Observability

Keeping Watch: Monitoring My Network with TELEGRAF

How I visualise network performance with Grafana, Prometheus and Telegraf so we can enjoy gaming marathons and 4K streaming sessions without a hitch!

Rudolf van Voorden

16 Apr 2024 • 5 min read

Network monitoring has become important for us, especially with all the remote work during lockdowns. My wife, who's a lecturer, basically lived on Teams for her lectures. So, we've had to make sure our network can handle our gaming marathons and 4K streaming sessions without a hitch.

What to monitor?
How does Telegraf work?
Telegraf setup
Grafana graph setup
- Panel query
Alerting
- Telegram Bot
Results
- Alternative to Grafana
  - Uptime Kuma

What to monitor?

Most of the pre-configured dashboards you find for Grafana are full of irrelevant graphs. My main focus is to monitor the network latency. Telegraf is my weapon of choice. It allows me to monitor latency, packet loss and DNS resolution.

How does Telegraf work?

Telegraf is a lightweight server agent designed for collecting, processing, and reporting metrics. Telegraf uses a plugin-based architecture, like Ping and DNS Query, where each data source or metric type is handled by a separate plugin. This allows Telegraf to be highly flexible and extensible, supporting a wide range of data sources and output formats.

The DNS plugin gathers dns query times in milliseconds - like Dig
The Ping plugin sends a ping message by executing the system ping command and reports the results.

Telegraf setup

To setup Telegraf it involves editing the Telegraf configuration file (usually located at /etc/telegraf/telegraf.conf) and adding configuration blocks for Input Ping and Input DNS Query. This is my config:

[[inputs.ping]]
  interval =  "15s"
  method = "native"
  timeout = 2.0
  count = 4
  urls = [
    "172.16.99.1", # pfSense Firewall
    "172.16.99.2", # Mikrotik CRS326
    "172.16.99.3", # OpenWrt-Livingroom
    "172.16.99.4", # OpenWrt-Utility
    "172.16.99.5", # OpenWrt-Bedroom
    "172.16.99.6", # OpenWrt-Attic
    "172.16.20.5", # Intel NUC Pro 
    "cloudflare.com", # DNS Query
    "google.com", # DNS Query
    "reddit.com", # DNS Query
    "twitter.com", # DNS Query
    "***.***.***.*" # WAN Gateway
  ]

[[inputs.dns_query]]
servers = ["172.16.20.1"]
domains = ["cloudflare.com", "google.com", "reddit.com", "twitter.com"]

[[outputs.prometheus_client]]
  listen = "172.23.0.2:9273"

Here is an example of the metrics generated by Telegraf Ping:

rudolf@gringotts:~$ curl 172.23.0.2:9273/metrics | 
grep 'url="cloudflare.com"' | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15813    0 15813    0     0  9956k      0 --:--:-- --:--:-- --:--:-- 15.0M
ping_average_response_ms{host="9b216f136c72",url="cloudflare.com"} 16.091742
ping_maximum_response_ms{host="9b216f136c72",url="cloudflare.com"} 17.548752
ping_minimum_response_ms{host="9b216f136c72",url="cloudflare.com"} 14.233287
ping_packets_received{host="9b216f136c72",url="cloudflare.com"} 4
ping_packets_transmitted{host="9b216f136c72",url="cloudflare.com"} 4
ping_percent_packet_loss{host="9b216f136c72",url="cloudflare.com"} 0
ping_result_code{host="9b216f136c72",url="cloudflare.com"} 0
ping_standard_deviation_ms{host="9b216f136c72",url="cloudflare.com"} 1.336313
ping_ttl{host="9b216f136c72",url="cloudflare.com"} 56

Resources for Telegraf Configuration options:

Grafana graph setup

The idea was to create a graph that is similar to Smokeping

Example:

Credit: https://oss.oetiker.ch/smokeping/index.en.html

Smokeping combines avg, max, min, now values in a single graph along with packet loss. It provides a large amount of information from a single target.

Panel query

I created a panel in Grafana with four queries:
ping_minimum_response_ms{url="google.com"}
ping_average_response_ms{url="google.com"}
ping_maximum_response_ms{url="google.com"}
ping_percent_packet_loss{url="google.com"}

To create the Ping graph with min, max, avg and packet loss you have to override the queries:

Override for Average - Filled below to Minimum

Override for Maximum - Filled below to Minimum and disabled line width

Override for Minimum - Disabled line width

Alerting

Grafana has the option to create alerts to help you identify issues almost immediately after they occur. I use the Grafana Alerting - Telegram integration to send Telegram notifications on my phone when alerts are firing.

Telegram Bot

My Telegram Bot that is configured to alert me when network devices experience high latency or packetloss. Alerts are defined with queries and alert conditions. For example if the average latency from a network device reports a higher value over a 5 minute interval it will trigger a alert.

Results

I use Telegraf to keep on eye on several devices and addresses: my pfSense firewall, my ISP's gateway, and both Google and Cloudflare IP addresses. This helps me identify packet drops and diagnose issues. Typically, I can check if the problem lies within my network or further along the route. When I see packet loss on the Google DNS but smooth sailing to my gateway, I know my connection's not the culprit.

My Telegram Bot alerts me with notifications on my phone wich is an added bonus, even if its complex to setup and configure. In a later article I will dive deeper into this topic.

Alternative to Grafana

If you simply want to monitor application uptime and dont want bother with Telegraf -> Prometheus -> Grafana you can use Uptime Kuma.

Uptime Kuma

Uptime Kuma has an added bonus that it supports notifications via Telegram, Discord, Gotify, Slack, Pushover, Email (SMTP), and 90+ notification services, click here for the full list.