The surprising truth about collecting host metrics with OpenTelemetry is that you’re often just scratching the surface of what your OS is telling you.
Let’s see it in action. Imagine you’re running a Node.js application and want to monitor its CPU and memory usage. You’ve got OpenTelemetry set up, and you’re piping metrics to a backend like Prometheus.
Here’s a snippet of how you might configure the host metrics receiver in your OpenTelemetry Collector:
receivers:
hostmetrics:
collection_interval: 10s
process_filters:
- include:
exe: "node"
scrapers:
cpu:
metrics:
- cpu.cpu_time
- cpu.cpu_usage
memory:
metrics:
- memory.usage
disk:
metrics:
- disk.io
- disk.operations
filesystem:
metrics:
- filesystem.usage
- filesystem.inodes
network:
metrics:
- network.connections
- network.traffic
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
metrics:
receivers: [hostmetrics]
exporters: [prometheus]
When this collector runs, it starts scraping metrics from the host. For the cpu.cpu_usage metric, it’s not just a single number. It’s broken down by CPU core, showing the percentage of time spent in user mode, system mode, and idle. For memory, you’ll see total, free, buffered, and cached memory. Disk I/O will show read/write bytes and operations per second.
The hostmetrics receiver is a powerful swiss army knife, but its real strength lies in its granular data. It leverages OS-specific interfaces to pull a wealth of information. For Linux, it often taps into /proc filesystem entries like /proc/stat for CPU, /proc/meminfo for memory, and /proc/diskstats for disk I/O. On Windows, it uses Performance Counters.
The configuration above is just a starting point. The process_filters section is crucial for narrowing down what you’re looking at. If you only care about metrics for your specific Node.js processes, you’d use something like exe: "node". You can also filter by command-line arguments or even PIDs. The collection_interval dictates how frequently these metrics are polled, with 10 seconds being a common default.
Beyond the basic CPU, memory, disk, and network stats, the hostmetrics receiver can also expose process-specific metrics if you configure it correctly. This means you can see the CPU and memory consumed by individual processes, not just the aggregate system usage. This is invaluable for pinpointing resource hogs.
The cpu.cpu_time metric, for instance, provides a breakdown of CPU time spent in different states (user, system, idle, iowait, etc.) per CPU core. This allows you to differentiate between application-level CPU usage, kernel-level overhead, and time spent waiting for I/O. Similarly, memory.usage gives you total, available, used, free, cached, and buffered memory, offering a nuanced view of memory pressure.
When you configure the hostmetrics receiver, you’re essentially telling OpenTelemetry which "scrapers" to enable. Each scraper targets a specific subsystem of the operating system. The cpu scraper collects CPU-related metrics, memory for memory statistics, disk for block device I/O, filesystem for mounted filesystem usage and inode counts, and network for network interface traffic and connection counts. You can enable or disable individual metrics within each scraper.
What many people miss is how the filesystem scraper’s inodes.usage metric can be a silent killer. It’s easy to monitor disk space filling up, but running out of inodes (the data structures that store information about files and directories) on a filesystem can prevent new files from being created, even if there’s plenty of disk space. This can manifest as seemingly random application errors related to file operations.
The next logical step after getting a handle on host metrics is to correlate them with application-specific metrics.