Prometheus metrics for Prometheus itself are not just for dashboards; they reveal the internal health and performance of your monitoring system, often highlighting issues before they impact your collected data.
Let’s see Prometheus scraping itself.
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
This configuration tells Prometheus to scrape the /metrics endpoint of its own HTTP server running on localhost:9090. The metrics generated are invaluable for understanding Prometheus’s operational status.
The primary reason Prometheus metrics for Prometheus are crucial is that they provide direct insight into the health of the scraping process, rule evaluation, and storage. If Prometheus can’t scrape itself, it’s a strong indicator that it might not be scraping anything else effectively. This self-awareness is key to maintaining a robust monitoring infrastructure.
Let’s break down what to look for and how to fix common issues.
Scrape Failures
The most common problem is Prometheus failing to scrape its own metrics.
Diagnosis:
Check the up metric for the prometheus job.
curl -s 'http://localhost:9090/api/v1/query?query=up{job="prometheus"}' | jq
If the value is 0, Prometheus is not successfully scraping itself.
Common Causes & Fixes:
-
Prometheus Not Running:
- Diagnosis: Check the systemd status:
sudo systemctl status prometheus. - Fix: If it’s inactive, start it:
sudo systemctl start prometheus. This works because theprometheusservice is the actual process responsible for running the Prometheus server. - Why it works: The
upmetric indicates a successful scrape. If the server isn’t running, the scrape target is unreachable, resulting in anupvalue of0.
- Diagnosis: Check the systemd status:
-
Incorrect
targetsinprometheus.yml:- Diagnosis: Examine your
prometheus.ymlconfiguration for theprometheusscrape job. Ensuretargetsis set to['localhost:9090']or the correct IP/hostname and port Prometheus is listening on. - Fix: Correct the
targetsentry. For example, if Prometheus is running on a different IP192.168.1.100, change it to['192.168.1.100:9090']. This works because Prometheus uses this configuration to know where to send its scrape requests; an incorrect address means it’s trying to connect to the wrong place. - Why it works: Prometheus sends HTTP requests to the specified targets. If the address is wrong, the requests will fail to reach the Prometheus server.
- Diagnosis: Examine your
-
Firewall Blocking Port 9090:
- Diagnosis: If Prometheus is running but
upis0, check firewall rules. On systems usingufw:sudo ufw status. - Fix: Allow traffic on port 9090:
sudo ufw allow 9090/tcp. This works because firewalls act as gatekeepers, blocking network traffic by default for many ports; explicitly allowing port 9090 lets the scrape requests reach the Prometheus process. - Why it works: Network traffic, including scrape requests, is subject to firewall rules. If port 9090 is blocked, Prometheus cannot receive the incoming HTTP requests from itself.
- Diagnosis: If Prometheus is running but
-
Prometheus Binding to a Different IP:
- Diagnosis: Check Prometheus logs for messages indicating which IP it’s listening on. Often, Prometheus might bind to
127.0.0.1(localhost) or a specific internal IP. - Fix: If Prometheus is bound to
0.0.0.0(all interfaces) or a specific IP other thanlocalhost, update thetargetsinprometheus.ymlto match that IP. For example, if Prometheus is listening on10.0.0.5, settargets: ['10.0.0.5:9090']. This works becauselocalhosttypically resolves to127.0.0.1. If Prometheus is listening on a different IP address,localhostwon’t reach it. - Why it works: The
targetsconfiguration must accurately reflect the network interface and port Prometheus is actively listening on.
- Diagnosis: Check Prometheus logs for messages indicating which IP it’s listening on. Often, Prometheus might bind to
-
Incorrect
external_urlConfiguration:- Diagnosis: If you’re using
external_urlin yourprometheus.ymland it’s misconfigured, it can sometimes interfere with internal service discovery or routing. Check theexternal_urlsetting. - Fix: Ensure
external_urlis correctly set to the URL through which Prometheus is accessible externally, e.g.,external_url: 'http://your-prometheus-domain.com:9090'. If you don’t need an external URL, remove or comment it out. This works becauseexternal_urlinfluences how Prometheus generates URLs for its own services, and an incorrect value can lead to internal confusion. - Why it works: This setting is primarily for when Prometheus is behind a reverse proxy. If it’s set incorrectly, Prometheus might generate internal links that are not resolvable by its own scraping mechanism.
- Diagnosis: If you’re using
High Resource Usage
If Prometheus is scraping itself successfully but consuming excessive CPU or memory, it indicates internal performance bottlenecks.
Diagnosis:
Use promtool check metrics to analyze metric cardinality and volume. Monitor Prometheus’s own process_resident_memory_bytes and process_cpu_seconds_total.
Common Causes & Fixes:
-
Excessive Label Cardinality:
- Diagnosis: Look for metrics with a very high number of unique label combinations. Query
http://localhost:9090/api/v1/query?query=sum(label_values{job="prometheus"}) by (metric)and look for metrics with abnormally largelabel_valuescounts. - Fix: Reduce cardinality by dropping or relabeling metrics that generate too many unique labels. In
prometheus.yml, usemetric_relabel_configsto drop problematic labels or metrics. For example:
This works because each unique label combination for a metric consumes memory and CPU for storage and processing. Reducing cardinality directly lessens this load.metric_relabel_configs: - source_labels: [__name__] regex: 'promhttp_metric_handler_requests_total' action: drop - Why it works: High cardinality means a vast number of distinct time series. Prometheus has to store, index, and query each one, leading to memory exhaustion and slow performance.
- Diagnosis: Look for metrics with a very high number of unique label combinations. Query
-
Too Many Scrape Targets / Frequent Scrapes:
- Diagnosis: Check
prometheus_tsdb_head_seriesto see the number of active series. Monitorprometheus_rule_evaluation_duration_secondsandprometheus_scrape_duration_seconds. If these are consistently high, the scrape interval might be too short for the number of targets. - Fix: Increase the scrape interval in
prometheus.ymlfor relevant jobs, e.g., changescrape_interval: 15stoscrape_interval: 30s. This gives Prometheus more time to process existing data before collecting new data. - Why it works: A shorter scrape interval means Prometheus needs to complete its scrape and rule evaluation cycles more frequently. If the workload is too large, these cycles start overlapping, leading to resource contention.
- Diagnosis: Check
-
Inefficient Recording Rules:
- Diagnosis: Analyze the
prometheus_rule_evaluation_duration_secondsmetric. Long-running rules indicate performance issues. - Fix: Optimize recording rules. Rewrite complex queries to be more efficient, or reduce their frequency of evaluation if possible. For example, avoid
sum()over high-cardinality metrics if a more targeted query can achieve the same result. - Why it works: Recording rules are evaluated periodically. If a rule query is computationally expensive, it will consume significant CPU and I/O resources during each evaluation cycle, impacting overall Prometheus performance.
- Diagnosis: Analyze the
The next error you’ll likely encounter after fixing these issues is target_unreachable for other services if Prometheus was too unhealthy to scrape them, or perhaps too many open files if the underlying OS limits were hit due to excessive series.