Prometheus metrics for Prometheus itself are not just for dashboards; they reveal the internal health and performance of your monitoring system, often highlighting issues before they impact your collected data.

Let’s see Prometheus scraping itself.

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

This configuration tells Prometheus to scrape the /metrics endpoint of its own HTTP server running on localhost:9090. The metrics generated are invaluable for understanding Prometheus’s operational status.

The primary reason Prometheus metrics for Prometheus are crucial is that they provide direct insight into the health of the scraping process, rule evaluation, and storage. If Prometheus can’t scrape itself, it’s a strong indicator that it might not be scraping anything else effectively. This self-awareness is key to maintaining a robust monitoring infrastructure.

Let’s break down what to look for and how to fix common issues.

Scrape Failures

The most common problem is Prometheus failing to scrape its own metrics.

Diagnosis: Check the up metric for the prometheus job.

curl -s 'http://localhost:9090/api/v1/query?query=up{job="prometheus"}' | jq

If the value is 0, Prometheus is not successfully scraping itself.

Common Causes & Fixes:

  1. Prometheus Not Running:

    • Diagnosis: Check the systemd status: sudo systemctl status prometheus.
    • Fix: If it’s inactive, start it: sudo systemctl start prometheus. This works because the prometheus service is the actual process responsible for running the Prometheus server.
    • Why it works: The up metric indicates a successful scrape. If the server isn’t running, the scrape target is unreachable, resulting in an up value of 0.
  2. Incorrect targets in prometheus.yml:

    • Diagnosis: Examine your prometheus.yml configuration for the prometheus scrape job. Ensure targets is set to ['localhost:9090'] or the correct IP/hostname and port Prometheus is listening on.
    • Fix: Correct the targets entry. For example, if Prometheus is running on a different IP 192.168.1.100, change it to ['192.168.1.100:9090']. This works because Prometheus uses this configuration to know where to send its scrape requests; an incorrect address means it’s trying to connect to the wrong place.
    • Why it works: Prometheus sends HTTP requests to the specified targets. If the address is wrong, the requests will fail to reach the Prometheus server.
  3. Firewall Blocking Port 9090:

    • Diagnosis: If Prometheus is running but up is 0, check firewall rules. On systems using ufw: sudo ufw status.
    • Fix: Allow traffic on port 9090: sudo ufw allow 9090/tcp. This works because firewalls act as gatekeepers, blocking network traffic by default for many ports; explicitly allowing port 9090 lets the scrape requests reach the Prometheus process.
    • Why it works: Network traffic, including scrape requests, is subject to firewall rules. If port 9090 is blocked, Prometheus cannot receive the incoming HTTP requests from itself.
  4. Prometheus Binding to a Different IP:

    • Diagnosis: Check Prometheus logs for messages indicating which IP it’s listening on. Often, Prometheus might bind to 127.0.0.1 (localhost) or a specific internal IP.
    • Fix: If Prometheus is bound to 0.0.0.0 (all interfaces) or a specific IP other than localhost, update the targets in prometheus.yml to match that IP. For example, if Prometheus is listening on 10.0.0.5, set targets: ['10.0.0.5:9090']. This works because localhost typically resolves to 127.0.0.1. If Prometheus is listening on a different IP address, localhost won’t reach it.
    • Why it works: The targets configuration must accurately reflect the network interface and port Prometheus is actively listening on.
  5. Incorrect external_url Configuration:

    • Diagnosis: If you’re using external_url in your prometheus.yml and it’s misconfigured, it can sometimes interfere with internal service discovery or routing. Check the external_url setting.
    • Fix: Ensure external_url is correctly set to the URL through which Prometheus is accessible externally, e.g., external_url: 'http://your-prometheus-domain.com:9090'. If you don’t need an external URL, remove or comment it out. This works because external_url influences how Prometheus generates URLs for its own services, and an incorrect value can lead to internal confusion.
    • Why it works: This setting is primarily for when Prometheus is behind a reverse proxy. If it’s set incorrectly, Prometheus might generate internal links that are not resolvable by its own scraping mechanism.

High Resource Usage

If Prometheus is scraping itself successfully but consuming excessive CPU or memory, it indicates internal performance bottlenecks.

Diagnosis: Use promtool check metrics to analyze metric cardinality and volume. Monitor Prometheus’s own process_resident_memory_bytes and process_cpu_seconds_total.

Common Causes & Fixes:

  1. Excessive Label Cardinality:

    • Diagnosis: Look for metrics with a very high number of unique label combinations. Query http://localhost:9090/api/v1/query?query=sum(label_values{job="prometheus"}) by (metric) and look for metrics with abnormally large label_values counts.
    • Fix: Reduce cardinality by dropping or relabeling metrics that generate too many unique labels. In prometheus.yml, use metric_relabel_configs to drop problematic labels or metrics. For example:
      metric_relabel_configs:
        - source_labels: [__name__]
          regex: 'promhttp_metric_handler_requests_total'
          action: drop
      
      This works because each unique label combination for a metric consumes memory and CPU for storage and processing. Reducing cardinality directly lessens this load.
    • Why it works: High cardinality means a vast number of distinct time series. Prometheus has to store, index, and query each one, leading to memory exhaustion and slow performance.
  2. Too Many Scrape Targets / Frequent Scrapes:

    • Diagnosis: Check prometheus_tsdb_head_series to see the number of active series. Monitor prometheus_rule_evaluation_duration_seconds and prometheus_scrape_duration_seconds. If these are consistently high, the scrape interval might be too short for the number of targets.
    • Fix: Increase the scrape interval in prometheus.yml for relevant jobs, e.g., change scrape_interval: 15s to scrape_interval: 30s. This gives Prometheus more time to process existing data before collecting new data.
    • Why it works: A shorter scrape interval means Prometheus needs to complete its scrape and rule evaluation cycles more frequently. If the workload is too large, these cycles start overlapping, leading to resource contention.
  3. Inefficient Recording Rules:

    • Diagnosis: Analyze the prometheus_rule_evaluation_duration_seconds metric. Long-running rules indicate performance issues.
    • Fix: Optimize recording rules. Rewrite complex queries to be more efficient, or reduce their frequency of evaluation if possible. For example, avoid sum() over high-cardinality metrics if a more targeted query can achieve the same result.
    • Why it works: Recording rules are evaluated periodically. If a rule query is computationally expensive, it will consume significant CPU and I/O resources during each evaluation cycle, impacting overall Prometheus performance.

The next error you’ll likely encounter after fixing these issues is target_unreachable for other services if Prometheus was too unhealthy to scrape them, or perhaps too many open files if the underlying OS limits were hit due to excessive series.

Want structured learning?

Take the full Observability & Monitoring course →