Prometheus is failing to scrape targets because it’s running out of memory due to excessively high cardinality labels on metrics.

Here’s what’s actually broken at the system level: The Prometheus server’s head block, which holds recent time-series data, has grown too large to fit into RAM. This is because a massive number of unique label combinations are being generated for a single metric name, overwhelming the server’s ability to manage its in-memory index.

Common Causes and Fixes:

  1. Application-Generated Labels with High Cardinality:

    • Diagnosis: Use promtool tsget to inspect metric data from a problematic target. Look for metrics with a high number of unique label values. For example, if you see http_requests_total{path="/user/12345", method="GET", user_id="abcde12345", request_id="fghij67890", ...} where user_id and request_id are unique for every request, that’s a problem.
    • Fix: Modify the application emitting the metric to reduce the cardinality. Remove or aggregate high-cardinality labels. For instance, instead of user_id, use a role or group if applicable, or implement sampling. If request_id is truly needed, consider if it belongs as a metric label or if it should be logged separately. The goal is to reduce the number of distinct series.
    • Why it works: Fewer unique series means a smaller in-memory index and less memory pressure.
  2. Instance/Pod Labels from Orchestrators:

    • Diagnosis: Examine the labels on metrics scraped from Kubernetes or other orchestrators. Labels like __meta_kubernetes_pod_name, __meta_kubernetes_namespace, __meta_kubernetes_node_name, or specific pod annotations can contribute if they are highly dynamic or unique per instance.
    • Fix: Use relabel_configs in your Prometheus scrape configuration to drop or replace these labels. For example, to drop __meta_kubernetes_pod_name:
      scrape_configs:
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_name]
          action: drop
      
      Or to replace a dynamic label with a static one:
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace]
        regex: '(.*)'
        target_label: namespace
        action: replace
      - source_labels: [__meta_kubernetes_pod_name]
        regex: 'my-app-(.+)-([a-f0-9]+)'
        target_label: instance_short_id
        replacement: '$1.$2' # Example: my-app-worker-123abc -> worker.123abc
      
    • Why it works: Relabeling rules modify or remove labels before they are stored, preventing the generation of high-cardinality series.
  3. Incorrect Service Discovery (SD) Configuration:

    • Diagnosis: If using service discovery (like kubernetes_sd_configs, consul_sd_configs, etc.), a misconfiguration can lead to a vast number of targets being discovered, each with unique labels. Check your Prometheus UI’s "Targets" page to see the sheer number of endpoints being scraped.
    • Fix: Refine your service discovery configuration to only discover relevant services. Use relabel_configs to filter targets based on labels or metadata. For Kubernetes, this might involve restricting discovery to specific namespaces or pods with certain labels:
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - my-application-namespace
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: my-specific-app
        action: keep
      
    • Why it works: Limiting the number of discovered targets directly reduces the number of series Prometheus needs to manage.
  4. Client Library Misuse:

    • Diagnosis: If you’re instrumenting your own applications with Prometheus client libraries, review the code. Common mistakes include creating labels dynamically based on user IDs, request IDs, or other highly variable data points without aggregation.
    • Fix: Audit your application’s metric instrumentation. Aggregate or remove labels that are too granular. For example, if you’re tracking request latency per user, consider aggregating by user group or role instead of individual user IDs.
    • Why it works: Client-side aggregation or removal of high-cardinality labels prevents them from ever reaching the Prometheus server.
  5. __name__ Label Exploitation (Less Common but Possible):

    • Diagnosis: While Prometheus doesn’t allow duplicate metric names with different labels, a malicious or buggy exporter could potentially try to expose metrics with similar names but distinct label sets, or a single metric name could be used for many different purposes. This is harder to diagnose directly from Prometheus logs but would manifest as an explosion of series.
    • Fix: Review your exporters. Use promtool check on exporter output to identify unusual metric naming conventions or label usage. Ensure exporters are configured to expose only necessary metrics.
    • Why it works: Correctly defined metrics and their labels are crucial for efficient storage and querying.
  6. Exporters Generating Too Many Metrics:

    • Diagnosis: Some exporters (e.g., node_exporter with certain collectors enabled, or complex application-specific exporters) can generate a very large number of metrics by default.
    • Fix: Disable unnecessary collectors in exporters. For node_exporter, you can specify which collectors to enable/disable via command-line flags (e.g., --no-collector.textfile). For application exporters, check their configuration options for reducing metric verbosity.
    • Why it works: Reducing the total number of distinct metrics scraped, even if their cardinality isn’t extremely high, can also contribute to memory savings.

After addressing these, the next error you’re likely to encounter is context deadline exceeded on scrapes, as the server will still be struggling to respond to scrape requests while recovering.

Want structured learning?

Take the full Prometheus course →