Tame Prometheus Cardinality Overload

Prometheus cardinality explosion is when your metrics become so numerous and unique that they overwhelm the TSDB’s ability to store and query them efficiently.

Let’s see this in action. Imagine you’re collecting metrics from a fleet of Kubernetes pods, and you’re adding the pod_name and container_name as labels.

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_container_name]
        target_label: container

If you have 1000 pods, each with 5 containers, and each container is running multiple instances of the same application, you can quickly end up with hundreds of thousands or even millions of unique time series. Each unique combination of job, app, pod, and container creates a new time series.

The problem isn’t just the sheer number of series, but the uniqueness of the label combinations. Prometheus stores each unique series in its Time Series Database (TSDB). When cardinality gets too high, the TSDB struggles:

Memory Usage: The TSDB needs to keep an index of all series in memory. A massive index consumes vast amounts of RAM, leading to OOM kills or extreme slowness.
Write Performance: Every new unique series requires an update to the TSDB’s index, slowing down ingestion.
Query Performance: Queries need to scan and filter through this enormous index, making even simple queries take minutes or crash the Prometheus server.
Disk I/O: As the TSDB grows, disk operations for reads and writes become a bottleneck.

Detecting Cardinality Issues

The first step is to identify which metrics are causing the problem. Prometheus itself provides tools for this.

Prometheus UI Metrics:
- Navigate to your Prometheus UI, usually http://<prometheus-host>:9090/.
- Go to Status -> TSDB status.
- Look for the "Head Series" count. If this is in the millions and growing rapidly, you have a problem.
- Crucially, check the "Max Series per Label Name" and "Max Series per Label Value" warnings.
promtool for Cardinality Analysis:
- The promtool utility (included with Prometheus) is invaluable.
- To analyze the current active series and their label counts, run:
```
promtool tsdb analyze /path/to/prometheus/data
```
- This will output a breakdown of series counts per label name and per label value. Look for labels with an extremely high number of unique values.
Querying Prometheus for High-Cardinality Metrics:
- You can write PromQL queries to identify metrics with many unique label combinations. This is often done by counting distinct label values.
- To find metrics with a high number of unique pod labels:
```
count by (job) (count by (pod)({__name__=~".+"}))
```
  This query counts how many unique pod labels exist for each job. A very large number here indicates a problem.
- To find the top 10 metrics by series count:
```
topk(10, count by (__name__)({__name__=~".+"}))
```
  This shows you which metric names are generating the most series.

Preventing and Mitigating Cardinality Explosions

The core principle is to reduce the number of unique label combinations generated.

Relabeling __meta Labels:

Problem: Automatically scraping Kubernetes metadata like pod_name, container_name, namespace, node_name, etc., directly into your metrics as labels creates high cardinality if you have many pods, containers, or nodes.
Diagnosis: Use promtool tsdb analyze or the UI metrics to see if pod, container, namespace, or node are among the top labels by value count.

Fix: Filter out or replace these dynamic labels during the scrape using relabel_configs.

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Keep only specific labels, or drop dynamic ones
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      # Drop pod and container names if they cause high cardinality
      - source_labels: [__meta_kubernetes_pod_name]
        action: drop
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: drop
      # Or, if you need some identifier, map it to a static label if possible
      # - source_labels: [__meta_kubernetes_pod_name]
      #   regex: 'my-specific-app-(.*)'
      #   target_label: instance_id
      #   action: replace

Why it works: By dropping or transforming dynamic labels that change frequently with every new pod or container, you reduce the number of unique label sets Prometheus needs to track.

Avoiding Dynamic Labels in Application Metrics:
- Problem: Applications often expose metrics with labels that are dynamic or have high cardinality, like user IDs, request IDs, or specific resource identifiers.
- Diagnosis: Identify metrics with high label value counts using promtool or PromQL queries. For example, if request_id is a label on many metrics, it’s a prime suspect.
- Fix: Modify your application’s metrics exposition to either:
  - Remove the high-cardinality label entirely.
  - Replace it with a more static or aggregated identifier.
  - Use a "summary" metric (which has lower cardinality than "histograms" with many buckets) if appropriate, but be mindful of their own performance implications.
```
// Example of problematic metric in application code
httpRequestsTotal.WithLabelValues("GET", "/users/" + userID, "200").Inc() // userID can be high cardinality

// Better approach: aggregate or remove
// Option 1: Aggregate by user type or role if possible
httpRequestsTotal.WithLabelValues("GET", "/users/:id", "200").Inc()
// Option 2: Remove the dynamic part if not essential for aggregation
httpRequestsTotal.WithLabelValues("GET", "/users", "200").Inc()
```
- Why it works: Removing or aggregating highly granular, dynamic label values directly at the source prevents them from being sent to Prometheus in the first place, drastically reducing the number of unique time series.
Using labeldrop and labelkeep in Scrape Configs:
- Problem: Even if you don’t want to drop entire __meta labels, you might want to remove specific dynamic labels that are attached to your metrics after the scrape.
- Diagnosis: After relabeling, if you still see high cardinality from certain labels that were not __meta labels but were dynamically generated or attached.
- Fix: Use labeldrop or labelkeep in your scrape_configs.
```
scrape_configs:
  - job_name: 'my-app-metrics'
    static_configs:
      - targets: ['app-service:8080']
    # Drop labels that are not useful and add cardinality
    metric_relabel_configs:
      - source_labels: [request_id]
        action: drop
      - source_labels: [user_session_id]
        action: drop
    # Or, keep only a specific set of labels
    # metric_relabel_configs:
    #   - action: labelkeep
    #     regex: (job|instance|app|version)
```
- Why it works: These metric_relabel_configs operate on the metrics after they’ve been scraped but before they are ingested into the TSDB. They allow you to prune unwanted labels, thereby reducing cardinality.
Aggregating Metrics at the Exporter Level:
- Problem: Some exporters (e.g., node_exporter) expose a vast number of metrics with very detailed labels by default.
- Diagnosis: Use promtool to see if metrics from specific exporters (like node_exporter) have an overwhelming number of series due to their detailed label sets.
- Fix: Configure the exporter to disable or limit the metrics that generate high cardinality. For node_exporter, you can use the --collector.<name> flags to disable collectors, or --no-collector.<name> to disable specific sub-collectors. For example, to disable the textfile collector which can generate many unique metrics from files:
```
./node_exporter --no-collector.textfile
```
  Or, if you are using the textfile collector but want to limit its scope:
```
# In /etc/node_exporter/textfile-collector/
# Create files like /etc/node_exporter/textfile-collector/my_metric.prom
# Content: my_custom_metric{label="value"} 123
```
  This is less about dropping labels and more about controlling the source of metrics.
- Why it works: By choosing which metrics collectors are active on an exporter, or by carefully crafting the metrics exposition from custom collectors, you can prevent high-cardinality metrics from ever being generated and sent to Prometheus.
Leveraging Service Discovery for Static Labels:
- Problem: You might need to identify which application instance or host a metric came from, but using dynamic labels like pod_name or container_name is too high cardinality.
- Diagnosis: You’re seeing high cardinality from labels that you feel should be static identifiers.
- Fix: Use service discovery (like Kubernetes SD or Consul SD) to inject more stable identifiers. You can then use relabel_configs to map these stable identifiers to your metric labels. For example, mapping a Kubernetes Deployment name to a metric label.
```
scrape_configs:
  - job_name: 'kubernetes-deployments'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_deployment_name]
        target_label: deployment
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      # Drop the highly dynamic pod name if it's not needed
      - source_labels: [__meta_kubernetes_pod_name]
        action: drop
```
- Why it works: By using more stable identifiers provided by service discovery, like deployment names or service names, you get meaningful labels without the extreme uniqueness that pod or container names provide.
Consider Remote Write and Downsampling:
- Problem: You have legitimate use cases for high-cardinality data (e.g., debugging a specific request), but you can’t afford to keep it all long-term.
- Diagnosis: You’ve tried all other methods, and you still have a significant number of high-cardinality metrics that are essential for short-term analysis but not for long-term trending.
- Fix: Configure Prometheus to use remote_write to send data to a long-term storage solution (like Thanos, Cortex, VictoriaMetrics, or Mimir) that supports downsampling. Configure this remote storage to aggregate or downsample metrics after a certain period (e.g., keep 1-minute resolution for a day, then 5-minute resolution for a month, then hourly for a year).
```
remote_write:
  - url: "http://your-long-term-storage:9201/api/v1/push"
    # Optional: filter metrics to send to remote write if needed
    # write_relabel_configs:
    #   - source_labels: [__name__]
    #     regex: "metric_i_want_to_keep_long_term"
    #     action: keep
```
- Why it works: Prometheus itself remains lean by only storing recent, high-resolution data. The remote storage handles the long-term, potentially lower-resolution, aggregated data, solving both storage and query performance issues for historical data.

The next error you’ll likely encounter after fixing a cardinality explosion is a performance degradation in query execution time for metrics that were previously unaffected, as the remaining index still needs to be scanned.