PromQL queries can feel like a black box, but understanding their execution reveals a surprisingly simple bottleneck: how much data the query actually needs to scan.

Let’s see this in action with a common scenario: finding the average CPU usage across all nodes in a cluster over the last hour.

avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))

This query looks innocent enough. It uses rate to calculate the per-second increase of the node_cpu_seconds_total counter, specifically for the idle CPU mode, and then averages it by instance.

But what if your cluster has thousands of nodes, each with multiple CPU cores? That’s millions of time series. Prometheus has to:

  1. Scan all node_cpu_seconds_total metrics: It looks for every single time series matching node_cpu_seconds_total{mode="idle"}.
  2. Filter by time: For each matching time series, it needs to retrieve data points within the last minute for the rate function.
  3. Calculate the rate: It performs the division (value[t] - value[t-1m]) / 1m for each series.
  4. Aggregate: Finally, it groups these rates by instance and calculates the average.

This is where performance tanks. The sheer volume of raw data being processed is the killer.

To speed this up, we need to reduce the amount of data Prometheus has to touch. The most effective way is to limit the number of series Prometheus needs to evaluate before aggregation.

Common Causes and Fixes

1. Overly Broad Metric Selectors (The "Everything" Problem)

  • Diagnosis: Your query is selecting far more time series than you actually need. This often happens with missing or generic label filters.
  • Example Problem: rate(node_cpu_seconds_total[1m]) without any label filters.
  • Fix: Add specific label filters to narrow down the target metrics.
    • Command/Check: Examine your query’s metric selector. Does it have explicit label constraints like job="my-app", namespace="production", or instance=~"web-server-.*"?
    • Example Fix: If you only care about CPU usage for web servers in the production environment, change rate(node_cpu_seconds_total{mode="idle"}[1m]) to rate(node_cpu_seconds_total{mode="idle", job="webserver", namespace="production"}[1m]).
    • Why it Works: This dramatically reduces the number of time series Prometheus needs to read from storage and process for the rate function.

2. Large Aggregations Without Pre-aggregation

  • Diagnosis: You’re aggregating over a massive number of series, and the aggregation itself is the bottleneck, not the data retrieval.
  • Example Problem: sum(rate(node_cpu_seconds_total{mode="idle"}[5m])) across thousands of nodes.
  • Fix: Leverage recording rules to pre-aggregate data at a higher level.
    • Command/Check: Look for sum(), avg(), count() applied to metrics that naturally have many instances (e.g., node_cpu_seconds_total, container_cpu_usage_seconds_total).
    • Example Fix: Create a recording rule:
      groups:
      - name: node_cpu_aggregation
        rules:
        - record: node:cpu:idle:rate:5m
          expr: |
            avg by (instance) (
              rate(node_cpu_seconds_total{mode="idle"}[5m])
            )
      
      Then, query sum(node:cpu:idle:rate:5m) instead.
    • Why it Works: The recording rule calculates the average per instance at ingestion time and stores it as a new metric (node:cpu:idle:rate:5m). Your query then only needs to sum this pre-aggregated metric, which has far fewer series.

3. Excessive recording vs. aggregation Metrics

  • Diagnosis: You’re using recording metrics (<metric_name>_total) in aggregations where a pre-aggregated or rate-based metric would be more appropriate.
  • Example Problem: avg(node_cpu_seconds_total{mode="idle"}) without a rate or delta function. This implies you’re averaging raw counter values, which is rarely useful and extremely inefficient.
  • Fix: Always use functions like rate(), irate(), delta(), or increase() on counters when calculating per-unit-of-time values. Use aggregation functions (sum, avg) on results of these functions or on gauges.
    • Command/Check: Does your query use aggregation functions directly on counter metrics without a time-based function?
    • Example Fix: Change avg(node_cpu_seconds_total{mode="idle"}) to avg(rate(node_cpu_seconds_total{mode="idle"}[5m])).
    • Why it Works: Counters only increase. Averaging raw counter values is meaningless. rate() calculates the change over time, giving you a meaningful value (e.g., CPU usage per second) that can then be aggregated.

4. Unnecessary group by Clauses

  • Diagnosis: You’re grouping by labels that don’t actually reduce the number of series, or you’re grouping by too many labels.
  • Example Problem: avg by (instance, cpu, mode) (rate(node_cpu_seconds_total[1m])) when you only need the average per instance.
  • Fix: Review your by and without clauses. Only group by the labels essential for your desired output.
    • Command/Check: Does your by clause include labels that are already unique to each series you’re interested in, or labels you don’t need in the final output?
    • Example Fix: If you want the average idle CPU rate per instance, avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) is correct. If you accidentally wrote avg by (instance, mode), you’d get separate averages for each mode per instance, which might be okay, but if mode is always "idle" due to your selector, it’s redundant.
    • Why it Works: Unnecessary grouping can cause Prometheus to materialize intermediate results for combinations that are already unique, increasing memory pressure and computation.

5. Inefficient offset Usage

  • Diagnosis: Using offset without careful consideration can force Prometheus to fetch data from two distinct time ranges, potentially doubling the I/O for that part of the query.
  • Example Problem: (metric_a - metric_b offset 5m) where metric_a and metric_b are already complex aggregations.
  • Fix: Whenever possible, combine data points before applying an offset, or use vector matching with on() or ignoring() clauses.
    • Command/Check: Are you using offset on complex expressions or on metrics that have many series?
    • Example Fix: Instead of (sum(rate(a[5m])) - sum(rate(b[5m])) offset 1h), consider if you can achieve the same by calculating sum(rate(a[5m]) - rate(b[5m])) and then applying the offset if absolutely necessary, or better yet, restructuring the query to avoid the offset if the logic allows. Often, vector matching is a more performant alternative for comparing series at the same point in time.
    • Why it Works: offset forces Prometheus to look at two separate points in time. If the underlying metrics are already being processed, this can lead to redundant data fetching and processing. Vector matching allows Prometheus to align series based on labels at a single point in time.

6. Unoptimized unless or or Operations

  • Diagnosis: Using unless or or on large sets of series can be computationally expensive as Prometheus needs to evaluate both sides and perform set operations.
  • Example Problem: metric_a unless on(instance) metric_b where metric_a and metric_b both have thousands of series.
  • Fix: Try to filter down the series on one side of the operation as much as possible before the unless or or.
    • Command/Check: Are you using unless or or with broad selectors on either side?
    • Example Fix: If you want series from metric_a that are not present in metric_b where metric_b is filtered by job="critical", write metric_a unless on(instance) metric_b{job="critical"}. If metric_a itself can be filtered further, do that first: metric_a{job="app"} unless on(instance) metric_b{job="critical"}.
    • Why it Works: Set operations involve comparing elements. Reducing the size of either set significantly reduces the comparison work Prometheus needs to do.

The next hurdle you’ll likely encounter is understanding how Prometheus’s internal storage (TSDB) impacts query performance, particularly with high cardinality.

Want structured learning?

Take the full Prometheus course →