PromQL queries can feel like a black box, but understanding their execution reveals a surprisingly simple bottleneck: how much data the query actually needs to scan.
Let’s see this in action with a common scenario: finding the average CPU usage across all nodes in a cluster over the last hour.
avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))
This query looks innocent enough. It uses rate to calculate the per-second increase of the node_cpu_seconds_total counter, specifically for the idle CPU mode, and then averages it by instance.
But what if your cluster has thousands of nodes, each with multiple CPU cores? That’s millions of time series. Prometheus has to:
- Scan all
node_cpu_seconds_totalmetrics: It looks for every single time series matchingnode_cpu_seconds_total{mode="idle"}. - Filter by time: For each matching time series, it needs to retrieve data points within the last minute for the
ratefunction. - Calculate the rate: It performs the division
(value[t] - value[t-1m]) / 1mfor each series. - Aggregate: Finally, it groups these rates by
instanceand calculates the average.
This is where performance tanks. The sheer volume of raw data being processed is the killer.
To speed this up, we need to reduce the amount of data Prometheus has to touch. The most effective way is to limit the number of series Prometheus needs to evaluate before aggregation.
Common Causes and Fixes
1. Overly Broad Metric Selectors (The "Everything" Problem)
- Diagnosis: Your query is selecting far more time series than you actually need. This often happens with missing or generic label filters.
- Example Problem:
rate(node_cpu_seconds_total[1m])without any label filters. - Fix: Add specific label filters to narrow down the target metrics.
- Command/Check: Examine your query’s metric selector. Does it have explicit label constraints like
job="my-app",namespace="production", orinstance=~"web-server-.*"? - Example Fix: If you only care about CPU usage for web servers in the production environment, change
rate(node_cpu_seconds_total{mode="idle"}[1m])torate(node_cpu_seconds_total{mode="idle", job="webserver", namespace="production"}[1m]). - Why it Works: This dramatically reduces the number of time series Prometheus needs to read from storage and process for the
ratefunction.
- Command/Check: Examine your query’s metric selector. Does it have explicit label constraints like
2. Large Aggregations Without Pre-aggregation
- Diagnosis: You’re aggregating over a massive number of series, and the aggregation itself is the bottleneck, not the data retrieval.
- Example Problem:
sum(rate(node_cpu_seconds_total{mode="idle"}[5m]))across thousands of nodes. - Fix: Leverage recording rules to pre-aggregate data at a higher level.
- Command/Check: Look for
sum(),avg(),count()applied to metrics that naturally have many instances (e.g.,node_cpu_seconds_total,container_cpu_usage_seconds_total). - Example Fix: Create a recording rule:
Then, querygroups: - name: node_cpu_aggregation rules: - record: node:cpu:idle:rate:5m expr: | avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) )sum(node:cpu:idle:rate:5m)instead. - Why it Works: The recording rule calculates the average per instance at ingestion time and stores it as a new metric (
node:cpu:idle:rate:5m). Your query then only needs to sum this pre-aggregated metric, which has far fewer series.
- Command/Check: Look for
3. Excessive recording vs. aggregation Metrics
- Diagnosis: You’re using
recordingmetrics (<metric_name>_total) in aggregations where a pre-aggregated or rate-based metric would be more appropriate. - Example Problem:
avg(node_cpu_seconds_total{mode="idle"})without arateordeltafunction. This implies you’re averaging raw counter values, which is rarely useful and extremely inefficient. - Fix: Always use functions like
rate(),irate(),delta(), orincrease()on counters when calculating per-unit-of-time values. Use aggregation functions (sum,avg) on results of these functions or on gauges.- Command/Check: Does your query use aggregation functions directly on counter metrics without a time-based function?
- Example Fix: Change
avg(node_cpu_seconds_total{mode="idle"})toavg(rate(node_cpu_seconds_total{mode="idle"}[5m])). - Why it Works: Counters only increase. Averaging raw counter values is meaningless.
rate()calculates the change over time, giving you a meaningful value (e.g., CPU usage per second) that can then be aggregated.
4. Unnecessary group by Clauses
- Diagnosis: You’re grouping by labels that don’t actually reduce the number of series, or you’re grouping by too many labels.
- Example Problem:
avg by (instance, cpu, mode) (rate(node_cpu_seconds_total[1m]))when you only need the average per instance. - Fix: Review your
byandwithoutclauses. Only group by the labels essential for your desired output.- Command/Check: Does your
byclause include labels that are already unique to each series you’re interested in, or labels you don’t need in the final output? - Example Fix: If you want the average idle CPU rate per instance,
avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))is correct. If you accidentally wroteavg by (instance, mode), you’d get separate averages for each mode per instance, which might be okay, but ifmodeis always "idle" due to your selector, it’s redundant. - Why it Works: Unnecessary grouping can cause Prometheus to materialize intermediate results for combinations that are already unique, increasing memory pressure and computation.
- Command/Check: Does your
5. Inefficient offset Usage
- Diagnosis: Using
offsetwithout careful consideration can force Prometheus to fetch data from two distinct time ranges, potentially doubling the I/O for that part of the query. - Example Problem:
(metric_a - metric_b offset 5m)wheremetric_aandmetric_bare already complex aggregations. - Fix: Whenever possible, combine data points before applying an
offset, or usevector matchingwithon()orignoring()clauses.- Command/Check: Are you using
offseton complex expressions or on metrics that have many series? - Example Fix: Instead of
(sum(rate(a[5m])) - sum(rate(b[5m])) offset 1h), consider if you can achieve the same by calculatingsum(rate(a[5m]) - rate(b[5m]))and then applying the offset if absolutely necessary, or better yet, restructuring the query to avoid the offset if the logic allows. Often,vector matchingis a more performant alternative for comparing series at the same point in time. - Why it Works:
offsetforces Prometheus to look at two separate points in time. If the underlying metrics are already being processed, this can lead to redundant data fetching and processing. Vector matching allows Prometheus to align series based on labels at a single point in time.
- Command/Check: Are you using
6. Unoptimized unless or or Operations
- Diagnosis: Using
unlessororon large sets of series can be computationally expensive as Prometheus needs to evaluate both sides and perform set operations. - Example Problem:
metric_a unless on(instance) metric_bwheremetric_aandmetric_bboth have thousands of series. - Fix: Try to filter down the series on one side of the operation as much as possible before the
unlessoror.- Command/Check: Are you using
unlessororwith broad selectors on either side? - Example Fix: If you want series from
metric_athat are not present inmetric_bwheremetric_bis filtered byjob="critical", writemetric_a unless on(instance) metric_b{job="critical"}. Ifmetric_aitself can be filtered further, do that first:metric_a{job="app"} unless on(instance) metric_b{job="critical"}. - Why it Works: Set operations involve comparing elements. Reducing the size of either set significantly reduces the comparison work Prometheus needs to do.
- Command/Check: Are you using
The next hurdle you’ll likely encounter is understanding how Prometheus’s internal storage (TSDB) impacts query performance, particularly with high cardinality.