The Prometheus query engine is failing to load results because it’s trying to materialize an unmanageably large number of time series samples in memory.
Common Causes and Fixes
1. Overly Broad Time Range:
- Diagnosis: Check the
startandendparameters in your Prometheus query URL or API call. A very large range (e.g., weeks or months) can naturally lead to many samples. - Fix: Reduce the time range of your query. For instance, instead of
start=1678886400&end=1678972800(a full day), trystart=1678944000&end=1678972800(a few hours). This limits the number of data points Prometheus needs to scan and return. - Why it works: Fewer data points means less data to process and transfer.
2. High Cardinality Labels:
- Diagnosis: Use
count({__name__=~".+"})to get a rough estimate of your total active time series. If this number is in the millions or tens of millions, high cardinality is a strong suspect. Identify labels that have a large number of unique values. You can do this by inspectinglabel_values(<label_name>)or by using a query likesum by (label_name) (count by (label_name)({__name__=~".+"})). Look for labels with a vast number of distinct values, such aspod_name,instance, or dynamically generated IDs. - Fix: Remove or reduce the cardinality of problematic labels. This might involve reconfiguring the exporter to not expose metrics with such labels, or using Prometheus relabeling rules to drop or keep only essential label values. For example, to drop a label called
user_id:
Or to keep only specific values:relabel_configs: - source_labels: [user_id] action: droprelabel_configs: - source_labels: [instance] regex: 'instance-(.*)' action: keep replacement: 'instance-\1' - Why it works: High cardinality means each unique combination of label values represents a distinct time series. Reducing the number of unique combinations directly reduces the total number of time series Prometheus needs to manage and query.
3. Inefficient Query Selectors:
- Diagnosis: Examine your query for selectors that match an extremely large number of time series. For example, a selector like
{job="my-service"}might match thousands of instances, each with multiple metrics. A broad metric name selector like{__name__=~"process_cpu.*"}can also be problematic if you have many processes. - Fix: Refine your selectors to be more specific. Add more label matchers. For instance, instead of
{job="my-service"}, try{job="my-service", environment="production"}. If you need to query across many metrics, consider querying specific metric names or using a more targeted regular expression. - Why it works: More specific selectors filter down the set of time series before Prometheus starts fetching data, reducing the initial load.
4. Querying Aggregated Metrics Too Broadly:
- Diagnosis: Queries that aggregate over many time series, especially with
sumoravgwithout abyclause, can still be problematic if the intermediate result set is huge. For example,sum(rate(http_requests_total[5m]))without any label filtering can be very expensive ifhttp_requests_totalhas high cardinality. - Fix: Apply label filters to the metric before aggregation, or use
byclauses to limit the aggregation scope. For instance,sum by (code, job) (rate(http_requests_total{job=~"api-.*"}[5m])). - Why it works: Filtering early or limiting the aggregation scope reduces the number of series that need to be processed by the aggregation function.
5. Using topk or count_over_time on High Cardinality Metrics:
- Diagnosis: Functions like
topk(10, metric_name)orcount_over_time(metric_name[1h])can be resource-intensive ifmetric_namehas very high cardinality.topkneeds to sort potentially millions of series, andcount_over_timeneeds to iterate over all samples within the specified window for each series. - Fix: Use these functions on metrics that have already been filtered by specific labels, or consider if there’s an alternative aggregation strategy. For example, if you need the top 10 request rates by endpoint, query
topk(10, sum by (path) (rate(http_requests_total[5m]))). - Why it works: Applying filters or aggregations before functions like
topkdrastically reduces the number of series the function operates on.
6. Prometheus Server Resource Constraints:
- Diagnosis: Check Prometheus server resource utilization (CPU, memory). If the server is consistently at high CPU or memory usage, it might be struggling to handle even moderately complex queries, especially when combined with high ingestion rates. Look at
promhttp_metric_handler_requests_totalandgo_memstats_alloc_bytesin Prometheus’s own metrics. - Fix: Increase the resources allocated to your Prometheus server (CPU, RAM). Tune Prometheus configuration parameters like
storage.tsdb.max_block_durationorstorage.tsdb.retention.timeif they are contributing to excessive disk I/O or memory pressure, though this is less common for query load itself. - Why it works: More resources allow the Prometheus server to process queries and manage its in-memory data structures more efficiently.
7. Network Bottlenecks or Client-Side Issues:
- Diagnosis: While less common for "too many samples" errors originating from the server, a very large result set being transferred over a slow or saturated network can manifest as timeouts or client-side memory exhaustion. Check network bandwidth and latency between Prometheus and the client.
- Fix: Optimize network infrastructure. Ensure the client application has sufficient memory to handle the response. If the response is genuinely enormous, consider if the query can be refined to return less data or if aggregation can be pushed closer to the source (e.g., via recording rules).
- Why it works: Ensures the data can be transferred efficiently and processed by the client.
The next error you’ll likely encounter, if you’ve fixed the sample loading issue but haven’t addressed the underlying data volume, is a query timeout.