The Prometheus query engine is failing to load results because it’s trying to materialize an unmanageably large number of time series samples in memory.

Common Causes and Fixes

1. Overly Broad Time Range:

  • Diagnosis: Check the start and end parameters in your Prometheus query URL or API call. A very large range (e.g., weeks or months) can naturally lead to many samples.
  • Fix: Reduce the time range of your query. For instance, instead of start=1678886400&end=1678972800 (a full day), try start=1678944000&end=1678972800 (a few hours). This limits the number of data points Prometheus needs to scan and return.
  • Why it works: Fewer data points means less data to process and transfer.

2. High Cardinality Labels:

  • Diagnosis: Use count({__name__=~".+"}) to get a rough estimate of your total active time series. If this number is in the millions or tens of millions, high cardinality is a strong suspect. Identify labels that have a large number of unique values. You can do this by inspecting label_values(<label_name>) or by using a query like sum by (label_name) (count by (label_name)({__name__=~".+"})). Look for labels with a vast number of distinct values, such as pod_name, instance, or dynamically generated IDs.
  • Fix: Remove or reduce the cardinality of problematic labels. This might involve reconfiguring the exporter to not expose metrics with such labels, or using Prometheus relabeling rules to drop or keep only essential label values. For example, to drop a label called user_id:
    relabel_configs:
      - source_labels: [user_id]
        action: drop
    
    Or to keep only specific values:
    relabel_configs:
      - source_labels: [instance]
        regex: 'instance-(.*)'
        action: keep
        replacement: 'instance-\1'
    
  • Why it works: High cardinality means each unique combination of label values represents a distinct time series. Reducing the number of unique combinations directly reduces the total number of time series Prometheus needs to manage and query.

3. Inefficient Query Selectors:

  • Diagnosis: Examine your query for selectors that match an extremely large number of time series. For example, a selector like {job="my-service"} might match thousands of instances, each with multiple metrics. A broad metric name selector like {__name__=~"process_cpu.*"} can also be problematic if you have many processes.
  • Fix: Refine your selectors to be more specific. Add more label matchers. For instance, instead of {job="my-service"}, try {job="my-service", environment="production"}. If you need to query across many metrics, consider querying specific metric names or using a more targeted regular expression.
  • Why it works: More specific selectors filter down the set of time series before Prometheus starts fetching data, reducing the initial load.

4. Querying Aggregated Metrics Too Broadly:

  • Diagnosis: Queries that aggregate over many time series, especially with sum or avg without a by clause, can still be problematic if the intermediate result set is huge. For example, sum(rate(http_requests_total[5m])) without any label filtering can be very expensive if http_requests_total has high cardinality.
  • Fix: Apply label filters to the metric before aggregation, or use by clauses to limit the aggregation scope. For instance, sum by (code, job) (rate(http_requests_total{job=~"api-.*"}[5m])).
  • Why it works: Filtering early or limiting the aggregation scope reduces the number of series that need to be processed by the aggregation function.

5. Using topk or count_over_time on High Cardinality Metrics:

  • Diagnosis: Functions like topk(10, metric_name) or count_over_time(metric_name[1h]) can be resource-intensive if metric_name has very high cardinality. topk needs to sort potentially millions of series, and count_over_time needs to iterate over all samples within the specified window for each series.
  • Fix: Use these functions on metrics that have already been filtered by specific labels, or consider if there’s an alternative aggregation strategy. For example, if you need the top 10 request rates by endpoint, query topk(10, sum by (path) (rate(http_requests_total[5m]))).
  • Why it works: Applying filters or aggregations before functions like topk drastically reduces the number of series the function operates on.

6. Prometheus Server Resource Constraints:

  • Diagnosis: Check Prometheus server resource utilization (CPU, memory). If the server is consistently at high CPU or memory usage, it might be struggling to handle even moderately complex queries, especially when combined with high ingestion rates. Look at promhttp_metric_handler_requests_total and go_memstats_alloc_bytes in Prometheus’s own metrics.
  • Fix: Increase the resources allocated to your Prometheus server (CPU, RAM). Tune Prometheus configuration parameters like storage.tsdb.max_block_duration or storage.tsdb.retention.time if they are contributing to excessive disk I/O or memory pressure, though this is less common for query load itself.
  • Why it works: More resources allow the Prometheus server to process queries and manage its in-memory data structures more efficiently.

7. Network Bottlenecks or Client-Side Issues:

  • Diagnosis: While less common for "too many samples" errors originating from the server, a very large result set being transferred over a slow or saturated network can manifest as timeouts or client-side memory exhaustion. Check network bandwidth and latency between Prometheus and the client.
  • Fix: Optimize network infrastructure. Ensure the client application has sufficient memory to handle the response. If the response is genuinely enormous, consider if the query can be refined to return less data or if aggregation can be pushed closer to the source (e.g., via recording rules).
  • Why it works: Ensures the data can be transferred efficiently and processed by the client.

The next error you’ll likely encounter, if you’ve fixed the sample loading issue but haven’t addressed the underlying data volume, is a query timeout.

Want structured learning?

Take the full Prometheus course →