Prometheus doesn’t actually store exact quantile values; it uses a clever approximation that often surprises people.

Let’s see this in action. Imagine you’re scraping a service that reports request durations.

# prometheus.yml
scrape_configs:
  - job_name: 'my_service'
    static_configs:
      - targets: ['localhost:9091']

Your service exposes a histogram metric like this:

# HELP my_app_request_duration_seconds Histogram of request durations.
# TYPE my_app_request_duration_seconds histogram
my_app_request_duration_seconds_bucket{le="0.005"} 123
my_app_request_duration_seconds_bucket{le="0.01"} 200
my_app_request_duration_seconds_bucket{le="0.05"} 350
my_app_request_duration_seconds_bucket{le="0.1"} 400
my_app_request_duration_seconds_bucket{le="0.5"} 420
my_app_request_duration_seconds_bucket{le="1"} 425
my_app_request_duration_seconds_bucket{le="5"} 430
my_app_request_duration_seconds_bucket{le="10"} 432
my_app_request_duration_seconds_bucket{le="+Inf"} 432
my_app_request_duration_seconds_count 432
my_app_request_duration_seconds_sum 55.7

When you query for the 95th percentile of my_app_request_duration_seconds, Prometheus doesn’t have a precise value for it. Instead, it uses the _bucket counts to estimate the quantile. The query histogram_quantile(0.95, my_app_request_duration_seconds_bucket) will return a value based on how the data falls between the defined buckets.

The problem Prometheus solves is the memory and processing overhead of storing every single observation for every metric. If you had millions of requests per second, storing each duration would be infeasible. Histograms, by their nature, aggregate data into buckets. Prometheus leverages this aggregation to provide quantile estimates. It’s a trade-off: sacrificing perfect accuracy for scalability. The histogram_quantile function interpolates linearly between the two buckets that contain the target quantile.

You control the accuracy by defining the histogram buckets. These buckets are crucial. If your buckets are too coarse, your quantile estimates will be less accurate. For instance, if you only had buckets for 0.01, 0.1, and +Inf, and the 95th percentile fell between 0.1 and +Inf, Prometheus would linearly interpolate between the count at 0.1 and the total count at +Inf, giving you a very rough estimate. Well-chosen buckets, especially for expected ranges of your data, are key. A common pattern is to use exponentially increasing bucket sizes, but ensure you have fine-grained buckets around critical thresholds (e.g., 0.1s, 0.5s, 1s for latency metrics).

The core idea is that histogram_quantile(q, b) finds the bucket b_i where the cumulative count is less than q * total_count, and the next bucket b_{i+1} where the cumulative count is greater than or equal to q * total_count. It then interpolates between the upper bounds of these two buckets. Specifically, it calculates b_i.upper_bound + (b_{i+1}.upper_bound - b_i.upper_bound) * (q * total_count - b_i.count) / (b_{i+1}.count - b_i.count).

When observing my_app_request_duration_seconds_bucket{le="0.05"} 350 and my_app_request_duration_seconds_bucket{le="0.1"} 400, and you query for histogram_quantile(0.95, my_app_request_duration_seconds_bucket), Prometheus sees that the 95th percentile falls within the bucket between 0.05s and 0.1s. The total count is 432. The count at le="0.05" is 350. The count at le="0.1" is 400. The 95th percentile rank is 0.95 * 432 = 410.4. Since 410.4 is between 350 and 400, Prometheus interpolates. The value will be 0.05 + (0.1 - 0.05) * (410.4 - 350) / (400 - 350) = 0.05 + 0.05 * 60.4 / 50 = 0.05 + 0.05 * 1.208 = 0.05 + 0.0604 = 0.1104. The result would be approximately 0.1104 seconds.

A common pitfall is expecting exact quantile values from Prometheus histograms. Because it’s an estimation based on bucket counts, the accuracy is directly tied to the granularity and distribution of your defined buckets. If your data has a very tight distribution that spans only a few fine-grained buckets, the accuracy will be high. If the data is spread thinly across very coarse buckets, the accuracy will suffer significantly. You’ll often see quantile estimates that are slightly off from what you might expect if you had exact observations.

Beyond accuracy, consider the implications for alerting. Alerting on estimated quantiles can lead to false positives or negatives if the bucket granularity isn’t sufficient for the desired alerting threshold. For instance, if you want to alert if the 99th percentile exceeds 2 seconds, but your buckets only go up to 1 second and then jump to +Inf, your estimate for the 99th percentile might be wildly inaccurate and not trigger the alert when it should, or vice-versa.

The next challenge is understanding how to properly configure these buckets for different types of metrics.

Want structured learning?

Take the full Prometheus course →