Latency Percentiles: Beyond Averages

Latency percentiles are a surprisingly effective way to understand user experience without drowning in averages that hide extreme outliers.

Let’s say you’re running a simple web service. Here’s what a typical request might look like in terms of latency (in milliseconds):

10, 15, 18, 22, 25, 28, 30, 35, 40, 55, 75, 120, 250, 500, 1200

P50 (Median): This is the 50th percentile. Half of your requests are faster than this, and half are slower. In our example, P50 is 28ms. This tells you that most of your users are having a pretty snappy experience.
P95: This is the 95th percentile. 95% of your requests are faster than this, and only 5% are slower. In our example, P95 is 120ms. This means 5% of your users are experiencing significantly higher latency. These are the "long tail" requests.
P99: This is the 99th percentile. 99% of your requests are faster than this, and only 1% are slower. In our example, P99 is 500ms. This highlights the absolute worst-case scenarios for a small fraction of your users.

The real power comes from comparing these. If your P50 is 20ms but your P99 is 1000ms, your average latency might look decent (around 73ms in our example), but 5% of your users are having a truly terrible time. This suggests a problem that affects a minority of requests but has a disproportionate impact on their experience.

Understanding the System in Action

Imagine a database query. The query itself might be very fast for most of the data. However, a single, large table scan or a complex join on an edge case could dramatically increase the latency for a few requests. Percentiles help you distinguish between a system that’s consistently good and one that’s mostly good but occasionally tanks.

Consider a distributed system. A request might involve calls to multiple services.

Service A: P50: 10ms, P99: 50ms
Service B: P50: 5ms, P99: 800ms
Service C: P50: 2ms, P99: 20ms

If your overall request latency is the sum of these, the P50 might be 17ms (10+5+2). But the P99 could be 870ms (50+800+20). The P99 is dominated by Service B’s P99, even though Service B is fast for most requests.

Levers You Control

Application Code Optimization: For P50, this is often about efficient algorithms and data structures. For P95/P99, it means identifying and optimizing those rare, expensive operations (e.g., full table scans, inefficient serialization, blocking I/O).
Infrastructure Scaling: If P99 is high due to resource contention (CPU, memory, network), scaling up instances or adding more nodes can help. This is especially relevant for distributed systems where bottlenecks can cascade.
Caching: Implementing effective caching strategies can dramatically reduce latency for frequently accessed data, thereby improving P50, P95, and P99.
Database Tuning: Indexing, query optimization, and connection pooling are critical for database-bound latency. A poorly indexed query can be the culprit for your P99 spikes.
Load Balancers and Network: Sometimes, network congestion or inefficient load balancing can introduce tail latency. Ensuring healthy network paths and properly configured load balancers is key.
Asynchronous Processing: For operations that don’t need immediate results, offloading them to background workers or message queues can prevent them from impacting synchronous request latency, thus improving all percentiles for the user-facing request.

The most surprising thing about latency percentiles is how quickly a seemingly minor performance improvement for the average case can have a disproportionately massive positive impact on the P99, often by addressing a single, obscure code path or configuration setting. For example, adding a simple index to a database table that was previously performing a full scan on a large dataset might reduce the latency of those specific slow queries from seconds to milliseconds. This single change could drop your P99 latency by hundreds or even thousands of milliseconds, making a drastic difference to the 1% of users experiencing that specific bottleneck, without necessarily changing the P50 at all.

The next step after analyzing latency percentiles is often correlating them with specific error rates or system events to pinpoint the root cause of those tail latencies.