Throughput vs. Latency: The Real Trade-offs

Throughput and latency are often presented as opposing forces, but the real story is how they’re intrinsically linked by the resources you have available.

Imagine a highway. Throughput is how many cars can pass a point on that highway per hour. Latency is how long it takes a single car to get from point A to point B. If you have a single-lane road (limited resources), you can either let a lot of cars go slowly (high throughput, high latency) or let a few cars go quickly (low throughput, low latency). But if you add more lanes (more resources), you can potentially have both high throughput and low latency.

Let’s see this in action with a simple web server. We’ll use wrk, a modern HTTP benchmarking tool, to simulate traffic and measure performance.

First, let’s set up a basic Python web server that just returns "Hello, World!":

from http.server import BaseHTTPRequestHandler, HTTPServer

class SimpleHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-type', 'text/plain')
        self.end_headers()
        self.wfile.write(b"Hello, World!")

if __name__ == "__main__":
    server_address = ('', 8000)
    httpd = HTTPServer(server_address, SimpleHandler)
    print("Starting server on port 8000...")
    httpd.serve_forever()

Now, let’s benchmark it. We’ll simulate 10 concurrent connections making requests to this server.

Scenario 1: Focus on Throughput (Many small requests)

We’ll ask wrk to send a high volume of requests.

wrk -t4 -c100 -d10s http://localhost:8000/

-t4: Use 4 threads.
-c100: Keep 100 connections open.
-d10s: Run for 10 seconds.

Here’s a typical output you might see:

Running 10s test @ http://localhost:8000/
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   Median   Min
    Latency     1.23ms    0.87ms  15.50ms    0.98ms  0.30ms
    Req/Sec    25.34k    1.50k   27.00k   25.50k   24.50k
  500000 requests in 10.00s, 50.00MB read
Requests/sec: 50000.00

In this output, notice:

Requests/sec: 50000.00: This is our throughput. We’re processing 50,000 requests per second.
Latency (Median): 0.98ms: The median time for a single request to complete is less than a millisecond. This seems great for latency, but it’s a consequence of the very simple workload.

Scenario 2: Focus on Latency (Fewer, longer-running requests)

Now, let’s simulate a scenario where each request takes longer to process on the server-side. We can do this by adding a small delay in our Python handler.

from http.server import BaseHTTPRequestHandler, HTTPServer
import time

class SlowHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        time.sleep(0.1) # Simulate work
        self.send_response(200)
        self.send_header('Content-type', 'text/plain')
        self.end_headers()
        self.wfile.write(b"Hello, World!")

if __name__ == "__main__":
    server_address = ('', 8000)
    httpd = HTTPServer(server_address, SlowHandler)
    print("Starting server on port 8000...")
    httpd.serve_forever()

Now, let’s run wrk again, but this time with fewer connections to avoid overwhelming the server immediately.

wrk -t4 -c20 -d10s http://localhost:8000/

-t4: Use 4 threads.
-c20: Keep 20 connections open.
-d10s: Run for 10 seconds.

Here’s a possible output:

Running 10s test @ http://localhost:8000/
  4 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   Median   Min
    Latency     110.50ms  20.10ms 250.00ms  108.75ms  80.00ms
    Req/Sec     4.95k     0.20k    5.20k   4.90k    4.80k
  198000 requests in 10.00s, 19.80MB read
Requests/sec: 19800.00

Notice the changes:

Requests/sec: 19800.00: Our throughput has dropped significantly from 50,000 to about 19,800 requests per second.
Latency (Median): 108.75ms: The median latency has increased dramatically, from under 1ms to over 100ms. This is because each request now inherently takes about 100ms due to time.sleep(0.1).

The system’s internal resources (CPU, network I/O, memory) are finite. When requests are short and fast, the overhead of managing connections and processing requests is low, allowing for high throughput. When requests are long-running or resource-intensive, they occupy those finite resources for longer periods. This means fewer requests can be processed concurrently, leading to lower throughput and higher latency because subsequent requests have to wait for resources to become available.

The key to understanding the tradeoff is to look at your total available resources. If you have a powerful server with a fast network connection and efficient code, you can handle many concurrent requests with low latency. If your server is under-provisioned or your application code is inefficient, you’ll experience either low throughput or high latency, or both. It’s about how effectively you can utilize those resources for the tasks at hand.

When you see high latency in a system, it’s often because there’s contention for a shared resource. This could be CPU cycles, network bandwidth, disk I/O, or even a lock within your application. Each request, regardless of how small, needs a slice of these resources. If the total demand for these resources exceeds the supply, requests will queue up, and latency will increase. Throughput, in this context, is the rate at which these queued requests can be processed and completed, limited by the rate at which resources become available.

The next challenge is optimizing for both when the resource constraints are real.