The token bucket algorithm isn’t just about limiting requests; it’s fundamentally about managing burstiness in your traffic, allowing for temporary spikes while enforcing a long-term average.
Let’s see it in action. Imagine a simple API endpoint that we want to limit. We’ll use a token bucket with a rate of 10 tokens per second and a capacity of 20 tokens.
{
"rate": "10r/s",
"capacity": 20
}
When a request comes in, the system checks if there’s at least one token in the bucket.
-
Scenario 1: Steady State: If requests arrive at exactly 10 per second, each request consumes one token. The bucket stays at its
capacity(20 tokens) because tokens are refilled at 10 per second, perfectly matching consumption. -
Scenario 2: Burst: A client makes 20 requests in rapid succession. The first 20 requests consume 20 tokens. The bucket is now empty.
-
Scenario 3: After Burst: For the next second, no new requests can be processed because the bucket is empty. However, during that second, 10 tokens are added back. If requests start arriving again at a moderate pace (say, 5 per second), they will be allowed as long as tokens are available. The bucket will gradually refill back to its
capacity. -
Scenario 4: Sustained High Rate: If a client tries to send 15 requests per second continuously, the first 20 will pass immediately. After that, only 10 requests per second will be allowed because that’s the rate at which tokens are replenished. The bucket will remain empty, and 5 requests per second will be dropped.
The core problem this solves is preventing a single user or a small group of users from overwhelming a service during sudden traffic spikes. Traditional fixed-rate limiting would simply drop requests after a certain threshold, leading to a jarring, all-or-nothing experience. The token bucket, by having a capacity, allows for a grace period. It decouples the instantaneous rate from the long-term average rate. The rate defines the refill rate (the steady-state throughput), and the capacity defines the maximum burst size that can be absorbed.
The rate parameter is typically expressed as Xr/Y where X is the number of tokens added, and Y is the time unit (e.g., 10r/s for 10 tokens per second, 100r/m for 100 tokens per minute). The capacity is the maximum number of tokens the bucket can hold, effectively setting the limit on how many requests can be served in a very short period, even if the average rate is lower.
When configuring a token bucket, the relationship between rate and capacity is crucial. A capacity much larger than the rate allows for significant bursts. For instance, a rate of 1r/s and a capacity of 60 would allow a burst of up to 60 requests in a single second, then cap subsequent requests at 1 per second. Conversely, a capacity equal to the rate (e.g., 10r/s with capacity: 10) offers very little burst allowance, behaving much closer to a fixed-rate limiter.
Most implementations of the token bucket algorithm use a floating-point number for the token refill, meaning that even if the bucket is empty, tokens are added continuously. This ensures that when a request arrives, the number of available tokens is precisely calculated based on the time elapsed since the last check, rather than relying on discrete, second-by-second refills. This continuous refilling is what allows for smooth handling of traffic that doesn’t perfectly align with second boundaries.
The next challenge is often handling rate limits across distributed systems where multiple instances of your service need to share the same token bucket state.