The most surprising thing about rate limiting is that it’s not about preventing traffic, but about managing it to ensure fairness and stability for everyone.

Imagine you’re running a popular API. Thousands of users are hitting your endpoints. Without rate limiting, a single user or a bot could overwhelm your servers, making the service unusable for everyone else. Rate limiting acts like a bouncer at a club, deciding who gets in and when, ensuring a smooth experience.

Let’s look at how it works with a common scenario: an API endpoint that should only be called 100 times per minute per user.

Token Bucket

The Token Bucket algorithm is like a bucket that gets filled with tokens at a constant rate. Each incoming request consumes one token. If the bucket is empty, the request is rejected.

How it works:

  • Bucket Size (Burst Limit): This is the maximum number of tokens the bucket can hold. It determines how many requests can be handled in a short burst.
  • Fill Rate: This is the rate at which tokens are added to the bucket, typically per second or per minute. This defines the sustained rate of requests allowed.

Example Configuration (Conceptual):

  • Bucket Size: 100 tokens
  • Fill Rate: 2 tokens per second

If a user makes 100 requests in quick succession, they’ll consume all 100 tokens. Then, for every 2 tokens that refill per second, they can make 2 more requests. If they try to make more than 100 requests in the first second, the excess will be dropped.

Code Snippet (Python, using py-ratelimit):

from ratelimit import limits, sleep_and_retry

@sleep_and_retry
@limits(calls=100, period=60) # 100 calls per 60 seconds
def api_call(user_id):
    print(f"Processing API call for user {user_id}")
    # Actual API logic here

# Simulate rapid calls
for i in range(150):
    api_call("user123")

In this example, calls=100 is the bucket size, and the effective fill rate is implicitly determined by period=60 (100 calls / 60 seconds). If the calls exceed 100 within 60 seconds, subsequent calls will be delayed or rejected depending on the library’s implementation (this one retries).

Leaky Bucket

The Leaky Bucket algorithm is conceptually similar but focuses on the output rate. Requests are added to a queue (the bucket). The bucket "leaks" requests out at a constant rate, processing them. If the queue is full, new requests are rejected.

How it works:

  • Bucket Size (Queue Size): The maximum number of requests that can be queued.
  • Leak Rate (Processing Rate): The constant rate at which requests are processed and sent out.

Example Configuration (Conceptual):

  • Bucket Size: 50 requests
  • Leak Rate: 10 requests per second

If 100 requests arrive in the first second, 50 might be queued. The next 50 will be rejected because the bucket is full. Then, the bucket will process 10 requests per second.

Code Snippet (Conceptual, as direct implementation is less common than Token Bucket for simple APIs):

This is often implemented as a queue with a background worker that processes items at a fixed rate.

import time
import queue

request_queue = queue.Queue(maxsize=50)
processing_rate = 10 # requests per second

def worker():
    while True:
        try:
            request = request_queue.get_nowait()
            print(f"Processing request: {request}")
            # Simulate processing time
            time.sleep(1.0 / processing_rate)
            request_queue.task_done()
        except queue.Empty:
            time.sleep(0.1) # Wait a bit if queue is empty

# Start worker in a separate thread (not shown for brevity)

def submit_request(request_data):
    try:
        request_queue.put_nowait(request_data)
        print(f"Queued request: {request_data}")
    except queue.Full:
        print(f"Request rejected: Queue full for {request_data}")

# Simulate rapid calls
for i in range(100):
    submit_request(f"request_{i}")

The key difference here is that bursts are absorbed by the queue up to its capacity, smoothing out the traffic after the queue.

Sliding Window

Sliding Window algorithms offer a more dynamic approach, tracking requests within a moving time window. This avoids the "bursty" issue of fixed windows where a surge of requests at the end of one window and the beginning of the next could exceed the limit.

How it works:

  • Window Size: The duration of the time window (e.g., 60 seconds).
  • Limit: The maximum number of requests allowed within that window.

Instead of counting all requests in a fixed minute, it counts requests within the last 60 seconds from the current moment.

Example Configuration (Conceptual):

  • Window Size: 60 seconds
  • Limit: 1000 requests

If it’s 10:00:30, the window is from 09:59:30 to 10:00:30. If it’s 10:01:00, the window is from 10:00:00 to 10:01:00.

Implementation Detail: Sliding Window Log

A common way to implement this is by keeping a log of timestamps for each request. When a new request comes in, you:

  1. Remove all timestamps older than the window size.
  2. Count the remaining timestamps.
  3. If the count is less than the limit, add the current request’s timestamp and allow it. Otherwise, reject it.

Code Snippet (Conceptual, Redis-based):

Using Redis’s sorted sets is a very efficient way to implement this.

import redis
import time

r = redis.Redis(decode_responses=True)
WINDOW_SECONDS = 60
MAX_CALLS = 100

def sliding_window_limit(user_key):
    current_time = time.time()
    window_start = current_time - WINDOW_SECONDS

    # Remove old timestamps
    r.zremrangebyscore(user_key, 0, window_start)

    # Count current requests in window
    current_calls = r.zcard(user_key)

    if current_calls < MAX_CALLS:
        # Add current request timestamp
        r.zadd(user_key, {str(current_time): current_time})
        return True
    else:
        return False

# Simulate calls
user_id = "user456"
for i in range(120):
    if sliding_window_limit(user_id):
        print(f"Request {i} allowed for {user_id}")
    else:
        print(f"Request {i} rejected for {user_id}")
    time.sleep(0.4) # Simulate requests arriving every 0.4 seconds

This approach ensures that the rate is consistently maintained over the defined window.

The real magic of rate limiting lies in its ability to adapt. While Token Bucket is great for allowing bursts and then enforcing a steady pace, and Leaky Bucket smooths output, Sliding Window offers a more precise measure of recent activity. Understanding how each algorithm manages the flow of requests is key to building resilient and fair systems.

The next step in rate limiting often involves distributed systems, where ensuring consistency across multiple servers becomes a significant challenge.

Want structured learning?

Take the full Rate-limiting course →