Concurrency limiting is how you prevent a single service from being overwhelmed by too many requests at once.

Let’s say you have a service that handles user profile lookups. It’s fast, but not infinitely so. If you suddenly get 10,000 requests for user profiles simultaneously, and your service can only handle 100 at a time, things are going to go sideways. Your service will start dropping requests, become unresponsive, and potentially crash. Concurrency limiting is the guardrail that stops this from happening.

Imagine this scenario:

You have a web application backed by a microservice responsible for fetching user data.

# User request comes in for profile ID 123
GET /users/123 HTTP/1.1
Host: user-service.example.com

The user-service is designed to handle 50 requests per second. It’s currently processing 49 requests.

// user-service internal state (simplified)
let activeRequests = 49;
const maxConcurrency = 50;

A new request arrives:

# Another user request for profile ID 456
GET /users/456 HTTP/1.1
Host: user-service.example.com

The user-service checks its current load: activeRequests (49) is less than maxConcurrency (50). It accepts the request. activeRequests becomes 50.

// user-service internal state
activeRequests = 50;
maxConcurrency = 50;

Now, a third request arrives just milliseconds later:

# Yet another user request for profile ID 789
GET /users/789 HTTP/1.1
Host: user-service.example.com

The user-service checks its load: activeRequests (50) is equal to maxConcurrency (50). It rejects this request.

// user-service internal state
activeRequests = 50;
maxConcurrency = 50;

// Response to the third request
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 5

{
  "error": "Too many concurrent requests. Please try again later."
}

The 429 Too Many Requests response is the key. It tells the client (or an upstream proxy) that the service is overloaded and the request should be retried later. The Retry-After header gives a hint when it might be safe to retry.

The core problem concurrency limiting solves is cascading failures. When one service gets overloaded, it can slow down or fail to respond to the services that depend on it. Those dependent services might then also become overloaded, and so on, leading to a widespread outage. By capping concurrency, you ensure that each service can operate within its defined capacity, protecting itself and its upstream callers.

There are several ways to implement concurrency limiting, each with its own trade-offs:

  1. Token Bucket Algorithm: Imagine a bucket that holds tokens. Tokens are added to the bucket at a fixed rate. To process a request, you must take a token from the bucket. If the bucket is empty, the request is rejected or queued. This is great for smoothing out bursts of traffic.

    • Configuration Example (conceptual):

      • rate: 100 tokens per second (requests per second)
      • burst: 200 tokens (maximum number of requests that can be processed in a short burst)
    • Why it works: It allows for short bursts of traffic up to the burst limit but enforces an average rate of rate over time, preventing sustained overload.

  2. Leaky Bucket Algorithm: Requests are added to a queue (the bucket). The bucket "leaks" requests at a constant rate, processing them. If the bucket is full, new requests are rejected. This is good for ensuring a steady outflow of requests, regardless of inflow.

    • Configuration Example (conceptual):

      • capacity: 50 (maximum number of requests in the bucket/queue)
      • leakRate: 100 requests per second (how fast requests are processed)
    • Why it works: It guarantees that requests are processed at a steady pace, preventing the system from being flooded even if the input rate is highly variable.

  3. Fixed Window Counter: This method counts requests within a fixed time window (e.g., 60 seconds). If the count exceeds a threshold within that window, subsequent requests are rejected until the window resets.

    • Configuration Example (conceptual):

      • windowSize: 60 seconds
      • maxRequests: 3000
    • Why it works: Simple to implement, it limits requests within defined intervals. However, it can allow double the rate at the window boundary (e.g., 3000 requests at second 59 and another 3000 at second 60).

  4. Sliding Window Log: This is a more sophisticated version of the fixed window. It keeps a log of request timestamps. To check if a request can proceed, it counts how many requests have occurred within the last windowSize seconds.

    • Configuration Example (conceptual):

      • windowSize: 60 seconds
      • maxRequests: 3000
    • Why it works: It avoids the boundary issue of the fixed window by considering a rolling window of timestamps, providing a more accurate representation of current load.

  5. Rate Limiter Libraries/Proxies: Many frameworks and API gateways offer built-in rate limiting. Examples include:

    • Envoy Proxy: Uses RateLimit cluster and configuration.

    • Nginx: limit_req_zone and limit_req.

    • Spring Cloud Gateway: RequestRateLimiter filter.

    • Custom Code: Implementing one of the algorithms above directly in your service.

    • Configuration Example (Nginx):

      http {
          limit_req_zone $binary_remote_addr zone=mylimit:10m rate=10r/s;
      
          server {
              location / {
                  limit_req zone=mylimit burst=20 nodelay;
                  proxy_pass http://backend_server;
              }
          }
      }
      

      Here, $binary_remote_addr is the key (client IP), zone=mylimit:10m defines a shared memory zone named mylimit of 10MB, and rate=10r/s sets the average rate to 10 requests per second. burst=20 nodelay allows up to 20 requests to be processed immediately if the rate is below average.

    • Why it works: Offloads the complexity of rate limiting to a dedicated component, often a proxy that sits in front of your services, allowing your services to focus on business logic.

When implementing, consider where to apply the limit. Applying it at the edge (API Gateway) protects all downstream services. Applying it within a specific service protects that service and its immediate dependencies. Often, a combination is used.

The most surprising thing about concurrency limiting is how often it’s implemented after a major outage has already occurred, rather than proactively. It’s not just a "nice-to-have" for high-traffic sites; it’s a fundamental resilience pattern for any distributed system.

One crucial aspect often overlooked is how the rate limiting decision is made and how it interacts with downstream dependencies. If your service rate-limits requests before it even attempts to call another service, you’re protecting your service. But if your service accepts a request, then tries to call a downstream service and that downstream service rate-limits your service, you might still experience delays or failures if your internal processing can’t keep up with the rate your downstream dependency allows. This is why understanding the full call chain and applying limits at multiple levels, or ensuring your internal processing can buffer and retry gracefully, is key.

Once you’ve successfully implemented concurrency limiting, the next challenge you’ll likely face is managing distributed rate limiting across multiple instances of your service.

Want structured learning?

Take the full Rate-limiting course →