Rate Limiting in Production: Config, Testing, Rollout (2026)

Rate limiting is often thought of as a simple gatekeeper, but its real power lies in its ability to shape traffic flow to prevent cascading failures, not just block bad actors.

Let’s see it in action. Imagine a critical service, user-service, that has a downstream dependency on profile-service. If profile-service gets overwhelmed, user-service will start timing out, and then any services calling user-service will also start failing. This is a common pattern.

Here’s a simplified envoy.yaml configuration for Envoy Proxy acting as a rate limiter for profile-service:

dynamic_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 8080
    filter_chains:
    - filters:
      - name: envoy.filters.http.router
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
      - name: envoy.filters.http.rate_limit
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.http.rate_limit.v3.RateLimit
          domain: profile-service
          rate_limit_service:
            grpc_service:
              envoy_grpc:
                cluster_name: rate_limiter_service
            timeout: 0.5s
    api_filter_chains:
    - ... # Other filters
  clusters:
  - name: profile_service
    connect_timeout: 0.25s
    type: LOGICAL_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: profile_service
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: profile-service.internal
                port_value: 9000
  - name: rate_limiter_service
    connect_timeout: 1s
    type: STATIC
    load_assignment:
      cluster_name: rate_limiter_service
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 8081 # Assuming a local rate limiter service

This Envoy configuration defines a listener that forwards requests to profile-service. Crucially, it includes the envoy.filters.http.rate_limit filter. This filter is configured with a domain (profile-service) and points to a rate_limiter_service cluster. When a request arrives, Envoy will first call out to the rate_limiter_service to check if the request is allowed. If the rate limiter service responds with a denial (e.g., HTTP 429 Too Many Requests), Envoy will immediately return that response to the client, preventing the request from ever reaching profile-service.

The domain in the rate limiting filter is key. It’s a logical identifier that the external rate limiting service uses to group and apply different policies. So, profile-service requests might have a different limit than user-service requests, even if they use the same underlying rate limiter service. The rate_limit_service itself is often a separate, dedicated application (like Envoy’s own rate_limit_service or a custom implementation) that maintains counters and applies logic based on configuration. The timeout on the rate_limit_service call is critical; if the rate limiter is slow, Envoy will default to allowing the request to pass through to avoid becoming a bottleneck itself.

The mental model for rate limiting is typically a "token bucket" or "leaky bucket" algorithm. In a token bucket, a bucket has a fixed capacity. Tokens are added to the bucket at a steady rate. When a request arrives, it consumes a token. If the bucket is empty, the request is rejected. This allows for bursts of traffic up to the bucket’s capacity, while maintaining an average rate. The rate_limiter_service is where these buckets are managed. Envoy simply acts as the enforcer, querying the service for each request.

The actual rate limiting logic is configured outside of Envoy, in the rate_limiter_service. This service can be configured with rules like: "Allow 100 requests per second for domain profile-service from any client IP" or "Allow 10 requests per minute for domain user-service from a specific API key." The configuration for this service is often managed separately, perhaps through a configuration file or an API.

The most surprising true thing about production rate limiting is that its primary purpose isn’t to stop attackers, but to prevent internal cascading failures by gracefully degrading under load. By limiting the rate at which one service can call another, you prevent the downstream service from being overwhelmed, which in turn prevents the upstream service from timing out, and so on. It’s a proactive defense against system-wide instability.

The rate limiting service can be configured to make decisions based on a variety of attributes extracted from the request by Envoy. Envoy can be configured to add "descriptors" to the rate limit request. For example, you could have a descriptor for the HTTP method, another for the client IP address, and another for a specific header value like an X-API-Key. The rate limiter service then uses these descriptors to look up and apply the correct rate limit policy. This allows for very granular control, such as setting different limits for GET vs. POST requests, or for authenticated users versus anonymous ones.

When rolling out rate limiting, start with very generous limits in a staging environment that mirrors production traffic as closely as possible. Monitor the rate limiter’s rejection counts and the downstream service’s latency and error rates. Gradually tighten the limits in production, watching for any increase in 429 responses or degradation in service performance. A common mistake is to set limits too aggressively initially, causing legitimate traffic to be blocked and leading to user complaints or application errors.

The next concept you’ll likely grapple with is how to handle the rate limiting service itself becoming a bottleneck or a single point of failure.