Rate Limiting in Microservices: Sidecar and API Gateway (2026)

Rate limiting is crucial for microservices to prevent overload and ensure fair usage, but the common approach of implementing it at the API Gateway is fundamentally flawed.

Here’s a typical scenario:

# Example Nginx configuration for rate limiting
http {
    limit_req_zone $binary_remote_addr zone=mylimit:10m rate=5r/s;
    server {
        location / {
            limit_req zone=mylimit burst=20 nodelay;
            proxy_pass http://backend_service;
        }
    }
}

In this setup, a single API Gateway instance tries to enforce rate limits for all incoming requests. When traffic spikes, the gateway can become a bottleneck itself, leading to dropped requests before they even reach the individual microservices. This means you’re rate-limiting the gateway, not the backend services, and the gateway itself can become the single point of failure.

The core problem is that the API Gateway, acting as a centralized point, doesn’t have the granular context of which microservice is being hit and its specific capacity. It treats all requests equally, regardless of their destination. This leads to situations where a burst of traffic targeting a low-capacity service can overwhelm the gateway, impacting requests for high-capacity services as well.

The Sidecar Pattern: A Better Approach

A more robust solution is to implement rate limiting closer to the services themselves, using a sidecar proxy pattern. In this model, each microservice has its own dedicated proxy (the sidecar) that handles concerns like rate limiting, logging, and service discovery.

Consider a service user-service with its sidecar proxy, often implemented using Envoy or Nginx. The sidecar sits alongside the user-service container within the same pod or deployment.

Here’s how the traffic flow changes:

Client Request: A client sends a request to the user-service.
Sidecar Intercepts: The request first hits the user-service’s sidecar proxy.
Rate Limiting: The sidecar checks its configured rate limits for user-service.
- If the limit is exceeded, the sidecar rejects the request immediately (e.g., with a 429 Too Many Requests HTTP status).
- If the limit is not exceeded, the sidecar forwards the request to the actual user-service container.
Service Processing: The user-service processes the request.
Response: The user-service sends the response back to the sidecar, which then returns it to the client.

This distributes the rate-limiting logic, making it scalable and resilient. Each service manages its own rate limits, preventing a single point of failure.

Configuration Example (Envoy Proxy)

Let’s look at a simplified Envoy configuration for a sidecar proxy managing rate limits for a product-service.

# envoy.yaml (simplified for demonstration)
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 10000 # Sidecar listens on this port
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: product_service_cluster
          http_filters:
          - name: envoy.filters.http.local_ratelimit
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
              stat_prefix: rl
              rate_limit_policy:
                rate_limits:
                - limit:
                    unit: SECOND
                    requests_per_unit: 10 # Allow 10 requests per second
          - name: envoy.filters.http.router
            typed_config: {}

  clusters:
  - name: product_service_cluster
    connect_timeout: 0.25s
    type: LOGICAL_DNS
    # The actual product-service runs on a different port within the same pod
    # or accessible via localhost. Here we assume localhost.
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: product_service_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1 # Target the actual service
                port_value: 8080  # Port the product-service listens on

In this Envoy configuration:

The listener_0 on 0.0.0.0:10000 is where incoming requests to the product-service sidecar arrive.
The envoy.filters.http.local_ratelimit filter is configured to allow 10 requests per SECOND.
When a request passes the rate limit, the envoy.filters.http.router forwards it to the product_service_cluster, which is configured to point to 127.0.0.1:8080 – the actual product-service running locally.

This setup means the rate limiting is happening right next to the product-service, isolated from the traffic of other services. If product-service gets overwhelmed, it only affects requests destined for product-service.

The Crucial Insight: Distributed Enforcement

The fundamental shift here is from centralized, shared enforcement at the gateway to distributed, per-service enforcement via sidecars. The API Gateway might still be useful for cross-cutting concerns like authentication, SSL termination, or coarse-grained API routing, but for granular rate limiting, pushing it to the edge of each service is the only scalable and robust pattern. This ensures that rate limits are applied based on the service’s capacity and load, not the gateway’s.

The true power of the sidecar pattern for rate limiting lies in its ability to scale independently with your services. As you add more instances of a microservice, you also add more instances of its rate-limiting sidecar, distributing the load and maintaining consistent performance. This avoids the cascading failures that plague centralized gateway-based rate limiting.

A common pitfall is forgetting to configure the sidecar’s rate limits to be aware of the actual service port and address. If the sidecar forwards to the wrong localhost port or the service itself isn’t listening on the expected port, requests will still fail, but the rate limiter will appear to be working correctly.

The next challenge you’ll likely encounter is how to dynamically update these rate limits without redeploying your sidecar proxies.