The most surprising truth about rate limiting is that it’s often the least aggressive configurations that cause the most production pain.

Let’s watch what happens when a single API endpoint, /api/v1/users, is hit by a sudden surge of traffic. Imagine a legitimate user, Alice, is trying to update her profile. She clicks "save" and her client makes a POST request. Behind the scenes, our API gateway is watching. It has a rate limit configured for this endpoint: 100 requests per minute. If Alice’s client is well-behaved, this is plenty. But what if a bug in her client causes it to rapidly resend the save request hundreds of times in a few seconds?

// Example request to /api/v1/users
POST /api/v1/users/12345 HTTP/1.1
Host: api.example.com
Content-Type: application/json
Authorization: Bearer <token>

{
  "email": "alice.smith@example.com",
  "bio": "Updated bio text."
}

Our gateway, seeing 100+ requests from Alice’s IP address (or more likely, her API token) within a short window, will start rejecting subsequent requests.

// Example rejection response
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 60 // seconds

{
  "error": "Too Many Requests",
  "message": "You have exceeded the rate limit. Please try again in 60 seconds."
}

This is the intended behavior. The problem arises when rate limits are configured too loosely, or when the wrong limits are applied.

The "Too Permissive" Limit

This is the classic antipattern. You set a rate limit that looks generous, like 1000 requests per minute per user. It seems like it would never be hit by a single, normal user.

What breaks: A distributed denial-of-service (DDoS) attack, or even a poorly written bot, can easily overwhelm your downstream services. If your limit is 1000 req/min, and a malicious actor sends 10,000 requests from 10 different IPs (all authenticated with stolen tokens, or targeting an unauthenticated endpoint), your gateway might still allow 1000 requests per minute from each IP. This can still saturate your database or application servers.

Diagnosis: Check your rate limit configuration. Look for limits that are too high, especially on critical or resource-intensive endpoints. kubectl get apirule -n api-gateway (if using something like Ambassador/Emissary) or check your Kong/APIGee/Cloudflare configuration.

Fix: Lower the rate limit to a value that reflects typical peak usage, not theoretical maximums. For /api/v1/users, a limit of 10 requests per minute per user might be more appropriate, with a burst allowance for brief spikes. spec.rateLimit.rate: 10 spec.rateLimit.burst: 20 This forces abusive clients to slow down significantly, protecting your backend.

The "Global Limit" Trap

Another common mistake is applying a single, global rate limit across all users or all requests to an endpoint, without any per-user or per-IP differentiation.

What breaks: A single, very active legitimate user (or a bug in their client) can consume the entire global quota, starving all other users. Imagine a shared resource, like a public API endpoint for fetching a list of available products. If one user triggers a massive data export, they could hit the global limit of 500 requests per minute, making the endpoint unusable for everyone else.

Diagnosis: Examine your rate limiting rules. If you see a limit applied without a user_id, api_key, or source_ip dimension, it’s a global limit.

Fix: Implement per-user or per-key rate limiting. This ensures that the actions of one user don’t impact others. Configure a rule like: limit: 100/minute by api_key. This ensures that each unique API key gets its own 100 requests per minute, rather than one shared pool.

The "Unauthenticated Endpoint" Vulnerability

Rate limiting is often implemented after authentication. This leaves unauthenticated endpoints (like login pages, sign-up forms, or public-facing APIs) wide open.

What breaks: An attacker can flood an unauthenticated endpoint with requests without needing any credentials. This is a common vector for credential stuffing attacks (trying to log in with stolen usernames/passwords) or simply overwhelming a public resource. If your /login endpoint has no rate limiting, an attacker can try millions of password combinations per second.

Diagnosis: Explicitly check rate limit configurations for any endpoints that do not require authentication.

Fix: Apply rate limits based on source IP address or other identifying characteristics before authentication. For unauthenticated endpoints, use a limit like 50/minute by source_ip. This prevents a single IP from making an excessive number of login attempts or hitting a public resource too hard.

The "Ignoring Retry-After" Problem

Clients often don’t respect the Retry-After header returned in a 429 Too Many Requests response.

What breaks: Even if your rate limit is correctly configured, if clients immediately retry after being rate-limited, they will continue to hit the limit, consuming resources and generating more 429 responses. This creates a feedback loop of unnecessary traffic.

Diagnosis: Monitor your application logs and network traffic for repeated requests from the same client shortly after a 429 response. This indicates clients are not respecting the Retry-After header.

Fix: Ensure your client applications implement exponential backoff with jitter when they receive a 429 response. The Retry-After header provides the minimum time to wait. If Retry-After is 60, a client might wait 60s, then 120s, then 240s, and so on, with a small random variation (jitter) added to each delay to prevent thundering herd problems.

The "No Burst Allowance" Stranglehold

Rate limiting that strictly enforces an average rate without allowing for any bursts can cause legitimate users to be rejected during normal, albeit spiky, usage patterns.

What breaks: A user might legitimately make 5 requests in 1 second, then nothing for 59 seconds. If the limit is strictly 10 requests per minute (meaning no more than 1 request every 6 seconds on average), that initial burst of 5 requests will trigger a 429, even though the user is well within their overall minute quota.

Diagnosis: Observe legitimate user traffic patterns. If users are reporting 429 errors during normal operation, it’s likely due to a lack of burst capacity.

Fix: Configure a burst allowance (often called burst, refill_amount, or capacity). Set the rate to 10/minute and the burst to 20. This allows the user to make up to 20 requests in a short period, as long as the average over the minute doesn’t exceed 10. The burst capacity refills over time.

The "Overly Aggressive Default"

Sometimes, the default rate limit applied to all endpoints is too strict, without specific tuning for individual endpoints.

What breaks: A new or less critical endpoint might be unintentionally throttled. For example, a /healthz check endpoint might be rate-limited to 5 requests per minute, causing monitoring systems to fail their checks and trigger false alarms.

Diagnosis: Review the default rate limit configuration applied globally to your API gateway or service mesh. Check if critical internal services or health checks are being impacted.

Fix: Define specific rate limits for critical endpoints and remove or significantly increase the default limit for less sensitive ones. Set a default limit of 1000/minute and a specific rule for /api/v1/users of 10/minute.

The next common problem you’ll encounter is how to implement effective distributed rate limiting across multiple service instances.

Want structured learning?

Take the full Rate-limiting course →