Rate limiting is surprisingly not about preventing abuse, but about ensuring fairness amongst legitimate users.
Imagine a popular API that serves real-time stock quotes. Without rate limiting, a single, very fast client could hog all the available server resources, making it impossible for other users to get their stock data. Rate limiting steps in to ensure that no single client can consume an unreasonable portion of the server’s capacity, thereby guaranteeing a baseline level of service for everyone.
Let’s see this in action. We’ll use a simple in-memory rate limiter in Go, but the principles apply universally.
package main
import (
"fmt"
"net/http"
"time"
"golang.org/x/time/rate"
)
var limiter = rate.NewLimiter(rate.Limit(1), 1) // Allow 1 request per second, burst of 1
func apiHandler(w http.ResponseWriter, r *http.Request) {
// Wait for a token to become available
if !limiter.Allow() {
http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
return
}
// Simulate API work
time.Sleep(100 * time.Millisecond)
fmt.Fprintln(w, "Stock quote data...")
}
func main() {
http.HandleFunc("/stock", apiHandler)
fmt.Println("Server listening on :8080")
http.ListenAndServe(":8080", nil)
}
If you hit http://localhost:8080/stock repeatedly with curl -v http://localhost:8080/stock, you’ll see the first request succeed immediately. The second request, sent less than a second later, will be rejected with a 429 Too Many Requests.
The core problem rate limiting solves is resource exhaustion. APIs, databases, and underlying infrastructure have finite capacity. Uncontrolled demand, even from legitimate sources, can overwhelm these systems, leading to degraded performance, intermittent outages, and ultimately, lost revenue. Rate limiting acts as a traffic cop, smoothing out peaks and ensuring a consistent, predictable flow of requests.
Internally, most rate limiters operate on a token bucket algorithm. Think of a bucket that holds a certain number of tokens. Tokens are added to the bucket at a fixed rate (e.g., 10 tokens per second). To process a request, a client must consume one token from the bucket. If the bucket is empty, the request is denied or queued until a token becomes available. The "burst" capacity allows for short, intense spikes in traffic, up to the maximum number of tokens the bucket can hold.
The two primary levers you control are the rate and the burst. The rate dictates how many requests are allowed over a sustained period (e.g., 100 requests per minute). The burst determines the maximum number of requests that can be handled in a very short interval, essentially how much "slack" the system has to absorb sudden traffic spikes. Choosing these values is a delicate balancing act: too restrictive, and you alienate legitimate users; too permissive, and you risk overwhelming your infrastructure.
A common misconception is that rate limiting is solely an external concern, handled by API gateways or load balancers. While these are excellent places to enforce global limits, implementing rate limiting within your application or service provides finer-grained control and allows for context-aware decisions. For instance, you might apply different limits based on the authenticated user, the type of request, or even the geographical origin of the request. This internal application of rate limiting is crucial for protecting specific, resource-intensive operations that might not be obvious at the gateway level.
The next natural step after implementing basic rate limiting is understanding how to handle the rejected requests gracefully, particularly when dealing with distributed systems.