Rate Limiting with GCP Cloud Endpoints: Quota Setup (2026)

Setting up rate limiting in Google Cloud Endpoints isn’t just about capping requests; it’s about fundamentally shaping the economic and operational reality of your API.

Let’s see this in action. Imagine you have a critical get_user_data method in your UserService and you want to ensure it doesn’t get overwhelmed by individual users.

type: google.api.Service
config_version: 3

http:
  rules:
  - selector: UserService.GetUser
    get: /v1/users/{user_id}

# ... other service configurations ...

control:
  # This section defines how your API is controlled and secured.
  # Quotas are a key part of this.
  consumer_overrides:
  - selector: "*" # Applies to all methods
    # This is where we define our rate limits.
    # 'rate_limits' is a list, allowing multiple limits per selector.
    rate_limits:
    - name: "user_requests_per_minute"
      # 'rate_limit' specifies the actual limit.
      # It's a string with a quantity and a unit.
      # Here, 60 requests per 1 minute.
      rate_limit: "60/min"
      # 'period' is the duration for the limit.
      # "60s" means 60 seconds, which aligns with "min".
      period: "60s"
      # 'displayName' is a human-readable name for the limit.
      displayName: "User requests per minute"

The consumer_overrides block is where the magic happens for client-side rate limiting. You can apply these limits to specific methods using their selector or to all methods with "*". The rate_limits list lets you define multiple, distinct limits. Each entry in this list has a name for identification, a rate_limit string defining the allowed requests per period (e.g., "1000/hour"), a period specifying the duration in seconds (e.g., "3600s"), and a displayName for clarity in the GCP console.

Here’s how the system interprets rate_limit: "60/min" and period: "60s": When a client makes a request, Cloud Endpoints checks its current usage against the defined limit for that specific API key or consumer. If the client has made more than 60 requests within the last 60 seconds, subsequent requests will be rejected with a 429 Too Many Requests status code and a RATE_LIMIT_EXCEEDED error. The system maintains a sliding window of the specified period for each consumer, tracking their request counts.

The core problem this solves is preventing API abuse and ensuring fair usage among your consumers. Without it, a single rogue client could flood your backend, impacting availability for everyone else. It’s also a crucial tool for API monetization, allowing you to offer tiered access based on different rate limits. For example, a free tier might have a limit of 100 requests per day, while a premium tier could have 10,000 requests per hour.

What most people don’t realize is that the rate_limit string and the period are not strictly coupled by naming convention. You can write rate_limit: "60/min" and period: "70s", and the system will enforce 60 requests within any 70-second window. This flexibility allows for nuanced rate-shaping beyond simple fixed intervals, though aligning them for clarity is generally best practice. The underlying mechanism is a token bucket or leaky bucket algorithm, where tokens are replenished at a certain rate and requests consume tokens. If the bucket is empty, requests are denied.

Once you’ve set up these basic rate limits, the next logical step is to explore distributed rate limiting across multiple API deployments or services using custom metric_rules and usage_rules in your openapi.yaml.