The most surprising truth about ESC configuration is that its primary purpose isn’t actually about preventing something from happening, but about orchestrating the graceful failure of services when a critical dependency is unavailable.

Let’s watch this in action. Imagine we have two services: frontend and backend. frontend relies on backend to serve data. If backend goes down, frontend should ideally not just crash and burn, or start returning garbage. It should fail predictably and gracefully. This is where ESC (Elastic Service Control) comes in.

Here’s a simplified frontend configuration using ESC to handle backend unavailability:

service: frontend
version: 1.0.0

dependencies:
  backend:
    # This is the critical part: the ESC configuration
    esc:
      enabled: true
      # If backend is unhealthy for 3 consecutive checks,
      # we'll consider it unavailable.
      unhealthy_threshold: 3
      # Each check happens every 10 seconds.
      interval: 10s
      # If backend is unavailable, we'll return an empty list
      # for a specified duration before escalating.
      fallback:
        type: empty_list
        duration: 60s
      # After the fallback duration, if backend is still unavailable,
      # we'll return a static error response.
      on_failure:
        type: static_response
        status_code: 503
        body: '{"error": "Backend service is temporarily unavailable."}'

health_checks:
  # This is how frontend checks if backend is healthy
  http:
    path: /health
    port: 8080
    interval: 5s
    timeout: 2s
    unhealthy_threshold: 2
    healthy_threshold: 1

In this setup, frontend periodically probes backend on /health at 8080. If backend fails its health check twice in a row (unhealthy_threshold: 2 for the health check itself), frontend starts its ESC mechanism.

The esc.enabled: true flag activates the circuit breaker. The unhealthy_threshold: 3 in the ESC section means frontend will wait until backend has failed three health checks in a row before it considers backend truly unavailable. This prevents transient network blips or brief service hiccups from triggering a full failure.

Once backend is deemed unavailable, the fallback mechanism kicks in. Here, type: empty_list means frontend will, for the next duration: 60s, respond to requests that would have gone to backend with an empty list. This is useful if the frontend can still provide a partial, albeit reduced, user experience. For example, if backend provides a list of recommended products, returning an empty list is better than returning an error.

If backend remains unavailable after that 60s fallback duration, the on_failure directive takes over. type: static_response means frontend will stop trying to fetch data and instead return a predefined, hardcoded error message: {"error": "Backend service is temporarily unavailable."} with a 503 Service Unavailable HTTP status code. This clearly signals to the client that the service is down.

The whole point of this ESC configuration is to allow frontend to be resilient. It doesn’t just crash; it degrades gracefully. It might return empty lists for a while, and then a clear error message, all orchestrated by ESC. The key levers you control are:

  • enabled: Turn the ESC mechanism on or off.
  • unhealthy_threshold: How many consecutive health check failures before the dependency is considered truly down.
  • interval: How often health checks are performed.
  • fallback: What to do before escalating to a full failure state. This can be returning empty data, a cached response, or other strategies.
  • duration: How long the fallback mechanism should be active.
  • on_failure: The ultimate action when dependencies remain unavailable after fallback.

One subtle but powerful aspect of ESC is how it interacts with service discovery and load balancing. When frontend marks backend as unavailable via ESC, it doesn’t just stop sending requests to one instance of backend. If backend is managed by a sophisticated orchestrator (like Kubernetes or a similar system), frontend’s ESC state can trigger the orchestrator to stop sending new traffic to any unhealthy backend instances. This prevents a cascade of failures where a struggling backend gets overloaded with requests from multiple frontend instances, leading to a complete collapse. The ESC state acts as a signal that ripples outwards.

The next concept you’ll likely encounter is how to implement more sophisticated fallback strategies, such as returning cached data from a distributed cache when a primary backend is unavailable.

Want structured learning?

Take the full Pulumi course →