Ray Serve’s rolling updates allow you to deploy new versions of your models without interrupting service, but they can fail if not managed carefully.

Here’s how to make them work, and what happens when they don’t.

The Problem: The Illusion of Zero Downtime

When you update a Ray Serve deployment, the system attempts a "rolling update." It spins up new replicas of your updated code before tearing down the old ones. The goal is that traffic seamlessly shifts from the old replicas to the new ones. In theory, no requests should ever hit a dead replica.

This sounds great, but it’s a delicate dance. If the new replicas aren’t ready to serve traffic, or if the old ones are still actively processing requests when the switch happens, you’ll see errors.

What Happens When Rolling Updates Fail

The most common symptom of a failed rolling update is a sudden spike in 5xx errors on your service. Specifically, you might see:

  • 503 Service Unavailable: This means requests are hitting the Serve router, but the router can’t find any healthy replicas to send the request to. This usually happens when all old replicas have been torn down, but the new ones haven’t started up successfully or aren’t reporting as healthy.
  • 500 Internal Server Error: This is more insidious. It means the request did reach a replica, but that replica failed to process it. This often happens when a request is routed to a new replica that is still initializing its model weights or other resources, and it crashes before it can handle the inference.

Common Causes and Fixes

  1. New Replicas Aren’t Starting Up Fast Enough (or at All)

    • Diagnosis: Check the logs for your Serve deployment. Look for ERROR messages related to your application code failing to initialize. This could be due to missing dependencies, incorrect configuration loading, or long model loading times.

      ray logs <serve_controller_pid> --log-actor-name serve_controller
      ray logs <replica_pid> --log-actor-name <your_app_actor_name>
      

      Look for errors like ModuleNotFoundError, FileNotFoundError, or exceptions during torch.load() or tf.keras.models.load_model().

    • Fix:

      • Increase num_ongoing_starts: This parameter in serve.run() (or your deployment configuration) controls how many replicas can start simultaneously. If you have many replicas, increasing this can help them start in parallel. A common value to try is num_ongoing_starts=4 or num_ongoing_starts=8.
      • Optimize Model Loading: If your model loading is the bottleneck, try techniques like:
        • Loading weights once during deployment initialization rather than on every request.
        • Using lighter model formats (e.g., ONNX, TorchScript).
        • Pre-loading models on the driver before starting Serve.
      • Increase graceful_shutdown_timeout_s: This is crucial. It defines how long Serve waits for a replica to finish processing in-flight requests before it forcibly terminates it. If your requests take a long time to process, and graceful_shutdown_timeout_s is too short (e.g., 30 seconds), the old replicas might be killed before they finish, causing a gap. Increase this to a value that accommodates your longest request processing times, e.g., graceful_shutdown_timeout_s=120.
    • Why it works: num_ongoing_starts allows more replicas to attempt initialization concurrently. Optimizing model loading directly reduces the time it takes for a new replica to become ready. graceful_shutdown_timeout_s ensures old replicas have enough time to finish their work, preventing a sudden drop in available capacity.

  2. Stale or Corrupted Dependencies on Worker Nodes

    • Diagnosis: If some replicas start fine and others don’t, or if you see weird import errors only on certain nodes, check your environment. If you’re using custom Docker images or pip install on the fly, a mismatch in installed packages across nodes can cause issues.

      # On a problematic worker node, check installed packages
      pip freeze
      

      Compare this output to a known good node or your expected environment.

    • Fix: Ensure all worker nodes have the exact same set of dependencies installed. The most robust way is to use a consistent Docker image for all your Ray nodes. If not using Docker, ensure your requirements.txt is applied uniformly across all nodes before starting Ray.

    • Why it works: Guarantees that the code and its dependencies are identical everywhere, eliminating environment-specific import or runtime errors.

  3. init_ready is Not Set Correctly or Not Checked

    • Diagnosis: If your application performs a significant initialization step (like loading a large model) after the __init__ method of your deployment class, Serve might consider the replica "ready" before it’s actually capable of serving traffic. Look for logs indicating the model is loaded after the replica starts reporting healthy.

    • Fix: Implement the is_ready() method in your deployment class. This method should return True only when your application is fully initialized and ready to accept requests.

      from ray import serve
      
      @serve.deployment
      class MyModel:
          def __init__(self):
              self.model = None
              self.initialized = False
      
          async def __call__(self, request):
              if not self.initialized:
                  # This should ideally be in __init__ or a separate load method
                  # For demonstration, we simulate a long load here
                  import time
                  time.sleep(10) # Simulate loading
                  self.model = "loaded_model_object"
                  self.initialized = True
              return f"Result: {self.model}"
      
          def is_ready(self) -> bool:
              # Only return True when the model is fully loaded and ready
              return self.initialized
      
      # Deployment config
      # deployment = MyModel.bind()
      # serve.run(deployment)
      
    • Why it works: is_ready() provides a hook for Serve to check the actual readiness of your application, not just that the Python process has started. Serve will only route traffic to replicas that report True from is_ready().

  4. In-Flight Requests Not Finishing (The graceful_shutdown_timeout_s Problem Revisited)

    • Diagnosis: If your rolling update seems to hang or eventually fails with 503s, and you’ve confirmed new replicas are starting okay, the issue is likely that old replicas are holding onto requests for too long. The default graceful_shutdown_timeout_s is often 30 seconds. If your inference takes longer than that, the old replicas will still be busy when Serve tries to kill them.

      # Check deployment configuration for graceful_shutdown_timeout_s
      serve.run(deployment_config={"graceful_shutdown_timeout_s": 30})
      
    • Fix: Increase graceful_shutdown_timeout_s in your deployment configuration. For example, if your typical inference time is 60 seconds, set it to 120 or 180 seconds.

      from ray import serve
      
      @serve.deployment(graceful_shutdown_timeout_s=180) # Set to 180 seconds
      class MyModel:
          # ... your deployment code ...
          pass
      
      # Or when running
      # serve.run(deployment, graceful_shutdown_timeout_s=180)
      
    • Why it works: This gives your existing replicas more time to complete their current tasks before being terminated. Serve waits for this duration for active requests to finish.

  5. Resource Starvation During Startup

    • Diagnosis: If your new replicas are failing to start or are crashing shortly after startup, and you see errors related to memory allocation or CPU limits, it could be that the Ray cluster doesn’t have enough resources to run both the old and new replicas simultaneously during the transition.

      # Check Ray cluster resource usage
      ray dashboard # or use ray status
      

      Look for high CPU/memory utilization on your nodes.

    • Fix:

      • Scale Up Your Cluster: Add more nodes or larger nodes to your Ray cluster.
      • Reduce num_replicas temporarily: If you have a very high number of replicas, consider reducing it before initiating the rolling update, then scaling back up afterward.
      • Adjust ray_actor_options: Ensure your deployment’s ray_actor_options (e.g., num_cpus, memory) are not overly aggressive, especially during the transition.
    • Why it works: Ensures that the Ray cluster has sufficient capacity to provision new replicas without impacting the performance or stability of existing ones, and without hitting system-level resource limits.

  6. Network Issues or Firewall Blocking

    • Diagnosis: In complex network environments, it’s possible that new replicas are starting up but cannot register themselves with the Serve controller or communicate with the router due to firewall rules or network segmentation. Look for logs indicating a failure to connect to the Serve controller or other internal Ray components.

    • Fix: Verify that all nodes in your Ray cluster can communicate with each other on the necessary Ray ports (typically 6379, 8000, 8265, and a range of high ports for communication). Ensure no firewalls are blocking inter-node communication.

    • Why it works: Allows the Ray Serve control plane and worker nodes to establish necessary communication channels for replicas to register and be managed correctly.

The Next Hurdle: Health Checks and Load Balancers

Once your rolling updates are consistently successful, you’ll likely start thinking about external load balancers. You’ll need to configure your load balancer to only send traffic to the Ray Serve endpoint after it reports healthy, and to gracefully remove old instances during updates. This involves understanding how your load balancer interacts with the health check endpoints provided by Ray Serve.

Want structured learning?

Take the full Ray course →