Ray Serve’s ability to scale and serve models in production hinges on a deceptively simple configuration that, when misapplied, leads to subtle but impactful performance degradation.

Let’s see it in action. Imagine we have a simple Keras model we want to serve.

import ray
from ray import serve
from starlette.requests import Request

@serve.deployment(num_replicas=2, ray_actor_options={"num_cpus": 1})
class KerasModel:
    def __init__(self):
        # In a real scenario, load your model here
        # self.model = tf.keras.models.load_model("my_model.h5")
        print("Model loaded!")

    async def __call__(self, request: Request):
        data = await request.json()
        # In a real scenario, run inference:
        # result = self.model.predict(data["input"])
        result = {"prediction": f"processed {data['input']}"}
        return result

app = KerasModel.bind()

When we deploy this with serve.run(app), Ray Serve spins up two replicas of our KerasModel deployment. Each replica is an independent Ray actor. The num_replicas=2 tells Serve how many instances of our model to run concurrently. The ray_actor_options={"num_cpus": 1} specifies that each of these replicas (actors) should be allocated one CPU core. This is crucial: Serve doesn’t just magically run your Python code; it manages Ray actors, each with its own resource allocation.

The core problem Serve solves is managing the lifecycle and scaling of these model-serving actors. When a request comes in, Serve’s router (another internal component) directs it to one of the available replicas. If all replicas are busy, the request will queue up until a replica becomes free. This is where num_replicas and num_cpus become your primary levers for scaling.

The mental model for Serve scaling is this:

  1. Replicas (num_replicas): These are independent instances of your model serving code. More replicas mean more parallel processing capacity for incoming requests.
  2. Resources per Replica (ray_actor_options): Each replica is a Ray actor and consumes resources (CPU, GPU, memory). num_cpus=1 is a common starting point, meaning one full CPU core is dedicated to each replica. If your model inference is CPU-bound and takes a significant chunk of a core, you might need more than one replica per core, or you might increase num_cpus per replica if your model is very heavy and can utilize multiple cores within a single actor process.
  3. Concurrency within a Replica: By default, a single replica in Serve can only process one request at a time. If your model inference is very fast (e.g., milliseconds), a single replica might handle many requests per second. If inference is slow (e.g., seconds), you’ll need many replicas to achieve high throughput. Ray Serve’s async nature allows a replica to handle other requests while waiting for I/O or computation to complete, but it doesn’t inherently give you multi-request parallelism within a single Python process unless your code is written to explicitly support it (which is complex for typical ML models).

Consider a scenario where you have a deployment with num_replicas=4 and ray_actor_options={"num_cpus": 2}. This means Serve will launch 4 separate Python processes (actors), and each of those processes will be allocated 2 CPU cores by the Ray cluster. If you then send 10 concurrent requests, Serve will distribute these across the 4 replicas. Each replica can handle one request at a time. So, at most, 4 requests are being processed simultaneously. The remaining 6 requests will wait. The 2 CPUs per replica are available to the Python process running that replica, which might be used by the model inference itself if it’s multi-threaded or multi-process internally, or by the Python interpreter and its libraries.

The common pitfall is to solely focus on num_replicas and forget ray_actor_options. If you have a CPU-intensive model that takes 500ms to infer on a single core, and you set num_replicas=2 with num_cpus=1 for each, your two replicas can together handle approximately 4 requests per second (2 replicas * 1 request/replica / 0.5 seconds/request). If you then increase num_replicas to 10 but keep num_cpus=1, you’re still bottlenecked by the total CPU available on your Ray cluster. If your cluster only has 8 cores, you can’t effectively run 10 replicas each demanding 1 full CPU. Ray will try its best, but you’ll see scheduling contention and potentially slowdowns.

The most surprising truth about Ray Serve scaling is that it’s not just about spinning up more identical copies of your model. It’s fundamentally about managing the resource allocation and concurrency boundaries of independent actors. When you set num_cpus for a replica, you’re telling the Ray scheduler to reserve that much CPU for that actor’s process. If your model code itself doesn’t efficiently utilize those allocated CPUs (e.g., it’s single-threaded Python code that does a lot of blocking I/O or computation), the extra CPUs might go unused by the model inference, but they are still unavailable to other actors.

A common optimization, especially for models that are not heavily CPU-bound but might have significant memory footprints or I/O waits, is to increase num_replicas while keeping num_cpus=1. This allows for more requests to be "in flight" across different replicas, hiding latency. Conversely, if your model inference is extremely CPU-intensive and can saturate a single core, you might experiment with num_cpus=2 or num_cpus=4 per replica, but be mindful that Python’s Global Interpreter Lock (GIL) often limits true parallelism for CPU-bound tasks within a single Python process. In such cases, num_replicas with num_cpus=1 is often more effective, distributing the load across more processes.

The next hurdle you’ll likely encounter is managing GPU resources for inference.

Want structured learning?

Take the full Ray course →