Ray Serve, when paired with vLLM, can push LLM inference throughput to levels that feel almost magical, but it’s not about just plugging them together.

Let’s see it in action. Imagine you have a deployed model, say a Llama-2-7b-chat-hf model, running on Ray Serve. Instead of a simple predict endpoint, you’re expecting a stream of responses, and you want to handle hundreds of these concurrent requests without your GPU crying uncle.

Here’s a simplified Python snippet showing how you might set this up:

from ray import serve
from vllm import LLM, SamplingParams

# Load the vLLM model
llm = LLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

@serve.deployment(
    num_replicas=2,
    ray_actor_options={"num_gpus": 1}
)
class LLMServe:
    def __init__(self):
        self.llm = llm # Use the globally loaded vLLM model
        self.sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)

    async def __call__(self, request):
        prompt = await request.json()
        results = self.llm.generate(prompt["prompt"], self.sampling_params)
        return results[0].outputs[0].text

# Deploy the model
app = LLMServe.bind()
serve.run(app)

This setup looks straightforward, but the magic happens in how Ray Serve orchestrates requests and how vLLM manages its internal state. Ray Serve acts as the sophisticated traffic cop, distributing incoming requests across your LLMServe replicas. Each replica, in turn, leverages vLLM’s PagedAttention and continuous batching. PagedAttention is vLLM’s core innovation; it’s a memory management system that treats GPU memory like virtual memory in an OS. Instead of allocating contiguous blocks for each KV cache, it uses fixed-size blocks. This allows for efficient sharing and reuse of memory, dramatically reducing fragmentation and increasing the number of sequences vLLM can handle concurrently. Continuous batching takes this further by dynamically forming batches from incoming requests, rather than waiting for a fixed batch size. This means that as soon as a request finishes, its resources are immediately available for a new one, eliminating the idle time typical in static batching.

The num_replicas in the @serve.deployment decorator tells Ray Serve how many independent copies of your LLMServe class to run. The ray_actor_options={"num_gpus": 1} ensures each replica gets its own GPU. Ray Serve’s internal scheduler then distributes incoming HTTP requests to these replicas. When a replica receives a request, it passes the prompt to the self.llm object. This is where vLLM’s generate method kicks in. It queues the request and, if the GPU isn’t fully utilized, it might immediately start processing it or merge it into an ongoing batch. vLLM’s continuous batching logic ensures that the GPU is always doing useful work, processing multiple sequences in parallel by cleverly interleaving their computations and managing their KV caches using PagedAttention. The SamplingParams are passed along to control the generation process, like temperature and max_tokens.

The key to maximizing throughput here isn’t just having more replicas; it’s about how efficiently each replica utilizes its GPU. vLLM’s architecture is designed to keep the GPU compute units fed with work by minimizing memory overhead and maximizing parallelism. Ray Serve then provides the robust, scalable serving layer to handle the load.

Most people focus on the num_gpus or num_replicas for scaling, but the actual memory footprint of the KV cache for all active sequences is the primary bottleneck vLLM overcomes. This is why you can often achieve higher concurrency with fewer, but more efficiently managed, GPUs than you might expect with other inference frameworks.

Once you’ve mastered this, the next hurdle is managing dynamic batching across multiple nodes in a distributed Ray cluster, especially when dealing with varying sequence lengths.

Want structured learning?

Take the full Ray course →