A Ray Serve deployment graph can actually execute arbitrary Python code, not just model inference, by treating Python functions as first-class citizens in the deployment graph.

Let’s see this in action. Imagine you have a pipeline that first preprocesses some data, then runs it through a model, and finally post-processes the results. In Ray Serve, this looks like:

from ray import serve
from ray.serve.deployment import Application

@serve.deployment
class Preprocessor:
    def __call__(self, request):
        data = request.json()["data"]
        processed_data = data.upper() # Simple example: uppercase the data
        return processed_data

@serve.deployment
class Model:
    def __init__(self):
        # In a real scenario, load your model here
        print("Model loaded.")

    def __call__(self, request):
        processed_data = request.json()["data"]
        # Simulate model inference
        result = f"Model processed: {processed_data}"
        return result

@serve.deployment
class Postprocessor:
    def __call__(self, request):
        model_result = request.json()["data"]
        final_result = f"Final output: {model_result.replace('Model processed:', 'Inferred result:')}"
        return final_result

# Compose the deployments into an application
app = Preprocessor.bind() \
    .then(Model.bind()) \
    .then(Postprocessor.bind())

# Deploy the application
serve.run(app)

If you send a POST request to the deployed endpoint with JSON {"data": "hello world"}, you’ll get back:

"Final output: Inferred result: hello world"

This demonstrates how then() chains deployments together, passing the output of one as the input to the next. But the real power comes from understanding that Preprocessor, Model, and Postprocessor are just standard Python classes decorated with @serve.deployment. They can contain any logic you need – data validation, feature engineering, calling external APIs, complex business logic, or even other Ray tasks.

The mental model for a Ray Serve deployment graph is a Directed Acyclic Graph (DAG) where nodes are deployments (or individual Python functions) and edges represent the flow of data. When you use .then(), you’re essentially adding an edge from the output of the preceding deployment to the input of the subsequent one. Serve handles the serialization, deserialization, and network communication between these stages automatically. Each deployment runs as an independent service, allowing for independent scaling and fault tolerance.

You can also represent the graph using Python functions directly. This is particularly useful for simpler pipelines or when you want to encapsulate a sequence of operations within a single deployment.

from ray import serve
from ray.serve.deployment import Application

@serve.deployment
class MyPipeline:
    def __init__(self):
        # Load model or other resources here
        print("Pipeline initialized.")

    def preprocess(self, data):
        return data.upper()

    def infer(self, processed_data):
        return f"Model processed: {processed_data}"

    def postprocess(self, model_result):
        return f"Final output: {model_result.replace('Model processed:', 'Inferred result:')}"

    def __call__(self, request):
        data = request.json()["data"]
        preprocessed = self.preprocess(data)
        inferred = self.infer(preprocessed)
        final = self.postprocess(inferred)
        return final

# Deploy the single deployment that contains the entire pipeline logic
serve.run(MyPipeline.bind())

The key insight here is that .then() is syntactic sugar for building a specific type of graph where each node is a deployment. You can achieve the same result by writing a single deployment that orchestrates multiple internal functions. The choice between these approaches depends on whether you want to scale and manage each stage independently or treat the entire pipeline as a single unit. This functional composition allows for incredibly flexible and powerful multi-model or multi-stage processing pipelines without needing to manually manage inter-service communication.

What most people don’t realize is that the request object passed to a deployment’s __call__ method is an instance of starlette.requests.Request, which means you have access to all its features, including accessing query parameters, headers, and even streaming the request body, not just JSON.

The next logical step is to explore how to handle conditional logic or branching within these deployment graphs, enabling more complex workflows beyond simple linear sequences.

Want structured learning?

Take the full Ray course →