Ray Serve with FastAPI lets you expose machine learning models as scalable HTTP APIs.
Here’s a look at how it works in practice. Imagine you have a trained XGBoost model that predicts customer churn.
from ray import serve
from fastapi import FastAPI
import xgboost
# Load your trained model
model = xgboost.Booster(model_file="churn_model.xgb")
@serve.deployment
class ChurnPredictor:
def __init__(self):
self.model = xgboost.Booster(model_file="churn_model.xgb")
def predict(self, data):
# Preprocess data into a DMatrix
dmatrix = xgboost.DMatrix(data)
return self.model.predict(dmatrix)
async def __call__(self, request):
data = await request.json()
predictions = self.predict(data["features"])
return {"predictions": predictions.tolist()}
app = FastAPI()
@app.get("/health")
async def health_check():
return {"status": "ok"}
# Deploy the model
ChurnPredictor.deploy()
# Mount the Serve application to FastAPI
app.mount("/", serve.get_deployment_app(ChurnPredictor))
When you run this with serve run your_script.py, Ray Serve spins up the ChurnPredictor deployment. FastAPI then handles incoming HTTP requests. For a request to /, Ray Serve routes it to an instance of your ChurnPredictor deployment. The deployment’s __call__ method receives the JSON payload, extracts the features, passes them to the predict method, and returns the predictions as JSON. The /health endpoint is handled directly by FastAPI.
The core problem Ray Serve solves here is the operationalization of ML models. Traditionally, you’d have a separate service for your model, another for your API, and then complex orchestration. Ray Serve unifies this. A "deployment" in Ray Serve is a scalable, independently deployable component of your application. It can be a single model, a pre-processing pipeline, or a complex composed service.
The serve.deployment decorator registers your Python class as a deployable unit. Ray handles the scaling, replication, and fault tolerance. You define the logic within the class’s methods, and Ray Serve makes it accessible via HTTP (or gRPC). FastAPI is integrated as the web framework, allowing you to define standard HTTP endpoints alongside your model deployments. app.mount("/", serve.get_deployment_app(ChurnPredictor)) is the key piece that tells FastAPI to delegate requests for the root path to your Ray Serve deployment.
The surprising part is how seamlessly Ray Serve handles stateful deployments. Your ChurnPredictor class has an __init__ method where you load the XGBoost model. Ray Serve ensures that this initialization code runs on each replica of your deployment, and the loaded model is available for inference. This means you don’t have to worry about serializing/deserializing models for each request; the model is loaded once per replica.
If you want to scale this endpoint, you can simply change the deployment’s configuration:
@serve.deployment(num_replicas=5, ray_actor_options={"num_cpus": 2})
class ChurnPredictor:
# ... (rest of the class)
This tells Ray Serve to create 5 replicas of ChurnPredictor, each with 2 CPU cores allocated. Ray will then automatically distribute incoming requests across these replicas.
The one thing most people don’t realize is how serve.get_deployment_app() actually works. It doesn’t just expose a single endpoint; it creates a FastAPI application that acts as a proxy. When a request hits the mounted app, it’s routed to the appropriate replica of the ChurnPredictor deployment based on Ray Serve’s internal scheduling and load balancing. This allows you to mix and match standard FastAPI routes with your ML model deployments within the same application.
The next step is understanding how to compose multiple deployments for more complex inference pipelines.