Anyscale Managed Ray lets you run Ray workloads without ever touching a cluster config file.
Imagine you’ve got a Python script that uses Ray for distributed training. Normally, you’d spin up a cluster, install Ray, and then run your script. With Anyscale, you just upload your script and tell Anyscale to run it. It handles provisioning the underlying infrastructure, installing Ray, and making sure your script runs. You get the benefits of Ray’s distributed computing without the operational overhead.
Let’s see it in action. Suppose you have a simple Ray script train.py:
import ray
import ray.train
from ray.train.torch import TorchTrainer
import torch
from torch.utils.data import TensorDataset
import os
# Initialize Ray (Anyscale does this for you, but good practice to include)
# ray.init() # Not needed for Anyscale Managed Ray
def train_epoch(config):
model = torch.nn.Linear(10, 2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = torch.nn.MSELoss()
# Dummy data
x = torch.randn(128, 10)
y = torch.randn(128, 2)
dataset = TensorDataset(x, y)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
for _ in range(config.get("epochs", 1)):
for batch_x, batch_y in dataloader:
optimizer.zero_grad()
outputs = model(batch_x)
loss = loss_fn(outputs, batch_y)
loss.backward()
optimizer.step()
print(f"Epoch finished with loss: {loss.item()}")
return {"loss": loss.item(), "pid": os.getpid()}
# This is what you'd typically run locally to test
# if __name__ == "__main__":
# trainer = TorchTrainer(
# train_func=train_epoch,
# train_loop_config={"epochs": 2},
# scaling_config=ray.train.ScalingConfig(num_workers=2, use_gpu=False),
# )
# result = trainer.fit()
# print(result)
# For Anyscale, we define the entrypoint for the job
def main():
trainer = TorchTrainer(
train_func=train_epoch,
train_loop_config={"epochs": 2},
scaling_config=ray.train.ScalingConfig(num_workers=2, use_gpu=False),
)
result = trainer.fit()
print("Training finished.")
print(f"Final result: {result}")
if __name__ == "__main__":
main()
To run this on Anyscale, you’d go to the Anyscale platform, create a new "Run" or "Job," and point it to this train.py script. You’d also specify your desired scaling_config directly in the Anyscale UI or via its API, for example: num_workers: 2, use_gpu: false. Anyscale then provisions a Ray cluster for you, uploads your code, installs the necessary dependencies (which you can specify in a requirements.txt file), and executes your main() function. You can monitor the progress, view logs, and see the results directly in the Anyscale dashboard.
The core problem Anyscale solves is the complexity of managing distributed systems. Setting up and maintaining Ray clusters involves significant engineering effort: configuring VM instances, managing networking, handling node failures, ensuring consistent software versions, and scaling up or down based on workload. Anyscale abstracts all of this away. It provides a managed Ray environment where your Ray applications run as jobs.
Internally, Anyscale uses Kubernetes under the hood to manage the lifecycle of your Ray clusters. When you submit a job, Anyscale translates your scaling_config into Kubernetes resource requests. It provisions the necessary pods (Ray head node and worker nodes), sets up the Ray cluster, and then executes your Python script on that cluster. When the job completes, Anyscale tears down the cluster, so you only pay for the compute time you use.
The key levers you control are within the ScalingConfig and TrainLoopConfig. ScalingConfig determines the resources for your Ray job: num_workers dictates how many Ray worker actors will be launched; resources_per_worker allows you to specify CPU, GPU, or custom resources for each worker; use_gpu is a boolean to easily request GPUs. TrainLoopConfig (or similar configurations for other Ray libraries like Ray Tune) controls the behavior of your training loop: epochs, batch_size, learning rates, etc.
Most people understand that num_workers increases parallelism. What they often miss is how resources_per_worker interacts with your actual workload. If your model or data loading is memory-bound, simply increasing num_workers without also increasing the memory allocated per worker (e.g., via ray.autoscaler.v2.autoscaler.yaml or Anyscale’s UI equivalent for custom resources) can lead to OOM (Out Of Memory) errors on individual workers, even if the total cluster memory is sufficient. Anyscale’s managed environment allows you to specify these granular resource requests, ensuring each worker has what it needs to avoid such bottlenecks.
The next step after running simple training jobs is to explore hyperparameter tuning with Ray Tune on Anyscale.