Ray’s autoscaler is surprisingly powerful, but it’s not actually scaling your cluster up and down based on Ray task load.

Let’s get a multi-node Ray cluster up and running on AWS or GCP. This guide assumes you’ve got basic cloud provider familiarity and gcloud or aws CLI set up.

The Core Idea: Head and Workers

A Ray cluster has a single head node and one or more worker nodes. The head node is the orchestrator. It runs the Ray cluster launcher, the Ray dashboard, and a Ray head process. Worker nodes run Ray worker processes. They connect back to the head node to receive tasks.

Setting Up the Environment

First, you need to install Ray on your local machine. This is how you’ll launch and manage the cluster.

pip install "ray[all]"

Next, you’ll need a cloud provider account (AWS or GCP) and their respective CLI tools configured.

AWS Setup

1. IAM Permissions: Ray needs permissions to create and manage EC2 instances, security groups, and key pairs. The easiest way is to create an IAM user with programmatic access and attach a policy that grants these permissions. A good starting point is the AmazonEC2FullAccess policy, but for production, you’d want a more restricted policy.

2. Security Group: You need a security group that allows: * SSH (port 22) from your IP address. * All traffic between nodes in the cluster (for Ray communication). * The Ray dashboard port (8265) from your IP address.

3. Key Pair: You’ll need an EC2 key pair to SSH into your instances. Create one in the EC2 console and download the .pem file.

4. Launching the Head Node: Use the ray up command. This command reads a YAML configuration file. Let’s create aws-cluster.yaml:

# aws-cluster.yaml
provider: aws
region: us-east-1
instance_type: m5.large
# Assuming you've created a key pair named 'my-ray-key' in us-east-1
# and have the .pem file in ~/.ssh/
# You might need to adjust the 'ssh_private_key' path.
ssh_private_key: ~/.ssh/my-ray-key.pem

# Replace with your actual security group ID.
# Ensure it allows SSH, Ray ports, and inter-node communication.
# You can create one via the AWS console or CLI.
# Example: aws ec2 create-security-group --group-name ray-sg --description "Ray cluster security group"
# Then add rules for SSH, Ray ports, etc.
# Note: You'll need to find the ID after creation, e.g., 'sg-0123456789abcdef0'
# For simplicity, Ray can create one if you omit this, but it's less controlled.
# security_group_ids:
#   - sg-0123456789abcdef0

# Replace with your existing VPC subnet ID if needed.
# subnet_id: subnet-0123456789abcdef0

# Use your custom AMI if you have one. Otherwise, Ray uses a default.
# image_id: ami-0abcdef1234567890

# Specify the number of worker nodes to start with.
# The autoscaler can adjust this.
initial_workers: 2

# Optional: Tags for your instances
tags:
  Name: ray-cluster
  Project: ML-Experiment

Now, launch it:

ray up aws-cluster.yaml

This command will:

  • Create a security group (if security_group_ids is commented out).
  • Create an EC2 instance for the head node.
  • Provision worker nodes based on initial_workers.
  • Install Ray on all nodes.
  • Configure the nodes to form a cluster.

Once it’s up, you can access the Ray dashboard at http://<head-node-public-ip>:8265.

5. Scaling and Stopping: To add more workers (manually):

ray scale aws-cluster.yaml --num-workers 5

To stop the cluster:

ray down aws-cluster.yaml

GCP Setup

1. Service Account: Create a GCP service account with roles like Compute Instance Admin (v1) and Service Account User. Download the JSON key file.

2. Firewall Rules: You need firewall rules to allow: * SSH (port 22) from your IP. * All traffic within your VPC network (for Ray). * The Ray dashboard port (8265) from your IP.

3. Launching the Head Node: Create gcp-cluster.yaml:

# gcp-cluster.yaml
provider: gcp
region: us-central1
zone: us-central1-a
instance_type: n1-standard-1 # GCP instance type
# Replace with the path to your GCP service account key file.
# Ensure the service account has necessary permissions.
gcp_service_account_key: /path/to/your/gcp-key.json

# Replace with your GCP project ID.
project_id: your-gcp-project-id

# You can specify a custom network or subnetwork if needed.
# network: default
# subnetwork: default

# Ray can create a default firewall if you omit this, but explicit is better.
# firewall_rules:
#   - name: allow-ray-ssh
#     ports: ["22"]
#     allowed_sources: ["YOUR_IP_ADDRESS/32"] # e.g., 34.56.78.90/32
#   - name: allow-ray-internal
#     ports: ["0-65535"]
#     allowed_sources: ["YOUR_VPC_CIDR"] # e.g., 10.128.0.0/20
#   - name: allow-ray-dashboard
#     ports: ["8265"]
#     allowed_sources: ["YOUR_IP_ADDRESS/32"]

# Use your custom image if you have one.
# image: projects/your-gcp-project-id/global/images/your-custom-image-name

initial_workers: 2

# Optional: Labels for your instances
labels:
  ray-project: ml-experiment

Launch it:

ray up gcp-cluster.yaml

This will provision the head and worker nodes on GCP, install Ray, and configure the cluster. Access the dashboard at http://<head-node-external-ip>:8265.

4. Scaling and Stopping: Add workers:

ray scale gcp-cluster.yaml --num-workers 5

Stop the cluster:

ray down gcp-cluster.yaml

The Autoscaler (A Deeper Dive)

The ray up command also sets up Ray’s autoscaler. This is where things get interesting. The autoscaler doesn’t automatically scale based on the number of Ray tasks waiting in the queue. Instead, it scales based on the number of idle Ray workers.

Here’s how it works:

  1. Head Node Checks: The head node periodically checks the status of its workers.
  2. Idle Worker Count: It counts how many workers have been idle for a configurable period (e.g., 60 seconds).
  3. Scaling Decision:
    • If the number of idle workers exceeds a threshold (upscaling_max_idle_time), the autoscaler removes workers.
    • If the number of workers is below the desired target (e.g., initial_workers or a minimum set by you) and there are tasks to run, it adds workers up to a configured maximum (max_workers).

This mechanism is designed to save costs by shutting down idle resources. However, it means that if your tasks are very short-lived or sporadic, the autoscaler might not keep up with demand perfectly. You might experience a slight delay as new workers are provisioned.

The autoscaler configuration is part of the ray.yaml file (or the cloud-specific YAMLs we used). You can fine-tune parameters like upscaling_max_idle_time, idle_timeout_minutes, and max_workers.

The next thing you’ll likely run into is configuring persistent storage for your Ray cluster.

Want structured learning?

Take the full Ray course →