Ray Autoscaler: Scale Cloud Clusters Automatically (2026)

Ray’s autoscaler is designed to dynamically adjust the number of nodes in your Ray cluster based on the workload, aiming to optimize resource utilization and cost.

Let’s see it in action. Imagine you have a Ray cluster running, and you submit a set of tasks that require more resources than currently available.

import ray

# Assuming ray is already initialized and a cluster is running
# ray.init(address="auto") # Or your cluster address

# Submit a bunch of heavy tasks
for _ in range(100):
    ray.get(
        ray.remote(lambda: time.sleep(10))(
            num_cpus=4, num_gpus=1
        ).remote()
    )

When these tasks are submitted, Ray’s internal scheduler will try to find available resources. If it can’t, it will signal the autoscaler. The autoscaler, observing the pending tasks and resource requests, will then decide to add new nodes to the cluster. Conversely, if the cluster has been idle or underutilized for a period, the autoscaler will detect this and scale down by removing nodes, saving you money.

The core problem the autoscaler solves is the mismatch between static cluster provisioning and dynamic, unpredictable workloads. Traditionally, you’d either over-provision to handle peak loads (and waste money during off-peak times) or under-provision and suffer performance degradation or task failures. The autoscaler bridges this gap by treating your cluster size as a variable that adapts to demand.

Internally, the autoscaler operates on a feedback loop. It continuously monitors the Ray scheduler’s queue for pending tasks that cannot be scheduled due to insufficient resources. It also tracks the utilization of existing nodes. Based on these metrics, it consults a configuration file (often autoscaler.yaml) to decide whether to scale up or down.

The autoscaler.yaml file is your primary lever. Here’s a simplified example:

provider:
    type: aws
    region: us-east-1
    availability_zones: ["us-east-1a", "us-east-1b"]

cluster_name: my-ray-cluster

head_node_type:
    InstanceSetup:
        InstanceType: m5.large
        ImageId: ami-0abcdef1234567890
        UserData: |
            #!/bin/bash
            # User data script to run on head node
            echo "Ray head node setup"

    UpscalingMaxScale: 10 # Max number of worker nodes

worker_node_types:
    -   InstanceSetup:
            InstanceType: c5.xlarge
            ImageId: ami-0abcdef1234567890
            UserData: |
                #!/bin/bash
                # User data script to run on worker nodes
                echo "Ray worker node setup"
        min_workers: 0
        max_workers: 5
        upscaling_speed: 1.0 # How aggressively to scale up

    -   InstanceSetup:
            InstanceType: p3.2xlarge # GPU instance
            ImageId: ami-0abcdef1234567890
            UserData: |
                #!/bin/bash
                # User data script to run on GPU worker nodes
                echo "Ray GPU worker node setup"
        min_workers: 0
        max_workers: 2
        upscaling_speed: 1.0

idle_timeout_minutes: 5 # How long to wait before scaling down idle workers
upscaling_interval_seconds: 10 # How often to check for scaling needs

In this configuration:

provider: Specifies your cloud provider (AWS in this case), region, and availability zones.
head_node_type: Defines the instance type and setup for your head node.
worker_node_types: A list of different worker node configurations. You can have multiple types, for example, CPU-optimized and GPU-optimized.
- min_workers and max_workers: Define the bounds for each worker node type.
- upscaling_speed: Controls how quickly new nodes are provisioned. A higher value means faster scaling.
idle_timeout_minutes: The crucial setting for scaling down. If a worker node remains idle for this duration, it’s a candidate for termination.

The autoscaler doesn’t just blindly add nodes when there’s a pending task. It has a sophisticated algorithm that considers the number of pending tasks, the resource requirements of those tasks (CPUs, GPUs, memory), and the current number and types of available nodes. It tries to provision the right type of node to satisfy the pending requests efficiently. For instance, if you have many pending tasks requesting GPUs, it will prioritize adding GPU instances.

A common point of confusion is how upscaling_speed interacts with idle_timeout_minutes. upscaling_speed governs the rate at which new nodes are added when demand is high. It’s a multiplier (typically between 0.1 and 1.0) on the number of nodes to add. A speed of 1.0 means it will try to add all necessary nodes immediately, while a speed of 0.1 means it will add only 10% of the required nodes in each scaling interval. idle_timeout_minutes dictates when nodes are considered for removal due to inactivity. The autoscaler will only scale down if there are no pending tasks and nodes have been idle for the specified timeout.

The autoscaler uses a concept of "resource budgets" to manage scaling. When Ray needs to scale up, it calculates the total resources required by pending tasks and compares it to the available resources. If there’s a deficit, it determines how many nodes of which type are needed. The max_workers and max_scale settings act as hard limits on this budget. It also respects the min_workers setting, ensuring at least that many nodes are kept running.

The autoscaler’s decision-making process is also influenced by the upscaling_interval_seconds. This setting determines how frequently the autoscaler re-evaluates the cluster’s state and makes scaling decisions. A shorter interval means quicker responses to changing demand but also more frequent API calls to your cloud provider, potentially incurring minor costs or hitting API rate limits if set too low.

The autoscaler is an opt-in feature, typically enabled by starting the cluster with ray start --autoscaling-config=autoscaler.yaml. Once running, it continuously monitors the system, making it a powerful tool for managing Ray clusters in dynamic cloud environments.

When you have successfully configured and are running your autoscaler, the next operational challenge will be monitoring its performance and troubleshooting scaling decisions.