KubeRay autoscaling is not about adding more Ray clusters; it’s about dynamically adjusting the resources within a single Ray cluster based on demand.
Here’s KubeRay autoscaling in action. Imagine you have a Ray cluster running on Kubernetes, managed by KubeRay. You’ve defined an autoscaling configuration that tells KubeRay how to grow and shrink the cluster’s worker nodes.
Let’s say you start with a minimal cluster:
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: my-ray-cluster
spec:
rayVersion: "2.9.3"
enableInTreeAutoscaling: true # This is key!
autoscalerOptions:
idleTimeoutSeconds: 60 # How long to wait before scaling down idle nodes
image: rayproject/ray-autoscaler:latest
resources:
cpu: 1 # Autoscaler pod requests 1 CPU
memory: "1Gi" # Autoscaler pod requests 1 GiB of memory
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
workerGroupSpecs:
- replicas: 1
minWorkers: 1
maxWorkers: 5
groupName: default
rayStartParams:
num-cpus: 4 # Each worker node has 4 CPUs available to Ray
Initially, minWorkers: 1 means you’ll always have at least one worker node. When you submit a Ray job that requires more resources than available on the current nodes (e.g., you need 8 CPUs, but only have 4 available), KubeRay’s autoscaler kicks in. It sees the pending tasks, checks your maxWorkers: 5 limit, and requests a new Kubernetes Pod (which will become a Ray worker node) from the Kubernetes API. If the Kubernetes cluster has available capacity, it schedules and starts this new pod. Once the pod is ready and Ray registers it, the autoscaler scales down the cluster.
The autoscaler is a separate process running as a Pod within your Ray cluster (specified by autoscalerOptions.image). It continuously monitors the Ray cluster’s state and the Kubernetes API. When it detects that the number of pending Ray tasks exceeds the available resources on the currently running Ray workers, it translates this demand into Kubernetes Pod creation requests. Conversely, if worker nodes are idle for longer than idleTimeoutSeconds, the autoscaler will signal Kubernetes to terminate those Pods, thereby scaling down the Ray cluster.
The core problem KubeRay autoscaling solves is efficiently utilizing Kubernetes resources for Ray workloads. Without it, you’d either overprovision your Kubernetes cluster to handle peak Ray demands (leading to wasted resources during off-peak times) or manually scale your Ray cluster up and down, which is tedious and error-prone. KubeRay automates this, making Ray applications on Kubernetes cost-effective and responsive.
The enableInTreeAutoscaling: true flag is crucial. It tells KubeRay to use the built-in autoscaler logic. If this is false, KubeRay will not attempt to autoscale your worker nodes. The autoscaler image (autoscalerOptions.image) is the container image that runs the autoscaler logic. It needs to be a valid container image that KubeRay can pull and run. The rayStartParams for the worker nodes define how each individual Ray worker process starts, including resource allocation like num-cpus.
A common pitfall is forgetting to set enableInTreeAutoscaling: true. Without it, your minWorkers and maxWorkers settings are essentially decorative; the cluster won’t scale beyond the initial replicas. Another point of confusion is the difference between Kubernetes pod resources (resources.cpu, resources.memory in autoscalerOptions) and Ray worker resources (rayStartParams.num-cpus). The former are for the autoscaler process itself, while the latter is how many CPUs each Ray worker node makes available to Ray tasks.
When scaling down, the autoscaler prioritizes removing idle nodes. It checks the raylet.ip and raylet.port of each worker. If a worker has no active tasks and hasn’t been active for a period exceeding idleTimeoutSeconds, it’s marked for termination. The autoscaler then gracefully tells Ray to shut down that worker, and subsequently requests Kubernetes to delete the corresponding Pod.
The next hurdle you’ll likely encounter is configuring advanced autoscaling scenarios, such as specifying different node types or resource requests for different worker groups.