Ray can churn through compute-intensive tasks, but those costs can pile up faster than you can say "distributed training." Here’s how to slash those bills by leveraging AWS Spot Instances.
Imagine you’re running a massive Ray cluster for a demanding machine learning workload. On-demand instances are reliable but expensive. Spot Instances, on the other hand, offer spare AWS compute capacity at a fraction of the on-demand price, but with a catch: AWS can reclaim them with a two-minute warning. The trick is to make your Ray cluster resilient to these interruptions.
Setting Up a Spot-Enabled Ray Cluster
The core idea is to configure your Ray cluster to launch worker nodes on Spot Instances. When a Spot Instance is reclaimed, Ray’s distributed nature allows it to continue operating, albeit with reduced capacity, until replacement nodes can be provisioned.
Here’s how you’d typically set this up using ray up with a cluster configuration file.
First, create a cluster.yaml file:
# cluster.yaml
cluster_name: ray-spot-cluster
provider:
type: aws
region: us-east-1
availability_zones:
- us-east-1a
- us-east-1b
- us-east-1c
# Configure your VPC and subnet IDs here
# vpc_id: vpc-xxxxxxxxxxxxxxxxx
# subnet_ids:
# - subnet-xxxxxxxxxxxxxxxxx
# - subnet-yyyyyyyyyyyyyyyyy
head_node:
instance_type: m5.large
disk_size_gb: 50
# You can also run the head node on spot, but it's generally
# recommended to keep it on-demand for stability.
# spot_instance: True
# spot_max_price: 0.05 # Example: Max price per hour
worker_nodes:
# This is where the magic happens for spot instances
instance_type: m5.large
min_workers: 2
max_workers: 10
disk_size_gb: 50
# Set spot_instance to True to enable spot pricing for workers
spot_instance: True
# Define the maximum price you're willing to pay per hour for a spot instance.
# This should be less than the on-demand price.
# A good starting point is 20-40% of the on-demand price.
# Check current on-demand prices for m5.large in us-east-1:
# https://aws.amazon.com/ec2/pricing/on-demand/
# For m5.large in us-east-1, on-demand might be ~$0.096.
# Let's set a max price of $0.04.
spot_max_price: 0.04
# Number of spot interruptions we can tolerate before scaling down
# This is a crucial parameter for resilience.
# If a worker is interrupted, Ray will try to replace it.
# If too many workers get interrupted quickly, we might want to scale down
# to avoid losing too much capacity.
# This value is in seconds. A common value is 180 seconds (3 minutes).
# This means if we get 3 interruptions within 3 minutes, we might scale down.
# However, for simplicity in initial setup, we often rely on Ray's
# automatic retry mechanisms and let it try to replace nodes.
# For more advanced resilience, you might use this.
# For now, let's focus on the core spot_instance and spot_max_price.
# Ray will automatically handle provisioning and deprovisioning.
# The key is that Ray can tolerate worker failures and restarts.
# Optional: Define IAM role for your cluster
# iam_instance_profile: arn:aws:iam::YOUR_ACCOUNT_ID:instance-profile/YOUR_RAY_INSTANCE_PROFILE
# Optional: Specify a custom AMI if needed
# image_id: ami-0abcdef1234567890
With this cluster.yaml, you can launch your cluster:
ray up cluster.yaml
Ray will then provision the head node (on-demand by default) and attempt to launch worker nodes using Spot Instances, respecting your spot_max_price.
How Ray Handles Spot Interruptions
When AWS reclaims a Spot Instance, it sends a two-minute warning. Ray’s internal mechanisms and the underlying cloud provider integration are designed to handle this gracefully.
- Interruption Notice: AWS sends a notification to the instance.
- Graceful Shutdown (Attempted): If your application is configured to listen for these signals (e.g., via EC2 instance metadata), it can attempt to save its state or finish ongoing tasks. For Ray workers, this means that any tasks currently executing on that node will be interrupted.
- Node Failure Detection: The Ray head node detects that a worker has become unresponsive.
- Task Re-scheduling: Ray’s scheduler identifies the tasks that were running on the failed worker and re-schedules them onto other available workers in the cluster. If there are no other available workers, these tasks will wait until a new worker is provisioned.
- Replacement Provisioning: Ray’s cluster autoscaler, if configured, will detect the reduced capacity and attempt to launch a new Spot Instance to replace the one that was terminated. This process can take a few minutes.
The crucial aspect is that Ray’s distributed nature means a single worker failure doesn’t bring down the entire cluster. Tasks are designed to be fault-tolerant, and the scheduler is robust enough to handle dynamic changes in cluster membership.
Key Parameters for Spot Instances
spot_instance: True: This is the primary flag to enable Spot Instance usage for a node group.spot_max_price: This defines the maximum hourly price you’re willing to pay. If the current Spot price exceeds this, your instance won’t launch or will be terminated. It’s essential to set this below the on-demand price. Form5.largeinus-east-1, the on-demand price is around $0.096/hour. Settingspot_max_price: 0.04(about 41% of on-demand) is a good starting point for significant savings. Always check the latest on-demand prices for your chosen region and instance type.instance_type: Choose instance types that are commonly available as Spot Instances. Popular general-purpose (M series), compute-optimized (C series), and memory-optimized (R series) instances are usually good candidates.
Optimizing for Cost and Resilience
- Diversify Instance Types: Don’t rely on a single instance type for your workers. If one type becomes too expensive or unavailable, Ray can potentially launch others if your
cluster.yamlis configured to allow multiple types (thoughray uptypically provisions one type per group). - Use the Right Region/AZ: Spot Instance prices vary by region and Availability Zone. Check AWS Spot Instance Advisor to find the most cost-effective options.
- Tune
max_workers: Setmax_workersto a level that balances your performance needs with your budget. You might not always need the maximum number of workers if many are frequently interrupted. - Application-Level Checkpointing: For long-running, critical tasks, implement application-level checkpointing. This allows your tasks to resume from a saved state if they are interrupted, rather than starting from scratch. Ray tasks can be designed to handle this by passing state or references to saved data.
- Handle the Two-Minute Warning: For applications that need to perform immediate cleanup or save state upon interruption, listen for the EC2 Spot Instance termination notice. This can be done by querying the instance metadata service for the
TerminationTimeor by using signal handlers forSIGTERM. Ray’s underlying infrastructure attempts to handle this, but explicit application-level handling can be more robust.
The Next Hurdle: State Management
While Spot Instances offer huge cost savings, a frequent interruption rate can still impact your application’s throughput. You’ll soon find yourself needing to manage the state of your distributed computations more explicitly.