Ray GPU Allocation: Fractional GPUs for Shared Workloads (2026)

Ray can allocate portions of a GPU, not just whole ones, letting multiple tasks share the same physical GPU by carving it up.

Here’s what that looks like in practice. Imagine you have a single NVIDIA V100 with 16GB of VRAM. You want to run two separate inference services, each needing about 6GB of VRAM. Without fractional GPUs, you’d need two physical GPUs, or one would have to wait. With fractional GPUs, you can allocate 0.5 of the V100 to each service. Ray handles the underlying NVIDIA driver calls (like cudaSetDevice and cudaSetDeviceFlags) to ensure each process thinks it has its own dedicated GPU, even though they’re sharing.

This is crucial for optimizing utilization. Many deep learning workloads, especially inference or smaller training jobs, don’t saturate a full GPU. Leaving whole GPUs idle while others are overloaded is incredibly wasteful. Fractional GPUs allow you to pack more workloads onto existing hardware, reducing costs and improving throughput. Ray manages this by abstracting away the physical GPU and presenting a logical device to each task. When a task requests num_gpus=0.5, Ray finds a GPU, reserves half of its capacity (based on VRAM or other metrics), and assigns a unique device ID to that task’s process. The NVIDIA driver then ensures that memory allocations and kernel launches for that process are confined to its allocated slice.

Here’s a simplified code snippet demonstrating the allocation:

import ray
import torch

# Initialize Ray
if not ray.is_initialized():
    ray.init(num_cpus=4, num_gpus=0) # Start with no GPUs, we'll add them virtually

# Define a simple PyTorch model and task
def train_model(gpu_fraction: float):
    # Request a fraction of a GPU
    # Ray will ensure this task gets access to the specified fraction.
    # The actual GPU ID will be assigned by Ray.
    ray.get_gpu_ids() # This call will return the ID(s) assigned by Ray
    
    # Assume a GPU is available and allocated by Ray
    # In a real scenario, you'd check ray.get_gpu_ids() and configure torch
    # to use the specific device index Ray provides.
    
    device = torch.device("cuda:0") # Ray maps its assigned GPU to cuda:0 for the task
    model = torch.nn.Linear(10, 10).to(device)
    print(f"Task running on GPU fraction: {gpu_fraction}, device: {device}")
    # Simulate some work
    input_tensor = torch.randn(1, 10).to(device)
    output = model(input_tensor)
    print(f"Output shape: {output.shape}")
    return True

# Launch multiple tasks, each requesting a fraction of a GPU
# Assume we have 1 physical GPU available on the cluster.
# We need to tell Ray how many physical GPUs are available for it to manage.
# This is typically done when starting the Ray cluster or by setting env vars.
# For a single node example:
# ray.init(num_cpus=4, num_gpus=1) # If running locally with 1 GPU

# If running on a cluster where Ray knows about physical GPUs:
# Launch tasks requesting fractions
results = []
# Launch 4 tasks, each requesting 0.3 of a GPU. Total requested: 1.2 GPUs
# If only 1 physical GPU is available, Ray will schedule these serially
# or on different physical GPUs if available.
# If you have 2 physical GPUs, Ray could potentially run these in parallel.
for i in range(4):
    results.append(ray.remote(train_model).remote(gpu_fraction=0.3))

# Launch another task requesting a full GPU
# results.append(ray.remote(train_model).remote(gpu_fraction=1.0))

# Wait for results
ready_refs, _ = ray.wait(results)
print(f"Completed {len(ready_refs)} tasks.")

# Shutdown Ray
ray.shutdown()

The core idea is that Ray intercepts GPU requests and translates them into logical allocations. When you specify num_gpus=0.5 in a Ray task, Ray doesn’t just pass this to the CUDA driver. Instead, it manages a pool of physical GPUs. It might take a physical GPU, divide its VRAM capacity, and assign a logical GPU ID to your task. The task then uses this logical ID, and Ray ensures that the underlying CUDA calls are correctly mapped to the allocated slice of the physical GPU. The ray.get_gpu_ids() function within the task will return the specific IDs Ray has assigned, which you can then use to configure your deep learning framework (like PyTorch or TensorFlow) to use the correct CUDA device index.

The key configuration lever is how Ray’s resource manager perceives available GPUs. If you’re running Ray on a machine with one V100, you’d typically start Ray with ray.init(num_gpus=1). Ray then internally knows it has one physical GPU to manage. When tasks request fractional amounts, Ray’s scheduler dynamically allocates portions of that single GPU’s resources (primarily VRAM) to different tasks. If multiple tasks request 0.5 GPUs and only one physical GPU is available, Ray will schedule these tasks sequentially on that GPU, ensuring that at any given moment, the total allocated fractional GPUs do not exceed the capacity of the physical GPU.

A subtle but critical point is how Ray interacts with the CUDA driver and device management. When Ray allocates a fractional GPU to a task, it’s not modifying the driver itself. Instead, it’s managing which processes get to use which parts of a GPU’s resources. For a task requesting 0.5 of a GPU, Ray might assign it logical device ID cuda:0. The task’s code then says torch.device("cuda:0"). Ray ensures that all CUDA operations originating from this task are directed to the allocated slice of the physical GPU. This isolation is maintained by Ray’s runtime, which intercepts and redirects CUDA calls and memory allocations, effectively creating separate virtual GPU environments for each task. It’s like having a very sophisticated multiplexer for GPU resources.

The actual mechanism involves Ray’s resource scheduler tracking available GPU memory and compute. When a task requests X fraction of a GPU, Ray finds an available physical GPU and calculates how much VRAM X corresponds to. It then reserves that VRAM and assigns a unique CUDA device ID to the task. If multiple tasks request fractions that sum up to more than one physical GPU, Ray will queue them, ensuring no physical GPU is over-allocated at any instant.

The next hurdle you’ll face is efficiently managing the lifecycle of these fractional GPU tasks, particularly in dynamic environments.