Ray RLlib: Reinforcement Learning Training Guide (2026)

Reinforcement learning agents often learn faster when they’re allowed to explore the same environment concurrently from multiple independent starting points.

Let’s watch an RLlib agent train a simple CartPole policy. We’ll use the PPO algorithm, which is a good default for many problems.

import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.registry import register_env
from ray.rllib.env.wrappers.atari_wrappers import AtariEnv

# Initialize Ray
ray.init(ignore_reinit_error=True)

# Define a simple environment registration
def env_creator(env_config):
    return AtariEnv({"env_name": "CartPole-v1"})

register_env("cartpole", env_creator)

# Configure the PPO algorithm
config = (
    PPOConfig()
    .environment("cartpole")
    .framework("torch") # or "tf2"
    .rollouts(num_rollout_workers=2) # Use 2 parallel workers for data collection
    .training(gamma=0.99, lr=0.0001)
    .evaluation(evaluation_interval=1, evaluation_num_workers=1) # Evaluate periodically
)

# Build and train the algorithm
algo = config.build()

# Train for a few iterations
for i in range(5):
    result = algo.train()
    print(f"Iteration: {i}, Mean Reward: {result['episode_reward_mean']:.2f}")

# Save the trained model
checkpoint_dir = algo.save()
print(f"Checkpoint saved in: {checkpoint_dir}")

# Clean up
algo.stop()
ray.shutdown()

This script sets up RLlib, defines a registration for the CartPole-v1 environment, and then configures and trains a PPO agent. The num_rollout_workers=2 tells RLlib to spin up two separate processes that will each run their own instance of the CartPole-v1 environment, collect experience, and send it back to the main training process. This parallel data collection is crucial for speeding up training, especially on more complex environments. The evaluation_interval and evaluation_num_workers allow for periodic evaluation of the trained policy on a separate set of workers without interfering with the training data collection.

The core of RLlib’s power lies in its distributed execution capabilities. When you set num_rollout_workers greater than zero, RLlib uses Ray’s distributed task execution to run environment interactions in parallel. Each worker is essentially a separate Python process that receives policy updates from the central learner and uses those policies to interact with its local environment. The collected trajectories (sequences of states, actions, rewards, and done flags) are then aggregated and used to update the policy’s parameters. The framework setting determines whether to use PyTorch or TensorFlow for the neural network models.

The PPOConfig object exposes a vast number of hyperparameters. gamma is the discount factor, which determines how much future rewards are valued. lr is the learning rate for the optimizer. Beyond these, you can tune the network architecture (model), the PPO-specific clipping epsilon (clip_param), the number of epochs to train on collected data (train_batch_size and sgd_minibatch_size), and much more. Understanding these levers is key to optimizing performance. For instance, a smaller lr might lead to more stable training but slower convergence, while a larger lr can speed things up but risks divergence.

RLlib’s configuration system is hierarchical. For example, the rollouts sub-configuration controls aspects of data collection, such as the number of workers and the batch size. The training sub-configuration governs the optimization process. You can also customize the environment’s observation and action spaces, define custom models, and specify callback functions to run at various points during training.

The algo.train() method is where the magic happens. In each iteration, RLlib collects a batch of experience from the rollout workers, performs gradient updates on the policy network using this data, and then distributes the updated policy back to the workers for the next round of data collection. The results dictionary provides metrics like mean episode reward, standard deviation, and loss values, which are essential for monitoring training progress.

The evaluation_num_workers setting is a subtle but powerful feature. When enabled, RLlib spins up dedicated workers for evaluation. These workers use the current learned policy but do not contribute to training data. This allows you to get an unbiased estimate of your policy’s performance at regular intervals without it being contaminated by the exploration noise inherent in training. It’s a clean separation of concerns that helps in understanding true learning progress.

The algo.save() method persists the trained model’s weights, optimizer state, and the algorithm’s configuration. This is critical for resuming training later or for deploying the trained agent. When you load a checkpoint, RLlib can reconstruct the entire training state, allowing you to pick up exactly where you left off.

The next step after achieving good performance on a single environment is often to scale up to more complex environments or to explore more advanced algorithms like SAC or IMPALA.