A Ray cluster isn’t just a bunch of machines running Ray; it’s a precisely orchestrated system where the "head" node is the conductor and the "worker" nodes are the orchestra members, each with a critical, distinct role.

Let’s see this in action. Imagine you have a simple Ray script that launches a few tasks.

# This would run on the head node or be submitted to it.
import ray

# Initialize Ray, specifying the head node's address if not local.
# If running locally, `ray.init()` is enough.
# If connecting to an existing cluster:
# ray.init(address="auto") # Or specify the head node's IP: ray.init(address="ray://<head_node_ip>:10001")

@ray.remote
def my_task(x):
    return x * 2

if __name__ == "__main__":
    ray.init() # For local testing, or connect to cluster

    # Submit tasks
    futures = [my_task.remote(i) for i in range(5)]

    # Retrieve results
    results = ray.get(futures)
    print(results)

    ray.shutdown()

When you run this (or submit it to a cluster), Ray’s magic happens. The ray.init() call is where the connection to the cluster’s control plane is established. On the head node, this starts the Ray Client, Ray Dashboard, and Traffic Controller. The worker nodes, on the other hand, start their Raylets, GCS Clients, and Task Schedulers.

The core problem Ray solves is distributed execution: running Python code across multiple machines efficiently and reliably. It abstracts away the network communication, serialization, and scheduling complexities. The head node is the central point of intelligence and coordination, while worker nodes are where the actual computation happens.

Here’s how the pieces fit together:

  • Head Node:

    • Ray Client: This is the entry point for your application. When you run ray.init(), you’re interacting with the Ray Client on the head node (or the head node itself if running locally). It translates your Python calls into commands for the cluster.
    • Ray Dashboard: A web-based UI (usually at http://<head_node_ip>:8265) that provides visibility into the cluster’s state, running actors, tasks, logs, and resource utilization. It’s invaluable for debugging and monitoring.
    • Traffic Controller: This component on the head node manages incoming connections from clients and worker nodes, directing traffic to the appropriate internal Ray services.
    • Global Control Store (GCS) Client (on Head): While GCS is a distributed service, the head node often runs a client that acts as a primary interface for certain cluster-wide metadata and state.
  • Worker Nodes:

    • Raylet: This is the most crucial component on each worker node. It’s the local agent responsible for managing resources on that specific machine, scheduling tasks, and communicating with the head node’s scheduler. It also manages the lifecycle of actors running on its node.
    • GCS Client (on Workers): Workers also have GCS clients to access cluster-wide information managed by GCS, like the location of actors or task dependencies.
    • Task Scheduler (on Workers): Each worker node has a local scheduler that works under the direction of the head node’s scheduler to execute tasks and actors assigned to it.

When you submit my_task.remote(i), the Ray Client on the head node packages this request and sends it to the head’s scheduler. The head’s scheduler, in conjunction with the GCS, determines which worker node has the available resources (CPU, GPU, etc.) to execute my_task. It then sends a message to the Raylet on the chosen worker. The worker’s Raylet picks up the task, executes my_task(i) using its local resources, and sends the result back to the head node, which then forwards it to your client.

The surprising thing about Ray’s architecture is how it uses a distributed, multi-component system to achieve what looks like simple, synchronous Python code execution. The head node isn’t just a passive gateway; it actively participates in scheduling and coordination, making it a single point of failure if not managed for high availability, but also the brain of the operation. The worker Raylets are the muscle, executing work and reporting back, while GCS acts as the shared memory for critical cluster state.

Many users overlook the importance of the GCS (Global Control Store) service. It’s not just a simple key-value store; it’s a highly available, distributed database that stores essential cluster metadata, including actor registration, task dependencies, and the location of data. If GCS becomes unhealthy, the entire cluster can grind to a halt, affecting everything from task scheduling to actor discovery, often manifesting as cryptic timeouts or connection errors that are hard to trace back to the GCS’s internal state.

The next step is understanding how Ray manages stateful applications with Actors, which introduce a different set of architectural considerations and potential failure modes.

Want structured learning?

Take the full Ray course →