Ray’s metrics system is designed to be incredibly flexible, but the most surprising thing about integrating it with Prometheus and Grafana is how little actual configuration you need to do on the Ray side to get most of the good stuff.

Let’s see it in action. First, you need a running Ray cluster. You can start one locally with ray start --head or on a cluster. The key is that Ray components (the Raylet, the dashboard, workers) all expose metrics endpoints by default. For example, the Raylet, which is the core agent on each node, exposes metrics on port 8265 by default.

Here’s what a snippet of that might look like if you were to curl http://localhost:8265/metrics:

# HELP ray_actor_restarts_total Total number of actor restarts.
# TYPE ray_actor_restarts_total counter
ray_actor_restarts_total{actor_name="MyActor",component="raylet",node_id="0a1b2c3d4e5f678901234567890abcdef",ray_node_ip="127.0.0.1",ray_node_port="6379"} 0
# HELP ray_actor_task_execution_time_s Average time spent executing actor tasks in seconds.
# TYPE ray_actor_task_execution_time_s gauge
ray_actor_task_execution_time_s{actor_name="MyActor",component="raylet",node_id="0a1b2c3d4e5f678901234567890abcdef",ray_node_ip="127.0.0.1",ray_node_port="6379"} 0.00123
# HELP ray_object_store_memory_usage_bytes Memory usage of the object store in bytes.
# TYPE ray_object_store_memory_usage_bytes gauge
ray_object_store_memory_usage_bytes{component="raylet",node_id="0a1b2c3d4e5f678901234567890abcdef",ray_node_ip="127.0.0.1",ray_node_port="6379"} 15728640

This output is in Prometheus exposition format. Ray’s components are instrumented to emit these Prometheus-compatible metrics automatically. The raylet on each node, the gcs_server, and the dashboard all do this.

To bring this into Prometheus, you’ll configure Prometheus to scrape these endpoints. In your prometheus.yml configuration, you’d add a scrape job like this:

scrape_configs:
  - job_name: 'ray'
    static_configs:
      - targets:
        - 'ray-head-node-ip:8265' # Raylet on head node
        - 'ray-worker-node-1-ip:8265' # Raylet on worker node 1
        - 'ray-worker-node-2-ip:8265' # Raylet on worker node 2
        - 'ray-head-node-ip:8080' # Ray Dashboard

If you’re running Ray locally, this would be localhost:8265 and localhost:8080. Prometheus will then periodically fetch these metrics.

Once Prometheus is scraping, you can connect Grafana. In Grafana, you add Prometheus as a data source, pointing it to your Prometheus server’s URL. Then, you can create dashboards. For example, to visualize Ray object store memory usage, you’d create a new panel, select your Prometheus data source, and use a query like this:

ray_object_store_memory_usage_bytes

Or, to see actor restarts per node:

sum by (node_id) (ray_actor_restarts_total)

The real power comes from understanding what these metrics represent. ray_actor_restarts_total is a counter that increments every time an actor crashes and is restarted by Ray. ray_object_store_memory_usage_bytes shows how much memory is currently occupied by objects in the distributed object store on a given node. ray_actor_task_execution_time_s gives you insight into how long your actor’s tasks are taking to execute on average.

The system is designed so that most of the core operational metrics are available out-of-the-box. You don’t need to write custom instrumentation for basic health checks and resource utilization. The Raylets are the primary source for node-level metrics, including object store usage, CPU/memory utilization (though these are often scraped by node-exporter separately and correlated), and actor lifecycle events. The Ray Dashboard’s metrics endpoint provides high-level cluster information and its own operational statistics.

What most people miss is that Ray also exposes metrics related to its internal communication and scheduling. For instance, you can find metrics like ray_gcs_server_heartbeat_latency_ms which tells you about the latency of heartbeats from the GCS service to its clients (like the Raylets), or ray_scheduler_queue_length which indicates how many tasks are waiting in the scheduler’s queue. These are invaluable for diagnosing performance bottlenecks that aren’t directly tied to individual actors but rather the cluster’s ability to orchestrate them. You would typically find these on the GCS server’s metrics endpoint, often exposed on port 8126 if you’re running a GCS server separately, or integrated within the head node’s Raylet/Dashboard ports if not.

The next step is often correlating these Ray-specific metrics with system-level metrics like node CPU, memory, and network I/O, which are usually collected by separate Prometheus exporters like node_exporter.

Want structured learning?

Take the full Ray course →