Ray’s logging system, by default, scatters logs across individual worker nodes, making it a nightmare to debug. Centralizing these logs into a single, searchable location is the key to understanding what’s happening across your distributed Ray cluster.

Let’s set up a robust logging aggregation pipeline using Elasticsearch, Fluentd, and Kibana (the EFK stack), which is a common and powerful choice for this task.

The Core Problem: Log Sprawl

When you run a Ray cluster, each worker node generates its own set of logs. These logs are crucial for diagnosing issues, understanding application behavior, and monitoring performance. However, without a centralized system, you’d have to SSH into each node, find the relevant log files (often scattered in /tmp or user-defined directories), and manually grep through them. This is inefficient, error-prone, and simply not scalable for anything beyond a small, toy cluster.

The Solution: EFK Stack

The EFK stack provides a structured approach:

  1. Fluentd: This is your log collector. It runs on each Ray worker node, tails log files, parses them, and forwards them to a central aggregation point.
  2. Elasticsearch: This is your search and analytics engine. It receives logs from Fluentd, indexes them, and makes them searchable with incredible speed and flexibility.
  3. Kibana: This is your visualization and dashboarding tool. It connects to Elasticsearch, allowing you to build dashboards, search logs interactively, and gain insights into your Ray cluster’s behavior.

Setting Up Fluentd on Ray Workers

You’ll need to configure Fluentd to run as a daemon on each Ray node. The simplest way to achieve this is by building a custom Ray Docker image that includes Fluentd, or by using a runtime_env in your Ray job to install and configure it.

Example Fluentd Configuration (fluentd.conf)

This configuration tells Fluentd to watch Ray’s log files, parse them as JSON (Ray logs are often JSON-formatted), and send them to your Elasticsearch instance.

<source>
  @type tail
  @id in_tail_ray_logs
  path /path/to/your/ray/logs/*.log  # Adjust this path to where Ray logs are stored
  pos_file /var/log/td-agent/ray_logs.pos
  tag ray.logs
  <parse>
    @type json
  </parse>
</source>

<match ray.logs>
  @type elasticsearch
  host YOUR_ELASTICSEARCH_HOST
  port 9200
  logstash_format true
  logstash_prefix ray-logs
  include_tag_key true
  tag_key @log_name
  flush_interval 5s
</match>

Deployment:

If you’re using ray submit with a runtime_env, you could have a script that installs Fluentd and places this configuration. For instance, your runtime_env requirements.txt might not be enough, so you’d likely use pip with install_command and then setup_cmds to install Fluentd and copy the config.

Alternatively, if you’re managing your Ray cluster with Kubernetes, you’d deploy Fluentd as a DaemonSet, ensuring it runs on every node and is configured to pick up Ray’s log files. The key is that Fluentd needs to be able to read the log files produced by Ray processes.

Setting Up Elasticsearch and Kibana

You can deploy Elasticsearch and Kibana using Docker Compose, Kubernetes, or managed cloud services. For a local setup, Docker Compose is straightforward.

docker-compose.yml example:

version: '3.7'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.17.9 # Use a compatible version
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - ES_JAVA_OPTS=-Xms512m -Xmx512m # Adjust JVM heap size as needed
    ports:
      - "9200:9200"
      - "9300:9300"
    volumes:
      - esdata:/usr/share/elasticsearch/data
    networks:
      - elk

  kibana:
    image: docker.elastic.co/kibana/kibana:7.17.9 # Use a compatible version
    container_name: kibana
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    networks:
      - elk

volumes:
  esdata:
    driver: local

networks:
  elk:
    driver: bridge

Start them: docker-compose up -d

Once running, you can access Kibana at http://localhost:5601.

Connecting Kibana to Elasticsearch

  1. Create an Index Pattern: In Kibana, navigate to "Stack Management" -> "Kibana" -> "Index Patterns". Click "Create index pattern".
  2. Index Pattern Name: Enter ray-logs-* (or whatever you set as logstash_prefix in Fluentd).
  3. Time Field: Select @timestamp.
  4. Create: Click "Create index pattern".

Searching Your Ray Logs

Now, go to the "Discover" tab in Kibana. You should see logs from your Ray workers flowing in. You can:

  • Search: Use the search bar to filter logs by keywords, Ray actor names, task IDs, or any other field present in your log messages. For example, level:ERROR AND message:timeout.
  • Filter: Add filters by clicking on field values in the log message details.
  • Visualize: Create dashboards with charts showing error rates, task durations, or resource usage based on your log data.

The Counterintuitive Insight: Log Structure is Everything

Many people focus on getting logs to Elasticsearch. What they often overlook is that the structure of those logs dictates how useful they are. Ray’s default logging is decent, but for effective centralized searching, you want to ensure critical fields like actor_id, task_id, level, timestamp, and the actual message are consistently present and correctly parsed. If your logs are just unstructured text blobs, even a perfect EFK setup will only let you do basic keyword searches. Consider adding structured metadata to your Ray application logs using Python’s logging module, ensuring that when Fluentd parses them as JSON, all the important pieces are extracted into distinct fields.

The next logical step after centralizing logs is to build automated alerts based on specific log patterns or error rates, which you can configure directly within Kibana or by integrating with alerting tools.

Want structured learning?

Take the full Ray course →