Pulsar monitoring with Prometheus and Grafana is less about collecting metrics and more about understanding the emergent behavior of a distributed system.

Here’s a Pulsar cluster, running with a few topics and some producers/consumers. You can see the broker-level metrics coming in, but what’s interesting is how they correlate with the Pulsar functions and the client-side latency.

{
  "timestamp": 1678886400000,
  "metrics": {
    "broker": {
      "pulsar_broker_connections_count": 150,
      "pulsar_broker_message_in_rate": 5000,
      "pulsar_broker_message_out_rate": 4800,
      "pulsar_broker_topic_lookup_rate": 200
    },
    "bookkeeper": {
      "bookkeeper_write_latency_avg_ms": 2.5,
      "bookkeeper_read_latency_avg_ms": 1.8
    },
    "pulsar_functions": {
      "pulsar_functions_instance_running": 10,
      "pulsar_functions_processing_rate": 3000
    },
    "clients": {
      "producer_send_latency_p99_ms": 5.0,
      "consumer_receive_latency_p99_ms": 6.2
    }
  }
}

This data is scraped by Prometheus, which stores it as time-series data. Grafana then queries Prometheus to visualize these time series as graphs. The power comes from combining these views to diagnose issues. For instance, a spike in producer_send_latency_p99_ms might correlate with an increase in pulsar_broker_connections_count or a subtle rise in bookkeeper_write_latency_avg_ms.

The core problem Pulsar monitoring solves is the opacity of distributed systems. You can’t just grep logs to find out why a message is slow. You need to see the flow of data and the performance of each component in that flow. This involves tracking metrics across brokers, bookies, ZooKeeper (or equivalent metadata store), and the Pulsar Functions runtime.

Prometheus is configured with scrape targets for each Pulsar component. For brokers, this is typically http://<broker-host>:8080/metrics/prometheus. Bookies expose metrics at http://<bookie-host>:8000/metrics/prometheus. Pulsar Functions instances also expose their metrics, usually on a different port per instance. You’ll define these targets in your prometheus.yml configuration.

scrape_configs:
  - job_name: 'pulsar_brokers'
    static_configs:
      - targets: ['broker1:8080', 'broker2:8080', 'broker3:8080']
        labels:
          env: 'production'
          cluster: 'us-east-1'

  - job_name: 'pulsar_bookies'
    static_configs:
      - targets: ['bookie1:8000', 'bookie2:8000', 'bookie3:8000']
        labels:
          env: 'production'
          cluster: 'us-east-1'

Grafana then connects to Prometheus as a data source and you build dashboards. A good Pulsar dashboard will have panels for:

  • Broker Health: Connections, request rates, topic lookups, backlog size.
  • BookKeeper Performance: Write/read latencies (average, p95, p99), ledger operations, disk usage.
  • Client Latency: Producer send latency, consumer receive latency, end-to-end latency if you instrument it.
  • Pulsar Functions: Instance status, processing rates, error counts.
  • ZooKeeper/Metadata Store: Latency, connection counts, request rates.

The key is to create correlated views. For example, a dashboard panel showing pulsar_broker_message_in_rate alongside bookkeeper_write_latency_avg_ms. If the write latency starts creeping up, you can immediately see if it’s impacting message ingestion.

A crucial, often overlooked aspect of Pulsar monitoring is the pulsar_broker_segment_unacked_messages metric. This metric represents messages that have been written to BookKeeper but haven’t yet been acknowledged by a consumer. If this number grows unchecked, it indicates a consumer lag problem, but more subtly, it can also signal issues within the broker itself or BookKeeper that are preventing acknowledgements from propagating or being processed efficiently. It’s a leading indicator of both consumer-side backpressure and potential internal system bottlenecks.

Once you have these dashboards set up and you’ve seen some interesting patterns, the next logical step is to set up alerting on key thresholds, like sustained high latencies or rapidly increasing unacked message counts.

Want structured learning?

Take the full Pulsar course →