The Pulsar broker’s JVM ran out of memory because the Java heap, intended to hold active data and metadata, was exhausted by excessive garbage collection cycles, preventing new allocations and causing the broker to become unresponsive.

Common Causes and Fixes:

  1. Insufficient Heap Size: The most straightforward cause is that the JVM heap is simply too small for the broker’s workload.

    • Diagnosis: Check the broker’s JVM metrics for heap.used and heap.max. If heap.used is consistently near heap.max, or if you see frequent java.lang.OutOfMemoryError: Java heap space errors in the broker logs, the heap is likely undersized.
    • Fix: Increase the PULSAR_MEM environment variable in your broker’s startup script (e.g., pulsar-broker.conf or systemd unit file). For example, to set the heap to 8GB, use:
      export PULSAR_MEM="-Xms8g -Xmx8g"
      
      This allocates an initial heap size (-Xms) and maximum heap size (-Xmx) of 8GB.
    • Why it works: A larger heap provides more space for objects, reducing the frequency of garbage collection pauses and allowing the broker to handle more concurrent operations before running out of memory.
  2. Excessive Topic/Subscription Count: A very large number of topics and subscriptions, especially those with many unacknowledged messages, can consume significant heap memory for metadata and internal state.

    • Diagnosis: Monitor the number of topics and subscriptions via Pulsar Admin API:
      pulsar-admin topics list --tenant <tenant> --namespace <namespace> | wc -l
      pulsar-admin subscriptions list <topic> --tenant <tenant> --namespace <namespace> | wc -l
      
      Also, check broker-stats.json for messagesDelayed and messagesUnacked counts.
    • Fix: Implement topic/subscription cleanup policies. For instance, configure message TTL or auto-subscription deletion. In broker.conf or standalone.conf:
      # Example: Set message TTL to 1 hour (3600 seconds)
      defaultTopicProperties=messageTTL=3600
      # Example: Auto-delete subscriptions if no consumer has been connected for 24 hours
      autoUnsubcribeDelayInSeconds=86400
      
    • Why it works: Reducing the number of active, unacked messages and stale subscriptions directly lowers the memory footprint required to track them.
  3. Inefficient Garbage Collector Configuration: The default garbage collector might not be optimal for Pulsar’s workload, leading to excessive GC pauses and memory churn.

    • Diagnosis: Examine broker logs for frequent or long-running garbage collection events. Tools like jstat -gc <pid> can show GC activity. If you see Full GC events happening very often with high duration, the GC is struggling.
    • Fix: Switch to a more modern and performant garbage collector like G1GC and tune its parameters. Add these to PULSAR_GC_OPTS in your broker startup script:
      export PULSAR_GC_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=35"
      
      This enables G1GC, aims for pauses under 200ms, and starts GC when 35% of the heap is occupied.
    • Why it works: G1GC is designed for larger heaps and aims to balance throughput and latency by performing concurrent garbage collection, reducing the impact of GC on application performance and memory availability.
  4. Large Message Payloads / Slow Consumers: If consumers are slow to acknowledge messages, or if messages with very large payloads are frequently processed, the broker’s internal buffer and entry buffer can grow significantly, consuming heap.

    • Diagnosis: Monitor broker-stats.json for byteIn and byteOut rates, and especially look at delayedMessageIndex or buffer.totalMemoryAllocated metrics if available. Check consumer lag using pulsar-admin consumerstats <topic>.
    • Fix: Increase maxOutgoingMessageBatchSizeBytes and maxMessagePublishRate in broker.conf to allow for larger batches and higher publish rates, but more importantly, tune consumer-side acknowledgment mechanisms or increase the number of consumers.
      # In broker.conf
      maxOutgoingMessageBatchSizeBytes=262144 # Example: 256KB
      maxMessagePublishRate=10000 # Example: 10,000 messages/sec
      
      Crucially, ensure consumers are configured to acknowledge messages promptly.
    • Why it works: Larger outgoing batches and higher publish rates can improve throughput, but the primary fix for memory pressure from slow consumers is to ensure messages are acknowledged efficiently. If messages stay unacknowledged for too long, they remain in memory buffers.
  5. Memory Leaks in User Code or Libraries: While less common, custom Pulsar functions, custom authentication providers, or third-party libraries can introduce memory leaks within the broker JVM.

    • Diagnosis: If heap usage grows steadily over time without a clear correlation to traffic and never returns to a baseline even after GC, a leak is suspected. Use a heap profiler (like YourKit, JProfiler, or jmap with jhat) to capture a heap dump and analyze object allocations.
    • Fix: Identify the leaking objects in the heap dump and fix the code responsible for not releasing them. This is highly specific to the leak’s origin. For example, if a custom AuthenticationProvider is holding onto large data structures indefinitely, it needs to be refactored to release them.
    • Why it works: Eliminating the leak prevents the continuous growth of memory usage, allowing the garbage collector to reclaim unused objects and keep the heap within manageable limits.
  6. Off-Heap Memory Usage: Although OutOfMemoryError: Java heap space specifically points to the heap, excessive off-heap memory usage (e.g., by Netty, file descriptors, or native libraries) can indirectly lead to heap pressure by forcing the JVM to allocate more memory for its internal structures.

    • Diagnosis: Use jcmd <pid> GC.heap_info to see heap usage and jcmd <pid> VM.native_memory summary to inspect native memory. Monitor file descriptor usage with lsof -p <pid> | wc -l.
    • Fix: Tune Netty’s buffer settings if applicable (though Pulsar’s defaults are generally good). Ensure file descriptor limits are sufficient for the broker’s workload. In broker.conf:
      # Example: Increase Netty's direct buffer pool size if needed
      nettyMaxDirectMemory=1073741824 # Example: 1GB
      
      Also, increase OS-level file descriptor limits:
      # For the user running Pulsar
      ulimit -n 65536
      
    • Why it works: Properly managing off-heap memory ensures that the JVM has enough contiguous memory for its heap and that the OS can support the broker’s network and file operations without strain.

The next error you’ll likely encounter after fixing JVM heap issues is a java.lang.OutOfMemoryError: Direct buffer memory if Netty’s off-heap buffers become the bottleneck, or potentially network-related errors if the broker becomes unresponsive due to CPU saturation from constant GC activity prior to the heap exhaustion.

Want structured learning?

Take the full Pulsar course →