The Pulsar producer’s internal queue is full, causing it to stop sending messages to the broker because the broker is not acknowledging them fast enough.
Here’s why that’s happening and how to fix it:
1. Broker Network Saturation
The most common culprit is the broker’s network interface being overloaded. It’s receiving messages faster than it can send them out to consumers or disk.
Diagnosis: Check the network traffic on your Pulsar broker nodes. Look for high egress (outbound) and ingress (inbound) traffic, specifically on the port Pulsar uses (default 6650 for client traffic, 8080 for HTTP API).
# On your Pulsar broker node
sudo ss -tunp | grep 6650
# Observe the 'RECV-Q' and 'SEND-Q' for the Pulsar port. High values indicate issues.
# Also, check overall network utilization:
sudo nload
Fix:
- Scale brokers: Add more broker nodes to your cluster. This distributes the load.
- Increase network bandwidth: If your cloud provider or on-prem hardware has limited network capacity, upgrade it.
- Optimize consumer throughput: Ensure your consumers are keeping up. If they aren’t, they are indirectly causing producer backpressure by not draining the broker’s queues.
Why it works: Distributing the load across more brokers or increasing the pipe’s capacity allows the broker to process and acknowledge messages more quickly, clearing the producer’s queue.
2. Broker Disk I/O Bottleneck
Pulsar brokers write messages to disk before acknowledging them, especially if persistence is enabled (which it should be for durability). If the disk subsystem is slow or saturated, acknowledgments will lag.
Diagnosis:
Monitor disk I/O on your broker nodes. Look for high iowait percentages, high read/write latency, and full disk queues.
# On your Pulsar broker node
sudo iostat -xz 1 5
# Look for %util, await, svctm, and avgqu-sz for your Pulsar data disks.
# Also, check mount points for disk space:
df -h
Fix:
- Faster disks: Migrate Pulsar data to faster storage (e.g., SSDs instead of HDDs, NVMe drives).
- Separate disks: Ensure your Pulsar data directory is on a disk separate from the OS and logs.
- RAID configuration: Use appropriate RAID levels (e.g., RAID 10) for a balance of performance and redundancy.
- Scale brokers: Adding brokers also distributes the disk load.
Why it works: Faster disks or better I/O configurations reduce the time it takes for Pulsar to persist messages, allowing it to acknowledge them sooner and unblock producers.
3. Insufficient Broker Resources (CPU/Memory)
If brokers are starved for CPU or memory, their ability to process incoming messages, write to disk, and send acknowledgments is severely hampered.
Diagnosis:
Monitor CPU and memory utilization on your Pulsar broker nodes. Look for sustained high CPU usage (>80%) or constant swapping (high si/so in vmstat).
# On your Pulsar broker node
top -H -p $(pgrep -f "org.apache.pulsar.broker.PulsarBroker")
# Observe CPU and memory usage for individual Pulsar broker threads.
# Or for overall system:
vmstat 1 5
Fix:
- Increase broker resources: Allocate more CPU cores and RAM to your broker instances.
- Tune JVM heap: Ensure the Pulsar broker JVM heap size (
-Xmx) is appropriately configured and not causing excessive garbage collection pauses. - Scale brokers: Distribute the load across more nodes.
Why it works: Providing sufficient resources ensures the Pulsar broker process can execute its threads efficiently, reducing processing delays and freeing up queues.
4. Producer Configuration: blockIfQueueFull and sendTimeoutMs
The producer client has its own internal queue. If blockIfQueueFull is true (the default), it will block when this queue fills up. The sendTimeoutMs determines how long the producer will wait before throwing an error.
Diagnosis: Examine your producer client’s configuration.
Fix:
- Increase
sendTimeoutMs: If you expect temporary backpressure, increasing this value gives the producer more time to wait for the broker to catch up. For example, setting it to60000(60 seconds).Producer<byte[]> producer = pulsarClient.newProducer() .topic("my-topic") .sendTimeout(60, TimeUnit.SECONDS) // Increased timeout .create(); - Set
blockIfQueueFulltofalse: This will cause the producer to immediately throw an exception (ProducerQueueIsFullError) when the queue is full, rather than blocking. This is useful if you want immediate feedback on overload rather than a potential application hang.Producer<byte[]> producer = pulsarClient.newProducer() .topic("my-topic") .blockIfQueueFull(false) // Do not block, throw error immediately .create(); - Increase
maxPendingMessages: This is the size of the producer’s internal queue. Increasing it allows the producer to buffer more messages before blocking or throwing an error. The default is500.Producer<byte[]> producer = pulsarClient.newProducer() .topic("my-topic") .maxPendingMessages(2000) // Increased queue size .create();
Why it works: These settings directly control the producer’s behavior under pressure. Adjusting them allows you to tune how aggressively the producer backs off or signals an error when the broker can’t keep up.
5. Topic Throughput Limits or Throttling
While less common for causing producer queue full errors directly, if a topic has aggressive throttling configured at the broker level or is part of a tenant-level rate limit, it can indirectly cause backpressure by limiting how fast the broker can ingest messages for that specific topic.
Diagnosis: Check Pulsar admin API for topic-level or namespace-level throttling configurations.
# Get namespace-level limits
bin/pulsar-admin namespaces get-throttling my-tenant/my-namespace
# Get topic-level limits (if set)
bin/pulsar-admin topics stats my-topic
Fix:
- Increase limits: Adjust
messageRateorbyteRatelimits for the relevant namespace or topic. - Remove limits: If throttling is not strictly necessary, consider removing it.
Why it works: Removing artificial limits on message ingestion allows the broker to process messages for that topic at its maximum capacity, reducing the chance of queues filling up.
6. Large Message Sizes
If your producers are sending very large messages, even a moderate number of them can quickly saturate the broker’s network bandwidth or disk I/O, leading to the same symptoms as the above.
Diagnosis: Examine the sizes of messages being sent by your producer. Check broker metrics for message size distribution.
Fix:
- Optimize message payloads: Compress data, use binary formats, or split large messages into smaller ones.
- Increase broker resources: As mentioned in point 3, more CPU/memory and faster disks can handle larger payloads better.
- Increase network bandwidth: As mentioned in point 1, more bandwidth is key for high-throughput, large messages.
Why it works: Reducing the amount of data per message or increasing the capacity to handle that data directly alleviates the bottleneck caused by large payloads.
After resolving these, your next likely headache will be Topic is partitioned and partition key is not set if you haven’t configured a partition key for a partitioned topic.