The Pulsar broker Received a message with incorrect checksum error means a broker received a message where the data didn’t match the checksum Pulsar calculated when the message was originally published.
This usually happens because of a data corruption issue somewhere between the producer and the broker, or within the broker’s storage itself.
Here are the most common causes and how to fix them:
1. Network Packet Corruption
Diagnosis: This is the hardest to directly diagnose, as network issues are transient. The best indicator is seeing this error frequently, especially if it’s intermittent. You can try enabling TCP checksum offloading diagnostics on your network interfaces. On Linux, you can check ethtool -k <interface_name> and look for tx-checksum-ipv4 and rx-checksum-ipv4. If these are enabled, it means the network card is doing checksumming. Sometimes, faulty NICs or buggy drivers can cause issues with this.
Fix: Disable TCP checksum offloading on the network interfaces of both your Pulsar clients (producers) and brokers.
On Linux, use sudo ethtool -K <interface_name> tx off rx off.
This forces the kernel to handle checksumming, which is generally more reliable, albeit at a slight CPU cost.
Why it works: Network interface cards (NICs) often offload TCP/IP checksum calculations to save CPU. If the NIC’s hardware or firmware is faulty, it can compute incorrect checksums, leading to corrupted data that Pulsar detects. Disabling offloading shifts this task to the operating system, bypassing the potentially faulty hardware.
2. Producer-Side Serialization/Compression Issues
Diagnosis: Check the logs of your Pulsar producers. If you see exceptions related to serialization or compression before the message is sent, or if producers are crashing, this could be the culprit. Also, examine the producer code for any custom serialization logic or compression libraries that might be faulty or misconfigured.
Fix: Ensure that the producer is correctly serializing and compressing data before sending it. If using Pulsar’s built-in compression (like Snappy, Zlib, LZ4), verify that the compressionType setting in your producer configuration matches what the broker expects or is configured to handle. If you’re using custom serializers, double-check their implementation for bugs.
Example producer configuration snippet (Java):
Producer<byte[]> producer = pulsarClient.newProducer()
.topic("my-topic")
.compressionType(CompressionType.SNAPPY) // Ensure this is correct
.create();
Why it works: A producer might incorrectly calculate the checksum if its internal serialization or compression logic is flawed. This incorrect checksum is then sent with the message, and when the broker receives it, its own checksum calculation will differ, triggering the mismatch error.
3. Broker-Side Data Corruption (Memory or Disk)
Diagnosis: If checksum errors are consistently happening for messages arriving at a specific broker, and you’ve ruled out network and producer issues, the problem might be with the broker itself. Check the broker’s logs for any disk I/O errors (iostat -xz 1 on Linux can show disk utilization and errors) or memory-related errors (examine dmesg for kernel OOM killer messages or ECC errors if your hardware supports it).
Fix:
- For transient memory corruption: Restart the affected Pulsar broker. This is a temporary fix and might indicate a deeper hardware issue.
- For persistent disk corruption: Identify the specific Pulsar data directory (
dataDirinserver.conf) that might be affected. If you suspect the underlying storage device (SSD/HDD) is failing, you need to replace it. Before replacing, ensure you have a full backup and can recover the data. You might need to re-ingest data if it’s irrecoverably corrupted. Exampleserver.confparameter:dataDir=/pulsar/data - If using BookKeeper: Check the BookKeeper logs (
bookkeeper.log) for errors related to the specific ledger containing the corrupted message. This might point to a BookKeeper node problem.
Why it works: If the broker’s memory has a transient corruption, restarting the broker can clear it. If the disk is corrupt, any data written to or read from that sector will be incorrect. Replacing the faulty disk and potentially re-ingesting lost data resolves the corruption. For BookKeeper, diagnosing and fixing the underlying BookKeeper node or storage issues is crucial.
4. Load Balancer Issues
Diagnosis: If you have a load balancer in front of your Pulsar brokers, it might be dropping or corrupting packets. Look for any specific load balancer logs indicating packet drops, retransmissions, or errors. Also, consider if the load balancer is performing any stateful inspection or manipulation that could interfere with the message stream.
Fix: Temporarily bypass the load balancer for a specific client and broker to see if the errors disappear. If they do, investigate your load balancer’s configuration. Ensure it’s configured for raw TCP passthrough and isn’t doing any packet modification or deep packet inspection that could interfere with Pulsar’s traffic.
Why it works: Some load balancers, especially those with advanced features, can inadvertently corrupt or drop network packets. Configuring them for simple TCP forwarding ensures that traffic is passed through without modification.
5. Outdated or Incompatible Client Libraries
Diagnosis: If you’re running a mix of very old and very new Pulsar client libraries across your producers and consumers, there’s a small chance of incompatibilities. Check the versions of your Pulsar client libraries used by producers.
Fix: Ensure all your Pulsar producers are using a recent, compatible version of the Pulsar client library. Ideally, keep them updated to the latest stable release.
Why it works: Older client libraries might have bugs in their checksum calculation or handling of certain message formats that newer versions have fixed.
6. Pulsar Broker Bugs
Diagnosis: This is the least common, but if you’ve exhausted all other possibilities and are running a very specific or older version of Pulsar, a bug in the broker’s message handling or checksum verification could be the cause. Check the Pulsar JIRA or GitHub issues for similar reports.
Fix: Upgrade your Pulsar brokers to the latest stable version. If the issue persists, consider reporting it to the Pulsar community with detailed logs and reproduction steps.
Why it works: A bug in the broker’s code could lead to incorrect checksum calculations or misinterpretations of valid checksums. Upgrading to a fixed version resolves the issue.
After fixing these, you might encounter Topic not found errors if the underlying issue was severe enough to cause data loss or metadata corruption that needs manual intervention.