The Pulsar Bookie journal disk is a critical component for ensuring low-latency writes and data durability, but it’s often a bottleneck if not configured optimally.
Here’s a look at how Pulsar Bookie handles its journal and how you can tune it for low latency.
How Bookie Journal Works
When a Pulsar client sends a write request to a Bookie, the request first goes to the WriteableLedger in memory. From there, it’s written to a journal file on disk. This journal file is a sequential write-ahead log (WAL) where all incoming writes are recorded before they are acknowledged back to the client. This ensures that even if the Bookie crashes, the data is not lost because it’s durably written to disk.
Once the journal write is complete and acknowledged, the data is then written to an entry log file, which is a more optimized storage format for long-term persistence. The journal is periodically "rolled over" and its contents are flushed to the entry logs, after which the journal file can be truncated or deleted.
The critical path for low-latency writes is the time it takes to write to the journal file and get that acknowledgment back. If this path is slow, your overall write latency will suffer.
Tuning for Low Latency
The primary goal for low-latency journal disk performance is to make those sequential writes as fast as possible and minimize any potential stalls.
1. Disk Choice: NVMe is King
- Diagnosis:
lsblk -o NAME,MODEL,SIZE,TYPE,MOUNTPOINT - Check: Look for disks with "NVMe" in their model name. If you’re using SATA SSDs or HDDs, you’re already at a disadvantage for this specific workload.
- Fix:
- Ensure your Bookie’s journal directory (
journalDirectoryinserver.conf) is mounted on an NVMe drive. - Example
server.conf:journalDirectory=/mnt/nvme_journal
- Ensure your Bookie’s journal directory (
- Why it works: NVMe drives offer significantly higher IOPS and lower latency for sequential writes compared to SATA SSDs or HDDs, directly accelerating the journal WAL.
2. Filesystem: XFS with noatime
- Diagnosis:
mount | grep $(df /mnt/nvme_journal | awk 'NR==2 {print $1}') - Check: Verify the filesystem type and mount options. Look for
noatime. - Fix:
- Format the journal partition with XFS:
mkfs.xfs /dev/nvme0n1p1 # Replace with your NVMe partition - Mount it with
noatimein/etc/fstab:/dev/nvme0n1p1 /mnt/nvme_journal xfs defaults,noatime 0 0 - Remount or reboot.
- Format the journal partition with XFS:
- Why it works: XFS is generally performant for sequential workloads.
noatimeprevents the filesystem from updating access times on files, reducing unnecessary disk writes.
3. I/O Scheduler: none or mq-deadline
- Diagnosis:
cat /sys/block/nvme0n1/queue/scheduler(replacenvme0n1with your NVMe device name). - Check: See what I/O scheduler is currently active.
- Fix:
- For NVMe,
none(which often maps tonoopornonein modern kernels) is usually best.mq-deadlinecan also be competitive. - Set it persistently via kernel boot parameters or udev rules. For transient testing:
echo none > /sys/block/nvme0n1/queue/scheduler - Add
scsi_mod.use_blk_mq=1 elevator=noneto your kernel boot parameters in/etc/default/grub(thenupdate-grub).
- For NVMe,
- Why it works: Modern NVMe drives have sophisticated internal controllers. Generic I/O schedulers can sometimes interfere with the drive’s optimal handling of requests, adding latency.
nonepasses requests through with minimal modification.
4. Journal Flush Settings: flushInterval and flushThreads
- Diagnosis: Check
conf/bookkeeper.conf(orserver.confif bundled). - Check: Look for
journalFlushIntervalInMsandjournalFlushThreads. - Fix:
journalFlushIntervalInMs: This controls how often Bookie flushes the journal to the entry log. A lower value means more frequent flushes but potentially more I/O. A higher value reduces flush I/O but increases the amount of data in the journal, which can impact recovery time. For low latency, a moderate value like100or200is often a good start.journalFlushIntervalInMs=100journalFlushThreads: This controls the number of threads dedicated to flushing. Increasing this can help parallelize the flush operation if you have multiple journal disks or a very fast disk. Start with2or4.journalFlushThreads=2
- Why it works: These settings tune the trade-off between journal write latency and the background flushing process. By adjusting the interval and threads, you can ensure flushes don’t contend with new writes while still keeping the journal size manageable.
5. Journal Sync: syncData
- Diagnosis: Check
conf/bookkeeper.conf(orserver.conf). - Check: Look for
syncData. - Fix:
- For maximum low latency and durability against power loss, you want
syncData=true. This ensuresfsync()is called after every journal write. - However,
fsync()is expensive. If your absolute lowest latency is paramount and you can tolerate potential data loss only on a sudden power failure (since the journal is still written to disk, data is safe from software crashes), you might considersyncData=false. - Recommendation: Keep
syncData=truefor production. If you are seeing journal write latency as the primary bottleneck, investigate the disk and filesystem first.syncData=true
- For maximum low latency and durability against power loss, you want
- Why it works:
syncData=trueguarantees that data is physically written to stable storage before acknowledgment. Setting it tofalserelies on the OS page cache and can lead to data loss if the server loses power before the OS flushes the cache to disk.
6. CPU and NUMA Awareness
- Diagnosis:
lscpu | grep NUMAandtaskset -p <pid>(find Bookie PID withjps -l). - Check: Is the Bookie process running on the same NUMA node as the NVMe drive?
- Fix:
- Ensure your Bookie process is pinned to the NUMA node that the NVMe drive is attached to. This can often be managed by
numactlor systemd service configurations. - Example: If your NVMe is on NUMA node 0, start Bookie with
numactl -N -m 0 bin/bookkeeper bookie ....
- Ensure your Bookie process is pinned to the NUMA node that the NVMe drive is attached to. This can often be managed by
- Why it works: Avoiding cross-NUMA node memory access significantly reduces latency by keeping memory operations local to the CPU core.
7. Journal Device Isolation
- Diagnosis:
iostat -xz 5oriostat -dx 5(e.g.,iostat -dx 5 /dev/nvme0n1). - Check: Monitor
%util,await, andsvctmfor the journal device. Are there other processes or mounts on the same journal disk? - Fix:
- Dedicate a physical NVMe drive solely for the
journalDirectory. Do not share it with entry logs, OS, or any other application. - Ensure
entryLogDirectoryis on a separate, fast disk (ideally another NVMe, but SATA SSDs can be acceptable here, as sequential writes are less demanding than journal writes).
- Dedicate a physical NVMe drive solely for the
- Why it works: Isolating the journal to its own high-performance device prevents I/O contention from other processes or even other Bookie storage operations.
After applying these, monitor your Bookie’s journal write latency metrics in Prometheus/Grafana. You should see a significant reduction in journal_write_latency_ms and write_latency_p99 for your Pulsar topics.
The next common issue you’ll encounter if your journal is perfectly tuned is likely to be related to the entry log write performance or network latency for client acknowledgments.