The Pulsar Bookie journal disk is a critical component for ensuring low-latency writes and data durability, but it’s often a bottleneck if not configured optimally.

Here’s a look at how Pulsar Bookie handles its journal and how you can tune it for low latency.

How Bookie Journal Works

When a Pulsar client sends a write request to a Bookie, the request first goes to the WriteableLedger in memory. From there, it’s written to a journal file on disk. This journal file is a sequential write-ahead log (WAL) where all incoming writes are recorded before they are acknowledged back to the client. This ensures that even if the Bookie crashes, the data is not lost because it’s durably written to disk.

Once the journal write is complete and acknowledged, the data is then written to an entry log file, which is a more optimized storage format for long-term persistence. The journal is periodically "rolled over" and its contents are flushed to the entry logs, after which the journal file can be truncated or deleted.

The critical path for low-latency writes is the time it takes to write to the journal file and get that acknowledgment back. If this path is slow, your overall write latency will suffer.

Tuning for Low Latency

The primary goal for low-latency journal disk performance is to make those sequential writes as fast as possible and minimize any potential stalls.

1. Disk Choice: NVMe is King

  • Diagnosis: lsblk -o NAME,MODEL,SIZE,TYPE,MOUNTPOINT
  • Check: Look for disks with "NVMe" in their model name. If you’re using SATA SSDs or HDDs, you’re already at a disadvantage for this specific workload.
  • Fix:
    • Ensure your Bookie’s journal directory (journalDirectory in server.conf) is mounted on an NVMe drive.
    • Example server.conf:
      journalDirectory=/mnt/nvme_journal
      
  • Why it works: NVMe drives offer significantly higher IOPS and lower latency for sequential writes compared to SATA SSDs or HDDs, directly accelerating the journal WAL.

2. Filesystem: XFS with noatime

  • Diagnosis: mount | grep $(df /mnt/nvme_journal | awk 'NR==2 {print $1}')
  • Check: Verify the filesystem type and mount options. Look for noatime.
  • Fix:
    • Format the journal partition with XFS:
      mkfs.xfs /dev/nvme0n1p1 # Replace with your NVMe partition
      
    • Mount it with noatime in /etc/fstab:
      /dev/nvme0n1p1 /mnt/nvme_journal xfs defaults,noatime 0 0
      
    • Remount or reboot.
  • Why it works: XFS is generally performant for sequential workloads. noatime prevents the filesystem from updating access times on files, reducing unnecessary disk writes.

3. I/O Scheduler: none or mq-deadline

  • Diagnosis: cat /sys/block/nvme0n1/queue/scheduler (replace nvme0n1 with your NVMe device name).
  • Check: See what I/O scheduler is currently active.
  • Fix:
    • For NVMe, none (which often maps to noop or none in modern kernels) is usually best. mq-deadline can also be competitive.
    • Set it persistently via kernel boot parameters or udev rules. For transient testing:
      echo none > /sys/block/nvme0n1/queue/scheduler
      
    • Add scsi_mod.use_blk_mq=1 elevator=none to your kernel boot parameters in /etc/default/grub (then update-grub).
  • Why it works: Modern NVMe drives have sophisticated internal controllers. Generic I/O schedulers can sometimes interfere with the drive’s optimal handling of requests, adding latency. none passes requests through with minimal modification.

4. Journal Flush Settings: flushInterval and flushThreads

  • Diagnosis: Check conf/bookkeeper.conf (or server.conf if bundled).
  • Check: Look for journalFlushIntervalInMs and journalFlushThreads.
  • Fix:
    • journalFlushIntervalInMs: This controls how often Bookie flushes the journal to the entry log. A lower value means more frequent flushes but potentially more I/O. A higher value reduces flush I/O but increases the amount of data in the journal, which can impact recovery time. For low latency, a moderate value like 100 or 200 is often a good start.
      journalFlushIntervalInMs=100
      
    • journalFlushThreads: This controls the number of threads dedicated to flushing. Increasing this can help parallelize the flush operation if you have multiple journal disks or a very fast disk. Start with 2 or 4.
      journalFlushThreads=2
      
  • Why it works: These settings tune the trade-off between journal write latency and the background flushing process. By adjusting the interval and threads, you can ensure flushes don’t contend with new writes while still keeping the journal size manageable.

5. Journal Sync: syncData

  • Diagnosis: Check conf/bookkeeper.conf (or server.conf).
  • Check: Look for syncData.
  • Fix:
    • For maximum low latency and durability against power loss, you want syncData=true. This ensures fsync() is called after every journal write.
    • However, fsync() is expensive. If your absolute lowest latency is paramount and you can tolerate potential data loss only on a sudden power failure (since the journal is still written to disk, data is safe from software crashes), you might consider syncData=false.
    • Recommendation: Keep syncData=true for production. If you are seeing journal write latency as the primary bottleneck, investigate the disk and filesystem first.
      syncData=true
      
  • Why it works: syncData=true guarantees that data is physically written to stable storage before acknowledgment. Setting it to false relies on the OS page cache and can lead to data loss if the server loses power before the OS flushes the cache to disk.

6. CPU and NUMA Awareness

  • Diagnosis: lscpu | grep NUMA and taskset -p <pid> (find Bookie PID with jps -l).
  • Check: Is the Bookie process running on the same NUMA node as the NVMe drive?
  • Fix:
    • Ensure your Bookie process is pinned to the NUMA node that the NVMe drive is attached to. This can often be managed by numactl or systemd service configurations.
    • Example: If your NVMe is on NUMA node 0, start Bookie with numactl -N -m 0 bin/bookkeeper bookie ....
  • Why it works: Avoiding cross-NUMA node memory access significantly reduces latency by keeping memory operations local to the CPU core.

7. Journal Device Isolation

  • Diagnosis: iostat -xz 5 or iostat -dx 5 (e.g., iostat -dx 5 /dev/nvme0n1).
  • Check: Monitor %util, await, and svctm for the journal device. Are there other processes or mounts on the same journal disk?
  • Fix:
    • Dedicate a physical NVMe drive solely for the journalDirectory. Do not share it with entry logs, OS, or any other application.
    • Ensure entryLogDirectory is on a separate, fast disk (ideally another NVMe, but SATA SSDs can be acceptable here, as sequential writes are less demanding than journal writes).
  • Why it works: Isolating the journal to its own high-performance device prevents I/O contention from other processes or even other Bookie storage operations.

After applying these, monitor your Bookie’s journal write latency metrics in Prometheus/Grafana. You should see a significant reduction in journal_write_latency_ms and write_latency_p99 for your Pulsar topics.

The next common issue you’ll encounter if your journal is perfectly tuned is likely to be related to the entry log write performance or network latency for client acknowledgments.

Want structured learning?

Take the full Pulsar course →