Redpanda’s tail latency, particularly P99, can be frustratingly high in production, but it’s often a sign of a few specific bottlenecks rather than a general performance issue.

Here’s how to dig in and improve your P99 latency:

1. Disk I/O Saturation

This is the most common culprit. Redpanda is heavily I/O bound. If your disks can’t keep up with the write throughput, latency will skyrocket.

Diagnosis: Use iostat -xz 1 on your Redpanda nodes. Look for %util consistently at or near 100% for your Redpanda data disks. Also, check await and svctm – high values here indicate the disk is struggling to service requests.

Common Causes & Fixes:

  • Under-provisioned EBS/Instance Store: You might be on a general-purpose disk type that doesn’t offer enough IOPS or throughput.

    • Fix: Migrate to a provisioned IOPS SSD (io2 Block Express for maximum performance, io1 for good performance) or a higher-tier instance with better local NVMe storage. For example, if you’re on gp3, consider provisioning 10,000 IOPS and 500 MB/s throughput.
    • Why it works: Higher provisioned IOPS and throughput directly translate to faster disk operations, allowing Redpanda to write data more quickly.
  • Too many partitions/topics on a single disk: Even with fast disks, if you have an excessive number of active partitions spread across them, you can hit aggregate I/O limits.

    • Fix: Distribute topics and partitions across more disks or nodes. Consider creating separate storage volumes for high-throughput topics.
    • Why it works: Spreading the load reduces contention on any single disk, allowing each to operate closer to its theoretical maximum without impacting others.
  • RAID configuration suboptimal: If you’re using RAID, especially RAID 5 or 6, for Redpanda data, the parity calculations can become a bottleneck under heavy write loads.

    • Fix: Use RAID 0 for performance if data loss is acceptable at the OS/filesystem level (Redpanda’s replication handles durability). For better performance, ensure your RAID controller is optimized for writes, or better yet, use multiple independent SSDs without RAID if your OS/filesystem supports it well.
    • Why it works: RAID 0 offers the highest write performance by striping data across drives without parity overhead.
  • Filesystem overhead: Some filesystems have higher write overhead than others.

    • Fix: Format your Redpanda data volumes with xfs and mount with noatime,nodiratime. Ensure fstrim is not running during peak hours.
    • Why it works: xfs is generally well-suited for high-throughput workloads. noatime and nodiratime reduce metadata writes, and fstrim can cause I/O stalls.
  • Network attached storage (NAS) / SAN limitations: If Redpanda is not running on local NVMe or directly attached SSDs, the network and SAN infrastructure can easily become the bottleneck.

    • Fix: Migrate to local SSDs or NVMe. If NAS/SAN is unavoidable, ensure it’s provisioned with very high IOPS and low latency.
    • Why it works: Minimizing network hops and external dependencies for critical write paths drastically reduces latency.

2. CPU Contention

While disk is primary, a CPU-starved node can’t process requests, compress data, or manage its internal state efficiently, leading to increased latency.

Diagnosis: Use top or htop. Look for %CPU consistently above 80-90% for Redpanda processes. Also, check the load average. High load average relative to the number of cores indicates the system is overloaded.

Common Causes & Fixes:

  • Insufficient CPU cores: Redpanda benefits from more cores for parallel processing of requests, replication, and log compaction.

    • Fix: Increase the instance type to one with more CPU cores. For example, move from an m5.xlarge (4 vCPU) to an m5.2xlarge (8 vCPU).
    • Why it works: More cores allow Redpanda to handle more concurrent operations and background tasks without blocking.
  • High irq or softirq usage: This indicates the CPU is spending a lot of time handling network or disk interrupts, often a sign of network saturation or very high I/O rates.

    • Fix: Optimize network interfaces (e.g., use SR-IOV if available) and ensure your disk I/O isn’t maxing out as described above. Tune interrupt affinity.
    • Why it works: Reducing interrupt load frees up CPU cycles for Redpanda’s application threads.
  • Background processes: Other applications or system daemons consuming CPU resources on the same nodes.

    • Fix: Isolate Redpanda on dedicated nodes. Remove or reschedule non-essential services.
    • Why it works: Guarantees that Redpanda has the CPU resources it needs without competition.

3. Network Throughput and Latency

High network latency between brokers or between clients and brokers can directly impact replication and producer acknowledgments.

Diagnosis: Use ping and traceroute between Redpanda nodes. Monitor network interface traffic with iftop or nload. Check for network errors (netstat -s | grep -i error).

Common Causes & Fixes:

  • Network Interface Card (NIC) saturation: If your network interfaces are maxing out their bandwidth (e.g., 1 Gbps or 10 Gbps), replication traffic can lag.

    • Fix: Upgrade to higher bandwidth network interfaces (e.g., 25 Gbps or more) or use multiple interfaces for aggregate throughput. Distribute topics/partitions to reduce per-partition traffic.
    • Why it works: More bandwidth allows Redpanda’s replication streams to keep up with producers.
  • High inter-broker latency: Physical distance or network congestion between brokers delays replication.

    • Fix: Deploy Redpanda nodes in the same availability zone or region. Use instances with enhanced networking.
    • Why it works: Minimizing physical network hops and congestion reduces the time it takes for data to travel between brokers.
  • MTU mismatches: Incorrect Maximum Transmission Unit (MTU) settings can lead to packet fragmentation and retransmissions, increasing latency.

    • Fix: Ensure consistent MTU settings (typically 1500 or 9000 for jumbo frames) across all network interfaces and routers in your Redpanda cluster.
    • Why it works: Consistent MTU avoids fragmentation, which is a costly operation that adds latency and consumes CPU.

4. Redpanda Configuration Tuning

Default configurations are often conservative. Specific settings can significantly impact performance.

Diagnosis: Review your redpanda.yaml for relevant settings.

Common Causes & Fixes:

  • Producer acks setting: If producers are using acks=all, they must wait for data to be replicated to multiple brokers before receiving an acknowledgment.

    • Fix: If your durability requirements allow, consider acks=1 for producers.
    • Why it works: acks=1 means the producer only waits for the leader broker to acknowledge the write, reducing latency by not waiting for replication.
  • Memory allocation for redpanda process: Insufficient memory can lead to excessive swapping or aggressive garbage collection.

    • Fix: Ensure the Redpanda process has adequate memory allocated and that the system has sufficient free memory. On Kubernetes, this means setting appropriate resources.limits.memory.
    • Why it works: Adequate memory allows Redpanda to cache data and run its garbage collector more efficiently, reducing pauses.
  • Topic replication factor: A higher replication factor (e.g., 3 or 5) means more copies of data need to be written, increasing the load on brokers.

    • Fix: While essential for durability, ensure your replication factor is appropriate for your actual needs. If you have very high write volumes, consider if a replication factor of 3 is truly necessary for all topics.
    • Why it works: Fewer replicas mean less work for the cluster to do to achieve consistency.

5. Log Compaction and Schema Registry

Background processes like log compaction and the Schema Registry can consume resources.

Diagnosis: Monitor CPU and disk I/O. Check Redpanda logs for compaction-related messages.

Common Causes & Fixes:

  • Aggressive log compaction: If compaction runs too frequently or on very large logs, it can cause significant I/O and CPU spikes.

    • Fix: Tune log_compaction_interval_ms and log_retention_bytes in redpanda.yaml. If you don’t need compaction, disable it.
    • Why it works: Adjusting compaction frequency or disabling it reduces background I/O and CPU load.
  • Schema Registry load: If you have a high volume of schema registrations or lookups, the Schema Registry can become a bottleneck.

    • Fix: Ensure the Schema Registry has sufficient resources and consider horizontal scaling if it’s a dedicated component.
    • Why it works: Offloading or scaling the Schema Registry frees up resources on Redpanda brokers.

If you’ve addressed all of these, your next problem will likely be identifying and optimizing specific client applications that are sending misconfigured or inefficient requests.

Want structured learning?

Take the full Redpanda course →