The most surprising thing about sizing Redpanda hardware is that you’re not just sizing for raw throughput, but for the predictability of that throughput.
Let’s see Redpanda in action. Imagine we’re setting up a small cluster for a new event-driven microservice. We’ve got three nodes, each a beefy virtual machine with 16 vCPUs and 64GiB of RAM.
# On each node, let's check our initial configuration
rpk cluster info
rpk cluster config get data_dir
rpk cluster config get partition_batch_size
These commands will give us a snapshot of the cluster’s current state. data_dir tells us where Redpanda is storing its log segments on disk, and partition_batch_size is a crucial tuning parameter for how data is batched for replication.
Now, what problem does this solve? Redpanda, like Kafka, is a distributed commit log. It’s designed to ingest a massive stream of events and make them durable and available to consumers. Sizing correctly ensures that this ingestion and delivery happen without unexpected slowdowns, dropped messages, or data loss, even under load. It’s about predictable latency and throughput for your critical data streams.
Internally, Redpanda is a complex beast. It uses a shared-nothing architecture where each node is independent but cooperates to replicate data across partitions. The core components are the Raft consensus protocol for metadata and leader election, and the Kafka API for data streaming. Performance hinges on efficient disk I/O, network bandwidth, and CPU for serialization/deserialization and Raft.
Here are the exact levers you control:
- Storage: This is paramount. Redpanda needs fast, low-latency NVMe SSDs. The size of your
data_dirdetermines how much data you can retain. The speed of the NVMe dictates how quickly Redpanda can write new data and compact old data. A common recommendation is to provision at least 3x the expected daily data volume for retention and growth. For example, if you expect 1TB of data per day, aim for at least 3TB of fast SSD storage per node. - RAM: Redpanda uses RAM extensively for its page cache, which is its primary mechanism for reducing disk I/O. A good rule of thumb is to allocate at least 10-15% of your total system RAM to the Redpanda process itself, leaving the rest for the OS page cache. For our 64GiB nodes, we’d want to see Redpanda configured to use around 6-9GiB, with the OS having access to the rest for buffering. The
rpk tuning memorycommand can help adjust this. - CPU: While storage and RAM are often bottlenecks, CPU is critical for Raft consensus, compression, and TLS encryption/decryption. More vCPUs generally mean better ability to handle concurrent requests and faster Raft leader elections. For busy clusters, 16-32 vCPUs per node is a common starting point.
- Network: Redpanda is a network-intensive application. High-throughput, low-latency networking is essential, especially for inter-node communication during replication. 10GbE is a baseline, with 25GbE or higher becoming necessary for very large clusters or high-throughput workloads.
The most surprising thing most people don’t realize is that Redpanda’s performance is heavily impacted by the write amplification of your underlying SSDs. If your NVMe drives have high write amplification factors (e.g., due to heavy background garbage collection or wear leveling), it can significantly degrade Redpanda’s write performance, even if the raw sequential write speeds look good on paper. Monitoring disk I/O wait times and queue depths becomes critical to diagnosing this.
Next, you’ll want to explore how to tune Redpanda’s replication factor and topic configurations to optimize for different durability and performance trade-offs.