Redpanda’s rpk cluster commands are your primary diagnostic tools when things go sideways, and understanding them is key to keeping your cluster humming.
Redpanda Cluster Health Check: rpk cluster Commands
The rpk cluster command group is your go-to for understanding the internal state and health of your Redpanda cluster. It’s less about "fixing" and more about "diagnosing" issues by interrogating the cluster’s internal reporting mechanisms. The most common problem you’ll encounter is a cluster reporting unhealthy, often manifesting as elevated latency, failed operations, or general instability. This usually stems from fundamental issues with node communication, resource availability, or internal state consistency.
Here’s a breakdown of the most common causes for a Redpanda cluster reporting unhealthy, and how to diagnose and fix them using rpk:
1. Node Unreachability or Communication Failure
-
Diagnosis: The most fundamental problem is that nodes can’t talk to each other. This could be due to network partitioning, firewall issues, or simply a node being down.
rpk cluster healthLook for nodes reporting
UNHEALTHYorDOWN. Ifrpk cluster healthitself fails to connect or times out, it’s a strong indicator of a network or connectivity issue between your client and the cluster, or between the cluster nodes themselves. -
Common Causes & Fixes:
- Firewall Blocking Ports: Redpanda nodes communicate on several ports (e.g., 3301 for client, 9092 for Kafka, 9642 for admin API, 5000-5003 for inter-node). Ensure these are open between all nodes in the cluster.
- Fix: On affected nodes, use
sudo ufw allow <port>(for UFW) or equivalent commands for your firewall. For example, to allow inter-node communication on port 5000:sudo ufw allow 5000/tcp. - Why it works: Opens the necessary communication channels for Redpanda’s internal gossip protocol and data replication.
- Fix: On affected nodes, use
- Incorrect
advertised_listenersorseed_servers: If nodes are advertising incorrect addresses or not correctly identifying seed servers, they won’t be able to find and connect to each other.- Fix: Update your
redpanda.yamlconfiguration file. Foradvertised_listeners, ensure it matches the IP address or hostname that other nodes can reach. Forseed_servers, list theadvertised_listenersof at least two other nodes. Example:
Then, restart Redpanda service on the affected node(s).advertised_listeners: - name: plaintext address: "192.168.1.10:9092" external: "192.168.1.10:9092" seed_servers: - host: "192.168.1.10" port: 5000 - host: "192.168.1.11" port: 5000 - Why it works: Correctly configured listeners and seeds ensure nodes can discover and establish direct communication paths.
- Fix: Update your
- Underlying Network Instability: Packet loss, high latency, or network interface issues can disrupt communication.
- Fix: Investigate your network infrastructure. Use tools like
ping,mtr, ortcpdumpto diagnose packet loss or excessive latency between nodes. Address any network hardware or configuration problems. - Why it works: Stable and reliable network connectivity is a prerequisite for distributed system consensus and data replication.
- Fix: Investigate your network infrastructure. Use tools like
- DNS Resolution Issues: If nodes rely on DNS for discovery, incorrect or slow DNS resolution can cause problems.
- Fix: Ensure all nodes can reliably resolve the hostnames of other nodes in the cluster. Check your DNS server configuration and network settings.
- Why it works: Consistent DNS resolution ensures nodes can correctly map hostnames to IP addresses for communication.
- Firewall Blocking Ports: Redpanda nodes communicate on several ports (e.g., 3301 for client, 9092 for Kafka, 9642 for admin API, 5000-5003 for inter-node). Ensure these are open between all nodes in the cluster.
2. Resource Exhaustion (CPU, Memory, Disk I/O)
-
Diagnosis: Redpanda is resource-intensive. When nodes are starved for CPU, memory, or disk I/O, they become slow to respond, leading to health check failures and general unresponsiveness.
rpk cluster info rpk topic list --allWhile
rpkdoesn’t directly show OS-level resource usage, you’ll often seeUNHEALTHYstatuses, high replication lag inrpk topic status, and slow responses torpkcommands themselves. System-level tools liketop,htop,iostat, andvmstatare crucial here. -
Common Causes & Fixes:
- Insufficient System Resources: The nodes simply don’t have enough RAM, CPU, or disk throughput for the workload.
- Fix: Scale up your instances (more CPU/RAM) or scale out (more nodes). Ensure disks are SSDs with sufficient IOPS. For disk I/O, check that your storage is provisioned correctly and not saturated.
- Why it works: Provides Redpanda with the necessary computational and storage resources to operate efficiently and meet its internal deadlines.
- Disk Latency: Redpanda relies heavily on disk for its write-ahead log (WAL) and state storage. High disk latency can cripple performance.
- Fix: Use faster storage (e.g., NVMe SSDs). Ensure your storage is not experiencing contention from other processes. Monitor disk I/O wait times (
iowaitintoporiostat). - Why it works: Reduces the time Redpanda spends waiting for disk operations, allowing it to process requests and replicate data faster.
- Fix: Use faster storage (e.g., NVMe SSDs). Ensure your storage is not experiencing contention from other processes. Monitor disk I/O wait times (
- Excessive Topic/Partition Count: A very large number of topics and partitions can increase metadata overhead and consume significant resources, especially during cluster startup or rebalancing.
- Fix: Consolidate topics where possible. Review your application’s partitioning strategy to avoid excessive fragmentation.
- Why it works: Reduces the management overhead for Redpanda, allowing it to focus resources on data handling rather than metadata processing.
- Insufficient System Resources: The nodes simply don’t have enough RAM, CPU, or disk throughput for the workload.
3. Internal State Inconsistency or Consensus Issues
-
Diagnosis: Redpanda uses the Raft consensus algorithm for state management. If nodes disagree on the state of partitions or metadata, the cluster will become unhealthy.
rpk cluster status rpk partition statusLook for partitions showing
under-replicatedorunavailable.rpk cluster statuswill often show nodes in anUNHEALTHYorDEGRADEDstate. -
Common Causes & Fixes:
- Leader Election Failures: If a leader for a partition fails and a new leader cannot be elected due to network issues or insufficient quorum, the partition becomes unavailable.
- Fix: Ensure network connectivity is stable between nodes. Verify that a majority of nodes (quorum) can communicate for leader election to succeed. Check Redpanda logs on the affected nodes for Raft-related errors.
- Why it works: Stable communication and quorum allow Raft to reliably elect leaders and maintain consensus on partition state.
- Data Replication Lag: If replicas cannot keep up with the leader, they fall behind, and partitions might become unavailable if the leader fails.
- Fix: Address underlying resource issues (CPU, disk, network) that are causing replication to be slow. Monitor
rpk topic statusfor high lag. - Why it works: Ensures all replicas are up-to-date, maintaining fault tolerance and availability.
- Fix: Address underlying resource issues (CPU, disk, network) that are causing replication to be slow. Monitor
- Corrupted State: In rare cases, internal state files can become corrupted.
- Fix: This is the most drastic. It typically involves stopping Redpanda, clearing the data directory for the affected node(s) (after backing up any critical data or understanding the implications of data loss for those nodes), and restarting. This will cause data loss for partitions that were only present on the cleared node. Consult Redpanda support before attempting this.
- Why it works: Replaces potentially corrupted state files with a clean slate, allowing the node to rejoin the cluster and resynchronize data from healthy peers.
- Leader Election Failures: If a leader for a partition fails and a new leader cannot be elected due to network issues or insufficient quorum, the partition becomes unavailable.
The next error you’ll likely encounter after fixing cluster health issues is related to specific topic operations failing, such as producer requests timing out or consumer offsets not committing, indicating that while the cluster is up, it’s not yet fully performing at the expected level.