Redpanda Cluster Health Check: rpk cluster commands (2026)

Redpanda’s rpk cluster commands are your primary diagnostic tools when things go sideways, and understanding them is key to keeping your cluster humming.

Redpanda Cluster Health Check: `rpk cluster` Commands

The rpk cluster command group is your go-to for understanding the internal state and health of your Redpanda cluster. It’s less about "fixing" and more about "diagnosing" issues by interrogating the cluster’s internal reporting mechanisms. The most common problem you’ll encounter is a cluster reporting unhealthy, often manifesting as elevated latency, failed operations, or general instability. This usually stems from fundamental issues with node communication, resource availability, or internal state consistency.

Here’s a breakdown of the most common causes for a Redpanda cluster reporting unhealthy, and how to diagnose and fix them using rpk:

1. Node Unreachability or Communication Failure

Diagnosis: The most fundamental problem is that nodes can’t talk to each other. This could be due to network partitioning, firewall issues, or simply a node being down.
```
rpk cluster health
```
Look for nodes reporting UNHEALTHY or DOWN. If rpk cluster health itself fails to connect or times out, it’s a strong indicator of a network or connectivity issue between your client and the cluster, or between the cluster nodes themselves.
Common Causes & Fixes:
- Firewall Blocking Ports: Redpanda nodes communicate on several ports (e.g., 3301 for client, 9092 for Kafka, 9642 for admin API, 5000-5003 for inter-node). Ensure these are open between all nodes in the cluster.
  - Fix: On affected nodes, use sudo ufw allow <port> (for UFW) or equivalent commands for your firewall. For example, to allow inter-node communication on port 5000: sudo ufw allow 5000/tcp.
  - Why it works: Opens the necessary communication channels for Redpanda’s internal gossip protocol and data replication.
- Incorrect advertised_listeners or seed_servers: If nodes are advertising incorrect addresses or not correctly identifying seed servers, they won’t be able to find and connect to each other.
  - Fix: Update your redpanda.yaml configuration file. For advertised_listeners, ensure it matches the IP address or hostname that other nodes can reach. For seed_servers, list the advertised_listeners of at least two other nodes. Example:
```
advertised_listeners:
  - name: plaintext
    address: "192.168.1.10:9092"
    external: "192.168.1.10:9092"
seed_servers:
  - host: "192.168.1.10"
    port: 5000
  - host: "192.168.1.11"
    port: 5000
```
    Then, restart Redpanda service on the affected node(s).
  - Why it works: Correctly configured listeners and seeds ensure nodes can discover and establish direct communication paths.
- Underlying Network Instability: Packet loss, high latency, or network interface issues can disrupt communication.
  - Fix: Investigate your network infrastructure. Use tools like ping, mtr, or tcpdump to diagnose packet loss or excessive latency between nodes. Address any network hardware or configuration problems.
  - Why it works: Stable and reliable network connectivity is a prerequisite for distributed system consensus and data replication.
- DNS Resolution Issues: If nodes rely on DNS for discovery, incorrect or slow DNS resolution can cause problems.
  - Fix: Ensure all nodes can reliably resolve the hostnames of other nodes in the cluster. Check your DNS server configuration and network settings.
  - Why it works: Consistent DNS resolution ensures nodes can correctly map hostnames to IP addresses for communication.

2. Resource Exhaustion (CPU, Memory, Disk I/O)

Diagnosis: Redpanda is resource-intensive. When nodes are starved for CPU, memory, or disk I/O, they become slow to respond, leading to health check failures and general unresponsiveness.
```
rpk cluster info
rpk topic list --all
```
While rpk doesn’t directly show OS-level resource usage, you’ll often see UNHEALTHY statuses, high replication lag in rpk topic status, and slow responses to rpk commands themselves. System-level tools like top, htop, iostat, and vmstat are crucial here.
Common Causes & Fixes:
- Insufficient System Resources: The nodes simply don’t have enough RAM, CPU, or disk throughput for the workload.
  - Fix: Scale up your instances (more CPU/RAM) or scale out (more nodes). Ensure disks are SSDs with sufficient IOPS. For disk I/O, check that your storage is provisioned correctly and not saturated.
  - Why it works: Provides Redpanda with the necessary computational and storage resources to operate efficiently and meet its internal deadlines.
- Disk Latency: Redpanda relies heavily on disk for its write-ahead log (WAL) and state storage. High disk latency can cripple performance.
  - Fix: Use faster storage (e.g., NVMe SSDs). Ensure your storage is not experiencing contention from other processes. Monitor disk I/O wait times (iowait in top or iostat).
  - Why it works: Reduces the time Redpanda spends waiting for disk operations, allowing it to process requests and replicate data faster.
- Excessive Topic/Partition Count: A very large number of topics and partitions can increase metadata overhead and consume significant resources, especially during cluster startup or rebalancing.
  - Fix: Consolidate topics where possible. Review your application’s partitioning strategy to avoid excessive fragmentation.
  - Why it works: Reduces the management overhead for Redpanda, allowing it to focus resources on data handling rather than metadata processing.

3. Internal State Inconsistency or Consensus Issues

Diagnosis: Redpanda uses the Raft consensus algorithm for state management. If nodes disagree on the state of partitions or metadata, the cluster will become unhealthy.
```
rpk cluster status
rpk partition status
```
Look for partitions showing under-replicated or unavailable. rpk cluster status will often show nodes in an UNHEALTHY or DEGRADED state.
Common Causes & Fixes:
- Leader Election Failures: If a leader for a partition fails and a new leader cannot be elected due to network issues or insufficient quorum, the partition becomes unavailable.
  - Fix: Ensure network connectivity is stable between nodes. Verify that a majority of nodes (quorum) can communicate for leader election to succeed. Check Redpanda logs on the affected nodes for Raft-related errors.
  - Why it works: Stable communication and quorum allow Raft to reliably elect leaders and maintain consensus on partition state.
- Data Replication Lag: If replicas cannot keep up with the leader, they fall behind, and partitions might become unavailable if the leader fails.
  - Fix: Address underlying resource issues (CPU, disk, network) that are causing replication to be slow. Monitor rpk topic status for high lag.
  - Why it works: Ensures all replicas are up-to-date, maintaining fault tolerance and availability.
- Corrupted State: In rare cases, internal state files can become corrupted.
  - Fix: This is the most drastic. It typically involves stopping Redpanda, clearing the data directory for the affected node(s) (after backing up any critical data or understanding the implications of data loss for those nodes), and restarting. This will cause data loss for partitions that were only present on the cleared node. Consult Redpanda support before attempting this.
  - Why it works: Replaces potentially corrupted state files with a clean slate, allowing the node to rejoin the cluster and resynchronize data from healthy peers.

The next error you’ll likely encounter after fixing cluster health issues is related to specific topic operations failing, such as producer requests timing out or consumer offsets not committing, indicating that while the cluster is up, it’s not yet fully performing at the expected level.

Redpanda Cluster Health Check: rpk cluster Commands

Redpanda Cluster Health Check: `rpk cluster` Commands