A Redpanda debug bundle is essentially a zipped archive of critical operational data, designed to give Redpanda support engineers a comprehensive snapshot of your cluster’s state at a specific moment in time. Think of it as a high-fidelity medical scan for your Redpanda cluster, capturing everything from system metrics to configuration files and logs.

Let’s see what goes into one and how it’s useful by looking at a hypothetical scenario.

Imagine you’re running Redpanda in Kubernetes, and you’ve started experiencing intermittent producer timeouts. You’ve checked your network, your application’s logs, and everything seems fine from your end. This is where the debug bundle becomes invaluable.

First, you’d initiate the collection. If you’re using the Redpanda Kubernetes Operator, it’s as simple as applying a specific DebugBundle resource:

apiVersion: redpanda.com/v1alpha1
kind: DebugBundle
metadata:
  name: my-redpanda-cluster-bundle
  namespace: redpanda
spec:
  clusterRef:
    name: my-redpanda-cluster

Once applied, the operator will orchestrate the collection process. It’s not just about grabbing a few files; it’s a coordinated effort across all nodes in your Redpanda cluster.

On each Redpanda node, the operator will trigger several diagnostic actions:

  • System Metrics: It collects iostat, vmstat, netstat, and top output. This gives Redpanda support insight into the underlying host’s performance. Are there I/O bottlenecks? Is memory being exhausted? Is the network saturated?
    • Example Command (run on a Redpanda node): iostat -xz 1 5 (This would capture 5 seconds of extended I/O statistics, 1-second intervals).
    • Why it works: iostat shows disk utilization, read/write speeds, and wait times, directly indicating if storage is a bottleneck.
  • Redpanda Process Information: It gathers details about the running Redpanda processes, including their memory usage, CPU consumption, and open file descriptors.
    • Example Command (run on a Redpanda node): ps aux | grep redpanda and pmap -x $(pgrep redpanda)
    • Why it works: ps shows resource allocation, while pmap reveals the memory map, which can highlight memory leaks or excessive memory usage patterns.
  • Redpanda Configuration: All redpanda.yaml (or equivalent configuration) files are included, along with any command-line arguments used to start the Redpanda process.
    • Why it works: Configuration errors or suboptimal settings are a common source of performance issues. Seeing the exact configuration ensures everyone is on the same page.
  • Redpanda Logs: This is crucial. It collects recent Redpanda logs, often including TRACE-level logs if enabled.
    • Example (via operator): The operator typically targets /var/log/redpanda/ or similar paths.
    • Why it works: Logs are the primary source for understanding what Redpanda was doing, what errors it encountered, and the sequence of events leading up to a problem.
  • Topic and Partition Information: It fetches details about your topics, partitions, leader distribution, and replication status.
    • Example Command (using rpk): rpk topic list -v and rpk partition list -v
    • Why it works: Understanding the state of your data topology is key to diagnosing issues related to data availability, leadership, or replication lag.
  • Internal Redpanda State: This includes critical internal data like the cluster’s state machine, RPC endpoints, and cached metadata.
    • Example (via operator): This often involves internal Redpanda API calls or scraping metrics from the /internal/ endpoints.
    • Why it works: This provides a deep dive into Redpanda’s internal workings, revealing issues with consensus, leader election, or metadata management that might not be obvious from logs alone.
  • Network Diagnostics: Tools like ping and traceroute to other Redpanda nodes and relevant external services.
    • Example Command (run on a Redpanda node): ping <other-redpanda-node-ip>
    • Why it works: Network latency or packet loss between nodes can severely impact cluster performance and availability.

The operator then aggregates all this data from across the cluster into a single, downloadable archive (e.g., redpanda-debug-bundle-20231027T100000Z.tar.gz). When you provide this to Redpanda support, they have everything they need to start diagnosing without needing to interact with your live cluster, reducing the risk of further disruption.

The most surprising aspect of the Redpanda debug bundle is its ability to capture internal cluster state that isn’t directly exposed through standard metrics or logs. This includes detailed information about the Raft consensus protocol’s state for each partition, which is often the root cause of leadership or replication issues.

When you receive the bundle, you’ll find a well-organized directory structure. At the top level, you might see directories like logs/, metrics/, config/, and cluster_state/. Deeper within, you’ll find node-specific information and detailed breakdowns of Redpanda’s internal components.

The next challenge you’ll likely encounter is understanding how to interpret the vast amount of data within the bundle, particularly the internal state information.

Want structured learning?

Take the full Redpanda course →