A Redpanda debug bundle is essentially a zipped archive of critical operational data, designed to give Redpanda support engineers a comprehensive snapshot of your cluster’s state at a specific moment in time. Think of it as a high-fidelity medical scan for your Redpanda cluster, capturing everything from system metrics to configuration files and logs.
Let’s see what goes into one and how it’s useful by looking at a hypothetical scenario.
Imagine you’re running Redpanda in Kubernetes, and you’ve started experiencing intermittent producer timeouts. You’ve checked your network, your application’s logs, and everything seems fine from your end. This is where the debug bundle becomes invaluable.
First, you’d initiate the collection. If you’re using the Redpanda Kubernetes Operator, it’s as simple as applying a specific DebugBundle resource:
apiVersion: redpanda.com/v1alpha1
kind: DebugBundle
metadata:
name: my-redpanda-cluster-bundle
namespace: redpanda
spec:
clusterRef:
name: my-redpanda-cluster
Once applied, the operator will orchestrate the collection process. It’s not just about grabbing a few files; it’s a coordinated effort across all nodes in your Redpanda cluster.
On each Redpanda node, the operator will trigger several diagnostic actions:
- System Metrics: It collects
iostat,vmstat,netstat, andtopoutput. This gives Redpanda support insight into the underlying host’s performance. Are there I/O bottlenecks? Is memory being exhausted? Is the network saturated?- Example Command (run on a Redpanda node):
iostat -xz 1 5(This would capture 5 seconds of extended I/O statistics, 1-second intervals). - Why it works:
iostatshows disk utilization, read/write speeds, and wait times, directly indicating if storage is a bottleneck.
- Example Command (run on a Redpanda node):
- Redpanda Process Information: It gathers details about the running Redpanda processes, including their memory usage, CPU consumption, and open file descriptors.
- Example Command (run on a Redpanda node):
ps aux | grep redpandaandpmap -x $(pgrep redpanda) - Why it works:
psshows resource allocation, whilepmapreveals the memory map, which can highlight memory leaks or excessive memory usage patterns.
- Example Command (run on a Redpanda node):
- Redpanda Configuration: All
redpanda.yaml(or equivalent configuration) files are included, along with any command-line arguments used to start the Redpanda process.- Why it works: Configuration errors or suboptimal settings are a common source of performance issues. Seeing the exact configuration ensures everyone is on the same page.
- Redpanda Logs: This is crucial. It collects recent Redpanda logs, often including TRACE-level logs if enabled.
- Example (via operator): The operator typically targets
/var/log/redpanda/or similar paths. - Why it works: Logs are the primary source for understanding what Redpanda was doing, what errors it encountered, and the sequence of events leading up to a problem.
- Example (via operator): The operator typically targets
- Topic and Partition Information: It fetches details about your topics, partitions, leader distribution, and replication status.
- Example Command (using
rpk):rpk topic list -vandrpk partition list -v - Why it works: Understanding the state of your data topology is key to diagnosing issues related to data availability, leadership, or replication lag.
- Example Command (using
- Internal Redpanda State: This includes critical internal data like the cluster’s state machine, RPC endpoints, and cached metadata.
- Example (via operator): This often involves internal Redpanda API calls or scraping metrics from the
/internal/endpoints. - Why it works: This provides a deep dive into Redpanda’s internal workings, revealing issues with consensus, leader election, or metadata management that might not be obvious from logs alone.
- Example (via operator): This often involves internal Redpanda API calls or scraping metrics from the
- Network Diagnostics: Tools like
pingandtracerouteto other Redpanda nodes and relevant external services.- Example Command (run on a Redpanda node):
ping <other-redpanda-node-ip> - Why it works: Network latency or packet loss between nodes can severely impact cluster performance and availability.
- Example Command (run on a Redpanda node):
The operator then aggregates all this data from across the cluster into a single, downloadable archive (e.g., redpanda-debug-bundle-20231027T100000Z.tar.gz). When you provide this to Redpanda support, they have everything they need to start diagnosing without needing to interact with your live cluster, reducing the risk of further disruption.
The most surprising aspect of the Redpanda debug bundle is its ability to capture internal cluster state that isn’t directly exposed through standard metrics or logs. This includes detailed information about the Raft consensus protocol’s state for each partition, which is often the root cause of leadership or replication issues.
When you receive the bundle, you’ll find a well-organized directory structure. At the top level, you might see directories like logs/, metrics/, config/, and cluster_state/. Deeper within, you’ll find node-specific information and detailed breakdowns of Redpanda’s internal components.
The next challenge you’ll likely encounter is understanding how to interpret the vast amount of data within the bundle, particularly the internal state information.