The OpenTelemetry Batch Processor’s queue and retry mechanisms are designed to absorb temporary network glitches and upstream service unavailability, ensuring telemetry data isn’t lost and is eventually delivered.
Let’s see it in action. Imagine a service exporting traces. Without proper batching, each trace would be a separate HTTP request, hammering the collector. With batching, traces are collected, grouped into batches, and sent as a single request.
Here’s a simplified view of the batch processor’s lifecycle:
- Data Ingestion: Spans arrive from the SDK.
- Queueing: Spans are added to an in-memory queue.
- Batching: When the queue reaches a certain size or a timeout occurs, a batch is formed.
- Export: The batch is sent to the configured exporter (e.g., OTLP exporter to a collector).
- Retry (on failure): If the export fails, the batch is re-queued with a backoff strategy.
The core problem this solves is the inherent unreliability of network communication and distributed systems. You will have transient failures. The batch processor is your shock absorber.
The configuration lives within the otelcol-config.yaml file, typically under the processors section. Here’s a snippet:
processors:
batch:
send_batch_size: 1000
timeout: 5s
retry_on_failure:
enabled: true
initial_interval: 500ms
max_interval: 30s
max_elapsed_time: 5m
Let’s break down the key parameters:
send_batch_size: This is the maximum number of telemetry items (spans, metrics, logs) that will be included in a single export batch. A larger size can improve efficiency by reducing the number of HTTP requests but increases the memory footprint and the potential data loss if a batch fails. A smaller size is more resilient to individual failures but less efficient.timeout: This is the maximum amount of time to wait for a batch to be filled before exporting it, even ifsend_batch_sizehasn’t been reached. This ensures that data isn’t held indefinitely in memory if traffic is low.retry_on_failure: This entire block controls the retry logic.enabled: A simple boolean to turn retries on or off. Essential to havetruein production.initial_interval: The first time a batch export fails, the processor will wait this long before retrying. Shorter intervals mean quicker retries but can overwhelm a struggling upstream service.max_interval: As retries continue, the interval between them will increase, up to this maximum. This prevents constant hammering during prolonged outages.max_elapsed_time: The total duration for which retries will be attempted for a single failed batch. After this time, the batch is discarded. This prevents infinite retry loops and memory exhaustion.
The send_batch_size and timeout work in tandem. A batch is sent either when send_batch_size items are accumulated OR when timeout elapses, whichever comes first.
Consider the interaction between these settings. If send_batch_size is 1000 and timeout is 5s:
- If your service generates 200 spans per second, a batch of 1000 will be ready in 5 seconds, and it will be sent.
- If your service generates only 50 spans per second, a batch of 250 spans will be sent every 5 seconds.
- If your service generates 1000 spans in 1 second, multiple batches will be sent immediately.
The retry mechanism is a crucial safeguard. If an OTLP exporter fails (e.g., the collector is down or returns a 5xx error), the batch is not immediately dropped. Instead, it’s placed back into the queue, and the processor waits for initial_interval before attempting to send it again. If that also fails, it waits initial_interval * 2 (up to max_interval). This exponential backoff, capped by max_interval and max_elapsed_time, is a standard pattern for robust systems.
The actual data items are held in memory by the batch processor until they are successfully exported or discarded after max_elapsed_time. Therefore, send_batch_size and the number of concurrent batch exports you might have directly impact the memory usage of the collector. You can monitor this using collector metrics like otelcol_processor_batch_queue_size.
When configuring retry_on_failure, be mindful of the max_elapsed_time. If this is set too low, you might discard data during brief but sustained network issues. If it’s too high, you risk holding onto large amounts of data in memory if the upstream is completely unreachable for an extended period. A common starting point for max_elapsed_time is 5-15 minutes.
The default values are often a good starting point:
send_batch_size: 1024timeout: 1sretry_on_failure:enabled: true,initial_interval: 5s,max_interval: 1m,max_elapsed_time: 1h
The most common pitfall is setting retry_on_failure.enabled: false in production, leading to data loss on any transient network hiccup.
Once you’ve mastered batching and retries, the next logical step is to understand how to deal with data that still fails to export after retries, which often involves configuring a dead-letter queue or an alternative export path.