The OpenTelemetry Collector’s memory limiter is failing to prevent Out-Of-Memory (OOM) errors because it’s not accounting for all the memory being consumed by the Collector process.
Here are the common reasons why the memory limiter might not be doing its job:
1. Incorrectly Configured memory_limiter Settings: The most straightforward cause is that the memory_limiter component itself is configured with limits that are too high or not applied effectively. The collector has a default memory_limiter that might not be aggressive enough for your specific workload. You need to explicitly configure it in your Collector’s YAML.
- Diagnosis: Check your Collector’s configuration file for the
memory_limitercomponent. Look at themax_memory_percentageandspike_limit_percentagevalues. - Fix: Reduce
max_memory_percentageto a value that leaves sufficient buffer for the OS and other processes. For example, if your host has 8GB of RAM and you want the Collector to use at most 70%, set:processors: memory_limiter: max_memory_percentage: 70 spike_limit_percentage: 15 check_interval: 1s - Why it works: This directly tells the Collector process to attempt to stay below the specified percentage of total system memory. When it approaches this limit, it will start dropping data.
2. memory_limiter Not Enabled or Applied to the Right Pipeline: The memory_limiter is a processor, and like any processor, it needs to be explicitly added to your data processing pipelines. If it’s defined in the config but not referenced in any service.pipelines section, it will have no effect.
- Diagnosis: Review your
service.pipelinesconfiguration. Search for theprocessorslist within yourtraces,metrics, orlogspipelines. - Fix: Add
memory_limiterto the processors list in your relevant pipelines. It’s generally recommended to place it early in the pipeline.service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] # Make sure it's here! exporters: [otlp] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [otlp] - Why it works: By including it in the pipeline, the Collector applies the memory checking logic to the data flowing through that specific pipeline.
3. Underlying Exporter/Receiver Memory Consumption: Some receivers or exporters might consume significant memory before data even reaches the memory_limiter processor or after it leaves. For example, a receiver buffering a large volume of data or an exporter that queues data internally can exceed the limiter’s effective control. The memory_limiter operates on the current memory usage, not necessarily the peak usage caused by specific components.
- Diagnosis: Use
toporhtopon the Collector host to observe the Collector process’s memory usage. If it’s consistently high, identify which receivers/exporters are active and consider their buffering or internal state. Inspect the Collector’s logs for warnings related to specific components. - Fix:
- For receivers: Tune their respective configuration options to limit buffering or batch sizes. For example, the
otlpreceiver hasmax_concurrent_requests. - For exporters: Configure their
timeoutandretrysettings to avoid excessive retries that build up queues. Reduce thesending_queuesize if applicable (though this can lead to data loss if the exporter is truly overwhelmed). - Example for
otlpreceiver:receivers: otlp: protocols: grpc: max_concurrent_requests: 100 # Default is often higher
- For receivers: Tune their respective configuration options to limit buffering or batch sizes. For example, the
- Why it works: By reducing the memory footprint of individual components, you lower the overall Collector process memory, making the
memory_limitermore effective at its job.
4. Go Runtime Garbage Collection (GC) Behavior: The OpenTelemetry Collector is written in Go, and its memory usage is heavily influenced by the Go runtime’s garbage collector. The memory_limiter checks memory periodically, but GC can cause bursts of memory allocation and deallocation that might temporarily exceed the configured limits before the limiter can react. The limiter’s check_interval plays a role here.
- Diagnosis: Monitor the Collector’s memory usage over time. Look for sharp spikes just before OOMs occur. Go’s GC can be tuned via environment variables, but this is an advanced and potentially risky configuration.
- Fix: While direct GC tuning is complex, a shorter
check_intervalin thememory_limitercan help it react faster to transient spikes. Ensure you have sufficientspike_limit_percentageto allow for normal GC activity without triggering data drops prematurely.processors: memory_limiter: max_memory_percentage: 70 spike_limit_percentage: 20 # Slightly more room for GC spikes check_interval: 500ms # Shorter interval - Why it works: A shorter
check_intervalmeans the Collector evaluates its memory usage more frequently, allowing it to drop data earlier when nearing themax_memory_percentagelimit, even during GC cycles.
5. External Processes Consuming Host Memory: The memory_limiter is configured as a percentage of the total host memory. If other processes on the same host are consuming a large portion of RAM, the percentage allocated to the Collector will be smaller, making it easier to hit its limit even with seemingly reasonable configuration.
- Diagnosis: Use
top,htop, orfree -hto check overall host memory utilization and identify other memory-hungry processes. - Fix:
- Reduce the memory footprint of other applications on the host.
- Increase the host’s RAM.
- Run the Collector on a dedicated host or in a container with memory limits set at the container orchestration level (e.g., Kubernetes resource limits).
- Why it works: By ensuring the Collector has a larger absolute amount of memory available (either by reducing other consumers or increasing total RAM), the percentage-based limit becomes less of a bottleneck.
6. Data Throughput and Agent-Side Processing: A very high volume of telemetry data arriving at the Collector can overwhelm its processing capacity, leading to memory pressure. The memory_limiter is a last resort; if the data arrives faster than it can be processed and sent out, memory will inevitably climb.
- Diagnosis: Monitor the number of telemetry items being received and processed by the Collector. Look for increasing queue sizes within exporters or processors if they expose metrics.
- Fix:
- Scale out the Collector instances.
- Use the
batchprocessor to group data before it hits memory-intensive processors or exporters. - Implement agent-side aggregation if possible (e.g., on the OpenTelemetry SDKs themselves) to reduce the volume sent to the Collector.
- Ensure your
batchprocessor’stimeoutandsend_batch_sizeare tuned appropriately.
processors: batch: send_batch_size: 1000 timeout: 1s - Why it works: The
batchprocessor consolidates smaller spans/metrics/logs into larger batches, reducing the overhead per item and allowing exporters to send data more efficiently, thus easing memory pressure.
After addressing these, the next error you’re likely to encounter is a large volume of data being dropped, indicated by messages like "dropped because memory limit reached" in the Collector’s logs.