OpenTelemetry Memory Limiter: Prevent OOM in Collector (2026)

The OpenTelemetry Collector’s memory limiter is failing to prevent Out-Of-Memory (OOM) errors because it’s not accounting for all the memory being consumed by the Collector process.

Here are the common reasons why the memory limiter might not be doing its job:

1. Incorrectly Configured memory_limiter Settings: The most straightforward cause is that the memory_limiter component itself is configured with limits that are too high or not applied effectively. The collector has a default memory_limiter that might not be aggressive enough for your specific workload. You need to explicitly configure it in your Collector’s YAML.

Diagnosis: Check your Collector’s configuration file for the memory_limiter component. Look at the max_memory_percentage and spike_limit_percentage values.
Fix: Reduce max_memory_percentage to a value that leaves sufficient buffer for the OS and other processes. For example, if your host has 8GB of RAM and you want the Collector to use at most 70%, set:
```
processors:
  memory_limiter:
    max_memory_percentage: 70
    spike_limit_percentage: 15
    check_interval: 1s
```
Why it works: This directly tells the Collector process to attempt to stay below the specified percentage of total system memory. When it approaches this limit, it will start dropping data.

2. memory_limiter Not Enabled or Applied to the Right Pipeline: The memory_limiter is a processor, and like any processor, it needs to be explicitly added to your data processing pipelines. If it’s defined in the config but not referenced in any service.pipelines section, it will have no effect.

Diagnosis: Review your service.pipelines configuration. Search for the processors list within your traces, metrics, or logs pipelines.

Fix: Add memory_limiter to the processors list in your relevant pipelines. It’s generally recommended to place it early in the pipeline.

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch] # Make sure it's here!
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

Why it works: By including it in the pipeline, the Collector applies the memory checking logic to the data flowing through that specific pipeline.

3. Underlying Exporter/Receiver Memory Consumption: Some receivers or exporters might consume significant memory before data even reaches the memory_limiter processor or after it leaves. For example, a receiver buffering a large volume of data or an exporter that queues data internally can exceed the limiter’s effective control. The memory_limiter operates on the current memory usage, not necessarily the peak usage caused by specific components.

Diagnosis: Use top or htop on the Collector host to observe the Collector process’s memory usage. If it’s consistently high, identify which receivers/exporters are active and consider their buffering or internal state. Inspect the Collector’s logs for warnings related to specific components.
Fix:
- For receivers: Tune their respective configuration options to limit buffering or batch sizes. For example, the otlp receiver has max_concurrent_requests.
- For exporters: Configure their timeout and retry settings to avoid excessive retries that build up queues. Reduce the sending_queue size if applicable (though this can lead to data loss if the exporter is truly overwhelmed).
- Example for otlp receiver:
```
receivers:
  otlp:
    protocols:
      grpc:
        max_concurrent_requests: 100 # Default is often higher
```
Why it works: By reducing the memory footprint of individual components, you lower the overall Collector process memory, making the memory_limiter more effective at its job.

4. Go Runtime Garbage Collection (GC) Behavior: The OpenTelemetry Collector is written in Go, and its memory usage is heavily influenced by the Go runtime’s garbage collector. The memory_limiter checks memory periodically, but GC can cause bursts of memory allocation and deallocation that might temporarily exceed the configured limits before the limiter can react. The limiter’s check_interval plays a role here.

Diagnosis: Monitor the Collector’s memory usage over time. Look for sharp spikes just before OOMs occur. Go’s GC can be tuned via environment variables, but this is an advanced and potentially risky configuration.
Fix: While direct GC tuning is complex, a shorter check_interval in the memory_limiter can help it react faster to transient spikes. Ensure you have sufficient spike_limit_percentage to allow for normal GC activity without triggering data drops prematurely.
```
processors:
  memory_limiter:
    max_memory_percentage: 70
    spike_limit_percentage: 20 # Slightly more room for GC spikes
    check_interval: 500ms # Shorter interval
```
Why it works: A shorter check_interval means the Collector evaluates its memory usage more frequently, allowing it to drop data earlier when nearing the max_memory_percentage limit, even during GC cycles.

5. External Processes Consuming Host Memory: The memory_limiter is configured as a percentage of the total host memory. If other processes on the same host are consuming a large portion of RAM, the percentage allocated to the Collector will be smaller, making it easier to hit its limit even with seemingly reasonable configuration.

Diagnosis: Use top, htop, or free -h to check overall host memory utilization and identify other memory-hungry processes.
Fix:
- Reduce the memory footprint of other applications on the host.
- Increase the host’s RAM.
- Run the Collector on a dedicated host or in a container with memory limits set at the container orchestration level (e.g., Kubernetes resource limits).
Why it works: By ensuring the Collector has a larger absolute amount of memory available (either by reducing other consumers or increasing total RAM), the percentage-based limit becomes less of a bottleneck.

6. Data Throughput and Agent-Side Processing: A very high volume of telemetry data arriving at the Collector can overwhelm its processing capacity, leading to memory pressure. The memory_limiter is a last resort; if the data arrives faster than it can be processed and sent out, memory will inevitably climb.

Diagnosis: Monitor the number of telemetry items being received and processed by the Collector. Look for increasing queue sizes within exporters or processors if they expose metrics.
Fix:
- Scale out the Collector instances.
- Use the batch processor to group data before it hits memory-intensive processors or exporters.
- Implement agent-side aggregation if possible (e.g., on the OpenTelemetry SDKs themselves) to reduce the volume sent to the Collector.
- Ensure your batch processor’s timeout and send_batch_size are tuned appropriately.
```
processors:
  batch:
    send_batch_size: 1000
    timeout: 1s
```
Why it works: The batch processor consolidates smaller spans/metrics/logs into larger batches, reducing the overhead per item and allowing exporters to send data more efficiently, thus easing memory pressure.

After addressing these, the next error you’re likely to encounter is a large volume of data being dropped, indicated by messages like "dropped because memory limit reached" in the Collector’s logs.