The SpanProcessor in your OpenTelemetry Collector is rejecting new spans because its internal queue for processing spans has reached its maximum capacity. This typically happens when spans are being generated and sent to the collector faster than the processor can export them to the configured backend.
Common Causes and Fixes
-
Under-provisioned Exporter: The most frequent culprit is an exporter that cannot keep up with the incoming span rate. This could be due to network latency, a slow backend API, or insufficient resources on the machine running the exporter.
-
Diagnosis: Check the logs of your OpenTelemetry Collector for repeated messages like
queue is full,failed to send span, or exporter-specific errors (e.g.,timeout,connection refused,rate limited). Monitor the collector’s CPU and memory usage. -
Fix: Increase the
batchsize for the exporter, and/or increase thetimeoutfor the exporter. For example, in a Prometheus exporter configuration:exporters: prometheus: endpoint: "0.0.0.0:8889" log_errors: true timeout: 30s batch: send_batch_size: 1000 timeout: 5sWhy it works: A larger
send_batch_sizeallows the exporter to send more spans in a single request, reducing the number of network round trips and potentially improving throughput if the backend can handle larger batches. A longertimeoutgives the exporter more time to successfully send a batch before giving up, reducing transient failures that can back up the queue.
-
-
High CPU/Memory on Collector Host: The collector itself might be struggling to process incoming spans, format them for export, or manage the queues, leading to a backlog.
- Diagnosis: Use
top,htop, or your cloud provider’s monitoring tools to observe the CPU and memory utilization of the OpenTelemetry Collector process. High sustained CPU (above 80%) or memory usage is a strong indicator. - Fix: Scale up the resources (CPU, RAM) of the machine running the collector. If running in a containerized environment, increase the resource limits for the collector pod. Why it works: More CPU and RAM directly enable the collector process to execute its instructions faster and hold more data in memory, allowing it to process spans and manage queues more efficiently.
- Diagnosis: Use
-
Inefficient Span Processing Configuration: The
SpanProcessoritself might be configured with parameters that are too restrictive for the observed traffic.-
Diagnosis: Examine your collector’s configuration file, specifically the
processorssection. Look forbatchprocessors and theirqueue_sizeandsend_batch_sizesettings. -
Fix: Increase the
queue_sizefor thebatchprocessor. For example:processors: batch: send_batch_size: 500 timeout: 10s queue_size: 2000Why it works: A larger
queue_sizeprovides more buffer space for spans waiting to be processed and sent by the exporter, absorbing temporary spikes in traffic without immediately dropping spans.
-
-
Network Saturation or Latency to Backend: The network path between the collector and the backend where spans are being sent might be congested or experiencing high latency.
- Diagnosis: Use
pingandtraceroutefrom the collector host to the backend endpoint. Monitor network interface statistics on the collector host for dropped packets or high utilization. - Fix: Optimize network routing, increase bandwidth, or move the collector geographically closer to the backend if possible. Ensure firewalls are not introducing significant latency. Why it works: A more performant and reliable network connection allows spans to be transmitted to the backend more quickly and with fewer interruptions, preventing backlogs in the collector’s queues.
- Diagnosis: Use
-
Backend Service Throttling or Unavailability: The receiving backend (e.g., Jaeger, Splunk, Datadog) might be rate-limiting the collector, experiencing its own performance issues, or be temporarily unavailable.
- Diagnosis: Check the logs and monitoring dashboards of your tracing backend for any errors, warnings, or metrics indicating it’s overloaded, rate-limiting requests, or experiencing downtime.
- Fix: Scale up your tracing backend infrastructure or adjust its configuration to handle the incoming load. If rate-limited, consider reducing the number of spans sent from your applications or increasing the backend’s ingestion capacity. Why it works: By ensuring the backend can accept and process spans at the rate they are being sent, you prevent the collector from being held up by a bottleneck downstream.
-
Misconfigured Span Count or Sampling: While less common for a "queue full" error specifically, if span generation is excessively high due to misconfiguration (e.g., 100% sampling everywhere), it can overwhelm any downstream component, including the collector’s queues.
-
Diagnosis: Review your application’s OpenTelemetry instrumentation configuration. Check sampling rates. Ensure they are set appropriately for your environment (e.g., lower in development, higher but not 100% in production unless absolutely necessary).
-
Fix: Adjust sampling probabilities in your application’s instrumentation to a more sustainable level. For example, in Java with the OTLP exporter:
SdkTracerProvider tracerProvider = SdkTracerProvider.builder() .setSampler(Sampler.traceIdRatioBased(0.1)) // Sample 10% of traces .addSpanProcessor(OtlpGrpcSpanProcessor.builder().build()) .build();Why it works: Reducing the number of spans generated at the source directly lowers the load on the collector, preventing its queues from filling up.
-
After resolving the SpanProcessor queue full error, you might encounter Exporter failed to send batch errors if the underlying issue was indeed the exporter’s inability to communicate with the backend, or if the backend itself is now the bottleneck.