The SpanProcessor in your OpenTelemetry Collector is rejecting new spans because its internal queue for processing spans has reached its maximum capacity. This typically happens when spans are being generated and sent to the collector faster than the processor can export them to the configured backend.

Common Causes and Fixes

  1. Under-provisioned Exporter: The most frequent culprit is an exporter that cannot keep up with the incoming span rate. This could be due to network latency, a slow backend API, or insufficient resources on the machine running the exporter.

    • Diagnosis: Check the logs of your OpenTelemetry Collector for repeated messages like queue is full, failed to send span, or exporter-specific errors (e.g., timeout, connection refused, rate limited). Monitor the collector’s CPU and memory usage.

    • Fix: Increase the batch size for the exporter, and/or increase the timeout for the exporter. For example, in a Prometheus exporter configuration:

      exporters:
        prometheus:
          endpoint: "0.0.0.0:8889"
          log_errors: true
          timeout: 30s
          batch:
            send_batch_size: 1000
            timeout: 5s
      

      Why it works: A larger send_batch_size allows the exporter to send more spans in a single request, reducing the number of network round trips and potentially improving throughput if the backend can handle larger batches. A longer timeout gives the exporter more time to successfully send a batch before giving up, reducing transient failures that can back up the queue.

  2. High CPU/Memory on Collector Host: The collector itself might be struggling to process incoming spans, format them for export, or manage the queues, leading to a backlog.

    • Diagnosis: Use top, htop, or your cloud provider’s monitoring tools to observe the CPU and memory utilization of the OpenTelemetry Collector process. High sustained CPU (above 80%) or memory usage is a strong indicator.
    • Fix: Scale up the resources (CPU, RAM) of the machine running the collector. If running in a containerized environment, increase the resource limits for the collector pod. Why it works: More CPU and RAM directly enable the collector process to execute its instructions faster and hold more data in memory, allowing it to process spans and manage queues more efficiently.
  3. Inefficient Span Processing Configuration: The SpanProcessor itself might be configured with parameters that are too restrictive for the observed traffic.

    • Diagnosis: Examine your collector’s configuration file, specifically the processors section. Look for batch processors and their queue_size and send_batch_size settings.

    • Fix: Increase the queue_size for the batch processor. For example:

      processors:
        batch:
          send_batch_size: 500
          timeout: 10s
          queue_size: 2000
      

      Why it works: A larger queue_size provides more buffer space for spans waiting to be processed and sent by the exporter, absorbing temporary spikes in traffic without immediately dropping spans.

  4. Network Saturation or Latency to Backend: The network path between the collector and the backend where spans are being sent might be congested or experiencing high latency.

    • Diagnosis: Use ping and traceroute from the collector host to the backend endpoint. Monitor network interface statistics on the collector host for dropped packets or high utilization.
    • Fix: Optimize network routing, increase bandwidth, or move the collector geographically closer to the backend if possible. Ensure firewalls are not introducing significant latency. Why it works: A more performant and reliable network connection allows spans to be transmitted to the backend more quickly and with fewer interruptions, preventing backlogs in the collector’s queues.
  5. Backend Service Throttling or Unavailability: The receiving backend (e.g., Jaeger, Splunk, Datadog) might be rate-limiting the collector, experiencing its own performance issues, or be temporarily unavailable.

    • Diagnosis: Check the logs and monitoring dashboards of your tracing backend for any errors, warnings, or metrics indicating it’s overloaded, rate-limiting requests, or experiencing downtime.
    • Fix: Scale up your tracing backend infrastructure or adjust its configuration to handle the incoming load. If rate-limited, consider reducing the number of spans sent from your applications or increasing the backend’s ingestion capacity. Why it works: By ensuring the backend can accept and process spans at the rate they are being sent, you prevent the collector from being held up by a bottleneck downstream.
  6. Misconfigured Span Count or Sampling: While less common for a "queue full" error specifically, if span generation is excessively high due to misconfiguration (e.g., 100% sampling everywhere), it can overwhelm any downstream component, including the collector’s queues.

    • Diagnosis: Review your application’s OpenTelemetry instrumentation configuration. Check sampling rates. Ensure they are set appropriately for your environment (e.g., lower in development, higher but not 100% in production unless absolutely necessary).

    • Fix: Adjust sampling probabilities in your application’s instrumentation to a more sustainable level. For example, in Java with the OTLP exporter:

      SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
          .setSampler(Sampler.traceIdRatioBased(0.1)) // Sample 10% of traces
          .addSpanProcessor(OtlpGrpcSpanProcessor.builder().build())
          .build();
      

      Why it works: Reducing the number of spans generated at the source directly lowers the load on the collector, preventing its queues from filling up.

After resolving the SpanProcessor queue full error, you might encounter Exporter failed to send batch errors if the underlying issue was indeed the exporter’s inability to communicate with the backend, or if the backend itself is now the bottleneck.

Want structured learning?

Take the full Opentelemetry course →