The OpenTelemetry trace exporter failed because the collector couldn’t establish a persistent, authenticated connection to the backend it was trying to send traces to.
Common Causes and Fixes
-
Network Connectivity Issues: The collector cannot reach the OTLP endpoint.
- Diagnosis: From the collector host, try
curl -v <your_otlp_endpoint_address>:<port>. Look for "Connection refused" or "Timeout." - Fix: Ensure firewalls (host-based and network) allow outbound traffic on the OTLP port (e.g., 4317 for gRPC, 4318 for HTTP). If using a proxy, verify
HTTP_PROXYandHTTPS_PROXYenvironment variables are correctly set for the collector process. - Why it works: This bypasses the collector’s internal networking and directly tests the fundamental ability to establish a TCP connection to the target address and port.
- Diagnosis: From the collector host, try
-
Incorrect OTLP Endpoint: The configured endpoint address or port is wrong.
- Diagnosis: Review your OpenTelemetry Collector configuration (
config.yamlor equivalent). Specifically, check theendpointparameter within yourexporterssection for the OTLP exporter. - Fix: Correct the
endpointto the actual OTLP receiver address and port. For example, changeendpoint: "http://localhost:4318"toendpoint: "http://otel-collector.mycompany.com:4318"orendpoint: "http://192.168.1.100:4317". - Why it works: The exporter needs the precise network location of the OTLP receiver to send data. A typo or outdated address prevents any data from reaching its destination.
- Diagnosis: Review your OpenTelemetry Collector configuration (
-
Authentication/Authorization Failure (TLS/SSL Issues): The collector cannot authenticate with the OTLP endpoint due to certificate problems.
- Diagnosis: Check the collector logs for messages like "x509: certificate signed by unknown authority," "remote error: tls: bad certificate," or "ssl handshake failed."
- Fix (if using self-signed certs or private CA): Configure the collector to trust your CA. For the
otlpexporter, settlsto:
If the backend is using a self-signed cert and the collector doesn’t trust it, you might need to add the backend’s CA to the collector’s system trust store or explicitly provide it viatls: cert_file: /path/to/client.crt key_file: /path/to/client.key ca_file: /path/to/ca.crt # Ensure this points to your CA certificateca_file. - Why it works: This ensures the collector trusts the identity of the OTLP endpoint presented by its TLS certificate, allowing the secure handshake to complete.
-
Authentication/Authorization Failure (API Keys/Tokens): The collector is sending requests without valid credentials or with incorrect ones.
- Diagnosis: Examine collector logs for authentication errors from the OTLP backend. These often manifest as
401 Unauthorizedor403 ForbiddenHTTP status codes in the collector’s export attempts. - Fix: Ensure the
headersfield in your OTLP exporter configuration contains the correct authentication token or API key. For example:
Replaceexporters: otlp: endpoint: "https://your-backend.com:4317" tls: insecure_skip_verify: true # Use only for testing, not production headers: Authorization: "Bearer your_secret_token_here" X-API-Key: "your_api_key_here"your_secret_token_hereoryour_api_key_herewith your actual credentials. - Why it works: Many OTLP backends require specific headers for authentication. Providing these correctly allows the backend to identify and authorize the incoming trace data.
- Diagnosis: Examine collector logs for authentication errors from the OTLP backend. These often manifest as
-
Backend Service Unavailability or Overload: The OTLP endpoint is running but is not accepting new connections or processing requests.
- Diagnosis: Check the health status of your OTLP backend service (e.g., Jaeger, Tempo, Honeycomb, Datadog agent). Look for high CPU, memory, or disk I/O, or error logs within the backend itself.
- Fix: Scale up your OTLP backend resources or investigate and resolve the performance bottlenecks within the backend application. Restarting the backend service might temporarily resolve issues caused by stuck processes.
- Why it works: If the receiving service is overwhelmed or crashing, it cannot accept or process the incoming trace data, leading to connection errors or timeouts reported by the exporter.
-
Collector Configuration Error (Exporter Type Mismatch): The exporter is configured incorrectly for the protocol or format expected by the backend.
- Diagnosis: Verify the
protocolsetting in yourotlpexporter configuration against what your backend expects. Common values aregrpcandhttp/protobuf. - Fix: Adjust the
protocolsetting. If your backend expects gRPC, ensure it’s set togrpc(often the default) and the endpoint uses a gRPC port (e.g., 4317). If it expects HTTP, setprotocol: http/protobufand use an HTTP port (e.g., 4318).exporters: otlp: protocol: grpc # or http/protobuf endpoint: "your-backend.com:4317" # or :4318 for http - Why it works: Different protocols use different network ports and data serialization methods. Mismatched configurations lead to the collector sending data in a format the backend cannot understand or on a port it isn’t listening on.
- Diagnosis: Verify the
-
Resource Exhaustion on Collector Host: The collector process itself is running out of memory or file descriptors, preventing it from establishing new network connections.
- Diagnosis: Monitor the collector process’s resource usage (
top,htop,docker stats). Check system logs for "out of memory" (OOM) killer messages or "too many open files" errors. - Fix: Increase the RAM allocated to the collector host or container. Increase the open file descriptor limit (
ulimit -n) for the user running the collector process. Review the collector’s configuration for excessive batch sizes or queue sizes that might be contributing to memory bloat. - Why it works: Network operations require system resources. When these are depleted, the operating system prevents new connections from being made, leading to exporter failures.
- Diagnosis: Monitor the collector process’s resource usage (
The next error you’ll likely encounter after fixing the permanent exporter failure is a "Queue Full" error or a "Batch Processor Timeout," indicating that while the exporter can now try to send data, the upstream processors or the sheer volume of data is overwhelming the collector’s capacity to handle it.