OpenTelemetry exporters will fail to connect to their backends over TLS if the client certificate presented by the exporter is not trusted by the backend.

Problem: Exporter TLS Handshake Failure

The OTel exporter service, acting as a client, is unable to establish a secure TLS connection to the Collector or other OTel-compatible backend (the server). This typically manifests as an error message indicating a TLS handshake failure, often with cryptic details like "certificate verify failed" or "no shared cipher." The core issue is that the server doesn’t trust the identity presented by the exporter.

Common Causes and Fixes

  1. Backend Not Trusting Exporter’s CA Certificate

    • Diagnosis: On the backend (Collector) side, check the TLS configuration for the receiver. Look for a ca_file or client_ca_file directive. If it’s present, verify that the Certificate Authority (CA) that signed the exporter’s certificate is listed in that file.
    • Fix: Ensure the ca_file on the backend includes the CA certificate that signed the exporter’s certificate. If you’re using a self-signed certificate for the exporter and its CA, you’ll need to add that CA certificate to the backend’s ca_file.
      # Collector's receiver config
      receivers:
        otlp:
          protocols:
            grpc:
              tls:
                cert_file: /path/to/backend.crt
                key_file: /path/to/backend.key
                client_ca_file: /path/to/ca.crt # This should contain the CA that signed the exporter's cert
      
    • Why it works: The backend uses the client_ca_file to verify the identity of clients connecting to it. If the CA that signed the client’s certificate isn’t in this file, the backend rejects the connection.
  2. Exporter Using Incorrect Client Certificate/Key

    • Diagnosis: On the exporter side (e.g., within an application or a separate exporter process), check the TLS configuration for the exporter. Verify the paths to the cert_file and key_file. Ensure these files correspond to the certificate and private key that the backend is configured to trust.
    • Fix: Update the exporter’s configuration to point to the correct client certificate and its corresponding private key.
      # Example exporter config (e.g., in an application using an SDK)
      # This is conceptual, actual SDK config varies.
      exporter:
        otlp:
          endpoint: "https://collector.example.com:4317"
          tls:
            cert_file: "/path/to/exporter.crt"
            key_file: "/path/to/exporter.key"
            ca_file: "/path/to/ca.crt" # Backend's CA cert if backend uses self-signed
      
    • Why it works: The exporter needs to present a valid, trusted certificate to the backend. If it’s presenting the wrong one, or a certificate that doesn’t match its private key, the handshake will fail.
  3. Exporter Certificate Expired

    • Diagnosis: Check the validity period of the exporter’s client certificate. Use OpenSSL:
      openssl x509 -in /path/to/exporter.crt -noout -dates
      
    • Fix: Renew the exporter’s certificate and private key. Ensure the new certificate is valid and has a sufficient validity period. Update the exporter’s configuration to use the new certificate and key files.
    • Why it works: TLS connections require valid, unexpired certificates. An expired certificate is untrusted by definition.
  4. Exporter Certificate Subject Alternative Name (SAN) Mismatch

    • Diagnosis: The exporter’s certificate might be valid, but its Subject Alternative Name (SAN) field does not include the hostname or IP address the exporter is using to connect to the backend. Check the SANs in the exporter’s certificate:
      openssl x509 -in /path/to/exporter.crt -noout -text | grep -A 1 'Subject Alternative Name'
      
    • Fix: Reissue the exporter’s certificate with the correct SANs that match the backend’s endpoint address. Alternatively, if the backend allows it, configure the exporter to connect using an address that is listed in the certificate’s SANs.
    • Why it works: When establishing a TLS connection, the client (exporter) verifies that the server’s certificate’s SANs match the hostname it’s trying to connect to. While this is typically server-side validation, if the exporter is also acting as a server or has client-side validation enabled for its own identity presentation, this can cause issues. More commonly, if the backend also performs client certificate validation based on specific DNs or SANs, this becomes critical. Self-correction: This is usually a server-side problem when the server’s cert SAN doesn’t match its hostname. For client certs, the server validates the client cert, and the client validates the server cert. The client cert SAN is less about the client verifying the server and more about the server accepting the client. However, if the server is configured to only accept clients whose cert SANs match certain criteria, this becomes a client-side problem. The most common scenario this impacts is when the backend (server) is configured to explicitly check the SANs of the client certificate it receives.
  5. Incorrectly Formatted Certificate or Key Files

    • Diagnosis: The certificate or private key files might be corrupted, incomplete, or in the wrong format (e.g., PEM vs. DER, missing headers/footers).
    • Fix: Ensure certificate and key files are in PEM format, with standard -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- (or PRIVATE KEY) delimiters. Concatenate intermediate certificates into the cert_file if necessary, or use a chain_file option if available.
      # Example: Combine server cert and its CA cert into one file
      cat /path/to/exporter.crt /path/to/ca.crt > /path/to/exporter_chain.crt
      
      Then update the exporter config to use exporter_chain.crt.
    • Why it works: TLS libraries expect specific file formats and content for certificates and keys to parse them correctly.
  6. System Clock Skew

    • Diagnosis: Check the system time on both the exporter host and the backend host. If the difference is significant (more than a few minutes), it can cause certificate validation failures because the certificate’s validity period might appear to be in the past or future.
      date
      
    • Fix: Synchronize the system clocks on all involved machines using NTP (Network Time Protocol).
      # On systems using systemd-timesyncd
      sudo timedatectl set-ntp true
      # Or manually configure an NTP client
      
    • Why it works: Certificate validity is strictly checked against the current system time. Clock skew can make a valid certificate appear invalid.

After resolving these, the next common issue you’ll encounter is the backend not having sufficient permissions to write received telemetry data to its storage, leading to data loss or different error messages related to storage access.


OpenTelemetry’s attributes provide a powerful, structured way to enrich telemetry, but their true magic lies in how they are processed and filtered downstream, not just in their presence.

Let’s see how this plays out with a simple trace. Imagine we’re instrumenting a web service that handles user requests.

package main

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"os"
	"time"

	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
	"go.opentelemetry.io/otel/sdk/resource"
	"go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.17.0" // Use a recent version
)

// initTracer initializes the OpenTelemetry tracer.
func initTracer() (func(context.Context) error, error) {
	// Create a stdout exporter to print traces to the console.
	exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint())
	if err != nil {
		return nil, fmt.Errorf("failed to create stdout exporter: %w", err)
	}

	// Define resource attributes for this service.
	res, err := resource.New(context.Background(),
		resource.WithAttributes(
			semconv.ServiceName("my-web-service"),
			semconv.ServiceVersion("1.0.0"),
			attribute.String("environment", "production"),
		),
	)
	if err != nil {
		return nil, fmt.Errorf("failed to create resource: %w", err)
	}

	// Create a trace provider with the exporter and resource.
	tp := trace.NewTracerProvider(
		trace.WithBatcher(exporter),
		trace.WithResource(res),
	)
	otel.SetTracerProvider(tp)

	// Return the shutdown function for the tracer provider.
	return tp.Shutdown, nil
}

// helloHandler is a simple HTTP handler.
func helloHandler(w http.ResponseWriter, r *http.Request) {
	// Get the tracer for the current package.
	tracer := otel.Tracer("main")
	ctx := r.Context()

	// Start a new span with attributes.
	// We're adding user ID and request method here.
	_, span := tracer.Start(ctx, "helloHandler",
		attribute.String("http.method", r.Method),
		attribute.String("user.id", "user123"), // Example user ID
		attribute.String("request.path", r.URL.Path),
	)
	defer span.End()

	// Simulate some work
	time.Sleep(50 * time.Millisecond)

	// Add another attribute to the span mid-operation.
	span.SetAttributes(attribute.Bool("processed.successfully", true))

	fmt.Fprintf(w, "Hello, World!\n")
}

func main() {
	// Initialize the tracer.
	shutdown, err := initTracer()
	if err != nil {
		log.Fatalf("Failed to initialize tracer: %v", err)
	}
	defer func() {
		if err := shutdown(context.Background()); err != nil {
			log.Fatalf("Failed to shutdown tracer provider: %v", err)
		}
	}()

	// Create a new HTTP multiplexer.
	mux := http.NewServeMux()

	// Wrap the helloHandler with otelhttp to automatically instrument HTTP requests.
	// otelhttp will create spans and add common HTTP attributes (like http.method, http.url, etc.).
	handler := otelhttp.NewHandler(http.HandlerFunc(helloHandler), "helloHandler")
	mux.Handle("/hello", handler)

	// Start the HTTP server.
	port := os.Getenv("PORT")
	if port == "" {
		port = "8080"
	}
	log.Printf("Server listening on port %s", port)
	if err := http.ListenAndServe(":"+port, mux); err != nil {
		log.Fatalf("Server failed to start: %v", err)
	}
}

When you run this application and send a request to /hello (e.g., curl http://localhost:8080/hello), you’ll see output similar to this on your console:

{
  "traceId": "...",
  "spanId": "...",
  "parentSpanId": "...",
  "name": "helloHandler",
  "kind": 2,
  "startTimeUnixNano": 1678886400000000000,
  "endTimeUnixNano": 1678886400050000000,
  "attributes": [
    {
      "key": "http.method",
      "value": {
        "type": "STRING",
        "value": "GET"
      }
    },
    {
      "key": "user.id",
      "value": {
        "type": "STRING",
        "value": "user123"
      }
    },
    {
      "key": "request.path",
      "value": {
        "type": "STRING",
        "value": "/hello"
      }
    },
    {
      "key": "processed.successfully",
      "value": {
        "type": "BOOL",
        "value": true
      }
    },
    {
      "key": "http.status_code",
      "value": {
        "type": "INT",
        "value": "200"
      }
    },
    {
      "key": "http.route",
      "value": {
        "type": "STRING",
        "value": "/hello"
      }
    }
  ],
  "status": {
    "code": 2
  },
  "resource": {
    "attributes": [
      {
        "key": "service.name",
        "value": {
          "type": "STRING",
          "value": "my-web-service"
        }
      },
      {
        "key": "service.version",
        "value": {
          "type": "STRING",
          "value": "1.0.0"
        }
      },
      {
        "key": "environment",
        "value": {
          "type": "STRING",
          "value": "production"
        }
      }
    ]
  }
}

Notice the attributes array. We manually added http.method, user.id, request.path, and processed.successfully. The otelhttp middleware automatically added http.status_code and http.route. The resource section contains attributes defined at the service level (service.name, service.version, environment). All this data is attached to the span.

The fundamental problem that attributes solve is observability context. Without them, you have a trace showing a request happened, but you don’t know which user, in which environment, or why it might have failed (beyond generic errors). Attributes turn generic traces into actionable insights.

Internally, OpenTelemetry SDKs manage spans. When you call tracer.Start, a new span object is created. Attributes are key-value pairs associated with this span. These attributes are stored with the span’s data until the span is completed (span.End()). At that point, the span, along with all its associated attributes, is sent to the configured exporter.

The "levers" you control are:

  1. Resource Attributes: These are defined when the tracer provider is initialized (resource.New). They apply to all telemetry generated by that service instance (e.g., service.name, environment, host.name).
  2. Span Attributes: These are added to individual spans.
    • Automatic Instrumentation: Middleware like otelhttp automatically adds common attributes (e.g., http.method, http.status_code).
    • Manual Instrumentation: You explicitly add attributes using span.SetAttributes() or tracer.Start(..., attribute.Key("value")). This is where you add business-specific context like user.id, order.id, db.query.parameters, etc.
  3. Event Attributes: Spans can also have Events, which are timestamped occurrences within a span, and these events can also carry attributes.

The real power of attributes comes not just from collecting them, but from how they enable downstream processing. A backend system (like Jaeger, Prometheus, or a custom log analysis pipeline) can use these attributes for:

  • Filtering: Show me traces only for user.id="user123".
  • Aggregation: Calculate the average latency for requests with http.status_code=500.
  • Correlation: Link traces to logs that have matching traceId and spanId.
  • Alerting: Trigger an alert if the error rate (based on status.code=ERROR) exceeds a threshold for the production environment.

A common misconception is that attributes are just metadata that gets sent along for the ride. In reality, the structure and standardization of attributes (via Semantic Conventions) allow for powerful cross-system analysis. If every service uses http.status_code consistently, a backend can reliably aggregate HTTP status codes across your entire distributed system. If one service uses http_status and another uses status_code, that aggregation breaks.

The next concept you’ll likely grapple with is context propagation, understanding how trace and span IDs are passed between services to stitch together distributed traces.

Want structured learning?

Take the full Opentelemetry course →