OpenTelemetry Collector instances, when run behind a load balancer, don’t inherently distribute traffic equally; they often default to a round-robin approach that can lead to uneven resource utilization and dropped data.
Let’s see this in action. Imagine you have two collector instances, collector-1 and collector-2, behind a simple TCP load balancer. The load balancer, in its default round-robin configuration, sends every other incoming connection to collector-1, then collector-2, and so on.
# Example of initial state (hypothetical)
Load Balancer -> collector-1 (receives 50% of new connections)
Load Balancer -> collector-2 (receives 50% of new connections)
However, the type of traffic matters. If your agents or services send data in bursts, or if connections are long-lived, round-robin might not be enough. One collector could get a large influx of data from a few chatty services, while another sits mostly idle. This leads to dropped metrics, traces, or logs because one collector’s queues fill up faster than its processing capacity.
Here’s the actual problem: the default load balancing strategy (often round-robin at the TCP/UDP layer) doesn’t account for the volume or state of the data being processed by each collector instance. It’s like a traffic cop directing cars without looking at how many cars are already backed up on a particular road.
Common Causes and Fixes
-
Uneven Connection Distribution (TCP Round-Robin Issue):
- Diagnosis: Monitor the active connections and the request rate (e.g.,
requests_totalorreceived_data_bytesin OTel Collector metrics) on each collector instance. If one instance consistently has more active connections or higher throughput than others, your load balancer isn’t distributing effectively for your workload. - Fix: Switch your load balancer to a more intelligent algorithm. For TCP, Least Connections is often better. This directs new connections to the collector with the fewest active connections.
# Example: Nginx configuration snippet for least_conn upstream otel_collectors { least_conn; server collector-1:4317; server collector-2:4317; } - Why it works: By sending new traffic to the least busy instance, it prevents any single collector from being overwhelmed by connection count alone, providing a more even distribution of new work.
- Diagnosis: Monitor the active connections and the request rate (e.g.,
-
Sticky Sessions (Agent-Level Affinity):
- Diagnosis: If your agents or clients are configured to prefer a specific collector instance (e.g., via DNS or a service discovery mechanism that returns a stable IP), they might not failover effectively. This can lead to persistent imbalances if some agents are stuck sending all their data to one collector that’s struggling.
- Fix: Ensure your service discovery or agent configuration does not enforce sticky sessions or long-lived affinities to a single collector IP. Agents should be able to discover and connect to any available collector.
# Example: OTel Collector agent config for discovery receivers: otlp: protocols: grpc: exporters: otlp: endpoint: "otel-collector.my-domain.com:4317" # Use a load balancer DNS # No specific client-side affinity settings - Why it works: Agents can dynamically switch to a different collector if their primary choice becomes unresponsive or overloaded, allowing the load balancer to manage traffic flow.
-
Agent-Side Load Balancing (Client-Side Distribution):
- Diagnosis: If your agents are configured with multiple collector endpoints and they themselves implement a simple round-robin, you might still see imbalances if agents have varying connection lifetimes or burstiness.
- Fix: Configure agents to use a more sophisticated client-side load balancing strategy if available, or rely solely on the upstream load balancer. If agents must list endpoints, ensure they don’t have internal sticky logic. Some agents might support a "pick first" or "random" strategy.
# Example: OTel Collector agent config with multiple endpoints (if supported by agent) exporters: otlp: endpoints: - "http://collector-1:4317" - "http://collector-2:4317" # Agent's internal load balancing strategy (e.g., round_robin, random, pick_first) load_balancing_strategy: random - Why it works: Offloads some distribution logic to the agent, allowing it to pick the "best" collector based on its own internal state or a randomized approach, potentially smoothing out traffic before it even hits the main load balancer.
-
Uneven Data Volume (Attribute/Trace ID Hashing):
- Diagnosis: Even with "Least Connections," if a few high-volume traces or large batches of logs are all hashed to the same collector by the application itself (e.g., if the application uses trace ID as a consistent hashing key for something upstream), one collector can still be swamped.
- Fix: Implement consistent hashing at the load balancer level, based on a stable attribute of the telemetry data (like
trace_idfor traces, or a consistent attribute for logs/metrics). This ensures all data for a specific trace or entity always goes to the same collector instance.# Example: HAProxy consistent hashing (using trace ID if available in protocol) backend otel_collectors hash-type consistent mask 0x1fff # Adjust mask based on desired distribution server collector-1 192.168.1.10:4317 server collector-2 192.168.1.11:4317 - Why it works: This ensures that data belonging to the same logical entity (like a single distributed trace) always lands on the same collector. This is crucial for trace processing, where context needs to be maintained. It can also help distribute high-volume, but consistently keyed, metric/log streams.
-
Collector Resource Limits:
- Diagnosis: Check the CPU, memory, and network utilization on each collector instance. If one collector is consistently maxing out resources while others are idle, it’s not a load balancing problem, but an undersized collector or a single point of failure.
- Fix: Increase the CPU/memory allocation for the collector pods/VMs, or tune the collector’s internal settings (e.g.,
queue_size,max_concurrent_requestsin theotlpreceiver) to better match its allocated resources.# Example: OTel Collector config for queue tuning receivers: otlp: protocols: grpc: max_concurrent_streams: 1000 # Example tuning # queue_size: 2000 # Also tune queue if needed - Why it works: Ensures that each collector instance has sufficient capacity to process the traffic it receives, preventing it from becoming a bottleneck due to insufficient resources rather than just uneven distribution.
-
Load Balancer Health Checks:
- Diagnosis: Verify that your load balancer’s health checks are correctly configured to detect unresponsive collector instances. If health checks are too slow or not sensitive enough, the load balancer might continue sending traffic to a collector that is effectively dead or severely degraded.
- Fix: Configure aggressive but reasonable health checks. For HTTP/S endpoints, a simple
GET /or a custom/healthendpoint is good. For gRPC, you might need a specific health check service. Ensure the timeout for health checks is significantly shorter than the time it takes for a collector to become unresponsive.# Example: AWS ALB health check config # Protocol: HTTP # Port: traffic-port # Path: /health # Assuming collector exposes a /health endpoint # Timeout: 5 seconds # Interval: 10 seconds # Unhealthy threshold: 2 - Why it works: Quickly removes unhealthy collector instances from the pool of available backends, preventing traffic from being sent to them and allowing the load balancer to direct traffic only to healthy instances.
By combining intelligent load balancing strategies (like least connections or consistent hashing) with proper health checks and ensuring agents don’t enforce sticky sessions, you can achieve a much more even distribution of telemetry data across your OpenTelemetry Collector fleet.
The next error you’ll hit is likely related to the processing of data within a collector, such as otcol_receiver_refused_data_points if queues are still filling up, indicating that even with good distribution, your collectors might be undersized for the total volume.