RabbitMQ’s TCP connection lost error means the Erlang VM, running inside RabbitMQ, unexpectedly terminated a network connection to a client or another node. This isn’t just a glitch; it’s the VM deciding the connection is no longer viable, often due to underlying network issues or resource exhaustion on either side.
Common Causes and Fixes
1. Network Firewalls/Load Balancers Dropping Idle Connections
- Diagnosis: Check firewall or load balancer logs for
TCP_RESETorFIN_WAITstates indicating premature connection closure. On the RabbitMQ node, runnetstat -anp | grep <client_ip>to see if connections are being established and then disappearing without a proper FIN/RST from the client. - Fix: Configure your firewall or load balancer to send keepalive packets on the RabbitMQ port (default 5672 for AMQP) at a regular interval (e.g., every 30 seconds). For instance, in
iptables, you might add rules like:
If using a cloud load balancer, look for "TCP Keep-Alive" or "Idle Timeout" settings and adjust them.iptables -A OUTPUT -p tcp --tcp-flags SYN,ACK SYN,ACK -m state --state ESTABLISHED -j ACCEPT iptables -A OUTPUT -p tcp --tcp-flags ACK ACK -m state --state ESTABLISHED -j ACCEPT iptables -A INPUT -p tcp --tcp-flags SYN,ACK SYN,ACK -m state --state ESTABLISHED -j ACCEPT iptables -A INPUT -p tcp --tcp-flags ACK ACK -m state --state ESTABLISHED -j ACCEPT iptables -A OUTPUT -p tcp --tcp-flags RST RST -j ACCEPT iptables -A INPUT -p tcp --tcp-flags RST RST -j ACCEPT - Why it works: These firewalls/load balancers often have idle timeouts. If no data is sent for a period, they assume the connection is dead and tear it down. Sending periodic keepalive packets keeps the connection "active" in their eyes, preventing premature closure.
2. RabbitMQ Node Overload (High CPU/Memory)
- Diagnosis: Monitor RabbitMQ node resource utilization. Use
toporhtopon the server for CPU and memory. Check RabbitMQ’s own metrics viarabbitmqctl status(look for high file descriptor usage, low free memory) or its management UI (overview tab). Specifically, look for highmnesiatable sizes or excessive garbage collection activity in Erlang’s VM stats if available. - Fix:
- Increase Erlang VM Memory: Edit the RabbitMQ environment file (e.g.,
/etc/rabbitmq/rabbitmq-env.confor~/.rabbitmq/rabbitmq-env.conf) and setERLANG_MAX_VIRTUAL_MEMORYto a larger value, e.g.,ERLANG_MAX_VIRTUAL_MEMORY=4096MB. Restart RabbitMQ. - Tune Erlang GC: While more advanced, you can influence garbage collection. Add
+A30toRABBITMQ_SERVER_ERL_ARGSin your environment file to enable concurrent garbage collection, which can help with high load.RABBITMQ_SERVER_ERL_ARGS="+A30" - Optimize Queues/Consumers: Review your message rates, queue depths, and consumer throughput. Ensure consumers are acknowledging messages promptly. Consider increasing the number of consumers or optimizing message processing logic.
- Increase Erlang VM Memory: Edit the RabbitMQ environment file (e.g.,
- Why it works: Erlang’s VM needs sufficient memory to operate efficiently. When it runs out, or if garbage collection becomes too frequent and blocking, it can lead to unresponsive connections and eventual termination. Increasing memory or tuning GC allows the VM to handle the load better.
3. Erlang Distribution Protocol Issues (Clustering)
- Diagnosis: If you’re in a cluster, check
rabbitmqctl cluster_status. Look for nodes that are disconnected or marked as "down." Examine the Erlang crash logs (usually in/var/log/rabbitmq/crash.logor similar) on all nodes for messages related tonet_kernelornet_ticktimeouts. - Fix: Ensure that all nodes in the cluster can resolve each other’s hostnames and communicate over the Erlang distribution port (default 25672 TCP/UDP). Open these ports in your firewalls. If hostnames are unreliable, configure Erlang to use IP addresses by setting
NODENAMEinrabbitmq-env.conftorabbit@<node_ip_address>for each node.
Restart RabbitMQ on all nodes and rejoin the cluster if necessary.# In /etc/rabbitmq/rabbitmq-env.conf NODENAME=rabbit@192.168.1.100 - Why it works: Erlang’s clustering relies on a stable network connection between nodes using a specific protocol. If nodes can’t reach each other or if DNS resolution is flaky, the distribution protocol will time out, leading to cluster instability and connection drops.
4. Client Application Crashing or Unresponsive
- Diagnosis: On the client machine experiencing connection drops, monitor its CPU and memory usage. Check the client application’s logs for errors, exceptions, or indications of being stuck in a loop or long garbage collection pause.
- Fix: Address the resource issues or bugs within the client application. If the client is a service, ensure it has adequate resources. If it’s a long-running process, implement proper error handling and retry mechanisms. Ensure the client is properly closing connections when it shuts down.
- Why it works: If the client application itself becomes unresponsive or crashes, it can’t properly close its TCP connection to RabbitMQ. RabbitMQ, seeing no activity or an abrupt closure, might log it as a connection lost error, even though the root cause is on the client’s side.
5. Network Latency or Packet Loss
- Diagnosis: Use
pingandtraceroutebetween the RabbitMQ node and the client machine to check for high latency or packet loss. Monitor network interface statistics on both the server and client for errors or dropped packets. - Fix: Identify and resolve the underlying network issue. This might involve upgrading network hardware, reconfiguring network devices, or working with your network provider. If high latency is unavoidable, you might need to adjust RabbitMQ’s connection timeout settings (though this is generally not recommended as a primary fix) or implement more robust client-side retry logic.
- Why it works: TCP connections are sensitive to network instability. High latency can cause TCP retransmissions, and packet loss can lead to timeouts and connection resets, which RabbitMQ interprets as a lost connection.
6. RabbitMQ Erlang VM Crash (Less Common)
- Diagnosis: Look for Erlang crash dump files (
.dumpfiles) in RabbitMQ’s log directory. These files contain detailed information about why the Erlang VM terminated unexpectedly. Analyze these dumps usingerldumpor consult RabbitMQ support. - Fix: The fix depends entirely on the crash dump. It could be a bug in RabbitMQ itself, a bug in an Erlang library, or a very specific environmental issue. Often, upgrading RabbitMQ or the underlying Erlang/OTP version is the solution.
- Why it works: A crash of the Erlang VM means the entire RabbitMQ process is terminated abruptly, naturally leading to all active connections being lost.
The next error you’ll likely encounter after fixing TCP connection issues is a "Channel closed by server" error, as clients attempt to re-establish their connections and channels, but might face new issues if underlying message routing or permission problems exist.