Patroni decided it couldn’t reach the primary PostgreSQL instance, so it initiated a switchover.
This usually happens because the distributed configuration store Patroni uses (like etcd or Consul) is unavailable, or because the PostgreSQL primary itself is genuinely down and Patroni can’t establish a connection.
Here are the common reasons why Patroni might think the primary is unavailable and trigger a failover:
-
Distributed Configuration Store (DCS) Unavailability: Patroni needs its DCS (etcd, Consul, ZooKeeper, or Kubernetes) to coordinate failover and store cluster state. If the DCS is down or unreachable from the node trying to check the primary’s health, Patroni will assume the primary is gone.
- Diagnosis: From a Patroni node, check connectivity to your DCS. For etcd, this might be
curl http://etcd-node:2379/version. For Consul,curl http://consul-node:8500/v1/status/leader. - Fix (etcd example): Ensure your etcd cluster is healthy. If etcd nodes are down, bring them back up. If network issues exist, fix routing or firewall rules. If etcd is experiencing high load, investigate and scale it.
- Why it works: Patroni relies on the DCS for leader election and health checks. If the DCS is unhealthy, Patroni cannot verify the primary’s status or elect a new leader, leading to a false positive for primary failure.
- Diagnosis: From a Patroni node, check connectivity to your DCS. For etcd, this might be
-
Network Partition Between Patroni and PostgreSQL Primary: Even if PostgreSQL is running, if a network issue (firewall rule, routing problem, etc.) prevents Patroni from reaching the PostgreSQL port (default 5432) on the primary node, Patroni will mark it as down.
- Diagnosis: From the Patroni node, try
nc -zv <primary_ip> 5432ortelnet <primary_ip> 5432. - Fix: Check firewall rules (
iptables -L -n -v) on both the Patroni and PostgreSQL nodes. Ensure port 5432 is open between them. Verify routing tables (ip route show) and DNS resolution for the primary’s hostname. - Why it works: Patroni actively probes the PostgreSQL port to determine primary health. A blocked port means Patroni cannot confirm the database is alive and accepting connections.
- Diagnosis: From the Patroni node, try
-
PostgreSQL Primary Not Running or Not Listening: The PostgreSQL process might have crashed, been stopped, or is configured to listen on the wrong network interface.
- Diagnosis: On the suspected primary node, check
ps aux | grep postgresto see if the process is running. Checksudo ss -tulnp | grep 5432to see what IP addresses PostgreSQL is listening on. - Fix: If PostgreSQL isn’t running, start it (
sudo systemctl start postgresql). If it’s listening on127.0.0.1instead of0.0.0.0or a specific IP, editpostgresql.confto setlisten_addresses = '*'(or the appropriate IP) and restart PostgreSQL. - Why it works: Patroni needs to connect to the PostgreSQL server process. If the process is absent or not bound to a network interface accessible by Patroni, the connection will fail.
- Diagnosis: On the suspected primary node, check
-
Patroni Configuration Incorrect (API/Postgres Ports): Patroni nodes need to reach each other’s REST API (default 8008) and the PostgreSQL port on the primary. If these are misconfigured or blocked, failover can be erratic.
- Diagnosis: From one Patroni node, try
curl http://<other_patroni_node>:8008/cluster. Also, verify that Patroni nodes can reach the PostgreSQL port on the primary (as in point 2). - Fix: Ensure
scopeandnamespaceinpatroni.ymlare consistent across the cluster. Verifyrestapi.listenandpostgresql.listeninpatroni.ymlmatch actual interfaces and that firewalls allow traffic on 8008 and 5432 between Patroni nodes. - Why it works: Patroni uses its REST API to communicate cluster state and health checks between nodes. It also directly probes PostgreSQL. Misconfigurations here break the communication chain needed for proper HA.
- Diagnosis: From one Patroni node, try
-
PostgreSQL Primary Resource Exhaustion: The primary PostgreSQL server might be running but unresponsive due to extreme load, OOM killer intervention, or disk I/O issues. Patroni’s health checks might time out.
- Diagnosis: On the primary node, check
top,htop,vmstat,iostat. Look for high CPU, memory pressure, or disk queue lengths. Check/var/log/syslogorjournalctlfor OOM killer messages. - Fix: Tune PostgreSQL parameters (
shared_buffers,work_mem, etc.) or add more resources (CPU, RAM, faster storage). Identify and optimize slow queries. - Why it works: If the PostgreSQL server is too busy to respond to Patroni’s connection attempts or health checks within the configured timeouts, Patroni will assume it’s unavailable.
- Diagnosis: On the primary node, check
-
Replication Slot Issues (for replica promotion): While not directly causing a failover of the primary, if a replica is promoted and Patroni cannot correctly establish replication from the new primary due to issues with replication slots on the old primary (if it’s still alive but unreachable), it can cause confusion and subsequent errors.
- Diagnosis: On the new primary, check
pg_replication_slotsinpsql. On the replicas, checkpg_stat_replicationand Patroni logs for errors related to replication slot management. - Fix: Manually clean up stale replication slots on the old primary if it’s permanently dead, or ensure the new primary can create/manage slots. Often, a
DROP SLOTcommand on the old primary (if accessible) or a manual slot creation on the new primary might be needed if it failed to inherit. - Why it works: Patroni relies on replication slots to ensure point-in-time recovery for replicas. If these are broken or mismanaged during a promotion, the cluster’s replication health is compromised.
- Diagnosis: On the new primary, check
The next error you’ll likely hit after fixing these is related to the application’s ability to connect to the new primary, or perhaps a replica failing to catch up if replication was severely impacted.