Postgres HA with Patroni: Automatic Failover Setup (2026)

Patroni decided it couldn’t reach the primary PostgreSQL instance, so it initiated a switchover.

This usually happens because the distributed configuration store Patroni uses (like etcd or Consul) is unavailable, or because the PostgreSQL primary itself is genuinely down and Patroni can’t establish a connection.

Here are the common reasons why Patroni might think the primary is unavailable and trigger a failover:

Distributed Configuration Store (DCS) Unavailability: Patroni needs its DCS (etcd, Consul, ZooKeeper, or Kubernetes) to coordinate failover and store cluster state. If the DCS is down or unreachable from the node trying to check the primary’s health, Patroni will assume the primary is gone.
- Diagnosis: From a Patroni node, check connectivity to your DCS. For etcd, this might be curl http://etcd-node:2379/version. For Consul, curl http://consul-node:8500/v1/status/leader.
- Fix (etcd example): Ensure your etcd cluster is healthy. If etcd nodes are down, bring them back up. If network issues exist, fix routing or firewall rules. If etcd is experiencing high load, investigate and scale it.
- Why it works: Patroni relies on the DCS for leader election and health checks. If the DCS is unhealthy, Patroni cannot verify the primary’s status or elect a new leader, leading to a false positive for primary failure.
Network Partition Between Patroni and PostgreSQL Primary: Even if PostgreSQL is running, if a network issue (firewall rule, routing problem, etc.) prevents Patroni from reaching the PostgreSQL port (default 5432) on the primary node, Patroni will mark it as down.
- Diagnosis: From the Patroni node, try nc -zv <primary_ip> 5432 or telnet <primary_ip> 5432.
- Fix: Check firewall rules (iptables -L -n -v) on both the Patroni and PostgreSQL nodes. Ensure port 5432 is open between them. Verify routing tables (ip route show) and DNS resolution for the primary’s hostname.
- Why it works: Patroni actively probes the PostgreSQL port to determine primary health. A blocked port means Patroni cannot confirm the database is alive and accepting connections.
PostgreSQL Primary Not Running or Not Listening: The PostgreSQL process might have crashed, been stopped, or is configured to listen on the wrong network interface.
- Diagnosis: On the suspected primary node, check ps aux | grep postgres to see if the process is running. Check sudo ss -tulnp | grep 5432 to see what IP addresses PostgreSQL is listening on.
- Fix: If PostgreSQL isn’t running, start it (sudo systemctl start postgresql). If it’s listening on 127.0.0.1 instead of 0.0.0.0 or a specific IP, edit postgresql.conf to set listen_addresses = '*' (or the appropriate IP) and restart PostgreSQL.
- Why it works: Patroni needs to connect to the PostgreSQL server process. If the process is absent or not bound to a network interface accessible by Patroni, the connection will fail.
Patroni Configuration Incorrect (API/Postgres Ports): Patroni nodes need to reach each other’s REST API (default 8008) and the PostgreSQL port on the primary. If these are misconfigured or blocked, failover can be erratic.
- Diagnosis: From one Patroni node, try curl http://<other_patroni_node>:8008/cluster. Also, verify that Patroni nodes can reach the PostgreSQL port on the primary (as in point 2).
- Fix: Ensure scope and namespace in patroni.yml are consistent across the cluster. Verify restapi.listen and postgresql.listen in patroni.yml match actual interfaces and that firewalls allow traffic on 8008 and 5432 between Patroni nodes.
- Why it works: Patroni uses its REST API to communicate cluster state and health checks between nodes. It also directly probes PostgreSQL. Misconfigurations here break the communication chain needed for proper HA.
PostgreSQL Primary Resource Exhaustion: The primary PostgreSQL server might be running but unresponsive due to extreme load, OOM killer intervention, or disk I/O issues. Patroni’s health checks might time out.
- Diagnosis: On the primary node, check top, htop, vmstat, iostat. Look for high CPU, memory pressure, or disk queue lengths. Check /var/log/syslog or journalctl for OOM killer messages.
- Fix: Tune PostgreSQL parameters (shared_buffers, work_mem, etc.) or add more resources (CPU, RAM, faster storage). Identify and optimize slow queries.
- Why it works: If the PostgreSQL server is too busy to respond to Patroni’s connection attempts or health checks within the configured timeouts, Patroni will assume it’s unavailable.
Replication Slot Issues (for replica promotion): While not directly causing a failover of the primary, if a replica is promoted and Patroni cannot correctly establish replication from the new primary due to issues with replication slots on the old primary (if it’s still alive but unreachable), it can cause confusion and subsequent errors.
- Diagnosis: On the new primary, check pg_replication_slots in psql. On the replicas, check pg_stat_replication and Patroni logs for errors related to replication slot management.
- Fix: Manually clean up stale replication slots on the old primary if it’s permanently dead, or ensure the new primary can create/manage slots. Often, a DROP SLOT command on the old primary (if accessible) or a manual slot creation on the new primary might be needed if it failed to inherit.
- Why it works: Patroni relies on replication slots to ensure point-in-time recovery for replicas. If these are broken or mismanaged during a promotion, the cluster’s replication health is compromised.

The next error you’ll likely hit after fixing these is related to the application’s ability to connect to the new primary, or perhaps a replica failing to catch up if replication was severely impacted.