Postgres is terminating connections because the pg_terminate_backend() function is being called with an invalid process ID (PID).
This usually happens when an automated process or a manual script attempts to kill a PostgreSQL backend process that no longer exists or has already been terminated by another mechanism. The pg_terminate_backend() function expects a valid, active backend PID. When it receives an invalid PID, PostgreSQL’s internal logic flags this as an abnormal termination and logs an error, often resulting in the client connection associated with that non-existent backend also being dropped.
Here are the most common reasons this occurs and how to address them:
Stale PID in Automation Scripts
Diagnosis: You’ll find a script (e.g., a shell script, Python script, or cron job) that periodically checks for long-running queries or idle connections and attempts to terminate them using pg_terminate_backend(). The PID it’s trying to terminate is no longer valid.
Common Cause: The script’s logic for fetching PIDs is flawed, or a race condition exists where a backend process finishes its work between the time the PID is fetched and the time pg_terminate_backend() is called.
Fix:
-
Implement a "check-before-kill" mechanism: Before calling
pg_terminate_backend(pid), querypg_stat_activityto ensure the PID still exists and matches the expected query or state.SELECT pid, usename, datname, query, state FROM pg_stat_activity WHERE pid = <the_pid_you_intend_to_kill>; -
Add error handling: If the query above returns no rows, or if the
queryorstatedoesn’t match what you expect, do not callpg_terminate_backend(). Log this condition as a warning instead of an error.Why it works: This prevents the function from being called with a non-existent PID, thus avoiding the "invalid PID" error that triggers the connection termination.
External Process Management (e.g., Docker, Kubernetes)
Diagnosis: If your PostgreSQL instance is running within a containerized environment, the container orchestrator might be terminating PostgreSQL processes directly without PostgreSQL being aware. When pg_terminate_backend() is then called on a PID that the orchestrator already killed, you get this error.
Common Cause: The orchestrator’s health checks or scaling events might trigger container restarts or process termination. The PostgreSQL server might not have graceful shutdown procedures configured or triggered correctly.
Fix:
-
Configure graceful shutdown: Ensure your container orchestrator sends appropriate termination signals (like SIGTERM) to the PostgreSQL container. Configure PostgreSQL to handle these signals by initiating a clean shutdown. This typically involves setting
SIGTERMtosmart_shutdownorfast_shutdownin your orchestrator’s deployment configuration. -
Avoid direct PID killing from orchestrator: If possible, let PostgreSQL manage its own processes. If you must intervene, ensure your intervention logic is robust and checks for process existence.
Why it works: Graceful shutdown allows PostgreSQL to clean up connections and processes properly before the container is stopped, preventing orphaned PIDs that external scripts might try to kill later.
Manual Intervention Errors
Diagnosis: A DBA or developer manually ran a SELECT pg_terminate_backend(<pid>); command, but provided a PID that was already gone or incorrect.
Common Cause: Typo in the PID, or the PID was for a process that finished its work just before the command was executed.
Fix:
-
Double-check PIDs: Always verify the PID from
pg_stat_activityimmediately before executingpg_terminate_backend(). -
Use
pg_stat_activityto find PIDs: Instead of guessing or recalling PIDs, always querypg_stat_activityto get the current, active PIDs.-- Example: Find a long-running query and its PID SELECT pid, usename, query, state, now() - query_start AS duration FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%' ORDER BY duration DESC LIMIT 1;Then, use the
pidfrom the result.Why it works: Ensures the
pg_terminate_backend()command is always executed with a PID that is currently active and managed by PostgreSQL.
Background Worker Processes
Diagnosis: PostgreSQL uses background worker processes for various tasks (e.g., autovacuum, logical replication workers). If a script or external tool tries to terminate a background worker PID that has already exited or been replaced, this error can occur.
Common Cause: Scripts designed to manage user-submitted queries might incorrectly target system background processes.
Fix:
-
Filter background workers: When selecting PIDs to terminate, explicitly exclude background workers. You can identify them by checking the
backend_typecolumn inpg_stat_activity.SELECT pid, usename, datname, query, state, backend_type FROM pg_stat_activity WHERE pid = <the_pid_you_intend_to_kill> AND backend_type = 'client backend'; -
Be cautious with system processes: Avoid terminating processes that are not directly associated with user queries unless you have a very specific and well-understood reason.
Why it works: This prevents attempts to terminate processes that are managed internally by PostgreSQL and might have different lifecycles than regular client backends.
System Resource Exhaustion Leading to Process Death
Diagnosis: The operating system might be killing PostgreSQL backend processes due to memory pressure (OOM killer) or other resource constraints. If a script later tries to terminate one of these OOM-killed PIDs, you’ll get the "invalid PID" error.
Common Cause: Insufficient RAM on the server, or a specific query consuming excessive memory.
Fix:
-
Monitor system resources: Regularly check
dmesgor/var/log/syslogfor OOM killer messages. -
Tune PostgreSQL memory settings: Adjust
shared_buffers,work_mem, andmaintenance_work_membased on your server’s RAM. -
Optimize queries: Identify and optimize memory-hungry queries.
Why it works: By preventing processes from being killed by the OS, you ensure that any PIDs PostgreSQL is aware of are actually alive and managed by the database.
Network Interruption/Client Disconnects
Diagnosis: A client application might have crashed or lost its network connection. PostgreSQL might still have a backend process associated with that client for a short period. If an automated cleanup script targets this PID, and the OS or PostgreSQL has already cleaned up the actual process, you can see this.
Common Cause: Unstable network, client application bugs, or aggressive client-side connection pooling that doesn’t properly notify the server on disconnect.
Fix:
-
Implement timeouts on the server: Use
idle_in_transaction_session_timeoutandstatement_timeoutto automatically clean up sessions that are stuck or have been idle for too long. -
Ensure proper client disconnect handling: Configure client connection pools to send
DISCARD SEQUENCESorRESETcommands before closing connections, and ensure they properly handle network errors.Why it works: Server-side timeouts actively manage idle or stuck connections, reducing the chance that an external script will attempt to terminate a PID for a client that has already effectively disconnected.
After addressing these, the next error you might encounter is related to the actual resource that was causing the long-running query or connection leak in the first place, or potentially issues with replication if background workers were involved.