A Python application that doesn’t handle SIGTERM will often be killed abruptly by its orchestrator, losing critical state and potentially corrupting data.
Let’s see how this plays out in practice. Imagine you have a simple web server that, when it receives a request, writes the request details to a file and then sleeps for a few seconds to simulate work.
import time
import signal
import os
import sys
STOPPING = False
def signal_handler(signum, frame):
print(f"Received signal: {signum}. Initiating graceful shutdown.")
global STOPPING
STOPPING = True
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler) # Also handle Ctrl+C
print(f"Worker process started with PID: {os.getpid()}")
counter = 0
while not STOPPING:
try:
with open("requests.log", "a") as f:
f.write(f"[{time.time()}] Processing request {counter}\n")
print(f"Processed request {counter}")
counter += 1
time.sleep(2) # Simulate work
except KeyboardInterrupt:
print("Ctrl+C detected, initiating shutdown.")
STOPPING = True
except Exception as e:
print(f"An error occurred: {e}")
break
print("Shutdown complete. Exiting.")
sys.exit(0)
Now, let’s run this script and send it a SIGTERM signal.
# In one terminal:
python your_script_name.py
# Output will show:
# Worker process started with PID: 12345
# Processed request 0
# Processed request 1
# ...
# In another terminal, find the PID (e.g., 12345) and send SIGTERM:
kill -s SIGTERM 12345
If SIGTERM is not handled, the process will simply disappear. The print("Shutdown complete. Exiting.") will never be reached, and the requests.log file might be in an inconsistent state if the write operation was interrupted.
The problem SIGTERM solves is providing a standardized way for an operating system or orchestration system (like Kubernetes, Docker Swarm, or systemd) to tell a process, "Hey, it’s time to shut down now." This signal is designed to be caught and handled, allowing the application to clean up resources, save state, and exit cleanly. Without handling it, the default behavior is to terminate the process immediately, which is the equivalent of pulling the plug.
The core mechanism for handling signals in Python is the signal module. You register a callback function for specific signals. When the signal is received by the process, the registered callback is invoked.
import signal
import sys
# This flag will be set by the signal handler
shutdown_requested = False
def graceful_shutdown_handler(signum, frame):
print(f"Received signal {signum}. Initiating shutdown sequence...")
global shutdown_requested
shutdown_requested = True
# Register the handler for SIGTERM (signal number 15)
signal.signal(signal.SIGTERM, graceful_shutdown_handler)
# It's also good practice to handle SIGINT (Ctrl+C) for local testing
signal.signal(signal.SIGINT, graceful_shutdown_handler)
print(f"Application started. PID: {os.getpid()}")
try:
while not shutdown_requested:
# Simulate doing work, e.g., processing requests, cleaning queues
print("Doing work...")
time.sleep(5)
except Exception as e:
print(f"An unexpected error occurred: {e}")
finally:
print("Performing final cleanup...")
# This is where you'd save state, close connections, etc.
print("Cleanup complete. Exiting gracefully.")
sys.exit(0)
The signal.signal(signal_number, handler_function) call is the heart of it. signal.SIGTERM is the standard termination signal, and signal.SIGINT is what you get from pressing Ctrl+C. The graceful_shutdown_handler function is executed when either signal is received. Inside the handler, we set a flag (shutdown_requested) to True. The main application loop then checks this flag and breaks when it’s true, entering a finally block for cleanup.
The real power comes when you integrate this into an orchestration system. For example, in Kubernetes, when a pod is scaled down or deleted, the kubelet sends a SIGTERM to the containers within that pod. If your application handles SIGTERM correctly, it will have time to finish its current operations, save any in-progress work, and shut down cleanly before Kubernetes forcefully kills it with SIGKILL after a configurable grace period (defined by terminationGracePeriodSeconds).
When the SIGTERM handler is invoked, it interrupts the normal flow of your program. If your application is in the middle of a long-running operation (like a database transaction, a network request, or a complex calculation), simply setting a flag might not be enough. You need to ensure that these operations can be safely interrupted or completed. This often means using mechanisms like:
- Timeout for operations: Wrap critical operations in
try...exceptblocks with explicit timeouts. - Event-driven shutdown: Use threading or asynchronous programming (
asyncio) and coordinate shutdown across threads or tasks. Forasyncio, you’d typically register a handler forSIGTERMand then cancel all running tasks. - State persistence: If a process might be killed mid-operation, ensure that any critical data is written to a durable store as soon as possible, or that the state can be resumed.
Consider this asyncio example:
import asyncio
import signal
import os
import sys
shutdown_event = asyncio.Event()
async def worker(name):
print(f"Worker {name} started.")
try:
while not shutdown_event.is_set():
print(f"Worker {name} doing work...")
await asyncio.sleep(2)
except asyncio.CancelledError:
print(f"Worker {name} received cancellation.")
# Perform cleanup specific to this worker
print(f"Worker {name} cleaning up.")
print(f"Worker {name} finished.")
async def main():
print(f"Main application started with PID: {os.getpid()}")
tasks = [asyncio.create_task(worker(f"Task-{i}")) for i in range(3)]
# Register signal handler and set the event
loop = asyncio.get_running_loop()
for sig in (signal.SIGTERM, signal.SIGINT):
loop.add_signal_handler(sig, lambda: shutdown_event.set())
try:
await asyncio.gather(*tasks)
except Exception as e:
print(f"An error occurred in main: {e}")
finally:
print("Main application shutting down.")
# Ensure all tasks are cancelled if not already
for task in tasks:
if not task.done():
task.cancel()
await asyncio.gather(*tasks, return_exceptions=True) # Wait for cancellations
print("All tasks finished. Exiting.")
if __name__ == "__main__":
asyncio.run(main())
When SIGTERM is received, shutdown_event.set() is called. The while not shutdown_event.is_set() loop in each worker will terminate. Crucially, asyncio.gather(*tasks) will then see that the tasks are no longer running (or have been cancelled) and will proceed to the finally block. The task.cancel() calls ensure that even if a worker was stuck in an await, it would receive an asyncio.CancelledError, allowing it to execute its own cleanup logic.
The most surprising thing about graceful shutdown is how it forces you to think about the atomicity of your operations. A signal handler is essentially a callback that can interrupt any Python code at an arbitrary point (between bytecode instructions, specifically). This means that even simple operations like appending to a list or writing to a file might be interrupted. If you’re writing to a file and get SIGTERM, the file might be left with incomplete data. The real lesson is that any operation that modifies shared state or external resources must be designed with the possibility of abrupt interruption in mind, and graceful shutdown is the mechanism that allows you to mitigate the worst of these interruptions by giving you a chance to make things right.
The next hurdle you’ll likely face is managing the shutdown of multiple interconnected services, where the order of shutdown matters.