Checkpoint and restore let you migrate a running container from one host to another, preserving its exact state.
Let’s see it in action. Imagine we have a simple web server running in a container, and we want to move it without interrupting service.
# Start a simple Nginx container
podman run -d --name my-nginx nginx
# Wait for Nginx to start and serve a request
sleep 5
curl localhost:8080 # This will likely fail initially as Nginx port isn't mapped yet.
# Correct way to map port and verify
podman stop my-nginx
podman rm my-nginx
podman run -d -p 8080:80 --name my-nginx nginx
sleep 5
curl localhost:8080
Now, my-nginx is running and serving requests on port 8080. We can checkpoint this running state.
# Checkpoint the running container
podman container checkpoint my-nginx --file nginx.tar.gz
This nginx.tar.gz file contains the entire state of the my-nginx container: its memory, CPU registers, open file descriptors, network connections, and even the internal state of the applications running within it. It’s like a snapshot of a process’s live execution.
On a different host (or even the same host after stopping the original), you can restore this state.
# Stop the original container (if it's still running on the source host)
podman stop my-nginx
# Restore the container from the checkpoint file
podman container restore --name my-nginx-restored --file nginx.tar.gz
The my-nginx-restored container will now be in the exact same state as my-nginx was when it was checkpointed. If my-nginx was in the middle of processing a request, my-nginx-restored will resume from that point.
The core problem checkpoint and restore solves is stateful application migration. Traditionally, moving a running application meant downtime. You’d stop it, save its data, provision a new instance, restore the data, and start it. This could take minutes or hours. With checkpoint/restore, the downtime is practically zero – it’s the time it takes to transfer the checkpoint file and initiate the restore.
Internally, Podman (and its underlying CRIU - Checkpoint/Restore In Userspace) works by freezing the processes within the container. It then serializes the state of these processes, including their memory pages, registers, and open file descriptors, to a file. For restoration, it unfreezes the processes and loads their saved state back into memory. This is why it’s so powerful for applications that maintain significant in-memory state, like databases or long-running computations.
The podman container checkpoint command has several options to control the process. You can specify --leave-running to keep the original container running after the checkpoint is created, though this is less common for migration scenarios. The --file option is crucial for specifying where the checkpoint image is stored. On the restore side, --name allows you to give the restored container a new name, and --file points to the checkpoint image.
The surprising part is how seamlessly it handles network connections. If your container had an active TCP connection when checkpointed, the restored container will resume that connection, and the remote peer might not even notice a disruption, assuming the IP address and port remain the same or are correctly handled by network configuration. This is because the network stack’s state (like socket buffers and sequence numbers) is also serialized and restored.
A common pitfall is assuming that a checkpoint taken on one kernel version will work flawlessly on another, or across different distributions. While CRIU aims for compatibility, subtle differences in kernel behavior or library versions can cause restore failures. Always test your checkpoint/restore workflow between the exact environments you intend to use.
The next hurdle you’ll likely encounter is managing these checkpoint files for more complex applications or when dealing with persistent volumes.