Database Recovery: Beyond the Backup

Postgres backup and recovery is often treated as a monolithic "set it and forget it" task, but the real magic happens when you understand how the pieces interact, particularly the interplay between pg_dump and Write-Ahead Logging (WAL).

Let’s see pg_dump in action. Imagine you have a database named mydatabase.

pg_dump -h localhost -p 5432 -U myuser -F c -b -v -f mydatabase_backup.dump mydatabase

This command creates a custom-format archive (-F c) that includes blobs (-b), shows verbose output (-v), and saves it to mydatabase_backup.dump. This pg_dump output is a point-in-time snapshot of your database’s data. It’s like taking a photograph of your data at the exact moment the dump finishes.

Now, what happens if a transaction is committed just as pg_dump finishes? Or what if you need to recover to a specific point in time between pg_dump runs? This is where Write-Ahead Logging (WAL) becomes critical.

WAL is Postgres’s mechanism for ensuring data durability and consistency. Before any change is written to the actual data files, it’s first written to a WAL record in a WAL segment file. These WAL segments are sequential logs of every single change made to the database.

When you perform a Point-in-Time Recovery (PITR), you combine a base backup (like one from pg_dump or pg_basebackup) with a stream of WAL files generated since that backup was taken. Postgres replays these WAL records against the base backup to reconstruct the database state at any given moment.

Here’s a typical setup for PITR. First, ensure wal_level is set to replica (or logical) in postgresql.conf.

wal_level = replica

Next, configure archive_mode and archive_command. archive_mode = on tells Postgres to start archiving WAL files, and archive_command specifies how to copy them to a safe, permanent location.

archive_mode = on
archive_command = 'cp %p /path/to/wal_archive/%f'

The %p is a placeholder for the WAL file to be archived, and %f is for its filename. In a production setup, you’d use a more robust method than cp, like rsync or a dedicated archiving tool, to handle potential failures and ensure files are moved reliably.

To perform a recovery, you’d first restore your latest base backup to a new data directory. Then, you’d tell Postgres where to find the archived WAL files. This is done by creating a recovery.signal file (for newer versions) or recovery.conf (older versions) in the data directory. For recovery.signal, you’d specify the target recovery time within the file.

recovery_target_time = '2023-10-27 10:30:00 UTC'

Postgres will then start up, find the recovery.signal file, and begin replaying WAL records from your archive until it reaches the specified recovery_target_time. It’s crucial to have your restore_command correctly configured in recovery.conf (or implicitly handled by recovery.signal and archive_command in newer versions) to tell Postgres how to retrieve the necessary WAL files from their archived location.

The most surprising thing about PITR is that pg_dump alone is not a PITR solution. It’s a logical backup, meaning it contains SQL statements or a custom format representing the data. While invaluable for restoring a database to a consistent state at the time of the dump, it doesn’t capture the continuous stream of changes that WAL does. You can’t use pg_dump to recover to a point after the dump was taken. You need the WAL archives.

A common pitfall is misinterpreting the role of pg_basebackup. Unlike pg_dump, which is a logical backup, pg_basebackup creates a physical copy of the data directory, including the data files and the current WAL files. It’s often preferred for base backups in a PITR strategy because it’s a block-level copy, generally faster to restore, and inherently includes the necessary WAL segments to start the recovery process. When using pg_basebackup, you still need to archive subsequent WAL files to achieve PITR.

If your archive_command fails, WAL files will stop being archived, and your PITR capability will be broken from that point forward. You’ll only be able to recover up to the last successfully archived WAL file.

The next hurdle you’ll encounter is understanding how to manage and prune your WAL archive to avoid filling up your disk space.