Write-Ahead Logging (WAL) doesn’t actually make writes durable until they’ve been safely written to disk, which sounds obvious but most people think WAL is the durability mechanism.
Let’s watch it in action. Imagine a simple database table:
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(100)
);
Now, we want to add a new user:
INSERT INTO users (id, name) VALUES (1, 'Alice');
Before this INSERT statement can even think about modifying the users table data pages in memory, the database system has to write a record of this operation to its WAL file. This WAL record describes the change: "insert row with id 1 and name 'Alice' into table 'users'." It looks something like this (simplified):
WAL Record: { type: INSERT, table_id: 123, row_data: { id: 1, name: 'Alice' } }
This WAL record is written to a sequential log file on disk. Only after this record is safely on disk can the database system proceed to modify the actual data page for the users table in its memory buffers. The change to the data page in memory is not yet durable.
Now, what happens if the system crashes right after the WAL record is written to disk, but before the data page modification is written to disk?
When the database restarts, it reads its WAL file. It finds the INSERT record for 'Alice'. Since the WAL record is present, the database knows this operation should have completed. It then replays that WAL record, effectively performing the INSERT again, and writes 'Alice' to the users table. This is durability: the data is permanent because the intent to change it was logged before the change was allowed to proceed.
This replay mechanism is also how recovery works after a crash. If the system crashed midway through a transaction that involved multiple WAL records, upon restart, it would replay all the WAL records for that transaction. If a WAL record indicates a commit, the changes are applied. If a WAL record indicates a rollback, the changes are undone.
The core principle is that no data modification is considered committed (and thus durable) until a record of that modification has been appended to the WAL log and that log record has been flushed to stable storage. The actual data pages can be modified in memory, and even written to disk later by a background process (like a checkpoint), but the WAL record is the irrefutable proof that the change happened or was intended to happen.
Here’s a look at a typical WAL directory structure and file naming convention on a PostgreSQL system. You’ll see files named like 000000010000000000000001. Each of these files is a segment of the WAL stream.
/var/lib/postgresql/data/pg_wal/
├── 000000010000000000000001
├── 000000010000000000000002
└── 000000010000000000000003
The most surprising thing about WAL is that the actual data pages don’t need to be written to disk before the WAL record. The WAL record can be written and flushed, and then the data page can be modified in memory. The system relies on the WAL to reconstruct the state. The data pages are written to disk asynchronously or during checkpoints, but the WAL is the authoritative journal.
If you’re using replication, the WAL stream is also what’s sent to replicas. The replica receives the WAL records and applies them to its own data, ensuring it stays in sync.
The next concept you’ll encounter is how checkpoints interact with WAL, and why WAL files are eventually recycled or removed.