The most surprising thing about PostgreSQL’s VACUUM is that it’s not about cleaning up old data, but about reclaiming space from deleted or updated rows that are still technically visible to older transactions.

Let’s see this in action. Imagine a table users with a few rows:

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE,
    last_login TIMESTAMP
);

INSERT INTO users (username, last_login) VALUES
('alice', NOW()),
('bob', NOW());

Now, let’s update alice’s last_login from within a transaction that starts before the update.

-- Transaction 1 (starts now)
BEGIN;
SELECT * FROM users WHERE username = 'alice'; -- This sees the old row
-- ... do other things ...

Meanwhile, in another session:

-- Transaction 2 (starts after Transaction 1)
UPDATE users SET last_login = NOW() WHERE username = 'alice';
COMMIT;

At this point, the users table conceptually has one row for alice. However, PostgreSQL, to maintain MVCC (Multi-Version Concurrency Control), keeps the old version of alice’s row around until all transactions that started before the UPDATE have finished. This "dead tuple" still occupies space.

This is where VACUUM comes in. It’s the process that marks these dead tuples as reusable space. It doesn’t physically remove them from the disk file immediately, but it tells PostgreSQL that the space they occupy can be overwritten by new rows or future UPDATEs.

Manual VACUUM:

You can run VACUUM manually on a table or the entire database.

  • VACUUM users;: This will clean up dead tuples in the users table. It won’t reclaim disk space back to the operating system unless the table is completely empty or you use VACUUM FULL.
  • VACUUM;: This cleans up dead tuples in all tables in the current database, excluding system catalogs.
  • VACUUM (VERBOSE, ANALYZE);: VERBOSE gives you more output about what’s happening. ANALYZE updates statistics about the table’s contents, which the query planner uses to create efficient execution plans. This is often run after VACUUM.

The Problem: Bloat and Performance Degradation

Without VACUUM (or autovacuum), dead tuples accumulate. This leads to:

  1. Bloat: Table and index files grow larger than necessary.
  2. Slower Scans: Full table scans have to read through more dead tuples, slowing down queries.
  3. Index Inefficiency: Indexes can also contain dead entries, making them larger and slower to use.
  4. Transaction ID Wraparound: PostgreSQL uses transaction IDs (XIDs) to track row visibility. XIDs are 32-bit numbers and will eventually wrap around. If old dead rows aren’t VACUUMed, they can prevent older XIDs from being "frozen" (marked as permanently visible), leading to the catastrophic "transaction ID wraparound" error, which can shut down your database.

Autovacuum: The Automated Solution

Autovacuum is a set of background worker processes that automatically run VACUUM and ANALYZE based on configurable thresholds. It’s essential for most PostgreSQL deployments.

Autovacuum monitors tables for changes. When a table exceeds certain thresholds for dead tuples or inserts, autovacuum will trigger a VACUUM or ANALYZE operation for that table.

Key Autovacuum Configuration Parameters:

These are typically set in postgresql.conf.

  • autovacuum = on: Enables the autovacuum daemon. This must be on for normal operation.
  • autovacuum_max_workers = 3: The number of autovacuum worker processes that can run concurrently. More workers can handle larger databases but consume more resources.
  • autovacuum_vacuum_threshold = 50: The minimum number of rows that must be dead in a table before autovacuum considers running VACUUM.
  • autovacuum_vacuum_scale_factor = 0.2: The fraction of a table’s size that must consist of dead tuples before autovacuum considers running VACUUM. The total threshold is autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * number_of_rows.
  • autovacuum_analyze_threshold = 50: Similar to autovacuum_vacuum_threshold, but for ANALYZE.
  • autovacuum_analyze_scale_factor = 0.1: Similar to autovacuum_vacuum_scale_factor, but for ANALYZE. The total threshold is autovacuum_analyze_threshold + autovacuum_analyze_scale_factor * number_of_rows.
  • autovacuum_naptime = 1min: How often autovacuum checks if it needs to start new workers.

Tuning Autovacuum:

The default settings are often too conservative for busy tables. You’ll frequently need to tune autovacuum_vacuum_scale_factor and autovacuum_analyze_scale_factor for specific tables.

Example: Set autovacuum to run more aggressively on a high-traffic table orders:

ALTER TABLE orders SET (autovacuum_vacuum_scale_factor = 0.05, autovacuum_analyze_scale_factor = 0.02);

This means autovacuum will consider running VACUUM on orders when at least 5% of its rows are dead, and ANALYZE when at least 2% of its rows are dead, in addition to the base thresholds.

VACUUM FULL vs. VACUUM

VACUUM marks dead tuples as reusable but doesn’t shrink the file. VACUUM FULL rewrites the entire table into a new disk file, discarding dead space. This returns space to the OS but is much slower, requires an exclusive lock (blocking all reads/writes), and can cause significant I/O. It’s rarely the right answer and should be used sparingly, usually as a last resort for severe bloat. pg_repack is a popular extension that can achieve similar results with less downtime.

The one thing most people don’t know is that VACUUM doesn’t actually delete tuples from the physical data files. It just marks the space occupied by dead tuples as available for reuse. The actual physical removal of these tuples from the disk file only happens when PostgreSQL needs to rewrite the table or index, which VACUUM FULL does but regular VACUUM does not. The space is then available for new rows to be inserted into, or for updated rows to overwrite.

The next thing you’ll encounter is understanding how indexes are affected by dead tuples and how REINDEX can be used to clean them up.

Want structured learning?

Take the full Postgres course →