Postgres logical replication can start replicating before the initial snapshot is taken, creating a race condition that can lead to data divergence if not handled carefully.
Let’s see how this plays out.
Imagine we have a users table on a publisher database:
CREATE TABLE users (
id SERIAL PRIMARY KEY,
username VARCHAR(50) UNIQUE NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
INSERT INTO users (username) VALUES ('alice');
INSERT INTO users (username) VALUES ('bob');
Now, we set up a subscriber. On the publisher, we create a publication:
CREATE PUBLICATION my_publication FOR TABLE users;
On the subscriber, we create a subscription:
CREATE SUBSCRIPTION my_subscription
CONNECTION 'host=publisher_host port=5432 dbname=publisher_db user=repl_user password=repl_password'
PUBLICATION my_publication
WITH (copy_data = true);
The WITH (copy_data = true) tells the subscriber to perform an initial data copy from the publisher. During this process, PostgreSQL on the publisher creates a logical replication slot. This slot is a pointer to a specific point in the Write-Ahead Log (WAL) from which the subscriber should start receiving changes.
Here’s the crucial part: the copy_data process reads the existing data from the users table. As it reads, it’s also advancing the replication slot’s LSN (Log Sequence Number) forward in the WAL. Meanwhile, new transactions are happening on the publisher:
-- On the publisher, after subscription creation
INSERT INTO users (username) VALUES ('charlie');
If the copy_data process is slow, or if there’s a lot of data, the replication slot’s LSN might advance significantly. When copy_data finishes, the subscriber then starts applying the WAL changes from the point the slot is currently at. If charlie was inserted after the copy_data process had already read past the WAL record for that insertion, the subscriber will never see that INSERT statement. The slot is already too far ahead.
This is why understanding slot configuration is key. The logical replication slot is the gatekeeper for your changes.
The Anatomy of a Slot
A logical replication slot is created automatically when a subscription is initiated with copy_data = true or when a publication is used by a standby server. You can inspect these slots on the publisher using:
SELECT slot_name, plugin, slot_type, database, active, active_pid, wal_status, confirmed_flush_lsn, restart_lsn, active_reason
FROM pg_replication_slots;
slot_name: The unique identifier for the slot.plugin: The WAL decoding plugin being used (e.g.,pgoutputfor logical replication).slot_type:logicalfor our use case.database: The database the slot is associated with.active: Whether a backend process is currently connected and consuming changes.active_pid: The process ID of the backend consuming changes.wal_status:reservedorstreamed.reservedmeans the WAL is held but not yet consumed.streamedmeans it’s being actively sent.confirmed_flush_lsn: The LSN up to which the subscriber has confirmed receipt and applied changes. This is the most critical LSN for determining how far behind the subscriber is.restart_lsn: The LSN from which the slot will start sending changes. This is what advances as changes are consumed.active_reason: Explains why the slot is active (e.g.,initial_data_copy,streaming WAL).
Common Slot Configuration Pitfalls and Fixes
-
Slot Lagging Too Far Behind (Race Condition)
- Diagnosis: You notice data missing on the subscriber that exists on the publisher, or the
confirmed_flush_lsninpg_replication_slotsis significantly older than the current WAL generation. The subscriber might also show no new changes are arriving. - Cause: The initial
copy_dataprocess on the subscriber consumed WAL faster than the publisher could generate it, or the publisher experienced high write load after thecopy_datastarted but before it finished, causing therestart_lsnto jump ahead of the data already copied. Or, the subscriber process crashed/was stopped, and WAL kept accumulating on the publisher, but theconfirmed_flush_lsndidn’t advance. - Fix: The most robust solution is to drop and recreate the subscription with
copy_data = false. This avoids the initial snapshot and the associated race condition. You then manually synchronize the data (e.g., usingpg_dumpandpg_restoreon the affected tables) and then start the subscription. If you must usecopy_data, ensure the publisher has sufficient WAL disk space and I/O capacity, and ideally, schedule the subscription creation during a low-activity period. - Why it works: By not copying data initially, the subscriber starts consuming WAL from the very beginning, ensuring no changes are missed. Manual synchronization guarantees data consistency.
- Diagnosis: You notice data missing on the subscriber that exists on the publisher, or the
-
Slot Not Dropped After Subscription Removal
- Diagnosis: You deleted a subscription using
DROP SUBSCRIPTION, but a corresponding logical slot still exists on the publisher and is consuming disk space for WAL files. - Cause: PostgreSQL does not automatically drop logical replication slots when a subscription is dropped. This is a safety feature to prevent accidental data loss if the slot was being used by a manual consumer or if the subscription was dropped in error.
- Fix: Manually drop the orphaned slot using
SELECT pg_drop_replication_slot('slot_name');on the publisher. - Why it works: This command explicitly removes the replication slot and allows its associated WAL files to be purged by
pg_wal_retain_conf(orarchive_modeif archiving is enabled).
- Diagnosis: You deleted a subscription using
-
WAL Disk Full on Publisher
- Diagnosis: The publisher database becomes unresponsive, or
pg_waldirectory fills up. Error logs will show "WAL disk full" or similar I/O errors. - Cause: A logical replication slot is holding onto WAL files that are no longer needed by the primary server for crash recovery but are needed by the subscriber. If the subscriber is down, stopped, or significantly lagging, these WAL files will accumulate indefinitely.
- Fix:
- Immediate: Free up disk space by temporarily disabling WAL archiving if enabled, or by manually deleting WAL files (highly risky, use with extreme caution and only if you understand WAL lifecycle). The correct fix is to get the subscriber back online and consuming WAL.
- Long-term: Ensure the subscriber is always running and consuming changes. Monitor
pg_replication_slotsforconfirmed_flush_lsnlag. Increasemax_wal_sizeandwal_keep_size(thoughwal_keep_sizeis a minimum, slots can override this). Ensure sufficient disk space for WAL.
- Why it works: Logical slots prevent WALs from being recycled. If the slot isn’t advancing (
confirmed_flush_lsnisn’t moving), WAL files will keep accumulating until the disk is full. Fixing the subscriber’s consumption or dropping the slot (if no longer needed) resolves the accumulation.
- Diagnosis: The publisher database becomes unresponsive, or
-
max_replication_slotsExceeded- Diagnosis: You cannot create new subscriptions or publications, and
pg_stat_activityshows processes waiting to create slots. Error logs will contain "maximum number of replication slots exceeded". - Cause: The
max_replication_slotsparameter inpostgresql.conflimits the total number of logical replication slots that can exist on the server. - Fix: Increase the
max_replication_slotsparameter inpostgresql.conf(e.g., from10to20) and reload the PostgreSQL configuration (pg_ctl reloadorSELECT pg_reload_conf();). - Why it works: This increases the server’s capacity to manage multiple replication streams, allowing new slots to be created.
- Diagnosis: You cannot create new subscriptions or publications, and
-
max_wal_sendersExceeded- Diagnosis: New subscriptions fail to connect or start streaming.
pg_stat_activitymight show manywalsenderprocesses. Error logs might indicate "too many replication clients already connected". - Cause:
max_wal_senderslimits the number of concurrent connections that can perform WAL streaming. Each logical replication slot requires onewalsenderprocess. - Fix: Increase the
max_wal_sendersparameter inpostgresql.conf(e.g., from10to20) and reload the PostgreSQL configuration. - Why it works: This allows more
walsenderprocesses to run concurrently, accommodating the connections needed for your replication slots.
- Diagnosis: New subscriptions fail to connect or start streaming.
-
Incorrect
pluginSpecified- Diagnosis: When creating a publication or subscription, you receive an error like "plugin 'myplugin' not found" or "replication slot 'myslot' uses unsupported plugin 'myplugin'".
- Cause: The
pluginspecified in theCREATE PUBLICATIONorCREATE SUBSCRIPTIONcommand (or implicitly bypgoutputfor logical replication) is not available or is misspelled. For standard logical replication, the plugin ispgoutput. - Fix: Ensure the correct plugin name is used. For standard logical replication, you don’t explicitly specify the plugin on the subscription side; it’s handled by the publisher’s publication and the
pgoutputplugin. If you’re using custom plugins, ensure they are installed correctly and accessible by the PostgreSQL server. - Why it works: The
pluginparameter tells PostgreSQL how to decode the WAL stream. Using the correct plugin ensures the WAL can be interpreted as intended by the replication system.
The next hurdle you’ll likely face is managing replication lag and ensuring data consistency across complex, multi-table, or multi-publication scenarios.