Postgres logical replication can start replicating before the initial snapshot is taken, creating a race condition that can lead to data divergence if not handled carefully.

Let’s see how this plays out.

Imagine we have a users table on a publisher database:

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

INSERT INTO users (username) VALUES ('alice');
INSERT INTO users (username) VALUES ('bob');

Now, we set up a subscriber. On the publisher, we create a publication:

CREATE PUBLICATION my_publication FOR TABLE users;

On the subscriber, we create a subscription:

CREATE SUBSCRIPTION my_subscription
    CONNECTION 'host=publisher_host port=5432 dbname=publisher_db user=repl_user password=repl_password'
    PUBLICATION my_publication
    WITH (copy_data = true);

The WITH (copy_data = true) tells the subscriber to perform an initial data copy from the publisher. During this process, PostgreSQL on the publisher creates a logical replication slot. This slot is a pointer to a specific point in the Write-Ahead Log (WAL) from which the subscriber should start receiving changes.

Here’s the crucial part: the copy_data process reads the existing data from the users table. As it reads, it’s also advancing the replication slot’s LSN (Log Sequence Number) forward in the WAL. Meanwhile, new transactions are happening on the publisher:

-- On the publisher, after subscription creation
INSERT INTO users (username) VALUES ('charlie');

If the copy_data process is slow, or if there’s a lot of data, the replication slot’s LSN might advance significantly. When copy_data finishes, the subscriber then starts applying the WAL changes from the point the slot is currently at. If charlie was inserted after the copy_data process had already read past the WAL record for that insertion, the subscriber will never see that INSERT statement. The slot is already too far ahead.

This is why understanding slot configuration is key. The logical replication slot is the gatekeeper for your changes.

The Anatomy of a Slot

A logical replication slot is created automatically when a subscription is initiated with copy_data = true or when a publication is used by a standby server. You can inspect these slots on the publisher using:

SELECT slot_name, plugin, slot_type, database, active, active_pid, wal_status, confirmed_flush_lsn, restart_lsn, active_reason
FROM pg_replication_slots;
  • slot_name: The unique identifier for the slot.
  • plugin: The WAL decoding plugin being used (e.g., pgoutput for logical replication).
  • slot_type: logical for our use case.
  • database: The database the slot is associated with.
  • active: Whether a backend process is currently connected and consuming changes.
  • active_pid: The process ID of the backend consuming changes.
  • wal_status: reserved or streamed. reserved means the WAL is held but not yet consumed. streamed means it’s being actively sent.
  • confirmed_flush_lsn: The LSN up to which the subscriber has confirmed receipt and applied changes. This is the most critical LSN for determining how far behind the subscriber is.
  • restart_lsn: The LSN from which the slot will start sending changes. This is what advances as changes are consumed.
  • active_reason: Explains why the slot is active (e.g., initial_data_copy, streaming WAL).

Common Slot Configuration Pitfalls and Fixes

  1. Slot Lagging Too Far Behind (Race Condition)

    • Diagnosis: You notice data missing on the subscriber that exists on the publisher, or the confirmed_flush_lsn in pg_replication_slots is significantly older than the current WAL generation. The subscriber might also show no new changes are arriving.
    • Cause: The initial copy_data process on the subscriber consumed WAL faster than the publisher could generate it, or the publisher experienced high write load after the copy_data started but before it finished, causing the restart_lsn to jump ahead of the data already copied. Or, the subscriber process crashed/was stopped, and WAL kept accumulating on the publisher, but the confirmed_flush_lsn didn’t advance.
    • Fix: The most robust solution is to drop and recreate the subscription with copy_data = false. This avoids the initial snapshot and the associated race condition. You then manually synchronize the data (e.g., using pg_dump and pg_restore on the affected tables) and then start the subscription. If you must use copy_data, ensure the publisher has sufficient WAL disk space and I/O capacity, and ideally, schedule the subscription creation during a low-activity period.
    • Why it works: By not copying data initially, the subscriber starts consuming WAL from the very beginning, ensuring no changes are missed. Manual synchronization guarantees data consistency.
  2. Slot Not Dropped After Subscription Removal

    • Diagnosis: You deleted a subscription using DROP SUBSCRIPTION, but a corresponding logical slot still exists on the publisher and is consuming disk space for WAL files.
    • Cause: PostgreSQL does not automatically drop logical replication slots when a subscription is dropped. This is a safety feature to prevent accidental data loss if the slot was being used by a manual consumer or if the subscription was dropped in error.
    • Fix: Manually drop the orphaned slot using SELECT pg_drop_replication_slot('slot_name'); on the publisher.
    • Why it works: This command explicitly removes the replication slot and allows its associated WAL files to be purged by pg_wal_retain_conf (or archive_mode if archiving is enabled).
  3. WAL Disk Full on Publisher

    • Diagnosis: The publisher database becomes unresponsive, or pg_wal directory fills up. Error logs will show "WAL disk full" or similar I/O errors.
    • Cause: A logical replication slot is holding onto WAL files that are no longer needed by the primary server for crash recovery but are needed by the subscriber. If the subscriber is down, stopped, or significantly lagging, these WAL files will accumulate indefinitely.
    • Fix:
      • Immediate: Free up disk space by temporarily disabling WAL archiving if enabled, or by manually deleting WAL files (highly risky, use with extreme caution and only if you understand WAL lifecycle). The correct fix is to get the subscriber back online and consuming WAL.
      • Long-term: Ensure the subscriber is always running and consuming changes. Monitor pg_replication_slots for confirmed_flush_lsn lag. Increase max_wal_size and wal_keep_size (though wal_keep_size is a minimum, slots can override this). Ensure sufficient disk space for WAL.
    • Why it works: Logical slots prevent WALs from being recycled. If the slot isn’t advancing (confirmed_flush_lsn isn’t moving), WAL files will keep accumulating until the disk is full. Fixing the subscriber’s consumption or dropping the slot (if no longer needed) resolves the accumulation.
  4. max_replication_slots Exceeded

    • Diagnosis: You cannot create new subscriptions or publications, and pg_stat_activity shows processes waiting to create slots. Error logs will contain "maximum number of replication slots exceeded".
    • Cause: The max_replication_slots parameter in postgresql.conf limits the total number of logical replication slots that can exist on the server.
    • Fix: Increase the max_replication_slots parameter in postgresql.conf (e.g., from 10 to 20) and reload the PostgreSQL configuration (pg_ctl reload or SELECT pg_reload_conf();).
    • Why it works: This increases the server’s capacity to manage multiple replication streams, allowing new slots to be created.
  5. max_wal_senders Exceeded

    • Diagnosis: New subscriptions fail to connect or start streaming. pg_stat_activity might show many walsender processes. Error logs might indicate "too many replication clients already connected".
    • Cause: max_wal_senders limits the number of concurrent connections that can perform WAL streaming. Each logical replication slot requires one walsender process.
    • Fix: Increase the max_wal_senders parameter in postgresql.conf (e.g., from 10 to 20) and reload the PostgreSQL configuration.
    • Why it works: This allows more walsender processes to run concurrently, accommodating the connections needed for your replication slots.
  6. Incorrect plugin Specified

    • Diagnosis: When creating a publication or subscription, you receive an error like "plugin 'myplugin' not found" or "replication slot 'myslot' uses unsupported plugin 'myplugin'".
    • Cause: The plugin specified in the CREATE PUBLICATION or CREATE SUBSCRIPTION command (or implicitly by pgoutput for logical replication) is not available or is misspelled. For standard logical replication, the plugin is pgoutput.
    • Fix: Ensure the correct plugin name is used. For standard logical replication, you don’t explicitly specify the plugin on the subscription side; it’s handled by the publisher’s publication and the pgoutput plugin. If you’re using custom plugins, ensure they are installed correctly and accessible by the PostgreSQL server.
    • Why it works: The plugin parameter tells PostgreSQL how to decode the WAL stream. Using the correct plugin ensures the WAL can be interpreted as intended by the replication system.

The next hurdle you’ll likely face is managing replication lag and ensuring data consistency across complex, multi-table, or multi-publication scenarios.

Want structured learning?

Take the full Postgres course →