Database replication is fundamentally about enabling a database to keep multiple copies of its data synchronized across different locations or nodes. The "streaming," "logical," and "physical" distinctions refer to how that synchronization happens, each with its own trade-offs in terms of performance, flexibility, and complexity.
Let’s dive into how this looks in practice. Imagine we have a primary PostgreSQL database and we want to set up a read replica.
Physical Replication: The "Bit-by-Bit" Copy
Physical replication, often called "streaming replication" in PostgreSQL, works by sending WAL (Write-Ahead Log) records from the primary to the replica. WAL records are the detailed, low-level transaction logs that the primary database uses to ensure durability. The replica then replays these WAL records to reconstruct the exact same state as the primary.
Here’s a simplified setup for PostgreSQL physical replication.
On the primary server (pg_primary):
-
Configure
postgresql.conf:wal_level = replica max_wal_senders = 3 wal_keep_size = 102400 # Keep 100MB of WAL files archive_mode = on archive_command = 'cp %p /path/to/archive/%f' # Or use rsync, etc.wal_level = replica: This is crucial. It tells PostgreSQL to log enough information in the WAL to allow for replication.minimalis not enough.max_wal_senders: The number of simultaneous replication connections the primary can handle.wal_keep_size: Ensures that WAL files needed by the replica are not removed too quickly.archive_modeandarchive_command: While not strictly required for streaming replication, archiving is essential for PITR (Point-In-Time Recovery) and for setting up a replica from a base backup.
-
Configure
pg_hba.conf:host replication replicator 192.168.1.10/32 md5 host replication replicator 192.168.1.11/32 md5- This grants the
replicatoruser (which we’ll create) permission to connect for replication purposes from specific IP addresses.
- This grants the
-
Create a Replication User:
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'your_replication_password'; -
Take a Base Backup: This is how the replica gets its initial copy of the data. You can use
pg_basebackup.pg_basebackup -h pg_primary -U replicator -D /var/lib/postgresql/14/main_replica -P -v -R-h: Hostname of the primary.-U: Replication user.-D: Destination directory for the replica’s data.-P: Show progress.-v: Verbose.-R: This is magic! It creates thestandby.signalfile and thepostgresql.auto.confwith replication connection details automatically in the replica’s data directory.
On the replica server (pg_replica), after running pg_basebackup (which copies data and sets up basic replication config):
-
Ensure
postgresql.confis set for a replica: The-Rflag inpg_basebackupshould have created apostgresql.auto.conffile in the replica’s data directory that looks something like this:primary_conninfo = 'user=replicator password=your_replication_password host=pg_primary port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any' primary_slot_name = 'your_slot_name' # Optional but recommendedYou’ll also need
hot_standby = onin yourpostgresql.confto allow read queries on the replica. -
Create a Replication Slot (Recommended): On the primary server:
SELECT pg_create_physical_replication_slot('your_slot_name');- Replication slots ensure that the primary server never removes WAL files that the replica still needs, even if the replica is temporarily disconnected. This prevents the replica from falling too far behind and needing a full base backup again.
How it works: The primary writes changes to WAL. The replica connects to the primary using primary_conninfo, requests WAL records, and applies them. If hot_standby = on, the replica can serve read queries while applying WAL.
Logical Replication: The "SQL Statement" or "Row Change" Copy
Logical replication is more granular. Instead of sending raw WAL records, it decodes them into logical changes (e.g., "row X was updated," "row Y was inserted") and sends these logical changes to the replica. This allows for more flexibility: you can replicate specific tables, transform data, or even replicate between different major versions of PostgreSQL (though this can be tricky).
Here’s a simplified setup for PostgreSQL logical replication.
-
On the Primary (
pg_primary):-
postgresql.conf:wal_level = logical max_replication_slots = 10 max_wal_senders = 10wal_level = logical: This is the key. It enables the decoding of WAL into logical changes.max_replication_slotsandmax_wal_senders: Similar to physical replication, but for logical decoding workers.
-
pg_hba.conf:host replication replicator 192.168.1.10/32 md5(Same as physical, assuming the same user).
-
Create a Publication: A publication defines what data to replicate.
CREATE PUBLICATION my_publication FOR TABLE users, products; -- Or for all tables: CREATE PUBLICATION my_publication FOR ALL TABLES;- This tells the primary to generate logical changes for
INSERT,UPDATE, andDELETEoperations on theusersandproductstables.
- This tells the primary to generate logical changes for
-
-
On the Replica (
pg_replica):- Ensure tables exist: The schema for the replicated tables must already exist on the replica. Logical replication does not replicate DDL (like
CREATE TABLE). You’d typically set up the schema first, then start replication. - Create a Subscription: A subscription tells the replica where to get the data from and what publication to subscribe to.
CREATE SUBSCRIPTION my_subscription CONNECTION 'host=pg_primary user=replicator password=your_replication_password dbname=mydatabase' PUBLICATION my_publication;CONNECTION: Specifies how to connect to the primary.PUBLICATION: Links this subscription to the publication created on the primary.
- Ensure tables exist: The schema for the replicated tables must already exist on the replica. Logical replication does not replicate DDL (like
How it works: The primary creates a logical decoding worker that reads WAL, decodes it into logical change events, and sends them over the network to the replica. The replica’s subscription process receives these events and applies them as SQL statements (or row changes) to its own database.
The Core Difference: What is Transmitted?
-
Physical Replication: Transmits WAL records (the raw, low-level transaction logs). This is a byte-for-byte copy.
- Pros: Simpler to set up for a full replica, high performance, can replicate everything (including DDL, schema changes, sequences).
- Cons: Less flexible. You can’t selectively replicate tables. You can’t easily transform data. Replicas are exact copies.
-
Logical Replication: Transmits decoded logical changes (e.g.,
INSERTof specific rows,UPDATEof specific columns).- Pros: Highly flexible. Can replicate specific tables, schemas, or even filter data. Can replicate between different database versions (with caveats). Allows for more complex topologies (e.g., multi-master, though this is advanced).
- Cons: More complex to manage. Doesn’t replicate DDL automatically. Can have performance overhead due to decoding. Requires careful schema management on the replica.
The "Streaming" Aspect
"Streaming" in database replication refers to the continuous, near real-time flow of data from the primary to the replica. Both physical and logical replication can be "streaming."
- Physical Streaming Replication: WAL records are streamed as they are generated.
- Logical Streaming Replication: Logical changes are decoded and streamed as they are generated.
The term "streaming replication" is often used synonymously with "physical replication" in PostgreSQL because that’s its primary use case. However, the underlying mechanism of continuous data flow can apply to both.
The next concept you’ll grapple with is multi-master replication, where multiple nodes can accept writes, and the complexity of conflict resolution that arises from it.