pgstream is an open source CDC command-line tool and library that offers Postgres replication support with DDL changes to any provided target.
Features
- Schema change tracking and replication of DDL changes
- Support for multiple out of the box targets
- Elasticsearch/OpenSearch
- Webhooks
- PostgreSQL
- Initial and on demand PostgreSQL snapshots (for when you don’t need continuous replication)
- Column value transformations (anonymise your data on the go!)
- Modular deployment configuration, only requires Postgres
- Kafka support with schema based partitioning
- Extendable support for custom targets
Usage
pgstream can be used via the readily available CLI or as a library. For detailed information about the CLI usage, check out the dedicated CLI documentation section.
CLI Installation
Environment setup
If you have an environment available, with at least Postgres and whichever resources you’re planning on running, then you can skip this step. Otherwise, a docker setup is available in this repository with profiles that selectively start Postgres, Kafka and OpenSearch. To run all profiles:pg2pg profile:
- pg2pg
- pg2os
- pg2webhook
- kafka
Configuration
Pgstream source and target need to be configured appropriately before the commands can be run. This can be done:- Using the relevant CLI flags for each command
- Using a yaml configuration file
- Using environment variables (.env file supported)
Run pgstream
Replication mode
Run will start streaming data from the configured source into the configured target. By passing the--init flag to the run command, pgstream will initialise the pgstream state in the source Postgres database before starting replication. It will:
- Create a
pgstreamschema - Create tables/functions/triggers to keep track of schema changes for DDL replication (see Tracking schema changes for more details)
- Create a replication slot
pgstream init command separately before pgstream run. Check out the CLI documentation for more details.
Example running pgstream replication from Postgres -> OpenSearch:
--snapshot-tables flag or by setting the relevant configuration fields (check the configuration documentation for more details on advanced configuration options).
Example running pgstream with PostgreSQL -> PostgreSQL with initial snapshot enabled:
Snapshot mode
pgstream can also be used to perform a point in time snapshot of the source database. This is helpful if you don’t require continuous replication, but want to keep the source and target in sync by running nightly snapshots for example. Thesnapshot command doesn’t require any initialisation or pgstream specific state, since it only performs read operations on the source Postgres database.
Example running pgstream to perform a snapshot from PostgreSQL -> PostgreSQL:
Tutorials
- PostgreSQL replication to PostgreSQL
- PostgreSQL replication to OpenSearch
- PostgreSQL replication to webhooks
- PostgreSQL replication using Kafka
- PostgreSQL snapshots
- PostgreSQL column transformations
Documentation
For more advanced usage, implementation details, and detailed configuration settings, please refer to the full documentation below.Benchmarks
Snapshots
Datasets used: IMDB database, MusicBrainz database, Firenibble database.
All benchmarks were run using the same setup, with pgstream v0.7.2, pg_dump/pg_restore (PostgreSQL) 17.4 and PostgreSQL 17.4, using identical resources to ensure a fair comparison.
For more details into performance benchmarking for snapshots to PostgreSQL with pgstream, check out this blogpost.
Limitations
Some of the limitations of the initial release include:- Single Kafka topic support
- Postgres plugin support limited to
wal2json - No row level filtering support
- Primary key/unique not null column required for replication
- Kafka serialisation support limited to JSON