Today we’re excited to expand our open source Postgres platform with pgstream, a CDC command line tool and library for PostgreSQL with replication support for DDL changes to any provided output.
Written by
Esther Minano Sanz
Published on
July 17, 2024
At Xata, Postgres takes centre stage. And while it is our main database, we also offer other features that require us to extend its reach while keeping the data in sync. A good example of this is our full-text search feature, which enables the use of Elasticsearch on top of Postgres. To keep these two datastores in sync, we capture and identify data and schema changes in Postgres and push these modifications downstream to Elasticsearch with minimal latency. This is often referred to as CDC (Change Data Capture).
So now you know our use case, but why did we build our own replication tool? There's many established solutions out there, but we had very specific requirements, which included support for continuous tracking of schema changes (DDL). This was something that existing tooling didn't support at the time. Database schemas have a tendency to change over time - if your CDC tool doesn't support replicating them, you risk data loss and manual intervention to fix your pipeline. There had to be a better way!
We also wanted a solution that was easy to deploy and operate for both big and small setups, which isn't always the case for existing tooling.
And so pgstream was born! pgstream is an open source CDC command-line tool and library. Some of its key features include:
Internally, pgstream is constructed as a streaming pipeline, where data from one module streams into the next, eventually reaching the configured output plugins. pgstream keeps track of schema changes and replicates them alongside the data changes to maintain a consistent view of the source data downstream. This modular approach makes adding and integrating output plugin implementations simple and painless.
One of the main differentiators of pgstream is that it tracks and replicates schema changes automatically. How? It relies on SQL triggers that will populate a Postgres table (pgstream.schema_log
) containing a history log of all DDL changes for a given schema. Whenever a schema change occurs, this trigger creates a new row in the schema log table with the schema encoded as a JSON value. This table tracks all the schema changes, forming a linearised change log that is then parsed and used within the pgstream pipeline to identify modifications and push the relevant changes downstream.
The schema and data changes are part of the same linear stream - the downstream consumers always observe the schema changes as soon as they happen, before any data arrives that relies on the new schema. This prevents data loss and manual intervention.
Disclaimer: There are a lot of references in this section to the WAL (Write Ahead Logging). It refers to a sequential record of all changes made to a database, and a key component to Postgres replication.
Now, let's dive a little deeper into the stream!
At a high level, the internal implementation is split into WAL listeners and WAL processors.
A listener is anything that listens to WAL data, regardless of the source. It has a single responsibility: consume and manage the WAL events, delegating the processing of those entries to modules that form the processing pipeline. Depending on the listener implementation, it might be required to also have a checkpointer to flag the events as processed once the processor is done.
There are currently two implementations of the listener:
A processor processes a WAL event. Depending on the implementation it might also be required to checkpoint the event once it's done processing it as described above.
There are currently three implementations of the processor:
pgstream
), where the schema log history is kept for use within the search store (i.e. read queries).In addition to the implementations described above, there's an optional processor decorator, the translator, that injects some of the pgstream logic into the WAL event. This includes:
pgstream.schema_log
table.This is only the beginning! We plan to continue developing pgstream and exploring how it can make it easier to replicate data.
Here are some of the items in our development pipeline:
pgoutput
)avro
)postgres
,clickhouse
, snowflake
)
We are excited to share this with you and look forward to your feedback! Want to see pgstream in action? Check out this quick demo video.If you have any suggestions or questions, please open an issue in our GitHub repo,
reach out to us on Discord or follow us on X / Twitter. We'd love
to hear from you and keep you up to date with the latest progress on pgstream
.
You can follow along this week with a new announcement every day on our launch week page, join in on our summer hackathon or just pop into Discord and say hi 👋
Xata provides the best free plan in the industry. It is production ready by default and doesn't pause or cool-down. Take your time to build your business and upgrade when you're ready to scale.
Copyright © 2024 Xatabase Inc.
All rights reserved.