pgstream v0.5.0: New transformers,YAML config,CLI refactoring,table filtering

We're proud to announce the release of v0.5 of pgstream, our open-source CDC tool for Postgres! 🚀 In this blog post, we'll dive into some of the key features packed into this latest release, and look at what the future holds!

You can find the complete release notes on the Github v0.5.0 release page.

What is pgstream?

pgstream is an open source CDC(Change Data Capture) tool and library that offers Postgres replication support with DDL changes. Some of its key features include:

Replication of DDL changes: schema changes are tracked and seamlessly replicated downstream alongside the data, avoiding manual intervention and data loss.
Modular deployment configuration: pgstream modular implementation allows it to be configured for simple use cases, removing unnecessary complexity and deployment challenges. However, it can also easily integrate with Kafka for more complex use cases.
Out of the box supported targets:
- Postgres: replication to Postgres databases with support for schema changes and batch processing.
- Elasticsearch/Opensearch: replication to search stores with special handling of field IDs to minimise re-indexing.
- Webhooks: subscribe and receive webhook notifications whenever your source data changes.
Snapshots: capture a consistent view of your Postgres database at a specific point in time, either as an initial snapshot before starting replication or as a standalone process when replication is not needed.
Column transformations: modify column values during replication or snapshots, which is particularly useful for anonymizing sensitive data.

For more details on how pgstream works under the hood, check out the full documentation.

What's new?

This update focuses on improving the usability of pgstream, from adding new column transformers for added flexibility, to simplifying the configuration management by introducing YAML support, and refining the CLI experience. Also, table filtering is finally here! Let's take a look at the main new features in detail.

🔐 Advanced data transformations

After the introduction of transformers in v0.4, in this release we continue the work towards improving the transformation capabilities of pgstream. Masking, phone number and literal transformers, dynamic parameter support, and transformation rules validation are now available. Let's dive a bit deeper into some of these new transformation features!

Masking

Instead of producing random/realistic data to anonymize sensitive information, you can now just simply mask the data or parts of it. Powered by the go-masker library, it comes with a predefined set of masking functions (password, name, address, email, mobile, telephone, id, credit_card, url), while also offering a custom function in which the user can define the level of masking/unmasking by either providing indexes or percentages (useful when the fields are variable in length).

Example masking rules:

Dynamic parameter support

Supported transformers can now use dynamic parameters, which allows them to define the transformation rules based on the values of different columns in the same row. This is particularly useful for complex transformations that depend on multiple fields.

In the following example, we have a users table with a mobile_number and country_code column. The phone number transformer will use the value of the country_code column to determine the prefix for the randomly generated mobile phone number.

Transformation rules validation

In order to ensure you don't accidentally forget to add a transformation rule for a column, which could lead to sensitive data leaks, pgstream transformation rules now expose a validation mode setting. The validation mode can be set to strict, relaxed or table_level.

- relaxed mode, which is the default, only validates the provided transformations, ensuring the configured transformers are compatible with the table column data types.

- strict mode checks the transformation rules against the source table schema and enforces the explicit mention of all columns. Not all columns need to have a transformation applied to them (it can be bypassed by just using a noop transformer or just leaving it unset), but they need to be explicitly mentioned in the configuration.

- table_level mode means the validation mode is evaluated on a per table basis, allowing to have different validation modes for different tables.

For more details on the new transformers, check out the supported transformers section in the pgstream documentation.

For more details on how to set up and use transformers with pgstream, check out the transformers tutorial.

📜 YAML configuration

In this release, we have added support for YAML configuration files. This allows you to define the pgstream configuration in a more human-readable format, making it easier to manage and share your configurations. The transformation rules are embedded into the same configuration file, simplifying the configuration setup. Environment variables are still supported, but are not compatible with the YAML configuration.

For more details on how to set up and use YAML configuration files with pgstream, check out the configuration documentation.

🧰 Command-Line Interface (CLI) Refactoring

We decided to spend a bit of time on the CLI, and refactor it to improve the user experience. The new CLI is more intuitive and user-friendly, making it easier to configure and run pgstream.

Flags have been added to all commands, removing the need to provide a configuration file. This allows you to quickly set up pgstream without needing to create a configuration file, making it easier to get started. It relies on default values for most of the configuration.

The snapshot command is now a separate command from the run replication command. This allows you to run snapshots independently of replication, making it more straightforward to manage your snapshot workflows (e.g. running a snapshot as a nigthly job).

A status command has been added to validate the pgstream configuration and initialisation.

For more details on how to use the new CLI, check out the usage documentation or our tutorials section.

🔍 Table level filtering

We recently received some community feedback requesting table level filtering. Up until now, the only way of achieving this was to use pgstream as a library. In this release we finally added this feature, allowing you to specify which tables to include or exclude from the replication process, giving you more control over the data that is replicated when using the CLI. You can provide the configuration as part of the modifiers section in the new YAML configuration file, or as part of the environment variables.

For more details about the new table filtering configuration, check out the configuration documentation.

Conclusion

With the latest features discussed in this blogpost, you can build robust, compliant, and efficient data workflows. Whether you're replicating data to downstream systems, anonymizing sensitive information, or creating snapshots, pgstream has the tools you need.

If you have any suggestions or questions, you can reach out to us on Discord or follow us on X / Twitter or Bluesky. We welcome any feedback in issues, or contributions via pull requests! 💜

Ready to get started? Check out the pgstream documentation for more details.

pgstream v0.5.0: New transformers, YAML configuration, CLI refactoring & table filtering

What is pgstream?

What's new?

🔐 Advanced data transformations

Masking

Dynamic parameter support

Transformation rules validation

📜 YAML configuration

🧰 Command-Line Interface (CLI) Refactoring

🔍 Table level filtering

Conclusion

Related Posts

pgstream v0.4.0: Postgres-to-Postgres replication, snapshots & transformations

Introducing pgstream: Postgres replication with DDL changes

Postgres Cafe: Solving schema replication gaps with pgstream

Postgres webhooks with pgstream