Data Privacy & Anonymization for Postgres

Add this to your agent's context

Give your AI agent the context it needs to understand Xata data privacy and anonymization capabilities.

Your business handles PII and PHI. Development and staging branches should never contain raw sensitive data. Xata applies anonymization at the replication layer. Every branch starts clean, compliant, and production-realistic without manual scrubbing or custom scripts.

The privacy spectrum

Choose the right level of protection for your data. From lightweight redaction to full HIPAA-grade anonymization.

Column redaction Basic

Drop sensitive columns entirely. Simplest option when the data is not needed downstream.

Safe masking Basic

Replace values with fixed-length placeholders. Preserves column structure, removes content.

Synthetic data Moderate

Format-preserving fake data via Mimesis. Emails look like emails, phones look like phones, but none are real.

Deterministic hashing Moderate

SHA-2 hashing preserves referential integrity across tables. The same input always produces the same hash, so joins still work.

k-Member microaggregation Advanced

Share data with confidence by preventing re-identification attacks. Groups similar records into clusters of k members, replaces quasi-identifier values with fair aggregates. Geographic-aware for ZIP codes.

HIPAA Expert Determination Regulatory

Our anonymization pipeline was designed to meet, even exceed, HIPAA Expert Determination requirements. Probabilistic risk evaluations and distortion metrics provide the evidence an expert needs to certify compliance.

Why anonymize

Protect privacy. Your business handles PII (Personal Identifiable Information) and PHI (Protected Health Information). Branches should never contain raw sensitive data.

Stay compliant. GDPR, HIPAA, CPRA, and PCI require that non-production environments do not contain identifiable data. Anonymization at replication time ensures compliance by default.

Avoid costly mistakes. Real email addresses or phone numbers in test data lead to embarrassing outbound messages and costly breach notifications.

Open source foundation. Anonymization is implemented in pgstream (Apache 2.0). Full flexibility in deployment, transformers, and configuration.

Deterministic transformers. The same input always produces the same output. Foreign keys, joins, and relational integrity are preserved across anonymized tables.

Extensible transformers. Use transformers from Greenmask, NeoSync, and go-masker, or implement custom transformations in Go. Even the most complex requirements can be met.

The 5-stage pipeline

Anonymization happens during replication from production to the staging replica. Branches inherit already-anonymized data.

1. Entity recognition

Presidio-based PII detection classifies columns as Direct Identifiers (DIDs), Quasi-Identifiers (QIDs), or Safe. Covers all 18 HIPAA identifier families automatically.

2. Risk assessment

Monte Carlo simulation of linkage attacks. Generates 50 synthetic adversary datasets and computes worst-case re-identification probability (Phisher K-Threshold).

3. DID suppression

Direct identifiers are treated per column: redact (drop), mask (fixed-length placeholder), fake (format-preserving synthetic data), or hash (SHA-2).

4. k-Member microaggregation

Quasi-identifiers are grouped into clusters of k similar records using ANN search. Values are replaced with fair aggregates (randomized median/mode) that prevent statistical bias.

5. Distortion analysis

Measures information loss. Column-level distribution comparison (frequency charts, KDE). Inter-column relationship preservation via NPMI (Normalized Pointwise Mutual Information).

HIPAA compliance

Xata provides the quantitative evidence an expert needs to certify that data has been de-identified under HIPAA's Expert Determination method (§164.514(b)(1)).

No hardcoded thresholds. You configure the k-target for microaggregation. The system computes before-and-after risk scores and provides the metrics. Not a binary pass/fail, but the quantitative evidence a qualified statistical expert uses to certify compliance.

All 18 HIPAA identifier families detected. The entity recognition stage automatically classifies columns against all HIPAA-defined direct identifiers, from names and SSNs to biometric identifiers and vehicle serial numbers.

Safe Harbor support.For organizations using the Safe Harbor method (§164.514(b)(2)), the system detects and treats all 18 HIPAA identifier families through complete DID suppression.

Distortion analysis included. After anonymization, NPMI (Normalized Pointwise Mutual Information) measures how well inter-column relationships are preserved. Frequency charts and KDE compare original vs. treated distributions column by column.

18 HIPAA Direct Identifier Families

Names
Geographic data (below state)
Dates (except year)
Phone numbers
Fax numbers
Email addresses
Social Security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate/license numbers
Vehicle identifiers
Device identifiers
Web URLs
IP addresses
Biometric identifiers
Full-face photos
Any other unique identifying number

Powered by pgstream

pgstream is an open source project for PostgreSQL data replication and transformation. It uses PostgreSQL logical replication (including DDL statements) and parallel snapshotting for maximum throughput.

Schedule pgstream via CI/CD to create nightly anonymized snapshots of production. Snapshots can be taken from a read-replica to minimize production impact.

pgstream integrates with transformer libraries including Greenmask, NeoSync, and go-masker. Custom transformations can be written in Go.

pgstream on GitHub

Deterministic and realistic transformers

Deterministic transformers ensure the same input always produces the same output. Foreign keys, joins, and relational data stay consistent across anonymized tables.

Realistic transformers generate format-preserving synthetic data. Anonymized emails look like emails, anonymized phone numbers look like phone numbers. Downstream systems and tests work without modification.

Transformer documentation

johndoe@gmail.com

+1 (555) 123-4567

johndoe92

phone

username

age

a2asd112@example.com

+1 (555) 987-6543

a2asd112

Data subsetting (coming soon)

For multi-terabyte production databases, create smaller yet representative staging datasets. pgstream follows foreign key relationships to maintain referential integrity across subsetted tables.

Request 5% of the orders table, and the subsetting logic automatically filters related users, products, and payments tables to maintain a consistent, connected dataset.

Competitive differentiation

The only branching solution with production-quality anonymization. Neon recently added basic column masking, but it requires manual setup per column with no automatic PII detection, no format preservation guarantees, and no referential integrity across tables. Supabase and Tiger Data have no native anonymization capabilities.

Xata is the only platform where anonymization is integrated into the replication layer. PII is detected automatically, transforms preserve format and joins, and every branch starts with realistic, clean data that your tests and downstream systems can use without modification.

Part of a complete platform

Anonymization is one piece. Xata combines it with branching, scale-to-zero, and zero-downtime migrations.

Connect, anonymize, branch, deploy

Connect to production: keep production where it is (RDS, Aurora, Cloud SQL, self-hosted). Anonymize during replication: configurable transformers applied at the replication layer. Branch instantly: copy-on-write branches from the anonymized replica in seconds, not hours. Deploy without downtime: pgroll serves old and new schema versions simultaneously.

Learn about Postgres branching | Learn about schema changes

From basic masking to
HIPAA-grade anonymization.

Add this to your agent's context

The privacy spectrum

Column redaction Basic

Safe masking Basic

Synthetic data Moderate

Deterministic hashing Moderate

k-Member microaggregation Advanced

HIPAA Expert Determination Regulatory

Why anonymize

The 5-stage pipeline

1. Entity recognition

2. Risk assessment

3. DID suppression

4. k-Member microaggregation

5. Distortion analysis

HIPAA compliance

18 HIPAA Direct Identifier Families

Powered by pgstream

Deterministic and realistic transformers

Data subsetting (coming soon)

Competitive differentiation

Part of a complete platform

Let's scope it for your team.

From basic masking toHIPAA-grade anonymization.

Add this to your agent's context

The privacy spectrum

Column redaction Basic

Safe masking Basic

Synthetic data Moderate

Deterministic hashing Moderate

k-Member microaggregation Advanced

HIPAA Expert Determination Regulatory

Why anonymize

The 5-stage pipeline

1. Entity recognition

2. Risk assessment

3. DID suppression

4. k-Member microaggregation

5. Distortion analysis

HIPAA compliance

18 HIPAA Direct Identifier Families

Powered by pgstream

Deterministic and realistic transformers

Data subsetting (coming soon)

Competitive differentiation

Part of a complete platform

Let's scope it for your team.

From basic masking to
HIPAA-grade anonymization.