Add this to your agent's context
Give your AI agent the context it needs to understand Xata data privacy and anonymization capabilities.
Your business handles PII and PHI. Development and staging branches should never contain raw sensitive data. Xata applies anonymization at the replication layer. Every branch starts clean, compliant, and production-realistic without manual scrubbing or custom scripts.
The privacy spectrum
Choose the right level of protection for your data. From lightweight redaction to full HIPAA-grade anonymization.
Column redaction Basic
Drop sensitive columns entirely. Simplest option when the data is not needed downstream.
Safe masking Basic
Replace values with fixed-length placeholders. Preserves column structure, removes content.
Synthetic data Moderate
Format-preserving fake data via Mimesis. Emails look like emails, phones look like phones, but none are real.
Deterministic hashing Moderate
SHA-2 hashing preserves referential integrity across tables. The same input always produces the same hash, so joins still work.
k-Member microaggregation Advanced
Share data with confidence by preventing re-identification attacks. Groups similar records into clusters of k members, replaces quasi-identifier values with fair aggregates. Geographic-aware for ZIP codes.
HIPAA Expert Determination Regulatory
Our anonymization pipeline was designed to meet, even exceed, HIPAA Expert Determination requirements. Probabilistic risk evaluations and distortion metrics provide the evidence an expert needs to certify compliance.
Why anonymize
Protect privacy. Your business handles PII (Personal Identifiable Information) and PHI (Protected Health Information). Branches should never contain raw sensitive data.
Stay compliant. GDPR, HIPAA, CPRA, and PCI require that non-production environments do not contain identifiable data. Anonymization at replication time ensures compliance by default.
Avoid costly mistakes. Real email addresses or phone numbers in test data lead to embarrassing outbound messages and costly breach notifications.
Open source foundation. Anonymization is implemented in pgstream (Apache 2.0). Full flexibility in deployment, transformers, and configuration.
Deterministic transformers. The same input always produces the same output. Foreign keys, joins, and relational integrity are preserved across anonymized tables.
Extensible transformers. Use transformers from Greenmask, NeoSync, and go-masker, or implement custom transformations in Go. Even the most complex requirements can be met.
The 5-stage pipeline
Anonymization happens during replication from production to the staging replica. Branches inherit already-anonymized data.
1. Entity recognition
Presidio-based PII detection classifies columns as Direct Identifiers (DIDs), Quasi-Identifiers (QIDs), or Safe. Covers all 18 HIPAA identifier families automatically.
2. Risk assessment
Monte Carlo simulation of linkage attacks. Generates 50 synthetic adversary datasets and computes worst-case re-identification probability (Phisher K-Threshold).
3. DID suppression
Direct identifiers are treated per column: redact (drop), mask (fixed-length placeholder), fake (format-preserving synthetic data), or hash (SHA-2).
4. k-Member microaggregation
Quasi-identifiers are grouped into clusters of k similar records using ANN search. Values are replaced with fair aggregates (randomized median/mode) that prevent statistical bias.
5. Distortion analysis
Measures information loss. Column-level distribution comparison (frequency charts, KDE). Inter-column relationship preservation via NPMI (Normalized Pointwise Mutual Information).
HIPAA compliance
Xata provides the quantitative evidence an expert needs to certify that data has been de-identified under HIPAA's Expert Determination method (§164.514(b)(1)).
No hardcoded thresholds. You configure the k-target for microaggregation. The system computes before-and-after risk scores and provides the metrics. Not a binary pass/fail, but the quantitative evidence a qualified statistical expert uses to certify compliance.
All 18 HIPAA identifier families detected. The entity recognition stage automatically classifies columns against all HIPAA-defined direct identifiers, from names and SSNs to biometric identifiers and vehicle serial numbers.
Safe Harbor support.For organizations using the Safe Harbor method (§164.514(b)(2)), the system detects and treats all 18 HIPAA identifier families through complete DID suppression.
Distortion analysis included. After anonymization, NPMI (Normalized Pointwise Mutual Information) measures how well inter-column relationships are preserved. Frequency charts and KDE compare original vs. treated distributions column by column.
18 HIPAA Direct Identifier Families
- Names
- Geographic data (below state)
- Dates (except year)
- Phone numbers
- Fax numbers
- Email addresses
- Social Security numbers
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers
- Device identifiers
- Web URLs
- IP addresses
- Biometric identifiers
- Full-face photos
- Any other unique identifying number
Powered by pgstream
pgstream is an open source project for PostgreSQL data replication and transformation. It uses PostgreSQL logical replication (including DDL statements) and parallel snapshotting for maximum throughput.
Schedule pgstream via CI/CD to create nightly anonymized snapshots of production. Snapshots can be taken from a read-replica to minimize production impact.
pgstream integrates with transformer libraries including Greenmask, NeoSync, and go-masker. Custom transformations can be written in Go.
Deterministic and realistic transformers
Deterministic transformers ensure the same input always produces the same output. Foreign keys, joins, and relational data stay consistent across anonymized tables.
Realistic transformers generate format-preserving synthetic data. Anonymized emails look like emails, anonymized phone numbers look like phone numbers. Downstream systems and tests work without modification.
Data subsetting (coming soon)
For multi-terabyte production databases, create smaller yet representative staging datasets. pgstream follows foreign key relationships to maintain referential integrity across subsetted tables.
Request 5% of the orders table, and the subsetting logic automatically filters related users, products, and payments tables to maintain a consistent, connected dataset.
Competitive differentiation
The only branching solution with production-quality anonymization. Neon recently added basic column masking, but it requires manual setup per column with no automatic PII detection, no format preservation guarantees, and no referential integrity across tables. Supabase and Tiger Data have no native anonymization capabilities.
Xata is the only platform where anonymization is integrated into the replication layer. PII is detected automatically, transforms preserve format and joins, and every branch starts with realistic, clean data that your tests and downstream systems can use without modification.
Part of a complete platform
Anonymization is one piece. Xata combines it with branching, scale-to-zero, and zero-downtime migrations.
Learn about Postgres branching | Learn about schema changes