Why Production-Like Data Is the Real Bottleneck

We had a long thread in our Slack about how software is built now.

What surprised me was how similar the experience felt across very different projects and AI coding agents. The pattern was surprisingly consistent.

Underneath it was something more uncomfortable: software has become easy to create and it’s no longer realistic to review it the way we used to. Code is cheap.

If you can’t read every line anymore, where do you draw the line?

How do you maintain the quality standards you expect as an engineer, while being honest about your limits and your time?

Colleagues said things like:

Most of the code I’ve shipped lately was generated. I review the important parts carefully and skim the rest.

My main job besides reviewing is deleting code.

I keep asking the agent to refactor because it invents abstractions I did not ask for.

I’ve basically become the person who sets the direction, verifies outcomes, and deletes code.

Parallelizing agent work is possible, but managing multiple streams is tiring.

I haven’t coded by hand for months. It started as practice, now it’s second nature.

This isn’t a debate about whether AI is good or bad.

It’s an observation about where the bottleneck moved. If you’re not reviewing every line anymore, correctness comes from two places:

Guardrails: types, lint, CI, tests, pre‑commit hooks.
QA in realistic environments.

The uncomfortable part is that QA and tests are only as good as the data and environments behind them. When code output speeds up, the database and data workflow become the thing that breaks first.

What “skim‑review” actually means in practice

I spend more time QA’ing the product than reading diffs line by line

For production code I still try to review everything. For throwaway experiments, I don’t.

Skim‑review is not “no review.” It’s a shift in how teams allocate attention. Instead of reading every line of every change, people increasingly:

focus deep review on high‑risk areas,
rely on automated checks to catch basic issues,
validate behaviour end‑to‑end in preview environments,
iterate quickly until the product behaves correctly.

In our thread, several people described adding more structure around the agent to make skim‑review safer:

Pre‑commit hooks and type checks that prevent bad commits or fail fast.
An explicit “simplify and clean up” step after implementation.
AI pre‑review to catch obvious issues before asking humans.
Repo documentation that the agent must read and update.

If there’s slop, the build breaks and the agent can’t commit.

I keep documentation in the repo and the agent is instructed to read it during planning and update it after implementation.

All of this helps. But the discussion kept circling back to one thing: as soon as you try to verify behaviour on something production‑shaped, the friction shows up.

QA without production‑like data is theatre

The real reason teams fight for production-like data isn’t realism for its own sake. It’s confidence.

When you can run a change against data that behaves like production, you stop doing the “merge and pray” loop.

Fewer surprises means fewer rollbacks, fewer hotfixes, and less time spent arguing whether a bug is “real” or “staging weirdness.”

Confidence is what turns speed into throughput.

Most production bugs are not “this function returns the wrong type.” They are:

a weird edge case in the data (nulls, unexpected shapes, duplicates),
a subtle permission boundary that depends on real user history,
performance behaviour that only appears at production scale,
an interaction between migrations, backfills and application code,
a single customer record that violates your assumptions.

Teams usually reach for one of these strategies:

Strategy (informal)	Why it fails
Fake seed data	It’s safe, but it’s small. It rarely catches the bugs you actually ship.
Shared staging	It gets polluted, and it becomes a bottleneck when PR velocity goes up.
Clone production into staging	It can be slow, expensive, and if you don’t anonymize, a compliance nightmare.
Test against production	Everyone does it once under pressure. It only needs to go wrong once.

The requirement becomes clear: you need production‑like data, but you can’t expose real PII. You need lots of isolated environments, but you can’t afford lots of full database copies.

The workflow that matches how teams build now

A recurring theme in our thread was that speed comes from constraints, not trust. Types break the build. Hooks block bad commits. Documentation keeps intent explicit. The data side needs the same approach.

A modern “skim‑review friendly” workflow needs four properties:

Isolation: each pull request (or feature branch) gets its own database environment.
Realism: the data looks like production in shape and distribution.
Privacy: PII and sensitive data are anonymized before it reaches dev and test.
Cost control: you can create environments frequently without paying for full copies.

Xata maps cleanly onto those constraints: copy‑on‑write branching for isolation, anonymization during cloning or streaming for privacy, and scale‑to‑zero for cost control.

Step 1 – Create a staging replica from production (continuous streaming)

If you want staging data that stays production‑shaped over time, set it up as a continuous streaming replica instead of a one‑time copy.

Xata’s streaming replication uses Postgres logical replication: it takes an initial snapshot of the tables you choose, then keeps the target branch up to date by streaming changes through a replication slot.

1) Make sure your source Postgres can do logical replication

On the source database, confirm the replication settings, and enable them if needed:

Restart Postgres for the changes to take effect.

You also need a Postgres role with replication permissions and network connectivity from Xata to your Postgres instance.

2) Create the Xata project and the branch you’ll treat as staging

In the Console, create a project and a base branch you’ll use as your staging replica (many teams just use the main branch for this). For streaming replication, we recommend using at least one replica for the target branch.

3) Configure the Xata CLI locally

This links the folder to your project and branch context.

4) Generate a streaming config (tables + transforms)

The prompt lets you select tables to replicate and define transformation pipelines (including anonymization rules), then writes them to .xata/clone.yaml.

Optional: if you want to enforce that sensitive columns must be explicitly covered, use strict validation:

Strict validation mode requires that every table and column be explicitly defined in the configuration, minimizing the risk of accidental leaks.

5) Start the streaming replica (snapshot + continuous sync)

Optional: if you only want certain tables:

The xata clone stream command takes an initial snapshot of your tables, sets up the streaming pipeline and begins continuous replication. You can filter tables using the --filter-tables flag.

One operational note worth keeping: if you stop streaming replication and do not plan to resume it, clean up the replication slot and related objects using xata stream destroy. Otherwise the Write‑Ahead Log (WAL) will continue to accumulate.

Step 2 – Branch the database per PR

Once you have a staging replica, you want isolation. Per pull request, branch from staging and point your preview environment at it:

The xata branch create command creates a new branch that inherits schema and data from its parent, using copy‑on‑write storage. The wait-ready subcommand waits for the branch to be ready, and branch url returns a connection string for the given branch. Using branch checkout staging sets the CLI context so that xata branch get id returns the ID of the staging branch when constructing the PR branch.

Step 3 – Clean up or hibernate branches when done

Branches shouldn’t live forever.

If you want branches to sleep automatically, enable Scale to Zero to hibernate after inactivity and wake on connection.
If you want explicit control, you can manually hibernate a branch in the Console.

In CI, the simplest cleanup is still deletion on PR close, and you can combine that with scale‑to‑zero for branches you keep around temporarily. Automated workflows often use xata branch delete to remove a branch when a pull request is merged or closed.

What changes for the team day‑to‑day

This workflow doesn’t just reduce staging fights. It changes how you define “done”:

build, lint and type checks pass,
migrations run cleanly on an isolated database branch,
integration tests run against anonymized, production‑shaped data,
key flows are verified in a preview environment,
risky areas (auth, deletes, billing, tenancy) have explicit checks.

The review conversation moves from “did you read every line?” to “did we verify the behaviour on realistic data?”

The part nobody should gloss over: anonymization is work

Anonymization isn’t magic. You need to choose which columns to transform and verify the result. The good news is that xata clone config writes your rules to .xata/clone.yaml, and you can run strict validation if you want to force completeness.

Good anonymization systems:

remove or transform direct identifiers (emails, phone numbers, addresses),
handle quasi‑identifiers thoughtfully (names, timestamps, free‑form text),
preserve relational consistency,
keep distributions realistic enough to reproduce bugs.

Implementation checklist

Here’s a practical way to start, even if you adopt it gradually:

Create a single preview environment using a safe staging replica (see Step 1).
Define anonymization rules for sensitive columns using xata clone config (consider -validation-mode strict once you have it working).
Automate branch creation in CI: on PR open, create a branch from staging; on PR close, delete it. Use xata branch wait-ready to block until the branch is ready.
Run migrations and integration tests against the branch. If you need to test zero‑downtime migrations, explore xata roll and its migrate and complete commands for pgroll‑powered schema changes.
Update your PR template to include a verification checklist referencing the preview environment.
Add a “simplify after implement” step in your agent workflow so unnecessary abstractions get deleted before they calcify.

If you do only one thing, do this: make it easy for every pull request to run against realistic, isolated, anonymized data. That’s the missing piece for teams who’ve moved from line‑by‑line review to guardrails and QA.

Closing thought

In our thread, someone said: “I like shipping faster, but I don’t like reviewing everything.”

That tension isn’t going away.

The goal isn’t to pretend we can go back to the old way. It’s to build workflows where the new way is actually safe.

Skim‑review can work. But only if verification is real. For most teams, making verification real means fixing the data and environment story first.

Skim-review is the new normal. Your data workflow has to catch up