The Missing Data Layer in LLM Knowledge Bases

If you're running product, go-to-market, or operations, your work lives in two places. Traditionally, you’re comfortable writing things like strategy docs, competitive intel, research notes, and campaign briefs. But the decisions you make (hopefully) depend on data: activation rates by cohort, pipeline conversion, revenue trends, campaign ROI. Your context exists in writing, your decision relies on data. This is nothing new, but how well and how fast we can run these cycles has completely changed with AI.

Andrej Karpathy recently shared a pattern for using LLMs to build personal knowledge bases, and it resonated with a LOT of people - 18M views, and what feels like an equal number of derivative pieces, as of this writing. If your work is pure research and idea generation, it's all you need. Markdown is the right container for concepts, connections, and synthesis.

But if you need to make a quarterly planning decision, a wiki page will only get you half way there. You need activation rates, not a summary of what activation means. You need pipeline numbers, not a positioning doc. Karpathy's pattern only handles the first half. The missing piece is a database alongside your wiki, with the LLM maintaining both and the references between them.

What Karpathy proposed

You rarely touch the wiki directly. The LLM writes it, maintains it, and keeps it consistent. You handle the thinking, it handles the work.

His first step is to index source documents (articles, papers, repos, datasets, images) into a raw/ directory. The LLM then incrementally "compiles" a wiki from those sources: summaries, concept pages, backlinks, entity articles, all cross-linked. The wiki lives as .md files, ready to view in your editor of choice - his preference is Obsidian. But remember, in this case Obsidian is a viewer, not an editor. The LLM is the editor.

Once the wiki reaches meaningful size (Karpathy's is around 100 articles and 400K words on some topics), you can ask the LLM complex research questions against it. The LLM reads the index files and summaries, pulls in the relevant articles, and synthesizes answers. At this scale, you don't need RAG infrastructure. The LLM's own index maintenance and the ability to read files on demand is enough.

Most importantly, outputs get filed back into the wiki. Your explorations and queries accumulate in the knowledge base, making it richer for future queries. The wiki basically gets stronger with exercise (novel concept), not just from ingesting new sources. Periodic linting keeps the wiki improving over time. As the wiki grows, you build lightweight tools around it. Karpathy vibe-coded a small search engine over the wiki that he uses directly through a web UI and hands off to the LLM as a CLI tool for larger queries.

Karpathy's closing observation: "I think there is room here for an incredible new product instead of a hacky collection of scripts." He's right - and several new products have already been shared. The challenge with them is that they’re all making decisions based on experience and professional principles, not data.

Two compilers, not one

I spent an embarrassing amount of time trying to make markdown tables work for tracking weekly metrics before admitting what should have been obvious from the start: knowledge and data behave differently. A concept page in your wiki gets rewritten and refined over time. The LLM improves it as it learns more. But last week's signup numbers don't get rewritten. They're a fact. Next week adds another row. The wiki pattern is built for knowledge that evolves through revision. Your data evolves through accumulation. Markdown doesn't have a natural model for "append a row to an existing dataset." Databases do. That's literally what they are.

Karpathy built a knowledge compiler. Raw text goes in, structured wiki comes out. What's missing is a data compiler that sits alongside it.

The two layers serve different purposes and need different operations:

	Wiki (knowledge)	Database (data)
Shape	Paragraphs, concepts, relationships	Rows, columns, time-series
Core operation	Summarize, link, synthesize	Query, aggregate, compare
LLM role	Write and maintain text	Define schemas, write queries, maintain derived tables
Update pattern	Rewrite sections, add pages	Insert rows, run migrations
Scales by	More pages	More rows and tables

The interesting part: the two layers reference each other constantly. A weekly analytics report (wiki) cites a SQL query that produced the numbers (database). A database table of experiment results gets interpreted in a wiki page that explains what the numbers mean and what to do about them. The LLM maintains both layers and the references between them.

The loop: database to markdown and back

The questions you most want answered span both layers. "Why did activation drop?" is not a data question or a knowledge question. It's both. You need a database to compute the drop, and a wiki to explain it.

Here's what this looks like if you run a weekly product analytics review. Say you're tracking signups, activation, branch creation, and revenue for a SaaS product.

It starts with the database. The LLM runs SQL against Postgres to pull the raw numbers. These are relational queries that markdown can't answer:

Revenue joins across billing tables:

Then the LLM takes the query results, computes trends (week-over-week changes, cohort comparisons), and writes a report that combines the numbers with qualitative context: what you shipped that week, what might explain a spike or drop, what to watch next. That report is a markdown file that goes into your knowledge base:

This is where the two layers start compounding. The report says activation dropped from 79.5% to 38.8%. That raises a question: why? You ask the LLM. It goes back to the database and queries activation by signup source to check if a specific channel brought in low-intent users. It searches the wiki for last week's report and finds that the previous cohort included a batch of internal test accounts that inflated the baseline. It checks the execution notes and sees the onboarding flow wasn't changed.

You get an answer that combines fresh data ("activation by channel shows organic signups activated at 52%, while Product Hunt signups activated at 12%") with knowledge from the wiki ("W13 baseline was inflated by internal testing, see weekly report W13. Onboarding flow unchanged."). Neither layer alone could have given you that answer.

And now that answer gets filed back into your knowledge base too. Next week, when the LLM writes the W15 report, it has the full context chain: the drop in W14, the investigation, the root cause. Your knowledge base gets smarter every cycle, without you doing any of the bookkeeping.

This is the loop: database produces the numbers, wiki captures the interpretation, follow-up questions drive deeper into both, and everything accumulates. Each weekly report isn't just a snapshot. It's another layer of context that makes your next analysis better.

Putting it together: structure and downstream decisions

The product analytics example above shows the loop. But this pattern works for any domain where you mix text with numbers. Here's what the directory structure looks like when you set up a knowledge base with both layers:

The schema.md file (what Karpathy calls the "schema layer") now has instructions for both compilers. It covers how to ingest text into the wiki and data into the database, what page structure the wiki follows and what table schemas the database uses.

Your data layer is a Postgres database that mirrors whatever data sources you're working with - product analytics, campaign performance, pipeline stages, billing, customer data. We wrote about how to build a product analytics warehouse in Postgres using materialized views and pg_cron, and that's exactly the kind of setup that works well as the data layer here. The LLM queries it directly via SQL whenever it needs to produce a report, answer a question, or validate a claim in a strategy doc.

The database is where the numbers live. The wiki is where the context lives. And because everything ends up in the same knowledge base, the LLM can move between the two without you having to point it at the right source every time.

Making the LLM useful with your data

Getting the LLM to work with your wiki is trivial since it reads and writes markdown. The database side takes a bit more setup, but it comes down to three things:

Give it a schema
Any ORM schema or plain schema.sql works, since Xata is vanilla Postgres. Prisma's PostgreSQL quickstart is a good starting point if you want typed queries.
The key is describing what tables exist, what columns they have, and how they relate. Without it, every query starts with the LLM exploring your database blind.
Save your queries
The first time you ask a question, the LLM figures out the schema, writes the query, runs it, maybe adjusts it. That costs tokens and time. The second time, it will just run activation-rate.sql. Saved queries are the data equivalent of wiki pages: compiled knowledge about how to get a specific answer.
Write a skill file
The schema says status String. It doesn't say "a branch with status 'active' might still be hibernating." A skill file teaches the LLM what your columns actually mean, which metrics matter, and how to interpret what it finds. This is the highest-leverage thing you can do for output quality. You stop getting generic SQL output and start getting answers that reflect how your team thinks about your business.

Picking a database for this

⚡ TLDR: we’re obviously going to suggest using a Xata Postgres database, but you do you.

Signup at console.xata.io, then:

curl -fsSL <https://xata.io/install.sh> | bash

xata auth login

xata init

A knowledge base database has a specific usage pattern: you query it intensely for an hour while writing a weekly report, then don't touch it for days. Look for two things in your Postgres setup:

scale-to-zero: don’t pay for compute while the database sits idle between sessions
branching: so you can test hypotheses without risking your master knowledge base

Remember, your LLM is effectively an agent that needs a database it can break. Let it experiment freely on a copy.

This pattern works at personal or small-team scale: a few data sources, a few hundred thousand rows, one or two people asking questions. Postgres handles more than most people think before you need to reach for something else. If you need Spark here, I’d hate to see your Claude bill.

If you want to add a data layer to your own knowledge base, start with one data source. Pick the data you're currently cramming into markdown (product events, billing exports, analytics CSVs), put it in Postgres, write a skill file that explains what the columns mean, and run a few cross-layer queries. Save the queries that work. The pattern compounds from there, just like Karpathy's wiki does.

Thanks for sticking with us until the end. If you experiment with this and have recommendations to improve just tag us @xata and let us know!

The missing layer in LLM knowledge bases

What Karpathy proposed

Two compilers, not one

The loop: database to markdown and back

Putting it together: structure and downstream decisions

Making the LLM useful with your data

Picking a database for this

Give every agentic workload its own Postgres branch

Share

Give every agentic workload its own Postgres branch

Related Posts

Is Postgres really enough in 2026?

We skipped the OLAP stack and built our data warehouse in vanilla Postgres