After vibe coding, is vibe observability next?
Written by
Tudor Golubenco
Published on
March 25, 2025
Since releasing the Xata Agent, our Postgres AI expert, one of the most common questions we’ve got is a variation of:
I’m currently using {Datadog|Grafana|Elastic} for monitoring Postgres, how is this tool different?
We usually answer this in terms of what the Xata Agent already does today for Postgres, but I think the question gets even more interesting if we generalize this to: What can LLMs bring to the observability space?
The standard way SRE and software teams keep their services up and healthy looks something like this:
I’d say this has been the standard for the past 10 years or so. An enormous amount of blog posts, conference talks, and podcasts have been created on the topic of best practices for each of the points above. And it works: you no longer need to be Google in order serve Google-level traffic with Google-level SLAs. In fact, you probably only need a handful of excellent SRE/platform engineers. At least partially, this is thanks to very good observability tools and best practices.
However, I think the current observability tooling is in a local maximum, and AI agents might be the way to escape it.
Let’s look at the main pain points:
The next sections look at each of the above to see if AI-first observability can change the game.
Think about how you are doing on-call today. If your process is similar to ours, it will look something like this:
First of all, it can be on-call 24/7, with humans only as an escalation option.
Then, it can investigate the root-cause:
If there is no playbook for the particular alert, it can use its own internal knowledge to generate one, and then execute it. During this process, it can search the web, check status pages of dependency services, or check the documentation of the systems involved.
If it successfully identifies the cause, in that ~42.9% of cases where the system self-recovers, it can recognize this and simply continue monitoring to confirm recovery.
If the necessary action is low-risk (e.g., scaling up a component), it can take that step without waking anyone up.
If confidence in the root cause is low, or the recommended action is risky (e.g., adding a new index), the agent will page a human. But this time, the engineer wakes up to an already written summary: what’s happening, the likely root cause, and suggested next steps. The agent has already done half the work.
Finally, the agent now has all the information to write the incident report and also store it in its knowledge base, in case it happens again.
If you’ve worked with Datadog or Splunk, you know these tools get expensive quickly. Open-source alternatives are cheaper, but really only by comparison. The truth is, there are good reasons observability tooling is expensive:
Can the LLM-based AI agents change the game from a cost perspective? Maybe. I imagine a system like this:
The Xata Agent, for example, retrieves the Postgres metrics and logs directly from the Postgres service (e.g. for RDS via Cloudwatch). It doesn’t need to store them again in its own storage. On a typical run, it checks first the high level metrics like CPU, memory, and latency. It drills down to retrieving other metrics and logs only when it detects something worth investigating.
SREs and platform engineers these days are expected to be experts in a growing number of technologies, including cloud services, IaC, Kubernetes, and multiple database technologies. On top of that, they need to have excellent communication and prioritization skills simply to justify their own job to upper management. No wonder they’re so hard to hire.
For example, the reason we hardly see any DBA jobs today is because those responsibilities have largely fell on SRE and platform teams. But a typical SRE doesn’t have the depth of knowledge in Postgres that a Postgres DBA once had. Instead, they tend to read the docs “on demand” when something breaks, or ask for help on mailing lists, Discord, etc.
Now imagine that AI agents, in addition to their built-in general knowledge, can be equipped with tools and playbooks from top experts in each technology.
That’s already the case for the Xata Agent, which we’re battle-training to troubleshoot issues across tens of thousands of active Postgres databases every day. With every incident, we improve an existing playbook or add a new tool. A few of these tools and playbooks are specific to how we do things at Xata, but many are applicable to any Postgres installation.
That means we are growing a community library of tools and playbooks. It is released as Apache-license open source and we’re hoping others will contribute their own tools and playbooks too.
There is another way to think about playbooks, which is inspired by a skeet by Simon Willison. SREs can basically build their own custom software to help them operate the infrastructure. They write this software in plain English, by telling the agent which tools to run, and under which conditions.
The blog post so far presents a one-sided story, but let’s look at some counter arguments as well. These are all valid objections that we heard about the Xata Agent already.
Privacy: What are the risks of sending data about your production environment to third-party LLM providers like OpenAI, Anthropic, Deepseek, etc? Even if the Xata Agent is open-source and fully self-hosted, it does rely on online LLM APIs. It doesn’t send your actual database data, but it may include metadata like table or column names, metrics, error logs, and so on as part of the prompts.
If this is a major blocker, probably the best solution today is to invest in self-hosted LLMs. The good news, in my opinion, is that LLMs are generalists, so if you have a local LLM for your code tools, you can use it for observability as well.
LLM cost: Isn’t calling models expensive? Are we replacing the high cost of observability tools for the high cost of AI APIs?
From what we’ve seen running the Xata Agent so far, the cost of current LLMs is still much smaller (like 2 orders of magnitude) compared to what we pay our observability solution. And this is with the most expensive models from OpenAI and Anthropic.
This one might go both ways in the future, but I’m somewhat optimistic that it won’t be a major bottleneck.
Hallucinations: What if the AI agent takes down production? What if it deletes data? What if it does something illegal?
For the time being, I think it’s important to limit what AI agents can do without supervision. This limits their usefulness, but also prevents the worse case scenarios.
For example, the Xata Agent currently does not have a tool for running arbitrary SQL. We don’t want it YOLO’ing a TRUNCATE
in production. Instead, it’s limited to predefined SQL actions: retrieving active queries, slow queries, current settings, active connections, etc.
We do want to add support for arbitrary SQL execution, but only behind an approval workflow. Until then, the Agent may recommend SQL for you to run, but it won’t run it itself. No YOLO mode (yet).
The space of AI agents for observability is just getting started. A number of startups are promising AI-based RCA, alert triaging, or on-call assistance. Most are either generic or focused on Kubernetes. Examples are Deductive.ai, AgentSRE, Parity, Cleric. In the Postgres space, besides our own Xata Agent, I’m aware of two others in development: dba.ai by Tembo, and CrystalDBA.
I believe the rise of MCP servers (every SaaS and their dog is getting one) will accelerate these agents faster than what you’d expect. MCP also means that some of these agents will collaborate with each other. For example, one agent specializing in Kubernetes collaborates with another focused on Postgres to resolve a complex incident.
I’m awfully biased, but I think the approach we take with the Xata Agent has the key elements to tackle the challenges above:
If you are interested in what AI agents could mean for observability, we’d love it if you would give Xata Agent a try and maybe even contribute to it. If you’d like to learn more about it, don’t hesitate to reach out to us.
Xata provides the best free plan in the industry. It is production ready by default and doesn't pause or cool-down. Take your time to build your business and upgrade when you're ready to scale.
Copyright © 2025 Xatabase Inc.
All rights reserved.