Writing an LLM Eval with Vercel's AI SDK and Vitest

LLM Evals - testing for applications using LLMs

Author

Richard Gill

Date published

The Xata Agent

Recently we launched Xata Agent, an open-source AI agent which helps diagnose issues and suggest optimizations for PostgreSQL databases.

To make sure that Xata Agent still works well after modifying a prompt or switching LLM models we decided to test it with an Eval. Here, we'll explain how we used Vercel's AI SDK and Vitest to build an Eval in TypeScript.

Testing the Agent with an Eval

The problem with building applications on top of LLMs is that LLMs are a black box:

The Xata Agent contains multiple prompts and tool calls. How do we know that Xata Agent still works well after modifying a prompt or switching model?

To 'evaluate' how our LLM system is working we write a special kind of test - an Eval.

An Eval is usually similar to a system test or an integration test, but specifically built to deal with the uncertainty of making calls to an LLM.

The Eval run output

When we run the Eval the output is a directory with one folder for each Eval test case.

The folder contains the output files of the run along with 'trace' information so we can debug what happened.

We’re using Vercel's AI SDK to perform tool calling with different models. The response.json files represent a full response object from Vercel’s AI SDK. This contains everything we need to evaluate the Xata Agent’s performance:

  • Final text response
  • Tool calls and intermediate ‘thinking’ the model does
  • System + User Prompts.

We then convert this to a human readable format:

We then built a custom UI to see all all these outputs in so we can quickly debug what happened in a particular Eval run:

LLM Eval Runner

LLM Eval Runner

Using Vitest to run an Eval

Vitest is a popular TypeScript testing framework. To create our desired folder structure we have to hook into Vitest in a few places:

Get an id for the Eval run

To get a consistent id for each run of all our Eval tests we can set a TEST_RUN_ID environment variable in Vitest’s globalSetup.

We can then create and reference the folder for our eval run like this: path.join('/tmp/eval-runs/', process.env.TEST_RUN_ID)

Get an id for an individual Eval

Getting an id for each individual Eval test case is a bit more tricky.

Since LLM calls take some time, we need to run Vitest tests in parallel using describe.concurrent . But we must then use a local copy of the expect variable from the test to ensure the test name is correct.

We can use the Vitest describe + test name as the Eval id:

From here it’s pretty straightforward to create a folder like this: path.join('/tmp/eval-runs/', process.env.TEST_RUN_ID, testNameToEvalId(expect)).

Combining the results with a Vitest Reporter

We can use Vitest’s reporters to execute code during/after our test run:

Conclusion

Vitest is a powerful and versatile test runner for the TypeScript which can be straightforwardly adapted to run an Eval.

Vercel AI’s Response objects contain almost everything needed to see what happened in an Eval.

For full details check out the Pull Request which introduces in the open source Xata Agent.

If you're interested in monitoring and diagnosing issues with your PostgreSQL database check out the Xata Agent repo - issues and contributions are always welcome. You can also join the waitlist for our cloud-hosted version.

Related Posts