A Framework for Building Reliable Tools

Rapid adoption of MCP has led to a flood of unreliable servers, many little more than thin API wrappers. Too often, setups just don't work as expected. That's where evals come in: they help you define and measure what a “good” tool result really is.

In this post, I'll quickly explain evals and show you how to use them, featuring the evals package I've been building to make implementation easy.

Introducing Evals

Evals are a way to test the qualitative outputs you get from a Large Language Model (LLM) via additional prompts. Derived from the term “evaluate”, evals help you answer questions such as

Did my tool provide an accurate and helpful answer?
What types of prompts does my tool answer well, and what types of prompts does it struggle with?
Did the client select the right tool?
What types of user inputs lead to my being chosen?
How does my tool handle malformed inputs?
How do different models affect tool selection and output
How do tool call descriptions affect tool call selection?

You might be wondering: if you're asking an LLM to evaluate its output, how can the evaluator be any wiser?

The key is that you already know what a good answer should look like. By using targeted eval prompts — such as "Did the answer include X?" or "Did it follow format Y?" — you provide clear criteria for what constitutes a strong response.

In essence, you supply the context for evaluation within the prompt itself.

How to create an effective eval prompt

The key to creating an effective eval prompt is providing additional context and using measurable heuristics. Context often takes the form of specific elements that you know should exist in the correct answer.

For instance, let's say you're interested in the weather today, so you provide it with an initial prompt: “What is the weather like today?” You already know what information to expect, so well-crafted evals will check for that in the response. The following are good and bad examples of evals.

Initial Prompt: What is the weather today ? Bad: How helpful was the answer?
Good: Did the response include the information in Celsius?
Initial Prompt: What is the weather today? Bad: Was the weather returned?
Good: Did the response provide a location?
Initial Prompt: What is the weather today? Bad: Was the weather returned?
Good: Did the response include the uv index?

Notice that we include specific information such as “information in Celcius,” “location,” and “uv index.” Next, we take a look at other examples:

Initial Prompt: Create a graph in JavaScript? Bad: Did you like the response?
Good: Does the following JavaScript graph include labels on the axis ?
Initial Prompt: Create me an article about evals
Bad: How helpful is this article?
Good: Does the article cover all topics covered in this (insert source material)
Initial Prompt: Create me an article about evals
Bad: How helpful is this article?
Good: Does the article include a table of contents?

Again, in these examples, we provide further context to the model in the eval prompt by asking for specific elements that we expected from the initial output of the model.

How are evals in MCP different than other evals

There are a few notable differences regarding evals in MCP. The most significant difference is in the tool selection process. Your tool descriptions have a considerable impact on which tools are selected.

Crafting tool descriptions is crucial for ensuring the correct tool is selected and that the appropriate variables are passed into the tool.

How to leverage evals in your MCP server

I decided to open-source the evals package I was using while building my own MCP servers.

It consists of a typescript package that you can run with npx ex: npx mcp-eval path/to/your/evals.ts path/to/your/server.ts and an optional GitHub action if you just want to run it on pull requests. It also creates a client, allowing you to simulate which tools are being selected.

Below, I'll guide you through setting up evaluations on your server. You can find the complete documentation here.

The first step is to install the package.

npm install mcp-evals

After installing the package, you'll need to set up the functions to run your tests. I've typically been including these in src/evals/evals.ts. For projects with a high number of evaluations to be run, I will split the individual evaluations into their own files and then export them all in src/evals/evals.ts.

import { grade, EvalFunction } from "mcp-evals";

const analyze_databaseEval: EvalFunction = {
  name: 'analyze_database Evaluation',
  description: 'Evaluates the analyze_database tool functionality',
  run: async () => {
    const result = await grade(openai("gpt-4"), "Please analyze the configuration of my PostgreSQL database using connection string: postgres://user:pass@localhost:5432/mydb");
    return JSON.parse(result);
  }
};

const get_setup_instructionsEval: EvalFunction = {
  name: 'get_setup_instructions Evaluation',
  description: 'Evaluates the get_setup_instructions tool functionality',
  run: async () => {
    const result = await grade(openai("gpt-4"), "I want to set up a PostgreSQL database on Linux for production usage. What steps should I follow?");
    return JSON.parse(result);
  }
};

const debug_databaseEval: EvalFunction = {
  name: 'debug_database Evaluation',
  description: 'Evaluates the debug_database tool functionality',
  run: async () => {
    const result = await grade(openai("gpt-4"), "I am experiencing replication issues with my PostgreSQL database. How can I debug them?");
    return JSON.parse(result);
  }
};

const config: EvalConfig = {
  model: openai("gpt-4"),
  evals: [analyze_databaseEval, get_setup_instructionsEval, debug_databaseEval]
};

export default config;

export const evals = [analyze_databaseEval, get_setup_instructionsEval, debug_databaseEval];

Running Evals in Continuous Integration (CI)

Similar to unit tests, I often run my evals during CI and output the results to a comment on the PR. Below is an example GitHub action that implements this design.

name: AI Evals
on: [push]
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: npm install mcp-evals openai
      - name: Run Evals
        run: npx mcp-eval src/evals.ts src/server.ts

Conclusion

Evals are an excellent way to ensure you are building reliable and helpful MCP servers. They verify that the right tools are being chosen and can catch misleading or incorrect tool call results. As the ecosystem matures, I'm excited to see what else can be done to improve the reliability and functionality of MCP servers.