de en es ja ko ru zh

mcp-server-cloudflare

Official

by cloudflare

Overview Schema Related Servers Score Discussions

TypeScript

Remote

mcp-server-cloudflare
implementation-guides

evals.md•8.26 KiB

# Evaluation Implementation Guide This guide explains how to create evaluation tests (`.eval.ts` files) for testing AI model interactions with specific tools or systems, such as Cloudflare Worker bindings or container environments. ## What are Evals? Evals are automated tests designed to verify if an AI model correctly understands instructions and utilizes its available "tools" (functions, API calls, environment interactions) to achieve a desired outcome. They assess the model's ability to follow instructions, select appropriate tools, and provide correct arguments to those tools. ## Core Concepts Evals are typically built using a testing framework like `vitest` combined with specialized evaluation libraries like `vitest-evals`. The main structure revolves around `describeEval`: ```typescript import { expect } from 'vitest' import { describeEval } from 'vitest-evals' import { checkFactuality } from '@repo/eval-tools/src/scorers' import { eachModel } from '@repo/eval-tools/src/test-models' import { initializeClient, runTask } from './utils' // Helper functions eachModel('$modelName', ({ model }) => { // Optional: Run tests for multiple models describeEval('A descriptive name for the evaluation suite', { data: async () => [ /* Test cases */ ], task: async (input) => { /* Test logic */ }, scorers: [ /* Scoring functions */ ], threshold: 1, // Passing score threshold timeout: 60000, // Test timeout }) }) ``` ### Key Parts: 1. **`describeEval(name, options)`**: Defines a suite of evaluation tests. - `name`: A string describing the purpose of the eval suite. - `options`: An object containing the configuration for the eval: - **`data`**: An async function returning an array of test case objects. Each object typically contains: - `input`: (string) The instruction given to the AI model. - `expected`: (string) A natural language description of the _expected_ sequence of actions or outcome. This is used by scorers. - **`task`**: An async function that executes the actual test logic for a given `input`. It orchestrates the interaction with the AI/system and performs assertions. - **`scorers`**: An array of scoring functions (e.g., `checkFactuality`) that evaluate the test outcome based on the `promptOutput` from the `task` and the `expected` string from the `data`. - **`threshold`**: (number, usually between 0 and 1) The minimum score required from the scorers for the test case to pass. A threshold of `1` means a perfect score is required. - **`timeout`**: (number) Maximum time in milliseconds allowed for a single test case. 2. **`task(input)` Function**: The heart of the eval. It typically involves: - **Setup**: Initializing a client or test environment (`initializeClient`). This prepares the system for the test, configuring available tools or connections. - **Execution**: Running the actual interaction (`runTask`). This function sends the `input` instruction to the AI model via the client and captures the results, which usually include: - `promptOutput`: The textual response from the AI model. - `toolCalls`: A structured list of the tools the AI invoked, along with the arguments passed to each tool. - **Assertions (`expect`)**: Using the testing framework's assertion library (`vitest`'s `expect` in the examples) to verify that the correct tools were called with the correct arguments based on the `toolCalls` data. Sometimes, this involves direct interaction with the system state (e.g., reading a file created by a tool) to confirm the outcome. - **Return Value**: The `task` function usually returns the `promptOutput` to be evaluated by the `scorers`. 3. **Scoring (`checkFactuality`, etc.)**: Automated functions that compare the actual outcome (represented by the `promptOutput` and implicitly by the assertions passed within the `task`) against the `expected` description. 4. **Helper Utilities (`./utils`)**: - `initializeClient()`: Sets up the testing environment, connects to the system under test, and configures the available tools for the AI model. - `runTask(client, model, input)`: Sends the input prompt to the specified AI model using the configured client, executes the model's reasoning and tool use, and returns the results (`promptOutput`, `toolCalls`). - `eachModel()`: (Optional) A utility to run the same evaluation suite against multiple different AI models. ## Steps to Implement Evals 1. **Identify Tools:** Define the specific actions or functions (the "tools") that the AI should be able to use within the system you're testing (e.g., `kv_write`, `d1_query`, `container_exec`). 2. **Create Helper Functions:** Implement your `initializeClient` and `runTask` (or similarly named) functions. - `initializeClient`: Should set up the necessary context, potentially using test environments like `vitest-environment-miniflare` for workers. It needs to make the defined tools available to the AI model simulation. - `runTask`: Needs to simulate the AI processing: take an input prompt, interact with an LLM (or a mock) configured with the tools, capture which tools are called and with what arguments, and capture the final text output. 3. **Create Eval File (`*.eval.ts`):** Create a new file (e.g., `kv-operations.eval.ts`). 4. **Import Dependencies:** Import `describeEval`, scorers, helpers, `expect`, etc. 5. **Structure with `describeEval`:** Define your evaluation suite. 6. **Define Test Cases (`data`):** Write specific test scenarios: - Provide clear, unambiguous `input` prompts that target the tools you want to test. - Write concise `expected` descriptions detailing the primary tool calls or outcomes anticipated. 7. **Implement the `task` Function:** - Call `initializeClient`. - Call `runTask` with the `input`. - Write `expect` assertions to rigorously check: - Were the correct tools called? (`toolName`) - Were they called in the expected order (if applicable)? - Were the arguments passed to the tools correct? (`args`) - (Optional) Interact with the system state if necessary to verify side effects. - Return the `promptOutput`. 8. **Configure Scorers and Threshold:** Choose appropriate scorers (often `checkFactuality`) and set a `threshold`. 9. **Run Tests:** Execute the evals using your test runner (e.g., `vitest run`). ## Example Structure (Simplified) ```typescript // my-feature.eval.ts import { expect } from 'vitest' import { describeEval } from 'vitest-evals' import { checkFactuality } from '@repo/eval-tools/src/scorers' import { initializeClient, runTask } from './utils' describeEval('Tests My Feature Tool Interactions', { data: async () => [ { input: 'Use my_tool to process the data "example"', expected: 'The my_tool tool was called with data set to "example"', }, // ... more test cases ], task: async (input) => { const client = await initializeClient() // Sets up environment with my_tool const { promptOutput, toolCalls } = await runTask(client, 'your-model', input) // Check if my_tool was called const myToolCall = toolCalls.find((call) => call.toolName === 'my_tool') expect(myToolCall).toBeDefined() // Check arguments passed to my_tool expect(myToolCall?.args).toEqual( expect.objectContaining({ data: 'example', // ... other expected args }) ) return promptOutput // Return AI output for scoring }, scorers: [checkFactuality], threshold: 1, }) ``` ## Best Practices - **Clear Inputs:** Write inputs as clear, actionable instructions. - **Specific Expected Outcomes:** Make `expected` descriptions precise enough for scorers but focus on the key actions. - **Targeted Assertions:** Use `expect` to verify the most critical aspects of tool calls (tool name, key arguments). Don't over-assert on trivial details unless necessary. - **Isolate Tests:** Ensure each test case in `data` tests a specific interaction or a small sequence of interactions. - **Helper Functions:** Keep `initializeClient` and `runTask` generic enough to be reused across different eval files for the same system. - **Use `expect.objectContaining` or `expect.stringContaining`:** Often, you only need to verify _parts_ of the arguments, not the entire structure, making tests less brittle. - **Descriptive Names:** Use clear names for `describeEval` blocks and meaningful `input`/`expected` strings.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cloudflare/mcp-server-cloudflare'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

evals.md•8.26 KiB