Skip to main content
Glama

mcp-server-cloudflare

Official
by cloudflare
evals.md8.46 kB
# Evaluation Implementation Guide This guide explains how to create evaluation tests (`.eval.ts` files) for testing AI model interactions with specific tools or systems, such as Cloudflare Worker bindings or container environments. ## What are Evals? Evals are automated tests designed to verify if an AI model correctly understands instructions and utilizes its available "tools" (functions, API calls, environment interactions) to achieve a desired outcome. They assess the model's ability to follow instructions, select appropriate tools, and provide correct arguments to those tools. ## Core Concepts Evals are typically built using a testing framework like `vitest` combined with specialized evaluation libraries like `vitest-evals`. The main structure revolves around `describeEval`: ```typescript import { expect } from 'vitest' import { describeEval } from 'vitest-evals' import { checkFactuality } from '@repo/eval-tools/src/scorers' import { eachModel } from '@repo/eval-tools/src/test-models' import { initializeClient, runTask } from './utils' // Helper functions eachModel('$modelName', ({ model }) => { // Optional: Run tests for multiple models describeEval('A descriptive name for the evaluation suite', { data: async () => [ /* Test cases */ ], task: async (input) => { /* Test logic */ }, scorers: [ /* Scoring functions */ ], threshold: 1, // Passing score threshold timeout: 60000, // Test timeout }) }) ``` ### Key Parts: 1. **`describeEval(name, options)`**: Defines a suite of evaluation tests. - `name`: A string describing the purpose of the eval suite. - `options`: An object containing the configuration for the eval: - **`data`**: An async function returning an array of test case objects. Each object typically contains: - `input`: (string) The instruction given to the AI model. - `expected`: (string) A natural language description of the _expected_ sequence of actions or outcome. This is used by scorers. - **`task`**: An async function that executes the actual test logic for a given `input`. It orchestrates the interaction with the AI/system and performs assertions. - **`scorers`**: An array of scoring functions (e.g., `checkFactuality`) that evaluate the test outcome based on the `promptOutput` from the `task` and the `expected` string from the `data`. - **`threshold`**: (number, usually between 0 and 1) The minimum score required from the scorers for the test case to pass. A threshold of `1` means a perfect score is required. - **`timeout`**: (number) Maximum time in milliseconds allowed for a single test case. 2. **`task(input)` Function**: The heart of the eval. It typically involves: - **Setup**: Initializing a client or test environment (`initializeClient`). This prepares the system for the test, configuring available tools or connections. - **Execution**: Running the actual interaction (`runTask`). This function sends the `input` instruction to the AI model via the client and captures the results, which usually include: - `promptOutput`: The textual response from the AI model. - `toolCalls`: A structured list of the tools the AI invoked, along with the arguments passed to each tool. - **Assertions (`expect`)**: Using the testing framework's assertion library (`vitest`'s `expect` in the examples) to verify that the correct tools were called with the correct arguments based on the `toolCalls` data. Sometimes, this involves direct interaction with the system state (e.g., reading a file created by a tool) to confirm the outcome. - **Return Value**: The `task` function usually returns the `promptOutput` to be evaluated by the `scorers`. 3. **Scoring (`checkFactuality`, etc.)**: Automated functions that compare the actual outcome (represented by the `promptOutput` and implicitly by the assertions passed within the `task`) against the `expected` description. 4. **Helper Utilities (`./utils`)**: - `initializeClient()`: Sets up the testing environment, connects to the system under test, and configures the available tools for the AI model. - `runTask(client, model, input)`: Sends the input prompt to the specified AI model using the configured client, executes the model's reasoning and tool use, and returns the results (`promptOutput`, `toolCalls`). - `eachModel()`: (Optional) A utility to run the same evaluation suite against multiple different AI models. ## Steps to Implement Evals 1. **Identify Tools:** Define the specific actions or functions (the "tools") that the AI should be able to use within the system you're testing (e.g., `kv_write`, `d1_query`, `container_exec`). 2. **Create Helper Functions:** Implement your `initializeClient` and `runTask` (or similarly named) functions. - `initializeClient`: Should set up the necessary context, potentially using test environments like `vitest-environment-miniflare` for workers. It needs to make the defined tools available to the AI model simulation. - `runTask`: Needs to simulate the AI processing: take an input prompt, interact with an LLM (or a mock) configured with the tools, capture which tools are called and with what arguments, and capture the final text output. 3. **Create Eval File (`*.eval.ts`):** Create a new file (e.g., `kv-operations.eval.ts`). 4. **Import Dependencies:** Import `describeEval`, scorers, helpers, `expect`, etc. 5. **Structure with `describeEval`:** Define your evaluation suite. 6. **Define Test Cases (`data`):** Write specific test scenarios: - Provide clear, unambiguous `input` prompts that target the tools you want to test. - Write concise `expected` descriptions detailing the primary tool calls or outcomes anticipated. 7. **Implement the `task` Function:** - Call `initializeClient`. - Call `runTask` with the `input`. - Write `expect` assertions to rigorously check: - Were the correct tools called? (`toolName`) - Were they called in the expected order (if applicable)? - Were the arguments passed to the tools correct? (`args`) - (Optional) Interact with the system state if necessary to verify side effects. - Return the `promptOutput`. 8. **Configure Scorers and Threshold:** Choose appropriate scorers (often `checkFactuality`) and set a `threshold`. 9. **Run Tests:** Execute the evals using your test runner (e.g., `vitest run`). ## Example Structure (Simplified) ```typescript // my-feature.eval.ts import { expect } from 'vitest' import { describeEval } from 'vitest-evals' import { checkFactuality } from '@repo/eval-tools/src/scorers' import { initializeClient, runTask } from './utils' describeEval('Tests My Feature Tool Interactions', { data: async () => [ { input: 'Use my_tool to process the data "example"', expected: 'The my_tool tool was called with data set to "example"', }, // ... more test cases ], task: async (input) => { const client = await initializeClient() // Sets up environment with my_tool const { promptOutput, toolCalls } = await runTask(client, 'your-model', input) // Check if my_tool was called const myToolCall = toolCalls.find((call) => call.toolName === 'my_tool') expect(myToolCall).toBeDefined() // Check arguments passed to my_tool expect(myToolCall?.args).toEqual( expect.objectContaining({ data: 'example', // ... other expected args }) ) return promptOutput // Return AI output for scoring }, scorers: [checkFactuality], threshold: 1, }) ``` ## Best Practices - **Clear Inputs:** Write inputs as clear, actionable instructions. - **Specific Expected Outcomes:** Make `expected` descriptions precise enough for scorers but focus on the key actions. - **Targeted Assertions:** Use `expect` to verify the most critical aspects of tool calls (tool name, key arguments). Don't over-assert on trivial details unless necessary. - **Isolate Tests:** Ensure each test case in `data` tests a specific interaction or a small sequence of interactions. - **Helper Functions:** Keep `initializeClient` and `runTask` generic enough to be reused across different eval files for the same system. - **Use `expect.objectContaining` or `expect.stringContaining`:** Often, you only need to verify _parts_ of the arguments, not the entire structure, making tests less brittle. - **Descriptive Names:** Use clear names for `describeEval` blocks and meaningful `input`/`expected` strings.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cloudflare/mcp-server-cloudflare'

If you have feedback or need assistance with the MCP directory API, please join our Discord server