evals.md•8.46 kB
# Evaluation Implementation Guide
This guide explains how to create evaluation tests (`.eval.ts` files) for testing AI model interactions with specific tools or systems, such as Cloudflare Worker bindings or container environments.
## What are Evals?
Evals are automated tests designed to verify if an AI model correctly understands instructions and utilizes its available "tools" (functions, API calls, environment interactions) to achieve a desired outcome. They assess the model's ability to follow instructions, select appropriate tools, and provide correct arguments to those tools.
## Core Concepts
Evals are typically built using a testing framework like `vitest` combined with specialized evaluation libraries like `vitest-evals`. The main structure revolves around `describeEval`:
```typescript
import { expect } from 'vitest'
import { describeEval } from 'vitest-evals'
import { checkFactuality } from '@repo/eval-tools/src/scorers'
import { eachModel } from '@repo/eval-tools/src/test-models'
import { initializeClient, runTask } from './utils' // Helper functions
eachModel('$modelName', ({ model }) => {
	// Optional: Run tests for multiple models
	describeEval('A descriptive name for the evaluation suite', {
		data: async () => [
			/* Test cases */
		],
		task: async (input) => {
			/* Test logic */
		},
		scorers: [
			/* Scoring functions */
		],
		threshold: 1, // Passing score threshold
		timeout: 60000, // Test timeout
	})
})
```
### Key Parts:
1.  **`describeEval(name, options)`**: Defines a suite of evaluation tests.
    - `name`: A string describing the purpose of the eval suite.
    - `options`: An object containing the configuration for the eval:
      - **`data`**: An async function returning an array of test case objects. Each object typically contains:
        - `input`: (string) The instruction given to the AI model.
        - `expected`: (string) A natural language description of the _expected_ sequence of actions or outcome. This is used by scorers.
      - **`task`**: An async function that executes the actual test logic for a given `input`. It orchestrates the interaction with the AI/system and performs assertions.
      - **`scorers`**: An array of scoring functions (e.g., `checkFactuality`) that evaluate the test outcome based on the `promptOutput` from the `task` and the `expected` string from the `data`.
      - **`threshold`**: (number, usually between 0 and 1) The minimum score required from the scorers for the test case to pass. A threshold of `1` means a perfect score is required.
      - **`timeout`**: (number) Maximum time in milliseconds allowed for a single test case.
2.  **`task(input)` Function**: The heart of the eval. It typically involves:
    - **Setup**: Initializing a client or test environment (`initializeClient`). This prepares the system for the test, configuring available tools or connections.
    - **Execution**: Running the actual interaction (`runTask`). This function sends the `input` instruction to the AI model via the client and captures the results, which usually include:
      - `promptOutput`: The textual response from the AI model.
      - `toolCalls`: A structured list of the tools the AI invoked, along with the arguments passed to each tool.
    - **Assertions (`expect`)**: Using the testing framework's assertion library (`vitest`'s `expect` in the examples) to verify that the correct tools were called with the correct arguments based on the `toolCalls` data. Sometimes, this involves direct interaction with the system state (e.g., reading a file created by a tool) to confirm the outcome.
    - **Return Value**: The `task` function usually returns the `promptOutput` to be evaluated by the `scorers`.
3.  **Scoring (`checkFactuality`, etc.)**: Automated functions that compare the actual outcome (represented by the `promptOutput` and implicitly by the assertions passed within the `task`) against the `expected` description.
4.  **Helper Utilities (`./utils`)**:
    - `initializeClient()`: Sets up the testing environment, connects to the system under test, and configures the available tools for the AI model.
    - `runTask(client, model, input)`: Sends the input prompt to the specified AI model using the configured client, executes the model's reasoning and tool use, and returns the results (`promptOutput`, `toolCalls`).
    - `eachModel()`: (Optional) A utility to run the same evaluation suite against multiple different AI models.
## Steps to Implement Evals
1.  **Identify Tools:** Define the specific actions or functions (the "tools") that the AI should be able to use within the system you're testing (e.g., `kv_write`, `d1_query`, `container_exec`).
2.  **Create Helper Functions:** Implement your `initializeClient` and `runTask` (or similarly named) functions.
    - `initializeClient`: Should set up the necessary context, potentially using test environments like `vitest-environment-miniflare` for workers. It needs to make the defined tools available to the AI model simulation.
    - `runTask`: Needs to simulate the AI processing: take an input prompt, interact with an LLM (or a mock) configured with the tools, capture which tools are called and with what arguments, and capture the final text output.
3.  **Create Eval File (`*.eval.ts`):** Create a new file (e.g., `kv-operations.eval.ts`).
4.  **Import Dependencies:** Import `describeEval`, scorers, helpers, `expect`, etc.
5.  **Structure with `describeEval`:** Define your evaluation suite.
6.  **Define Test Cases (`data`):** Write specific test scenarios:
    - Provide clear, unambiguous `input` prompts that target the tools you want to test.
    - Write concise `expected` descriptions detailing the primary tool calls or outcomes anticipated.
7.  **Implement the `task` Function:**
    - Call `initializeClient`.
    - Call `runTask` with the `input`.
    - Write `expect` assertions to rigorously check:
      - Were the correct tools called? (`toolName`)
      - Were they called in the expected order (if applicable)?
      - Were the arguments passed to the tools correct? (`args`)
      - (Optional) Interact with the system state if necessary to verify side effects.
    - Return the `promptOutput`.
8.  **Configure Scorers and Threshold:** Choose appropriate scorers (often `checkFactuality`) and set a `threshold`.
9.  **Run Tests:** Execute the evals using your test runner (e.g., `vitest run`).
## Example Structure (Simplified)
```typescript
// my-feature.eval.ts
import { expect } from 'vitest'
import { describeEval } from 'vitest-evals'
import { checkFactuality } from '@repo/eval-tools/src/scorers'
import { initializeClient, runTask } from './utils'
describeEval('Tests My Feature Tool Interactions', {
	data: async () => [
		{
			input: 'Use my_tool to process the data "example"',
			expected: 'The my_tool tool was called with data set to "example"',
		},
		// ... more test cases
	],
	task: async (input) => {
		const client = await initializeClient() // Sets up environment with my_tool
		const { promptOutput, toolCalls } = await runTask(client, 'your-model', input)
		// Check if my_tool was called
		const myToolCall = toolCalls.find((call) => call.toolName === 'my_tool')
		expect(myToolCall).toBeDefined()
		// Check arguments passed to my_tool
		expect(myToolCall?.args).toEqual(
			expect.objectContaining({
				data: 'example',
				// ... other expected args
			})
		)
		return promptOutput // Return AI output for scoring
	},
	scorers: [checkFactuality],
	threshold: 1,
})
```
## Best Practices
- **Clear Inputs:** Write inputs as clear, actionable instructions.
- **Specific Expected Outcomes:** Make `expected` descriptions precise enough for scorers but focus on the key actions.
- **Targeted Assertions:** Use `expect` to verify the most critical aspects of tool calls (tool name, key arguments). Don't over-assert on trivial details unless necessary.
- **Isolate Tests:** Ensure each test case in `data` tests a specific interaction or a small sequence of interactions.
- **Helper Functions:** Keep `initializeClient` and `runTask` generic enough to be reused across different eval files for the same system.
- **Use `expect.objectContaining` or `expect.stringContaining`:** Often, you only need to verify _parts_ of the arguments, not the entire structure, making tests less brittle.
- **Descriptive Names:** Use clear names for `describeEval` blocks and meaningful `input`/`expected` strings.