create_test
Create YAML test cases for AI agents to capture expected behavior and establish baselines for regression testing.
Instructions
Create a new EvalView test case YAML file for an agent. Call this when the user asks to add a test, or when you want to capture expected agent behavior. After creating a test, call run_snapshot to establish the baseline. No YAML knowledge required — just describe the test. IMPORTANT: Automatically detect test_path by looking for a 'tests/evalview/' directory in the current project. If found, use it. Otherwise use 'tests'.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| name | Yes | Test name (e.g. 'calculator-division', 'weather-lookup') | |
| query | Yes | The input query to send to the agent | |
| description | No | Human-readable description of what this test covers | |
| expected_tools | No | Tool names the agent should call (e.g. ['calculator', 'search']) | |
| forbidden_tools | No | Tool names the agent must NEVER call. Any violation is an immediate hard-fail (score=0, passed=false) regardless of output quality. Use this for safety contracts — e.g. a read-only agent that must never call edit_file, bash, or write_file. Matching is case-insensitive: 'EditFile' catches 'edit_file'. | |
| expected_output_contains | No | Strings that must appear in the agent's output | |
| min_score | No | Minimum passing score 0-100 (default: 70) | |
| test_path | No | Directory to save the test file. Auto-detect: use 'tests/evalview/' if it exists in the project, otherwise 'tests'. |
Implementation Reference
- evalview/mcp_server.py:345-406 (handler)The `_create_test` method implements the `create_test` tool, creating a YAML test file based on the provided arguments.
def _create_test(self, args: Dict[str, Any]) -> str: test_name = args.get("name", "").strip() query = args.get("query", "").strip() if not test_name or not query: return "Error: 'name' and 'query' are required." test_path = args.get("test_path", self.test_path) slug = test_name.lower().replace(" ", "-").replace("_", "-") filename = os.path.join(test_path, f"{slug}.yaml") if os.path.exists(filename): return f"Error: test already exists at {filename}. Delete it first or choose a different name." os.makedirs(test_path, exist_ok=True) lines = [f'name: "{test_name}"'] description = args.get("description", "") if description: lines.append(f'description: "{description}"') lines += ["", "input:", f' query: "{query}"', "", "expected:"] expected_tools = args.get("expected_tools", []) if expected_tools: lines.append(" tools:") for t in expected_tools: lines.append(f" - {t}") forbidden_tools = args.get("forbidden_tools", []) if forbidden_tools: lines.append(" # Tools that must NEVER be called — hard-fail if any are invoked.") lines.append(" forbidden_tools:") for t in forbidden_tools: lines.append(f" - {t}") expected_output = args.get("expected_output_contains", []) if expected_output: lines.append(" output:") lines.append(" contains:") for s in expected_output: lines.append(f' - "{s}"') min_score = args.get("min_score", 70) lines += ["", "thresholds:", f" min_score: {int(min_score)}"] with open(filename, "w") as f: f.write("\n".join(lines) + "\n") summary_parts = [f"query: {query}"] if expected_tools: summary_parts.append(f"tools: {', '.join(expected_tools)}") if forbidden_tools: summary_parts.append(f"forbidden_tools (hard-fail): {', '.join(forbidden_tools)}") if expected_output: summary_parts.append(f"output contains: {', '.join(expected_output)}") return ( f"Created {filename}\n" + "\n".join(f" {p}" for p in summary_parts) + "\n\nRun run_snapshot to capture the baseline for this test." ) - evalview/mcp_server.py:409-410 (registration)The tool is registered and called within the `call_tool` handler in `evalview/mcp_server.py`.
if name == "create_test": return self._create_test(args)