run_skill_test
Execute skill test suites to validate AI agent performance through deterministic checks and rubric-based evaluation. Use for regression testing and quality assurance of coding agents.
Instructions
Run a skill test suite against a SKILL.md. Executes two evaluation phases: Phase 1 (deterministic) checks tool calls, file operations, commands run, output content, and token budgets. Phase 2 (rubric) uses LLM-as-judge to score output quality against a defined rubric. Call this after writing skill tests or after any change to the skill or agent. Use --no-rubric for fast Phase 1-only checks with no LLM cost.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| test_file | Yes | Path to the skill test YAML file (e.g. 'tests/my-skill-tests.yaml') | |
| agent | No | Agent type to test against: 'claude-code', 'system-prompt', 'codex', 'langgraph', 'crewai', 'openai-assistants', 'custom'. Defaults to value in YAML. | |
| no_rubric | No | Skip Phase 2 rubric evaluation — run deterministic checks only (faster, no LLM cost). Default: false. | |
| model | No | Model to use for evaluation (default: claude-sonnet-4-20250514) | |
| verbose | No | Show detailed output for all tests, not just failures. Default: false. |
Implementation Reference
- evalview/mcp_server.py:448-460 (handler)The logic handling the `run_skill_test` tool, which constructs and executes the `evalview skill test` CLI command.
elif name == "run_skill_test": test_file = os.path.normpath(args.get("test_file", "")) if not test_file: return "Error: 'test_file' is required." cmd = ["evalview", "skill", "test", test_file, "--json"] if args.get("agent"): cmd += ["--agent", args["agent"]] if args.get("no_rubric") is True: cmd += ["--no-rubric"] if args.get("model"): cmd += ["--model", args["model"]] if args.get("verbose") is True: cmd += ["--verbose"] - evalview/mcp_server.py:204-229 (schema)The MCP tool registration and input schema definition for `run_skill_test`.
{ "name": "run_skill_test", "description": ( "Run a skill test suite against a SKILL.md. " "Executes two evaluation phases: " "Phase 1 (deterministic) checks tool calls, file operations, commands run, output content, and token budgets. " "Phase 2 (rubric) uses LLM-as-judge to score output quality against a defined rubric. " "Call this after writing skill tests or after any change to the skill or agent. " "Use --no-rubric for fast Phase 1-only checks with no LLM cost." ), "inputSchema": { "type": "object", "required": ["test_file"], "properties": { "test_file": { "type": "string", "description": "Path to the skill test YAML file (e.g. 'tests/my-skill-tests.yaml')", }, "agent": { "type": "string", "description": "Agent type to test against: 'claude-code', 'system-prompt', 'codex', 'langgraph', 'crewai', 'openai-assistants', 'custom'. Defaults to value in YAML.", }, "no_rubric": { "type": "boolean", "description": "Skip Phase 2 rubric evaluation — run deterministic checks only (faster, no LLM cost). Default: false.", },