run_skill_test
Executes a two-phase evaluation of a SKILL.md: deterministic checks on tool calls, file ops, and token budgets, then LLM-as-judge rubric scoring for output quality.
Instructions
Run a skill test suite against a SKILL.md. Executes two evaluation phases: Phase 1 (deterministic) checks tool calls, file operations, commands run, output content, and token budgets. Phase 2 (rubric) uses LLM-as-judge to score output quality against a defined rubric. Call this after writing skill tests or after any change to the skill or agent. Use --no-rubric for fast Phase 1-only checks with no LLM cost.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| test_file | Yes | Path to the skill test YAML file (e.g. 'tests/my-skill-tests.yaml') | |
| agent | No | Agent type to test against: 'claude-code', 'system-prompt', 'codex', 'langgraph', 'crewai', 'openai-assistants', 'custom'. Defaults to value in YAML. | |
| no_rubric | No | Skip Phase 2 rubric evaluation — run deterministic checks only (faster, no LLM cost). Default: false. | |
| model | No | Model to use for evaluation (default: claude-sonnet-4-20250514) | |
| verbose | No | Show detailed output for all tests, not just failures. Default: false. |