Skip to main content
Glama

run_skill_test

Execute skill test suites to validate AI agent performance through deterministic checks and rubric-based evaluation. Use for regression testing and quality assurance of coding agents.

Instructions

Run a skill test suite against a SKILL.md. Executes two evaluation phases: Phase 1 (deterministic) checks tool calls, file operations, commands run, output content, and token budgets. Phase 2 (rubric) uses LLM-as-judge to score output quality against a defined rubric. Call this after writing skill tests or after any change to the skill or agent. Use --no-rubric for fast Phase 1-only checks with no LLM cost.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
test_fileYesPath to the skill test YAML file (e.g. 'tests/my-skill-tests.yaml')
agentNoAgent type to test against: 'claude-code', 'system-prompt', 'codex', 'langgraph', 'crewai', 'openai-assistants', 'custom'. Defaults to value in YAML.
no_rubricNoSkip Phase 2 rubric evaluation — run deterministic checks only (faster, no LLM cost). Default: false.
modelNoModel to use for evaluation (default: claude-sonnet-4-20250514)
verboseNoShow detailed output for all tests, not just failures. Default: false.

Implementation Reference

  • The logic handling the `run_skill_test` tool, which constructs and executes the `evalview skill test` CLI command.
    elif name == "run_skill_test":
        test_file = os.path.normpath(args.get("test_file", ""))
        if not test_file:
            return "Error: 'test_file' is required."
        cmd = ["evalview", "skill", "test", test_file, "--json"]
        if args.get("agent"):
            cmd += ["--agent", args["agent"]]
        if args.get("no_rubric") is True:
            cmd += ["--no-rubric"]
        if args.get("model"):
            cmd += ["--model", args["model"]]
        if args.get("verbose") is True:
            cmd += ["--verbose"]
  • The MCP tool registration and input schema definition for `run_skill_test`.
    {
        "name": "run_skill_test",
        "description": (
            "Run a skill test suite against a SKILL.md. "
            "Executes two evaluation phases: "
            "Phase 1 (deterministic) checks tool calls, file operations, commands run, output content, and token budgets. "
            "Phase 2 (rubric) uses LLM-as-judge to score output quality against a defined rubric. "
            "Call this after writing skill tests or after any change to the skill or agent. "
            "Use --no-rubric for fast Phase 1-only checks with no LLM cost."
        ),
        "inputSchema": {
            "type": "object",
            "required": ["test_file"],
            "properties": {
                "test_file": {
                    "type": "string",
                    "description": "Path to the skill test YAML file (e.g. 'tests/my-skill-tests.yaml')",
                },
                "agent": {
                    "type": "string",
                    "description": "Agent type to test against: 'claude-code', 'system-prompt', 'codex', 'langgraph', 'crewai', 'openai-assistants', 'custom'. Defaults to value in YAML.",
                },
                "no_rubric": {
                    "type": "boolean",
                    "description": "Skip Phase 2 rubric evaluation — run deterministic checks only (faster, no LLM cost). Default: false.",
                },

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/hidai25/eval-view'

If you have feedback or need assistance with the MCP directory API, please join our Discord server