Server Configuration
Describes the environment variables required to run the server.
| Name | Required | Description | Default |
|---|---|---|---|
| test_path | No | Custom path to the directory containing test cases (passed as --test-path). Defaults to 'tests/'. | |
| OPENAI_API_KEY | No | API key for OpenAI, used for semantic similarity scoring, LLM-as-judge evaluation, and embeddings. | |
| ANTHROPIC_API_KEY | No | API key for Anthropic, used for running skill tests with the default provider. | |
| SKILL_TEST_API_KEY | No | API key specifically for the skill test provider if different from global keys. | |
| SKILL_TEST_BASE_URL | No | Base URL for the skill test provider (e.g., for DeepSeek, Groq, or Together API). | |
| SKILL_TEST_PROVIDER | No | Provider for skill tests (e.g., 'openai', 'anthropic'). | |
| EVALVIEW_SLACK_WEBHOOK | No | Webhook URL for Slack alerts used during production monitoring. |
Capabilities
Features and capabilities supported by this server
| Capability | Details |
|---|---|
| tools | {} |
Tools
Functions exposed to the LLM to take actions
| Name | Description |
|---|---|
| create_test | Create a new EvalView test case YAML file for an agent. Call this when the user asks to add a test, or when you want to capture expected agent behavior. After creating a test, call run_snapshot to establish the baseline. No YAML knowledge required — just describe the test. IMPORTANT: Automatically detect test_path by looking for a 'tests/evalview/' directory in the current project. If found, use it. Otherwise use 'tests'. |
| run_check | Check for regressions against the golden baseline. Returns a diff summary for each test: PASSED, OUTPUT_CHANGED, TOOLS_CHANGED, or REGRESSION. REGRESSION means the score dropped significantly — treat this as a blocking failure. TOOLS_CHANGED / OUTPUT_CHANGED are warnings: the agent's behavior shifted but may be intentional. Use this after any code change (prompt, model, tools) to confirm nothing broke. If you see a regression, show the diff to the user and offer to fix it before moving on. IMPORTANT: Automatically detect test_path by looking for a 'tests/evalview/' directory in the current project. If it exists, pass it as test_path. If the project has a custom test location, use that instead. |
| run_snapshot | Run tests and save passing results as the new golden baseline. Use this to establish or update the expected behavior after an intentional change. Future |
| list_tests | List all available golden baselines in this EvalView project. Shows test names, variant counts, and when each baseline was last updated. |
| validate_skill | Validate a SKILL.md file for correct structure, naming conventions, and completeness. Call this after writing or editing a SKILL.md before running tests. Returns a list of issues found and whether the skill is valid. |
| generate_skill_tests | Auto-generate test cases from a SKILL.md file. Call this when the user asks to create tests for a skill — it reads the skill definition and generates a ready-to-run YAML test suite covering explicit, implicit, contextual, and negative test categories. After generating, call run_skill_test to execute them. |
| run_skill_test | Run a skill test suite against a SKILL.md. Executes two evaluation phases: Phase 1 (deterministic) checks tool calls, file operations, commands run, output content, and token budgets. Phase 2 (rubric) uses LLM-as-judge to score output quality against a defined rubric. Call this after writing skill tests or after any change to the skill or agent. Use --no-rubric for fast Phase 1-only checks with no LLM cost. |
| generate_visual_report | Generate a beautiful self-contained HTML visual report from the latest evalview check or run results. Opens automatically in the browser. Call this after run_check or run_snapshot to give the user a visual breakdown of traces, diffs, scores, and timelines. Returns the absolute path to the generated HTML file. |
Prompts
Interactive templates invoked by user choice
| Name | Description |
|---|---|
No prompts | |
Resources
Contextual data attached and managed by the client
| Name | Description |
|---|---|
No resources | |