Schema | EvalView

EvalView

Overview Schema Related Servers Score Discussions

Server Configuration

Describes the environment variables required to run the server.

Name	Required	Description
`test_path`	No	Custom path to the directory containing test cases (passed as --test-path). Defaults to 'tests/'.
`OPENAI_API_KEY`	No	API key for OpenAI, used for semantic similarity scoring, LLM-as-judge evaluation, and embeddings.
`ANTHROPIC_API_KEY`	No	API key for Anthropic, used for running skill tests with the default provider.
`SKILL_TEST_API_KEY`	No	API key specifically for the skill test provider if different from global keys.
`SKILL_TEST_BASE_URL`	No	Base URL for the skill test provider (e.g., for DeepSeek, Groq, or Together API).
`SKILL_TEST_PROVIDER`	No	Provider for skill tests (e.g., 'openai', 'anthropic').
`EVALVIEW_SLACK_WEBHOOK`	No	Webhook URL for Slack alerts used during production monitoring.

Capabilities

Features and capabilities supported by this server

Capability	Details
`tools`	{}

Tools

Functions exposed to the LLM to take actions

Name	Description
create_test	Create a new EvalView test case YAML file for an agent. Call this when the user asks to add a test, or when you want to capture expected agent behavior. After creating a test, call run_snapshot to establish the baseline. No YAML knowledge required — just describe the test. IMPORTANT: Automatically detect test_path by looking for a 'tests/evalview/' directory in the current project. If found, use it. Otherwise use 'tests'.
run_check	Check for regressions against the golden baseline. Returns a diff summary for each test: PASSED, OUTPUT_CHANGED, TOOLS_CHANGED, or REGRESSION. REGRESSION means the score dropped significantly — treat this as a blocking failure. TOOLS_CHANGED / OUTPUT_CHANGED are warnings: the agent's behavior shifted but may be intentional. Also returns observability signals: behavioral anomalies (tool loops, stalls), trust scores (benchmark gaming detection), and coherence issues (multi-turn context loss). Use this after any code change (prompt, model, tools) to confirm nothing broke. If you see a regression, show the diff to the user and offer to fix it before moving on. Use heal=true to auto-retry flaky failures and distinguish non-determinism from real drift. IMPORTANT: Automatically detect test_path by looking for a 'tests/evalview/' directory in the current project. If it exists, pass it as test_path. If the project has a custom test location, use that instead.
run_snapshot	Run tests and save passing results as the new golden baseline. Use this to establish or update the expected behavior after an intentional change. Future `run_check` calls will compare against this snapshot. Call this: (1) after creating a new test with create_test, (2) after confirming a behavioral change is intentional, (3) before making large refactors so you have a clean rollback point. Only passing tests are saved — failing tests are skipped with a warning. IMPORTANT: Automatically detect test_path by looking for a 'tests/evalview/' directory in the current project. If it exists, pass it as test_path.
list_tests	List all available golden baselines in this EvalView project. Shows test names, variant counts, and when each baseline was last updated.
validate_skill	Validate a SKILL.md file for correct structure, naming conventions, and completeness. Call this after writing or editing a SKILL.md before running tests. Returns a list of issues found and whether the skill is valid.
generate_skill_tests	Auto-generate test cases from a SKILL.md file. Call this when the user asks to create tests for a skill — it reads the skill definition and generates a ready-to-run YAML test suite covering explicit, implicit, contextual, and negative test categories. After generating, call run_skill_test to execute them.
run_skill_test	Run a skill test suite against a SKILL.md. Executes two evaluation phases: Phase 1 (deterministic) checks tool calls, file operations, commands run, output content, and token budgets. Phase 2 (rubric) uses LLM-as-judge to score output quality against a defined rubric. Call this after writing skill tests or after any change to the skill or agent. Use --no-rubric for fast Phase 1-only checks with no LLM cost.
generate_visual_report	Generate a beautiful self-contained HTML visual report from the latest evalview check or run results. Opens automatically in the browser. Call this after run_check or run_snapshot to give the user a visual breakdown of traces, diffs, scores, and timelines. Returns the absolute path to the generated HTML file.
compare_agents	Compare two agent endpoints side-by-side on the same test suite. Useful for A/B testing a new model, prompt change, or architecture swap. Returns per-test score deltas, tool diffs, and an optional HTML report.
replay	Open a trajectory diff viewer for a specific test. Shows a side-by-side HTML comparison of baseline vs current behavior — tool calls, parameters, outputs, and score changes. Opens automatically in the browser.

Prompts

Interactive templates invoked by user choice

Name	Description
No prompts

Resources

Contextual data attached and managed by the client

Name	Description
No resources

Server Configuration
Capabilities
Tools
Prompts
Resources

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/hidai25/eval-view'

If you have feedback or need assistance with the MCP directory API, please join our Discord server