Saiten MCP Server

Overview Schema Related Servers Score Discussions

agent-evaluation.md•15.3 KiB

# Agent Evaluation A framework for measuring agent workflow effectiveness based on vscode-ai-toolkit best practices. ## Table of Contents - [Overview](#overview) - Why evaluation matters - [Three Core Metrics](#three-core-metrics) - Intent Resolution, Tool Call Accuracy, Task Adherence - [Evaluation Methodology](#evaluation-methodology) - How to evaluate - [Evaluation Best Practices](#evaluation-best-practices) - Guidelines for effective evaluation - [Common Evaluation Pitfalls](#common-evaluation-pitfalls) - Anti-patterns to avoid - [Evaluation Checklist](#evaluation-checklist) - Quick check template --- ## Overview Effective agent evaluation goes beyond simple success/fail metrics. It requires assessing the quality of reasoning, tool usage, and task adherence across diverse scenarios. ## Three Core Metrics ### 1. Intent Resolution **Does the agent correctly understand and fulfill user intent?** | Aspect | Description | | -------------------- | ---------------------------------------------------- | | **What to Measure** | Accuracy of understanding user's goal | | **Success Criteria** | Agent addresses the actual need, not surface request | | **Common Failures** | Misinterpreting ambiguous requests, literal parsing | #### Evaluation Questions - [ ] Did the agent clarify ambiguous requests? - [ ] Did it address the underlying goal or just the explicit ask? - [ ] Were edge cases and constraints considered? - [ ] Did the output match user expectations? #### Example Scenarios | User Request | Poor Intent Resolution | Good Intent Resolution | | -------------------- | ------------------------------------ | ----------------------------------------- | | "Fix the bug" | Changes random code without analysis | Reproduces bug, identifies root cause | | "Add authentication" | Implements basic login | Asks about requirements (OAuth, JWT, etc) | | "Make it faster" | Random optimizations | Profiles code, targets bottlenecks | ### 2. Tool Call Accuracy **Does the agent use tools correctly and efficiently?** | Aspect | Description | | -------------------- | ------------------------------------------------- | | **What to Measure** | Correctness and appropriateness of tool selection | | **Success Criteria** | Right tool, right parameters, right sequence | | **Common Failures** | Wrong tool, hallucinated tools, inefficient usage | #### Evaluation Questions - [ ] Did the agent use available tools (not reimplementing)? - [ ] Were tool parameters correct and complete? - [ ] Was the tool sequence logical and efficient? - [ ] Were errors handled gracefully? #### Common Tool Usage Mistakes | Mistake | Example | Fix | | -------------------------- | ------------------------------------------------- | --------------------------------------- | | **Hallucinated Tools** | Calls `search_web()` when not defined | Validate tool availability | | **Wrong Parameters** | Passes string when tool expects array | Strict parameter validation | | **Inefficient Chains** | Reads same file 10 times | Cache results, batch operations | | **Missing Error Handling** | Continues after tool failure | Check return codes, implement fallbacks | | **Tool Reimplementation** | Writes grep-like logic instead of using grep tool | Document tool capabilities clearly | #### Tool Call Metrics ```markdown **Accuracy:** (Correct Tool Calls) / (Total Tool Calls) **Efficiency:** (Optimal Tool Count) / (Actual Tool Count) **Coverage:** (Tools Used) / (Relevant Tools Available) ``` ### 3. Task Adherence **Does the agent follow instructions and constraints?** | Aspect | Description | | -------------------- | --------------------------------------------- | | **What to Measure** | Compliance with explicit and implicit rules | | **Success Criteria** | Respects boundaries, doesn't overstep or skip | | **Common Failures** | Scope creep, ignoring constraints, shortcuts | #### Evaluation Questions - [ ] Did the agent respect scope boundaries? - [ ] Were constraints (time, resources, safety) followed? - [ ] Did it avoid making unauthorized changes? - [ ] Were deliverables complete per specification? #### Adherence Violations | Violation Type | Example | Impact | | --------------------- | ------------------------------------------- | ------------------------- | | **Scope Creep** | Asked to fix bug, refactors entire codebase | Wasted time, new risks | | **Constraint Bypass** | Ignores "don't delete files" rule | Data loss, trust breach | | **Premature Action** | Starts execution before plan approval | Wasted work, misalignment | | **Incomplete Output** | Returns half-finished documentation | User must finish work | --- ## Evaluation Methodology ### 1. Test Data Generation **Create diverse, realistic test scenarios** #### Scenario Categories | Category | Description | Example | | -------------------- | ----------------------------------- | ----------------------------------- | | **Happy Path** | Clear, straightforward requests | "List files in src/ directory" | | **Ambiguous** | Requires clarification | "Make it better" (what aspect?) | | **Edge Cases** | Boundary conditions, unusual inputs | Empty repository, large file counts | | **Error Handling** | Simulated failures | File not found, API timeout | | **Multi-Step** | Complex workflows | Analyze → Design → Implement → Test | | **Constraint-Heavy** | Many restrictions | "Refactor but don't change API" | #### Test Data Generation Process ``` Step 1: Identify Agent's Domain - What tasks does it handle? - What tools does it use? Step 2: Create Baseline Scenarios - 10-20 typical use cases - Cover main functionality Step 3: Add Variation - Ambiguous phrasing - Missing information - Conflicting requirements Step 4: Include Edge Cases - Boundary values (empty, max size) - Unusual file types or formats - Concurrent requests Step 5: Simulate Failures - Tool unavailability - Partial results - Timeout scenarios ``` #### Example Test Set (Code Review Agent) ```yaml test_cases: - id: "happy-001" input: "Review src/auth.py for security issues" expected_tools: ["read", "grep"] expected_output_type: "security_report" success_criteria: - Uses #tool:read to access file - Identifies security patterns (SQL injection, XSS, etc.) - Provides severity ratings - id: "ambiguous-001" input: "Check this" expected_behavior: "Request clarification" success_criteria: - Asks what to check - Offers options (code, tests, docs) - Doesn't make assumptions - id: "edge-001" input: "Review all Python files" context: "Repository has 1000+ .py files" expected_behavior: "Plan-First approach" success_criteria: - Creates plan before execution - Proposes batching or sampling - Estimates time required - id: "error-001" input: "Review src/nonexistent.py" expected_behavior: "Graceful error handling" success_criteria: - Detects file doesn't exist - Informs user clearly - Suggests alternatives (similar files, search) ``` ### 2. Evaluation Rubric **Score each dimension on a scale** #### Intent Resolution Rubric | Score | Description | | ----- | ----------------------------------------------------- | | **5** | Perfect understanding, proactive clarification | | **4** | Correct interpretation, addresses need | | **3** | Mostly correct, minor misunderstanding | | **2** | Partially correct, significant gaps | | **1** | Misunderstood intent, wrong direction | | **0** | Complete failure to understand or engage with request | #### Tool Call Accuracy Rubric | Score | Description | | ----- | ------------------------------------------------ | | **5** | Optimal tool selection and usage, error-free | | **4** | Correct tools, minor inefficiencies | | **3** | Mostly correct, some suboptimal choices | | **2** | Significant tool misuse, works but inefficient | | **1** | Wrong tools or hallucinations, barely functional | | **0** | Tool calls fail, agent cannot proceed | #### Task Adherence Rubric | Score | Description | | ----- | --------------------------------------------- | | **5** | Perfect compliance, respects all constraints | | **4** | Minor deviations, core requirements met | | **3** | Some scope creep or missed constraints | | **2** | Significant violations, partial deliverables | | **1** | Major violations, ignores instructions | | **0** | Completely off-track, disregards requirements | ### 3. Automated Testing **Implement programmatic evaluation where possible** #### What Can Be Automated | Aspect | Automation Approach | | ------------------------- | --------------------------------------------- | | **Tool Call Validation** | Parse logs, check tool names and parameters | | **Output Format** | Schema validation (JSON, YAML) | | **File Operations** | Verify files created/modified as expected | | **Performance Metrics** | Measure execution time, tool call count | | **Constraint Violations** | Check against rule list (deleted files, etc.) | #### What Requires Human Review | Aspect | Why Human Needed | | ------------------------ | --------------------------------------------- | | **Intent Understanding** | Nuanced interpretation of ambiguous requests | | **Output Quality** | Subjective assessment (clarity, completeness) | | **Creativity** | Innovation and problem-solving approach | | **Edge Case Handling** | Appropriateness of fallback strategies | #### Automation Example (Python) ```python def evaluate_tool_calls(log_file, expected_tools): """ Evaluate tool call accuracy from execution logs. """ with open(log_file) as f: logs = json.load(f) tool_calls = [entry for entry in logs if entry["type"] == "tool_call"] metrics = { "total_calls": len(tool_calls), "unique_tools": len(set(call["tool_name"] for call in tool_calls)), "hallucinated": sum(1 for call in tool_calls if call["tool_name"] not in expected_tools), "failed": sum(1 for call in tool_calls if call["status"] == "error"), } metrics["accuracy"] = 1 - (metrics["hallucinated"] / metrics["total_calls"]) return metrics ``` --- ## Evaluation Best Practices ### 1. Establish Baselines Before making changes, measure current performance: ```markdown Baseline Evaluation (v1.0) - Intent Resolution: 3.8/5.0 - Tool Call Accuracy: 4.2/5.0 - Task Adherence: 4.5/5.0 After Optimization (v1.1) - Intent Resolution: 4.3/5.0 ✅ +13% - Tool Call Accuracy: 4.6/5.0 ✅ +9.5% - Task Adherence: 4.4/5.0 ⚠️ -2.2% ``` ### 2. Diverse Test Set | Principle | Rationale | | --------------------- | ------------------------------------------------ | | **Coverage** | Test all major features and edge cases | | **Realism** | Use real-world scenarios, not synthetic examples | | **Difficulty Range** | Include easy, medium, and hard tasks | | **Failure Scenarios** | Don't just test happy paths | | **User Language** | Use actual phrasing users would employ | ### 3. Iterative Improvement ```mermaid graph LR A[Baseline] --> B[Identify Weaknesses] B --> C[Implement Fix] C --> D[Re-evaluate] D --> E{Improved?} E -->|Yes| F[Deploy] E -->|No| G[Different Approach] G --> C ``` ### 4. Log Everything **Comprehensive logging enables debugging and improvement** ```json { "test_id": "happy-001", "timestamp": "2024-01-15T10:30:00Z", "user_input": "Review src/auth.py", "agent_plan": "1. Read file 2. Scan for patterns 3. Report", "tool_calls": [ { "tool": "read", "params": { "file": "src/auth.py" }, "status": "success" }, { "tool": "grep", "params": { "pattern": "eval\\(" }, "status": "success" } ], "output": "Found 2 security issues...", "scores": { "intent_resolution": 5, "tool_accuracy": 5, "task_adherence": 5 }, "evaluator_notes": "Excellent handling, proactive security scan" } ``` ### 5. Continuous Monitoring **Evaluation isn't one-time; monitor in production** | Metric | Frequency | Action Threshold | | -------------------------- | --------------- | --------------------------- | | **User Satisfaction** | After each task | <80% positive → investigate | | **Tool Call Failures** | Real-time | >5% failure rate → alert | | **Task Completion Rate** | Daily | <90% → review failures | | **Average Execution Time** | Weekly | +20% vs baseline → profile | --- ## Common Evaluation Pitfalls ### 1. Overfitting to Test Set **Problem:** Agent performs well on tests but poorly on real tasks. **Solution:** - Use held-out test set - Regularly refresh test scenarios - Include user-reported issues ### 2. Ignoring User Feedback **Problem:** Metrics look good, users are unhappy. **Solution:** - Collect qualitative feedback - Review actual usage logs - User interviews/surveys ### 3. Binary Pass/Fail **Problem:** Doesn't capture nuance or areas for improvement. **Solution:** - Use rubrics with gradations - Track multiple dimensions - Identify specific failure modes ### 4. Testing Only Happy Paths **Problem:** Agent fails when things go wrong. **Solution:** - Systematically test error conditions - Simulate timeouts, missing resources - Test with malformed inputs ### 5. Lack of Regression Testing **Problem:** New features break existing functionality. **Solution:** - Maintain test suite - Run before each release - Automate where possible --- ## Evaluation Checklist Before deploying an agent workflow: ```markdown - [ ] Created diverse test set (happy, ambiguous, edge, error) - [ ] Defined success criteria for each test - [ ] Established baseline metrics - [ ] Tested all major workflows - [ ] Verified tool call accuracy - [ ] Confirmed constraint adherence - [ ] Logged evaluation results - [ ] Identified improvement areas - [ ] Documented known limitations - [ ] Set up monitoring for production ``` --- ## References - [vscode-ai-toolkit Agent Evaluation](https://github.com/microsoft/vscode-ai-toolkit) - [Building Effective Agents - Anthropic](https://www.anthropic.com/engineering/building-effective-agents) - [Prompt Engineering Guide - OpenAI](https://platform.openai.com/docs/guides/prompt-engineering)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/aktsmm/FY26_techconnect_saiten'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

agent-evaluation.md•15.3 KiB