Skip to main content
Glama
EVALUATION.md9.84 kB
# MCP WordPress Tools Evaluation Guide This guide covers the comprehensive evaluation system for MCP WordPress tools using [mcp-evals](https://github.com/mclenhard/mcp-evals). ## Overview The evaluation system provides automated testing and scoring of WordPress MCP tools using LLM-based evaluation to ensure: - **Tool Reliability**: Consistent performance across different scenarios - **Quality Assurance**: Comprehensive testing of all 59 WordPress tools - **Performance Monitoring**: Track tool performance over time - **Regression Detection**: Identify when changes affect tool quality ## Quick Start ### Prerequisites - Node.js 20+ - OpenAI API key (configured in GitHub secrets) - WordPress test site credentials ### Running Evaluations ```bash # Run all evaluations npm run eval # Run quick evaluation (critical tools only) npm run eval:quick # Run critical tools evaluation npm run eval:critical # Generate evaluation report npm run eval:report # Watch mode for development npm run eval:watch ``` ## Evaluation Configurations ### 1. Comprehensive Evaluation (`wordpress-tools-eval.yaml`) Tests all 59 WordPress tools across multiple categories: - **Post Management**: Create, read, update, delete posts - **Media Management**: Upload, manage media files - **User Management**: User creation, role management - **Comment Management**: Comment moderation workflows - **Taxonomy Management**: Categories and tags - **Site Management**: Settings, health checks - **Performance**: Cache management, optimization - **Authentication**: Security and permissions - **Error Handling**: Edge cases and failures ### 2. Critical Tools Evaluation (`critical-tools-eval.yaml`) Focused evaluation of the most important tools with stricter scoring: - Higher pass threshold (4.0/5.0 vs 3.5/5.0) - Essential functionality only - Faster execution for CI/CD pipelines ### 3. TypeScript Evaluations (`critical-tools.eval.ts`) Advanced evaluations using TypeScript for complex scenarios: - Multi-tool workflows - Error recovery scenarios - Performance benchmarks - Security testing ## Evaluation Scoring ### Scoring Criteria Each evaluation is scored on five dimensions (1-5 scale): 1. **Accuracy** (25%): How accurately the tool performs its function 2. **Completeness** (20%): How thoroughly it completes the task 3. **Relevance** (20%): How relevant the response is to the request 4. **Clarity** (20%): How clear and understandable the output is 5. **Reasoning** (15%): How well it handles edge cases and errors ### Scoring Thresholds - **Pass**: 3.5/5.0 (Acceptable performance) - **Good**: 4.0/5.0 (Solid performance) - **Excellent**: 4.5/5.0 (Outstanding performance) ### Example Evaluation ```yaml - name: create_post_basic description: Test basic post creation functionality prompt: "Create a new blog post titled 'AI Trends in 2025' with content about emerging AI technologies" expected_tools: - wp_create_post success_criteria: - Post is created successfully - Title matches the request - Content is relevant to AI trends ``` ## GitHub Actions Integration ### Automated Evaluation Pipeline The evaluation system runs automatically on: - **Pull Requests**: When tools are modified - **Push to Main**: After merging changes - **Weekly Schedule**: Every Monday at 2 AM UTC - **Manual Trigger**: Via GitHub Actions UI ### Workflow Features - **PR Comments**: Automatic evaluation results in PR comments - **Artifacts**: Evaluation results saved for 30 days - **Regression Detection**: Compares with previous results - **Performance Tracking**: Trends over time - **Failure Notifications**: Creates issues for significant regressions ### Environment Variables Required secrets in GitHub repository: ```bash OPENAI_API_KEY=your_openai_api_key TEST_WORDPRESS_URL=https://test-site.com TEST_WORDPRESS_USER=testuser TEST_WORDPRESS_PASSWORD=app_password ``` ## Writing Custom Evaluations ### YAML Configuration ```yaml model: provider: openai name: gpt-4o temperature: 0.3 evals: - name: custom_evaluation description: Description of what this tests prompt: "Test prompt for the LLM" expected_tools: - wp_tool_name success_criteria: - What constitutes success - Additional criteria ``` ### TypeScript Evaluation ```typescript import { EvalFunction, grade } from "mcp-evals"; import { openai } from "mcp-evals/models"; export const customEval: EvalFunction = { name: "custom_evaluation", description: "Test custom functionality", run: async () => { const result = await grade(openai("gpt-4o"), "Your test prompt here", { systemPrompt: "Evaluation criteria...", responseFormat: { type: "json_object" }, }); return JSON.parse(result); }, }; ``` ## Evaluation Categories ### 1. Functional Testing Tests basic tool functionality: ```yaml - name: create_post_basic prompt: "Create a new blog post with title and content" expected_tools: [wp_create_post] ``` ### 2. Integration Testing Tests multiple tools working together: ```yaml - name: content_publishing_workflow prompt: "Create a post with images and publish it" expected_tools: [wp_create_post, wp_upload_media, wp_update_post] ``` ### 3. Error Handling Testing Tests edge cases and error scenarios: ```yaml - name: handle_invalid_post prompt: "Try to update a non-existent post" expected_tools: [wp_update_post] ``` ### 4. Performance Testing Tests tool performance and efficiency: ```yaml - name: high_volume_operations prompt: "List 100 most recent posts and generate summary" expected_tools: [wp_list_posts, wp_get_post] ``` ## Best Practices ### Writing Effective Evaluations 1. **Clear Prompts**: Write specific, actionable prompts 2. **Realistic Scenarios**: Test real-world use cases 3. **Success Criteria**: Define clear success metrics 4. **Error Cases**: Include failure scenarios 5. **Performance**: Consider timeout and efficiency ### Evaluation Maintenance 1. **Regular Updates**: Keep evaluations current with tool changes 2. **Threshold Tuning**: Adjust scoring thresholds based on results 3. **Coverage Analysis**: Ensure all tools are tested 4. **Performance Monitoring**: Track evaluation trends ### CI/CD Integration 1. **Fast Feedback**: Use `eval:quick` for PR checks 2. **Comprehensive Testing**: Full evaluations on main branch 3. **Regression Detection**: Compare with previous results 4. **Performance Gates**: Fail builds on significant regressions ## Troubleshooting ### Common Issues 1. **API Key Missing**: Ensure `OPENAI_API_KEY` is set 2. **WordPress Connection**: Verify test site credentials 3. **Timeout Errors**: Increase timeout or reduce evaluation scope 4. **Rate Limiting**: Add delays between evaluations ### Debug Mode ```bash # Run with debug output DEBUG=true npm run eval # Run single evaluation npm run eval:quick -- --filter create_post_basic ``` ### Local Testing ```bash # Set up local environment export OPENAI_API_KEY=your_key export WORDPRESS_SITE_URL=http://localhost/wp export WORDPRESS_USERNAME=admin export WORDPRESS_APP_PASSWORD=your_password # Run evaluations npm run eval:quick ``` ## Reporting ### Automatic Reports Evaluation results are automatically processed into: - **JSON Summary**: Machine-readable results - **HTML Report**: Visual dashboard - **Markdown Report**: Documentation-friendly format ### Manual Report Generation ```bash # Generate comprehensive report npm run eval:report # View results open evaluations/reports/evaluation-report.html ``` ### Example Report Structure ```json { "overall_score": 4.2, "tests_passed": 23, "total_tests": 25, "status": "good", "categories": { "post": { "passed": 5, "total": 6, "avg_score": 4.1 }, "media": { "passed": 3, "total": 3, "avg_score": 4.5 } }, "failed_tests": [ { "name": "handle_invalid_post", "score": 3.2, "reason": "Error handling could be more graceful" } ], "recommendations": ["Improve error handling for edge cases", "Add more comprehensive validation"] } ``` ## Advanced Features ### Performance Tracking Track evaluation performance over time: ```bash # Generate performance trends node evaluations/scripts/track-performance.js # Compare with previous results node evaluations/scripts/compare-performance.js ``` ### Custom Scoring Override default scoring with custom logic: ```typescript const customScoring = { accuracy: { weight: 0.3, min: 4.0 }, completeness: { weight: 0.3, min: 3.5 }, relevance: { weight: 0.2, min: 3.0 }, clarity: { weight: 0.1, min: 3.0 }, reasoning: { weight: 0.1, min: 3.0 }, }; ``` ### Integration with Monitoring Connect evaluations to monitoring systems: ```bash # Export metrics to Prometheus npm run eval:metrics # Send results to monitoring dashboard npm run eval:monitor ``` ## Contributing ### Adding New Evaluations 1. Create evaluation in `evaluations/config/` 2. Add to appropriate category 3. Test locally with `npm run eval:quick` 4. Submit PR with evaluation results ### Improving Existing Evaluations 1. Review current evaluation scores 2. Identify areas for improvement 3. Update prompts and success criteria 4. Test changes thoroughly ### Reporting Issues - Use GitHub Issues for evaluation problems - Include full evaluation results - Provide reproduction steps - Tag with `evaluation` label ## Resources - [mcp-evals Documentation](https://github.com/mclenhard/mcp-evals) - [OpenAI API Documentation](https://platform.openai.com/docs) - [WordPress REST API](https://developer.wordpress.org/rest-api/) - [GitHub Actions Documentation](https://docs.github.com/en/actions) --- **Ready to improve tool quality?** Start by running `npm run eval:quick` to see current performance, then dive into writing custom evaluations for your specific use cases!

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/docdyhr/mcp-wordpress'

If you have feedback or need assistance with the MCP directory API, please join our Discord server