# Comprehensive evaluation suite for Simplenote MCP Server
# Updated for realistic scenarios, proper tool usage, and measurable outcomes
model:
provider: openai
name: gpt-4o-mini # Cost-effective for comprehensive testing
evals:
# === CORE TOOL VALIDATION ===
- name: tool_availability_complete
description: Verify all implemented tools are available and properly configured
prompt: |
Verify that all these tools are available and working:
1. create_note - with required 'content' parameter and optional 'tags'
2. get_note - with required 'note_id' parameter
3. update_note - with required 'note_id' and 'content' parameters
4. delete_note - with required 'note_id' parameter
5. search_notes - with required 'query' and optional filters
6. add_tags - with required 'note_id' and 'tags' parameters
7. remove_tags - with required 'note_id' and 'tags' parameters
8. replace_tags - with required 'note_id' and 'tags' parameters
expected_result: "All 8 tools should be available with correct parameter schemas and descriptions"
# === REALISTIC USER WORKFLOWS ===
- name: workflow_meeting_notes
description: Simulate real meeting notes workflow
prompt: |
Simulate a realistic meeting notes workflow:
1. Create a meeting note: "Weekly Team Standup - July 15, 2025\n\nAttendees: Alice, Bob, Charlie\n\nAgenda:\n- Sprint review\n- Blockers discussion\n- Next week planning"
2. Add tags: "meetings, weekly, team"
3. During the meeting, update the note to add: "\n\nNotes:\n- Sprint went well, 8/10 stories completed\n- Bob blocked on API integration\n- Next week: focus on testing"
4. After the meeting, add more tags: "completed, archived"
5. Search for "standup" to verify it can be found
6. Clean up by deleting the note
expected_result: |
Complete workflow should succeed:
- Note creation with formatted content
- Tag additions working properly
- Content updates preserving formatting
- Search finding the note correctly
- Cleanup completing successfully
- name: workflow_research_collection
description: Simulate research note collection and organization
prompt: |
Create a research collection workflow:
1. Create 3 research notes:
- "MCP Protocol Overview\n\nModel Context Protocol enables standardized communication..."
- "Simplenote API Research\n\nRESTful API with endpoints for CRUD operations..."
- "Evaluation Best Practices\n\nComprehensive testing requires realistic scenarios..."
2. Tag them: first with "research, mcp", second with "research, simplenote", third with "research, testing"
3. Search for all "research" tagged notes
4. Update the first note to add a "References" section
5. Use replace_tags to change "mcp" to "protocol" on the first note
6. Search for "protocol" to verify tag change
7. Clean up all research notes
expected_result: |
Research workflow should demonstrate:
- Multiple note creation and management
- Consistent tagging strategy
- Tag-based organization and retrieval
- Content evolution over time
- Tag management operations
# === PERFORMANCE AND SCALE TESTING ===
- name: performance_concurrent_operations
description: Test system under concurrent load
prompt: |
Test concurrent operations to verify system stability:
1. Create 10 notes simultaneously with unique content "Concurrent Test #{1-10}"
2. Simultaneously search for "Concurrent Test" while notes are being created
3. Update all 10 notes in parallel to append " - Updated"
4. Perform multiple tag operations on different notes concurrently
5. Delete all 10 notes simultaneously
Monitor for any failures, data corruption, or performance degradation.
expected_result: |
Should handle concurrent operations without issues:
- All 10 notes created successfully
- Search works during concurrent creation
- All updates complete without corruption
- Tag operations succeed without conflicts
- All deletions complete successfully
- Response times remain reasonable (< 30s total)
- name: performance_large_content_handling
description: Test handling of large note content (>10KB)
prompt: |
Test large content handling:
1. Create a note with content approximately 15,000 characters (large technical document)
2. Retrieve the full content and verify integrity
3. Update the note by appending additional 5,000 characters
4. Search for specific terms within the large content
5. Delete the large note
Measure response times and verify no truncation occurs.
expected_result: |
Should handle large content efficiently:
- Creation succeeds with full content preserved
- Retrieval returns complete content (no truncation)
- Updates work on large notes
- Search finds terms within large content
- Operations complete within reasonable time (< 60s each)
# === ADVANCED SEARCH SCENARIOS ===
- name: search_advanced_queries
description: Test complex search queries and filters
prompt: |
Create test data and perform advanced searches:
1. Create notes with various content and tags:
- "Project Alpha Requirements" with tags "work, alpha, requirements"
- "Project Beta Design" with tags "work, beta, design"
- "Personal: Weekend Plans" with tags "personal, weekend"
2. Test these search scenarios:
- Search with tag filter: "work" tag only
- Search with multiple tag filters: "work" AND "alpha"
- Search with text + tag combination: "Project" with "work" tag
- Search with date range (if supported)
3. Clean up test notes
expected_result: |
Advanced search should work correctly:
- Tag filtering returns only matching notes
- Multiple filters work as AND operations
- Combined text+tag searches work properly
- Date filtering works if implemented
- All results are relevant and complete
# === ERROR HANDLING AND EDGE CASES ===
- name: error_comprehensive_validation
description: Test all parameter validation scenarios
prompt: |
Test parameter validation across all tools:
1. create_note with missing content
2. get_note with empty note_id
3. update_note with invalid note_id format
4. delete_note with non-existent note_id
5. search_notes with empty query
6. add_tags with malformed tags
7. remove_tags with tags not on note
8. replace_tags with invalid characters
expected_result: |
Each validation error should return proper error structure:
{
"success": false,
"error": "descriptive error message",
"error_type": "ValidationError" or similar
}
No operations should partially succeed or crash.
- name: edge_cases_special_content
description: Test handling of special content types
prompt: |
Test various special content scenarios:
1. Create note with only whitespace content
2. Create note with extremely long single line (>1000 chars)
3. Create note with binary-like content (emoji, symbols: 🎉🚀💻📝)
4. Create note with code snippets and formatting
5. Create note with multiple languages (English, español, 中文, العربية)
6. Test all these notes can be retrieved and searched
7. Clean up all test notes
expected_result: |
Should handle all content types gracefully:
- Whitespace content preserved exactly
- Long lines don't cause issues
- Unicode content (emoji, symbols) preserved
- Code and formatting maintained
- Multi-language content searchable
- All content retrievable exactly as stored