# Model configuration
model:
provider: openai
name: gpt-4o
# Uses OPENAI_API_KEY environment variable by default
# List of evaluations to run for Simplenote MCP Server
# Fixed critical issues: dynamic test data, proper tool usage, realistic scenarios
evals:
- name: note_lifecycle_basic
description: Test complete note lifecycle with proper setup and cleanup
prompt: |
Perform a complete note lifecycle test:
1. Create a new note with content "Evaluation Test Note - Basic Lifecycle\n\nThis note tests the basic CRUD operations."
2. Retrieve the created note and verify its content
3. Update the note by appending "\n\nUpdated: $(date)"
4. Search for notes containing "Evaluation Test"
5. Delete the created note
Each step should succeed and the note should be properly cleaned up.
expected_result: |
Should complete all operations successfully:
- create_note returns: {"success": true, "note_id": "<string>", "message": "Note created successfully"}
- get_note returns the full content with proper formatting
- update_note preserves existing content and adds new content
- search_notes finds the test note in results
- delete_note removes the note successfully
- name: note_creation_with_tags
description: Test note creation with tags and proper validation
prompt: |
Create a new note with the following specifications:
- Content: "Tagged Note Test\n\nThis note has multiple tags for testing purposes."
- Tags: "evaluation, testing, automated"
Then retrieve the note to verify tags were properly applied.
expected_result: |
Should return JSON with structure:
{
"success": true,
"note_id": "<valid_uuid_or_id>",
"message": "Note created successfully",
"tags": ["evaluation", "testing", "automated"]
}
Retrieved note should contain all specified tags.
- name: search_functionality_comprehensive
description: Test search with various query types and filters
prompt: |
First create 3 test notes:
1. "Meeting Notes - Weekly Standup" with tags "work, meetings"
2. "Personal Tasks - Weekend Projects" with tags "personal, projects"
3. "Research - MCP Protocol Documentation" with tags "work, research"
Then perform these searches:
1. Basic text search for "Meeting"
2. Tag-based search for notes with "work" tag
3. Search for notes containing "Protocol"
Clean up all test notes afterward.
expected_result: |
Each search should return appropriate results:
- "Meeting" search should find the first note
- "work" tag filter should find notes 1 and 3
- "Protocol" search should find the third note
All searches should return structured results with note IDs and content snippets.
- name: tag_operations_comprehensive
description: Test all tag operations on a single note
prompt: |
1. Create a note with content "Tag Operations Test Note"
2. Add tags "initial, test" using add_tags
3. Add additional tags "work, important" using add_tags
4. Remove the "test" tag using remove_tags
5. Replace all tags with "final, completed" using replace_tags
6. Verify final tags and delete the note
expected_result: |
Should successfully perform all tag operations:
- Initial tags: ["initial", "test"]
- After adding: ["initial", "test", "work", "important"]
- After removing: ["initial", "work", "important"]
- After replacing: ["final", "completed"]
Each operation should return success confirmation.
- name: error_handling_invalid_note_id
description: Test handling of non-existent note IDs
prompt: |
Try to retrieve a note with a clearly invalid ID "non-existent-note-12345-invalid".
The system should handle this gracefully with appropriate error messaging.
expected_result: |
Should return an error response with structure:
{
"success": false,
"error": "Note not found" or similar,
"error_type": "ResourceNotFoundError" or similar
}
Should not crash or return malformed responses.
- name: error_handling_missing_parameters
description: Test tool parameter validation
prompt: |
Test parameter validation by:
1. Attempting to create a note without required 'content' parameter
2. Attempting to update a note without 'note_id' parameter
3. Attempting to add tags without 'tags' parameter
expected_result: |
Each operation should return validation errors:
- Missing content: "content is required"
- Missing note_id: "note_id is required"
- Missing tags: "tags are required"
Should not perform partial operations or crash.
- name: search_edge_cases_and_special_characters
description: Test search with special characters and edge cases
prompt: |
1. Create a test note with content: "Special chars: @#$%^&*(){}[]|\\:;\"'<>?,./ and unicode: àáâãäå ñoño 你好"
2. Search for notes containing "@#$%"
3. Search for notes containing "àáâãäå"
4. Search for notes containing "你好"
5. Try an empty search query
6. Clean up the test note
expected_result: |
Should handle all special characters correctly:
- Special ASCII characters should be found in search
- Unicode characters should be found in search
- Empty search should return appropriate response (not crash)
- All text should be preserved exactly as entered
- name: performance_multiple_operations
description: Test handling of multiple rapid operations
prompt: |
Perform rapid operations to test system responsiveness:
1. Create 5 notes quickly with titles "Performance Test 1" through "Performance Test 5"
2. Search for all notes containing "Performance Test"
3. Update each note to append " - Updated"
4. Search again to verify updates
5. Delete all 5 test notes
Monitor response times and ensure no operations fail due to timing.
expected_result: |
Should complete all operations successfully within reasonable time:
- All 5 notes should be created (< 30 seconds total)
- Search should find all 5 notes
- All updates should succeed
- Final search should show updated content
- All deletions should succeed
No operations should timeout or fail due to rate limiting.
- name: content_handling_large_notes
description: Test handling of large note content
prompt: |
1. Create a note with large content (approximately 5000 characters):
"Large Content Test\n\n" + repeat "This is line #{n} of a large note content test. " 200 times
2. Retrieve the note and verify content integrity
3. Update the note by appending "\n\nEnd of large content test"
4. Search for "Large Content Test"
5. Delete the large note
expected_result: |
Should handle large content without issues:
- Note creation should succeed with full content preserved
- Retrieval should return complete content
- Updates should work on large notes
- Search should find large notes
- No truncation or corruption should occur