Skip to main content
Glama
EVALUATIONS.md6.37 kB
# Gmail MCP Server - Evaluations ## Overview Comprehensive evaluations for the Gmail MCP server require a test Gmail account with known, stable data. Since Gmail contains personal data that changes over time, evaluations cannot be run against a production mailbox. ## Test Account Setup To create proper evaluations, you'll need: 1. **Dedicated Test Gmail Account** - Create a new Gmail account specifically for testing - Use a consistent naming pattern (e.g., `mcp-test-gmail@gmail.com`) - Never use for real communication 2. **Seed Data** - Create known emails with specific subjects, senders, labels - Add test attachments - Create test drafts - Organize into threads 3. **Stable Test Data Requirements** - Messages must not be auto-deleted (avoid trash) - Labels must remain constant - Threads must not grow (no new replies) - Attachments must remain available ## Example Evaluation Questions Once a test account is set up with known data, here are example evaluation scenarios: ### 1. Multi-step Search and Label Operations **Question**: Find all unread messages from sender "test-sender@example.com" with subject containing "Report", apply label "Processed", and mark as read. How many messages were modified? **Answer**: `[Number based on test data]` **Tools Required**: - `gmail_search_messages` (query: "from:test-sender@example.com subject:Report is:unread") - `gmail_list_labels` (to find "Processed" label ID) - `gmail_modify_message_labels` (add label, remove UNREAD) ### 2. Thread Analysis **Question**: Find the conversation thread with subject "Project Alpha Discussion" and count how many different participants sent messages in that thread. **Answer**: `[Number based on test data]` **Tools Required**: - `gmail_search_messages` or `gmail_list_threads` (query: "subject:Project Alpha Discussion") - `gmail_get_thread` (to retrieve all messages) - Parse headers to extract unique senders ### 3. Label Creation and Application **Question**: Create a new label called "Q1-2025-Reports", find all messages from January 2025 with "invoice" in the subject, apply the label, and report how many messages now have this label. **Answer**: `[Number based on test data]` **Tools Required**: - `gmail_create_label` (name: "Q1-2025-Reports") - `gmail_search_messages` (query: "after:2025/01/01 before:2025/02/01 subject:invoice") - `gmail_modify_message_labels` (apply label to results) ### 4. Attachment Identification **Question**: How many messages in the inbox have attachments larger than 1MB? (Requires test data with known attachment sizes) **Answer**: `[Number based on test data]` **Tools Required**: - `gmail_search_messages` (query: "has:attachment") - `gmail_get_message` (for each result, check payload.parts for attachments) - Filter by size metadata ### 5. Draft Management **Question**: Find the draft with subject "Monthly Report Template", send it to "recipient@example.com", and confirm it was sent successfully. What is the resulting message ID? **Answer**: `[Message ID from test]` **Tools Required**: - `gmail_list_drafts` - Filter for subject match - `gmail_send_draft` - Extract message ID from response ### 6. Complex Label Query **Question**: How many messages have both the "Important" label AND the "Work" label but do NOT have the "Archived" label? **Answer**: `[Number based on test data]` **Tools Required**: - `gmail_list_labels` (to get label IDs) - `gmail_search_messages` (complex query with label logic) - Count results ### 7. Thread Label Modification **Question**: Find all threads with more than 3 messages that contain the word "meeting", apply the label "Meetings" to all of them, and report how many threads were modified. **Answer**: `[Number based on test data]` **Tools Required**: - `gmail_list_threads` (query: "meeting") - `gmail_get_thread` (for each, count messages) - Filter threads with > 3 messages - `gmail_modify_thread_labels` (apply "Meetings" label) ### 8. Sender Analysis **Question**: Who sent the most messages to you in threads labeled "Project Beta"? Report the email address. **Answer**: `[Email address from test data]` **Tools Required**: - `gmail_list_labels` (find "Project Beta" label ID) - `gmail_search_messages` or `gmail_list_threads` (filter by label) - `gmail_get_message` or `gmail_get_thread` (extract sender headers) - Count senders, find maximum ### 9. Label Organization **Question**: List all custom (user-created) labels alphabetically and count how many messages are in the label with the most messages. What is that count? **Answer**: `[Number from test data]` **Tools Required**: - `gmail_list_labels` (filter type: "user") - `gmail_search_messages` (for each label, query by label ID) - Find maximum count ### 10. Trash Recovery **Question**: How many messages are currently in the trash that contain the word "invoice" and were sent in the last 30 days? **Answer**: `[Number from test data]` **Tools Required**: - `gmail_search_messages` (query: "in:trash invoice newer_than:30d") - Count results ## Creating the Evaluation XML Once test data is seeded, create `evaluations.xml`: ```xml <evaluation> <qa_pair> <question>Find all unread messages from sender "test-sender@example.com" with subject containing "Report", apply label "Processed", and mark as read. How many messages were modified?</question> <answer>5</answer> </qa_pair> <!-- Add remaining 9 qa_pairs --> </evaluation> ``` ## Running Evaluations Use the MCP evaluation harness (if available) or manually test each question: ```bash # Example manual testing python -c " from gmail_mcp import gmail_search_messages, SearchMessagesInput result = gmail_search_messages(SearchMessagesInput( query='from:test-sender@example.com subject:Report is:unread', response_format='json' )) print(result) " ``` ## Notes - Evaluations must be run against the same test account consistently - Test data should never change (static fixture) - Real Gmail accounts are unsuitable due to changing data - For MVP testing, manual verification is acceptable - Full automated evaluations require dedicated infrastructure ## Future Enhancement Consider creating a Gmail test harness that: 1. Creates a test account automatically 2. Seeds with known data via Gmail API 3. Runs evaluations 4. Tears down test data 5. Validates results

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/IAMSamuelRodda/gmail-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server