# Gmail MCP Server - Evaluations
## Overview
Comprehensive evaluations for the Gmail MCP server require a test Gmail account with known, stable data. Since Gmail contains personal data that changes over time, evaluations cannot be run against a production mailbox.
## Test Account Setup
To create proper evaluations, you'll need:
1. **Dedicated Test Gmail Account**
- Create a new Gmail account specifically for testing
- Use a consistent naming pattern (e.g., `mcp-test-gmail@gmail.com`)
- Never use for real communication
2. **Seed Data**
- Create known emails with specific subjects, senders, labels
- Add test attachments
- Create test drafts
- Organize into threads
3. **Stable Test Data Requirements**
- Messages must not be auto-deleted (avoid trash)
- Labels must remain constant
- Threads must not grow (no new replies)
- Attachments must remain available
## Example Evaluation Questions
Once a test account is set up with known data, here are example evaluation scenarios:
### 1. Multi-step Search and Label Operations
**Question**: Find all unread messages from sender "test-sender@example.com" with subject containing "Report", apply label "Processed", and mark as read. How many messages were modified?
**Answer**: `[Number based on test data]`
**Tools Required**:
- `gmail_search_messages` (query: "from:test-sender@example.com subject:Report is:unread")
- `gmail_list_labels` (to find "Processed" label ID)
- `gmail_modify_message_labels` (add label, remove UNREAD)
### 2. Thread Analysis
**Question**: Find the conversation thread with subject "Project Alpha Discussion" and count how many different participants sent messages in that thread.
**Answer**: `[Number based on test data]`
**Tools Required**:
- `gmail_search_messages` or `gmail_list_threads` (query: "subject:Project Alpha Discussion")
- `gmail_get_thread` (to retrieve all messages)
- Parse headers to extract unique senders
### 3. Label Creation and Application
**Question**: Create a new label called "Q1-2025-Reports", find all messages from January 2025 with "invoice" in the subject, apply the label, and report how many messages now have this label.
**Answer**: `[Number based on test data]`
**Tools Required**:
- `gmail_create_label` (name: "Q1-2025-Reports")
- `gmail_search_messages` (query: "after:2025/01/01 before:2025/02/01 subject:invoice")
- `gmail_modify_message_labels` (apply label to results)
### 4. Attachment Identification
**Question**: How many messages in the inbox have attachments larger than 1MB? (Requires test data with known attachment sizes)
**Answer**: `[Number based on test data]`
**Tools Required**:
- `gmail_search_messages` (query: "has:attachment")
- `gmail_get_message` (for each result, check payload.parts for attachments)
- Filter by size metadata
### 5. Draft Management
**Question**: Find the draft with subject "Monthly Report Template", send it to "recipient@example.com", and confirm it was sent successfully. What is the resulting message ID?
**Answer**: `[Message ID from test]`
**Tools Required**:
- `gmail_list_drafts`
- Filter for subject match
- `gmail_send_draft`
- Extract message ID from response
### 6. Complex Label Query
**Question**: How many messages have both the "Important" label AND the "Work" label but do NOT have the "Archived" label?
**Answer**: `[Number based on test data]`
**Tools Required**:
- `gmail_list_labels` (to get label IDs)
- `gmail_search_messages` (complex query with label logic)
- Count results
### 7. Thread Label Modification
**Question**: Find all threads with more than 3 messages that contain the word "meeting", apply the label "Meetings" to all of them, and report how many threads were modified.
**Answer**: `[Number based on test data]`
**Tools Required**:
- `gmail_list_threads` (query: "meeting")
- `gmail_get_thread` (for each, count messages)
- Filter threads with > 3 messages
- `gmail_modify_thread_labels` (apply "Meetings" label)
### 8. Sender Analysis
**Question**: Who sent the most messages to you in threads labeled "Project Beta"? Report the email address.
**Answer**: `[Email address from test data]`
**Tools Required**:
- `gmail_list_labels` (find "Project Beta" label ID)
- `gmail_search_messages` or `gmail_list_threads` (filter by label)
- `gmail_get_message` or `gmail_get_thread` (extract sender headers)
- Count senders, find maximum
### 9. Label Organization
**Question**: List all custom (user-created) labels alphabetically and count how many messages are in the label with the most messages. What is that count?
**Answer**: `[Number from test data]`
**Tools Required**:
- `gmail_list_labels` (filter type: "user")
- `gmail_search_messages` (for each label, query by label ID)
- Find maximum count
### 10. Trash Recovery
**Question**: How many messages are currently in the trash that contain the word "invoice" and were sent in the last 30 days?
**Answer**: `[Number from test data]`
**Tools Required**:
- `gmail_search_messages` (query: "in:trash invoice newer_than:30d")
- Count results
## Creating the Evaluation XML
Once test data is seeded, create `evaluations.xml`:
```xml
<evaluation>
<qa_pair>
<question>Find all unread messages from sender "test-sender@example.com" with subject containing "Report", apply label "Processed", and mark as read. How many messages were modified?</question>
<answer>5</answer>
</qa_pair>
<!-- Add remaining 9 qa_pairs -->
</evaluation>
```
## Running Evaluations
Use the MCP evaluation harness (if available) or manually test each question:
```bash
# Example manual testing
python -c "
from gmail_mcp import gmail_search_messages, SearchMessagesInput
result = gmail_search_messages(SearchMessagesInput(
query='from:test-sender@example.com subject:Report is:unread',
response_format='json'
))
print(result)
"
```
## Notes
- Evaluations must be run against the same test account consistently
- Test data should never change (static fixture)
- Real Gmail accounts are unsuitable due to changing data
- For MVP testing, manual verification is acceptable
- Full automated evaluations require dedicated infrastructure
## Future Enhancement
Consider creating a Gmail test harness that:
1. Creates a test account automatically
2. Seeds with known data via Gmail API
3. Runs evaluations
4. Tears down test data
5. Validates results