Zotero MCP

Overview Schema Related Servers Score Discussions

zotero-mcp
docs

ANNOTATION_FEATURE.md•9.36 KiB

# Zotero Annotation Feature - Code Overview ## Summary This feature adds the ability to create highlight annotations on PDF attachments in Zotero via the MCP interface. It handles the complexity of PDF text search, coordinate systems, and multiple storage configurations (Zotero Cloud, WebDAV). ## Commits | Commit | Description | |--------|-------------| | `b6e7842` | Initial `zotero_create_annotation` tool | | `999ab2a` | Add fuzzy text matching | | `8ef202e` | Fix for PDFs with missing word spaces | | `9d093ea` | Enhanced fuzzy matching (normalization, thresholds) | | `555d8c8` | Debug info on search failures | | `df41245` | **Anchor-based matching** for long passages | | `6c7655f` | Major refactor for maintainability | | `e361cc7` | "Did you mean" suggestions | --- ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ MCP Tool: zotero_create_annotation │ │ (server.py:1700-1939) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ PDF Download │ │ 1. Try local Zotero (WebDAV/local storage) │ │ 2. Fallback to Web API (Zotero Cloud) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Text Search (pdf_utils.py) │ │ │ │ Strategy Order: │ │ 1. Anchor-based (for text >100 chars) │ │ 2. Exact match (PyMuPDF search_for) │ │ 3. Fuzzy match (normalized text comparison) │ │ │ │ Features: │ │ • Neighboring page search (±2 pages) │ │ • Debug info on failure │ │ • "Did you mean" suggestions │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Coordinate Conversion │ │ PyMuPDF (top-left origin) → Zotero (bottom-left origin) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Create Annotation via Web API │ │ POST to Zotero API with annotationPosition JSON │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## File: `pdf_utils.py` (816 lines) ### Module Structure ```python # Configuration Constants (lines 28-47) ANCHOR_MIN_TEXT_LENGTH = 100 # Use anchor matching for text > 100 chars ANCHOR_TARGET_LENGTH = 40 # Length of start/end anchors ANCHOR_MATCH_THRESHOLD = 0.75 # Fuzzy threshold for anchors FUZZY_THRESHOLD_SHORT = 0.85 # For text < 50 chars FUZZY_THRESHOLD_MEDIUM = 0.75 # For text 50-150 chars FUZZY_THRESHOLD_LONG = 0.65 # For text > 150 chars DEFAULT_NEIGHBOR_PAGES = 2 # Pages to search on either side # Character Replacement Maps DASH_REPLACEMENTS = {...} # em-dash, en-dash → hyphen QUOTE_REPLACEMENTS = {...} # curly quotes → straight LIGATURE_REPLACEMENTS = {...} # fi, fl, ff → expanded ``` ### Function Groups #### Text Normalization | Function | Purpose | |----------|---------| | `normalize_text()` | Handle hyphens, dashes, quotes, ligatures | | `normalize_for_matching()` | Aggressive: remove ALL spaces, lowercase | #### Page Text Extraction | Function | Purpose | |----------|---------| | `_extract_page_spans()` | Get text spans with bounding boxes | | `_build_normalized_text_index()` | Create position-to-span mapping | | `_get_spans_in_range()` | Find spans overlapping a position range | #### Coordinate Conversion | Function | Purpose | |----------|---------| | `_convert_rects_to_zotero()` | Transform PyMuPDF → Zotero coords | | `_build_sort_index()` | Create annotation sort key | | `_build_search_result()` | Assemble result dict | #### Search Strategies | Function | Purpose | |----------|---------| | `_anchor_based_search()` | Match start/end, highlight between | | `_fuzzy_search_page()` | Normalized text comparison | | `_sliding_window_match()` | SequenceMatcher-based fuzzy search | | `_search_single_page()` | Orchestrates all strategies | #### Public API | Function | Purpose | |----------|---------| | `find_text_position()` | Main entry point for text search | | `get_page_label()` | Get PDF page label (e.g., "i", "ii") | | `verify_pdf_attachment()` | Check if file is valid PDF | | `build_annotation_position()` | Create Zotero position JSON | --- ## File: `server.py` Changes ### New MCP Tool ```python @mcp.tool(name="zotero_create_annotation") def create_annotation( attachment_key: str, # PDF attachment key page: int, # 1-indexed page number text: str, # Text to highlight comment: str = None, # Optional comment color: str = "#ffd400" # Highlight color ) -> str: ``` ### Error Handling When text search fails, the tool provides helpful feedback: ``` Error: Could not find text on page 21 Text searched: "16.27% savings rate" ================================================== DID YOU MEAN (score: 76%): "we find a couple must save 16.27% (i.e., 63% more) to achieve the same expected utility with the TDF..." (Found on page 21) ================================================== TIP: Copy the exact text from the PDF instead of paraphrasing. ``` --- ## Key Design Decisions ### 1. Anchor-Based Matching For passages >100 characters, instead of matching the entire text: - Extract first ~40 chars as **start anchor** - Extract last ~40 chars as **end anchor** - Find both in PDF, highlight everything between **Why?** Long passages often have line breaks, hyphenation, or character variations in the middle. Only the start/end need to match. ### 2. Aggressive Text Normalization PDF text extraction produces inconsistent output: - Words without spaces: `"Wechallengetwotenets"` - Special characters: em-dashes, curly quotes, ligatures Solution: `normalize_for_matching()` removes ALL spaces and lowercases, making comparison robust. ### 3. Neighboring Page Search PDF page numbers often don't match document page numbers (due to front matter). The tool automatically searches ±2 pages if text isn't found on the specified page. ### 4. Hybrid Storage Support Works with both: - **Zotero Cloud Storage**: Downloads via Web API - **WebDAV Storage**: Downloads via local Zotero (port 23119) Annotations are always created via Web API (local API is read-only). --- ## Testing Run the test suite: ```bash .venv/bin/python -c " from zotero_mcp.pdf_utils import ( normalize_text, normalize_for_matching, find_text_position, _extract_anchor ) # Tests run automatically on import print('All imports successful') " ``` --- ## Usage Example ```python # Via MCP tool zotero_create_annotation( attachment_key="NHZFE5A7", page=1, text="We challenge two tenets of lifecycle investing", comment="Key thesis of the paper", color="#ffd400" ) ``` --- ## Future Improvements 1. **Auto-retry on "Did you mean"**: Claude could automatically retry with suggested text 2. **Multi-page highlights**: Support annotations spanning multiple pages 3. **Other annotation types**: Underline, strikethrough, notes 4. **Batch annotations**: Create multiple highlights in one call

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/54yyyu/zotero-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ANNOTATION_FEATURE.md•9.36 KiB