PDF Indexer MCP Server
Allows downloading and indexing research papers from arXiv for semantic search and retrieval.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@PDF Indexer MCP ServerDownload and index this paper: https://arxiv.org/pdf/2303.08774.pdf, then search for 'attention mechanisms'"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
PDF Indexer MCP Server
A Model Context Protocol (MCP) server that enables AI agents to download, index, and semantically search PDF research papers. This server provides 8 tools that AI agents can discover and use autonomously to build research paper knowledge bases and answer questions.
What is MCP?
Model Context Protocol (MCP) is a standardized protocol that allows AI agents to discover and use tools. Instead of being limited to text generation, agents become action-capable systems that can:
Discover Tools: Agents automatically discover available tools from connected MCP servers
Understand Capabilities: Agents read tool descriptions and parameters to understand what each tool can do
Execute Tasks: Agents call tools with appropriate parameters to accomplish goals
Compose Workflows: Agents can combine multiple tools from different servers to solve complex problems
How MCP Works
AI Agent → MCP Protocol → Tool Server → Execution → Results → AgentWhen you connect this MCP server to an AI agent (like in Cursor, Claude Desktop, or via OpenAI Agents framework), the agent automatically:
Discovers all 8 tools available from this server
Understands what each tool does from their descriptions
Uses the tools when they're needed to complete tasks
Can combine tools in complex workflows
Features
📥 PDF Download: Download research papers from URLs
📄 Intelligent Chunking: Two chunking strategies:
Header-based: Preserves document structure (ideal for academic papers)
S2 chunking: Spatial-semantic hybrid approach for optimal semantic chunks
🗄️ Database Indexing: Store papers and chunks in SQLite with navigation indices
🔍 Semantic Search: Search papers using MLX-optimized embeddings (Qwen3-Embedding-0.6B)
⚡ FAISS Vector Index: Fast similarity search even for thousands of chunks
🧠 Context-Aware: Retrieve surrounding chunks for better context
Quick Start: Suggested Prompt
Once the MCP server is configured, you can use prompts like:
"I have this research paper URL: [URL]. Please download it, index it,
make it searchable, and then search for information about [topic]."Or more simply:
"Download and index this paper: [URL], then search it for information about [topic]."The agent will automatically:
Download the PDF
Index it into the database
Generate embeddings for semantic search
Search for relevant content
Present the results
Installation
Option 1: Install from GitHub (Recommended)
# Clone the repository
git clone https://github.com/lizTheDeveloper/pdf-indexer-mcp.git
cd pdf-indexer-mcp
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtOption 2: Install via pip (After Publishing)
pip install pdf-indexer-mcpMCP Server Setup
For Cursor IDE
Locate MCP configuration file:
macOS:
~/Library/Application Support/Cursor/User/globalStorage/mcp.jsonWindows:
%APPDATA%\Cursor\User\globalStorage\mcp.jsonLinux:
~/.config/Cursor/User/globalStorage/mcp.json
Add configuration (create file if it doesn't exist):
{ "mcpServers": { "pdf-indexer": { "command": "/absolute/path/to/pdf_indexer_mcp/venv/bin/python3", "args": [ "/absolute/path/to/pdf_indexer_mcp/semantic_chunked_pdf_rag.py" ], "env": {} } } }Restart Cursor completely (not just reload)
Verify: After restart, you should see 8 tools available:
mcp_pdf-indexer_download_pdfmcp_pdf-indexer_chunk_pdfmcp_pdf-indexer_index_pdfmcp_pdf-indexer_list_indexed_papersmcp_pdf-indexer_get_document_structuremcp_pdf-indexer_get_document_sectionmcp_pdf-indexer_generate_embeddingsmcp_pdf-indexer_search_research_papers
For Claude Desktop
Locate MCP configuration file:
macOS:
~/Library/Application Support/Claude/claude_desktop_config.jsonWindows:
%APPDATA%\Claude\claude_desktop_config.jsonLinux:
~/.config/Claude/claude_desktop_config.json
Add the same configuration as above
Restart Claude Desktop completely
For OpenAI Agents Framework
from agents import Agent, Runner
from agents.mcp import MCPServerStdio
async with MCPServerStdio(
name="PDF Indexer",
params={
"command": "/path/to/pdf_indexer_mcp/venv/bin/python3",
"args": ["/path/to/pdf_indexer_mcp/semantic_chunked_pdf_rag.py"],
},
) as pdf_indexer_server:
agent = Agent(
name="Research Assistant",
instructions="Help users search and analyze research papers",
mcp_servers=[pdf_indexer_server],
model="gpt-4"
)
result = await Runner.run(
agent,
"Download and index this paper: https://arxiv.org/pdf/1706.03762.pdf"
)Available MCP Tools
This server exposes 8 tools that AI agents can use. Tools are automatically discovered by agents when the server is connected.
1. download_pdf(url: str)
What it does: Downloads a PDF research paper from a URL and saves it locally.
When agents use it: When you ask to download a paper, the agent automatically discovers and uses this tool.
Example agent workflow:
User: "Download the attention paper from arxiv"
Agent:
1. Discovers download_pdf tool
2. Calls: download_pdf("https://arxiv.org/pdf/1706.03762.pdf")
3. Returns: Downloaded paper saved as "1706.03762.pdf"Returns:
success: boolfilename: str (e.g., "1706.03762.pdf")filepath: str (absolute path)message: str
2. chunk_pdf(filename: str, method: str = "header")
What it does: Extracts text from a PDF and chunks it using header-based or S2 chunking.
Parameters:
filename: PDF filename (must be inpapers/directory)method:"header"(default) or"s2"for spatial-semantic chunking
When agents use it: When asked to analyze or process a PDF's structure.
Returns:
success: boolnum_chunks: intchunks: list of chunk dictionaries with preview textmethod: str (chunking method used)
3. index_pdf(filename: str, url: str = "", method: str = "header")
What it does: Complete indexing workflow - downloads (if needed), chunks, and stores in database.
When agents use it: The most common tool agents use - it handles the full pipeline.
Example agent workflow:
User: "Index this paper and make it searchable"
Agent:
1. Discovers index_pdf tool
2. Calls: index_pdf("paper.pdf", url="https://...", method="header")
3. Paper is now in database and searchableReturns:
success: boolpaper_id: int (database ID)num_chunks: intnum_sections: int (for header method)message: str
4. list_indexed_papers()
What it does: Lists all papers currently indexed in the database.
When agents use it: When asked "what papers do you have?" or "show me all papers".
Returns:
success: boolcount: intpapers: list of paper metadata
5. get_document_structure(filename: str)
What it does: Gets the complete structure of a paper (sections, headers, chunk ranges).
When agents use it: When you ask about a paper's structure or sections.
Returns:
success: boolstructure: dict with paper metadata and sections list
6. get_document_section(filename: str, ...)
What it does: Retrieves a specific section of a document.
Parameters (use one of):
chunk_index: int - Get specific chunk by indexheader_path: str - Get section by header path (e.g., "Introduction")page_start/page_end: int - Get chunks in page range
When agents use it: When asked "show me the Introduction section" or "get page 5-10".
Returns:
success: boolpaper_id: intnum_chunks: intchunks: list of full chunk content
7. generate_embeddings(filename: str, model_name: str = "mlx-community/Qwen3-Embedding-0.6B")
What it does: Generates semantic embeddings for all chunks in a paper and adds them to the FAISS vector index.
When agents use it: Agents automatically use this before semantic search.
Example agent workflow:
User: "Make this paper searchable"
Agent:
1. Calls index_pdf() - indexes the paper
2. Calls generate_embeddings() - makes it searchable
3. Paper is now ready for semantic searchReturns:
success: boolpaper_id: intnum_embeddings: intembedding_dim: int (1024 for Qwen3-Embedding-0.6B)model_name: str
8. search_research_papers(query: str, k: int = 5, context_window: int = 1, model_name: str = "mlx-community/Qwen3-Embedding-0.6B")
What it does: Semantically searches all indexed papers using embeddings and returns the most relevant chunks.
Parameters:
query: Search query textk: Number of top results (default: 5)context_window: Number of neighboring chunks to include (default: 1)model_name: Embedding model (default: Qwen3-Embedding-0.6B)
When agents use it: When asked questions like "find papers about transformers" or "search for attention mechanisms".
Example agent workflow:
User: "What papers discuss attention mechanisms?"
Agent:
1. Discovers search_research_papers tool
2. Calls: search_research_papers("attention mechanisms", k=5)
3. Gets relevant chunks with context
4. Synthesizes answer from resultsReturns:
success: boolquery: str (original query)num_results: intresults: list of result dictionaries with:chunk_id,paper_id,filename,titletext: Full chunk textheader_path: Section locationpage_start,page_end: Page numbersdistance: Similarity score (lower = more similar)is_context: bool (true if context chunk, not direct match)
How Agents Use These Tools
Autonomous Tool Discovery
When you connect this MCP server, agents automatically discover all 8 tools. Each tool has:
Name: What the tool is called
Description: What the tool does (agents read this!)
Parameters: What inputs the tool needs
Return Type: What the tool returns
Agents use these descriptions to understand when to use each tool.
Typical Agent Workflow
User: "Find papers about transformers and summarize the key findings"
Agent workflow:
1. Discovers list_indexed_papers() → checks what's available
2. Discovers search_research_papers() → searches for "transformers"
3. Discovers get_document_section() → gets more context for top results
4. Synthesizes findings into summary
All tool calls happen autonomously!Multi-Tool Composition
Agents can combine tools in sophisticated ways:
# Example: Agent decides to do a complete research workflow
1. download_pdf("https://arxiv.org/pdf/...")
→ Downloads paper
2. index_pdf("paper.pdf", url="...", method="header")
→ Indexes paper in database
3. generate_embeddings("paper.pdf")
→ Makes it searchable
4. search_research_papers("related topic", k=5)
→ Finds related papers
5. get_document_section(filename, header_path="Introduction")
→ Gets specific sections for contextComplete Usage Example
Via Cursor/Claude Desktop
Once the MCP server is configured and restarted, you can simply ask:
You: "Download and index this paper about transformers"
Agent: [Automatically uses download_pdf and index_pdf tools]
You: "Search for papers about attention mechanisms"
Agent: [Automatically uses search_research_papers tool]
You: "Show me the Introduction section of the transformer paper"
Agent: [Automatically uses get_document_section tool]Via OpenAI Agents Framework
from agents import Agent, Runner
from agents.mcp import MCPServerStdio
async with MCPServerStdio(
name="PDF Indexer",
params={
"command": "/path/to/venv/bin/python3",
"args": ["/path/to/semantic_chunked_pdf_rag.py"],
},
) as pdf_server:
agent = Agent(
name="Research Assistant",
instructions="""
You help users manage and search research papers.
You can download, index, and search papers using the available tools.
""",
mcp_servers=[pdf_server],
model="gpt-4"
)
# Agent autonomously uses tools
result = await Runner.run(
agent,
"Download this paper, index it, make it searchable, and then search for related work on attention"
)
print(result.final_output)Package Structure
pdf_indexer_mcp/
├── semantic_chunked_pdf_rag.py # Main MCP server (exposes tools)
├── utils/ # Logging utilities
├── pdf_processing/ # PDF text extraction
├── chunking/ # Chunking algorithms
├── database/ # Database models and operations
├── embeddings/ # MLX embedding generation and FAISS
├── papers/ # Downloaded PDFs (created automatically)
├── indexes/ # Database and FAISS indices (created automatically)
├── logs/ # Log files (created automatically)
├── requirements.txt # Python dependencies
├── pyproject.toml # Package metadata for pip
├── LICENSE # GPL-3.0 copyleft license
└── README.md # This fileLearning RAG (Retrieval-Augmented Generation)
This MCP server demonstrates a complete RAG (Retrieval-Augmented Generation) pipeline for research papers. Understanding RAG is essential for building effective AI systems that can access and use external knowledge.
What is RAG?
RAG combines information retrieval with language generation, allowing LLMs to:
Retrieve relevant information from external sources (here: research papers)
Augment the LLM's context with retrieved information
Generate responses grounded in retrieved content
Instead of relying solely on pre-trained knowledge, RAG enables systems to answer questions using up-to-date, domain-specific information.
How This Server Implements RAG
This MCP server provides a complete RAG implementation:
1. Document Ingestion (Retrieval Setup)
download_pdf(): Fetch papers from URLsindex_pdf(): Extract and chunk text, store in databaseCreates a searchable knowledge base
2. Semantic Indexing (Vector Search)
generate_embeddings(): Convert text chunks into semantic vectorsUses MLX-optimized embeddings (Qwen3-Embedding-0.6B, 1024 dimensions)
Stores vectors in FAISS for fast similarity search
3. Retrieval (Finding Relevant Content)
search_research_papers(): Semantic search across all papersFinds relevant chunks based on meaning, not just keywords
Returns ranked results with context
4. Augmentation (Context Enhancement)
get_document_section(): Retrieve full context from specific sectionsIncludes surrounding chunks for better understanding
Provides metadata (section, page, headers)
5. Generation (LLM Response)
Agent receives retrieved chunks
Uses them as context to generate grounded responses
Responses are based on actual paper content, not just training data
RAG Pipeline Flow
User Query
↓
Semantic Search (search_research_papers)
↓
Find Relevant Chunks (FAISS vector search)
↓
Retrieve Context (get_document_section if needed)
↓
Augment LLM Context (pass chunks to LLM)
↓
Generate Response (grounded in retrieved content)Key RAG Concepts Demonstrated
Chunking Strategy: Two approaches shown:
Header-based: Preserves structure, ideal for academic papers
S2 chunking: Spatial-semantic hybrid for unstructured documents
Semantic Search: Uses embeddings to find meaning, not just keywords
"attention mechanisms" finds related concepts even without exact words
Better than traditional keyword search
Vector Database: FAISS for fast similarity search
Scales to thousands of chunks
Sub-millisecond search times
Incremental Indexing: Add papers without rebuilding entire index
Each paper can be indexed independently
Embeddings added incrementally
Context Windows: Retrieve surrounding chunks for better context
Helps maintain narrative flow
Provides background for understanding
Why RAG Matters
Without RAG: LLMs can only use pre-trained knowledge, which may be:
Outdated (training data cutoff)
Generic (not domain-specific)
Limited (no access to private/publications)
With RAG: LLMs can:
Access current information (newly published papers)
Use domain-specific knowledge (research papers)
Ground responses in verifiable sources
Answer questions about documents not in training data
RAG Best Practices (This Implementation)
Effective Chunking: Balance chunk size - too small loses context, too large dilutes relevance
Semantic Embeddings: Use models optimized for your domain (here: research papers)
Vector Search: Fast retrieval is essential (FAISS provides sub-millisecond search)
Metadata Preservation: Keep headers, pages, sections for navigation
Context Retrieval: Include surrounding chunks for better understanding
Further Learning
To understand RAG better:
Experiment with different chunking methods (header vs S2)
Try different embedding models
Adjust context_window in search_research_papers()
Explore the database structure to see how chunks are stored
Check logs to see performance metrics
This implementation provides a production-ready RAG system you can study and extend.
Requirements
Python: 3.9+ (required by dependencies like numpy 2.2.6)
Platform: macOS (for MLX optimization), Linux/Windows (with CPU fallback)
RAM: ~500MB for embeddings
Disk: ~1GB for model downloads (first run)
Technical Details
Embedding Model
Model:
mlx-community/Qwen3-Embedding-0.6BDimensions: 1024
Framework: MLX (optimized for Apple Silicon)
Speed: ~35 embeddings/second on Apple Silicon
Chunking Methods
Header-based (method="header"):
Best for academic papers with clear structure
Preserves document hierarchy
Groups content under headers
S2 Chunking (method="s2"):
Hybrid spatial-semantic approach
Combines layout analysis with semantic similarity
Optimal for unstructured documents
Storage
Database: SQLite (
indexes/research_papers.db)Vector Index: FAISS (
indexes/research_papers.faiss)Mapping: NumPy array (
indexes/research_papers_mapping.npy)
Troubleshooting
MCP Server Not Starting
Verify virtual environment:
which python3 # Should show path in venv/bin/python3Check dependencies:
pip list | grep fastmcpTest server manually:
python semantic_chunked_pdf_rag.pyIf it starts without errors, press Ctrl+C to stop.
Check logs:
tail -f logs/pdf_indexer_*.log
Tools Not Appearing in Agent
Restart completely (not just reload)
Check configuration path is absolute (not relative)
Verify Python path points to virtual environment
Check MCP logs for connection errors
Embedding Generation Fails
Verify MLX installed:
python -c "import mlx.core as mx; print('OK')"Check available RAM (needs ~500MB)
First run downloads model automatically (may take time)
Search Returns No Results
Verify papers indexed:
# Agents should discover list_indexed_papers() toolGenerate embeddings:
# Agents should discover generate_embeddings() toolCheck FAISS index exists in
indexes/directory
Contributing
Contributions welcome! This project uses GPL-3.0 copyleft licensing.
License
GNU General Public License v3.0 (GPL-3.0) - Copyleft License
Copyright (C) 2025 Liz Howard (@lizTheDeveloper)
See LICENSE file for full license text.
Links
Repository: https://github.com/lizTheDeveloper/pdf-indexer-mcp
Issues: https://github.com/lizTheDeveloper/pdf-indexer-mcp/issues
Author: Liz Howard (@lizTheDeveloper)
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/lizTheDeveloper/pdf-indexer-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server