read_pdf

Extract text, tables, and images from PDF files as Markdown. Supports local files, URLs, page ranges, and OCR for scanned documents.

Instructions

Read content from a PDF file (local path or URL). Returns a unified Markdwon string containing text, tables, and image references. Args: source: Local file path or URL to the PDF. page_range: Format "1-5" or "10". If not provided, reads all pages. extract_images: If True, extracts images to temp dir and links them.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`source`	Yes
`page_range`	No
`extract_images`	No
`force_ocr`	No

Implementation Reference

src/server/main.py:25-62 (handler)
The primary handler for the 'read_pdf' tool. It invokes PDFParser to extract content and formats it into a Markdown report with metadata.
@mcp.tool() async def read_pdf( source: str, page_range: str = None, extract_images: bool = False, force_ocr: bool = False, ) -> str: """ Read content from a PDF file (local path or URL). Returns a unified Markdwon string containing text, tables, and image references. Args: source: Local file path or URL to the PDF. page_range: Format "1-5" or "10". If not provided, reads all pages. extract_images: If True, extracts images to temp dir and links them. """ result = await parser.parse(source, page_range, extract_images, force_ocr) # Format the result for the AI # We return a string content. If the client expects JSON, we can return json.dumps(result). # But text-based LLMs usually prefer direct text content. # Let's construct a rich report. metadata = result["metadata"] content = result["content"] report = f"""# PDF Extraction Result ## Metadata - **Title**: {metadata['title']} - **Page Count**: {metadata['page_count']} - **Source**: {metadata['source']} ## Content {content} """ return report
src/core/parser.py:18-100 (helper)
The core parsing method in PDFParser that orchestrates loading the PDF, extracting text, images, tables, and building the result dictionary used by the handler.
async def parse( self, source: str, page_range: str = None, extract_images: bool = False, force_ocr: bool = False, ) -> Dict[str, Any]: """ Main entry point to parse a PDF. Args: source: URL or local path. page_range: String like "1-5", "10", or None for all. extract_images: Whether to extract images. Returns: Dict containing metadata and content (markdown). """ # 1. Load Document doc = await self.loader.load(source) try: # 2. Parse Page Range pages = self._parse_page_range(doc, page_range) # 3. Extract Text (Markdown) text_md = self.text_extractor.extract_text(doc, pages, force_ocr=force_ocr) # 4. Extract Images (Optional) images_data = [] if extract_images: images_data = self.image_extractor.extract_images(doc, pages) # Append image markdown to text_md (simplified approach: append at end or interpolate) # For now, let's just keep them separate data, but maybe append to content if images_data: text_md += "\n\n## Extracted Images\n" for img in images_data: text_md += f"\n{img['markdown']}\n" # 5. Extract Tables (Optional enhancement) # Use 'source' if it's a local path. If URL, pdfplumber needs a file-like object or path. # Our loader handles URL->fitz. pdfplumber needs a bit more work for URLs (stream or temp file). # For this MVP, let's apply a check: if fitz loaded from URL (stream), we might skip table extraction # OR save the fitz doc to a temp file for pdfplumber. # Let's save to temp file to be robust. temp_pdf_path = None if doc.name and os.path.exists(doc.name): # It's a local file pdf_path = doc.name else: # It's a stream (URL), save to temp import tempfile with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp: doc.save(tmp.name) pdf_path = tmp.name temp_pdf_path = tmp.name tables_md = self.table_extractor.extract_tables(pdf_path, pages) if tables_md: text_md += "\n\n## Extracted Tables\n" + "\n\n".join(tables_md) # Cleanup temp file if temp_pdf_path and os.path.exists(temp_pdf_path): os.remove(temp_pdf_path) # 6. Construct Final Result metadata = { "page_count": len(doc), "title": doc.metadata.get("title", ""), "author": doc.metadata.get("author", ""), "source": source, } return { "metadata": metadata, "content": text_md, "images": [img["path"] for img in images_data], } finally: doc.close()
src/server/main.py:25-25 (registration)
The @mcp.tool() decorator registers the read_pdf function as an MCP tool.
@mcp.tool()
src/server/main.py:86-105 (helper)
A helper resource handler for pdf:// URIs that extracts PDF content using the same parser.
@mcp.resource("pdf://{file_path}") async def read_pdf_resource(file_path: str) -> str: """ Directly read a PDF file as a resource using URI scheme pdf://... Warning: file_path must be absolute. """ # Simply delegate to the parsing logic # Note: Resources usually return raw content, but for PDF we want the processed markdown # because raw PDF bytes are not useful to the LLM directly as text. # Resource templates extract the variables. # FastMCP resources route passes the variable. # We need to reconstruct full path if needed, but here it comes as string. # Re-adding the leading slash if it was stripped is a common gotcha with URI templates, # but let's assume valid absolute path for now. # Using the same parser logic result = await parser.parse(file_path) return result["content"]

PDF Reader MCP Server

read_pdf

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API