Skip to main content
Glama

read_pdf

Extract text, tables, and images from PDF files as Markdown. Supports local files, URLs, page ranges, and OCR for scanned documents.

Instructions

Read content from a PDF file (local path or URL). Returns a unified Markdwon string containing text, tables, and image references. Args: source: Local file path or URL to the PDF. page_range: Format "1-5" or "10". If not provided, reads all pages. extract_images: If True, extracts images to temp dir and links them.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
sourceYes
page_rangeNo
extract_imagesNo
force_ocrNo

Implementation Reference

  • The primary handler for the 'read_pdf' tool. It invokes PDFParser to extract content and formats it into a Markdown report with metadata.
    @mcp.tool() async def read_pdf( source: str, page_range: str = None, extract_images: bool = False, force_ocr: bool = False, ) -> str: """ Read content from a PDF file (local path or URL). Returns a unified Markdwon string containing text, tables, and image references. Args: source: Local file path or URL to the PDF. page_range: Format "1-5" or "10". If not provided, reads all pages. extract_images: If True, extracts images to temp dir and links them. """ result = await parser.parse(source, page_range, extract_images, force_ocr) # Format the result for the AI # We return a string content. If the client expects JSON, we can return json.dumps(result). # But text-based LLMs usually prefer direct text content. # Let's construct a rich report. metadata = result["metadata"] content = result["content"] report = f"""# PDF Extraction Result ## Metadata - **Title**: {metadata['title']} - **Page Count**: {metadata['page_count']} - **Source**: {metadata['source']} ## Content {content} """ return report
  • The core parsing method in PDFParser that orchestrates loading the PDF, extracting text, images, tables, and building the result dictionary used by the handler.
    async def parse( self, source: str, page_range: str = None, extract_images: bool = False, force_ocr: bool = False, ) -> Dict[str, Any]: """ Main entry point to parse a PDF. Args: source: URL or local path. page_range: String like "1-5", "10", or None for all. extract_images: Whether to extract images. Returns: Dict containing metadata and content (markdown). """ # 1. Load Document doc = await self.loader.load(source) try: # 2. Parse Page Range pages = self._parse_page_range(doc, page_range) # 3. Extract Text (Markdown) text_md = self.text_extractor.extract_text(doc, pages, force_ocr=force_ocr) # 4. Extract Images (Optional) images_data = [] if extract_images: images_data = self.image_extractor.extract_images(doc, pages) # Append image markdown to text_md (simplified approach: append at end or interpolate) # For now, let's just keep them separate data, but maybe append to content if images_data: text_md += "\n\n## Extracted Images\n" for img in images_data: text_md += f"\n{img['markdown']}\n" # 5. Extract Tables (Optional enhancement) # Use 'source' if it's a local path. If URL, pdfplumber needs a file-like object or path. # Our loader handles URL->fitz. pdfplumber needs a bit more work for URLs (stream or temp file). # For this MVP, let's apply a check: if fitz loaded from URL (stream), we might skip table extraction # OR save the fitz doc to a temp file for pdfplumber. # Let's save to temp file to be robust. temp_pdf_path = None if doc.name and os.path.exists(doc.name): # It's a local file pdf_path = doc.name else: # It's a stream (URL), save to temp import tempfile with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp: doc.save(tmp.name) pdf_path = tmp.name temp_pdf_path = tmp.name tables_md = self.table_extractor.extract_tables(pdf_path, pages) if tables_md: text_md += "\n\n## Extracted Tables\n" + "\n\n".join(tables_md) # Cleanup temp file if temp_pdf_path and os.path.exists(temp_pdf_path): os.remove(temp_pdf_path) # 6. Construct Final Result metadata = { "page_count": len(doc), "title": doc.metadata.get("title", ""), "author": doc.metadata.get("author", ""), "source": source, } return { "metadata": metadata, "content": text_md, "images": [img["path"] for img in images_data], } finally: doc.close()
  • The @mcp.tool() decorator registers the read_pdf function as an MCP tool.
    @mcp.tool()
  • A helper resource handler for pdf:// URIs that extracts PDF content using the same parser.
    @mcp.resource("pdf://{file_path}") async def read_pdf_resource(file_path: str) -> str: """ Directly read a PDF file as a resource using URI scheme pdf://... Warning: file_path must be absolute. """ # Simply delegate to the parsing logic # Note: Resources usually return raw content, but for PDF we want the processed markdown # because raw PDF bytes are not useful to the LLM directly as text. # Resource templates extract the variables. # FastMCP resources route passes the variable. # We need to reconstruct full path if needed, but here it comes as string. # Re-adding the leading slash if it was stripped is a common gotcha with URI templates, # but let's assume valid absolute path for now. # Using the same parser logic result = await parser.parse(file_path) return result["content"]

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rexfelix/readPDF_mcp_server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server