Skip to main content
Glama

read_pdf

Extract text, tables, and image references from PDF files (local or URL) and convert them to Markdown format for easy processing and analysis.

Instructions

Read content from a PDF file (local path or URL).
Returns a unified Markdwon string containing text, tables, and image references.

Args:
    source: Local file path or URL to the PDF.
    page_range: Format "1-5" or "10". If not provided, reads all pages.
    extract_images: If True, extracts images to temp dir and links them.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
sourceYes
page_rangeNo
extract_imagesNo
force_ocrNo

Implementation Reference

  • The handler function for the MCP tool 'read_pdf'. It is registered via @mcp.tool() decorator. Parses PDF content using PDFParser and returns formatted Markdown with metadata and extracted text, tables, images.
    @mcp.tool()
    async def read_pdf(
        source: str,
        page_range: str = None,
        extract_images: bool = False,
        force_ocr: bool = False,
    ) -> str:
        """
        Read content from a PDF file (local path or URL).
        Returns a unified Markdwon string containing text, tables, and image references.
    
        Args:
            source: Local file path or URL to the PDF.
            page_range: Format "1-5" or "10". If not provided, reads all pages.
            extract_images: If True, extracts images to temp dir and links them.
        """
        result = await parser.parse(source, page_range, extract_images, force_ocr)
    
        # Format the result for the AI
        # We return a string content. If the client expects JSON, we can return json.dumps(result).
        # But text-based LLMs usually prefer direct text content.
        # Let's construct a rich report.
    
        metadata = result["metadata"]
        content = result["content"]
    
        report = f"""# PDF Extraction Result
        
    ## Metadata
    - **Title**: {metadata['title']}
    - **Page Count**: {metadata['page_count']}
    - **Source**: {metadata['source']}
    
    ## Content
    {content}
    """
        return report
  • The PDFParser.parse method, which contains the core logic for PDF parsing invoked by the read_pdf tool handler. Handles loading, text, image, and table extraction.
    async def parse(
        self,
        source: str,
        page_range: str = None,
        extract_images: bool = False,
        force_ocr: bool = False,
    ) -> Dict[str, Any]:
        """
        Main entry point to parse a PDF.
    
        Args:
            source: URL or local path.
            page_range: String like "1-5", "10", or None for all.
            extract_images: Whether to extract images.
    
        Returns:
            Dict containing metadata and content (markdown).
        """
        # 1. Load Document
        doc = await self.loader.load(source)
    
        try:
            # 2. Parse Page Range
            pages = self._parse_page_range(doc, page_range)
    
            # 3. Extract Text (Markdown)
            text_md = self.text_extractor.extract_text(doc, pages, force_ocr=force_ocr)
    
            # 4. Extract Images (Optional)
            images_data = []
            if extract_images:
                images_data = self.image_extractor.extract_images(doc, pages)
                # Append image markdown to text_md (simplified approach: append at end or interpolate)
                # For now, let's just keep them separate data, but maybe append to content
                if images_data:
                    text_md += "\n\n## Extracted Images\n"
                    for img in images_data:
                        text_md += f"\n{img['markdown']}\n"
    
            # 5. Extract Tables (Optional enhancement)
            # Use 'source' if it's a local path. If URL, pdfplumber needs a file-like object or path.
            # Our loader handles URL->fitz. pdfplumber needs a bit more work for URLs (stream or temp file).
            # For this MVP, let's apply a check: if fitz loaded from URL (stream), we might skip table extraction
            # OR save the fitz doc to a temp file for pdfplumber.
            # Let's save to temp file to be robust.
            temp_pdf_path = None
            if doc.name and os.path.exists(doc.name):
                # It's a local file
                pdf_path = doc.name
            else:
                # It's a stream (URL), save to temp
                import tempfile
    
                with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
                    doc.save(tmp.name)
                    pdf_path = tmp.name
                    temp_pdf_path = tmp.name
    
            tables_md = self.table_extractor.extract_tables(pdf_path, pages)
            if tables_md:
                text_md += "\n\n## Extracted Tables\n" + "\n\n".join(tables_md)
    
            # Cleanup temp file
            if temp_pdf_path and os.path.exists(temp_pdf_path):
                os.remove(temp_pdf_path)
    
            # 6. Construct Final Result
            metadata = {
                "page_count": len(doc),
                "title": doc.metadata.get("title", ""),
                "author": doc.metadata.get("author", ""),
                "source": source,
            }
    
            return {
                "metadata": metadata,
                "content": text_md,
                "images": [img["path"] for img in images_data],
            }
    
        finally:
            doc.close()
  • The @mcp.tool() decorator registers the read_pdf function as an MCP tool.
    @mcp.tool()
Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rexfelix/readPDF_mcp_server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server