Skip to main content
Glama
pietermyb

mcp-pdf-reader

pdf-to-text

Extract text content from PDF documents for analysis or processing. Specify page ranges and include page numbers as needed.

Instructions

Extract all text from a PDF document

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
pdf_idYesID of the PDF to extract text from
include_page_numbersNoWhether to include page number markers in the output
start_pageNoStart page number (0-based, inclusive)
end_pageNoEnd page number (0-based, inclusive)

Implementation Reference

  • The handler function for the 'pdf-to-text' tool within the @server.call_tool() method. It extracts text from specified pages of an open PDF using PyPDF2.PdfReader, optionally includes page markers and metadata, formats the text, and returns it as TextContent.
    elif name == "pdf-to-text":
        pdf_id = arguments.get("pdf_id")
        if not pdf_id or pdf_id not in pdfs:
            raise ValueError("Invalid PDF ID")
    
        reader = pdfs[pdf_id]
        include_page_numbers = arguments.get("include_page_numbers", True)
    
        # Get page range or use all pages
        start_page = arguments.get("start_page", 0)
        end_page = arguments.get("end_page", len(reader.pages) - 1)
    
        # Validate page range
        if start_page < 0 or start_page >= len(reader.pages):
            start_page = 0
        if end_page < 0 or end_page >= len(reader.pages):
            end_page = len(reader.pages) - 1
        if start_page > end_page:
            start_page, end_page = end_page, start_page
    
        # Extract text from all pages
        all_text = []
        total_pages = len(reader.pages)
    
        for page_num in range(start_page, end_page + 1):
            page = reader.pages[page_num]
            page_text = page.extract_text()
    
            if page_text:
                # Format the text to be easier to read
                page_text = page_text.replace('\n\n', '\n').strip()
    
                if include_page_numbers:
                    all_text.append(f"\n--- PAGE {page_num + 1}/{total_pages} ---\n{page_text}")
                else:
                    all_text.append(page_text)
            elif include_page_numbers:
                all_text.append(f"\n--- PAGE {page_num + 1}/{total_pages} ---\n[No extractable text on this page]")
    
        # Join all the text
        full_text = "\n".join(all_text)
    
        # Get PDF metadata for context
        metadata = reader.metadata
        metadata_text = ""
        if metadata:
            metadata_text = "\nDocument Metadata:\n" + "\n".join([f"- {k}: {v}" for k, v in metadata.items() if v])
    
        # Create page range description
        if start_page == 0 and end_page == total_pages - 1:
            page_range_desc = f"all pages (1-{total_pages})"
        elif start_page == end_page:
            page_range_desc = f"page {start_page + 1}"
        else:
            page_range_desc = f"pages {start_page + 1}-{end_page + 1}"
    
        return [
            types.TextContent(
                type="text",
                text=(
                    f"Text extracted from {page_range_desc} of '{os.path.basename(pdf_paths[pdf_id])}'"
                    f"{metadata_text}\n\n{full_text}"
                ),
            )
        ]
  • The JSON schema and description for the 'pdf-to-text' tool, defined in the @server.list_tools() method. Specifies input parameters like pdf_id (required), optional include_page_numbers, start_page, end_page.
    types.Tool(
        name="pdf-to-text",
        description="Extract all text from a PDF document",
        inputSchema={
            "type": "object",
            "properties": {
                "pdf_id": {"type": "string", "description": "ID of the PDF to extract text from"},
                "include_page_numbers": {"type": "boolean", "description": "Whether to include page number markers in the output", "default": True},
                "start_page": {"type": "integer", "description": "Start page number (0-based, inclusive)"},
                "end_page": {"type": "integer", "description": "End page number (0-based, inclusive)"},
            },
            "required": ["pdf_id"],
        },
    )
  • The 'pdf-to-text' tool is registered in the list_tools() handler by including it in the returned list of types.Tool objects.
    return [
        types.Tool(
            name="open-pdf",
            description="Open a PDF file",
            inputSchema={
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "Path to the PDF file"},
                },
                "required": ["path"],
            },
        ),
        types.Tool(
            name="close-pdf",
            description="Close an open PDF file",
            inputSchema={
                "type": "object",
                "properties": {
                    "pdf_id": {"type": "string", "description": "ID of the PDF to close"},
                },
                "required": ["pdf_id"],
            },
        ),
        types.Tool(
            name="list-pdf-metadata",
            description="List metadata of an open PDF",
            inputSchema={
                "type": "object",
                "properties": {
                    "pdf_id": {"type": "string", "description": "ID of the PDF to get metadata for"},
                },
                "required": ["pdf_id"],
            },
        ),
        types.Tool(
            name="get-pdf-page-count",
            description="Get the page count of a PDF",
            inputSchema={
                "type": "object",
                "properties": {
                    "pdf_id": {"type": "string", "description": "ID of the PDF to get page count for"},
                },
                "required": ["pdf_id"],
            },
        ),
        types.Tool(
            name="get-pdf-page-text",
            description="Get the text content of a specific page in a PDF",
            inputSchema={
                "type": "object",
                "properties": {
                    "pdf_id": {"type": "string", "description": "ID of the PDF to get page text from"},
                    "page_number": {"type": "integer", "description": "Page number (0-based index)"},
                },
                "required": ["pdf_id", "page_number"],
            },
        ),
        types.Tool(
            name="pdf-to-text",
            description="Extract all text from a PDF document",
            inputSchema={
                "type": "object",
                "properties": {
                    "pdf_id": {"type": "string", "description": "ID of the PDF to extract text from"},
                    "include_page_numbers": {"type": "boolean", "description": "Whether to include page number markers in the output", "default": True},
                    "start_page": {"type": "integer", "description": "Start page number (0-based, inclusive)"},
                    "end_page": {"type": "integer", "description": "End page number (0-based, inclusive)"},
                },
                "required": ["pdf_id"],
            },
        )
    ]

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pietermyb/mcp-pdf-reader'

If you have feedback or need assistance with the MCP directory API, please join our Discord server