Skip to main content
Glama
pietermyb

mcp-pdf-reader

pdf-to-text

Extract text content from PDF documents for analysis or processing. Specify page ranges and include page numbers as needed.

Instructions

Extract all text from a PDF document

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
pdf_idYesID of the PDF to extract text from
include_page_numbersNoWhether to include page number markers in the output
start_pageNoStart page number (0-based, inclusive)
end_pageNoEnd page number (0-based, inclusive)

Implementation Reference

  • The handler function for the 'pdf-to-text' tool within the @server.call_tool() method. It extracts text from specified pages of an open PDF using PyPDF2.PdfReader, optionally includes page markers and metadata, formats the text, and returns it as TextContent.
    elif name == "pdf-to-text": pdf_id = arguments.get("pdf_id") if not pdf_id or pdf_id not in pdfs: raise ValueError("Invalid PDF ID") reader = pdfs[pdf_id] include_page_numbers = arguments.get("include_page_numbers", True) # Get page range or use all pages start_page = arguments.get("start_page", 0) end_page = arguments.get("end_page", len(reader.pages) - 1) # Validate page range if start_page < 0 or start_page >= len(reader.pages): start_page = 0 if end_page < 0 or end_page >= len(reader.pages): end_page = len(reader.pages) - 1 if start_page > end_page: start_page, end_page = end_page, start_page # Extract text from all pages all_text = [] total_pages = len(reader.pages) for page_num in range(start_page, end_page + 1): page = reader.pages[page_num] page_text = page.extract_text() if page_text: # Format the text to be easier to read page_text = page_text.replace('\n\n', '\n').strip() if include_page_numbers: all_text.append(f"\n--- PAGE {page_num + 1}/{total_pages} ---\n{page_text}") else: all_text.append(page_text) elif include_page_numbers: all_text.append(f"\n--- PAGE {page_num + 1}/{total_pages} ---\n[No extractable text on this page]") # Join all the text full_text = "\n".join(all_text) # Get PDF metadata for context metadata = reader.metadata metadata_text = "" if metadata: metadata_text = "\nDocument Metadata:\n" + "\n".join([f"- {k}: {v}" for k, v in metadata.items() if v]) # Create page range description if start_page == 0 and end_page == total_pages - 1: page_range_desc = f"all pages (1-{total_pages})" elif start_page == end_page: page_range_desc = f"page {start_page + 1}" else: page_range_desc = f"pages {start_page + 1}-{end_page + 1}" return [ types.TextContent( type="text", text=( f"Text extracted from {page_range_desc} of '{os.path.basename(pdf_paths[pdf_id])}'" f"{metadata_text}\n\n{full_text}" ), ) ]
  • The JSON schema and description for the 'pdf-to-text' tool, defined in the @server.list_tools() method. Specifies input parameters like pdf_id (required), optional include_page_numbers, start_page, end_page.
    types.Tool( name="pdf-to-text", description="Extract all text from a PDF document", inputSchema={ "type": "object", "properties": { "pdf_id": {"type": "string", "description": "ID of the PDF to extract text from"}, "include_page_numbers": {"type": "boolean", "description": "Whether to include page number markers in the output", "default": True}, "start_page": {"type": "integer", "description": "Start page number (0-based, inclusive)"}, "end_page": {"type": "integer", "description": "End page number (0-based, inclusive)"}, }, "required": ["pdf_id"], }, )
  • The 'pdf-to-text' tool is registered in the list_tools() handler by including it in the returned list of types.Tool objects.
    return [ types.Tool( name="open-pdf", description="Open a PDF file", inputSchema={ "type": "object", "properties": { "path": {"type": "string", "description": "Path to the PDF file"}, }, "required": ["path"], }, ), types.Tool( name="close-pdf", description="Close an open PDF file", inputSchema={ "type": "object", "properties": { "pdf_id": {"type": "string", "description": "ID of the PDF to close"}, }, "required": ["pdf_id"], }, ), types.Tool( name="list-pdf-metadata", description="List metadata of an open PDF", inputSchema={ "type": "object", "properties": { "pdf_id": {"type": "string", "description": "ID of the PDF to get metadata for"}, }, "required": ["pdf_id"], }, ), types.Tool( name="get-pdf-page-count", description="Get the page count of a PDF", inputSchema={ "type": "object", "properties": { "pdf_id": {"type": "string", "description": "ID of the PDF to get page count for"}, }, "required": ["pdf_id"], }, ), types.Tool( name="get-pdf-page-text", description="Get the text content of a specific page in a PDF", inputSchema={ "type": "object", "properties": { "pdf_id": {"type": "string", "description": "ID of the PDF to get page text from"}, "page_number": {"type": "integer", "description": "Page number (0-based index)"}, }, "required": ["pdf_id", "page_number"], }, ), types.Tool( name="pdf-to-text", description="Extract all text from a PDF document", inputSchema={ "type": "object", "properties": { "pdf_id": {"type": "string", "description": "ID of the PDF to extract text from"}, "include_page_numbers": {"type": "boolean", "description": "Whether to include page number markers in the output", "default": True}, "start_page": {"type": "integer", "description": "Start page number (0-based, inclusive)"}, "end_page": {"type": "integer", "description": "End page number (0-based, inclusive)"}, }, "required": ["pdf_id"], }, ) ]

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pietermyb/mcp-pdf-reader'

If you have feedback or need assistance with the MCP directory API, please join our Discord server