Skip to main content
Glama
xraywu

PDF Extraction MCP Server

by xraywu

extract-pdf-contents

Extract text content from local PDF files with optional page selection, supporting both standard PDF reading and OCR capabilities.

Instructions

Extract contents from a local PDF file, given page numbers separated in comma. Negative page index number supported.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
pdf_pathYes
pagesNo

Implementation Reference

  • Handler logic for 'extract-pdf-contents' tool: validates arguments, instantiates PDFExtractor, calls extract_content, and returns the extracted text as TextContent.
    if name == "extract-pdf-contents": if not arguments: raise ValueError("Missing arguments") pdf_path = arguments.get("pdf_path") pages = arguments.get("pages") if not pdf_path: raise ValueError("Missing file path") extractor = PDFExtractor() extracted_text = extractor.extract_content(pdf_path, pages) return [ types.TextContent( type="text", text=extracted_text, ) ]
  • Input schema for 'extract-pdf-contents': requires 'pdf_path' string, optional 'pages' string for comma-separated page numbers.
    inputSchema={ "type": "object", "properties": { "pdf_path": {"type": "string"}, "pages": {"type": "string"}, }, "required": ["pdf_path"], },
  • Registers the 'extract-pdf-contents' tool in the MCP server's list_tools handler, including name, description, and input schema.
    @server.list_tools() async def handle_list_tools() -> list[types.Tool]: """ Tools for PDF contents extraction """ return [ types.Tool( name="extract-pdf-contents", description="Extract contents from a local PDF file, given page numbers separated in comma. Negative page index number supported.", inputSchema={ "type": "object", "properties": { "pdf_path": {"type": "string"}, "pages": {"type": "string"}, }, "required": ["pdf_path"], }, ) ]
  • Core helper method in PDFExtractor class that performs the actual PDF content extraction, supporting both text-based and scanned (OCR) PDFs, with page parsing.
    def extract_content(self, pdf_path: str, pages: Optional[str]) -> List[str]: """提取PDF内容的主方法""" if not pdf_path: raise ValueError("PDF路径不能为空") try: # 检查是否为扫描件 is_scanned = self.is_scanned_pdf(pdf_path) # 解析页码 reader = PdfReader(pdf_path) total_pages = len(reader.pages) selected_pages = self.parse_pages(pages, total_pages) # 根据PDF类型选择提取方式 if is_scanned: text = self.extract_text_from_scanned(pdf_path, selected_pages) else: text = self.extract_text_from_normal(pdf_path, selected_pages) return text except Exception as e: raise ValueError(f"提取PDF内容失败: {str(e)}")

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/xraywu/mcp-pdf-extraction-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server