extract-pdf-contents

Extract text content from local PDF files with optional page selection, supporting both standard PDF reading and OCR capabilities.

Instructions

Extract contents from a local PDF file, given page numbers separated in comma. Negative page index number supported.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`pdf_path`	Yes
`pages`	No

Implementation Reference

src/pdf_extraction/server.py:39-57 (handler)
Handler logic for 'extract-pdf-contents' tool: validates arguments, instantiates PDFExtractor, calls extract_content, and returns the extracted text as TextContent.
if name == "extract-pdf-contents": if not arguments: raise ValueError("Missing arguments") pdf_path = arguments.get("pdf_path") pages = arguments.get("pages") if not pdf_path: raise ValueError("Missing file path") extractor = PDFExtractor() extracted_text = extractor.extract_content(pdf_path, pages) return [ types.TextContent( type="text", text=extracted_text, ) ]
src/pdf_extraction/server.py:21-28 (schema)
Input schema for 'extract-pdf-contents': requires 'pdf_path' string, optional 'pages' string for comma-separated page numbers.
inputSchema={ "type": "object", "properties": { "pdf_path": {"type": "string"}, "pages": {"type": "string"}, }, "required": ["pdf_path"], },
src/pdf_extraction/server.py:12-31 (registration)
Registers the 'extract-pdf-contents' tool in the MCP server's list_tools handler, including name, description, and input schema.
@server.list_tools() async def handle_list_tools() -> list[types.Tool]: """ Tools for PDF contents extraction """ return [ types.Tool( name="extract-pdf-contents", description="Extract contents from a local PDF file, given page numbers separated in comma. Negative page index number supported.", inputSchema={ "type": "object", "properties": { "pdf_path": {"type": "string"}, "pages": {"type": "string"}, }, "required": ["pdf_path"], }, ) ]
src/pdf_extraction/pdf_extractor.py:73-95 (helper)
Core helper method in PDFExtractor class that performs the actual PDF content extraction, supporting both text-based and scanned (OCR) PDFs, with page parsing.
def extract_content(self, pdf_path: str, pages: Optional[str]) -> List[str]: """提取PDF内容的主方法""" if not pdf_path: raise ValueError("PDF路径不能为空") try: # 检查是否为扫描件 is_scanned = self.is_scanned_pdf(pdf_path) # 解析页码 reader = PdfReader(pdf_path) total_pages = len(reader.pages) selected_pages = self.parse_pages(pages, total_pages) # 根据PDF类型选择提取方式 if is_scanned: text = self.extract_text_from_scanned(pdf_path, selected_pages) else: text = self.extract_text_from_normal(pdf_path, selected_pages) return text except Exception as e: raise ValueError(f"提取PDF内容失败: {str(e)}")

PDF Extraction MCP Server

extract-pdf-contents

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API