extract-pdf-contents
Extract text content from local PDF files with optional page selection, supporting both standard PDF reading and OCR capabilities.
Instructions
Extract contents from a local PDF file, given page numbers separated in comma. Negative page index number supported.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| pdf_path | Yes | ||
| pages | No |
Implementation Reference
- src/pdf_extraction/server.py:39-57 (handler)Handler logic for 'extract-pdf-contents' tool: validates arguments, instantiates PDFExtractor, calls extract_content, and returns the extracted text as TextContent.if name == "extract-pdf-contents": if not arguments: raise ValueError("Missing arguments") pdf_path = arguments.get("pdf_path") pages = arguments.get("pages") if not pdf_path: raise ValueError("Missing file path") extractor = PDFExtractor() extracted_text = extractor.extract_content(pdf_path, pages) return [ types.TextContent( type="text", text=extracted_text, ) ]
- src/pdf_extraction/server.py:21-28 (schema)Input schema for 'extract-pdf-contents': requires 'pdf_path' string, optional 'pages' string for comma-separated page numbers.inputSchema={ "type": "object", "properties": { "pdf_path": {"type": "string"}, "pages": {"type": "string"}, }, "required": ["pdf_path"], },
- src/pdf_extraction/server.py:12-31 (registration)Registers the 'extract-pdf-contents' tool in the MCP server's list_tools handler, including name, description, and input schema.@server.list_tools() async def handle_list_tools() -> list[types.Tool]: """ Tools for PDF contents extraction """ return [ types.Tool( name="extract-pdf-contents", description="Extract contents from a local PDF file, given page numbers separated in comma. Negative page index number supported.", inputSchema={ "type": "object", "properties": { "pdf_path": {"type": "string"}, "pages": {"type": "string"}, }, "required": ["pdf_path"], }, ) ]
- Core helper method in PDFExtractor class that performs the actual PDF content extraction, supporting both text-based and scanned (OCR) PDFs, with page parsing.def extract_content(self, pdf_path: str, pages: Optional[str]) -> List[str]: """提取PDF内容的主方法""" if not pdf_path: raise ValueError("PDF路径不能为空") try: # 检查是否为扫描件 is_scanned = self.is_scanned_pdf(pdf_path) # 解析页码 reader = PdfReader(pdf_path) total_pages = len(reader.pages) selected_pages = self.parse_pages(pages, total_pages) # 根据PDF类型选择提取方式 if is_scanned: text = self.extract_text_from_scanned(pdf_path, selected_pages) else: text = self.extract_text_from_normal(pdf_path, selected_pages) return text except Exception as e: raise ValueError(f"提取PDF内容失败: {str(e)}")