parse_pdf
Parses PDF files to extract text and images, offering quick preview or full extraction modes.
Instructions
解析PDF文件内容,支持快速预览和完整解析两种模式
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | Yes | PDF文件的本地路径,例如'/path/to/document.pdf' | |
| mode | No | 解析模式:'quick'(仅文本)或'full'(文本和图片),默认为'full' | full |
Implementation Reference
- mcp_tool/tools/pdf_tool.py:48-85 (handler)The main handler (execute method) for the parse_pdf tool. It validates inputs (file_path, mode), processes the file path, and dispatches to either _quick_preview_pdf or _full_parse_pdf.
async def execute(self, arguments: Dict[str, Any]) -> List[types.TextContent | types.ImageContent | types.EmbeddedResource]: """ 解析PDF文件 Args: arguments: 参数字典,必须包含'file_path'键,可选'mode'键 Returns: 解析结果列表 """ if "file_path" not in arguments: return [types.TextContent( type="text", text="错误: 缺少必要参数 'file_path'" )] file_path = arguments["file_path"] # 处理文件路径,支持挂载目录的转换 file_path = self.process_file_path(file_path) if not os.path.exists(file_path): return [types.TextContent( type="text", text=f"错误: 文件不存在: {file_path}" )] if not file_path.lower().endswith('.pdf'): return [types.TextContent( type="text", text=f"错误: 文件不是PDF格式: {file_path}" )] mode = arguments.get("mode", "full") if mode == "quick": return await self._quick_preview_pdf(file_path) else: return await self._full_parse_pdf(file_path) - mcp_tool/tools/pdf_tool.py:87-122 (handler)Quick preview mode handler: extracts only text content from the PDF using PyMuPDF (fitz).
async def _quick_preview_pdf(self, file_path: str) -> List[types.TextContent | types.ImageContent | types.EmbeddedResource]: """ 快速预览PDF文件,仅提取文本内容 """ try: # 使用PyMuPDF提取文本 doc = fitz.open(file_path) text_content = [] # 添加文件信息 text_content.append(f"文件名: {os.path.basename(file_path)}") text_content.append(f"页数: {doc.page_count}") text_content.append("---") # 提取每页文本 for page_num in range(doc.page_count): page = doc[page_num] text = page.get_text() if text.strip(): text_content.append(f"第{page_num + 1}页:") text_content.append(text) text_content.append("---") doc.close() return [types.TextContent( type="text", text="\n".join(text_content) )] except Exception as e: error_details = traceback.format_exc() return [types.TextContent( type="text", text=f"错误: 快速预览PDF时发生错误: {str(e)}\n{error_details}" )] - mcp_tool/tools/pdf_tool.py:166-256 (handler)Full parse mode handler: extracts both text and images from the PDF using PyMuPDF, runs OCR on images via pytesseract, and returns the results including base64-encoded images.
async def _full_parse_pdf(self, file_path: str) -> List[types.TextContent | types.ImageContent | types.EmbeddedResource]: """ 完整解析PDF文件,提取文本和图片内容 """ results = [] try: # 使用PyMuPDF提取文本和图片 doc = fitz.open(file_path) # 添加文件信息 results.append(types.TextContent( type="text", text=f"文件名: {os.path.basename(file_path)}\n页数: {doc.page_count}\n---" )) # 处理每一页 for page_num in range(doc.page_count): page = doc[page_num] # 提取文本 text = page.get_text() if text.strip(): results.append(types.TextContent( type="text", text=f"第{page_num + 1}页:\n{text}\n---" )) # 提取图片 image_list = page.get_images() if image_list: results.append(types.TextContent( type="text", text=f"第{page_num + 1}页包含{len(image_list)}张图片" )) # 处理各页的图片 skipped_images = 0 successful_images = 0 for img_idx, img_info in enumerate(image_list): try: xref = img_info[0] base_image = doc.extract_image(xref) image_bytes = base_image["image"] # 获取图片MIME类型并检查是否支持 mime_type = self._get_image_mime_type(image_bytes) supported_mime_types = ["image/jpeg", "image/png", "image/gif", "image/webp"] # 如果格式不受支持,则跳过该图片 if mime_type not in supported_mime_types: skipped_images += 1 continue # 添加图片OCR识别结果 image_analysis = await self._analyze_image(image_bytes) results.append(types.TextContent( type="text", text=f"第{page_num + 1}页 图片{successful_images + 1}分析结果:\n{image_analysis}\n---" )) # 添加图片内容,直接返回图片而非只返回OCR文本 image_base64 = self._encode_image_base64(image_bytes) results.append(types.ImageContent( type="image", data=image_base64, mimeType=mime_type )) successful_images += 1 except Exception: # 捕获所有异常,但不中断处理流程 skipped_images += 1 # 如果有跳过的图片,添加简单提示 if skipped_images > 0: results.append(types.TextContent( type="text", text=f"注意: 第{page_num + 1}页有 {skipped_images} 张图片因格式问题已跳过处理。" )) doc.close() return results except Exception as e: error_details = traceback.format_exc() return [types.TextContent( type="text", text=f"错误: 完整解析PDF时发生错误: {str(e)}\n{error_details}" )] - mcp_tool/tools/pdf_tool.py:29-46 (schema)Tool name, description, and input schema definition for parse_pdf. Defines required 'file_path' parameter and optional 'mode' parameter (quick/full).
name = "parse_pdf" description = "解析PDF文件内容,支持快速预览和完整解析两种模式" input_schema = { "type": "object", "required": ["file_path"], "properties": { "file_path": { "type": "string", "description": "PDF文件的本地路径,例如'/path/to/document.pdf'", }, "mode": { "type": "string", "description": "解析模式:'quick'(仅文本)或'full'(文本和图片),默认为'full'", "enum": ["quick", "full"], "default": "full" } }, } - mcp_tool/tools/pdf_tool.py:21-22 (registration)Registration decorator that registers the PdfTool class with the ToolRegistry so it's discovered and available as the 'parse_pdf' tool.
@ToolRegistry.register class PdfTool(BaseTool):