Skip to main content
Glama
aigo666

MCP Development Framework

parse_pdf

Extract text and images from PDF files using quick text-only or full content parsing modes to access document information.

Instructions

解析PDF文件内容,支持快速预览和完整解析两种模式

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
file_pathYesPDF文件的本地路径,例如'/path/to/document.pdf'
modeNo解析模式:'quick'(仅文本)或'full'(文本和图片),默认为'full'full

Implementation Reference

  • Registers the parse_pdf tool via @ToolRegistry.register decorator on PdfTool class, setting name and description.
    @ToolRegistry.register
    class PdfTool(BaseTool):
        """
        PDF解析工具,支持两种模式:
        1. 快速预览模式:仅提取文本内容,适用于大型PDF文件
        2. 完整解析模式:提取文本和图片内容,提供更详细的文档分析
        """
        
        name = "parse_pdf"
        description = "解析PDF文件内容,支持快速预览和完整解析两种模式"
  • Input schema defining parameters for file_path (required) and mode (optional, quick/full).
    input_schema = {
        "type": "object",
        "required": ["file_path"],
        "properties": {
            "file_path": {
                "type": "string",
                "description": "PDF文件的本地路径,例如'/path/to/document.pdf'",
            },
            "mode": {
                "type": "string",
                "description": "解析模式:'quick'(仅文本)或'full'(文本和图片),默认为'full'",
                "enum": ["quick", "full"],
                "default": "full"
            }
        },
    }
  • Main execute handler: validates input, processes file path, checks existence and PDF format, dispatches to quick or full parse mode.
    async def execute(self, arguments: Dict[str, Any]) -> List[types.TextContent | types.ImageContent | types.EmbeddedResource]:
        """
        解析PDF文件
        
        Args:
            arguments: 参数字典,必须包含'file_path'键,可选'mode'键
        
        Returns:
            解析结果列表
        """
        if "file_path" not in arguments:
            return [types.TextContent(
                type="text",
                text="错误: 缺少必要参数 'file_path'"
            )]
        
        file_path = arguments["file_path"]
        # 处理文件路径,支持挂载目录的转换
        file_path = self.process_file_path(file_path)
        
        if not os.path.exists(file_path):
            return [types.TextContent(
                type="text",
                text=f"错误: 文件不存在: {file_path}"
            )]
        
        if not file_path.lower().endswith('.pdf'):
            return [types.TextContent(
                type="text",
                text=f"错误: 文件不是PDF格式: {file_path}"
            )]
        
        mode = arguments.get("mode", "full")
        
        if mode == "quick":
            return await self._quick_preview_pdf(file_path)
        else:
            return await self._full_parse_pdf(file_path)
  • Helper function implementing full PDF parsing: extracts text and images per page using PyMuPDF, performs OCR on images, encodes images as base64.
    async def _full_parse_pdf(self, file_path: str) -> List[types.TextContent | types.ImageContent | types.EmbeddedResource]:
        """
        完整解析PDF文件,提取文本和图片内容
        """
        results = []
        
        try:
            # 使用PyMuPDF提取文本和图片
            doc = fitz.open(file_path)
            
            # 添加文件信息
            results.append(types.TextContent(
                type="text",
                text=f"文件名: {os.path.basename(file_path)}\n页数: {doc.page_count}\n---"
            ))
            
            # 处理每一页
            for page_num in range(doc.page_count):
                page = doc[page_num]
                
                # 提取文本
                text = page.get_text()
                if text.strip():
                    results.append(types.TextContent(
                        type="text",
                        text=f"第{page_num + 1}页:\n{text}\n---"
                    ))
                
                # 提取图片
                image_list = page.get_images()
                if image_list:
                    results.append(types.TextContent(
                        type="text",
                        text=f"第{page_num + 1}页包含{len(image_list)}张图片"
                    ))
                    
                    # 处理各页的图片
                    skipped_images = 0
                    successful_images = 0
                    
                    for img_idx, img_info in enumerate(image_list):
                        try:
                            xref = img_info[0]
                            base_image = doc.extract_image(xref)
                            image_bytes = base_image["image"]
                            
                            # 获取图片MIME类型并检查是否支持
                            mime_type = self._get_image_mime_type(image_bytes)
                            supported_mime_types = ["image/jpeg", "image/png", "image/gif", "image/webp"]
                            
                            # 如果格式不受支持,则跳过该图片
                            if mime_type not in supported_mime_types:
                                skipped_images += 1
                                continue
                            
                            # 添加图片OCR识别结果
                            image_analysis = await self._analyze_image(image_bytes)
                            results.append(types.TextContent(
                                type="text",
                                text=f"第{page_num + 1}页 图片{successful_images + 1}分析结果:\n{image_analysis}\n---"
                            ))
                            
                            # 添加图片内容,直接返回图片而非只返回OCR文本
                            image_base64 = self._encode_image_base64(image_bytes)
                            results.append(types.ImageContent(
                                type="image",
                                data=image_base64,
                                mimeType=mime_type
                            ))
                            
                            successful_images += 1
                        except Exception:
                            # 捕获所有异常,但不中断处理流程
                            skipped_images += 1
                    
                    # 如果有跳过的图片,添加简单提示
                    if skipped_images > 0:
                        results.append(types.TextContent(
                            type="text",
                            text=f"注意: 第{page_num + 1}页有 {skipped_images} 张图片因格式问题已跳过处理。"
                        ))
            
            doc.close()
            return results
            
        except Exception as e:
            error_details = traceback.format_exc()
            return [types.TextContent(
                type="text",
                text=f"错误: 完整解析PDF时发生错误: {str(e)}\n{error_details}"
            )] 

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/aigo666/mcp-framework'

If you have feedback or need assistance with the MCP directory API, please join our Discord server