Skip to main content
Glama
aigo666

MCP Development Framework

parse_word

Extract text, tables, and images from Word documents to access structured content for analysis or integration.

Instructions

解析Word文档内容,提取文本、表格和图片信息

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
file_pathYesWord文档的本地路径,例如'/path/to/document.docx'

Implementation Reference

  • Registration of the 'parse_word' tool via @ToolRegistry.register decorator on WordTool class, including name, description, and input schema.
    @ToolRegistry.register
    class WordTool(BaseTool):
        """
        用于解析Word文档的工具,提取文本内容、表格和图片信息
        支持.docx和.doc(Word 97-2003)格式
        """
        
        name = "parse_word"
        description = "解析Word文档内容,提取文本、表格和图片信息"
        input_schema = {
            "type": "object",
            "required": ["file_path"],
            "properties": {
                "file_path": {
                    "type": "string",
                    "description": "Word文档的本地路径,例如'/path/to/document.docx'",
                }
            },
        }
  • The main execute handler for the parse_word tool, validates input, processes file path, and calls the document parsing function.
    async def execute(self, arguments: Dict[str, Any]) -> List[types.TextContent | types.ImageContent | types.EmbeddedResource]:
        """
        解析Word文档
        
        Args:
            arguments: 参数字典,必须包含'file_path'键
            
        Returns:
            解析结果列表
        """
        if "file_path" not in arguments:
            return [types.TextContent(
                type="text",
                text="错误: 缺少必要参数 'file_path'"
            )]
        
        # 处理文件路径,支持挂载目录的转换
        file_path = self.process_file_path(arguments["file_path"])
        
        return await self._parse_word_document(file_path)
  • Core implementation of Word document parsing: handles .doc/.docx, extracts document properties, text paragraphs, tables (as Markdown), images (validated and base64 encoded), with LibreOffice conversion for .doc files.
    async def _parse_word_document(self, file_path: str) -> List[types.TextContent | types.ImageContent | types.EmbeddedResource]:
        """
        解析Word文档内容,支持.docx和.doc格式
        
        Args:
            file_path: Word文档路径
            
        Returns:
            Word文档内容列表
        """
        results = []
        temp_docx_path = None
        
        # 检查文件是否存在
        if not os.path.exists(file_path):
            return [types.TextContent(
                type="text",
                text=f"错误: 文件不存在: {file_path}\n请检查路径是否正确,并确保文件可访问。"
            )]
        
        # 检查文件扩展名
        if not file_path.lower().endswith(('.docx', '.doc')):
            return [types.TextContent(
                type="text",
                text=f"错误: 不支持的文件格式: {file_path}\n仅支持.docx和.doc格式的Word文档。"
            )]
        
        try:
            # 添加文件信息
            file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
            
            # 处理.doc格式(Word 97-2003文档)
            if file_path.lower().endswith('.doc'):
                results.append(types.TextContent(
                    type="text",
                    text=f"# Word文档解析 (Word 97-2003 格式)\n\n文件大小: {file_size_mb:.2f} MB"
                ))
                
                # 检查LibreOffice是否可用
                if not self._is_libreoffice_installed():
                    return [types.TextContent(
                        type="text",
                        text="错误: 无法解析Word 97-2003 (.doc)格式。\n"
                             "系统未安装LibreOffice,无法进行格式转换。\n"
                             "请安装LibreOffice后重试,或将文档另存为.docx格式。"
                    )]
                
                try:
                    # 显示转换提示
                    results.append(types.TextContent(
                        type="text",
                        text="正在使用LibreOffice转换文档格式,请稍候..."
                    ))
                    
                    # 转换.doc到.docx
                    temp_docx_path = self._convert_doc_to_docx(file_path)
                    
                    # 更新文件路径为转换后的文件
                    file_path = temp_docx_path
                    
                    results.append(types.TextContent(
                        type="text",
                        text="文档格式转换完成,继续解析...\n"
                    ))
                except Exception as e:
                    return results + [types.TextContent(
                        type="text",
                        text=f"错误: {str(e)}\n"
                             f"建议:\n"
                             f"1. 确保已正确安装LibreOffice且可通过命令行访问\n"
                             f"2. 尝试手动将文档转换为.docx格式后重试\n"
                             f"3. 检查文档是否加密或损坏"
                    )]
            else:
                results.append(types.TextContent(
                    type="text",
                    text=f"# Word文档解析\n\n文件大小: {file_size_mb:.2f} MB"
                ))
            
            # 打开Word文档
            doc = docx.Document(file_path)
            
            # 提取文档属性
            properties = {}
            if hasattr(doc.core_properties, 'title') and doc.core_properties.title:
                properties['标题'] = doc.core_properties.title
            if hasattr(doc.core_properties, 'author') and doc.core_properties.author:
                properties['作者'] = doc.core_properties.author
            if hasattr(doc.core_properties, 'created') and doc.core_properties.created:
                properties['创建时间'] = str(doc.core_properties.created)
            if hasattr(doc.core_properties, 'modified') and doc.core_properties.modified:
                properties['修改时间'] = str(doc.core_properties.modified)
            if hasattr(doc.core_properties, 'comments') and doc.core_properties.comments:
                properties['备注'] = doc.core_properties.comments
            
            # 添加文档属性信息
            if properties:
                properties_text = "## 文档属性\n\n"
                for key, value in properties.items():
                    properties_text += f"- {key}: {value}\n"
                results.append(types.TextContent(
                    type="text",
                    text=properties_text
                ))
            
            # 提取文档内容
            content_text = "## 文档内容\n\n"
            
            # 处理段落
            paragraphs_count = len(doc.paragraphs)
            content_text += f"### 段落 (共{paragraphs_count}个)\n\n"
            
            for i, para in enumerate(doc.paragraphs):
                if para.text.strip():  # 只处理非空段落
                    content_text += f"{para.text}\n\n"
            
            # 处理表格
            tables_count = len(doc.tables)
            if tables_count > 0:
                content_text += f"### 表格 (共{tables_count}个)\n\n"
                
                for i, table in enumerate(doc.tables):
                    content_text += f"#### 表格 {i+1}\n\n"
                    
                    # 创建Markdown表格
                    rows = []
                    for row in table.rows:
                        cells = [cell.text.replace('\n', ' ').strip() for cell in row.cells]
                        rows.append(cells)
                    
                    if rows:
                        # 表头
                        content_text += "| " + " | ".join(rows[0]) + " |\n"
                        # 分隔线
                        content_text += "| " + " | ".join(["---"] * len(rows[0])) + " |\n"
                        # 表格内容
                        for row in rows[1:]:
                            content_text += "| " + " | ".join(row) + " |\n"
                        
                        content_text += "\n"
            
            # 添加文档内容
            results.append(types.TextContent(
                type="text",
                text=content_text
            ))
            
            # 提取图片信息和内容
            try:
                # 提取文档中的所有图片,并过滤掉嵌入的外部文档
                images = self._extract_images_from_word(doc)
                
                if images:
                    image_info = f"## 图片信息\n\n文档中包含 {len(images)} 张图片。\n\n"
                    results.append(types.TextContent(
                        type="text",
                        text=image_info
                    ))
                    
                    # 返回图片内容
                    for i, (image_id, image_bytes) in enumerate(images):
                        try:
                            # 获取图片MIME类型
                            mime_type = self._get_image_mime_type(image_bytes)
                            
                            # 将图片添加到结果中
                            image_base64 = self._encode_image_base64(image_bytes)
                            results.append(types.TextContent(
                                type="text",
                                text=f"### 图片 {i+1}\n\n"
                            ))
                            results.append(types.ImageContent(
                                type="image",
                                data=image_base64,
                                mimeType=mime_type
                            ))
                        except Exception as e:
                            # 记录图片处理错误但不中断
                            results.append(types.TextContent(
                                type="text",
                                text=f"注意: 图片 {i+1} 处理失败: {str(e)}"
                            ))
                else:
                    results.append(types.TextContent(
                        type="text",
                        text="## 图片信息\n\n文档中未包含图片或嵌入对象均不是有效图片。"
                    ))
            except Exception as img_error:
                results.append(types.TextContent(
                    type="text",
                    text=f"警告: 提取图片信息时出错: {str(img_error)}"
                ))
            
            # 添加处理完成的提示
            results.append(types.TextContent(
                type="text",
                text="Word文档处理完成!"
            ))
            
            return results
        except Exception as e:
            error_details = traceback.format_exc()
            return [types.TextContent(
                type="text",
                text=f"错误: 解析Word文档失败: {str(e)}\n"
                     f"可能的原因:\n"
                     f"1. 文件格式不兼容或已损坏\n"
                     f"2. 文件受密码保护\n"
                     f"3. 文件包含不支持的内容\n\n"
                     f"详细错误信息: {error_details}"
            )]
        finally:
            # 清理临时文件
            if temp_docx_path and os.path.exists(temp_docx_path):
                try:
                    # 删除临时文件
                    temp_dir = os.path.dirname(temp_docx_path)
                    shutil.rmtree(temp_dir, ignore_errors=True)
                except Exception:
                    # 忽略清理过程中的错误
                    pass 
  • Input schema definition for the parse_word tool, requiring 'file_path' parameter.
    input_schema = {
        "type": "object",
        "required": ["file_path"],
        "properties": {
            "file_path": {
                "type": "string",
                "description": "Word文档的本地路径,例如'/path/to/document.docx'",
            }
        },
    }
  • Helper function to extract and validate images from Word document, filtering non-image embeddings.
    def _extract_images_from_word(self, doc: Document) -> List[Tuple[str, bytes]]:
        """
        从Word文档中提取图片,过滤掉嵌入的外部文档
        
        Args:
            doc: Word文档对象
            
        Returns:
            图片列表,每项包含图片ID和二进制数据
        """
        images = []
        document_part = doc.part
        rels = document_part.rels
        
        for rel in rels.values():
            try:
                # 只处理图片类型的关系
                if "image" in rel.reltype:
                    image_part = rel.target_part
                    image_bytes = image_part.blob
                    image_id = rel.rId
                    
                    # 验证是否为真实图片,过滤掉嵌入的外部文档
                    if self._is_valid_image(image_bytes):
                        images.append((image_id, image_bytes))
            except Exception:
                continue
                    
        return images
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden of behavioral disclosure. While it states what the tool does (parse and extract), it lacks critical behavioral details: it doesn't specify output format, error handling, performance characteristics, or any limitations (e.g., file size constraints, supported Word versions). For a tool with no annotations, this leaves significant gaps in understanding how it behaves.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is extremely concise and front-loaded: a single sentence in Chinese that directly states the tool's function. There is zero wasted text, and every word contributes to understanding the purpose. It efficiently communicates the core action and targets.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity (parsing structured documents), lack of annotations, and no output schema, the description is incomplete. It doesn't explain what the extracted information looks like (e.g., structured data, raw text), how errors are handled, or any dependencies. For a tool with no structured behavioral hints, this leaves the agent with insufficient context to use it effectively.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, with the single parameter 'file_path' fully documented in the schema. The description adds no additional parameter information beyond what's in the schema (e.g., no examples of valid paths beyond the schema's example, no constraints on file types). With high schema coverage, the baseline score of 3 is appropriate as the description doesn't enhance parameter understanding.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: '解析Word文档内容,提取文本、表格和图片信息' (Parse Word document content, extract text, table, and image information). It specifies the verb (parse/extract) and resource (Word document) with the types of content extracted. However, it doesn't explicitly distinguish itself from sibling tools like parse_pdf or parse_excel, which likely perform similar extraction functions for different file formats.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It doesn't mention sibling tools like parse_pdf or parse_excel, nor does it specify prerequisites (e.g., file format requirements) or exclusions. The agent must infer usage based on the tool name and description alone.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/aigo666/mcp-framework'

If you have feedback or need assistance with the MCP directory API, please join our Discord server