parse_documents
Convert PDF, Word, PPT, and image files to Markdown format from local paths or URLs with optional OCR and language support.
Instructions
统一接口,将文件转换为Markdown格式。支持本地文件和URL,会根据USE_LOCAL_API配置自动选择合适的处理方式。
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_sources | Yes | 文件路径或URL,支持以下格式: - 单个路径或URL: "/path/to/file.pdf" 或 "https://example.com/document.pdf" - 多个路径或URL(逗号分隔): "/path/to/file1.pdf, /path/to/file2.pdf" 或 "https://example.com/doc1.pdf, https://example.com/doc2.pdf" - 混合路径和URL: "/path/to/file.pdf, https://example.com/document.pdf" (支持pdf、ppt、pptx、doc、docx以及图片格式jpg、jpeg、png) | |
| enable_ocr | No | 启用OCR识别,默认False | |
| language | No | 文档语言,默认"ch"中文,可选"en"英文等 | ch |
| page_ranges | No | 指定页码范围,格式为逗号分隔的字符串。例如:"2,4-6":表示选取第2页、第4页至第6页;"2--2":表示从第2页一直选取到倒数第二页。(远程API),默认None |
Implementation Reference
- src/mineru_mcp/server.py:71-156 (handler)The 'parse_documents' tool handler implemented as a FastMCP tool. It handles file parsing by routing between local and remote processing based on configuration.
@mcp.tool() async def parse_documents( file_sources: Annotated[ str, Field( description="""文件路径或URL,支持以下格式: - 单个路径或URL: "/path/to/file.pdf" 或 "https://example.com/document.pdf" - 多个路径或URL(逗号分隔): "/path/to/file1.pdf, /path/to/file2.pdf" 或 "https://example.com/doc1.pdf, https://example.com/doc2.pdf" - 混合路径和URL: "/path/to/file.pdf, https://example.com/document.pdf" (支持pdf、ppt、pptx、doc、docx以及图片格式jpg、jpeg、png)""" ), ], enable_ocr: Annotated[bool, Field(description="启用OCR识别,默认False")] = False, language: Annotated[ str, Field(description='文档语言,默认"ch"中文,可选"en"英文等') ] = "ch", page_ranges: Annotated[ str | None, Field( description='指定页码范围,格式为逗号分隔的字符串。例如:"2,4-6":表示选取第2页、第4页至第6页;"2--2":表示从第2页一直选取到倒数第二页。(远程API),默认None' ), ] = None, ) -> Dict[str, Any]: """ 统一接口,将文件转换为Markdown格式。支持本地文件和URL,会根据USE_LOCAL_API配置自动选择合适的处理方式。 """ sources = parse_list_input(file_sources) if not sources: return {"status": "error", "error": "未提供有效的文件路径或URL"} sources = list(dict.fromkeys(sources)) url_paths = [] file_paths = [] for source in sources: if source.lower().startswith(("http://", "https://")): url_paths.append(source) else: file_paths.append(source) results = [] client = state.get_client() output_dir = state.output_dir if config.USE_LOCAL_API: results = await _handle_local_api(file_paths, enable_ocr) else: if url_paths: results.extend( await _handle_remote_urls(client, url_paths, enable_ocr, language, page_ranges, output_dir) ) if file_paths: results.extend( await _handle_remote_files(client, file_paths, enable_ocr, language, page_ranges, output_dir) ) if not results: return {"status": "error", "error": "未处理任何文件"} if len(results) == 1: result = results[0].copy() for key in ("filename", "source_path", "source_url"): result.pop(key, None) return result success_count = len([r for r in results if r.get("status") == "success"]) error_count = len([r for r in results if r.get("status") == "error"]) overall_status = "success" if success_count == 0: overall_status = "error" elif error_count > 0: overall_status = "partial_success" return { "status": overall_status, "results": results, "summary": { "total_files": len(results), "success_count": success_count, "error_count": error_count, }, }