Skip to main content
Glama

parse_documents

Convert PDF, Word, PPT, and image files to Markdown format from local paths or URLs with optional OCR and language support.

Instructions

统一接口,将文件转换为Markdown格式。支持本地文件和URL,会根据USE_LOCAL_API配置自动选择合适的处理方式。

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
file_sourcesYes文件路径或URL,支持以下格式: - 单个路径或URL: "/path/to/file.pdf" 或 "https://example.com/document.pdf" - 多个路径或URL(逗号分隔): "/path/to/file1.pdf, /path/to/file2.pdf" 或 "https://example.com/doc1.pdf, https://example.com/doc2.pdf" - 混合路径和URL: "/path/to/file.pdf, https://example.com/document.pdf" (支持pdf、ppt、pptx、doc、docx以及图片格式jpg、jpeg、png)
enable_ocrNo启用OCR识别,默认False
languageNo文档语言,默认"ch"中文,可选"en"英文等ch
page_rangesNo指定页码范围,格式为逗号分隔的字符串。例如:"2,4-6":表示选取第2页、第4页至第6页;"2--2":表示从第2页一直选取到倒数第二页。(远程API),默认None

Implementation Reference

  • The 'parse_documents' tool handler implemented as a FastMCP tool. It handles file parsing by routing between local and remote processing based on configuration.
    @mcp.tool()
    async def parse_documents(
        file_sources: Annotated[
            str,
            Field(
                description="""文件路径或URL,支持以下格式:
                - 单个路径或URL: "/path/to/file.pdf" 或 "https://example.com/document.pdf"
                - 多个路径或URL(逗号分隔): "/path/to/file1.pdf, /path/to/file2.pdf" 或
                  "https://example.com/doc1.pdf, https://example.com/doc2.pdf"
                - 混合路径和URL: "/path/to/file.pdf, https://example.com/document.pdf"
                (支持pdf、ppt、pptx、doc、docx以及图片格式jpg、jpeg、png)"""
            ),
        ],
        enable_ocr: Annotated[bool, Field(description="启用OCR识别,默认False")] = False,
        language: Annotated[
            str, Field(description='文档语言,默认"ch"中文,可选"en"英文等')
        ] = "ch",
        page_ranges: Annotated[
            str | None,
            Field(
                description='指定页码范围,格式为逗号分隔的字符串。例如:"2,4-6":表示选取第2页、第4页至第6页;"2--2":表示从第2页一直选取到倒数第二页。(远程API),默认None'
            ),
        ] = None,
    ) -> Dict[str, Any]:
        """
        统一接口,将文件转换为Markdown格式。支持本地文件和URL,会根据USE_LOCAL_API配置自动选择合适的处理方式。
        """
        sources = parse_list_input(file_sources)
        if not sources:
            return {"status": "error", "error": "未提供有效的文件路径或URL"}
    
        sources = list(dict.fromkeys(sources))
    
        url_paths = []
        file_paths = []
    
        for source in sources:
            if source.lower().startswith(("http://", "https://")):
                url_paths.append(source)
            else:
                file_paths.append(source)
    
        results = []
        client = state.get_client()
        output_dir = state.output_dir
    
        if config.USE_LOCAL_API:
            results = await _handle_local_api(file_paths, enable_ocr)
        else:
            if url_paths:
                results.extend(
                    await _handle_remote_urls(client, url_paths, enable_ocr, language, page_ranges, output_dir)
                )
            if file_paths:
                results.extend(
                    await _handle_remote_files(client, file_paths, enable_ocr, language, page_ranges, output_dir)
                )
    
        if not results:
            return {"status": "error", "error": "未处理任何文件"}
    
        if len(results) == 1:
            result = results[0].copy()
            for key in ("filename", "source_path", "source_url"):
                result.pop(key, None)
            return result
    
        success_count = len([r for r in results if r.get("status") == "success"])
        error_count = len([r for r in results if r.get("status") == "error"])
    
        overall_status = "success"
        if success_count == 0:
            overall_status = "error"
        elif error_count > 0:
            overall_status = "partial_success"
    
        return {
            "status": overall_status,
            "results": results,
            "summary": {
                "total_files": len(results),
                "success_count": success_count,
                "error_count": error_count,
            },
        }
Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Tongzhao9417/mineru_mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server