macOS Native OCR MCP Server

API_REFERENCE.md•18.3 KiB

# API 参考文档 **项目**: macOS OCR MCP Server **版本**: 0.1.0 **文档版本**: 1.0.0 **最后更新**: 2026-01-18 --- ## 目录 1. [MCP 工具 API](#mcp-工具-api) 2. [Python 模块 API](#python-模块-api) 3. [数据模型](#数据模型) 4. [错误码](#错误码) 5. [使用示例](#使用示例) 6. [性能指标](#性能指标) --- ## MCP 工具 API ### 工具 1: read_image_text 提取图片或 PDF 中的纯文本。 #### 请求 ```json { "name": "read_image_text", "arguments": { "image_path": "/Users/zzz/Pictures/document.png" } } ``` #### 参数 | 参数 | 类型 | 必需 | 描述 | |-----|------|------|------| | image_path | string | 是 | 图片或 PDF 的绝对路径 | #### 响应 **成功**: ```json { "content": [ { "type": "text", "text": "这是识别到的文本内容。\n\n第二段文本..." } ] } ``` **失败**: ```json { "content": [ { "type": "text", "text": "Error processing image: FileNotFoundError..." } ] } ``` #### 特性 - **自动段落合并**: 将相邻的文本行合并为段落 - **智能纠错**: 修复常见的 OCR 错误（如中文换行） - **多页支持**: 自动处理 PDF 多页文档 - **表格优化**: 优化表格单元格的识别 #### 使用场景 - 快速提取可复制文本 - 文档内容搜索 - 文本摘要与分析 - 简单的 OCR 场景 --- ### 工具 2: read_image_layout 提取结构化的版面信息，包含文本块、边界框和样式类型。 #### 请求 ```json { "name": "read_image_layout", "arguments": { "image_path": "/Users/zzz/Pictures/document.pdf" } } ``` #### 参数 | 参数 | 类型 | 必需 | 描述 | |-----|------|------|------| | image_path | string | 是 | 图片或 PDF 的绝对路径 | #### 响应 **成功**: ```json { "content": [ { "type": "text", "text": "[\n {\n \"id\": 1,\n \"page\": 1,\n \"text\": \"段落内容...\",\n \"bbox\": {\"x\": 0.1, \"y\": 0.2, \"w\": 0.8, \"h\": 0.1},\n \"type\": \"print\",\n \"lines\": [...]\n },\n ...\n]" } ] } ``` **失败**: ```json { "content": [ { "type": "text", "text": "{\"error\": \"OCR failed...\"}" } ] } ``` #### 返回数据结构参见 [数据模型](#数据模型) 章节。 #### 特性 - **结构化输出**: 包含文本块、边界框、置信度 - **样式分析**: 区分印刷体和强调文本 - **多页支持**: 返回所有页面的文本块 - **LLM 友好**: 专为 AI 文档重建设计 #### 使用场景 - 图片转 Markdown - 表格提取转 CSV - 文档排版还原 - 发票/收据结构化提取 - 需要定位文本位置的场景 --- ## Python 模块 API ### 模块: src.ocr OCR 核心功能模块，可直接导入使用。 #### 函数: recognize_text 提取图片或 PDF 中的纯文本。 **签名**: ```python def recognize_text(image_path: str) -> str ``` **参数**: - `image_path` (str): 图片或 PDF 的绝对路径 **返回**: - `str`: 识别后的纯文本 **示例**: ```python from ocr import recognize_text text = recognize_text('/path/to/image.png') print(text) ``` **异常**: - `FileNotFoundError`: 文件不存在 - `RuntimeError`: OCR 识别失败 - `ValueError`: 无效文件路径或格式 --- #### 函数: recognize_text_with_layout 提取图片或 PDF 中的结构化版面信息。 **签名**: ```python def recognize_text_with_layout(image_path: str) -> List[Dict] ``` **参数**: - `image_path` (str): 图片或 PDF 的绝对路径 **返回**: - `List[Dict]`: 文本块列表，每个块包含结构化信息 **示例**: ```python from ocr import recognize_text_with_layout layout = recognize_text_with_layout('/path/to/image.png') for block in layout: print(f"块 {block['id']}: {block['text']}") print(f" 位置: x={block['bbox']['x']}, y={block['bbox']['y']}") print(f" 类型: {block['type']}") ``` **异常**: - `FileNotFoundError`: 文件不存在 - `RuntimeError`: OCR 识别失败 - `ValueError`: 无效文件路径或格式 --- #### 函数: cluster_into_blocks 将文本行聚类为语义块（内部函数）。 **签名**: ```python def cluster_into_blocks(layout_items: List[Dict]) -> List[List[Dict]] ``` **参数**: - `layout_items` (List[Dict]): 文本行列表 **返回**: - `List[List[Dict]]`: 聚类后的文本块列表 **说明**: - 使用 Union-Find 算法 - 基于垂直连续性和水平对齐判断 --- #### 函数: sort_blocks 按阅读顺序排序文本块（内部函数）。 **签名**: ```python def sort_blocks(blocks: List[List[Dict]]) -> List[List[Dict]] ``` **参数**: - `blocks` (List[List[Dict]]): 文本块列表 **返回**: - `List[List[Dict]]`: 排序后的文本块列表 **说明**: - 使用 Banding 算法 - 从上到下、从左到右排序 --- #### 函数: analyze_style_for_blocks 分析文本块的样式类型（内部函数）。 **签名**: ```python def analyze_style_for_blocks(image_path: str, blocks: List[Dict]) -> None ``` **参数**: - `image_path` (str)]: 图像路径 - `blocks` (List[Dict]]): 文本块列表（原地修改） **说明**: - 使用 Pillow 和 NumPy - 基于颜色饱和度判断样式类型 --- ### 模块: src.server MCP 服务器模块。 #### 函数: main 启动 MCP 服务器。 **签名**: ```python def main() -> None ``` **示例**: ```python from server import main main() # 启动服务器，等待 MCP 客户端连接 ``` --- ## 数据模型 ### 文本行 (TextLine) Vision 框架返回的原始 OCR 结果。 **结构**: ```python { "text": str, # 识别文本 "confidence": float, # 置信度 [0, 1] "bbox": { "x": float, # x 坐标 [0, 1] "y": float, # y 坐标 [0, 1] "w": float, # 宽度 [0, 1] "h": float # 高度 [0, 1] } } ``` **字段说明**: | 字段 | 类型 | 描述 | 范围 | |-----|------|------|------| | text | string | 识别的文本内容 | 任意字符串 | | confidence | float | 识别置信度 | 0.0 - 1.0 | | bbox.x | float | 左边界（归一化） | 0.0 - 1.0 | | bbox.y | float | 下边界（归一化，左下角原点） | 0.0 - 1.0 | | bbox.w | float | 宽度（归一化） | 0.0 - 1.0 | | bbox.h | float | 高度（归一化） | 0.0 - 1.0 | --- ### 文本块 (Block) 聚类后的语义块（段落、单元格等）。 **结构**: ```python [ { "text": str, # 合并后的文本 "bbox": { "x": float, # 块的左边界 "y": float, # 块的下边界 "w": float, # 块的宽度 "h": float # 块的高度 } } ] ``` --- ### 结构化块 (StructuredBlock) `read_image_layout` 返回的完整数据结构。 **结构**: ```python { "id": int, # 块编号（从 1 开始） "page": int, # 页码 "text": str, # 合并纠错后的文本 "bbox": { "x": float, # 归一化 x 坐标 "y": float, # 归一化 y 坐标 "w": float, # 归一化宽度 "h": float # 归一化高度 }, "type": str, # 样式类型 # - "print": 印刷体（黑色） # - "emphasized": 强调文本（彩色） "lines": list # 构成该块的原始行信息 } ``` **字段说明**: | 字段 | 类型 | 描述 | 示例 | |-----|------|------|------| | id | integer | 块的唯一标识符 | 1, 2, 3, ... | | page | integer | 页码（从 1 开始） | 1 | | text | string | 合并纠错后的文本 | "段落内容..." | | bbox.x | float | 左边界（归一化） | 0.1 | | bbox.y | float | 下边界（归一化） | 0.2 | | bbox.w | float | 宽度（归一化） | 0.8 | | bbox.h | float | 高度（归一化） | 0.1 | | type | string | 样式类型 | "print" 或 "emphasized" | | lines | array | 原始行信息 | [TextLine, ...] | --- ### 样本数据 **示例图片**: 一份包含标题、段落和表格的文档。 **OCR 输出示例**: ```json [ { "id": 1, "page": 1, "text": "文档标题", "bbox": { "x": 0.15, "y": 0.85, "w": 0.7, "h": 0.08 }, "type": "print", "lines": [ { "text": "文档标题", "confidence": 0.98, "bbox": {"x": 0.15, "y": 0.85, "w": 0.7, "h": 0.08} } ] }, { "id": 2, "page": 1, "text": "这是第一段文字。包含多行内容会自动合并。", "bbox": { "x": 0.1, "y": 0.65, "w": 0.8, "h": 0.15 }, "type": "print", "lines": [ { "text": "这是第一段文字。", "confidence": 0.95, "bbox": {"x": 0.1, "y": 0.75, "w": 0.8, "h": 0.05} }, { "text": "包含多行内容会自动合并。", "confidence": 0.96, "bbox": {"x": 0.1, "y": 0.65, "w": 0.8, "h": 0.05} } ] }, { "id": 3, "page": 1, "text": "重要提示：请注意此条目", "bbox": { "x": 0.1, "y": 0.5, "w": 0.8, "h": 0.05 }, "type": "emphasized", "lines": [ { "text": "重要提示：请注意此条目", "confidence": 0.97, "bbox": {"x": 0.1, "y": 0.5, "w": 0.8, "h": 0.05} } ] } ] ``` --- ## 错误码 ### 错误类型 | 错误类型 | 描述 | HTTP 状态码 | |---------|------|------------| | FileNotFoundError | 文件不存在 | 404 | | ValueError | 无效文件路径或格式 | 400 | | RuntimeError | OCR 识别失败 | 500 | | PermissionError | 无文件访问权限 | 403 | ### 错误消息格式 #### MCP 工具错误 ```json { "content": [ { "type": "text", "text": "Error processing image: <错误详情>" } ] } ``` #### Python 模块错误 Python 模块直接抛出异常，需要 try-except 捕获： ```python try: text = recognize_text('/path/to/image.png') except FileNotFoundError as e: print(f"文件不存在: {e}") except RuntimeError as e: print(f"OCR 失败: {e}") except ValueError as e: print(f"无效输入: {e}") ``` --- ## 使用示例 ### 示例 1: 提取纯文本 ```python from ocr import recognize_text # 识别单张图片 text = recognize_text('/path/to/document.png') print(text) # 识别 PDF text = recognize_text('/path/to/document.pdf') print(text) ``` ### 示例 2: 提取结构化布局 ```python from ocr import recognize_text_with_layout import json # 识别并获取结构化数据 layout = recognize_text_with_layout('/path/to/document.png') # 输出 JSON print(json.dumps(layout, indent=2, ensure_ascii=False)) # 遍历文本块 for block in layout: print(f"\n--- 块 {block['id']} (页 {block['page']}) ---") print(f"文本: {block['text']}") print(f"位置: x={block['bbox']['x']:.2f}, y={block['bbox']['y']:.2f}") print(f"尺寸: w={block['bbox']['w']:.2f}, h={block['bbox']['h']:.2f}") print(f"类型: {block['type']}") print(f"置信度: {block['lines'][0]['confidence']:.2f}") ``` ### 示例 3: 筛选特定类型文本 ```python from ocr import recognize_text_with_layout layout = recognize_text_with_layout('/path/to/document.png') # 只提取印刷体（排除手写批注） print_blocks = [block for block in layout if block['type'] == 'print'] # 只提取强调文本（如手写批注） emphasized_blocks = [block for block in layout if block['type'] == 'emphasized'] print(f"印刷体块数: {len(print_blocks)}") print(f"强调文本块数: {len(emphasized_blocks)}") ``` ### 示例 4: 按页提取文本 ```python from ocr import recognize_text_with_layout from collections import defaultdict layout = recognize_text_with_layout('/path/to/document.pdf') # 按页分组 pages = defaultdict(list) for block in layout: pages[block['page']].append(block) # 输出每页内容 for page_num in sorted(pages.keys()): print(f"\n--- 第 {page_num} 页 ---") for block in pages[page_num]: print(block['text']) ``` ### 示例 5: 提取表格数据 ```python from ocr import recognize_text_with_layout layout = recognize_text_with_layout('/path/to/table.png') # 假设 OCR 已经正确识别表格行 table = [] for block in layout: lines = block['text'].split('\n') table.append(lines) # 输出表格 for row in table: print(' | '.join(row)) ``` ### 示例 6: 转换为 Markdown ```python from ocr import recognize_text_with_layout layout = recognize_text_with_layout('/path/to/document.png') markdown_lines = [] for block in layout: text = block['text'] # 根据 y 坐标判断是否为标题（顶部区域） if block['bbox']['y'] > 0.8: markdown_lines.append(f"# {text}\n") elif block['type'] == 'emphasized': markdown_lines.append(f"**{text}**\n") else: markdown_lines.append(f"{text}\n") markdown_output = '\n'.join(markdown_lines) print(markdown_output) ``` ### 示例 7: 转换为 CSV ```python import csv from ocr import recognize_text_with_layout layout = recognize_text_with_layout('/path/to/table.png') # 解析表格 table_data = [] for block in layout: # 假设每行用制表符或多个空格分隔 row = [cell.strip() for cell in block['text'].split('\t')] table_data.append(row) # 写入 CSV with open('output.csv', 'w', newline='', encoding='utf-8') as f: writer = csv.writer(f) writer.writerows(table_data) print("CSV 已保存到 output.csv") ``` ### 示例 8: 批量处理 ```python import os from ocr import recognize_text # 批量处理目录中的所有图片 input_dir = '/path/to/images' output_dir = '/path/to/output' os.makedirs(output_dir, exist_ok=True) for filename in os.listdir(input_dir): if filename.lower().endswith(('.png', '.jpg', '.jpeg')): input_path = os.path.join(input_dir, filename) output_path = os.path.join(output_dir, f'{filename}.txt') text = recognize_text(input_path) with open(output_path, 'w', encoding='utf-8') as f: f.write(text) print(f"已处理: {filename}") ``` ### 示例 9: 错误处理 ```python import sys import os from ocr import recognize_text, recognize_text_with_layout def safe_recognize(image_path, layout=False): """安全的 OCR 识别，包含错误处理""" if not os.path.exists(image_path): return f"错误: 文件不存在 - {image_path}" if not os.access(image_path, os.R_OK): return f"错误: 无读取权限 - {image_path}" try: if layout: result = recognize_text_with_layout(image_path) import json return json.dumps(result, ensure_ascii=False) else: return recognize_text(image_path) except RuntimeError as e: return f"OCR 失败: {str(e)}" except ValueError as e: return f"无效输入: {str(e)}" except Exception as e: return f"未知错误: {str(e)}" # 使用示例 if __name__ == '__main__': if len(sys.argv) < 2: print("Usage: python example.py <image_path> [--layout]") sys.exit(1) image_path = sys.argv[1] layout = '--layout' in sys.argv result = safe_recognize(image_path, layout) print(result) ``` ### 示例 10: 性能监控 ```python import time from ocr import recognize_text def benchmark(image_path, iterations=10): """性能基准测试""" times = [] for i in range(iterations): start = time.time() text = recognize_text(image_path) end = time.time() times.append(end - start) print(f"迭代 {i+1}: {end - start:.2f} 秒") avg_time = sum(times) / len(times) min_time = min(times) max_time = max(times) print(f"\n--- 性能统计 ({iterations} 次迭代) ---") print(f"平均时间: {avg_time:.2f} 秒") print(f"最短时间: {min_time:.2f} 秒") print(f"最长时间: {max_time:.2f} 秒") print(f"文本长度: {len(text)} 字符") # 使用示例 benchmark('/path/to/document.png', iterations=5) ``` --- ## 性能指标 ### 识别速度 | 场景 | 分辨率 | 平均时间 | 说明 | |-----|-------|---------|------| | 纯文本图片 | 1920x1080 | 2-3 秒 | 准确模式 | | 纯文本图片 | 1920x1080 | 0.5-1 秒 | 快速模式 | | PDF 单页 | A4 (300 DPI) | 3-5 秒 | 准确模式 | | PDF 10 页 | A4 (300 DPI) | 28-35 秒 | 逐页处理 | | 大尺寸图片 | 4000x3000 | 8-10 秒 | 高分辨率 | ### 内存占用 | 场景 | 峰值内存 | 说明 | |-----|---------|------| | 单张图片 OCR | 50-100 MB | 包括图像加载 + OCR 缓存 | | PDF 多页处理 | 100-200 MB | 每页依次处理 | | 样式分析 | +20-50 MB | Pillow + NumPy 内存 | | 批量处理 | 每文件 50-100 MB | 取决于并发数 | ### 准确率 | 文本类型 | 典型准确率 | 备注 | |---------|-----------|------| | 印刷中文 | 95-99% | 清晰字体 | | 印刷英文 | 96-99% | 标准字体 | | 中英文混合 | 94-98% | 混排文档 | | 手写中文 | 70-85% | 工整字迹 | | 手写英文 | 75-90% | 标准手写 | | 表格数字 | 90-98% | 清晰排版 | ### 算法复杂度 | 算法 | 时间复杂度 | 空间复杂度 | 典型输入规模 | |-----|-----------|-----------|-------------| | Vision OCR | O(n) | O(n) | n = 像素数 | | 区块聚类 | O(N²) | O(N) | N = 文本行数 (< 1000) | | 区块排序 | O(N log N) | O(N) | N = 文本块数 | | 样式分析 | O(N * P) | O(P) | P = 每块像素数 | | 智能合并 | O(L) | O(L) | L = 文本长度 | --- ## 附录 ### A. 文件格式支持 | 格式 | 扩展名 | 支持 | 说明 | |-----|-------|------|------| | PNG | .png | 完全 | 推荐 | | JPEG | .jpg, .jpeg | 完全 | 广泛使用 | | GIF | .gif | 完全 | 无动画支持 | | TIFF | .tiff, .tif | 完全 | 多页支持 | | BMP | .bmp | 完全 | | | PDF | .pdf | 完全 | 多页支持 | | WebP | .webp | 完全 | 需 Pillow 支持 | ### B. 语言支持 | 语言 | 代码 | 状态 | |-----|------|------| | 中文简体 | zh-Hans | 完全支持 | | 中文繁体 | zh-Hant | 完全支持 | | 英文 | en-US | 完全支持 | | 中英混合 | zh-Hans + en-US | 完全支持 | ### C. 最佳实践 1. **使用绝对路径**: MCP 工具要求绝对路径 2. **检查文件存在**: 处理前验证文件存在性 3. **错误处理**: 始终使用 try-except 捕获异常 4. **内存管理**: 批量处理时注意内存限制 5. **图像质量**: 使用 > 300 DPI 的图像以提高准确率 6. **选择合适工具**: 纯文本用 `read_image_text`，布局分析用 `read_image_layout` --- **文档结束** 如有问题或建议，请提交 Issue: https://github.com/wenjiazhu/macos-ocr-mcp/issues

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/wenjiazhu/macos-ocr-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

API_REFERENCE.md•18.3 KiB