extract_text
Extract text from images and PDF documents using OCR technology. Process PNG, JPG, JPEG, or PDF files from local paths or URLs to convert visual content into readable text.
Instructions
Extract text from local files or URLs. Supported formats: PNG, JPG/JPEG, PDF.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | No | Local file path or URL for PNG, JPG/JPEG, or PDF. Examples: ./test.png, C:/docs/a.pdf, https://example.com/a.jpg | |
| base64_data | No | Optional data URL or base64 payload. Use when file_path is unavailable. | |
| start_page_id | No | Optional PDF start page (1-based). Ignored for PNG/JPG inputs. | |
| end_page_id | No | Optional PDF end page (1-based). Ignored for PNG/JPG inputs. | |
| return_json | No | Optional, default false. Use only when structured layout details are needed (bbox_2d/content/label etc.), because JSON output is much longer. |
Implementation Reference
- src/glm_ocr_mcp/server.py:58-121 (handler)The main handler function that executes the extract_text tool. It validates inputs, retrieves the OCR client, and calls either parse() or parse_json() based on the return_json parameter.@server.call_tool() async def call_tool(name: str, arguments: dict) -> list[TextContent]: if name != "extract_text": raise ValueError(f"Unknown tool: {name}") arguments = arguments or {} file_path = arguments.get("file_path") base64_data = arguments.get("base64_data") start_page_id = arguments.get("start_page_id") end_page_id = arguments.get("end_page_id") return_json = bool(arguments.get("return_json", False)) if ( start_page_id is not None and end_page_id is not None and start_page_id > end_page_id ): return [ TextContent( type="text", text="Error: start_page_id must be less than or equal to end_page_id", ) ] # Get OCR client ocr = get_ocr_client() try: if base64_data: # Handle base64 data if "," in base64_data: # May be data URL format data_input = base64_data else: data_input = f"data:application/octet-stream;base64,{base64_data}" else: # Handle file path data_input = file_path if return_json: json_result = ocr.parse_json( data_input, start_page_id=start_page_id, end_page_id=end_page_id, ) return [ TextContent( type="text", text=json.dumps(json_result, ensure_ascii=False), ) ] md_content = ocr.parse( data_input, start_page_id=start_page_id, end_page_id=end_page_id, ) return [TextContent(type="text", text=md_content)] except FileNotFoundError: return [TextContent(type="text", text=f"Error: File not found: {file_path}")] except ValueError as e: return [TextContent(type="text", text=f"Error: {str(e)}")] except Exception as e: return [TextContent(type="text", text=f"OCR parsing error: {str(e)}")]
- src/glm_ocr_mcp/server.py:17-56 (schema)The input schema definition and tool registration for extract_text. Defines the accepted parameters (file_path, base64_data, start_page_id, end_page_id, return_json) and registers the tool with MCP.shared_schema = { "type": "object", "properties": { "file_path": { "type": "string", "description": "Local file path or URL for PNG, JPG/JPEG, or PDF. Examples: ./test.png, C:/docs/a.pdf, https://example.com/a.jpg" }, "base64_data": { "type": "string", "description": "Optional data URL or base64 payload. Use when file_path is unavailable." }, "start_page_id": { "type": "integer", "minimum": 1, "description": "Optional PDF start page (1-based). Ignored for PNG/JPG inputs." }, "end_page_id": { "type": "integer", "minimum": 1, "description": "Optional PDF end page (1-based). Ignored for PNG/JPG inputs." }, "return_json": { "type": "boolean", "default": False, "description": "Optional, default false. Use only when structured layout details are needed (bbox_2d/content/label etc.), because JSON output is much longer." } }, "anyOf": [ {"required": ["file_path"]}, {"required": ["base64_data"]}, ], "additionalProperties": False, } return [ Tool( name="extract_text", description="Extract text from local files or URLs. Supported formats: PNG, JPG/JPEG, PDF.", inputSchema=shared_schema ), ]
- src/glm_ocr_mcp/server.py:15-16 (registration)The MCP server decorator that registers the list_tools handler, which makes the extract_text tool discoverable to MCP clients.@server.list_tools() async def list_tools() -> list[Tool]:
- src/glm_ocr_mcp/ocr.py:119-161 (handler)Core OCR processing methods that execute the actual text extraction. parse() returns markdown content, parse_json() returns structured JSON with layout details.def parse( self, file: Union[str, bytes], start_page_id: int | None = None, end_page_id: int | None = None, ) -> str: """ Call layout parsing API, return markdown content Args: file: File path, base64 data, or URL Returns: Parsed markdown content """ payload = self._build_payload( file, start_page_id=start_page_id, end_page_id=end_page_id, ) result = self._post_layout_parsing(payload) return self._extract_markdown(result) def parse_json( self, file: Union[str, bytes], start_page_id: int | None = None, end_page_id: int | None = None, ) -> dict: """ Call layout parsing API and return structured JSON response. `md_results` is removed because markdown is provided by `parse`. """ payload = self._build_payload( file, start_page_id=start_page_id, end_page_id=end_page_id, ) result = self._post_layout_parsing(payload) result.pop("md_results", None) return result
- src/glm_ocr_mcp/ocr.py:164-169 (helper)Factory function that creates and returns a ZhipuOCR client instance, loading the API key from environment variables.def get_ocr_client() -> ZhipuOCR: """Get OCR client instance""" api_key = os.getenv("ZHIPU_API_KEY") if not api_key: raise ValueError("Please set ZHIPU_API_KEY environment variable") return ZhipuOCR(api_key)