extract_text
Extract text from images and PDFs using file paths or base64 data. Supports PNG, JPG, and PDF formats.
Instructions
Extract text from local files or URLs. Supported formats: PNG, JPG/JPEG, PDF.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | No | Local file path or URL for PNG, JPG/JPEG, or PDF. Examples: ./test.png, C:/docs/a.pdf, https://example.com/a.jpg | |
| base64_data | No | Optional data URL or base64 payload. Use when file_path is unavailable. | |
| start_page_id | No | Optional PDF start page (1-based). Ignored for PNG/JPG inputs. | |
| end_page_id | No | Optional PDF end page (1-based). Ignored for PNG/JPG inputs. | |
| return_json | No | Optional, default false. Use only when structured layout details are needed (bbox_2d/content/label etc.), because JSON output is much longer. |
Implementation Reference
- src/glm_ocr_mcp/server.py:50-56 (registration)Tool registration: defines the 'extract_text' tool with its name, description, and input schema.
return [ Tool( name="extract_text", description="Extract text from local files or URLs. Supported formats: PNG, JPG/JPEG, PDF.", inputSchema=shared_schema ), ] - src/glm_ocr_mcp/server.py:17-49 (schema)Input schema for extract_text: defines file_path, base64_data, start_page_id, end_page_id, and return_json parameters.
shared_schema = { "type": "object", "properties": { "file_path": { "type": "string", "description": "Local file path or URL for PNG, JPG/JPEG, or PDF. Examples: ./test.png, C:/docs/a.pdf, https://example.com/a.jpg" }, "base64_data": { "type": "string", "description": "Optional data URL or base64 payload. Use when file_path is unavailable." }, "start_page_id": { "type": "integer", "minimum": 1, "description": "Optional PDF start page (1-based). Ignored for PNG/JPG inputs." }, "end_page_id": { "type": "integer", "minimum": 1, "description": "Optional PDF end page (1-based). Ignored for PNG/JPG inputs." }, "return_json": { "type": "boolean", "default": False, "description": "Optional, default false. Use only when structured layout details are needed (bbox_2d/content/label etc.), because JSON output is much longer." } }, "anyOf": [ {"required": ["file_path"]}, {"required": ["base64_data"]}, ], "additionalProperties": False, } - src/glm_ocr_mcp/server.py:58-121 (handler)Handler function for extract_text: dispatches to ZhipuOCR.parse() for markdown or ZhipuOCR.parse_json() for structured JSON output, with error handling.
@server.call_tool() async def call_tool(name: str, arguments: dict) -> list[TextContent]: if name != "extract_text": raise ValueError(f"Unknown tool: {name}") arguments = arguments or {} file_path = arguments.get("file_path") base64_data = arguments.get("base64_data") start_page_id = arguments.get("start_page_id") end_page_id = arguments.get("end_page_id") return_json = bool(arguments.get("return_json", False)) if ( start_page_id is not None and end_page_id is not None and start_page_id > end_page_id ): return [ TextContent( type="text", text="Error: start_page_id must be less than or equal to end_page_id", ) ] # Get OCR client ocr = get_ocr_client() try: if base64_data: # Handle base64 data if "," in base64_data: # May be data URL format data_input = base64_data else: data_input = f"data:application/octet-stream;base64,{base64_data}" else: # Handle file path data_input = file_path if return_json: json_result = ocr.parse_json( data_input, start_page_id=start_page_id, end_page_id=end_page_id, ) return [ TextContent( type="text", text=json.dumps(json_result, ensure_ascii=False), ) ] md_content = ocr.parse( data_input, start_page_id=start_page_id, end_page_id=end_page_id, ) return [TextContent(type="text", text=md_content)] except FileNotFoundError: return [TextContent(type="text", text=f"Error: File not found: {file_path}")] except ValueError as e: return [TextContent(type="text", text=f"Error: {str(e)}")] except Exception as e: return [TextContent(type="text", text=f"OCR parsing error: {str(e)}")] - src/glm_ocr_mcp/ocr.py:119-141 (helper)Core OCR helper: ZhipuOCR.parse() calls the ZhipuAI layout parsing API and returns markdown text.
def parse( self, file: Union[str, bytes], start_page_id: int | None = None, end_page_id: int | None = None, ) -> str: """ Call layout parsing API, return markdown content Args: file: File path, base64 data, or URL Returns: Parsed markdown content """ payload = self._build_payload( file, start_page_id=start_page_id, end_page_id=end_page_id, ) result = self._post_layout_parsing(payload) return self._extract_markdown(result) - src/glm_ocr_mcp/ocr.py:143-161 (helper)Core OCR helper: ZhipuOCR.parse_json() calls the ZhipuAI layout parsing API and returns structured JSON (minus md_results).
def parse_json( self, file: Union[str, bytes], start_page_id: int | None = None, end_page_id: int | None = None, ) -> dict: """ Call layout parsing API and return structured JSON response. `md_results` is removed because markdown is provided by `parse`. """ payload = self._build_payload( file, start_page_id=start_page_id, end_page_id=end_page_id, ) result = self._post_layout_parsing(payload) result.pop("md_results", None) return result