extract_text

Extract text from images and PDFs using file paths or base64 data. Supports PNG, JPG, and PDF formats.

Instructions

Extract text from local files or URLs. Supported formats: PNG, JPG/JPEG, PDF.

Input Schema

TableJSON Schema

Name	Required	Description
`file_path`	No	Local file path or URL for PNG, JPG/JPEG, or PDF. Examples: ./test.png, C:/docs/a.pdf, https://example.com/a.jpg
`base64_data`	No	Optional data URL or base64 payload. Use when file_path is unavailable.
`start_page_id`	No	Optional PDF start page (1-based). Ignored for PNG/JPG inputs.
`end_page_id`	No	Optional PDF end page (1-based). Ignored for PNG/JPG inputs.
`return_json`	No	Optional, default false. Use only when structured layout details are needed (bbox_2d/content/label etc.), because JSON output is much longer.

Implementation Reference

src/glm_ocr_mcp/server.py:50-56 (registration)

Tool registration: defines the 'extract_text' tool with its name, description, and input schema.

return [
    Tool(
        name="extract_text",
        description="Extract text from local files or URLs. Supported formats: PNG, JPG/JPEG, PDF.",
        inputSchema=shared_schema
    ),
]

src/glm_ocr_mcp/server.py:17-49 (schema)

Input schema for extract_text: defines file_path, base64_data, start_page_id, end_page_id, and return_json parameters.

shared_schema = {
    "type": "object",
    "properties": {
        "file_path": {
            "type": "string",
            "description": "Local file path or URL for PNG, JPG/JPEG, or PDF. Examples: ./test.png, C:/docs/a.pdf, https://example.com/a.jpg"
        },
        "base64_data": {
            "type": "string",
            "description": "Optional data URL or base64 payload. Use when file_path is unavailable."
        },
        "start_page_id": {
            "type": "integer",
            "minimum": 1,
            "description": "Optional PDF start page (1-based). Ignored for PNG/JPG inputs."
        },
        "end_page_id": {
            "type": "integer",
            "minimum": 1,
            "description": "Optional PDF end page (1-based). Ignored for PNG/JPG inputs."
        },
        "return_json": {
            "type": "boolean",
            "default": False,
            "description": "Optional, default false. Use only when structured layout details are needed (bbox_2d/content/label etc.), because JSON output is much longer."
        }
    },
    "anyOf": [
        {"required": ["file_path"]},
        {"required": ["base64_data"]},
    ],
    "additionalProperties": False,
}

src/glm_ocr_mcp/server.py:58-121 (handler)

Handler function for extract_text: dispatches to ZhipuOCR.parse() for markdown or ZhipuOCR.parse_json() for structured JSON output, with error handling.

@server.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    if name != "extract_text":
        raise ValueError(f"Unknown tool: {name}")

    arguments = arguments or {}
    file_path = arguments.get("file_path")
    base64_data = arguments.get("base64_data")
    start_page_id = arguments.get("start_page_id")
    end_page_id = arguments.get("end_page_id")
    return_json = bool(arguments.get("return_json", False))

    if (
        start_page_id is not None
        and end_page_id is not None
        and start_page_id > end_page_id
    ):
        return [
            TextContent(
                type="text",
                text="Error: start_page_id must be less than or equal to end_page_id",
            )
        ]

    # Get OCR client
    ocr = get_ocr_client()

    try:
        if base64_data:
            # Handle base64 data
            if "," in base64_data:
                # May be data URL format
                data_input = base64_data
            else:
                data_input = f"data:application/octet-stream;base64,{base64_data}"
        else:
            # Handle file path
            data_input = file_path

        if return_json:
            json_result = ocr.parse_json(
                data_input,
                start_page_id=start_page_id,
                end_page_id=end_page_id,
            )
            return [
                TextContent(
                    type="text",
                    text=json.dumps(json_result, ensure_ascii=False),
                )
            ]

        md_content = ocr.parse(
            data_input,
            start_page_id=start_page_id,
            end_page_id=end_page_id,
        )
        return [TextContent(type="text", text=md_content)]
    except FileNotFoundError:
        return [TextContent(type="text", text=f"Error: File not found: {file_path}")]
    except ValueError as e:
        return [TextContent(type="text", text=f"Error: {str(e)}")]
    except Exception as e:
        return [TextContent(type="text", text=f"OCR parsing error: {str(e)}")]

src/glm_ocr_mcp/ocr.py:119-141 (helper)

Core OCR helper: ZhipuOCR.parse() calls the ZhipuAI layout parsing API and returns markdown text.

def parse(
    self,
    file: Union[str, bytes],
    start_page_id: int | None = None,
    end_page_id: int | None = None,
) -> str:
    """
    Call layout parsing API, return markdown content

    Args:
        file: File path, base64 data, or URL

    Returns:
        Parsed markdown content
    """
    payload = self._build_payload(
        file,
        start_page_id=start_page_id,
        end_page_id=end_page_id,
    )
    result = self._post_layout_parsing(payload)

    return self._extract_markdown(result)

src/glm_ocr_mcp/ocr.py:143-161 (helper)

Core OCR helper: ZhipuOCR.parse_json() calls the ZhipuAI layout parsing API and returns structured JSON (minus md_results).

def parse_json(
    self,
    file: Union[str, bytes],
    start_page_id: int | None = None,
    end_page_id: int | None = None,
) -> dict:
    """
    Call layout parsing API and return structured JSON response.

    `md_results` is removed because markdown is provided by `parse`.
    """
    payload = self._build_payload(
        file,
        start_page_id=start_page_id,
        end_page_id=end_page_id,
    )
    result = self._post_layout_parsing(payload)
    result.pop("md_results", None)
    return result

GLM OCR MCP Server

extract_text

Instructions

Input Schema

Implementation Reference

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API