Skip to main content
Glama
ascript-cn
by ascript-cn

ocr

Extract text from mobile device screens via OCR. Specify region, confidence, and regex pattern to get text, position, and accuracy.

Instructions

在设备屏幕上执行 OCR 文字识别。返回识别到的文字、位置坐标和置信度。

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
modeNoOCR 引擎:mlkit(默认,快)、paddle_v2、paddle_v3(最新)、tessmlkit
rectNo识别区域 [left, top, right, bottom],不传则全屏
patternNo正则表达式过滤结果
confidenceNo置信度阈值 0.0-1.0,默认 0.1

Implementation Reference

  • Tool registration for 'ocr': defines name, description (OCR on device screen), and input schema with mode (mlkit/paddle_v2/paddle_v3/tess), rect (region), pattern (regex filter), and confidence (threshold).
    Tool(
        name="ocr",
        description=(
            "在设备屏幕上执行 OCR 文字识别。"
            "返回识别到的文字、位置坐标和置信度。"
        ),
        inputSchema={
            "type": "object",
            "properties": {
                "mode": {
                    "type": "string",
                    "description": "OCR 引擎:mlkit(默认,快)、paddle_v2、paddle_v3(最新)、tess",
                    "enum": ["mlkit", "paddle_v2", "paddle_v3", "tess"],
                    "default": "mlkit",
                },
                "rect": {
                    "type": "array",
                    "items": {"type": "integer"},
                    "description": "识别区域 [left, top, right, bottom],不传则全屏",
                },
                "pattern": {
                    "type": "string",
                    "description": "正则表达式过滤结果",
                },
                "confidence": {
                    "type": "number",
                    "description": "置信度阈值 0.0-1.0,默认 0.1",
                    "default": 0.1,
                },
            },
        },
    ),
  • Dispatch handler for 'ocr' tool: calls dev.ocr() with mode, rect, pattern, and confidence arguments from the caller.
    if name == "ocr":
        return dev.ocr(
            mode=args.get("mode", "mlkit"),
            rect=args.get("rect"),
            pattern=args.get("pattern"),
            confidence=args.get("confidence", 0.1),
        )
  • Actual OCR implementation: executes OCR on the connected device. Maps mode string to int, builds GP strack parameters, and calls _run_gp() with the appropriate screen.Ocr class (android or ios).
    _OCR_MODES = {
        "mlkit": 1,
        "paddle_v2": 2,
        "paddle_v3": 3,
        "tess": 4,
    }
    
    
    def ocr(
        mode: str = "mlkit",
        rect: Optional[list[int]] = None,
        pattern: Optional[str] = None,
        confidence: float = 0.1,
    ) -> dict[str, Any]:
        """
        在设备屏幕上执行 OCR 文字识别。
    
        mode: mlkit / paddle_v2 / paddle_v3 / tess
        rect: [left, top, right, bottom] 识别区域,不传则全屏
        pattern: 正则过滤
        confidence: 置信度阈值 0.0-1.0
        """
        d = require_device()
        mode_int = _OCR_MODES.get(mode, 1)
    
        # 构建 GP strack 参数
        params_parts = [f"mode={mode_int}"]
        if rect:
            params_parts.append(f"rect={rect}")
        if pattern:
            params_parts.append(f"pattern='{pattern}'")
        params_parts.append(f"confidence={confidence}")
        params_str = ", ".join(params_parts)
    
        return _run_gp(d, "ascript.android.screen.Ocr" if d.platform == "android" else "ascript.ios.screen.Ocr", params_str)
  • GP engine helper: _run_gp() handles the device-side screenshot, builds the strack payload, and sends it to the device's GP API endpoint for execution.
    def _run_gp(device: Device, class_id: str, params_str: str) -> dict[str, Any]:
        """
        调用设备 GP 引擎执行图色工具。
    
        流程:1) 先在设备端截图保存  2) 用截图路径调 GP strack 引擎
        通过 /api/gp/strack (Android) 或 /api/screen/gp (iOS) 发送请求。
        """
        # 先截图保存到设备
        image_path = _ensure_screenshot(device)
    
        strack = [
            {
                "id": class_id,
                "type": "图色工具",
                "data": {"params": params_str},
            }
        ]
    
        if device.platform == "android":
            url = f"{device.base_url}/api/gp/strack"
        else:
            url = f"{device.base_url}/api/screen/gp"
    
        payload = {
            "strack": json.dumps(strack),
            "image": image_path,
            "gp": "as_gp_test_screen_temp",
        }
        body = urllib.parse.urlencode(payload).encode("utf-8")
        headers = dict(device.headers)
        headers["Content-Type"] = "application/x-www-form-urlencoded"
        req = urllib.request.Request(url, data=body, headers=headers, method="POST")
    
        with urllib.request.urlopen(req, timeout=30) as resp:
            return json.loads(resp.read().decode("utf-8"))
  • Helper function _ensure_screenshot() captures a screenshot on the device and returns the image path, which is required by the GP engine before running OCR.
    def _ensure_screenshot(device: Device) -> str:
        """
        在设备端截图并保存,返回设备上的截图文件路径。
        GP 引擎需要一个设备端的图片路径才能工作。
        """
        if device.platform == "android":
            # 用 path 参数指定保存路径,GP 引擎从该路径读取
            save_path = "/sdcard/airscript/screen/mcp_temp.png"
            url = f"{device.base_url}/api/tool/screen/capture?path={urllib.parse.quote(save_path)}"
            _fetch_bytes(url, headers=device.headers, timeout=10)
            return save_path
        else:
            # iOS: /api/screen/capture/list?capture=true 截图并保存到设备,返回路径
            url = f"{device.base_url}/api/screen/capture/list?capture=true"
            result = _fetch_json(url, method="GET", headers=device.headers, timeout=10)
            items = result.get("data", [])
            if items:
                return items[-1].get("path", "")
            return ""
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the burden. It discloses the output (text, coordinates, confidence) but does not specify if the tool is read-only, requires permissions, or handles edge cases like no text found. The behavioral disclosure is adequate but not thorough.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two concise sentences with no wasted words. It front-loads the core action and output, making it efficient for an agent to parse.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Despite no output schema, the description mentions the return values (text, coordinates, confidence). For a tool with four parameters that are well-documented in the schema, the description provides sufficient context for the agent to understand the tool's result structure.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

All four parameters have descriptions in the input schema (100% coverage). The description adds minimal value beyond the schema (e.g., default engine, default region), so a baseline score of 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's function: perform OCR on the device screen and return recognized text, coordinates, and confidence. It uses specific verbs ('执行 OCR') and nouns ('设备屏幕', '文字识别'), and distinguishes from sibling tools like screen_capture or dump_ui_tree.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies use for text recognition tasks but does not provide explicit guidance on when to use this tool versus alternatives (e.g., screen_capture or dump_ui_tree). No when-not-to-use or alternative references are given.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ascript-cn/ascript-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server