Skip to main content
Glama
alderban107

hyprland-mcp

by alderban107

find_text_on_screen

Locate text on your screen using OCR to find matching coordinates for automation tasks. Specify target text and optional search areas like monitors, windows, or regions to get precise screen positions.

Instructions

Find text on screen using OCR. Returns matching locations in screen coordinates.

Take a screenshot, run OCR, and find all occurrences of the target text. Coordinates are in absolute screen space — ready to pass to mouse_click.

Args: target: Text to find (case-insensitive, supports multi-word) monitor: Limit search to a specific monitor window: Limit search to a specific window (e.g. "class:discord") region: Limit search to a region "X,Y WxH" scope: "auto" (default) captures just the active window for better accuracy. "full" captures the entire desktop.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
targetYes
monitorNo
windowNo
regionNo
scopeNoauto

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The implementation of the find_text_on_screen tool, which captures a screenshot and performs OCR to locate text.
    async def find_text_on_screen(
        target: str,
        monitor: str | None = None,
        window: str | None = None,
        region: str | None = None,
        scope: str = "auto",
    ) -> str:
        """Find text on screen using OCR. Returns matching locations in screen coordinates.
    
        Take a screenshot, run OCR, and find all occurrences of the target text.
        Coordinates are in absolute screen space — ready to pass to mouse_click.
    
        Args:
            target: Text to find (case-insensitive, supports multi-word)
            monitor: Limit search to a specific monitor
            window: Limit search to a specific window (e.g. "class:discord")
            region: Limit search to a region "X,Y WxH"
            scope: "auto" (default) captures just the active window for better accuracy.
                   "full" captures the entire desktop.
        """
        from . import screenshot as ss, ocr
    
        if scope == "full" and not (monitor or window or region):
            png_bytes, origin_x, origin_y = await ss.capture_raw()
        else:
            png_bytes, origin_x, origin_y = await _auto_scope_capture(
                monitor, window, region,
            )
    
        boxes = ocr.extract_boxes(png_bytes)
        matches = ocr.find_text(boxes, target)
    
        if not matches:
            all_text = ocr.extract_text(png_bytes)
            preview = all_text[:500] + "..." if len(all_text) > 500 else all_text
            return f"Text '{target}' not found.\n\nOCR detected text:\n{preview}"
    
        lines = [f"Found {len(matches)} match(es) for '{target}':"]
        for m in matches:
            screen_x = m["x"] + origin_x + m["w"] // 2
            screen_y = m["y"] + origin_y + m["h"] // 2
            lines.append(
                f"- \"{m['text']}\" at screen ({screen_x}, {screen_y}) "
                f"[box: {m['x']+origin_x},{m['y']+origin_y} {m['w']}x{m['h']}, "
                f"conf: {m['conf']}%]"
            )
        return "\n".join(lines)
    
    
    @mcp.tool()
    async def click_text(
        target: str,
        button: str = "left",
        double: bool = False,
        monitor: str | None = None,
        window: str | None = None,
        region: str | None = None,
        occurrence: int = 1,
        scope: str = "auto",
    ) -> str:
        """Find text on screen and click it — screenshot + OCR + click in one call.
  • Registration of the find_text_on_screen tool using the @mcp.tool() decorator.
    @mcp.tool()
    async def find_text_on_screen(
        target: str,
        monitor: str | None = None,
        window: str | None = None,
        region: str | None = None,
        scope: str = "auto",
    ) -> str:
        """Find text on screen using OCR. Returns matching locations in screen coordinates.
    
        Take a screenshot, run OCR, and find all occurrences of the target text.
        Coordinates are in absolute screen space — ready to pass to mouse_click.
    
        Args:
            target: Text to find (case-insensitive, supports multi-word)
            monitor: Limit search to a specific monitor
            window: Limit search to a specific window (e.g. "class:discord")
            region: Limit search to a region "X,Y WxH"
            scope: "auto" (default) captures just the active window for better accuracy.
                   "full" captures the entire desktop.
        """
        from . import screenshot as ss, ocr
    
        if scope == "full" and not (monitor or window or region):
            png_bytes, origin_x, origin_y = await ss.capture_raw()
        else:
            png_bytes, origin_x, origin_y = await _auto_scope_capture(
                monitor, window, region,
            )
    
        boxes = ocr.extract_boxes(png_bytes)
        matches = ocr.find_text(boxes, target)
    
        if not matches:
            all_text = ocr.extract_text(png_bytes)
            preview = all_text[:500] + "..." if len(all_text) > 500 else all_text
            return f"Text '{target}' not found.\n\nOCR detected text:\n{preview}"
    
        lines = [f"Found {len(matches)} match(es) for '{target}':"]
        for m in matches:
            screen_x = m["x"] + origin_x + m["w"] // 2
            screen_y = m["y"] + origin_y + m["h"] // 2
            lines.append(
                f"- \"{m['text']}\" at screen ({screen_x}, {screen_y}) "
                f"[box: {m['x']+origin_x},{m['y']+origin_y} {m['w']}x{m['h']}, "
                f"conf: {m['conf']}%]"
            )
        return "\n".join(lines)
    
    
    @mcp.tool()
    async def click_text(
        target: str,
        button: str = "left",
        double: bool = False,
        monitor: str | None = None,
        window: str | None = None,
        region: str | None = None,
        occurrence: int = 1,
        scope: str = "auto",
    ) -> str:
        """Find text on screen and click it — screenshot + OCR + click in one call.
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden and does well. It discloses key behaviors: takes a screenshot, runs OCR, returns coordinates in absolute screen space, case-insensitive search, supports multi-word targets, and accuracy implications of scope settings. It doesn't mention performance characteristics like speed or error rates, but covers the essential operational behavior adequately.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized and front-loaded. The first sentence states the core purpose and output. Subsequent sentences explain the process and parameter details in a structured 'Args:' section. Every sentence adds value with no wasted words, making it easy for an agent to quickly understand and use the tool.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (OCR-based search with multiple scoping options), no annotations, and an output schema (which handles return values), the description is complete. It covers purpose, behavior, all parameter meanings, and usage context. The presence of an output schema means the description doesn't need to explain return format, and it adequately addresses the gaps from missing annotations.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate fully. It provides excellent parameter semantics: explains what 'target' is (case-insensitive, multi-word), clarifies 'monitor' and 'window' limit search scope, defines 'region' format ('X,Y WxH'), and details 'scope' options ('auto' vs 'full') with accuracy implications. This adds substantial meaning beyond the bare schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Find text on screen using OCR. Returns matching locations in screen coordinates.' It specifies the verb (find), resource (text on screen), method (OCR), and output (locations in screen coordinates). It distinguishes from siblings like screenshot_with_ocr (which captures but doesn't search) and mouse_click (which acts on coordinates).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for when to use this tool: for finding text via OCR with coordinate results 'ready to pass to mouse_click.' It mentions scope options ('auto' vs 'full') for accuracy trade-offs. However, it doesn't explicitly state when NOT to use it or name specific alternatives among siblings (e.g., screenshot_with_ocr for just OCR without search).

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/alderban107/hyprland-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server