Skip to main content
Glama

browser.find_by_vision

Locate web elements using natural language descriptions when CSS selectors fail. Returns clickable coordinates for browser automation.

Instructions

Use Claude Vision to find an element from a natural language description. Returns (x, y) coordinates you can pass to browser.execute_action click. Use when CSS selectors fail or the element has no reliable text anchor.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
session_idYes
descriptionYes
take_screenshotNo

Implementation Reference

  • The handler implementation for the browser.find_by_vision tool. It uses a VisionTargeter (injected in the gateway) to identify elements on a screenshot.
    async def _find_by_vision(self, payload: VisionFindInput) -> dict[str, Any]:
        if self.vision_targeter is None:
            raise RuntimeError(
                "Vision targeting is not available — set ANTHROPIC_API_KEY to enable it."
            )
        session = await self.manager.get_session(payload.session_id)
        if payload.take_screenshot:
            screenshot = await self.manager.capture_screenshot(payload.session_id, label="vision")
            screenshot_path = screenshot["screenshot_path"]
        else:
            # Use the most recent screenshot if available
            screenshots = sorted(
                session.artifact_dir.glob("*.png"),
                key=lambda p: p.stat().st_mtime,
                reverse=True,
            )
            if not screenshots:
                raise RuntimeError("No screenshots available — take one first")
            screenshot_path = str(screenshots[0])
    
        result = await self.vision_targeter.find_element(screenshot_path, payload.description)
        return {"session_id": payload.session_id, **result}
  • Input schema for browser.find_by_vision tool, inheriting from SessionIdInput and defining the vision query parameters.
    class VisionFindInput(SessionIdInput):
        description: str = Field(min_length=1, max_length=500)
        take_screenshot: bool = True
  • The tool registration for browser.find_by_vision in the McpToolGateway, binding the input schema and handler.
    ToolSpec(
        name="browser.find_by_vision",
        description=(
            "Use Claude Vision to find an element from a natural language description. "
            "Returns (x, y) coordinates you can pass to browser.execute_action click. "
            "Use when CSS selectors fail or the element has no reliable text anchor."
        ),
        input_model=VisionFindInput,
        handler=self._find_by_vision,
    ),

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/LvcidPsyche/auto-browser'

If you have feedback or need assistance with the MCP directory API, please join our Discord server