browser.find_by_vision
Locate web elements using natural language descriptions when CSS selectors fail. Returns clickable coordinates for browser automation.
Instructions
Use Claude Vision to find an element from a natural language description. Returns (x, y) coordinates you can pass to browser.execute_action click. Use when CSS selectors fail or the element has no reliable text anchor.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| session_id | Yes | ||
| description | Yes | ||
| take_screenshot | No |
Implementation Reference
- controller/app/tool_gateway.py:1278-1299 (handler)The handler implementation for the browser.find_by_vision tool. It uses a VisionTargeter (injected in the gateway) to identify elements on a screenshot.
async def _find_by_vision(self, payload: VisionFindInput) -> dict[str, Any]: if self.vision_targeter is None: raise RuntimeError( "Vision targeting is not available — set ANTHROPIC_API_KEY to enable it." ) session = await self.manager.get_session(payload.session_id) if payload.take_screenshot: screenshot = await self.manager.capture_screenshot(payload.session_id, label="vision") screenshot_path = screenshot["screenshot_path"] else: # Use the most recent screenshot if available screenshots = sorted( session.artifact_dir.glob("*.png"), key=lambda p: p.stat().st_mtime, reverse=True, ) if not screenshots: raise RuntimeError("No screenshots available — take one first") screenshot_path = str(screenshots[0]) result = await self.vision_targeter.find_element(screenshot_path, payload.description) return {"session_id": payload.session_id, **result} - Input schema for browser.find_by_vision tool, inheriting from SessionIdInput and defining the vision query parameters.
class VisionFindInput(SessionIdInput): description: str = Field(min_length=1, max_length=500) take_screenshot: bool = True - controller/app/tool_gateway.py:786-795 (registration)The tool registration for browser.find_by_vision in the McpToolGateway, binding the input schema and handler.
ToolSpec( name="browser.find_by_vision", description=( "Use Claude Vision to find an element from a natural language description. " "Returns (x, y) coordinates you can pass to browser.execute_action click. " "Use when CSS selectors fail or the element has no reliable text anchor." ), input_model=VisionFindInput, handler=self._find_by_vision, ),