vision_locate
Locate page elements by natural-language description using a vision LLM. Returns coordinates and confidence, with optional click action.
Instructions
⭐ Find an element by natural-language description using a vision LLM.
Uses the same provider as solve_recaptcha_ai (OPENAI_* / ANTHROPIC_* env).
Reuses solve_recaptcha_ai's vision plumbing so any vision-capable model
works (gpt-4o, gpt-5.x, claude, llava, llama-3.2-vision).
Args:
description: NL description, e.g. "the red Create button at bottom right"
click: if True, also dispatches a CDP mouse_click at the located point
api_key/base_url/model/provider: explicit overrides (else from env)
Returns JSON: {"found":true/false, "x":int, "y":int, "confidence":"high|medium|low"}.
Use when CSS selectors are unreliable (visual-only differentiator, dynamic IDs).
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| description | Yes | ||
| click | No | ||
| api_key | No | ||
| base_url | No | ||
| model | No | ||
| provider | No |
Output Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |