detect_elements_visually
Identify clickable and focusable UI elements on Android screens using screenshot analysis and UI tree data to locate interactive components for automation.
Instructions
Detect interactive elements on the screen using a combination of screenshots and UI tree analysis. Returns a list of all clickable/focusable elements with their coordinates and descriptions. Use this when you need to find elements to interact with.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| device_id | No | Device serial number |
Implementation Reference
- src/vision/visual-analyzer.ts:57-110 (handler)The implementation of detectElementsVisually, which captures a screenshot and parses the UI tree to identify interactive elements.
export async function detectElementsVisually(deviceId?: string): Promise<{ screenshot: ScreenshotResult; uiTree: string; interactiveElements: Array<{ description: string; centerX: number; centerY: number; bounds: string; }>; }> { const resolved = await deviceManager.resolveDeviceId(deviceId); const screenshot = await captureScreenshot(resolved); let uiTree = ''; const interactiveElements: Array<{ description: string; centerX: number; centerY: number; bounds: string; }> = []; try { const tree = await getUITree(resolved); uiTree = summarizeTree(tree); // Extract interactive elements const { flattenTree } = await import('../uiautomator/ui-tree-parser.js'); const allElements = flattenTree(tree); for (const el of allElements) { if ((el.clickable || el.focusable) && el.bounds.width > 0 && el.bounds.height > 0) { const desc = el.text || el.contentDesc || el.resourceId || el.className.split('.').pop() || 'unknown'; interactiveElements.push({ description: desc, centerX: el.bounds.centerX, centerY: el.bounds.centerY, bounds: `[${el.bounds.left},${el.bounds.top}][${el.bounds.right},${el.bounds.bottom}]`, }); } } } catch (error) { log.warn('UI tree unavailable for visual detection', { error: error instanceof Error ? error.message : String(error), }); } log.info('Visual detection completed', { deviceId: resolved, interactiveCount: interactiveElements.length, }); return { screenshot, uiTree, interactiveElements }; } - src/controllers/vision-tools.ts:91-124 (registration)The MCP tool registration for detect_elements_visually.
server.registerTool( 'detect_elements_visually', { description: 'Detect interactive elements on the screen using a combination of screenshots and UI tree analysis. Returns a list of all clickable/focusable elements with their coordinates and descriptions. Use this when you need to find elements to interact with.', inputSchema: { device_id: z.string().optional().describe('Device serial number'), }, }, async ({ device_id }) => { return await metrics.measure('detect_elements_visually', device_id || 'default', async () => { const detection = await detectElementsVisually(device_id); const content: Array<{ type: 'text'; text: string } | { type: 'image'; data: string; mimeType: string }> = []; content.push({ type: 'image' as const, data: detection.screenshot.base64, mimeType: 'image/png', }); content.push({ type: 'text' as const, text: JSON.stringify({ success: true, interactiveElements: detection.interactiveElements, elementCount: detection.interactiveElements.length, uiTree: detection.uiTree, }, null, 2), }); return { content }; }); } );