# UI Element Detection Research for Mobile Automation
Research on how automation testing frameworks detect where to tap, especially when accessibility labels are not present.
## Main Approaches
### 1. Accessibility Tree (Most Common)
Most test frameworks use the native accessibility APIs:
- **Android**: UIAutomator (what we're using with `android_describe_all`)
- **iOS**: XCUITest/IDB accessibility info
This is what Appium, Detox, Espresso, and XCUITest use. It's reliable but **requires accessibility labels/testIDs**.
### 2. Image/Template Matching (OpenCV)
[Appium's Image Locator](https://appiumpro.com/editions/32-finding-elements-by-image-part-1) uses OpenCV to find elements by matching a reference image against the screenshot:
```javascript
driver.findElement(by: .image, using: base64EncodedTemplateImage)
```
**Pros:**
- Works when there's no accessibility info
- Useful for games/custom renderers
**Cons:**
- Fragile with resolution/theme changes
- Requires maintaining reference images
### 3. OCR-Based Detection
Several Appium plugins use OCR to find text on screen:
| Plugin | Description |
|--------|-------------|
| [appium-ocr-plugin](https://github.com/jlipps/appium-ocr-plugin) | Tesseract-based, returns XPath-like elements |
| [ocr-click-plugin](https://github.com/Jitu1888/ocr-click-plugin) | Tesseract.js + Sharp + Vertex AI |
| [AppiumOCR](https://github.com/jaipaddy/AppiumOCR) | SikuliX OCR API integration |
### 4. Vision AI/LLM-Based (Emerging)
New approaches using multimodal LLMs:
- [android-vision-agent](https://github.com/areu01or00/android-vision-agent) - Uses GPT-4o for screen analysis
- [AskUI](https://www.askui.com/blog-posts/developing-an-automated-ui-controller-using-gpt-agents-and-gpt-4-vision) - Numbers UI elements and uses GPT-4V to identify them
- [VisionDroid](https://arxiv.org/html/2407.03037v1) - Academic research on MLLM for GUI testing
---
## Open Source Projects to Investigate
### Maestro
- **Approach**: Smart selectors + AI assist (MaestroGPT)
- **Link**: [github.com/mobile-dev-inc/Maestro](https://github.com/mobile-dev-inc/Maestro)
- **Docs**: [maestro.mobile.dev](https://maestro.mobile.dev/)
- **License**: Apache 2.0
- **Notes**: Built on learnings from Appium, Espresso, UIAutomator, XCTest. Has visual flow builder & element inspector. Cross-platform (Android, iOS, Web).
### appium-ocr-plugin
- **Approach**: Tesseract OCR
- **Link**: [github.com/jlipps/appium-ocr-plugin](https://github.com/jlipps/appium-ocr-plugin)
- **Notes**: Provides OCR endpoint, returns XML of text objects with screen positions. Can use XPath to find "elements" based on OCR results.
### ocr-click-plugin
- **Approach**: Tesseract.js + Sharp + Google Cloud Vertex AI
- **Link**: [github.com/Jitu1888/ocr-click-plugin](https://github.com/Jitu1888/ocr-click-plugin)
- **Notes**: Image enhancement preprocessing, AI-powered analysis, confidence filtering. Works with both iOS and Android.
### android-vision-agent
- **Approach**: GPT-4o vision + GPT-4-turbo planning
- **Link**: [github.com/areu01or00/android-vision-agent](https://github.com/areu01or00/android-vision-agent)
- **Notes**: Hybrid approach - Vision Model for extracting text/analyzing screenshots, Planning Model for reasoning. Includes smart task planning and error recovery.
### vimGPT
- **Approach**: GPT-4V + Vimium-style navigation
- **Link**: [github.com/ishan0102/vimGPT](https://github.com/ishan0102/vimGPT)
- **Notes**: Found that CogVLM can accurately specify pixel coordinates. Resolution threshold affects detection quality.
### AI-Employe
- **Approach**: GPT-4V + DOM indexing with MeiliSearch
- **Link**: [github.com/vignshwarar/AI-Employe](https://github.com/vignshwarar/AI-Employe)
- **Notes**: Found that direct coordinate requests to GPT-4V cause hallucinations. Solution: index DOM, have AI generate commands with element text, then look up actual coordinates.
---
## Key Research Findings
### Problem: GPT-4V Coordinate Hallucination
The [AI-Employe](https://github.com/vignshwarar/AI-Employe) project tested multiple approaches:
1. Sending shortened HTML to GPT-3
2. Creating bounding boxes with IDs for GPT-4V
3. Directly asking GPT-4V for X,Y coordinates
**Result**: None were reliable - all led to hallucinations.
### Solution: Element Numbering Technique
Both AI-Employe and [AskUI](https://www.askui.com/blog-posts/developing-an-automated-ui-controller-using-gpt-agents-and-gpt-4-vision) converged on a similar solution:
1. Detect UI elements (via accessibility tree, OCR, or CV)
2. Draw numbered bounding boxes on the screenshot
3. Send annotated screenshot to AI
4. AI returns the element NUMBER (not coordinates)
5. Look up actual coordinates from the numbered element map
**Example prompt from AskUI:**
> "All the UI elements are numbered for reference. The associated numbers are on top left of corresponding bbox. For the prompt/query asked, return the number associated to the target element."
### Academic Research
[Vision-Based Mobile App GUI Testing Survey](https://arxiv.org/html/2310.13518v3) covers:
- Traditional CV technologies for GUI element detection
- Deep learning-based approaches (e.g., predicting labels of image-based buttons)
- Models like OwlEyes, Gilb, NightHawk for detecting display bugs
[VisionDroid](https://arxiv.org/html/2407.03037v1) proposes:
- Extract GUI text and align with screenshots for vision prompts
- Function-aware exploration using MLLM
- Logic-aware bug detection
---
## Potential Implementation for Our Tool
### Hybrid Approach
1. **Primary**: Use accessibility tree (already working)
- `android_describe_all` / `ios_describe_all`
- `android_find_element` / `ios_find_element`
2. **Fallback when no accessibility labels**:
**Option A: OCR-based**
- Take screenshot
- Run Tesseract OCR to detect text elements
- Return text positions as tappable elements
**Option B: Element Annotation**
- Take screenshot
- Detect elements via OCR + edge detection
- Draw numbered bounding boxes
- Send annotated screenshot to AI
- AI returns element number
- Look up coordinates from element map
### Libraries to Consider
- **Tesseract.js** - OCR in JavaScript/Node.js
- **Sharp** - Image processing (already in use)
- **OpenCV.js** - Computer vision for element detection
- **node-tesseract-ocr** - Node.js Tesseract wrapper
---
## References
### Documentation & Guides
- [Appium Image Recognition](https://docs.digital.ai/continuous-testing/docs/te/test-execution-home/mobile-android-and-ios/appium/appium-image-recognition)
- [Appium Pro: Finding Elements By Image](https://appiumpro.com/editions/32-finding-elements-by-image-part-1)
- [Maestro Documentation](https://docs.maestro.dev/)
- [Vision-Based GUI Testing Guide](https://www.futuristicbug.com/vision-based-gui-testing/)
### Academic Papers
- [Vision-driven Automated Mobile GUI Testing via MLLM](https://arxiv.org/html/2407.03037v1)
- [Vision-Based Mobile App GUI Testing: A Survey](https://arxiv.org/html/2310.13518v3)
### Blog Posts
- [AskUI: GPT Agents & GPT-4 Vision for UI Automation](https://www.askui.com/blog-posts/developing-an-automated-ui-controller-using-gpt-agents-and-gpt-4-vision)
- [How Vision-Based AI Agents Work in UI Testing](https://www.askui.com/blog-posts/vision-ai-ui-testing)
- [Image-Based Locators in Mobile Testing](https://medium.com/ai-driven-software-testing/start-using-image-based-locators-in-mobile-app-testing-3995954e74b8)
- [OCR with Appium](https://medium.com/@patnaikgaurav61/implementing-ocr-library-in-appium-test-scripts-6682f21792f9)
---
*Research compiled: December 2024*