# AI Agent Integration Guide
This guide explains how AI agents should use the mcp-scrcpy-vision MCP server to control Android devices.
---
## Overview
The mcp-scrcpy-vision server gives AI agents **vision and control** over Android devices through MCP tools and resources. Agents can:
- **See** the screen via screenshots or continuous streaming
- **Interact** with the UI via tap, swipe, text input, and gestures
- **Navigate** apps and system UI
- **Automate** complex multi-step workflows
---
## Quick Start for Agents
### 1. Check Device Connection
```
Tool: android.devices.list
Result: [{ serial: "ABC123", status: "device", ... }]
```
Always start by listing devices to get the `serial` identifier needed for all other commands.
### 2. Choose Your Mode
**Streaming Mode (Recommended)** - For interactive control with fast input:
```
1. android.vision.startStream { serial: "ABC123" }
2. Read resource: android://device/ABC123/frame/latest.jpg
3. Perform actions (tap, swipe, etc.) - uses fast scrcpy control (~5-10ms)
4. Read resource again to see result
5. android.vision.stopStream { serial: "ABC123" } when done
```
**Snapshot Mode** - For simpler automation or when streaming unavailable:
```
1. android.vision.snapshot { serial: "ABC123" } - returns PNG screenshot
2. Analyze image
3. Perform action (uses ADB shell commands, ~100-300ms)
4. Take another snapshot to verify
```
---
## The Agent Control Loop
The recommended pattern for AI agents:
```
┌─────────────────────────────────────────────────────┐
│ OBSERVE │
│ Read frame/snapshot → Analyze screen content │
└──────────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ DECIDE │
│ What action should I take to achieve my goal? │
│ - Which element to interact with? │
│ - What input type? (tap, swipe, text, key) │
└──────────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ ACT │
│ Execute the chosen action via MCP tool │
└──────────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ VERIFY │
│ Read new frame → Confirm action had expected │
│ effect → Loop back to OBSERVE │
└─────────────────────────────────────────────────────┘
```
### Example: Login Flow
```
# Setup
android.vision.startStream { serial: "ABC123" }
# Observe - Read current screen
Read resource: android://device/ABC123/frame/latest.jpg
→ Agent sees: Login screen with username/password fields
# Decide - Need to enter username
android.ui.findElement { serial: "ABC123", resourceId: "username" }
→ Returns: { elements: [{ centerX: 540, centerY: 400, ... }] }
# Act - Tap username field
android.input.tap { serial: "ABC123", x: 540, y: 400 }
# Act - Type username
android.input.text { serial: "ABC123", text: "user@example.com" }
# Verify - Read screen to confirm
Read resource: android://device/ABC123/frame/latest.jpg
→ Agent sees: Username field now contains text
# Continue with password field...
```
---
## Vision Tools
### android.vision.startStream
Start continuous screen capture. Enables fast input via scrcpy control protocol.
```json
{
"serial": "ABC123",
"maxSize": 1024, // Optional: max dimension (default 1024)
"maxFps": 30, // Optional: capture fps (default 30)
"frameFps": 2 // Optional: output fps for frames (default 2)
}
```
**When to use**: For interactive sessions where you need real-time feedback and fast input.
### android.vision.stopStream
Stop the streaming session and release resources.
```json
{
"serial": "ABC123"
}
```
**When to use**: Always call when done with interactive session.
### android.vision.snapshot
Take a single PNG screenshot without starting a stream.
```json
{
"serial": "ABC123"
}
```
**Returns**: Base64-encoded PNG image data.
**When to use**: For simple verification steps or when streaming isn't needed.
### android.ui.dump
Get the UI hierarchy as XML (via uiautomator).
```json
{
"serial": "ABC123"
}
```
**Returns**: XML string of all UI elements with bounds, text, resource-ids.
**When to use**: When you need detailed information about UI structure.
### android.ui.findElement
Find specific elements with their tap coordinates.
```json
{
"serial": "ABC123",
"text": "Login", // Optional: match by visible text
"resourceId": "btn_login", // Optional: match by resource ID
"className": "Button", // Optional: match by class name
"contentDesc": "Sign in" // Optional: match by content description
}
```
**Returns**: Array of matching elements with `centerX`, `centerY`, `bounds`, and other properties.
**When to use**: To find precise tap coordinates for interactive elements.
---
## Input Tools
All input tools automatically use fast scrcpy control when streaming is active, falling back to ADB shell when not.
### android.input.tap
Tap at specific coordinates.
```json
{
"serial": "ABC123",
"x": 540,
"y": 1200
}
```
**Latency**: ~5-10ms (streaming) / ~100-300ms (no streaming)
### android.input.swipe
Swipe gesture from point A to point B.
```json
{
"serial": "ABC123",
"x1": 540,
"y1": 1500,
"x2": 540,
"y2": 500,
"durationMs": 300 // Optional: swipe duration
}
```
**Common patterns**:
- Scroll down: swipe from bottom to top
- Scroll up: swipe from top to bottom
- Scroll left: swipe from right to left
- Pull to refresh: swipe from top down
### android.input.longPress
Long press at coordinates.
```json
{
"serial": "ABC123",
"x": 540,
"y": 800,
"durationMs": 1000 // Optional: hold duration (default 1000)
}
```
**When to use**: Context menus, drag initiation, selection mode.
### android.input.text
Type text into focused input field.
```json
{
"serial": "ABC123",
"text": "Hello World"
}
```
**Note**: First tap an input field to focus it, then use this tool.
### android.input.keyevent
Send Android keycode.
```json
{
"serial": "ABC123",
"keycode": 4 // BACK key
}
```
**Common keycodes**:
| Key | Code | Key | Code |
|-----|------|-----|------|
| HOME | 3 | BACK | 4 |
| VOLUME_UP | 24 | VOLUME_DOWN | 25 |
| POWER | 26 | ENTER | 66 |
| DELETE | 67 | TAB | 61 |
| MENU | 82 | APP_SWITCH | 187 |
### android.input.pinch
Pinch zoom gesture.
```json
{
"serial": "ABC123",
"centerX": 540,
"centerY": 960,
"startDistance": 200,
"endDistance": 400, // Zoom in (endDistance > startDistance)
"durationMs": 300
}
```
### android.input.dragDrop
Drag from one point to another (with hold).
```json
{
"serial": "ABC123",
"startX": 200,
"startY": 500,
"endX": 600,
"endY": 500,
"durationMs": 500
}
```
---
## App Control Tools
### android.app.start
Launch an app by package name.
```json
{
"serial": "ABC123",
"packageName": "com.android.settings",
"activity": ".Settings" // Optional: specific activity
}
```
### android.app.stop
Force stop an app.
```json
{
"serial": "ABC123",
"packageName": "com.example.app"
}
```
### android.apps.list
List installed apps.
```json
{
"serial": "ABC123",
"system": false // Optional: include system apps
}
```
### android.activity.current
Get the current foreground activity.
```json
{
"serial": "ABC123"
}
```
**Returns**: Package name and activity of the foreground app.
---
## System Tools
### android.shell.exec
Execute arbitrary shell command.
```json
{
"serial": "ABC123",
"command": "pm list packages"
}
```
**Warning**: Use with caution. Can execute any shell command.
### android.file.push / android.file.pull
Transfer files to/from device.
```json
// Push
{
"serial": "ABC123",
"localPath": "C:/test.txt",
"remotePath": "/sdcard/test.txt"
}
// Pull
{
"serial": "ABC123",
"remotePath": "/sdcard/screenshot.png",
"localPath": "C:/screenshot.png"
}
```
### android.clipboard.get / android.clipboard.set
Read/write clipboard (limited on Android 10+).
```json
{
"serial": "ABC123",
"text": "Text to copy" // For set only
}
```
### android.notifications.get
Get current notifications.
```json
{
"serial": "ABC123"
}
```
---
## Screen Control Tools
### android.screen.wake / android.screen.sleep
Wake or sleep the screen.
```json
{
"serial": "ABC123"
}
```
### android.screen.isOn
Check if screen is currently on.
```json
{
"serial": "ABC123"
}
```
### android.screen.unlock
Unlock the screen (only works for unsecured devices).
```json
{
"serial": "ABC123"
}
```
---
## WiFi Connection Tools
Connect to devices wirelessly for untethered automation.
### Setup WiFi Connection
```
1. Connect device via USB first
2. android.adb.enableTcpip { serial: "ABC123" }
3. android.adb.getDeviceIp { serial: "ABC123" } → "192.168.1.50"
4. Disconnect USB cable
5. android.adb.connectWifi { ipAddress: "192.168.1.50" }
6. Use "192.168.1.50:5555" as serial for all commands
```
---
## Best Practices for Agents
### 1. Always Verify Actions
After every action, read the screen to verify it had the expected effect:
```
android.input.tap { ... }
// Wait 200-500ms for UI to update
Read resource: android://device/.../frame/latest.jpg
// Verify the expected change occurred
```
### 2. Use findElement for Precision
Instead of guessing coordinates, use `android.ui.findElement`:
```
// Good: Find element by text/ID
android.ui.findElement { serial: "ABC123", text: "Submit" }
→ Use returned centerX, centerY for tap
// Risky: Guessing coordinates
android.input.tap { x: 500, y: 1200 } // May miss on different screen sizes
```
### 3. Handle Loading States
Wait for content to load before interacting:
```
// Start an app
android.app.start { packageName: "com.example.app" }
// Wait for app to load (poll for expected element)
loop:
Read frame
Check if expected element is visible
If not, wait 500ms and retry (max 10 attempts)
```
### 4. Use Streaming for Interactive Sessions
For multi-step workflows, always use streaming mode:
```
android.vision.startStream { serial: "ABC123" }
// ... multiple interactions ...
android.vision.stopStream { serial: "ABC123" }
```
Benefits:
- 10-20x faster input
- Consistent frame access
- Lower latency feedback loop
### 5. Fallback Gracefully
If streaming fails, fall back to snapshot mode:
```
try:
android.vision.startStream { ... }
except:
// Use snapshot mode instead
android.vision.snapshot { ... }
```
### 6. Clean Up Resources
Always stop streams when done:
```
try:
// Your automation workflow
finally:
android.vision.stopStream { serial: "ABC123" }
```
---
## Common Workflows
### App Installation Testing
```
1. android.app.start { packageName: "com.android.settings" }
2. Navigate to Apps section (tap, swipe as needed)
3. Find target app
4. Verify app info (version, permissions)
5. android.app.stop { packageName: "com.android.settings" }
```
### Form Filling
```
1. android.vision.startStream { serial: "ABC123" }
2. For each form field:
a. android.ui.findElement to locate field
b. android.input.tap to focus
c. android.input.text to enter data
d. Read frame to verify
3. android.ui.findElement { text: "Submit" }
4. android.input.tap on submit button
5. Verify success screen
6. android.vision.stopStream
```
### Screenshot Documentation
```
1. android.vision.snapshot { serial: "ABC123" }
2. android.file.pull to save screenshot locally
3. Navigate to next screen
4. Repeat for all screens needed
```
### Accessibility Testing
```
1. android.vision.startStream
2. For each interactive element:
a. android.ui.dump to get element properties
b. Verify contentDescription is set
c. Check bounds are reasonable tap targets (48dp minimum)
3. Test with different font sizes (via Settings)
4. android.vision.stopStream
```
---
## Troubleshooting
### "No devices found"
- Check USB cable connection
- Enable USB debugging in Developer Options
- Accept RSA fingerprint prompt on device
- Run `adb devices` to verify
### Streaming won't start
- Verify scrcpy-server path is correct
- Check version numbers match exactly
- Try `android.vision.snapshot` to verify basic connectivity
### Input not working
- Ensure screen is on: `android.screen.wake`
- Check coordinates are within screen bounds
- Try increasing delays between actions
### UI elements not found
- Wait for content to load before searching
- Try alternative selectors (text vs resourceId)
- Use `android.ui.dump` to see actual element hierarchy
---
## Performance Tips
| Mode | Latency | Best For |
|------|---------|----------|
| Streaming + Fast Input | ~5-10ms/action | Interactive workflows, games, real-time control |
| Snapshot + ADB | ~100-300ms/action | Simple verification, screenshot capture |
For best performance:
1. Use streaming mode for multi-step workflows
2. Batch related actions (tap + type) before reading frame
3. Use appropriate frame rate (`frameFps` parameter)
4. Stop streams when not actively using them
---
## Security Notes
This MCP server provides **full device control**:
- Can execute any shell command
- Can read/write files
- Can control all UI and input
- Can access clipboard and notifications
**Only connect devices you own and trust the AI agent.**