Skip to main content
Glama

android-ui

Control Android device interfaces by performing tap, swipe, text input, key press, and app launch actions for UI automation and testing.

Instructions

Perform various UI interaction operations on an Android device.

Args: ctx: MCP Context. serial: Device serial number. action: The UI action to perform. x: X coordinate (for tap). y: Y coordinate (for tap). start_x: Starting X coordinate (for swipe). start_y: Starting Y coordinate (for swipe). end_x: Ending X coordinate (for swipe). end_y: Ending Y coordinate (for swipe). duration_ms: Duration of the swipe in milliseconds (default: 300). text: Text to input (for input_text). keycode: Android keycode to press (for press_key). package: Package name (for start_intent). activity: Activity name (for start_intent). extras: Optional intent extras (for start_intent).

Returns: A string message indicating the result of the operation.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
serialYes
actionYes
xNo
yNo
start_xNo
start_yNo
end_xNo
end_yNo
duration_msNo
textNo
keycodeNo
packageNo
activityNo
extrasNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The primary handler function 'android_ui' decorated with @mcp.tool(name='android-ui'), implementing UI interactions (tap, swipe, input text, press key, start intent) on Android devices via dispatching to internal helper functions.
    @mcp.tool(name="android-ui")
    async def android_ui(  # pylint: disable=too-many-arguments
        ctx: Context,
        serial: str,
        action: UIAction,
        x: int | None = None,
        y: int | None = None,
        start_x: int | None = None,
        start_y: int | None = None,
        end_x: int | None = None,
        end_y: int | None = None,
        duration_ms: int = 300,  # Default for swipe
        text: str | None = None,
        keycode: int | None = None,
        package: str | None = None,
        activity: str | None = None,
        extras: dict[str, str] | None = None,
    ) -> str:
        """
        Perform various UI interaction operations on an Android device.
    
        Args:
            ctx: MCP Context.
            serial: Device serial number.
            action: The UI action to perform.
            x: X coordinate (for tap).
            y: Y coordinate (for tap).
            start_x: Starting X coordinate (for swipe).
            start_y: Starting Y coordinate (for swipe).
            end_x: Ending X coordinate (for swipe).
            end_y: Ending Y coordinate (for swipe).
            duration_ms: Duration of the swipe in milliseconds (default: 300).
            text: Text to input (for input_text).
            keycode: Android keycode to press (for press_key).
            package: Package name (for start_intent).
            activity: Activity name (for start_intent).
            extras: Optional intent extras (for start_intent).
    
        Returns:
            A string message indicating the result of the operation.
        """
        if action == UIAction.TAP:
            if x is None or y is None:
                msg = "Error: 'x' and 'y' coordinates are required for tap action."
                await ctx.error(msg)
                return msg
            return await _tap_impl(serial=serial, x=x, y=y, ctx=ctx)
    
        if action == UIAction.SWIPE:
            if start_x is None or start_y is None or end_x is None or end_y is None:
                msg = "Error: 'start_x', 'start_y', 'end_x', and 'end_y' are required for swipe action."
                await ctx.error(msg)
                return msg
            # duration_ms has a default, so no explicit None check needed if we pass it through
            return await _swipe_impl(
                serial=serial,
                start_x=start_x,
                start_y=start_y,
                end_x=end_x,
                end_y=end_y,
                ctx=ctx,
                duration_ms=duration_ms,
            )
    
        if action == UIAction.INPUT_TEXT:
            if text is None:
                msg = "Error: 'text' is required for input_text action."
                await ctx.error(msg)
                return msg
            return await _input_text_impl(serial=serial, text=text, ctx=ctx)
    
        if action == UIAction.PRESS_KEY:
            if keycode is None:
                msg = "Error: 'keycode' is required for press_key action."
                await ctx.error(msg)
                return msg
            return await _press_key_impl(serial=serial, keycode=keycode, ctx=ctx)
    
        if action == UIAction.START_INTENT:
            if package is None or activity is None:
                msg = "Error: 'package' and 'activity' are required for start_intent action."
                await ctx.error(msg)
                return msg
            # extras is optional
            device_manager = get_device_manager()
            return await start_intent(
                ctx=ctx,
                serial=serial,
                package=package,
                activity=activity,
                device_manager=device_manager,
                extras=extras,
            )
    
        # Should not be reached if action is a valid UIAction member
        unhandled_action_msg = f"Error: Unhandled UI action '{action}'."
        logger.error(unhandled_action_msg)
        await ctx.error(unhandled_action_msg)
        return unhandled_action_msg
  • UIAction Enum defining the supported actions for the 'android-ui' tool: TAP, SWIPE, INPUT_TEXT, PRESS_KEY, START_INTENT.
    class UIAction(Enum):
        """Actions available for UI automation."""
    
        TAP = "tap"
        SWIPE = "swipe"
        INPUT_TEXT = "input_text"
        PRESS_KEY = "press_key"
        START_INTENT = "start_intent"
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden for behavioral disclosure. While it mentions 'UI interaction operations' and lists parameters, it doesn't describe what actually happens during execution - whether these are simulated touches, require device accessibility services, have timing considerations, or produce side effects. For a tool with 14 parameters and no annotation coverage, this represents a significant gap in understanding the tool's behavior.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with a clear purpose statement followed by organized parameter explanations. The Args section efficiently groups parameters by their associated actions. While comprehensive, it maintains reasonable conciseness for a tool with 14 parameters. Every sentence serves a purpose, though the initial purpose statement could be more specific.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (14 parameters, no annotations, but with output schema), the description provides adequate but incomplete coverage. The parameter explanations are strong, but there's missing context about behavioral aspects, error conditions, and practical usage. The output schema existence means return values are documented elsewhere, but the description doesn't explain what 'result of the operation' entails or provide examples of successful/failed outcomes.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description provides excellent parameter semantics through the Args section, mapping each parameter to specific actions (e.g., 'x: X coordinate (for tap)'). With 0% schema description coverage, the description fully compensates by explaining what each parameter means and which actions they apply to. This adds substantial value beyond the bare schema, though it could benefit from more detail about parameter constraints or relationships.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose3/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description states 'Perform various UI interaction operations on an Android device' which provides a general purpose but is vague about the specific operations. It mentions UI interactions but doesn't clearly distinguish this from sibling tools like android-screenshot or android-shell. The purpose is understandable but lacks specificity about what makes this tool unique among Android automation tools.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No guidance is provided about when to use this tool versus alternatives. With sibling tools like android-screenshot (for capturing screen), android-shell (for command execution), and android-app (likely for app management), there's no indication of when UI interaction is appropriate versus other Android automation approaches. The description offers no context about prerequisites, limitations, or typical use cases.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/hyperb1iss/droidmind'

If you have feedback or need assistance with the MCP directory API, please join our Discord server