Skip to main content
Glama

interact_with_ui

Perform UI interactions on Android and iOS apps by tapping, swiping, or inputting text using element targeting or coordinates.

Instructions

Perform UI interactions like tap, swipe, or text input. Can target elements by ID/text or by coordinates.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
platformYesTarget platform
actionYesType of interaction to perform
elementNoElement ID, resource ID, or text to interact with
xNoX coordinate for coordinate-based interaction
yNoY coordinate for coordinate-based interaction
textNoText to input (for input_text action)
directionNoSwipe direction (for swipe action)
durationMsNoDuration in milliseconds (for long_press and swipe, default: 300)
deviceIdNoDevice ID or name (optional)

Implementation Reference

  • Main handler function for the 'interact_with_ui' tool. Validates inputs, locates elements or uses coordinates, performs platform-specific UI interactions (tap, swipe, input, etc.) for Android and iOS, and returns interaction results.
    export async function interactWithUI(args: InteractWithUIArgs): Promise<InteractionResult> {
      const {
        platform,
        action,
        element,
        x,
        y,
        text,
        direction,
        durationMs = 300,
        deviceId,
      } = args;
    
      // Validate platform
      if (!isPlatform(platform)) {
        throw Errors.invalidArguments(`Invalid platform: ${platform}. Must be 'android' or 'ios'`);
      }
    
      // Validate action
      if (!INTERACTION_TYPES.includes(action as InteractionType)) {
        throw Errors.invalidArguments(
          `Invalid action: ${action}. Must be one of: ${INTERACTION_TYPES.join(', ')}`
        );
      }
    
      const interactionType = action as InteractionType;
      const startTime = Date.now();
    
      // Determine target coordinates
      let targetCoords: Point;
      let targetElement: UIElement | undefined;
    
      if (element) {
        // Find element by ID or text
        const foundElement = await findTargetElement(platform, element, deviceId);
        if (!foundElement) {
          throw Errors.elementNotFound(element);
        }
        targetElement = foundElement;
        targetCoords = foundElement.center;
      } else if (x !== undefined && y !== undefined) {
        targetCoords = { x, y };
      } else if (interactionType !== 'input_text' && interactionType !== 'clear') {
        throw Errors.invalidArguments('Either element or coordinates (x, y) must be provided');
      } else {
        // For input_text and clear, we operate on the focused element
        targetCoords = { x: 0, y: 0 };
      }
    
      // Perform interaction
      try {
        if (platform === 'android') {
          await performAndroidInteraction(interactionType, targetCoords, {
            text,
            direction: direction as SwipeDirection,
            durationMs,
            deviceId,
          });
        } else {
          await performIOSInteraction(interactionType, targetCoords, {
            text,
            direction: direction as SwipeDirection,
            durationMs,
            deviceId,
          });
        }
    
        return {
          success: true,
          interactionType: action,
          targetElement: targetElement
            ? {
                id: targetElement.id,
                type: targetElement.type,
                bounds: targetElement.bounds,
              }
            : undefined,
          coordinates: targetCoords,
          durationMs: Date.now() - startTime,
        };
      } catch (error) {
        return {
          success: false,
          interactionType: action,
          coordinates: targetCoords,
          durationMs: Date.now() - startTime,
          error: error instanceof Error ? error.message : String(error),
        };
      }
    }
  • TypeScript interface defining the input schema/arguments for the interact_with_ui tool, including platform, action type, element/coordinates, text, direction, etc.
    export interface InteractWithUIArgs {
      /** Target platform */
      platform: string;
      /** Interaction type */
      action: string;
      /** Target element ID or text (for element-based interactions) */
      element?: string;
      /** X coordinate (for coordinate-based interactions) */
      x?: number;
      /** Y coordinate (for coordinate-based interactions) */
      y?: number;
      /** Text to input (for input_text action) */
      text?: string;
      /** Swipe direction (for swipe action) */
      direction?: string;
      /** Duration in ms (for long_press and swipe) */
      durationMs?: number;
      /** Target device ID or name */
      deviceId?: string;
    }
  • Registration function for the interact_with_ui tool. Registers the tool name, description, JSON input schema, and binds the interactWithUI handler to the global tool registry.
    export function registerInteractWithUITool(): void {
      getToolRegistry().register(
        'interact_with_ui',
        {
          description:
            'Perform UI interactions like tap, swipe, or text input. Can target elements by ID/text or by coordinates.',
          inputSchema: createInputSchema(
            {
              platform: {
                type: 'string',
                enum: ['android', 'ios'],
                description: 'Target platform',
              },
              action: {
                type: 'string',
                enum: ['tap', 'long_press', 'swipe', 'input_text', 'clear'],
                description: 'Type of interaction to perform',
              },
              element: {
                type: 'string',
                description: 'Element ID, resource ID, or text to interact with',
              },
              x: {
                type: 'number',
                description: 'X coordinate for coordinate-based interaction',
              },
              y: {
                type: 'number',
                description: 'Y coordinate for coordinate-based interaction',
              },
              text: {
                type: 'string',
                description: 'Text to input (for input_text action)',
              },
              direction: {
                type: 'string',
                enum: ['up', 'down', 'left', 'right'],
                description: 'Swipe direction (for swipe action)',
              },
              durationMs: {
                type: 'number',
                description: 'Duration in milliseconds (for long_press and swipe, default: 300)',
              },
              deviceId: {
                type: 'string',
                description: 'Device ID or name (optional)',
              },
            },
            ['platform', 'action']
          ),
        },
        (args) => interactWithUI(args as unknown as InteractWithUIArgs)
      );
    }
  • Central registration point where registerInteractWithUITool() is imported and invoked as part of registerAllTools(), ensuring the tool is registered during server startup.
    const { registerInteractWithUITool } = await import('./ui/interact-with-ui.js');
    
    registerGetUIContextTool();
    registerInteractWithUITool();
  • Helper function implementing Android-specific UI interactions using ADB commands for tap, long_press, swipe, input_text, and clear actions.
    async function performAndroidInteraction(
      action: InteractionType,
      coords: Point,
      options: {
        text?: string;
        direction?: SwipeDirection;
        durationMs: number;
        deviceId?: string;
      }
    ): Promise<void> {
      const { text, direction, durationMs, deviceId } = options;
    
      switch (action) {
        case 'tap':
          await tap(coords.x, coords.y, deviceId);
          break;
    
        case 'long_press':
          // Implement as swipe with same start and end
          await swipe(coords.x, coords.y, coords.x, coords.y, durationMs, deviceId);
          break;
    
        case 'swipe':
          if (!direction) {
            throw Errors.invalidArguments('direction is required for swipe action');
          }
          const swipeCoords = calculateSwipeCoordinates(coords, direction, 500);
          await swipe(
            swipeCoords.startX,
            swipeCoords.startY,
            swipeCoords.endX,
            swipeCoords.endY,
            durationMs,
            deviceId
          );
          break;
    
        case 'input_text':
          if (!text) {
            throw Errors.invalidArguments('text is required for input_text action');
          }
          await inputText(text, deviceId);
          break;
    
        case 'clear':
          // Select all and delete
          await executeShell('adb', [
            ...(deviceId ? ['-s', deviceId] : []),
            'shell',
            'input',
            'keyevent',
            'KEYCODE_CTRL_A',
          ]);
          await executeShell('adb', [
            ...(deviceId ? ['-s', deviceId] : []),
            'shell',
            'input',
            'keyevent',
            'KEYCODE_DEL',
          ]);
          break;
      }
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries full burden. It mentions 'perform UI interactions' which implies mutation/write operations, but doesn't disclose behavioral traits like whether this requires specific permissions, if actions are reversible, potential side effects (e.g., app state changes), or error handling. For a complex 9-parameter mutation tool, this is inadequate.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is perfectly concise - two sentences that efficiently convey core functionality without redundancy. It's front-loaded with the main purpose and follows with targeting methods. Every word earns its place with zero waste.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a complex UI interaction tool with 9 parameters, no annotations, and no output schema, the description is incomplete. It doesn't address return values, error conditions, platform-specific behaviors, or the relationship between element-based and coordinate-based targeting. The agent lacks sufficient context to use this tool effectively beyond basic parameter filling.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, providing detailed parameter documentation. The description adds minimal value beyond the schema, only vaguely mentioning targeting methods ('by ID/text or by coordinates') which partially maps to 'element', 'x', and 'y' parameters. It doesn't explain parameter relationships or usage nuances beyond what's already in the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool performs UI interactions with specific examples (tap, swipe, text input) and targeting methods (ID/text or coordinates). It distinguishes itself from sibling tools by focusing on UI manipulation rather than analysis, building, or navigation. However, it doesn't explicitly differentiate from tools like 'get_ui_context' or 'inspect_app_state' which might also involve UI.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No guidance is provided on when to use this tool versus alternatives. The description doesn't mention prerequisites (e.g., needing an app launched first), nor does it suggest when coordinate-based vs. element-based targeting is appropriate. With many sibling tools available, this lack of contextual guidance is a significant gap.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/abd3lraouf/specter-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server