Skip to main content
Glama

mouse_click

Click at specific screen coordinates or convert image-based coordinates to screen positions for automated desktop interaction when pixel-level precision is required.

Instructions

Click at screen-absolute coordinates (virtual screen pixels), or pass origin+scale from a dotByDot=true screenshot response to let the server convert image-local coords automatically: screen = origin + (x,y) / (scale ?? 1). doubleClick:true for double-click; tripleClick:true for triple-click (selects a full line of text) — if both are set, tripleClick wins. windowTitle optionally focuses the window first (for pinned-dock setups). Prefer click_element (UIA) for stable text-addressed clicking in native apps. Prefer browser_click_element for Chrome. Use mouse_click only when pixel coords are the only available option. Pass lensId (from perception_register) to run safety guards (identity stable, foreground, coordinates in rect) before clicking and receive post.perception state feedback without a screenshot. Caveats: origin+scale are meaningful ONLY with dotByDot=true screenshot responses — applying them to scaled detail='text'/'meta' output lands clicks in the wrong positions.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
xYesX coordinate. Screen-absolute by default. When 'origin' is provided, treated as image-local (pixel position within the screenshot).
yYesY coordinate. Screen-absolute by default. When 'origin' is provided, treated as image-local.
originNoWhen set, (x,y) are image-local coords from a screenshot. Server converts to screen coords: screen_x = origin.x + x / (scale ?? 1), screen_y = origin.y + y / (scale ?? 1). Copy origin values directly from the screenshot response text. This eliminates manual coord math and prevents out-of-window clicks.
scaleNoScale factor from screenshot response (only when dotByDotMaxDimension caused a resize). Omit if the screenshot was 1:1. Only used when 'origin' is also provided.
buttonNoMouse button to clickleft
doubleClickNoWhether to double-click
tripleClickNoWhether to triple-click (select a line of text). Takes precedence over doubleClick when both are true.
narrateNoNarration level. rich includes UIA or browser state diff when supported.minimal
speedNoCursor movement speed in px/sec. 0 = instant.
homingNoEnable homing correction if the target window moved.
windowTitleNoPartial title of the target window.
elementNameNoName or label of the UI element.
elementIdNoAutomationId of the UI element.
forceFocusNoBypass Windows foreground-stealing protection before focusing.
trackFocusNoDetect if focus was stolen after the action.
settleMsNoMilliseconds to wait before checking post-action state.
lensIdNoOptional perception lens ID from perception_register. When provided, guards are evaluated before clicking (safe.clickCoordinates, target.identityStable) and a perception envelope is attached to post.perception in the response.
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It does well by explaining coordinate conversion mechanics, precedence rules (tripleClick wins over doubleClick), safety guard integration with lensId, and important caveats about origin+scale usage. However, it doesn't fully describe error conditions, performance characteristics, or what happens when clicks fail.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized for a complex tool with 17 parameters. It's front-loaded with core functionality and usage guidance, though it could be slightly more structured. Some sentences are quite long and could be broken up for better readability, but overall it's efficient with minimal waste.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a complex mouse interaction tool with 17 parameters and no output schema, the description does well to explain coordinate systems, click types, safety mechanisms, and sibling tool relationships. It covers the essential context needed to use the tool correctly, though it doesn't describe return values or error responses since there's no output schema.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all 17 parameters thoroughly. The description adds some context about origin+scale conversion mechanics and lensId safety guards, but doesn't provide significant additional parameter semantics beyond what's in the schema. This meets the baseline expectation when schema coverage is high.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: clicking at screen coordinates with options for coordinate conversion, click types, and window focusing. It specifically distinguishes from siblings by stating 'Prefer click_element (UIA) for stable text-addressed clicking in native apps. Prefer browser_click_element for Chrome. Use mouse_click only when pixel coords are the only available option.' This provides explicit differentiation from alternative tools.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides excellent usage guidance with explicit when/when-not statements and named alternatives. It states 'Prefer click_element (UIA) for stable text-addressed clicking in native apps. Prefer browser_click_element for Chrome. Use mouse_click only when pixel coords are the only available option.' This gives clear context for when to use this tool versus its siblings.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Harusame64/desktop-touch-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server