Skip to main content
Glama

screenshot

Read-only

Capture macOS screen images in full screen, specific regions, or windows. Returns base64-encoded images with dimension metadata for documentation or automation tasks.

Instructions

Capture a screenshot of the macOS screen. Supports full screen, a rectangular region, or a specific window by title. Returns a base64-encoded image with dimension metadata. Do not narrate visual observations or coordinate calculations. Brief task progress updates are acceptable.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
modeYesCapture mode: "full" (entire screen), "region" (rectangular area), or "window" (specific window)full
xNoLeft edge x-coordinate in screen pixels (may be negative for secondary displays; required when mode is region)
yNoTop edge y-coordinate in screen pixels (may be negative for secondary displays; required when mode is region)
widthNoRegion width in screen pixels (required when mode is region)
heightNoRegion height in screen pixels (required when mode is region)
window_titleNoWindow title to capture (required when mode is window)
max_dimensionYesMaximum width or height of the returned image. 0 means no resize (default). When set, must be 256–4096.
formatYesOutput image format: "png" (default) or "jpeg"png
rulerYesWhen true, overlay coordinate rulers on the top and left edges of the screenshot. Tick labels show screen coordinates for precise positioning.

Implementation Reference

  • The handleScreenshot function executes the tool logic, including input validation, capturing the screen via `captureScreen`, and formatting the resulting image/metadata for the MCP response.
    async function handleScreenshot(
      args: Record<string, unknown>,
    ): Promise<CallToolResult> {
      const parsed = ScreenshotInputSchema.parse(args);
    
      try {
        const result = await captureScreen({
          mode: parsed.mode,
          region:
            parsed.mode === "region"
              ? { x: parsed.x!, y: parsed.y!, w: parsed.width!, h: parsed.height! }
              : undefined,
          windowTitle: parsed.mode === "window" ? parsed.window_title : undefined,
          maxDimension: parsed.max_dimension,
          format: parsed.format,
          ruler: parsed.ruler,
        });
    
        const mimeType = parsed.format === "jpeg" ? "image/jpeg" : "image/png";
    
        // Build coordinate mapping hint for agents
        const isIdentity =
          result.scaleX === 1 &&
          result.scaleY === 1 &&
          result.originX === 0 &&
          result.originY === 0;
    
        const coordinateHint = isIdentity
          ? "Coordinate mapping: screen = image pixel (1:1, no conversion needed)"
          : `Coordinate mapping: screen_x = ${result.originX} + image_x * ${result.scaleX}, screen_y = ${result.originY} + image_y * ${result.scaleY}`;
    
        return {
          content: [
            {
              type: "image" as const,
              data: result.base64,
              mimeType,
            },
            {
              type: "text" as const,
              text: `Image: ${result.width}x${result.height}\n${coordinateHint}`,
            },
          ],
        };
      } catch (error: unknown) {
        const message = error instanceof Error ? error.message : String(error);
    
        // Detect permission-related failures and include setup instructions
        const isPermissionError =
          /permission/i.test(message) ||
          /not permitted/i.test(message) ||
          /screen recording/i.test(message) ||
          /cannot be opened/i.test(message);
    
        const text = isPermissionError
          ? `Screenshot failed: ${message}\n\n${PERMISSION_INSTRUCTIONS}`
          : `Screenshot failed: ${message}`;
    
        return {
          content: [{ type: "text" as const, text }],
          isError: true,
        };
      }
    }
  • The ScreenshotInputSchema uses Zod for runtime input validation and includes cross-field refinements to ensure mandatory parameters for different screenshot modes (e.g., 'region' vs 'window').
    /** Full runtime validation schema with cross-field refinements. */
    const ScreenshotInputSchema = ScreenshotBaseSchema.refine(
      (data) => {
        if (data.mode === "region") {
          return (
            data.x !== undefined &&
            data.y !== undefined &&
            data.width !== undefined &&
            data.height !== undefined
          );
        }
        return true;
      },
      { message: 'x, y, width, and height are required when mode is "region"' },
    ).refine(
      (data) => {
        if (data.mode === "window") {
          return data.window_title !== undefined;
        }
        return true;
      },
      { message: 'window_title is required when mode is "window"' },
    );
  • The screenshot tool definition exported as part of screenshotToolDefinitions, used to register the tool in the MCP server.
    export const screenshotToolDefinitions: Tool[] = [
      {
        name: "screenshot",
        description:
          "Capture a screenshot of the macOS screen. Supports full screen, a rectangular region, or a specific window by title. Returns a base64-encoded image with dimension metadata. Do not narrate visual observations or coordinate calculations. Brief task progress updates are acceptable.",
        inputSchema: zodToToolInputSchema(ScreenshotBaseSchema),
        annotations: {
          readOnlyHint: true,
          destructiveHint: false,
        },
      },
    ];
  • The screenshotToolHandlers object maps the 'screenshot' tool name to its handler function, wrapping it in an enqueue function to manage execution flow.
    export const screenshotToolHandlers: Record<
      string,
      (args: Record<string, unknown>) => Promise<CallToolResult>
    > = {
      screenshot: (args) => enqueue(() => handleScreenshot(args)),
    };
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true and destructiveHint=false, establishing this as a safe read operation. The description adds valuable behavioral context beyond annotations: it specifies the return format ('base64-encoded image with dimension metadata') and provides important usage constraints about narration and progress updates. No contradictions with annotations exist.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is perfectly front-loaded with the core purpose, followed by mode options, return format, and behavioral constraints. Every sentence earns its place with zero wasted words, making it highly efficient while remaining comprehensive.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a tool with 9 parameters, no output schema, and read-only annotations, the description provides excellent context about what the tool does, behavioral constraints, and return format. The main gap is lack of explicit output schema documentation, but the description compensates well by specifying the return format and metadata.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema description coverage, the input schema already thoroughly documents all 9 parameters. The description mentions the three capture modes but doesn't add significant parameter semantics beyond what's in the schema. It meets the baseline for high schema coverage without compensating with extra parameter insights.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the specific action ('Capture a screenshot') and resource ('macOS screen'), with precise scope details ('full screen, a rectangular region, or a specific window by title'). It effectively distinguishes this tool from siblings like 'get_screen_info' or 'get_ui_elements' by focusing on image capture rather than information retrieval.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context about when to use this tool (for capturing screenshots with various modes) and includes behavioral guidance ('Do not narrate visual observations or coordinate calculations. Brief task progress updates are acceptable.'). However, it doesn't explicitly mention when NOT to use it or name specific alternatives among sibling tools.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/antbotlab/mac-use-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server