Skip to main content
Glama
kazuph
by kazuph

capture

Capture screenshots of specified regions (left, right, or full) with optional OCR, and save them in JSON, markdown, vertical, or horizontal formats to a dated directory in Downloads.

Instructions

Captures a screenshot of the specified region and performs OCR. Options:

  • region: 'left'/'right'/'full' (default: 'left')

  • format: 'json'/'markdown'/'vertical'/'horizontal' (default: 'markdown') The screenshot is saved to a dated directory in Downloads.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
formatNomarkdown
regionNoleft

Implementation Reference

  • Main handler for CallToolRequestSchema implementing the 'capture' tool logic: validates input, captures screenshot using takeScreenshot, performs OCR with performOCR, and returns the result or error.
    server.setRequestHandler(CallToolRequestSchema, async (request) => {
    	try {
    		const { name, arguments: args } = request.params;
    
    		if (name !== "capture") {
    			throw new Error(`Unknown tool: ${name}`);
    		}
    
    		const parsed = ScreenshotArgsSchema.safeParse(args);
    		if (!parsed.success) {
    			throw new Error(`Invalid arguments: ${parsed.error}`);
    		}
    
    		console.error(
    			`Debug: Starting screenshot capture for region: ${parsed.data.region}, format: ${parsed.data.format}`,
    		);
    		const imagePath = await takeScreenshot(parsed.data.region);
    		console.error(`Debug: Screenshot saved to: ${imagePath}`);
    
    		const ocrText = await performOCR(imagePath, parsed.data.format);
    		console.error("Debug: OCR completed");
    
    		return {
    			content: [
    				{
    					type: "text",
    					text: `Screenshot saved to: ${imagePath}\n\nOCR Results:\n${ocrText}`,
    				},
    			],
    		};
    	} catch (error) {
    		console.error("Error:", error);
    		return {
    			content: [
    				{
    					type: "text",
    					text: `Error: ${error instanceof Error ? error.message : String(error)}`,
    				},
    			],
    			isError: true,
    		};
    	}
    });
  • index.ts:227-240 (registration)
    Registration of the 'capture' tool in the ListToolsRequestSchema handler, including name, description, and input schema.
    server.setRequestHandler(ListToolsRequestSchema, async () => ({
    	tools: [
    		{
    			name: "capture",
    			description:
    				"Captures a screenshot of the specified region and performs OCR. " +
    				"Options:\n" +
    				"- region: 'left'/'right'/'full' (default: 'left')\n" +
    				"- format: 'json'/'markdown'/'vertical'/'horizontal' (default: 'markdown')\n" +
    				"The screenshot is saved to a dated directory in Downloads.",
    			inputSchema: zodToJsonSchema(ScreenshotArgsSchema) as ToolInput,
    		},
    	],
    }));
  • Zod schema defining input parameters for the 'capture' tool: region (left/right/full) and format (json/markdown/vertical/horizontal).
    const ScreenshotArgsSchema = z.object({
    	region: z.enum(["left", "right", "full"]).default("left"),
    	format: z
    		.enum(["json", "markdown", "vertical", "horizontal"])
    		.default("markdown"),
    });
  • Helper function to take screenshot of full screen and crop to left/right region if specified, saves to dated Downloads folder.
    async function takeScreenshot(
    	region: z.infer<typeof ScreenshotArgsSchema>["region"],
    ): Promise<string> {
    	const dateDir = await ensureDateDirectory();
    	const timestamp = new Date().toISOString().replace(/[:.]/g, "-");
    	const filename = `screenshot-${region}-${timestamp}.png`;
    	const filepath = join(dateDir, filename);
    
    	try {
    		// Get main display dimensions
    		const { width, height } = await getDisplayDimensions();
    		console.error(
    			`Debug: Display dimensions - width: ${width}, height: ${height}`,
    		);
    
    		// Always capture full screen
    		await execFileAsync("screencapture", [filepath]);
    
    		// Process image if needed
    		if (region !== "full") {
    			const tempFilePath = `${filepath}.temp.png`;
    			await sharp(filepath).toFile(tempFilePath);
    
    			const metadata = await sharp(tempFilePath).metadata();
    			if (!metadata.width || !metadata.height) {
    				throw new Error("Failed to get image dimensions");
    			}
    
    			const halfWidth = Math.floor(metadata.width / 2);
    
    			// Extract left or right half
    			if (region === "left") {
    				await sharp(tempFilePath)
    					.extract({
    						left: 0,
    						top: 0,
    						width: halfWidth,
    						height: metadata.height,
    					})
    					.toFile(filepath);
    			} else if (region === "right") {
    				await sharp(tempFilePath)
    					.extract({
    						left: halfWidth,
    						top: 0,
    						width: halfWidth,
    						height: metadata.height,
    					})
    					.toFile(filepath);
    			}
    
    			// Remove temporary file
    			await execFileAsync("rm", [tempFilePath]);
    		}
    
    		return filepath;
    	} catch (error) {
    		throw new Error(`Screenshot capture failed: ${error}`);
    	}
    }
  • Helper function for OCR on the screenshot image, first tries API then falls back to Tesseract.js, formats output as specified.
    async function performOCR(
    	imagePath: string,
    	format = "markdown",
    ): Promise<string> {
    	try {
    		const formData = new FormData();
    		formData.append("file", createReadStream(imagePath), {
    			filename: imagePath.split("/").pop(),
    		});
    
    		const response = await axios.post(
    			`${API_CONFIG.OCR_API_URL}${API_CONFIG.OCR_API_PATH}?format=${format}`,
    			formData,
    			{
    				headers: formData.getHeaders(),
    			},
    		);
    
    		if (response.status !== 200) {
    			throw new Error(`OCR API returned status ${response.status}`);
    		}
    
    		// Remove <br> tags
    		const content = response.data.content.replace(/<br\s*\/?>/g, "");
    		return content;
    	} catch (error) {
    		console.error("OCR API error, falling back to Tesseract.js:", error);
    
    		try {
    			// Configure worker for both Japanese and English recognition
    			console.error("OCR: Creating worker for Japanese and English...");
    			const worker = await createWorker("jpn+eng");
    			console.error("OCR: Starting recognition...");
    
    			const {
    				data: { text },
    			} = await worker.recognize(imagePath);
    			console.error("OCR: Recognition completed");
    			await worker.terminate();
    
    			// Format output according to specified format
    			let formattedText = text.trim();
    			switch (format) {
    				case "json":
    					formattedText = JSON.stringify({ content: text.trim() });
    					break;
    				case "markdown":
    					formattedText = `\`\`\`\n${text.trim()}\n\`\`\``;
    					break;
    				case "vertical":
    					formattedText = text.trim().split("\n").join("\n\n");
    					break;
    				case "horizontal":
    					formattedText = text.trim().replace(/\n/g, " ");
    					break;
    			}
    
    			return formattedText;
    		} catch (tesseractError) {
    			console.error("Tesseract.js error details:", tesseractError);
    			throw new Error(
    				`Both OCR API and Tesseract.js failed. API error: ${error instanceof Error ? error.message : String(error)}. Tesseract error: ${tesseractError instanceof Error ? tesseractError.message : String(tesseractError)}`,
    			);
    		}
    	}
    }
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It effectively describes key behaviors: the tool captures screenshots, performs OCR, saves files to a dated directory in Downloads, and provides default values for parameters. However, it misses details like error handling or performance characteristics.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is front-loaded with the core purpose, followed by a bulleted list of options for clarity. Every sentence earns its place by providing essential information without redundancy, making it efficient and well-structured.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (2 parameters, no output schema, no annotations), the description is mostly complete. It covers purpose, parameters, and behavioral aspects like file saving. A minor gap is the lack of detail on OCR output or error cases, but overall it suffices for informed use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The schema has 0% description coverage, so the description must compensate. It adds meaningful context by explaining what 'region' and 'format' control, including their allowed values and defaults. This goes beyond the schema's enum lists, though it could elaborate on the effects of each format option.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('captures a screenshot', 'performs OCR') and resources ('specified region'), making it immediately understandable. It distinguishes the tool's dual functionality of screenshot capture and OCR processing, which is comprehensive for its domain.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage through the mention of options like region and format, but does not explicitly state when to use this tool versus alternatives. Since there are no sibling tools, this is less critical, but it lacks guidance on scenarios or prerequisites for effective use.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/kazuph/mcp-screenshot'

If you have feedback or need assistance with the MCP directory API, please join our Discord server