macos-vision-mcp
Server Configuration
Describes the environment variables required to run the server.
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||
Capabilities
Features and capabilities supported by this server
| Capability | Details |
|---|---|
| tools | {
"listChanged": true
} |
| resources | {
"listChanged": true
} |
Tools
Functions exposed to the LLM to take actions
| Name | Description |
|---|---|
| ocr_imageA | Extract text from a local image or PDF file using Apple Vision OCR (offline, no API key needed). USE WHEN: The user provides a local file path to an image, screenshot, scanned document, or PDF and wants to extract the text from it. DO NOT USE for: images hosted on URLs (download first), non-macOS systems, or when the user wants face/barcode detection (use the dedicated tools). Supported formats: jpg, jpeg, png, heic, heif, tiff, bmp, pdf Parameters: path — absolute or relative path to the image/PDF file format — "text" returns a single plain-text string (default) "blocks" returns JSON { pages: [{ page, paragraphs, textBlocks }] } with reading-order paragraphs and per-block bounding boxes. Each textBlock carries lineId, paragraphId, confidence, and page-local bbox (0–1). PDFs return one entry per page. start_page — PDFs only — 1-based index of the first page to OCR (default 1). Ignored for images. start_page past the end returns an empty result. max_pages — PDFs only — maximum number of pages to OCR from start_page (default: all). Ignored for images. Returns: extracted text as a string (format="text") or a JSON document with per-page paragraphs and text blocks (format="blocks"). |
| detect_facesA | Detect human faces in a local image file using Apple Vision (offline, no API key needed). USE WHEN: The user wants to know how many faces are in a local image, or needs their positions. DO NOT USE for: text extraction (use ocr_image), barcode reading (use detect_barcodes). Returns: JSON with the total face count and an array of face positions expressed as percentage of image dimensions (top, left, width, height). |
| detect_barcodesA | Detect and decode barcodes or QR codes in a local image file using Apple Vision (offline, no API key needed). USE WHEN: The user wants to read a QR code, barcode, EAN, UPC, Code128, PDF417, Aztec, DataMatrix or other 1D/2D code from a local file. DO NOT USE for: text extraction (use ocr_image), face detection (use detect_faces). Supported symbologies: QR, EAN-8, EAN-13, UPC-E, Code39, Code93, Code128, ITF, PDF417, Aztec, DataMatrix, GS1DataBar and more. Returns: JSON array of detected codes, each with its decoded value and symbology type. |
| detect_documentA | Detect the boundary of a document in a local image using Apple Vision (offline, no API key needed). USE WHEN: The user has a photo of a piece of paper, a receipt, a card, an ID, or any rectangular document and wants the four corner points — typically as a hint for cropping, deskewing, or straightening the image before further OCR. DO NOT USE for: reading the document text (use ocr_image), classifying the image (use classify_image), or analyzing a PDF (PDFs are already rectangular pages). Returns: JSON with the four corner points of the detected document — topLeft, topRight, bottomLeft, bottomRight — each as { x, y } in 0–1 image coordinates, plus a confidence score. Returns { "detected": false } if no document is found. |
| classify_imageA | Classify the content of a local image into categories using Apple Vision (offline, no API key needed). USE WHEN: The user wants to know what is depicted in an image — objects, scenes, activities, animals, food, etc. Works with 1000+ categories and returns confidence scores. DO NOT USE for: text extraction (use ocr_image), face/barcode detection (dedicated tools), images that need detailed visual description (use the model's built-in vision). Returns: JSON array of classification labels sorted by confidence (highest first), each with a label name and confidence score (0–1). |
| analyze_documentA | Run a full analysis pipeline on a local image or PDF and return structured JSON for document reconstruction: OCR (with line/paragraph grouping in reading order), face detection, barcode/QR detection, and rectangle detection — all in parallel, fully offline, no API key needed. USE WHEN: The user wants the model to reconstruct a document into Markdown, HTML, DOCX, or any other format — invoices, scanned reports, contracts, IDs, receipts, mixed-content scans. Returns enough structure (paragraphs + raw text blocks with bounding boxes) that the model can render the output in whatever format the user asks for. DO NOT USE when: the user needs only one capability (use the dedicated tool — it will be faster). Returns: JSON with this shape: { "source": { "path", "pageCount", "isPdf" }, "pages": [ { "page": 0, "paragraphs": [{ "paragraphId", "lineIds", "text" }, ...], // primary surface "textBlocks": [{ "text", "lineId", "paragraphId", "confidence", "bbox": { "x","y","width","height" } }, ...], "faces": [{ "x","y","width","height" }, ...], "barcodes": [{ "value","symbology","bbox" }, ...], "rectangles": [{ "confidence","bbox" }, ...] }, ... ], "summary": { "totalTextBlocks","totalParagraphs","totalFaces","totalBarcodes","totalRectangles" } } Use paragraphs[].text as the primary surface for reading-order content. Use textBlocks[] when spatial information matters — multi-column layouts, tables, forms. PDFs return one entry per page; all coordinates are page-local 0–1. Face/barcode/rectangle detection on PDFs is best-effort (the underlying binary analyzes the PDF as a whole rather than per page). Parameters: path — absolute or relative path to the image/PDF file start_page — PDFs only — 1-based index of the first page to analyze (default 1). Only narrows the OCR pass; face/barcode/rectangle detections are still whole-document and attached to the first returned page. Ignored for images. max_pages — PDFs only — maximum number of pages to OCR from start_page (default: all). Ignored for images. |
Prompts
Interactive templates invoked by user choice
| Name | Description |
|---|---|
No prompts | |
Resources
Contextual data attached and managed by the client
| Name | Description |
|---|---|
| macos-vision-capabilities |
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/woladi/macos-vision-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server