---
name: macos-ocr-helper
description: This skill should be used when a user asks to extract text from images or PDFs, especially on macOS. Use it for OCR (Optical Character Recognition) tasks including converting images to text, PDF to Markdown, extracting tables to CSV, processing receipts/invoices, extracting code from screenshots, or analyzing scanned documents. It provides two local tools: read_image_text for pure text extraction and read_image_layout for structured layout information.
---
# macOS OCR Helper
## Overview
This skill provides direct access to macOS's native Vision framework for offline, high-accuracy OCR (Optical Character Recognition). It enables extracting text from images and PDFs with support for multiple languages (Chinese, English, and mixed), automatic paragraph merging, table optimization, and structured layout analysis for LLM-friendly output. **No MCP server required - works directly on macOS.**
## Quick Start
To use macOS OCR, call one of the two available local tools:
- **`read_image_text`**: Extract pure text from images or PDFs with automatic paragraph merging
- **`read_image_layout`**: Extract structured layout information including text blocks, bounding boxes, and semantic information
All processing happens locally on macOS with no cloud uploads or API keys required.
## Tool Selection Guide
### Use `read_image_text` when:
- Only need plain text content without layout information
- Quickly extract readable text from documents
- Perform text search, content analysis, or summarization
- Use OCR results for simple string manipulation
**Example**:
```
Call macos-ocr's read_image_text to read the following file:
image_path=/absolute/path/image.png
Output only the OCR text without explanations or formatting.
```
### Use `read_image_layout` when:
- Need to preserve or reconstruct document layout
- Process complex documents with tables or multi-column layouts
- Require LLM to reconstruct document structure
- Need to locate text positions within the image
- Convert OCR results to structured formats (CSV, Markdown, etc.)
**Example**:
```
Call macos-ocr's read_image_layout to read the following file:
image_path=/absolute/path/document.pdf
Convert the returned blocks to Markdown format based on bbox coordinates.
```
## Common Use Cases
### Extract Plain Text from Images
**Scenario**: Quickly copy text from an image screenshot
Call `read_image_text` with the image path. The tool automatically handles paragraph merging and table optimization.
**Output**: Pure text string
### Convert Images/PDFs to Markdown
**Scenario**: Transform a scanned document into editable Markdown
1. Call `read_image_layout` to get structured blocks with layout information
2. Process blocks using bbox coordinates to reconstruct layout
3. Apply appropriate Markdown syntax for headings, paragraphs, lists, and tables
4. Use page separators for multi-page documents (e.g., `--- Page N ---`)
**Output**: Markdown document with preserved layout
### Extract Tables to CSV
**Scenario**: Convert a table screenshot into spreadsheet-compatible format
1. Call `read_image_layout` to get table structure
2. Identify table boundaries using bbox coordinates
3. Format cells as comma-separated values
4. Handle merged cells with appropriate placeholders or repeated values
**Output**: CSV format with table data
### Process Receipts/Invoices
**Scenario**: Extract structured information from financial documents
1. Call `read_image_text` to get text content
2. Parse text to extract fields: merchant, date, amount, tax, items
3. Output structured JSON with all extracted fields
**Output**: JSON object with receipt details:
```json
{
"merchant": string|null,
"date": "YYYY-MM-DD"|null,
"currency": string|null,
"total": number|null,
"tax": number|null,
"items": [...],
"payment_method": string|null,
"invoice_no": string|null
}
```
### Extract Code from Screenshots
**Scenario**: Get executable code from terminal or code editor screenshots
1. Call `read_image_text` to get code content
2. Remove line numbers, prompts (e.g., `$`, `>>>`), and irrelevant characters
3. Fix common OCR errors: 0/O confusion, 1/l/I confusion, punctuation errors
4. Maintain original indentation
5. Wrap in Markdown code block with appropriate language identifier
**Output**: Markdown code block with corrected syntax
### Analyze Long Documents
**Scenario**: Extract structure and key points from scanned papers or contracts
1. Call `read_image_text` for full text extraction
2. Generate structured outline (H1/H2/H3 headings)
3. Identify key points (maximum 10 items)
4. List potential issues requiring human verification (unclear numbers, broken references)
**Output**: Structured outline with key information
## Best Practices
### Prompt Construction
- Always explicitly specify the tool to use (`read_image_text` or `read_image_layout`)
- Use absolute file paths
- Clearly state expected output format
- Avoid ambiguous instructions
### Quality Optimization
For code documents:
- Remove line numbers and prompts
- Fix character confusions (0/O, 1/l/I, :, ;)
- Preserve indentation
- Wrap in appropriate code block
For formal documents:
- Generate structural outline
- Extract key points with original text references
- List verification notes for unclear content
### Data Format Notes
**read_image_layout output structure**:
```json
[
{
"text": "Corrected semantic text",
"bbox": {
"x": 0.0, // Normalized x-coordinate [0-1]
"y": 0.0, // Normalized y-coordinate [0-1]
"w": 0.5, // Normalized width [0-1]
"h": 0.2 // Normalized height [0-1]
},
"lines": [ // Original line information (optional)
{
"text": "Original line text",
"bbox": {...}
}
]
}
]
```
All bbox coordinates are normalized values between 0 and 1, representing positions relative to image dimensions.
### Error Handling
Common OCR error types and handling strategies:
- **Character confusion**: Fix 0/O, 1/l/I, punctuation in code blocks
- **Punctuation errors**: Correct quotes and bracket pairs
- **Line splitting**: Merge incorrectly split paragraphs (automatically handled by the tool)
### Performance Considerations
- For large documents, consider preprocessing (compression, cropping)
- For batch processing, clearly separate results for each file
- Request progress updates or summaries when processing multiple files
## Troubleshooting
### Poor Recognition Quality
- Check image resolution (recommended 300 DPI or higher)
- Ensure image clarity without blur
- Try increasing contrast
- Check and correct image rotation if needed
### Multi-Page PDF Issues
- Verify PDF file integrity
- Check for password protection
- Use `read_image_layout` for more detailed information
### Inaccurate Coordinates
- Verify image dimensions and orientation
- Ensure normalized coordinates are within [0, 1] range
- Use `read_image_layout` for complete structural information
## Resources
### references/
This skill includes reference documentation for detailed usage patterns and best practices.
**references/examples.md**:
- Six complete usage scenarios with template prompts
- Detailed examples for each common use case
- Output format specifications
**references/best_practices.md**:
- Tool selection guidelines with decision criteria
- Prompt writing recommendations
- Quality assurance strategies
- Error handling and troubleshooting guide
- Performance optimization tips
Load these reference documents when working with complex scenarios or when detailed guidance is needed.
---
## Important Notes
- All OCR processing happens locally on macOS
- No cloud uploads or API keys required
- Supports Chinese (simplified/traditional), English, and mixed text
- Built-in PDF rendering engine handles multi-page documents automatically
- Zero configuration required - works out of the box