OCR a scanned/image-only PDF (Tesseract.js)
obsidian_ocr_pdfExtracts text from scanned or image-only PDFs using Tesseract OCR, returning per-page text and confidence scores. Supports multilingual OCR and optional page ranges.
Instructions
Runs Tesseract OCR over each page of an image-only / scanned PDF, returning per-page text + per-page confidence + mean confidence + the same shape as obsidian_read_pdf. Use this when obsidian_read_pdf returns has_text: false (typical for scans, photographed paper, image-only PDFs). Multilingual via lang (default 'eng'; multi-lang via '+', e.g. 'eng+rus'). Optional pages range and scale (DPI multiplier, default 2 ~ 150 DPI, capped at 4). ~1-2s per page on M1 CPU. Read-only. Powered by Tesseract.js (Apache-2.0; language trained-data must be pre-installed via enquire-mcp install-ocr-lang <code> — serve mode makes zero outbound network calls, so a language missing from the local cache fails closed with an install hint rather than downloading at runtime) + @napi-rs/canvas for PDF→bitmap rendering. Both gated to optionalDependencies so the markdown-only path stays zero-cost.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| path | Yes | Vault-relative path of the .pdf file (with or without .pdf) | |
| lang | No | Tesseract language pack(s). Default 'eng'. Multi-lang via '+': 'eng+rus' for English+Russian mixed scans (max 8 packs per call). Common: 'eng', 'rus', 'jpn', 'chi_sim', 'fra', 'deu'. | |
| pages | No | Optional 1-indexed inclusive page range, e.g. [2, 5] OCRs pages 2..5 | |
| scale | No | Render scale (DPI multiplier). Default 2 (~150 DPI). Higher = better OCR on small text but slower. |