ocr_pdf
Run OCR on scanned PDFs to add a searchable text layer using Tesseract, making the text copyable and searchable.
Instructions
Run OCR on a PDF to add a searchable text layer using Tesseract.
Use this when a PDF is scanned (no extractable text). After running, the output PDF has an invisible text layer and can be converted to markdown via markitdown, searched in Adobe / Windows Search, or copied from normally.
Args: input_path: Absolute path to the input PDF on the user's filesystem. output_path: Absolute path for the OCR'd output PDF. If omitted, defaults to "-ocr.pdf" in the same directory as input. language: Tesseract language code (e.g. "eng", "eng+spa", "deu"). Multiple languages are joined with "+". Default "eng". force_ocr: If True, OCR every page even if a text layer already exists. Use when the existing text layer is poor or junk OCR. Default False (pages with existing text are passed through unchanged). deskew: If True, straighten skewed pages before OCR. Default True.
Returns:
A dict with success (bool), output_path (str on success), and
error (str on failure).
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| input_path | Yes | ||
| output_path | No | ||
| language | No | eng | |
| force_ocr | No | ||
| deskew | No |