ocr_pdf
Extract text from scanned PDFs using OCR. Supports page ranges, multiple languages, and adjustable DPI for balancing accuracy and performance.
Instructions
Perform OCR on PDF pages using Tesseract for scanned documents.
Args:
file_path: Path to the PDF file
pages: Page range (e.g., '1,3,5-10,-1' for pages 1, 3, 5 to 10, and last page)
language: OCR language code (default: 'chi_sim' for simplified Chinese)
chunk_size: Maximum size of text chunks
chunk_overlap: Overlap between chunks to preserve context
dpi: DPI for PDF to image conversion (higher = better quality, slower)
Returns:
JSON string with OCR results and metadataInput Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | Yes | ||
| pages | No | ||
| language | No | chi_sim | |
| chunk_size | No | ||
| chunk_overlap | No | ||
| dpi | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |