get_pdf_text
Extract text from PDF files using native text extraction with flexible page range selection. Specify pages or ranges to retrieve content from PDFs containing native text.
Instructions
Extract text from specific pages or page ranges of a PDF file using native text extraction. Supports Python-style slicing: '5' (single page), '5:10' (range), '7:' (from page 7 to end), ':5' (from start to page 5). Use either absolute_path for any location or relative_path for files in ~/pdf-agent/ directory. Note: Works best with PDFs containing native text; scanned PDFs may yield limited results.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| absolute_path | No | Absolute path to the PDF file (e.g., '/Users/john/documents/report.pdf') | |
| relative_path | No | Path relative to ~/pdf-agent/ directory (e.g., 'reports/annual.pdf') | |
| use_pdf_home | No | Use PDF agent home directory for relative paths (default: true) | |
| page_range | No | Page range in enhanced Python-style format: '5' (page 5), '5:10' (pages 5-10), '7:' (page 7 to end), ':5' (start to page 5). Also supports comma-separated combinations: '1,3:5,7' (pages 1, 3-5, and 7), '1-3,7,10:' (pages 1-3, 7, and 10 to end). Default: '1:' (all pages) | 1: |
| extraction_strategy | No | Text extraction strategy: 'hybrid' (enhanced native extraction with better error handling), 'native' (standard PDF.js extraction). Default: 'hybrid' | hybrid |
| preserve_formatting | No | Preserve text formatting and spacing (default: true) | |
| line_breaks | No | Preserve line breaks in extracted text (default: true) |