read_document
Extract text from PDF, Word, plain text, and HTML documents, with OCR for scanned files.
Instructions
Read and extract text from documents — PDF, Word, and plain text files.
Supported formats: - PDF (.pdf) — text extraction with pdftotext, OCR fallback for scans - Word (.docx) — paragraph and table text extraction (no extra deps) - Plain text (.txt, .md, .csv, .log, .json, .xml, .yaml, .yml, .ini, .cfg, .toml) - HTML (.html, .htm) — strips tags, returns clean text
Sample prompts that trigger this tool: - "Read this PDF: /path/to/document.pdf" - "What does this document say? /path/to/report.docx" - "Extract text from /path/to/scanned.pdf" - "Read the CSV at /path/to/data.csv" - "Show me the contents of config.yaml"
Args: file_path: Absolute path to the document file.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | Yes |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |