Read PDF Text
read_textExtract text from PDFs with Y-coordinate reading order. Handle multi-column layouts and compact whitespace for Japanese forms to reduce token consumption.
Instructions
Extract text content from a PDF document with Y-coordinate-based reading order preservation.
Text is extracted page by page, sorted by vertical position (top to bottom) then horizontal position (left to right), providing natural reading order.
For untagged multi-column PDFs (e.g. older 新旧対照表 PDFs that lack a structure tree), pass split_columns: 2 or 3 to bucket items by X-coordinate left-to-right. Tagged PDFs with proper <Table> markup should use the extract_tables tool instead.
For Japanese form-style PDFs (帳票・様式) where U+3000 fullwidth spaces are used as visual indentation, pass compact_whitespace: true to collapse runs of whitespace to a single ASCII space. Cuts 20–40% of token consumption without losing content.
Args:
file_path (string): Absolute path to a local PDF file
pages (string, optional): Page range to extract. Format: "1-5", "3", or "1,3,5-7". Omit for all pages.
response_format ('markdown' | 'json'): Output format (default: 'markdown')
split_columns (1 | 2 | 3, optional): Column-aware reordering for untagged multi-column PDFs. Default 1 = existing Y-sort.
compact_whitespace (boolean, optional): Collapse whitespace runs (incl. U+3000) to one ASCII space and trim each line. Default false.
Returns:
Extracted text organized by page number. With split_columns >= 2, columns are separated by a blank line so a downstream LLM can tell them apart.
Examples:
Extract all text: { file_path: "/path/to/doc.pdf" }
Untagged 新旧対照表: { file_path: "/path/to/older-shinkyu.pdf", split_columns: 2 }
Japanese form template: { file_path: "/path/to/form.pdf", compact_whitespace: true }
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | Yes | Absolute path to a local PDF file (e.g., "/path/to/document.pdf") | |
| pages | No | Page range to process. Format: "1-5", "3", or "1,3,5-7". Omit for all pages. | |
| response_format | No | Output format: "markdown" for human-readable, "json" for structured data | markdown |
| split_columns | No | Number of columns to use when reordering text. 1 (default) = existing Y-sort. 2 or 3 = bucket by X-coordinate left-to-right. Use for untagged 新旧対照表 / two-column PDFs where Y-sort would interleave columns. Tagged PDFs with proper <Table> markup should use extract_tables instead. | |
| compact_whitespace | No | When true, collapse runs of whitespace (incl. fullwidth space U+3000) to a single ASCII space and trim each line. Reduces token consumption on Japanese form-style PDFs. Default: false (no whitespace normalization). |