get_page_text
Extract text from PDF pages in multiple formats including JSON, plain text, markdown, or HTML. Specify page ranges and control header/footer inclusion for precise content extraction.
Instructions
Extract text content from one or more PDF pages.
Args:
filename: Path to a PDF file.
start_page: First page number (1-indexed, inclusive).
end_page: Last page number (1-indexed, inclusive). Defaults to start_page.
format: Output format. "json" returns structured page data with block/line/span
detail. "text" returns plain text. "markdown" returns markdown via
PyMuPDF4LLM. "html" returns HTML.
include_headers_footers: If False, crops top/bottom margins to exclude
headers and footers. Default True.Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| filename | Yes | ||
| start_page | No | ||
| end_page | No | ||
| format | No | json | |
| include_headers_footers | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |