Convert PDF to Markdown
pdf_to_markdownExtract text content from PDF files and convert to Markdown, with options to clean output for AI use.
Instructions
Extract text content from a PDF file into Markdown format.
IMPORTANT: This is CONTENT EXTRACTION, not layout reconstruction.
Scanned PDFs, complex tables, two-column papers, and mathematical formulas may not convert reliably.
For scanned PDFs, an OCR engine is required (not included).
Default engine is MarkItDown (better text extraction). Falls back to Pandoc if unavailable.
Arguments:
inputPath (string, required): Path to the input PDF file
outputPath (string, optional): Output path. Defaults to same name with .md
engine (enum, optional): Engine — 'markitdown' (default) or 'pandoc'
cleanForLLM (boolean, optional): Clean up Markdown for LLM consumption
preferSourceSidecar (boolean, optional): When true (default), first check for a source sidecar file (sample.pdf.source.md) and return it instead of extracting PDF text. This is the only reliable way to recover original Markdown structure.
overwrite (boolean, optional): Allow overwriting. Defaults to false
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| inputPath | Yes | Path to the input PDF file (relative to workspace) | |
| outputPath | No | Output Markdown path (relative to workspace). Auto-derived if omitted. | |
| engine | No | Conversion engine. Defaults to 'markitdown'. | |
| cleanForLLM | No | Clean up the Markdown output for LLM consumption | |
| preferSourceSidecar | No | When true (default), first check for a source sidecar file (.source.md) generated by markdown_to_pdf with preserveSource=true. If found, return the original Markdown instead of extracting PDF text. This is the only reliable way to recover structure. | |
| overwrite | No | Allow overwriting existing output file. Defaults to false. |