docshelf_convert_pdf
Convert PDF files to Markdown without committing to a shelf. Optionally split output by H2 headings for easier navigation.
Instructions
Standalone PDF → Markdown conversion (no shelf, no INDEX update).
Use when you want the converted file but don't yet want to commit it to a shelf. Optionally splits the result by H2.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| params | Yes |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |
Implementation Reference
- src/docshelf_mcp/server.py:164-173 (registration)Registration of the 'docshelf_convert_pdf' MCP tool via the @mcp.tool decorator. Delegates to t.convert_pdf(params).
@mcp.tool( name="docshelf_convert_pdf", annotations={ "title": "Convert a PDF to Markdown", "readOnlyHint": False, "destructiveHint": False, "idempotentHint": True, "openWorldHint": False, }, ) - src/docshelf_mcp/tools.py:291-315 (handler)Core handler: converts PDF to Markdown via pdf_to_markdown(), cleans artefacts, optionally splits by H2 headings, and returns output paths.
def convert_pdf(params: ConvertPdfInput) -> dict: """Implementation of the ``convert_pdf`` MCP tool.""" pdf_path = Path(params.pdf_path).expanduser().resolve() out_dir = Path(params.out_dir).expanduser().resolve() out_dir.mkdir(parents=True, exist_ok=True) raw = pdf_to_markdown(pdf_path, quality=params.quality) cleaned = clean_markdown(raw) out_md = out_dir / f"{pdf_path.stem}.md" out_md.write_text(cleaned, encoding="utf-8") section_paths: list[Path] = [] if params.split and should_split(cleaned): sections = split_by_h2(cleaned) if len(sections) >= 2: section_paths = write_split_files(sections, out_dir / pdf_path.stem) return { "status": "ok", "source_pdf": str(pdf_path), "output_markdown": str(out_md), "size_bytes": out_md.stat().st_size, "split_into": len(section_paths), "section_paths": [str(p) for p in section_paths], } - src/docshelf_mcp/tools.py:136-157 (schema)Pydantic input schema for the convert_pdf tool: pdf_path (required), out_dir (required), quality (default 'fast'), split (default False).
class ConvertPdfInput(_BaseInput): pdf_path: str = Field( ..., description="Absolute path to the source .pdf file.", min_length=1, ) out_dir: str = Field( ..., description="Output directory. Created if missing. The resulting .md file " "uses the PDF's stem as its filename.", min_length=1, ) quality: Quality = Field( default="fast", description="'fast' (pymupdf4llm) or 'high' (marker-pdf).", ) split: bool = Field( default=False, description="If True, also split the converted Markdown by H2 into " "a sibling subdirectory.", ) - Helper that dispatches PDF-to-Markdown conversion to either pymupdf4llm (fast) or marker-pdf (high) engine.
def pdf_to_markdown(pdf_path: Path | str, quality: Quality = "fast") -> str: """Convert a PDF file to Markdown. Args: pdf_path: Path to the source PDF. quality: ``"fast"`` (default, pymupdf4llm) or ``"high"`` (marker-pdf). Returns: The extracted Markdown text. NOT cleaned — run :func:`docshelf_mcp.core.splitter.clean_markdown` to remove PDF-extraction artefacts. Raises: FileNotFoundError: If ``pdf_path`` doesn't exist. ConversionError: If the requested engine is missing or fails. """ pdf_path = Path(pdf_path).expanduser().resolve() if not pdf_path.exists(): raise FileNotFoundError(f"PDF not found: {pdf_path}") if pdf_path.suffix.lower() != ".pdf": raise ConversionError( f"Expected a .pdf file, got {pdf_path.suffix or '<no extension>'}" ) if quality == "fast": return _convert_fast(pdf_path) if quality == "high": return _convert_high(pdf_path) raise ConversionError(f"Unknown quality preset: {quality!r}") - Helper that cleans PDF-extraction artefacts (fake H1 headings, excessive blank lines) from the converted Markdown.
def clean_markdown(text: str) -> str: """Smooth out common PDF-extraction artefacts in a Markdown string. Returns the cleaned text (always ends with a single trailing newline). """ return "\n".join(_clean_lines(text.splitlines())) + "\n" def should_split(text: str, threshold_bytes: int = DEFAULT_SPLIT_THRESHOLD_BYTES) -> bool: """Heuristic: does this document warrant a chapter-by-chapter split? True if the UTF-8 byte length exceeds ``threshold_bytes`` AND the document has at least two H2 headings to split on. Returning False here means the caller should keep the document as a single file. """ if len(text.encode("utf-8")) <= threshold_bytes: return False h2_count = sum(1 for line in text.splitlines() if _H2_RE.match(line)) return h2_count >= 2 def split_by_h2(text: str) -> list[tuple[str, str]]: """Split a Markdown string on H2 boundaries. Returns a list of ``(title, body)`` pairs. Content before the first H2 is returned with title ``"preamble"`` and is omitted if it is entirely whitespace. Each body starts at its ``## `` heading line — so writing the body verbatim to a file preserves the heading. """ sections: list[tuple[str, list[str]]] = [("preamble", [])] for line in text.splitlines(): m = _H2_RE.match(line) if m: title = m.group(1).strip() sections.append((title, [line])) else: sections[-1][1].append(line) if not "\n".join(sections[0][1]).strip(): sections.pop(0) return [(title, "\n".join(body).rstrip() + "\n") for title, body in sections] def write_split_files( sections: list[tuple[str, str]], target_dir: Path, *, clean_existing: bool = True, ) -> list[Path]: """Write each ``(title, body)`` section to ``target_dir/NNN-slug.md``. Args: sections: Output of :func:`split_by_h2`. target_dir: Output directory. Created if missing. clean_existing: If True (default), nukes ``target_dir`` first so the split is fully idempotent on re-run. Returns: List of written :class:`Path` objects, in section order. """ if clean_existing and target_dir.exists(): shutil.rmtree(target_dir) target_dir.mkdir(parents=True, exist_ok=True) written: list[Path] = [] used_slugs: set[str] = set() for idx, (title, body) in enumerate(sections, start=1): slug = slugify(title) if slug in used_slugs: slug = f"{slug}-{idx:03d}" used_slugs.add(slug) filename = f"{idx:03d}-{slug}.md" path = target_dir / filename # If the body doesn't already start with a heading and the slice has a # real title (not "preamble"), prepend one so the standalone file is # self-explanatory. if title != "preamble" and not body.lstrip().startswith("#"): body = f"# {title}\n\n{body}" path.write_text(body, encoding="utf-8") written.append(path) return written