Skip to main content
Glama

docshelf_convert_pdf

Idempotent

Convert PDF files to Markdown without committing to a shelf. Optionally split output by H2 headings for easier navigation.

Instructions

Standalone PDF → Markdown conversion (no shelf, no INDEX update).

Use when you want the converted file but don't yet want to commit it to a shelf. Optionally splits the result by H2.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
paramsYes

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • Registration of the 'docshelf_convert_pdf' MCP tool via the @mcp.tool decorator. Delegates to t.convert_pdf(params).
    @mcp.tool(
        name="docshelf_convert_pdf",
        annotations={
            "title": "Convert a PDF to Markdown",
            "readOnlyHint": False,
            "destructiveHint": False,
            "idempotentHint": True,
            "openWorldHint": False,
        },
    )
  • Core handler: converts PDF to Markdown via pdf_to_markdown(), cleans artefacts, optionally splits by H2 headings, and returns output paths.
    def convert_pdf(params: ConvertPdfInput) -> dict:
        """Implementation of the ``convert_pdf`` MCP tool."""
        pdf_path = Path(params.pdf_path).expanduser().resolve()
        out_dir = Path(params.out_dir).expanduser().resolve()
        out_dir.mkdir(parents=True, exist_ok=True)
    
        raw = pdf_to_markdown(pdf_path, quality=params.quality)
        cleaned = clean_markdown(raw)
        out_md = out_dir / f"{pdf_path.stem}.md"
        out_md.write_text(cleaned, encoding="utf-8")
    
        section_paths: list[Path] = []
        if params.split and should_split(cleaned):
            sections = split_by_h2(cleaned)
            if len(sections) >= 2:
                section_paths = write_split_files(sections, out_dir / pdf_path.stem)
    
        return {
            "status": "ok",
            "source_pdf": str(pdf_path),
            "output_markdown": str(out_md),
            "size_bytes": out_md.stat().st_size,
            "split_into": len(section_paths),
            "section_paths": [str(p) for p in section_paths],
        }
  • Pydantic input schema for the convert_pdf tool: pdf_path (required), out_dir (required), quality (default 'fast'), split (default False).
    class ConvertPdfInput(_BaseInput):
        pdf_path: str = Field(
            ...,
            description="Absolute path to the source .pdf file.",
            min_length=1,
        )
        out_dir: str = Field(
            ...,
            description="Output directory. Created if missing. The resulting .md file "
            "uses the PDF's stem as its filename.",
            min_length=1,
        )
        quality: Quality = Field(
            default="fast",
            description="'fast' (pymupdf4llm) or 'high' (marker-pdf).",
        )
        split: bool = Field(
            default=False,
            description="If True, also split the converted Markdown by H2 into "
            "a sibling subdirectory.",
        )
  • Helper that dispatches PDF-to-Markdown conversion to either pymupdf4llm (fast) or marker-pdf (high) engine.
    def pdf_to_markdown(pdf_path: Path | str, quality: Quality = "fast") -> str:
        """Convert a PDF file to Markdown.
    
        Args:
            pdf_path: Path to the source PDF.
            quality: ``"fast"`` (default, pymupdf4llm) or ``"high"`` (marker-pdf).
    
        Returns:
            The extracted Markdown text. NOT cleaned —
            run :func:`docshelf_mcp.core.splitter.clean_markdown` to remove
            PDF-extraction artefacts.
    
        Raises:
            FileNotFoundError: If ``pdf_path`` doesn't exist.
            ConversionError: If the requested engine is missing or fails.
        """
        pdf_path = Path(pdf_path).expanduser().resolve()
        if not pdf_path.exists():
            raise FileNotFoundError(f"PDF not found: {pdf_path}")
        if pdf_path.suffix.lower() != ".pdf":
            raise ConversionError(
                f"Expected a .pdf file, got {pdf_path.suffix or '<no extension>'}"
            )
    
        if quality == "fast":
            return _convert_fast(pdf_path)
        if quality == "high":
            return _convert_high(pdf_path)
        raise ConversionError(f"Unknown quality preset: {quality!r}")
  • Helper that cleans PDF-extraction artefacts (fake H1 headings, excessive blank lines) from the converted Markdown.
    def clean_markdown(text: str) -> str:
        """Smooth out common PDF-extraction artefacts in a Markdown string.
    
        Returns the cleaned text (always ends with a single trailing newline).
        """
        return "\n".join(_clean_lines(text.splitlines())) + "\n"
    
    
    def should_split(text: str, threshold_bytes: int = DEFAULT_SPLIT_THRESHOLD_BYTES) -> bool:
        """Heuristic: does this document warrant a chapter-by-chapter split?
    
        True if the UTF-8 byte length exceeds ``threshold_bytes`` AND the document
        has at least two H2 headings to split on. Returning False here means the
        caller should keep the document as a single file.
        """
        if len(text.encode("utf-8")) <= threshold_bytes:
            return False
        h2_count = sum(1 for line in text.splitlines() if _H2_RE.match(line))
        return h2_count >= 2
    
    
    def split_by_h2(text: str) -> list[tuple[str, str]]:
        """Split a Markdown string on H2 boundaries.
    
        Returns a list of ``(title, body)`` pairs. Content before the first H2
        is returned with title ``"preamble"`` and is omitted if it is entirely
        whitespace.
    
        Each body starts at its ``## `` heading line — so writing the body verbatim
        to a file preserves the heading.
        """
        sections: list[tuple[str, list[str]]] = [("preamble", [])]
        for line in text.splitlines():
            m = _H2_RE.match(line)
            if m:
                title = m.group(1).strip()
                sections.append((title, [line]))
            else:
                sections[-1][1].append(line)
    
        if not "\n".join(sections[0][1]).strip():
            sections.pop(0)
    
        return [(title, "\n".join(body).rstrip() + "\n") for title, body in sections]
    
    
    def write_split_files(
        sections: list[tuple[str, str]],
        target_dir: Path,
        *,
        clean_existing: bool = True,
    ) -> list[Path]:
        """Write each ``(title, body)`` section to ``target_dir/NNN-slug.md``.
    
        Args:
            sections: Output of :func:`split_by_h2`.
            target_dir: Output directory. Created if missing.
            clean_existing: If True (default), nukes ``target_dir`` first so the
                split is fully idempotent on re-run.
    
        Returns:
            List of written :class:`Path` objects, in section order.
        """
        if clean_existing and target_dir.exists():
            shutil.rmtree(target_dir)
        target_dir.mkdir(parents=True, exist_ok=True)
    
        written: list[Path] = []
        used_slugs: set[str] = set()
        for idx, (title, body) in enumerate(sections, start=1):
            slug = slugify(title)
            if slug in used_slugs:
                slug = f"{slug}-{idx:03d}"
            used_slugs.add(slug)
    
            filename = f"{idx:03d}-{slug}.md"
            path = target_dir / filename
    
            # If the body doesn't already start with a heading and the slice has a
            # real title (not "preamble"), prepend one so the standalone file is
            # self-explanatory.
            if title != "preamble" and not body.lstrip().startswith("#"):
                body = f"# {title}\n\n{body}"
    
            path.write_text(body, encoding="utf-8")
            written.append(path)
        return written
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate idempotentHint=true and destructiveHint=false. The description adds that this is 'standalone' (no side effects on shelf/index) and mentions optional splitting. This provides useful behavioral context beyond annotations without contradiction.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is only two sentences, front-loading the essential purpose and usage context. No extraneous words; every part earns its place.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity and the presence of an output schema, the description covers key aspects: conversion, standalone nature, and optional split. It lacks details about prerequisites (e.g., file existence checks) but overall is sufficient.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema already contains detailed descriptions for all four properties (pdf_path, out_dir, quality, split). The description's mention of splitting by H2 adds no new information beyond what the schema provides. With high schema coverage, minimal additional contribution from description is acceptable.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states 'Standalone PDF → Markdown conversion (no shelf, no INDEX update)', which specifies the action (conversion), input (PDF), output (Markdown), and distinguishing constraints. It uses a specific verb and resource, and the title 'Convert a PDF to Markdown' reinforces this.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly says 'Use when you want the converted file but don't yet want to commit it to a shelf', providing a clear context for use. It implies alternatives (shelf-related tools) but does not explicitly list them or state when not to use this tool.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ignatenkofi/docshelf-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server