docshelf-mcp

Overview Schema Related Servers Score Discussions

docshelf_convert_pdf

Idempotent

Convert PDF files to Markdown without committing to a shelf. Optionally split output by H2 headings for easier navigation.

Instructions

Standalone PDF → Markdown conversion (no shelf, no INDEX update).

Use when you want the converted file but don't yet want to commit it to a shelf. Optionally splits the result by H2.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`params`	Yes

Output Schema

TableJSON Schema

Name	Required	Description	Default
`result`	Yes

Implementation Reference

src/docshelf_mcp/server.py:164-173 (registration)

Registration of the 'docshelf_convert_pdf' MCP tool via the @mcp.tool decorator. Delegates to t.convert_pdf(params).

@mcp.tool(
    name="docshelf_convert_pdf",
    annotations={
        "title": "Convert a PDF to Markdown",
        "readOnlyHint": False,
        "destructiveHint": False,
        "idempotentHint": True,
        "openWorldHint": False,
    },
)

src/docshelf_mcp/tools.py:291-315 (handler)

Core handler: converts PDF to Markdown via pdf_to_markdown(), cleans artefacts, optionally splits by H2 headings, and returns output paths.

def convert_pdf(params: ConvertPdfInput) -> dict:
    """Implementation of the ``convert_pdf`` MCP tool."""
    pdf_path = Path(params.pdf_path).expanduser().resolve()
    out_dir = Path(params.out_dir).expanduser().resolve()
    out_dir.mkdir(parents=True, exist_ok=True)

    raw = pdf_to_markdown(pdf_path, quality=params.quality)
    cleaned = clean_markdown(raw)
    out_md = out_dir / f"{pdf_path.stem}.md"
    out_md.write_text(cleaned, encoding="utf-8")

    section_paths: list[Path] = []
    if params.split and should_split(cleaned):
        sections = split_by_h2(cleaned)
        if len(sections) >= 2:
            section_paths = write_split_files(sections, out_dir / pdf_path.stem)

    return {
        "status": "ok",
        "source_pdf": str(pdf_path),
        "output_markdown": str(out_md),
        "size_bytes": out_md.stat().st_size,
        "split_into": len(section_paths),
        "section_paths": [str(p) for p in section_paths],
    }

src/docshelf_mcp/tools.py:136-157 (schema)

Pydantic input schema for the convert_pdf tool: pdf_path (required), out_dir (required), quality (default 'fast'), split (default False).

class ConvertPdfInput(_BaseInput):
    pdf_path: str = Field(
        ...,
        description="Absolute path to the source .pdf file.",
        min_length=1,
    )
    out_dir: str = Field(
        ...,
        description="Output directory. Created if missing. The resulting .md file "
        "uses the PDF's stem as its filename.",
        min_length=1,
    )
    quality: Quality = Field(
        default="fast",
        description="'fast' (pymupdf4llm) or 'high' (marker-pdf).",
    )
    split: bool = Field(
        default=False,
        description="If True, also split the converted Markdown by H2 into "
        "a sibling subdirectory.",
    )

src/docshelf_mcp/core/converter.py:60-88 (helper)

Helper that dispatches PDF-to-Markdown conversion to either pymupdf4llm (fast) or marker-pdf (high) engine.

def pdf_to_markdown(pdf_path: Path | str, quality: Quality = "fast") -> str:
    """Convert a PDF file to Markdown.

    Args:
        pdf_path: Path to the source PDF.
        quality: ``"fast"`` (default, pymupdf4llm) or ``"high"`` (marker-pdf).

    Returns:
        The extracted Markdown text. NOT cleaned —
        run :func:`docshelf_mcp.core.splitter.clean_markdown` to remove
        PDF-extraction artefacts.

    Raises:
        FileNotFoundError: If ``pdf_path`` doesn't exist.
        ConversionError: If the requested engine is missing or fails.
    """
    pdf_path = Path(pdf_path).expanduser().resolve()
    if not pdf_path.exists():
        raise FileNotFoundError(f"PDF not found: {pdf_path}")
    if pdf_path.suffix.lower() != ".pdf":
        raise ConversionError(
            f"Expected a .pdf file, got {pdf_path.suffix or '<no extension>'}"
        )

    if quality == "fast":
        return _convert_fast(pdf_path)
    if quality == "high":
        return _convert_high(pdf_path)
    raise ConversionError(f"Unknown quality preset: {quality!r}")

src/docshelf_mcp/core/splitter.py:74-161 (helper)

Helper that cleans PDF-extraction artefacts (fake H1 headings, excessive blank lines) from the converted Markdown.

def clean_markdown(text: str) -> str:
    """Smooth out common PDF-extraction artefacts in a Markdown string.

    Returns the cleaned text (always ends with a single trailing newline).
    """
    return "\n".join(_clean_lines(text.splitlines())) + "\n"


def should_split(text: str, threshold_bytes: int = DEFAULT_SPLIT_THRESHOLD_BYTES) -> bool:
    """Heuristic: does this document warrant a chapter-by-chapter split?

    True if the UTF-8 byte length exceeds ``threshold_bytes`` AND the document
    has at least two H2 headings to split on. Returning False here means the
    caller should keep the document as a single file.
    """
    if len(text.encode("utf-8")) <= threshold_bytes:
        return False
    h2_count = sum(1 for line in text.splitlines() if _H2_RE.match(line))
    return h2_count >= 2


def split_by_h2(text: str) -> list[tuple[str, str]]:
    """Split a Markdown string on H2 boundaries.

    Returns a list of ``(title, body)`` pairs. Content before the first H2
    is returned with title ``"preamble"`` and is omitted if it is entirely
    whitespace.

    Each body starts at its ``## `` heading line — so writing the body verbatim
    to a file preserves the heading.
    """
    sections: list[tuple[str, list[str]]] = [("preamble", [])]
    for line in text.splitlines():
        m = _H2_RE.match(line)
        if m:
            title = m.group(1).strip()
            sections.append((title, [line]))
        else:
            sections[-1][1].append(line)

    if not "\n".join(sections[0][1]).strip():
        sections.pop(0)

    return [(title, "\n".join(body).rstrip() + "\n") for title, body in sections]


def write_split_files(
    sections: list[tuple[str, str]],
    target_dir: Path,
    *,
    clean_existing: bool = True,
) -> list[Path]:
    """Write each ``(title, body)`` section to ``target_dir/NNN-slug.md``.

    Args:
        sections: Output of :func:`split_by_h2`.
        target_dir: Output directory. Created if missing.
        clean_existing: If True (default), nukes ``target_dir`` first so the
            split is fully idempotent on re-run.

    Returns:
        List of written :class:`Path` objects, in section order.
    """
    if clean_existing and target_dir.exists():
        shutil.rmtree(target_dir)
    target_dir.mkdir(parents=True, exist_ok=True)

    written: list[Path] = []
    used_slugs: set[str] = set()
    for idx, (title, body) in enumerate(sections, start=1):
        slug = slugify(title)
        if slug in used_slugs:
            slug = f"{slug}-{idx:03d}"
        used_slugs.add(slug)

        filename = f"{idx:03d}-{slug}.md"
        path = target_dir / filename

        # If the body doesn't already start with a heading and the slice has a
        # real title (not "preamble"), prepend one so the standalone file is
        # self-explanatory.
        if title != "preamble" and not body.lstrip().startswith("#"):
            body = f"# {title}\n\n{body}"

        path.write_text(body, encoding="utf-8")
        written.append(path)
    return written

Tool Definition Quality

A4.2/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate idempotentHint=true and destructiveHint=false. The description adds that this is 'standalone' (no side effects on shelf/index) and mentions optional splitting. This provides useful behavioral context beyond annotations without contradiction.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is only two sentences, front-loading the essential purpose and usage context. No extraneous words; every part earns its place.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity and the presence of an output schema, the description covers key aspects: conversion, standalone nature, and optional split. It lacks details about prerequisites (e.g., file existence checks) but overall is sufficient.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema already contains detailed descriptions for all four properties (pdf_path, out_dir, quality, split). The description's mention of splitting by H2 adds no new information beyond what the schema provides. With high schema coverage, minimal additional contribution from description is acceptable.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states 'Standalone PDF → Markdown conversion (no shelf, no INDEX update)', which specifies the action (conversion), input (PDF), output (Markdown), and distinguishing constraints. It uses a specific verb and resource, and the title 'Convert a PDF to Markdown' reinforces this.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly says 'Use when you want the converted file but don't yet want to commit it to a shelf', providing a clear context for use. It implies alternatives (shelf-related tools) but does not explicitly list them or state when not to use this tool.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ignatenkofi/docshelf-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server