Skip to main content
Glama

docshelf_add_document

Idempotent

Add a PDF or Markdown file to a categorized document shelf. Automatically converts PDFs to Markdown, splits large documents by headings, and updates the navigation index.

Instructions

Add a PDF or Markdown file to the shelf and refresh INDEX.md.

  • .pdf is converted to Markdown (pymupdf4llm by default; pass quality='high' to use marker-pdf).

  • Documents larger than 50 KB with multiple H2 headings are split into one file per section (turn this off with split=False).

  • INDEX.md is regenerated automatically. The caller still owns the git commit / push step.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
paramsYes

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The core implementation of add_document on the Shelf class. Handles PDF conversion, Markdown writing, splitting, metadata updates, and index rebuild.
    def add_document(
        self,
        source: Path | str,
        *,
        category: str,
        title: str,
        description: str = "",
        split: bool = True,
        quality: Quality = "fast",
    ) -> AddResult:
        """Add (or replace) a document in the shelf.
    
        Args:
            source: Path to a ``.pdf`` or ``.md`` file.
            category: Category bucket (e.g. ``"laptops"``). Created if missing.
            title: Human-readable title — used in the INDEX entry.
            description: Short description (one sentence). Empty by default.
            split: If True (default) and the document is large enough, split it
                by H2 into a sibling subdirectory.
            quality: PDF conversion quality preset (``"fast"`` or ``"high"``).
    
        Returns:
            :class:`AddResult` with the on-disk paths.
    
        Raises:
            FileNotFoundError: ``source`` doesn't exist.
            ValueError: ``source`` is not a .pdf or .md file.
        """
        source = Path(source).expanduser().resolve()
        if not source.exists():
            raise FileNotFoundError(f"Source not found: {source}")
    
        suffix = source.suffix.lower()
        if suffix not in {".pdf", ".md"}:
            raise ValueError(
                f"Unsupported source type {suffix!r}; expected .pdf or .md"
            )
    
        category_slug = slugify(category, max_len=80) or "uncategorized"
        category_dir = self.root / "docs" / category_slug
        category_dir.mkdir(parents=True, exist_ok=True)
    
        doc_stem = slugify(title, max_len=80) or "document"
        doc_path = category_dir / f"{doc_stem}.md"
    
        if suffix == ".pdf":
            raw_md = pdf_to_markdown(source, quality=quality)
            converted_from_pdf = True
        else:
            raw_md = source.read_text(encoding="utf-8", errors="replace")
            converted_from_pdf = False
    
        cleaned = clean_markdown(raw_md)
        if not cleaned.lstrip().startswith("#"):
            cleaned = f"# {title}\n\n{cleaned}"
        doc_path.write_text(cleaned, encoding="utf-8")
    
        section_paths: list[Path] = []
        was_split = False
        split_dir = category_dir / doc_stem
        if split and should_split(cleaned, self.config.split_threshold_bytes):
            sections = split_by_h2(cleaned)
            if len(sections) >= 2:
                section_paths = write_split_files(sections, split_dir)
                was_split = True
        elif split_dir.is_dir():
            # Document is no longer large enough — wipe the stale split.
            import shutil
    
            shutil.rmtree(split_dir)
    
        # Record title/description in .meta.json for the indexer.
        self._update_category_meta(category_dir, doc_path.name, title, description)
    
        # Auto-rebuild INDEX.md so the on-disk state and the index stay in sync.
        # Callers that need batch performance can short-circuit by going one
        # layer down (write files manually, then call rebuild_index once).
        self.rebuild_index()
    
        return AddResult(
            document_path=doc_path,
            section_paths=section_paths,
            was_split=was_split,
            converted_from_pdf=converted_from_pdf,
        )
  • The thin wrapper in tools.py that resolves the shelf, calls Shelf.add_document, rebuilds the index, and returns a serializable dict response.
    def add_document(params: AddDocumentInput) -> dict:
        """Implementation of the ``add_document`` MCP tool."""
        shelf = _resolve_shelf(params.shelf_path)
        result = shelf.add_document(
            params.source_path,
            category=params.category,
            title=params.title,
            description=params.description,
            split=params.split,
            quality=params.quality,
        )
        shelf.rebuild_index()
        return {
            "status": "ok",
            "shelf_root": str(shelf.root),
            "document_path": str(result.document_path.relative_to(shelf.root)),
            "section_paths": [str(p.relative_to(shelf.root)) for p in result.section_paths],
            "was_split": result.was_split,
            "section_count": len(result.section_paths),
            "converted_from_pdf": result.converted_from_pdf,
            "index_path": "INDEX.md",
            "next_steps": (
                f"Commit the changes ('git add . && git commit -m \"docs: add {params.title}\"') "
                "to make the new entry visible via raw URLs."
            ),
        }
  • Pydantic model AddDocumentInput with all input fields (source_path, category, title, description, split, quality, shelf_path) and validation.
    class AddDocumentInput(_BaseInput):
        """Input for ``add_document``."""
    
        source_path: str = Field(
            ...,
            description="Absolute path to the source .pdf or .md file on disk.",
            min_length=1,
        )
        category: str = Field(
            ...,
            description="Category bucket — e.g. 'laptops', 'recipes', 'research-papers'. "
            "Created if missing.",
            min_length=1,
            max_length=80,
        )
        title: str = Field(
            ...,
            description="Human-readable document title. Used as the INDEX entry and "
            "(slugified) as the filename.",
            min_length=1,
            max_length=200,
        )
        description: str = Field(
            default="",
            description="Optional one-sentence description shown next to the entry "
            "in INDEX.md.",
            max_length=500,
        )
        split: bool = Field(
            default=True,
            description="Auto-split large documents (>50 KB) by H2 heading. "
            "Recommended unless the source is already small.",
        )
        quality: Quality = Field(
            default="fast",
            description="PDF conversion quality: 'fast' (pymupdf4llm, default) or "
            "'high' (marker-pdf, requires optional install).",
        )
        shelf_path: str | None = Field(
            default=None,
            description="Path to the shelf root directory. Defaults to $DOCSHELF_ROOT "
            "or the server's working directory.",
        )
  • MCP tool registration via @mcp.tool decorator with name='docshelf_add_document', annotations, and delegation to tools.add_document.
    @mcp.tool(
        name="docshelf_add_document",
        annotations={
            "title": "Add a document to the shelf",
            "readOnlyHint": False,
            "destructiveHint": False,
            "idempotentHint": True,
            "openWorldHint": False,
        },
    )
    def add_document(params: t.AddDocumentInput) -> str:
        """Add a PDF or Markdown file to the shelf and refresh INDEX.md.
    
        * ``.pdf`` is converted to Markdown (``pymupdf4llm`` by default; pass
          ``quality='high'`` to use ``marker-pdf``).
        * Documents larger than 50 KB with multiple H2 headings are split into
          one file per section (turn this off with ``split=False``).
        * INDEX.md is regenerated automatically. The caller still owns the git
          commit / push step.
        """
        try:
            return _serialize(t.add_document(params))
        except Exception as exc:
            logger.exception("add_document failed")
            return _serialize({"status": "error", "error": str(exc), "type": type(exc).__name__})
  • AddResult dataclass returned by Shelf.add_document containing document_path, section_paths, was_split, and converted_from_pdf.
    @dataclass
    class AddResult:
        """Outcome of :meth:`Shelf.add_document`."""
    
        document_path: Path
        section_paths: list[Path]
        was_split: bool
        converted_from_pdf: bool
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations indicate idempotentHint=true but no destructiveness. The description adds crucial behavioral traits: PDF conversion details, auto-splitting logic, INDEX.md regeneration, and the caller's responsibility for committing. No contradiction with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is front-loaded with the main purpose followed by bullet points. Every sentence adds value without redundancy, making it efficient and easy to parse.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity (multiple parameters, conversion, splitting, index update) and the presence of an output schema, the description covers most important behaviors. Minor gap: no mention of error handling or prerequisites (e.g., file existence), but overall sufficient.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema already describes each parameter well, but the description adds meaningful beyond-schema context, such as default conversion method, the 'quality' option, and split behavior. Schema descriptions exist for all parameters, but the description enhances understanding.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the action ('Add a PDF or Markdown file to the shelf') and the outcome ('refresh INDEX.md'). It distinguishes from sibling tools by specifying the file types and the shelf context.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

While not explicitly stating when to use alternatives, the description implies usage for adding new documents. The sibling tool names (e.g., convert_pdf, search) are sufficiently different, so the context provides implicit guidance.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ignatenkofi/docshelf-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server