docshelf-mcp
Enables hosting a docshelf repository on GitHub, with INDEX.md containing raw.githubusercontent.com URLs for AI agents to fetch individual document sections on demand.
docshelf-mcp
Put your manuals on a shelf, hand the AI the index.
___ __ ____ ____ _ _ ____ __ ____
/ __)/ \(_ _)/ ___)/ )( \( __)( ) ( __)
( (_ \( O ) )( \___ \) __ ( ) _) / (_/\ ) _)
\___/ \__/ (__) (____/\_)(_/(____)\____/(__)
MCP server for AI-friendly doc shelvesAn MCP server that turns a folder of PDFs and Markdown into a chat-project-friendly document collection: AI agents see a single INDEX.md and pull individual sections by raw GitHub URL on demand — instead of choking on a 4 MB datasheet.
Why?
You have 30 hardware manuals, or 200 cooking recipes, or a stack of research PDFs.
You want Claude / ChatGPT / whatever to be able to answer questions across them — but:
❌ You can't dump 80 MB of PDFs into a chat project. It won't fit, and you'd burn the context window even if it did.
❌ You can manually copy-paste the relevant pages, but only after you remember which manual mentioned the thing you need.
❌ Long files mean retrieval is wasteful — the model loads the whole RouterOS guide just to answer a question about VLANs.
docshelf-mcp solves it like this:
You drop a PDF onto the shelf.
The shelf converts it to Markdown, splits big files chapter-by-chapter, and regenerates a navigation
INDEX.md.You commit and push to a public GitHub repo.
Add only
INDEX.mdto your Claude project. When the model needs a section, it fetches it viaraw.githubusercontent.com.
Result: a 5 KB index pointing at a 50 MB collection. The model reads exactly the chapter it needs.
📦 Install
From PyPI (once the first tagged release is published):
# uv (recommended)
uv pip install docshelf-mcp
# or plain pip
pip install docshelf-mcpOr straight from main (always-latest, no PyPI required):
pip install "git+https://github.com/ignatenkofi/docshelf-mcp"Optional high-quality PDF engine (pulls ~2 GB of PyTorch — only if you need it):
pip install "docshelf-mcp[high-quality]"📋 Project Prompt
Drop this into the Custom Instructions of any Claude project that consumes
a docshelf-style INDEX.md:
This project uses the docshelf pattern.
INDEX.mdis the entry point. When answering: read INDEX → fetch ONLY the needed section file via its GitHub raw URL (use WebFetch / fetch / curl). Don't load full source files into context. For large manuals split into chapters, follow INDEX → chapter SUBINDEX → section file.
Medium (~150 words) and full (~400 words) versions, plus how-to snippets for
Claude Code, Claude Desktop, and the Anthropic API, live in
docs/PROJECT_PROMPT.md.
Quickstart (Python library)
from docshelf_mcp import Shelf
shelf = Shelf("~/Documents/my-homelab-docs").init(
name="My HomeLab Docs",
remote="https://github.com/me/my-homelab-docs",
default_categories=["routers", "switches", "psu", "motherboards"],
)
shelf.add_document(
"~/Downloads/MIKROTIK_RouterOS.pdf",
category="routers",
title="Mikrotik RouterOS — full manual",
description="Official RouterOS reference, split by chapter.",
)
# → docs/routers/mikrotik-routeros-full-manual.md + docs/routers/.../001-..md, 002-..md, ...
# → INDEX.md is regenerated automatically.Then in the shelf directory: git add . && git commit -m "docs: add RouterOS" && git push.
In your Claude project, attach only INDEX.md. Done.
Quickstart (MCP server)
1. Add to Claude Desktop
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%/Claude/claude_desktop_config.json (Windows):
{
"mcpServers": {
"docshelf": {
"command": "docshelf-mcp",
"env": {
"DOCSHELF_ROOT": "/Users/me/Documents/my-homelab-docs"
}
}
}
}Restart Claude Desktop. You now have six new tools available:
Tool | What it does |
| Bootstrap a new shelf directory. |
| Add a PDF/MD file. Converts, splits, re-indexes. |
| Regenerate |
| Plain-text search across the shelf, with raw URLs. |
| List documents by category. |
| Standalone PDF → Markdown (no shelf). |
2. Add to Claude Code
claude mcp add docshelf -- docshelf-mcp
# Optional: set the default shelf
claude mcp add docshelf --env DOCSHELF_ROOT=/path/to/shelf -- docshelf-mcp3. Test from the command line
# Sanity check — should print the server version then wait on stdin
docshelf-mcpThe shelf layout
my-shelf/
├── .docshelf.json ← shelf metadata: name, remote, category order
├── INDEX.md ← auto-generated navigation (your chat-project file)
├── .gitignore
└── docs/
├── routers/
│ ├── .meta.json ← per-document title/description overrides
│ ├── mikrotik-routeros.md (full document, lightly cleaned)
│ └── mikrotik-routeros/ (auto-split sections)
│ ├── 001-overview.md
│ ├── 002-bridging.md
│ └── 003-firewall.md
└── switches/
└── cudy-gs1010pe.mdEverything in docs/ is committed; everything is fetchable via raw URL once you push to GitHub.
How splitting works
A document is split when both conditions hold:
UTF-8 size > 50 KB (configurable via
.docshelf.json:split_threshold_bytes).The document has at least two
##(H2) headings.
The splitter:
Cleans PDF-extraction noise (collapses runaway blank lines, demotes CLI dumps mistaken for H1s).
Slices on H2 boundaries.
Names files
NNN-<slug>.mdso they sort naturally and survive title changes.Wipes the previous split directory before regenerating — fully idempotent.
If you want to keep a document whole, pass split=False.
Examples
See the examples/ directory for three concrete use cases:
examples/homelab/— original use case, hardware manuals for a home lab.examples/recipes/— a cookbook with one recipe per file.examples/research-papers/— academic PDFs with abstracts in.meta.json.
Each example shows the directory layout and the INDEX.md you'd end up with.
Optional: high-quality PDF conversion
The default engine (pymupdf4llm) is fast and good enough for ~95% of technical documents. For papers with complex tables, math, or scanned content, install the marker-pdf backend:
pip install "docshelf-mcp[high-quality]"Then pass quality="high":
shelf.add_document("paper.pdf", category="research", title="...", quality="high")⚠️ marker-pdf pulls in PyTorch (~2 GB) and is significantly slower (10–60 s per document on CPU). The library import is deferred — if you don't use quality="high", the dependency is never loaded.
FAQ
Why GitHub raw URLs and not embeddings / RAG? Because it's dead simple, costs nothing to host, and the AI is already good at chasing links. You can layer embedding search on top later if you want — the on-disk shape is a normal git repo.
Does this work with private repos?
Not for the raw-URL trick — raw.githubusercontent.com won't serve them without auth. The local search tool works fine on private shelves; you just lose the "AI fetches sections directly" benefit. Make the doc repo public (separate from your code repo).
Do I have to use GitHub?
No. The shelf is just a directory. If you don't set a github_remote, INDEX.md still gets generated — entries just won't have URLs. You can host the static files anywhere that serves raw text (S3, Cloudflare R2, GitLab raw, Gitea, …) and post-process URLs yourself.
Does it edit the source PDFs?
No. PDFs are converted on add_document and the source is left in place. The shelf only writes inside its own directory.
What about non-English documents?
Slugify is Unicode-aware (NFKD-normalized, with \w under re.UNICODE). Cyrillic / CJK titles slug down to ASCII-ish forms; the body Markdown is preserved as-is.
Can I use it without MCP?
Yes — from docshelf_mcp import Shelf and use the class directly. See docs/USAGE.md.
Limitations
Public GitHub only for the raw-URL trick (or whatever public static host you wire up).
Single repo per shelf. If you outgrow one repo, run multiple shelves and attach multiple
INDEX.mds.Heuristic splitting. The PDF→Markdown extract isn't always clean enough to split cleanly. For pathological cases (some 4+ MB datasheets), keep the file whole and rely on
docshelf_search.No automatic git commit. Tools regenerate
INDEX.mdon disk, but the caller (you, or an agent) is responsible forgit add / commit / push. This is intentional — staying out of git's way keeps the tool safe to call from agents.
Demo
A short walkthrough video / GIF is planned: https://github.com/ignatenkofi/docshelf-mcp/blob/main/docs/demo.md (coming soon)
Architecture
For a deeper dive, see docs/ARCHITECTURE.md — module layout, data flow, design rationale.
Contributing
Bug reports and PRs welcome. To set up a dev env:
git clone https://github.com/ignatenkofi/docshelf-mcp
cd docshelf-mcp
uv pip install -e ".[dev]"
ruff check src tests
pytest -vLicense
MIT — see LICENSE.
Origin
docshelf-mcp started life as a 350-line Python script (homelab-encyclopedia.py) that managed a single homelab manuals repo. The split / index / clean logic is the same code, generalised to work for any category-organised document collection.
Maintenance
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/ignatenkofi/docshelf-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server