loading…
Search for a command to run...
loading…
docshelf-mcp manages AI-friendly document collections — converts PDFs and Markdown into chapter-split shelves with a single navigation INDEX, so AI agents can f
docshelf-mcp manages AI-friendly document collections — converts PDFs and Markdown into chapter-split shelves with a single navigation INDEX, so AI agents can fetch only the relevant section by raw URL instead of choking on a 4 MB datasheet. Repo: https://github.com/ignatenkofi/docshelf-mcp
Put your manuals on a shelf, hand the AI the index.
License: MIT Python 3.10+ MCP CI PyPI Glama
📖 Docs & landing page: https://ignatenkofi.github.io/docshelf-mcp/
___ __ ____ ____ _ _ ____ __ ____
/ __)/ \(_ _)/ ___)/ )( \( __)( ) ( __)
( (_ \( O ) )( \___ \) __ ( ) _) / (_/\ ) _)
\___/ \__/ (__) (____/\_)(_/(____)\____/(__)
MCP server for AI-friendly doc shelves
An MCP server that turns a folder of PDFs and Markdown into a chat-project-friendly document collection: AI agents see a single INDEX.md and pull individual sections by raw GitHub URL on demand — instead of choking on a 4 MB datasheet.
You have 30 hardware manuals, or 200 cooking recipes, or a stack of research PDFs.
You want Claude / ChatGPT / whatever to be able to answer questions across them — but:
docshelf-mcp solves it like this:
INDEX.md.INDEX.md to your Claude project. When the model needs a section, it fetches it via raw.githubusercontent.com.Result: a 5 KB index pointing at a 50 MB collection. The model reads exactly the chapter it needs.
From PyPI (once the first tagged release is published):
# uv (recommended)
uv pip install docshelf-mcp
# or plain pip
pip install docshelf-mcp
Or straight from main (always-latest, no PyPI required):
pip install "git+https://github.com/ignatenkofi/docshelf-mcp"
Optional high-quality PDF engine (pulls ~2 GB of PyTorch — only if you need it):
pip install "docshelf-mcp[high-quality]"
Drop this into the Custom Instructions of any Claude project that consumes
a docshelf-style INDEX.md:
This project uses the docshelf pattern.
INDEX.mdis the entry point. When answering: read INDEX → fetch ONLY the needed section file via its GitHub raw URL (use WebFetch / fetch / curl). Don't load full source files into context. For large manuals split into chapters, follow INDEX → chapter SUBINDEX → section file.
Medium (~150 words) and full (~400 words) versions, plus how-to snippets for Claude Code, Claude Desktop, and the Anthropic API, live in docs/PROJECT_PROMPT.md.
from docshelf_mcp import Shelf
shelf = Shelf("~/Documents/my-homelab-docs").init(
name="My HomeLab Docs",
remote="https://github.com/me/my-homelab-docs",
default_categories=["routers", "switches", "psu", "motherboards"],
)
shelf.add_document(
"~/Downloads/MIKROTIK_RouterOS.pdf",
category="routers",
title="Mikrotik RouterOS — full manual",
description="Official RouterOS reference, split by chapter.",
)
# → docs/routers/mikrotik-routeros-full-manual.md + docs/routers/.../001-..md, 002-..md, ...
# → INDEX.md is regenerated automatically.
Then in the shelf directory: git add . && git commit -m "docs: add RouterOS" && git push.
In your Claude project, attach only INDEX.md. Done.
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%/Claude/claude_desktop_config.json (Windows):
{
"mcpServers": {
"docshelf": {
"command": "docshelf-mcp",
"env": {
"DOCSHELF_ROOT": "/Users/me/Documents/my-homelab-docs"
}
}
}
}
Restart Claude Desktop. You now have six new tools available:
| Tool | What it does |
|---|---|
docshelf_init_shelf |
Bootstrap a new shelf directory. |
docshelf_add_document |
Add a PDF/MD file. Converts, splits, re-indexes. |
docshelf_rebuild_index |
Regenerate INDEX.md from disk. |
docshelf_search |
Plain-text search across the shelf, with raw URLs. |
docshelf_list_documents |
List documents by category. |
docshelf_convert_pdf |
Standalone PDF → Markdown (no shelf). |
claude mcp add docshelf -- docshelf-mcp
# Optional: set the default shelf
claude mcp add docshelf --env DOCSHELF_ROOT=/path/to/shelf -- docshelf-mcp
# Sanity check — should print the server version then wait on stdin
docshelf-mcp
my-shelf/
├── .docshelf.json ← shelf metadata: name, remote, category order
├── INDEX.md ← auto-generated navigation (your chat-project file)
├── .gitignore
└── docs/
├── routers/
│ ├── .meta.json ← per-document title/description overrides
│ ├── mikrotik-routeros.md (full document, lightly cleaned)
│ └── mikrotik-routeros/ (auto-split sections)
│ ├── 001-overview.md
│ ├── 002-bridging.md
│ └── 003-firewall.md
└── switches/
└── cudy-gs1010pe.md
Everything in docs/ is committed; everything is fetchable via raw URL once you push to GitHub.
A document is split when both conditions hold:
.docshelf.json:split_threshold_bytes).## (H2) headings.The splitter:
NNN-<slug>.md so they sort naturally and survive title changes.If you want to keep a document whole, pass split=False.
See the examples/ directory for three concrete use cases:
examples/homelab/ — original use case, hardware manuals for a home lab.examples/recipes/ — a cookbook with one recipe per file.examples/research-papers/ — academic PDFs with abstracts in .meta.json.Each example shows the directory layout and the INDEX.md you'd end up with.
The default engine (pymupdf4llm) is fast and good enough for ~95% of technical documents. For papers with complex tables, math, or scanned content, install the marker-pdf backend:
pip install "docshelf-mcp[high-quality]"
Then pass quality="high":
shelf.add_document("paper.pdf", category="research", title="...", quality="high")
⚠️ marker-pdf pulls in PyTorch (~2 GB) and is significantly slower (10–60 s per document on CPU). The library import is deferred — if you don't use quality="high", the dependency is never loaded.
Why GitHub raw URLs and not embeddings / RAG? Because it's dead simple, costs nothing to host, and the AI is already good at chasing links. You can layer embedding search on top later if you want — the on-disk shape is a normal git repo.
Does this work with private repos?
Not for the raw-URL trick — raw.githubusercontent.com won't serve them without auth. The local search tool works fine on private shelves; you just lose the "AI fetches sections directly" benefit. Make the doc repo public (separate from your code repo).
Do I have to use GitHub?
No. The shelf is just a directory. If you don't set a github_remote, INDEX.md still gets generated — entries just won't have URLs. You can host the static files anywhere that serves raw text (S3, Cloudflare R2, GitLab raw, Gitea, …) and post-process URLs yourself.
Does it edit the source PDFs?
No. PDFs are converted on add_document and the source is left in place. The shelf only writes inside its own directory.
What about non-English documents?
Slugify is Unicode-aware (NFKD-normalized, with \w under re.UNICODE). Cyrillic / CJK titles slug down to ASCII-ish forms; the body Markdown is preserved as-is.
Can I use it without MCP?
Yes — from docshelf_mcp import Shelf and use the class directly. See docs/USAGE.md.
INDEX.mds.docshelf_search.INDEX.md on disk, but the caller (you, or an agent) is responsible for git add / commit / push. This is intentional — staying out of git's way keeps the tool safe to call from agents.A short walkthrough video / GIF is planned: https://github.com/ignatenkofi/docshelf-mcp/blob/main/docs/demo.md (coming soon)
For a deeper dive, see docs/ARCHITECTURE.md — module layout, data flow, design rationale.
Bug reports and PRs welcome. To set up a dev env:
git clone https://github.com/ignatenkofi/docshelf-mcp
cd docshelf-mcp
uv pip install -e ".[dev]"
ruff check src tests
pytest -v
MIT — see LICENSE.
docshelf-mcp started life as a 350-line Python script (homelab-encyclopedia.py) that managed a single homelab manuals repo. The split / index / clean logic is the same code, generalised to work for any category-organised document collection.
Run in your terminal:
claude mcp add docshelf-mcp -- npx