loading…
Search for a command to run...
loading…
Tools to benchmark chunking strategies for your RAG corpus.
Tools to benchmark chunking strategies for your RAG corpus.
PyPI version Python versions License: MIT CI Docs
Auto chunking tuner and MCP server for RAG pipelines.
Give it your documents. It tries multiple chunking strategies, measures which setup supports retrieval best, and recommends a configuration for your corpus and use case. Zero API cost to start — run estimate for a dry-run before any paid calls.
Full documentation: shantanu-deshmukh.github.io/chunktuner
flowchart TD
Lib["Python library"] --> Ingest
CLI["CLI (chunk-tune)"] --> Ingest
MCP["MCP server"] --> Ingest
Ingest["Ingest your documents<br/>files, URLs, repos"] --> Tune
subgraph Tune ["AutoTuner: for every strategy and param set"]
direction LR
Chunk["Chunk document"] --> Embed["Embed chunks<br/>and queries"] --> Score["Score retrieval<br/>recall, MRR, NDCG"]
end
Tune --> Rank["Rank all configs<br/>against baseline"] --> Best(["Recommended config<br/>.autochunk.yaml"])
When building a RAG pipeline, how you split documents into chunks directly impacts retrieval quality. chunktuner automates the process of finding the optimal chunking strategy for your specific corpus, embedding model, and use case.
It benchmarks strategies like fixed-token windows, recursive character splitting, semantic splitting, PDF structural chunking, and AST-based code chunking — then scores each one against real retrieval metrics (token recall, MRR, NDCG) and optional generation metrics (RAGAS faithfulness, answer relevancy).
chunk-tune) — human-driven tuning from the terminal# Install (pick one)
uv tool install chunktuner
pip install chunktuner
# Initialize workspace (embedding_model defaults to null — no API calls)
chunk-tune init
# See cost estimate before running anything
chunk-tune estimate ./my_docs --use-case rag_qa
# Get a recommendation (dummy embeddings by default; add --embedding-model for real ones)
chunk-tune recommend ./my_docs --use-case rag_qa
Python API:
from pathlib import Path
from chunktuner import FileIngestor, DummyEmbeddingFunction, LiteLLMEmbeddingFunction, AutoTuner
from chunktuner import default_registry, Evaluator, ScoreCalculator
docs = FileIngestor().ingest_dir(Path("./my_docs"))
# Free/offline: use dummy embeddings for quick strategy comparison.
# Swap in LiteLLMEmbeddingFunction for real embeddings with any provider:
# LiteLLMEmbeddingFunction("text-embedding-3-small") # OpenAI
# LiteLLMEmbeddingFunction("gemini/gemini-embedding-001") # Google
# LiteLLMEmbeddingFunction("openai/<id>", api_base="http://localhost:1234/v1") # local
embedding_fn = DummyEmbeddingFunction()
tuner = AutoTuner(
strategies=default_registry,
evaluator=Evaluator(embedding_fn),
scorer=ScoreCalculator(use_case="rag_qa"),
)
result = tuner.recommend(docs, use_case="rag_qa")
print(result.best.config)
After running recommend, you get a ranked table with the winning config and how much it beats the baseline:
Rank Strategy Params Score Recall MRR IOU AvgTok
────────────────────────────────────────────────────────────────────────────────────────
1 ★ recursive_character 1024 chr / 154 ov 0.821 0.950 0.880 0.062 212
2 fixed_tokens 512 tok / 51 ov 0.764 0.920 0.840 0.059 444
...
Baseline fixed_tokens 512 tok / 0 ov → score 0.682
Winner beats baseline by +0.139 (+20.4%)
See examples/financial_analysis for a full benchmark on S&P 500 earnings call transcripts — a corpus where separator choice and chunk size make a measurable difference in retrieval quality.
Run it offline with zero API cost:
cd examples/financial_analysis
uv sync
uv run python run_benchmark.py --fixture --num-transcripts 2
| Strategy | Best for |
|---|---|
fixed_tokens |
Baseline; uniform token windows |
recursive_character |
General prose and documentation |
semantic |
Theme-heavy articles |
markdown_semantic |
Structured Markdown docs |
pdf_structural |
PDFs with layout regions and tables |
structural_semantic |
PDF/DOCX with mixed layout and text |
late_chunking |
Long docs with dense cross-references |
agentic |
High-value narrative documents |
code_ast |
Code repos (Python, JavaScript) |
code_window |
Code baseline (sliding window) |
Python FastMCP (chunk-tune-mcp, stdio). No Node.js build. See docs/mcp_setup.md.
Add to your .mcp.json:
{
"mcpServers": {
"chunktuner": {
"command": "uvx",
"args": ["--from", "chunktuner[mcp]", "chunk-tune-mcp"],
"env": {
"CHUNK_TUNER_BASE_DIR": "/path/to/your/corpus"
}
}
}
}
Tools available: list_strategies, preview_chunks, evaluate_chunking, recommend_config.
chunk-tune init Bootstrap workspace config
chunk-tune analyze Quick structural scan (no API cost)
chunk-tune estimate Dry-run cost/token estimate
chunk-tune evaluate Full evaluation across strategies
chunk-tune recommend Evaluation + best config recommendation
chunk-tune compare Side-by-side comparison of specific strategies
chunk-tune preview Inspect how a strategy splits a document
chunk-tune cache Manage embedding and chunk cache
pip install chunktuner # CLI + library
uv add chunktuner # library
uv tool install chunktuner # global CLI
uvx --from chunktuner chunk-tune … # ephemeral CLI (no install)
# With optional extras
pip install "chunktuner[docling]" # PDF/DOCX support
uv add "chunktuner[docling]" # PDF/DOCX support
uv add "chunktuner[ragas]" # generation metrics
uv add "chunktuner[semantic]" # semantic chunking
uv add "chunktuner[code]" # AST code chunking
uv add "chunktuner[all]" # everything
See CONTRIBUTING.md.
Shantanu Deshmukh — full stack developer building E2E AI applications.
Выполни в терминале:
claude mcp add chunktuner -- npx Да, ChunkTuner MCP бесплатный — установка в один клик через Unyly без оплаты.
Нет, ChunkTuner работает без API-ключей и переменных окружения.
Self-hosted: сервер запускается локально на твоей машине командой из раздела установки.
Открой ChunkTuner на unyly.org, выбери вкладку своего клиента (Claude Desktop, Claude Code, Cursor) и нажми Install — конфиг сгенерируется автоматически, без правки JSON.
CSA PROJECT - FZCO © 2026 IFZA Business Park, DDP, Premises Number 31174 - 001
Безопасность
Низкий рискАвтоматическая эвристика по публичным данным — не гарантия безопасности.