loading…
Search for a command to run...
loading…
Local offline semantic search over documents (txt, md, pdf, docx, pptx, csv). Indexes folders into a LanceDB vector database with multilingual embeddings and su
Local offline semantic search over documents (txt, md, pdf, docx, pptx, csv). Indexes folders into a LanceDB vector database with multilingual embeddings and supports hybrid vector + keyword search via Reciprocal Rank Fusion. No API keys, no cloud, no Docker required.
GitHub stars Python 3.11+ License: AGPL-3.0 MCP Compatible
If you find this useful, consider giving it a ⭐ — it helps others discover the project.
Local semantic search for your documents. No API keys. No cloud. Works with any MCP client.

AI coding tools are powerful, but they have blind spots when it comes to your local files:
This plugin bridges that gap. It indexes your local documents into a small, fast vector database. When you ask a question, it retrieves only the relevant pieces -- so your AI tool can answer with your actual data.
Your documents --> chunked --> embedded --> local vector DB
|
Your question --> embedded --> similarity search --> relevant chunks --> AI answers
| Format | Extension | Details |
|---|---|---|
| Plain text | .txt |
UTF-8 with fallback |
| Markdown | .md |
Raw text preserved |
.pdf |
Page-level extraction with metadata | |
| Word | .docx |
Full paragraph extraction |
| PowerPoint | .pptx |
Slide-level extraction with metadata |
| CSV | .csv |
Row-based text extraction |
git clone https://github.com/ZhuBit/cowork-semantic-search.git
cd cowork-semantic-search
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[all]"
Add the server to your MCP client's config. Replace paths with your own.
.mcp.json in your project root{
"mcpServers": {
"semantic-search": {
"command": "/absolute/path/to/.venv/bin/python",
"args": ["-m", "server.main"],
"cwd": "/absolute/path/to/cowork-semantic-search",
"env": {
"PYTHONPATH": "/absolute/path/to/cowork-semantic-search"
}
}
}
}
.cursor/mcp.json in your project root or ~/.cursor/mcp.json globally{
"mcpServers": {
"semantic-search": {
"command": "/absolute/path/to/.venv/bin/python",
"args": ["-m", "server.main"],
"env": {
"PYTHONPATH": "/absolute/path/to/cowork-semantic-search"
}
}
}
}
~/.codeium/windsurf/mcp_config.json{
"mcpServers": {
"semantic-search": {
"command": "/absolute/path/to/.venv/bin/python",
"args": ["-m", "server.main"],
"env": {
"PYTHONPATH": "/absolute/path/to/cowork-semantic-search"
}
}
}
}
Open Cline > MCP Servers icon > Configure > Advanced MCP Settings, then add:
{
"mcpServers": {
"semantic-search": {
"command": "/absolute/path/to/.venv/bin/python",
"args": ["-m", "server.main"],
"env": {
"PYTHONPATH": "/absolute/path/to/cowork-semantic-search"
}
}
}
}
"Index all documents in ~/Documents/projects"
"Search for 'quarterly revenue report'"
First run downloads the embedding model (~120MB), then everything runs offline.
If you keep notes in Obsidian (or any folder of markdown files), this plugin turns your AI tool into a search engine for your knowledge base.
You: "Index my vault at ~/Documents/ObsidianVault"
AI: Indexed 847 files -> 3,291 chunks in 42s
You: "What did I write about API rate limiting?"
AI: Found 6 relevant chunks across 3 files:
- notes/backend/rate-limiting-strategies.md
- projects/acme-api/design-decisions.md
- daily/2025-11-03.md
...
You: "Find anything about the client meeting last November, use hybrid search"
AI: Found 4 results using hybrid search (vector + keyword):
- meetings/2025-11-12-acme-kickoff.md
- daily/2025-11-12.md
...
Works the same with PDFs, Word docs, PowerPoints, and CSVs -- just point it at a folder.
| Tool | Description |
|---|---|
index_folder |
Index or re-index all documents in a folder. Incremental -- skips unchanged files. |
semantic_search |
Search indexed documents using natural language. Supports vector and hybrid modes. |
get_index_status |
Show total chunks, file count, and list of indexed files. |
reindex_file |
Force re-index a single file, bypassing the hash cache. |
paraphrase-multilingual-MiniLM-L12-v2from server.indexer import index_folder
from server.search import semantic_search
# Index a folder
result = index_folder("/path/to/docs")
print(f"{result['files_indexed']} files -> {result['total_chunks']} chunks")
# Search
results = semantic_search("project deadline", mode="hybrid")
for r in results["results"]:
print(f" {r['file_name']}: {r['text'][:100]}...")
server/
main.py # MCP server + tool definitions
parsers.py # Per-format text extraction
chunker.py # Text splitting with metadata
indexer.py # Discovery, hashing, embedding pipeline
store.py # LanceDB vector store + FTS + hybrid search
search.py # Query embedding + search orchestration
| Component | Choice | Why |
|---|---|---|
| MCP framework | FastMCP | Clean tool definitions, async support |
| Embeddings | sentence-transformers | Offline, multilingual, fast |
| Vector DB | LanceDB | Serverless, embedded, FTS built-in |
| Chunking | langchain-text-splitters | Battle-tested recursive splitting |
| PyMuPDF | Fast, accurate extraction | |
| DOCX | python-docx | Lightweight, no system deps |
| PPTX | python-pptx | Slide-level extraction |
source .venv/bin/activate
pytest tests/ -v
56 tests covering parsers, chunking, indexing, search, and MCP tool integration.
Contributions welcome -- open an issue or submit a PR.
If this is useful to you, consider giving it a ⭐ — it helps others find the project.
AGPL-3.0 -- free to use, modify, and self-host. If you offer this as a network service, you must share your source code. See LICENSE for details.
Добавь это в claude_desktop_config.json и перезапусти Claude Desktop.
{
"mcpServers": {
"cowork-semantic-search": {
"command": "npx",
"args": []
}
}
}