loading…
Search for a command to run...
loading…
An MCP server that adds AI-powered document intelligence to Paperless-ngx, enabling semantic search, automatic classification, receipt data extraction, bank sta
An MCP server that adds AI-powered document intelligence to Paperless-ngx, enabling semantic search, automatic classification, receipt data extraction, bank statement matching, and accounting export — all running locally via Ollama.
AI-Powered Document Intelligence for Paperless-ngx
Semantic search, auto-classification, receipt extraction, and accounting export — 100% local, 100% private.
Quick Start · Features · MCP Tools · Receipts · Docs
PaperCortex turns your Paperless-ngx document archive into an intelligent, queryable knowledge base — powered entirely by local AI running on your own hardware.
If you use Paperless-ngx to store invoices, receipts, contracts, tax documents, letters, or any other scanned paperwork, PaperCortex adds the intelligence layer that Paperless-ngx is missing:
Everything runs locally through Ollama. No document content ever leaves your network. No cloud APIs. No subscriptions. No data harvesting.
PaperCortex exposes all capabilities as an MCP (Model Context Protocol) Server, making it a first-class tool for Claude Code, AI coding agents, and automated workflows.
Paperless-ngx is an outstanding document management system with 37,000+ GitHub stars. It handles scanning, OCR, storage, and basic tagging beautifully. But once your documents are in Paperless-ngx, finding and working with them has real limitations:
| What you want to do | Paperless-ngx alone | With PaperCortex |
|---|---|---|
| Find a document by what it's about | Keyword search only — misses synonyms, translations, related concepts | Semantic search understands meaning across languages |
| Classify incoming documents | Manual rules or basic auto-matching | LLM-powered classification understands document content |
| Extract data from a receipt | Read it yourself and type it in | Automatic extraction of vendor, amount, date, tax, line items |
| Answer "How much did I spend on X?" | Export everything, open spreadsheet, filter manually | Natural language query returns the answer instantly |
| Send receipt data to accounting | Manual data entry or copy-paste | One-click DATEV/CSV export ready for your tax advisor |
| Use documents in AI workflows | No API integration for AI agents | Full MCP Server for Claude Code and any MCP-compatible agent |
| Keep data private | Self-hosted (good!) | Self-hosted AI too — zero cloud dependency |
Traditional keyword search fails when you don't remember the exact words. PaperCortex generates vector embeddings for every document using local Ollama models and stores them in a lightweight SQLite vector database.
Search by meaning, not by memory:
"electricity bill" → finds documents containing "Stromrechnung", "utility payment", "power invoice""office supplies" → finds "Bueroausstattung", "paper and toner", "desk accessories order""tax deductible travel" → finds flight bookings, hotel receipts, train tickets, taxi invoicesSupported embedding models:
nomic-embed-text (recommended — fast, accurate, 768 dimensions)mxbai-embed-large (higher accuracy, slower)Every new document arriving in Paperless-ngx gets analyzed by a local LLM that reads the OCR content and assigns:
Classification runs asynchronously in the background. New documents are processed within minutes of arriving in Paperless-ngx.
PaperCortex includes a dedicated receipt processing pipeline optimized for expense management:
Data extraction from receipts and invoices:
Works with:
Import your bank statement as CSV and let PaperCortex automatically match transactions to receipts:
For German businesses and freelancers, PaperCortex generates DATEV-compatible export files that your Steuerberater can import directly:
Also supports plain CSV export for use with any accounting software worldwide.
Ask questions about your document archive in plain language:
"How much did I spend on hotels in Q1 2025?"
"Show me all contracts expiring this year"
"What was my highest single expense last month?"
"Find all invoices from Deutsche Telekom"
"Which receipts don't have a matching bank transaction?"
"Summarize my office supply spending trend over the last 12 months"
PaperCortex translates natural language into document queries, retrieves relevant documents via semantic search, and uses the local LLM to synthesize answers with source references.
PaperCortex implements the Model Context Protocol (MCP) — the open standard for connecting AI agents to external tools. This means any MCP-compatible AI agent can use your document archive as a knowledge source.
Compatible with:
| Feature | PaperCortex | paperless-ai | Veryfi | Taggun | Rossum |
|---|---|---|---|---|---|
| Fully self-hosted | :white_check_mark: | :white_check_mark: | :x: | :x: | :x: |
| Local AI (no cloud API) | :white_check_mark: | :x: OpenAI | :x: | :x: | :x: |
| Semantic search | :white_check_mark: | :x: | :x: | :x: | :x: |
| Auto-classification | :white_check_mark: | :white_check_mark: | :x: | :x: | :white_check_mark: |
| Receipt data extraction | :white_check_mark: | :x: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| Bank statement matching | :white_check_mark: | :x: | :x: | :x: | :x: |
| DATEV export | :white_check_mark: | :x: | :x: | :x: | :x: |
| CSV accounting export | :white_check_mark: | :x: | :white_check_mark: | :x: | :white_check_mark: |
| MCP Server | :white_check_mark: | :x: | :x: | :x: | :x: |
| Natural language queries | :white_check_mark: | :x: | :x: | :x: | :x: |
| Multi-language documents | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| Free and open source | :white_check_mark: | :white_check_mark: | :x: $$$ | :x: $$$ | :x: $$$$ |
| Privacy — data stays local | :white_check_mark: | :warning: API calls | :x: | :x: | :x: |
| Works with Paperless-ngx | :white_check_mark: | :white_check_mark: | :x: | :x: | :x: |
┌─────────────────────┐ ┌──────────────────────────┐ ┌────────────────────┐
│ │ │ │ │ │
│ Claude Code / │ MCP │ PaperCortex │ REST │ Paperless-ngx │
│ AI Agents / ├────────►│ ├────────►│ │
│ Automation │ │ ┌──────────────────┐ │ API │ OCR + Storage + │
│ │ │ │ MCP Server │ │ │ Tagging │
└─────────────────────┘ │ │ (stdio / HTTP) │ │ │ │
│ └──────────────────┘ │ └────────────────────┘
│ │
│ ┌──────────────────┐ │ ┌────────────────────┐
│ │ Intelligence │ │ │ │
│ │ Layer │ │ LLM │ Ollama │
│ │ ├────────────►│ │
│ │ - Classifier │ │ API │ qwen2.5 / llama3 │
│ │ - Extractor │ │ │ nomic-embed-text │
│ │ - Query Engine │ │ │ │
│ └──────────────────┘ │ └────────────────────┘
│ │
│ ┌──────────────────┐ │
│ │ Vector Store │ │
│ │ (SQLite + HNSW) │ │
│ └──────────────────┘ │
│ │
└──────────────────────────┘
All processing happens on your hardware. The only network traffic is between PaperCortex and your local Paperless-ngx and Ollama instances.
Pull the required Ollama models:
ollama pull qwen2.5:14b # LLM for classification, extraction, queries
ollama pull nomic-embed-text # Embedding model for semantic search
git clone https://github.com/renefichtmueller/PaperCortex.git
cd PaperCortex
cp .env.example .env
Edit .env with your configuration:
PAPERLESS_URL=http://your-paperless-instance:8000
PAPERLESS_TOKEN=your-paperless-api-token
OLLAMA_URL=http://your-ollama-host:11434
OLLAMA_MODEL=qwen2.5:14b
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
Start PaperCortex:
docker compose up -d
PaperCortex will begin indexing your existing documents automatically.
git clone https://github.com/renefichtmueller/PaperCortex.git
cd PaperCortex
npm install
cp .env.example .env
# Edit .env with your settings
npm run build
npm start
npx papercortex --paperless-url http://localhost:8000 --paperless-token YOUR_TOKEN
PaperCortex exposes five MCP tools that AI agents can call:
papercortex_search — Semantic Document SearchFind documents by meaning, not just keywords.
{
"tool": "papercortex_search",
"arguments": {
"query": "electricity bills from last winter",
"limit": 10,
"date_from": "2024-12-01",
"date_to": "2025-02-28"
}
}
Returns: Ranked list of documents with relevance scores, titles, dates, and Paperless-ngx document IDs.
papercortex_classify — Auto-ClassificationAnalyze a document and assign type, tags, and metadata.
{
"tool": "papercortex_classify",
"arguments": {
"document_id": 1234,
"apply": true
}
}
Returns: Suggested document type, tags, correspondent, and confidence scores. Set apply: true to write classifications back to Paperless-ngx.
papercortex_receipt — Receipt Data ExtractionExtract structured financial data from receipts and invoices.
{
"tool": "papercortex_receipt",
"arguments": {
"document_id": 5678
}
}
Returns:
{
"vendor": "Amazon EU S.a.r.l.",
"date": "2025-03-15",
"total_gross": 119.99,
"total_net": 100.83,
"tax_rate": 19,
"tax_amount": 19.16,
"currency": "EUR",
"items": [
{ "description": "USB-C Hub", "quantity": 1, "price": 49.99 },
{ "description": "Monitor Arm", "quantity": 1, "price": 70.00 }
],
"invoice_number": "INV-DE-2025-1234567"
}
papercortex_query — Natural Language QuestionsAsk questions about your entire document archive.
{
"tool": "papercortex_query",
"arguments": {
"question": "How much did I spend on business travel in Q1 2025?"
}
}
Returns: A natural language answer with source document references and a breakdown of the calculation.
papercortex_export — Accounting ExportExport extracted receipt data in accounting-ready formats.
{
"tool": "papercortex_export",
"arguments": {
"format": "datev",
"date_from": "2025-01-01",
"date_to": "2025-03-31",
"account_plan": "SKR03"
}
}
Supported formats: datev (German standard), csv (universal), json (programmatic).
Add to your ~/.claude.json or project .claude/settings.json:
{
"mcpServers": {
"papercortex": {
"command": "node",
"args": ["./dist/mcp-server/index.js"],
"cwd": "/path/to/PaperCortex",
"env": {
"PAPERLESS_URL": "http://localhost:8000",
"PAPERLESS_TOKEN": "your-token",
"OLLAMA_URL": "http://localhost:11434"
}
}
}
}
Once connected, you can ask Claude Code about your documents naturally:
You: Search my documents for anything related to the office lease renewal
Claude: I found 4 relevant documents:
1. "Mietvertrag Verlängerung 2025" (Score: 0.94) — Document #1234
2. "Office Lease Agreement Amendment" (Score: 0.91) — Document #1235
3. "Nebenkostenabrechnung 2024" (Score: 0.78) — Document #1240
4. "Facilities Management Invoice" (Score: 0.72) — Document #1251
You: Extract the receipt data from document #5678 and export it for DATEV
Claude: Extracted receipt data:
Vendor: Deutsche Bahn AG
Date: 2025-03-20
Amount: 89.90 EUR (net: 75.55 EUR, 19% VAT: 14.35 EUR)
Description: ICE Frankfurt-Berlin, 1st class
DATEV export saved to: exports/datev_2025_03.csv
You: How much did I spend on cloud services this year?
Claude: Based on 23 matching documents, your cloud service spending in 2025:
- AWS: 2,340.00 EUR (12 invoices)
- Hetzner: 456.00 EUR (3 invoices)
- Cloudflare: 240.00 EUR (3 invoices)
- Vercel: 180.00 EUR (3 invoices)
- GitHub: 132.00 EUR (2 invoices)
Total: 3,348.00 EUR
┌──────────┐ ┌─────────────┐ ┌──────────────┐ ┌──────────┐ ┌──────────┐
│ Scan / │ │ Paperless- │ │ PaperCortex │ │ Match │ │ Export │
│ Photo / ├───►│ ngx ├───►│ Receipt ├───►│ Bank ├───►│ DATEV / │
│ Email │ │ OCR+Store │ │ Extraction │ │ CSV │ │ CSV │
└──────────┘ └─────────────┘ └──────────────┘ └──────────┘ └──────────┘
# Process all unprocessed receipts
npm run receipt:process
# Extract data from a specific document
npm run receipt:extract -- --document-id 1234
# Import bank statement and match transactions
npm run receipt:match -- --bank-csv ./bank_export_2025_q1.csv
# Export matched data as DATEV
npm run receipt:export -- --format datev --period 2025-Q1
# Export as plain CSV
npm run receipt:export -- --format csv --period 2025-03
The DATEV export generates a Buchungsstapel CSV file following the official DATEV format specification:
Your documents contain some of the most sensitive data in your life:
Cloud-based document AI services require uploading this data to external servers for processing. Even with encryption and privacy policies, you are trusting a third party with your most sensitive information.
PaperCortex takes a fundamentally different approach:
Your documents stay in your network. Period.
All configuration is done through environment variables. See .env.example for a complete template.
| Variable | Default | Description |
|---|---|---|
PAPERLESS_URL |
http://localhost:8000 |
Paperless-ngx instance URL |
PAPERLESS_TOKEN |
(required) | Paperless-ngx API authentication token |
OLLAMA_URL |
http://localhost:11434 |
Ollama API endpoint |
OLLAMA_MODEL |
qwen2.5:14b |
LLM model for classification and extraction |
OLLAMA_EMBEDDING_MODEL |
nomic-embed-text |
Embedding model for semantic search |
VECTOR_DB_PATH |
./data/vectors.db |
Path to the SQLite vector database |
| Variable | Default | Description |
|---|---|---|
POLL_INTERVAL |
300 |
Seconds between polling Paperless-ngx for new documents |
BATCH_SIZE |
10 |
Number of documents to process per batch |
EMBEDDING_DIMENSIONS |
768 |
Vector dimensions (must match embedding model) |
CLASSIFICATION_CONFIDENCE |
0.7 |
Minimum confidence to auto-apply classifications |
| Variable | Default | Description |
|---|---|---|
DATEV_ADVISOR_NUMBER |
(optional) | Steuerberater number for DATEV export header |
DATEV_CLIENT_NUMBER |
(optional) | Mandantennummer for DATEV export header |
DATEV_FISCAL_YEAR_START |
01-01 |
Fiscal year start (MM-DD) |
DEFAULT_ACCOUNT_PLAN |
SKR03 |
Default chart of accounts (SKR03 or SKR04) |
EXPORT_DIR |
./exports |
Directory for generated export files |
| Variable | Default | Description |
|---|---|---|
MCP_TRANSPORT |
stdio |
MCP transport mode (stdio or http) |
MCP_PORT |
3100 |
Port for HTTP transport mode |
MCP_AUTH_TOKEN |
(optional) | Bearer token for HTTP transport authentication |
PaperCortex works with any Ollama-compatible model. Recommended configurations:
| Model | VRAM | Speed | Quality | Recommended For |
|---|---|---|---|---|
qwen2.5:7b |
5 GB | Fast | Good | Raspberry Pi, low-end servers |
qwen2.5:14b |
10 GB | Medium | Very Good | Most homelab setups |
qwen2.5:32b |
20 GB | Slow | Excellent | High-accuracy requirements |
llama3.1:8b |
5 GB | Fast | Good | Alternative to Qwen |
mistral:7b |
5 GB | Fast | Good | European language focus |
| Model | Dimensions | Speed | Quality |
|---|---|---|---|
nomic-embed-text |
768 | Very Fast | Very Good |
mxbai-embed-large |
1024 | Fast | Excellent |
all-minilm |
384 | Fastest | Good |
PaperCortex/
├── src/
│ ├── mcp-server/ # MCP Server for AI agent integration
│ │ ├── index.ts # Server entry point and tool registration
│ │ └── tools/
│ │ ├── search.ts # Semantic document search tool
│ │ ├── classify.ts # Auto-classification tool
│ │ ├── receipt.ts # Receipt data extraction tool
│ │ ├── query.ts # Natural language query tool
│ │ └── export.ts # DATEV/CSV export tool
│ ├── embeddings/
│ │ ├── ollama.ts # Ollama embedding API client
│ │ └── store.ts # SQLite vector store with HNSW index
│ ├── paperless/
│ │ ├── client.ts # Paperless-ngx REST API client
│ │ └── types.ts # TypeScript type definitions
│ └── receipt/
│ ├── extractor.ts # Receipt OCR content parsing and extraction
│ ├── matcher.ts # Bank CSV transaction matching engine
│ └── datev.ts # DATEV Buchungsstapel CSV formatter
├── docs/
│ ├── architecture.md # Detailed architecture documentation
│ ├── setup.md # Step-by-step installation guide
│ └── receipts.md # Receipt workflow documentation
├── docker-compose.yml # Production deployment
├── Dockerfile # Container build
├── .env.example # Configuration template (no secrets!)
├── package.json
├── tsconfig.json
└── LICENSE # MIT
Contributions are welcome! PaperCortex is early-stage and there are many ways to help:
git clone https://github.com/renefichtmueller/PaperCortex.git
cd PaperCortex
npm install
cp .env.example .env
# Edit .env with your local Paperless-ngx and Ollama settings
npm run dev
git checkout -b feat/amazing-feature)feat:, fix:, docs:, refactor:)| Area | Description | Difficulty |
|---|---|---|
| Bank CSV Parsers | Add parsers for different bank export formats (Sparkasse, ING, N26, Revolut, etc.) | Easy |
| Export Formats | Additional accounting export formats beyond DATEV | Medium |
| Web Dashboard | Build a simple web UI for browsing indexed documents and extracted data | Medium |
| Multi-language | Improve extraction accuracy for non-English/German receipts | Medium |
| Vision Models | Use Ollama vision models to extract data directly from receipt images | Hard |
| Webhooks | React to Paperless-ngx document events in real-time | Medium |
Q: Does PaperCortex modify my documents in Paperless-ngx?
A: By default, PaperCortex only reads documents. When you use the classify tool with apply: true, it can write tags, document types, and correspondents back to Paperless-ngx. Extraction results and embeddings are stored in PaperCortex's own database.
Q: How much disk space does the vector database need? A: Roughly 1-2 KB per document for embeddings. A collection of 10,000 documents needs about 10-20 MB of vector storage.
Q: Can I use OpenAI instead of Ollama? A: PaperCortex is designed for local-first operation with Ollama. Support for OpenAI-compatible APIs (including local alternatives like LM Studio, vLLM, or LocalAI) is on the roadmap.
Q: What Paperless-ngx version is required? A: PaperCortex works with Paperless-ngx 2.0 and later (REST API v3+).
Q: Can I run PaperCortex on a Raspberry Pi?
A: PaperCortex itself is lightweight. The bottleneck is Ollama — you'll need a model that fits in your available RAM. qwen2.5:7b works on 8GB devices.
Q: Is DATEV export only for Germany? A: The DATEV format is the German standard, but PaperCortex also exports plain CSV that works with any accounting software worldwide.
MIT License — see LICENSE for details.
Free to use, modify, and distribute. Commercial use welcome.
Built on the shoulders of giants:
If PaperCortex is useful to you, please consider giving it a star — it helps others discover the project!
Your documents. Your AI. Your hardware.
No cloud required.
Run in your terminal:
claude mcp add papercortex -- npx Security
Low riskAutomated heuristic from public metadata — not a security guarantee.