loading…
Search for a command to run...
loading…
An optimized Model Context Protocol server for document OCR processing using Mistral AI with support for high-performance batch operations and async connection
An optimized Model Context Protocol server for document OCR processing using Mistral AI with support for high-performance batch operations and async connection pooling. It enables efficient extraction of text and tables from local files or URLs into structured markdown and HTML formats while minimizing token costs.
Optimized MCP server for OCR processing using Mistral AI with batch processing and async connection pooling.
| Feature | Benefit |
|---|---|
| Batch Processing API | Up to 50% cost reduction for large file sets |
| Async Connection Pooling | 20-30% faster processing for multiple files |
| Token-Efficient Defaults | include_images=False, table_format=markdown saves 30-40% tokens |
| Concurrent Processing | Process up to 5 files simultaneously |
| Cross-Platform Paths | Works on Windows, macOS, Linux, and Docker |
| Configurable Parameters | Fine-tune OCR output with table_format, headers, footers |
# Navigate to project directory
cd D:/dev/mcp_mistral_ocr_opt
# Create and activate virtual environment
uv venv
# Windows
.venv\Scripts\activate
# Unix
source .venv/bin/activate
# Install dependencies
uv pip install .
# Build image
docker build -t mcp-mistral-ocr-opt .
# Run container
docker run -e MISTRAL_API_KEY=your_api_key \
-v /path/to/your/files:/data/ocr \
mcp-mistral-ocr-opt:latest
Create or edit .env file:
# Required
MISTRAL_API_KEY=your_api_key_here
OCR_DIR=D:/dev/mcp_mistral_ocr_opt/data/ocr
# Optional - Batch Processing
BATCH_MODE=auto # auto, always, never
BATCH_MIN_FILES=5 # Use batch processing for 5+ files in auto mode
INLINE_BATCH_THRESHOLD=10 # Use inline batch for <10 files
MAX_CONCURRENT_REQUESTS=5 # Max concurrent API requests
# Optional - OCR Defaults (token optimization)
DEFAULT_TABLE_FORMAT=markdown # null, markdown, or html
INCLUDE_IMAGES=false # Default false for token efficiency
EXTRACT_HEADER=false # Extract document headers
EXTRACT_FOOTER=false # Extract document footers
Add to claude_desktop_config.json:
{
"mcpServers": {
"mistral-ocr-opt": {
"command": "uv",
"args": [
"run",
"--directory",
"D:/dev/mcp_mistral_ocr_opt",
"-m",
"src.mcp_mistral_ocr_opt.main"
],
"env": {
"MISTRAL_API_KEY": "your_api_key_here",
"OCR_DIR": "D:/dev/mcp_mistral_ocr_opt/data/ocr",
"BATCH_MODE": "auto"
}
}
}
}
process_local_file - Process a single fileProcess a single local file from OCR_DIR.
{
"name": "process_local_file",
"arguments": {
"filename": "document.pdf",
"table_format": "markdown",
"extract_header": false,
"extract_footer": false,
"include_images": false
}
}
Parameters:
filename (required): Name of file relative to OCR_DIRtable_format (optional): null, markdown, or html - default: markdownextract_header (optional): Extract document headers - default: falseextract_footer (optional): Extract document footers - default: falseinclude_images (optional): Include base64 images - default: false (token efficient)Supported local file types:
.pdf.jpg, .jpeg, .png, .gif, .webp, .bmp, .avifprocess_batch_local_files - Process multiple files concurrentlyProcess multiple files with concurrent or batch processing (auto-selected).
{
"name": "process_batch_local_files",
"arguments": {
"patterns": ["*.pdf", "scanned_*.jpg"],
"max_files": 100,
"table_format": "markdown",
"include_images": false
}
}
Parameters:
patterns (required): Array of glob patterns (e.g., ["*.pdf", "*.jpg"])max_files (optional): Maximum files to processprocess_local_fileAuto-selection Logic:
process_url_file - Process file from URLProcess a file from a public URL.
{
"name": "process_url_file",
"arguments": {
"url": "https://example.com/document.pdf",
"file_type": "pdf",
"table_format": "html"
}
}
create_batch_job - Create explicit batch jobCreate a batch processing job (for large file sets, cost savings up to 50%).
{
"name": "create_batch_job",
"arguments": {
"patterns": ["documents/*.pdf"],
"use_inline": false,
"table_format": "markdown"
}
}
Returns:
{
"batch_type": "file",
"job_id": "job_abc123",
"batch_file_id": "file_xyz789",
"files_queued": 50,
"message": "Batch job created with 50 files. Use check_batch_status to monitor progress."
}
check_batch_status - Monitor batch job{
"name": "check_batch_status",
"arguments": {
"job_id": "job_abc123"
}
}
Returns:
{
"id": "job_abc123",
"status": "SUCCESS",
"created_at": "2026-01-22T12:00:00",
"completed_at": "2026-01-22T12:05:00"
}
download_batch_results - Download completed results{
"name": "download_batch_results",
"arguments": {
"job_id": "job_abc123"
}
}
cancel_batch_job - Cancel running job{
"name": "cancel_batch_job",
"arguments": {
"job_id": "job_abc123"
}
}
list_batch_jobs - List all batch jobs{
"name": "list_batch_jobs",
"arguments": {
"status": "RUNNING"
}
}
OCR results are saved in JSON format in OCR_DIR/output/:
{filename}_{timestamp}.jsonbatch_results_{job_id}_{timestamp}.jsonlResult structure:
{
"pages": [
{
"index": 0,
"markdown": "Extracted text content...",
"images": [],
"tables": [],
"hyperlinks": [],
"dimensions": {"width": 0, "height": 0}
}
],
"model": "mistral-ocr-latest",
"usage_info": {...},
"_metadata": {
"source_file": "/path/to/document.pdf",
"output_file": "/path/to/output.json",
"file_type": "pdf",
"processed_at": "2026-01-22T12:00:00",
"table_format": "markdown",
"include_images": false
}
}
{
"name": "process_local_file",
"arguments": {
"filename": "invoice.pdf",
"table_format": "html",
"include_images": false
}
}
{
"name": "process_batch_local_files",
"arguments": {
"patterns": ["*.pdf"],
"table_format": "markdown"
}
}
{
"name": "create_batch_job",
"arguments": {
"patterns": ["documents/**/*.pdf"],
"use_inline": false,
"table_format": "html",
"extract_header": true,
"extract_footer": true
}
}
Then monitor:
{
"name": "check_batch_status",
"arguments": {
"job_id": "job_abc123"
}
}
And download when complete:
{
"name": "download_batch_results",
"arguments": {
"job_id": "job_abc123"
}
}
include_images=false (default) - saves 30-40% tokenstable_format="markdown" (default) - more efficient than HTMLextract_header/extract_footer unless neededBATCH_MODE=always for large recurring batchesmax_files to limit processing if neededMAX_CONCURRENT_REQUESTS (default: 5, max: 10)BATCH_MODE=auto (default) for auto-selection| Scenario | Old Version | Optimized | Improvement |
|---|---|---|---|
| 10 files concurrent | 45s | 12s | 4x faster |
| 100 files batch | $5.00 | $2.50 | 50% cheaper |
| With images (tokens) | 100% | 60% | 40% fewer tokens |
| PDF processing (API calls) | 300 | 100 | 3x fewer calls |
uv run pytest
uv run pytest --cov=src --cov-report=term-missing
uv run python -m src.mcp_mistral_ocr_opt.main
docker build -t mcp-mistral-ocr-opt .
docker run -e MISTRAL_API_KEY=your_key \
-e OCR_DIR=/data/ocr \
-v $(pwd)/data/ocr:/data/ocr \
mcp-mistral-ocr-opt:latest
version: '3.8'
services:
mistral-ocr:
image: mcp-mistral-ocr-opt:latest
environment:
MISTRAL_API_KEY: ${MISTRAL_API_KEY}
OCR_DIR: /data/ocr
BATCH_MODE: auto
MAX_CONCURRENT_REQUESTS: 5
volumes:
- ./data/ocr:/data/ocr
restart: unless-stopped
If migrating from the original mcp-mistral-ocr:
No code changes required for basic usage!
Solution: Add MISTRAL_API_KEY=your_key to .env file
Solution: Check OCR_DIR path in .env and ensure files are in that directory
Solution: Check Mistral dashboard or try cancel_batch_job and retry
Solution: Verify internet connection and API key is valid
Based on the original mcp-mistral-ocr project.
Добавь это в claude_desktop_config.json и перезапусти Claude Desktop.
{
"mcpServers": {
"mcp-mistral-ocr-optimized": {
"command": "npx",
"args": []
}
}
}