Mistral OCR Optimized

БесплатноНе проверен

An optimized Model Context Protocol server for document OCR processing using Mistral AI with support for high-performance batch operations and async connection

автор: snussik

GitHub

Описание

An optimized Model Context Protocol server for document OCR processing using Mistral AI with support for high-performance batch operations and async connection pooling. It enables efficient extraction of text and tables from local files or URLs into structured markdown and HTML formats while minimizing token costs.

README

Optimized MCP server for OCR processing using Mistral AI with batch processing and async connection pooling.

🚀 Key Optimizations

Feature	Benefit
Batch Processing API	Up to 50% cost reduction for large file sets
Async Connection Pooling	20-30% faster processing for multiple files
Token-Efficient Defaults	`include_images=False`, `table_format=markdown` saves 30-40% tokens
Concurrent Processing	Process up to 5 files simultaneously
Cross-Platform Paths	Works on Windows, macOS, Linux, and Docker
Configurable Parameters	Fine-tune OCR output with table_format, headers, footers

📦 Installation

Using UV (Recommended)

# Navigate to project directory
cd D:/dev/mcp_mistral_ocr_opt

# Create and activate virtual environment
uv venv
# Windows
.venv\Scripts\activate
# Unix
source .venv/bin/activate

# Install dependencies
uv pip install .

Using Docker

# Build image
docker build -t mcp-mistral-ocr-opt .

# Run container
docker run -e MISTRAL_API_KEY=your_api_key \
           -v /path/to/your/files:/data/ocr \
           mcp-mistral-ocr-opt:latest

⚙️ Configuration

Environment Variables

Create or edit .env file:

# Required
MISTRAL_API_KEY=your_api_key_here
OCR_DIR=D:/dev/mcp_mistral_ocr_opt/data/ocr

# Optional - Batch Processing
BATCH_MODE=auto                  # auto, always, never
BATCH_MIN_FILES=5                # Use batch processing for 5+ files in auto mode
INLINE_BATCH_THRESHOLD=10        # Use inline batch for <10 files
MAX_CONCURRENT_REQUESTS=5        # Max concurrent API requests

# Optional - OCR Defaults (token optimization)
DEFAULT_TABLE_FORMAT=markdown    # null, markdown, or html
INCLUDE_IMAGES=false             # Default false for token efficiency
EXTRACT_HEADER=false             # Extract document headers
EXTRACT_FOOTER=false             # Extract document footers

Claude Desktop Configuration

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "mistral-ocr-opt": {
      "command": "uv",
      "args": [
        "run",
        "--directory",
        "D:/dev/mcp_mistral_ocr_opt",
        "-m",
        "src.mcp_mistral_ocr_opt.main"
      ],
      "env": {
        "MISTRAL_API_KEY": "your_api_key_here",
        "OCR_DIR": "D:/dev/mcp_mistral_ocr_opt/data/ocr",
        "BATCH_MODE": "auto"
      }
    }
  }
}

🛠️ Available Tools

1. `process_local_file` - Process a single file

Process a single local file from OCR_DIR.

{
  "name": "process_local_file",
  "arguments": {
    "filename": "document.pdf",
    "table_format": "markdown",
    "extract_header": false,
    "extract_footer": false,
    "include_images": false
  }
}

Parameters:

filename (required): Name of file relative to OCR_DIR
table_format (optional): null, markdown, or html - default: markdown
extract_header (optional): Extract document headers - default: false
extract_footer (optional): Extract document footers - default: false
include_images (optional): Include base64 images - default: false (token efficient)

Supported local file types:

PDFs: .pdf
Images: .jpg, .jpeg, .png, .gif, .webp, .bmp, .avif
Other formats (docx/xlsx/pptx) are not supported

2. `process_batch_local_files` - Process multiple files concurrently

Process multiple files with concurrent or batch processing (auto-selected).

{
  "name": "process_batch_local_files",
  "arguments": {
    "patterns": ["*.pdf", "scanned_*.jpg"],
    "max_files": 100,
    "table_format": "markdown",
    "include_images": false
  }
}

Parameters:

patterns (required): Array of glob patterns (e.g., ["*.pdf", "*.jpg"])
max_files (optional): Maximum files to process
Other parameters same as process_local_file

Auto-selection Logic:

< 5 files: Concurrent processing
5-9 files: Inline batch (if BATCH_MODE=auto)
10+ files: File batch (saves up to 50% cost)

3. `process_url_file` - Process file from URL

Process a file from a public URL.

{
  "name": "process_url_file",
  "arguments": {
    "url": "https://example.com/document.pdf",
    "file_type": "pdf",
    "table_format": "html"
  }
}

4. `create_batch_job` - Create explicit batch job

Create a batch processing job (for large file sets, cost savings up to 50%).

{
  "name": "create_batch_job",
  "arguments": {
    "patterns": ["documents/*.pdf"],
    "use_inline": false,
    "table_format": "markdown"
  }
}

Returns:

{
  "batch_type": "file",
  "job_id": "job_abc123",
  "batch_file_id": "file_xyz789",
  "files_queued": 50,
  "message": "Batch job created with 50 files. Use check_batch_status to monitor progress."
}

5. `check_batch_status` - Monitor batch job

{
  "name": "check_batch_status",
  "arguments": {
    "job_id": "job_abc123"
  }
}

Returns:

{
  "id": "job_abc123",
  "status": "SUCCESS",
  "created_at": "2026-01-22T12:00:00",
  "completed_at": "2026-01-22T12:05:00"
}

6. `download_batch_results` - Download completed results

{
  "name": "download_batch_results",
  "arguments": {
    "job_id": "job_abc123"
  }
}

7. `cancel_batch_job` - Cancel running job

{
  "name": "cancel_batch_job",
  "arguments": {
    "job_id": "job_abc123"
  }
}

8. `list_batch_jobs` - List all batch jobs

{
  "name": "list_batch_jobs",
  "arguments": {
    "status": "RUNNING"
  }
}

📊 Output

OCR results are saved in JSON format in OCR_DIR/output/:

Single files: {filename}_{timestamp}.json
Batch results: batch_results_{job_id}_{timestamp}.jsonl

Result structure:

{
  "pages": [
    {
      "index": 0,
      "markdown": "Extracted text content...",
      "images": [],
      "tables": [],
      "hyperlinks": [],
      "dimensions": {"width": 0, "height": 0}
    }
  ],
  "model": "mistral-ocr-latest",
  "usage_info": {...},
  "_metadata": {
    "source_file": "/path/to/document.pdf",
    "output_file": "/path/to/output.json",
    "file_type": "pdf",
    "processed_at": "2026-01-22T12:00:00",
    "table_format": "markdown",
    "include_images": false
  }
}

🎯 Usage Examples

Example 1: Process a single PDF with tables

{
  "name": "process_local_file",
  "arguments": {
    "filename": "invoice.pdf",
    "table_format": "html",
    "include_images": false
  }
}

Example 2: Process all PDFs in directory with batch

{
  "name": "process_batch_local_files",
  "arguments": {
    "patterns": ["*.pdf"],
    "table_format": "markdown"
  }
}

Example 3: Create explicit batch job for 100+ documents

{
  "name": "create_batch_job",
  "arguments": {
    "patterns": ["documents/**/*.pdf"],
    "use_inline": false,
    "table_format": "html",
    "extract_header": true,
    "extract_footer": true
  }
}

Then monitor:

{
  "name": "check_batch_status",
  "arguments": {
    "job_id": "job_abc123"
  }
}

And download when complete:

{
  "name": "download_batch_results",
  "arguments": {
    "job_id": "job_abc123"
  }
}

🔧 Performance Tips

Token Optimization

Set include_images=false (default) - saves 30-40% tokens
Use table_format="markdown" (default) - more efficient than HTML
Skip extract_header/extract_footer unless needed

Cost Optimization

Use batch processing for 10+ files (up to 50% cost savings)
Set BATCH_MODE=always for large recurring batches
Use max_files to limit processing if needed

Speed Optimization

Increase MAX_CONCURRENT_REQUESTS (default: 5, max: 10)
Use inline batch for 5-9 files (faster startup)
Enable BATCH_MODE=auto (default) for auto-selection

📈 Performance Benchmarks

Scenario	Old Version	Optimized	Improvement
10 files concurrent	45s	12s	4x faster
100 files batch	$5.00	$2.50	50% cheaper
With images (tokens)	100%	60%	40% fewer tokens
PDF processing (API calls)	300	100	3x fewer calls

▶️ Run via UV

uv run pytest
uv run pytest --cov=src --cov-report=term-missing
uv run python -m src.mcp_mistral_ocr_opt.main

🐳 Docker Support

Build Image

docker build -t mcp-mistral-ocr-opt .

Run Container

docker run -e MISTRAL_API_KEY=your_key \
           -e OCR_DIR=/data/ocr \
           -v $(pwd)/data/ocr:/data/ocr \
           mcp-mistral-ocr-opt:latest

Docker Compose

version: '3.8'
services:
  mistral-ocr:
    image: mcp-mistral-ocr-opt:latest
    environment:
      MISTRAL_API_KEY: ${MISTRAL_API_KEY}
      OCR_DIR: /data/ocr
      BATCH_MODE: auto
      MAX_CONCURRENT_REQUESTS: 5
    volumes:
      - ./data/ocr:/data/ocr
    restart: unless-stopped

🤝 Migration from Original

If migrating from the original mcp-mistral-ocr:

API Key: Same key works
Tools: All original tools still work
New Tools: Batch tools added (optional to use)
Defaults: More token-efficient by default

No code changes required for basic usage!

📝 Troubleshooting

Issue: "Configuration error: MISTRAL_API_KEY is required"

Solution: Add MISTRAL_API_KEY=your_key to .env file

Issue: "File not found"

Solution: Check OCR_DIR path in .env and ensure files are in that directory

Issue: "Batch job stuck in QUEUED"

Solution: Check Mistral dashboard or try cancel_batch_job and retry

Issue: Connection errors

Solution: Verify internet connection and API key is valid

📄 License

Based on the original mcp-mistral-ocr project.

🔗 Links

Как установить

Выполни в терминале:

claude mcp add mcp-mistral-ocr-optimized -- npx

Command Palette