tan-yong-sheng/ai-vision-mcp

БесплатноНе проверен

📇 🏠 🍎 🪟 🐧 - Multimodal AI vision MCP server for image, video, and object detection analysis. Enables UI/UX evaluation, visual regression testing, and inter

автор: tan-yong-sheng

GitHub

Описание

📇 🏠 🍎 🪟 🐧 - Multimodal AI vision MCP server for image, video, and object detection analysis. Enables UI/UX evaluation, visual regression testing, and interface understanding using Google Gemini and Vertex AI.

README

A powerful Model Context Protocol (MCP) server that provides AI-powered image and video analysis using Google Gemini and Vertex AI models.

Features

Dual Provider Support: Choose between Google Gemini API and Vertex AI
Multimodal Analysis: Support for both image and video content analysis
Flexible File Handling: Upload via multiple methods (URLs, local files, base64)
Storage Integration: Built-in Google Cloud Storage support
Comprehensive Validation: Zod-based data validation throughout
Error Handling: Robust error handling with retry logic and circuit breakers
TypeScript: Full TypeScript support with strict type checking

Quick Start

Pre-requisites

You could choose either to use google provider or vertex_ai provider. For simplicity, google provider is recommended.

Below are the environment variables you need to set based on your selected provider. (Note: It’s recommended to set the timeout configuration to more than 5 minutes for your MCP client).

(i) Using Google AI Studio Provider

export IMAGE_PROVIDER="google" # or vertex_ai
export VIDEO_PROVIDER="google" # or vertex_ai
export GEMINI_API_KEY="your-gemini-api-key"

Get your Google AI Studio's api key here

(ii) Using Vertex AI Provider

export IMAGE_PROVIDER="vertex_ai"
export VIDEO_PROVIDER="vertex_ai"
export VERTEX_CLIENT_EMAIL="[email protected]"
export VERTEX_PRIVATE_KEY="-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n"
export VERTEX_PROJECT_ID="your-gcp-project-id"
export GCS_BUCKET_NAME="your-gcs-bucket"

Refer to the guideline here on how to set this up.

Installation

Below are the installation guide for this MCP on different MCP clients, such as Claude Desktop, Claude Code, Cursor, Cline, etc.

Claude Desktop

Add to your Claude Desktop configuration:

(i) Using Google AI Studio Provider

{
  "mcpServers": {
    "ai-vision-mcp": {
      "command": "npx",
      "args": ["ai-vision-mcp"],
      "env": {
        "IMAGE_PROVIDER": "google",
        "VIDEO_PROVIDER": "google",
        "GEMINI_API_KEY": "your-gemini-api-key"
      }
    }
  }
}

(ii) Using Vertex AI Provider

{
  "mcpServers": {
    "ai-vision-mcp": {
      "command": "npx",
      "args": ["ai-vision-mcp"],
      "env": {
        "IMAGE_PROVIDER": "vertex_ai",
        "VIDEO_PROVIDER": "vertex_ai",
        "VERTEX_CLIENT_EMAIL": "[email protected]",
        "VERTEX_PRIVATE_KEY": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
        "VERTEX_PROJECT_ID": "your-gcp-project-id",
        "GCS_BUCKET_NAME": "ai-vision-mcp-{VERTEX_PROJECT_ID}"
      }
    }
  }
}

Claude Code

(i) Using Google AI Studio Provider

claude mcp add ai-vision-mcp \
  -e IMAGE_PROVIDER=google \
  -e VIDEO_PROVIDER=google \
  -e GEMINI_API_KEY=your-gemini-api-key \
  -- npx ai-vision-mcp

(ii) Using Vertex AI Provider

claude mcp add ai-vision-mcp \
  -e IMAGE_PROVIDER=vertex_ai \
  -e VIDEO_PROVIDER=vertex_ai \
  -e VERTEX_CLIENT_EMAIL=your-service-account@project.iam.gserviceaccount.com \
  -e VERTEX_PRIVATE_KEY="-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n" \
  -e VERTEX_PROJECT_ID=your-gcp-project-id \
  -e GCS_BUCKET_NAME=ai-vision-mcp-{VERTEX_PROJECT_ID} \
  -- npx ai-vision-mcp

Note: Increase the MCP startup timeout to 1 minutes and MCP tool execution timeout to about 5 minutes by updating ~\.claude\settings.json as follows:

{
  "env": {
    "MCP_TIMEOUT": "60000",
    "MCP_TOOL_TIMEOUT": "300000"
  }
}

Cursor

Go to: Settings -> Cursor Settings -> MCP -> Add new global MCP server

Pasting the following configuration into your Cursor ~/.cursor/mcp.json file is the recommended approach. You may also install in a specific project by creating .cursor/mcp.json in your project folder. See Cursor MCP docs for more info.

(i) Using Google AI Studio Provider

{
  "mcpServers": {
    "ai-vision-mcp": {
      "command": "npx",
      "args": ["ai-vision-mcp"],
      "env": {
        "IMAGE_PROVIDER": "google",
        "VIDEO_PROVIDER": "google",
        "GEMINI_API_KEY": "your-gemini-api-key"
      }
    }
  }
}

(ii) Using Vertex AI Provider

{
  "mcpServers": {
    "ai-vision-mcp": {
      "command": "npx",
      "args": ["ai-vision-mcp"],
      "env": {
        "IMAGE_PROVIDER": "vertex_ai",
        "VIDEO_PROVIDER": "vertex_ai",
        "VERTEX_CLIENT_EMAIL": "[email protected]",
        "VERTEX_PRIVATE_KEY": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
        "VERTEX_PROJECT_ID": "your-gcp-project-id",
        "GCS_BUCKET_NAME": "ai-vision-mcp-{VERTEX_PROJECT_ID}"
      }
    }
  }
}

Cline

Cline uses a JSON configuration file to manage MCP servers. To integrate the provided MCP server configuration:

Open Cline and click on the MCP Servers icon in the top navigation bar.
Select the Installed tab, then click Advanced MCP Settings.
In the cline_mcp_settings.json file, add the following configuration:

(i) Using Google AI Studio Provider

{
  "mcpServers": {
    "timeout": 300, 
    "type": "stdio",
    "ai-vision-mcp": {
      "command": "npx",
      "args": ["ai-vision-mcp"],
      "env": {
        "IMAGE_PROVIDER": "google",
        "VIDEO_PROVIDER": "google",
        "GEMINI_API_KEY": "your-gemini-api-key"
      }
    }
  }
}

(ii) Using Vertex AI Provider

{
  "mcpServers": {
    "ai-vision-mcp": {
      "timeout": 300,
      "type": "stdio",
      "command": "npx",
      "args": ["ai-vision-mcp"],
      "env": {
        "IMAGE_PROVIDER": "vertex_ai",
        "VIDEO_PROVIDER": "vertex_ai",
        "VERTEX_CLIENT_EMAIL": "[email protected]",
        "VERTEX_PRIVATE_KEY": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
        "VERTEX_PROJECT_ID": "your-gcp-project-id",
        "GCS_BUCKET_NAME": "ai-vision-mcp-{VERTEX_PROJECT_ID}"
      }
    }
  }
}

Other MCP clients

The server uses stdio transport and follows the standard MCP protocol. It can be integrated with any MCP-compatible client by running:

npx ai-vision-mcp

MCP Tools

The server provides four main MCP tools:

1) `analyze_image`

Analyzes an image using AI and returns a detailed description.

Parameters:

imageSource (string): URL, base64 data, or file path to the image
prompt (string): Question or instruction for the AI
mode (string, optional): Analysis mode - one of:
- general (default) - General image analysis
- palette - Extract design tokens (colors, spacing, typography)
- hierarchy - Analyze visual hierarchy and eye flow
- components - Catalog UI components and design system maturity
options (object, optional): Analysis options including temperature and max tokens

Examples:

General image analysis:

{
  "imageSource": "https://plus.unsplash.com/premium_photo-1710965560034-778eedc929ff",
  "prompt": "What is this image about? Describe what you see in detail."
}

Extract design tokens:

{
  "imageSource": "https://example.com/design.png",
  "prompt": "Extract all design tokens from this screenshot",
  "mode": "palette"
}

Analyze visual hierarchy:

{
  "imageSource": "C:\\Users\\username\\Downloads\\ui_mockup.png",
  "prompt": "Analyze the visual hierarchy and eye flow",
  "mode": "hierarchy"
}

Component inventory:

{
  "imageSource": "https://example.com/design-system.png",
  "prompt": "List all UI components and evaluate design system maturity",
  "mode": "components"
}

2) `compare_images`

Compares multiple images using AI and returns a detailed comparison analysis.

Parameters:

imageSources (array): Array of image sources (URLs, base64 data, or file paths) - minimum 2, maximum 4 images
prompt (string): Question or instruction for comparing the images
options (object, optional): Analysis options including temperature and max tokens

Examples:

Compare images from URLs:

{
  "imageSources": [
    "https://example.com/image1.jpg",
    "https://example.com/image2.jpg"
  ],
  "prompt": "Compare these two images and tell me the differences"
}

Compare mixed sources:

{
  "imageSources": [
    "https://example.com/image1.jpg",
    "C:\\\\Users\\\\username\\\\Downloads\\\\image2.jpg",
    "data:image/jpeg;base64,/9j/4AAQSkZJRgAB..."
  ],
  "prompt": "Which image has the best lighting quality?"
}

3) `detect_objects_in_image`

Detects objects in an image using AI vision models and generates annotated images with bounding boxes. Returns detected objects with coordinates and either saves the annotated image to a file or temporary directory.

Parameters:

imageSource (string): URL, base64 data, or file path to the image
prompt (string): Custom detection prompt describing what to detect or recognize in the image
outputFilePath (string, optional): Explicit output path for the annotated image

Configuration: This function uses optimized default parameters for object detection and does not accept runtime options parameter. To customize the AI parameters (temperature, topP, topK, maxTokens), use environment variables:

# Recommended environment variable settings for object detection (these are now the defaults)
TEMPERATURE_FOR_DETECT_OBJECTS_IN_IMAGE=0.0     # Deterministic responses
TOP_P_FOR_DETECT_OBJECTS_IN_IMAGE=0.95          # Nucleus sampling
TOP_K_FOR_DETECT_OBJECTS_IN_IMAGE=30            # Vocabulary selection
MAX_TOKENS_FOR_DETECT_OBJECTS_IN_IMAGE=8192     # High token limit for JSON

File Handling Logic:

Explicit outputFilePath provided → Saves to the exact path specified
If not explicit outputFilePath → Automatically saves to temporary directory

Response Types:

Returns file object when explicit outputFilePath is provided
Returns tempFile object when explicit outputFilePath is not provided so the image file output is auto-saved to temporary folder
Always includes detections array with detected objects and coordinates
Includes summary with percentage-based coordinates for browser automation

Examples:

Basic object detection:

{
  "imageSource": "https://example.com/image.jpg",
  "prompt": "Detect all objects in this image"
}

Save annotated image to specific path:

{
  "imageSource": "C:\\Users\\username\\Downloads\\image.jpg",
  "outputFilePath": "C:\\Users\\username\\Documents\\annotated_image.png"
}

Custom detection prompt:

{
  "imageSource": "data:image/jpeg;base64,/9j/4AAQSkZJRgAB...",
  "prompt": "Detect and label all electronic devices in this image"
}

4) `audit_design`

Audits UI/UX design compliance with pixel-level analysis and AI critique.

This tool provides automated design compliance auditing using pure TypeScript/JavaScript pixel analysis combined with Gemini Vision API critique. It extracts dominant colors, detects visual complexity, validates WCAG contrast ratios, and generates actionable design recommendations.

Inspired by: Automating UX/UI Design Analysis with Python, Machine Learning, and LLMs by Jade Graham

Parameters:

imageSource (string): URL, base64 data, or file path to the design image
prompt (string, optional): Custom audit context or focus areas
options (object, optional): Analysis options including temperature and max tokens

Features:

Dominant Colors: K-means clustering to extract 5 primary colors
Edge Complexity: Sobel operator for visual structure analysis
WCAG Contrast: W3C relative luminance formula validation (AA/AAA)
Luminance Stats: Mean brightness and standard deviation calculations
Design Issues: Automated detection of contrast, complexity, and brightness problems
AI Critique: Gemini-powered recommendations for design improvements

Examples:

Basic design audit:

{
  "imageSource": "https://example.com/design.png",
  "prompt": "Audit this design for accessibility and visual hierarchy"
}

Audit local design file:

{
  "imageSource": "C:\\Users\\username\\Downloads\\ui_design.png",
  "prompt": "Check WCAG AA compliance"
}

5) `analyze_video`

Analyzes a video using AI and returns a detailed description.

Parameters:

videoSource (string): YouTube URL, GCS URI, or local file path to the video
prompt (string): Question or instruction for the AI
options (object, optional): Analysis options including temperature and max tokens

Supported video sources:

YouTube URLs (e.g., https://www.youtube.com/watch?v=...)
Local file paths (e.g., C:\Users\username\Downloads\video.mp4)

Examples:

Analyze video from YouTube URL:

{
  "videoSource": "https://www.youtube.com/watch?v=9hE5-98ZeCg",
  "prompt": "What is this video about? Describe what you see in detail."
}

Analyze local video file:

{
  "videoSource": "C:\\Users\\username\\Downloads\\video.mp4",
  "prompt": "What is this video about? Describe what you see in detail."
}

Note: Only YouTube URLs are supported for public video URLs. Other public video URLs are not currently supported.

Environment Configuration

For basic setup, you only need to configure the provider selection and required credentials:

Google AI Studio Provider (Recommended)

export IMAGE_PROVIDER="google"
export VIDEO_PROVIDER="google"
export GEMINI_API_KEY="your-gemini-api-key"

Vertex AI Provider (Production)

export IMAGE_PROVIDER="vertex_ai"
export VIDEO_PROVIDER="vertex_ai"
export VERTEX_CLIENT_EMAIL="[email protected]"
export VERTEX_PRIVATE_KEY="-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n"
export VERTEX_PROJECT_ID="your-gcp-project-id"
export GCS_BUCKET_NAME="your-gcs-bucket"

📖 Detailed Configuration Guide

For comprehensive environment variable documentation, including:

Complete configuration reference (60+ environment variables)
Function-specific optimization examples
Advanced configuration patterns
Troubleshooting guidance

👉 See Environment Variable Guide

Configuration Priority Overview

The server uses a hierarchical configuration system where more specific settings override general ones:

LLM-assigned values (runtime parameters in tool calls)
Function-specific variables (TEMPERATURE_FOR_ANALYZE_IMAGE, etc.)
Task-specific variables (TEMPERATURE_FOR_IMAGE, etc.)
Universal variables (TEMPERATURE, etc.)
System defaults

Quick Configuration Examples

Basic Optimization:

# General settings
export TEMPERATURE=0.7
export MAX_TOKENS=1500

# Task-specific optimization
export TEMPERATURE_FOR_IMAGE=0.2     # More precise for images
export TEMPERATURE_FOR_VIDEO=0.5     # More creative for videos

Function-specific Optimization:

# Optimize individual functions
export TEMPERATURE_FOR_ANALYZE_IMAGE=0.1
export TEMPERATURE_FOR_COMPARE_IMAGES=0.3
export TEMPERATURE_FOR_DETECT_OBJECTS_IN_IMAGE=0.0  # Deterministic
export MAX_TOKENS_FOR_DETECT_OBJECTS_IN_IMAGE=8192   # High token limit

Model Selection:

# Choose models per function
export ANALYZE_IMAGE_MODEL="gemini-2.5-flash-lite"
export COMPARE_IMAGES_MODEL="gemini-2.5-flash"
export ANALYZE_VIDEO_MODEL="gemini-2.5-flash-pro"

Troubleshooting (stdio / Codex / Claude Code)

1) "Transport closed" / tool call fails

If you see errors like:

tools/call failed: Transport closed

Common causes:

A) Image annotation dependency failed to load

This server uses imagescript for image annotation/dimension extraction.

Verify it loads:

npm run doctor
# or
npm run check:imagescript

B) stdout logs corrupt stdio MCP framing

This server uses the MCP stdio transport (newline-delimited JSON-RPC over stdout).

✅ stdout must contain only MCP JSON-RPC messages
✅ write logs to stderr (e.g. console.error)
❌ do not use console.log in stdio MCP servers

If stdout is polluted, clients (Codex/Claude Code) may disconnect and report Transport closed.

Development

Prerequisites

Node.js 18+
npm or yarn

Setup

# Clone the repository
git clone https://github.com/tan-yong-sheng/ai-vision-mcp.git
cd ai-vision-mcp

# Install dependencies
npm install

# Build the project
npm run build

# Start development server
npm run dev

Scripts

npm run build - Build the TypeScript project
npm run dev - Start development server with watch mode
npm run lint - Run ESLint
npm run format - Format code with Prettier
npm start - Start the built server

Architecture

The project follows a modular architecture:

src/
├── providers/          # AI provider implementations
│   ├── gemini/        # Google Gemini provider
│   ├── vertexai/      # Vertex AI provider
│   └── factory/       # Provider factory
├── services/          # Core services
│   ├── ConfigService.ts
│   └── FileService.ts
├── storage/           # Storage implementations
├── file-upload/       # File upload strategies
├── types/            # TypeScript type definitions
├── utils/            # Utility functions
└── server.ts         # Main MCP server

Error Handling

The server includes comprehensive error handling:

Validation Errors: Input validation using Zod schemas
Network Errors: Automatic retries with exponential backoff
Authentication Errors: Clear error messages for API key issues
File Errors: Handling for file size limits and format restrictions

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Google for the Gemini and Vertex AI APIs
The Model Context Protocol team for the MCP framework
Jade Graham for the design analysis methodology that inspired the audit_design tool
All contributors and users of this project

Как установить

Выполни в терминале:

claude mcp add tan-yong-sheng-ai-vision-mcp -- npx

tan-yong-sheng/ai-vision-mcp

Описание

README

Features

Quick Start

Pre-requisites

Installation

MCP Tools

1) `analyze_image`

2) `compare_images`

3) `detect_objects_in_image`

4) `audit_design`

5) `analyze_video`

Environment Configuration

Google AI Studio Provider (Recommended)

Vertex AI Provider (Production)

📖 Detailed Configuration Guide

Configuration Priority Overview

Troubleshooting (stdio / Codex / Claude Code)

1) "Transport closed" / tool call fails

Development

Prerequisites

Setup

Scripts

Architecture

Error Handling

Contributing

License

Acknowledgments

Как установить

Похожие MCP

Compare tan-yong-sheng/ai-vision-mcp with

Figma

LibreOffice Tools

Logo.dev

PIX4Dmatic

Command Palette

tan-yong-sheng/ai-vision-mcp

Описание

README

Features

Quick Start

Pre-requisites

Installation

MCP Tools

1) analyze_image

2) compare_images

3) detect_objects_in_image

4) audit_design

5) analyze_video

Environment Configuration

Google AI Studio Provider (Recommended)

Vertex AI Provider (Production)

📖 Detailed Configuration Guide

Configuration Priority Overview

Troubleshooting (stdio / Codex / Claude Code)

1) "Transport closed" / tool call fails

Development

Prerequisites

Setup

Scripts

Architecture

Error Handling

Contributing

License

Acknowledgments

Как установить

Похожие MCP

Compare tan-yong-sheng/ai-vision-mcp with

Figma

LibreOffice Tools

Logo.dev

PIX4Dmatic

1) `analyze_image`

2) `compare_images`

3) `detect_objects_in_image`

4) `audit_design`

5) `analyze_video`