# AI Vision MCP Server
A powerful Model Context Protocol (MCP) server that provides AI-powered image and video analysis using Google Gemini and Vertex AI models.
## Features
- **Dual Provider Support**: Choose between Google Gemini API and Vertex AI
- **Multimodal Analysis**: Support for both image and video content analysis
- **Flexible File Handling**: Upload via multiple methods (URLs, local files, base64)
- **Storage Integration**: Built-in Google Cloud Storage support
- **Comprehensive Validation**: Zod-based data validation throughout
- **Error Handling**: Robust error handling with retry logic and circuit breakers
- **TypeScript**: Full TypeScript support with strict type checking
## Quick Start
### Pre-requisites
You could choose either to use [`google` provider](https://aistudio.google.com/welcome) or [`vertex_ai` provider](https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstart). For simplicity, `google` provider is recommended.
Below are the environment variables you need to set based on your selected provider. (Note: It’s recommended to set the timeout configuration to more than 5 minutes for your MCP client).
(i) **Using Google AI Studio Provider**
```bash
export IMAGE_PROVIDER="google" # or vertex_ai
export VIDEO_PROVIDER="google" # or vertex_ai
export GEMINI_API_KEY="your-gemini-api-key"
```
Get your Google AI Studio's api key [here](https://aistudio.google.com/app/api-keys)
(ii) **Using Vertex AI Provider**
```bash
export IMAGE_PROVIDER="vertex_ai"
export VIDEO_PROVIDER="vertex_ai"
export VERTEX_CREDENTIALS="/path/to/service-account.json"
export GCS_BUCKET_NAME="your-gcs-bucket"
```
Refer to [the guideline here](docs/provider/vertex-ai-setup-guide.md) on how to set this up.
### Installation
Below are the installation guide for this MCP on different MCP clients, such as Claude Desktop, Claude Code, Cursor, Cline, etc.
<details>
<summary>Claude Desktop</summary>
Add to your Claude Desktop configuration:
(i) Using Google AI Studio Provider
```json
{
"mcpServers": {
"ai-vision-mcp": {
"command": "npx",
"args": ["ai-vision-mcp"],
"env": {
"IMAGE_PROVIDER": "google",
"VIDEO_PROVIDER": "google",
"GEMINI_API_KEY": "your-gemini-api-key"
}
}
}
}
```
(ii) Using Vertex AI Provider
```json
{
"mcpServers": {
"ai-vision-mcp": {
"command": "npx",
"args": ["ai-vision-mcp"],
"env": {
"IMAGE_PROVIDER": "vertex_ai",
"VIDEO_PROVIDER": "vertex_ai",
"VERTEX_CREDENTIALS": "/path/to/service-account.json",
"GCS_BUCKET_NAME": "ai-vision-mcp-{VERTEX_PROJECT_ID}"
}
}
}
}
```
</details>
<details>
<summary>Claude Code</summary>
(i) Using Google AI Studio Provider
```bash
claude mcp add ai-vision-mcp \
-e IMAGE_PROVIDER=google \
-e VIDEO_PROVIDER=google \
-e GEMINI_API_KEY=your-gemini-api-key \
-- npx ai-vision-mcp
```
(ii) Using Vertex AI Provider
```bash
claude mcp add ai-vision-mcp \
-e IMAGE_PROVIDER=vertex_ai \
-e VIDEO_PROVIDER=vertex_ai \
-e VERTEX_CREDENTIALS=/path/to/service-account.json \
-e GCS_BUCKET_NAME=ai-vision-mcp-{VERTEX_PROJECT_ID} \
-- npx ai-vision-mcp
```
Note: Increase the MCP startup timeout to 1 minutes and MCP tool execution timeout to about 5 minutes by updating `~\.claude\settings.json` as follows:
```json
{
"env": {
"MCP_TIMEOUT": "60000",
"MCP_TOOL_TIMEOUT": "300000"
}
}
```
</details>
<details>
<summary>Cursor</summary>
Go to: Settings -> Cursor Settings -> MCP -> Add new global MCP server
Pasting the following configuration into your Cursor ~/.cursor/mcp.json file is the recommended approach. You may also install in a specific project by creating .cursor/mcp.json in your project folder. See [Cursor MCP docs](https://docs.cursor.com/context/model-context-protocol) for more info.
(i) Using Google AI Studio Provider
```json
{
"mcpServers": {
"ai-vision-mcp": {
"command": "npx",
"args": ["ai-vision-mcp"],
"env": {
"IMAGE_PROVIDER": "google",
"VIDEO_PROVIDER": "google",
"GEMINI_API_KEY": "your-gemini-api-key"
}
}
}
}
```
(ii) Using Vertex AI Provider
```json
{
"mcpServers": {
"ai-vision-mcp": {
"command": "npx",
"args": ["ai-vision-mcp"],
"env": {
"IMAGE_PROVIDER": "vertex_ai",
"VIDEO_PROVIDER": "vertex_ai",
"VERTEX_CREDENTIALS": "/path/to/service-account.json",
"GCS_BUCKET_NAME": "ai-vision-mcp-{VERTEX_PROJECT_ID}"
}
}
}
}
```
</details>
<details>
<summary>Cline</summary>
Cline uses a JSON configuration file to manage MCP servers. To integrate the provided MCP server configuration:
1. Open Cline and click on the MCP Servers icon in the top navigation bar.
2. Select the Installed tab, then click Advanced MCP Settings.
3. In the cline_mcp_settings.json file, add the following configuration:
(i) Using Google AI Studio Provider
```json
{
"mcpServers": {
"timeout": 300,
"type": "stdio",
"ai-vision-mcp": {
"command": "npx",
"args": ["ai-vision-mcp"],
"env": {
"IMAGE_PROVIDER": "google",
"VIDEO_PROVIDER": "google",
"GEMINI_API_KEY": "your-gemini-api-key"
}
}
}
}
```
(ii) Using Vertex AI Provider
```json
{
"mcpServers": {
"ai-vision-mcp": {
"timeout": 300,
"type": "stdio",
"command": "npx",
"args": ["ai-vision-mcp"],
"env": {
"IMAGE_PROVIDER": "vertex_ai",
"VIDEO_PROVIDER": "vertex_ai",
"VERTEX_CREDENTIALS": "/path/to/service-account.json",
"GCS_BUCKET_NAME": "ai-vision-mcp-{VERTEX_PROJECT_ID}"
}
}
}
}
```
</details>
<details>
<summary>Other MCP clients</summary>
The server uses stdio transport and follows the standard MCP protocol. It can be integrated with any MCP-compatible client by running:
```bash
npx ai-vision-mcp
```
</details>
## MCP Tools
The server provides four main MCP tools:
### 1) `analyze_image`
Analyzes an image using AI and returns a detailed description.
**Parameters:**
- `imageSource` (string): URL, base64 data, or file path to the image
- `prompt` (string): Question or instruction for the AI
- `options` (object, optional): Analysis options including temperature and max tokens
**Examples:**
1. **Analyze image from URL:**
```json
{
"imageSource": "https://plus.unsplash.com/premium_photo-1710965560034-778eedc929ff",
"prompt": "What is this image about? Describe what you see in detail."
}
```
2. **Analyze local image file:**
```json
{
"imageSource": "C:\\Users\\username\\Downloads\\image.jpg",
"prompt": "What is this image about? Describe what you see in detail."
}
```
### 2) `compare_images`
Compares multiple images using AI and returns a detailed comparison analysis.
**Parameters:**
- `imageSources` (array): Array of image sources (URLs, base64 data, or file paths) - minimum 2, maximum 4 images
- `prompt` (string): Question or instruction for comparing the images
- `options` (object, optional): Analysis options including temperature and max tokens
**Examples:**
1. **Compare images from URLs:**
```json
{
"imageSources": [
"https://example.com/image1.jpg",
"https://example.com/image2.jpg"
],
"prompt": "Compare these two images and tell me the differences"
}
```
2. **Compare mixed sources:**
```json
{
"imageSources": [
"https://example.com/image1.jpg",
"C:\\\\Users\\\\username\\\\Downloads\\\\image2.jpg",
"data:image/jpeg;base64,/9j/4AAQSkZJRgAB..."
],
"prompt": "Which image has the best lighting quality?"
}
```
### 3) `detect_objects_in_image`
Detects objects in an image using AI vision models and generates annotated images with bounding boxes. Returns detected objects with coordinates and either saves the annotated image to a file or temporary directory.
**Parameters:**
- `imageSource` (string): URL, base64 data, or file path to the image
- `prompt` (string): Custom detection prompt describing what to detect or recognize in the image
- `outputFilePath` (string, optional): Explicit output path for the annotated image
**Configuration:**
This function uses optimized default parameters for object detection and does not accept runtime `options` parameter. To customize the AI parameters (temperature, topP, topK, maxTokens), use environment variables:
```
# Recommended environment variable settings for object detection (these are now the defaults)
TEMPERATURE_FOR_DETECT_OBJECTS_IN_IMAGE=0.0 # Deterministic responses
TOP_P_FOR_DETECT_OBJECTS_IN_IMAGE=0.95 # Nucleus sampling
TOP_K_FOR_DETECT_OBJECTS_IN_IMAGE=30 # Vocabulary selection
MAX_TOKENS_FOR_DETECT_OBJECTS_IN_IMAGE=8192 # High token limit for JSON
```
**File Handling Logic:**
1. **Explicit outputFilePath provided** → Saves to the exact path specified
2. **If not explicit outputFilePath** → Automatically saves to temporary directory
**Response Types:**
- Returns `file` object when explicit outputFilePath is provided
- Returns `tempFile` object when explicit outputFilePath is not provided so the image file output is auto-saved to temporary folder
- Always includes `detections` array with detected objects and coordinates
- Includes `summary` with percentage-based coordinates for browser automation
**Examples:**
1. **Basic object detection:**
```json
{
"imageSource": "https://example.com/image.jpg",
"prompt": "Detect all objects in this image"
}
```
2. **Save annotated image to specific path:**
```json
{
"imageSource": "C:\\Users\\username\\Downloads\\image.jpg",
"outputFilePath": "C:\\Users\\username\\Documents\\annotated_image.png"
}
```
3. **Custom detection prompt:**
```json
{
"imageSource": "data:image/jpeg;base64,/9j/4AAQSkZJRgAB...",
"prompt": "Detect and label all electronic devices in this image"
}
```
### 4) `analyze_video`
Analyzes a video using AI and returns a detailed description.
**Parameters:**
- `videoSource` (string): YouTube URL, GCS URI, or local file path to the video
- `prompt` (string): Question or instruction for the AI
- `options` (object, optional): Analysis options including temperature and max tokens
**Supported video sources:**
- YouTube URLs (e.g., `https://www.youtube.com/watch?v=...`)
- Local file paths (e.g., `C:\Users\username\Downloads\video.mp4`)
**Examples:**
1. **Analyze video from YouTube URL:**
```json
{
"videoSource": "https://www.youtube.com/watch?v=9hE5-98ZeCg",
"prompt": "What is this video about? Describe what you see in detail."
}
```
2. **Analyze local video file:**
```json
{
"videoSource": "C:\\Users\\username\\Downloads\\video.mp4",
"prompt": "What is this video about? Describe what you see in detail."
}
```
**Note:** Only YouTube URLs are supported for public video URLs. Other public video URLs are not currently supported.
## Environment Configuration
For basic setup, you only need to configure the provider selection and required credentials:
### Google AI Studio Provider (Recommended)
```bash
export IMAGE_PROVIDER="google"
export VIDEO_PROVIDER="google"
export GEMINI_API_KEY="your-gemini-api-key"
```
### Vertex AI Provider (Production)
```bash
export IMAGE_PROVIDER="vertex_ai"
export VIDEO_PROVIDER="vertex_ai"
export VERTEX_CREDENTIALS="/path/to/service-account.json"
export GCS_BUCKET_NAME="your-gcs-bucket"
```
### 📖 **Detailed Configuration Guide**
For comprehensive environment variable documentation, including:
- Complete configuration reference (60+ environment variables)
- Function-specific optimization examples
- Advanced configuration patterns
- Troubleshooting guidance
👉 **[See Environment Variable Guide](docs/environment-variable-guide.md)**
### Configuration Priority Overview
The server uses a hierarchical configuration system where more specific settings override general ones:
1. **LLM-assigned values** (runtime parameters in tool calls)
2. **Function-specific variables** (`TEMPERATURE_FOR_ANALYZE_IMAGE`, etc.)
3. **Task-specific variables** (`TEMPERATURE_FOR_IMAGE`, etc.)
4. **Universal variables** (`TEMPERATURE`, etc.)
5. **System defaults**
<details>
<summary><strong>Quick Configuration Examples</strong></summary>
**Basic Optimization:**
```bash
# General settings
export TEMPERATURE=0.7
export MAX_TOKENS=1500
# Task-specific optimization
export TEMPERATURE_FOR_IMAGE=0.2 # More precise for images
export TEMPERATURE_FOR_VIDEO=0.5 # More creative for videos
```
**Function-specific Optimization:**
```bash
# Optimize individual functions
export TEMPERATURE_FOR_ANALYZE_IMAGE=0.1
export TEMPERATURE_FOR_COMPARE_IMAGES=0.3
export TEMPERATURE_FOR_DETECT_OBJECTS_IN_IMAGE=0.0 # Deterministic
export MAX_TOKENS_FOR_DETECT_OBJECTS_IN_IMAGE=8192 # High token limit
```
**Model Selection:**
```bash
# Choose models per function
export ANALYZE_IMAGE_MODEL="gemini-2.5-flash-lite"
export COMPARE_IMAGES_MODEL="gemini-2.5-flash"
export ANALYZE_VIDEO_MODEL="gemini-2.5-flash-pro"
```
</details>
## Development
### Prerequisites
- Node.js 18+
- npm or yarn
### Setup
```bash
# Clone the repository
git clone https://github.com/tan-yong-sheng/ai-vision-mcp.git
cd ai-vision-mcp
# Install dependencies
npm install
# Build the project
npm run build
# Start development server
npm run dev
```
### Scripts
- `npm run build` - Build the TypeScript project
- `npm run dev` - Start development server with watch mode
- `npm run lint` - Run ESLint
- `npm run format` - Format code with Prettier
- `npm start` - Start the built server
## Architecture
The project follows a modular architecture:
```
src/
├── providers/ # AI provider implementations
│ ├── gemini/ # Google Gemini provider
│ ├── vertexai/ # Vertex AI provider
│ └── factory/ # Provider factory
├── services/ # Core services
│ ├── ConfigService.ts
│ └── FileService.ts
├── storage/ # Storage implementations
├── file-upload/ # File upload strategies
├── types/ # TypeScript type definitions
├── utils/ # Utility functions
└── server.ts # Main MCP server
```
## Error Handling
The server includes comprehensive error handling:
- **Validation Errors**: Input validation using Zod schemas
- **Network Errors**: Automatic retries with exponential backoff
- **Authentication Errors**: Clear error messages for API key issues
- **File Errors**: Handling for file size limits and format restrictions
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- Google for the Gemini and Vertex AI APIs
- The Model Context Protocol team for the MCP framework
- All contributors and users of this project