Integrates with OpenAI's API to provide visual question-answering capabilities using GPT-4o-mini and GPT-4.1 models for compositional image understanding and knowledge-dependent visual reasoning
🚀 ViperMCP: A Model Context Protocol for Viper Server
Mixture-of-Experts VQA, streaming-ready, and MCP-native.
ViperMCP is a mixture-of-experts (MoE) visual question‑answering (VQA) server that exposes streamable MCP tools for:
🔎 Visual grounding
🧩 Compositional image QA
🌐 External knowledge‑dependent image QA
It’s built on the shoulders of 🐍 ViperGPT and delivered as a FastMCP HTTP server, so it works with all FastMCP client tooling.
✨ Highlights
⚡ MCP-native JSON‑RPC 2.0 endpoint (
/mcp/) with streaming🧠 MoE routing across classic and modern VLMs/LLMs
🧰 Two tools out of the box:
viper_query(text) &viper_task(crops/masks)🐳 One‑command Docker or pure‑Python install
🔐 Secure key handling via env var or secret mount
⚙️ Setup
🔑 OpenAI API Key
An OpenAI API key is required. Provide it via one of the following:
OPENAI_API_KEY(environment variable)OPENAI_API_KEY_PATH(path to a file containing the key)?apiKey=...HTTP query parameter (for quick local testing)
🌐 Ngrok (Optional)
Use ngrok to expose your local server:
Use the ngrok URL anywhere you see http://0.0.0.0:8000 below.
🛠️ Installation
🐳 Option A: Dockerized FastMCP Server (GPU‑ready)
Save your key to
api.key, then run:
This starts a CUDA‑enabled container serving MCP at:
💡 Prefer building from source? Use the included
docker-compose.yaml. By default it readsapi.keyfrom the project root. If your platform injects env vars, you can also setOPENAI_API_KEYdirectly.
🐍 Option B: Pure FastMCP Server (dev‑friendly)
Your server should be live at:
To use OpenAI‑backed models via query param:
🧪 Usage
🤝 FastMCP Client Example
Pass images as base64 (shown) or as URLs:
🧵 OpenAI API (MCP Integration)
The OpenAI MCP integration currently accepts image URLs (not raw base64). Send the URL as type: "input_text".
🌐 Endpoints
🔓 HTTP GET Endpoints
🧠 MCP Client Endpoints (JSON‑RPC 2.0)
🔨 MCP Client Functions
🧩 Models (Default MoE Pool)
🐊 Grounding DINO
✂️ Segment Anything (SAM)
🤖 GPT‑4o‑mini (LLM)
👀 GPT‑4o‑mini (VLM)
🧠 GPT‑4.1
🔭 X‑VLM
🌊 MiDaS (depth)
🐝 BERT
🧭 The MoE router picks from these based on the tool & prompt.
⚠️ Security & Production Notes
This package may generate and execute code on the host. We include basic injection guards, but you must harden for production. A recommended architecture separates concerns:
🧱 Isolate codegen & execution.
🔒 Lock down secrets & file access.
🧪 Add unit/integration tests around wrappers.
📚 Citations
Huge thanks to the ViperGPT team:
🤝 Contributions
PRs welcome! Please:
✅ Ensure all tests in
/testspass🧪 Add coverage for new features
📦 Keep docs & examples up to date
🧭 Quick Commands Cheat‑Sheet
💬 Questions?
Open an issue or start a discussion. We ❤️ feedback and ambitious ideas!
This server cannot be installed
hybrid server
The server is able to function both locally and remotely, depending on the configuration or use case.
A mixture-of-experts visual question-answering server that enables visual grounding, compositional image question answering, and external knowledge-dependent image question answering through code generation and execution. Built on the ViperGPT framework with support for multiple computer vision models including Grounding DINO, SegmentAnything, and GPT-4o.
Related MCP Servers
- -security-license-qualityA powerful server that integrates the Moondream vision model to enable advanced image analysis, including captioning, object detection, and visual question answering, through the Model Context Protocol, compatible with AI assistants like Claude and Cline.Last updated -18Apache 2.0
- Asecurity-licenseAqualityA MCP server that enables Claude and other MCP-compatible assistants to generate images from text prompts using Together AI's image generation models.Last updated -4MIT License
- -security-license-qualityA server that connects to the xAI/Grok image generation API, allowing users to generate images from text prompts with support for multiple image generation and different response formats.Last updated -8
- -security-license-qualityThis server enables interaction with Google's Video Intelligence API for advanced video analysis, auto-generated using AG2's MCP builder to provide a standardized multi-agent interface.Last updated -