Enables video and image analysis using Qwen3-VL models deployed on Modal's serverless GPU infrastructure, supporting hours-long video processing, timestamp grounding, OCR in 32 languages, and various analysis tasks including summarization, text extraction, and frame comparison.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Qwen Video Understanding MCP Serversummarize this presentation video https://youtube.com/watch?v=abc123"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Qwen Video Understanding MCP Server
An MCP (Model Context Protocol) server that enables Claude and other AI agents to analyze videos and images using Qwen3-VL deployed on Modal.
Highlights
Hours-long video support with full recall
Timestamp grounding - second-level precision
256K context (expandable to 1M)
32-language OCR support
Free/self-hosted on Modal serverless GPU
Features
Video Analysis: Analyze videos via URL with custom prompts
Image Analysis: Analyze images via URL
Video Summarization: Generate brief, standard, or detailed summaries
Text Extraction: Extract on-screen text and transcribe speech
Video Q&A: Ask specific questions about video content
Frame Comparison: Analyze changes and progression in videos
Architecture
Claude/Agent → MCP Server → Modal API → Qwen3-VL (GPU)The MCP server acts as a bridge between Claude and your Qwen2.5-VL model deployed on Modal's serverless GPU infrastructure.
Prerequisites
Modal Account: Sign up at modal.com
Deployed Qwen Model: Deploy the video understanding model to Modal (see below)
Python 3.10+
Quick Start
1. Deploy the Model to Modal (if not already done)
cd ~/qwen-video-modal
modal deploy qwen_video.py2. Install the MCP Server
cd ~/qwen-video-mcp-server
pip install -e .Or with uv:
uv pip install -e .3. Configure Environment
cp .env.example .env
# Edit .env with your Modal workspace name4. Add to Claude Desktop
Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"qwen-video": {
"command": "uv",
"args": [
"--directory",
"/Users/adamanz/qwen-video-mcp-server",
"run",
"server.py"
],
"env": {
"MODAL_WORKSPACE": "adam-31541",
"MODAL_APP": "qwen-video-understanding"
}
}
}
}5. Restart Claude Desktop
The qwen-video tools should now be available.
Available Tools
analyze_video
Analyze a video with a custom prompt.
analyze_video(
video_url="https://example.com/video.mp4",
question="What happens in this video?",
max_frames=16
)analyze_image
Analyze an image with a custom prompt.
analyze_image(
image_url="https://example.com/image.jpg",
question="Describe this image"
)summarize_video
Generate a video summary in different styles.
summarize_video(
video_url="https://example.com/video.mp4",
style="detailed" # brief, standard, or detailed
)extract_video_text
Extract text and transcribe speech from a video.
extract_video_text(
video_url="https://example.com/presentation.mp4"
)video_qa
Ask specific questions about a video.
video_qa(
video_url="https://example.com/video.mp4",
question="How many people appear in this video?"
)compare_video_frames
Analyze changes throughout a video.
compare_video_frames(
video_url="https://example.com/timelapse.mp4",
comparison_prompt="How does the scene change?"
)check_endpoint_status
Check the Modal endpoint configuration.
list_capabilities
List all server capabilities and supported formats.
Configuration
Environment Variable | Description | Default |
| Your Modal workspace/username |
|
| Name of the Modal app |
|
| Override image endpoint URL | Auto-generated |
| Override video endpoint URL | Auto-generated |
Supported Formats
Video: mp4, webm, mov, avi, mkv
Image: jpg, jpeg, png, gif, webp, bmp
Limitations
Videos must be accessible via public URL
Maximum 64 frames extracted per video
Recommended video length: under 10 minutes for best results
First request may have cold start delay (Modal serverless)
Cost
The Modal backend uses A100-40GB GPUs:
~$3.30/hour while processing
Scales to zero when idle (no cost)
Only charged for actual processing time
Troubleshooting
"Request timed out"
Video may be too large
Try a shorter video or reduce
max_frames
"HTTP error 502/503"
Modal container is starting up (cold start)
Wait a few seconds and retry
"Video URL not accessible"
Ensure the URL is publicly accessible
Check for authentication requirements
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytestLicense
MIT