Enables video and image analysis using Qwen3-VL models deployed on Modal's serverless GPU infrastructure, supporting hours-long video processing, timestamp grounding, OCR in 32 languages, and various analysis tasks including summarization, text extraction, and frame comparison.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Qwen Video Understanding MCP Serversummarize this presentation video https://youtube.com/watch?v=abc123"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Qwen Video Understanding MCP Server
An MCP (Model Context Protocol) server that enables Claude and other AI agents to analyze videos and images using Qwen3-VL deployed on Modal.
Highlights
Hours-long video support with full recall
Timestamp grounding - second-level precision
256K context (expandable to 1M)
32-language OCR support
Free/self-hosted on Modal serverless GPU
Features
Video Analysis: Analyze videos via URL with custom prompts
Image Analysis: Analyze images via URL
Video Summarization: Generate brief, standard, or detailed summaries
Text Extraction: Extract on-screen text and transcribe speech
Video Q&A: Ask specific questions about video content
Frame Comparison: Analyze changes and progression in videos
Architecture
The MCP server acts as a bridge between Claude and your Qwen2.5-VL model deployed on Modal's serverless GPU infrastructure.
Prerequisites
Modal Account: Sign up at modal.com
Deployed Qwen Model: Deploy the video understanding model to Modal (see below)
Python 3.10+
Quick Start
1. Deploy the Model to Modal (if not already done)
2. Install the MCP Server
Or with uv:
3. Configure Environment
4. Add to Claude Desktop
Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json):
5. Restart Claude Desktop
The qwen-video tools should now be available.
Available Tools
analyze_video
Analyze a video with a custom prompt.
analyze_image
Analyze an image with a custom prompt.
summarize_video
Generate a video summary in different styles.
extract_video_text
Extract text and transcribe speech from a video.
video_qa
Ask specific questions about a video.
compare_video_frames
Analyze changes throughout a video.
check_endpoint_status
Check the Modal endpoint configuration.
list_capabilities
List all server capabilities and supported formats.
Configuration
Environment Variable | Description | Default |
| Your Modal workspace/username |
|
| Name of the Modal app |
|
| Override image endpoint URL | Auto-generated |
| Override video endpoint URL | Auto-generated |
Supported Formats
Video: mp4, webm, mov, avi, mkv
Image: jpg, jpeg, png, gif, webp, bmp
Limitations
Videos must be accessible via public URL
Maximum 64 frames extracted per video
Recommended video length: under 10 minutes for best results
First request may have cold start delay (Modal serverless)
Cost
The Modal backend uses A100-40GB GPUs:
~$3.30/hour while processing
Scales to zero when idle (no cost)
Only charged for actual processing time
Troubleshooting
"Request timed out"
Video may be too large
Try a shorter video or reduce
max_frames
"HTTP error 502/503"
Modal container is starting up (cold start)
Wait a few seconds and retry
"Video URL not accessible"
Ensure the URL is publicly accessible
Check for authentication requirements
Development
License
MIT