Which integrations are available for this server?

Provides REST API endpoints for vision capabilities including screen capture, OCR text extraction, VLM scene understanding, and watch mode for continuous monitoring. Uses [Paddle](/mcp/servers/integrations/paddle)OCR engine for local, multilingual text extraction from screenshots with support for multiple languages. Powers VLM scene understanding capabilities using MiniCPM-V 2.6 model for visual scene analysis, with optional GPU acceleration support.

Vision Service - MCP

Vision capabilities for ThinkDrop AI: screen capture, OCR, and VLM scene understanding.

Features

Screenshot Capture - Fast cross-platform screen capture
OCR - Text extraction using PaddleOCR (local, multilingual)
VLM - Scene understanding using MiniCPM-V 2.6 (lazy-loaded, optional)
Watch Mode - Continuous monitoring with change detection
Memory Integration - Auto-store to user-memory service as embeddings

Quick Start

# 1. Copy environment config cp .env.example .env # 2. Edit .env (set API keys, configure VLM, etc.) nano .env # 3. Start service ./start.sh

Service will be available at http://localhost:3006

Installation Options

Minimal (OCR Only - No GPU Required)

pip install -r requirements.txt

Screenshot + OCR only
~200-500ms per capture
No VLM dependencies

Full (OCR + VLM - GPU Recommended)

# Uncomment VLM dependencies in requirements.txt pip install torch transformers accelerate # Or with CUDA support pip install torch --index-url https://download.pytorch.org/whl/cu118 pip install transformers accelerate

Screenshot + OCR + VLM
600-1500ms with GPU, 2-6s with CPU
~2.4GB model download on first use

API Endpoints

Health Check

GET /health

Capture Screenshot

POST /vision/capture { "region": [x, y, width, height], # Optional "format": "png" }

Extract Text (OCR)

POST /vision/ocr { "region": [x, y, width, height], # Optional "language": "en" # Optional }

Describe Screen (VLM)

POST /vision/describe { "region": [x, y, width, height], # Optional "task": "Find the Save button", # Optional focus "include_ocr": true, # Include OCR text "store_to_memory": true # Auto-store to user-memory }

Start Watch Mode

POST /vision/watch/start { "interval_ms": 2000, "change_threshold": 0.08, "run_ocr": true, "run_vlm": false, "task": "Monitor for errors" }

Stop Watch Mode

POST /vision/watch/stop

Watch Status

GET /vision/watch/status

Configuration

Key environment variables in .env:

# Service PORT=3006 API_KEY=your-vision-api-key-here # OCR OCR_ENGINE=paddleocr OCR_LANGUAGE=en # VLM (lazy-loaded) VLM_ENABLED=true VLM_MODEL=openbmb/MiniCPM-V-2_6 VLM_DEVICE=auto # auto, cpu, cuda # Watch WATCH_DEFAULT_INTERVAL_MS=2000 WATCH_CHANGE_THRESHOLD=0.08 # User Memory Integration USER_MEMORY_SERVICE_URL=http://localhost:3003 USER_MEMORY_API_KEY=your-user-memory-api-key

Performance

OCR Only (Minimal Setup)

Capture: 10-20ms
OCR: 200-500ms
Total: ~300-600ms per request
Memory: ~500MB

OCR + VLM (Full Setup)

Capture: 10-20ms
OCR: 200-500ms
VLM (GPU): 300-800ms
VLM (CPU): 2-5s
Total (GPU): ~600-1500ms
Total (CPU): ~2.5-6s
Memory: ~3-4GB (model loaded)

Watch Mode Strategy

Watch mode uses smart change detection to minimize VLM calls:

Every interval: Capture + fingerprint comparison
On change: Run OCR (if enabled)
On significant change: Run VLM (if enabled)
Auto-store: Send to user-memory service as embedding

This keeps VLM usage efficient while maintaining continuous awareness.

Integration with ThinkDrop AI

The vision service integrates with the MCP state graph:

// In AgentOrchestrator state graph const visionResult = await mcpClient.callService('vision', 'describe', { include_ocr: true, store_to_memory: true, task: userMessage }); // Result automatically stored as embedding in user-memory // No screenshot files to manage!

Testing

Test Capture

curl -X POST http://localhost:3006/vision/capture \ -H "Content-Type: application/json" \ -d '{}'

Test OCR

curl -X POST http://localhost:3006/vision/ocr \ -H "Content-Type: application/json" \ -d '{}'

Test VLM (if enabled)

curl -X POST http://localhost:3006/vision/describe \ -H "Content-Type: application/json" \ -d '{"include_ocr": true, "store_to_memory": false}'

Test Watch

# Start curl -X POST http://localhost:3006/vision/watch/start \ -H "Content-Type: application/json" \ -d '{"interval_ms": 2000, "run_ocr": true}' # Status curl http://localhost:3006/vision/watch/status # Stop curl -X POST http://localhost:3006/vision/watch/stop

Troubleshooting

OCR Not Working

Check PaddleOCR installation: pip list | grep paddleocr
Models download on first use (~100MB)
Check logs for download progress

VLM Not Loading

Ensure dependencies installed: pip list | grep transformers
Check available memory (need 4-8GB)
Set VLM_ENABLED=false to disable
Model downloads on first use (~2.4GB)

Performance Issues

CPU too slow: Disable VLM, use OCR only
Memory issues: Reduce watch interval, disable VLM
GPU not detected: Check CUDA installation

Architecture

vision-service/ ├── server.py # FastAPI app ├── src/ │ ├── services/ │ │ ├── screenshot.py # mss wrapper │ │ ├── ocr_engine.py # PaddleOCR wrapper │ │ ├── vlm_engine.py # VLM wrapper (lazy) │ │ └── watch_manager.py # Watch loop │ ├── routes/ │ │ ├── capture.py # /vision/capture │ │ ├── ocr.py # /vision/ocr │ │ ├── describe.py # /vision/describe │ │ └── watch.py # /vision/watch/* │ └── middleware/ │ └── validation.py # API key validation ├── requirements.txt ├── start.sh └── README.md

License

Part of ThinkDrop AI project.

ThinkDrop Vision Service