Skip to main content
Glama

Vision Service - MCP

Vision capabilities for ThinkDrop AI: screen capture, OCR, and VLM scene understanding.

Features

  • Screenshot Capture - Fast cross-platform screen capture

  • OCR - Text extraction using PaddleOCR (local, multilingual)

  • VLM - Scene understanding using MiniCPM-V 2.6 (lazy-loaded, optional)

  • Watch Mode - Continuous monitoring with change detection

  • Memory Integration - Auto-store to user-memory service as embeddings

Quick Start

# 1. Copy environment config cp .env.example .env # 2. Edit .env (set API keys, configure VLM, etc.) nano .env # 3. Start service ./start.sh

Service will be available at http://localhost:3006

Installation Options

Minimal (OCR Only - No GPU Required)

pip install -r requirements.txt
  • Screenshot + OCR only

  • ~200-500ms per capture

  • No VLM dependencies

Full (OCR + VLM - GPU Recommended)

# Uncomment VLM dependencies in requirements.txt pip install torch transformers accelerate # Or with CUDA support pip install torch --index-url https://download.pytorch.org/whl/cu118 pip install transformers accelerate
  • Screenshot + OCR + VLM

  • 600-1500ms with GPU, 2-6s with CPU

  • ~2.4GB model download on first use

API Endpoints

Health Check

GET /health

Capture Screenshot

POST /vision/capture { "region": [x, y, width, height], # Optional "format": "png" }

Extract Text (OCR)

POST /vision/ocr { "region": [x, y, width, height], # Optional "language": "en" # Optional }

Describe Screen (VLM)

POST /vision/describe { "region": [x, y, width, height], # Optional "task": "Find the Save button", # Optional focus "include_ocr": true, # Include OCR text "store_to_memory": true # Auto-store to user-memory }

Start Watch Mode

POST /vision/watch/start { "interval_ms": 2000, "change_threshold": 0.08, "run_ocr": true, "run_vlm": false, "task": "Monitor for errors" }

Stop Watch Mode

POST /vision/watch/stop

Watch Status

GET /vision/watch/status

Configuration

Key environment variables in .env:

# Service PORT=3006 API_KEY=your-vision-api-key-here # OCR OCR_ENGINE=paddleocr OCR_LANGUAGE=en # VLM (lazy-loaded) VLM_ENABLED=true VLM_MODEL=openbmb/MiniCPM-V-2_6 VLM_DEVICE=auto # auto, cpu, cuda # Watch WATCH_DEFAULT_INTERVAL_MS=2000 WATCH_CHANGE_THRESHOLD=0.08 # User Memory Integration USER_MEMORY_SERVICE_URL=http://localhost:3003 USER_MEMORY_API_KEY=your-user-memory-api-key

Performance

OCR Only (Minimal Setup)

  • Capture: 10-20ms

  • OCR: 200-500ms

  • Total: ~300-600ms per request

  • Memory: ~500MB

OCR + VLM (Full Setup)

  • Capture: 10-20ms

  • OCR: 200-500ms

  • VLM (GPU): 300-800ms

  • VLM (CPU): 2-5s

  • Total (GPU): ~600-1500ms

  • Total (CPU): ~2.5-6s

  • Memory: ~3-4GB (model loaded)

Watch Mode Strategy

Watch mode uses smart change detection to minimize VLM calls:

  1. Every interval: Capture + fingerprint comparison

  2. On change: Run OCR (if enabled)

  3. On significant change: Run VLM (if enabled)

  4. Auto-store: Send to user-memory service as embedding

This keeps VLM usage efficient while maintaining continuous awareness.

Integration with ThinkDrop AI

The vision service integrates with the MCP state graph:

// In AgentOrchestrator state graph const visionResult = await mcpClient.callService('vision', 'describe', { include_ocr: true, store_to_memory: true, task: userMessage }); // Result automatically stored as embedding in user-memory // No screenshot files to manage!

Testing

Test Capture

curl -X POST http://localhost:3006/vision/capture \ -H "Content-Type: application/json" \ -d '{}'

Test OCR

curl -X POST http://localhost:3006/vision/ocr \ -H "Content-Type: application/json" \ -d '{}'

Test VLM (if enabled)

curl -X POST http://localhost:3006/vision/describe \ -H "Content-Type: application/json" \ -d '{"include_ocr": true, "store_to_memory": false}'

Test Watch

# Start curl -X POST http://localhost:3006/vision/watch/start \ -H "Content-Type: application/json" \ -d '{"interval_ms": 2000, "run_ocr": true}' # Status curl http://localhost:3006/vision/watch/status # Stop curl -X POST http://localhost:3006/vision/watch/stop

Troubleshooting

OCR Not Working

  • Check PaddleOCR installation: pip list | grep paddleocr

  • Models download on first use (~100MB)

  • Check logs for download progress

VLM Not Loading

  • Ensure dependencies installed: pip list | grep transformers

  • Check available memory (need 4-8GB)

  • Set VLM_ENABLED=false to disable

  • Model downloads on first use (~2.4GB)

Performance Issues

  • CPU too slow: Disable VLM, use OCR only

  • Memory issues: Reduce watch interval, disable VLM

  • GPU not detected: Check CUDA installation

Architecture

vision-service/ ├── server.py # FastAPI app ├── src/ │ ├── services/ │ │ ├── screenshot.py # mss wrapper │ │ ├── ocr_engine.py # PaddleOCR wrapper │ │ ├── vlm_engine.py # VLM wrapper (lazy) │ │ └── watch_manager.py # Watch loop │ ├── routes/ │ │ ├── capture.py # /vision/capture │ │ ├── ocr.py # /vision/ocr │ │ ├── describe.py # /vision/describe │ │ └── watch.py # /vision/watch/* │ └── middleware/ │ └── validation.py # API key validation ├── requirements.txt ├── start.sh └── README.md

License

Part of ThinkDrop AI project.

-
security - not tested
F
license - not found
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lukaizhi5559/thinkdrop-vision-service'

If you have feedback or need assistance with the MCP directory API, please join our Discord server