Skip to main content
Glama
Byte-Naut

npu-vision-fallback

by Byte-Naut

๐Ÿ”‹ npu-vision-fallback

Local low-power vision for desktop AI agents

When accessibility APIs fail โ€” NPU-first, zero GPU wake-up, 100% local

CI PyPI License: MIT Python 3.11+

English | ไธญๆ–‡ๆ–‡ๆกฃ


English

What is this?

A lightweight, local-first vision service for desktop agents that need to see and interact with screens where traditional accessibility APIs fall shortโ€”games, remote desktops, canvas apps, and more.

Built for efficiency: Native OS OCR ยท Intel NPU acceleration ยท Zero cloud calls ยท Battery-friendly by design

Architecture Diagram


โœจ Why Use This?

Desktop agents face a challenge: how to perceive UI when the accessibility tree is empty?

Common Approach

The Problem

๐Ÿค– Multimodal LLM screenshots

Expensive tokens, slow round-trips, coordinate hallucination

๐ŸŒณ OS Accessibility APIs only

Blind to games, canvas apps, remote desktops, emulators

๐Ÿ”ฅ Heavy GPU OCR (PaddleOCR)

Big dependencies, high power draw, wakes discrete GPU

npu-vision-fallback is your fallback layer โ€” when the accessibility tree comes back empty, this gives your agent a small, fast, local vision service that doesn't touch the cloud or spin up the dGPU.

Perfect for:

  • ๐ŸŽฎ Game UIs and emulators

  • ๐Ÿ–ฅ๏ธ Remote desktop / VNC clients (no remote accessibility tree)

  • ๐ŸŽจ Canvas / WASM web apps rendering outside the DOM

  • ๐Ÿ’ป Local SLMs that can't afford multimodal screenshot tokens


๐Ÿš€ Quick Start

1. Install (Windows + Intel NPU recommended)

pip install "npu-vision-fallback[ocr-win,detect]"
python scripts/download_ui_model.py  # One-time setup

2. Configure Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "npu-vision-fallback": {
      "command": "npu-vision-fallback"
    }
  }
}

3. Use it

Restart Claude Desktop and try:

You: The accessibility tree for this game is empty. Can you read the screen at coordinates [0,0,1280,800] and find the "Start Game" button?

Claude: (calls analyze_screen) I found a button labeled "Start Game" at [520, 580, 720, 640]. Want me to click its center at (620, 610)?


๐Ÿ“ฆ Installation Options

Native OCR + NPU UI detection (~85 MB total):

pip install "npu-vision-fallback[ocr-win,detect]"
python scripts/download_ui_model.py

Linux / macOS

Cross-platform OCR + CPU detection (~130 MB):

pip install "npu-vision-fallback[ocr-rapid,detect]"
python scripts/download_ui_model.py

Full (All Backends)

For development or testing all backends:

pip install "npu-vision-fallback[all]"
python scripts/download_ui_model.py

Minimal Core

Just the MCP server (no OCR/detection, ~20 MB):

pip install npu-vision-fallback

๐Ÿ’ก Note: The detect extra uses OpenVINO (~80 MB) for runtime, not PyTorch. Model conversion requires the dev-convert extra (~2 GB), but that's a one-time setup most users skip.


๐ŸŽฏ Key Features

  • ๐Ÿ”‹ NPU-first architecture โ€” UI detection runs on Intel AI Boost at ~80ms per call (~0.3J energy)

  • โšก Zero dGPU wake-up โ€” Default paths use NPU, system OCR, or CPUโ€”laptop battery stays happy

  • ๐ŸŒ Native OS OCR โ€” Uses Windows OCR engine (macOS Vision planned) for quality

  • ๐Ÿงฉ MCP protocol โ€” Works with Claude Desktop, Cursor, or any MCP client out of the box

  • ๐Ÿชถ Lightweight โ€” No PyTorch/TensorFlow at runtime; all heavy deps are optional

  • ๐Ÿ›ก๏ธ Privacy-first โ€” 100% local processing, no telemetry, no cloud


โšก Performance

Measured on Intel Core Ultra 9 275HX (2560ร—1600 screen, on battery):

Task

Backend

Latency

Energy

Notes

OCR

WinOCR

~1100ms

2.5J

Native Windows API (full screen)

OCR

RapidOCR

~6300ms

14.5J

Cross-platform ONNX CPU

UI Detection

OpenVINO NPU

~80ms

0.3J

YOLOv8n on Intel AI Boost

UI Detection

OpenVINO CPU

~120ms

โ€”

Fallback when no NPU

Full benchmark details and reproduction steps: outputs/power_report.md


๐Ÿ› ๏ธ MCP Tools

Tool

Purpose

Key Arguments

health_check

Server status

โ€”

list_backends

Available backends

โ€”

ocr_region

Extract text from region

region=[x1,y1,x2,y2]

detect_ui

Find UI elements

region=[x1,y1,x2,y2]

analyze_screen

๐ŸŒŸ Combined OCR + detection

region=[x1,y1,x2,y2]

analyze_screen is the primary tool โ€” it fuses detection + OCR, returns spatially-sorted elements with text annotations. Perfect for agent navigation.


๐Ÿ“š Documentation


๐Ÿงช Examples

Example

Description

basic_ocr.py

Simple OCR call to screen region

agent_ui_navigation.py

Find and click UI elements

desktop_remote_vnc.py

Vision fallback in remote desktop

uv run python examples/basic_ocr.py --region 0 0 1280 800

๐Ÿ—บ๏ธ Roadmap

  • v1.1 โ€” Multi-monitor support, DPI scaling awareness

  • v2.0 โ€” Custom model training interface, bring your own detector

  • v2.1 โ€” UI-TARS integration, macOS Vision backend, PP-OCR v4 on NPU


๐Ÿค Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines. Please read CLAUDE.mdโ€”it's the project constitution that ensures code quality and architectural consistency.


๐Ÿ“‹ Supported Backends

Backend

Type

Device

Platform

Status

winocr

System OCR

CPU/NPU

Windows

โœ… Primary

openvino_npu

UI Detection

NPU

Win/Linux + Intel NPU

โœ… Primary

openvino_cpu

UI Detection

CPU

Win/Linux/macOS

โœ… Fallback

rapid_ocr

OCR

CPU

All

โœ… Cross-platform

pytesseract

OCR

CPU

All

โœ… Last-resort

vision

System OCR

ANE

macOS

๐Ÿšง Planned


๐Ÿ“„ License

MIT ยฉ npu-vision-fallback contributors


๐Ÿ™ Acknowledgments

Built with:

Development assisted by Claude Code (Anthropic). Architecture design and code review powered by AI collaboration.

Install Server
A
license - permissive license
A
quality
B
maintenance

Maintenance

โ€“Maintainers
โ€“Response time
โ€“Release cycle
โ€“Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Byte-Naut/npu-vision-fallback'

If you have feedback or need assistance with the MCP directory API, please join our Discord server