Supports desktop automation and visual context analysis for GNOME sessions running under the Wayland compositor.
Integrates with Google Gemini models to provide VLM-powered visual analysis of screenshots, allowing the server to identify and interact with on-screen elements.
Enables screen capture and input control for the Hyprland tiling Wayland compositor.
Provides automation capabilities for KDE Plasma Wayland sessions, enabling AI assistants to interact with the KDE desktop environment.
Provides a native solution for automating modern Linux desktop environments via screenshot capture and input device simulation through evemu.
Supports automation and analysis for Sway compositors, facilitating mouse and keyboard control through an AI-driven interface.
Enables AI-driven desktop automation for Wayland-based systems, including visual analysis through screenshots, absolute/relative mouse control, and keyboard input simulation.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Wayland MCP ServerTake a screenshot, find the browser icon, and click it"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Wayland MCP Server
Model Context Protocol server for Wayland desktop automation
Features • Installation • Usage • API • Security
Overview
Wayland MCP Server enables AI assistants to interact with your Wayland desktop through the Model Context Protocol. It provides screenshot capture with VLM analysis, mouse control, keyboard input, and action chaining capabilities.
Why This Project?
Existing Wayland screenshot and automation tools often have reliability issues. This project provides a robust, MCP-native solution specifically designed for AI-driven desktop automation on modern Linux systems.
Quick Example
Features
Visual Analysis
Screenshot capture with precision ruler overlays
VLM-powered image analysis via OpenRouter or Google Gemini
Multiple vision model support (Claude, GPT-4V, Gemini, Qwen)
Side-by-side image comparison and diff detection
Mouse Automation
Absolute and relative cursor positioning
Click operations (left, right, middle button)
Drag and drop with coordinate precision
Bidirectional scrolling (vertical/horizontal)
Keyboard Control
Text input simulation
Individual key press events
Complex key combinations
Action Sequences
Chain multiple operations together
Flexible syntax:
chain:action1;action2;action3Example:
chain:click:100,200;type:hello;press:Enter
Installation
Prerequisites
Python 3.8 or higher
Wayland compositor (GNOME, KDE Plasma, Hyprland, Sway, etc.)
grimandslurpfor screenshots (usually pre-installed)
Quick Install
From Source
Input Control Setup
For mouse and keyboard automation, run the setup script:
What it does:
Installs
evemu-toolspackageConfigures setuid for
evemu-eventAdds user to
inputgroupCreates udev rules for device access
After setup, log out and back in for group changes to take effect.
Usage
MCP Configuration
The server supports two VLM providers:
Option 1: OpenRouter (multiple models via proxy)
Option 2: Google Gemini Direct (native API, faster)
Example for Claude Desktop (~/.config/Claude/claude_desktop_config.json):
Note: See CONFIG_EXAMPLES.md for more configuration examples including Cursor, OpenRouter models, and VLM provider options.
Environment Variables
Variable | Description | Default | Required |
VLM Provider Options | |||
| Vision provider: |
| No |
| OpenRouter API key | - | For OpenRouter |
| Google Gemini API key | - | For Gemini |
| Model identifier |
| No |
Wayland Environment | |||
| Wayland runtime directory |
| Yes |
| Display identifier |
| Yes |
Optional | |||
| Server listen port |
| No |
Getting API Keys:
OpenRouter: openrouter.ai → Keys section
Google Gemini: Google AI Studio
Desktop Environment Compatibility
Desktop | Status | Notes |
GNOME | ✅ Tested | Wayland by default on modern versions |
KDE Plasma | ✅ Tested | Enable Wayland session at login |
Hyprland | ✅ Tested | Native Wayland compositor |
Sway | ✅ Should work | i3-compatible Wayland compositor |
Others | ⚠️ Untested | Any wlroots-based compositor should work |
Example Commands
Through an MCP client, you can request actions like:
"Take a screenshot and analyze what's on the screen"
"Move the mouse to coordinates (100, 200) and click"
"Type 'hello world' and press Enter"
"Click at (50, 50), then drag to (200, 200)"
Available Tools
The server exposes the following MCP tools:
Screen Capture
capture_screenshot- Take a screenshot with optional ruler overlayscapture_and_analyze- Capture and analyze using VLM in one step
Vision Analysis
analyze_screenshot- Analyze an existing screenshot with custom promptcompare_images- Compare two screenshots to detect differences
Mouse Control
move_mouse- Move cursor to coordinates (absolute or relative)click_mouse- Perform left click at current positiondrag_mouse- Drag between two coordinate pointsscroll_mouse- Vertical scroll (positive=up, negative=down)
Action Execution
execute_action- Execute single action or chain multiple actions
Action Chain Syntax
Combine multiple actions with semicolons:
Supported Actions:
type:text- Type a text stringpress:key- Press a specific keyclick:orclick:x,y- Click at position or current locationmove_to:x,y- Move to absolute coordinatesmove_to:rel:x,y- Move relative to current positiondrag:x1,y1:x2,y2- Drag from point to pointscroll:amount- Scroll vertically (typical values: 15-120)scroll:horizontal:amount- Scroll horizontally
Example Chains:
Security
⚠️ IMPORTANT SECURITY CONSIDERATIONS
This server grants extensive control over your desktop environment:
Full mouse and keyboard control
Screen capture capabilities
Ability to execute arbitrary input sequences
Best Practices
Only use with trusted AI models and MCP clients
Review action chains before execution in sensitive contexts
Consider running in a sandboxed or test environment
Be aware that the AI can perform any action you could perform manually
Permission Model
The setup script requires sudo access to:
Install system packages (
evemu-tools)Modify file permissions
Configure udev rules
After setup, the server runs with your user privileges but can control input devices through configured permissions.
Architecture
Troubleshooting
Input control not working
Ensure you ran
sudo ./setup.shLog out and back in after setup
Verify you're in the
inputgroup:groups | grep input
Screenshots failing
Check if
grimis installed:which grimVerify
WAYLAND_DISPLAYmatches your session:echo $WAYLAND_DISPLAY
VLM analysis not working
Confirm
OPENROUTER_API_KEYis set correctlyCheck API key permissions on OpenRouter dashboard
Test model availability: some models have usage limits
Server won't start
Check Python version:
python3 --version(needs 3.8+)Verify all dependencies:
pip install -e .Look for port conflicts if using custom
WAYLAND_MCP_PORT
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Project Structure
License
GPL-3.0 License - See LICENSE for details.
Acknowledgments
Built on the Model Context Protocol
Uses FastMCP for server implementation
Inspired by the need for reliable Wayland automation tools