Skip to main content
Glama

MCP Agent Platform

README_EN.md7.56 kB
# MCP Agent Platform ## Project Introduction This is a human-computer interaction system based on multi-agent architecture, integrating visual recognition, speech recognition, and speech synthesis capabilities. The system consists of multiple specialized agents working together to achieve a natural human-computer interaction experience. ## System Architecture The system employs a distributed multi-agent architecture, utilizing the MCP (Multi-agent Communication Protocol) protocol for efficient communication and collaboration between agents. ### Agents - **Brain Agent** - Responsible for core decision-making and coordination - Processes information from other agents - Integrates Ollama large model API for multimodal understanding and generation - Manages agent states and task scheduling - **Eye Agent** - Processes visual input - Implements face detection and recognition - Supports real-time video stream processing - Generates image analysis results - **Ear Agent** - Processes audio input - Implements speech recognition - Supports real-time audio stream processing - Provides noise suppression and signal enhancement - **Mouth Agent** - Responsible for audio output - Implements speech synthesis - Supports emotional speech synthesis - Provides natural voice interaction experience ### Communication Mechanism The system uses a WebSocket-based real-time communication mechanism, implementing message passing between agents through the MCP protocol: 1. **Message Types** - Text Message (TextMessage) - Image Message (ImageMessage) - Audio Message (AudioMessage) - Command Message (CommandMessage) - Status Message (StatusMessage) 2. **Message Routing** - Message routing based on sender and receiver IDs - Supports broadcast and point-to-point communication - Implements reliable message delivery and confirmation mechanisms 3. **State Management** - Real-time monitoring of agent states - Dynamic allocation of system resources - Task queue priority management ### Technical Principles #### Face Recognition Technology The system uses deep learning models to implement face recognition functionality: 1. **Face Detection** - Uses MTCNN (Multi-task Cascaded Convolutional Networks) - Implements face detection and landmark localization through P-Net, R-Net, and O-Net - Processes video streams in real-time, supporting multi-face detection 2. **Feature Extraction** - Employs FaceNet deep learning model to extract 128-dimensional face feature vectors - Uses Triplet Loss training strategy to optimize feature extraction - Implements face alignment and normalization 3. **Similarity Calculation** - Uses cosine similarity to calculate feature vector distances - Sets dynamic thresholds for identity matching - Implements feature vector indexing and fast retrieval #### Speech Processing Technology 1. **Speech Recognition** - Uses Wav2Vec pre-trained model for feature extraction - Employs CTC (Connectionist Temporal Classification) decoding algorithm - Integrates Chinese speech recognition model with real-time transcription 2. **Speech Synthesis** - Implements end-to-end speech synthesis based on FastSpeech2 - Uses HiFiGAN vocoder for high-quality audio generation - Supports emotion control and speech rate adjustment #### Multimodal Processing 1. **Feature Fusion** - Implements multimodal fusion of visual and speech features - Uses Attention mechanism for cross-modal alignment - Supports collaborative understanding of multimodal information 2. **Context Management** - Maintains multi-turn dialogue history - Implements cross-modal information correlation analysis - Updates dialogue state dynamically ### Large Model Integration The system integrates the Ollama large model API to achieve multimodal intelligent processing. Different models are configured based on task types: 1. **Model Configuration** - Text Processing: Uses qwen2 model - Image Processing: Uses gemma3:4b model with multimodal support - Audio Processing: Uses qwen2 model with temperature parameter set to 0.5 2. **Features Implementation** - **Image Understanding** - Scene analysis and object recognition - Facial feature extraction and emotion analysis - Image description generation - **Dialogue Generation** - Context-aware dialogue management - Multi-turn dialogue history maintenance - Personalized response generation - **Multimodal Fusion** - Combined understanding of image and text - Bidirectional conversion between speech and text - Comprehensive analysis of multimodal information 3. **Configuration Example** ```python # Ollama Model Configuration OLLAMA_BASE_URL = "http://127.0.0.1:11434" # Model Configuration for Different Task Types MODEL_CONFIG = { "text": { "model": "qwen2", # Text processing model "params": {} }, "image": { "model": "gemma3:4b", # Image processing model "params": {}, "multimodal": True }, "audio": { "model": "qwen2", # Audio processing model "params": {"temperature": 0.5} } } ``` ### Core Features 1. **Face Recognition** - Real-time face detection and recognition - Person information database management - Facial feature extraction and matching - Stranger recognition and recording 2. **Voice Interaction** - Chinese speech recognition - Speech synthesis output - Real-time voice dialogue - Emotional speech synthesis 3. **Web Interface** - Visual robot interface - Real-time status display - Interactive control panel - System monitoring dashboard ## Installation Guide ### Requirements - Python 3.10+ - Windows/Linux/MacOS ### Installation Steps 1. Clone the project ```bash git clone [repository-url] cd mcpTest ``` 2. Install dependencies ```bash pip install -r requirements.txt ``` ### Configuration The following parameters can be configured in `config.py`: - Server configuration - Agent parameters - Model settings - Speech recognition parameters ## Usage Guide ### Start the System 1. Run the main program ```bash python main.py ``` 2. Access the Web interface ``` http://localhost:8070 ``` ### Features Usage 1. **Face Recognition** - The system automatically detects and recognizes faces in the camera - First-time recognized faces will be automatically recorded 2. **Voice Interaction** - The system supports Chinese speech recognition - You can dialogue with the system through voice ## Tech Stack - **Backend** - Python - FastAPI - WebSocket - OpenCV - SpeechRecognition - pyttsx3 - **Frontend** - HTML/CSS - JavaScript - WebSocket ## Directory Structure ``` ├── config.py # Configuration file ├── main.py # Main program ├── requirements.txt # Dependencies list ├── src/ │ ├── agents/ # Agent implementations │ ├── brain/ # Brain logic │ ├── platform/ # Platform core │ ├── utils/ # Utility functions │ └── web/ # Web service ├── static/ # Static resources └── templates/ # Page templates ```

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rolenet/McpAgentRobot'

If you have feedback or need assistance with the MCP directory API, please join our Discord server