Uses local Ollama LLMs for sophisticated Natural Language Understanding (NLU) to perform intent classification and parameter extraction from voice commands.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Voice-AGI MCP ServerStart a voice conversation to help me manage my research goals"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Voice-AGI MCP Server
Stateful voice-controlled AGI system combining local STT/TTS with Letta-style conversation management
Overview
Voice-AGI is an advanced MCP server that provides:
Stateful conversations - Multi-turn dialogue with context retention
Tool execution during voice - Call AGI functions naturally via speech
Local STT/TTS - Cost-effective Whisper + Edge TTS (no API costs)
Intent detection - Sophisticated NLU using local Ollama
AGI integration - Direct control of goals, tasks, memory, and research
Latency tracking - Performance metrics for optimization
Architecture
Features
🎯 Stateful Conversation Management
Context retention across multiple turns (last 10 turns)
User context tracking (name, preferences, etc.)
Conversation history stored in enhanced-memory
Seamless multi-turn dialogue ("What was I just asking about?")
🔧 Voice-Callable AGI Tools
search_agi_memory- Search past memories via voicecreate_goal_from_voice- "Create a goal to optimize memory"list_pending_tasks- "What tasks do I have?"trigger_consolidation- "Run memory consolidation"start_research- "Research transformer architectures"check_system_status- "How is the system doing?"remember_name/recall_name- User context managementstart_improvement_cycle- "Improve consolidation speed"decompose_goal- "Break down this goal into tasks"10+ tools total, easily extensible
🧠 Intent Detection
Local Ollama LLM (llama3.2) for sophisticated NLU
Intent classification - Automatically routes to appropriate tools
Parameter extraction - Extracts args from natural speech
Context-aware - Uses conversation history for better understanding
Fallback heuristics - Works even if Ollama unavailable
🎤 Voice Pipeline
STT: pywhispercpp (local, Python 3.14 compatible)
TTS: Microsoft Edge TTS (free, neural voices)
Audio feedback: Beeps for state changes
Latency tracking: STT, TTS, and total round-trip metrics
Flexible: Easy to add cloud STT/TTS later
📊 Performance Metrics
STT latency tracking (ms)
TTS latency tracking (ms)
Total round-trip latency
Conversation statistics (turns, words, duration)
Tool invocation counts
Installation
1. Install Dependencies
2. Ensure Prerequisites
Required:
Python 3.10+
edge-tts(installed via requirements.txt)arecord(ALSA utils):sudo dnf install alsa-utilsAudio player:
mpg123,ffplay, orvlcOllama with llama3.2:
ollama pull llama3.2
Optional (for STT):
pywhispercpp: Already in requirements.txtMicrophone access
3. Configure in Claude Code
Add to ~/.claude.json:
4. Restart Claude Code
Usage
Basic Voice Chat
Voice Conversation Loop
Listen Only
Speak Only
Get Conversation Context
List Voice Tools
Get Performance Stats
Voice-Callable Tools
Tools are automatically invoked when intent is detected in user speech.
Memory Operations
Search Memory:
Remember User Info:
Goal & Task Management
Create Goal:
List Tasks:
Decompose Goal:
AGI Operations
Memory Consolidation:
Autonomous Research:
Self-Improvement:
System Status:
Extending the System
Adding New Voice-Callable Tools
In src/server.py:
Customizing Intent Detection
Edit src/intent_detector.py to:
Add new intent categories
Adjust LLM prompts
Tune confidence thresholds
Add domain-specific NLU
Integrating with Other MCP Servers
Edit src/mcp_integrations.py to:
Add new MCP client classes
Implement actual API calls (currently stubbed)
Configure MCP server URLs
Performance
Measured on Mac Pro 5,1 (Dual Xeon X5680, 24 threads):
Operation | Latency |
STT (base model) | ~800ms |
TTS (Edge) | ~1500ms |
Intent detection | ~500ms |
Total round-trip | ~2.8s |
Tips for Optimization:
Use smaller Whisper model (
tiny) for faster STTPre-load Whisper model on startup
Use GPU if available (GTX 680 on your system)
Enable cloud STT/TTS for latency-critical use cases
Troubleshooting
Whisper Not Available
Edge TTS Not Working
Ollama Not Responding
Audio Recording Fails
No Audio Output
Architecture Details
Conversation Flow
Stateful Context
Conversation manager maintains:
Message history (last 10 turns)
User context (name, preferences)
Session metadata (start time, turn count)
Tool invocations (which tools were used)
Context is automatically:
Passed to intent detector for better NLU
Stored in enhanced-memory for long-term retention
Used for multi-turn understanding
Tool Invocation
Tools are invoked when:
Intent confidence > 0.6
Intent name matches registered tool
Required parameters can be extracted
Parameters extracted via:
LLM-based extraction (Ollama)
Pattern matching (regex)
Conversation context (previous turns)
Defaults (if specified in tool definition)
Comparison to Letta Voice
Feature | Letta Voice | Voice-AGI (This) |
STT | Deepgram (cloud) | Whisper (local) |
TTS | Cartesia (cloud) | Edge TTS (local) |
Memory | Letta stateful framework | Enhanced-memory MCP |
Tools | Function calling | Voice-callable tools |
Cost | ~$620/mo (8hr/day) | ~$5/mo (local compute) |
Latency | ~700ms | ~2.8s (local CPU) |
Privacy | ❌ Cloud data | ✅ Fully local |
AGI Integration | ❌ None | ✅ Deep integration |
Best of Both Worlds: This system combines Letta's stateful conversation approach with your existing local infrastructure.
Future Enhancements
Phase 4: Streaming & VAD (Planned)
Voice Activity Detection (silero-vad)
Streaming transcription (continuous buffer)
Interrupt handling
GPU acceleration for Whisper
Phase 5: Cloud Upgrade (Optional)
Adaptive pipeline (local vs cloud based on context)
Deepgram STT integration
Cartesia TTS integration
Livekit for real-time streaming
Configuration
Environment Variables
Conversation Settings
In src/server.py:
Voice Pipeline Settings
API Reference
See inline docstrings in:
src/server.py- Main MCP toolssrc/conversation_manager.py- Conversation managementsrc/voice_pipeline.py- STT/TTS operationssrc/tool_registry.py- Tool registrationsrc/intent_detector.py- Intent detectionsrc/mcp_integrations.py- MCP client interfaces
Contributing
To add new features:
New voice-callable tools: Add to
src/server.pywith@tool_registry.register()Enhanced intent detection: Update
src/intent_detector.pyMCP integrations: Implement actual calls in
src/mcp_integrations.pyPerformance optimizations: Add VAD, streaming, GPU acceleration
Cloud providers: Add Deepgram/Cartesia clients
License
Part of the Mac Pro 5,1 Agentic System - see main system documentation.
Support
For issues or questions:
Check logs:
journalctl -f | grep voice-agiTest components individually (see Troubleshooting)
Review AGI system documentation in
/home/marc/
Voice-AGI v0.1.0 - Stateful voice control for recursive self-improving AGI systems