Voice Recognition MCP Service
This service provides voice recognition and text extraction capabilities through both stdio and MCP modes.
Features
- Voice recognition from file
- Voice recognition from base64 encoded data
- Text extraction
- Support for both stdio and MCP modes
- Structured voice recognition results
Project Structure
voice_service.py
- Core service implementationstdio_server.py
- stdio mode entry pointmcp_server.py
- MCP mode entry pointbuild.py
- Build script for executablesbuild_exec.sh
- Build execution scripttest_*.sh
- Test scripts for different functionalities
Installation
- Clone the repository:
- Install dependencies:
- Set up environment variables in
.env
:
Usage
stdio Mode
- Run the service:
- Send JSON-RPC requests via stdin:
- Or use the executable:
MCP Mode
- Run the service:
- Or use the executable:
Voice Recognition Results
The service provides structured voice recognition results. Here's an example of the response format:
Original API Response
Restructured Response
Label Result Fields
The label_result
field contains the following structured information:
Field | Description | Example Value |
---|---|---|
lan | Language code | "en" |
emo | Emotion state | "unknown" |
type | Audio type | "speech" |
speaker | Speaker identifier | "woitn" |
text | Recognized text content | "test test test" |
Special Labels
The service recognizes and processes the following special labels in the original response:
<|en|>
- Language code<|EMO_UNKNOWN|>
- Emotion state<|Speech|>
- Audio type<|woitn|>
- Speaker identifier
Building Executables
- Make the build script executable:
- Build stdio mode executable:
- Build MCP mode executable:
The executables will be created at:
- stdio mode:
dist/voice_stdio
- MCP mode:
dist/voice_mcp
Testing
Run the test scripts:
License
This project is licensed under the MIT License - see the LICENSE file for details.
This server cannot be installed
Provides voice recognition and text extraction capabilities with support for both stdio and MCP modes, processing audio files or base64 encoded data and returning structured results with language, emotion, and speaker information.