Voice to Text MCP Server

语音转文字 MCP 服务器

一个功能强大的语音转文字 MCP 服务器，支持多种音频格式和识别引擎。

功能特性

🎯 核心功能

多引擎支持: 远程API调用（阿里云百炼、OpenAI Whisper、讯飞等）、Google Speech Recognition、CMU Sphinx
多格式支持: WAV、MP3、M4A、FLAC、OGG、AAC
多语言支持: 中文、英文、日文、韩文、法文、德文、西班牙文、俄文
批量处理: 支持批量转写多个音频文件
实时进度: 提供详细的转写进度信息
无本地模型: 全部通过远程API调用，无需下载大模型

🛠️ 工具功能

transcribe_audio_file: 转写音频文件
transcribe_audio_data: 转写音频数据
transcribe_with_remote_api: 通过远程API转写音频
batch_transcribe: 批量转写多个文件
analyze_audio_file: 分析音频文件信息
convert_audio_file_format: 转换音频格式
get_supported_formats: 获取支持的格式

📚 资源功能

audio://info/{file_path}: 获取音频文件信息
audio://formats: 获取支持的音频格式

💡 提示模板

语音转文字助手
音频格式转换助手

安装

使用 uv (推荐)

# 克隆项目
git clone <repository-url>
cd DW_MCP_Server

# 安装依赖
uv sync

# 运行服务器
uv run python main.py

使用 pip

# 安装依赖
pip install -r requirements.txt

# 运行服务器
python main.py

使用方法

1. 启动服务器

# 开发模式
uv run mcp dev main.py

# 或者直接运行
uv run python main.py

2. 在 Claude Desktop 中安装

uv run mcp install main.py

3. 使用示例

转写单个音频文件

# 使用 Google Speech Recognition
result = await transcribe_audio_file(
    file_path="/path/to/audio.wav",
    language="zh-CN",
    engine="google"
)

# 使用远程API（需配置API密钥）
result = await transcribe_audio_file(
    file_path="/path/to/audio.mp3",
    language="zh-CN",
    engine="remote_api"
)

# 直接调用远程API
result = await transcribe_with_remote_api(
    file_path="/path/to/audio.wav",
    api_type="bailian",  # 支持 bailian, openai, xunfei
    api_key="your_api_key",
    api_url="your_api_url",
    language="zh-CN"
)

批量转写

file_paths = [
    "/path/to/audio1.wav",
    "/path/to/audio2.mp3",
    "/path/to/audio3.m4a"
]

results = await batch_transcribe(
    file_paths=file_paths,
    language="zh-CN",
    engine="whisper"
)

分析音频文件

info = await analyze_audio_file("/path/to/audio.wav")
print(f"格式: {info.format}")
print(f"时长: {info.duration}秒")
print(f"采样率: {info.sample_rate}Hz")

转换音频格式

output_path = await convert_audio_file_format(
    input_path="/path/to/audio.mp3",
    output_path="/path/to/output.wav",
    target_format="wav"
)

支持的格式

输入格式

WAV
MP3
M4A
FLAC
OGG
AAC

输出格式

WAV
MP3
TXT (转写文本)
SRT (字幕文件)
VTT (WebVTT 字幕)

支持的语言

中文 (zh-CN)
英文 (en-US)
日文 (ja-JP)
韩文 (ko-KR)
法文 (fr-FR)
德文 (de-DE)
西班牙文 (es-ES)
俄文 (ru-RU)

识别引擎对比

引擎	优点	缺点	适用场景
远程API（百炼/OpenAI/讯飞）	准确率高，支持多种语言，无需本地模型	需要网络连接和API密钥	在线应用
Google Speech Recognition	准确率高，支持多种语言	需要网络连接	在线应用
CMU Sphinx	完全离线，轻量级	准确率相对较低	嵌入式设备

配置选项

环境变量

# 设置默认语言
export DEFAULT_LANGUAGE=zh-CN

# 设置默认引擎
export DEFAULT_ENGINE=remote_api

# 设置默认API类型
export DEFAULT_API_TYPE=bailian

# 配置API密钥和地址
export BAILIAN_API_KEY=your_bailian_api_key
export BAILIAN_API_URL=https://bailian.aliyuncs.com/v1/audio/transcriptions

服务器配置

# 在 main.py 中修改服务器配置
mcp = FastMCP(
    "语音转文字服务",
    dependencies=["speechrecognition", "pydub", "openai-whisper", "torch"]
)

开发

安装开发依赖

uv sync --extra dev

运行测试

uv run pytest

代码格式化

uv run black main.py
uv run isort main.py

类型检查

uv run mypy main.py

故障排除

常见问题

API密钥配置错误
# 检查环境变量 echo $BAILIAN_API_KEY echo $BAILIAN_API_URL # 或在代码中直接传入 result = await transcribe_with_remote_api( file_path="audio.wav", api_key="your_api_key", api_url="your_api_url" )
音频格式不支持
# 安装 ffmpeg # Windows: 下载 ffmpeg 并添加到 PATH # macOS: brew install ffmpeg # Linux: sudo apt install ffmpeg
网络连接错误
- 检查网络连接
- 检查API地址是否正确
- 考虑使用本地引擎（Google Speech Recognition）

日志调试

# 启用详细日志
import logging
logging.basicConfig(level=logging.DEBUG)

贡献

欢迎提交 Issue 和 Pull Request！

开发指南

Fork 项目
创建功能分支
提交更改
推送到分支
创建 Pull Request

许可证

MIT License

更新日志

v0.1.0

初始版本
支持 Google Speech Recognition、Whisper、CMU Sphinx
支持多种音频格式
支持批量处理
提供进度反馈

联系方式

如有问题或建议，请通过以下方式联系：

提交 Issue
发送邮件
加入讨论群

注意: 使用远程API需要配置API密钥和地址，请在使用前设置相应的环境变量或在调用时传入参数。推荐使用阿里云百炼、OpenAI Whisper、讯飞等主流语音识别API。

This server cannot be installed

security - not tested

license - not found

quality - not tested

How are these scores calculated?

A powerful speech-to-text MCP server that supports multiple audio formats and recognition engines including remote APIs (Bailian, OpenAI Whisper, iFLYTEK), Google Speech Recognition, and CMU Sphinx.

Related MCP Servers

Zanny's Persistent Memory Manager
zannyonear1h1
-
security
F
license
-
quality
A custom MCP server that allows storage, retrieval, and management of text-based information with natural language commands and keyword detection.
Last updated -
TypeScript
Audio Transcriber MCP Server
Ichigo3766
A
security
A
license
A
quality
A MCP server that enables transcription of audio files using OpenAI's Speech-to-Text API, with support for multiple languages and file saving options.
Last updated -
1
4
7
JavaScript
MIT License
Blabber-MCP
pinkpixel-dev
-
security
A
license
-
quality
An MCP server that enables LLMs to generate spoken audio from text using OpenAI's Text-to-Speech API, supporting various voices, models, and audio formats.
Last updated -
0
1
JavaScript
MIT License
MCP Video & Audio Text Extraction Server
SealinGp
-
security
F
license
-
quality
An MCP server that downloads videos/extracts audio from various platforms like YouTube, Bilibili, and TikTok, then transcribes them to text using OpenAI's Whisper model.
Last updated -
5
Python

View all related MCP servers