agent_speech.ipynb•14.1 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 实时语音对话能力\n",
"**注意⚠️:实时语音功能目前处于内测阶段,使用过程中有任何问题,欢迎提issue或微信群反馈~**\n",
"\n",
"## 目标\n",
"实现一个实时语音对话功能,支持多种语音音色。用户可以参考cookbook代码,通过AppBuilder-SDK将实时语音功能很好地融入自己的平台、应用中。\n",
"\n",
"## 实现原理\n",
"通过循环不断处理用户的语音,将语音转文本,然后进行对话,最后将对话结果通过TTS进行播报。。\n",
"* 使用大模型的 ASR 进行语音转文本。\n",
"* 使用用户自己创建的Agent进行对话,适配用户的应用场景,并具有上下文理解能力。\n",
"* 使用大模型的 TTS 进行文本转语音并进行播报。\n",
"\n",
"## 前置条件\n",
"* 使用内置ASR、TTS组件之前,请先开通组件服务并够买额度,可参考[开通组件服务](https://cloud.baidu.com/doc/AppBuilder/s/Glqb6dfiz#3%E3%80%81%E5%BC%80%E9%80%9A%E7%BB%84%E4%BB%B6%E6%9C%8D%E5%8A%A1)\n",
"* pip安装pyaudio、webrtcvad依赖包\n",
"* 给程序开放麦克风权限\n",
"* 创建好自己的Agent应用\n",
"\n",
"## 示例代码"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Copyright (c) 2024 Baidu, Inc. All Rights Reserved.\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# http://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License.\n",
"\n",
"import os\n",
"import time\n",
"import wave\n",
"import sys\n",
"import pyaudio\n",
"import webrtcvad\n",
"import appbuilder\n",
"import re\n",
"\n",
"# 请前往千帆AppBuilder官网创建密钥,流程详见:https://cloud.baidu.com/doc/AppBuilder/s/Olq6grrt6#1%E3%80%81%E5%88%9B%E5%BB%BA%E5%AF%86%E9%92%A5\n",
"# 设置环境变量\n",
"os.environ[\"APPBUILDER_TOKEN\"] = (\n",
" \"...\"\n",
")\n",
"# 已发布AppBuilder应用的ID\n",
"app_id = \"...\"\n",
"appbuilder.logger.setLoglevel(\"ERROR\")\n",
"\n",
"CHUNK = 1024\n",
"FORMAT = pyaudio.paInt16\n",
"CHANNELS = 1 if sys.platform == \"darwin\" else 2\n",
"RATE = 16000\n",
"DURATION = 30 # ms\n",
"CHUNK = RATE // 1000 * DURATION\n",
"\n",
"\n",
"class Chatbot:\n",
" def __init__(self):\n",
" self.p = pyaudio.PyAudio()\n",
" self.tts = appbuilder.TTS()\n",
" self.asr = appbuilder.ASR()\n",
" self.agent = appbuilder.AppBuilderClient(app_id)\n",
" self.conversation_id = self.agent.create_conversation()\n",
"\n",
" def run(self):\n",
" self.run_tts_and_play_audio(\n",
" \"我是你的专属聊天机器人,如果你有什么问题,可以直接问我\"\n",
" )\n",
" while True:\n",
" # Record\n",
" audio_path = \"output.wav\"\n",
" print(\"开始记录音频...\")\n",
" if self.record_audio(audio_path) < 1000:\n",
" time.sleep(1)\n",
" continue\n",
" print(\"音频记录结束\")\n",
"\n",
" # ASR\n",
" print(\"开始执行ASR...\")\n",
" query = self.run_asr(audio_path)\n",
" print(\"结束执行ASR\")\n",
"\n",
" # Agent\n",
" print(\"query: \", query)\n",
" if len(query) == 0:\n",
" continue\n",
" answer = self.run_agent(query)\n",
" results = re.findall(r\"(https?://[^\\s]+)\", answer)\n",
" for result in results:\n",
" print(\"链接地址:\", result)\n",
" answer = answer.replace(result, \"\")\n",
" print(\"answer:\", answer)\n",
"\n",
" # TTS\n",
" print(\"开始执行TTS并播报...\")\n",
" self.run_tts_and_play_audio(answer)\n",
" print(\"结束TTS并播报结束\")\n",
"\n",
" def record_audio(self, path):\n",
" with wave.open(path, \"wb\") as wf:\n",
" wf.setnchannels(CHANNELS)\n",
" wf.setsampwidth(self.p.get_sample_size(FORMAT))\n",
" wf.setframerate(RATE)\n",
" stream = self.p.open(\n",
" format=FORMAT, channels=CHANNELS, rate=RATE, input=True\n",
" )\n",
" vad = webrtcvad.Vad(1)\n",
" not_speech_times = 0\n",
" speech_times = 0\n",
" total_times = 0\n",
" start_up_times = 33 * 5 # 初始时间设置为5秒\n",
" history_speech_times = 0\n",
" while True:\n",
" if history_speech_times > 33 * 10:\n",
" break\n",
" data = stream.read(CHUNK, False)\n",
" if vad.is_speech(data, RATE):\n",
" speech_times += 1\n",
" wf.writeframes(data)\n",
" else:\n",
" not_speech_times += 1\n",
" total_times += 1\n",
" if total_times >= start_up_times:\n",
" history_speech_times += speech_times\n",
" # 模拟滑窗重新开始计数\n",
" if float(not_speech_times) / float(total_times) > 0.7:\n",
" break\n",
" not_speech_times = 0\n",
" speech_times = 0\n",
" total_times = 0\n",
" start_up_times = start_up_times / 2\n",
" if start_up_times < 33:\n",
" start_up_times = 33\n",
" stream.close()\n",
" return history_speech_times * DURATION\n",
"\n",
" def run_tts_and_play_audio(self, text: str):\n",
" # AppBuilder内置的TTS使用文档,用户可根据文档调整参数:https://github.com/baidubce/app-builder/tree/master/python/core/components/tts\n",
" msg = self.tts.run(\n",
" appbuilder.Message(content={\"text\": text}),\n",
" speed=5,\n",
" pitch=5,\n",
" volume=5,\n",
" person=0,\n",
" audio_type=\"pcm\",\n",
" model=\"paddlespeech-tts\",\n",
" stream=True,\n",
" )\n",
" stream = self.p.open(\n",
" format=self.p.get_format_from_width(2),\n",
" channels=1,\n",
" rate=24000,\n",
" output=True,\n",
" frames_per_buffer=2048,\n",
" )\n",
" for pcm in msg.content:\n",
" stream.write(pcm)\n",
" stream.stop_stream()\n",
" stream.close()\n",
"\n",
" # AppBuilder内置的ASR使用文档,用户可根据文档调整参数:https://github.com/baidubce/app-builder/blob/master/python/core/components/asr/README.md\n",
" def run_asr(self, audio_path: str):\n",
" with open(audio_path, \"rb\") as f:\n",
" content_data = {\"audio_format\": \"wav\", \"raw_audio\": f.read(), \"rate\": 16000}\n",
" msg = appbuilder.Message(content_data)\n",
" out = self.asr.run(msg)\n",
" text = out.content[\"result\"][0]\n",
" return text\n",
"\n",
" def run_agent(self, query):\n",
" msg = self.agent.run(self.conversation_id, query, stream=True)\n",
" answer = \"\"\n",
" for content in msg.content:\n",
" answer += content.answer\n",
" return answer\n",
"\n",
"\n",
"if __name__ == \"__main__\":\n",
" chatbot = Chatbot()\n",
" chatbot.run()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 使用方法\n",
"\n",
"直接运行程序即可。\n",
"\n",
"用户也可以将下面的功能模块替换成自己的其他实现或模型:\n",
"* record_audio: 录音\n",
"* run_asr: 语音识别语音识别,[AppBuilder ASR组件使用文档](https://github.com/baidubce/app-builder/blob/master/python/core/components/asr/README.md)\n",
"* run_agent: Agent对话功能,[AppBuilder TTS组件使用文档](https://github.com/baidubce/app-builder/blob/master/python/core/components/tts/README.md)\n",
"* run_tts_and_play_audio:回复的语音生成并播报\n",
"\n",
"**AppBuilder TTS组件参数**\n",
"| 参数名称 | 参数类型 | 是否必须 | 描述 | 示例值 |\n",
"|------------|---------|------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|\n",
"| message | String | 是 | 待转成语音的文本 | Message(content={\"text\": \"需合成的文本\"}) |\n",
"| model | String | 否 | 默认是`baidu-tts`模型,可选值:`paddlespeech-tts`、`baidu-tts` | paddlespeech-tts |\n",
"| speed | Integer | 否 | 语音语速,默认是5中等语速,取值范围在0~15之间,仅当模型为`baidu-tts`参数有效,如果模型为`paddlespeech-tts`,参数自动失效 | 5 |\n",
"| pitch | Integer | 否 | 语音音调,默认是5中等音调,取值范围在0~15之间,仅当模型为`baidu-tts`参数有效,如果模型为`paddlespeech-tts`,参数自动失效 | 5 |\n",
"| volume | Integer | 否 | 语音音量,默认是5中等音量,取值范围在0~15之间,,仅当模型为`baidu-tts`参数有效,如果模型为`paddlespeech-tts`,参数自动失效 | 5 |\n",
"| person | Integer | 否 | 语音人物特征,默认是0(度小美),普通音库可选值包括: 0(度小美)、1(度小宇)、3(度逍遥-基础)、4(度丫丫);精品音库包括:5003(度逍遥-精品)、5118(度小鹿)、106(度博文)、110(度小童)、111(度小萌)、103(度米朵)、5(度小娇);臻品音库包括:4003(度逍遥-情感男声)、4106(度博文-专业男主播)、4115(度小贤-电台男主播)、4119(度小鹿-甜美女声)、4105(度灵儿-清激女声)、4117(度小乔-活泼女声)、4100(度小雯-活力女主播)、4103(度米朵-可爱女声)、4144(度姗姗-娱乐女声)、4278(度小贝-知识女主播)、4143(度清风-配音男声)、4140(度小新-专业女主播)、4129(度小彦-知识男主播)、4149(度星河-广告男声)、4254(度小清-广告女声)、4206(度博文-综艺男声)、4226(南方-电台女主播)。仅当模型为`baidu-tts`参数有效,如果模型为`paddlespeech-tts`,参数自动失效 | 0 |\n",
"| audio_type | String | 否 | 音频文件格式,如果使用`baidu-tts`模型可选`mp3`, `wav`; 如果使用`paddlespeech-tts`模型非流式返回,参数只能设为`wav`;如果使用`paddlespeech-tts`模型流式返回,参数只能设为`pcm` | wav |\n",
"| stream | Bool | 否 | 默认是False, 目前`paddlespeech-tts`模型支持流式返回,`baidu-tts`模型不支持流式返回 | False |\n",
"| retry | Integer | 否 | HTTP重试次数 | 3 |\n",
"| timeout | Integer | 否 | HTTP超时时间 | 5 |"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}