qa_system_1_dataset.ipynb•14.6 kB
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# 企业级问答系统 —— 离线知识生产流程\n",
"\n",
"## 一、背景介绍\n",
"\n",
"智能问答系统是当前应用范围最广,最容易落地的大模型应用。通常来说,智能问答系统包括以下五部分:\n",
"1. Query理解\n",
"2. 混合检索召回\n",
"3. 语义排序\n",
"4. 答案生成\n",
"5. 追问生成\n",
"\n",
"AppBuilder在问答系统沉淀了最佳实践流程,并以极易用的组件化的方式对外开放,可以使用页面操作 + 低代码的方式快速搭建自己的问答系统。\n",
"\n",
"<img src=\"./qa_system/qa_flow.png\" alt=\"drawing\" width=\"1000\"/>\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 二、实操流程\n",
"\n",
"整体使用流程包括以下三个环节:\n",
"\n",
"1. 登录[百度智能云千帆AppBuilder官网](https://cloud.baidu.com/product/AppBuilder)创建账户,\n",
"\n",
"<img src=\"./qa_system/product.png\" alt=\"drawing\" width=\"1000\"/>\n",
"\n",
"\n",
"2. 在[百度智能云千帆AppBuilder控制台](https://console.bce.baidu.com/ai_apaas/dialogHome)右侧头像栏『API密钥』页面获取密钥,并复制。\n",
"\n",
"<img src=\"./qa_system/token.jpg\" alt=\"drawing\" width=\"1000\"/>\n",
"\n",
"\n",
"<img src=\"./qa_system/get_token.png\" alt=\"drawing\" width=\"1000\"/>\n",
"\n",
"\n",
"3. 引用AppBuilder SDK代码,调用Dataset API相关接口,创建数据集,并增、查、删除文档。"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 环境准备\n",
"\n",
"首先需要安装AppBuilder-SDK代码库,若已在开发环境安装,则可跳过此步。\n",
"\n",
"**注意:**: appbuilder-sdk 的python版本要求 3.9+"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Defaulting to user installation because normal site-packages is not writeable\n",
"Collecting appbuilder-sdk\n",
" Using cached appbuilder_sdk-0.6.0-py3-none-any.whl.metadata (768 bytes)\n",
"Requirement already satisfied: requests in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from appbuilder-sdk) (2.31.0)\n",
"Requirement already satisfied: proto-plus==1.22.3 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from appbuilder-sdk) (1.22.3)\n",
"Requirement already satisfied: pydantic==2.6.4 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from appbuilder-sdk) (2.6.4)\n",
"Requirement already satisfied: numpy in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from appbuilder-sdk) (1.24.4)\n",
"Requirement already satisfied: SQLAlchemy~=1.4.50 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from appbuilder-sdk) (1.4.52)\n",
"Requirement already satisfied: urllib3<2.0.0 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from appbuilder-sdk) (1.26.18)\n",
"Requirement already satisfied: tenacity in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from appbuilder-sdk) (8.2.3)\n",
"Requirement already satisfied: pandas in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from appbuilder-sdk) (2.2.1)\n",
"Requirement already satisfied: openpyxl in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from appbuilder-sdk) (3.1.2)\n",
"Requirement already satisfied: pymochow>=1.1.2 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from appbuilder-sdk) (1.1.2)\n",
"Requirement already satisfied: parameterized in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from appbuilder-sdk) (0.9.0)\n",
"Requirement already satisfied: protobuf<5.0.0dev,>=3.19.0 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from proto-plus==1.22.3->appbuilder-sdk) (4.25.3)\n",
"Requirement already satisfied: annotated-types>=0.4.0 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from pydantic==2.6.4->appbuilder-sdk) (0.6.0)\n",
"Requirement already satisfied: pydantic-core==2.16.3 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from pydantic==2.6.4->appbuilder-sdk) (2.16.3)\n",
"Requirement already satisfied: typing-extensions>=4.6.1 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from pydantic==2.6.4->appbuilder-sdk) (4.10.0)\n",
"Requirement already satisfied: orjson in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from pymochow>=1.1.2->appbuilder-sdk) (3.9.15)\n",
"Requirement already satisfied: future in /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from pymochow>=1.1.2->appbuilder-sdk) (0.18.2)\n",
"Requirement already satisfied: et-xmlfile in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from openpyxl->appbuilder-sdk) (1.1.0)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from pandas->appbuilder-sdk) (2.8.2)\n",
"Requirement already satisfied: pytz>=2020.1 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from pandas->appbuilder-sdk) (2024.1)\n",
"Requirement already satisfied: tzdata>=2022.7 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from pandas->appbuilder-sdk) (2024.1)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from requests->appbuilder-sdk) (3.3.2)\n",
"Requirement already satisfied: idna<4,>=2.5 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from requests->appbuilder-sdk) (3.6)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from requests->appbuilder-sdk) (2024.2.2)\n",
"Requirement already satisfied: six>=1.5 in /Users/chengmo/Library/Python/3.9/lib/python/site-packages (from python-dateutil>=2.8.2->pandas->appbuilder-sdk) (1.16.0)\n",
"Using cached appbuilder_sdk-0.6.0-py3-none-any.whl (335 kB)\n",
"Installing collected packages: appbuilder-sdk\n",
"Successfully installed appbuilder-sdk-0.6.0\n"
]
}
],
"source": [
"!python3 -m pip install appbuilder-sdk"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 引入AppBuilder-SDK,并设置TOKEN\n",
"\n",
"**注意:** 请使用刚才在AppBuilder平台上新建的个人TOKEN"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AppBuilder 模块导入成功!\n",
"您的AppBuilder Token为:your_appbuilder_token\n"
]
}
],
"source": [
"# 引入os模块,引入appbuilder 模块\n",
"import os\n",
"import appbuilder\n",
"\n",
"# 设置appbuilder的token密钥,从页面上复制粘贴token,覆盖此处的 \"your_appbuilder_token\"\n",
"os.environ['APPBUILDER_TOKEN'] = \"your_appbuilder_token\"\n",
"\n",
"print(\"AppBuilder 模块导入成功!\")\n",
"print(\"您的AppBuilder Token为:{}\".format(os.environ['APPBUILDER_TOKEN']))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 创建全新知识库\n",
"\n",
"我们从零开始创建一个新的知识库,并使用它来支持问答系统。我们将其命名为**说明书知识库**"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"知识库创建成功! 知识库的ID为:c368a982-699f-4114-9e9a-c08b76c07f92, 知识库的名称为: 说明书知识库\n"
]
}
],
"source": [
"# 创建全新知识库\n",
"dataset = appbuilder.console.Dataset.create_dataset(\"说明书知识库\")\n",
"\n",
"print(\"知识库创建成功! 知识库的ID为:{}, 知识库的名称为: {}\".format(\n",
" dataset.dataset_id,\n",
" dataset.dataset_name))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 上传文档到知识库\n",
"\n",
"在notebook的同级目录`./qa_system`下,我们提前准备了两篇文档:\n",
"- `./qa_system/平板电脑说明书.pdf`\n",
"- `./qa_system/显示器说明书.pdf`\n",
"\n",
"我们将这两篇文档上传到知识库中,用于后续问答应用的私域知识检索。"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"文档上传到知识库成功!知识库中的文档ID为:['c396c715-66d8-49cb-af9d-af0bd08a16e2', '817d29bb-be72-420b-bee1-c1d9b1116826']\n"
]
}
],
"source": [
"file_1 = \"./qa_system/平板电脑说明书.pdf\"\n",
"file_2 = \"./qa_system/显示器说明书.pdf\"\n",
"\n",
"document_infos = dataset.add_documents([\n",
" file_1, file_2\n",
"])\n",
"\n",
"print(\"文档上传到知识库成功!知识库中的文档ID为:{}\".format(document_infos.document_ids))\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.5 查找知识库中的文档\n",
"\n",
"问答系统中的知识库,常用增、删、查操作,我们刚才演示了增加文档的方法,下面演示如何查找文档。\n",
"\n",
"我们获取[百度智能云千帆AppBuilder控制台-我的知识](https://console.bce.baidu.com/ai_apaas/dataset)页面中,我们刚才创建的知识库中的文档列表。\n",
"\n",
"<img src=\"./qa_system/doc_list.png\" alt=\"drawing\" width=\"1000\"/>"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"文档列表获取成功!\n",
"第二个的文档是: 平板电脑说明书.pdf, 文档ID是: c396c715-66d8-49cb-af9d-af0bd08a16e2, 文档字数是: 8568\n"
]
}
],
"source": [
"doc_list = dataset.get_documents(page=1,limit=10)\n",
"print(\"文档列表获取成功!\")\n",
"print(\"第二个的文档是: {}, 文档ID是: {}, 文档字数是: {}\".format(\n",
" doc_list.data[1].name, doc_list.data[1].id, doc_list.data[1].word_count))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.6 删除知识库中的文档\n",
"\n",
"下面我们演示如何删除知识库中的文档。我们尝试删除 『平板电脑说明书.pdf』 这篇文档"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"文档: 平板电脑说明书.pdf, 文档ID: c396c715-66d8-49cb-af9d-af0bd08a16e2 删除成功!\n",
"文档删除成功,请刷新『知识库』前端页面,观察是否已无『平板电脑说明书.pdf』文档\n"
]
}
],
"source": [
"need_delete_doc_name = \"平板电脑说明书.pdf\"\n",
"need_delete_doc_id = \"\"\n",
"\n",
"doc_list = dataset.get_documents(page=1,limit=10)\n",
"for doc in doc_list.data:\n",
" if doc.name == need_delete_doc_name:\n",
" need_delete_doc_id = doc.id\n",
" break\n",
"\n",
"if need_delete_doc_id == \"\":\n",
" print(\"未找到指定文档,请检查 2.4 步骤是否正确上传,以及2.5步骤中是否展示了正确的文档列表\")\n",
"else:\n",
" dataset.delete_documents([need_delete_doc_id])\n",
" print(\"文档: {}, 文档ID: {} 删除成功!\".format(\n",
" need_delete_doc_name,\n",
" need_delete_doc_id\n",
" ))\n",
"\n",
"print(\"文档删除成功,请刷新『知识库』前端页面,观察是否已无『{}』文档\".format(need_delete_doc_name))\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"执行完上述步骤后,前端页面刷新,指定文档已经被成功删除\n",
"\n",
"<img src=\"./qa_system/delete_doc.png\" alt=\"drawing\" width=\"1000\"/>"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"当前创建的知识库的ID为:c368a982-699f-4114-9e9a-c08b76c07f92, 知识库的名称为: 说明书知识库\n"
]
}
],
"source": [
"# 再一次确认当前知识库的 Name 与 ID,方便后续 企业级问答系统 —— 在线问答流程 的使用\n",
"\n",
"print(\"当前创建的知识库的ID为:{}, 知识库的名称为: {}\".format(\n",
" dataset.dataset_id,\n",
" dataset.dataset_name))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 三、总结\n",
"\n",
"AppBuilder 为 基于RAG的 企业问答系统提供了 `appbuilder.console.dataset` API,可以方便的以代码态进行知识库的管理,实现基本的增删改查业务需求。\n",
"\n",
"本示例展示了如何从零创建知识库,并实现增、查、改的代码,并可以观察SDK与前端页面的联动。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}