Skip to main content
Glama

baidu-ai-search

Official
by baidubce
rag.ipynb8.41 kB
{ "cells": [ { "cell_type": "markdown", "id": "f802e64d-4eaa-445d-a48a-1042a91bc394", "metadata": { "tags": [] }, "source": [ "# 智能问答机器人:简历筛选小助手\n", "\n", "## 目标\n", "通过对本地简历库的简历进行解析、切片、创建索引,实现基于JD进行简历筛选,并对筛选的Top1简历进行总结的功能\n", "\n", "## 准备工作\n", "### 平台注册\n", "1.先在appbuilder平台注册,获取token\n", "2.创建BES集群,详见(https://cloud.baidu.com/doc/BES/s/0jwvyk4tv)\n", "3.安装appbuilder-sdk" ] }, { "cell_type": "code", "execution_count": null, "id": "2939356f-61c2-42e9-9e0c-fc6729c193f6", "metadata": {}, "outputs": [], "source": [ "!pip install appbuilder-sdk\n", "!pip install elasticsearch==7.11.0" ] }, { "cell_type": "code", "execution_count": 1, "id": "4ccff03b-1567-4e8b-8e1f-9a5032690406", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "init done\n" ] } ], "source": [ "import logging\n", "import os\n", "\n", "# 设置环境变量\n", "os.environ[\"APPBUILDER_TOKEN\"] = \"...\"\n", "\n", "# 创建BES集群后获得集群id、用户名、密码\n", "cluster_id = \"...\"\n", "username = \"...\"\n", "password = \"...\"\n", "\n", "print(\"init done\")\n" ] }, { "cell_type": "markdown", "id": "aeb2fa55-075f-48df-a9fb-8b40d9900684", "metadata": {}, "source": [ "## 开发过程\n", "### 对简历文档内容进行解析、切分" ] }, { "cell_type": "code", "execution_count": null, "id": "41559341-fd7a-478c-a08b-1477d79e9d41", "metadata": { "ExecuteTime": { "end_time": "2023-12-18T06:24:26.982459Z", "start_time": "2023-12-18T06:23:53.771345Z" } }, "outputs": [], "source": [ "import appbuilder\n", "from pathlib import Path\n", "\n", "# 基于doc_parser和doc_splitter解析file_path文件为若干个段落\n", "def parse_file(file_path, doc_parser, doc_splitter):\n", " input_msg = appbuilder.Message(str(file_path))\n", " doc_parser_result = doc_parser(input_msg, return_raw=True)\n", " doc_splitter_result = doc_splitter(doc_parser_result)\n", " return [f\"{file_path.name}+{para['text'][:384]}\" \n", " for para in doc_splitter_result.content[\"paragraphs\"]]\n", "\n", "# 文档切分的分块大小,每个分块最大340个字符\n", "chunk_size = 340\n", "\n", "# 本地简历库地址\n", "file_dir = \"../简历\"\n", "\n", "# 声明文档解析和文档切分组件\n", "doc_parser = appbuilder.DocParser()\n", "doc_splitter = appbuilder.DocSplitter(splitter_type=\"split_by_chunk\", max_segment_length=chunk_size) \n", "\n", "# 批量解析,形成段落切片列表\n", "resume_paragraphs = [para_text \n", " for file in Path(file_dir).iterdir() if file.is_file()\n", " for para_text in parse_file(file, doc_parser, doc_splitter)]\n", "\n", "print(f\"total length of splitted paragraphs {len(resume_paragraphs)}\")\n", "\n", "print(\"sample resume paragraphs\")\n", "for paras in resume_paragraphs[:3]:\n", " print(paras)" ] }, { "cell_type": "markdown", "id": "5fc5bc38-6bc5-4187-a8fd-f802d77d89fa", "metadata": {}, "source": [ "### 计算文档切片的语义向量并入库" ] }, { "cell_type": "code", "execution_count": null, "id": "d514a628-5ae3-4269-aada-eaf8b21c3793", "metadata": {}, "outputs": [], "source": [ "import time\n", "# 千帆Embedding\n", "embedding = appbuilder.Embedding()\n", "\n", "# 将段落切片列表入库到BESVectorStoreIndex,这里面用到的Baidu Elastic Search服务\n", "segments = appbuilder.Message(resume_paragraphs)\n", "vector_index = appbuilder.BESVectorStoreIndex.from_segments(\n", " segments=segments, cluster_id=cluster_id, user_name=username, \n", " password=password, embedding=embedding)\n", "\n", "# bes向量入库是异步操作,此处等待bes进行refresh\n", "time.sleep(5)\n", "print(\"index done.\")" ] }, { "cell_type": "markdown", "id": "16a8aa38-7a33-4e27-bca4-00900cfe1641", "metadata": {}, "source": [ "### 从工作职责描述中抽取关键词" ] }, { "cell_type": "code", "execution_count": null, "id": "9f45ef5f-6206-4b31-83c4-3c8eb2c86925", "metadata": {}, "outputs": [], "source": [ "# 招聘jd\n", "from collections import Counter, defaultdict\n", "\n", "# 工作职责描述\n", "job_desc = '''拥有大语言模型领域的研究背景,发表过期刊会议论文者加分;\n", "具备深度学习平台开发或大模型开发经验,能够独立完成任务;\n", "具备优秀的编程能力,熟悉tensorflow、pytorch、paddlepaddle中的一种深度学习框架;\n", "对人工智能有浓厚兴趣,愿意投入时间和精力深入研究;\n", "具备团队协作精神,能够与团队成员有效沟通,共同推进项目进展。'''\n", "\n", "# 标签抽取的组件\n", "tagger = appbuilder.TagExtraction(model=\"Qianfan-Agent-Speed-8K\")\n", "\n", "# 从JD抽取标签并打印\n", "tags = tagger(appbuilder.Message(job_desc))\n", "\n", "print(\"从JD抽取标签并打印\")\n", "print(tags.content)\n", "tag_list = [tag.split(\".\")[1] for tag in tags.content.split(\"\\n\")]\n" ] }, { "cell_type": "markdown", "id": "21ef57a9-46cf-4939-b375-68e21d4b3e90", "metadata": {}, "source": [ "### 基于工作描述的关键词从简历库中检索简历并排序" ] }, { "cell_type": "code", "execution_count": null, "id": "7313e122-5199-4c90-bc6c-ad04e206ccc9", "metadata": {}, "outputs": [], "source": [ "# 基于JD抽取出来的标签,使用retriever对简历库进行检索,基于简历文件名称命中次数进行汇总和排序\n", "retriever = vector_index.as_retriever()\n", "\n", "resume_count = Counter()\n", "resume_content = defaultdict(set)\n", "\n", "for tag in tag_list:\n", " print(\"Going to retrieve resumes for tag: \", tag, \"\\nGot Results\")\n", " relevant_resumes = retriever(query=appbuilder.Message(tag), top_k=3)\n", " for idx, res in enumerate(relevant_resumes.content):\n", " name_and_content = res[\"text\"].split(\"+\")\n", " print(f\"{idx+1}: {name_and_content[0]}, {name_and_content[1]}\")\n", " resume_count[name_and_content[0]] += 1\n", " resume_content[name_and_content[0]].add(name_and_content[1])\n", " print()\n", "\n", "sorted_resumes = sorted(resume_count.items(), key=lambda x: x[1], reverse=True)\n", "print(sorted_resumes)" ] }, { "cell_type": "markdown", "id": "36ba212a-774b-4f26-a860-988a872423ac", "metadata": {}, "source": [ "### 对检索得到的最优简历进行总结" ] }, { "cell_type": "code", "execution_count": null, "id": "e38701cd-9fbf-4da4-982b-a87a9b310427", "metadata": {}, "outputs": [], "source": [ "\n", "play = appbuilder.Playground(\n", " prompt_template=\"基于候选人姓名、职责描述和简历内容,概括一下{name}的推荐理由。\\n候选人姓名: {name}\\n职责描述: {JD}\\n简历内容: {resume}\\n推荐理由: \",\n", " model=\"Qianfan-Agent-Speed-8K\"\n", ")\n", "\n", "resume_summary = play(appbuilder.Message({\"JD\": job_desc, \"name\": sorted_resumes[0][0], \"resume\": \"\\n\".join(list(resume_content[sorted_resumes[0][0]]))}))\n", "print(resume_summary.content)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/baidubce/app-builder'

If you have feedback or need assistance with the MCP directory API, please join our Discord server