chatbot_with_human_feedback.ipynb•14.6 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg\" width=\"1000\"/>\n",
" <br>\n",
" <br>\n",
" <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
" |\n",
" <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
" |\n",
" <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
" </p>\n",
"</center>\n",
"<h1 align=\"center\">Instrumenting a chatbot with human feedback</h1>\n",
"\n",
"Phoenix provides endpoints to associate user-provided feedback directly with OpenInference spans as annotations.\n",
"\n",
"In this tutorial, we will create a manually-instrument chatbot with user-triggered \"👍\" and \"👎\" feedback buttons. We will have those buttons trigger a callback that sends the user feedback to Phoenix and is viewable alongside the span. Automating associating feedback with spans is a powerful way to quickly focus on traces of your application that are not behaving as expected."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install -q arize-phoenix-otel \"arize-phoenix-client>=1.5.0\" gradio"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"from typing import Any, Dict\n",
"from uuid import uuid4\n",
"\n",
"import httpx\n",
"\n",
"from phoenix.client import Client\n",
"from phoenix.otel import register"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
" openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
"\n",
"if not (phoenix_api_key := os.getenv(\"PHOENIX_API_KEY\")):\n",
" phoenix_api_key = getpass(\"🔑 Enter your Phoenix API key: \")\n",
"\n",
"os.environ[\"PHOENIX_CLIENT_HEADERS\"] = f\"api_key={phoenix_api_key}\"\n",
"os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\"\n",
"os.environ[\"PHOENIX_PROJECT_NAME\"] = \"Chatbot with Annotations\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define endpoints and configure OpenTelemetry tracing"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tracer_provider = register()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"FEEDBACK_ENDPOINT = f\"{os.environ['PHOENIX_COLLECTOR_ENDPOINT']}/span_annotations\"\n",
"OPENAI_API_URL = \"https://api.openai.com/v1/chat/completions\"\n",
"tracer = tracer_provider.get_tracer(__name__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define and instrument chat service backend\n",
"\n",
"Here we define two functions:\n",
"\n",
"`generate_response` is a function that contains the chatbot logic for responding to a user query. `generate_response` is manually instrumented using the `OpenInference` semantic conventions. More information on how to manually instrument an application can be found [here](https://arize.com/docs/phoenix/tracing/how-to-tracing/manual-instrumentation). `generate_response` also returns the OpenTelemetry spanID, a hex-encoded string that is used to associate feedback with a specific trace.\n",
"\n",
"`send_feedback` is a function that sends user feedback to Phoenix via the `span_annotations` REST route."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client = Client()\n",
"http_client = httpx.Client()\n",
"\n",
"\n",
"def generate_response(\n",
" input_text: str, model: str = \"gpt-3.5-turbo\", temperature: float = 0.1\n",
") -> Dict[str, Any]:\n",
" user_message = {\"role\": \"user\", \"content\": input_text, \"uuid\": str(uuid4())}\n",
" invocation_parameters = {\"temperature\": temperature}\n",
" payload = {\n",
" \"model\": model,\n",
" **invocation_parameters,\n",
" \"messages\": [user_message],\n",
" }\n",
" headers = {\n",
" \"Content-Type\": \"application/json\",\n",
" \"Authorization\": f\"Bearer {openai_api_key}\",\n",
" }\n",
" with tracer.start_as_current_span(\"llm_span\", openinference_span_kind=\"llm\") as span:\n",
" span.set_input(user_message)\n",
"\n",
" # get the active hex-encoded spanID\n",
" span_id = span.get_span_context().span_id.to_bytes(8, \"big\").hex()\n",
" print(span_id)\n",
"\n",
" response = http_client.post(OPENAI_API_URL, headers=headers, json=payload)\n",
"\n",
" if not (200 <= response.status_code < 300):\n",
" raise Exception(f\"Failed to call OpenAI API: {response.text}\")\n",
" response_json = response.json()\n",
"\n",
" span.set_output(response_json)\n",
"\n",
" return response_json, span_id\n",
"\n",
"\n",
"def send_feedback(span_id: str, feedback: int, user_id: str) -> None:\n",
" label = \"👍\" if feedback == 1 else \"👎\"\n",
" client.annotations.add_span_annotation(\n",
" span_id=span_id,\n",
" annotation_name=\"user_feedback\",\n",
" label=label,\n",
" score=feedback,\n",
" metadata={\"example_key\": \"123\"},\n",
" identifier=user_id,\n",
" )\n",
" print(f\"Feedback sent for span_id {span_id}: {label}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define an LLM evaluator to run on incorrect responses"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def run_llm_eval(span_id: str, input_text: str, assistant_content: str):\n",
" \"\"\"\n",
" Evaluates the quality of an LLM response by asking another LLM to classify its correctness.\n",
"\n",
" Args:\n",
" span_id: The ID of the span to evaluate\n",
" input_text: The original unchanged user query\n",
" assistant_content: The assistant's response to evaluate\n",
" \"\"\"\n",
" # Create a prompt for the evaluation model\n",
" eval_prompt = f\"\"\"\n",
" You are an expert evaluator of AI assistant responses. Please evaluate the following:\n",
"\n",
" User Query: {input_text}\n",
"\n",
" Assistant Response: {assistant_content}\n",
"\n",
" Is this response correct, helpful, and appropriate for the user query?\n",
" Provide a brief analysis and then classify as either \"CORRECT\" or \"INCORRECT\".\n",
"\n",
" Format your response as follows:\n",
" Analysis: [Your analysis here]\n",
" Classification: [CORRECT or INCORRECT]\n",
" \"\"\"\n",
"\n",
" # Call the evaluation model using the OpenAI API\n",
"\n",
" headers = {\n",
" \"Content-Type\": \"application/json\",\n",
" \"Authorization\": f\"Bearer {openai_api_key}\",\n",
" }\n",
"\n",
" payload = {\n",
" \"model\": \"gpt-4o\", # Using a smaller model for evaluation\n",
" \"messages\": [{\"role\": \"user\", \"content\": eval_prompt}],\n",
" }\n",
"\n",
" # Increased timeout to prevent ReadTimeout errors\n",
" eval_response = http_client.post(OPENAI_API_URL, headers=headers, json=payload, timeout=60.0)\n",
" eval_response = eval_response.json()\n",
" print(eval_response)\n",
" eval_content = eval_response[\"choices\"][0][\"message\"][\"content\"]\n",
"\n",
" # Store the evaluation as an annotation\n",
" client.annotations.add_span_annotation(\n",
" span_id=span_id,\n",
" annotation_name=\"correctness\",\n",
" annotator_kind=\"LLM\",\n",
" label=\"INCORRECT\" if \"Classification: INCORRECT\" in eval_content else \"CORRECT\",\n",
" score=1 if \"Classification: INCORRECT\" in eval_content else 0,\n",
" explanation=eval_content,\n",
" )\n",
"\n",
" print(f\"LLM Evaluation for span_id {span_id}:\")\n",
" print(eval_content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Chat Widget\n",
"\n",
"We create a simple chat application using IPython widgets. Alongside the chatbot responses we provide feedback buttons that a user can click to provide feedback. These can be seen inside the Phoenix UI!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def create_gradio_chat():\n",
" import gradio as gr\n",
"\n",
" def chat_response(message, history, user_id):\n",
" # Send the message to the OpenAI API and get the response\n",
" response_data, span_id = generate_response(message)\n",
" assistant_content = response_data[\"choices\"][0][\"message\"][\"content\"]\n",
"\n",
" # Store the span_id for feedback\n",
" return assistant_content, span_id\n",
"\n",
" def submit_feedback(feedback_type, span_id, message, response, user_id):\n",
" if feedback_type == \"positive\":\n",
" send_feedback(span_id, 1, user_id)\n",
" return \"Thanks for your positive feedback! We'll use it to improve our assistant.\"\n",
" else: # negative feedback\n",
" send_feedback(span_id, 0, user_id)\n",
" run_llm_eval(span_id, message, response)\n",
" return \"Thanks for your feedback. We'll work on improving this type of response.\"\n",
"\n",
" with gr.Blocks() as demo:\n",
" gr.HTML(\"<h3>Encyclopedia Chatbot</h3>\")\n",
" gr.HTML(\n",
" \"<p>Welcome to the Encyclopedia Chatbot. Ask any question about the world, and provide feedback to help us improve!</p>\"\n",
" )\n",
"\n",
" user_id = gr.Dropdown(\n",
" choices=[\"user1\", \"user2\", \"user3\", \"user4\", \"user5\"], value=\"user1\", label=\"User ID\"\n",
" )\n",
"\n",
" chatbot = gr.Chatbot(height=400)\n",
" msg = gr.Textbox(placeholder=\"Type your message here...\")\n",
"\n",
" # Hidden state to store the current span_id\n",
" current_span_id = gr.State(\"\")\n",
" feedback_message = gr.Markdown(\"\")\n",
"\n",
" def respond(message, chat_history, user_id):\n",
" # Get bot response\n",
" bot_response, span_id = chat_response(message, chat_history, user_id)\n",
"\n",
" # Update chat history\n",
" chat_history.append((message, bot_response))\n",
"\n",
" return \"\", chat_history, span_id\n",
"\n",
" # Send button\n",
" msg.submit(respond, [msg, chatbot, user_id], [msg, chatbot, current_span_id])\n",
"\n",
" with gr.Row():\n",
" thumbs_up = gr.Button(\"👍\", scale=1)\n",
" thumbs_down = gr.Button(\"👎\", scale=1)\n",
"\n",
" # Feedback handlers\n",
" def handle_positive_feedback(span_id, chat_history, user_id):\n",
" if not chat_history:\n",
" return \"No message to provide feedback on.\"\n",
"\n",
" last_user_msg, last_bot_msg = chat_history[-1]\n",
" return submit_feedback(\"positive\", span_id, last_user_msg, last_bot_msg, user_id)\n",
"\n",
" def handle_negative_feedback(span_id, chat_history, user_id):\n",
" if not chat_history:\n",
" return \"No message to provide feedback on.\"\n",
"\n",
" last_user_msg, last_bot_msg = chat_history[-1]\n",
" return submit_feedback(\"negative\", span_id, last_user_msg, last_bot_msg, user_id)\n",
"\n",
" thumbs_up.click(\n",
" handle_positive_feedback, [current_span_id, chatbot, user_id], feedback_message\n",
" )\n",
"\n",
" thumbs_down.click(\n",
" handle_negative_feedback, [current_span_id, chatbot, user_id], feedback_message\n",
" )\n",
"\n",
" return demo\n",
"\n",
"\n",
"# Create and display the Gradio interface\n",
"demo = create_gradio_chat()\n",
"demo.launch(inline=True, share=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyze feedback using the Phoenix Client\n",
"\n",
"We can use the Phoenix client to pull the annotated spans. By combining `get_spans_dataframe`\n",
"and `get_span_annotations_dataframe` we can create a dataframe of all annotations alongside\n",
"span data for analysis!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"spans_df = client.spans.get_spans_dataframe(project_identifier=os.environ[\"PHOENIX_PROJECT_NAME\"])\n",
"annotations_df = client.spans.get_span_annotations_dataframe(\n",
" spans_dataframe=spans_df, project_identifier=os.environ[\"PHOENIX_PROJECT_NAME\"]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"annotations_df.join(spans_df, how=\"inner\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client.spans.get_span_annotations(\n",
" span_ids=spans_df.index, project_identifier=os.environ[\"PHOENIX_PROJECT_NAME\"]\n",
")"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}