evaluating_traces_cleanlabTLM.ipynb•13.6 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg\" width=\"1000\"/>\n",
" <br>\n",
" <br>\n",
" <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
" |\n",
" <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
" |\n",
" <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
" </p>\n",
"</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automatically find the bad LLM responses in your LLM Evals with Cleanlab\n",
"\n",
"This guide will walk you through the process of evaluating LLM responses captured in Phoenix with Cleanlab's Trustworthy Language Models (TLM).\n",
"\n",
"TLM boosts the reliability of any LLM application by indicating when the model’s response is untrustworthy.\n",
"\n",
"This guide requires a Cleanlab TLM API key. If you don't have one, you can sign up for a free trial [here](https://tlm.cleanlab.ai/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install dependencies & Set environment variables"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"pip install -q \"arize-phoenix>=4.29.0\"\n",
"pip install -q 'httpx<0.28'\n",
"pip install -q openai cleanlab_tlm openinference-instrumentation-openai"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import os\n",
"from getpass import getpass\n",
"\n",
"import dotenv\n",
"\n",
"dotenv.load_dotenv()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
" openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sign up for a free trial of Cleanlab TLM and get an API key [here](https://tlm.cleanlab.ai/).\n",
"if not (cleanlab_tlm_api_key := os.getenv(\"CLEANLAB_TLM_API_KEY\")):\n",
" cleanlab_tlm_api_key = getpass(\"🔑 Enter your Cleanlab TLM API key: \")\n",
"\n",
"os.environ[\"CLEANLAB_TLM_API_KEY\"] = cleanlab_tlm_api_key"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connect to Phoenix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we'll use Phoenix as our destination. You could instead add any other exporters you'd like in this approach.\n",
"\n",
"If you need to set up an API key for Phoenix, you can do so [here](https://app.phoenix.arize.com/).\n",
"\n",
"The code below will connect you to a Phoenix Cloud instance. You can also connect to [a self-hosted Phoenix instance](https://arize.com/docs/phoenix/deployment) if you'd prefer."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Add Phoenix API Key for tracing\n",
"if not (PHOENIX_API_KEY := os.getenv(\"PHOENIX_CLIENT_HEADERS\")):\n",
" PHOENIX_API_KEY = getpass(\"🔑 Enter your Phoenix API Key: \")\n",
"os.environ[\"PHOENIX_CLIENT_HEADERS\"] = f\"api_key={PHOENIX_API_KEY}\"\n",
"os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have Phoenix configured, we can register that instance with OpenTelemetry, which will allow us to collect traces from our application here."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from phoenix.otel import register\n",
"\n",
"tracer_provider = register(project_name=\"evaluating_traces_TLM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prepare trace dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the sake of making this guide fully runnable, we'll briefly generate some traces and track them in Phoenix. Typically, you would have already captured traces in Phoenix and would skip to \"Download trace dataset from Phoenix\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from openinference.instrumentation.openai import OpenAIInstrumentor\n",
"\n",
"OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"\n",
"# Initialize OpenAI client\n",
"client = OpenAI()\n",
"\n",
"\n",
"# Function to generate an answer\n",
"def generate_answers(trivia_question):\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-3.5-turbo\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a trivia master.\"},\n",
" {\"role\": \"user\", \"content\": trivia_question},\n",
" ],\n",
" )\n",
" answer = response.choices[0].message.content\n",
" return answer\n",
"\n",
"\n",
"trivia_questions = [\n",
" \"What is the 3rd month of the year in alphabetical order?\",\n",
" \"What is the capital of France?\",\n",
" \"How many seconds are in 100 years?\",\n",
" \"Alice, Bob, and Charlie went to a café. Alice paid twice as much as Bob, and Bob paid three times as much as Charlie. If the total bill was $72, how much did each person pay?\",\n",
" \"When was the Declaration of Independence signed?\",\n",
"]\n",
"\n",
"# Generate answers\n",
"answers = []\n",
"for i in range(len(trivia_questions)):\n",
" answer = generate_answers(trivia_questions[i])\n",
" answers.append(answer)\n",
" print(f\"Question {i + 1}: {trivia_questions[i]}\")\n",
" print(f\"Answer {i + 1}:\\n{answer}\\n\")\n",
"\n",
"print(f\"Generated {len(answers)} answers and tracked them in Phoenix.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download trace dataset from Phoenix"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from phoenix.client import AsyncClient\n",
"\n",
"px_client = AsyncClient()\n",
"spans_df = await px_client.spans.get_spans_dataframe(project_name=\"evaluating_traces_TLM\")\n",
"spans_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate evaluations with TLM\n",
"\n",
"Now that we have our trace dataset, we can generate evaluations for each trace using TLM. Ultimately, we want to end up with a trustworthiness score and explaination for each prompt, response pair in the traces."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from cleanlab_tlm import TLM\n",
"\n",
"tlm = TLM(options={\"log\": [\"explanation\"]})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We first need to extract the prompts and responses from the individual traces. `TLM.get_trustworthiness_score()` will take a list of prompts and responses and return trustworthiness scores and explanations.\n",
"\n",
"**IMPORTANT:** It is essential to always include any system prompts, context, or other information that was originally provided to the LLM to generate the response. You should construct the prompt input to `get_trustworthiness_score()` in a way that is as similar as possible to the original prompt."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a new DataFrame with input and output columns\n",
"eval_df = spans_df[[\"context.span_id\", \"attributes.input.value\", \"attributes.output.value\"]].copy()\n",
"eval_df.set_index(\"context.span_id\", inplace=True)\n",
"\n",
"\n",
"# Combine system and user prompts from the traces\n",
"def get_prompt(input_value):\n",
" if isinstance(input_value, str):\n",
" input_value = json.loads(input_value)\n",
" system_prompt = input_value[\"messages\"][0][\"content\"]\n",
" user_prompt = input_value[\"messages\"][1][\"content\"]\n",
" return system_prompt + \"\\n\" + user_prompt\n",
"\n",
"\n",
"# Get the responses from the traces\n",
"def get_response(output_value):\n",
" if isinstance(output_value, str):\n",
" output_value = json.loads(output_value)\n",
" return output_value[\"choices\"][0][\"message\"][\"content\"]\n",
"\n",
"\n",
"# Create a list of prompts and associated responses\n",
"prompts = [get_prompt(input_value) for input_value in eval_df[\"attributes.input.value\"]]\n",
"responses = [get_response(output_value) for output_value in eval_df[\"attributes.output.value\"]]\n",
"\n",
"eval_df[\"prompt\"] = prompts\n",
"eval_df[\"response\"] = responses"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have all of the prompts and responses, we can evaluate each pair using TLM."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Evaluate each of the prompt, response pairs using TLM\n",
"evaluations = tlm.get_trustworthiness_score(prompts, responses)\n",
"\n",
"# Extract the trustworthiness scores and explanations from the evaluations\n",
"trust_scores = [entry[\"trustworthiness_score\"] for entry in evaluations]\n",
"explanations = [entry[\"log\"][\"explanation\"] for entry in evaluations]\n",
"\n",
"# Add the trust scores and explanations to the DataFrame\n",
"eval_df[\"score\"] = trust_scores\n",
"eval_df[\"explanation\"] = explanations\n",
"\n",
"# Display the new DataFrame\n",
"eval_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have a DataFrame with added colums:\n",
"- `prompt`: the combined system and user prompt from the trace\n",
"- `response`: the LLM response from the trace\n",
"- `score`: the trustworthiness score from TLM\n",
"- `explanation`: the explanation from TLM\n",
"\n",
"Let's sort our traces by the `score` column to quickly find untrustworthy LLM responses."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sorted_df = eval_df.sort_values(by=\"score\", ascending=True).head()\n",
"sorted_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Let's look at the least trustworthy trace.\n",
"print(\"Prompt: \", sorted_df.iloc[0][\"prompt\"], \"\\n\")\n",
"print(\"OpenAI Response: \", sorted_df.iloc[0][\"response\"], \"\\n\")\n",
"print(\"TLM Trust Score: \", sorted_df.iloc[0][\"score\"], \"\\n\")\n",
"print(\"TLM Explanation: \", sorted_df.iloc[0][\"explanation\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Awesome! TLM was able to identify multiple traces that contained incorrect answers from OpenAI."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's upload the `score` and `explanation` columns to Phoenix."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Upload evaluations to Phoenix\n",
"\n",
"Our evals_df has a column for the span_id and a column for the evaluation result. The span_id is what allows us to connect the evaluation to the correct trace in Phoenix. Phoenix will also automatically look for columns named \"score\" and \"evaluation\" to display in the UI."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"eval_df[\"score\"] = eval_df[\"score\"].astype(float)\n",
"eval_df[\"explanation\"] = eval_df[\"explanation\"].astype(str)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"await px_client.spans.log_span_annotations_dataframe(\n",
" dataframe=eval_df,\n",
" annotation_name=\"Trustworthiness\",\n",
" annotator_kind=\"LLM\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should now see evaluations in the Phoenix UI!\n",
"\n",
"From here you can continue collecting and evaluating traces!"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}