evaluate_QA_classifications.ipynb•126 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
" <br>\n",
" <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
" |\n",
" <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
" |\n",
" <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
" </p>\n",
"</center>\n",
"<h1 align=\"center\">Q&A Classification Evals</h1>\n",
"\n",
"The purpose of this notebook is:\n",
"\n",
"- to evaluate the performance of an LLM-assisted approach to detecting issues with Q&A systems on retrieved context data\n",
"- to provide an experimental framework for users to iterate and improve on the default classification template.\n",
"\n",
"## Install Dependencies and Import Libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#####################\n",
"## N_EVAL_SAMPLE_SIZE\n",
"#####################\n",
"# Eval sample size determines the run time\n",
"# 100 samples: GPT-4 ~ 80 sec / GPT-3.5 ~ 40 sec\n",
"# 1,000 samples: GPT-4 ~15-17 min / GPT-3.5 ~ 6-7min (depending on retries)\n",
"# 10,000 samples GPT-4 ~170 min / GPT-3.5 ~ 70min\n",
"N_EVAL_SAMPLE_SIZE = 100"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"!pip install -qq \"arize-phoenix-evals\" \"openai>=1\" ipython matplotlib pycm scikit-learn tiktoken nest_asyncio 'httpx<0.28'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use `nest_asyncio`. `nest_asyncio` globally patches `asyncio` to enable event loops to be re-entrant. This is not required for non-notebook environments.\n",
"\n",
"Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import openai\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from pycm import ConfusionMatrix\n",
"from sklearn.metrics import classification_report\n",
"\n",
"from phoenix.evals import (\n",
" QA_PROMPT_RAILS_MAP,\n",
" QA_PROMPT_TEMPLATE,\n",
" OpenAIModel,\n",
" download_benchmark_dataset,\n",
" llm_classify,\n",
")\n",
"\n",
"pd.set_option(\"display.max_colwidth\", None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download Benchmark Dataset\n",
"\n",
"\n",
"\n",
"- Squad 2:\n",
"The 2.0 version of the large-scale dataset Stanford Question Answering Dataset (SQuAD 2.0) allows researchers to design AI models for reading comprehension tasks under challenging constraints.\n",
"https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15785042.pdf\n",
"- Supplemental Data to Sqaud 2: In order to check the case of detecting incorrect answers, we created wrong answers based on the context data. The wrong answers are intermixed with right answers.\n",
"- sampled_answer is a sampled column of randomly original Squad 2 or incorrect answers"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"df = download_benchmark_dataset(task=\"qa-classification\", dataset_name=\"qa_generated_dataset\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **question**: This is the question the Q&A system is running against\n",
"- **sampled_answer**: This is a random sample of correct_answer from Squad 2 or wrong_answer which is a made up incorrect answer. This is the column we test against as it has wrong and right answers.\n",
"- **correct_answer**: True if answer is correct, False if not. The ground truth to test against.\n",
"- **answers**: This is the right answer to the question.\n",
"- **wrong_answer**: This is an incorrect answer generated by the context.\n",
"- **context**: This is the context to be used to answer the question, and is what Q&A Eval must use to check the correct answer.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>title</th>\n",
" <th>context</th>\n",
" <th>question</th>\n",
" <th>answers</th>\n",
" <th>correct_answer</th>\n",
" <th>wrong_answer</th>\n",
" <th>sampled_answer</th>\n",
" <th>answer_true</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>57317e8d497a881900248f87</td>\n",
" <td>Mosaic</td>\n",
" <td>Jerusalem with its many holy places probably had the highest concentration of mosaic-covered churches but very few of them survived the subsequent waves of destructions. The present remains do not do justice to the original richness of the city. The most important is the so-called \"Armenian Mosaic\" which was discovered in 1894 on the Street of the Prophets near Damascus Gate. It depicts a vine with many branches and grape clusters, which springs from a vase. Populating the vine's branches are peacocks, ducks, storks, pigeons, an eagle, a partridge, and a parrot in a cage. The inscription reads: \"For the memory and salvation of all those Armenians whose name the Lord knows.\" Beneath a corner of the mosaic is a small, natural cave which contained human bones dating to the 5th or 6th centuries. The symbolism of the mosaic and the presence of the burial cave indicates that the room was used as a mortuary chapel.</td>\n",
" <td>When was the Armenian Mosaic re-discovered?</td>\n",
" <td>1894</td>\n",
" <td>True</td>\n",
" <td>The Armenian Mosaic was re-discovered in 1920.</td>\n",
" <td>1894</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>56cfabed234ae51400d9be49</td>\n",
" <td>New_York_City</td>\n",
" <td>The first non-Native American inhabitant of what would eventually become New York City was Dominican trader Juan Rodriguez (transliterated to Dutch as Jan Rodrigues). Born in Santo Domingo of Portuguese and African descent, he arrived in Manhattan during the winter of 1613–1614, trapping for pelts and trading with the local population as a representative of the Dutch. Broadway, from 159th Street to 218th Street, is named Juan Rodriguez Way in his honor.</td>\n",
" <td>Who was the first non-Indian person to live in what is now NYC?</td>\n",
" <td>Juan Rodriguez</td>\n",
" <td>True</td>\n",
" <td>The first non-Indian person to live in what is now NYC was Italian explorer Christopher Columbus.</td>\n",
" <td>Juan Rodriguez</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>571a2c554faf5e1900b8a8f6</td>\n",
" <td>Memory</td>\n",
" <td>Short-term memory is supported by transient patterns of neuronal communication, dependent on regions of the frontal lobe (especially dorsolateral prefrontal cortex) and the parietal lobe. Long-term memory, on the other hand, is maintained by more stable and permanent changes in neural connections widely spread throughout the brain. The hippocampus is essential (for learning new information) to the consolidation of information from short-term to long-term memory, although it does not seem to store information itself. Without the hippocampus, new memories are unable to be stored into long-term memory, as learned from patient Henry Molaison after removal of both his hippocampi, and there will be a very short attention span. Furthermore, it may be involved in changing neural connections for a period of three months or more after the initial learning.</td>\n",
" <td>Which part of the brain does short-term memory seem to rely on?</td>\n",
" <td>frontal lobe (especially dorsolateral prefrontal cortex) and the parietal lobe</td>\n",
" <td>True</td>\n",
" <td>The cerebellum</td>\n",
" <td>frontal lobe (especially dorsolateral prefrontal cortex) and the parietal lobe</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>57301bf5b2c2fd1400568889</td>\n",
" <td>Roman_Republic</td>\n",
" <td>In 62 BC, Pompey returned victorious from Asia. The Senate, elated by its successes against Catiline, refused to ratify the arrangements that Pompey had made. Pompey, in effect, became powerless. Thus, when Julius Caesar returned from a governorship in Spain in 61 BC, he found it easy to make an arrangement with Pompey. Caesar and Pompey, along with Crassus, established a private agreement, now known as the First Triumvirate. Under the agreement, Pompey's arrangements would be ratified. Caesar would be elected consul in 59 BC, and would then serve as governor of Gaul for five years. Crassus was promised a future consulship.</td>\n",
" <td>What provided the Roman senate with exuberance?</td>\n",
" <td>successes against Catiline</td>\n",
" <td>True</td>\n",
" <td>The Roman Senate was filled with exuberance due to Pompey's defeat in Asia.</td>\n",
" <td>The Roman Senate was filled with exuberance due to Pompey's defeat in Asia.</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>572f8ee0b2c2fd14005681f6</td>\n",
" <td>Armenia</td>\n",
" <td>The Seljuk Empire soon started to collapse. In the early 12th century, Armenian princes of the Zakarid noble family drove out the Seljuk Turks and established a semi-independent Armenian principality in Northern and Eastern Armenia, known as Zakarid Armenia, which lasted under the patronage of the Georgian Kingdom. The noble family of Orbelians shared control with the Zakarids in various parts of the country, especially in Syunik and Vayots Dzor, while the Armenian family of Hasan-Jalalians controlled provinces of Artsakh and Utik as the Kingdom of Artsakh.</td>\n",
" <td>What area did the Hasan-jalalians command?</td>\n",
" <td>Artsakh and Utik</td>\n",
" <td>True</td>\n",
" <td>The Hasan-Jalalians commanded the area of Syunik and Vayots Dzor.</td>\n",
" <td>Artsakh and Utik</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id title \\\n",
"0 57317e8d497a881900248f87 Mosaic \n",
"1 56cfabed234ae51400d9be49 New_York_City \n",
"2 571a2c554faf5e1900b8a8f6 Memory \n",
"3 57301bf5b2c2fd1400568889 Roman_Republic \n",
"4 572f8ee0b2c2fd14005681f6 Armenia \n",
"\n",
" context \\\n",
"0 Jerusalem with its many holy places probably had the highest concentration of mosaic-covered churches but very few of them survived the subsequent waves of destructions. The present remains do not do justice to the original richness of the city. The most important is the so-called \"Armenian Mosaic\" which was discovered in 1894 on the Street of the Prophets near Damascus Gate. It depicts a vine with many branches and grape clusters, which springs from a vase. Populating the vine's branches are peacocks, ducks, storks, pigeons, an eagle, a partridge, and a parrot in a cage. The inscription reads: \"For the memory and salvation of all those Armenians whose name the Lord knows.\" Beneath a corner of the mosaic is a small, natural cave which contained human bones dating to the 5th or 6th centuries. The symbolism of the mosaic and the presence of the burial cave indicates that the room was used as a mortuary chapel. \n",
"1 The first non-Native American inhabitant of what would eventually become New York City was Dominican trader Juan Rodriguez (transliterated to Dutch as Jan Rodrigues). Born in Santo Domingo of Portuguese and African descent, he arrived in Manhattan during the winter of 1613–1614, trapping for pelts and trading with the local population as a representative of the Dutch. Broadway, from 159th Street to 218th Street, is named Juan Rodriguez Way in his honor. \n",
"2 Short-term memory is supported by transient patterns of neuronal communication, dependent on regions of the frontal lobe (especially dorsolateral prefrontal cortex) and the parietal lobe. Long-term memory, on the other hand, is maintained by more stable and permanent changes in neural connections widely spread throughout the brain. The hippocampus is essential (for learning new information) to the consolidation of information from short-term to long-term memory, although it does not seem to store information itself. Without the hippocampus, new memories are unable to be stored into long-term memory, as learned from patient Henry Molaison after removal of both his hippocampi, and there will be a very short attention span. Furthermore, it may be involved in changing neural connections for a period of three months or more after the initial learning. \n",
"3 In 62 BC, Pompey returned victorious from Asia. The Senate, elated by its successes against Catiline, refused to ratify the arrangements that Pompey had made. Pompey, in effect, became powerless. Thus, when Julius Caesar returned from a governorship in Spain in 61 BC, he found it easy to make an arrangement with Pompey. Caesar and Pompey, along with Crassus, established a private agreement, now known as the First Triumvirate. Under the agreement, Pompey's arrangements would be ratified. Caesar would be elected consul in 59 BC, and would then serve as governor of Gaul for five years. Crassus was promised a future consulship. \n",
"4 The Seljuk Empire soon started to collapse. In the early 12th century, Armenian princes of the Zakarid noble family drove out the Seljuk Turks and established a semi-independent Armenian principality in Northern and Eastern Armenia, known as Zakarid Armenia, which lasted under the patronage of the Georgian Kingdom. The noble family of Orbelians shared control with the Zakarids in various parts of the country, especially in Syunik and Vayots Dzor, while the Armenian family of Hasan-Jalalians controlled provinces of Artsakh and Utik as the Kingdom of Artsakh. \n",
"\n",
" question \\\n",
"0 When was the Armenian Mosaic re-discovered? \n",
"1 Who was the first non-Indian person to live in what is now NYC? \n",
"2 Which part of the brain does short-term memory seem to rely on? \n",
"3 What provided the Roman senate with exuberance? \n",
"4 What area did the Hasan-jalalians command? \n",
"\n",
" answers \\\n",
"0 1894 \n",
"1 Juan Rodriguez \n",
"2 frontal lobe (especially dorsolateral prefrontal cortex) and the parietal lobe \n",
"3 successes against Catiline \n",
"4 Artsakh and Utik \n",
"\n",
" correct_answer \\\n",
"0 True \n",
"1 True \n",
"2 True \n",
"3 True \n",
"4 True \n",
"\n",
" wrong_answer \\\n",
"0 The Armenian Mosaic was re-discovered in 1920. \n",
"1 The first non-Indian person to live in what is now NYC was Italian explorer Christopher Columbus. \n",
"2 The cerebellum \n",
"3 The Roman Senate was filled with exuberance due to Pompey's defeat in Asia. \n",
"4 The Hasan-Jalalians commanded the area of Syunik and Vayots Dzor. \n",
"\n",
" sampled_answer \\\n",
"0 1894 \n",
"1 Juan Rodriguez \n",
"2 frontal lobe (especially dorsolateral prefrontal cortex) and the parietal lobe \n",
"3 The Roman Senate was filled with exuberance due to Pompey's defeat in Asia. \n",
"4 Artsakh and Utik \n",
"\n",
" answer_true \n",
"0 True \n",
"1 True \n",
"2 True \n",
"3 False \n",
"4 True "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Display Binary Q&A Classification Template\n",
"\n",
"View the default template used to classify hallucinations. You can tweak this template and evaluate its performance relative to the default."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"You are given a question, an answer and reference text. You must determine whether the\n",
"given answer correctly answers the question based on the reference text. Here is the data:\n",
" [BEGIN DATA]\n",
" ************\n",
" [Question]: {input}\n",
" ************\n",
" [Reference]: {reference}\n",
" ************\n",
" [Answer]: {output}\n",
" [END DATA]\n",
"Your response must be a single word, either \"correct\" or \"incorrect\",\n",
"and should not contain any text or characters aside from that word.\n",
"\"correct\" means that the question is correctly and fully answered by the answer.\n",
"\"incorrect\" means that the question is not correctly or only partially answered by the\n",
"answer.\n",
"\n"
]
}
],
"source": [
"print(QA_PROMPT_TEMPLATE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure the API Key\n",
"\n",
"Configure your OpenAI API key."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
" openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
"openai.api_key = openai_api_key\n",
"os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Benchmark Dataset Sample\n",
"Sample size determines run time\n",
"Recommend iterating small: 100 samples\n",
"Then increasing to large test set"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"df_sample = (\n",
" df.sample(n=N_EVAL_SAMPLE_SIZE)\n",
" .reset_index(drop=True)\n",
" .rename(\n",
" columns={\n",
" \"question\": \"input\",\n",
" \"context\": \"reference\",\n",
" \"sampled_answer\": \"output\",\n",
" }\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## LLM Evals: Q&A Classifications GPT-4\n",
"Run Q&A classifications against a subset of the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instantiate the LLM and set parameters."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The `model_name` field is deprecated. Use `model` instead. This will be removed in a future release.\n"
]
}
],
"source": [
"model = OpenAIModel(\n",
" model_name=\"gpt-4\",\n",
" temperature=0.0,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"Hello! I'm working perfectly. How can I assist you today?\""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model(\"Hello world, this is a test if you are working?\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run LLM Eval using the template against the dataset: This is the main Eval function"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "ca05336a691d49cf9b3fe40fa8ab9e1e",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"llm_classify | | 0/100 (0.0%) | ⏳ 00:00<? | ?it/s"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# The rails force the output to specific values of the template\n",
"# It will remove text such as \",,,\" or \"...\", anything not the\n",
"# binary value expected from the template\n",
"rails = list(QA_PROMPT_RAILS_MAP.values())\n",
"Q_and_A_classifications = llm_classify(\n",
" dataframe=df_sample,\n",
" template=QA_PROMPT_TEMPLATE,\n",
" model=model,\n",
" rails=rails,\n",
" concurrency=20,\n",
")[\"label\"].tolist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"Evaluate the predictions against human-labeled ground-truth Q&A labels."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" correct 1.00 0.88 0.94 50\n",
" incorrect 0.89 1.00 0.94 50\n",
"\n",
" accuracy 0.94 100\n",
" macro avg 0.95 0.94 0.94 100\n",
"weighted avg 0.95 0.94 0.94 100\n",
"\n"
]
},
{
"data": {
"text/plain": [
"<Axes: title={'center': 'Confusion Matrix (Normalized)'}, xlabel='Predicted Classes', ylabel='Actual Classes'>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"true_labels = df_sample[\"answer_true\"].map(QA_PROMPT_RAILS_MAP).tolist()\n",
"\n",
"print(classification_report(true_labels, Q_and_A_classifications, labels=rails))\n",
"confusion_matrix = ConfusionMatrix(\n",
" actual_vector=true_labels, predict_vector=Q_and_A_classifications, classes=rails\n",
")\n",
"confusion_matrix.plot(\n",
" cmap=plt.colormaps[\"Blues\"],\n",
" number_label=True,\n",
" normalized=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## LLM Evals: Q&A Classifications GPT-3.5\n",
"\n",
"\n",
"Evaluate the predictions against human-labeled ground-truth Q&A labels."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The `model_name` field is deprecated. Use `model` instead. This will be removed in a future release.\n"
]
}
],
"source": [
"model = OpenAIModel(model_name=\"gpt-3.5-turbo\", temperature=0.0, request_timeout=20)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a10ca79971f34ba1a7a87e766ece3576",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"llm_classify | | 0/100 (0.0%) | ⏳ 00:00<? | ?it/s"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')\n",
"Requeuing...\n"
]
}
],
"source": [
"Q_and_A_classifications = llm_classify(\n",
" dataframe=df_sample,\n",
" template=QA_PROMPT_TEMPLATE,\n",
" model=model,\n",
" rails=list(QA_PROMPT_RAILS_MAP.values()),\n",
" concurrency=20,\n",
")[\"label\"].tolist()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" correct 0.96 0.94 0.95 50\n",
" incorrect 0.94 0.96 0.95 50\n",
"\n",
" accuracy 0.95 100\n",
" macro avg 0.95 0.95 0.95 100\n",
"weighted avg 0.95 0.95 0.95 100\n",
"\n"
]
},
{
"data": {
"text/plain": [
"<Axes: title={'center': 'Confusion Matrix (Normalized)'}, xlabel='Predicted Classes', ylabel='Actual Classes'>"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"true_labels = df_sample[\"answer_true\"].map(QA_PROMPT_RAILS_MAP).tolist()\n",
"classes = list(QA_PROMPT_RAILS_MAP.values())\n",
"\n",
"print(classification_report(true_labels, Q_and_A_classifications, labels=classes))\n",
"confusion_matrix = ConfusionMatrix(\n",
" actual_vector=true_labels, predict_vector=Q_and_A_classifications, classes=classes\n",
")\n",
"confusion_matrix.plot(\n",
" cmap=plt.colormaps[\"Blues\"],\n",
" number_label=True,\n",
" normalized=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## LLM Evals: Q&A Classifications GPT-4 Turbo\n",
"\n",
"\n",
"Evaluate the predictions against human-labeled ground-truth Q&A labels."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The `model_name` field is deprecated. Use `model` instead. This will be removed in a future release.\n"
]
}
],
"source": [
"model = OpenAIModel(model_name=\"gpt-4-turbo-preview\", temperature=0.0)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9994cc409e634712b8bb35535b11d152",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"llm_classify | | 0/100 (0.0%) | ⏳ 00:00<? | ?it/s"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"Q_and_A_classifications = llm_classify(\n",
" dataframe=df_sample,\n",
" template=QA_PROMPT_TEMPLATE,\n",
" model=model,\n",
" rails=list(QA_PROMPT_RAILS_MAP.values()),\n",
" concurrency=20,\n",
")[\"label\"].tolist()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" correct 1.00 0.88 0.94 50\n",
" incorrect 0.89 1.00 0.94 50\n",
"\n",
" accuracy 0.94 100\n",
" macro avg 0.95 0.94 0.94 100\n",
"weighted avg 0.95 0.94 0.94 100\n",
"\n"
]
},
{
"data": {
"text/plain": [
"<Axes: title={'center': 'Confusion Matrix (Normalized)'}, xlabel='Predicted Classes', ylabel='Actual Classes'>"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"true_labels = df_sample[\"answer_true\"].map(QA_PROMPT_RAILS_MAP).tolist()\n",
"classes = list(QA_PROMPT_RAILS_MAP.values())\n",
"\n",
"print(classification_report(true_labels, Q_and_A_classifications, labels=classes))\n",
"confusion_matrix = ConfusionMatrix(\n",
" actual_vector=true_labels, predict_vector=Q_and_A_classifications, classes=classes\n",
")\n",
"confusion_matrix.plot(\n",
" cmap=plt.colormaps[\"Blues\"],\n",
" number_label=True,\n",
" normalized=True,\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 4
}