Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
evaluate_relevance_classifications.ipynb186 kB
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n", " <br>\n", " <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n", " </p>\n", "</center>\n", "<h1 align=\"center\">Retrieval Relevance Evals</h1>\n", "\n", "Arize provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) applications. This relevance is then used to measure the quality of each retrieval using ranking metrics such as precision@k. In order to determine whether each retrieved document is relevant or irrelevant to the corresponding query, our approach is straightforward: ask an LLM.\n", "\n", "The purpose of this notebook is:\n", "\n", "- to evaluate the performance of an LLM-assisted approach to relevance classification against information retrieval datasets with ground-truth relevance labels,\n", "- to provide an experimental framework for users to iterate and improve on the default classification template.\n", "\n", "## Install Dependencies and Import Libraries" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "#####################\n", "## N_EVAL_SAMPLE_SIZE\n", "#####################\n", "# Eval sample size determines the run time\n", "# 100 samples: GPT-4 ~ 80 sec / GPT-3.5 ~ 40 sec\n", "# 1,000 samples: GPT-4 ~15-17 min / GPT-3.5 ~ 6-7min (depending on retries)\n", "# 10,000 samples GPT-4 ~170 min / GPT-3.5 ~ 70min\n", "N_EVAL_SAMPLE_SIZE = 100" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "9391.10s - pydevd: Sending message related to process being replaced timed-out after 5 seconds\n" ] } ], "source": [ "!pip install -qq \"arize-phoenix-evals>=0.0.5\" \"openai>=1\" ipython matplotlib pycm scikit-learn tiktoken nest_asyncio 'httpx<0.28'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use `nest_asyncio`. `nest_asyncio` globally patches `asyncio` to enable event loops to be re-entrant. This is not required for non-notebook environments.\n", "\n", "Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import nest_asyncio\n", "\n", "nest_asyncio.apply()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "from pycm import ConfusionMatrix\n", "from sklearn.metrics import classification_report\n", "\n", "from phoenix.evals import (\n", " RAG_RELEVANCY_PROMPT_RAILS_MAP,\n", " RAG_RELEVANCY_PROMPT_TEMPLATE,\n", " OpenAIModel,\n", " download_benchmark_dataset,\n", " llm_classify,\n", ")\n", "\n", "pd.set_option(\"display.max_colwidth\", None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download Benchmark Dataset\n", "\n", "We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against benchmark datasets of queries and retrieved documents with ground-truth relevance labels. Currently supported datasets include:\n", "\n", "- \"wiki_qa-train\"\n", "- \"ms_marco-v1.1-train\"" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>query_id</th>\n", " <th>query_text</th>\n", " <th>document_title</th>\n", " <th>document_text</th>\n", " <th>document_text_with_emphasis</th>\n", " <th>relevant</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Q1</td>\n", " <td>how are glacier caves formed?</td>\n", " <td>Glacier cave</td>\n", " <td>A partly submerged glacier cave on Perito Moreno Glacier . The ice facade is approximately 60 m high Ice formations in the Titlis glacier cave A glacier cave is a cave formed within the ice of a glacier . Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice.</td>\n", " <td>A partly submerged glacier cave on Perito Moreno Glacier . The ice facade is approximately 60 m high Ice formations in the Titlis glacier cave A GLACIER CAVE IS A CAVE FORMED WITHIN THE ICE OF A GLACIER . Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice.</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Q10</td>\n", " <td>how an outdoor wood boiler works</td>\n", " <td>Outdoor wood-fired boiler</td>\n", " <td>The outdoor wood boiler is a variant of the classic wood stove adapted for set-up outdoors while still transferring the heat to interior buildings.</td>\n", " <td>The outdoor wood boiler is a variant of the classic wood stove adapted for set-up outdoors while still transferring the heat to interior buildings.</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Q100</td>\n", " <td>what happens to the light independent reactions of photosynthesis?</td>\n", " <td>Light-independent reactions</td>\n", " <td>The simplified internal structure of a chloroplast Overview of the Calvin cycle and carbon fixation The light-independent reactions of photosynthesis are chemical reactions that convert carbon dioxide and other compounds into glucose . These reactions occur in the stroma , the fluid-filled area of a chloroplast outside of the thylakoid membranes. These reactions take the light-dependent reactions and perform further chemical processes on them. There are three phases to the light-independent reactions, collectively called the Calvin cycle : carbon fixation, reduction reactions, and ribulose 1,5-bisphosphate (RuBP) regeneration. Despite its name, this process occurs only when light is available. Plants do not carry out the Calvin cycle by night. They, instead, release sucrose into the phloem from their starch reserves. This process happens when light is available independent of the kind of photosynthesis ( C3 carbon fixation , C4 carbon fixation , and Crassulacean Acid Metabolism ); CAM plants store malic acid in their vacuoles every night and release it by day in order to make this process work.</td>\n", " <td>The simplified internal structure of a chloroplast Overview of the Calvin cycle and carbon fixation THE LIGHT-INDEPENDENT REACTIONS OF PHOTOSYNTHESIS ARE CHEMICAL REACTIONS THAT CONVERT CARBON DIOXIDE AND OTHER COMPOUNDS INTO GLUCOSE . These reactions occur in the stroma , the fluid-filled area of a chloroplast outside of the thylakoid membranes. THESE REACTIONS TAKE THE LIGHT-DEPENDENT REACTIONS AND PERFORM FURTHER CHEMICAL PROCESSES ON THEM. There are three phases to the light-independent reactions, collectively called the Calvin cycle : carbon fixation, reduction reactions, and ribulose 1,5-bisphosphate (RuBP) regeneration. Despite its name, this process occurs only when light is available. Plants do not carry out the Calvin cycle by night. They, instead, release sucrose into the phloem from their starch reserves. This process happens when light is available independent of the kind of photosynthesis ( C3 carbon fixation , C4 carbon fixation , and Crassulacean Acid Metabolism ); CAM plants store malic acid in their vacuoles every night and release it by day in order to make this process work.</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Q1000</td>\n", " <td>where in the bible that palestine have no land in jerusalem</td>\n", " <td>Philistines</td>\n", " <td>The Philistine cities of Gaza, Ashdod, Ashkelon, Ekron, and Gath The Philistines (, , , or ; , Plištim), Pleshet or Peleset, were a people who as part of the Sea Peoples appeared in the southern coastal area of Canaan at the beginning of the Iron Age (circa 1175 BC), most probably from the Aegean region. According to the Bible , they ruled the five city-states (the \"Philistine Pentapolis\") of Gaza , Ashkelon , Ashdod , Ekron and Gath , from Wadi Gaza in the south to the Yarqon River in the north, but with no fixed border to the east. The Bible paints them as the Kingdom of Israel 's most dangerous enemy. Originating somewhere in the Aegean , their population was around 25,000 in the 12th century BC, rising to a peak of 30,000 in the 11th century BC, of which the Aegean element was not more than half the total, and perhaps much less. Nothing is known for certain about the original language or languages of the Philistines, however they were not part of the Semitic Canaanite population. There is some limited evidence in favour of the assumption that the Philistines were Indo-European-speakers either from Greece and/or Luwian speakers from the coast of Asia Minor . Philistine-related words found in the Bible are not Semitic, and can in some cases, with reservations, be traced back to Proto-Indo-European roots. By the beginning of the 1st Millennium BCE they had adopted the general Canaanite language of the region.</td>\n", " <td>The Philistine cities of Gaza, Ashdod, Ashkelon, Ekron, and Gath The Philistines (, , , or ; , Plištim), Pleshet or Peleset, were a people who as part of the Sea Peoples appeared in the southern coastal area of Canaan at the beginning of the Iron Age (circa 1175 BC), most probably from the Aegean region. According to the Bible , they ruled the five city-states (the \"Philistine Pentapolis\") of Gaza , Ashkelon , Ashdod , Ekron and Gath , from Wadi Gaza in the south to the Yarqon River in the north, but with no fixed border to the east. The Bible paints them as the Kingdom of Israel 's most dangerous enemy. Originating somewhere in the Aegean , their population was around 25,000 in the 12th century BC, rising to a peak of 30,000 in the 11th century BC, of which the Aegean element was not more than half the total, and perhaps much less. Nothing is known for certain about the original language or languages of the Philistines, however they were not part of the Semitic Canaanite population. There is some limited evidence in favour of the assumption that the Philistines were Indo-European-speakers either from Greece and/or Luwian speakers from the coast of Asia Minor . Philistine-related words found in the Bible are not Semitic, and can in some cases, with reservations, be traced back to Proto-Indo-European roots. By the beginning of the 1st Millennium BCE they had adopted the general Canaanite language of the region.</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>Q1001</td>\n", " <td>what are the test scores on asvab</td>\n", " <td>Armed Services Vocational Aptitude Battery</td>\n", " <td>The Armed Services Vocational Aptitude Battery (ASVAB) is a multiple choice test, administered by the United States Military Entrance Processing Command , used to determine qualification for enlistment in the United States armed forces . It is often offered to American high school students when they are in the 10th, 11th and 12th grade, though anyone eligible for enlistment may take it. Although the test is administered by the military, it is not (and never has been) a requirement that a test-taker with a qualifying score enlist in the armed forces.</td>\n", " <td>The Armed Services Vocational Aptitude Battery (ASVAB) is a multiple choice test, administered by the United States Military Entrance Processing Command , used to determine qualification for enlistment in the United States armed forces . It is often offered to American high school students when they are in the 10th, 11th and 12th grade, though anyone eligible for enlistment may take it. Although the test is administered by the military, it is not (and never has been) a requirement that a test-taker with a qualifying score enlist in the armed forces.</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " query_id \\\n", "0 Q1 \n", "1 Q10 \n", "2 Q100 \n", "3 Q1000 \n", "4 Q1001 \n", "\n", " query_text \\\n", "0 how are glacier caves formed? \n", "1 how an outdoor wood boiler works \n", "2 what happens to the light independent reactions of photosynthesis? \n", "3 where in the bible that palestine have no land in jerusalem \n", "4 what are the test scores on asvab \n", "\n", " document_title \\\n", "0 Glacier cave \n", "1 Outdoor wood-fired boiler \n", "2 Light-independent reactions \n", "3 Philistines \n", "4 Armed Services Vocational Aptitude Battery \n", "\n", " document_text \\\n", "0 A partly submerged glacier cave on Perito Moreno Glacier . The ice facade is approximately 60 m high Ice formations in the Titlis glacier cave A glacier cave is a cave formed within the ice of a glacier . Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice. \n", "1 The outdoor wood boiler is a variant of the classic wood stove adapted for set-up outdoors while still transferring the heat to interior buildings. \n", "2 The simplified internal structure of a chloroplast Overview of the Calvin cycle and carbon fixation The light-independent reactions of photosynthesis are chemical reactions that convert carbon dioxide and other compounds into glucose . These reactions occur in the stroma , the fluid-filled area of a chloroplast outside of the thylakoid membranes. These reactions take the light-dependent reactions and perform further chemical processes on them. There are three phases to the light-independent reactions, collectively called the Calvin cycle : carbon fixation, reduction reactions, and ribulose 1,5-bisphosphate (RuBP) regeneration. Despite its name, this process occurs only when light is available. Plants do not carry out the Calvin cycle by night. They, instead, release sucrose into the phloem from their starch reserves. This process happens when light is available independent of the kind of photosynthesis ( C3 carbon fixation , C4 carbon fixation , and Crassulacean Acid Metabolism ); CAM plants store malic acid in their vacuoles every night and release it by day in order to make this process work. \n", "3 The Philistine cities of Gaza, Ashdod, Ashkelon, Ekron, and Gath The Philistines (, , , or ; , Plištim), Pleshet or Peleset, were a people who as part of the Sea Peoples appeared in the southern coastal area of Canaan at the beginning of the Iron Age (circa 1175 BC), most probably from the Aegean region. According to the Bible , they ruled the five city-states (the \"Philistine Pentapolis\") of Gaza , Ashkelon , Ashdod , Ekron and Gath , from Wadi Gaza in the south to the Yarqon River in the north, but with no fixed border to the east. The Bible paints them as the Kingdom of Israel 's most dangerous enemy. Originating somewhere in the Aegean , their population was around 25,000 in the 12th century BC, rising to a peak of 30,000 in the 11th century BC, of which the Aegean element was not more than half the total, and perhaps much less. Nothing is known for certain about the original language or languages of the Philistines, however they were not part of the Semitic Canaanite population. There is some limited evidence in favour of the assumption that the Philistines were Indo-European-speakers either from Greece and/or Luwian speakers from the coast of Asia Minor . Philistine-related words found in the Bible are not Semitic, and can in some cases, with reservations, be traced back to Proto-Indo-European roots. By the beginning of the 1st Millennium BCE they had adopted the general Canaanite language of the region. \n", "4 The Armed Services Vocational Aptitude Battery (ASVAB) is a multiple choice test, administered by the United States Military Entrance Processing Command , used to determine qualification for enlistment in the United States armed forces . It is often offered to American high school students when they are in the 10th, 11th and 12th grade, though anyone eligible for enlistment may take it. Although the test is administered by the military, it is not (and never has been) a requirement that a test-taker with a qualifying score enlist in the armed forces. \n", "\n", " document_text_with_emphasis \\\n", "0 A partly submerged glacier cave on Perito Moreno Glacier . The ice facade is approximately 60 m high Ice formations in the Titlis glacier cave A GLACIER CAVE IS A CAVE FORMED WITHIN THE ICE OF A GLACIER . Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice. \n", "1 The outdoor wood boiler is a variant of the classic wood stove adapted for set-up outdoors while still transferring the heat to interior buildings. \n", "2 The simplified internal structure of a chloroplast Overview of the Calvin cycle and carbon fixation THE LIGHT-INDEPENDENT REACTIONS OF PHOTOSYNTHESIS ARE CHEMICAL REACTIONS THAT CONVERT CARBON DIOXIDE AND OTHER COMPOUNDS INTO GLUCOSE . These reactions occur in the stroma , the fluid-filled area of a chloroplast outside of the thylakoid membranes. THESE REACTIONS TAKE THE LIGHT-DEPENDENT REACTIONS AND PERFORM FURTHER CHEMICAL PROCESSES ON THEM. There are three phases to the light-independent reactions, collectively called the Calvin cycle : carbon fixation, reduction reactions, and ribulose 1,5-bisphosphate (RuBP) regeneration. Despite its name, this process occurs only when light is available. Plants do not carry out the Calvin cycle by night. They, instead, release sucrose into the phloem from their starch reserves. This process happens when light is available independent of the kind of photosynthesis ( C3 carbon fixation , C4 carbon fixation , and Crassulacean Acid Metabolism ); CAM plants store malic acid in their vacuoles every night and release it by day in order to make this process work. \n", "3 The Philistine cities of Gaza, Ashdod, Ashkelon, Ekron, and Gath The Philistines (, , , or ; , Plištim), Pleshet or Peleset, were a people who as part of the Sea Peoples appeared in the southern coastal area of Canaan at the beginning of the Iron Age (circa 1175 BC), most probably from the Aegean region. According to the Bible , they ruled the five city-states (the \"Philistine Pentapolis\") of Gaza , Ashkelon , Ashdod , Ekron and Gath , from Wadi Gaza in the south to the Yarqon River in the north, but with no fixed border to the east. The Bible paints them as the Kingdom of Israel 's most dangerous enemy. Originating somewhere in the Aegean , their population was around 25,000 in the 12th century BC, rising to a peak of 30,000 in the 11th century BC, of which the Aegean element was not more than half the total, and perhaps much less. Nothing is known for certain about the original language or languages of the Philistines, however they were not part of the Semitic Canaanite population. There is some limited evidence in favour of the assumption that the Philistines were Indo-European-speakers either from Greece and/or Luwian speakers from the coast of Asia Minor . Philistine-related words found in the Bible are not Semitic, and can in some cases, with reservations, be traced back to Proto-Indo-European roots. By the beginning of the 1st Millennium BCE they had adopted the general Canaanite language of the region. \n", "4 The Armed Services Vocational Aptitude Battery (ASVAB) is a multiple choice test, administered by the United States Military Entrance Processing Command , used to determine qualification for enlistment in the United States armed forces . It is often offered to American high school students when they are in the 10th, 11th and 12th grade, though anyone eligible for enlistment may take it. Although the test is administered by the military, it is not (and never has been) a requirement that a test-taker with a qualifying score enlist in the armed forces. \n", "\n", " relevant \n", "0 True \n", "1 False \n", "2 True \n", "3 False \n", "4 False " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = download_benchmark_dataset(\n", " task=\"binary-relevance-classification\", dataset_name=\"wiki_qa-train\"\n", ")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Display Binary Relevance Classification Template\n", "\n", "View the default template used to classify relevance. You can tweak this template and evaluate its performance relative to the default." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "You are comparing a reference text to a question and trying to determine if the reference text\n", "contains information relevant to answering the question. Here is the data:\n", " [BEGIN DATA]\n", " ************\n", " [Question]: {input}\n", " ************\n", " [Reference text]: {reference}\n", " ************\n", " [END DATA]\n", "Compare the Question above to the Reference text. You must determine whether the Reference text\n", "contains information that can answer the Question. Please focus on whether the very specific\n", "question can be answered by the information in the Reference text.\n", "Your response must be single word, either \"relevant\" or \"unrelated\",\n", "and should not contain any text or characters aside from that word.\n", "\"unrelated\" means that the reference text does not contain an answer to the Question.\n", "\"relevant\" means the reference text contains an answer to the Question.\n" ] } ], "source": [ "print(RAG_RELEVANCY_PROMPT_TEMPLATE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The template variables are:\n", "\n", "- **input:** the question asked by a user\n", "- **reference:** the text of the retrieved document\n", "- **output:** a ground-truth relevance label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure the LLM\n", "\n", "Configure your OpenAI API key." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Benchmark Dataset Sample\n", "Sample size determines run time\n", "Recommend iterating small: 100 samples\n", "Then increasing to large test set" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)\n", "df_sample = df_sample.rename(\n", " columns={\n", " \"query_text\": \"input\",\n", " \"document_text\": \"reference\",\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LLM Evals: Retrieval Relevance Classifications GPT-4\n", "Run relevance against a subset of the data.\n", "Instantiate the LLM and set parameters." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The `model_name` field is deprecated. Use `model` instead. This will be removed in a future release.\n" ] } ], "source": [ "model = OpenAIModel(\n", " model_name=\"gpt-4\",\n", " temperature=0.0,\n", ")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Hello! I'm working perfectly. How can I assist you today?\"" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model(\"Hello world, this is a test if you are working?\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run Relevance Classifications\n", "\n", "Run relevance classifications against a subset of the data." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "058106771549475aaa4fa7a2a3c5d1af", "version_major": 2, "version_minor": 0 }, "text/plain": [ "llm_classify | | 0/100 (0.0%) | ⏳ 00:00<? | ?it/s" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# The rails is used to hold the output to specific values based on the template\n", "# It will remove text such as \",,,\" or \"...\"\n", "# Will ensure the binary value expected from the template is returned\n", "rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())\n", "relevance_classifications = llm_classify(\n", " dataframe=df_sample,\n", " template=RAG_RELEVANCY_PROMPT_TEMPLATE,\n", " model=model,\n", " rails=rails,\n", " concurrency=20,\n", ")[\"label\"].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate Classifications\n", "\n", "Evaluate the predictions against human-labeled ground-truth relevance labels." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " relevant 0.77 0.87 0.82 47\n", " unrelated 0.87 0.77 0.82 53\n", "\n", " accuracy 0.82 100\n", " macro avg 0.82 0.82 0.82 100\n", "weighted avg 0.83 0.82 0.82 100\n", "\n" ] }, { "data": { "text/plain": [ "<Axes: title={'center': 'Confusion Matrix (Normalized)'}, xlabel='Predicted Classes', ylabel='Actual Classes'>" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "<Figure size 640x480 with 2 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "true_labels = df_sample[\"relevant\"].map(RAG_RELEVANCY_PROMPT_RAILS_MAP).tolist()\n", "\n", "print(classification_report(true_labels, relevance_classifications, labels=rails))\n", "confusion_matrix = ConfusionMatrix(\n", " actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails\n", ")\n", "confusion_matrix.plot(\n", " cmap=plt.colormaps[\"Blues\"],\n", " number_label=True,\n", " normalized=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classifications with explanations\n", "\n", "When evaluating a dataset for relevance, it can be useful to know why the LLM classified a document as relevant or irrelevant. The following code block runs `llm_classify` with explanations turned on so that we can inspect why the LLM made the classification it did. There is speed tradeoff since more tokens is being generated but it can be highly informative when troubleshooting." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using prompt:\n", "\n", "\n", "You are comparing a reference text to a question and trying to determine if the reference text\n", "contains information relevant to answering the question. Here is the data:\n", " [BEGIN DATA]\n", " ************\n", " [Question]: {input}\n", " ************\n", " [Reference text]: {reference}\n", " ************\n", " [END DATA]\n", "Compare the Question above to the Reference text. You must determine whether the Reference text\n", "contains information that can help answer the Question. First, write out in a step by step manner\n", "an EXPLANATION to show how to arrive at the correct answer. Avoid simply stating the correct answer\n", "at the outset. Your response LABEL must be single word, either \"relevant\" or \"unrelated\", and\n", "should not contain any text or characters aside from that word. \"unrelated\" means that the\n", "reference text does not help answer to the Question. \"relevant\" means the reference text directly\n", "answers the question.\n", "\n", "Example response:\n", "************\n", "EXPLANATION: An explanation of your reasoning for why the label is \"relevant\" or \"unrelated\"\n", "LABEL: \"relevant\" or \"unrelated\"\n", "************\n", "\n", "EXPLANATION:\n", "OpenAI invocation parameters: {'model': 'gpt-4', 'temperature': 0.0, 'max_tokens': 256, 'frequency_penalty': 0, 'presence_penalty': 0, 'top_p': 1, 'n': 1, 'timeout': None}\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "35aa9e114681486cb5fdc7a08c53ef19", "version_major": 2, "version_minor": 0 }, "text/plain": [ "llm_classify | | 0/5 (0.0%) | ⏳ 00:00<? | ?it/s" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "- Snapped 'relevant' to rail: relevant\n", "- Snapped 'relevant' to rail: relevant\n", "- Snapped 'unrelated' to rail: unrelated\n", "- Snapped 'relevant' to rail: relevant\n", "- Snapped 'unrelated' to rail: unrelated\n" ] } ], "source": [ "small_df_sample = df_sample.copy().sample(n=5).reset_index(drop=True)\n", "relevance_classifications_df = llm_classify(\n", " dataframe=small_df_sample,\n", " template=RAG_RELEVANCY_PROMPT_TEMPLATE,\n", " model=model,\n", " rails=rails,\n", " provide_explanation=True,\n", " verbose=True,\n", " concurrency=20,\n", ")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>input</th>\n", " <th>reference</th>\n", " <th>label</th>\n", " <th>explanation</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>What did Lawrence Joshua Chamberlain do?</td>\n", " <td>Joshua Lawrence Chamberlain (September 8, 1828 – February 24, 1914), born as Lawrence Joshua Chamberlain, was an American college professor from the State of Maine , who volunteered during the American Civil War to join the Union Army . Although having no earlier education in military strategies, he became a highly respected and decorated Union officer , reaching the rank of brigadier general (and brevet major general ). For his gallantry at Gettysburg , he was awarded the Medal of Honor . He was given the honor of commanding the Union troops at the surrender ceremony for the infantry of Robert E. Lee 's Army at Appomattox , Virginia. After the war, he entered politics as a Republican and served four one-year terms of office as the 32nd Governor of Maine . He served on the faculty, and as president, of his alma mater , Bowdoin College .</td>\n", " <td>relevant</td>\n", " <td>The question asks about what Lawrence Joshua Chamberlain did. The reference text provides a detailed account of Lawrence Joshua Chamberlain's life, including his career as a college professor, his military service during the American Civil War, his political career as the Governor of Maine, and his role at Bowdoin College. Therefore, the reference text is relevant to the question.</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>what is the sign for degrees</td>\n", " <td>The degree symbol (°) is a typographical symbol that is used, among other things, to represent degrees of arc (e.g. in geographic coordinate systems ), hours (in the medical field), or degrees of temperature . The symbol consists of a small raised circle, historically a zero glyph . In Unicode it is encoded at .</td>\n", " <td>relevant</td>\n", " <td>The question asks for the sign used for degrees. The reference text provides a detailed explanation of the degree symbol, including its appearance and uses. Therefore, the reference text is directly relevant to the question.</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>how did Joan Crawford die?</td>\n", " <td>Joan Crawford (March 23, ca. 1904 – May 10, 1977), born Lucille Fay LeSueur, was an American actress in film, television and theatre. Starting as a dancer in traveling theatrical companies before debuting on Broadway , Crawford was signed to a motion picture contract by Metro-Goldwyn-Mayer in 1925. Initially frustrated by the size and quality of her parts, Crawford began a campaign of self-publicity and became nationally known as a flapper by the end of the 1920s. In the 1930s, Crawford's fame rivaled, and later outlasted, MGM colleagues Norma Shearer and Greta Garbo . Crawford often played hardworking young women who find romance and financial success. These \"rags-to-riches\" stories were well received by Depression -era audiences and were popular with women. Crawford became one of Hollywood's most prominent movie stars and one of the highest paid women in the United States, but her films began losing money and by the end of the 1930s she was labeled \"box office poison\". After an absence of nearly two years from the screen, Crawford staged a comeback by starring in Mildred Pierce (1945), for which she won the Academy Award for Best Actress . In 1955, she became involved with the Pepsi-Cola Company through her marriage to company Chairman Alfred Steele . After his death in 1959, Crawford was elected to fill his vacancy on the board of directors but was forcibly retired in 1973. She continued acting in film and television regularly through the 1960s, when her performances became fewer; after the release of the British horror film Trog in 1970, Crawford retired from the screen. Following a public appearance in 1974, after which unflattering photographs were published, Crawford withdrew from public life and became increasingly reclusive until her death in 1977. Crawford married four times. Her first three marriages ended in divorce; the last ended with the death of husband Alfred Steele. She adopted five children, one of whom was reclaimed by his birth mother. Crawford's relationships with her two older children, Christina and Christopher, were acrimonious. Crawford disinherited the two and, after Crawford's death, Christina wrote a \"tell-all\" memoir, Mommie Dearest , in which she alleged a lifelong pattern of physical and emotional abuse perpetrated by Crawford. She was voted the tenth greatest female star in the history of American cinema by the American Film Institute .</td>\n", " <td>unrelated</td>\n", " <td>The question asks about the cause of Joan Crawford's death. The reference text mentions that Joan Crawford died in 1977, but it does not provide any information about how she died. Therefore, while the text is somewhat related to the question in that it mentions her death, it does not provide the specific information needed to answer the question.</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Who Makes Nexen Tires</td>\n", " <td>Nexen Tire is a tire manufacturer, headquartered in Yangsan , South Gyeongsang Province , and Seoul , both in South Korea . Its major domestic competitors are Hankook Tire and Kumho Tires . The company's name is reflected in the company slogan , \"Next Century Tire.\"</td>\n", " <td>relevant</td>\n", " <td>The question asks about the manufacturer of Nexen Tires. The reference text clearly states that Nexen Tire is a tire manufacturer based in South Korea. Therefore, the reference text directly answers the question.</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>where is mark sanchez from</td>\n", " <td>Mark Travis John Sanchez (born November 11, 1986) is an American football quarterback for the New York Jets of the National Football League (NFL). He was drafted in the first round of the 2009 NFL Draft as the fifth overall selection by the Jets and the second quarterback taken overall. He played college football at the University of Southern California (USC). Sanchez grew up in a well-disciplined and athletic family. In the eighth grade, he began to play football and learn the intricacies of the quarterback position, training with his father, Nick. A well-regarded prospect, Sanchez committed to Southern California following his successful high school career in which he led his team to a championship title during his final season. At USC, Sanchez was relegated as the backup quarterback during his first three years though he rose to prominence due to his brief appearances on the field in 2007 due to injuries suffered by starting quarterback John David Booty . Sanchez also became popular within the community due to his Mexican-American heritage. Named the starter in 2008, Sanchez led USC to a 12–1 record and won the Rose Bowl against Penn State for which Sanchez was awarded the Most Valuable Player award for his performance on offense . Although many considered him too inexperienced, Sanchez announced his intention to enter the 2009 NFL Draft . He was selected by the Jets after they traded up with the Cleveland Browns , and was named the starting quarterback prior to the start of the season . Despite a subpar performance, Sanchez led the Jets to the AFC Championship Game , a losing effort to the Indianapolis Colts , becoming the fourth rookie quarterback in NFL history to win his first playoff game and the second to win two playoff games. In his second season, Sanchez continued to develop and led the Jets to the playoffs and the team's second consecutive AFC Championship Game where they narrowly lost to the Pittsburgh Steelers , 24–19. With the win over the New England Patriots the week prior, Sanchez tied four other quarterbacks for the second most post-season road victories by a quarterback in NFL history. In leading the Jets to two consecutive conference championships, Sanchez joined quarterback Ben Roethlisberger as the only two quarterbacks in NFL history to reach the conference championship in their first two seasons in the league. The next two seasons would be a regression for both the team and Sanchez as they failed to reach the playoffs. Fans and media critics called for a struggling Sanchez to be benched. He eventually was replaced towards the end of the 2012 season with Greg McElroy .</td>\n", " <td>unrelated</td>\n", " <td>The question asks about the origin of Mark Sanchez. The reference text provides information about Mark Sanchez's life, including his upbringing and career. However, it does not specify where he is from, which is the information asked for in the question. Therefore, the reference text does not contain information that can help answer the question.</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " input \\\n", "0 What did Lawrence Joshua Chamberlain do? \n", "1 what is the sign for degrees \n", "2 how did Joan Crawford die? \n", "3 Who Makes Nexen Tires \n", "4 where is mark sanchez from \n", "\n", " reference \\\n", "0 Joshua Lawrence Chamberlain (September 8, 1828 – February 24, 1914), born as Lawrence Joshua Chamberlain, was an American college professor from the State of Maine , who volunteered during the American Civil War to join the Union Army . Although having no earlier education in military strategies, he became a highly respected and decorated Union officer , reaching the rank of brigadier general (and brevet major general ). For his gallantry at Gettysburg , he was awarded the Medal of Honor . He was given the honor of commanding the Union troops at the surrender ceremony for the infantry of Robert E. Lee 's Army at Appomattox , Virginia. After the war, he entered politics as a Republican and served four one-year terms of office as the 32nd Governor of Maine . He served on the faculty, and as president, of his alma mater , Bowdoin College . \n", "1 The degree symbol (°) is a typographical symbol that is used, among other things, to represent degrees of arc (e.g. in geographic coordinate systems ), hours (in the medical field), or degrees of temperature . The symbol consists of a small raised circle, historically a zero glyph . In Unicode it is encoded at . \n", "2 Joan Crawford (March 23, ca. 1904 – May 10, 1977), born Lucille Fay LeSueur, was an American actress in film, television and theatre. Starting as a dancer in traveling theatrical companies before debuting on Broadway , Crawford was signed to a motion picture contract by Metro-Goldwyn-Mayer in 1925. Initially frustrated by the size and quality of her parts, Crawford began a campaign of self-publicity and became nationally known as a flapper by the end of the 1920s. In the 1930s, Crawford's fame rivaled, and later outlasted, MGM colleagues Norma Shearer and Greta Garbo . Crawford often played hardworking young women who find romance and financial success. These \"rags-to-riches\" stories were well received by Depression -era audiences and were popular with women. Crawford became one of Hollywood's most prominent movie stars and one of the highest paid women in the United States, but her films began losing money and by the end of the 1930s she was labeled \"box office poison\". After an absence of nearly two years from the screen, Crawford staged a comeback by starring in Mildred Pierce (1945), for which she won the Academy Award for Best Actress . In 1955, she became involved with the Pepsi-Cola Company through her marriage to company Chairman Alfred Steele . After his death in 1959, Crawford was elected to fill his vacancy on the board of directors but was forcibly retired in 1973. She continued acting in film and television regularly through the 1960s, when her performances became fewer; after the release of the British horror film Trog in 1970, Crawford retired from the screen. Following a public appearance in 1974, after which unflattering photographs were published, Crawford withdrew from public life and became increasingly reclusive until her death in 1977. Crawford married four times. Her first three marriages ended in divorce; the last ended with the death of husband Alfred Steele. She adopted five children, one of whom was reclaimed by his birth mother. Crawford's relationships with her two older children, Christina and Christopher, were acrimonious. Crawford disinherited the two and, after Crawford's death, Christina wrote a \"tell-all\" memoir, Mommie Dearest , in which she alleged a lifelong pattern of physical and emotional abuse perpetrated by Crawford. She was voted the tenth greatest female star in the history of American cinema by the American Film Institute . \n", "3 Nexen Tire is a tire manufacturer, headquartered in Yangsan , South Gyeongsang Province , and Seoul , both in South Korea . Its major domestic competitors are Hankook Tire and Kumho Tires . The company's name is reflected in the company slogan , \"Next Century Tire.\" \n", "4 Mark Travis John Sanchez (born November 11, 1986) is an American football quarterback for the New York Jets of the National Football League (NFL). He was drafted in the first round of the 2009 NFL Draft as the fifth overall selection by the Jets and the second quarterback taken overall. He played college football at the University of Southern California (USC). Sanchez grew up in a well-disciplined and athletic family. In the eighth grade, he began to play football and learn the intricacies of the quarterback position, training with his father, Nick. A well-regarded prospect, Sanchez committed to Southern California following his successful high school career in which he led his team to a championship title during his final season. At USC, Sanchez was relegated as the backup quarterback during his first three years though he rose to prominence due to his brief appearances on the field in 2007 due to injuries suffered by starting quarterback John David Booty . Sanchez also became popular within the community due to his Mexican-American heritage. Named the starter in 2008, Sanchez led USC to a 12–1 record and won the Rose Bowl against Penn State for which Sanchez was awarded the Most Valuable Player award for his performance on offense . Although many considered him too inexperienced, Sanchez announced his intention to enter the 2009 NFL Draft . He was selected by the Jets after they traded up with the Cleveland Browns , and was named the starting quarterback prior to the start of the season . Despite a subpar performance, Sanchez led the Jets to the AFC Championship Game , a losing effort to the Indianapolis Colts , becoming the fourth rookie quarterback in NFL history to win his first playoff game and the second to win two playoff games. In his second season, Sanchez continued to develop and led the Jets to the playoffs and the team's second consecutive AFC Championship Game where they narrowly lost to the Pittsburgh Steelers , 24–19. With the win over the New England Patriots the week prior, Sanchez tied four other quarterbacks for the second most post-season road victories by a quarterback in NFL history. In leading the Jets to two consecutive conference championships, Sanchez joined quarterback Ben Roethlisberger as the only two quarterbacks in NFL history to reach the conference championship in their first two seasons in the league. The next two seasons would be a regression for both the team and Sanchez as they failed to reach the playoffs. Fans and media critics called for a struggling Sanchez to be benched. He eventually was replaced towards the end of the 2012 season with Greg McElroy . \n", "\n", " label \\\n", "0 relevant \n", "1 relevant \n", "2 unrelated \n", "3 relevant \n", "4 unrelated \n", "\n", " explanation \n", "0 The question asks about what Lawrence Joshua Chamberlain did. The reference text provides a detailed account of Lawrence Joshua Chamberlain's life, including his career as a college professor, his military service during the American Civil War, his political career as the Governor of Maine, and his role at Bowdoin College. Therefore, the reference text is relevant to the question. \n", "1 The question asks for the sign used for degrees. The reference text provides a detailed explanation of the degree symbol, including its appearance and uses. Therefore, the reference text is directly relevant to the question. \n", "2 The question asks about the cause of Joan Crawford's death. The reference text mentions that Joan Crawford died in 1977, but it does not provide any information about how she died. Therefore, while the text is somewhat related to the question in that it mentions her death, it does not provide the specific information needed to answer the question. \n", "3 The question asks about the manufacturer of Nexen Tires. The reference text clearly states that Nexen Tire is a tire manufacturer based in South Korea. Therefore, the reference text directly answers the question. \n", "4 The question asks about the origin of Mark Sanchez. The reference text provides information about Mark Sanchez's life, including his upbringing and career. However, it does not specify where he is from, which is the information asked for in the question. Therefore, the reference text does not contain information that can help answer the question. " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's view the data\n", "merged_df = pd.merge(\n", " small_df_sample, relevance_classifications_df, left_index=True, right_index=True\n", ")\n", "merged_df[[\"input\", \"reference\", \"label\", \"explanation\"]].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LLM Evals: relevance Classifications GPT-3.5 Turbo\n", "Run relevance against a subset of the data using GPT-3.5. GPT-3.5 can significantly speed up the classification process. However there are tradeoffs as we will see below." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The `model_name` field is deprecated. Use `model` instead. This will be removed in a future release.\n" ] } ], "source": [ "model = OpenAIModel(model=\"gpt-3.5-turbo\", temperature=0.0, request_timeout=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2d083b07d0e449ccb1f9df27184c3c56", "version_major": 2, "version_minor": 0 }, "text/plain": [ "llm_classify | | 0/100 (0.0%) | ⏳ 00:00<? | ?it/s" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')\n", "Requeuing...\n" ] } ], "source": [ "rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())\n", "relevance_classifications = llm_classify(\n", " dataframe=df_sample,\n", " template=RAG_RELEVANCY_PROMPT_TEMPLATE,\n", " model=model,\n", " rails=rails,\n", " concurrency=20,\n", ")[\"label\"].tolist()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " relevant 0.50 1.00 0.67 47\n", " unrelated 1.00 0.11 0.20 53\n", "\n", " accuracy 0.53 100\n", " macro avg 0.75 0.56 0.44 100\n", "weighted avg 0.77 0.53 0.42 100\n", "\n" ] }, { "data": { "text/plain": [ "<Axes: title={'center': 'Confusion Matrix (Normalized)'}, xlabel='Predicted Classes', ylabel='Actual Classes'>" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "<Figure size 640x480 with 2 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "true_labels = df_sample[\"relevant\"].map(RAG_RELEVANCY_PROMPT_RAILS_MAP).tolist()\n", "\n", "print(classification_report(true_labels, relevance_classifications, labels=rails))\n", "confusion_matrix = ConfusionMatrix(\n", " actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails\n", ")\n", "confusion_matrix.plot(\n", " cmap=plt.colormaps[\"Blues\"],\n", " number_label=True,\n", " normalized=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preview: Running with GPT-4 Turbo" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The `model_name` field is deprecated. Use `model` instead. This will be removed in a future release.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4a017b45b2794c689eed81f9c6dc2ee2", "version_major": 2, "version_minor": 0 }, "text/plain": [ "llm_classify | | 0/100 (0.0%) | ⏳ 00:00<? | ?it/s" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = OpenAIModel(model_name=\"gpt-4-turbo-preview\")\n", "relevance_classifications = llm_classify(\n", " dataframe=df_sample,\n", " template=RAG_RELEVANCY_PROMPT_TEMPLATE,\n", " model=model,\n", " rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),\n", " concurrency=20,\n", ")[\"label\"].tolist()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " relevant 0.75 0.94 0.83 47\n", " unrelated 0.93 0.72 0.81 53\n", "\n", " accuracy 0.82 100\n", " macro avg 0.84 0.83 0.82 100\n", "weighted avg 0.84 0.82 0.82 100\n", "\n" ] }, { "data": { "text/plain": [ "<Axes: title={'center': 'Confusion Matrix (Normalized)'}, xlabel='Predicted Classes', ylabel='Actual Classes'>" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "<Figure size 640x480 with 2 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "true_labels = df_sample[\"relevant\"].map(RAG_RELEVANCY_PROMPT_RAILS_MAP).tolist()\n", "\n", "print(classification_report(true_labels, relevance_classifications, labels=rails))\n", "confusion_matrix = ConfusionMatrix(\n", " actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails\n", ")\n", "confusion_matrix.plot(\n", " cmap=plt.colormaps[\"Blues\"],\n", " number_label=True,\n", " normalized=True,\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.3" } }, "nbformat": 4, "nbformat_minor": 4 }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server