llm_ops_overview.ipynb•31.9 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "lTormFlpkh_z"
},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
" <br>\n",
" <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
" |\n",
" <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
" |\n",
" <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
" </p>\n",
"</center>\n",
"<h1 align=\"center\">LLM Ops - Tracing, Evaluation, and Analysis</h1>\n",
"\n",
"In this tutorial we will learn how to build, observe, evaluate, and analyze a LLM powered application.\n",
"\n",
"It has the following sections:\n",
"\n",
"1. Understanding LLM-powered applications\n",
"2. Observing a RAG application using spans and traces\n",
"3. Evaluating the RAG application using LLM Evals\n",
"4. Learn how to construct an experimentation and evaluation workflow\n",
"\n",
"⚠️ This tutorial requires an OpenAI key and a [Phoenix Cloud](https://app.phoenix.arize.com/) account to run\n",
"\n",
"\n",
"## Understanding LLM powered applications\n",
"\n",
"Building software with LLMs, or any machine learning model, is [fundamentally different](https://karpathy.medium.com/software-2-0-a64152b37c35). Rather than compiling source code into binary to run a series of commands, we need to navigate datasets, embeddings, prompts, and parameter weights to generate consistent accurate results. LLM outputs are probabilistic and therefore don't produce the same deterministic outcome every time."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "brp39fbNkh_1"
},
"source": [
"## Observing the application using traces\n",
"\n",
"LLM Traces and Observability lets us understand the system from the outside, by letting us ask questions about that system without knowing its inner workings. Furthermore, it allows us to easily troubleshoot and handle novel problems (i.e. “unknown unknowns”), and helps us answer the question, “Why is this happening?”\n",
"\n",
"LLM Traces are designed to be a category of telemetry data that is used to understand the execution of LLMs and the surrounding application context such as retrieval from vector stores and the usage of external tools such as search engines or APIs. It lets you understand the inner workings of the individual steps your application takes wile also giving you visibility into how your system is running and performing as a whole.\n",
"\n",
"Traces are made up of a sequence of `spans`. A span represents a unit of work or operation (think a span of time). It tracks specific operations that a request makes, painting a picture of what happened during the time in which that operation was executed.\n",
"\n",
"LLM Tracing supports the following span kinds:\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/main/images/blog/span_kinds.png\" width=\"1100\"/>\n",
"\n",
"\n",
"A trace is a collection of related spans that together represent a complete workflow, such as a single request to an application or one run of an agent.\n",
"\n",
"By capturing the building blocks of your application while it's running, Phoenix can provide a more complete picture of the inner workings of your application. To illustrate this, let's look at an example LLM application and inspect it's traces."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uHG-eqAukh_1"
},
"source": [
"### Traces and Spans in action\n",
"\n",
"Let's build a relatively simple LLM-powered application that will answer questions about Arize AI. This example uses LlamaIndex for RAG and OpenAI as the LLM but you could use any LLM you would like. The details of the application are not important, but the architecture is.\n",
"\n",
"Let's get started. First, you will need to install dependencies and set up keys. You can get your [Phoenix Cloud](https://app.phoenix.arize.com/) keys in the Settings page of your account."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -Uqq arize-phoenix llama-index-llms-openai llama-index-embeddings-openai openai gcsfs nest_asyncio openinference-instrumentation-llama_index openinference-instrumentation-openai"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"if \"OPENAI_API_KEY\" not in os.environ:\n",
" os.environ[\"OPENAI_API_KEY\"] = getpass(\"🔑 Enter your OpenAI API key: \")\n",
"\n",
"if \"PHOENIX_API_KEY\" not in os.environ:\n",
" os.environ[\"PHOENIX_API_KEY\"] = getpass(\"🔑 Enter your Phoenix API key: \")\n",
"\n",
"if \"PHOENIX_COLLECTOR_ENDPOINT\" not in os.environ:\n",
" os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = getpass(\"🔑 Enter your Phoenix Collector Endpoint: \")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "T5xpDzOnAkmg"
},
"source": [
"Next, we will configure tracing using the `register` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from phoenix.otel import register\n",
"\n",
"tracer_provider = register(auto_instrument=True, project_name=\"llm-ops\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5BUg7yRQArG6"
},
"source": [
"The following block of code initializes a query engine that enables RAG by querying the indexed documents to provide context-aware answers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from gcsfs import GCSFileSystem\n",
"from llama_index.core import (\n",
" Settings,\n",
" StorageContext,\n",
" load_index_from_storage,\n",
")\n",
"from llama_index.embeddings.openai import OpenAIEmbedding\n",
"from llama_index.llms.openai import OpenAI\n",
"\n",
"file_system = GCSFileSystem(project=\"public-assets-275721\")\n",
"index_path = \"arize-phoenix-assets/datasets/unstructured/llm/llama-index/arize-docs/index/\"\n",
"storage_context = StorageContext.from_defaults(\n",
" fs=file_system,\n",
" persist_dir=index_path,\n",
")\n",
"\n",
"Settings.llm = OpenAI(model=\"gpt-4o-mini\")\n",
"Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-ada-002\")\n",
"index = load_index_from_storage(\n",
" storage_context,\n",
")\n",
"query_engine = index.as_query_engine()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XjSWOT5Ikh_2"
},
"source": [
"Now that we have an application setup, let'squery it and take a look inside."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from openinference.semconv.trace import SpanAttributes\n",
"from opentelemetry import trace\n",
"from tqdm import tqdm\n",
"\n",
"tracer = trace.get_tracer(__name__)\n",
"\n",
"queries = [\n",
" \"How can I query for a monitor's status using GraphQL?\",\n",
" \"How do I delete a model?\",\n",
" \"How much does an enterprise license of Arize cost?\",\n",
" \"How do I log a prediction using the python SDK?\",\n",
"]\n",
"\n",
"for query in tqdm(queries):\n",
" with tracer.start_as_current_span(\"Query\") as span:\n",
" span.set_attribute(SpanAttributes.OPENINFERENCE_SPAN_KIND, \"chain\")\n",
" span.set_attribute(SpanAttributes.INPUT_VALUE, query)\n",
" response = query_engine.query(query)\n",
" span.set_attribute(SpanAttributes.OUTPUT_VALUE, response)\n",
" print(f\"Query: {query}\")\n",
" print(f\"Response: {response}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tWvY9gbMkh_2"
},
"source": [
"Now that we've run the application a few times, let's take a look at the traces in the UI"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iQm9sdjYkh_2"
},
"source": [
"The UI will give you an interactive troubleshooting experience. You can sort, filter, and search for traces. You can also view the detail of each trace to better understand what happened during the response generation process."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Zs5y32e2BCnd"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WX4pp4ezkh_2"
},
"source": [
"In addition to being able to view the traces in the UI, you can also query for traces to use as pandas dataframes in your notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from phoenix.client import AsyncClient\n",
"\n",
"px_client = AsyncClient()\n",
"primary_df = await px_client.spans.get_spans_dataframe(project_identifier=\"llm-ops\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"spans_df = primary_df[\n",
" [\n",
" \"name\",\n",
" \"span_kind\",\n",
" \"context.trace_id\",\n",
" \"attributes.llm.input_messages\",\n",
" \"attributes.llm.output_messages\",\n",
" \"attributes.retrieval.documents\",\n",
" ]\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"spans_df"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qXIarX-mkh_3"
},
"source": [
"With just a few lines of code, we have managed to gain visibility into the inner workings of our application. We can now better understand how things like retrieval, prompts, and parameter weights could be affecting our application. But what can we do with this information? Let's take a look at how to use LLM evals to evaluate our application."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ihZT429Pkh_3"
},
"source": [
"## Evaluating the application using LLM Evals (Trace-level Evaluations)\n",
"\n",
"Evaluation should serve as the primary metric for assessing your application. It determines whether the app will produce accurate responses based on the data sources and range of queries.\n",
"\n",
"While it's beneficial to examine individual queries and responses, this approach is impractical as the volume of edge-cases and failures increases. Instead, it's more effective to establish a suite of metrics and automated evaluations. These tools can provide insights into overall system performance and can identify specific areas that may require scrutiny.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YbMTuUQEkh_3"
},
"source": [
"We can now use Phoenix's LLM Evals to evaluate these queries. LLM Evals uses an LLM to grade your application based on different criteria. For this example we will use the evals library to see if any `hallucinations` are present and if the `Q&A Correctness` is good (whether or not the application answers the question correctly).\n",
"\n",
"We will evaluate at the trace level, meaning the evaluation will consider the entire request in full context. This differs from span-level evaluations, which focus on assessing individual steps within the workflow."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"trace_df = (\n",
" spans_df.groupby(\"context.trace_id\")\n",
" .agg(\n",
" {\n",
" \"attributes.llm.input_messages\": lambda x: \" \".join(x.dropna().astype(str)),\n",
" \"attributes.llm.output_messages\": lambda x: \" \".join(x.dropna().astype(str)),\n",
" \"attributes.retrieval.documents\": lambda x: \" \".join(x.dropna().astype(str)),\n",
" }\n",
" )\n",
" .rename(\n",
" columns={\n",
" \"attributes.llm.input_messages\": \"input\",\n",
" \"attributes.llm.output_messages\": \"output\",\n",
" \"attributes.retrieval.documents\": \"reference\",\n",
" }\n",
" )\n",
" .reset_index()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"trace_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"HALLUCINATION_PROMPT_TEMPLATE = \"\"\"\n",
"In this task, you will be presented with a query, a reference text and an answer. The answer is\n",
"generated to the question based on the reference text. The answer may contain false information. You\n",
"must use the reference text to determine if the answer to the question contains false information,\n",
"if the answer is a hallucination of facts. Your objective is to determine whether the answer text\n",
"contains factual information and is not a hallucination. A 'hallucination' refers to\n",
"an answer that is not based on the reference text or assumes information that is not available in\n",
"the reference text.\n",
"\n",
" [BEGIN DATA]\n",
" ************\n",
" [Query]: {{input}}\n",
" ************\n",
" [Reference text]: {{reference}}\n",
" ************\n",
" [Answer]: {{output}}\n",
" ************\n",
" [END DATA]\n",
"\n",
" Is the answer above factual or hallucinated based on the query and reference text?\n",
"\n",
"Please read the query, reference text and answer carefully, then write out in a step by step manner\n",
"an EXPLANATION to show how to determine if the answer is \"factual\" or \"hallucinated\". Avoid simply\n",
"stating the correct answer at the outset. Your response LABEL should be a single word: either\n",
"\"factual\" or \"hallucinated\", and it should not include any other text or characters. \"hallucinated\"\n",
"indicates that the answer provides factually inaccurate information to the query based on the\n",
"reference text. \"factual\" indicates that the answer to the question is correct relative to the\n",
"reference text, and does not contain made up information.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"QA_PROMPT_TEMPLATE = \"\"\"\n",
"You are given a question, an answer and reference text. You must determine whether the\n",
"given answer correctly answers the question based on the reference text. Here is the data:\n",
" [BEGIN DATA]\n",
" ************\n",
" [Question]: {{input}}\n",
" ************\n",
" [Reference]: {{reference}}\n",
" ************\n",
" [Answer]: {{output}}\n",
" [END DATA]\n",
"Please read the query, reference text and answer carefully, then write out in a step by step manner\n",
"an EXPLANATION to show how to determine if the answer is \"correct\" or \"incorrect\". Avoid simply\n",
"stating the correct answer at the outset. Your response LABEL must be a single word, either\n",
"\"correct\" or \"incorrect\", and should not contain any text or characters aside from that word.\n",
"\"correct\" means that the question is correctly and fully answered by the answer.\n",
"\"incorrect\" means that the question is not correctly or only partially answered by the\n",
"answer.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from openinference.instrumentation import suppress_tracing\n",
"\n",
"from phoenix.evals import create_classifier\n",
"from phoenix.evals.evaluators import async_evaluate_dataframe\n",
"from phoenix.evals.llm import LLM\n",
"\n",
"llm = LLM(provider=\"openai\", model=\"gpt-5\")\n",
"\n",
"\n",
"hallucination_evaluator = create_classifier(\n",
" name=\"hallucination\",\n",
" llm=llm,\n",
" prompt_template=HALLUCINATION_PROMPT_TEMPLATE,\n",
" choices={\"factual\": 1.0, \"hallucinated\": 0.0},\n",
")\n",
"\n",
"qa_evaluator = create_classifier(\n",
" name=\"q&a\",\n",
" llm=llm,\n",
" prompt_template=QA_PROMPT_TEMPLATE,\n",
" choices={\"correct\": 1.0, \"incorrect\": 0.0},\n",
")\n",
"\n",
"with suppress_tracing():\n",
" results_df = await async_evaluate_dataframe(\n",
" dataframe=trace_df,\n",
" evaluators=[hallucination_evaluator, qa_evaluator],\n",
" )\n",
"results_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RYNmmXOkCgO4"
},
"source": [
"After running the evaluaton, we log our traces back to Phoenix:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from phoenix.evals.utils import to_annotation_dataframe\n",
"\n",
"root_spans = primary_df[primary_df[\"parent_id\"].isna()][[\"context.trace_id\", \"context.span_id\"]]\n",
"\n",
"# Merge results with root spans to align on trace_id\n",
"results_with_spans = pd.merge(\n",
" results_df.reset_index(), root_spans, on=\"context.trace_id\", how=\"left\"\n",
").set_index(\"context.span_id\", drop=False)\n",
"\n",
"# Format for Phoenix logging\n",
"annotation_df = to_annotation_dataframe(results_with_spans)\n",
"\n",
"hallucination_eval_results = annotation_df[\n",
" annotation_df[\"annotation_name\"] == \"hallucination\"\n",
"].copy()\n",
"qa_eval_results = annotation_df[annotation_df[\"annotation_name\"] == \"q&a\"].copy()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"await px_client.annotations.log_span_annotations_dataframe(\n",
" dataframe=hallucination_eval_results,\n",
" annotator_kind=\"LLM\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"await px_client.annotations.log_span_annotations_dataframe(\n",
" dataframe=qa_eval_results,\n",
" annotator_kind=\"LLM\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zdZcFBhXkh_3"
},
"source": [
"Now that we have the evaluation logged, let's take a look at them in the UI. You will notice that the traces that correspond to hallucinations are clearly marked in the UI and you can now query for them!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OeKTvt4Ekh_3"
},
"source": [
"We can use filtering to identify if there are certain queries that are resulting in hallucinations or incorrect answers. In the next section, we will use LLM Evals to identify if these issues are caused by the retrieval process for RAG. We are going to use an LLM to grade whether or not the chunks retrieved are relevant to the query."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "n_88gNnRDSZC"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "g_Fod69gD_jT"
},
"source": [
"# Further Evaluation on Retrieval Process (Span-level Evaluations)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-N5UgQOqELGB"
},
"source": [
"We've now identified if there are certain queries that are resulting in hallucinations or incorrect answers. Let's see if we can use LLM Evals to identify if these issues are caused by the retrieval process for RAG. We are going to use an span-level evluations to grade whether or not the chunks retrieved are relevant to the query.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"filtered_df = primary_df[\n",
" (primary_df[\"span_kind\"] == \"RETRIEVER\")\n",
" & (primary_df[\"attributes.retrieval.documents\"].notnull())\n",
"]\n",
"\n",
"filtered_df = filtered_df.rename(\n",
" columns={\"attributes.input.value\": \"input\", \"attributes.retrieval.documents\": \"documents\"}\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RAG_RELEVANCY_PROMPT_TEMPLATE = \"\"\"\n",
"You are comparing a reference text to a question and trying to determine if the reference text\n",
"contains information relevant to answering the question. Here is the data:\n",
" [BEGIN DATA]\n",
" ************\n",
" [Question]: {{input}}\n",
" ************\n",
" [Reference text]: {{documents}}\n",
" ************\n",
" [END DATA]\n",
"Compare the Question above to the Reference text. You must determine whether the Reference text\n",
"contains information that can help answer the Question. First, write out in a step by step manner\n",
"an EXPLANATION to show how to arrive at the correct answer. Avoid simply stating the correct answer\n",
"at the outset. Your response LABEL must be single word, either \"relevant\" or \"unrelated\", and\n",
"should not contain any text or characters aside from that word. \"unrelated\" means that the\n",
"reference text does not help answer to the Question. \"relevant\" means the reference text directly\n",
"answers the question.\n",
"\n",
"Example response:\n",
"LABEL: \"relevant\" or \"unrelated\"\n",
"************\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"llm = LLM(provider=\"openai\", model=\"gpt-5\")\n",
"\n",
"\n",
"relevancy_evaluator = create_classifier(\n",
" name=\"RAG Relevancy\",\n",
" llm=llm,\n",
" prompt_template=RAG_RELEVANCY_PROMPT_TEMPLATE,\n",
" choices={\"relevant\": 1.0, \"unrelated\": 0.0},\n",
")\n",
"\n",
"with suppress_tracing():\n",
" results_df = await async_evaluate_dataframe(\n",
" dataframe=filtered_df,\n",
" evaluators=[relevancy_evaluator],\n",
" )\n",
"results_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HuFknaEkSeSb"
},
"source": [
"Finally, we log our span level evaluations back to Phoenix."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"relevancy_eval_df = to_annotation_dataframe(results_df)\n",
"\n",
"await px_client.annotations.log_span_annotations_dataframe(\n",
" dataframe=relevancy_eval_df,\n",
" annotator_kind=\"LLM\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "N_I1aNfeSkKo"
},
"source": [
"To inspect span-level evaluations, go to the Annotations tab for the span.\n",
"\n",
"This view reveals how well the retrieved documents performed in terms of quality. In this case, while the application did not hallucinate (as confirmed by the trace-level evaluation), the span-level assessment showed that the retrieved documents weren’t of high quality."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BzhL3z20ShpY"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yiQSjIQ0obOl"
},
"source": [
"# Creating an experimentation workflow"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BDU80-ofoeDb"
},
"source": [
"At this point, we’ve covered how to trace and evaluate your application data.\n",
"\n",
"The next step is to add these traces to a dataset. Once your traces are organized in a dataset, you can run experiments to measure how changes in your application affect the evaluation metrics. Below, we’ll walk through an example of this process."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "H18oekgWqq-S"
},
"source": [
"Running an experiment requires three main components: a dataset, a task to execute on that dataset, and evaluators to measure the quality of the task’s outputs."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qX15R3PNq2Fy"
},
"source": [
"### Define Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"queries = [\n",
" \"How can I query for a monitor's status using GraphQL?\",\n",
" \"How do I delete a model?\",\n",
" \"How much does an enterprise license of Arize cost?\",\n",
" \"How do I log a prediction using the python SDK?\",\n",
" \"What is a trace versus a span?\",\n",
" \"Does Arize AX or Arize Phoenix support TypeScript\",\n",
" \"What is an experiment?\",\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataset_df = pd.DataFrame(data={\"query\": queries})\n",
"dataset = await px_client.datasets.create_dataset(\n",
" dataframe=dataset_df,\n",
" name=\"arize-questions\",\n",
" input_keys=[\"query\"],\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "N_pGKbNdrk3w"
},
"source": [
"If you navigate to the Datasets page of your Arize Phoenix instance, you will see the dataset we just created."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BxNufeW7rsDu"
},
"source": [
"### Define Task & Evaluator"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def query_system(input):\n",
" response = query_engine.query(input[\"query\"])\n",
" return response"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"HALLUCINATION_PROMPT_TEMPLATE = \"\"\"\n",
"In this task, you will be presented with a query, a reference text and an answer. The answer is\n",
"generated to the question based on the reference text. The answer may contain false information. You\n",
"must use the reference text to determine if the answer to the question contains false information,\n",
"if the answer is a hallucination of facts. Your objective is to determine whether the answer text\n",
"contains factual information and is not a hallucination. A 'hallucination' refers to\n",
"an answer that is not based on the reference text or assumes information that is not available in\n",
"the reference text.\n",
"\n",
" [BEGIN DATA]\n",
" ************\n",
" [Query]: {{input}}\n",
" ************\n",
" [Reference text and Answer]: {{output}}\n",
" ************\n",
" [END DATA]\n",
"\n",
" Is the answer above factual or hallucinated based on the query and reference text?\n",
"\n",
"Please read the query, reference text and answer carefully, then write out in a step by step manner\n",
"an EXPLANATION to show how to determine if the answer is \"factual\" or \"hallucinated\". Avoid simply\n",
"stating the correct answer at the outset. Your response LABEL should be a single word: either\n",
"\"factual\" or \"hallucinated\", and it should not include any other text or characters. \"hallucinated\"\n",
"indicates that the answer provides factually inaccurate information to the query based on the\n",
"reference text. \"factual\" indicates that the answer to the question is correct relative to the\n",
"reference text, and does not contain made up information.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"QA_PROMPT_TEMPLATE = \"\"\"\n",
"You are given a question, an answer and reference text. You must determine whether the\n",
"given answer correctly answers the question based on the reference text. Here is the data:\n",
" [BEGIN DATA]\n",
" ************\n",
" [Question]: {{input}}\n",
" ************\n",
" [Reference text and Answer]: {{output}}\n",
" ************\n",
" [END DATA]\n",
"Please read the query, reference text and answer carefully, then write out in a step by step manner\n",
"an EXPLANATION to show how to determine if the answer is \"correct\" or \"incorrect\". Avoid simply\n",
"stating the correct answer at the outset. Your response LABEL must be a single word, either\n",
"\"correct\" or \"incorrect\", and should not contain any text or characters aside from that word.\n",
"\"correct\" means that the question is correctly and fully answered by the answer.\n",
"\"incorrect\" means that the question is not correctly or only partially answered by the\n",
"answer.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hallucination_evaluator = create_classifier(\n",
" name=\"hallucination\",\n",
" llm=llm,\n",
" prompt_template=HALLUCINATION_PROMPT_TEMPLATE,\n",
" choices={\"factual\": 1.0, \"hallucinated\": 0.0},\n",
")\n",
"\n",
"qa_evaluator = create_classifier(\n",
" name=\"q&a\",\n",
" llm=llm,\n",
" prompt_template=QA_PROMPT_TEMPLATE,\n",
" choices={\"correct\": 1.0, \"incorrect\": 0.0},\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jPbaJwflsYHs"
},
"source": [
"### Run Experiment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"experiment = await px_client.experiments.run_experiment(\n",
" dataset=dataset, task=query_system, evaluators=[hallucination_evaluator, qa_evaluator]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8TvYVkDJziva"
},
"source": [
"You will see your experiment and results populate in Phoenix. From here, you can make changes to the RAG application, run the same datasets of examples, and see how the evaluation metrics change!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uTr0CVpJw71W"
},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}