@arizeai/phoenix-mcp

Official

Overview Schema Related Servers Score Discussions

phoenix
tutorials

llm_application_tracing_evaluating_and_analysis.ipynb•14.1 KiB

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "8H61KXhXT41P" }, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n", " <br>\n", " <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n", " </p>\n", "</center>\n", "<h1 align=\"center\">Tracing, Evaluation, and Analysis of an LLM Application using Phoenix</h1>\n", "\n", "In this tutorial we will learn how to build, observe, evaluate, and analyze a LLM powered application. This is a LLM-powered Chat on Docs application that will answer questions about <a href=\"https://docs.arize.com/arize/\">Arize</a> from their product documentation.\n", "\n", "Key Concepts:\n", "1. LLM Traces are a category of telemetry data that is used to understand the execution of LLMs and the surrounding application context (such as retrieval from vector stores, usage of external tools, etc).\n", "\n", "2. Traces are made up of a sequence of `spans`. A span represents a unit of work or operation (think a span of time).\n", "\n", "3. LLM Evaluations help get visbility into the performance of the application\n", "\n", "Run the next two code blocks to launch Phoenix." ] }, { "cell_type": "markdown", "metadata": { "id": "GWsyMSFbSkeF" }, "source": [ "## Launch Phoenix to visualize the app" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -qq arize-phoenix openai gcsfs nest_asyncio llama-index-llms-openai 'httpx<0.28'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "# Uses your OpenAI API Key\n", "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gcsfs import GCSFileSystem\n", "from llama_index.core import (\n", " Settings,\n", " StorageContext,\n", " load_index_from_storage,\n", ")\n", "from llama_index.core.graph_stores import SimpleGraphStore\n", "from llama_index.embeddings.openai import OpenAIEmbedding\n", "from llama_index.llms.openai import OpenAI\n", "from openinference.instrumentation import suppress_tracing\n", "from tqdm import tqdm" ] }, { "cell_type": "markdown", "metadata": { "id": "JnYXL65xSywo" }, "source": [ "## Learn how to Build the application and add Tracing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openinference.instrumentation.llama_index import LlamaIndexInstrumentor\n", "\n", "from phoenix.otel import register\n", "\n", "tracer_provider = register(project_name=\"llm-rag-app\")\n", "LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Pulls down the Arize Product Documentation Knowledge Base\n", "file_system = GCSFileSystem(project=\"public-assets-275721\")\n", "index_path = \"arize-phoenix-assets/datasets/unstructured/llm/llama-index/arize-docs/index/\"\n", "storage_context = StorageContext.from_defaults(\n", " fs=file_system,\n", " persist_dir=index_path,\n", " graph_store=SimpleGraphStore(), # prevents unauthorized request to GCS\n", ")\n", "Settings.llm = OpenAI(model=\"gpt-3.5-turbo\", temperature=0.0)\n", "Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-ada-002\")\n", "index = load_index_from_storage(\n", " storage_context,\n", ")\n", "query_engine = index.as_query_engine()\n", "\n", "# Asking the Application questions about the Arize product\n", "queries = [\n", " \"How can I query for a monitor's status using GraphQL?\",\n", " \"How do I delete a model?\",\n", " \"How much does an enterprise license of Arize cost?\",\n", " \"How do I log a prediction using the python SDK?\",\n", "]\n", "\n", "for query in tqdm(queries):\n", " response = query_engine.query(query)\n", " print(f\"Query: {query}\")\n", " print(f\"Response: {response}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "XbbFlTpeT41T" }, "source": [ "## Learn how to Evaluate the application using Phoenix LLM Evals\n", "\n", "We now have visibility into the inner workings of our application. Next, let's take a look at how to use LLM evals to evaluate our application.\n", "\n", "We will be going through a few common evaluation metrics\n", "\n", "1. Hallucination Eval: Checks if application response was an hallucination\n", "\n", "2. Q&A Eval: Whether or not the application answers the question correctly\n", "\n", "3. Document Relevance Eval: Grades whether or not the documents/chunks retrieved were actually relevant to answering the query" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert traces into workable datasets\n", "from phoenix.client import AsyncClient\n", "\n", "px_client = AsyncClient()\n", "\n", "spans_df = px_client.get_spans_dataframe(project_name=\"llm-rag-app\")\n", "spans_df[[\"name\", \"span_kind\", \"attributes.input.value\", \"attributes.retrieval.documents\"]].head()\n", "\n", "from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents\n", "\n", "retrieved_documents_df = get_retrieved_documents(px_client, project_name=\"llm-rag-app\")\n", "queries_df = get_qa_with_reference(px_client, project_name=\"llm-rag-app\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hallucination_prompt = \"\"\"\n", "In this task, you will be presented with a query, a reference text and an answer. The answer is\n", "generated to the question based on the reference text. The answer may contain false information. You\n", "must use the reference text to determine if the answer to the question contains false information,\n", "if the answer is a hallucination of facts. Your objective is to determine whether the answer text\n", "contains factual information and is not a hallucination. A 'hallucination' refers to\n", "an answer that is not based on the reference text or assumes information that is not available in\n", "the reference text. Your response should be a single word: either \"factual\" or \"hallucinated\", and\n", "it should not include any other text or characters. \"hallucinated\" indicates that the answer\n", "provides factually inaccurate information to the query based on the reference text. \"factual\"\n", "indicates that the answer to the question is correct relative to the reference text, and does not\n", "contain made up information. Please read the query and reference text carefully before determining\n", "your response.\n", "\n", " [BEGIN DATA]\n", " ************\n", " [Query]: {input}\n", " ************\n", " [Reference text]: {reference}\n", " ************\n", " [Answer]: {output}\n", " ************\n", " [END DATA]\n", "\n", " Is the answer above factual or hallucinated based on the query and reference text?\n", "\"\"\"\n", "\n", "qa_correctness_prompt = \"\"\"\n", "You are given a question, an answer and reference text. You must determine whether the\n", "given answer correctly answers the question based on the reference text. Here is the data:\n", " [BEGIN DATA]\n", " ************\n", " [Question]: {input}\n", " ************\n", " [Reference]: {reference}\n", " ************\n", " [Answer]: {output}\n", " [END DATA]\n", "Please read the query, reference text and answer carefully, then write out in a step by step manner\n", "an EXPLANATION to show how to determine if the answer is \"correct\" or \"incorrect\". Avoid simply\n", "stating the correct answer at the outset. Your response LABEL must be a single word, either\n", "\"correct\" or \"incorrect\", and should not contain any text or characters aside from that word.\n", "\"correct\" means that the question is correctly and fully answered by the answer.\n", "\"incorrect\" means that the question is not correctly or only partially answered by the\n", "answer.\n", "\n", "Example response:\n", "************\n", "EXPLANATION: An explanation of your reasoning for why the label is \"correct\" or \"incorrect\"\n", "LABEL: \"correct\" or \"incorrect\"\n", "************\n", "\n", "EXPLANATION:\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generating the Hallucination & Q&A Eval\n", "from phoenix.evals import (\n", " async_evaluate_dataframe,\n", " create_classifier,\n", ")\n", "from phoenix.evals.llm import LLM\n", "\n", "llm = LLM(provider=\"openai\", model=\"gpt-4\")\n", "\n", "hallucination_eval = create_classifier(\n", " name=\"hallucination\",\n", " prompt_template=hallucination_prompt,\n", " llm=llm,\n", " choices={\"factual\": 1.0, \"hallucinated\": 0.0},\n", ")\n", "qa_correctness_eval = create_classifier(\n", " name=\"qa_correctness\",\n", " prompt_template=qa_correctness_prompt,\n", " llm=llm,\n", " choices={\"correct\": 1.0, \"incorrect\": 0.0},\n", ")\n", "with suppress_tracing():\n", " results_df = await async_evaluate_dataframe(\n", " dataframe=queries_df,\n", " evaluators=[hallucination_eval, qa_correctness_eval],\n", " )\n", "results_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.evals.utils import to_annotation_dataframe\n", "\n", "relevancy_eval_df = to_annotation_dataframe(dataframe=results_df)\n", "\n", "await px_client.annotations.log_span_annotations_dataframe(\n", " dataframe=relevancy_eval_df,\n", " annotator_kind=\"LLM\",\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "llNyxEi3T41U" }, "source": [ "As you can see from the results, one of the queries was flagged as a hallucination.\n", "\n", "We can use Retrieval Relevance Evals to identify if these issues are caused by the retrieval process for RAG. We are going to use an LLM to grade whether or not the chunks retrieved are relevant to the query." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rag_relevancy_prompt = \"\"\"\n", "You are comparing a reference text to a question and trying to determine if the reference text\n", "contains information relevant to answering the question. Here is the data:\n", " [BEGIN DATA]\n", " ************\n", " [Question]: {input}\n", " ************\n", " [Reference text]: {reference}\n", " ************\n", " [END DATA]\n", "Compare the Question above to the Reference text. You must determine whether the Reference text\n", "contains information that can answer the Question. Please focus on whether the very specific\n", "question can be answered by the information in the Reference text.\n", "Your response must be single word, either \"relevant\" or \"unrelated\",\n", "and should not contain any text or characters aside from that word.\n", "\"unrelated\" means that the reference text does not contain an answer to the Question.\n", "\"relevant\" means the reference text contains an answer to the Question.\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "retrieved_documents_eval = create_classifier(\n", " name=\"rag_relevancy\",\n", " prompt_template=rag_relevancy_prompt,\n", " llm=llm,\n", " choices={\"relevant\": 1.0, \"unrelated\": 0.0},\n", ")\n", "with suppress_tracing():\n", " results_df = await async_evaluate_dataframe(\n", " dataframe=retrieved_documents_df, evaluators=[retrieved_documents_eval]\n", " )\n", "results_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if results_df.index.name == \"context.span_id\" and \"context.span_id\" not in results_df.columns:\n", " results_df = results_df.reset_index()\n", "\n", "document_eval_df = to_annotation_dataframe(dataframe=results_df)\n", "\n", "await px_client.annotations.log_span_annotations_dataframe(\n", " dataframe=document_eval_df,\n", " annotator_kind=\"LLM\",\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "PhOHTgA0T41U" }, "source": [ "Looks like we are getting a lot of irrelevant chunks of text that might be polluting the prompt sent to the LLM.\n", "\n", "If we once again visit the UI, we will now see that Phoenix has aggregated up retrieval metrics (`precision`, `ndcg`, and `hit`). We see that our hallucinations and incorrect answers directly correlate to bad retrieval!" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

llm_application_tracing_evaluating_and_analysis.ipynb•14.1 KiB