Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
llm_application_tracing_evaluating_and_analysis.ipynb11.9 kB
{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "8H61KXhXT41P" }, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n", " <br>\n", " <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n", " </p>\n", "</center>\n", "<h1 align=\"center\">Tracing, Evaluation, and Analysis of an LLM Application using Phoenix</h1>\n", "\n", "In this tutorial we will learn how to build, observe, evaluate, and analyze a LLM powered application. This is a LLM-powered Chat on Docs application that will answer questions about <a href=\"https://docs.arize.com/arize/\">Arize</a> from their product documentation.\n", "\n", "Key Concepts:\n", "1. LLM Traces are a category of telemetry data that is used to understand the execution of LLMs and the surrounding application context (such as retrieval from vector stores, usage of external tools, etc).\n", "\n", "2. Traces are made up of a sequence of `spans`. A span represents a unit of work or operation (think a span of time).\n", "\n", "3. LLM Evaluations help get visbility into the performance of the application\n", "\n", "Run the next two code blocks to launch Phoenix." ] }, { "cell_type": "markdown", "metadata": { "id": "GWsyMSFbSkeF" }, "source": [ "## Launch Phoenix to visualize the app" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -qq \"arize-phoenix[evals,llama-index]\" \"openai>=1\" gcsfs nest_asyncio llama-index-llms-openai 'httpx<0.28'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "# Uses your OpenAI API Key\n", "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gcsfs import GCSFileSystem\n", "from llama_index.core import (\n", " ServiceContext,\n", " StorageContext,\n", " load_index_from_storage,\n", ")\n", "from llama_index.core.graph_stores import SimpleGraphStore\n", "from llama_index.embeddings.openai import OpenAIEmbedding\n", "from llama_index.llms.openai import OpenAI\n", "from tqdm import tqdm\n", "\n", "import phoenix as px\n", "from phoenix.trace import DocumentEvaluations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "px.launch_app().view()" ] }, { "cell_type": "markdown", "metadata": { "id": "JnYXL65xSywo" }, "source": [ "## Learn how to Build the application and add Tracing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openinference.instrumentation.llama_index import LlamaIndexInstrumentor\n", "\n", "from phoenix.otel import register\n", "\n", "tracer_provider = register(endpoint=\"http://127.0.0.1:6006/v1/traces\")\n", "LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Pulls down the Arize Product Documentation Knowledge Base\n", "file_system = GCSFileSystem(project=\"public-assets-275721\")\n", "index_path = \"arize-phoenix-assets/datasets/unstructured/llm/llama-index/arize-docs/index/\"\n", "storage_context = StorageContext.from_defaults(\n", " fs=file_system,\n", " persist_dir=index_path,\n", " graph_store=SimpleGraphStore(), # prevents unauthorized request to GCS\n", ")\n", "service_context = ServiceContext.from_defaults(\n", " llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0.0),\n", " embed_model=OpenAIEmbedding(model=\"text-embedding-ada-002\"),\n", ")\n", "index = load_index_from_storage(\n", " storage_context,\n", " service_context=service_context,\n", ")\n", "query_engine = index.as_query_engine()\n", "\n", "# Asking the Application questions about the Arize product\n", "queries = [\n", " \"How can I query for a monitor's status using GraphQL?\",\n", " \"How do I delete a model?\",\n", " \"How much does an enterprise license of Arize cost?\",\n", " \"How do I log a prediction using the python SDK?\",\n", "]\n", "\n", "for query in tqdm(queries):\n", " response = query_engine.query(query)\n", " print(f\"Query: {query}\")\n", " print(f\"Response: {response}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import sleep\n", "\n", "sleep(2) # wait a little bit for data to become fully available" ] }, { "cell_type": "markdown", "metadata": { "id": "XbbFlTpeT41T" }, "source": [ "## Learn how to Evaluate the application using Phoenix LLM Evals\n", "\n", "We now have visibility into the inner workings of our application. Next, let's take a look at how to use LLM evals to evaluate our application.\n", "\n", "We will be going through a few common evaluation metrics\n", "\n", "1. Hallucination Eval: Checks if application response was an hallucination\n", "\n", "2. Q&A Eval: Whether or not the application answers the question correctly\n", "\n", "3. Document Relevance Eval: Grades whether or not the documents/chunks retrieved were actually relevant to answering the query" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert traces into workable datasets\n", "\n", "spans_df = px.Client().get_spans_dataframe()\n", "spans_df[[\"name\", \"span_kind\", \"attributes.input.value\", \"attributes.retrieval.documents\"]].head()\n", "\n", "from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents\n", "\n", "retrieved_documents_df = get_retrieved_documents(px.Client())\n", "queries_df = get_qa_with_reference(px.Client())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generating the Hallucination & Q&A Eval\n", "\n", "import nest_asyncio\n", "\n", "from phoenix.evals import (\n", " HALLUCINATION_PROMPT_RAILS_MAP,\n", " HALLUCINATION_PROMPT_TEMPLATE,\n", " QA_PROMPT_RAILS_MAP,\n", " QA_PROMPT_TEMPLATE,\n", " OpenAIModel,\n", " llm_classify,\n", ")\n", "\n", "nest_asyncio.apply() # Speeds up OpenAI API calls\n", "\n", "# Creating Hallucination Eval which checks if the application hallucinated\n", "hallucination_eval = llm_classify(\n", " dataframe=queries_df,\n", " model=OpenAIModel(model=\"gpt-4\", temperature=0.0),\n", " template=HALLUCINATION_PROMPT_TEMPLATE,\n", " rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),\n", " provide_explanation=True, # Makes the LLM explain its reasoning\n", " concurrency=4,\n", ")\n", "hallucination_eval[\"score\"] = (\n", " hallucination_eval.label[~hallucination_eval.label.isna()] == \"factual\"\n", ").astype(int)\n", "\n", "# Creating Q&A Eval which checks if the application answered the question correctly\n", "qa_correctness_eval = llm_classify(\n", " dataframe=queries_df,\n", " model=OpenAIModel(model=\"gpt-4\", temperature=0.0),\n", " template=QA_PROMPT_TEMPLATE,\n", " rails=list(QA_PROMPT_RAILS_MAP.values()),\n", " provide_explanation=True, # Makes the LLM explain its reasoning\n", " concurrency=4,\n", ")\n", "\n", "qa_correctness_eval[\"score\"] = (\n", " hallucination_eval.label[~qa_correctness_eval.label.isna()] == \"correct\"\n", ").astype(int)\n", "\n", "# Logs the Evaluations to Phoenix\n", "from phoenix.client import AsyncClient\n", "\n", "px_client = AsyncClient()\n", "await px_client.spans.log_span_annotations_dataframe(\n", " dataframe=hallucination_eval,\n", " annotation_name=\"Hallucination\",\n", " annotator_kind=\"LLM\",\n", ")\n", "await px_client.spans.log_span_annotations_dataframe(\n", " dataframe=qa_correctness_eval,\n", " annotation_name=\"QA Correctness\",\n", " annotator_kind=\"LLM\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hallucination_eval.head(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "qa_correctness_eval.head(2)" ] }, { "cell_type": "markdown", "metadata": { "id": "llNyxEi3T41U" }, "source": [ "As you can see from the results, one of the queries was flagged as a hallucination.\n", "\n", "We can use Retrieval Relevance Evals to identify if these issues are caused by the retrieval process for RAG. We are going to use an LLM to grade whether or not the chunks retrieved are relevant to the query." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generating Retrieval Relevance Eval\n", "\n", "from phoenix.evals import (\n", " RAG_RELEVANCY_PROMPT_RAILS_MAP,\n", " RAG_RELEVANCY_PROMPT_TEMPLATE,\n", " OpenAIModel,\n", " llm_classify,\n", ")\n", "\n", "retrieved_documents_eval = llm_classify(\n", " dataframe=retrieved_documents_df,\n", " model=OpenAIModel(model=\"gpt-4\", temperature=0.0),\n", " template=RAG_RELEVANCY_PROMPT_TEMPLATE,\n", " rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),\n", " provide_explanation=True,\n", ")\n", "\n", "retrieved_documents_eval[\"score\"] = (\n", " retrieved_documents_eval.label[~retrieved_documents_eval.label.isna()] == \"relevant\"\n", ").astype(int)\n", "\n", "px.Client().log_evaluations(\n", " DocumentEvaluations(eval_name=\"Relevance\", dataframe=retrieved_documents_eval)\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "retrieved_documents_eval.head(2)" ] }, { "cell_type": "markdown", "metadata": { "id": "PhOHTgA0T41U" }, "source": [ "Looks like we are getting a lot of irrelevant chunks of text that might be polluting the prompt sent to the LLM.\n", "\n", "If we once again visit the UI, we will now see that Phoenix has aggregated up retrieval metrics (`precision`, `ndcg`, and `hit`). We see that our hallucinations and incorrect answers directly correlate to bad retrieval!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"The Phoenix UI:\", px.active_session().url)" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server