Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
llamaindex_retrieval_chunk_eval.ipynb15.8 kB
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<img alt=\"arize llama-index logos\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/llama-index-knowledge-base-tutorial/arize_llamaindex.png\" width=\"400\">\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LlamaIndex Chunk Size, Retrieval Method and K Eval Suite\n", "\n", "This colab provides a suite of retrieval performance tests that helps teams understand\n", "how to setup the retrieval system. It makes use of the Phoenix Eval options for \n", "Q&A (overall did it answer the question) and retrieval (did the right chunks get returned).\n", "\n", "There is a sweep of parameters that is stored in experiment_data/results_no_zero_remove, \n", "check that directory for results. \n", "\n", "The goal is to help teams choose a Chunk size, retireval method, K for return chunks\n", "\n", "This colab downloads the script (py) files. Those files can be run without this colab directly,\n", "in a code only environment (VS code for example)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieval Eval\n", "\n", "This Eval evaluates whether a retrieved chunk contains an answer to the query. Its extremely useful for evaluating retrieval systems.\n", "\n", "https://arize.com/docs/phoenix/concepts/llm-evals/retrieval-rag-relevance\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q&A EVal\n", "This Eval evaluates whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.\n", "\n", "https://arize.com/docs/phoenix/concepts/llm-evals/q-and-a-on-retrieved-data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/chunking.png\" width=800/>\n", " </p>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The challenge in setting up a retrieval system is having solid performance metrics that allow you to evaluate your different strategies:\n", "- Chunk Size\n", "- Retrieval Method\n", "- K value\n", "\n", "In setting the above variables you first need some overall Eval metrics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/eval_relevance.png\" width=800/>\n", " </p>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above is the relevance evaluation used to check whether the chunk retrieved is relevant to the query." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/EvalQ_A.png\" width=600/>\n", " </p>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above Eval shows a Q&A Eval on the entire system Q&A /\n", "on the overall question and answer. \n", "Each is used as we sweep through parameters to detremine effectiveness of retrieval." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sweeping values\n", "The scripts sweep through K, Retrival approach and chunk size, determining the trade off on your own docs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/sweep_k.png\" width=800/>\n", " </p>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above shows sweeping through K=4 and K=6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/sweep_chunk.png\" width=800/>\n", " </p>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above shows sweeping through Chunk Size" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The script below runs a test on the question set, by default we have a 170 Question set\n", "# That takes some time to run so you can default it lower just to test\n", "# Comment this out to run on full dataset\n", "QUESTION_SAMPLES = 4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install cohere matplotlib lxml openai 'arize-phoenix[evals,llama-index]' bs4 'llama-index-postprocessor-cohere-rerank' \"urllib3>=2.0.4\" nest_asyncio" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Retrieval Eval Scripts \n", "The following scripts can be run directly. In the case of long test suites, we recommend running \n", "the python script llama_index_w_evals_and_qa directly.py directly in python. All parameters are available \n", "in that script." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# # Download scripts\n", "import requests\n", "\n", "url = \"https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/llama_index_w_evals_and_qa.py\"\n", "response = requests.get(url)\n", "with open(\"llama_index_w_evals_and_qa.py\", \"w\") as file:\n", " file.write(response.text)\n", "\n", "url = \"https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/plotresults.py\"\n", "response = requests.get(url)\n", "with open(\"plotresults.py\", \"w\") as file:\n", " file.write(response.text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "import os\n", "import pickle\n", "\n", "import cohere\n", "import openai\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Phoenix Observabiility\n", "Click link below to visualize llamaIndex queries and chunking as its happening!!!!!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#########################################\n", "### CLICK LINK BELOW FOR PHOENIX VIZ ####\n", "#########################################\n", "# Phoenix can display in real time the traces automatically\n", "# collected from your LlamaIndex application.\n", "import phoenix as px\n", "\n", "# Look for a URL in the output to open the App in a browser.\n", "px.launch_app()\n", "# The App is initially empty, but as you proceed with the steps below,\n", "# traces will appear automatically as your LlamaIndex application runs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from getpass import getpass\n", "\n", "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", "openai.api_key = openai_api_key\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n", "if not (cohere_api_key := os.getenv(\"COHERE_API_KEY\")):\n", " cohere_api_key = getpass(\"🔑 Enter your Cohere API key: \")\n", "cohere.api_key = cohere_api_key\n", "os.environ[\"COHERE_API_KEY\"] = cohere_api_key\n", "\n", "# if loading from scratch, change these below\n", "web_title = \"arize\" # nickname for this website, used for saving purposes\n", "base_url = \"https://docs.arize.com/arize\"\n", "# Local files\n", "file_name = \"raw_documents.pkl\"\n", "save_base = \"./experiment_data/\"\n", "if not os.path.exists(save_base):\n", " os.makedirs(save_base)\n", "\n", "run_name = datetime.datetime.now().strftime(\"%Y%m%d_%H%M\")\n", "save_dir = os.path.join(save_base, run_name)\n", "if not os.path.exists(save_dir):\n", " # Create a new directory because it does not exist\n", " os.makedirs(save_dir)\n", "\n", "\n", "questions = pd.read_csv(\n", " \"https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/constants.csv\",\n", " header=None,\n", ")[0].to_list()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This will determine run time, how many questions to pull from the data to run\n", "selected_questions = questions[:QUESTION_SAMPLES] if QUESTION_SAMPLES else questions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nest_asyncio\n", "from llama_index.core import download_loader\n", "from llama_index_w_evals_and_qa import get_urls, plot_graphs, run_experiments\n", "\n", "import phoenix.evals.default_templates as templates\n", "from phoenix.evals import OpenAIModel\n", "\n", "nest_asyncio.apply()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raw_docs_filepath = os.path.join(save_base, file_name)\n", "if not os.path.exists(raw_docs_filepath):\n", " print(f\"'{raw_docs_filepath}' does not exists.\")\n", " urls = get_urls(base_url) # you need to - pip install lxml\n", " print(f\"LOADED {len(urls)} URLS\")\n", "\n", "print(\"GRABBING DOCUMENTS\")\n", "BeautifulSoupWebReader = download_loader(\"BeautifulSoupWebReader\")\n", "# two options here, either get the documents from scratch or load one from disk\n", "if not os.path.exists(raw_docs_filepath):\n", " print(\"LOADING DOCUMENTS FROM URLS\")\n", " # You need to 'pip install lxml'\n", " loader = BeautifulSoupWebReader()\n", " documents = loader.load_data(urls=urls) # may take some time\n", " with open(save_base + file_name, \"wb\") as file:\n", " pickle.dump(documents, file)\n", " print(\"Documents saved to raw_documents.pkl\")\n", "else:\n", " print(\"LOADING DOCUMENTS FROM FILE\")\n", " print(\"Opening raw_documents.pkl\")\n", " with open(save_base + file_name, \"rb\") as file:\n", " documents = pickle.load(file)\n", "##############################\n", "### PARAMETER SWEEPS BELOW ###\n", "##############################\n", "###chunk_sizes### to test, will sweep through values of chunk size\n", "chunk_sizes = [\n", " 100,\n", " # 300,\n", " # 500,\n", " # 1000,\n", " # 2000,\n", "] # change this, perhaps experiment from 500 to 3000 in increments of 500\n", "\n", "### K ###: Sizes to test, will sweep through values of k\n", "k = [4, 6, 8]\n", "# k = [10] # num documents to retrieve\n", "\n", "### Retrieval Approach ###: transformation to test will sweep through retrieval\n", "# transformations = [\"original\", \"original_rerank\",\"hyde\", \"hyde_rerank\"]\n", "transformations = [\"original\", \"original_rerank\"]\n", "# Model for Q&A\n", "llama_index_model = \"gpt-4o\"\n", "# llama_index_model = \"gpt-3.5-turbo\"\n", "# Model for Evals\n", "eval_model = OpenAIModel(model=\"gpt-4o\", temperature=0.0)\n", "\n", "qa_template = templates.QA_PROMPT_TEMPLATE" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Uncomment when testing, 3 questions are easy to run through quickly\n", "questions = questions[0:3]\n", "all_data = run_experiments(\n", " documents=documents,\n", " queries=questions,\n", " chunk_sizes=chunk_sizes,\n", " query_transformations=transformations,\n", " k_values=k,\n", " web_title=web_title,\n", " save_dir=save_dir,\n", " llama_index_model=llama_index_model,\n", " eval_model=eval_model,\n", " template=qa_template,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_data_filepath = os.path.join(save_dir, f\"{web_title}_all_data.pkl\")\n", "with open(all_data_filepath, \"wb\") as f:\n", " pickle.dump(all_data, f)\n", "\n", "# The retrievals with 0 relevant context really can't be optimized, removing gives a diff view\n", "plot_graphs(\n", " all_data=all_data,\n", " save_dir=os.path.join(save_dir, \"results_zero_removed\"),\n", " show=False,\n", " remove_zero=True,\n", ")\n", "plot_graphs(\n", " all_data=all_data,\n", " save_dir=os.path.join(save_dir, \"results_zero_not_removed\"),\n", " show=False,\n", " remove_zero=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Results Q&A Evals (actual results in experiment_data)\n", "\n", "The Q&A Eval runs at the highest level of did you get the question answer correct based on the data:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix data\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/percentage_incorrect_plot.png\" />\n", " </p>\n", "</center>\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Results Retrieval Eval (actual results in experiment_data)\n", "\n", "The retrieval analysis example is below, iterates through the chunk sizes, K (4/6/10), retrieval method\n", "The eval checks whether the retrieved chunk is relevant and has a chance to answer the question" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix data\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/all_mean_precisions.png\" />\n", " </p>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Results Latency (actual results in experiment_data)\n", "\n", "The latency can highly varied based on retrieval approaches, below are latency maps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix data\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/median_latency_all.png\" />\n", " </p>\n", "</center>\n" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 4 }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server