llamaindex_retrieval_chunk_eval.ipynb•15.8 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"arize llama-index logos\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/llama-index-knowledge-base-tutorial/arize_llamaindex.png\" width=\"400\">\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## LlamaIndex Chunk Size, Retrieval Method and K Eval Suite\n",
"\n",
"This colab provides a suite of retrieval performance tests that helps teams understand\n",
"how to setup the retrieval system. It makes use of the Phoenix Eval options for \n",
"Q&A (overall did it answer the question) and retrieval (did the right chunks get returned).\n",
"\n",
"There is a sweep of parameters that is stored in experiment_data/results_no_zero_remove, \n",
"check that directory for results. \n",
"\n",
"The goal is to help teams choose a Chunk size, retireval method, K for return chunks\n",
"\n",
"This colab downloads the script (py) files. Those files can be run without this colab directly,\n",
"in a code only environment (VS code for example)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieval Eval\n",
"\n",
"This Eval evaluates whether a retrieved chunk contains an answer to the query. Its extremely useful for evaluating retrieval systems.\n",
"\n",
"https://arize.com/docs/phoenix/concepts/llm-evals/retrieval-rag-relevance\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q&A EVal\n",
"This Eval evaluates whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.\n",
"\n",
"https://arize.com/docs/phoenix/concepts/llm-evals/q-and-a-on-retrieved-data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/chunking.png\" width=800/>\n",
" </p>\n",
"</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The challenge in setting up a retrieval system is having solid performance metrics that allow you to evaluate your different strategies:\n",
"- Chunk Size\n",
"- Retrieval Method\n",
"- K value\n",
"\n",
"In setting the above variables you first need some overall Eval metrics."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/eval_relevance.png\" width=800/>\n",
" </p>\n",
"</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above is the relevance evaluation used to check whether the chunk retrieved is relevant to the query."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/EvalQ_A.png\" width=600/>\n",
" </p>\n",
"</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above Eval shows a Q&A Eval on the entire system Q&A /\n",
"on the overall question and answer. \n",
"Each is used as we sweep through parameters to detremine effectiveness of retrieval."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sweeping values\n",
"The scripts sweep through K, Retrival approach and chunk size, determining the trade off on your own docs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/sweep_k.png\" width=800/>\n",
" </p>\n",
"</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above shows sweeping through K=4 and K=6"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/sweep_chunk.png\" width=800/>\n",
" </p>\n",
"</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above shows sweeping through Chunk Size"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# The script below runs a test on the question set, by default we have a 170 Question set\n",
"# That takes some time to run so you can default it lower just to test\n",
"# Comment this out to run on full dataset\n",
"QUESTION_SAMPLES = 4"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install cohere matplotlib lxml openai 'arize-phoenix[evals,llama-index]' bs4 'llama-index-postprocessor-cohere-rerank' \"urllib3>=2.0.4\" nest_asyncio"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Retrieval Eval Scripts \n",
"The following scripts can be run directly. In the case of long test suites, we recommend running \n",
"the python script llama_index_w_evals_and_qa directly.py directly in python. All parameters are available \n",
"in that script."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# # Download scripts\n",
"import requests\n",
"\n",
"url = \"https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/llama_index_w_evals_and_qa.py\"\n",
"response = requests.get(url)\n",
"with open(\"llama_index_w_evals_and_qa.py\", \"w\") as file:\n",
" file.write(response.text)\n",
"\n",
"url = \"https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/plotresults.py\"\n",
"response = requests.get(url)\n",
"with open(\"plotresults.py\", \"w\") as file:\n",
" file.write(response.text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import datetime\n",
"import os\n",
"import pickle\n",
"\n",
"import cohere\n",
"import openai\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Phoenix Observabiility\n",
"Click link below to visualize llamaIndex queries and chunking as its happening!!!!!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#########################################\n",
"### CLICK LINK BELOW FOR PHOENIX VIZ ####\n",
"#########################################\n",
"# Phoenix can display in real time the traces automatically\n",
"# collected from your LlamaIndex application.\n",
"import phoenix as px\n",
"\n",
"# Look for a URL in the output to open the App in a browser.\n",
"px.launch_app()\n",
"# The App is initially empty, but as you proceed with the steps below,\n",
"# traces will appear automatically as your LlamaIndex application runs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from getpass import getpass\n",
"\n",
"if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
" openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
"openai.api_key = openai_api_key\n",
"os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n",
"if not (cohere_api_key := os.getenv(\"COHERE_API_KEY\")):\n",
" cohere_api_key = getpass(\"🔑 Enter your Cohere API key: \")\n",
"cohere.api_key = cohere_api_key\n",
"os.environ[\"COHERE_API_KEY\"] = cohere_api_key\n",
"\n",
"# if loading from scratch, change these below\n",
"web_title = \"arize\" # nickname for this website, used for saving purposes\n",
"base_url = \"https://docs.arize.com/arize\"\n",
"# Local files\n",
"file_name = \"raw_documents.pkl\"\n",
"save_base = \"./experiment_data/\"\n",
"if not os.path.exists(save_base):\n",
" os.makedirs(save_base)\n",
"\n",
"run_name = datetime.datetime.now().strftime(\"%Y%m%d_%H%M\")\n",
"save_dir = os.path.join(save_base, run_name)\n",
"if not os.path.exists(save_dir):\n",
" # Create a new directory because it does not exist\n",
" os.makedirs(save_dir)\n",
"\n",
"\n",
"questions = pd.read_csv(\n",
" \"https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/constants.csv\",\n",
" header=None,\n",
")[0].to_list()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# This will determine run time, how many questions to pull from the data to run\n",
"selected_questions = questions[:QUESTION_SAMPLES] if QUESTION_SAMPLES else questions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import nest_asyncio\n",
"from llama_index.core import download_loader\n",
"from llama_index_w_evals_and_qa import get_urls, plot_graphs, run_experiments\n",
"\n",
"import phoenix.evals.default_templates as templates\n",
"from phoenix.evals import OpenAIModel\n",
"\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"raw_docs_filepath = os.path.join(save_base, file_name)\n",
"if not os.path.exists(raw_docs_filepath):\n",
" print(f\"'{raw_docs_filepath}' does not exists.\")\n",
" urls = get_urls(base_url) # you need to - pip install lxml\n",
" print(f\"LOADED {len(urls)} URLS\")\n",
"\n",
"print(\"GRABBING DOCUMENTS\")\n",
"BeautifulSoupWebReader = download_loader(\"BeautifulSoupWebReader\")\n",
"# two options here, either get the documents from scratch or load one from disk\n",
"if not os.path.exists(raw_docs_filepath):\n",
" print(\"LOADING DOCUMENTS FROM URLS\")\n",
" # You need to 'pip install lxml'\n",
" loader = BeautifulSoupWebReader()\n",
" documents = loader.load_data(urls=urls) # may take some time\n",
" with open(save_base + file_name, \"wb\") as file:\n",
" pickle.dump(documents, file)\n",
" print(\"Documents saved to raw_documents.pkl\")\n",
"else:\n",
" print(\"LOADING DOCUMENTS FROM FILE\")\n",
" print(\"Opening raw_documents.pkl\")\n",
" with open(save_base + file_name, \"rb\") as file:\n",
" documents = pickle.load(file)\n",
"##############################\n",
"### PARAMETER SWEEPS BELOW ###\n",
"##############################\n",
"###chunk_sizes### to test, will sweep through values of chunk size\n",
"chunk_sizes = [\n",
" 100,\n",
" # 300,\n",
" # 500,\n",
" # 1000,\n",
" # 2000,\n",
"] # change this, perhaps experiment from 500 to 3000 in increments of 500\n",
"\n",
"### K ###: Sizes to test, will sweep through values of k\n",
"k = [4, 6, 8]\n",
"# k = [10] # num documents to retrieve\n",
"\n",
"### Retrieval Approach ###: transformation to test will sweep through retrieval\n",
"# transformations = [\"original\", \"original_rerank\",\"hyde\", \"hyde_rerank\"]\n",
"transformations = [\"original\", \"original_rerank\"]\n",
"# Model for Q&A\n",
"llama_index_model = \"gpt-4o\"\n",
"# llama_index_model = \"gpt-3.5-turbo\"\n",
"# Model for Evals\n",
"eval_model = OpenAIModel(model=\"gpt-4o\", temperature=0.0)\n",
"\n",
"qa_template = templates.QA_PROMPT_TEMPLATE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Uncomment when testing, 3 questions are easy to run through quickly\n",
"questions = questions[0:3]\n",
"all_data = run_experiments(\n",
" documents=documents,\n",
" queries=questions,\n",
" chunk_sizes=chunk_sizes,\n",
" query_transformations=transformations,\n",
" k_values=k,\n",
" web_title=web_title,\n",
" save_dir=save_dir,\n",
" llama_index_model=llama_index_model,\n",
" eval_model=eval_model,\n",
" template=qa_template,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_data_filepath = os.path.join(save_dir, f\"{web_title}_all_data.pkl\")\n",
"with open(all_data_filepath, \"wb\") as f:\n",
" pickle.dump(all_data, f)\n",
"\n",
"# The retrievals with 0 relevant context really can't be optimized, removing gives a diff view\n",
"plot_graphs(\n",
" all_data=all_data,\n",
" save_dir=os.path.join(save_dir, \"results_zero_removed\"),\n",
" show=False,\n",
" remove_zero=True,\n",
")\n",
"plot_graphs(\n",
" all_data=all_data,\n",
" save_dir=os.path.join(save_dir, \"results_zero_not_removed\"),\n",
" show=False,\n",
" remove_zero=False,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example Results Q&A Evals (actual results in experiment_data)\n",
"\n",
"The Q&A Eval runs at the highest level of did you get the question answer correct based on the data:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix data\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/percentage_incorrect_plot.png\" />\n",
" </p>\n",
"</center>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example Results Retrieval Eval (actual results in experiment_data)\n",
"\n",
"The retrieval analysis example is below, iterates through the chunk sizes, K (4/6/10), retrieval method\n",
"The eval checks whether the retrieved chunk is relevant and has a chance to answer the question"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix data\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/all_mean_precisions.png\" />\n",
" </p>\n",
"</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example Results Latency (actual results in experiment_data)\n",
"\n",
"The latency can highly varied based on retrieval approaches, below are latency maps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix data\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/median_latency_all.png\" />\n",
" </p>\n",
"</center>\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 4
}