@arizeai/phoenix-mcp

Official

Overview Inspect Schema Related Servers Score Discussions

evaluate_code_readability_classifications.ipynb•14.8 kB

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n", " <br>\n", " <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n", " </p>\n", "</center>\n", "<h1 align=\"center\">Code Readability Evals</h1>\n", "\n", "Arize provides tooling to evaluate LLM applications, including tools to determine the readability or unreadability of code generated by LLM applications.\n", "\n", "The purpose of this notebook is:\n", "\n", "- to evaluate the performance of an LLM-assisted approach to classifying\n", " generated code as readable or unreadable using datasets with ground-truth\n", " labels\n", "- to provide an experimental framework for users to iterate and improve on the default classification template.\n", "\n", "## Install Dependencies and Import Libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#####################\n", "## N_EVAL_SAMPLE_SIZE\n", "#####################\n", "# Eval sample size determines the run time\n", "# 100 samples: GPT-4 ~ 80 sec / GPT-3.5 ~ 40 sec\n", "# 1,000 samples: GPT-4 ~15-17 min / GPT-3.5 ~ 6-7min (depending on retries)\n", "# 10,000 samples GPT-4 ~170 min / GPT-3.5 ~ 70min\n", "N_EVAL_SAMPLE_SIZE = 10" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -qq \"arize-phoenix-evals>=0.0.5\" \"openai>=1\" ipython matplotlib pycm scikit-learn tiktoken nest_asyncio 'httpx<0.28'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use `nest_asyncio`. `nest_asyncio` globally patches `asyncio` to enable event loops to be re-entrant. This is not required for non-notebook environments.\n", "\n", "Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nest_asyncio\n", "\n", "nest_asyncio.apply()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "import matplotlib.pyplot as plt\n", "import openai\n", "import pandas as pd\n", "from pycm import ConfusionMatrix\n", "from sklearn.metrics import classification_report\n", "\n", "from phoenix.evals import (\n", " create_classifier,\n", " download_benchmark_dataset,\n", " evaluate_dataframe,\n", ")\n", "from phoenix.evals.llm import LLM\n", "\n", "pd.set_option(\"display.max_colwidth\", None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download Benchmark Dataset\n", "\n", "We'll evaluate the evaluation system consisting of an LLM model and settings in\n", "addition to an evaluation prompt template against a benchmark datasets of\n", "readable and unreadable code with ground-truth labels. Currently supported\n", "datasets for this task include:\n", "\n", "- openai_humaneval_with_readability" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset_name = \"openai_humaneval_with_readability\"\n", "df = download_benchmark_dataset(task=\"code-readability-classification\", dataset_name=dataset_name)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Display Binary Readability Classification Template\n", "\n", "View the default template used to classify readability. You can tweak this template and evaluate its performance relative to the default." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "CODE_READABILITY_PROMPT_TEMPLATE = \"\"\"\n", "You are a stern but practical senior software engineer who cares a lot about simplicity and\n", "readability of code. Can you review the following code that was written by another engineer?\n", "Focus on readability of the code. Respond with \"readable\" if you think the code is readable,\n", "or \"unreadable\" if the code is unreadable or needlessly complex for what it's trying\n", "to accomplish.\n", "\n", "ONLY respond with \"readable\" or \"unreadable\"\n", "\n", "Task Assignment:\n", "```\n", "{input}\n", "```\n", "\n", "Implementation to Evaluate:\n", "```\n", "{output}\n", "```\n", "\"\"\"\n", "\n", "CODE_READABILITY_PROMPT_RAILS_MAP = {\"readable\": 1, \"unreadable\": 0}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The template variables are:\n", "\n", "- **input:** the query from the user describing the coding task\n", "- **output:** an implementation of the coding task" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure the LLM\n", "\n", "Configure your OpenAI API key." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", "openai.api_key = openai_api_key\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instantiate the LLM and set parameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Benchmark Dataset Sample\n", "Sample size determines run time\n", "Recommend iterating small: 100 samples\n", "Then increasing to large test set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)\n", "df = df.rename(\n", " columns={\"prompt\": \"input\", \"solution\": \"output\"},\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LLM Evals: Code Readability Classifications GPT-4\n", "\n", "Run readability classifications against a subset of the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = LLM(provider=\"openai\", model=\"gpt-4\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "code_readability_eval = create_classifier(\n", " name=\"code readability\",\n", " prompt_template=CODE_READABILITY_PROMPT_TEMPLATE,\n", " llm=model,\n", " choices=CODE_READABILITY_PROMPT_RAILS_MAP,\n", ")\n", "\n", "readability_classifications = evaluate_dataframe(dataframe=df, evaluators=[code_readability_eval])\n", "\n", "readability_classifications = readability_classifications[\"code readability_score\"].apply(\n", " lambda x: x.get(\"label\") if isinstance(x, dict) else None\n", ")\n", "\n", "readability_classifications.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Evaluate the predictions against human-labeled ground-truth readability labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "true_labels = df[\"readable\"].map(CODE_READABILITY_PROMPT_RAILS_MAP).tolist()\n", "\n", "print(\n", " classification_report(\n", " true_labels, readability_classifications, labels=[\"readable\", \"unreadable\"]\n", " )\n", ")\n", "confusion_matrix = ConfusionMatrix(\n", " actual_vector=true_labels,\n", " predict_vector=readability_classifications,\n", " classes=[\"readable\", \"unreadable\"],\n", ")\n", "confusion_matrix.plot(\n", " cmap=plt.colormaps[\"Blues\"],\n", " number_label=True,\n", " normalized=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting evaluations\n", "\n", "Because the evals are binary classifications, we can easily sample a few rows\n", "where the evals deviated from ground truth and see what the actual code was in\n", "that case." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"readability\"] = readability_classifications\n", "# inspect instances where ground truth was readable but evaluated to unreadable by the LLM\n", "filtered_df = df.query('readable == False and readability == \"readable\"')\n", "\n", "# inspect first 5 rows that meet this condition\n", "result = filtered_df.head(5)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classifications with explanations\n", "\n", "When evaluating a dataset for readability, it can be useful to know why the LLM classified text as readable or not. The following code block runs `llm_classify` with explanations turned on so that we can inspect why the LLM made the classification it did. There is speed tradeoff since more tokens is being generated but it can be highly informative when troubleshooting." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "small_df_sample = df.copy().sample(n=5).reset_index(drop=True)\n", "\n", "readability_classifications_df = evaluate_dataframe(\n", " dataframe=small_df_sample, evaluators=[code_readability_eval]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's view the data\n", "merged_df = pd.merge(\n", " small_df_sample, readability_classifications_df, left_index=True, right_index=True\n", ")\n", "merged_df[[\"input\", \"output\", \"label\", \"explanation\"]].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LLM Evals: Code Readability Classifications GPT-3.5\n", "\n", "Run readability classifications against a subset of the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = LLM(provider=\"openai\", model=\"gpt-3.5-turbo\")\n", "\n", "code_readability_eval = create_classifier(\n", " name=\"code readability\",\n", " prompt_template=CODE_READABILITY_PROMPT_TEMPLATE,\n", " llm=model,\n", " choices=CODE_READABILITY_PROMPT_RAILS_MAP,\n", ")\n", "\n", "readability_classifications = evaluate_dataframe(dataframe=df, evaluators=[code_readability_eval])\n", "\n", "readability_classifications = readability_classifications[\"code readability_score\"].apply(\n", " lambda x: x.get(\"label\") if isinstance(x, dict) else None\n", ")\n", "\n", "readability_classifications.tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "true_labels = df[\"readable\"].map(CODE_READABILITY_PROMPT_RAILS_MAP).tolist()\n", "\n", "print(\n", " classification_report(\n", " true_labels, readability_classifications, labels=[\"readable\", \"unreadable\"]\n", " )\n", ")\n", "confusion_matrix = ConfusionMatrix(\n", " actual_vector=true_labels,\n", " predict_vector=readability_classifications,\n", " classes=[\"readable\", \"unreadable\"],\n", ")\n", "confusion_matrix.plot(\n", " cmap=plt.colormaps[\"Blues\"],\n", " number_label=True,\n", " normalized=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Preview: GPT-4 Turbo" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = LLM(provider=\"openai\", model=\"gpt-4-turbo-preview\")\n", "\n", "code_readability_eval = create_classifier(\n", " name=\"code readability\",\n", " prompt_template=CODE_READABILITY_PROMPT_TEMPLATE,\n", " llm=model,\n", " choices=CODE_READABILITY_PROMPT_RAILS_MAP,\n", ")\n", "\n", "readability_classifications = evaluate_dataframe(dataframe=df, evaluators=[code_readability_eval])\n", "\n", "readability_classifications = readability_classifications[\"code readability_score\"].apply(\n", " lambda x: x.get(\"label\") if isinstance(x, dict) else None\n", ")\n", "\n", "readability_classifications.tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "true_labels = df[\"readable\"].map(CODE_READABILITY_PROMPT_RAILS_MAP).tolist()\n", "\n", "print(\n", " classification_report(\n", " true_labels, readability_classifications, labels=[\"readable\", \"unreadable\"]\n", " )\n", ")\n", "confusion_matrix = ConfusionMatrix(\n", " actual_vector=true_labels,\n", " predict_vector=readability_classifications,\n", " classes=[\"readable\", \"unreadable\"],\n", ")\n", "confusion_matrix.plot(\n", " cmap=plt.colormaps[\"Blues\"],\n", " number_label=True,\n", " normalized=True,\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.3" } }, "nbformat": 4, "nbformat_minor": 4 }

Latest Blog Posts

What Is Context Bloat in MCP?
By Om-Shree-0709 on December 16, 2025.
mcp
Context Bloat
MCP Moves to the Linux Foundation: Neutral Stewardship for Agentic Infrastructure
By Om-Shree-0709 on December 15, 2025.
mcp
anthropic
Linux Foundation
Code Execution with MCP: Architecting Agentic Efficiency
By Om-Shree-0709 on December 14, 2025.
mcp
Token bloat

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server