@arizeai/phoenix-mcp

Official

227

7,296

Overview InspectNew Endpoints Schema Related Servers Reviews Score

code-readability-evaluation.md•3.32 kB

--- description: >- Evaluate the readability of code generated by LLM applications using Phoenix's evaluation framework. --- # Code Readability Evaluation This tutorial shows how to classify code as readable or unreadable using benchmark datasets with ground-truth labels. **Key Takeaways:** * Download and prepare benchmark datasets for code readability evaluation * Compare different LLM models (GPT-4, GPT-3.5, GPT-4 Turbo) for classification accuracy * Analyze results with confusion matrices and detailed reports * Get explanations for LLM classifications to understand decision-making *** ## Notebook Walkthrough We will go through key code snippets on this page. To follow the full tutorial, check out the full notebook. {% embed url="https://colab.research.google.com/github/Arize-ai/phoenix/blob/main/tutorials/evals/evaluate_code_readability_classifications.ipynb#scrollTo=pDUpSpUG44ZV" %} ## Download Benchmark Dataset ```python dataset_name = "openai_humaneval_with_readability" df = download_benchmark_dataset(task="code-readability-classification", dataset_name=dataset_name) ``` ## Configure Evaluation ```python N_EVAL_SAMPLE_SIZE = 10 df = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True) df = df.rename(columns={"prompt": "input", "solution": "output"}) ``` ## Run Code Readability Classification Run readability classifications against a subset of the data. ```python model = OpenAIModel(model="gpt-4", temperature=0.0) rails = list(CODE_READABILITY_PROMPT_RAILS_MAP.values()) readability_classifications = llm_classify( dataframe=df, template=CODE_READABILITY_PROMPT_TEMPLATE, model=model, rails=rails, concurrency=20, )["label"].tolist() ``` ## Evaluate Results and Plot Confusion Matrix Evaluate the predictions against human-labeled ground-truth readability labels. <pre class="language-python"><code class="lang-python">true_labels = df["readable"].map(CODE_READABILITY_PROMPT_RAILS_MAP).tolist() <strong> </strong><strong>print(classification_report(true_labels, readability_classifications, labels=rails)) </strong>confusion_matrix = ConfusionMatrix( actual_vector=true_labels, predict_vector=readability_classifications, classes=rails ) confusion_matrix.plot( cmap=plt.colormaps["Blues"], number_label=True, normalized=True, ) </code></pre> {% embed url="https://storage.googleapis.com/arize-phoenix-assets/assets/images/code-readability-cookbook.png" %} ## Get Explanations When evaluating a dataset for readability, it can be useful to know why the LLM classified text as readable or not. The following code block runs `llm_classify` with explanations turned on so that we can inspect why the LLM made the classification it did. There is speed tradeoff since more tokens is being generated but it can be highly informative when troubleshooting. ```python readability_classifications_df = llm_classify( dataframe=df.sample(n=5), template=CODE_READABILITY_PROMPT_TEMPLATE, model=model, rails=rails, provide_explanation=True, concurrency=20, ) ``` ## Compare Models Run the same evaluation with different models: ```python # GPT-3.5 model_gpt35 = OpenAIModel(model="gpt-3.5-turbo", temperature=0.0) # GPT-4 Turbo model_gpt4turbo = OpenAIModel(model="gpt-4-turbo-preview", temperature=0.0) ```

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server