@arizeai/phoenix-mcp

Official

227

7,296

Overview InspectNew Endpoints Schema Related Servers Reviews Score

README.md•13.4 kB

# Homework 3: LLM-as-Judge for Recipe Bot Evaluation with Phoenix ## Evaluate "Adherence to Dietary Preferences" for the Recipe Bot **Example**: If a user asks for a "vegan" recipe, does the bot provide one that is actually vegan? We'll provide ~2000 starter Recipe Bot traces and detailed criteria for this failure mode if you want a head start! We picked this criterion because it's fairly easy to align an LLM judge on dietary restrictions - the rules are generally clear and objective. ## Tools You'll Use - **Phoenix**: For generating and viewing your traces and evaluations. - Your preferred LLM (for crafting the judge) - Your critical thinking & prompt engineering skills! - The `judgy` Python library: [github.com/ai-evals-course/judgy](https://github.com/ai-evals-course/judgy) ## Understanding Phoenix in This Assignment **Phoenix** is an observability platform that helps you collect, trace, and evaluate LLM applications. In this assignment, you'll use Phoenix to: 1. **Collect Recipe Bot traces**: Automatically capture user queries, bot responses, and metadata 2. **Run LLM-as-Judge evaluations**: Use Phoenix evals to systematically evaluate traces 3. **Log evaluation results**: Store judge predictions and ground truth labels for analysis 4. **Monitor performance**: Track judge accuracy and bias over time ### Key Phoenix Concepts You'll Use: - **Spans**: Individual units of work (e.g., a single Recipe Bot query-response) - **SpanQuery**: Query language to filter and retrieve specific traces - **Phoenix Evals**: Framework for running LLM evaluations at scale - **Client Annotations**: Logging evaluation results back to Phoenix using the new client API > **📚 Phoenix Methods Guide**: For detailed examples and usage patterns, see [phoenix_methods_guide.md](phoenix_methods_guide.md) ## Three Implementation Options You have three options for how much of the pipeline to implement yourself: ### Option 1: Full Implementation (Most Learning) Start from scratch and implement the complete pipeline: - Generate your own Recipe Bot traces with Phoenix - Label your own data using Phoenix evals - Build the entire evaluation workflow - **Learning**: Complete end-to-end experience with LLM-as-Judge methodology and Phoenix observability ### Option 2: Start with Raw Traces (Medium Implementation) Use our provided traces from Phoenix and focus on the evaluation: - Skip trace generation, start with labeling - Implement judge development and evaluation using Phoenix evals - Focus on the core LLM-as-Judge workflow with Phoenix observability - **Learning**: Judge development, bias correction, and statistical evaluation with Phoenix ### Option 3: Start with Labeled Data (Judge Development Focus) Use our provided labeled traces from Phoenix: - Skip trace generation and labeling - Focus on judge prompt engineering and evaluation using Phoenix evals - Implement the statistical correction workflow - **Learning**: Judge optimization and bias correction techniques with Phoenix Choose the option that best fits your learning goals and available time! ## Assignment Steps: From Labels to Confident Measurement 📊 ### Step 1: Set Up Phoenix and Generate Traces *[Option 1 starts here]* **Phoenix Setup:** 1. Install Phoenix: `pip install arize-phoenix` 2. Boot up your local Phoenix instance with `phoenix serve` and visit it at localhost:6006 **Generate Traces:** - **Option 1**: Implement your own trace generation script that: - Loads dietary queries from `data/dietary_queries.csv` - Calls your Recipe Bot for each query - Uses Phoenix instrumentation to log traces with span attributes (query, dietary_restriction, response) - Handles concurrency efficiently (consider using ThreadPoolExecutor) - **Option 2**: Use our provided traces from Phoenix - Run `generate_traces.py` to build your traces. **In Phoenix go to projects->recipe-agent (or projects->YOUR_PROJECT_NAME if you made your own script) to view your traces.** > **💡 Tip**: See [phoenix_methods_guide.md](phoenix_methods_guide.md#tracing) for Phoenix instrumentation examples ### Step 2: Label Your Data *[Option 2 starts here]* **Manual Labeling (Ideal, but a bit tedious):** - Sample a subset of traces (150-200) from your collected data - Manually label each trace as "Pass" or "Fail" based on the dietary adherence criteria - This creates the most reliable ground truth for your evaluation - Store labels in a CSV file for easy manipulation **Automated Labeling (Alternative):** - Implement a "ground truth LLM-as-Judge evaluator" that: - Uses a more powerful model (e.g., GPT-4) for higher accuracy - Creates a detailed labeling prompt with clear Pass/Fail criteria - Implements robust output parsing to extract labels and explanations - Logs ground truth labels back to Phoenix using the new client API - **Note**: While automated labeling is faster, it introduces potential bias since you're using an LLM to create ground truth. Manual labeling is preferred for the most reliable evaluation. - Consider sampling a subset (150-200 traces) for labeling to manage costs **You can either implement your own script for this, or simply run label_data.py.** > **💡 Tip**: See [phoenix_methods_guide.md](phoenix_methods_guide.md#evaluation-methods) for Phoenix evals examples ### Step 3: Split Your Labeled Data *[Option 3 starts here]* - Implement a data splitting script that: - Loads labeled traces from Phoenix - Uses stratified splitting to maintain label distribution - Creates Train (~15%), Dev (~40%), and Test (~45%) splits - Saves splits as Phoenix datasets for further experimentation, with the proper fields (ground truth label, dietary restriction, input query, etc.) - Consider using scikit-learn's `train_test_split` with stratification > **💡 Tip**: See [phoenix_methods_guide.md](phoenix_methods_guide.md#tracing) for querying traces from Phoenix > **💡 Tip**: See [phoenix_methods_guide.md](phoenix_methods_guide.md#datasets) for creating Phoenix datasets ### Step 4: Develop Your LLM-as-Judge Prompt *[All options continue from here]* Craft a clear prompt with: - The specific task/criterion - Precise Pass/Fail definitions - 2-3 clear few-shot examples (input, Recipe Bot output, desired judge reasoning & Pass/Fail label) taken from your Train set - The structured output format you expect from the judge (e.g., JSON with reasoning and answer) ### Step 5: Refine & Validate Your Judge - Implement judge evaluation that: - Uses Phoenix experiments/evals to test your judge on the Dev set - Calculates TPR and TNR metrics as Phoenix evals - Iteratively refines the prompt based on performance - Once finalized, experiment on your Test set for unbiased performance metrics - **Phoenix Benefits**: Automatic logging of all evaluations, easy comparison of different prompts > **💡 Tip**: See [phoenix_methods_guide.md](phoenix_methods_guide.md#logging-results) for logging evaluation results to Phoenix > **💡 Tip**: See [phoenix_methods_guide.md](phoenix_methods_guide.md#experiments) for running experiments on Phoenix datasets ### Step 6: Measure on "New" Traces - Implement a full evaluation script that: - Queries all traces from Phoenix using SpanQuery - Runs your finalized judge using Phoenix evals - Handles large-scale evaluation efficiently - Logs results for analysis - This simulates evaluating production data > **💡 Tip**: See [phoenix_methods_guide.md](phoenix_methods_guide.md#common-patterns) for complete evaluation pipeline examples ### Step 7: Report Results with judgy Report: - The raw pass rate (p_obs) from your judge on the new traces - The corrected true success rate (θ̂) using judgy - The 95% Confidence Interval (CI) for θ - Include a brief interpretation of your results (e.g., How well is the Recipe Bot adhering to dietary preferences? How confident are you in this assessment?) ## Failure Mode: Adherence to Dietary Preferences **Definition**: When a user requests a recipe with specific dietary restrictions or preferences, the Recipe Bot should provide a recipe that actually meets those restrictions and preferences. **Examples**: - ✅ Pass: User asks for "vegan pasta recipe" → Bot provides pasta with nutritional yeast instead of parmesan - ❌ Fail: User asks for "vegan pasta recipe" → Bot suggests using honey as a sweetener (honey isn't vegan) - ✅ Pass: User asks for "gluten-free bread" → Bot provides recipe using almond flour and xanthan gum - ❌ Fail: User asks for "gluten-free bread" → Bot suggests using regular soy sauce (contains wheat) in the recipe - ✅ Pass: User asks for "keto dinner" → Bot provides cauliflower rice with high-fat protein - ❌ Fail: User asks for "keto dinner" → Bot includes sweet potato as a "healthy carb" (too high-carb for keto) ### Dietary Restriction Definitions (for reference; taken from OpenAI o4): - **Vegan**: No animal products (meat, dairy, eggs, honey, etc.) - **Vegetarian**: No meat or fish, but dairy and eggs are allowed - **Gluten-free**: No wheat, barley, rye, or other gluten-containing grains - **Dairy-free**: No milk, cheese, butter, yogurt, or other dairy products - **Keto**: Very low carb (typically <20g net carbs), high fat, moderate protein - **Paleo**: No grains, legumes, dairy, refined sugar, or processed foods - **Pescatarian**: No meat except fish and seafood - **Kosher**: Follows Jewish dietary laws (no pork, shellfish, mixing meat/dairy) - **Halal**: Follows Islamic dietary laws (no pork, alcohol, proper slaughter) - **Nut-free**: No tree nuts or peanuts - **Low-carb**: Significantly reduced carbohydrates (typically <50g per day) - **Sugar-free**: No added sugars or high-sugar ingredients - **Raw vegan**: Vegan foods not heated above 118°F (48°C) - **Whole30**: No grains, dairy, legumes, sugar, alcohol, or processed foods - **Diabetic-friendly**: Low glycemic index, controlled carbohydrates - **Low-sodium**: Reduced sodium content for heart health ## Sample Challenging Queries **Contradictory Requests:** - "I'm vegan but I really want to make something with honey - is there a good substitute?" - "I want a cheeseburger but I'm dairy-free and vegetarian" **Ambiguous Preferences:** - "Something not too carb-y for dinner" - "Something keto-ish but not super strict" - "Dairy-free but cheese is okay sometimes" ## Key Metrics to Understand - **True Positive Rate (TPR)**: How often the judge correctly identifies adherent recipes - **True Negative Rate (TNR)**: How often the judge correctly identifies non-adherent recipes - **Corrected Success Rate**: True adherence rate accounting for judge errors - **95% Confidence Interval**: Range for the corrected success rate ## Deliverables 1. **Your labeled dataset** with train/dev/test splits (from Phoenix) 2. **Your final judge prompt** with few-shot examples 3. **Judge performance metrics** (TPR/TNR on test set) 4. **Final evaluation results** using judgy (raw rate, corrected rate, confidence interval) ## Reference Implementation This repository contains a complete reference implementation showing one approach to this assignment. You can: - **Study the code structure** to understand the workflow - **Use our provided data** as a starting point - **Implement your own version** from scratch for full learning value ### Phoenix Integration - Use `phoenix.otel` for automatic instrumentation of your Recipe Bot - Use `SpanQuery().where("span_kind == 'CHAIN'")` to retrieve traces - Use `llm_generate` from `phoenix.evals` for scalable evaluation - Use the new client API to log evaluation results back to Phoenix > **📚 Complete Guide**: For detailed Phoenix API usage, examples, and best practices, see [phoenix_methods_guide.md](phoenix_methods_guide.md) ### Reference Implementation Structure ``` homeworks/hw3/ ├── scripts/ │ ├── generate_traces.py # Generate Recipe Bot traces with parallel processing │ ├── label_data.py # Use GPT-4o to label ground truth (150 examples) │ ├── split_data.py # Split data into train/dev/test sets │ ├── develop_judge.py # Develop LLM judge with few-shot examples │ ├── evaluate_judge.py # Evaluate judge performance on test set │ └── run_full_evaluation.py # Run judge on all traces and compute metrics ├── data/ │ ├── dietary_queries.csv # 60 challenging edge case queries we crafted │ ├── raw_traces.csv # Generated Recipe Bot traces (~2400 total) │ ├── labeled_traces.csv # Traces with ground truth labels (150) │ ├── train_set.csv # Training examples for few-shot (~23) │ ├── dev_set.csv # Development set for judge refinement (~60) │ └── test_set.csv # Test set for final evaluation (~67) └── results/ │ ├── judge_performance.json # TPR/TNR metrics on test set │ ├── final_evaluation.json # Results with confidence intervals │ └── judge_prompt.txt # Final judge prompt └── README.md # Project Spec and general project guide └── ai_evals_hw3_solution.ipynb # Guide containing helpful Phoenix methods and links to Phoenix documentation ```

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server