@arizeai/phoenix-mcp

Official

227

7,296

Overview InspectNew Endpoints Schema Related Servers Reviews Score

evals-migration.mdc•10.3 kB

--- description: how to migrate to the new evals interfaces globs: alwaysApply: false --- # Phoenix Evals Migration Guide ## ⚠️ DEPRECATED INTERFACES The following interfaces are **DEPRECATED** and should no longer be used: - `phoenix.evals.models` module (all model classes) - `phoenix.evals.llm_classify` function - `phoenix.evals.llm_generate` function - `phoenix.evals.run_evals` function - `phoenix.evals.templates.PromptTemplate` class - All legacy evaluator classes in `phoenix.evals` root module **Legacy documentation**: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/legacy.html ## Migration Overview The new Phoenix Evals API (v2.0+) provides: - **Unified LLM interface** via `phoenix.evals.llm.LLM` - **Composable evaluators** with `create_classifier` and `create_evaluator` - **Efficient batch processing** with `evaluate_dataframe` - **Better error handling** and async support - **Structured outputs** with automatic scoring ## Complete Migration Mapping ### 1. Model Interfaces | **DEPRECATED** | **NEW INTERFACE** | | ------------------------------------------------- | ----------------------------------- | | `from phoenix.evals.models import OpenAIModel` | `from phoenix.evals.llm import LLM` | | `from phoenix.evals.models import AnthropicModel` | `from phoenix.evals.llm import LLM` | | `from phoenix.evals.models import GeminiModel` | `from phoenix.evals.llm import LLM` | | `from phoenix.evals.models import VertexAIModel` | `from phoenix.evals.llm import LLM` | | `from phoenix.evals.models import BedrockModel` | `from phoenix.evals.llm import LLM` | | `from phoenix.evals.models import LiteLLMModel` | `from phoenix.evals.llm import LLM` | ### 2. Core Functions | **DEPRECATED** | **NEW INTERFACE** | | ---------------------------- | ---------------------------------------------------------------------- | | `phoenix.evals.llm_classify` | `phoenix.evals.create_classifier` + `phoenix.evals.evaluate_dataframe` | | `phoenix.evals.llm_generate` | `phoenix.evals.llm.LLM.generate_text` or custom evaluator | | `phoenix.evals.run_evals` | `phoenix.evals.evaluate_dataframe` | ### 3. Templates | **DEPRECATED** | **NEW INTERFACE** | | ------------------------------------------------ | ---------------------------------------------------------- | | `phoenix.evals.templates.PromptTemplate` | Raw strings or `phoenix.evals.templating.Template` | | `phoenix.evals.templates.ClassificationTemplate` | `phoenix.evals.create_classifier` with `choices` parameter | ### 4. Evaluators | **DEPRECATED** | **NEW INTERFACE** | | -------------------------------------- | ------------------------------------------------- | | `phoenix.evals.LLMEvaluator` | `phoenix.evals.LLMEvaluator` (new implementation) | | `phoenix.evals.HallucinationEvaluator` | `phoenix.evals.metrics.HallucinationEvaluator` | | `phoenix.evals.RelevanceEvaluator` | Create with `phoenix.evals.create_classifier` | | `phoenix.evals.ToxicityEvaluator` | Create with `phoenix.evals.create_classifier` | | `phoenix.evals.QAEvaluator` | Create with `phoenix.evals.create_classifier` | | `phoenix.evals.SummarizationEvaluator` | Create with `phoenix.evals.create_classifier` | | `phoenix.evals.SQLEvaluator` | Create with `phoenix.evals.create_classifier` | ## Step-by-Step Migration Examples ### Example 1: Basic Classification (llm_classify → create_classifier) **DEPRECATED:** ```python from phoenix.evals import llm_classify from phoenix.evals.models import OpenAIModel from phoenix.evals.templates import PromptTemplate # Old way model = OpenAIModel(model="gpt-4o") template = PromptTemplate( template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}. Respond either as 'helpful' or 'not_helpful'" ) evals_df = llm_classify( data=spans_df, model=model, rails=["helpful", "not_helpful"], template=template, exit_on_error=False, provide_explanation=True, ) # Manual score assignment evals_df["score"] = evals_df["label"].apply(lambda x: 1 if x == "helpful" else 0) ``` **NEW:** ```python import pandas as pd from phoenix.evals import create_classifier, evaluate_dataframe from phoenix.evals.llm import LLM # New way llm = LLM(provider="openai", model="gpt-4o") helpfulness_evaluator = create_classifier( name="helpfulness", prompt_template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}", llm=llm, choices={"helpful": 1.0, "not_helpful": 0.0}, # Automatic scoring ) results_df = evaluate_dataframe( dataframe=spans_df, evaluators=[helpfulness_evaluator], ) ``` ### Example 2: Multiple Evaluators **DEPRECATED:** ```python from phoenix.evals import llm_classify from phoenix.evals.models import OpenAIModel model = OpenAIModel(model="gpt-4o") # Multiple separate calls relevance_df = llm_classify(data=df, model=model, rails=["relevant", "irrelevant"], ...) helpfulness_df = llm_classify(data=df, model=model, rails=["helpful", "not_helpful"], ...) toxicity_df = llm_classify(data=df, model=model, rails=["toxic", "non_toxic"], ...) # Manual merging required ``` **NEW:** ```python from phoenix.evals import create_classifier, evaluate_dataframe from phoenix.evals.llm import LLM llm = LLM(provider="openai", model="gpt-4o") # Create multiple evaluators relevance_evaluator = create_classifier( name="relevance", prompt_template="Is the response relevant?\n\nQuery: {input}\nResponse: {output}", llm=llm, choices={"relevant": 1.0, "irrelevant": 0.0}, ) helpfulness_evaluator = create_classifier( name="helpfulness", prompt_template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}", llm=llm, choices={"helpful": 1.0, "not_helpful": 0.0}, ) toxicity_evaluator = create_classifier( name="toxicity", prompt_template="Is the response toxic?\n\nQuery: {input}\nResponse: {output}", llm=llm, choices={"toxic": 0.0, "non_toxic": 1.0}, ) # Single call evaluates all metrics results_df = evaluate_dataframe( dataframe=df, evaluators=[relevance_evaluator, helpfulness_evaluator, toxicity_evaluator], ) ``` ### Example 3: Text Generation (llm_generate → LLM.generate_text) **DEPRECATED:** ```python from phoenix.evals import llm_generate from phoenix.evals.models import OpenAIModel from phoenix.evals.templates import PromptTemplate model = OpenAIModel(model="gpt-4o") template = PromptTemplate(template="Generate a response to: {query}") generated_df = llm_generate( dataframe=df, template=template, model=model, ) ``` **NEW:** ```python from phoenix.evals.llm import LLM llm = LLM(provider="openai", model="gpt-4o") # For single generations response = llm.generate_text(prompt="Generate a response to: How do I reset my password?") # For batch processing with dataframes def generate_responses(row): prompt = f"Generate a response to: {row['query']}" return llm.generate_text(prompt=prompt) df['generated_response'] = df.apply(generate_responses, axis=1) ``` ### Example 4: Custom Evaluators **DEPRECATED:** ```python from phoenix.evals import LLMEvaluator from phoenix.evals.models import OpenAIModel class CustomEvaluator(LLMEvaluator): def evaluate(self, input_text, output_text): # Custom logic pass evaluator = CustomEvaluator(model=OpenAIModel(model="gpt-4o")) ``` **NEW:** ```python from phoenix.evals import create_evaluator, LLMEvaluator from phoenix.evals.llm import LLM # Option 1: Function-based evaluator @create_evaluator(name="custom_metric", direction="maximize") def custom_evaluator(input: str, output: str) -> float: # Custom heuristic logic return len(output) / len(input) # Example metric # Option 2: LLM-based evaluator llm = LLM(provider="openai", model="gpt-4o") class CustomLLMEvaluator(LLMEvaluator): def __init__(self): super().__init__( name="custom_llm_eval", llm=llm, prompt_template="Evaluate this response: {input} -> {output}", ) def _evaluate(self, eval_input): # Custom LLM evaluation logic pass ``` ### Example 5: Different LLM Providers **DEPRECATED:** ```python from phoenix.evals.models import OpenAIModel, AnthropicModel, GeminiModel openai_model = OpenAIModel(model="gpt-4o") anthropic_model = AnthropicModel(model="claude-3-sonnet-20240229") ``` **NEW:** ```python from phoenix.evals.llm import LLM # All providers use the same interface openai_llm = LLM(provider="openai", model="gpt-4o") litellm_llm = LLM(provider="litellm", model="claude-3-sonnet-20240229") ``` ## Migration Checklist When migrating your code: 1. **✅ Update imports** - Replace `phoenix.evals.models.*` with `phoenix.evals.llm.LLM` - Replace `phoenix.evals.llm_classify` with `phoenix.evals.create_classifier` - Replace `phoenix.evals.llm_generate` with direct LLM calls 2. **✅ Update model instantiation** - Use unified `LLM(provider="...", model="...")` interface - Remove provider-specific model classes 3. **✅ Replace function calls** - Convert `llm_classify` to `create_classifier` + `evaluate_dataframe` - Convert `llm_generate` to `LLM.generate_text` - Convert `run_evals` to `evaluate_dataframe` 4. **✅ Update templates** - Use raw strings instead of `PromptTemplate` objects - Replace `rails` parameter with `choices` dictionary 5. **✅ Update evaluators** - Use `create_classifier` for classification tasks - Use `create_evaluator` decorator for custom metrics - Import built-in evaluators from `phoenix.evals.metrics` 6. **✅ Test the migration** - Verify outputs match expected format - Check that scores are properly assigned - Ensure error handling works as expected ## Getting Help - **New API Documentation**: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/evals.html - **Legacy API Reference**: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/legacy.html

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server