Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
README.md4.18 kB
# Pre-Built Evals The following are simple functions on top of the LLM evals building blocks that are pre-tested with benchmark data. {% hint style="info" %} All evals templates are tested against golden data that are available as part of the LLM eval library's [benchmarked data](./#how-we-benchmark-pre-tested-evals) and target precision at 70-90% and F1 at 70-85%. {% endhint %} <table data-view="cards"><thead><tr><th align="center"></th><th align="center"></th><th align="center"></th><th align="center"></th><th data-hidden data-card-target data-type="content-ref"></th></tr></thead><tbody><tr><td align="center"><strong>Hallucination Eval</strong></td><td align="center"><a href="hallucinations.md">Hallucinations on answers to public and private data</a></td><td align="center"><em>Tested on:</em></td><td align="center">Hallucination QA Dataset, Hallucination RAG Dataset</td><td><a href="hallucinations.md">hallucinations.md</a></td></tr><tr><td align="center"><strong>Q&#x26;A Eval</strong></td><td align="center"><a href="q-and-a-on-retrieved-data.md">Private data Q&#x26;A Eval</a></td><td align="center"><em>Tested on:</em></td><td align="center">WikiQA</td><td><a href="q-and-a-on-retrieved-data.md">q-and-a-on-retrieved-data.md</a></td></tr><tr><td align="center"><strong>Retrieval Eval</strong></td><td align="center"><a href="retrieval-rag-relevance.md">RAG individual retrieval</a></td><td align="center"><em>Tested on:</em></td><td align="center">MS Marco, WikiQA</td><td><a href="retrieval-rag-relevance.md">retrieval-rag-relevance.md</a></td></tr><tr><td align="center"><strong>Summarization Eval</strong></td><td align="center"><a href="summarization-eval.md">Summarization performance</a></td><td align="center"><em>Tested on:</em></td><td align="center">GigaWorld, CNNDM, Xsum</td><td><a href="summarization-eval.md">summarization-eval.md</a></td></tr><tr><td align="center"><strong>Code Generation Eval</strong></td><td align="center"><a href="code-generation-eval.md">Code writing correctness and readability</a></td><td align="center"><em>Tested on:</em></td><td align="center">WikiSQL, HumanEval, CodeXGlu</td><td><a href="code-generation-eval.md">code-generation-eval.md</a></td></tr><tr><td align="center"><strong>Toxicity Eval</strong></td><td align="center"><a href="toxicity.md">Is the AI response racist, biased or toxic</a></td><td align="center">T<em>ested on:</em></td><td align="center">WikiToxic</td><td><a href="toxicity.md">toxicity.md</a></td></tr><tr><td align="center"><strong>AI vs. Human</strong></td><td align="center"><a href="ai-vs-human-groundtruth.md">Compare human and AI answers</a></td><td align="center"></td><td align="center"></td><td></td></tr><tr><td align="center"><strong>Reference Link</strong></td><td align="center"><a href="reference-link-evals.md">Check citations</a></td><td align="center"></td><td align="center"></td><td></td></tr><tr><td align="center"><strong>User Frustration</strong></td><td align="center"><a href="user-frustration.md">Detect user frustration</a></td><td align="center"></td><td align="center"></td><td></td></tr><tr><td align="center"><strong>SQL Generation</strong></td><td align="center"><a href="sql-generation-eval.md">Evaluate SQL correctness given a query</a></td><td align="center"></td><td align="center"></td><td></td></tr><tr><td align="center"><strong>Agent Function Calling</strong></td><td align="center"><a href="tool-calling-eval.md">Agent tool use and parameters</a></td><td align="center"></td><td align="center"></td><td></td></tr><tr><td align="center"><strong>Audio Emotion</strong></td><td align="center"><a href="audio-emotion-detection.md">Classify emotions from audio files</a></td><td align="center"></td><td align="center"></td><td></td></tr></tbody></table> ## Supported Models. The models are instantiated and usable in the LLM Eval function. The models are also directly callable with strings. ```python model = OpenAIModel(model_name="gpt-4",temperature=0.6) model("What is the largest costal city in France?") ``` We currently support a growing set of models for LLM Evals, please check out the [Eval Models section for usage](../evaluation-models.md).

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server