@arizeai/phoenix-mcp

Official

Overview Schema Related Servers Score Discussions

phoenix
docs
evaluation
legacy
how-to-evals
running-pre-tested-evals

tool-calling-eval.md•4.37 KiB

# Agent Function Calling Eval The Agent Function Call eval can be used to determine how well a model selects a tool to use, extracts the right parameters from the user query, and generates the tool call code. {% embed url="https://www.youtube.com/watch?v=Rsu-UZ1ZVZU" %} Demo {% endembed %} ## **Function Calling Eval Template** ```python TOOL_CALLING_PROMPT_TEMPLATE = """ You are an evaluation assistant evaluating questions and tool calls to determine whether the tool called would answer the question. The tool calls have been generated by a separate agent, and chosen from the list of tools provided below. It is your job to decide whether that agent chose the right tool to call. [BEGIN DATA] ************ [Question]: {question} ************ [Tool Called]: {tool_call} [END DATA] Your response must be single word, either "correct" or "incorrect", and should not contain any text or characters aside from that word. "incorrect" means that the chosen tool would not answer the question, the tool includes information that is not presented in the question, or that the tool signature includes parameter values that don't match the formats specified in the tool signatures below. "correct" means the correct tool call was chosen, the correct parameters were extracted from the question, the tool call generated is runnable and correct, and that no outside information not present in the question was used in the generated question. [Tool Definitions]: {tool_definitions} """ ``` {% hint style="info" %} We are continually iterating our templates, view the most up-to-date template [on GitHub](https://github.com/Arize-ai/phoenix/blob/ecef5242d2f9bb39a2fdf5d96a2b1841191f7944/packages/phoenix-evals/src/phoenix/evals/span_templates.py#L189). {% endhint %} ## **Running an Agent Eval using the Function Calling Template** ```python from phoenix.evals import ( TOOL_CALLING_PROMPT_RAILS_MAP, TOOL_CALLING_PROMPT_TEMPLATE, OpenAIModel, llm_classify, ) # the rails object will be used to snap responses to "correct" # or "incorrect" rails = list(TOOL_CALLING_PROMPT_RAILS_MAP.values()) model = OpenAIModel( model_name="gpt-4", temperature=0.0, ) # Loop through the specified dataframe and run each row # through the specified model and prompt. llm_classify # will run requests concurrently to improve performance. tool_call_evaluations = llm_classify( dataframe=df, template=TOOL_CALLING_PROMPT_TEMPLATE, model=model, rails=rails, provide_explanation=True ) ``` Parameters: * `df` - a dataframe of cases to evaluate. The dataframe must have these columns to match the default template: * `question` - the query made to the model. If you've [exported spans from Phoenix](https://app.gitbook.com/o/ZmsT56faZH0gUFkMMqBk/s/gtQcEYlwzTfZSAnHREvw/) to evaluate, this will the `llm.input_messages` column in your exported data. * `tool_call` - information on the tool called and parameters included. If you've [exported spans from Phoenix](../../../tracing/how-to-tracing/importing-and-exporting-traces/extract-data-from-spans.md) to evaluate, this will be the `llm.function_call` column in your exported data. ## Parameter Extraction Only This template instead evaluates only the parameter extraction step of a router: ```python You are comparing a function call response to a question and trying to determine if the generated call has extracted the exact right parameters from the question. Here is the data: [BEGIN DATA] ************ [Question]: {question} ************ [LLM Response]: {response} ************ [END DATA] Compare the parameters in the generated function against the JSON provided below. The parameters extracted from the question must match the JSON below exactly. Your response must be single word, either "correct", "incorrect", or "not-applicable", and should not contain any text or characters aside from that word. "correct" means the function call parameters match the JSON below and provides only relevant information. "incorrect" means that the parameters in the function do not match the JSON schema below exactly, or the generated function does not correctly answer the user's question. You should also respond with "incorrect" if the response makes up information that is not in the JSON schema. "not-applicable" means that response was not a function call. Here is more information on each function: {function_defintions} ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

tool-calling-eval.md•4.37 KiB