evaluate_llm_response_on_multiple_criteria
Evaluate LLM responses against multiple custom criteria to get per-criterion scores and critiques for detailed quality analysis.
Instructions
Evaluate an LLM's response to a prompt across multiple evaluation criteria.
This function uses an Atla evaluation model under the hood to return a list of
dictionaries, each containing an evaluation score and critique for a given
criteria.
Returns:
list[dict[str, str]]: A list of dictionaries containing the evaluation score
and critique, in the format `{"score": <score>, "critique": <critique>}`.
The order of the dictionaries in the list will match the order of the
criteria in the `evaluation_criteria_list` argument.Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| evaluation_criteria_list | Yes | ||
| llm_prompt | Yes | The prompt given to an LLM to generate the `llm_response` to be evaluated. | |
| llm_response | Yes | The output generated by the model in response to the `llm_prompt`, which needs to be evaluated. | |
| expected_llm_output | No | A reference or ideal answer to compare against the `llm_response`. This is useful in cases where a specific output is expected from the model. Defaults to None. | |
| llm_context | No | Additional context or information provided to the model during generation. This is useful in cases where the model was provided with additional information that is not part of the `llm_prompt` or `expected_llm_output` (e.g., a RAG retrieval context). Defaults to None. | |
| model_id | No | The Atla model ID to use for evaluation. `atla-selene` is the flagship Atla model, optimized for the highest all-round performance. `atla-selene-mini` is a compact model that is generally faster and cheaper to run. Defaults to `atla-selene`. | atla-selene |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |
Implementation Reference
- atla_mcp_server/server.py:202-236 (handler)The main handler function for the tool 'evaluate_llm_response_on_multiple_criteria'. It accepts multiple evaluation criteria as a list, and calls evaluate_llm_response concurrently for each criterion using asyncio.gather, returning a list of score/critique dicts.
async def evaluate_llm_response_on_multiple_criteria( ctx: Context, evaluation_criteria_list: list[AnnotatedEvaluationCriteria], llm_prompt: AnnotatedLlmPrompt, llm_response: AnnotatedLlmResponse, expected_llm_output: AnnotatedExpectedLlmOutput = None, llm_context: AnnotatedLlmContext = None, model_id: AnnotatedModelId = "atla-selene", ) -> list[dict[str, str]]: """Evaluate an LLM's response to a prompt across *multiple* evaluation criteria. This function uses an Atla evaluation model under the hood to return a list of dictionaries, each containing an evaluation score and critique for a given criteria. Returns: list[dict[str, str]]: A list of dictionaries containing the evaluation score and critique, in the format `{"score": <score>, "critique": <critique>}`. The order of the dictionaries in the list will match the order of the criteria in the `evaluation_criteria_list` argument. """ tasks = [ evaluate_llm_response( ctx=ctx, evaluation_criteria=criterion, llm_prompt=llm_prompt, llm_response=llm_response, expected_llm_output=expected_llm_output, llm_context=llm_context, model_id=model_id, ) for criterion in evaluation_criteria_list ] results = await asyncio.gather(*tasks) return results - atla_mcp_server/server.py:257-257 (registration)Registration of the tool with the MCP server via mcp.tool() decorator.
mcp.tool()(evaluate_llm_response_on_multiple_criteria) - atla_mcp_server/server.py:167-199 (helper)The underlying helper function evaluate_llm_response that performs a single evaluation via the Atla API. Called by evaluate_llm_response_on_multiple_criteria for each criterion.
async def evaluate_llm_response( ctx: Context, evaluation_criteria: AnnotatedEvaluationCriteria, llm_prompt: AnnotatedLlmPrompt, llm_response: AnnotatedLlmResponse, expected_llm_output: AnnotatedExpectedLlmOutput = None, llm_context: AnnotatedLlmContext = None, model_id: AnnotatedModelId = "atla-selene", ) -> dict[str, str]: """Evaluate an LLM's response to a prompt using a given evaluation criteria. This function uses an Atla evaluation model under the hood to return a dictionary containing a score for the model's response and a textual critique containing feedback on the model's response. Returns: dict[str, str]: A dictionary containing the evaluation score and critique, in the format `{"score": <score>, "critique": <critique>}`. """ state = cast(MCPState, ctx.request_context.lifespan_context) result = await state.atla_client.evaluation.create( model_id=model_id, model_input=llm_prompt, model_output=llm_response, evaluation_criteria=evaluation_criteria, expected_model_output=expected_llm_output, model_context=llm_context, ) return { "score": result.result.evaluation.score, "critique": result.result.evaluation.critique, }