evaluate_llm_response_on_multiple_criteria

Evaluate LLM responses against multiple custom criteria to get per-criterion scores and critiques for detailed quality analysis.

Instructions

Evaluate an LLM's response to a prompt across multiple evaluation criteria.

This function uses an Atla evaluation model under the hood to return a list of
dictionaries, each containing an evaluation score and critique for a given
criteria.

Returns:
    list[dict[str, str]]: A list of dictionaries containing the evaluation score
        and critique, in the format `{"score": <score>, "critique": <critique>}`.
        The order of the dictionaries in the list will match the order of the
        criteria in the `evaluation_criteria_list` argument.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`evaluation_criteria_list`	Yes
`llm_prompt`	Yes	The prompt given to an LLM to generate the `llm_response` to be evaluated.
`llm_response`	Yes	The output generated by the model in response to the `llm_prompt`, which needs to be evaluated.
`expected_llm_output`	No	A reference or ideal answer to compare against the `llm_response`. This is useful in cases where a specific output is expected from the model. Defaults to None.
`llm_context`	No	Additional context or information provided to the model during generation. This is useful in cases where the model was provided with additional information that is not part of the `llm_prompt` or `expected_llm_output` (e.g., a RAG retrieval context). Defaults to None.
`model_id`	No	The Atla model ID to use for evaluation. `atla-selene` is the flagship Atla model, optimized for the highest all-round performance. `atla-selene-mini` is a compact model that is generally faster and cheaper to run. Defaults to `atla-selene`.	atla-selene

Output Schema

TableJSON Schema

Name	Required	Description	Default
`result`	Yes

Implementation Reference

atla_mcp_server/server.py:202-236 (handler)

The main handler function for the tool 'evaluate_llm_response_on_multiple_criteria'. It accepts multiple evaluation criteria as a list, and calls evaluate_llm_response concurrently for each criterion using asyncio.gather, returning a list of score/critique dicts.

async def evaluate_llm_response_on_multiple_criteria(
    ctx: Context,
    evaluation_criteria_list: list[AnnotatedEvaluationCriteria],
    llm_prompt: AnnotatedLlmPrompt,
    llm_response: AnnotatedLlmResponse,
    expected_llm_output: AnnotatedExpectedLlmOutput = None,
    llm_context: AnnotatedLlmContext = None,
    model_id: AnnotatedModelId = "atla-selene",
) -> list[dict[str, str]]:
    """Evaluate an LLM's response to a prompt across *multiple* evaluation criteria.

    This function uses an Atla evaluation model under the hood to return a list of
    dictionaries, each containing an evaluation score and critique for a given
    criteria.

    Returns:
        list[dict[str, str]]: A list of dictionaries containing the evaluation score
            and critique, in the format `{"score": <score>, "critique": <critique>}`.
            The order of the dictionaries in the list will match the order of the
            criteria in the `evaluation_criteria_list` argument.
    """
    tasks = [
        evaluate_llm_response(
            ctx=ctx,
            evaluation_criteria=criterion,
            llm_prompt=llm_prompt,
            llm_response=llm_response,
            expected_llm_output=expected_llm_output,
            llm_context=llm_context,
            model_id=model_id,
        )
        for criterion in evaluation_criteria_list
    ]
    results = await asyncio.gather(*tasks)
    return results

atla_mcp_server/server.py:257-257 (registration)
Registration of the tool with the MCP server via mcp.tool() decorator.
```
mcp.tool()(evaluate_llm_response_on_multiple_criteria)
```

atla_mcp_server/server.py:167-199 (helper)

The underlying helper function evaluate_llm_response that performs a single evaluation via the Atla API. Called by evaluate_llm_response_on_multiple_criteria for each criterion.

async def evaluate_llm_response(
    ctx: Context,
    evaluation_criteria: AnnotatedEvaluationCriteria,
    llm_prompt: AnnotatedLlmPrompt,
    llm_response: AnnotatedLlmResponse,
    expected_llm_output: AnnotatedExpectedLlmOutput = None,
    llm_context: AnnotatedLlmContext = None,
    model_id: AnnotatedModelId = "atla-selene",
) -> dict[str, str]:
    """Evaluate an LLM's response to a prompt using a given evaluation criteria.

    This function uses an Atla evaluation model under the hood to return a dictionary
    containing a score for the model's response and a textual critique containing
    feedback on the model's response.

    Returns:
        dict[str, str]: A dictionary containing the evaluation score and critique, in
            the format `{"score": <score>, "critique": <critique>}`.
    """
    state = cast(MCPState, ctx.request_context.lifespan_context)
    result = await state.atla_client.evaluation.create(
        model_id=model_id,
        model_input=llm_prompt,
        model_output=llm_response,
        evaluation_criteria=evaluation_criteria,
        expected_model_output=expected_llm_output,
        model_context=llm_context,
    )

    return {
        "score": result.result.evaluation.score,
        "critique": result.result.evaluation.critique,
    }

Atla

evaluate_llm_response_on_multiple_criteria

Instructions

Input Schema

Output Schema

Implementation Reference

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API