Skip to main content
Glama

evaluate_llm_response_on_multiple_criteria

Evaluate LLM responses against multiple custom criteria to get per-criterion scores and critiques for detailed quality analysis.

Instructions

Evaluate an LLM's response to a prompt across multiple evaluation criteria.

This function uses an Atla evaluation model under the hood to return a list of
dictionaries, each containing an evaluation score and critique for a given
criteria.

Returns:
    list[dict[str, str]]: A list of dictionaries containing the evaluation score
        and critique, in the format `{"score": <score>, "critique": <critique>}`.
        The order of the dictionaries in the list will match the order of the
        criteria in the `evaluation_criteria_list` argument.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
evaluation_criteria_listYes
llm_promptYesThe prompt given to an LLM to generate the `llm_response` to be evaluated.
llm_responseYesThe output generated by the model in response to the `llm_prompt`, which needs to be evaluated.
expected_llm_outputNoA reference or ideal answer to compare against the `llm_response`. This is useful in cases where a specific output is expected from the model. Defaults to None.
llm_contextNoAdditional context or information provided to the model during generation. This is useful in cases where the model was provided with additional information that is not part of the `llm_prompt` or `expected_llm_output` (e.g., a RAG retrieval context). Defaults to None.
model_idNoThe Atla model ID to use for evaluation. `atla-selene` is the flagship Atla model, optimized for the highest all-round performance. `atla-selene-mini` is a compact model that is generally faster and cheaper to run. Defaults to `atla-selene`.atla-selene

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The main handler function for the tool 'evaluate_llm_response_on_multiple_criteria'. It accepts multiple evaluation criteria as a list, and calls evaluate_llm_response concurrently for each criterion using asyncio.gather, returning a list of score/critique dicts.
    async def evaluate_llm_response_on_multiple_criteria(
        ctx: Context,
        evaluation_criteria_list: list[AnnotatedEvaluationCriteria],
        llm_prompt: AnnotatedLlmPrompt,
        llm_response: AnnotatedLlmResponse,
        expected_llm_output: AnnotatedExpectedLlmOutput = None,
        llm_context: AnnotatedLlmContext = None,
        model_id: AnnotatedModelId = "atla-selene",
    ) -> list[dict[str, str]]:
        """Evaluate an LLM's response to a prompt across *multiple* evaluation criteria.
    
        This function uses an Atla evaluation model under the hood to return a list of
        dictionaries, each containing an evaluation score and critique for a given
        criteria.
    
        Returns:
            list[dict[str, str]]: A list of dictionaries containing the evaluation score
                and critique, in the format `{"score": <score>, "critique": <critique>}`.
                The order of the dictionaries in the list will match the order of the
                criteria in the `evaluation_criteria_list` argument.
        """
        tasks = [
            evaluate_llm_response(
                ctx=ctx,
                evaluation_criteria=criterion,
                llm_prompt=llm_prompt,
                llm_response=llm_response,
                expected_llm_output=expected_llm_output,
                llm_context=llm_context,
                model_id=model_id,
            )
            for criterion in evaluation_criteria_list
        ]
        results = await asyncio.gather(*tasks)
        return results
  • Registration of the tool with the MCP server via mcp.tool() decorator.
    mcp.tool()(evaluate_llm_response_on_multiple_criteria)
  • The underlying helper function evaluate_llm_response that performs a single evaluation via the Atla API. Called by evaluate_llm_response_on_multiple_criteria for each criterion.
    async def evaluate_llm_response(
        ctx: Context,
        evaluation_criteria: AnnotatedEvaluationCriteria,
        llm_prompt: AnnotatedLlmPrompt,
        llm_response: AnnotatedLlmResponse,
        expected_llm_output: AnnotatedExpectedLlmOutput = None,
        llm_context: AnnotatedLlmContext = None,
        model_id: AnnotatedModelId = "atla-selene",
    ) -> dict[str, str]:
        """Evaluate an LLM's response to a prompt using a given evaluation criteria.
    
        This function uses an Atla evaluation model under the hood to return a dictionary
        containing a score for the model's response and a textual critique containing
        feedback on the model's response.
    
        Returns:
            dict[str, str]: A dictionary containing the evaluation score and critique, in
                the format `{"score": <score>, "critique": <critique>}`.
        """
        state = cast(MCPState, ctx.request_context.lifespan_context)
        result = await state.atla_client.evaluation.create(
            model_id=model_id,
            model_input=llm_prompt,
            model_output=llm_response,
            evaluation_criteria=evaluation_criteria,
            expected_model_output=expected_llm_output,
            model_context=llm_context,
        )
    
        return {
            "score": result.result.evaluation.score,
            "critique": result.result.evaluation.critique,
        }
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations, the description adds some behavioral context (uses Atla model under the hood, returns list of dicts) but does not disclose side effects, cost, failure modes, or restrictions. It adds value beyond annotations but falls short of full transparency.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise (two sentences plus a return format note) and front-loaded with purpose. Every sentence adds value; no obvious fluff. Minor improvement possible by incorporating usage guidance.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (6 parameters, output schema exists), the description provides basic purpose and return format but lacks usage guidelines, behavioral warnings, and parameter elaboration. It is incomplete for fully understanding when and how to use the tool effectively.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is high (83%), and the description does not add new meaning to individual parameters beyond the schema. It explains the return format, which aids understanding of the output but does not directly enhance parameter semantics.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states that the tool evaluates an LLM response across multiple criteria using an Atla model and returns a list of dictionaries with score and critique. It distinguishes from the sibling tool 'evaluate_llm_response' by emphasizing 'multiple evaluation criteria', making the purpose and differentiation explicit.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for multiple criteria but does not explicitly state when to use this tool versus the single-criterion sibling. No when-not-to-use or alternative suggestions are provided, leaving guidance implicit.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/atla-ai/atla-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server