Skip to main content
Glama

evaluate_response

Evaluate LLM responses locally by scoring relevance, coherence, correctness, and completeness on a 1-5 scale using a judge model.

Instructions

Use a local judge model to score an LLM response on relevance, coherence, correctness, and completeness (1-5 each).

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
questionYes
responseYes
judge_modelNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • MCP tool registration for 'evaluate_response' with the @mcp.tool decorator, defining name, description, and arguments (question, response, judge_model).
    @mcp.tool(
        name="evaluate_response",
        description=(
            "Use a local judge model to score an LLM response on relevance, "
            "coherence, correctness, and completeness (1-5 each)."
        ),
    )
    async def evaluate_response(
        question: str,
        response: str,
        judge_model: str | None = None,
    ) -> dict[str, Any]:
        """
        Args:
            question: The original question or prompt.
            response: The model response to evaluate.
            judge_model: Which Ollama model acts as judge. Defaults to first available.
        """
        return await ev.evaluate_response(
            question=question,
            response=response,
            judge_model=judge_model,
        )
  • Core handler function that uses a judge model (defaulting to first available Ollama model) to score an LLM response on relevance, coherence, correctness, and completeness (1-5 each), returning JSON scores with rationale.
    async def evaluate_response(
        question: str,
        response: str,
        judge_model: str | None = None,
    ) -> dict[str, Any]:
        model = judge_model or await _judge_model()
        prompt = _EVAL_PROMPT.format(question=question, response=response)
        raw = await oc.generate(model=model, prompt=prompt)
        result = _extract_json(raw)
        result["judge_model"] = model
        return result
  • Helper that resolves the judge model: uses JUDGE_MODEL env var or falls back to the first available Ollama model.
    async def _judge_model() -> str:
        if DEFAULT_JUDGE_MODEL:
            return DEFAULT_JUDGE_MODEL
        models = await oc.list_models()
        if not models:
            raise RuntimeError("No Ollama models available for evaluation.")
        return models[0]["name"]
  • Prompt template sent to the judge model, asking for scores and rationale for each criterion in JSON format.
    _EVAL_PROMPT = """\
    You are an expert evaluator.  Score the following response on a scale of 1-5
    for each criterion and provide a short rationale.
    
    Criteria:
      - relevance: Does the response address the question?
      - coherence: Is the response logically consistent?
      - correctness: Is the factual content accurate (as far as you can tell)?
      - completeness: Does the response fully answer the question?
    
    Question:
    {question}
    
    Response:
    {response}
    
    Return ONLY valid JSON with this schema:
    {{
      "scores": {{
        "relevance": <int 1-5>,
        "coherence": <int 1-5>,
        "correctness": <int 1-5>,
        "completeness": <int 1-5>
      }},
      "rationale": {{
        "relevance": "<string>",
        "coherence": "<string>",
        "correctness": "<string>",
        "completeness": "<string>"
      }},
      "overall": <float average>
    }}
    """
  • JSON extraction helper that tries direct parsing first, then strips markdown fences, and falls back to an error dict.
    def _extract_json(raw: str) -> dict[str, Any]:
        """Best-effort JSON extraction from a model response."""
        raw = raw.strip()
        # Try direct parse first
        try:
            return json.loads(raw)
        except json.JSONDecodeError:
            pass
        # Strip markdown fences
        for marker in ("```json", "```"):
            if marker in raw:
                raw = raw.split(marker, 1)[-1].split("```")[0].strip()
                try:
                    return json.loads(raw)
                except json.JSONDecodeError:
                    pass
        return {"raw": raw, "error": "Could not parse JSON from judge response"}
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description fully bears the transparency burden. It mentions using a local judge model to score, implying a read-only operation, but does not explicitly state side-effects, auth requirements, or idempotency. It is adequate but not thorough.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single concise sentence (18 words) that front-loads the core purpose without any superfluous information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

While an output schema exists (so return value details are not needed), the description lacks explanation of parameters and usage context. Considering the tool has 3 parameters and sibling tools like evaluate_agent, more context would be beneficial, but it covers the essential purpose minimally.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters2/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, and the description does not add semantic detail for the parameters (question, response, judge_model). It mentions the scoring criteria but does not clarify what question and response represent or how judge_model affects behavior, leaving significant gaps.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool scores an LLM response on four criteria with a 1-5 scale. It uses a specific verb 'score' and resource 'LLM response', distinguishing it from siblings like evaluate_agent (which evaluates an agent) and chat/generate (which produce responses).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No guidance is given on when to use this tool vs alternatives such as evaluate_agent. The description only states what it does, without context for selection or exclusions.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/deadSwank001/Nexus-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server