Nexus-MCP

Overview Schema Related Servers Score Discussions

evaluate_response

Evaluate LLM responses locally by scoring relevance, coherence, correctness, and completeness on a 1-5 scale using a judge model.

Instructions

Use a local judge model to score an LLM response on relevance, coherence, correctness, and completeness (1-5 each).

Input Schema

TableJSON Schema

Name	Required	Description	Default
`question`	Yes
`response`	Yes
`judge_model`	No

Output Schema

TableJSON Schema

Name	Required	Description	Default
No arguments

Implementation Reference

src/foundry_reverse/server.py:226-248 (registration)

MCP tool registration for 'evaluate_response' with the @mcp.tool decorator, defining name, description, and arguments (question, response, judge_model).

@mcp.tool(
    name="evaluate_response",
    description=(
        "Use a local judge model to score an LLM response on relevance, "
        "coherence, correctness, and completeness (1-5 each)."
    ),
)
async def evaluate_response(
    question: str,
    response: str,
    judge_model: str | None = None,
) -> dict[str, Any]:
    """
    Args:
        question: The original question or prompt.
        response: The model response to evaluate.
        judge_model: Which Ollama model acts as judge. Defaults to first available.
    """
    return await ev.evaluate_response(
        question=question,
        response=response,
        judge_model=judge_model,
    )

src/foundry_reverse/evaluation.py:112-122 (handler)

Core handler function that uses a judge model (defaulting to first available Ollama model) to score an LLM response on relevance, coherence, correctness, and completeness (1-5 each), returning JSON scores with rationale.

async def evaluate_response(
    question: str,
    response: str,
    judge_model: str | None = None,
) -> dict[str, Any]:
    model = judge_model or await _judge_model()
    prompt = _EVAL_PROMPT.format(question=question, response=response)
    raw = await oc.generate(model=model, prompt=prompt)
    result = _extract_json(raw)
    result["judge_model"] = model
    return result

src/foundry_reverse/evaluation.py:18-24 (helper)

Helper that resolves the judge model: uses JUDGE_MODEL env var or falls back to the first available Ollama model.

async def _judge_model() -> str:
    if DEFAULT_JUDGE_MODEL:
        return DEFAULT_JUDGE_MODEL
    models = await oc.list_models()
    if not models:
        raise RuntimeError("No Ollama models available for evaluation.")
    return models[0]["name"]

src/foundry_reverse/evaluation.py:27-59 (helper)

Prompt template sent to the judge model, asking for scores and rationale for each criterion in JSON format.

_EVAL_PROMPT = """\
You are an expert evaluator.  Score the following response on a scale of 1-5
for each criterion and provide a short rationale.

Criteria:
  - relevance: Does the response address the question?
  - coherence: Is the response logically consistent?
  - correctness: Is the factual content accurate (as far as you can tell)?
  - completeness: Does the response fully answer the question?

Question:
{question}

Response:
{response}

Return ONLY valid JSON with this schema:
{{
  "scores": {{
    "relevance": <int 1-5>,
    "coherence": <int 1-5>,
    "correctness": <int 1-5>,
    "completeness": <int 1-5>
  }},
  "rationale": {{
    "relevance": "<string>",
    "coherence": "<string>",
    "correctness": "<string>",
    "completeness": "<string>"
  }},
  "overall": <float average>
}}
"""

src/foundry_reverse/evaluation.py:93-109 (helper)

JSON extraction helper that tries direct parsing first, then strips markdown fences, and falls back to an error dict.

def _extract_json(raw: str) -> dict[str, Any]:
    """Best-effort JSON extraction from a model response."""
    raw = raw.strip()
    # Try direct parse first
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        pass
    # Strip markdown fences
    for marker in ("```json", "```"):
        if marker in raw:
            raw = raw.split(marker, 1)[-1].split("```")[0].strip()
            try:
                return json.loads(raw)
            except json.JSONDecodeError:
                pass
    return {"raw": raw, "error": "Could not parse JSON from judge response"}

Tool Definition Quality

B3.4/5.0

Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description fully bears the transparency burden. It mentions using a local judge model to score, implying a read-only operation, but does not explicitly state side-effects, auth requirements, or idempotency. It is adequate but not thorough.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single concise sentence (18 words) that front-loads the core purpose without any superfluous information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

While an output schema exists (so return value details are not needed), the description lacks explanation of parameters and usage context. Considering the tool has 3 parameters and sibling tools like evaluate_agent, more context would be beneficial, but it covers the essential purpose minimally.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters2/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, and the description does not add semantic detail for the parameters (question, response, judge_model). It mentions the scoring criteria but does not clarify what question and response represent or how judge_model affects behavior, leaving significant gaps.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool scores an LLM response on four criteria with a 1-5 scale. It uses a specific verb 'score' and resource 'LLM response', distinguishing it from siblings like evaluate_agent (which evaluates an agent) and chat/generate (which produce responses).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No guidance is given on when to use this tool vs alternatives such as evaluate_agent. The description only states what it does, without context for selection or exclusions.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/deadSwank001/Nexus-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server