evaluate_response
Evaluate LLM responses locally by scoring relevance, coherence, correctness, and completeness on a 1-5 scale using a judge model.
Instructions
Use a local judge model to score an LLM response on relevance, coherence, correctness, and completeness (1-5 each).
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| question | Yes | ||
| response | Yes | ||
| judge_model | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||
Implementation Reference
- src/foundry_reverse/server.py:226-248 (registration)MCP tool registration for 'evaluate_response' with the @mcp.tool decorator, defining name, description, and arguments (question, response, judge_model).
@mcp.tool( name="evaluate_response", description=( "Use a local judge model to score an LLM response on relevance, " "coherence, correctness, and completeness (1-5 each)." ), ) async def evaluate_response( question: str, response: str, judge_model: str | None = None, ) -> dict[str, Any]: """ Args: question: The original question or prompt. response: The model response to evaluate. judge_model: Which Ollama model acts as judge. Defaults to first available. """ return await ev.evaluate_response( question=question, response=response, judge_model=judge_model, ) - Core handler function that uses a judge model (defaulting to first available Ollama model) to score an LLM response on relevance, coherence, correctness, and completeness (1-5 each), returning JSON scores with rationale.
async def evaluate_response( question: str, response: str, judge_model: str | None = None, ) -> dict[str, Any]: model = judge_model or await _judge_model() prompt = _EVAL_PROMPT.format(question=question, response=response) raw = await oc.generate(model=model, prompt=prompt) result = _extract_json(raw) result["judge_model"] = model return result - Helper that resolves the judge model: uses JUDGE_MODEL env var or falls back to the first available Ollama model.
async def _judge_model() -> str: if DEFAULT_JUDGE_MODEL: return DEFAULT_JUDGE_MODEL models = await oc.list_models() if not models: raise RuntimeError("No Ollama models available for evaluation.") return models[0]["name"] - Prompt template sent to the judge model, asking for scores and rationale for each criterion in JSON format.
_EVAL_PROMPT = """\ You are an expert evaluator. Score the following response on a scale of 1-5 for each criterion and provide a short rationale. Criteria: - relevance: Does the response address the question? - coherence: Is the response logically consistent? - correctness: Is the factual content accurate (as far as you can tell)? - completeness: Does the response fully answer the question? Question: {question} Response: {response} Return ONLY valid JSON with this schema: {{ "scores": {{ "relevance": <int 1-5>, "coherence": <int 1-5>, "correctness": <int 1-5>, "completeness": <int 1-5> }}, "rationale": {{ "relevance": "<string>", "coherence": "<string>", "correctness": "<string>", "completeness": "<string>" }}, "overall": <float average> }} """ - JSON extraction helper that tries direct parsing first, then strips markdown fences, and falls back to an error dict.
def _extract_json(raw: str) -> dict[str, Any]: """Best-effort JSON extraction from a model response.""" raw = raw.strip() # Try direct parse first try: return json.loads(raw) except json.JSONDecodeError: pass # Strip markdown fences for marker in ("```json", "```"): if marker in raw: raw = raw.split(marker, 1)[-1].split("```")[0].strip() try: return json.loads(raw) except json.JSONDecodeError: pass return {"raw": raw, "error": "Could not parse JSON from judge response"}