Duck Judge
duck_judgeEvaluates and ranks responses from multiple AI ducks using customizable criteria and a judge persona to provide comparative evaluations.
Instructions
Have one duck evaluate and rank other ducks' responses. Use after duck_council to get a comparative evaluation.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| responses | Yes | Array of duck responses to evaluate (from duck_council output) | |
| judge | No | Provider name of the judge duck (optional, uses first available) | |
| criteria | No | Evaluation criteria (default: ["accuracy", "completeness", "clarity"]) | |
| persona | No | Judge persona (e.g., "senior engineer", "security expert") |
Implementation Reference
- src/tools/duck-judge.ts:24-82 (handler)Main handler for the duck_judge tool. Validates inputs, selects a judge provider, builds a prompt asking the judge to evaluate/rank responses, parses the JSON judgment, formats the result, and returns it.
export async function duckJudgeTool( providerManager: ProviderManager, args: Record<string, unknown> ) { const { responses, judge, criteria = DEFAULT_CRITERIA, persona, } = args as unknown as DuckJudgeArgs; // Validate inputs if (!responses || !Array.isArray(responses) || responses.length === 0) { throw new Error('At least one response is required to judge'); } if (responses.length === 1) { throw new Error('At least two responses are required for comparison'); } // Determine judge provider const judgeProvider = judge || providerManager.getProviderNames()[0]; if (!judgeProvider) { throw new Error('No judge provider available'); } logger.info(`Starting judgment with ${judgeProvider} on ${responses.length} responses`); // Build the judgment prompt const prompt = buildJudgePrompt(responses, criteria, persona); // Get judgment from the judge duck const judgeResponse = await providerManager.askDuck(judgeProvider, prompt); // Parse the judgment const evaluation = parseJudgment( judgeResponse.content, judgeResponse.provider, judgeResponse.nickname, responses, criteria ); // Format output const formattedOutput = formatJudgeResult(evaluation); logger.info( `Judgment completed by ${judgeProvider}: #1 is ${evaluation.rankings[0]?.provider || 'unknown'}` ); return { content: [ { type: 'text', text: formattedOutput, }, ], }; } - src/tools/duck-judge.ts:5-10 (schema)Input schema for the duck_judge tool: accepts an array of DuckResponse objects, optional judge provider, optional criteria list, and optional persona.
export interface DuckJudgeArgs { responses: DuckResponse[]; judge?: string; criteria?: string[]; persona?: string; } - src/server.ts:576-621 (registration)Registration of the 'duck_judge' tool in the MCP server with its title, description, Zod input schema, and the async handler that delegates to duckJudgeTool.
// duck_judge this.server.registerTool( 'duck_judge', { title: 'Duck Judge', description: "Have one duck evaluate and rank other ducks' responses. Use after duck_council to get a comparative evaluation.", inputSchema: { responses: z .array( z.object({ provider: z.string(), nickname: z.string(), model: z.string().optional(), content: z.string(), }) ) .min(2) .describe('Array of duck responses to evaluate (from duck_council output)'), judge: this.providerEnum() .optional() .describe('Provider name of the judge duck (optional, uses first available)'), criteria: z .array(z.string()) .optional() .describe('Evaluation criteria (default: ["accuracy", "completeness", "clarity"])'), persona: z .string() .optional() .describe('Judge persona (e.g., "senior engineer", "security expert")'), }, annotations: { readOnlyHint: true, openWorldHint: true, }, }, async (args) => { try { return this.toolResult( await duckJudgeTool(this.providerManager, args as Record<string, unknown>) ); } catch (error) { return this.toolErrorResult(error); } } ); - src/tools/duck-judge.ts:84-125 (helper)Builds the prompt sent to the judge duck, listing responses, criteria, and instructions to return a JSON evaluation with rankings, criteria_scores, and summary.
function buildJudgePrompt(responses: DuckResponse[], criteria: string[], persona?: string): string { const criteriaList = criteria.map((c, i) => `${i + 1}. ${c}`).join('\n'); const responsesText = responses .map((r, i) => `--- Response ${i + 1} (${r.nickname} / ${r.provider}) ---\n${r.content}\n`) .join('\n'); const personaText = persona ? `You are a ${persona} evaluating these responses.\n\n` : ''; return `${personaText}You are a judge evaluating ${responses.length} responses to the same prompt. RESPONSES TO EVALUATE: ${responsesText} EVALUATION CRITERIA: ${criteriaList} INSTRUCTIONS: 1. Evaluate each response against ALL criteria 2. Assign a score from 0-100 for each response 3. Rank responses from best to worst 4. Provide a brief justification for each ranking 5. Give a final summary Respond with ONLY a JSON object in this exact format: { "rankings": [ {"provider": "<provider name>", "score": <0-100>, "justification": "<brief explanation>"}, {"provider": "<provider name>", "score": <0-100>, "justification": "<brief explanation>"} ], "criteria_scores": { "<provider>": {${criteria.map((c) => `"${c}": <0-100>`).join(', ')}} }, "summary": "<overall assessment and recommendation>" } IMPORTANT: - Rankings must be ordered from highest score to lowest - Use the exact provider names from the responses - Do NOT include any text before or after the JSON - Do NOT use markdown code blocks`; } - src/tools/duck-judge.ts:151-226 (helper)Parses the JSON judgment from the judge duck's response, extracting rankings, criteria scores, and summary, with fallback if parsing fails.
function parseJudgment( response: string, judgeProvider: string, judgeNickname: string, originalResponses: DuckResponse[], criteria: string[] ): JudgeEvaluation { const evaluation: JudgeEvaluation = { judge: judgeProvider, judgeNickname: judgeNickname, prompt: '', // Will be filled by caller if needed criteria, rankings: [], criteriaScores: {}, summary: '', rawResponse: response, }; try { // Try to extract JSON from the response const jsonMatch = response.match(/\{[\s\S]*\}/); if (!jsonMatch) { logger.warn(`No JSON found in judge response from ${judgeProvider}`); return createFallbackEvaluation(evaluation, originalResponses, response); } const parsed = JSON.parse(jsonMatch[0]) as ParsedJudgment; const matchedProviders = new Set<string>(); // Parse rankings if (Array.isArray(parsed.rankings)) { for (const [index, r] of parsed.rankings.entries()) { const matched = matchProvider(r.provider, originalResponses); if (matched && !matchedProviders.has(matched.provider)) { matchedProviders.add(matched.provider); evaluation.rankings.push({ provider: matched.provider, nickname: matched.nickname, rank: index + 1, score: typeof r.score === 'number' ? Math.max(0, Math.min(100, r.score)) : 0, justification: r.justification?.toString() || '', }); } } } // Parse criteria scores if (parsed.criteria_scores && typeof parsed.criteria_scores === 'object') { evaluation.criteriaScores = parsed.criteria_scores; } // Parse summary if (parsed.summary) { evaluation.summary = parsed.summary.toString(); } } catch (error) { logger.warn(`Failed to parse JSON judgment from ${judgeProvider}:`, error); return createFallbackEvaluation(evaluation, originalResponses, response); } // Ensure all original responses are represented const rankedProviders = new Set(evaluation.rankings.map((r) => r.provider)); for (const resp of originalResponses) { if (!rankedProviders.has(resp.provider)) { evaluation.rankings.push({ provider: resp.provider, nickname: resp.nickname, rank: evaluation.rankings.length + 1, score: 0, justification: 'Not evaluated by judge', }); } } return evaluation; }