Evaluate Translation Quality
xcomet_evaluateEvaluate translation quality by analyzing source and translated text, returning a score (0-1), detected errors with severity, and a summary.
Instructions
Evaluate the quality of a translation using xCOMET model.
This tool analyzes a source text and its translation, providing:
A quality score between 0 and 1 (higher is better)
Detected error spans with severity levels (minor/major/critical)
A human-readable quality summary
Args:
source (string): Original source text to translate from
translation (string): Translated text to evaluate
reference (string, optional): Reference translation for comparison
source_lang (string, optional): Source language code (ISO 639-1)
target_lang (string, optional): Target language code (ISO 639-1)
response_format ('json' | 'markdown'): Output format (default: 'json')
use_gpu (boolean, optional): Use GPU for inference if available (default: false)
Returns: For JSON format: { "score": number, // Quality score 0-1 "errors": [ // Detected errors { "text": string, "start": number, "end": number, "severity": "minor" | "major" | "critical" } ], "summary": string // Human-readable summary }
Examples:
Evaluate EN→JA translation quality
Check if MT output needs post-editing
Compare translation against reference
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| source | Yes | Original source text | |
| translation | Yes | Translated text to evaluate | |
| reference | No | Optional reference translation for comparison | |
| source_lang | No | Source language code (ISO 639-1, e.g., 'en', 'ja') | |
| target_lang | No | Target language code (ISO 639-1, e.g., 'en', 'ja') | |
| response_format | No | Output format: 'json' for structured data or 'markdown' for human-readable | json |
| use_gpu | No | Use GPU for inference (faster if available). Default: false (CPU only) |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| score | Yes | Quality score between 0 and 1 | |
| errors | Yes | Detected error spans | |
| summary | Yes | Human-readable quality summary |
Implementation Reference
- src/tools/index.ts:107-142 (handler)Registration and handler for the xcomet_evaluate MCP tool. The handler calls xCometService.evaluate() with source, translation, reference, and use_gpu params, then formats the result as JSON or markdown.
// Tool: xcomet_evaluate server.registerTool( "xcomet_evaluate", { title: "Evaluate Translation Quality", description: TOOL_DESCRIPTIONS.evaluate, inputSchema: { source: EvaluateInputSchema.shape.source, translation: EvaluateInputSchema.shape.translation, reference: EvaluateInputSchema.shape.reference, source_lang: EvaluateInputSchema.shape.source_lang, target_lang: EvaluateInputSchema.shape.target_lang, response_format: EvaluateInputSchema.shape.response_format, use_gpu: EvaluateInputSchema.shape.use_gpu, }, outputSchema: { score: EvaluateOutputSchema.shape.score, errors: EvaluateOutputSchema.shape.errors, summary: EvaluateOutputSchema.shape.summary, }, annotations: READ_ONLY_ANNOTATIONS, }, async (params: EvaluateInput) => { try { const result = await xCometService.evaluate( params.source, params.translation, params.reference, params.use_gpu ); return createToolResponse(result, params.response_format, "Translation Quality Evaluation"); } catch (error) { return createErrorResponse(error, "evaluating translation"); } } ); - src/schemas/index.ts:43-74 (schema)EvaluateInputSchema defines the input parameters for xcomet_evaluate: source (required), translation (required), reference (optional), source_lang (optional ISO code), target_lang (optional ISO code), response_format (json/markdown, default json), use_gpu (boolean, default false).
export const EvaluateInputSchema = z.object({ source: z .string() .min(1, "Source text is required") .max(MAX_TEXT_LENGTH, `Source text must not exceed ${MAX_TEXT_LENGTH} characters`) .describe("Original source text"), translation: z .string() .min(1, "Translation text is required") .max(MAX_TEXT_LENGTH, `Translation text must not exceed ${MAX_TEXT_LENGTH} characters`) .describe("Translated text to evaluate"), reference: z .string() .max(MAX_TEXT_LENGTH) .optional() .describe("Optional reference translation for comparison"), source_lang: z .string() .length(2) .optional() .describe("Source language code (ISO 639-1, e.g., 'en', 'ja')"), target_lang: z .string() .length(2) .optional() .describe("Target language code (ISO 639-1, e.g., 'en', 'ja')"), response_format: ResponseFormat.default("json").describe( "Output format: 'json' for structured data or 'markdown' for human-readable" ), use_gpu: UseGpuSchema, }); export type EvaluateInput = z.infer<typeof EvaluateInputSchema>; - src/schemas/index.ts:79-93 (schema)EvaluateOutputSchema defines the return type: score (0-1), errors array with text/start/end/severity, and a human-readable summary string.
export const EvaluateOutputSchema = z.object({ score: z.number().min(0).max(1).describe("Quality score between 0 and 1"), errors: z .array( z.object({ text: z.string().describe("Error span text"), start: z.number().describe("Start position in translation"), end: z.number().describe("End position in translation"), severity: ErrorSeverity.describe("Error severity level"), }) ) .describe("Detected error spans"), summary: z.string().describe("Human-readable quality summary"), }); export type EvaluateOutput = z.infer<typeof EvaluateOutputSchema>; - src/services/xcomet.ts:147-170 (handler)The XCometService.evaluate() method that performs the actual evaluation logic. It validates reference requirements, sends a 'evaluate' RPC request to the Python server via stdio JSON-RPC protocol, and returns the EvaluateOutput.
async evaluate( source: string, translation: string, reference?: string, useGpu: boolean = false ): Promise<EvaluateOutput> { // Validate reference requirement if (!reference && modelRequiresReference(this.config.model)) { throw new Error(XCometServiceErrors.referenceRequired(this.config.model)); } const result = await this.serverManager.request<EvaluateOutput>( "evaluate", { source, translation, reference, use_gpu: useGpu, }, this.config.timeout ); return result; } - src/tools/index.ts:64-101 (helper)formatAsMarkdown helper used by createToolResponse to render evaluation results as a human-readable markdown string with star ratings, quality score percentage, and error tables.
function formatAsMarkdown(data: Record<string, unknown>, title: string): string { let md = `## ${title}\n\n`; if ("score" in data && typeof data.score === "number") { const score = data.score; const stars = score >= 0.9 ? "⭐⭐⭐⭐⭐" : score >= 0.7 ? "⭐⭐⭐⭐" : score >= 0.5 ? "⭐⭐⭐" : score >= 0.3 ? "⭐⭐" : "⭐"; md += `**Quality Score:** ${(score * 100).toFixed(1)}% ${stars}\n\n`; } if ("summary" in data && typeof data.summary === "string") { md += `**Summary:** ${data.summary}\n\n`; } if ("errors" in data && Array.isArray(data.errors) && data.errors.length > 0) { md += `### Detected Errors\n\n`; md += `| Severity | Text | Position |\n`; md += `|----------|------|----------|\n`; for (const error of data.errors) { const e = error as { severity: string; text: string; start: number; end: number }; const severityEmoji = e.severity === "critical" ? "🔴" : e.severity === "major" ? "🟠" : "🟡"; md += `| ${severityEmoji} ${e.severity} | ${e.text} | ${e.start}-${e.end} |\n`; } md += "\n"; } if ("results" in data && Array.isArray(data.results)) { md += `### Batch Results\n\n`; md += `| # | Score | Errors | Critical |\n`; md += `|---|-------|--------|----------|\n`; for (const r of data.results) { const result = r as { index: number; score: number; error_count: number; has_critical_errors: boolean }; md += `| ${result.index + 1} | ${(result.score * 100).toFixed(1)}% | ${result.error_count} | ${result.has_critical_errors ? "⚠️ Yes" : "✓ No"} |\n`; } md += "\n"; } return md; }