create_evaluation
Evaluate a trained or base model against a dataset with customizable evaluators. Specify dataset and evaluator IDs to run code_execution, similarity, or llm_judge evaluations.
Instructions
Create a new model evaluation. Run your trained model or a base model against a dataset using selected evaluators. Use list_evaluators to see available evaluators (e.g. code_execution, similarity, llm_judge).
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| name | No | Name for this evaluation run | |
| user_model_id | No | ID of your trained model to evaluate. Either this or base_model is required. | |
| base_model | No | HuggingFace model ID to evaluate (e.g. 'Qwen/Qwen2.5-Coder-7B-Instruct'). Either this or user_model_id is required. | |
| dataset_id | Yes | ID of the evaluation dataset to use. Must be a dataset marked for_evaluation. | |
| evaluator_ids | Yes | List of evaluator IDs to run (use list_evaluators to see options) | |
| max_samples | No | Maximum samples to evaluate (default: all) |
Implementation Reference
- src/mcp.ts:549-582 (registration)Tool registration for 'create_evaluation' in the MCP server's tools list, defining its name, description, and input schema.
{ name: "create_evaluation", description: "Create a new model evaluation. Run your trained model or a base model against a dataset using selected evaluators. " + "Use list_evaluators to see available evaluators (e.g. code_execution, similarity, llm_judge).", inputSchema: { type: "object" as const, properties: { name: { type: "string", description: "Name for this evaluation run" }, user_model_id: { type: "string", description: "ID of your trained model to evaluate. Either this or base_model is required.", }, base_model: { type: "string", description: "HuggingFace model ID to evaluate (e.g. 'Qwen/Qwen2.5-Coder-7B-Instruct'). Either this or user_model_id is required.", }, dataset_id: { type: "string", description: "ID of the evaluation dataset to use. Must be a dataset marked for_evaluation.", }, evaluator_ids: { type: "array", items: { type: "string" }, description: "List of evaluator IDs to run (use list_evaluators to see options)", }, max_samples: { type: "number", description: "Maximum samples to evaluate (default: all)", }, }, required: ["dataset_id", "evaluator_ids"], }, }, - src/mcp.ts:915-930 (handler)Handler for the 'create_evaluation' tool call. Validates that either user_model_id or base_model is provided, then calls the client's createEvaluation method.
case "create_evaluation": if (!args?.user_model_id && !args?.base_model) { return { content: [{ type: "text", text: "Error: either user_model_id or base_model is required" }], isError: true, }; } result = await getClient().createEvaluation({ name: args?.name as string | undefined, user_model_id: args?.user_model_id as string | undefined, base_model: args?.base_model as string | undefined, dataset_id: args!.dataset_id as string, evaluator_ids: args!.evaluator_ids as string[], max_samples: args?.max_samples as number | undefined, }); break; - src/mcp.ts:554-581 (schema)Input schema for create_evaluation: defines properties name, user_model_id, base_model, dataset_id, evaluator_ids, max_samples, with required fields dataset_id and evaluator_ids.
inputSchema: { type: "object" as const, properties: { name: { type: "string", description: "Name for this evaluation run" }, user_model_id: { type: "string", description: "ID of your trained model to evaluate. Either this or base_model is required.", }, base_model: { type: "string", description: "HuggingFace model ID to evaluate (e.g. 'Qwen/Qwen2.5-Coder-7B-Instruct'). Either this or user_model_id is required.", }, dataset_id: { type: "string", description: "ID of the evaluation dataset to use. Must be a dataset marked for_evaluation.", }, evaluator_ids: { type: "array", items: { type: "string" }, description: "List of evaluator IDs to run (use list_evaluators to see options)", }, max_samples: { type: "number", description: "Maximum samples to evaluate (default: all)", }, }, required: ["dataset_id", "evaluator_ids"], }, - src/client.ts:280-289 (helper)Client-side createEvaluation method that sends a POST request to /api/v1/evaluations with the provided parameters.
async createEvaluation(params: { name?: string; user_model_id?: string; base_model?: string; dataset_id: string; evaluator_ids: string[]; max_samples?: number; }): Promise<any> { return this.request("POST", "/api/v1/evaluations", params); }