create_evaluation
Evaluate a trained or base model against a dataset with customizable evaluators. Specify dataset and evaluator IDs to run code_execution, similarity, or llm_judge evaluations.
Instructions
Create a new model evaluation. Run your trained model or a base model against a dataset using selected evaluators. Use list_evaluators to see available evaluators (e.g. code_execution, similarity, llm_judge).
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| name | No | Name for this evaluation run | |
| user_model_id | No | ID of your trained model to evaluate. Either this or base_model is required. | |
| base_model | No | HuggingFace model ID to evaluate (e.g. 'Qwen/Qwen2.5-Coder-7B-Instruct'). Either this or user_model_id is required. | |
| dataset_id | Yes | ID of the evaluation dataset to use. Must be a dataset marked for_evaluation. | |
| evaluator_ids | Yes | List of evaluator IDs to run (use list_evaluators to see options) | |
| max_samples | No | Maximum samples to evaluate (default: all) |