run_evaluation
Load a dataset, query models via streaming, and score responses with an LLM-as-judge to produce per-question scores, aggregate summary, and pass/fail status.
Instructions
Run an LLM evaluation: load a dataset, query models via streaming, score responses with an LLM-as-judge, and return per-question scores, aggregate summary, and pass/fail status.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| dataset_path | Yes | Path to the JSON evaluation dataset file. | |
| models | Yes | Models to evaluate. | |
| judge | No | Judge configuration (optional). | |
| output_dir | No | Directory to save results JSON files. | |
| tracing | No | Optional tracing configuration. |