llm_model_eval
Evaluate and benchmark all available models on reasoning and code tasks to determine quality, speed, and accuracy, then optimize routing priorities.
Instructions
Evaluate and benchmark all available local and remote models.
Runs a suite of benchmark tasks (reasoning, code) against each available model (Ollama, Codex, APIs) to determine quality, speed, and accuracy. Results are cached for 7 days and used to optimize routing priorities.
Can be called manually to force a re-evaluation, or automatically runs once per week during session-end.
Returns: Formatted evaluation results with quality scores and latency metrics.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |