llm_model_eval
Evaluates and benchmarks diverse language models, measuring quality, speed, and accuracy to optimize task routing.
Instructions
Evaluate and benchmark all available local and remote models.
Runs a suite of benchmark tasks (reasoning, code) against each available
model (Ollama, Codex, APIs) to determine quality, speed, and accuracy.
Results are cached for 7 days and used to optimize routing priorities.
Can be called manually to force a re-evaluation, or automatically runs
once per week during session-end.
Returns:
Formatted evaluation results with quality scores and latency metrics.Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |