llm_model_eval
Run benchmark evaluations across local and remote models to measure quality, speed, and accuracy. Results optimize routing priorities for AI task allocation.
Instructions
Evaluate and benchmark all available local and remote models.
Runs a suite of benchmark tasks (reasoning, code) against each available model (Ollama, Codex, APIs) to determine quality, speed, and accuracy. Results are cached for 7 days and used to optimize routing priorities.
Can be called manually to force a re-evaluation, or automatically runs once per week during session-end.
Returns: Formatted evaluation results with quality scores and latency metrics.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |