compare_methods
Compare AI models and methods side-by-side across shared benchmarks to evaluate performance differences on common datasets and metrics.
Instructions
Compare 2-10 models/methods side-by-side across shared benchmarks. Finds datasets where at least 2 of the specified models have been evaluated, enabling direct score comparison. Example: compare GPT-4, LLaMA-3, and Mistral across MMLU, GSM8K, etc.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| models | Yes | Model/method names to compare e.g. ['GPT-4', 'LLaMA-3', 'Mistral'] | |
| dataset | No | Filter to a specific dataset e.g. 'MMLU' | |
| metric | No | Filter to a specific metric e.g. 'accuracy' |