Run a live A/B test against the engine's TOP 3 PICKS for a stated purpose — the engine chooses the candidates from the full catalog. Generates 5 representative test queries (auto-expands to 10 or 15 if results are too close to call), runs them through the picked models in parallel, and returns real cost, latency, and plain-English commentary on who won what. Use AFTER `pick` or `rank` when the user wants the engine's own picks stress-tested with live data. DO NOT use this when the user has already named specific candidate models — the engine will ignore the names and test its own picks. Use `compare` instead in that case. Costs more than `rank` (15+ live LLM calls).
Connector