optimize_bandit
Select the optimal variant from a set of options using multi-armed bandit algorithms to balance exploration and exploitation based on observed rewards.
Instructions
Pick the best option from a set of variants (Multi-Armed Bandit: UCB1, Thompson sampling, or ε-greedy). Use this when you have N options with observed reward history and need to choose the next one with optimal explore/exploit tradeoff (A/B test arm selection, ad/email variant routing, recommendation ranking). For context-dependent selection (different best option per user/situation), use optimize_contextual instead. For continuous parameter tuning, use optimize_cmaes. Returns the selected arm + score breakdown in <1ms.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| arms | Yes | Candidate options to choose between (at least 2). | |
| algorithm | No | Selection algorithm (default: ucb1). UCB1 is deterministic; thompson/epsilon-greedy sample. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| selected | Yes | The chosen arm. | |
| score | Yes | Combined exploitation + exploration score. | |
| algorithm | Yes | Which algorithm produced the selection. | |
| exploitation | No | Pure mean-reward component. | |
| exploration | No | Uncertainty bonus added to exploitation. | |
| regret | No | Cumulative regret estimate (lower is better). |