Sciweavers

CORR
2012
Springer

The best of both worlds: stochastic and adversarial bandits

12 years 8 months ago
The best of both worlds: stochastic and adversarial bandits
We present a bandit algorithm, SAO (Stochastic and Adversarial Optimal), whose regret is, essentially, optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the O( √ n) worst-case regret of Exp3 [Auer et al., 2002b] for adversarial rewards and the (poly)logarithmic regret of UCB1 [Auer et al., 2002a] for stochastic rewards. Adversarial rewards and stochastic rewards are the two main settings in the literature on (non-Bayesian) multi-armed bandits. Prior work on multiarmed bandits treats them separately, and does not attempt to jointly optimize for both. Our result falls into a general theme of achieving good worst-case performance while also taking advantage of “nice” problem instances, an important issue in the design of algorithms with partially known inputs.
Sébastien Bubeck, Aleksandrs Slivkins
Added 20 Apr 2012
Updated 20 Apr 2012
Type Journal
Year 2012
Where CORR
Authors Sébastien Bubeck, Aleksandrs Slivkins
Comments (0)