Best Arm Identification in Multi-Armed Bandits

15 years 5 months ago

Download www.di.ens.fr

We consider the problem of finding the best arm in a stochastic multi-armed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm based on successive rejects. We show that these algorithms are essentially optimal since their regret decreases exponentially at a rate which is, up to a logarithmic factor, the best possible. However, while the UCB policy needs the tuning of a parameter depending on the unobservable hardness of the task, the successive rejects policy benefits from being parameter-free, and also independent of the scaling of the rewards. As a by-product of our analysis, we show that identifying the best arm (when it is unique) requires a number of samples of order (up to a log(K) factor) i 1/2 i , where the sum is on the suboptimal arms and i represents the difference between the mean reward of the best arm a...

Jean-Yves Audibert, Sébastien Bubeck, R&eac

Real-time Traffic

COLT 2010 | Machine Learning | Mean Reward | Successive Rejects | UCB Policy |

claim paper

Post Info
More Details (n/a)

Added	10 Feb 2011
Updated	10 Feb 2011
Type	Journal
Year	2010
Where	COLT
Authors	Jean-Yves Audibert, Sébastien Bubeck, Rémi Munos

Comments (0)

Sciweavers

Best Arm Identification in Multi-Armed Bandits

COLT 2010 | Machine Learning | Mean Reward | Successive Rejects | UCB Policy |

Explore & Download

Productivity Tools

Sciweavers