An Asymptotically Optimal Bandit Algorithm for Bounded Support Models

14 years 1 months ago

Download www.colt2010.org

Multiarmed bandit problem is a typical example of a dilemma between exploration and exploitation in reinforcement learning. This problem is expressed as a model of a gambler playing a slot machine with multiple arms. We study stochastic bandit problem where each arm has a reward distribution supported in a known bounded interval, e.g. [0, 1]. In this model, Auer et al. (2002) proposed practical policies called UCB and derived finite-time regret of UCB policies. However, policies achieving the asymptotic bound given by Burnetas and Katehakis (1996) have been unknown for the model. We propose Deterministic Minimum Empirical Divergence (DMED) policy and prove that DMED achieves the asymptotic bound. Furthermore, the index used in DMED for choosing an arm can be computed easily by a convex optimization technique. Although we do not derive a finite-time regret, we confirm by simulations that DMED achieves a regret close to the asymptotic bound in finite time.

Junya Honda, Akimichi Takemura

Real-time Traffic

Asymptotic Bound | Bandit Problem | COLT 2010 | Finite-time Regret | Machine Learning |

claim paper

Post Info
More Details (n/a)

Added	10 Feb 2011
Updated	10 Feb 2011
Type	Journal
Year	2010
Where	COLT
Authors	Junya Honda, Akimichi Takemura

Comments (0)

Sciweavers

An Asymptotically Optimal Bandit Algorithm for Bounded Support Models

Asymptotic Bound | Bandit Problem | COLT 2010 | Finite-time Regret | Machine Learning |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers