Sciweavers

CORR
2010
Springer

Optimism in Reinforcement Learning Based on Kullback-Leibler Divergence

13 years 10 months ago
Optimism in Reinforcement Learning Based on Kullback-Leibler Divergence
We consider model-based reinforcement learning in finite Markov Decision Processes (MDPs), focussing on so-called optimistic strategies. Optimism is usually implemented by carrying out extended value iterations, under a constraint of consistency with the estimated model transition probabilities. In this paper, we strongly argue in favor of using the Kullback-Leibler (KL) divergence for this purpose. By studying the linear maximization problem under KL constraints, we provide an efficient algorithm for solving KL-optimistic extended value iteration. When implemented within the structure of UCRL2, the near-optimal method introduced by [2], this algorithm also achieves bounded regrets in the undiscounted case. We however provide some geometric arguments as well as a concrete illustration on a simulated example to explain the observed improved practical behavior, particularly when the MDP has reduced connectivity. To analyze this new algorithm, termed KL-UCRL, we also rely on recent devia...
Sarah Filippi, Olivier Cappé, Aurelien Gari
Added 24 Jan 2011
Updated 24 Jan 2011
Type Journal
Year 2010
Where CORR
Authors Sarah Filippi, Olivier Cappé, Aurelien Garivier
Comments (0)