Q-learning and enhanced policy iteration in discounted dynamic programming

14 years 10 months ago

Download web.mit.edu

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm, to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration, and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for lar...

Dimitri P. Bertsekas, Huizhen Yu

Real-time Traffic

CDC 2010 | Control Systems | Iteration | Policy Iteration | Stopping Problem |

claim paper

Added	13 May 2011
Updated	13 May 2011
Type	Journal
Year	2010
Where	CDC
Authors	Dimitri P. Bertsekas, Huizhen Yu

Sciweavers

Q-learning and enhanced policy iteration in discounted dynamic programming

CDC 2010 | Control Systems | Iteration | Policy Iteration | Stopping Problem |

Explore & Download

Productivity Tools

Sciweavers