In this paper, we consider the problem of planning and learning in the infinite-horizon discounted-reward Markov decision problems. We propose a novel iterative direct policysearch approach, called dynamic policy programming (DPP). DPP is, to the best of our knowledge, the first convergent direct policy-search method that uses a Bellman-like iteration technique and at the same time is compatible with function approximation. For the tabular case, we prove that DPP converges asymptotically to the optimal policy. We numerically compare the performance of DPP to other state-of-the-art approximate dynamic programming methods on the mountain-car problem with linear function approximation and Gaussian basis functions. We observe that, unlike other approximate dynamic programming methods, DPP converges to a near-optimal policy, even when the basis functions are randomly placed. We conclude that DPP, combined with function approximation, asymptotically outperforms other approximate dynamic pro...
Mohammad Gheshlaghi Azar, Hilbert J. Kappen