We model reinforcement learning as the problem of learning to control a Partially Observable Markov Decision Process ( ¢¡¤£¦¥§ ), and focus on gradient ascent approaches to this problem. In [3] we introduced ¨ ¢¡¤£¦¥§ , an algorithm for estimating the performance gradient of a ©¡¤£¦¥¤ from a single sample path, and we proved that this algorithm almost surely converges to an approximation to the gradient. In this paper, we provide a convergence rate for the estimates produced by ¨ ¢¡¤£¦¥§ , and give an improved bound on the approximation error of these estimates. Both of these bounds are in terms of mixing times of the ©¡¤£¦¥¤ .
Peter L. Bartlett, Jonathan Baxter