For a Markov Decision Process with finite state (size S) and action spaces (size A per state), we propose a new algorithm--Delayed Q-Learning. We prove it is PAC, achieving near optimal performance except for ~O(SA) timesteps using O(SA) space, improving on the ~O(S2 A) bounds of best previous algorithms. This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience. Learning takes place from a single continuous thread of experience--no resets nor parallel sampling is used. Beyond its smaller storage and experience requirements, Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.
Alexander L. Strehl, Lihong Li, Eric Wiewiora, Joh