Abstract-- We consider reinforcement learning, and in particular, the Q-learning algorithm in large state and action spaces. In order to cope with the size of the spaces, a functio...
We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy learning is of interest because it forms...
We present a reinforcement learning architecture, Dyna-2, that encompasses both samplebased learning and sample-based search, and that generalises across states during both learni...
Residual gradient (RG) was proposed as an alternative to TD(0) for policy evaluation when function approximation is used, but there exists little formal analysis comparing them ex...
Reinforcement learning algorithms can become unstable when combined with linear function approximation. Algorithms that minimize the mean-square Bellman error are guaranteed to co...