Toward Off-Policy Learning Control with Function Approximation

15 years 7 months ago

Download www.sztaki.hu

We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, Greedy-GQ, is an extension of recent work on gradient temporal-difference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear function approximation. In reinforcement learning, the term "off-policy learning" refers to learning about one way of behaving, called the target policy, from da...

Hamid Reza Maei, Csaba Szepesvári, Shalabh

Real-time Traffic

Behavior Policy | ICML 2010 | Machine Learning | Optimal Policy | Target Policy |

claim paper

» ModelFree LeastSquares Policy Iteration

» Efficient exploration through active learning for value function approximation in reinforc...

» Studying XCSBOA learning in Boolean functions structure encoding and random Boolean functi...

» A Generalized Path Integral Control Approach to Reinforcement Learning

» Lightweight Reinforcement Learning with Function Approximation for Reallife Control Tasks

» Parallel Reinforcement Learning with Linear Function Approximation

» A FunctionalLinkBased Neurofuzzy Network for Nonlinear System Control

» Efficient ContinuousTime Reinforcement Learning with Adaptive State Graphs

Post Info
More Details (n/a)

Added	09 Nov 2010
Updated	09 Nov 2010
Type	Conference
Year	2010
Where	ICML
Authors	Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, Richard S. Sutton

Comments (0)

Sciweavers

Toward Off-Policy Learning Control with Function Approximation

Behavior Policy | ICML 2010 | Machine Learning | Optimal Policy | Target Policy |

Explore & Download

Productivity Tools

Sciweavers