Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

14 years 3 months ago

Download jmlr.csail.mit.edu

Policy gradient methods for reinforcement learning avoid some of the undesirable properties of the value function approaches, such as policy degradation (Baxter and Bartlett, 2001). However, the variance of the performance gradient estimates obtained from the simulation is sometimes excessive. In this paper, we consider variance reduction methods that were developed for Monte Carlo estimates of integrals. We study two commonly used policy gradient techniques, the baseline and actor-critic methods, from this perspective. Both can be interpreted as additive control variate variance reduction methods. We consider the expected average reward performance measure, and we focus on the GPOMDP algorithm for estimating performance gradients in partially observable Markov decision processes controlled by stochastic reactive policies. We give bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties o...

Evan Greensmith, Peter L. Bartlett, Jonathan Baxte

Real-time Traffic

Gradient | NIPS 2001 | NIPS 2007 | Value Function | Variance Reduction Methods |

claim paper

Post Info
More Details (n/a)

Added	31 Oct 2010
Updated	31 Oct 2010
Type	Conference
Year	2001
Where	NIPS
Authors	Evan Greensmith, Peter L. Bartlett, Jonathan Baxter

Comments (0)

Sciweavers

Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

Gradient | NIPS 2001 | NIPS 2007 | Value Function | Variance Reduction Methods |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers