We give the first rigorous upper bounds on the error of temporal difference (td) algorithms for policy evaluation as a function of the amount of experience. These upper bounds prove exponentially fast convergence, with both the rate of convergence and the asymptote strongly dependent on the length of the backups k or the parameter . Our bounds give formal verification to the long-standing intuition that td methods are subject to a “bias-variance” trade-off, and they lead to schedules for k and that are predicted to be better than any fixed values for these parameters. We give preliminary experimental confirmation of our theory for a version of the random walk problem.
Michael J. Kearns, Satinder P. Singh