Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

15 years 8 months ago

Download books.nips.cc

We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P) log T of the reward obtained by the optimal policy, where C(P) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in ﬂavor to an algorithm re...

Ambuj Tewari, Peter L. Bartlett

Real-time Traffic

Information Technology | NIPS 2007 | Optimistic Linear Programming | Regret Bound | Transition Probabilities |

claim paper

Post Info
More Details (n/a)

Added	30 Oct 2010
Updated	30 Oct 2010
Type	Conference
Year	2007
Where	NIPS
Authors	Ambuj Tewari, Peter L. Bartlett

Comments (0)

Sciweavers

Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

Information Technology | NIPS 2007 | Optimistic Linear Programming | Regret Bound | Transition Probabilities |

Explore & Download

Productivity Tools

Sciweavers