In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full information case, the upper bound on the regret is O∗ ( √ nT), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm which achieves O∗ (n3/2 √ T) regret — all previous (nontrivial) bounds here were O(poly(n)T2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a fa...
Varsha Dani, Thomas P. Hayes, Sham Kakade