We develop a new tool for data-dependent analysis of the exploration-exploitation trade-off in learning under limited feedback. Our tool is based on two main ingredients. The fi...
A multi-armed bandit episode consists of n trials, each allowing selection of one of K arms, resulting in payoff from a distribution over [0, 1] associated with that arm. We assum...
The analysis of online least squares estimation is at the heart of many stochastic sequential decision-making problems. We employ tools from the self-normalized processes to provi...