Low Regret Algorithms for Multi-armed Bandits and MDPs

Ambuj Tewari

UC Berkeley


While taking actions in an unknown probabilistic environment one faces the well-known exploration vs. exploitation trade-off. On the one hand, we would like to exploit our current knowledge of the environment to choose actions that have been rewarding in the past. On the other hand, we would like to explore actions that have not been taken before as they might be potentially rewarding. The notion of regret measures how much worse we perform compared to someone who has complete knowledge of the environment. Achieving low regret in a rich class of environments can be taken to be a sign of successfully making the exploration vs. exploitation trade-off.

In this talk, I will describe algorithms for multi-armed bandit problems and Markov decision processes (MDPs) that achieve low regret over time. Even when started with absolutely no knowledge of the environment they are faced with, these algorithms incur regret that grows only logarithmically with time.

The `optimism in the face of uncertainty' heuristic is a common feature of all these algorithms. After reviewing some known bandit algorithms, I will show how the heuristic can be used to derive a new algorithm, called Optimistic Linear Programming (OLP), that achieves low regret in any irreducible MDP. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but its regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP.

(Joint work with Peter Bartlett.)