In this talk, I will describe algorithms for multi-armed bandit problems and Markov decision processes (MDPs) that achieve low regret over time. Even when started with absolutely no knowledge of the environment they are faced with, these algorithms incur regret that grows only logarithmically with time.
The `optimism in the face of uncertainty' heuristic is a common feature of all these algorithms. After reviewing some known bandit algorithms, I will show how the heuristic can be used to derive a new algorithm, called Optimistic Linear Programming (OLP), that achieves low regret in any irreducible MDP. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but its regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP.
(Joint work with Peter Bartlett.)