# Reinforcement Learning in Large or Unknown MDPs

### Ambuj Tewari

###
EECS Department

University of California, Berkeley

Technical Report No. UCB/EECS-2007-126

October 25, 2007

### http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-126.pdf

Reinforcement learning is a central problem in artificial intelligence. Unlike supervised learning, in reinforcement learning the learning agent does not receive any input from a supervisor about what to do in different situations. The agent has to learn from its own experience taking into account any uncertainty in the outcomes of its actions.

Markov Decision Processes (MDPs) have been the dominant formalism used to mathematically state and investigate the problem of reinforcement learning. Classical algorithms like value iteration and policy iteration can compute optimal policies for MDPs in time polynomial in the description of the MDP. This is fine for small problems but makes it impractical to apply these algorithms to real world MDPs where the number of states is enormous, even infinite. Another drawback is that these algorithms assume that the MDP parameters are precisely known. To quantify learning in an unknown MDP, the notion of regret has been defined and studied in the literature.

This dissertation consists of two parts. In the first part, we study two methods that have been proposed to handle large MDPs. PEGASUS is a policy search method that uses simulators and approximate linear programming is a general methodology that tries to obtain a good policy by solving linear programs of reasonable size. We give performance bounds for policies produced by these methods. In the second part, we study the problem of learning an unknown MDP. We begin by considering bounded parameter MDPs. These arise when we have confidence intervals associated with each MDP parameter. Finally, we give a new algorithm that achieves logarithmic regret in an irreducible but otherwise unknown MDP. This is a provably optimal rate up to a constant.

**Advisor:** Peter Bartlett

BibTeX citation:

@phdthesis{Tewari:EECS-2007-126, Author = {Tewari, Ambuj}, Title = {Reinforcement Learning in Large or Unknown MDPs}, School = {EECS Department, University of California, Berkeley}, Year = {2007}, Month = {Oct}, URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-126.html}, Number = {UCB/EECS-2007-126}, Abstract = {Reinforcement learning is a central problem in artificial intelligence. Unlike supervised learning, in reinforcement learning the learning agent does not receive any input from a supervisor about what to do in different situations. The agent has to learn from its own experience taking into account any uncertainty in the outcomes of its actions. Markov Decision Processes (MDPs) have been the dominant formalism used to mathematically state and investigate the problem of reinforcement learning. Classical algorithms like value iteration and policy iteration can compute optimal policies for MDPs in time polynomial in the description of the MDP. This is fine for small problems but makes it impractical to apply these algorithms to real world MDPs where the number of states is enormous, even infinite. Another drawback is that these algorithms assume that the MDP parameters are precisely known. To quantify learning in an unknown MDP, the notion of regret has been defined and studied in the literature. This dissertation consists of two parts. In the first part, we study two methods that have been proposed to handle large MDPs. PEGASUS is a policy search method that uses simulators and approximate linear programming is a general methodology that tries to obtain a good policy by solving linear programs of reasonable size. We give performance bounds for policies produced by these methods. In the second part, we study the problem of learning an unknown MDP. We begin by considering bounded parameter MDPs. These arise when we have confidence intervals associated with each MDP parameter. Finally, we give a new algorithm that achieves logarithmic regret in an irreducible but otherwise unknown MDP. This is a provably optimal rate up to a constant.} }

EndNote citation:

%0 Thesis %A Tewari, Ambuj %T Reinforcement Learning in Large or Unknown MDPs %I EECS Department, University of California, Berkeley %D 2007 %8 October 25 %@ UCB/EECS-2007-126 %U http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-126.html %F Tewari:EECS-2007-126