Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

Reinforcement Learning in Large or Unknown MDPs

Ambuj Tewari

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2007-126
October 25, 2007

http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-126.pdf

Reinforcement learning is a central problem in artificial intelligence. Unlike supervised learning, in reinforcement learning the learning agent does not receive any input from a supervisor about what to do in different situations. The agent has to learn from its own experience taking into account any uncertainty in the outcomes of its actions. Markov Decision Processes (MDPs) have been the dominant formalism used to mathematically state and investigate the problem of reinforcement learning. Classical algorithms like value iteration and policy iteration can compute optimal policies for MDPs in time polynomial in the description of the MDP. This is fine for small problems but makes it impractical to apply these algorithms to real world MDPs where the number of states is enormous, even infinite. Another drawback is that these algorithms assume that the MDP parameters are precisely known. To quantify learning in an unknown MDP, the notion of regret has been defined and studied in the literature. This dissertation consists of two parts. In the first part, we study two methods that have been proposed to handle large MDPs. PEGASUS is a policy search method that uses simulators and approximate linear programming is a general methodology that tries to obtain a good policy by solving linear programs of reasonable size. We give performance bounds for policies produced by these methods. In the second part, we study the problem of learning an unknown MDP. We begin by considering bounded parameter MDPs. These arise when we have confidence intervals associated with each MDP parameter. Finally, we give a new algorithm that achieves logarithmic regret in an irreducible but otherwise unknown MDP. This is a provably optimal rate up to a constant.

Advisor: Peter Bartlett


BibTeX citation:

@phdthesis{Tewari:EECS-2007-126,
    Author = {Tewari, Ambuj},
    Title = {Reinforcement Learning in Large or Unknown MDPs},
    School = {EECS Department, University of California, Berkeley},
    Year = {2007},
    Month = {Oct},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-126.html},
    Number = {UCB/EECS-2007-126},
    Abstract = {Reinforcement learning is a central problem in artificial intelligence. Unlike supervised learning, in reinforcement learning the learning agent does not receive any input from a supervisor about what to do in different situations. The agent has to learn from its own experience taking into account any uncertainty in the outcomes of its actions.

Markov Decision Processes (MDPs) have been the dominant formalism used to mathematically state and investigate the problem of reinforcement learning. Classical algorithms like value iteration and policy iteration can compute optimal policies for MDPs in time polynomial in the description of the MDP. This is fine for small problems but makes it impractical to apply these algorithms to
real world MDPs where the number of states is enormous, even infinite. Another drawback is that these algorithms assume that the MDP parameters are precisely known. To quantify learning in an unknown MDP, the notion of regret has been defined and studied in the literature.

This dissertation consists of two parts. In the first part, we study two methods that have been proposed to handle large MDPs. PEGASUS is a policy search method that uses simulators and approximate linear programming is a general methodology that tries to obtain a good policy by solving linear programs of reasonable size. We give performance bounds for policies produced by these methods. In the second part, we study the problem of learning an unknown MDP. We begin by considering bounded parameter MDPs. These arise when we have confidence intervals associated with each MDP parameter. Finally, we give a new algorithm that achieves logarithmic regret in an irreducible but otherwise unknown MDP. This is a provably optimal rate up to a constant.}
}

EndNote citation:

%0 Thesis
%A Tewari, Ambuj
%T Reinforcement Learning in Large or Unknown MDPs
%I EECS Department, University of California, Berkeley
%D 2007
%8 October 25
%@ UCB/EECS-2007-126
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-126.html
%F Tewari:EECS-2007-126