CS 188, Fall 2005, Assignment 6 Part 1

CS 188, Fall 2005, Introduction to Artificial Intelligence
Assignment 6 Part 1, due 12/5, total value 4% of grade

This assignment should be done in pairs. Don't leave it to the last minute! Consult early and often with me and/or your TA.

This assignment comes in two parts. The first part is worth 50 points out of 100 and is mainly intended to help you become familiar with the basics of MDP representations, algorithms, and agents and with Spider solitaire. It does not involve writing much new code. The second part, to be posted shortly, deals with reinforcement learning.

The first thing you need to do is load the CS188 AIMA code in the usual way: load aima.lisp and then do (aima-load 'search) and (aima-load 'mdps). You should also copy, compile, and load all the lisp files in this directory.

Be sure to use the latest version from ~cs188. Several things have changed and new code has been added. As always, remember to compile the code.

MDPs

The AIMA code includes a general facility for defining and using MDPs. Code defining MDPs and all the basic operations on them (including I/O) appears in

mdps/domains/mdps.lisp

The main methods defined on MDPs are as follows:

(actions mdp state) returns the list of actions legal in state, just as in search problems.
(results mdp action state) returns an enumerated probability distribution giving each possible outcome state and its probability.
(reward state1 action state2) gives the reward for doing action in state1 leading to state2. In some MDPs, the reward function depends only on state1 or on state1 and action. The reward-type of the MDP can be S, SA, or SAS to specify which kind of MDP this is.

The simplest kind of MDP is an enumerated-mdp, in which actions, results, and rewards are stored in hash tables. The MDP methods are defined generically for all such MDPs. An example is the 4x3 MDP used throughout Chapter 17, which is defined in

mdps/domains/4x3-mdp.lisp

There are also dynamic programming algorithms (value iteration and policy iteration) for solving MDPs:

mdps/algorithms/dp.lisp

Value iteration outputs the utility function as a hash table. For example, try:

>> (hprint (value-iteration *4x3-mdp*))
#
(1 1):     0.70530814
(2 1):     0.655308
(3 2):     0.660274
(1 3):     0.81155825
(2 3):     0.8678082
(4 1):     0.38792402
(4 3):     1.0
(3 1):     0.6114151
(1 2):     0.7615582
(4 2):     -1.0
(3 3):     0.91780823

The function value-iteration-policy does value iteration and converts the result into an optimal policy by one-step lookahead:

>> (hprint (value-iteration-policy *4x3-mdp*))
(1 1):     UP
(2 1):     LEFT
(3 2):     UP
(1 3):     RIGHT
(2 3):     RIGHT
(4 1):     LEFT
(4 3):     NIL
(3 1):     LEFT
(1 2):     UP
(4 2):     NIL
(3 3):     RIGHT
#

MDP agents and environments

You will first need to understand the basics of how environments and agents work. See

agents/environments/basic-env.lisp

Notice in particular how run-environment works: it invokes the agent program with the current percept, and then updates the state of the environment based on the action that the agent returns. It also keeps track of the agent's score in the environment by updating the score slot of the agent itself.

Any MDP can be converted into an environment using the mdp->environment function in

mdps/environments/mdp-env.lisp

This function needs a list of one agent to run in the environment. By default, it uses one constructed by new-simple-mdp-solving-agent, which is defined in

mdps/agents/simple-mdp-solving-agent.lisp

Such an agent computes a policy for the MDP (e.g., by the value-iteration-policy algorithm, see mdps/algorithms/dp.lisp) and then executes it.

Question 1 (5 pts). Use the agent-trial function to measure the average score of the simple-mdp-solving-agent in *4x3-mdp* over 1000 trials. Your result should be close to the true utility of the initial state (1 1), as shown on AIMA2e p.619.

Question 2 (5 pts). Now let's consider an agent that makes decisions using an approximate utility function and a lookahead search. (While this is unnecessary for the 4x3 world, it is essential for Spider.) Because an MDP has actions with uncertain outcomes, but just one agent, the search we need is an expectimax search that alternates between choosing maximum-utility actions and calculating expected outcome values. The algorithm is given in

mdps/algorithms/expectimax.lisp

and an agent that uses it is given in

mdps/agents/expectimax-agent.lisp

Using the approximate utility function in 4x3-eval.lisp, evaluate depth-1 and depth-2 expectimax agents over 1000 trials on *4x3-mdp*.

Question 3 (15 pts). The expectimax algorithm (and indeed any algorithm using Bellman backups) computes expected values by summing over all possible outcome states. This will not be possible in Spider, where the number of outcomes for one action can exceed 32 quintillion. Instead of summing over all outcomes, we will have to sum over a small sample. First, write the following methods for enumerated MDPs (one line each):

(num-results mdp action state) returns the number of possible outcomes of action in state.
(random-result mdp action state) returns an outcome state sampled from the distribution over outcomes for action in state. [Hint: something similar is required in mdp-env.lisp.]

Now, write sampling versions of the expectimax functions called sampling-expectimax-cutoff-decision, sampling-expected-cutoff-value, and sampling-max-cutoff-value. These should take an additional argument specifying the number of samples. The only substantial change to the expectimax code will be in sampling-expected-cutoff-value, which should first check if the actual number of outcomes is greater than the number of samples allowed. If so, it should generate the samples and average over them; if not, it should compute the exact expectation as before. Now use new-expectimax-cutoff-mdp-agent to make an agent that uses sampling-expectimax-cutoff-decision with 2 samples and depth-2 lookahead. Test this agent over 1000 trials as before; you should find that the agent does nearly as well as the depth-2 expectimax agent.

Spider

Now we will apply similar techniques to Spider. First, we need to understand the game iself and how it is implemented. You should definitely play a few times; the game is available on Windows systems and there are many free downloads. The game is also available on the unix cluster under ~cs188/spider and as a java applet. There are two ways to think about Spider.

Spider is a partially observable MDP problem with certain special characteristics: except for the randomly generated initial state, everything proceeds deterministically although the true state is not fully visible. Because of this determinism, we can define the Spider POMDP as if it were just an ordinary search problem except that we add a get-percept method to generate the partial percept from the true state. Thus, Spider is a partially observable problem, or poproblem, and its get-percept method simply removes all identifying information from the hidden cards in the state. See search/domains/poproblems.lisp for the general definition and search/environments/poproblem-env.lisp for a generic method to convert any poproblem into an environment.
As discussed in class, any POMDP can be viewed by the agent as a fully observable MDP whose state space is the agent's belief state, i.e., probability distribution over all possible states. Now of course the Spider state space is huge, so this doesn't seem very promising, except for the fact that a Spider agent's belief state is always a uniform distribution over the locations of the hidden cards. Furthermore, the Spider percept always tells the agent exactly which cards are hidden -- namely, all the ones it can't see. Hence, the Spider percept itself "represents" the agent's belief state; there is nothing more to be learned from the percept history and no explicit probability distributions need be written out. Thus, we can define a Spider MDP whose states are just the possible Spider percepts. The transition model for this MDP is also straightforward: for example, if we flip over a hidden card, it is chosen uniformly at random from the hidden cards and is no longer hidden.

You can see the definitions for the Spider MDP in spider-mdp.lisp, but probably it's best to look first at the underlying Spider implementation itself in spider.lisp. To make a Spider instance, call make-spider-problem with suitable parameters. Here's a particularly easy one:

>> (setq ps-easy (make-spider-problem :num-packs 1 :num-suits 1 :num-stacks 10 :num-hidden-rows 2))

>> (setq s0-easy (problem-initial-state ps-easy))

 0    ??? ??? ???  AH
 1    ??? ??? ???  KH
 2    ??? ???  2H
 3    ??? ???  5H
 4    ??? ???  AH
 5    ??? ???  JH
 6    ??? ???  6H
 7    ??? ???  9H
 8    ??? ???  2H
 9    ??? ???  4H

Reserve:  ....................
Completed:

Notice that the stacks are numbered from 0 to 9 and that the "top-to-bottom" orientation of stacks in the Windows implementation is replaced here by a "left-to-right" ordering. Notice also that this is the state of the Spider poproblem, not the percept. The percept looks the same to the naked eye:

>> (setq s0-percept (get-percept ps-easy s0-easy))

 0    ??? ??? ???  AH
 1    ??? ??? ???  KH
 2    ??? ???  2H
 3    ??? ???  5H
 4    ??? ???  AH
 5    ??? ???  JH
 6    ??? ???  6H
 7    ??? ???  9H
 8    ??? ???  2H
 9    ??? ???  4H

Reserve:  ....................
Completed:

but in the percept the hidden cards really are hidden:

>> (card-number (second (aref (spider-state-stacks s0-easy) 9)))
10
>> (card-number (second (aref (spider-state-stacks s0-percept) 9)))
NIL

Spider moves specify the number of cards to be moved, the origin stack, and the destination stack:

>> (pprint (actions ps-easy s0-easy))
(NEW-ROW #S(SPIDER-MOVE :K 1 :FROM 4 :TO 8)
 #S(SPIDER-MOVE :K 1 :FROM 0 :TO 8) #S(SPIDER-MOVE :K 1 :FROM 3 :TO 6)
 #S(SPIDER-MOVE :K 1 :FROM 9 :TO 3) #S(SPIDER-MOVE :K 1 :FROM 4 :TO 2)
 #S(SPIDER-MOVE :K 1 :FROM 0 :TO 2))

The new-row action deals out a new row of cards from the reserve. The outcome of a Spider action is defined by the result method:

>> (result ps-easy  #S(SPIDER-MOVE :K 1 :FROM 9 :TO 3) s0-easy)

 0    ??? ??? ???  AH
 1    ??? ??? ???  KH
 2    ??? ???  2H
 3    ??? ???  5H  4H
 4    ??? ???  AH
 5    ??? ???  JH
 6    ??? ???  6H
 7    ??? ???  9H
 8    ??? ???  2H
 9    ??? 10H

Reserve:  ....................
Completed:

See spider.lisp for a complete explanation of exactly which moves are allowed. We have eliminated some redundant moves to make the game a little easier for the computer to play. You should also look at the goal-test, step-cost, and get-percept methods.

A Spider poproblem instance can be converted into an environment as follows:

>> (setq es-easy (poproblem->environment ps-easy :agents (list (new-random-spider-agent :problem ps-easy))))
#

You can now type (run-environment es-easy) and watch the random agent playing. It usually wins, which shows that this is an easy class of Spider problems. Look in spider-agents.lisp to see the definition of new-random-spider-agent. This file also contains an expectimax agent specifically for Spider.

The Spider MDP is defined in spider-mdp.lisp. Like any MDP, this has a results method, but it should be avoided, especially for the new-row action. The file defines num-results and random-result methods for use with sampling expectimax. We can make a Spider MDP as follows:

>> (setq smdp (make-spider-mdp :problem ps-easy :initial-state s0-easy))
#

Question 4 (5 pts). Evaluate the random spider agent over 1000 instances of the "easy" Spider game. (Be sure that you generate a new instance each time!)

Question 5 (10 pts). Write a function new-random+history-spider-agent that, like new-random-spider-agent, returns an agent that selects moves randomly; however, these agents should keep a history of all visited states in a hash table (don't forget to use the state-hash-key!) and should never execute a move that leads to a state already visited. [Hint: you need only check the outcome of moves that have exactly one outcome.] Does this make your agent worse or better on the easy Spider instances?

Question 6 (5 pts). The file spider-eval.lisp contains an approximate utility function for Spider. Explain why the function includes the spider-suits-completed feature. [Hint: this is not a completely trivial question.]

Question 7 (5 pts). Evaluate a depth-1 sampling expectimax agent that uses 5 samples on both the "easy" instances and on instances with 2 packs, 1 suit, and 4 hidden rows. (Do as many trials as you can in a reasonable time, but no more than 1000 in any case.) If cycles are available, evaluate a depth-2 agent as well.

CS 188, Fall 2005, Introduction to Artificial Intelligence Assignment 6 Part 1, due 12/5, total value 4% of grade

This assignment should be done in pairs. Don't leave it to the last minute! Consult early and often with me and/or your TA.

MDPs

MDP agents and environments

Spider

CS 188, Fall 2005, Introduction to Artificial Intelligence
Assignment 6 Part 1, due 12/5, total value 4% of grade