CS 188, Fall 2005, Introduction to Artificial Intelligence
Assignment 6 Part 1, due 12/5, total value 4% of grade




This assignment should be done in pairs. Don't leave it to the last minute! Consult early and often with me and/or your TA.


This assignment comes in two parts. The first part is worth 50 points out of 100 and is mainly intended to help you become familiar with the basics of MDP representations, algorithms, and agents and with Spider solitaire. It does not involve writing much new code. The second part, to be posted shortly, deals with reinforcement learning.

The first thing you need to do is load the CS188 AIMA code in the usual way: load aima.lisp and then do (aima-load 'search) and (aima-load 'mdps). You should also copy, compile, and load all the lisp files in this directory.

Be sure to use the latest version from ~cs188. Several things have changed and new code has been added. As always, remember to compile the code.

MDPs

The AIMA code includes a general facility for defining and using MDPs. Code defining MDPs and all the basic operations on them (including I/O) appears in

The main methods defined on MDPs are as follows: The simplest kind of MDP is an enumerated-mdp, in which actions, results, and rewards are stored in hash tables. The MDP methods are defined generically for all such MDPs. An example is the 4x3 MDP used throughout Chapter 17, which is defined in There are also dynamic programming algorithms (value iteration and policy iteration) for solving MDPs: Value iteration outputs the utility function as a hash table. For example, try:
>> (hprint (value-iteration *4x3-mdp*))
#
(1 1):     0.70530814
(2 1):     0.655308
(3 2):     0.660274
(1 3):     0.81155825
(2 3):     0.8678082
(4 1):     0.38792402
(4 3):     1.0
(3 1):     0.6114151
(1 2):     0.7615582
(4 2):     -1.0
(3 3):     0.91780823
The function value-iteration-policy does value iteration and converts the result into an optimal policy by one-step lookahead:
>> (hprint (value-iteration-policy *4x3-mdp*))
(1 1):     UP
(2 1):     LEFT
(3 2):     UP
(1 3):     RIGHT
(2 3):     RIGHT
(4 1):     LEFT
(4 3):     NIL
(3 1):     LEFT
(1 2):     UP
(4 2):     NIL
(3 3):     RIGHT
#

MDP agents and environments

You will first need to understand the basics of how environments and agents work. See Notice in particular how run-environment works: it invokes the agent program with the current percept, and then updates the state of the environment based on the action that the agent returns. It also keeps track of the agent's score in the environment by updating the score slot of the agent itself.

Any MDP can be converted into an environment using the mdp->environment function in

This function needs a list of one agent to run in the environment. By default, it uses one constructed by new-simple-mdp-solving-agent, which is defined in Such an agent computes a policy for the MDP (e.g., by the value-iteration-policy algorithm, see mdps/algorithms/dp.lisp) and then executes it.

Question 1 (5 pts). Use the agent-trial function to measure the average score of the simple-mdp-solving-agent in *4x3-mdp* over 1000 trials. Your result should be close to the true utility of the initial state (1 1), as shown on AIMA2e p.619.

Question 2 (5 pts). Now let's consider an agent that makes decisions using an approximate utility function and a lookahead search. (While this is unnecessary for the 4x3 world, it is essential for Spider.) Because an MDP has actions with uncertain outcomes, but just one agent, the search we need is an expectimax search that alternates between choosing maximum-utility actions and calculating expected outcome values. The algorithm is given in

and an agent that uses it is given in Using the approximate utility function in 4x3-eval.lisp, evaluate depth-1 and depth-2 expectimax agents over 1000 trials on *4x3-mdp*.

Question 3 (15 pts). The expectimax algorithm (and indeed any algorithm using Bellman backups) computes expected values by summing over all possible outcome states. This will not be possible in Spider, where the number of outcomes for one action can exceed 32 quintillion. Instead of summing over all outcomes, we will have to sum over a small sample. First, write the following methods for enumerated MDPs (one line each):

Now, write sampling versions of the expectimax functions called sampling-expectimax-cutoff-decision, sampling-expected-cutoff-value, and sampling-max-cutoff-value. These should take an additional argument specifying the number of samples. The only substantial change to the expectimax code will be in sampling-expected-cutoff-value, which should first check if the actual number of outcomes is greater than the number of samples allowed. If so, it should generate the samples and average over them; if not, it should compute the exact expectation as before. Now use new-expectimax-cutoff-mdp-agent to make an agent that uses sampling-expectimax-cutoff-decision with 2 samples and depth-2 lookahead. Test this agent over 1000 trials as before; you should find that the agent does nearly as well as the depth-2 expectimax agent.

Spider

Now we will apply similar techniques to Spider. First, we need to understand the game iself and how it is implemented. You should definitely play a few times; the game is available on Windows systems and there are many free downloads. The game is also available on the unix cluster under ~cs188/spider and as a java applet. There are two ways to think about Spider. You can see the definitions for the Spider MDP in spider-mdp.lisp, but probably it's best to look first at the underlying Spider implementation itself in spider.lisp. To make a Spider instance, call make-spider-problem with suitable parameters. Here's a particularly easy one:
>> (setq ps-easy (make-spider-problem :num-packs 1 :num-suits 1 :num-stacks 10 :num-hidden-rows 2))
Notice that the stacks are numbered from 0 to 9 and that the "top-to-bottom" orientation of stacks in the Windows implementation is replaced here by a "left-to-right" ordering. Notice also that this is the state of the Spider poproblem, not the percept. The percept looks the same to the naked eye:
>> (setq s0-percept (get-percept ps-easy s0-easy))

 0    ??? ??? ???  AH
 1    ??? ??? ???  KH
 2    ??? ???  2H
 3    ??? ???  5H
 4    ??? ???  AH
 5    ??? ???  JH
 6    ??? ???  6H
 7    ??? ???  9H
 8    ??? ???  2H
 9    ??? ???  4H

Reserve:  ....................
Completed:  
but in the percept the hidden cards really are hidden:
>> (card-number (second (aref (spider-state-stacks s0-easy) 9)))
10
>> (card-number (second (aref (spider-state-stacks s0-percept) 9)))
NIL

Spider moves specify the number of cards to be moved, the origin stack, and the destination stack:

>> (pprint (actions ps-easy s0-easy))
(NEW-ROW #S(SPIDER-MOVE :K 1 :FROM 4 :TO 8)
 #S(SPIDER-MOVE :K 1 :FROM 0 :TO 8) #S(SPIDER-MOVE :K 1 :FROM 3 :TO 6)
 #S(SPIDER-MOVE :K 1 :FROM 9 :TO 3) #S(SPIDER-MOVE :K 1 :FROM 4 :TO 2)
 #S(SPIDER-MOVE :K 1 :FROM 0 :TO 2))
The new-row action deals out a new row of cards from the reserve. The outcome of a Spider action is defined by the result method:
>> (result ps-easy  #S(SPIDER-MOVE :K 1 :FROM 9 :TO 3) s0-easy)

 0    ??? ??? ???  AH
 1    ??? ??? ???  KH
 2    ??? ???  2H
 3    ??? ???  5H  4H
 4    ??? ???  AH
 5    ??? ???  JH
 6    ??? ???  6H
 7    ??? ???  9H
 8    ??? ???  2H
 9    ??? 10H

Reserve:  ....................
Completed:  
See spider.lisp for a complete explanation of exactly which moves are allowed. We have eliminated some redundant moves to make the game a little easier for the computer to play. You should also look at the goal-test, step-cost, and get-percept methods.

A Spider poproblem instance can be converted into an environment as follows:

>> (setq es-easy (poproblem->environment ps-easy :agents (list (new-random-spider-agent :problem ps-easy))))
#
You can now type (run-environment es-easy) and watch the random agent playing. It usually wins, which shows that this is an easy class of Spider problems. Look in spider-agents.lisp to see the definition of new-random-spider-agent. This file also contains an expectimax agent specifically for Spider.

The Spider MDP is defined in spider-mdp.lisp. Like any MDP, this has a results method, but it should be avoided, especially for the new-row action. The file defines num-results and random-result methods for use with sampling expectimax. We can make a Spider MDP as follows:

>> (setq smdp (make-spider-mdp :problem ps-easy :initial-state s0-easy))
#

Question 4 (5 pts). Evaluate the random spider agent over 1000 instances of the "easy" Spider game. (Be sure that you generate a new instance each time!)

Question 5 (10 pts). Write a function new-random+history-spider-agent that, like new-random-spider-agent, returns an agent that selects moves randomly; however, these agents should keep a history of all visited states in a hash table (don't forget to use the state-hash-key!) and should never execute a move that leads to a state already visited. [Hint: you need only check the outcome of moves that have exactly one outcome.] Does this make your agent worse or better on the easy Spider instances?

Question 6 (5 pts). The file spider-eval.lisp contains an approximate utility function for Spider. Explain why the function includes the spider-suits-completed feature. [Hint: this is not a completely trivial question.]

Question 7 (5 pts). Evaluate a depth-1 sampling expectimax agent that uses 5 samples on both the "easy" instances and on instances with 2 packs, 1 suit, and 4 hidden rows. (Do as many trials as you can in a reasonable time, but no more than 1000 in any case.) If cycles are available, evaluate a depth-2 agent as well.