CS88: Artificial Intelligence, Fall 20 Written 2: Games and MDP s Due: 0/5 submitted electronically by :59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators) but must be written up individually. Instructions for submitting your assignment can be found on the website via the assignments page: http://inst.eecs.berkeley.edu/ cs88/fa/assignments.html Pacmen Competing Consider a game where multiple Pacmen are competing for dots. Each Pacman s score is the number of dots it has eaten minus the number of dots that any other Pacman has eaten. Whoever has the highest score when all the food is gone is the winner. One Pacman moves each turn, with p first, then p 2, etc. Assume that pacmen can move past each other in the maze and share a square. Finally, assume that they cannot stop. First consider the simple one-player case, in the board below. Pacman can move either West (left), East (right), South (down) or North (up) and is using limited-depth minimax search to choose his next move, with a basic evaluation function consisting of Pacman s score. There is no time step penalty. (a) ( point) For what search depths, if any, will East be an apparently optimal action (i.e., an action that could be returned by depth-limited search) (b) ( point) For what search depths, if any, will West be an apparently optimal action (i.e., an action that could be returned by depth-limited search)? (c) ( point) of the game? For what search depths, if any, will the minimax value returned be the actual minimax value (d) ( point) Now, Pacman is using an evaluation function of (score + 2d closest ), where d closest denotes the manhattan distance to the closest dot. For what search depths, if any, is West an apparently optimal action?
A second, adversarial Pacman (p 2 ) enters the game! The evaluation function p uses is again simply its score. Remember, p s score is the number of dots p has eaten minus the number of dots p 2 has eaten (and so p and p 2 always have opposite scores). It is currently p s turn. Again, there is no time step penalty, for either agent. 2 (e) ( point) search tree? If p uses minimax search to depth 0, what will be the minimax value of root node in his (f) ( point) Now suppose p knows that p 2 just moves randomly. If p uses a depth 0 expectimax search, what will be his optimal action or actions? 2
2 Minimax/Expectimax Consider the following minimax tree where the x, y, z, w, a, and b correspond to the utilities at the leaf nodes. x y z w m b a (a) ( point) Assume that alpha-beta evaluates w; write the condition on the values x, y, z and w (or a subset of those) that allows the evaluation of a to be skipped. (b) ( point) Assume that alpha-beta evaluates node m; write the condition on the values x, y, z, w and a (or a subset of those) that allows the evaluation of b to be skipped. Do not treat m as a number! Consider the following expectimax tree. The outcomes at the chance node are equally likely. For all terminal states, it holds that 0 U 6. 0 y c (c) ( point) Write the condition on y that guarantees that the computation can be safely stopped without considering the c-leaf. Now consider the following, slightly more general expectimax tree. The chance node has N (terminal) outcomes. For each outcome i of the chance node, the probability is p i and the utility is u i. Again, all terminal utilities obey 0 U 6. 0 u u 2 u N (d) ( point) Write the condition that allows the computation of the chance node s value to stop after the first k leaves, without considering the rest of the leaves of the chance node. 3
3 Blackjack In the game of Blackjack, a player gets cards, one at a time, with the goal of achieving a total value as close to 2 as possible without exceeding it. If the current value of a player s hand is less than 2, the player can hit (be dealt a single card) in the hope of acquiring a hand with higher value. However, the player runs the risk of busting, or going over 2, which results in an immediate loss. The player can also decide to stay (stop getting cards). The end result is either a win or a loss, as described below. In the CS88 casino, several variants of blackjack are played; in what follows, you will formulate each variant as an MDP. In all variants, there is only one player (you) and one dealer. The deck is always reset and shuffled after every game. Also, each card type always has a fixed numeric value (so aces always have value, face cards have value 0, etc.). Finally, the utility for the player winning is 0, while the utility for the player losing is -5, and the utility of a tie is 0. Fixed Dealer, Infinite Deck: For this variant, the deck is infinite, which means that the probability of being dealt any given card is independent of the cards already dealt. In addition, the dealer s hand is fixed to the value 5, i.e., staying at 6 or higher is a win. (a) ( point) Describe a minimal state representation for this problem variant. (b) ( point) For how many states s is T (start, hit, s ) non-zero? (c) ( point) Will value iteration converge exactly on this MDP (i.e., is there some finite k for which V k (s) = V (s) for all s)? Briefly justify your answer. Fixed Dealer, Finite Deck: Now assume the infinite deck is replaced by a finite deck, which means that the cards already dealt are not being replaced. Again, the dealer s hand is fixed to the value 5. (d) ( point) Describe a minimal state representation for this problem. 4
Dealer Showing, Infinite Deck: Assume once more that the deck is infinite, but now the dealer actually plays (stays and hits) in alternation with the player, and the dealer always shows her cards. The dealer has fixed behavior: if her cards total less than 5, she hits, while if they total 5 or more, she stays. The player may keep hitting after the dealer stops and vice versa. (e) ( point) Describe a minimal state representation for this problem. (f) ( point) For how many states s is T (start, hit, s ) non-zero? Assume that the start state is before any card has been dealt. Dealer Hiding, Infinite Deck: Assume once more that the deck is infinite, and the dealer still plays (stays and hits) in alternation with the player, according to the same fixed behavior. However, now the dealer does not shows her cards value (though the player does know how many cards the dealer has). (g) ( point) Can this problem still be formalized as an MDP? If so, describe a minimal state representation for this problem. If not, justify why not. 5
4 MDP: Walk or Jump? Consider an MDP with states {0,, 2, 3, 4}, where 0 is the starting state and 4 is a terminal state. In the terminal state, there are no actions that can be taken and the value for that state is defined to be zero. In states k 3, you can Walk (W ) and T (k, W, k + ) =. In states k 2, you can also Jump (J ) and T (k, J, k + 2) = 2/3, T (k, J, k) = /3 (usually jumping is faster, but sometimes you trip and don t make progress). The reward R(s, a, s ) = (s s ) 2 for all (s, a, s ). Use a discount of γ = /2. /3 2/3 0 2 3 4 (a) ( point) Consider the policy π that chooses the action Walk in every state. Compute V π (2). (b) ( point) Compute V (2). (c) ( point) Compute Q (, W ). Now consider a similar MDP, but with N + states {0,..., N} where 0 is the starting state and N is the terminal state. Now the transition probabilities are T (k, J, k + 2) = 0. and T (k, J, k) = 0.9 for k N 2 and T (k, W, k + ) = 0.9 and T (k, W, k) = 0. for k N. Again, you cannot jump from N. However now R(s, a, s ) = 00 for s = N and 0 otherwise. Furthermore, the discount has changed; γ = 0.99. (d) ( point) What is the smallest k such that V k (0) > 0, where V k is the value function after k iterations of Bellman updates? Assume N is even. (e) ( point) Will value iteration converge exactly (i.e., is there some finite k for which V k (s) = V (s) for all s)? Briefly justify your answer. 6
Now consider the same MDP as above (with non-zero rewards at state N only). However, there is now a time limit: the game ends after a known, finite number of actions M, where M < N. Formally, after the M first steps, no further rewards are obtained. One way to treat this situation is with time-dependent values and policies, as optimal values and actions could change depending on the time remaining. (f) ( point) Consider the value function V (s, t), where t is the number of time steps left and s is the current state. Write a general Bellman optimality equation for V (s, t) in terms of other V (s, t) quantities. A full credit answer will not specialize to this MDP. (g) ( point) For this MDP, will the optimal time-dependent policy, π (s, t), ever be different (for any state s or time t) from the optimal policy π(s) for the original, unlimited-time version of the problem? Either provide a specific example of a difference or justify why there will be no difference. 7