Monte Carlo Tree Search Method for AI Games

Size: px

Start display at page:

Download "Monte Carlo Tree Search Method for AI Games"

Earl George
5 years ago
Views:

1 Monte Carlo Tree Search Method for AI Games 1 Tejaswini Patil, 2 Kalyani Amrutkar, 3 Dr. P. K. Deshmukh 1,2 Pune University, JSPM, Rajashri Shahu College of Engineering, Tathawade, Pune 3 JSPM, Rajashri Shahu College of Engineering, Tathawade, Pune Abstract: Old methods to game AI require either a high quality of domain knowledge, or a long time to generate effective AI behavior. These two characteristics hamper the goal of establishing challenging game AI. In this paper, we put forward Monte-Carlo Tree Search as a complete framework to game AI. In the framework, randomized explorations of the search space are used to predict the most promising game actions. In this report it s demonstrate that Monte-Carlo Tree Search can be applied effectively to a. classic board-games, b. modern board-games, and c. video games. Different algorithms which required implementing these games are also explained in this paper. Different algorithms presented which handles different set of information. Few Games are also explained here to demonstrate actions of algorithms. Keywords: Artificial Intelligence, Monte Carlo methods, Game AI 1. INTRODUCTION When implementing AI for computer games, the most important factor is the evaluation function that estimates the quality of a game state. The classic approach is to use heuristic domain knowledge to establish such estimates. However, building an adequate evaluation function based on heuristic knowledge for a non-terminal game state is a domain dependant and complex task. It probably is one of the main reasons why game AI in complex gameenvironments did not achieve a strong level, despite intensive research and additional use of knowledge-based methods. In the last few years, several Monte-Carlo based techniques emerged in the field of computer games. They have already been applied successfully too many games. Monte-Carlo Tree Search (MCTS), a Monte-Carlo based technique that was first established in 2006, is implemented in top-rated GO programs. These programs defeated for the 9 9 board. However, the technique is not specific to GO or classical board games, but can be generalized easily to modern board games or video games. Furthermore, its implementation is quite straightforward. In the proposed demonstration, we will illustrate that MCTS can be applied effectively to (1) classic board-games (such as GO), (2) modern board-games (such as SETTLERS OF CATAN), and (3) video games (such as the SPRING RTS game). In this paper, we studied several variants of MCTS for games of imperfect information. Determinization are a popular approach to such games, but this approach has several shortcomings. Two such shortcomings are: strategy fusion (assuming the ability to make different decisions from different future states in the same information set) and nonlocality (ignoring the ability of other players to direct play towards some states in information set and away from others). For MCTS, a third shortcoming is that the computational budget must be shared between searching several independent trees rather than devoting all iterations to exploring deeply a single tree. Some techniques which require to solve some problems in MCTS algorithm are not discussed in this paper. 2. REVIEW BACKGROUND 2.1 Monte Carlo Tree Search MONTE CARLO TREE SEARCH (MCTS) methods have gained popularity in recent years due to success in domains such as Computer Go. In particular, the upper confidence bound for trees (UCT) algorithm forms the basis of successful MCTS applications across a wide variety of domains. Many of the domains in which MCTS has proved successful, including Go, were considered challenging for the application of traditional AI techniques (such as minimax search with pruning), particularly due to the difficulty of forecasting the winner from a nonterminal game state. MCTS has several strengths. It requires little domain knowledge, although including domain knowledge can be beneficial. It is an anytime algorithm, able to produce well results with as much or as little computational time as is available. It also lends itself to parallel execution. In this paper we will investigate the application of MCTS to games of imperfect information. In particular, we consider games which have three different types of imperfect information. Information sets are collections of states, which appear in a game when one player has knowledge about the state that another player does not. For example, in a card game each player hides his own cards from his opponents. In this example, the information set contains all states which correspond to all possible permutations of opponent cards. A player knows which information set they are in, but not which state within that information set. Volume 2, Issue 2 March April 2013 Page 405

2 Partially observable moves appear in games where a player performs an action but some detail of the action is hidden from an opponent. Simultaneous moves arise when multiple players reveal decisions simultaneously without knowing what decision the other players have made. The effect of the decisions is resolved simultaneously. The well-known game of Rock- Paper-Scissors is an example of this. Monte-Carlo Tree Search (MCTS) is a class of game tree search algorithms that have recently proven successful for deterministic games of perfect information, particularly the game of Go. Determinization is an AI technique for making decisions in games of imperfect information by analyzing several instances of the equivalent deterministic game of perfect information. In this paper we combine determinization techniques with MCTS for the popular Chinese card game Dou Di Zhu, Lord of the Rings, Phantom 4, 4, 4 Game. In determinized MCTS, there is a trade-off between the number of determinizations searched and the time spent searching each one; however, we show that this trade-off does not significantly affect the performance of determinized MCTS, as long as both quantities are sufficiently large. MCTS algorithms build a subtree of the entire decision tree where usually one new node is added after every simulation. Each node stores estimates of the rewards obtained by selecting each action and an improved estimate is available after every simulation step. Each decision in the tree is treated as a multiarmed bandit problem where the arms are actions, and the rewards are the results of performing a Monte Carlo simulation after selecting that action. MCTS is an anytime algorithm, requiring little domain knowledge. 2.2 Structure of MCTS The structure of MCTS algorithm is generally quite similar: a discrete number of iterations are performed, after which an action from the root node is selected according to statistics collected about each action. Monte- Carlo Tree Search (MCTS) as shown in Fig. 1 is a bestfirst search technique which uses stochastic simulations. MCTS can be applied to any game of finite length. Its basis is the simulation of games where both the AI controlled player and its opponents play random moves, or, better, pseudo-random moves. From a single random game (where every player selects his actions randomly), very little can be learnt. But from simulating a multitude of random games, a good strategy can be inferred[1]. Each step performs four operations on sub-tree built by algorithm, namely Selection, Expansion, Simulation and Back-Propagation. Selection: While the state is found in the tree, the next action is chosen according to the statistics stored, in a way that balances between exploitation and exploration. On the one hand, the task is often to select the game action that leads to the best results so far (exploitation). On the other hand, less promising actions still have to be explored, due to the uncertainty of the evaluation (exploration). Expansion: When the game reaches the cannot be found in the tree, the state is added as a new node. This way, the tree is expanded by one node for each simulated game. Simulation: For the rest of the game, actions are selected at random until the end of the game. Naturally, the adequate weighting of action selection probabilities has a signi are selected with equal probability, then the strategy played is often weak, and the level of the Monte-Carlo program is suboptimal. We can use heuristic knowledge to give larger weights to actions that look more promising. Backpropagation: After reaching the end of the simulated game, we update each tree node that was traversed during that game. The visit counts are increased and the win/loss ratio is modi outcome. The game action program in the actual game, is the one corresponding to the child which was explored the most 2.3 Node Selection Figure 1 Basic Algorithm of MCTS In this section, we first present the node selection process which happens from in basic algorithm explained in above section. Node selection during tree descent is achieved by choosing the node that maximises some quantity, analogous to the multiarmed bandit problem in which a player must choose the slot machine that maximises the estimated reward each turn. An Upper Confidence Bounds (UCB) formula is typically used for such Node Selection[3]. UCB formula used is: Where vi is the estimated value of the node, ni is the number of the times the node has been visited and N is the total number of times that its parent has been visited. C is a tunable bias parameter. The UCB formula balances the exploitation of known rewards with the exploration of relatively unvisited nodes to encourage their exercise. Reward estimates are based Volume 2, Issue 2 March April 2013 Page 406

3 on random simulations, so nodes must be visited a number of times before these estimates become reliable; MCTS estimates will typically be unreliable at the start of a search but converge to more reliable estimates given sufficient time and perfect estimates given infinite time. 2.4 Benefits of MCTS MCTS offer several benefits over the traditional tree search algorithms. Few of them are listed below: 1. Aheuristic: MCTS does not require any strategic or tactical knowledge about the given domain to make reasonable decisions. The algorithm can function effectively with no knowledge of a game apart from its legal moves and end conditions; this means that a single MCTS implementation can be reused for a number of games with little modification, and makes MCTS a potential boon for general game playing. 2. Asymmetric: MCTS performs asymmetric tree growth that adapts to the topology of the search space. The algorithm visits more interesting nodes more often, and focuses its search time in more relevant parts of the tree. This makes MCTS suitable for games with large branching factors such as 19x19 Go. Such large combinatorial spaces typically cause problems for standard depth- or breadth-based search methods, but the adaptive nature of MCTS means that it will (eventually) find those moves that appear optimal and focus its search effort there. 3. Anytime: The algorithm can be halted at any time to return the current best estimate. The search tree built thus far may be discarded or preserved for future reuse. 4. Elegant: The algorithm is simple to implement 2.5 Drawbacks of MCTS Algorithms MCTS has few drawbacks which can be major. 1. Playing Strength: The MCTS algorithm, in its basic form, can fail to find reasonable moves for even games of medium complexity within a reasonable amount of time. This is mostly due to the sheer size of the combinatorial move space and the fact that key nodes may not be visited enough times to give reliable estimates. 2. Speed: MCTS search can take much iteration to converge to a good solution, which can be an issue for more general applications that are difficult to optimize. For example, the best Go implementations can require millions of playouts in conjunction with domain specific optimizations and enhancements to make expert moves, whereas the best GGP implementations may only make tens of (domain independent) playouts per second for more complex games. For a reasonable move times, such GGPs may barely have time to visit each legal move and it is unlikely that significant search will occur. 2.6 Improvements over the drawbacks There are many MCTS enhancements have been suggested to date. These can generally be described as being either Domain Knowledge or Domain Independent. 1. Domain Knowledge: Domain knowledge specific to the current game can be exploited in the tree to filter out implausible moves or in the simulations to produce heavy playouts that are more similar to playouts that would occur between human opponents. This means that playout results will be more realistic than random simulations and those nodes will require less iteration to yield realistic reward values. Domain knowledge can yield significant improvements, at the expense of speed and loss of generality. 2. Domain Independent: Domain independent enhancements apply to all problem domains. These are typically applied in the tree (e.g. AMAF) although again some apply to the simulations (e.g. prefer winning moves during playouts). Domain independent enhancements do not tie the implementation to a particular domain, maintaining generality, and are hence the focus of most current work in the area. 3. DESCRIPTION OF ALGORITHMS We will try to study two types of algorithm in this paper which elaborate MCTS. A. Single Observer Information Set Monte Carlo Tree Search (SO-ISMCTS) To overcome the problems associated with other approaches of MCTS algorithm used in AI Games here we are proposing searching a single tree whose nodes corresponds to information sets and not states. In SO- ISMCTS nodes in the tree correspond to information sets from the root player s point of view, and edges correspond to action[8]s (i.e., moves from the point of view of the player who plays them). The correspondence between nodes and information sets is not one one: partially observable opponent moves that are indistinguishable to the root player have separate edges in the tree, and thus the resulting information set has several nodes in the tree. Figure 2 Single Observer Information Set Monte Carlo Tree Search Figure 2 shows a game tree for a simple single-player game of imperfect information. The root information set contains two states: x and y. The player first selects one of two actions: a1 or a2. Selecting a2 yields an immediate reward of +0.5 and ends the game. If the player instead selects a1, he must then select an action a3 or a4. If the game began in state x, then a3 and a4 lead to rewards of - 1 and +1, respectively. If the game began in state y, then the rewards are interchanged. Volume 2, Issue 2 March April 2013 Page 407

4 If states x and y is equally likely, action a1 has an expectimax value of 0: upon choosing, both a3 and a4 have an expectimax value of 0. Thus, the optimal action from the root is a2. However, a determinizing player searches trees corresponding to each state x and y individually and assigns a minimax value of +1 in each (by assuming that the correct choice of a3 or a4 can always be made), thus believing a1 to be optimal. A simple algorithm/high level pseudocode explains SO- ISMCTS is given below: 1. Create a single tree with some root corresponding to root information 2. Do n number of iterations for node/action //Selection 3. Select a node and traverse until node reach to some action which results in some points or rewards //Expansion 4. If selected node is not a terminal node add a new node and expand tree //Simulation 5. On the last node run a simulation to the end of the game using determinization algorithm //Backpropagation 6. Return an action from root node such that the number of visits to the corresponding child node is optimal B. Multiple Observer Information Set Monte Carlo Tree Search (MO-ISMCTS) SO-ISMCTS + POM solve the strategy fusion problem of SO-ISMCTS, at the expense of significantly weakening the opponent model: in particular, it is assumed that the opponent chooses randomly between actions that are indistinguishable to the root player. In the extreme case, when SO-ISMCTS + POM is applied to a phantom game all opponent actions are indistinguishable and so the opponent model is essentially random. To address this, we propose multiple-observer information set MCTS (MO-ISMCTS). This algorithm maintains a separate tree for each player, whose nodes correspond to that player s information sets and whose edges correspond to moves from that player s point of view. Each iteration of the algorithm descends all of the trees simultaneously. Each selection step uses statistics in the tree belonging to the player about to act in the current determinization to select an action. Each tree is then descended by following the branch corresponding to the move obtained when the corresponding player observes the selected action, adding new branches if necessary. The information set trees can be seen as projections of the underlying game tree. Each iteration induces a path through the game tree, which projects onto a path through each information set tree. 4. GAMES A. Lord of the Rings The Lord of the Rings is a board game with elements similar to Stratego and has several features which make it even more challenging from an AI perspective. It has hidden information, partially observable moves, and simultaneous moves, all of which make the decision making process highly complex. The game also has an asymmetry between the two players since they have different win conditions and different resources available to them, which necessitates different tactics and strategies. In this experiment, the following algorithms play in a round-robin tournament: UCT, SO-ISMCTS, SO- ISMCTS + POM, and MO-ISMCTS. Each algorithm runs for iterations per decision. Determinized UCT uses ten determinizations with 1000 iterations for the Dark player, and applies all iterations to a single determinization for the Light. These values were chosen based on the results in Section V-B. Cheating ensemble UCT uses ten trees with 1000 iterations each for both Light and Dark; devoting all iterations to a single tree would be equivalent to cheating single-tree UCT. B. Phantom 4, 4, 4 Games Phantom is also known as m, n,k games are a generalization of games such as Noughts and Crosses and Renju where players try to place k pieces in a row on an m x n grid. We will investigate the phantom 4, 4, 4-game where players cannot see each other s pieces. This leads to a game with hidden information and partially observable moves 5. EXPERIMENTAL RESULT FOR DOU DI ZHU GAME A. Background Dou Di Zhu is played among three people with one pack of cards, including the two jokers. The game starts with players bidding for the Landlord position. Those who lose the bid enter the game as the team competing against the Landlord. The objective of the game is to be the first player to have no cards left. The game was only played in a few regions of China until quite recently, when versions of the game on the Internet have led to an increase in the popularity of the game throughout the whole country. Today Dou Di Zhu is played by millions of people online, although almost exclusively in China, with one website reporting players per hour. In addition, there have been several major Dou Di Zhu tournaments including one in 2008 which attracted players. B. Game Play A shuffled pack of 54 cards is dealt to three players. Each player is dealt 17 cards each, with the last three leftover "kitty" cards detained on the playing desk, face down. Volume 2, Issue 2 March April 2013 Page 408

5 All players first review and appraise their own cards without showing their cards to the other players. Then, players take turns to bid for the Landlord position by telling the other players the risk stake they are willing to accept. There are three kinds of risk stakes, 1, 2, and 3, with 1 being the lowest and 3 being the highest. Generally, the more confident a player is in the strength of one's cards, the higher the risk stakes one is willing to bid. In most of the online game rooms, the first bidder is chosen randomly by the system. In reality, players usually make up their own rules as to who gets to bid first. A player may accept the prior player's bid by passing their turn to bid or one may try to outbid the prior player as long as the prior player did not bet 3 as the risk stake. In other words, 1 can be outbid by 2 or 3; 2 can only be outbid by 3; and 3 cannot be outbid. The highest bidder takes the Landlord position; and the remaining players enter the Farmer team competing against the Landlord. The three leftover wild cards are then revealed to all players before dealt to the Landlord. The Landlord wins if he or she has no cards left. The Farmer team wins if either of the Farmers has no cards left. C. Rules Dou Di Zhu uses a standard 52 card deck with the addition of a black joker and a red joker. Suit is irrelevant but the cards are ranked in ascending order. A bidding phase, which is not considered here, designates one of the players as the Landlord. The Landlord receives 20 cards dealt from a shuffled deck, while the other players receive 17 each. The goal of the game is to be the first to get rid of all cards in hand. If the Landlord wins, the other two players must each pay the stake to the Landlord. However, if either of the other two players wins, the Landlord pays the stake to both opponents. This means the two non-landlord players must cooperate to beat the Landlord. The non-landlord players do not see each other s cards, so the game cannot be reduced to a twoplayer game with perfect recall. Card play takes place in a number of rounds until a player has no cards left. The Landlord begins the game by making a leading play, which can be any group of cards from their hand provided this group is a member of one of the legal move categories. The next player can play a group of cards from their hand provided this group is in the same category and has a higher rank than the group played by the previous player, or may pass. A player who holds no compatible group has no choice but to pass. This continues until two players pass, at which point the next player may start a new round by making a new leading play of any category. D. Implementation The branching factor for leading plays is typically around 40, and for nonleading plays is much smaller. However, in situations where moves with kickers are available each combination of move and kicker must be considered as a separate move, leading to a combinatorial explosion in the branching factor for leading plays. It should be noted that this is a problem specific to Dou Di Zhu caused by the game mechanic of being able to attach kicker cards to a play. To restructure this, an approach used similar to the move grouping approach of Childs et al the player first chooses the base move and then the kicker, as two separate consecutive decision nodes in the tree. The overall win rate for determinized UCT was 43.6%, for ISMCTS it was 42.3%, and for cheating UCT it was 56.5%. The win rates are approximately the same as those we previously obtained. This is unsurprising: the 1000 deals we originally selected were chosen to be a good indicator of typical playing strength. Each deal was then put into one of three categories according to the difference in win rate between cheating UCT and determinized UCT. If cheating UCT outperformed determinized UCT (with 95% significance) the deal was put into the category. If determinized UCT outperformed cheating UCT (also with 95% significance) the deal was put into the category. 6. THE OBJECTIVE: HANDLING UNCERTAINTY IN GAME The objective of the new algorithm outperforms to handle uncertainty in games and hidden information. Below are the two techniques which help to avoid drawback of the current algorithm and achieve object of the current algorithm. A. Simultaneous Moves: Simultaneous moves are a special case of imperfect information, in which each player independently chooses an action and these actions are applied at the same time. Simultaneous moves can be modelled by having players choose their actions sequentially, while hiding their choices from the other players, until finally an environment action reveals the chosen actions and resolves their effects. With this in mind, any algorithm that can handle imperfect information in general can handle simultaneous moves in particular. However, some of our algorithms (particularly those not designed to handle partially observable moves) perform poorly using this model. Under a simple determinization approach, the first player is overly pessimistic (assuming the opponent can observe the chosen move and select the best response to it) while the second player is overly optimistic (assuming the first player s move is fixed at the point of the second player s decision, and thus determinizing it randomly). For this reason, we add a mechanism to the algorithms studied in this paper specifically to handle simultaneous moves. The UCT algorithm has been applied to the simultaneous move game Rock-Paper-Scissors, using an approach where each player s choice of action is treated Volume 2, Issue 2 March April 2013 Page 409

6 as a separate independent multiarmed bandit problem. In other words, instead of selecting player 1 s move, descending the corresponding tree branch, and selecting player 2 s move from the resulting child node, both moves are selected independently from the same node and the tree branch corresponding to the resulting pair of moves is descended. B. Chance Nodes Handling of chance events is not a primary focus of this paper. However, chance nodes do occur under certain circumstances in one of our test domains, so they cannot be ignored completely. Note that our chance nodes have a small number of possible outcomes (at most four but rarely more than two), all with equal probability. Technically, another test domain includes a chance event with combinatorial many outcomes corresponding to shuffling and dealing a deck of cards at the beginning of the game, but since this occurs before any player has made a decision it never occurs as a chance node in our search tree. Consider a chance node with branches. To ensure that each branch is explored approximately equally, the first visits select all outcomes in a random permutation; the second visits select all outcomes in another random permutation, and so on. This is almost trivial to implement in UCT: since we already use UCB with random tie-breaking for action selection, it suffices to treat the environment player as a decision-making agent who has perfect information and receives a reward of zero for all terminal states. The UCB exploration term then ensures that the branches are visited in the manner described above. 7. FUTURE WORK The ISMCTS family of algorithms for several domains demonstrated here. It is clear that an enhanced version of ISMCTS should yield better playing strength, especially for domains such as Dou Di Zhu where there is a need for some mechanism to handle the large branching factor at opponent nodes. It remains to establish the theoretical properties of these algorithms and their potential for converging to game-theoretic solutions. MO-ISMCTS is arguably the most theoretically defensible of the three ISMCTS algorithms as it most accurately models the differences in information available to each player. A subject for future work is to conduct a full theoretical analysis of MO-ISMCTS, and investigate the situations under which it converges to an optimal policy. The SO- ISMCTS + POM algorithm currently assumes the opponent chooses indistinguishable moves at random, which is clearly incorrect as a decision model for the opponent. There is room for improvement in this aspect of the algorithm. REFERENCES [2] Chang Liu and Andrew D. Tremblay Monte-Carlo Serach Algorithm March 28, 2011 [3] Monte Carlo Tree Search [Online] Available: ml [4] Fight the landlor(dou Di Zhu) [Online] Available: rd_report.pdf [5] E. K. P. Chong, R. L. Givan, and H. S. Chang, A framework for simulation-based network control via hindsight optimization, in Proc.IEEE Conf. Decision Control, Sydney, Australia, 2000, pp [6] S. J. Russell and P.Norvig, Artificial Intelligence: AModern Approach, 3rd ed. Upper Saddle River, NJ: Prentice-Hall, [7] Board Game Geek, Stratego, 2011 [Online]. Available: Powley, E.J. ; Whitehouse, D. Information set monte carlo tree search IEEE Conf. UK Volume 2 Issue 4, June 2012 [8] Algorithms for abstracting and solving imperfect information games[online] Available: [9] The Monte Carlo Method for Game AI, blog [Online] Available: [10] D. Whitehouse, E. J. Powley, and P. I. Cowling, Determinization and information set Monte Carlo tree search for the card game Dou Di Zhu, in Proc. IEEE Conf. Comput. Intell. Games, Seoul, Korea, 2011, pp AUTHORS Dr. Pradeep K. Deshmukh, B.E, M.E & Ph.D in Computer Science and Engineering. His key research interest include Cloud computing, Network Security, ANN. He is currently working as Professor in Rajarshi Shahu College of Engineering, Pune, India Mrs. Tejaswini Patil, BE Computers. She has over 6 years of Industrial Experience currently pursuing ME from Rajarshi Shahu College of Enginnering, Pune, India Mrs. Kalyani Amrutkar, BE Computers from Pune University, Currently pursuing her ME from Pune University and working as Lecturer in reputed institute. [1] Determinization in Monte-Carlo Tree Search for the card game Dou Di Zhu [Online]. Available: Volume 2, Issue 2 March April 2013 Page 410

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage Richard Kelly and David Churchill Computer Science Faculty of Science Memorial University {richard.kelly, dchurchill}@mun.ca