Implementation of Upper Confidence Bounds for Trees (UCT) on Gomoku

Size: px

Start display at page:

Download "Implementation of Upper Confidence Bounds for Trees (UCT) on Gomoku"

Lesley Spencer
5 years ago
Views:

1 Implementation of Upper Confidence Bounds for Trees (UCT) on Gomoku Guanlin Zhou (gz2250), Nan Yu (ny2263), Yanqing Dai (yd2369), Yingtao Zhong (yz3276) 1. Introduction: Reinforcement Learning for Gomoku Game Gomoku or Five in a Row, is a two-player strategy board game using black and white stones. Players alternately place stone on board and the first player who has five stones in a line with same color win the game [13]. There are some different rules for Gomoku such as Free-style Gomoku and Renju. Although it had been proved that the first player has a winning strategy both in Free-style Gomoku [1] and Renju [8], complex board state still need further study. Many algorithms have been developed on playing Gomoku for several decades. Traditional approach utilize tree search with alpha-beta pruning [4, 7]. In 2006, Coulom proposed a novel approach named Monte Carlo Tree Search, which combined tree search with Monte-Carlo evaluation [3]. MCTS is a heuristic search algorithm which takes the simulation result from child node as an evaluation function for the tree search. One of the most advantages of MCTS is that it doesn t require domain-specific knowledge and it s easy to apply the model to another domain. However, in the process of expanding tree, it is hard to maintain balance between the exploitation of deep variants after moves with high average win rate and the exploration of moves with few simulations. In 2006, Kocsis and Szepesvari proposed an idea to improve the result of MCTS [5, 6]. In this paper, they used UCT algorithm to solve previous mentioned problem. UCT is an algorithm based on UCB1, which is one of the algorithm for multi-armed bandit problem. In our project, UCT algorithm is implemented on Gomoku game, and our revised version can win amateur human players with a good confidence level. 2. Preliminaries: UCT algorithm We used the Upper Confidence Bounds for Tree (UCT) algorithm from the paper A Survey of Monte Carlo Tree Search Methods [2] as a base model of our implementation. The UCT algorithm is one of the most popular algorithms in Monte Carlo Tree Search family. It s run based on the current state and output a suggested best action given certain computational allowance. UCT algorithm consists of four parts for each step: 1. Picking a suitable move; 2. Simulation of following moves till termination; 3. Recording the rewards; 4. Selecting the best rewarded move (after doing 1 through 3 for certain rounds). Full algorithms goes as follows, with explanations of notations in the context of a Gomoku game.

2 s 0 : current configuration of the board as a double array (11 x 11); v 0 : a node representing the current state s 0 with extra information like parent, children, rewards, etc; v l : a node in the next layer, child of v extra 0 move from v 0 ; nonterminal v: the state associated with v is not a winning or losing or tie state; fully expanded v: every possible move in the state associated with v is explored; Cp: set to 1/ 2 to satisfy the Hoeffding inequality with rewards in the range [0,1]; [3][8] action: a possible play move, not overlapping with current pieces on the board; f(s(v), a): generate a new state (configuration of board) according to a move and current configuration of board; Q(v): the current possible reward for v; N(v): the current number of simulations (games) v is involved in; : reward of a game. +1 for winning, -1 for losing, and 0 for tie. On top of UCT algorithm, we made many changes to form our actual implementation, which we will explain in details in the next section. 3. Implementation and Algorithm Improvements 3.1 Environment and Required Packages We ve implemented our Gomoku game environment and all related parts under Python The required packages include pygame, which provides a simple and elegant game UI environment, and numpy. Before testing our code, please make sure all these packages are installed successfully. 3.2 Basic Structures Our project is put under folder rl-gomoku. All the game related images are put under sub folder rl-gomoku/imgs and our source code is under rl-gomoku/src. Under folder rl-gomoku/src there are five python files:

main.py : It provides the main entrance of our gomoku game environment. And you need to set parameters when executing main.py. There are two game modes: (1) Human play with AI, under this mode you should input two parameters, the command should be like python main.

3 main.py : It provides the main entrance of our gomoku game environment. And you need to set parameters when executing main.py. There are two game modes: (1) Human play with AI, under this mode you should input two parameters, the command should be like python main.py n t, where n should be the number of total games and t is the computational budget set for MCTS AI, which should be a floating number indicates how many seconds it has for each step. (2) AI play with AI, under this mode you should input three parameters, the command should be like python main.py n t1 t2, where t1 and t2 are the computational budget set for each AI player. Fig. 3.1 Gomoku game environment gomuku.py : The game environment is implemented in this file. Under this file the class gomuku_game is defined and implemented, which saves all current game related data, such as current turn, current board state, and this class also contains the functions to display game UI, update game state and receive I/O inputs, etc. simple_ai.py : Simple AI provides a basic AI agent for the Gomoku game. Which defines some naive game playing rules and provides prior information for our more advanced version of game AI which is implemented using UCT and MCTS. mcts_ai.py and uct_tree.py : The UCT Monte Carlo Tree Search algorithm is implemented in mcts_ai.py and uct_tree.py. These two files are the core section of our project and we ll discuss more about their details in next section. 3.3 Simple AI for Gomoku With MCTS and UCT, the way to find the next move is to do an exhaustive search over the whole board and to find a position that the algorithm thinks is the best. However, this process is very time-consuming and often ineffective. This is because the algorithm does not know how to play Gomoku, each time the position it picks is only slightly better than picking randomly.

4 In our very early versions, we tried to implement the Gomoku AI using the original UCT MCTS algorithm as illustrated in the previous section. However we found that unless the computational budget of each step is super large (like 20 minutes or more), the pure random algorithm will not return good results. As the possibilities grows rapidly when we are doing tree search. For example, if we try to explore next ten steps in the early game, an approximate estimate of the total number of possibilities would reach So apart from the implementation of MCTS and UCT, we also employed some predefined, knowledge-based, simple but effective AI rules for choosing the next move. The key to the AI is the pre-defined rules. The rules can be divided into two main parts, rules for attack and rules for defence. There are totally 7 rules we defined for the AI, specifically, if an empty position: 1. Can contribute to at least five consecutive stones, then attack on the position 2. Is next to four consecutive stones, then attack on the position 3. Is next to three consecutive stones without blockage on both ends, then attack on the position 4. Can contribute to at least five consecutive adversary stones, then defend on the position 5. Is next to four consecutive adversary stones, then defend on the position 6. Is next to three consecutive adversary stones without blockage on both ends, and there are less than three consecutive stones on the player s side, then defend on the position 7. If none of the cases above does it fall in, then attack on the best position recorded (this is only for generating default policy) The strategies we utilized to enforce these rules were to scan the whole board to check against all the cases in all four directions. Whenever we see an empty cell, we check whether there are at least four consecutive stones next to it in horizontal, vertical, left-diagonal, and right-diagonal directions. And similar strategies were used for exploring other cases. 3.4 UCT Algorithm for Gomoku As the basic algorithm of UCT has been illustrated in the previous sections. In this section we will discuss how we ve implemented this algorithm for the Gomoku game and the improvements we ve made on the origin algorithm. The first part will be introduced is the basic data structure of tree node and our method to find a specific node with given game state. The tree node class is defined in uct_tree.py. Like ordinary search tree nodes, each node saves its tree structure information (parents, children), the number of it has been visited, the total reward, current turn and whether this node is an terminal node. One thing we have to notice is that in a Gomoku game, a single state may have different parents. We can use the following state as an example, let the locations of three black pieces be B 1 (4, 5), B 2 (5, 4), B 3 (8, 6) and three white pieces be W 1 (6, 4), W 2 (6, 6), W 3 (7, 6). The sequence B 1 W 1 B 2 W 2 B 3 W 3, B 2 W 1 B 1 W 2 B 3 W 3, B 1 W 1 B 3 W 2 B 2 W 3 and all other possible sequences will lead to the same state. However no matter how different the sequence of previous moves are, with the same current state, the best decision for next step should always be same.

5 Fig. 3.2 An example game state Thus in our tree structure, each node could have multiple children and parents. Actually this feature isn t a bad thing for our tree search algorithm since when we are doing tree search and found a tree node previously been detected and developed, we can keep using the information of that node rather than do the same work again. By doing this, previous search result of will help to accumulate more information on a larger extent for the next step and the overall search efficiency is increased. Rather than saving each game state in the tree nodes class, we ve created a multi-layered hash dictionary to keep recording the relationship between game state and node. This idea is inspired by the method for Linux to manage its huge virtual memory space (Linux has a four level page table to save all the frame numbers on real physical address). Our state to node dictionary has total three levels, and each level uses different rows of state information for indexing (row 4-6 for first level, row 0-3 for second level, and row 7-11 for the last level). The reason we ve used the middle rows for the first level is that for most games, the players tends to place the pieces in the middle first, thus using these rows for first level indexing will help the data to distribute more evenly in the dictionary. There is a function in mcts_ai.py named state2node(game_ state), this function will use the layered dictionary to determine whether a state has been detected before or not. For an existing node, it will simply return this node and for a state not explored before, it will create a new entry in the dictionary and return the new node. After introducing how we ve managed our search tree data structure. The next topic we will discuss is how we ve adapted UCT tree search algorithm into Gomoku game and our endeavors to make the origin algorithm more efficient. And most of our adaptations and improvements are focused on making origin tree policy and default policy accommodate Gomoku game better and generate better decisions with limited time budget Tree Policy with Advice In the origin version of UCT algorithm illustrated by Browne et al. Its Tree Policy tries to explore all the possible actions (or moves in board game) when a node isn t fully expanded and its expand algorithm simply randomly pick the next node among the node which haven t been explored.

Firstly, for the coach system, when there are moves which guarantee lead to win if the player takes it or to lose if the player doesn t take it, we should focus more on these moves.

6 This method may work fine with a narrow range of next actions and huge time budget. However in our Gomoku game it isn t the case. To make our Tree Policy more efficient, we ve introduced a coach system and piece influence map. Firstly, for the coach system, when there are moves which guarantee lead to win if the player takes it or to lose if the player doesn t take it, we should focus more on these moves. Secondly, when there are no suggestions provided by coach, the moves near the locations where current pieces are placed should be focused. Since in Gomoku game, a piece placed far away from other pieces will have very small short term contribution for the game. To represent this feature, we give each black and white piece the same origin weight of 1.0, the whole board makes an 11x11 matrix. We then apply a convolution to this matrix using a 5x5 gaussian-like kernel. Make the value of locations which are already being occupied zero and then we have the matrix for the possibility distribution of next expand choice. Following two images shows the explore probability distribution for the next move, The dark blue means the probability to explore that move is nearly zero and the yellow blocks means the probability to explore that location is higher. Fig. 3.3 From game state to explore probability In our final version of tree policy, we ve kept both of these ideas. And as the suggestions generated by SimpleAI is usually smarter than the moves randomly picked on probability distribution map, thus we ve set higher priority for SimpleAI. Our pseudo-code is shown in fig 3.4. In the pseudo code, \ represents Set Difference. Function RandMove(A) is a function which return a uniformly random pick among A and function RandMoveProbMap(A, M) returns a random pick amone A, where the probability is subject to the probability distribution map M. One more thing need to be mention here is that some of the moves on probability map M may have already been explored, however we are not interested in expanding them again. Thus we have to set these explored moves probability to zero and renormalize M before picking next move.

$function TreePolicy(v) while v is nonterminal do suggestions <- suggestions generated by SimpleAi if suggestions is not empty do unexplored <- suggestions \ v.$ $state) possible_moves <- moves with non-zero probability on prob map unexplored <- possible_moves \ v.$

7 function TreePolicy(v) while v is nonterminal do suggestions <- suggestions generated by SimpleAi if suggestions is not empty do unexplored <- suggestions \ v.children if unexplored is not empty do v <- RandMove(unexplored) return ExpandNode(v, v ) else prob_map <- GenerateProbMap(v.state) possible_moves <- moves with non-zero probability on prob map unexplored <- possible_moves \ v.children if unexplored is not empty do v <- RandMoveProbMap(unexplored, prob_map) return ExpandNode(v, v ) v <- BestChild(v,Cp) return v function ExpandNode(v, v ) add the new child v to v return v Fig. 3.4 Tree Policy Pseudo Code Smarter Default Policy In the origin algorithm, the default policy also simply take random actions from the possible action set. However this may not be a good way in our game as the state space is huge and the terminate state space is relatively small. Thus we ve introduced our Gomoku SimpleAI which can generate a move which is smarter than the one randomly picked. This method will generate a better result because in real Gomoku game, both of the players are trying to win, and the moves generated by SimpleAI is more player-like and thus the estimated reward value is more accurate. However such method also has a flaw. In a Gomoku game, there are multiple methods to attract, some are defendable while others are not. Fig 3.4.a shows a situation is pretty easy to defend if white place piece at (2,6) or (6,2) while in 3.4.b, no matter what white do, it will lose if black keeping attack 3->4 and 4->5. Fig. 3.4.a Defendable attack (Black) Fig. 3.4.b Undefendable attack (Black) Though the second attack is much better, but the defendable attack move is sometimes also a good move since your opponent may fails to detect that and falls into your trap. However Simple AI is smart enough

8 to detect and defend all of the defendable state, then the reward for defendable attack would be zero as the default policy runs. A smarter opponent will make our AI be less aggressive and make less attack moves, that is what we don't want. Then we introduced our final method, the method which makes it possible for both sides to make mistakes. And it is actually very easy to implement, the only thing we add is to for each move of default policy, we give both side a possibility to make random choice rather than use SimpleAI. After this adjust we ve found that the reward of defendable attack moves becomes higher in search tree. As its reward is still lower than undefendable attack moves, thus undefendable attack moves still have higher priority and would be picked first Backward Propagation with Decay One more difference between our algorithm and the origin algorithm is that we ve add decay for the backward propagation (BACKUP function). As everytime the default policy has reached a terminal state, the reward wound be back propagate to all its ancestors. However, instead of direct pass this reward value, this value is multiplied by a constant decay value (<1.0, what we choose is 0.85) on each level. The reason we ve add this decay value is because in our game, short term reward is always more promising than long term reward. Assume you are a player who really want win the game, and there are two moves provided for you, the first one has 80% probability to win within three rounds and the other has 80% probability to win within ten rounds. A rational player usually tends to choose the first choice. Also in a adversal game, the long term future could be hard to predict with high accuracy. Thus we ve add this rule. 4. Test Results To prove the effectiveness of our algorithm we did following experiments: 1. Let AI play against another AI with different computational limit. 2. Let AI play against different people. 4.1 Experiment between two Gomoku AIs In this experiment we let two Gomoku AIs play against with each other. Let these two AIs be player1 and player2, we have made five groups of experiments with five different time bases, which are 0.2s, 0.4s, 0.8s, 1.6s and 3.2s, where player1 uses the time base as its computation time budget and player2 doubles the budget of player1. When doing the experiments, each player take turns to use black pieces and white pieces. For example, in the first group of experiments, there are totally 400 games where player1 uses white pieces for 200 games and use black pieces for 200 games to ensure the sequence does not influence the result (as in Gomoku rules, black side should always drop piece first). Our results are shown in Table 4.1 and Fig 4.1, 4.2.

9 Table 4.1 Winning rate of alternative playing in different limit n games Player1 budget (ms) Player2 budget (ms) Player 1 win rate (%) Player 2 win rate (%) Draw rate (%) Fig 4.1 Graph of computational limit vs player win rate Fig 4.2 Graph of computational limit vs draw rate

10 One thing is pretty clear we can found in table 4.1, fig 4.1 and fig 4.2 is that the draw rate drops dramatically as the base run time increases. It happens because as the computation budge increases, the tree search algorithm will have more time to explore deeper into the deeper nodes, which represent the moves into further future of the game. The outcome is that the smarter moves would be found as the algorithm considers more into further future and these moves will create traps which are harder to detect by the opponent and defend these moves. The result is the game tends to ends in fewer steps and thus less draw games, where every empty location is occupied and no one wins. The second thing we ve found is that player2 always has higher win rate than player1, this result met our expectation pretty well since with larger computation budget, the online UCT training algorithm will have more time to reach more precise results. As the base time increases, the difference between win rates actually decreases. There are two reasons may lead it to happen. The first is that though the reward and its UCB and LCB are continuous, the final result of the online UCT training algorithm is always a single location on the board, which is a discrete value. With larger computation budget, these estimated values (reward, UCB & LCB) could be more precise. But there may exist a threshold when the budget is over the threshold, the results is already precise enough to generate a rational enough discrete decision. Then higher precision may no longer make the decision much better when budget is beyond that threshold. Also as we ve introduced SimpleAI and explore probability distribution map, these features will helps the online training to converge faster. In a learning based system, after the result convergence, more training usually won t cause obvious improvements on the results. 4.2 Experiment between Human and Gomoku AI In the end, we let the AI play with different level of Gomoku hobbyist and amateur. Results in Table 4.2 shows that our AI can sometimes beat human players. It can even sometimes win our best Gomoku hobbyist Player in our team. Table 4.2 Results of human player against AI Human Player Human Player Level AI Time Budget (s) Number of Games Human Wins AI Wins AI Winning rate (%) A Intermediate B Intermediate C High D Expert For the results in table 4.2 we can draw the conclusion that our Gomoku AI does have the ability to win human players sometimes. It does generate pretty good moves under some circumstances which really surprise the human player. And we do believe our UCT Gomoku algorithm has the potential to totally win general human players and we are all willing to make refinements and let our Gomoku better in the future.

11 5. References [1] L. V. Allis et al. Searching for solutions in games and artificial intelligence. Ponsen & Looijen, [2] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1 43, , 2 [3] R. Coulom. Efficient selectivity and backup operators in monte-carlo tree search. International conference on computers and games, pages Springer, [4] D. E. Knuth and R. W. Moore. An analysis of alpha-beta pruning. Artificial intelligence, 6(4): , [5] L. Kocsis and C. Szepesv ari. Bandit based monte-carlo planning. In ECML, volume 6, pages Springer, [6] L. Kocsis, C. Szepesv ari, and J. Willemson. Improved monte-carlo search. Univ. Tartu, Estonia, Tech. Rep, 1, , 2 [7] J. Schaeffer. The history heuristic and alpha-beta search enhancements in practice. IEEE transactions on pattern analysis and machine intelligence, 11(11): , [8] J. Ẃagner and I. Virag. Solving renju. In ICGA journal. Citeseer, [9] Silver, David. Reinforcement learning and simulation-based search in computer Go. ProQuest Dissertations Publishing, NR [10] Gelly, Sylvain, and David Silver. Combining online and offline knowledge in UCT. Proceedings of the 24th international conference on Machine learning. ACM, [11] Silver, David, et al. Mastering the game of Go with deep neural networks and tree search. Nature (2016): [12] Silver, David, et al. Mastering the game of Go without human knowledge. Nature (2017): [13] Wikipedia. Gomoku - Wikipedia, the free encyclopedia, [Online; accessed 01-Dec-2017] [14] Wikipedia. Q-learning - Wikipedia, the free encyclopedia, [Online; accessed 03-Dec-2017] [15] Wikipedia. Monte Carlo tree search - Wikipedia, the free encyclopedia, [Online; accessed 05-Dec-2017]

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage Richard Kelly and David Churchill Computer Science Faculty of Science Memorial University {richard.kelly, dchurchill}@mun.ca