Parallel Go on CUDA with. Monte Carlo Tree Search

Size: px

Start display at page:

Download "Parallel Go on CUDA with. Monte Carlo Tree Search"

Lizbeth Sims
5 years ago
Views:

2 Parallel Go on CUDA with Monte Carlo Tree Search A thesis submitted to the Division of Research and Advanced Studies of the University of Cincinnati in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in the School of Computing Sciences and Informatics of the College of Engineering and Applied Sciences October 2012 by Jun Zhou B.S., University of Michigan, 2008 Thesis Advisor and Committee Chair: Dr. Kenneth Berman

3 Abstract Traditional Go AI uses minimax tree search with pruning optimizations and board evaluation function to dictate moves in a sequential fashion. However It is widely accepted that professional human players are far better than minimax tree search based Go AI, due to lack of a decent Go evaluation function and its astronomical search space. Recent development of Monte Carlo Tree Search (MCTS) based Go AI sees a big surge in its playing strength. With the emergence of CUDA, Nvidia s massively parallel GPU platform, it appears the MCTS process can be parallelized at the simulation stage by the GPU to enhance its performance. This thesis takes on the challenge of building a Monte Carlo Tree Search algorithm to play Go on the manycore CUDA platform. 1

4 2

5 Preface Ever since the invention of the computer, our scientists forefathers have long dreamed of harnessing its tremendous computational power to conquer board games. In the last few decades we witnessed the fast-paced improvement in microprocessors as predicted by the famous Moore s Law. As computers become much cheaper and faster, the programs which run on the computers benefit from this directly. The most famous Man vs Machine story has to be the match between Garry Kasparov and IBM s Deep Blue in the late 1990s, which left a monumental mark in the computing history because it is the first time a computer chess AI has convincingly defeated the world s best grandmaster. The extraordinary feat of Deep Blue is easily repeated nowadays by home personal computers; it is estimated that a top-notch computer chess AI can can have upward of playing strength while top human grandmasters are only around Consequently, AI researchers began shifting their attention to another board game named Go. Go is an ancient Chinese board game with relatively simple rules that one can learn in minutes but take a 3

6 4 lifetime to master. Currently even the strongest Go AI is considerably weaker[13] than professional human Go players for reasons that will be discussed in later chapters.

7 Acknowledgements I dearly thank Dr. Berman for teaching me all these years, making me work hard, pushing me forward and providing me with countless research advices. I also thank Dr. Annexstein for teaching me CUDA, which is what inspired this research in the first place. I also thank Dr. Bhatnagar for teaching me Artificial Intelligence, many of my ideas are from his AI class. 5

8 Contents Abstract 1 Preface 3 Acknowledgements 5 1 Introduction Origin of Go Go Terminology Rules Thesis Organization Monte Carlo Tree Search Challenges with Conventional Minimax Tree Board Size Additive Complexity Evaluation Function

9 CONTENTS MCTS Overview MCTS Process Selection Expansion Simulation Backpropagation CUDA Architecture Background Advantages Limitations Tesla C Implementation High-Level Game Flow CUDA Configuration Running Simulation in Parallel Biased Monte Carlo Sampling via Evaluation Function GPU Parallel Optimization Shared Memory Memory Padding Result BackPropagate Benchmark 32

10 CONTENTS Thread Configurations Board Sizes Biased Monte Carlo Tree Search Performance Impact of Different Monte Carlo Policies Optimizations and Speedup CPU vs GPU Conclusions Verdict Future Works A Go Ranking Illustration 45 B Go Board Struct 46 C Go Board Intersect 47 Bibliography 49

11 List of Tables 3.1 Tesla C1060 Key Specifications A.1 Go Ranking [Low-High]

12 List of Figures 1.1 Go Stones and Board MCTS Iterations CPU vs. GPU Computing CUDA Processing Flow CUDA Memory Hierarchy Tesla C1060 Graphics Card Thread Configurations Effects on 19x19 Board Different Board Sizes Playing Speed of Different MCTS Policies Optimizations and Speedup CPU vs GPU C.1 Go Board Intersect

13 Chapter 1 Introduction 1.1 Origin of Go Go might not be the most popular game in the world, but it is certainly one of the oldest and time-tested throughout history. It originated in China more than 2500 years ago, and was introduced to the West through Japan. The Chinese name for Go is Weiqi, which translates to encircling game. Indeed, the idea of encircling is a fundamental concept in the game. The game is noted for its simple physical composition: a 19x19 board, uniform pieces called stones, and easy playing rules. 1

CHAPTER 1. INTRODUCTION 2 1.2 Go Terminology Figure 1.

14 CHAPTER 1. INTRODUCTION Go Terminology Figure 1.1: Go Stones and Board To understand the grand concept of Go, here is some of the most basic terminology defined: stone: a black or white colored pieces that is played on the board. group: a set of stones connected by the lines, also called a chain. liberty: a vacant point immediately adjacent to a stone in cardinal (either horizontal or vertical but not diagonal) direction; a single stone can have up to 4 liberties. komi: a number of handicap stones awarded to the weaker player; 19x19 board often uses as komi. There are many more Go-specific terms, but because this thesis focuses on MCTS parallelism and the CUDA platform they will only be explained 1 The komi of X entails that there is always a winner: no ties are possible.

15 CHAPTER 1. INTRODUCTION 3 as necessary. 1.3 Rules Two players, one using black stones and the other using white stones, alternate placing stones on the board. Each stone must be placed on the intersection of two lines in the 19 x 19 grid. If one side is known to be stronger, a handicap can be used. A handicap allows the weaker side to set extra stones on the board before game starts. There are also 13x13 and 9x9 boards which are commonly used by beginners. The goal of the game is to use stones to surround as much territory as possible. Once a piece is played, its location on the board is set, and may not be moved again unless it is captured. A capture happens if the stone or stones have no liberty. The game finishes if there are two consecutive passes or if one player resigns. The player with more territory at the end of the game wins. 1.4 Thesis Organization Here s a brief overview of successive chapters: Chapter 1: Introduction covers the basic concepts and rules of Go. Chapter 2: Monte Carlo Tree Search mentions the shortcomings of applying conventional minimax tree search techniques to Go, and delves into a Monte Carlo Tree based search algorithm for Go.

16 CHAPTER 1. INTRODUCTION 4 Chapter 3: CUDA Architecture dives deep into Nvidia powered CUDA architecture, and compares it with CPU architecture. Chapter 4: Implementation discusses the actual implementation stepby-step, provides variants of Monte Carlo Tree Search, and highlights the differences from the CPU implementation. It also explains several unique challenges SIMD architecture encounters on the GPU, and their impact on performance. Chapter 5: Benchmark shows the testing environment and various benchmarking results from the aforementioned manycore implementation. Chapter 6: Conclusion assesses overall project goals and limitations and delivers future thoughts.

17 Chapter 2 Monte Carlo Tree Search 2.1 Challenges with Conventional Minimax Tree Knowing Go predates chess by more than a millennium, one might guess Go is a simpler game to play and therefore more easily solved by a computer. Moreover, one might also conjecture that the same minimax techniques which work well for chess can also be applied to Go. However, this is not true. One should not underestimate the simple pieces and rules that make up the game of Go. The following sections highlight why Go is a fundamentally different problem for computers to tackle compared to chess. 5

18 CHAPTER 2. MONTE CARLO TREE SEARCH Board Size The 19x19 Go board has 361 line intersections, while the 8x8 chess board only has 64. Due to this fact, the number of possible positions that can be played on the Go board is far greater than that of chess. More precisely, the average legal moves, at any moment during the game, is about 30 for chess; Go begins with 55 distinct possible legal moves (with symmetric duplicates removed). As the game progresses and once the symmetry is broken, the number of legal moves explodes to the point where most of the 361 positions require evaluation. Some intersects are more popular than others, but all are allowed Additive Complexity As games such as chess, checkers, and backgammon, etc. progress, pieces get eliminated which cause the game board to simplify therefore decreasing the complexity. In Go, as the game continues, most pieces retain their positions on the board with the exception of a few captures moves (extremely rare in professional level games). This not only eliminates board symmetry, but also adds more tension and new possibilities to the board, rendering an everlasting high complexity and branching factor.

19 CHAPTER 2. MONTE CARLO TREE SEARCH Evaluation Function For any tree based search heuristics that are applied to board games, an evaluation function is required to count materials and calculate the positions on the board to in order to assign scores for both sides so the AI engine is fully aware of the values of the game states. In other words, the evaluation function is the backbone of minimax tree search, which helps to decide which of the possible search paths to steer towards. In chess, an accurate evaluation function can be derived from a number of considerations which not only count up the raw material advantage on the board, but also include many other examinations such as king position, queen position, doubled pawns, isolated pawns, passed pawns, rooks on open files, bishop pairs, etc. These observations can be made easily and formulated into rules to construct an accurate and strong evaluation function. But similar techniques cannot be applied efficiently to Go. An accurate evaluation of a Go position requires complex analysis such as finding death of stones and groups, whether the stones can be connected to avoid capture, and whether an attack should be played in place of defense. Much of the time, more than one stone placement can be good, depending on the side s strategy and other delicate trade-offs. For example, it can be debatable whether one should kill a few enemy stones at the cost of allowing the enemy s stones elsewhere to strengthen. Also, sometimes the present placement of a stone might not seem influential but after many moves it suddenly becomes crucial.

20 CHAPTER 2. MONTE CARLO TREE SEARCH 8 All in all, an accurate evaluation function for Go is simply non-existent. Currently, even the best evaluation function based Go AI still only performs at kyu 1 levels. 2.2 MCTS Overview Monte Carlo Tree Search(MCTS)[7][8][9] is an innovative tree search method that has received the spotlight in recent years of Go artificial intelligence research. It adapts to utilize Monte Carlo random sampling techniques to tackle the astronomical search space which previously haunted traditional minimax tree search methods. Mainly due to the high branching factor and poor evaluation function in Go, traditional minimax tree search struggles to find the optimal decision. In comparison, MCTS cleverly uses statistics collected by randomized samplings to estimate the true value of a node, and builds the tree iteratively to help in finding an optimal decision. MCTS relies on an important concept: the true value of a node can be approximated by running enough Monte Carlo simulations, so the policy can adjust to the best-first strategy based on the statistics. MCTS has succeeded on difficult problems where other techniques come up short. It is a statistical sampling technique, which means more runtime and more processing power generally achieves a better performance. In terms of flexibility, it can be used with no domain knowledge at all in some instances 1 Go rankings are illustrated in Appendix A.

21 CHAPTER 2. MONTE CARLO TREE SEARCH 9 or can be used with little domain knowledge 2 to help the decision process. 2.3 MCTS Process Since MCTS is based on random sampling, it runs many iterations to ensure the accuracy and precision of collected statistics. Each iteration has four stages: Selection, Expansion, Simulation, and Backpropagation. Figure 2.1[6] illustrates this iterative process. Figure 2.1: MCTS Iterations Selection Selection refers to the process of choosing the optimal child node among all of the expandable child nodes in the tree. A node is expandable if it is not 2 In Go, domain knowledge consists of simple rules to bias the node selection towards more advantageous nodes.

22 CHAPTER 2. MONTE CARLO TREE SEARCH 10 a terminal state and it has unvisited children. The node selection problem faces the dilemma of exploration versus exploitation[10]. Exploration-biased heuristics favor less examined nodes while exploitation-biased heuristics favor well-established nodes. Thus a policy must be defined to resolve this conundrum. Auer et al. [1] proposed a policy called Upper Confidence Bounds(UCB) 3 which has a great variant for tree search, named Upper Confidence Bounds for Trees(UCT). UCT is simple and efficient, and guaranteed to be within a constant factor of the best possible bound on the growth of the regret[2], which is defined as the difference between the optimal selection and UCT selection. Therefore it is a widely adopted method to balance exploitation and exploration. The heuristic assigns each node a UCT score, and simply picks the node with highest score. UCT = X j + 2C p 2 ln n n j C p is a constant, usually set to 1/ 2 to satisfy the Hoeffding inequality with rewards in the range [0,1]; this is examined in details by Kocsis and Szepesvari[12]. X j is the ratio of the number of wins over the number of node visits, which translates to winning percentage for the child node j; this value is always no more than 1. n is the number of times the parent node is visited, where n j is the number 3 It is also known as multi-armed bandit problem[14].

23 CHAPTER 2. MONTE CARLO TREE SEARCH 11 of times the child node, also the current node j is visited. The UCT bound value can be intuitively understood as the sum of the two parts: the first part X j representing the ratio of winning which encourages exploration of promising nodes with high win ratio and the second part 2C p 2 ln n n j encourages visiting nodes that have not been well sampled. As more simulations execute, the statistics get updated in all the nodes. X j captures the simple win ratio of each node state. However, this ratio alone is not good enough to decide which child node j to expand next. For instance, consider a node A and a node B respectively having 1/2 and 20/100 as X j values. If only X j is considered, it seems node A is a sure pick because of the 0.5 win ratio over node B s 0.2 win ratio. However it overlooks the fact that node B s ratio has a much higher confidence level, and statistics for node A could deteriorate quickly if more simulations are run. Vice versa, consider a node C and a node D with 1/2 and 70/100 X j values. It is tempting to always favor node D because it has a high ratio and high confidence. But if node C isn t examined further, the algorithm could miss out on a potentially even higher win ratio from node C. Thus a balance must be achieved in choosing child node j. 2C p 2 ln n n j is the second part of the formula, it gives incentive to examine less explored nodes. n j is 0 for never visited nodes, UCT of which yields to infinity; therefore all the nodes are guaranteed at least one visit. In summary, the less explored nodes have small denominators n j which would make the

24 CHAPTER 2. MONTE CARLO TREE SEARCH 12 overall quotient big, thus contributing more to the UCT value Expansion Once a node is selected, according to available (legal) actions, child nodes are added to expand the selected node. The application can control the policy, which regulates expandable nodes as it can define rules to determine which of the child nodes are possible. After expansion, sometimes a filtering function is also applied to prune the child nodes in order to eliminate obvious sub-optimal ones Simulation Starting from the selected node, the simulation follows a simulation (also called playout) policy until game completion to produce an outcome. The simulation policy, a key ingredient, uses simple strategies to map game states into actions, and also contains rules that restrict the legality of child nodes; but on the other hand, it should not be overly restrictive as it can degrade random sampling processes. The simulation policy is not equivalent to a full strategy, which is much more expansive and complicated; the policy usually is simpler and leaves enough room for randomness. It should be noted that a policy-less simulation can have a potentially large number of child nodes and render the collected statistics less useful.

25 CHAPTER 2. MONTE CARLO TREE SEARCH Backpropagation The outcomes from simulations get propagated backwards from leaf nodes to the root node in order to update statistics. Based on the statistics, using UCT (multi-armed bandit) heuristics, the application loops back to the node Selection stage.

26 Chapter 3 CUDA Architecture 3.1 Background Compute Unified Device Architecture (CUDA) is a recent parallel computing platform developed by Nvidia Corporation. It is neither the first parallel computing platform, nor the first GPU enabled parallel programming construct. However it is arguably the first popularized parallel GPU architecture that sees wide use in both commercial and academic fields. CUDA itself is a massively parallel architecture often containing more than 100 cores; the newer models even have more than 1000 cores. This poses a sharp contrast to traditional CPU architecture, which often contains no more than 4 to 8 cores. While a CUDA GPU can have hundreds of cores executing instructions in parallel, each core is a lot slower than a CPU core. CUDA emphasizes 14

27 CHAPTER 3. CUDA ARCHITECTURE 15 Figure 3.1: CPU vs. GPU Computing

28 CHAPTER 3. CUDA ARCHITECTURE 16 running many threads concurrently as opposed to running a single thread very quickly. This architecture is also known as Single Instruction Multiple Data (SIMD). The underlying premise is that, if a task can be saturated with high enough volume of input data, the overall performance can improve due to the large scale batch processing. Figure 3.1 demonstrates the peak computing capability in GFLOPS 1 of CUDA graphic cards compared to Intel CPUs. The programming on the GPU is vastly different from doing so on the CPU, primarily due to the architectural difference and supported instruction set. Historically it is never an easy task to migrate an application running on the CPU to the GPU and also improving its performance. The beauty of the CUDA platform is that it allows the use of C Programming Language to write programs largely the in the same fashion with minimal modifications. Using C also favors portability, as a great number of programs were written in C language to run on the CPU can be relatively easily modified to run on a CUDA platform Advantages Ease of programming C is one of the most popular programming languages. Comparing to OpenCL and DirectCompute, C is more straight-forward and general-purpose, thus more user friendly. 1 Giga-floating-point operations per second

29 CHAPTER 3. CUDA ARCHITECTURE 17 Figure 3.2: CUDA Processing Flow

30 CHAPTER 3. CUDA ARCHITECTURE 18 Figure 3.3: CUDA Memory Hierarchy

31 CHAPTER 3. CUDA ARCHITECTURE 19 Computation capability CUDA enabled graphics cards can achieve more than 100GFLOPS, making them suitable for computationally expensive tasks. Shared memory Each multiprocessor within the device is allocated a region of shared memory, which is extremely fast and have low latency, and can be shared among different threads within the same block. User-managed cache Flexible caching configurations to help improve read performance Limitations Divergence penalties - Any SIMD architecture is restricted by the inability to concurrently execute different logic branches simultaneously, this can impact applications of heavy branching nature. Incomplete C/C++ support Though CUDA implements basic constructs of C, not all features are supported. e.g. dynamic allocation, full class support, virtual function, various libraries etc. Thread configuration Threads need to be cautiously set up and configured to achieve reasonable performance. More specifically, threads need to be run in warps 2 and each core should be saturated with threads in order to mask memory access latency. 2 In current CUDA implementation, a warp contains 32 threads.

CHAPTER 3. CUDA ARCHITECTURE 20 Small shared memory Each multiprocessor contains 16-48KB of shared memory, it could pose a challenge to fit large indivisible graph with many nodes. 3.1.3 Tesla C1060 Figure 3.

32 CHAPTER 3. CUDA ARCHITECTURE 20 Small shared memory Each multiprocessor contains 16-48KB of shared memory, it could pose a challenge to fit large indivisible graph with many nodes Tesla C1060 Figure 3.4: Tesla C1060 Graphics Card The Tesla C1060 card is used for this thesis. It is a professional grade CUDA enabled Nvidia graphics card manufactured in Some of the key hardware specifications will determine the optimal thread configuration in the experiment.

33 CHAPTER 3. CUDA ARCHITECTURE 21 CUDA Capability Major revision number 1 CUDA Capability Minor revision number 3 Total amount of global memory bytes Number of multiprocessors 30 Number of cores 240 Total amount of constant memory bytes Total amount of shared memory per block bytes Total number of registers available per block Warp size 32 Maximum number of threads per block 512 Maximum sizes of each dimension of a block 512 x 512 x 64 Maximum sizes of each dimension of a grid x x 1 Maximum memory pitch bytes Texture alignment 256 bytes Clock rate 1.30 GHz Concurrent copy and execution Yes Run time limit on kernels No Integrated No Support host page-locked memory mapping Yes Table 3.1: Tesla C1060 Key Specifications

34 Chapter 4 Implementation In this implementation, the GNU licensed freely available Go infrastructure library Fuego is used in order to avoid re-inventing the basic building blocks of Go such as user interface, basic logic flow, game rules etc. Fuego programming library grants researchers a foundation to work with rather than having to deal with trivial programming details, thus allowing more emphasis to be put on actual algorithms and heuristics. A portion of the source code also borrows from GNU Go frameworks. 4.1 High-Level Game Flow Using the four stage Monte Carlo Tree Search and its random sampling process, the procedure to find the optimal next move can be illustrated in pseudo code below: 22

35 CHAPTER 4. IMPLEMENTATION 23 function MonteCarloSearch(state 0 ) create root node node 0 with state state 0 while nodecount maximum allowed nodes chosennode = TreePolicy(node 0 ) result = SimulationPolicy(chosenN ode) BackPropagate(result, chosenn ode) return best scoring node Monte Carlo Tree Search belongs to the family of tree search algorithms, thus a game state is represented by a node in the tree. Each node needs to keep track of the Go board information, which contains the black and white stone positions and captures; moreover it needs to store win and loss counts of the node. The heuristics start off by constructing a root node which represents the current game state, and goes on to pick out the most opportune child node. The child node is first expanded, and runs Monte Carlo simulations by invoking the TreePolicy function which determines the best child node via upper confidence bound methods (UCT). Once the appropriate child is chosen, SimluationPolicy function takes over and a predetermined number of Monte Carlo simulations are run to generate statistics on the chosen node. Afterwards the results (wins and losses) get propagated back all the way to the root node. From there the algorithm loops back to the

36 CHAPTER 4. IMPLEMENTATION 24 node selection stage, thus iteratively constructing a Monte Carlo search tree by selecting and expanding one node at a time. If a terminal condition such as maximum allowed time or maximum allowed nodes are met, search is immediately interrupted. It is easy to determine the best child based on the collected statistics and return it as the next Go move. function TreePolicy(node 0 ) leafnodes = getallleafnodes(node 0 ) foreach i in leafnodes score i = winratio i + 2C p 2 ln totalv isits nodev isits return node i with max score i TreePolicy function is fairly straightforward. It uses an iterator to loop through all the leaf nodes of the root node, and put them in a set. The UCT formula is used to determine which leaf node SimulationPolicy should start to execute from. The gist of UCT is that, in addition to the win ratio, it also adds a weight which favors less explored nodes. Under UCT, the balance between exploration and exploitation is achieved. function SimulationPolicy(chosenN ode) while!winningcondition trynakademove() else

37 CHAPTER 4. IMPLEMENTATION 25 tryataricapturemove() else tryataridefensemove() else trylowlibertymove() else trypatternmove() else trycapturemove() else tryrandommove() else trypassmove() return move SimulationPolicy takes a board state, and tries to generate a move based on a number of Go-specific functions from a priority list. Nakade 1 heuristics are tried first, then Atari 2 moves, low liberty moves, pattern matching moves, capture moves, random moves and lastly pass moves are also possible. The ordering of this priority list is important and and any adjustment can result in performance fluctuation. 4.2 CUDA Configuration The thread configuration in CUDA architecture is intricate and can impact the performance heavily. If only a limited number of threads execute in parallel, the majority of CUDA cores computational power is under-utilized. 1 It literally means inside move in Japanese, refers to the situation where a group of stones can be made into two eyes or prevented from doing so by a single move. 2 A group of stones with only one liberty.

38 CHAPTER 4. IMPLEMENTATION 26 On the contrary, if far too many threads are allocated, the resulting divergent execution paths and unmanaged memory access patterns can be disastrous. Therefore in this thesis, a variety of configurations are examined to illustrate the repercussions of each setup. Based on these results, the application can find the optimal configuration. CUDA threads must be configured with grid dimension 3 and block dimension 4. The default setup is a grid which consists of 512 blocks and each block contains 512 threads. This configuration is found to be the optimal after performance analysis with different configurations ranging from 64 blocks to 512 blocks and 32 threads to 512 threads. The detailed thread configurations and their impact on the performance is shown in Benchmark chapter. 4.3 Running Simulation in Parallel CUDA Go aims to parallelize Monte Carlo Simluations, in other words, parallelization happens at leaf node. Here is a snapshot of pseudo code that first runs on the CPU before launching a CUDA kernel that runs purely on the GPU: function CUDAMonteCarloSearch(state 0 ) create root node node 0 with state state 0 3 A grid can contain up to 512 x 512 x 64 blocks in each of x y z dimension. 4 A block can contain up to 512x512x64 in each x y z dimension but only maximum of 512 threads per block.

39 CHAPTER 4. IMPLEMENTATION 27 while nodecount maximum allowed nodes chosennode = TreePolicy(node 0 ) result = CUDASimulationKernel(chosenN ode) BackPropagate(result, chosenn ode) return best scoring node CUDAMonteCarloSearch function first runs on the CPU much like MonteCarloSearch method in the sequential counterpart. The TreePolicy is handled sequentially on the CPU since this part of the algorithm is inherently linear. As the application reaches CUDASimluationKernel, a CUDA kernel is launched with user managed configurations, and the process copies the initial board state from the CPU memory to the GPU memory and executes thousand of threads in parallel. On the GPU, the kernel does not distinguish amongst threads; this is to minimize divergence. Inside the GPU, each thread manages its own Go board state in the global memory 5. After each thread terminates, score is stored in a pre-allocated array in the global memory region; threads are synchronized to ensure termination. The scores are accumulated by a fast CUDA Thrust library powered prefix sum function which runs in log N time. The results get transferred back from the GPU to the CPU memory and is back-propagated all the way to the root node. With the updated nodes statistics, the game engine utilizes the UCT formula 5 Shared memory is also used to reduce memory access delay.

40 CHAPTER 4. IMPLEMENTATION 28 once again to decide which of the next nodes should be explored and hands off that node to the GPU, therefore iteratively constructing a Monte Carlo Search Tree. function CUDASimulationKernel(chosenN ode) create root node node 0 with state state 0 while!gameendingconditions SimulationPolicy(chosenN ode) resultarray[threadidx] = result SynchronizeAllThreads() stats = inclusivescan(resultarray) return stats When the CPU hands off the work to the GPU, each spawned thread on the GPU works independently by following a set of rules formally defined in the SimluationPolicy function. After each game is played to completion, the result is saved in the respective index that corresponds to the unique thread ID in the global array. The global array is accumulated by a fast inclusive scan function to generate the total number of wins achieved from the initial game state before these statistics are passed back to the CPU and continues in the main loop.

41 CHAPTER 4. IMPLEMENTATION Biased Monte Carlo Sampling via Evaluation Function The original Monte Carlo Sampling is uniform, meaning it does not bias towards any of the generated moves. While this is the standard Monte Carlo approach, it can lead to rather bizarre positions that a human Go player is not likely to play. The root problem is that, the engine makes no distinction of positions that can potentially influence game differently. Intuitively, stronger moves should be played more frequently than weaker moves. It appears that, if these moves are given weights according to an evaluation function which assigns scores to all of the moves, a better simulation might be achieved. Let the evaluation function E access all of the generated moves m i M, and save the result as r i r. Next, the result vector is normalized to 1; a new vector containing normalized weights of each move R i R is formed. Now the Monte Carlo simulation probabilistically selects the next move based on the weights associated with the moves. Note that the evaluation function in Go is far from perfect. The best Go program that utilizes minimax tree search and evaluation functions still cannot achieve dan level strength[15]. In essence, the evaluation relies on many Go-specific tactics and rules to reason the strategies and examine territories on the board, which combines to a total score at the end. Of course the score is far from accurate in many cases, but the assessment can be useful in

42 CHAPTER 4. IMPLEMENTATION 30 distinguishing positions which a standard MCTS cares not to. The results of unbiased Monte Carlo process versus biased Monte Carlo process is presented later on in Benchmark chapter. 4.5 GPU Parallel Optimization Shared Memory Shared memory is an important aspect of CUDA. It can be seen as user managed cache. In order to achieve speedup, the application needs to make use of the shared memory and allow as much memory to load from it as possible. C1060 Tesla provides 16KB of shared memory per multiprocessor. It has a total of 30 multiprocessors thus giving 30*16KB shared memory. The struct 6 to keep a Go game board occupies about 400 bytes. Therefore bytes/400bytes = 1200 boards can be allocated in the shared memory. Certainly this optimization will have a positive impact on the overall performance, due to much faster memory load from the on-chip shared memory as opposed to the distant global memory Memory Padding A lot of the simulation policy rules require LookUp, LookDown, LookLeft and LookRight functions which are used to check positions adjacent to the current 6 Refer to Appendix B for more details.

43 CHAPTER 4. IMPLEMENTATION 31 intersect. CUDA architecture provides 16 memory banks 7, therefore if any of the half-warp threads access through the same memory bank, a conflict would occur. To combat this drawback, the application allocates extra columns in the memory region so that none of the adjacent intersect lookups would cause memory bank conflict. This technique speeds up memory read Result BackPropagate CUDA Go needs to be smart on how to transfer back the game simulation results. If it is done carelessly, meaning accumulating all of the results using a single global variable with atomic lock, significant write race-condition would occur and the generated delays are non-trivial. To avoid the write conflicts, the threads of the CUDA Go first write the results to the pre-allocated array in the GPU global memory according to the indices formed by the unique thread IDs. An inclusive scan accumulator function is then used to tally the results in log N time before transferring the total sum to the CPU memory. Doing it this way, not only does the application avoid the write conflict, but also speeds up accumulation. 7 It s also known as the 16 stride memory bank.

44 Chapter 5 Benchmark 5.1 Thread Configurations In CUDA there can be a variety of thread configurations. Different configurations can lead to drastic performance disparities. In general, the number of threads per block should be in a multiple of warp (32 threads), and the number of threads per block should be abundant enough to saturate the cores. As illustrated in figure 5.1, there needs to be at least 128 blocks in the grid to have a reasonable performance; however increasing the block number beyond 256 gains little to no additional performance due to hardware saturation. It is also observed that the number of threads per block tops out near 256, and 512 threads per block gains no significant performance. The slope of the 64-blocks line is sharper than other lines below, which means that the performance increase by adding more threads to the 64-blocks configuration 32

45 CHAPTER 5. BENCHMARK 33 is more evident than the higher-blocks lines. This is because when there are fewer blocks, the hardware is less saturated, therefore each new thread can utilize more hardware power. In the 512-blocks configuration, the hardware is already heavily occupied, thus the addition of new threads only gains limited performance much like the effect of diminishing return. Figure 5.1: Thread Configurations Effects on 19x19 Board 5.2 Board Sizes Though 19x19 sized Go board is a standard, it s not uncommon for amateurs to play on 9x9 or 13x13 boards for learning purposes. Computer Go programs

46 CHAPTER 5. BENCHMARK 34 also play on different boards with different sizes for academic reasons. In figure 5.2, it comes as no surprise that smaller boards allows faster simulations due to less memory footprint and fewer stones to be played until completion. The fitted curve shows a relation close to that of quadratic function. The underlying reason is that as the board gets smaller, the number of playable next moves shrinks at a non-linear pace, thus resulting in a much faster simulation per game. Figure 5.2: Different Board Sizes

47 CHAPTER 5. BENCHMARK Biased Monte Carlo Tree Search Black side uses the biased MCTS engine while white side uses the original unbiased MCTS engine. Each side plays simulations to determine the next move, and time[11] is unlimited. 9x9 13x13 19x19 Black wins (games) 51/100 52/100 11/20 White wins (games) 49/100 48/100 9/20 1 Avg game length (moves) Avg black total time spent per game (second) Avg white total time spent per game (second) Avg black time spent per move (second) Avg white time spent per move (second) As the results produced by having two engines battle against one another demonstrate, the biased MCTS Go engine achieves 51%, 52% and 55% win ratio for 9x9, 13x13 and 19x19 board respectively; however the biased engine also uses up 6.8%, 14.1% and 29.4% more time than the unbiased MCTS Go engine. This data hints that the biased algorithm which uses weights to decide the next move can increase the playing strength of the Go engine. Intuitively, as MCTS generates a string of possible moves, having biased weights which favor stronger moves predicted by the evaluation function help to play the moves more sharply (stronger), thus achieving a more accurate result which is one step closer to perfect play than the previously

48 CHAPTER 5. BENCHMARK 36 unbiased MCTS engine. The overall statistics collected by the biased MCTS are more competitive, resulting in an increase in playing strength. It is also observed that the bigger boards get more benefits from the biased MCTS, as the winning ratio jumped from 51%(9x9 board) to 55%(19x19 board). A logical explanation could be that, when the board is smaller, there are less available next moves to choose from, so the biased effect on each move is not as significant as it is for a bigger board with many candidate moves to pick from. This phenomenon can be seen as the evaluation function plays a more important role on bigger boards due to larger number of possible next moves, since the larger boards have a much higher probability to arrive at a permutation of moves that is different from biased MCTS weights. However, the lift on the playing strength does not come freely. The computing time for the biased MCTS Go engine sees an increase of 6.8%, 14.1% and 29.5% going from small to big boards. This additional cost comes from more available moves that need to be accessed by the evaluation function which is not cheap to invoke because it involves many Go-specific rules and knowledge such as strategical and territorial analysis.

49 CHAPTER 5. BENCHMARK Performance Impact of Different Monte Carlo Policies Divergence is a big issue for any SIMD architecture and therefore the CUDA Go engine is subject to this problem. As an indication of the level of impact the divergence has on CUDA Go engine, this experiment modifies the MCTS so that it generates the next move randomly instead of following a predefined policy (only requirement is move has to be legal). Doing so avoids executing many different branches of rules within the original policy which causes significant divergence. Figure 5.3 shows approximately a seven fold speed increase over the standard MCTS engine. In other words, the policyless engine can play games seven times faster than the original. Nonetheless, the playing strength of policy-less Go engine suffers heavily due to inaccurate statistics generated from the simulations. It almost never wins a game against a normal engine that follows policy. 5.5 Optimizations and Speedup Due to the underlying divergent execution paths issues and heavy memory footprints, a straightforward implementation that works well for CPU can hit severe performance roadblocks in the CUDA GPU architecture. Several optimizations are applied to the baseline implementation. As figure 5.4 shows, the largest speed up come from setting appropriate blocks per grid

50 CHAPTER 5. BENCHMARK 38 Figure 5.3: Playing Speed of Different MCTS Policies

51 CHAPTER 5. BENCHMARK 39 and threads per block parameters. Another obvious improvement credits itself to the use of shared memory. Memory padding and inclusive scan also add observable performance increase to the baseline engine. Figure 5.4: Optimizations and Speedup 5.6 CPU vs GPU Given the architectural differences in CPUs and GPUs in general, it is interesting to benchmark the two head to head in order to understand if the CUDA GPU architecture is fit for solving Go using Monte Carlo Tree Search methods. In figure 5.5, the comparisons demonstrate that the performance of

52 CHAPTER 5. BENCHMARK 40 CUDA definitely outshines that of Pentium 4 single core CPU, yet falls short to the much newer generation quad core AMD Phenom processor. Although not benchmarked in this experiment, it is predictable that with the newest generation Fermi architectured Nvidia graphics cards, the GPU performance should overtake AMD Phenom processor convincingly. Figure 5.5: CPU vs GPU

53 Chapter 6 Conclusions 6.1 Verdict Evaluating the success of the parallelized CUDA Go using MCTS is intricate. Parallel Monte Carlo Tree Search is certainly a sound methodology to tackle problems of large search space, as high branching factors lead to astronomical search nodes. The parallelism essentially can be exploited at Monte Carlo simulation stage, where a massive number of threads can spawn and help to gather high quality statistics. In Go s case, it is able to determine which of the next moves is the most suitable based on the tried paths and their simulated results. In this thesis CUDA is selected as the underlying platform for the implementation for two reasons. First, it is a novel idea and a contribution 41

54 CHAPTER 6. CONCLUSIONS 42 to utilize the GPU to solve Go using MCTS. Secondly, CUDA has hundreds of cores and it is architecturally different from the CPUs. On a normal CPU, threads can execute independently and branch freely regardless of what other threads do. However in CUDA, any divergence in the thread execution paths can take a heavy toll on the performance and serialize the threads executions in the warp. A great portion of Monte Carlo simulation processes unavoidably rely on randomization and thus creating divergent executions. For instance, one thread is looking at a stone to calculate its liberty while the other thread is performing pattern matching heuristics. In CUDA it s often the case that, an application gets more speed up if it heavily performs arithmetics such as add, subtract, and multiply etc. In MCTS Go s scenario, most operations are branching and memory loading; therefore the improvement is not as drastic as other applications might encounter. Various optimizations are also applied to the baseline implementation to achieve more speedup. It s often the case that, a straightforward implementation which works well for the CPU would not perform adequately for the GPU. Though the CUDA platform has hundreds of cores, a GPU core is more than two orders of magnitude slower than a CPU core. The proper thread configurations ensure the graphics card keeps all the cores occupied and achieves maximum throughput while the shared memory usage and the memory paddling serve to accelerate memory read accesses. Many of these optimizations are GPU-only and hardware specific. Ideally one should not

55 CHAPTER 6. CONCLUSIONS 43 have to consider these limitations in order to design an efficient parallel system, as the hardware specifics are best to be abstracted away from theoretical standpoint. The idea of using the evaluation function to build a biased MCTS engine is also a contribution. Interestingly the overall playing strength sees an increase at the cost of consuming more computation time. This can become useful in long games in which time requirement is more flexible. The Go engine can switch between the unbiased and the biased mode depending on the remaining time and importance level of the next move. This can be the first step towards making a stronger MCTS Go playing engine. Overall this implementation gives great insight on how SIMD architecture can benefit from the parallelism as well as the limitations it can pose. Particularly the biased MCTS algorithm is able to produce stronger playouts during the simulation stage. Moreover, it can be foreseen that with a good MIMD parallel architecture which is immune to divergence problems, the speedups can be achieved more easily. The newer generation Nvidia graphics cards should also delivery stronger results. 6.2 Future Works Although this thesis limits its scope to the SIMD architecture CUDA platform. It should be interesting to conduct similar research and experiments on MIMD platforms and compare the results. Many of the divergence issues

56 CHAPTER 6. CONCLUSIONS 44 should go away and the performance boost should be closer to the theoretical improvement bound due to the ability to execute many paths concurrently. The other debacle of MCTS algorithms is that while overall it delivers greater results compared to traditional minimax tree search applied to Go, it is still far from top human professional players. Perhaps MCTS needs a stronger simulation policy heuristics based on entirely new paradigms[4] not limited to Go-specific rules and strategies. Machine learning[16][5] could also be applied to train the engine. On one hand, the playing strength of the MCTS Go engine is largely dependent on its simulation policy, domain specific knowledge and pattern matching etc; on the other hand, these factors can become limitations to improving the engine s playing strength as the Go-specific material could get so complicated that only Go-expert could produce. If ML is used to train the engine, these knowledge and patterns could potentially be learned without a Go-expert, thus making the development of policies easier. In this thesis only leaf node parallelization is explored. It should be interesting to experiment with root node parallelization; maybe it can have several trees running MCTS algorithms simultaneously[3] and merge the results to produce overall improvement over the single tree MCTS.

57 Appendix A Go Ranking Illustration Rank Rank Type Stage Double-digit kyu 30-20k Beginner Double-digit kyu 19-10k Casual Single-digit kyu 9-1k Intermediate amateur Amateur dan 1-7d Advanced player Professional dan 1-9p Professional player Table A.1: Go Ranking [Low-High] 45

58 Appendix B Go Board Struct struct{ int intersect[19*19]; // a positive number indicates black, negative for white, zero for empty int whitecapture; int blackcapture; int komi; int result; } 46

59 Appendix C Go Board Intersect The Go board is represented as a 1-dimensional array, with offset maps to an intersect. While the board can also be represented with a 2-dimensional array, it has slightly more footprint thus slightly slower performance. 1- dimensional array is functionally equivalent and just needs the program to be intelligent about switching rows. 47

60 APPENDIX C. GO BOARD INTERSECT 48 Figure C.1: Go Board Intersect

61 Bibliography [1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2): , [2] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, Proceedings., 36th Annual Symposium on, pages IEEE, [3] D. Auger. Multiple tree for partially observable monte-carlo tree search. Applications of Evolutionary Computation, pages 53 62, [4] V. Berthier, H. Doghmen, and O. Teytaud. Consistency modifications for automatically tuned monte-carlo tree search. Learning and Intelligent Optimization, pages , [5] B. Bouzy and G. Chaslot. Monte-carlo go reinforcement learning experiments. In Computational Intelligence and Games, 2006 IEEE Symposium on, pages IEEE,

62 BIBLIOGRAPHY 50 [6] C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. Computational Intelligence and AI in Games, IEEE Transactions on, (99):1 1, [7] G. Chaslot. Monte-carlo tree search. PhD thesis, PhD thesis, Maastricht Univ, [8] G. Chaslot, M. Winands, and H. van Den Herik. Parallel monte-carlo tree search. Computers and Games, pages 60 71, [9] A. Fern and P. Lewis. Ensemble monte-carlo planning: An empirical study. In Proc. 21st Int. Conf. Automat. Plan. Sched., Freiburg, Germany, pages 58 65, [10] S. Gelly and Y. Wang. Exploration exploitation in go: Uct for montecarlo go [11] S.C. Huang, R. Coulom, and S.S. Lin. Time management for monte-carlo tree search applied to the game of go. In Technologies and Applications of Artificial Intelligence (TAAI), 2010 International Conference on, pages IEEE, [12] L. Kocsis, C. Szepesvári, and J. Willemson. Improved monte-carlo search. Univ. Tartu, Estonia, Tech. Rep, 1, [13] S. Lopez. Rybkas monte carlo analysis

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage Richard Kelly and David Churchill Computer Science Faculty of Science Memorial University {richard.kelly, dchurchill}@mun.ca