AN ABSTRACT OF THE THESIS OF - PDF Free Download

AN ABSTRACT OF THE THESIS OF Paul Lewis for the degree of Master of Science in Computer Science presented on June 1, 2010. Title: Ensemble Monte-Carlo Planning: An Empirical Study Abstract approved: Alan Fern Monte-Carlo planning algorithms such as UCT make decisions at each step by intelligently expanding a single search tree given the available time and then selecting the best root action. Recent work has provided evidence that it can be advantageous to instead construct an ensemble of search trees and make a decision according to a weighted vote. However, these prior investigations have only considered the application domains of Go and Solitaire and were limited in the scope of ensemble configurations considered. In this paper, we conduct a large scale empirical study of ensemble Monte-Carlo planning using the UCT algorithm in a set of five additional diverse and challenging domains. In particular, we evaluate the advantages of a broad set of ensemble configurations in terms of space and time efficiency in both parallel and sequential time models. Our results show that ensembles are an effective way to improve performance given a parallel model, can significantly reduce space requirements and in some

cases may improve performance in a sequential model. Additionally, from our work we produced an open-source planning library.

Ensemble Monte-Carlo Planning: An Empirical Study by Paul Lewis A THESIS submitted to Oregon State University in partial fulfillment of the requirements for the degree of Master of Science Presented June 1, 2010 Commencement June 2011

Master of Science thesis of Paul Lewis presented on June 1, 2010. APPROVED: Major Professor, representing Computer Science Director of the School of Electric Engineering and Computer Science Dean of the Graduate School I understand that my thesis will become part of the permanent collection of Oregon State University libraries. My signature below authorizes release of my thesis to any reader upon request. Paul Lewis, Author

TABLE OF CONTENTS Page 1 Introduction 1 2 Monte Carlo Planning with Sparse UCT 3 3 Ensemble UCT 10 3.1 Motivation for Ensembles....................... 12 3.1.1 Parallelization.......................... 12 3.1.2 Variance Reduction....................... 13 3.1.3 Space and Time Complexity.................. 14 4 Related Work 15 5 A Generic Java Planning Library 17 5.1 Running Tests.............................. 18 5.2 Adding New Domains......................... 21 6 Description and Evaluation of Domains 24 6.1 Backgammon.............................. 24 6.2 Biniax.................................. 27 6.3 Connect 4................................ 29 6.4 Havannah................................ 30 6.5 Yahtzee................................. 31 7 Empirical Results 34 7.1 Experimental Setup........................... 34 7.2 General Results............................. 37 7.3 Specific Observations.......................... 43 7.3.1 Biniax.............................. 43 7.3.2 Backgammon.......................... 44 7.3.3 Connect 4............................ 44 7.3.4 Havannah............................ 45 7.3.5 Yahtzee............................. 45

TABLE OF CONTENTS (Continued) Page 7.4 Parameter Sensitivity.......................... 46 8 Summary 48 Bibliography 49

LIST OF FIGURES Figure Page 2.1 SparseUCT(s 0, t, c, ss)......................... 7 2.2 UCT Example Part 1.......................... 8 2.3 UCT Example Part 2.......................... 9 3.1 EnsembleUCT(s 0, t, c, ss, e)...................... 11 3.2 Various Ensemble Methods....................... 12 5.1 Test File backgammon random uct................. 19 5.2 Output File backgammon random uct results........... 20 5.3 Test File 2................................ 21 5.4 Test File 3................................ 22 6.1 Comparison of Domains........................ 24 6.2 Backgammon Board [9]......................... 25 6.3 Biniax Examples............................ 28 6.4 Havannah Board............................ 31 7.1 Yahtzee UCT Constant Table..................... 35 7.2 Connect 4 Ensemble Timing Table (ms)............... 36 7.3 Biniax Sparseness............................ 37 7.4 Biniax UCT Ensemble Table...................... 39 7.5 Backgammon Ensemble Table..................... 39 7.6 Connect 4 Ensemble Table....................... 40 7.7 Havannah Ensemble Table....................... 40 7.8 Yahtzee Ensemble Table........................ 41 7.9 Alternate Yahtzee Ensemble Table.................. 41 7.10 Connect 4 Ensemble Parameter Sensitivity.............. 46

LIST OF FIGURES (Continued) Figure Page 7.11 Yahtzee Ensemble Parameter Sensitivity............... 46

Chapter 1 Introduction UCT is a Monte-Carlo planning algorithm [10], which extends recent algorithms for multi-armed bandit problems to sequential decision problems including general Markov Decision Processes and games. The algorithm has received stature as the premiere computer algorithm for the game of Go [6]. UCT uses Monte-Carlo methods combined with an evaluation function to construct a search tree and select a best root action. There has also been recent work into the benefits of various parallel UCT methods in the domain of Go [4]. One of these methods, root parallelization, builds multiple trees from a common root state and combines the root evaluation into a weighted vote from each tree. This method is similar to bagging methods for classifiers. This has been shown to have parallel time benefits and in some cases sequential time benefits. We evaluate these results with an abstract software structure that can run a general purpose UCT algorithm on any one of five distinct domains: Biniax, Backgammon, Connect 4, Havannah and Yahtzee. With our software we were able to compare the performance of standard UCT methods to that of the ensemble method, root parallelization. We were able to show that across a variety of domains ensemble UCT is usually an effective tool for boosting performance in parallel time, will always conserve memory in sequential time and in certain circumstances may boost performance in sequential time. We are able to show that as a tree grows the time to add each new trajectory

2 increases. This finding increases the incentive to use ensembles as they keep the tree size low while boosting performance. We also found that ensembles work well with sub optimal values of the UCT constant, C, but work best with the best value of C for base UCT. The implications of this are that ensembles can be an easy extension to an already well tuned UCT implementation.

3 Chapter 2 Monte Carlo Planning with Sparse UCT The UCT algorithm uses Monte-Carlo methods to construct a game tree that balances exploration with exploitation. The basic algorithm only has two parameters to adjust; the number of roll-out trajectories and the UCT constant. The UCT constant, C, is a domain dependent parameter that controls exploration. Larger values of C increase exploration. The greater the number of trajectories (roll-outs) the more data UCT collects and the stronger it becomes. This algorithm is guaranteed to converge on the optimal solution given enough roll-outs. A well chosen C value will allow UCT to converge more quickly. The base UCT algorithm is also domain independent in that it can be an effective tool even without any domain heuristic. Given a current state, UCT selects an action by constructing a sparse lookahead tree over the state-space with the current state as the root and leaf nodes corresponding to terminal states. If the domain is deterministic then the edges may represent actions from one state node to another. If the domain is stochastic then the tree will contain alternating state and action nodes. Our implementation uses both state and action nodes as this works in both deterministic and stochastic domains. Each node, n, in the resulting tree stores the number of times visited and a cumulative reward. The value estimate of any node can be computed by dividing the cumulative reward by the number of visits. As shown by the following

4 formula. Q UCT (n) = n.rewards/n.visits UCT is distinct in the way that it constructs the tree and estimates action values. Unlike standard minimax search and sparse sampling [8], which typically build depth bounded trees and apply evaluation functions at the leaves, UCT does not impose a depth bound and does not require an evaluation function. Rather, UCT incrementally constructs a tree and updates action values by carrying out a sequence of Monte-Carlo roll-outs of entire decision making sequences starting from the root to a terminal state. The key idea behind UCT is to intelligently bias the roll-out trajectories toward ones that appear more promising based on previous trajectories, while maintaining sufficient exploration. Each roll-out begins at the root and actions are selected via the following process. If the current state contains actions that have not yet been explored in previous roll-outs, then a random action is chosen from among the unselected actions. Otherwise, if all actions in the current state have been explored previously then UCT selects the action that maximizes an upper confidence bound given by the following formula, where s is a state and a is a legal action taken from that state. Q UCT (s, a) = Q UCT (a) + C log(s.visits) a.visits The first term of the formula greedily selects the current best action while the second term gives a higher value to actions that have not been explored much.

5 The other decision that needs to be made while traversing the tree is at the action nodes. If the domain is deterministic then each action node trivially leads to a single state node. However, in stochastic domains a given action may lead to one of several state nodes. In this case base UCT will call the domain simulator to return the next state, s, from the current state and the chosen action. If the current action node already contains s then the algorithm will use that node otherwise a new node is created. In this way duplicate states from a given action are not generated. Some domains are highly stochastic. If the number of possible states per action is large compared to the number of roll-out trajectories then each time an action node is visited the simulator will generate a new s and no state from that action node will be visited more than once. This is where we introduce the concept of sparse sampling that is effective in these domains. All sparse sampling does is put a limit on the number of times an action node calls the simulator to generate a new s. Once the limit is reached the action node will randomly select among the already present state nodes. The state nodes will be weighted based on how many times they were originally visited. Thus if all the nodes were visited once except for one node that was visited twice the node that was visited twice will have a probability of being selected equal to 2/sparseSampleSize while all the other nodes will have a probability of 1/sparseSampleSize. The last point to make about the algorithm is that once it adds a new leaf node (leaf nodes are always state nodes) it will play a random complete game and return the result. The resultant reward is passed back up the tree and added to each node visited. If a leaf node in the tree is a terminal state then no random game is played and

6 the reward from that terminal state is passed back up the tree. The sparse UCT algorithm is given by algorithm 2.1. Note that sparse UCT is the same as normal UCT when sparse sample size is set to infinity. We present an example to show how sparse UCT would build a tree. Lets assume that our domain is stochastic with 2 competing agents and that there is a reward of 1 for winning and -1 for losing. In these examples 2.2 blue circles are state nodes, red circles are action nodes and the value inside each node represents the cumulative reward on the left of the colon and the number of visits on the right. The tree would start with a single state node. From here UCT would play a random game. The result of the game could be one of three values, [0, 0] (draw), [1, 1] (player 1 wins) or [ 1, 1] (player 2 wins). Part 1 of the example shows player 1 to win and part 2 shows how the root node gets updated appropriately. Part 3 shows how the next roll-out chooses a random action from the pool of unvisited actions and the simulator selects a state from that action then plays a random game from the new leaf state. Part 4 shows the updated nodes that were visited by the last trajectory. Note that since this is a two player game where the turns alternate, the leaf state updates its reward from the first value while the parent action and state update their reward from the second value which represents the other player s reward. Assuming that there are still unvisited nodes part 5 from figure 2.3 shows a new action and state added to the tree and another random game being played. The visited nodes in part 6 are similarly updated. If we assume that the root state has no other unvisited action then it will choose between the current two for its next roll-out. Since they were both visited once and the one on the left

7 Figure 2.1: SparseUCT(s 0, t, c, ss) Input: s 0 initial state t number of trajectories c UCT constant ss sparse sample size Output: values for each action in s 1: for i = 1 to t do 2: s s 0 3: while s terminal state do 4: if all actions of s have been sampled then 5: a argmax a Q UCT (s, a) 6: else 7: a random unsampled action from s 8: end if 9: if s.visits = ss then 10: s select weighted random child of a 11: else 12: s transition(s, a) 13: if a doesn t have child s then 14: create new state node s for a 15: s simulaterandomgamet oend(s) 16: end if 17: end if 18: end while 19: increment visits and add terminal state reward to all visited nodes 20: end for 21: return array of a.reward/a.visits for each action in s 0

8 Figure 2.2: UCT Example Part 1 1 2 3 4 0:0 1:1 1:1 2:2 [1,-1] 0:0 1:1 0:0-1:1 [-1,1] has a higher cumulative reward it will be selected for part 7. Two more nodes are added and part 8 shows the resulting updated tree. If another simulation were run then in this scenario the right action node would be selected from the root since both nodes have the same cumulative reward but the right node has been visited fewer times.

9 5 Figure 2.3: UCT Example Part 2 6 7 8 2:2 2:3 2:3 1:4 1:1 0:0 1:1 0:1 1:1 0:1 0:2 0:1-1:1 0:0-1:1 0:1-1:1 0:1 0:2 0:1 [0,0] 0:0 1:1 0:0-1:1 [-1,1]

10 Chapter 3 Ensemble UCT Ensemble UCT methods generate multiple UCT trees from a common root state. There are several ways to use the multiple trees to vote on the root actions. We used the root parallelization method 3.1; after all the trees are generated the visits and rewards of each root action in each tree are added together and then the average for each action is computed. This is the same as taking a weighted total as shown by the following formula. r 1 + r 2 +... + r n v 1 + v 2 +... + v n = v 1 v 1 + v 2 +... + v n r 1 v 1 +... + v n v 1 + v 2 +... + v n r n v n When we started working with ensembles one of our initial directions was to evaluate various ensemble techniques. All of the methods that we looked into used the root actions of several trees. The evaluation criteria was the only thing that changed. In this way we were also able to compare evaluations in different circumstances and see how they differed. Our candidates were root parallelization, root ensemble, plurality vote, instant runoff vote and Borda count vote. Root ensemble took the average of all the value estimates rather then the weighted average as root parallelization did.

11 Figure 3.1: EnsembleUCT(s 0, t, c, ss, e) Input: s 0 initial state t number of trajectories c UCT constant ss sparse sample size e number of ensembles Output: action 1: for i = 1 to e do 2: {r and v are arrays of rewards and visits to each action node} 3: (r, v) SparseUCT (s 0, t, c, ss) 4: for j = 1 to number of children root do 5: sumr j + = r j 6: sumv j + = v j 7: end for 8: end for 9: return action with highest average reward: sumr j /sumv j, where sumv j > 0 root ensemble = r 1 v 1 + r 2 v 2 +... + r n v n With plurality vote each tree selects the single best action and the action that has the most votes is selected. Ties are decided randomly. Instant runoff and board count are more complex voting systems where instance runoff has rounds of removing the worse action and giving its votes to another action and Borda count assigns points based on rank and the action with the overall most points is selected. Table 3.2 shows some early tests we ran on the Connect 4 domain with base UCT 4096 against the various ensemble methods where the number of ensemble trees we fixed at 8. In this case root parallelization performs the best while plurality vote is close behind and root ensemble does terrible. From other tests we performed in

12 Trajectories Figure 3.2: Various Ensemble Methods Root Parallelization Root Ensemble Ensembles Methods Plurality Vote Instant Runoff Vote Borda Count Vote 512-0.443-0.584-0.536-0.576-0.613 1024-0.266-0.379-0.274-0.320-0.360 2048-0.007-0.174-0.096-0.080-0.160 4096 0.271 0.074 0.212 0.153 0.066 various domains we got a clear picture that the root ensemble method was weaker than root parallelization and that even though some of the other voting methods showed some promise the root parallelization was overall stronger. 3.1 Motivation for Ensembles 3.1.1 Parallelization From the ensemble algorithm 3.1 one can see that each iteration of the outermost loop that generates each UCT tree can be run in parallel. Once all the trees have been generated and the rewards and visits of each root action node have been summed, a short evaluation of the action with the best average reward can be made. The importance of running the algorithm in parallel is that any kind of time speed up with ensembles indicates that it may be useful in a cluster environment when all the processors need to be utilized.

13 3.1.2 Variance Reduction The idea of ensemble planning is related to the notion of bagging in machine learning where multiple independent classifiers are learned based on bootstrap training sets. Bagging has been shown to significantly improve performance for learning algorithms that have high variance [2]. As UCT builds a tree, the most promising actions get explored the most. Sometimes the optimal action gets unlucky early on and it takes a while to catch up because many more trajectories are being used to explore sub optimal actions. If the chance of selecting the optimal action at a given state is greater than all other actions, then ensembles over multiple trees reduce the variance in selecting that action. UCT converges on the optimal action but this may take a large number of trajectories. However, if a fixed number of trajectories has a good chance of selecting the optimal action then averaging over multiple trees will increase the chance of selecting the optimal action. So there may be cases where more improvement can be gained in the same amount of time for using ensembles as for increasing the number of trajectories. This motivates using ensembles to improve sequential time UCT algorithms. Any of the ensemble methods we tried would have the effect of reducing the variance. However, root parallelization, weighted the votes with more weight given to actions that ran more simulations. This is important so that outliers with few trajectories don t pull the results away from the optimal action.

14 3.1.3 Space and Time Complexity Ensembles are an alternaative to creating larger trees by building multiple smaller ones. The smaller trees take less memory which is important for UCT. The amount of memory increases linearly with the amount of time taken to run the base algorithm because a new node is added to the tree with each trajectory. Also, bigger trees take longer to traverse before a random game can be played. This increases the time of each new trajectory so that a tree of double the size takes more than double the time to run. This means that ensembles have an extra savings in time.

15 Chapter 4 Related Work Our work builds off of a few papers that started to explore parallelization methods of the UCT algorithm. A wide variety of methods were tried including complex message passing [3]. Some methods related to these were applied to the domain of Go [5]. But more recently the root parallelization method was introduced alongside a few other methods [4]. This work was part of an evaluation of parallel UCT methods. The root parallelization method that we are investigating was shown to not only be effective in parallel time but also in sequential time. The unit of measures used were Games-Per-Second (GPS) and strength speedup. The GPS is the increase in the number of complete UCT roll-outs performed per second. For root parallelization this means that 1 thread has a GPS of 1 while 4 threads each building a single tree has a GPS of 4. The strength speedup was a direct measure of how much stronger the algorithm performed compared to an equivalent single threaded algorithm. So a strength speedup of 2 would mean that the algorithm has the same strength as a single threaded program that takes 2 times as long to run. The results for a 2 threaded program were a GPS of 2 and a strength speed up of 3; a 4 threaded program had a GPS of 4 with a strength speedup of 6.5. These results showed it is possible for ensembles to increase the strength of the algorithm greater than the time speedup and thus may have usefulness in sequential time speedup.

16 The root parallelization method is similar to bagging predictors used in bootstrap learning [2]. Bagging methods run a classifier multiple times and average over the results. In cases where the classifier was highly unstable (there was a lot of variance) bagging worked best. In cases where the classifiers were stable or very accurate bagging was fairly useless although it did not hurt the results. One remark from this paper was the ease of parallelizing the bagging method. No communication between CPUs would be needed because each classifier could be run independently and then averaged over in the end. Prior work in the context of solitaire introduced a family of algorithms that contained UCT, HOP-UCT, and Ensemble UCT [1]. The focus there was on achieving the best results in the domain of Klondike solitaire. Some advantages were noted for the ensemble approach, but the evaluation was not exhaustive in that it focused on optimizing a single domain.

17 Chapter 5 A Generic Java Planning Library Our planning library was built using a Java code base that allows new agents and domains to be easily added and to interact with the already present agents and domains. The code base is open source under the GPL 2.0 license and can be found at Beaversource. There are currently six domains and 4 agents included in the library; The agents are Random, Human, UCT and Expectimax; The domains are Backgammon, Biniax, Connect 4, EWN, Havannah and Yahtzee. The original purpose of this library was to allow a generic UCT algorithm to play on a myriad domains. The library ended up growing into a general purpose library for any agent that could function in turn based discrete fully observable static domains. The library does not currently support simultaneous and partially observable domains. The base class structure consists of the following classes: Action, State, Simulator, Agent and Arbiter. The Action and State classes abstractly represent an action and a state in a given domain respectively. These classes typically just hold information about the action and state with some accessor methods. It is typical for these classes to be immutable. The Simulator class is tied to a specific state and action and computes legal actions and rewards at a given state and the state transition given an action. The simulator is used by the agents to explore the domain and the arbiter to regulate a game. The Agent class s main method is

18 selectaction that takes a state and simulator as parameters and returns an action. The Arbiter class calls the selectaction method on each Agent when it is time to make a move. The Arbiter class is what governs the legal play of a game. It takes in the agents and simulators needed to play a game. It also records some statistics such as time taken to move. The Arbiter class can run many games in the same domain against the same set of agents and collect statistics on average reward. In order to keep things fair it rotates the move order of agents each game. 5.1 Running Tests The UCTProject class is used to run tests. This class takes zero, one or two arguments. If it is run with no arguments then it runs in interactive mode. In this mode the user selects a simulator and agents. This is the only mode that the human agent can be used. If the human agent is run the the user inputs actions and gets feedback about the the current state and legal actions. If there are no human agents then the single game is simulated and the history and results are shown at the end. If a single argument is passed then it is the file path of a test file. An example test file is given 5.1. All white space and comments (any line starting with a #) are ignored. The first line read in (that isn t a comment or whitespace) indicates the number of complete games to be simulated. The second line indicates the simulator that will be passed to the arbiter. The third line indicates the simulator that each agent will use to make decisions. The reason this distinction is made is

19 #Number of Simulated Games 2000 #World Backgammon #Simulated World Backgammon Figure 5.1: Test File backgammon random uct #Agents Random UCT 1024 1-1 1 ROOT_PARALLELIZATION that an agent may not have access to the complete real world simulator but only a close approximation. The next two lines are used to specify the agents to run in the domain. Each line is given to an agent and the parameters for that agent are separated by white space. Note that if a domain such as backgammon requires two agents it will throw an exception if not enough agents are included in the test file and if more than two agents are specified it will just use the first two agents. The test file in figure 5.1 will play 2000 games in the domain of Backgammon between a Random agent and a UCT agent with the number of trajectories set to 1024, C = 1, sparse sample size set to infinity and the number of ensembles set to 1. The output of the test from 5.1 backgammon random uct will be appended to the file backgammon random uct results 5.2. All output will be appended to a file with the name of the input file plus results attached to the end of the name. If that file does not already exist it will be created. Figure 5.2 shows what

20 Figure 5.2: Output File backgammon random uct results #2000 - BackgammonSimulator - BackgammonSimulator -0.994, 0.109, 0.014, 0.023, Random 0.994, 0.109, 437.494, 91.069, UCT, 1024, 1, -1, 1, ROOT_PARALLELIZATION the output of figure 5.1 might look like. The fist line is just the first 3 lines of the test file put onto one line. Each line that follows is given to an agent in the tests. The first value is the average reward while the second parameter is the standard deviation of the average reward. The next two values are the average time per move and the standard deviation in time per move respectively. Then at the end is appended the agent name and the parameters it used. If a second argument is passed to UCTProject then that second argument is the name of the output file instead of the default name. Another useful feature of the test files is shown in figure 5.3. This time the domain is Yahtzee and since it is a single agent domain only one agent is given. However, the number of trajectories that UCT takes in as a parameter is given by an array of values, [128,256,512,1024], instead of a single value. What this does is run 4 separate tests and append the results of each test to the same output file. Each test will use a different value for the number of trajectories. Multiple parameters can be specified in this manner as shown in figure 5.4. In this case all possible combinations of the parameters are run. Thus there will be output for the following pairs of parameters in the order given: (128,2), (256,2), (512,2), (1024,2), (128,4), (256,4), (512,4), (1024,4). Instead of writing [2,4,6,8,10] you may input

21 #Number of Simulated Games 2000 #World Yahtzee #Simulated World Yahtzee Figure 5.3: Test File 2 #Agents UCT [128,256,512,1024] 64-1 2 ROOT_PARALLELIZATION [2:4:10]. The first and laster values are the first and last values to test respectively. The difference between the first and last values divided by the middle value is the step value. Thus for [2:4:10] it would be (10 2)/4) = 2 and the results values would be [2,4,6,8,10]. 5.2 Adding New Domains Any new domain that is added to this library will need to meet a few requirements. All domains consist of at least three parts: a simulator, state and action class. There must be one top level state and action class that inherit from the Action and State classes respectively. If the domain has multiple types of states or actions these can in turn inherit from the top level action and state class. For instance in the domain of Yahtzee there is a single state class, YahtzeeState, that inherits from State and an action class, YahtzeeAction, that inherits from the Ac-

22 #Number of Simulated Games 2000 #World Yahtzee #Simulated World Yahtzee Figure 5.4: Test File 3 #Agents UCT [128,256,512,1024] 64-1 [2,4] ROOT_PARALLELIZATION tion class. Since in Yahtzee there are two different actions that can be performed, namely selecting dice to re-roll or selecting a category to score Yahtzee has two other action classes, YahtzeeRollAction and YahtzeeSelectAction that inherit from YahtzeeAction. YahtzeeAction is never used as an object. The only reason to have it is so that the Yahtzee simulator, YahtzeeSimulator, can use generics to specify that it only uses actions of type YahtzeeAction and states of type YahtzeeState. This protects the simulator from being misused by passing in a state or action that is not a Yahtzee state or action. Another thing to keep in mind is that this library was designed to use immutable state and action objects (objects that hold data that can be accessed by other objects but cannot be modified). Instead of changing a state a new state object is created to replace the old one. This has advantages and disadvantages. If a mutable action or state object is created then some of the basic simulator methods

23 also need to be overridden since they assume that the objects are immutable. The simulator class itself is a mutable object. It contains the methods for taking an action and computing the rewards and legal actions at a given state. A simulator has a method takeaction that replaces its current state with a new state object based on the passed in action. It is important for a simulator to check that the passed in action is indeed a legal action from that state or there could be problems. Also, the simulator keeps a record of the legal actions and rewards array for the current state. These should be updated whenever the state changes. The getlegalactions and getrewards methods don t need to be overridden as they just return these objects. This means that it is desirable to make some kind of method that computes the legal action and rewards that can be called whenever the state changes. For all of the current simulators computelegalactions and computerewards are used as private methods although the names don t matter.

24 Chapter 6 Description and Evaluation of Domains The following section describes the domains that were used to generate data for the paper. We included five domains all of which are discrete, fully observable, static and either deterministic or stochastic. Figure 6 is a quick comparison of the following features in the domains: number of agents, value of the UCT constant and upper bound on actions per state (APS) and states per action (SPA). A reasonable value of C was chosen for each domain by running base UCT at a fixed number of trajectories and converging on a local optimum. 6.1 Backgammon Backgammon is an ancient gambling game that is still popular today. A neural network approach TD-Gammon [11] has achieved mastery level play. Backgammon is a turn based two player domain, played on a board consisting Figure 6.1: Comparison of Domains Domain Agents C APS SPA Backgammon 2 1 15 4 15 Biniax 1 8 4 4 nelements! 5 (nelements 2)!2! Connect 4 2 1 7 1 Havannah 2 1 61 1 Yahtzee 1 64 32 252

25 Figure 6.2: Backgammon Board [9] of twenty-four narrow triangles called points. The triangles alternate in color and are grouped into four quadrants of six triangles each. The quadrants are referred to as a player s home board and outer board and the opponent s home board and outer board. The home and outer boards are separated from each other by a ridge down the center of the board called the bar. The points are numbered for either player starting in that player s home board. The outermost point is the twenty-four point, which is also the opponent s one point. Each player has fifteen checkers of his own color. The initial arrangement of checkers is: two on each player s twenty-four point, five on each player s thirteen point, three on each player s eight point, and five on each player s six point. Figure 6.2 shows the initial board setup.

26 The starting player is determined randomly and players take turns rolling a pair of six sided dice to determine possible moves. The value on each die corresponds to the number of points a single piece may be moved forward on the board. Players move pieces away from the home board. If doubles is rolled then the moves on the dice are made twice. A legal move is one in which a piece lands on a point where there are 1 or fewer enemy pieces. If there is one enemy piece at that location it is captured. A captured piece is placed on the bar and must be moved onto that player s home board before any other pieces that player controls may be moved. If all legal moves of captured piece are blocked by enemy pieces then that player can make no moves for that turn. Additionally, a player must make as many moves as possible during their turn. So if there is an option to make only 1 move or to make 2 moves the player must choose the move combination of 2 moves. The object of the game is to move all the checkers to the opponent s home board and then bear them off. The first player to bear off all checkers wins the game. In order to bear pieces off checkers a player must have all remaining pieces in the opponents home board. Backgammon has special scoring rules when gambling which we do not use for our domain. A win is 1 reward and a loss is -1 reward. The Backgammon domain is moderately stochastic with 15 possible states that any action can lead to. Sparse sampling isn t useful. The other interesting feature of this domain is that the number of possible actions from any given state can vary wildly. Some states may have no legal actions while others may have as many as a thousand. This means that for a fixed number of UCT trajectories the current

27 action may end up having a well explored or a sparsely explored set of root actions. 6.2 Biniax Biniax is a newer arcade style game that can be found online free to play. No previous research that we know of has been done in this domain. Biniax is a highly stochastic single agent domain. The agent controls a single element on a 5 by 7 board. An action consists of moving the single element to an adjacent non diagonal location (North, South, East and West). Some locations on the board are empty while others contain element pairs. A move can be made into an empty space or a space with an element pair where one element in the pair is the same as the player s element. When the piece moves into a location with an element pair it will change it s element value to that of the other element in the element pair. Every 2 moves all element pairs and the agent s element are moved down one location. Element-pairs move off the bottom of the board but the player s element does not. The top row fills in with 4 random element pairs and one random empty location. As the game progresses the number of elements possible in the element pairs increases. The game starts with 4 possible elements. The goal of the agent is to survive for as many turns as possible (1 reward per action). The game ends when the agent can make no more legal moves; this occurs only when the agent is on the bottom row and there are either no spaces the element can move into or any of the moves will cause the element to get pushed off the board by a falling element pair. Some examples are given by figure 6.3.

28 Figure 6.3: Biniax Examples Element moves North taking an element pair. (Free Moves:2) (Free Moves:1) --------------------- --------------------- : A-D A-D B-C A-C: : A-D A-D B-C A-C: :A-B A-B A-C A-B: :A-B A-B [C] A-B: : [A] : : : : : : : --------------------- --------------------- Element moves East getting pushed down by element pair. (Free Moves:1) (Free Moves:2) --------------------- --------------------- : A-B B-D B-C A-B: :B-D A-D C-D C-D: : A-D A-D B-C A-C: : A-B B-D B-C A-B: :A-B [A] A-B: : A-D A-D B-C A-C: :C-D : :A-B [A] A-B: --------------------- --------------------- Element moves East and takes element pair moving down. (Free Moves:1) (Free Moves:2) --------------------- --------------------- : A-B B-D B-C A-B: :B-D A-D C-D C-D: : A-D A-D A-C A-C: : A-B B-D B-C A-B: :A-B [A] A-B: : A-D A-D [C] A-C: :C-D : :A-B A-B: --------------------- --------------------- Element has no legal moves in each situation. (Free Moves:1) (Free Moves:1) --------------------- --------------------- :A-C A-C A-B A-D: :A-C A-C A-B A-D: :B-D C-D A-B A-C: :B-D C-D A-B A-C: : A-B A-C A-D B-D: : A-B A-D A-D B-D: :A-D A-B A-D [C]: :A-D [C] : --------------------- ---------------------

29 Biniax is an ideal domain to apply sparse sampling methods because it is highly stochastic and base UCT has trouble building a deep tree. Every other action ends up being a stochastic action with the following number of equally possible 4 nelements! states. Number of Next States = 5 Where nelements is (nelements 2)!2! the number of elements currently being generated. 6.3 Connect 4 Connect 4 has been the subject of past research because it was challenging enough of a domain to warrant study but had a significantly smaller state space than games such as chess or go. It is now a solved domain where agents with access to the move database can make optimal decisions quickly. Connect 4 is a connection game played on a grid of height 6 and width 7. It is a 2 agent deterministic domain where each agent alternates placing pieces on the board. The objective of the game is to create a connection of 4 pieces in a row either vertically, horizontally or diagonally. When a piece is placed on the board a column is chosen and the piece moves down the column until it sits above a non empty location. In this way there are a maximum of 7 actions per state where placing a piece in a column is a legal action if that column has at least one empty location. The game ends as a draw if the game board contains no empty locations and there are no four in a rows. In our research this domain was successful both because of its simplicity and the amount of previous work done. Fast open source implementations of the sim-

30 ulator for this domain could be found online and allowed us to generate trees with extremely high numbers of trajectories compared to the other domains. This domain was also challenging enough that UCT didn t converge on the optimal moves too quickly. 6.4 Havannah Havannah is a connection game invented by game designer Christian Freeling. The game is based on the classic connection game Hex. No existing computer agent is capable of beating the best human players on a full sized board (side length of 10). Christian Freeling has put out a prize for anyone that can create an agent that can beat him in 1 of 10 games. Havannah is a connection game where pieces are placed alternately by two competing players. Once a piece is placed it remains on the board for the remainder of the game. There are three possible win conditions; bridge, fork or ring. A bridge is a connection from 2 of the six corners, a fork is a connection of 3 of the six sides on the board where a side does not include a corner and a ring surrounds either enemy or empty locations. Figure 6.4 depicts a Havannah board of side length 8 with examples of the three possible win conditions highlighted in blue. For our tests we used a board with side length of 5 to speed up the collection of results. Unlike Hex, Havannah is not a determined game but it is rare to end in a draw. It is deterministic and has a maximum number of possible moves equal to

31 Figure 6.4: Havannah Board the number of empty spaces on the board so all search trees have a fixed depth. For our board size we had a maximum of 61 actions per state. 6.5 Yahtzee Yahtzee is a single agent stochastic domain. In its present form it has been around since 1954. An optimal algorithm has been found that has an average score of 254.5896 with a standard deviation of 59.6117 [7]. Also expert human players have been able to average around 250 points over the course of many games. The object of the game is to score the most points by rolling 5 dice to make certain combinations. The game consists of 13 rounds. In each round the dice may be rolled up to 3 times. All dice are rolled in the first roll of a round. In the second and third rolls the player may choose which dice to roll and which to keep

32 the same. After the third roll the agent must select an appropriate category to score the dice. There are 13 categories to choose from as described below. After a category has been scored it cannot be chosen again. There are two sections that may be scored; the upper section and the lower section. The upper section consists of 6 scoring categories: Ones, Twos, Threes, Fours, Fives and Sixes. These categories are scored by summing the total of matching die faces. For example, if you rolled [ 5 2 5 6 1 ] and placed it into the Fives category you would receive 10 points. If you put the same combination in the Ones category you would receive 1 point. If the total of Upper scores is 63 or more, add a bonus of 35 points. The lower section categories are either scored a set amount or zero if the category requirements are not satisfied. Categories three and four of a kind need 3 and 4 of the same die faces respectively. These categories score the sum of all the dice. A Straight is a sequence of consecutive die faces, where a small straight is 4 consecutive faces and a large straight is 5 consecutive faces. Small straights score 30 and large straights 40 points. A Full House is where you have 3 of a kind and 2 of a kind. Full houses score 25 points. A Yahtzee is 5 of a kind and scores 50 points. The Chance category is scored by the sum of the dice. Each Yahtzee rolled after the Yahtzee category has scored 50 points yields a 100 point bonus. This roll must be put into another category, as follows: if the corresponding upper section category is not filled then the Yahtzee must be score there. For example, if [ 4 4 4 4 4 ] is rolled, the Fours category must be scored if not filled. If the corresponding upper section category is filled you may then put the score anywhere on the upper

33 section or lower section and score appropriately. Yahtzee is a highly stochastic domain but actions taken by an agent can control the stochasticity, which can range from 1 to 252. We found that sparse sampling did not help much in this domain since good moves tended to limit the number of possible states.

34 Chapter 7 Empirical Results The main results that we were interested in were the effect of ensembles (root parallelization) applied to base UCT in various domains. Specifically, the benefits of ensembles in parallel time, sequential time and memory usage. Some secondary results we were interested in included the usefulness of sparseness in stochastic domains, the effect of ensembles with a sub optimal UCT constant and increased time per trajectory as a tree increases in size. We will present the ensemble results along with some other interesting results. 7.1 Experimental Setup Our timing results were collected on one machine with 4 2.67GHz dual core Intel Xeon processors and 24 gigabytes of memory. Our Java library only ran a single thread so each test used one core for processing and a second core for garbage collection (Java s way of cleaning up dynamically allocated memory). For each domain we started by generating a reasonable UCT curve. This meant first finding a UCT constant for base UCT and then if the domain was stochastic finding a sparseness value; in some cases this value was infinite (same as base UCT without sparseness). An example of how we determined a reasonable UCT constant for our original Yahtzee domain is given by figure 7.1. We increased the UCT constant

35 Figure 7.1: Yahtzee UCT Constant Table Trajectories UCT Constant 1 2 4 8 16 32 64 128 1024 167 172 181 189 195 196 194 189 2048 167 171 186 193 199 201 202 199 4096 167 175 186 198 202 204 208 207 across a range of trajectories until there was no further improvement to base UCT. For the deterministic domains we set the sparse sample size to 1 rather than infinite. The two values are equivalent logically for domains with only 1 state per action but sparseness size of 1 will run faster because the simulator will only need to be called the first time the state is visited from that action rather than each time. It is important to point out that this is not true for stochastic domains. For example, the domain of backgammon always has 15 distinct possible states from any one action. Yet sparse sample size of 15 is not logically equivalent to a sparse sample size of infinite. This is because during the course of filling up the 15 child states of the action there may be a state that is chosen more than once leaving out another state. Once the number of visits to the action node is 15 the current set of states is fixed (no other new states can be added) and the distribution remains unchanged. For domains that were 2 agents the opposing agent was a UCT agent fixed at some number of trajectories. From this point ensemble table results could be collected for 2, 4, 8 and 16 ensembles. A point to make about our timing measurements is that we assume that a single trajectory added to a tree takes constant time. This isn t completely true because

36 Figure 7.2: Connect 4 Ensemble Timing Table (ms) Total Ensembles Trajectories 1 2 4 8 16 4096 694 ± 6 8192 714 ± 6 16384 740 ± 6 32768 773 ± 6 65536 792 ± 6 as the tree becomes larger the time taken per trajectory increases. For some domains we found this time to be more of a factor than others but for the most part it was not huge. However, this means that the ensembles are usually performing slightly better than indicated by the results. Figure 7.2 gives an example from the Connect 4 domain of these results. The values in the table represent the average time it takes for the algorithm to select an action in milliseconds. The values to the right of ± indicates the %99 confidence interval for that value. Another thing to note before analyzing the results of the ensembles is that the only domain that used sparseness in these tests was Biniax. We found that for domains with low stochasticity such as Backgammon and Yahtzee sparse sampling did not improve performance in a significant way. However, for a highly stochastic domain such as Biniax where the tree was unable to grow deep, sparseness had a significant improvement. These results are shown in figure 7.3. The far left column shows the infinite sparseness values of base UCT. The sparseness values starting from 1 improve upon infinite sparseness up to 8 before declining again. We ended up setting the sparseness of Biniax to 8 for our tests.

37 Figure 7.3: Biniax Sparseness Number of Sparseness Trajectories 1 2 4 8 16 32 64 96 100 100 98 97 97 95 128 99 102 103 102 101 100 99 256 100 104 104 103 102 102 100 512 98 102 106 106 104 102 102 1024 101 103 104 107 106 105 104 2048 100 104 106 108 108 107 105 4096 101 102 106 107 110 109 107 8192 100 103 107 109 111 110 108 7.2 General Results Figures 7.4 to 7.9 show a comparison of base UCT and various numbers of ensemble UCT for each domain. There is also a sixth alternate Yahtzee domain. This was our original Yahtzee domain but there was a flaw in how the scoring was done so that it was slightly easier than the normal Yahtzee domain. This isn t a problem except that we wanted to compare our results to the optimal Yahtzee values that have already been found. The domains with more than one agent (Backgammon, Connect4 and Havannah) all have a base UCT agent they compete against. All the base UCT agents use the UCT constant values from figure 6. The number of trajectories is set different for each domain. Backgammon is 256, Connect 4 is 4096 and Havannah is 128. These values allowed us to control the strength of the domain we were testing against. If the value was too large then the tests would take too long because both the base agent would take a while and the variable agent would need to be larger as well. On the other hand if the value was too low

38 then the base UCT version of the variable agent would quickly win %100 of the games and ensembles cannot improve from there. Since the Connect 4 simulator was faster and had smaller states than Backgammon and Havannah it was also able to generate much larger trees with less memory and in less time. For each ensemble table the column indicates the number of trajectories UCT uses to build each tree while the rows indicate the number of ensemble trees used to determine the correct action. We constructed the tables in such a way that it would be easy to compare the results of increasing the number of ensembles to that of increasing the number of trajectories. The number of ensembles and trajectories both double with each increase. In this way the differences in sequential time between various ensemble values using the same number of overall trajectories could be easily compared. For instance, on any of the tables starting from a base UCT result moving diagonally up and to the right results in 2 ensemble trees each with half the number of trajectories. Continuing along this diagonal all of the results use the same number of total trajectories. Just moving down a row shows parallel time benefits of using ensembles. The values in each table are the average reward acquired over a series of games (usually 1000 to 4000) and the values to the right of ± indicates the %99 confidence interval for that value. The tables show values increasing from lower numbers of ensembles to higher numbers of ensembles given a fixed number of trajectories. This indicates that ensembles improve performance in general when using parallel time. There are cases where ensembles don t improve performance but there were no statistically significant cases where performance suffered due to ensembles. This was the case for all