Contents. List of Figures

Size: px

Start display at page:

Download "Contents. List of Figures"

Suzan Ford
5 years ago
Views:

1 1

2 Contents 1 Introduction Rules of the game Complexity of the game History of self-learning backgammon AI Standard techniques Implementation details Usage of open source projects Framework First Try: Windows Framework Second Try: Heuristic Framework Algorithms MCMC TD(λ) Genetic Algorithm Results Intermediate Approaches Final Results MCMC TD Genetic Comparison Comparison Analysis and Conclusions Suggestions for further work A Installation and Running B The code structure List of Figures 3.1 Benchmark players results MCMC learning from gnubg MCMC learning from initial AI MCMC learning from random AI TD learning from random TD learning, changing alpha values TD learning, changing lambda values Genetic population Genetic generations Best Scores A.1 The main GUI

3 1 Introduction In this project we explored the ability of various reinforcement learning algorithms (including a genetic algorithm) in the game of backgammon. We implemented 3 different algorithms: MCMC, TD and Genetic Algorithm. 1.1 Rules of the game (quoted from [1]) Backgammon playing pieces are known variously as checkers, draughts, stones, men, counters, pawns, or chips. The objective is to remove (bear off) all of one s own checkers from the board before one s opponent can do the same. The checkers are scattered at first and may be blocked or hit by the opponent. As the playing time for each individual game is short, it is often played in matches, where victory is awarded to the first player to reach a certain number of points. Setup Each side of the board has a track of 12 long triangles, called points. The points are considered to be connected across one edge of the board, forming a continuous track in the shape of a horseshoe, and are numbered from 1 to 24. Players begin with two checkers on their 24-point, three checkers on their 8-point, and five checkers each on their 13-point and their 6-point. The two players move their checkers in opposing directions, from the 24-point towards the 1-point. Points 1 through 6 are called the home board or inner board, and points 7 through 12 are called the outer board. The 7-point is referred to as the bar point, and the 13-point as the mid point. Movement To start the game, each player rolls one die, and the player with the higher number moves first using both the numbers shown. If the players roll the same number, they must roll again as the first move can not be a double. Both dice must land completely flat on the right hand side of the gameboard. The players then alternate turns, rolling two dice at the beginning of each turn. After rolling the dice players must, if possible, move their checkers according to the number of pips shown on each die. For example, if the player rolls a 6 and a 3 (notated as 6-3 ), that player must move one checker six points forward, and another or the same checker three points forward. The same checker may be moved twice as long as the two moves are distinct: six and then three, or three and then six. If a player rolls two of the same number, called doubles, that player must play each die twice. For example, upon rolling a 5-5 that player may move up to four separate checkers forward five spaces each. For any roll, if a player can move both dice, that player is compelled to do so. If players cannot move either die in a roll, given the position of their checkers then that turn is over and the turn passes to the opponent. If it is possible to move either die, but not both, the higher number must be played. For example if a player rolls 6-3 and can only move a 6 or a 3, the 6 being the bigger number must be moved; if it is possible to move the 3 but not the 6 then the 3 is played. Further, if one die is unable to be moved, but such a move is made possible by the moving of the other die, that move is compulsory. In the course of a move, a checker may land on any point that is unoccupied or is occupied only by a player s own checkers. It may also land on a point occupied by exactly one opposing checker, or blot. In this case, the blot has been hit, and is placed in the middle of the board on the bar that divides the two sides of the playing surface. A checker may never land on a point occupied by two or more opposing checkers; thus, no point is ever occupied by checkers from both players simultaneously. Checkers placed on the bar re-enter the game through the opponent s home board. A roll of 2 allows the checker to enter on the 23-point, a roll of 3 on the 22-point, and so forth. A player may not move any other checkers until all checkers on the bar belonging to that player have re-entered the game. When all of a player s checkers are in that player s home board, that player may start removing them; this is called bearing off. A roll of 1 may be used to bear off a checker from the 1-point, a 2 from the 2-point, and so on. A die may not be used to bear off checkers from a lower-numbered point unless there are no checkers on any higher points. For example if a player rolls a 6 and a 3

4 5, but has no checkers on the 6-point, though 2 checkers remain on the 5-point, then the 6 and the 5 must be used to bear off the 2 checkers from the 5-point. When bearing off, a player may also move a lower die roll before the higher even if that means the full value of the higher die is not fully utilized. For example, if a player has exactly 1 checker remaining on the 6-point, and rolls a 6 and a 1, the player may move the 6-point checker 1 place to the 5-point with the lower die roll of 1, and then bear that checker off the 5-point using the die roll of 6; this is sometimes useful tactically. If one player has not borne off any checkers by the time that player s opponent has borne off all fifteen, then the player has lost a gammon, which counts for double a normal loss. If the losing player has not borne off any checkers and still has checkers on the bar or in the opponent s home board, then the player has lost a backgammon, which counts for triple a normal loss. Doubling The game also includes doubling - an action which (if taken) causes the score to be doubled. In order to simplify the game, we ignored doubling in the implementation of the game. 1.2 Complexity of the game The major difficulty in modelling an AI for backgammon, is the large number of possible states. The number is estimated[2] at which is far too large for any practical purpose. Moreover, the large branching factor (several hundred[2] for each move) is an obstacle for common game-tree approaches. 1.3 History of self-learning backgammon AI In 1992, Tesauro [2], using the TD(λ) algorithm, on a multilayered neural net structure, created TD-Gammon. Learning from self-play, it achieved a level of play comparable to human experts without explicit programming of heuristic function evaluation. It was further shown[3] that other algorithms that implement co-evolution, could achieve similar results - using a genetic algorithm, on a framework of neural networks Standard techniques Bearoff database Surprisingly, programmed AIs are better at the evaluation of normal game progress situations (which is a pattern evaluation problem), but are worse in the evaluation of the bearoff positions. It is standard to use a database for the ending positions. Lookahead For the evaluation of the board, one way is to look at the possibilities for the game tree in the immediate future of the position. Because of the large branching factor, this is generally limited to a small (1-3) number of moves. Rollouts One technique for the evaluation of a board position, is to sample several chains of rollouts (throws of dice) from the position, and have the AI play according to them. Then the position is scored according to the average score (or evaluation of resulting boards - if using partial rollouts). 4

5 2 Implementation details 2.1 Usage of open source projects The game details implementation (board position logic, board GUI), where adapted from JGammon[4]. JGammon is an open-source project, that gives you the option to play Backgammon games against different AIs and provides you with an interface to create new AIs. Since the AIs in JGammon are still in development, we added the Gnu Backgammon program (gnubg [5]) for benchmarking. As its name suggests gnubg is also an open-source project, that also give you the option to play against AIs, but in contrary to JGammon, gnubg uses neural network weight tables which were built in months of trainings. Gnubg gives you different difficulty levels (controlled by adding noise to the decisions and n-ply lookahead). Gnubg at its full power - no noise and 3-ply lookahead is considered as a human expert. That was confirmed by the members of this team :). The JGammon gui was heavily simplified to remove unneeded options, leaving a plain gameplay. 2.2 Framework One of the great challenges in this project was the evaluation function of the board. The target was to create an effective evaluation function mapping a board to a respective value, which should be an approximation of the expectation of the score achievable from the board position. Then when a move in needed in a game, we go all over the possible next move boards, and choose the one with the highest value. As the number of states is very large, we needed some abstraction of the board. The implementation of the function was using the naive approach of a lookup table (LUT), which was filled by the various algorithms. It should be noted, that since there is full observability of the game position, and the passage from a position (with dice), to the next (without dice) is deterministic, it is sufficient to evaluate the board positions, and there is no need to evaluate a separate value for actions. First we tried a simple relaxation, which proved to be not very successful even against random opponents, then we tried a different approach that proved itself to be better than the first. We called every evaluation function a framework First Try: Windows Framework The window framework is basically a relaxation of the board full description. The first abstraction we used is that the number of checkers at each point can be either 0, 1 or 2. That means that two boards are considered equal if they only differ in the actual number of checkers at points which aren t a blot. Furthermore we divide the board to different windows. A window is a series of continuous points on the board. We divide a board to a number of windows. Windows can overlap. Then we assign a value to every window in the board, and the value of the board is the average of all the windows values. After implementing this framework and testing it using the different algorithms (discussed later), we reached poor result in all the test cases, even against random opponent. The main reason for the poor performance of this framework is that the relaxation wasn t enough, and the number of states was still too large to be learnt well enough from a limited number of games. In addition, it might be that the averaging of values of windows did not create a good enough estimate of the overall board value. We have changed the framework to a different one which gave better results: the heuristic framework Second Try: Heuristic Framework The heuristic framework extracts a small set of features from each board. The function that uses the features to evaluate the board (using a LUT) is learnt by the RL algorithms. Those features are: 1. The number of houses (points with checkers, which aren t a blot). 2. The number of blots. 5

6 3. The number of checkers on the bar. 4. The number of checkers of the other player on the bar. 5. The number of houses in the last six points. 6. Whether separation of pieces has been achieved (all other player pieces are before all pieces). 2.3 Algorithms We have used the following algorithms, using as a utility function the score of the game, thereby trying to maximize the expected score of a game. We have considered implementing an on-policy algorithm (SARSA), but since the results from learning from self play were poor, we have deemed it a lost cause MCMC Monte Carlo Markov Chain sampling algorithm is a model-free off-policy algorithm for evaluating the value of positions. Using a given policy (AI), it runs a large number of simulation of the game. The algorithm samples every board in a simulation of a game and gives it a value according to the average score of games which pass through that board. In our implementation, we did not accumulate the results when moving to a new generation, that means each generation learns the entire database from scratch. Since the algorithm is trivially parallelizable, we implemented it using multi-threading, for speedup of computation TD(λ) Temporal Difference is a model-free off-policy algorithm for evaluating the value of positions. Using a given policy (AI) it runs a large number of games. The value given to the board is averaged using the approximation given to the next position generated in the markovian chain. λ is an interpolation parameter, for interpolating between the immediate estimation of the reward and the reward of later positions. The reward (of non-terminal states) is calculated as: R(p) = λv (p + ) + (1 λ)r(p + ) Where p + is the next position in the chain. Then the value is update as V (p) = (1 α)v (p) + αr(p) Where α is the averaging parameter. [6] Genetic Algorithm Genetic algorithm is a local search algorithm. A population of (intermediate) solutions is kept. Each solutions is given a score by a scoring function, and a selection function chooses, with a preference to higher scoring members, solutions to go on to the next generation. A crossover function is used to combine different solutions, and a mutation function to modify them randomly in a small manner. In our implementation, we used randomly generated LUTs as the initial population, the scoring function was done by evaluating the expected performance against a specified AI. The selection function selected randomly in proportion to the expectation of the score (after adding 3 in order to have positive results). The crossover function averaged the value given to each position between the two AIs. The mutation function added white noise to the value of each board. It is possible to score the population using games against one another, thereby implementing self play, without the need of a specific AI to score against. We have used an outside AI in order to check the effect of the scoring function on the results. 6

7 3 Results 3.1 Intermediate Approaches We have discovered that an addition of features to the framework, has in fact caused poorer results. This is a direct consequence of the naive implementation of the evaluation function as a LUT - with a too large state space, it is hard to learn to evaluate a function properly, without some sort of interpolation between neighboring states. We have found out that it is better (in the RL algorithms) to learn from a game between the AI we are learning from and the random player. This configuration gave better results than learning from a game between two better AIs, even when using ɛ-greedy exploration (choosing a random move with probability ɛ). The probable reason for this is that the naive implementation of the evaluation function as a LUT, causes the resulting AI to behave very close to a random one - since it has to see each framework value in order to evaluate it differently than the default. Although we have implemented rollouts to augment an existing AI, we have found that it is too computationally expensive, and did not give significantly better results. The AIs resulting from our imlementation are not stable enough to use as a reliable guide for the rollout game progress, and the evaluation function not good enough to give reliable results for partial rollouts. 3.2 Final Results Benchmarking was done by evaluating the expectation of the score of a game (measured on 1000 games) on 3 AIs: GnuBG - the gnubg AI (set on expert level, which although not the strongest setting, does play reasonably well). Random - an AI choosing the moves uniformly at random Initial - JGammon s initial AI - an AI using a linear evaluation function on a small set of heuristics. This AI is very weak, and has very peculiar behaviour. In our tests we measure the expected score of a game. A score is positive if our player won, and negative if the opponent won. A single game has a score from 1 to 3. 1 (or -1) for a regular win, 2 is a gammon (qxn) and 3 is a backgammon (iwxeh qxn). 7

8 Figure 3.1: Benchmark players results MCMC MCMC tests have three parameters: Number of games for each learning generations (varying from 10 to 10000, multiplied by 10). Number of generations (varying from 1 to 9, steps of 2). AI to learn from 8

9 Figure 3.2: MCMC learning from gnubg Figure 3.2 shows only one generation learning, as later generations produce poor results. It can be observed that the AI achieves good result against the random opponent, and a continued improvement against the initial AI opponent. Even though the AI did not show any improvement against the gnubg AI. It is important to remember that when you play more games, your database will be larger, but still not close to the number of states. 9

10 Figure 3.3: MCMC learning from initial AI Figure 3.3 shows again only one generation, for the same reasons. This figure and figure 3.1 show a surprising result, as learning from gnubg yields good results against initial AI and poor results against gnubg opponent, as this figure suggests, learning from initial AI produces poor results against the initial AI, and relatively good results against gnubg opponent (breaking the -2.0 barrier and showing an improvement against a much stronger opponent). 10

11 Figure 3.4: MCMC learning from random AI In figure 3.4 we can see very good results against the random opponent as expected, and a consistent dominance over initial AI. Results against gnubg does not break the -2.0 barrier. We can see from the various figures, that the performance does get better with more games TD TD tests have five parameters: Number of games for each learning generations (varying from 10 to 10000, powers of 10). Number of generations (varying from 1 to 9, steps of 2). Alpha factor (varying from 0.01 to 0.1, steps of 0.01) Lambda factor (varying from 0.1 to 1, steps of 0.1) AI to learn from. 11

12 Figure 3.5: TD learning from random Figure 3.5 shows the correlation between the average score and the number of games learned from (against random with lambda=0.1 alpha=0.01). As expected, in general the player improves as the number of games increases. However, it seems that against the better players the learning curve peaks around 1000 games, and from there on the results stay more or less the same. Figure 3.6: TD learning, changing alpha values 12

13 Figure 3.7: TD learning, changing lambda values When looking at figures 3.6 and 3.7, we can see no conclusive alpha or lambda values that yield the best results for all opponents. There is a small correlation between the good lambda values for random and initial AI (learned from), but the maximum for gnubg is different. Again, graphs with varying generations do not appear here, because beyond one generation the performance is deteriorating Genetic Genetic tests have three parameters: Size of population (varying from 10 to 1000, powers of 10). Number of generations (varying from 10 to 40, steps of 5). AI to learn from (for use in scoring). 13

14 Figure 3.8: Genetic population Figure 3.8 shows the correlation between the average score and the population (learned from random with 40 generations). There is no clear connection between these parameters. However, most of the tests show that the score in general declines as the population grows. 14

15 Figure 3.9: Genetic generations Figure 3.9 shows the correlation between the average score and the number of generations (learned from gnubg with a population of 10). In this case it is very hard to find a connection between the parameters, though it seems that for a low number of generations and for a high number the score is the highest. Note that generations in genetic algorithms is inherently different concept from generations in MCMC or TD algorithms Comparison These results suggest that the genetic algorithm framework is too noisy to give reliable results - a direct consequence from the large number of parameters of the framework (the LUT implementation), and the small number of generations run (since running a large number is computationally punitive). 15

16 3.2.5 Comparison Figure 3.10: Best Scores Figure 3.10 shows the comparison between the best results we got from all of the learning algorithms and opponents. It is clear that the genetic algorithm performs poorly in comparison to the RL algorithms. 16

17 4 Analysis and Conclusions The resulting AIs from the RL algorithms achieved quite good results against the random opponent - much better than InitialAI s performance - but performed only slightly better against InitialAI. Unsurprisingly the performance against gnubg is abysmal, it is better than random (expectation of losing with score 1.5 instead of 2.5) but generally worse than initialai (1.3). Playing against the resulting AIs, we have noticed that the initial game play was much more reasonable than the bearoff. This is expected since it is known that programming an evaluation of the bearoff states is particularly difficult (which is why a database is normally used), and the framework we used is totally unsuitable for such evaluation. The genetic algorithm resulting AIs have much worse results. The large number of parameters (since we used a LUT framework), and the small number of generations (it was computationally punitive) - which means that the number of games played during the learning process is quite small in comparison to the RL algorithms - caused poor performance of this algorithm. It is easily verified by noting that the size of the LUT in the AIs resulting from the genetic algorithm, are much smaller than the corresponding AIs resulting from the RL algorithms. The various parameters had small effect on the actual results. This suggests that saturation of the performance available from the specific framework we used, is already achieved. This would also explain why self-play did not cause noticeable improvement. The conclusion is that even with a simple set of features, and a simple framework (not using the flexibility achieved by using neural networks) reasonable performance could be achieved using reinforcement learning, however, expectedly, the performance is worse than those acheived with a flexible framework, and a long training period. 4.1 Suggestions for further work It would be interesting to see if using an interpolation scheme, and perhaps using a smaller set of parameters for the evaluation function, would improve the performance - especially with the genetic algorithm. It would be interesting to compare the performance of different feature sets and frameworks. It would be interesting to see the effect of different selection, crossover and mutation functions on the genetic algorithm performance. 17

A Installation and Running Warning: The compilation process takes a while as the makefile compiles gnubg, get a cup of coffee in the meantime. To install - make. To run - make run.

18 A Installation and Running Warning: The compilation process takes a while as the makefile compiles gnubg, get a cup of coffee in the meantime. To install - make. To run - make run. Note: since gnubg runs on a separate process, if interrupted it continues to run on the computer, and interferes with running it again. running killall gnubg would clear it. Warning: The compilation process takes a while as the makefile compiles gnubg, get a cup of coffee in the meantime. The GUI We provide an interface to create a new AI player that learns from a given AI. One can create a player using any of the algorithms we implemented using any AI he choses. In the same interface we provide a way to play against any AI opponent, including one created from learning. We also provide the ability to save created AI to files and to load them from files. Figure A.1: The main GUI The GUI is divided into two sections, the upper section which focuses on playing or watching games. The lower part handles learning. When learning a new AI, the new AI is placed in memory and called the current learner. When learning a new AI, you can choose the algorithm you want to use, and AI to learn from, and the different parameters for each algorithm. In the directory AIs, there is a representative AIs from the various algorithms, which could be used to play against (by loading them from file). Warning: Learning with large parameters can lead to long computational time, so choose the parameters carefully. After learning you can view any game by choosing the preferred players and clicking Start Game. 18

19 B The code structure <AI> options are specified as -c classname for loading a class (e.g. -c jgam.ai.randomai) or as -o filename for loading an object (e.g. -o AIs/MCMC.ai), or as -ro..., -rc... for augmentation with rollouts. utils Miscellaneous utilities used for various jobs. Game - an implementation of a simulation of a game between 2 AIs Main: <AI1> <AI2> would run a match of 1000 games of one against the other, and print the expected score. Learner - an implementation of the learning process. Loader - a dynamic AI class loader. Logger - an interface for logging the board positions during the game simulation RolloutAI - an implementation of augmentation of an AI using rollouts. framework The implementation of the feature framework extracted from the board FrameworkAI - an implementation of the AI using the LUT. HeuristicFramework - the features extracted from the board position. gui MainGui - the gui rl RL - an interface for the learning algorithms MCMC - an implementation of the MCMC algorithm Main: <AI> <numberofgames> <filename> learns from the ai, running the specified number of simulations, and saves to the filename TD - an implementation of the TD algorithm Main: <AI> <gamesnumber> <lambda> <alpha> <filename> learns from the ai, using the specified parameters, and saves to the filename Exploration - an implementation of ɛ-greedy exploration genetic Genetic - an implementation of genetic algorithm Main: <population> <gamenumber> <AI> <filename> learns from the ai, using the specified parameters, and saves to the filename 19

20 References [1] Wikipedia - [2] Temporal Difference Learning and TD-Gammon / Gerald Tesauro / Communications of the ACM, March 1995, Vol. 38, No. 3 / Available at [3] Co-Evolution in the Successful Learning of Backgammon Strategy / Jordan B. Pollack & Alan D. Blair / Journal Machine Learning archive Volume 32 Issue 3, Sept [4] [5] [6] Reinforcement Learning: An Introduction / Richard S. Sutton and Andrew G. Barto / Available at sutton/book/ebook/index.html 20

TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play

NOTE Communicated by Richard Sutton TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play Gerald Tesauro IBM Thomas 1. Watson Research Center, I? 0. Box 704, Yorktozon Heights, NY 10598