K-means separated neural networks training with application to backgammon evaluations

Size: px

Start display at page:

Download "K-means separated neural networks training with application to backgammon evaluations"

Camron Fisher
6 years ago
Views:

1 K-means separated neural networks training with application to backgammon evaluations Øystein Johansen December 19, 2007 Abstract This study examines whether a k-means clustering method can be utilied to classify different backgammon positions into several clusters, such that each of these clusters has a neural network to estimate the winning probabilities. In this way the different phases of the game can be covered by different neural net, without specifying a special rule for each phase of the game. Each neural net will have less responsibility and the sie of the neural nets can be kept relatively small. The neural nets has been trained both with Sutton s TD(λ) algorithm, and with a supervised learning algorithm. The same training algorithms has also been applied to a rule based classification configuration, such that the two classification methods can be compared. The results are still in favour of the rule based classification, but the results also indicate that the k-means separation method is usable and promising and worth further investigation. Contents 1 Introduction Backgammon Computer backgammon: a brief history Evaluations, equity and selection of the best move GNU Backgammon: Current evaluation techniques and position classification Lookahead searches Cubeful evaluations Scope of this study Experimental setup and code implementation Neural networks Output units Input units Activation function Backpropagation Code structure and modelling Benchmark method Database benchmark Head-to-head benchmark player

2 3 Training algorithms and k-means separation TD(λ)-training Supervised training K-means separation Complexity of k-means Results and analysis Benchmarks of GNU Backgammons neural networks Database benchmark compared to head-to-head Training results before introduction of k-means separation TD(λ) training results Supervised training results Results of k-means clustering Training of k-means separated neural networks TD(λ) training results Supervised training results The final verdict: Best k-means versus the reference nets Discussion TD(λ) training vs. supervised training K-means separation: Improvement or not? Further work Conclusion 29 A Algorithm implementations 31 A.1 TD(λ) training algorithm A.2 Supervised training algorithm B Description of tools developed 34 C Simplified class diagram of evaluator engine 36 1 Introduction Backgammon is a popular game, and it is winning getting the attention from more and more players all over the world. The possibility to make a computer program predict the outcome of a backgammon positions is very valuable for any backgammon player who take the game seriously. It is therefore important to be able to develop software that can give good estimates of the possible outcomes of a position. Serious players are willing to pay a relatively high price for a good software tool that can predict the probabilities of the game in a given position. A player who is studying the game with a good software tool will learn more about the game, an in this way the player can get a small edge on his opponent in a tournament. 1.1 Backgammon Backgammon is an ancient game. The exact origins remain unknown, but according to Crawford and Jacoby [10] the most ancient possible ancestor for the 2

3 Ž Œ u v r s t k l m n o p q r s t k l m n o p q r s t K L M N O p q r s t K L M n o p q r s t k l m n o p q r s t P Q R S T e d c B A j i h g f e d c b a j i h g f e d c b a e d c b a j i e d c b a j i h g f e d c b a J I H G F ƒ ˆ Š ẕ - Figure 1: The normal initial position of a backgammon game. Black should move his checkers clockwise to his home (points 1 to 6). White should move his checkers anti-clockwise to his home (points 19-24). game to be found so far, dates back about 5000 years. However backgammon with the rules we know today is much newer. The scoring system and the doubling cube is probably invented in 1926 or 1927 in New York [15]. Backgammon has increased its popularity over the last decades. The possibility to play the game over Internet and the increasing number of tournaments held all over the world can possibly be the main reason for this increased popularity. Backgammon is a two player board game, played with fifteen checkers (sometimes called men) for each player. The board consists of 24 point, where the checkers can be placed. These 24 points are usually of triangular shape and arranged in four quadrants with six points in each quadrant. The quadrants are divided by a bar, and checkers that are hit are placed on the bar. See figure 1. The game is turn based and is played with a pair of dice. The player on turn roll the dice and make a move according to the dice rolled. The object of the game is to collect all checkers in the home board quadrant and then bear checkers off the board. The player who first gets all his checkers off the board wins the game. Checkers can not be borne off unless all the players remaining checkers are inside his home quadrant. Backgammon is basically a race. The players move in opposite directions on the board so, the checkers interfere with each other in the race to get home. A single checker on a point (often called a blot) can be hit. A hit checker is placed at the bar and has to enter from the bar again before it can start race home again. Being hit is therefore a setback in the race. A point which is occupied 3

4 with two or more checkers is said to be owned by the player. The opponent can not use this point. Owning several constructive points can have a major obstructive effect, and can therefore slow the opponent down in his race. There are three types of wins in backgammon. A normal win is simply getting all checkers off before the opponent. This win gives the player one point. If a player can get all checkers off before the opponent can get any of his checker off, the player wins two points. This is called a gammon. The third type of win is if a player can get all checkers off while the opponent still has all checkers on the board and at least one checker remaining in the opponents home quadrant or on the bar. This win is called a backgammon and give the winner three points. The doubling cube is also an element in backgammon. If a player believes he has a good position he can offer to double the stakes in the game. The opponent must then consider if he will accept this offer and continue playing for twice the stakes, or if he rather wants to resign the game at the current stake. 1.2 Computer backgammon: a brief history Making computer backgammon programs are different from making computer chess programs. Chess programs can do brute force searches in the game tree. It is possible to see deep into the tree, and alpha-beta pruning speeds up the searches such that the computer chess player can find a good move. Computer backgammon players must take into account each possible dice roll, and there are 21 distinct rolls, and find the best corresponding move each of this 21 dice rolls. To find the best move for each roll, all the possible moves after each dice roll must be considered. On average there is about 20 legal moves in a position with a given dice roll. The fan out of the search tree will therefore be much wider than in chess. This limits how deep it is possible to search the game tree. One of the first attempts to make a computer backgammon player must have been BKG by Hans Berliner [4, 2]. This backgammon player was developed in the late 1970s. In 1979, BKG played the world champion of the same year, Luigi Villa. BKG won a 5 point match by 7-1 [3]. Later Berliner admits that the program was really lucky and made some terrible bad moves, but Villa was not able take advantage of the programs weaknesses. BKG was knowledge based computer program with heuristic rules of how to evaluate each position. It is hard to estimate how strong this program was compared to todays programs, but it was probably just playing at an amateur level. Expert Backgammon, by Tom Weaver was a commercial program available for Macintosh in the mid 1980s. It was also knowledge based, but one of the big features of Expert Backgammon was its ability to do rollouts. Rollouts are simply a Monte Carlo-simulation of a position, which gives a really good estimate for how good a position is, and what the probable outcome is. Before this software, rollouts were done manually by two human players and was considered time consuming and boring. Computeried rollouts were much faster and more reliable. The next successful attempt to make a computer player was Neurogammon, by Gerald Tesauro, a researcher at IBM [18]. It used a artificial neural network for evaluating backgammon positions, and this must be considered a milestone in the development of computer backgammon. Neurogammon was trained with supervised training based on inputs from an expert player. 4

5 This program became the best computer program for backgammon, but it still played at the level of a strong amateur player. Gerald Tesauro continued his research of computer backgammon and constructed TD-Gammon. TD-Gammon was based on Neurogammon and used neural nets for the evaluations, but the difference was that TD-Gammon was trained with a Temporal Difference Learning-algorithm, TD(λ) [19, 20]. It is a reinforcement learning technique which is described in detail by Sutton [17]. TD(λ)-learning is simply that the program learn as it plays against it self. TD- Gammon achieved a master level of of play, and it is fascinating that there was ero knowledge (except from the rules) coded into the system. It was purely trained by self play [21, 22]. JellyFish by Fredrik Dahl, was the first commercial backgammon playing program that used neural nets for evaluating positions. Kit Woolsey, a world class player, said JellyFish 1.0 was better than TD-Gammon in some respects, but overall weaker [16]. JellyFish 2.0 was released in March 1996, and version 3.0 was available in The development of JellyFish ceased in 1999, and the latest version available is version 3.5. JellyFish was also trained by a reinforcement learning algorithm, but not the exact TD(λ) algorithm used by TD-Gammon, however the details about this is not known. Snowie was the next commercial program that played backgammon. It was first released in July It was really strong, and it had an appealing user interface. Even though it was expensive compared to JellyFish, it became the choice for most serious backgammon players. It was so strong that evaluations and rollouts performed by Snowie was considered as the only true evaluation. GNU Backgammon (gnubg) was an open source effort that was started by Gary Wong in 1998 [24]. It is free software and several developers joined the project and it soon became just as strong as Snowie, if not stronger. The neural networks were mainly trained by Joseph Heled, who has documented much of his work at his web page [9] and in the GNU Backgammon source code. This study is really much based on the work of Heled in the GNU Backgammon project, and some of the details in the GNU Backgammon is therefore described throughly. 1.3 Evaluations, equity and selection of the best move Common for all computer backgammon programs is that they do not consider moves. They evaluate positions. An evaluation of a backgammon position is simply an estimation of the probability for each outcome of the game. Each possible outcome is a random variable and an evaluation is the process of predicting or estimating the values of these random variables. When an estimate for the probabilities of the possible outcomes of the game has been found, it is trivial to calculate the equity of the position. The equity is the expected number of point a player win pr. game (ppg). The expected number of points a player wins is of course the same as the expected number of points the opponent loses. Therefore the equity of a player is always the negative of the opponents equity. Since a player can win or lose one, two or three points, the equity will always be from -3.0 to 3.0. When a computer backgammon player selects a move, it take the current position and finds all legal moves and then evaluates all the resulting positions after each of these moves. The resulting position that gives the best equity is 5

6 the position that was resulted from the best move. So the best move is found by evaluating the resulting position after each legal move from the current position with the dice roll given. 1.4 GNU Backgammon: Current evaluation techniques and position classification GNU Backgammon is a backgammon program that plays and analyse backgammon games and positions. Its evaluations are considered to be one of the best among other computer backgammon programs. This study tries to recreate the training done in the GNU Backgammon project, and hopefully the quality of the evaluations can even be improved. The evaluation engine in GNU Backgammon is based on several techniques to get a good evaluation of each position. The positions are classified in five different evaluation classes according to the position on the board. These different classes are evaluated in different ways. Two sided bearoff Both players are bearing off their checkers, and both players have six or less checkers remaining on the board. These situations are handled by a two sided database. For each possible position there is a corresponding winning chance. This bearoff database is pre-calculated by a recursive algorithm and the evaluation of this type will therefore give an exact result. One sided bearoff Both players are bearing off, but at least one of the players has more then six checkers left. This type of positions are evaluated by another database. This database has stored the distribution of rolls to get all checkers off from each position of one of the players. The winning probability is then given by combining these distributions. This method is described by Boro [5], and this method is known to give results that is very close to the exact probabilities. Race positions This is the type of positions where the contact between the players checkers has been broken and no checkers can be hit or blocked anymore. These positions are evaluated by the race neural network. This neural network has 214 input units and 128 hidden units and 5 outputs. This neural net was developed and trained by Heled [9], and it is considered to be very strong and reliable. Crashed positions These type of positions are when one of the players has six or less free checkers to play, where the rest of the checkers are either borne off or stacked on the ace point or deuce point 1 of the board. This type of position is also evaluated by a neural network. The neural net has 250 input units, 128 hidden units and 5 output units. This neural net is known to have some weaknesses, and can make big mistakes in certain positions. Contact positions These are the class that is most common in a backgammon game. It is just a contact position, that does not fall into the crashed definition above. Heled reports that 79% of all positions fall into this category 1 The ace point and deuce point is the traditional names for the one-point and two-point. See the points numbered 1 and 2 in figure 1 6

7 based on online games played at FIBS between computer and humans. This position class is also evaluated by a neural network. This neural network has 250 input units, 128 hidden units and 5 output units. This net is strong, but it is known to make some mistakes in certain positions Lookahead searches For improving the evaluations, GNU Backgammon can also do a lookahead search. It is the brute force game tree search known from computer chess, which is applied for backgammon. Since the fan out of the game tree is much wider in backgammon than in chess, the search is shallow compared to the similar search in chess. Lookahead search is simply looking at all possible rolls by the opponent and then averaging the value of all resulting evaluations of the resulting position after the best move of each roll. This can be done in arbitrary number of subsequent moves. This is a method that is known to work well for chess, but in backgammon the search tree has a high branching factor and the search becomes expensive. To improve this there is also a set of pruning neural nets. These are smaller neural nets with only 200 input units, 10 hidden units, and the same 5 output units. These neural networks are applied at the top level search when doing lookahead search. For deeper nodes in the search tree, there are move filters that cancels out bad moves that should not be further evaluated Cubeful evaluations Backgammon is often played with a doubling cube. The player on roll can offer a double for his opponent, suggesting that the game continues for twice the stakes. The opponent can accept this suggestion, take the cube, or he may chose to pass the suggestion and thereby lose the game at the current stake. This is sometimes called to drop the cube. The cube gives a skewness to the equity, since if a player can double where the opponent will pass, the true equity after this will be 1.0 for the player and -1.0 for his opponent. Janowski [11] has studied this in detail and his research serves as a basis for cubeful evaluations in GNU Backgammon. It is basically an adjustment to the cubeless evaluation based on the cube ownership. This method is considered to be state of the art for handling cube evaluations in backgammon, even though several other methods has been implemented and tried. 1.5 Scope of this study This study will try to build a set of training tools to create a new independent backgammon evaluation engine based on some of the techniques and code from GNU Backgammon. The tools will be developed and neural nets will be trained. Since the bearoff databases are considered to be accurate, and the race neural net is very strong and reliable, the effort should be put into the contact and crashed position types. Since most of the positions are evaluated with the contact neural net, it is believe that it would give a better overall performance to improve this specific neural net. Another approach to improving the overall quality of the evaluations, is to define different position classes than the current contact and crashed class. The 7

8 classification between these two classes is simply rule based, and this classification could be done with a k-means classifications instead. Such a k-means classification also means that the positions that is now evaluated by crashed and contact evaluators, can be evaluated by several new neural nets. This study has therefore two parts: Try to recreate and retrain a neural network that handle contact positions, and hopefully get something that is stronger than the current GNU Backgammon neural net for this position class. This is to verify that the training algorithm works. Reclassify the positions now handled by crashed and contact neural nets by a k-means method. Then retrain these new neural nets based on these new classifications. The study is also limited to evaluations where the cube is not considered. This means that all evaluations are considered based on the assuption that the game will be played to the end, and not terminated by a cube offer and a drop. Every point scored is considered to have the same value for the player, like each point scored is connected to a stake. This is sometimes called cubeless money game. It is believed that a good cubeless money game evaluation will give a good enough cubeful evaluations with the current cubeful evaluation techniques suggested by Janowski [11]. The study will also only consider static evaluations without any lookahead in a search tree. If the static neural net evaluations are improved, the lookahead evaluations will also improve. 2 Experimental setup and code implementation This section describes some of the technologies used in this study and some of the decisions taken to build the computational model. It gives a very short description of neural networks. For a more extensive description of neural nets, refer to text books like Kartalopoulos [12] or Haykin [8]. This section will describe the neural nets used, the benchmarking methods, and how the software model is designed and implemented. 2.1 Neural networks Artificial neural networks, or just neural networks is a technology utilised in development of artificial intelligence. The neural net technology is inspired by how the biological brain of humans and animals works. A biological brain is composed of millions of neurons, and these neurons are connected to each other by axons and dendrites. The connection between them is adaptive, which means that the connection structure is dynamically changing. Changes of the connections is what we call learning. In an artificial neural network it is similar, however here it is just numeric values that connects the neurons. One of the most used structures is the multilayer neural network. The neurons are usually modelled in three layers, where the first layer is called the input layer, the intermediate layer is called the hidden layer and the last layer is called the output layer. See figure 2. The neurons in this model are sometimes called units or nodes. The nodes of each layer 8

9 Figure 2: Structure of a multilayer neural network. This network has three input units, four hidden units and three output units. The units, or neurons are connected through adjustable weights. Adjusting the weights is considered as training of the neural network. are connected to each other with adjustable weights. The process of adjusting these weights is called training. The numeric value into a unit is the sum of all its input connections times the weight of that input connection. The numeric value of the neuron is then adjusted with an nonlinear activation function. The neural networks used in this study is based on the implementation found in GNU Backgammon. However, some features have been removed and some other features have been added. The neural net is now written as a class with the gobject system. The neural networks is quite standard three layer neural networks. There are five output units, 128 hidden layer units and 250 input units. The activation function is a standard sigmoid function Output units There are five outputs of the neural networks, each representing a probability of an outcome. The output vector will be denoted. Since there are six different outcomes of a backgammon game, there is only need for five outputs, since there must be an end to a backgammon game. The five different outputs are defined as in table 1. Note that there is no output for the probability of losing, since this would be redundant. The probability of losing is simply the probability of not winning which is easily calculated as

10 Output description 0 Probability of winning. Any type of win 1 Probability of winning a gammon or a backgammon 2 Probability of winning a backgammon 3 Probability of losing a gammon or backgammon 4 Probability of losing a backgammon Table 1: Description of neural net outputs Input units The input calculation from a backgammon position description to a input vector for the neural networks is the same as in GNU Backgammon. There are four base inputs for each point of the board. These four inputs are all ero when there is no checker on the point. The first base input is set to 1.0 if there is a blot (single checker) on the point and is ero if there is more than a single checker on the point. The second base input is set to 1.0 if there is exactly two checkers on a point and else ero. The third base input is set to 1.0 if there are three or more checkers on the point. The last base input for a point is increasing by 0.5 for each checker more than three. If there are three or less checkers on a point, the input is kept ero. These base inputs are more or less the same as described by Tesauro [19]. In addition to the base inputs, there are 25 handcrafted inputs. For being able to test the current GNU Backgammon neural networks, these handcrafted inputs were kept the same as in GNU Backgammon. These inputs are pipcounts, a blocking value, a contact value, a momentum of checker distribution, some inputs for anchors, and some inputs for number of hitting dice rolls and how much race setback if hit. For a details about the additional handcrafted inputs, check the GNU Backgammon source code [23]. So, there are 24 points on the board, in addition comes the bar, and these 25 points has four base inputs which makes 100 inputs. There are 25 additional handcrafted inputs. These 125 inputs are calculated for both players, so the total number of inputs to the neural net is Activation function The activation function is a quite standard sigmoid function shown in equation 1. The β values for the sigmoid functions are 0.1 for the hidden layer activation, and 1.0 for the output layer activation. Some small experiments were done to see if these values could be adjusted to some other values that could make weights converge faster, however the effect of the β values was found to be minimal. 1 f(x) = (1) 1 + exp( βx) Since the call to sigmoid is done really often, it is quite important that this function is really effective and fast. To increase the performance of this sigmoid, the function has been discretied to a lookup table. 10

11 Evaluator +name: GString +evaluate() NetEvaluator +nn: NeuralNet +ip: InputCalculator +pp: PostProcessor evaluate() DBEvaluator +db: Database evaluate() OverEvaluator evaluate() Figure 3: Simplified class diagram of the Evaluator abstract class, and the three implementation Backpropagation For adjusting the weights during training, a standard backpropagation algorithm is used. A simple experiment with momentum adjustment to the backpropagation was used at an early stage, but there was no effect on the weight convergence to be observed. Momentum adjustment was therefore abandoned. 2.2 Code structure and modelling Before the new training tools were developed, an object oriented design to the system was made. All code is now written in the C programming language, and the only non standard library that is used is GLib [6]. GLib is a class library for C, which provides basic data structures like linked lists, tree structures, hash tables, etc. This library is the foundation for the GTK+ toolkit system. In addition to the data structures, GLib also contains gobject and gtype. These provides the C programming language with an object system. With this object system it is possible to make object oriented code with known principles like inheritance, polymorphism and encapsulation. There are two fundamentally different ways of evaluating a position: database evaluation and neural network evaluation. In addition there is also an evaluation when a game is over, and these situations gets an evaluation type on its own. The obvious design would be to have an abstract class, Evaluator, with the method evaluate(), and three implementations of this abstract class, as shown in figure 3. In this way an evaluation can be implemented to the abstract class instead of to each implementation. The NetEvaluator class is composed of three other classes: NeuralNet, InputCalculator and PostProcessor. An evaluation with an instance of this class begins with calculation the input vector to the neural net from the board. This calculation is performed by the InputCalculator instance. The evaluation is the calculation performed by the neural net. After the neural net evaluation, the PostProcessor checks the neural net output, to make sure the output make sense. In some situations the neural net can report impossible results, like a higher probability of winning a gammon than the probability of winning. The Post- Processor instance, will correct error like these. The composition of the NetE- 11

12 InputCalculator +name: gchar * +n_input +calculate() NetEvaluator +nn: NeuralNet +ip: InputCalculator +pp: PostProcessor evaluate() NeuralNet +n_input: gint +n_output: gint +n_trained: gint +evaluate() +train() PostProcessor +db1: Database +db2: Database +postprocess() Figure 4: Simplified class diagram of the NetEvaluator class. Evaluator * 1 Engine Classifier +name: GString +evaluate() +find_best_move() +find_best_moves(): GList +evaluate_position() +get_evaluator(): Evaluator -invert_values() +classify_position() Figure 5: Simplified class diagram of the Engine class. The class has a collection of all different evaluators and appurtenant classifier. valuator is shown i figure 4. The NeuralNet class is simply done as one class with all operation like evaluate, train, load and save. However, the activation function should have been separated out of the NeuralNet class, since the use of different activation functions can be a subject for further studies. The collection of different evaluators together with an appurtenant classifier instance is collected in a class called Engine. The simplified class diagram for the engine class is shown in figure 5. A full (simplified) class diagram of the of the whole evaluation system can be found in appendix C. 2.3 Benchmark method With the tools developed as described above, it is really simple to breed a new neural network. The big problem is to breed a neural network that is better than previous nets. It is therefore necessary to be able to benchmark a neural net in an effective way. The best test would be to let one neural net play a high number of games against a reference neural net. However, the natural variance in high level backgammon is quite high. The number of games to be played in such test, must probably be above a million to get statistical significant results. Such test would therefore be time consuming and would not be practical for 12

13 benchmarking Database benchmark A better approach is to have a collection of positions where a dice roll is given, and then find the best move by performing a Monte Carlo-simulation of the resulting position after each legal move. In that way a database can store a position and an accommodating dice roll, where the best move is known, and in addition for every other move, the equity loss (ppg) by not doing the best move. Fortunately such database already exists from the GNU Backgammon project. Heled [9] collected such benchmark database of positions for each position class. The database for contact positions contains positions with an accommodating dice roll and Monte Carlo-simulated results of the best moves. For each position in the database, there is usually only five or six move candidates that has been simulated, however these are the five or six best moves. The benchmark is then calculated by looping through all these positions, find the best move according to the current neural net evaluator, and compare this move with the best move stored in the database. If the evaluator finds the same best move as the best move in the database, no error is added to the running total error. If another move is found to be best by the evaluator, the running error increases with the equity difference of the best move and the move done. When all positions in the database are done, the total error is reported as the benchmark. Since this benchmark is based on comparing different position relative to each other, this is called the relative error benchmark. The same database also contains some positions where there is no dice roll given. There is just a Monte Carlo-simulation of the five outputs. In Heleds work these positions and results were used for benchmarking cube evaluation. In this study these positions are used to get an absolute error benchmark. The benchmarking algorithm return the relative error, and the absolute error, and the number of positions where the evaluator did the best move according to the benchmark database. We should distinguish between absolute error and relative error: Relative error The error of how different positions are evaluated compared to each other. Is one position better than another? This can be measured by a benchmark algorithm that selects the best move given a position and a dice roll, which is exactly what the benchmark algorithm do. Absolute error The error in the evaluation of the position. How precisely does the net predict the outcome of the game compared to a rollout of the same position. (Rollout is the backgammon lingo for a Monte Carlosimulation, these MC-simulations are usually quite accurate). Fortunately the benchmark database also contains positions where there is a rollout result, and this is used to make a measure for the absolute error. Important: We are most interested in having the relative error as low as possible. We want our evaluator to make the right move. If it can find the right move, a rollout can find the true value of the position. This is also how human experts evaluate when making a move decision. A human will compare the different resulting positions after each move, and compare these positions to each other. The human will not try to estimate the true winning chance after each move, and then compare the winning chances. 13

14 2.3.2 Head-to-head benchmark player In addition to the benchmark based on the database, a head-to-head benchmark was developed. It is a small tool that can set up two different evaluation engines against each other for a specified number of games. These games were played without any cube actions taken and the different outcomes were reported together with how many points pr. game an evaluation engine scores against another. Standard deviations to all numbers were also calculated. In this way the database benchmark could be verified, and a relation between the benchmark error rates and ppg could be developed. This relation can be found in the result section. 3 Training algorithms and k-means separation 3.1 TD(λ)-training It is possible to make a neural net based backgammon evaluator learn how to play good backgammon with ero knowledge (except for the rules of the game), just by letting it play against itself. The method is a Reinforcement Learning algorithm called Temporal Difference (TD) learning. This algorithm was introduced by Sutton [17]. It was first applied for backgammon by Tesauro [19]. The algorithm is updating the weights in a neural net according to equation 2. w t+1 w t = α(y t+1 Y t ) t λ t k w Y k (2) The left side in equation 2 is simply the weight change. The α is a learning rate parameter, and Y t+1 and Y t is the evaluation outputs at time step t + 1 and t. A time step in this sense is one move by a player. The w Y k is the output gradient with respect to the weight. The λ is a parameter in the range from 0.0 to 1.0, which controls how much temporal credit to be assigned to errors for time steps (moves) further back in the game. A λ value of 0.0 will only assign credit for the error in the immediate previous position, while λ=1.0, will assign just the same credit for all errors trough that game. Intermediate values for λ will give a logarithmic decrease of temporal credit assignment in previous time steps. Several sources [22] states that a good value for λ is 0.0. This simplifies a lot. This means that there is only a temporal credit assigned for the previous time step. There will be only one term in equation 2. Notice how equation 2 reduces to the backpropagation equation when λ = 0. The backpropagation algorithm can be utilised directly to update the weights. The target values for the input at time step t is simply the output of the estimator at time step t + 1, namely Y t+1. This makes a really simple implementation of the algorithm. The implementation can be found in appendix A Supervised training Heled [9] used a database of training positions that was stepwise increased. It started out with a small database of position, and the net was trained until k=1 14

15 no further improvement was seen. Then more positions were added to the training database, as the training developed. However, in the training in this study, the database uses all six hundred thousand training position right from the beginning, and the number of positions in the database is never increased. In the article, Heled describes his supervised algorithm: A switch to supervised training (ST) addresses the first drawback. A set of positions is chosen, and the net is trained in epochs (A pass through all of the training data with a fixed α) over and over, trying to find the best fit for the data set as a whole. It was quite an effort to make supervised training work. I ended up with a method which works for me, but it is still unclear why it does. I use ridiculous values of α in the range 30 to 1, coupled with randomiing the order of data in each epoch. 1. set α to α start 2. select a random order for the data 3. perform one epoch 4. err = NN error 5. if err is smaller by p% than err previous, return to stage 3 6. decrease α. if α < α min, set it to α start 7. if err < err previous, return to stage 3 8. return to stage 2 So, while there is a p% improvement, another pass is performed with the same settings. When there is a small improvement, α is clamped down some more. When error increases, the data is reordered. Typical values are α start = 20, p = 0.5% and α min = 1 or 0.5. Obviously a big α lets us escape from small local minima, but it is not clear to me why it converges at all. The algorithm in this study is quite similar to this. The difference is that in step 6, α is decreased with a decrease factor, by default set to 0.9, as found in Heleds code, but if a minimum α is reached, the α is kept low instead of reset to α start. Minimum α is set to The implementation has instead an increase of the α at step 8 in the above algorithm. This increase is called the increase factor. The default value of the increase factor is The idea of the increase of α is to create a small disturbance such that a training session can get out of a local minimum. The implementation of the supervised learning algorithm can be found in appendix A.2. During an epoch 2, the weights are updated for each position in the database, referred to as stochastic training in Duda [7], as opposed to batch training. Duda [7] recommends stochastic training for most applications especially ones employing large redundant training sets. The database with training positions for the contact neural net contains positions. The database use in this study is the same as Heled used in his training. The positions in this database are chosen from self play, and the corresponding probability values are simulated with Monte Carlo simulations ( Rollouts ). 2 An epoch is one run through all positions in the training database 15

16 3.3 K-means separation Our proposed method to improve the overall evaluations is to separate the neural nets in several nets. For the current evaluators, the evaluations from bearoff databases and the race neural network, are close to perfect and need no further improvement. As shown in figure 6a, the current classifier for positions is rule based, and classifies positions in these position types: Game over Two sided bearoff One sided bearoff Race positions Contact positions Crashed positions The first is simply the class returned when a game is over and there is a winner. The corresponding evaluator just gives the points to the winner. The two bearoff classes are evaluated by lookup databases. It is only the three last items at this list that are evaluated by neural nets. The Race position class has a different input calculation and the positions are conceptually different from the other two neural net classes. The race position neural network is also really good. It is therefore natural to keep the race position evaluations as they are today. The crashed and contact position classes may benefit from a further subdivision, and the idea is to provide such division with a k-means method as shown in figure 6b. The idea is to rewrite the classifier such that the positions that are now classified as contact or crashed, will be classified with a new k-means classifier. These new classes will simply be labeled contact0, contact1, contact2 and contact3. See figure 6b. The data vector to classify from is a subset of the input vector to the neural net. The input vector to the neural net is composed by 200 base inputs and 50 additional inputs which is basically hand crafted features of the position like pipcount, how many hitting numbers, pips set back if hit, blocking value etc. It is these additional inputs that is used for the k-means classification. A simple program was written to extract all the positions from training databases. The training data for the k-means classification are simply the positions from the training data sets from both the crashed training database and the contact training database. For each position in these databases the 50 additional handcrafted features was calculates, and saved to a file. There are positions in the contact training database and positions in the crashed training database. The k-means algorithm used to generate the codebook, was the implementation from Scipy. Scipy is a set of algorithms and tools for the Python programming language. It enriches the Python language with matrix operations like the operations known from commercial programs like MATLAB. It also has some analytical packages available. The k-means algorithm used here is from the package scipy.cluster.vq. 16

17 Figure 6: Decision tree for the standard rule based classifier in (a), and for the new implemented classifier with k-means classification (b). 17

18 Each iteration of the k-means algorithm, with the described dataset, took about 30 seconds. The calculation was set to calculate a codebook (set of µ vectors) for 1680 iteration. This takes about 15 hours, and was all the time that was available at the moment. The algorithm was set to classify to four different classes. Of course it could be more than four or less than four, but five or more would of course demand even more training. Three is only one more than the existing configuration, so four new classes sounded like the best number for this study. Optimum number of classes can be examined further. The interactive Python/Scipy code for the k-means classifier training, can be seen in listing 1. Listing 1: Interactive python code to generate the k-means codebook from scipy.io import read_array a = read_array("features.txt", atype='f', rowsie=100) # The above reading takes half an hour from scipy.cluster.vq import kmeans, vq codebook, dist = kmeans( a, 4, thresh=1.0e-10, iter=1680) The distortion, the sum-of-square errors, after these 1680 iterations became With further iterations the distortion could have been further decreased, but the available time and progress of the project limited the number of iterations. More information about the k-means algorithm that is used, can be found in the Scipy documentation [14] Complexity of k-means There are several good features about k-means clustering. There is no need to name the classes, the algorithm find the classes on its own. It is also proven that the clustering algorithm must converge to a minimum distortion. After some number of iterations, there will be no changes in the set of µ, and the algorithm terminates. The computational complexity of k-means is linear with respect to all input sies. The complexity is O(ndcT ), where n is the number of training samples, d is the dimension of each sample, c is the number of clusters and T is the number of iterations. See also Pattern Classification by R. Duda et al. [7] The problem is to find out how many iterations are needed before the k- means algorithms terminates with a stable set of µ. This has been studied by D. Arthur and S. Vassilvitskii [1]. This article state that the number of iterations needed to full convergence is superpolynomial. The lower bound for the number of iterations is indicated to 2 Ω( n). This is much higher than the 1680 iterations done in this study, but it looks like the classifier can find certain concepts after just these relatively few iterations. 4 Results and analysis This section summaries some of the results from head-to-head benchmarking, the k-means separation and training sessions for the standard classification scheme and the k-means classification scheme. The results from the training sessions are give as the error rates from the database benchmarking. 18

19 Relative error benchmark # correct moves # total moves err/tot ratio % Total rel. error Average error Error rate Absolute error benchmark # total positions Total equity diff Abs. error rate Table 2: Benchmark of current GNU Backgammon neural net evaluator. 4.1 Benchmarks of GNU Backgammons neural networks It is natural to first find how the benchmark values of Heleds [9] evaluation engine. In this way it is possible to compare directly if something is done right or wrong. The benchmark results of these are shown in table 2. This neural nets are considered as the reference neural nets. 4.2 Database benchmark compared to head-to-head From a supervised training before k-means separation, some of the neural nets were matched head-to-head against the best net from Heleds work. From the training session 24 contact nets were selected with a benchmark relative error range from 1207 to Heleds best neural net benchmarks to a relative error of Each net was matched in a series of cubeless money games, terminated at the state where a race position occurs. The results were logged. To make the numbers comparable to other reference neural nets, the relative score of Heleds reference net (1122.7) is subtracted, and a linear regression is made based on the benchmark error difference of the two combating nets. Figure 7 show the difference of the conventional benchmark relative errors compared to the equity lost against Heleds best neural nets. The error bars indicate the 95% confidence interval. The dashed line is the regression line. It is assumed that the line intersects at the origin, since two equal nets should logically be even. The slope of the regression line is e-03. This linear relation is simple and can be used to estimate how many points pr. game one neural net will lose to another based on the database relative error difference. This result does not only give a relation between the benchmark and ppg lost. It also verifies the benchmark. Since the correlation can be seen, the benchmark database must be valid. 4.3 Training results before introduction of k-means separation The first experiments performed was to try training with the same neural network configuration as the current GNU Backgammon evaluator. This was done to verify that the algorithms worked properly, and that a new evalua- 19

20 ppg Benchmark relative error difference Figure 7: The difference in benchmarked relative error compared to points pr. game lost tor could be retrained to the same strength with the newly implemented algorithms TD(λ) training results Training with TD(λ) training was some of the first experiments performed. This was even before the logging feature of the tool set was implemented, and a figure of the first initial TD-training can therefore not be shown. This training method showed a lowest relative error benchmark at , and an absolute error rate of Unfortunately the number of games played to achieve this error rate is lost, but the training was run for over two weeks so the real number of games must have been above 100 million. The learning rate α was set to 0.15 for all updates. A bootstrap method was used to indicate if there was any improvement, and at the time the training was stopped, there was no indication of improvement. Achim Müller [13] also performed some TD-training. He tried to train from a different starting position. The starting position used in his training session was the position known as Nackgammon. Nackgammon is the same as backgammon, but both players starts with four checkers back instead of two checkers. This leads to more complex positions. This training converged to a relative error of 1616 after 110 million games. The α was set to 0.5 for this training. Another TD-training session was performed later in the study, and the figure from this training session is show in figure 8. The difference to the initial 20

21 rel. error # training games (x ) Figure 8: TD(λ) training. Decreasing error rate with number of training games played. TD-training is the learning rate, α, which was during this training set to 0.5. The best of these training sessions, gave a new contact neural net that benchmarked to According to the relation between relative error and ppg lost, it is likely that this neural net will lose about ppg against Heleds reference neural net Supervised training results With this training method, the neural net scored better. A net that was trained in advance for about 1800 epochs, managed to reach a relative error of , but after epoch there was absolutely no improvement, and this training was stopped. A figure of this training is given in figure 9. This is still the lowest relative benchmark observed in this study. According to the relation found, it is estimated that this neural net will only lose ppg against the reference neural net. This is extremely close to the reference net. This can be verified by a head-to-head test, but such test must be played over millions of games to give a statistical significant result. This head-to-head test is therefore not performed. 4.4 Results of k-means clustering The developed k-means method is now used to classify the positions in the training database. The result is quite promising. The classification algorithm has been coded in C, and the codebook values from the k-means algorithm 21

TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play

NOTE Communicated by Richard Sutton TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play Gerald Tesauro IBM Thomas 1. Watson Research Center, I? 0. Box 704, Yorktozon Heights, NY 10598