An Intelligent Othello Player Combining Machine Learning and Game Specific Heuristics

An Intelligent Othello Player Combining Machine Learning and Game Specific Heuristics Kevin Cherry and Jianhua Chen Department of Computer Science, Louisiana State University, Baton Rouge, Louisiana, U.S.A. Abstract - In this paper we present an intelligent Othello game player that combines game-specific heuristics with machine learning techniques for move selection. Five game specific heuristics have been proposed; some of which can be generalized to fit other games. For machine learning techniques, the normal Minimax algorithm along with a custom variation is used as a base. Genetic algorithms and neural networks are applied to learn the static evaluation function. The game specific techniques (or a subset of) are to be executed first and if no move is found, Minimax is performed. All techniques, and several subsets of them, have been tested against three deterministic agents, one nondeterministic agent, and three human players of varying skill levels. The results show that the combined Othello player performs better in general. We present the study results on the basis of performance (percentage of games won), speed, predictability of opponent, and usage situation. Keywords: Othello, genetic algorithm, neural network, minimax, influence map, expected min 1 Introduction In Othello, two players, white and black, alternate turns in an attempt to capture as much of the 8x8 board as they can. Each valid move flips over one or more opponent pieces to display the player s color. At the end of the game, the player with the most cells showing his color wins. The game starts out with two white and two black pieces as shown in figure 1 below. The black player always goes first and must choose from the valid moves highlighted in figure 2 to place one of his black pieces. Each one of black s valid locations result in a white piece being surrounded by two black pieces (one already on the board and the new one just placed). This means the surrounded white piece will be flipped over to its blackcolored side and will become a black piece. For example, if the black player places a piece at row 5, column 3, the resulting board state would become: figure 3 Only moves which result in one or more opponent pieces being flipped are valid. At this point the white player would take a turn, with his valid moves being those that surround one or more black pieces either horizontally, vertically, or diagonally. Turns continue to alternate until either all board locations are occupied or one player has no more pieces left on the board. If a player does not have any valid moves, then that player s turn is forfeited. 2 Common methods Along with game specific heuristics, which will be discussed later, an algorithm called Minimax was used. Minimax is an exhaustive search approach to finding an ideal move among the valid choices [1]. The player is known as max and the opponent is min. To build the initial tree, the current board state starts as the root, then each of the player s valid moves become a child node, then each of the opponent s moves in response to the player s moves become children of that node, and so on. The tree construction stops when the specified depth has been reached. This basic structure is shown in figure 4 below. figure 1 figure 2

figure 4 These leaf nodes get their value from a static evaluation function. This function takes in features of a board state and assigns a real value indicating how good that state is for each player. Typically more negative values indicate good board states for min, while more positive values are good for max. For this paper a neural network was used for the function and certain input values as well as weights were trained using genetic algorithms. Figure 5 below shows an example of what the tree structure may look like after running the evaluation function. These leaf values must then be propagated upwards. The level right above the leaf nodes is a max level, meaning each of these nodes chose the maximum value from its children as that is better for the max player. The next level represents min s move and so the minimal value of each child node is chosen. The final level is always max since we are evaluating a move for the max player. After this root node gets a value, the path becomes clear and is highlighted below: figure 5 This path represents a board configuration after a certain number of moves that is the result of the max and min player both playing optimally. Normally the deeper the tree is allowed to build toward, the more accurate the node values. The number of nodes grows exponentially, however, resulting in an estimated total number of b d, where d is the maximum depth and b is the branching factor or average number of children for each node [2]. This is why several optimizations were created for this algorithm. The one used for this paper is called alphabeta and will be discussed later. 3 Game-specific heuristics The game s rules are easy enough to learn, however there are several aspects that must be considered to become an expert player. First, the number of pieces one has during the beginning or middle game is not a good indication of how well one is doing [3]. Since single moves can often flip several pieces and change the game s score dramatically, it becomes more important to focus on getting one s pieces in certain key locations. These locations will set the player up for higher scoring moves later on. It is this reason that always choosing the move that results in the highest score gain is not an ideal strategy. Second, is about learning where these prime locations are and setting up moves that will help secure them. This also means ensuring one s opponent does not take these locations first. Lastly, one should try to look a few moves ahead to ensure that each move made will result in a more dominant board state later on. Any good Othello agent must make use of these aspects in specific and general heuristic ways. There are several specific exploits designed into our agent that are tried first before relying on a more general approach. The order in which they are presented below is the order of precedence. Techniques are applied according to this precedence and if a move is found, that move is used and no further processing is done, else the next technique is used. The only exception to this comes with blacklisting, where moves are not chosen, but forbidden from being chosen by the sequential decision methods applied. 3.1 Killer move detection The first exploit is killer move detection. If a player loses all his pieces on the board, that player has lost and the game is over. Therefore all valid moves are checked to see if any one move leads to the opponent having this outcome. 3.2 Corner detection Second is corner detection. One set of prime locations are the corners of the board. This is due to the fact that a corner can never be overtaken once a piece is placed there and they allow for high scoring moves later on in the game. By always picking a corner location if it is in the list of valid moves, the agent can take advantage of these facts. 3.3 Blocking Blocking is the third exploit and it attempts to keep the opponent from making a move. If any move in the valid move list forces the opponent to forfeit its next turn, that move is chosen. Even if this move is not an ideal choice, gaining an extra turn should make up for this. 3.4 Pattern detection It is important to recognize board situations in which an ideal move is known. This sort of pattern detection with board states makes up the fourth exploit. The idea is to express several specific board states along with a collection, ordered by precedence, of predetermined ideal moves, or

locations to attempt to overtake with a single pattern. Several of these patterns can be created to cover a large number of board states. The patterns created for the agent allow it to take a corner on its next move. This is accomplished by detecting a diagonal series of opponent pieces leading up to a corner and finding a way to overtake one of those pieces. This gives the agent a diagonal capture line to the corner on its next turn. Figure 6 below shows a possible board state that the mentioned pattern would recognize. The agent is in white and its valid moves are shown as dotted circles. The opponent pieces with white crosshairs are possible targets to overtake. The pattern does not recognize the opponent piece at (1, 1) since overtaking that spot would give the opponent the corner. Piece (4, 4) is also not a target since the opponent could flip that piece right back over on its next turn due to it having a piece at (3, 3). Since one of the agent s valid moves takes the target opponent at (2, 2), that move is chosen. The result of this move leaves the agent with a valid move at the corner, as shown in figure 7 below: figure 6 figure 7 Notice that the spot at (2, 2) cannot be taken back by the opponent s next move, as it is protected by the opponent s own pieces. The implementation behind the other exploits are trivial enough to warrant not going into more detail, however construction of patterns can be quite complex. One must take a collection of several (possibly hundreds) of specific board states and represent those with a single structure along with actions to take if said pattern is matched. Therefore the implementation used will be discussed as one possible approach to creating this exploit. A pattern was represented as an xml file that used bit masking to represent a location s status. One was untaken or empty, two was taken by the agent, four was taken by the opponent, and zero was inherited from another xml file called a template. Templates were used to store bit masked values for board locations common to multiple patterns (to keep from repeating the same value multiple times and make it easier to change if need be). For convenience if a location was not specified, it was assumed to have a don t care value and was ignored when checking the pattern. The xml file would then contain a list of rows and columns with bit masked states for each. For a pattern to match, each location, described by its row and column, would have to have a matching state to that of the actual board. The pattern then specified which location it wanted to overtake. If this location was empty, the agent would simply place a piece there. If the location was taken by its opponent, it would look through its list of valid moves and choose one that would flip over that piece, thereby taking the location over. If no valid move could accomplish this, the pattern would move on to its next target location if one existed. If no target locations could be overtaken, the pattern would be considered unmatched and the next pattern was evaluated. 3.5 Blacklisting The final exploit was created to prevent the opponent from having an ideal valid move. Blacklisting is the act of banning valid player moves that lead to this. These banned moves are then forbidden from being chosen by the next phase. If all moves are banned, then the banned list is cleared. It is important to not be too aggressive when banning moves as moves that seem to give the opponent the upper hand, might actually prove to be a better move for the player later in the game. Therefore banning moves too often can actually reduce performance. 4 Machine learning techniques In this paper genetic algorithms were used to learn the influence map for the board, and both genetic algorithms and neural networks for the weights of the static evaluation function for Minimax. We also propose a variation on standard Minimax called expected min Minimax. 4.1 Expected min If all previous game-specific exploits fail to find a move, then Minimax is run on all valid moves that are not blacklisted. In addition to the normal Minimax algorithm, a variation called expected min Minimax was added. This addition addresses a crucial issue with normal Minimax in its assumption over the actions of the min player. Instead of assuming the min player will follow the strategy of a subjective evaluation function, a more general approach is used to find an accurate representative value for each min node. The steps are as follows: 1) Take all child node values. 2) Subtract each value by the maximum of those values plus 1 (e.g. if we have 1, 2, and 3 then produce (1 4), (2 4), and (3 4) to get -3, -2, and -1). The reason for this is due to both the desire to end up with higher weights on lower numbers, and also to allow

values of zero to have some contribution to the weight distribution. 3) Sum these new values up and divide each value by that sum (e.g. for the -3, -2, and -1 values from above, we have (-3 / -6), (-2 / -6), (-1 / -6) to get 0.5, 0.333, 0.1667). 4) Multiply the original values by these weights (e.g. our original values of 1, 2, and 3 become (1 * 0.5), (2 * 0.333), and (3 * 0.1667) to get 0.5, 0.667, 0.5). 5) Sum these values up to get the min parent s value (e.g. 0.5 + 0.667 + 0.5 = 1.667). This value is therefore taken from all child nodes with more weight added to lower values. The influence map values were learned using genetic algorithms [7] with the fitness function: (1) 4.2 Static evaluation function Minimax s static evaluation function takes the form of a neural network [4] with eight input nodes, one layer of five hidden nodes, and a single output node giving the final value. Sigmoid was used as the activation function [5] and each input value is evaluated for both the agent and its opponent. The input values are: the number of corner and side locations held, number of unique pieces that could be overtaken if the other player would have the next move, and the influence map sum of all player pieces on the board. 4.3 Influence map An influence map is used to display the relative important of overtaking each location on the board [6]. These values should be learned to be effective, however with a range of zero to ten and 64 board locations, a search space of 11 64 is too massive. Certain exploits and commonalities were therefore used to reduce this space. Assigning a value of ten to the corners and zero to the four starting locations, this space can be reduced to 11 56. Since one side of the board is no more important than the other, the influence map can be divided diagonally in both directions to form four quadrants with equal values. Each quadrant need only hold eight possibly unique values. After this the search space is reduced to 11 8 = 214,358,881. In the chart below, only values A to H need to be learned: The variables W sw, W sl and W nc are the weights for S w, S l and N c, respectively and were set at 5, 5, and 2, respectively. These values represent heuristic estimates. Each generation contained 21 chromosomes and selection, crossover and mutation operations were defined and used at a rate of 0.25, 0.75 and 0.05, respectively. Note that the mutation rate was chosen such that a single chromosome would be affected each generation. Chromosomes were chosen proportional to their fitness (i.e. chromosome fitness divided by total population fitness, or fitness proportionate selection [8]) and were moved onto the next generation or faced single-point crossover. Mutation was then applied by selecting random chromosomes at the before mentioned rate, and randomly changing one of its weights to a random value in the valid range of zero to ten. Figure 9 shows how the fitness changed with each generation during the learning phase: figure 8 figure 9

In figure 9 Gen is the current generation and the white, light gray, and dark gray lines indicates max, average, and min fitness respectively. There is a white line for each of these that shows the progress from the initial generation. To get the fitness value of a chromosome, that chromosome s knowledge was put into an influence map agent (an agent who only uses its influence map to make its decision) and that agent played against the target agent as the white and black player. The fitness of both games was averaged to get that chromosome s actual fitness. This was run initially with a greedy agent (one whose move choice is based solely on the number of opponent s pieces it can overtake) as the target, however after a fitness of eighteen was reached, the most fit chromosome s knowledge was put into an influence map agent to be the new target agent. The genetic algorithm was then started over with all chromosomes testing their knowledge against the target agent. This repetition continued for approximately six times before the final fittest chromosome s knowledge was used. Its influence map was then used in the input to the static evaluation function. 4.4 Training the network With all inputs defined and learned, the neural network must be trained. Genetic algorithms were again used for training. With eight input, five hidden, and one output node, the number of weights, along with the bias nodes, is 51. Population size was increased to 100 to deal with this larger space and mutation rate decreased. The new mutation rate of 0.01 was chosen such that only one chromosome would be affected; same as last time. Selection and crossover rates as well as fitness function remain the same. Each chromosome s knowledge was put into a Minimax agent and used for its evaluation function. This agent played as white and black player against the greedy, influence map (using the same map), and greedy influence (combination of two approaches) agents. A total of six games were therefore played for each chromosome and the fitness was averaged. The results of running the genetic algorithm on four different machines in parallel, each with their own initial population, for around 80 hours is shown in table 1. Notice the best chromosome had a maximum fitness of 14.22, which is quite good. table 1 Number of generations 14,608 11,769 11,808 12,597 Maximum fitness obtained 8.95 14.21 14.22 11.24 To speed up the process during training, Minimax used a max depth of one. Since it is only necessary to find neural network weights that can accurately describe the board state given, a depth of one was enough to accomplish that and any higher depth would have added unnecessary training time [9]. Also to aid in training, the mutation rate was sometimes adjusted when the genetic algorithm appeared to hit a plateau. This occurs when generations seem to all have similar fitness values as nothing new is being added between populations. This could indicate a local minimum and increasing the mutation rate is a way of introducing added variation to help overcome this. To get a better idea of how the Minimax depth affects performance, maximum depths of one to six were used for testing. The max of six was used based on the amount of time taken to play a single game and personal thoughts on how many depths are sufficient enough for testing. Alpha-beta is among the most common Minimax optimizations [10]. It works by figuring out which nodes do not contribute to the final root value. For example, say we have two min level nodes, A and B, with A having a value of four and B unevaluated. B s children would all have to have values greater than four for B to be selected, since B chooses the minimum value of its children. Therefore after evaluating each child node, if the value is ever less than four we can stop. At this point we know that B s value will be less than four and A will be chosen. Therefore the final tree might look something like this: figure 10 The nodes with X s were not evaluated and their subtrees were pruned. This procedure can be performed at any level to eliminate unnecessary node evaluation and cause the entire Minimax algorithm to be a lot less expensive. Although alpha-beta was used to help speed up testing, it could only be used on normal Minimax. The expected min algorithm discussed earlier requires all child nodes to form an accurate value for the min parent, and therefore cannot or should not be combined with any pruning optimizations. 5 Experiments Testing was performed against the three deterministic agents mentioned previously, a random agent and three human players. The greedy agent always takes the move that results in the most opponent pieces being flipped. The influence map agent chooses the move with the highest corresponding value from our influence map. Greedy influence multiplies the number of opponent pieces flipped with the influence map value for each move and picks the one with the highest result. Finally the random agent selects a valid move at random. The human players had self-rated skill

levels of one, five, and seven with one being a beginner and ten a very experienced player. 5.1 Artificial agents Here are the results of the agent playing against the three deterministic agents: tables 2 (top) and 3 (bottom) Without EM M MCK MCKBoBa MCKBoP ALL Total Avg Per 1 6 6 5 5 5 27.0 5.40 90% 2 4 4.5 4 4 3 19.5 3.90 65% 3 4 4 4.5 4 4.5 21.0 4.20 70% 4 3 4 6 5 6 24.0 4.80 80% 5 4 4 6 5 6 25.0 5.00 83% 6 5 5 3 6 4 23.0 4.60 77% Total 26.0 27.5 28.5 29.0 28.5 Avg 4.33 4.58 4.75 4.83 4.75 Per 72% 76% 79% 81% 79% With EM M MCK MCKBoBa MCKBoP ALL Total Avg Per 1 6 6 5 5 5 27.0 5.40 90% 2 1.5 2 4 1 4 12.5 2.50 42% 3 3 3 5 3 5 19.0 3.80 63% 4 3 5 6 4 5 23.0 4.60 77% 5 6 5 6 6 6 29.0 5.80 97% 6 4 5 4.5 5 4.5 23.0 4.60 77% Total 23.5 26.0 30.5 24.0 29.5 Avg 3.92 4.33 5.08 4.00 4.92 Per 65% 72% 85% 67% 82% EM refers to the expected min approach. Each cell represents how many games out of six were won (1 game against each of the three agents as both players. Ties count as half a point). For the abbreviations under the table heading: M is Minimax, C is corner detection, K is killer move detection, Bo is blocking, Ba is blacklisting, P is pattern detection, and ALL is for all techniques. Combinations indicate techniques that were used for those tests. The number in the first column represents the maximum Minimax depth searched to. Total was the total number of games won (the sum of the appropriate cells), Avg is the average, and Per is the percent won. Finally, the cells shaded in light gray represent situations were all six games were won, cells in dark gray mean half or less of the games were won, and cells in mid gray represent the best Minimax depth or best combination of techniques for that chart. Even though the expected min approach performed worse overall, it had higher maximum scores with 97% over 90% for depth and 85% over 81% for technique combinations. This indicates some promise for the approach. Note that for expected min, depths one to five were used without alpha-beta pruning, however, due to memory constraints, it was applied for depth six. This caused the performance of depth six to be less than that of the true value. For normal Minimax this anomaly is also present and can be explained by examining the different exploit combinations. Notice that for those technique sets that do not include blacklisting, the performance actually improved from depth five. This would indicate that blacklisting does not perform well at this depth and is evidence of the precaution mentioned when blacklisting was introduced earlier. Basically blacklisting is being used in situations where Minimax produces better results. Therefore blacklisting is preventing what it feels is a bad move, when Minimax, having examined the possible board states 6 plies ahead, knows better. This leads to the point that, although each of these exploits can be used to improve performance, they all have the ability to decrease performance under certain circumstances. Therefore caution should be taken when choosing the correct combination (more on this later). Also, note that depth one did considerably better than most other depths in both situations. This is most likely the side effect of training at that depth. Another side effect was caused by the fact that these testing agents were the same as the training agents. Although the depths and techniques used differ, this fact could have caused distortion of the test results. It is therefore imperative that tests are run against unseen agents. Therefore the same tests were repeated against the random agent with the difference that each game was played ten times and scores were averaged. The agent s scores in every category were superior to that of the random agent. 5.2 Human agents Next, human players tested the aptitude of the agent. Here are the results: tables 4 (top) and 5 (bottom) Without EM M ALL Total Avg Per 1 3 5 8.0 4.00 67% 6 6 6 12.0 6.00 100% Total 9.0 11.0 Avg 4.50 5.50 Per 75% 92% With EM M ALL Total Avg Per 1 3 5 8.0 4.00 67% 6 3 6 9.0 4.50 75% Total 6.0 11.0 Avg 3.00 5.50 Per 50% 92% Note that with human players, the amount of combinations, and therefore amount of games required, must be lowered in order to seem reasonable. The charts are in the same format as before and again alpha-beta was used with expected min at a depth of six. The results improved when the game-specific exploits were added as indicated by the all column and every exploit and depth category saw 50%

or higher cumulative scores (meaning the agent matched or exceeded the expertise of the human player). 5.3 Choosing the right parameters When choosing the right exploits, Minimax depth, and whether or not to use expected min, one must consider the application of the Othello agent. Aspects to consider are that of move accuracy, time taken, and Minimax nodes examined (for memory constraints) and are summarized in the following figures and tables. Figure 12 shows the information expressed in table 6 below it. In both, NM stands for normal Minimax and EM stands for expected min Minimax. Each depth can significantly increase the average amount of time (in seconds) taken for a move decision to be made. Since using expected min at depth six was not actually possible (due to memory constraints), the data for it was estimated using a third order polynomial with the data from depths one to five. The data for average number of nodes searched is highly correlated with average move time as it is the main factor in determining it. Table 7 shows this with the expected min value at depth six being estimated using the same procedure as in table 6. table 7 Average Nodes Searched 1 2 3 4 5 6 NM 7 28 219 714 2,037 7,548 EM 7 66 497 4,502 28,862 79,496 figure 11 Figure 11 shows the percent accuracy of move decisions for both expected and normal Minimax and how it changes with maximum depth. The data for this figure was taken from tables 2 and 3 above. figure 12 table 6 Average Move Times (sec) 1 2 3 4 5 6 NM 0.002 0.008 0.063 0.197 0.609 2.179 EM 0.002 0.019 0.151 1.396 9.293 25.749 If the agent application was that of a commercial game, move time would need to be minimized, nodes searched must be kept reasonable, and accuracy could be negotiable by the human player. For an easy difficulty setting, one might choose Minimax only with expected min and search depth of two. Normal difficulty could be Minimax with normal min searching to a depth of three with corner and killer move detection. For hard, normal min Minimax at depth six along with all exploits active. If the application was more of a test against other computer agents, move accuracy may be more important than quicker move times and nodes searched may not be of too much concern. If that is the case, a combination of expected min Minimax searching to a depth of five with corner detection, killer move detection, blocking and blacklisting would be a good choice. 6 Conclusion This paper has shown the approaches used to create an intelligent Othello player. Game specific heuristics, namely killer move detection, corner detection, blocking, pattern detection, and blacklisting were created to help the agent choose an ideal move among the valid alternatives. If these fail to produce a decision, Minimax is ran using a custom static evaluation function learned by genetic algorithms and driven by a neural network. An alternative to the standard way of choosing a value for the min node was also presented. All of the above were then tested under the same situations and results were summarized and explained. Advice was given on selecting one s own combination of techniques for specific tasks. Overall the agent with all its methods performed quite well on all tests presented to it. 6.1 Future work To improve upon the agent, some further work could be done. If one were to use a variation of the leave-one-out cross

validation by training on two of the three deterministic agents then testing on the third, and repeating this for each agent, a better approximation of the true performance could be obtained for the neural network static evaluation function. The neural network might also benefit from the addition of more input features. One such value might be the number of strong lines on the board for each player where strong is defined as a collection of pieces that are hard to overtake in few moves. For example if a player had pieces that formed a line from one side of the board to the other, there would be no way the opponent could overtake all pieces in a single move, and would probably require several instead. Also, the way in which the neural network learns could be changed to a reinforcement learning method as opposed to genetic algorithms as they might yield better weights. If one wanted to use this agent in an Othello game meant for human players, one could benefit from adding more Minimax optimizations such as Negascout [11] or other forms of aspiration search. Remembering ideal moves from past games and forming a history table may help to improve move times as more games are played. This could also lead to deeper Minimax searching by recalling past values found from evaluating the same board state. More patterns could help to capture these ideal moves, however they must be well thought out since, like blacklisting, they can be overused and block better decisions from Minimax. [8] Mitchell, Tom M. Machine Learning. Singapore: Mcgraw Hill, 1997. 255. [9] Smed, Jouni, and Harri Hakonen. Algorithms and Networking for Computer Games. Finland: John Wiley & Sons, Ltd, 2006. 80-82. [10] Shapiro, Stuart C. Encyclopedia of Artificial Intelligence. USA: Wiley-Interscience, 1987. 4-7. [11] Reinefeld, Alexander. Die Entwicklung der Spielprogrammierung: Von John von Neumann bis zu den hochparallelen Schachmaschinen. Zuse Institut Berlin: Humboldt-Universität zu Berlin, 2005. 47-51. 7 References [1] AI Horizon: Minimax Game Tree Programming, Part 1. AI Horizon: Computer Science and Artificial Intelligence Programming Resources. AI Horizon. 27 February 2011. <http://www.aihorizon.com/essays/basiccs/trees/minimax.htm >. [2] Rajiv Bakulesh Shah, minimax, in Dictionary of Algorithms and Data Structures [online], Paul E. Black, ed, U.S. National Institute of Standards and Technology. 10 January 2007. 27 February 2011. <http://xlinux.nist.gov/dads//html/minimax.html> [3] Rose, Brian. Othello: A Minute to Learn... A Lifetime to Master. Anjar Co., 2005. [4] Graupe, Daniel. Principles of Artificial Neural Networks. Singapore: World Scientific, 1997. 1-3. [5] Shepherd, Adrian J. Second-Order Methods for Neural Networks. Great Britain: Springer, 1997. 3. [6] Matthews, James. generation5 Simple Board Game AI. generation5 At the forefront of Artificial Intelligence. 27 December 2004. Generation5. 27 February 2011. <http://www.generation5.org/content/2000/boardai.asp>. [7] Mitchell, Tom M. Machine Learning. Singapore: Mcgraw Hill, 1997. 249-270.