Programming an Othello AI Michael An (man4), Evan Liang (liange)

Programming an Othello AI Michael An (man4), Evan Liang (liange) 1 Introduction Othello is a two player board game played on an 8 8 grid. Players take turns placing stones with their assigned color (black or white) on the grid. After a player moves, all opponent pieces bounded in a straight line between two of the player s stones are all flipped. Each move must flip at least one opponent stone. If a player has no legal moves, they must pass. The game ends when both players pass consecutively, and the winner is the player with the most stones on the board at the end. If both players have the same number of stones, the game ends in a draw. We created an Othello engine, a program that takes an input position and tries to return the best move possible in that position. To do this, we modelled Othello as a zero-sum state-based game. We define the following: Figure 1: Othello as a state-based game. We then evaluated the success of our engine by playing a series of games against other Othello engines. We score each win as 1 point, each draw as 0.5 points, and each loss as 0 points. The exact number of stones won by each player has no impact on the scoring. 2 Engine Implementation We programmed the Othello engine from scratch in Python. The main parts of the implementation were the user interface, state representation, and game rules. Othello engines are usually command-line interfaces that follow a certain protocol. Keeping the engine implementation separate from the GUI allows other programmers to make a GUI that communicates between two different engines through the protocol. We used the same protocol for entering moves as Edax, a strong open-source engine. For example, play e3 places a stone on the square E3, pass is the command to manually pass, and go lets the engine play a move. The interface is essentially an infinite loop that reads user input and takes actions based on the input. An external GUI can pipe input into our interface, as well as read the output from our program, which goes to stdout. 1

We represented the board as a two-dimensional list of dimensions 8 8 in Python. We created constants WHITE = 1, EMPTY = 0, and BLACK = -1 to denote occupancy of stones on the board. We also stored the player as a single variable, which can take on either BLACK or WHITE. To implement the game rules, we added several helper functions, such as one that checks for opponent pieces in each direction, and one that returns a list of valid moves. We also needed string-to-move and move-to-string functions that converted a move tuple, i.e. (4, 2), to the corresponding square, i.e. E3, or vice versa. For user-friendliness, a printboard function prints the board, player to move, and current score (number of stones of each color) after each turn. Figure 2: The screenshots above show a sample move played through the command line in our program (left) and Edax (right). 2.1 Baseline and Oracle We implemented a simple greedy search algorithm as our baseline. This algorithm simply picks the move that flips the most stones at each turn, breaking ties randomly. 2

Our oracles are two of the strongest Othello engines, Cassio and Edax. Since Othello is an unsolved game with an unknown optimal play style, it is important to test our AI against different engines to avoid falling into the trap of optimizing our AI for one particular style of play. Our baseline implementation will never take a game off of any of the oracles, yet may still be capable of beating human players. However, even the best human players will lose handily to either Cassio or Edax. We tested our baseline against Edax, and we lost 64-0. In other words, we ended the game with no stones remaining on the board! This gives an idea of the wide gap in strength between the strongest engines and a naive implementation. Top engines like Edax and Cassio use a variety of techniques to improve their playing strength. These include the following: Large number of features, all carefully tuned to maximize score against other engines Several move ordering rules, which make alpha-beta pruning more efficient Opening book and endgame solver Aggressive pruning rules that allow searches of over 20 plies Bitboards for board representation, which allows the use of bit tricks to speed up computations Hash table that caches scores and best moves of board positions encountered in search We explored the first two techniques in detail in our engine. In particular, we used TD learning to tune a table of square values, and we implemented a move ordering rule that is also based on this table. 3 Search and Evaluation 3.1 Negamax with Alpha-Beta Pruning Since it would be computationally impossible for our algorithm to exhaustively search and reach the end state for every state of the game, we needed to define a value function to score intermediate board configurations. We implemented a negamax search algorithm with the simple evaluation function Number of stones of player s color Number of stones of opponent s color. Negamax is a special case of minimax that takes advantage of the zero-sum property of Othello. In particular, the score is symmetric with respect to the current player. Positive scores in the evaluation function mean that the current player is winning, negative means the current player is losing, and zero indicates a tie. Therefore, negamax is functionally equivalent to minimax but slightly simpler to implement. As a matter of naming convention, the depth reached by an Othello engine is the number of turns (by either player) that the engine looks ahead, i.e. depth is always reported in plies, so for instance, a depth 5 search indicates 5 plies. 3

The negamax algorithm was immediate on searches of depth 3. However, for a depth 4 search, the algorithm was quick in the opening and endgame phases (when the board is nearly empty or nearly full), but took about a second per move in the middlegame. This can be explained as follows: 1. In the opening, there are few legal moves (although branch factor is generally increasing), and making each move is quick because we have to iterate over very few opponent stones when checking whether to flip them or not. 2. In the endgame, making a move may be slow, but the branch factor is bounded by the number of empty squares remaining, which decreases drastically. 3. In the middlegame, the branch factor is quite high, and making a move is quite slow, so negamax will usually take a long time. The general trend of search being quicker in the opening and endgame extends to other search depths, and we can use this to our advantage. In particular, since moves are quick in the opening and endgame, we can afford to search more plies without spending too much time. We therefore implemented our program with a time-restricted search instead of a depth-restricted one. Endgame solvers in strong engines use a similar concept, searching over 20 plies deep at around move 40, all the way to end states and thus generating a perfect solve. While engines can also search deeper in the opening than the middlegame, top engines create opening books by analyzing all possible positions after, say, 10 plies. This preprocessing works since the start position is always the same, and caching the evaluations of opening positions saves the engine time. The next logical step for us was to implement alpha-beta pruning. Again, because of the zero-sum property, negamax-based alpha-beta pruning is simpler than the original algorithm. With random move ordering, alpha-beta is expected to allow us to search about 50% deeper, and this held up in practice: depth 6 searches in the middlegame take our engine up to about two seconds. 3.2 TD Learning an Improved Evaluation Besides speed optimizations that allow us to search to a deeper ply, another way to improve upon our AI is to come up with a better evaluation function. One idea we tried was essentially a piecesquare table. Typically in Othello, it is advantageous to have pieces on the edges and the corners of the board, because these pieces are harder to flip over by the opponent (or impossible, in the case of corner pieces). Therefore, we can assign a higher evaluation score if our pieces are distributed more around the edges than our opponents. To do this, we create a 4 4 array of the value of a piece in each square in the upper-left quadrant. Because of symmetry, this 4 4 array is sufficient. We then add the corresponding value for each of the player s pieces to our evaluation and subtract for each of the opponent s pieces. 4

Figure 3: Types of squares: Corner, buffer, and edge. To start off, we tried hand tuning the square values based on Edax s values. Edax heavily prioritizes the corners, as we would expect, while placing little value on the corner buffers, which also makes sense since placing a stone on a buffer might allow the opponent to play a stone in the corner. However, using Edax s values blindly leads our AI to over-prioritize corners and edges. This is because Edax s evaluation function contains several other factors, which subsequently affect the square values. However, hand-tuning these values already resulted in an visual improvement of our AI (where it no longer haphazardly gave away important corner squares). Figure 4: Our AI playing as black against an online engine, prior to (left) and after (right) adding a hand-tuned piece square table. To improve on the hand-tuned square values, we implemented TD learning. We used Monte Carlo to generate over 10,000 games by playing our engine against older versions of itself, using a depth of 3 so that the games finish reasonably quickly. To provide some variety and exploration, we generate a random move instead of the engine s move with high probability in the opening and lower probability (ε = 0.04) for the rest of the game. Then, for each board position encountered in these games, we upweight squares that lead to wins and downweight those that lead to losses. Here, we take into account the final score difference, which can be thought of as a gradient, so a 64-0 win would cause a larger change in weights than a 33-31 win. We tried a variety of step sizes η ranging from 0.0001 to 0.01. 5

Figure 5: Tuned weights for the piece square table after TD learning. Note that the corner has the highest value, the edges also have high values, the buffers have negative value, and the other squares have almost no value, as expected. Our tuned weights were tested against our naive evaluation function, which is equivalent to assigning a weight of 1 to each square. We used the same fixed depth of 2-4 in our testing, and the tuned version consistently won around 70% against the original version. We started TD learning with uniform weights and thus no prior knowledge of square values, but TD learning was able to realize on its own that corners were important and that buffers were to be avoided. 3.3 Optimizing Move Ordering in Alpha-Beta Pruning With the worst possible move ordering, alpha-beta pruning takes similar time as minimax to achieve a given depth. However, in the best case, alpha-beta pruning can achieve about twice the depth of minimax. Therefore, we implemented a move ordering heuristic to increase the efficiency of alphabeta search. Intuitively, for example, if a player can take a corner, usually the best move is indeed to play in the corner. We decided that a simple move ordering scheme could be to use the square values table tuned with TD learning, because placing a stone on the most valuable square possible is likely the best move in a given position. We did this by sorting the list of legal moves based on the value of (score, random number), which allows us to sort the moves by score and break ties randomly. We then tested the time required for our engine to reach certain depths with and without the optimized move ordering. We generated games at a constant depth of 3 or 4, with the first 8 moves randomized and stopping after move 48 to ensure each game has the same length, and we measured the average time elapsed. We also measured the time required to search to depth 10 from the start position, to see the effects of better move ordering at larger depths. Figure 6: Time to depth comparisons before and after optimizing move ordering. 6

4 Discussion and Future Work Through TD learning, we were able to learn a set of square values that could beat the naive evaluation function 70% of the time at a given depth, and by improving our move ordering, our engine could achieve a given depth approximately 50% faster. In the future, we could implement many of the strategies described in Section 2.1 that top engines use. Here are some more specific things that we could consider adding: Number of legal moves in the evaluation. Being forced to pass is usually very disadvantageous, and in general, the more legal moves we have, the better chance there is of having a good move. Stone patterns in the evaluation. For example, if we have two stones on an edge separated by an empty square, we are never able to fill in the empty space, which could spell trouble if the opponent manages to sneak a piece in. Completely prune out moves that are obviously bad, such as placing a stone in a buffer square if there are much better alternatives. 5 References The Cassio engine can be found at cassio.free.fr. The Edax engine can be found at abulmo.perso.neuf.fr (link has become inactive since the start of the project). 7