AI Learning Agent for the Game of Battleship

Size: px

Start display at page:

Download "AI Learning Agent for the Game of Battleship"

Sharyl Brooks
6 years ago
Views:

1 CS 221 Fall 2016 AI Learning Agent for the Game of Battleship Jordan Ebel (jebel) Kai Yee Wan (kaiw) Abstract This project implements a Battleship-playing agent that uses reinforcement learning to become proficient at playing a modified version of Battleship. The large state space involved poses a major challenge, which was overcome using an algorithm based on Q-learning combined with carefully designed feature extraction. The resulting agent consistently outperforms algorithms that implement random and hunt-and-target strategies, as well as the human authors of this project. It is especially proficient at hunting for small ships placed sparsely on relatively large game boards. 1. Introduction The game of Battleship is played on a square grid (traditionally 10-by-10). A number of ships, each 1 square wide and between 1 to 5 squares long, are placed on the grid either horizontally or vertically with no overlap. The ships are hidden from the player, who each turn makes a guess and selects a square to strike. The player is then informed whether the strike resulted in a hit or a miss. The game continues until a player has successfully hit all the squares occupied by ships (i.e. sunk all the ships). In this project, we implemented a Battleship-playing agent that learns to become proficient at playing our modified version of Battleship. For the consistent evaluation and comparison of this agent with other approaches, we made some specific modifications to the basic game rules as outlined below. During gameplay, each player selects one square of the board to strike each turn. The game board will be of size m-by-m, where m is known to the players. Part of what we evaluate in this project is the effect of increasing and decreasing the board dimensions on the performance of game agents. The number and lengths of the ships can be customized from the traditional arrangement (5 ships of lengths 5, 4, 3, 3, 2). The only information available to player is the number and sizes of the ships on the board, the observable statuses of each square on the board, and which ships have been sunk. The player is required to make its decision based solely this available information. In this project, unless otherwise noted, we are running our game agents on the classic game rules within the above definitions: On a 10-by-10 board, the player receives an unlimited number of torpedoes, and one ship of length 5, one ship of length 4, two ships of length 3, and one ship of length 2. The main metric for game-playing proficiency is the player s hit-rate, i.e. the ratio of strikes taken that resulted in a hit. Equivalently, a player should attempt to sink all ships on the board in as few strikes as possible. 2. Infrastructure We have formalized the game of Battleship into a state space model. This model consists of a start state, the set of legal actions that can be taken from the current state, a generation of the successor state from a

2 given action, and an evaluation for whether the current state represents the game s end. The gameplay itself is modeled as the interaction between a game controller module and game agents (Figure 1). 2.1 Modeling A single state consists of a game board and game-related statistics (score, number of moves taken, ships remaining). On a game board of m x m squares, each square may be in one of the following statuses: Unexplored, hit, or missed. To represent the game board efficiently, the model only needs to track the board s dimensions (e.g. 10-by-10), and maintain lists of squares that have been either hit or missed (with the remaining squares assumed to be unexplored). Separately, the state maintains a list of ships and the squares that they occupy. This information can be represented by a position, length and orientation (vertical or horizontal). This information is accessible only by the state object and not exposed to game-playing agents. The initial state the game involves resetting all squares of the grid to unexplored status. We have implemented a game function that randomly selects the positions and orientations of the ships, while ensuring ships lie within board boundaries and there are no overlaps (Figure 2). An action is the selection of a square on the game board to strike, denoted by a pair of coordinates. At any given state, a legal action is a square that has not been previously struck, that is, an unexplored square not marked as either hit or missed. A successor state is generated from this action by comparing the action against the state s game board and ship locations. The statistics and the square selected by the action are both updated accordingly. The game ends when all the ships of the current state have been sunk. Equivalently, this means all the squares occupied by ships, as maintained by the state, are included in the list of hit squares. To simplify this evaluation, we maintain a separate variable that tracks the number of ships (and their corresponding squares) remaining. Select action Q-learning agent Update policy O X O X Strike coordinates Hit/Missed X O Game controller Figure 1: Gameplay modeled as the interaction between controller module and agents. O X Figure 2: Example game board representing one possible state. Shaded squares contain parts of hidden ships. Squares with X represents hits; those with O represent misses 2

3 2.2 Game Implementation The entire game was implemented from scratch in Python. From the onset, it has been our goal in this project to experiment and compare results with various combinations of game rules, ship types and gameplay strategies. To support this, we have placed much emphasis on separating all modeling-related implementations from algorithm-related implementations. Furthermore, the implementations are all modularized for code reuse while allowing great flexibility for customization. The following is a high-level summary of the key modules in the implementation (shown in Figure A1): Game class: A controller module that solicits actions from agents, generates the successor state and provides the successor state to the next agent. The controller selects a set of game rules from its rules class and uses it to initialize the State instance. State class: Defines the state space model. Includes data structure definitions for the game board, square statuses, game statistics, and ship statuses. As well, includes functions for determining legal squares for targeting, scoring, evaluation of end state and generating successor state game boards. This class only defines the mechanism for generating states and the relationship between ships, actions and the game board. The actual values for game board sizing, ship types, ship sizes, and torpedo counts are never coded within the State and are obtained by interfacing with the Rules class. Rules class: A parent to children classes that each define specifics on game board sizing, ship types ship sizes, and torpedo counts. By isolating these values into individual classes, the game controller can easily swap between rules. Agent class: A parent to children classes that each implement an algorithm to derive a game-playing policy. Every gameplay algorithm evaluated in this project is implemented as an individual agent. Each agent returns a single action based on its internal policy, allowing the game controller to treat all of them as black boxes. In addition, we have implemented modules that form the backend to the game implementation and play no functional role in the gameplay itself. Nonetheless, these modules are essential to allow Display class: Graphically outputs the game board and the statuses of all its squares; allows us to understand the situation in the game at a glance. Statistics class: Interfaces with the game controller to gather turn-by-turn actions, results and game statistics. Includes functions for plotting the gathered data. 3. Strategy and Approach To implement an AI game agent for the game of Battleship, we are approaching the game fundamentally as an optimization problem. The task then becomes searching the state space to find the optimal policy for selecting actions, i.e. coordinates of the square to strike. At a high level, our agent employs a reinforcement learning technique for finding the optimal policy to a Markov decision process. This approach reflects the nature of the game of Battleship, where the location of the ships are hidden from the player. With each action (square selected to strike), there is an unknown probability of transitioning to each possible successor state (either a miss or a hit), and the corresponding 3

4 rewards are therefore unknown as well. But from studying the layout of the game board, it is possible for a learning agent to learn, over repeated iterations, the probability that a specific square contains a ship. 3.1 Q-Learning Algorithm Specifically, our algorithm is based on the Q-learning algorithm to find the optimal policy. Each time we select a square on the game board to strike (action), the algorithm updates the expected utility of selecting that square based on the result. Through this process, we eventually converge to the true expected utility and obtain the action that yields the highest expected value at each state. Policy update is a relatively expensive operation computationally, and our design needs to balance between under-training (not sufficiently converged to optimal policy) against over-training (wasted effort and overfitting to training states). After some experimentation, we have settled on 15 training iterations prior to running the agent on the test games. We also scale the step size based on the inverse of the square of the number of iterations. This allows us to gradually reduce the step size as we converge on the optimal policy. To ensure that we are learning the optimal policy while the state space is properly explored, our agent uses the epsilon-greedy algorithm to produce the actual action. During training, we assign epsilon to 0.1; thus the agent will generally choose the action based on policy, but for 10 percent of the time will take a random action that takes us into a new state not yet covered by the existing policy. During testing, we assign epsilon to 0 and the agent will always follow the learned policy. A learning algorithm relies on the proper assignment of rewards to derive the expected utility of a policy. We use a simple scoring mechanism that is based on a weighted portion of ships sunk discounted by the total number of moves executed: Squares hit Score Total squares occupied Ship s value NumberMovesTaken Each ship is assigned a value based on its size. A ship of length 5 is assigned a value of 100, a ship of length 4 has value 80, a ship of length 3 has value 60, and so forth. In essence, this scoring system provides reward that is consistent with the game s central premise that a player should strive to sink as many ships in the fewest number of moves possible. A key realization is that features need to be generalized, such that the learning agent would be able to associate features found in states encountered during training on states that have not been seen before during testing. This can be done with function approximation which parameterizes the Q-values with a weight vector and a feature vector., 1,,, 3.2 Designing Features A major challenge for our agent is exploring the large state-space, which increases exponentially with the game board size. As a matter of interest and practicality, the agent must try to reduce the state space to be searched. This requires a carefully designed set of features to be extracted from game states. The agent needs to use these features to evaluate states, and then place a clear preference on successors that will lead to an optimal policy and avoid ones that do not. 4

5 In this discussion on feature design, it is worthwhile to quantitatively compare the game proficiency of our Q-learning agent using our choices of features against a non-generalized feature choice. As a baseline for proficiency, we define a non-generalized feature that is simply formed by the combination of state (game board, game scores) and action (coordinates to target). This is an example of rote learning because it makes no attempt at exploiting patterns in similar states. Distance between target candidate and nearest missed square: A basic feature that allows the agent to learn how closely to target squares apart from previously targeted squares. Intuitively, when hunting for a ship of unknown location and orientation, a player should space out their targeted squares such that the targets are not too close together (to achieve better coverage of all parts of the board) and not too far apart (to avoid ships slipping through between large gaps). The ideal spacing is generally correlated to the number and sizes of ships. Over training iterations, the agent gradually learns the ideal distance spacing. Distance between target candidate and nearest hit square: By the same rationale as above, agent also learns the ideal distance to target from a previously targeted hit square. This feature essentially allows the agent to learn to target squares right next to hit squares, so it attacks squares in a line. Beyond the basic distance-based feature, we have incorporated some more sophisticated features that encourage or discourage the agent towards certain successor state patterns: Ignore sunk squares when calculating the distances to nearest hits on the same row and column: This feature builds on the basic distance-based feature but also takes advantage of knowledge about ships that have been sunk. It allows the agent to discount the squares surrounding a ship that has already been sunk, and discourages it from futile attempts at targeting the nearby areas. We saw significant improvement with this feature, as the agent immediately moves on to hunting for a new target as soon as the targeted ship is sunk. When used in place of the basic distance based features, we saw about 10-13% fewer moves. Consecutive hit squares: This feature was chosen so that the agent would recognize the orientation of a ship when consecutive hit squares are uncovered. This prevents wasted effort on the behavior in Figure 3, where agent is attacking other columns when it can be deduced that a ship is oriented vertically, and vice versa. From this feature we saw a moderate performance improvement, and in most cases, the agent learned to limit the erratic targeting behavior mentioned. When used in combination with basic distance based features, we saw about 0-8% fewer moves from including this feature only. O X O X O O X O X O O Figure 3: Erratic square targeting by agent. We desire to address this behavior with an additional feature checking on consecutive hit squares. With training, Q-Learning can use this feature to recognize a desirable pattern in consecutive hits. 5

6 Percentage of all orientations of all not-hit ships that the attack square could hit: This feature was added to address the agent randomly picking squares when hunting, and therefore inefficiently picking squares. With the feature, the agent is encouraged to attack the center of openings in rows and columns as well as attacking the center of the board, where more ships in more orientations could be located (Simplified example shown in Figure 4). When used in combination with basic distance based features, we saw a significant improvement of about 10-15% fewer moves from including this feature. A B C D Figure 4: (A) On a 3-by-3 board example, the center tile has the highest probability of containing a ship (indicated by darkness of shade), followed by the side tiles, and lastly the corner tiles. This is because: (B) The center tile can support 4 possible ship orientations; (C) The side tiles can each support 3 possible orientations; (D) The corner tiles can each support only 2 possible ship orientations. It is by combining all the features above that we were able to achieve the most cumulative improvement over a rote-learning feature extractor. The following table summarizes the impact of each feature on the overall proficiency of the Q-learning agent: Table 1: Features for Q-Learning Agent and Respective Impact on Performance # Feature Avg. Moves Avg. Score 0 Rote learning (non-generalized state/action combo) Basic distance-based (distance from hit/missed) Ignore sunk squares + feature # Consecutive hit squares + feature # Percentage of not-hit squares targetable + feature # All features included (features #1, #2, #3, #4) Baselines and Oracles We are using several non-learning approaches as baselines and oracles to compare with our learning-based AI agent: Baseline 1: A random algorithm can simply select squares at random to strike each turn. It is simple but not very interesting. We can use this to represent the absolute lower bound on effectiveness. 6

7 Baseline 2: A hunt-and-target (H-and-T) algorithm is a simple implementation that improves on the random algorithm. At first, it also selects squares at random to strike. But after a strike results in a hit, it will target the squares adjacent to the hit square during subsequent turns. Oracle: In general, a human can play the game of Battleship reasonably well. Most humans employ a strategy similar to the hunt-and-target approach but will subconsciously incorporate additional intuition to determine the likelihood that a square contains a ship. For example, an unexplored square that is mostly surrounded by missed strikes is not very likely to contain a ship, since a ship would likely have occupied several of the adjacent squares that have already been exposed. As such, a human would likely prioritize other squares as the target. It is also useful to note that the theoretical upper bound to effectiveness would be 100% hit rate, i.e. all strikes result in a hit and the game is completed in the minimum possible number of turns. Results for game agents playing Battleship on a 10 by 10 game board with classic rules is shown in the following tables. 4. Experiment Results and General Analysis The Random agent requires the most moves to complete the game, and receives the lowest score. The Random agent s poor performance can be attributed to its complete lack of any game-specific knowledge. The Hunt and Target agent performs much better than the Random agent, requiring about one third fewer moves. H-and-T also selects its targets at random, but it includes the game-specific knowledge that ships are oriented in straight lines and will attack squares surrounding a square that was a hit previously. Agent Avg. Num. Moves Avg. Score Random Hunt and Target Human Q Learning Table 2: Performance of game agents on classic Battleship rules Q Learning Hunt Random Human Q-Learning X Hunt-and-Target 24 X Random 0 2 X 0 Human X Table 3: Percentage of games won by each agent playing against another agent The Human agent uses 12.5% fewer moves to complete the game than the Hunt and Target agent. The Human agent acts similarly to the Hunt and Target agent when attempting to sink a known ship. However, when the Human is searching for an unknown ship, the Human is capable of visually scanning unexplored game squares and, using knowledge about the amount and lengths of the ships on the game board, select the minimum amount of shots required to explore regions of the game board. The Q-Learning agent outperforms the Random, Hunt and Target and Human agents. This can be attributed to the addition of well-designed features to the Q-Learning algorithm. In effect, the features allow the agent to approximate the probability that each square on the board contains a ship. While the features mostly mirror human intuition (e.g. number of consecutive hits, likelihood of a ship fitting within an unexplored space, etc.), an algorithm is able to define the features more precisely and avoids the occasional biases of human visual approximation. The advantage of Q-Learning over the agents is even more apparent in a head-to-head competition (Table 3). As an experiment, we run the agents against each other one-on-one, and the agent that finishes a game 7

8 first declared the winner. Q-Learning is able to beat the H-and-T agent 76% of the time, and 70% of the time against a human. 5. Algorithm Behavior and Error Analysis Among our baselines and oracles, the Hunt-and-Target (H-and-T) algorithm is a particularly interesting subject for a more in-depth comparison against the Q-Learning agent. The simplicity of H-and-T is its greatest strength, and therefore significant outperformance is required to justify choosing a different implementation over H-and-T. As seen in the preceding section, Q-Learning indeed delivers on that outperformance over H-and-T. To understand the factors that contribute to Q-Learning s advantage over H-and-T, we have conducted a series of further experiments. 5.1 Target Selection Behavior Figure 5 presents a heat map of the most frequently targeted tiles by Q-Learning (top) compared with Huntand-Target (bottom). For this experiment, we have kept the locations and orientations of ships constant on the game board, and proceeded to run both agents on this same game over 50 games. During each game, whenever one agent completes the game first by sinking all ships, both agents are stopped and the targeted tiles are tallied. With this approach, we ensure that the same number of turns are counted for both agents in each game. The shades in the heat maps therefore represent the normalized average behavior of the two agents. It is also important to note that Q-Learning was trained on a different, randomized set of game boards prior to this experiment. In addition, after the training and during this experiment itself, Q-Learning was not permitted to perform any policy updates. Effectively, this means that Q-Learning cannot learn the locations of the stationary ships over the course of this experiment. For each game iteration in this experiment, Q-Learning is Figure 5: Heat map of the most frequently targeted tiles by Q-Learning (top) and Hunt-and-Target (bottom) acting only on the policy learned from previous training, and behaves as though it is playing on this game board for the first time. From these heat maps, it is clear that Q-Learning is able to target ship locations on the game board much more precisely than Hunt-and-Target. This can be attributed to Q-Learning s much more sophisticated approach in evaluating the likelihood that an unexplored region contains an un-sunk ship. Q-Learning s targets are much more concentrated on the ships and the surrounding squares, indicating a much stronger confidence about their locations. Q-Learning correctly determined that certain areas of the board are unlikely to contain ships and the heat map reveals that the agent rarely targeted them. In contrast, H-and- T s simple randomized hunting comes at a cost on performance. Most areas of the map are darker and more even toned compared to Q-Learning, indicating much worse precision. The smaller two-tile long ship on the upper right is barely discernable on the H-and-T map, suggesting that the agent was essentially taking blind stabs at the region. 8

9 5.2 Performance Impact of Board and Ship Sizes It is clear that Q-Learning holds an advantage on the traditional 10-by-10 game board over Hunt-and-Target. To understand the how the game board size and relative ship sizes affect the performance of each agent, we have conducted another set of experiments. Figure 6 shows how the portion of the game board that each agent must explore to complete the game varies as the game board grows larger. As we gradually increase the board size from 10-by-10 to 20-by-20, Q-Learning maintains its advantage over H-and-T, consistently exploring a smaller portion of the board by comparison. In other words, Q- Learning is consistently more accurate and completes the game in fewer moves than H-and-T, even as board size changes. Q-Learning s advantage over H-and-T only increases modestly from 10 percent to about 14 percent. However, it should be noted that Q-Learning s advantage becomes larger in absolute terms (i.e. number of squares targeted) even as the percentages remain proportionally consistent. Figure 7 shows how well each agent performs as the average size of ships relative to the game board changes. By relative size, we refer to the ratio of the ship s length to the game board s dimensions, such that a ship of length 3 on a 10-by-10 game board yields a ratio of 3/10 = 0.3. The result from this experiment provides much insight into the strength and weakness of each agent. There are two situations where Q-Learning and H-and-T perform about equally well: with ships of average relative size 0.1 (where all agents perform about as poorly), and with ships of relative sizes 0.5 to 0.6. Q-Learning holds the advantage at relative sizes between 0.1 to 0.5, while H-and-T gradually takes the lead with sizes greater than 0.6. In summary, Q-Learning excels with small ships and/or sparse boards, while H-and-T is stronger with large ships and/or crowded boards. Intuitively, this finding is consistent with the assumption that Q-Learning s more sophisticated targeting methods are most useful at hunting down small ships in large unexplored areas. This is a situation where determining the probability that each square holds the ship truly pays off, since randomly targeting squares will result in misses more likely than not. By extension, we can also consider this to validate Q-Learning s strength at hunting down that last tiny ship on the board to win the game. In contrast, Hunt-and-Target performs very well on crowded boards with large ships. Here, Q-Learning s more sophisticated, probabilistic-based approach provides very little advantage, as randomly targeting Figure 6: Portion of game board explored by game agents with increasing board dimensions. Figure 7: Performance of game agents with varying ship sizes relative to game board dimensions 9

10 squares on the board would also most likely result in a hit. Once H-and-T obtains a hit, it targets adjacent squares, which on a crowded board are likely to result in hits as well. Meanwhile, a situation where nearly all strikes result in hits is likely to confuse Q-Learning. In order to learn a useful policy, Q-Learning relies on the assumption that certain actions yield rewards while others do not. When nearly all moves yield the same reward or result, Q-Learning has little chance to improve its policy or identify patterns with its feature extractor, and its performance approaches that of a random implementation. 5. Related Works The game of Battleship dates from the First World War, and has remained popular through the decades. As such, there is no shortage of discussion on strategies and playing styles for the game, both in academic and non-academic settings. But as far as we could determine, this project is one of the few if not only attempt to implement a Battleship gaming agent by combining the reinforcement learning technique with a nontrivial, designed set of features to assist searching the state space. Some existing works have implemented a learning-based implementation but did not combine it with feature extraction (e.g. Compton et al.). As a result, such implementations got bogged down by the large state space and achieved performances that are no better than hunt-and-target. Others have been able to improve significantly over hunt-and-target, but only with a non-learning based algorithm that incorporates significant amount of hardcoded, game board specific knowledge (e.g. Berry). Such algorithms likely would not adapt well to a generalized game environment with variable factors (e.g. variable board size, ship number, etc.). The following is a summary of some significant related works. C. Compton, J. Liu, N. Stanzione. Battleship: A Hit or Miss Affair. Battleships - A Game Playing Agent. N. Berry. Battleship. DataGenetics Conclusions The Q Learning agent outperforms the Random, Hunt and Target and Human agents. This can be attributed to the addition of well-designed features to the Q-Learning algorithm, allowing it to evaluate the likelihood that an unexplored region contains a ship in a more precise and sophisticated manner. Q-Learning maintains its advantage over Hunt-and-Target as the board size is increased from 10-by-10, consistently exploring a smaller portion of the board by comparison. In other words, Q-Learning is consistently more accurate and completes the game in fewer moves. Q-Learning excels with small ships and/or sparse boards, while Hunt-and-Target is stronger with large ships and/or crowded boards. Intuitively, this finding is consistent with the assumption that Q-Learning s more sophisticated targeting methods are most useful at hunting down small ships in large unexplored areas. 10

11 Appendix Figure A1: Overview of game infrastructure including game controller, agents, data structures and backend modules for display and maintenance. 11

CandyCrush.ai: An AI Agent for Candy Crush

CandyCrush.ai: An AI Agent for Candy Crush Jiwoo Lee, Niranjan Balachandar, Karan Singhal December 16, 2016 1 Introduction Candy Crush, a mobile puzzle game, has become very popular in the past few years.