Model-Based Reinforcement Learning in Atari 2600 Games

Size: px
Start display at page:

Download "Model-Based Reinforcement Learning in Atari 2600 Games"

Transcription

1 Model-Based Reinforcement Learning in Atari 2600 Games Daniel John Foley Research Adviser: Erik Talvitie A thesis presented for honors within Computer Science on May 15 th, 2017 Franklin & Marshall College Bachelor of Arts May 13 th, 2017

2 Chapter 1 Introduction Imagine playing a video game for the first time, with no prior knowledge of the objective of the game or the mapping between controller buttons and gameplay mechanics. The ideal is to formulate an optimal strategy in pursuit of the maximum final score, but no information is available at the start of gameplay. The general interaction between controller buttons and gameplay mechanics can be quickly learned (eg. Up button = jump) by observing causal relationships. After pressing a button several times and witnessing its effects, it is possible to predict the future outcome of pressing the button. This is extremely useful because it is then possible to determine which button should be pressed in a given scenario to receive the largest final score. This type of learning is called reinforcement learning; actions that lead to favorable outcomes are encouraged whereas actions that lead to less than favorable outcomes are discouraged. Humans are able to develop an understanding of the gameplay dynamics over time through gameplay interaction. This understanding is called a model, which allows humans to predict the outcome of pressing a button. Due to the complicated nature of many games, humans usually create an imperfect model, and thus may make inaccurate predictions. Despite utilizing an imperfect model of the gameplay dynamics, humans are remarkably skilled video game players. This work will focus on the creation of a software agent that learns how to play video games by developing an imperfect model of the gameplay dynamics. The objective of this research is not to solve a video game problem, but to explore model-based decision-making in a novel and challenging environment - a topic that is fairly underrepresented in the current AI literature. 1.1 The Arcade Learning Environment The Arcade Learning Environment (ALE) is a platform for evaluating the success of AI programs in achieving flexible competence in a variety of problems. The ALE is comprised of over 50 Atari 2600 games - each with their own objective, controller mapping, reward system, and gameplay objects. The wide variety of tasks associated with the ALE create an appropriate platform to evaluate non-domain specific AI technology. The ALE serves as the challenge problem that will be explored in this thesis. The ALE was purposefully designed as a premier challenge problem in the AI community so that it may serve as a microcosm of general AI. Bellemare et. al. (2013A) created the ALE with the intentions that it would be an achievable stepping stone [for general AI competency] and formidable enough to require new technological breakthroughs. The ALE has inspired novel techniques and approaches, and still serves as a prominent subject of active research. The ALE interface provides functionality to observe and interact with the gameplay environment of Atari 2600 games. The ALE allows an agent to select actions from an action set of 18 joystick movement and button combinations, and then updates the respective gameplay world accordingly. An AI agent operating in the ALE has access to the current pixel-level representation of 1

3 a screen of gameplay, as well as the points received after selecting an action. Although the functionality of the ALE provides some useful information to the agent, there is a significant lack of environmental information available to the agent. An agent can observe the current pixel-level representation of a screen of gameplay, but the gameplay environment is still only partially observable to the agent, making it a sufficiently challenging task environment. For example, a single observation does not reveal the velocity of [objects on the screen] (Hausknecht and Stone, 2015), nor denotes the location and time at which an object will enter the screen in the near future. An agent in a partially observable environment can only act based on the limited information it has available to it. Atari 2600 games are deterministic in nature, so a perfect model of gameplay dynamics would be able to predict the future with 100% accuracy. Despite the determinic nature, it is very difficult to develop a perfect model of the dynamics due to the incredibly large number of possible gameplay situations, and therefore humans usually only develop an approximate model. Humans often believe deterministic Atari 2600 games are stochastic due to their approximate model of the high-dimensional game environments. This work will present the application of a stochastic model to the deterministic ALE. 1.2 Reinforcement Learning The agent will learn a stochastic model of the ALE through reinforcement learning. Reinforcement learning (RL) is a type of learning that associates actions with a positive or negative reward signal. A RL agent in the ALE is able to determine which action should be taken, given the current situation, by selecting the action that will lead to the largest final score. A RL agent must discover which actions yield the most reward by trying them (Sutton and Barto, 1998) and this direct experience with the world shapes the agent s policy. A policy is an agent s master strategy - for any given situation, the agent should determine the most favorable action. In model-based reinforcement learning (MBRL), an agent constructs an internal representation of the mechanics of the world based on its past experiences, and uses this information to make predictions about the future environment. MBRL agents develop a model of the dynamics of the world and the rewards associated with situations. Access to a model allows an agent to envision the long-term effects of taking a given action. This allows an agent to plan, which is to determine the best possible action to make in a given situation considering the long-term potential future outcomes. It is important to note that there are no published results of a MBRL agent performing well in the ALE. In model-free reinforcement learning (MFRL), an agent does not attempt to create a predictive model of the dynamics of the world, but instead evaluates the quality of selecting each action in a given situation. MFRL agents are less computationally expensive than MBRL agents, but do not utilize available information that could be used to learn about the dynamics of the world. MFRL agents determine the best action to select solely based on the current situation. Despite having a short-term focus, several MFRL agents have performed well in the ALE. 1.3 Related Work Several researchers have studied MFRL agents in the ALE, leading to the discovery of fairly suc- 2

4 cessful MFRL approaches. The main challenge is to construct a practical and useful representation of a screen of gameplay. Bellemare et. al. (2013A) created five different screen representations to power a common MFRL algorithm named SARSA. Through empirical analysis, they demonstrated that learning progress is already possible in Atari 2600 games. Their work was the first to demonstrate a level of MFRL success within the ALE, and would inspire future researchers to continue exploring the potential of MFRL in the ALE. The success from a fairly simple method would inspire more sophisticated approaches. Mnih et. al. (2015) trained a deep convolutional neural network to associate selecting an action in a screen of gameplay with a reward signal. They established the state-of-the-art results for performance in the ALE. Although their approach was very successful, it was also computationally expensive. Liang et. al. (2016) sought to create a computationally practical MFRL agent that would achieve similar performance to the state-of-the-art results. This led to the creation of the Blob-PROST screen representation (which will be used in this work). SARSA coupled with Blob- PROST was able to attain similar performance to the state-of-the-art results while remaining computationally practical. Despite the successes of MFRL approaches, MFRL has several disadvantages. Consider the game Montezuma s Revenge. Figure 1: A screen of Montezuma s Revenge gameplay. The objective of the first level is to obtain the key and open one of the doors (the two dashed lines that are one level above the top of the left and right ladders). The agent only receives points when the key is obtained or one of the doors is unlocked. From the original starting position, there is a fairly long and specific action sequence that needs to be followed to obtain the key. A MFRL agent is seriously disadvantaged in this context, as the agent would not learn anything about the relative quality of actions until it receives points by randomly obtaining the key, which is a very difficult task to complete randomly. When the MFRL agent does obtain the key, it would only learn which action it should select when it is one action away from reaching the key. Even in the absence of points, a MBRL agent would learn about the dynamics of the world, and could even learn how to reach the position where the agent is one action away from reaching the key. In addition, the agent would be aware of locations on the screen it has not visited, and could intentionally explore the locations it has not yet visited. This encouraged exploration could 3

5 lead to an informed search process, which is certainly more effective than searching randomly. Several reseachers have explored model-learning in the ALE, which entails training a model to be able to predict future screens of gameplay. Bellemare et. al (2013B) divided Atari 2600 games into different segments and applied Bayesian inference, Bellemare et. al. (2014) introduced the Skip Context Tree Switching algorithm, and Oh et. al. (2015) developed two deep neural networks. All three of their models have achieved remarkable prediction accuracy. However, their research does not explore decision-making in the ALE based on predicted future screens. No one has demonstrated that predicting future screens can improve gameplay performance in the ALE. This work will focus on the intersection of model-learning and reinforcement learning in the ALE, an area of study that is currently missing from the current AI literature. 1.4 Primary Contributions This thesis will focus on the creation of a model-based reinforcement learning agent that learns an imperfect predictive model of the gameplay dynamics of Atari 2600 games. The objective of this research is not to solve a video game problem, but to explore model-based decision-making in a novel and challenging environment. Each design choice in this work was made with the intention of maintaining practicality, as creating an accurate model in the ALE is extremely computationally challenging. There are no published results showing successful MBRL in the Atari 2600 domain - and this thesis will not be the first. This thesis, however, will present a practical model that shows promise with respect to learning potential, as well as document MBRL challenges not often presented in the current MBRL literature. 4

6 Chapter 2 MBRL Agent for Atari 2600 This chapter details the creation of a MBRL agent for the Atari 2600, which consists of a convolutional logistic regression model coupled with Monte-Carlo planning. In an attempt to create a model useful for planning, special focus was placed on using simple techniques to maintain the practicality of the MBRL agent. 2.1 Pong The collection of Atari 2600 games provided by the ALE form a testbed for empirical analysis of flexibly competent agents. However, special focus will be placed on the game of Pong, as it will serve as the core example throughout this thesis because it is simple, computationally workable, and easily understood. Pong is a single-player game analogous to tennis; the player has a paddle and the objective is to hit the ball with the paddle until the opponent s paddle is not able to make the contact with the ball. In a screen of Pong gameplay, there are three main objects of importance: Figure 2: A screen of Pong gameplay. the player s paddle (right), the opponent s paddle (left), and the ball. The bottom of the screen and the white line at the top of the screen serve as vertical boundaries that will cause the ball to bounce upon contact. If the ball manages to reach the left edge of the screen, the player will be awarded a point and their score will increase by one. If the ball manages to reach the right edge of screen, the opponent will be awarded a point and the player s score will decrease by one. The numbers at the top of the screen from left to right reflect the current number of points for the opponent and the player, respectively. A game is finished when either the player or the opponent receives their 21 st point. The player s final score is between -21 and 21, inclusively, where a negative number signifies defeat and a positive number signifies victory. 5

7 2.2 Tile-Level Screen Information The ALE provides access to the screen information of Atari 2600 gameplay. More specifically, the ALE records the colors present at every pixel of a given screen. Liang et. al. (2016) developed methods to translate the pixel-level screen information to object-level screen information, which is comprised of the locational information of colored blobs. The object-level screen information melds together contiguous pixels of the same color to form colored blobs, where the definition of contiguity has been expanded to include the surrounding s x s neighborhood, for some size s. The location of a colored blob is defined as the center of the blob s smallest bounding box (Liang et al., 2016). Liang then converted the object-level screen information to tile-level screen information, which bears resemblance to the original pixel-level screen information. The screen is divided into tiles of dimensions 4 pixels x 7 pixels, and the tile-level screen information records the colors present at every tile, where color presence is defined as a tile containing the center of a blob of the given color. The abstraction to tile-level screen information has proven to be an efficient and useful basis for a feature set for model-free learning, and as such, will be used as the basis for the MBRL agent s feature set. Figure 3: A screen of Pong gameplay (left), and a tile-level representation of the screen comprised of 9 tiles x 11 tiles (right). Black indicates there are no blobs within the tile. Note that the background color and the white line in the original screen of gameplay are each represented by one single blob that is assigned to the tile that contains the blob s center. 2.3 Out of Bounds Color The ALE provides functionality to allow the user to determine the color palette for the screen information of Atari 2600 gameplay. To maintain practicality, the MBRL agent utilizes the 8- color (SECAM) palette from the ALE. The distinction of colors allows the agent to differentiate between objects on the screen. It would also be advantageous for the agent to know the boundaries of the gameplay screen. During the conversion process from pixel-level screen information to tilelevel screen information, an immediate perimeter of tiles are created to surround the tiles of the gameplay screen. A special color is placed into each of the tiles in this immediate perimeter of tiles, establishing out of bounds markers for the perimeter layer. The addition of the out of bounds color has extended the agent s original 8-color palette to a 9-color palette. 6

8 Figure 4: A 5 tiles x 5 tiles screen of gameplay surrounded by an immediate perimeter of tiles containing the out of bounds color. Under the original 8-color palette, the only information known is the existence of an orange blob and a green blob. The addition of the out of bounds color adds the information that the orange blob is on the left-edge of the screen, and the green blob is on the right-edge of the screen at the topmost tile. 2.4 Logistic Regression The basic learning algorithm that will be employed to learn the model is logistic regression. The following example is a simplified version of how the model can make predictions using the logistic function. Consider a tile at position (x, y) - the objective is to predict whether the tile will contain a white blob in the next screen. A feature is a piece of information, such as there is a green blob in the tile at position (x, y). A binary feature corresponds to the truth value (0 or 1) of the piece of information - if there is a green blob in the tile at position (x, y), then the corresponding binary feature is 1. This example will demonstrate how a logistic regression model can use a vector of binary features to predict the presence of a white blob within the tile at position (x, y) in the next screen. Consider the binary feature vector φ x,y,c that corresponds to the presence of the colors white, green, and red in the tile at position (x, y) in the current screen of gameplay. In this example, white and red are present in the current screen, so therefore the binary feature vector φ x,y,c is (1, 0, 1). Logistic regression will be used to predict the probability of the presence a white blob in the next screen of gameplay, which is defined as ˆM(white φ, x, y). For every feature in the feature set, there is a weight for white blob presence. The weights account for the correlations between the features being active and the likelihood of a white blob being present in the tile at position (x, y) in the next screen. For this example, the white weight vector Θ white is (2, 1, 1.2). Summing the weights that correspond to the active features form a linear combination Λ x,y,φ,white = (2 1) + ( 1 0) + (1.2 1) = 3.2. Applying the logistic probability formula to this linear combination yields ˆM(white φ, x, y) = e3.2 = e3.2 The logistic probability formula has determined that the model probability of a white blob being present in the tile at position (x, y) in the next screen of gameplay is Logistic regression 7

9 will always yield a number between 0 and 1, so it is an ideal tool for determining probability. Figure 5 graphs the logistic curve: Figure 5: The logistic curve from x = 5 to x = 5 (Wolfram Alpha, 2017). The model will then generate a random number between 0 and 1. If the randomly generated number is less than , then the tile at position (x, y) will contain a white blob in the predicted next screen. The same process can be applied to the other colors as well to determine the colors present in the tile at position (x, y) in the next screen. It can also be applied to all of the colors in the remaining tiles of the next screen to create an entire predicted next screen. The success of the logistic regression model is largely dependent on the accuracy of the weights. Weights of the active features are updated by comparing the model s predictions of the screen to the actual screen. After each color prediction within each tile, for each color c and active feature i, the prediction error δ c,i is accumulated as δ c,i δ c,i + (τ x,y,c ˆM(c φ, x, y)). (2.1) Once an entire predicted next screen has been generated, each weight θ c,i is updated by the prediction error δ c,i and the step size α, which is a number between 0 and 1 that allows the logistic regression model to gradually learn correlations over time. The weight update rule is θ c,i θ c,i + α δ c,i. (2.2) 2.5 Convolutional Logistic Regression (Dynamics Model) Under the MBRL framework, an agent chooses actions leading to the most desirable future outcome. In order to compare the effects of selecting an action, the logistic regression model must be action-dependent. Equation 2.4 reports the logistic probability formula used in this work, which is an action-dependent formula. Action-dependence is achieved by separating the weights among the action set so that each action a has its own set of weights. The linear combination is the sum of the n active weights, which are both color and action dependent. 8

10 Λ x,y,φ,a,c = n θ a,c,i φ i (2.3) i=0 eλ x,y,φ,a,c ˆM(c φ, x, y, a) = (2.4) 1 + e Λ x,y,φ,a,c In the Atari 2600 domain, a MBRL agent must predict future screens in order to determine the most favorable action. This work will utilize a convolutional approach for future screen prediction (Fukushima, 1980). The convolutional approach is centered around the process of predicting the colors within a tile of the next screen based on the presence of colors in the neighboring tiles of the current screen. Absolute position has been abstracted from this approach, allowing the model to learn position-independent information. For each tile in the next screen, a feature vector is constructed using a convolutional lens. A convolutional lens has access to the tile-level screen information in the surrounding area of the current screen. This surrounding area is named the convolutional neighborhood, which has square dimensions determined by the program. Figure 6 presents an example of a convolutional lens with dimensions 3 tiles x 3 tiles. As demonstrated by Figure 6, the convolutional lens iterates Figure 6: A 3 tiles x 3 tiles convolutional lens iterating through a 5 tiles x 5 tiles screen. The gray tiles and the red tile comprise the convolutional neighborhood for the first iteration, where the red tile is the tile that is currently being predicted. through the tile positions of the next screen in a top-to-bottom, left-to-right fashion. The feature vector constructed for each tile is determined by the presence of colors within the tiles in the convolutional neighborhood from the current screen. Each feature has a corresponding weight for each action-color pair, where the weight corresponds to the correlation between the feature and the color when the agent takes the action. For each tile position and for each color, summing the color s weights that correspond to the active features and then applying logistic regression to this sum yields the probability a blob of that color will be present within the tile in the predicted next screen. This process constructs an entire predicted next screen. When the agent encounters a real next screen of gameplay, it compares its predictions of the next screen with the actual screen, and applies a small update (found in Equation 2.6) to the 9

11 weights of the active features - in this the way the model is able to learn about the dynamics of the world over time. δ a,c,i δ a,c,i + (τ x,y,c ˆM(c φ, x, y, a)) (2.5) θ a,c,i θ a,c,i + α δ a,c,i (2.6) Formulating predictions solely based on the current screen of gameplay neglects information about the movement of objects on the screen. In order to account for movement, the presence of colors in the neighboring tiles of the previous screen of gameplay is also factored into the prediction formulation. However, this procedure also neglects the conditional nature of objects on the screen, which suggests the predictions should be conditional as well; there is only one ball in the game of Pong, so if the ball is predicted to be present within a tile of the next screen, the probability of predicting a ball should be 0% for all subsequent tiles within the predicted next screen. The presence of colors in the already predicted neighboring tiles of the predicted next of gameplay is also taken into consideration for prediction formulations. The features for this convolutional logistic regression model are screen-specific, locationspecific (location in the convolutional neighborhood), and color-specific. A feature is composed of the form: φ t, x, y,c where the feature is active if the screen at time t + t has a blob of color c in the tile position: row x + x, column y + y. The feature vector will only consist of the feature enumerations for the Figure 7: The possible feature enumerations for a 3 tiles x 3 tiles convolutional lens, where the red tile signifies the current tile being predicted. The blue tiles signify the partial convolutional neighborhood that does not currently contain any predictions, and is thus not useful for prediction formulation. colored blobs encountered in the convolutional neighborhood within the previous screen, current screen, and next screen. For example, the feature set would be {5} if the only blob encountered is present in the previous screen, located in the top-left position of the convolutional neighborhood, and is the sixth color in the 9-color palette. 10

12 2.5.1 Convolutional Logistic Regression Example The following example will demonstrate how convolutional logistic regression can generate future screens of Pong gameplay. Figure 8 demonstrates the convolutional features across time available to the convolutional lens at the tile position: row 4, column 5. The left grid and the middle grid are the bottom right quarter of the previous screen and current screen of gameplay, respectively. The convolutional features provide the information the ball was (with respect to the current position) four tiles below/two tiles to the right in the previous screen and two tiles below/one tile to the right in the current screen. Summing the white color weights that correspond to these two features and applying the logistic function provides the probability the ball will be predicted to be present in the current tile position within the next screen. If the probability is greater than a randomly generated number between 0 and 1, then the prediction of the next screen will contain a ball within this tile. Note that the paddle present in the previous and current screens is not visible to the convolutional lens because it is outside of the convolutional neighborhood, and therefore the weights of these features are not included in the weight summation. Figure 8: The predicted next screen of Pong gameplay displayed above provides a snapshot of a 9 tiles x 9 tiles convolutional lens mid-iteration predicting the ball at the tile position row 4, column 5. The paddle is not displayed in the predicted next screen, because the convolutional lens has not yet crossed a tile in which it would predict the presence of the paddle. 2.6 Sparsified Feature Set It is common to initialize the weights to 0, but if the weights are initialized to 0, the first prediction will result in the logistic probability function predicting the presence of approximately half of all the possible features in the future screen. This is demonstrated by the sigmoid calculation: e e 0 = 0.5 This is both inefficient and inaccurate, as the set of possible features is magnitudes larger than the number of features that are active in any given screen of gameplay. To counter this scaling challenge, a bias term was introduced. The bias term is a feature that is always active so that it provides the agent with information about how frequently it can expect colored blobs to be present in the next screen in general. The bias term is associated with an initially negative weight so that the agent is initially pessimistic with regards to predicting colored blobs will be present in the next screen of gameplay. The first prediction with the bias term will result in the logistic probability function predicting there are no colored blobs in the next screen. This inaccuracy is removed as 11

13 the weight of the bias term will eventually converge to a value indicative of the probability of a colored blob being present in the next screen. The addition of the bias term has turned this into a more scalable problem. To further sparsify the predicted set of active features, probability truncation has been introduced. Applying the sigmoid function to the sum of the weights of the active features will often yield a probability that is only marginally larger than zero percent. The logistic regression formula will never return a probability of zero percent, so a probability marginally larger than zero percent is essentially the minimum. To reduce sigmoid computation (it would be fairly computationally expensive to use the logistic probability formula for every color in every tile) and account for the probability never being zero percent exactly, the sum of the weights of active features is compared to a probability threshold. If the sum of the weights of active features is smaller than the probability threshold, then the agent will assume the probability is zero. 2.7 Blob-PROST (Reward Model) Convolutional logistic regression provides the agent with a dynamics model of the environment, allowing the agent to predict future screens of gameplay. It does not tell the agent how advantageous or disadvantegous these future screens of gameplay may be, however. To determine the favorableness of selecting a certain action in the current screen, the agent utilizes Blob-PROST for its reward model, a feature set created by Liang et. al. (2016). The Blob-PROST feature set ψ consists of all of the blobs on a screen, the pairwise relative offsets between blobs on the screen, and the pairwise relative offsets between blobs on the screen over time (from the previous screen to the current screen). This work utilizes Blob-PROST features to power an action-dependent linear reward model, where summing the active weights (linear combination) within the Blob-PROST feature set θ R a,i provides the expected reward ˆR(a, ψ) of selecting an action in the current screen of gameplay. The m weights are also updated according to the update rule in Equation 2.8, which compares the actual reward received ρ to the expeted reward ˆR(a, ψ), and updates the reward weights with a step size α R. ˆR(a, ψ) = m θa,i R ψ i (2.7) i=0 θ R a,i θ R a,i + α R (ρ ˆR(a, ψ)) (2.8) 2.8 Monte Carlo Planning Algorithm If the agent is presented several action choices, it would prefer to make the action choice that leads to the most favorable future screen. Since the agent is able to predict future screens of gameplay and can quantify the favorableness of these future screens, it is plausible that the agent can plan for the future. Note that the favorableness of a screen should be viewed as the long-term expected reward from the state as opposed to the immediate reward received from reaching the screen. In order to calculate the long-term expected reward of a screen, the agent uses a Monte Carlo planning algorithm. The agent s Monte Carlo planning algorithm performs x rollouts of y random actions (where 12

14 x and y are determined by the specifications) following each action choice. This means the agent will determine the best action choice, by: Predicting the future screen following a possible action choice Following the action choice with y additional random action choices Performing the above two steps a total of x times and calculating the average long-term expected reward for the action choice Performing the above step for each action choice Choosing the action with the largest average long-term expected reward A discount factor γ is applied to allow the agent to determine a weighted long-term expected reward, where less weight is placed on rewards the further in the future they may occur. Discounting is a beneficial approach because it best aligns with the agent s objective of achieving a large final score; if the game will terminate in the next few screens (which can happen for a number of reasons), it is advantageous that the agent focuses on increasing its final score before the game ends. The agent s long-term reward for one rollout is determined by the following equation, where a t : ˆM(ρ) = 1 x x k=1 y γ t 1 ˆR(a t, ψ t ) (2.9) t=1 Figure 9 presents a simplified Monte Carlo example where the agent only has two possible actions. The average reward of the rollouts corresponding to an action represent the Monte-Carlo prediction for the long-term expected reward of selecting the action. In this example, action 1 yields the larger long-term expected reward, so action 1 is more favorable. 13

15 Figure 9: A two-action agent with Monte Carlo specifications 3 rollouts of 2 random actions. Long-term expected reward of action 1: 3 Long-term expected reward of action 2: Pong Results Using Atari 2600 Simulator The ALE provides functionality to save the current game so that it may be loaded in the future. The saving and loading functionality makes it possible to determine how well an agent would perform in the ALE if it used the Monte Carlo planning algorithm with the Atari 2600 simulator. The Atari 2600 simulator serves as a trivially perfect model - it will always predict the future with perfect accuracy, but the decision-making process requires having the agent try every action in the real world several times in order to determine the best action (also most domains do not provide a simulator). Experiments were conducted with the trivially perfect model to gauge agent performance with the Monte Carlo planning algorithm in Pong. The following tables display several performance metrics for an agent utilizing the Atari 2600 simulator. The experiment consisted of 30 independent trial episodes and the planning specifications entailed 10 rollouts of 20 random actions. Average Score Minimum Score Maximum Score Average Number of Frames Minimum Number of Frames Maximum Number of Frames The agent was very successful at Pong; the agent received a score of on average, and never scored lower than 14 points. The agent also played the game for many frames, averaging frames of gameplay. It can therefore be concluded than an agent utilizing the Monte Carlo planning 14

16 algorithm with a sufficiently accurate model can achieve success in the ALE. The ambition of this project (which has not yet been achieved) was to create a sufficiently accurate model that could be utilized by the Monte Carlo planning algorithm to achieve success in the ALE Asterix Results Using Atari 2600 Simulator In order to determine if the Monte Carlo planning algorithm could achieve flexible competence in the ALE, experiments were conducted in a different game. Asterix is a game where the player must navigate in the world to avoid enemies and collect point boosts. A screen of Asterix gameplay is pictured in Figure 10, where the player sprite is the yellow character. The yellow character can move up, down, left, and right in the world. The player has three lives, and loses a life when the yellow character makes contact with a harp. Harps appear from the left or right side of the screen, and move within the row to exit the screen from the other side. The player increases their score by collecting items such as clocks and shields, which appear and move in the same fashion as harps. Figure 10: A screen of Asterix gameplay. The following tables display the performance metrics (consistent with Pong) for an agent playing Asterix using the Monte Carlo planning algorithm and the Atari 2600 simulator. Average Score Minimum Score Maximum Score Average Number of Frames Minimum Number of Frames Maximum Number of Frames The agent had a few low scoring games (such as a 1450 point game), which demonstrates that the Monte Carlo planning algorithm is not perfect, even when the Atari 2600 simulator is being utilized. However, the agent was very successful at Asterix on average; the agent received an average score of which is an incredibly high score. The agent also played the game for frames on average, which means the agent was able to avoid making contact with 3 harps for a very long time. These results indicate that although the Monte Carlo planning algorithm is not perfect, it is a flexibly competent planning algorithm within the ALE. 15

17 Chapter 3 Improvements and Results This chapter presents a reporting of iterative improvement measures and empirical analysis of the MBRL agent s performance. The set of experiments are uniform in nature, while sequentially adding the extensions to the model in order to isolate the impact of each additional component. The model is still not useful for planning, but each successive version has learned an iteratively improved model with a greater prediction accuracy. 3.1 Experimental Procedure In each experiment, thirty trials were performed, where an agent would learn a model of the game by making random actions. Experiments were conducted on both Pong and Asterix. The initial model for Pong used a 7 tiles x 7 tiles convolutional lens and the initial model for Asterix used a 3 tiles x 3 tiles convolutional lens; the number of objects on the screen differ between the games, so the sizes of the convolutional lens were respectfully chosen to maintain efficiency within each game. The dimensions of each tile were 7 pixels x 4 pixels. After 100, 500, 1000, 1500, and 3000 training episodes, the agent relied on its model to choose actions for the next 5 episodes (planning episodes). The model was paired with the Monte Carlo planning algorithm to make decisions. The Monte Carlo specification entailed 10 rollouts of 20 random actions for each action Model Evaluation The five evaluation metrics, per episode, were: score, number of frames, frames per second, negative log likelihood of the dynamics model error, and mean squared error of the reward model. Score is a direct measure of performance in Atari 2600 games. Number of frames provides more clarity with respect to measuring performance when a score comparison does not reveal any information. Two agents could finish a game of Pong with -21 points each, but the agent that plays for more frames is more resistant to surrendering points, and therefore performs better than the other agent. Frames per second is a direct measure of the computational practicality of the agent, which is noteworthy to measure because a frames per second comparison demonstrates the computational costs of an improvement. Negative log likelihood is a prominent measure of accuracy for logistic regression, so the negative log likelihood of dynamics model error is recorded. The negative log likelihood (µ) equation is: µ = t log( ˆM(τ t,x,y,c φ t,x,y,at,c)) (3.1) x,y c In this document, model error refers to the negative log likelihood of the dynamics model error. Mean squared error (κ ) is a prominent measure for linear function approximations, so the mean squared error of the reward model is recorded. The mean squared error (κ) equation is: κ = t (ρ t ˆR(a t, ψ t )) 2 (3.2) 16

18 In this document, reward error refers to the mean squared error of the reward estimates. For each time-frame of 5 planning episodes, averages of the number of frames, negative log likelihood of the dynamics model error, mean squared error of the reward model, and score were recorded. 3.2 Initial Results Figure 11 displays both the score vs. training episodes and frames vs. training episodes for the planning intervals of 5 episodes after 100, 500, 1000, 1500, and 3000 training episodes. These performance metrics were evaluated on Pong. The agent received an average score between - 20 and -21 at each planning interval, which demonstrates the agent is a very unsuccessful Pong player. The agent progressively played for fewer frames at each planning interval, which suggests the information the agent has learned may have led to worse play. Figure 11: Initial performance graph for Pong Figure 12 displays both the model error vs. training episodes and reward error vs. training episodes for the planning intervals of 5 episodes after 100, 500, 1000, 1500, and 3000 training episodes. These error metrics were evaluated on Pong. The agent progressively reduced the reward error at each interval, and the graph suggests more training could further reduce the reward error. The agent also progressively reduced the model error at each interval, but the model error seems to have plateaued around µ =

19 Figure 12: Initial error graph for Pong Figure 13 presents the performance metrics for Asterix. The agent s average score mostly fluctuated in the range, which is a poor score in Asterix. The number of frames the agent played also fluctuated, leading to no decisive conclusions about change in performance over time. Figure 13: Initial performance graph for Asterix Figure 14 presents the error metrics for Asterix. Interestingly, the reward error has increased slightly over time as the agent has experienced more episodes of training. The model error has decreased slightly over time, reaching µ = 3500 after 3000 training episodes. 18

20 Figure 14: Initial error graph for Asterix 3.3 Action Independent Model Since the dynamics model is action-specific, it divides the learning data among the set of possible actions. This division of data provides the model with insight regarding how each action choice will affect the next screen of gameplay - which is very useful for the agent to choose an action. However, this division of data makes inefficient use of the resources available, as the future positions of many of the objects on a screen do not depend on the action taken by the agent. In the game of Pong, the agent controls the movement of the player paddle, but the movement of the opposing player s paddle is entirely independent of the actions taken by the agent. The movement of the ball is mostly independent of the actions taken by the agent, with the exception being the agent choosing to position the player paddle to make contact with the ball. To efficiently make use of the overlapping resources, a second model was created - an action independent model. The action-independent model shares its information among all actions, adding a layer of generalization to the model information available to the agent. The agent uses both models to predict future screens, where the new logistic probability formula for determining the probability of the existence of a color c in the next screen can be found in Equation 3.4. An action independent weight is defined as θc,i AI and the linear combination of active action independent weights is defined as Λ AI c,φ. Λ AI x,y,φ,a,c = n i=0 θ AI a,c,i φ i (3.3) e(λ x,y,φ,a,c+λ AI x,y,φ,c ) ˆM(c φ, x, y, a) = (3.4) 1 + e (Λ x,y,φ,a,c+λ AI x,y,φ,c ) The combination of action dependent and action independent weights allow the model to make better use of the available information. In the context of Pong, the action independent weights will allow the model to better track the movement of the ball, whereas the action dependent weights will allow the model to understand the causal relationship between selecting an action and moving the paddle. 19

21 3.3.1 Results with the Additional Action Independent Model Figure 15 displays the performance metrics for Pong with the additional action indepedent model. The upgraded agent was still an unsuccessful Pong player, scoring between -20 and -21 each planning interval. Also similarly to the original agent, the agent also played for fewer frames as the agent experienced more training episodes. Figure 15: Performance graph for Pong with the additional action independent model Figure 16 displays the error metrics for Pong with the additional action indepedent model. The agent s reward model was not altered, and as such, it is no surprise the reward model error did not significantly change. The upgraded agent improved its model error to almost as low as µ = 3600 after 3000 training episodes, which is a slight improvement from the original agent which had a model error of approximately µ = The upgraded model also demonstrated quicker learning, as it received a smaller model error during each planning interval. Figure 16: Error graph for Pong with the additional action independent model Figure 17 presents the performance metrics for Asterix with the additional action indepedent 20

22 model. The upgraded agent was as ineffective at scoring points as the original model. Also similarly to the original agent, the number of frames the agent experienced fluctuated in a manner leading to no decisive conclusions. Figure 17: Error graph for Asterix with the additional action independent model Figure 18 presents the error metrics for Asterix with the additional action indepedent model. The agent s reward model was not altered, nor will be altered in any subsequent improvements, so reporting differences in prediction error would be superfluous. The upgraded agent s model error after 100 training episodes matched the original agent s model error after 3000 training episodes, but remained fairly constant over time. Figure 18: Performance graph for Asterix with the additional action independent model 21

23 3.4 Flip Invariant Model The situation depicted in Figure 8 is one in which the ball has just bounced off the paddle in an upward direction. The model would learn about this specific situation, but would not be able to generalize the information it has learned to a similar situation, such as the proposed situation of the ball bouncing off the paddle in a downward direction. Many Atari 2600 games are symmetrical in nature, and thus information can often be generalized to learn about symmetrical situations to the situation directly experienced. Figure 19 mirrors the situation in Figure 8 flipped across the horizontal axis, where a comparison between the two figures demonstrates the information learned between the two situations is vertically symmetrical. Figure 19: Mirror situation to Figure 8, only flipped across the horizontal axis (vertical flip) To benefit from this generalization, a flip-invariant model was introduced. The flip-invariant model utilizes the symmetrical nature of Atari 2600 games to learn more information. Flipinvariant weights are trained on 4 images - the regular version, the horizontal flip version, the vertical flip version, and the both flip (vertical and horizontal) flip version. Flip features can be calculated from direct conversions from the non-flip (regular) features: φ t, x, y,c {φ t, x, y,c, φ t,w x, y,c, φ t, x,h y,c, φ t,w x,h y,c } where H = height of screen in tiles, W = width of screen in tiles When the ball moves in one direction, the flip invariant model learns about all symmetrical situations as well. However, during prediction, only the flip-invariant weight corresponding to the regular version of the image is factored into the logistic probability formula, which can be found in Equation 3.7. A flip invariant weight is defined as θ F I is defined as θ AI,F I c,i c,i, an action-independent, flip invariant weight, the linear combination of active flip invariant weights is defined as Λ F c,φ I, and the linear combination of active action independent, flip invariant weights is defined as Λ AI,F I c,φ. Λ F I x,y,φ,a,c = Λ AI,F I x,y,φ,c = n i=0 n i=0 22 θ F I a,c,i φ i (3.5) θ AI,F I c,i φ i (3.6)

24 ˆM(c a, φ) = e(λ x,y,φ,a,c+λ AI x,y,φ,c +ΛF I x,y,φ,a,c +ΛAI,F I x,y,φ,c ) 1 + e (Λ x,y,φ,a,c+λ AI x,y,φ,c +ΛF I x,y,φ,a,c +ΛAI,F I x,y,φ,c ) (3.7) Results with the Additional Flip Invariant Model Figure 20 displays the performance metrics for Pong with the additional flip invariant model. The recently upgraded agent was still an unsuccessful Pong player, scoring between -20 and -21 each planning interval. Also similarly to both previous versions of the agent, the agent also played for fewer frames as the agent experienced more training episodes. Figure 20: Performance graph for Pong with the additional flip invariant model Figure 21 displays the error metrics for Pong with the additional flip invariant model. The improved agent has a smaller model error at each planning interval than the previous two versions. Figure 21: Error graph for Pong with the additional flip invariant model Figure 22 presents the performance metrics for Asterix with the additional flip invariant model. The upgraded agent was as ineffective at scoring points as the original model. Also similarly to 23

25 the original agent, the number of frames the agent experienced fluctuated in a manner leading to no decisive conclusions. Figure 22: Performance graph for Asterix with the additional flip invariant model Figure 23 presents the error metrics for Asterix with the additional flip invariant model. The recently upgraded agent s model error after 100 training episodes matched the previous agent s model error after 3000 training episodes, but remained constant over time. Figure 23: Error graph for Asterix with the additional flip invariant model 3.5 Pairwise Features One of the most important concepts in Atari 2600 games is the spatial relationship between objects on the screen. It is important to know where the agent avatar is in relation to objects that are beneficial to the agent, such as a point boost or power-up, and objects that are harmful to the agent, such as an enemy. Another very important concept is the spatial relationship between objects in time, as this provides information about the velocity of the objects in a game. In Figure 19, the ball was one tile below and one tile to the right of the paddle in the previous screen, and is three tiles below and two tiles to the right of the paddle (which has not moved) in the 24

26 current screen. These two screens of gameplay suggest the ball has a velocity of two tiles downward/one tile rightward per screen. This information regarding the velocity of the ball suggests the position of the ball in the next screen will be two tiles downward/one tile rightward from its position in the current screen. Pairwise features have been added to document the relationship between objects in space and time. There is a pairwise feature for every possible pairing of non-pairwise features. Pairwise feature (ξ) calculation is of the form: {φ t1, x 1, y 1,c 1, φ t2, x 2, y 2,c 2 } ξ t1, t 2, x 1, x 2, y 1, y 2,c 1,c 2 This inclusion allows the model to be more expressive by interpreting conclusions based on temporal and spatial relationships. In accordance with the regular non-pairwise features, pairwise flipped features are calculated from the regular pairwise features, using the following protocol: ξ t1, t 2, x 1, x 2, y 1, y 2,c 1,c 2 {ξ t1, t 2, x 1, x 2, y 1, y 2,c 1,c 2, ξ t1, t 2,W x 1,W x 2, y 1, y 2,c 1,c 2, ξ t1, t 2, x 1, x 2,H y 1,H y 2,c 1,c 2, ξ t1, t 2,W x 1,W x 2,H y 1,H y 2,c 1,c 2 } Results with the Additional Pairwise Features Figure 24 displays the performance metrics for Pong with the additional pairwise features. The recently upgraded agent was still an unsuccessful Pong player, scoring between -20 and -21 each planning interval. However, it seems the agent has learned to play more frames over time, which suggests a small improvement in play style. Figure 24: Performance graph for Pong with the additional pairwise features Figure 25 displays the error metrics for Pong with the additional pairwise features. The recently upgraded agent has received a significantly smaller model error of around µ = 3000 at each 25

Learning to Play 2D Video Games

Learning to Play 2D Video Games Learning to Play 2D Video Games Justin Johnson jcjohns@stanford.edu Mike Roberts mlrobert@stanford.edu Matt Fisher mdfisher@stanford.edu Abstract Our goal in this project is to implement a machine learning

More information

TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS

TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS Thong B. Trinh, Anwer S. Bashi, Nikhil Deshpande Department of Electrical Engineering University of New Orleans New Orleans, LA 70148 Tel: (504) 280-7383 Fax:

More information

CandyCrush.ai: An AI Agent for Candy Crush

CandyCrush.ai: An AI Agent for Candy Crush CandyCrush.ai: An AI Agent for Candy Crush Jiwoo Lee, Niranjan Balachandar, Karan Singhal December 16, 2016 1 Introduction Candy Crush, a mobile puzzle game, has become very popular in the past few years.

More information

Playing CHIP-8 Games with Reinforcement Learning

Playing CHIP-8 Games with Reinforcement Learning Playing CHIP-8 Games with Reinforcement Learning Niven Achenjang, Patrick DeMichele, Sam Rogers Stanford University Abstract We begin with some background in the history of CHIP-8 games and the use of

More information

Swing Copters AI. Monisha White and Nolan Walsh Fall 2015, CS229, Stanford University

Swing Copters AI. Monisha White and Nolan Walsh  Fall 2015, CS229, Stanford University Swing Copters AI Monisha White and Nolan Walsh mewhite@stanford.edu njwalsh@stanford.edu Fall 2015, CS229, Stanford University 1. Introduction For our project we created an autonomous player for the game

More information

EFFECTS OF PHASE AND AMPLITUDE ERRORS ON QAM SYSTEMS WITH ERROR- CONTROL CODING AND SOFT DECISION DECODING

EFFECTS OF PHASE AND AMPLITUDE ERRORS ON QAM SYSTEMS WITH ERROR- CONTROL CODING AND SOFT DECISION DECODING Clemson University TigerPrints All Theses Theses 8-2009 EFFECTS OF PHASE AND AMPLITUDE ERRORS ON QAM SYSTEMS WITH ERROR- CONTROL CODING AND SOFT DECISION DECODING Jason Ellis Clemson University, jellis@clemson.edu

More information

Monte Carlo based battleship agent

Monte Carlo based battleship agent Monte Carlo based battleship agent Written by: Omer Haber, 313302010; Dror Sharf, 315357319 Introduction The game of battleship is a guessing game for two players which has been around for almost a century.

More information

Achieving Desirable Gameplay Objectives by Niched Evolution of Game Parameters

Achieving Desirable Gameplay Objectives by Niched Evolution of Game Parameters Achieving Desirable Gameplay Objectives by Niched Evolution of Game Parameters Scott Watson, Andrew Vardy, Wolfgang Banzhaf Department of Computer Science Memorial University of Newfoundland St John s.

More information

Chapter 17. Shape-Based Operations

Chapter 17. Shape-Based Operations Chapter 17 Shape-Based Operations An shape-based operation identifies or acts on groups of pixels that belong to the same object or image component. We have already seen how components may be identified

More information

37 Game Theory. Bebe b1 b2 b3. a Abe a a A Two-Person Zero-Sum Game

37 Game Theory. Bebe b1 b2 b3. a Abe a a A Two-Person Zero-Sum Game 37 Game Theory Game theory is one of the most interesting topics of discrete mathematics. The principal theorem of game theory is sublime and wonderful. We will merely assume this theorem and use it to

More information

VISUAL ANALOGIES BETWEEN ATARI GAMES FOR STUDYING TRANSFER LEARNING IN RL

VISUAL ANALOGIES BETWEEN ATARI GAMES FOR STUDYING TRANSFER LEARNING IN RL VISUAL ANALOGIES BETWEEN ATARI GAMES FOR STUDYING TRANSFER LEARNING IN RL Doron Sobol 1, Lior Wolf 1,2 & Yaniv Taigman 2 1 School of Computer Science, Tel-Aviv University 2 Facebook AI Research ABSTRACT

More information

By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

By David Anderson SZTAKI (Budapest, Hungary) WPI D2009 By David Anderson SZTAKI (Budapest, Hungary) WPI D2009 1997, Deep Blue won against Kasparov Average workstation can defeat best Chess players Computer Chess no longer interesting Go is much harder for

More information

8.2 IMAGE PROCESSING VERSUS IMAGE ANALYSIS Image processing: The collection of routines and

8.2 IMAGE PROCESSING VERSUS IMAGE ANALYSIS Image processing: The collection of routines and 8.1 INTRODUCTION In this chapter, we will study and discuss some fundamental techniques for image processing and image analysis, with a few examples of routines developed for certain purposes. 8.2 IMAGE

More information

Learning to Play like an Othello Master CS 229 Project Report. Shir Aharon, Amanda Chang, Kent Koyanagi

Learning to Play like an Othello Master CS 229 Project Report. Shir Aharon, Amanda Chang, Kent Koyanagi Learning to Play like an Othello Master CS 229 Project Report December 13, 213 1 Abstract This project aims to train a machine to strategically play the game of Othello using machine learning. Prior to

More information

2048: An Autonomous Solver

2048: An Autonomous Solver 2048: An Autonomous Solver Final Project in Introduction to Artificial Intelligence ABSTRACT. Our goal in this project was to create an automatic solver for the wellknown game 2048 and to analyze how different

More information

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol Google DeepMind s AlphaGo vs. world Go champion Lee Sedol Review of Nature paper: Mastering the game of Go with Deep Neural Networks & Tree Search Tapani Raiko Thanks to Antti Tarvainen for some slides

More information

Towards Strategic Kriegspiel Play with Opponent Modeling

Towards Strategic Kriegspiel Play with Opponent Modeling Towards Strategic Kriegspiel Play with Opponent Modeling Antonio Del Giudice and Piotr Gmytrasiewicz Department of Computer Science, University of Illinois at Chicago Chicago, IL, 60607-7053, USA E-mail:

More information

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

Creating an Agent of Doom: A Visual Reinforcement Learning Approach Creating an Agent of Doom: A Visual Reinforcement Learning Approach Michael Lowney Department of Electrical Engineering Stanford University mlowney@stanford.edu Robert Mahieu Department of Electrical Engineering

More information

CS221 Project Final Report Automatic Flappy Bird Player

CS221 Project Final Report Automatic Flappy Bird Player 1 CS221 Project Final Report Automatic Flappy Bird Player Minh-An Quinn, Guilherme Reis Introduction Flappy Bird is a notoriously difficult and addicting game - so much so that its creator even removed

More information

Learning to play Dominoes

Learning to play Dominoes Learning to play Dominoes Ivan de Jesus P. Pinto 1, Mateus R. Pereira 1, Luciano Reis Coutinho 1 1 Departamento de Informática Universidade Federal do Maranhão São Luís,MA Brazil navi1921@gmail.com, mateus.rp.slz@gmail.com,

More information

Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm

Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm by Silver et al Published by Google Deepmind Presented by Kira Selby Background u In March 2016, Deepmind s AlphaGo

More information

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault CS221 Project Final Report Deep Q-Learning on Arcade Game Assault Fabian Chan (fabianc), Xueyuan Mei (xmei9), You Guan (you17) Joint-project with CS229 1 Introduction Atari 2600 Assault is a game environment

More information

An Artificially Intelligent Ludo Player

An Artificially Intelligent Ludo Player An Artificially Intelligent Ludo Player Andres Calderon Jaramillo and Deepak Aravindakshan Colorado State University {andrescj, deepakar}@cs.colostate.edu Abstract This project replicates results reported

More information

REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING

REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING RIKA ANTONOVA ANTONOVA@KTH.SE ALI GHADIRZADEH ALGH@KTH.SE RL: What We Know So Far Formulate the problem as an MDP (or POMDP) State space captures

More information

Chapter 2 Distributed Consensus Estimation of Wireless Sensor Networks

Chapter 2 Distributed Consensus Estimation of Wireless Sensor Networks Chapter 2 Distributed Consensus Estimation of Wireless Sensor Networks Recently, consensus based distributed estimation has attracted considerable attention from various fields to estimate deterministic

More information

AI Agents for Playing Tetris

AI Agents for Playing Tetris AI Agents for Playing Tetris Sang Goo Kang and Viet Vo Stanford University sanggookang@stanford.edu vtvo@stanford.edu Abstract Game playing has played a crucial role in the development and research of

More information

Monte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar

Monte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar Monte Carlo Tree Search and AlphaGo Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar Zero-Sum Games and AI A player s utility gain or loss is exactly balanced by the combined gain or loss of opponents:

More information

AI Learning Agent for the Game of Battleship

AI Learning Agent for the Game of Battleship CS 221 Fall 2016 AI Learning Agent for the Game of Battleship Jordan Ebel (jebel) Kai Yee Wan (kaiw) Abstract This project implements a Battleship-playing agent that uses reinforcement learning to become

More information

Techniques for Generating Sudoku Instances

Techniques for Generating Sudoku Instances Chapter Techniques for Generating Sudoku Instances Overview Sudoku puzzles become worldwide popular among many players in different intellectual levels. In this chapter, we are going to discuss different

More information

Reinforcement Learning Agent for Scrolling Shooter Game

Reinforcement Learning Agent for Scrolling Shooter Game Reinforcement Learning Agent for Scrolling Shooter Game Peng Yuan (pengy@stanford.edu) Yangxin Zhong (yangxin@stanford.edu) Zibo Gong (zibo@stanford.edu) 1 Introduction and Task Definition 1.1 Game Agent

More information

Developing Frogger Player Intelligence Using NEAT and a Score Driven Fitness Function

Developing Frogger Player Intelligence Using NEAT and a Score Driven Fitness Function Developing Frogger Player Intelligence Using NEAT and a Score Driven Fitness Function Davis Ancona and Jake Weiner Abstract In this report, we examine the plausibility of implementing a NEAT-based solution

More information

Game Specific Approaches to Monte Carlo Tree Search for Dots and Boxes

Game Specific Approaches to Monte Carlo Tree Search for Dots and Boxes Western Kentucky University TopSCHOLAR Honors College Capstone Experience/Thesis Projects Honors College at WKU 6-28-2017 Game Specific Approaches to Monte Carlo Tree Search for Dots and Boxes Jared Prince

More information

COMP3211 Project. Artificial Intelligence for Tron game. Group 7. Chiu Ka Wa ( ) Chun Wai Wong ( ) Ku Chun Kit ( )

COMP3211 Project. Artificial Intelligence for Tron game. Group 7. Chiu Ka Wa ( ) Chun Wai Wong ( ) Ku Chun Kit ( ) COMP3211 Project Artificial Intelligence for Tron game Group 7 Chiu Ka Wa (20369737) Chun Wai Wong (20265022) Ku Chun Kit (20123470) Abstract Tron is an old and popular game based on a movie of the same

More information

AI Plays Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng)

AI Plays Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng) AI Plays 2048 Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng) Abstract The strategy game 2048 gained great popularity quickly. Although it is easy to play, people cannot win the game easily,

More information

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Reinforcement Learning in Games Autonomous Learning Systems Seminar Reinforcement Learning in Games Autonomous Learning Systems Seminar Matthias Zöllner Intelligent Autonomous Systems TU-Darmstadt zoellner@rbg.informatik.tu-darmstadt.de Betreuer: Gerhard Neumann Abstract

More information

Traffic Control for a Swarm of Robots: Avoiding Group Conflicts

Traffic Control for a Swarm of Robots: Avoiding Group Conflicts Traffic Control for a Swarm of Robots: Avoiding Group Conflicts Leandro Soriano Marcolino and Luiz Chaimowicz Abstract A very common problem in the navigation of robotic swarms is when groups of robots

More information

Contents. List of Figures

Contents. List of Figures 1 Contents 1 Introduction....................................... 3 1.1 Rules of the game............................... 3 1.2 Complexity of the game............................ 4 1.3 History of self-learning

More information

CSC321 Lecture 23: Go

CSC321 Lecture 23: Go CSC321 Lecture 23: Go Roger Grosse Roger Grosse CSC321 Lecture 23: Go 1 / 21 Final Exam Friday, April 20, 9am-noon Last names A Y: Clara Benson Building (BN) 2N Last names Z: Clara Benson Building (BN)

More information

CS 229 Final Project: Using Reinforcement Learning to Play Othello

CS 229 Final Project: Using Reinforcement Learning to Play Othello CS 229 Final Project: Using Reinforcement Learning to Play Othello Kevin Fry Frank Zheng Xianming Li ID: kfry ID: fzheng ID: xmli 16 December 2016 Abstract We built an AI that learned to play Othello.

More information

CS221 Project Final Report Gomoku Game Agent

CS221 Project Final Report Gomoku Game Agent CS221 Project Final Report Gomoku Game Agent Qiao Tan qtan@stanford.edu Xiaoti Hu xiaotihu@stanford.edu 1 Introduction Gomoku, also know as five-in-a-row, is a strategy board game which is traditionally

More information

Othello/Reversi using Game Theory techniques Parth Parekh Urjit Singh Bhatia Kushal Sukthankar

Othello/Reversi using Game Theory techniques Parth Parekh Urjit Singh Bhatia Kushal Sukthankar Othello/Reversi using Game Theory techniques Parth Parekh Urjit Singh Bhatia Kushal Sukthankar Othello Rules Two Players (Black and White) 8x8 board Black plays first Every move should Flip over at least

More information

Monte Carlo Tree Search

Monte Carlo Tree Search Monte Carlo Tree Search 1 By the end, you will know Why we use Monte Carlo Search Trees The pros and cons of MCTS How it is applied to Super Mario Brothers and Alpha Go 2 Outline I. Pre-MCTS Algorithms

More information

Game Mechanics Minesweeper is a game in which the player must correctly deduce the positions of

Game Mechanics Minesweeper is a game in which the player must correctly deduce the positions of Table of Contents Game Mechanics...2 Game Play...3 Game Strategy...4 Truth...4 Contrapositive... 5 Exhaustion...6 Burnout...8 Game Difficulty... 10 Experiment One... 12 Experiment Two...14 Experiment Three...16

More information

Reinforcement Learning Simulations and Robotics

Reinforcement Learning Simulations and Robotics Reinforcement Learning Simulations and Robotics Models Partially observable noise in sensors Policy search methods rather than value functionbased approaches Isolate key parameters by choosing an appropriate

More information

ROBOCODE PROJECT AIBOT - MARKOV MODEL DRIVEN AIMING COMBINED WITH Q LEARNING FOR MOVEMENT

ROBOCODE PROJECT AIBOT - MARKOV MODEL DRIVEN AIMING COMBINED WITH Q LEARNING FOR MOVEMENT ROBOCODE PROJECT AIBOT - MARKOV MODEL DRIVEN AIMING COMBINED WITH Q LEARNING FOR MOVEMENT PATRICK HALUPTZOK, XU MIAO Abstract. In this paper the development of a robot controller for Robocode is discussed.

More information

Temporal-Difference Learning in Self-Play Training

Temporal-Difference Learning in Self-Play Training Temporal-Difference Learning in Self-Play Training Clifford Kotnik Jugal Kalita University of Colorado at Colorado Springs, Colorado Springs, Colorado 80918 CLKOTNIK@ATT.NET KALITA@EAS.UCCS.EDU Abstract

More information

I.M.O. Winter Training Camp 2008: Invariants and Monovariants

I.M.O. Winter Training Camp 2008: Invariants and Monovariants I.M.. Winter Training Camp 2008: Invariants and Monovariants n math contests, you will often find yourself trying to analyze a process of some sort. For example, consider the following two problems. Sample

More information

Tile Number and Space-Efficient Knot Mosaics

Tile Number and Space-Efficient Knot Mosaics Tile Number and Space-Efficient Knot Mosaics Aaron Heap and Douglas Knowles arxiv:1702.06462v1 [math.gt] 21 Feb 2017 February 22, 2017 Abstract In this paper we introduce the concept of a space-efficient

More information

Image Extraction using Image Mining Technique

Image Extraction using Image Mining Technique IOSR Journal of Engineering (IOSRJEN) e-issn: 2250-3021, p-issn: 2278-8719 Vol. 3, Issue 9 (September. 2013), V2 PP 36-42 Image Extraction using Image Mining Technique Prof. Samir Kumar Bandyopadhyay,

More information

DeepMind Self-Learning Atari Agent

DeepMind Self-Learning Atari Agent DeepMind Self-Learning Atari Agent Human-level control through deep reinforcement learning Nature Vol 518, Feb 26, 2015 The Deep Mind of Demis Hassabis Backchannel / Medium.com interview with David Levy

More information

Analyzing the Impact of Knowledge and Search in Monte Carlo Tree Search in Go

Analyzing the Impact of Knowledge and Search in Monte Carlo Tree Search in Go Analyzing the Impact of Knowledge and Search in Monte Carlo Tree Search in Go Farhad Haqiqat and Martin Müller University of Alberta Edmonton, Canada Contents Motivation and research goals Feature Knowledge

More information

Blur Detection for Historical Document Images

Blur Detection for Historical Document Images Blur Detection for Historical Document Images Ben Baker FamilySearch bakerb@familysearch.org ABSTRACT FamilySearch captures millions of digital images annually using digital cameras at sites throughout

More information

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage Richard Kelly and David Churchill Computer Science Faculty of Science Memorial University {richard.kelly, dchurchill}@mun.ca

More information

Training a Neural Network for Checkers

Training a Neural Network for Checkers Training a Neural Network for Checkers Daniel Boonzaaier Supervisor: Adiel Ismail June 2017 Thesis presented in fulfilment of the requirements for the degree of Bachelor of Science in Honours at the University

More information

AI Approaches to Ultimate Tic-Tac-Toe

AI Approaches to Ultimate Tic-Tac-Toe AI Approaches to Ultimate Tic-Tac-Toe Eytan Lifshitz CS Department Hebrew University of Jerusalem, Israel David Tsurel CS Department Hebrew University of Jerusalem, Israel I. INTRODUCTION This report is

More information

Summary Overview of Topics in Econ 30200b: Decision theory: strong and weak domination by randomized strategies, domination theorem, expected utility

Summary Overview of Topics in Econ 30200b: Decision theory: strong and weak domination by randomized strategies, domination theorem, expected utility Summary Overview of Topics in Econ 30200b: Decision theory: strong and weak domination by randomized strategies, domination theorem, expected utility theorem (consistent decisions under uncertainty should

More information

TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill

TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS Thomas Keller and Malte Helmert Presented by: Ryan Berryhill Outline Motivation Background THTS framework THTS algorithms Results Motivation Advances

More information

Why Should We Care? More importantly, it is easy to lie or deceive people with bad plots

Why Should We Care? More importantly, it is easy to lie or deceive people with bad plots Elementary Plots Why Should We Care? Everyone uses plotting But most people ignore or are unaware of simple principles Default plotting tools (or default settings) are not always the best More importantly,

More information

46.1 Introduction. Foundations of Artificial Intelligence Introduction MCTS in AlphaGo Neural Networks. 46.

46.1 Introduction. Foundations of Artificial Intelligence Introduction MCTS in AlphaGo Neural Networks. 46. Foundations of Artificial Intelligence May 30, 2016 46. AlphaGo and Outlook Foundations of Artificial Intelligence 46. AlphaGo and Outlook Thomas Keller Universität Basel May 30, 2016 46.1 Introduction

More information

Chapter 4 SPEECH ENHANCEMENT

Chapter 4 SPEECH ENHANCEMENT 44 Chapter 4 SPEECH ENHANCEMENT 4.1 INTRODUCTION: Enhancement is defined as improvement in the value or Quality of something. Speech enhancement is defined as the improvement in intelligibility and/or

More information

FreeCiv Learner: A Machine Learning Project Utilizing Genetic Algorithms

FreeCiv Learner: A Machine Learning Project Utilizing Genetic Algorithms FreeCiv Learner: A Machine Learning Project Utilizing Genetic Algorithms Felix Arnold, Bryan Horvat, Albert Sacks Department of Computer Science Georgia Institute of Technology Atlanta, GA 30318 farnold3@gatech.edu

More information

Deep Learning for Autonomous Driving

Deep Learning for Autonomous Driving Deep Learning for Autonomous Driving Shai Shalev-Shwartz Mobileye IMVC dimension, March, 2016 S. Shalev-Shwartz is also affiliated with The Hebrew University Shai Shalev-Shwartz (MobilEye) DL for Autonomous

More information

CSE548, AMS542: Analysis of Algorithms, Fall 2016 Date: Sep 25. Homework #1. ( Due: Oct 10 ) Figure 1: The laser game.

CSE548, AMS542: Analysis of Algorithms, Fall 2016 Date: Sep 25. Homework #1. ( Due: Oct 10 ) Figure 1: The laser game. CSE548, AMS542: Analysis of Algorithms, Fall 2016 Date: Sep 25 Homework #1 ( Due: Oct 10 ) Figure 1: The laser game. Task 1. [ 60 Points ] Laser Game Consider the following game played on an n n board,

More information

Monte Carlo tree search techniques in the game of Kriegspiel

Monte Carlo tree search techniques in the game of Kriegspiel Monte Carlo tree search techniques in the game of Kriegspiel Paolo Ciancarini and Gian Piero Favini University of Bologna, Italy 22 IJCAI, Pasadena, July 2009 Agenda Kriegspiel as a partial information

More information

Laboratory 1: Uncertainty Analysis

Laboratory 1: Uncertainty Analysis University of Alabama Department of Physics and Astronomy PH101 / LeClair May 26, 2014 Laboratory 1: Uncertainty Analysis Hypothesis: A statistical analysis including both mean and standard deviation can

More information

Nonuniform multi level crossing for signal reconstruction

Nonuniform multi level crossing for signal reconstruction 6 Nonuniform multi level crossing for signal reconstruction 6.1 Introduction In recent years, there has been considerable interest in level crossing algorithms for sampling continuous time signals. Driven

More information

Computing Science (CMPUT) 496

Computing Science (CMPUT) 496 Computing Science (CMPUT) 496 Search, Knowledge, and Simulations Martin Müller Department of Computing Science University of Alberta mmueller@ualberta.ca Winter 2017 Part IV Knowledge 496 Today - Mar 9

More information

Why Should We Care? Everyone uses plotting But most people ignore or are unaware of simple principles Default plotting tools are not always the best

Why Should We Care? Everyone uses plotting But most people ignore or are unaware of simple principles Default plotting tools are not always the best Elementary Plots Why Should We Care? Everyone uses plotting But most people ignore or are unaware of simple principles Default plotting tools are not always the best More importantly, it is easy to lie

More information

Playing Atari Games with Deep Reinforcement Learning

Playing Atari Games with Deep Reinforcement Learning Playing Atari Games with Deep Reinforcement Learning 1 Playing Atari Games with Deep Reinforcement Learning Varsha Lalwani (varshajn@iitk.ac.in) Masare Akshay Sunil (amasare@iitk.ac.in) IIT Kanpur CS365A

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, April AlphaZero

TTIC 31230, Fundamentals of Deep Learning David McAllester, April AlphaZero TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 AlphaZero 1 AlphaGo Fan (October 2015) AlphaGo Defeats Fan Hui, European Go Champion. 2 AlphaGo Lee (March 2016) 3 AlphaGo Zero vs.

More information

Decision Making in Multiplayer Environments Application in Backgammon Variants

Decision Making in Multiplayer Environments Application in Backgammon Variants Decision Making in Multiplayer Environments Application in Backgammon Variants PhD Thesis by Nikolaos Papahristou AI researcher Department of Applied Informatics Thessaloniki, Greece Contributions Expert

More information

Kalman Tracking and Bayesian Detection for Radar RFI Blanking

Kalman Tracking and Bayesian Detection for Radar RFI Blanking Kalman Tracking and Bayesian Detection for Radar RFI Blanking Weizhen Dong, Brian D. Jeffs Department of Electrical and Computer Engineering Brigham Young University J. Richard Fisher National Radio Astronomy

More information

A Quick Guide to Understanding the Impact of Test Time on Estimation of Mean Time Between Failure (MTBF)

A Quick Guide to Understanding the Impact of Test Time on Estimation of Mean Time Between Failure (MTBF) A Quick Guide to Understanding the Impact of Test Time on Estimation of Mean Time Between Failure (MTBF) Authored by: Lenny Truett, Ph.D. STAT T&E COE The goal of the STAT T&E COE is to assist in developing

More information

Training a Minesweeper Solver

Training a Minesweeper Solver Training a Minesweeper Solver Luis Gardea, Griffin Koontz, Ryan Silva CS 229, Autumn 25 Abstract Minesweeper, a puzzle game introduced in the 96 s, requires spatial awareness and an ability to work with

More information

Universiteit Leiden Opleiding Informatica

Universiteit Leiden Opleiding Informatica Universiteit Leiden Opleiding Informatica Predicting the Outcome of the Game Othello Name: Simone Cammel Date: August 31, 2015 1st supervisor: 2nd supervisor: Walter Kosters Jeannette de Graaf BACHELOR

More information

Experiments on Alternatives to Minimax

Experiments on Alternatives to Minimax Experiments on Alternatives to Minimax Dana Nau University of Maryland Paul Purdom Indiana University April 23, 1993 Chun-Hung Tzeng Ball State University Abstract In the field of Artificial Intelligence,

More information

USING A FUZZY LOGIC CONTROL SYSTEM FOR AN XPILOT COMBAT AGENT ANDREW HUBLEY AND GARY PARKER

USING A FUZZY LOGIC CONTROL SYSTEM FOR AN XPILOT COMBAT AGENT ANDREW HUBLEY AND GARY PARKER World Automation Congress 21 TSI Press. USING A FUZZY LOGIC CONTROL SYSTEM FOR AN XPILOT COMBAT AGENT ANDREW HUBLEY AND GARY PARKER Department of Computer Science Connecticut College New London, CT {ahubley,

More information

Dynamic Programming in Real Life: A Two-Person Dice Game

Dynamic Programming in Real Life: A Two-Person Dice Game Mathematical Methods in Operations Research 2005 Special issue in honor of Arie Hordijk Dynamic Programming in Real Life: A Two-Person Dice Game Henk Tijms 1, Jan van der Wal 2 1 Department of Econometrics,

More information

Chapter 3: Alarm correlation

Chapter 3: Alarm correlation Chapter 3: Alarm correlation Algorithmic Methods of Data Mining, Fall 2005, Chapter 3: Alarm correlation 1 Part II. Episodes in sequences Chapter 3: Alarm correlation Chapter 4: Frequent episodes Chapter

More information

More on games (Ch )

More on games (Ch ) More on games (Ch. 5.4-5.6) Alpha-beta pruning Previously on CSci 4511... We talked about how to modify the minimax algorithm to prune only bad searches (i.e. alpha-beta pruning) This rule of checking

More information

Why Randomize? Jim Berry Cornell University

Why Randomize? Jim Berry Cornell University Why Randomize? Jim Berry Cornell University Session Overview I. Basic vocabulary for impact evaluation II. III. IV. Randomized evaluation Other methods of impact evaluation Conclusions J-PAL WHY RANDOMIZE

More information

Adaptive Antennas in Wireless Communication Networks

Adaptive Antennas in Wireless Communication Networks Bulgarian Academy of Sciences Adaptive Antennas in Wireless Communication Networks Blagovest Shishkov Institute of Mathematics and Informatics Bulgarian Academy of Sciences 1 introducing myself Blagovest

More information

Artificial Intelligence. Minimax and alpha-beta pruning

Artificial Intelligence. Minimax and alpha-beta pruning Artificial Intelligence Minimax and alpha-beta pruning In which we examine the problems that arise when we try to plan ahead to get the best result in a world that includes a hostile agent (other agent

More information

Game Theory and Randomized Algorithms

Game Theory and Randomized Algorithms Game Theory and Randomized Algorithms Guy Aridor Game theory is a set of tools that allow us to understand how decisionmakers interact with each other. It has practical applications in economics, international

More information

Plan Execution Monitoring through Detection of Unmet Expectations about Action Outcomes

Plan Execution Monitoring through Detection of Unmet Expectations about Action Outcomes Plan Execution Monitoring through Detection of Unmet Expectations about Action Outcomes Juan Pablo Mendoza 1, Manuela Veloso 2 and Reid Simmons 3 Abstract Modeling the effects of actions based on the state

More information

Defense Technical Information Center Compilation Part Notice

Defense Technical Information Center Compilation Part Notice UNCLASSIFIED Defense Technical Information Center Compilation Part Notice ADPO 11345 TITLE: Measurement of the Spatial Frequency Response [SFR] of Digital Still-Picture Cameras Using a Modified Slanted

More information

Predicting outcomes of professional DotA 2 matches

Predicting outcomes of professional DotA 2 matches Predicting outcomes of professional DotA 2 matches Petra Grutzik Joe Higgins Long Tran December 16, 2017 Abstract We create a model to predict the outcomes of professional DotA 2 (Defense of the Ancients

More information

Advanced Techniques for Mobile Robotics Location-Based Activity Recognition

Advanced Techniques for Mobile Robotics Location-Based Activity Recognition Advanced Techniques for Mobile Robotics Location-Based Activity Recognition Wolfram Burgard, Cyrill Stachniss, Kai Arras, Maren Bennewitz Activity Recognition Based on L. Liao, D. J. Patterson, D. Fox,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Deep Learning Barnabás Póczos Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey Hinton Yann LeCun 2

More information

Chapter 3 Novel Digital-to-Analog Converter with Gamma Correction for On-Panel Data Driver

Chapter 3 Novel Digital-to-Analog Converter with Gamma Correction for On-Panel Data Driver Chapter 3 Novel Digital-to-Analog Converter with Gamma Correction for On-Panel Data Driver 3.1 INTRODUCTION As last chapter description, we know that there is a nonlinearity relationship between luminance

More information

Image Enhancement in Spatial Domain

Image Enhancement in Spatial Domain Image Enhancement in Spatial Domain 2 Image enhancement is a process, rather a preprocessing step, through which an original image is made suitable for a specific application. The application scenarios

More information

Comp 3211 Final Project - Poker AI

Comp 3211 Final Project - Poker AI Comp 3211 Final Project - Poker AI Introduction Poker is a game played with a standard 52 card deck, usually with 4 to 8 players per game. During each hand of poker, players are dealt two cards and must

More information

arxiv: v1 [cs.cc] 21 Jun 2017

arxiv: v1 [cs.cc] 21 Jun 2017 Solving the Rubik s Cube Optimally is NP-complete Erik D. Demaine Sarah Eisenstat Mikhail Rudoy arxiv:1706.06708v1 [cs.cc] 21 Jun 2017 Abstract In this paper, we prove that optimally solving an n n n Rubik

More information

IMAGE ENHANCEMENT IN SPATIAL DOMAIN

IMAGE ENHANCEMENT IN SPATIAL DOMAIN A First Course in Machine Vision IMAGE ENHANCEMENT IN SPATIAL DOMAIN By: Ehsan Khoramshahi Definitions The principal objective of enhancement is to process an image so that the result is more suitable

More information

Creating a Poker Playing Program Using Evolutionary Computation

Creating a Poker Playing Program Using Evolutionary Computation Creating a Poker Playing Program Using Evolutionary Computation Simon Olsen and Rob LeGrand, Ph.D. Abstract Artificial intelligence is a rapidly expanding technology. We are surrounded by technology that

More information

Introduction. Chapter Time-Varying Signals

Introduction. Chapter Time-Varying Signals Chapter 1 1.1 Time-Varying Signals Time-varying signals are commonly observed in the laboratory as well as many other applied settings. Consider, for example, the voltage level that is present at a specific

More information

Alternation in the repeated Battle of the Sexes

Alternation in the repeated Battle of the Sexes Alternation in the repeated Battle of the Sexes Aaron Andalman & Charles Kemp 9.29, Spring 2004 MIT Abstract Traditional game-theoretic models consider only stage-game strategies. Alternation in the repeated

More information

Programming an Othello AI Michael An (man4), Evan Liang (liange)

Programming an Othello AI Michael An (man4), Evan Liang (liange) Programming an Othello AI Michael An (man4), Evan Liang (liange) 1 Introduction Othello is a two player board game played on an 8 8 grid. Players take turns placing stones with their assigned color (black

More information

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions Slides by Svetlana Lazebnik, 9/2016 Modified by Mark Hasegawa Johnson, 9/2017 Types of game environments Perfect

More information

5.4 Imperfect, Real-Time Decisions

5.4 Imperfect, Real-Time Decisions 116 5.4 Imperfect, Real-Time Decisions Searching through the whole (pruned) game tree is too inefficient for any realistic game Moves must be made in a reasonable amount of time One has to cut off the

More information