Model-Based Reinforcement Learning in Atari 2600 Games

Size: px

Start display at page:

Download "Model-Based Reinforcement Learning in Atari 2600 Games"

Edmund Adams
6 years ago
Views:

1 Model-Based Reinforcement Learning in Atari 2600 Games Daniel John Foley Research Adviser: Erik Talvitie A thesis presented for honors within Computer Science on May 15 th, 2017 Franklin & Marshall College Bachelor of Arts May 13 th, 2017

2 Chapter 1 Introduction Imagine playing a video game for the first time, with no prior knowledge of the objective of the game or the mapping between controller buttons and gameplay mechanics. The ideal is to formulate an optimal strategy in pursuit of the maximum final score, but no information is available at the start of gameplay. The general interaction between controller buttons and gameplay mechanics can be quickly learned (eg. Up button = jump) by observing causal relationships. After pressing a button several times and witnessing its effects, it is possible to predict the future outcome of pressing the button. This is extremely useful because it is then possible to determine which button should be pressed in a given scenario to receive the largest final score. This type of learning is called reinforcement learning; actions that lead to favorable outcomes are encouraged whereas actions that lead to less than favorable outcomes are discouraged. Humans are able to develop an understanding of the gameplay dynamics over time through gameplay interaction. This understanding is called a model, which allows humans to predict the outcome of pressing a button. Due to the complicated nature of many games, humans usually create an imperfect model, and thus may make inaccurate predictions. Despite utilizing an imperfect model of the gameplay dynamics, humans are remarkably skilled video game players. This work will focus on the creation of a software agent that learns how to play video games by developing an imperfect model of the gameplay dynamics. The objective of this research is not to solve a video game problem, but to explore model-based decision-making in a novel and challenging environment - a topic that is fairly underrepresented in the current AI literature. 1.1 The Arcade Learning Environment The Arcade Learning Environment (ALE) is a platform for evaluating the success of AI programs in achieving flexible competence in a variety of problems. The ALE is comprised of over 50 Atari 2600 games - each with their own objective, controller mapping, reward system, and gameplay objects. The wide variety of tasks associated with the ALE create an appropriate platform to evaluate non-domain specific AI technology. The ALE serves as the challenge problem that will be explored in this thesis. The ALE was purposefully designed as a premier challenge problem in the AI community so that it may serve as a microcosm of general AI. Bellemare et. al. (2013A) created the ALE with the intentions that it would be an achievable stepping stone [for general AI competency] and formidable enough to require new technological breakthroughs. The ALE has inspired novel techniques and approaches, and still serves as a prominent subject of active research. The ALE interface provides functionality to observe and interact with the gameplay environment of Atari 2600 games. The ALE allows an agent to select actions from an action set of 18 joystick movement and button combinations, and then updates the respective gameplay world accordingly. An AI agent operating in the ALE has access to the current pixel-level representation of 1

3 a screen of gameplay, as well as the points received after selecting an action. Although the functionality of the ALE provides some useful information to the agent, there is a significant lack of environmental information available to the agent. An agent can observe the current pixel-level representation of a screen of gameplay, but the gameplay environment is still only partially observable to the agent, making it a sufficiently challenging task environment. For example, a single observation does not reveal the velocity of [objects on the screen] (Hausknecht and Stone, 2015), nor denotes the location and time at which an object will enter the screen in the near future. An agent in a partially observable environment can only act based on the limited information it has available to it. Atari 2600 games are deterministic in nature, so a perfect model of gameplay dynamics would be able to predict the future with 100% accuracy. Despite the determinic nature, it is very difficult to develop a perfect model of the dynamics due to the incredibly large number of possible gameplay situations, and therefore humans usually only develop an approximate model. Humans often believe deterministic Atari 2600 games are stochastic due to their approximate model of the high-dimensional game environments. This work will present the application of a stochastic model to the deterministic ALE. 1.2 Reinforcement Learning The agent will learn a stochastic model of the ALE through reinforcement learning. Reinforcement learning (RL) is a type of learning that associates actions with a positive or negative reward signal. A RL agent in the ALE is able to determine which action should be taken, given the current situation, by selecting the action that will lead to the largest final score. A RL agent must discover which actions yield the most reward by trying them (Sutton and Barto, 1998) and this direct experience with the world shapes the agent s policy. A policy is an agent s master strategy - for any given situation, the agent should determine the most favorable action. In model-based reinforcement learning (MBRL), an agent constructs an internal representation of the mechanics of the world based on its past experiences, and uses this information to make predictions about the future environment. MBRL agents develop a model of the dynamics of the world and the rewards associated with situations. Access to a model allows an agent to envision the long-term effects of taking a given action. This allows an agent to plan, which is to determine the best possible action to make in a given situation considering the long-term potential future outcomes. It is important to note that there are no published results of a MBRL agent performing well in the ALE. In model-free reinforcement learning (MFRL), an agent does not attempt to create a predictive model of the dynamics of the world, but instead evaluates the quality of selecting each action in a given situation. MFRL agents are less computationally expensive than MBRL agents, but do not utilize available information that could be used to learn about the dynamics of the world. MFRL agents determine the best action to select solely based on the current situation. Despite having a short-term focus, several MFRL agents have performed well in the ALE. 1.3 Related Work Several researchers have studied MFRL agents in the ALE, leading to the discovery of fairly suc- 2

4 cessful MFRL approaches. The main challenge is to construct a practical and useful representation of a screen of gameplay. Bellemare et. al. (2013A) created five different screen representations to power a common MFRL algorithm named SARSA. Through empirical analysis, they demonstrated that learning progress is already possible in Atari 2600 games. Their work was the first to demonstrate a level of MFRL success within the ALE, and would inspire future researchers to continue exploring the potential of MFRL in the ALE. The success from a fairly simple method would inspire more sophisticated approaches. Mnih et. al. (2015) trained a deep convolutional neural network to associate selecting an action in a screen of gameplay with a reward signal. They established the state-of-the-art results for performance in the ALE. Although their approach was very successful, it was also computationally expensive. Liang et. al. (2016) sought to create a computationally practical MFRL agent that would achieve similar performance to the state-of-the-art results. This led to the creation of the Blob-PROST screen representation (which will be used in this work). SARSA coupled with Blob- PROST was able to attain similar performance to the state-of-the-art results while remaining computationally practical. Despite the successes of MFRL approaches, MFRL has several disadvantages. Consider the game Montezuma s Revenge. Figure 1: A screen of Montezuma s Revenge gameplay. The objective of the first level is to obtain the key and open one of the doors (the two dashed lines that are one level above the top of the left and right ladders). The agent only receives points when the key is obtained or one of the doors is unlocked. From the original starting position, there is a fairly long and specific action sequence that needs to be followed to obtain the key. A MFRL agent is seriously disadvantaged in this context, as the agent would not learn anything about the relative quality of actions until it receives points by randomly obtaining the key, which is a very difficult task to complete randomly. When the MFRL agent does obtain the key, it would only learn which action it should select when it is one action away from reaching the key. Even in the absence of points, a MBRL agent would learn about the dynamics of the world, and could even learn how to reach the position where the agent is one action away from reaching the key. In addition, the agent would be aware of locations on the screen it has not visited, and could intentionally explore the locations it has not yet visited. This encouraged exploration could 3

5 lead to an informed search process, which is certainly more effective than searching randomly. Several reseachers have explored model-learning in the ALE, which entails training a model to be able to predict future screens of gameplay. Bellemare et. al (2013B) divided Atari 2600 games into different segments and applied Bayesian inference, Bellemare et. al. (2014) introduced the Skip Context Tree Switching algorithm, and Oh et. al. (2015) developed two deep neural networks. All three of their models have achieved remarkable prediction accuracy. However, their research does not explore decision-making in the ALE based on predicted future screens. No one has demonstrated that predicting future screens can improve gameplay performance in the ALE. This work will focus on the intersection of model-learning and reinforcement learning in the ALE, an area of study that is currently missing from the current AI literature. 1.4 Primary Contributions This thesis will focus on the creation of a model-based reinforcement learning agent that learns an imperfect predictive model of the gameplay dynamics of Atari 2600 games. The objective of this research is not to solve a video game problem, but to explore model-based decision-making in a novel and challenging environment. Each design choice in this work was made with the intention of maintaining practicality, as creating an accurate model in the ALE is extremely computationally challenging. There are no published results showing successful MBRL in the Atari 2600 domain - and this thesis will not be the first. This thesis, however, will present a practical model that shows promise with respect to learning potential, as well as document MBRL challenges not often presented in the current MBRL literature. 4

Chapter 2 MBRL Agent for Atari 2600 This chapter details the creation of a MBRL agent for the Atari 2600, which consists of a convolutional logistic regression model coupled with Monte-Carlo planning.

6 Chapter 2 MBRL Agent for Atari 2600 This chapter details the creation of a MBRL agent for the Atari 2600, which consists of a convolutional logistic regression model coupled with Monte-Carlo planning. In an attempt to create a model useful for planning, special focus was placed on using simple techniques to maintain the practicality of the MBRL agent. 2.1 Pong The collection of Atari 2600 games provided by the ALE form a testbed for empirical analysis of flexibly competent agents. However, special focus will be placed on the game of Pong, as it will serve as the core example throughout this thesis because it is simple, computationally workable, and easily understood. Pong is a single-player game analogous to tennis; the player has a paddle and the objective is to hit the ball with the paddle until the opponent s paddle is not able to make the contact with the ball. In a screen of Pong gameplay, there are three main objects of importance: Figure 2: A screen of Pong gameplay. the player s paddle (right), the opponent s paddle (left), and the ball. The bottom of the screen and the white line at the top of the screen serve as vertical boundaries that will cause the ball to bounce upon contact. If the ball manages to reach the left edge of the screen, the player will be awarded a point and their score will increase by one. If the ball manages to reach the right edge of screen, the opponent will be awarded a point and the player s score will decrease by one. The numbers at the top of the screen from left to right reflect the current number of points for the opponent and the player, respectively. A game is finished when either the player or the opponent receives their 21 st point. The player s final score is between -21 and 21, inclusively, where a negative number signifies defeat and a positive number signifies victory. 5

2.2 Tile-Level Screen Information The ALE provides access to the screen information of Atari 2600 gameplay. More specifically, the ALE records the colors present at every pixel of a given screen.

7 2.2 Tile-Level Screen Information The ALE provides access to the screen information of Atari 2600 gameplay. More specifically, the ALE records the colors present at every pixel of a given screen. Liang et. al. (2016) developed methods to translate the pixel-level screen information to object-level screen information, which is comprised of the locational information of colored blobs. The object-level screen information melds together contiguous pixels of the same color to form colored blobs, where the definition of contiguity has been expanded to include the surrounding s x s neighborhood, for some size s. The location of a colored blob is defined as the center of the blob s smallest bounding box (Liang et al., 2016). Liang then converted the object-level screen information to tile-level screen information, which bears resemblance to the original pixel-level screen information. The screen is divided into tiles of dimensions 4 pixels x 7 pixels, and the tile-level screen information records the colors present at every tile, where color presence is defined as a tile containing the center of a blob of the given color. The abstraction to tile-level screen information has proven to be an efficient and useful basis for a feature set for model-free learning, and as such, will be used as the basis for the MBRL agent s feature set. Figure 3: A screen of Pong gameplay (left), and a tile-level representation of the screen comprised of 9 tiles x 11 tiles (right). Black indicates there are no blobs within the tile. Note that the background color and the white line in the original screen of gameplay are each represented by one single blob that is assigned to the tile that contains the blob s center. 2.3 Out of Bounds Color The ALE provides functionality to allow the user to determine the color palette for the screen information of Atari 2600 gameplay. To maintain practicality, the MBRL agent utilizes the 8- color (SECAM) palette from the ALE. The distinction of colors allows the agent to differentiate between objects on the screen. It would also be advantageous for the agent to know the boundaries of the gameplay screen. During the conversion process from pixel-level screen information to tilelevel screen information, an immediate perimeter of tiles are created to surround the tiles of the gameplay screen. A special color is placed into each of the tiles in this immediate perimeter of tiles, establishing out of bounds markers for the perimeter layer. The addition of the out of bounds color has extended the agent s original 8-color palette to a 9-color palette. 6

8 Figure 4: A 5 tiles x 5 tiles screen of gameplay surrounded by an immediate perimeter of tiles containing the out of bounds color. Under the original 8-color palette, the only information known is the existence of an orange blob and a green blob. The addition of the out of bounds color adds the information that the orange blob is on the left-edge of the screen, and the green blob is on the right-edge of the screen at the topmost tile. 2.4 Logistic Regression The basic learning algorithm that will be employed to learn the model is logistic regression. The following example is a simplified version of how the model can make predictions using the logistic function. Consider a tile at position (x, y) - the objective is to predict whether the tile will contain a white blob in the next screen. A feature is a piece of information, such as there is a green blob in the tile at position (x, y). A binary feature corresponds to the truth value (0 or 1) of the piece of information - if there is a green blob in the tile at position (x, y), then the corresponding binary feature is 1. This example will demonstrate how a logistic regression model can use a vector of binary features to predict the presence of a white blob within the tile at position (x, y) in the next screen. Consider the binary feature vector φ x,y,c that corresponds to the presence of the colors white, green, and red in the tile at position (x, y) in the current screen of gameplay. In this example, white and red are present in the current screen, so therefore the binary feature vector φ x,y,c is (1, 0, 1). Logistic regression will be used to predict the probability of the presence a white blob in the next screen of gameplay, which is defined as ˆM(white φ, x, y). For every feature in the feature set, there is a weight for white blob presence. The weights account for the correlations between the features being active and the likelihood of a white blob being present in the tile at position (x, y) in the next screen. For this example, the white weight vector Θ white is (2, 1, 1.2). Summing the weights that correspond to the active features form a linear combination Λ x,y,φ,white = (2 1) + ( 1 0) + (1.2 1) = 3.2. Applying the logistic probability formula to this linear combination yields ˆM(white φ, x, y) = e3.2 = e3.2 The logistic probability formula has determined that the model probability of a white blob being present in the tile at position (x, y) in the next screen of gameplay is Logistic regression 7

9 will always yield a number between 0 and 1, so it is an ideal tool for determining probability. Figure 5 graphs the logistic curve: Figure 5: The logistic curve from x = 5 to x = 5 (Wolfram Alpha, 2017). The model will then generate a random number between 0 and 1. If the randomly generated number is less than , then the tile at position (x, y) will contain a white blob in the predicted next screen. The same process can be applied to the other colors as well to determine the colors present in the tile at position (x, y) in the next screen. It can also be applied to all of the colors in the remaining tiles of the next screen to create an entire predicted next screen. The success of the logistic regression model is largely dependent on the accuracy of the weights. Weights of the active features are updated by comparing the model s predictions of the screen to the actual screen. After each color prediction within each tile, for each color c and active feature i, the prediction error δ c,i is accumulated as δ c,i δ c,i + (τ x,y,c ˆM(c φ, x, y)). (2.1) Once an entire predicted next screen has been generated, each weight θ c,i is updated by the prediction error δ c,i and the step size α, which is a number between 0 and 1 that allows the logistic regression model to gradually learn correlations over time. The weight update rule is θ c,i θ c,i + α δ c,i. (2.2) 2.5 Convolutional Logistic Regression (Dynamics Model) Under the MBRL framework, an agent chooses actions leading to the most desirable future outcome. In order to compare the effects of selecting an action, the logistic regression model must be action-dependent. Equation 2.4 reports the logistic probability formula used in this work, which is an action-dependent formula. Action-dependence is achieved by separating the weights among the action set so that each action a has its own set of weights. The linear combination is the sum of the n active weights, which are both color and action dependent. 8

10 Λ x,y,φ,a,c = n θ a,c,i φ i (2.3) i=0 eλ x,y,φ,a,c ˆM(c φ, x, y, a) = (2.4) 1 + e Λ x,y,φ,a,c In the Atari 2600 domain, a MBRL agent must predict future screens in order to determine the most favorable action. This work will utilize a convolutional approach for future screen prediction (Fukushima, 1980). The convolutional approach is centered around the process of predicting the colors within a tile of the next screen based on the presence of colors in the neighboring tiles of the current screen. Absolute position has been abstracted from this approach, allowing the model to learn position-independent information. For each tile in the next screen, a feature vector is constructed using a convolutional lens. A convolutional lens has access to the tile-level screen information in the surrounding area of the current screen. This surrounding area is named the convolutional neighborhood, which has square dimensions determined by the program. Figure 6 presents an example of a convolutional lens with dimensions 3 tiles x 3 tiles. As demonstrated by Figure 6, the convolutional lens iterates Figure 6: A 3 tiles x 3 tiles convolutional lens iterating through a 5 tiles x 5 tiles screen. The gray tiles and the red tile comprise the convolutional neighborhood for the first iteration, where the red tile is the tile that is currently being predicted. through the tile positions of the next screen in a top-to-bottom, left-to-right fashion. The feature vector constructed for each tile is determined by the presence of colors within the tiles in the convolutional neighborhood from the current screen. Each feature has a corresponding weight for each action-color pair, where the weight corresponds to the correlation between the feature and the color when the agent takes the action. For each tile position and for each color, summing the color s weights that correspond to the active features and then applying logistic regression to this sum yields the probability a blob of that color will be present within the tile in the predicted next screen. This process constructs an entire predicted next screen. When the agent encounters a real next screen of gameplay, it compares its predictions of the next screen with the actual screen, and applies a small update (found in Equation 2.6) to the 9

11 weights of the active features - in this the way the model is able to learn about the dynamics of the world over time. δ a,c,i δ a,c,i + (τ x,y,c ˆM(c φ, x, y, a)) (2.5) θ a,c,i θ a,c,i + α δ a,c,i (2.6) Formulating predictions solely based on the current screen of gameplay neglects information about the movement of objects on the screen. In order to account for movement, the presence of colors in the neighboring tiles of the previous screen of gameplay is also factored into the prediction formulation. However, this procedure also neglects the conditional nature of objects on the screen, which suggests the predictions should be conditional as well; there is only one ball in the game of Pong, so if the ball is predicted to be present within a tile of the next screen, the probability of predicting a ball should be 0% for all subsequent tiles within the predicted next screen. The presence of colors in the already predicted neighboring tiles of the predicted next of gameplay is also taken into consideration for prediction formulations. The features for this convolutional logistic regression model are screen-specific, locationspecific (location in the convolutional neighborhood), and color-specific. A feature is composed of the form: φ t, x, y,c where the feature is active if the screen at time t + t has a blob of color c in the tile position: row x + x, column y + y. The feature vector will only consist of the feature enumerations for the Figure 7: The possible feature enumerations for a 3 tiles x 3 tiles convolutional lens, where the red tile signifies the current tile being predicted. The blue tiles signify the partial convolutional neighborhood that does not currently contain any predictions, and is thus not useful for prediction formulation. colored blobs encountered in the convolutional neighborhood within the previous screen, current screen, and next screen. For example, the feature set would be {5} if the only blob encountered is present in the previous screen, located in the top-left position of the convolutional neighborhood, and is the sixth color in the 9-color palette. 10

2.5.1 Convolutional Logistic Regression Example The following example will demonstrate how convolutional logistic regression can generate future screens of Pong gameplay.

12 2.5.1 Convolutional Logistic Regression Example The following example will demonstrate how convolutional logistic regression can generate future screens of Pong gameplay. Figure 8 demonstrates the convolutional features across time available to the convolutional lens at the tile position: row 4, column 5. The left grid and the middle grid are the bottom right quarter of the previous screen and current screen of gameplay, respectively. The convolutional features provide the information the ball was (with respect to the current position) four tiles below/two tiles to the right in the previous screen and two tiles below/one tile to the right in the current screen. Summing the white color weights that correspond to these two features and applying the logistic function provides the probability the ball will be predicted to be present in the current tile position within the next screen. If the probability is greater than a randomly generated number between 0 and 1, then the prediction of the next screen will contain a ball within this tile. Note that the paddle present in the previous and current screens is not visible to the convolutional lens because it is outside of the convolutional neighborhood, and therefore the weights of these features are not included in the weight summation. Figure 8: The predicted next screen of Pong gameplay displayed above provides a snapshot of a 9 tiles x 9 tiles convolutional lens mid-iteration predicting the ball at the tile position row 4, column 5. The paddle is not displayed in the predicted next screen, because the convolutional lens has not yet crossed a tile in which it would predict the presence of the paddle. 2.6 Sparsified Feature Set It is common to initialize the weights to 0, but if the weights are initialized to 0, the first prediction will result in the logistic probability function predicting the presence of approximately half of all the possible features in the future screen. This is demonstrated by the sigmoid calculation: e e 0 = 0.5 This is both inefficient and inaccurate, as the set of possible features is magnitudes larger than the number of features that are active in any given screen of gameplay. To counter this scaling challenge, a bias term was introduced. The bias term is a feature that is always active so that it provides the agent with information about how frequently it can expect colored blobs to be present in the next screen in general. The bias term is associated with an initially negative weight so that the agent is initially pessimistic with regards to predicting colored blobs will be present in the next screen of gameplay. The first prediction with the bias term will result in the logistic probability function predicting there are no colored blobs in the next screen. This inaccuracy is removed as 11

13 the weight of the bias term will eventually converge to a value indicative of the probability of a colored blob being present in the next screen. The addition of the bias term has turned this into a more scalable problem. To further sparsify the predicted set of active features, probability truncation has been introduced. Applying the sigmoid function to the sum of the weights of the active features will often yield a probability that is only marginally larger than zero percent. The logistic regression formula will never return a probability of zero percent, so a probability marginally larger than zero percent is essentially the minimum. To reduce sigmoid computation (it would be fairly computationally expensive to use the logistic probability formula for every color in every tile) and account for the probability never being zero percent exactly, the sum of the weights of active features is compared to a probability threshold. If the sum of the weights of active features is smaller than the probability threshold, then the agent will assume the probability is zero. 2.7 Blob-PROST (Reward Model) Convolutional logistic regression provides the agent with a dynamics model of the environment, allowing the agent to predict future screens of gameplay. It does not tell the agent how advantageous or disadvantegous these future screens of gameplay may be, however. To determine the favorableness of selecting a certain action in the current screen, the agent utilizes Blob-PROST for its reward model, a feature set created by Liang et. al. (2016). The Blob-PROST feature set ψ consists of all of the blobs on a screen, the pairwise relative offsets between blobs on the screen, and the pairwise relative offsets between blobs on the screen over time (from the previous screen to the current screen). This work utilizes Blob-PROST features to power an action-dependent linear reward model, where summing the active weights (linear combination) within the Blob-PROST feature set θ R a,i provides the expected reward ˆR(a, ψ) of selecting an action in the current screen of gameplay. The m weights are also updated according to the update rule in Equation 2.8, which compares the actual reward received ρ to the expeted reward ˆR(a, ψ), and updates the reward weights with a step size α R. ˆR(a, ψ) = m θa,i R ψ i (2.7) i=0 θ R a,i θ R a,i + α R (ρ ˆR(a, ψ)) (2.8) 2.8 Monte Carlo Planning Algorithm If the agent is presented several action choices, it would prefer to make the action choice that leads to the most favorable future screen. Since the agent is able to predict future screens of gameplay and can quantify the favorableness of these future screens, it is plausible that the agent can plan for the future. Note that the favorableness of a screen should be viewed as the long-term expected reward from the state as opposed to the immediate reward received from reaching the screen. In order to calculate the long-term expected reward of a screen, the agent uses a Monte Carlo planning algorithm. The agent s Monte Carlo planning algorithm performs x rollouts of y random actions (where 12

14 x and y are determined by the specifications) following each action choice. This means the agent will determine the best action choice, by: Predicting the future screen following a possible action choice Following the action choice with y additional random action choices Performing the above two steps a total of x times and calculating the average long-term expected reward for the action choice Performing the above step for each action choice Choosing the action with the largest average long-term expected reward A discount factor γ is applied to allow the agent to determine a weighted long-term expected reward, where less weight is placed on rewards the further in the future they may occur. Discounting is a beneficial approach because it best aligns with the agent s objective of achieving a large final score; if the game will terminate in the next few screens (which can happen for a number of reasons), it is advantageous that the agent focuses on increasing its final score before the game ends. The agent s long-term reward for one rollout is determined by the following equation, where a t : ˆM(ρ) = 1 x x k=1 y γ t 1 ˆR(a t, ψ t ) (2.9) t=1 Figure 9 presents a simplified Monte Carlo example where the agent only has two possible actions. The average reward of the rollouts corresponding to an action represent the Monte-Carlo prediction for the long-term expected reward of selecting the action. In this example, action 1 yields the larger long-term expected reward, so action 1 is more favorable. 13

Figure 9: A two-action agent with Monte Carlo specifications 3 rollouts of 2 random actions. Long-term expected reward of action 1: 3 Long-term expected reward of action 2: 0 2.

15 Figure 9: A two-action agent with Monte Carlo specifications 3 rollouts of 2 random actions. Long-term expected reward of action 1: 3 Long-term expected reward of action 2: Pong Results Using Atari 2600 Simulator The ALE provides functionality to save the current game so that it may be loaded in the future. The saving and loading functionality makes it possible to determine how well an agent would perform in the ALE if it used the Monte Carlo planning algorithm with the Atari 2600 simulator. The Atari 2600 simulator serves as a trivially perfect model - it will always predict the future with perfect accuracy, but the decision-making process requires having the agent try every action in the real world several times in order to determine the best action (also most domains do not provide a simulator). Experiments were conducted with the trivially perfect model to gauge agent performance with the Monte Carlo planning algorithm in Pong. The following tables display several performance metrics for an agent utilizing the Atari 2600 simulator. The experiment consisted of 30 independent trial episodes and the planning specifications entailed 10 rollouts of 20 random actions. Average Score Minimum Score Maximum Score Average Number of Frames Minimum Number of Frames Maximum Number of Frames The agent was very successful at Pong; the agent received a score of on average, and never scored lower than 14 points. The agent also played the game for many frames, averaging frames of gameplay. It can therefore be concluded than an agent utilizing the Monte Carlo planning 14

16 algorithm with a sufficiently accurate model can achieve success in the ALE. The ambition of this project (which has not yet been achieved) was to create a sufficiently accurate model that could be utilized by the Monte Carlo planning algorithm to achieve success in the ALE Asterix Results Using Atari 2600 Simulator In order to determine if the Monte Carlo planning algorithm could achieve flexible competence in the ALE, experiments were conducted in a different game. Asterix is a game where the player must navigate in the world to avoid enemies and collect point boosts. A screen of Asterix gameplay is pictured in Figure 10, where the player sprite is the yellow character. The yellow character can move up, down, left, and right in the world. The player has three lives, and loses a life when the yellow character makes contact with a harp. Harps appear from the left or right side of the screen, and move within the row to exit the screen from the other side. The player increases their score by collecting items such as clocks and shields, which appear and move in the same fashion as harps. Figure 10: A screen of Asterix gameplay. The following tables display the performance metrics (consistent with Pong) for an agent playing Asterix using the Monte Carlo planning algorithm and the Atari 2600 simulator. Average Score Minimum Score Maximum Score Average Number of Frames Minimum Number of Frames Maximum Number of Frames The agent had a few low scoring games (such as a 1450 point game), which demonstrates that the Monte Carlo planning algorithm is not perfect, even when the Atari 2600 simulator is being utilized. However, the agent was very successful at Asterix on average; the agent received an average score of which is an incredibly high score. The agent also played the game for frames on average, which means the agent was able to avoid making contact with 3 harps for a very long time. These results indicate that although the Monte Carlo planning algorithm is not perfect, it is a flexibly competent planning algorithm within the ALE. 15

17 Chapter 3 Improvements and Results This chapter presents a reporting of iterative improvement measures and empirical analysis of the MBRL agent s performance. The set of experiments are uniform in nature, while sequentially adding the extensions to the model in order to isolate the impact of each additional component. The model is still not useful for planning, but each successive version has learned an iteratively improved model with a greater prediction accuracy. 3.1 Experimental Procedure In each experiment, thirty trials were performed, where an agent would learn a model of the game by making random actions. Experiments were conducted on both Pong and Asterix. The initial model for Pong used a 7 tiles x 7 tiles convolutional lens and the initial model for Asterix used a 3 tiles x 3 tiles convolutional lens; the number of objects on the screen differ between the games, so the sizes of the convolutional lens were respectfully chosen to maintain efficiency within each game. The dimensions of each tile were 7 pixels x 4 pixels. After 100, 500, 1000, 1500, and 3000 training episodes, the agent relied on its model to choose actions for the next 5 episodes (planning episodes). The model was paired with the Monte Carlo planning algorithm to make decisions. The Monte Carlo specification entailed 10 rollouts of 20 random actions for each action Model Evaluation The five evaluation metrics, per episode, were: score, number of frames, frames per second, negative log likelihood of the dynamics model error, and mean squared error of the reward model. Score is a direct measure of performance in Atari 2600 games. Number of frames provides more clarity with respect to measuring performance when a score comparison does not reveal any information. Two agents could finish a game of Pong with -21 points each, but the agent that plays for more frames is more resistant to surrendering points, and therefore performs better than the other agent. Frames per second is a direct measure of the computational practicality of the agent, which is noteworthy to measure because a frames per second comparison demonstrates the computational costs of an improvement. Negative log likelihood is a prominent measure of accuracy for logistic regression, so the negative log likelihood of dynamics model error is recorded. The negative log likelihood (µ) equation is: µ = t log( ˆM(τ t,x,y,c φ t,x,y,at,c)) (3.1) x,y c In this document, model error refers to the negative log likelihood of the dynamics model error. Mean squared error (κ ) is a prominent measure for linear function approximations, so the mean squared error of the reward model is recorded. The mean squared error (κ) equation is: κ = t (ρ t ˆR(a t, ψ t )) 2 (3.2) 16

18 In this document, reward error refers to the mean squared error of the reward estimates. For each time-frame of 5 planning episodes, averages of the number of frames, negative log likelihood of the dynamics model error, mean squared error of the reward model, and score were recorded. 3.2 Initial Results Figure 11 displays both the score vs. training episodes and frames vs. training episodes for the planning intervals of 5 episodes after 100, 500, 1000, 1500, and 3000 training episodes. These performance metrics were evaluated on Pong. The agent received an average score between - 20 and -21 at each planning interval, which demonstrates the agent is a very unsuccessful Pong player. The agent progressively played for fewer frames at each planning interval, which suggests the information the agent has learned may have led to worse play. Figure 11: Initial performance graph for Pong Figure 12 displays both the model error vs. training episodes and reward error vs. training episodes for the planning intervals of 5 episodes after 100, 500, 1000, 1500, and 3000 training episodes. These error metrics were evaluated on Pong. The agent progressively reduced the reward error at each interval, and the graph suggests more training could further reduce the reward error. The agent also progressively reduced the model error at each interval, but the model error seems to have plateaued around µ =

19 Figure 12: Initial error graph for Pong Figure 13 presents the performance metrics for Asterix. The agent s average score mostly fluctuated in the range, which is a poor score in Asterix. The number of frames the agent played also fluctuated, leading to no decisive conclusions about change in performance over time. Figure 13: Initial performance graph for Asterix Figure 14 presents the error metrics for Asterix. Interestingly, the reward error has increased slightly over time as the agent has experienced more episodes of training. The model error has decreased slightly over time, reaching µ = 3500 after 3000 training episodes. 18

20 Figure 14: Initial error graph for Asterix 3.3 Action Independent Model Since the dynamics model is action-specific, it divides the learning data among the set of possible actions. This division of data provides the model with insight regarding how each action choice will affect the next screen of gameplay - which is very useful for the agent to choose an action. However, this division of data makes inefficient use of the resources available, as the future positions of many of the objects on a screen do not depend on the action taken by the agent. In the game of Pong, the agent controls the movement of the player paddle, but the movement of the opposing player s paddle is entirely independent of the actions taken by the agent. The movement of the ball is mostly independent of the actions taken by the agent, with the exception being the agent choosing to position the player paddle to make contact with the ball. To efficiently make use of the overlapping resources, a second model was created - an action independent model. The action-independent model shares its information among all actions, adding a layer of generalization to the model information available to the agent. The agent uses both models to predict future screens, where the new logistic probability formula for determining the probability of the existence of a color c in the next screen can be found in Equation 3.4. An action independent weight is defined as θc,i AI and the linear combination of active action independent weights is defined as Λ AI c,φ. Λ AI x,y,φ,a,c = n i=0 θ AI a,c,i φ i (3.3) e(λ x,y,φ,a,c+λ AI x,y,φ,c ) ˆM(c φ, x, y, a) = (3.4) 1 + e (Λ x,y,φ,a,c+λ AI x,y,φ,c ) The combination of action dependent and action independent weights allow the model to make better use of the available information. In the context of Pong, the action independent weights will allow the model to better track the movement of the ball, whereas the action dependent weights will allow the model to understand the causal relationship between selecting an action and moving the paddle. 19

21 3.3.1 Results with the Additional Action Independent Model Figure 15 displays the performance metrics for Pong with the additional action indepedent model. The upgraded agent was still an unsuccessful Pong player, scoring between -20 and -21 each planning interval. Also similarly to the original agent, the agent also played for fewer frames as the agent experienced more training episodes. Figure 15: Performance graph for Pong with the additional action independent model Figure 16 displays the error metrics for Pong with the additional action indepedent model. The agent s reward model was not altered, and as such, it is no surprise the reward model error did not significantly change. The upgraded agent improved its model error to almost as low as µ = 3600 after 3000 training episodes, which is a slight improvement from the original agent which had a model error of approximately µ = The upgraded model also demonstrated quicker learning, as it received a smaller model error during each planning interval. Figure 16: Error graph for Pong with the additional action independent model Figure 17 presents the performance metrics for Asterix with the additional action indepedent 20

22 model. The upgraded agent was as ineffective at scoring points as the original model. Also similarly to the original agent, the number of frames the agent experienced fluctuated in a manner leading to no decisive conclusions. Figure 17: Error graph for Asterix with the additional action independent model Figure 18 presents the error metrics for Asterix with the additional action indepedent model. The agent s reward model was not altered, nor will be altered in any subsequent improvements, so reporting differences in prediction error would be superfluous. The upgraded agent s model error after 100 training episodes matched the original agent s model error after 3000 training episodes, but remained fairly constant over time. Figure 18: Performance graph for Asterix with the additional action independent model 21

23 3.4 Flip Invariant Model The situation depicted in Figure 8 is one in which the ball has just bounced off the paddle in an upward direction. The model would learn about this specific situation, but would not be able to generalize the information it has learned to a similar situation, such as the proposed situation of the ball bouncing off the paddle in a downward direction. Many Atari 2600 games are symmetrical in nature, and thus information can often be generalized to learn about symmetrical situations to the situation directly experienced. Figure 19 mirrors the situation in Figure 8 flipped across the horizontal axis, where a comparison between the two figures demonstrates the information learned between the two situations is vertically symmetrical. Figure 19: Mirror situation to Figure 8, only flipped across the horizontal axis (vertical flip) To benefit from this generalization, a flip-invariant model was introduced. The flip-invariant model utilizes the symmetrical nature of Atari 2600 games to learn more information. Flipinvariant weights are trained on 4 images - the regular version, the horizontal flip version, the vertical flip version, and the both flip (vertical and horizontal) flip version. Flip features can be calculated from direct conversions from the non-flip (regular) features: φ t, x, y,c {φ t, x, y,c, φ t,w x, y,c, φ t, x,h y,c, φ t,w x,h y,c } where H = height of screen in tiles, W = width of screen in tiles When the ball moves in one direction, the flip invariant model learns about all symmetrical situations as well. However, during prediction, only the flip-invariant weight corresponding to the regular version of the image is factored into the logistic probability formula, which can be found in Equation 3.7. A flip invariant weight is defined as θ F I is defined as θ AI,F I c,i c,i, an action-independent, flip invariant weight, the linear combination of active flip invariant weights is defined as Λ F c,φ I, and the linear combination of active action independent, flip invariant weights is defined as Λ AI,F I c,φ. Λ F I x,y,φ,a,c = Λ AI,F I x,y,φ,c = n i=0 n i=0 22 θ F I a,c,i φ i (3.5) θ AI,F I c,i φ i (3.6)

24 ˆM(c a, φ) = e(λ x,y,φ,a,c+λ AI x,y,φ,c +ΛF I x,y,φ,a,c +ΛAI,F I x,y,φ,c ) 1 + e (Λ x,y,φ,a,c+λ AI x,y,φ,c +ΛF I x,y,φ,a,c +ΛAI,F I x,y,φ,c ) (3.7) Results with the Additional Flip Invariant Model Figure 20 displays the performance metrics for Pong with the additional flip invariant model. The recently upgraded agent was still an unsuccessful Pong player, scoring between -20 and -21 each planning interval. Also similarly to both previous versions of the agent, the agent also played for fewer frames as the agent experienced more training episodes. Figure 20: Performance graph for Pong with the additional flip invariant model Figure 21 displays the error metrics for Pong with the additional flip invariant model. The improved agent has a smaller model error at each planning interval than the previous two versions. Figure 21: Error graph for Pong with the additional flip invariant model Figure 22 presents the performance metrics for Asterix with the additional flip invariant model. The upgraded agent was as ineffective at scoring points as the original model. Also similarly to 23

25 the original agent, the number of frames the agent experienced fluctuated in a manner leading to no decisive conclusions. Figure 22: Performance graph for Asterix with the additional flip invariant model Figure 23 presents the error metrics for Asterix with the additional flip invariant model. The recently upgraded agent s model error after 100 training episodes matched the previous agent s model error after 3000 training episodes, but remained constant over time. Figure 23: Error graph for Asterix with the additional flip invariant model 3.5 Pairwise Features One of the most important concepts in Atari 2600 games is the spatial relationship between objects on the screen. It is important to know where the agent avatar is in relation to objects that are beneficial to the agent, such as a point boost or power-up, and objects that are harmful to the agent, such as an enemy. Another very important concept is the spatial relationship between objects in time, as this provides information about the velocity of the objects in a game. In Figure 19, the ball was one tile below and one tile to the right of the paddle in the previous screen, and is three tiles below and two tiles to the right of the paddle (which has not moved) in the 24

26 current screen. These two screens of gameplay suggest the ball has a velocity of two tiles downward/one tile rightward per screen. This information regarding the velocity of the ball suggests the position of the ball in the next screen will be two tiles downward/one tile rightward from its position in the current screen. Pairwise features have been added to document the relationship between objects in space and time. There is a pairwise feature for every possible pairing of non-pairwise features. Pairwise feature (ξ) calculation is of the form: {φ t1, x 1, y 1,c 1, φ t2, x 2, y 2,c 2 } ξ t1, t 2, x 1, x 2, y 1, y 2,c 1,c 2 This inclusion allows the model to be more expressive by interpreting conclusions based on temporal and spatial relationships. In accordance with the regular non-pairwise features, pairwise flipped features are calculated from the regular pairwise features, using the following protocol: ξ t1, t 2, x 1, x 2, y 1, y 2,c 1,c 2 {ξ t1, t 2, x 1, x 2, y 1, y 2,c 1,c 2, ξ t1, t 2,W x 1,W x 2, y 1, y 2,c 1,c 2, ξ t1, t 2, x 1, x 2,H y 1,H y 2,c 1,c 2, ξ t1, t 2,W x 1,W x 2,H y 1,H y 2,c 1,c 2 } Results with the Additional Pairwise Features Figure 24 displays the performance metrics for Pong with the additional pairwise features. The recently upgraded agent was still an unsuccessful Pong player, scoring between -20 and -21 each planning interval. However, it seems the agent has learned to play more frames over time, which suggests a small improvement in play style. Figure 24: Performance graph for Pong with the additional pairwise features Figure 25 displays the error metrics for Pong with the additional pairwise features. The recently upgraded agent has received a significantly smaller model error of around µ = 3000 at each 25

Learning to Play 2D Video Games

Learning to Play 2D Video Games Justin Johnson jcjohns@stanford.edu Mike Roberts mlrobert@stanford.edu Matt Fisher mdfisher@stanford.edu Abstract Our goal in this project is to implement a machine learning