CS221 Project Final Report Deep Q-Learning on Arcade Game Assault

Size: px

Start display at page:

Download "CS221 Project Final Report Deep Q-Learning on Arcade Game Assault"

Kerry Porter
6 years ago
Views:

1 CS221 Project Final Report Deep Q-Learning on Arcade Game Assault Fabian Chan (fabianc), Xueyuan Mei (xmei9), You Guan (you17) Joint-project with CS229 1 Introduction Atari 2600 Assault is a game environment provided on the OpenAI Gym platform; it is a top-down shoot em up game where the player gains reward points for destroying enemy ships. The enemy consists of a mothership and smaller vessels that shoot at the player. The player can move and shoot in various directions with a total of 7 actions available. Every time the player shoots, a heat meter keeps track of how hot the engine is; if the player shoots too frequently, the player can lose a life when the heat meter fills up due to overheating. The player can also lose a life upon taking fire from enemy ships. The game ends when the player runs out of lives. We create an AI agent that generates the optimal actions, taking raw pixels as features by feeding them into a convolutional neural network (CNN), also known as deep Q-learning. 2 Literature Review The first paper Playing Atari with Deep Reinforcement Learning [1] addresses how convolutional neural networks and deep reinforcement learning combine together to accomplish a high performance AI agent that play Atari games. The paper analyzes how Reinforcement learning (RL) provides a good solution to game playing problem and also the challenges in Deep Learning brought about by RL from the data representation perspective. The paper proposes deep reinforcement learning on game playing agents, which is similar to our goal. However, there are differences between our approaches and theirs. One of the differences is that the approach in this paper relies on heavy downsampling images before feeding them into a neural network. Our approach tries to avoid this downsampling procedure in an attempt to produce better data layers for deep Q-learning. The second related paper Deep Learning for Real-Time Atari Game Play Using Offline Monte- Carlo Tree Search Planning [2] proposes another solution to game playing. Compared to the first paper, which uses a model-free Q-learning strategy, this paper tackles the problem using a combination of Monte-Carlo techniques and deep Q-learning, ending up with a much more sophisticated algorithm that adds extra assumptions and considerable complexity. Though we choose the modelfree learning approach, this paper still provides us with insights about deep learning applied to games, such as the preprocessing of raw data and the architecture of convolutional neural networks. 3 Task Definition The goal is to create an AI controller that tries to maximize the total game score. To achieve this, the problem is broken down to finding a score for each action (the Q-value), given the current state of the game. After which the optimal action to take is the one that corresponds to the highest 1

2 Q-value. We note that the scope of this project is limited to the Assault game, and though it might work on other games with varying degrees of performance, we are not considering other games at this time. We will evaluate our agent by comparing the final scores achieved against with both the baseline and oracle. The baseline corresponds to the performance of a simple Q-learning with a simple feature extractor, with an average score of 670.7, while the oracle corresponds to the performance achieved by a top human player, with a score of 4153[3]. This is explained in more details in section Infrastructure OpenAI Gym is a platform for developing AI agents. It offers a Python API for interacting with the game, as well as provide the game environment represented as pixels along with the reward earned at every timestep. All this allows us to spend less time writing the game itself, and more time working on the AI agent. At each step, the agent takes an action, and it receives an observation and reward from the environment. An RL algorithm seeks to maximize some measure of the agent s total reward, as the agent interacts with the environment in an online learning or batch learning manner, where the agent takes the tuple (observation, reward, done) as an input on each timestep and either performs learning updates incrementally or collect for later use in a batch update. Our infrastructure can be thought of as a processing pipeline and can be summarized as follows: game observations are collected from OpenAI as raw pixel data, which are images of pixels with a 128-color palette. Then the data is processed and fed as inputs into a CNN; the outputs of the CNN are Q-values that correspond to the list of actions. While there are many software libraries available for implementing neural networks, the tool we have chosen to use is TensorFlow. TensorFlow is a popular open-source library used by various researchers and companies. With its flexible architecture and multidimensional data arrays, also called tensors, we can implement convolutional neural networks for reinforcement learning. The final note regarding our infrastructure is that we involved GPU computation. The computational task for the reinforcement learning process is so computationally heavy that we decided to leverage a GPU to perform the gradient-based updates on our neural network. Since TensorFlow supports computation on the GPU, with the help of NVIDIA s parallel computing API, CUDA, we gain a speed up of 2x-10x compared to relying on a CPU. 5 Approach 5.1 Modeling States and Actions Our model defines its states and actions in a fairly straightforward manner as follows: a state consists of pixel values of the game screen taken for a window of k consecutive frames. This is a numpy.array of shape (k, 250, 160, 3), where the second value x is the height of the screen, the third value y stands for the width of the screen, and the fourth represents the RGB dimension for each pixel at coordinate (x, y). We want our state definition to contain not simply the current frame but the last few frames because Assault is a dynamic game and a single frame is not enough to 2

3 determine the motion of various game entities (i.e. direction and velocity). Our assumption is that the most recent k frames will provide sufficient information to calculate the most optimal action(s) to take in the future. For this progress report we have chosen k to be 3. The set of actions are simply the 7 actions made available to the player: moving and shooting in various directions, as well as a do-nothing action. 5.2 Challenges and Methods to Address Them There are two major challenges we need to pay attention to in this project. First, to ensure reasonable progress, the ability to iterate quickly is crucial. In order to save training time and computing resources, we needed to simplify the state space and reduce the number of features before feeding them into our training algorithm. Second, because we expect it to take several days to train on a dataset, we need to ensure that it converges to a good local minimum (if not global) within a reasonable amount of time. Experience replay is a technique we ended up using to help with the convergence problem. Furthermore, a tradeoff needs to be made between exploration and exploitation to find a balance between running time and how well the result can be optimized. We decided to use simple Epsilon-Greedy exploration for this, and the major task in dealing with this challenge would involve finding suitable value(s) for epsilon. 5.3 Baseline and Oracle The baseline for our project is the performance of a simple Q-learning algorithm with a simple feature extractor. The reward for the algorithm is the score the player receives. The value of Q opt is calculated using w φ(s, a) where w is the weight vector, φ is the feature vector, s is the state and a is the action. For this simple algorithm, we set k = 1, discount factor γ = 1, ɛ = 0.3, and used a feature extractor that only indicates whether a pixel is black or not. We ran this simple Q-learning algorithm for 5 times, each for 3 hours. The average score over the 5 runs that this naive Q-learning approach was able to obtain is The oracle for our project is the performance achieved by a top human player, which is 4153 [3]. The baseline and oracle scores serve to give us a rough idea of how well the AI should perform. 6 Learning Algorithm 6.1 Q-Learning for Game Playing One common way to deal with the game playing problem is to assume a Markov Decision Process (MDP). This is appropriate for Assault because the enemy agents move randomly. An MDP is a model defined by a set of States, Actions, Transitions, and Rewards. In order to train an AI to tackle game playing tasks, reinforcement learning based on Q-learning is a popular choice. In Q-learning, the MDP recurrence is defined as follows: Q(s, a) = E s ɛ[r + γmax a Q (s, a ) s, a] (1) where a is the action it takes, s is the current state, r is the reward, and γ is the discount factor. Furthermore, we can use function approximation by parameterizing Q-value. In this way, we can easily adapt linear regression and gradient descent techniques from machine learning. With function 3

4 approximation, we can calculate the best weights by adapting the update rule: w w η[ ˆQ opt (s, a; w) (r + γ ˆV opt (s ))]Φ(s, a) (2) where w is a vector containing weights of each feature, and is initialized randomly to avoid getting into the same local optimum in every trial. 6.2 Convolutional Neural Networks as Function Approximators We decided to use deep Q-learning instead of ordinary Q-learning because there are too many possible game states. The size of one frame is 216 by 160 pixels, and for each pixel there are choices of RGB values. Furthermore, a sliding window of k frames leads to k possible states in total this is too large for ordinary Q-learning because it will result in too many rows in our imaginary Q-table. Therefore, we decide to use a neural network to learn these Q values instead. In effect, this neural network ends up operating as a function approximator. The network architecture is currently described as follows: Preprocess frames: convert RGB pixels to grayscale and threshold to black or white Input layer: takes in the preprocessed frames. Size: [k, 160, 250, 1] Hidden convolutional layer 1: kernel size [8, 8, k, 32], strides [1, 4, 4, 1] Max pooling layer 1: kernel size [1, 2, 2, 1], strides [1, 2, 2, 1] Hidden convolutional layer 2: kernel size [4, 4, 32, 64], strides [1, 2, 2, 1] Max pooling layer 2: kernel size [1, 2, 2, 1], strides [1, 2, 2, 1] Hidden convolutional layer 3: kernel size [3, 3, 64, 64], strides [1, 1, 1, 1] Max pooling layer 3: kernel size [1, 2, 2, 1], strides [1, 2, 2, 1] Resize the max pooling outputs to a vector of size [768] and feed to one fully connected layer Feed the outputs to a rectified linear activation function Collect the final output as 7 Q-values, each corresponding to an action Though we have 3 max-pooling layers involved in our network architecture, this is not what we had in mind at the beginning. Unlike CNN architectures typically used in computer vision tasks such as image classification, pooling layers in our architecture may not have been desirable for our purposes because we likely do not want to introduce translation invariance since the position of the game entities are important for estimating Q values. However, max-pooling serves as an adequate way to compress our large state space into a vector of size 768, and is the reason why we have been using them. One suggestion to replace these max-pooling layers is to make the strides larger in each hidden convolutional layer but given that the current strides already have substantial size, we have been hesitant to increase it any further. However, we are not dismissing it as a bad idea and would like to give it a try if we were given an additional month or two to evaluate. 4

5 6.3 Training Process The OpenAI Gym library provides game screen observations given as pixel values, which we use to construct as part of the state. We initialize the weights and biases for every layer of the neural network to random values taken from a Gaussian distribution centered around zero. Then at each time step, we take a minibatch of size 100 and feed it into the network for training. However, instead of starting training right away, our algorithm waits for 500 steps. During the first 500 steps, the act of choosing the next action is based on a uniformly random distribution. After the program observes enough with the randomly chosen actions, it starts to train and will have the ability to choose the next action based on the last state. Our loss function is: 6.4 Epsilon-Greedy Search L = 1 2 (r + γmax a Q(s, a ) Q(s, a)) 2 (3) Although it will have the ability to choose the next action based on the last state, it is worthy to note that a tradeoff needs to be made between exploration and exploitation as there is a balance between running time and how well the result can be optimized. We decide to use simple ɛ-greedy exploration for this, where ɛ is the probability that the player chooses a random action decreases as time goes on. This seems to work sufficiently for our purposes. However, we have elected to use a dynamic ɛ that is dependent on the timestep, where we linearly decrease ɛ from 0.8 to 0.05, annealed over 50,000 timesteps. 6.5 Experience Replay In general, deep neural networks are difficult to train. In the presence of multiple local optima, gradient descent may end up at a bad local minimum which will lead to poor performance. Initializing the network weights and biases to random values helps a little but is likely not sufficient. Learning directly from the newest observations is ineffective, due to the strong correlations among the most recent observations; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. During training, the current parameters determine the next data sample that the parameters are trained on, and as a result unwanted feedback loops may occur and the parameters could converge and get stuck at a poor local optimum, or even possibly diverge. We incorporate a technique called experience replay to encourage the algorithm to find better optima instead of getting stuck at some underperforming local optimum. To be more specific, as we run a game session during training, all experiences < s, a, r, s > are stored in replay memory. During training, we take random samples from the replay memory instead of always grabbing the most recent transition. By breaking the similarity of subsequent training examples, this trick is likely to prevent the network from diving into some local minimum and will do so in an efficient manner [5]. By using experience replay, the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding divergence in the parameters. In our implementation we simply perform a uniform sample from the bank of observed states (the replay memory) to construct a minibatch of size 100 with which we train on each iteration. Storing all past experiences is impossible due to the humongous state space, so we simply retain the most recent 20,000 observations in the replay memory and sample from that. 5

6 7 Results and Analysis As preliminary evaluation, we first ran our algorithm (without experience replay) for 5 trials, each with a cutoff at the end of 36 hours, using number of consecutive frames k=3. The final scores for each trial is plotted on the top half of Figure 1. From the results we can see that some trials performed pretty badly, and are in fact no better than the baseline Q-learning algorithm. However, other trials performed significantly better than the baseline. We believe these differences in performance among the trials can be explained by the gradient descent approach that we used for training our deep neural network, which is characteristically vulnerable to getting stuck at some under-performing local optimum. We arrived at this explanation because each trial is initialized with random weights and biases, and these trials produce wildly different final scores, so mostly likely each of them ended up in a different local optimum. We then tweaked our model parameters (aka hyperparameters) such as ɛ the exploitation-exploration parameter, k the number of consecutive frames in consideration, and η the learning rate, in a manner similar to grid search. Since in our situation we do not have a dataset for which to divide into training, validation and testing sets for the reason that out training comes from operating a dynamic game, we simply repeated training on various values of ɛ, k and η and find the combination that gives the best scores. This is straightforward compared to the usual hyperparameter optimization process in machine learning. We found that setting ɛ to a dynamic one as described in section 6.4, k to 4, and η to 0.01 gives the best results overall, but we notice that in general varying the hyperparameters does not significantly influence the game agent s performance, so we did not devote too much time trying out different combinations. Figure 1: Comparison: performance with and without experience replay We obtained our final results by running the agent using the model weights computed at the end of training and noting down the final score repeated for a total of 5 trials, and instead of doing the cutoff based on time, we now terminate training after performing 50,000 iterations. We switched from time-based cutoff to iteration-based cutoff because the time it takes to train a model is mainly a function of the hardware used to train it. Furthermore, reporting the scores obtained from a 6

7 Figure 2: Average scores over training episodes certain number of episodes makes it more robust against the noise/stochasticity of the OpenAI gym environment. This time we also incorporated experience replay into the training procedure. The scores we obtained are plotted on the bottom half of Figure 1. Comparing these results with those above, we can see that experience replay produced results that are more consistent and stable. This is because this technique was able to effectively prevent the algorithm from getting stuck at some bad local optimum, and thus help achieve (slightly) higher and more consistent results. To aid with our analysis we plot the average score per 20 consecutive training episodes over one complete trial in Figure 2. We can see that at the beginning, there is a small amount of episodes during which the score stays low and did not improve. This corresponds to the starting period when we did not apply training and only allow the agent to choose actions randomly to gather observations. Then afterwards the score rises rapidly to around 600 because the game mode stays the same up to this point and so far it is quite easy to get there. After which it appears to get stuck at around 600 for over 1500 episodes, and this is because starting from this point, the game jumps in difficulty new enemy entities known as crawlers appear on the left and right sides of the agent, in addition to the enemy ships already hovering above. As if it encountered a roadblock, the agent was unable to make much progress past 600 for quite some time, but it did eventually learn to overcome this obstacle. So we conclude that even with this change of difficulty, our agent simply needed some time to adjust and continue to learn. From 600 onwards the agent improves at a rate slower than it did from the start to 600 because the game is no longer as easy as it was in the beginning. Finally, it saturated at around 1000 and this is roughly the final score it was able to achieve. 8 Conclusion In this project, we have implemented a game playing agent for Atari Assault using deep Q-learning. We first implemented ordinary Q-learning to obtain a baseline of score 670.7, then we implemented deep Q-learning by constructing a convolutional neural network using Tensorflow. We obtained promising results after experimenting with and without experience replay. For us, experience replay worked well in helping the agent avoid getting stuck at some bad lo- 7

8 cal optimum. Some of our improvements also came about by extending the training time and tweaking hyperparameters. Our deep Q-learning agent managed to significantly out perform the baseline: the average score it obtained was 980, while the ordinary Q-learning baseline has an average score of Although our agent does significantly better than the baseline, it still does not come close to the oracle. We believe that there are still many places we can try to improve, such as revising the neural network architecture, adding customized feature extractors, and experimenting with more fully connected layers. 9 References 1. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Wierstra, D., & Riedmiller, M. (2016). Playing Atari with Deep Reinforcement Learning. University of Toronto. 2. Guo, X., Singh, S., Lee, H., Lewis, R., & Wang, X. (n.d.). Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning. Retrieved 2016, from 3. Assault (1983, Bomb) - Atari Score (n.d.). Retrieved November 16, 2016, from yttto 4. Assault-v0. (n.d.). Retrieved from 5. Matiisen, B. T. (n.d.). Demystifying Deep Reinforcement Learning. Retrieved November 16, 2016, from 8

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

Creating an Agent of Doom: A Visual Reinforcement Learning Approach Michael Lowney Department of Electrical Engineering Stanford University mlowney@stanford.edu Robert Mahieu Department of Electrical Engineering