Creating an Agent of Doom: A Visual Reinforcement Learning Approach

Creating an Agent of Doom: A Visual Reinforcement Learning Approach Michael Lowney Department of Electrical Engineering Stanford University mlowney@stanford.edu Robert Mahieu Department of Electrical Engineering Stanford University rmahieu@stanford.edu Abstract We investigate the use of the Deep Q-Network deep learning model for use in playing complex video games such as first-person shooters. Due to the availability of the ViZDoom interface framework, we chose to experiment with the 1993 PC game Doom. By modeling the game as an extremely large Markov Decision Process, we are able to employ Q-learning techniques to train the parameters of a neural network Q-function approximation. Using ViZDoom, we were able to train an artificial intelligence agent that was capable of completing tasks in simple s using only the current screen image as input to the network. Results show that for the s tested, the agent was able to learn successful strategies in relatively small amounts of training time. I. INTRODUCTION In 2013, the Google DeepMind team published a fascinating paper detailing a method for using reinforcement learning techniques to train an artificial intelligence agent to play Atari 2600 video games using only the screen s raw pixel data and the game score as input [1]. The project proposed a new deep learning model, dubbed the Deep Q-Network (DQN), that uses a variation of Q- learning to train a convolutional neural network (CNN) to make decisions on which actions to take in the game. As such, it has the ability to learn and develop policies using high-dimensional input data. This project attempts to tackle the challenge of extending this study by training an AI agent to play a first-person shooter (FPS) game specifically the 1993 classic PC game Doom. This goal is made tractable by the existence of an interface framework called ViZDoom, which was developed as part of the Visual Doom AI Competition during the 2016 Computational Intelligence & Games (CIG) conference [2]. The framework provides access to the the screen buffer as well as game state information such as player health, armor, ammo, etc., making application of the DQN model possible. Therefore, to achieve our goal, we train a CNN using the DQN model and game data relayed from ViZDoom which then develops the ability to make decisions on which keys to press based on the current screen image. A. Network Architecture II. MODEL The CNN takes as input a 60x40 pixel single-channel grayscaled version of the screen buffer, which is then passed through three convolutional layers, each followed by a respective max-pooling subsampling layer. The first convolutional layer contains 32 8x8 filters, the second 64 4x4 filters, and the third 64 3x3 filters. Each maxpooling layer uses a 2x2 shape. After this, the data is passed through a fully-connected 512-node rectifier layer, then a final fully-connected linear output layer. At the output of the network, each node corresponds to a possible combination of button presses and the weight at the node corresponds to the Q-value of taking that action from the current state. This network architecture is illustrated in Figure 1. At each layer of our network a Rectified Linear Unit (ReLU) was used as the activation function. The weights were initialized according to the methods proposed by He. et al in [4], which have shown to be perform well with networks using ReLUs as the activation function. B. Markov Decision Process (MDP) The process of playing the game is modeled as an extremely large, but finite, Markov Decision Process. A state, s, is simply the image pixel data contained in the screen buffer. The set of possible actions A = {a 1, a 2,..., a K } represents all possible combinations of button presses. For example, in the basic that was tested, in which the game map is a single corridor with a stationary monster on one end and the AI agent on the other, a very small set of actions is used, representing all 2 3 binary combinations of [left, right, shoot]. The rewards system is where the ViZDoom framework becomes especially useful. Because the framework

allows access to game state information, we are able to define rewards as a function of events in the game, such as a monster or the player dying. The maps/s themselves are hardcoded using editing software to return rewards to the framework when certain events occur. For example, in the basic, the simple corridor map used generates a +101 reward upon death of the monster, -5 reward when shooting but missing, and - 1 reward every frame while the player is still alive. Further discussion on the specific rewards implemented are provided below in Section V. Fig. 1: Depiction of the neural network architecture. Three convolutional layers with respective max-pooling subsampling, then a fullyconnected rectification layer followed by a fully-connected output layer. Input is a 1-channel image and output is a weight associated with each possible game action. III. ALGORITHM To train the agent to play the game, an ε-greedy Q- learning approach is used in an attempt to determine the optimal policy for the MDP. This optimal policy represents the ideal game strategy at each state s, and is found by selecting the action a A at the current state s that maximizes the following recursion: Q (s, a) = E s [ r + γ max a ] Q (s, a ) s, a In the above formulation, r represents the reward given for taking action a at state s and γ [0, 1) is the discount factor which controls how much we care about future versus immediate rewards. The optimal Q- value, Q (s, a), represents the expected score that would be achieved by taking the action a and then continually choosing actions that are expected to return the largest Q-value until the game ends. Note that this is an offpolicy learning algorithm since we are not directly evaluating Q-values based on some set policy. Because the state dimensionality and state space are so large, in order to promote adequate learning generalization, a neural network function approximation, referred to as a Q-network, parameterized by θ, is used to approximate the Q-function such that: Q(s, a; θ) Q(s, a) The parameters θ are optimized using stochastic gradient descent (SGD) on the cost function below, representing the squared difference between the current target and prediction values: L(θ i ) = A. Experience Replay [( r + γ max a Q(s, a ; θ i ) ) ] 2 Q(s, a; θ i ) For the SGD update procedure, a technique called experience replay is utilized. This technique involves storing a large set of previous transitions e t = (s t, a t, r t, s t) in a dataset D t = {e 1,..., e t } called the replay memory, across many run-throughs of the agent playing the game. During the update step at a given iteration t, a small set of transitions, referred to as a minibatch, are sampled uniformly at random from the dataset D t. Then, using the RMSProp algorithm, this minibatch is used to perform an update on the network parameters. This technique avoids training on consecutive transitions, which would lead to erratic updates and a slower learning rate due to the high correlation between the states. The replay memory is large but finite, and once full, when new transitions are added, the oldest are removed and replaced by the new. B. ε-greedy Q-Learning In the ε-greedy learning implementation, the parameter ε refers to the probability that the agent takes a random action a A while playing the game during training instead of taking the action arg max a A Q(s, a; θ) based on our current network parameterization θ. Starting with a ε close to one allows for exploration of the state space at the beginning of training, however, to allow the training to converge, we decrease the ε parameter linearly over the training time from 0.99 to 0.10, causing the agent to closer follow and refine its developed strategy. C. Action Repeat Another technique used was action repeat, in which the agent repeats an action choice over some number of frames before evaluating the Q-function again and choosing a new action to take. This technique provides several advantages. Using action repeat avoids populating the replay memory and thus training excessively on very similar states which leads to faster training. It also gives the game time to react to the selected action 2

(a) Basic (b) Health-gathering (c) Predict Fig. 2: Example screen images from the three test s. and respond with an appropriate reward, allowing the agent to more accurately associate the action chosen with the reward returned. In our implementation we found an action repeat of 4 frames worked well. IV. BASELINE & ORACLE Because training an agent for a multiplayer game of Doom is such a complex undertaking, we decided to instead break the problem up into simpler s. The three s we examined are explained in detail in Section V, though example screen images from each can be viewed above in Figure 2. The s emphasize two main points of the game that an agent should be proficient at; killing monsters and collecting health packs. To adequately evaluate the performance of our system, we first implemented a baseline algorithm which simply took random actions (i.e. ε = 1.0) for the entirety of its playtime. As might be expected, this resulted in very poor average performance. The average scores obtained by the agent in each test using this algorithm is summarized below in Table I. To give an upper bound on what we might be aiming for in regards to good performance, our oracle was simply the average score achieved by a human playing the game for each. These scores are also summarized below in Table I. TABLE I: Baseline and oracle performance results. Scenario Baseline Avg. Score Oracle Avg. Score Basic -315 90.52 Health-gathering 466.4 2100 Prediction 0.0377 0.7396 V. RESULTS The first we decided to explore, referred to as the basic, has the agent at one end of a square room and a monster at the other end. The agent can only take three possible actions: move left, move right, and shoot. The monster can be killed in one shot and the agent will get a reward of +101 if the monster is killed. The agent will get a reward of -1 for being alive at each state, and a reward of -5 for shooting and missing the monster, which incentivizes both precision and speed. The game engine returns these rewards to the agent as a result of its interactions with the world. Note that custom rewards can also be created, such as by using the engine to detect change in the agent s amount of ammo. Within our implementation, by using only the screen buffer the agent is able to make a decision on which action to take. The agent was trained over 20 epochs which each consisted of 5000 iterations to update the Q function. After each epoch the agent was tested on 100 episodes. The episode ends once the monster is killed, or after a timeout of 300 actions are taken. The results of the average score per episode for both testing and training are shown in Figure 3 and Figure 4 respectively. The training scores represent the average score over each epoch of training, and the testing scores represent the average score after each epoch. The average training score steadily increases until about the 9th epoch where it starts to level out. The testing scores also reach the maximum values in about 9 epochs. The agent was able to outperform both the baseline and oracle score. The agent had an average score of 85.45 for the basic. The second second we tested, called the health-gathering, placed the agent in a square room where the floor is covered with acid. The agent can choose any combination of the following actions, move forward, turn left, and turn right. Standing in the acid will slowly reduce the agent s health. Health packs are spawned at random throughout the room and the agent must pick up these health packs in order to stay 3

alive. For this the agent only gets a reward of +1 for staying alive, there is no explicit reward given for collecting health packs. The agent must learn that collecting health packs is necessary to maximize its score. For this we used a timeout of 2100 actions, so the maximum score that can be achieved is 2100. The agent was trained in a similar fashion for the health gathering. There were 20 epochs of 5000 iterations, with 100 episodes of testing after each epoch. The averaged score per episode of testing and training are shown in Figures 5 and 6 respectively. These plots show that the average score does increase overtime, but there is some noise in the learning curve. This could be a sign that the network needs more time to train to a stable solution. After training it is interesting to view the agents decision making during the game. In Figure 7, the agents relative movements are plotted. This plot shows how the agent has learned to move in large circles around the room. This strategy ensures that the agent is always moving and does not get stuck against the wall or in a corner of the room. Since there is a fairly high density of health packs in the room, this circular path is likely to pick up health packs as well. The agent outperformed the baseline and came close to the performance of the oracle. The agent had an average test score of 1835.04 for the health gathering. On many episodes it would remain alive until time ran out. More training is necessary for the agent to learn how to obtain the maximal score on every episode. The final we decided to test was the prediction. In this the agent can has the following actions available, turn left, turn right, and shoot. The agent is placed in a square room, and at the far end of the room is a monster. The monster travels across the room on a fixed path, and is spawned somewhere randomly on this path. The agent this time is given a rocket launcher with a single rocket. There is a delay between when the agent fires and when the rocket reaches an object depending on the distance. In this the agent must predict where the monster will move to so that it can lead the shot and kill the monster. If the agent successfully hits the monster it will get a reward of +1. For each state that the agent is alive it will receive a reward of -0.001. Each episode ends either when a timeout of 300 states has been reached, or when the rocket has been fired and struck something. We trained the network several different times while adjusting hyperparameters such as the amount of regularization and the network structure itself. The average score did not improve significantly after training times of 20-80 epochs. Since this is more complex than the previous two we anticipated that it would need a larger training time, and figured that the score would increase after a larger amount of computation time. We also decided to expand the state for this. Instead of feeding in a single 60x40 image image we used as input a 2x60x60 pixel image. The first channel represent the current frame and the previous channel represented the previous frame. Our theory was that using the previous frame and current frame the network could infer which direction the monster was moving if the previous frame was added to the state. We finally trained our network for 220 epochs (aproximately 22 hours), the training scores can be seen in Figure 8, and the testing results are shown in Figure 9. Fig. 3: Average testing score per episode vs. epoch for basic Fig. 4: Average training score per episode vs. epoch for basic VI. DISCUSSION The results for the basic demonstrate that for the relatively small state- and action-spaces present here, the Q-function is able to converge quite quickly using our 4

Fig. 5: Average testing score per episode vs epoch for health-gathering Fig. 7: Path taken by agent in health gathering at test time Fig. 6: Average training score per episode vs epoch for health-gathering Fig. 8: Average training score per episode vs epoch for prediction a implementation and after training can be used to achieve very high scores on almost every test run. Convergence was achieved in about 9 epochs, which took about 28 minutes to run on our hardware. This outcome appears to indicate that the underlying algorithm and model are effective in training the agent to perform at least simple tasks such as this. The results for the health-gathering, a somewhat more complex task, indicate that the algorithm is also able to learn successful methods for slightly higher state- and action-space s. What is particularly interesting about the outcome of this is that although the agent does learn the successful method of survival by running around in circles (see Figure 7), it does not appear to have a strong, if any, understanding that picking up health packs is truly the key to staying alive. While the agent does continually pick up health packs as it runs around, it does not seem to specifically attempt to direct itself towards health packs, rather it simply tries to keep moving as much as possible in hopes of hitting some by chance. After 20 epochs of training on the health-gathering (taking a little over an hour), it s also interesting to note that the Q-function does not entirely converge and it s difficult to say whether or not the average score actually appears to still be on an upward trend. We hypothesize that this is due to the fact that the system has still yet to associate the health packs directly with success and would need significantly more training before this (convergence) occurs. The prediction did not preform as well as we had anticipated. At first glance it seems that the network may be over-fitting because the average testing score starts to decay after epoch 190. If we had more time we would experiment by increasing the amount of L2 regularization used in our loss function. We could also have added drop out layers to our network at train time to reduce avoid over-fitting. The learning rate and weight initialization are the other two hyperparameters we would like to experiment with to see how it effects the training time and results. Choosing a lower training rate 5

that individually represent important tasks in the game Doom. Our agent only uses the information from the image buffer as an input to determine which action to take. We show that on two of the three s our agent outperforms a random policy and comes very close to the level of a human player. The flaws of our method are pointed out and potential solutions are proposed for future work. REFERENCES Fig. 9: Average testing score per episode vs epoch for prediction and allowing the network to train longer may avoid the score from dipping at the end. The fact that the learning curve is fairly flat until a certain point may indicate that the initial weights were poorly chosen and that no major change to the network was made until much later during train time. For this particular this ε-greedy Q-learning implementation has a difficult time trying to learn the optimal policy. The probability of the correct sequence of actions occurring at random is very low in certain cases. For example, when the monster is spawned off screen the agent must turn in the correct direction a certain number of times just to see the monster, and then shoot at the precise time. The probability of this random walk occurring can be very small and will not occur often. Had we trained on data acquired from a human playing the game we could have potentially sped up the train time and allowed the agent to learn how to find the monster more easily. This will also avoid the neurons from becoming saturated or skewed by constantly having training examples that would result in a negative reward. However, training on a human playing does take away from the appeal of the agent discovering how to play on its own. After training the predict for 220 epochs it did not outperform our baseline score. The average test score over 100 episodes was -0.3, which indicates that it did not hit the monster at all during the 100 trials. We have pointed out potential flaws in our methods above. In addition longer training time may still be needed since the addition of the previous frame to the state increases the amount of variables in our network. [1] Mnih, Volodymyr, et al. Human-level control through deep reinforcement learning. Nature 518.7540 (2015): 529-533. [2] Kempka, Micha, et al. ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning. arxiv preprint arxiv:1605.02097 (2016). [3] Lample, Guillaume, and Devendra Singh Chaplot. Playing FPS games with deep reinforcement learning. arxiv preprint arxiv:1609.05521 (2016). [4] K. He, X. Zhang, S. Ren and J. Sun, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 2015, pp. 1026-1034. VII. CONCLUSION In this paper we implemented a Deep Q-Network for teaching an agent how to play certain simple s 6