Reinforcement Learning Agent for Scrolling Shooter Game Peng Yuan (pengy@stanford.edu) Yangxin Zhong (yangxin@stanford.edu) Zibo Gong (zibo@stanford.edu) 1 Introduction and Task Definition 1.1 Game Agent Nowadays, gaming has become one of the most popular fields in both live and research. Game agents are some computer programs which can not only play games automatically, but also try to maximize the score or performance in the game. The standard input of a game agent is a scene or state in a specific game, and it will return an optimal action which is estimated as the best action for now to get a high score in the future. The mapping from game state to optimal action is called optimal policy. The task in this paper is that given a specific game and its rules, train a game agent which yields a reasonable optimal policy. 1.2 Objective and Evaluation Playing with a reasonable optimal policy, a game agent should get relatively high score. So, the evaluation for a game agent is how good its performance is in the game by average (or by maximum). Usually, we will compare the performance with the ones of human players of different levels. In our paper, we use the amateur human player level performance as a comparison. Therefore, our objective is to get a high score in a scrolling shooter game with an agent learned by reinforcement learning. Given environment data in each frame as state s, find optimal policy such that gives the optimal action in [up, down, left, right, stay]. () s 1.3 Challenges Complex environment: the game contains hero, enemies, missiles, power ups, etc., and their internal interaction rules, resulting in a complex environment with a huge state space. Delayed rewards: it takes time for the missile to fly and thus one action s reward is revealed after several frames, making it hard for the agent to capture and associate the reward with proper actions. Sequence of actions: one enemy will be destroyed only after several hits, thus it requires a sequence of action to get a reward instead of one single action 2 Infrastructure 2.1 Scrolling Shooter Game The specific game we chose in this paper is a scrolling shooter game. As shown in Figure 1, the enemy ships will occur randomly from the top of the screen with different moving and attacking strategies. We need to control a hero fighter (referenced as the hero below) to shoot down the enemies and try to survive as long as possible to get a high score. In addition, some extra items can upgrade our weapon or give us a shield, which will make life easier. We should try to collect them as well. The game itself is an open source one[1], written in C++ with OpenGL rendering the frames, enabling us to modify its source code slightly to better fit our need in training and testing. 1
2.2 Infrastructure Overview Figure 1 Scrolling Shooter Game We made several APIs to let our AI agent to interact with the game, and we also introduced non-graphic mode of the game to speed it up for faster training. The working flow is shown in Figure 2, where the communication part is done via TCP and we disabled the default 40ms delay mechanism in TCP protocol to speed it up, achieved a transmission rate of thousands of frames per second, making it capable of fast training. New game environment Enemies action TCP Game status data Update game status AI agent (python) TCP Hero s action Figure 2 Infrastructure overview 3 Model Generally, a game can be modeled as a Markov Decision Process (MDP). MDP is a state-based model with states, actions, transition distribution and reward: from each state, we can take a legal action, and then with a probability distribution we may transit to one other state and get some reward from the transition. Specifically, in our game the MDP can be formulated as: State. Each state is one frame of game scene. The state includes all the game information of that scene: enemy information (position, type); ammo information (position, speed, type); power-ups information (position, type); hero information (position, health point, shield point, lives, active guns, score). An end state will be the game over scene. Action. Legal action set is [up, down, left, right, stay], which means hero move up, move down, move left, move right and stay still. Since changing action every frame makes little difference on hero position, we design that every time we choose an action, it will remain the same in the next 5 frames. As a result, a successor state will be the scene that is 5 frames after the current state after taking one of the actions above. Transition distribution. This part of MDP cannot be modeled directly since the game is way too complicated and full of randomness. On each state, given an action to take, we cannot predict what next state might be until we execute that action and run the game. The states number of 2
this complex game is too large, which makes modeling transition distribution intractable. In next section, we will introduce methods called reinforcement learning to tackle this issue. Reward. The goal of this game is to get a high score, so intuitively the reward will be the score gained in each transition. But using game score as reward will make it difficult to solve MDP and get reasonable optimal policy in practice. In next section, we will give more details of this and redesign a reward function appropriate for solving the problem. 4 Approaches 4.1 Q-learning 4.1.1 Q-learning Overview To solve MDP without explicit transition distribution, we introduce Q-learning algorithm with function approximation. Before describing about Q-learning, we need to talk about two important concepts of policy over MDP: value and Q-value. The value of a state s with respect to a fixed policy π is denotated as V π (s), which is the expected total reward received in the future by following policy π from state s. And the Q-value of a stateaction pair (s, a) is notated as Q π (s, a), which is the expected total reward received in the future after taking action a from state s and then following policy π. If our policy π is the optimal policy π opt, then we have optimal value of a state s : π opt (s) = arg max a Q opt(s, a) (1) V opt (s) = { 0 if s is an end state max Q opt(s, a) otherwise (2) a Q opt (s, a) = T(s, π opt (s), s )[R(s, π opt (s), s ) + V opt (s )] s where T(s, a, s ) is the transition probability from state s to s by action a ; and R(s, a, s ) is the transition reward. To obtain the optimal policy π opt, we can estimate V opt (s) and Q opt (s, a) from MDP and take argmax of Q-value. The challenge, as stated in the previous section, is that we cannot estimate transition distribution T(s, a, s ) easily. Q-learning algorithm is one of the solutions. In Q-learning, we don t estimate T(s, a, s ); instead, we directly estimate Q opt (s, a) and V opt (s). First we need to obtain training data: for each state s, we take a so far predicted optimal action a, and then we run the game to transit to a successor state s and get some actual reward r from game. Repeat these steps and we can get a large number of (s, a, s, r) tuples, which will be used as training dataset. For each tuple of (s, a, s, r), we can update Q opt (s, a), V opt (s) and π opt (s) as: Q opt (s, a) (1 η)q opt (s, a) + η (r + V opt (s )) V opt (s) max Q opt(s, a) a 3
π opt (s) argmax Q opt(s, a) a where η (0,1). The idea behind this update rule is that we try to use the actual reward r to correct the estimation of value and Q-value step by step. After updates, we can use the corrected estimation of optimal policy to gain new training data and continue this process again and again. In practice, we don t always take the estimated optimal action as the next action, with a probability of ε (0,1), we take a random action. This strategy is called ε-greedy policy, which is necessary because the algorithm can converge to local optima without it. The intuition behind it is that we should sometimes take some random actions, which might never be considered before, and see whether they are better than the current policy; if so, we can improve the estimated optimal policy. Using Q-learning algorithm with ε-greedy strategy, we are guaranteed to get the real optimal policy finally (although we need a very long time in practice). But there is another issue here: if a MDP has too many states, we will have a great number of (s, a) pairs, which makes it hard to store Q opt (s, a) of each pair with limited space and also makes it impossible to converge to true value with limited time. 4.1.2 Function Approximation In order to deal with large number of states in MDP, we employ function approximation in Q- learning. For each (s, a) pair, we define a feature vector φ(s, a) and use features to approximate Q-value Q opt (s, a). E.g. φ 1 (s, a) = activated weapons number; φ 2 (s, a) = 1[a = w]. The idea behind this is that the Q-value can be estimated by features of current state and the action to take. For instance, if the hero have activated many powerful weapons, he is likely to receive a high score in the future (i.e. a high φ 1 (s, a) might indicate a high Q opt (s, a)); on the contrary, in a scrolling shooter game, we should seldom move the hero to the top of screen (i.e. when φ 2 (s, a) = 1, it might indicate we will have a lower Q opt (s, a)). In this paper, we employ the most common function approximation - linear approximation: Q opt (s, a) = w φ(s, a) (3) where w is vector of weights of all the features. With this function approximation, Q-learning algorithm can update π opt, V opt, and Q opt through updating the weights vector w. Using a stochastic gradient descent method, the update rule of Q-learning will turn into w w η [Q opt (s, a) (r + V opt (s ))] φ(s, a) (4) where η (0,1). And the definitions of π opt, V opt, and Q opt follow (1)(2)(3). The actual update rule we used in Q-learning is exactly formula (4). In section 4.4, we will specify the feature vector φ(s, a) we used in this paper. 4.1.3 Reward Function As the goal of playing this game is to get a high score, an intuitive idea is to use obtained game score in transition as the reward in Q-learning. But this can cause some issues: 1) the hero gets some score only when it defeats enemy. However, it needs to shoot enemy for quite a long time before defeating it. The shooting actions are valuable but cannot be captured by the score reward. 4
2) When hero is taken damage, its life point and shield point will go down but the score won t. So, the bad actions that make the hero take damage cannot be captured by score reward either. 3) When hero collects a power-up, it is likely to have a higher total reward in the future. But the good actions to collect power-ups cannot be captured by the score reward. In order to capture these features of good/bad actions, we need to design a heuristic reward function which can estimate the future reward after taking an action in the current state. In other words, it can give bonus/punishment when hero is taking possible good/bad actions for the future. The reward function we design in this paper is as follow: r = w 1 r dmg + w 2 r dodge + w 3 r item + w 4 r attack + w 5 r genral (5) It s a weighted reward of 5 components: Damage taken reward r dmg is the hero taken damage in transition. When hero takes some damage in the 5-frame transition, this reward will be negative to punish the bad actions. Dodging enemy/ammo reward r dodge is the increase of distance sum of nearby enemy/ammo. This reward is used to encourage the actions that dodge the nearby enemy/ammo and punish those actions that get closer to them. Item collecting reward r item is the decrease of distance to the closest power-up item. This reward is used to encourage the actions that try to collect the closest power-up. We find that these items are very useful for getting a high score so we weight this reward higher than others. Attacking reward r attack is hero positive attacking estimation. This reward estimates whether the hero is trying to attack enemy to gain score. We need this reward to encourage attacking action since defeating enemy is the only way to gain score. The general idea of this term is to see if the hero is attacking an enemy or approaching the closest enemy that can be shot. General movement reward r general is whether the hero is moving to bottom center when idle. A general optimal position for hero is at the bottom center of the screen (since it can make hero able to shoot any enemy for a long time and also convenient to move left and right to dodge enemy/ammo). So when hero is not dodging attack, collecting item or trying to attack enemies, a good move is trying to go to the bottom center. Our final reward function is a weighted sum of the five rewards above, which is complicated but more useful than plain score reward from the training result. 4.1.4 Features Since we use function approximation to estimate Q opt (s, a), we need to design the feature vector φ(s, a) for each state-action pair in equation (3). The features we use include: Number of ammo in the hero nearby region. We define the nearby region as a circular area with a center of hero position and a fixed radius. Then we divide this area into 8 sector regions and count the number of ammo in each area to form a feature vector of length 8. Speed distribution of ammo in the hero nearby region. The nearby region has the same definition. Now instead of counting ammo number, we use the histogram of speed angles as feature (as shown in Figure 3). Similarly, we use 8 buckets for different speed direction. So it s a feature vector with length 8 * 8 = 64. 5
Figure 3 Speed distribution features (left) and the hero front area (right) Number of enemies in the hero nearby region. Similar to the first type of feature but we keep track of the number of different type of enemies. We have 7 kinds of enemy in all, so it s a vector with length 7 * 8 = 56. With these three types of features above, we are able to teach game agent to dodge the enemies and ammos under specific situation. Number of enemy in the hero front area (as shown in Figure 3). We use this feature to keep track of number and positions of attackable enemies. This feature can be useful since we want the game agent to learn to approach and attack enemies to gain score. Other features include number of power-up items in the hero nearby region (can be useful to teach hero to collect power-ups), hero shield point, health point, lives, activated weapon indicators, score, and special position indicators (hero is whether or not at the center of x axis or near the screen boundaries). These features try to capture factors which are appropriate for future reward estimation. We can see that all of the features somehow correspond to different components of our reward function in section 4.3. This is because even if we have a reasonable reward function, we still need appropriate features that can memorize the bonus/punishment to get a good function approximation of Q-value. In total the feature vector of state has a length of 193 in our design. And we also need to design features based on actions; otherwise we cannot tell any difference between Q-values of two different actions in the same state. However, the action itself contains very little information. It is meaningful only when we combine it with state features. As a result, we replicate the length- 193 feature vector with 5 times (since we have 5 possible actions [ w, a, s, d, 0 ]) and multiply each set of copy by an indicator of the corresponding action. Finally, we get a feature vectors with a length of 965. In our design, for each state-action pair, only 193 entries of the 965-dimension vector can be non-zeros due to the indicators of actions. 4.2 Deep Q-Learning The Deep Q-learning is based on our traditional Q-learning approach and thus shares the same action space and states. Instead of using self-extracted feature for learning, the core of Deep Q- learning is to use a Deep Q-Network (DQN) to replace the self-extracted feature and weight part[2]. 4.2.1 Features To capture geometry relationships in the game world, we set coordinates centered at the hero and mesh area nearby hero in each frame into a 52*52 grid/matrix, and without giving much information, we filter each frame into 4 feature maps, as shown in Figure 4, where each E and U is a 5*5 matrix with all ones, centered at where the target object is. 6
each frame 1 1... 0 0 E... E 0 2... 1 0 U... U 1 1... 0 0 0... 0 1 0... 5 0 0... 0................................................ 1 1... 1 E 0... 0 2 1... 0 0 0... 0 world boundary feature map enemies position feature map missiles count feature map power ups position feature map 4.2.2 Related Works Figure 4 Feature maps In 2013, the DeepMind Technologies present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning[3]. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. In addition, other researches[4, 5] on the Deep Q-Network (DQN) also inspired us in tuning the hyper-parameters in the DNQ. 4.2.3 Deep Q-Network (DQN) Structure To better address the delayed reward issue, we utilized a technique known as Experience Replay where we stored the agent s experiences at each time-step, et st, at, rt, st 1 in a data-set D e,, 1 e, and pooled over many episodes into a replay memory. The algorithm is show in N Algorithm 1, where in the inner loop of the algorithm, we applied Q-learning updates or minibatch updates, to samples of experience drawn at random from the pool of stored samples. After performing experience replay, the agent selected and executes an action according to an ε- greedy policy. Algorithm 1 Deep Q-learning with Experience Replay: Initialize replay memory D to capacity N Initialize action value function Q with random weights for episode = 1: M do Initialize sequence s 1 [ x 1 ] and preprocessed sequenced for t = 1:T do With probability ε select a random action Otherwise: a max Q* x, a select t a i Execute action at in emulator and observe reward new feature maps x t 1 Set st 1 ( st, at, rt ) Store transition and xt 1 x st 1 x j, a j, rj, x j 1 in D Sample random minibatch of transitions Set y r max Qx a, t t1 a t1 a t s 1 r t and extract x j, a j, rj, x j 1 from D Perform a gradient descent step according to the loss function max Q x, a y 2 t a t end for end for 7
The structure of our DQN is shown in Figure 5. We fed our Deep Q-Network (DQN) with a sequence of 5 frames as one frame set (one iteration), and perfume one update every 64 frame sets (batch size is 64), resulting in an input dimension of 52*52*4*5*64. The first hidden layers convolve 16 8*8 filters with stride 4 and applies a rectifier nonlinearity. The second hidden layer convolves 32 4*4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is a fully-connected and consists of 64 rectifier units. The output layer is also a fullyconnected linear layer with 5 outputs for 5 actions. The loss function of the network is L r V s ' Q s, a 2. 5 Experiments 5.1 Baselines & Oracle For baseline, we implement two dumb agents: Figure 5 DQN structure We first applied a random-move strategy to the game, which means that the fighter ship takes a random action for a random period of time. Then we applied a more sophisticated strategy: when an enemy ship or a missile is about to collide with the fighter ship, the agent will try to move our ship to avoid it. This strategy worked much better on the game. But since it can only take a few enemies into account, it cannot avoid collision when we get more enemies and missiles. In this paper, human amateur level of playing is used as the oracle: we practice playing the game for hours and then play it for multiple times and record the final scores we get. 5.2 Evaluation & Analysis For each kind of agent, we run the agent on the game for 100 times and report their average score and maximum score as evaluation of that agent. The result is shown in Table 1 and the learning curve of Q-learning method is shown in Figure 6. 8
Table 1 Performance of each method Method Average Score Max Score Random 111 175 Rule-based 802 1751 Q-learning (2.1 10 5 iterations) 628 1475 Q-learning (2 10 7 iterations) 7847 40875 Deep Q-learning (2.1 10 5 iterations) 1341 7875 Human 9025 15120 Figure 6 Learning curve for Q-learning (left) and Deep Q-learning (right) Compared to all the other methods, Q-learning obtained the highest performance at the end. Compared to human player oracle, Q-learning agent get a comparable average score while its maximum score is much higher than the maximum of amateur human player. This shows the effectiveness of Q-learning to estimate the optimal policy. And it also shows that our design of reward function can estimate the future reward well and the features we use can capture some good factors to correctly approximate Q-value. Another advantage of Q-learning is that it need much shorter time for updating in each iteration compared to deep Q-learning method. When 5 we compare Q-learning and Deep Q-learning after the same number of iterations ( 2.1 10 ) in Table 1, Deep Q-learning perform better than Q-learning, but since we can execute much more iterations in limited time with Q-learning than with deep Q-learning, the final performance of 7 Q-learning with 2 10 iterations is much higher. As shown in Table 1 shows that DQN can capture some useful local features and also learn some effective non-handcrafted features through the networks. However, the final performance of DQN is much lower. This is because 1) primarily, we cannot run more iterations with limited time due to the slow training speed of DQN; 2) only feature maps of position may not contain enough information for this complicated game environment; and 3) the structure of DQN with two convolution layers might not be appropriate for this game. Other interesting observations. In the process of training of Q-learning algorithm, we once found that our agent is too aggressive that the hero usually prefers to shoot enemies first rather than to dodge the ammo and collect power-ups, although the latter actions have high probability to get a higher reward in the future. We check through our implementation and find that this is because the weight of r attack in our reward function (4) is too high. This cause our heuristic 9
cannot estimate the future score correctly. Then we increase the weights of r dmg, r dodge and r item and re-train the model. After the modification of hyper parameters, we manage to get a much smarter agent which can dodge ammos, collect power-ups and attack enemies at the same time. From Figure 6, we find that the performance of Q-learning rapidly increased at iteration of 2.1 10 7. That s exactly because we tune up the weights above at that time and also tune down the value of ε to continue the training. 6 Conclusion To conclude, we manage to obtain a smart game agent on the scrolling shooter game using Q- learning algorithm and Depp Q-learning Network. Our experiment shows the effectiveness of DQN in capturing local region and non-linear hidden features under the same number of iterations compared to traditional Q-learning algorithm. However, the highest performance is still achieved by Q-learning since it s more efficient and can also capture good feature by a welldesigned reward function and feature vector. The final performance of our Q-learning game agent is comparable to the performance of amateur human level player. 7 References [1] "A scrolling shooter game: Chromium B.S.U. http://chromium-bsu.sourceforge.net/." [2] V. Mnih et al., "Human-level control through deep reinforcement learning," Nature, Letter vol. 518, no. 7540, pp. 529-533, 02/26/print 2015. [3] V. Mnih et al., "Playing Atari with Deep Reinforcement Learning," CoRR, vol. abs/1312.5602, 2013. [4] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, "What is the best multi-stage architecture for object recognition?," in 2009 IEEE 12th International Conference on Computer Vision, 2009, pp. 2146-2153. [5] V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807-814. 10