Reinforcement Learning Agent for Scrolling Shooter Game

Similar documents
Creating an Agent of Doom: A Visual Reinforcement Learning Approach

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault

CandyCrush.ai: An AI Agent for Candy Crush

Playing CHIP-8 Games with Reinforcement Learning

REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING

Playing Atari Games with Deep Reinforcement Learning

An Artificially Intelligent Ludo Player

Learning from Hints: AI for Playing Threes

CS221 Project: Final Report Raiden AI Agent

Programming an Othello AI Michael An (man4), Evan Liang (liange)

Swing Copters AI. Monisha White and Nolan Walsh Fall 2015, CS229, Stanford University

Reinforcement Learning in Games Autonomous Learning Systems Seminar

DeepMind Self-Learning Atari Agent

CS 229 Final Project: Using Reinforcement Learning to Play Othello

Learning to Play Love Letter with Deep Reinforcement Learning

Game Playing for a Variant of Mancala Board Game (Pallanguzhi)

Applying Modern Reinforcement Learning to Play Video Games

ROBOCODE PROJECT AIBOT - MARKOV MODEL DRIVEN AIMING COMBINED WITH Q LEARNING FOR MOVEMENT

Introduction to Machine Learning

an AI for Slither.io

VISUAL ANALOGIES BETWEEN ATARI GAMES FOR STUDYING TRANSFER LEARNING IN RL

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

Hanabi : Playing Near-Optimally or Learning by Reinforcement?

Heads-up Limit Texas Hold em Poker Agent

Mutliplayer Snake AI

COMP3211 Project. Artificial Intelligence for Tron game. Group 7. Chiu Ka Wa ( ) Chun Wai Wong ( ) Ku Chun Kit ( )

Monte Carlo based battleship agent

Achieving Desirable Gameplay Objectives by Niched Evolution of Game Parameters

Playing Geometry Dash with Convolutional Neural Networks

CS221 Project Final Report Gomoku Game Agent

Research on Hand Gesture Recognition Using Convolutional Neural Network

It s Over 400: Cooperative reinforcement learning through self-play

USING A FUZZY LOGIC CONTROL SYSTEM FOR AN XPILOT COMBAT AGENT ANDREW HUBLEY AND GARY PARKER

Using Artificial intelligent to solve the game of 2048

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen

2048: An Autonomous Solver

Applying Modern Reinforcement Learning to Play Video Games. Computer Science & Engineering Leung Man Ho Supervisor: Prof. LYU Rung Tsong Michael

Deep Reinforcement Learning and Forward Modeling for StarCraft AI

Deep Learning for Infrastructure Assessment in Africa using Remote Sensing Data

Reinforcement Learning for CPS Safety Engineering. Sam Green, Çetin Kaya Koç, Jieliang Luo University of California, Santa Barbara

Augmenting Self-Learning In Chess Through Expert Imitation

DeepStack: Expert-Level AI in Heads-Up No-Limit Poker. Surya Prakash Chembrolu

AI Agent for Ants vs. SomeBees: Final Report

Deep Learning for Autonomous Driving

NOVA. Game Pitch SUMMARY GAMEPLAY LOOK & FEEL. Story Abstract. Appearance. Alex Tripp CIS 587 Fall 2014

BetaPoker: Reinforcement Learning for Heads-Up Limit Poker Albert Tung, Eric Xu, and Jeffrey Zhang

Fast Online Learning of Antijamming and Jamming Strategies

10703 Deep Reinforcement Learning and Control

FreeCiv Learner: A Machine Learning Project Utilizing Genetic Algorithms

AI Learning Agent for the Game of Battleship

THE problem of automating the solving of

Playing FPS Games with Deep Reinforcement Learning

CS221 Final Project Report Learn to Play Texas hold em

TUD Poker Challenge Reinforcement Learning with Imperfect Information

CS221 Project Final Report Automatic Flappy Bird Player

Dice Games and Stochastic Dynamic Programming

Department of Computer Science and Engineering. The Chinese University of Hong Kong. Final Year Project Report LYU1601

CS188: Artificial Intelligence, Fall 2011 Written 2: Games and MDP s

Tutorial of Reinforcement: A Special Focus on Q-Learning

AI Agents for Playing Tetris

Learning via Delayed Knowledge A Case of Jamming. SaiDhiraj Amuru and R. Michael Buehrer

Temporal Difference Learning for the Game Tic-Tac-Toe 3D: Applying Structure to Neural Networks

Biologically Inspired Computation

arxiv: v1 [cs.lg] 2 Jan 2018

Learning to Play 2D Video Games

Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm

Deep RL For Starcraft II

INFORMATION about image authenticity can be used in

Lecture 10: Games II. Question. Review: minimax. Review: depth-limited search

If you have any questions or feedback regarding the game, please do not hesitate to contact us through

An Empirical Evaluation of Policy Rollout for Clue

A retro space combat game by Chad Fillion. Chad Fillion Scripting for Interactivity ITGM 719: 5/13/13 Space Attack - Retro space shooter game

Human Level Control in Halo Through Deep Reinforcement Learning

CS188 Spring 2011 Written 2: Minimax, Expectimax, MDPs

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

Generating an appropriate sound for a video using WaveNet.

Storyboard for Playing the Game (in detail) Hoang Huynh, Jeremy West, Ioan Ihnatesn

Deep Learning. Dr. Johan Hagelbäck.

ConvNets and Forward Modeling for StarCraft AI

Monte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar

Poker AI: Equilibrium, Online Resolving, Deep Learning and Reinforcement Learning

SPACEYARD SCRAPPERS 2-D GAME DESIGN DOCUMENT

CS 188: Artificial Intelligence Spring Announcements

Image analysis. CS/CME/BIOPHYS/BMI 279 Fall 2015 Ron Dror

General Video Game AI: Learning from Screen Capture

46.1 Introduction. Foundations of Artificial Intelligence Introduction MCTS in AlphaGo Neural Networks. 46.

Success Stories of Deep RL. David Silver

UAV-Aided 5G Communications with Deep Reinforcement Learning Against Jamming

Gateways Placement in Backbone Wireless Mesh Networks

Artificial Intelligence. Cameron Jett, William Kentris, Arthur Mo, Juan Roman

A Deep Q-Learning Agent for the L-Game with Variable Batch Training

Approaching The Royal Game of Ur with Genetic Algorithms and ExpectiMax

Author(s) Corr, Philip J.; Silvestre, Guenole C.; Bleakley, Christopher J. The Irish Pattern Recognition & Classification Society

Diceland Consolidated Special Effect Rules

Transferring Deep Reinforcement Learning from a Game Engine Simulation for Robots

Policy Teaching. Through Reward Function Learning. Haoqi Zhang, David Parkes, and Yiling Chen

Tiny ImageNet Challenge Investigating the Scaling of Inception Layers for Reduced Scale Classification Problems

Neural Labyrinth Robot Finding the Best Way in a Connectionist Fashion

TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS

Playing Othello Using Monte Carlo

Transcription:

Reinforcement Learning Agent for Scrolling Shooter Game Peng Yuan (pengy@stanford.edu) Yangxin Zhong (yangxin@stanford.edu) Zibo Gong (zibo@stanford.edu) 1 Introduction and Task Definition 1.1 Game Agent Nowadays, gaming has become one of the most popular fields in both live and research. Game agents are some computer programs which can not only play games automatically, but also try to maximize the score or performance in the game. The standard input of a game agent is a scene or state in a specific game, and it will return an optimal action which is estimated as the best action for now to get a high score in the future. The mapping from game state to optimal action is called optimal policy. The task in this paper is that given a specific game and its rules, train a game agent which yields a reasonable optimal policy. 1.2 Objective and Evaluation Playing with a reasonable optimal policy, a game agent should get relatively high score. So, the evaluation for a game agent is how good its performance is in the game by average (or by maximum). Usually, we will compare the performance with the ones of human players of different levels. In our paper, we use the amateur human player level performance as a comparison. Therefore, our objective is to get a high score in a scrolling shooter game with an agent learned by reinforcement learning. Given environment data in each frame as state s, find optimal policy such that gives the optimal action in [up, down, left, right, stay]. () s 1.3 Challenges Complex environment: the game contains hero, enemies, missiles, power ups, etc., and their internal interaction rules, resulting in a complex environment with a huge state space. Delayed rewards: it takes time for the missile to fly and thus one action s reward is revealed after several frames, making it hard for the agent to capture and associate the reward with proper actions. Sequence of actions: one enemy will be destroyed only after several hits, thus it requires a sequence of action to get a reward instead of one single action 2 Infrastructure 2.1 Scrolling Shooter Game The specific game we chose in this paper is a scrolling shooter game. As shown in Figure 1, the enemy ships will occur randomly from the top of the screen with different moving and attacking strategies. We need to control a hero fighter (referenced as the hero below) to shoot down the enemies and try to survive as long as possible to get a high score. In addition, some extra items can upgrade our weapon or give us a shield, which will make life easier. We should try to collect them as well. The game itself is an open source one[1], written in C++ with OpenGL rendering the frames, enabling us to modify its source code slightly to better fit our need in training and testing. 1

2.2 Infrastructure Overview Figure 1 Scrolling Shooter Game We made several APIs to let our AI agent to interact with the game, and we also introduced non-graphic mode of the game to speed it up for faster training. The working flow is shown in Figure 2, where the communication part is done via TCP and we disabled the default 40ms delay mechanism in TCP protocol to speed it up, achieved a transmission rate of thousands of frames per second, making it capable of fast training. New game environment Enemies action TCP Game status data Update game status AI agent (python) TCP Hero s action Figure 2 Infrastructure overview 3 Model Generally, a game can be modeled as a Markov Decision Process (MDP). MDP is a state-based model with states, actions, transition distribution and reward: from each state, we can take a legal action, and then with a probability distribution we may transit to one other state and get some reward from the transition. Specifically, in our game the MDP can be formulated as: State. Each state is one frame of game scene. The state includes all the game information of that scene: enemy information (position, type); ammo information (position, speed, type); power-ups information (position, type); hero information (position, health point, shield point, lives, active guns, score). An end state will be the game over scene. Action. Legal action set is [up, down, left, right, stay], which means hero move up, move down, move left, move right and stay still. Since changing action every frame makes little difference on hero position, we design that every time we choose an action, it will remain the same in the next 5 frames. As a result, a successor state will be the scene that is 5 frames after the current state after taking one of the actions above. Transition distribution. This part of MDP cannot be modeled directly since the game is way too complicated and full of randomness. On each state, given an action to take, we cannot predict what next state might be until we execute that action and run the game. The states number of 2

this complex game is too large, which makes modeling transition distribution intractable. In next section, we will introduce methods called reinforcement learning to tackle this issue. Reward. The goal of this game is to get a high score, so intuitively the reward will be the score gained in each transition. But using game score as reward will make it difficult to solve MDP and get reasonable optimal policy in practice. In next section, we will give more details of this and redesign a reward function appropriate for solving the problem. 4 Approaches 4.1 Q-learning 4.1.1 Q-learning Overview To solve MDP without explicit transition distribution, we introduce Q-learning algorithm with function approximation. Before describing about Q-learning, we need to talk about two important concepts of policy over MDP: value and Q-value. The value of a state s with respect to a fixed policy π is denotated as V π (s), which is the expected total reward received in the future by following policy π from state s. And the Q-value of a stateaction pair (s, a) is notated as Q π (s, a), which is the expected total reward received in the future after taking action a from state s and then following policy π. If our policy π is the optimal policy π opt, then we have optimal value of a state s : π opt (s) = arg max a Q opt(s, a) (1) V opt (s) = { 0 if s is an end state max Q opt(s, a) otherwise (2) a Q opt (s, a) = T(s, π opt (s), s )[R(s, π opt (s), s ) + V opt (s )] s where T(s, a, s ) is the transition probability from state s to s by action a ; and R(s, a, s ) is the transition reward. To obtain the optimal policy π opt, we can estimate V opt (s) and Q opt (s, a) from MDP and take argmax of Q-value. The challenge, as stated in the previous section, is that we cannot estimate transition distribution T(s, a, s ) easily. Q-learning algorithm is one of the solutions. In Q-learning, we don t estimate T(s, a, s ); instead, we directly estimate Q opt (s, a) and V opt (s). First we need to obtain training data: for each state s, we take a so far predicted optimal action a, and then we run the game to transit to a successor state s and get some actual reward r from game. Repeat these steps and we can get a large number of (s, a, s, r) tuples, which will be used as training dataset. For each tuple of (s, a, s, r), we can update Q opt (s, a), V opt (s) and π opt (s) as: Q opt (s, a) (1 η)q opt (s, a) + η (r + V opt (s )) V opt (s) max Q opt(s, a) a 3

π opt (s) argmax Q opt(s, a) a where η (0,1). The idea behind this update rule is that we try to use the actual reward r to correct the estimation of value and Q-value step by step. After updates, we can use the corrected estimation of optimal policy to gain new training data and continue this process again and again. In practice, we don t always take the estimated optimal action as the next action, with a probability of ε (0,1), we take a random action. This strategy is called ε-greedy policy, which is necessary because the algorithm can converge to local optima without it. The intuition behind it is that we should sometimes take some random actions, which might never be considered before, and see whether they are better than the current policy; if so, we can improve the estimated optimal policy. Using Q-learning algorithm with ε-greedy strategy, we are guaranteed to get the real optimal policy finally (although we need a very long time in practice). But there is another issue here: if a MDP has too many states, we will have a great number of (s, a) pairs, which makes it hard to store Q opt (s, a) of each pair with limited space and also makes it impossible to converge to true value with limited time. 4.1.2 Function Approximation In order to deal with large number of states in MDP, we employ function approximation in Q- learning. For each (s, a) pair, we define a feature vector φ(s, a) and use features to approximate Q-value Q opt (s, a). E.g. φ 1 (s, a) = activated weapons number; φ 2 (s, a) = 1[a = w]. The idea behind this is that the Q-value can be estimated by features of current state and the action to take. For instance, if the hero have activated many powerful weapons, he is likely to receive a high score in the future (i.e. a high φ 1 (s, a) might indicate a high Q opt (s, a)); on the contrary, in a scrolling shooter game, we should seldom move the hero to the top of screen (i.e. when φ 2 (s, a) = 1, it might indicate we will have a lower Q opt (s, a)). In this paper, we employ the most common function approximation - linear approximation: Q opt (s, a) = w φ(s, a) (3) where w is vector of weights of all the features. With this function approximation, Q-learning algorithm can update π opt, V opt, and Q opt through updating the weights vector w. Using a stochastic gradient descent method, the update rule of Q-learning will turn into w w η [Q opt (s, a) (r + V opt (s ))] φ(s, a) (4) where η (0,1). And the definitions of π opt, V opt, and Q opt follow (1)(2)(3). The actual update rule we used in Q-learning is exactly formula (4). In section 4.4, we will specify the feature vector φ(s, a) we used in this paper. 4.1.3 Reward Function As the goal of playing this game is to get a high score, an intuitive idea is to use obtained game score in transition as the reward in Q-learning. But this can cause some issues: 1) the hero gets some score only when it defeats enemy. However, it needs to shoot enemy for quite a long time before defeating it. The shooting actions are valuable but cannot be captured by the score reward. 4

2) When hero is taken damage, its life point and shield point will go down but the score won t. So, the bad actions that make the hero take damage cannot be captured by score reward either. 3) When hero collects a power-up, it is likely to have a higher total reward in the future. But the good actions to collect power-ups cannot be captured by the score reward. In order to capture these features of good/bad actions, we need to design a heuristic reward function which can estimate the future reward after taking an action in the current state. In other words, it can give bonus/punishment when hero is taking possible good/bad actions for the future. The reward function we design in this paper is as follow: r = w 1 r dmg + w 2 r dodge + w 3 r item + w 4 r attack + w 5 r genral (5) It s a weighted reward of 5 components: Damage taken reward r dmg is the hero taken damage in transition. When hero takes some damage in the 5-frame transition, this reward will be negative to punish the bad actions. Dodging enemy/ammo reward r dodge is the increase of distance sum of nearby enemy/ammo. This reward is used to encourage the actions that dodge the nearby enemy/ammo and punish those actions that get closer to them. Item collecting reward r item is the decrease of distance to the closest power-up item. This reward is used to encourage the actions that try to collect the closest power-up. We find that these items are very useful for getting a high score so we weight this reward higher than others. Attacking reward r attack is hero positive attacking estimation. This reward estimates whether the hero is trying to attack enemy to gain score. We need this reward to encourage attacking action since defeating enemy is the only way to gain score. The general idea of this term is to see if the hero is attacking an enemy or approaching the closest enemy that can be shot. General movement reward r general is whether the hero is moving to bottom center when idle. A general optimal position for hero is at the bottom center of the screen (since it can make hero able to shoot any enemy for a long time and also convenient to move left and right to dodge enemy/ammo). So when hero is not dodging attack, collecting item or trying to attack enemies, a good move is trying to go to the bottom center. Our final reward function is a weighted sum of the five rewards above, which is complicated but more useful than plain score reward from the training result. 4.1.4 Features Since we use function approximation to estimate Q opt (s, a), we need to design the feature vector φ(s, a) for each state-action pair in equation (3). The features we use include: Number of ammo in the hero nearby region. We define the nearby region as a circular area with a center of hero position and a fixed radius. Then we divide this area into 8 sector regions and count the number of ammo in each area to form a feature vector of length 8. Speed distribution of ammo in the hero nearby region. The nearby region has the same definition. Now instead of counting ammo number, we use the histogram of speed angles as feature (as shown in Figure 3). Similarly, we use 8 buckets for different speed direction. So it s a feature vector with length 8 * 8 = 64. 5

Figure 3 Speed distribution features (left) and the hero front area (right) Number of enemies in the hero nearby region. Similar to the first type of feature but we keep track of the number of different type of enemies. We have 7 kinds of enemy in all, so it s a vector with length 7 * 8 = 56. With these three types of features above, we are able to teach game agent to dodge the enemies and ammos under specific situation. Number of enemy in the hero front area (as shown in Figure 3). We use this feature to keep track of number and positions of attackable enemies. This feature can be useful since we want the game agent to learn to approach and attack enemies to gain score. Other features include number of power-up items in the hero nearby region (can be useful to teach hero to collect power-ups), hero shield point, health point, lives, activated weapon indicators, score, and special position indicators (hero is whether or not at the center of x axis or near the screen boundaries). These features try to capture factors which are appropriate for future reward estimation. We can see that all of the features somehow correspond to different components of our reward function in section 4.3. This is because even if we have a reasonable reward function, we still need appropriate features that can memorize the bonus/punishment to get a good function approximation of Q-value. In total the feature vector of state has a length of 193 in our design. And we also need to design features based on actions; otherwise we cannot tell any difference between Q-values of two different actions in the same state. However, the action itself contains very little information. It is meaningful only when we combine it with state features. As a result, we replicate the length- 193 feature vector with 5 times (since we have 5 possible actions [ w, a, s, d, 0 ]) and multiply each set of copy by an indicator of the corresponding action. Finally, we get a feature vectors with a length of 965. In our design, for each state-action pair, only 193 entries of the 965-dimension vector can be non-zeros due to the indicators of actions. 4.2 Deep Q-Learning The Deep Q-learning is based on our traditional Q-learning approach and thus shares the same action space and states. Instead of using self-extracted feature for learning, the core of Deep Q- learning is to use a Deep Q-Network (DQN) to replace the self-extracted feature and weight part[2]. 4.2.1 Features To capture geometry relationships in the game world, we set coordinates centered at the hero and mesh area nearby hero in each frame into a 52*52 grid/matrix, and without giving much information, we filter each frame into 4 feature maps, as shown in Figure 4, where each E and U is a 5*5 matrix with all ones, centered at where the target object is. 6

each frame 1 1... 0 0 E... E 0 2... 1 0 U... U 1 1... 0 0 0... 0 1 0... 5 0 0... 0................................................ 1 1... 1 E 0... 0 2 1... 0 0 0... 0 world boundary feature map enemies position feature map missiles count feature map power ups position feature map 4.2.2 Related Works Figure 4 Feature maps In 2013, the DeepMind Technologies present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning[3]. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. In addition, other researches[4, 5] on the Deep Q-Network (DQN) also inspired us in tuning the hyper-parameters in the DNQ. 4.2.3 Deep Q-Network (DQN) Structure To better address the delayed reward issue, we utilized a technique known as Experience Replay where we stored the agent s experiences at each time-step, et st, at, rt, st 1 in a data-set D e,, 1 e, and pooled over many episodes into a replay memory. The algorithm is show in N Algorithm 1, where in the inner loop of the algorithm, we applied Q-learning updates or minibatch updates, to samples of experience drawn at random from the pool of stored samples. After performing experience replay, the agent selected and executes an action according to an ε- greedy policy. Algorithm 1 Deep Q-learning with Experience Replay: Initialize replay memory D to capacity N Initialize action value function Q with random weights for episode = 1: M do Initialize sequence s 1 [ x 1 ] and preprocessed sequenced for t = 1:T do With probability ε select a random action Otherwise: a max Q* x, a select t a i Execute action at in emulator and observe reward new feature maps x t 1 Set st 1 ( st, at, rt ) Store transition and xt 1 x st 1 x j, a j, rj, x j 1 in D Sample random minibatch of transitions Set y r max Qx a, t t1 a t1 a t s 1 r t and extract x j, a j, rj, x j 1 from D Perform a gradient descent step according to the loss function max Q x, a y 2 t a t end for end for 7

The structure of our DQN is shown in Figure 5. We fed our Deep Q-Network (DQN) with a sequence of 5 frames as one frame set (one iteration), and perfume one update every 64 frame sets (batch size is 64), resulting in an input dimension of 52*52*4*5*64. The first hidden layers convolve 16 8*8 filters with stride 4 and applies a rectifier nonlinearity. The second hidden layer convolves 32 4*4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is a fully-connected and consists of 64 rectifier units. The output layer is also a fullyconnected linear layer with 5 outputs for 5 actions. The loss function of the network is L r V s ' Q s, a 2. 5 Experiments 5.1 Baselines & Oracle For baseline, we implement two dumb agents: Figure 5 DQN structure We first applied a random-move strategy to the game, which means that the fighter ship takes a random action for a random period of time. Then we applied a more sophisticated strategy: when an enemy ship or a missile is about to collide with the fighter ship, the agent will try to move our ship to avoid it. This strategy worked much better on the game. But since it can only take a few enemies into account, it cannot avoid collision when we get more enemies and missiles. In this paper, human amateur level of playing is used as the oracle: we practice playing the game for hours and then play it for multiple times and record the final scores we get. 5.2 Evaluation & Analysis For each kind of agent, we run the agent on the game for 100 times and report their average score and maximum score as evaluation of that agent. The result is shown in Table 1 and the learning curve of Q-learning method is shown in Figure 6. 8

Table 1 Performance of each method Method Average Score Max Score Random 111 175 Rule-based 802 1751 Q-learning (2.1 10 5 iterations) 628 1475 Q-learning (2 10 7 iterations) 7847 40875 Deep Q-learning (2.1 10 5 iterations) 1341 7875 Human 9025 15120 Figure 6 Learning curve for Q-learning (left) and Deep Q-learning (right) Compared to all the other methods, Q-learning obtained the highest performance at the end. Compared to human player oracle, Q-learning agent get a comparable average score while its maximum score is much higher than the maximum of amateur human player. This shows the effectiveness of Q-learning to estimate the optimal policy. And it also shows that our design of reward function can estimate the future reward well and the features we use can capture some good factors to correctly approximate Q-value. Another advantage of Q-learning is that it need much shorter time for updating in each iteration compared to deep Q-learning method. When 5 we compare Q-learning and Deep Q-learning after the same number of iterations ( 2.1 10 ) in Table 1, Deep Q-learning perform better than Q-learning, but since we can execute much more iterations in limited time with Q-learning than with deep Q-learning, the final performance of 7 Q-learning with 2 10 iterations is much higher. As shown in Table 1 shows that DQN can capture some useful local features and also learn some effective non-handcrafted features through the networks. However, the final performance of DQN is much lower. This is because 1) primarily, we cannot run more iterations with limited time due to the slow training speed of DQN; 2) only feature maps of position may not contain enough information for this complicated game environment; and 3) the structure of DQN with two convolution layers might not be appropriate for this game. Other interesting observations. In the process of training of Q-learning algorithm, we once found that our agent is too aggressive that the hero usually prefers to shoot enemies first rather than to dodge the ammo and collect power-ups, although the latter actions have high probability to get a higher reward in the future. We check through our implementation and find that this is because the weight of r attack in our reward function (4) is too high. This cause our heuristic 9

cannot estimate the future score correctly. Then we increase the weights of r dmg, r dodge and r item and re-train the model. After the modification of hyper parameters, we manage to get a much smarter agent which can dodge ammos, collect power-ups and attack enemies at the same time. From Figure 6, we find that the performance of Q-learning rapidly increased at iteration of 2.1 10 7. That s exactly because we tune up the weights above at that time and also tune down the value of ε to continue the training. 6 Conclusion To conclude, we manage to obtain a smart game agent on the scrolling shooter game using Q- learning algorithm and Depp Q-learning Network. Our experiment shows the effectiveness of DQN in capturing local region and non-linear hidden features under the same number of iterations compared to traditional Q-learning algorithm. However, the highest performance is still achieved by Q-learning since it s more efficient and can also capture good feature by a welldesigned reward function and feature vector. The final performance of our Q-learning game agent is comparable to the performance of amateur human level player. 7 References [1] "A scrolling shooter game: Chromium B.S.U. http://chromium-bsu.sourceforge.net/." [2] V. Mnih et al., "Human-level control through deep reinforcement learning," Nature, Letter vol. 518, no. 7540, pp. 529-533, 02/26/print 2015. [3] V. Mnih et al., "Playing Atari with Deep Reinforcement Learning," CoRR, vol. abs/1312.5602, 2013. [4] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, "What is the best multi-stage architecture for object recognition?," in 2009 IEEE 12th International Conference on Computer Vision, 2009, pp. 2146-2153. [5] V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807-814. 10