Human Level Control in Halo Through Deep Reinforcement Learning

Size: px

Start display at page:

Download "Human Level Control in Halo Through Deep Reinforcement Learning"

MargaretMargaret Jenkins
6 years ago
Views:

1 Human Level Control in Halo Through Deep Reinforcement Learning Samuel Colbran, Vighnesh Sachidananda Abstract In this report, a reinforcement learning agent and environment for the game Halo:

Index Terms Reinforcement Learning, Q-Learning, Function Approximation, Autonomous Vehicle, Halo 1 INTRODUCTION Many recent works in the field of artificial intelligence have focused on the

1 1 Human Level Control in Halo Through Deep Reinforcement Learning Samuel Colbran, Vighnesh Sachidananda Abstract In this report, a reinforcement learning agent and environment for the game Halo: Combat Evolved is detailed. The reinforcement agent approaches human level performance when racing in a vehicle on terrains with hills and obstacles. Index Terms Reinforcement Learning, Q-Learning, Function Approximation, Autonomous Vehicle, Halo 1 INTRODUCTION Many recent works in the field of artificial intelligence have focused on the development of actors that can learn how to maximise expected utility without prior knowledge of their environment through reinforcement learning. In this project, several approaches to developing an intelligent agent are evaluated in Halo: Combat Evolved, a game shown in figure 1 developed by Bungie and published by Microsoft Studios in vehicle control in Halo: Combat Evolved. The evaluation of the agent will be strictly concerned with driving an in-game car through a timed obstacle course. 2.2 Scope This agent must be able to (1) create a representation of it s environment from high dimensional pixel inputs and (2) generalize these inputs to past experiences to take action and make decisions with future uncertainty. The specifics for this abstract process are explained in section 4.4. Fig. 2. Reinforcement Learning Agent. Fig. 1. Halo - Combat Evolved. 2 TASK DEFINITION 2.1 Aim The aim of this project is to develop a Reinforcement Learning agent that is able to achieve human level 2.3 Evaluation In order to prove the agents ability to evaluate and understand the 3-dimensional game environment both visually and physically, the autonomous agent will engage in both a solo race and head to head against a human opponent. The race consists of markers to which the autonomous agent must navigate culminating with an endpoint. Only the location of the very next marker will be known to the agent. Game state will include RGB pixels, a depth map, location of the vehicle and more.

2 2 Map layout and game physics are unknown to the autonomous agent. 2.4 Dataset The Halo game was not designed in such a way to allow the integration of a reinforcement learning agent. The dataset we provide is the result of reverse engineering the proprietary Halo compiled assembly code. 3 INFRASTRUCTURE A large amount of infrastructure needed to be created prior to development of the artificial intelligence component. Halo is proprietary, closed source and does not provide an existing API for controlling the game. The following section describes the interface that was created in order to mesh a server side script containing artificial intelligence with the game. 3.1 Halo Plugin The original versions of Halo for Mac released in 2001, including a full version and demo version with limited features, only supported PowerPC architectures. When Apple transitioned the Mac lineup to Intel processors, an updated version of Halo was released with support for the Intel x86 architecture. The demo version of Halo was not updated, so members of the community took the new full version and stripped out any content that was not previously provided by the demo. With this, Halo Mini Demo was born. A plugin system was created during its development that enabled members of the community to inject their own code into the executable. This system was utilised to build the API that the artificial intelligence components could use to control the game. The system hooks into game functions (stored at a certain point in executable memory) with the following procedure: 1) Disable executable memory protection (to make it writable). 2) Move a few of the instructions stored at a particular function offset into a code cave (i.e. move them somewhere else). 3) At the end of the code cave, add an instruction which jumps to the original function. At this point, executing the instructions starting at the top of the code cave should be equivalent to calling the old function. 4) Write a new instruction at the function offset which jumps to a new set of instructions (i.e. the code we want to run instead). 5) Enable executable memory protection (to make it executable). In our new set of instructions, we can then jump to the code cave if we want to execute the old function at any time. The following functions and their corresponding location in the executable were found using a debugger, x86 assembly editors and other reverse engineering tools Python Bindings To speed up rapid prototyping, it was decided that the majority of the artificial intelligence code would be developed in Python rather than directly in the plugin with Objective-C++. All of the logic was then moved to the server script (described in section 3.2) and executed by the plugin when Halo was launched. To facilitate this, a Python interpreter was added to the plugin and then several bridge objects were developed to pass data between the API (C) and Python. This decision also made it easier to connect with other libraries such as tensor flow and keras. Additional features were added to support rapid prototyping. The plugin was designed to listen for any changes to the server script and then automatically reload the module without needing to reset the game. This meant that one could debug very easily. If something wasn t working, it was possible to insert a print statement or modify the debugging function and then immediately see the changes in action without needing to relaunch the game Render Objects (0x2305E0) The render objects function was modified (i.e. replaced and then the existing one was called) to facilitate the extraction of the diffuse and depth map into image files. The diffuse map is what one normally sees when they play Halo. The depth map contains information relating to the distance of the surfaces displayed on screen. A high value (white) means that the rendered surface is far away, whereas a low value (black) means that it is near to the viewpoint. An example of these two maps is shown in figure 3.

3 player (and the vehicle if they were in one) from the render pipeline. 3.1.4 Read Controls (0x13e726) The read controls function was fully replaced.

To facilitate debugging, an additional step was injected into the rendering pipeline that cleared the existing rendering and replaced it with a flat plane.

Imagine taking a snapshot of the currently displayed screen, deleting everything currently on the screen, and then rendering that snapshot to the screen.

3 3 player (and the vehicle if they were in one) from the render pipeline Read Controls (0x13e726) The read controls function was fully replaced. In Halo, the controls are stored in a structure containing the following values: Fig. 3. Diffuse map (left) and Depth map (right). To facilitate debugging, an additional step was injected into the rendering pipeline that cleared the existing rendering and replaced it with a flat plane. This flat plane was then rendered with the diffuse texture. This step, with an unmodified diffuse texture, does not have any impact on the game. Imagine taking a snapshot of the currently displayed screen, deleting everything currently on the screen, and then rendering that snapshot to the screen. To facilitate debugging, the plugin provides a mutable copy of the diffuse texture to the server script. With this, the server script was then able to modify any of the pixels within the game screen, which greatly assisted in debugging. Figure 4 shows an example of the Q-value weight debug screen (i.e. the diffuse texture was modified by the server script to draw horizontal lines depicting the weights for each action at sampled pixels on the screen) that was made possible by the API. Fig. 4. Q-value weight display utilising the debug API Render Object (0x1a7afc) During early stages of the project development cycle, the presence of the car in the depth map was causing problems with user provided features. The render object function was modified to remove the 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) jumping (bool) switch grenade (bool) interacting (bool) - used to enter vehicles etc. switch weapon (bool) meleeing (bool) flashlight (bool) throw grenade (bool) fire weapon (bool) crouching (bool) zooming (bool) scores (bool) - bring up the leader board reload weapon (bool) talk (bool) - bring up chat dialog movex (float) - move left or right movey (float) - move backwards or forwards lookx (float) - look left or right looky (float) - look upwards or downwards Rather than calling the old function, which would populate these values depending on which keys were currently pressed, the plugin instead asks the server script for controls. As the focus was on movement only, the server script was only able to set interacting (so that the actor could get in a vehicle), movex (so that the actor could move) and lookx (so that the actor could turn). An extended version of the API could be created that allowed the actor to set all of these controls and fully mimic a human player, but it was not necessary for the project Run Command (0x11e3de) Halo provides an in-game console where players can type in various debugging commands (such as reset the map, change map, list players etc). To assist with the debugging of learning, the run command function was augmented with the following additional commands. 1) 2) ai (on/off) - Turns the server script on or off. This was useful for starting at a certain test point or getting unstuck etc. debug (on/off) - Turns the debug display on or off. As Python is modifying a huge array during debugging it tends to slow, so this

4 4 is necessary when one is only interested in running the learning at maximum speed. 3) super (on/off) - Turns on supervised learning mode. As described in section 4.4, supervised learning was attempted and this mode was used to create a record of the diffuse map along with the users current controls a few times per second. 4) freedrive (on/off) - Gives control back to the user while still running the server scrip. This is useful if one wants to debug learned weights. 5) learn (on/off) - This passes a flag to the AI to control whether it should learn (i.e. have some exploration factor) or simply exploit as much as possible Other A few other game functions were modified to remove things such as antenna rendering, modify game resolution etc. 3.2 Server Script The server script contains the intelligence component and is written in Python. The plugin exposes a module called halo that the script can import to run the following functions: 1) console(string) - prints a string to the ingame console. 2) restart() - restarts the current match. 3) speed(float) - sets the game speed. 4) map(name, gametype) - sets the current game map. 5) ai(boolean) - sets whether the ai is in control. The script also implements the following class and functions, which are executed by the plugin at the appropriate time. 1) def configure() - called when Halo launches. Used to set the game speed, map file and other configuration settings using the halo module. 2) def setlearn(self, learn) - called when the user enters the learn on/off command in the in-game console. 3) def reload(self, prior) - called when the server script changes. The prior object contains the previous instance of the script (so one could copy over old weights etc) 4) def onframe(self, input) - called when a new frame is rendered by the game engine. 5) def oncontrol(self, state) - called periodically when the game is requesting player controls. 3.3 Client Script The server script is constrained to a 32-bit architecture as it runs within the 32-bit Halo process. This presented a problem when connecting to libraries such as tensorflow, which require a 64-bit instance of Python. Rather than going through the tedious process of compiling each library as a 32-bit version, it was decided that a new client script would be created. It would communicate with the server script using a simple networking library such as Pyro4 to speed up develop time. The client script is then able to run tensorflow and other libraries in 64-bit, possibly on a different machine that is connected to a very powerful GPU to speed up learning. Each state is sent over the network from the server to the client and then the client responds with a corresponding action. The client script is completely optional and requires a server script that supports the networking library. If one were to develop all of the logic within the server script and not require additional 64-bit libraries such as tensorflow, the client script would not be required. 3.4 Architecture Overall, the system we developed connects processes in C++ and Python. The C++ process executes Halo (the game environment) as well as a Halo Plugin. This Halo Plugin exposes control to a Python Server Script (running as a 32-bit process). The Python Server Script communicates to the Client Script through Remote Procedure Calls. The full diagram of this communication along with the data being sent is shown in figure 5. Fig. 5. Architecture Diagram.

5 5 4 APPROACH 4.1 Model In describing the model for our reinforcement learning algorithm, we begin by outlining the inputs and outputs to the algorithm which dictate how the states, policy, and objective are built. This treatment allows for an understanding of the Markov Decision Process that governs the autonomous agent s gameplay Sensor Input At each state of the algorithm, the values shown in table 1 will be received by the algorithm. TABLE 1 Sensor Inputs at Each State Vehicle Range Description (position x, position y, position z) R 3 Vehicle x, y, z coordinates (target x, target y, target z) (heading x, heading y, R 3 Next goal x, y, z coordinates R 3 Vehicle rotation x, y, z unit vector heading z) vehicle {0, 1} Whether the player is in a vehicle tid I Current race index. Starts at 0, then increases to 1 when the player hits the first checkpoint etc Actions Using the sensor inputs and the developed model, the autonomous agent will be able to drive the car through specifying a policy with the variables shown in table States Through the previously defined sensor inputs and actions, an understanding of the states and state space can be examined Unique State: Each state can be uniquely defined by position and heading of the vehicle, the target and whether or not the player is in a vehicle. These variables are shown in the first four rows of table 1. TABLE 2 Action Outputs of Each State Variable Range Description accelerate {-1, 0, 1} Acceleration constant. The values correspond to reverse, stop and accelerate respectively. steer [-2, 2] The direction and amount in which to steer. A negative number indicates steering to the left, 0 indicates continuing on a straight path, and a positive number indicates steering to the right Time Variant Aspects: The Markov Decision Process was trained under the assumption that there are no time variant included in the state space. This means that following a deterministic policy always leads to the same expected reward regardless of the time at which actions are taken Continuous State Space: Since the variables that dictate the possible actions that can be taken are continuous intervals (that are closed and bounded), we have a state space that requires continuous control. How we managed this is detailed in section Action Space Complexity: As the action space is continuous, it is infinitely dimensional and difficult to convert into a discrete set of actions. With the given interval from 0 to 1, if one was to create a 100 discrete choices from.00 to 1.00 and 400 discrete choices for our steering direction, they would have choices at each stage or 40, 000 choices to choose between Time between next State: Any value iterations need to be very fast due to the frame rate of the Halo game. Currently, the frame rate is about 30 FPS and the autonomous agent is expected to act on a policy and make decisions within 20 milliseconds. 4.2 Policy Stochastic Policy: The policy network implemented was used with an input of a state, as described previously, and output an action from the action space. A stochastic policy was also used to encourage exploration of the game surroundings. In addition to game state, the policy relied on a specific feature set θ. These features were learned through the sensory input and are described in detail in the Deep Q Learning portion of this paper.

6 6 π θ (a s) = P [a s, θ] Learning Policy: This policy is approximated through the notion of discounted reward. The following is a representation of the reward, whose details will be further mentioned in the next section. L(θ) = E[r 1 + γr 2 + γ 2 r π θ (s, a)] 4.3 Objective Reward: In implementing the learning policy, rewards have been defined as progression towards the next marker in our race. A discount factor γ as.9 was used and reward proportional to distance progressed towards the next state, where m is the next marker s position and l 1 is the location of the vehicle at state Algorithms r 1 = (m l 1 ) 2 (m l 0 ) 2 The proposed algorithm for our task is to use recent results from Google Deepmind including Deep Q Learning, Continuous Policy Gradient, and using the Actor-Critic algorithm in conjunction with Deep Q Learning Q-Learning with Function Approximation and Weight Regularization The first algorithm utilised was Q-Learning with Function Approximation. As the state space of this problem is enormous, function approximation is necessary to estimate the Q-value of states using a reduced number of parameters. If one was to try and apply normal Q-Learning, the algorithm would need training examples from every single possible location, target, heading etc, which is an inefficient approach. Instead, the following two features were used to approximate the value of a state. The intuition behind choosing these features was to maintain a course to minimise race time whilst being able to avoid any immediate obstacles. To simply the problem, the acceleration component was removed from the action (it was simply set to 1 at all times) and the steering direction was binned into 5 discrete choices , 0.05, 0.00, 0.05, Best Angle: - the best angle feature used the position, target and heading state variables to compute the steering direction that would optimise reaching the target. Using this feature as an action would be equivalent to running the baseline where the actor drives directly towards the next goal point. This is prone to crashing into obstacles and getting stuck, so the additional depth feature was also added Depth: - the depth at a subset of sampled pixels was added as features. It was hoped that the algorithm would assign reasonable weights to the depth features so that it could avoid obstacles which may cause it to crash or get stuck Exploration Policy An epsilon greedy approach was used to balance exploration with exploitation. The epsilon term was slowly decreased as more training samples were gathered to ensure that the algorithm would converge over time to full exploitation and have a reasonable chance of reaching race markers past the initial one. The process used to anneal the exploration rate is a stationary Gauss-Markov Process, also known as the Ornstein-Uhlenbeck process. The formulation for this process follows. dɛ t = θ(µ ɛ t )dt + σdw t Weight Regularization A novel approach was used to normalize the weights after each iteration relative to each action. This prevented one action from becoming too extreme (and having a very high weight) that would lead to that action always being chosen. Instead, the weights relative to each action summed to 1 so that the difference between the overall Q-values for each action was solely determined by the state rather than relatively extreme weights. After closer literature review, we found that this idea has been recently presented for faster convergence in deep neural networks by Tim Salimans and Diederik P. Kingma [7]. The proposal is to reparametrize weights in the following manner, where v is a vector, g is a scalar, and v denotes the Euclidean norm of v. w = g v v w = g

As proposed in this recent work, we also v find that decoupling the weight vector v from the norm of the weights w that the stochastic gradient descent we employ converges a lot faster.

7 As proposed in this recent work, we also v find that decoupling the weight vector v from the norm of the weights w that the stochastic gradient descent we employ converges a lot faster. Furthermore, the gradient of our loss function that we optimize changes due to this weight reparametrization: v L = g L = wl v v g v wl g gl v 2 v Utilizing this modified weight vector and as a consequence, modified gradient, we find that our model converges much faster than otherwise, which we show in the Section 6 Error Analysis Supervised Learning Although not reinforcement learning, supervised learning was attempted as an experiment to see how its performance might compare with reinforcement learning. To collect the necessary data, a human driver drove around the race numerous times whilst the plugin recorded a frame and associated steering action a few times per second. The actions were simplified as in the previous section to convert the problem into a simple 5 class classification problem. The model used is a Convolutional Neural Network trained to minimize categorical crossentropy. Unfortunately due to limited computer performance (no GPU enhancements), we found the process to yield poor results. Our classification error cross-validated with 20 percent of data as a holdout was about 36 percent. L(w) = 1 N L(w) = 1 N N H(p n, q n ) n=1 N [y n log(ŷ n + (1 y n )log(1 ŷ n )] n= Deep Q-Learning In addition to using function approximation with linear regression, we implemented the Deep Q Learning algorithm as detailed by Google Deepmind. This combines the expressiveness of the Neural Networks we used in the Supervised Learning with the Q Learning model, which is more fit to the MDP style structure of gameplay. At the core of the Deep Q Learning algorithm that we use is the Q Learning with function approximation algorithm. At each step we perform the following update to our learned policy: ˆQ opt (s, a; w) = w φ(s, a) w w η[ ˆQ opt (s, a; w) (r + γ ˆV opt (s ))]φ(s, a) We tested both nonlinear function approximation, through Deep Neural Networks, and linear function approximation. In both implementations, we derive weights from the pixel (r,g,b) and depth (α) maps from gameplay. We compute hidden layer activations and then use these activations to find a score of the current state. After experimentation, we use solely the depth (α) map as it allows for better encoding of the environment as the agent needs to interact with it. The depth of objects was found to be far easier to interpret than the RGB values. However, with GPU enhancements we would have used both the pixel and depth maps. 7 The learning function with deep neural networks is shown below. Each neuron learns an activation (depicted by h j ). An overall score is computed for the input. σ(z) = (1 + exp z ) 1 Fig. 6. Convolutional Neural Network Model. The objective we train on and the predictor and categorical cross entropy loss is shown more explicitly, where g is the logistic function. ŷ = g(w x n ) h j = σ(v j φ(x)) score = w h This allows us a model free estimation of the game state. Q π (s, a) Q(s, a, w)

8 Policy Gradients In order to use continuous control (i.e. continuous action space) with this formulation, we will set up a new loss function derived on the policy gradient: dl(θ) dθ dl(θ) dθ = E x p(x θ) [ dq dθ ] = E x p(x θ) [ dq(s, a, w) da da dθ ] Our new loss function now looks like the following: Loss = [r + γq(s, a ) Q(s, a)] 2 We implement our Deep-Q network with and without Policy Gradients. 5 LITERATURE REVIEW In implementing and testing various models, literature regarding Deep Q Learning, Convolutional Neural Networks and Stochastic Optimization was reviewed. 5.1 Deep Q Learning Most of our insights on Deep Q Learning are derived through in class presentations on Deep Q Learning and Google Deepmind s recent papers. Specifically the following works were very useful towards building our Reinforcement Learning agent Human Level Control through Deep Reinforcement Learning This paper outlines the formulation for the Deep Q Network that was used in this project Continuous Control with Deep Reinforcement Learning In using policy gradients for our work, we use this work to dictate how the Loss Functions for the Deep Q Network change when the action space is continuous. 5.2 Convolutional Neural Networks In testing a Convolutional Neural Network for Supervised learning we use literature to dictate how to structure the task and learning of image classification ImageNet Classification with Deep Convolutional Neural Networks ImageNet has evolved as a canonical benchmark and this paper outlines some of the core architectural components of modern Convolutional Networks. The importance of convolutional, activation and fully connected layers are discussed in detail. 5.3 Stochastic Optimization The crux of most modern machine learning algorithms are performant algorithms for stochastic optimization. We reviewed literature in this field to better understand how to speed up convergence on learned policies Adam: A Method for Stochastic Optimization In this work, the authors present a Moment based approach towards updating weights through stochastic optimization Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks This work became part of how we trained function approximation in the Q Network for our Reinforcement Learning agent. We found this addition to speed up convergence of the network substantially and allowed our model to be run on a single quad core CPU. 6 ERROR ANALYSIS After implementing various models, we ran the algorithms on our defined task and noted performance. A summary of performance across models along with a detailed analysis of the best model is included next. We found the best model given our data and machine constraints to be Q Learning with linear function approximation Model Benchmarks We prepared two different solutions to the task of learning to drive a vehicle in Halo. The first solution used Convolutional Neural Networks and the second used reinforcement learning. Our results using Convolutional Neural Networks for classification was poor because we could only capture a few episodes worth of data and could only use one layer of the RGB pixel tensor due to computational limits.

9 Reinforcement Learning: When using Q Learning with function approximation and weight normalization, we found that we are able to converge to non-zero reward quite quickly. There is observed to be a lot of variance across episodes. To combat this further work could be done by using multiple models with voting such as the Asynchronous Advantage Actor Critic model. The agents goal is to maximize its score and will thus aim to finish as many checkpoints as possible with the lowest times achievable. The performance on a short course with moderate terrain and a longer course with heavier terrain is shown below. Fig. 9. Short Course Scores (3 Markers). Fig. 7. Q Learning Reward Over Time. We find that the agent is able to complete the course about 15% of the episodes Best Model Results When competing in the race format we found that our agent performs better than the baseline and is comparable to human level performance. The reinforcement learning agent crashes less, and more importantly, does not get stuck on obstacles like the baseline. To evaluate the ability of the agent to autonomously drive a vehicle, we simulate a race in which it must drive through areas of varying terrain with obstacles. To quantify its performance, we had it race through such a course and we timed the completion of the algorithms at multiple checkpoints throughout the race. We scored the contestants using the following rubric: Fig. 8. Score Per Checkpoint. Fig. 10. Long Course Scores (6 Markers). 7 CONCLUSION We develop and detail an environment and agent for human level control in Halo. We achieve near human level results in moderately challenging control conditions. Further work will be focused on augmenting the learning algorithms with more expressive function approximation, though this also requires better hardware. Another avenue for future work is training the agent on different tasks and investigating ways to transfer learning across these tasks. Doing so would be working towards more generalized artificial intelligence. Lastly we would like to thank the CS221 team for the class and Bryan Anenberg, our TA, for providing us with feedback throughout the course of the project.

10 10 REFERENCES [1] Human Level Control through Reinforcement Learning, /v518/n7540/full/nature14236.html [2] Continuous Control with Deep Reinforcement Learning, [3] Recurrent Policy Gradients, [4] DDPG Keras TORCS, [5] ImageNet Classification with Deep Convolutional Neural Networks, [6] Adam: A Method for Stochastic Optimization, [7] Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks,

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault Fabian Chan (fabianc), Xueyuan Mei (xmei9), You Guan (you17) Joint-project with CS229 1 Introduction Atari 2600 Assault is a game environment