ROBOCODE PROJECT AIBOT - MARKOV MODEL DRIVEN AIMING COMBINED WITH Q LEARNING FOR MOVEMENT

Size: px

Start display at page:

Download "ROBOCODE PROJECT AIBOT - MARKOV MODEL DRIVEN AIMING COMBINED WITH Q LEARNING FOR MOVEMENT"

Steven Wilkinson
5 years ago
Views:

1 ROBOCODE PROJECT AIBOT - MARKOV MODEL DRIVEN AIMING COMBINED WITH Q LEARNING FOR MOVEMENT PATRICK HALUPTZOK, XU MIAO Abstract. In this paper the development of a robot controller for Robocode is discussed. The Java robot consisted of an aiming strategy based on modeling the enemy robot s movements with a Markov model. The gun was aimed to the location the enemy robot was expected to be when the bullet had traveled far enough to hit the enemy. The robot s movements were based on maintaining an optimal distance from the enemy to enable dodging the enemy s firing and still being close enough to hit the enemy reliably. The Q learning method was used to learn the optimal movement strategy. The robot won the in class robot war competition. 1. AIBot movement strategy A general introduction to Robocode is presented in Appendix 1, the following description assumes a good understanding of general Robocode functionality and strategy. Often in a 1v1 contest both robots implement great aiming and tracking systems, so the key to victory is often in movement strategy. By missing bullets fired by the enemy and avoiding hitting the wall energy is preserved so you can outlast the enemy. Additionally maintaining good position where you aren t cornered so you can dodge bullets and don t bump the wall is important. We started by studying the tactics used by many of the most successful robots in the 1v1 tournaments. To avoid being able to be targeted properly by the enemy many tanks change their heading and velocity randomly continuously. Another strategy is monitoring the enemy s energy level. When the enemy fires a bullet its energy level will drop between.1 and 3.0 depending on the amount of energy given to the bullet. If my tank only fires bullets with energy of.8 or higher then when the enemy is hit by my bullet it s energy drops by 3.2 or more - so it is easy to tell if they enemy s energy level drop was because it fired a bullet or because it was hit by a bullet. Some tanks move fairly predictably and then when they detect the firing of a bullet they modify their movement exactly at that point so that if their movement has been modeled by the enemy it will likely miss. Q Learning was used to control the robot with the goal that it would learn to move the robot to the most advantageous field position that minimized the likelihood of being hit by the enemy or bumping into the wall. One feature to aid in Key words and phrases. Robocode, Reinforcement Learning. 1

2 2 PATRICK HALUPTZOK, XU MIAO Figure 1. Showing the evasion technique used to reduce the probability of being hit. The optimal strategy is to be randomly positioned in a uniform distribution of the range of, where is the range your robot can move by the time bullet arrives evasion was generated by monitoring the enemy s energy level and providing as a feature of the MDP space whether the enemy had fired in the previous time step. When the enemy fires a bullet it takes time to reach our robot, and the hope was the Q Learning would learn to take evasive action. There is a range of area we can reach by the time the bullet travels far enough to hit us. This range of area can be delimited by the minimum and maximum angle of the enemy s gun at the moment he fires. It was hoped the Q Learning evasion strategy would make our tank s position be uniformly distributed between the minimum and maximum gun angle settings of the enemy by the time the bullet arrives - so the best the enemy can do is randomly aim it s gun in that possible angle range and fire. It is the strategy that gives the minimal probability of our tank being hit by enemy fire. In the diagram below we illustrate this ideal evasion strategy, showing the angle that we want to uniformly distribute our position across Q-Learning. Q Learning was the approach implemented for controlling movement. The approach was to use a number of features to describe the current state of the world and from that state learn what the best movement to make would be. The utility estimate for each state action pair was stored in a table. The reward feedback from the environment was -1 for hitting a wall and -10 for being hit by an enemy bullet - these events caused events to fire in Robocode so the reward feedback was easy to track. The Q Learning feedback was independent of the accuracy of firing to simplify the approach. This has the obvious problem that as my robot stays further from the enemy it will likely get hit less often, but also miss hitting the enemy more often. To start training Q(s,a) was initialized with very optimistic utility values to start - to encourage it to try all the state action pairs. In Q Learning my MDP state space was defined by the horizontal and vertical location of the enemy tank, the angle of my tank from the enemy tank, the distance from the enemy tank, and whether the enemy tank had just fired. The state space would be too large using raw measurements so I bucketed the measurements. There was 7 horizontal, 7 vertical, 8 angle, 6 distance and 2 enemy fired buckets giving 7x7x8x6x2=4704 unique states.

3 3 3 QLStates[HORZ B UCKET S][V ERT B UCKET S][DIST B UCKET S][ANGLE B UCKET S][F IRED B U For each state I allowed 7x7=49 total actions - 7 of the actions were moving closer, further, or staying the same relative to the enemy. 7 were related to moving clockwise, counter-clockwise or staying the same angle relative to the enemy. Each movement in distance or angle was for each direction a small amount (1), a large amount (15), or a random amount (15 * random). The update formula was: (1.1) aestatecosts[ihorzp rev][iv ertp rev][idistp rev][ianglep rev][if iredp rev][iangleactp rev][idistactp rev Where alpha was 0.01 and gamma was and ebestaction was the best aestatecosts for the current state across all the current possible actions, which are the last 2 array indexes. Q Learning was fun to watch converge. I saved the updated weights out from each battle that it fought in to load in the next fight. So initially it would just sit or do very random movements. After a number of battles it would converge to somewhat reasonable behavior, circling the opponent a weaving back and forth. Eventually it converged to a response - but there was enough noise and variation in the Q(s,a) values over time from the just random luck of how well the opponent was doing in targeting the robot to change the favored action for each state from time to time. In implementing the Q-Learning approach to control movement, multiple potential actions combined with different combinations of features to define the input space were tried. One potential problem was the coarseness of the state space. In it taking an action in a state often resulted in still being in the same state. For example being really close to the enemy will result in getting hit by an enemy s bullet with a much higher probability. So moving away should give a higher utility than moving closer or staying in the same position - but before you move far enough away to matter statistically and get into the state for the further away bucket you may get hit multiple times, changing the Q(s,a) for moving away to not look so great. Possibly I could do better here if I updated the Q(s,a) learning to use a model - it knows what state its heading towards and could use that information to take a combined weight of the current state and the next state its heading towards. Also a finer grained state space or using function approximation for the utility function with a decaying alpha may result in better stabilization of the optimal movement strategy. In addition to our learning strategies for moving and shooting Patrick built another hard-coded tank for CSE573 assignment 1. This tank s hard-coded movement and aiming strategies could be replaced independently with the learning based strategies. A table below shows how adding combining the AI and non-ai implementations compared to other tanks. The important point illustrated in Table 1 is the serious degradation in performance the AI approaches have when combined due to the numerous missed time slices Lessons Learned. First the most effective tank needs to be able to react quickly at run time. Many of our learning techniques used a great deal of computing time, causing missed time slices which resulted in sluggish and less optimal

4 4 PATRICK HALUPTZOK, XU MIAO Figure 2. Motion prediction performance. If optimizing tank effectiveness was the only priority we would simplify our learning techniques to be less ambitious but be guaranteed to run in the time slices allotted - and hard-code more of the strategy that we were learning directly into the tank. Another choice would be to modify Robocode to allow more time per time step. 2. AIBot aiming strategy The aiming itself actually is an instance of motion tracking problem, in our agent, which can be summarized as: Observe opponent s movement Update inner states Predict opponent s next movement Fire According to this scheme, we applied several different models and algorithms. And all of them are based on the basic physics of the motion Physics of motion. In every tick, the tank has a specific speed (v) and a heading. And the tank will go along the direction of the heading with distance of the magnitude of the speed. After that, the tank will turn a specific turning angle (θ). Also, 8 v 8 ( v) θ ( v) So the prediction scheme is to sum up all the predicted vectors as shown in Fig1. This technique is also called virtual bullets. One important thing is in every tick the possible acceleration rate is fixed to be 1, 0 or -2. So if at time slice t, the opponent is at speed v, the next time slice there are only 3 possible value of the speed.

5 Simple Markov Model. It is natural to think of this problem as a markov process. Here we describe a simple markov model which is the first model we implements. States: < v, θ > We update the transition probability according to the observation. These two variables are fully observable, so the updating is only a task of statistics. The prediction is to find the next state with the maximum likelyhood. S t+1 = argmax St+1 T (S t, S t+1 ) This stationary Markov Model provides an average motion approximation. It works well onto sample robots, because they are all single strategy of movement. However mostly smarter robot can switch among strategies which can not be described by a simple stationary model, even if there are only two strategies. So we need some more sophisticated approaches. A second thought comes to us is a n-order Markov process. We implements a 2-order Markov process. It turns out that the accuracy is improved a little, but it slows the robot a lot as the time goes on(a huge transition matrix). We tried to prune some states by examining the probability according to the conditional probability distribution and independent conditions. But that makes it even worse Reinforcement Learning. In robocode project, we always have no prior knowledge about the transition model about the motion of the enemy. So we turn on to reinforcement learning, since this technique can be model-free. We reduce the tracking problem as: States: < v, θ > Action: predict the next state the opponent will be in Reward: The next observed state gets immediate reward 1 If fired bullet hits the opponent, all the predicted states, generating that bullet, get delayed reward 4 If fired bullet missed the opponent, all the predicted states, generating that bullet, get delayed reward -2 If fired bullet hits bullet, all the predicted states, generating that bullet, get delayed reward 1 There is only one action that is to find the next state with maximal possible utility. S t+1 = argmax St+1 T (S t, a, S t+1 ) U(S t+1 ) The action is fixed, so the policy is also fixed too. We implement both TDlearning and ADP-learning algorithm. ADP-learning has better accuracy, but time consuming so that it skips a lot of turns and does nothing. So the overall performance is worse Modified Reinforcement Learning. Using the immediate reward corresponding to single state seems not very correct, because it is the transition from one state to another state causes the reward. So after the first RL model, we did a slight change on that: Make the immediate reward associating the path instead of state. The modified model is:

6 6 PATRICK HALUPTZOK, XU MIAO Table 1. a. Against SpinBot PA HA MR WIN/LOSE SMM 86% % 20/0 TD-RL 2% % 18/2 ADL-RL 23% % 18/2 TD-MRL 89% % 20/0 Table 2. b. Against Marvin I PA HA MR WIN/LOSE SMM 35% TD-RL 18% % 16/4 ADL-RL 23% % 10/10 TD-MRL 32% 1.4 0% 19/1 States: < S t 1, S t > Action: predict the next path the opponent will be taking Reward: The next observed path gets immediate reward 1 If fired bullet hits the opponent, all the predicted pathes, generating that bullet, get delayed reward 4 If fired bullet missed the opponent, all the predicted pathes, generating that bullet, get delayed reward -2 If fired bullet hits bullet, all the predicted pathes, generating that bullet, get delayed reward 1 The action prediction is: P t+1 = argmax Pt+1 T (P t, a, P t+1 ) U(P t+1 ) = argmax Pt+1 T (S t, a, S t+1 ) U(P t+1 ) Then we implemented a TD-learning algorithm. And also this is the model we used in the final tournament Experiment And Analysis Performance measure of SMM, TD-RL, ADP-RL and TD-MRL. We measures three values: prediction accuracy(pa): The accuracy of the next state predicted based on current state. hitenemy accuracy(ha): The accuracy of our robot hitting enemy. HA = ihit/im issed missedturn rate(mr): The rate of missed turn, which represents the computational intensiveness. M R = im issedt urn/(ihit+im issed+ihitbullet+ im issedt urn) We made games against a spinbot, a Marvin I and a Fractal MC respectively. Marvin I is the first version of Marvin robot from our colleague, Fractal MC is a pure dodging robot with excellent movement behavior. The result is shown in Table 1. From Table 1.b we can see TD-MRL did better than others although its PA is not as good as SMM.

7 3 7 Table 3. c. Against Fractal MC PA HA MR WIN/LOSE SMM 8% % 3/17 TD-RL 12% % 15/5 ADL-RL TD-MRL 23% % 5/15 Figure 3. Motion prediction For this phenomenon, we thought the reinforcement learning is helping to achieving the goal of making it more accurate that the summed predicting vector equals the actual moving vector (shown in Fig2), instead of achieving the goal of making it more accurate to predict the next time slice state. Another interesting thing is TD-RL doing much better to Fractal MC, on the other hand, TD-MRL didn t do that good. At the same time, we found actually SMM did better in simple pattern robot. It seems SMM can predict more accurately on simple pattern, meanwhile, RL bot can predict more complicated patterns Adaptation measure. Not like a general reinforcement learning problem, robocode project gives the reward dynamically. In other words, the reward is not directly depended on the state we formulated. The reward function is changing along the time. Therefore, the utility of a state normally not converging at all. Instead, it is oscillating (as shown in Fig4). Intuitively we think this oscillation is the response of the opponent s switching strategies Opened problems. In our RL models, we only think of the states with velocity and turning angle. However there are a lot of events affecting the opponent s movement we ignored, for example, if the opponent approaching to the wall, if the

8 8 PATRICK HALUPTZOK, XU MIAO Utility(<0,0>) 1 u time Figure 4. Motion prediction opponent is approaching to our robot, if the opponent having sensed a bullet approaching, if it is long enough to switch the strategy of the opponent and etc. All these could be the inner states too. However to characterize all these states and computation with full state space will cause the robot slowed down(higher MR) and finally disabled due to too many skipped turns. Developing a fast algorithm dealing with big states space could be a open problem. Another open problem is to learn the firing power. We could consider our robot firing with different firing power as different actions, then use the full reinforcement learning method to generate the optimal policy. Again, it will also make the computation heavy BN with latent variables model. The normal model for modelling motion is using Bayesian Nets or Hidden Markov Model. So after exploring the application of reinforcement learning, we explored the possibility of application of DBN approaches. Finally we come out with a BN with latent variables model. It is similar to HMM. HMM uses only one factored variable describing the state. We could vectorize the state space and serialized into one variable, but there are some features could be directly updated by the evidences/observations instead of using the whole transition matrix, for example, the velocity and turning angle. So we choose BN with latent variable to represent the model.

9 3 9 To make it simple, we start with thinking one latent variable HittingWall?, so the problem is formulated as: States: < v, θ, HittingW all? > Evidences: < v, θ, d wall, d center >, here d wall is the minimal distance of the opponent to one of the walls, and d center is the distance of the opponent to the center of the game field Transition model: online updating Sensor model: the linear Guassian model P (< d wall, d center > HittingW all?) = 1 e (HittingW all? d wall +(1 HittingW all?) d center ) 2 λ 2 2πλ P (< v, θ > < v, θ >) = 1; P (E t X t ) = P (< d wall, d center > HittingW all?)p (< v, θ > < v, θ >) = P (< d wall, d center > HittingW all?) So the Forward-Backward algorithm is now as: Forward(); Normalize(); Backward(); UpdateTransition(); this is to update transition matrix by the estimated states EM(λ); this is to update the sensor model to get maximal likelihood of data classification(according to HittingWall?) In addition, we assume: (w λ)(h λ) (2.1) P (HittingW all?) = 1 w h Here w is the width of the battle field, h is the height of the battle field. In the prediction part, we uses Viterbi algorithm. All of these algorithms require the history from the start time point, but that will make the computation really slow. So we keep a piece of memory with a certain length. So that every time learning, we only learn a certain time after situation, and every time predicting, we only predict within a certain time. We started this approach after the final tournament, but we could t finish it due to the tight time. We think this one will working better but might be slower. So the total performance might not be better. 3. Conclusions AI techniques can be used to control the movement of a robot or predict the movement of an enemy robot for aiming. However great robot performance requires structuring the robot with lots of domain knowledge. Providing the learning algorithm with the important relevant features to model the enemy s movement, and designing effective performance measures for optimizing the movement and distance from the enemy are just as important as the specific learning algorithms used. Robocode robots are a great example of how important domain knowledge is

10 10 PATRICK HALUPTZOK, XU MIAO combined with AI to get great results. Approaches in class that didn t incorporate much domain knowledge - such as the GA approach that just tried to converge from basic optimizing rough performance measures didn t converge to optimal solutions - the space of solutions was too large. Constraining the solution space by hardcoding in known important features and basic frame work functionality allowed much more effective solutions to be found. References [1] [2] Russel, S. J. and Norvig, P. (2003) Artificial Intelligence A Modern Approach Second Edition: [3] Sutton, R.S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA 4. Division of Labor Xu Miao wrote the aiming code and that section of the report and Patrick Haluptzok wrote the movement code and the rest of the report. Microsoft Department of Computer Science and Engineering, University of Washington address: xm@u.washington.edu, patrickh@windows.microsoft.com

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Reinforcement Learning in Games Autonomous Learning Systems Seminar Matthias Zöllner Intelligent Autonomous Systems TU-Darmstadt zoellner@rbg.informatik.tu-darmstadt.de Betreuer: Gerhard Neumann Abstract