This study concerns the use of machine learning based

Modern AI for games: RoboCode Jon Lau Nielsen (jlni@itu.dk), Benjamin Fedder Jensen (bfje@itu.dk) Abstract Te study concerns te use of neuroevolution, neural networks and reinforcement learning in te creation of a virtual tank in te AI programming game Robocode. Te control of a turret of suc a robot was obtained troug te use of istorical information given to a feedforward neural network for aiming, and best projectile power by Q-Learning. Te control of movement was acieved troug use of neuroevolution troug augmenting topologies. Results sow tat suc macine learning metods can create beaviour tat is able to satisfactorily compete against oter Robocode tanks. I. INTRODUCTION Tis study concerns te use of macine learning based agents in te guise of virtual robot tanks. Robocode [1] is an open source programming game, in wic tese robot tanks compete against eac oter in a simple two dimensional environment. Te game is roborealistic in te sense tat limited information about te environment are given to robots, and requires robots to actively scan te environment to gain new information. In te study suc agents as been made wit macine learning tecniques, wic are composed of two controllers; a turret controller and a movement controller. Te movement controller use NeuroEvolution troug Augmenting Topologies [4] (NEAT), and te turret controller uses artificial neural network [7] (ANN) and reinforcement learning (RL). Te reason for two tecniques in te turret controller, is tat aiming wit te turret gun itself and loading a projectile of a certain power are divided into two subcontrollers. Te neural network is a fully connected feedforward network using backpropagation for error gradient correction for controling te aim of te turret. Te reinforcement learning is modified Q-Learning utilizing a Q- table. It controls te power of te projectiles to be fired. A. Background Evolutionary algoritms ave been widely used to evolve beavior in Robocode agents. In a study by J. Hong and S. Co te use of genetic algoritms (GA) was investigated [2]. Tey separated te beavior of teir agents into five subbeaviors, namely movement, aiming, bullet power selection, radar searc and target selection. Teir encoding of a genome consists of a gene tat represents eac of tese subbeaviors. A finite set of actions to represent genes was made, and evolution was done troug natural selection, mutation and crossover breeding. Training was performed against single opponents, and evolved specialized beavior. Tey were able to produce numerous genomes of distinct beavior capable of defeating simple opponents. Te emergence of beavior was owever restricted by te finite set of actions available. Anoter study by Y. Sicel, E. Ziserman and M. Sipper investigated te use of tree-based genetic programming (GP) in Robocode agents [3]. Here te genomes are of varying size and consists of genes tat refers to functions and terminals organized in a tree structure tat relay information from te robocode environment and allows computation. Tey divided teir agents beavior into tree subaviours; forward/backward velocity, left/rigt velocity and aiming. Evolution was done troug tournament selection, subtree mutation and subtree crossover breeding, as well as putting upper bounds on te dept of te genetic trees. Training was done against several opponents per fitness evaluation, and was able to evolve generalized beavior. Tey used te scoring function tat robocode uses as teir fitness function, and was able to claim a tird place in a Robocode tournament named te Haikubot league. Agents tat participate in tis tournament are only allowed four lines of code, wic was suitable to te long lines of nested functions and terminals generated by genetic trees. Neural networks as been used in RoboCode agents since May 2002, first utilized by Qonil s (alias) agent. It was only a year later te Robocode community started to take interest in neural networks, wen Albert (alias) experimented wit different inputs in is tank ScruciPu - but witout approacing te accuracy of oter top targeting metods. Finally in May, 2006, Wcsv s (alias) neural targeting agent named Engineer reaced a RoboCode rating of 2000+ [14]. Te most modern neural network for RoboCode is currently Gaff s (alias) targeting wic is currently one of te best guns to predict te future positions of wave surfers [15]. Pattern matcing and a tecnique termed wave surfing can be used for agents to model incoming projectiles as waves wic tey ten atttempt to surf [16][17]. B. Platform RoboCode is a real-time tank simulation game, were two or more agents compete for victory in points. Agents can participate in an online tournament called Roborumble [9] and climb te ranks. RoboCode was originally started by Mattew A. Nelson in late 2000, and was in 2005 brougt to Sourceforge as an open source project [1]. Robocode are built on roborealism principles, and tus ides information about opponent positions, projectile positions et cetera. In order for an agent to gater te needed information to make informed decisions, it must use a radar controller. Te radar gater information about opponents witin its field of view, but not teir projectiles. Tis information consists of opponents directional eading, current position and energy. Temporal information suc as opponent velocity can be calculated from tis information wit respect to time.

Te game is used as a teacing aid for aspiring JAVA,.NET and AI programmers given its simple interface and simple ingame JAVA compiler as well as a community in wic ideas, code, algoritms and strategies are sared. Te mecanics of te game are suc tat agents receive sensory input at te beginning of eac turn and is able to activate various actuators suc as acceleration and projectile firing. Tese actions are eiter on a turn basis, or temporal. An action suc as MOVEAHEAD 1000 PIXELS is temporal and will alt te program until it as finised, wereas a nontemporal action suc as ACCELERATE 20 PIXELS causes no alting. Temporal actions are incapable of adapting to new situations, and are terefore used only for very primitive agents. At every round te robot tanks spawn at random positions wit 100 energy. Energy is used for firing projectiles, wic can be fired wit a power ranging from 0.1 to 3.0. Power does not determine only damage on impact owever, but also te velocity of te projectile; ig powered projectiles move slower tan low powered projectiles. Damage is calculated from te power of te projectile and gives a bonus if ig powered, tus making te ig powered projectiles te most letal in average. Additionally, wen damaging an opponent, te agent recieves triple te amount of damage done as energy. Energy is also lost wen te agent drives into walls, and wen opponents collide wit tem, wic is termed ramming. Ramming causes no damage to te agent wo apply forward velocity against te obstacle opponent, and is tus also useful for offense. Sould an agent reac 0 energy, by getting it, driving into walls or drained by sooting, te agent will loose. Te radar controller as already been perfected troug a tecnique termed Narrow Lock [12] from te Robocode community. It keeps an opponent witin te field of te view constantly. Tus tis tecnique is used for te radar trougout te study. Te relative coordinates are used suc tat placement on te battlefield wont matter for te neural network. Wen firing a projectile, te input data wic finds te estimated future position of te enemy is saved. For eac turn a metod cecks if any projectile as flown furter tan expected, if so te projectile missed te target and te current relative position of te opponent is marked. Backpropagation appens te turn wen te projectile its a wall, using te stored input and exact location of te opponent for wen te projectile sould ave it te opponent, relative to te position te projectile was fired from. Before being able to estimate te future position of an opponent, te most suitable degree of power for te projectile as to be determined. Te more power put into a projectile te more damage it deals, but te slower it moves and terefore (all tings equal) it is arder to it an opponent wit ig powered projectiles, given it as more time to alter its moving pattern to avoid it. To pick te best degree of power, a Q-learning module is made to divide te battlefield into states surrounding te agent, as seen in Figure 1. A state consist of only a start range and end range. II. TURRET CONTROLLER Te turret controller is split into two parts; aiming and projectile power. For aiming an artificial neural network (ANN) [7] wit one idden layer estimates te future position of te opponent. For implementation details, te ANN uses 0.9 momentum rate and 0.1 learning rate. Te input of te ANN for aiming consists of a value describing ow muc power is being put into te projectile, and a sequence of last known positions of te opponent spread over a range of radar scans. All inputs are normalized to be in range [0; 1]. Te last known positions are drawn as dots following te opponent in te presentation video, and we will terefore refer to tem as green dots. Wile exploring different possible inputs, we found it unneeded to include velocity and eading of te enemy, as it is indirectly included troug te previous positions. As for te output, one relative coordinate (two XY-outputs) is used to estimate were te enemy will be in te future. Fig. 1. RL. Te agent wit surrounding rings, eac representing a state in te In te Q-table eac state is represented by itself and its actions Q(s t, a t ). Here te actions are te different degrees of power tat a projectile can be fired wit. Since te best sot is determined only by te maximum Q-value of te possible actions (sots) in a given state, and not by a discounted maximum future value, te equation for updating Q is rewritten as can be seen in Equation (1). Q(s t, a t ) = Q(s t, a t )(1 α t (s t, a t )) + α t (s t, a t )R(s t ) (1) Were te Q value of state s t applied action a t is updated wit part of te old Q value and presented wit a part of te reward, R(s t ), given by te newly fired projectile. To allow for small canges te reward and old Q value is affected by

te learning rate α t (s t, a t ) wic in our application as been set to 0.1 by default. For R(s t ) te energy-penalty and rewards given by Robocode is used, in form of damage dealt (positive), gun eat generated (negative), energy drained (negative) and energy returned (positive) [1]. Tese are determined by te power of te projectile fired - a sot powered by 3 will drain te agent of 3 energy, generate 1.6 turret eat, but cause 12 damage to te opponent it its, wit additional 4 bonus damage and 9 energy returned to te agent. Tis will cause a reward of 3 0.8 + 12 + 4 + 9 = 21.2 but missing a ig powered sot will create a bigger penalty tan a smaller sot. Te reward is given upon te event of itting an opponent, itting a wall (missing) or itting anoter projectile. Te full reward function can be seen in Equation (2), were te event PHP means a projectile its te enemy projectile. Drain Heat + HitDamage +EnergyBack + Bonus, R(s t ) = Drain Heat, Drain Heat, A. ANN: Input experiments if it; if miss; if PHP. (2) Te purpose of tis ANN experiment is to see te effect of using te past positions of te opponent. How muc information is needed to provide te best estimation of te future position? Te best input will be used from tis test onward. To train te ANN, only te turret is active and only te opponent is allowed to move trougout te training. Te opponent sample.myfirstjuniorrobot is used, wic uses simple movement patterns and do not alter te pattern based on sots fired towards it. For eac turn of te rounds te ANN is provided wit te necessary inputs of te battlefield and asked for an output. As mentioned in before te inputs used is te sot and an amount of istory from were te opponent as been prior to tis point in time (dots). To ave a useful ANN wic can be used afterwards, te power-input is randomized for eac sot. Wile decreasing te amounts of dots in size, te amount of idden neurons is also decreased wit a 1 : 2 ratio. As an example: ten dots needs 20 numbers (X/Y) and one input is used for energy of te sot, te total input size is 21 and terefore 42 neurons is used on te idden layer. Te reason for decreasing te idden neurons, is to allow for less error corrections as te patterns are getting less complex. Eac input is tested over 1000 sessions of 50 games eac. For eac session te current it percentage is saved and sown in te grap. Figure 2 sows te average of tese percentages. Since tese are averages, slow converge pulls down te average terefore te average of te last 20 sessions (1000 rounds) are sown in table I. Fig. 2. Input test for ANN using relative coordinates. TABLE I HIT% OVER THE LAST 20 SESSIONS (1000 ROUNDS) Dots Input size Hidden neurons Hit% 15 31 62 32.69 10 21 42 32.08 8 17 34 31.38 6 13 26 30.96 4 9 18 27.81 2 5 10 26.22 1 3 6 27.60 B. RL: Range experiment Tis test allows us to sow te benefit different ranges provides. Te smaller te ranges, te more circles are placed around te agent, as seen in Figure 1 wic represents te states. Te benefit from dividing te states into smaller ranges, allows te turret to find wic sot is te most rewarding in eac state. As ig powered sots moves slow, we expect te RL-module to figure out a balance between risk and reward. Te RL-module also as te benefit of tailoring te needs of te ANN. Sould te ANN start missing in one state te RL can lower te energy put in te projectile. For tis test, te ANN from te previous test is used using 15 previous positions (dots), 31 inputs and 62 idden neurons. Trougout te test, te neural network is frozen suc tat no weigt updates can be performed. For 32 sots (about one fast round) te agent soots random powered sots in range [0.1; 3], for tis period te RL is allowed to update te Q-table. For te next 32 sots te RL stops updating te Q-table and utilizes te gained knowledge. Te damage dealt over te utilized 32 sots is recorded and te process is repeated 15.000 times (close to 50.000 rounds). Figure 3 sows te average reward given for eac range trougout te test. Damage done is a big factor in te reward function, as can be seen in Equation (2), and important in taking down te opponent. Terefore te test were done twice allowing us to sow te effect of average damage done as seen in Figure 4.

Fig. 3. Reward test using RL-module wit frozen ANN wit te enemy s 15 previous positions as input. Fig. 4. Damage test for RL using ANN wit te enemy s 15 previous positions as input. C. Result discussion In testing different inputs for te ANN, te usage of more istorical information (or dots) clearly as a big impact on ow well te ANN can predict. From a single dot (meaning te current position of te enemy) it was able to it an average of 27.60%, wile giving it more information suc as 10 and 15 dots resulted in an average of 32.08% and an average 32.69% as seen in Table I. Wile furter increasing dots, te advantage of knowing more became smaller, as seen wit ten dots and 15 dots. Furter dot increase could possible move it to 33%, but tis would also inder te agents performance on anoter field; wen a new round starts, te turret waits for all inputs to be gatered before sooting, meaning slowing down sooting. Te RL range test allowed us to sow te impact of bot reward recieved and damage dealt as te RL-module altered its Q-table, as seen in Figure 3 and Figure 4. As we were trying to explain te experimental results, we noticed if everyting can be calculated in time t, as can be seen in Equation (1), and since te RL is not dependent on oter future Q values of oter states, tere s a better solution tan RL. Te solution is simple, if te it percentage for eac state is known by te ANN, te expected reward can be calculated for eac degree of power (action), as can be seen in Equation (3), and as suc a near optimal solution can be precalculated into a lookup-table. Te lookup-table as been precalculated and attaced on te CD. ExpectedReward(s t ) = (Hit%)( Drain Heat +HitDamage + EnergyBack +Bonus Drain Heat) +(1 Hit%)( Drain Heat) (3) Wile tis is a better solution, te RL is far from a bad solution - it just takes time to converge. Having te near optimal policy elps us explain wy te test-results ended as tey did: In Figure 3 and 4 te average reward and average damage can be seen to meet at te same value, for all ranges. In Figure 3 te reward seems to converge at 33 and at Figure 4 te average damage is 87. As we are using a ANN tat averages a it percentage of 32.69% over all states te optimal expected reward is about 3.5 points (according to te near optimal policy) using te optimal sot wit a power of tree. As te experiment doesn t sow 3.5 32 = 112, it s a strong indication tat te RL is far from fully converged. Before starting te range experiment, it was assumed range ten would outperform range 1000, as knowing ow well te turret it in one state, allowed for a tailored sot for tat range. At close range a ig it percentage migt be present and a ig powered projectile sould be used for optimal reward. But as can be seen in te optimal values going for te ig powered sot is generally te best option given our turret is sooting wit about 32.69% it percentage. Sould we later meet a more evasive opponent, or sould te ANN start missing at long range, range ten would sow muc better results. Te reason range ten would sow better results, is its ability to isolate te areas were it s itting well and use te igest powered projectile, wile using a less powered projectile in areas were te it percentage is low - on te oter side range 1000 wouldn t be able to adjust to maximize te reward and minimizing te penalty of missing. III. MOVEMENT CONTROLLER Te movement controller is composed of sensors and actuators. A multilayered recursive neural network processes te sensor information and actuate troug synapses and neurons, wic feeds information troug te network by neural activation [7]. Tis network is created troug neuroevolution by te use of NEAT. In evoluationary terms, te neural network is te penotype of a genome, wic as been evolved troug several generations of mutation and selection. NEAT evolves te topology of a neural networked genome, troug te creation of additional neurons and synapses, initially starting wit a minimal structure consisting of input and output neurons and full connectivity [4]. As wit oter genetic computational metods, te best genomes are selected for cross breeding and are allowed to survive for future generations. However, NEAT features speciation wic divides dissimilar genomes into species and allows all species to exists for a number of generations independent of teir immediate suitability to te environment [4].

Tis allows evolutionary innovations to occur tat gives no immediate benefit, but may serve as te base of offspring tat may prove strong in future generations. Species does owever go extinct if tey ave sown no improvement over several generations. Additionally, te istory of eac synapse in te genome is recorded, suc tat offspring trougout te species sare a genetic istory [4]. Tis allows crossover between dissimilar genomes witout duplication of synapses, as only synapses wit te same genetic istory are crossed over. Te strengt of te parents determine te topology sould be inerited, wereas te synapses of te identical structure are randomly cosen. NEAT as been implemented in JAVA as a library for te purpose of tis study. Te library is based on NEAT C++ for Microsoft Windows [6] and provides similar features suc as sigmoid activation response mutation and recursive synapses. Tere are several additions owever. It is possible to load te neural networks of penotypes and use tem as it was an ordinary multilayered neural network, witout loading te NEAT module. Tis network also allows recursion not only of synapses tat connects te same neuron, but also topology tat causes recursion. Tis is done troug metod similar to breadt first searc tat allows neurons to be visited twice, rater tan once. Tis allows memorization of te previous time step as well as small cycles during processing. Recursion is an optional feature, but te cycles are a sideeffect of te innovations of new synapses tat may cause cycles. Anoter approac would be to restrict innovations suc tat cycles does not occur. Anoter additional feature is te ability to evolve based on a genome rater tan minimal structure, wic allows training troug several pases. Saving and loading of bot genomes and penotypes are supported. A. Sensors Fig. 5. Agent in-game displaying its sensors. Te agent is black and red, and te opponent blue. Te input for te movement controller consists of eigt continous values supplied from eigt sensors as can be seen in Figure 5. Six of te sensors are distance sensors, were five detect te distance to te walls of te environment, and te last te distance to te opponent. Te sensors wo measure distance to walls do so by casting rays at angles tat cover te forward field of view of te agent. Te environment are modeled as te four lines tat te square of te environment consists of, and are ten cecked for intersections wit te rays. Te resulting intersection points are ten used to calculate te distance from te ray origins to te environment boundary [8]. To fit witin te range [0;1], only distances witin an upper bound are perceived. Te two remaining sensors are te angle to te opponent, wic togeter wit te distance gives te relative polar coordinates to opponents, and a projectile danger field sensor. Tis last sensor, te projectile danger field, is te amount of danger of projectile impacts witin te environment, wic is determined by an attempt to track all projectiles as tey travel from an opponent. However, it is impossible to detect at wat angle projectiles as been fired by opponents. It tus arbitrarily assumes tat te range is toward tis agent, but tis is imperfect as better opponents fire at angles tat intercept te agent at future turns. In order to make better predictions of te fligt pat of enemy projectiles, a prediction metod must be used. Tis as not been done in tis study. Togeter, tese sensors provide enoug information for te agent to percieve its environment and te opponent witin. Te movement controller reacts to te environment troug forward/backward acceleration and left/rigt acceleration actuation. B. Training Te training of te movement controller is divided into four pases of increasing difficulty. First it must learn to move around te battlefield at te igest possible rate. Ten it must learn to case an opponent, and attempt to keep a specific distance to it. Ten it must learn to avoid getting it by an opponent, and lastly it must learn to position itself optimally for te turret controller. All te pases as te limitation tat te agent must never touc a wall; if it does so, it will receive a fitness score of zero. Tere are consequences of suc a strictness; innovations tat later could turn out strong, may be discarded early because of inability to move correctly. Preliminary tests owever sowed tat allowing te agent to touc walls wit a lesser penalty could spawn beavior tat optimized around itting walls, wic is unacceptable. Dividing training into pases does cause loss of knowledge, but in te context of te training being done tis is acceptable and necessary. Te idea is tat instead of optimizing from a minimal structure wic is not adapted to te environment, eac pase after te first can optimise from a structure tat is already suboptimal witin certain aspects tat is deemed useful. To learn ow to case for example, it is an aid tat it already knows ow to move, albeit te new movement after training will of course be very different tan te base of wic it originated from. Tests ave sown owever tat in order to preserve previous knowledge, te fitness function must constrain te training in suc a way tat old beavior is required as well as te new desired beavior.

Tus te fitness function grow in complexity as te pases increase. 1) Pase 1: Mobility: Te fitness function for te first pase is given in Equation (4). Te variable w is te amount of wall its, v is te average forward/backward velocity of a round and is te absolute average left/rigt velocity of a round wit a minimal bound of 0.1. A maximum score of 100 is possible if v is 1 and is 0.1. A fraction is used to enable te fitness to depend on two variables. Preliminary testing sowed tat simply addition or multiplication can cause training to get stuck in local optima tat only optimizes one of te variables. An optimal beavior to tis function sould tus move forward at igest possible velocity constantly, avoid any walls, and keep turning to a minimum. Te reason for a limitation on turning, is tat preliminary tests witout tis variable ad a tendency of evolving beavior tat turned in circles in te center of te environment tus fulfilling w and v. f 1 (s i ) = ( v ) 2 v, oterwise. (4) middle of te sensory range of te sixt distance sensor. Te fraction as a maximum of 100 as wit te previous pase. An additional term is added, wit te variable c wic is te amount of turns witin sensory range, and t wic is te amount of turns in total. Tis as te effect, tat te agent sould stay witin sensory range for at least alf of te round in order to fully benefit from te fraction. Since start locations are random, it takes time for te agent to get witin sensory range of an opponent, and tis ensures tat it is not penalized for tis, wic would cause initial random location to affect te fitness. In effect, te agent is forced to stay witin te middle of te sensory range for at least alf of te round. f 2 (s i ) = ( (1 [d 0.5] 2) ) 2 v c ( t 2 ), oterwise. (5) Fig. 7. Training wit f 2 (s i ): 500 generations wit a population of 100, 5 games per specimen, random moving opponent. Fig. 6. Training wit f 1 (s i ): 500 generations wit a population of 100, 1 game per specimen, no opponent. As can be seen in Figure 6 te agent quickly evolved structure wic optimize te fitness function, wit te resulting genome aving a value of 100. Te training spanned 500 generations wit a population of 100 specimen, eac of wic ad one round to optimize teir fitness. Since tere are no opponents in tis pase, te environment is static and uncanging, wic allows a fairly precise fitness calculation as cance are mostly avoided. Beaviorally te agent as a preference of turning rigt, and does drive in cycles. However, te cycles are so wide, tat it must react wen nearing walls, wic also disrupts te cycles and cause it to travel around most of te environment. 2) Pase 2: Casing: Te fitness function for te second pase is given in Equation (5). It depends on te same variables as in te previous pase, wit several additions to determine its ability to case an opponent. Te variable d is te average distance wen witin sensory range (te sixt distance sensor). Witin te numerator of te fraction, tis variable is part of a triangular equation were 0.5 is te top and 0 and 1 is te bottom. Tis forces it to stay witin te Te results can be seen in Figure 7. Te resulting genome ave a value of 63.07. Te training spanned 500 generations wit a population of 100 specimen, were eac specimen ad five rounds to optimize. Te opponent was sample.crazy [1], a random moving opponent tat does not attack. Te reason for te additional turns, is tat tere now is an opponent wic causes te environment to become dynamic and now includes cance. Tere is a clear sign of improvement as generations increase, but it as been unable to fully optimize for te fitness function. Tis can also be seen in te beavior, tat as inerited te circular movement of te first pase. It clearly gets witin sensory range, but it continuously gets too close or too far, as it circles eiter around or away from te opponent. A curious tendency, is tat if te agent is pitted against opponents made to circle around teir prey, tey will drive around in a perfect circle, were bot attempts to get an optimal distance but negates eac oter. 3) Pase 3: Avoidance: Te tird pase was attempted wit two different fitness functions, Equation (6) and Equation (7), as it was unclear on wic variables te training sould depend on. Te variable k 1 is te average danger sensor score, and k 2 is te average projectiles tat it te agent out of te projectiles tat was fired by te opponent. Te difference lies in tat te sensor is an abstraction of

possible projectile dangers, wereas k 2 is te actual occurrences in te environment. Part of te fitness function of pase 2 is included, suc tat it sould attempt to maintain te desired distance to te opponent wile at te same time avoiding being it. Tere is no constraint on velocity owever. Te danger variables are inverted, suc tat avoidance is rewarded. In effect it sould avoid projectiles wile staying witin optimal range, some tat may seem like a paradox as closer range means a statistically iger cange of getting it. However, since movement sould benefit te turret controller, it must stay witin range suc tat te statistical cance of itting te opponent is ig for te agent as well. f 3,1 (s i ) = ( ) 2 (1 k 1 ) (1 [d 0.5] 2) c ( t ), oterwise. 2 (6) f 3,2 (s i ) = ( ) 2 (1 k 2 ) (1 [d 0.5] 2) c ( t ), oterwise. 2 (7) wile adding new beavior tat in tis case attempts to avoid getting it by te opponent. Te efficiency of te two fitness functions can be seen in Table II. Te second fitness function, Equation (7), proved best, wile bot proved to be an improvement from te structure of te previous pase. Tere are owever only a very small difference in te efficiency of training from using an abstraction or real occurrences. 4) Pase 4: Combat: Two fitness functions was investigated for te combat pase, as can be seen in Equation (8) and Equation (9). Variable u is te number of projectiles fired by te agent tat it te opponent, and r is te number of projectiles fired by te opponent tat it te agent. Equation (8) simply rewards itting te opponent and avoid being it troug a fraction. Tis is an abstraction of ow combat works; te robot tat its its opponent te most will acieve victory. However, bullet power is not mirrored in te fitness function as tat is out of scope for tis controller. It is assumed tat te turret controller is itself near optimal, suc tat tis abstraction is true. Tests ave sown tat it evolves cyclic beavior, and appear to loose muc of te previous training. Equation (9) forces te agent to evolve a beavior tat still fulfill te task of casing. Te performance of bot are to be evaluated in te result section. { 0, if w > 0; f 4,1 (s i ) = u r, oterwise. (8) f 4,2 (s i ) = ( ) 2 u (1 [d 0.5] 2) c r ( t ), oterwise. (9) 2 Fig. 8. Training wit f 3,2 (s i ): 500 generations wit a population of 100, 5 game per specimen, direct targeting opponent. TABLE II COMPARISON OF THE TRAINED GENOMES OF THE FITNESS FUNCTIONS. Fitness function Avoidance percentage f 2 (s i ) 73.53% f 3,1 (s i ) 78.70% f 3,2 (s i ) 78.84% As can be seen in Figure 8 te agent ad difficulties evolving a structure tat optimizes te function. Te resulting genome ave a value of 82.49. Te training parameters is identical to te previous pase, but wit an attacking opponent named rz.hawkonfire 0.1 wic uses direct targeting. Direct targetting was cosen because te danger sensor is only able to predict projectiles fired by suc a strategy. It starts quite ig as te training from te previous pase receives a fairly good score from te part of te function tat tey sare. Innovations does owever increase te success of te agent as generations increase, but it is unable to acieve total avoidance. By keeping part of te previous fitness function, it maintains te previous beavior Fig. 9. Training wit f 4,1 (s i ) : 250 generations wit a population of 100, 5 games per specimen, predictive targeting opponent. Since te training for te turret was only done using te sample.myfirstjuniorrobot [1] opponent, it does not ave te accuracy needed for real opponents. Tus demetrix.nano.neutrino 0.27 wic ranks 397 in Roborumble [9] was used to train te turret troug 10000 rounds before te training of te movement controller begun. Unlike in te casing pase, tis opponent does not use direct targeting, but use prediction. Tis will force te agent to learn ow to avoid tese projectiles as well. Overfitting will occur,

training. Eac opponent is engaged in combat wit tese agents ten times for 1000 rounds. TABLE III PERFORMANCE OF AGENT WITH MOVEMENT VARIANT A Opponent Mean Standard Deviation Sample.MyFirstJunior 72.5% 0.92% matt.underdark3 2.4.34 48% 7.75% ar.teoryofeveryting 1.2.1 20.5% 0.67% jk.mega.drussgt 1.9.0b 1% 0% Fig. 10. Training wit f 4,2 (s i ): 250 generations wit a population of 100, 5 games per specimen, predictive targeting opponent. in tat it will optimize for tis opponent and tis opponent only. Te best solution would be to train against several opponents of varying strategies tat generalize te tactics used trougout te wole population of robots. However te application does not allow swapping of opponents, so tis is unfeasible witout altering te freely available source code. Tis as not been done, and te training as been made witin te constraints of te officially released client. Harder opponents was not cosen, as preliminary tests sowed tat te turret ad problems matcing teir complex moving patterns, and tat no movement improvements could be found tat increased te agents overall success. Te training can be seen in Figure (9) and Figure (10). Training wit te fitness function f 4,1 (s i ) resulted in a genome wit a value of 74.67 and function f 4,2 (s i ) resulted in a genome wit a value of 51.46. Te training spanned 250 generations wit a population of 100, and five rounds per specimen. Tus te simpler fitness function was easier for te agent to fit, wereas it ad difficulties in fitting te second function. Te resulting structure of te genomes are minimal, suc tat te genome of te simple function contain only five idden neurons and 15 new synapses, and te second function contain only four idden neurons and 13 new synapses. Te beavior is as expected; te agent wit a penotype derived from te genome of f 4,1 (s i ) moves in fast irregular cycles, and te agent of f 4,2 (s i ) moves in almost constant motion, circles around te opponent and keeps witin range. Bot is capable of avoiding projectiles fired by opponents. IV. RESULTS To determine te performance of te combination of te movement and turret controllers, a test is made against robots from te official Roborumble tournament [9]. Te Sample.MyFirstJunior robot serves as te base, and ten tree robots of decreasing ranks in te official Roborumble tournament [9]. Tese are matt.underdark3 2.4.34 wit a rank of 691, ar.teoryofeveryting 1.2.1 wit a rank of 411 and jk.mega.drussgt 1.9.0b wit a rank of 1. Two agents will be measured to determine te difference by te two movement controller variations found in pase 4 of te TABLE IV PERFORMANCE OF AGENT WITH MOVEMENT VARIANT B Opponent Mean Standard Deviation Sample.MyFirstJunior 74.8% 0.75% matt.underdark3 2.4.34 26.2% 0.87% ar.teoryofeveryting 1.2.1 21.1% 0.7% jk.mega.drussgt 1.9.0b 2% 0% Te scoring used is tat of Robocode [10], wic is a sum of survival score, bullet damage and ram damage, and additional bonuses. Tis scoring sceme rewards not only victory, but also te ability to damage an opponent. It does tus not reward agents tat wears opponents down troug avoidance only, but requires offensive actions. Ramming is te action of driving into an opponent, wic causes damage to te opponent only. Te agent wit movement control trained by Equation (8) will be referred to as movement variant A, and te movement control trained by Equation (9) will be referred to as movement variant B. Te performance of tese agents can be seen in Table III and Table IV. Bot movement variants performed well against Sample.MyFirstJunior wic is te base line. Tere is little variance, as bot as a standard deviation (SD) below 1%. Te agents perform above cance, and is able to fire and move in suc a way tat tey can defeat an offensive opponent. Against arder opponents, te agents ave difficulties winning. Against ar.teoryofeveryting 1.2.1 and especially jk.mega.drussgt 1.9.0b tey perform equally poor, and decidedly so wit a SD below 1%. Te opponent jk.mega.drussgt 1.9.0b is te current campion of Roborumble [9], and te agents are incapable of itting it, nor avoid getting it by it. Tey are owever able to bot it and avoid te projectiles of ar.teoryofeveryting 1.2.1, but at a too low frequency to claim victory. Tere is a big difference in te performance of te agents against matt.underdark3 2.4.34. Movement variant B performs almost as poor as wit ar.teoryofeveryting 1.2.1, wit a mean of 26.2% and a SD of 0.87%. Movement variant A owever ad an almost even matc wit a mean of 48% and a SD of 7.75. It does owever ave sligtly less success against te oter opponents tan variant B. Overall te agent trained wit te simple fitness function in Equation (8) as te best performance. Tis means tat te complex fitness function in Equation (9) creates a too large searc

space, or constrains te searc space in suc a way, tat local optima is arder to find. Te problem wit bot te turret controller and te movement controller is tat of overfitting as well as te scope of patterns tat must be approximated. Tere are many different opponents tat ave igly varying beaviors, and learning one of tese puts te controller at a disadvantage against oters. Some of tis can be sligtly indered by training against a multitude of opponents, allowing te turret controller and movement controller to find an approximation tat generalizes teir beavior. As mentioned in pase four of te movement training, tis was not possible because of software constraints. It would be possible for te movement controller to converge at a good generalized strategy given enoug generations, and a suitable fitness function. Te turret controller owever faces problems wit its neural network tat utilizes backpropagation. Wile possible, it is muc more prone to be stuck in undesirable local optimas, valleys in te searc space. Te oter problem is sensory input of bot te controllers. Better input increases te cance of convergence in iger local optimas. Certainly tere are enoug information in te input given to te controllers, as can be seen on te performance against te base opponent and for movement variant A against te opponent matt.underdark3 2.4.34. But a iger understanding of te environment and te game will allow input tat are more relevant to te controllers. Tere exists tecniques for training feature selection [13] as well, allowing te controllers to learn wat inputs proves essential to teir success and wat is unnecessary or impeding. V. CONCLUSION We ave proven tat artificial neural networks can successfully be used for aiming wit te turret of a Robocode agent using istorical information of opponents in Section II. An average it percentage of 32.69% was acieved using tis metod against a simple opponent. Using tis network it was sown tat reinforcement learning can be used for determining te power of a projectile, albeit te difference in power across different distances varied little due to ig it percentage. It was determined tat ig powered projectiles are often most favorable, but low powered projectiles were also favorable in a few circumstances. It was also proven tat neuroevolution can successfully be used for evolving relevant beavior of movement in Section III. It also sows tat training can be divided into several pases, suc tat te size of te searc space can be reduced into several smaller partitions. Te terms of te fitness functions of eac pase was determined to be of utmost importance to avoid knowledge loss. Te results in Section IV sowed tat an agent composed of controllers trained by tese macine learning tecniques are capable of defeating easier opponents, but was unable to defeat te leading robots in te Robocode tournament. Overfitting was a problem tat could not be tackled in tis study. Possible improvements was determined to be better sensory inputs for te networks in bot controllers. REFERENCES [1] F. N. Larsend, Robocode. [online] Available: ttp://robocode.sourceforge.net [accessed 2. December [2] J. Hong and S. Co, Evolution of emergent beaviors for sooting game caracters in Robocode, Evolutionary Computation, pp. 634 638, 2004. [3] Y. Sicel, E. Ziserman and M. Sipper, GP-Robocode: Using Genetic Programming to Evolve Robocode Players, Genetic Programming. Springer Berlin / Heidelberg, pp. 143-143, 2005. [4] K. O. Stanley and R. Miikkulainen, Evolving Neural Networks troug Augmenting Topologies, Evolutionary Computation, pp. 99 127, 2002. [5] D. M. Bourg and G. Seemann, AI for Game Developers. Sebastopol, CA: O Reilly Media, pp. 269 347, 2004. [6] M. Buckland, NEAT C++ for Microsoft Windows. [online] Available: ttp://nn.cs.utexas.edu/soft-view.pp?softid=6 [accessed 2. December [7] P. Ross, Neural Networks: an introduction. AI Applications Institute, University of Edinburg, pp. pp. 4:1-23, 1999. [8] J. M. V. Vert, L. M. Bisop, Essential Matematics for Games & Interactive Applications: A Programmers Guide. Burlington, MA: Morgan Kaufman, pp. 541 599, 2008. [9] F. N. Larsend, Roborumble. [online] Available: ttp://darkcanuck.net/rumble/rankings?version=1&game=roborumble [accessed 2. December [10] F. N. Larsend, Robocode Scoring. [online] Available: ttp://robowiki.net/wiki/robocode/scoring [accessed 11. December [11] F. N. Larsend, Robocode Scoring. [online] Available: ttp://robowiki.net/wiki/robocode/scoring [accessed 11. December [12] F. N. Larsend, Narrow Lock. [online] Available: ttp://robowiki.net/wiki/one on One Radar [accessed 11. December [13] K. O. Stanley et al., Automatic Feature Selection in Neuroevolution, Proceedings of te Genetic and Evolutionary Computation Conference, 2005. [14] F. N. Larsend, Neural targeting. [online] Available: ttp://robowiki.net/wiki/neural Targeting [accessed 12. December [15] F. N. Larsend, Targeting. [online] Available: ttp://robowiki.net/wiki/gaff/targeting [accessed 12. December [16] F. N. Larsend, Wave surfing. [online] Available: ttp://robowiki.net/wiki/wave surfing [accessed 12. December [17] F. N. Larsend, Pattern matcing. [online] Available: ttp://robowiki.net/wiki/pattern matcin [accessed 12. December