Evolving Multimodal Behavior

Size: px

Start display at page:

Download "Evolving Multimodal Behavior"

Dwight Jackson
6 years ago
Views:

1 Evolving Multimodal Behavior Jacob Schrum October 26, 29 Abstract Multimodal behavior occurs when an agent exhibits distinctly different kinds of actions under different circumstances. Many interesting problems in real and simulated environments require agents that exhibit such behavior. The ability to automatically discover multimodal behavior would be useful in robotics, video games and other high-level control problems. Multimodal behavior is also especially important for teams of agents, taking the form of division of labor between team members. This proposed dissertation develops a method for discovering such behavior via neuroevolution. Work completed so far demonstrates how three modifications to typical neuroevolutionary methods make multimodal behavior easier to evolve: (1) multiobjective evolution (via e.g. the multiobjective evolutionary algorithm NSGA-II) encourages multimodal behavior because distinct behaviors tend to be associated with sets of contradictory objectives, (2) whenever the population collectively surpasses preset objective goals, the corresponding objectives can be dropped, speeding up evolution, and (3) a special mutation operator that creates a new set of output neurons for a neural network encourages the development of multiple distinct behavioral modes. The proposed work will build upon these findings in order to improve the evolution of multimodal behavior further in the following ways: (1) methods of overcoming stagnation via behavioral diversity enhancement will be developed, (2) the new-mode mutation will be improved, and different methods of arbitrating between the multiple modes evaluated, (3) in order to evolve teams more effectively, populations will be divided into subpopulations for each role in the team, and (4) the dynamic objective dropping mechanism will be modified to be an open-ended learning process. The resulting algorithm will be evaluated in a set of increasingly challenging multimodal domains, including Unreal Tournament 24, a complex commercial first-person shooter video game. Success in these domains will demonstrate the algorithm s ability to evolve interesting behavior for challenging domains. 1

2 Contents Contents 2 1 Introduction 4 2 Related Work Multimodal Behavior Design Approaches Value-Function Approaches Evolutionary Approaches Multiobjective Optimization Pareto Optimality Value-Function Methods Evolutionary Methods Constructive Neuroevolution Evolution of Teamwork Completed Work Experimental Setup Battle Domain Fight or Flight Benefits of Multiobjective Neuroevolution Experiment Results Evolved Behaviors Targeting Unachieved Goals Why Objective Management is Needed Experiment Results Evolved Behaviors Evolving Multiple Output Modes Evolving Extra Output Modes Experiment Results Evolved Behaviors Proposed Work Avoiding Stagnation By Promoting Diversity Extending Evolution of Multiple Output Modes Variations on New-Mode Mutation Probabilistic Arbitration

3 4.2.3 Restrictions on New-Mode Mutation Improving Heterogeneous Teams Using Subpopulations Establishing Open-Ended Evolution While Targeting Unachieved Goals When All Goals Are Achieved As Individual Goals Are Achieved Evaluation Task Extensions in BREVE Unreal Tournament Future Work 73 6 Conclusion 74 Bibliography 75 3

4 1 Introduction According to Poole et al. (1998), computational intelligence is the study of the design of intelligent agents. This approach to computational intelligence has resulted in many interesting approaches to creating agents that act intelligently, and has also given rise to agent design methods by which agents learn how to act intelligently on their own, based solely on feedback from the environment. The domains inhabited by these agents are examples of Reinforcement Learning (RL) problems. Popular approaches for solving RL problems are value-function methods and evolutionary methods. Value-function methods have been used successfully in problems such as backgammon (Tesauro 1994), soccer ball interception by a physical robot (Müller et al. 27), and keepaway in a RoboCup soccer simulator (Stone et al. 25; Taylor et al. 25). Evolutionary methods have done well at checkers (Fogel 21), finless rocket control (Gomez and Miikkulainen 23), robot foraging (Stanley et al. 23; Waibel et al. 29), and agent control in a squad-based combat video game (Stanley et al. 26). However, despite successes in these domains, there are still many challenging domains that have yet to be solved by these methods. One of the reasons for this is that many challenging domains require multimodal behavior to be solved. Multimodal behavior is defined in this work as the ability to exhibit distinctly different kinds of behavior under different circumstances. Animals have to be able to find food, find mates and avoid danger. In soccer, players must balance offensive and defensive roles. In a first-person shooter video game, a player has to explore to obtain resources and win in combat against opponents. Performing well in these domains is hard because agents must not only recognize that the situation calls for a different mode of behavior, which may be difficult to discern in itself, but must also perform the needed behaviors, each of which are potentially complex and very different from the other behaviors. Multimodal behavior is particularly interesting in the context of teams. In the soccer example above, the role chosen by a player will usually depend on what its fellow teammates are doing. Because team members need to fulfill different roles, multimodal behavior arises at the team level: Different individuals in the team exhibit different behavioral modes. Learning methods are usually used only to learn component behaviors in a hard-coded controller, or to learn the control mechanism to choose between several hand-coded behaviors. In some cases both are learned, but separately, and with careful human design. The purpose of this work is to develop a learning method that creates agents that exhibit multimodal behavior by simply training them in the task requiring such behavior. Such a method frees the human designer from needing to make decisions about what subbehaviors a multimodal agent will need. Before explaining how this goal will be achieved, relevant previous work in this field is presented to give context to the proposed approach, and in order to show how the new approach distinguishes itself from what has been done before. 2 Related Work Before proposing this new learning method, several previous approaches to dealing with multimodal problems are reviewed as a foundation. This section also discusses tools and techniques that are part 4

5 of the new learning method. 2.1 Multimodal Behavior There are many examples of agents that exhibit multimodal behavior. As the brief survey below will show, most of these successes depend on extensive human intervention. The role of humans is discussed to highlight the need for a method that discovers multimodal behavior without human intervention Design Approaches The most common approach to designing complex behavior for agents in both robotics and in simulated worlds (such as video games) is to simply hard-code all behaviors. This approach allows designers to be aware of how their agents should respond in many different situations, although obviously not in situations they did not anticipate. Thus hard-coded agents are reliable in typical situations but do not handle surprises well. Designing such agents also requires much advanced knowledge about the domain in which the agent will function, as well as design time and effort from programmers. Because hard-coded designs are complex, it is important to employ design practices that assure that the design is functional and easy to understand. Two popular design paradigms are detailed below. Hierarchical Subsumption Architecture The subsumption architecture (Brooks 1986) has remained a cornerstone of robotics ever since its inception. A recent summary of its design is given by Prescott et al. (1999): 1. Distributed, layered control: Control is distributed across several layers operating in parallel, with no central control within a layer. 2. Behavioral decomposition: Each layer is designed to achieve some particular goal or task, thus different layers are oriented towards different behaviors of the overall multimodal system. 3. Increasing levels of competence : Each higher layer of the architecture has access to the lower layers, and is able to arbitrate between them, but the lower layers function unaware of the higher layers. Thus, higher layers are able to refine collections of lower-level behaviors into a meaningful, multimodal behavior. 4. Incremental construction: Lower layers are frozen before higher layers are build on top of them. 5. Subsumption: Higher layers can subsume the roles of lower layers by taking over, suppressing, or replacing lower-level outputs with higher-level outputs. 5

6 Using the subsumption architecture means identifying subbehaviors that are parts of larger behaviors and incorporating them into a hierarchy. The lowest level is constituted of several simple behaviors. The next layer can therefore perform multimodal behavior by arbitrating between the different behaviors exhibited by each lower layer component. Though effective, this approach has the obvious drawback of requiring both the individual behaviors and the hierarchy of subsumption layers to be hand designed. A general purpose Machine Learning (ML) approach by Stone and Veloso (2) called Layered Learning takes the general idea of hierarchical, layered control, and adds to it the ability to learn the subcontrollers for individual layers. The approach allows for arbitrary ML algorithms to be used for each layer of a given control architecture. This hybrid approach led to victory in the 1999 RoboCup Soccer tournament when applied to a team of soccer player agents in the RoboCup simulator (Stone 2). Still, despite the success of the bottom up subsumption approach in robotics and academic circles, other methods are favored elsewhere, such as in commercial video games. Behavior Trees Despite the advent of computational learning techniques, AI in commercial games typically depends on old fashioned AI techniques such as rule-based systems and finite state machines (Diller et al. 24). However, a slightly more sophisticated, yet still hand-designed, AI mechanism popularized by the first-person shooter game Halo 2 is behavior trees (Isla 25). Behavior trees offer another way of designing agents hierarchically, though the design process is top-down instead of bottom-up as with the subsumption architecture. A behavior tree starts at a root, under which there are several prioritized high-level behaviors. Each behavior has a firing condition that determines whether or not it should be active. The first behavior in the list whose firing condition is active will then be enacted by the agent. In the subtree beneath each high-level behavior is another prioritized list of behaviors, each with their own firing conditions. The behavior selection process is repeated deeper into the tree until terminal low-level behaviors are chosen and carried out by the agent. The behavior tree approach allows multimodal behavior to develop because distinct lower-level behaviors are chosen at different times based on priority and firing mechanism. However, like the subsumption architecture method, this approach requires careful human design of several component subbehaviors. Since the purpose of this work is to develop multimodal behavior automatically, approaches to developing multimodal behavior using value-function and evolutionary methods are discussed next Value-Function Approaches Value-function methods depend on modelling the environment as a Markov Decision Process (MDP). An MDP is defined as a tuple (S, A, R, T ). The set S is the set of all states in the environment. Set A defines the actions available from each state. The function R : S A S R is a reward function that returns a scalar reward for each transition that occurs between states based on an action taken by the agent. Transitions between states based on actions occur based on the probabilities defined by T : S A S [, 1]. 6

7 The value function used by these methods is either a state-value function, V (s), or, more commonly, state-action-value function, Q(s, a), that estimates the long-term future utilities of states or actions at those states respectively. Most value-function methods are also Temporal Difference (TD) methods, in which the value function is updated after every action taken by an agent based on the difference between the rewards expected and the rewards actually received. However, domains requiring multimodal behavior are challenging for these methods because the relevant aspects of the state that dictate which mode of behavior is currently best may be hard to identify, especially when they are not directly observable within the current state as perceived by the agent. Because of this challenge, many value-function approaches to multimodal behavior break down tasks into hierarchies, as with the hand-designed methods described above. MAXQ The MAXQ (Dietterich 1998) method is an early example of such a hierarchical reinforcement learning method. It borrows from and expands upon methods such as Feudal Reinforcement Learning (Dayan and Hinton 1993) and Hierarchical Abstract Machines (Parr and Russell 1998). Applying the method requires that the hierarchy is designed by hand for the task, thus splitting it into subtasks. The hierarchy also separates context dependent and independent tasks. This separation allows lower-level tasks to be reused by multiple higher-level tasks, since the rewards at lower levels will not be influenced by the outcome of higher-level tasks. Learning with MAXQ is much faster than the standard TD approach because the separate subtasks can restrict themselves to state and action space representations that are smaller than those required to represent the entire problem. MAXQ was initially designed to use a tabular Q function. This work also introduced the Taxi domain, which despite being a challenging grid world problem, is still a toy domain. MAXQstyle task decomposition has been applied to other domains and situations as well: Roncagliolo and Tadepalli (24) used a MAXQ-style hierarchy with function approximation to solve a blockworld task. A multiagent extension of MAXQ called MOMQ was used in a robot trash collection task (Cheng et al. 27). Work has even been done on learning the task hierarchy itself, though still for the Taxi domain (Hengst 22). Still, MAXQ has yet to be applied to any really large, realistic problems. No matter how large the state space is, the domains tested still have finite state spaces, and are generally grid worlds. The more challenging and interesting domains tend to have continuous inputs, and such domains are the ones for which multimodal behavior is both more important, and harder to learn. Basis Behaviors Problems involving physical robots have continuous state spaces and tend to require multiple behaviors, such as wall following, obstacle avoidance, and homing. An effective value-function approach in such domains is to simply hand-code the low-level behaviors and learn only how to arbitrate between them. Since such an approach still requires much human intervention, it is far removed from the goals of this work. However, it turns out that even when only learning the behavior arbitration mechanism, there are many challenging issues to overcome, and these are worth discussing. Mataric (1997) used a behavior-based approach to learning the arbitration mechanism for robots in an adversarial multiagent foraging domain. The goal of the robots was to collect as many pucks 7

8 as possible and return them to a specific home location. In order to make the learning task simpler, the state space was reduced to a set of four simple conditions that were either true or false, and the action space was abstracted to a set of four predefined high-level behaviors. Even with so much human knowledge and such a small state-action space, the task was not easy, due to the temporal credit assignment problem caused by temporally extended actions. To overcome this problem, shaped reinforcement via progress estimators was used. For example, in addition to getting immediate positive reinforcement for dropping a puck at home, robots also received positive reinforcement if they were successfully nearing their home location while holding a puck. The end result was robots that perform a challenging task in a real-world multiagent domain. However, the learned behavior requires so much human input as to beg the question of whether a purely hand-coded approach would not have been more practical. For instance, all of the robot actions were predefined. It is common with value-function methods to have an abstract set of simple actions since value-function methods are not easily applied to continuous action spaces, though there has been work in overcoming this challenge (Hagen and Krse 2; Strösslin and Gerstner 23). However, these methods were applied to robot navigation tasks that did not require multimodal behavior to be solved. Continuous action spaces are not a problem for policy-search reinforcement learning methods, of which evolutionary methods are an example. Also, the temporal credit assignment problem faced by the robots in this domain does not come up in evolutionary search, because rewards are assigned based on overall behavior instead of for individual actions. Example applications of evolutionary methods are discussed next Evolutionary Approaches Evolutionary algorithms (EAs) are all based on the idea of evolution via natural selection as first presented by Darwin There are many different types of EAs, but they all involve maintaining consecutive populations of candidate solutions, each of whose individuals are evaluated and then preferentially selected to pass on genetic material in one form or another to the next population. Because EAs require a population of individuals to be maintained rather than a single agent they tend to be more evaluation intensive than value-function methods. Also, since fitness (rewards) is assigned only at the end of an evaluation, evolutionary methods do not allow agents to learn in the middle of an evaluation. However, this aspect of fitness assignment has also proved beneficial to evolutionary methods when creating agents for partially observable domains (Gomez et al. 26; Stanley and Miikkulainen 22). One reason is that a set policy can perform well despite misleading or ambiguous information in the state representation. Another reason is that policy representations for evolved agents are not constrained to the value-function paradigm, and can therefore often make use of representations that support some form of memory. A popular evolvable agent representation is neural networks, and the combination of EAs with neural networks is called neuroevolution. Neuroevolution In a neuroevolutionary algorithm, neural networks are evolved and used as inputoutput mappings to solve some given problem. Such algorithms have proven especially useful 8

9 in control tasks (Baluja 1996; Blynel and Floreano 22; Boshy and Ruppin 22; Gomez and Miikkulainen 23), where the sensor inputs of an agent are the inputs to the neural network, and the outputs control the actuators of the agent. Neural networks are appealing because they are theoretically capable of representing any continuous function of a bounded set (Hornik et al. 1989), which means they should be capable of solving nearly any problem given the right configuration. Neural networks can also support recurrent connections, which allow the networks to have a form of memory. Recurrent connections grant neural networks an internal state that often allows them to perform better in non-markovian environments than value-function algorithms that depend on the Markovian property (Gomez and Miikkulainen 1999). The earliest neuroevolution algorithms used fixed-length genomes to encode fully connected feed-forward networks (Miller and Todd 1989). This simple approach is still in use today (Togelius et al. 29). However, there are also many different varieties of enhanced neuroevolution algorithms for creating networks with both fixed and arbitrary topologies (Kitano 199; Gruau 1994; Moriarty and Miikkulainen 1996; Gomez and Miikkulainen 1996; Stanley and Miikkulainen 22). Several of these algorithms support the evolution of recurrent connections, or can be applied to one of several fixed-topology recurrent architectures. The fact that neural networks can support recurrency allows them to do well in partially observable domains. This ability in turn makes neuroevolution a good candidate for solving multimodal problems, since indicators about which of several behavioral modes to select may not always be directly observable to an agent. The following are several previous instances of multimodal behaviors generated via neuroevolution. Dangerous Foraging Neural networks are one of many policy representations that have proven highly evolvable. One advantage offered by these so called neuroevolution methods is the ability to evolve recurrent connections, which provide a memory mechanism for agents. For instance, Stanley et al. (23) demonstrate the usefulness of neural networks with recurrent connections in a simple multimodal domain, called the Dangerous Foraging Domain. This domain models the situation faced by a foraging animal entering a new geographical region. The foraging animal cannot know in advance whether the food in this region is safe to eat or not, therefore it must taste one food item first before it has a chance of making appropriate decisions. In the domain, all food present in a given trial is either safe or unsafe, so optimal behavior is to try one piece of food, and then take different actions according to whether the food was safe or not. If it was safe, then all food should be eaten. Otherwise, all food should be avoided. This task is made challenging by the fact that the network only receives brief pain and pleasure signals upon eating unsafe and safe food respectively. The fact that this signal is brief rather than persistent makes the domain partially observable. Recurrent networks were evolved using a neuroevolution method known as NEAT (Stanley and Miikkulainen 22). Networks constructed by evolution only were compared to those that also included a dynamic Hebbian synapse update rule intended to help the network adjust to different environments. Surprisingly, the evolution-only networks solved the Dangerous Foraging task quicker and more reliably than networks with the Hebbian learning rule. It is possible that recurrent 9

10 networks only did so well in this domain because they were able to take advantage of a simple trick, whereby a recurrent connection activated by the pain sensor caused the robot to spin until it was facing away from food. However, it should be kept in mind that many domains may be solvable using such tricks. This is why having some form of recurrency is useful for constructing multimodal behavior. Adaptive Teams of Agents However, multimodal behavior can emerge without recurrency as well. For instance, in the Adaptive Teams of Agents (ATA; Bryant and Miikkulainen 23) approach, multimodal behavior was evolved without recurrency in a simple strategy game called Legion-II. In this game, a single neural network controlled a team of several Roman legions on a hexagonal grid. Each legion had its own copy of the neural network for its control policy, thus making for a team composed of homogeneous members. The goal of the team was to stop barbarian hordes from rampaging both the countryside and villages on the grid, though protecting villages was more important. Successfully evolved behavior in this task consisted of knowing when to garrison a village vs. when to hunt down barbarians in the countryside. Since villages were more important, it was generally always a good idea to garrison the villages against attack, unless said village was already garrisoned. Thus successful teams evolves a multimodal behavior in which they would hunt the countryside for barbarians as long as they encountered no empty villages, but legions that did encounter empty villages would garrison them to protect against the barbarians. Thus the team has individuals filling different roles, which is an example of team-level multimodal behavior. This work demonstrates that certain types of multimodal behavior can be evolved without an architecture designed specifically to promote multimodal behavior. The use of homogeneous teams also demonstrates that teammates can take on different roles despite having the same control policy. However, other types of multimodal behavior could not be evolved by this approach. When the game was augmented by adding the ability to build roads for quick travel, the Roman legions failed to exploit the strategic advantages of this option (Miikkulainen 29). This result indicates that certain types of multimodal behavior are harder to evolve that others, hence the need for approaches that specifically promote the discovery of such behaviors. Layered Evolution of a Subsumption Architecture In the enhanced version of Legion-II with roads, there were separate tasks in which the Roman legions could participate: building roads, garrisoning villages, hunting barbarians. With just a little knowledge of how the game works, the task can be divided into several subtasks. The ability to decompose the task in this way makes it amenable to the subsumption architecture approach mentioned above. However, rather than manually coding the subbehaviors of the architecture, then can be learned via layered evolution (Togelius 24). Given that some task involves several subtasks, it is usually easier to evolve solutions to the easier subtasks. The subsumption architecture takes advantage of this fact. First a task is decomposed into subtasks, then separate controllers are evolved for each individual subtask. This means that each subtask must have its own fitness function. Sometimes certain subtasks depend on others, 1

11 which requires sequentially evolved subcontrollers to also have access to the outputs of previously evolved subcontrollers. Once the controllers for the individual subtasks are evolved, they are frozen and combined into a hierarchical controller in the following manner: The hierarchical controller is another evolved controller that takes as input some mixture of the original inputs for the subcontrollers as well as the outputs of the subcontrollers. The master controller has one output corresponding to each subcontroller. When used to control an agent, the action of the agent is defined by the action of the subcontroller for which the master controller produces the highest output. This architecture allows the master controller to arbitrate between several subbehaviors believed to be important to solving the overall task. The master controller itself is evolved separately using whichever group of subcontrollers are deemed best. It should be noted that the subcontrollers can each be created in completely different manners. Therefore, different subcontrollers could be evolved with different algorithms or representations, or some could even be hand-coded. In fact, this approach seems nearly identical to Layered Learning (Stone and Veloso 2), except that in this work neuroevolution is preferred over other ML techniques. This layered evolution methodology has been applied to some interesting problems. One application is in the EvoTanks domain (Thompson and Levine 28; Thompson et al. 29), which is modelled on the Atari 26 game Combat. In this domain, several tanks travel around obstacles and fire at each other to see which tank can survive and destroy all opponents first. The specific goal that the subsumption architecture was trained to attain in this domain was the task of navigating around an obstacle to reach a way point guarded by enemy turrets, while avoiding enemy fire. The subtasks that the component controllers were trained on were visit waypoint, detect obstacles, and dodge shells. Each subsequent task had access to the outputs of the previous modules while evolving. Experiments were also performed to see what the affect of retraining lower-level controllers would be after the higher-level controllers had been trained. These results were inconclusive in that retraining certain controllers improved fitness while retraining others caused fitness to drop. Other trials were also performed on several different tasks within the domain, but the overall result is that the evolved tanks are fairly good at arbitrating between the behaviors learned by their individual subcontrollers. Another domain in which layered evolution has had success is a simplified version of the popular first-person shooter game Unreal Tournament 24 (van Hoorn et al. 29a). The Unreal Tournament series of games all feature a Deathmatch mode of play, in which several human players and computer controlled bots run around 3D maps collecting weapons and fighting in order to see who can get the most kills (frags). The full game is very complex, since there are many weapons to choose between, challenging obstacles that must be navigated around or even jumped over, and also many high-level strategic decisions that must be considered to perform well. The hierarchical subsumption controller evolved for this domain functions in a version of the game simplified in the following manner: Levels used were reduced to one floor, making the game effectively 2D. In fact, the ability of the bots to aim up and down was also disabled. Only one weapon was used: the shock rifle. This weapon fires straight and hits instanta- 11

12 neously, meaning that an agent that fires when aiming directly at an opponent is guaranteed to hit that opponent. Besides health packs and shock rifle ammo, all items were removed from the maps used. Despite these restrictions, the evolved bots are impressive in that they outperform the bots shipped with the game, as long as the prepackaged bots are functioning at the standard difficulty level (there are several difficulty levels beyond this). The subsumption bot was trained in the tasks of exploration, path-following and shooting. Though these examples demonstrate that layered evolution combined with the subsumption architecture can produce impressive results in challenging domains, these methods still require much human intervention and a priori knowledge about the task decomposition within the domain. In order to discover multimodal behaviors automatically, methods that do not depend on knowledge of the task hierarchy must be developed. Neuro-Evolving Robotic Operatives The NERO video game (Stanley et al. 25a,b, 26) allows multimodal behaviors to evolve via clever manipulation of the fitness function by a human user. In this sense the training regime is still divided into separate tasks by a human expert, but because the training is performed online by a human user dynamically, less a priori knowledge is required. All that is needed is a rough plan. The user is free to change this plan in response to the results of the evolutionary process. NERO is a machine learning game, in that the role of the human player is to train a team of robot soldiers using machine learning techniques, specifically neuroevolution. During training, the team consists of the entire population. Thus, the team is heterogeneous. Evolution occurs in a special training area in which the human instructor can place various obstacles and enemies. The human user influences the behaviors of the robots indirectly via manipulation of a fitness function. As the fitness function changes, so does the selection pressure, which quickly leads to certain behaviors being favored. The fitness function is a weighted sum of several separate fitness components such as rewards for approaching enemies, shooting at enemies, crowding together, spreading out, etc. The user controls the weights via several slidebars in the game interface. In order to assure that the weights assigned by the slidebars are comparable across different objectives, scores for each individual objective are combined using the z-score method. A z-score allows scores from different distributions to be compared on equal footing. Given a population of scores, the z-score measures the score s number of standard deviations above or below the mean. Thus a z-score of is average, negative scores are below average and positive scores are above average. Fitness scores for the robots are the weighted sums of the z-scores that their individual objective scores correspond to for each objective. Weighted-sum fitness functions have weaknesses that will be discussed more fully below in the section on multiobjective optimization. However, some of these weaknesses do not apply to the fitness function in NERO, since the weights can be changed dynamically by the user. This process adds a lot of flexibility to the types of solutions that can be produced during training. In particular, 12

13 multimodal behavior can emerge through careful control of the fitness function across an appropriate sequence of tasks. In sum, the z-score method in NERO can also produce multimodal behavior, but it requires a lot of attention from a human user to be effective. As with the layered evolution method, the human user needs some knowledge of the task hierarchy to be learned in advance, but because the system is interactive, decisions are not set in stone in advance. More importantly, the idea of helping evolution find interesting behaviors by dynamically changing the fitness function throughout the course of evolution is a useful idea that will be used below in conjunction with Pareto-based multiobjective optimization methods to encourage multimodal behavior. Such multiobjective optimization methods are the topic of the next section. 2.2 Multiobjective Optimization While there are cases where pursuit of a single objective requires many different behaviors, many complex domains require multimodal behavior to be solved because they have multiple, usually contradictory, objectives. This observation leads to the need for a way to do multiobjective optimization. In multiobjective optimization, two or more conflicting objectives are optimized simultaneously. Examples of such problems are numerous. In finance, it is always of interest to maximize profits while minimizing costs. In the design of aircraft and automobiles, trade-offs between cost, performance and safety must be taken into consideration. For agents in video games, there is often a need to balance the benefits of thwarting opponents with the risks of making oneself vulnerable. Because increases in one objective often come at the expense of other objectives, it is not always clear when one solution is better than another. A formalism is needed that can resolve this issue Pareto Optimality For the examples above, it is not immediately clear what the best trade-off between the stated objectives is. In fact, different trade-offs may even be useful under different circumstances. However, it is clear that if a given solution to a problem is better than another solution in all objectives, it is better overall. In fact, even if a solution is strictly better than another solution in only one objective, it is absolutely better as long as it is at least equal in all the other objectives. These ideas are formalized by the notions of Pareto dominance and optimality: Definition 1 (Pareto Dominance) Vector v = (v 1,..., v n ) dominates u = (u 1,..., u n ) if and only if the following conditions hold: 1. i {1,..., n} : v i u i, and 2. i {1,..., n} : v i > u i. The expression v u denotes that v dominates u. 13

14 Definition 2 (Pareto Optimal) A set of points S F is Pareto optimal if and only if it contains all points such that x S: y F such that y x. The points in S are non-dominated, and make up the non-dominated Pareto front of F. The above definitions indicate that the best solutions to a multiobjective problem are in the Pareto front of the search space. Therefore, solving a multiobjective optimization problem involves approximating the Pareto front as best as possible. Of course, appealing to the idea of Pareto optimality is not the way that many multiobjective problems are solved in practice. A more common and intuitive approach is simply to combine the objectives into a single function, and then to attempt optimizing this Aggregate Objective Function (AOF). The simplest form of AOF is a weighted sum of objectives. This approach has the obvious disadvantage of needing to choose appropriate weights, since radically different solutions will be considered optimal under different weightings. With the right weights, a weighted-sum approach can actually do fairly well on problems whose true Pareto front has a convex shape. However, weighted-sum AOFs cannot capture Pareto optimal points on non-convex surfaces (Coello 1999), which makes the approach inappropriate for problems where the objectives are strongly contradictory. AOFs with other forms can be designed to capture points on non-convex Pareto surfaces, but these methods require additional knowledge about the behavior of and constraints on the objectives (Messac et al. 2). Furthermore, such an approach is not practical when the number of parameters in decision space is high. Due to these limitations in solving multiobjective optimization problems from an engineering perspective, several AI-based approaches have been developed: Value-Function Methods Value-function methods have been applied to domains involving multiple contradictory objectives. However, the fact that value-function methods depend on the environment being an MDP can make it difficult to apply these methods to a multiobjective problem. The problem with the MDP framework within the context of a multiobjective optimization problem is the reward function R; it only returns a scalar reward. Learning value functions in a multiobjective setting requires replacing R with R : S A S R N where N is the number of objectives. However, this is not the common approach. Multiple objectives in a value-function setting are normally dealt with by scaling scores from different objectives into a scalar reward function, which from a multiobjective perspective is similar to using a weighted-sum AOF. However, there have been several attempts to work with modified multiobjective MDPs using value functions, some of which are described below. Greatest Mass Sarsa Sprague and Ballard (23b) dealt with multiple objectives using a modified version of the popular Sarsa (Rummery and Niranjan 1994) algorithm, called GM-Sarsa (Greatest Mass Sarsa). The problem they address is slightly more general than the one formalized above; in their formalization, each objective is its own separate MDP, with potentially different states, transition functions and reward functions. However, the action space must be the same for all MDPs. 14

15 Given a shared state space and transition function, this more general problem degenerates to the multiobjective optimization problem proposed earlier. GM-Sarsa learns separate action-value functions for each MDP being solved. Each action-value function has its utility values updated according to on-policy exploration of the composite MDP by the GM-Sarsa agent. This agent chooses what actions to take based on getting the greatest mass: The agent attempts to maximize the sum of the utility estimates across all action-value functions. Though GM-Sarsa is a novel way of modelling and solving multiobjective problems, the method of greatest mass is still very similar to using a weighted-sum AOF. Thus, GM-Sarsa discovers a single policy that finds near optimal solutions to a multiobjective problem with some specific weighting. Changing the relative magnitudes of the rewards in the different MDPs is like changing weights in a weighted-sum AOF. Therefore, GM-Sarsa can only find policies at a particular point on the trade-off surface between objectives. Also, it is unlikely to find policies existing on non-convex Pareto fronts. When GM-Sarsa was introduced it was tested on a few multiobjective grid world tasks (Sprague and Ballard 23b), but it has since been used to model human eye-movement in a virtual sidewalk navigation task (Sprague and Ballard 23a): A virtual humanoid was required to simultaneously accomplish the goals of staying on the sidewalk, avoiding obstacles, and picking up litter. Each task used a different continuous state-space that was descretized using its own CMAC (Sutton 1996). The virtual human moved forward at a fixed speed. The action-space shared by the component MDPs allowed for left turn, right turn and no-turn actions. GM-Sarsa performed well in this task, which was challenging because of the continuous state-space and the conflicting goals. Also, the separate goals map nicely onto separate behaviors. Still, appropriate relative rewards for each goal were needed in order to produce the desired behavior, due to the use of summation to calculate the values of actions. Also, the method only generates one solution, and does not explore possible trade-offs between objectives, which is important in domains where the relative importance of objectives is not obvious. Convex Hull Value Iteration GM-Sarsa summed raw rewards in separate objectives to determine the values of actions, which means that each objective effectively had the same weighting. A multiobjective value-function approach by Barrett and Narayanan (28) goes one step further and simultaneously solves a problem for all possible weightings of the different objectives. This approach assumes that the standard reward function is replaced with a vector function R, as described earlier. The overall function to be maximized by the algorithm is R w (s, a, s ) = w R(s, a, s ) (1) for any arbitrary weight vector w such that i w i = 1. The method works for arbitrary weight vectors by finding the set of solutions in the convex hull of the set of vectors whose reward is maximal for some weight setting. The convex hull consists of boundary points on a convex set that are maximal in some direction. Once the convex hull is known, a policy can be generated based on it by plugging in a specific weight vector. This approach has the benefit of being able to produce multiple policies based on a choice of objective weightings. However, as the name implies, this approach only works for domains where 15

16 the Pareto-front has a convex shape. This limitation is due to the fact that that R w (s, a, s ) is a weighted sum. Therefore, despite being able to generate a set of trade-off solutions, this approach still cannot find points on non-convex Pareto fronts. The details of the algorithm are complex, and not relevant to proposed work. However, it is worth noting that the formulation is done in terms of a discrete state space, and the method was tested in a simple multiobjective grid world. There does not seem to be any further work using this algorithm, and it is not known whether it could be extended to domains with continuous state spaces, let alone those requiring multimodal behavior Evolutionary Methods Most EAs make use of a single scalar fitness value to measure the performance of individuals in the population. The first attempts at solving multiobjective problems using EAs therefore made use of AOFs. For a survey, see Coello However, due to the problems with AOFs described above, the best modern Multi-Objective Evolutionary Algorithms (MOEAs) perform selection based on the concept of Pareto dominance. Still, there are many modern MOEAs in common use, and each one addresses the problem slightly differently. What follows is a brief survey of several popular Pareto-based MOEAs: Non-dominated Sorting Genetic Algorithm II Non-Dominated Sorting Genetic Algorithm II (NSGA-II; Deb et al. 2) works by sorting the population into non-dominated Pareto fronts in terms of each individual s fitness scores, in order to select those that are Pareto dominated by the fewest individuals. The underlying EA behind NSGA-II is a (µ + λ) Evolution Strategy (ES; Bäck et al. 1991). In the ES paradigm, a parent population of size µ is evaluated, and then used to produce a child population of size λ. Selection is performed on the combined parent and child population to give rise to a new parent population of size µ. With NSGA-II, selection is elitist, and is implemented by first determining which NPCs are in the non-dominated Pareto front. NSGA-II then removes these individuals from consideration momentarily and forms a second non-dominated Pareto front based on the remaining members of the population. This process repeats until the whole population is sorted into successive Pareto fronts. NSGA-II selects the members from the highest ranked Pareto fronts to be the next parent generation. A cutoff is often reached such that the Pareto front under consideration holds more individuals than there are remaining slots in the next parent population. These slots are filled by selecting individuals from the current front based on a metric called crowding distance. The crowding distance for a particular point p in objective space is the average distance between all pairs of points on either side of p along each objective. Points having an objective score that is the maximum or minimum for the particular objective are considered to have a crowding distance of infinity. For other points, the crowding distance tends to be bigger the more isolated the point is. NSGA-II favors solutions with high crowding distance during selection, because the more isolated points in objective space are filling a niche in the trade-off surface with less competition. 16

17 By combining the notion of non-dominance with that of crowding distance, a total ordering of the population arises by which individuals in different Pareto fronts are sorted based on the dominance criteria, and individuals in the same front are further sorted based on crowding distance. The resulting comparison operator for this total ordering is also used by NSGA-II: The way that a new child population is derived from a parent population is via binary tournament selection based on the total ordering of Pareto fronts and crowding distance. NSGA-II has performed well on several multiple function minimization benchmarks of interest to the MOEA community (Deb et al. 2), but these are all contrived problems of little interest in the real world. They are far removed from the domains of interest to this work, which focuses on the behavior of agents interacting with the environment. However, NSGA-II has been used in domains such as these as well. Some example domains are unmanned aerial vehicle navigation (Barlow et al. 24), and opponent control in both a racing game (Agapitos et al. 28; van Hoorn et al. 29b) and a first-person shooter game (van Hoorn et al. 29a). These applications indicate that NSGA-II is indeed useful for the types of domains of interest in this work. Pareto Envelope-based Selection Algorithm II PESA-II (Corne et al. 21) is a slight modification to the selection scheme of PESA (Corne et al. 2). The general idea behind PESA is to maintain an archive of non-dominated points, and to take offspring from this archive in order to explore new possibilities. The selection scheme is binary-tournament selection based on the squeeze factor of hyper-boxes that partition the objective space. The squeeze factor is simply the count of individuals in a hyper-box, which means that favoring smaller squeeze factors encourages exploration of unpopulated regions of objective-space. Whereas PESA makes use of individual selection, PESA-II introduces region-based selection to enhance evolutionary search. Region-based selection is performed at the level of hyper-boxes. Whenever a given hyper-box is selected to produce offspring, it is actually a random individual from that hyper-box that reproduces. Region-based selection makes less crowded hyper-boxes as likely to be randomly chosen to compete in a binary tournament as highly crowded boxes. Since the winner of the tournament is the hyper-box with the smaller squeeze factor, this selection scheme ultimately results in less crowded individuals creating offspring more often than crowded ones. PESA-II has performed well on several of the same benchmark problems that NSGA-II was tested on (Corne et al. 21; Durillo et al. 28). The question of which one is better seems to depend more on the particular problem than on the inherent qualities of either algorithm. Unlike NSGA-II, PESA-II has been less widely applied to interesting problems involving the behavior of agents, so the question of how well it would evolve multimodal behaviors is still open. Strength Pareto Evolutionary Algorithm 2 Like PESA-II, SPEA2 (Zitzler and Thiele 1999) also uses an external archive in addition to an internal population. However, the archive has a fixed size, so in some respects the resulting scheme is more similar to the ES method used by NSGA-II. The major difference between SPEA2 and both previously mentioned algorithms is the selection mechanism. Fitness is assigned to each individual based on notions of strength and density. Strength is the number of solutions in the archive and the population that a given individual dominates. Stronger individuals are better because they dominate more points. Density values are smaller for 17

18 less crowded points. The density measure serves the same purpose as the squeeze factor in PESA-II and the crowding-distance metric in NSGA-II. As with the other methods presented, SPEA2 has performed well on the standard benchmarks of interest to the field (Zitzler and Thiele 1999), though direct comparison indicates that SPEA2 is slightly slower than both PESA-II and NSGA-II (Durillo et al. 28). There is evidence that SPEA2 is faster than NSGA-II in the early stages of evolution on noisy problems, but given enough time, the quality of solutions produced by NSGA-II is better (Bui et al. 24). Because domains where an agent interacts with an environment typically involve noisy evaluations, and because multimodal behaviors are presumably high-quality behaviors, NSGA-II still seems to be a better choice for evolving multimodal behavior. However, as with PESA-II, no work has been done using SPEA2 to evolve behaviors, so its effectiveness at such a task is still an open question. 2.3 Constructive Neuroevolution Neuroevolution algorithms that build additional structure over time are called constructive, and have several advantages over algorithms that evolve fixed networks. One advantage is that constructive algorithms search in a gradually increasing search space instead of the entire search space from the start. This search methodology allows constructive algorithms to optimize neural network weights incrementally within gradually complexifying neural network topologies, which in turn allows useful solutions to be found quicker. The ability to evolve arbitrary topologies is also an advantage of constructive algorithms, because often a solution can be represented via a simple (i.e. sparsely connected) network topology that requires less tuning than e.g. a fully connected multilayer network. Constructive neuroevolution algorithms can use a variety of encodings, and can offer several different ways in which to add structure, but generally there needs to be a way to do at least three simple operations: change a synaptic weight, add a new connection, and add a new neuron. These operations are implemented via mutations that add to network structure. Sometimes the action of adding a recurrent connection to the network is handled as a special case of the operation for adding a connection, but this is not strictly necessary. Neuroevolution of Augmenting Topologies (NEAT; Stanley and Miikkulainen 22) and Evolutionary Acquisition of Neural Topologies (EANT; Siebel and Sommer 27) are two modern constructive neuroevolution algorithms that allow for these types of basic structural changes. Each algorithm boasts the ability to evolve any potential neural network using these most basic mutations. Other neuroevolution algorithms (Yao and Liu 1996; Angeline et al. 1994) have also experimented with simplifying mutations, such as deleting nodes and connections. The efficacy of such simplifying mutations in speeding up evolutionary search is still uncertain. Although NEAT and EANT make use of encodings capable of representing any network, it does not mean that any given network is easy to evolve, even supposing that the network in question would be well suited to the task. The completeness of the network encodings and the particular mutation operators supported effect the shape of the fitness landscape. The simple mutation operators used in these algorithms decrease the bias in evolutionary search, but this lack of bias may make certain solutions very difficult to evolve. 18

19 Recently, Kohl and Miikkulainen (28) demonstrated that introducing some bias into constructive neuroevolution algorithms can help the evolutionary search, at least in certain domains. Specifically, the Cascade-NEAT algorithm supports a mutation that adds a new cascade node, which is a node receiving inputs from all preexisting input and hidden nodes, and sending output to all output nodes. No other structural mutations are supported. Thus in Cascade-NEAT, the search space is highly constrained, yet the constraints bias the evolutionary search towards discovering networks of a particular type. Cascade-NEAT has demonstrated improved performance over regular NEAT in several domains with high fracture. Informally, fracture is a measure of how radically the outputs of a function change across nearby inputs. To some extent, multimodal domains can be considered domains with high fracture, in the sense that a multimodal domain may suddenly require an agent to use a different mode of behavior without much change in input sensor values. However, the main point of bringing up Cascade-NEAT is not that it is designed to solve multimodal problems, but that it demonstrates the possibility of improving evolutionary search by biasing it with innovative mutations. Some innovative mutations that are designed specifically to discover multimodal behavior will be discussed later, both as part of the completed and proposed work. 2.4 Evolution of Teamwork Complex, interesting domains that require multimodal behavior are often multiagent domains. A challenging issue in such multiagent domains is the evolution of teamwork. For the purposes of this work, it is also interesting to examine what multimodal behavior means in the context of teams of agents. First there is the issue of whether multimodal behavior is defined differently in the context of a team. It is often useful for a team of agents to use a heterogeneous mix of policies so that each individual in the team takes on a specific role. Such division of labor is useful, though in a sense it is not actually multimodal behavior; each individual of the team is carrying out only a single behavior. However, since the outcome is the same, such behavior can be considered multimodal at the team level. If the composition of the team is homogeneous (shared control policy), then a division of labor into roles is also multimodal. This was the case with the ATA agents in the Legion-II game mentioned above. In Legion-II, the justification for designating such behavior as multimodal is that any of the individuals of the team is capable of exhibiting the garrison behavior or the barbarian chasing behavior depending on cues from the environment. Even if each individual of the homogeneous team only exhibits a single behavior within an evaluation, all team members need to have multimodal behaviors available to them. Regardless of how it arises, is is worth finding out whether or not multimodal behavior is easier to evolve in the context of heterogeneous or homogeneous teams. In general, there is evidence that homogeneous teams are better than heterogeneous teams in tasks that require cooperation (Waibel et al. 29). This result was demonstrated in a robot foraging domain. In contrast, heterogeneous teams did better in a version of the task that did not require teamwork. Another way to view these results is that heterogeneous teams will tend towards individualistic, loner strategies and homogeneous teams will tend towards group strategies. When evolving homogeneous teams it is best to 19

20 use team-level selection, and when evolving heterogeneous teams it is best to use individual-level selection (Waibel et al. 29). These selection methods play to the natural strengths of each approach. Either method could be useful in different domains, so running experiments with both can be instructive. For the most part, the domains that will be tested in this work are ones that should require teamwork, but it will be interesting to see to what extent loner strategies are effective. In particular, there will be a chance to see different types of multimodal behavior evolve: multimodal behavior by individuals using loner strategies in heterogeneous teams, multimodal behavior at the team level in homogeneous teams, and hopefully even multimodal behavior by individuals in homogeneous teams. 3 Completed Work The approach taken to discover multimodal behavior in this work is neuroevolution combined with an MOEA, specifically NSGA-II. This section describes a sequence of experiments that each introduce and test one component of the proposed method: 1. Multiobjective Neuroevolution using NSGA-II for selection. 2. Dynamic objective management via Targeting Unachieved Goals (TUG). 3. Introduction of a New-Mode Mutation and a means of arbitrating between evolved output modes. In order to test the efficacy of the method, several test domains are needed, and will be explained first. 3.1 Experimental Setup The test domains constitute a simple yet challenging set of tasks in the same environment. In this environment, the agents all have the same input and output space, which allows domains requiring multimodal behavior to be constructed by simply combining several separate domains into one evaluation. The domains completed so far include the Battle Domain, whose purpose is to demonstrate the benefits of multiobjective neuroevolution in discovering multimodal behavior, and the Fight or Flight domain, whose purpose is to create a domain which demands multimodal behavior to be solved. Both domains are team tasks, which will allow the differences in the evolution of homogeneous and heterogeneous teams to be explored. These domains were constructed using BREVE (Klein 23), a simulation environment intended for artificial life simulations. BREVE s scripting language is agent-centric, which means it is well suited to programming for multiagent domains. 2

21 3.1.1 Battle Domain The purpose of the Battle Domain (Fig. 1) is to demonstrate that Pareto-based multiobjective evolution is more effective in producing multimodal behaviors than the standard approach of reducing multiple objectives into a single fitness function. Figure 1: Illustrations of the Battle Domain: The green (dark) fighter agent in the center chases the nearest yellow (light) monster while swinging its bat. If the bat hits a monster, the monster is damaged; if it receives enough damage, it dies. If a monster hits the fighter, the fighter is damaged. The monsters must evolve to deal damage to the fighter, avoid damage from the fighter s bat, and survive as long as possible. The Battle Domain also constitutes the Fight task of the Fight or Flight domain. The Battle Domain pits a group of four evolved monster agents against a single fighter agent controlled by a static policy. The fighter swings a bat constantly while approaching the nearest monster. If a monster is hit, it receives 1 points of damage. After receiving 5 points of damage, the monster dies and is removed for the remainder of the evaluation. In turn, the monsters damage the fighter for 1 points of damage each time one of them collides with it. The fighter can also take 5 points of damage before dying, but each time the fighter dies it respawns, and the monsters that are still alive are returned to their starting positions. The monster agents are controlled by neural networks. The outputs of the networks are very simple: One output determines the left/right turn impulse, and the other output determines the forward/backward motion impulse. The network inputs are slightly more complicated: 1. Bias: Constant input of Difference From Fighter Heading: The angle difference in radians between the heading of the fighter agent and the heading of the monster controlled by the network. 21

22 3. Angle to Fighter: Angle in radians between the monster s heading and a direct path to the fighter from the monster. 4. Yelp of Pain: Whenever any monster is struck by the fighter s bat, the monster emits a yelp of pain sensed briefly by all monsters. This input is 1. when the yelp is sensed, and. at all other times. 5. Just Received Hit: This is a personalized version of the yelp of pain. This input is 1. only if the sensing monster was just hit, and. otherwise. 6. Shout of Encouragement: A signal also is sent out whenever any monster deals damage to the fighter. This input is a 1. immediately after any monsters hits the fighter, and. at all other times. 7. Just Delivered Hit: A personalized version of shout of encouragement, this value is 1. if the sensing monster just hit the fighter and. otherwise. 8. Fighter Out of Control: Whenever any agent in the world takes damage, it is knocked backward and has no control over its actions for a very brief period. The hit agent is also invulnerable during this brief period. This input is 1. whenever the fighter is in this out of control state, and. otherwise. 9. In Front of Fighter: This input is 1. whenever the fighter is facing in the general direction of the sensing monster, and. otherwise. 1. Very Close Bat: Value of 1. if the distance between the monster and the fighter s bat is such that the monster is within striking range, and. otherwise. 11. Close Bat: Same as Very Close Bat sense, but for twice the distance. 12. Difference From Teammate Heading: For each teammate there is a sensor for the difference in the heading of the sensing monster and the heading of the particular teammate. This value is measured in radians, and is always. for the input corresponding to the teammate that is the same as the sensing monster. 13. Angle to Teammate: For each teammate there is a sensor for the angle between the sensing teammate s heading and a direct path to the sensed teammate. Each monster senses the angle to itself as Teammate Just Delivered Hit: For each teammate there is a sense for whether or not it just hit the fighter. The input is 1. in the case of a hit and. otherwise. This sense overlaps a bit with shout of encouragement and the just delivered hit senses. 15. Fighter Sensors: There are five sensors arrayed around the front of each monster for sensing the fighter. Their length is slightly greater than the length of the fighter s bat. Each sensor returns a 1. if it is currently intersecting space occupied by the fighter fighter and. otherwise. 22

23 16. Monster Sensors: There is an identical array of sensors for detecting other monsters. 17. Bat Sensors: There is yet another array of sensors for sensing the fighter s bat. Using these inputs the monsters must fulfill multiple contradictory objectives. They must avoid being hit by the fighter s bat, since they lose life and can die from too many hits. If death is unavoidable, then it should be avoided for as long as possible. If the fighter can kill monsters quickly, then it can keep shifting its focus to new monsters, and kill even more. Finally, monsters must maximize the damage that they as a group deal to the fighter. It is not the individual score, but the group score that matters. From the fighter s perspective, all monsters are the same. Therefore, the monsters must work together to maximize group score. Enacting strategies that take proper advantage of the given inputs can be particularly hard given that some of the inputs are only brief impulses. Any lasting effect that these inputs have on behavior can presumably only occur through the use of recurrent connections in the controlling neural network. Given the objectives above, the following three fitness measures are designed to measure success in each one: 1. Maximize Damage Dealt: Every time a monster contacts the fighter, the fighter loses 1 health points. The amount of damage dealt is attributed to the team, regardless of which individual dealt the damage. 2. Minimize Damage Received: Every time the fighter strikes a monster with its bat, the monster takes 1 points of damage, making for a resulting change in health of -1. The fitness attributed to the team is the average change in health across all individuals. 3. Maximize Time Alive: There are 6 time steps in each trial. For each individual monster, this objective measures the number of time steps that the monster is alive. The team score in this objective is the average across team members. These objectives are in opposition because the easiest way to reduce damage received and extend time alive is to avoid the fighter, but dealing damage to the fighter requires monsters to stay near the fighter. Likewise, a kamikaze attack may result in more damage to the fighter than patiently waiting for the right moment to strike, but it would also result in injury and death. There also exist strategies between these extremes that exemplify the trade-offs in the domain. From a multiobjective standpoint, many different solutions can be considered successful. Therefore, the Battle Domain is a useful platform for studying complex behavior: It is simple, yet has contradictory objectives that lead to different types of behaviors. In order to challenge evolution to come up with these behaviors, a challenging opponent is needed. The fighter agent uses a simple but effective chasing strategy. Starting from a position where it is surrounded by monsters, but facing in a random direction, the fighter constantly moves forward and always turns to pursue the nearest monster in front of it. If no monsters are in front, it keeps turning until it finds a monster to pursue. This strategy is challenging because the fighter never turns away from a monster that is threatening it. Monsters in front of the fighter have to keep backing away as the fighter pursues them. Because monsters and the fighter move at the same speed, as long as the fighter is chasing some monster, other monsters that are pursuing the fighter 23

24 from behind will generally not be able to catch up with it. Their only hope is to distract the fighter somehow to slow it down, or else find an opening to get around its attacks. Because of how challenging this strategy is for the monsters, an initially random population does not have much of a chance to evolve interesting behavior against a fighter moving at full speed. Therefore, the fighter is initially handicapped by having its speed reduced, and incremental evolution (Gomez and Miikkulainen 1997) is used to increase the speed gradually whenever the monster population demonstrates that it is able to handle the fighter at the current speed. Progression is based on three goals; one for each objective. A goal is simply a numeric value for an objective that should be attained by an average performing member of the population. Given a numeric goal value for an objective, that goal is to be considered achieved once the average performance of the population in that objective has persisted long enough at a level above the value of the goal. To illustrate the persistence requirement, let us first consider a simple example without it. The performance of the population in each objective is measured every generation. The average across the population for each objective is also calculated. A plot of these averages over generations fluctuates heavily, both because the evaluations are noisy and because mutations have random effects upon consecutive generations. Due to these fluctuations, average population performance that is above the goal level may not mean much. Therefore, to assure that goal achievement does not depend too strongly on chance, recencyweighted averages of the average performance are used to determine whether a goal is achieved, in addition to the performance averages themselves. The purpose of the recency-weighted average is to track the average performance of the population over the most recent evaluations. When both the average performance and the recency-weighted average of that average have surpassed their objective s goal value, the goal is considered achieved. Once all goals are achieved, the challenge of the domain is incremented by increasing the speed of the fighter. When all goals are met with the fighter moving at full speed, evolution stops with a successful population. The goals for each objective are: 1. Maximize Damage Dealt: 5: The fighter has 5 health points, so this goal requires the monsters to kill the fighter at least once per trial. The fighter respawns after death, giving the monsters a chance to inflict more damage. 2. Minimize Damage Received: -2: Bat strikes deal 1 damage points each, so each monster should take no more than two hits on average. However, because this value is averaged across team members, it is possible to achieve this goal even if one team member dies (5 damage), since the average across the four team members could still be above Maximize Time Alive: 54: On average, team members must survive throughout 9% of the trial. This number is an average across team members as well, so it is still possible to achieve this goal even if some NPCs die in fewer than 54 iterations. Recall that the total number of iterations per trial is 6. A recency-weighted average r(t) at time t is updated in the following fashion: 24

25 r(t + 1) r(t) + α( x r(t)) (2) where the α parameter was a constant equal to.15 and x is the current average score for the particular objective. Thus the recency-weighted average moved slightly closer to the most recent average every generation. At the start of evolution, and whenever the population achieves all goals, thus causing the speed of the fighter to increase, the recency-weighted averages are reset to the minimum values of the corresponding objectives. These are zero for all objectives except damage received, which has a minimum of -5. Successful performance in the Battle Domain will often depend on multimodal behavior, as will be seen in the experiments below. However, it is not clear exactly how many modes of behavior are required of agents to solve it, because these different behavioral modes are applied in the same task. If agents were to be evaluated in distinctly different tasks, then it would be clear that they need different behaviors for different tasks. Such a domain that is explicitly divided into distinct tasks is the Fight or Flight domain described next Fight or Flight Fight or Flight is an extension to the Battle Domain in which the monsters participate in two separate types of evaluations: Fight (Fig. 1) evaluations and Flight (Fig. 2) evaluations. Fight evaluations are identical to evaluations in the Battle Domain, and have the exact same objectives and goals. Flight evaluations are a new and distinct type of challenge. The Flight task shares the same dynamics and the same network inputs as the Fight task, but the fighter becomes prey because it no longer has a bat. There is no input sense for the monsters that indicates whether or not the agent they are interacting with is a fighter or prey, but this information can be inferred from the bat sensors and pain sensors. Since the prey has no bat, the four monsters cannot be damaged, but they can damage the prey. The prey initially faces a random direction and is surrounded by four monsters, as in the Fight task, and must escape without dying. Assuming monsters and the prey move at the same speed, the monsters will not be able to catch the prey once the prey is no longer surrounded by monsters. For this reason, Flight trials end as soon as the prey escapes a bounding box defined by the four monster positions. To encourage proper behavior, this termination condition holds true even when the prey moves slower than the monsters, which is the case during early stages of the incremental evolution process. The Flight task has only one objective, which is to maximize the damage dealt to the prey. Though similar to the maximize damage objective in the Fight task, this objective is distinct simply because the task is different. Behavior for dealing damage in the Fight task is different from behavior in the Flight task, so the two objectives are separate. However, since the Fight or Flight domain requires the same teams of monsters to perform both the Fight and the Flight tasks, all together the domain has four distinct objectives. The new Flight task objective also has its own goal value for the sake of progression via incremental evolution. The goal for the maximize damage objective in the Flight task is 1 damage, 25

26 Figure 2: The Flight task has the same dynamics as the Fight task, except that the fighter is replaced with prey, shown in green. The prey is different from the fighter in that it does not have a bat. The prey also runs backwards away from the nearest monster (red) instead of chasing after it. The goal of the monsters is to damage the prey, but in order to do this successfully they must also confine the prey so it cannot escape. The Flight task is different enough from the Fight task that it requires a different behavior, which means that multimodal behavior is required of the monsters in order to handle the Fight or Flight domain as a whole. which equates to killing the prey twice per trial. This amount is greater than in the Fight task because it is easier to kill prey than it is to kill a fighter with a bat. So in total the Fight or Flight domain consists of four objectives across two separate tasks. All monsters are evaluated in both the Flight task and the Fight task. Unless the prey escapes during a Flight trial as described above, Flight trials last 6 iterations, just like Fight trials. Given the two domains described above, it is now possible to carry out a series of experiments to see under what conditions multimodal behavior is likely to evolve. 3.2 Benefits of Multiobjective Neuroevolution In order to demonstrate that Pareto-based evolution is effective in multiobjective problems, NSGA- II is compared to the z-scores method in the task of evolving both homogeneous and heterogeneous teams in the Battle Domain Experiment Individual-level selection is used with the heterogeneous teams, and team-level selection is used with the homogeneous teams. Each team had four members. Furthermore, each individual of a 26

27 heterogeneous team is required to be part of a different team for each of three separate evaluations. Such evaluation in multiple teams helps gauge each individual s fitness more accurately. A bad individual in a good heterogeneous team may get good scores, but if evaluated in enough teams, its true colors will show. Homogeneous teams also participate in three evaluations each, though the team remains the same each time. For both heterogeneous and homogeneous teams, multiple evaluations help overcome the problem of noisy evaluations in the domain. No two evaluations will be the same because the fighter has initially random orientation. The final scores of individuals and teams in each objective are formed by averaging scores obtained across several evaluations. When using the z-scores method, these raw scores must be transformed into z-scores and then incorporated into a weighted sum. The weights for each objective in the Battle Domain were 2. for maximizing damage dealt, 1. for minimizing damage received, and.5 for maximizing time alive. Obviously, other weightings are possible, which is itself a problem with any sort of weighted-sum approach, but these weightings are reasonable given some knowledge of the domain. Maximizing damage dealt has the highest weight because it is the hardest objective to achieve, especially in the presence of the other two objectives which are opposed to it. An easy way to maximize time alive and minimize damage received is to simply run away from the fighter. The importance of the objective to deal damage helps prevent this behavior from becoming a dominant strategy. Minimizing damage received has a greater weight than maximizing time alive, because the time alive can be maximized even when a monster receives a lot of damage. Once again, the objective that is more difficult to obtain has a higher weight. To make the comparison between NSGA-II and the z-scores method as fair as possible, the z- scores method functions in a manner as similar to NSGA-II as possible (i.e. a (µ + λ) Evolution Strategy). In both the homogeneous and heterogeneous trials, the parent population size is 52. This number needed to be divisible by four in order for each individual to have an equal number of evaluations in the heterogeneous trials. Note that because four heterogeneous individuals are evaluated at a time, it actually takes fewer overall evaluations to evolve heterogeneous teams than homogeneous teams. The experiments are set up so that each individual is evaluated the same number of times, even though this approach means that homogeneous trials require a greater number of overall evaluations. In both methods, after all 52 individuals are evaluated, a child population of size 52 is made by simply cloning and mutating every individual in the parent population. Thus, there is no selection pressure in the creation of the child population. For NSGA-II, this approach is a slight deviation from the standard algorithm, which makes use of the combined Pareto-rank/crowding-distance selection operator at this point. Notice that no crossover is used in these experiments. Fogel and Atmar (199) claim that crossover is unnecessary in evolutionary search, and is often detrimental, since the crossover operation often creates individuals that are highly dissimilar from either parent despite being derived from both of them. Simple mutation, on the other hand, always results in an individual that is a slight variation from its parent genotype. Furthermore, NSGA-II and the z-scores method described above are both examples of ES-style EAs. The ES paradigm tends to rely exclusively on mutation (Schaffer and Eshelman 1991). Indeed, preliminary experiments using crossover did not turn out as well as experiments using only mutation, so crossover was not used in the experiments 27

28 reported below. Once all 52 members of the resulting child population have been evaluated, 52 members of the combined 14 member population are chosen to become the next parent population. Both methods use purely elitist selection to obtain the next parent population. NSGA-II uses Pareto rank and the crowding-distance metric for selection as described earlier. In the z-scores method, the members with the highest weighted-sum fitness scores are chosen to become the next parent population. Both methods are subject to the same incremental evolution scheme. The details of this scheme and the goals for each objective were given in the domain description. The speed progression sequence was not given. The speed of the fighter starts at % of the speed of the monsters. The fighter can still turn towards and swing the bat at monsters; it simply cannot move forward or backward. When the monsters overcome the fighter at % speed, the speed is upgraded in sequence to 4%, 8%, 9%, 95% and then finally 1%. The step size decreases as speed increases because the higher speeds are harder to deal with. Both methods also make use of neural networks to control the monsters. The neuroevolutionary method used is similar to NEAT, specifically Feature Selective NEAT (Whiteson et al. 25). FS- NEAT starts with a population of networks with sparse connections between the input and output layer. There are no hidden nodes, and each output node takes input from one randomly chosen input node. This approach allows evolution to ignore unnecessary inputs, or to put off using them until enough prerequisite structure is in place to make using them worthwhile. During the mutation phase, three basic types of mutations can occur to a network, independent of each other. These mutations (and their rates) are the basic weight perturbation (9%), connection addition (6%) and node splice (3%) mutations used in NEAT. It is safe for the mutation rates to be very high both because there is no crossover, and because the selection process is elitist. Lack of crossover means that a network that is not mutated will be identical to a network in the parent population, but because of elitist selection, there is no reason to have more than one instance of any given network. Therefore, the mutation rates should be high so as to encourage new solutions to be found and evaluated. Networks were evolved under each of four possible conditions for up to 1 generations. If a population achieved all goals against the fighter at 1% speed before 1 generations, then the simulation terminated in success. A total of 3 runs for each condition were performed. The four experimental conditions were: 1. Hetero+Z: Heterogeneous teams using the z-score selection method. 2. Hetero+NSGA2: Heterogeneous teams using NSGA-II for selection. 3. Homo+Z: Homogeneous teams using the z-score selection method. 4. Homo+NSGA2: Homogeneous teams using NSGA-II for selection Results Figure 3 shows how many of the 3 heterogeneous runs of each type (Z vs. NSGA2) had surpassed each fighter speed of the incremental evolution sequence at each generation. Figure 4 shows similar 28

29 results for the homogeneous trials. For the heterogeneous trials, the NSGA-II method is clearly superior, since it dominates the z-score method at all fighter speeds. The results for the homogeneous trials are not as stark. The z-score method actually dominates NSGA-II at speeds %, 4%, and 8%. NSGA-II catches up to the z-scores method at 9% speed. At speeds 95% and 1%, NSGA-II slightly dominates the z-score method. Because obtaining good performance in the hardest task (1% speed) is more important, NSGA-II is better than the z-score method overall. The data can also be viewed in terms of how long it took each individual run to overcome each speed. Figure 5 shows the number of generations spent by all heterogeneous runs against the fighter at each speed that the given run reached. Figure 6 shows the results for the homogeneous runs. Because every experimental condition included trials that did not finish within the allotted 1 generations, the completion time of different experimental conditions cannot be statistically compared in terms of their averages. However, median tests offer a valid alternative. The median test is a special case of the well known chi-squared test. The test determines whether or not two independent samples have a common median, which should be the case for the completion times if different experimental conditions took roughly the same number of generations to surpass each speed. Results of median tests for the heterogeneous trials are in table 1. Median test results for the homogeneous trials are in table 2. Speed χ 2 (1, n = 6) p φ Effect % Small 4% Small 8% Small 9% Small 95% < Large 1% Could not calculate End < Large Table 1: Values of the χ 2 -statistic when comparing the heterogeneous z-score trials to the heterogeneous NSGA-II trials via median tests. The scores being compared are the number of generations taken to achieve all goals at the given fighter speed. The p values and φ values are also given. Values less than.5 for p are considered to indicate a statistically significant result, whereas values less than.1 are highly significant. Values for φ indicate the magnitude of the difference without regards to significance. Each φ value classifies the effect of the different conditions as being small, medium or large, as indicated in the Effect column. No median test could be performed for the fighter at 1% speed because over half of the runs from the two conditions did not surpass this speed. The last row gives the result of a straight chi-squared test that compares the two conditions in terms of whether or not they finished successfully. Significant differences are in bold. The results indicate that NSGA-II takes less time to progress than the z-score method against the 95% speed fighter, and also results in significantly more runs that finish within 1 generations. The numerical results indicate that NSGA-II is clearly superior to the z-scores method in the heterogeneous case, and about as good in the homogeneous case. However, the quality of the 29

30 3 Runs That Successfully Beat % Speed By Generation 3 Runs That Successfully Beat 4% Speed By Generation Successful Runs 15 Successful Runs Hetero+Z Hetero+NSGA Hetero+Z Hetero+NSGA2 3 Runs That Successfully Beat 8% Speed By Generation 3 Runs That Successfully Beat 9% Speed By Generation Successful Runs 15 Successful Runs Hetero+Z Hetero+NSGA Hetero+Z Hetero+NSGA2 3 Runs That Successfully Beat 95% Speed By Generation 3 Runs That Successfully Beat 1% Speed By Generation Successful Runs 15 Successful Runs Hetero+Z Hetero+NSGA Hetero+Z Hetero+NSGA2 Figure 3: Number of runs of heterogeneous teams that successfully achieved all goals at the specified fighter speed by a particular generation. Achieving all goals at 1% speed indicates a complete success. Runs using the z-score selection method are compared to runs of the NSGA-II selection method. NSGA-II clearly outperforms the z-score method, especially when the fighter moves at higher speeds. 3

31 3 Runs That Successfully Beat % Speed By Generation 3 Runs That Successfully Beat 4% Speed By Generation Successful Runs 15 Successful Runs Homo+Z Homo+NSGA Homo+Z Homo+NSGA2 3 Runs That Successfully Beat 8% Speed By Generation 3 Runs That Successfully Beat 9% Speed By Generation Successful Runs 15 Successful Runs Homo+Z Homo+NSGA Homo+Z Homo+NSGA2 3 Runs That Successfully Beat 95% Speed By Generation 3 Runs That Successfully Beat 1% Speed By Generation Successful Runs 15 Successful Runs Homo+Z Homo+NSGA Homo+Z Homo+NSGA2 Figure 4: Number of runs of homogeneous teams that successfully achieved all goals at the specified fighter speed by a particular generation. For homogeneous runs, the z-score method is actually better at lower speeds, but is eventually overtaken by NSGA-II at 95% and 1% speeds. 31

32 1 8 Spent with Each Speed Hetero+Z Hetero+NSGA2 Figure 5: Number of generations that each of 3 heterogeneous runs of each type (Z vs. NSGA2) spent against the fighter at each speed. Bars that are cut off before reaching 1 generations completed successfully. Many more NSGA-II runs completed successfully than z-score runs. 32

33 1 8 Spent with Each Speed Homo+Z Homo+NSGA2 Figure 6: Number of generations that each of 3 homogeneous runs of each type (Z vs. NSGA2) spent against the fighter at each speed. These results are more difficult to interpret than the corresponding figure of heterogeneous results. There are more NSGA-II runs that get stuck at lower speeds than z-score runs. However, when NSGA-II does not get stuck, it overcomes the toughest task (1% speed) faster. Also, more NSGA-II runs finish successfully than z-score runs. This result indicates that although the z-scores method is fine for simpler tasks, more difficult multiobjective problems are better handled by NSGA-II. 33

34 Speed χ 2 (1, n = 6) p φ Effect % Small 4% Small 8% Small 9% Small 95% Small 1% Small End Small Table 2: Values of the χ 2 -statistic when comparing the homogeneous z-score trials to the homogeneous NSGA-II trials via median tests. At most speeds, there is no significant difference in completion time between the z-scores method and NSGA-II. NSGA-II is also not significantly different from the z-scores method in determining whether or not a given run completes successfully. However, the significant results for speeds % and 9% actually indicate that the z-score method was significantly faster at these speeds, although the effect size is small in both cases. produced solutions is more important than the time taken to evolve them, so an in depth analysis of the behaviors evolved is in order Evolved Behaviors Because the multimodal behaviors in heterogeneous and homogeneous teams are slightly different the behaviors resulting from each team-composition method will be analyzed separately. Heterogeneous Teams In the heterogeneous NSGA-II trials, the most intelligent behavior involved baiting the fighter to allow teammates to sneak up from behind and hit the fighter repeatedly with a side-swiping attack, which prevented it from successfully swinging its bat (Fig. 7). For the purpose of clarity, baiting is defined in the Battle Domain to be a very particular type of running away: It involves moving away from the fighter while also turning slightly to the side by a very small amount. The resulting path of motion is curved. The turning aspect of this behavior is important because it involves an element of risk: If a monster runs straight away from the fighter, the fighter will never catch up since (at 1%) the fighter and monsters move at the same speed. Slightly turning means the fighter could eventually catch up and hit the baiting monster. However, by putting itself at risk of eventually being hit, the monster acting as bait enables its teammates to catch up to the fighter when chasing from behind. When done right, the baiting monster is only hit about once, and may not be hit at all, before teammates approaching from behind catch up to the fighter. The monsters that do catch up begin to attack the fighter in a particular fashion that is slightly risky, but makes good use of the limited evaluation time. After the first hit from behind, the fighter ends up facing the monster that hit it. This monster continues to attack the fighter by hitting it on its left side near the front. The fighter always swings its bat from the right side, and being hit interrupts 34

35 Figure 7: Baiting Behavior Followed by Side-Swiping Attack: This example comes from one of the successful heterogeneous NSGA-II trials. The fighter sees a monster in front of it on the right, and so pursues it while swinging the bat. The baiting monster allows another monster to sneak up from behind and hit the fighter. Afterwards the fighter is facing the monster that just hit it and takes another swing at it, but the monster delivers side-swipe attacks on the fighter s left side which repeatedly cancel its swing. This behavior is a clever team-level multimodal behavior. its swing. By attacking the fighter s left side, the monster is able to repeatedly interrupt the fighter s swing, and hit it until it dies. This attacking style does not always work as perfectly as described, which is why it is risky, but it works well enough to have evolved within seven out of 19 successful heterogeneous NSGA- II trials. The side-swipe attack also evolved in three heterogeneous NSGA-II trials that reached the 1% speed fighter, but did not overcome it in the allotted number of generations. Baiting behavior evolved in six of 19 successful heterogeneous NSGA-II trials. These six trials also featured side-swipe attacking behavior. Additionally, the three trials that developed sideswiping at 1% speed without finishing also developed baiting behavior. The side-swipe attack and the baiting behavior represent two different modes of behavior exhibited by individuals in heterogeneous trials. Individuals generally start out attempting to attack the fighter with the side-swipe attack. If one starts hitting the fighter without being hit by its bat, then it continues the attack. However, when an attacking monster is hit by the bat it switches to baiting mode. A behavior similar to baiting also evolved in heterogeneous NSGA-II trials that did not reach 1% speed, but this behavior is not as meaningful when the fighter is slower than the bait because it involves no risk. In fact, when the fighter moves at less than 1% speed, a monster can run straight away from it and never be caught, yet the monster s teammates will still be able to sneak up behind and hit the fighter. Because this strategy is effective at lower speeds, but completely ineffective at higher speeds, it causes a problem in the incremental evolution process that will be addressed in the experiments below. A less intelligent, but still effective, behavior evolved in the successful heterogeneous NSGA-II trials that did not exhibit the behaviors described above. This behavior was based on aggressively rushing the fighter in hopes of sneaking in to hit it between swings. Once one hit gets through, a monster can often hit the fighter multiple times before it can get off another swing. Because individuals using the aggressive rushing behavior tend to take damage, populations with such individuals 35

36 usually contained cowardly individuals that would run away from the fighter forever and never get hit. Since goal progression is based on average population scores, having individuals of each type in the population allowed these populations to be successful. A total of 15 out of 19 successful heterogeneous runs feature aggressive rushing individuals, and 12 of these also feature cowardly individuals. Populations that had rushing individuals but not cowardly individuals were more successful at timing their attacks, and thus were able to succeed without ever running away. Of the two successful heterogeneous z-score trials, one exhibited baiting behavior, and the other exhibited a strategy in which monsters randomly wandered around until attacked, at which point they would attack the fighter until it died. These behaviors are multimodal, but they only evolved in the z-score trials twice. Of the heterogeneous z-score trials that reached 1% fighter speed but failed to succeed, the populations featured a mix of aggressive rushing individuals and cowardly individuals. However, the aggressive individuals were too aggressive, and as a result received too much damage to successfully achieve the goal for the Damage Received objective. In sum, heterogeneous teams of agents evolved in the Battle Domain are more likely to evolve multimodal behavior successfully than teams evolved using the z-scores method. However, many runs only exhibited a simple mix of rushing and running away, and some trials did not succeed at all, which means there is still room for improvement. Homogeneous Teams The best behavior exhibited by homogeneous NSGA-II trials involved teamwork and precise timing between teammates. The attack begins with one team member rushing the fighter to get off the first attack, while the others wait. Once the first hit is delivered, the monster that delivered it retreats while the next monster going counter-clockwise around the fighter rushes in to attack. By the time the fighter notices the monster rushing in to attack it, it will be too late to prevent being hit. The reason this strategy works is that attacking according to counter-clockwise position around the fighter means that the next attacker always approaches the fighter opposite of the side from which it swings the bat. This taking-turns behavior (Fig. 8) is multimodal both because different members of a homogeneous team are either attacking or retreating at any given time, and because every individual in the team can be seen to be attacking at some times and retreating at others. Unfortunately, this behavior only evolved within two out of 19 successful homogeneous NSGA-II trials. It did not evolve in any of the homogeneous z-score trials. The most common behavior evolved in the homogeneous NSGA-II trials was a mixture of aggressive rushing and cowardly retreating, as seen in the heterogeneous runs. A total of 16 out of 19 successful homogeneous NSGA-II runs had individuals exhibiting the aggressive rush behavior. However, unlike the heterogeneous teams, the homogeneous teams had individuals that would switch between running away and rushing in depending on whether or not a teammate was already harassing the fighter. When one monster was attacking, others would back up to make room. Therefore, this behavior is multimodal. A similar mixture of rushing and running away was exhibited by seven out of 14 successful homogeneous z-score runs. This behavior was different from that of the NSGA-II trials in that individual team members were either exclusive rushers or exclusive runners. This behavior is tech- 36

Figure 8: Taking Turns Attacking: This example comes from one of the successful homogeneous NSGA-II trials. A monster attacks from the right, barely dodging the bat. The first attack is the riskiest.

37 Figure 8: Taking Turns Attacking: This example comes from one of the successful homogeneous NSGA-II trials. A monster attacks from the right, barely dodging the bat. The first attack is the riskiest. Afterwards the fighter is knocked backwards. The fighter will try to attack the monster that just hit it, but while it does this, the next monster going counter-clockwise around the fighter will rush in and catch the fighter unawares. This process continues in a counter-clockwise circle around the fighter. This behavior is multimodal because each teammate rushes in and retreats at the appropriate time to damage the fighter without being hit. nically multimodal, but it is not an particularly interesting multimodal behavior: Individuals simply either rush forward or run away. The other behavior present in both NSGA-II runs and z-score runs was a side-swiping attack behavior, but this behavior did not involve baiting and was not multimodal. All monsters would move in a fixed curved path that went away from the initial fighter position, but eventually circled back around. If the timing was right, then one of the monsters would end up delivering side-swipe attacks to the fighter until it died. If the timing was wrong, then one of the monsters would die, but then another one would catch up and start delivering the side-swipe attacks before the fighter could turn around to face it. The point here is that the monsters would not react differently even if under attack, and they died as a result. However, one monster sacrifice is still within the bounds of the goal for the Damage Received objective, so this strategy was successful in the end. There were three out of 19 successful homogeneous NSGA-II trials that exhibited this behavior and seven out of 14 successful z-score trials with this behavior. Ultimately, the NSGA-II trials have a slightly stronger tendency towards multimodal behavior than the z-score trials. However, even the NSGA-II method resulted in several trials that did not finish at all, as well as some successful trials that did not exhibit multimodal behavior. There are two issues that need to be resolved: (1) make multiobjective evolution quicker and more reliable, and (2) encourage evolution to result in multimodal behaviors. The next set of experiments (section 3.3) introduce a feature designed to help with the first issue, and the experiments after that (section 3.4) are designed to deal with the second issue. 3.3 Targeting Unachieved Goals The performance of multiobjective evolution in the Battle Domain described above can be greatly improved by dynamically managing which objectives are active at any given time. An objective is active if it is currently being used by the multiobjective selection process. Normally, all objectives 37

Achieving Desirable Gameplay Objectives by Niched Evolution of Game Parameters

Achieving Desirable Gameplay Objectives by Niched Evolution of Game Parameters Scott Watson, Andrew Vardy, Wolfgang Banzhaf Department of Computer Science Memorial University of Newfoundland St John s.