Developing an agent for Dominion using modern AI-approaches

Size: px

Start display at page:

Download "Developing an agent for Dominion using modern AI-approaches"

Moris Pearson
5 years ago
Views:

1 Developing an agent for Dominion using modern AI-approaches Written by: Rasmus Bille Fynbo CPR: ******-**** ***** IT- University of Copenhagen Fall 2010 M.Sc. IT, Media Technology and Games (MTG-T) Center for Computer Games Research Christian Sinding Nellemann CPR: ******-**** ***** Supervised by: Georgios N. Yannakakis Miguel A. Sicart

3 A C K N O W L E D G M E N T S We would like to thank our supervisors, Georgios N. Yannakakis and Miguel Sicart, Associate Professors at the IT-University of Copenhagen, for their input and support during the process of this work. We would also like to thank the people who are still around after being neglected for the past months. iii

5 A B S T R A C T The game Dominion provides an interesting problem in artificial intelligence due to its complexity, dynamics, lack of earlier work and popularity. In this work, an agent for a digital implementation of Dominion is successfully created by separating the decisions of the game into three parts, each of which is approached with different methods. The first and simplest of these parts is predicting how far the game has progressed. This estimation is provided by an Artificial Neural Network (ANN) trained using the well-known technique of backpropagation. The second part is the problem of deciding which cards to buy during game play, which is addressed by using an ANN co-evolved with NeuroEvolution of Augmenting Topologies (NEAT) for evaluation of card values, given game state information and the progress estimation from the first part. The last part is that of playing one s action cards in an optimal order, given a hand drawn from the deck. For this problem, a hybrid approach inspired by other work in Monte-Carlo based search combined with an estimation of leaf nodes provided by an ANN co-evolved by NEAT is employed with some success. Together, the three parts constitute a strong player for Dominion, shown by its performance versus various hand-crafted players, programmed to use both well-known strategies and heuristics based on the intermediate-level knowledge of the authors. v

7 C O N T E N T S 1 introduction 1 2 related work 3 3 the game Dominion Assumptions and premises The cards The number of players Winning conditions and end game Approach method Artificial neural networks Backpropagation NEAT Competitive co-evolution Measuring evolutionary progress Sampling Population switches Monte-Carlo-based approaches Tools progress evaluation Method Representation Experiments Error function Producing training data Finite state machine player (FSMPlayer) Results Simple experiment using progress estimates Training Results based on FSMPlayer games Results based on BUNNPlayer games Conclusion buy phase card selection Representation Experiments Fitness function Set-up Comparison of population switches Results vii

8 viii contents Performance versus heuristics Evolved gain strategies Strategic circularities Conclusion action phase Method State evaluation Experiments Early observations and changes Results Conclusion conclusion Findings Contributions Future work bibliography 79 a heuristics 83 a.1 Action phase a.2 Buy phase a.3 Card related decisions b glossary of terms 87 c cards 89 c.1 Treasure cards c.2 Victory cards c.3 Action cards

9 I N T R O D U C T I O N 1 Games have long been used as testbed for research in artificial intelligence techniques. Traditional choices include Chess, Checkers, Go, and other classical board games (Van den Herik et al., 2002; Billings et al., 1998). As performance of hardware and artificial intelligence techniques improve, some games which have previously proved good testing grounds for research may lose their usefulness in showing the merits of new approaches. Chess has mostly been abandoned by the research community (Schaeffer and Van den Herik, 2002), for Checkers, AI has already achieved world-champion-class play (Schaeffer et al., 2007), and while AI research is still struggling against humans in Go, recent years have seen promising breakthroughs in neuroevolutionary (Stanley and Miikkulainen, 2004) and Monte Carlo Tree Search (MCTS) based (Lee et al., 2009) approaches. Though some researchers have advocated that we move away from these games (Billings et al., 1998), more modern games are rarely explored. We discuss previous work on the application of machine learning to games in chapter 2. Modern, mechanic-oriented games, colloquially referred to as Eurogames, with their strong focus on rules and balance, may give the proper level of complexity required to prove the worth of advanced machine learning techniques. Dominion, with its numerous configurations and its wide range of feasible strategies, is such a game. Our interest in creating a strong agent for the game has been further strengthened as requests for a working agent to train against seems to be a recurring issue in the game s community. Furthermore, to the best of our knowledge, no research has been done on Dominion, which means that questions regarding the nature of the game s dynamics are largely unanswered. One such question is whether the game has strategic circularities. From a game design perspective, the question of the presence of strategic circularities is interesting for more reasons than those stated by Cliff and Miller (1995) or Stanley and Miikkulainen (2002b), as the absence of such would mean that the game has an optimal strategy, which once is it found, would make the game far less interesting to play. We investigate if Dominion is a game for which strategic circularities exist in section A description 1

10 2 introduction and brief analysis of Dominion is given in chapter 3.1. Because we needed to be able to simulate thousands of games in seconds as well as have access to a comprehensible user interface we decided to implement our own digital version of Dominion (following the full basic rule set) for this thesis. As modern board games often offer wide selections of approaches to winning, these are particularly suitable for the application of co-evolutionary approaches to machine learning. Such coevolutionary approaches can be combined with a variety of evolutionary techniques one such is NeuroEvolution of Augmenting Topologies (NEAT), which is what we have opted for. NEAT has to our knowledge not been applied to modern board games, though other co-evolutionary works by Lubberts and Miikkulainen (2001), Rawal et al. (2010) and Stanley and Miikkulainen (2002a) may by seen as related. An introduction to the concepts and techniques used throughout this work is given in chapter 4. The approach which we are applying in creating an agent for Dominion is largely based on our analysis of the game, which has led us to split the creation of the agent into three parts. We believe that having a good measure of the progress of the game (i.e. how far the game has progressed or conversely, how long it will last) is central to decision making our solution for estimating progress in Dominion is described in chapter 5, where the solution is also quantitatively evaluated. In chapter 6, we detail our creation of various co-evolved candidate solutions for the problem of card gain selection in Dominion, which are evaluated quantitatively as well as qualitatively. Our results in chapter 5 suggest that our assumption about the importance of progress prediction is valid, which is further confirmed by the results given in chapter 6. These two parts have been the prime subjects of our efforts, as they are by far the most crucial to skillful play. We further describe the creation of solutions to the problem of selection and playing of action cards in chapter 7. A discussion of our results can be found in chapter 8.

11 R E L AT E D W O R K 2 It is generally accepted that games are useful in determining the worth of techniques in artificial intelligence and machine learning (Szita et al., 2009). For this reason, much research has been committed to the creation of agents capable of playing games of varying difficulty, complexity and type. How this research is related to our work, and why we are confident ours stand out, is discussed in the following. The first games to gain the attention of the AI community were perhaps the classical board games, such as Checkers, Chess and Go. These are all characterized by being zero-sum, two player games with perfect information and a good degree of success has been achieved in the creation of agents playing them (Van den Herik et al., 2002). This type of games differs from Dominion in almost every respect in particular, while the four players of Dominion have a fair amount of knowledge about the game state, the inherent nondeterminism of Dominion makes it hard to utilize this information in a brute force search approach. Backgammon has also been a popular testbed since Tesauro and Sejnowski (1989) introduced the use of Parallel Networks in the context of games. The approach changed from the use of expert knowledge based supervised learning to unsupervised Temporal Difference learning (TD-learning) with Tesauro (1995), and achieved a greater degree of success against expert human players. Backgammon, and other problems for which TD-learning has been successfully applied, has special properties pointed out by Pollack and Blair (1998), however, which we will briefly discuss. Certain dynamics separates Backgammon from other games in particular they argue that backgammon has an inherent reversibility: the outcome of the game continues to be uncertain until all contact is broken and one side has a clear advantage (Pollack and Blair, 1998, p.234). While there is a fair amount of uncertainty involved in the problem at hand, poor choices made early on will make learning from good ones made later very difficult, contrary to the case of Backgammon. This leads us to conclude that a solution along the lines of Tesauro (1995) that is, to train the agent using Temporal Difference learning is not feasible in our context. 3

12 4 related work Another game which has attracted the attention of research in AI is Poker. Billings et al. (1998) argue that research in games in which brute force search is infeasible is vital to progress beyond that of algorithm optimization and more powerful hardware. We agree in this point, and in this respect our choice of subject fits perfectly. The combination of imperfect information and nondeterminism entails such a combinatorial explosion that older search approaches, such as minimax, as well as more recent ones, such as Monte Carlo Tree Search, are impractical with the computational power we have available, at least in their original forms. It is argued that certain properties of Poker, namely those of imperfect information, multiple competing agents, risk management, agent modeling, deception and unreliable information, are what makes innovations in poker AI research applicable to more general problems in AI (Billings et al., 1998, 2002). Dominion shares some, but not all, of those properties with Poker. Indeed, while there are multiple competing agents, and we have the problem of imperfect knowledge, we believe that opponent modelling and deception (while central to poker) are of marginal importance to successfully playing Dominion. Moving into the realm of more modern games, Magic: The Gathering shares many similarities with our subject, especially when one takes its meta-game of deck building into account. Some work has been done on the application of Monte Carlo based search in this area (Ward and Cowling, 2009), but in our opinion the approach suffers from an excessive simplification of the problem domain. Decks are limited to contain a very specific set of cards not only that, most types are removed entirely so that the remaining dynamics are nothing but a shadow of those in the original game. While this approach of simplification is not uncommon, we agree that, when attempting to apply machine learning to solve an interesting problem we must be careful not to remove the complex activities that we are interested in studying (Billings et al., 1998, p. 229). While making an agent capable of playing with only a subset of the original game would greatly have cut down development time, we originally chose this field of study due to its complexity and its potential appeal to actual players of the game both would be significantly reduced by making any useful reductions or rule changes. Fortunately, the problem of selecting cards to play is much simpler in Dominion, mainly due the the fact that the hands of the players are discarded at the end of the turn (see section 3.1). In that respect, risk management and deception matters little when selecting cards to play since conserving a card for the next turn is not possible. Meanwhile, the order of play matters in the same sense it does in the reduced version of Magic: The Gathering used by Ward and

13 related work 5 Cowling (2009), and we might draw on some experience from the work done there. In terms of design, Dominion has more in common with the 1996 board game Settlers of Catan, another game on which Monte Carlo Tree Search was successfully applied (Szita et al., 2009). Though, strictly speaking, Dominion is not a board game, it shares many similarities with the family of eurogames, in that it favors strategy over luck, has no player elimination, and has a medium level of player interaction. This is yet another feature that makes it interesting for the application of modern AI. The branching factor of Settlers of Catan, however, is much smaller than our domain. Though there are dice involved with the distribution of resources the players gain each turn, the decisions that need to be made are limited to which improvements to buy (out of three different options), and where to place them. This is in contrast with the branching factor of Dominion, which explodes due to the large amount of imperfect information about the hand configurations of the opponents. Moreover, all imperfect information and direct player interaction (i.e. which development cards the opponents have drawn and interaction through trading) was removed from the game by Szita et al. (2009). According to Szita et al. (2009) these changes have no significant impact on the game; we disagree with that statement on the contrary, in our opinion amassing certain development cards to play them with a specific timing near the endgame is a an efficient strategy. Another approach to AI research is to devise a new game and use that as testbed for experiments, as done by Stanley and Miikkulainen (2002a,b). For these studies, Stanley and Miikkulainen designed a testing ground where agents are equal and engaged in a duel, leading to an environment where agents must evolve strategies that outwit those of the opponents. In this sense, our domain is quite similar, but the sheer amount of decisions to be made leads to a much larger complexity than that of the robot duel. While evolution of artificial neural network weights and topologies (see sections 4.1 and 4.3) and related techniques have been applied in other game contexts such as Pong (Monroy et al., 2006), the NERO (NeuroEvolving Robotic Operatives) game (Stanley et al., 2005) and Go (Lubberts and Miikkulainen, 2001), we are not aware of research on its application to modern board games. A frequently occurring issue in problems used for AI research is that of strategic circularities: Cycling between strategies is possible if there is an intransitive dominance relationship between strategies (Cliff and Miller, 1995), or as phrased by Stanley and Miikkulainen (2002b): a strategy from late in a run may defeat

14 6 related work more generational champions than an earlier strategy, the later strategy may not be able to defeat the earlier strategy itself! This is somehow similar to the problem of forgetting: Among the problems one often confronts when using a co-evolutionary algorithm is that of forgetting, where one or more previously acquired traits (i.e., components of behavior) are lost only to be needed later, and so must be re-learned. (Ficici and Pollack, 2003). Some research subject are chosen (or even designed) as they are assumed to have these issues, for instance the aforementioned robot duel; Because of the complex interaction between foraging, pursuit, and evasion behaviors, the domain allows for a broad range of strategies of varying sophistication. (Stanley and Miikkulainen, 2002b). While co-evolved solutions to problems which have strategies and counter-strategies will be at risk of forgetting, whether Dominion has strategic circularities remains to be seen.

15 T H E G A M E dominion Dominion is a card game for two to four players. Published in 2008 by Rio Grande Games, the game won a series of awards, among others the prestigious Spiel des Jahres (Spiel des Jahres, 2009) and Deutsche Spiele Preis 1. To the best of our knowledge, no research papers are available on autonomous learning of a Dominion strategy, nor any on Dominion itself. We encourage our readers to have a look at the concise rules of the game, which are freely available on the Internet (see Vaccarino, 2008) we will, however, review the rules below. The game is themed around building a dominion, which is represented by the player s deck. During game play the player will acquire additional cards and build a deck to match her strategy, while trying to amass enough victory cards to win the game. The cards available during a session (called the supply) consist of seven cards which are available in every game as well as ten kingdom cards which are drawn randomly from a pool of 25 different cards before the start of the game. This gives the game more than 3.2 million different configurations. The remaining 15 cards (which are not in the supply) will not be used and are set aside. The cards in Dominion are divided into four main categories, some of which can be played during the appropriate phase of a turn: Victory cards: Needed at the end of the game to win but weaken the deck during the game. Curse cards: Like victory cards but with negative victory point values. Treasure cards: Played during the buy phase to get coins for purchasing more cards. Action cards: Played during the action phase to gain a wide variety of different beneficial effects. 1 That it won both is significant, since the choice of the latter is often seen as a critique of the choice the jury of Spiel des Jahres makes (Spotlight on Games, 2009) 7

8 the game Figure 1: A screenshot of the user interface of our implementation of Dominion. Cards tinted with green are those which can be played during the current phase.

16 8 the game Figure 1: A screenshot of the user interface of our implementation of Dominion. Cards tinted with green are those which can be played during the current phase. During the buy phase, the cards in the supply that the player can afford are also tinted green. The players will start by drawing seven Copper cards, the cheapest treasure card, and three Estate cards, the cheapest victory card, from the supply these cards form the players starting decks. The starting decks are shuffled and each player draws five cards. A predetermined starting player2 plays a turn, after which the turn passes to the player to the left until the game has reached one of its ending conditions. When playing a turn, a player goes through three phases: Action Phase: During this phase the player has one action, which can be used to play one action card from her hand. The player may gain further actions from the action card played initially, which can in turn be spent on playing more action cards subsequently. Buy Phase: During this phase the player can play treasure cards to gain coins, which are added to any coins gained by playing action cards during the action phase. These coins can be spent on buying the cards in the supply. The player has one buy available each turn, which means that she is allowed to buy one card that she can afford. Many action cards increase the number of buys that the player can make, which allows the player to split her coins on multiple purchases. The cards bought are put into the player s discard pile. Clean-up Phase: In the clean-up phase the player discards all cards played in the previous two phases along with any 2 Chosen randomly for the first game, and chosen as the player to the left of the last game s winner for subsequent games

3.1 dominion Figure 2: Examples of the card types (Curse excluded) victory, treasure and action, respectively. The cost of a card can be seen in its lower left corner.

17 3.1 dominion Figure 2: Examples of the card types (Curse excluded) victory, treasure and action, respectively. The cost of a card can be seen in its lower left corner. The particular action card, Market, allows the player to draw one card and gives her one more action. Furthermore, it gives her an extra purchase in the buy phase, and one extra coin to spend. cards left in the hand. The player then draws a new hand of five cards from the deck (should the deck run out of cards, the player shuffles her discard pile and places it as her new deck). The turn then passes to the player on her left The game ends if, at the end of any player s turn, either or both of two requirements have been met: The stack of Provinces (the most expensive victory card) is empty. Three of the stacks of cards in supply are empty. Once this happens each player counts the number of victory points among her cards, and the player who has gathered the most is the winner. Should two players have the same number of victory points, the player who played the fewest turns is the winner. Should the players also draw for the number of turns, the players share the victory. Some of the action cards are attack cards as well this means that they negatively influence opponents on top of any positive effects they might have for the player playing the card. These negative effects can be countered if the player who would be affected can reveal a reaction card (another subset of action cards) from her hand3. For a list of the cards and their rules see Appendix C. The shuffling of the decks makes Dominion a non-deterministic game. Players have limited information (the order of the cards in 3 The only reaction card in the basic Dominion game is Moat. More reaction cards have been introduced in expansion packs. 9

18 10 the game their own decks as well as the contents of the opponents hands, decks and parts of their discard piles are all unknown to the player). The game is also a nonzero-sum game. Dominion does not necessarily terminate should all the players decide not to buy any cards, the game will go on for ever. As argued by Pollack and Blair (1998), some games involving a degree of randomness have an inherent reversibility, i.e. the outcome of the game continues to be uncertain until some point. As players have non-trivial chances of winning or losing up until this point, the players can potentially learn from each move, which makes the domain particularly suited for co-evolution. Other domains will punish blunders early in the game so severely that learning from each move will be unlikely, as the game is already virtually decided. Dominion is likely closest to the latter category, as the options available to a player are determined by her own choices in earlier turns, as well as the choices made by opponents (if, for instance, the opponents use a lot of attack cards or buy all the cards in a specific category). The cards acquired early will be likely to show up the most times during a game, so a bad decision made early will be haunting the performance of the player s deck more than one made late in the game. 3.2 assumptions and premises Often, when attempting to apply machine learning techniques to create an agent for a game, the game is simplified to some degree. Many machine learning approaches to finding solutions to games, which are actually played by human players, simplify these games almost beyond recognition. We have attempted to make as few changes to the rules of Dominion as possible (if too many rule modifications are needed one might argue that selecting a simpler or more appropriate game would have been prudent). Some changes to the rules have been made though, which are described in the following The cards Dominion has 32 different cards with a wide range of different effects on the game. It would no doubt be easier to train an AI to play well using a subset of these 32 cards. We did, however, consider it feasible to implement all 32 and to train the AI using the full set of cards, and as we consider a general solution more interesting than a solution to a single configuration, or to a smaller subset, we decided to implement all the cards.

19 3.2 assumptions and premises 11 A number of expansions and promotional cards have been published for Dominion. We have chosen to not include these in our experiments, as we did not have access to neither the expansions nor the promotional cards, and as implementing these would require a lot of work not directly related to the application of machine learning The number of players The basic Dominion game is played by two to four players. We elected to always use four players for training and testing purposes. This decision was made because we did not want to introduce a changing number of players as another factor agents would have to take into account when training. We chose four players, rather than two or three, as having to deal with three opponents would likely be a more interesting problem to solve, and as we personally consider the four player game the most interesting to play Winning conditions and end game As stated in 3.1, should multiple of the highest scoring players have the same number of victory point and have gained these in the same number of turns, these players share the win. This turned out to be impractical when running batches of many games, as which player wins determines who will start the next game. In order to properly find a starting player in such a way that the advantage of starting would not be given to a particular position, we elected to randomly select a winner among the players who would otherwise be sharing the victory. As mentioned in section 3.1 Dominion is not guaranteed to terminate. When it does not terminate (or when it takes a very long time to terminate because players are reluctant to buy cards) it is a sign that the four players participating are unskilled. Observing players in early evolution we found that an unreasonable amount of time was being spent on evaluating players who were not skilled enough to actually buy cards. As we estimated that these games would not be likely to contain very important genetic material we simply elected to end the game after 50 turns and tally the score as if the game had ended by either of the two conditions having been fulfilled. The various agents will not attempt to analyze the outcome of making a move before they make it. This means that no prediction of whether emptying a certain stack of cards will terminate the game will take place, nor of whether this will make the player win

20 12 the game or lose. This is of course limiting for the agents ability to play, but as it would be somewhat unrelated for the learning of Dominion strategies we decided not to introduce special handling of the end game (section 8.3 contains a brief description of how this might be implemented). 3.3 approach The decisions needed for playing the three phases of a turn in Dominion are vastly different (see section 3.1 or Vaccarino (2008) for a more thorough description of the phases). In the action phase the player must make decision on which of the action cards in hand to play, and in which order. Some decisions here are fairly straightforward. For instance, if the player holds a Village card and a Smithy card, playing the Smithy card first would leave the player with three newly drawn cards whereas playing the Village card first would have the player holding four newly drawn cards and have one action left, which could again be used to play any action cards drawn. Other decisions are more difficult would it be best to play the Militia card to get two coins and potentially hamper the hands of the opposing players or should one rather forgo the chance to disrupt the opponents hands and play a Smithy card to draw three cards, in hope that these will have more coins than the Militia card would provide? The choices a player must make during the buy phase are central to the future performance of a player s deck. Which cards the player buys during the buy phase determine which cards she will be able to play in later action phases, how many coins she has available during the buy phase, how many victory points she will have at the end of the game and even whether she will be able to defend herself against attack cards played by opposing players during their turns. It is quite obvious that the skill of a player in Dominion is highly dependent on the strategy employed when selecting cards to gain. It would appear that, though both are required to achieve good play, the decisions made during the action phase and those made in the buy phase are widely different. For this reason we have elected to split the solution into two distinct parts, one which is tasked with handling the action phase and another one which handles the selection of which cards to gain. There are overlaps between the action and the buy phase it is possible to gain cards during the action phase (for instance

21 3.3 approach 13 when playing a Feast or a Remodel). If such a card gain involves a choice between gaining different cards the choice will be made as if it were the buy phase. The clean-up phase is mainly a question of administrating the end of the turn: Cards which have been played during the turn and cards left in the player s hand have to be discarded, the player must draw a new hand and so forth. Actually only one decision can be made during the clean-up phase; the order in which to discard the cards from the player s hand. The cards which have been played during the turn go into the discard pile first and these will all have been seen by the opposing players. The cards in the hand have not, so the player has the option of hiding some of the cards in hand from the opponents by discarding another card as the very last 4. To avoid taking opponent modeling (Schadd et al., 2007) into consideration to obtain a marginal improvement at best, we have decided to merely let the choice of which card to discard last be made at random. Dominion is designed so that cards which influence whether or not the player is winning (victory cards and curse cards) are not useful during the actual playing of the game 5. Therefore a human player, when faced with the option of gaining a card to add to her deck, would make a choice between the available cards in part based on an estimate of how progressed the current game is. If she estimates that the game is currently in an early phase, she can expect that the bought card will show up in the hand a larger number of times than a card bought later in the game. As decks as a rule grow in the course of a game, a card gained during the early game will also constitute a larger percentage of the deck in the immediately following turns. Hence, buying a victory card early, while giving the player a head start in terms of victory points, will also greatly hamper the player s deck and reduce the chance the player will get of gaining more and higher value victory cards later in the game. Likewise, choosing an action or treasure card over a victory card in the last turns of a game would only rarely have a significant positive impact on the player s performance: the action or treasure card would not likely come into play, and even if it did so, no more than a few times; only in rare instances could it prove more beneficial towards the player s total number of victory points than the victory card would. 4 It could for instance be advantageous to hide a Moat card so that the other players will not know how many of your moats are in the discard pile and then in turn will not be able to discern whether you might have one in hand. 5 They can neither be played during the action or buy phase and with certain cards in play (Bureaucrat) having one in hand even constitutes a risk.

22 14 the game A problem with many of the basic strategies we have encountered 6 is that they often have a notion of early and late game without being able to specify what these vague terms correspond to in terms of the actual game state. Is it a late game state when there is only a specific percentage of the starting provinces available, and is this to be considered more or less late game than when some of the supply stacks are nearly empty? It is easy to come up with experience-based heuristics in reply to questions like that, but the result will hardly be very general or optimal. We believe that having a way of estimating the progress of a game of Dominion is essential to playing the game. To avoid vague heuristics based on expert knowledge, we decided to use machine learning for a separate part of the player, tasked solely with estimating a game s progress depending on the game state. We intend to use these estimates of the game s progress for the other two parts of the solution. Some decisions are unrelated to buying or playing cards the decisions of whether or not to reveal a Moat when faced with an attack card or put a particular card aside when playing a Library are examples of such. These decisions have to a high degree been left to heuristics see Appendix A for a complete list. 6 See for instance Boardgamegeek (2008).

23 M E T H O D 4 In our attempt to create a powerful agent for Dominion we have used several techniques, most of which will likely be familiar to those who are experienced in machine learning. As a reference to those readers who are, and as an introduction to other readers, they are described in relative brevity in this chapter. 4.1 artificial neural networks The human brain consists of nerve cells (neurons). Each neuron is connected to a number of other neurons and each neuron can transmit information to or receive information from the connected neurons. These complex networks are believed to be the foundation of thought (Russell and Norvig, 2003, pp 10-12). In attempting to create artificial intelligence one approach would be attempting to simulate the cells in the human brain, which was first done by McCulloch and Pitts (1943). The simulated neurons (often called nodes or units) receive a number of inputs through a number of connections (often also referred to as links). Each connection has a weight associated with it, so that the sum of the inputs to the node can be considered a weighted sum (Russell and Norvig, 2003, pp ). The output, y, of a node depends on this weighted sum as well as the node s activation function, g, and is written as ( n ) y = g w i x i i=1 (4.1) where x i and w i represent the ith input and weight respectively and n is the number of inputs to the node. It is obvious that the activation function is important for the output. Among the most commonly used are the threshold function or step function, which outputs 1 if the weighted sum of the inputs is zero or higher and otherwise outputs zero, the linear activation function, and the logistic sigmod-shaped activation function (henceforth sigmoid function): g(x) = e Sx (4.2) 15

24 16 method S being the slope parameter (henceforth just referred to as slope) of the sigmoid activation function. The advantage of the step function and the sigmoid function is that a relatively small change in the weighted input sum can result in a relative large change in the output in some situations (when the weighted input sum is close to the step of the threshold function or the steepest slope of the sigmoid function) while in other situations (when the weighted input sum is not close to the step or steep slope) the change in the output will be small. Some methods of learning requires the activation function to be differentiable (see section 4.2), which puts the sigmoid and the linear activation functions at an advantage. Many nodes have a constant input called the bias. As the bias does not change, the weight on the connection leading it into the node can be considered defining the placement of the step of the threshold function or the steepest part of the sigmoid function. The value of the bias input is often set to 1. Such nodes connected to each other is called an artificial neural network (henceforth neural network). If there are no cyclic connections in the network it is called a feed-forward network and if it has cyclic connections it is called a recurrent network. The nodes of feed-forward networks are often organized in layers, so that the (weighted) outputs from one layer is fed as inputs to the nodes of the next layer. A layer which is composed of neither input nodes nor output nodes is called a hidden layer, and the nodes in it are called hidden nodes. If we consider a network composed of a single node (often called a perceptron) using a step function, this node will define a hyperplane dividing the search space (which has a dimensionality equal to the number of mutually independent inputs) in two, a part where the output of the perceptron is one and another where it is zero. To see that this is a hyperplane, consider when the threshold function will activate: n i=1 w i x i > 0 this splits the space in two by the hyperplane n i=1 w i x i = 0. For problems that are not linearly separable one would need to arrange nodes in multiple layers. For a better, smoother function approximation, one might use the softer sigmoid activation functions. 4.2 backpropagation Backpropagation is a technique with which the weights of a neural network can be optimized in order to make the neural network deliver certain outputs for certain inputs, or more precisely: The

25 4.2 backpropagation 17 aim is to find a set of weights that ensure that for each input vector the output vector produced by the network is the same as (or sufficiently close to) the desired output vector (Rumelhart et al., 1986). The technique, which was first described by those authors, is outlined in the following (index letters have been changed for readability). For backpropagation to work, one must have training data a set of inputs and the outputs relating to them. The neural network should also be a feed forward network without recurrency. By presenting a neural network with this input data one can calculate an error: E = 1 2 c (a j,c d j,c ) 2 (4.3) j where c is the index of the elements in the training set, j is the index of output units and a j,c and d j,c are the actual and desired output for this output unit and training set. Recall that the input to a unit, q, is a linear function of the outputs, y p, of the units connected to it and the weights of those connections (w pq ). The input, x q, can then be written as x q = y p w pq (4.4) p and that this, if using a sigmoid activation function, is routed into the activation function to produce an output for the unit (see equation 4.2), so the output is: 1 y q = g(x q ) = 1 + e Sx q (4.5) To minimize E we can compute the partial derivatives of E for each weight in the neural network, then use these derivatives to modify the values of the weights. For the output units this is straightforward one simply differentiates equation 4.3, which (for a single input/output pair) yields E y q = y q d q (4.6) Applying the chain rule to find E x q we find E x q = E y q y q x q (4.7) To find y q x q we differentiate equation 4.5 (considering S = 1 for simplicity) which (along with equation 4.7) gives us an expression

26 18 method for how to change the total input to the output unit in order to change the error: E = E y q (1 y q ) (4.8) x q y q This is needed for the computation of the derivative of the error with respect to a certain weight: E = E x q = E y p (4.9) w pq x q w pq x q So now, for an output node q, given the actual and desired output and the output of a previous unit p we can compute the partial derivative of E with respect to the weight of the connection from p to q: E w pq. For a unit, p, connected to q the error from the output unit can be propagated back: E y p = E x q x q y p = E x q w pq (4.10) If node p is connected to multiple output units, this can be summed to E E = y p w pq (4.11) q x q These steps can be repeated, propagating the error back through the neural network. Now we have an expression for E w for each weight in the network. These can be used to compute changes to the weights: w = ɛ E w, ɛ being a small constant, the learning rate. 4.3 neat When NeuroEvolution of Augmenting Topologies (NEAT) was first introduced by Stanley and Miikkulainen (2002c), it was shown to outperform earlier techniques on the benchmark of double pole balancing (see Barto et al., 1983, for a description of the benchmark and its justification), by a large margin 1. The factors behind this significant success were threefold, as was shown by the ablation studies following the initial benchmarks. First, instead of starting with a fixed topology and only evolve the connection weights, NEAT allowed the networks to complexify over the course of evolution. 1 NEAT beat the previous record holders, Cellular Encoding and Enforced Subpopulations, by factors of 25 and 5, respectively (Stanley and Miikkulainen, 2002c)

27 4.3 neat 19 Second, NEAT would start from a minimal topology consisting of only the input nodes with full connections to the output node. This, it was argued, allowed the search for optimal weights to be performed in a smaller number of dimensions, allowing for better weights to be found. Third, the concept of speciation was used to protect new innovations, to allow them to evolve their weights without becoming extinct too fast for the effectiveness of their topology to be tested properly. Earlier experiments in neuroevolution had used fixed topologies containing a varying number of hidden neurons, layers and connections. These parameters had to be decided upon, and experienced users of machine learning will most likely know and agree that this parameter selection can be time consuming and tedious, since it often ends up as a trial and error process. All the aforementioned factors were strongly dependent on the novel concept of historical markings in combination with the genetic encoding employed by NEAT. The genes of NEAT can be split into node genes and connection genes. The first contain information on the input, hidden and output nodes, while the latter hold information on each connection, its weight and the historical marking in the form of an innovation number. Two kinds of mutations can happen in NEAT. The mutation on weights is the simplest of these, in that it occurs in much the same way as in other neuroevolutionary systems; with a certain probability, the connection weight will either be perturbed by a small random amount, or it will be set to a new random value in the interval allowed for weights. The more complex mutation is that of the topology of the neural network, which may occur in two different forms: The addition of a new connection or a new node. The first will select two random neurons that thus far have been unconnected, and add a connection between them. This new connection gene will receive an innovation number which is simply the previously highest innovation number incremented by one. The latter will select a random connection to be split into two connections, with a new node between them. These new connections will each receive a new innovation number, in the same way as with the add connection mutation. As these innovation numbers are inherited, NEAT always keeps track of which connection genes stems from the same topological mutation. When genomes are to be crossed over during mating, they utilize the innovation numbers in the connection genes to line up those

28 20 method genes that share historical origin. Crossover in the form of random selection from either parent is applied for all the genes that match up, while those that do not match up are selected from the more fit parent. The historical markings also provide the basis for speciation, which NEAT employs to ensure diversity in the population. When comparing genomes, some genes will inevitably not match up. These are either disjoint, when the innovation number is within the range of those of the other parent, or excess if they are outside that same range. The count of disjoint and excess genes are used to measure the distance between two genomes, when deciding whether they belong to the same species. The distance δ between two such genomes is δ = c 1E N + c 2D N + c 3 W, (4.12) where E and D are the numbers of excess and disjoint genes, respectively, and W is the average distance between the weights of the matching genes. Three constants, c 1, c 2 and c 3 are selected by the user to specify the importance of each of the distance measures. N is the size of the largest of the two genomes being compared, effectively normalizing δ by the genome sizes. When species are decided upon, each genome after the first (which automatically ends up in the first species) is compared to a random member of each species until the distance between the two is below a threshold selected by the user; the genome is assigned to the first species it is sufficiently close to, or to a new species if it is too distant from the other species. Once this has been done for all the genomes, they have their fitness for selection purposes adjusted relative to the size of their species, in that each of their individual fitnesses are divided by the count of members of their specific species. This is also known as explicit fitness sharing where, as phrased by Rosin and Belew (1995); An individual s fitness is divided by the sum of its similarities with each other individual in the population, rewarding unusual individuals. Each generation, when genomes are selected for survival due to elitism, the speciated fitness will ensure that a diverse population is maintained, by eliminating the less fit members of each species first. When topological mutations occur, chances are that the newly added structure will cause a drop in the fitness of the newly created genome, compared to previous, smaller ones (Stanley, 2004, section 2.5). New genomes with an innovative topology are protected by their small species sizes, and the less fit genomes

29 4.4 competitive co-evolution 21 in larger species will be eliminated first. Therefore innovative mutations will have the time needed for their connection weights to be optimized, and the fitness of their topology to be fairly tested. 4.4 competitive co-evolution Co-evolution in genetic algorithms was first suggested by Hillis (1990). It is a technique by which multiple genetically distinct populations with a shared fitness landscape evolve simultaneously (Rosin and Belew, 1995) 2. The term competitive signifies that the individuals in the populations being co-evolved have their fitness based on direct competition with individuals of the other population. The advantage of co-evolution is that it creates an arms race in which one population (traditionally called the host population) will evolve in order to surpass test cases from the other population (called the parasite population). The roles are then switched so that the population that was previously the parasite becomes the host and vice versa. Competitive co-evolution has been shown to achieve better results than standard evolution, to achieve these results over fewer generations and to be less likely to get stuck at local optima (Lubberts and Miikkulainen, 2001; Rawal et al., 2010) Measuring evolutionary progress The shared fitness landscape of competitive co-evolution makes it difficult to measure evolutionary progress (Cliff and Miller, 1995). Comparing solutions across generations is no longer straightforward as the fitnesses of these solutions are relative to the test cases chosen from the other population. With an exogenous fitness, a ranking of two agents could be performed by merely comparing their fitness scores, but with relative fitnesses this is not an option. Furthermore, for complex games like Dominion, the evolution might get stuck in a strategy loop, in which a population will lose the ability to exploit a certain weakness if this particular weakness does not occur in the opposing population any longer (Cliff and Miller, 1995; Monroy et al., 2006). The opposing population might then develop the same weakness, which would force the first population to re-evolve the ability to exploit this and once again weed out the opposing players having this weakness. To an observer watching the champions getting replaced by new 2 Co-evolution is also sometimes attributed to Axelrod (1987), but while it is competitive evolution with a shared fitness landscape, it only utilizes one population, which sets it apart from what we mean by competitive co-evolution.

30 22 method champions, this circularity might look like evolutionary progress. A method of monitoring progress that has often been used is the hall of fame as suggested by Rosin (1997). The hall of fame stores the best performing individual from each generation in order to use them to test members of populations in later generations. The members of the host population will be tested against not only members of the parasite population but also against some selection of the members of the hall of fame 3. If we wish merely to monitor the progress of the evolution, a master tournament as suggested by Floreano and Nolfi (1997) can be used. A master tournament is simply testing the best performing individual of each generation against the best performing individuals of all generations. With a large number of generations and a large amount of games needed to accurately detect differences between these master -individuals, such a tournament might take quite a while. Stanley and Miikkulainen (2002b) further argues that a master tournament does not indicate whether the aforementioned arms race takes place it merely shows if a champion can defeat other champions, but does not show strategies defeating other strategies and is therefore not capable of showing if the evolution has gotten stuck in a loop. Instead Stanley and Miikkulainen (2002b) suggest a dominance tournament, in which each generational champion is tested against all other previous dominant strategies. The first generational champion becomes the first dominant strategy outright. If a generational champion later manages to beat every previous dominant strategy it becomes a dominant strategy itself, and future dominant strategies must be capable of beating it. To limit the amount of games needed, the candidate plays the opponents in the order last to first that is, hardest to easiest and the tournament terminates if it fails to beat any opponent. As a generational champion will only be required to play against a subset of the previous champions, and as this tournament can often be terminated early, a dominance tournament is much less computationally expensive than a master tournament. It is also a handy tool for tracking evolutionary progress, as any strategy added to the collection of dominant strategies will be known to be capable of beating each previous dominant strategy. 3 Rosin (1997) suggests random sampling if re-evaluating the fitnesses of the members of the hall of fame will be time consuming, which would likely be the case for Dominion-agents.

31 4.4 competitive co-evolution Sampling Having every member of one population play against every member of the opposing population in order to measure fitness values is computationally expensive as argued by Rosin (1997): It is desirable to reduce the computational effort expended in competition by testing each individual in the host population against only a limited sample of parasites from the other population. The questions remain which opponents to pick from the parasite population, and how to pair them up with the members of the host population. As the competitions are held to find a relative fitness hierarchy of the members of the host population, each host must compete against the same selection of opponents if playing against different samples of the parasite population a host might get an unrealistically high fitness from playing against comparatively weak opponents. Let us address the question of how to play these games before getting into whom to play them against. Many games used as benchmarks for competitive co-evolution are two player games, and it would appear that not much research has been done on how to deal with games with higher numbers of players, so a brief discussion is in order. One could potentially match one host against one parasite in a series of games as we are not dealing with less than four players (see section 3.2) this could be done by including two instances of the host player and two instances of the parasite player in the same game. While this would ensure an even match between the two, more similar series of games would be required to make sure that the hosts fitness values are based on a diverse selection of opponents. Another option would be matching two instances of the same player against two different members of the parasite population. This would have a better diversity, but it might favor strategies based on ending the game quickly, as the two identical players would be able to support each other effectively by buying the same types of cards and thereby closing the game faster. This could be avoided by playing two members of the hosts population against two members of the parasite population. It would not mean that we could evaluate the fitness of twice as many hosts per game though, as we still need the hosts to have played against the same opponents in order to create a fitness hierarchy one of the members of the host population would need to be present in all games. Furthermore this member of the host population would also need to be tested as it would need to play against the same players as each of the other members of the host population it would end up playing against itself, which again

32 24 method would favor strategies based on closing the game fast. This would also change the nature of the fitness landscape of competitive co-evolution, the consequences of which are difficult to foresee. Instead we opted for matching one member of the host population against three different members of the parasite population. While we are only testing one member of the host population at any given time, this matching can be used to test against diverse opponents while not favoring any particular strategies. Now it remains to find the members of the parasite population that each member of the host population should be competing against. One option would be picking random players. This would sometimes result in the hosts being faced with relatively weak competition if only unskilled members of the parasite population are chosen, which would mean a loss of the arms race edge of competitive co-evolution (Monroy et al., 2006). As stated by (Stanley and Miikkulainen, 2002a): The parasites are chosen for their quality and diversity, making host/parasite evolution more efficient and more reliable than random or round robin tournament.. In order to get the highest level of quality we could use the champions, and to get a diverse sampling we could pick members of different species. Combining these, we use the champions of the three best performing species of parasites. Should less than three species occur in the parasite population, the best performing champion is used to fill in any gaps this situation will only rarely arise when adjusting speciation parameters on the fly in order to make sure the populations have a certain number of species (an addition to the original NEAT, see Stanley, 2010) Population switches As mentioned competitive co-evolution is created to sustain an arms race between the two populations the idea is that if both populations evolve, they will be fairly evenly matched and a good choice for evaluating the fitness of the opposition. The question of when to evaluate (and evolve on) one or the other population is largely unexplored though. Stanley and Miikkulainen (2002a) state that In each generation, each population is evaluated against an intelligently chosen sample of networks from the other population i.e., both populations are subject to evolution in each generation. Many other texts on co-evolution do not touch upon the topic of population switches (Rosin and Belew, 1995; Floreano and Nolfi, 1997; Rawal et al., 2010, just to name a few) and one must assume that they are using the standard version of evolving: evolving on both populations in

33 4.5 monte-carlo-based approaches 25 each generation. This method of switching population seems somewhat in conflict with the arms race purpose of competitive co-evolution. In order to force either population to improve one would assume it would be best if it is faced with a population of slightly higher skill. If assuming that one population is more skilled than the other, one must necessarily be playing against a population less skilled than itself when the standard competitive co-evolution lets both populations be evaluated by playing against the other. With proper sampling (see section 4.4.2) the individual competitions might not be played against less skilled players, but the question remains: is this evolution worth the computational effort spent on it. As approximately half the computational effort goes into training populations which are already doing better than their counterparts, one could reach twice the number of generations if only evolving on the least skilled population. This might mean that when only the worst performing population is evolved, the arms race ability of competitive co-evolution has a larger impact as it is present in every generation. One might, however, also imagine that an unreasonable amount of time gets used on evolving on an inferior population, while the better population is not evolving, which could mean that the solutions found would be less skilled. A third option would be that the method chosen for population switches is unimportant so that the skill of the evolved individuals will not be influenced by this. The unforeseeable effects of making changes to the populations switches of NEAT, as well as the apparent lack of research done on the subject, made the question of different methods of population switches interesting to us. We intend to use Dominion as a testbed for comparing these two approaches to population switches in competitive co-evolution, which we call switch by skill (by which we mean only evolving on the population that has the lowest skill) and switch by generation (evolving on both populations every generation) see section monte-carlo-based approaches Since some Monte-Carlo approaches have served as inspiration during the course of our studies, we give a brief description of those in the following. Monte-Carlo Tree Search was presented in 2006 by Chaslot et al. (2008), and has shown great promise in computer Go (Lee et al., 2009). The algorithm builds a tree of possible states from the

34 26 method current one, and ranks them according to how beneficial they are judged to be. The tree starts out with only one node, corresponding to the current state. A tree of possible subsequent states is then built by repeating the following mechanisms a number of times, as described by Chaslot et al. (2008): Selection If the game state is already in the tree, the inferred statistics about future states are used to select next action. Expansion When a state is reached that does not already have a corresponding node in the tree, a new node is added. Simulation Subsequent possible actions are then selected randomly or by heuristics until the end state is reached. Backpropagation Each tree node visited is updated (i.e. the chance of winning is re-estimated) using the value found in simulation. After the algorithm has run the desired number of simulations, the action corresponding to the node with the highest value is selected. Another variant of Monte-Carlo-based search is the one using UCB1 (Auer et al., 2002) employed by Ward and Cowling (2009). Instead of building a tree like in MCTS, it relies on random simulation of possible actions after all actions directly available from the current state. The action to be simulated is selected based on the average rewards from actions found by earlier simulations (the algorithm is initialized by simulating each action once), plus a value calculated from the number of times the actions have been simulated. That is, the action j that maximizes x j + C ln n n j (4.13) is simulated, where x j is the average reward of previous simulations of the action, C is a constant in the range [0, 1] controlling exploration/exploitation (with larger values corresponding to more uniform search), n j is the number of times the action has been simulated, and n is the total amount of simulations that has been performed. 4.6 tools A working digital version of Dominion was implemented in Java. The interface was created using Slick, a software package for creating 2D-graphics in Java (Glass et al., 2010). Training of neural networks using backpropagation was done using a modified

35 4.6 tools 27 version of Neuroph, which is a neural network framework for Java (Sevarac et al., 2010). The training of neural networks using NEAT was done in a modified version of ANJI (Another NEAT Java Implementation by James and Tucker, 2010).

37 P R O G R E S S E VA L U AT I O N 5 As argued in section 3.3, an estimate of how long the game will last is important for many of the decisions that a player needs to make in order to play Dominion well. This section describes the choices and observations made in creating progress evaluation for Dominion. 5.1 method What is really needed is a continuous estimate of game progress, as opposed to discrete notions of early, middle and late game. This can be achieved in a number of ways. One might do a simple weighted sum of parameters deemed relevant to the estimation of game progress, or the same parameters might be used as input for a neural network approximator. The decision to use a neural network for this part of the AI seems an obvious one: Neural networks have shown their worth countless times 1 in providing approximations of complex functions of a number of numerical values based on properly preprocessed inputs. As the game s turn progress can be seen as a real number within [0, 1] ( max turn ), and all the relevant input can be normalized to that same interval, plugging things into the neural network, as well as putting the output to use is an easy task. The question of how to train the neural network now remains. One approach could be evolving a neural network using Neuro- Evolution of Augmenting Topologies (NEAT). NEAT is at an advantage for problems where a complex topology is needed in order to get a good solution. It can, however, be quite cumbersome, with many parameters to set and very few guidelines as to how to set them. As NEAT is a randomized approach, a significant amount of computational effort will also be spent on evaluating solutions of lower fitness than previous ones. As we have a desired output (the value of maxturn, which is stored when training data is produced) as well as an actual output, using backpropagation would also be an applicable way of training the neural network. Backpropagation is fast and reliable, but can be hampered by local optima. turn 1 Among the most well-known within the AI community is Tesauro (1995) 29

38 30 progress evaluation Because we do not consider it likely that a complex topology is needed, and do not consider progress prediction in Dominion to be a problem which would likely have many local optima, we have opted for using backpropagation in our training of the neural network. For either of these techniques to work, decisions have to be made regarding which parts of the game state are relevant. The choices we have made are discussed in the following Representation As mentioned in section 3.1 a game of Dominion can end in one of two ways, which we recall to be: Three of the supply stacks are empty at the end of any player s turn The stack of provinces is empty at the end of any player s turn This suggests that the number of provinces left is important to the estimate of the game s progress, as well as the number of cards in the three supply stacks with the fewest cards left in them. Such an approach is not particularly fine-grained though: for instance it would make a situation with two empty and one nearly empty stack indistinguishable from a situation with two empty and two nearly empty stacks. From the player s point of view these situations could be very different, as the latter might offer an opportunity for closing the game cheaper or with a higher victory point gain than the first one. Therefore we decided to include the number of cards in the four lowest stacks in our initial selection of the inputs to the neural network. The number of provinces left and the nearly empty stacks might not be sufficient information about the game to predict when it will end: In some games the selection of kingdom cards means that the players will get enough money to buy the more expensive cards early, while in others, scraping together enough coins to get the right cards can be difficult. Also, in some games the selection of cheap cards may be large while in other games these might be rare (this influences whether players can end the game with relatively weak economies or not). Therefore, the costs of the cards in the four smallest stacks would be meaningful to include as a part of the game state useful for predicting progress. This leaves the amount of coins the players are able to spend which, like the cost of the cards, influences how fast a game can

39 5.1 method 31 be closed. This could be represented in a number of ways for instance the average ratio of coins per card in a player s deck or the number of coins the player used during the previous buy phases. These approaches both suffer from an inability to recognize certain strategies though: For instance, if a player forms her deck around gaining cards costing four coins by playing Workshop cards, and then using Remodel cards to turn these into cards costing six coins and later to cards costing eight (i.e. Provinces), the potential of her strategy will not be recognized by examining the cards gained during the buy phase or the average coin value of her deck. Therefore, using information about the gains made during a turn (during the action and buy phases) will give a more accurate picture of the card gaining potential of a deck than merely looking at the treasure cards in the deck or the buys made will. As we estimated that remembering only one turn could give very unreliable results (as the opposing players might have drawn particularly weak or strong subsets of their decks) we elected to measure the average of the best gains over a number of turns. We average this over three turns four or five would give us a smoother curve of buy averages over turns (which would also mean that the impact of players particularly lucky or unlucky turns would be diminished), but it would also be less sensitive to changes in the buy/gain-power of decks. Comparing agents using averages over different numbers of turns might reveal which choice is better. This, however, is beyond the scope of this thesis. So, to summarize, these are the inputs we have chosen for our neural network: Lowest stacks The amount of cards left in the supply stacks closest to being empty. This input appears four times, once for each of the four stacks with the fewest cards in them. These inputs are normalized through division by the total number of cards of the given type at the start of the game. Lowest stack costs The costs of the cards in the stacks closest to being empty. Like the lowest stacks, this input appears four times. The lowest stack costs are normalized through division by eight, as this is the cost of the most expensive card in the game. Gain average The average gains of the players, averaged over three turns. This value is normalized by eight as well. Provinces The number of provinces left in supply, divided by the total number of provinces in the game 2 for normalization in our case, as we play only four player games.

40 32 progress evaluation Bias (1) 5.2 experiments Though the training of a neural network might seem a fairly automated process, there are some parameters that still need to be decided upon. The most important of these is the error function, which is described in section We also had to decide upon an activation function for the output node and the nodes in the hidden layer of the neural network. As we wanted a gradual change in the estimates of progress, using a step function was out of the question. Linear and sigmoid activation functions would both be able to give us this gradual change we eventually decided on a sigmoid activation function, as this would allow for small changes in inputs to a node to result in large changes in the output, which we believe would be advantageous in the training of progress prediction Dominion. We used a slope of S = 1, as this is also the default value used by Neuroph (see section 4.6) Error function We wanted an error function which would attribute the lower error values to networks which made estimates of progress close to the actual progress: E game = T t=1 E(t) T (5.1) Here E game is the error of an agent s prediction of the progress over an entire game lasting T turns, with E(t) being the error for the prediction of the progress for a single turn, t. The same approach can be used to compute the average error for a collection of turns that is not a game (this could for instance be a series of games) by merely considering T the total number of turns in the collection. If instead of defining t as the turn, we define it as the current turn divided by the total number of turns, we can define our desired output as a function of t, namely D(t) = t. Given the actual output of the neural network approximator, A(t), we can define an error for the progress prediction in a single turn to be E(t) = D(t) A(t) (5.2)

41 5.2 experiments 33 This error in the prediction for a single turn, E(t), is between 0 and 1. The higher values can only be achieved during late or early game, and halfway through the game the error function for the turn can at most be one half. As the maximum error is linear both in the [0, 1 2 ] and the [ 1 2, 1] interval, the highest average error that progress prediction can get for an entire game is 0.75 (see figure 3). Figure 3: The maximum error as a function of game progress. If our output node of the neural network is unconnected (i.e. all the inputs to the output node will be zero), the output will be 0.5 for every input (see equation 4.2), the actual output, A(t), being A(t) = 1 2. As we have negative as well as positive weights, this is somewhat similar to initializing the neural network with random weights between 1 and 1 3, which will yield outputs averaging 0.5. Using this A(t), our error function is E(t) = D(t) A(t) = t 1 2 The integral of this function, average error of the output, is t 0 E(t )dt 1 2 = t(1 t) if 0 t 1 2, 1 4 (2t2 2t + 1) if 1 2 t 1. 3 Or between any number and minus the same number

42 34 progress evaluation For the t = [0, 1] interval (an entire game) the average error is: 1 0 E(t )dt = Producing training data In order to train a neural network to evaluate the progress of a given game by looking at the subset of the game state described in section we needed to gather such subsets for a large number of games. Rather than having four human players spend enormous amounts of time playing the game, we decided to make a finite state machine player (henceforth FSMPlayer) capable of playing the game following simple heuristics. As different kingdom card configurations might make games very different in terms of play time, we decided to create training data using random selections of kingdom cards. Using one subset of the kingdom cards to train would likely create a very good solution for that particular set of card while not yielding a good general approximation of game progress. The series of games were played using the rules for restarting the game (see Vaccarino, 2008) this means that the winning player of one game will be playing last in the following game and thereby be at a disadvantage. This diminishes the impact of the slight advantage a starting player has if one player is significantly stronger than the other players though, the player to the left of the strong player could have a minor advantage compared to her peers Finite state machine player (FSMPlayer) This finite state machine (henceforth FSM) considers the game in one of three states; early, mid-game or late. The transitions between these states are simple, expert knowledge based heuristics for instance the game is considered to no longer be in the early state once the first province has been bought by a player, if the first stack is entirely empty or when the sum of the number of cards in the three smallest stacks drops below 13. The value that the FSMPlayer attributes to gaining a card is based on three things: the card s cost, type and the state the game is considered to be in. The card type and the state are used to determine a value modifier for the card. For instance, action cards have a modifier of 5 in early and mid-game and 1 in late game. In order to utilize the designed values of the cards (which is reflected in

43 5.3 results 35 the cost of the cards as decided upon by the game s designer) we let a card s value in any given phase be the product of its cost and the modifier for the card s type in that phase. If, for instance, a player considers buying a Village card in the mid-game phase, the value of making this buy is the action card modifier for the mid-game phase times the cost of the card, i.e. 5 3 = 15. This value is then compared to values for other cards and the player will buy the highest scoring one. (see Appendix A.2 for a full list of the modifiers used). For the action phase, the player basically uses heuristically assigned priorities to each card. The cards which give the player the highest number of actions are generally prioritized higher, as they will allow the player to continue the action phase. In some cases information about the game state is used to change the priorities of the action cards for instance playing a Witch has a significantly lower priority if the supply is out of curse cards. For a complete list of the action phase heuristics used, see Appendix A.1. Other decisions the FSMPlayer might need to make are also done heuristically: For instance, if the opponent plays an attack card, the player will always reveal a reaction card if possible, and if given the option of stealing opposing players treasure cards (when playing a Thief) the player will chose to only gain the Silver and Gold cards. These heuristics are fully detailed in Appendix A.3. Should superior strategies with a significant impact on game duration exist outside the bounds set by our heuristics, the progress estimation could prove highly inaccurate if applied to games played by players capable of using such strategies. Even if this is not the case, it is likely that games between players using a more developed strategy for the end game would be so different in duration that we would need to re-train our neural network for progress evaluation. 5.3 results In this section we describe the results of the training of the progress evaluation network as well as an early experiment carried out to make sure the progress evaluation could potentially be used to gain an advantage when making buy decisions.

44 36 progress evaluation Simple experiment using progress estimates In order to test our progress evaluation we set up a small experiment: we created a player using the same logic as our FSMPlayer described in section for playing the action phase, buying and any other decisions the only actual change was in the heuristic card value modifier used when deciding which cards to buy. Where, for example, the previous version would have a modifier for the values of treasure cards of 6 in the early game, 5 during mid-game and 1 during late game, the new version used the progress estimate to interpolate between these values in order to find a modifier (see figure 4). The mid-game values are placed at a progress of 0.4, which is another heuristic for where the center of the mid-game area might be. Figure 4: The interpolation between the heuristic gain-value modifiers of treasure cards, based on the estimated progress. Playing against three of our simple FSMPlayers, our player using the progress evaluation interpolation won 3012 out of 10, 000 games (see table 1). Assuming players of equal skill would win 25% of the games on average, the player is significantly better (p < 0.01) Because the players use identical heuristics for the card values in the different phases of the game, these results depend on our heuristic estimate of a good place to put the mid-game heuristic.

45 5.3 results 37 Player type Score Wins FSM FSM FSM FSM using NN Table 1: Scores and number of wins during 10, 000 games between one player using progress prediction to interpolate between buy value modifiers and three FSMPlayers Therefore these results do not prove that the neural network prediction of progress is better. They do however suggest that there exists a connection between prediction of progress and the relative value of gaining cards, which would imply that good progress evaluation is needed for good play Training The neural network was initialized as a fully connected feed forward network with one hidden layer of size five. The learning rate used was ɛ = 0.01 and the slope of the sigmoid activation function was the aforementioned S = 1. Training data for 100, 000 turns was recorded from games played by the FSMPlayer described in For each turn we recorded the inputs to the neural network (the subset of the game state discussed in the section) as well as the desired output of the network (the turn number divided by the highest turn number). The average error of the neural network s output was minimized using backpropagation. To get an accurate measure of the performance of the trained network, we performed a variation of 10-fold cross-validation, in the following way: Before the run, we partitioned the data into ten equally sized parts each containing 10, 000 input-output data pairs. In each fold, seven of these parts were used for training the network and two were used for validating that the error of the network on unseen data did not increase over an epoch (one of our termination criteria). The last part was used at the end of the fold for testing the efficiency of the solution found. In the subsequent folds the roles would then be shifted by one, so that each part of data would be used for training, validation and test an equal number of times. When using the validation set to make sure the neural network is not overfitting, the training can sometimes go on for a very

46 38 progress evaluation long time without achieving significant improvements. Therefore we chose to terminate the training not only if the error on the validation set had risen, but also if the drop in error was lower than Figure 5: The error of the progress evaluation of one fold for the first 300, 000 iterations. The data used for training was gathered from games played between FSMPlayers and BUNNPlayers respectively Results based on FSMPlayer games Figure 5 shows the error for the first 300, 000 iterations of the training. As it can be seen, the network starts out having an error of approximately 0.25 (as expected, see section 5.2.1). The error quickly drops to roughly 0.11, as seen in figure 6, after which only minor improvements are achieved. The number of epochs needed before training was interrupted ranged for 18 to 62 (with an average of 44.3 and a standard deviation of 16). Figure 6 shows an example of error values during training for one fold.

47 5.3 results 39 Figure 6: Examples of the error of progress evaluation training for entire folds. The data used for training was gathered from games played between FSMPlayers and BUNNPlayers respectively. Ten measurements of the error are made for each epoch to give a more illustrative graph (measured by the performance on the validation set, i.e. in the same way as we would at the end of an epoch). The ten folds yielded error values on the test sets ranging from to , with a mean of , a median of and a standard deviation of The low standard deviation seems to support our assumption that progress prediction in Dominion is not made difficult by many local optima, at least when played by the simple FSMPlayer Results based on BUNNPlayer games Once we had a better solution for the buying of cards (see section 6), we attempted to re-train our neural network. We did this using the same topology, learning rate, slope and so on as previously the only difference being that the training data came from games played by agents using a better solution for buying cards (BUNNPlayers, the result of our efforts discussed in chapter

48 40 progress evaluation 6). An example of the training can be seen in figure 5. The training lasted for between 19 and 34 epochs (with a mean of 27.5 and a standard deviation of 4). The lower number of epochs needed and the lower standard deviation compared to the training using data from games between FSMPlayers could be attributed to the BUNNPlayers ability to close game faster, as they make more skilled buys. Figure 6 shows an examples of error measured over epochs. As seen before, the errors for the neural networks start out around The errors quickly fall to just below 0.07 after which improvement stagnates somewhat. Once training concluded, the errors of the trained network on the test sets ranged from to The mean value was and the median was These errors had a standard deviation of A comparison between the errors of the networks trained with data from the FSMPlayers and those trained with that of the BUNNPlayers can be found in table 2. The low standard deviation for the errors could once again be interpreted as supporting our assumption that solving progress evaluation for Dominion is not made more difficult by local optima. The errors are significantly lower for the neural networks trained with data produced using the more skilled players. This could be due to the inherent randomness of the buys the heuristic player makes when multiple cards have the same approximate value the player picks one at random (see appendix A.2). The randomized buying would result in less focused buying of particular supply stacks, which would make the progress more difficult to evaluate. Trained using Lowest Highest Mean Median Std.dev. FSM BUNN Table 2: The errors on the progress evaluation training for the two data sets. Both are done using backpropagation and ten-fold cross-validation. Both of the training data sets have been created using four identical players. This means that with a given selection of kingdom cards, they will follow the same strategy, buying out the most important cards for the strategy fast. It is likely that the neural network would need to be retrained in order to get the same low error values for games between players following different strategies.

49 5.4 conclusion conclusion Using the subset of the game state described in section we conducted a simple experiment, linking neural network progress evaluation to the heuristics used by the FSMPlayer. Our results show the heuristic/neural network hybrid obtaining a significantly higher score than the pure heuristic players which shows that the progress approximation can be utilized to obtain improved play. The neural networks was initially trained with a training set composed of data from games played by our FSMPlayer. The training was achieved via backpropagation and the error of the neural networks decreases over training time. This shows that the selected subset of the game state can be used to approximate the actual progress of the game. From the average error of around 0.25, the neural networks improved to get a mean error of approximately The training was repeated with data gathered from games between BUNNPlayers (see section 6). This time the error mean value for the test set (i.e. the set of data for 10, 000 turns which is not used during training) fell to , which could be attributed to the BUNNPlayers higher skill at buying and thereby closing games. The errors of the 10-fold cross-validation test sets have a low standard deviation, which could suggest that solving the problem of progress evaluation in a game of Dominion is not made difficult by local optima.

51 B U Y P H A S E C A R D S E L E C T I O N 6 As briefly argued in section 3.3, the choice of which cards to gain from those available plays a crucial part in playing Dominion efficiently. This applies to the buy phase, and also to situations where certain action cards allow cards to be gained during the action phase. Given a set of cards to choose from, as well as known information about the game state (i.e. only the information that would be available to a human player playing the game), it would be advantageous to be able to estimate the relative value of each. Neural networks and in particular NEAT seem obvious choices of technique for this purpose (see sections 4.1 and 4.3) the details about game state representation and implementation are discussed in this section. 6.1 representation The factors to take into account during the decision of which card to gain are many, and quite possibly, the entire known game state is relevant in one form or another. Unfortunately, inclusion of too many factors would make it very hard for the AI to learn which of these are the most important, and which can be seen as having a negligible influence on the decision at hand. Therefore, it has been up to us to decide on a fitting subset of information to provide the AI with. The decisions we have made are discussed in the following, and the names of the factors we include are stated for easier reference. In order to recognize effective combinations, information about which cards are available for this particular game (i.e. whether or not they were picked as one of the kingdom cards) will be important. If, for instance, it turns out that a combination of the Thief card and the Spy card is powerful, the player will need to know if the Spy card is actually available in order to properly assign a value to buying a Thief card). This information, which we name ingame, is of course only needed for kingdom cards, not for the seven cards available in every game. Furthermore, how many of a particular card are remaining in the supply is significant (note that this is different from ingame described above, which signifies whether some card is part of the supply for the particular game). How many Curse card remain 43

52 44 buy phase card selection in the supply, for instance, is important for the value of the Witch card if a lot of Witches have been played already the supply might be out of Curse cards, which would severely diminish the value of buying a new Witch card. We call the information about the number of cards of a type which are left in the supply remaining. Whether or not some card is already in the player s deck is also important using the previous example of the Spy/Thief combination, a player who recognizes this combination as strong and sees that both are available in a game might start out by buying the Spy card at the first available chance. If the player cannot perceive that she already has Spy cards in the deck this behavior is merely repeated the next time the player has a chance to choose between these cards. The player will end up buying Spy cards at every chance and will fail to get the combination up and running. Therefore, the player should also consider which cards she already has in her deck and base the gain priorities on this (we name this measure inowndeck). Thereby behavior which prioritizes getting the missing parts of an effective combination can hopefully be developed. It is also important to take the gain choices which have previously been made by other players into account (which is equivalent to information about the cards in the opposing player s deck): The value of the Militia card might relate negatively to the amount of Library cards the other players have bought. Similarly, the Chapel and Remodel cards might increase in value once Witch cards have been bought by the opposing players. In a more direct manner, the value of the Moat card is expected to be higher once opponents have invested in various attack cards for their deck. This information we call inopponentdeck. The amount of treasure cards in the decks of both the player deciding what to gain and her opponents is also relevant information. The Thief card, for instance, is most likely more valuable when you are short on treasure and your opponents are not, and the value of gaining treasure cards is also influenced by the amount already owned. We simply name this information moneydensity for the player deciding which card to buy and oppmoneydensity for the opposing players. This information is somewhat redundant the same could be found by analyzing the inowndeck and inopponentdeck data for treasure cards. If not using a separate input for the money density, however, we would need the neural network to get topological and weight mutations in order to carry information about the

53 6.2 experiments 45 density of money through the network to properly let the value of the Thief card depend on this. This would require the three inputs for Copper, Silver and Gold in the opposing players decks to be routed into a node in the hidden layer, which would also need to be linked to the signal for whether or not buying a Thief card is being considered. After this, the weights of the connections involved would have to be optimized. As the money density is likely also important for the values of other cards, this mutation would have to occur multiple times. We believe that the increase in dimensionality of the search space caused by adding an input for the money density of the opposing players decks would increase training time less than waiting for this particular topological mutation to occur would. While the money density provides some information on the purchasing power of the current deck, it does not show the complete picture. Since some cards, like Feast or Workshop, allows the player to gain cards of a certain value, a player might be able to gain many cards while having a relatively low money density. Other cards, like Woodcutter or Festival, provides bonuses to the amount of coins available during the buy phase. Therefore, the deck might look poor based on money density, while it actually has a good purchasing power, and investments in Treasure cards would be pointless. Because of this, a measure of the actual purchasing power is needed. We call this measure bestgainavg as it is the value of the highest valued gains the player has made, averaged over the last turns (see section 6.2). We consider the estimated progress of the current game to be important to any buy-decisions the player makes as well. As argued in section 5, an estimate of the progress of the current game is among the important factors in selecting which card to buy. The value here is the output of our solution for progress evaluation; we simply name this progress. 6.2 experiments As mentioned in the preceding section, we need to be able to estimate the relative value of each card available for gaining, in order to build an efficient deck. Which information is relevant to this decision has already been discussed, but a way of finding a function from this set of inputs to the desired estimated values of each available card has not. Neural networks are useful for approximating functions, among these complex functions that are linearly non-separable (see section 4.1), and NEAT (see section 4.3) provides the means to evolve them. The choice of input and output representation is crucial, however, and the decisions we

54 46 buy phase card selection have made for this specific problem are discussed in the following. First, we had to decide on the desired format of the output. We decided that the estimated relative value of each available card should be in the form of a number in the interval [0, 1], since this follows the conventions of NEAT and neural networks in general. The most obvious way to do this would perhaps be to have an output for each choice in our case, for every card to be evaluated. Given the way NEAT is initialized, however, this solution would be imprudent. Recall that NEAT starts out with a minimal network topology, where every input is connected to every output. Then consider a fairly small network with ten inputs. With only one output, the total count of connections in the initial topology would be ten but for each additional output added to the network, the amount of connections would grow by the number of inputs. In our case, since we must evaluate each card, the growth in connections per output would be a staggering 32 times the number of inputs. With the input information 1 which we explain the relevance of in section 6.1, the number of initial connections would be beyond four thousand. This is hardly minimal nor is it a very feasible starting point for either optimization or complexification. Instead of having an output for each possible choice, a more prudent approach is to activate the neural network once for each card, selecting the one that yields the highest output. This is similar to TD-Gammon (Tesauro, 1995) where the board positions that are reachable within the current time step are evaluated one by one using a trained neural network, and follows the advice given by Stanley (2010) that also mentions the use of control signals 2. Instead of having an output node for each card we want to evaluate, we add one extra input node for each card. When the value of a card is estimated, the control signal corresponding to that particular card is set to 1, while those of all the other cards are set to 0. In this way, NEAT can evolve the neural network to output a specific value based on the information given about the current game state, and which control signal is activated. A close up, truncated view of the initial topology would then look much like the one seen in figure 7. First, the inputs concerning the global game state can be seen at the very top. Their relevance has been explained in section 6.1, but the relation between the named inputs and the way they are input to the network is as follows: 1 Much of which would be separate inputs for each card. 2 Actually, Stanley (2010) names these candidate inputs.

55 6.2 experiments 47 Figure 7: The organization of the inputs for each card. bestgainavg the value of the highest valued card gained in each of the last three rounds, averaged and normalized by the value of the Province card, which is eight coins. This is a measure of the purchasing power of the player currently considering which card to gain. The information is a historical average to minimize the impact of lucky or unlucky hands. moneydensity the average coin value per card, based on treasure cards. Note that the raw value of this measure is fairly close to one ( 7 10 ) at the beginning of the game. Therefore, we have normalized it by the coin value of the Silver treasure card (2), since the player will rarely own enough Gold cards for this value to be more than 1.0 (this would correspond to the player starting turns with of 10 coins in the hand, on average). The value is clamped to be 1.0 in the rare event that this happens. opponentmoneydensity the same measure as the previous, but for the opponents of the player deciding on which card to gain instead.

56 48 buy phase card selection progress the estimated progress of the game, output from the progress evaluation solution described in section 5. After these, the inputs concerning the Adventurer card can be seen the inputs for the rest of the cards are completely analogous, and appear in alphabetical order according to the card names, with the exception of the cards that are in every game, that is, the treasure cards, the normal victory cards and the Curse card. For those, the ingame input is not needed, and the inputs for these appear after that of the others, though still in alphabetical order. Thus, the inputs of each card are arranged relative to each other as shown in figure 8, where the keen reader will also notice that the bottom cards have one less input that the top ones as per the missing ingame input. Figure 8: The organization of the input groups relative to each other. In our early experiments the neural network evaluating gain choices did not properly attribute importance to the progress. This is likely because of the initial topology of a network created by NEAT (when only one output node is used): the input nodes are connected directly to the output node, as mentioned in section 4.3. With such a topology, a progress input (which will be the same for all cards evaluated) will not make any difference in the evaluations of cards for each card the weighted progress will be added to the sum being calculated in the output node. As both the linear and sigmoid functions of the output node are monotonic, adding a constant 3 to each sum will not change the 3 As neither the weight on the progress connection nor the progress estimate itself will change while choosing between the cards available, the product of the two will be the same for each card we evaluate.

57 6.2 experiments 49 order in which the cards are ranked and the progress becomes superfluous. For the progress input to actually become useful, the network would have to mutate in such a way that the progress input would be routed into a node along with one or more of the beingconsidered signals, which signify that the network is to evaluate the value of a particular card. The output from this node would then need to be routed (possibly through other nodes) to the output. After such a mutation, weights would need to be evolved in such a way that the turning on and off of control signals would manifest itself in a changed ordering of the value of gaining cards (the output from the neural network). As there are somewhat effective strategies for Dominion which can be found without considering the progress 4 the network tended to develop these and did not find a proper topology and weight set for a neural network using the progress within a reasonable number of generations. To help alleviate this problem we devised a new set of inputs: In addition to the beingconsidered signal, the subsets of inputs relating to each card would have a progress input as well, which would be the estimated progress multiplied by beingconsidered (that is, the input would be the estimated progress when the particular card was being considered, and 0 otherwise). In the end we had six inputs for each card kingdom card and five for the cards that are in every game, since the ingame signal is not needed for these. Together with the information about the average value of card gains over the last few rounds and information about the coin density in the decks of the current player and the opponents, the total number of inputs ended up at 189, including a bias input. An updated version of the inputs is illustrated in figure 9 these are the final ones used in the experiments Fitness function As argued in section 5.3.1, the number of wins is a good way to measure whether one type of player is significantly better than another. Using the number of wins gained by a player as base for a fitness function is problematic for a number of reasons. One is that when comparing players with significant differences in skill levels if one player beats an opponent with a wide margin and another player beats the same opponent with a small margin both these games will register in the same manner, and the difference in skill between the two players will not be obvious as 4 For instance Big Money, see Geekdo (2009), or strategies involving aggressive usage of particular attack cards.

58 50 buy phase card selection Figure 9: The final version of the inputs both players won the game. The same applies to players faced with stronger opposition the use of wins would not register a difference between losing by a single point and not getting any victory points during the entire game. This introduces a lot of noise which could hinder good training. As mentioned in section using the total number of points is also a flawed approach, as this will favor players skilled at the game configurations that are likely to terminate due to the stack of Provinces being empty (because the score in this type of game tends to be higher). It might also reward players who buy victory cards early to go for a high average score at the expense of the number of wins. We wanted the fitness to reflect more than just whether the player won the game or not. Should the player have won a large victory, we wanted this to be evident from the fitness, just as well as as it should be evident if the player suffered a severe defeat. It would furthermore be practical if this fitness could be normalized within the [0, 1] interval. To meet these goals we decided upon

59 6.2 experiments 51 the following function for the fitness of a player after playing a single game: F game = ( score winner score + 1 ) highest opponent score winner score + game won Here, score represents the score of the player for whom fitness is being evaluated, winner score is the highest score obtained by any player, highest opponent score is the highest score gained by the other players and game won is 1 if the player won the game and 0 otherwise. While this fitness for a single game does reflect what we wanted it to, it is not entirely unproblematic. Sometimes players who are very unskilled (for instance those who have yet to figure out that buying curses is a bad idea) or players faced with opponents who are using a Witch-based strategy will end up having a negative score at the end of a game. Should this happen, the player would get a negative fitness, which could be problematic in our further utilization of the fitness. Further problems will arise if all the players get a negative score if for instance the players get scores of 1, 2, 2, and 10, the winner would get a fitness of 1 3 whereas the losing player would get a fitness of Also, should the highest scoring player have a score of 0, the fitness would not be computed, as it would require a division by zero. To get around these problems we elected to add a correction (the absolute value of the lowest score) to all the scores, so that the lowest scoring player does not have less than zero points. Should the highest scoring player have a score of zero after this correction (which would only happen if all players get the same score of zero or less) this correction is increased by one to avoid division by zero. Now that we have an expression for the fitness of an agent playing a single game of Dominion, we need a way of translating this into the fitness of an agent playing a series of games. We elected a simple averaging, i.e. 3 F n games = n F game i n i=1 As mentioned in section we terminate the game after 50 turns if the players have not managed to end the game on their own at this point. When this happens, scores are tallied just as they would be in a game ending in the regular way, and the scores and information about which player won is utilized to

60 52 buy phase card selection compute fitnesses just as it would have been had the game ended due to empty supply stacks Set-up We trained the buy network using competitive co-evolution, with a population size of 50 for each population. A weight mutation rate of 0.01 with a standard deviation of 1 was used, along with topological mutation rates of 0.3 for adding connections, 0.1 for adding neurons, and a remove connection mutation rate of 0. The selection was done by elitism with a survival rate of 0.5. Speciation was also used in order to preserve potentially important innovations this was done using excess and disjoint coefficients of 1.0, a weight difference coefficient of 0.8 and a threshold of 0.3. The neural network was initialized as a fully connected feedforward network with no hidden neurons, the output node using a sigmoid activation function. The method of sampling used was the one discussed in section 4.4.2, i.e. each member of one population was tested against the most fit members in the three best performing species of the other population Comparison of population switches As we mention in section 4.4.3, the question of the effect of the technique chosen for population switches is largely unanswered. We intend to investigate whether evolving on both populations every generation (switch by generation) or only evolving on the least skilled population (switch by skill) is better, or if both techniques are equally good and the choice between them is unimportant. The switch by generation is easy to test we merely use the standard version of co-evolution, which will evolve on both populations in each generation. The switch by skill approach requires more consideration though: we intend to switch the roles of the populations, where one is being evaluated and evolved on (the host role) while the other is supplying test cases (the parasite role). We further intend to make these population switches when the population being evolved on plays stronger than the one supplying the test cases. As numerous generations might have passed since the population in the parasite role has had the members fitnesses evaluated, comparing the fitness scores of the hosts to those of the parasites is infeasible. If our fitness function was merely a function of the

61 6.2 experiments 53 number of wins the fitnesses might serve as a measure of whether to switch populations: If, for instance, the fitness was the number of wins divided by the total number of games, members of the population in the host role would be better once the average fitness was above 0.5. This, however, is not possible, as our fitness function also takes in the sizes of the wins (see section 6.2.1). We needed to find another way to compare the skills of the populations. As the purpose of conducting the experiment of different population switches was to investigate whether or not computation time could be saved, we wanted our method of testing to not require very many games. Therefore a testing of all the members of one population against each member of the other was out of the question. It would also have been problematic, as we protect innovations through speciation (see 4.3) and therefore the members of the population would not necessarily be the best performing ones. Instead we elected to let the champion of one population face the champion of the other population in a two vs. two game (two instances of each playing). If the challenger (i.e. the champion of the population in the host role) wins 50% or more of the games, we switch the populations so that the one that was in the host role gets switched over to the parasite role and vice versa. We chose to use the problem of card gain selection for this experiment as we felt it was better suited. The progress estimation is a problem which can be solved well using backpropagation. The buy phase and action phase pose problems which are more difficult to solve. As what we would like to compare is the difference in how fast the two different techniques find good solutions it would be preferable to investigate a problem for which solutions could be found fairly quickly as what a player does during the buy phase has a more direct influence on who wins the game than the playing of the action phase, setting our experiment in the buy phase would ensure less noise from the fitness function which would also mean that fewer games would be required (the impact of the action phase is further debated in chapter 7). To detect any difference between these two techniques we compared runs of 1200 generations (that is, 1200 evolutionary iterations shared between the populations when testing switch by skill and 600 generations with evolutionary iterations on both population for those runs done with switch by generation). The number of generations was chosen as it had previously shown to produce strategies complex enough to challenge a human player and as it was feasible to do within reasonable time. In order to get a statistical basis for the results, ten runs were executed with

62 54 buy phase card selection each technique for population switches. 6.3 results Instead of undertaking a computationally heavy testing of all the members of the populations of each run we decided to merely test the final dominant strategies from each evolutionary run. In addition to being less expensive in terms of computation this would also limit the impact of unskilled population members which have survived merely because they are being protected by speciation (see 4.3). To compare the dominant strategies evolved using different methods for population switches, we had each member of one group play against each member of the opposing group. The games were played with two instances of one player against two instances of the other, ordered so that players using the same chromosome would not be seated next to each other in the order of play. Each of the pairs of players would play 1, 000 games against each other in order to find the probabilities each player had of winning or losing. The average win ratio can be found in figure 10. The players evolved using population switch by skill got an average of wins per game against those developed using switch by generation. From this result one might get the impression that the two methods of population switches are equally strong as they get approximately the same number of wins against each other on average. In order to get a clearer image of any differences between the strategies evolved with different population switches we let each play 1, 000 games between three instances of a player using the Big Money strategy (BMPlayer), the results of which can be found in table 3. The mean win chances of those evolved through switch by skill is , which is almost identical to the gained by those evolved with switch by generation (considering that the baseline should be 0.25 with one player in a four player game, these results are good they do not show any significant difference between the two though). A two-proportion z-test, chosen because of our large sample of 1, 000 games, shows that the difference is not significant (p > 0.01). There are, however, differences in the placement of the quartiles of the two distributions: for the players developed using switch by skill, these are significantly closer to the middle than those of the players developed using switch by generation. The comparatively lower standard deviation of the performance of the players developed with switch by skill shows the same the results,

63 6.3 results 55 Figure 10: Comparison of dominant strategy chromosomes evolved with different methods for population switch. Each group of bars represents a chromosome evolved using switch by skill and the bars represents that chromosome s win ratio against one of those evolved using switch by generation over 1, 000 games. while equally good on average, is more condensed around the middle. It would also appear that the best agent evolved using switch by skill is better than the best one evolved using switch by generation. A two-proportion z-test shows that the difference is significant (p < 0.01). This only shows a difference in the players ability to beat the BMPlayer as Dominion has strategic circularities (see 6.3.3) this result can not be interpreted as a general advantage of doing population switches based on skill. To further compare the two methods for population switch, we elected to let the best performing players play 1, 000 games against each other (two instances of each). The player developed through population switch by skill won 565 of the games

64 56 buy phase card selection Switch by Min 1st qu. Mean 3rd qu. Max Std.dev. Skill Gen Table 3: The scores of players evolved with different methods for population switches performance against three Big Money players over 1000 games. while the one developed through population switch by generation won 435, which is significantly better (p < 0.01). To further compare the performance of the two, we matched each against three instances of the best player our experiments had produced so far (from a run of around 1500 generations which was not part of this experiment). Of the 10, 000 games played, the player developed with switch by skill won 2513 while the other won A a two-proportion z-test shows that this is significant (p < 0.01). While too much importance should not be attributed to the fact that switch by skill created the best individual, performance comparison between the ten runs of each kind did show interesting results: The higher spread of the performance of the agents evolved using switch by generation shows up in the higher standard deviation as well as the placement of the quartiles. This could be attributed to the less focused way in which the evolutionary iterations are given to the populations with switch by generation: both populations are given one iteration, whether they are the best or not, whereas for population switches by skill, the evolutionary effort will be focused on the worst performing population. The switch by skill will ensure steady but slow progress, while the switch by generation will yield faster but less steady advancement. Which method is the most appropriate to use for co-evolution depends on the purpose of this co-evolution. Should one be interested in running a large number of evolutionary runs, picking the best found solution and discarding the rest, one should probably chose the traditional switch by generation approach. If, however, the purpose is a proof of concept with limited computational power at ones disposal, switching population by skill might be preferred Performance versus heuristics Using the knowledge gained from the experiment discussed in section 6.2.3, we evolved the network for making gain decisions accordingly. Metrics for the actual skill of evolved dominant strategies are difficult to come up with, since their strength is

65 6.3 results 57 relative to that the opponents of the dominance tournament. We might, however, still compare them quantitatively to some of the strategies we have manually implemented. At the end of a run, 1, 000 games were played for each dominant strategy. The BUNNPlayer using the evolved dominant strategies played against three players using different hand-crafted strategies in each game; the gain heuristics of the FSMPlayer using neural network progress evaluation (see Appendix A.2 and 5.3.1), the Big Money strategy and a random, greedy baseline. The result of this comparison is shown in figure 11. Note that only the choice of which cards to gain are different among these players that is, other decisions such as those made during the action phase, are entirely the same. Figure 11: Fraction of games won by a player using the evolved dominant strategies of a single run, each game playing against three players using different strategies (our gain heuristics, Big Money and random greedy gains) games were played by each dominant strategy, after the run was completed. Here we see that the particular run (a representative run, but not the best of them) quickly reaches a level of strategy that beats the opposition in more than the expected 25% of games played against players of equal skill, around generation 250. Interestingly,

66 58 buy phase card selection we also see some drops in skill versus this selection of opponents, namely around generations 700 and This indicates that as a strategy is evolved to be better in the general case (versus the parasites chosen from the opposing population), it might also become less proficient against specific strategies and vice versa. Analogous plots of the ten runs using switch by skill and switch by generation are shown in figures 12 and 13, respectively. Figure 12: The performance of the dominant strategies of all ten runs using switch by skill. In general the evolved strategies become capable of beating the heuristics we have at hand. In 1, 000 games the 20 dominant strategies from our population switch experiment got an average of wins per game against three instances of our handcrafted heuristic (again we would expect an agent of equal skill to get a win probability of 0.25) and wins per game against three instances of the much stronger but less diverse Big Money strategy Evolved gain strategies A more qualitative look on the strategies evolved can be seen in figure 14, where a clear progression shows in the types of

67 6.3 results 59 Figure 13: The performance of the dominant strategies of all ten runs using switch by generation. cards the player decides to gain. The very first dominant strategy Figure 14: Percentage of specific card types gained is merely chosen for being the strongest of its population (see section 4.4.1), which it is because of its ability to gain victory cards we notice that victory cards constitute a fair share of the cards acquired. In fact, it might look like it should win versus many of the later dominant strategies, based on this percentage. That is not the case, however, which can be realized by looking at figure 15, where a tally of the victory cards gained by the evolved player can be seen: the many victory cards bought by the player

60 buy phase card selection are all Estate cards it never ever buys a Duchy or a Province which is likely why it is not competitive beyond the first few generations of evolution.

68 60 buy phase card selection are all Estate cards it never ever buys a Duchy or a Province which is likely why it is not competitive beyond the first few generations of evolution. Figure 15: Count of individual victory cards gained in over 1000 games It is clear that though the percentage of victory cards bought is fairly high in the first dominant strategy, the part constituted by the more valuable ones goes steadily up. The dominant strategy of generation 1 is also able to beat the first, even though it buys a lower percentage of victory cards. Looking at figure 15 again though, we see that the amount of more valuable victory cards climbs. Furthermore, the very early strategies, while being better than the others of their populations, still have an unfortunate taste for buying Curse cards. Since Curse is a fully legal and even free card to gain, this makes perfect sense when considering the randomly initialized weights of the neural networks. For reasons we described in section 3.3, estimating how the a game has progressed is crucial when deciding which cards to gain. With the inputs discussed in section 6.2, we would expect NEAT to come up with solutions where the progress evaluation from chapter 5 has some sort of influence. That this is the case is shown by figure 16. Note that the data has been cut off after turn number 28, since less than 3% of the games last longer than that, and the data for those turns is therefore quite unstable. It is clearly the case that the composition of types gained changes quite a lot during the course of a game and timing indeed is important in Dominion. We might also look at the action cards bought by individual dominant strategies. These constitute a varying amount of the total cards bought we saw in figure 14, for example, that action cards seem to be gained less and less, until the percentage more or less stabilizes around generation 500. Though this percentage is common across strategies, the composition is changing quite

69 6.3 results 61 Figure 16: Percentage of card types gained in each turn, averaged over 10, 000 games. frequently, as illustrated by figure 17. Figure 17: Composition of action card types bought by each dominant strategy of one run. Overall we see that the same strategy is generally tuned over a couple of dominant strategies, for example those between generations 107 to 169 which seem to favor a combination of Chancellor, Mine, Moat and Spy cards, as well as some other cards. This scheme is defeated after some evolution by one relying heavily on the use of the Witch card (which may explain the increasing percentage of Moat cards in the other strategy) in generation 214. The new strategy continues to increase the percentage of Witch cards gained, until they almost account for every gain. This strategy is then beaten in generation 529 by one using Remodel, Moat, and Chapel cards all strong defenses against the Witch card. If we once again look at figure 15 it is also clear that, while the strategy is very defensive, it is also able to gain many victory cards, enough to account for the victory point penalties suffered through gaining Curses because of the Witch. Another interesting family of strategies is seen between generations 1039 and 1089, where the cards granting extra buys account

70 62 buy phase card selection for a large part. This in itself might not seem like an obviously good strategy, but if we once again take the victory cards bought into account, we see that the strategies favor the Gardens card more than the previous ones. Coupled with the ability to buy many cards (recall, the Copper card costs 0 coins), this is a clever way to gain a lot of victory points at a small price compared just gaining the standard victory cards. All of these qualitative analyses might seem a bit circumstantial. For example, many of the strategies will simply buy treasure and victory cards if none of the kingdom cards they favor turn up. We would like to point out that while that is the case, it happens very infrequently in fact, the probability of none of the action cards mainly favored by the last dominant strategy (that is, Cellar, Library, Smithy and Witch) turning up in a game is about 1.3%. Over 1, 000 games this happens quite rarely, and the random fluctuations due to the set up of available kingdom cards is small enough that we may qualitatively tell different strategies apart with confidence. It is important to point out that the run discussed above is quite representative of the runs in general. That is, while that particular run turned out to be more successful in the end, arms race happened in the other runs as well (though sometimes with inferior arms ). For an example of this we might look at the dominant strategies of a run which fared much worse against the various heuristics we have created. The action cards gained by the dominant strategies of one such run are shown in figure 18 Figure 18: Composition of action card types bought by each strategy of a less successful run. Clearly, some evolution is happening with respect to the composition of action cards bought. Where this run has failed is in discovering the value of the Province card, as shown by figure 19 they are rarely bought, at least until generation 679. And even then, very few of them are bought, as we can see from figure 20.

Creating a Dominion AI Using Genetic Algorithms

Creating a Dominion AI Using Genetic Algorithms Abstract Mok Ming Foong Dominion is a deck-building card game. It allows for complex strategies, has an aspect of randomness in card drawing, and no obvious