AI Agent for Ants vs. SomeBees: Final Report

Size: px

Start display at page:

Download "AI Agent for Ants vs. SomeBees: Final Report"

Molly Russell
6 years ago
Views:

1 CS 221: ARTIFICIAL INTELLIGENCE: PRINCIPLES AND TECHNIQUES 1 AI Agent for Ants vs. SomeBees: Final Report Wanyi Qian, Yundong Zhang, Xiaotong Duan Abstract This project aims to build a real-time game playing AI agent for Ants vs. SomeBees that can beat human agent. Specifically, we implemented a search algorithm using A* and reinforcement learning using expectimax with TD learning to solve this tower defense game. Win rate shows a significant increase when using AI algorithms comparing with our baseline model. A* search gives a decent result but faces computational inefficiency and lack of look ahead, and reinforcement learning is a good alternative to address these issues. Key: Tower Defense Game, Search, Reinforcement Learning I. INTRODUCTION Ants vs. SomeBees [1] is a type of Tower Defense game [2] and a simpler version of the classic Plants vs. Zombies game [3]. The goal of a tower defense game is to keep the invaders from reaching a certain location on the board, and commonly the innermost point of the defenders region. The game typically consists of three main elements, the invader, the defender and the target, and the result is determined by whether the defenders have eliminated all the invaders or the invaders have reached the target. In Ants vs. Somebees, these would be the bees, the ants, and the ant queen, respectively. The excitement of a tower defense game is that it utilizes realtime strategic planning of the player. In Ants vs. SomeBees, the ultimate goal is to keep the ant queen alive by strategically placing the ant soldiers to defend the invading bees, which enters the game board with predefined probability. Since there is a cost associated with each type of ant and the player is subject to constrained resource, a strategy needs to be developed in order to utilize resource well. In previous research [4], the game was modeled as a search problem. In their model, the states contained the existence and location of ants and bees on the board, the amount of food, and the number of turns since the beginning of the game. The authors concluded that the reinforcement learning should not be considered due to the lack of Markov property according to the state definition, and A* was further analyzed for AI agent development. The relaxation problem assumed no additional bees entering the board for each turn. The time series property of the game was approached by setting up sub-goal for each turn as eliminating all bees on current board. By setting the recalculation and execution rate according to the evasion time of bees, the problem was converted to be deterministic for each sub-problem. However, as mentioned in the paper [4], the definition of problem relaxation led to the gap between solving the sub-problem and the entire game. Therefore, the heuristics were introduced to narrow the gap. The cost of action was defined to be 1 for each turn. And five heuristics (Strongest Ants Needed, Food, Fire power, Closest Bee, Bee Armor) were introduced to address the issue of only reaching a local optimum for sub-problem instead of global optimum for the entire game. As the result of the research [4], the Bee Armor heuristic provided the highest win rate (98%) among the five heuristics. Moreover, the bee-based heuristics performed better than antbased heuristics, and non-admissible heuristics had promising result according to the result. However, the previous research didn t provide their board condition in detail (board size). Therefore, the promising result mentioned in the paper was not justified completely. II. MODEL (SCOPE DEFINITION) In this section, we will define the setup of the game. Game Board The game board consists of 3 lanes and 8 tiles in each lane. Ants There are 2 types of ants that can be placed on the board, each has a specific food cost, armor value, damage and ability. The ant queen will automatically be places on the left, out of the board at the start of the game. See Table 1 for details of the ants. One requirement for any type of ants is that once the ant is placed, it cannot be moved to other positions or removed from the board by the player. It will only be eliminated by the bees. Insect Type Food Cost TABLE I INSECT TYPES Harvester Thrower Bee N/A 3 1 Armor Damage Skill generates 1 food per turn attack one position ahead attack the ant in the same position Game Rules A single game consists of a series of turns. At each turn, a bee may enter a row from the right side with certain probability, and all rows lead to the ant queens at the end. During one turn, each thrower ant placed on the board will take action of throwing one leaf at the leading bee on its row, harvester ant will generate one food unit, and all bees will move one block to the left. The player will decide whether to place an ant on the board and the type of ant given the current board condition. Only one ant can be placed at each turn. The end result of the game depends on whether the ants survived the attack or the bees have reached the ant queen.

2 CS 221: ARTIFICIAL INTELLIGENCE: PRINCIPLES AND TECHNIQUES 2 There is one special effect when a bee enters a block that has an thrower ant. The bee will stun the ant, thus killing it and removing it from the board, but the ant will also slow down the bee s action by one time unit. Thus the bee will remain in its block for the next turn. The following figures show the start of the game and a sample display of after a few turns. Game Conditions We tested all the models under three game conditions: easy, medium, and hard. For all of the modes, the game start with 4 food units. The easy mode is composed of one wave of 2 bees at time 2, three waves of 1 bee at time 3, 8, 13, and one wave of 2 bees at time 15; the medium mode is composed of five waves of 2 bees at the same time steps as easy mode; the hard mode is composed of one wave of 2 bees, three waves of 5 bees, and one wave of 3 bees at the above time steps. III. GOAL The goal of this project is to implement AI agent using search and reinforcement learning to play this game and compare the performance of the agents with the baseline model and oracle. IV. ORACLE As there is no Agent, a top human player will give the oracle in this case. As data about this game is insufficient, after being familiar with the game, we played it multiple times and counted the win rate by ourselves. TABLE II WIN RATE FOR ORACLE Oracle V. BASELINE AND RESULT To simplify the problem for baseline, we broke down the goal of solving the whole game into finding the optimal strategy of the subgame (each turn). For each turn, we assumed that there won t be other bees entering the game board. Moreover, we assumed all the tiles in the game board are legal for deployment of ants. For the baseline strategy, we implemented a greedy algorithm aiming to always deploy ants on the row where the difference between the number of bees and ants are the largest. Moreover, to ensure there will be enough food for the greedy algorithm, we set up a ratio of number of harvesters and number of throwers which must be fulfilled before following greedy algorithms to deploy ants. Previously, the original Ants vs. SomeBees used location of ants, location of bees, amount of food, ant types, and time as the state of the game. In order to implement our greedy algorithm, the amount of harvester, amount of throwers, and amount of ants and bees in each row of the board are added as part of the state of the game. For each ant turn of the game, first, the ratio of harvester and thrower will be calculated. If the ratio is below the threshold, the choice of ants will be enforced to be harvester to ensure there are enough amount of food for future attacks, otherwise the thrower will be chosen as the choice of ant type in current turn. Because the game board is a 3 * 8 grid, the location of deployment of ant should be carefully considered. In our greedy algorithm, we defined the place with high weight is the place: 1. Column: as close as possible to the ant queen (left most of the board); 2. Row: amount of bees - amount of ants is the largest among all the rows. The win rate of our baseline is shown below. TABLE III WIN RATE FOR BASELINE Baseline Note that we found the ratio between number of harvesters and throwers is an important factor affecting the win rate. Specifically, the optimal ratio under easy mode will lead to a significant lack of food in the hard mode. Possible reasons could be that under the easy mode, the number of bees in each wave is small, the defined ratio preferred to be about 1. However, under hard mode, the amount of bees in some waves were significantly higher than those in easy mode, which required the agent to place more throwers than harvesters compared to the strategy under easy mode. A. Search Game State VI. ALGORITHM AND RESULT As a natural extension of the conditions we used in setting the baseline algorithm, we can also optimize the game with a search method using a similar heuristic. The challenge though is that the game board changes when new bees enter the colony

3 CS 221: ARTIFICIAL INTELLIGENCE: PRINCIPLES AND TECHNIQUES 3 and thus the best strategy for one board may not be as good for the next one. The assumption we made to address this issue is the same as in the baseline, assuming no bees will enter the board in future turns, which makes the game deterministic and thus solvable using a search algorithm. Parameters we used are defined as: state(s) = number of bees on the board and their position, number of ants on the board and their type and position, amount of food s start = initial state: current game board isend(s) = end state: number of bees on the board == 0 { place an ant: the type of ant and location Actions(s) = not to place an ant (1) Cost(s,a) = 1) 0 if at the end 2) heuristic cost if not at the end The heuristic we used is that we looked at the difference between armor and damage power of bees and ants, and calculated total turns needed to eliminate all bees on the game board. The algorithm is shown below: For each row: bee.armor = total sum of each bees armor ant.firepower = type of ants * their firepower turnsneeded = bee.armor / ant.firepower return sum(turnsneeded) To confirm the consistency of this heuristic, we need to make sure the assumptions we made would relax the original problem. Using this heuristic, we do not consider how close the bees are to the ant queen and how this would affect the game state, thus we remove the constraint that the bees cannot pass over the left, i.e. we do not constrain on how long the row has to be, and this would make the problem a relaxation of the original. Therefore, the heuristic we used is consistent. Giving all the parameters, the algorithm can generate the optimal solution relating to the current state, and when new bees entering the board, the algorithm will need to be rerun to generate a new solution. The win rate under this condition is given below: TABLE IV WIN RATE FOR A* A* The win rate using A* algorithm does not show much improvement compared with the baseline model. (Performance is poorer than the baseline s for hard mode.) We noticed that the poor performance could come from the limitations of this algorithm. The first limitation is lack of look ahead. Unlike the logic human adopts when playing such defense game, which is to plan not only for current state but also account for the future, the agent does not recognize that action made at current turn would influence future turns. For example, in the first turn when no bees enter the board, instead of planting a harvester (as is what a human player usually does), the agent simply does nothing since there is no bees on the board yet. And when all the bees are eliminated from the board, instead of planting a harvester or a thrower to prepare for future attacks, the agent again does nothing. This lack of planning ahead will limit the success rate of the agent in a more intense game. The second limitation is the computational inefficiency. Since when searching through the paths to reach the end goal, the state space expands exponentially as number of bees on the board increases. Since we did not add time constraint on each turn, When running the hard mode, the average time needed to make one move can be up to several minutes. Although A* is a powerful tool for such search game, it s performance is severely limited as the search space expands. To address these issues, we added on certain constraints that need to be satisfied before generating new actions using A*. The first constraint we added is if there are no harvester placed yet, a harvester will be placed on the board. This means at time step one, a harvester will be placed. (Note in all of the game conditions, the first wave of bees will enter at time step two.) The second constraint we imposed is similar to the one we did in baseline. If the ratio of number of throwers relative to number of harvesters is greater than 3, then a harvester should be placed. This make sure there will be enough food source for future attacks. Only if the game condition satisfies the above constraints, A* algorithm will be used in determine the next step action. The win rate under this condition is shown below: TABLE V WIN RATE FOR A* WITH CONSTRAINTS ADDED A* with constraints We see that together with the constraints, the performance of A* is dramatically improved and the computation is much more efficient as well. One thing to note though is that adding constraints is an extension of using domain knowledge. The constraints we added mainly serve the purpose of having enough food resource for future turns because we are aware of future attacks. Since at the start of game, there are only 4 food units, food source becomes a very stringent factor in determining what actions the agent can take. And if we do not make sure there are enough food in each turn, we get the first result (poor performance), since A* agent does not see the complete picture and only consider current state. And it will not choose the optimal action if there is not enough food to support it. But once we added the constraints, thus making sure food source is not as stringent as before, we see the performance of A* agent increases significantly. Since the agent has more resource to utilize and have more feasible actions. B. Adversarial Game State Search 1) Expectimax With Manually Defined Evaluation Function: Next, we wanted to test the performance of reinforce-

4 CS 221: ARTIFICIAL INTELLIGENCE: PRINCIPLES AND TECHNIQUES 4 ment learning. If we define the state = (Board Layout, Food, # of bees left), and the action (ant type, location), then the problem will become a Markov-decision process (MDP). Nevertheless, this game does not have a specific reward for taking action. It only generates two results win or lose, which is all that matters. Thus, typical ways to solve MDP such as linear or dynamic programming is impossible. As stated in the project proposal, we mainly concern two candidates for approaching this problem: search tree or reinforcement learning. At this stage, we have implemented some searching algorithms, and are planing on the extension to Q-learning. For the following part of this section, we will describe the searching methods in details. 1. Expecitimax with Evaluation function In this game, every turn the agent can either choose a type of ant to place, or simply doing nothing; while the adversary bees will enter the board randomly. Once a bee enters the board, it will go with a fixed action. Intuitively, an expectimax search can be used in building the searching tree. Since the policy of the bees are fixed, we can write the search strategy as: V max,bee (s, d) = Evaluation(s), d = 0 max a V max,bees (Succ(s,a), d), Agent Move a π bees(s,a)v max,bees (Succ(s,a), d 1)), Bees move (2) where a LegalActions(s) Here d is the search depth, and the output of the evaluation function is considered to be the reward, given the current state which contains the information about Board Layout, food and number of bees left. To implement this search method, there are mainly two tasks for the team: the first is to change the original game structure and the second is to determine a good searching strategy. The original design of the game is designed to only consider the input from human player, so it contains minimum information about the board. In order to let us play around with the algorithm, we spent a decent amount of time in modifying the original code (including the GUI), so as to ensure that the game can be controlled by the agent. Specifically, we builds two separate classes (ant and bee) which takes action and place as input and do something in each turn. We also re-code the game board so the board information is easier to obtain. For the searching, the logic is exactly the same as described in (1). However, it is obvious that a plain search will easily run out of time (here we set the decision time of the agent is 1 second per turn), as there are 3 8 locations in the game board. The time complexity is O((3 8 3 N) d ) per turn, where N is the number of types of the ants. Hence, a depth of 2 or 3 is almost the limit of the plain search. Nevertheless, there are some pruning we can do here. An obvious way is the deploy location. Obviously, ants should be deployed as far away as the hive of the bee, so as to increase their life span. The ideal game board will look like: a bunch of Harvester working at the leftmost side of the game board, while the attacker ants face directly to the bees. Hence, the deploying strategy for harvester ants should be: as left as possible, and the strategy for attacker bees should be: the maximum range that they can attack the nearest bees (if there exists). With this, we are able to run a depth-2 Expectimax search. For the evaluation function, an ideal way is to use Q- learning to come up with some fancy combinations of each feature of the game board. However, due to the complexity, the evaluation function is mainly done by hand-crafted empirical experiments and human intuition. Through many experiments, the design of our evaluation function is as follows: Criteria 0: isend(state) or iswin(state) Criteria 1: there has to be at least one harvester in the game board Criteria 2: in easy and medium mode, maintain the ratio between attacker and harvester to be 1:1 Criteria 3: all the rows should contain at least one ant and penalize the rows where the amount of bees is larger than that of ants Criteria 4: the loss of bees armor for bees on the game board will receive positive reward Criteria 5: the less total armor of all the bees, the better Criteria 6: for all the ants, the closer to the left side, the better TABLE VI WIN RATE FOR EXPECTIMAX Expectimax All of this criteria are reasonably intuitive. We found that with this evaluation function and depth-2 Expectimax search, our algorithm is able to beat the baseline greedy (which is actually a solid baseline). This is mainly because the agent can foresee the following situation. For example, when the game board only contains one harvester and one food, the greedy algorithm will tell the agent stop doing anything till food:4 and then place a thrower. While the Expectimax agent will chose to place another harvester, and then at the end of this turn it will immediately have 2 food. By doing this, the expectimax agent will arrive food:4 at the same time with greedy, but have one more harvester. During the search of Expectimax, we assumed that the agent will know the exactly time and amount of bees will enter the board. With this Expectimax agent, we successfully arrived at the win rate of 42/50, 15/50, 5/50 corresponding to easy, medium and hard mode. We only used two types of ant here, increasing the types of ant or the depth or search will definitely increase the win rate, but the time complexity increases exponentially as well. 2) TD Learning: Reinforcement learning (RL) is another classic way to solve MDP, as it has the power of learning the rewards belong to each action of each state. Also, RL can do much better in terms of the response time after sufficient training. Possible candidates include TD Learning. However, to fully utilize RL, we need to define a good feature extractor so the learning can be generalized. More importantly, the loss function and the update rule needs to be carefully specified

5 CS 221: ARTIFICIAL INTELLIGENCE: PRINCIPLES AND TECHNIQUES 5 and evaluated. The general format of these two methods are: w w η (prediction(w) target) w prediction(w) (3) where w is the learned weighted factor, weighting each extracted feature and η is the learning rate. In order to implement TD learning, first, the features were vectorized. The complete features were: 1. number of Thrower on the board 2. ratio of Harvester and Thrower > 2 3. ratio of Harvester and Thrower < 1 4. isend(state) or iswin(state) 5. total armor of bees Total armor of bees on the board 6. sum of the location (column) of deployment of Harvester 7. sum of the location (column) of deployment of Thrower 8. current state score 9. # bees - # ant for each row 10. sum of the location (column) of deployment of Harvester / # Harvester on the board 11. sum of the location (column) of deployment of Thrower / # Thrower on the board 12. danger bees in each row For the danger bees feature, the danger bee was defined as the total armor of bees is larger than that of the ants in the same row, and was calculated as the difference between total armor of the bees and the amount of Throwers in the row times the bees location as a heuristic. In order to increase the converge speed, first, we manually tuned an Expectimax agent with function approximation (features 1, 2, 4, 5, 6, 7, 8, 12) using weight based on our intuition which resulted in 100% win rate in medium level of the game. The weight was used as a prior knowledge influencing the choice of optimal action during the learning process according to some probability (with 30% probability, the agent will choose the optimal action produced by evaluation function utilizing manually tuned weight). Then, the TD learning agent utilizing feature 1, 2, 3, 4, 5, 8, 10, 11, 12 with a warm start of weight was implemented. The same as Expectimax agent, the legal actions of ants were chosen each turn and the resulted value of each action was calculated. With a probability 0.3, the agent will choose the optimal action generated by the inner product of manually tuned weight and corresponding features to speed up the convergence. During the tuning of the model, we noticed that some features such as number of Throwers on the board will increase significantly and close to infinity and some features such as score of current state remained constant. Therefore, we manually deleted those features and the rest of the features (features 2, 3, 4, 5, 8, 12 indicated as full feature later) were used to generate the result. The learning rate was defined as 1/Step Size. The step size would increase as the game time increases, and it was initialized within the range of 100 to 1000 to find the optimal initial value of step size. The discount was set as 1 and the reward was 0 and were held as constant during the learning process. Besides this combination of the features, we also implemented a location feature only contained the location of each Harvester, Thrower, and Bees resulted in a 72*1 vector to analyze the informative of the location of insects. The 7:3 strategy was also used in this TD learning agent to speed up the converge of weight. The results of survival time for loss turn, weight difference in terms of 2 norm, and win/loss of each turn for TD learning agent using full features and location feature were shown below. The survival time for loss turn was smoothed according to medium filter.

6 CS 221: ARTIFICIAL INTELLIGENCE: PRINCIPLES AND TECHNIQUES 6 tuning learning rate in different mode. For example, the optimal learning rate in medium mode might not be the optimal learning rate in hard mode. In our approach, the initial learning rate was tuned manually within the range of 0.01 to and we noticed that the win rate increased and decreased when the learning rate decreased from 0.01 to TABLE VII WIN RATE WITH TWO TYPE OF FEATURES Feature Type Location Feature (learning rate = 0.001) Full Features (learning rate = 0.001) Easy Medium Hard According to the result, the location information was informative and enough to gain an appropriate win rate in easy and medium mode. However, according to the result of win/loss for each turn and the survival time of loss turn, the TD learning didn t work as expected. In our expectation, the density of win turn should increase along with the iterations, which eventually should look like an horizontal line as y = 1 in the graph. According to our result, it only showed a weak learning trend at the early stage of the location feature method in easy and medium mode, and full features method in medium mode. Possible reason could be that the features we created were not informative enough for learning that the value produced by evaluation function could not indicated the value in the real situation. Moreover, the convexity of the feature space can also influence the convergence of gradient descent algorithm. The feature space generated by our selected features might not be convex so that the gradient descent could fail to some extent, or with a convex feature space, the gradient descent was trapped at some local minimum that led to unstable win status at the later stage of the iteration. Even though adjusting the learning rate helped to improve the win rate from 0.56 to 0.71 for TD learning using full feature in medium mode, it still could not boost the performance to higher win rate. Additionally, since there are various sequential states leading to the end state, if the evaluation of one state is off (i.e. the value of state should be positive but the evaluation function produce a negative value), it might lead to the failure of search and result in low win rate. The large amount of different combinations of states can also make it difficult to win compared to lose. During the learning process,our TD learning suffered from VII. CONCLUSION AND FUTURE WORK In our project, we modified the A* algorithm implemented previously and implemented Expectimax agent and TD learning agent. In conclusion, A* agent alone does not show good performance and it faces the challenge of lacking look ahead and computational inefficiency. We addressed these issues by adding constraints using domain knowledge of the game, in this setting, the agent gives very high performance for easy and medium mode, and decent result for hard mode. Compared to A* algorithm, the Expectimax agent and TD learning agent had a much higher calculation rate, and TD learning agent performed slightly better than Expectimax agent in terms of higher win rate in medium mode. Moreover, we confirmed that the initial learning rate will influence the performance agent in terms of win rate. In order to find an optimal combination of parameters in TD learning (i.e. learning rate and discount), we could wrap up our script with choice of various combination of parameters and return the combination with the highest win rate. Also, we will add location feature and other informative features to full features set to gain a competency in the description of state value. VIII. ACKNOWLEDGMENT We would like to acknowledge the source of the game base code is from the project developed for UC Berkeley CS 61A. We would also like to extend our gratitude to Moses and Clara for giving us access to their code source for this project, and most of the backbone A* code we implemented are based on their work, with our modification of heuristics and constrains and some implementation changes. IX. REFERENCE 1. Ants vs. SomeBees project reference page: 1 inst.eecs.berkeley.edu/ cs61a/su13/projects/ants/ants 2. Tower Defense Game: 3. Plants vs. Zoombies: 4. Leshem, Yotam, et al. Plants vs. Zombies: Introduction to AI Final Project. Hebrew University of Jerusalem. Manuscript.

Game Playing for a Variant of Mancala Board Game (Pallanguzhi)

Game Playing for a Variant of Mancala Board Game (Pallanguzhi) Varsha Sankar (SUNet ID: svarsha) 1. INTRODUCTION Game playing is a very interesting area in the field of Artificial Intelligence presently.