HUJI AI Course 2012/2013. Bomberman. Eli Karasik, Arthur Hemed

HUJI AI Course 2012/2013 Bomberman Eli Karasik, Arthur Hemed

Table of Contents Game Description...3 The Original Game...3 Our version of Bomberman...5 Game Settings screen...5 The Game Screen...6 The Progress Screen...7 Artificial Intelligence Agents...8 Reflex Agents...8 MiniMax based Agents...10 Heuristic...11 Alpha-Beta...11 ExpectiMax...12 Q-Learning...12 Analysis...14 General Conclusions...14 Pruning Efficiency...14 Effects of the Number of Bombs...15 Appendix A Compilation and Running...17

The Original Game Game Description Bomberman was originally released by Japanese developer Hudson Soft in 1983. Also known as ボンバーマン (Bonbāman) in Japan, or Dynablaster in Europe. Bomberman is a strategic, 2D maze-based real time game. The goal in the game is to place bombs strategically in order to kill enemies and destroy obstacles, while avoiding blasts from any bomb. A bomb can also set off another bomb if it is in it's range of fire. The bomb destroys obstacles, but it's fire is stopped at walls. Usually, power-ups exist in open spaces or hidden by destroyables that can help the player achieve the goal - increasing the amount of bombs that can be placed simultaneously, increasing the radius of the blast, increasing movement speed and others. Bomberman also features a multiplayer mode, where multiple bombermen compete and the last one left standing wins. The Bomberman series has featured more than 70 different games, most of them in the format above, but some were not based on a maze, but instead were an adventure/ platformer/ puzzle/ races and other variations. The game (surprisingly) also has a storyline, though it appeared in later games: it is set in the galaxy known as the Bomber Nebula, on Planet Bomber. The main character, Bomberman grows bored of making bombs in an underground factory of an evil empire. He hears a rumor that robots reaching the surface become human, and so he decided to run. In other versions, he is the first robot of his kind, created by Dr. Mimori, and despite being a prototype accepts his role as the defender of justice.

Here is a screenshot from the original game: More recent games also feature a 3D view of the maze instead of the traditional 2D - here is an example from Bomberman: Act Zero released in 2006. Although, most players did not like the new design and prefer the original - that game was widely criticized. Bomberman games still come out today, though they are published by Konami which acquired Hudson Soft in 2012.

Our version of Bomberman Our version includes only a multiplayer mode for two players competing against each other. Players can be either human controlled or an artificial intelligence. Unlike the original games, we have no power-ups - the number of bombs a player can place, and their radius is the same during the game. A single hit to any player ends the game. The destroyables in the arena are generated randomly, players are always placed on opposite sides and always have enough space to place their first bomb. Our version is turn-based and discrete - each player moves/places the bomb in his turn, players can stand only on integer coordinates. However, the default turn length is small - 75ms, and thus this fact is almost unnoticeable to a human player. The game can end in a tie, if both players are left standing after a fixed amount of turns, or if both players were burned to ashes on the same turn. Game Settings screen The first screen that appears when running the game is the game setting screen. In this screen the user can choose all the settings for the game: The players that will play the game - human players or artificial intelligence The number of games the user want to play or to run with agents and see the results, also the user can choose if he wants to display the game user interface. The size of the square board The number of turns per game - so that the game will end and not run forever (it is possible if the agents will always run from each other and not place bombs or if the human players will not act)

The number of bombs per player The probability that a destroyable will appear in any open space (besides the two sqaures adjacent to the starting positions these are always open so that it will be possible to place the first bomb) The length of each turn - as we mentioned before our game is turn-based so it is important to choose the right turn length, for example it is very hard and even impossible for the human player to play if the turn length is under 50 milliseconds because the human players (most of them) can not think and react so fast. On the other hand, it will be hard for the human player to play the game if the turn length is more than 100ms because the game will not feel like real time. For human players, the best setting that is also the default turn length is 75ms. The Game Screen After the user chose all the settings for the game and if he chose to display the game user interface, pressing 'start!' will open the next screen - the game screen itself, that looks like this (for board size 13) : The objects in the game : The first player The second player indestructible wall destroyable wall - can be destroyed by the explosion of the bomb bomb fire from the bomb explosion, destroys walls and kills players the ground

The controls of the game for human players : Human player (WASD) W up, S- down, A left, D right, to place a bomb click CTRL. Human Player(Arrows) use the arrow keys to move, to place a bomb click SPACE. The Progress Screen The progress screen will appear after all the games ended/ or while they are run if you choose to run without a user interface. Note that games without a UI will usually run faster. The progress screen notifies you how many turns passed in the current game, how many games were already played, and the current scores.

Artificial Intelligence Agents An important thing to notice that there can be no simple AI for this game - an agent must place bombs very selectively, otherwise it would commit suicide very fast. Thus, a random agent is not interesting - it would just always lose. Also, an agent that never places bombs isn't likely to succeed as well, as he will be trapped in his starting place and if the enemy opens only one path to him, and places a bomb there he will be trapped and will lose. So, even to end the game in a tie, an agent must clear some space around him, and to do that without committing suicide is no simple task. Reflex Agents The Reflex agent is the most simple AI we developed. It chooses the next turn solely by looking at the state of the board, without considering any future actions of the opponent. We maintain two numbers for each position on the board: 1. We mark ourselves places on the map that are going to blow up in the near future - and how many turns we have until that happens. We do that by going for each bomb, and for each direction marking empty spaces as likely to blow up, either until we reach some item (wall, destroyable, another bomb) or we go out of the blast radius of that bomb. 2. Next, we try to score places by how good would it be to arrive in them in the future. First of all, places that are taken by some object, or we decided that they are going to blow up receive a very low score. Next, we decrease the distance from that place to the enemy - ignoring all objects in the way ( Manhattan distance) as a player should strive to reach the enemy and blow him up. The most important part which is added last is the bomb score, which is explained later. Given these two statistics, we decide movement by using an algorithm based on a Breadth First Search:

Initialize BFS queue Insert our current position to the queue Initialize best = current position While the queue is not empty take the top of the queue if this location is occupied by an item or the distance to this location is higher than the amount of turns until it blows up continue if (score - distance) > (score - distance of best location) best = current location insert all unvisited neighbours to queue Backtrack from the best position to the start finding the movement used to reach it. return direction and score of best position Now, back to the Bomb Score - we want to score how good a place is for placing a bomb in the future, and this depends on two factors - are there destroyables around it that the bomb can blow and thus clear a path, and is it possible to run away from the bomb after placing it, so that we will not commit suicide. For that, we actually do a recursive call, finding the best action from a state where we reached the given destination and placed a bomb (and assuming that the enemy did nothing all this time...). If the best action returned has a high score - that means that we can run away and we return 5 * number of destroyables in the blast radius. If not, it means that the best possible thing we can do is walk to a place that will blow up in the future, so placing a bomb here is a bad decision and we return -1. Only the first call to getting the best action does a recursive call, the second call is already not recursive and the bomb score is just 0. But, this is only the decision on what is the best movement at a given state. Alternatively, we may decide to place a bomb. There are a few possible situations that make us place a bomb (Assuming we have a bomb to place, and we are not standing on a bomb):

1. If a Bomb Score is more than zero - and that means that placing a bomb here will blow a destroyable and we can still run away. 2. If the enemy is close (Manhattan distance less than 3), and Bomb Score is nonnegative, meaning we can run away after placing. We don't know the other player's future moves, but we do know that doing that will constrict his movement, and may blow him up. So, if one of the above conditions hold, we place a bomb instead of doing the best movement. Runtime: this agent runs in O(N^4) time where N is the size of the board, making this a very fast agent even on large boards where N=29. The agent described above was later renamed as Aggressive Reflex agent duo to the fact that he always runs after the player and places a lot of bombs around him, but this is also his weakness discussed in the Analysis section. We also created a Balanced reflex agent, which does not have the second trigger for placing bombs he will not place bombs around the enemy indiscriminately, but rather will try to survive more. MiniMax based Agents Minimax is a decision rule for two-player zero-sum games, based on the minimax theorem by John von Neumann: Theorem : In every two-player zero-sum game there is a value V (called the value of the game) and mixed strategies for the players so that: given the second players strategy, the best payoff for the first player is V, and given the first player's strategy, the best payoff for the second player is -V. What this says in other words, is that maximizing your payoff is the same as minimizing the enemy's payoff. Thus, the minimax algorithm will do the action that maximizes the payoff assuming the enemy minimizes our payoff in turn assuming we will maximize our payoff in the next turn, and so on... While the minimax algorithm chooses the optimal action, we can't apply it that way to our game - as the amount of turns until some player wins is too big to consider all the options. Thus, after a fixed depth we employ a heuristic to score a state, saying how close we are to winning/losing. The minimax agents we implemented are Iterative Deepening - meaning we try to do the search for increasing depths, until a certain timeout is reached - in our case the time limit for the turn, and taking the result from the deepest search that we had time to complete.

Heuristic Our heuristic for a given state is compromised of two main components: 1. We subtract the distance to the enemy - this is computed by BFS, but ignoring destroyables, and stopping the search only on walls/bombs. 2. For each bomb on the board, we score it - taking into account what it will destroy, and how are we in danger from it - detailed explanation below. For each bomb, we expand into all the possible blow directions. If we reach a destroyable before the end of the blast radius, we add 5+20/(time to explosion). Taking into account the time until the explosion is important, because it makes bombs placed earlier as better than bombs placed later. Otherwise, the agent may consider placing a bomb now, or waiting a few turns and placing it as the same thing. If a player is in the blast radius, the value of that is (bomb fuse turns+5-turns left until explosion)*(blast radius +2 - distance to bomb) - the closer the bomb, or the sooner it will explode makes it more threatening. We add/decrease that value from the score depending on which player is in danger - we, or the enemy. Special cases for the heuristic are death/win - we remove/add a big constant from the score. We also add a small random in the range (0,0.01) - otherwise the player tends to do a lot of back and forth between two states while he is waiting for a bomb to blow up and has nothing better to do. This makes his movement a little more interesting and rational. Alpha-Beta The Alpha-Beta agent uses a special version of the MiniMax algorithm which considers much less game states by using pruning - keeping the best value for the maximizing player and the best value for the minimizing player found so far in already explored neighbor branches in the search tree. This allows us to ignore many branches that we can be sure that will not be played. For example, if the maximizing player finds out that in the current branch the minimum player can get a minimum below his best maximum so far, he will never play that branch. And vice versa, if the minimizing player finds the maximizing player can have a maximum higher than his current minimum, he will never play the action that gets him to this branch of the search tree. In other games, Alpha-Beta usually achieves a considerable speedup against the basic Minimax agent. We will later compare the average search depth of the Alpha-Beta agent against that of the Expectimax agent, which uses the full Minimax tree and is explained below.

ExpectiMax Minimax assumes that the opponent is optimal - meaning he plays Minimax himself. This is usually not the case, especially against human players which in our game are expected to act fast and do not have enough time to think hard about their next move. ExpectiMax instead assumes that the opponent plays each action with some probability - in our implementation we assume uniform distribution over the legal actions in the state. This unfortunately makes pruning impossible as to get a value of a tree we need to consider all of it. Q-Learning Q-learning is an off-policy reinforcement learning algorithm we have learned in class. It stores q-values for each pair of state and action. The policy of a state is the action with the highest q-value. During training, in each step, the agent either plays the policy, or does a random move with some exploration probability epsilon. After the step, values are updated using the value (highest q-value) of the resulting state with a discount factor, and a reward function that defines the reward for a given (state,action,state) transition. Obviously, we can't learn this way over the full state space, as for example at board size 21, we have 274 possible destroyables that are either there or missing, and assuming we have 4 bombs for each player, and each bomb takes 25 turns to blow up and has a 2 coordinate position in 1-20, plus 2 coordinate positions for each player in 1-20. So, about 10^120 possible states... That's 10^40 times more than the estimated amount of atoms in the observable universe. This was unfortunately a failed attempt not matter what feature extraction scheme we designed to decrease the state space and what was the reward function we were not able to create a good reinforcement learning player. At our best, and this is what we hand in, we can successfully blow up a stationary enemy at any board size. Obviously, we could have gotten such a result with a very simple state extraction direction to the stationary enemy, and a flag if we already set a bomb there rewards would be good for reaching before a bomb is placed, and getting away after that and bad for doing other things. Our scheme is definitely more complicated, but it still can't beat a stationary enemy when destroyables block the path, and definitely not an enemy that is trying to bomb it. We think there is quite a simple fact that prevents creation of a good reinforcement

learning player in Bomberman the big amount of time between the action of placing a bomb, and until it's effects are seen if 25 turns until the bomb blows up, that is 12 turns for the learning player. So, even if the state just before a bomb blows up does receive some value and good policy eventually, this does not propagate 12 states backwards -as the chances to reach that same state again are remote. No matter what we tried the learning agent always learned that placing a bomb near a destroyable at the start of the game is a bad action as the amount of episodes where he blows up because of that bomb (or the one after, or another one there are many destroyables to clear...) is much bigger than the amount of episodes where he actually wins against the opponent setting a lower learning rate alpha did not help. Thus, when playing on a map with destroyables, the only thing the agent learns is that it should not place bombs... The feature extraction scheme we submitted works the following way: A direction to the enemy valued 0-8, based on the 8-neighbour which is closest to the enemy, and one more value for the case we stand on the same tile. A bomb is represented by the amount of destroyables it will blow up when exploding (assuming that more is likely better), and a direction to a player if he is in it's line of fire. We did not include the amount of turns until it blows up apparently this already increases the state space too much. Destroyables are not represented at all we account for them in the reward function. Storing the amount of destroyables left just increases the state space. Distance to closest destroyable didn't seem to help either. State value function: For each bomb, 5 for each destroyable it reaches, 5 if enemy is in range, -10/time left if we are in range. Subtract distance to enemy Subtract number of destroyables left Reward function: -100 for dying, 1000 for winning difference in the state value function between the next state and the current. We still feel that it should be possible to create a learning agent, even if only on small boards and consider this a challenge we would really like to solve some day.

General Conclusions Analysis Alpha-Beta and ExpectiMax beat the reflex agent more than 90% of the time, and are practically unbeatable by a human. We have discovered that the reflex agent is overly aggressive because he tends to run after the enemy and place bombs around him. However, this also causes a case where he places a bomb and moves forward while the enemy also places a bomb before him, thus effectively trapping him that is what happens in most cases when he loses. Without this knowledge, he usually beats the human player. But, if you play against this weakness, waiting for him and trapping him, you can usually win about 50% of the time. Playing aggressive against him is likely to fail as the agent thinks much faster than a human. Unless stated otherwise, all the tests were run with the default settings: a 13x13 board, 75ms per turn, 2000 turns until a tie, a player can place 4 bombs simultaneously, destroyable appearance rate is 65%. Pruning Efficiency As stated earlier, Alpha-Beta is a pruning heuristic for the Minimax algorithm, that uses the fact that there are many sub-trees that will never be played by optimal agents. In other games, Alpha-Beta usually leads to a big speedup in the search speed, and thus increases the search depth we can possibly use in a reasonable time. In our case, the smallest good depth is 6, as this is just enough for a movement, placing a bomb and running away. The deeper the depth before we use a heuristic, the smarter the player becomes. We compared the average depth Alpha-Beta and ExpectiMax (which uses the same Minimax tree) reach given a fixed amount of time until they are stopped and return the result from the deepest tree finished (iterative deepening) over many games:

Average Search Depth by Time Depth 14 12 10 8 6 4 2 AlphaBeta depth ExpectiMax depth 0 5 25 50 75 100 125 150 500 1000 Time (ms) As you can see, the Alpha-Beta search is considerably faster and reaches much deeper given the same time. While Alpha-Beta can reasonably play even with 5ms turns, for ExpectiMax this is not enough time to think and we also saw it in the results he loses to the reflex agent about 40% of the time in this case. Despite that for times 50ms and above we observed that while Alpha-Beta does indeed win more when caging the two of them together, most games end in a tie. Effects of the Number of Bombs Initially, we though that given a small number of bombs (1,2) an agent is unlikely to win and the game will drift towards a tie at least, that is what a human player can do if he does no mistakes. Also, the more bombs we give, the faster games will end. When caging the Reflex agent against Alpha-Beta, these were our results:

Avg. Number of Turns by Number of Bombs Turns 450 400 350 300 250 200 150 100 50 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Bombs the amount of ties was actually negligible in all cases. This is what made us find the weakness of the Aggressive reflex agent we described above. As an attemp to solve that was the Balanced agent described above. This was a moderate success with one bomb, the game ends in tie about 20% of the time, and even won a few times. This is still not what we have expected, but a big improvement over the Aggressive agent. The average number of turns also increased by about 20%. Also, beating him as a human is much harder as he has no visible weaknesses. Avg. Number of Turns by Number of Bombs Turns 800 700 600 500 400 300 200 100 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Bombs Wins Number of Wins by Number of Bombs 60 50 40 30 20 10 0 1 2 3 4 5 Bombs Balanced Reflex AlphaBeta

Appendix A Compilation and Running The game was written in Java, using the swing graphics library. It should run on any platform with a Java 7 virtual machine available. To compile, run ant in the project directory. This will compile the source files found in the src directory into class files in the build directory, and create an executable java archive (jar) Bomberman.jar To run, type java -jar dist/bomberman.jar, or you can use ant run which compiles and runs the project.