CS221 Project: Final Report Raiden AI Agent Lu Bian lbian@stanford.edu Yiran Deng yrdeng@stanford.edu Xuandong Lei xuandong@stanford.edu 1 Introduction Raiden is a classic shooting game where the player control a flight to avoid collisions between enemy flights and enemy projectiles. Meanwhile, the player flight can shoot missiles to hit enemy flight to gain points. In a normal game configuration, the player will die as long as it get hit by any enemy objects. Therefore, the player needs to ensure staying alive first and then shoot as many enemies as possible. The difficult part of this game is that the game speed could be so fast such that the player can t make the best movement. In this project, we implemented an AI agent player to play the game automatically to achieve a higher score than a human player. We used AlphaBeta and ExpectiMax Agents to control the agent s movement and feed the agents game state information. We will elaborate our game state features in the following sections. We hard-coded the weights for the feature vector and trained it using TD-learning. 2 Task Definition We built an AI agent that will automatically make the best decision based on the current state of the game. Also, we added the difficulty of this game by introducing intelligent enemy flights play against the player. We implemented 4 different game modes: Human player v.s. normal enemy. AI agent v.s. normal enemy. AI agent v.s. AI enemy. Human player v.s. AI enemy. Our jobs is to implement an AI controlled agent player and an AI enemy. We will introduce the AI enemy in the game description section. 1
3 Game Description 3.1 Player Actions The player can move up, down, left and right to avoid collisions with enemy flights and enemy projectiles. The player can shoot a missile to destroy an enemy flight (enemy projectiles are not destructible). Figure 1: Game Layout Figure 2: Player Shoot 2
3.2 Enemy Actions Enemy flights are always generated at the top of the screen with a random horizontal and vertical speed. Also, the enemy will fire 3 spread projectiles every 0.5 seconds to hit the player. The vertical speed of projectiles is fixed. The horizontal speed of projectiles depends on the horizontal speed of enemy flight. The horizontal position of the middle projectile is always the same as the enemy. The speed of classic enemy flights is fixed whereas the AI enemy will track the player s position and flying directly toward the player. However, the enemy flight can only move downward but not upward. This means when the enemy flight is below the player, the enemy cannot go upward to hit the player. 3.3 Scoring The score is showed on the up-left corner of the game board. The game never ends, so the goal is to get as many points as possible. The player will be dead as long as it hits an enemy flight or enemy projectile. The detailed scoring is as below. +1 points: For staying alive for 1 60 +1000 points: For hitting an enemy flight by missile. -500 points: For firing a missile. seconds (60 points for staying alive for one second). We added a penalty for firing a missile. By enforcing this penalty, firing missiles will cripple the total score if the accuracy is under 50%. The reason for deducting points by firing missile is to prevent abuse of missiles. A human player can get a very high score by keeping shooting missiles. 4 Approach 4.1 Game State Implementation 4.1.1 State member The contents of GameState includes: enemy list: the list of enemy jets. missile list: the list of missiles that player has shot. projectile list: the list of projectiles that enemy shot. currentagent: the agent index that is taking action now. 3
score: the current game state score. Each enemy, missile, projectile in the list includes those fields: speed: the current speed of this object on x-axis and y-axis. position: position are simply recorded as (x,y) pairs. dimension: height and width of the object. Each object has the following functionality: checkcollision: a function that takes another flight as input and return whether they are collided based on their position and dimension. updateflight: a function that update the position of the flight to the successor GameState. 4.1.2 State function The Agent needs the successor state given a agent index and its action, and it also needs the legal actions that an agent can take. So these are the API we implemented in our GameState, getlegalactions: This function takes the agent index as input, and return the legal actions of this agent simply according to its position. More specifically, the legal actions are those will not take the flight out of the board (except for the enemy). generatesuccessor: This function takes the agent index and the action this agent are about to take, and return the successor state of this action. When the index is 0, player moves using the updatef light function, also all the projectiles takes a move. And if the index if for enemy, only the enemy update it s position. islose: Loop over the list of enemy and projectiles, using the checkcollision function in Flight, find if there is a collision between the them and player agent. If there is a collision, return True. iswin: This function always return False, because there s not an end point of the game, the player simply wants more score. The game state cannot be easily copied by deepcopy() in python because of pygame mask issues. So we implemented our own copy of all the objects (including speed, position, etc.) and added the above functionality. 4
4.2 Agent 4.2.1 Search At first, we used a Reflex agent for adjusting game parameters and comparison with advanced agents. The agent simply loops through all the legal actions for the current game state and chooses the action that would lead to the highest score in the successor state. Then we used a simple MiniMax agent to control the player s behavior. The MiniMax agent is basically what we ve used for our Pacman assignment, and we transplant it to fit our game interface. However, when the depth of our agent increaes to 3, the game becomes very slow because our game board is very large (640 640) compared to Pacman s layout, besides we have a lot of computation to do in a single frame. This fact motivates us to use an AlphaBeta agent to speed up the computation, which is also a variant of the Pacman s AlphaBeta pruning algorithm. Since Minimax agent sets a lower bound performance against all adversaries, we thought Expectimax would outstrip Minimax in the easy level of our game, where the enemies move randomly. But it may not get a decent score against our AI enemies. Observing the behavior of our agent, we found that sometimes the agent would fire a missile anyway after detecting its certain death (the game would end whatever action the agent takes). This fact stops the agent from achieving a higher score since the game would soon end after the missile is fired. Thus 500 score is wasted most of the time. Furthermore, the agent must stay still to shoot a missile which sometimes leads to its doom, since it is avoidable by choosing to move. In our later implementation, we modified our agent to make up for these features. In the result section, we compared the performance of Reflex agent, AlphaBeta agent, and Expectimax agent in all game modes. 4.2.2 Search depth After getting the AlphaBeta and Expectimax agent working, We tried different search depths for our agent. Knowing that there is a trade-off between search depth and search speed, we observed a drastically deteriorated game speed as we increased the search depth. The reason is that when we are evaluating the game state we have a lot of computation to do, such as calculating the distances to enemies and projectiles. Therefore, we chose a search depth that would not hurt the game speed and still guarantee a decent accuracy. 4.2.3 Evaluation function The evaluation function is largely based on what a human player will behave under certain circumstances. The first thing we do is to force our agent to stay in the center of the screen. Since staying in the border will eliminate at least one legal action, (for example the agent cannot move left or down if staying in the left corner). Moreover, lingering in the corners will increase the probability of being trapped by enemies and their projectiles. Next we evaluate the number of threats within a certain range, since our agent needs to respond 5
to all enemies and projectiles timely. The strategy is similar to Pacman s strategy, we simply calculate the Euclidean distances between the agent and other enemies and projectiles, and punish the total score if some threats get close. Then we let our agent to react to enemies as early as possible. This is realized by calculating the horizontal distance to enemies and keep a large horizontal distance from them. Moreover, The screen is divided into four pieces. We then count the total number of enemies and projectiles in each piece, and make the agent to fly to the safest zone with the least number of threats. Adding this feature enables the agent to behave more like human player instead of just staying in the bottom dodging enemies and projectiles. The last problem for the agent is evaluate when to shoot missiles. Our strategy is to consider the state at current time step, then calculate whether firing a missile will possibly hit an enemy. Denote the agent s position (x 0, y 0 ), an enemy s position and speed (x 1, y 1 ), (v x1, v y1 ), and the missile s speed (v x, v y ). If the agent fires a missile, the missile will move to (x 0 + v x t, y 0 + v y t) after some time step t, and the enemy will (possibly) move to (x 1 + v x1 t, y 1 + v y1 t) accordingly. The missile and the enemy will collide if: x 0 + v x t (x 1 + v x1 t) < W m + W e 2 y 0 + v y t (y 1 + v y1 t) < H m + H e 2 where W m and H m are missile width and height, W e and H e are enemy width and height. If these two conditions are satisfied, the agent will shoot, otherwise it will never shoot due the cost of shooting missiles. This is a very strong estimation and we observed the agent s shooting accuracy is above 85% in average. 4.3 TD-Learning We tried to use TD-learning to train our weight vector automatically. If we use normal TDlearning formula, the weight vector w gets to infinity really quickly. So we added a regularization item and normalize w to avoid this problem. The formula is like below: w w η {[ ˆVπ (s; w) ( r + γ ˆV )] } π (s ; w) w ˆVπ (s; w) + λw ˆV π (s; w) = w φ(s) w ˆVπ (s; w) = φ(s) w 2 = 1 Here, the feature vector φ(s) is the game state we have described above. The evaluation function will return w φ(s) as the current game state evaluation result. 6
5 Results and Analysis 5.1 General results For each agent, we let it to play 200 times separately against random enemy and AI enemy. We also played the game ourselves for 50 times against each enemy. The results are shown below. Agent # of Games Avg. Score Highest Score STD. Reflex vs. normal enemy 200 1354 6984 2922 Reflex vs. AI enemy 200 1320 7051 3253 AlphaBeta vs. normal enemy 200 6769 39784 5858 AlphaBeta vs. AI enemy 200 5864 30913 5028 Expectimax vs. normal enemy 200 7461 49106 7195 Expectimax vs. AI enemy 200 5261 27227 6316 Player vs. normal enemy 50 3829 8172 2670 Player vs. AI enemy 50 2539 4260 1646 Figure 3: Performance Comparison The results indicate that human players perform significantly worse than our AlphaBeta and Expectimax agent, but also have the lowest standard deviation. Which means that human players have a more consistent performance playing the game, probably because we have our own routines for playing such games. Expectimax agent comes on top for both average score and highest score against normal enemy with a large lead. This is what we expected, since it would definitely outperform AlphaBeta. For AI enemy, AlphaBeta and Expectimax agent have very close performances. 7
The standard deviations of AlphaBeta are less, which is consistent with what we have observed: AlphaBeta agent will barely have close contact with enemies, since it assumes them are always Min enemies; on the other hand, Expectimax agent will sometimes sneak up behind enemies. Furthermore, we expected Expectimax agent would get a much lower score against AI enemy, and our result verifies our expectation. 5.2 TD-learning We have trained the weight vector for thousand rounds of games. But the performance of the trained agent is not as good as our hard-coded agent. Here, we only list the trained ExpectiMax agent performance. Agent # of Games Avg. Score Highest Score STD. Trained ExpectiMax vs. normal enemy 100 2776 4724 1092 Trained ExpectiMax vs. AI enemy 100 1562 2286 557 Figure 4: Player Stuck at Corner We can see that the overall performance of our trained agent is worse than human and our hand-coded agent. The is because the trained agent will keep staying at the one of the two of the top corners of the game board and barely move. 6 Conclusion and Future Work Our agent can outperform human players using our hand-coded evaluation function. The agent can score as high as 49,000 points whereas a human can only score nor more than 10,000 points. When faced with the AI enemy, our agent can achieve an average score of 5,500 points whereas human player can only achieve 2,500 points. 8
Our TD-learning trained weight is not performing very well and makes the agent stuck at a local minimum. One improvement we can use is to change the feature vector to get the TD-learning out of the local minimum. We think it would be better to use a trained weight vector rather than a hard-coded weight vector. Our agent performance is not very stable for now. One improvement we can make is to add more generalizing features to the feature vector to get a more stable performance. References [1] Open source Sky-Fighter: https://github.com/edward344/sky-fighter.git 9