1 CS221 Project Final Report Automatic Flappy Bird Player Minh-An Quinn, Guilherme Reis Introduction Flappy Bird is a notoriously difficult and addicting game - so much so that its creator even removed it from app stores at the peak of its popularity, citing guilt over the time players were devoting to the game. Although its rules are simple, straightforward, and intuitive, strategic timing and dexterity are essential to excel at Flappy Bird. Humans immediately understand the game s rules and what they must do to win, but lack the ability to, strategically, think several moves ahead, as well as the dexterity to take the optimal action at 30 frames per second. On the other hand, an AI has no trouble with computing ahead and executing actions with perfect timing - thinking ahead and intuitively mastering the game s physics, however, is not as easy. The motivation for our project was to, through careful feature engineering and modelling, understand precisely what on-screen information - such as position, velocities, accelerations, etc. - are necessary to be able to excel at Flappy Bird gameplay. To that extent, we implemented Q-Learning to use reinforcement learning to estimate the value of taking an action from a given state, and then pared down the information in the state to the bare minimum. With as little as 1 hour of training, our Automatic Flappy Bird Player beats average humans at gameplay; with 3-4 hours (~6000 games played), it achieves superhuman performance, eventually being able to play the game without losing. Background Flappy Bird is a one-player game where the user controls a bird and attempts to fly the bird between pipes. The bird moves forward (i.e., right) at a constant speed set by the game s logic. As the bird moves, pipes come on-screen and become visible. Each pipe has an opening that the bird must pass through to clear the pipe. Furthermore, the bird is impacted by gravity and accelerates down at a constant rate. If the bird touches a pipe or the bottom border of the screen the game ends. At any time, there are two actions available to the user: click, in which case the bird accelerates up, or do nothing. Hence, at every frame in the gameplay the user must decide whether to propel the bird up, or do nothing and simply let it be impacted by gravity. The user must be strategic in deciding when to click: too early, and the bird will come back down and hit the bottom end of the pipe; too late, and the bird may overshoot the pipe s opening and hit the top part of the pipe.
2 The user gets one point added to their score every time the bird passes an obstacle, with the final score being score the user has accumulated throughout the duration of the game. Since the horizontal speed is constant, the only way for the bird to keep surviving is to keep passing through pipe openings. Scope The goal of our project is to build an AI agent that will get as high a score as possible. Note that, in the case of Flappy Bird, the bird s score is equal to the number of pipes it clears before death and is directly proportional to how long the bird stays alive. An essential measure of success is whether the AI can score better than a regular, average human player. There are two fundamental challenges that make this game hard for humans: firstly, humans have difficulty judging when exactly to click in preparation for a given object, as they have to take into consideration both the horizontal velocity of the bird, the downwards acceleration due to gravity, and the effect of upwards acceleration cancelling out gravity if the player clicks. Secondly, a human not only needs to make that decision in a timely fashion, but also needs to have the dexterity to translate the decision into a button press. Given this, we defined two success metrics. Our baseline is beating gameplay of a person who is familiar with the game, i.e., 200-400 points/pipes. Our oracle, on the other hand, is achieving superhuman gameplay: being able to consistently clear thousands of pipes. As is shown later, we achieve and supersede both our baseline and oracle. Infrastructure Rather than building our own version of Flappy Bird (which would not be relevant for CS221), we used a version of Flappy Bird built in PyGame. In addition, we used PyGame Learning Environment (PLE), a wrapper function around PyGame, to implement our agent. PyGame Learning Environment is helpful because it allowed us to hook into Flappy Bird s game logic in order to get information with which to build the state, and because its framework facilitates receiving a game state or other signal like the end of a game, replying with an action to take, and then receiving the corresponding reward. Flappy Bird operates at 30 frames per second and at each frame, PLE gives us the game state and our Agent preprocesses it and replies with the action to take. PLE then passes this action to PyGame, which executes it and returns a reward to our Agent, to then be used in incorporating feedback. Note we had to make a few modifications to PyGame and PLE code to streamline our infrastructure. The reward structure is, naturally, chosen by us and defined before the game starts. For reasons discussed later, we only incorporate feedback and alter our weights at the end of every game. To be able to do so, we store a list of moves : initialized at the start of every game, it saves the state-action pair at each frame. We then use this at the end of each game in incorporatefeedback() to update our weights and improve our model.
3 Our infrastructure consists of two main files: quickstart.py and agent.py. The quickstart file instantiates the game and PLE, and bridges PLE and our Agent by saving states and rewards and then passing them along to our Agent at the next frame. QuickStart also keeps a moving average and maximum score of the last 100 games, and prints those during training and testing. Quickstart is also where one can choose whether to display the screen and watch gameplay, or rather train at a higher speed, and also where one can choose to either load previously saved weights or start training from scratch. Challenges The structure of the Flappy Bird game poses two main challenges for our agent. The first challenge is the strategic timing. It is not sufficient for the bird to be able to line itself up vertically with the hole in the pipe; rather, it must ensure at once it reaches the pipe, it still is lined up vertically. This might entail having to drop below the pipe and only then clicking so that the bird becomes vertically lined up with the pipe opening as it reaches it horizontally. Furthermore, not only must the bird reach the pipe s opening, it must be able to pass through the opening too (which takes approximately 15-25 frames). Hence, the bird might fit into the pipe, but have an excessive vertical velocity that will cause it to crash into the pipe while inside it. This is further complicated by the fact that clicking inside the pipe is quite risky, since there isn t much vertical space to allow for significant vertical acceleration. As we observed human gameplay, we noticed that this is where people struggled: they could line the bird up with the pipe s hole, but where then forced to click while inside it, causing the bird to hit the pipe and die. Because of this, actually clearing the pipe is far harder than merely being able to enter it. Successfully going through a pipe is the result of not just one action, but rather, a series of deliberate, well-timed actions. Therefore, our agent must be incorporating information about previous actions into its decision making. In addition, our agent must also be thinking several moves ahead and be willing to put itself into risky situations for a future reward. In this sense, actually entering the pipe is quite risky: while inside the pipe, the bird is most constrained in terms of what actions it can take. Furthermore, if the bird approaches the pipe with the wrong vertical velocity, it might even be impossible for it to clear the pipe completely, even if it enters it. This is clearly a challenge because it makes it very easy for any type of reinforcement learning to find an alternate local optimum: simply avoid the riskiest part of the game (being inside the pipe), and e.g. constantly click to rise diagonally, hitting the pipe at the last possible moment. Hence, we must incentivize our bird to incur the risk of entering the pipe, when it is likely that doing so will lead to its death until it learns to clear the pipe. One way of avoiding this local optimum is to use an exploration probability. We concluded that such an approach is inadequate: a) firstly, Flappy Bird s physics are deterministic, so it isn t advantageous to take
4 random actions during gameplay; b) secondly, with 30 actions per second but only two actions to take at any given point, it was far too easy for the bird to randomly click when it shouldn t have, interfering with the strategic timing. For example, if the bird is inside the pipe for 15-25 frames, even with a 0.1 exploration probability it is fairly certain to click, leading to near certain death. Instead of incentivizing the bird to explore randomly, we instead are more selective with when we apply a positive or negative reward, as is explained later. The second challenge is the Flappy Bird s delayed rewards. When playing the game, users only gain points if they pass through a pipe obstacle. Due to the game s logic, the reward is given to the user as soon as they are halfway through a pipe. However, the task of successfully navigating through a pipe requires many moves and careful pre-planning--the reward is a result of a series of actions, not just a single action. Additionally, since the goal of our agent is to gain as many points as possible, merely making it halfway through a pipe is not enough--our agent needs to learn how to successfully clear a whole pipe. Due to these differences, it is not appropriate to give the bird rewards in the same timing that the game logic does. For example, the bird might remain alive by making poor choices - such as avoiding the risk of entering the pipe; similarly, the bird might die even if it takes the best action at the state it is in, simply because it made poor choices (such as clicking) a few frames back. Consider the case where the bird is 20 frames away from the pipe, and it chooses to click, accelerating it upwards well past the pipe s opening, and sentencing it to (eventual) death as it crashes into the pipe. Clearly, clicking was not the optimal action at this state; however, the wise thing to do given that you ve already clicked is to do nothing, hoping that gravity will eventually bring the bird down sufficiently. Hence, we must punish the action 20 frames ago without punishing the next 20 frames, which are closest to the bird s death. Since there isn t a clear way of telling whether an action is bad until the bird dies, this was a significant challenge. Approach In order to create an automatic Flappy Bird agent, we use Q-learning to learn the value of taking each action at a given state. Note that we are not using function approximation, so we directly learn (and look-up for choosing the next action) the value of each specific state-action pair. Due to this, we need to have seen each given state before being able to intelligently act upon it. However, we manage to keep this state space quite small by aggressively reducing the variables in our state and binning them (through rounding) to further decrease the state space. Since we are modelling physical space (distances/velocities/acceleration), it is likely that the optimal action does not change if a given variable in the state changes a small amount, so binning is an effective strategy.
5 At each frame, our Agent receives information about the state of the game. We hook into the PyGame code to retrieve this information directly. We experimented with numerous features, but ultimately we need only 3 in order to achieve superhuman performance and near-perfect gameplay: 1. Horizontal distance between the bird and the next pipe. This is the distance between the bird s head and the start of the pipe obstacle. 2. Vertical distance between the bird and the next pipe. This distance was calculated by finding the difference between the y-coordinate of the top of the pipe opening and the y-coordinate corresponding to the bird. Since the pipe gap remains constant throughout the game, we only need to include the horizontal distance between the bird and the top of the opening, and not also the distance to the bottom of the opening. 3. Vertical velocity of the bird. The vertical velocity of the bird implicitly contains the information of whether the bird is currently falling down or rising, and if so, how long ago the Agent chose to click. Since the horizontal velocity of the bird remains constant throughout the game, this was the only velocity feature that was needed. Once the bird clears a pipe, variables 1 and 2 are replaced by the distance to the following pipe. Note that all three values in the game code are continuous, making it unlikely that we d see exactly the same state again. To bin them, we simply divide by a constant and round to a whole number. However, we noticed that far more precision is needed in some variables than other. Hence, for the horizontal distance between the bird and the pipe, we divide by 4; for the vertical distance between the bird and the pipe gap, we divide by 8; and for the bird s vertical velocity, we divide by 2 decimal places. This is equivalent to overlaying a grid on the game/state screen, and mapping the continuous value to which grid square it is in. With this aggressive feature selection and binning, we only needed to encounter and record less than 22,000 state to achieve superhuman performance. There are three other hyperparameters that merit discussion. The first is the discount factor, which we fixed at 1 - intuitively, we care just as much as being alive in the future as we do being alive currently. The second is the step size, which we set at 0.4 for optimal performance. However, our AI s performance was very sensitive to step size: a step size of 0.45 or 0.25 results in abysmal gameplay performance. The third hyperparameter t is not a feature of Q-Learning, but rather something we devised ourselves to better decide what state-action pairs to punish and which to reward. Understanding its usage is intimately connected with how we update our weights/q-values (since we are not using function approximation, these are equivalent). We do not incorporate any feedback at all during gameplay. This may seem counterintuitive from a traditional reinforcement learning perspective; however, during the game
6 itself, we actually cannot know whether a given action is successful or not. Regardless of the reward system used, Flappy Bird has one characteristic feature: dying is bad, and everything else is equally good. Hence, we cannot necessarily use an event like clearing one pipe or staying alive another frame as a proxy for our Agent acted correctly : the Agent may have cleared a pipe but put the bird in a position where it won t be able to clear the next pipe, or it may have performed an action that will lead to death, but only a number of frames later. Hence, the only feedback we receive and can evaluate without depending on future outcomes happens when the bird dies. Because of this, it is only when the bird dies and the game ends that we incorporate feedback: we keep a list of every state the bird was in and the action our Agent advised it to take, and upon the end of a game, we negatively reward the last t state-action pairs, and positively reward all the previous ones. In essence, we punish whatever the bird did in the last t frames, since it was either a poor action, or a very risky state to be in in the first place. For every action taken before the last t frames, we positively reward: it does not matter to us what the bird did, but if it led to staying alive for more than t frames, it was probably a wise action to take. We found that the optimal value was t = 9, i.e., punish the last 9 frames. This parameter is also quite sensitive: 7 or 8 lead to somewhat lower performance, and 10 and above lead to terrible performance. At the end of each game, we update the relevant q-values as follows: qvalues[state][action] = ( stepsize) * ( Reward + max(qvalues[newstate]) Reward is either positive or negative, depending on whether t > 9 or t < 9. If t > 9, then R eward = 1 ; if t < 9, then R eward = 1000. The reward structure can be altered inside quickstart.py. As discussed previously, the optimal step size is 0.4. When selecting the action to take at a given state, we simply compare which action results in a higher q-value (i.e., qvalues[state][ click ] vs. qvalues[state][ no click ]) and return that action. If there is a tie, we choose to not click for two reasons: a) Clicking is a lot more dangerous in many situations than not clicking; b) Clicking is irreversible, whereas simply doing nothing allows the bird to transition to a potentially known state from which we can act optimally. Error Analysis Our Agent met and exceeded both our baseline and oracle, so (fortunately) there isn t significant space to improve. However, there are certain drawbacks to our approach. Firstly, while it is remarkably fast to train to approach superhuman performance, training slows down considerably after approximately 5700 games. This is because at this point our Agent is so
skilled that it takes a very long amount of gameplay for our bird to die, and hence for feedback to be incorporated. Secondly, while our Agent is very skilled at this particular game, its knowledge is not generalizable: changing the pipe gap or the horizontal speed of the game, for example, resulted in significantly worse performance. 7
8