CS221 Project Final Report Automatic Flappy Bird Player

Similar documents
CS221 Project: Final Report Raiden AI Agent

In this project we ll make our own version of the highly popular mobile game Flappy Bird. This project requires Scratch 2.0.

Flappy Parrot Level 2

CISC 1600, Lab 2.2: More games in Scratch

AI Approaches to Ultimate Tic-Tac-Toe

AI Agent for Ants vs. SomeBees: Final Report

Game Playing for a Variant of Mancala Board Game (Pallanguzhi)

Make Your Own Game Tutorial VII: Creating Encounters Part 2

Swing Copters AI. Monisha White and Nolan Walsh Fall 2015, CS229, Stanford University

An Adaptive-Learning Analysis of the Dice Game Hog Rounds

Lu 1. Game Theory of 2048

Trainyard: A level design post-mortem

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

SDS PODCAST EPISODE 110 ALPHAGO ZERO

COMP3211 Project. Artificial Intelligence for Tron game. Group 7. Chiu Ka Wa ( ) Chun Wai Wong ( ) Ku Chun Kit ( )

Prof. Sameer Singh CS 175: PROJECTS IN AI (IN MINECRAFT) WINTER April 6, 2017

Developing Frogger Player Intelligence Using NEAT and a Score Driven Fitness Function

CS 229 Final Project: Using Reinforcement Learning to Play Othello

Programming an Othello AI Michael An (man4), Evan Liang (liange)

Using Artificial intelligent to solve the game of 2048

Intro to Java Programming Project

G51PGP: Software Paradigms. Object Oriented Coursework 4

2048: An Autonomous Solver

Fpglappy Bird: A side-scrolling game. Overview

CMS.608 / CMS.864 Game Design Spring 2008

Tutorial: A scrolling shooter

Project 1: A Game of Greed

Fpglappy Bird: A side-scrolling game. 1 Overview. Wei Low, Nicholas McCoy, Julian Mendoza Project Proposal Draft, Fall 2015

Workshop 4: Digital Media By Daniel Crippa

G54GAM Lab Session 1

CandyCrush.ai: An AI Agent for Candy Crush

AI Agents for Playing Tetris

Space Invadersesque 2D shooter

Techniques for Generating Sudoku Instances

A paradox for supertask decision makers

UNDERSTANDING LAYER MASKS IN PHOTOSHOP

The Beauty and Joy of Computing Lab Exercise 10: Shall we play a game? Objectives. Background (Pre-Lab Reading)

Final Project: NOTE: The final project will be due on the last day of class, Friday, Dec 9 at midnight.

Mutliplayer Snake AI

Learning to Play like an Othello Master CS 229 Project Report. Shir Aharon, Amanda Chang, Kent Koyanagi

Heads-up Limit Texas Hold em Poker Agent

Introduction to Auction Theory: Or How it Sometimes

In this project you ll learn how to create a game, in which you have to match up coloured dots with the correct part of the controller.

Constructing Line Graphs*

Chapter 6. Doing the Maths. Premises and Assumptions

Universiteit Leiden Opleiding Informatica

Game Mechanics Minesweeper is a game in which the player must correctly deduce the positions of

Assignment 1. Due: 2:00pm, Monday 14th November 2016 This assignment counts for 25% of your final grade.

Create a game in which you have to guide a parrot through scrolling pipes to score points.

EE307. Frogger. Project #2. Zach Miller & John Tooker. Lab Work: 11/11/ /23/2008 Report: 11/25/2008

Taffy Tangle. cpsc 231 assignment #5. Due Dates

USING A FUZZY LOGIC CONTROL SYSTEM FOR AN XPILOT COMBAT AGENT ANDREW HUBLEY AND GARY PARKER

2D Platform. Table of Contents

Video Sales Letter Zombie

Achieving Desirable Gameplay Objectives by Niched Evolution of Game Parameters

TDD Making sure everything works. Agile Transformation Summit May, 2015

CS221 Final Project Report Learn to Play Texas hold em

Expectation and Thin Value in No-limit Hold em: Profit comes with Variance by Brian Space, Ph.D

Your First Game: Devilishly Easy

Dungeon Cards. The Catacombs by Jamie Woodhead

CPSC 217 Assignment 3 Due Date: Friday March 30, 2018 at 11:59pm

Opening Rolls Part 2: 62, 63, 64 Copyright (C) by Marty Storer

Reinforcement Learning Agent for Scrolling Shooter Game

Elephant Run Background:

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Analysis of Game Balance

The good side of running away

Annex IV - Stencyl Tutorial

Team 11. Flingshot. An infinite mobile climber game which uses the touch screen to control the character.

Bidding Over Opponent s 1NT Opening

Overview. The Game Idea

Photoshop CS6 automatically places a crop box and handles around the image. Click and drag the handles to resize the crop box.

CRYPTOSHOOTER MULTI AGENT BASED SECRET COMMUNICATION IN AUGMENTED VIRTUALITY

All-Stars Dungeons And Diamonds Fundamental. Secrets, Details And Facts (v1.0r3)

Mobile and web games Development

An Artificially Intelligent Ludo Player

Maniacally Obese Penguins, Inc.

Tutorial: Creating maze games

On the GED essay, you ll need to write a short essay, about four

Note: This PDF contains affiliate links.

Get Rhythm. Semesterthesis. Roland Wirz. Distributed Computing Group Computer Engineering and Networks Laboratory ETH Zürich

First Tutorial Orange Group

Optimal Yahtzee performance in multi-player games

Chapter 4: Internal Economy. Hamzah Asyrani Sulaiman

Mathematical Analysis of 2048, The Game

Journey through Game Design

Playing Othello Using Monte Carlo

Real-Time Connect 4 Game Using Artificial Intelligence

So to what extent do these games supply and nurture their social aspect and does game play suffer or benefit from it? Most MMORPGs fail because of a

Sudoku Touch. 1-4 players, adult recommended. Sudoku Touch by. Bring your family back together!

Patterns in Fractions

GAME PROGRAMMING & DESIGN LAB 1 Egg Catcher - a simple SCRATCH game

SPACE SPORTS / TRAINING SIMULATION

Opening Bid of 2. A Survey of Common Treatments By Marty Nathan. Systems Options

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault

Run Very Fast. Sam Blake Gabe Grow. February 27, 2017 GIMM 290 Game Design Theory Dr. Ted Apel

Federico Forti, Erdi Izgi, Varalika Rathore, Francesco Forti

Zpvui!Iboepvut!boe!Xpsltiffut! gps;!

Memory. Introduction. Scratch. In this project, you will create a memory game where you have to memorise and repeat a sequence of random colours!

Automatic Bidding for the Game of Skat

Transcription:

1 CS221 Project Final Report Automatic Flappy Bird Player Minh-An Quinn, Guilherme Reis Introduction Flappy Bird is a notoriously difficult and addicting game - so much so that its creator even removed it from app stores at the peak of its popularity, citing guilt over the time players were devoting to the game. Although its rules are simple, straightforward, and intuitive, strategic timing and dexterity are essential to excel at Flappy Bird. Humans immediately understand the game s rules and what they must do to win, but lack the ability to, strategically, think several moves ahead, as well as the dexterity to take the optimal action at 30 frames per second. On the other hand, an AI has no trouble with computing ahead and executing actions with perfect timing - thinking ahead and intuitively mastering the game s physics, however, is not as easy. The motivation for our project was to, through careful feature engineering and modelling, understand precisely what on-screen information - such as position, velocities, accelerations, etc. - are necessary to be able to excel at Flappy Bird gameplay. To that extent, we implemented Q-Learning to use reinforcement learning to estimate the value of taking an action from a given state, and then pared down the information in the state to the bare minimum. With as little as 1 hour of training, our Automatic Flappy Bird Player beats average humans at gameplay; with 3-4 hours (~6000 games played), it achieves superhuman performance, eventually being able to play the game without losing. Background Flappy Bird is a one-player game where the user controls a bird and attempts to fly the bird between pipes. The bird moves forward (i.e., right) at a constant speed set by the game s logic. As the bird moves, pipes come on-screen and become visible. Each pipe has an opening that the bird must pass through to clear the pipe. Furthermore, the bird is impacted by gravity and accelerates down at a constant rate. If the bird touches a pipe or the bottom border of the screen the game ends. At any time, there are two actions available to the user: click, in which case the bird accelerates up, or do nothing. Hence, at every frame in the gameplay the user must decide whether to propel the bird up, or do nothing and simply let it be impacted by gravity. The user must be strategic in deciding when to click: too early, and the bird will come back down and hit the bottom end of the pipe; too late, and the bird may overshoot the pipe s opening and hit the top part of the pipe.

2 The user gets one point added to their score every time the bird passes an obstacle, with the final score being score the user has accumulated throughout the duration of the game. Since the horizontal speed is constant, the only way for the bird to keep surviving is to keep passing through pipe openings. Scope The goal of our project is to build an AI agent that will get as high a score as possible. Note that, in the case of Flappy Bird, the bird s score is equal to the number of pipes it clears before death and is directly proportional to how long the bird stays alive. An essential measure of success is whether the AI can score better than a regular, average human player. There are two fundamental challenges that make this game hard for humans: firstly, humans have difficulty judging when exactly to click in preparation for a given object, as they have to take into consideration both the horizontal velocity of the bird, the downwards acceleration due to gravity, and the effect of upwards acceleration cancelling out gravity if the player clicks. Secondly, a human not only needs to make that decision in a timely fashion, but also needs to have the dexterity to translate the decision into a button press. Given this, we defined two success metrics. Our baseline is beating gameplay of a person who is familiar with the game, i.e., 200-400 points/pipes. Our oracle, on the other hand, is achieving superhuman gameplay: being able to consistently clear thousands of pipes. As is shown later, we achieve and supersede both our baseline and oracle. Infrastructure Rather than building our own version of Flappy Bird (which would not be relevant for CS221), we used a version of Flappy Bird built in PyGame. In addition, we used PyGame Learning Environment (PLE), a wrapper function around PyGame, to implement our agent. PyGame Learning Environment is helpful because it allowed us to hook into Flappy Bird s game logic in order to get information with which to build the state, and because its framework facilitates receiving a game state or other signal like the end of a game, replying with an action to take, and then receiving the corresponding reward. Flappy Bird operates at 30 frames per second and at each frame, PLE gives us the game state and our Agent preprocesses it and replies with the action to take. PLE then passes this action to PyGame, which executes it and returns a reward to our Agent, to then be used in incorporating feedback. Note we had to make a few modifications to PyGame and PLE code to streamline our infrastructure. The reward structure is, naturally, chosen by us and defined before the game starts. For reasons discussed later, we only incorporate feedback and alter our weights at the end of every game. To be able to do so, we store a list of moves : initialized at the start of every game, it saves the state-action pair at each frame. We then use this at the end of each game in incorporatefeedback() to update our weights and improve our model.

3 Our infrastructure consists of two main files: quickstart.py and agent.py. The quickstart file instantiates the game and PLE, and bridges PLE and our Agent by saving states and rewards and then passing them along to our Agent at the next frame. QuickStart also keeps a moving average and maximum score of the last 100 games, and prints those during training and testing. Quickstart is also where one can choose whether to display the screen and watch gameplay, or rather train at a higher speed, and also where one can choose to either load previously saved weights or start training from scratch. Challenges The structure of the Flappy Bird game poses two main challenges for our agent. The first challenge is the strategic timing. It is not sufficient for the bird to be able to line itself up vertically with the hole in the pipe; rather, it must ensure at once it reaches the pipe, it still is lined up vertically. This might entail having to drop below the pipe and only then clicking so that the bird becomes vertically lined up with the pipe opening as it reaches it horizontally. Furthermore, not only must the bird reach the pipe s opening, it must be able to pass through the opening too (which takes approximately 15-25 frames). Hence, the bird might fit into the pipe, but have an excessive vertical velocity that will cause it to crash into the pipe while inside it. This is further complicated by the fact that clicking inside the pipe is quite risky, since there isn t much vertical space to allow for significant vertical acceleration. As we observed human gameplay, we noticed that this is where people struggled: they could line the bird up with the pipe s hole, but where then forced to click while inside it, causing the bird to hit the pipe and die. Because of this, actually clearing the pipe is far harder than merely being able to enter it. Successfully going through a pipe is the result of not just one action, but rather, a series of deliberate, well-timed actions. Therefore, our agent must be incorporating information about previous actions into its decision making. In addition, our agent must also be thinking several moves ahead and be willing to put itself into risky situations for a future reward. In this sense, actually entering the pipe is quite risky: while inside the pipe, the bird is most constrained in terms of what actions it can take. Furthermore, if the bird approaches the pipe with the wrong vertical velocity, it might even be impossible for it to clear the pipe completely, even if it enters it. This is clearly a challenge because it makes it very easy for any type of reinforcement learning to find an alternate local optimum: simply avoid the riskiest part of the game (being inside the pipe), and e.g. constantly click to rise diagonally, hitting the pipe at the last possible moment. Hence, we must incentivize our bird to incur the risk of entering the pipe, when it is likely that doing so will lead to its death until it learns to clear the pipe. One way of avoiding this local optimum is to use an exploration probability. We concluded that such an approach is inadequate: a) firstly, Flappy Bird s physics are deterministic, so it isn t advantageous to take

4 random actions during gameplay; b) secondly, with 30 actions per second but only two actions to take at any given point, it was far too easy for the bird to randomly click when it shouldn t have, interfering with the strategic timing. For example, if the bird is inside the pipe for 15-25 frames, even with a 0.1 exploration probability it is fairly certain to click, leading to near certain death. Instead of incentivizing the bird to explore randomly, we instead are more selective with when we apply a positive or negative reward, as is explained later. The second challenge is the Flappy Bird s delayed rewards. When playing the game, users only gain points if they pass through a pipe obstacle. Due to the game s logic, the reward is given to the user as soon as they are halfway through a pipe. However, the task of successfully navigating through a pipe requires many moves and careful pre-planning--the reward is a result of a series of actions, not just a single action. Additionally, since the goal of our agent is to gain as many points as possible, merely making it halfway through a pipe is not enough--our agent needs to learn how to successfully clear a whole pipe. Due to these differences, it is not appropriate to give the bird rewards in the same timing that the game logic does. For example, the bird might remain alive by making poor choices - such as avoiding the risk of entering the pipe; similarly, the bird might die even if it takes the best action at the state it is in, simply because it made poor choices (such as clicking) a few frames back. Consider the case where the bird is 20 frames away from the pipe, and it chooses to click, accelerating it upwards well past the pipe s opening, and sentencing it to (eventual) death as it crashes into the pipe. Clearly, clicking was not the optimal action at this state; however, the wise thing to do given that you ve already clicked is to do nothing, hoping that gravity will eventually bring the bird down sufficiently. Hence, we must punish the action 20 frames ago without punishing the next 20 frames, which are closest to the bird s death. Since there isn t a clear way of telling whether an action is bad until the bird dies, this was a significant challenge. Approach In order to create an automatic Flappy Bird agent, we use Q-learning to learn the value of taking each action at a given state. Note that we are not using function approximation, so we directly learn (and look-up for choosing the next action) the value of each specific state-action pair. Due to this, we need to have seen each given state before being able to intelligently act upon it. However, we manage to keep this state space quite small by aggressively reducing the variables in our state and binning them (through rounding) to further decrease the state space. Since we are modelling physical space (distances/velocities/acceleration), it is likely that the optimal action does not change if a given variable in the state changes a small amount, so binning is an effective strategy.

5 At each frame, our Agent receives information about the state of the game. We hook into the PyGame code to retrieve this information directly. We experimented with numerous features, but ultimately we need only 3 in order to achieve superhuman performance and near-perfect gameplay: 1. Horizontal distance between the bird and the next pipe. This is the distance between the bird s head and the start of the pipe obstacle. 2. Vertical distance between the bird and the next pipe. This distance was calculated by finding the difference between the y-coordinate of the top of the pipe opening and the y-coordinate corresponding to the bird. Since the pipe gap remains constant throughout the game, we only need to include the horizontal distance between the bird and the top of the opening, and not also the distance to the bottom of the opening. 3. Vertical velocity of the bird. The vertical velocity of the bird implicitly contains the information of whether the bird is currently falling down or rising, and if so, how long ago the Agent chose to click. Since the horizontal velocity of the bird remains constant throughout the game, this was the only velocity feature that was needed. Once the bird clears a pipe, variables 1 and 2 are replaced by the distance to the following pipe. Note that all three values in the game code are continuous, making it unlikely that we d see exactly the same state again. To bin them, we simply divide by a constant and round to a whole number. However, we noticed that far more precision is needed in some variables than other. Hence, for the horizontal distance between the bird and the pipe, we divide by 4; for the vertical distance between the bird and the pipe gap, we divide by 8; and for the bird s vertical velocity, we divide by 2 decimal places. This is equivalent to overlaying a grid on the game/state screen, and mapping the continuous value to which grid square it is in. With this aggressive feature selection and binning, we only needed to encounter and record less than 22,000 state to achieve superhuman performance. There are three other hyperparameters that merit discussion. The first is the discount factor, which we fixed at 1 - intuitively, we care just as much as being alive in the future as we do being alive currently. The second is the step size, which we set at 0.4 for optimal performance. However, our AI s performance was very sensitive to step size: a step size of 0.45 or 0.25 results in abysmal gameplay performance. The third hyperparameter t is not a feature of Q-Learning, but rather something we devised ourselves to better decide what state-action pairs to punish and which to reward. Understanding its usage is intimately connected with how we update our weights/q-values (since we are not using function approximation, these are equivalent). We do not incorporate any feedback at all during gameplay. This may seem counterintuitive from a traditional reinforcement learning perspective; however, during the game

6 itself, we actually cannot know whether a given action is successful or not. Regardless of the reward system used, Flappy Bird has one characteristic feature: dying is bad, and everything else is equally good. Hence, we cannot necessarily use an event like clearing one pipe or staying alive another frame as a proxy for our Agent acted correctly : the Agent may have cleared a pipe but put the bird in a position where it won t be able to clear the next pipe, or it may have performed an action that will lead to death, but only a number of frames later. Hence, the only feedback we receive and can evaluate without depending on future outcomes happens when the bird dies. Because of this, it is only when the bird dies and the game ends that we incorporate feedback: we keep a list of every state the bird was in and the action our Agent advised it to take, and upon the end of a game, we negatively reward the last t state-action pairs, and positively reward all the previous ones. In essence, we punish whatever the bird did in the last t frames, since it was either a poor action, or a very risky state to be in in the first place. For every action taken before the last t frames, we positively reward: it does not matter to us what the bird did, but if it led to staying alive for more than t frames, it was probably a wise action to take. We found that the optimal value was t = 9, i.e., punish the last 9 frames. This parameter is also quite sensitive: 7 or 8 lead to somewhat lower performance, and 10 and above lead to terrible performance. At the end of each game, we update the relevant q-values as follows: qvalues[state][action] = ( stepsize) * ( Reward + max(qvalues[newstate]) Reward is either positive or negative, depending on whether t > 9 or t < 9. If t > 9, then R eward = 1 ; if t < 9, then R eward = 1000. The reward structure can be altered inside quickstart.py. As discussed previously, the optimal step size is 0.4. When selecting the action to take at a given state, we simply compare which action results in a higher q-value (i.e., qvalues[state][ click ] vs. qvalues[state][ no click ]) and return that action. If there is a tie, we choose to not click for two reasons: a) Clicking is a lot more dangerous in many situations than not clicking; b) Clicking is irreversible, whereas simply doing nothing allows the bird to transition to a potentially known state from which we can act optimally. Error Analysis Our Agent met and exceeded both our baseline and oracle, so (fortunately) there isn t significant space to improve. However, there are certain drawbacks to our approach. Firstly, while it is remarkably fast to train to approach superhuman performance, training slows down considerably after approximately 5700 games. This is because at this point our Agent is so

skilled that it takes a very long amount of gameplay for our bird to die, and hence for feedback to be incorporated. Secondly, while our Agent is very skilled at this particular game, its knowledge is not generalizable: changing the pipe gap or the horizontal speed of the game, for example, resulted in significantly worse performance. 7

8