DeepMind Self-Learning Atari Agent

Similar documents
Playing CHIP-8 Games with Reinforcement Learning

Playing Atari Games with Deep Reinforcement Learning

Reinforcement Learning Agent for Scrolling Shooter Game

REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

Poker AI: Equilibrium, Online Resolving, Deep Learning and Reinforcement Learning

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault

Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm

Learning to Play Love Letter with Deep Reinforcement Learning

Intuition Mini-Max 2

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Decision Making in Multiplayer Environments Application in Backgammon Variants

Swing Copters AI. Monisha White and Nolan Walsh Fall 2015, CS229, Stanford University

Monte Carlo Tree Search. Simon M. Lucas

TUD Poker Challenge Reinforcement Learning with Imperfect Information

Applying Modern Reinforcement Learning to Play Video Games

Mastering the game of Go without human knowledge

It s Over 400: Cooperative reinforcement learning through self-play

Tutorial of Reinforcement: A Special Focus on Q-Learning

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen

Artificial Intelligence and Games Playing Games

Learning from Hints: AI for Playing Threes

CS188: Artificial Intelligence, Fall 2011 Written 2: Games and MDP s

Game Playing State-of-the-Art CSE 473: Artificial Intelligence Fall Deterministic Games. Zero-Sum Games 10/13/17. Adversarial Search

CSE 473 Midterm Exam Feb 8, 2018

Set 4: Game-Playing. ICS 271 Fall 2017 Kalev Kask

Adversarial Search Lecture 7

Learning via Delayed Knowledge A Case of Jamming. SaiDhiraj Amuru and R. Michael Buehrer

CS 188: Artificial Intelligence

Name: Your EdX Login: SID: Name of person to left: Exam Room: Name of person to right: Primary TA:

Heads-up Limit Texas Hold em Poker Agent

6. Games. COMP9414/ 9814/ 3411: Artificial Intelligence. Outline. Mechanical Turk. Origins. origins. motivation. minimax search

CS 188 Fall Introduction to Artificial Intelligence Midterm 1

CS188 Spring 2011 Written 2: Minimax, Expectimax, MDPs

CITS3001. Algorithms, Agents and Artificial Intelligence. Semester 2, 2016 Tim French

Applying Modern Reinforcement Learning to Play Video Games. Computer Science & Engineering Leung Man Ho Supervisor: Prof. LYU Rung Tsong Michael

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

CS 229 Final Project: Using Reinforcement Learning to Play Othello

Reinforcement Learning for CPS Safety Engineering. Sam Green, Çetin Kaya Koç, Jieliang Luo University of California, Santa Barbara

Adversarial Search. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 9 Feb 2012

School of EECS Washington State University. Artificial Intelligence

An Artificially Intelligent Ludo Player

Deep Learning for Autonomous Driving

CandyCrush.ai: An AI Agent for Candy Crush

Announcements. Homework 1. Project 1. Due tonight at 11:59pm. Due Friday 2/8 at 4:00pm. Electronic HW1 Written HW1

Pengju

Hanabi : Playing Near-Optimally or Learning by Reinforcement?

Computer Go: from the Beginnings to AlphaGo. Martin Müller, University of Alberta

Games CSE 473. Kasparov Vs. Deep Junior August 2, 2003 Match ends in a 3 / 3 tie!

Adversarial Search. Human-aware Robotics. 2018/01/25 Chapter 5 in R&N 3rd Ø Announcement: Slides for this lecture are here:

Success Stories of Deep RL. David Silver

TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill

CS 4700: Foundations of Artificial Intelligence

A Deep Q-Learning Agent for the L-Game with Variable Batch Training

AI Plays Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng)

CS 188: Artificial Intelligence

VISUAL ANALOGIES BETWEEN ATARI GAMES FOR STUDYING TRANSFER LEARNING IN RL

CS 5522: Artificial Intelligence II

Local Search. Hill Climbing. Hill Climbing Diagram. Simulated Annealing. Simulated Annealing. Introduction to Artificial Intelligence

CS-E4800 Artificial Intelligence

The Nature of Informatics

CS 188: Artificial Intelligence Spring Game Playing in Practice

Hacking Reinforcement Learning

Temporal Difference Learning for the Game Tic-Tac-Toe 3D: Applying Structure to Neural Networks

Monte Carlo Tree Search

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

Using Artificial intelligent to solve the game of 2048

Carnegie Mellon University, University of Pittsburgh

10703 Deep Reinforcement Learning and Control

Programming Project 1: Pacman (Due )

Game Playing: Adversarial Search. Chapter 5

game tree complete all possible moves

CS325 Artificial Intelligence Ch. 5, Games!

Game Playing for a Variant of Mancala Board Game (Pallanguzhi)

CSE 573: Artificial Intelligence Autumn 2010

More on games (Ch )

Game Playing State-of-the-Art

TGD3351 Game Algorithms TGP2281 Games Programming III. in my own words, better known as Game AI

Game Playing AI Class 8 Ch , 5.4.1, 5.5

Game-Playing & Adversarial Search

COMP219: COMP219: Artificial Intelligence Artificial Intelligence Dr. Annabel Latham Lecture 12: Game Playing Overview Games and Search

CS 188: Artificial Intelligence Spring Announcements

Prof. Sameer Singh CS 175: PROJECTS IN AI (IN MINECRAFT) WINTER April 6, 2017

Learning to Play Donkey Kong Using Neural Networks and Reinforcement Learning

Project 2: Searching and Learning in Pac-Man

Adversarial Search. Read AIMA Chapter CIS 421/521 - Intro to AI 1

Announcements. CS 188: Artificial Intelligence Fall Local Search. Hill Climbing. Simulated Annealing. Hill Climbing Diagram

Game AI Challenges: Past, Present, and Future

Artificial Intelligence

ARTIFICIAL INTELLIGENCE (CS 370D)

CS 188: Artificial Intelligence Spring 2007

CSC321 Lecture 23: Go

Games and Adversarial Search

Adversary Search. Ref: Chapter 5

Towards Strategic Kriegspiel Play with Opponent Modeling

Game Playing State-of-the-Art. CS 188: Artificial Intelligence. Behavior from Computation. Video of Demo Mystery Pacman. Adversarial Search

HUJI AI Course 2012/2013. Bomberman. Eli Karasik, Arthur Hemed

GAME PROGRAMMING & DESIGN LAB 1 Egg Catcher - a simple SCRATCH game

Adversarial Search (Game Playing)

Game-playing: DeepBlue and AlphaGo

Transcription:

DeepMind Self-Learning Atari Agent Human-level control through deep reinforcement learning Nature Vol 518, Feb 26, 2015 The Deep Mind of Demis Hassabis Backchannel / Medium.com interview with David Levy Advanced Topics: Reinforcement Learning class notes David Silver, UCL & DeepMind Nikolai Yakovenko 3/25/15 for EE6894

Motivations automatically convert unstructured information into useful, actionable knowledge ability to learn for itself from experience and therefore it can do stuff that maybe we don t know how to program - Demi Hassabis

If you play bridge, whist, whatever, I could invent a new card game and you would not start from scratch there is transferable knowledge. Explicit 1 st step toward self-learning intelligent agents, with transferable knowledge.

Why Games? Easy to create more data. Easy to compare solutions. (Relatively) easy to transfer knowledge between similar problems. But not yet.

idea is to slowly widen the domains. We have a prototype for this the human brain. We can tie our shoelaces, we can ride cycles & we can do physics, with the same architecture. So we know this is possible. - Demis Hassbis

What They Did An agent, that learns to play any of 49 Atari arcade games Learns strictly from experience Only game screen as input No game-specific settings

DQN Novel agent, called deep Q-network (DQN) Q-learning (reinforcement learning) Choose actions to maximize future rewards Q-function CNN (convolution neural network) Represent visual input space, map to game actions Experience replay Batches updates of the Q-function, on a fixed set of observations No guarantee that this converges, or works very well. But often, it does.

DeepMind Atari -- Breakout

DeepMind Atari Space Invaders

CNN, from screen to Joystick

The Recipe Connect game screen via CNN to a top layer, of reasonable dimension. Fully connected, to all possible user actions Learn optimal Q-function Q*, maximizing future game rewards Batch experiences, and randomly sample a batch, with experience replay Iterate, until done.

Obvious Questions State: screen transitions, not just one frame Four frames Actions: how to start? Start with no action Force machine to wiggle it Reward: what it is?? Game score Game AI will totally fail in cases where these are not sufficient

Peek-forward to results. Space Invaders Seaquest

But first Reinforcement Learning in One Slide

Markov Decision Process Fully observable universe State space S, action space A Transition probability function f: S x A x S -> [0, 1.0] Reward function r: S x A x S -> Real At a discrete time step t, given state s, controller takes action a: o according to control policy π: S -> A [which is probabilistic] Integrate over the results, to learn the (average) expected reward.

Control Policy <-> Q-Function Every control policy π has corresponding Q- function Q: S x A -> Real Which gives reward value, given state s and action a, and assuming future actions will be taken with policy π. Our goal is to learn an optimal policy This can be done by learning an optimal Q* function Discount rate γ for each time-step t (maximum discount reward, over all control policies π.)

Q-learning Start with any Q, typically all zeros. Perform various actions in various states, and observe the rewards. Iterate to the next step estimate of Q* α = learning rate

Dammit, this is a bit complicated.

Dammit, this is complicated. Let s steal excellent slides from David Silver, University College London, and DeepMind

Observation, Action & Reward

Measurable Progress

(Long-term) Greed is Good?

Markov State = Memory not Important

Rodentus Sapiens: Need-to-Know Basis

MDP: Policy & Value Setting up complex problem as Markov Decision Process (MDP) involves tradeoffs Once in MDP, there is an optimal policy for maximizing rewards And thus each environment state has a value Follow optimal policy forward, to conclusion, or Optimal policy <-> true value at each state

Chess Endgame Database If value is known, easy to pursue optimal policy.

Policy: Simon Says

Value: Simulate Future States, Sum Future Rewards Familiar to stock market watchers: discounted future dividends.

Simple Maze

Maze Policy

Maze Value

OK, we get it. Policy & value.

Back to Atari

How Game AI Normally Works Heuristic to evaluate game state; tricks to prune the tree.

These seem radically different approaches to playing games

but part of the Explore & Exploit Continuum

RL is Trial & Error

E&E Present in (most) Games

Back to Markov for a second

Markov Reward Process (MRP)

MRP for a UK Student

Discounted Total Return

Discounting the Future We do it all the time.

Short Term View

Long Term View

Back to Q*

Q-Learning in One Slide Each step: we adjust Q toward observations, at learning rate α.

Q-Learning Control: Simulate every Decision

Q-Learning Algorithm Or learn on-policy, by choosing states non-randomly.

Think Back to Atari Videos By default, the system takes default action (no action). Unless rewards are observed (a few steps) from actions, the system moves (toward solution) very slowly.

Back to the CNN

CNN, from screen (S) to Joystick (A)

Four Frames 256 hidden units

Experience Replay Simply, batch training. Feed in a bunch of transitions, compute new approximating of Q*, assuming current policy Don t adjust Q, after every data point. Pre-compute some changes for a bunch of states, then pull a random batch from the database.

Experience Replay (Batch train): DQN

Experience Reply with SGD

Do these methods help? Yes. Quite a bit. Units: game high score.

Finally results it works! (sometimes) Space Invaders Seaquest

Some Games Better Than Others Good at: quick-moving, complex, short-horizon games Semi-independent trails within the game Negative feedback on failure Pinball Bad at: long-horizon games that don t converge Ms. Pac-Man Any walking around game

Montezuma: Drawing Dead Can you see why?

Can DeepMind learn from chutes & ladders? How about Parcheesi?

Actions & Values Value is in expected (discount) score from state Breakout: value increases as closer to medium-term reward Pong: action values differentiate as closer to ruin

Frames, Batch Sizes Matter

Bibliography DeepMind Nature paper (with video): http://www.nature.com/nature/journal/v518/n7540/full/nature14236.ht ml Demis Hassabis interview: https://medium.com/backchannel/the-deepmind-of-demis-hassabis-156112890d8a Wonderful Reinforcement Learning Class (David Silver, University College London): http://www0.cs.ucl.ac.uk/staff/d.silver/web/teaching.html Readable (kind of) paper on Replay Memory: http://busoniu.net/files/papers/smcc11.pdf Chute & Ladders: an ancient morality tale: http://uncyclopedia.wikia.com/wiki/chutes_and_ladders ALE (Arcade Learning Environment): http://www.arcadelearningenvironment.org/ Stella (multi-platform Atari 2600 emulator): http://stella.sourceforge.net/faq.php Deep Q-RL with Theano: https://github.com/spragunr/deep_q_rl

Addendum: Atari Setup w/ Stella

Addendum: ALE Atari Agent compiled agent I/O pipes saves frames

Addendum: (Video) Poker? Can input be fully connected to actions? Atari games played one button at a time. Here, we choose which cards to keep. Remember Montezuma s Revenge!

Addendum: Poker Transition How does one encode this for RL? OpenCV easy for image generation.