ECE 517: Reinforcement Learning in Artificial Intelligence

Similar documents
Reinforcement Learning

TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play

CS 331: Artificial Intelligence Adversarial Search II. Outline

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Game Design Verification using Reinforcement Learning

RISTO MIIKKULAINEN, SENTIENT ( SATIENT/) APRIL 3, :23 PM

Artificial Intelligence. Minimax and alpha-beta pruning

TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS

Temporal-Difference Learning in Self-Play Training

An Artificially Intelligent Ludo Player

Reinforcement Learning Simulations and Robotics

Training a Back-Propagation Network with Temporal Difference Learning and a database for the board game Pente

REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING

TUD Poker Challenge Reinforcement Learning with Imperfect Information

How AI Won at Go and So What? Garry Kasparov vs. Deep Blue (1997)

Outline. Game Playing. Game Problems. Game Problems. Types of games Playing a perfect game. Playing an imperfect game

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

COMP9414/ 9814/ 3411: Artificial Intelligence. Week 2. Classifying AI Tasks

Reinforcement Learning for CPS Safety Engineering. Sam Green, Çetin Kaya Koç, Jieliang Luo University of California, Santa Barbara

Evolutionary robotics Jørgen Nordmoen

Bootstrapping from Game Tree Search

CS 380: ARTIFICIAL INTELLIGENCE MONTE CARLO SEARCH. Santiago Ontañón

Learning a Value Analysis Tool For Agent Evaluation

The UT Austin Villa 3D Simulation Soccer Team 2008

AlphaGo and Artificial Intelligence GUEST LECTURE IN THE GAME OF GO AND SOCIETY

Presentation Overview. Bootstrapping from Game Tree Search. Game Tree Search. Heuristic Evaluation Function

Artificial Intelligence Search III

Solving Problems by Searching: Adversarial Search

Tutorial of Reinforcement: A Special Focus on Q-Learning

Augmenting Self-Learning In Chess Through Expert Imitation

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

The UT Austin Villa 3D Simulation Soccer Team 2007

Hybrid of Evolution and Reinforcement Learning for Othello Players

Contents. List of Figures

Optic Flow Based Skill Learning for A Humanoid to Trap, Approach to, and Pass a Ball

Lecture 14. Questions? Friday, February 10 CS 430 Artificial Intelligence - Lecture 14 1

Hierarchical Controller for Robotic Soccer

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

Five-In-Row with Local Evaluation and Beam Search

The UPennalizers RoboCup Standard Platform League Team Description Paper 2017

Analysing and Exploiting Transitivity to Coevolve Neural Network Backgammon Players

Learning and Using Models of Kicking Motions for Legged Robots

Abalearn: Efficient Self-Play Learning of the game Abalone

CS343 Introduction to Artificial Intelligence Spring 2012

CSCI 4150 Introduction to Artificial Intelligence, Fall 2004 Assignment 7 (135 points), out Monday November 22, due Thursday December 9

Andrei Behel AC-43И 1

5.4 Imperfect, Real-Time Decisions

Neural Networks for Real-time Pathfinding in Computer Games

Prof. Sameer Singh CS 175: PROJECTS IN AI (IN MINECRAFT) WINTER April 6, 2017

6. Games. COMP9414/ 9814/ 3411: Artificial Intelligence. Outline. Mechanical Turk. Origins. origins. motivation. minimax search

Adversarial Search. Soleymani. Artificial Intelligence: A Modern Approach, 3 rd Edition, Chapter 5

Foundations of Artificial Intelligence

Overview. Origins. Idea of programming computers for "intelligent" behavior. First suggested by Alan Turing, 1950.

CS343 Introduction to Artificial Intelligence Spring 2010

More Adversarial Search

CSE-571 AI-based Mobile Robotics

Teaching a Neural Network to Play Konane

CS 188: Artificial Intelligence Spring Game Playing in Practice

Why did TD-Gammon Work?

Foundations of Artificial Intelligence

COMP219: Artificial Intelligence. Lecture 2: AI Problems and Applications

Learning of Position Evaluation in the Game of Othello

Online Interactive Neuro-evolution

Success Stories of Deep RL. David Silver

Random Administrivia. In CMC 306 on Monday for LISP lab

Decision Making in Multiplayer Environments Application in Backgammon Variants

Games and Adversarial Search

SPQR RoboCup 2016 Standard Platform League Qualification Report

FU-Fighters. The Soccer Robots of Freie Universität Berlin. Why RoboCup? What is RoboCup?

CS 4700: Foundations of Artificial Intelligence

Contents. Foundations of Artificial Intelligence. Problems. Why Board Games?

Why we need to know what AI is. Overview. Artificial Intelligence is it finally arriving?

RoboCup: Not Only a Robotics Soccer Game but also a New Market Created for Future

Multiple Agents. Why can t we all just get along? (Rodney King)

Principles of Computer Game Design and Implementation. Lecture 20

CS188: Artificial Intelligence, Fall 2011 Written 2: Games and MDP s

10703 Deep Reinforcement Learning and Control

CS7032: AI & Agents: Ms Pac-Man vs Ghost League - AI controller project

CSC321 Lecture 23: Go

On the Design and Training of Bots to Play Backgammon Variants

Reinforcement Learning of Local Shape in the Game of Go

IMPROVING TOWER DEFENSE GAME AI (DIFFERENTIAL EVOLUTION VS EVOLUTIONARY PROGRAMMING) CHEAH KEEI YUAN

COMP3211 Project. Artificial Intelligence for Tron game. Group 7. Chiu Ka Wa ( ) Chun Wai Wong ( ) Ku Chun Kit ( )

Playing CHIP-8 Games with Reinforcement Learning

Playing Atari Games with Deep Reinforcement Learning

Game AI Challenges: Past, Present, and Future

Board Representations for Neural Go Players Learning by Temporal Difference

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen

Robo-Erectus Tr-2010 TeenSize Team Description Paper.

5.4 Imperfect, Real-Time Decisions

It s Over 400: Cooperative reinforcement learning through self-play

CS295-1 Final Project : AIBO

The State of the Art in Robotics: RoboCup, Rescue, Entertainment, and More

Temporal Difference Learning for the Game Tic-Tac-Toe 3D: Applying Structure to Neural Networks

Birth of An Intelligent Humanoid Robot in Singapore

Playful AI Education. Todd W. Neller Gettysburg College

Reinforcement Learning in a Generalized Platform Game

UNIT 13A AI: Games & Search Strategies. Announcements

Nao Devils Dortmund. Team Description for RoboCup Matthias Hofmann, Ingmar Schwarz, and Oliver Urbann

SINGLE SENSOR LINE FOLLOWER

Transcription:

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: Case Studies and Gradient Policy October 29, 2015 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2015 1

Introduction We ll discuss several case studies of reinforcement learning Illustrate some of the trade-offs and issues that arise in real-world applications For example, we emphasize how domain knowledge is incorporated into the formulation and solution of the problem We also highlight the representation issues that are so often critical to successful applications Applications of RL are still far from routine and typically require as much art as science Making applications easier and more straightforward is one of the goals in current RL research ECE 517: Reinforcement Learning in AI 2

TD-Gammon (Tesauro s 1992, 1994, 1995, ) One of the most impressive early applications of RL to date is Gerry Tesauro s (IBM) game of backgammon TD-Gammon, required little backgammon knowledge, yet learned to play extremely well, near the level of the world's strongest grandmasters The learning algorithm was a straightforward combination of the TD(l) algorithm and nonlinear function approximation FA using a FFNN trained by backpropagating TD errors There are probably more professional backgammon players than there are professional chess players BG is in part a game of chance, which can be viewed as a large MDP ECE 517: Reinforcement Learning in AI 3

TD-Gammon (cont.) The game is played with 15 white and 15 black pieces on a board of 24 locations, called points Here s a typical position early in the game, seen from the perspective of the white player ECE 517: Reinforcement Learning in AI 4

TD-Gammon (cont.) White has just rolled a 5 and a 2, so it can move one of his pieces 5 and one (possibly the same) 2 steps The objective is to advance all pieces to points 19-24, and then off the board Hitting removal of single piece 30 pieces, 24 locations implies enormous number of configurations (state set is ~10 20 ) Effective branching factor of 400, considering that each dice roll has ~20 possibilities ECE 517: Reinforcement Learning in AI 5

TD-Gammon - details Although the game is highly stochastic, a complete description of the game's state is available at all times The estimated value of any state was meant to predict the probability of winning starting from that state Reward: 0 at all times except those in which the game is won, when it is 1 Episodic (game = episode), undiscounted Non-linear form of TD(l) using a FF neural network Backpropagation of TD error 4 input units for each point; unary encoding of number of white pieces, plus other features Use of after-states Learning during self-play fully incrementally ECE 517: Reinforcement Learning in AI 6

TD-Gammon Neural Network Employed ECE 517: Reinforcement Learning in AI 7

Summary of TD-Gammon Results Two players played against each other Each had no prior knowledge of the game Only the rules of the game were imposed Human s learn from machines: TD-Gammon learned to play certain opening positions differently than was the convention among the best human players ECE 517: Reinforcement Learning in AI 8

Rebuttal on TD-Gammon For an alternative view, see Why did TD-Gammon Work?, Jordan Pollack and Alan Blair, NIPS (1997) Claim: it was the co-evolutionary training strategy, playing games against itself, which led to the success No need for dealing with exploration/exploitation Any such approach would work with backgammon No sensitivity of state to value - success does not extend to other problems e.g. Tetris, maze-type problems exploration issue comes up ECE 517: Reinforcement Learning in AI 9

The Acrobot Robotic application of RL Roughly analogous to a gymnast swinging on a high bar The first joint (corresponding to the hands on the bar) cannot exert torque The second joint (corresponding to the gymnast bending at the waist) can This system has been widely studied by control engineers and machine learning researchers ECE 517: Reinforcement Learning in AI 10

The Acrobot (cont.) One objective for controlling the Acrobot is to swing the tip (the "feet") above the first joint by an amount equal to one of the links in minimum time In this task, the torque applied at the second joint is limited to three choices: positive torque of a fixed magnitude, negative torque of the same magnitude, or no torque A reward of 1 is given on all time steps until the goal is reached, which ends the episode. No discounting is used Thus, the optimal value of any state is the minimum time to reach the goal (an integer number of steps) Sutton (1996) addressed the Acrobot swing-up task in an on-line, model-free context ECE 517: Reinforcement Learning in AI 11

Acrobot Learning Curves for Sarsa(l) ECE 517: Reinforcement Learning in AI 12

Typical Acrobot Learned Behavior ECE 517: Reinforcement Learning in AI 13

RL in Robotics Robot motor capabilities were investigated using RL Walking, grabbing and delivering MIT Media Lab Robocup competitions soccer games Sony AIBOs are commonly employed Maze-type problems Balancing themselves on unstable platform Multi-dimensional input streams ECE 517: Reinforcement Learning in AI 14

Policy Gradient Methods Assume that our policy, p, has a set of n realvalued parameters, q = {q 1, q 2, q 3,..., q n } Running the policy with a particular q results in a reward, r q r Estimate the reward gradient,, for each q i θ i r θi θi θ i This is another learning rate ECE 517: Reinforcement Learning in AI 15

Policy Gradient Methods (cont.) This results in hill-climbing in policy space So, it s subject to all the problems of hill-climbing But, we can also use tricks from search theory, like random starting points and momentum terms This is a good approach if you have a parameterized policy Let s assume we have a reasonable starting policy Typically faster than value-based methods Safe exploration, if you have a good policy Learns locally-best parameters for that policy ECE 517: Reinforcement Learning in AI 16

An Example: Learning to Walk RoboCup 4-legged league Walking quickly is a big advantage Historically - tuned manually Robots have a parameterized gait controller 12 parameters Controls step length, height, etc. Robot walk across soccer field and is timed Reward is a function of the time taken They know when to stop (distance measure) ECE 517: Reinforcement Learning in AI 17

An Example: Learning to Walk (cont.) Basic idea 1. Pick an initial q = {q 1, q 2,..., q 12 } 2. Generate N testing parameter settings by perturbing q q j = {q 1 + d 1, q 2 + d 2,..., q 12 + d 12 }, d i {-e, 0, e} 3. Test each setting, and observe rewards q j r j 4. For each q i q Calculate q i+, q i0, q i- and set 5. Set q q, and go to 2 θ' i θ i d 0 d if if if θ θ θ i 0 i i largest largest largest Average reward when q n i = q i - d i ECE 517: Reinforcement Learning in AI 18

An Example: Learning to Walk (cont.) Initial Final ECE 517: Reinforcement Learning in AI 19

Value Function or Policy Gradient? When should I use policy gradient? When there s a parameterized policy When there s a high-dimensional state space When we expect the gradient to be smooth Typically on episodic tasks (e.g. AIBO walking) When should I use a value-based method? When there is no parameterized policy When we have no idea how to solve the problem (i.e. no known structure) ECE 517: Reinforcement Learning in AI 20

Summary RL is a powerful tool which can support a wide range of applications There is an art to defining the observations, states, rewards and actions Main goal: formulate as simple as possible representation Policy Gradient methods directly search in policy space ECE 517: Reinforcement Learning in AI 21