REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING

Similar documents
Tutorial of Reinforcement: A Special Focus on Q-Learning

Playing CHIP-8 Games with Reinforcement Learning

Swing Copters AI. Monisha White and Nolan Walsh Fall 2015, CS229, Stanford University

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

Deep RL For Starcraft II

Reinforcement Learning Agent for Scrolling Shooter Game

Deep Learning for Autonomous Driving

DeepMind Self-Learning Atari Agent

Applying Modern Reinforcement Learning to Play Video Games

Playing Geometry Dash with Convolutional Neural Networks

Reinforcement Learning for CPS Safety Engineering. Sam Green, Çetin Kaya Koç, Jieliang Luo University of California, Santa Barbara

TUD Poker Challenge Reinforcement Learning with Imperfect Information

ECE 517: Reinforcement Learning in Artificial Intelligence

Learning from Hints: AI for Playing Threes

A Deep Q-Learning Agent for the L-Game with Variable Batch Training

arxiv: v4 [cs.ro] 21 Jul 2017

Reinforcement Learning Simulations and Robotics

Success Stories of Deep RL. David Silver

Playing FPS Games with Deep Reinforcement Learning

10703 Deep Reinforcement Learning and Control

VISUAL ANALOGIES BETWEEN ATARI GAMES FOR STUDYING TRANSFER LEARNING IN RL

Improvised Robotic Design with Found Objects

Augmenting Self-Learning In Chess Through Expert Imitation

arxiv: v1 [cs.lg] 30 May 2016

Playing Atari Games with Deep Reinforcement Learning

arxiv: v2 [cs.lg] 13 Nov 2015

arxiv: v1 [cs.ro] 24 Feb 2017

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen

Transfer Deep Reinforcement Learning in 3D Environments: An Empirical Study

Learning via Delayed Knowledge A Case of Jamming. SaiDhiraj Amuru and R. Michael Buehrer

CandyCrush.ai: An AI Agent for Candy Crush

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Policy Teaching. Through Reward Function Learning. Haoqi Zhang, David Parkes, and Yiling Chen

Transferring Deep Reinforcement Learning from a Game Engine Simulation for Robots

Chapter 3 Learning in Two-Player Matrix Games

BLUFF WITH AI. Advisor Dr. Christopher Pollett. By TINA PHILIP. Committee Members Dr. Philip Heller Dr. Robert Chun

Model-Based Reinforcement Learning in Atari 2600 Games

Training a Minesweeper Solver

Iteration. Many thanks to Alan Fern for the majority of the LSPI slides.

An Empirical Evaluation of Policy Rollout for Clue

Learning to Play Love Letter with Deep Reinforcement Learning

Learning to Play Donkey Kong Using Neural Networks and Reinforcement Learning

Admin Deblurring & Deconvolution Different types of blur

ロボティクスと深層学習. Robotics and Deep Learning. Keywords: robotics, deep learning, multimodal learning, end to end learning, sequence to sequence learning.

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

arxiv: v1 [cs.ro] 28 Feb 2017

Prof. Sameer Singh CS 175: PROJECTS IN AI (IN MINECRAFT) WINTER April 6, 2017

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

Name: Your EdX Login: SID: Name of person to left: Exam Room: Name of person to right: Primary TA:

Trajectory Generation for a Mobile Robot by Reinforcement Learning

CS 4501: Introduction to Computer Vision. Filtering and Edge Detection

Hanabi : Playing Near-Optimally or Learning by Reinforcement?

Beating the World s Best at Super Smash Bros. Melee with Deep Reinforcement Learning

Robotics at OpenAI. May 1, 2017 By Wojciech Zaremba

Push Path Improvement with Policy based Reinforcement Learning

Applying Modern Reinforcement Learning to Play Video Games. Computer Science & Engineering Leung Man Ho Supervisor: Prof. LYU Rung Tsong Michael

arxiv: v1 [cs.lg] 3 Oct 2016

Elements of Artificial Intelligence and Expert Systems

Reinforcement Learning

Plan Execution Monitoring through Detection of Unmet Expectations about Action Outcomes

Deep Reinforcement Learning for General Video Game AI

Heads-up Limit Texas Hold em Poker Agent

CSE 573: Artificial Intelligence Autumn 2010

Monte Carlo Tree Search

Application of self-play deep reinforcement learning to Big 2, a four-player game of imperfect information

Using Policy Gradient Reinforcement Learning on Autonomous Robot Controllers

Antennas and Propagation. Chapter 6b: Path Models Rayleigh, Rician Fading, MIMO

Reinforcement Learning in Robotic Task Domains with Deictic Descriptor Representation

A Comparison of Particle Swarm Optimization and Gradient Descent in Training Wavelet Neural Network to Predict DGPS Corrections

Reinforcement Learning

Modern Control Theoretic Approach for Gait and Behavior Recognition. Charles J. Cohen, Ph.D. Session 1A 05-BRIMS-023

Computer Vision, Lecture 3

arxiv: v1 [cs.lg] 30 Aug 2018

Learning Actions from Demonstration

Deep Imitation Learning for Playing Real Time Strategy Games

Frugal Sensing Spectral Analysis from Power Inequalities

DYNAMIC CONVOLUTIONAL NEURAL NETWORK FOR IMAGE SUPER- RESOLUTION

Supervisory Control for Cost-Effective Redistribution of Robotic Swarms

Filtering Images in the Spatial Domain Chapter 3b G&W. Ross Whitaker (modified by Guido Gerig) School of Computing University of Utah

Biologically Inspired Embodied Evolution of Survival

CSE-571 AI-based Mobile Robotics

Artificial Intelligence and Deep Learning

Optimizing Public Transit

Adversarial Search. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 9 Feb 2012

Announcements. CS 188: Artificial Intelligence Fall Local Search. Hill Climbing. Simulated Annealing. Hill Climbing Diagram

arxiv: v1 [cs.lg] 7 Nov 2016

Real-World Reinforcement Learning for Autonomous Humanoid Robot Charging in a Home Environment

Filtering. Image Enhancement Spatial and Frequency Based

Announcements. Homework 1. Project 1. Due tonight at 11:59pm. Due Friday 2/8 at 4:00pm. Electronic HW1 Written HW1

Local Search. Hill Climbing. Hill Climbing Diagram. Simulated Annealing. Simulated Annealing. Introduction to Artificial Intelligence

Patterns and random permutations II

General Video Game AI: Learning from Screen Capture

an AI for Slither.io

THOMAS PANY SOFTWARE RECEIVERS

Secure and Intelligent Mobile Crowd Sensing

TTIC 31230, Fundamentals of Deep Learning David McAllester, April AlphaZero

Decision Making in Multiplayer Environments Application in Backgammon Variants

DeepStack: Expert-Level AI in Heads-Up No-Limit Poker. Surya Prakash Chembrolu

CS188: Artificial Intelligence, Fall 2011 Written 2: Games and MDP s

Transcription:

REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING RIKA ANTONOVA ANTONOVA@KTH.SE ALI GHADIRZADEH ALGH@KTH.SE

RL: What We Know So Far Formulate the problem as an MDP (or POMDP) State space captures information about environment e.g. positions and velocities of the objects in the scene Action space captures what our agent can do e.g. position/acceleration/torque commands to each joint Select appropriate representation and parameters state/action space continuous vs. discrete horizon length and discount factor fully or partially observed state (MDP vs POMDP) Slide by: Rika

RL: What We Know So Far Apply an appropriate RL algorithm to solve the problem RL has been used for a variety of research problems in Robotics To get an overview of what approach might be appropriate for your problem start by looking through the relevant surveys E.g.: Reinforcement learning in robotics: A survey. Jens Kober, J. Andrew Bagnell, Jan Peters, 2013 A Survey on Policy Search for Robotics. Marc Peter Deisenroth, Gerhard Neumann, Jan Peters 2013 Learning control in robotics. Stefan Schaal, Christopher G. Atkeson, 2010... and many more sources for specific subtasks/problems Slide by: Rika

RL: Deeper Challenges When state/action space is large or continuous function approximation is employed Most recently, deep neural networks were successfully used to approximate value and policy functions But getting NNs to train well for an RL problem is not trivial more difficult than supervised and unsupervised/structure learning! Slide by: Rika

End-to-end Training Notable work on training NNs for RL was done by DeepMind in the context of games DQN: first visible demonstration of learning from pixels from scratch (no prior domain knowledge) using a generic algorithm (NN structure is not task-specific) Playing Atari with Deep Reinforcement Learning. arxiv2013 Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra, Riedmiller. Approaches from this line of work are useful to know about when working with NN-based RL in general Slide by: Rika

Recall: Q-Learning Bellman Optimality Equation: stochastic reward from environment transition dynamics [not known explicitly; only perceived through interaction with environment] Q-Learning - off-policy TD learning: Sutton&Barto Ch3 TD error Slide by: Rika

Deep Q-Learning? We want deep neural network as function approximator for Q Can we simply use TD error as a loss to train our NN in a standard supervised learning way? Problems? Slide by: Rika

Deep Q-Learning? Problems: 1 2 3 (s,a,r,s ) tuples are not iid (independent identically distributed) but standard supervised learning approaches would need iid distribution of samples can change when policy changes but supervised learning usually makes stationarity assumption large reward values (e.g. from longer episodes) might cause instabilities when training NNs Slide by: Rika

DQN: Human-level control through deep RL 1 2 Use experience replay break correlations in the data by shuffling (s,a,r,s ) tuples learn from all past policies that explored the space 3 Reduce oscillations/instabilities freeze weights of NN (θ i-1 ) while updating current weights (θ i ) on a batch of training data clip rewards or normalize them adaptively Playing Atari with Deep Reinforcement Learning. Mnih et al, arxiv 2013 Slide by: Rika

DQN: Human-level control through deep RL Bellman Optimality Equation: stochastic reward from environment Same as: Sutton&Barto Ch3 Mnih et al 2013 environment stochastic reward from environment Playing Atari with Deep Reinforcement Learning. Mnih et al, arxiv 2013 Slide by: Rika

DQN: Human-level control through deep RL Construct loss function based on Bellman Optimality Equation target for training iteration i NN weights from previous training iteration behavior distribution: states and actions encountered by the agent when learning NN weights for current iteration Playing Atari with Deep Reinforcement Learning. Mnih et al, arxiv 2013 Slide by: Rika

DQN: Human-level control through deep RL Differentiate the squared loss with respect to NN weights θ i holding NN weights from previous iteration fixed when differentiating NN weights at training iteration i behavior policy target network : Q network with weights from previous training iteration held fixed Do gradient descent to find optimal NN weights from chain rule Playing Atari with Deep Reinforcement Learning. Mnih et al, arxiv 2013 Slide by: Rika

DQN: Human-level control through deep RL DEMO from Human-level control through deep reinforcement learning. Mnih et al, Nature 2015 Slide by: Rika

DDPG: Deep Deterministic Policy Gradient Recall from the lecture on continuous action spaces: DDPG is a model-free off-policy RL method learns a deterministic policy (actor), and can use any stochastic policy during training for exploration maintains a separate NN for learning Q function (critic) Why learn deterministic policies? could be easier to learn than stochastic and desirable when executing on robots Continuous Control with Deep Reinforcement Learning. Lillicrap et al, ICLR 2016 Slide by: Rika

Making DDPG Work in Practice Replay Buffer At each training step: sample a minibatch uniformly from the buffer use batch normalization (normalize each dimension to get unit mean and variance) update the critic and the actor Continuous Control with Deep Reinforcement Learning. Lillicrap et al, ICLR 2016 Slide by: Rika

Making DDPG Work in Practice Soft Target Networks use a copy of the actor and critic networks for target values when computing loss weights of these target networks are updated by slowly tracking the learned networks pick a rate (τ 1) weights of the target networks weights of the actor and critic networks Continuous Control with Deep Reinforcement Learning. Lillicrap et al, ICLR 2016 Slide by: Rika

DDPG: Deep Deterministic Policy Gradient Learn critic weights θ Q by minimizing the loss: batch size "target "target networks with weights slowly tracking actor and critic NN weights Continuous Control with Deep Reinforcement Learning. Lillicrap et al, ICLR 2016 Slide by: Rika

DDPG: Deep Deterministic Policy Gradient Learn actor weights θ µ using deterministic policy gradient theorem states s i from a minibatch of size N (collected when running actor with weights θ μ during training episode) this is a deterministic version of the stochastic policy gradient theorem that we studied in one of the previous lectures Continuous Control with Deep Reinforcement Learning. Lillicrap et al, ICLR 2016 Slide by: Rika

End-to-end RL Challenges Approaches like DQN and DDPG learn from scratch upside: deep NNs will automatically learn to extract features useful for the task e.g. can learn directly from pixels / images of the scene! downside: might not be sample-efficient it might take millions of samples to learn something useful this could be prohibitively slow for learning on real hardware in real time So, the next part of the lecture is on data-efficient algorithms designed to learn on real robots Slide by: Rika

End-to-end deep learning Recap copyrighted image Image Inputs Motor outputs Network parameters

RL Policy Search copyrighted image Image Inputs Motor outputs Generate trajectories given the current policy Inefficient for large policies Evaluate sampled trajectories Update policy to make good samples more likely copyrighted image copyrighted image copyrighted image

RL Policy Search copyrighted image Image Inputs Motor outputs Randomly initialized policies are less likely to generate good trajectories to learn from copyrighted image

Guided Policy Search Ingredients Policy Search RL Complex Dynamics Complex Policies difficult Supervised learning Complex Policies manageable Optimal Control Complex Dynamics manageable

Guided Policy Search Ingredients Policy Search RL Complex Dynamics Complex Policies difficult Supervised learning Complex Policies manageable Optimal Control Complex Dynamics manageable

Guided Policy Search Ingredients Policy Search RL Complex Dynamics Complex Policies difficult Supervised learning Complex Policies manageable Optimal Control Complex Dynamics manageable Optimal Control Supervised + Policy Learning

GPS Trajectory Optimization Find a trajectory based on optimal control Solve the regression problem to match the policy to the observed trajectory copyrighted image

GPS Trajectory Optimization Find a trajectory based on optimal control Solve the regression problem to match the policy to the observed trajectory copyrighted image This naïve approach would fail once the policy deviates from the demonstrated trajectory

GPS Trajectory Optimization Find a trajectory based on optimal control Solve the regression problem to match the policy to the observed trajectory Solution Find the widest trajectory distribution Sample from this distribution Solve the regression problem to lean the policy Copyrighted image Copyrighted image This naïve approach would fail once the policy deviates from the demonstrated trajectory

GPS Constraints Produced action trajectories may not be well suited to train a neural network policy See presentation at https://www.youtube.com/watch?v=etmyh_--vnu Adapt teacher to produce samples wellsuited for policy training

GPS Constraints Solution Alternatively Optimize the NN policy to match produced action trajectories Optimize trajectories with an extra constraint to avoid samples very different from the policy Train NN policy parameters with observed trajectories Full state Local policies Neural Network Policy Observation optimize local policies to minimize the loss function

Guided Policy Search Dual gradient decent

Guided Policy Search Dual gradient decent Optimize w.r.t. Optimize w.r.t. Optimize

GPS Local policy optimization Time-varying linear-gaussian controllers

GPS Local policy optimization Time-varying linear-gaussian controllers Sample from each local policy and apply it to the real robot

GPS Local policy optimization Time-varying linear-gaussian controllers Sample from each local policy and apply it to the real robot Fit local linear-gaussian dynamics for each local policy

GPS Local policy optimization Time-varying linear-gaussian controllers Sample from each local policy and apply it to the real robot Fit local linear-gaussian dynamics for each local policy Update local policies using the fitted dynamics via modified LQR algorithm

GPS Local policy optimization

GPS Local policy optimization

GPS Local policy optimization

Guided Policy Search GPS unnecessarily complicated?

Guided Policy Search GPS unnecessarily complicated? Generate samples from local policies

Guided Policy Search GPS unnecessarily complicated? Generate samples from local policies Fit local dynamics

Guided Policy Search GPS unnecessarily complicated? Generate samples from local policies Fit local dynamics Optimize global policy parameters

Guided Policy Search GPS unnecessarily complicated? Generate samples from local policies Fit local dynamics Optimize global policy parameters Update local policy

Guided Policy Search GPS unnecessarily complicated? Generate samples from local policies Fit local dynamics Optimize global policy parameters Update local policy Increment dual variable

Mirror Decent Guided Policy Search

Mirror Decent Guided Policy Search Generate samples from the local or global policies

Mirror Decent Guided Policy Search Generate samples from the local or global policies Fit local dynamics

Mirror Decent Guided Policy Search Generate samples from the local or global policies Fit local dynamics Linearize global policy using samples

Mirror Decent Guided Policy Search Generate samples from the local or global policies Fit local dynamics Linearize global policy using samples Update local policy

Mirror Decent Guided Policy Search Generate samples from the local or global policies Fit local dynamics Linearize global policy using samples Update local policy Update global policy

Mirror Decent Guided Policy Search Generate samples from the local or global policies Fit local dynamics Linearize global policy using samples Update local policy Update global policy MDGPS less complicated Better convergence properties

Path Integral Guided Policy Search LQR policies require smooth and differentiable loss function Path integral RL with MDGPS (model-free)

End-to-end training Features Copyrighted image Slide by: Ali