Tutorial of Reinforcement: A Special Focus on Q-Learning

Similar documents
REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING

Improvised Robotic Design with Found Objects

Success Stories of Deep RL. David Silver

Playing CHIP-8 Games with Reinforcement Learning

arxiv: v1 [cs.ne] 3 May 2018

Reinforcement Learning Simulations and Robotics

Deep Learning for Autonomous Driving

Reinforcement Learning for CPS Safety Engineering. Sam Green, Çetin Kaya Koç, Jieliang Luo University of California, Santa Barbara

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

Poker AI: Equilibrium, Online Resolving, Deep Learning and Reinforcement Learning

Swing Copters AI. Monisha White and Nolan Walsh Fall 2015, CS229, Stanford University

Structured Control Nets for Deep Reinforcement Learning

10703 Deep Reinforcement Learning and Control

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

Monte Carlo Tree Search. Simon M. Lucas

arxiv: v1 [cs.lg] 22 Feb 2018

Applying Modern Reinforcement Learning to Play Video Games

Robotics at OpenAI. May 1, 2017 By Wojciech Zaremba

DeepMind Self-Learning Atari Agent

TUD Poker Challenge Reinforcement Learning with Imperfect Information

Learning from Hints: AI for Playing Threes

ECE 517: Reinforcement Learning in Artificial Intelligence

Deep RL For Starcraft II

Deep Reinforcement Learning for General Video Game AI

Using Policy Gradient Reinforcement Learning on Autonomous Robot Controllers

arxiv: v1 [cs.lg] 30 May 2016

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault

CS 380: ARTIFICIAL INTELLIGENCE MONTE CARLO SEARCH. Santiago Ontañón

Learning to Play Love Letter with Deep Reinforcement Learning

Hanabi : Playing Near-Optimally or Learning by Reinforcement?

The next level of intelligence: Artificial Intelligence. Innovation Day USA 2017 Princeton, March 27, 2017 Michael May, Siemens Corporate Technology

VISUAL ANALOGIES BETWEEN ATARI GAMES FOR STUDYING TRANSFER LEARNING IN RL

A Deep Q-Learning Agent for the L-Game with Variable Batch Training

Learning via Delayed Knowledge A Case of Jamming. SaiDhiraj Amuru and R. Michael Buehrer

Prof. Sameer Singh CS 175: PROJECTS IN AI (IN MINECRAFT) WINTER April 6, 2017

AlphaGo and Artificial Intelligence GUEST LECTURE IN THE GAME OF GO AND SOCIETY

Transferring Deep Reinforcement Learning from a Game Engine Simulation for Robots

arxiv: v1 [cs.lg] 30 Aug 2018

A Reinforcement Learning Approach for Solving KRK Chess Endgames

Iteration. Many thanks to Alan Fern for the majority of the LSPI slides.

Monte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar

Playing FPS Games with Deep Reinforcement Learning

COMPACT FUZZY Q LEARNING FOR AUTONOMOUS MOBILE ROBOT NAVIGATION

Transfer Deep Reinforcement Learning in 3D Environments: An Empirical Study

Applying Modern Reinforcement Learning to Play Video Games. Computer Science & Engineering Leung Man Ho Supervisor: Prof. LYU Rung Tsong Michael

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

Generating Adaptive Attending Behaviors using User State Classification and Deep Reinforcement Learning

Outline. Introduction to AI. Artificial Intelligence. What is an AI? What is an AI? Agents Environments

Playing Geometry Dash with Convolutional Neural Networks

arxiv: v4 [cs.ro] 21 Jul 2017

Application of self-play deep reinforcement learning to Big 2, a four-player game of imperfect information

Adversarial Search. Human-aware Robotics. 2018/01/25 Chapter 5 in R&N 3rd Ø Announcement: Slides for this lecture are here:

Augmenting Self-Learning In Chess Through Expert Imitation

Mastering the game of Go without human knowledge

Reinforcement Learning

Soar-RL A Year of Learning

Reinforcement Learning to Train Ms. Pac-Man Using Higher-order Action-relative Inputs

Reinforcement Learning Agent for Scrolling Shooter Game

TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill

Playing Atari Games with Deep Reinforcement Learning

arxiv: v1 [cs.ro] 24 Feb 2017

Monte Carlo Tree Search

[31] S. Koenig, C. Tovey, and W. Halliburton. Greedy mapping of terrain.

An Artificially Intelligent Ludo Player

Automated Suicide: An Antichess Engine

CSC321 Lecture 23: Go

Learning to Play Donkey Kong Using Neural Networks and Reinforcement Learning

Biologically Inspired Embodied Evolution of Survival

A Survey on Machine-Learning Techniques in Cognitive Radios

Learning to Play 2D Video Games

Supervisory Control for Cost-Effective Redistribution of Robotic Swarms

Carnegie Mellon University, University of Pittsburgh

General Video Game AI: Learning from Screen Capture

HyperNEAT-GGP: A HyperNEAT-based Atari General Game Player. Matthew Hausknecht, Piyush Khandelwal, Risto Miikkulainen, Peter Stone

Real-World Reinforcement Learning for Autonomous Humanoid Robot Charging in a Home Environment

CSE 473 Midterm Exam Feb 8, 2018

Reinforcement Learning for Traffic Control with Adaptive Horizon

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Q Learning Behavior on Autonomous Navigation of Physical Robot

Real-World Reinforcement Learning for Autonomous Humanoid Robot Charging in a Home Environment

Philosophy. AI Slides (5e) c Lin

Decision Making in Multiplayer Environments Application in Backgammon Variants

Verification and Validation for Safety in Robots Kerstin Eder

46.1 Introduction. Foundations of Artificial Intelligence Introduction MCTS in AlphaGo Neural Networks. 46.

CS 378: Autonomous Intelligent Robotics. Instructor: Jivko Sinapov

Application of Artificial Neural Networks in Autonomous Mission Planning for Planetary Rovers

Foundations of Artificial Intelligence

Artificial Intelligence and Deep Learning

AI Agent for Ants vs. SomeBees: Final Report

Artificial Intelligence and Games Playing Games

Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm

Game Playing for a Variant of Mancala Board Game (Pallanguzhi)

CandyCrush.ai: An AI Agent for Candy Crush

A Bandit Approach for Tree Search

Coevolution of Heterogeneous Multi-Robot Teams

Using Reinforcement Learning for City Site Selection in the Turn-Based Strategy Game Civilization IV

Analyzing the Impact of Knowledge and Search in Monte Carlo Tree Search in Go

Computing Science (CMPUT) 496

Sim-to-Real Transfer with Neural-Augmented Robot Simulation

City Research Online. Permanent City Research Online URL:

Transcription:

Tutorial of Reinforcement: A Special Focus on Q-Learning TINGWU WANG, MACHINE LEARNING GROUP, UNIVERSITY OF TORONTO

Contents 1. Introduction 1. Discrete Domain vs. Continous Domain 2. Model Based vs. Model Free 3. Value-based vs. Policy-based 4. On-policy vs. Off-policy 2. Prediction vs. Control: Marching Towards Q-learning 1. Prediction: TD-learning and Bellman Equation 2. Control: Bellman Optimality Equation and SARSA 3. Control: Switching to Q-learning Algorithm 3. Misc: Continous Control 1. Policy Based Algorithm 2. NerveNet: Learning Stuctured Policy in RL 4. Reference

Introduction 1. Today's focus: Q-learning [1] method. 1. Q-learning is a { discrete domain, value-based, off-policy, model-free, control, often shown up in ML finals } algorithm. 2. Related to Q-learning [2]: 1. Bellman-equation. 2. TD-learning. 3. SARSA algorithm.

Discrete Domain vs. Continous Domain 1. Discrete action space (our focus). 1. Only several actions are available (e.g. up, down, left, right). 2. Often solved by value based methods (DQN [3], or DQN + MCTS [4]). 3. Policy based methods work too (TRPO[5] / PPO[6], not our focus).

Discrete Domain vs. Continous Domain 1. Continuous action space (not our focus). 1. Action is a value from a continous interval. 1. Infinite number of choices. 2. E.g.: Locomotion control of robots (MuJoCo [7]). Actions could be the forces applied to each joint (say: 0-100 N). 2. If we apply discretization to the action space, we have discrete domain problems (autonomous car).

Model Based vs. Model Free 1. Model Based RL make use of dynamical model of the environment. (not our focus). 1. Pros 1. Better sample efficiency and transferabilty (VIN [8]). 2. Security/performance gaurantee (if the model is good). 3. Monte-Carlo Tree Search (used in AlphaGo[4]). 4.... 2. Cons 1. The dynamical models are difficult to train itself. 2. Time consuming. 3....

Model Based vs. Model Free 1. Model Free RL makes no assumption of the environments' dynamical model (our focus) 1. In the ML community, more focus has been put on Model-free RL. 2. E.g. : 1. In Q-learning, we can choose our action by looking at Q(s, a), without worrying about what happens next. 2. In AlphaGo, the authors combine the model-free method with model-based method (much stronger performance given a perfect dynamical model for Chess/GO).

Value-based vs. Policy-based 1. Value based methods are more interested in "Value" (our focus) 1. Estimate the expected reward for different actions given the initial states (table from Silver's slides [9]). 2. Policies are chosen by looking at values.

Value-based vs. Policy-based 1. Policy-based methods directly model the policy (not our focus). 1. Objective function is the expected average reward. 1. Usually solved by policy gradient or evolutionary updates. 2. If using value function to reduce variance --> actor-critic methods.

On-policy vs. Off-policy 1. Behavior policy & target policy. My own way of telling them (works most of the time): 1. Behavior policy is the policy used to generate training data. 1. Could be generated by other agents (learning by watching) 2. Could be that the agent just want to do something new to explore the world. 3. Re-use generated data. 2. Target policy is the policy the agent want to use if the agent is put into testing. 3. Behavior policy == target policy: On-policy, otherwise Off-policy

Contents 1. Introduction 1. Discrete Domain vs. Continous Domain 2. Model Based vs. Model Free 3. Value-based vs. Policy-based 4. On-policy vs. Off-policy 2. Prediction vs. Control: Marching Towards Q-learning 1. Prediction: TD-learning and Bellman Equation 2. Control: Bellman Optimality Equation and SARSA 3. Control: Switching to Q-learning Algorithm 3. Misc: Continous Control 1. Policy Based Algorithm 2. NerveNet: Learning Stuctured Policy in RL 4. Reference

Prediction: TD-learning and Bellman Equation 1. Prediction: 1. Evaluation certain policy (could be crappy). 2. Bellman Expectation Equation (covered in lecture slides). Take out the Expectation if the process is deterministic. 3. Algorithms: 1. Monte-Carlo algorithm (not our focus). 1. It learns directly from episodes of experience. 2. Dynamic Programming (not our focus) 1. Only applicable when the dynamical model is known and small. 3. TD-learning algorithm (related to Q-learning, covered in lecture slides). 1. Update value V(S t ) toward estimated return R t+1 + γv(s t+1 )

Prediction: TD-learning and Bellman Equation 1. Prediction Examples: 2. Since the trajectory is generated by the policy we want to evaluate, eventually the value function converges to the true value under this policy.

Control: Bellman Optimality Equation and SARSA 1. Control: 1. Obtaining the optimal policy. 1. Looping over Bellman Expectation Equation and improve policy. 2. Bellman Optimality Equation (covered in lecture slides). 3. SARSA: 1. Fix the policy to be epsilon-greedy policy from Bellman Optimality Equation. 2. Updating the policy using Bellman Expectation Equation (TD). 3. When the Bellman Expectation Equation converges, the Bellman Optimality Equation is met.

Control: Switching to Q- learning Algorithm 1. Switching to off-policy method. 1. SARSA has the same target policy and behavior policy (epsilon-greedy). 2. Q-learning might has different target policy and behavior policy. 1. Target policy: greedy policy (Bellman Optimality Equation). 2. Common behavior policy for Q-learning: Epsilon-greedy policy. 1. Choose random policy with probability of epsilon, greedy policy with probability of (1 - epsilon) 2. Decaying epsilon with time.

Contents 1. Introduction 1. Discrete Domain vs. Continous Domain 2. Model Based vs. Model Free 3. Value-based vs. Policy-based 4. On-policy vs. Off-policy 2. Prediction vs. Control: Marching Towards Q-learning 1. Prediction: TD-learning and Bellman Equation 2. Control: Bellman Optimality Equation and SARSA 3. Control: Switching to Q-learning Algorithm 3. Misc: Continous Control 1. Policy Based Algorithm 2. NerveNet: Learning Stuctured Policy in RL 4. Reference

Policy Based Algorithm 1. Policy Gradient (not our focus) 1. Objective function: 2. Takeing the gradient (Policy Gradient Theorem) 1. Variants: 1. If Q w is the empirical return: REINFORCE algorithm [10]. 2. If Q w is the estimation of action-value function: Actor Critics [11]. 3. If adding KL constraints on policy updates: TRPO / PPO. 4. If policy is deterministic: DPG [12] / DDPG [13] (Deterministic Policy Gradient).

NerveNet: Learning Stuctured Policy in RL 1. NerveNet: 1. In traditional reinforcement learning, policies of agents are learned by MLPs which take the concatenation of all observations from the environment as input for predicting actions. 2. We propose NerveNet to explicitly model the structure of an agent, which naturally takes the form of a graph.

Contents 1. Introduction 1. Discrete Domain vs. Continous Domain 2. Model Based vs. Model Free 3. Value-based vs. Policy-based 4. On-policy vs. Off-policy 2. Prediction vs. Control: Marching Towards Q-learning 1. Prediction: TD-learning and Bellman Equation 2. Control: Bellman Optimality Equation and SARSA 3. Control: Switching to Q-learning Algorithm 3. Misc: Continous Control 1. Policy Based Algorithm 2. NerveNet: Learning Stuctured Policy in RL 4. Reference

Reference [1] Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8.3-4 (1992): 279-292. [2] Sutton, Richard S., Doina Precup, and Satinder Singh. "Between MDPs and semi-mdps: A framework for temporal abstraction in reinforcement learning." Artificial intelligence 112.1-2 (1999): 181-211. [3] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arxiv preprint arxiv:1312.5602 (2013). [4] Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." Nature 529.7587 (2016): 484-489. [5] Schulman, John, et al. "Trust region policy optimization." Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015. [6] Schulman, John, et al. "Proximal policy optimization algorithms." arxiv preprint arxiv:1707.06347 (2017). [7] Todorov, Emanuel, Tom Erez, and Yuval Tassa. "MuJoCo: A physics engine for model-based control." Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012. [8] Tamar, Aviv, et al. "Value iteration networks." Advances in Neural Information Processing Systems. 2016. [9] Silver, David, UCL Course on RL, http://www0.cs.ucl.ac.uk/staff/d.silver/web/teaching.html [10] WILLIANMS, RJ. "Toward a theory of reinforcement-learning connectionist systems." Technical Report (1988). [11] Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000. [12] Silver, David, et al. "Deterministic policy gradient algorithms." Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014. [13] Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arxiv preprint arxiv:1509.02971 (2015).