Reinforcement Learning

Similar documents
CSCI 4150 Introduction to Artificial Intelligence, Fall 2004 Assignment 7 (135 points), out Monday November 22, due Thursday December 9

TUD Poker Challenge Reinforcement Learning with Imperfect Information

A. Rules of blackjack, representations, and playing blackjack

Learning via Delayed Knowledge A Case of Jamming. SaiDhiraj Amuru and R. Michael Buehrer

CS188: Artificial Intelligence, Fall 2011 Written 2: Games and MDP s

Reinforcement Learning in Games Autonomous Learning Systems Seminar

10703 Deep Reinforcement Learning and Control

Policy Teaching. Through Reward Function Learning. Haoqi Zhang, David Parkes, and Yiling Chen

Resource Management in QoS-Aware Wireless Cellular Networks

Morphology Independent Learning in Modular Robots

Temporal-Difference Learning in Self-Play Training

An Artificially Intelligent Ludo Player

Tutorial of Reinforcement: A Special Focus on Q-Learning

Reinforcement Learning-Based Dynamic Power Management of a Battery-Powered System Supplying Multiple Active Modes

Reinforcement Learning Simulations and Robotics

Reinforcement Learning and its Application to Othello

Optimization Techniques for Alphabet-Constrained Signal Design

Classifier-Based Approximate Policy Iteration. Alan Fern

Frugal Sensing Spectral Analysis from Power Inequalities

DeepMind Self-Learning Atari Agent

ROBOCODE PROJECT AIBOT - MARKOV MODEL DRIVEN AIMING COMBINED WITH Q LEARNING FOR MOVEMENT

Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)

REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING

Iteration. Many thanks to Alan Fern for the majority of the LSPI slides.

1 Introuction 1.1 Robots 1.2. Error recovery Self healing or self modelling robots 2.1 Researchers 2.2 The starfish robot 2.2.

TJHSST Senior Research Project Evolving Motor Techniques for Artificial Life

Soar-RL A Year of Learning

Applying Modern Reinforcement Learning to Play Video Games. Computer Science & Engineering Leung Man Ho Supervisor: Prof. LYU Rung Tsong Michael

Learning to Play Love Letter with Deep Reinforcement Learning

A Reinforcement Learning Scheme for Adaptive Link Allocation in ATM Networks

Temporal Difference Learning for the Game Tic-Tac-Toe 3D: Applying Structure to Neural Networks

TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS

Distributed Online Learning of Central Pattern Generators in Modular Robots

CS188 Spring 2014 Section 3: Games

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen

Jamming mitigation in cognitive radio networks using a modified Q-learning algorithm

Reinforcement Learning to Train Ms. Pac-Man Using Higher-order Action-relative Inputs

Closing the loop around Sensor Networks

Reinforcement Learning Agent for Scrolling Shooter Game

Energy-aware Task Scheduling in Wireless Sensor Networks based on Cooperative Reinforcement Learning

Human-Swarm Interaction

Transport Capacity and Spectral Efficiency of Large Wireless CDMA Ad Hoc Networks

A Survey on Machine-Learning Techniques in Cognitive Radios

Decision Making in Multiplayer Environments Application in Backgammon Variants

Q-Learning Algorithms for Constrained Markov Decision Processes with Randomized Monotone Policies: Application to MIMO Transmission Control

Towards Strategic Kriegspiel Play with Opponent Modeling

Ilab METIS Optimization of Energy Policies

CSE 473 Midterm Exam Feb 8, 2018

Some results on optimal estimation and control for lossy NCS. Luca Schenato

Learning Artificial Intelligence in Large-Scale Video Games

International Journal of Modern Engineering and Research Technology

Real-World Reinforcement Learning for Autonomous Humanoid Robot Charging in a Home Environment

CS188 Spring 2011 Written 2: Minimax, Expectimax, MDPs

CS221 Project Final Report Gomoku Game Agent

Learning and Using Models of Kicking Motions for Legged Robots

Learning Reactive Neurocontrollers using Simulated Annealing for Mobile Robots

Automated Suicide: An Antichess Engine

Dynamic Fair Channel Allocation for Wideband Systems

Announcements. Homework 1. Project 1. Due tonight at 11:59pm. Due Friday 2/8 at 4:00pm. Electronic HW1 Written HW1

Q Learning Behavior on Autonomous Navigation of Physical Robot

Booklet of teaching units

Cooperative Behavior Acquisition in A Multiple Mobile Robot Environment by Co-evolution

Cooperative Tracking using Mobile Robots and Environment-Embedded, Networked Sensors

Applying Modern Reinforcement Learning to Play Video Games

TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill

Reinforcement Learning for Penalty Avoiding Policy Making and its Extensions and an Application to the Othello Game

Reinforcement Learning in Robotic Task Domains with Deictic Descriptor Representation

Design of Instrumentation Systems for Monitoring Geo-Hazards in Transportation. By Barry R. Christopher Christopher Consultants Roswell, Ga.

Name: Your EdX Login: SID: Name of person to left: Exam Room: Name of person to right: Primary TA:

Artificial Neural Networks. Artificial Intelligence Santa Clara, 2016

Learning Attentive-Depth Switching while Interacting with an Agent

DECENTRALISED ACTIVE VIBRATION CONTROL USING A REMOTE SENSING STRATEGY

Augmenting Self-Learning In Chess Through Expert Imitation

Reinforcement Learning-based Cooperative Sensing in Cognitive Radio Ad Hoc Networks

FreeCiv Learner: A Machine Learning Project Utilizing Genetic Algorithms

Maximum Likelihood Time Delay Estimation and Cramér-Rao Bounds for Multipath Exploitation

Multi-Robot Task-Allocation through Vacancy Chains

Efficiency and detectability of random reactive jamming in wireless networks

ECE 174 Computer Assignment #2 Due Thursday 12/6/2012 GLOBAL POSITIONING SYSTEM (GPS) ALGORITHM

Random Administrivia. In CMC 306 on Monday for LISP lab

Wright-Fisher Process. (as applied to costly signaling)

EE 435/535: Error Correcting Codes Project 1, Fall 2009: Extended Hamming Code. 1 Introduction. 2 Extended Hamming Code: Encoding. 1.

TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play

POKER AGENTS LD Miller & Adam Eck April 14 & 19, 2011

Traffic Control for a Swarm of Robots: Avoiding Group Conflicts

Practice Session 2. HW 1 Review

Learning in 3-Player Kuhn Poker

On Kalman Filtering. The 1960s: A Decade to Remember

Modeling and Control of Mold Oscillation

CMSC 671 Project Report- Google AI Challenge: Planet Wars

CAPIR: Collaborative Action Planning with Intention Recognition

Learning and Using Models of Kicking Motions for Legged Robots

University of Tennessee at. Chattanooga

COMPACT FUZZY Q LEARNING FOR AUTONOMOUS MOBILE ROBOT NAVIGATION

A Toolbox of Hamilton-Jacobi Solvers for Analysis of Nondeterministic Continuous and Hybrid Systems

COMP219: Artificial Intelligence. Lecture 13: Game Playing

Chapter 4 Investigation of OFDM Synchronization Techniques

Computational Sensors

Signal Recovery from Random Measurements

Chapter 5. Tracking system with MEMS mirror

Transcription:

Reinforcement Learning

Reinforcement Learning Assumptions we made so far: Known state space S Known transition model T(s, a, s ) Known reward function R(s) not realistic for many real agents Reinforcement Learning: Learn optimal policy with a priori unknown environment Assume fully observable state(i.e. agent can tell its state) Agent needs to explore environment (i.e. experimentation)

Passive Reinforcement Learning Task: Given a policy π, what is the utility function U π? Similar to Policy Evaluation, but unknown T(s, a, s ) and R(s) Approach: Agent experiments in the environment Trials: execute policy from start state until in terminal state. (1,1) -0.04 (1,2) -0.04 (1,3) -0.04 (1,2) -0.04 (1,3) -0.04 (2,3) -0.04 (3,3) -0.04 (4,3) 1.0 (1,1) -0.04 (1,2) -0.04 (1,3) -0.04 (2,3) -0.04 (3,3) -0.04 (3,2) -0.04 (3,3) -0.04 (4,3) 1.0 (1,1) -0.04 (2,1) -0.04 (3,1) -0.04 (3,2) -0.04 (4,2) -1.0

Direct Utility Estimation Data: Trials of the form (1,1) -0.04 (1,2) -0.04 (1,3) -0.04 (1,2) -0.04 (1,3) -0.04 (2,3) -0.04 (3,3) -0.04 (4,3) 1.0 (1,1) -0.04 (1,2) -0.04 (1,3) -0.04 (2,3) -0.04 (3,3) -0.04 (3,2) -0.04 (3,3) -0.04 (4,3) 1.0 (1,1) -0.04 (2,1) -0.04 (3,1) -0.04 (3,2) -0.04 (4,2) -1.0 Idea: Average reward over all trials for each state independently From data above, estimate U(1,1) A=0.72 B= -1.16 C=0.28 D=0.55

Direct Utility Estimation Data: Trials of the form (1,1) -0.04 (1,2) -0.04 (1,3) -0.04 (1,2) -0.04 (1,3) -0.04 (2,3) -0.04 (3,3) -0.04 (4,3) 1.0 (1,1) -0.04 (1,2) -0.04 (1,3) -0.04 (2,3) -0.04 (3,3) -0.04 (3,2) -0.04 (3,3) -0.04 (4,3) 1.0 (1,1) -0.04 (2,1) -0.04 (3,1) -0.04 (3,2) -0.04 (4,2) -1.0 Idea: Average reward over all trials for each state independently From data above, estimate U(1,2) A=0.76 B= 0.77 C=0.78 D=0.79

Direct Utility Estimation Why is this less efficient than necessary? Ignores dependencies between states U π (s) = R(s) + γ Σ s T(s, π(s), s ) U π (s )

Adaptive Dynamic Programming (ADP) Idea: Run trials to learn model of environment (i.e. T and R) Memorize R(s) for all visited states Estimate fraction of times action a from state s leads to s Use PolicyEvaluation Algorithm on estimated model Data: Trials of the form (1,1) -0.04 (1,2) -0.04 (1,3) -0.04 (1,2) -0.04 (1,3) -0.04 (2,3) -0.04 (3,3) -0.04 (4,3) 1.0 (1,1) -0.04 (1,2) -0.04 (1,3) -0.04 (2,3) -0.04 (3,3) -0.04 (3,2) -0.04 (3,3) -0.04 (4,3) 1.0 (1,1) -0.04 (2,1) -0.04 (3,1) -0.04 (3,2) -0.04 (4,2) -1.0

ADP (1,1) -0.04 (1,2) -0.04 (1,3) -0.04 (1,2) -0.04 (1,3) -0.04 (2,3) -0.04 (3,3) -0.04 (4,3) 1.0 (1,1) -0.04 (1,2) -0.04 (1,3) -0.04 (2,3) -0.04 (3,3) -0.04 (3,2) -0.04 (3,3) -0.04 (4,3) 1.0 (1,1) -0.04 (2,1) -0.04 (3,1) -0.04 (3,2) -0.04 (4,2) -1.0 Estimate T[(1,3), right, (2,3)] A=0 B=0.333 C=0.666 D=1.0

Problem? Can be quite costly for large state spaces For example, Backgammon has 10 50 states Learn and store all transition probabilities and rewards PolicyEvaluation needs to solve linear program with 10 50 equations and variables.

Temporal Difference (TD) Learning If policy led U(1,3) to U(2,3) all the time, we would expect that U π (1,3) = -0.04 + U π (2,3) R(s) should be equal U π (s) - γ U π (s ), so U π (s) = U π (s) + α [R(s) + γ U π (s ) - U π (s)] α is learning rate. α should decrease slowly over time, so that estimates stabilize eventually.

From observation, U(1,3)=0.84 U(2,3)=0.92 And R = -0.04 Is U(1,3) too low or too high? A=Too Low B=Too high

Temporal Difference (TD) Learning Idea: Do not learn explicit model of environment! Use update rule that implicitly reflects transition probabilities. Method: Init U π (s) with R(s) when first visited After each transition, update with U π (s) = U π (s) + α [R(s) + γ U π (s ) - U π (s)] α is learning rate. α should decrease slowly over time, so that estimates stabilize eventually. Properties: No need to store model Only one update for each action (not full PolicyEvaluation)

Active Reinforcement Learning Task: In an a priori unknown environment, find the optimal policy. unknown T(s, a, s ) and R(s) Agent must experiment with the environment. Naïve Approach: Naïve Active PolicyIteration Start with some random policy Follow policy to learn model of environment and use ADP to estimate utilities. Update policy using π(s) argmax a Σ s T(s, a, s ) U π (s ) Problem: Can converge to sub-optimal policy! By following policy, agent might never learn T and R everywhere. Need for exploration!

Exploration vs. Exploitation Exploration: Take actions that explore the environment Hope: possibly find areas in the state space of higher reward Problem: possibly take suboptimal steps Exploitation: Follow current policy Guaranteed to get certain? expected reward Approach:? Sometimes take rand steps Bonus reward for states that have not been visited often yet

Q-Learning Problem: Agent needs model of environment to select action via argmax a Σ s T(s, a, s ) U π (s ) Solution: Learn action utility function Q(a,s), not state utility function U(s). Define Q(a,s) as U(s) = max a Q(a,s) Bellman equation with Q(a,s) instead of U(s) Q(a,s) = R(s) + γ Σ s T(s, a, s ) max a Q(a,s ) TD-Update with Q(a,s) instead of U(s) Q(a,s) Q(a,s) + α [R(s) + γ max a Q(a,s ) - Q(a,s)] Result: With Q-function, agent can select action without model of environment argmax a Q(a,s)

Q-Learning Illustration Q(up,(1,2)) Q(right,(1,2)) Q(down,(1,2)) Q(left,(1,2)) Q(up,(1,1)) Q(right,(1,1)) Q(down,(1,1)) Q(left,(1,1)) Q(up,(2,1)) Q(right,(2,1)) Q(down,(2,1)) Q(left,(2,1))

Function Approximation Problem: Storing Q or U,T,R for each state in a table is too expensive, if number of states is large Does not exploit similarity of states (i.e. agent has to learn separate behavior for each state, even if states are similar) Solution: Approximate function using parametric representation For example: Ф(s) is feature vector describing the state Material values of board Is the queen threatened?

Tilt Sensors Servo Actuators

Morphological Estimation

Emergent Self-Model With Josh Bongard and Victor Zykov, Science 2006

Damage Recovery With Josh Bongard and Victor Zykov, Science 2006

Random Predicted Physical