Soar-RL A Year of Learning

Similar documents
Documentation and Discussion

A. Rules of blackjack, representations, and playing blackjack

Agent Learning using Action-Dependent Learning Rates in Computer Role-Playing Games

Game Playing for a Variant of Mancala Board Game (Pallanguzhi)

Learning Companion Behaviors Using Reinforcement Learning in Games

Learning Character Behaviors using Agent Modeling in Games

Tutorial of Reinforcement: A Special Focus on Q-Learning

AI Approaches to Ultimate Tic-Tac-Toe

Five-In-Row with Local Evaluation and Beam Search

TUD Poker Challenge Reinforcement Learning with Imperfect Information

A Tic Tac Toe Learning Machine Involving the Automatic Generation and Application of Heuristics

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

Success Stories of Deep RL. David Silver

CSCI 4150 Introduction to Artificial Intelligence, Fall 2004 Assignment 7 (135 points), out Monday November 22, due Thursday December 9

Temporal-Difference Learning in Self-Play Training

Reinforcement Learning

Game Design Verification using Reinforcement Learning

Reinforcement Learning in Games Autonomous Learning Systems Seminar

A Learning Infrastructure for Improving Agent Performance and Game Balance

Deep Learning for Autonomous Driving

Mutliplayer Snake AI

A Reinforcement Learning Scheme for Adaptive Link Allocation in ATM Networks

Decision Making in Multiplayer Environments Application in Backgammon Variants

Multi-Robot Teamwork Cooperative Multi-Robot Systems

Extending the STRADA Framework to Design an AI for ORTS

Reinforcement Learning for CPS Safety Engineering. Sam Green, Çetin Kaya Koç, Jieliang Luo University of California, Santa Barbara

Indiana K-12 Computer Science Standards

Monte Carlo Tree Search. Simon M. Lucas

Reinforcement Learning-Based Dynamic Power Management of a Battery-Powered System Supplying Multiple Active Modes

Energy-aware Task Scheduling in Wireless Sensor Networks based on Cooperative Reinforcement Learning

Introduction to Artificial Intelligence CS 151 Programming Assignment 2 Mancala!! Due (in dropbox) Tuesday, September 23, 9:34am

CS 380: ARTIFICIAL INTELLIGENCE MONTE CARLO SEARCH. Santiago Ontañón

For slightly more detailed instructions on how to play, visit:

More on games (Ch )

More on games (Ch )

TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS

Reinforcement Learning Simulations and Robotics

46.1 Introduction. Foundations of Artificial Intelligence Introduction MCTS in AlphaGo Neural Networks. 46.

POKER AGENTS LD Miller & Adam Eck April 14 & 19, 2011

Package tictactoe. May 26, 2017

CS 4700: Foundations of Artificial Intelligence

Time Domain Response of Split-Ring Resonators in Waveguide Below Cut-Off Structure

Blackfin Online Learning & Development

Emergence of Purposive and Grounded Communication through Reinforcement Learning

Blue-Bot TEACHER GUIDE

An Artificially Intelligent Ludo Player

Swarm Intelligence W7: Application of Machine- Learning Techniques to Automatic Control Design and Optimization

2048: An Autonomous Solver

Biologically Inspired Embodied Evolution of Survival

Improving Reinforcement Learning Algorithms for Dynamic Spectrum Allocation in Cognitive Sensor Networks

Reinforcement Learning-based Cooperative Sensing in Cognitive Radio Ad Hoc Networks

Soar Agents in Government Applications

ROBOCODE PROJECT AIBOT - MARKOV MODEL DRIVEN AIMING COMBINED WITH Q LEARNING FOR MOVEMENT

CS 188 Fall Introduction to Artificial Intelligence Midterm 1

TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play

Implementation of Upper Confidence Bounds for Trees (UCT) on Gomoku

LAB II. INTRODUCTION TO LABVIEW

City Research Online. Permanent City Research Online URL:

OSPF. Routing Protocols and Concepts Chapter 11. ITE PC v4.0 Chapter Cisco Systems, Inc. All rights reserved. Cisco Public

LC-10 Chipless TagReader v 2.0 August 2006

Booklet of teaching units

CS 221 Othello Project Professor Koller 1. Perversi

Advanced Techniques for Mobile Robotics Location-Based Activity Recognition

Othello/Reversi using Game Theory techniques Parth Parekh Urjit Singh Bhatia Kushal Sukthankar

AI Plays Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng)

Training a Back-Propagation Network with Temporal Difference Learning and a database for the board game Pente

"!" - Game Modding and Development Kit (A Work Nearly Done) '08-'10. Asset Browser

CS188: Artificial Intelligence, Fall 2011 Written 2: Games and MDP s

An Efficient Zero-Loss Technique for Data Compression of Long Fault Records

ECE 241 Digital Systems. Basic Information

Official Skirmish Tournament Rules

Spectrum Assignment in Narrowband Power Line Communication

CMSC 671 Project Report- Google AI Challenge: Planet Wars

CS510 \ Lecture Ariel Stolerman

Reinforcement Learning to Train Ms. Pac-Man Using Higher-order Action-relative Inputs

On Verifying Game Designs and Playing Strategies using Reinforcement Learning

Learning to Play like an Othello Master CS 229 Project Report. Shir Aharon, Amanda Chang, Kent Koyanagi

Outline for this presentation. Introduction I -- background. Introduction I Background

CCAP: A Strategic Tool for Managing Capacity of CDMA Networks

Unpacking Digital Technologies

Playing CHIP-8 Games with Reinforcement Learning

ITE PC v4.0. Chapter Cisco Systems, Inc. All rights reserved. Cisco Public

Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)

Modelling of Real Network Traffic by Phase-Type distribution

Information for Parents/Carers Targets in Computing

Cognitive Radio: Brain-Empowered Wireless Communcations

Using Dynamic Capability Evaluation to Organize a Team of Cooperative, Autonomous Robots

CHAPTER 1 INTRODUCTION...

Introduction to Spring 2009 Artificial Intelligence Final Exam

CSC 550: Introduction to Artificial Intelligence. Fall 2004

ECE 517: Reinforcement Learning in Artificial Intelligence

Chapter 12. Cross-Layer Optimization for Multi- Hop Cognitive Radio Networks

Jane Li. Assistant Professor Mechanical Engineering Department, Robotic Engineering Program Worcester Polytechnic Institute

CandyCrush.ai: An AI Agent for Candy Crush

SPICE 4: Diodes. Chris Winstead. ECE Spring, Chris Winstead SPICE 4: Diodes ECE Spring, / 28

Introduction to Computer Science - PLTW #9340

TTIC 31230, Fundamentals of Deep Learning David McAllester, April AlphaZero

SGLS Test Set. Model 630. General Description

Surveillance strategies for autonomous mobile robots. Nicola Basilico Department of Computer Science University of Milan

TIBCO FTL Part of the TIBCO Messaging Suite. Quick Start Guide

Transcription:

Soar-RL A Year of Learning Nate Derbinsky University of Michigan

Outline The Big Picture Developing Soar-RL Agents Controlling the Soar-RL Algorithm Debugging Soar-RL Soar-RL Performance Nuggets & Coal Additional Resources 2

Outline The Big Picture The Path to Release How Soar-RL Affects Agent Behavior Developing Soar-RL Agents Controlling the Soar-RL Algorithm Debugging Soar-RL Soar-RL Performance Nuggets & Coal Additional Resources 3

The Path to Release Credit for most system functionality and all research to make Soar-RL possible should go to Shelley Nason and John Laird Nason, S. and Laird, J. E., Soar-RL, Integrating Reinforcement Learning with Soar, International Conference on Cognitive Modeling, 2004. The work being presented today deals with the engineering efforts to effectively and efficiently integrate Soar-RL with the Soar trunk Nate Derbinsky, Nick Gorski, John Laird, Bob Marinier, Jonathan Voigt, Yongjia Wang 4

The RL Agent-Environment Interface Agent state s t reward r t action a t r t+1 Environment s t+1 Sutton, R.S., and Barto, A.G., Reinforcement Learning: An Introduction. 5

Soar-RL Agent-Environment Interface Soar Agent input-link s t reward-link r t output-link a t r t+1 Environment s t+1 6

Numeric Indifferent Preferences (<state> ^operator <operator> = number) number, the value of the preference, is a numeric constant The value of the numeric indifferent preference may bias selection of the operator from amongst indifferent preferences numeric-indifferent-mode determines how values combine indifferent-selection sets the policy for deciding amongst indifferent preferences 7

How Soar-RL Affects Agent Behavior Over time, Soar-RL modifies numeric indifferent preference values such as to maximize the expected receipt of future reward Altering preference values in procedural memory allows Soar-RL to modify the outcome of operator selection, and thus affect agent behavior 8

Water Jug Demonstration

Outline The Big Picture Developing Soar-RL Agents Soar-RL Rules Templates Reward Controlling the Soar-RL Algorithm Debugging Soar-RL Soar-RL Performance Nuggets & Coal Additional Resources 10

Soar-RL Rules sp {my*rl*rule (state <s> ^operator <o> + ^attrib-a alpha ^attrib-b beta) (<o> ^name my-op) --> (<s> ^operator <o> = 2.3) } LHS can be anything RHS must be single numeric indifferent preference Soar-RL rules form a representation of a value function Q( s, o ) = 2.3 11

Water-Jug Agent Example sp {water-jug*empty*small*0*0 (state <s> ^name water-jug ^operator <op> + ^jug <j1> <j2>) (<op> ^name empty ^empty-jug.volume 3) (<j1> ^volume 3 ^contents 0) (<j2> ^volume 5 ^contents 0) --> (<s> ^operator <op> = 0) } 12

Soar-RL Rule Usage In order for Soar-RL to affect selection of an operator in a particular state, a Soar-RL rule must exist whose LHS matches the state-operator pair With complex agents, the requirement of manually representing the Q-function with Soar-RL rules is unreasonable Solutions: scripting or templates 13

Soar-RL Templates sp {my*rl*template :template (state <s> ^operator <o> + ^attrib-a <a> ^attrib-b <b>) (<o> ^name my-op) --> (<s> ^operator <o> = 2.3) } Must have :template flag LHS can be anything RHS must be single numeric indifferent preference A Soar-RL template is a representation of the initial value function of a set of state-operator pairs 14

Water-Jug Agent Example sp {water-jug*empty :template (state <s> ^name water-jug ^operator <op> + ^jug <j1> <j2>) (<op> ^name empty ^empty-jug.volume <evol>) (<j1> ^volume 3 ^contents <c1>) (<j2> ^volume 5 ^contents <c2>) --> (<s> ^operator <op> = 0) } 15

Soar-RL Template Behavior During proposal phase, the template rule is supplied to the matcher Matches are used to create new Soar-RL productions that contribute to the current cycle and future decisions The new production has naming pattern rl*template-name*id template-name original template rule id auto incrementing counter 16

Water-Jug Agent Example sp {rl*water-jug*empty*1 (state <s> ^name water-jug ^operator <op> + ^jug <j1> <j2>) (<op> ^name empty ^empty-jug.volume 3) (<j1> ^volume 3 ^contents 0) (<j2> ^volume 5 ^contents 0) --> (<s> ^operator <op> = 0) } 17

Reward The agent programmer must supply reward information to guide the reinforcement learning process Location of reward is a new structure, a state s reward-link state.reward-link.reward.value state ^reward-link.reward.value 1.2 state ^reward-link.reward.value -2 The reward-link is not part of the io-link and is not modified directly by the environment 18

Water-Jug Agent Example sp {water-jug*detect*goal*achieved (state <s> ^name water-jug ^jug <j> ^reward-link <r>) (<j> ^volume 3 ^contents 1) --> (write (crlf) The problem has been solved. ) (<r> ^reward.value 10) (halt)} 19

Outline The Big Picture Developing Soar-RL Agents Controlling the Soar-RL Algorithm Operator Selection Reinforcement Learning Manipulating Soar-RL Parameters Debugging Soar-RL Soar-RL Performance Nuggets & Coal Additional Resources 20

Operator Selection The purpose of learning a Q-function is that the agent can act optimally by selecting the operator with the highest Q-value In Soar preference semantics, symbolic preferences take precedence over numeric preferences Only if there would be a tie are numeric preferences considered 21

Exploration vs. Exploitation For reinforcement learning to discover the optimal policy, it is necessary that the agent sometimes choose an action that does not have the maximum predicted value Often occurs during initial learning and as a result of a change in the task Control of the exploration policy takes place via the indifferent-selection command 22

Preference Updates Soar-RL does Temporal Difference (TD) learning: update = ( target current ) Current estimate = Q( s t, o t ) = Learning rate Target estimate and application of update are affected by a number of Soar-RL parameters Updates are applied at the beginning of the next decision phase 23

Gaps in Rule Coverage Since TD updates are transmitted backwards through the stored Q-function, it would seem necessary that the function be well-represented by Soar-RL rules at each decision cycle To address this practical issue, Soar-RL provides preliminary support for automatic propagation of updates over gaps By default, Soar-RL will automatically propagate updates over gaps, discounted exponentially with respect to the length of the gap This behavior can be enabled/disabled by manipulating the temporal-extension parameter 24

Gaps Example rl rl go1 no2 go3 temporal-extension reward 25

Hierarchical Reinforcement Learning HRL is RL done over a hierarchically decomposed structure Learning can be done to improve subtask performance, as well as selection amongst subtasks Hierarchical Soar-RL is built on Soar s impasse structure 26

Op No-Change Example S1 O1 O1 r1 O1 r2 O1 r3 O2 Rewards at S1 after O1 are attributed to O1, discounted with respect to the number of decision cycles S2 Rewards at S2 are attributed to the respective operator O11 O12 O13 After O13, reward is checked at S2 and, if present, attributed directly to O13 27

Other Soar-RL Features Exploration Policies Boltzmann, Epsilon Greedy, Softmax, First, Last Learning Policies On-policy, Off-policy Reward Discounting Reward Accumulation Eligibility Traces 28

Manipulating Soar-RL Parameters Get a parameter rl [-g --get] <name> Set a parameter rl [-s --set] <name> <value> Get all values rl Get Soar-RL statistics rl [-S --stats] <statistic> 29

Outline The Big Picture Developing Soar-RL Agents Controlling the Soar-RL Algorithm Debugging Soar-RL Soar-RL Performance Nuggets & Coal Additional Resources 30

Debugging Soar-RL New watch switches --indifferent-selection = view numeric preferences for each operator --template = view firing of templates --rl = debugging information New print and excise switches --rl = all Soar-RL rules --template = all Soar-RL templates rl*water-jug*empty*46 1. 0. rl*water-jug*pour*45 1. 3. 31

New Decision Cycle Commands select <id> Forces the selection of an operator predict Determines which operator will be chosen during the next decision phase If operator selection will require probabilistic selection predict will manipulate the random number generator to enforce its prediction (assuming no preference changes) 32

Outline The Big Picture Developing Soar-RL Agents Controlling the Soar-RL Algorithm Debugging Soar-RL Soar-RL Performance TestSoarPerformance Rules vs. Templates Nuggets & Coal Additional Resources 33

TestSoarPerformance 8.6.4 RL OS X (RL on) 8.067 8.231 2.0% OS X (RL off) 8.201 1.7% Linux (RL on) 3.593 3.660 1.9% Linux (RL off) 3.637 1.2% Windows XP (RL on) 3.703 3.765 1.7% Windows XP (RL off) 3.725 0.6% 34

Rules vs. Templates Rules Templates Water Jug OS X.043.065 51% Linux.024.033 38% Windows XP.125.140 12% 35

Outline The Big Picture Developing Soar-RL Agents Controlling the Soar-RL Algorithm Debugging Soar-RL Soar-RL Performance Nuggets & Coal Additional Resources 36

Nuggets & Coal Nuggets Soar-RL is an integration of reinforcement learning with Soar Soar-RL provides a highly configurable new learning mechanism with a relatively small performance cost Soar-RL beta is available for download today! Coal Current template implementation takes a heavy toll 37

Outline The Big Picture Developing Soar-RL Agents Controlling the Soar-RL Algorithm Debugging Soar-RL Soar-RL Performance Nuggets & Coal Additional Resources 38

Additional Resources http://winter.eecs.umich.edu/soar Binaries Tutorial Manual Programmer Reference 39