ECE-517 Reinforcement Learning in Artificial Intelligence

Similar documents
ECE-517 Reinforcement Learning in Artificial Intelligence

Lecture #7: Discrete-time Signals and Sampling

Go over Quizzes 2 and 4. Introduction to pipelining Maybe hazards

DC electrical circuits

4.5 Biasing in BJT Amplifier Circuits

EE 330 Lecture 24. Amplification with Transistor Circuits Small Signal Modelling

March 13, 2009 CHAPTER 3: PARTIAL DERIVATIVES AND DIFFERENTIATION

EXPERIMENT #4 AM MODULATOR AND POWER AMPLIFIER

Communications II Lecture 7: Performance of digital modulation

10. The Series Resistor and Inductor Circuit

MATLAB/SIMULINK TECHNOLOGY OF THE SYGNAL MODULATION

Pointwise Image Operations

MAXIMUM FLOWS IN FUZZY NETWORKS WITH FUNNEL-SHAPED NODES

Lab 3 Acceleration. What You Need To Know: Physics 211 Lab

EECE 301 Signals & Systems Prof. Mark Fowler

Notes on the Fourier Transform

5 Spatial Relations on Lines

EXPERIMENT #9 FIBER OPTIC COMMUNICATIONS LINK

Figure A linear pair? Explain. No, because vertical angles are not adjacent angles, and linear pairs are.

EE201 Circuit Theory I Fall

The University of Melbourne Department of Mathematics and Statistics School Mathematics Competition, 2013 JUNIOR DIVISION Time allowed: Two hours

Lecture September 6, 2011

Comparing image compression predictors using fractal dimension

OSTROWSKI TYPE FRACTIONAL INTEGRAL INEQUALITIES FOR MT-CONVEX FUNCTIONS

(This lesson plan assumes the students are using an air-powered rocket as described in the Materials section.)

MEASUREMENTS OF VARYING VOLTAGES

EE Controls Lab #2: Implementing State-Transition Logic on a PLC

An endogenous bio-economic optimization algorithm to evaluate recovery plans: an application to southern hake

Solid-state Timer H3CT

ECMA st Edition / June Near Field Communication Wired Interface (NFC-WI)

Solution of ECE 342 Test 2 S12

Lecture 5: DC-DC Conversion

CHAPTER 2 LITERATURE STUDY

Domination and Independence on Square Chessboard

Signals and the frequency domain ENGR 40M lecture notes July 31, 2017 Chuan-Zheng Lee, Stanford University

16.5 ADDITIONAL EXAMPLES

OPERATION MANUAL. Indoor unit for air to water heat pump system and options EKHBRD011AAV1 EKHBRD014AAV1 EKHBRD016AAV1

MATH 118 PROBLEM SET 6

THE OSCILLOSCOPE AND NOISE. Objectives:

SystemC-AMS Hands-On Lab Part 2

OPERATION MANUAL. Indoor unit for air to water heat pump system and options EKHBRD011ADV1 EKHBRD014ADV1 EKHBRD016ADV1

EE368/CS232 Digital Image Processing Winter Homework #1 Released: Monday, January 8 Due: Wednesday, January 17, 1:30pm

Lecture 4. EITN Chapter 12, 13 Modulation and diversity. Antenna noise is usually given as a noise temperature!


ECE 274 Digital Logic. Digital Design. Datapath Components Shifters, Comparators, Counters, Multipliers Digital Design

Automatic Power Factor Control Using Pic Microcontroller

3: Microscopic imaging

6.003: Signals and Systems

Noise Reduction/Mode Isolation with Adaptive Down Conversion (ADC)

Phase-Shifting Control of Double Pulse in Harmonic Elimination Wei Peng1, a*, Junhong Zhang1, Jianxin gao1, b, Guangyi Li1, c

How to Shorten First Order Unit Testing Time. Piotr Mróz 1

Variation Aware Cross-Talk Aggressor Alignment by Mixed Integer Linear Programming

Teacher Supplement to Operation Comics, Issue #5

Investigation and Simulation Model Results of High Density Wireless Power Harvesting and Transfer Method

EEO 401 Digital Signal Processing Prof. Mark Fowler

Memorandum on Impulse Winding Tester

7 th International Conference on DEVELOPMENT AND APPLICATION SYSTEMS S u c e a v a, R o m a n i a, M a y 27 29,

Errata and Updates for ASM Exam MLC (Fourteenth Edition) Sorted by Page

Square Waves, Sinusoids and Gaussian White Noise: A Matching Pursuit Conundrum? Don Percival

Revision: June 11, E Main Suite D Pullman, WA (509) Voice and Fax

P. Bruschi: Project guidelines PSM Project guidelines.

Installing remote sites using TCP/IP

2.6 Special Angles on Parallel Lines Objectives: I CAN define angle pairs made by parallel lines. I CAN solve problems involving parallel lines.

AN303 APPLICATION NOTE

A1 K. 12V rms. 230V rms. 2 Full Wave Rectifier. Fig. 2.1: FWR with Transformer. Fig. 2.2: Transformer. Aim: To Design and setup a full wave rectifier.

WIDE-RANGE 7-SWITCH FLYING CAPACITOR BASED DC-DC CONVERTER FOR POINT-OF-LOAD APPLICATIONS

Lecture 20. Intro to line integrals. Dan Nichols MATH 233, Spring 2018 University of Massachusetts.

Transmit Beamforming with Reduced Feedback Information in OFDM Based Wireless Systems

ELEG 3124 SYSTEMS AND SIGNALS Ch. 1 Continuous-Time Signals

CS 135: Computer Architecture I. Boolean Algebra. Basic Logic Gates

EECE 301 Signals & Systems Prof. Mark Fowler

EE 40 Final Project Basic Circuit

ECMA-373. Near Field Communication Wired Interface (NFC-WI) 2 nd Edition / June Reference number ECMA-123:2009

ORDER INFORMATION TO pin 320 ~ 340mV AMC7150DLF

ECE-517: Reinforcement Learning in Artificial Intelligence

LECTURE 9: QUADRATIC RESIDUES AND THE LAW OF QUADRATIC RECIPROCITY

EQ: What are the similarities and differences between matrices and real numbers?

Software solutions to mitigate the EM emission of power modules

MODELING OF CROSS-REGULATION IN MULTIPLE-OUTPUT FLYBACK CONVERTERS

An off-line multiprocessor real-time scheduling algorithm to reduce static energy consumption

Section 17.2: Line Integrals. 1 Objectives. 2 Assignments. 3 Maple Commands. 1. Compute line integrals in IR 2 and IR Read Section 17.

A Study on Data Aggregation and Reliability in Managing Wireless Sensor Networks

Driver Modeling and Alignment for Worst-Case Delay Noise

To Relay or Not to Relay: Learning Device-to-Device Relaying Strategies in Cellular Networks

LECTURE 1 CMOS PHASE LOCKED LOOPS

Test 1 Review. Test 1 Review. Communication Systems: Foundational Theories. Communication System. Reference: Sections and

Mobile Robot Localization Using Fusion of Object Recognition and Range Information

Section 16.3 Double Integrals over General Regions

DECEMBER 15-18, 1997 ADELAIDE, SOUTH AUSTRALIA. Department of Mechanical Engineering, Hanyang University. Haeng Dang-Dong, Seong Dong-Ku, Seoul, KOREA

Algorithms for Memory Hierarchies Lecture 14

Direct Analysis of Wave Digital Network of Microstrip Structure with Step Discontinuities

NCTM Content Standard/National Science Education Standard:

Programmable DC Electronic Load 8600 Series

ECE3204 Microelectronics II Bitar / McNeill. ECE 3204 / Term D-2017 Problem Set 7

READING ASSIGNMENTS LECTURE OBJECTIVES. Problem Solving Skills. x(t) = cos(αt 2 ) ELEG-212 Signal Processing and Communications

Knowledge Transfer in Semi-automatic Image Interpretation

TEST OF THE NOVAERUS AIR CLEANING SYSTEM IN THE DEPARTMENT OF CLINICAL MEDICINE.

FASER: Fast Analysis of Soft Error Susceptibility for Cell-Based Designs

Optimal Response to Burstable Billing under Demand Uncertainty

Experiment 6: Transmission Line Pulse Response

Transcription:

ECE-57 Reinforcemen Lerning in Arificil Inelligence Lecure 7: Finie Horizon MDPs, Dynmic Progrmming Sepember 0, 205 Dr. Imr Arel College of Engineering Deprmen of Elecricl Engineering nd Compuer Science The Universiy of Tennessee Fll 205

Ouline Finie Horizon MDPs Dynmic Progrmming ECE 57 - Reinforcemen Lerning in AI 2

ECE 57 - Reinforcemen Lerning in AI 3 Finie Horizon MDPs lue Funcions The durion, or expeced durion, of he process is finie Le s consider he following reurn funcions: The expeced sum of rewrds The expeced discouned sum of rewrds sufficien condiion for he bove o converge is r < r mx ) ( lim lim ) ( 0 s E s s r E s 0 lim ) ( 0 s s r E s expeced sum of rewrds for seps

Reurn funcions If r < r mx holds, hen rmx rmx noe h his bound is very sensiive o he vlue of The expeced verge rewrd M lim E r s0 lim s oe h he bove limi does no lwys exis! ECE 57 - Reinforcemen Lerning in AI 4

Relionship beween (s) nd (s) Consider finie horizon problem where he horizon is rndom, i.e. E E r s0 Le s lso ssume h he finl vlue for ll ses is zero Le be geomericlly disribued wih prmeer, such h he probbiliy of sopping he h sep is Pr n n s Lemm: we ll show h under he ssumpion h r < r mx ECE 57 - Reinforcemen Lerning in AI 5

ECE 57 - Reinforcemen Lerning in AI 6 Relionship beween nd (con.) (s) ) (s ) ( ) ( s r E r E r E r E s n n n n n Proof:

Ouline Finie Horizon MDPs (con.) Dynmic Progrmming ECE 57 - Reinforcemen Lerning in AI 7

Exmple of finie horizon MDP Consider he following se digrm: {5,0.5} {5,0.5} S S 2 0 2 22 {-,) 2 {0,} ECE 57 - Reinforcemen Lerning in AI 8

Why do we need DP echniques? Explicily solving he Bellmn Opimliy equion is hrd Compuing he opiml policy solve he RL problem Relies on he following hree ssumpions We hve perfec knowledge of he dynmics of he environmen We hve enough compuionl resources The Mrkov propery holds In reliy, ll hree re problemic e.g. Bckgmmon gme: firs nd ls condiions re ok, bu compuionl resources re insufficien Approx. 0 20 se In mny cses we hve o sele for pproxime soluions (much more on h ler ) ECE 57 - Reinforcemen Lerning in AI 9

Big Picure: Elemenry Soluion Mehods During he nex few weeks we ll lk bou echniques for solving he RL problem Dynmic progrmming well developed, mhemiclly, bu requires n ccure model of he environmen Mone Crlo mehods do no require model, bu re no suible for sep-by-sep incremenl compuion Temporl difference lerning mehods h do no need model nd re fully incremenl More complex o nlyze Lunched he revisiing of RL s prgmic frmework (988) The mehods lso differ in efficiency nd speed of convergence o he opiml soluion ECE 57 - Reinforcemen Lerning in AI 0

Dynmic Progrmming Dynmic progrmming is he collecion of lgorihms h cn be used o compue opiml policies given perfec model of he environmen s n MDP DP consiues heoreiclly opiml mehodology In reliy ofen limied since DP is compuionlly expensive Imporn o undersnd reference o oher models Do jus s well s DP Require less compuions Possibly require less memory Mos schemes will srive o chieve he sme effec s DP, wihou he compuionl complexiy involved ECE 57 - Reinforcemen Lerning in AI

Dynmic Progrmming (con.) We will ssume finie MDPs (ses nd cions) The gen hs knowledge of rnsiion probbiliies nd expeced immedie rewrds, i.e. P R ss' ss' Pr s s' s s, E r The key ide of DP (s in RL) is he use of vlue funcions o derive opiml/good policies We ll focus on he mnner by which vlues re compued Reminder: n opiml policy is esy o derive once he opiml vlue funcion (or cion-vlue funcion) is ined s s,, s s' ECE 57 - Reinforcemen Lerning in AI 2

Dynmic Progrmming (con.) Employing he Bellmn equion o he opiml vlue/ cion-vlue funcion, yields Q * * ( s, ) mx s' s' P ss' P ss' R ss' ( s') DP lgorihms re obined by urning he Bellmn equions ino upde rules These rules help improve he pproximions of he desired vlue funcions We will discuss wo min pproches: policy ierion nd vlue ierion R ss' * mx Q * ( s', ') ECE 57 - Reinforcemen Lerning in AI 3

Mehod #: Policy Ierion Technique for obining he opiml policy Comprises of wo complemening seps Policy evluion upding he vlue funcion in view of curren policy (which cn be sub-opiml) Policy improvemen upding he policy given he curren vlue funcion (which cn be sub-opiml) The process converges by bouncing beween hese wo seps 2 * *, 2 ECE 57 - Reinforcemen Lerning in AI 4

Policy Evluion We ll consider how o compue he se-vlue funcion for n rbirry policy Recll h 0 (ssumes h policy is lwys followed) The exisence of unique soluion is gurneed s long s eiher < or evenul erminion is gurneed from ll ses under he policy The Bellmn equion rnsles ino S simulneous equions wih S unknowns (he vlues) Assuming we hve n iniil guess, we cn use he Bellmn equion s n upde rule ECE 57 - Reinforcemen Lerning in AI 5

Policy Evluion (con.) We cn wrie k The sequence { k } converges o he correc vlue funcion s k In ech ierion, ll se-vlues re upded.k.. full bckup ( s, ) Pss' Rss' k ( s' A similr mehod cn be pplied o se-cion (Q(s,)) funcions An underlying ssumpion: ll ses re visied ech ime Scheme is compuionlly hevy Cn be disribued given sufficien resources (Q: How?) In-plce schemes use single rry nd upde vlues bsed on new esimes Also converge o he correc soluion s' Order in which ses re bcked up deermines re of convergence ) ECE 57 - Reinforcemen Lerning in AI 6

Ierive Policy Evluion lgorihm A key considerion is he erminion condiion Typicl sopping condiion for ierive policy evluion is mx ( s) ss k k 0 ECE 57 - Reinforcemen Lerning in AI 7

Policy Improvemen Policy evluion dels wih finding he vlue funcion under given policy However, we don know if he policy (nd hence he vlue funcion) is opiml Policy improvemen hs o do wih he bove, nd wih upding he policy if non-opiml vlues re reched Suppose h for some rbirry policy,, we ve compued he vlue funcion (using policy evluion) Le policy be defined such h in ech se s i selecs cion h mximizes he firs-sep vlue, i.e. def '( s) rg mx P R s' I cn be shown h is les s good s, nd if hey re equl hey re boh he opiml policy. ss' ss' ( s') ECE 57 - Reinforcemen Lerning in AI 8

Policy Improvemen (con.) Consider greedy policy,, h selecs he cion h would yield he highes expeced single-sep reurn ' Then, by definiion, rg mx rg mx Q Q s' ( s, ) his is he condiion for he policy improvemen heorem. The bove ses h following he new policy one sep is enough o prove h i is beer policy, i.e. h P ss' R ss' s, ' ' ( s') ECE 57 - Reinforcemen Lerning in AI 9

Proof of he Policy Improvemen Theorem 0 ECE 57 - Reinforcemen Lerning in AI 20

Policy Ierion 0 ECE 57 - Reinforcemen Lerning in AI 2