ECE-57 Reinforcemen Lerning in Arificil Inelligence Lecure 7: Finie Horizon MDPs, Dynmic Progrmming Sepember 0, 205 Dr. Imr Arel College of Engineering Deprmen of Elecricl Engineering nd Compuer Science The Universiy of Tennessee Fll 205
Ouline Finie Horizon MDPs Dynmic Progrmming ECE 57 - Reinforcemen Lerning in AI 2
ECE 57 - Reinforcemen Lerning in AI 3 Finie Horizon MDPs lue Funcions The durion, or expeced durion, of he process is finie Le s consider he following reurn funcions: The expeced sum of rewrds The expeced discouned sum of rewrds sufficien condiion for he bove o converge is r < r mx ) ( lim lim ) ( 0 s E s s r E s 0 lim ) ( 0 s s r E s expeced sum of rewrds for seps
Reurn funcions If r < r mx holds, hen rmx rmx noe h his bound is very sensiive o he vlue of The expeced verge rewrd M lim E r s0 lim s oe h he bove limi does no lwys exis! ECE 57 - Reinforcemen Lerning in AI 4
Relionship beween (s) nd (s) Consider finie horizon problem where he horizon is rndom, i.e. E E r s0 Le s lso ssume h he finl vlue for ll ses is zero Le be geomericlly disribued wih prmeer, such h he probbiliy of sopping he h sep is Pr n n s Lemm: we ll show h under he ssumpion h r < r mx ECE 57 - Reinforcemen Lerning in AI 5
ECE 57 - Reinforcemen Lerning in AI 6 Relionship beween nd (con.) (s) ) (s ) ( ) ( s r E r E r E r E s n n n n n Proof:
Ouline Finie Horizon MDPs (con.) Dynmic Progrmming ECE 57 - Reinforcemen Lerning in AI 7
Exmple of finie horizon MDP Consider he following se digrm: {5,0.5} {5,0.5} S S 2 0 2 22 {-,) 2 {0,} ECE 57 - Reinforcemen Lerning in AI 8
Why do we need DP echniques? Explicily solving he Bellmn Opimliy equion is hrd Compuing he opiml policy solve he RL problem Relies on he following hree ssumpions We hve perfec knowledge of he dynmics of he environmen We hve enough compuionl resources The Mrkov propery holds In reliy, ll hree re problemic e.g. Bckgmmon gme: firs nd ls condiions re ok, bu compuionl resources re insufficien Approx. 0 20 se In mny cses we hve o sele for pproxime soluions (much more on h ler ) ECE 57 - Reinforcemen Lerning in AI 9
Big Picure: Elemenry Soluion Mehods During he nex few weeks we ll lk bou echniques for solving he RL problem Dynmic progrmming well developed, mhemiclly, bu requires n ccure model of he environmen Mone Crlo mehods do no require model, bu re no suible for sep-by-sep incremenl compuion Temporl difference lerning mehods h do no need model nd re fully incremenl More complex o nlyze Lunched he revisiing of RL s prgmic frmework (988) The mehods lso differ in efficiency nd speed of convergence o he opiml soluion ECE 57 - Reinforcemen Lerning in AI 0
Dynmic Progrmming Dynmic progrmming is he collecion of lgorihms h cn be used o compue opiml policies given perfec model of he environmen s n MDP DP consiues heoreiclly opiml mehodology In reliy ofen limied since DP is compuionlly expensive Imporn o undersnd reference o oher models Do jus s well s DP Require less compuions Possibly require less memory Mos schemes will srive o chieve he sme effec s DP, wihou he compuionl complexiy involved ECE 57 - Reinforcemen Lerning in AI
Dynmic Progrmming (con.) We will ssume finie MDPs (ses nd cions) The gen hs knowledge of rnsiion probbiliies nd expeced immedie rewrds, i.e. P R ss' ss' Pr s s' s s, E r The key ide of DP (s in RL) is he use of vlue funcions o derive opiml/good policies We ll focus on he mnner by which vlues re compued Reminder: n opiml policy is esy o derive once he opiml vlue funcion (or cion-vlue funcion) is ined s s,, s s' ECE 57 - Reinforcemen Lerning in AI 2
Dynmic Progrmming (con.) Employing he Bellmn equion o he opiml vlue/ cion-vlue funcion, yields Q * * ( s, ) mx s' s' P ss' P ss' R ss' ( s') DP lgorihms re obined by urning he Bellmn equions ino upde rules These rules help improve he pproximions of he desired vlue funcions We will discuss wo min pproches: policy ierion nd vlue ierion R ss' * mx Q * ( s', ') ECE 57 - Reinforcemen Lerning in AI 3
Mehod #: Policy Ierion Technique for obining he opiml policy Comprises of wo complemening seps Policy evluion upding he vlue funcion in view of curren policy (which cn be sub-opiml) Policy improvemen upding he policy given he curren vlue funcion (which cn be sub-opiml) The process converges by bouncing beween hese wo seps 2 * *, 2 ECE 57 - Reinforcemen Lerning in AI 4
Policy Evluion We ll consider how o compue he se-vlue funcion for n rbirry policy Recll h 0 (ssumes h policy is lwys followed) The exisence of unique soluion is gurneed s long s eiher < or evenul erminion is gurneed from ll ses under he policy The Bellmn equion rnsles ino S simulneous equions wih S unknowns (he vlues) Assuming we hve n iniil guess, we cn use he Bellmn equion s n upde rule ECE 57 - Reinforcemen Lerning in AI 5
Policy Evluion (con.) We cn wrie k The sequence { k } converges o he correc vlue funcion s k In ech ierion, ll se-vlues re upded.k.. full bckup ( s, ) Pss' Rss' k ( s' A similr mehod cn be pplied o se-cion (Q(s,)) funcions An underlying ssumpion: ll ses re visied ech ime Scheme is compuionlly hevy Cn be disribued given sufficien resources (Q: How?) In-plce schemes use single rry nd upde vlues bsed on new esimes Also converge o he correc soluion s' Order in which ses re bcked up deermines re of convergence ) ECE 57 - Reinforcemen Lerning in AI 6
Ierive Policy Evluion lgorihm A key considerion is he erminion condiion Typicl sopping condiion for ierive policy evluion is mx ( s) ss k k 0 ECE 57 - Reinforcemen Lerning in AI 7
Policy Improvemen Policy evluion dels wih finding he vlue funcion under given policy However, we don know if he policy (nd hence he vlue funcion) is opiml Policy improvemen hs o do wih he bove, nd wih upding he policy if non-opiml vlues re reched Suppose h for some rbirry policy,, we ve compued he vlue funcion (using policy evluion) Le policy be defined such h in ech se s i selecs cion h mximizes he firs-sep vlue, i.e. def '( s) rg mx P R s' I cn be shown h is les s good s, nd if hey re equl hey re boh he opiml policy. ss' ss' ( s') ECE 57 - Reinforcemen Lerning in AI 8
Policy Improvemen (con.) Consider greedy policy,, h selecs he cion h would yield he highes expeced single-sep reurn ' Then, by definiion, rg mx rg mx Q Q s' ( s, ) his is he condiion for he policy improvemen heorem. The bove ses h following he new policy one sep is enough o prove h i is beer policy, i.e. h P ss' R ss' s, ' ' ( s') ECE 57 - Reinforcemen Lerning in AI 9
Proof of he Policy Improvemen Theorem 0 ECE 57 - Reinforcemen Lerning in AI 20
Policy Ierion 0 ECE 57 - Reinforcemen Lerning in AI 2