Stochastic Policy Gradient Reinforcement Learning on a Simple 3D Biped

Size: px

Start display at page:

Download "Stochastic Policy Gradient Reinforcement Learning on a Simple 3D Biped"

Jocelin Blanche Greene
5 years ago
Views:

1 Stochastc Polcy Gradent Renforcement Learnng on a Smple 3D Bped Russ Tedrake Computer Scence & Artfcal Intellgence Lab, Center for Bts & Atoms Massachusetts Insttute of Technology Cambrdge, MA 0239 Emal: russt@a.mt.edu Teresa Weru Zhang Department of Mechancal Engneerng Bran & Cogntve Scences Massachusetts Insttute of Technology Cambrdge, MA 0239 Emal: resa@mt.edu H. Sebastan Seung Howard Hughes Medcal Insttute Bran & Cogntve Scences Center for Bts & Atoms Massachusetts Insttute of Technology Cambrdge, MA 0239 Emal: seung@mt.edu Abstract We present a learnng system whch s able to quckly and relably acqure a robust feedback control polcy for 3D dynamc walkng from a blank-slate usng only trals mplemented on our physcal robot. The robot begns walkng wthn a mnute and learnng converges n approxmately 20 mnutes. Ths success can be attrbuted to the mechancs of our robot, whch are modeled after a passve dynamc walker, and to a dramatc reducton n the dmensonalty of the learnng problem. We reduce the dmensonalty by desgnng a robot wth only 6 nternal degrees of freedom and 4 actuators, by decomposng the control system n the frontal and sagttal planes, and by formulatng the learnng problem on the dscrete return map dynamcs. We apply a stochastc polcy gradent algorthm to ths reduced problem and decrease the varance of the update usng a state-based estmate of the expected cost. Ths optmzed learnng system works quckly enough that the robot s able to contnually adapt to the terran as t walks. I. INTRODUCTION Recent advances n bpedal walkng technology have produced robots capable of leavng the laboratory envronment to nteract wth the unknown and uncertan envronments of the real world. Despte our best efforts, t s unlkely that we wll be able to preprogram these robots for every possble stuaton wthout sacrfcng performance. Endowng our robots wth the ablty to learn from experence and adapt to ther envronment seems crtcal for the success of any real world robot. Dynamc bpedal walkng s dffcult to learn for a number of reasons. Frst, walkng robots typcally have many degrees of freedom, whch can cause a combnatoral exploson for learnng systems that attempt to optmze performance n every possble confguraton of the robot. Second, detals of the robot dynamcs such as uncertantes n the ground contact and nonlnear frcton n the jonts are dffcult to model well n smulaton, makng t unlkely that a controller optmzed n a smulaton wll perform optmally on the real robot. Snce t s only practcal to run a small number of learnng trals on the real robot, the learnng algorthms must perform well after obtanng a very lmted amount of data. Fnally, learnng algorthms for dynamc walkng must deal wth dynamc dscontnutes caused by collsons wth the ground and wth the problem of delayed reward - torques appled at one tme may have an effect on the performance many steps nto the future. Although there s a great deal of lterature on learnng control for dynamcally walkng bpedal robots, there are relatvely few examples of learnng algorthms actually mplemented on the robot or whch work quckly enough to allow the robot to adapt onlne to changng terran. Some researchers attempt to learn a controller n smulaton that s robust enough to run on the real robot [], treatng dfferences between the smulaton and the robot as dsturbances. The Unversty of New Hampshre bpeds were two early examples of onlne learnng whch acqured a basc gat by tunng parameters n a hand-desgned controller ([2], [3]). In ths paper we generalze these results to obtanng a controller from scratch nstead of tunng an exstng controller. Learnng control has also been successfully mplemented on Sony s quadrupedal robot AIBO (.e., [4]). The learned controllers for AIBO are open-loop trajectores, but trajectory feedback s essental for robust, dynamc, bpedal walkng. In order to study learnng feedback control for walkng, we performed our ntal experments on a smplfed robot whch captures the essence of dynamc walkng but whch mnmzes many of the complcatons. Our robot has only 6 nternal degrees of freedom and 4 actuators. The mechancal desgn of our robot s based on a passve dynamc walker ([5], [6]). Ths allows us to solve a porton of the control problem n the mechancal desgn, and makes the robot mechancally very stable; most polces n our search space result n ether stable walkng or faled walkng where the robot ends up smply standng stll. The learnng on our robot s performed by a polcy gradent renforcement learnng algorthm ([7], [8], [9]). The goal of ths paper s to descrbe our formulaton of the learnng problem and the algorthm that we use to solve t. We nclude our expermental results on ths smplfed bped, and dscuss the possblty of applyng the same algorthm to a more complcated walkng system. The standard for 3D bpeds s to have at least 2 nternal degrees of freedom and 2 actuators n the legs

2 II. THE ROBOT Fg.. The robot on the left s a smple passve dynamc walker. The robot on the rght s our actuated verson of the same robot. The passve dynamc walker shown on the left n Fgure represents the smplest machne that we could buld whch captures the essence of stable dynamc walkng n three dmensons. It has only a sngle passve pn jont at the hp. When placed at the top of a small ramp and gven a push sdeways, the walker wll begn fallng down the ramp and eventually converge to a stable lmt cycle trajectory that has been compared to the waddlng gat of a pengun [0]. The energetcs of ths passve walker are common to all passve walkers: energy lost due to frcton and collsons when the swng leg returns to the ground s balanced by the gradual converson of potental energy nto knetc energy as the walker moves down the slope. The mechancal desgn of ths robot and some expermental stablty results are presented n []. We desgned our learnng robot by addng a small number of actuators to ths passve desgn. The robot shown on the rght n fgure, whch s also descrbed n [], has passve jonts at the hp and 2 degrees of actuaton (roll and ptch) at each ankle. The ankle actuators are poston controlled servo motors whch, when commanded to hold ther zero poston, allow the actuated robot to walk stably down a small ramp, smulatng the passve walker. The shape of the large, curved feet s desgned to make the robot walk passvely at 0.8Hz, and to take steps of approxmately 6.5 cm when walkng down a ramp of 0.03 radans. The robot stands 44 cm tall and weghs approxmately 2.9 kg, whch ncludes the CPU and batteres that are carred on-board. The most recent addtons to ths robot are the passve arms, whch are mechancally coupled to the opposte leg to provde mechancal yaw compensaton. When placed on flat terran, the passve walker waddles back and forth, slowly losng energy, untl t comes to rest standng stll. In order to acheve stable walkng on flat terran, the actuators on our learnng robot must restore energy nto the system that would have been restored by gravty when walkng down a slope. III. THE LEARNING PROBLEM The goal of learnng s to acqure a feedback control polcy whch makes the robot s gat nvarant to small slopes. In total, the system has 9 degrees of freedom 2, and the equatons of moton can be wrtten n the form where H(q) q + C(q, q) q + G(q) = τ + D(t), () q =[θ yaw, θ lp tch, θ bp tch, θ rp tch, θ roll, θ raroll, θ laroll, θ rap tch, θ lap tch ] T, τ =[0, 0, 0, 0, τ raroll, τ laroll, τ rap tch, τ lap tch ] T. H s the state dependent nertal matrx, C contans nteracton torques between the lnks, G represents the effect of gravty, τ are the motor torques, and D are random dsturbances to the system. Our shorthand lp tch, bp tch, and rp tch refer to left leg ptch, body ptch, and rght leg ptch, respectvely. raroll, laroll, rap tch, and lap tch are short for rght and left ankle roll and ptch. The actual output of the controller s a motor command vector u = [u raroll, u laroll, u rap tch, u lap tch ] T, whch generates torques τ = h(q, q, u). The functon h descrbes the lnear feedback controller mplemented by the servo boards and the nonlnear knematc transformaton nto jont torques. The robot uses a determnstc feedback control polcy whch s represented usng a lnear functon approxmator parameterzed by vector w and usng nonlnear features φ: u = π w (ˆx) = [ q q] w φ (ˆx), wth x =. (2) The notaton ˆx represents a nosy estmate of the state x. Before learnng, w s ntalzed to all zeros, makng the polcy outputs zero everywhere, so that the robot smulates the passve walker. To quantfy the stablty of our nonlnear, stochastc, perodc trajectory, we consder the dynamcs on the return map, taken around the pont where θ roll = 0 and θ roll > 0. The return map dynamcs are a Markov random sequence wth the probablty at the (n + )th crossng of the return map gven by f w (x, x) = P ˆX(n + ) = x ˆX(n) = x, W(n) = w. (3) f w (x, x) represents the probablty densty functon over the state space whch contans the dynamcs n equatons and 2 ntegrated over one cycle. We do not make any assumptons about ts form, except that t s Markov. Note that the element of f w representng θ roll s the delta functon, ndependent of 2 6 nternal DOFs and 3 DOFs for the robot s orentaton. We assume that the robot s always n contact wth the ground at a sngle pont, and nfer the robot s absolute (x, y) poston n space drectly from the remanng varables.

3 x. The return map dynamcs are represented as a Markov chan that depends on the parameter vector w nstead of the equvalent Markov decson process for smplfcaton because the feedback controller s evaluated many tmes durng a sngle step (our controller runs at 00 Hz and our robot steps at around 0.8 Hz). The stochastcty n f w comes from the random dsturbances D(t) and the state estmaton error, ˆx x. The cost functon for learnng uses a constant desred value, x d, on the return map: g(x) = 2 x xd 2. (4) Ths desred value can be consdered a reference trajectory on the return map, and s taken from the gat of the walker down a slope of 0.03 radans; no reference trajectory s requred for the lmt cycle between steps. For a gven trajectory ˆx = [ˆx(0), ˆx(),..., ˆx(N)], we defne the average cost G(ˆx) = N N g(ˆx(n)). (5) Our goal s to fnd the parameter vector w whch mnmzes lm E G(ˆx). (6) By mnmzng ths error, we are effectvely mnmzng the egenvalues of return map, and maxmzng the stablty of the desred lmt cycle. IV. THE LEARNING ALGORITHM The learnng algorthm s a statstcal algorthm whch makes small changes to the control parameters w on each step and uses correlatons between changes n w and changes n the return map error to clmb the performance gradent. Ths can be accomplshed wth a very smple onlne learnng rule whch changes w wth each step that the robot takes. The partcular algorthm that we present here was orgnally proposed by [7]. We present a thorough dervaton of ths algorthm n the next secton. The algorthm makes use of an ntermedate representaton whch we call the value functon, J(x). The value of state x s the expected average cost to be ncurred by followng polcy π w startng from state x: J(x) = lm N N g(x(n)), wth x(0) = x. Ĵ v (x) s an estmate of the value functon parameterzed by vector v. Ths value estmate s represented n another functon approxmator: Ĵ v (ˆx) = v ψ (ˆx). (7) Durng learnng, we add stochastcty to our determnstc control polcy by varyng w. Let Z(n) be a Gaussan random vector wth EZ (n) = 0 and EZ (n)z j (n ) = σ 2 δ j δ nn. Durng the nth step that the robot takes, we evaluate the controller usng the parameter vector w (n) = w(n) + z(n). The algorthm uses a storage varable, e(n), whch we call the elgblty trace. We begn wth w(0) = e(0) = 0. At the end of the nth step, we make the updates: δ(n) =g (ˆx(n)) + γĵv (ˆx(n + )) Ĵv (ˆx(n)) (8) e (n) =γe (n ) + b (n)z (n) (9) w (n) = η w δ(n)e (n) (0) v (n) =η v δ(n)ψ (ˆx(n)). () η w 0 and η v 0 are the learnng rates and γ s the dscount factor of the elgblty trace, whch wll be dscussed n more detal n the algorthm dervaton. b (n) s a boolean one step elgblty, whch s f the parameter w s actvated (φ (ˆx) > 0) at any pont durng step n and 0 otherwse. δ(n) s called the one step temporal dfference error. The algorthm can be understood ntutvely. On each step the robot receves some cost g(ˆx(n)). Ths cost s compared to cost that we expect to receve, as estmated by Ĵv(x). If the cost s lower than expected, then ηδ(n) s postve, so we add a scaled verson of the nose terms, z, nto w. Smlarly, f the cost s hgher than expected, then we move n the opposte drecton. Ths smple onlne algorthm performs approxmate stochastc gradent descent on the expected value of the average nfnte-horzon cost. V. ALGORITHM DERIVATION The expected value of the average cost, G, s gven by: EG(ˆx) = G(ˆx)P w ˆX = ˆxdˆx. The probablty of trajectory ˆx s ˆx P w ˆX = ˆx = P ˆX(0) N = ˆx(0) f w (ˆx(n + ), ˆx(n)). Takng the gradent of EG(ˆx) wth respect to w we fnd EG(ˆx) = G(ˆx) P w ˆX = ˆxdˆx w ˆx w = G(ˆx)P w ˆX = ˆx log P w ˆX = ˆxdˆx ˆx w =E G(ˆx) log P w ˆX = ˆx w ( N ) =E G(ˆx) log f w (ˆx(m + ), ˆx(m)) w Recall that f w (x, x) s a complcated functon whch ncludes the ntegrated dynamcs of the controller and the robot. Nevertheless, w log f w s smply: log f w (x (m + ), x(m)) = w [ log P w ˆX = x ˆX ] = x, W = w P w W = w = log P w W (m) = w = z (m) w σ 2

4 Substtutng, we have EG(ˆx) = w Nσ 2 E = Nσ 2 E ( N ) ( N ) g(ˆx(n)) z (m) n= N g(ˆx(n)) n b (m)z (m). Ths fnal reducton s based on the observaton that Eg(ˆx(n)) N m=n z (m) = 0 (nose added to the controller on or after step n s not correlated to the cost at step n). Smlarly, random changes to a weght that s not used durng the nth step (b (m) = 0) have zero expectaton, and can be excluded from the sum. Observe that the varance of ths gradent estmate grows wthout bound as N [8]. To bound the varance, we use a based estmate of ths gradent whch artfcally dscounts the elgblty trace: EG(ˆx) w N Nσ 2 E n g(ˆx(n)) γ n m b (m)z (m) = N Nσ 2 E g(ˆx(n))e (n), wth 0 γ. The dscount factor γ parameterzes the bas-varance trade-off. Next, observe that we can subtract any mean zero baselne from ths quantty wthout effectng the expected value of our estmate [2]. Includng ths baselne can dramatcally mprove the performance of our algorthm because t can reduce the varance of our gradent estmate. In partcular, we subtract a mean-zero term contanng an estmate of the value functon as recommended by [7]: lm EG(ˆx) w N lm N E g(x(n))e (n) = lm N E N g(x(n))e (n)+ N N ˆV π (x(n))z (n) N ( z (n) γ m n γ ˆV π (x(m + )) ˆV ) π (x(m)) m=n N = lm N E g(x(n))e (n)+ N ( γ ˆV π (x(n + )) ˆV ) π (x(n)) e (n) N = lm N E δ(n)e (n) By ths dervaton, we can see that the average of the weght update gven n equatons 8- s n the drecton of the performance gradent: lm N N E w (n) η lm VI. LEARNING IMPLEMENTATION w EG(ˆx). In our ntal mplementaton of the algorthm, we decded to further smplfy the problem by decomposng the control n the frontal and sagttal planes. In ths decomposton, the ankle roll actuators are responsble for stablzng the oscllatons of the robot n the frontal plane. The ankle ptch actuators cause the robot to lean forward or backward, whch moves the poston of the center of mass relatve to the ground contact pont on the foot. Because the hp jont on our robot s passve, f the center of mass s n front of the ground contact when the swng foot leaves the ground, then the robot wll begn to walk forward. The dstance of between the center of mass and the ground contact s monotoncally related to the step sze and to the walkng speed. Due to the smplcty of the sagttal plane control, we only need to learn a control polcy for the two ankle roll actuators whch stablze the roll oscllaton n the frontal plane. Ths strategy wll change as the robot walks at dfferent speeds, but we hope the learnng algorthm wll adapt quckly enough to compensate for those dfferences. Wth these smplfcatons n mnd, we constran the feedback polcy to be a functon of only two varables: θ roll and θ roll. The choce of these two varables s not arbtrary; they are the only varables that we use when wrtng a nonlearnng feedback controller that stablzes the oscllaton. We also constran the polcy to be symmetrc - the controller for the left ankle s smply a mrror mage of the controller for the rght ankle. Therefore, the learned control polcy only has a sngle output. The value functon s approxmated as a functon of only a sngle varable: θroll. Ths very low dmensonalty allows the algorthm to tran very quckly. The control polcy and value functons are both represented usng lnear functon approxmators of the form descrbed n Equatons 2 and 7, whch are fast and very convenent to ntalze. We use a non-overlappng tle-codng for our approxmator bass functons: 35 tles for the polcy (5 n θ roll 7 n θ roll ) and tles for the value functon. In order to make the robot explore the state space durng learnng, we hand-desgned a smple controller to place the robot n random ntal condtons on the return map. The random dstrbuton s based accordng to the dstrbuton of ponts that the robot has already experenced on the return map - the most lkely ntal condton s the state that the robot experenced least often. We use ths controller to randomly rentalze the robot every tme that t comes to a halt standng stll, or every 0 seconds, whchever comes frst. Ths heurstc makes the dstrbuton on the return map more unform, and ncreases the lkelhood of the algorthm convergng on the same polcy each tme that t learns from a blank slate.

5 VII. EXPERIMENTAL RESULTS When the learnng begns, the polcy parameters, w, are set to 0 and the baselne parameters, v, are ntalzed so that Ĵ v (x) g(x) γ. We typcally tran the robot on flat terran usng short trals wth random ntal condtons. Durng the frst few trals, the polcy does not restore suffcent energy nto the system, and the robot comes to a halt standng stll. Wthn a mnute of tranng, the robot acheves foot clearance on nearly every step; ths s the mnmal defnton of walkng on ths system. The learnng easly converges to a robust gat wth the desred fxed pont on the return map wthn 20 mnutes (approxmately 960 steps at 0.8 Hz). Error obtaned durng learnng depends on the random ntal condtons of each tral, and s therefore a very nosy stochastc varable. For ths reason, n Fgure 2 we plot a typcal learnng curve n terms of the average error per step. Fgure 3 plots a typcal trajectory of the learned controller walkng on flat terran. Fgure 4 dsplays the fnal polcy. learnng (w = 0) and after 000 steps. In general, the return map for our 9 DOF robot s 7 dmensonal (9 states + 9 dervatves - ), and the projecton of these dynamcs onto a sngle dmenson s dffcult to nterpret. The plots n Fgure 5 where made wth the robot walkng n place on flat terran. In ths partcular stuaton, most of the return map varables are close to zero throughout the dynamcs, and a two dmensonal return map captures the desred dynamcs. As expected, before learnng the return map llustrates a sngle fxed pont at θ roll = 0, whch means the robot s standng stll. After learnng, we obtan a sngle fxed pont at the desred value ( θ roll =.0 radans / second), and the basn of attracton of ths fxed pont extends over the entre doman that we tested. On the rare occason that the robot falls over, the system does not return to the map and stops producng ponts on ths graph. Fg. 5. Expermental return maps, before (left) and after (rght) learnng. Fxed ponts exst at the ntersectons of the return map (blue) and the lne of slope one (red). Fg. 2. A typcal learnng curve, plotted as the average error on each step. seconds Fg. 3. θ roll trajectory of the robot startng from standng stll. Fg. 4. Learned feedback control polcy u raroll = π w(ˆx). In Fgure 5 we plot the return maps of the system before To quantfy the stablty of the learned controller, we measure the egenvalues of the return map. Lnearzng around the fxed pont n Fgure 5 suggests that the system has a sngle egenvalue of 0.5. To obtan the egenvalues of the return map when the robot s walkng, we run the robot from a large number of ntal condtons and record the return map trajectores ˆx (n), 9 vectors whch represent the state of the system (wth fxed ankles) on the nth crossng of the th tral. For each tral we estmate ˆx ( ), the equlbrum of the return map. Fnally, we perform a least squares ft of the matrx A to satsfy the relaton [ˆx (n + ) ˆx ( )] = A[ˆx (n) ˆx ( )]. The egenvalues of A for the learned controller and for our hand-desgned controllers (descrbed n []) are: Controller Egenvalues Passve walkng 0.88 ± 0.0, 0.75, 0.66 ± 0.03, (63 trals) 0.54, 0.36, 0.32 ± 0.3 Hand-desgned 0.80, 0.60, 0.49 ± 0.04, 0.36, feed-forward (89 trals) 0.25, 0.20 ± 0.0, 0.0 Hand-desgned 0.78, 0.69 ± 0.03, 0.36, 0.25, feedback (58 trals) 0.20 ± 0.0, 0.0 Learned feedback 0.74 ± 0.05, 0.53 ± 0.09, 0.43, (42 trals) 0.30 ± 0.02, 0.5, 0.07 All of these experments were on flat terran except the passve walkng, whch was on a slope of radans. The

6 convergence of the system to the nomnal trajectory s largely governed by the largest egenvalues. Ths analyss suggests that our learned controller converges to the steady state trajectory more quckly that the passve walker on a ramp and more quckly than any of our hand-desgned controllers. Our stochastc polcy gradent algorthm solves the temporal credt assgnment problem n by accumulatng the elgblty wthn a step and dscountng elgblty between steps. Interestly, our algorthm performs best wth heavy dscountng between steps (0 γ 0.2). Ths suggests that our one dmensonal value estmate does a good job of solatng the credt assgnment to a sngle step. Whle t took a few mnutes to learn a controller from a blank slate, adjustng the learned controller to adapt to small changes n the terran appears to happen very quckly. The non-learnng controllers requre constant attenton and small manual changes to the parameters as the robot walks down the hall, on tles, and on carpet. The learnng controller easly adapts to these stuatons. VIII. DISCUSSION Desgnng our robot lke a passve dynamc walker changes the learnng problem n a number of ways. It allows us to learn a polcy wth only a sngle output whch controlled a 9 DOF system, and allows us to formulate the problem on the return map dynamcs. It also dramatcally ncreases the number of polces n the search space whch could generate stable walkng. The learnng algorthm works extremely well on ths smple robot, but wll the technque scale to more complcated robots? One factor n our success was the formulaton of the learnng problem on the dscrete dynamcs of the return map nstead of the contnuous dynamcs along the entre trajectory. Ths formulaton reles on the fact that our passve walker produces perodc trajectores even before the learnng begns. It s possble for passve walkers to have knees and arms [3], or on a more tradtonal humanod robot ths algorthm could be used to augment and mprove and exstng walkng controller whch produces nomnal walkng trajectores. As the number of degrees of freedom ncreases, the stochastc polcy gradent algorthm may have problems wth scalng. The algorthm correlates changes n the polcy parameters wth changes n the performance on the return map. As we add degrees of freedom, the assgnment of credt to a partcular actuator wll become more dffcult, requrng more learnng trals to obtan a good estmate of the correlaton. Ths scalng problem s an open and nterestng research queston and a prmary focus of our current research. IX. CONCLUSIONS We have presented a learnng formulaton and learnng algorthm whch works very well on our smplfed 3D dynamc bped. The robot begns to walk after only one mnute of learnng from a blank slate, and the learnng converges to the desred trajectory n less than 20 mnutes. Ths learned controller s quantfably more stable, usng the egenvalues of the return map, than any controller we were able to derve for the same robot by hand. Once the controller s learned, to robot s able to quckly adapt to small changes n the terran. Buldng a robot to smplfy the learnng allowed us to gan some practcal nsghts nto the learnng problem for dynamc bpedal locomoton. Implementng these algorthms on the real robot proved to be a very dfferent problem than workng n smulaton. We would lke to take two basc drectons to contnue ths research. Frst, we are removng many of the smplfyng assumptons used n ths paper (such as the decomposed control polcy) to better approxmate optmal walkng on ths smple platform and to test our learnng controller s ablty to compensate for rough terran. Second, we are scalng these results up to more sophstcated bpeds, ncludng a passve dynamc walker wth knees and humanods that already have a basc control system n place. ACKNOWLEDGMENTS Ths work was supported by the Davd and Luclle Packard Foundaton (contract 99-47), the Natonal Scence Foundaton (grant CCR-02249). Specal thanks to Mng-fa Fong and Derrck Tan for ther help wth desgnng and buldng the expermental platform. REFERENCES [] J. Mormoto and C. Atkeson, Mnmax dfferental dynamc programmng: An applcaton to robust bped walkng. Neural Informaton Processng Systems, [2] W. T. Mller, III, Real-tme neural network control of a bped walkng robot, IEEE Control Systems Magazne, vol. 4, no., pp. 4 48, Feb 994. [3] H. Benbrahm and J. A. Frankln, Bped dynamc walkng usng renforcement learnng, Robotcs and Autonomous Systems, vol. 22, pp , 997. [4] N. Kohl and P. Stone, Polcy gradent renforcement learnng for fast quadrupedal locomoton. IEEE Internatonal Conference on Robotcs and Automaton, [5] T. McGeer, Passve dynamc walkng, Internatonal Journal of Robotcs Research, vol. 9, no. 2, pp , Aprl 990. [6] M. J. Coleman and A. Runa, An uncontrolled toy that can walk but cannot stand stll, Physcal Revew Letters, vol. 80, no. 6, pp , Aprl 998. [7] H. Kmura and S. Kobayash, An analyss of actor/crtc algorthms usng elgblty traces: Renforcement learnng wth mperfect value functons. Internatonal Conference on Machne Learnng (ICML 98), 998, pp [8] J. Baxter and P. Bartlett, Infnte-horzon polcy-gradent estmaton, Journal of Artfcal Intellgence Research, vol. 5, pp , 200. [9] R. S. Sutton, D. McAllester, S. Sngh, and Y. Mansour, Polcy gradent methods for renforcement learnng wth functon approxmaton. Advances n Neural Informaton Processng Systems, 999. [0] J. E. Wlson, Walkng toy, Unted States Patent Offce, Tech. Rep., October [] R. Tedrake, T. W. Zhang, M. Fong, and H. S. Seung, Actuatng a smple 3d passve dynamc walker. IEEE Internatonal Conference on Robotcs and Automaton, [2] R. Wllams, Smple statstcal gradent-followng algorthms for connectonst renforcement learnng, Machne Learnng, vol. 8, pp , 992. [3] S. H. Collns, M. Wsse, and A. Runa, A three-dmensonal passve-dynamc walkng robot wth two legs and knees, Internatonal Journal of Robotcs Research, vol. 20, no. 7, pp , July 200.

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University Dynamc Optmzaton Assgnment 1 Sasanka Nagavall snagaval@andrew.cmu.edu 16-745 January 29, 213 Robotcs Insttute Carnege Mellon Unversty Table of Contents 1. Problem and Approach... 1 2. Optmzaton wthout