Computing Optimal Strategies to Commit to in Stochastic Games

Size: px

Start display at page:

Download "Computing Optimal Strategies to Commit to in Stochastic Games"

Melissa Thompson
6 years ago
Views:

1 Computng Optmal trateges to Commt to n tochastc Games Joshua Letchford 1 and Lam MacDermed and Vncent Contzer 1 and Ronald Parr 1 and Charles L. Isbell 1 Duke Unversty, Department of Computer cence, Durham, C 7708, UA {jcl,contzer,parr}@cs.duke.edu Georga Insttute of Technology, Robotcs and Intellgent Machnes Laboratory, Atlanta, Georga 3033 {lam,sbell}@cc.gatech.edu Abstract gnfcant progress has been made recently n the followng two lnes of research n the ntersecton of AI and game theory: (1) the computaton of optmal strateges to commt to (tackelberg strateges), and () the computaton of correlated equlbra of stochastc games. In ths paper, we unte these two lnes of research by studyng the computaton of tackelberg strateges n stochastc games. We provde theoretcal results on the value of beng able to commt and the value of beng able to correlate, as well as complexty results about computng tackelberg strateges n stochastc games. We then modfy the QPAC algorthm (MacDermed et al. 011) to compute tackelberg strateges, and provde expermental results. 1 Introducton Computng game-theoretc solutons s a topc that has long been of nterest to AI researchers. A recent lne of research focuses on two-player games n whch player 1 s able to commt to a strategy before the other player moves. The followng standard example llustrates the potental beneft of such commtment. U (1,1) (3,0) D (0,0) (,1) Fgure 1: ormal-form game used n xample 1. xample 1 (known). Consder the normal-form game n Fgure 1. For the case where the players move smultaneously (no ablty to commt), the unque ash equlbrum s (U, L): U strctly domnates D, so that the game s solvable by terated strct domnance. Player 1 (the row player) receves utlty 1. However, now suppose that player 1 has the ablty to commt. Then, she s better off commttng to play D, whch wll ncentvze player to play R, resultng n a utlty of for player 1. The stuaton gets even better for player 1 f she can commt to a mxed Copyrght c 01, Assocaton for the Advancement of Artfcal Intellgence ( All rghts reserved. strategy: n ths case, she can commt to the mxed strategy (.5 ɛ,.5 + ɛ), whch stll ncentvzes player to play R, but now player 1 receves an expected utlty of.5 ɛ. To ensure the exstence of optmal strateges, we assume (as s commonly done (Contzer and andholm 006; Paruchur et al. 008)) that player breaks tes n player 1 s favor, so that the optmal strategy for player 1 to commt to s (.5,.5), resultng n a utlty of.5. (ote that there s never a reason for player to randomze, snce he effectvely faces a sngle-agent decson problem.) Besdes potentally ncreasng the utlty of player 1 (and never decreasng t (von tengel and Zamr 010)), the use of mxed tackelberg strateges has several other techncal advantages. Frst, t avods the dreaded equlbrum selecton problem: n smultaneous-move games, f the players choose ther strateges from dfferent equlbra, the result s not necessarly an equlbrum. econd, n two-player normal-form games, an optmal mxed tackelberg strategy can be computed n polynomal tme usng lnear programmng (Contzer and andholm 006; von tengel and Zamr 010), whereas computng a ash equlbrum s PPADcomplete (Daskalaks, Goldberg, and Papadmtrou 006; Chen and Deng 006), and f the goal s to fnd an optmal ash equlbrum, typcal objectve functons (such as the sum of the players utltes or even just player 1 s utlty) are P-hard even to approxmate (Glboa and Zemel 1989; Contzer and andholm 008). (However, an optmal correlated equlbrum can be computed n polynomal tme usng lnear programmng.) Perhaps n part due to some of these advantages, the computaton of mxed tackelberg strateges has recently found applcaton n varous real securty problems, ncludng arport securty (Jan et al. 008; Pta et al. 009), assgnng Federal Ar Marshals to flghts (Tsa et al. 009), and Coast Guard patrols (heh et al. 01). Most of the work on computng mxed tackelberg strateges has focused on games where nether player learns anythng about the other s actons untl the end of the game, wth the excepton of player learnng player 1 s mxed strategy before actng. (An excepton s work on computng tackelberg strateges n extensve-form games (Letchford and Contzer 010), on whch we wll draw later n the paper.) A useful language for descrbng games that play out over tme s that of stochastc games, a generalzaton of MDPs to multple players. Computng equlbra of stochastc games

2 (0,1); (0,0);C C U (,ɛ); D (0,); (0,); (0,0);F F (,1); (0,0);C U D C (0,3); Fgure : xample stochastc game. presents a number of challenges, but a recent sequence of papers makes sgnfcant progress on the problem of computng correlated equlbra that are not necessarly statonary (.e., the players actons may depend on the hstory of play, not just the current state). Ths lne of research replaces the noton of value (the maxmum utlty) n tradtonal value teraton wth an achevable set (the set of Pareto effcent maxmal utltes for each player). Achevable sets represent all possble utlty vectors yelded by correlated polces n equlbrum. Murray and Gordon (007) presented the frst exact algorthm for computng these achevable sets. However, the complexty of mantanng these achevable sets ncreased exponentally wth each teraton, leadng to an algorthm that where both tme and space requrements scaled exponentally. MacDermed et al. (011) showed how to epslon-approxmate achevable sets effcently whle smultaneously lowerng computatonal complexty, leadng to the Quck Polytope Approxmaton of all Correlated qulbra algorthm (QPAC). In ths paper, we unte these two lnes of work and focus on the problem of computng optmal strateges to commt to n stochastc games. The recent methods for computng correlated equlbra of stochastc games turn out to combne well wth recent observatons about the relatonshp between tackelberg strateges and correlated equlbrum n normal-form games (Contzer and Korzhyk 011), although there are some addtonal subtletes n stochastc games, as we wll see. The dea of commtment also combnes well wth the noton of grm-trgger strateges (strateges that am to forever mnmze another player s utlty) n stochastc games, because player 1 s commtment power wll make t credble for her to play such a strategy. tochastc games A two-player stochastc game s defned as follows: We have a two players, 1 and, a set of states T and n each state t, we have a set of actons for each player A t. For each state t and each acton par n A 1 t A t, we have an outcome that conssts of two elements, the utltes that each of the players wll acheve n that round and nformaton on what state the game wll transton to next (possbly stochastcally). Fnally, we have a dscount factor whch s used to dscount the value of future payoffs. Consder the game n Fgure. Ths game has three states:, C and. We assume state s our ntal state, meanng that play wll begn wth the possble actons A 1 for player 1 Fgure 3: xample game where sgnalng must occur early, but not too early. and A for player (throughout ths paper play wll always begn n the state labeled ). In ths state player 1 has only 1 possble acton; thus the outcome depends entrely on player. If player two chooses to play acton L here (whch we wll denote L ), then the outcome s (0,1);. Ths means that player 1 receves a utlty of 0, player receves a utlty of 1, and the game transtons to state, meanng n the next round the two players wll be playng the normal-form game depcted as state. Alternatvely, f player chooses R, then both players receve a utlty of 0 for ths round, and play transtons to state C. 3 Commtment and sgnals In a two-player tackelberg game, player 1 may be able to do more than just commttng to play a specfc mxed strategy; she may also be able to send sgnals to the other player. Ths dea has prevously been explored for normalform games (Contzer and Korzhyk 011). pecfcally, they consder the case where she can commt to drawng from a jont dstrbuton over sgnals and her own actons, and then send the sgnal to player before he moves, whle playng her drawn acton. (Wthout loss of generalty, we can assume that the sgnal player 1 sends to player s smply the acton that he should take.) They show that n a two-player normal-form game, player 1 gans nothng from the ablty to commt to such a correlated strategy rather than just a mxed strategy (that does not sgnal anythng to the other player). The reason s that for each sgnal, there wll be a dstrbuton over player 1 s actons condtonal on that sgnal and player 1 may as well just commt to playng the best of those dstrbutons from her perspectve. It turns out that n two-player stochastc games, sgnalng becomes more meanngful. Frst, let us return to the game n Fgure. If we assume = 1, f player 1 commts to wth probablty (.5 ɛ) send sgnal L C and play U C and commts to wth probablty (.5 + ɛ) send sgnal R C and play D C, then player can expect a utlty of 1+.5ɛ - ɛ for takng the acton R. Wthout sgnalng, f player 1 commts to U C less than +ɛ of the tme, player wll respond wth R C whch leads to a utlty of 0 for player 1. Furthermore, f player 1 commts to U C at at least +ɛ of the tme, player wll always prefer L whch agan gves a utlty of 0 to player 1. However, beng able to sgnal about one s acton at a state

3 only when reachng that state may not be enough n every game. Consder the game pctured n Fgure 3. To acheve a postve utlty n ths game, player 1 needs to sgnal what she wll be playng n state C before player chooses hs acton n state F but after he acts n state. Consder what happens f player 1 commts to (.5 + ɛ)u C + (.5 ɛ)d C. If she commts to sendng the followng sgnal to player n state F : R F when she wll be playng U C and L F when she wll be playng D C, t s possble for the two players to acheve a correlated equlbrum of (.5 + ɛ)(r F, U C ) + (.5 ɛ)l F. Gven ths, player would prefer R to L as he would acheve an expected utlty of + ɛ for R. In contrast, f player 1 only sends the sgnal after the transton to state C, then player wll prefer to play R F (and L ). On the other hand, the nformaton sgnaled nforms player of what acton player 1 wll play n state C and f ths nformaton s sgnaled too early, namely before player makes a choce n, then he would prefer to choose L when the sgnal s to play L F. Thus, wthout the ablty to sgnal ths nformaton at ths specfc tme, ths correlaton would not be possble and player 1 would acheve a utlty of 0. 4 Value of correlaton and commtment The game n Fgure 3 llustrates a stuaton where player 1 would be unable to acheve a postve utlty wthout both the ablty to commt to a mxed strategy and the ablty to correlate her actons wth those of player. In ths game commtment and correlaton work synergstcally. In ths secton we show t s not always so: n some games commtment by tself obtans all the value whereas correlaton by tself obtans none, and n other games the stuaton s reversed. Towards ths end, we defne the followng three values. Frst, we defne OptCom as player 1 s utlty n the optmal commtment strategy 1 that does not use correlaton. econd, we defne OptCor as player 1 s utlty n the correlated equlbrum that maxmzes payoff for player 1. Fnally, we defne Opt as player 1 s utlty n the optmal commtment strategy that uses correlaton. U (ɛ,1); (1,0); D (1-ɛ,1); Fgure 4: Game where commtment offers an unbounded advantage over correlaton. Theorem 1. There exsts a stochastc game where = and OptCom = Opt for any dscount factor. OptCom OptCor Proof. Consder the game n Fgure 4. Ths s effectvely a one-shot game because of the mmedate transton to an 1 We use the standard defnton of optmal here, where a commtment strategy s consdered to be optmal f t maxmzes the utlty for player 1. We have used the stage game for state elsewhere to pont out that the value of commtment n normal-form games s (Letchford, Korzhyk, and Contzer 01). absorbng state. To calculate OptCom and OptCor for ths game, we can reason as follows. In the normal-form game assocated wth, U domnates D, whch allows us to solve ths game by terated strct domnance. Thus the only correlated equlbrum s (U,L ) whch gves a utlty of ɛ to player 1. However, f player 1 s able to commt to playng D wth probablty 1, then player s best response to ths s to play R (both wth and wthout correlaton). Ths causes player 1 to receve a utlty of 1 ɛ. Thus, OptCom OptCor = 1 ɛ ɛ whch tends to as epslon approaches 0 and OptCom = Opt. U (0,0);C (0,.5); D (ɛ,0); (ɛ,ɛ); C U ( 1, ɛ ); D ( ɛ, 1 ); Fgure 5: Game where correlaton offers an unbounded advantage over commtment. Theorem. For any > 0 there exsts a stochastc game = and OptCor = Opt. where OptCor OptCom Proof. Consder the game n Fgure 5. To calculate OptCor and OptCom for ths game, let us start by fndng the optmal tackelberg strategy wthout correlaton. Let us frst consder what s necessary to convnce player to play L. ven f player 1 commts to U, player wll best respond by playng R (whch gves player 1 a utlty of 0) unless player 1 commts to D C wth probablty at least.5. If player 1 does commt to D C wth probablty at least.5, then player s best response for C wll be L C and at best player 1 can expect at ɛ utlty from C. Thus, there s no way for player 1 to acheve more than ɛ by convncng player to best respond wth L. As both outcomes where player plays R transton to the absorbng state and gve player 1 at most ɛ utlty, we can conclude that player 1 wll acheve at most ɛ va commtment wthout correlaton. ext, consder the followng correlated equlbrum, whch nvolves: (U,L ),.5(U C,R C ) +.5(D C,L C ). Ths gves a dscounted expected value of.5 +.5ɛ for both players for state C, causng player 1 to prefer U to D and player to prefer L to R. Player 1 s utlty under ths equlbrum s.5 +.5ɛ. Thus, to as ɛ goes to 0. OptCor OptCom.5+.5ɛ ɛ whch tends 5 Hardness results In ths secton we consder the dffculty of solvng for an optmal tackelberg strategy. We consder two man dmensons of the problem, the amount of memory about past

4 h = 0 0 < h < h = Corr. P-hard (Th 3) P-hard (Th 4)? 3 o Corr. P-hard (Th 3) P-hard (Th 4) P-hard (Th 5) Fgure 6: Overvew of hardness results. h represents the amount of hstory that player 1 can remember. states and actons player 1 can base her actons upon and the ablty of player 1 to sgnal to the follower to enable correlaton. For the memory dmenson we consder three man cases. In the frst case player 1 s constraned to commt to a statonary strategy. In the second case player 1 has some fnte memory and can commt to act based upon the states and actons taken n these past tmesteps. Fnally, n the thrd case player 1 has an nfnte memory, and can commt based on all actons and states that have occurred snce the start of the game. For the sgnalng dmenson we consder two cases, both wth and wthout the ablty to sgnal. An overvew of our results appears n Fgure 6. Theorem 3. It s P-hard to solve for the optmal commtment to a statonary strategy n a stochastc game wth or wthout correlaton, for any dscount factor > 0. Proof. We reduce an arbtrary nstance of 3AT to a stochastc game such that player 1 can obtan utlty 1 f and only f the 3AT nstance s satsfable. The 3AT nstance conssts of varables x... x n and M clauses C 1... C m. The constructon s pctured n Fgure 7 and the detals are as follows. We start wth an ntal state, one state C for each clause, one state x for each varable and one fnal absorbng state. Our ntal state has no payoff for ether player, but transtons unformly at random to a clause state C. Addtonally,, the absorbng state has a sngle possble outcome, namely. Clause states: ach clause state s constructed as follows. For each lteral we have one row (C x ) and two columns (C +x and C x ). If the lteral s postve we have two entres n the game, (C x,c+x ) s assgned an outcome of ( 1,0);x and (C x,c x ) s assgned an outcome of (0,1);. If the lteral s nstead negatve, we nclude the followng two entres, (C x,c+x ) s assgned an outcome of ( 1,0); and (Cx,C x ) s assgned an outcome of (0,0);x. The other 1 outcomes of the game are. In Fgure 7 we show an example for a clause wth the lterals (x 1 x x 3 ). Varable states: ach varable state has 1 column and two rows +x and x. The outcomes are as follows, for row +x t s (0, 1 ); and for row x t s. Proof of equvalence to 3AT nstance: We now show that player 1 can obtan a utlty of 1 from ths game f and only f there exsts a satsfyng assgnment to the underlyng 3AT nstance. Let us start by consderng when player 1 can obtan a utlty of 1 from a gven clause state. We frst consder a row that corresponds to a postve lteral. In ths case, f player 1 has also commtted to +x n the correspondng varable state, then the two non-zero outcomes n ths row 3 We show how to solve ths approxmately by a modfcaton of the QPAC algorthm n ecton 6. gve payoffs of ( 1,1) and (0,1). nce, by assumpton, the follower breaks tes n player 1 s favor, a sgnal to play C +x (ths could be ether commttng to play C x or explctly sgnalng ths to player ) wll lead to a utlty of 1 for player 1. If player 1 nstead commts to any other strategy n x then a sgnal to play C +x wll nstead lead to player devatng to C x causng player 1 to receve 0 utlty. ext, consder a row correspondng to a negatve lteral. In ths case, f player 1 has commtted to x n the correspondng varable state, then the two potentally non-zero outcomes n ths row gves payoffs of (0,0) and ( 1,0). Wth smlar logc as before, a sgnal to play C x wll lead to a utlty of 0 for player 1 unless she has commtted to x. As there are three lterals n each clause, ths gves player 1 three potental ways to ncentvze the follower to play n a way that s benefcal to player 1. However, later commtment (n the varable states) can remove ths potental (namely commtment n such a way to preserve the potental for the opposng lteral). If all three of these sgnals lose ther potental, player 1 s left wth no way to ncentvze the follower to play n a way that gves her utlty f play reaches ths clause. nce the ntal state forces a unform random choce between these clauses subgames, ths game wll have expected value for player 1 of 1 f and only f all of the clause states have a utlty of 1. Theorem 4. It s P-hard to solve for the optmal commtment to a strategy that uses a constant h steps of hstory wth or wthout correlaton and any dscount factor > 0. Proof. Consder the followng modfcaton to the reducton used n the proof of Theorem 3. For each varable state, we snert h buffer states that gve no payoffs before that state. Thus, by the tme player 1 reaches the varable state, they wll have forgotten whch clause they orgnated from and the above reducton wll agan hold. For the case of nfnte hstory, t s mpossble to extend the above 3AT reducton. Consder the constructon from Theorem 3 wth h buffer states nserted for each varable. If player 1 has a memory of h + 1, when choosng a lteral to commt to, player 1 can condton ths upon the clause the players transtoned from. In ths way, player 1 wll be able to satsfy both the postve and negatve values of each lteral. However, we note that stochastc games can model extensve-form games wth chance nodes, whch allows us to adapt the KAPACK reducton from Theorem 5 n (Letchford and Contzer 010) wth mnor changes. Theorem 5. It s P-hard to solve for the optmal commtment n a stochastc game even when the strategy s allowed to use nfnte hstory wthout correlaton and any dscount factor > 0. Proof. In the KAPACK problem, we are gven a set of tems, and for each of them, a value p and a weght w ; addtonally, we are gven a weght lmt W. We are asked to fnd a subset of the tems wth total weght at most W that maxmzes the sum of the p n the subset. We reduce an arbtrary KAPACK nstance to a stochastc game, n

5 (0,0); 1 M C M C m x +x (0, 1 ); x C (x 1 x x 3 ) C +x1 C x1 C +x C x C +x3 C x3 ( 1,0);x 1 (0,1); ( 1,0); (0,0);x ( 1,0);x 3 (0,1); C x1 C x C x3 Fgure 7: tochastc game used n the hardness reducton of Theorem 3. such a way that the maxmal utlty obtanable by player 1 wth commtment (whether pure or mxed) s equal to the optmal soluton value n the KAPACK nstance. Ths game s llustrated n Fgure 8, and defned formally below. Intal state : The frst state contans two possble choces by player, who chooses between an outcome of (0, W ); and a outcome whch randomzes unformly over the tem states, defned next. Item states: ach tem I has two states. At the top level (whch can be reached drectly from ), there s a state I 1 where player acts. It has two potental outcomes: one s an outcome of ( p, w );, the other s a transton to a state I where only player 1 has a choce. The latter node also has two outcomes: (0, w ); and. Proof of equvalence to KAPACK nstance: If for an tem, player 1 commts to playng 100% In I, then player, breakng tes n 1 s favor, wll move In I 1, resultng n dscounted payoffs of (p, w ) f I 1 s reached. Otherwse, player wll move Out I 1, and player 1 wll get 0 (and player at most 0). Because player 1 wants player to choose K, there s no beneft to player 1 n movng In I wth probablty strctly between 0 and 100%, snce ths wll only make K less desrable to player wthout beneftng player 1. Thus, we can assume wthout loss of optmalty that player 1 commts to a pure strategy. Let X be the set of ndces of states where player 1 commts to playng In. Then, player s expected utlty for choosng K s (1/) X w = X w. Player wll choose K f and only f X w W. Gven ths, player 1 s expected utlty s (1/) X p = X p. Hence, fndng player 1 s optmal strategy to commt to s equvalent to solvng the KAPACK nstance. 6 mprcal Results Whle we have shown that the ablty to commt can, n the extreme, provde an unbounded ncrease n utlty, these results say lttle about the value of commttng n general. We present the frst algorthm whch can compute all correlated commtment equlbra of a stochastc game. We use ths K 1 I I1 n A (0, W ); I 1 In Out ( p ); (0,0);I In Out I (0, w ); Fgure 8: tochastc game used n the hardness reducton of Theorem 5. algorthm to compare a leader s value usng correlated commtment to her value usng only correlaton, and make conclusons about the condtons under whch commtment s most mportant. 6.1 Computng Commtment qulbra The QPAC algorthm (MacDermed et al. 011) effcently approxmates the set of correlated equlbra n stochastc games by teratvely contractng a state s achevable set, by removng polces that volate a player s ratonalty constrants. Contzer and Korzhyk (011) showed that the set of commtment equlbra s equvalent to the set of correlated equlbra wthout the leader s ratonalty constrants. Therefore QPAC can be easly modfed to approxmately compute commtment equlbra by removng the leader ratonalty constrants n the achevable set contracton step. very teraton of QPAC performs a backup smlar to sngle agent value teraton whch mproves the current estmaton of each state s achevable set V (s). Achevable sets are represented as polytopes wth halfspace normals H j and offsets V (s) j. The normals H are fxed at ntalzaton and are the same for all achevable sets. A state s backup s broken down nto three steps: (1) Calculate the acton achevable sets Q(s, a), gvng us the set of possble contnuaton utltes. () Construct a set of nequaltes that defnes the set of equlbra. (3) Approxmately project ths feasble set nto value-vector space by solvng a lnear program for each hyperplane V (s) j. tep two s the only step that needs to be changed to compute feasble commtment polces nstead of correlated equlbra. We modfy equaton 6 n MacDermed

6 et al. (011) to not nclude ratonalty constrants for the leader. The resultng set of nequaltes defnng the set of commtment equlbra s shown n equaton 1. Ths gves us our polytope n R (n+1) A n over varables cu a and x a of feasble correlated equlbra: For each player who cannot commt, for each par of dstnct actons α, β A cu a (α) x a (α)[ gt a (β) + R(s, a (β) ) ] a A n a A n a A n x a = 1 and a A n, x a 0 For each jont-acton a A n and halfspace j H j cu a x a Q(s, a) j The rest of QPAC remans unchanged. Our modfcaton to QPAC s mnor and leaves the strong theoretcal propertes of the orgnal algorthm ntact. Most mportantly, the algorthm converges to wthn ɛ n polynomal tme and returns a set of commtment equlbra whch s guaranteed to nclude all exact equlbra wth addtonal solutons beng no worse than ɛ-equlbra, where ɛ s the approxmaton parameter. 6. xperments on random games We ran sutes of experments over sets of random games. These random games vared over the number of states, actons, stochastcty (the number of successor states wth nonzero transton probablty), and dscount (). Unless otherwse noted, games were run wth four jont-actons, fve states, two successor states, a of 0.9 and an ɛ approxmaton error of Results are averaged over 1000 games, whch allows our utlty results to be accurate wthn 0.0 wth 99% confdence. A random game s generated usng the followng procedure: for each state jont-acton par k successor states are chosen at random. The smplex over these k states represents all possble probablty dstrbutons over these states. A transton probablty dstrbuton s chosen unformly at random from ths smplex. ach state jontacton par s also assgned a reward for each player drawn unformally at random. Fnally, these rewards are normalzed such that each player s rewards range between 0 and 1. Our frst set of experments examnes the scalablty of the algorthm. Despte our algorthm havng fewer constrants for each lnear program than the orgnal QPAC algorthm, we found our algorthm to have runnng tme nearly dentcal to the orgnal. Because QPAC starts each lnear program at the soluton of the prevous teraton s lnear program, the total number of bass changes over the course of the entre algorthm s relatvely small. Thus, fewer constrants reduces the overall runnng tme by an nsgnfcant amount. Our algorthm appears to scale lnearly n the number of states, jont-actons, and 1/ɛ (the frst two of these are shown n Fgure 9). Our second set of experments focuses on determnng the mportance of commtment vs. equlbrum selecton. One (1) Wall tme (mnutes) tates (thousands) umber of Jont-actons Fgure 9: The runnng tme of our algorthm s lnear n the number of states and jont-actons. of the more powerful aspects of commttng s beng able to dctate whch partcular equlbrum of the many possble wll be chosen. Wthout commtment, players are faced wth a barganng problem to determne the whch equlbrum wll be chosen. Ths may result n sgnfcantly less utlty for a potental leader. It s mportant to dfferentate between the beneft of beng able to commt to sub-ratonal polces and the beneft of equlbrum selecton. Towards ths end we compute both the Kala-morodnsky barganng soluton (Kala and morodnsky 1975), whch favors equal gans to all partes, and the optmal selfsh equlbrum for the potental leader wth and wthout commtment (Fgure 10). Leader utlty scaled by Leader s best commtment equlbrum Leader s best correlated equlbrum Barganng correlated equlbrum Follower s commtment equlbrum Fgure 10: The average utlty of beng able to select a correlated or commtment equlbrum selfshly as opposed to the leader s Kala-morodnsky barganng soluton or the follower s utlty when the leader selected a commtment equlbrum. The results show that as ncreases, the mportance of commttng, over and above just beng able to select the equlbrum, decreases. Ths relatonshp s caused by threats becomng more powerful as the horzon ncreases. trong threats act as a bndng mechansm, permttng a wder array of possble equlbra by allowng players to punsh each other for devatons wthout the need of someone volatng her ratonalty constrant (as per folktheorems). On the other hand, the beneft of equlbrum selecton remans mportant regardless of the dscount factor.

7 The effect of equlbrum selecton on the follower s even more dramatc than t s for the leader (Fgure 10). Wth a small dscount factor, the set of possble equlbra s small and thus lkely to provde both players wth smlar utltes, even when the leader selects selfshly. As ncreases, a leader has more optons for forcng the follower down a path more preferable to the leader at the expense of the follower. Our thrd set of experments examnes how the number of actons affects the value of commttng. We tested random games wth a of both 0.0 and 0.4 across varyng numbers of actons per player (Fgure 11). We observe that as the number of actons ncreases, the relatve commtment gan decreases slghtly because addtonal actons ncrease the probablty that a correlated equlbrum wthout commtment wll approach the unrestrcted optmum, leavng less room for mprovement by commttng. Whle more actons decreases the mportance of commttng, the mportance of beng able to select the equlbrum (as opposed to havng to bargan) remans hgh. Ths effect s stronger for larger values of. For games wth very hgh dscount factors, ncreasng the number of actons tends to ncrease the total value of the game, but not the relatve mportance of commttng. Leader utlty scaled by Actons per player = 0.4 = 0.0 Optmum Commtment Correlated Commtment Correlated Fgure 11: The best utlty achevable by the leader usng ether correlated or commtment equlbra compared to the unrestrcted optmum. Result are shown for = 0.0 and Concluson In ths paper we showed how to adapt the QPAC algorthm to approxmately solve for the optmal commtment soluton n stochastc games. Addtonally, we showed that removng the ablty to correlate or the ablty to use the full hstory of the game causes solvng for the optmal commtment strategy to become P-hard. Fnally, we studed the value that both commtment and correlaton provde to the leader, both theoretcally and expermentally. 8 Acknowledgements The authors would lke to thank Dmytro Korzhyk for helpful dscussons. Letchford and Contzer gratefully acknowledge F Awards II , II , and CCF , as well as ARO MURI Grant W911F and an Alfred P. loan fellowshp, for support. MacDermed and Isbell gratefully acknowledge F Grant II for support. References Chen, X., and Deng, X ettlng the complexty of two-player ash equlbrum. In FOC, Contzer, V., and Korzhyk, D Commtment to correlated strateges. In AAAI, Contzer, V., and andholm, T Computng the optmal strategy to commt to. In C, Contzer, V., and andholm, T ew complexty results about ash equlbra. Games and conomc Behavor 63(): Daskalaks, C.; Goldberg, P.; and Papadmtrou, C. H The complexty of computng a ash equlbrum. In TOC, Glboa, I., and Zemel, ash and correlated equlbra: ome complexty consderatons. Games and conomc Behavor 1: Jan, M.; Pta, J.; Tambe, M.; Ordóñez, F.; Paruchur, P.; and Kraus, Bayesan tackelberg games and ther applcaton for securty at Los Angeles Internatonal Arport. IGecom xch. 7():1 3. Kala,., and morodnsky, M Other solutons to nashs barganng problem. conometrca 43(3): Letchford, J., and Contzer, V Computng optmal strateges to commt to n extensve-form games. In C, Letchford, J.; Korzhyk, D.; and Contzer, V. 01. On the value of commtment. Workng Paper. MacDermed, L.; arayan, K..; Isbell, C. L.; and Wess, L Quck polytope approxmaton of all correlated equlbra n stochastc games. In AAAI, Murray, C., and Gordon, G Fndng correlated equlbra n general sum stochastc games. Techncal Report CMU-ML , Carnege Mellon Unversty. Paruchur, P.; Pearce, J. P.; Mareck, J.; Tambe, M.; Ordóñez, F.; and Kraus, Playng games for securty: An effcent exact algorthm for solvng Bayesan tackelberg games. In AAMA, Pta, J.; Jan, M.; Ordóñez, F.; Portway, C.; Tambe, M.; and Western, C Usng game theory for Los Angeles arport securty. AI Magazne 30(1): heh,.; An, B.; Yang, R.; Tambe, M.; Baldwn, C.; D- Renzo, J.; Maule, B.; and Meyer, G. 01. PROTCT: A deployed game theoretc system to protect the ports of the unted states. In AAMA. Tsa, J.; Rath,.; Kekntveld, C.; Ordonez, F.; and Tambe, M IRI - a tool for strategc securty allocaton n transportaton networks. In AAMA - Industry Track, von tengel, B., and Zamr, Leadershp games wth convex strategy sets. Games and conomc Behavor 69:

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University Dynamc Optmzaton Assgnment 1 Sasanka Nagavall snagaval@andrew.cmu.edu 16-745 January 29, 213 Robotcs Insttute Carnege Mellon Unversty Table of Contents 1. Problem and Approach... 1 2. Optmzaton wthout