Safe Opponent Exploitation

Size: px

Start display at page:

Download "Safe Opponent Exploitation"

Arleen Newman
6 years ago
Views:

1 Safe Opponent Explotaton SAM GANZFRIED and TUOMAS SANDHOLM, Carnege Mellon Unversty We consder the problem of playng a repeated two-player zero-sum game safety: that s, guaranteeng at least the value of the game per perod n expectaton regardless of the strategy used by the opponent. Playng a stage-game equlbrum strategy at each tme step clearly guarantees safety, and pror work has (ncorrectly) stated that t s mpossble to smultaneously devate from a stage-game equlbrum (n hope of explotng a suboptmal opponent) and to guarantee safety. We show that such proftable devatons are ndeed possble specfcally n games where certan types of gft strateges exst, whch we defne formally. We show that the set of strateges consttutng such gfts can be strctly larger than the set of teratvely weakly-domnated strateges; ths dsproves another recent asserton whch states that all nonteratvely weakly domnated strateges are best responses to each equlbrum strategy of the other player. We present a full characterzaton of safe strateges, and develop effcent algorthms for explotng suboptmal opponents whle guaranteeng safety. We also provde analogous results for extensve-form games of perfect and mperfect nformaton, and present safe explotaton algorthms and full characterzatons of safe strateges for those settngs as well. We present expermental results n Kuhn poker, a canoncal test problem for game-theoretc algorthms. Our experments show that (1) aggressve safe explotaton strateges sgnfcantly outperform adjustng the explotaton wthn stage-game equlbrum strateges only and (2) all the safe explotaton strateges sgnfcantly outperform a (nonsafe) best response strategy aganst strong dynamc opponents. Categores and Subject Descrptors: I.2.11 [Artfcal Intellgence]: Dstrbuted Artfcal Intellgence Multagent Systems; J.4[Socal and Behavoral Scences]: Economcs General Terms: Algorthms, Economcs, Theory Addtonal Key Words and Phrases: Game theory, opponent explotaton, multagent learnng ACM Reference Format: Ganzfred, S. and Sandholm, T Safe opponent explotaton. ACM Trans. Econ. Comp. 3, 2, Artcle 8 (Aprl 2015), 28 pages. DOI: INTRODUCTION In repeated nteractons aganst an opponent, an agent must determne how to balance between explotaton (maxmally takng advantage of weak opponents) and explotablty (makng sure that he hmself does not perform too poorly aganst strong opponents). In two-player zero-sum games, an agent can play a mnmax strategy, whch guarantees at least the value of the game n expectaton aganst any opponent. However, dong so could potentally forgo sgnfcant profts aganst suboptmal opponents. Thus, an equlbrum strategy has low (zero) explotablty, but acheves low explotaton. On the A shorter early verson of ths artcle appeared n Proceedngs of the ACM Conference on Electronc Commerce (EC 12). Ths materal s based on work supported by the Natonal Scence Foundaton under grant IIS , as well as XSEDE computng resources provded by the Pttsburgh Supercomputng Center. Authors addresses: S. Ganzfred and T. Sandholm, Computer Scence Department, Carnege Mellon Unversty, 5000 Forbes Avenue, Pttsburgh, PA 15213; emal: sganzfr@cs.smu.edu. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. Copyrghts for components of ths work owned by others than ACM must be honored. Abstractng wth credt s permtted. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. Request permssons from permssons@acm.org. c 2015 ACM /2015/04-ART8 $15.00 DOI:

2 8:2 S. Ganzfred and T. Sandholm other end of the spectrum, agents could attempt to learn the opponent s strategy and maxmally explot t; however, dong so runs the rsk of beng exploted n turn by a deceptve opponent. Ths s known as the get taught and exploted problem [Sandholm 2007]. Such decepton s common n games such as poker; for example, a player may play very aggressvely ntally, then suddenly swtch to a more conservatve strategy to captalze on the fact that the opponent tres to take advantage of hs aggressve mage, whch he now leaves behnd. Thus, pure opponent explotaton potentally leads to a hgh level of explotaton, but at the expense of explotablty. Respectvely, the game-solvng communty has, by and large, taken two radcally dfferent approaches: fndng game-theoretc solutons and opponent explotaton. In ths artcle, we are nterested n answerng a fundamental queston that helps shed some lght on ths trade-off. Is t possble to explot the opponent more than any equlbrum strategy of a stage game would, whle smultaneously guaranteeng at least the value of the full game n expectaton n the worst case? If the answer s no, then fully safe explotaton s not possble, and we must be wllng to accept some ncrease n worst-case explotablty f we wsh to devate from equlbrum n order to explot suboptmal opponents. However, f the answer s yes, then safe opponent explotaton would ndeed be possble. Recently t was stated that safe opponent explotaton s not possble [Ganzfred and Sandholm 2011]. The ntuton for that argument was that the opponent could have been playng an equlbrum all along, and when we devate from equlbrum to attempt to explot hm, then we run the rsk of beng explotable ourselves. However, that argument s ncorrect. It does not take nto account the fact that our opponent may gve us a gft by playng an dentfably suboptmal strategy, such as one that s strctly domnated. 1 If such gft strateges are present n a game, then t turns out that safe explotaton can be acheved; specfcally, we can devate from equlbrum to explot the opponent provded that our worst-case explotablty remans below the total amount of proft won through gfts (n expectaton). Is t possble to obtan such gfts that do not correspond to strctly domnated strateges? What about other forms of domnance, such as weak, terated, and domnance by mxed strateges? Recently t was clamed that all nonteratvely weakly domnated strateges are best responses to each equlbrum strategy of the other player [Waugh 2009]. Ths would suggest that such undomnated strateges cannot be gfts, and that gft strateges must therefore be domnated accordng to some form of domnance. We dsprove ths clam and present a game n whch a nonteratvely weakly domnated strategy s not a best response to an equlbrum strategy of the other player. Safe explotaton s possble n the game by takng advantage of that partcular strategy. We defne a formal noton of gfts, whch s more general than teratvely weakly domnated strateges, and show that safe opponent explotaton s possble specfcally n games n whch such gfts exst. Next, we provde a full characterzaton of the set of safe explotaton strateges, and we present several effcent algorthms for convertng any opponent explotaton archtecture (that s arbtrarly explotable) nto a fully safe opponent explotaton procedure. One of our algorthms s smlar to a procedure that guarantees safety n the lmt as the number of teratons goes to nfnty [McCracken and Bowlng 2004]; however, the algorthms n that paper can be arbtrarly explotable n the fntely repeated game settng, whch s what we are nterested n. The man dea of our algorthm s 1 We thank Vnce Contzer for pontng ths out to us.

3 Safe Opponent Explotaton 8:3 to play an ɛ-safe best response (a best response subject to the constrant of havng explotablty at most ɛ) at each tme step rather than a full best response, where ɛ s determned by the total amount of gfts obtaned thus far from the opponent. Safe best responses have also been studed n the context of Texas Hold em poker [Johanson et al. 2007], though that work dd not use them for onlne opponent explotaton. We also present several other safe algorthms whch alternate between playng an equlbrum and a best response dependng on how much has been won so far n expectaton. Algorthms have been developed whch guarantee ɛ-safety aganst specfc classes of opponents (statonary opponents and opponents wth bounded memory) [Powers et al. 2007]; by contrast, our algorthms acheve full safety aganst all opponents. It turns out that safe opponent explotaton s also possble n extensve-form games, though we must redefne what strateges consttute gfts and must make pessmstc assumptons about the opponent s play n game states off the path of play. We present effcent algorthms for safe explotaton n games of both perfect and mperfect nformaton, and fully characterze the space of safe strateges n these game models. We also show when safe explotaton can be performed n the mddle of a sngle teraton of an extensve-form game. Ths may be useful when a mstake s observed early on. We compare our algorthms expermentally on Kuhn poker [Kuhn 1950], a smplfed form of poker whch s a canoncal problem for testng game-solvng algorthms and has been used as a test problem for opponent-explotaton algorthms [Hoehn et al. 2005]. We observe that our algorthms obtan a sgnfcant mprovement over the best equlbrum strategy, whle also guaranteeng safety n the worst case. Thus, n addton to provdng theoretcal advantages over both mnmax and fully explotatve strateges, safe opponent explotaton can be effectve n practce. The rest of the artcle s organzed as follows. In Secton 2, we descrbe several alternatve uses of the approach, and ts applcablty to more general game classes, such as nfntely repeated, general-sum, and multplayer games. In Secton 3, we present game theory background. In Secton 4, we defne safety and present an example of a game where safe explotaton s not possble, as well as a game where t s possble. In Secton 5, we gve a full characterzaton of when safe explotaton s possble, whch turns out to concde wth games for whch a gft strategy (whch we defne) exsts for the opponent. In Secton 6, we present several new algorthms for safely explotng opponents, and show that pror algorthms are ether unsafe or unexplotatve. In Sectons 7 and 8, we provde a full characterzaton of safe strateges n strategc-form and extensve-form games (of both perfect and mperfect nformaton), respectvely. In Secton 9, we present experments n an extensve-form game of mperfect-nformaton that demonstrate that our algorthms safely explot suboptmal opponents sgnfcantly more than repeatedly playng the best stage-game Nash equlbrum. Fnally, we conclude and present future research drectons n Secton USES, APPLICABILITY, AND GENERALITY OF THE APPROACH In ths secton we suggest two alternatve uses of the approach, as well as dscuss ts applcablty and generalty Two Alternatve Uses of the Methodology We can vew safe explotaton as a meta-algorthm that enforces the safety of any opponent explotaton procedure by ensurng that t does not rsk too much at any pont. An opponent explotaton archtecture conssts of two components: 1) an opponent modelng algorthm, whch takes as nput the observatons of both players actons (to the extent that they are observable) and constructs a model of the opponent s strategy, and 2) a strategy selecton algorthm, whch takes the opponent model and the

4 8:4 S. Ganzfred and T. Sandholm Fg. 1. Our safe explotaton methodology used as a meta-algorthm whch makes any opponent explotaton archtecture safe. An opponent explotaton archtecture conssts of two components: an opponent modelng algorthm and a strategy selecton algorthm. Fg. 2. Our safe explotaton methodology used to replace the strategy selecton component whle retanng the opponent modelng component of any opponent explotaton archtecture. observatons as nput and outputs an explotatve strategy. Ths strategy may not be safe n general. The frst way to use our safe explotaton methodology s to obtan safety by curtalng the strateges that the archtecture may propose. Ths s depcted n Fgure 1. The second way to use the methodology s to vew our safe explotaton algorthms as alternatves to standard explotaton algorthms wthn the opponent explotaton paradgm. Our safe algorthms stll work wth any opponent modelng algorthm to construct an opponent model, but replace a potentally unsafe strategy selecton algorthm wth a new algorthm that guarantees safety. Ths s depcted n Fgure Bounds Suffce for Usng the Methodology We expect our algorthms to be useful n practce n many real-world domans, for example, n (cyber)securty games. It has been observed that human adversares n such domans often behave rratonally, and there can be sgnfcant benefts to explotng

5 Safe Opponent Explotaton 8:5 ther mstakes [Blythe et al. 2011; Pta et al. 2010, 2012]. However, the cost of makng a mstake ourselves s extremely hgh n such domans, for example, snce human lves could be at stake. Algorthms that can explot rratonal opponents whle stll guaranteeng safety would be very desrable. Furthermore, perhaps the man crtcsm of securty games to date s that the numerc payoffs for the attacker and defender are questonable. Our approach does not requre an exact model of the game. We only need a lower bound on the gfts (mstakes) that the opponent has gven us and an upper bound on the loss from our explotaton. Ths would be especally useful n securty games, snce t guarantees robustness even when the game models are not accurate. (Another advantage s that our approach apples also to mult-step games, whch are a rcher, more powerful framework than the securty game models used to date Stackelberg games where the defender moves once and then the attacker moves once.) 2.3. The Methodology Also Apples to Infntely Repeated, General-Sum, and Multplayer Games Our methodology also apples straghtforwardly to two-player zero-sum nfntely repeated games. Whle some of our algorthms specfcally depend on the fnte tme horzon and wll not extend to the nfnte settng, several of them do not, and wll apply straghtforwardly. In partcular, the algorthm that s most aggressve (among safe algorthms) and performed best n the experments does not rely on a fnte horzon. For general-sum and multplayer games, our methodology apples straghtforwardly f we replace the mnmax value wth the maxmn value (.e., maxmzng our expected payoff mnmzed over the other s strateges) n our algorthms. In two-player zero-sum games, these two values concde, and any equlbrum strategy guarantees at least ths value n expectaton n the worst case. In general-sum and multplayer games, these propertes do not hold; however, n many settngs t could be very desrable to explot opponents mstakes whle stll guaranteeng the maxmn value. For example, ths could be extremely useful n securty domans, whch are often modeled as nonzero-sum games [Korzhyk et al. 2011] snce safety s of hgh mportance Safe Explotaton Can Be Vewed as Selecton Among Equlbra of The Repeated Game As we dscuss n Secton 4, n repeated games, the set of safe strateges s the same as the set of maxmn strateges n the repeated game (and therefore, the set of Nash equlbra n the case where the repeated game s a two-player zero-sum game). Thus, one can vew our safe explotaton algorthms as procedures for selectng among equlbra of the repeated game. In the context of nonrepeated games, our work can be vewed as equlbrum selecton n the nonrepeated game. However, n both repeated and nonrepeated games, as we wll dscuss n Secton 8.3, our equlbrum refnement dffers from subgame perfecton [Selten 1965], and thus also from all the usual equlbrum refnements, whch are further refnements of subgame perfecton. 3. GAME THEORY BACKGROUND In ths secton, we brefly revew relevant defntons and pror results from game theory and game solvng Strategc-Form games The most basc game representaton, and the standard representaton for smultaneous-move games, s the strategc form. A strategc-form game (aka matrx game aka normal-form game) conssts of a fnte set of players N, a space of pure strateges S for each player, and a utlty functon u : S R for each player. Here S denotes the space of strategy profles vectors of pure strateges, one for each player.

6 8:6 S. Ganzfred and T. Sandholm The set of mxed strateges of player s the space of probablty dstrbutons over hs pure strategy space S. We wll denote ths space by. Defne the support of a mxed strategy to be the set of pure strateges played wth nonzero probablty. If the sum of the payoffs of all players equals zero at every strategy profle, then the game s called zero sum. In ths artcle, we wll be prmarly concerned wth two-player zerosum games. If the players are followng strategy profle σ,weletσ denote the strategy taken by player s opponent, and we let denote the opponent s entre mxed strategy space. Two-player zero-sum strategc-form games are often represented as a matrx, where the element n row m column n corresponds to player 1 s payoff when he plays hs m-th pure strategy and player 2 plays hs n-th pure strategy Extensve-Form games An extensve-form game s a general model of multagent decson makng wth potentally sequental and smultaneous actons and mperfect nformaton. As wth perfectnformaton games, extensve-form games consst prmarly of a game tree; each non-termnal node has an assocated player (possbly chance) that makes the decson at that node, and each termnal node has assocated utltes for the players. Addtonally, game states are parttoned nto nformaton sets, where the player whose turn t s to move cannot dstngush among the states n the same nformaton set. Therefore, n any gven nformaton set, a player must choose actons wth the same dstrbuton at each state contaned n the nformaton set. If no player forgets nformaton that he prevously knew, we say that the game has perfect recall. A (behavoral) strategy for player, σ, s a functon that assgns a probablty dstrbuton over all actons at each nformaton set belongng to Nash Equlbra Player s best response to σ s any strategy n arg max u (σ σ, σ ). A Nash equlbrum s a strategy profle σ such that σ s a best response to σ for all. An ɛ-equlbrum s a strategy profle n whch each player acheves a payoff of wthn ɛ of hs best response. In two player zero-sum games, we have the followng result, whch s known as the mnmax theorem [von Neumann 1928; Osborne and Rubnsten 1994]: v = max mn u 1 (σ 1, σ 2 ) = mn max u 1 (σ 1, σ 2 ). σ 1 1 σ 2 2 σ 2 2 σ 1 1 We refer to v as the value of the game to player 1. Sometmes we wll wrte v as the value of the game to player. Any equlbrum strategy for a player wll guarantee an expected payoff of at least the value of the game to that player. Defne the explotablty of σ to be the dfference between the value of the game and the performance of σ aganst ts nemess, formally: expl(σ ) = v mn u (σ, σ σ ). Snce there always exsts a nemess that s a pure strategy, ths expresson s equal to v mn s S u (σ, s ). For any ɛ 0, defne SAFE(ɛ) to be the set of strateges wth explotablty at most ɛ. The set SAFE(ɛ) s defned by lnear constrants: σ SAFE(ɛ) f and only f u (σ, s ) v ɛ for all s S. Defne an ɛ-safe best response of player to σ to be any strategy n argmax σ SAFE(ɛ)u (σ, σ ).

7 Safe Opponent Explotaton 8:7 Fg. 3. Payoff matrx of Rock-Paper-Scssors. All fnte games have at least one Nash equlbrum [Nash 1951]. In two-player zerosum strategc-form games, a Nash equlbrum can be found effcently by lnear programmng. In the case of zero-sum extensve-form games wth perfect recall, there are effcent technques for fndng an equlbrum, such as lnear programmng [Koller et al. 1994]. An ɛ-equlbrum can be found n even larger games va algorthms such as generalzatons of the excessve gap technque [Hoda et al. 2010] and counterfactual regret mnmzaton [Znkevch et al. 2007]. The latter two algorthms scale to games wth approxmately game tree states, whle the most scalable current generalpurpose lnear programmng technque (CPLEX s barrer method) scales to games wth around 10 8 states. By contrast, full best responses can be computed n tme lnear n the sze of the game tree, whle the best known technques for computng ɛ-safe best responses have runnng tmes roughly smlar to an equlbrum computaton [Johanson et al. 2007] Repeated games In repeated games, the stage game s repeated for a fnte number T of teratons. At each teraton, players can condton ther strateges on everythng that has been observed so far. In strategc-form games, ths generally ncludes the full mxed strategy of the agent n all prevous teratons, as well as all actons of the opponent (though not hs full strategy). In extensve-form games, generally only the actons of the opponent along the path of play are observed; n games wth mperfect nformaton, the opponent s prvate nformaton may also be observed n some stuatons. 4. SAFETY One desrable property of a strategy for a repeated game s that t s safe. Defnton 4.1. A safe strategy for a repeated game s a strategy that guarantees a worst-case payoff of at least v per perod n expectaton. The set of safe strateges s the same as the set of mnmax strateges n the full repeated game. Clearly playng a (stage-game) mnmax strategy at each teraton s safe, snce t guarantees at least v n each teraton. However, a mnmax strategy may fal to maxmally explot a suboptmal opponent. On the other hand, devatng from stage-game equlbrum n an attempt to explot a suboptmal opponent could lose the guarantee of safety and may result n an expected payoff below the value of the game aganst a deceptve opponent (or f the opponent model s ncorrect). Thus, a natural queston to consder s whether there exst strateges that are safe, yet devate from stage-game equlbrum strateges (n order to explot an opponent s mstakes) A Game n Whch Safe Explotaton s Not Possble Consder the classc game of Rock-Paper-Scssors (RPS), whose payoff matrx s depcted n Fgure 3. The unque equlbrum σ s for each player to randomze equally among all three pure strateges. Now suppose that our opponent has played Rock n each of the frst 10 teratons (whle we have played accordng to σ ). We may be tempted to try to explot hm by playng the pure strategy Paper at the 11th teraton. However, ths would not be safe;

8 8:8 S. Ganzfred and T. Sandholm Fg. 4. Payoff matrx of RPST. t s possble that he has n fact been playng hs equlbrum strategy all along, and that he just played Rock each tme by chance (ths wll happen wth probablty 1 ) It s also possble that he wll play Scssors n the next round (perhaps to explot the fact that he thnks we are more lkely to play Paper havng observed hs actons). Aganst such a strategy, we would actually have a negatve expected total proft 0 n the frst 10 rounds and -1 n the 11th. Thus, our strategy would not be safe. By smlar reasonng, t s easy to see that any devaton from σ wll not be safe, and that safe explotaton s not possble n RPS A Game n Whch Safe Explotaton s Possble Now consder a varant of RPS n whch player 2 has an addtonal pure strategy T. If he plays T, then we get a payoff of 4 f we play R, and 3 f we play P or S. The payoff matrx of ths new game RPST s gven n Fgure 4. Clearly the unque equlbrum s stll for both players to randomze equally between R, P, and S. Now suppose we play our equlbrum strategy n the frst game teraton, and the opponent plays T; no matter what acton we played, we receve a payoff of at least 3. Suppose we play the pure strategy R n the second round n an attempt to explot hm (snce R s our best response to T). In the worst case, our opponent wll explot us n the second round by playng P, and we wll obtan payoff 1. But combned over both tme steps, our payoff wll be postve no matter what the opponent does at the second teraton. Thus, our strategy consttuted a safe devaton from equlbrum. Ths was possble because of the exstence of a gft strategy for the opponent; no such gft strategy s present n standard RPS. 5. CHARACTERIZING GIFTS What exactly consttutes a gft? Does t have to be a strctly domnated pure strategy, lke T n the precedng example? What about weakly domnated strateges? What about terated domnance, or domnated mxed strateges? In ths secton we frst provde some negatve results whch show that several natural canddate defntons of gfts strateges are not approprate. Then we provde a formal defnton of gfts and show that safe explotaton s possble f and only f such gft strateges exst. Recent work has asserted the followng. 2 Asserton 1. [Waugh 2009]. An equlbrum strategy makes an opponent ndfferent to all non-[weakly]-teratvely-domnated strateges. That s, to te an equlbrum strategy n expectaton, all one must do s play a non-[weakly]-teratvely-domnated strategy. Ths asserton would seem to mply that gfts correspond to strateges that put weght on pure strateges that are weakly teratvely domnated. However, consder the game shown n Fgure 5. It can easly be shown that ths game has a unque equlbrum, n whch P1 plays U and D wth probablty 1 2, and P2 plays L and M wth probablty 1 2. The value 2 Ths s made as a statement of fact n pror work [Waugh 2009], and not n the form of an asserton.

9 Safe Opponent Explotaton 8:9 Fg. 5. A game wth a gft strategy that s not weakly teratvely domnated. of the game to player 1 s 2.5. If player 1 plays hs equlbrum strategy and player 2 plays R, player 1 gets expected payoff of 5, whch exceeds hs equlbrum payoff; thus R consttutes a gft, and player 1 can safely devate from equlbrum to try to explot hm. But R s not domnated under any form of domnance. Ths dsproves the asserton, and causes us to rethnk our noton of gfts. PROPOSITION 5.1. It s possble for a strategy that survves terated weak domnance to obtan expected payoff worse than the value of the game aganst an equlbrum strategy. We mght now be tempted to defne a gft as a strategy that s not n the support of any equlbrum strategy. However, the game n Fgure 6 shows that t s possble for a strategy to not be n the support of an equlbrum and also not be a gft (snce f P1 plays hs only equlbrum strategy U, he obtans 0 aganst R, whch s the value of the game). Now that we have ruled out several canddate defntons of gft strateges, we now present our new defnton, whch we relate formally to safe explotaton n Proposton 5.3. Defnton 5.2. A strategy σ s a gft strategy f there exsts an equlbrum strategy σ for the other player such that σ s not a best response to σ.3 3 Ths defnton of gft strateges concdes wth the strateges for the opponent specfed by the thrd step of a procedure for selectng a partcular equlbrum of a (one-shot) two-player zero-sum game, known as Dresher s procedure [Dresher 1961; van Damme 1987]. The procedure assumes the opponent wll make a mstake (.e., by playng a gft strategy), then selects a strategy that maxmzes the mnmum gan resultng from a possble mstake of the opponent. It has been shown that the strateges selected by ths procedure concde wth the proper equlbra of the game [van Damme 1987], an equlbrum refnement concept defned by Myerson [1978]. Thus, proper equlbrum strateges explot all gft strateges, and one could equvalently defne gft strateges as strateges that are not a best response to a proper equlbrum strategy of the opponent. One could vew proper equlbra, as well as some other equlbrum refnement concepts (e.g., tremblng-hand perfect equlbrum) as approaches for explotng mstakes of the opponent n (nonrepeated) games although they are typcally thought of as means to prescrbe acton probabltes for nformaton sets that are reached wth zero probablty n equlbrum. In contrast, our man focus s on repeated games, although our technques apply to sngle-shot games as well. Furthermore, we wll show n Secton 8.3 that even n sngle-shot games, our safe explotatve strateges dffer from the strateges prescrbed by subgame perfecton [Selten 1965], and thus our approach dffers from all pror refnements that are further refnements of subgame perfecton. So, our work can be vewed as provdng novel equlbrum selecton concepts and procedures. In broad strokes, at every pont n the game, pror refnements try to play as well as possble aganst an (almost) ratonal opponent (e.g., one who trembles wth small probablty), whle ours explots an opponent model (whch does not have to be ratonal n any way) as much as possble subject to safety. So, our approach can explot the opponent sgnfcantly more than pror equlbrum refnements. (Some of the pror refnements also assume that we wll tremble wth small probablty ourselves; ths s not motvated by explotaton, but rather so that we know how to respond to actons further down the tree at nformaton sets that would otherwse be reached wth probablty zero.) Another dfference s that n our technque, a safe, maxmally explotatve strategy can be computed n polynomal tme both n theory and practce. In contrast, whle proper equlbrum strateges can be computed n polynomal tme n theory for both strategc-form and extensve-form games, those polynomal-tme algorthms are numercally unstable n practce [Mltersen and Sørensen 2006, 2008].

10 8:10 S. Ganzfred and T. Sandholm Fg. 6. Strategy R s not n the support of an equlbrum for player 2, but s also not a gft. When such a strategy σ exsts, player can wn an mmedate proft beyond v aganst an opponent who plays σ by smply playng the safe strategy σ ; then he can play a potentally unsafe strategy (that has explotablty below some lmt) n future teratons n an attempt to explot perceved weaknesses of the opponent. Usng ths defnton, RPS and the game depcted n Fgure 6 have no gft strateges for ether player, whle T s a gft for player 2 n RPST, and R s a gft for player 2 n the game depcted n Fgure 5. PROPOSITION 5.3. Assumng we are not n a trval game n whch all of player s strateges are mnmax strateges, then non-stage-game-equlbrum safe strateges exst f and only f there exsts at least one gft strategy for the opponent. PROOF. Suppose some gft strategy σ exsts for the opponent. Then there exsts an equlbrum strategy σ such that u (σ, σ ) >v.letɛ = u (σ, σ ) v.lets be a nonequlbrum strategy for player. Suppose player plays σ n the frst round, and n the second round does the followng: f the opponent dd not play σ n the frst round, he plays σ n all subsequent rounds. If the opponent dd play σ n the frst round, then n the second round he plays ˆσ, where ˆσ s a mxture between s and σ that has explotablty n (0, ɛ) (we can always obtan such a mxture by puttng suffcently much weght on σ ), and he plays σ n all subsequent rounds. Such a strategy consttutes a safe strategy that devates from stage-game equlbrum. Now suppose no gft strategy exsts for the opponent, and suppose we devate from equlbrum for the frst tme n some teraton t. Suppose the opponent plays a nemess strategy at tme step t (to the strategy we are playng at tme step t ), and plays an equlbrum strategy at all future tme steps. Then we wll wn less than v n expectaton aganst hs strategy. Therefore, we cannot safely devate from equlbrum. The followng procedure gves an effcent algorthm, consstng of solvng two lnear programs (LPs), to determne whether a gft strategy for the opponent exsts n a two-player zero-sum strategc-form game (and therefore whether safe explotaton s possble). (1) Compute an equlbrum by solvng the LP; ths determnes the value of the game to player, v. (2) Solve the LP that maxmzes the expected payoff of player aganst the unform random strategy of the opponent, subject to the constrants that player s strategy s an equlbrum (these constrants wll use v ). Let ˆv denote the optmal objectve value of ths LP. (3) If ˆv > v, then at least one gft strategy for the opponent exsts; otherwse no gft strateges exst. PROPOSITION 5.4. The above procedure determnes n polynomal tme whether a gft strategy for the opponent exsts n a gven two-player zero-sum game.

11 Safe Opponent Explotaton 8:11 PROOF. Suppose a gft strategy s for the opponent exsts. Then there exsts an equlbrum strategy σ such that u (σ, s ) >v. For every other strategy t for the opponent, we have u (σ, t ) v. Thus, player s expected payoff of playng σ aganst the unform random strategy wll strctly exceed v,andˆv > v. Now suppose no gft strateges exst. Then for all equlbrum strateges σ and all strateges s for the opponent, we have u (σ, s ) = v. Thus, all equlbrum strateges wll obtan expected payoff v aganst the unform random strategy, and we have ˆv = v. The procedure s polynomal tme snce t conssts of solvng LPs of polynomal sze (the LP formulatons for computng a best response as well as the equlbrum constrants are descrbed by, for example, Koller et al. [1994]). 6. SAFETY ANALYSIS OF SOME NATURAL EXPLOITATION ALGORITHMS Now that we know t s possble to safely devate from equlbrum n certan games, can we construct effcent procedures for mplementng such safe explotatve strateges? In ths secton we analyze the safety of several natural explotaton algorthms. In short, we wll show that all pror algorthms and natural other canddate algorthms are ether unsafe or unexplotatve. We ntroduce algorthms that are safe and explotatve Rsk What You ve Won (RWYW) The Rsk What You ve Won algorthm (RWYW) s qute smple and natural; essentally, at each teraton t rsks only the amount of proft won so far. More specfcally, at each teraton t, RWYW plays an ɛ-safe best response to a model of the opponent s strategy (accordng to some opponent modelng algorthm M), where ɛ s our current cumulatve payoff mnus (t 1)v. Pseudocode s gven n Algorthm 1. Algorthm 1 Rsk What You ve Won (RWYW) v value of the game to player k 1 0 for t = 1toT do π t argmax π SAFE(max{k t,0}) M(π) Play acton a t accordng to πt Update M wth opponent s actons, a t k t+1 k t + u (a t, at ) v end for PROPOSITION 6.1. RWYW s not safe. PROOF. Consder RPS, and assume our opponent modelng algorthm M says that the opponent wll play accordng to hs dstrbuton of actons observed so far. Snce ntally k 1 = 0, we must play our equlbrum strategy σ at the frst teraton, snce t s the only strategy wth explotablty of 0. Wthout loss of generalty, assume the opponent plays R n the frst teraton. Our expected payoff n the frst teraton s 0, snce σ has expected payoff of 0 aganst R (or any strategy). Suppose we had played R ourselves n the frst teraton. Then we would have obtaned an actual payoff of 0, and would set k 2 = 0. Thus we wll be forced to play σ at the second teraton as well. If we had played P n the frst round, we would have obtaned a payoff of 1, and set k 2 = 1. We would then set π 2 to be the pure strategy P, snce our opponent model dctates the

12 8:12 S. Ganzfred and T. Sandholm opponent wll play R agan, and P s the unque k 2 -safe best response to R. Fnally, f we had played S n the frst round, we would have obtaned an actual payoff of -1, and would set k 2 = 1; ths would requre us to set π 2 equal to σ. Now, suppose the opponent had actually played accordng to hs equlbrum strategy n teraton 1, plays the pure strategy S n the second round, then plays the equlbrum n all subsequent rounds. As dscussed above, our expected payoff at the frst teraton s zero. Aganst ths strategy, we wll actually obtan an expected payoff of -1 n the second teraton f the opponent happened to play R n the frst round, whle we wll obtan an expected of 0 n the second round otherwse. So our expected payoff n the second round wll be 1 3 ( 1) = 1 3. In all subsequent rounds our expected payoff wll be zero. Thus our overall expected payoff wll be 1 3, whch s less than the value of the game; so RWYW s not safe. RWYW s not safe because t does not adequately dfferentate between whether profts were due to skll (.e., from gfts) or to luck Rsk What You ve Won n Expectaton (RWYWE) A better approach than RWYW would be to rsk the amount won so far n expectaton. Ideally we would lke to do the expectaton over both our randomzaton and our opponent s, but ths s not possble n general snce we only observe the opponent s acton, not hs full strategy. However, t would be possble to do the expectaton only over our randomzaton. For example, suppose we play accordng to the equlbrum σ at one teraton of RPS, and end up selectng acton R, whle the opponent selects acton P; then our actual payoff s -1, but our expected payoff (over our own randomzaton) s 0. It turns out that we can ndeed acheve safety usng ths procedure, whch we call RWYWE. Pseudocode s gven n Algorthm 2. Here u (π t, at ) denotes our expected payoff of playng our mxed strategy π t aganst the opponent s observed acton a t. The dfference between RWYWE and RWYW s n the step for updatng k t :RWYW uses u (a t, at ) whle RWYWE uses u (π t, at ). Algorthm 2 Rsk What You ve Won n Expectaton (RWYWE) v value of the game to player k 1 0 for t = 1toT do π t argmax π SAFE(k t ) M(π) Play acton a t accordng to πt The opponent plays acton a t accordng to unobserved dstrbuton πt Update M wth opponent s actons, a t k t+1 k t + u (π t, at ) v end for LEMMA 6.2. Let π be updated accordng to RWYWE, and suppose the opponent plays accordng to π. Then for all n 0, n E[ k n+1 ] = u (π t, π t ) nv.

13 Safe Opponent Explotaton 8:13 PROOF. Snce k 1 = 0, the statement holds for n = 0. Now suppose the statement holds for all t n, forsomen 0. Then E[ k n+2 ] = E[ k n+1 + u (π n+1, a n+1 ) v ] = E[ k n+1 ] +E[ u (π n+1, a n+1 )] E[ v ] [ n ] = u (π t, π t ) nv + E[ u (π n+1, a n+1 )] v = = [ n ] u (π t, π t ) nv + u (π n+1, π n+1 ) v n+1 u (π t, π t ) (n + 1)v. LEMMA 6.3. Let π be updated accordng to RWYWE. Then for all t 1, k t 0. PROOF. By defnton, k 1 = 0. Now suppose k t 0forsomet 1. By constructon, π t has explotablty at most k t. Thus, we must have Thus k t+1 0 and we are done. PROPOSITION 6.4. RWYWE s safe. PROOF. By Lemma 6.2, u (π t, at ) v k t. T u (π t, π t ) = E[ kt+1 ] +Tv. By Lemma 6.3, k T+1 0, and therefore E[ k T+1 ] 0. So and RWYWE s safe. T u (π t, π t ) Tv, RWYWE s smlar to the Safe Polcy Selecton Algorthm (SPS) [McCracken and Bowlng 2004]. The man dfference s that SPS uses an addtonal decay functon f : N R settng k 1 f (1) and usng the update step k t+1 k t + f (t + 1) + u (π t, a t ) v. They requre f to satsfy the followng propertes (1) f (t) >0 for all t (2) lm T T f (t) T = 0.

14 8:14 S. Ganzfred and T. Sandholm In partcular, they obtaned good expermental results usng f (t) = β t. They are able to show that SPS s safe n the lmt as T ; 4 however SPS s arbtrarly explotable n fntely repeated games. Furthermore, even n nfntely repeated games, SPS can lose a sgnfcant amount; t s merely the average loss that approaches zero. We can thnk of RWYWE as SPS but usng f (t) = 0 for all t Best Equlbrum Strategy Gven an opponent modelng algorthm M, we could play the best Nash equlbrum accordng to M at each tme step: π t = argmax π SAFE(0) M(π). Ths would clearly be safe, but can only explot the opponent as much as the best equlbrum can, and potentally leaves a lot of explotaton on the table Regret Mnmzaton between an Equlbrum and an Opponent Explotaton Algorthm We could use a no-regret algorthm (e.g., Exp3 [Auer et al. 2002]) to select between an equlbrum and an (unsafe) opponent explotaton algorthm at each teraton. As pror work has ponted out [McCracken and Bowlng 2004], ths would be safe n the lmt as T. However, ths would not be safe n fntely repeated games. Even n the nfntely repeated case, no-regret algorthms only guarantee that average regret goes to 0 n the lmt; n fact, total regret can stll grow arbtrarly large Regret Mnmzaton n the Space of Equlbra Regret mnmzaton n the space of equlbra s safe, but agan would potentally mss out on a lot of explotaton aganst suboptmal opponents. Ths procedure was prevously used to explot opponents n Kuhn poker [Hoehn et al. 2005] Best Equlbrum Followed by Full Explotaton (BEFFE) The BEFFE algorthm works as follows. We start off playng the best equlbrum strategy accordng to some opponent model M. Then we swtch to playng a full best response for all future teratons f we know that dong so wll keep our strategy safe n the full game (n other words, f we know we have accrued enough gfts to support full explotaton n the remanng teratons). Specfcally, we play a full best response at tme step t f the amount of gfts we have accumulated, k t, s at least (T t + 1)(v ɛ), where ɛ s the explotablty of a full best response. Otherwse, we play the best equlbrum. Pseudocode s gven n Algorthm 3. Ths algorthm s smlar to the DBBR algorthm [Ganzfred and Sandholm 2011], whch plays an equlbrum for some fxed number of teratons, then swtches to full explotaton. However, BEFFE automatcally detects when ths swtch should occur, whch has several advantages. Frst, t s one fewer parameter requred by the algorthm. More mportantly, t enables the algorthm to guarantee safety. PROPOSITION 6.5. BEFFE s safe. 4 We recently dscovered a mstake n ther proof of safety n the lmt; however, the result s stll correct. A corrected proof s avalable at bowlng/papers/04aaa-fallsymp-errata.pdf.

15 Safe Opponent Explotaton 8:15 Algorthm 3 Best Equlbrum Followed by Full Explotaton (BEFFE) v value of the game to player k 1 0 for t = 1toT do π BR t argmax π M(π) ɛ v mn π u (π BR t, π ) f k t >= (T t + 1)(v ɛ) then π t π BR t else π t argmax π SAFE(0) M(π) end f Play acton a t accordng to πt The opponent plays acton a t accordng to unobserved dstrbuton πt Update M wth opponent s actons, a t k t+1 k t + u (π t, at ) v end for PROOF. Follows by same reasonng as proof of safety of RWYWE, snce we are playng a strategy wth explotablty at most k t at each teraton. One possble advantage of BEFFE over RWYWE s that t potentally saves up explotablty untl the end of the game, when t has the most accurate nformaton on the opponent s strategy (whle RWYWE does explotaton from the start when the opponent model has noser data). On the other hand, BEFFE possbly msses out on addtonal rounds of explotaton by watng untl the end, snce t may accumulate addtonal gfts n the explotaton phase that t dd not take nto account. Furthermore, by watng longer before turnng on explotaton, one s experence of the opponent can be from the wrong part of the space; that s, the space that s reached when playng equlbrum but not when explotng. Consequently, the explotaton mght not be as effectve because t may be based on less data about the opponent n the pertnent part of the space. Ths ssue has been observed n opponent explotaton n Heads-Up Texas Hold em poker [Ganzfred and Sandholm 2011] Best Equlbrum and Full Explotaton when Possble (BEFEWP) BEFEWP s smlar to BEFFE, but rather than watng untl the end of the game, we play a full best response at each teraton where ts explotablty s below k t ; otherwse we play the best equlbrum. Pseudocode s gven n Algorthm 4. Lke RWYWE, BEFEWP wll contnue to explot a suboptmal opponent throughout the match provded the opponent keeps gvng us gfts. It also guarantees safety, snce we are stll playng a strategy wth explotablty at most k t at each teraton. However, playng a full best response rather than a safe best response early n the match may not be the greatest dea, snce our data on the opponent s stll qute nosy. PROPOSITION 6.6. BEFEWP s safe. 7. A FULL CHARACTERIZATION OF SAFE STRATEGIES IN STRATEGIC-FORM GAMES In the prevous secton we saw a varety of opponent explotaton algorthms, some whch are safe and some whch are unsafe. In ths secton, we fully characterze the space of safe algorthms. Informally, t turns out that an algorthm wll be safe f at each tme step t selects a strategy wth explotablty at most k t, where k s updated accordng to the RWYWE procedure. Ths does not mean that RWYWE s the only safe

16 8:16 S. Ganzfred and T. Sandholm Algorthm 4 Best Equlbrum and Full Explotaton When Possble (BEFEWP) v value of the game to player k 1 0 for t = 1toT do π BR t argmax π M(π) ɛ v mn π u (π BR t, π ) f ɛ<= k t then π t π BR t else π t argmax π SAFE(0) M(π) end f Play acton a t accordng to πt The opponent plays acton a t accordng to unobserved dstrbuton πt Update M wth opponent s actons, a t k t+1 k t + u (π t, at ) v end for algorthm, or that safe algorthms must explctly use the gven update rule for k t ;t just means that the explotablty at each tme step must be bounded by the partcular value k t, assumng that k had hypothetcally been updated accordng to the RWYWE rule. 5 Defnton 7.1. An algorthm for selectng strateges s expected-proft-safe f t satsfes the rule π t SAFE(k t ) at each tme step t from 1 to T, where ntally k 1 = 0andk s updated usng the rule k t+1 k t + u (π t, a t ) v. PROPOSITION 7.2. A strategy π (for the full game, not the stage game) s safe f and only f t s expected-proft-safe. PROOF. If π s expected-proft-safe, then t follows that π s safe by smlar reasonng to the proof of Proposton 6.4. Now suppose π s safe, but at some teraton t selects π t wth explotablty exceedng k t, as defned n Defnton 7.1 (assume t s the frst such teraton); let e denote the explotablty of π t. Suppose the opponent had been playng the pure strategy that selects acton a t wth probablty 1 at each teraton t for all t < t, and suppose he 5 We could generalze the approaches to play strateges n SAFE(f (k t )) at each tme step rather than SAFE(k t ),wheref (k t ) k t s an arbtrary functon that s a potentally lower upper bound on the explotablty. Ths would result n a larger worst-case payoff guarantee when f (k t )<k t, but potentally at the expense of explotaton (snce we are now restrctng our space of strateges to a smaller set). In the opposte drecton, we could also select strateges n SAFE(k t + δ) for δ>0; ths would lead to strateges that are approxmately safe (wthn an addtve factor δ), and potentally acheve hgher levels of explotaton.

17 Safe Opponent Explotaton 8:17 plays hs nemess strategy to π t at tme step t (and follows a mnmax strategy at all future teratons). Then our expected payoff n the frst t teratons s < = t 1 u (π t, a t ) + v e t 1 u (π t, a t ) + v k t t 1 t u (π t, a t ) + 1 v u (π t, a t ) (t 1)v (1) = t v. In Equaton 1, we use Lemma 6.2 and the fact that E[ k t ] = k t, snce the opponent played a determnstc strategy n the frst t 1 rounds. We wll obtan payoff at most v at each future teraton, snce the opponent s playng a mnmax strategy. So π s not safe and we have a contradcton; therefore π must be expected-proft-safe, and we are done. 8. SAFE EXPLOITATION IN EXTENSIVE-FORM GAMES In extensve-form games, we cannot mmedately apply RWYWE (or the other safe algorthms that devate from equlbrum), snce we do not know what the opponent would have done at game states off the path of play (and thus cannot evaluate the expected payoff of our mxed strategy) Extensve-Form Games of Perfect Informaton In extensve-form games of perfect nformaton, t turns out that to guarantee safety we must assume pessmstcally that the opponent s playng a nemess off the path of play (whle playng hs observed acton on the path of play). Ths pessmsm potentally lmts our amount of explotaton when the opponent s not playng a nemess, but s needed to guarantee safety. We present an extensve-form verson of RWYWE below as Algorthm 5. As n the strategc-form case, the tme step t refers to the teraton of the repeated game (not to the depth of the tree wthn a sngle teraton); the strateges refer to behavoral strateges for a sngle teraton of the full extensve-form game. Algorthm 5 Extensve-Form RWYWE v value of the game to player k 1 0 for t = 1toT do π t argmax π SAFE(k t ) M(π) Play acton a t accordng to πt The opponent plays acton a t accordng to unobserved dstrbuton πt Update M wth opponent s actons, a t τ t strategy for the opponent that plays at on the path of play, and plays a best response to π t offthepathofplay k t+1 k t + u (π t, τ t ) v. end for

18 8:18 S. Ganzfred and T. Sandholm LEMMA 8.1. Let π be updated accordng to Extensve-Form RWYWE, and suppose the opponent plays accordng to π. Then for all n 0, n E[ k n+1 ] u (π t, π t ) nv. PROOF. Snce k 1 = 0, the statement holds for t = 0. Now suppose the statement holds for all t n, forsomen 0. Then E[ k n+2 ] = E[ k n+1 + u (π n+1, τ n+1 ) v ] = E[ k n+1 ] +E[ u (π n+1, τ n+1 )] E[ v ] [ n ] u (π t, π t ) nv + E[ u (π n+1, τ n+1 )] v = [ n ] u (π t, π t ) nv + u (π n+1, π n+1 ) v n+1 u (π t, π t ) (n + 1)v. LEMMA 8.2. Let π be updated accordng to Extensve-Form RWYWE. Then for all t 1, k t 0. PROOF. By defnton, k 1 = 0. Now suppose k t 0forsomet 1. By constructon, π t has explotablty at most k t. Thus, we must have u (π t, τ t ) v k t. Thus k t+1 0 and we are done. PROPOSITION 8.3. Extensve-Form RWYWE s safe. PROOF. By Lemma 8.1, T u (π t, π t ) E[ kt+1 ] +Tv. By Lemma 8.2, k T+1 0, and therefore E[ k T+1 ] 0. So T u (π t, π t ) Tv, and Extensve-Form RWYWE s safe. We now provde a full characterzaton of safe explotaton algorthms n extensveform games smlarly to what we dd for strategc-form games earler n the artcle. Defnton 8.4. An algorthm for selectng strateges n extensve-form games of perfect nformaton s expected-proft-safe f t satsfes the rule π t SAFE(k t ) at each tme step t from 1 to T, where ntally k 1 = 0andk s updated usng the same rule as Extensve-Form RWYWE.

19 Safe Opponent Explotaton 8:19 LEMMA 8.5. Let π be updated accordng to Extensve-Form RWYWE, and suppose the opponent plays accordng to π = τ, where τ s defned n Algorthm 5. Then for all n 0, n E[ k n+1 ] = u (π t, π t ) nv. PROOF. Snce k 1 = 0, the statement holds for t = 0. Now suppose the statement holds for all t n, forsomen 0. Then E[ k n+2 ] = E[ k n+1 + u (π n+1, τ n+1 ) v ] = E[ k n+1 ] +E[ u (π n+1, τ n+1 )] E[ v ] [ n ] = u (π t, π t ) nv + E[ u (π n+1, τ n+1 )] v = = [ n ] u (π t, π t ) nv + u (π n+1, π n+1 ) v n+1 u (π t, π t ) (n + 1)v. PROPOSITION 8.6. A strategy π n an extensve-form game of perfect nformaton s safe f and only f t s expected-proft-safe. PROOF. Ifπ s expected-proft-safe, then t follows that π s safe by smlar reasonng to the proof of Proposton 8.3. Now suppose π s safe, but at some teraton t selects π t wth explotablty exceedng k t, as defned n Defnton 8.4; let e denote the explotablty of π t. Suppose the opponent had been playng the pure strategy that selects acton a t wth probablty 1 at each teraton t for all t < t, and suppose he plays hs nemess strategy at tme step t (and follows a mnmax strategy at all future teratons). Then our expected payoff s < = t 1 u (π t, a t ) + v e t 1 u (π t, a t ) + v k t t 1 t u (π t, a t ) + 1 v u (π t, a t ) (t 1)v = t v. In Equaton (2), we use Lemma 8.5 and the fact that E[ k t ] = k t, snce the opponent played a determnstc strategy n the frst t 1 rounds. We wll obtan payoff at most v at each future teraton, snce the opponent s playng a mnmax strategy. So π s not safe and we have a contradcton; therefore π must be proft-safe, and we are done.

Test 2. ECON3161, Game Theory. Tuesday, November 6 th

Test 2. ECON3161, Game Theory. Tuesday, November 6 th Test 2 ECON36, Game Theory Tuesday, November 6 th Drectons: Answer each queston completely. If you cannot determne the answer, explanng how you would arrve at the answer may earn you some ponts.. (20 ponts)