Double-oracle Algorithm for Computing an Exact Nash Equilibrium in Zero-sum Extensive-form Games

Size: px

Start display at page:

Download "Double-oracle Algorithm for Computing an Exact Nash Equilibrium in Zero-sum Extensive-form Games"

Everett Lindsey
5 years ago
Views:

1 Double-oracle Algorthm for Computng an Exact Nash Equlbrum n Zero-sum Extensve-form Games Branslav Bošanský 1, Chrstopher Kekntveld 2, Vlam Lsý 1, Jří Čermák 1, Mchal Pěchouček 1 1 Agent Technology Center, Dept. of Computer Scence, FEE, Czech Techncal Unversty n Prague {bosansky, lsy, cermak, pechoucek}@agents.fel.cvut.cz 2 Computer Scence Department, Unversty of Texas at El Paso cdkekntveld@utep.edu ABSTRACT We nvestgate an teratve algorthm for computng an exact Nash equlbrum n two-player zero-sum extensve-form games wth mperfect nformaton. The approach uses the sequence-form representaton of extensve-form games and the double-oracle algorthmc framework. The man dea s to restrct the game by allowng the players to play only some of the sequences of avalable actons, then teratvely solve ths restrcted game, and explot fast best-response algorthms to add addtonal sequences to the restrcted game for the next teraton. In ths paper we (1) extend the sequence-form double-oracle method to be applcable on non-determnstc extensve-form games, (2) present more effcent methods for mantanng vald restrcted game and computng bestresponse sequences, and fnally we (3) provde theoretcal guarantees of the convergence of the algorthm to a Nash equlbrum. We expermentally evaluate our algorthm on two types of games: a search game on a graph and smplfed varants of Poker. The results show sgnfcant runnng-tme mprovements compared to the prevous varant of the double-oracle algorthm, and demonstrate the ablty to fnd an exact soluton of much larger games compared to solvng full lnear program for the complete game. Categores and Subject Descrptors I.2.11 [Dstrbuted Artfcal Intellgence]: [Multagent systems] Keywords game theory, extensve-form games, exact Nash equlbrum, algorthms 1. INTRODUCTION Computatonal game theory s one of the fundamental methods for analyzng mult-agent systems usng a formal mathematcal framework. Improved algorthms for solvng dfferent classes of games have led to an ncreasng number of successful applcatons of game theory n areas rangng from auctons and tradng agents [14], securty [11], to Poker [15], and many others. We focus on makng fundamental algorthmc advances for solvng large nstances of an mportant general class of games: two player, zero-sum extensve-form games (EFGs) wth mperfect nformaton. Ths class of games captures sequental nteractons between Appears n: Proceedngs of the 12th Internatonal Conference on Autonomous Agents and Multagent Systems (AAMAS 2013), Ito, Jonker, Gn, and Shehory (eds.), May, 6 10, 2013, Sant Paul, Mnnesota, USA. Copyrght c 2013, Internatonal Foundaton for Autonomous Agents and Multagent Systems ( All rghts reserved. two adversaral players n stuatons where they have uncertanty about the state of the world or the other player s actons. Many well known games are nstances of ths class, ncludng Poker, Kregspel (blnd chess), and varous securty and pursut evason games. Solvng these games s a very dffcult computatonal task for problems of realstc sze; hence, approxmaton methods are commonly used to solve them n practce, ncludng: regret mnmzaton technques (e.g., CFR) [16], later mproved wth samplng methods [6]; Nesterov s Excessve Gap Technque (EGT) [2]; or varants of Monte-Carlo tree search algorthms appled on mperfectnformaton games (e.g., see [9]). The frst two types of algorthms guarantee convergence to approxmate ɛ-nash equlbrum, whle the thrd method has no theoretcal guarantees for mperfectnformaton games, but t can produce good strateges n practce [9]. We focus on fndng exact solutons, though our method could be used to develop approxmate algorthms as well. The leadng exact algorthm uses the sequence-form representaton [5, 12] and lnear programmng optmzaton technques to fnd a soluton. Solvng the lnear program (LP) requres sgnfcantly more memory and tme than the approxmate methods. However, research on decomposton methods for solvng large-scale optmzaton problems provdes a framework for developng teratve approaches that do not need to enumerate the full problem. These technques are known n the game theory as oracle algorthms [8] and they have been successfully used to solve large normal-form games [3]. In ths paper we use the same teratve prncple to develop a new exact algorthm for two-player, zero-sum EFGs wth mperfect nformaton based on the sequence-form double-oracle algorthm recently ntroduced by Bosansky et al. [1]. The man dea of the algorthm s to create a restrcted game, n whch the players have a lmted number of allowed sequences, and then teratvely expand the restrcted game by addng best-response sequences to the soluton of the current restrcted game. In the worst case, ths approach may need to enumerate all possble sequences, but n typcal cases a soluton can be found by explorng a small fracton of the strategy space. There are two other approaches that use smlar oracle-based decompostons for EFGs, but whch operate on the complete strategy space of the game. The frst was ntroduced by McMahan et al. n [7] for the more general class of convex games; ths method searches a space of lnear combnatons of multple mxed strateges on the full game tree. A smlar approxmatve method usng strateges defned on the complete game tree was used n [15] to create a compettve Poker bot. We buld on deas of the sequenceform double-oracle, snce t explots the tree structure of an EFG by usng the compact sequence form so that the strateges do not need to be represented over the complete game tree. Whle ths s a strong advantage, workng wth the sequence form also ntroduces

2 complextes n mantanng a vald restrcted game and ensurng convergence to a Nash equlbrum. The contrbutons of ths paper nclude several major mprovements of the sequence-form double-oracle method: (1) we extend the double-oracle framework to model more general games of mperfect nformaton wth moves by Nature, (2) we dramatcally mprove the performance of the algorthm by ntroducng a novel technque for mantanng a vald restrcted game wthout addng unnecessary sequences to the restrcted game, (3) we mprove bestresponse calculatons for sequences usng a set of doman-ndependent prunng technques, and (4) we provde formal theoretcal analyss to guarantee that the algorthm converges to a Nash equlbrum. Fnally, we present an expermental evaluaton of our algorthm on two very dfferent classes of games: a search game based on a pursut evason scenaro and a game based on smplfed varants of Poker. For search games, whch have small support n the equlbrum, our algorthm s much faster than both computng the LP for the full game, and the prevous sequence-form double-oracle algorthm. On the poker varants, whch generally have larger support, the algorthm does not show speedups n all cases compared to the full LP, but t stll has much lower memory requrements so t s possble to solve larger games usng our approach. 2. TECHNICAL BACKGROUND Adversaral stuatons wth sequental moves can be modeled as extensve-form games (EFGs) vsually represented as game trees. EFGs are suffcently general to model stochastc changes n the envronment, prvate nformaton of players, and lmted observablty of the actons of the opponent. We study zero-sum two-player games, where a specal Nature player s used to model stochastc events. An EFG s formally defned as follows (the outlne of the man symbols s depcted n Table 1): N s a set of two players N = {1, 2}, we use to refer to one of the two players (ether 1 or 2), and to refer to the opponent of. H denotes a fnte set of the nodes n the game tree. Each node corresponds to a unque hstory of actons taken by all players and Nature from the root of the game; hence, we use the terms hstory and node nterchangeably. We denote Z H as the set of all termnal nodes of the game. For each termnal node z Z we defne a utlty functon for each player (u : Z R). Our games are zero sum, so u (z) = u (z) holds for all z Z. A(h) represents the set of actons avalable n node h H, we denote ha = h H to be a node h reached from node h by performng an acton a A(h), and we say that h s a prefx of h. The functon P : H N {c} assgns each node to a player who takes the acton n the node; c denotes the Nature player that selects n the node an acton based on a known fxed probablty dstrbuton. We represent the uncertanty that each player has usng nformaton sets I whch form a partton over the nodes assgned to player (P(h) = ). That s, every node n the game tree belongs to exactly one nformaton set, and the player does not know whch of the nodes n the nformaton set he s currently n when playng the game. All nodes h n a sngle nformaton set I k I have the same set of possble actons A(h). We assume perfect recall, whch means that all nodes n any nformaton set I k have the same hstory of actons for player (.e., players cannot msremember ther own actons). We use functon C : H [0, 1] to denote the probablty of reachng node h f both players do ther best to reach t, whch s calculated as a product of probabltes of all actons of the Nature player n hstory h. 2.1 Sequence Form LP Method Solvng a game for a Nash equlbrum nvolves computng a profle of strateges (a strategy for each player) n whch each player plays the best response to the strateges of the other players. Formally, let Π be the set of pure strateges for player,.e., a selecton of exactly one acton for each nformaton set. A mxed strategy s a probablty dstrbuton over the set of all pure strateges of a player and we denote by the set of all mxed strateges of player. For any par of strateges δ we denote u (δ) = u (δ, δ ) the expected outcome of the game for player. A best response of player to the opponent s strategy δ s a strategy δ BR such that u (δ BR, δ ) u (δ, δ ) for all strateges δ. A strategy profle δ = (δ 1, δ 2 ) s n a Nash equlbrum f for each player t holds that δ s a best response to δ. Extensve-form games wth perfect recall can be represented usng the compact sequence form [5, 12]. A sequence σ s an ordered lst of actons of player n a hstory h. The set of all possble sequences n a game for player s denoted by Σ and the set of sequences for all players s Σ = Σ 1 Σ 2. A sequence σ Σ can be extended by a sngle acton a taken by a player, denoted as σ a = σ. In games wth perfect recall, an acton a taken to extend sequence σ unquely dentfes an nformaton set I k of player. We use the functon I (σ ) to denote the nformaton set n whch the last acton (.e., a n ths case) was executed. Smlarly, all nodes n an nformaton set I k share the same sequence of actons of player and we use seq (I k ) to denote ths sequence. We overload the notaton and also use seq (h) to denote the sequence of actons of player leadng to node h, and seq (H ) Σ, where seq (H ) = h H seq (h ) for some H H. Fnally, we defne functon g : H R that extends the utlty functon to all nodes by settng g (h) = u (h) C(h) f h Z and g (h) = 0 f h s not a termnal node (h Z). We use functon ω : Σ P(H) to dentfy the set of nodes that can be reached usng a fxed combnaton of sequences of players. More precsely, nodes n ω(σ, σ ) can be reached by sequental executon of actons n the gven order, stoppng when ether the next acton s not vald n the reached node, or a leaf s reached. In non-determnstc games the executon of a sngle combnaton of sequences for two players can reach dfferent nodes dependng on the actons of Nature. Thus, functon ω returns the set of all nodes n H that can be reached by executon of sequences for two players. Usng the sequence form we can fnd a Nash equlbrum of a two-player zero-sum extensve-form game usng a lnear program (LP) (e.g., see [10, p. 135]). Ths s enabled by an equvalent compact representaton of mxed strateges of players n a form of realzaton plans. A realzaton plan of a sequence σ s a probablty that player wll play ths sequence of actons under the assumpton that the opponent wll choose compatble actons that reach the nformaton sets n whch actons specfed n the sequence σ are applcable. We denote realzaton plans of player by functon r : Σ R, and they can be computed usng a LP: v I (σ ) mn v I ( ) v I k I k I :seq (I k)=σ σ Σ h ω(σ,σ ) g (h) r (σ ) σ Σ (1) r ( ) = 1 (2) a A(I k ) r (σ a) = r (σ ) σ = seq (I k ), Ik I (3) r (σ ) 0 σ Σ (4) A separate LP s constructed for computng the strategy of each player. It uses varables v I k that represent the expected utlty of the player n an nformaton set I k, and varables r represent the strategy of the opponent n the form of realzaton plan. The

H Z H σ Σ φ Φ ω : Σ P(H) r : Σ R C : H R g : H R π seq I : Σ I game-tree nodes / hstores leafs / termnal states sequences of player n the restrcted game full-length sequences of player added by the

3 H Z H σ Σ φ Φ ω : Σ P(H) r : Σ R C : H R g : H R π seq I : Σ I game-tree nodes / hstores leafs / termnal states sequences of player n the restrcted game full-length sequences of player added by the best-response sequence algorthm all nodes reached by the maxmal sequental executon of actons n a par of sequences realzaton plan of player for a sequence probablty of reachng a node w.r.t. Nature play extenson of the utlty functon to all nodes; g (h) = u (h) C(h) f h Z, g (h) = 0 otherwse mplct default pure strategy for player sequence(s) of actons of player leadng to a node / a set of nodes / an nformaton set an nformaton set, n whch the last acton of sequence was executed Table 1: Outlne of the man used symbols. frst equaton (1) ensures the maxmzaton of the expected utlty of player for each nformaton set, whle the opponent s tryng to mnmze the utlty by selectng the optmal realzaton plan, whch s constraned by equatons (2 4). 3. DOUBLE-ORACLE METHOD In ths secton we descrbe n detal the sequence-form doubleoracle algorthm for EFGs and our man contrbutons. The basc structure of the algorthm conssts of teratng through three man steps untl convergence: (1) create a restrcted game by lmtng the set of sequences that each player s allowed to play; (2) compute a par of Nash equlbrum strateges n ths restrcted game; (3) for each player, compute a best response strategy aganst the equlbrum strategy of the opponent, whch may be any sequence n the complete game. The best response sequences found n step 3 are added to the restrcted game and allowed n the next teraton. The algorthm termnates f the value of the best-response sequences aganst the equlbrum strateges does not mprove the value of the equlbrum strateges n the restrcted game. 3.1 Defntons We now ntroduce the concept of full-length sequences that lead to a termnal state of the game, and whch we denote Φ Σ. The restrcted game s defned by a lmted set of allowed sequences for each player, whch are used to construct the sequence-form lnear program (LP). We denote the set of restrcted sequences as Σ Σ and we overload the notaton and also use Σ to refer to the restrcted game correspondng to ths set of sequences. The followng secton explans how ths subset of sequences can be constructed based on the lmted subset of full-length sequences Φ Φ usng the best response methods. Our algorthm also needs to use partal strateges from the restrcted game n the context of the complete game. We extend the strateges from the restrcted game to be defned n the complete game usng the concept of a default pure strategy for player, denoted π Π. Ths strategy s an arbtrary, fxed, pure strategy that can be mplctly defned for all nformaton sets of player (e.g., take the frst acton from each A(h)). The default strategy s used only where the strategy from the restrcted game s not defned n the complete game. 3.2 Upper-bound Valdty Algorthm In [1] the authors show that a naïve mplementaton of the sequenceform double-oracle method may result n an ncorrect soluton. If we smply add best-response sequences to the restrcted game, the game can get malformed because of ncompatbltes among the Fgure 1: An extensve-form game between two players, crcle and box. The dashed boxes ndcate the nformaton sets. sequences, resultng n ncorrect solutons. We now descrbe the stuatons that lead to these nconsstences n more detal, and ntroduce a novel soluton to ths problem Inconsstences n the Restrcted Game Consder the game depcted n Fgure 1 and assume the restrcted game s defned by sequences Σ = {, A, AC} for the crcle player and Σ = {, y, yu} for the box player. Now assume, that the crcle player adds sequences B, BE based on ts best response calculaton and we need to construct a new restrcted game from sequences (Σ {B, BE}, Σ ). Executng sequences BE and yu together ends n an nner node h = ByE n whch the box player does not have any acton avalable n the restrcted game. Havng nodes such as h wth no vald contnuaton n the restrcted game causes nconsstences because the LP does not know how to assgn a value to these nteror nodes. In general the problem appears whenever the best-response algorthm adds a sequence σ to a restrcted game that already contans some sequence σ for the opponent, such that the executon of these two sequences ends n an nner node h assgned to the player = P(h), from whch there s no contnuaton n the restrcted game. That s, there s no sequence of player n the restrcted game that reaches h and contans a vald acton to take n h. A contnuaton for h does exst n the complete game, otherwse t would be a termnal node and would have a utlty functon assgned to t. The problem wth havng nteror nodes wth no contnuatons n the restrcted game LP s that the LP assgns a value of 0 to these nodes. Ths can result n the LP fndng an optmal soluton that ends n ths node, and player may not choose to add a new sequence to extend beyond node h f all possble contnuatons have very low values (less than 0), leadng to ncorrect termnaton of the double-oracle algorthm. The method ntroduced n [1] solved ths problem by addng addtonal sequences to the restrcted game that provde vald contnuatons for all nteror nodes. However, ths method has large computatonal neffcences, and was the major bottleneck n the algorthm. We ntroduce a new approach that treats these nteror nodes as temporary leaf nodes n the restrcted game, and assgns them temporary utlty values that wll guarantee convergence of the algorthm to equlbrum. For convergence, these temporary values must be lower bounds on the utlty for the player who s movng n h (and has no vald contnuaton), and therefore an upper bound for player. Ths accomplshes two thngs. Frst, player wll play n the restrcted game so as to end n node h f there s any chance that there s a better outcome based on a contnuaton of ths node. Second, player wll add a best response sequence contnung from ths node f t occurs wth non-zero probablty n the soluton to the restrcted game, and there exsts a better strategy for that can mprove over the pessmstc value assumed for ths node n the current restrcted game soluton.

4 To fnd a temporary value that meets these crtera, we compute the utlty of the best response of player to the default strategy of the opponent. Ths s an upper bound on the value of node h for player because the utlty of the best response to the arbtrary default strategy π wll always be at least as good as f was playng an ntellgent strategy nstead of the default. If we consder agan the example from Fgure 1, the node ByE wll represent a temporary leaf n the restrcted game and the temporary value wll be equal to an upper bound estmaton from the perspectve of crcle player Managng the Set of Temporal Leaves Calculatng the values of the temporary leaf nodes and updatng them as new sequences are added to the restrcted game requres some bookkeepng n the algorthm. Frst, we formalze the process of constructng the sequence-form LP for the restrcted game (CoreLP). The set Φ represents the set of full-length sequences added to the restrcted game by best-response algorthms, and the set Σ represents a set of sequences that are used to create the CoreLP. Intutvely, the sequences n Σ always form a maxmal compatble part of the game gven by a set of full-length sequences Φ, so s maxmal, such that: Σ {getallprefxes(φ) : φ Φ } (5) σ Σ σ Σ s.t. σ seq (ω(σ, σ )). (6) Ths means that for each sequence σ n Σ there exsts a compatble sequence of the opponent σ that allows an executon of the sequence σ n full. Gven the defnton of Σ, the restrcted game can now have temporary leaves nstead of nner nodes n the complete game, and we defne the set L H to represent these temporary leaves. For each temporary leaf h, assgned to player, we defne a temporary utlty equal to the value of the best response of player to the default strategy of the opponent. The sequences of both players leadng to node h are extended to full-length sequences usng ether the default strategy (for player ) or the best-response strategy (for player ). We denote these extended sequences as Φ BR(h) Φ. Now, the temporary utltes are ncorporated nto the CoreLP usng a modfed functon g, defned as follows: g (h) = g(h ) (7) φ BR(h) Φ BR(h) h ω(φ BR(h) ) f h L, and we set g (h) = g (h) otherwse. In the mplementaton, we mantan some addtonal nformaton for each temporary leaf h, ncludng the set of full-length sequences of player that were added by the best-response sequence algorthm, and whch end n h, wth a correspondng sequence from the default strategy of player. In general there may be multple such sequences for each h, whch form the set Φ BR(h). Ths nformaton s useful for updatng the temporary leafs as new sequences are added. If a new sequence σ new has an acton that s applcable n node h, then h wll no longer be a temporary leaf and t wll be removed from the set L. However, new temporary leaves could be created somewhere n the sub-tree rooted n node h. In ths case, we can reuse the sequences from Φ BR(h) to quckly estmate the temporary utltes of these new temporary leaves. 3.3 Best-response Sequence Algorthms The goal of the best-response sequence (BRS ) algorthm s to generate new sequences that wll be added to the restrcted game n the next teraton, or to prove that no more sequences need to be added. Throughout ths secton we use the term searchng player to represent the player for whom the algorthm computes the bestresponse sequence. We denote ths player as. Frst, we descrbe the basc steps of the best-response sequence algorthm and then we focus on doman-ndependent prunng technques Game-tree Search Algorthm BRS algorthm returns a pure strategy that s a best response to the nput strategy of the opponent completed by hs default strategy where t s not defned. The algorthm returns both the set of sequences formng ths strategy, as well as the expected value of ths strategy aganst the completed realzaton plan of the opponent. The algorthm s based on a depth-frst search that traverses through the complete game tree, n whch the behavor of the opponent s fxed ether to the strategy gven by the realzaton plan r from the CoreLP soluton n nformaton sets already ncluded n the restrcted game, or the default strategy π. The behavor of the depth-frst search algorthm n each currently evaluated node h depends prmarly on the player, to whch the node s assgned (searchng player, opponent, or Nature). If node h s assgned to Nature (.e., t s a chance node), the method recursvely evaluates the succeedng nodes, computes the expected value of node h as a sum of the values of the successors weghted by the fxed probablty dstrbuton assocated wth node h, and propagates ths expected value to the predecessor. If node h s assgned to the opponent, the algorthm acts smlar, but the probablty dstrbuton s ether gven by the realzaton plan of the opponent r, or by the default strategy π. Fnally, f the algorthm evaluates node h assocated wth the searchng player, the algorthm selects the value of the best acton played n the nformaton set I k, where ths node h belongs. The probablty of beng n a specfc node h I k n ths nformaton set s gven by the probablty of the Nature play C(h ), and agan ether by the realzaton plan of the opponent r (seq (h )), or by the default strategy π. Therefore, by applyng each acton a A(h) n every node h I k the algorthm fnds the best acton wth the maxmal expected utlty Prunng Technques Snce the best-response sequence (BRS ) algorthm operates on (generally large) game tree of the complete game, ts performance can be substantally mproved by ensurng that rrelevant branches are not evaluated. To do ths, we ntroduce several doman-ndependent prunng technques. Frst, the method of selectng a fxed acton accordng to the default strategy descrbed n the prevous subsecton for nodes that belong to the opponent not ncluded n the restrcted game results n prunng that sgnfcantly reduces the searched space. Second, we can estmate the value for the searchng player n nodes assgned to the opponent that are ncluded n the set of temporary leaves of the restrcted game. The algorthm has already calculated and upper bound value for these temporary leaves accordng to the sequences of the searchng player and the default strategy of the opponent. Therefore, the searchng player can re-use ths nformaton drectly wthout a need to search the state space beyond ths node. Fnally, we ntroduce prunng for the nodes that are assgned to the searchng player (.e., P(h) =, h I k ) and that are a part of the restrcted game. The prunng reles on fact that (1) t evaluates the nodes from the nformaton set I k n an ordered fashon that s gven by the probablty of these nodes, and (2) we can prune the evaluaton of some branches f the selecton of the best acton wth maxmal expected utlty s certan. The pseudocode of ths part of

5 Requre: - searchng player, h - current node, I k - current nformaton set, r - opponent s strategy, Mn/MaxUtlty - bounds on utlty values 1: H {h ; h I k 2: } sort H descendng accordng to value r (seq (h)) C(h) 3: w h H r (seq (h )) C(h ) 4: maxval 5: for h H do 6: f (maxval - secmaxval) > w (MaxUtlty MnUtlty) then 7: v h maxacton vh maxacton +BRS (h maxacton) 8: else 9: for a A(h ) do 10: v h a v h a +BRS (h a) 11: end for 12: maxacton arg max a A(h ) v a 13: maxval v h maxacton 14: secmaxval max a {A(h ) maxacton} v a 15: end f 16: w w r (seq (h )) C(h ) 17: end for 18: return v h maxacton Fgure 2: Prunng n the nodes of the searchng player. prunng s depcted n Fgure 2. It uses BRS (h) to denote the recursve call of the best-response sequence algorthm n node h. Frst, we sort all the nodes n ths nformaton set based on the probablty of occurrence (set H ), whch s determned by the realzaton plan of the opponent as well as moves by Nature (lnes 1-2). Then, the algorthm evaluates recursvely the nodes n the set H (lne 10) and for all actons t calculates the expected utlty value (the recursve call returns the utlty value already weghted by the probabltes). The prunng occurs f the choce for the best acton cannot be changed more formally, f the maxmal expected value cannot be overcome n the remanng nodes gven the remanng probabltes for these nodes (denoted as w) and bounds for mnmal and maxmal utlty values n the game (lne 6). If prunng s enabled, for the remanng nodes n H the algorthm evaluates only the values by executng acton maxacton (lne 7), and these values are returned when some of these nodes s reached by the BRS algorthm. 4. THEORETICAL ANALYSIS In ths secton we prove that our sequence-form double-oracle algorthms wll always converge to a Nash equlbrum of the complete game. Frst, we defne formally the strategy computed by the best-response sequence (BRS ) algorthm, then we prove lemmas about the basc characterstcs of the BRS strateges, and fnally prove the man convergence result. In the proof we use the concept of an extenson of a realzaton plan from a restrcted game to the full game usng a default strategy outsde of the restrcted game. Formally, f r s a mxed strategy of player n the form of realzaton plan n a restrcted game Σ, then we defne r to be a strategy dentcal to r on sequences n Σ and dentcal to default strategy π on the remanng sequences from Σ that are not n Σ. Lemma 4.1. Let r be a realzaton plan of player on some restrcted game Σ. BRS (r ) then returns a strategy q n the complete game, such that q s a pure best response strategy to r. The value returned by the algorthm s the value of executng the par of strateges u (q, r ). Proof. BRS (r ) searches the game tree and takes the acton that maxmzes the value of the game for player n all nformaton sets I assgned to player. In the opponent s nodes, t takes the expected value of playng r, where t s defned and the value of playng the pure acton of the default strategy π where r s not defned. In chance nodes, t returns the expected value of the node as the sum of successors weghted by ther probabltes. In each node h, f the successors have the maxmal possble value for then also node h has the maxmal possble value for (when playng aganst r ). The selectons n the nodes that belong to acheves ths maxmal value; hence, they form the best responce to strategy r. For brevty we further use v(brs (r )) to denote the value returned by the BRS algorthm that s equal to u (q, r ). Lemma 4.2. Let r be a realzaton plan of player on some restrcted game Σ and let v be the value of the complete game Σ for player, then v(brs (r )) v. Proof. The value of the game (v ) s the value of the best response for player n the restrcted game aganst r. From Lemma 4.1, we know that v(brs (r )) s the value of the best response to strategy r for player. Snce the best response n restrcted game Σ to r s also a possble response to the extended strategy r, the value of the best response to r must be at least v, and could be greater snce the respondng player has strctly more strateges to choose from n the unrestrcted game. Lemma 4.3. Let r be a realzaton plan of player on returned by the CoreLP for some restrcted game Σ and let v LP be the value of the restrcted game returned by the CoreLP, then v(brs (r )) v LP. Proof. r s a part of the Nash equlbrum strategy n a zerosum game that guarantees value v LP n Σ. If the best response computaton n the complete game selects only the actons from Σ, t creates the best response n game Σ as well obtanng value v LP. If the best response selects an acton that s not ncluded n Σ, there are two cases. Case 1: The best response strategy uses an acton n a temporary leaf of Σ. Player makes the decson n the leaf, because otherwse the value of the temporary leaf would be drectly returned by BRS. The value of the temporary leaf has been under-estmated for n the LP computaton and t s over-estmated n the BRS computaton as the best response to the default strategy π. The value of the best response can only ncrease by ncludng ths acton. Case 2: The best response strategy uses an acton not ncluded n Σ n an nternal node of the game. Ths can occur n nodes assgned to player, because the actons of gong out of Σ have probablty zero n r. BRS takes maxmum n the nodes assgned to player, so the reason for selectng an acton leadng outsde Σ s that t has greater or equal value to the best acton n Σ. Lemma 4.4. Under the assumptons of the prevous lemma, f v(brs (r )) > vlp then t returns sequences that extend game Σ n the next teraton. Proof. Based on the proof of the prevous Lemma, BRS for player can mprove over the value of the LP (v LP ) only by an acton a not present n Σ performed n a node h ncluded n Σ, n whch makes decson. Let (σ, σ ) be the par of sequences leadng to h. Then n the constructon of the restrcted game n the next teraton, sequence σ s the sequence that ensures that σ a can be executed n full and wll be part of the restrcted game. Theorem 4.5. The double-oracle algorthm for extensve form games descrbed n the prevous secton stops f and only f v(brs (r )) = v(brs (r )) = v LP = v, (8) whch always happens after a fnte number of teratons, and strateges ( r, r ) are a Nash equlbrum of the complete game.

6 Proof. Frst we show that the algorthm contnues untl all equaltes (8) hold. If v(brs (r )) v(brs (r )) then from Lemma 4.2 and Lemma 4.4 we know that for some player t holds BRS (r ) > v LP, thus the restrcted game n the followng teraton s larger by at least one acton and the algorthm contnues. In the worst case, the restrcted game equals the complete game Σ = Σ, and t cannot be extended any further. In ths case the BRS cannot fnd a better response then v and the algorthm stops due to Lemma 4.4. If the condton n the theorem holds, t s clear we have found a NE n the complete game, because from Lemma 4.1 we know that q = BRS (r ) s the best response to r n the complete game. However, f the value of the best response to a strategy n a zerosum game s the value of the game, then the strategy s optmal and t s a part of a Nash equlbrum of the game. 5. EXPERIMENTS We expermentally evaluate the performance of the descrbed sequence-form double-oracle algorthm wth novel methods for mantanng a vald restrcted game and computng best-response sequences. Snce both novel methods are complementary, we expermentally analyze the mpact of each of them separately. We run the frst set of experments on the search games used for expermental evaluaton n [1], the second set of experments s on varants of a smplfed Poker games nspred by Leduc Hold em [13]. The test games have dfferng characterstcs n terms of stochastcty (the frst one s determnstc, whle the second ncludes chance nodes) and prvate nformaton of the players (the players cannot observe the actons of the opponent n the frst game, however, the prvate nformaton s completely determned by the moves of Nature n the second game). Also, these games dffer greatly n the sze of the support of Nash equlbra (.e., the number of sequences actually used wth non-zero probablty n a Nash equlbrum). Whle there are only few sequences n the support of the equlbrum n the frst game, the sze of the support s consderably hgher n poker games. Such dfferng characterstcs gve us an opportunty to present a broader evaluaton of the performance of our algorthm and determne both strengths and current lmtatons. Note that all algorthms were mplemented usng a generc framework for modelng arbtrary extensve-form games, and nether algorthm s usng any doman-specfc knowledge. Both the valdty and best-response algorthms we present could be further enhanced wth doman-dependent knowledge, whch could sgnfcantly mprove the runtme. However, our focus here s on generc methods. In all of the experments we used a sngle thread on an Intel 7 CPU runnng at 2.8 GHz. Each of the algorthms was gven at maxmum of 10 GB of memory for Java heap space, and we used IBM CPLEX 12.2 for solvng the lnear programs. 5.1 Experment Settngs Search Games. The search game contans two players: the patroller (or the defender) and the evader (or the attacker). The game s played on a drected graph (see Fgure 3), where the evader ams to cross safely from a startng node (E) to a destnaton node (D), and the defender moves n the ntermedate nodes (the shaded areas) tryng to capture the evader (.e., to be at the same node as the evader). The defender controls two unts and durng each turn both players move ther unts smultaneously from the current node to an adjacent node, or stay n the same locaton. The only excepton s the evader, who cannot stay n the two leftmost nodes. If a pre-determned number of turns s made wthout ether player wnnng, the game s a draw. Fgure 3: Two varants of the graph used n the experments; we refer to them as G1 (left) and G2 (rght). Players are unaware of the locaton and the actons of the other player wth one excepton the evader leaves tracks n the vsted nodes that can be dscovered f the defender vsts the nodes later. The game also ncludes an opton for the attacker to avod leavng tracks usng a specal move (termed slow move) that requres two turns (the evader removes the tracks n a node n one turn). Fgure 3 shows two examples of the graphs used n the experments. The patrollng unts can move only n the shaded areas (P1,P2). Even though the graph s small, the concurrent movement of all unts mples a large branchng factor (up to 50 for one turn) and thus large game trees (up to nodes). In the experments we altered three dfferent graphs (G1 and G2 from Fgure 3 and G3 that has no edges n the mddle column), maxmal number of turns of the game (from 3 to 7), and the opton for the attacker to remove the tracks (SA f the slow moves were allowed, SD otherwse). Poker Games. Secondly, we use varants of a smplfed two-player poker game nspred by Leduc Hold em. Each player starts wth the same amount of chps, and both players are requred to put some number of chps n the pot (called ante). In the next step, the Nature player deals a sngle card to each player (the opponent s unaware of the card) and the bettng round begns. A player can ether fold (the opponent wns the pot), check (let the opponent start the round), bet (add some amount of chps as frst n the round), call (add the amount of chps equal to the last bet of the opponent nto the pot), or rase (match and ncrease the bet of the opponent). If no further rase s made by any of the players, the bettng round ends, the Nature player deals one card on the table, and the second bettng round wth the same rules begns. After the second bettng round ends, the outcome of the game s determned a player wns f: (1) her prvate card matches the table card and the opponent s card does not match, or (2) none of the players cards matches the table card and her prvate card s hgher than the prvate card of the opponent. If no player wns, the game s a draw and the pot s splt. In the experments we altered the number of types of the cards (rangng from 3 to 5; there are 3 types of cards n Leduc), number of cards of each type (rangng from 2 to 3; set to 2 n Leduc), maxmum length of sequence of rases n a bettng round (rangng from 1 to 4; set to 1 n Leduc), and the number of dfferent szes of bets (.e., amount of chps added to the pot) for bet/rase actons (rangng from 1 to 4; set to 1 n Leduc). 5.2 Results Search Games. In the search game we measured the mpact of each of the methods proposed n ths paper. Therefore we compared 4 dfferent nstances of the double-oracle algorthm, where we used dfferent verson of the valdty algorthms and the best-response sequence algorthms. The names of the algorthms are composed as follows: DO prefx denotes a double oracle algorthm; UBva denotes the upper-bound valdty algorthm presented n ths paper; SAva denotes the sequence-addng valdty algorthm presented n [1];

7 Algorthm Iteratons Σ VA [s] BR [s] LP [s] FullLP DO-SAva-ORGbrs DO-SAva-NEWbrs DO-UBva-ORGbrs DO-UBva-NEWbrs Table 2: Tme breakdown for scenaro G1-SD of the search game; columns VA, BR, LP refer to the cumulatve tme spent n the valdty algorthm, the best-response sequence algorthm, and the method for generatng and solvng the CoreLP. Tme [s] (log scale) FullLP DO-SAva-ORGbrs DO-SAva-NEWbrs DO-UBva-ORGbrs DO-UBva-NEWbrs NEWbrs denotes the best-response sequence algorthm presented n ths paper; and ORGbrs denotes orgnal best-response sequence algorthm presented n [1]. FullLP denotes constructng and solvng the full sequence-form lnear program for the complete game. In all scenaros all varants of double-oracle algorthm were able to fnd the exact soluton by addng only a small fracton of the sequences to the restrcted game. For the largest scenaros the sze of the restrcted game was 0.1% 0.4% of the complete game. These results show that the double-oracle algorthms are partcularly useful for games wth small support of Nash equlbra, where they are able to quckly fnd the sequences actually needed for the soluton wthout any doman-dependent knowledge. The results for the selected scenaros of the search game wth number of turns fxed to 7 are depcted n Fgure 4 (mssng values for FullLP ndcate that the algorthm run out of memory). The overvew of the complete runnng tme of the algorthms shows that the novel UBva algorthm dramatcally speeds-up the overall runnng tme (note the logarthmc scale), and the double-oracle algorthms wth UBva sgnfcantly outperform other algorthms even on smaller nstances of the search game. The mpact of the new valdty algorthm s demonstrated on the tme breakdown for selected scenaro G1-SD (see Table 2). The new valdty algorthm ncreases the number of teratons somewhat, however, tme needed for keepng the restrcted game vald s dramatcally lower. In addton, snce we do not add addtonal sequences beyond the best responses, the restrcted game s smaller so generatng and solvng the lnear program s faster as well. The novel NEWbrs algorthm also typcally mproves the runtme (t sgnfcantly speeds up calculatng the best-response). However, t slows down the overall performance of the algorthm for some cases. The reason les n the ncreased number of teratons n comparson to the orgnal ORGbrs for these nstances of graphs the orgnal optmstc BRS algorthm quckly found the correct sequences that form support of the Nash equlbrum. Poker Games. For varants of smplfed Poker games we compared only the best algorthm from the prevous set of experments, the doubleoracle method wth the upper-bound valdty algorthm and the new best-response algorthms that utlze default strateges (DO-UBva- NEWbrs; from now on we refer to ths algorthm only as DO), along wth the FullLP algorthm. Contrary to the prevous game, the varants of poker have relatvely large support, so the runnng-tme performance of the DO algorthm was typcally somewhat worse. For the nstances of expermental settngs where we prmarly ncreased the number of cards and card types the sze of the fnal restrcted game was 44% 65% of the sze of the complete game (up to 4481 sequences for each player n ths case), and t took the double-oracle algorthm roughly 2 10 tmes the amount of tme needed by the FullLP to fnd the exact soluton (for example, 6 seconds for FullLP compared to 10 1 G1-SD G2-SD G3-SD G1-SA G2-SA G3-SA Search Game Scenaros Fgure 4: Runnng tmes for FullLP and varants of DO algorthm on dfferent scenaros of the search game (G1-G3 denotes graph;sa(sd) that slow moves were (ds)allowed). Mssng values for FullLP ndcate that the algorthm ran out of memory. Memory Consumed [GB] FullLP-R1 DO-R1 FullLP-R2 DO-R2 FullLP-R3 DO-R3 FullLP-R4 DO-R B1 B2 B3 B4 Number of dfferent values for bet/rase actons Fgure 5: Memory consumpton for nstances of Poker games (R denotes number of re-rase actons, B number of possble bettng amounts): sold lnes represent FullLP, dashed lnes DO algorthm. Mssng values for FullLP ndcate that the algorthm ran out of memory. the 60 for DO on the nstance wth 5 types of cards, 3 cards of each type, and 4 dfferent bettng values n the second round). The stuaton changes when we ncrease the depth of the game tree by allowng players to re-rase the opponent s bet, and ncrease the branchng factor for player s actons by ncreasng number of dfferent bettng values for the actons bet and rase n both rounds. The sze of the restrcted game dropped to 2% 5% for the largest nstances solvable by both algorthms and FullLP was quckly unable to construct and solve the lnear program due to the memory restrctons. Fgure 5 shows the scalng of the memory consumpton for FullLP and DO algorthm we fxed number of types of cards to 3, number cards of each type to 2. We can see that wth ncreasng depth of the game tree (ncreasng number of re-rase actons, denoted as R1-R4) and ncreasng branchng factor at nodes assgned to players (dfferng number for bettng amounts, denoted as B1-B4) the memory consumed by both algorthms grows exponentally. However, DO requres orders of magntude less memory and we were able to fnd exact soluton to much larger nstances of Poker games compared to the FullLP.

8 Case B2 Σ 1 Σ 2 Support P1 Support P2 DO-R (67%) 1576 (69%) DO-R (34%) 4448 (39%) DO-R (13%) 7118 (14%) DO-R (3%) 8579 (4%) Table 3: Szes of the restrcted sets of sequences (% of the complete game) and the number of support sequences wth ncreasng depth. Table 3 shows the exact szes of the sets of sequences n the restrcted game Σ for a fxed number of bettng amounts B2 and ncreasng depth R1-R4. We can see that although the sze of the restrcted game ncreases, t corresponds to smaller fractons of the complete game. Smlarly, the sze of the support n the restrcted game ncreases, but we can see that there s an ncreasng number of sequences n Σ that are not actually used n the support. By reducng the number of sequences added by best-response algorthms, we can reduce the overall number of teratons of DO algorthm, and thus further mprove the performance. 6. DISCUSSION AND CONCLUSION In ths paper we present a sequence-form double-oracle algorthm for computng exact solutons for two-player zero-sum extensveform games wth mperfect nformaton. We extend the doubleoracle framework to non-determnstc games, provde new algorthms for key parts of the double-oracle framework, and present theoretcal proof that our double-oracle algorthm converges to a Nash equlbrum of the game. We expermentally evaluated our algorthm on two classes of games and show sgnfcant speed-up n runnng tme for games wth small support Nash equlbra, and sgnfcant reducton n memory consumpton even for games wth large support n comparson to state-of-the-art exact algorthms. The benefts of the approach presented n ths paper are twofold. Frstly, t computes exact Nash equlbra for extensve-form games pror to constructng the complete game by dentfyng the sequences of promsng actons that the players should play wthout any doman knowledge. Secondly, our approach decomposes the problem of computng a Nash equlbrum nto separate sub-problems. Therefore, t can be seen as a framework, n whch the domanndependent valdty and/or best-response sequence algorthms can be replaced wth doman-specfc and thus much faster mplementatons (e.g., explotng new effcent best-response algorthms for poker games [4] would brng sgnfcant speed-up n these games). The presented formal analyss dentfes the key propertes that these doman-specfc mplementatons need to satsfy to guarantee the convergence to the correct soluton of the game. The algorthm presented n ths paper stmulates a large volume of possble future work. Besdes applcatons to specfc domans, several theoretcal questons need to be nvestgated. Frst of all, the performance of the double-oracle algorthm currently depends on the number of teratons. We ntend to explot theoretc characterstcs of the structure of support of Nash equlbra n extensve-form games and mprove the methodology of addng sequences n order to lower the number of teratons. Secondly, we plan to nvestgate possblty of adaptng ths framework for general-sum games, n whch prunng technques and upper-bound calculatons are much more complcated. Fnally, a modfcaton of ths algorthm for dfferent soluton concepts n extensve-form games (such as refnements of Nash equlbra, or Stackelberg equlbra) s a natural contnuaton of ths work. Acknowledgements Ths research was supported by the Czech Scence Foundaton (grant no. P202/12/2054) and by the Czech Mnstry of Educaton, Youth and Sports (grant no. LH11051). 7. REFERENCES [1] B. Bosansky, C. Kekntveld, V. Lsy, and M. Pechoucek. Iteratve Algorthm for Solvng Two-player Zero-sum Extensve-form Games wth Imperfect Informaton. In Proc. of European Conf. on Artfcal Intellgence (ECAI), [2] S. Hoda, A. Glpn, J. Peña, and T. Sandholm. Smoothng technques for computng nash equlbra of sequental games. Math. Oper. Res., 35(2): , May [3] M. Jan, D. Korzhyk, O. Vanek, V. Contzer, M. Tambe, and M. Pechoucek. Double Oracle Algorthm for Zero-Sum Securty Games on Graph. In Proc. of the Int. Conf. on Autonomous Agents and Multagent Systems (AAMAS), [4] M. Johanson, M. Bowlng, K. Waugh, and M. Znkevch. Acceleratng best response calculaton n large extensve games. In Proc. of Int. Jont Conf. on Artfcal Intellgence (IJCAI), [5] D. Koller, N. Megddo, and B. von Stengel. Effcent computaton of equlbra for extensve two-person games. Games and Economc Behavor, 14(2), [6] M. Lanctot, K. Waugh, M. Znkevch, and M. Bowlng. Monte carlo samplng for regret mnmzaton n extensve games. In Advances n Neural Informaton Processng Syst. (NIPS), pages , [7] H. B. McMahan and G. J. Gordon. A fast bundle-based anytme algorthm for poker and other convex games. Journal of Machne Learnng Research - Proceedngs Track, 2: , [8] H. B. McMahan, G. J. Gordon, and A. Blum. Plannng n the presence of cost functons controlled by an adversary. In Int. Conf. on Machne Learnng, pages , [9] M. J. V. Ponsen, S. de Jong, and M. Lanctot. Computng approxmate nash equlbra and robust best-responses usng samplng. J. Artf. Intell. Res. (JAIR), 42: , [10] Y. Shoham and K. Leyton-Brown. Multagent Systems: Algorthmc, Game-Theoretc, and Logcal Foundatons [11] M. Tambe. Securty and Game Theory: Algorthms, Deployed Systems, Lessons Learned. Cambrdge Unversty Press, [12] B. von Stengel. Effcent computaton of behavor strateges. Games and Economc Behavor, 14: , [13] K. Waugh, N. Bard, and M. Bowlng. Strategy graftng n extensve games. In Advances n Neural Informaton Processng Systems 22 (NIPS), pages , [14] M. P. Wellman. Tradng Agents. Synthess Lectures on Artfcal Intellgence and Machne Learnng. Morgan & Claypool Pub., [15] M. Znkevch, M. Bowlng, and N. Burch. A new algorthm for generatng equlbra n massve zero-sum games. In Proc. of Natonal Conference on Artfcal Intellgence (AAAI), pages , [16] M. Znkevch, M. Johanson, M. Bowlng, and C. Pccone. Regret mnmzaton n games wth ncomplete nformaton. Advances n Neural Informaton Processng Syst. (NIPS), 20: , 2008.

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University Dynamc Optmzaton Assgnment 1 Sasanka Nagavall snagaval@andrew.cmu.edu 16-745 January 29, 213 Robotcs Insttute Carnege Mellon Unversty Table of Contents 1. Problem and Approach... 1 2. Optmzaton wthout