An Emergence of Game Strategy in Multiagent Systems

An Emergence of Game Sraegy in Muliagen Sysems Peer LACKO Slovak Universiy of Technology Faculy of Informaics and Informaion Technologies Ilkovičova 3, 842 16 Braislava, Slovakia lacko@fii.suba.sk Absrac. In his paper, we sudy an emergence of game sraegy in muliagen sysems. Symbolic and subsymbolic approaches are compared. Symbolic approach is represened by a backrack algorihm wih specified search deph, whereas he subsymbolic approach is represened by feedforward neural neworks ha are adaped by reinforcemen emporal difference TD(lambda) echnique. As a es game, we used simplified checkers. Three differen sraegies were used. The firs sraegy corresponds o a single agen ha repeaedly plays games agains MinMax version of a backrack search mehod. The second sraegy corresponds o single agens ha are repeaedly playing a megaournamen. The hird sraegy is an evoluionary modificaion of he second one. I is demonsraed ha all hese approaches led o a populaion of agens very successfully playing checkers agains a backrack algorihm wih he search deph 3. 1 Inroducion Applicaions of TD(λ) reinforcemen learning [2, 3] o compuaional sudies of emergence of game sraegies were iniiaed by Gerald Tesauro [4, 5] in 1992. He le a machine-learning program endowed wih feed-forward neural nework o play agains iself a backgammon game. Tesauro has observed ha a neural nework emerged, which is able o play backgammon on a supreme champion level. The purpose of his paper is o use TD(λ) reinforcemen learning mehod and evoluionary opimizaion for adapaion of feed-forward neural neworks ha are used as evaluaors of nex possible posiions creaed from a given posiion by permied moves. Neural nework evaluaes each posiion by a score number from an open Supervisor: prof. Ing. Vladimír Kvasnička, DrSc., Insiue of Applied Informaics, Faculy of Informaics and Informaion Technologies STU in Braislava M. Bieliková (Ed.), IIT.SRC 2005, April 27, 2005, pp. 41-48.

42 Peer Lacko inerval (0,1). A posiion wih he larges score is seleced as he forhcoming posiion, while oher moves are ignored. The mehod is esed on a simplified game of checkers, where ha player wins, whose piece firs reaches any of he squares on he opposie end of he game board. Three differen experimens are done. The firs experimen uses only one neural nework playing a game agains a player simulaed by a MinMax backrack algorihm wih he search deph 3. The second experimen corresponds o a populaion of neural neworks, which are repeaedly playing a megaournamen (each nework agains all ohers). Afer each game boh neural neworks are adaped by TD(λ) according o he resul of he game. Finally, he hird experimen uses evoluionary adapaion of neural neworks, i.e. reinforcemen learning was subsiued by random muaions and naural selecion. In all hree experimens, he emerged neural neworks won abou 60% of games of simplified checkers agains MinMax algorihm wih he search deph 3. 2 Simplified game of checkers The game of checkers is played on a square board wih sixy-four smaller squares arranged in an 8 8 grid of alernaing colors (like chess board). The saring posiion is wih each player having 8 pieces (black, respecively whie), on he 8 squares of he same color closes o his edge of he board. Each mus make one move per urn. The pieces move one square, diagonally, forward. A piece can only move o a vacan square. One capures an opponen's piece by jumping over i, diagonally, o he adjacen vacan square. If a player can jump, he mus. A player wins, if one of his/her pieces reaches a square on he opponen's edge of he board, or capures he las opponen's piece, or blocks all opponen's moves. 2.1. Formalizaion of game In his secion, we shall deal wih a formalizaion of he game of checkers, which can be applied o all symmeric wo player games (chess, go, backgammon, ec.). Le he curren posiion of he game be described by a variable P, his posiion can be changed by permied acions moves consiuing a se A(P). Using a move a A(P), he a posiion P shall be ransformed ino a new posiion P, P P. An inverse posiion P is obainable from a posiion P by swiching he color of all black pieces o whie and of all whie pieces o black. We shall use a muliagen approach, and we shall presume, ha he game is played by wo agens G 1 and G 2, which are endowed wih cogniive devices, by which hey are able o evaluae nex posiions. Algorihm 1. sep. The game is sared by he firs player, G G 1, from a saring posiion, P P ini.

An Emergence of Game Sraegy in Muliagen Sysems 43 2. sep. The player G generaes from he posiion P a se of he nex permied posiions A ( P) = { P1, P2,..., Pn }. Each such posiion P i from he se of hese posiions is evaluaed by a coefficien 0<z i <1. The player selecs as his nex posiion such P A(P), which is evaluaed by he maximum coefficien, P P. If he posiion P saisfies condiion for vicory, hen he player G wins and he game coninues wih he sep 4. 3. sep. The oher player akes urn in he game, G G 2, he posiion P is generaed from he inverse of he curren posiion, P P, he game coninues wih he sep 2. 4. sep. End of he game. The key role in he algorihm plays he calculaion of coefficiens z=z(p ) for posiions P A(P). These calculaions can be done eiher by mehods of classical arificial inelligence, based on a combinaion of deph firs search and various heurisics, or by sof compuing mehods. We shall use he second approach, where we shall urn our aenion o modern approach of muliagen sysems. I is based on a presumpion, ha he behavior of an agen in his/her environmen and/or cerain acions he/she performs are fully deermined by his/her cogniive device, which demonsraes a cerain plasiciy (i.e. i is capable of learning). informaion flow y zp ( )...... xp ( ) oupu neuron ( 1) hidden neurons ( p) inpu neurons ( 32) Fig. 1. Feed forward neural nework wih one layer of hidden neurons. The inpu aciviies are equivalen o 32-dimensional vecor x(p), which codes a posiion of he game. The oupu aciviy equals o he real number z(p) from he open inerval (0,1), his number is an evaluaion of he inpu posiion. 3 The srucure of he cogniive device neural nework Firsly, before we shall specify he cogniive device of agens, we have o inroduce he so-called numerical represenaion of posiions. Posiion is represened by a 32- dimensional vecor x ( P) = ( x ) { } 32 1,x 2,...,x 32 01,, 1 (1a) where single enries specify single squares a he posiion P of he game

44 Peer Lacko i { 0( h square is free) 1( on h square is our/opponen's piece) } x = i, ± i (1b) The used neural nework has he archiecure of a feed-forward nework wih one layer of hidden neurons. The aciviies of inpu neurons are deermined by a numerical represenaion x(p) of he given posiion P, he oupu aciviy evaluaes he posiion x(p) (see fig. 1). The number of parameers of he neural nework is 34p+1, where p is he number of hidden neurons 4 Adapaion of cogniive device of agen wih emporal difference TD(λ)-mehod wih reward and punishmen [2] In his secion we shall give he basic principles of he used reinforcemen learning mehod, which currenly in is emporal difference TD(λ) version belongs o effecive algorihmic ools for adapaion of cogniive devices of muliagen sysems. The basic principles of reinforcemen learning are following: Agen observes he mapping of his/her inpu paern o his/her oupu signal of his/her cogniive device (he oupu signal is ofen called an acion or conrol signal ). Agen evaluaes he qualiy of he oupu signal on he basis of he exernal scalar signal reward. The aim of he learning is such an adapaion of he cogniive organ of he agen, which will modify he oupu signals oward maximizaion of exernal reward signals. In many cases he reward signal is delayed, i arrives only a he end of a long sequence of acions and i can be undersood as he evaluaion of he whole sequence of acions, wheher he sequence achieved he desired goal or no. In his secion we shall ouline he consrucion of TD(λ) mehod as a cerain generalizaion of a sandard mehod of gradien descen learning of neural neworks. Le us presume, ha we know he sequence of posiions of he given agen - player and heir evaluaion by a real number z P1, P2,..., Pm, z (2) reward where z reward is an exernal evaluaion of he sequence and corresponds o he fac, ha he las posiion P m means ha he given agen won, or los zreward = { 1( sequence of posiions won ), 1( sequence of posiions los) } (3) From he sequence (2) we shall creae m couples of posiions and heir evaluaions by z reward, which shall be used as raining se for he following objecive funcion m 1 ( ) ( ( )) 2 E w = zreward G x ;w (4) 2 = 1 We shall look for such weigh coefficiens of he neural nework cogniive device, which will minimize he objecive funcion. When we would find ou such weigh coefficiens of he nework ha he funcion would be zero, hen each posiion from he sequence (2) is evaluaed by a number z reward. The recurren formula for adapaion of he weigh coefficiens is as follows w := E w α w = w+ w (5)

An Emergence of Game Sraegy in Muliagen Sysems 45 z w = α (6) m = 1 w ( z reward z ) where z=g(pi,w) is an evaluaion of h posiion P by neural nework working as a cogniive device. Our goal will be, ha all he posiions from he sequence (2) would be evaluaed by he same number z reward, which specifies, if oucome of he game consising from he sequence (2) was a win, draw, or loss for he given player. This approach can be generalized o a formula, which creaes he basis of he TD(λ) class of learning mehods [2] w = m w = w (7) = 1 k z k α λ (8) ( z + 1 z ) k = 1 w where he parameer 0 λ 1. Formulas (7) and (8) enable a recurren calculaion of he incremen w. We shall inroduce a new symbol e (λ), which can be easily calculaed recurrenly as follows k z k z + 1 e ( λ ) = λ e + 1( λ ) = λe ( λ ) + (9) k = 1 w w e1 λ = z1. Then he single parial incremens w are deermined where ( ) w 5 Resuls ( z z ) e ( λ ) w = α (10) +1 For measuremen of game sraegy emergency success we used MinMax algorih [1].In our implemenaion, we used he following heurisic funcion: l m n[] ( 8 s[] ) (11) evaluaion = y i y i i= 1 i= 1 If we denoe he number of our pieces on a game board l and he number of opponen s pieces m, hen we can denoe by y n [i] he posiion of our ih piece along he y axis (he axis owards he opponen s saring par of game board). By y s [i] we can denoe equivalen value of he opponen s pieces. This evaluaion ensures ha he MinMax will ry o ge is pieces oward he opponen s par of he game board and i will ry o preven he opponen s progress. The play of our player is quie offensive. 5.1 The resul of a supervised learning of neural nework In our firs, simples approach, we considered only a 1-agen sysem. Our agen learns by playing agains anoher agen, whose decisions are based on backrack search o a maximum deph 3. The game is repeaed wih players alernaing o go firs. Afer each end of he game, he agen wih a cogniive device represened by a neural nework adaps his/her neural nework by a reinforcemern learning mehod using reward/ punishmen signal.

46 Peer Lacko For raining and esing we used a wo-layered feed-forward nework wih 64 hidden neurons, he learning rae 0.01 and he coefficien λ=0.9. The nework learned afer each game, when i was evaluaed by 1 if i won, and 0 evaluaed i if i los. Afer each 100 games, he raio of won/los games was marked on he graph. In he rial, he nework played agains he MinMax algorihm searching o he level 3. The graph of he learning progress is shown on he fig. 2. I is eviden, ha he nework learned slowly and even afer 450000 maches achieved vicory only in 65% of maches. Neverheless, i is sill an excellen resul, since if he nework would play as well as our MinMax algorihm searching o he level 3, i would win only 50% of maches. Fig. 2. The progress of learning of neural nework playing agains he MinMax algorihm wih he search deph 3. 5.2 The resul of adapaion of a populaion of neural neworks In he second, more complicaed case, we consider a muliagen sysem. Is agens repeaedly play agains each oher a megaournamen, while afer each game he neural neworks of boh agens are adaped by a reinforcemen learning mehod. For his rial, we used 20 neural neworks. These neworks played a megaournamen agains each oher, and heir level of developmen was measured by a ournamen agains MinMax wih search deph 3. The learning curve is shown on he fig. 3. The figure shows, ha even hough he neural neworks were no augh by TD(λ) learning agains he MinMax algorihm, hey did learn o play checkers. I means ha we did no need an exper wih a good knowledge o each he neworks how o play. 5.3 The resul of evoluionary adapaion of a populaion of neural neworks In he hird, mos complex approach, we used in a muliagen sysem also a Darwinian evoluion; afer he end of each megaournamen agens are evaluaed by a finess according o heir success rae in he game and hey furher quasirandomly reproduce wih a probabiliy proporional o heir finess. In his case, we use asexual reproducion, where we creae a copy of he parenal agen. This copy is wih a cerain probabiliy muaed and replaces some weak agen in he populaion. To assess finess

An Emergence of Game Sraegy in Muliagen Sysems 47 we used he MinMax wih search deph 3. The figure 5 shows a learning curve as an average resul of he agens from he populaion agains MinMax wih search deph 3. The populaion consised of 55 agens, from which in each epoch a subpopulaion of 10 individuals was creaed. The subpopulaion was generaed by a quasirandom selecion of agens from he populaion, which were muaed wih a probabiliy 0.5. The muaion added o each weigh of neural nework wih a 0.001 probabiliy a random number from a logisic disribuion wih a parameer 0.1. This subpopulaion hen replaced he weak individuals in he populaion. The figure 4 shows, ha even in his approach a sraegy of he game emerged, and resuling neural neworks were able o play a he same level as he MinMax algorihm wih a deph-search 3. Fig. 3. The progress of learning in a populaion of 20 neural neworks rained by TD(λ). The curve shows an average percenage for he whole populaion of wins o losses agains he MinMax wih search deph 3. Fig. 4. The progress of learning of a populaion of 55 neural neworks adaped by evoluionary algorihm. The curve shows an average percenage of wins for he whole populaion agains he MinMax wih search deph 3.

48 Peer Lacko 6 Conclusions The purpose of his paper is a compuaional sudy of game-sraegy emergence for a simplified game of checkers. I was sudied a hree differen levels. A he firs simples level we have sudied a simple 1-agen sysem, where an agen (represened by a neural nework) plays agains anoher agen, which is simulaed by MinMax backrack mehod wih a specified search deph of 3. A he second level, we used a genuine muliagen sysem, where agens play repeaedly a megaournamen, each agen agains all oher agens. Afer finishing each single game, boh paricipaing agens modify heir neural neworks by TD(λ) reinforcemen learning. A he hird level, a Darwinian evoluion is used for all agens from a populaion (muliagen sysems). In a similar way as a he previous second level, agens play also a megaournamen; is resuls are used for finess evaluaion of agens. The finess values are used in he evoluionary approach for a reproducion process of agens, when fier agens are reproduced wih a higher probabiliy han weaker agens. In he reproducion process he weigh coefficien are sofly randomly modified muaed, where naural selecion ensures ha only beer neural neworks survive. A all hree levels we observed an emergence of checker game sraegy, where a he second and hird level we did no use any user defined agens ha are endowed wih an abiliy o predic a correc sraegy and are herefore able o play he game perfecly. This is a very imporan momen in our compuer simulaions; in he used muliagen sysems, an emergence of game sraegy is sponaneous and no biased by predefined opponens ha are able o play he game perfecly. Neural neworks are able o learn a sraegy, which gives rise o agens wih capabiliies o play checkers a a very good level. Acknowledgmen: This work was suppored by Scienific Gran Agency of Slovak Republic under grans #1/0062/03 and #1/1047/04. References 1. Russell, S.J., Norvig, P.: Arificial Inelligence: Modern Approach. Upper Saddle River, HJ: Prenice Hall, 1995. 2. Suon, R.S.: Learning o predic by he mehod of emporal differences. Machine Learning, Vol. 3 (1988), 9-44. 3. Suon, R.S., Baro, A.G.: Reinforcemen Learning: An Inroducion. Cambridge, MA: MIT Press, 1998. 4. Tesauro, G.J.: Pracical issues in emporal difference learning. Machine Learning, Vol. 8 (1992), 257-277. 5. Tesauro, G.J.: TD-gammon, a self-eaching backgammon program, achieves maser-level play. Neural Compuaion, Vol. 6, No. 2 (1994), 215-219.