Multiagent Reinforcement Learning Dynamic Spectrum Access in Cognitive Radios

Sensos & Tansduces 204 by IFSA Publishing, S L http://wwwsensospotalcom Multiagent Reinfocement Leaning Dynamic Spectum Access in Cognitive Radios Wu Chun, 2 Yin Mingyong, 2 Ma Shaoliang, Jiang Hong School of ational Defense Technology, Southwest Univesity of Science and Technology, Mianyang 62000, Sichuan, China 2 Institute of Compute Application, China Academy of Engineeing Physics, Mianyang 62900, Sichuan, China Tel: 86-86089890, fax: 86-86089890 E-mail: soldie_wu@63com Received: 28 ovembe 203 /Accepted: 28 Januay 204 /Published: 28 Febuay 204 Abstact: A multiuse independent Q-leaning method which does not need infomation inteaction is poposed fo multiuse dynamic spectum accessing in cognitive adios The method adopts self-leaning paadigm, in which each CR use pefoms einfocement leaning only though obseving individual pefomance ewad without spending communication esouce on infomation inteaction with othes The ewad is defined suitably to pesent channel quality and channel conflict status The leaning stategy of sufficient exploation, pefeence fo good channel, and punishment fo channel conflict is designed to implement multiuse dynamic spectum accessing In two uses two channels scenaio, a fast leaning algoithm is poposed and the convegence to maximal whole ewad is poved The simulation esults show that, with the poposed method, the CR system can obtain convegence of ash equilibium with lage pobability and achieve geat pefomance of whole ewad Copyight 204 IFSA Publishing, S L Keywods: Cognitive adios, Multiagent einfocement leaning, Q-leaning, Dynamic spectum access Intoduction Unde the tend of infomation innovation in cuent wold economy and social development, the wieless communication technology has expeienced apid development Cognitive adio (CR) [] becomes a hot eseach topic in wieless communication domain owing to its advantages of dynamic spectum access and intelligent adaptation to envionment The capability of high intelligence is one of significant key chaacteistics of CR, and the leaning epesents CR intelligence mostly Thee ae on-line leaning and off-line leaning methods applying in CR geneally [2] In on-line leaning, the agent inteacts with the envionment, gets feedback ewad, and leans fom its own expeience The einfocement leaning is the epesentative on-line leaning method The centalized solution is commonly used in taditional wieless communication fo applying online leaning to solve the issues of esouce allocation J ie pesents a dynamic spectum allocation method with centalized Q-leaning in mobile communication systems [3] S Xegias makes use of the centalized E-FRTS (enhanced fame egisty tee schedule) to accomplish the schedule and allocation of multimedia taffic in IEEE 8026 mesh netwoks [4] On account of the autonomy and vaiety of CR uses and the potential heteogeneity of cognitive 70 Aticle numbe P_848

adio netwoks (CR), the decentalized leaning is moe suitable fo CR than the centalized leaning To maximize the global ewad of all uses, multi CR uses negotiate with each othe odinaily and that needs infomation exchange J E Suis applies coopeative game model in the distibuted spectum shaing and poposes a distibuted algoithm to achieve nea optimal allocation, and the use exchange infomation of actions and ewads with each othe in poposed algoithm [5] P Zhou studies applying Bush-Mostelle einfocement leaning to esolve powe contol issue in CR [6] Although CR uses don't exchange stategy and ewad infomation with each othe, they still get the total jamming intensity (global ewad) fom the pimay use The infomation exchange between CR use and CR use (o pimay use) need occupancy a cetain amount of communication esouce Moeove, too fequent inteactions may cause the oveload of the communication To ovecome this shotcoming, an altenative method is self-leaing (independent leaning), with which the use only leans and acts based on itself ewad, not exchanging infomation with each othe Cuently, the eseaches on selfleaning in CR ae scace in liteatues This pape applies the epeated game to model the issue of multi uses competing multi channels, and poposes a multiagent einfocement leaning method: multiuse independent Q-leaning method with which the CR use implement self-leaning to maximize global ewad The simulations validate the effectiveness of poposed method 2 Stochastic Spectum Access Model The issue that an M CR uses (SUs, Second Uses) access channel not occupied by PUs (Pimay Uses) is eseached (only M is consideed in this pape) Each CR use chooses channel independently accoding to itself tactics epeatedly and aims at achieving the maximal total ewad of all uses The use does not exchange infomation with othes in the whole leaning and channel selecting pocess The epeated game [7] is used to model the pocess of multi CR uses competing channels At each stage game, M uses choose espective channel (in the whole channels), and the ewad of anyone is detemined by the combined stategies of all uses This stage game is pesented by a matix game [8] ( ) ( ) (,,, M M M A A,,, ), whee M is the numbe of uses in game, A is actions assemble of use m and it includes actions, ie choosing channel nn (,2, ), A is combined action ( M ) space A A A, is ewad function of use m When use m chooses channel n and it conflicts with othe use's choosing, the ewad ) ( n ) equals to zeo, ie ) 0 The ewad ) ( n ) is detemined by channel gain with no channel conflict, b The ewad matix of a game that two uses compete two channels is shown as fomula and R R b b 0 b 0 0 b 0 The item at line i, column, j in matix R denotes the ewad of use m when use chooses channel i and use 2 chooses channel j 3 Multiuse Independent Q-leaning The goal of the epeated game fo accessing channels is to maximize the ewad of the stage game g t M m t afte executing the stage game fo many times Multiagent einfocement leaning (MARL) [9] is effective method to esolve the game In most cuent liteatues about multiuse game, the use need obseve the ewads and the stategies of othe uses In cognitive adio netwoks, such fequent infomation inteactions of the ewads and the stategies between CR uses will occupy a geat quantity of communication esouce This pape poposes a multiuse independent Q-leaning (MIQ) method with which the CR uses don't equie any infomation exchange between each othe The MIQ algoithm in the epeated game model expects achieving two goals: one is the convegence to ash Equilibium and the othe is the total ewad of all uses each the maximal value o the close maximal value * Definition The stategies assemble ( m) m is a ash Equilibium if fo each use m, 2, M it has (, ) (, ),, (3) * * * m m m m m * * whee m is the stategy of use m, m is the combined stategies of the othe uses except use m, is the ewad of use m Among all possible combined stategies of evey use choosing diffeent channel, thee must exist one combined stategies which has bette o same ewad than othe combined stategies That combined stategies is a ash Equilibium point The ash Equilibium point exists appaently in above mentioned game, but the ash Equilibium point may be not unique Fo instance, all othogonal channel allocation stategies ae ash Equilibium points when M The basic Q-leaning method leans the optimal stategies in unknown envionment by using leaned knowledge and exploing new stategies with cetain pobability In the situation of undetemined ewads 7

and actions, the updating fomula of Q-value function is [0]: Qsa (, ) ( ) Qsa (, ) ( max Qs (, a)) (4) Some impovements on basic Q-leanig ae equied fo the poposed issue Each use leans independently in game pocess, and its ewad is affected by othe uses The ewad is uncetainty thus the slow updating of Q-value is a easonable manne In addition, owing to the status of uses does not tansfom duing epeated game, the new Q-value does not contain the contibutions of delay ewad With the above two impovements, as well as pope exploing policy and contol of leanig ate, a multiuse independent Q-leaning method is poposed The key to achieve the joint optimal solution by independent actions and leaning is designing suitable autonomous leaning policy and actions policy Two pinciples ae poposed and applied in the independent actions of each use: ) The use pefes choosing the channel with high gain; 2) The use avoids channel conflict between othes evetheless, the two pinciples may conflict occasionally o fequently The poposed MIQ algoithm executes iteation action ties unde the two pinciples and gets the final channel allocation The concete implementation of MIQ algoithm is as below: Step : Q-value table initializing The Q-value table of use m is initialized as a b () i Q n i n ( ),,2, (5) The initialized Q-value of use m is the aveage value of the ewads on diffeent channels Afte initialization, each item in Q-value table has the same aveage value, theefoe the use chooses any channel with same pobability in fist action A moe deep eason to initialize Q-value with aveage value is to conveniently ealize the subsequent updating pinciple of Q-value: big ewad makes Q-value incease slowly and small ewad makes Q-value decease slowly Step 2: Independent Q-leaning pocess Q-value update iteatively until eaching specified times of game a) Compute the pobabilities of choosing each channel based on Q-value table, execute the actions with the pobabilities in fomula (6) q ( Q ) P, n,2, q n ( Q ) (6) The bigge Q-value indicates the bette channel and meanwhile esults in the highe pobability of channel choosing In fomula (6), q is the pobability contolling facto The selection of actions inclines to use leaned knowledge with bigge q value and inclines to exploe all possible choices with smalle q value Due to thee is no infomation inteaction between uses, adequate exploation is significant and necessay in the initial stage of leaning Geneally, a small q is set at the vey beginning of leaning, along with the epeated leaning pocess the q incease gadually and slowly to impove the convegence of leaning b) Afte the actions, each use obseves itself ewad only b use m no conflict t, (7) 0 use m conflict On the one hand, the definition of ewad epesents the quality of channel, ie the ewad is the channel gain with no conflict On the othe hand, the ewad eflects the conflict status of the channels When action of use m conflicts with any othe, the ewad gets zeo and that esults in the decease of coesponding Q-value This can be deemed as punishment mechanism of channel conflict c) Update the Q-value table Qt ( t ) Qt t bt (8) whee t is the updating ate of t Q-value, ) t denotes the times use m chooses channel n duing the whole t epeated games, denotes the contolling facto of Q-value updating ate The updating ate of Q-value ) t educes gadually duing leaning pocess appaently and that contibutes to the convegence of leaning When the use chooses a channel with high gain (highe than aveage gain) on the condition of no conflict, the Q-value inceases, and highe gain lead to the moe incease of Q-value Meanwhile, if the use's selective channel conflict with othes, the Q-value updating in fomula (8) with cuent zeo ewad causes the decease of Q-value, and the degee of decease is much moe than the degee of incease obtained in unconflict situation Because of athe heavy punishment fo channel conflict, the use could seach high gain channel on the basis of no confliction The study on self-leaning (independent leaning) with which the agent leans and acts only by obseving itself ewad in MARL filed is vey few It is especially difficult to pove the convegence of self-leaning on the condition that the multi uses don t exchange infomation Bowling poposed a multiagent leaning method WOLF-PHC (win o lean fast policy hill-climbing) in which each use 72

leans using a vaiable leaning ate and it achieves the maximal ewads of all uses [] But the poof of the algoithm convegence is not povided and only an expeiment on 2 uses and 2 actions is done to evaluate the convegence popety This pape poposes a fast leaning algoithm fo 2 uses and 2 channels scenaio, and makes the poof of the convegence The concete implementation of the fast leaning algoithm is as below: Step : Phase of leaning channel ewad Each use exploes channels andomly thus get the ewad value (channel gain) without channel conflict Step 2: Phase of fast geedy channel choosing Each use chooses the channel of highest gain The allocation of channels and leaning pocess ae finished if no channel conflict occus When channel conflict occus, Step 3 executes subsequently Step 3: Phase of Q-leaning The lean pocess epeat fo specified times (a) Initialize the Q-value table Q Q Q Q 05 (9) (b) Choose action based on pobability policy and obseve the ewad The use chooses channel accoding to the pobabilities in fomula (6), obseves the status of conflict, and calculates the ewad When the use chooses the highe gain channel in the two candidates, the ewad is ) t - use m no conflict, (0) use m conflict ( ) ( ) ( ) whee m b m b m L, is an appopiate value between and When the use chooses the lowe gain channel in the two candidates, the ewad is ) t - (c) Update the Q-value table use m no conflict () use m conflict ( ) ( ) Q ( ) t n Qt n t n (2) Theoem The poposed fast leaning algoithm conveges to the optimal solution Poof: Suppose b b, b b o b b, b b It is clea that the algoithm convege to the optimal solution in the phase of fast geedy channel choosing Suppose b b, b b Assign the paamete L a sufficiently lage value thus make and ae small enough, and assign an appopiate value between and In the fist peiod of time of leaning, use pefoms 4n times of channel choosing On the condition that, and ae extemely small, the Q-value table updates as Q Q n n Q Q nn (3) Q Q n n Q Q nn If, thee is Q Q Q Q Duing the subsequent lean peiods, Q inceases continually meanwhile Q deceases continually Afte a while, the lean pocess ends and final solution is that each use chooses the channel with bigge Q-value, ie the use chooses channel and the use 2 chooses channel 2 This solution is the optimal solution owing to b b b b If, the algoithm conveges too Suppose b b, b b, the convegence can be pove in the same way 4 Simulation and Results The MIQ algoithm is simulated and evaluated mainly in thee aspects: the pobability of convegence to ash Equilibium, the pobability of convegence to the optimal solution and the nomalization pefomance of poposed algoithm In the scene of M CR uses selecting channels, the ewad of each use m choosing each channel n is initialized to unifomly distibuted andom numbes between 05 and, b ) 05 05* and() (4) Then each use caies out autonomic leaning with MIQ algoithm independently In policy updating pocedue in Step 2(a), the pobability contolling facto q adjusts dynamically The q equals 05 when selecting channel fo the fist time, and the q inceases vey gadually until eaching specified leaning times In Step 2(c), the contolling facto of Q-value updating ate is configued with The numbe of times of epeated game is 0000, and the simulation pocess is executed 00 times with divese andom ewads to obtain the aveage pefomance of poposed algoithm Fig and Fig 2 show pefomance of MIQ algoithm when the numbe of uses M equal to the numbe of channels The stategies of multi uses can convege to ash Equilibium at 00 % o nea 00 % (as shown in Fig ) When M 2, the total ewad of all uses convege to the maximal value with a pobability of 98 % The pobability of convegence to maximal ewad dops along with the incease of uses/channels numbe, and the 73

pobability dops to nea 70 % when M 8 Fig shows supeficially that the MIQ algoithm behaves geat pefomance unde the scene of vey few uses while the MIQ algoithm is not a quite good method unde the scene of many moe uses Moe deep evaluation of the algoithm fo aveage pefomance is descibed in Fig 2 When M 2 the nomalization pefomance (the atio of cuent total ewad to the maximal total ewad) 09999 and when M 3 the nomalization pefomance 09995 Along with incease of uses/channels numbe, the nomalization pefomance dops vey slowly When M 8, the nomalization pefomance maintains vey high value, ie 09978 The eason that the nomalization pefomance keeps high while the pobability of convegence to maximal ewad dops appaently is: the total ewad unde detemined paadigm of channel allocation is vey close to maximal ewad Theefoe, even if the MIQ algoithm can not achieve absolute optimal pefomance, it achieves quite good pefomance which is vey close to the absolute optimal pefomance algoithm conveges to maximal ewad fo 69 times and the nomalization pefomance got in the emaining 3 simulations is geate than 098 mostly, even the wost pefomance is geate than 097 On the condition that use numbe equals to channel numbe, the MIQ algoithm can each unconflicted othogonal channel allocation and the nomalization pefomance obtained by MIQ algoithm is appoximately 5 % highe than that of andom othogonal channel allocation method Fig 3 The nomalization pefomance distibution of 00 simulation samples fo MIQ algoithm ( M 8) Fig 4 and Fig 5 show pefomance of MIQ algoithm when the numbe of uses M is less than o equal to the numbe of channels Fig 4 shows the pobabilities of convegence to ash Equilibium and maximal total ewad by poposed algoithm with diffeent channel numbe ( M 2,3) Fig The pobabilities of convegence to ash Equilibium and the maximal ewad ( M ) Fig 4 The pobabilities of convegence to ash Equilibium and the maximal ewad ( M ) Fig 2 The nomalization pefomance of MIQ algoithm and andom othogonal allocation method ( M ) Fig 3 shows the 00 samples of nomalization pefomance obtained in simulations ( M 8) It can be seen that in total 00 simulations, the MIQ Along with incease of channel numbe, not only the pobability of convegence to maximal total ewad dops obviously but also the pobability of convegence to ash Equilibium dops similaly, and it is diffeent fom the case shown in Fig The channels conflict in MIQ leaning pocess would lead to the decease of Q-value and thus make final allocation of channels can avoid channels conflict 74

effectively The stategies assemble ealizing unconflicted allocation of channels is exactly ash Equilibium when M, so the pobability of convegence to ash Equilibium is quite high (as shown in Fig ) When M, the unconflicted allocation of channels does not necessaily satisfy ash Equilibium, and that is why the pobability convegence to ash Equilibium dops in Fig 4 Fig 5 shows the poposed MIQ algoithm has good pefomance too when M Fig 5 The nomalization pefomance of MIQ algoithm and andom othogonal allocation method ( M ) 6 Conclusions The independent leaning without infomation exchange between each node is an altenative on-line leaning method fo esouce allocation in cognitive adio netwoks This pape uses the epeated game modeling multiuse dynamic spectum accessing, and poposes a multiagent einfocement leaning method: multiuse independent Q-leaning method with which the CR use coodinates in choosing best highest gain channel and avoiding conflict between each othe Moeove, a fast leaning algoithm fo 2 uses and 2 channels case is pesented and poved that it convege to ash Equilibium The simulations show that use action can convege to ash Equilibium with high pobability and achieved total ewad is close to the maximal ewad with poposed MIQ algoithm Acknowledgements Poject suppoted by the ational atual Science Foundation of China (Gant o 6379005), and the ational Basic Reseach Pogam of China (Gant o 2009CB320403) Refeences [] J Mitola, J G Q Maguie, Cognitive adio: making softwae adios moe pesonal, IEEE Pesonal Communications, Vol 6, Issue 4, 999, pp 3-8 [2] C Wu, Y Li, K Yi, Reseach on GA-LSSVM offline leaning in cognitive adios, Jounal of Beijing Univesity of Posts and Telecommunications, Vol 35, Issue 2, 202, pp 90-93 [3] J ie, S Haykin, A Q-leaning-based dynamic channel assignment technique fo mobile communication systems, IEEE Tansactions on Vehicula Technology, Vol 48, Issue 5, 999, pp 676-687 [4] S Xegias, Passas, A K Salkintzis, Centalized esouce allocation fo multimedia taffic in IEEE 8026 mesh netwoks, Poceedings of the IEEE, Vol 96, Issue, 2008, pp 54-63 [5] J E Suis, L A Dasilva, H Zhu, et al, Coopeative game theoy fo distibuted spectum shaing, in Poceedings of the IEEE Intenational Confeence on Communications, Glasgow, Scotland, 2007, pp 5282-5287 [6] P Zhou, Y Chang, J A Copeland, Reinfocement leaning fo epeated powe contol game in cognitive adio netwoks, IEEE Jounal on Selected Aeas in Communications, Vol 30, Issue, 202, pp 54-69 [7] V D Schaa M, F Fu, Spectum access games and stategic leaning in cognitive adio netwoks fo delay-citical applications, Poceedings of the IEEE, Vol 97, Issue 4, 2009, pp 720-739 [8] C Yang, J Li, Mixed-stategy based discete powe contol appoach fo cognitive adios: A matix game-theoetic famewok, in Poceedings of the 2 nd Intenational Confeence on Futue Compute and Communication, Wuhan, China, 200, pp 3806-380 [9] L Busoniu, R Babuska, B D Schutte, A compehensive suvey of multiagent einfocement leaning, IEEE Tansactions on Systems, Man and Cybenetics Pat C: Applications and Reviews, Vol 38, Issue 2, 2008, pp 56-72 [0] Tom M Mitchell, Machine leaning, McGaw-Hill College, 2005 [] B Michael, V Manuela, Multiagent leaning using a vaiable leaning ate, Atificial Intelligence, Vol 36, Issue 2, 2002, pp 25-250 204 Copyight, Intenational Fequency Senso Association (IFSA) Publishing, S L All ights eseved (http://wwwsensospotalcom) 75