On Balancing Exploration vs. Exploitation in a Cognitive Engine for Multi-Antenna Systems

O Balacig Exploratio vs. Exploitatio i a Cogitive Egie for Multi-Atea Systems Haris I. Volos ad R. Michael Buehrer Mobile ad Portable Radio Research Group (MPRG), Wireless@Virgiia Tech Bradley Departmet of Electrical ad Computer Egieerig Virgiia Polytechic Istitute ad State Uiversity Email: {hvolos, buehrer}@vt.edu Abstract I this paper, we defie the problem of balacig exploratio vs. exploitatio i a cogitive egie cotrolled multiatea commuicatio system i terms of the classical multiarmed badit framework. We the employ the ɛ-greedy strategy ad Gittis idices methods for addressig the problem i a system with o prior iformatio. Results show that the Gittis idices assumig a ormal reward process had the best overall performace compared to the Gittis idices with a Beroulli reward process ad the ɛ-greedy strategy. The latter was foud to be more cosistet albeit iefficiet for most of the cases except i the case of both a low umber of trials ad a low SNR i which it was foud to have better performace tha the other methods. Nevertheless, the Gittis idices method should be geerally preferred as it is more cosistet tha the ɛ-greedy strategy across differet scearios.. I. INTRODUCTION A radio has traditioally used a fixed set of commuicatio methods selected by its operator. Today, however, a radio is expected ot oly to use a large umber of the plethora of curret commuicatio methods, but also to be able to select the method that best meets its goal uder the curret operatig eviromet. Typically, the radio desiger will aalyze each commuicatio method i terms of the desired goal uder assumed chael models. The, the desiger will combie the aalysis i order to arrive at a set of adaptatio rules that best meets the goal for the radio. Performig the required aalysis is usually a legthy task requirig a sigificat amout of effort, especially whe a large umber of differet methods are ivolved. Furthermore, if the chael models do ot hold or the commuicatio methods perform i a way ot expected, the desig becomes moot. The same applies if the desired goal chages. The questio the arises: what if a radio could be desiged i such a way that it could determie o its ow which commuicatio method to use to meet its goals? The scope of our work [], [2] is to propose methods that ca make such a desig possible. This work is based o the pioeerig work of Mitola who first described the cogitive radio. While Mitola s ideal Cogitive Radio (CR) is ot oly capable of optimizig its ow wireless capabilities, but also of self-determiig its goals by observig its huma operator s This work was supported by the Natioal Sciece Foudatio uder Grat No. 5248. behavior [3], we are oly iterested i optimizig the wireless capabilities of the radio. Research such as [4], [5] borrowed ideas from the fields of machie learig ad Artificial Itelligece (AI) i a effort to make that possible. Although curret research made steps i the right directio, the mai chael metrics cosidered were the Sigal-to-Noise Ratio (SNR) ad basic chael statistics which are ot adequate for multipleatea systems. The latter systems are also affected by other factors such as spatial correlatio. Also, [5] addressed a rage of issues cocerig a CR, where the Physical (PHY) layer issues received partial attetio. Our overall work focuses o multi-atea PHY layer aspects. Furthermore, we look closely at the two mai tasks eeded to eable the radio to optimize its ow performace: learig its ow capabilities ad optimizig to meet its goals. We wat the radio to use the most applicable ad efficiet methods for learig ad optimizig. Therefore, our aim is to desig a Cogitive Egie (CE), the software package that makes the desired behavior possible. I this paper we focus o the CE s performace while it is learig. I [2] we have proposed a CE desig that employs learig ad optimizatio techiques i order to lear the capabilities of the radio ad optimize for differet goals. Our CE desig differs from existig desigs [5] [8] i that the tasks of learig ad optimizig are separate. Our CE is ot oly learig what is the best method, give the goal ad the chael coditios, but it also lears the abilities of the radio idepedet of the goal. Should the goal chage, the kow abilities of the radio ca be used to speed up the optimizatio process ad miimize the eed for ew learig. Our method ca be see as aalogous to the Actor-Critic [9] methods used i reiforcemet learig, where the actor (optimizatio uit) suggests possible solutios, ad the critic (learig uit) respods with the desirability of the proposed solutio based o its experiece. I our previous work [2], we have assumed that the learig uit had eough examples to lear what is eeded to provide the optimizatio uit with the ecessary iformatio. Furthermore, we established that there may be a egative impact whe ot eough samples are available. The extet of the egative effect is depedet upo the learig ad optimizatio techiques used. However, performace durig learig (collectig examples) was ot evaluated. Eve if the CE is assumed to go through prologed learig sessios, it This full text paper was peer reviewed at the directio of IEEE Commuicatios Society subject matter experts for publicatio i the IEEE "GLOBECOM" 29 proceedigs.

is practically impossible to expose it to all possible chael coditios apriori. Cosequetly, it is reasoable to expect that the CE sooer or later will face ukow coditios. For example, if the radio is operatig i a critical missio may ot have the luxury of time to lear what is best before operatig; it has to establish a coectio ad lear at the same time. ly balacig exploratio vs. exploitatio esures that the egative effects of learig will be kept to a miimum. Therefore, we eed to evaluate the performace of the radio cotrolled by the CE durig learig. I this paper we seek to evaluate the CE s performace durig learig ad to itroduce syergies betwee learig ad optimizatio (i.e., match learig with the optimizatio goal(s)). To put it i more classical terms, we address the problem of balacig the exploratio vs. exploitatio. Exploratio refers to tryig optios with ukow but potetial beeficial outcome. O the other had, exploitatio refers to usig what is already kow to have the highest performace metric. Methods focusig o exploitatio istead of exploratio are usually kow as greedy or myopic methods. Exploratio ivolves ukow risk, while exploitatio teds to be safer i terms of expectatios. Depedig o the situatio, exploitatio might ultimately limit the log-term performace. I a few words the exploratio vs. exploitatio problem is: do we choose a optio that guaratees shortterm performace (exploitatio) or do we choose a optio that could either hurt short-term performace but improve the log term performace of the system? How do we balace those two coflictig objectives? This is a uiversally occurrig problem ad we borrow the results from other fields such as re-reiforcemet learig [9] ad dyamic programmig [], [] ad apply them to the cotext of selectig the best commuicatio techique i a multi-atea commuicatio system. [2] applied similar cocepts o fidig uused chaels uder the cotext of Dyamic Spectrum Sharig (DSS). The goal of this paper is to ivestigate ad apply exploratio vs. exploitatio balacig techiques amely the simple, yet effective, ɛ-greedy strategy ad the more complex, but optimal Gittis [3] dyamic allocatio idex method. Sectio II presets some backgroud iformatio ad formulates the problem. Sectio III explais our test setup. Sectio IV presets ad discusses the test results, ad Sectio V provides some cocludig remarks. II. BACKGROUND AND PROBLEM FORMULATION The problem of exploratio vs. exploitatio is classically studied by usig the mathematical framework of the Multi- Armed Badit (MAB) problem. We itroduce the MAB problem ad two possible approaches, amely the ɛ-greedy strategy ad the Gittis idices. Fially, we defie how the distributio parameters are estimated i a recursive way. A. Our Commuicatio Problem We have a commuicatio system with K optios. Each optio uses a combiatio of modulatio, codig, ad MIMO techiques. The system is cotrolled by a CE that ca lear ad optimize performace subject to the collectio of data samples uder the differet chael coditios. I this paper we limit the scope of our results i the case of achievig maximum spectral efficiecy (capacity) ad assume that the chael statistics do ot chage durig the operatig iterval. We pla o expadig to more goals ad varyig chael statistics i our ext publicatio. B. The Multi-Armed Badit Problem The MAB problem gets its ame from the slot machies (badits) foud i casios. A typical slot machie has a sigle arm that whe pulled returs a reward with a certai probability. I the multi-armed badit problem it is assumed that the player is faced with either multiple machies or a sigle machie with multiple arms ad his goal is to get the maximum reward by usig the machies. Geerally it is assumed that the player has little or o iformatio about the badits ad he has to decide betwee explorig for the most rewardig machies ad usig the machie that was foud to yield the higher reward. Essetially this is a iformatio acquisitio problem ad the player is always faced with the same optios. Adaptig the descriptio foud i [], let Y be the set of K slot machies (comm. optios), ad let W y be a radom variable that gives the amout of the reward retured (capacity) if we use the machie y. Alsoletμ y be the ukow true mea of W y ad σy 2 be the variace. Fially, let ( μ y, ˆσ y) 2 be the estimate of the mea ad the variace of W y after iteratios ad s a belief state about the radom variable W y. The estimates ( μ y, ˆσ y) 2 ca be a example of a belief state. Let x y =if the y th machie is played at iteratio ad the reward retured o that roud. Also let W y N y = x y () y= be the total umber of times the y th machie was used. I the MAB problem we are lookig for a policy that maximizes the expected retur V (s): V (s) =E s N = γ W (2) where N is the maximum umber of plays (ofte assumed to be ), E y is the expectatio operator over the belief state s, γ is a discout factor <γ<, ad W the retur at time. The discout factor is used to esure a fiite retur whe N. Aother iterpretatio is to treat γ as the probability that the process is goig to stop [4]. Therefore, the discout factor is a way to express our expectatio o the duratio of the optimizatio horizo. A low value discout factor discouts future returs with a higher rate. As result, whe balacig exploratio vs. exploitatio the latter has a higher weight. O the other had, a high valued discout factor (γ ) will make future rewards more importat ad exploratio will have a higher weight tha the previous case. This full text paper was peer reviewed at the directio of IEEE Commuicatios Society subject matter experts for publicatio i the IEEE "GLOBECOM" 29 proceedigs.

Fidig the policy that maximizes (2) is a K-dimesioal problem. Two key methods of addressig this problem are the ɛ-greedy strategy ad the use of the Gitti s idices. C. The ɛ-greedy Strategy The ɛ-greedy strategy [9] is a simple strategy that is exploitig by usig the best method (y gready = arg max y μ y )-ɛ (ɛ [, ]) of the time (greedy). However, with probability ɛ it explores by usig a radom y uiformly selected. As by the law of large umbers μ y is goig to coverge to the true mea. The ɛ-greedy methods guaratee that all the optios are explored as the horizo teds to ifiity. The ɛ parameter cotrols how fast exploratio is performed. A higher ɛ will cause a faster exploratio ad arrive more quickly at a optimal or ear-optimal optio. However, the high exploratio rate may cause reduced overall returs because of the higher exploratio cost. The ɛ greedy strategy has two mai variatios: the ɛ first strategy ad the ɛ decreasig strategy. The former explores for ɛn trials ad exploits for the remaiig ( ɛ)n trials, where N is the umber of total trials. The latter variatio decreases the exploratio rate by decreasig ɛ as the umber of trials is icreasig. I this paper we cosider oly the classic versio of the ɛ greedy strategy. However, we have some prior iformatio about the commuicatio system that should be used: we kow the potetial retur of each optio (capacity) ad we also kow a upper boud of the capacity that ca be achieved uder the curret chael. Therefore, we restrict the exploratio to machies that potetially ca outperform the curret y gready. We also restrict exploratio o methods with a capacity C max, the maximum Shao estimated capacity, C max,for the curret chael coditios as give by [5]: mi{n t,m r} ( C max = log 2 + SNR ) (3) i= where N t ad M r is the umber of the trasmit ad receive ateas respectively. λ i is ith eigevalue of HH T, where H is the M r N t chael matrix. D. The Gittis Idex Gittis i [3] showed we ca solve the K-dimesioal problem of (2) by usig a dyamic allocatio idex method that breaks the problem i a series of K oe-dimesioal problems. The Gittis idex ν y at a belief state s is give by: N E s = ν y (s) = sup γ Wy N (4) N E π = γ which ca be iterpreted as the maximum expected reward per uit of expected discout time. The result, albeit simple ad straightforward, has bee prove i a o-trivial way to be optimal. The iterested reader ca fid more iformatio i the paper [3] ad book [] authored by Gittis. The optimal policy is simply to use the optio y with the highest ν y. The Gittis idex is depedet upo the N t λ i uderlyig distributio of W y. I this work we cosider the Gittis idex for the Normal Reward Process (NRP) ad the Beroulli Reward Process (BRP). It may be oted that i the applicatio examied i this work the uderlyig process is Beroulli - either a certai capacity is achieved or ot. I our applicatio if a trasmitted packet is successfully received, the we assume a retur equal to the capacity of the commuicatio optio used, otherwise the retur is zero. Therefore, assumig a BRP is i theory more suitable tha NRP. However, as it will be show i the results, assumig a NRP has some performace advatages over assumig a BRP. For a NRP the Gittis idex is equal to: ν( μ, σ 2,,γ)= μ + σν(,,,γ) (5) where ν(,,,γ) is the Gittis idex for a zero mea, uit variace distributed process. For a BRP the Gittis idex is equal to [6]: ν(α, β, γ, R y )=R y ν(α, β, γ, ) (6) where ν(α, β, γ, ) is the Gittis idex for a Beroulli process with α successes, β failures, with a reward of, if successful. R y is the reward received whe optio y is successful. A challege of the BRP is that the Gittis idex is ot defied whe either α or β is zero. This is challegig because there are cases that the probability of success of the BRP is practically either oe or zero. For example, beamformig usig QPSK with /8 covolutioal code i medium to high SNR levels ad spatial multiplexig usig 256 QAM i medium to low SNR levels, respectively. For this reaso we make the practical assumptio that whe either α or β are zero we calculate ν(α+,β+,γ,r y ) istead. This assumptio allows us to get a estimate of the idex that otherwise we would ot be able to achieve i a reasoable amout of trials. A dowside of the Gittis idices is that is ot trivial to estimate them. O the other had, for most practical purposes the idices tabulated i [] are sufficiet. Fially, like the ɛ-greedy strategy, we limited the choice of y by usig (3). E. Mea ad Variace Estimatio For the NRP, we eed to have a estimate of the mea ad the variace of each of W y. For their estimatio, we adopt the method described i []. Subsequetly, the mea ca be recursively estimated usig: μ y = { N y N y μ y μ y + N W y y If x i = Otherwise The variace ca be similarly estimated used: ˆσ 2, y = N y N y 2 Ny ˆσ2, y + Ny (7) ( ) W y μ 2 y If x y = ad Ny 2, ˆσ 2, y If x y = ca be updated usig: N y (8) = N y + x y (9) This full text paper was peer reviewed at the directio of IEEE Commuicatios Society subject matter experts for publicatio i the IEEE "GLOBECOM" 29 proceedigs.

The variace of μ y is give by: If N y is or, the ˆσ 2, y σ y 2, = Ny ˆσ y 2, () =. That meas we do t kow aythig about the distributio. However, assumptios ca be made. I this work we iitialize the idex to the maximum potetial retur (capacity) of each optio util Ny >. I large problems it is hard to estimate the variace, therefore, we ca use a sigle populatio variace for the iitial steps: ˆσ y 2, = 2 ˆσ2, y + x y y Y ( W y μ ) 2 y () which is updated after every trial. The variace of μ y is give by: σ y 2, = ˆσ 2, (2) Ny III. TEST SETUP We evaluated the proposed methods by implemetig the system usig the MATLAB simulatio software package. A. Cofiguratio Three MIMO Techiques were cosidered: Beamformig, Trasmit Diversity usig a STBC, ad Spatial Multiplexig usig V-BLAST. The MIMO system was assumed to have four trasmit ad four receive ateas (4 4). The CE had the choice of the followig modulatio schemes: QPSK, 8- PSK, 6, 32, 64, 28, ad 256 QAM. Furthermore, it could vary the codig rate of a covolutioal codec with a costrait legth K =8[7]. The available codig rates were:, 7/8, 3/4, 2/3, /2, /4, /6, ad /8. Not all modulatio/codig combiatios were allowed. The modulatio/codig combiatios where chose such that the combied distace metric [7] ad spectral efficiecy mootoically decreased ad icreased respectively. The fial combiatios had 22 differet spectral efficiecies from.25 bits/symbol (QPSK, /8 codec) to 8 bits/symbol (256-QAM, ucoded). B. Methods Tested We tested the Gittis idex for both NRP ad BRP for γ equal to 2.5,.7, ad.99. The ɛ greedy strategy was evaluated with ɛ equal to.,.. Values aroud. are commoly used. The methods were tested i a SNR rage betwee 5 ad 5 db at 5 db itervals at a maximum pairwise atea correlatio ρ =.. We also tested the case of ρ =.5. I the latter case we oly cosidered γ =.7 ad ɛ =.. A high correlatio, ρ, betwee the atea elemets egatively affects the use of spatial multiplexig. The reader is remided that spatial multiplexig exploits the availability of multiple spatial modes which are reduced to oe as ρ. It may be oted that the differet chael metrics such as SNR ad ρ represet differet sets of workig optios. 2 Gittis [] provides tables for γ equal to.5,.6,.7,.8,.9 ad.99 for both NRP ad BRP TABLE I AVERAGE TOTAL RETURN OVER OPTIMAL TOTAL RETURN Number of Trials Performed 5 5 5 5 5 5 Discout Factor, γ Method.5.7.99.5.7.99 Gittis Idex, NRP.73.73,.7.72.93.93,.93.94 Gittis Idex, BRP.6.59,.65.56.89.89,.87.85 Exploratio Parameter, ɛ Method...2...2 ɛ-greedy strategy.53.65,.5.7.74.87,.87.87 Max. pairwise atea correlatio, ρ, equal to.5,. otherwise Total Retur Over Total Retur.9.8.7.6.5.4.3 Number of trials=5, Max. Atea Pairwise Corr. ρ=..2 Gittis Idex, NRP, γ=.7. Gittis Idex, BRP, γ=.7 5 5 2 25 3 35 4 45 5 SNR (db) Fig.. Total Retur vs. SNR, 5 Trials Higher SNR levels have more optios available ad a low ρ value allows the use of more spatial multiplexig optios. The availability of workig optios will affect the total retur as optios that do ot work will adversely affect the retur (if they are explored). C. Evaluatio Metrics I the results to follow we will compare the total retur ad the average istataeous retur to the optimal respective retur. The total retur is the sum of all returs for a umber of trials. The average istataeous retur is the average retur experieced after a specific umber of trials. The optimal retur for each case is the best possible retur for the uderlyig chael coditios. The optimal returs were estimated by employig a brute force search over all available optios ad usig 4 trials per optio. As a compariso, each method (Gittis & ɛ-greedy) was evaluated after ruig up to 5 trials i total. IV. RESULTS Table I presets the average total retur over the optimal total retur, averaged over all the SNR levels, for all the methods ad parameters tested. There are two sets, the first estimated at 5 trials ad the secod estimated at 5 trials. The results for 5 trials represet performace with a relatively This full text paper was peer reviewed at the directio of IEEE Commuicatios Society subject matter experts for publicatio i the IEEE "GLOBECOM" 29 proceedigs.

Total Retur Over Total Retur Total Retur Over Total Retur.9.8.7.6.5.4.3.2 Number of trials=5, Max. Atea Pairwise Corr. ρ=. Gittis Idex, NRP, γ=.7. Gittis Idex, BRP, γ=.7 5 5 2 25 3 35 4 45 5 SNR (db).9.8.7.6.5.4.3.2 Fig. 2. Total Retur vs. SNR, 5 Trials SNR=5 db, Max. Atea Pairwise Corr. ρ=.. Gittis Idex, NRP, γ=.7 Gittis Idex, BRP, γ=.7 5 5 2 25 3 35 4 45 5 Trials (Packets) Fig. 3. Total Retur vs. Trials, SNR = 5 db short time frame ad the results for 5 trials for a loger time frame. At 5 trials, both Gitti s idex methods have a slightly reduced performace whe a higher valued discout factor is used. This is because there is a higher focus o exploratio. The same ca be observed for the BRP at 5 trials - the NRP has practically o fluctuatios. Also, we ca observe that after 5 trials the returs are higher compared to 5 trials. Results show that the NRP has the highest overall returs. The reaso why NRP performs better vs. the BRP is explaied i the followig paragraphs, whe the results of Figure are discussed. Still commetig o the results of Table I, the ɛ greedy strategy showed reduced performace with lower values of ɛ. I this case the reduced exploratio hurt the returs i both cases. Most of the results were obtaied assumig ρ =.. I additio, there is a small set for ρ =.5 (Sectio III-B). I the latter case, the results show that the Gittis NRP is still superior ad performace at 5 trials is slightly reduced for NRP ad the ɛ greedy strategy, ad slightly improved for the BRP. With 5 trials, the results were early the same. Figure plots the total achieved reward over the optimal reward value by ruig the Gittis idex methods for γ =.7 ad the ɛ greedy strategy for ɛ =.. The results of Figure show: first, that all the methods have degraded performace at low SNRs. This ca be attributed to the fact that may optios simply do ot work ad as a result, the exploratio cost is higher. Eve though the optios used are limited by (3), ot all o-performig optios are elimiated because the limit give by (3) is ot tight sice the optios used are suboptimal compared to the limit. Secod, the Gittis idex usig a BRP was foud to perform poorly at low to medium SNR levels. This is possibly due to the fact that this reward process is ot defied for the cases where either α or β is zero. Therefore, i order to be able to estimate the idex, whe either α or β is zero istead of calculatig ν y (α, β, γ, R y ), we calculated ν y (α+,β+,γ,r y ) istead. This caused the idex to be higher for a low umber of trials ad causig the exploratio of those optios util eough samples were collected. This disadvatage is more proouced at the low SNR levels where some of the methods have a extremely low probability of workig ad requirig a extremely large amout of trials to get a o zero α. The same applies to β for the higher SNRs; however, it is ulikely that the latter is causig ay performace degradatio. O the other had, i the NRP case, if o successes are foud, the μ i (5) is goig to be zero, ad the other mai cotributio factor will be the variace which at the iitial stages it is assumed to be the same() for all y. Therefore, i this case the curret estimate of μ will carry most of the decisio weight util eough samples are collected for the idividual variace calculatio ad for ν(,,,γ) i (5) to be a more decisive factor. Third, the Gittis idex BRP has a reduced performace at a SNR=5 db. This is because, istead of the method focusig/settlig o the optimal choice of beamformig with 28 QAM ucoded with a expected retur of 7.98 = 6.86 bps/hz, where 7 is the spectral efficiecy ad.98 the probability of a successful packet, it settles o usig VBLAST with ucoded QPSK with a expected retur of 8.7 =5.6 bps/hz. The reaso that the latter is preferred is because it has a higher potetial retur (8 > 7 bps/hz) which makes it more desirable for exploratio. Ad fourth, the observatios i the low SNR regio of Figure ca be also see i Figure 3 which shows the progressio of the total reward i terms of the optimal reward. Results show that the ɛ-greedy strategy has the best performace for the first 5 trials, followed by the Gittis idex with NRP ad BRP. The latter had the worst performace for the reasos explaied above. As the umber of trials icreases, the Gittis idex with NRP outperforms the other two methods. Figure 2 is the same type as Figure but with 5 trials istead of 5. I this case the ɛ-greedy strategy seems to have the best performace up to a SNR=2dB. Potetially, oe could use the ɛ-greedy strategy whe the goal is to perform best i a time horizo of less tha 5 trials. A look at the istataeous retur is provided by Figures 4 & 5 for a SNR equal to 25 ad 5 db respectively. It may be observed from both figures that after 5 trials all the This full text paper was peer reviewed at the directio of IEEE Commuicatios Society subject matter experts for publicatio i the IEEE "GLOBECOM" 29 proceedigs.

Average Istataeous Retur (bits/s/hz) 8 7 6 5 4 3 2 SNR=25 db, Max. Atea Pairwise Corr. ρ=. Gittis Idex, NRP, γ=.7 Gittis Idex, BRP, γ=.7 5 5 2 25 3 35 4 45 5 Trials (Packets) Fig. 4. Average Istataeous Retur (bits/s/hz) 3 28 26 24 22 2 8 Average Istataeous Retur Vs Trials, SNR = 25 db SNR=5 db, Max. Atea Pairwise Corr. ρ=. 6 Gittis Idex, NRP, γ=.7 Gittis Idex, BRP, γ=.7 4 5 5 2 25 3 35 4 45 5 Trials (Packets) Fig. 5. Average Istataeous Retur vs. Trials, SNR = 5 db methods achieve a retur very close to the retur achieved after 5 trials. I additio, at a medium SNR (Figure 4) the two Gittis idex methods perform the same after trials. O the cotrary, at a high SNR level (Figure 5) the Gittis idex with a BRP outperforms the Gittis idex with a NRP. The ɛ-greedy strategy also outperforms the latter after trials ad it also settles o the optimal method. The reaso behid this is that the Gittis idex with a NRP is usig VBLAST with both 28 QAM ad 256 QAM (both ucoded) with a retur of 28 ad 32 bps/hz respectively. As a result, the average istataeous reward is 3.55 vs. the optimal of 3.36 bps/hz Fially, it may be oted that the results obtaied by all the methods tested are always subject to the parameters used ad are also depedet o the umber of the available optios ad the uderlyig coditios. For example, the ɛ greedy strategy is kow to suffer whe the umber of optios is very large []. O the other had, we have see that the Gittis idices with the NRP, teds to perform better most of the time verifyig the log-term optimality of the Gittis idices. V. CONCLUSIONS I this paper we defied the exploratio vs. exploitatio problem i the cotext of a Cogitive Egie tryig to lear (explorig) while providig optimal performace (exploitig) at the same time. This problem ca be classically addressed i the terms of the multi-armed badit problem which ca be solved suboptimally by the ɛ-greedy method ad optimally by the use of the Gittis dyamic allocatio idices. Eve though our test sceario was a Beroulli reward process, it was foud that usig the Gittis idex of a ormal reward process yielded better results, especially i the low SNR regios. I additio, the ɛ-greedy method was foud to work well whe the umber of trials is small (short-term performace) ad SNR is poor. Nevertheless, the Gittis idices method should be geerally preferred as it is more cosistet tha the ɛ-greedy strategy across differet scearios. REFERENCES [] H. I. Volos, C. I. Phelps, ad R. M. Buehrer, Iitial Desig of a Cogitive Egie for MIMO Systems, i SDR Forum Techical Coferece, Nov 27. [2], Physical Layer Cogitive Egie For Multi-Atea Systems, i IEEE Military Commuicatios Coferece, Nov. 28. [3] J. Mitola, III, Cogitive Radio: A Itegrated Aget Architecture for Software Defied Radio, Ph.D. dissertatio, KTH, Stockholm, Swede, May 2. [4] C. Clacy, J. Hecker, E. Stutebeck, ad T. O Shea, Applicatios of Machie Learig to Cogitive Radio Networks, IEEE Wireless Commuicatios, vol. 4, o. 4, pp. 47 52, August 27. [5] T. W. Rodeau, Applicatio of Artificial Itelligece to Wireless Commuicatios, Ph.D. dissertatio, Virgiia Tech, 27. [6] C. J. Rieser, Biologically Ispired Cogitive Radio Egie Model Utilizig Distributed Geetic Algorithms for Secure ad Robust Wireless Commuicatios ad Networkig, Ph.D. dissertatio, Virgiia Tech, 24. [7] T.R.Newma,B.A.Barker,A.M.Wygliski,A.Agah,J.B.Evas, ad G. J. Mide, Cogitive egie implemetatio for wireless multicarrier trasceivers, Wiley Joural o Wireless Commuicatios ad Mobile Computig, vol. 7, o. 9, pp. 29 42, 27. [8] Z. Zhao, S. Xu, S. Zheg, ad J. Shag, Cogitive radio adaptatio usig particle swarm optimizatio, Wiley Joural o Wireless Commuicatios ad Mobile Computig, 28, published olie. [9] R. S. Sutto ad A. G. Barto, Reiforcemet Learig: A Itroductio. The MIT Press, March 998. [] W. B. Powell, Approximate Dyamic Programmig: Solvig the Curses of Dimesioality (Wiley Series i Probability ad Statistics). Wiley- Itersciece, 27. [] J. C. Gittis, Multi-armed Badit Allocatio Idices. Wiley, Chichester, NY, 989. [2] L. Lai, H. E. Gamal, H. Jiag, ad H. V. Poor, Cogitive medium access: Exploratio, exploitatio ad competitio, IEEE/ACM Tras. o Networkig, Oct. 27, submitted. [3] J. Gittis ad D. Joes, A Dyamic Allocatio Idex for the Sequetial Desig of Experimets, Progress i Statistics, pp. 24 266, 974. [4] I. M. Soi, A geeralized Gittis idex for a Markov chai ad its recursive calculatio, Statistics & Probability Letters, vol. 78, o. 2, pp. 526 533, September 28. [5] D. Gesbert, M. Shafi, D. sha Shiu, P. Smith, ad A. Naguib, From theory to practice: a overview of MIMO space-time coded wireless systems, Selected Areas i Commuicatios, IEEE Joural o, vol. 2, o. 3, pp. 28 32, Apr 23. [6] D. Acuña ad P. Schrater, Bayesia Modelig of Huma Sequetial Decisio-Makig o the Multi-Armed Badit Problem, i Proceedigs of the 3th Aual Coferece of the Cogitive Sciece Society,V.Sloutsky, B. Love, ad K. McRae, Eds. Washigto, DC: Cogitive Sciece Society, 28. [7] J. Proakis, Digital Commuicatios, 4th ed. New York: McGraw-Hill, 2. This full text paper was peer reviewed at the directio of IEEE Commuicatios Society subject matter experts for publicatio i the IEEE "GLOBECOM" 29 proceedigs.