A Multi Armed Bandit Formulation of Cognitive Spectrum Access

Similar documents
Opportunistic Spectrum Access with Channel Switching Cost for Cognitive Radio Networks

Almost Optimal Dynamically-Ordered Multi-Channel Accessing for Cognitive Networks

Optimizing Media Access Strategy for Competing Cognitive Radio Networks Y. Gwon, S. Dastangoo, H. T. Kung

Bandit Algorithms Continued: UCB1

Cognitive Radio Technology using Multi Armed Bandit Access Scheme in WSN

A Bandit Approach for Tree Search

Learning via Delayed Knowledge A Case of Jamming. SaiDhiraj Amuru and R. Michael Buehrer

Imperfect Monitoring in Multi-agent Opportunistic Channel Access

A survey on broadcast protocols in multihop cognitive radio ad hoc network

Sequential Multi-Channel Access Game in Distributed Cognitive Radio Networks

Secondary User Monitoring in Unslotted Cognitive Radio Networks with Unknown Models

Optimizing Media Access Strategy for Competing Cognitive Radio Networks

Application of combined TOPSIS and AHP method for Spectrum Selection in Cognitive Radio by Channel Characteristic Evaluation

Learning State Selection for Reconfigurable Antennas: A Multi-Armed Bandit Approach

3432 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 10, OCTOBER 2007

Monte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar

On Optimality of Myopic Policy for Restless Multi-Armed Bandit Problem: An Axiomatic Approach Kehao Wang and Lin Chen

Machine learning proof-of-concept for Opportunistic Spectrum Access

Cognitive Radios Games: Overview and Perspectives

Fast Online Learning of Antijamming and Jamming Strategies

Adversarial Reasoning: Sampling-Based Search with the UCT algorithm. Joint work with Raghuram Ramanujan and Ashish Sabharwal

Analysis of cognitive radio networks with imperfect sensing

Exploration exploitation in Go: UCT for Monte-Carlo Go

Channel Sensing Order in Multi-user Cognitive Radio Networks

Jamming Bandits. arxiv: v1 [cs.it] 13 Nov 2014 I. INTRODUCTION

Jamming-resistant Multi-radio Multi-channel Opportunistic Spectrum Access in Cognitive Radio Networks

IMPROVED PROBABILITY OF DETECTION AT LOW SNR IN COGNITIVE RADIOS

Learning-based hybrid TDMA-CSMA MAC protocol for virtualized WLANs

Population Adaptation for Genetic Algorithm-based Cognitive Radios

DOWNLINK BEAMFORMING AND ADMISSION CONTROL FOR SPECTRUM SHARING COGNITIVE RADIO MIMO SYSTEM

Adaptive Rate Transmission for Spectrum Sharing System with Quantized Channel State Information

EMERGENCY circumstances such as accidents, natural. Pure-Exploration Bandits for Channel Selection in Mission-Critical Wireless Communications

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 17, NO. 6, DECEMBER /$ IEEE

Distributed Learning and Stable Orthogonalization in Ad-Hoc Networks with Heterogeneous Channels

Performance Evaluation of Energy Detector for Cognitive Radio Network

/13/$ IEEE

Optimal Defense Against Jamming Attacks in Cognitive Radio Networks using the Markov Decision Process Approach

Spectral efficiency of Cognitive Radio systems

Cognitive Radio: Brain-Empowered Wireless Communcations

Adaptive Channel Allocation Spectrum Etiquette for Cognitive Radio Networks

A Novel Adaptive Method For The Blind Channel Estimation And Equalization Via Sub Space Method

A Novel Opportunistic Spectrum Access for Applications in. Cognitive Radio

Policy Teaching. Through Reward Function Learning. Haoqi Zhang, David Parkes, and Yiling Chen

Adaptive Scheduling of Collaborative Sensing in Cognitive Radio Networks

Channel Sensing Order in Multi-user Cognitive Radio Networks

Learning and Decision Making with Negative Externality for Opportunistic Spectrum Access

Learning, prediction and selection algorithms for opportunistic spectrum access

Optimum Power Allocation in Cooperative Networks

RESEARCH ON METHODS FOR ANALYZING AND PROCESSING SIGNALS USED BY INTERCEPTION SYSTEMS WITH SPECIAL APPLICATIONS

Optimizing Client Association in 60 GHz Wireless Access Networks

Power Allocation Strategy for Cognitive Radio Terminals

A new Opportunistic MAC Layer Protocol for Cognitive IEEE based Wireless Networks

final examination on May 31 Topics from the latter part of the course (covered in homework assignments 4-7) include:

Implementation of a Cognitive Radio Front- End Using Rotatable Controlled Reconfigurable Antennas

Resource Management in QoS-Aware Wireless Cellular Networks

Cooperative Spectrum Sensing and Spectrum Sharing in Cognitive Radio: A Review

Power Allocation with Random Removal Scheme in Cognitive Radio System

COGNITIVE Radio (CR) [1] has been widely studied. Tradeoff between Spoofing and Jamming a Cognitive Radio

An Optimized Energy Detection Scheme For Spectrum Sensing In Cognitive Radio

Continuous Monitoring Techniques for a Cognitive Radio Based GSM BTS

Optimal Power Allocation over Fading Channels with Stringent Delay Constraints

AN ABSTRACT OF THE THESIS OF. Pavithra Venkatraman for the degree of Master of Science in

A Quality of Service aware Spectrum Decision for Cognitive Radio Networks

Interference Model for Cognitive Coexistence in Cellular Systems

A Thompson Sampling Approach to Channel Exploration-Exploitation Problem in Multihop Cognitive Radio Networks

arxiv: v1 [cs.ni] 26 Nov 2015

Learning State Selection for Reconfigurable Antennas: A Multi-Armed Bandit Approach

Distributed Learning under Imperfect Sensing in Cognitive Radio Networks

OPPORTUNISTIC SPECTRUM ACCESS IN MULTI-USER MULTI-CHANNEL COGNITIVE RADIO NETWORKS

Optimal Unbiased Estimators for Evaluating Agent Performance

Automatic Channel Selection in Neural Microprobes: A Combinatorial Multi-Armed Bandit Approach

COGNITIVE RADIO TECHNOLOGY. Chenyuan Wang Instructor: Dr. Lin Cai November 30, 2009

Cooperative Spectrum Sharing in Cognitive Radio Networks: A Game-Theoretic Approach

COGNITIVE RADIO. I.U.C.A.F. Summer School Chile, April 2014

CSCI 4150 Introduction to Artificial Intelligence, Fall 2004 Assignment 7 (135 points), out Monday November 22, due Thursday December 9

Review of Energy Detection for Spectrum Sensing in Various Channels and its Performance for Cognitive Radio Applications

IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS 1

Spectrum Sharing for Device-to-Device Communications in Cellular Networks: A Game Theoretic Approach

Game Theory: Normal Form Games

SF2972: Game theory. Mark Voorneveld, February 2, 2015

A Reinforcement Learning Scheme for Adaptive Link Allocation in ATM Networks

*Most details of this presentation obtain from Behrouz A. Forouzan. Data Communications and Networking, 5 th edition textbook

Fast Online Learning of Antijamming and Jamming Strategies

IMPLEMENTATION OF CYCLIC PERI- ODOGRAM DETECTION ON VEE FOR COG- NITIVE

Available online at ScienceDirect. Procedia Computer Science 62 (2015 ) 31 38

Pareto Optimization for Uplink NOMA Power Control

The world s first collaborative machine-intelligence competition to overcome spectrum scarcity

Secondary Transmission Profile for a Single-band Cognitive Interference Channel

Stability Analysis for Network Coded Multicast Cell with Opportunistic Relay

Achieving Desirable Gameplay Objectives by Niched Evolution of Game Parameters

Analysis of Distributed Dynamic Spectrum Access Scheme in Cognitive Radios

Cross-layer Network Design for Quality of Services in Wireless Local Area Networks: Optimal Access Point Placement and Frequency Channel Assignment

Chutima Prommak and Boriboon Deeka. Proceedings of the World Congress on Engineering 2007 Vol II WCE 2007, July 2-4, 2007, London, U.K.

Estimation of the Channel Impulse Response Length and the Noise Variance for OFDM Systems

Adaptive Fighting Game Computer Play Switching Multiple Rule-based Contro. Sato, Naoyuki; Temsiririkkul, Sila; Author(s) Ikeda, Kokolo

TTIC 31230, Fundamentals of Deep Learning David McAllester, April AlphaZero

Distributed Collaborative Path Planning in Sensor Networks with Multiple Mobile Sensor Nodes

Application of UCT Search to the Connection Games of Hex, Y, *Star, and Renkula!

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2010-GI-24 No /6/25 UCT UCT UCT UCB A new UCT search method using position evaluation function an

Cognitive Radio Spectrum Access with Prioritized Secondary Users

Transcription:

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 A Multi Armed Bandit Formulation of Cognitive Spectrum Access Anonymous Author(s) Affiliation Address email Abstract We consider a cognitive network where a cognitive user attempts to access the channel if not occupied by primary users. The problem is formulated as a multiarmed bandit (MAB) problem. After reviewing several existing MAB algorithms, we propose a new MAB algorithm. The simulation results demonstrate the advantage of the proposed scheme compared to other listing algorithms when applied to a cognitive spectrum access problem. 1 Introduction Recently, the overwhelming increase of wireless services and devices results in overcrowded wireless networks and the lack of spectrum resources. The problem stimulated the generation of a new paradigm of wireless communication, referred as cognitive communications [1]. The basic idea of this communication technique is to take advantage of unused portions of licensed spectrum resources. In a cognitive network, users are classified into primary users and secondary users. Primary users always gains the permission to transmit, while secondary users, also known as cognitive users, first senses the channel and transmits its information if the channel is not occupied. Extensive attention has been paid to develop efficient schemes for the cognitive users to access the spectrum. In this paper, we propose to cast the media access problem of cognitive users into the frame of a multiarmed bandit (MAB) problem. Each channel is considered as a slot machine with certain expected reward while the cognitive user is considered as a gambler playing on several slot machines. The MAB has been well investigated in the context of machine learning. The UCB algorithm proposed in [2] is proven to be optimal if the reward distribution is stationary. On the other hand, with non-stationary reward distributions, Whittle s index [3] is proven to be asymptotically optimal. However, these algorithms assume infinite time, therefore cause problem when applied into the spectrum access problem of cognitive users. Moreover, the very nature of a wireless channel is that it is normally time varying, which also should be treated carefully when applying exiting MAB algorithms into cognitive communication. In this paper, we introduce and evaluate several existing MAB algorithms, and also proposed a new algorithms which is a combination of existing schemes. However, the new algorithms take account of both the finite-time and time varying nature of a wireless channel. The remainder of the paper is organized as follows: In section 2, we describe the network model and formulate the spectrum access problem of cognitive communication as a MAB problem. Section 3 introduces several existing MAB algorithms as well as the proposed algorithm. Simulation results are provided in section 4, followed by the concluding remarks in section 5. 1

054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 2 Network Model Figure 1: Channel model. Fig. 1 shows the network model of interest in this paper 1. Consider a network consisting of totaln channels, N = {1,...,N}. The primary users have the priority to access all the channels, while a cognitive user tries to use these channels when they are not occupied by the primary users. The channels are accessed in a time-slotted fashion. Let i refer to the channel index, j refer to the time slot index andk denote the cognitive user index. Assume that at each time slot, channeli is free with probabilityp i and let p = (p 1,...,p N ). Let b i (j) be a random variable that equals 1 if channeli is available at time slot j and equals 0 otherwise. For the wireless channel, we assume a block varying model, i.e., the value of p is static for a block of T time slots. Normally, the cognitive user assumed to be unaware ofpapriori. In the network model, the cognitive user seeks to exploit the free channels by sensing a channel at the beginning of each time slot. In particular, at time slot j, the cognitive user selects channel s(j) N to access. If the sensing result shows that channel s(j) is free, i.e., b s(j) (j) = 0 then the cognitive user can send one unit of information over this channel; otherwise the cognitive user have to wait until the next time slot and choose again a channel to access. The problem is that which channel the cognitive user should choose to sense at each time slot. Therefore, we can compute the total number of units of information that the cognitive user is able to send over one block as W = T b s(j) (j). (1) and the problem can be generalized as characterizing strategies that maximize T E{W} = E b s(j) (j). (2) Intuitively, we can observe that the essence of the problem is a trade-off between exploitation and exploration. By exploitation, it refers to that the cognitive user performs myopic action by selecting the channel with th highest probability of being free according to all the observations. On the other hand, by exploration, it means in order to learn the true value of p 2, the cognitive user will try to choose to different channel to access at different time slots. The above observation allows us to interpret the problem in a bayesian approach and to further reformulate the problem as a MAB problem. 1 We use a network model and notations similar to [4]. 2 It is assumed there is a true value of p in the real world. 2

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 2.1 Problem Formulation We can use the following typical MAB example to illustrate our problem properly: A gambler is sequentially choose one of N machines to play. If he wins, there will be one unit of reward. The ith machine has winning probability p i, which is unknown to the gambler. But he has observations of the outcomes of past plays. The goal is to maximize the overall reward after a total of T plays. Denote a medium access strategy of the cognitive user, i.e., a strategy of how to choose channels, by Γ. Therefore, Γ is a function of the previous j 1 observations: Φ(j) = {s(1),b s(1) (1),...,s(j 1),b s(j 1) (j 1)},j 2. (3) Note thats(j) is the channel chosen by adopting strategyγat timej, i.e.,s(j) = Γ(Φ(j)). The payoff function is the expected units of informations the cognitive user is able to transmit through a block T T W Γ = E b s(j) (j) = N p i Pr{Γ(Φ(j)) = i}. (4) and the regret function is wherep = max{p 1,...,p N }. i=1 i=1 T T N R Γ = p p i Pr{Γ(Φ(j)) = i}, (5) With the MAB problem well formulated, we now are ready to proceed to learning algorithms. 3 Learning Algorithms 3.1 Upper Confidence Bound In [5], Agrawal defines a family of policies based on the man value of the reward. These policies are referred as the Upper Confidence Bound (UCB) algorithms. The main idea of UCB is to add a bias factor to the mean value of the reward. The algorithm first selects each channel once. Then, at time slot j, UCB chooses channels(j) such that s(j) = argmax i N ( x i (j) y i (j) + ) σlogj, (6) y i (j) where y i (j) is the number of times channel i has been chosen to access till time j 1, x i (j) = j t=1 v i(t),v i (t) is the number of time slots for which the cognitive user has sensed channel i to be free till timet 1, andσ is a design parameter chosen to be 2 in [5]. 3.2 Upper Confidence Bound Tuned (UCBT) The UCBT algorithm was first proposed by Auer et al. in [6]. The main characteristic of the UCBT is the use of empirical variance in the bias sequence. Thus, the exploration is reduced for the channels with small reward variance. The UCBT algorithm chooses channels i (j) such that s i (j) = argmax i N ( z i (j)+ wherez i (j) = xi(j) y i(j) andcis also a design parameter free to adjust. ) (z i (j) (z i (j)) 2 )σlogj + clogj, (7) y i (j) y i (j) 3

162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 3.3 Discounted UCB (DUCB) The discounted UCB [7] adds a discount factor to the original UCBT algorithm. The average reward are weighted as T t=1 ẑ i (j) = γt t i x i (t) T, ˆn i (j) = γ T t i 1 s( t)=i}, (8) ˆn i (j) where 0 < γ i < 1 is the discount factor for channel i. The factor γ i represents how fast channel i changes. The discounted UCB is especially suitable for wireless channels because of the time varying nature of wireless environment. The algorithm assigns less weight for old data and more weight for fresh data. 3.4 Sliding Window UCB (SWUCB) Another practical algorithm the sliding window UCB [8]. The difference between SWUCB and DUCB is that SWUCB only uses a window of lengthl and only consider the average reward within this window. The window length decreases as the dynamic environment changes faster. 3.5 Combined UCBT and DUCB In this section, we proposed a novel UCB which combines the UCBT and the DUCB algorithms. The combined algorithm adopts the Equation (8) as average reward function and uses the selection criteria of DUCB. Therefore, the selection criteria of the new algorithm is expressed as s i (j) = argmax i N ( whereẑ i (j) is given in Equation (8). ẑ i (j)+ t=1 (ẑ i (j) (ẑ i (j)) 2 )σlogj + clogj y i (j) y i (j) ), (9) The combined algorithm enjoys the benefits of both UCBT and DUCB, therefore it considers the effect of the empirical variance, as well as the time varying nature of wireless channels. 4 Simulation Results In this section, we provide the simulation results for all the MAB algorithms introduced in this paper as well as the proposed new algorithm. The test scenario includes 20 channels with time block length T = 100 and 2000 blocks in total. The wireless channels are generated according to the IEEE standard 802.11. The simulation results including average regret, variance of regret and the percentage of time choosing the optimal channel are plotted in Figure 2, 3, and 4. It can be observed that, although UCB exhibits the highest average regret and regret variance, it performs best in terms of the percentage of time choosing the optimal channel. UCBT performs best in terms of regret variance and SWUCB exhibits the best average regret. The performance of the proposed algorithm lies in between that of UCBT and SWUCB. However, it has better optimal channel chosen percentage than those two algorithms. 5 Concluding Remarks In this paper, we propose to make use of the MAB problem model to formulate the spectrum access problem in cognitive radio in the context of wireless communication. Several existing algorithms for solving the MAB problem are introduced. We also proposed a novel algorithm, the combined UCBT and SWUCB algorithm to address the problem. Performance of these algorithms are evaluated under wireless channels generated by the IEEE 802.11 standard model. Several aspects worth further investigation as potential future work. First, although the simulation results demonstrates its advantage of the proposed scheme, it is necessary to derive the theoretical bounds on regrets in order to evaluate exactly how good the scheme is. Moreover, multiple cognitive users can be included in the network model. Finally, the work can be extended by adding the actual behavior model of the primary users to generate the probability distribution of channels being free. 4

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 References Average regret Vaciance of regret 65 60 55 50 45 40 35 30 25 UCB UCBT DUCB Combined UCB 20 0 200 400 600 800 1000 1200 1400 1600 1800 Observation period 90 80 70 60 50 40 30 Figure 2: Average regret. UCB UCBT DUCB Combined UCB 20 0 200 400 600 800 1000 1200 1400 1600 1800 Observation period Figure 3: Variance of regret. [1] Mitola, J. (2000) Cognitive radio: an integrated agent architecture for software defined radio. Royal Institute of Technology (KTH), Stockholm, Sweden. [2] Gittins, J. & Jones, D. (1974) A dynamic allocation indices for the sequential design of experiments. Progress in Statistics, European Meeting of Statisticians, vol. 1, pp. 241-266. 5

270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 % best channel chosen 1 0.9 0.8 0.7 0.6 0.5 0.4 UCB UCBT DUCB Combined UCB 0 200 400 600 800 1000 1200 1400 1600 1800 Observation period Figure 4: Percentage of best channel chosen. [3] Whittle, P. (1988) Restless bandits: activity allocation in a changing world. Journal of Applied Probability, vol. 25. [4] Lai, L. & Gamal, H. El & Jiang, H. & Poor, H. V. (2007) Cognitive medium access: exploration, exploitation and competition. IEEE/ACM Trans. on Networking, vol.10, no. 2, pp. 239-253. [5] Agrawal, R. (1995) Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, vol. 27, pp. 1054-1078. [6] Auer, P. & Cesa-Bianchi, N. & Fisher, P. (2002) Finite time analysis of the multiarmed bandit problem. Machine learning, vol. 47, pp. 235-256. [7] Kocsis, L. & Szepesvari, C. (2006) Discounted UCB. 2nd Pascal Challenge Workshop. [8] Garivier, A. & moulines, E. (2008) On upper-confidence bound policies for non-stationary bandit problems. Available from http://arxiv.org/ps case/arxiv/pdf/0805/0805.3415v1.pdf 6