Online Learning in Autonomic Multi-Hop Wireless Networks for Transmitting Mission-Critical Applications

Size: px

Start display at page:

Download "Online Learning in Autonomic Multi-Hop Wireless Networks for Transmitting Mission-Critical Applications"

Lucas Garry Scott
6 years ago
Views:

1 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 28, NO. 5, JUNE 2 Online Learning in Autonomic Multi-Hop Wireless Networks for Transmitting Mission-Critical Applications Hsien-Po Siang and Miaela van der Scaar, Fellow, IEEE Abstract In tis paper, we study ow to optimize te transmission decisions of nodes aimed at supporting mission-critical applications, suc as surveillance, security monitoring, and military operations, etc. We focus on a network scenario were multiple source nodes transmit simultaneously mission-critical data troug relay nodes to one or multiple destinations in multi-op wireless Mission-Critical Networks (MCN). In suc a network, te wireless nodes can be modeled as agents tat can acquire local information from teir neigbors and, based on tis available information, can make timely transmission decisions to minimize te end-to-end delays of te mission-critical applications. Importantly, te MCN needs to cope in practice wit te time-varying network dynamics. Hence, te agents need to make transmission decisions by considering not only te current network status, but also ow te network status evolves over time, and ow tis is influenced by te actions taken by te nodes. We formulate te agents autonomic decision making problem as a Markov decision process (MDP) and construct a distributed MDP framework, wic takes into consideration te informationally-decentralized nature of te multi-op MCN. We furter propose an online model-based reinforcement learning approac for agents to solve te distributed MDP at runtime, by modeling te network dynamics using priority queuing. We compare te proposed model-based reinforcement learning approac wit oter model-free reinforcement learning approaces in te MCN. Te results sow tat te proposed model-based reinforcement learning approac for mission-critical applications not only outperforms myopic approaces witout learning capability, but also outperforms conventional model-free reinforcement learning approaces. Index Terms multi-user mission-critical transmission, autonomic multi-op wireless networks, distributed Markov decision process, online reinforcement learning. I. INTRODUCTION A PLETHORA of mission-critical applications suc as battlefield videoconferencing, surveillance and security monitoring are emerging, e.g. in SOSANETs [], were real-time response and actions to te acquired critical data becomes vital. Tis critical data needs to be reliably and timely relayed to one or multiple decision makers, possibly located at different destinations. To connect te various sources to te destinations, a rapidly deployable solution can be provided using multi-op autonomic wireless networks. Manuscript received 4 April 29; revised 22 December 29. Tis work was funded by an ONR grant and by NSF CAREER CCF grant. Hsien-Po Siang and Miaela van der Scaar are wit te Department of Electrical Engineering, UCLA ( psiang@ucla.edu, miaela@ee.ucla.edu). Digital Object Identifier.9/JSAC.2.6xx //$25. c 2 IEEE A key advantage of suc flexible infrastructures is tat te same network can be re-used and reconfigured to relay critical data to multiple destinations. Te mission-critical applications require te network to support various transmission priorities, security, robustness requirements, and stringent transmission delay deadlines [6][8]. In tis paper, we focus on minimizing te network delays of te mission-critical applications, and rely on related work (suc as [8][9]) for te security and reliability requirements of te mission-critical applications. Autonomic wireless networks are composed of autonomic wireless nodes (also intercangeably referred to as agents in tis paper) endowed wit te capability of individually sensing te network environment, learning te dynamic network canges based on teir local information, and promptly adapting teir transmission actions in an autonomous manner to optimize te utility of te applications wic tey are serving []. Te dynamic network canges include variations in network topology, wireless cannel conditions, application requirements, etc. Wen tese network dynamics occur, te autonomic nodes can self-configure temselves and immediately react to tese canges, witout te need of propagating messages back and fort to a centralized coordinator. Autonomic wireless networks are especially suitable for missioncritical applications, since te autonomic beavior allows te wireless nodes to promptly discover local network canges and instantaneously react to tese canges, suc tat te important data packets tey are relaying will arrive at teir destinations witin teir delay deadlines. Moreover, autonomic wireless nodes endowed wit online learning capabilities can successfully model te network dynamics and foresigtedly adapt teir packet transmission to maximize te utility of te mission-critical applications. In te MCN, te autonomic nodes need to coordinate teir transmission decisions [7]. For example, in [25], it is sown tat te performance degradation is unavoidable if te agents do not optimize teir routing decisions in a cooperative manner. In [26][27], te Network Utility Maximization (NUM) framework is introduced and it is sown tat by allowing agents to cooperatively excange information, tey can optimize teir transmission actions in a distributed manner, suc tat a Pareto-efficient solution can be reaced. However, suc solutions assume a static network setting and tey cannot address te dynamic nature of te MCN. Dynamic transmission policies based on local information feedback are proposed (for example, based on QoS state information

2 2 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 28, NO. 5, JUNE 2 [2] and queuing backpressure [4][5]), wic ensure tat te delays of te mission-critical applications are bounded as long as te rate allocations are inside te capacity region of te network. However, computing te capacity region requires a ig computational complexity [32] and, moreover, does not guarantee tat te required delay constraints of te missioncritical applications are met. In [3], a QoS-aware protocol wit priority-based queuing model was proposed to support real-time traffic in wireless sensor networks. Te protocol allocates energy-efficient pats to te applications tat meet teir end-to-end delay requirements. Also, oter alternative QoS-aware solutions can be found in [] for supporting various applications in wireless sensor networks. However, most of tese solutions are mainly concerned wit minimizing te energy consumption. Importantly, in te distributed setting, an agent s decision impacts and is impacted by te decisions of te neigboring agents. We refer to tis coupling effect as te spatial dependency among te agents. Altoug te abovementioned solutions consider te spatial dependency, tey only react to te network canges in a myopic way. Tey merely optimize te transmission decisions based only on te information about te current network status and application requirements. In te dynamic MCN, owever, te agents need to adopt foresigted adaptation by considering not only te immediate network status, but also ow te network status evolves over time (referred to as te network dynamics in tis paper), in order to make optimal transmission decisions. Hence, in addition to te spatial dependency, agents need also to consider te temporal dependency among teir sequential decisions (performed over time). Moreover, in practice, te network dynamics may not be known. Reinforcement learning solutions ave been proposed for te nodes to learn te network dynamics and optimize te performance in routing [3] and admission control [3] solutions at runtime. However, tese solutions do not minimize te delays of te mission-critical applications. Moreover, te majority of tese solutions focus on model-free reinforcement learning approaces, wic are not suitable for te missioncritical applications due to teir slow convergence rates [5]. In summary, tere is no integrated framework tat considers te spatio-temporal dependencies among te agents in te MCN to minimize te end-to-end delays of te missioncritical applications, based on application priorities, packetbased delay deadlines, and te network dynamics. In tis paper, we provide a systematic framework based on wic agents (te nodes in te MCN) can optimize teir crosslayer transmission actions and minimize te delays of te mission-critical applications, wile considering te spatiotemporal dependencies among teir actions. We assume tat all te source and relay nodes are able to make teir own cross-layer transmission decisions, wic are te packet-based sceduling decisions in te application layer and te routing decisions in te network layer. In [28], it as been sown tat Markovian models (e.g. finite-state markov model [29]) can be applied for bot traffic state transition and cannel state transition. Also in [3], it was sown tat routing protocols in mobile ad oc networks can be furter improved by allowing te agents to make teir decisions using Markov Decision Process (MDP) [8]. Based on te MDP, te agents are able to forecast te future network status and optimize teir crosslayer transmission actions tat consider te MCN dynamics. However, unlike in [3], wic focuses on optimizing te overall trougput of te network, in tis paper, te agents minimize te expected end-to-end delays of te missioncritical applications. Te expected end-to-end delay is referred to in tis paper as te MDP delay value. Overall, te paper makes te following contributions: ) Distributed MDP framework tat considers te spatiotemporal dependencies in MCN. To account for te dynamic nature of te MCN, we construct an MDP framework wic minimizes te MDP delay values of te mission-critical applications. To address te informationally-decentralized nature of te multi-op MCN, te MDP needs to be formulated in a distributed manner, suc tat eac agent in te MCN can deploy its own cross-layer transmission policy based on only local information excanges wit its neigboring agents. Te proposed distributed MDP minimizes te delays of te mission-critical applications wile capturing te spatiotemporal dependency in te MCN. 2) Model-based online learning approac to solve te distributed MDP in MCN. We propose an online modelbased learning approac for te agents in MCN to solve te distributed MDP at runtime, wen te network dynamics are unknown. Unlike te conventional model-free reinforcement learning approaces for solving MDPs (as in [6][7]), te proposed model-based learning algoritm adopts a preemptive-repeat priority M/G/ queuing model [2], wic enables a faster convergence rate and sorter delays for te mission-critical applications. Te upper and lower bounds of te resulting MDP delay value are provided to verify te accuracy of te proposed model-based online learning approac at different network locations. Moreover, we compare te proposed model-based reinforcement learning approac wit te model-free reinforcement learning approaces in terms of delay performance, computational complexity, and te required information excange overeads. Tis paper is organized as follows. In Section II, we discuss te network settings and te cross-layer transmission actions of te autonomic wireless nodes, and formulate te autonomic decision making problem in te MCN. In Section III, we discuss te distributed MDP framework tat addresses bot te dynamic and information-decentralized nature of te MCN. In Section IV, we propose a model-based online learning approac for te autonomic wireless nodes to solve te distributed MDP at runtime, wic is suitable for te mission-critical applications. Section V provides simulation results and Section VI concludes te paper. II. AUTONOMIC DECISION MAKING PROBLEM FORMULATION IN MCN A. Mission-critical application caracteristics Unlike most cross-layer design papers tat consider only a single application, we assume tat tere are multiple sources transmitting simultaneously delay-critical information over te MCN. Let V = {V i } represent te set of te mission-critical applications. We assume tat te packets of an application

3 SHIANG and VAN DER SCHAAR: ONLINE LEARNING IN AUTONOMIC MULTI-HOP WIRELESS NETWORKS 3 V i are prioritized into K i priority classes. Te total number of te priority classes in te network is K = V i= K i. Let {C k,k =,..., K} represent all te priority classes in te network. In te subsequent part of te paper, we label te K classes (across all applications) in descending order of teir priorities, i.e. C is te igest priority class. A priority class C k is caracterized by te following parameters {D k,r k,l k }. D k represents te delay deadline of te packets in class C k. A packet of a mission-critical application is useful only if it is received at te destination before its delay deadline. R k is te average source rate of te packets in class C k. Based on te source rate, te source node generates a certain number of packets per unit time, wic impacts te traffic load of te MCN. L k is te average packet lengt of te packets in class C k, wic directly impacts te packet error rate and te transmission rate of sending a class C k packet. Let Delay k represent te end-to-end delay tat is required for te transmission of te traffic inclassc k. Tese required delays are mandated by te mission and te deployed applications, and te MCN agents need to prioritize te traffic and minimize teir end-to-end delays according to te assigned priorities [8]. For example, in a battlefield mission-critical network, instructions from a command center are mission-critical and sould ave iger priority tan any oter traffic, e.g. response notification, surveillance results, etc. B. Multi-op MCN settings Te MCN is represented by a network grap G(V, M, E), were M = {m,..., m M } represents te set of agents and E = {e,..., e E } represents a set of edges (transmission links) tat connect te various agents. Tere are two types of agents defined in tis paper: ) Autonomic Source Agents Ss). Eac AS generates a mission-critical application and would like to transmit te application to a predetermined destination node. 2) Autonomic Relay Agents Rs). ARs relay te packets from te AS to te corresponding destination node. Unlike te ASs, te ARs do not generate teir own traffic. Tey make teir cross-layer transmission decisions and forward te packets for te ASs. To enable us to better discuss te various networking solutions, we label te agents using a directed acyclic grap [3] as sown in Figure, wic consists of H ops from te ASs to te destination nodes 2. We assume tat M is te number of agents at te -t op ( H ), and M = M H = V. Eac agent at te -t op will be tagged wit a distinct number m ( m M ). Let M M represent te set of agents at te -t op. Te agent m processes a priority queue and it can only transmit te packets in te queue to a subset of ARs in M +. Troug periodic information excange (e.g. ello message excange in [24]), we assume tat eac agent m knows te existence of its neigboring nodes (i.e. te oter agents m M in te same We refer te interested readers to our previous work [2] for more details on tese parameters. 2 Note tat suc a directed acyclic network can be deployed over any pysical network topologies as an overlay network (see [3] for more details about ow to deploy te directed acyclic grap over a multi-op wireless network). Missioncritical priority classes C C K ASs. m M... Hop ARs.. m Hop + Fig.. Considered multi-op wireless network [].. m + M M + Destinations m H M H op and te agents m + M + in te next op), as well as te interference matrix [2] of te current op tat defines weter or not two different links of neigboring nodes can transmit simultaneously. C. Effective transmission rate over te multi-op MCN We denote te maximum transmission rate over te link (m,m + ) as T k,m,m + for traffic classc k. Assuming a memory-less packet erasure cannel as in [2][2], and given te Signal-to-Interference-Noise-Ratio (SINR) x m,m +,we can compute te packet error rate p k,m,m + (x m,m + ) over te link. If te agent m selects m + as its next relay, te effective transmission rate (goodput) can be approximated using te sigmoid function [2]: T goodput k,m,m + (x m,m + ) = T k,m,m + ( p k,m,m + (x m,m + )), p k,m,m + (x m,m + )= +e ζ(xm,m + δ), were ζ and δ are constants corresponding to te modulation and coding scemes for a given packet lengt L k. Tis goodput is determined by te actions of te agent m,wic influences te delay of te applications (see Section III.A for more details). D. Actions of te autonomic wireless nodes An agent s cross-layer transmission action varies wen transmitting different priority class traffic. Denote A m = {A k,m, C k } as te cross-layer transmission action of agent m,werea k,m = {π k,m,β k,m,m +,m + M + } A m represents te action of agent m wen sending packets in class C k. A m represents te set of feasible actions for te agent m. In tis paper, we assume tat te cross-layer transmission action includes te application layer packet sceduling π k,m of transmitting packets in class C k,andtenetwork layer relay selecting parameter β k,m,m +, wic determines te probability of selecting a node m + M + in te next op as te next relay. Denote A = {A m, m M} as te actions of all te agents in te MCN. Note tat te delay Delay k ) of packets in class C k is a function of all agents actions. ()

4 4 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 28, NO. 5, JUNE 2 E. Problem formulation In tis subsection, we discuss several ways to determine te cross-layer transmission decisions for transmitting te mission-critical applications over te MCN. - Centralized decision making Te majority of te cross-layer design papers assume a centralized optimization, in wic a central controller collects global network information G and make transmission decisions for all te agents in te MCN. Since minimizing end-toend delay is te key objective in te MCN, te centralized optimization needs to minimize te end-to-end delays for te various applications [2][3]. An advantage of suc delaydriven approac is tat te optimization only needs to be done for te iger priority classes, and te packets of te lower priority classes can be simply dropped if teir delay constraints cannot be met 3. Let a k = [A k,m, m M] represent te actions of all te agents sending traffic classc k. Te actions for transmitting te priority class C k can be computed after te actions for te iger priority classes {a,..., a k } are determined and te action a k will not affect any of te actions for {a,..., a k }. Specifically, te following delay constrained optimization is considered for te priority class C k : =argmindelay k (a k, {a,..., a k }, G) a k (2) s.t.delay k (a k, {a,..., a k }, G) D k a opt k However, in mission-critical applications, wic ave stringent delay deadlines, it is impractical to assume tat te global information G can be gatered in time at a central controller. Hence, it is important to decompose te optimization in equation (2) in suc a way tat eac agent m can make timely decision based on local information L m. - Distributed decision making for te agent m Let E[Delay k,m k,m, L m )] represent te expected delay from m to te destination node of te traffic classc k,wic is a function of te transmission action A k,m and its local information L m.letdelayk,m PASS represent te delay tat as already passed wen te class C k packet arrives at te agent m. Tis can be computed based on te information tat is encapsulated in te packet eader. Since agent m cannot influence Delayk,m PASS, it can only minimize te delay for te igest priority class C k in its queue using te following optimization [2]: A opt k,m (L m )=arg min E[Delay k,m k,m, L m )] A k,m s.t.delay k,m k,m, L m ) D k Delayk,m PASS (3) Figure 2(a) illustrates tis conventional distributed decision making. First, te agent evaluates te utility (i.e. te expected delay E[Delay k,m k,m, L m )]), wic it can obtain from taking various actions based on te local information L m. Ten, te agent determines its transmission action by solving 3 Te action A k,m = {β k,m,m +,m + M + } ereafter does not include te application layer sceduling, since te igest priority packet is selected to be transmitted. To simplify te notation, we use te same notation for te cross-layer transmission actions and assume tat te class C k is te igest priority class existing in te queue of te agent m wen taking te action A k,m. (a) (b) input rate, SINR Wireless networks (oter agents) input rate, SINR Wireless networks (oter agents) Gater local Information State Gater local information Utility evaluation Determine transmission action Future utility evaluation Determine transmission action Agent Fig. 2. (a) Conventional distributed decision making of an agent.(b) Proposed foresigted decision making of an agent. te optimization in equation (3). Te required local information L m for computing E[Delay k,m k,m, L m )] will be discussed later in Section III.B. However, due to te dynamic nature of te MCN, te gatered local information is canging over time. Hence, it is important for te agents to consider not only te current expected delay, but also te future expected delay as te network dynamics evolve. Figure 2(b) illustrates ow an agent anticipates te evolution of te network dynamics by considering te impact of its current transmission action on te future network state (wic will be defined in Section III.A), and based on it, makes foresigted transmission decisions to transmit mission-critical applications. Next, we formulate tis foresigted decision making of an agent in te MCN. - Proposed foresigted decision making for te agent m Assume E[Delay t k,m ] as te expected delay of agent m at current service interval t. Given te current local information L t m, agent m makes foresigted decisions by taking into account te impact of its actions not only on te current expected delay, but also on te discounted expected delays in te future service intervals, i.e. μ k,m (L t m{ )= } arg min γ t t E[Delayk,m t A k,m, L t (4) m )] k,m t=t were <γ< 4 represents te discount factor to decrease te utility impact of te later transmitted packets. If te discount factor γ =, te optimization in equation (4) becomes a myopic decision making, similar to te one in [2]. We refer to te function μ k,m (L m ) as te cross-layer transmission policy given te local information L m.inte 4 γ can be regarded as te probability tat te priority class ends in a certain service interval. Note tat different discount factors γ k can be considered for different priority classes. However, to simplify te exposition, we consider ere te same γ for all priority classes.

5 SHIANG and VAN DER SCHAAR: ONLINE LEARNING IN AUTONOMIC MULTI-HOP WIRELESS NETWORKS 5 next section, we will discuss ow to compute tis cross-layer transmission policy. III. DISTRIBUTED MARKOV DECISION PROCESS FRAMEWORK In tis section, we discuss ow to systematically compute te cross-layer transmission policy μ k,m (L m ) for te agents in te MCN. First, we define te state of te agents in Section III.A. Ten, in Section III.B, we propose te distributed MDP wic allows all te agents to make teir own decisions. A. States of te autonomic wireless nodes We define te network state at agent m as s m = {[η k,m, C k ], [x m,m +, (m,m + )]} X m, were x m,m + represents te cannel condition (see Section II.C) and η k,m represents te arrival rate of te class C k packets at agent m. To evaluate te expected delay E[Delay k,m ], agent m needs to first compute te expected queuing delay E[W k,m ] for wic te packets in class will be queued at m. Te state includes sufficient statistics for computing te expected queuing delay E[W k,m ],wenanactiona k,m is taken. Note tat te first two moments of te service rate can be obtained as: and L E[X k,m ]= k T k,m,m + ( p k,m,m + (x m,m + )) E[Xk,m 2 L ]= k (+p k,m,m + (x m,m + )) Tk,m 2,m ( p k,m,m + + (x m,m + )) 2 (5) Togeter wit te arrival rate η k,m, te expected queuing delay E[W k,m ] can be computed using an priority M/G/ queuing model [2]. We assume tat eac agent will feed back its expected delays to all te agents in te previous op (similar to DSDV protocols [24]). Hence, te agent m is able to select te next relay tat minimizes te sum of current queuing delay and te expected delay from te next op to te destination node of class C k,i.e. E[Delay k,m k,m,s m )] E[W k,m k,m,s m )] = H = = E[W k,m k,m,s m )] + E[Delay k,m+ k,m+ )] (6) Importantly, te agent m s transmission action will impact te information feedback E[Delay k,m+ ], since it will select te next relay m + M + tat feeds back different expected delay values. Moreover, te expected delay E[Delay k,m ] will be fed back to te agents in te previous op and ence impact teir transmission actions. Hence, te agent m s action A k,m will affect its own future state s m and also will influence te future expected delay as te network dynamics evolve. As in [3], we denote te probability tat te agent m as a state s t+ m in service interval t + as p(s t+ m ), wic is modeled as a function of agent m s current state s t m and current action A t k,m,i.e. p(s t+ m ) = ˆF s t+(s t m m,a t k,m ) (7) Note tat te real p(s t+ m ) can be very complicated in a real network, since it is impacted by te decisions of all te agents in te previous op as well as te interference among te agents in te current op. Note tat in our solution, te agents do not need to know te exact form of p(s t+ m ). Online learning approaces will be discussed in Section IV for te agents to learn te state transition function in equation (7). Next, we formulate te cross-layer optimization of te agent as an MDP for eac class. B. Distributed MDP for class C k For class C k, te MDP at te agent m is definedbya tuple X m, A m, I m, T m, U m,γ : - States: Recall tat te state is defined in Section III.A as s m = {[η k,m, C k ], [x m,m +, (m,m + )]} X m. - Actions: Recall tat te action is defined as A k,m = {β k,m,m +,m + M + } A m in Section II.C. To simplify te notation, we will afterward use A m instead of A k,m. - Information excange: Let I m = {F b,ff }5 represent te information excange of te agents in te -t op to te previous op and to te next op. Denote F b,t (m )= E[Delayk,m t ] as te feedback information from agent m to te agents in te previous op (see equation (6)) and let F b,t = [F b,t (m ),m M ] represents te feedback information in te -t op in te service interval t. Denote F f,t (m )={Delayk,m PASS,η k,m } as te feedforward information from node to te selected relay in te next op and let F f,t =[F f,t (m ),m M ] represent te feedforward information in te -t op. Given te feedforward information F f,t, te agent m computes te average delay Delayk, PASS of passing troug te previous ops as: Delay PASS k, = M m = η k,m Delayk,m PASS R (8) k If Delayk, PASS exceeds te delay deadline D k, te packet in class C k sould be dropped and no MDP is needed for traffic class C k at te agent m. - State transition probabilities: Let T sm s m m ) T m : X m X m A m [, ] represent te stationary state transition probabilities from state s m to state s m wen action A m is taken. Based on te state transition models in equation (7), we compute te state transition probabilities as T sm s m m ) = ˆF s m (s m,a m ). -Cost:Te expected delay E[Delay k,m (s m,a m )] U m represents te cost function. As mentioned in Section III.A, we rely on a priority-based queuing model to compute te cost function (see equation (6)). Note tat te expected delay of a iger priority class will not be influenced by te oter lower priority classes. However, if te class is one of te lower priority classes, te influence of te iger priority classes is taken into account based on te priority-based queuing model [2] (given te actions and states associated wit te iger priority classes). - Discount factor: Recall tat γ is te same discount factor as in equation (4). Based on te information feedback F+ b, we modify te 5 Te superscript b and te superscript f represent backwards and forwards information, respectively.

6 6 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 28, NO. 5, JUNE 2 b F f F 2 Markovian state transition Future utility evaluation Distributed MDP μ k ( s m ) Decision Determine process transmission of agents action m M f F Markovian state Local transition Local Information Information b Future F utility State s m evaluation State s m Distributed MDP μ k ( s m ) f F Decision Determine process transmission of agents m action M Fig. 3. Proposed decentralized MDP framework and te necessary information excange among te agents Bellman equation [9] of te MDP as: V k,m (s m,f { + b ) } = min γ t E[Delayk,m t A m A (s m,a m )] m t= E[W k,m (s m,a m )] + F+ b m )+ = min A m A m γ T sm s m m )V s k,m (s m,f+ b ) m (9) were Vk,m is referred to as te MDP delay value, wic is a discounted version of te long-term expected delay. To solve tis feedback-modified Bellman equation, te agent m adopts value iteration [9] by updating te MDP delay value: V t+ k,m (s m,f b,t + )= min Q t k,m A m A (s m,a m,f+ b ), () were Q t k,m (s m,a m,f+ b ) = E[W k,m t (s m,a m )] + F b,t + m ) + γ s m T sm s m m )V t k,m (s m,f b,t + ) is te Q-value at te agent m wen a crosslayer transmission action A m is taken in state s m. Te stationary policy can be written as: μ t k,m (L t m )=arg min Q t m A m A (s m,a m,f b,t + ). m Te feedback-modified Bellman equation in equation (9) can be solved using value iteration, if te agent m as complete knowledge about E[W k,m (s m,a m )] and T sm s m m ). Table I presents te detailed implementation of te distributed MDP and Figure 3 sows te considered system diagram of te distributed MDP tat allows te agents to excange information wit te nodes in te neigboring ops. IV. ONLINE MODEL-BASED LEARNING FOR SOLVING THE DISTRIBUTED MDP In order to solve te Bellman equations, te agents need to know te state transition probabilities T sm s m m ) in te updating equation (). However, te state transition probabilities may not be known to te agents a priori. In tis section, we discuss online learning approaces for solving te distributed MDP introduced in te previous section at runtime. We propose a novel model-based reinforcement learning approac tat is suitable for te agents to transmit missioncritical applications over te MCN. Te proposed modelbased reinforcement learning approac adopts te priority queuing model E[W k,m (s m,a m )] for te cost and directly estimates te state transition probabilities T sm s m m ) to solve te distributed MDP. In Section IV.B, we sow tat te proposed model-based learning metods converge faster tan te model-free learning approaces, since it takes less time for te autonomic node to explore different states and correctly evaluate te Q values. A. Conventional model-free reinforcement learning Te model-free learning metods, e.g. Q-learning [6][7], can be applied at an agent m to learn te next Q values [Q t+ k,m (s m,a m ), s m X m ] witout caracterizing te state transition probabilities T sm s m m ).TakingQlearning as an example, given te feedback value F b,t +,te autonomic node m updates te Q-value using te following updating equation: Q t+ k,m { (s m,a m )=( ρ t )Q t k,m (s m,a t m )+ } ρ t Cost t k,m + F b,t + t m )+γ min Q t k,m A (s t+ m,a m ) m () were <ρ t < represents te learning rate, and t ρ t = and t (ρ)2 < are ensured for te convergence of te Q- value [6]. Te Cost t k,m represents te delay measurement (e.g. by measuring te queue size) of sending packets in class C k and s t+ m represents te next state after te agent m takes te cross-layer transmission action A t m. For exploration purposes, instead of following te optimal stationary policy μ t k,m (s m ) = arg min Q t k,m A m A (s m,a m ), te next m action is selected according to a soft-min policy. Assume πk,m t (s m,a m ) denotes te probability for agent m to take te action A m given te state s m. Te soft-min policy μ t k,m (s m )=[πk,m t (s m,a m ), A m A m ] is defined using te Boltzmann distribution [4][5][6]: πk,m t exp( Qt k,m (s m,a m ) τ ) (s m,a m )= A m A m exp( Qt k,m (s m,a m ) τ ) (2) were τ is te temperature parameter. A small τ provides a greater probability difference in selecting different actions. If τ, te approac reduces back to μ t k,m (s m ) = arg min Q t k,m A m A (s m,a m ). On te oter and, a larger m τ allows te agents to explore various actions wit iger probabilities 6. We provide detailed steps of te model-free reinforcement learning in Algoritm in Table VI. Table II summarizes te required local information, memory complexity, and computational complexity of te model-free reinforcement learning approaces. In eac service interval, te model-free reinforcement learning approaces need to update te Q-values of s m X m, C k, and for eac state, Q t k,m (s t+ m,a m ) over A m A m is calculated. Hence, te computational complexity is O ( X m A m K). Note 6 τ provides an exploration and exploitation tradeoff between exploring different actions and exploiting te Q-values of taking an action. Suc tradeoff is important in te MCN, since it significantly impacts te convergence rate and te performance of te learning approac.

7 SHIANG and VAN DER SCHAAR: ONLINE LEARNING IN AUTONOMIC MULTI-HOP WIRELESS NETWORKS 7 TABLE I IMPLEMENTATION OF THE DISTRIBUTED MDP f, t Step. Gater local information. From te information feedforward F from te previous op, te agent m computes PASS Delayk, and determine weter te distributed MDP sould be performed for traffic class C k. Ten, gaters te local t t bt, information L m = { s, } m F +. Step 2. Evaluate queuing delay and state transition probabilities. Based on state s m and action A m te agent m t evaluates EW [ km, ]. Te state transition probabilities are modeled using ˆ t t T ( ) s t (, ) m sm ' Am = F + s s m A m in equation (7). m t + bt, Step 3. Update te transmission policy. Te agent m updates te MDP delay value Vm ( s, ) m F + using equation (). t t t bt, Te stationary policy of te agent m is μ km, ( L ) arg min (,, ) m = Q m s m A m F A A +. m m t t km m Step 4. Update te information excange. After te policy μ, ( L ) is determined, te next relay m + is selected and m can ten update te feedback information bt, + bt, t t t t F ( m) = βk, m, ( ) [, (,, ( ))] m F m EWk m s m m μ + k m m + M L. Te wireless node m also needs to + f, t + PASS t t t t update its feedforward information F ( m ) = Delay + EW [ ( s, μ ( L ))]. k, k, m m k, m m tat te dynamics in te MCN may cange before te updated policy converges wen using a model-free learning approac. Hence, we consider alternative model-based reinforcement learning in te next subsection, wic is more suitable for te agents in te MCN due to a faster convergence rate. B. Proposed model-based reinforcement learning In tis section, we propose our model-based learning approac tat enables te agent m to directly model te expected queuing delay E[W k,m (s m,a m )] and estimate te state transition probabilities ˆTsm s m m ) to solve te Bellman equation troug value iteration [9]. Figure 4 provides a system block diagram of te proposed online learning approac at te agent m. Our approac is similar to te Adaptive-RTDP in [4], were te state transition probabilities are determined using maximum-likeliood estimation. Specifically, let ˆTsm s m m ) denote te estimated state transition probability at, wic is updated at eac service interval. Te Q-value is also updated as: Q t+ k,m (s m,a m )=( ρ t )Q t k,m (s m,a m )+ E[Wk,m t (s m,a m )] + F b,t + m )+ ρ t t γ min ˆT sm s A m A m m )Q t k,m (s m,a m ) s m (3) s m represents te next state to wic agent m transits, after it takes te cross-layer transmission action A m. We provide te detailed steps of te proposed model-based reinforcement learning in Algoritm 2 in Table VII. Te main differences between te model-based online learning approac and model-free learning approaces are te following: ) We model te expected queuing delay E[W k,m (s m,a m )] wit an action realized from te policy μ t k,m using te preemptive-repeat priority M/G/ queuing model as in [2]: E[W k,m (s m,a m )] = 2 k P kp η i,m E[X 2 i,m ] i=!! P η i,m E[X i,m ] k η i,m E[X i,m ] i= i=, ife[w k,m ] Dk,m rem, oterwise (4) From equation (4), we know tat if te queuing time exceeds te remaining delay deadline D rem = D k Delayk, PASS, k,m te expected queuing time E[W k,m ] becomes infinite, since te packets will be useless (no utility gain) and tey will be dropped at te agent m. Unlike Q-learning tat can only update one Q-value of a state-action pair at eac service interval, wit te priority queuing model, our model-based learning approac provides accurate estimation for any stateaction pairs. Hence, te priority queuing model enables a faster learning capability, wic is very important in order to satisfy te stringent delay constraints of mission-critical applications. 2) We apply te maximum-likeliood state-transition probabilities [4] in Algoritm 2 to update te state transition probabilities ˆT t s m s m m ), instead of using te Q-value of te next state s t+ m at eac service interval. In Algoritm 2, n t s m s m m ) represents te observed number of times before service interval t tat te action A m is taken wen te state was in s m and made a transition to s m and n t s m m ) = s m X m n t s m s m m ) represents te observed number of times before service interval t tat te action A m is taken wen te state was s m. 3) Unlike regular value iteration and Q-learning, instead of updating te value Q t+ k,m (s m,a m ) for s m X m,we only update te value for states in a particular set B m.te rest of te states s m / B m ave insufficient SINR values to keep te transmission time witin te remaining / delay deadline Dk,m rem. In oter words, te condition L k T goodput k,m,m + must old to support te transmission of traffic clas D rem k,m

8 8 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 28, NO. 5, JUNE 2 Wireless network local Information Model te state transition probability s Information excange ˆ T ss ' b f +, F F b f F, F d Solve te modified μ ik Bellman equation EW km, [ ] Expected queuing delay estimation A m Select an action according to te policy A m A m Autonomic node m Packet transmission Fig. 4. System diagram of te proposed model-based online learning approac at te agent m TABLE II COMPLEXITY SUMMARY OF THE MODEL-FREE REINFORCEMENT LEARNING Required local information Memory complexity Computational complexity L t t t f, t bt, m = s m Cost k, m C k F F+ Transmission policy {,{, },, } State transition Q-value Xm A m K Not required X m A m K O( X A K) m m ( Tk,m,m + D rem ) ξ k,m B m = {s m : x m,m + δ ln } (5) L k C k at agent m. Hence, te set is defined as in (5), wic depends on te pysical layer parameters δ and ξ of te agent m (see equation ()). We only update te Q-values of te states s m B m in Algoritm 2. Table III summarizes te required local information, memory complexity, and computational complexity of te proposed model-based reinforcement learning approac. Te proposed model-based reinforcement learning approac as iger computational complexity tan model-free reinforcement learning approaces. However, te computational complexity is a minor concern in te MCN compared wit satisfying te delay constraints of te missioncritical applications. For te proposed model-based reinforcement learning approac, te Q-values of s m B m, C k need to be updated in eac service interval, and for eac state over A m A m, te last term t min ˆT sm s A m A m m )Q t k,m (s m,a m ) in equation m s m (3) is calculated. Altoug te computational complexity is larger, te convergence rate of te proposed model-based reinforcement learning approac is muc faster tan te modelfree reinforcement learning approaces. In Section V.B, we compare te convergence speeds of different learning metods troug extensive simulation results. Hence, te MCN nodes can coose to implement tis iger complexity learning to improve teir performance. In Section V.C, we investigate te case were nodes deploy eterogeneous learning metods and determine te resulting performance. C. Upper and lower bounds of te model-based learning approac Since te maximum-likeliood state-transition probabilities ˆT s t m s m m ) are used in te proposed model-based learning approac, tere is no guarantee tat te resulting MDP delay value can converge to te optimal value Vk,m (s m,f+ b ) in equation (9). In tis subsection, we investigate te accuracy of te proposed model-based learning in terms of te resulting MDP delay value. Let V t k,m (s m,f b,t + ) and V t k,m (s m,f b,t + ) denote te upper and te lower bounds of te value, respectively, using ˆT s t m s m m ) in te proposed model-based learning approac in service interval t. Wedefine ε as te ( δ)- confidence interval of te real MDP delay value (using te unknown ˆT s t m s m m ) in Section III) in service interval t, i.e. Prob(V t k,m (s m,f b,t + ) V k,m t (s m,f b,t + ) ε) δ( <δ<). Proposition: Tere exists a ( δ)-confidence interval ε, suc tat an agent m can update te upper bound of value V t k,m (s m,f b,t + ) using V t+ k,m (s m,f b,t + )= E[Wk,m t (s m,a m )] + F b,t + m )+ min γ ˆT A m s t m s m m )V t k,m (s m,f b,t s m + )+ε (6)

9 SHIANG and VAN DER SCHAAR: ONLINE LEARNING IN AUTONOMIC MULTI-HOP WIRELESS NETWORKS 9 TABLE III COMPLEXITY SUMMARY OF THE MODEL-BASED REINFORCEMENT LEARNING Required local information Memory complexity Computational L Transmission policy Bm A m K t t f, t bt, m = s m F F+ {,, } State transition 2 m m K 2 m m Q-value B A B A O complexity ( K B A ) m m K and update te lower bound V t k,m (s m,f b,t + ) using V t+ k,m (s m,f b,t + )= E[Wk,m t (s m,a m )] + F b,t + m )+ min A m γ t ˆT sm s m m )V t k,m (s m,f b,t + ) ε s m (7) and te following two conditions ( are satisfied: ) )n t s m m ) = 2 ln Am B m ( Vmax ) 2 δ ε, A m A m,werev max = max k Drem k,m γ represents te largest MDP delay value. 2)V k,m (s m,f+ b ) V k,m (s m,f+ b ) V k,m (s m,f+ b ) wit probability at least 2δ. Proof: See Appendix. Tis proposition sows tat te estimated values V t+ k,m (s m,f b,t + ) become more accurate as nt s m m ) ( ) becomes larger tan 2 ln Am B m ( Vmax ) 2 δ ε. Moreover, te closer te agent m is to te destination node, te remaining pat becomes sorter and provides a smaller and leads to a smaller V max requirement on n t s m m ). Hence, using te same proposed model-based learning approac to accumulate n t s m m ), te learning approac provides a more accurate MDP delay value for an agent tat is closer to its destination node, wic is also verified in te simulation results in Section V.D. V. SIMULATION RESULTS In tis section, we simulate te performance of te proposed model-based reinforcement learning for solving te distributed MDP for te mission-critical applications. A. Simulation results for different network topologies We simulate first a 6-op MCN wit a topology sown in Figure 5(a) wit two ASs and 8 ARs. Suc MCN is commonly adopted in various areas, suc as battlefield sensing, security monitoring, and ealtcare applications, were prioritized data packets need to be relayed to te remote destinations in a timely manner. Two groups of mission-critical applications are sent in different priority classes (K =8). Te caracteristic parameters of tese mission-critical applications are given in Table IV. Various mission-critical applications can be supported, e.g. video streams from surveillance cameras [2], delay-sensitive monitoring report suc as forest fire detection, or patient monitoring []. Group mission-critical applications are sent troug te AS m to te destination node D and group 2 mission-critical applications are sent from te oter AS m 2 to its destination node D2. Te agents are assumed to be able to select a set of modulation and coding scemes tat support a transmission rate T =Mbps for all te transmission links in te network [2]. Eac receiver of te transmission links receives a random SINR x tat results in a packet error rate ranging from 5% to 3%. We assume tat te nodes are excanging ello messages (as in DSDV [24]) wit te required information excange every ms (eac service interval is ms). Figure 5(b) sows te MDP delay values from te ASs to te destination nodes for te first 2 service intervals. Only te results of te first five priority classes are sown. Te iger priority traffic as a smaller MDP delay value Vk,m t. Te results of centralized optimization are analytically computed by assuming tat te global network information is known by a central controller, wic is unrealistic in practice. On te oter and, te proposed model-based reinforcement learning determines te cross-layer transmission policy at eac agent based on local information. We set γ =.75, wic is appropriate for igly time-varying MCN (after service intervals, te future is only about 5% of te cost). Note tat our model-based learning provides te MDP delay values close to te centralized optimization results, especially for te priority classes C,C 2,C 3 tat satisfy te condition E[W k,m ] Dk,m rem. Tese tree priority classes converge to a steady state after t =4, since teir end-to-end delays are witin te delay deadline of te applications (te required performance level is set as γ t D k = D k γ =4wen te delay deadline of t= eac future service interval is considered) and no packets are dropped. Te results also sow tat te iger priority traffic converges faster tan te lower priority traffic. Tis is because te queuing delay of te lower priority class traffic is impacted by te iger priority class traffic.next,wesimulateaskewed network topology tat as two clusters of nodes sown in Figure 6(a). Suc network topology wit clusters of nodes can be common in te MCN due to landscape requirements. Te network connections between te two clusters usually form a bottleneck to transmit te mission-critical applications. Figure of all te priority classes increase. We observe tat only te convergence rates of te iger priority classes decrease in te skewed network due to te impact of te bottleneck. 6(b) sows tat te MDP delay values V t k,m

10 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 28, NO. 5, JUNE 2 TABLE IV THE CHARACTERISTIC PARAMETERS OF THE MISSION-CRITICAL APPLICATIONS Group mission-critical applications V Group 2 mission-critical applications V 2 C C k C C 4 C 6 C 8 C 2 C 3 C 5 7 R (Kbps) k D L k k sec bytes y-axis (m) (a) m m D2 D x-axis (m) (b) V,m V 2,m2 V 3,m2 V 4,m V 5,m2 5 required performance level Centralized optimization Model-based learning service interval t Fig. 5. (a) 6-op network topology (b) MDP delay values of te first five priority classes y-axis (m) (a) m 2 m D2 D x-axis (m) (b) V,m V 2,m2 V 3,m2 V 4,m V 5,m2 5 required performance level Centralized optimization Model-based learning service interval t Fig. 6. (a) 2-cluster skewed network topology (b) MDP delay values of te first five priority classes B. Comparison among te reinforcement learning approaces In tis subsection, we compare te proposed model-based reinforcement learning approac wit Q-learning in [6] (a model-free reinforcement learning approac) and te myopic self-learning approac in [2] (γ =). We adopt te same network conditions as te previous simulations and te network topology sown in Figure 5(a). In Figure 7, te simulation results sow tat te proposed model-based reinforcement learning approac outperforms te oter two learning approaces in terms of te MDP delay values for all te priority classes. Altoug Q-learning as te lowest computational complexity, it as te worst performance in terms of bot te MDP delay value Vk,m t and te convergence rate. Te delay of te C traffic converges after t =2for te proposed model-based learning approac and converges only after t =4for Q-learning approac. Te convergence is not guaranteed for te lower priority class traffic, especially for te myopic self-learning solution. Moreover, altoug te myopic approac as te fastest convergence rate, it results in a worse performance tan te proposed model-based reinforcement learning approac. In addition to te MDP delay values Vk,m t, we directly com-

11 SHIANG and VAN DER SCHAAR: ONLINE LEARNING IN AUTONOMIC MULTI-HOP WIRELESS NETWORKS V,m V 2,m2 V 3,m2 V 4,m2 V 5,m 4 Model-based learning 2 Self-learning Q-learning required performance level service interval t Fig. 7. Comparisons of te discounted end-to-end delay using different learning approaces tat solves te distributed MDP TABLE V THE RESULTS OF HETEROGENEOUS LEARNING SCENARIOS Learning metod (witin 2 ops from ASs) Learning metod (outside 2 ops from ASs) Expected delay of te first class traffic (sec) Expected delay of te second class traffic (sec) Model-based Model-based Model-based Bot (random) Model-based Model-free Model-free Model-based Model-free Bot (random) Model-free Model-free pare expected end-to-end delays E[Delayk t ] of te missioncritical applications from te ASs to te destination nodes. Te acceptance level for E[Delayk t ] is D k =. In Figure 8, te simulation results sow tat by using te proposed model-based learning approac, te MCN is able to support up to tree mission-critical classes, since te end-to-end delay must be witin te delay deadline of te applications (E[Delayk t ] D k), wile by using te oter two learning approaces, te network can only support two mission-critical classes. Next, we simulate te expected delay of different classes in a source variation scenario, were te AS m disappears rigt after service interval t =6. Figure 9 sows te canges of expected delays over time for different classes using various learning approaces. Since te AS m is te source node of packets in classes {C,C 4,C 6,C 8 }, te expected delays E[Delay ] and E[Delay 4 ] in Figure 8 vanis after t =6.We can observe tat if Q-learning is applied, before t =6, only class C from m can be delivered in time (E[Delay ] D ). However, after t =6, te class C 2 from m 2 can be supported by te MCN due to te alleviation of te traffic loading. By applying te proposed model-based learning approac, before t =6, bot classes C,C 2 can be delivered in time, and after t =6, not only te class C 2 but also te class C 3 from m 2 can be supported by te MCN. Tis sows tat te proposed model-based learning approac enables te MCN to support more mission-critical applications. 2 E[Delay ] E[Delay 2 ] E[Delay 3 ] E[Delay 4 ] E[Delay 5 ] 4 Model-based learning 2 Self-learning Q-learning required performance level service interval t Fig. 8. Comparisons of te expected end-to-end delay using different learning approaces tat solves te distributed MDP. E[Delay ] E[Delay 2 ] E[Delay 3 ] E[Delay 4 ] E[Delay 5 ] 4 Model-based learning Self-learning 2 Q-learning Source node disappears Source node disappears required performance level service interval t Fig. 9. Source node of packets in class C,C 4 disappears after t =6. C. Heterogeneous learning In te previous simulations, we assume tat all te network nodes adopt te same learning approac to solve te distributed MDP. However in reality, te agents can adopt different learning approaces. We simulated different scenarios in wic te agents ave eterogeneous learning capabilities using te same network conditions as te previous simulation and te same network topology sown in Figure 5(a). In Table V, we assume tat te agents in te same op are using te same learning metod. Te model-based learning refers to te proposed model-based reinforcement learning approac and te model-free learning refers to te Q-learning in [6]. Te simulation results sow tat adopting a modelbased learning approac near te ASs is very important. Te delays are smaller independent of te type of learning approaces te rest of te nodes. Tis is because te modelbased learning approac provides a more accurate estimate of te expected delay feedback tan te model-free learning approac. Also, te model-based learning approac converges faster tan te model-free learning approac. Hence, te more

12 2 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 28, NO. 5, JUNE 2 V values in op V values in op 2 V values in op V value (discounted end-to-end delay) 4 Upper bound of te V value Lower bound of te V value service interval t Fig.. Te upper and te lower bounds of te discounted end-to-end delays for te first priority class traffic at different ops. remaining nodes adopt te model-based learning approac, te iger te improvement in te delay performance. Moreover, te delays of te second priority class traffic varymoretan te first priority class. Tis sows tat te learning metods adopted by te agents can significantly impact te performance of mission-critical applications, especially te ones wit lower priorities. In oter words, te deployed learning approaces impact te number of mission-critical applications supported by te MCN. D. Determining te upper and te lower bound In tis subsection, we provide simulation results to sow te upper bound and te lower bound of te model-based reinforcement learning. We adopt te same network conditions and te 2-cluster network topology sown in Figure 6(a). Figure sows te MDP delay values of te first priority class traffic at different ops. Since te real delay is proven to be bounded between te upper and te lower bounds, te result sows tat te model-based reinforcement learning provides end-to-end delays tat are more and more accurate over time as well as wen te agents are getting closer to te destination nodes. VI. CONCLUSION In tis paper, we investigated ow te agents in te MCN sould optimally select teir cross-layer transmission actions in te MCN in order to minimize te end-to-end delays of mission-critical applications. To consider bot te spatial and temporal dependency in te MCN, we formulate te network delay minimization problem using distributed MDP. To solve te distributed MDP in practice, we propose an online model-based reinforcement learning approac. Unlike te conventional model-free reinforcement learning approaces, te proposed model-based reinforcement learning approac as a faster convergence rate, since it takes advantage of te priority queuing model and requires less time for te autonomic node to explore different states to evaluate te Q- values. Our simulation results verify tat te suitability of te proposed model-based learning approac supporting missioncritical applications by te agents in te MCN. APPENDIX PROOF OF THE PROPOSITION We apply Heoffding inequality [22] to obtain te confidence interval ε, wic basically states tat given random variables {X,..., X m } in range [,X max ], te inequality olds: Prob( m X i m E[X i ] ε) e 2m( ε Xmax )2 (8) m m i= i= From te first condition, we ave ε =! ln δ Am V Bm max 2n t sm m ). Denote E[V (s m,a m )] = t ˆT sm s m m )V t k,m (s m,f b,t + ) as te average s m MDP delay upper bound based on te estimated ˆT s t m s m m ) wenever state s m is visited and action A m is taken, and denote E[V (s m,a m )] = T sm s m m )V s k,m t (s m,f b,t + ) as te average m expected MDP delay value based on real Ts t m s m m ). Similar to te proof of lemma 3.2 in [23], equation (8) can be rewritten as: Prob(E[V (s m,a m )] E[V (s m,a m )] ε)! ( ) 2 exp 2n t s m A m V max V ln δ 2 Am Bm max 2n t A sm m ( ) δ = A m B m (9) t+ Hence,Prob( V k,m (s m,f b,t + ) V t+ k,m (s m,f b,t + ) ε) δ for eac state-action pair (te total number of te state-action pairs is A m B m ). Similar proof can be applied to te lower bound. Since n t s m m ) in te last term of equations (6) and (7) goes to infinity as t, we can sow tat bot te upper bound and te lower bound converge under te same conditions, i.e. V k,m (s m,f b + ) = lim t V t k,m (s m,f b,t + ), and V k,m (s m,f+ b ) = lim V t t k,m (s m,f b,t + ). Due to te symmetric structure of V k,m (s m,f+ b ) and V k,m (s m,f+ b ), we apply te union bound as in [23] to sow tat te probability Prob( V k,m (s m,f+ b ) V k,m (s m,f+ b ) ε) 2δ and complete te proof. REFERENCES [] A. Rezgui and M. Eltoweissy, Service-Oriented Sensor-Actuator Networks, IEEE Commun. Mag., vol. 45, no. 2, pp 92-, Dec 27. [2] S. Nelakuditi, Z. Zang, R. P. Tsang, D. H. C. Du, Adaptive Proportional Routing: A Localized QoS Routing Approac, IEEE/ACM Trans. Netw., vol., no. 6, pp , Dec 22. [3] K. Akayya and M. Younis, An Energy-Aware QoS Routing Protocol for Wireless Sensor Networks, in te Proc. IEEE Worksop on Mobile and Wireless Networks(MWN23), Providence, RI, May 23. [4] P. Gupta and T. Javidi, Towards Trougput and Delay-Optimal Routing for Wireless Ad-Hoc Networks, Asilomar Conference on Signals, Systems and Computers, Nov. 27. [5] M. J. Neely, E. Modiano, and C. E. Rors, Dynamic Power Allocation and Routing for Time-Varying Wireless Networks, IEEE J. Sel. Areas Commun., vol. 23. no., Jan 25.

13 SHIANG and VAN DER SCHAAR: ONLINE LEARNING IN AUTONOMIC MULTI-HOP WIRELESS NETWORKS 3 TABLE VI ALGORITHM : MODEL-FREE REINFORCEMENT LEARNING AT NODE m [6] A. Tizgadam, A. Leon-Garcia, On Congestion in Mission-Critical Networks, IEEE INFOCOM 28, April 28. [7] M. Liotine, Mission Critical Network Planning, Artec House, Norwood, MA 23. [8] Y. Guan, X. Fu, D. Xuan, P. U. Senoy, R. Bettati, and W. Zao, NetCamo: Camouflaging Network Traffic for QoS-Guaranteed Mission Critical Applications, IEEE Trans. Syst., Man, Cybernet. A., vol. 3, no. 4, pp , July 2. [9] Y. Huang, W. He, K. Narstedt, W. C. Lee, Dos Resistant Broadcast Autentication wit Low End-to-end Delay, IEEE INFOCOM 28, April 28. [] D. Mars, R. Tynan, D. O Kane, G. M. P. O Hare, Autonomic Wireless Sensor Networks, Artificial Intelligence, vol. 7, pp , 24. [] D. Cen and P. K. Varsney, QoS support in wireless sensor networks: A survey In Proc. International Conference on Wireless Networks (ICWN), pp. 2-24, Las Vegas, NV, June 24. [2] J. Cakareski and P. Frossard, Rate-Distortion Optimized Distributed Packet Sceduling of Multiple Video Streams Over Sared Communication Resource, IEEE Trans. Multimedia, vol. 8, no. 2, Apr, 26. [3] H.-P. Siang and M. van der Scaar, Informationally Decentralized Video Streaming over Multi-op Wireless Networks, IEEE Trans. Multimedia, vol. 9, no. 6, pp , Sep 27. [4] A. G. Barto, S. J. Bradtke and S. P. Sing, Learning to act using realtime dynamic programming, Artificial Intelligence, vol. 72, no. -2, Jan 995, pp [5] P. Tadepalli and D. Ok, Model-based average reward reinforcement learning, Artificial Intelligence, vol., no. -2, Jan 998, pp [6] C. J. C. H. Watkins, P. Dayan, Q-learning, Macine Learning, vol. 8, no. 3-4, pp , May 992. [7] R. S. Sutton, Learning to predict by te metod of temporal differences, Macine Learning, vol. 3, no., pp. 9-44, Aug [8] M. L. Puterman, Markov Decision Process: Discrete Stocastic Dynamic Programming, Jon Wiley & Sons, Inc. New York, 994. [9] D. P. Bertsekas, Dynamic Programming and Optimal Control, Atena Scientific, 995. [2] D. Krisnaswamy, Network-assisted Link Adaptation wit Power Control and Cannel Reassignment in Wireless Networks, 3G Wireless Conference, pp. 65-7, 22. [2] H. -P. Siang and M. van der Scaar, Multi-user video streaming over multi-op wireless networks: A distributed, cross-layer approac based on priority queuing, IEEE J. Sel. Areas Commun., vol. 25, no. 4, pp , May 27. [22] W. Hoeffding, Probability inequalities for sums of bounded random variables, J. American Statistical Association, vol. 58, no. 3, pp. 3-3, Mar [23] E. Even-Dar, S. Mannor, Y. Manour, Action elimination and stopping conditions for reinforcement learning, Proc. International Conference on Macine Learning (ICML 23), 23. [24] C. E. Perkins, P. Bagwat, Higly Dynamic Destination-Sequenced Distance-Vector Routing (DSDV) for Mobile Computers, ACM SIG- COMM Computer Communication Review, vol. 24, no. 4, pp , Oct [25] T. Rouggarden, E. Tardos, How Bad is Selfis Routing? J. ACM, vol. 49, no. 2, pp , Marc 22. [26] F. Kelly, A. Maulloo, and D. Tan, Rate control in communication networks: sadow prices, proportional fairness and stability, J. Operational Researc Society, vol. 49, no. 3, pp , Mar [27] D. Xu, M. Ciang, and J. Rexford, Link-state routing wit opby-op forwarding acieves optimal traffic engineering, Proc. IEEE INFOCOM, 28. [28] F. Fu, M. van der Scaar, A systematic framework for dynamically optimizing multi-user video transmission, tecnical report, ttp://arxiv.org/abs/ [29] Q. Zang, S. A. Kassam, Finite-state Markov Model for Reyleig fading cannels, IEEE Trans. Commun., vol. 47, no., Nov [3] J. Dowling, E. Curran, R. Cunningam, and V. Caill, Using Feedback in Collaborative Reinforcement Learning to Adaptively Optimize MANET Routing, IEEE Trans. Syst., Man, Cybern. A., vol. 35, no. 3, pp , May 25. [3] H. Tong, T. X. Brown, Adaptive Call Admission Control under Quality of Service Constraints: A Reinforcement Learning Solution, IEEE J. Sel. Areas Commun., vol. 8, no. 2, pp , Feb 2. [32] S. Toumpis, A. J. Goldsmit, Capacity Regions for wireless Ad Hoc Network, IEEE Trans. Wireless Commun., vol. 2, no. 4, pp , July 23.

Los Angeles. He graduated from National Taiwan University wit is B.S. and M.S. in Electrical Engineering in 2 and 22, respectively. In 29, e received is P.D.

14 4 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 28, NO. 5, JUNE 2 TABLE VII ALGORITHM 2: MODEL-BASED REINFORCEMENT LEARNING AT NODE m Hsien-Po Siang is currently a Postdoctoral Scolar at te Department of Electrical Engineering, University of California, Los Angeles. He graduated from National Taiwan University wit is B.S. and M.S. in Electrical Engineering in 2 and 22, respectively. In 29, e received is P.D. degree from Electrical Engineering at University of California, Los Angeles. During is P.D. study, e worked at Intel Corp., Folsom CA in 26, researcing overlay network infrastructure over wireless mes networks. He publised several journal papers and conference papers on tese topics and as been selected as one of te eigt P.D. students cosen for te 27 Watson Emerging Leaders in Multimedia awarded by IBM Researc, NY. His researc interests include cross-layer optimizations/adaptations, multimedia communications, and dynamic resource management for delay-sensitive applications. Miaela van der Scaar received te P.D. degree from Eindoven University of Tecnology, Eindoven, Te Neterlands, in 2. Se is currently an Associate Professor at te Department of Electrical Engineering, University of California, Los Angeles. Se olds 3 granted US patents. Se is also te editor (wit Pil Cou) of te book Multimedia over IP and Wireless Networks: Compression, Networking, and Systems (San Diego, CA: Academic Press, 27). Dr. Van der Scaar as been an active participant in te International Organization for Standardization (ISO) MPEG standard since 999, to wic se made more tan 5 contributions and for wic se received 3 ISO recognition awards. Se received te National Science Foundation CAREER Award in 24, IBM Faculty Award in 25 and 27, te Okawa Foundation Award in 26, te IEEE Transactions on Circuits and Systems for Video Tecnology Best Paper Award in 25, and te Most Cited Paper Award from te European Association for Signal Processing Journal Signal Processing: Image Communication for Se was elected as an IEEE Fellow in 2. Her researc interests include wireless multimedia processing, communication and networking, game-teoretic approaces in multi-agent communication systems, and multimedia systems.

Spectrum Sharing with Multi-hop Relaying

Spectrum Sharing with Multi-hop Relaying Spectrum Saring wit Multi-op Relaying Yong XIAO and Guoan Bi Scool of Electrical and Electronic Engineering Nanyang Tecnological University, Singapore Email: xiao001 and egbi@ntu.edu.sg Abstract Spectrum