2170 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007

Size: px

Start display at page:

Download "2170 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007"

Julie Beatrix Powers
5 years ago
Views:

1 2170 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY Learning Algorithms for Constrained Markov Decision Processes With Randomized Monotone Policies: Application to MIMO Transmission Control Dejan V. Djonin and Vikram Krishnamurthy, Fellow, IEEE Abstract This paper presents novel -learning based stochastic control algorithms for rate and power control in V-BLAST transmission systems. The algorithms exploit the supermodularity and monotonic structure results derived in the companion paper. Rate and power control problem is posed as a stochastic optimization problem with the goal of minimizing the average transmission power under the constraint on the average delay that can be interpreted as the quality of service requirement of a given application. Standard -learning algorithm is modified to handle the constraints so that it can adaptively learn structured optimal policy for unknown channel/traffic statistics. We discuss the convergence of the proposed algorithms and explore their properties in simulations. To address the issue of unknown transmission costs in an unknown time-varying environment, we propose the variant of -learning algorithm in which power costs are estimated in online fashion, and we show that this algorithm converges to the optimal solution as long as the power cost estimates are asymptotically unbiased. Index Terms Constrained Markov decision process (CMDP), delay constraints, monotone policies, learning, randomized policies, reinforcement learning, supermodularity, transmission scheduling, V-BLAST. I. INTRODUCTION THIS paper addresses the problem of structured learning of rate and power control policy for transmission over wireless multiple-input multiple-output (MIMO) channel and under the constraint on transmission latency. Several structural results on the optimal costs and policies have been derived in the companion paper [1]. It has been shown in [1] that the optimal rate allocation action is monotonic increasing in the buffer occupancy and that control policy optimization can be divided into two separate problems of low-layer bit-loading and high-layer total rate allocation. In this paper, we exploit these structural results to derive computationally efficient stochastic control algorithms. Manuscript received January 25, 2006; revised August 2, The associate editor coordinating the review of this manuscript and approving it for publication was Dr. David J. Miller. This work was supported in part by the National Science Engineering Research Council (NSERC) PostDoctoral Fellowship Award and in part by a NSERC strategic grant. D. V. Djonin is with Dyaptive, Inc., Vancouver, BC V6E 4A6, Canada. V. Krishnamurthy is with the Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada ( vikramk@ece.ubc.ca). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TSP A. Summary of the Contributions The most important contributions of this paper are: application of online policy learning algorithms for the computation of the optimal rate scheduling algorithms for delay-constrained V-BLAST transmission in imperfectly known channel and traffic environments with simulated costs; utilizing structural results on the optimal rate scheduling policy with the goal of improving the convergence rate of online learning algorithms; analytical formulation and numerical examination of three novel algorithms designed to incorporate submodular linear constraint in the standard -learning algorithm to improve its convergence. The main ideas of this paper are novel from a communications perspective as well as learning-based perspective. B. Communications Perspective The problem addressed in this paper is a cross-layer optimization problem as we jointly consider the statistics of the traffic arriving into the adaptive V-BLAST transmitter and the transmitter adaptation based on the channel statistics. The optimization goal is to reduce the total transmission power over all transmitter antennas, while maintaining the constraint on transmission delay satisfied. Due to the consideration of the transmit buffer, this problem is inherently a dynamic stochastic optimization problem and can be stated as a constrained Markov decision process (CMDP). The problem of transmission control optimization for singlechannel systems, with the objective of average power or bit-error rate (BER) minimization under the latency constraints, has been previously addressed in [2], [3]. Multichannel rate adaptation for orthogonal frequency division multiplexing (OFDM) systems with delay constraint has been addressed in [4]. However, none of these results discuss the problem of transmission control adaptation when channel and/or traffic statistics is unknown. Here, we address this problem and present several algorithms for adaptive transmission control. These adaptive learning algorithms are based on the ideas of stochastic approximation and reinforcement learning [5]. To the best knowledge of the authors, the structured submodular -learning algorithm proposed in this paper is also novel from the control-theoretic viewpoint. From a signal transmission perspective, the importance of the addressed problem is threefold. 1) We address the adaptive policy learning, as the wireless channels and traffic statistics are usually not a priori X/$ IEEE

2 DJONIN AND KRISHNAMURTHY: -LEARNING ALGORITHMS 2171 known. Thereby the static optimization is not suitable to address the transmission control problem. 2) We address the MIMO channels as they provide higher capacities than single channel systems [6], [7]. The MIMO channel capacity can be further increased by employing per-antenna power and rate allocation at the transmitter (e.g., see power adaptation for Bell-Labs layered spacetime (V-BLAST) addressed in [8]and [9]). 3) We incorporate the consideration of real-time traffic in order to reduce the transmission latency and satisfy different user quality of service (QoS) conditions. C. Learning-Based Perspective The problem of policy learning for the analyzed V-BLAST power and rate adaptation can be addressed either using discrete stochastic approximation algorithms by learning the control policy directly (see [10] and references within) or by using continuous stochastic approximation algorithms such as -learning [5], actor-critic methods [11] and policy space methods [12]. In this paper, we have decided to pursue the continuous stochastic approximation approach as the discrete version would involve the search for the optimal policy within a very large set of optimal policies. Traditionally, both discrete stochastic approximation and -learning were used for unconstrained MDP and calculation of pure optimal policies. Due to equivalence in costs of CMDP and a Lagrangian MDP formulation of a CMDP [13], -learning can be applied to compute the optimal pure policy for a fixed Lagrangian multiplier and active constraint. The optimal randomized policy of a CMDP can be computed as a mixed policy of two optimal pure policies for two different Lagrangian multipliers. We also present an iterative algorithm to find these Lagrangian multipliers and compute optimal randomized policies. Each of the pure policies that constitute the optimal randomized policy possesses a known structure, that is, rate allocation actions are monotonically increasing with the buffer state. This implies that -factors possess a submodularity property that can be stated as a linear constraint on -factors and easily utilized in -learning algorithm. 1 The rationale is that by imposing submodularity structure on -factors, -learning more quickly searches through the policies and avoids considering nonstructured policies that are known to be nonoptimal. It has been shown in [1] that reduction in policy space, achieved by considering only structured policies, can be several orders of magnitude. Further, unlike actor-critic methods [11], -learning algorithm has well-explored convergence properties that can be shown to carry over to the structured version of -learning. In practice, costs of such CMDP can be estimated online during the learning phase and sampled costs can be used to update the -factors. -learning converges to the optimal solution with probability one as long as the cost estimates are asymptotically unbiased. We discuss how to perform power cost estimation in case that powers are adapted at a faster rate than the transmission rates. This approach has an added advantage 1 As opposed to the presented structured -learning algorithm, it is difficult to incorporate submodular constraints and ensure convergence to the optimal discrete policy using policy space search methods discussed in [12]. Fig. 1. MIMO transmission model with coding and modulation. -learning algorithm. ACM: adaptive that transmission adaptation actions (that have to be negotiated between the transmitter and the receiver) can be performed less frequently than the power control actions. Furthermore, rate control actions can be based on a more coarse quantization of the channel state than the power control actions. This results in a more efficient -learning algorithm. D. Paper Outline The outline of the paper is as follows. We formulate the V-BLAST power and rate control problem using stochastic control framework and CMPA in Section II. Respective costs and transition probabilities are identified for such a problem in Section III. Section IV presents a summary of structural properties of optimal policies. In Section IV, we utilize this structure of the optimal policy and propose several methods to improve the convergence rate of the -learning algorithm. This approach results in novel structured -learning algorithms for CMDPs that are posed as stochastic constrained optimization problem with linear constraints. We propose three algorithms to solve that constrained optimization problem. Namely, we address the primal-dual algorithm, primal projection algorithm and the submodular parameterization algorithm. In Section VI, we numerically explore performances of the proposed structured -learning algorithms for delay-constrained V-BLAST rate and power control. The simulations show that primal projection method best utilizes a priori known structure of the optimal policy for both stringent and relaxed delay constraints. II. V-BLAST TRANSMISSION MODEL Notation: A discrete-time slotted model is used throughout the paper. A time slot is defined as the time interval and controller decision in this time slot is made at the beginning of that interval at time. Let denote the discrete-time (in general random) variable at time slot. To avoid cumbersome notation, we will drop the time-slot superscript designation whenever that does not cause confusion. Let denote the cardinality of a certain finite set, and denote the probability measure. Let be the set of integers including 0. Fig. 1 shows a schematic representation of the V-BLAST transmitter and receiver model used in this paper. The transmitter is equipped with the transmission buffer of length. The task of the controller is to choose rates and powers for each of

3 2172 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007 the transmission antennas. Let us denote the buffer occupancy in bits at the beginning of the th time slot with, where and the buffer state space is. The transmission buffer is continually supplied with the incoming traffic from a higher layer application. Let be the number of packets stored into the transmission buffer during the th time slot. It is assumed that for all, is the element of a finite state space and the packet length is bits. Furthermore, we assume the following. A1) The number of packets stored into the transmission buffer is an ergodic Markov chain with transition probabilities that are independent of the chosen action, buffer occupancy and channel state. Consider that traffic arriving onto the transmission buffer is Markovian and independent of the buffer occupancy and actions taken. Out of these packets, only packets are stored into the transmission buffer. Therefore, A1 is satisfied if number of packets stored into the transmission buffer is not dependent on the buffer occupancy or actions, i.e., A1 cannot be satisfied if there are buffer overflows in the finite transmission buffer. Markov model for the incoming traffic is sufficient since the incoming traffic state space is finite and control decisions are made periodically at the end of each time slot. Therefore, it is not necessary to consider the semi-markov process for the incoming traffic. MIMO channel considered in this paper is a point-to-point wireless channel with transmit and receive antennas, that satisfy. The channel is considered to be block fading and constant during a time slot of length. Furthermore, at time slot, the MIMO channel is completely described with the complex dimensional channel matrix containing the elements. Let be the vector of transmitted symbols employing certain modulation format, from all of the transmit antennas. Each of the transmitted data streams over antennas can contain independent information. Then, received signal vector can be presented in the following complex baseband vector form while is the noise vector. The elements of are assumed to be independent and identical distributed (i.i.d) Gaussian random variables with zero mean and variance. Channel matrix is assumed to be dependent only on the previous time slot, i.e., and the sequence of channel matrices, constitutes a continuous value Markov process. A. Receiver Structure In the above MIMO channel, signals from all of the antennas are received on all of the receiver antennas. To recover and estimate the transmitted signals, several receiver structures have been devised. These include the linear receivers such as zero-forcing and minimum mean square estimation (MMSE) receivers, and nonlinear successive interference cancelation re- (1) ceivers. Each of these receivers compute estimates for all of the independent data streams. Next we consider the zero-forcing (ZF) linear detector and show that by employing this detector, the MIMO channel is decoupled into parallel independent channels. 2 The zero forcing detector assumes that knowledge of channel gains is known at the receiver. In this receiver, the received signal at time slot is multiplied by the pseudo-inverse. The postdetection signal-to-noise ratio (SNR), normalized by the nominal transmission power of mw, and associated with th transmission antenna when linear ZF equalizer is used can be expressed as where is the normalized received SNR at each receive antenna and is defined as. We will assume that quantized information on the postdetection SNR is provided to the transmitter, and that this information is utilized in the controller to choose the current rates and power levels in all of the transmit antennas. For the th transmit antenna ( ), we assume that postdetection SNR is quantized using thresholds at, where and. Denote with the set of all quantized channel states corresponding to a certain transmit antenna. Therefore, there will be channel states in for each of the single antenna postdetection SNRs. Let the current channel state associated with the th transmitter ( ) be denoted with and. Let us denote with the composite MIMO channel state defined as the Cartesian product of state spaces of quantized postdetection SNRs for each of transmitter data streams. The composite channel state of all transmit channels is denoted with and we will adopt the following assumption regarding its statistical evolution: A2) The sequence of channel states, forms an ergodic first-order Markov chain with transition probabilities and is independent of the action, buffer state and incoming traffic state. III. V-BLAST POWER AND RATE CONTROL PROBLEM AS CMDP This section provides a detailed formulation of the V-BLAST power and rate control problem (V-BLAST-PRCP) formulated as CMDP. A CMDP is completely described through its state space, action space, transition probabilities and cost criteria. Let denote an arbitrary finite set called the state space while denote the finite set called the action set of a CMDP. The proceeding definitions are foundation building blocks of the -learning algorithm to be discussed in Section V. A. State Space Utilizing definitions of Section II, the state space of V-BLAST-PRCP is the composite space comprising of buffer 2 ZF detector has been used in the simulations of Section VI. However, the CMDP model, proposed adaptive algorithm and utilized structural results are also valid for MMSE detectors with appropriate change in the power costs. (2)

4 DJONIN AND KRISHNAMURTHY: -LEARNING ALGORITHMS 2173 space, incoming traffic space and the channel state space, i.e., where denotes the Cartesian product. B. Action Space The action in the V-BLAST-PRCP is interpreted as the composite rate allocation of the individual transmitter antennas. Let denote the number of bits that are allocated to antenna where is the set of all possible single-antenna bit allocations. As shown in Fig. 1 these bits are processed by the adaptive coding and modulation (ACM) block to produce consecutive symbols transmitted from antenna. Let the composite action denote the bit allocations across all transmit antennas. The set of composite actions is equal to. Let us also define function which returns the number of bits retrieved from the buffer if action is applied, i.e., (3) D. Cost Criteria We will adopt the average expected cost as the optimization criteria in V-BLAST-PRCP. For any admissible policy, let the infinite horizon cost conditioned on initial state be defined as where the expectation is over randomized actions and system state evolution for. The goal is to compute the optimal policy that minimizes the cost (6) subject to the global constraint (6) (7) Define the set of Markovian admissible policies is measurable w.r.t. Let denote the -algebra generated by the observed system state at time. This means that is a (potentially) random function of current state. Let denote the set of all pure policies where is a deterministic function of current state. We now introduce the following unichain assumption on the set of optimal policies : A3) The set of admissible policies for the PRCP CMDP comprises of unichain policies. This assumption establishes regularity conditions of the CMDP that ensures the existence of the optimal policy for the average cost problems (for more details see [14]). A CMDP is unichain [14] if every policy where is a deterministic function of induces a single recurrent class plus possibly an empty set of transient states. C. Transition Probabilities When the system is in state, a finite number of possible actions which are elements of the set can be taken. Let denote the action taken by the decision maker at the time. For a given policy, the evolution of a MDP is Markovian with transition probabilities for some, and. Based on Assumptions 1 and 2, the transition probability of V-BLAST-PRCP between the composite state and when action is taken is given with where is the indicator function that returns 1, if is true and 0, otherwise. (4) (5) (8) Let finite cost be the instantaneous cost of taking action in the state. For any linear V-BLAST receiver the power cost for the composite channel state and composite rate action can be expressed as the total power necessary for transmission with a given average BER, i.e., where is a single-channel power needed to transmit with rate action over a channel state. Let the instantaneous delay cost be defined as (9) (10) where is the average number of incoming packets in a time slot and is the length of each packet in bits. For given above and according to the Little s formula, (8) describes the constrained average delay incurred in the buffer. Constraint cost is a user specified parameter. Any policy that minimizes will be called the optimal policy. The cost of the policy that is optimal subject to constraint (8) will be denoted by. Single-channel power cost when action is applied, is a random variable dependent on the random postdetection SNR conditioned on channel state. However, as is known from [15], the equivalent immediate costs in the case of random immediate costs can be calculated as the average cost for a given state-action pair. Therefore (11) where is the power needed to transmit with rate over a channel with SNR with a given BER of. The expec-

5 2174 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007 tation is over SNR conditioned on the channel state being in state i.e.,. 3 Furthermore, can be calculated from (12) for given. The expression for the BER is a function of the instantaneous signal to noise ratio and the rate action and depends on the utilized modulation format. In the numerical results Section VI we use the uncoded -ary quadrature modulation (MQAM) in each of the transmission antennas and its BER expression will be approximated with (see e.g., [16]) (13) Therefore, for uncoded MQAM the immediate single-channel power cost for channel SNR can be expressed as (14) has the minimal transmission cost among the actions that retrieve a fixed amount of data from the transmission buffer should be considered, i.e., (16) All other actions can be dropped from the model. To utilize the results of the previous theorem, we can define the reduced action set as. Using the reduced actions sets, in the proceeding sections we assume for simplicity that buffer retrieval function is equal to the ordinal number of action from the reduced action set. Let be the number of actions in the reduced action set. B. Monotonic Policies The -learning algorithm is based on the adaptive iterative learning of factors of a MDP. factors are defined as (17) Let be the Lagrangian cost given with (15) where denotes the value function of a certain state that is the solution of the Bellman s equation for a certain Lagrangian multiplier. As discussed in [13] and [1] the optimal policy of the CMDP with one constraint is a mixture of two pure policies that are optimal for unconstrained MDP with costs given as in (15) and two different Lagrangian multipliers. IV. SUMMARY OF STRUCTURAL RESULTS ON OPTIMAL POLICIES In this section, we review two theorems, whose proofs are given in the companion paper [1]. These results will allow us to simplify the computational complexity and exploit the structural results of the optimal policy in -learning algorithms that are to be discussed in Section V. A. Action Reduction We first note that the action space of the V-BLAST-PRCP is of dimensions that can be very large for large transmit antenna arrays. The following theorem demonstrates that the action space can be reduced to the set with on the order of states and that the optimal policy of the V-BLAST-PRCP will only utilize actions from this reduced action set. Theorem 1: For the V-BLAST-PRCP, if the transmission cost have the form (9), then the composite action set containing actions can be exponentially decreased in cardinality to the reduced action set with actions. To compute the optimal policy of V-BLAST-PRCP, for a certain channel state and traffic state, only the action that 3 As an alternative to the above calculation of power costs, in Section V-C, we will discuss online estimation of power costs that can be used in conjunction with the online -learning algorithm. (18) for a fixed Lagrange multiplier. Function is called submodular (has decreasing differences) in for a fixed parameters and, if for all and (19) It has been shown in [17] that if is submodular in then the optimal action of the MDP for certain state is monotonically increasing in the state. The following assumption and definitions for the stated transmission control problem will be used to establish the below result on monotonic policies. A4) Set of feasible actions is state-dependent and in state is a nonempty set of actions for which and and any. This assumption states that there exist such a feasible policy that will not lead to transmit buffer overflows. Definition 1: For any, mixed policy is a randomized policy formed of two pure policies and such that policy is applied with probability and policy is applied with probability. The next concept we will use is multimodularity. Multimodularity extends the convexity property of continuous functions defined on Euclidean space to real-valued functions defined on a discrete set. We will call a 2-D multimodular base and let be a convex subset of the set of ordered pairs of integers [18].

6 DJONIN AND KRISHNAMURTHY: -LEARNING ALGORITHMS 2175 Definition 2: (Multimodularity) A real-valued function is multimodular with respect to base if for all, and, the following holds: (20) We will use the multimodularity property of discrete functions due to its property that it remains preserved after minimization over any subset of parameters of a multimodular function (see[18, Lemma 61]). According to the previous definition, mixed policy is a randomized policy that is convex combination of pure policies and. Definition 3: Pure policy is nondecreasing in the buffer state if the ordinal number (index) of the action taken in state is nondecreasing in buffer state for each channel state and traffic state. Theorem 2: Consider V-BLAST-PRCP defined in Section III. Let assumptions A1, A2, A3, A4 hold. Furthermore, assume that the following assumptions holds: A5) Lagrangian cost defined in (15) be multimodular function of, submodular function of for any,,. Then for cost constraint, the optimal randomized policy is a mixed policy of two pure policies and. Both and are nondecreasing functions of buffer occupancy state (see Definition 3). Furthermore, there exists only one state such that. Under the assumption of no buffer overflows, the above theorem states that with the increase of buffer occupancy each pure policy that constitutes the optimal mixed policy takes more bits from the buffer with the increase of the buffer state. Therefore, the average number of bits retrieved from the buffer by the optimal mixed policy of the transmission controller is also increasing with the buffer occupancy. An important consequence of the above theorem is that pure policies and can be computed from the adjoined unconstrained MDP for two different values of Lagrangian multipliers and. This implies that functions and are both submodular in and we will exploit that fact in the derivation of adaptive algorithms in the next section. V. ADAPTIVE LEARNING ALGORITHMS FOR POLICY OPTIMIZATION In this section, we propose several algorithms for adaptive online policy learning that exploit the structural results discussed in Section IV. First, we present the standard -learning algorithm that will be used as a reference in simulations. In order to utilize known structures of the optimal policy, we propose three novel structured -learning algorithms. We discuss how these -learning algorithms can utilize estimated costs for adaptation updates. Lastly we propose an adaptive algorithm to compute optimal Lagrangian multipliers for a given constraint and the optimal randomized policy using -learning. The optimal solution of a CMDP problem for a fixed Lagrange multiplier can be computed using the Bellman s Equation (18), which can be rewritten equivalently as (21) and this equality has to hold for any state and action pair and its next state and action pair. This form is convenient as we can apply the stochastic approximation formulation (e.g., Robbins Monro algorithm) to iteratively find the optimal factors. Note that factors in our problem are positive since the costs are also always positive. Next, we first discuss the standard variant of the -learning algorithm directly derived from (21) and later extend it to employ the structure that exists in the V-BLAST-PRCP formulation. A. Nonstructured -Learning for Policy Optimization The standard algorithm for updating the -factors in a nonstructured algorithm is as follows [5]. The -factors for the state action pair is updated when this pair is visited as (22) where is the next state to be visited. The adaptive step size is dependent on the current state action pair and is calculated according to Visit (23) where is a given constant and is the number of times the state-action pair has been visited by the algorithm. In a practical implementation of -learning, factors are updated applying the same policy for a fixed period of time slots referred to as update interval. After that interval, the new policy is chosen based on current -factors and. For finite MDPs, -learning algorithm ensures convergence with probability one to the optimal solution of Bellman s equation (21) [5]. One of the drawbacks of nonadaptive algorithms for the calculation of the optimal policy using MDPs is that the nonadaptive algorithms should know a priori the expected costs of taking an action in certain state. The adaptive learning algorithm such as the -learning can take into account the random nature of the instantaneous cost for certain state/action pair by utilizing the measured values of the cost as to be discussed in Section V-C. B. Constrained Structured -Learning for Policy Optimization Here we utilize the results of the Section IV to devise three structured -learning algorithms. Namely, we propose the primal-dual method, primal projection method and submodular parameterization method that all utilize the submodular property of the -factors as a constraint. To the best knowledge of the authors, the proposed structured -learning algorithms are novel both from the control-theoretic viewpoint and as within the context of V-BLAST-PRCP.

7 2176 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007 The dependence on the fixed Lagrangian multiplier is implied and is ommited from notation for simplicity. Therefore, the following parameterization of the -function is introduced that guarantees that factors are submodular. Let for, and any as follows. If, then If or, then. If parameters are positive for all,,, and then submodularity property (19) and positivity of function is preserved for all and. The proof of this claim can be easily formalized by mathematical induction. To provide a compact notation for submodular constraint on function, let us define the following -dimensional row vector and similarly the following row vector of parameters (24) (25) The above explained linear relation between -factors and parameters can be written in a matrix form as follows: for the transition between state-action pair with index in the vector, to the state action-pair with index. Then the th element of the vector-valued function is defined as if if (31) where set is defined as the set of state action pairs that include the state and. We can finally state our constrained optimization problem in the vector form as subject to (29), where (32) (33) and represents the gradient operator. Next we propose, and in Section VI explore by simulations, three methods to compute the solution to the optimization problem (32) and (29) in an online fashion. 1) Primal-Dual Method: This algorithm can be utilized to iteratively compute the solution of the constrained optimization problem given in (32) and (29). We first form the Lagrangian of the problem (34) where the transformation matrix (26) -dimensional block diagonal is defined as (27) where is the -dimensional row vector of Lagrange multipliers of the primal-dual algorithm. The update formulas for the primal-dual version of the structured -learning algorithm (for details on the primal-dual algorithm see [19, Sec. IV.D]) is and is a block-diagonal matrix with blocks being -dimensional matrix given with (28), shown at the bottom of the page. Mapping from -factors to parameters is invertible and matrix is invertible. The constraint on submodularity of function can now be given simply as (29) where relation is performed element-wise. Using the above vector notation (21) can be written compactly as (30) (35) where learning steps and are calculated from (23) by substituting and instead of, respectively. The primal-dual algorithm converges with probability one providing that the optimization function is convex for any and and that feasible region is convex. This is necessary to ensure that there is no primal-dual gap and that Slater s conditions are satisfied [20]. Feasible region is convex since constraints are linear. Furthermore, it is easy to show that Hessian of the function is a set of gradients associated with each if, if and if and (28)

8 DJONIN AND KRISHNAMURTHY: -LEARNING ALGORITHMS 2177 of the dimensions of the vector function. Furthermore, this Hessian is a diagonal matrix with only one nonzero positive element which implies that the Hessian is semidefinite and that the optimization function is convex. 2) Primal Projection Method: The projection algorithm is also used to iteratively compute the solution of the constrained optimization problem given in (32) and (29). However, in this method we utilize the exact solution for the projected gradient in the optimization problem with linear constraints that are met with equality (cf. [21]). The primal projection constrained optimization method has been traditionally used in problems with equality constraints. To adapt this method to the inequality constraints we add positive auxiliary variables in the vector and form a modified vector such that (36) (37) where, where is identity matrix of dimensions. Now the inequality constrained problem has been transformed into an equality constrained problem and the update formula for the modified vector for the transition between state-action pair with index, to the state actionpair with index is as follows: where the projection matrix in the vector is calculated (38) (39) as given in [21]. Function where and are row vectors of length. Similar to the primal-dual algorithm, primal projection algorithm converges with probability one (for proof see also [22]). 3) Submodular Parameterization Method: In this method, the constrained optimization problem given in (32) and (29), is formulated to utilize the fact that if elements of the vector are positive then the constraint (29) is satisfied. Therefore, we can introduce the following simple parameterization of the vector : (40) where denotes the element-wise squaring of the matrix. The iterative update formula of the modified vector for the transition between state-action pair with index in the vector, to the state action-pair with index is now given as (41) C. Estimated Costs and Practical Implementation of Learning Algorithms In order to avoid precomputation of costs in an unknown environment, costs can be estimated online during the course of -learning algorithm. Namely, in the case of estimating the power costs, the transmitter chooses and keeps fixed transmission rate during the length of the th time slot. This rate allocation is based on the coarse quantization of the channel supplied by the channel state. The power control actions can be performed several times during the decision epoch based on the timely estimate of the channel matrix. For example, if parameters of the CDMA2000 standard are considered, one of the possible frame rates is 20 ms, while the power control period is 800 Hz. Therefore, in that setting there can be 16 power updates per one rate control action. The power levels are to be chosen based on (12) such that predefined BER requirement is met for the applied rate action and current channel state. This approach has a practical appeal since rate control actions can be made less frequently and be made based on the smaller state space. In contrast to changing transmission rates, changing transmission power can be performed more frequently and it does not need cooperation between the transmitter and the receiver. Average transmission power during a decision epoch is supplied to the -learning algorithm in order to update the -factors. -learning algorithm with estimated costs converges to the optimal solution with probability one providing that power cost estimates are asymptotically unbiased [23]. Next we discuss the per-iteration complexity of the proposed structured and nonstructured -learning algorithms. Namely: 1) nonstructured -learning algorithm (22) involves only number comparisons; 2) primal-dual method given with (35) involves number comparisons, multiplications and additions; 3) considering that is precomputed, primal-projection method (38) involves number comparisons, multiplications and additions; and 4) considering that is precomputed, submodular parameterization (41) involves number comparisons, multiplications and additions. From the above discussion, it can be seen that the number of calculations per iteration is at most linear in the state-space size. D. Lagrangian Multiplier Update Algorithm and Randomized Policies The optimal pure policy for a CMDP can be found using the relative value iteration (RVI) in case that the constraint in (8) is active. However, we can still pose the question of computation the suitable Lagrangian multiplier that satisfies the constraint with equality. Note that the average constraint for the optimal policy for Lagrangian multiplier can be given with (42) For the model of Section II and positive transmission and buffer costs, is piece-wise constant decreasing function of.

9 2178 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007 A simple algorithm designed to find the smallest (that will be called ) such that the constraint (8) is satisfied can be formulated as following: (43) where the step. The convergence to is ensured as the function is piece-wise linear concave function that attains its global maximum at and its derivative is. Therefore, the algorithm (43) is just a gradient descent algorithm. We demonstrate how to employ the algorithm (43) and estimated parameter to find the optimal randomized policy for any constraint with RVI. We assume that conditions of Theorem 2 hold. First, find the for a given feasible constraint. In view of the result on optimal mixed policies for CMDP [24], perturb the parameter by some to get and. Next we find the optimal pure policies and and their respective average constrained costs and. As stated in Theorem 2, the optimal randomized policy is a mixed policy of two pure policies and let parameter determine the probability of taking the policy and be the probability of taking the policy. Now, parameter can be computed such that. A simplified schematic diagram of the overall randomized policy learning algorithm for CMDP is presented in Fig. 2 and divided into three stages for simplicity. A. Simulation Parameters VI. NUMERICAL RESULTS We have performed numerous Monte Carlo simulations of the proposed learning algorithms with the goal of evaluating and comparing their performances. The following choice of parameters have been used in the simulations: the number of transmit antennas is and the number of receiver antennas, BER target is. The average normalized postdetection SNR is chosen as follows: and. Further, channel states per each transmit antenna is employed i.e.,. Therefore, there are a total of channel states of the discretized postdetection SNR across all transmit antennas. We have employed zeroforcing receiver and have chosen such partition of the postdetection SNRs in order to have equiprobable discretized channel states. As discussed in [3] a favorable partition of the channels state space is to assume that all channel states are equiprobable. Therefore, we have chosen the channel SNR thresholds as follows: and in order to guarantee the equiprobable distribution of channel states. In the course of the simulations it is assumed that time-slot is ms and that channel matrix realizations are independent across time slots and distributed with Gaussian distribution. The incoming traffic arriving into the transmission buffer is assumed to be i.i.d and that it follows the distribution where is the number of incoming packets during Fig. 2. Schematic representation of the flow diagram of -learning algorithm applied to learn the randomized optimal policy of a constrained MDPs. (a) -learning algorithm for a fixed Lagrange multiplier. (b) Learning algorithm of the optimal Lagrange multiplier optimal randomized policy.. (c) Learning algorithm of the a single time slot. The following distribution is used in the simulations: for or,,, and. It is assumed that actions are available for transmission adaptation in each of the transmit antennas, where action corresponds to no transmission in the th channel, corresponds to BPSK transmission and retrieval of bits from the buffer, while corresponds to QAM transmission and retrieval of bits from the buffer. Therefore, there are a total of actions across all transmit antennas. However, due to the action reduction techniques discussed in Section IV, the total number of actions can be reduced to 9 as the total number of bit packets drawn from the buffer for action is element of the set. The buffer size in the simulations was chosen to be eight packets. Note that to avoid packet overflows in the buffer, a special high penalty cost is introduced for state action pairs that would lead to buffer overflows. As suggested in [5] during the course of -learning simulations a random action is chosen in action-state pair with probability Visit. The state-space size in this example is and the number of -factors that correspond to state action pairs is equal to. B. Lagrangian Cost Adaptation Using Proposed -Learning Algorithms We demonstrate the performance of the proposed algorithms for two different values of Lagrange multiplier that weighs the influence of the holding cost on the choice of the optimal policy. In Figs. 3 and 4, we show the performances of the three proposed structured -learning algorithms for and compare them with the performance of the standard -learning. It can be seen

Comparison of primal-dual and submodular parameterization -learning with nonstructured -learning policy adaptation schemes for Lagrange multiplier. The optimal power cost is 0.

10 DJONIN AND KRISHNAMURTHY: -LEARNING ALGORITHMS 2179 Fig. 3. Comparison of primal-dual and submodular parameterization -learning with nonstructured -learning policy adaptation schemes for Lagrange multiplier. Fig. 5. Comparison of primal-dual and submodular parameterization -learning with nonstructured -learning policy adaptation schemes for Lagrange multiplier. The optimal power cost is mw and optimal delay cost is 2.65 packets. Fig. 4. Comparison of primal projection -learning with nonstructured -learning policy adaptation schemes for Lagrange multiplier. Fig. 6. Comparison of primal projection -learning with nonstructured -learning policy adaptation schemes for Lagrange multiplier. The optimal power cost is mw and optimal delay cost is 2.65 packets. that all three new algorithms perform better than the standard -learning algorithm, and that the convergence to a stable level that is close to the optimal cost is achieved within less than 200 policy updates. The convergence from that stable cost level to the optimal cost is very slow. Note that, convergence is faster for the submodular parameterization method as the policy update interval in that case is equal to only 20 iterations. Somewhat different behavior is achieved for Lagrange multiplier as demonstrated in Figs. 5 and 6. This case corresponds to the case when weight placed on the holding (delay) costs is less. In this case, only the primal-projection method performs better than the standard -learning algorithm. A plausible explanation of this behavior is that too much structure is imposed on the policy for smaller values of and that some nonstructured policies found by the nonstructured -learning algorithm can perform even better. This seems to affect mostly the primal-dual method and submodular parameterization method, while the projection method does not loose its effectiveness for smaller values of. A bar graph comparison of the optimal and suboptimal policies obtained by primal-dual -learning for and are shown in Figs. 7 and 8, respectively. In Fig. 9, we show the convergence properties of nonstructured and primal projection -learning algorithms for both power and delay cost and the Lagrangian cost. The choice of parameters is the same as in the simulation of Fig. 6 and the costs are smoothed out over ten trials. It can be seen that the primal projection -learning algorithm approaches both the optimal power and delay cost more closely than the nonstructured -learning algorithm. Furthermore, the Lagrangian cost of the primal projection method also approaches the optimal cost more closely than the nonstructured -learning algorithm. C. Discussion To gain further insight in the practical implementation of the proposed algorithms, we next discuss the time needed to achieve the convergence. The primal projection method that appears to have the fastest short-time convergence rate, needs 25 policy updates to achieve convergence with 10 iterations per policy update and. For ms that would amount to ms s to achieve convergence. In the worst case for, primal projection method needs 120 policy updates to achieve convergence with 100 iterations per policy update, that would amount to ms min to

2180 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007 Fig. 7.

Comparison of the convergence properties of power, delay, and Lagrangian costs for both primal projection -learning and nonstructured -learning policy adaptation. VII. CONCLUSION Fig. 8.

Therefore, the proposed algorithms can also be applicable in slowly changing nonstationary channel and traffic environments.

11 2180 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007 Fig. 7. Comparison of the optimal policy and the policy obtained by the primal projection -learning via policy adaptation for Lagrange multiplier. Fig. 9. Comparison of the convergence properties of power, delay, and Lagrangian costs for both primal projection -learning and nonstructured -learning policy adaptation. VII. CONCLUSION Fig. 8. Comparison of the optimal policy and the policy obtained by the primal projection -learning via policy adaptation for Lagrange multiplier. achieve convergence. Therefore, the proposed algorithms can also be applicable in slowly changing nonstationary channel and traffic environments. It should be also noted that actual short-term convergence properties depend on the chosen learning parameters, including the policy update interval, learning steps,, and, initial state of -factors and the probabilities of taking randomized actions when certain states are visited. It is possible to further improve the convergence properties of the proposed policy learning algorithms as follows. The optimal policy and -factors can be precomputed in an offline fashion for several known and typical channel and traffic environments. For example, we can precompute the -factors that give the optimal policy for Rayleigh fading channels and several typical values of fading rates The classification of the current channel environment can be performed using an external channel estimator. The -factors for the optimal policy of the channel that best matches the current channel environment can be then used as a good guess for initial conditions for the -factors of the policy learning algorithms described in this paper. This paper explores the adaptive delay-aware transmission control algorithms for MIMO systems under unknown channel and traffic conditions. The problem of rate and power adaptation under delay constraints is formulated as a CMDP and its solutions is obtained in an online fashion. We have proposed and explored properties of two general classes of algorithms that can be used to learn the optimal control policy: the conventional -learning and the structured -learning algorithms. The presented methods can find the application in the following modern wireless systems and under the following conditions. 1) Transmission rate/power control for next generation wireless systems that employ MIMO transmission based on V-BLAST principles. 2) Transmission rate/power control with latency requirements for multichannel transmission such as that in OFDM systems. This is of high relevance for modern wireless LAN systems based on OFDM standard such as IEEE g, and wireless MAN systems based on IEEE ) The proposed algorithms for transmission control adaptation are equally valid for time-correlated input traffic and channels as well as nontime correlated traffic and channels. 4) The proposed structured -learning algorithm converges faster than the conventional -learning algorithm and both algorithms approach the optimal costs in less than 200 policy updates. Therefore, provided that the channel/traffic statistics changes sufficiently slow, the proposed -learning algorithm is useful even in nonstationary environments and will be able track the changes in the statistics of the channel and traffic and appropriately adapt the transmission control policy. 5) The presented discussion and results are given for the zero-forcing V-BLAST receiver. However, the proposed -learning algorithm is applicable for any other linear

DJONIN AND KRISHNAMURTHY: -LEARNING ALGORITHMS 2181 receiver structure (such as for example the MMSE detector) as long as the sequence of quantized postdetection SNR channel states forms a Markov

REFERENCES [1] D. V. Djonin and V. Krishnamurthy, V-BLASt power and rate control under delay constraints in Markovian fading channels Optimality of randomized monotonic policies, IEEE Trans.

12 DJONIN AND KRISHNAMURTHY: -LEARNING ALGORITHMS 2181 receiver structure (such as for example the MMSE detector) as long as the sequence of quantized postdetection SNR channel states forms a Markov chain. In [25] and [26], policy gradient algorithms are presented for learning based control of CMDPs. It is of interest to use such policy search algorithms when the policies are monotone. REFERENCES [1] D. V. Djonin and V. Krishnamurthy, V-BLASt power and rate control under delay constraints in Markovian fading channels Optimality of randomized monotonic policies, IEEE Trans. Signal Process., 2007, accepted for publication. [2] B. E. Collins and R. Cruz, Transmission policies for time varying channels with average delay constraints, in Proc. Allerton Conf. Commun., Contr. Comput., Sep. 1999, pp [3] A. K. Karmokar, D. V. Djonin, and V. K. Bhargava, Optimal and suboptimal packet scheduling over time-varying fading channels, IEEE Trans. Wireless Commun., vol. 5, no. 2, pp , Feb [4] M. J. Hossain, D. V. Djonin, and V. K. Bhargava, Delay limited optimal and suboptimal power and bit loading algorithms for OFDM systems over correlated fading, in Proc. GLOBECOM 2005, St. Louis, 2005, pp [5] D. P. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, [6] G. J. Foschini and M. J. Gans, On limits of wireless communication in a fading environment when using multiple antennas, Wireless Pers. Commun., vol. 6, pp , Mar [7] G. J. Foschini, Layered space-time architecture for wireless communication in a fading environment when using multielement antennas, Bell. Labs. Tech. J., pp , Oct [8] S. Chung, H. C. Howard, and A. Lozano, Low complexity algorithm for rate quantization in extended V-BLAST, in Proc. IEEE VTC 2001, 2001, pp [9] H. Zhang, L. Dai, S. Zhou, and Y. Yao, Low complexity per-antenna rate and power control approach for closed-loop V-BLAST, IEEE Trans. Commun., vol. 51, no. 11, pp , Nov [10] V. Krishnamurthy, X. Wang, and G. Yin, Spreading code optimization and adaptation in CDMA via discrete stochastic approximation, IEEE Trans. Inf. Theory, vol. 50, no. 9, pp , Sep [11] A. Barto, R. Sutton, and C. Anderson, Neuron-like elements that can solve difficult learning control problems, IEEE Trans. Syst, Man, Cybern., vol. SMC 13, pp , [12] P. Marbach and J. N. Tsitsiklis, Simulation-based optimization of Markov reward processes, IEEE Trans. Autom. Contr., vol. 42, no. 2, pp , Feb [13] E. Altman, Constrained MDPes: Stochastic Modeling. London, U.K.: Chapman and Hall, [14] M. L. Putterman, Markov Decision Procsses: Discrete Stochastic Dynammic Programming. New York: Wiley, [15] D. P. Bertsekas, Dynamic Programming and Optimal Control. Belmont, MA: Athena Scientific, 1996, vol. 2. [16] S. T. Chung and A. J. Goldsmith, Degrees of freedom in adaptive modulation: A unified view, IEEE Trans. Commun., vol. 49, no. 9, pp , Sep [17] D. M. Topkis, Supermodularity and Complementarity. Princeton, NJ: Princeton Univ. Press, [18] E. Altman, B. Gaujal, and A. Hordijk, Discrete-Event Control of Stochastic Networks: Multimodularity and Regularity. Berlin, Germany: Springer-Verlag, [19] D. P. Bertsekas, Nonlinear Programming, 2nd ed. Belmont, MA: Athena Scientific, [20] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.: Cambridge Univ. Press, [21] D. G. Luenberger, Optimization by Vector Space Methods. New York: Wiley, [22] H. Kushner and D. Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems. New York: Springer-Verlag, [23] H. Kushner and G. Yin, Stochastic Approximation Algorithms and Applications, 1st ed. New York: Springer-Verlag, [24] F. J. Beutler and K. W. Ross, Optimal policies for controlled Markov chains with a constraint, J. Math. Anal. Appl., vol. 112, pp , [25] F. Vazquez Abad and V. Krishnamurthy, Constrained stochastic approximation algorithms for adaptive control of constrained markov decision processes, in Proc. 42nd IEEE Conf. Decision Contr., 2003, pp [26] V. Krishnamurthy, F. Vazquez Abad, and K. Martin, Implementation of gradient estimation to a constrained Markov decision problem, in Proc. 42nd IEEE Conf. Decision Contr., 2003, pp Dejan V. Djonin received the B.Sc. and M.Sc. degrees from the University of Belgrade, Belgrade, Serbia, in 1996 and 1999, respectively, and the Ph.D. degree from the University of Victoria, Victoria, BC, Canada, in From 1998 to 2000, he was with the Department of Telecommunications, Institute Mihajlo Pupin, Belgrade, and worked on the development of an antenna array processing system. In 2005 and 2006, he held a NSERC Postdoctoral Fellowship at the Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC. His research interests span applications of control theory in multimedia communication systems, wireless communications, information theory, and machine learning. He is currently with Dyaptive, Inc., Vancouver, BC, Canada. Dr. Djonin was awarded the Outstanding Paper Award for Young Researchers at the International Symposium on Information Theory and its Application (ISITA) Conference. He has served as an Assistant Program Chair of the Wireless Communications and Networking Conference (WCNC) 2004 and as a Technical Program Committee Member of ICC 2005 and IEEE Globecom 2003 Conferences. Vikram Krishnamurthy (S 90 M 91 SM 99 F 05) was born in He received the B.S. degree from the University of Auckland, New Zealand, in 1988, and the Ph.D. degree from the Australian National University, Canberra, Australia, in Since 2002, he has been a Professor and a Canada Research Chair at the Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC, Canada. Prior to this, he was a Chaired Professor at the Department of Electrical and Electronic Engineering, University of Melbourne, Melbourne, Australia. His research interests span several areas including ion channels and nanobiology, stochastic optimization, scheduling and control, statistical signal processing, and wireless telecommunications. Dr. Krishnamurthy has served as Associate Editor for the IEEE TRANSACTIONS ON SIGNAL PROCESSING, IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS, IEEE TRANSACTIONS ON NANOBIOSCIENCE, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II, EXPRESS BRIEFS, Systems and Control Letters and European Journal of Applied Signal Processing. He was Guest Editor of a Special Issue of IEEE TRANSACTIONS ON NANOBIOSCIENCE in March 2005 on bio-nanotubes.

Q-Learning Algorithms for Constrained Markov Decision Processes with Randomized Monotone Policies: Application to MIMO Transmission Control

Q-Learning Algorithms for Constrained Markov Decision Processes with Randomized Monotone Policies: Application to MIMO Transmission Control Dejan V. Djonin, Vikram Krishnamurthy, Fellow, IEEE Abstract