A Flexible VLSI Architecture for Extracting Diversity and Spatial Multiplexing Gains in MIMO Channels

Size: px

Start display at page:

Download "A Flexible VLSI Architecture for Extracting Diversity and Spatial Multiplexing Gains in MIMO Channels"

Geraldine Moody
5 years ago
Views:

1 A Flexible VLSI Architecture for Extracting Diversity and Spatial Multiplexing Gains in MIMO Channels Chia-Hsiang Yang University of California, Los Angeles Challenges: 1. A unified solution to span the entire diversity-multiplexing tradeoff curve 2. Tradeoff between two search methods Depth-first: ML performance with variable throughput and long latency K-best: near ML performance with constant throughput and short latency 3. Antenna array size beyond 4 4 Area increases quadratically with the number of transmit antennas Critical path increases linearly with the number of transmit antennas 4. Modulations beyond 16-QAM Hardware increases quickly with the constellation size Longer latency introduced by the minimum search circuit 5. Multiple sub-carriers Research Contributions: 1. A unified sphere decoder architecture for extracting diversity and spatial multiplexing gains in MIMO channels 2. Signal processing techniques to support antenna sizes up to Folding: hardware area increases linearly with antenna array size Loop retiming: reduces the critical path Data interleaving: supports multiple independent sub-carriers A region partition enumeration method for constellations up to 64-QAM 3. A flexible architecture Antenna array: 2 2 to Modulations: BPSK to 64-QAM Number of sub-carriers: 16 to 128 Search method: K-best or depth-first search 4. A simplified multiplier Numerical strength reduction Gray coding to reduce number of operations 5. A multi-core architecture for enhanced performance 1

2 Abstract Sphere decoding algorithm is widely used in MIMO communications, because of its ability to approach maximum likelihood detection with significantly reduced computational complexity. This makes it attractive for hardware implementation; however, prior work focused only on solutions with fixed number of antennas or fixed modulations. This work presents a unified sphere decoder architecture that deploys diversity-multiplexing tradeoff in MIMO channels by taking advantage of the flexibility in the number of antennas and modulation schemes. Several signal processing and circuit techniques are constructively combined to reduce the hardware complexity: a 20 times area reduction is achieved even without interleaving of subcarriers compared to direct-mapped architecture. The proposed flexible architecture supports antenna arrays from 2 2 to 16 16, modulations from BPSK to 64-QAM, over 16 to 128 sub-carriers. The peak estimated data rate exceeds 1.5 Gbps ideal throughput using a 16 MHz bandwidth in just 0.55 mm 2 in a standard 90 nm CMOS process. I. INTRODUCTION Multi-input multi-output (MIMO) communication has recently received significant attention due to its potential to increase link robustness and channel capacity. Hardware realization of MIMO signal processing algorithms is quite challenging, because it requires multi-dimensional, matrix-based, computations. However, with the growing demand for higher data transmission rates over wireless links, the need of devices equipped with multiple antennas increases. Among various MIMO algorithms, sphere decoding is one of the most promising solutions. It approximates the information theoretic bound, set by the maximum likelihood (ML) detection, with several orders of magnitude lower computational complexity [1] [2]. This means that, for a given hardware cost, the reduced complexity could be utilized to increase the size of antenna array and effectively improve the performance beyond the ML performance of a system with smaller array size. The complexity reduction is achieved by transforming an exhaustive search of the ML decoders into a tree search procedure of sphere decoding. Tree search is quite popular in other communications areas such as multi-user detection (MUD) for CDMA systems, block-based demodulation, and linear block error control code decoding [3]. Other potential applications include speech recognition, data compression, protein sequence exploration, and neural signal detection. Sphere decoding algorithm is a multi-dimensional signal processing task dealing with vector and matrix arithmetic. The required computation involves hundreds of add and multiply operations, and may also need divide and trigonometric functions. Such a high complexity limits the system specifications such as antenna array size and 2

3 modulations. In addition, prior work focused only on solutions with fixed number of antennas or fixed modulations [16][17][19][21][22][24]. In this work, we evaluate the architectures proposed in prior work and advance state-of-the-art in the area of multidimensional matrix-based signal processing hardware. A number of signal processing techniques [23] are considered jointly with the technology parameters to greatly reduce hardware area (cost) and power while maximizing the performance. This work develops an architecture that further simplifies sphere decoding implementation by jointly considering tradeoffs at the algorithm, architecture, and circuit layers of abstraction, with the goal of minimizing chip power and area. At the same time, additional degrees of freedom are considered in the design in order to take full advantage of the diversity and spatial multiplexing gains available in MIMO wireless channels [5]. Tuning over a range of diversity-multiplexing points is possible by varying antenna array size and modulation scheme, for example. Flexibility and scalability are, thus, key additional requirements in the design of multi-mode, multi-standard systems. Also, our work uses the Matlab/Simulink framework to improve design productivity in mapping of DSP algorithms onto silicon. BEE2 platform [38] is used to verify system functionality before entering physical ASIC design. This proposal is organized as follows. Section II reviews the fundamental diversity-multiplexing tradeoff in MIMO communications and describes sphere decoding algorithm. Several signal processing techniques, evaluated in power-area-performance space, and architecture details are presented in Section III. Section IV describes the Simulink design environment and BEE2 emulation platform. Conclusions are summed up in Section V. Finally, Section VI proposes future work and the timeline. II. ALGORITHM SPACE EXPLORATION A MIMO system can improve the reliability of a wireless link through increased diversity or improve the channel capacity through spatial multiplexing. Diversity gain and spatial multiplexing gain are related to system coverage range and data rate, respectively. Both gains can be improved using a larger antenna array. However, given a MIMO system, there is a fundamental trade-off between these two gains [4] [5]. In the diversity-multiplexing space, repetition code, Alamouti code, and space-time code use data redundancy to increase diversity at the price of losing spatial multiplexing gain. In contrast, Bell Labs Layered Space Time (BLAST) algorithm, Singular Value Decomposition (SVD), and QR decomposition allocate data-streams 3

4 Diversity gain (range) in different eigen-modes to maximize spatial multiplexing gain while sacrificing diversity gain, as shown in Fig. 1. Sphere decoding is a decoding scheme that can extract both diversity and multiplexing gains. With flexibility in coding and modulation, sphere decoder can effectively explore the entire tradeoff curve as shown in Fig. 1. The original data type for sphere decoding is uncoded data. By manipulation of input data, sphere decoding is capable of decoding space-time block codes (STBC), which improves the error probability and increases diversity gain. The data rate can be maximized by transmitting different modulations over different MIMO substreams to increase spatial multiplexing gain. Also, with proper preprocessing, the decoding process starts from decoding the symbols with highest SNR first, and then canceling the effect of the decoded symbols for remaining symbols until the final symbol is decoded. This decoding sequence is equivalent to that in BLAST [41]. A unified sphere decoder model is illustrated in the following section. Repetition Alamouti Space-time Sphere decoding array size array size BLAST SVD QR Spatial multiplexing gain (rate) Fig. 1. Diversity-Multiplexing tradeoff in MIMO communications. A. Sphere Decoding Algorithm Consider a multiple antenna system with M transmitter antennas and N receive antennas. The received vector y can be represented by y Hs n (1) where y is an N 1 vector of received symbols, and H denotes an N M channel matrix whose elements are i.i.d. complex Gaussian with zero mean and unit variance. Vectors s and n (M 1 and N 1 respectively) represent the transmitted symbols and zero mean, circularly symmetric white Gaussian noise, respectively. The transmitted vector s Q with the smallest Euclidean distance is selected as ML estimate in (2). The 4

5 channel matrix can be decomposed further using QR factorization; the equivalent ML estimate thus can be written as sˆ arg min y Hs with 2 yˆ Q ˆ 2 arg min y Rs (2) H y Rs ZF where Q is a unitary matrix, R is an upper triangular matrix, and s ZF (H is the zero-forcing (unconstrained ML) estimate. The signal model is presented in Fig. 2. H H) 1 H H y n s H y Q H ^ y Sphere Decoder s^ Channel TX H=QR RX ^ min ^ s=arg y -Rs 2 Fig. 2. Signal model of sphere decoding algorithm. The most commonly used methods for QR decomposition are Grahm-Schmidt decomposition, Householder transformation, and Givens rotations [7]. Several modifications such as division free, or square-root and division reduction methods are proposed to simplify the operation in the original algorithm [45] [46]. For hardware realization, [8] proposed an algorithm suitable for fixed-point implementation and [9] proposed a CORDIC-based triangular systolic array architecture to reduce latency. Under the assumption of block fading channel, QR decomposition is computed at the packet rate. Using the upper triangular nature of R, the symbol decoding begins from the last row and occurs in several steps. The decoded symbols are used for successive decoding steps until all symbols are decoded. This decoding algorithm can be mapped to finding a shortest path (with minimum Euclidean distance) in a tree topology one possible constellation point denotes one node, each row of the R matrix is mapped to each level of the tree whose edges are weighted by channel coefficients. The whole solution space of this tree is equivalent to exhaustive search in the trellis diagram of the original problem; number of total combinations of transmitted symbols is Q M, where Q is the constellation size. By properly choosing a search radius and a search method, the ML solution can be approached by visiting only nodes within a hyper-sphere, rather than performing an 5

6 exhaustive search. This complexity reduction is feasible, because the Euclidean distance is a cumulative sum of square terms. This means that for each node, if its Euclidean distance is larger than the search radius, the corresponding branches are outside the search radius as well. The conceptual view of sphere decoding algorithm is illustrated in Fig. 3. Tree pruning technique makes sphere decoding achieve ML performance with polynomial complexity (highlighted nodes in Fig. 3) rather than exponential complexity (all nodes in Fig. 3) [1]. search radius... constellation size... ant-m ant-2 ant-1 Fig. 3. Concept of sphere decoding. Unlikely nodes and branches are indicated with gray shade. B. Performance Improvements Several simple yet effective methods such as detection ordering, candidate enumeration and search radius setting are applied to improve error performance and/or reduce the complexity the basic sphere decoding [3]. For instance, the sphere decoding algorithm for QAM system as compared to exhaustive search results in over 10 5 times reduction in computational complexity [10]. 1) Detection Ordering: The idea behind detection ordering is to detect symbols with the largest SNR first: to avoid discarding the ML solution, the first decoded symbols should be the most reliable. Various ordering algorithms have been proposed for the preprocessing stage: V-BLAST-ZF ordering, V-BLAST-MMSE, and Norm ordering [3] [25]. Assuming a packet-based wireless communication system, the ordering only needs to be performed once at the beginning of each received frame. 2) Candidate Enumeration: Detection ordering is applied across levels in the tree topology. For each level, the order of constellation point enumeration is another important factor to improve search speed. Schnorr-Euchner (SE) enumeration suggests traversing the constellation candidates according to the cumulative distance increment in an ascending order [2]. Therefore, the first candidate s i for each row is the one with minimum distance between b i and R ii Q i in (3). Finding a good admissible 6

7 solution early means that we can shrink our initial radius early. M M yˆ i Rijs j bi Riis with i bi yi j i j i 1 ˆ R s. (3) ij j 3) Search Radius Setting: One major feature of sphere decoding is the radius shrinking. Once a solution is found with a smaller Euclidean distance, the search radius is updated to this value so that more unlikely branches can be pruned. However, the initial choice of search radius is not easy for sphere decoding, because the choice of search radius influences the complexity of the algorithm. When search radius is too large, a very high number of visited nodes is in the solution space which causes high detection complexity. Conversely, when the search radius is too small, this may result in an empty sphere and no available solution. Based on AWGN model, sum of noise square is central-chi-square distributed with 2M degrees of freedom [11] [47]. Given the channel SNR, the search radius can be decided by solving the probability density function (pdf) with a confidence interval. If channel SNR is unknown, the Euclidean distance of zero-forcing solution can be used as an initial guess. Algorithm with increasing search radius was proposed, which starts the search with a strict search radius first, and expands the search radius if no solution is available within the radius [12] [48]. C. Tradeoff in Diversity-Multiplexing Space A unified sphere decoder architecture is illustrated here for extracting diversity and spatial multiplexing gains along the tradeoff curve. We demonstrate that adding flexibility in varying antenna size and varying modulations is the key features for this purpose. Antenna array size provides an added flexibility to shift the tradeoff curve in the diversity-multiplexing space. In order to maximize diversity gain, we have to supply to the receiver multiple independently faded replicas of the same symbol, so that the error probability is reduced [13] [14]. The data replicas can be sent in space and/or time directions. Since a unified signal model can be developed for these space-time (ST) coding schemes, the same sphere decoder architecture can be used with some data rearrangement. Sphere decoding supporting algebraic ST codes [48], linear dispersion code [49], and space time block code (STBC) [15] were reported in prior work. The ML estimate can be written as sˆ 2 arg min y Bs (4) 7

8 where matrix B depends on code generators and channel matrix. By interpreting B as H in the original signal model, sphere decoding algorithm can be applied. Since the matrix dimension we deal with is changed due to the data rearrangement in the preprocessing stage, the equivalent antenna array size will be changed accordingly. For example, repetition coding by 2 in space domain for an 8 8 system will be transformed into data processing in a 4 4 system (only one half of symbols need to be decoded). This requirement enhances the need for flexibility in antenna array size. Spatial multiplexing gain is characterized by data rate. To maximize spatial multiplexing gain, we should allow data rate to scale with the SNR or assign different data rate to different substreams for a fixed SNR [5][15]. To this end, modulation scheme should be adaptive according to channel condition: a larger constellation is applied to substreams with higher SNR, and a smaller constellation is applied to substreams with lower SNR. In principle, this transmission strategy just uses water-filling in space domain. The system performance perspective, therefore, further motivates the need for adding flexibility in modulation schemes. III. ARCHITECTURE SPACE EXPLORATION The optimal architecture is decided by jointly considering tradeoffs at the algorithm, architecture, and circuit layers of abstraction, with the goal of minimizing chip power and area. As shown in Fig. 4, a layered design approach is adopted to merge algorithm and circuit decisions. An efficient multiplier is proposed to reduce area and delay at the same time. Saving in area directly translates to power reduction since power spent in charging/discharging parasitic capacitances is also reduced. At the processing element (PE) architecture level, we evaluate the existing architectures [16][17][19][21][22][24] and propose a solution with improved area and throughput. Unlike prior work, flexibility is also considered in the design stage. Antenna size, modulation scheme, number of subcarriers, and search method are designed with flexibility and scalability to cover multiple communication scenarios. A multi-core architecture which consists of many PEs ( small cores ) is developed to support the tradeoff between range and data rate at the system architecture level. We finally summarize the flexibility, scalability, and system specification. 8

9 R S1^S S0 1 0 S1&S0_b 0 1 S2 System arch. PE arch Metric calc. neg -1 1 <<2 x4 1 <<1 x8 neg Multiplier Fig. 4. Illustration of layered design approach. A. Numerical Strength Reduction From an algorithm perspective, the complexity of sphere decoding is evaluated by the number of nodes visited in the tree search process. When considered for hardware implementation, decoding algorithms are generally compared in terms of the number of multiplications. Down to the circuit level, the size of multipliers is the key factor to estimating the area, speed, and power of the sphere decoder. We start with simplifying the cost of the multiply operation to reduce hardware complexity. The multiplication is required to calculate Euclidean distance, which is mathematically represented by two equivalent forms, Eqs. (5), (6). sˆ ML 2 arg min R( s s ZF ) (5) arg min Q H y Rs 2 (6) Seemingly, the number of multiplications in Eq. (5) is less than in Eq. (6): one multiplication for Eq. (5) and two multiplications for Eq. (6). Hence, Eq. (5) was most commonly used in prior work [16]-[21] as a baseline for implementation. However, a careful investigation shows that Eq. (6) is a better choice from hardware perspective for at least two reasons. First, we observe that s ZF and Q H y can be pre-computed and, hence, have negligible impact on the total number of operations. Also, computation effort of s ZF is not less than Q H y. Second, the wordlength of s is usually much shorter than s ZF. Separating terms as in Eq. (6) results in multipliers with reduced wordlength. Without loss of generality, the normalized size of a multiplier can be estimated by the product of wordlength of the multiplier and multiplicand. The normalized delay of a multiplier can be estimated by the sum of wordlength of the multiplier and 9

10 multiplicand if an array multiplier is used [39]. The array multiplier approximation works well for first-order comparison purposes. Table 1 summarizes the relative area and delay reduction of a multiplier due to numerical strength reduction in a 64-QAM system, where wordlength (WL) of s is 3 for a real multiplier. We see that the area reduction is at least 50%, and that the delay reduction also reaches 50% for large wordlength inputs. The absolute area difference between these two types of multipliers is amplified by the total number of multiplications in the entire decoding process, which is approximately O(M 3 ). TABLE I AREA AND DELAY REDUCTION DUE TO NUMERICAL STRENGTH REDUCTION WL of s ZF WL of R =12, Area/delay 0.5/ / / /0.63 WL of R =16, Area/delay 0.5/ / / /0.54 The multiplier can be simplified further by taking advantage of some characteristics of communication signal processing: Gray coding and quantization effects. Gray code is a more compact representation in the constellation plane since only odd numbers are used. Conventionally, the number is transformed to 2 s complement representation for the purpose of arithmetic operations. Carefully examining the Gray code representation, the corresponding multiplication can be implemented by simple shift, add and invert operations. The code mapping, the associated operations, and the simplified multiplier are shown in Fig. 5. The neg operator stands for bit-inversion. 1-bit carry-in in 2 s complement can be absorbed as a carry-in (shaded in gray) in the following adders or simply be discarded as a quantization error on LSB, which can be recovered by wordlength optimization. The shifter has no direct area cost apart from routing, while the cost of inverters and multiplexers is relatively low because they are simple operations. Overall, it is possible to implement one complex multiplier with 6 adders + inverters and multiplexers, resulting in a total 40% area reduction compared to traditional approach (area is estimated by Synopsys Design Compiler). This implementation does not imply that we have to force the use of Gray coding in the constellation plane; the Gray coding is only used inside the sphere decoder to simplify metric calculation and candidate enumeration. The decoded symbols can be converted into any arithmetic representation at the sphere decoder outputs. Gray code value operation

11 S1 S0 R neg neg 0 1 <<2 x4 1 0 <<1 x8 S0 S1 S1 S0 S0 S2 Fig. 5. Gray code representation and the simplified multiplier. B. Architecture Tradeoff In the prior work, two major types of tree search methods are reported: depth-first (DF) [23] [24] and K-best [16]-[22]. The depth-first algorithm starts the search from the root of the tree and explores as far as possible along each branch, then it back-traces until a leaf node is found. The K-best algorithm approximates a breadth-first search by keeping only K branches with the smallest partial Euclidean distance (PED) at each level [26], which is similar to the M-algorithm in trellis decoding [27]. The major advantages of DF are that the ML performance can be achieved, and that radius shrinking can be used for tree pruning. On the other hand, the advantages of K-best are its uniform data path and constant throughput. Further examining details, depth-first ensures the ML performance if complete solution space is explored. This might not be feasible in practice, however, because of limited buffer size and processing cycles. This means that some termination schemes should be used and thus ML performance is no longer guaranteed. Since the default input is uncoded data, achieving a sub-optimal performance while keeping constant throughput is more important. Then, space-time codes or error correction codes can be used to improve the performance. The iterative decoding scheme which combines MIMO decoder and error correction code decoder was proven to achieve near-capacity performance [2]. In hardware implementation, depth-first is realized in a folding-like architecture because only one node is visited at a time during the tree search process. In this case, an extra memory to record the visited nodes is required, for the trace-back operation. K-best is realized in a multi-stage pipelined way, because no trace-back is needed. To process K data paths at the same time, parallel architecture is applied. Figure 6 illustrates the basic architectures of these two search schemes, and Table 2 summarizes their comparison in terms of circuit metrics and algorithmic performance. 11

12 PE PE 1 PE 2... PE M (a) Depth-first (folding) (b) K-best (parallel and multi-stage) Fig. 6. Basic architecture of (a) depth-first and (b) K-Best algorithm. TABLE II COMPARISON OF DEPTH-FIRST AND K-BEST ALGORITHM Area Throughput Latency Radius Shrinking /Tree Pruning Performance Depth-first Small variable long Yes ML K-best large Constant short No Near-ML For the sphere decoder operating with a large antenna array, the biggest challenge in the implementation is reducing area of the design. Using the number of (complex) multipliers as a first order area estimate, the number of multipliers needed in the folding and multi-stage architectures are M and M(M+1)/2, respectively, where M is the number of transmit antennas. Expanding a 4 4 system to a system, relative area increases from 4 to 16 for the folding architecture and 10 to 136 for the multi-stage architecture. The folding architecture is 2.5 to 8.5 more area efficient compared to the multi-stage architecture, as shown in Fig. 7 (a). To keep the area within a reasonable value, folding technique is considered. The second design challenge is operating frequency for the folded architecture. As the array size increases, the number of operands in the Multiply-Accumulate (MAC) operation in the metric calculation unit increases proportionally to the number of antennas. Assuming a tree adder design, the critical path delay roughly increases linearly with the number of transmit antennas. However, the time required to finish the MAC operation should be scaled down by the number of antennas in order to increase the throughput proportionally to the number of antennas. This timing requirement for a fixed bandwidth is shown in Fig. 7 (b). The situation is actually worse when metric enumeration is included in the loop. Since pipelining in the loop is considered a difficult task, this architecture can not operate at a high frequency even for a 4 4 system [24]. To facilitate pipeline insertion, inputs are up-sampled by a factor m, and then one register can be replaced with m pipeline registers in the loop using Noble Identity [42]. In this case, only one out of m samples is useful data, and the rest could be repeating 12

13 values of an original sample or padding zeros. By applying data-stream interleaving, samples of other independent data streams can be introduced in the loop in place of the repeated values or padding zeros. Technique of interleaving is therefore used to improve area efficiency through logic sharing and to provide flexibility needed to support varying number of data sub-carriers. In a multi-carrier communication system, data streams are transmitted over narrow-band sub-carriers [28]. multi-stage critical path in the loop Area x2.5 x3.5 folding x8.5 Delay timing requirement Timing gap 4x4 8x8 Antenna array size 16x16 4x4 8x8 Antenna array size 16x16 (a) area reduction using folding technique (b) growing timing gap in folding architecture Fig. 7. Design challenge and tradeoff for large antenna size. Impact of antenna array size on (a) area and (b) critical path delay. C. PE structure The function of the PE is to find the s i with minimum T i ( T i b R s ) for each level in the tree search, and to provide a candidate list with T i in a descending order since a path with smaller T i means a higher probability to be the ML estimate. A straightforward algorithm mapping is to enumerate all possible constellations and sort the T i to find the s i and the candidate list [24]. The hardware cost and computational latency of this architecture is very high for a large constellation size due to the circuit parallelism and inevitable latency of the sorting circuit. To resolve this problem, we propose another strategy: first, the closest point is found through the geometric relationship since the s i with minimum T i stands for the closest point between b i and R ii Q i. The second step is to use the selected s i to calculate T i. Finally, the candidate list is generated by the constellation arrangement, as described in Section III-C-2, Fig. 12. i ii i We decompose the PE into two parts: Metric Calculation Unit (MCU) and Metric Enumeration Unit (MEU). Each submodule can be mapped to Area-Energy-Delay space to explore optimal design parameters for the top-level integration. Decomposing a design problem along these three axes provides insight into design techniques and their impact on power, area, and throughput. Concurrency versus latency is one of the basic tradeoffs that need to be considered. Maximizing data 13

14 throughput calls for a parallel architecture, which results in a large area. Conversely, time-multiplexing improves area efficiency, but increases latency. For example, the decoding algorithm operating on complex numbers can be transformed into a real-valued problem, which results in a tree that is twice as deep as the original tree with a smaller number of children per node [16]. Since the multipliers are reused, the number of multipliers is reduced to one half at the cost of equal throughput reduction. Flexibility is another issue in circuit design. Ideally, the circuit should be flexible to support different search schemes (Depth-first or K-best). In general, the overhead of flexibility results in reduction of both energy efficiency and area efficiency. This overhead should be minimized while maintaining system performance. Fig. 8 shows the circuit diagram of one PE. There are m-stage pipeline registers inserted in the loop, so the critical path can be shortened under the timing constraint by choosing a larger m. Since m data streams are interleaved into the PE, the hardware always keeps active, creating the maximum throughput as if the m pipeline registers are introduced without the loop. The area overhead of the up-samplers for R can be removed if R is invariant for each sub-carrier during one packet transmission. The flexibility of search scheme is provided by the shift-register chain, which can be configured as forward trace or backward trace. By placing K PEs onto one sphere decoder, K search paths are explored at the same time to implement K-best algorithm, while each PE has flexibility to trace back as Depth-first. The flexibility to support varying antenna size is provided by the folding architecture. It reuses the same hardware with a higher clock frequency as the antenna size increases. The details of sub-modules are illustrated in the following. MCU shift-register chain s^ R m... partial product... adder tree... m stages y ^ i m sub sub MEU R ii b i Symbol selection 2 T i Fig. 8. Circuit diagram of one PE. 14

15 1) Metric Calculation: Metric Calculation Unit (MCU) computes R ij s j. Basically, it executes a Multiply-Accumulate (MAC) operation. To accumulate the maximal 16 operands and achieve the highest throughput, there are 15 simplified complex + 1 simplified real multipliers followed by an adder tree that merges the partial products. It is possible to reduce the number of multipliers in a time-multiplexing manner at the price of lower throughput [30]. For example, 4 complex multipliers can be time-multiplexed by 4 to deploy 16 multipliers, with throughput reduced by 4. Such tuning at the architecture level is used to position the design along throughput and power axis, with optimal tuning of variables such as supply voltage. M j i 1 Since the search process advances one stage per clock cycle, we propose an FIR-like architecture to facilitate metric calculation, as shown in Fig. 8. If only forward trace is allowed, the BER performance is limited by the number of parallel processors such as in K-best algorithm. Even though more processing cycles are provided, there is no room to improve the BER performance for K-best algorithm. By observing that the trace-back goes back up by only one or two layers instead of a random jump, a bidirectional shift register chain is embedded to adjust the search depth. Since the search state is recorded in the shift registers, no extra memory, such as stack memory, is needed to keep all the states [26] [40]. Due to the trace-back requirement, transpose form FIR architecture is not suitable to reduce the critical path, but the critical path is reduced by data-interleaving. s i+1 s i+2 s M s i... R i,i R i,i+1 R i,i+2 R i,m... adder tree Fig. 8. Circuit diagram of MCU. Coefficients of R matrix are stored in memory in an area efficient way. The diagonal terms of R matrix are real, while the rest are complex numbers. Using the upper triangular nature, the Real part diagonal and the Imaginary part triangular data are organized into a square memory, which saves around 50% of area. 15

16 ) Metric Enumeration: The Metric Enumeration Unit (MEU) enumerates the possible constellation points according to their Partial Euclidean Distance (PED) ( T i j M j 2 ) in an ascending order. Exhaustive search is a straightforward implementation; it calculates the PEDs of all constellation points and uses a sorting circuit to find the minimum one, as shown in Fig. 10 (a). The number of distance calculation units is proportional to the constellation size (64 units are required for 64-QAM, for example). This requirement in itself makes hard to support a large constellation size, in addition to the extra latency introduced by the minimum search circuit. In the constellation plane, metric enumeration is equal to finding the points closest to b i and scaling constellation points R ii Q i from the closest to the farthest [2]. This is the underlying principle of Schnorr-Euchner (SE) algorithm. The SE enumeration is originally applied to one dimensional signal, such as real valued PAM or PSK constellation; therefore it was modified to arrange QAM constellations in P Q concentric groups to fit the original algorithm. For example, 16-QAM constellation can be expressed as an arrangement of points in 3 concentric circles. Then the problem is reformulated to find the closest point in each subgroup and find the closest point over subgroup, as shown in Fig. 10 (b) [24]. R iiq 1 sub 2 PSK ALU 1 R iiq 2 b i R iiq k sub sub 2 2 min-search ^ s i b i PSK ALU 2 PSK ALU P Q min-search ^ s i (a) exhaustive search real part (b) SE enumeration for QAM Q R ii Region decision b i R ii Region decision ^ s i decision boundary b i I R iiq i imag. part (c) region partition search Fig. 10. Closest point selection scheme: (a) exhaustive search architecture, (b) SE enumeration for QAM, (c) region partition based search approach. Real value is represented by gray line. The original algorithm [2] uses phase relationship to find the closet point in a concentric circle. This approach is not suitable for hardware implementation, so [24] 16

17 proposed a decision boundary based method to simplify the SE enumeration. One type of decision boundary is set by straight lines passing through the origin and the middle point between two adjacent constellation points in a concentric circle, to specify the starting point. Another type of decision boundary is set by straight lines passing through the origin and the middle point between two constellation points around the starting point in a concentric circle, to determine the initial search direction. However, this simplification is only applicable to small size constellations such as 16-QAM. Larger constellation sizes are hard to support for several reasons. First, the decision boundary algorithm is quite complex many multiplications are needed to generate the decision boundaries. Second, the number of subgroup grows quickly, which increases the latency of the min-search circuit. For example, 64-QAM is decomposed into 9 subgroups. Third, the concentric group partition is scalable as QAM constellation size changes, thereby making the architecture infeasible to support different modulations. We propose a simple partition method based on Cartesian coordinates. The constellation plane is partitioned into 64 regions for 64-QAM (8 regions in the Real part and 8 regions in the Imaginary part). The closest point (with minimum distance) can be decided by the location of b i /R ii since real part and imaginary part can be decoded separately, as shown in Fig. 9 (c). In fact, this idea is also applied to symbol decision. For instance, to make a decision on a QPSK system, we do not need to calculate the distances from received signal to 4 constellation points. Instead, we just need to examine the sign of real and imaginary parts. The area complexity of the three architectures in Fig. 9 is evaluated using the number of add-equivalent operators (add, subtract, compare) as area estimation. For 64-QAM constellation, exhaustive search needs 64 subtractors, 64 square operators, and a min-search circuit. Assuming the square operators are simplified to absolute operators with a little performance loss [24] and that min-search uses a serial comparison circuit, total 192 adder equivalent operators are need. SE enumeration for 64-QAM needs 64 boundary decision comparisons and min-search across 9 subgroups, so 73 add-equivalent operators are need assuming the boundary is given. The proposed region partition search needs 8 comparators for real part and 8 for imaginary part, which is only 16 add-equivalent operators. Therefore, 4.6 area reduction is achieved compared to SE enumeration for 64-QAM and 32 compared to exhaustive search. Similar concept is applied in delay comparison: the number of adder delays is 17

18 used as delay estimation metric. Here, we assume the delay of min-search circuit is equal to log 2 n, where n is the number of sorting elements. However, a serial comparison circuit needs n adder delays to finish the comparison, so a more area consuming parallel architecture should be used to reduce the delay. The delay of exhaustive search is approximated by the sum of delay of 1 adder, 1 absolute, and log 2 64, which is equal to 8. Delay of SE enumeration is equal to 1 operators plus log 2 9 = 5. Our design needs only 1 comparator, which is 1/5 the delay of the SE enumeration without pipelining. TABLE III AREA AND DELAY COMPARISON Exhaustive SE enumeration Our work Area (normalized) Delay (normalized) One challenge in the MEU implementation is that a divider or an inverse operator seems inevitable to calculate b i /R ii, which usually introduces a longer latency and higher hardware complexity. The property that diagonal element R ii of R matrix is real simplifies the problem, but still introduces hardware overhead. One possible method is to calculate R in the preprocessing stage, since these values are updated at a 1 ii packet rate [16] [21]. If 1 R ii is given, only one multiplier is needed. In our approach, we can demonstrate that this inverse operation is not necessary. Instead of deciding 1 b i R ii in the constellation plane, it is equivalent to deciding b i in a constellation plane scaled by R ii. The decision boundary (db) is denoted as 1 db { 6, 4, 2, 0, 2, 4, 6}, then we simplify b i R ii to db Rii calculation. It may seem that replacing one multiplier with 6 multipliers in order to execute the boundary comparison in a parallel way may not be a good tradeoff from the area standpoint. However, a careful examination reveals a large multiplier is replaced with small multipliers, and that these small multipliers can be simplified as shift-add operators. Therefore, only one adder is needed to implement db R ( 6 Rii 4 Rii 2 Rii ); others can be implemented by hard-wired shifting and inversion. The negative value can be computed by bit-inversion without the carry-in bit, because carry bit appears as negligible quantization error from the signal decision perspective. The area reduction is quite high. If the wordlength of R is L, then the multiply operation with large 1 ii WL L WL( b )] is replaced with add operation which also has smaller number of [ i bits [ L 3]. The simple region decision circuit is shown in Fig. 11. ii 18

19 7 Real{b i } /imag{b i } >6R ii >4R ii >2R ii >0 >-2R ii >-4R ii >-6R ii R ii Sign constellation size Symbol remapping s[2] s[1] s[0] Fig. 11. Region decision circuit. An extra symbol remapping block is inserted at the end to remap constellation points if different constellation size is used. Decision outputs are mapped to Gray code directly without extra 2 s complement representation and Gray code transformation. Table 4 shows the mapping rules. Although R ii can be chosen always positive to simplify this circuit further, we leave the flexibility of supporting negative value as well in order to relax QR decomposition processing. With the proposed approach, no sorting is needed and it is easy to expand to a large constellation size. Additionally, the use of bit-level arithmetic results in only linear complexity increase as the constellation size grows exponentially. s[1:0] TABLE IV SYMBOL REMAPPING AND DECISION 64-QAM 16-QAM QPSK/BPSK real imag s[2] s[1] s[0] s[2] s[1] s[0] 64-QAM (6-bit) 16-QAM (4-bit) QPSK (2-bit) BPSK (1-bit) After finding the closest point, remaining candidates are also decided by the distance between b i and constellation points in an ascending order. The decoded symbol s i is used to enumerate remaining candidates through geometric relationship rather than sorting either in trace-back or parallel search mode. The complexity of the 19

20 sphere decoding algorithm is independent of the lattice constellation size [48]; therefore, we can enumerate the adjacent possible constellation points instead of the whole constellation plane. We extract 9 points in the constellation plane as illustrated in Fig. 12. Eight surrounding constellation points have either 1-bit error (Fig. 12 (a-b)) or 2-bit errors (Fig. 12 (c-d)) if Gray coding is used. The 2 nd closest point for each solution set is decided based on decision boundaries indicated by the dashed lines in Fig. 12 (a), (c). The remaining points are decided by the search direction, which is specified by other decision boundaries, starting from the 2 nd point, as shown in Fig. 12 (b), (d). These two decision boundaries are easy to calculate by sign check and comparison for {Re} and {Im}. The search sequence of each group is well-defined, but the boundary between these two groups is not easy to calculate. For example, which 3 rd search point in these two groups, Fig. 12 (b) and (d), is the closer point can not be decided by a simple boundary. Therefore, we adopt a mixed method: the two solution sets are compared to find the final enumeration sequence with respect to the central point. (a) 1 bit error subset #1 #2 (b) # # # #3 R ii s i b i (c) (d) # # bit errors subset # real part Imag. part #2 #5 #2 2 nd closest point 3 rd to 5 th points Fig. 12. Candidate enumeration scheme. Decision boundaries are dashed lines in the central region. Fig. 13 shows the overall area reduction for one PE. An overall 20 area reduction is achieved through various signal processing and circuit techniques, from arithmetic stage down to circuit stage. If 16 sub-carriers are processed through data-stream interleaving, the equivalent area reduction would be more than 260 times. So far, we have built a one-pe sphere decoder. To speed up the search and improve error probability, multiple PEs need to be utilized to span the search range. A multi-core architecture is proposed to cooperate all the functional blocks in a power and area efficient way. 20

21 x8.5 Area 30% 20% 5% total 20x reduction 20% initial folding MEU simplfication simplified multiplier memory reduction wordlengh reduction Fig. 13. Summary of area reduction for one PE. D. Multi-Core Architecture Multiple-PE architecture inherently improves the search speed by the number of processing elements. However, the search speed is further increased since the shorter paths can be found earlier thereby pruning the tree more efficiently. In addition, the number of processing elements offers the flexibility to trade performance with area. Virtually all K-best architectures use parallelism to search several branches at the same time [16]-[22], but they do not take advantage of the important features of sphere decoding radius shrinking and tree pruning. When the search paths run outside the search radius, they should be discarded instead of continuing with a deeper search. Intuitively, we should assign a new search branch within the search radius to the processors whose search paths outside the search radius. To maximize the probability of finding the ML estimate, the children of the branch with smaller Euclidean distance for that level are assigned as the new search candidates. Therefore, the functions needed include: (1) sorting circuit to record the branch with minimum Euclidean distance, (2) radius checking block to examine if the Euclidean distance is larger than the search radius, and (3) candidate enumeration circuit, illustrated in Fig. 12. Since the radius checking block is included in the sphere decoder, one of the many algorithms for effective radius shrinking can be utilized [2] [3] [10] [12]. 1) Sorting Circuit: Sorting algorithms are extensively studied in computer science. In hardware, several architectures are widely used: serial sorting, parallel sorting (Batcher sorter) and Single Instruction Multiple Data (SIMD) architecture [16][33-36]. Serial sorter executes the bubble sorting algorithm [16]. The serial comparison nature results in a longer latency. Parallel sorter is widely used in packet switch networks sorter, which 21

22 makes use of parallelism to speed up sorting at the cost of increased area. SIMD provides the largest flexibility, but its interconnect network is very complex. A comparison of these architectures is summarized in Table V. For N inputs, n log 2 N. Latency and Area are estimated as the number of comparator delays and the number of comparators, respectively. TABLE V SUMMARY OF SORTING CIRCUITS Serial Parallel SIMD Latency N n(n+1)/2 n(n+1)/2 Area N/2 (n 2 +n)n/4 N/2 Routing complexity Low Medium High Area is the first priority in the design of sorting circuit, because the sorting circuit needs to be replicated to support multiple sub-carriers. Leveraging the data-interleaving operation, N 1 time slots are available for additional sub-carriers, which makes serial comparison possible within a symbol period. Therefore, serial sorter is selected in our design. Since the first input is loaded into the register of the first stage, the latency is N 1 cycles (one cycle saved). Fig. 14 shows the circuit of serial sorter. For each comparator, the larger operand is sent to the lower branch and the smaller one is sent to the upper branch. The final sorted Euclidean distance from each PE can be used for outer receiver for iterative decoding. L H L H... compare compare compare L H stage 1 stage 2 stage M/2 Fig. 14. Circuit diagram of a serial sorter. 2) Radius Checking: Radius checking is executed with parallel sorting. Euclidean distances stored in all PEs are checked serially. If the Euclidean distance is larger than the search radius, a new search path is assigned. On the other hand, if the Euclidean distance is smaller than the search radius, then the search radius is updated to this smaller value and the corresponding branch is chosen as the ML estimate. A multi-core architecture is proposed to coordinate all functional blocks. The number of PEs are decided from BER-are-power tradeoff. A 16-PE architecture is shown in Fig. 15. For each PE, the decoded symbols and the associated Euclidean 22

23 distance for 16 sub-carriers are fed into registers serially after processing. For each cycle, only the metrics of one sub-carrier are computed, while other sub-carriers conduct sorting, radius checking, and candidate enumeration across PEs. A sorting circuit connecting 16 registers belonging to the same sub-carrier is embedded. Radius checking is conducted serially using a multiplexer, and followed by a new path assignment conditionally. PE SC-1 SC-2 SC-3 SC-4 Sub-carrier space PE-1 PE-2 PE-3 PE-4 SC-13 SC-14 SC-15 SC-16 Demux MCU Memory MEU SC-5 SC-6 SC-7 SC-8 I/O Interface PE-16 PE-1 PE-13 PE-12 PE-2 PE-3 PE-4 PE-15 PE-6 PE-5 PE-14 PE-7 PE-8 PE-11 PE-10 PE-9 PE-13 PE-14 PE-15 PE-16 Mux radius checking and updating PE-5 PE-6 PE-7 PE-8 SC-12 SC-11 SC-10 SC-9 PE-12 PE-11 PE-10 PE-9 Fig. 15. Multi-PE sphere decoder architecture. With this compact multi-pe architecture, the sphere decoder provides a very high performance. At 256MHz, each PE provides 46.5GOPS (12-bit equivalent add), and total operations for 16-PE architecture amount up to 800GOPS (including sorting and radius checking circuits) for the whole system when operating at the QAM mode. In addition to high performance, flexibility and scalability are also included. We illustrate the design specifications next. E. Design Specifications The sphere decoder is designed to support different system configurations with respect to antenna array size, modulation and detection schemes, as well as the number of sub-carriers. Table 6 summarizes the configuration modes. Since varying antenna array size and modulation are supported, this design is also capable of trading off diversity gain for spatial-multiplexing if STBC is used. Due to interleaving by 16, the supported number of sub-carriers can be a multiple of 16 through data rearrangement. TABLE VI OVERVIEW OF SYSTEM CONFIGURATION MODES Configuration Modes Antenna array size Any square matrix # b/w Modulation BPSK, QPSK, 16-QAM, 64-QAM # sub-carriers 16, 32, 64, 128 Detection Depth-first, K-best 23

24 Main design specification is the throughput constraint for the algorithm. Since total 16 MHz bandwidth is used, each sub-channel requires 1MS/s to process the data in the case of 16 sub-carriers. The requirement is thus to process 16 parallel streams of data at a 1MHz rate. Clock specification for the resulting architecture then becomes 256 MHz (1MHz 16 sub-carriers 16 antennas). Notice the clock frequency of all modes can achieve 256MHz. The clock frequency for smaller array size is reduced due to a fixed channel bandwidth. Detailed system specifications are listed in Table 7 for array size 4 4 to We see the system supports ideal throughput up to 1.536Gbps, which results in a spectral efficiency up to 96 bps/hz. When the system is operated at a smaller array mode, clock frequency and supply voltage can be reduced to minimize power consumption. TABLE VII SUMMARY OF SYSTEM SPECIFICATION Antenna array Modulation QPSK 16-QAM 64-QAM QPSK 16-QAM 64-QAM QPSK 16-QAM 64-QAM BW (baseline) 16 MHz Clock freq. 64MHz 128MHz 256MHz Throughput (bps) 128M 256M 384M 256M 512M 768M 512M 1.024G 1.536G Spectral Efficiency (bps/hz) A comparison of hardware is illustrated in Table 8. The estimated chip area is 0.55 mm 2 in a standard 90 nm CMOS process using the approximation of 10,000 FPGA slices 1 mm 2 layout area in 90 nm CMOS [28]. To make a fair comparison, the area is normalized by the number of transmit antennas (this is a conservative estimate, because the hardware complexity could grow quadratically with the number of transmit antennas). The data indicates that the proposed architecture is the most area efficient compared to prior work. Furthermore, our design outperforms all previously published designs in terms of supported antenna array size and constellation size, as shown in Fig. 16. Unlike previous work, the proposed architecture also supports multiple sub-carriers and search methods. Finally, this is the first design that offers the flexibility required to fully traverse the diversity vs. spatial-multiplexing tradeoff curve. TABLE VIII HARDWARE COMPLEXITY COMPARISON [19] [17] [21] [22] [24] This work 500k 10 mm k 50k Area * GC (0.18um) slices GC GC Area (norm.) *154k gate count (GC), 0.55 mm 2 (90nm), or 5.5 slices 24

25 Antenna array size 16x16 8x8 4x4 [19] This work [17][21][22][24] BPSK QPSK 16QAM 64QAM Modulation Fig. 16. Comparison of this work with previous work. IV. DESIGN METHODOLOGY An integrated design methodology is adopted in our work to incorporate algorithm, architecture, and circuit implementation in a highly automated environment. Since the design is complex, we start with a layered design approach which decomposes the whole system from the top architecture down to the fundamental modules hierarchically. Different considerations such as area and throughput are evaluated at each layer for architecture optimization. A graphical Simulink/Matlab development environment offers bit-true, cycle-by-cycle hardware equivalent modules for simulation, and then translates to FPGA emulation without hardware description language (HDL) coding. Due to the limited capacity of single FPGA, BEE2 platform [38] is used to accommodate the whole system and speed up emulation. A. Simulink-Based Design Environment We use Simulink/Matlab design environment [44]. Traditionally, circuit design for communication signal processing is divided into two stages: algorithm design and circuit implementation. Algorithm designers use C/C++ or Matlab for system simulation, and then the designed architecture is implemented by circuit designers using HDL. There are usually several iterations between two design stages to ensure the final design satisfies the specifications. In this work, Xilinx System Generator (XSG) block-sets are used to build hardware equivalent modules, which leverages cycle-accurate software simulation. In addition, quantization effects due to finite wordlength are considered in the simulation. Area information is extracted by resource estimator (XSG) or design compiler (Synopsys) in terms of number of slices or area in the early design stage, since the equivalent HDL can be generated automatically. The drawback of simulink-based design flow is its lengthy simulation time, which can be mitigated by FPGA-based hardware emulation [43]. 25

B. Emulation Platform FPGA-based hardware emulation and rapid prototyping have become an attractive solution, which can provide up to 10 6 times faster simulation speed than software simulation [37].

26 B. Emulation Platform FPGA-based hardware emulation and rapid prototyping have become an attractive solution, which can provide up to 10 6 times faster simulation speed than software simulation [37]. Xilinx University Program (XUP) board (Virtex-2 Pro 30 part) [50] is used to develop the hardware/software cosimulation environment for small circuits. In this case, the hardware modules built in the Simulink is replaced with the configured FPGA to speed up simulation. Due to the limited capacity of XUP board, BEE2 platform is used for whole system emulation. The BEE2 consists of 5 Vertex-2 Pro 70 FPGAs (~10M equivalent logic gates total). Each FPGA embeds a PowerPC core which minimizes the latency between the microprocessor and reconfigurable logic. Four FPGAs (user FPGA) are used for computation and one for control (control FPGA) as shown in Fig. 17. With high speed bandwidth, low latency links, BEE2 provides a virtual single FPGA of five times the capacity [38]. User FPGA-1 User FPGA-2 Ctrl FPGA User FPGA-4 User FPGA-3 Fig. 17. BEE2 emulation platform. C. Simulation Results The BER performance of one PE is verified through the hardware/software co-simulation environment. In this preliminary experiment, ZF-DFE/BLAST algorithm is adopted, i.e. for each level of the search tree topology, only the closest lattice point is chosen as the decoded symbol [41] [49]. Since only a small portion of the solution space is examined, there exists a performance gap between this scheme and ML solution. However, we demonstrate a system with a larger antenna array and repetition coding can outstrip the ML performance with a smaller antenna array easily. The BER performance can be further improved to achieve ML performance without repetition coding by using multiple PEs, which is being designed. Fig. 18 (a) shows the BER performance of 64-QAM modulation for different 26

ELEC E7210: Communication Theory. Lecture 11: MIMO Systems and Space-time Communications

ELEC E7210: Communication Theory Lecture 11: MIMO Systems and Space-time Communications Overview of the last lecture MIMO systems -parallel decomposition; - beamforming; - MIMO channel capacity MIMO Key