FPGA IMPLEMENTATION OF A NEAR-ML SPHERE DETECTOR FOR E BROADBAND WIRELESS SYSTEMS

FPGA IMPLEMENTATION OF A NEA-ML SPHEE DETECTO FO 80.6E BOADBAND WIELESS SYSTEMS Chri Dick (Xilinx, San Joe, CA, USA; chri.dick@xilinx.com); Milo Trajkovic (Signum Concept; San Diego, CA, USA; milo.trajkovic@ignumconcept.com); Slobodan Denic (Signum Concept; San Diego, CA, USA; lobodan.denic@ignumconcept.com); Dragan Vuletic (Signum Concept; San Diego, CA, USA; dragan.vuletic@ignumconcept.com); aghu ao, (Xilinx, San Joe, CA, USA; raghu.rao@xilinx.com), fred harri (San Diego State Univerity (SDSU), San Diego, CA, USA; fharri@mail.du.edu ); Kiarah Amiri (ice Univerity, Houton, TX, USA; kiaa@rice.edu). ABSTACT Spatial multiplexing multiple-input-multiple-output (MIMO) communication ytem have recently drawn ignificant attention a a mean to achieve tremendou gain in wirele ytem capacity and link reliability. The optimal hard deciion detection for MIMO wirele ytem i the maximum likelihood (ML) detector. ML detection i attractive due to it uperior performance (in term of BE). However, direct implementation grow exponentially with the number of antenna and the modulation cheme, making it ASIC or FPGA implementation infeaible for all but low denity modulation cheme uing a mall number of antenna. Sphere decoding (SD) olve the ML detection problem in a computationally efficient manner. However, even with thi complexity reduction, real-time implementation on a DSP proceor i generally not feaible and high-performance parallel computing platform uch a FPGA are increaingly being employed for thi cla of application. The phere detection problem afford many opportunitie for algorithm and micro-architecture optimization and tradeoff. Thi paper provide an overview of and FPGA implementation of a phere detector and channel matrix pre-proceor applicable to the 80.6e air interface protocol. The architecture of the deign i preented along with reource utilization data and BE performance curve.. INTODUCTION Spatial diviion multiplexing (SDM) MIMO proceing ignificantly increae the pectral efficiency, and hence capacity, of a wirele communication ytem: it i a core component of next generation wirele ytem, for example, WiMAX and other OFDM-baed communication cheme. Sphere detection i a prominent method of implifying the detection complexity in both SDM and SDMA ytem while maintaining BE performance comparable with optimum maximum-likelihood (ML) detection [], []. There are everal approache for realizing phere detector, and the algorithmic landcape i rich with method that enable the deigner to make variou tradeoff between performance, e.g. throughput of the wirele channel, BE, and implementation complexity [3]. While the algorithm (e.g. K-bet or depth-firt-earch (DFS)) and hardware architecture obviouly have a trong influence on the reulting BE performance of the MIMO detector, the channel matrix pre-proceing that i typically conducted prior to phere detection likewie ha a ignificant influence on the BE performance [4]. The channel matrix preproceing can range from very imple proceing that might, for example, compute a preferred order for proceing the patially multiplexed data tream, baed on variance computation applied to the channel matrix, through to much more ophiticated matrix factorization technique that determine the preferred order for proceing the tream in a more optimal (in the BE ene) manner. Thi paper decribe the field programmable gate array (FPGA) implementation of a detector for patial multiplexing MIMO in 80.6e broadband wirele ytem. By utilizing a channel matrix pre-proceor that realize a type of ucceive interference cancellation imilar in concept to that employed in BLAST (Bell Lab Layered Space Time) proceing, the detector achieve cloe to ML performance.. MIMO SYSTEM MODEL Let u aume wirele MIMO ytem that ha M T number of tranmit antenna and M number of receive antenna, where M M T. All tranmit antenna ue the ame channel for imultaneou communication with the receive antenna. The complex input-output MIMO model can be written a: Proceeding of the SD 09 Technical Conference and Product Expoition, Copyright 009 SD Forum, Inc. All ight eerved

y = H + n () M M complex channel matrix, where H denote the T T = [,,..., MT ] i the M T -dimenional tranmitted vector T from n tranmit antenna, y = [ y,,..., y y ] i the complex receive vector of dimenion M and n i a circularly ymmetric complex additive white Gauian noie vector of ize M. The value for j, j =,..., MT are choen from the complex et of ymbol Ω with p bit per ymbol, i.e. p Ω =, where the et of all poible tranmitted vector p ymbol i denoted by Ω. The M T parallel tream may ue different modulation denitie uch a 4-, 6- or 64-QAM. The detection proce require computing the olution to (), and the goal i to reduce the required compute complexity by uing imple arithmetic operation, while imultaneouly retaining the numerical integrity of the final reult. The matrix element in () are formed from complex valued calar quantitie. However, thi complex valued ytem of equation can be decompoed into a ytem of equation employing only real-valued number a follow correponding to ( ) ( y y) y = H+ n () H I( H) ( H) ( H) ( ) ( ) ( ) ( n ) n = + (3) I I I I with the new matrice having larger dimenion M=M T and N=M. While the dimeninality of the key matrice ha increaed, the arithmetic required to manipulate the matrix element i now implified due to the real-valued nature of the entrie. Each i, i =,..., M in i choen from the et of ' real number Ω, which in the cae of 64-QAM modulation ' i Ω = { ± 7, ± 5, ± 3, ± }. In addition to the real-valued decompoition (VD) decribed above, the modified VD (M-VD) preented in [] i employed in our deign to improve the BE performance. The new reordering of real and imaginary value of the complex component in () i ummarized a follow: yˆ = H ˆ ˆ + nˆ (4) or: ( ) ( ) ( ) ( y ) ( ) ( n I y I I n ).... = Hˆ. +. (5) ( ) ( ) ( ) ( ) ( ). y. MT I I I( n. y ) M T n The matrix Ĥ i the permuted channel matrix of (3) whoe column are reordered to match the other vector of the new decompoition ordering in (5). There i no extra computational cot aociated with thi new reordering. The optimum detector for the ytem decribed in (4) would be the maximum-likelihood detector which minimize the value y H acro all poible combination of the vector. For high order modulation and large number of antenna, the number of calculation in the detection cheme grow exponentially, and the correponding compute requirement render the real-time implementation of ML detection impractical. A reduced complexity quai-ml algorithm can be formulated tarting with the Q decompoition of the channel matrix H and defining the ditance metric a hown in (6) D = y H ( ) M ' y i = Q H y = i, j j (6) i= M j= i where H=Q, QQ H H =I and y = Q y. The final term in (6) i a conequence of the upper triangular tructure of the matrix. The norm in (6) can be computed in M iteration, tarting from the M th (i=m) and progreing to the firt antenna (i = ). For the firt iteration the initial partial norm i defined a ( M+ ) zero TM + ( ) = 0. Uing the notation from [], the partial Euclidian ditance at each iteration can be calculated a: with ( i ) ( i+ ) Ti ( ) = Ti ( ) T i [ ] i, i+,..., M ( i ) ( ) e ( ) i + + (7) =, i = M, M,..., and ( i ) i( ) e = yi i, ii M + j= i ( i+ ) = bi ( ) i, ii i, j j + (8) The iterative algorithm defined in equation (7) and (8) can be viewed a a tree traveral with each level of the tree i correponding to proceing ymbol from the i th antenna. The tree traveral can be performed uing everal different method [5]. In our implementation we elect to employ a breadth-firt earch due to the attractive feedforward tructure (and hence hardware friendly) nature of the approach. At each level only the K node with the mallet T i are choen for the expanion. Thi type of detector i called a K-bet detector [], [5]. The order in which the antenna are proceed by the phere detector ha a profound impact on the BE (bit error rate) performance. So prior to phere detection, channel reordering i applied. The propoed method i a V-BLASTlike channel reordering [6]. The method determine the optimum detection order of column of the complex channel matrix defined in () by calculating the norm of the row of the peudo-invere of the channel matrix over everal iteration. Depending on the iteration count, the row with the maximum or minimum norm i elected. The row with the minimum Euclidian norm repreent the influence of the tronget antenna while the row with the maximum Proceeding of the SD 09 Technical Conference and Product Expoition, Copyright 009 SD Forum, Inc. All ight eerved

Euclidian norm repreent the influence of the weaket antenna. The novel approach firt procee the weaket tream. All ubequent iteration proce the tream from highet to lowet power. The iteration proce i illutrated in the following equation: ( H j H j ) H j, j =... MT argmax ( G j ) k j = k argmin ( G j ) k j G = = j H j k j = for k j = for k H = H, H j + = H j,[ k j ] (9) where H j,[ k j ] repreent the deflated channel matrix excluding the detected k j column which i placed on the ( M T j +) th column pace of the new orted matrix. One can enviion that every iteration of the reordering method operate on the maller matrix and the lat tep will be calculated on matrix. The lat remaining column will be placed on the t column pace of the orted matrix. 3. FPGA HADWAE IMPLEMENTATION In thi ection the main feature of the FPGA implementation are preented. The target technology i Xilinx Virtex -5 FPGA technology. The deign flow employ Sytem Generator [7] for deign capture imulation and verification. In order to upport the different number of antenna/uer and modulation order, the detector i deigned for the mot demanding 4 4, 64-QAM cae. The block diagram of the MIMO 80.6e broadband wirele receiver i hown in the Figure. channel matrix. For the elected FPGA, with a target clock frequency of 5MHz and a communication bandwidth of 5 MHz (correponding to 360 data ub-carrier in a WiMAX ytem), the available number of proceing clock cycle per channel matrix interval i calculated in (0) ( 0.9 u/360) num _ cycle= 64 (0) ( / 5MHz) The deign i optimized to meet the timing pecification defined in (0). Hence, the ub-module are configured in a pipeline fahion to accommodate the high throughput of the channel matrix coefficient tream. Beide the high data rate, managing the latency of the ub-module wa alo an important iue guiding the deign architecture. The latency iue were olved by introducing Time Diviion Multiplexing (TDM) of the ucceive channel matrice. Thi approach provided more proceing time between the matrix element of the ame channel while till utaining high data throughput. The number of channel compriing a TDM grouping varie a a function of the pecific ubmodule. The channel matrix inverion proce employ 5 channel in the TDM cheme, while 5 channel are time diviion multiplexed in the real-valued Q decompoition module. 3.. Channel Matrix eordering To meet the high data rate requirement of the ytem the channel ordering proceing i realized uing the pipelined architecture a hown in Figure. The channel matrix i ucceively deflated in dimenion a it progree through the proceing pipeline. Buffer memorie organized in a ping-pong manner are incorporated to tore the orted column. The firt iteration buffer hold the etimated matrix value and it ize i determined by the number of data ubcarrier. Buffer for the additional iteration tore 5 channel employed in the TDM tructure. Figure : Block diagram of the MIMO 80.6e broadband wirele receiver. It i aumed that the channel matrix i perfectly known to the receiver which can be accomplihed by claical mean of channel etimation [8]. After channel reordering and Q decompoition, the phere detector (SD) i applied. In preparation for engaging a oft-input-oft-output channel decoder (e.g. Turbo decoder), oft output are produced by computing the log-likelihood ratio (LL) of the detected bit. The main architectural element of the ytem include the data ub-carrier proceing and managing the ytem ub-module to proce the deired number of ub-carrier in real time while imultaneouly minimizing proceing latency. The channel matrix i etimated for every data ubcarrier which limit the available proceing time for every Figure : Iterative channel matrix reordering algorithm. The calculation of the G matrix i the mot demanding component in Figure. The heart of the proce i matrix inverion which i realized uing Q decompoition (QD). A common method for realizing QD i baed on Given otation [9]. Thi paper preent a novel approach for performing the complex rotation which are the fundamental Proceeding of the SD 09 Technical Conference and Product Expoition, Copyright 009 SD Forum, Inc. All ight eerved

computation in the ytolic array we are uing. Some well known algorithm for angle etimation and planar rotation, uch a CODIC, introduce very high ytem latency for the numerical accuracy required in thi application, and thi i unacceptable for our ytem. The goal wa to find an alternative olution for vector rotation and phae etimation uing the FPGA embedded DSP reource (DSP48 [0]). The algorithm decribed in [9] how that the diagonal cell of the ytolic tructure rotate the input vector to the real axi and deliver the angle meaured in thi proce to the off-diagonal cell where an additional et of rotation are applied. Denoting the complex value in a diagonal cell a Z, we oberve that multiplication of the value contained in an off-diagonal cell with the complex conjugate Z * and caling by ( ) / qrt Z produce the deired rotation. The diviion i done a a multiplication with a reciprocal value calculated uing a polynomial approximation on the defined interval. Analyi of the function f (x), defined in (), howed the range where the function i cloe to linear. A firt order approximation can be applied to data in the interval [ c,+ ), where the contant c i choen to be a large poible, while producing an acceptably mall error for our application. The input data range i determined in the following manner 0 < Z < ; ( Z ), I( Z ) [, ]; hence the value lower than c have to be caled in order to be in range [ c, ]. The input caling i implemented by hifting the data by the number of ign bit carried in the value, and the hift factor i determined uing a binary earch of the leading bit. Since the qrt function i approximated, the hift factor i choen to be an even number of bit o the output compenation caling of the reult will jut be imple hift. Taking all thee fact into account, the contant wa determined to be c = 0. 5. The firt order approximation i illutrated in (): f ( x + x) = f ( x) + x f ( x); f ( x) = () x The value f (x) and f (x) are mapped to FPGA memorie [0] while the multiplication i done uing the DSP48 lice. Uing the approximation, the final ignal flowgraph of the complex rotator in the diagonal ytolic cell i hown in the Figure 3. The data ent to the off-diagonal cell are actually inphae and quadrature component of the rotated vector caled by the correponding approximated value. The multiplication proce in the off-diagonal cell i defined in (): [( I + jq) ( I rot jqrot )] = Irot + Qrot ( ) Irot Qrot = I + jq j () rot rot Irot + Qrot x x Figure 3: Block diagram of the diagonal ytolic array cell. High data throughput i obtained uing a pipelined architecture for the diagonal and off-diagonal cell while the latency introduced by the approximation module and complex multiplier wa managed by time diviion multiplexing (TDM) of the hardware acro 5 channel. The number of diagonal and off-diagonal cell implemented for the 4 4 matrix i and 7, repectively while the proceing time to decompoe a ingle matrix i 4 4 = 6 data cycle. The data i delivered at the rate of one ample every 3 clock cycle, o that total time to decompoe a ingle matrix i 3 4 4 = 48 clock cycle (out of the available 64). The ret of the available cycle are ued for IO and initializing the memorie for the next 5-channel TDM ub-frame. Back ubtitution of the decompoed matrix [9] and further reordering operation in (9) are implemented in the ame TDM manner uing etablihed and publihed algorithm. 3.. Modified eal-valued Q decompoition After obtaining the optimal ordering of the channel matrix column, the Q decompoition (QD) on the real-valued matrix coefficient i applied. The functional unit ued for thi QD proceing i imilar to the QD engine deigned to compute the invere matrix, but with ome modification. The input data in thi cae are real valued and the ytolic array tructure ha a correpondingly higher degree. In eence, the higher order matrix i proceed, a explained in (5), and in order to meet the deired timing contraint the input data conumption rate had to be input ample per clock cycle. Thi introduced challenge around proceing latency problem which couldn t be addreed with a 5- channel TDM tructure. The number of channel in a TDM group wa increaed to 5 to provide more time between the ucceive element of the ame channel matrix. Proceeding of the SD 09 Technical Conference and Product Expoition, Copyright 009 SD Forum, Inc. All ight eerved

3.3. Sphere Detector (SD) The norm computation defined in (8) i done in the PED block of the SD. Depending on the level of the tree, three different PED block are ued: the root node PED block calculate all poible PED (tree level index i i = M = 8). The econd level PED block compute 8 poible PED for each of the 8 urvivor path generated in the previou level. Thi will give u 64 generated PED for the tree level index i = 7. The third type of PED block i ued for all other tree level which compute the cloet-node PED for all PED computed on the previou level. Thi will fix the number of branche on each level to K = 64, thu propagating to the lat level i = and producing 64 final PED along with their detected ymbol equence. The cloet-node earch i preented in the ort-free SD []. The pipeline architecture of the SD allow data proceing on every clock cycle, thu the number of PED block neceary at every tree level i only one. The total number of PED unit i equal to the number of tree level, which for 4 4 64-QAM, i 8. The block diagram of the SD i illutrated in the Figure 4. The input data are provided from the real-valued QD and the output are aved and analyzed further in the deign, depending on the type of decoding proce. Figure 4: Block diagram of the phere detector 3.4. Soft output decoding Two type of decoding proce can be employed in the SD, hard decoding, that determine the equence having minimum ditance metric through all level in the tree, and oft decoding which repreent each output bit a a Log- Likelihood atio (LL) value, thee value typically being upplied a an a priori input value to the channel decoder. Although oft decoding in t part of the material reported in thi paper, the phere detector implementation provide upport for the generation of oft output for ue in the iterative detector/decoder hown in Figure 5. Floating point imulation have hown a ignificant improvement in BE performance when iterative oft detection i ued [5]. The iterative tructure preent challenge for the memory architecture of the ytem, and additional memory buffer mut be ued to tore block of bit along with their PED, o increaing the memory footprint of the FPGA deign. Figure 5: Block diagram of the iterative oft decoding Generating Soft Value block in the Figure 5 i deigned baed on the algorithm decribed in []. LL value are computed, aturated to 3 bit and decoded further uing a CTC. The algorithm ue extrinic information provided by the CTC a a priori LL, L A and baed on the tored PED value calculate the output LL, L E. High latency and delay introduced by the oft generating block i minimized by uing a parallel/pipelined architecture. 4. FPGA ESOUCE UTILIZATION The architecture decribed in the previou ection wa realized uing the Xilinx Sytem Generator for DSP [7] deign flow and implemented on a Virtex -5 XC5VFX30T-FF738 FPGA [0]. The target clock frequency wa 5MHz. A mentioned earlier, the mot computationally demanding 4 4 64-QAM configuration ha been deigned and teted. The achievable raw data rate in thi cae i calculated a follow in (3): D = * num_ ubcarrier * MT * nbit Tymb D = *360*4*6 = 83. 965 0.9 µ [ Mbp] (3) The implementation and imulation included the detection proce illutrated in the Figure with the excluion of the oft output generation module. Table I how the reource conumption for each of the key functional unit in the deign. The percentage utilization value in the table indicate FPGA area expreed relative to a XC5VFX30T device. TABLE I ESOUCE FOOTPINT SUMMAY BY SUB-SYSTEM Function Slice LUT/FF DSP48 Block AM Channel preproc. 9,999 (48%) 0,339/9,954 (4%) 59 (49%) 05 (7%) VD,75 4,48/5,556 30 7 QD Sphere Detector (8%),445 (%) (5%) 3,3/6,55 (3%) (9%) 48 (5%) 5. SIMULATION ESULTS (4%) (%) The entire detection chain wa realized uing Sytem Generator for DSP. Deign validation employed not only the imulation emantic of the MATLAB/Simulink [3] Proceeding of the SD 09 Technical Conference and Product Expoition, Copyright 009 SD Forum, Inc. All ight eerved

environment but alo the co-imulation capabilitie of Sytem Generator [7]. In our imulation we aume that the channel matrix i known to the receiver. In-phae and quadrature component of the channel matrix coefficient are drawn from a normal ditribution and delivered from MATLAB to the Sytem Generator modeling environment. The bit error rate (BE) i computed uing thi imulation framework. Figure 6 contrat the BE plot for our fixedpoint hard deciion implementation, the floating-point verion of the model and the ML curve. We note that there i virtually no difference between the floating-point oftware model and the hardware implementation. For a BE of 0-5 the difference i only 0.006dB. 6. CONCLUSION Thi paper ha decribed the hardware implementation of a phere detector for 80.6e ytem. Unlike many paper on thi topic we have decribed the algorithmic and architecture detail of the channel matrix pre-proceor preceding the phere detector. There are many way to realize the preproceing, and while our method i computationally complex, the reulting BE performance i cloe to ML. We note that the pre-proceor i the larget functional unit in the deign requiring 5.3x more multiplier than the QD block and 3.3x more multiplier than the phere detector itelf. However, conidering the ytem benefit delivered by the deign, and noting that new generation FPGA provide in exce of 000 multiplier, the cot of the circuit i warranted. BE 0 0 0-0 - 0-3 64 QAM, 4X4 7. EFEENCES [] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner and H. Bolckei, VLSI implementation of MIMO detection uing the phere decoding algorithm, IEEE JSSC, vol. 40, no. 7, pp. 566 577, Jul 005. [] K. Amiri, C. Dick,. ao, J.. Cavallaro, Flex-Sphere: An FPGA configurable ort-free phere detector for multi-uer MIMO wirele ytem, Proceeding of the 008 Software Defined adio Technical Conference, Oct. 6-30, 008, Wahington D.C. [3] L. G. Barbero, J. S. Thompon, apid Prototyping of a Fixed-Throughput Sphere Decoder for MIMO Sytem, IEEE International Conference on Communication, Volume 7, Page():308 3087, June 006. [4] L. G. Barbero, J. S. Thompon, A fixed-complexity MIMO detector baed on the complex phere decoder Signal Proceing Advance in Wirele Communication, 006. SPAWC 06. IEEE 7th Workhop on, Jul. 006. [5] Z. Guo, P. Nilon, Algorithm and implementation of the K- bet phere decoding for MIMO detection, IEEE Journal Selected Area in Communication, Volume 4, Iue 3, Page(): 49 503, March 006. [6] P.W. Wolnianky, G. J. Fochini, G.D. Golden, V-BLAST: an architecture for realizing very high data rate over the rich-cattering wirele channel, Proc. USI International Sympoium on Signal, Sytem and Electronic (ISSSE 98), Atlanta, GA, pp. 95 300, Sept. 998 [7] Xilinx, Sytem Generator for DSP Uer Guide, eleae 0., March 008. [8] A.. S. Bahai, B.. Saltzberg and M. Ergen, Multi-carrier digital communication theory and application of OFDM, Springer, 004. [9] M. Karkooti, J.. Cavallaro, C. Dick, FPGA Implementation of Matrix Inverion Uing QD-LS Algorithm, Conference ecord of the Thirty-Ninth Ailomar Conference on Signal, Sytem and Computer, page(): 65-69, 005. [0] Xilinx, Virtex-5 FPGA Uer Guide, UG90 (v4.5), January 009. [] Xilinx, IEEE 80.6e CTC decoder v3.0, Product pecification DS634, October 007. [] B. M. Hochwald, S. ten Brink, Achieving near-capacity on a multiple-antenna channel IEEE Tran. Commun., vol. 5, no. 3, pp. 389 399, Mar. 003. [3] The Mathwork, Simulink Simulation and Model Baed Deign, http://www.mathwork.com/product/imulink/ 0-4 0-5 Fixed-Point FPGA Sphere Det./preproceing Floating-point Matlab Sphere Det./preproceing Maximum Likelihood 0 5 0 5 0 5 EbNo [db] Figure 6: BE curve comparing the 4x4 64-QAM ytem for the floating point MATLAB imulation (hard deciion), Sytem Generator deign (hard deciion) and ML curve. Proceeding of the SD 09 Technical Conference and Product Expoition, Copyright 009 SD Forum, Inc. All ight eerved