A Multi-standard Efficient Column-layered LDPC Decoder for Software Defined Radio on GPUs

Similar documents
LMS Beamforming Using Pre and Post-FFT Processing for OFDM Communication Systems

Space Time Equalization-space time codes System Model for STCM

Low-Complexity Factor Graph Receivers for Spectrally Efficient MIMO-IDMA

A study of turbo codes for multilevel modulations in Gaussian and mobile channels

Efficient Large Integers Arithmetic by Adopting Squaring and Complement Recoding Techniques

Rejection of PSK Interference in DS-SS/PSK System Using Adaptive Transversal Filter with Conditional Response Recalculation

High Speed, Low Power And Area Efficient Carry-Select Adder

Design and Implementation of a Sort Free K-Best Sphere Decoder

Neuro-Fuzzy Network for Adaptive Channel Equalization

A Cooperative Spectrum Sensing Scheme Based on Trust and Fuzzy Logic for Cognitive Radio Sensor Networks

Optimal and water-filling Algorithm approach for power Allocation in OFDM Based Cognitive Radio System

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

Definition of level and attenuation in telephone networks

Impact of Interference Model on Capacity in CDMA Cellular Networks. Robert Akl, D.Sc. Asad Parvez University of North Texas

986 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 33, NO. 5, MAY 2015

Performance Analysis of Multi User MIMO System with Block-Diagonalization Precoding Scheme

Generalized Incomplete Trojan-Type Designs with Unequal Cell Sizes

User Based Resource Scheduling for Heterogeneous Traffic in the Downlink of OFDM Systems

A Comparison of Two Equivalent Real Formulations for Complex-Valued Linear Systems Part 2: Results

MTBF PREDICTION REPORT

Uncertainty in measurements of power and energy on power networks

THE third Generation Partnership Project (3GPP) has finalized

LS-SVM Based WSN Location Algorithm in NLOS Environments

FFT Spectrum Analyzer

Performance Analysis of MIMO SFBC CI-COFDM System against the Nonlinear Distortion and Narrowband Interference

Efficient Power Allocation for LDPC-Coded MIMO Systems

UWB & UWB Channels HANI MEHRPOUYAN

NATIONAL RADIO ASTRONOMY OBSERVATORY Green Bank, West Virginia SPECTRAL PROCESSOR MEMO NO. 25. MEMORANDUM February 13, 1985

A Non-cooperative Game Theoretic Approach for Multi-cell OFDM Power Allocation Ali Elyasi Gorji 1, Bahman Abolhassani 2 and Kiamars Honardar 3 +

Parameter Free Iterative Decoding Metrics for Non-Coherent Orthogonal Modulation

Side-Match Vector Quantizers Using Neural Network Based Variance Predictor for Image Coding

Performance Comparison of RS Code and Turbo Codes for Optical Communication

熊本大学学術リポジトリ. Kumamoto University Repositor

Optimal Placement of Sectionalizing Switches in Radial Distribution Systems by a Genetic Algorithm

Calculation of the received voltage due to the radiation from multiple co-frequency sources

Review: Our Approach 2. CSC310 Information Theory

Cooperative Wireless Multicast: Performance Analysis and Power/Location Optimization

Performance Analysis of an Enhanced DQRUMA/MC-CDMA Protocol with an LPRA Scheme for Voice Traffic

Resource Allocation Optimization for Device-to- Device Communication Underlaying Cellular Networks

Evaluation of Kolmogorov - Smirnov Test and Energy Detector Techniques for Cooperative Spectrum Sensing in Real Channel Conditions

The Performance Improvement of BASK System for Giga-Bit MODEM Using the Fuzzy System

International Journal of Network Security & Its Application (IJNSA), Vol.2, No.1, January SYSTEL, SUPCOM, Tunisia.

Revision of Lecture Twenty-One

Research of Dispatching Method in Elevator Group Control System Based on Fuzzy Neural Network. Yufeng Dai a, Yun Du b

Fast Code Detection Using High Speed Time Delay Neural Networks

A thesis presented to. the faculty of. the Russ College of Engineering and Technology of Ohio University. In partial fulfillment

A Data-Driven Robustness Algorithm for the Internet of Things in Smart Cities

A High-Sensitivity Oversampling Digital Signal Detection Technique for CMOS Image Sensors Using Non-destructive Intermediate High-Speed Readout Mode

A New Type of Weighted DV-Hop Algorithm Based on Correction Factor in WSNs

IEE Electronics Letters, vol 34, no 17, August 1998, pp ESTIMATING STARTING POINT OF CONDUCTION OF CMOS GATES

aperture David Makovoz, 30/01/2006 Version 1.0 Table of Contents

Reduced Cluster Search ML Decoding for QO-STBC Systems

Inverse Halftoning Method Using Pattern Substitution Based Data Hiding Scheme

Software Correlators for Dish and Sparse Aperture Arrays of the SKA Phase I

DESIGN OF OPTIMIZED FIXED-POINT WCDMA RECEIVER

Passive Filters. References: Barbow (pp ), Hayes & Horowitz (pp 32-60), Rizzoni (Chap. 6)

Full-Duplex Device-to-Device Collaboration for Low-Latency Wireless Video Distribution

DESIGN OF OPTIMIZED FIXED-POINT WCDMA RECEIVER

Comparative Analysis of Reuse 1 and 3 in Cellular Network Based On SIR Distribution and Rate

Understanding the Spike Algorithm

THE USE OF CONVOLUTIONAL CODE FOR NARROWBAND INTERFERENCE SUPPRESSION IN OFDM-DVBT SYSTEM

Network Reconfiguration in Distribution Systems Using a Modified TS Algorithm

Implementation of Digital Filters in Carry-Save Residue Number System

NOVEL ITERATIVE TECHNIQUES FOR RADAR TARGET DISCRIMINATION

A New Regressor for Bandwidth Calculation of a Rectangular Microstrip Antenna

Graph Method for Solving Switched Capacitors Circuits

Naveen Kumar Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (2), 2011,

DVB-T/H Digital Television Transmission and its Simulation over Ricean and Rayleigh Fading Channels

Error Probability of RS Code Over Wireless Channel

On the Design of Turbo Packet Combining Schemes for Relay-Assisted Systems over Multi-Antenna Broadband Channels

FULL RECONFIGURABLE INTERLEAVER ARCHITECTURE FOR HIGH-PERFORMANCE SDR APPLICATIONS

FULL-Duplex (FD) transceivers are known for their capability

Recurrent Neural Network Based Fuzzy Inference System for Identification and Control of Dynamic Plants

Optimized Forwarding for Wireless Sensor Networks by Fuzzy Inference System

VRT014 User s guide V0.8. Address: Saltoniškių g. 10c, Vilnius LT-08105, Phone: (370-5) , Fax: (370-5) ,

Chaotic Filter Bank for Computer Cryptography

Approximating User Distributions in WCDMA Networks Using 2-D Gaussian

Bit-interleaved Rectangular Parity-Check Coded Modulation with Iterative Demodulation In a Two-Node Distributed Array

Microelectronic Circuits

Adaptive Modulation for Multiple Antenna Channels

Topology Control for C-RAN Architecture Based on Complex Network

Spread Spectrum Image Watermarking for Secured Multimedia Data Communication

Joint Adaptive Modulation and Power Allocation in Cognitive Radio Networks

Hardware Design of Filter Bank-Based Narrowband/Wideband Interference Canceler for Overlaid TDMA/CDMA Systems

A novel approach for analog circuit incipient fault diagnosis by using kernel entropy component analysis as a preprocessor

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

antenna antenna (4.139)

STUDY OF MATRIX CONVERTER BASED UNIFIED POWER FLOW CONTROLLER APPLIED PI-D CONTROLLER

California, 4 University of California, Berkeley

A ph mesh refinement method for optimal control

Walsh Function Based Synthesis Method of PWM Pattern for Full-Bridge Inverter

29. Network Functions for Circuits Containing Op Amps

A Tractable and Accurate Cross-Layer Model for Multi-Hop MIMO Ad Hoc Networks

LOCAL DECODING OF WALSH CODES TO REDUCE CDMA DESPREADING COMPUTATION

MASTER TIMING AND TOF MODULE-

6928 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 13, NO. 12, DECEMBER 2014

Beam quality measurements with Shack-Hartmann wavefront sensor and M2-sensor: comparison of two methods

OVSF code slots sharing and reduction in call blocking for 3G and beyond WCDMA networks

Time-frequency Analysis Based State Diagnosis of Transformers Windings under the Short-Circuit Shock

A Tractable and Accurate Cross-Layer Model for Multi-Hop MIMO Networks

Transcription:

203 IEEE 4th Workshop on Sgna Processng Advances n Wreess Communcatons (SPAWC) A Mut-standard Effcent Coumn-ayered LDPC Decoder for Software Defned Rado on GPUs Rongchun L, Je Zhou, Yong Dou, Song Guo, Dan Zou Natona Laboratory for Parae and Dstrbuton Processng Natona Unversty of Defense Technoogy Changsha, Chna, 40073 Ema: {rongchun,jezhou,yongdou,songguo,danzou}@nudt.edu.cn Correspondng author, Te:(86)3467535838 Sh Wang Wuhan Mtary Deegate Bureau Genera Armament Mnstry, Wuhan, Chna. Ema: shwang@nudt.edu.cn Abstract In ths paper, we propose a mut-standard hgh-throughput coumn-ayered (CL) ow-densty party-check (LDPC) decoder for Software-Defned Rado (SDR) on a Graphcs Processng Unt (GPU) patform. Mutpe coumns n the sub-matrx of quas-cycc LDPC (QC-LDPC) code are parae performed nsde a bock, whe mutpe codewords are smutaneousy decoded among many bocks on the GPU. Severa optmzaton methods are empoyed to enhance the throughput, such as the compressed matrx structure, memory optmzaton, codeword packng scheme, two-dmenson thread confguraton and asynchronous data transfer. The experment shows that our decoder has ow bt error rato and the peak throughput s 72Mbps, whch s about two orders of magntude faster than that of CPU mpementaton and comparabe to the dedcated hardware soutons. Compared to the exstng fastest GPUbased mpementaton, the presented decoder can acheve a performance mprovement of 3.0x tmes. Index Terms GPU, SDR, LDPC Decoder, coumn-ayered decodng I. Introducton LDPC code has been consdered as one of most promsng near-optma error correctng codes due to ts exceent error correctng performance and fast decodng throughput. As a matter of fact, LDPC code has been adopted n many ndustra protocos, such as DVB-S2, DVB-T2, WF (802.n) and WMAX (802.6e) systems. LDPC decodng agorthm s on the bass of beef propagaton of messages, whch requres very ntensve computaton. Therefore, n order to reach the throughput requred by the standards, dedcated appcatonspecfc ntegrated crcut (ASIC) soutons for LDPC decoder have been presented n recent years [][2]. However, ASIC soutons have hgh tme-to-market, hgh desgn cost and fxed functonaty. Recenty, GPUs are wdey used for ther hgh computatona power, whch can execute numerous threads smutaneousy and the peak performance can reach up to tera foatng operatons per second. The NVIDIA corporaton presented Compute Unfed Devce Archtecture (CUDA) [3], usng C as a hgh-eve programmng anguage, whch offers a software envronment that factates the deveopment of hghperformance appcatons. Compared to ASIC soutons, GPUbased ones are ess expensve, scaabe and fexbe. Ths work centrazes on the parae mpementaton of the LDPC decodng agorthm on the GPU patform. There are three types of LDPC decodng schedues, the two-phase message-passng (TPMP), ayered decodng and sequenta decodng. In TPMP, the check-to-varabe (CV) and varabe-to-check (VC) messages are cacuated n two separated phases respectvey n one teraton. Conversey, n ayered decodng agorthm, the check sparse bnary H matrx s dvded nto mutpe ayers, whch are sera processed. In each ayer, the CV and VC messages are both computed. Compared to the TPMP, the ayered decodng can acheve about twce faster decodng convergence. In the ayered decodng, the agorthms can be dvded nto two categores accordng to the constructon of ayers, the row-ayered (RL) [4] and CL one [5] [6]. In the CL decodng, the varabe nodes (VNs) are dvded to be ayers, where the check nodes (CNs) do that n RL agorthm. CL decodng can acheve hgher decodng speed compared to the RL one due to ts ower compexty. In ths paper, CL decodng s chosen because of ts faster convergence and hgher speed. Severa studes have been devoted to LDPC decoders mpemented on GPUs n recent years. Most of these works expoted TPMP as ther decodng agorthms [7]-[2]. [8] proposed a scaabe RL LDPC decoder usng 9800 GTX+ GPU. To our best knowedge, ths s the frst paper proposng LDPC decoder expotng CL decodng on GPUs. The rest of the paper s organzed as foows. Secton 2 descrbes the background on CUDA, QC-LDPC code and CL decodng. We then present the GPU-based parae CL decodng agorthm n Secton 3. Next, seres of optmzaton methods are expoted n Secton 4. The performance evauaton s descrbed n Secton 5. Secton 6 concudes the paper. A. CUDA II. Background In the ogca herarchy, CUDA structure conssts of a grd, bocks, and threads. The devce program s herarchcay governed as the grd-bock-thread step. A grd s constructed as a three-dmenson array of bocks. At the same tme, each bock can be constructed as a three-dmenson array of threads as we. In a bock, a the threads can share the data. 978--4673-5577-3/3/$3.00 203 IEEE 724

203 IEEE 4th Workshop on Sgna Processng Advances n Wreess Communcatons (SPAWC) Conversey, bocks n a grd must be executed ndependenty. In each bock, 32 threads are organzed as a warp, whch s executed ndependenty. Any remanng ssue sots n a warp w be wasted f the bock sze s not a mutpe of 32. When one warp s watng for the data, a ready warp w be qucky swtched to execute to concea the atency of memory accesses. In the physca herarchy, the CUDA structure conssts of memores, as we as mutpe stream-mutprocessors (SMs) wth severa ntegrated stream-processors (SPs). A SM can smutaneousy process mutpe bocks f the computng resources are suffcent. In each SM, regsters are aocated to each ndvdua thread. At the meantme, threads can access four types of memory. The off-chp goba memory, whch can be read or wrtten by a the bocks wth an access atency of more than 400 GPU cock cyces. Constant or texture memory are cached to emnate memory atency. A threads In a bock can access to the shared memory wth the atency of four cyces. It enabes threads to communcate wth one another wthn a bock. B. QC-LDPC Code An LDPC code s near bock code specfed by a sparse M N party-check matrx H where M denotes the numbers of rows and N represents the code ength. The code rate can be cacuated as r = M N. H aso can be expressed by a Tanner graph, whch has m CNs and N VNs, correspondng the M rows and N coumns, respectvey. If H(, j) = then there s an edge between CN and VN j n Tanner graph. For QC-LDPC code, H s constructed by mutpe submatrces, each of whch has a cycc shft to the dentty matrx. Many protocos adopt the quas-cycc LDPC code, such as 802.6e [3] and 802.n [4]. 802.6e supports 6 dfferent code rate and 9 code ength rangng from 576 to 2304 bts wth granuarty of 96 bts, whe 802.n supports 4 dfferent code rate and 3 code ength, 648, 296 and 944 bts. It s worth notng that the code of the both protocos can be expressed by a base matrx H b, whch conssts of N b = 24 coumns and M b = ( r)n b rows. Each eement of H b s a Z Z sub-matrx, where Z s the expanson factor and can be obtaned as Z = N N b. The party-check matrx H for dfferent code ength can be obtaned by expandng H b wth a correspondng Z. C. CL Decodng The CL decodng agorthm was proposed by Zhang [5]. Cu [6] presented Mn-Sum based CL decodng. In CL decodng agorthm, VNs are dvded nto ayers, whch are seray processed. In each ayer, the VNs are updated frst and foowed by the CNs connected to these VNs. the updated CN messages n one ayer w feed to the next ayers. Same to the RL one, CL decodng can gan fast convergence due to the updated estmates wthn the same teraton. As a fact, CL decodng agorthm can acheve about twce faster convergence compared to the TPMP agorthm. Mn-Sum based CL decodng s adopted n ths paper for ts ess compexty. Agorthm Mn-Sum Coumn-ayered Decodng Agorthm Intazaton: Input: The receved sequence y j ; Output: The decoded bts c; L, j = y j ; R, j = 0; Iteraton: : for teraton k to k max do 2: for ayer to N do 3: for each CN n ayer do 4: R, j = sgn(l,j ) j V \j sgn(l,j ( ) ) mn j V \j (L,j, L,j ( ) ); 5: end for 6: for each VN j n ayer do 7: L, j = α Σ C j\ R, j + y j ; 8: Lv = α Σ C j R, j + y j ; 9: end for 0: end for : end for 2: Hard decson and generate c; Agorthm descrbes the Mn-Sum based CL decodng agorthm. Assumed that the bnary phase shft keyng (BPSK) s adopted as the moduaton scheme over the addtve whte Gaussan nose (AWGN) channe. Let y = y, y 2,..., y N be the receved sequence added the nose and L, j and R, j be the Varabe-to-Check and Check-to-Varabe messages n the edge H, j n the th ayer. The Lv j denotes the fna soft output of VNs. In Agorthm, R, j s updated by L, j n the ( ) th and th ayer. To smpfy the sgn and magntude computaton of R, j, a two-step updatng for the sorted sequence m s ntroduced [6], where m contans the magntudes of VC messages assocated to the CN. S tep a : remove sgn(l ( ), j ) and L ( ), j from the sorted sequence m ( ) f the L ( ), j s n the sequence, formng the new sequence m ( ). R, j s obtaned as R, j = sgn(m ( ) ) m ( ). S tep b : After L, j s generated, sort the magntude of the new varabe-to-check message L, j and m ( ) to obtan m for ayer +. III. GPU-based CL Decodng Agorthm From Agorthm, we can see that the CL decodng agorthm s a seray processed coumn by coumn, whch has ow degree of paraesm. The data dependence between the successve teratons centrazes on the L, j ( ) and L, j. However, by anayzng the QC-LDPC code, we can fnd out that the edges n a sub-matrx are not ocated on the same row wth each other. The same stuaton happens among M b sub-matrces n the same coumn n the base matrx H b. So n each coumn of the base matrx H b, the updated of L, j n each sub-matrx can be parae executed. Based on the above dea, we partton the matrx H nto N b ayers, each of whch has Z coumns. The CV and VC messages n each coumn wthn a ayer can be parae updated. 978--4673-5577-3/3/$3.00 203 IEEE 725

203 IEEE 4th Workshop on Sgna Processng Advances n Wreess Communcatons (SPAWC) Fg. descrbes the parae conceptua dagram of CL decodng agorthm. The x dmenson of a bock can be aocated to parae process the coumns n one ayer. Mutpe ayers are executed seray due to the data dependence. Assumed that N X s the x dmenson sze, whch can be obtaned as N X = Z. At the meantme, the x dmenson of a grd s chosen to execute mutpe codewords smutaneousy. Each bock processes one codeword wth the ength of N. In each thread, fve steps are seray performed. These steps are the computaton of m ( ), R, j, Lv, L, j, and m. m ( ) s the sequence n ( ) th ayer whch s obtaned by the step-a updatng. Among these steps, m ( ) and R, j of a the edges n the correspondng coumn are computed seray, foowng by the computatons of Lv, whch are obtaned by addng R, j n a edges n each coumn. And then L, j s updated by subtractng the R, j from Lv. After L, j s generated, m s updated to have the new sgn and magntude through the step-b computaton. It s noted that m and m are both represented by three varabes, sgn, mn and secmn. The magntudes of m or m are sorted by updatng the vaues of mn and secmn. After a the threads compete the update for one ayer, the syncthreads functon s caed to synchronze the threads wthn a bock. Fg.. grd.x bock.x L0 0 N b Layers m' (-) L 2 L2 CUDA cores R,j bock.x Lv L,j m grd.x N cw Codewords Conceptua dagram of paraezng CL decodng agorthm IV. Optmzaton Methods on GPU A. Compressed Matrx Structure CL decodng s based on the teratve message exchange between the VNs and CNs, whch are represented by poston of non-zero eements n H matrx. However, t s not preferred to store a the H matrx on the GPU because of ts huge memory sze and ow access effcency. We expot a compressed matrx structure, whch s consttuted by four arrays, H VN, H CN, RowNum and CoNum. H VN s scanned the H matrx n a coumn-major order and mapped the row postons of non-zero eements n each coumn. H CN s the correspondng permutaton array to store the coumn postons of non-zero eements n each row. RowNum and CoNum descrbe the number of non-zero eements n each row and coumn, respectvey. B. Memory Optmzaton ) Texture Memory: The texture memory, nstead of constant one, can be updated through the memory copy functon from host to devce, whch s fexbe n the mut-standard desgn. The texture memory s read-ony memory, whose access atency s as fast as regsters when the data fts n the texture cache. The four arrays n the compressed matrx structure are stored n the texture memory. Furthermore, the nput sequence y s aso cached n the texture memory. A these structures have sma storage amount, whch can be ht when read by the threads. 2) Shared Memory and Regsters: In the GPU, the number of resdent bocks per SM N BM decdes the performance of the appcaton, whch s obtaned as: N BM = mn(n BR, N BS, N BB ) () where N BR, N BS and N BB are the N BM restrcted by the regsters, shared memory and the maxmum resdent bock, respectvey. When shared memory and regsters are reasonaby used, the memory access w happen n the on-chp memores, nstead of the off-chp goba memory. The fewer shared memory and regsters used, the arger N BM s. The dea stuaton s that N BR = N BS = N BB. In the GPU havng the compute capabty of ess than 3.0, N BB = 8. In our GPU-based CL decodng, there s no data dependence among the bocks. So t s preferred to adopt the on-chp shared memory and regsters. We store the sgn, mn and secmn n the shared memory. Some ntermedate resuts, such as R, j and the sum of R, j n each edges, are aso stored n the shared memory. As mentoned n Secton 2, 32 threads are organzed as a warp. For the purpose to emnate the dvergence of warps, the address of shared memory s organzed n sera manner by the thread denttes (IDs). Regsters are used for the excusve varabe n each thread, such as the number of teraton, the ndex of the current coumn and so on. In our mpementaton, there are Z 32 bytes shared memory and Z 2 regsters used n a bock. In GTX580, there are 4952 bytes shared memory and 32768 regsters per SM. We can fnd out that N BR N BS. 3) Goba Memory: After the usage of shared memory and regsters, there are ony two varabes stored n the goba memory, Lv and L. The frst one ony creatng a store access n each ayer, whe the second one generates a read and wrte access from/to goba memory per ayer. After fetchng L from the goba memory, the data can be cached n the regsters, whch has no access atency. In Ferm GPUs, the goba access tme can be dmnshed by effcenty utzng the goba memory bandwdth, mergng the access of goba memory by 32 threads n a warp nto a snge memory transacton. The three accesses can adopt coaesced memory access method, ony needng to organze the address Lv and L n sera manner by the thread IDs. C. Codeword Packng Scheme In CL decodng, the demand of data precson s not hgh. So there s no need to adopt foatng pont data precson due to ts 978--4673-5577-3/3/$3.00 203 IEEE 726

203 IEEE 4th Workshop on Sgna Processng Advances n Wreess Communcatons (SPAWC) huge memory bandwdth. As a fact, experments have proved that 8 bts fxed pont mpementaton has the cose BER performance to the foatng one. In order to ncrease arthmetc ntensty and effcenty utze the memory bandwdth, we expot the codeword packng scheme, combnng four 8 bt codewords nto a 32 bt word. The dstnct four codewords are stored n the shared memory, except the Lv and L. When fetchng the L, j vaue from the goba memory, the messages are unpacked nto four 8 bt messages. After computaton, we pack the four messages and then wrte them back to the goba memory. Same stuaton happens to the Lv. Codeword packng scheme reduces the memory copy tme and the goba memory access tme by a factor of four, and enhances the number of codewords processed smutaneousy by four tmes. D. Two-dmenson Thread Confguraton As mentoned n optmzaton of shared memory and regsters, the dea stuaton s that N BR = N BS = N BB. However, n some GPUs ntegratng wth arge computng resources, the stuaton w be N BR = N BS < N BB. For exampe, n GTX580, N BR and N BS n our mpementaton are both equa to 6 f the code ength N s 2304 and the Z s 96, whch s twce tmes of N BB, whch s 8 n GTX580. In order to fuy utze the computng resources n the GPU, we present a two-dmenson thread confguraton to acheve maxmum utzaton. The dmenson sze of the second dmenson N Y s obtaned as: N Y = mn(n BR, N BS ) N BB (2) Two-dmenson thread confguraton fuy utzes the onchp shared memory and regsters, whch ncreases the number of codewords processed smutaneousy on the GPU by N Y tmes. E. Asynchronous Data Transfer CUDA manages the asynchronous data transfer through streams, whch are sequences of commands executng n order. By appyng the stream executon, the data transfer between host computer and the GPU can be overapped wth the computaton procedure. In the non-streamed process, the data transfer and kerne executon are seray processed. However, n the mut-streamed process, a the data transfers are overapped wth the mutpe kerne executons except the frst host to devce data transfer and the ast devce to host data transfer. In non-streamed mpementaton, the tota memory copy tme occupes about 33.6 percent of tota tme. In contrast, by the 8-streamed mpementaton, the tota memory copy tme can be reduced to 6.0%. A. Experment Setup V. Performance Evauaton We ay out the testng scenaro as beow. The bnary data bts are randomy generated by the host computer, encoded by the LDPC encoder and mapped by BPSK moduaton and then passed through the AWGN channe. In the recever, the sgna s devered to the goba memory of the GPU to perform the CL LDPC decodng procedure. The output bts are then devered to the host computer to cacuate the BER. The CPU n the host computer s Inte 3 530 wth the frequency of 2.93 GHz, and the GPU s GTX580 wth 2.0 compute capacty. The frequency of GTX580 s.54 GHz and the sze of ts goba memory s.5 GB. There are 6 SMs n GTX580, each of whch ntegrates 4952 bytes shared memory and 32768 regsters. BER 0 0 0-0 -2 0-3 0-4 0-5 0-6 0-7.0.5 2.0 2.5 3.0 3.5 4.0 4.5 Eb/No (db) TPMP,N=2304,ter=0 TPMP,N=576,ter=0 Proposed,N=2304,ter=5 Proposed,N=576,ter=5 Fg. 2. BER performance comparson of TPMP and proposed CL decodng agorthm B. Decoder BER Performance To evauate the BER performance of proposed decoder, We measure the BEomparson between TPMP and presented CL decodng agorthm. IEEE 802.6e Rate /2 LDPC codes wth the ength of 576 and 2304 are used for comparson. The fna BER performance under the AWGN channe s shown n Fg. 2. It ceary depcts that BER performance of the CL decodng agorthm after 5 teratons s cose to TPMP agorthm after 0 teratons. C. Decoder Throughput Tabe. I shows the throughput the proposed GPU-based CL LDPC decoder on varous code engths, fve of whch are for 802.6e and three are for 802.n. The number of codeword N cw s obtaned as: N cw = N BM N S M 4 N Y (3) where N S M means the number of SMs n the GPU. In each ength, fve optmzaton methods mentoned n Secton 4 are empoyed to enhance the throughput. We aso st the optmzaton process by addng methods one by one. The characters A to E mean the combnaton of severa methods that these characters represent n Secton 4. The throughput s 978--4673-5577-3/3/$3.00 203 IEEE 727

203 IEEE 4th Workshop on Sgna Processng Advances n Wreess Communcatons (SPAWC) TABLE I Throughput of proposed decoder for dfferent code engths and optmzaton methods Throughput(Mbps) N Z N Y N cw A AB ABC ABCD ABCDE 768 32 6 3072 9 78 296 58 704 52 48 4 2048 9 89 333 508 689 536 64 3 536 0 06 386 56 72 920 80 2 024 02 377 493 695 2304 96 2 024 3 2 46 507 70 648 27 7 3584 8 69 262 487 633 296 54 3 536 9 93 335 505 69 944 8 2 024 0 0 365 500 685 computed by dvdng the tota bts N cw N by decodng tme t. In the tabe, we can fnd that optmzaton methods B, C, D, E can acheve average 9.6x, 3.7x,.5x and.4x performance mprovement. By expotng the two-dmenson thread confguraton, N BR /N Y = N BS /N Y = N BB n GTX580, whch s the dea confguraton to acheve maxmum throughput. The peak throughput s acheved n code ength N = 536. There are three code engths obtanng the throughput over 700Mbps, whch s because that ther bock sze Z s the mutpe of 32, whch s the warp sze. For comparson, the throughput w be dropped f Z s not the mutpe of 32 because of the wastage of ssue sots n warps. When decodng a 802.n or 802.6e frame wth fewer codewords, e.g. 28 words, the processng tme s ess than ms, whch fas n the frame duraton. D. Performance Comparson The throughput of CL LDPC decoder on CPUs s 2.6Mbps. So our GPU-based decoder can acheve about 240x to 273x speedup over that. Tabe. II shows the performance comparson between dfferent LDPC decoders. We can see that our proposed CL LDPC decoder s comparabe to ASIC mpementatons [][2]. A the other GPU-based LDPC decoder [7]-[2] expoted TPMP agorthm, except RL decodng n [8]. Our decoder s the ony one expotng CL decodng. The maxmum throughput of these decoders s 209Mbps n [2]. In order to make comparson fary, we aso run our decoder on ther GPU types, 9800 GTX+ and C2050. We can see that our decoder acheves about.5 and 3.0x speedup over [8] on 9800 GTX+ and [2] on C2050, respectvey. The dfference s due to the mted resources on 9800 GTX+, where two-dmenson thread confguraton cannot be performed. VI. Concuson In ths paper, an effcent CL QC-LDPC decoder on the GPU has been presented. We propose the GPU-based parae CL decodng agorthm. Severa optmzaton methods are expoted to enhance the throughput. The throughput of proposed decoder acheves about two orders of magntude faster than that of CPU mpementaton. The peak throughput s 72Mbps, whch s 3.0x speedup to the exstng fastest GPU-based LDPC decoder, and comparabe to the dedcated hardware soutons. TABLE II Performance comparson of varous LDPC decoders ref. Patform Code Pre. Thr. Ago. Iter. type ength (bts) (Mbps) [] ASIC RL 2304 7 5 249 [2] ASIC RL 2304 5 7 679 [7] 8800 GTX TPMP 8000 8 0 96.2 [8] 9800 GTX+ RL 2304 6 5 60.0 [9] GTX260 TPMP 2304 32 20 24.5 [0] GTX470 TPMP 2304 8 0 52.2 [] GTX570 TPMP 6200 8 20 92.4 [2] C2050 TPMP 8000 8 0 209.0 9800 GTX+ CL 2304 8 5 235.6 Ours C2050 CL 2304 8 5 68.3 GTX580 CL 2304 8 5 70.0 The proposed decoder can be empoyed as a part of GPUbased communcaton system, whch s an aternatve to the hardware soutons. Acknowedgment Ths work was supported by Natona Scence Foundaton of Chna (62520). References [] M. Awas, A. Sngh, E. Bouton, and G. Masera, A Nove Archtecture for Scaabe, Hgh Throughput, Mut-Standard LDPC Decoder,n DSD 20: 4th Euromcro Conference on Dgta System Desgn: Archtectures, Methods and Toos, pp. 340-347, Aug. 20. [2] C. Roth, P. Menerzhagen, C. Studer, and A. Burg, A 5.8 pj/bt/ter Quas-Cycc LDPC Decoder for IEEE 802.n n 90 nm CMOS, n 200 IEEE Asan Sod-State Crcuts Conference, pp.33-36, Nov. 200. [3] NVIDIA Corporaton, NVIDIA CUDA Compute Unfed Devce Archtecture Programmng Gude verson 4.2, 202. [4] M. M. Mansour, and N. R. Shanbhag, Hgh-throughput LDPC decoders, IEEE Transactons on Very Large Scae Integraton (VLSI) Systems, vo., no. 6, pp. 976C996, Dec. 2003. [5] J. Zhang, and M. P. C. Fossorer, Shuff ed teratve decodng, IEEE Trans. Communcatons, vo. 53, no. 2, pp. 209C23, Feb. 2005. [6] Z. Cu, Z. Wang, X. Zhang, and Q. Ja, Effcent decoder desgn for hgh-throughput LDPC decodng, n Proc. 2008 IEEE Asa Pacfc Conf. Crcuts and Systems, pp. 640C643, Nov. 2008. [7] G. Facao, V. Sva, and L. Sousa, How GPUs can outperform ASICs for fast LDPC decodng, n the Internatona Conference on Supercomputng, pp. 390-399, June 2009. [8] A. K. Kumar, A scaabe LDPC decoder on GPU, n 24th Internatona Conference on VLSI Desgn, pp. 83-88, Jan. 20. [9] J. Cu, Y. Wang, and H. Yu, Systematc constructon and verfcaton methodoogy for LDPC codes, Lecture Notes n Computer Scence, vo. 6843 LNCS, pp. 366-379, Aug. 20. [0] G. Wang, M. Wu, Y. Sun, and J. R. Cavaaro, A massvey parae mpementaton of QC-LDPC decoder on GPU, n SASP 20: Proceedngs of the 20 IEEE 9th Symposum on Appcaton Specfc Processors, pp. 82-85, June 20. [] S. Gronroos, K. Nybom, and J. Bjorkqvst, Effcent GPU and CPUbased LDPC decoders for ong codewords, Anaog Integrated Crcuts and Sgna Processng, pp. -3, Nov. 202. [2] G. Facao, V. Sva, L. Sousa, and J. Andrade, Portabe LDPC decodng on mutcores usng OpenCL, IEEE Sgna Processng Magazne, vo. 29, no. 4, pp. 8-87, Apr 202. [3] Ar Interface for Fxed and Mobe Broadband Wreess Access Systems, IEEE Std 802.6e, Feb. 2006. [4] Wreess Lan Medum Access Contro (MAC) and Physca Layer (PHY) Specfcatons, IEEE Std 802.n, Oct. 2009. 978--4673-5577-3/3/$3.00 203 IEEE 728