A GPU Implementation for two MIMO OFDM Detectors

Size: px

Start display at page:

Download "A GPU Implementation for two MIMO OFDM Detectors"

Kellie Greene
5 years ago
Views:

1 A GPU Implementation for two MIMO OFDM Detectors Teemu Nyländen, Janne Janhunen, Olli Silvén, Markku Juntti Computer Science and Engineering Laboratory Centre for Wireless Communications University of Oulu, Finland University of Oulu, Finland {teemu.nylanden, {janne.janhunen, Abstract Two real-valued signal models based on selective spanning with fast enumeration (SSFE) and layered orthogonal lattice detector (LORD) algorithms are implemented on a Nvidia graphics processing unit (GPU). A 2 2 multiple-input multipleoutput (MIMO) antenna system with 16-quadrature amplitude modulation (16-QAM) is assumed. The chosen level update vector for SSFE is based on computer simulation results carried out in MATLAB environment. We implemented the algorithms with Nvidia Quadro FX 1700 GPU and achieved a throughput of Mbps for SSFE and 16.8 Mbps for LORD. The results show that the general-purpose graphics processing unit (GPGPU) has the potential to achieve high throughput, presuming a detection algorithm that allows efficient parallel processing. The latency of the control code and partial Euclidean distance (PED) calculations are very small-scale, but the latency of memory loads and stores to the GPUs global memory are significant. We also compare results from the trellis based detector implementation for GPU, where a more powerful GPU and a different detection algorithm are used. The GPUs offer superior computing power and programmability compared to the application specific software defined radio (SDR) designs implemented so far. I. INTRODUCTION In the third generation partnership project (3GPP) long term evolution (LTE) targets, it is planned to transmit 100 Mbps through wireless connections [1]. High-rate wireless communication needs power efficient solutions to process the increasing amounts of data with limited hardware and low power consumption. The multiple-input multiple-output (MIMO) antenna system combined with the orthogonal frequency division multiplexing technique (MIMO OFDM) has been included in multiple wireless telecommunication standards, such as the IEEE wireless local area network (WLAN), IEEE wireless metropolitan area network (WMAN), Worldwide Interoperability for Microwave Access (WiMAX) and the 3GPP LTE. The multipath environment causes the MIMO channel to be frequency-selective and OFDM can transform such a channel into a set of parallel frequency-flat MIMO channels. The transform into a frequency-flat MIMO channel decreases the computational complexity of the receiver. Because multiple standards are being proposed for wireless and wired communication, flexibility is required from the terminal device. The advantage of the software defined radio (SDR) is its flexibility in an environment with multiple standards but also within a single standard. However, the increased computing loads of the future applications cause new challenges to SDR implementations. The GPUs possess the programmable flexibility and the computing power to rise to these challenges. The three major computational blocks in a MIMO OFDM receiver are fast Fourier transform (FFT), detection and channel decoding. As illustrated in [2], a graphics processing unit (GPU) implementation of a channel decoding block with lowdensity parity-check coding (LDPC) achieves a high throughput. The CUDA architecture offers built-in functions that provide for efficient FFT processing on a GPU. As this study will show, also the detection block can be mapped on a GPU, presuming that a detection algorithm, which can be efficiently parallelized is chosen. On the other hand, GPUs have not been designed to be used as SDR processors. Consequently, one of the goals of this work is to identify the shortcomings for future improvements. The maximum likelihood (ML) detector is optimal for finding the closest lattice point [3]. However, it is not often feasible for real implementations, because its computational complexity increases exponentially with the increasing number of transmit antennas. Selective spanning with fast enumeration (SSFE) [4] and layered orthogonal lattice detector (LORD) [5] calculate the near-ml solution with reduced computational complexity compared to full-complexity exhaustive search ML detectors. We studied the performance of the Nvidia Quadro FX 1700 to fulfill the real-time requirements of the MIMO OFDM detector. The SSFE algorithm is a near-ml MIMO detection algorithm that produces a deterministic and regular data flow [4]. For a real-valued system, the SSFE can be characterized by a level update vector m =[m 1,..., m 2N ], where N is the number of transmit antennas. The level update vector also defines the computational complexity of the algorithm. The LORD algorithm offers a MAP performance with 2 2 antenna system [6] and the algorithm achieves this with rather low computational complexity, if a 16 quadrature amplitude modulation (QAM) is assumed. Both algorithms proceed one level at the time by calculating the partial Euclidean distances (PEDs) for each level. Most of the computational complexity of the algorithms originates from PED calculations and slicing operations. The PED calculations and slicing operations can, however, be performed in parallel for each level. High computing power is, nonetheless, required to achieve the real-time requirements of the detector. The algorithm presented in [4] was modified to be real-valued /10/$ IEEE 293

2 instead of using the complex-valued presented in the original algorithm study. In addition, the original real-valued LORD [5] was implemented. GPUs are designed for graphics processing and therefore cannot be efficiently used as SDR processors as such. However, the purpose of this paper is to study the possibilities of massively parallel processing for MIMO OFDM detection. Restrictions, such as power consumption and data transfers to and from the device, are acknowleged by the authors, but ignored since they are not in the scope of this study. The rest of the paper is organized as follows. The system model and maximum likelihood detection are briefly presented in Section II. The algorithms, SSFE and LORD, alongside the simulation results are presented in Section III. Nvidia GPU, our detector implementations and the results are presented in Sections V VI, respectively. Section VII compares our implementations with the [7] implementation. The final section concludes the paper. II. SYSTEM MODEL A MIMO OFDM based multiple antenna system is assumed with N transmit and M receive antennas. Figure 1 presents the block diagram of the MIMO OFDM transmission architecture. In this study, the detector block of the receiver is implemented to a single-precision floating-point GPU. Encoding Interleaving Mapping S/P Decoding Fig. 1. Deinterleaving P/S Soft Detection Channel and SNR estimation OFDM modulation OFDM modulation OFDM demodulation OFDM demodulation Channel Block diagram of the MIMO OFDM transmission. The received signal on sth subcarrier can be presented as y s = H s x s + n s, s =1, 2..., S (1) where S is the number of subcarriers, y s C M is the received signal vector, x s C N is the transmitted symbol vector and n s C M is the noise vector. The symbol H s C M N denotes the channel matrix. The entries of x s are chosen independently of each other from a QAM constellation. The ML detector minimizes the Euclidean distance between the received signal y and the lattice points Hx, and selects the lattice point that minimizes the Euclidean distance to the received vector y, i.e., ˆx = arg min y Hx 2, (2) x A N where A is the symbol alphabet and denotes the Frobenius norm of a vector. The exhaustive search can be used to solve the ML detection problem. However, it becomes computationally unfeasible as the set of lattice points increases. The SSFE and LORD algorithms solve the ML approximation (2) by limiting the search to the lattice points within a search tree specified by the algorithms. The received symbol is placed somewhere between the lattice points due to additive noise in the channel. At this point, the maximum likelihood method would calculate the Euclidean distance between the received symbol and every constellation point, whereas the SSFE and LORD algorithms only calculate the distances to a limited set of constellation points within the search tree. The depth of the search tree depends on the number of receiver antennas. III. DETECTORS The channel matrix H can be QR decomposed (QRD) into two parts. If the number of transmit and receiver antennas are equal, the channel matrix can be presented as H = QR, where Q denotes a N N orthogonal matrix and R is a N N upper triangular matrix. After the QR decomposition, the equation y Hx 2 can be rewritten as y QRx 2 Q H y Rx 2 (3), where Q H denotes the Hermitian transpose of matrix Q. And by denoting Q H y = ŷ, we get ŷ Rx 2. (4) Both implemented algorithms include QRD as preprocessing. The upper triangular matrices R and ŷ generated in the QRD were set to be fixed in the GPUs constant memory. A. Selective Spanning with Fast Enumeration The SSFE algorithm provides a fixed throughput and computational complexity. It is also easily parallelized, and, thus, is an interesting alternative for implementation. The computational complexity of the algorithm depends only on the number of antennas and the level update vector. The level update vector also determines the output list size. By setting the level update vector m = [1111], only four PED calculations and slicing operations are performed. At the other extreme, the algorithm with level update vector m = [4444] achieves MAP performance, but a total of 256 PED calculations and slicing operations need to be performed. By carefully choosing the level update vector, a compromise between error rate and computational complexity is achieved. The SSFE algorithm is based on the tree search strategy, i.e., the algorithm traverses a search three by calculating all 294

3 the admissible PEDs and storing the PEDs to the intermediate list in the memory. The search will continue with the nodes determined by the level update vector on the next level until the leaf nodes are reached. After the final level, the final candidate list is used for log likelihood ratio (LLR) calculation. However, it should be noted that the final candidate list may not include the lowest EDs. Figure 2 presents the search tree of the SSFE algorithm, where level update vector m = [1144] results in an output list size of 16. A real-valued signal model, a 2 2 antenna system and 16-QAM are assumed. processing at the higher level antennas. This is performed by a slicing operation somewhat similar to the SSFE slicing operation. The high dependency on the constellation used quickly becomes the limiting factor with the LORD algorithm. C. Performance Example A good compromise on the list size for SSFE was decided on by running simulations. The list size has a significant impact on the computational complexity of the SSFE algorithm. Parameter studies were performed on a MIMO OFDM simulator running in MATLAB environment. In the simulator, one frame corresponds to one OFDM symbol, and consists of 300 individual symbol vectors, each mapped to one OFDM subcarrier. Table I presents the simulation parameters inspired by the 3G LTE specifications [8], [9], [10]. The corresponding frame error rate (FER) curves are presented in Figure 3. TABLE I SIMULATION PARAMETERS Fig. 2. Example of a SSFE search tree. B. The Layered Orthogonal Lattice Detector A layered orthogonal lattice detector (LORD) is a softoutput near-optimal lattice detector that relies on a channel orthogonalization process [6]. The LORD algorithm is very similar to the SSFE algorithm. The greatest difference is that the channel matrix is reordered in the preprocessing and separate QRDs as well as separate tree searches for each transmit antenna are required. Assuming a 2 2 system, the LORD algorithm achieves MAP performance. With a real-valued 16-QAM system, the SSFE algorithm would need a full search with m = [4444] to achieve the same FER performance, resulting in eight times higher number of calculations. However, the LORD algorithm is heavily dependent on the constellation and the number of antennas used. With higher order modulations the computational complexity rapidly increases. Compared to the SSFE algorithm, the LORD algorithm also wastes memory resources, which can cause problems with GPU type parallel processing. Figure 2 presents the search tree pruning for the LORD algorithm. A real-valued signal model, a 2 2 antenna system and 16-QAM are assumed. The search tree is exactly the same as with the SSFE algorithm with m = [1144]. The only difference is that with the LORD algorithm there are two search trees, one for each transmit antenna. Assuming a V -QAM modulation, the LORD algorithm covers all V 2 values for the in-phase (I) and quadrature-phase (Q) of the lowest level antenna. Each of the covered values is decoded with spatial decision feedback equalizing (DFE) Number of subcarriers 512 of which 300 used Bandwidth 5MHz Carrier frequency 2.4 GHz Cyclic prefix (CP) duration 4.69 μs Symbol duration μs MIMO scheme VBLAST Channel code Turbo code with six iterations Code rate 1/2 Channel model Typical urban, 6 taps User velocity 120 km/h Base station antenna separation 4λ Mobile antenna separation 0.5λ FER Fig. 3. 2x2 MIMO system, 16 QAM, correlated channel SSFE, m=[ ] LORD MAP SNR (db) FER comparison for real-valued QAM systems. Based on the simulations, the list size of 16 was found to offer a good compromise between computational complexity and FER performance. Further simulations were performed to discover the best configuration for the level update vector. It was found that m = [1224] offers the best FER performance with a list size of 16. However, m = [1144] can be better mapped for GPU processing, with only a small increase in FER. Figure 3 also shows that the LORD algorithm achieves MAP performance with the 2 2 antenna system and 16-QAM. The 295

4 computational complexity compared to SSFE with m = [1144] is, however, doubled. IV. COMPUTE UNIFIED DEVICE ARCHITECTURE The Nvidia Quadro FX 1700 is one of the mid-range products of the Nvidia Quadro product family. It consists of four streaming multiprocessors (SMs). Each of the SMs contains eight pipelined scalar processor (SP) cores. The cores are running at 920 MHz. The maximum number of active threads running on the Quadro FX 1700 is 3072, 768 per SM. The maximum peak rate supported by Quadro FX 1700 is about 89 GFLOPS [11]. The Quadro FX 1700 has a global memory of 512 MB of graphics double data rate 2 (GDDR2) running at 400 MHz. It has a 128-bit memory interface and a 12.8 GB/s memory bandwidth. The total amount of constant memory available is 64 kb, and 16 kb of shared memory is offered per block. The maximum power consumption of the Quadro FX 1700 is 42 W[11]. CUDA is a software programming model for programmers to write scalable parallel programs using C. There are CUDA extensions available for some other standard programming languages too, for example FORTRAN. CUDA is developed by Nvidia and it requires a Nvidia GPU [12]. In the CUDA programming model, a GPU is viewed as a computing device that works as a co-processor for the main central processing unit (CPU). The CPU is often called the host and the GPU is called the device. The massive computational capability of the GPU is based on their high level hardware parallelism. A GPU can have several SMs. Parallel portions of the program are executed as kernel functions on the device. A kernel is a function that is called from the CPU, but executed on the GPU. Only a single kernel is executed at a time, but thousands of threads can be executed simultaneously in parallel inside a single kernel function. A kernel is composed of a grid that consists of a set of equal size thread blocks. At every kernel launch, the grid and block dimensions to be used are fed to the kernel as an input. One block can contain up to 512 threads. The grid can consist of multiple equally sized thread blocks, so the total number of threads is equal to the number of threads per block times the number of blocks. However, the number of thread blocks is more dependent on the processed data than the number of streaming multiprocessors available [13]. Figure 4 illustrates the composition of a kernel grid [14]. To manage the thousands of threads being processed simultaneously, the SMs use the single-instruction multiplethread (SIMT) architecture. It maps each thread to a SP core and executes them independently, assigning them with their own instruction address and register state [13]. The SIMT architecture concentrates on execution of a single thread. The threads are gathered by the SIMT unit into groups of 32 parallel threads called warps. When a kernel function with one or more thread blocks is being executed, the SIMT unit divides the threads into warps and schedules them for execution. The DEVICE (1,1) (0,1) (0,2) Fig. 4. GRID 1 (0,1) (1,1) (1,2) (1,1) (2,1) (2,2) (3,0) (3,1) (3,2) An example of a kernel grid. (2,1) threads inside a warp start the execution simultaneously at the same program address, but they are free to branch and execute independently. The threads are also assigned with unique increasing thread IDs. A GPU can have a large amount of off-chip memory, referred to as global memory. In addition to the global memory, GPUs also have fast on-chip memory and register resources. Although the size of the global memory can be notable, it is an off-chip resource, and thus, substantially slower than the on-chip resources. The latency penalty due to memory transfers to and from global memory can be avoided to some extent by efficient use of on-chip resources. When mapping the algorithms on CUDA, it is important to minimize the global memory reads and writes, due to the long latency they incur. Before starting the execution of a kernel the required data has to be copied from the CPUs system memory to the GPUs global and constant memories. The data transfers to and from the device are significantly slow due to the slow PCI-express bus, which is why the data should be kept on the device memory as long as possible. This is a limiting factor with GPU detector implementations, but since the purpose of this study was to explore the computational capacity of the GPU for MIMO OFDM detection, the data transfer issues were discarded and the main focus was on computing power. V. MAPPING SSFE ON CUDA The massive parallelism offered by the GPUs makes it possible to run numerous parallel independent tree searches on a single GPU. The computations required in the SSFE and LORD algorithms can be efficiently parallelized and mapped for GPU processing. By mapping the SSFE algorithm with vector m = [1144], the computations can be efficiently performed in parallel with 16 threads. However, to allocate a full warp of 32 threads, at least two parallel subcarrier detections need to be performed in parallel. The simplest way to map the parallel searches would be to perform one parallel subcarrier detection per thread block. However, lesser number of active threads would be performing the calculations, since the maximum number of active thread blocks per SM is eight. To increase the number of parallel subcarrier detections without the need to stall any 296

5 warps, better performance can be achieved by mapping two or more parallel subcarrier detections in a single block. In parallel programming such as CUDA, conditional execution of code should be avoided if possible. When mapped for parallel processing with CUDA, the SSFE algorithm requires conditional execution of code at least in the slicing operations. The slicing operations in both algorithms are performed by exploiting the threadid and/or blockid variables. When branching occurs, the branches will be executed in serial. This will naturally deteriorate overall performance. In our implementation, the threadid is mainly used in slicing operations and the blockid is mostly used to select the received partial symbol vector for calculations. Because branching could not be totally avoided with either one of the algorithms, some portions of the code were executed in serial. A number of computer simulations with different grid and block configurations were performed. The detection kernel execution time was averaged over 1000 runs and the results were recorded using CUDA Visual Profiler. The simulation results are presented in Table II. TABLE II SSFE DETECTOR CONFIGURATIONS GRID SIZE THROUGHPUT OCCUPANCY (threads per block blocks) (Mbps) % The peak performance of Mbps was achieved by mapping 64 parallel subcarrier detections on the GPU. The parallel subcarrier detections were performed with 32 thread blocks consisting of 32 threads, which gives a total of 1024 active threads per kernel. As the number of parallel subcarrier detections increased also the amount of branching required increased. However, each single branch only performs a single memory fetch operation from the constant memory and is therefore executed with a very small latency. As illustrated in Table II, the performance starts to deteriorate when the number of thread blocks exceeds 32, or the number of threads per block exceeds 64. It is also shown that the higher occupancy of the GPU does not guarantee better performance. The performance deterioration results from the limited amount of fast register resources. The block size of 64 threads increases the occupancy level of the GPU, but only six thread blocks were active, since the GPU ran out of register resources. By using a GPU with a higher number of SMs more thread blocks could be used and better throughput achieved. Figure 5 illustrates the composition of a kernel grid used in this implementation. Due to the characteristics of the algorithm, a one dimensional grid with one dimensional thread blocks was used. Since only 16 threads were needed for one subcarrier detection, one thread block detects two subcarriers. The first 16 threads were used to detect the first subcarrier and the rest of the threads inside the block detected the second subcarrier. Block and thread indices were used to select which subcarrier was to be detected. DEVICE Fig. 5. GRID (32,0) (32,0) Grid and block composition for the SSFE implementation. The Quadro FX 1700 would be capable of running 3072 active threads simultaneously. So, only one third of its full capacity was harnessed, due to the algorithm characteristics. A. Memory usage After preprocessing the generated values for R and ŷ were set to be fixed into the constant memory to avoid unnecessary and costly data transfers between the host and device. The fast register and shared memory resources were used in the computations and only the final candidate and PED lists were written to the global memory and then transferred to the host. As discussed earlier, the focus of this study is on the computational power of GPUs, which is why the costly data transfers are left with less attention. The shared memory was used for variables that could be shared along all the threads inside a single block. The registers were used to store variables and intermediate results that were only used by a single thread. Table III shows the memory allocation for the different grid configurations. TABLE III SSFE MEMORY UTILIZATION GRID SIZE SHARED MEMORY REGISTERS (per block in bytes) (per thread) The block size dictates how efficiently threadid and blockid variables can be exploited in the computations. In this implementation, these variables are also involved in the branching necessitated by the algorithm. The threadid is mainly used for slicing operations, and the blockid is mainly used for sorting out the subcarriers to be processed. It is also 297

6 shown that the memory usage of the SSFE algorithm is small, making it a promising candidate for mobile solutions. Also the computational complexity is quite small with a suitably selected level update vector. VI. MAPPING LORD ON CUDA As already mentioned, the LORD algorithm offers MAP performance with a 2 2 antenna system. The computational complexity is, however, highly dependent on the constellation used. In our implementation, the real-valued 16-QAM was used, which kept the computational complexity rather low. With a 2 2 antenna system, two tree searches per subcarrier detection, compared to the one search in the SSFE algorithm are required. If a 16-QAM and a 2 2 antenna system is assumed, the computational complexity is doubled compared to SSFE with the vector m = [1144]. Then again, the LORD achieves MAP performance and the SSFE algorithm falls about 2 dbs short of MAP performance with the preceding configuration. We implemented the two tree searches to be performed in a single kernel. One whole parallel subcarrier detection for the LORD algorithm can be efficiently mapped with 32 threads, assuming a real-valued QAM system. The first 16 threads of a thread block performed the first tree search and the next 16 threads performed the second tree search concurrently. Due to the structure of the LORD algorithm, more branching was required compared to the SSFE algorithm. The excess and more complex branching, higher number of calculations and less efficient memory utilization result in lower throughput compared to the SSFE algorithm. Table IV presents the simulation results for the LORD algorithm. Less simulations with the LORD algorithm were performed, since the branching and higher memory utilization deteriorate the performance level at a highly accelerating pace. Table IV also shows that the GPU allocation starts to fall as the block size and number of blocks is increased to 32. According to the CUDA Occupancy Calculator, the occupancy should be 33 percent for this configuration, but the CUDA Visual Profiler reveals that the ineffective branching required only allows occupancy of 25 percent. TABLE IV LORD DETECTOR CONFIGURATIONS GRID SIZE THROUGHPUT OCCUPANCY (threads per block blocks) (Mbps) % The composition of the grid used in the LORD algorithm implementation is very similar to that presented in Figure 5. The LORD algorithm also uses 32 blocks in the implementation that results in peak performance, but the block size is reduced to 16 threads. This means that the LORD algorithm needs two blocks instead of one to perform a single subcarrier detection and that only 512 active threads are in use. Only the block index was used to select which subcarrier was being detected, but the thread indices were used in the slicing operations as well as in selecting which tree search was being performed. Figure 6 illustrates the composition of the kernel grid and thread blocks for the LORD algorithm. DEVICE Fig. 6. GRID (16,0) (32,0) Grid and block composition for the LORD algorithm. (16,0) A. Memory Usage While the SSFE uses only a subtle amount of memory, the memory requirements of the LORD algorithm are considerably larger. The computations for the two search trees themselves almost double the memory requirements compared to SSFE. Also the characteristics of the LORD algorithm require more variables for the calculations, which increases the memory requirements even more. The memory allocation for the LORD algorithm is presented in Table V. TABLE V LORD MEMORY UTILIZATION GRID SIZE SHARED MEMORY REGISTERS (per block in bytes (per thread) In addition, with higher modulations and with antenna configurations greater than 2 2, the high memory usage becomes the bottleneck of LORD. Especially, the scarce register resources are insufficient for the LORD algorithm to be efficiently mapped on the GPU with higher antenna and constellation configurations. The SSFE algorithm does not utilize as much memory as the LORD, presuming a proper selection of the level update vector. More parallel tree searches can therefore be performed by using SSFE. VII. COMPARISON In [15] and [7] a GPU implementation of a MIMO OFDM detector were presented. In [7] the implementation achieves a peak throughput of Mbps with a complex-valued QAM system. Table VI [11] presents the major differences between the GPUs used in the implementations. Although the implementation results of this study fall short from the results presented 298

7 in [7], the GPU in [7] was much more powerful than that used in this work. The GeForce 9600 GT used in [7] has twice the number of cores the Quadro FX 1700 has, the core speed and the memory clock speed are also double compared to the Quadro FX Taking the differences in GPU performance and the scalable programming model into consideration, the results presented in this work outrun the results achieved in [7]. However, it has to be noted that the LLR computations were not included in our implementation, unlike in [7]. TABLE VI GPU RESOURCE COMPARISON Geforce 9600 GT Quadro FX 1700 Core Clock 650 MHz 460 MHz Shader Clock 1625 MHz 920 MHz Memory Clock 900 MHz 400 MHz Memory Bandwidth 57.6 GB/s 12.8 GB/s FLOPS 208 GFLOPS GFLOPS Table VII presents a comparison between the implementation results in terms of throughput (Mbps), goodput (Mbps) and execution time (ms). Throughput defines how much the hardware can output data in a time unit. However, goodput takes into account also the error probability, which is typical for the detector in certain channel condition. The goodput is the detection rate times (1-FER) at the given SNR. Since the goodput is calculated after the decoder, the code rate is taken into account, which in this case is assumed to be 1 2. Note that the LORD algorithm performs better in bad channel conditions. However, when a better channel is available, the SSFE detector achieves a higher goodput. Our implementation can easily adapt to the changing channel conditions by switching between the detection algorithms. TABLE VII COMPARISON OF THE RESULTS SSFE, m=[1144] LORD Trellis based [7] Throughput n/a n/a Execution 14.2 us / us / ms / 2200 time subcarriers subcarriers subcarriers In [7], the implementation used 16 threads for one subcarrier detection with a QAM system, similar to our SSFE implementation. Compared to the two parallel subcarrier detections mapped in each block in the SSFE implementation, and one subcarrier detection per block with the LORD algorithm presented in this study, [7] mapped four parallel subcarrier detections in each block, making the block size of the implementation in [7] 64 compared to 32 and 16 with SSFE and LORD implementations in our studies, respectively. As earlier presented any, larger block size than 32 with SSFE and 16 with LORD decreased the performance of our implementation due to the limitations in fast memory resources. However, our implementation could be scaled for more powerful GPUs by adding the number of thread blocks and therefore the number of parallel subcarrier detections. The overall threads used in our implementations were 1024 for SSFE and 512 for LORD compared to the in [7]. Both of the implementations allocated only 33 percents of the GPUs resources. Although, a higher GPU allocation was achieved with some configurations in this study, but with reduced performance. VIII. CONCLUSION Two MIMO OFDM detector implementations for singleprecision floating-point GPU processing were presented. The implementations were designed for maximum throughput, but also the GPU utilization was taken into account. Some flaws, such as power consumption and costly data transfers, were ignored in this study, due to the fact that GPUs are not designed for SDR processing as such. The limited size of the fast onchip memory resources and the required branching were found to be the limiting factors for the GPU implementations. An interesting future solution would be a GPU that is designed specifically for baseband solutions. The emergence of open computing language (OpenCL) will ease the realization of such GPU. The implementations presented suit SDR processing well. For example, the SSFE algorithm can easily adapt to the different channel conditions simply by changing the level update vector. The GPU detector provides for flexible solutions to support the different configurations included in the future LTE systems. By remolding the memory and the I/Oarchitectures of the GPUs, the GPUs can meet the LTE performance requirements. The GPU based MIMO OFDM detector implementations proposed in this paper offer a promising solution for software defined radio, and GPUs specifically designed for baseband solutions will make them even more promising. REFERENCES [1] 3rd Generation Partnership Project (3GPP); Technical Specification Group Radio Access Network, Physical layer aspects for evolved UTRA (TR version (release 7)), 3rd Generation Partnership Project (3GPP), Tech. Rep., [2] G. Falco, V. Silva, and L. Sousa, How GPUs can outperform ASICs for fast LDPC decoding, in In Proceedings of the 23rd international Conference on Supercomputing ICS 09, New York, USA, Jun. 2009, pp [3] M. O. Damen, H. E. Gamal, and G. Caire, On maximum likelihood detection and the search for the closest lattice point, IEEE Transactions on Information Theory, vol. 49, no. 10, pp , Oct [4] M. Li, B. Bougard, L. V. D. Perre, and F. Catthoor, Optimizing near-ml MIMO detector for SDR baseband on parallel programmable architectures, in Proc. of the conference on Design, automation and test in Europe, Munich, Germany, Mar [5] M. Siti and M. Fitz, Layered orthogonal lattice detector for two transmit antenna communications, in Proceedings of the Forty-Third Annual Allerton Conference on Communication, Control, and Computing, Sep , pp [6] A. Tomasoni, M. Siti, M. Ferrari, and S. Bellini, T-lord: a mapapproaching soft-input soft-output detector for iterative mimo receivers, in Proceedings of the IEEE GLOBECOM 2007, Nov , pp [7] M. Wu, Y. Sun, and J. R. Cavallaro, Reconfigurable real-time MIMO detector on GPU, in In IEEE 43rd Asilomar Conference on Signals, Systems and Computers, Pacific Grove, USA, Oct

8 [8] 3rd Generation Partnership Project (3GPP), [9] 3rd Generation Partnership Project (3GPP); Technical Specification Group Radio Access Network, Physical layer aspects for evolved UTRA (TR version (release 7)), 3rd Generation Partnership Project (3GPP), Tech. Rep., [10] 3rd Generation Partnership Project (3GPP), TSGR1#41 R , EUTRA downlink numerology, 3rd Generation Partnership Project (3GPP), Tech. Rep., [11] GPUReview, Tech. Rep., [12] T. R. Halfhill, Parallel processing with CUDA, Microprocessor, Jan [13] NVIDIA, Programming guide version 2.1, NVIDIA Corporation, Tech. Rep., [14], CUDA basics, NVIDIA Corporation, Tech. Rep., [15] M. Wu, Y. Sun, and J. R. Cavallaro, A GPU implementation of a realtime MIMO, in In IEEE Workshop on Signal Processing Systems, Oct. 2009, pp

Implementation and Complexity Analysis of List Sphere Detector for MIMO-OFDM systems

Implementation and Complexity Analysis of List Sphere Detector for MIMO-OFDM systems Markus Myllylä University of Oulu, Centre for Wireless Communications markus.myllyla@ee.oulu.fi Outline Introduction