DSP Design in Wireless Communication LIANG LIU AND FREDRIK EDMAN, LIANG.LIU@EIT.LTH.SE
Data Rate The Evolving Wireless Scene More bit/($ nj) More bit/sec 100Mb 10Mb 1Mb 100Kb 10Kb 1Kb 802.1a 802.11 (LAN) Bluetooth (PAN) Sensor networks 3G Cellular 2.5 G Cellular Cellular (WAN) 1m 10m 100m 1km 10km Range Courtesy: Prof. Jan Rabaey, BWRC
Mbit/s Evolution of High-Speed Wireless 10 4 10 2 10 0 10-2 WLAN Cellular 9.6Kb/s GSM 11Mb/s 802.11b 2Mb/s 802.11 72Kb/s GPRS 474Kb/s EDGE 54Mb/s 802.11ag 2Mb/s HSDPA 600Mb/s 802.11n 3.39Gb/s 802.11ac 7Gb/s 802.11ad 1Gb/s LTE-A 150Mb/s LTE 84Mb/s HSPA+ Wideband CDMA OFDM MIMO HighMod Massive MIMO 1995 2000 2005 2010 2015 5G?
Same in other communication systems Wider road More passengers Wider bands Frequency More bits Multiple dimensions Multiple antennas
Algorithms beats Moore beats Chemists 10000000 1000000 100000 10000 Algorithmic Complexity 3G Processor Performance (~Moore s Law) 1000 2G 100 10 1 1G Battery Capacity 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020 Courtesy: Ravi Subramanian (Morphics)
Design Considerations Small area Low power Low price Reliable High speed Flexible Time to market
Optimizing DSP Implementation Functionality Circuit Module Architecture Algorithm Performance Implement ation Cost Optimization
Optimizing DSP Implementation
CDMA
CDMA: Code Division Multiple Access A UMTS Baseband Receiver Chip for Infrastructure Applications, Texas Instruments, S. Sriram et al.
Accumulator
ant UMTS filter in receiver system 33dB RF ADC Reconfig. UMTS filter De-scramble & despread demod Inband Outband desired signal Architectural Optimization for Low Power in a Reconfigurable UMTS Filter, in Proceedings of Wireless Personal Multimedia Communications Symposium (WPMC), Deepak Dasalukunte et al. San Diego, USA data
Adaptive UMTS Filter (Algorithm level) minimum 33dB stop band attenuation (3GPP specification) required filter length of 65 taps In a Bad Channel D D D D...but ONLY 5 is needed in a Good one!
Constant Multiplier (Module level) Coe[127] + Shift & Add Coe[127] MUX + Coe[126] D + Coe[126] MUX D + Coe[1] + Coe[1] MUX + Coe[0] D + Coe[0] MUX D + 14
Symmetric Coefficient (Architecture level) Shift & Add + D + Coe[1] MUX Coe[0] MUX + D + 15
OFDM
N-point IDFT Parallel to serial OFDM: Orthogonal Frequency Division Multiplexing x 0,k x 1,k s 0,k s 1,k CP x N 1, k s N 1, k OFDM
FFT/IFFT in OFDM Systems Large number of subcarriers a large FFT, O(Nlog 2 N) OFDM: DVB-2/4/8k FFT WLAN IEEE802.11a/g-64 FFT (48+4 subcarriers) LTE Long Term Evolution: 2k FFT
FFT: VLSI Architecture X 1 (k) X 2 (k) W N k -1 X 1 (k)+w N k X 2 (k) X 1 (k)-w N k X 2 (k) + Folding Pipeline Parallel
FFT: VLSI Architecture Multi-path delay commutator Signle-path delay feedback Buttfly FIFO Mult 50% 50% 50% 50% 100% 50%
FFT: Multi-path delay feedback Feature number Technology 0.13-μm Area 1.44mm 2 Throughput 1GS/s Power 39.6mW@4 09MS/s High-throughput
MIMO
Multiple Antenna System, MIMO Transmit Antennas The Radio Channel SISO Receive Antennas Transmit Antennas The Radio Channel SIMO Receive Antennas Single Input Single Output Single Input Multiple Output (Receive diversity) MISO MIMO Multiple Input Single Output (Transmit diversity) Multiple Input Multiple Output (Multiple data streams)
Understanding MIMO via Audio A SISO Interference MIMO! MISO At least as many receivers as transmitted streams Spatial separation at both transmit and receive antennas Improve transmission throughput or reliability Interference!
MIMO System Model Tx Rx Data S/P Tx Rx r = Hs + n Tx Rx Tx Rx s Transmitted vector N r Received vector H = M h 11 h 12.. h 1N h 21 h 22.. h 2N..... h ij models fading gain between the j th transmit and i th receive antenna h M1 h M2.. h MN
MIMO Signal Processing - Receiver Interference Cancellation! Tx Rx Data S/P Tx Tx Rx Rx r = Hs + n high-complexity Signal processing s^ Tx Rx Recover the transmitted signal s from the received signal r, which contains interference and noise.
MIMO Signal Processing - Receiver Tx Rx Receiver Signal Processing Data S/P Tx Tx Tx Rx Rx Rx r = Hs + n Channel Estimation H^ Symbol Detection H ^ -1 Matrix Manipulation s ^ = H ^ -1 r Channel estimation: obtain the channel status by training signals Matrix manipulation: matrix inversion or decomposition depending on detection algorithm Symbol detection: estimate transmitted signal s given channel matrix H and received signal r
MIMO Signal Detection r = Hs + n Linear Detection: zero-forcing detection s zf = H 1 r = s + H 1 n Maximum Likelihood (ML) Detection: exhaustive search s ml = arg max p(s r, H) s Q N = arg min r Hs 2 s Q N 1.6 10 7 points per vector detection for 64-QAM, 4 4 MIMO
High Complexity ML Detection Tx Rx Receiver Signal Processing Data S/P Tx Tx Rx Rx ML Detection ^s WLAN 802.11n Example Modulation 256QAM; 4 Tx antennas; 108 sub-channels, 4ms per OFDM symbol ML detection 1.159 x 10 17 points/sec Current DSP technology is 1G inst/sec 10 8 processors! OR ( Moores Law... processor capability doubles every 18 months) Today Tx MUST WAIT 40years! Intel i7 CPU: 10 11 inst/sec Rx Channel Estimation Matrix Manipulation Mike Faulkner 2005, Victoria Univ. H^
Sphere Decoding: Algorithm level optimization Simplified 2D-case ML Detection Sphere Detection Limited search space a reduced complexity
Sphere Decoding: complexity reduced Near optimal ML performance with significantly reduced computational complexity (# search points) BER 10 0 10-1 10-2 10-3 10-4 10-5 Detection Performance ML Tree-Pruning Tree-Pruning+Reording 4 4 array 16-QAM 5 10 15 20 25 30 SNR (db) Chia-Hsiang Yang, University of California, Los Angeles, 2007 Average # of Search Points 10 5 10 4 10 3 10 2 10 1 Computational Complexity Total # search points=65536 ML Tree-Pruning Tree-Pruning+Reording 10 0 5 10 15 20 25 30 SNR (db) Near ML detection with 0.1% computational complexity
Sphere Decoding: tree-search QR-Decomposition: H = QR R upper triangular matrix ˆ ML s argmin y Rs 2 with y Q H r leaf 2nd layer 3rd layer 4th layer -1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1 P i -1 1 inc i -1 1 y1 R11 R12 R13 R14 s1 y 2 0 R22 R23 R 24 s 2 y3 0 0 R33 R34 s3 y4 0 0 0 R44 s4 Root K-Best Detection, e.g., K=2
Sphere Decoding: VLSI Design [1] Design [2] Design [3] Gate Counter 50K 91K 491K Throuput 136Mbps 269Mbps 1100Mbps
Systolic Array Systolic array A homogeneous network of tightly coupled Processing Elements (PEs) called cells or nodes The wave-like propagation of data through a systolic array resembles the pulse of the human circulatory system, the name systolic was coined from medical terminology
Systolic Array (example)
Systolic Array (T1)
Systolic Array (T2)
Systolic Array (T3)
Systolic Array (T4)
Systolic Array (T5)
Systolic Array (T6)
Systolic Array (T7)
Anything we can do at the transmitter?
MIMO Precoder: understanding via audio L + N L, 0.5 R + N L + N L, R 2(0.5 R + R N+ R N R ) If the receiver is not positioned directly between the speakers the received streams will be at different levels Equalizing at the receiver side? Like ZF, noise enhancement Balance at the transmitter side Require accurate channel information at the transmitter! Pre-code
pre-coding MIMO Precoder Tx Rx Data S/P Data S/P Tx Tx Rx Rx Low-Complexity Receiver s Tx Zero-forcing pre-coder x r = Hx Hs + n Rx r = Hx + n = HH 1 s + n r 1 1 0 0 s 1 n 1 r 2 = 0 1 0 s 2 + n 2 r 3 0 0 1 s 3 n 3
The corresponding processor for MIMO?
Flexibility Digital Hardware Platforms GPP DSP GPU FPGA ASIP ASIC Efficiency
Digital Signal Processors for MIMO Flynn s Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) Multiple instruction streams, single data stream (MISD) Multiple instruction streams, multiple data streams (MIMD)
VLIW+SIMD VLIW (very long instruction word) Instruction-level parallelism, e.g., streaming of data SIMD (single instruction multiple data) Data-level parallelism, e.g., vectors
MIMO Processor Examples: Phillips EVP Vector Processor for LTE (2007)
Massive MIMO
64 5G wireless communication
Goes to LARGE Dimension?
Goes to LARGE Dimension Further Scaling Up? Size limitation in terminals Power consumption in portable devices Cellular Antenna WLAN Antenna HSPA+ 2 2 802.11n 4 4 LTE 4 4 802.11ac 8 4 LTE-A 8 8 802.11? 16 16
How many antennas can we have? 2 in a phone 16 in a laptop 100X
Massive MIMO or Very-Large MIMO A1 UE1 BS UE2... AM UEK We think of very-large MIMO (multi-user) system We mean M>>K>>1 We are looking for M>100 antennas! We serve 10-20 users concurrently F. Rusek, D. Persson, B. K. Lau, E.G. Larsson, T.L. Marzetta, O. Edfors, and F. Tufvesson, Scaling Up MIMO: Opportunities and Challenges with Very Large Arrays, IEEE Signal Processing Magazine, Jan. 2013
Massive MIMO (video) https://www.youtube.com/watch?v=xbb481rnqgw
Dream case 4G System 5G System Massive MIMO for 5G: Imagine a highway created just for you, no matter where you are!
Dream case
DSP for Massive MIMO Channel Encoding Interle aving Symbol Mapping OFDM Modulation Resampling Filtering Channel Encoding Interle aving OFDM Modulation Resampling Filtering 1 K...... Analog TX Analog TX 1 M...... Channel Decoding Deinte rleav Symbol Demap Channel Decoding Deinte rleav Symbol Demap OFDM Demod. Digital Front-end 1 K...... Analog RX... OFDM Demod. Digital Front-end Analog RX Symbol Mapping 1 M 1 M MIMO Precoding Channel Estimation + MIMO Detection Reciprocity Calibration Memory Memory Data Transfer Network Data Transfer Network Central Processing Per-antenna Processing Per-user Processing
Design challenges 128 16 massive MIMO system with 20MHz High computation count: 190 10 9 multiplication/s for ZF-based MIMO processing Low processing latency: 285μs RX-TX turnaround time for moderate mobility Large data storage: 9.8MB memory for channel matrix Complicated data shuffling: 11GB/s information exchange for 16-bit wordlength 74
Memory subsystem in Massive MIMO High capacity and throughput Channel matrix 128 16 in massive MIMO v.s. 4 4 in LTE-A Multiple access patterns Column wise:h H H Row wise: Hy Diagonal wise: H H H+αI Adjustable operand matrix size H H H/H H H+αI/(H H H+αI) -1 (K K) 75
Flexible memory access 1 2 3 1 2 3 4 5 6 4 7 5 8 6 9 Multi-bank memory 7 8 9 1 2 3 5 6 4 1 2 3 9 7 8 6 4 5 8 9 7 76
Conflict-free parallel memory scheme column 77 Permutation Pattern Generate Unit Address Generate Unit Control Logic write read row v 0 v 1 v 15 14 1516 1 2 3 4 5 6 7 Memory Bank0 v 2 Permutation Network 1 2 15 Inverse-Permutation Network 8 2048x32 index of target memory module 16 1 2 3 4 5 6 7 8 9 12 13 14 1516 1 2 3 4 5 10 11 12 13 14 1516 1 2 3 8 9 10 11 12 13 14 1516 1 6 7 8 9 10 11 12 13 14 15 4 5 6 7 8 9 10 11 12 13 2 3 4 5 6 7 8 9 10 11 9 8 7 6 5 4 3 2 1 16 11 10 9 8 7 6 5 4 3 2 13 12 11 10 9 8 7 6 5 4 15 14 13 12 11 10 9 8 7 6 1 16 15 14 13 12 11 10 9 8 3 2 1 16 15 14 13 12 11 10 5 4 3 2 1 16 15 14 13 12 7 6 5 4 3 2 1 16 15 14 16 1 2 3 4 5 6 7 8 9 14 1516 1 2 3 4 5 6 7 12 13 14 1516 1 2 3 4 5 diagonal
Conclusions Digital signal processing is evolving at fast pace with new wireless technologies and applications Optimal DSP implementation is crucial to bring new DSP algorithm into practice Best design achieves balanced trade-offs depending on application requirements Optimization at earlier stages and do cross-layer (or co-) optimization Keep tracking new technologies for both algorithm and implementation
Thanks