Fast Fourier Transform: VLSI Architectures

Similar documents
An Efficient Design of Parallel Pipelined FFT Architecture

IMPLEMENTATION OF 64-POINT FFT/IFFT BY USING RADIX-8 ALGORITHM

DESIGN AND IMPLEMENTATION OF MOBILE WiMAX (IEEE e) PHYSICAL LAYERUSING FPGA

A Novel Approach in Pipeline Architecture for 64-Point FFT Processor without ROM

An Area Efficient FFT Implementation for OFDM

VLSI Implementation of Pipelined Fast Fourier Transform

A Combined SDC-SDF Architecture for Normal I/O Pipelined Radix-2 FFT

A Novel Low Power Approach for Radix-4 commutator FFT Based on CSD Algorithm

Low Power R4SDC Pipelined FFT Processor Architecture

A HIGH SPEED FFT/IFFT PROCESSOR FOR MIMO OFDM SYSTEMS

Combination of SDC-SDF Architecture for I/O Pipelined Radix-2 FFT

Design Of A Parallel Pipelined FFT Architecture With Reduced Number Of Delays

A FFT/IFFT Soft IP Generator for OFDM Communication System

VLSI Implementation of Area-Efficient and Low Power OFDM Transmitter and Receiver

ULTRAWIDEBAND (UWB) communication systems,

M.Tech Student, Asst Professor Department Of Eelectronics and Communications, SRKR Engineering College, Andhra Pradesh, India

Area Efficient Fft/Ifft Processor for Wireless Communication

FPGA Implementation of a Novel Efficient Vedic FFT/IFFT Processor For OFDM

Design of FFT Algorithm in OFDM Communication System

A High-Speed Low-Complexity Modified Processor for High Rate WPAN Applications

A PIPELINE FFT PROCESSOR

An Efficient FFT Design for OFDM Systems with MIMO support

DESIGN AND IMPLEMENTATION OF FFT ARCHITECTURE FOR REAL-VALUED SIGNALS BASED ON RADIX-2 3 ALGORITHM

PAPER A High-Speed Two-Parallel Radix-2 4 FFT/IFFT Processor for MB-OFDM UWB Systems

Implementation techniques of high-order FFT into low-cost FPGA

EFFICIENT DESIGN OF FFT/IFFT PROCESSOR USING VERILOG HDL

A SURVEY ON FFT/IFFT PROCESSOR FOR HIGH SPEED WIRELESS COMMUNICATION SYSTEM

LOW POWER FEED FORWARD FFT ARCHITECTURES USING SWITCH LOGIC

An Efficient Method for Implementation of Convolution

Design of Reconfigurable FFT Processor With Reduced Area And Power

Architecture for Canonic RFFT based on Canonic Sign Digit Multiplier and Carry Select Adder

A Low Power Pipelined FFT/IFFT Processor for OFDM Applications

A High Performance Split-Radix FFT with Constant Geometry Architecture

Computer Arithmetic (2)

Design and Analysis of RNS Based FIR Filter Using Verilog Language

Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India

International Journal of Scientific & Engineering Research, Volume 5, Issue 11, November ISSN

An FPGA Based Low Power Multiplier for FFT in OFDM Systems Using Precomputations

DESIGN OF PROCESSING ELEMENT (PE3) FOR IMPLEMENTING PIPELINE FFT PROCESSOR

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm

DSP Design Lecture 1. Introduction and DSP Basics. Fredrik Edman, PhD

Low-Power and High Speed 128-Point Pipline FFT/IFFT Processor for OFDM Applications

720 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 4, APRIL 2013

Data Word Length Reduction for Low-Power DSP Software

Multiple Constant Multiplication for Digit-Serial Implementation of Low Power FIR Filters

Implementing Logic with the Embedded Array

OFDM Based Low Power Secured Communication using AES with Vedic Mathematics Technique for Military Applications

Low power and Area Efficient MDC based FFT for Twin Data Streams

A Partially Operated FFT/IFFT Processor for Low Complexity OFDM Modulation and Demodulation of WiBro In-car Entertainment System

Multi-Channel FIR Filters

Implementation of a FFT using High Speed and Power Efficient Multiplier

ISSN Vol.07,Issue.01, January-2015, Pages:

An Design of Radix-4 Modified Booth Encoded Multiplier and Optimised Carry Select Adder Design for Efficient Area and Delay

High Performance Fbmc/Oqam System for Next Generation Multicarrier Wireless Communication

THE use of the orthogonal frequency division multiplexing

FPGA Implementation of Area-Delay and Power Efficient Carry Select Adder

CS 61C: Great Ideas in Computer Architecture Finite State Machines, Functional Units

International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN INTERNATIONAL JOURNAL OF ELECTRONICS AND

LOW-POWER FFT VIA REDUCED PRECISION

PIPELINED FAST FOURIER TRANSFORM FOR LOW POWER OFDM BASED APPLICATIONS

A New RNS 4-moduli Set for the Implementation of FIR Filters. Gayathri Chalivendra

Part One. Efficient Digital Filters COPYRIGHTED MATERIAL

Efficient Implementation on Carry Select Adder Using Sum and Carry Generation Unit

Chapter 1. Introduction

VLSI Implementation of Auto-Correlation Architecture for Synchronization of MIMO-OFDM WLAN Systems

Trade-Offs in Multiplier Block Algorithms for Low Power Digit-Serial FIR Filters

Analysis Parameter of Discrete Hartley Transform using Kogge-stone Adder

HIGH PERFORMANCE BAUGH WOOLEY MULTIPLIER USING CARRY SKIP ADDER STRUCTURE

Venkatesan.S 1, Hariharan.J 2. Abstract

Optimized area-delay and power efficient carry select adder

A Compact Design of 8X8 Bit Vedic Multiplier Using Reversible Logic Based Compressor

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS

CHAPTER 4 ANALYSIS OF LOW POWER, AREA EFFICIENT AND HIGH SPEED MULTIPLIER TOPOLOGIES

Design of Digital FIR Filter using Modified MAC Unit

Design and Implementation of Efficient Carry Select Adder using Novel Logic Algorithm

Power-conscious High Level Synthesis Using Loop Folding

A New network multiplier using modified high order encoder and optimized hybrid adder in CMOS technology

Fixed Point Lms Adaptive Filter Using Partial Product Generator

High Speed Vedic Multiplier Designs Using Novel Carry Select Adder

National Conference on Emerging Trends in Information, Digital & Embedded Systems(NC e-tides-2016)

An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors

SPIRO SOLUTIONS PVT LTD

Option 1: A programmable Digital (FIR) Filter

An RNS FFT Circuit Using LUT Cascades Based on a Modulo EVMDD

Design and Implementation of High Speed Carry Select Adder Korrapatti Mohammed Ghouse 1 K.Bala. 2

ASIC Implementation of High Speed Processor for Calculating Discrete Fourier Transformation using Circular Convolution Technique

Implementation and Comparative analysis of Orthogonal Frequency Division Multiplexing (OFDM) Signaling Rashmi Choudhary

ISSN: (PRINT) ISSN: (ONLINE)

Mahendra Engineering College, Namakkal, Tamilnadu, India.

An FPGA 1Gbps Wireless Baseband MIMO Transceiver

A 65nm CMOS RF Front End dedicated to Software Radio in Mobile Terminals

A Novel High Performance 64-bit MAC Unit with Modified Wallace Tree Multiplier

Innovative Approach Architecture Designed For Realizing Fixed Point Least Mean Square Adaptive Filter with Less Adaptation Delay

High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers

On-Chip Implementation of Cascaded Integrated Comb filters (CIC) for DSP applications

Area Power and Delay Efficient Carry Select Adder (CSLA) Using Bit Excess Technique

FPGA implementation of DWT for Audio Watermarking Application

Fast Fourier Transform utilizing Modified 4:2 & 7:2 Compressor

IJCSIET--International Journal of Computer Science information and Engg., Technologies ISSN

High Speed Binary Counters Based on Wallace Tree Multiplier in VHDL

Transcription:

Fast Fourier Transform: VLSI Architectures Lecture Vladimir Stojanović 6.97 Communication System Design Spring 6 Massachusetts Institute of Technology Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6.

Pipelined FFT architectures Examples 8 C BF C BF C BF j C BF (). RMDC(-6) Radix- 8 BF BF BF j BF multi-path delay commutator single-path delay feedback X6 BF X6 BF (). R5DF(-6) BE8F X BF X BF Radix- (). RSDF(-56) single-path delay feedback C 9 8 6 BF 6 8 C 8 6 BF 8 C 8 BF C BF multi-path delay commutator (). RMDC(-56) single-path delay commutator DC6X6 BF DC6X6 BF DC6X BF DC6X BF (5). RSDC(-56) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Radix- Multi-path Delay Commutator 8 C BF C BF C BF j C BF The most classical approach for pipeline implementation of radix- FFT Input sequence broken into two parallel data streams flowing forward with correct distance between data elements entering the butterfly scheduled by proper delays Both butterflies and multipliers are in 5% utilization Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Radix- Single-path Delay Feedback 8 BF BF BF j BF [Wold& Despain 8] Uses registers more efficiently Both as input and the output of the butterfly A single data stream goes through the multiplier at every stage Multiplier utilization is also 5% Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Radix- Single-path Delay Feedback [Despain7] X BF X BF x DFT X X x DFT X DFT X x 8 DFT X x x 5 DFT X X 5 x(n) W y(n) Utilization of multipliers 75% By storing BF outputs Radix- butterfly utilization only 5% Butterfly fairly complicated At least 8 complex adders x(n+ ) x(n+ ) x(n+ ) - j - - - n W n W n W y(n+ ) y(n+ ) y(n+ ) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 5

Radix- Multi-path Delay Commutator [Swartzlander8] C 8 BF + ++ C BF x X DFT X x DFT X DFT X x 8 DFT X x X DFT x 5 X 5 What is the utilization of x(n) W y(n) Butterflies? Multipliers? x(n+ ) x(n+ ) x(n+ ) - j - - - n W n W n W y(n+ ) y(n+ ) y(n+ ) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 6

Radix- Single-path Delay Commutator [Bi & Jones 89] input commutator stage stage butterfly element commutator butterfly element x X DFT X x DFT X DFT X x 8 DFT X x X DFT x 5 X 5 c c c c c 5 c 6 coefficient Modified radix- algorithm Programmable ¼ radix- BF 75% utilization x(n+ ) Used to build one of the largest single-chip FFTs (89pts) [Bidet 95] x(n) x(n+ ) x(n+ ) - j - - - W n W n W n W y(n) y(n+ ) y(n+ ) y(n+ ) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 7

RSDC commutator and butterfly details input t t t T t 5 9 8 7 6 5 5 x(n) Time t'+6t t' input t t : multiplexers m t c c c Time 5 9 8 7 6 5 5 9 8 7 6 5 5 9 9 8 7 6 5 5 9 8 7 6 5 Outputs from commutator at stage 6 5 5 9 8 7 6 5 8 t'+8t m = m = m = m = t'+t 5 6 7 8 9 5 6 6 9 stage stage 5 9 6 7 5 re () im () re () im () re () im () re () im () add/sub add/sub add/sub add/sub m t c c 5 c 6 add/sub D add/sub Re Im ( = addition, = subtraction) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 8

Some conclusions Delay feedback approaches are always more efficient than corresponding delay-commutator approaches In terms of memory utilization Since butterfly outputs share same storage with its inputs Pipeline architectures require FFT algorithms to be formulated in a hardware-oriented form Where spatial regularity is preserved in a signal-flow graph (SFG) So that arithmetic operations can be tightly scheduled for efficient hardware utilization Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 9

Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design Decomposition a review Twiddle factor is th primitive root of unity With exponent evaluated modulo Most fast algorithms share same general strategy Map one-dimensional transform int a two or multidimensional representation Exploit congruence property of coefficients to simplify computation Unlike traditional step-by-step decomposition of twiddle factors Cascading the twiddle factor decomposition leads to new forms of FFT with high-spatial regularity

Radix approach Start by classical divide-and-conquer radix- DIF indexing But, consider the first two steps of decomposition together [Shouseng and Torkelson 996] Compute directly in standard radix- approach ew idea is to proceed to shorter DFTs cascading the twiddle factor W (/n+n)k Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

A 6pt example Get radix--like mulitplier complexity with radix- butterfly structures (radix- ) x() x() x() x() x() x(5) x(6) x(7) x(8) x(9) x() x() x() x() x() x(5) W W W W 6 W W W W W W W 6 W 9 / DFT (k=, k=) / DFT (k=, k=) / DFT (k=, k=) / DFT (k=, k=) X() X(8) X() X() X() X() X(6) X() X() X(9) X(5) X() X() X() X(7) X(5) x() x() x() x() x() x(5) x(6) x(7) x(8) x(9) x() x() x() x() x() x(5) BF I BF II BF I BF II BF III BF IV W W W 6 W W W W W 6 W 9 X() X(8) X() X() X() X() X(6) X() X() X(9) X(5) X() X() X() X(7) X(5) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

A 6pt radix- example Image removed due to copyright restrictions. Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Radix- (R SDF) architecture =56 8 6 6 8 x(n) BFI BFII BFI BFII BFI BFII BFI BFII X + + t X X t X X t X X t X + X(k) clk 7 6 W(n) 5 W(n) W(n) Similar to RSDF xr(n) xi(n) xr(n+/) xi(n+/) Reduced number of multipliers eed two types of butterflies One identical to that in RSDF - - The other contains the logic for trivial twiddle factor multiplication (with j) Synchronization control very simple due to spatial regularity Just a log binary counter + + + + (i). BFI x xr(n) xi(n) xr(n+/) xi(n+/) + + + - - + + + - t x (ii). BFII zr(n+/) zi(n+/) zr(n) zi(n) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Radix- architecture Sync control log-bit binary counter Synchronization controller Address counter for twiddle factor reading in each stage On first / cycles, -to- mux in BF switch to Butterfly is idle (input data directed to shift registers) On next / cycles, muxes in BF switch to Butterfly computes a pt DFT with incoming data and data stored in the shift registers Output Z(n) sent to twiddle multiplier Output Z(n+/) sent back to the shift register to be multiplied in next / cycles, when the first half of the next frame is loaded in 8 6 6 8 x(n) BFI BFII BFI BFII BFI BFII BFI BFII X + + t X X t X X t X X t X + X(k) clk 7 6 W(n) 5 W(n) W(n) Operation of BF is similar, except the distance of butterfly input sequence is just / and the trivial multiply logic Utilization of the multiplier is 75% ext frame can be computed w/o pausing due to the pipelined processing in each stage Pipeline register can be inserted between each multiplier and BF stage to improve the performance Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 5

Arithmetic complexity RMDC RSDF RSDF RMDC RSDC R SDF multiplier # adder # memory size control (log - ) (log - ) log - (log - ) log - log - log log 8 log 8 log log log / - - - 5/ - - - simple simple medium simple complex simple R SDF has reached minimum requirement for both multiplier and storage Only RSDC better in terms of adder usage R SDF well suited for VLSI implementations of pipeline FFT processors Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 6

Memory issues The area/power consumption in the pipeline architectures dominated by the FIFO register files at each stage Complex multipliers at each (or every other stage) To diminish the unnecessary data moving in the FIFO need to reconstruct the storage A known approach is to use FIFO with -port RAM With read and write addresses displaced by a constant -port RAM cells % more area of the -port RAM cell Use two / -port RAMs Read and write interleaved Each active every other cycle D(n) a E E b E c E /- RAM /- RAM d D(n-) D(n) lx -port RAM D(n-) W-addr. R-addr. R/W Addr. W R Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 7

Single stage hardware example x[] x[] x[] x[] x[] x[5] x[6] x[7] - - - W W W - - W W - - - - W W T FFT =. r logr. Tr,PE W W - - - W W W X[] X[] X[] X[6] X[] X[5] X[] X[7] S/P & Bit reverse Control Circuits /r Butterflies Coeff ROM P/S Counter Where, /r = o. of butterfly per stage log r = o. of stage T r,pe = Time to calculate one butterfly Fold stages onto each other eed constant geometry signal flow graph Big price in area for parallelism (within each stage) [Sadat] Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 8

Radix-8 Pipelined/Parallel implementation A 6pt FFT example for 8.a [Excerpted from Maharatna et al ] Two dimensional structure of 8pt FFTs The number of nontrivial complex multiplications is 9 (7x7) Since the first twiddle is always The number of nontrivial complex multiplications for radix- FFT is 66 Radix- (or ) FFTs need only 5 multiplies Important to note that for 8pt FFT (DIT) no need for multiplies Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 9

8pt DIT FFT Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 6-point Fourier Transform Chip for High-speed Wireless LA Application Using OFDM." Solid-State Circuits 9 (): 8-9. Copyright IEEE. Used with permission. The only nontrivial multiply is with /sqrt() Easily realize using hardwired shift-and-add Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Block diagram of the FFT unit Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 6-point Fourier Transform Chip for High-speed Wireless LA Application Using OFDM." Solid-State Circuits 9 (): 8-9. Copyright IEEE. Used with permission. Two-stages are pipelined Fully parallel in each stage (radix- 8pt FFT, single clk cycle) Two performance bottlenecks Large number of global wires resulting from the multiplexing of complex data to the 8-point FFTs Construction of the multiplier unit to attain the required speed with minimal silicon are is not trivial Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Input unit Hard wired outputs and data shifting To the 8pt FFT Reduce de-muxing Reduce global wires Cannot shift every clk Multiplier cannot finish Extend latency Temporary registers,, Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 6-point Fourier Transform Chip for High-speed Wireless LA Application Using OFDM." Solid-State Circuits 9 (): 8-9. Copyright IEEE. Used with permission. Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

9 multiplies Multiplier unit Only nine sets unique (cos,sin) hard-wired constant Significantly less storage space for coefficients Turn multiplies into shift&add Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 6-point Fourier Transform Chip for High-speed Wireless LA Application Using OFDM." Solid-State Circuits 9 (): 8-9. Copyright IEEE. Used with permission. Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Multiplier unit and scheduling Figures from Maharatna, K., E. Grass, and U. Jagdhold. "A 6-point Fourier Transform Chip for High-speed Wireless LA Application Using OFDM." Solid-State Circuits 9 (): 8-9. Copyright IEEE. Used with permission. Some of the coefficients requested concurrently by different FFT outputs Solve by adding temp registers in the input unit ~5% less power and area than 8 standard complex multipliers Buffer unit similar to input unit, just w/o temporary registers Outputs also hardwired with distance of 8 Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

A mirror of input unit Output unit Just w/o temporary registers Control/sync is simple 5-bit counter Starts counting when input full Local counters control Input Intermediate Output units Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 6-point Fourier Transform Chip for High-speed Wireless LA Application Using OFDM." Solid-State Circuits 9 (): 8-9. Copyright IEEE. Used with permission. Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 5

Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 6 Readings [] H.e. Shousheng and M. Torkelson "A new approach to pipeline FFT processor," Parallel Processing Symposium, 996., Proceedings of IPPS '96, The th International no. S -, pp. 766-77, 996. [] H.e. Shousheng and M. Torkelson "Designing pipeline FFT processor for OFDM (de)modulation," Signals, Systems, and Electronics, 998. ISSSE 98. 998 URSI International Symposium on no. S -, pp. 57-6, 998. [] E. Wold and Alvin M. Despain "Pipeline and Parallel-Pipeline FFT Processors for VLSI Implementations," IEEE Trans. Computers vol., no. 5, pp. -6, 98. [] G. Bi and E.V. Jones "A pipelined FFT processor for word-sequential data," Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE Transactions on vol. 7, no. S - 96-58, pp. 98-985, 989. [] K. Maharatna, E. Grass and U. Jagdhold "A 6-point Fourier transform chip for highspeed wireless LA application using OFDM," Solid-State Circuits, IEEE Journal of vol. 9, no. S - 8-9, pp. 8-9,. Interesting DIT&F algorithm [] C. Chiu, W. Hui, T.J. Ding and J.V. McCanny "A 6-point Fourier transform chip for video motion compensation using phase correlation," Solid-State Circuits, IEEE Journal of vol., no. S 8-9, pp. 75-76, 996. Power-performance estimation [] S. Hong, S. Kim, M.C. Papaefthymiou and W.E. Stark "Power-complexity analysis of pipelined VLSI FFT architectures for low energy wireless communication applications," Circuits and Systems, 999. nd Midwest Symposium on vol., no. S -, pp. -6 vol., 999. [] K. Pagiamtzis and P.G. Gulak "Empirical performance prediction for IFFT/FFT cores for OFDM systems-on-a-chip," Circuits and Systems,. MWSCAS-. The 5th Midwest Symposium on vol., no. S -, pp. I-58-6 vol.,.