Fast Fourier Transform: VLSI Architectures

Fast Fourier Transform: VLSI Architectures Lecture Vladimir Stojanović 6.97 Communication System Design Spring 6 Massachusetts Institute of Technology Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6.

Pipelined FFT architectures Examples 8 C BF C BF C BF j C BF (). RMDC(-6) Radix- 8 BF BF BF j BF multi-path delay commutator single-path delay feedback X6 BF X6 BF (). R5DF(-6) BE8F X BF X BF Radix- (). RSDF(-56) single-path delay feedback C 9 8 6 BF 6 8 C 8 6 BF 8 C 8 BF C BF multi-path delay commutator (). RMDC(-56) single-path delay commutator DC6X6 BF DC6X6 BF DC6X BF DC6X BF (5). RSDC(-56) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Radix- Multi-path Delay Commutator 8 C BF C BF C BF j C BF The most classical approach for pipeline implementation of radix- FFT Input sequence broken into two parallel data streams flowing forward with correct distance between data elements entering the butterfly scheduled by proper delays Both butterflies and multipliers are in 5% utilization Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Radix- Single-path Delay Feedback 8 BF BF BF j BF [Wold& Despain 8] Uses registers more efficiently Both as input and the output of the butterfly A single data stream goes through the multiplier at every stage Multiplier utilization is also 5% Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Radix- Single-path Delay Feedback [Despain7] X BF X BF x DFT X X x DFT X DFT X x 8 DFT X x x 5 DFT X X 5 x(n) W y(n) Utilization of multipliers 75% By storing BF outputs Radix- butterfly utilization only 5% Butterfly fairly complicated At least 8 complex adders x(n+ ) x(n+ ) x(n+ ) - j - - - n W n W n W y(n+ ) y(n+ ) y(n+ ) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 5

Radix- Multi-path Delay Commutator [Swartzlander8] C 8 BF + ++ C BF x X DFT X x DFT X DFT X x 8 DFT X x X DFT x 5 X 5 What is the utilization of x(n) W y(n) Butterflies? Multipliers? x(n+ ) x(n+ ) x(n+ ) - j - - - n W n W n W y(n+ ) y(n+ ) y(n+ ) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 6

Radix- Single-path Delay Commutator [Bi & Jones 89] input commutator stage stage butterfly element commutator butterfly element x X DFT X x DFT X DFT X x 8 DFT X x X DFT x 5 X 5 c c c c c 5 c 6 coefficient Modified radix- algorithm Programmable ¼ radix- BF 75% utilization x(n+ ) Used to build one of the largest single-chip FFTs (89pts) [Bidet 95] x(n) x(n+ ) x(n+ ) - j - - - W n W n W n W y(n) y(n+ ) y(n+ ) y(n+ ) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 7

RSDC commutator and butterfly details input t t t T t 5 9 8 7 6 5 5 x(n) Time t'+6t t' input t t : multiplexers m t c c c Time 5 9 8 7 6 5 5 9 8 7 6 5 5 9 9 8 7 6 5 5 9 8 7 6 5 Outputs from commutator at stage 6 5 5 9 8 7 6 5 8 t'+8t m = m = m = m = t'+t 5 6 7 8 9 5 6 6 9 stage stage 5 9 6 7 5 re () im () re () im () re () im () re () im () add/sub add/sub add/sub add/sub m t c c 5 c 6 add/sub D add/sub Re Im ( = addition, = subtraction) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 8

Some conclusions Delay feedback approaches are always more efficient than corresponding delay-commutator approaches In terms of memory utilization Since butterfly outputs share same storage with its inputs Pipeline architectures require FFT algorithms to be formulated in a hardware-oriented form Where spatial regularity is preserved in a signal-flow graph (SFG) So that arithmetic operations can be tightly scheduled for efficient hardware utilization Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 9

Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design Decomposition a review Twiddle factor is th primitive root of unity With exponent evaluated modulo Most fast algorithms share same general strategy Map one-dimensional transform int a two or multidimensional representation Exploit congruence property of coefficients to simplify computation Unlike traditional step-by-step decomposition of twiddle factors Cascading the twiddle factor decomposition leads to new forms of FFT with high-spatial regularity

Radix approach Start by classical divide-and-conquer radix- DIF indexing But, consider the first two steps of decomposition together [Shouseng and Torkelson 996] Compute directly in standard radix- approach ew idea is to proceed to shorter DFTs cascading the twiddle factor W (/n+n)k Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

A 6pt example Get radix--like mulitplier complexity with radix- butterfly structures (radix- ) x() x() x() x() x() x(5) x(6) x(7) x(8) x(9) x() x() x() x() x() x(5) W W W W 6 W W W W W W W 6 W 9 / DFT (k=, k=) / DFT (k=, k=) / DFT (k=, k=) / DFT (k=, k=) X() X(8) X() X() X() X() X(6) X() X() X(9) X(5) X() X() X() X(7) X(5) x() x() x() x() x() x(5) x(6) x(7) x(8) x(9) x() x() x() x() x() x(5) BF I BF II BF I BF II BF III BF IV W W W 6 W W W W W 6 W 9 X() X(8) X() X() X() X() X(6) X() X() X(9) X(5) X() X() X() X(7) X(5) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

A 6pt radix- example Image removed due to copyright restrictions. Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Radix- (R SDF) architecture =56 8 6 6 8 x(n) BFI BFII BFI BFII BFI BFII BFI BFII X + + t X X t X X t X X t X + X(k) clk 7 6 W(n) 5 W(n) W(n) Similar to RSDF xr(n) xi(n) xr(n+/) xi(n+/) Reduced number of multipliers eed two types of butterflies One identical to that in RSDF - - The other contains the logic for trivial twiddle factor multiplication (with j) Synchronization control very simple due to spatial regularity Just a log binary counter + + + + (i). BFI x xr(n) xi(n) xr(n+/) xi(n+/) + + + - - + + + - t x (ii). BFII zr(n+/) zi(n+/) zr(n) zi(n) Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Radix- architecture Sync control log-bit binary counter Synchronization controller Address counter for twiddle factor reading in each stage On first / cycles, -to- mux in BF switch to Butterfly is idle (input data directed to shift registers) On next / cycles, muxes in BF switch to Butterfly computes a pt DFT with incoming data and data stored in the shift registers Output Z(n) sent to twiddle multiplier Output Z(n+/) sent back to the shift register to be multiplied in next / cycles, when the first half of the next frame is loaded in 8 6 6 8 x(n) BFI BFII BFI BFII BFI BFII BFI BFII X + + t X X t X X t X X t X + X(k) clk 7 6 W(n) 5 W(n) W(n) Operation of BF is similar, except the distance of butterfly input sequence is just / and the trivial multiply logic Utilization of the multiplier is 75% ext frame can be computed w/o pausing due to the pipelined processing in each stage Pipeline register can be inserted between each multiplier and BF stage to improve the performance Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 5

Arithmetic complexity RMDC RSDF RSDF RMDC RSDC R SDF multiplier # adder # memory size control (log - ) (log - ) log - (log - ) log - log - log log 8 log 8 log log log / - - - 5/ - - - simple simple medium simple complex simple R SDF has reached minimum requirement for both multiplier and storage Only RSDC better in terms of adder usage R SDF well suited for VLSI implementations of pipeline FFT processors Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 6

Memory issues The area/power consumption in the pipeline architectures dominated by the FIFO register files at each stage Complex multipliers at each (or every other stage) To diminish the unnecessary data moving in the FIFO need to reconstruct the storage A known approach is to use FIFO with -port RAM With read and write addresses displaced by a constant -port RAM cells % more area of the -port RAM cell Use two / -port RAMs Read and write interleaved Each active every other cycle D(n) a E E b E c E /- RAM /- RAM d D(n-) D(n) lx -port RAM D(n-) W-addr. R-addr. R/W Addr. W R Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 7

Single stage hardware example x[] x[] x[] x[] x[] x[5] x[6] x[7] - - - W W W - - W W - - - - W W T FFT =. r logr. Tr,PE W W - - - W W W X[] X[] X[] X[6] X[] X[5] X[] X[7] S/P & Bit reverse Control Circuits /r Butterflies Coeff ROM P/S Counter Where, /r = o. of butterfly per stage log r = o. of stage T r,pe = Time to calculate one butterfly Fold stages onto each other eed constant geometry signal flow graph Big price in area for parallelism (within each stage) [Sadat] Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 8

Radix-8 Pipelined/Parallel implementation A 6pt FFT example for 8.a [Excerpted from Maharatna et al ] Two dimensional structure of 8pt FFTs The number of nontrivial complex multiplications is 9 (7x7) Since the first twiddle is always The number of nontrivial complex multiplications for radix- FFT is 66 Radix- (or ) FFTs need only 5 multiplies Important to note that for 8pt FFT (DIT) no need for multiplies Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 9

8pt DIT FFT Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 6-point Fourier Transform Chip for High-speed Wireless LA Application Using OFDM." Solid-State Circuits 9 (): 8-9. Copyright IEEE. Used with permission. The only nontrivial multiply is with /sqrt() Easily realize using hardwired shift-and-add Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Block diagram of the FFT unit Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 6-point Fourier Transform Chip for High-speed Wireless LA Application Using OFDM." Solid-State Circuits 9 (): 8-9. Copyright IEEE. Used with permission. Two-stages are pipelined Fully parallel in each stage (radix- 8pt FFT, single clk cycle) Two performance bottlenecks Large number of global wires resulting from the multiplexing of complex data to the 8-point FFTs Construction of the multiplier unit to attain the required speed with minimal silicon are is not trivial Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Input unit Hard wired outputs and data shifting To the 8pt FFT Reduce de-muxing Reduce global wires Cannot shift every clk Multiplier cannot finish Extend latency Temporary registers,, Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 6-point Fourier Transform Chip for High-speed Wireless LA Application Using OFDM." Solid-State Circuits 9 (): 8-9. Copyright IEEE. Used with permission. Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

9 multiplies Multiplier unit Only nine sets unique (cos,sin) hard-wired constant Significantly less storage space for coefficients Turn multiplies into shift&add Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 6-point Fourier Transform Chip for High-speed Wireless LA Application Using OFDM." Solid-State Circuits 9 (): 8-9. Copyright IEEE. Used with permission. Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

Multiplier unit and scheduling Figures from Maharatna, K., E. Grass, and U. Jagdhold. "A 6-point Fourier Transform Chip for High-speed Wireless LA Application Using OFDM." Solid-State Circuits 9 (): 8-9. Copyright IEEE. Used with permission. Some of the coefficients requested concurrently by different FFT outputs Solve by adding temp registers in the input unit ~5% less power and area than 8 standard complex multipliers Buffer unit similar to input unit, just w/o temporary registers Outputs also hardwired with distance of 8 Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design

A mirror of input unit Output unit Just w/o temporary registers Control/sync is simple 5-bit counter Starts counting when input full Local counters control Input Intermediate Output units Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 6-point Fourier Transform Chip for High-speed Wireless LA Application Using OFDM." Solid-State Circuits 9 (): 8-9. Copyright IEEE. Used with permission. Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 5

Cite as: Vladimir Stojanovic, course materials for 6.97 Communication System Design, Spring 6. 6.97 Communication System Design 6 Readings [] H.e. Shousheng and M. Torkelson "A new approach to pipeline FFT processor," Parallel Processing Symposium, 996., Proceedings of IPPS '96, The th International no. S -, pp. 766-77, 996. [] H.e. Shousheng and M. Torkelson "Designing pipeline FFT processor for OFDM (de)modulation," Signals, Systems, and Electronics, 998. ISSSE 98. 998 URSI International Symposium on no. S -, pp. 57-6, 998. [] E. Wold and Alvin M. Despain "Pipeline and Parallel-Pipeline FFT Processors for VLSI Implementations," IEEE Trans. Computers vol., no. 5, pp. -6, 98. [] G. Bi and E.V. Jones "A pipelined FFT processor for word-sequential data," Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE Transactions on vol. 7, no. S - 96-58, pp. 98-985, 989. [] K. Maharatna, E. Grass and U. Jagdhold "A 6-point Fourier transform chip for highspeed wireless LA application using OFDM," Solid-State Circuits, IEEE Journal of vol. 9, no. S - 8-9, pp. 8-9,. Interesting DIT&F algorithm [] C. Chiu, W. Hui, T.J. Ding and J.V. McCanny "A 6-point Fourier transform chip for video motion compensation using phase correlation," Solid-State Circuits, IEEE Journal of vol., no. S 8-9, pp. 75-76, 996. Power-performance estimation [] S. Hong, S. Kim, M.C. Papaefthymiou and W.E. Stark "Power-complexity analysis of pipelined VLSI FFT architectures for low energy wireless communication applications," Circuits and Systems, 999. nd Midwest Symposium on vol., no. S -, pp. -6 vol., 999. [] K. Pagiamtzis and P.G. Gulak "Empirical performance prediction for IFFT/FFT cores for OFDM systems-on-a-chip," Circuits and Systems,. MWSCAS-. The 5th Midwest Symposium on vol., no. S -, pp. I-58-6 vol.,.