EEE 6323 Advanced VLSI Design - Spring 2015 Instructor: R. Bashirullah TA: Qiuzhong Wu (qiuzhongwu@ufl.edu) Due Monday April 20, 2015 (By noon) The goal of the project is to study one of the topics specified and design an architecture which consumes low power, is less sensitive to process variability and occupies as little area as possible. Any of the low power techniques taught in class (or new ones) can be used when implementing these projects. 1. MIPS LIST OF PROJECTS: The architecture of the MIPS processor can be taken from the Computer Architecture book written by John L. Hennessy, David A. Patterson. The goal of this project is to take the baseline unoptimized implementation of the MIPS processor given in the book optimize it for power, energy and area. Any of the power saving techniques can be used. Pipelining and Parallelism can be used. Low power RAMS can be used. You may even go for a sub- threshold design. A 2.60pJ/Inst Subthreshold Sensor Processor for Optimal Energy Efficiency Bo Zhai, Leyla Nazhandali, Javin Olson, Anna Reeves, Michael Minuth, Ryan Helfand,Sanjay Pant, David Blaauw and Todd Austin. Low-power CMOS digital design, A. Chandrakasan, S. Sheng, and R. Brodersen, IEEE J. Solid-State Circuits, vol. 27, pp. 473 483, Apr. 1992. A Leakage Reduction Methodology for Distributed MTCMOS (May, 2004), B. Calhoun, et al., IEEE Journal of Solid-State Circuits, Vol. 39, No. 5. A shared-well dual-supply-voltage 64-bit ALU, IEEE Journal of Solid State Circuits. Mar. 2004. Pages 494 500. 2. FFT Processor For the FFT project, you must create a hardware implementation of a FFT. The hardware implementation may be derived from any FFT algorithm (Cooley Tukey or Good Thomas or any other). It can be a radix-2, radix-4 or any specialized FFT implementations. Achievement of Low power is the main criteria here. Below are several articles on FFT hardware implementations: "A 180-mV Subthreshold FFT Processor Using a Minimum Energy Design Methodology," Wang, A., A. P. Chandrakasan,IEEE Journal of Solid-State Circuits, vol. 40, no. 1, pp. 310-319, January 2005 A single chip radix-2 FFT butterfly architecture using parallel data distributed arithmetic Mactaggart, I.R.; Jack, M.A.; Solid-State Circuits, IEEE Journal of,volume: 19, Issue: 3, Jun 1984 Pages:368 373 A Low-Power, High-Performance,1024-Point FFT Processor Bevan M. Baas Design and implementation of a 1024-point pipeline FFT processor, S. He and M. Torkelson, in
Proc. IEEE Custom Integrated Circuits Conf., May 1998, pp. 131 134. A high precision1024-point FFT processor for 2D convolution, M. Wosnitza, M. Cavadini, M. Thaler, and G. Troster, in Proc. IEEE Int. Solid-State Circuits Conf., 1998, vol. 41, pp. 118 119, 424. A radix 4 delay commutator for fast Fourier transform processor implementation Swartzlander, E.E.; Young, W.K.W.; Joseph, S.J.; Solid-State Circuits, IEEE Journal of,volume: 19, Issue: 5, Oct 1984 Pages:702-709 A VLSI array processor for 16-point FFT Lee, Moon-Key; Shin, Kyung-Wook; Lee, Jang-Kyu; Solid-StateCircuits, IEEE Journal of,volume: 26, Issue: 9, Sept. 1991 Pages:1286 1292 3. Digital PLL Phase-lock loops (PLLs) are used to recover timing information from a signal they are ubiquitous in communications, and are also used for timing recovery on boards and chips. Analog PLLs are very hard to design because they use feedback, and are very sensitive to noise and operating parameters. The goal of this project is to design a pure digital PLL and compare its performance (measured in lock time and phase noise) and costs (in terms of area, power, delay) to a traditional analog PLL. Some of the papers that can be referred are An All-Digital Phase-Locked Loop with 50-Cycle Lock Time Suitable for High-Performance Microprocessors Jim Dunning,, Gerald Garcia, Jim Lundberg, and Ed Nuckolls, IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 4, APRIL 1995 R. E. Best, Phase-Locked Loops, Theory, Design and Applications.New York: McGraw-Hill, 1993, 2nd ed. A Digitally Controlled PLL for SoC Applications Thomas Olsson, and Peter Nilsson IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 5, MAY 2004 751 A fully integrated standard-cell digital PLL, T. Olsson and P. Nilsson, IEEElectron. Lett., vol. 37, pp. 211 212, Feb. 2001. 4. High-Speed N-bit Kogge-Stone Adder (N >= 32) The KS-adder utilizes a parallel-prefix topology to reduce the critical path in the adder. The critical path, which is the carry generation path, has a logarithmic dependence of the bit-width. This should be compared to the linear dependence in the ripple carry adder. There are many ways to implement the carry generation tree for parallel prefix adders, but KS implementation is the most straightforward, and also it has one of the shortest critical paths of all tree adders. The drawback with the KS implementation is the large area consumed and the somewhat complex routing of interconnects. If you have a 16 bit adder, you will have 32 input pads and 16 output pads. This accounts to 48 pads which is too much. Because of the limited amount of pads a bit serial-to-parallel input/output interface (SPI) must be used to feed input vectors to the adder and get back the output. The inputs are feed to the circuit in a bit-serial data stream and are converted into N-bit vectors by the serial to parallel converters. Outputs of the sum vector are gotten through a parallel-to-serial interface. In addition to speed, use low power techniques to minimize power as well.
J.M. Rabaey, A. Chandrakasan, and B. Nikolic., Digital Integrated Circuits, 2nd ed.prentice Hall, 2003, ISBN 0-13-120764-4. N. Waste and K. Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley, 1993. 5. Low Power N-bit Radix-4 Booth Multiplier (N >= 32) Normal array multipliers compute partial products in radix-2 manner. This leads to larger number of partial products. You can decrease the number of partial products by increasing the radix of your multiplication. This leads to fewer partial products and hence smaller and faster CSA array. Radix-4 will have N/2 partial products and hence each partial product will either be 0,1,2,3 times the multiplicand. Multiplication by 3 is hard. To solve this Booth encoding was used which removes the complex multiplication of 3 of the multiplicand. In this project you will have design and layout a 16 or more bit wide booth multiplier. Now, if you have a 16 bit multiplier, you will have 32 input pads and 32 output pads. This accounts to 64 pads which is too much. Because of the limited amount of pads a bit serial-to-parallel input/output interface (SPI) must be used to feed input vectors to the adder and get back the output. The inputs are feed to the circuit in a bit-serial data stream and are converted into N-bit vectors by the serial to parallel converters. Outputs of the multiplier are gotten through a parallel-to-serial interface. Use low power techniques to minimize power. J.M. Rabaey, A. Chandrakasan, and B. Nikolic., Digital Integrated Circuits, 2nd ed.prentice Hall,
2003, ISBN 0-13-120764-4. N. Waste and K. Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley, 1993. 6. ECoG processor Brain-computer interfaces (BCIs) convert brain signals into outputs that communicate a user's intent. The electrocorticographic activity (ECoG) recorded from the cortical surface can serve as a modality for non-invasive BCI method. The sensorimotor rhythms comprise three major frequency bands - Mu(8-12Hz), Beta(18-26Hz) and Gamma(30-200Hz). The changes in these rhythm amplitudes correspond to human's actions or imagined actions. The objective is to develop a processor that can be used to estimate the energy in each frequency band over a specified time period. Thus, in order to estimate the power spectrum of the rhythm frequency, bandpass filters and FFTs(>=8-points) will be required. The snapshot of the power spectrum values should be available every 0.2 second. Ultra-low power consumption (~1uW) is utmost important for the bio-implantable circuits. See references for FFT Processor E. C. Leuthardt, G. Schalk, J. R. Wolpaw, J. G. Ojemann, and D. W. Moran, " A brain-computer interface using electrocorticographic signals in humans," J. Neural Eng., vol.1, no.2, pp. 63-71, 2004. K. J. Miller, E. C. Leuthardt, G. Schalk, R. P. N. Rao, N. R. Anderson, D. W. Moran, J. W. Miller, and J. G. Ojemann, "Spectral changes in cortical surface potentials during motor movement," J. Neurosci., vol. 27, no. 9, pp. 2424-2432, 2007. 7. A high speed ADC backend For this project, a digital backend of a high speed flash ADC is implemented. The desired sample rate is 4GSamples/s, and the nominal resolution is 5 bits. Due to the extremely high throughput (20Gb/s), it s impossible to test the ADC in real time at reasonable costs. There are two workarounds. The first is to decimate the ADC output until the data rate is within the equipment limit. The other is to store the data in a memory (shown as FIFO in the block diagram) and later read them for offline post-processing. Both approaches need to be implemented in this project. The design goal is to high throughput to process the sampled data.
Timeline Date Description Points 03/10 Project Assigned 03/17 Form groups (5 students per group) Brainstorming Phase: Determine the topic and carry out literature review 03/24 Submission of Project topic (1 page description) 5 pts 03/24-04/06 Design and Analysis Phase: Simulation, design and analysis. 04/06-04/17 Physical Implementation Phase: Layout and I/O Ring with full chip DRC and LVS 04/20 By Noon Report: 4 page paper 04/21 Final Project report demo/presentation 10:00am 2nd floor Comp. Lab 100 pts Important Dates: March 24, 2015: Submit your design topic. (5 pts). April 20, 2015: Paper due along with LVS and DRC report. April 21, 2015: Project check off. (100pts) Report As general guidelines, try to first understand the specifications before design implementation. Use HDL (Verilog-HDL/VHDL) as the design input. Go through the basic steps of general VLSI design flow (From HDL to GDS). You will need to hand in both a soft copy and hard copy of your source code: 1. Hand in HDL source code, result of each step, design report. 2. Design description, implementation notes, simulation result and performance summary (Power, area and speed etc) should be mentioned in this design report. 3. Submit your DRC and LVS report without the pads. If you have the DRC and LVS clean report with the IO pads also you get extra credit. 4. Write a 4- page double column paper in IEEE format. Download a word file template from: http://www.ieee.org/web/publications/authors/transjnl/index.html The paper should include Title, Author list (group members), Abstract, Introduction, a section describing design methodology, a section describing results and discussions, Conclusions and Reference list. All figures, including schematics waveform, plots and layout must be embedded within the paper. The paper cannot exceed four pages in length. Figures should be chosen appropriately to best explain the overall design and results.