L15: VLSI Integration and Performance Transformations

Similar documents
L15: VLSI Integration and Performance Transformations

EE141-Spring 2007 Digital Integrated Circuits

EE 434 ASIC and Digital Systems. Prof. Dae Hyun Kim School of Electrical Engineering and Computer Science Washington State University.

I/O Design EE141. Announcements. EE141-Fall 2006 Digital Integrated Circuits. Class Material. Pads + ESD Protection.

Computer Arithmetic (2)

DIGITAL INTEGRATED CIRCUITS A DESIGN PERSPECTIVE 2 N D E D I T I O N

Design Methodologies. Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Lecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques.

EECS150 - Digital Design Lecture 28 Course Wrap Up. Recap 1

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

Disseny físic. Disseny en Standard Cells. Enric Pastor Rosa M. Badia Ramon Canal DM Tardor DM, Tardor

Interconnect-Power Dissipation in a Microprocessor

Power Spring /7/05 L11 Power 1

Design Methodologies. Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Learning Outcomes. Spiral 2 8. Digital Design Overview LAYOUT

Jan Rabaey, «Low Powere Design Essentials," Springer tml

CMOS VLSI IC Design. A decent understanding of all tasks required to design and fabricate a chip takes years of experience

Semiconductor Technology Academic Research Center An RTL-to-GDS2 Design Methodology for Advanced System LSI

Digital Integrated CircuitDesign

Very Large Scale Integration (VLSI)

IN ORDER to meet the constant demand for performance

Trends and Challenges in VLSI Technology Scaling Towards 100nm

Introduction to CMOS VLSI Design (E158) Lecture 9: Cell Design

VLSI DESIGN OF DIGIT-SERIAL FPGA ARCHITECTURE

Low-Power VLSI. Seong-Ook Jung VLSI SYSTEM LAB, YONSEI University School of Electrical & Electronic Engineering

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

Multiple Constant Multiplication for Digit-Serial Implementation of Low Power FIR Filters

EE 434 ASIC & Digital Systems

Lecture 14: Datapath Functional Units Adders

Low Power VLSI Circuit Synthesis: Introduction and Course Outline

Lecture 7: Components of Phase Locked Loop (PLL)

A Survey of the Low Power Design Techniques at the Circuit Level

Lecture 9: Cell Design Issues

Introduction to CMOS VLSI Design (E158) Lecture 5: Logic

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

Datorstödd Elektronikkonstruktion

Dual-K K Versus Dual-T T Technique for Gate Leakage Reduction : A Comparative Perspective

CPE/EE 427, CPE 527 VLSI Design I: Homeworks 3 & 4

Lecture 1. Tinoosh Mohsenin

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

UNIT-II LOW POWER VLSI DESIGN APPROACHES

Computer Aided Design of Electronics

Physical Design of Digital Integrated Circuits (EN0291 S40) Sherief Reda Division of Engineering, Brown University Fall 2006

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

Lecture 3. FIR Design and Decision Feedback Equalization

UNIT-III POWER ESTIMATION AND ANALYSIS

Lecture 3. FIR Design and Decision Feedback Equalization

Low-Power Digital CMOS Design: A Survey

DESIGN FOR LOW-POWER USING MULTI-PHASE AND MULTI- FREQUENCY CLOCKING

AN EFFICIENT APPROACH TO MINIMIZE POWER AND AREA IN CARRY SELECT ADDER USING BINARY TO EXCESS ONE CONVERTER

Lecture 9: Clocking for High Performance Processors

Design Methodologies. Design Trade-offs. System Design to Hardware. Design Gap. Speed (throughput and clock frequency) Area and

CHAPTER 4 ANALYSIS OF LOW POWER, AREA EFFICIENT AND HIGH SPEED MULTIPLIER TOPOLOGIES

Course Outcome of M.Tech (VLSI Design)

Low Power VLSI CMOS Design. An Image Processing Chip for RGB to HSI Conversion

EE 330 Lecture 44. Digital Circuits. Ring Oscillators Sequential Logic Array Logic Memory Arrays. Final: Tuesday May 2 7:30-9:30

ISSCC 2003 / SESSION 6 / LOW-POWER DIGITAL TECHNIQUES / PAPER 6.2

Timing analysis can be done right after synthesis. But it can only be accurately done when layout is available

CS 6135 VLSI Physical Design Automation Fall 2003

IFSIN. WEB PAGE Fall ://weble.upc.es/ifsin/

! MOS Device Layout. ! Inverter Layout. ! Gate Layout and Stick Diagrams. ! Design Rules. ! Standard Cells. ! CMOS Process Enhancements

UT90nHBD Hardened-by-Design (HBD) Standard Cell Data Sheet February

Globally Asynchronous Locally Synchronous (GALS) Microprogrammed Parallel FIR Filter

Area Efficient and Low Power Reconfiurable Fir Filter

Signal Processing Using Digital Technology

Low Power, Area Efficient FinFET Circuit Design

Lecture 4&5 CMOS Circuits

EC 1354-Principles of VLSI Design

Advanced Digital Integrated Circuits. Lecture 2: Scaling Trends. Announcements. No office hour next Monday. Extra office hour Tuesday 2-3pm

On Chip Active Decoupling Capacitors for Supply Noise Reduction for Power Gating and Dynamic Dual Vdd Circuits in Digital VLSI

An Energy Scalable Computational Array for Energy Harvesting Sensor Signal Processing. Rajeevan Amirtharajah University of California, Davis

Chapter 3 Digital Logic Structures

Advanced FPGA Design. Tinoosh Mohsenin CMPE 491/691 Spring 2012

Mohit Arora. The Art of Hardware Architecture. Design Methods and Techniques. for Digital Circuits. Springer

Audio Sample Rate Conversion in FPGAs

CHAPTER 3 NEW SLEEPY- PASS GATE

Power-conscious High Level Synthesis Using Loop Folding

EECS 427 Lecture 21: Design for Test (DFT) Reminders

6.012 Microelectronic Devices and Circuits

Design Of A Comparator For Pipelined A/D Converter

DFT for Testing High-Performance Pipelined Circuits with Slow-Speed Testers

VLSI Design Verification and Test Delay Faults II CMPE 646

Lecture 12 Memory Circuits. Memory Architecture: Decoders. Semiconductor Memory Classification. Array-Structured Memory Architecture RWM NVRWM ROM

ENHANCING SPEED AND REDUCING POWER OF SHIFT AND ADD MULTIPLIER

School of Computer Engineering, Supelec, Rennes Nanyang Technological University, France SCEE. Singapore

Closing the Power Gap between ASIC and Custom: An ASIC Perspective

An Efficent Real Time Analysis of Carry Select Adder

PHYSICAL STRUCTURE OF CMOS INTEGRATED CIRCUITS. Dr. Mohammed M. Farag

Digital Design and System Implementation. Overview of Physical Implementations

Design and Implementation of High Speed Carry Select Adder Korrapatti Mohammed Ghouse 1 K.Bala. 2

Interconnect/Via CONCORDIA VLSI DESIGN LAB

A 24Gb/s Software Programmable Multi-Channel Transmitter

Computer Architecture (TT 2012)

Lecture #29. Moore s Law

UNIVERSITY OF CALIFORNIA College of Engineering Department of Electrical Engineering and Computer Sciences

A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION

Available online at ScienceDirect. Procedia Computer Science 57 (2015 )

Gates and Circuits 1

Transcription:

L15: VLSI Integration and Performance Transformations Acknowledgement: Materials in this lecture are courtesy of the following sources and are used with permission. Curt Schurgers J. Rabaey, A. Chandrakasan, B. Nikolic. igital Integrated Circuits: A esign Perspective. Prentice Hall/Pearson, 2003. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 1

Layout 101 Layout 101 3- Cross-Section V p-type substrate n-type well metal/pdiff contact + n + SiO 2 SiO 2 n + p + p + p + p n n + W p L p N-channel MOSFET P-channel MOSFET IN OUT V Figure by MIT OpenCourseWare. W n G S Circuit Representation GN L n contact frommetal to ndiff IN OUT metal poly Layout n+ diff Used with permission. p+ diff G S Follow simple design rules (contract between process and circuit designers) L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 2

Custom esign/layout 9-1 Mux 5-1 Mux a CARRYGEN g64 Itanium has 6 integer execution units like this node1 ck1 SUMSEL REG sum sumb to Cache 9-1 Mux 2-1 Mux b SUMGEN + LU s0 s1 1000um LU : Logical Unit From register files / Cache / Bypass Multiplexers Shifter Adder stage 1 Loopback Bus Loopback Bus Wiring Adder stage 2 Wiring Loopback Bus ie photograph of the Itanium integer datapath Courtesy Intel, as reprinted in Rabaey, et al. "igital Integrated Circuits". Bit slice 63 Adder stage 3 Sum Select Bit slice 2 Bit slice 1 Bit slice 0 Bit-slice esign Methodology To register files / Cache Hand crafting the layout to achieve maximum clock rates (> 1Ghz) Exploits regularity in datapath structure to optimize interconnects L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 3

The ASIC Approach esign Capture Behavioral esign Iteration Pre-Layout Simulation Post-Layout Simulation Verilog (or (or VHL )) Logic Synthesis Floorplanning Placement Structural Physical Circuit Extraction Routing Tape-out Most Common esign Approach for esigns up to 500Mhz Clock Rates L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 4

Standard Cell Example Power Supply Line (V ) elay in (ns)!! 3-input NAN cell (from ST Microelectronics): C = Load capacitance T = input rise/fall time Ground Supply Line (GN) Each library cell (FF, NAN, NOR, INV, etc.) and the variations on size (strength of the gate) is fully characterized across temperature, loading, etc. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 5

Standard Cell Layout Methodology 2-level metal technology Current ay Technology Cell-structure hidden under interconnect layers With limited interconnect layers, dedicated routing channels between rows of standard cells are needed Width of the cell allowed to vary to accommodate complexity Interconnect plays a significant role in speed of a digital circuit L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 6

Verilog to ASIC Layout (the push button approach) module adder64 (a, b, sum); input [63:0] a, b; output [63:0] sum; After Synthesis assign sum = a + b; endmodule After Routing After Placement L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 7

Macro Modules 256 32 (or 8192 bit) SRAM Generated by hard-macro module generator Generate highly regular structures (entire memories, multipliers, etc.) with a few lines of code Verilog models for memories automatically generated based on size L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 8

Clock istribution Q Clock skew Image removed due to copyright restrictions. Q For 1Ghz clock, skew budget is 100ps. Variations along different paths arise from: evice: V T, W/L, etc. Environment: V, C Interconnect: dielectric thickness variation IBM Clock Routing Image removed due to copyright restrictions. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 9

The Power Supply Wires are Not Ideal! To V Grid To V Grid To V Grid C coup Receiver C int R d C d river GROUN GRI Pad Pad The IR-drop problem causes internal power supply voltage to be less than the external source Used with permission. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 10

Analog Circuits: Clock Frequency Multiplication (Phase Locked Loop) up down VCO produces high frequency square wave ivider divides down VCO frequency PF compares phase of ref and div Loop filter extracts phase error information Used widely in digital systems for clock synthesis (a standard IP block in most ASIC flows) Courtesy Michael Perrott. Used with permission. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 11

Scan Testing... Idea: have a mode in which all registers are chained into one giant shift register which can be loaded/ 0 read-out bit serially. Test remaining (combinational) 1 logic by shift out ScanShift (1) in test mode, shift in new values for all register bits thus setting up the inputs to the combinational logic 0 (2) clock the circuit once in normal mode, latching 1 the outputs of the combinational logic back into CLK ScanShift the registers (3) in test mode, shift out the values of all shift in ScanShift shift in register bits and compare against expected Used with permission results. Clk ScanShift Primary Inputs Response To The Test Vector Loaded Scan-Flops Load/Unload Cycles Load/Unload Cycles Primary Outputs Normal System Figure by MIT OpenCourseWare. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 12

Behavioral Transformations There are a large number of implementations of the same functionality These implementations present a different point in the area-time-power design space Behavioral transformations allow exploring the design space a high-level Optimization metrics: 1. Area of the design 2. Throughput or sample time T S 3. Latency: clock cycles between the input and associated output change 4. Power consumption 5. Energy of executing a task 6. time power area L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 13

Fixed-Coefficient Multiplication Conventional Multiplication X 3 X 2 X 1 X 0 Y 3 Y 2 Y 1 Y Z = X Y 0 X 3 Y 0 X 2 Y 0 X 1 Y 0 X 0 Y 0 X 3 Y 1 X 2 Y 1 X 1 Y 1 X 0 Y 1 X 3 Y 2 X 2 Y 2 X 1 Y 2 X 0 Y 2 X 3 Y 3 X 2 Y 3 X 1 Y 3 X 0 Y 3 Z 7 Z 6 Z 5 Z 4 Z 3 Z 2 Z 1 Z 0 Constant multiplication (become hardwired shifts and adds) Z = X (1001) 2 X 3 X 2 X 1 X 0 1 0 0 1 X 3 X 2 X 1 X 0 X 3 X 2 X 1 X 0 Z 7 Z 6 Z 5 Z 4 Z 3 Z 2 Z 1 Z 0 Y = (1001) 2 = 2 3 + 2 0 X << 3 Z shifts using wiring L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 14

Transform: Canonical Signed igits (CS) Canonical signed digit representation is used to increase the number of zeros. It uses digits {-1, 0, 1} instead of only {0, 1}. Iterative encoding: replace string of consecutive 1 s 0 1 1 1 1 2 N-2 + + 2 1 + 2 0 Worst case CS has 50% non zero bits 1 0 0 0-1 2 N-1-2 0 01101111 0 1 1 0 1 1 1 1 0 1 1 1 0 0 0-1 = 10010001 1 0 0-1 0 0 0-1 X << 7 Z << 4 Shift translates to re-wiring L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 15

Algebraic Transformations Commutativity istributivity A B A B A B C A C B A + B = B + A (A + B) C = AB + BC Associativity Common sub-expressions A B B C X Y X X Y C A (A + B) + C = A + (B+C) A B A B L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 16

Transforms for Efficient Resource Utilization A B C E FG H I 1 Time multiplexing: mapped to 3 multipliers and 3 adders 2 distributivity A C B E FG H I 1 Reduce number of operators to 2 multipliers and 2 adders 2 L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 17

A Very Useful Transform: Retiming Retiming is the action of moving delay around in the systems elays have to be moved from ALL inputs to ALL outputs or vice versa Cutset retiming: A cutset intersects the edges, such that this would result in two disjoint partitions of these edges being cut. To retime, delays are moved from the ingoing to the outgoing edges or vice versa. Benefits of retiming: Modify critical path delay Reduce total number of registers L15: 6.111 Spring 2006 Introductory igital Systems Laboratory Courtesy of Prof. Charles E. Leiserson. 18

Retiming Example: FIR Filter x(n) h(0) h(1) h(2) h(3) Symbol for multiplication y( n) y(n) irect form = h( n) x( n) = K i= 0 x( n i) h( i) x(n) associativity of the addition (10) h(0) h(1) h(2) h(3) T clk = 22 ns y(n) (4) retime x(n) h(0) h(1) h(2) h(3) Transposed form T clk = 14 ns y(n) Note: here we use a first cut analysis that assumes the delay of a chain of operators is the sum of their individual delays. This is not accurate. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 19

Pipelining, Just Another Transformation (Pipelining = Adding elays + Retiming) Contrary to retiming, pipelining adds extra registers to the system add input registers How to pipeline: 1. Add extra registers at all inputs 2. Retime retime L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 20

The Power of Transforms: Lookahead y(n) = x(n) + A y(n-1) x(n) y(n) A loop unrolling x(n) A A 2 y(n) Try pipelining this structure distributivity y(n) = x(n) + A[x(n-1) + A y(n-2)] x(n) y(n) How about pipelining this structure! associativity A A A 2 x(n) y(n) x(n) y(n) A retiming A A 2 2 A 2 precomputed L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 21

Key Concern in Modern VLSI: Variations! SOURCE GATE RAIN 110 Mean Number of opant Atoms 10000 Random opant Fluctuations 1000 100 10 Tox BOY Leff 1000 500 250 130 65 32 Technology Node (nm) Path elay 100 Temp Variation & Hot spots With 100b transistors, 1b unusable (variations) Probability 90 80 70 60 50 40 ue to variations in: V dd, V t, and Temp Temperature (C) L15: 6.111 Spring 2006 Introductory igital Systems Laboratory elay eterministic design techniques inadequate in the future Courtesy of Shekhar Y. Borkar (Intel). Used with permission. 22

Trends: Chip in a ay (Matlab/Simulink to Silicon ) Mult1 S reg Mac1 X reg Mac2 Add, Sub, Shift Mult2 Map algorithms directly to silicon - bypass writing Verilog! (Courtesy of R. Brodersen. Used with permission.) L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 23

Trends: Watermarking of igital esigns Fingerprinting is a technique to deter people from illegally redistributing legally obtained IP by enabling the author of the IP to uniquely identify the original buyer of the resold copy. The essence of the watermarking approach is to encode the author's signature. The selection, encoding, and embedding of the signature must result in minimal performance and storage overhead. Image removed due to copyright restrictions. Image removed due to copyright restrictions. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 24