L15: VLSI Integration and Performance Transformations

L15: VLSI Integration and Performance Transformations Acknowledgement: Materials in this lecture are courtesy of the following sources and are used with permission. Curt Schurgers J. Rabaey, A. Chandrakasan, B. Nikolic. igital Integrated Circuits: A esign Perspective. Prentice Hall/Pearson, 2003. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 1

Layout 101 Layout 101 3- Cross-Section V p-type substrate n-type well metal/pdiff contact + n + SiO 2 SiO 2 n + p + p + p + p n n + W p L p N-channel MOSFET P-channel MOSFET IN OUT V Figure by MIT OpenCourseWare. W n G S Circuit Representation GN L n contact frommetal to ndiff IN OUT metal poly Layout n+ diff Used with permission. p+ diff G S Follow simple design rules (contract between process and circuit designers) L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 2

Custom esign/layout 9-1 Mux 5-1 Mux a CARRYGEN g64 Itanium has 6 integer execution units like this node1 ck1 SUMSEL REG sum sumb to Cache 9-1 Mux 2-1 Mux b SUMGEN + LU s0 s1 1000um LU : Logical Unit From register files / Cache / Bypass Multiplexers Shifter Adder stage 1 Loopback Bus Loopback Bus Wiring Adder stage 2 Wiring Loopback Bus ie photograph of the Itanium integer datapath Courtesy Intel, as reprinted in Rabaey, et al. "igital Integrated Circuits". Bit slice 63 Adder stage 3 Sum Select Bit slice 2 Bit slice 1 Bit slice 0 Bit-slice esign Methodology To register files / Cache Hand crafting the layout to achieve maximum clock rates (> 1Ghz) Exploits regularity in datapath structure to optimize interconnects L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 3

The ASIC Approach esign Capture Behavioral esign Iteration Pre-Layout Simulation Post-Layout Simulation Verilog (or (or VHL )) Logic Synthesis Floorplanning Placement Structural Physical Circuit Extraction Routing Tape-out Most Common esign Approach for esigns up to 500Mhz Clock Rates L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 4

Standard Cell Example Power Supply Line (V ) elay in (ns)!! 3-input NAN cell (from ST Microelectronics): C = Load capacitance T = input rise/fall time Ground Supply Line (GN) Each library cell (FF, NAN, NOR, INV, etc.) and the variations on size (strength of the gate) is fully characterized across temperature, loading, etc. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 5

Standard Cell Layout Methodology 2-level metal technology Current ay Technology Cell-structure hidden under interconnect layers With limited interconnect layers, dedicated routing channels between rows of standard cells are needed Width of the cell allowed to vary to accommodate complexity Interconnect plays a significant role in speed of a digital circuit L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 6

Verilog to ASIC Layout (the push button approach) module adder64 (a, b, sum); input [63:0] a, b; output [63:0] sum; After Synthesis assign sum = a + b; endmodule After Routing After Placement L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 7

Macro Modules 256 32 (or 8192 bit) SRAM Generated by hard-macro module generator Generate highly regular structures (entire memories, multipliers, etc.) with a few lines of code Verilog models for memories automatically generated based on size L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 8

Clock istribution Q Clock skew Image removed due to copyright restrictions. Q For 1Ghz clock, skew budget is 100ps. Variations along different paths arise from: evice: V T, W/L, etc. Environment: V, C Interconnect: dielectric thickness variation IBM Clock Routing Image removed due to copyright restrictions. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 9

The Power Supply Wires are Not Ideal! To V Grid To V Grid To V Grid C coup Receiver C int R d C d river GROUN GRI Pad Pad The IR-drop problem causes internal power supply voltage to be less than the external source Used with permission. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 10

Analog Circuits: Clock Frequency Multiplication (Phase Locked Loop) up down VCO produces high frequency square wave ivider divides down VCO frequency PF compares phase of ref and div Loop filter extracts phase error information Used widely in digital systems for clock synthesis (a standard IP block in most ASIC flows) Courtesy Michael Perrott. Used with permission. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 11

Scan Testing... Idea: have a mode in which all registers are chained into one giant shift register which can be loaded/ 0 read-out bit serially. Test remaining (combinational) 1 logic by shift out ScanShift (1) in test mode, shift in new values for all register bits thus setting up the inputs to the combinational logic 0 (2) clock the circuit once in normal mode, latching 1 the outputs of the combinational logic back into CLK ScanShift the registers (3) in test mode, shift out the values of all shift in ScanShift shift in register bits and compare against expected Used with permission results. Clk ScanShift Primary Inputs Response To The Test Vector Loaded Scan-Flops Load/Unload Cycles Load/Unload Cycles Primary Outputs Normal System Figure by MIT OpenCourseWare. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 12

Behavioral Transformations There are a large number of implementations of the same functionality These implementations present a different point in the area-time-power design space Behavioral transformations allow exploring the design space a high-level Optimization metrics: 1. Area of the design 2. Throughput or sample time T S 3. Latency: clock cycles between the input and associated output change 4. Power consumption 5. Energy of executing a task 6. time power area L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 13

Fixed-Coefficient Multiplication Conventional Multiplication X 3 X 2 X 1 X 0 Y 3 Y 2 Y 1 Y Z = X Y 0 X 3 Y 0 X 2 Y 0 X 1 Y 0 X 0 Y 0 X 3 Y 1 X 2 Y 1 X 1 Y 1 X 0 Y 1 X 3 Y 2 X 2 Y 2 X 1 Y 2 X 0 Y 2 X 3 Y 3 X 2 Y 3 X 1 Y 3 X 0 Y 3 Z 7 Z 6 Z 5 Z 4 Z 3 Z 2 Z 1 Z 0 Constant multiplication (become hardwired shifts and adds) Z = X (1001) 2 X 3 X 2 X 1 X 0 1 0 0 1 X 3 X 2 X 1 X 0 X 3 X 2 X 1 X 0 Z 7 Z 6 Z 5 Z 4 Z 3 Z 2 Z 1 Z 0 Y = (1001) 2 = 2 3 + 2 0 X << 3 Z shifts using wiring L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 14

Transform: Canonical Signed igits (CS) Canonical signed digit representation is used to increase the number of zeros. It uses digits {-1, 0, 1} instead of only {0, 1}. Iterative encoding: replace string of consecutive 1 s 0 1 1 1 1 2 N-2 + + 2 1 + 2 0 Worst case CS has 50% non zero bits 1 0 0 0-1 2 N-1-2 0 01101111 0 1 1 0 1 1 1 1 0 1 1 1 0 0 0-1 = 10010001 1 0 0-1 0 0 0-1 X << 7 Z << 4 Shift translates to re-wiring L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 15

Algebraic Transformations Commutativity istributivity A B A B A B C A C B A + B = B + A (A + B) C = AB + BC Associativity Common sub-expressions A B B C X Y X X Y C A (A + B) + C = A + (B+C) A B A B L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 16

Transforms for Efficient Resource Utilization A B C E FG H I 1 Time multiplexing: mapped to 3 multipliers and 3 adders 2 distributivity A C B E FG H I 1 Reduce number of operators to 2 multipliers and 2 adders 2 L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 17

A Very Useful Transform: Retiming Retiming is the action of moving delay around in the systems elays have to be moved from ALL inputs to ALL outputs or vice versa Cutset retiming: A cutset intersects the edges, such that this would result in two disjoint partitions of these edges being cut. To retime, delays are moved from the ingoing to the outgoing edges or vice versa. Benefits of retiming: Modify critical path delay Reduce total number of registers L15: 6.111 Spring 2006 Introductory igital Systems Laboratory Courtesy of Prof. Charles E. Leiserson. 18

Retiming Example: FIR Filter x(n) h(0) h(1) h(2) h(3) Symbol for multiplication y( n) y(n) irect form = h( n) x( n) = K i= 0 x( n i) h( i) x(n) associativity of the addition (10) h(0) h(1) h(2) h(3) T clk = 22 ns y(n) (4) retime x(n) h(0) h(1) h(2) h(3) Transposed form T clk = 14 ns y(n) Note: here we use a first cut analysis that assumes the delay of a chain of operators is the sum of their individual delays. This is not accurate. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 19

Pipelining, Just Another Transformation (Pipelining = Adding elays + Retiming) Contrary to retiming, pipelining adds extra registers to the system add input registers How to pipeline: 1. Add extra registers at all inputs 2. Retime retime L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 20

The Power of Transforms: Lookahead y(n) = x(n) + A y(n-1) x(n) y(n) A loop unrolling x(n) A A 2 y(n) Try pipelining this structure distributivity y(n) = x(n) + A[x(n-1) + A y(n-2)] x(n) y(n) How about pipelining this structure! associativity A A A 2 x(n) y(n) x(n) y(n) A retiming A A 2 2 A 2 precomputed L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 21

Key Concern in Modern VLSI: Variations! SOURCE GATE RAIN 110 Mean Number of opant Atoms 10000 Random opant Fluctuations 1000 100 10 Tox BOY Leff 1000 500 250 130 65 32 Technology Node (nm) Path elay 100 Temp Variation & Hot spots With 100b transistors, 1b unusable (variations) Probability 90 80 70 60 50 40 ue to variations in: V dd, V t, and Temp Temperature (C) L15: 6.111 Spring 2006 Introductory igital Systems Laboratory elay eterministic design techniques inadequate in the future Courtesy of Shekhar Y. Borkar (Intel). Used with permission. 22

Trends: Chip in a ay (Matlab/Simulink to Silicon ) Mult1 S reg Mac1 X reg Mac2 Add, Sub, Shift Mult2 Map algorithms directly to silicon - bypass writing Verilog! (Courtesy of R. Brodersen. Used with permission.) L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 23

Trends: Watermarking of igital esigns Fingerprinting is a technique to deter people from illegally redistributing legally obtained IP by enabling the author of the IP to uniquely identify the original buyer of the resold copy. The essence of the watermarking approach is to encode the author's signature. The selection, encoding, and embedding of the signature must result in minimal performance and storage overhead. Image removed due to copyright restrictions. Image removed due to copyright restrictions. L15: 6.111 Spring 2006 Introductory igital Systems Laboratory 24