Lecture 9: Clocking for High Performance Processors

Similar documents
Lecture 19: Design for Skew

Lecture 11: Clocking

Lecture 10. Circuit Pitfalls

EECS 141: SPRING 98 FINAL

! Sequential Logic. ! Timing Hazards. ! Dynamic Logic. ! Add state elements (registers, latches) ! Compute. " From state elements

CPE/EE 427, CPE 527 VLSI Design I: Homeworks 3 & 4

VLSI Design 11. Sequential Elements

Introduction to CMOS VLSI Design (E158) Lecture 9: Cell Design

Introduction to CMOS VLSI Design (E158) Lecture 5: Logic

INF3430 Clock and Synchronization

! Review: Sequential MOS Logic. " SR Latch. " D-Latch. ! Timing Hazards. ! Dynamic Logic. " Domino Logic. ! Charge Sharing Setup.

Timing analysis can be done right after synthesis. But it can only be accurately done when layout is available

DLL Based Frequency Multiplier

! Is it feasible? ! How do we decompose the problem? ! Vdd. ! Topology. " Gate choice, logical optimization. " Fanin, fanout, Serial vs.

Preface to Third Edition Deep Submicron Digital IC Design p. 1 Introduction p. 1 Brief History of IC Industry p. 3 Review of Digital Logic Gate

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2012

EE 42/100 Lecture 24: Latches and Flip Flops. Rev A 4/14/2010 (8:30 PM) Prof. Ali M. Niknejad

Lecture 9: Cell Design Issues

Module -18 Flip flops

EE E6930 Advanced Digital Integrated Circuits. Spring, 2002 Lecture 7. Clocked and self-resetting logic I

Microcircuit Electrical Issues

EE 42/100 Lecture 24: Latches and Flip Flops. Rev B 4/21/2010 (2:04 PM) Prof. Ali M. Niknejad

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

ECE 2300 Digital Logic & Computer Organization

Leakage Power Minimization in Deep-Submicron CMOS circuits

Lecture 3 Switched-Capacitor Circuits Trevor Caldwell

Dynamic Logic. Domino logic P-E logic NORA logic 2-phase logic Multiple O/P domino logic Cascode logic 11/28/2012 1

Lecture 14: Datapath Functional Units Adders

Digital Integrated Circuits Lecture 20: Package, Power, Clock, and I/O

EECS 427 Lecture 22: Low and Multiple-Vdd Design

DIGITAL INTEGRATED CIRCUITS A DESIGN PERSPECTIVE 2 N D E D I T I O N

Low-Power Digital CMOS Design: A Survey

Outline. EECS Components and Design Techniques for Digital Systems. Lec 12 - Timing. General Model of Synchronous Circuit

A Level-Encoded Transition Signaling Protocol for High-Throughput Asynchronous Global Communication

Timing Issues in FPGA Synchronous Circuit Design

BiCMOS Circuit Design

Geared Oscillator Project Final Design Review. Nick Edwards Richard Wright

UNIVERSITY OF CALIFORNIA College of Engineering Department of Electrical Engineering and Computer Sciences

Incorporating Variability into Design

EE434 ASIC & Digital Systems. Partha Pande School of EECS Washington State University

Reliability Enhancement of Low-Power Sequential Circuits Using Reconfigurable Pulsed Latches

HIGH-PERFORMANCE HYBRID WAVE-PIPELINE SCHEME AS IT APPLIES TO ADDER MICRO-ARCHITECTURES

Chapter 6 Combinational CMOS Circuit and Logic Design. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

ECE380 Digital Logic

Power Spring /7/05 L11 Power 1

Lecture 4&5 CMOS Circuits

Announcements. Advanced Digital Integrated Circuits. Midterm feedback mailed back Homework #3 posted over the break due April 8

EE241 - Spring 2013 Advanced Digital Integrated Circuits. Announcements. Lecture 16: Power and Performance

Phase Locked Loops, Report Writing, Layout Tuesday, April 5th, 9:15 11:00

UNIT-II LOW POWER VLSI DESIGN APPROACHES

Lecture 16: Design for Testability. MAH, AEN EE271 Lecture 16 1

CMOS Digital Integrated Circuits Lec 11 Sequential CMOS Logic Circuits

NOVEMBER 28, 2016 COURSE PROJECT: CMOS SWITCHING POWER SUPPLY EE 421 DIGITAL ELECTRONICS ERIC MONAHAN

Synchronous Mirror Delays. ECG 721 Memory Circuit Design Kevin Buck

High Speed Communication Circuits and Systems Lecture 14 High Speed Frequency Dividers

IES Digital Mock Test

IJMIE Volume 2, Issue 3 ISSN:

The Effect of Threshold Voltages on the Soft Error Rate. - V Degalahal, N Rajaram, N Vijaykrishnan, Y Xie, MJ Irwin

We ve looked at timing issues in combinational logic Let s now examine timing issues we must deal with in sequential circuits

DFT for Testing High-Performance Pipelined Circuits with Slow-Speed Testers

ECEN 720 High-Speed Links: Circuits and Systems. Lab3 Transmitter Circuits. Objective. Introduction. Transmitter Automatic Termination Adjustment

Layout - Line of Diffusion. Where are we? Line of Diffusion in General. Line of Diffusion in General. Stick Diagrams. Line of Diffusion in General

ENEE 359a Digital VLSI Design

EE584 Introduction to VLSI Design Final Project Document Group 9 Ring Oscillator with Frequency selector

1/19/2012. Timing in Asynchronous Circuits

2009 Spring CS211 Digital Systems & Lab 1 CHAPTER 3: TECHNOLOGY (PART 2)

ECEN 720 High-Speed Links Circuits and Systems

EEC 118 Lecture #12: Dynamic Logic

OBSOLETE. Digitally Programmable Delay Generator AD9501

Ruixing Yang

ENGIN 112 Intro to Electrical and Computer Engineering

EE241 - Spring 2013 Advanced Digital Integrated Circuits. Announcements. Lecture 13: Timing revisited

Mux-Based Latches. Lecture 8. Sequential Circuits 1. Mux-Based Latch. Mux-Based Latch. Negative latch (transparent when CLK= 0)

EE-382M-8 VLSI II. Early Design Planning: Back End. Mark McDermott. The University of Texas at Austin. EE 382M-8 VLSI-2 Page Foil # 1 1

Logic Restructuring Revisited. Glitching in an RCA. Glitching in Static CMOS Networks

DESIGNING powerful and versatile computing systems is

ISSCC 2003 / SESSION 6 / LOW-POWER DIGITAL TECHNIQUES / PAPER 6.2

Mohit Arora. The Art of Hardware Architecture. Design Methods and Techniques. for Digital Circuits. Springer

INTEGRATED CIRCUITS. AN243 LVT (Low Voltage Technology) and ALVT (Advanced LVT)

Department of Electrical and Computer Systems Engineering

EE 330 Lecture 44. Digital Circuits. Other Logic Styles Dynamic Logic Circuits

IN the design of the fine comparator for a CMOS two-step flash A/D converter, the main design issues are offset cancelation

BICMOS Technology and Fabrication

Memory, Latches, & Registers

CHAPTER 6 PHASE LOCKED LOOP ARCHITECTURE FOR ADC

Chapter 4. Problems. 1 Chapter 4 Problem Set

CMOS synchronous Buck switching power supply Raheel Sadiq November 28, 2016

EE434 ASIC & Digital Systems

Lecture 17. Low Power Circuits and Power Delivery

Delay-Locked Loop Using 4 Cell Delay Line with Extended Inverters

ECE 5745 Complex Digital ASIC Design Topic 2: CMOS Devices

Application and Analysis of Output Prediction Logic to a 16-bit Carry Look Ahead Adder

EE 434 ASIC and Digital Systems. Prof. Dae Hyun Kim School of Electrical Engineering and Computer Science Washington State University.

PROCESS-VOLTAGE-TEMPERATURE (PVT) VARIATIONS AND STATIC TIMING ANALYSIS

Announcements. Advanced Digital Integrated Circuits. Quiz #3 today Homework #4 posted This lecture until 4pm

19. Design for Low Power

EECS 141: FALL 98 FINAL

A Survey of the Low Power Design Techniques at the Circuit Level

Timing Verification of Sequential Domino Circuits

EE290C - Spring 2004 Advanced Topics in Circuit Design High-Speed Electrical Interfaces. Announcements

Transcription:

Lecture 9: Clocking for High Performance Processors Computer Systems Lab Stanford University horowitz@stanford.edu Copyright 2001 Mark Horowitz EE371 Lecture 9-1 Horowitz

Overview Reading Bailey Stojanovic Harris Clocking on the Alpha Evaluation of different latches Skew tolerant domino design Introduction In addition to the design of the circuits on a chip, clocking and flop/latch design has a large influence on the circuit s power and performance. As was discussed in EE271, the role of the clocks is to keep signals correlated in time. There are many approaches to this problem, some even that don t require clocks. If clocks are used, it is critical to minimize the overhead caused by the clocks, which includes clock skew, and latch/flop overheads. This lecture looks at the sequencing issue and explores a couple of approaches to this problem including eliminating clocks (self-timing). The next lecture will look at flop and latch design in more detail. We start with a brief look at history of clock design. EE371 Lecture 9-2 Horowitz

History Clocking was critical issue in early 80 s - nmos design - Needed fast but low power buffers - Name of the game was bootstrapping Clock EE371 Lecture 9-3 Horowitz

History, cont d CMOS changed the rules - Solved the clock buffer problem Clock circuitry became less interesting - Until the 90 s Clocks once again are a difficult circuit issue EE371 Lecture 9-4 Horowitz

Overview of Talk Background: - Role of clocks - Why clocks are bad (clock overhead) Self-timed design - Why no clocks are bad Real world - (how to live with badness) - Clock distribution issues - Skew tolerant designs Summary EE371 Lecture 9-5 Horowitz

Common View of Clock s Function Clocks work with Latch or Flip-Flop to hold state Latch - Stores data when the clock is low Flip-Flop - Stores In when clock rises In Clk Out In Out In Clk Out In Out Latch Flip-Flop EE371 Lecture 9-6 Horowitz

Another View If the delay of every path was EXACTLY the same Comb Logic I would not need clocks - The state is stored in the gates and the wires. Signals stay naturally correlated in time - (wave pipelining) Impossible to do in practice, so EE371 Lecture 9-7 Horowitz

Clock s Function: Keep values in a system correlated in time C L C L C L Keep signals from racing ahead of others - Slow down signals that arrive too fast D Flop Q φ Latch C L φ Latch - A flop is almost always built from two latches back to back EE371 Lecture 9-8 Horowitz

Clock Overhead Unfortunately, clocks delay slow paths too Flip Flop Latch D in Clk Q out T setup + T clk-q T d-q EE371 Lecture 9-9 Horowitz

Clock Skew Not all clocks arrive at the same time Two problems: - Adds more overhead: 1 Flop Logic Flop T cyc = T d +T setup +T clk-q +T skew T d - Can get the wrong answer: Late Early T skew < T clk-q - T hold Flop Flop Early Late Low overhead -> Fast latches, low clock skew 1. As one of the reading points out, it is hard to break the flop delay into a setup and clk-q delay, since the clk-q delay can increase when the setup time is small. We will ignore this type of issue until we talk about flops in the next lecture EE371 Lecture 9-10 Horowitz

Microprocessors To understand why clock design is getting harder, all we need to do is look at a couple of processor designs over the past 20 years. Take a look at a few different processors - 8086 (1978) - R2000(1986) - 21064(1992) - 21164(1995) - 21264(1998) Getting Larger - 30mm 2, 80mm 2, 220mm 2, 300mm 2, 300mm 2 Getting Faster - 5 MHz, 16MHz, 150MHz, 300MHz, 600MHz - EE371 Lecture 9-11 Horowitz

Processor Trends Performance: Part of speed increase is technology - Technology from 3µ nmos to 0.25µ CMOS - CMOS FO4 gate delay is roughly 0.5ns/µ L Part is better circuit design - 200, 50, 20-25, 20 FO4 inv delays/cycle Bottom Line: Cycle time getting shorter, even in # of gates - One FO4 delay is a larger % of cycle time Die size is growing - More capacitance on clocks (nf) - More resistance in clock lines Doesn t Work EE371 Lecture 9-12 Horowitz

Clock Overhead Really two issues: Latch / flop delay - All latches I know of have a delay of ~1.5 FO4 - Two latches / cycle is > 15% of cycle Clock skew - Keeping clock skew in ps constant is hard - But cycle times are falling - So engineering needed on the clock is growing. If generating the clock is hard and getting harder, why do it Radical approach would be to eliminate it all together - Use local information instead - Called self-timed design Uses information bundled with the signals for sequencing EE371 Lecture 9-13 Horowitz

Simple Self-Timed Pipeline C C C D D D D Function Function Function Reset Reset Reset EE371 Lecture 9-14 Horowitz

Simple Self-Timed Pipeline C C C D D D D Function Function Function Eval Reset Reset EE371 Lecture 9-15 Horowitz

Simple Self-Timed Pipeline C C C D D D D Function Function Function Hold Eval Reset EE371 Lecture 9-16 Horowitz

Simple Self-Timed Pipeline C C C D D D D Function Function Function Reset Hold Eval EE371 Lecture 9-17 Horowitz

Simple Self-Timed Pipeline C C C D D D D Function Function Function Eval Reset Hold EE371 Lecture 9-18 Horowitz

Self-Timed Sequencing Advantages: No global clock No clock skew No worst-case operation constraint Speed depends on operating conditions Cycle not limited by worst possible case Disadvantage: Lots of overhead! C Cause of the overhead: Eliminated clock, not need for sequencing - Control generated from local signals - And that takes time Note: This is added delay not skew D EE371 Lecture 9-19 Horowitz

Reducing Overhead Make SMALL assumption about timing Remove D from forward path Data starts evaluation D D D Function Function Function EE371 Lecture 9-20 Horowitz

More Problems In these kinds of systems there are three constraints that might be a problem Forward constraints (data movement) - This is the topic we were looking at, how fast can a data token flow forward Backward constraints (bubble movement) - This is the constraint on how fast the bubbles flow backward, or how fast does the unit reset after having data. If this is slow, you can have many bubbles for each data so this is not a constraint, but it lowers the pipeline rate. Loop constraints (min cycle time) - How fast can a function unit reset to be able to start another evaluation These constraints have large delays in them: - t D - completion detection tree, t C - buffer delay to drive datapath control - t D + t C can be 1/2 cycle Have fork, join issues too EE371 Lecture 9-21 Horowitz

Self-Timed vs. Clocked Systems Went to self-timing to get rid of skew Got rid of skew and worst-case limits But got control overhead too - Control overhead is larger than clock skew Can be hidden (in theory) but complicated - Little tool support So need to choose badness Most designers choose clocks EE371 Lecture 9-22 Horowitz

Goal Be paranoid: Make clock skew as small as possible AND Make your circuit insensitive to skew EE371 Lecture 9-23 Horowitz

Clock Distribution Need to reduce the skew on distributing the clock This requires us to reduce the wire delay, and the buffer delay - But we can t reduce the delay to the required levels (100ps) so Make the effective delay small, by balancing the delays of all the paths - Change a total delay problem to a matching problem - Make T much smaller than T drive Use a clock trees Match the delay on different branches of tree - If the buffer delay matches - If the wire delay matches - Skew will be zero Obvious question: - How well can you match delays? EE371 Lecture 9-24 Horowitz

H Trees Space filling pattern that matches wire delays Lots of papers on these things, but not real issue EE371 Lecture 9-25 Horowitz

Real Matching Issues There are function blocks on chip, that mess up your nice abstract H-tree Wiring and buffers will need to fit Buffers eventually need to drive latches - Load of a latch is data dependent, since the source voltage can change Gate loading depends on whether channel is formed Variation depends on technology, but is around 2:1 in capacitance Chip environments are not perfect - IR drops on the power supply lines - Temperature gradients across the chip Fabrication is not perfect - proximity effects / process tilt nmos cap larger when 0 pmos cap larger when 1 EE371 Lecture 9-26 Horowitz

Wire Load Matching Each wire has a different mix of components Not only gate-cap vs. wire cap Also % M1 - M2, M1 - M1, M1 fringe, etc. Need to find the worst-case skew Process corners don t help Don t vary wires relative each other, don t account for data dependence No real tools to help Problem for simulating matching of any kind, you need to simulate the worsecase for the matching, and this might not be either the slow or fast case. Buffer Matching Buffer delay depend on Vdd, can vary over a chip due to IR drops (over 10%) Fabrication matching -- process tilt and proximity effects EE371 Lecture 9-27 Horowitz

Process Tilt and Proximity Effects Chips are large (2cm on a side), and transistors features are small Inverters on different sides on chip will be different - Difference is not in corner files, since corners make all transistors the same This data is not usually given to designers It is essential for simulating clock skew Proximity Issues - Poly width sets channel length / speed of circuit - Current gate lengths are 0.25µ, Poly control must be a few 0.01µ - Local poly environment affects etching rate, so it affects channel width - Matched inverters, need matched layout, and matched environments (region around buffer needs to be same -- add dummy buffers) EE371 Lecture 9-28 Horowitz

Single Clock Distribution - 21064 Thick metal layer for clocks, M3-2µ thick Large clock buffer (entire vertical height of the chip) - Use a tree to balance the delay in this direction Shorted together all the local clock wires - Main difference with a conventional tree; reduces the effects of mismatches - Especially effective for reducing local skew More recent processors have more clock buffers to keep skew small EE371 Lecture 9-29 Horowitz

Local Skew Important for race-through Can get the wrong answer: - T skew < T clk-q - T hold Only occurs if delay is less than skew Delay is small only when elements are close - So only occurs when local skew is large - Shorting clocks together can reduce local skew But also limits your design Only have one clock and it is on all the time Not the lowest power solution Can avoid this problem by not having short paths - Need to have tool to check (and fix) min-delay (Easy, make all your flops with the ability to have a long clk-q delay) Flop Early Flop Late EE371 Lecture 9-30 Horowitz

Global Skew Still have long path problem Skew adds more overhead: - T cyc = T d +T setup +T clk-q +T skew Flop Logic Flop T d Late Early So don t use flops! The situation with latches is a little different EE371 Lecture 9-31 Horowitz

Clocking Design Trade off between overhead / robustness / complexity Constraints on the logic vs. Constraints on the clocks For performance, you need to worry about - Overhead of the sequencing Delay through the latches/flops Wasted time from clock-skew Look at a number of different clocking methods: Edge triggered clocking Pulse mode clocking Two phase clocking (might only have one clock) EE371 Lecture 9-32 Horowitz

Edge Triggered Flop Design Most popular design style (comes from old TTL designs) Used in many ASIC designs (Gate Arrays and standard Cells) Using a single clock, breaks every cycle with a flip-flop t cycle C L n n Clk Clk Flop Timing Constraints t dmax < t cycle - t setup - t clk-q - t skew t dmin > t skew + t hold - t clk-q If skew is large enough, you have two sided timing constraints EE371 Lecture 9-33 Horowitz

Flop Design Flops introduce hard timing boundaries in to the circuit Data must setup before the clock edge Output does not change until after the clock edge - Any uncertainty in clock, or data is wasted - Need to know precisely when the data will arrive If some section of logic will be done early, to use that extra time, you need to move the clock to the flop early, so the next cycle has more time (and you had better check the hold time of the following flop) D Q Fast D Q Slow D Q Early Clock watch holdtime EE371 Lecture 9-34 Horowitz

Latch Based Design Break flop into its two latches, and place logic between the latches. D Q D Q Logic D Q Logic D Q Logic Ld Ld Ld Ld There are no hard boundaries in latches - Pass data when clock is high Latching event is the load to hold transition - If data arrives early it is passed through - Can borrow time naturally, and - Is insensitive to clock skew, for critical paths, data sets the timing (well generally) EE371 Lecture 9-35 Horowitz

Pulse Mode Clocking Two requirements: All loops of logic are broken by a single latch The clock is a narrow pulse It must be shorter than the shortest path through the logic t cycle C L n n Clk Clk Latch t w Clock is usually generated inside the latch Timing Requirements t dmax < t cycle - t d-q - t skew t dmin > t w - t d-q + t skew EE371 Lecture 9-36 Horowitz

Pulse Mode Clocking Used in the original Cray computers (ECL machines) Advantage is it has a very small clocking overhead - One latch delay added to cycle Leads to double sided timing constraints - If logic is too slow OR too fast, the system will fail - But there is some flow time when the latch is enabled (softer edge) Pulse width is critical - Hard to maintain narrow pulses through inverter chains People are starting to use this type of clocking for MOS circuits - Pulse generation is done in each latch. - Clock distributed is 50% duty cycle - CAD tools check min delay - Called a glitch flop, but it is not a flop, it is a glitch latch! EE371 Lecture 9-37 Horowitz

Thinking About Timing Image your arranging your netlist on a sheet where all the flops at the top - The gates distance from the top indicates the settling time of its output - Gates at the end of long paths would be at the bottom of the sheet - Some of the outputs are the inputs to the flops, so we roll the sheet - Forms a cylinder, where the circumference is equal to the cycle time With flops, the input has to settle before its clock rises, and the output can t change until its clock falls so - To guarantee operation, need to waste skew time - Hard edge is the problem EE371 Lecture 9-38 Horowitz

Latches Are hard to analyze, since the timing of the output is not completely set by clock - Output is valid clk-q delay after clock rises If input was valid when clock was low - Output is valid a d-q delay after input If input becomes valid when clock is high Skew changes the relative timing of clock and data - Since latches are soft barriers it does not change the output arrival time! - Latch based system can tolerate skew Latch =T cyc - T max T cyc = cycle time T max = Max delay between latches Phi_b Phi EE371 Lecture 9-39 Horowitz

Summary Clocks have skew Today skew is caused by mismatches of balanced paths - Simulating matching is VERY HARD since need data that is not available - Need your our simulation environment since assuming perfect matching gives you get best-case skew I believe that keeping skew under 200ps is hard - Can do better than that over smaller regions (for fixed functionality, skew should roughly scale with inverter speed) Need to have circuits that deal with skew - Need to prevent race-through by padding short paths (and min. local skew) - Prevent cycle-time impact by using latch based design techniques Use skew tolerant domino (but that is another lecture) If you are careful clocks will work fine for chips running at GHz rates. EE371 Lecture 9-40 Horowitz