Relative Timing Driven Multi-Synchronous Design: Enabling Order-of-Magnitude Energy Reduction

Similar documents
To Boldly Do What Can t Be Done: Asynchronous Design for All. Kenneth S. Stevens University of Utah

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

INF3430 Clock and Synchronization

Timing Issues in FPGA Synchronous Circuit Design

1/19/2012. Timing in Asynchronous Circuits

A Level-Encoded Transition Signaling Protocol for High-Throughput Asynchronous Global Communication

CHAPTER 4 GALS ARCHITECTURE

Lecture 9: Clocking for High Performance Processors

Mixed Synchronous/Asynchronous State Memory for Low Power FSM Design

CS250 VLSI Systems Design. Lecture 3: Physical Realities: Beneath the Digital Abstraction, Part 1: Timing

Time-Multiplexed Dual-Rail Protocol for Low-Power Delay-Insensitive Asynchronous Communication

Design and implementation of LDPC decoder using time domain-ams processing

Mohit Arora. The Art of Hardware Architecture. Design Methods and Techniques. for Digital Circuits. Springer

Lecture #2 Solving the Interconnect Problems in VLSI

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

An Asynchronous High-Throughput Control Circuit For Proximity Communication Justin Schauer

Timing analysis can be done right after synthesis. But it can only be accurately done when layout is available

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

To appear in IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, San Francisco, February 2002.

THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE

VLSI System Testing. Outline

Disseny físic. Disseny en Standard Cells. Enric Pastor Rosa M. Badia Ramon Canal DM Tardor DM, Tardor

Globally Asynchronous Locally Synchronous (GALS) Microprogrammed Parallel FIR Filter

High-Throughput Low-Energy Content-Addressable Memory Based on Self-Timed Overlapped Search Mechanism

A Complete Real-Time a Baseband Receiver Implemented on an Array of Programmable Processors

EE 434 ASIC and Digital Systems. Prof. Dae Hyun Kim School of Electrical Engineering and Computer Science Washington State University.

ECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice

EITF35: Introduction to Structured VLSI Design

L15: VLSI Integration and Performance Transformations

Tiago Reimann Cliff Sze Ricardo Reis. Gate Sizing and Threshold Voltage Assignment for High Performance Microprocessor Designs

Policy-Based RTL Design

MODELING THE PHASE STEP RESPONSE OF BANG-BANG DIGITAL PLLS

EECS 427 Lecture 22: Low and Multiple-Vdd Design

L15: VLSI Integration and Performance Transformations

Accurate Timing and Power Characterization of Static Single-Track Full-Buffers

Low Power System-On-Chip-Design Chapter 12: Physical Libraries

A FFT/IFFT Soft IP Generator for OFDM Communication System

On the Rules of Low-Power Design

AN EFFICIENT APPROACH TO MINIMIZE POWER AND AREA IN CARRY SELECT ADDER USING BINARY TO EXCESS ONE CONVERTER

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

A New Class of Asynchronous Analog-to-Digital Converters Based on Time Quantization

EECS150 - Digital Design Lecture 28 Course Wrap Up. Recap 1

Asynchronous Pipeline Controller Based on Early Acknowledgement Protocol

LSI Design Flow Development for Advanced Technology

Design and Evaluation of Stochastic FIR Filters

Design of Baugh Wooley Multiplier with Adaptive Hold Logic. M.Kavia, V.Meenakshi

Towards PVT-Tolerant Glitch-Free Operation in FPGAs

High-Speed RSA Crypto-Processor with Radix-4 4 Modular Multiplication and Chinese Remainder Theorem

Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates

A High Performance Split-Radix FFT with Constant Geometry Architecture

CE Senior Projects VLSI Research

Machine Learning for Next Generation EDA. Paul Franzon, NCSU (Site Director) Cirrus Logic Distinguished Professor Director of Graduate Programs

Lecture 10. Circuit Pitfalls

Incorporating Variability into Design

Course Outcome of M.Tech (VLSI Design)

Introduction (concepts and definitions)

CS 6135 VLSI Physical Design Automation Fall 2003

64-Macrocell MAX EPLD

Datorstödd Elektronikkonstruktion

VLSI Design: Challenges and Promise

Announcements. Advanced Digital Integrated Circuits. Midterm feedback mailed back Homework #3 posted over the break due April 8

Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates

Digital Systems Design

Digital Systems Design

UNIT-II LOW POWER VLSI DESIGN APPROACHES

2002 IEEE International Solid-State Circuits Conference 2002 IEEE

CS/EE Homework 9 Solutions

Low-Power Communications and Neural Spike Sorting

Low Power Design Methods: Design Flows and Kits

CS 61C: Great Ideas in Computer Architecture Finite State Machines, Functional Units

On-silicon Instrumentation

FPGA based Asynchronous FIR Filter Design for ECG Signal Processing

Lecture 19: Design for Skew

ISSN:

An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors

Double Data Rate (DDR) SDRAM MT46V64M4 16 Meg x 4 x 4 banks MT46V32M8 8 Meg x 8 x 4 banks MT46V16M16 4 Meg x 16 x 4 banks

Using a Voltage Domain Programmable Technique for Low-Power Management Cell-Based Design

How to design little digital, yet highly concurrent, electronics? Alex Yakovlev Newcastle University Newcastle upon Tyne, U.K.

Run-Length Based Huffman Coding

A Low Power and High Speed Viterbi Decoder Based on Deep Pipelined, Clock Blocking and Hazards Filtering

Interconnect-Power Dissipation in a Microprocessor

Fast Fourier Transform: VLSI Architectures

EDA Challenges for Low Power Design. Anand Iyer, Cadence Design Systems

2. Simulated Based Evolutionary Heuristic Methodology

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Module -18 Flip flops

A Survey of the Low Power Design Techniques at the Circuit Level

Lecture 02: Digital Logic Review

Amber Path FX SPICE Accurate Statistical Timing for 40nm and Below Traditional Sign-Off Wastes 20% of the Timing Margin at 40nm

VA04D 16 State DVB S2/DVB S2X Viterbi Decoder. Small World Communications. VA04D Features. Introduction. Signal Descriptions. Code

ASICs Concept to Product

Advanced Techniques for Using ARM's Power Management Kit

Reducing Power Dissipation in Pipelined Accumulators

DIGITAL ELECTRONICS QUESTION BANK

The Case for Optimum Detection Algorithms in MIMO Wireless Systems. Helmut Bölcskei

KEY FEATURES. Immune to Latch-UP Fast Programming. ESD Protection Exceeds 2000 V Asynchronous Output Enable GENERAL DESCRIPTION TOP VIEW A 10

A HIGH SPEED FFT/IFFT PROCESSOR FOR MIMO OFDM SYSTEMS

An Efficient Method for Implementation of Convolution

A Novel Latch design for Low Power Applications

Transcription:

Relative Timing Driven Multi-Synchronous Design: Enabling Order-of-Magnitude Energy Reduction Kenneth S. Stevens University of Utah Granite Mountain Technologies 27 March 2013 UofU and GMT 1

Learn from Prof. Kajitana Think differently and deeply Apply thought to current challenges Then collaborate Goals of Presentation: 1. Define and propose rule breaker idea 2. Request support from physical design community 27 March 2013 UofU and GMT 2

Multi-Synchronous Advantage 1. Efficiency in power and performance is new game in town 2. Multi-synchronous design provides optimization opportunity 3. New (asynchronous) timing model is one excellent path 4. Produces average 10 eτ 2 improvement Pentium: eτ 2 = 17.5 FFT: eτ 2 = 16.9 5. But... need improved physical design support Design Energy Area Freq. Latency Aggregate Pentium F.E. 2.05 0.85 2.92 2.38 12.11 64-pt FFT 3.95 2.83 2.07 3.37 77.98 27 March 2013 UofU and GMT 3

Timing is a Key Issue Multi-synchronous design produces best results Synchronous Clock at 1.5GHz Synchronous 3.0GHz clk Async circuit Synchronous variable freq. Pausable 1.7GHz clk Synchronous Clock at 1.8GHz Single frequency, low skew (small blocks, standard CAD) 1. global block frequencies 2. higher clock power 3. clock design, distribution Multiple frequencies (SoC reality localization) 1. blocks operate at best frequency 2. network not synchronized 3. synchronizing FIFOs 27 March 2013 UofU and GMT 4

Wine goblet model: Energy Efficient Design Energy efficiency has two primary sources System architecture Physical design Methodology and CAD unify sources arch Best realization: Multi-synchronous Defined by system s critical path Then optimal local power-delay Asynchronous best methodology: no synchronization cost pd 27 March 2013 UofU and GMT 5

Interface Matters! Clocked design requires synchronizers when crossing all domains. IP Clock Domain Network Clock Domain data clk s r S S S S Major location for buffering in a design. 27 March 2013 UofU and GMT 6

Interface Matters! No synchronization required into async domain. IP Clock Domain Network Clock Domain data clk s r S S Improves power, performance, and modularity 27 March 2013 UofU and GMT 7

Timed Asynchronous Designs 27 March 2013 UofU and GMT 8

Multi-Synchronous Architecture 1. Make architectural bottleneck as fast as possible. 2. Make the rest of the design match bottleneck... normally as slow as possible 3. Optimize locally for power/performance. irdy bufack L1 L7 bufreq irdyack tagin1 tagin7 tagout1 tagout7 Asynchronous Pentium bottleneck circuit 27 March 2013 UofU and GMT 9

Concurrency and Time Architectural level timing experiment: Pentium front end Column Cache Latch 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Len. Decoders Row 0 Row 1 Row 2 Row 3 27 March 2013 UofU and GMT 10

Concurrency and Time Architectural level timing experiment: Pentium front end Cache Latch Target Len. Decoders 3 3 9 4 1 7 2 1 6 3 5 3 5 1 3 4 1 2 27 March 2013 UofU and GMT 11

Concurrency and Time Architectural level timing experiment: Pentium front end Cache Latch Len. Decoders 3 1 7 2 1 6 3 5 3 5 1 3 4 1 2 3 27 March 2013 UofU and GMT 12

Concurrency and Time Architectural level timing experiment: Pentium front end Cache Latch Len. Decoders 3 2 1 4 7 2 1 6 3 5 3 5 1 3 4 1 2 3 4 27 March 2013 UofU and GMT 13

Concurrency and Time Architectural level timing experiment: Pentium front end Cache Latch Len. Decoders 3 2 1 4 2 5 1 3 4 5 2 3 4 27 March 2013 UofU and GMT 14

Concurrency and Time Architectural level timing experiment: Pentium front end Cache Latch Len. Decoders 2 1 4 2 3 1 7 9 4 2 3 5 6 2 3 4 27 March 2013 UofU and GMT 15

Timing and Sequencing Traditional representation of timing: Metric values On an IC we measure it to picoseconds In track and ski racing, we measure it to milliseconds But what do we really care about? it isn t the number on the stop watch... 27 March 2013 UofU and GMT 16

Timing and Sequencing Traditional representation of timing: Metric values On an IC we measure it to picoseconds In track and ski racing, we measure it to milliseconds But what do we really care about? it isn t the number on the stop watch... We care about who wins!! The key: Timing results in sequencing Relative Timing formally represents the signal sequencing produced by circuit timing 27 March 2013 UofU and GMT 17

New Formal Abstract Model: Relative Timing Timing is both the technology differentiator and barrier Relative Timing is the generalized solution The key property of time is the sequencing it imposes Sequence gives winner, performance, etc. true in semiconductors as well as sports absolute stopwatch value is auxiliary Novel relativistic formal logic representation of time (relative timing): pod poc 1 poc 2 Sequencing relative to common reference can now evaluate sequencing can now control sequencing 27 March 2013 UofU and GMT 18

1. Relative Timing Relative Timing Sequences signals at poc (point of convergence) Requires a common timing reference: pod (point of divergence) 2. Formal representation: pod poc 1 + margin poc 2 3. RT models timing in ALL systems Clocked: pod = clock poc = flops Async: pod = request poc = latches 4. RT enables direct commercial CAD support of general timing requirements formal RT constraints mapped to sdc constraints FFi data FFi+1 A POC 0 clk POD POD B POC 1 POC clk i i+1 data m 27 March 2013 UofU and GMT 19

Relative Timed Design: Bundled Data Bundled data design is much like clocked. n CL CL FF i FF i+1 FF i+2 n n CL CL L i L i+1 L i+2 n clock network req i req i+1 req i+2 req i+3 delay delay ack i Ctl i ack i+1 Ctl i+1 ack i+2 Ctl i+2 ack i+3 Frequency based (clocked) design. Clock frequency and datapath delay of first pipeline stage is constrained by L i /clk i L i+1 /d+s L i+1 /clk i+1 Timed (bundled data) handshake design. Delay element sized by RT constraint: req i L i+1 /d+s L i+1 /clk Clocked physical design directly supports the clocked Relative Timing constraints. The asynchronous circuit constraints must be provided as min and max constraints, and are not well supported 27 March 2013 UofU and GMT 20

Relative Timing Driven Flow set d0 fdel 0.600 set d0 fdel margin [expr $d0 fdel + 0.050] set d0 bdel 0.060 set size only -all instances [find -hier cell lc1] set size only -all instances [find -hier cell lc3] set size only -all instances [find -hier cell lc4] set disable timing -from A2 -to Y [find -hier cell lc1] set disable timing -from B1 -to Y [find -hier cell lc1] set disable timing -from A2 -to Y [find -hier cell lc3] set disable timing -from B1 -to Y [find -hier cell lc3] set max delay $d0 fdel -from a -to l0/d set max delay $d0 fdel -from b -to l0/d set min delay $d0 fdel margin -from lr -to l0/clk set max delay $d0 bdel -from lr -to la #margin 0.050 -from a -to l0/d -from lr -to l0/clk #margin 0.050 -from b -to l0/d -from lr -to l0/clk 27 March 2013 UofU and GMT 21

Multi-rate 64-Point FFT Architecture Initial design target: high performance military applications Mathematically based on W N = e j2π N notation Hierarchical multi-rate design: N = N 1 N 2 Decimate frequency ( ) by N 2 operate on N 2 low frequency streams Transmute data & frequency to N 1 low frequency streams Expand ( ) by N 1 to reconstruct original frequency stream 27 March 2013 UofU and GMT 22

Design Models Hierarchical derivation of multi-frequency design: X m1 (m 2 ) = N 2 1 n 2 =0 [ ] W m 1n 2 N N 1 1 n 1 =0 x n 2 (n 1 )W m 1n 1 N 1 W m 2n 2 N 2 N 2 FFTs using N 1 values as the inner summation Scaled and used to produce N 1 FFTs of N 2 values Hierarchically scale design Base case when N = 4, X(m) = W 4 x(n) 4-point FFT performed without multiplication Multiplication constants W 4 become ±1 27 March 2013 UofU and GMT 23

FFT-64 Implemented on IBM s 65nm 10sf process, Artisan academic library Three design blocks: FFT-4 FFT-16 N 1,N 2 = 4 FFT-64 N 1 = 16, N 2 = 4 Two designs: Clocked Multi-Synchronous Relative Timed Multi-Synchronous near identical architectures additional RT area / pipeline optimized version for FFT-64 27 March 2013 UofU and GMT 24

General Multi-rate FFT Architecture 1.25GHz 313MHz 313MHz to 78MHz x(n) N 2 N 1 Constants x 0 (n 1 ) N 1 -pt. FFT z 1 z 1 z 1 N 2 N 2 x 1 (n 1 ) x N2 1(n 1 ) N 1 Constants N 1 -pt. FFT N 1 Constants N 1 -pt. FFT x 1 (0) x N2 1(0) x 0 (1) e j 2π N x 1 (1) e j2π(n 1 1) N x N2 1(1) x 0 (N 1 1) e j 2π(N 1 1) N x 1 (N 1 1) e j2π(n 2 1)(N 1 1) N x N2 1(N 1 1) X(m) z 1 z 1 N 1 N 1 N 2 -pt. FFT N 2 -pt. FFT z 1 N 1 N 2 -pt. FFT 1.25GHz 78MHz ASIC tool flow, 65nm technology 27 March 2013 UofU and GMT 25

FFT-4 Building Block Data flow graph of pipelined 4-Point FFT design: Re{x[0]} + + Re{X[0]} Im{x[0]} + + Im{X[0]} Re{x[1]} + - Re{X[1]} Im{x[1]} + - Im{X[1]} Re{x[2]} - + Re{X[2]} Im{x[2]} - + Im{X[2]} Re{x[3]} - - Re{X[3]} Im{x[3]} - - Im{X[3]} 27 March 2013 UofU and GMT 26

Pipelined Asynchronous 4-Point Architecture Operates at 1/4 the input frequency Synchronization occurs between decimated rows Fast internal pipeline stages essential LC1 0 f 0 j0 LC2 0 f 4 j4 LC3 0 f 8 j8 LC4 0 lr la LC0 Dec4 LC1 1 LC1 2 f 1 f 2 j1 j2 LC2 1 LC2 2 f 5 f 6 j5 j6 LC3 1 LC3 2 f 9 f 10 j9 j10 LC4 1 LC4 2 Exp4 LC5 rr ra LC1 3 f 3 j3 LC2 3 f 7 j7 LC3 3 f 11 j11 LC4 3 Fork Join Fork Join Fork Join add/sub add/sub 27 March 2013 UofU and GMT 27

Decimator-4 Design Comparison Clocked block requires pipeline to change frequency Async block latency combinational and concurrent clk/4 Shi f treg Shi ftreg clk R0 R1 R4 R5 D1 D2 ri r1 r2 r3 Din R2 R3 R6 R7 D3 D4 Multi-Synchronous asynchronous design smaller, faster, lower power ai Din r4 a1 a2 a3 a4 D1 D2 D3 D4 27 March 2013 UofU and GMT 28

Results The 16-point FFT Comparison Result (* values are scaled ideally to 65 nm technology) Points Word Time for 1K-point Clock Tech. Energy/point Area Power Energy Area Throughput bits µs MHz nm pj/data point mw Benefit Benefit Benefit Our Design(Async) 16-1024 32 0.83 1274 65 25.05 54 Kgates 30.9 8.01 2.77 8.32 Our Design(clock) 16-1024 32 1.73 588 65 41.83 71 Kgates 24.7 4.8 2.07 3.98 Guan [1] 16-1024 16 6.91 653 130 200.68 147 Kgates 29.7 1 1 1 The 64-point FFT Comparison Result (* values are scaled ideally to 65 nm technology) Points Word Time for 1K-point Clock Tech. Energy/point Area Power Energy Area Throughput bits µs MHz nm pj/data point mw Benefit Benefit Benefit Our Design(Async-opt) 64-1024 32 0.93 1284 65 62.41 0.41 mm 2 68.5 6.1 0.46 30.16 Our Design(Async) 64-1024 32 0.84 1366 65 59.94 0.50 mm 2 72.9 6.35 0.38 33.42 Our Design(clock) 64-1024 32 3.13 588 65 246.75 1.16 mm 2 80.7 1.54 0.16 8.99 Baireddy [2] 64-4096 - 28.14 514 90 380.88 0.19 mm 2 13.86 1 1 1 The 64-point async-opt design contains 229k gates, our clocked 454k. For comparison, these designs were scaled to a 65nm process by scaling frequency, power, and area in the 130nm technology by 2.0, 0.5, 0.25, and in the 90nm design by 1.43, 0.7, and 0.49 respectively. [1] X. Guan, Y. Fei, and H. Lin, Hierarchical Design of an Application-Specific Instruction Set Processor for High-Throughput and Scalable FFT Processing in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 20, No. 3, pp. 551 563, march 2012. [2] V. Baireddy, H. Khasnis, and R. Mundhada, A 64-4096 point FFT/IFFT/Windowing Processor for Multi Standard ADSL/VDSL Applications, in IEEE International symposium on Signals, Systems and Electronics (ISSSE 07), pp. 403 405, 2007. 27 March 2013 UofU and GMT 29

Multi-Synchronous Advantage 1. Efficiency in power and performance is new game in town 2. Multi-synchronous design provides optimization opprotunity 3. New (asynchronous) timing model is one excellent path 4. Produces average 10 eτ 2 improvement Pentium: eτ 2 = 17.5 FFT: eτ 2 = 16.9 5. But... need improved physical design support Design Energy Area Freq. Latency Aggregate Pentium F.E. 2.05 0.85 2.92 2.38 12.11 64-pt FFT 3.95 2.83 2.07 3.37 77.98 27 March 2013 UofU and GMT 30

RT Physical Design Optimization Timing, power, and performance optimizations driven by relative timing constriants. n C L L i L i+1 L i+2 C L n req i req i+1 req i+2 req i+3 delay delay ack i Ctl i ack i+1 Ctl i+1 ack i+2 Ctl i+2 ack i+3 req i L i+1 /d+m L i+1 /clk Mapped to set max delay and set min delay constraints Clock frequency determines min delay, async adds hold time 27 March 2013 UofU and GMT 31

RT Physical Design Problems n C L C L L i L i+1 L i+2 n req i req i+1 req i+2 req i+3 delay delay ack i Ctl i ack i+1 Ctl i+1 ack i+2 Ctl i+2 ack i+3 1. Inconsistency between operation and results supported pins & formats, synthesis vs place and route, etc. 2. Min-delay constraints not well supported Treated as hold time fixing Create arbitrarily large delays Degrades performance Required matching max-delay constraint to bound delay 3. Poor job of optimizing competing constraints 4. Placement can be substantially improved 27 March 2013 UofU and GMT 32

RT Physical Design Problems Simple experiment with inverters with endpoints mapping either to module pin or library gate pin: module i0 A B C D E F module i1 Design Compiler SoC Encounter Path Result Iterations Type Result type A E Yes 5 buffers No A F Yes 5 buffers No B E Yes 1 Dly Elts No B F Yes 1 Dly Elts Yes Dly Elts C E Yes 1 Dly Elts No C F Yes 1 Dly Elts Yes Dly Elts D E No No D F No No Paths use both max and min delay constraints 27 March 2013 UofU and GMT 33

RT Physical Design Problems LC1 0 f 0 j0 LC2 0 f 4 j4 LC3 0 f 8 j8 LC4 0 LC1 1 f 1 j1 LC2 1 f 5 j5 LC3 1 f 9 j9 LC4 1 lr la LC0 Dec4 Exp4 LC5 rr ra LC1 2 f 2 j2 LC2 2 f 6 j6 LC3 2 f 10 j10 LC4 2 LC1 3 f 3 j3 LC2 3 f 7 j7 LC3 3 f 11 j11 LC4 3 Fork Join Fork Join Fork Join add/sub add/sub Min-delay constraints get dropped, even in relatively small design! Design Compiler SoC SoC - timing closure Model #iter cyc. time #iter cyc. time energy/op #iter cyc. time energy/op wl0.5 9 738ps 1 728ps 5.16pJ 70 785ps 4.85pJ wl0 7 666ps 1 764ps 5.07pJ 16 763ps 4.87pJ 27 March 2013 UofU and GMT 34

RT Physical Design Potential n C L C L L i L i+1 L i+2 n req i req i+1 req i+2 req i+3 delay delay ack i Ctl i ack i+1 Ctl i+1 ack i+2 Ctl i+2 ack i+3 1. Low hanging fruit for performance improvements 2. Force directed algorithms Combine power/placement optimizations Drive cell clustering Drive pipeline/repeater placement and wire optimization 3. Tool performance: Convergence and run-time 27 March 2013 UofU and GMT 35

Multi-Synchronous Advantage 1. Efficiency in power and performance is new game in town 2. Multi-synchronous design provides optimization opprotunity 3. New (asynchronous) timing model is one excellent path 4. Produces average 10 eτ 2 improvement Pentium: eτ 2 = 17.5 FFT: eτ 2 = 16.9 5. But... need improved physical design support Design Energy Area Freq. Latency Aggregate Pentium F.E. 2.05 0.85 2.92 2.38 12.11 64-pt FFT 3.95 2.83 2.07 3.37 77.98 27 March 2013 UofU and GMT 36