Accurate Timing and Power Characterization of Static Single-Track Full-Buffers

Similar documents
UNIT-II LOW POWER VLSI DESIGN APPROACHES

UNIT-III POWER ESTIMATION AND ANALYSIS

AN EFFICIENT APPROACH TO MINIMIZE POWER AND AREA IN CARRY SELECT ADDER USING BINARY TO EXCESS ONE CONVERTER

Domino Static Gates Final Design Report

Department of Electrical and Computer Systems Engineering

Geared Oscillator Project Final Design Review. Nick Edwards Richard Wright

Preface to Third Edition Deep Submicron Digital IC Design p. 1 Introduction p. 1 Brief History of IC Industry p. 3 Review of Digital Logic Gate

A Survey of the Low Power Design Techniques at the Circuit Level

Derivation of an Asynchronous Counter

PROCESS-VOLTAGE-TEMPERATURE (PVT) VARIATIONS AND STATIC TIMING ANALYSIS

High Performance Asynchronous ASIC Back-End Design Flow Using Single-Track Full-Buffer Standard Cells

Low Power Design Methods: Design Flows and Kits

SINGLE-TRACK ASYNCHRONOUS PIPELINE TEMPLATE. Marcos Ferretti

IC Layout Design of 4-bit Universal Shift Register using Electric VLSI Design System

CHAPTER 5 DESIGN AND ANALYSIS OF COMPLEMENTARY PASS- TRANSISTOR WITH ASYNCHRONOUS ADIABATIC LOGIC CIRCUITS

A Digital Clock Multiplier for Globally Asynchronous Locally Synchronous Designs

QDI Fine-Grain Pipeline Templates

A Bottom-Up Approach to on-chip Signal Integrity

CHAPTER 6 PHASE LOCKED LOOP ARCHITECTURE FOR ADC

Contents 1 Introduction 2 MOS Fabrication Technology

SURVEY AND EVALUATION OF LOW-POWER FULL-ADDER CELLS

INF3430 Clock and Synchronization

A Low-Power SRAM Design Using Quiet-Bitline Architecture

Topic 6. CMOS Static & Dynamic Logic Gates. Static CMOS Circuit. NMOS Transistors in Series/Parallel Connection

CHAPTER 6 DIGITAL CIRCUIT DESIGN USING SINGLE ELECTRON TRANSISTOR LOGIC

ECE 683 Project Report. Winter Professor Steven Bibyk. Team Members. Saniya Bhome. Mayank Katyal. Daniel King. Gavin Lim.

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

EE E6930 Advanced Digital Integrated Circuits. Spring, 2002 Lecture 7. Clocked and self-resetting logic I

ISSN:

12 BIT ACCUMULATOR FOR DDS

Low-Power Digital CMOS Design: A Survey

EECS150 - Digital Design Lecture 19 CMOS Implementation Technologies. Recap and Outline

Lecture 10. Circuit Pitfalls

Total reduction of leakage power through combined effect of Sleep stack and variable body biasing technique

On Chip Active Decoupling Capacitors for Supply Noise Reduction for Power Gating and Dynamic Dual Vdd Circuits in Digital VLSI

ECE520 VLSI Design. Lecture 5: Basic CMOS Inverter. Payman Zarkesh-Ha

Low Power, Area Efficient FinFET Circuit Design

Lecture 4&5 CMOS Circuits

An Analog Phase-Locked Loop

Logic Families. Describes Process used to implement devices Input and output structure of the device. Four general categories.

Chapter 3 DESIGN OF ADIABATIC CIRCUIT. 3.1 Introduction

CHAPTER 3 NEW SLEEPY- PASS GATE

VLSI Design: Challenges and Promise

The challenges of low power design Karen Yorav

IJMIE Volume 2, Issue 3 ISSN:

CPE/EE 427, CPE 527 VLSI Design I: Homeworks 3 & 4

NOVEL OSCILLATORS IN SUBTHRESHOLD REGIME

1. Short answer questions. (30) a. What impact does increasing the length of a transistor have on power and delay? Why? (6)

DESIGN & IMPLEMENTATION OF SELF TIME DUMMY REPLICA TECHNIQUE IN 128X128 LOW VOLTAGE SRAM

STATIC TIMING ANALYSIS OF GASP. Prasad Joshi

Implementation of dual stack technique for reducing leakage and dynamic power

Just-In-Time Power Gating of GasP Circuits

A new 6-T multiplexer based full-adder for low power and leakage current optimization

CMOS Digital Integrated Circuits Analysis and Design

2 Assoc Prof, Dept of ECE, George Institute of Engineering & Technology, Markapur, AP, India,

Leakage Power Reduction for Logic Circuits Using Variable Body Biasing Technique

Single Switch Forward Converter

Power and Energy. Courtesy of Dr. Daehyun Dr. Dr. Shmuel and Dr.

A Low-Power High-speed Pipelined Accumulator Design Using CMOS Logic for DSP Applications

Low Power Design of Successive Approximation Registers

Sleepy Keeper Approach for Power Performance Tuning in VLSI Design

Chip Package - PC Board Co-Design: Applying a Chip Power Model in System Power Integrity Analysis

EFFECTIVE CONTROLLER IN OPTIMIZED ASYNCHRONOUS LOGIC

Power Spring /7/05 L11 Power 1

All Digital Linear Voltage Regulator for Super- to Near-Threshold Operation Wei-Chih Hsieh, Student Member, IEEE, and Wei Hwang, Life Fellow, IEEE

A Novel Low-Power Scan Design Technique Using Supply Gating

Towards PVT-Tolerant Glitch-Free Operation in FPGAs

EDA Challenges for Low Power Design. Anand Iyer, Cadence Design Systems

IN the past few years, superconductor-based logic families

DESIGNING powerful and versatile computing systems is

DESIGN OF LOW POWER HIGH PERFORMANCE 4-16 MIXED LOGIC LINE DECODER P.Ramakrishna 1, T Shivashankar 2, S Sai Vaishnavi 3, V Gowthami 4 1

Low Power High Performance 10T Full Adder for Low Voltage CMOS Technology Using Dual Threshold Voltage

Power Optimization of FPGA Interconnect Via Circuit and CAD Techniques

Low Power VLSI Circuit Synthesis: Introduction and Course Outline

Chapter 6 Combinational CMOS Circuit and Logic Design. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

A Level-Encoded Transition Signaling Protocol for High-Throughput Asynchronous Global Communication

EE 330 Lecture 44. Digital Circuits. Other Logic Styles Dynamic Logic Circuits

NOISE has traditionally been a concern to analog designers,

A Comparative Study of Π and Split R-Π Model for the CMOS Driver Receiver Pair for Low Energy On-Chip Interconnects

UNIVERSITY OF BOLTON SCHOOL OF ENGINEERING BENG (HONS) ELECTRICAL & ELECTRONICS ENGINEERING SEMESTER TWO EXAMINATION 2017/2018

Timing Verification of Sequential Domino Circuits

CROSS-COUPLING capacitance and inductance have. Performance Optimization of Critical Nets Through Active Shielding

Domino CMOS Implementation of Power Optimized and High Performance CLA adder

Lecture 11 Circuits numériques (I) L'inverseur

Policy-Based RTL Design

MOS CURRENT MODE LOGIC BASED PRIORITY ENCODERS

An Efficient Design of CMOS based Differential LC and VCO for ISM and WI-FI Band of Applications

Jack Keil Wolf Lecture. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. Lecture Outline. MOSFET N-Type, P-Type.

ECEN 474/704 Lab 8: Two-Stage Miller Operational Amplifier

Separation and Extraction of Short-Circuit Power Consumption in Digital CMOS VLSI Circuits

Aarthi.P, Suresh Kumar.R, Muniraj N. J. R, International Journal of Advance Research, Ideas and Innovations in Technology.

Keywords: VLSI; CMOS; Pass Transistor Logic (PTL); Gate Diffusion Input (GDI); Parellel In Parellel Out (PIPO); RAM. I.

Analog-aware Schematic Synthesis

EE 5327 VLSI Design Laboratory. Lab 7 (1 week) - Power Optimization

An Optimal Design of Ring Oscillator and Differential LC using 45 nm CMOS Technology

Implementation of Power Clock Generation Method for Pass-Transistor Adiabatic Logic 4:1 MUX

A Case Study of Nanoscale FPGA Programmable Switches with Low Power

ELEC Digital Logic Circuits Fall 2015 Delay and Power

ISSN:

Gate Delay Estimation in STA under Dynamic Power Supply Noise

Transcription:

Accurate Timing and Power Characterization of Static Single-Track Full-Buffers By Rahul Rithe Department of Electronics & Electrical Communication Engineering Indian Institute of Technology Kharagpur, India rrithe@iitkgp.ac.in Guidance Dr. Peter Beerel Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, CA 90089

Index Sr. No. Title Page No. 1 Abstract 3 2 Introduction 3 3 Background 1. Static Single-Track Full-Buffer 2. Characterizing Delay and Power 4 Modeling SSTFB 5 5 Asynchronous ASIC Design Flow 6 6 Characterization Flow 1. Input Slew and Load Capacitance 2. Input Waveforms 3. Measuring Power 4. Library Generation 7 Validation Results 9 8 Conclusion 12 9 References 13 4 5 7 7 8 8

Abstract For main stream acceptance of asynchronous circuits, a mature EDA tool flow is necessary that leverages off commercially available tools for synchronous circuits. A characterization methodology that supports back-annotated power and timing for highperformance asynchronous circuits based on the static single-track full-buffer template is presented here. The model takes into account the effects of input slew rate and output loading and is represented in a commercial library format. Experimental results show that back-annotated post P&R simulations using this library lead to less than 7.1 % error compared to detailed spice-level simulations that require orders of magnitude longer run-times. Introduction Driven by overwhelming design-time constraints, standard-cell based synchronous design styles supported by mature CAD design tools and a largely automated flow dominate the ASSP and ASIC market places. As device feature sizes shrink and process variability increases, however, the reliance on a global clock becomes increasingly difficult, yielding far-from-optimal solutions. Because standard-cell designs use very conservative circuit families and are often over-designed to accommodate worst-case variations, the performance and power gap between full-custom and standard-cell designs continuously widens. Recent research demonstrates that it is possible to narrow this gap using conventional standard-cell techniques with asynchronous cell libraries. Asynchronous design has begun to demonstrate its advantages in the commercial marketplace. However, the development and characterization of asynchronous libraries is still an emerging area of research. In general, the challenges of characterizing asynchronous cells comes from their more general circuit structure that may include internal combinational loops, bi-directional pins, and mutual exclusion rules on dual-rail or 1-of-N inputs. These input constraints and general structures do not conform to the standard latch or flip-flop templates supported by commercial library characterization tools. Consequently, library characterization has been a limiting factor for otherwise promising asynchronous design styles that rely on more general circuit structures. In particular, these challenges are epitomized in the proposed next generation STFB circuit family called static single-track full buffers (SSTFB) [1][3][4]. This work addresses these challenges by demonstrating an effective timing and power characterization flow for these cells. In particular, the decomposition of SSTFB cell behavior into a set of timing arcs that can be understood by commercial place-and-route and back-annotation tools is described here. It then describes a novel methodology and tool kit to automatically characterize the timing and power consumption of all timing arcs using the commercially supported Liberty file (.lib) format. 3

Background 1. Static Single-Track Full-Buffer Fig.1 shows the circuit diagram of a Static STFB dual rail buffer. Fig.1: Static STFB dual rail buffer circuit When there is no token in the right channel (R) (R is low meaning the channel is empty), the Right environment enables the domino logic to process a new token. When the next token arrives at the left channel (L goes high) it is processed by lowering the state signal S, which creates an output token on the right channel (R goes high) and causes A to assert, removing the token from the left channel via reset NMOS transistors. The presence of the output token on the right channel restores the state signal and deactivating the NMOS transistor at the bottom of the N-stack thus disabling the stage from firing while the output channel is busy. After the sender drives the line high, the receiver is responsible for actively keeping the line high (via the input keepers) until it wants to drive it low. Similarly, after the receiver drives the line low, the sender is responsible for actively keeping the line low until it wants to drive it high (via the output keepers). The line is always statically driven and no fight with staticizers exists. This hand-off technique enables the hold circuitry to be sized to a suitable strength creating a tradeoff between performance/power/area and robustness to noise. The inverters in the hold circuitry can be also be skewed such that they turn on early creating an overlap between the driving and hold logic. This overlap avoids the channel wire being in a tri-state condition thus making the circuit family more robust to noise. The overlap also helps ensure that the channel wires are always driven close to the power supplies further increasing noise margins [4]. The local cycle time of the static STFB template is 6 transitions and its forward latency of 2 transitions. It is called a full-buffer because each buffer stage can hold one token. The template is very flexible and can be expanded to implement different functionalities by enabling multiple 1ofN input channels, arbitrary NMOS pull down logic and multiple 1ofN output channels [1][3][4]. 4

2. Characterizing Delay and Power Both Delay and Power consumption are characterized and stored in a library file. For both delay and output slope, 2D tables that depend on input slope and output load are used to capture the delay of each timing arc of interest. For power, both static and dynamic sources of power are characterized. Dynamic power is made up of internal power and switching power. The former is dissipated by the cell in the absence of a load capacitance and the latter is the component that is dissipated while charging/discharging a load capacitance. Dynamic power is measured per timing arc (as with delay). Static dissipation is due to leakage currents through `OFF transistors and can be significant when the circuit is in the idle state (there is no switching activity). It has four principle sources: reverse-biased junction leakage current, gate induced drain leakage, gate direct-tunneling leakage and subthreshold (weak inversion) leakage. For 180nm, gate leakage is about 0.1% of total static power and subthreshold leakage dominates. The other two components are much smaller and thus generally ignored. With the above simplification, leakage power can be computed as the product of supply voltage and the subthreshold leakage current. Unlike delay and dynamic power, leakage power is typically represented as a single value per cell. Modeling SSTFB To model the performance of the SSTFB cells, a set of timing arcs are identified that capture the behavior of the SSTFB cell. The causality between the timing arcs is formalized in a timed marked graph. As an example, the marked graph shown in Fig. 2 illustrates the marked graph model of a buffer cell in which the specific data rails have been abstracted. Notice that the dashed edges represent the behavior of the environment whereas the solid edges refer to the behavior of the cell. The + symbol is used to indicate a rising transition, - for a falling transition, 0Z for a low to tri-state transition, 1Z for a high to tri-state transition, Z0 for a tri-state to low transition and Z1 for a tri-state to high transition. The behavior illustrated can be explained into two cases. Fig.2: Marked Graph model of a SSTFB buffer Case 1: The right channels are free and a token arrives at the left channel. This new token is processed by lowering the state signal S. This corresponds to the timing arc L + S -. As S goes low it will create a token on the right channel R and causes the SCD to assert the signal A. These are the timing arcs S - R + and S - A +. The reset block is activated 5

as A goes high and removes the token from the left channel L (A + L - ). Simultaneously, the domino logic precharges, causing S+ which tri-states the right channel, i.e., S+ is equivalent to R 1Z. The SCD now de-asserts A that results in the timing arc S + A - which tri-states the left channel L, which tri-states the left channel, i.e., is equivalent to L 0Z. The right channel R is eventually reset by the right environment enabling the processing of a new token and completing a cycle. Case 2: The right channel R is busy and a token arrives at the left channel L. In this case the token at L will have to wait for R to be free. The triggering event in this case is R being reset by the right environment. This introduces another timing arc R - S - in addition to the timing arcs described in case 1. Asynchronous ASIC Design Flow For each SSTFB cell needed we created four library views: functional views contains the behavioral description of the cell in Verilog HDL, schematic views contains the transistor level implementation of the cell, layout view containing detailed GDSII data, an abstract view to support placement and routing in LEF format, and finally its symbol. Using this library, a largely conventional standard-cell ASIC back-end design flow using conventional place and route tools can be used to create the layout, as illustrated in Fig. 3. We use Hspice to perform analog transistor-level simulations to verify both correctness and measure performance. Characterization Flow Fig. 3: Asynchronous ASIC Design Flow The industry standard Liberty format supports several delay models to characterize delay. We chose the non-linear delay because it provides a reasonable tradeoff between accuracy and complexity. This delay model uses lookup tables indexed by input slew and/or load capacitance. The selection of input slew and load capacitance indices and 6

creating real-world input waveforms directly impacts the accuracy of the characterization. In addition, it is necessary to measure the correct supply currents to accurately characterize internal power. Unlike synchronous standard cells, for which commercial library characterization tools are available, the effort has to be implemented from scratch and semi-automated. 1. Input Slew and Load Capacitance Delay behaves non-linearly and non-monotonically with input slew. The design usage space should be bounded by carefully selecting the minimum and maximum input slew and load capacitance values to minimize delay calculation error due to interpolation and extrapolation. The output load index must be based on the cell drive strength. The tables should have enough points for both input slew and output load index selections so as to cover non-linear or non-monotonic regions. In the proposed flow, the minimum load capacitance was zero and the maximum was calculated such that the cell operated within pre-determined voltage swing levels. The input slew values were computed for each cell in the library based on the selected load capacitance values. The load capacitance on internal pins is fixed. Consequently timing arcs from input pins to the S and A pins need only be modeled as a 1D table (1x6) based on the input slew. However, arcs from the state pins S to the output pins R are modeled as a 2D table (6x6) based on both slew on S and output load. 2. Input Waveforms The output load model can be simplified by assuming a lumped capacitance. For the input driver, however, the traditional use of a ramped linear waveform is not desirable as it can by itself contribute to 5-10% delay error. Commercial library characterization tools use one of two approaches to generate real-world input waveforms: the pre-driver method or a pre-driver generated real non-linear waveform. A buffer is often recommended for use as the pre-driver cell as shown in Fig. 4(a). Fig. 4(a): Test Setup for Synchronous Circuits Fig. 4(b): Test Setup for Asynchronous Circuits For asynchronous circuits, the left and right environments have to be set up that generate and consume input and output tokens to/from the circuit under test (CUT). Commonly, these environments are modeled by the cells called bitgen and bucket respectively, shown in Fig. 4(b). Thus we implicitly take care of the input waveform generation with the bitgen. The input slew is controlled by an adjustable capacitor C S and the output load is controlled by the capacitor C L. 7

bitgen. The input slew is controlled by an adjustable capacitor C S and the output load is controlled by the capacitor C L. 3. Measuring Power The main challenges for power characterization are partitioning the currents drawn through the supply amongst timing arcs for the dynamic component, modeling short circuit current and the effects of crosstalk. The Liberty format measures internal energy per timing arc which includes short-circuit power. Power analysis tools convert this internal energy to internal power by dividing by the system cycle time. They also add short-circuit energy and switching energy, the latter calculated as the energy required for switching the total net capacitance on the nets. The dynamic internal energy component of energy for an arc can be calculated using the following equation: (( I vdd / gnd Ileakge) Vdd) T Earc = (1) N where, I vdd/gnd is the average current measured through specific voltage sources associated with the timing arc, I leakage is the current measured when the circuit is idle, Vdd is the supply voltage, T is total simulation trace time and N is the number of tokens processed in time T. In particular, we added 0-volt voltage sources to Vdd segments to the extracted place-and-routed netlist in order to measure the currents responsible for charging internal cell nodes. We added 0-volt voltage sources to segments of Gnd to measure the shortcircuit current associated with charging output nodes (e.g., the R0/R1 nets). In general, the measured currents associated with each token value can be partitioned among the associated timing arcs that must occur for each such token processed. For cells with a single-input channel, however, we partitioned currents into one power arc for each output accessed by an arbitrarily-chosen single related pin. For cells with multiple input channels in which multiple power arcs existed for a given output, we accounted for the power of all arcs in each arc. In this case, the power analysis tool chooses one such power arc depending on the timing of the related pins. This leads to a small amount of error because we are essentially assuming the slew on all input channels is identical. 4. Library Generation Using the above concepts, the flow illustrated in Fig. 5 is used for complete timing and power characterization. Spice netlists of the cells were fed to Hspice along with autogenerated stimulus files. These stimulus files contain measure statements for delay, slew, and energy. Data is extracted from the output of Hspice which is in the.mt# file format and automatically converted to the liberty format. For ease of characterization, we sometimes assumed symmetry to estimate delays and slews of one data rail using measured data from the other rail, introducing a small amount of error due to small differences in the layout between rails. 8

Fig. 5: Characterization Flow Validation Results To validate the characterization of each individual cell, we tested every individual cell in an environment as shown in Fig. 6. Fig. 6: Cell Verification Environment Cadence s SoC Encounter is used to perform delay calculation for each timing arc instance in the routed netlist using the library s liberty description, recording the resulting arc delays in a standard delay format (.sdf) file. The.sdf file along with the Verilog netlist is simulated in Cadence s NC-Verilog simulator. This generates a timing-accurate Value Change Dump (.vcd) file that records the signal activity. The.vcd file is then fed back into SoC Encounter to compute simulation-based power analysis. This analysis produces an Instance Power Report enlisting the Internal power and the Switching power consumption of each of the circuit modules. The results of the simulation-based timing and power analysis are compared to golden Hspice simulations. Table 1 shows the results of this comparison for each of the individual cells. 9

Table 1: Comparison of results from Hspice and Encounter based flow Cell Timing (ns) Internal Power (mw) Switching Power (mw) Hspice Encounter Error Hspice Encounter Error Hspice Encounter Error Bitgen 0.335 0.350 6.0% 0.4560 0.4386 3.8% 0.2590 0.2430 6.1% Bucket 0.335 0.350 6.0% 0.2017 0.1958 2.9% 0.0000 0.0000 0.0% Buffer 0.381 0.394 3.4% 0.5710 0.5643 1.2% 0.2005 0.2100 4.5% Fork 0.528 0.556 5.0% 0.8596 0.8196 4.6% 0.4345 0.4490 3.2% Merge 0.572 0.604 4.3% 0.6375 0.6080 4.5% 0.1422 0.1398 1.8% OR 0.457 0.440 3.7% 0.5692 0.5627 1.1% 0.1788 0.1776 0.7% XOR 0.472 0.463 1.9% 0.6171 0.6370 3.1% 0.1928 0.2020 4.0% The above comparison is summarized in the form of bar graphs in Fig. 7(a), 7(b) and 7(c). Cycle Time Comparison Cycle Time (ns) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Bitgen Bucket Buffer Fork Merge OR XOR Cells Hspice Encounter Fig. 7(a): Cycle Time comparison from Hspice and Encounter based flow Internal Power Comparison 1 Internal Power (mw) 0.8 0.6 0.4 0.2 0 Bitgen Bucket Buffer Fork Merge OR XOR Hspice Encounter Cells Fig. 7(b): Internal Power comparison from Hspice and Encounter based flow 10

Switching Power Comparison Switching Power (mw) 0.5 0.4 0.3 0.2 0.1 0 Bitgen Bucket Buffer Fork Merge OR XOR Cells Hspice Encounter Fig. 7(c): Switching Power Comparison from Hspice and Encounter based flow To further validate the quality of our characterization in a larger environment, we used several small fork-join non-linear pipelines with a general structure as shown in Fig. 8. Fig. 8: Structure of the fork-join non-linear pipeline Table 2 shows the performance and power dissipation as measured by Hspice and the Encounter-based flow using our prototype SSTFB library for the different fork-join pipelines. Table 2: Performance and Power comparison from Hspice and Encounter based flow No. of Buffers in Performance (GHz) Power (mw) Long - Short Path Hspice Encounter Error Hspice Encounter Error 8-1 0.913 0.864 5.3% 4.500 4.270 5.1% 8-2 1.262 1.198 5.0% 6.418 6.027 6.0% 8-3 1.515 1.406 7.1% 8.158 7.741 5.1% 8-4 1.686 1.631 3.2% 9.710 9.375 3.4% 8-5 1.672 1.636 2.1% 10.15 10.04 1.1% 8-6 1.661 1.626 2.1% 11.04 10.54 4.7% 8-7 1.642 1.545 5.9% 11.08 10.51 5.1% 8-8 1.666 1.554 6.7% 11.69 10.88 6.9% The above comparison is summarized in the plots in Fig. 9(a) and 9(b). 11

Performance Comparison Pow er Comparison Performance (GHz) 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 Power (mw) 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 No. of Buffers in the Short Path No. of Buffers in the Short Path Hspice Encounter Hspice Encounter Fig. 9(a): Performance Comparison for the Fork-join pipelines Fig. 9(b): Power Comparison for the Fork-join pipelines The performance plot shows an interesting counter-intuitive result that maximum throughput is achieved when the short and long paths of the fork-join structure are somewhat unbalanced, i.e., the short path has 4 buffers while the long path has 8 buffers. This fact can be attributed to the free-slack associated with the buffers that are faster than the FORK and JOIN cells. More balanced fork-join pipelines are slightly slower due to increased wire delay and consume more energy. The experimental results show a maximum error between the encounter estimated and Hspice golden numbers of 7.1%. We believe much of this error can be attributed to the limited slew propagation during SDF generation due to the loops and bi-directional pins in the SSTFB.lib model. Conclusion A fully characterized asynchronous library not only supports back-annotated simulationbased power and timing analysis. It also enables timing driven place and route, performance and power driven synthesis and ECO flows. Moreover, characterized asynchronous libraries are a necessary pre-cursor to extending STA-based timing sign-off to these designs. This work demonstrates the issues, feasibility, and potential accuracy associated with characterizing static STFB circuits. This is quite promising because the SSTFB circuits have among the most complex timing relationships of the many different proposed asynchronous design styles and have promising characteristics for application in lowpower high-performance SoC interconnects. 12

References [1] P. Golani, G. D. Dimou, M. Prakash, P. A. Beerel. Design of a High-Speed Asynchronous Turbo Decoder, ASYNC 2007, March, 2007. [2] M. Ferretti and P.A. Beerel. High Performance Asynchronous Design Using Single- Track Full-Buffer Standard Cells, IEEE Journal of Solid-State Circuits, Vol. 41, No. 6, pp. 1444-1454, June 2006. [3] M. Ferretti and P. A. Beerel. Single-Track Asynchronous Pipeline Templates using 1- of-n Encoding, DATE'02, March, 2002. [4] P. Golani and P. A. Beerel. High Speed Noise Robust Asynchronous Circuits, Proc. of ISVLSI, March 2006. [5] P. Golani and P. A. Beerel. Back-Annotation in High-Speed Asynchronous Design. Journal of Low Power Electronics 2, 37 44 (2006). 13