Accurate Timing and Power Characterization of Static Single-Track Full-Buffers

Accurate Timing and Power Characterization of Static Single-Track Full-Buffers By Rahul Rithe Department of Electronics & Electrical Communication Engineering Indian Institute of Technology Kharagpur, India rrithe@iitkgp.ac.in Guidance Dr. Peter Beerel Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, CA 90089

Index Sr. No. Title Page No. 1 Abstract 3 2 Introduction 3 3 Background 1. Static Single-Track Full-Buffer 2. Characterizing Delay and Power 4 Modeling SSTFB 5 5 Asynchronous ASIC Design Flow 6 6 Characterization Flow 1. Input Slew and Load Capacitance 2. Input Waveforms 3. Measuring Power 4. Library Generation 7 Validation Results 9 8 Conclusion 12 9 References 13 4 5 7 7 8 8

Abstract For main stream acceptance of asynchronous circuits, a mature EDA tool flow is necessary that leverages off commercially available tools for synchronous circuits. A characterization methodology that supports back-annotated power and timing for highperformance asynchronous circuits based on the static single-track full-buffer template is presented here. The model takes into account the effects of input slew rate and output loading and is represented in a commercial library format. Experimental results show that back-annotated post P&R simulations using this library lead to less than 7.1 % error compared to detailed spice-level simulations that require orders of magnitude longer run-times. Introduction Driven by overwhelming design-time constraints, standard-cell based synchronous design styles supported by mature CAD design tools and a largely automated flow dominate the ASSP and ASIC market places. As device feature sizes shrink and process variability increases, however, the reliance on a global clock becomes increasingly difficult, yielding far-from-optimal solutions. Because standard-cell designs use very conservative circuit families and are often over-designed to accommodate worst-case variations, the performance and power gap between full-custom and standard-cell designs continuously widens. Recent research demonstrates that it is possible to narrow this gap using conventional standard-cell techniques with asynchronous cell libraries. Asynchronous design has begun to demonstrate its advantages in the commercial marketplace. However, the development and characterization of asynchronous libraries is still an emerging area of research. In general, the challenges of characterizing asynchronous cells comes from their more general circuit structure that may include internal combinational loops, bi-directional pins, and mutual exclusion rules on dual-rail or 1-of-N inputs. These input constraints and general structures do not conform to the standard latch or flip-flop templates supported by commercial library characterization tools. Consequently, library characterization has been a limiting factor for otherwise promising asynchronous design styles that rely on more general circuit structures. In particular, these challenges are epitomized in the proposed next generation STFB circuit family called static single-track full buffers (SSTFB) [1][3][4]. This work addresses these challenges by demonstrating an effective timing and power characterization flow for these cells. In particular, the decomposition of SSTFB cell behavior into a set of timing arcs that can be understood by commercial place-and-route and back-annotation tools is described here. It then describes a novel methodology and tool kit to automatically characterize the timing and power consumption of all timing arcs using the commercially supported Liberty file (.lib) format. 3

Background 1. Static Single-Track Full-Buffer Fig.1 shows the circuit diagram of a Static STFB dual rail buffer. Fig.1: Static STFB dual rail buffer circuit When there is no token in the right channel (R) (R is low meaning the channel is empty), the Right environment enables the domino logic to process a new token. When the next token arrives at the left channel (L goes high) it is processed by lowering the state signal S, which creates an output token on the right channel (R goes high) and causes A to assert, removing the token from the left channel via reset NMOS transistors. The presence of the output token on the right channel restores the state signal and deactivating the NMOS transistor at the bottom of the N-stack thus disabling the stage from firing while the output channel is busy. After the sender drives the line high, the receiver is responsible for actively keeping the line high (via the input keepers) until it wants to drive it low. Similarly, after the receiver drives the line low, the sender is responsible for actively keeping the line low until it wants to drive it high (via the output keepers). The line is always statically driven and no fight with staticizers exists. This hand-off technique enables the hold circuitry to be sized to a suitable strength creating a tradeoff between performance/power/area and robustness to noise. The inverters in the hold circuitry can be also be skewed such that they turn on early creating an overlap between the driving and hold logic. This overlap avoids the channel wire being in a tri-state condition thus making the circuit family more robust to noise. The overlap also helps ensure that the channel wires are always driven close to the power supplies further increasing noise margins [4]. The local cycle time of the static STFB template is 6 transitions and its forward latency of 2 transitions. It is called a full-buffer because each buffer stage can hold one token. The template is very flexible and can be expanded to implement different functionalities by enabling multiple 1ofN input channels, arbitrary NMOS pull down logic and multiple 1ofN output channels [1][3][4]. 4

2. Characterizing Delay and Power Both Delay and Power consumption are characterized and stored in a library file. For both delay and output slope, 2D tables that depend on input slope and output load are used to capture the delay of each timing arc of interest. For power, both static and dynamic sources of power are characterized. Dynamic power is made up of internal power and switching power. The former is dissipated by the cell in the absence of a load capacitance and the latter is the component that is dissipated while charging/discharging a load capacitance. Dynamic power is measured per timing arc (as with delay). Static dissipation is due to leakage currents through `OFF transistors and can be significant when the circuit is in the idle state (there is no switching activity). It has four principle sources: reverse-biased junction leakage current, gate induced drain leakage, gate direct-tunneling leakage and subthreshold (weak inversion) leakage. For 180nm, gate leakage is about 0.1% of total static power and subthreshold leakage dominates. The other two components are much smaller and thus generally ignored. With the above simplification, leakage power can be computed as the product of supply voltage and the subthreshold leakage current. Unlike delay and dynamic power, leakage power is typically represented as a single value per cell. Modeling SSTFB To model the performance of the SSTFB cells, a set of timing arcs are identified that capture the behavior of the SSTFB cell. The causality between the timing arcs is formalized in a timed marked graph. As an example, the marked graph shown in Fig. 2 illustrates the marked graph model of a buffer cell in which the specific data rails have been abstracted. Notice that the dashed edges represent the behavior of the environment whereas the solid edges refer to the behavior of the cell. The + symbol is used to indicate a rising transition, - for a falling transition, 0Z for a low to tri-state transition, 1Z for a high to tri-state transition, Z0 for a tri-state to low transition and Z1 for a tri-state to high transition. The behavior illustrated can be explained into two cases. Fig.2: Marked Graph model of a SSTFB buffer Case 1: The right channels are free and a token arrives at the left channel. This new token is processed by lowering the state signal S. This corresponds to the timing arc L + S -. As S goes low it will create a token on the right channel R and causes the SCD to assert the signal A. These are the timing arcs S - R + and S - A +. The reset block is activated 5

as A goes high and removes the token from the left channel L (A + L - ). Simultaneously, the domino logic precharges, causing S+ which tri-states the right channel, i.e., S+ is equivalent to R 1Z. The SCD now de-asserts A that results in the timing arc S + A - which tri-states the left channel L, which tri-states the left channel, i.e., is equivalent to L 0Z. The right channel R is eventually reset by the right environment enabling the processing of a new token and completing a cycle. Case 2: The right channel R is busy and a token arrives at the left channel L. In this case the token at L will have to wait for R to be free. The triggering event in this case is R being reset by the right environment. This introduces another timing arc R - S - in addition to the timing arcs described in case 1. Asynchronous ASIC Design Flow For each SSTFB cell needed we created four library views: functional views contains the behavioral description of the cell in Verilog HDL, schematic views contains the transistor level implementation of the cell, layout view containing detailed GDSII data, an abstract view to support placement and routing in LEF format, and finally its symbol. Using this library, a largely conventional standard-cell ASIC back-end design flow using conventional place and route tools can be used to create the layout, as illustrated in Fig. 3. We use Hspice to perform analog transistor-level simulations to verify both correctness and measure performance. Characterization Flow Fig. 3: Asynchronous ASIC Design Flow The industry standard Liberty format supports several delay models to characterize delay. We chose the non-linear delay because it provides a reasonable tradeoff between accuracy and complexity. This delay model uses lookup tables indexed by input slew and/or load capacitance. The selection of input slew and load capacitance indices and 6

creating real-world input waveforms directly impacts the accuracy of the characterization. In addition, it is necessary to measure the correct supply currents to accurately characterize internal power. Unlike synchronous standard cells, for which commercial library characterization tools are available, the effort has to be implemented from scratch and semi-automated. 1. Input Slew and Load Capacitance Delay behaves non-linearly and non-monotonically with input slew. The design usage space should be bounded by carefully selecting the minimum and maximum input slew and load capacitance values to minimize delay calculation error due to interpolation and extrapolation. The output load index must be based on the cell drive strength. The tables should have enough points for both input slew and output load index selections so as to cover non-linear or non-monotonic regions. In the proposed flow, the minimum load capacitance was zero and the maximum was calculated such that the cell operated within pre-determined voltage swing levels. The input slew values were computed for each cell in the library based on the selected load capacitance values. The load capacitance on internal pins is fixed. Consequently timing arcs from input pins to the S and A pins need only be modeled as a 1D table (1x6) based on the input slew. However, arcs from the state pins S to the output pins R are modeled as a 2D table (6x6) based on both slew on S and output load. 2. Input Waveforms The output load model can be simplified by assuming a lumped capacitance. For the input driver, however, the traditional use of a ramped linear waveform is not desirable as it can by itself contribute to 5-10% delay error. Commercial library characterization tools use one of two approaches to generate real-world input waveforms: the pre-driver method or a pre-driver generated real non-linear waveform. A buffer is often recommended for use as the pre-driver cell as shown in Fig. 4(a). Fig. 4(a): Test Setup for Synchronous Circuits Fig. 4(b): Test Setup for Asynchronous Circuits For asynchronous circuits, the left and right environments have to be set up that generate and consume input and output tokens to/from the circuit under test (CUT). Commonly, these environments are modeled by the cells called bitgen and bucket respectively, shown in Fig. 4(b). Thus we implicitly take care of the input waveform generation with the bitgen. The input slew is controlled by an adjustable capacitor C S and the output load is controlled by the capacitor C L. 7

bitgen. The input slew is controlled by an adjustable capacitor C S and the output load is controlled by the capacitor C L. 3. Measuring Power The main challenges for power characterization are partitioning the currents drawn through the supply amongst timing arcs for the dynamic component, modeling short circuit current and the effects of crosstalk. The Liberty format measures internal energy per timing arc which includes short-circuit power. Power analysis tools convert this internal energy to internal power by dividing by the system cycle time. They also add short-circuit energy and switching energy, the latter calculated as the energy required for switching the total net capacitance on the nets. The dynamic internal energy component of energy for an arc can be calculated using the following equation: (( I vdd / gnd Ileakge) Vdd) T Earc = (1) N where, I vdd/gnd is the average current measured through specific voltage sources associated with the timing arc, I leakage is the current measured when the circuit is idle, Vdd is the supply voltage, T is total simulation trace time and N is the number of tokens processed in time T. In particular, we added 0-volt voltage sources to Vdd segments to the extracted place-and-routed netlist in order to measure the currents responsible for charging internal cell nodes. We added 0-volt voltage sources to segments of Gnd to measure the shortcircuit current associated with charging output nodes (e.g., the R0/R1 nets). In general, the measured currents associated with each token value can be partitioned among the associated timing arcs that must occur for each such token processed. For cells with a single-input channel, however, we partitioned currents into one power arc for each output accessed by an arbitrarily-chosen single related pin. For cells with multiple input channels in which multiple power arcs existed for a given output, we accounted for the power of all arcs in each arc. In this case, the power analysis tool chooses one such power arc depending on the timing of the related pins. This leads to a small amount of error because we are essentially assuming the slew on all input channels is identical. 4. Library Generation Using the above concepts, the flow illustrated in Fig. 5 is used for complete timing and power characterization. Spice netlists of the cells were fed to Hspice along with autogenerated stimulus files. These stimulus files contain measure statements for delay, slew, and energy. Data is extracted from the output of Hspice which is in the.mt# file format and automatically converted to the liberty format. For ease of characterization, we sometimes assumed symmetry to estimate delays and slews of one data rail using measured data from the other rail, introducing a small amount of error due to small differences in the layout between rails. 8

Fig. 5: Characterization Flow Validation Results To validate the characterization of each individual cell, we tested every individual cell in an environment as shown in Fig. 6. Fig. 6: Cell Verification Environment Cadence s SoC Encounter is used to perform delay calculation for each timing arc instance in the routed netlist using the library s liberty description, recording the resulting arc delays in a standard delay format (.sdf) file. The.sdf file along with the Verilog netlist is simulated in Cadence s NC-Verilog simulator. This generates a timing-accurate Value Change Dump (.vcd) file that records the signal activity. The.vcd file is then fed back into SoC Encounter to compute simulation-based power analysis. This analysis produces an Instance Power Report enlisting the Internal power and the Switching power consumption of each of the circuit modules. The results of the simulation-based timing and power analysis are compared to golden Hspice simulations. Table 1 shows the results of this comparison for each of the individual cells. 9

Table 1: Comparison of results from Hspice and Encounter based flow Cell Timing (ns) Internal Power (mw) Switching Power (mw) Hspice Encounter Error Hspice Encounter Error Hspice Encounter Error Bitgen 0.335 0.350 6.0% 0.4560 0.4386 3.8% 0.2590 0.2430 6.1% Bucket 0.335 0.350 6.0% 0.2017 0.1958 2.9% 0.0000 0.0000 0.0% Buffer 0.381 0.394 3.4% 0.5710 0.5643 1.2% 0.2005 0.2100 4.5% Fork 0.528 0.556 5.0% 0.8596 0.8196 4.6% 0.4345 0.4490 3.2% Merge 0.572 0.604 4.3% 0.6375 0.6080 4.5% 0.1422 0.1398 1.8% OR 0.457 0.440 3.7% 0.5692 0.5627 1.1% 0.1788 0.1776 0.7% XOR 0.472 0.463 1.9% 0.6171 0.6370 3.1% 0.1928 0.2020 4.0% The above comparison is summarized in the form of bar graphs in Fig. 7(a), 7(b) and 7(c). Cycle Time Comparison Cycle Time (ns) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Bitgen Bucket Buffer Fork Merge OR XOR Cells Hspice Encounter Fig. 7(a): Cycle Time comparison from Hspice and Encounter based flow Internal Power Comparison 1 Internal Power (mw) 0.8 0.6 0.4 0.2 0 Bitgen Bucket Buffer Fork Merge OR XOR Hspice Encounter Cells Fig. 7(b): Internal Power comparison from Hspice and Encounter based flow 10

Switching Power Comparison Switching Power (mw) 0.5 0.4 0.3 0.2 0.1 0 Bitgen Bucket Buffer Fork Merge OR XOR Cells Hspice Encounter Fig. 7(c): Switching Power Comparison from Hspice and Encounter based flow To further validate the quality of our characterization in a larger environment, we used several small fork-join non-linear pipelines with a general structure as shown in Fig. 8. Fig. 8: Structure of the fork-join non-linear pipeline Table 2 shows the performance and power dissipation as measured by Hspice and the Encounter-based flow using our prototype SSTFB library for the different fork-join pipelines. Table 2: Performance and Power comparison from Hspice and Encounter based flow No. of Buffers in Performance (GHz) Power (mw) Long - Short Path Hspice Encounter Error Hspice Encounter Error 8-1 0.913 0.864 5.3% 4.500 4.270 5.1% 8-2 1.262 1.198 5.0% 6.418 6.027 6.0% 8-3 1.515 1.406 7.1% 8.158 7.741 5.1% 8-4 1.686 1.631 3.2% 9.710 9.375 3.4% 8-5 1.672 1.636 2.1% 10.15 10.04 1.1% 8-6 1.661 1.626 2.1% 11.04 10.54 4.7% 8-7 1.642 1.545 5.9% 11.08 10.51 5.1% 8-8 1.666 1.554 6.7% 11.69 10.88 6.9% The above comparison is summarized in the plots in Fig. 9(a) and 9(b). 11

Performance Comparison Pow er Comparison Performance (GHz) 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 Power (mw) 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 No. of Buffers in the Short Path No. of Buffers in the Short Path Hspice Encounter Hspice Encounter Fig. 9(a): Performance Comparison for the Fork-join pipelines Fig. 9(b): Power Comparison for the Fork-join pipelines The performance plot shows an interesting counter-intuitive result that maximum throughput is achieved when the short and long paths of the fork-join structure are somewhat unbalanced, i.e., the short path has 4 buffers while the long path has 8 buffers. This fact can be attributed to the free-slack associated with the buffers that are faster than the FORK and JOIN cells. More balanced fork-join pipelines are slightly slower due to increased wire delay and consume more energy. The experimental results show a maximum error between the encounter estimated and Hspice golden numbers of 7.1%. We believe much of this error can be attributed to the limited slew propagation during SDF generation due to the loops and bi-directional pins in the SSTFB.lib model. Conclusion A fully characterized asynchronous library not only supports back-annotated simulationbased power and timing analysis. It also enables timing driven place and route, performance and power driven synthesis and ECO flows. Moreover, characterized asynchronous libraries are a necessary pre-cursor to extending STA-based timing sign-off to these designs. This work demonstrates the issues, feasibility, and potential accuracy associated with characterizing static STFB circuits. This is quite promising because the SSTFB circuits have among the most complex timing relationships of the many different proposed asynchronous design styles and have promising characteristics for application in lowpower high-performance SoC interconnects. 12

References [1] P. Golani, G. D. Dimou, M. Prakash, P. A. Beerel. Design of a High-Speed Asynchronous Turbo Decoder, ASYNC 2007, March, 2007. [2] M. Ferretti and P.A. Beerel. High Performance Asynchronous Design Using Single- Track Full-Buffer Standard Cells, IEEE Journal of Solid-State Circuits, Vol. 41, No. 6, pp. 1444-1454, June 2006. [3] M. Ferretti and P. A. Beerel. Single-Track Asynchronous Pipeline Templates using 1- of-n Encoding, DATE'02, March, 2002. [4] P. Golani and P. A. Beerel. High Speed Noise Robust Asynchronous Circuits, Proc. of ISVLSI, March 2006. [5] P. Golani and P. A. Beerel. Back-Annotation in High-Speed Asynchronous Design. Journal of Low Power Electronics 2, 37 44 (2006). 13