High-performance, Low-power, and Leakage-tolerance Challenges for Sub-70nm Microprocessor Circuits

Size: px

Start display at page:

Download "High-performance, Low-power, and Leakage-tolerance Challenges for Sub-70nm Microprocessor Circuits"

Harold Daniels
6 years ago
Views:

1 ESSCIRC 22 High-performance, Low-power, and Leakage-tolerance Challenges for Sub-7nm Microprocessor Circuits Ram K. Krishnamurthy, Atila Alvandpour, Sanu Mathew, Mark Anders, Vivek De, Shekhar Borkar Microprocessor Research, Intel Labs Intel Corporation, Hillsboro, OR 97124, USA Abstract CMOS technology scaling is becoming difficult beyond 7nm node, raising new design challenges for highperformance and low-power microprocessors. This paper discusses some of the key paradigm shifts required. Circuit techniques to combat (i) increasing switching and leakage power dissipation, (ii) poor leakage tolerance of large-signal cache arrays and register files, (iii) worsening global on-chip interconnect scaling trend, and (iv) highperformance robust datapath circuits enabling up to 1GHz ALU and instruction scheduler loops in 13nm dual-vt CMOS technology are described. I. Introduction Performance demand of future generation microprocessors continues to grow, even as traditional CMOS technology scaling beyond 7nm node becomes increasingly difficult. This trend has motivated new paradigm shifts necessary to achieve the high-performance goals, while factoring in the power envelope, interconnect scaling, and leakageinduced limitations. Three key barriers - power consumption, leakage tolerance, and interconnect delays are addressed in this paper. Fig. 1 shows CPU power scaling trend, showing that power has been increasing at nearly 2.7x every two years. This can be attributed to the following: although capacitance has scaled by 3% per process generation, number of electrical switching nodes per unit area has doubled, die-size has grown by 14%, frequency has doubled and supply voltage has scaled by only 15% per generation. Further, active subthreshold (source-drain) and gate leakage currents have increased by 3-5X per generation due to threshold voltage and gateoxide scaling, contributing to significant portion of total CPU power in active mode (Fig. 2). Based on current scaling trends, total power is rapidly approaching the power-wall imposed by power delivery and thermal limitations of today s practical cost-effective cooling solutions. The increasing subthreshold and gate leakage currents with process scaling also degrades noise tolerance of dynamic circuits, especially wide-or like structures commonly employed in register file and cache array bitlines. Fig. 3 shows the bitline robustness (DC noise margin as fraction of supply voltage) scaling trend, indicating a rapid decline in sub-nm technologies, rendering conventional bitlines non-functional. Fig. 4 shows the wire-delay scaling trend for.25um to 5nm technologies [1]. The 3%+ increase in wire-delay per unit length has been compensated by reducing the repeater-insertion distance for global busses. With process scaling, this distance continues to drop, leading to an exponential increase in number of repeaters on-die and associated driver/repeater peak currents. These challenges, coupled with the continued demand for aggressive integer execution performance across server, desktop, and portable processor platforms, motivates robust solutions for high-speed datapath architectures and circuits. Single-ended dynamic ALU and instruction scheduler execution cluster loops enabling up to 1GHz operation in 13nm dual-vt CMOS technology are described. II. Switching Power Reduction Techniques Two of the most common techniques for switching power reduction are (i) lower supply voltage operation, and (ii) gating the unit/cluster-level clocks. Although dynamic power is known to be a quadratic function of supply voltage, CPU operating supply has scaled by only 15% per generation (not 3% as dictated by constant-field scaling predictions) in order to sustain high transistor performance, i.e., Vcc/Vt ratio. To achieve low-power benefits without compromising performance, two ways of lowering supply voltage can be employed: static and dynamic supply scaling. Fig. 5 shows the static supply scaling approach. Two supply voltages are employed: a regular, high supply for performance-critical functional units or clusters, e.g., integer and floating-point execution, and a secondary, lower supply voltage for the non-critical units or clusters. Interfacing between the two voltage domains requires static power free level converter circuits [2]. This approach requires an additional power grid and associated cost of decoupling capacitors. The secondary voltage may be generated off-chip [3] or regulated on-die from the core supply [4]. Dynamic supply scaling overrides the cost of using two supply voltages, by adapting the single supply voltage to performance demand. The highest supply voltage delivers the highest performance at the fastest designed frequency of operation. When performance demand is low, supply voltage and clock frequency is lowered, delivering reduced performance but with substantial power reduction [5]. This technique is particularly valuable in conserving battery life on mobile platforms. Clock gating is another valuable switching power reduction technique, exploiting the diverse switching 315

2 activities of various CPU functional units. When a unit is inactive for several clock cycles, unnecessary clock power wastage is avoided by shutting off the clock to that unit until reactivated. Since clock power is a substantial component of total CPU power, this technique potentially offers big power savings. Clearly, the power consumed by the logic that detects and activates the clock gating signals must be much lower than the savings achieved by clock gating, for this technique to be effective [6]. III. Leakage Power Reduction Techniques While the quadratic effect of voltage scaling on dynamic power is well known, leakage power dependence on supply voltage has not been exploited. Fig. 6 shows the impact of supply scaling on subthreshold leakage (V 3 ) and gate leakage (V 4 ) in a 1.2V,.13µm technology [7], indicating an even stronger benefit than dynamic power! In addition, two of the common leakage power reduction techniques are dual-vt usage [] and sleep transistors [9], [1]. Dual-Vt technology offers two flavors of transistors: a high threshold voltage (high-vt), and a relatively lower threshold voltage (low-vt) transistor. Fig. 7 shows high-vt and low-vt transistor leakage measurements in the 1.2V,.13µm dual-vt CMOS technology. As their names imply, the high-vt transistor is slower, but typically 1X lower active leakage. An all high-vt implementation would achieve the lowest leakage power, but slowest performance, whereas an all low-vt implementation achieves the converse. Effective dual-vt implementation involves selective insertion of high/low-vt transistors such that best performance (low-vt on critical paths) with lowest possible leakage (high-vt on non-critical paths) is achieved [11]. Sleep transistors or supply gating (Fig. ) is a technique similar to clock gating that selectively shuts off power supply to functional units during standby mode to save leakage power. The virtual Vcc and Gnd are connected to regular Vcc and Gnd rails during normal active-mode operation. During standby mode, the Sleep/Sleep signal is activated, disconnecting the virtual rails from their regular supplies, thus preventing a leakage current path between Vcc and Gnd..13µm studies on a 32-bit static CMOS adder show 145X standby leakage reduction for a 5.1% sleep transistor size and 6mV virtual supply bounce [12]. Sleep transistor size is typically large to minimize stack performance penalty, and therefore, turning them on/off consumes dynamic power. The leakage power savings should be carefully weighted against the dynamic power penalty, by evaluating the number of idle/standby cycles for which leakage power is saved. IV. Leakage-tolerant Techniques High fan-in, compact dynamic gates are often employed in performance-critical units of microprocessors, e.g., in register files and large-signal L caches. However, the use of wide dynamic gates is strongly impacted by leakage currents in sub-.13µm devices. In such a case, the keepers must compensate for large leakage currents without significant impact on the performance of the gates. Fig. 9(a) shows an example of a M-bit wide dynamic gate, with the standard keeper PK. The keeper is ON at the onset of the evaluation phase unconditionally. Large keepers cannot be used to compensate for leakage, since their contention severely degrades the performance. Further, as the size of register files and L caches continues to grow with technology scaling, the increasing number of bitcells per bitline aggravates the leakage tolerance problem. Two techniques to combat this problem are discussed: Firstly, a conditional keeper domino technique [13], where a large fraction of the keeper is turned ON only if the dynamic output remains High in the evaluation phase. Thus, strong keepers can be utilized with leaky gates without significant impact on performance of the gates. Secondly, a pseudo-static leakage-tolerant bitline technique [14], that enables -Vcc gate-source underdrive on the bitline NMOS pulldown transistors without gate-oxide overstress or routing additional bias voltages or control signals. Fig. 9(b) shows the conditional keeper-technique (CKP). It employs two keepers: A fixed keeper PK 1, and a conditional keeper PK 2. At the onset of the evaluation phase (Clock Low-to High), PK 1 is the only active keeper. After a delay-time, T keeper =T Delay element + T NAND, the keeper PK 2 is activated only if the dynamic output is still High. Knowing the worst-case time for a potential output Highto-Low transition, the highest performance can be achieved when PK 2 is activated close to or later than the worst-case clock-to-output transition, T MAX. The fixed keeper, PK 1 ensures sufficient robustness during the T keeper, which can be a small fraction of the clock phase. Compared to the standard keeper (PK ), the conditional keeper-technique can meet higher robustness at comparable performance, where W(PK 1 )~W(PK ), and the additional PK 2 is activated conditionally with negligible impact on performance (W is the width of the devices at fixed length). To meet higher performance at comparable robustness, the keepers can be sized such that W(PK ) = W(PK 1 )+ W(PK 2 ), where W(PK 1 )<< W(PK ) and W(PK 2 )< W(PK ). Fig. 1 shows the delay of, 16, and 32-bit wide bitlines with conditional keeper, normalized by the delay of the gates with the standard keeper. At T keeper ~T MAX, up to 35% less delays for 16 and 32-bit bitlines is observed. The pseudo-static bitline technique is geared at local bitlines (LBL) of register files and large-signal L caches. Fig. 11 shows a conventional LBL, where each LBL supports single-ended read on 16 cells with two-way merge via static NAND. Data from storage cell is read by two access transistors per word (M1 and M2), forming a dynamic 16-way OR. LBL dynamic OR s are susceptible to noise due to high active leakage during evaluate when 316

3 precharged domino node should stay high. Low-Vt on the domino pulldown NMOS transistors (M1 and M2) for LBL does not meet minimum noise margin floor. Fig. 12 shows the pseudo-static leakage tolerant LBL circuit with read-select and bit-cell data inputs swapped. Static PMOS sustainers Px precharge the stack nodes Vs to Vcc every cycle. A static 2-input NOR pre-conditions the data input to Gnd, achieving Vgs = -Vcc reverse-bias on M1. This reduces active leakage by 7X even with using low-vt transistors on M1 and M2, as shown in measurements from 1.2V,.13µm technology (Fig. 13). Performance penalty compared to dual-vt bitline scheme due to higher input capacitance and slow static NOR pullup is offset by (i) low-vt usage throughout read-path and (ii) 5% downsized keeper transistor Pk that reduces contention during evaluate. A 6GHz,.13µm, 256x32b register file based on this technique is described in [14], that achieves % higher performance than dual-vt LBL scheme with 36% DC noise robustness improvement (Fig. 14). V. On-chip Interconnect Design Techniques Source follower NMOS pull-up bus drivers have been proposed for high-speed low-power on-chip static busses [15]. The higher gain of NMOS improves pull-up performance and the reduced output swing (Vcc-Vt) helps lower bus switching power with a simultaneous reduction in pull down delay. However, this solution is impractical due to the floating driver output at Vcc-Vt, which prevents low-to-high coupling noise recovery. Recovery from highto-low coupling noise is also problematic due to the weak holding impedance of NMOS pull-up drivers. An effective alternative is a BiCMOS-style PMOS-boosted source follower (PSF) bus driver scheme that overcomes the floating output problem, enabling robust high-performance on-chip busses [16]. Fig. 15(a) shows the PSF driver scheme. The early-strike of the NMOS begins a fast pullup, with the PMOS-booster follow-through that completes the full-rail transition (Fig. 15(b)). This maintains the speed advantage of the NMOS pull-up and the noise immunity benefit of a PMOS pull-up. Since the output is full-swing, any switching power benefit is lost. However, the performance advantage of the NMOS pull-up can be traded for energy and peak-current reduction with careful consideration of the region of applicability. Fig. 16 shows the 12-bit L1 cache to FPU write-back bus topology implemented on a production MHz 64-bit processor fabricated in.1µm CMOS technology [17]. The conventional low-skewed driver and repeaters are replaced with PSF driver and repeaters optimized for the same input capacitance and faster pull-up operation. The bus driver/repeater performance is improved by -1% for a 2.5% increased switching energy due to short-circuit path in the PSF during early transition. The performance benefit is sustained across a wide range of transistor sizes (Fig 17). VI. Robust High-speed Circuit Design Techniques Out-of-order execution engines of superscalar processors require (i) wide instruction schedulers capable of scheduling back-to-back instructions into multiple ALUs in the execution core, and (ii) fast ALUs capable of executing these instructions with single-cycle latency and throughput. A high-speed ALU and instruction scheduler loop is therefore essential to maximize processor performance. A 6.5GHz 32-bit ALU and an -entry x 2- ALU instruction scheduler loop, implemented as part of the integer execution core and fabricated in 13nm dual-v t CMOS technology [1] (Fig. 1) is described. High-speed single-ended dynamic circuit techniques enable the evaluation of complex (up to 2x9-way OR) logic operations while simultaneously achieving (i) high noise robustness, (ii) low active leakage power dissipation, (iii) maximum low-v t usage, (iv) simplified 2Φ 5% dutycycle timing scheme with seamless scheduler/alu interface time-borrowing, and (v) scalable performance up to 1GHz, measured at 1.7V, 25C. The instruction scheduler is capable of scheduling dependent instructions to two 32-bit ALUs, choosing one of eight ready instructions to execute in each ALU per cycle (Fig. 19). Dependencies for the 16 instructions currently in the pool, D<15:>, are evaluated and stored in a 1-bit x 24-entry dependency matrix during the previous cycle. The ready logic resolves dependencies between the 16 instructions in the pool and two external dependency signals (E<1:>), essentially requiring an 1-way AND operation. An -way OR priority encoder then chooses from among the ready instructions using dynamically controlled priorities (P<6:>) and drives a 14µm loopback bus into the ready logic and the shared ALU tri-state bus. Both true and complementary inputs are required for the ready logic and the priority encoder, requiring tall/wide AND/OR logic paths in a conventional differential domino implementation. Critical path performance is limited by tall NAND stacks, forcing an -stage implementation. Fig. 2(a)-(b) shows the single-ended to differential dominocompatible complementary signal generator (CSG) based ready logic and priority encoder implementation, that eliminates the wide AND paths and realizes the complete critical path with single-ended 2x9-way and -way dynamic OR circuits respectively. Dual-V t optimization is conducted for high performance and to meet target noise margin constraints. High-V t is used on the 9- and -way domino-or NMOS pulldown transistors and low-v t is used for all other transistors. Complete scheduler path requires only 6 gate-stages, improving critical path performance by 23% over the corresponding dual-rail implementation. Further, the single-ended design achieves 67% layout area reduction and 25% loopback interconnect length reduction due to eliminating 5% of the scheduler logic transistors, enabling a dense layout occupying 317

4 21µmx21µm (Fig. 24). Total active leakage power dissipation is 5% lower than differential domino design. The 32-bit ALU consists of a 5:1 source multiplexor, single-ended 32-bit dynamic adder core, and an 4µm differential ALU loopback bus (Fig. 21). The source multiplexor selects single-rail ALU operands from the true and complementary outputs of ALU loopback bus, 32-bit register file entries, and external debug inputs. The sum/sum adder outputs are driven onto the ALU loopback bus via a tri-stated bus driver. This organization enables single-cycle execution of add, subtract and accumulate instructions. The adder employs a radix-2 Han-Carlson architecture with carry-merge operation performed in both the dynamic and static stages of the domino gates. This results in a worst-case evaluation path of 3N-2P-2N-2P-2N-2P stacks, with initial P-G generation occurring in the first stage, followed by 5 stages of carrymerge logic. This implementation enables a 4-way carrymerge operation to be effected in two logic stages. Worstcase domino NMOS pulldown is only 2-wide, allowing usage of performance-setting low-v t transistors throughout the core while meeting noise immunity and active leakage power constraints. The Han-Carlson carrymerge tree skips alternate odd carries (C 1, C 3, C 31 ) and generates 16 even carries (C, C 2, C 3 ) in 5 stages. An extra carry-merge logic stage is required to generate the missing odd carries at the end of the carry-merge tree. This logic is folded into a CSG and the output sum XORs to produce the dual-rail sum/sum outputs for the odd bits in a single gate-stage, achieving a 1% delay reduction over the reference design (Fig. 22(a)). The single-ended even carries also feed into a CSG with the output sum XORs folded-in to produce the dual-rail sum/sum outputs for the even bits (Fig. 22(b)). The Han-Carlson architecture with CSG usage enabled a single-rail ALU implementation with 5% fewer carry-merge gates and 4% reduction in active leakage energy compared to a differential domino Kogge-Stone implementation. Furthermore, only alternate bits are propagated between consecutive carry-merge stages, resulting in a 5% reduction in inter-stage interconnect routing complexity. This allowed a compact layout occupying 336µmx4µm (Fig. 24), with a worst-case inter-stage wire length of 16µm, contributing to further speed improvement. At 6.5GHz (1.1V, 25C) operation, the measured ALU and scheduler loop power is 12mW and active leakage power is 15mW. The advantages of the single-ended scheduler and ALU over dual rail schemes are summarized in Table 1. Fig. 23 shows the maximum frequency (F max ), switching power, and active leakage power vs. supply voltage measurements. The ALU and instruction scheduler loop operates on a 5% duty-cycle 2Φ domino timing scheme, resulting in reduced circuit design and validation complexity. The Φ2 clock is locally generated by inverting the incoming Φ1 clock, and triggers the CSG stages. Inputs to the CSG are setup before Φ2 clock s rising edge to minimize noise on the non-switching output. Peak output noise is limited to 3mV for up to 3ps of Φ2 clock skew/jitter variations, meeting output noise constraints. The scheduler s ready logic CSG clock (Φ1 d ) is a delayed version of Φ1 clock, produced by an on-die programmable switched delay cell to enable clock stretching for slow frequency debug. Conclusions New paradigm shifts necessary to achieve the highperformance and low-power goals of sub-7nm microprocessors are examined. Three key barriers - power consumption, leakage tolerance, and interconnect delays are addressed. Static and dynamic supply scaling, clock gating, dual threshold voltage technology, sleep transistor techniques and their tradeoffs are discussed for switching and leakage power reduction. Conditional keeper domino and pseudo-static techniques for improved dynamic bitline leakage tolerance are described. PMOS-boosted source follower driver scheme for high-speed/low-power on-chip global busses and implementation results on production 64-bit processor are studied. Robust ALU and instruction scheduler designs using single-ended dynamic circuits to enable up to 1GHz single-cycle operation in 13nm dual- Vt CMOS are reviewed. Acknowledgement The authors thank S. Hsu, D. Somasekhar, S. Narendra, A. Keshavarzi, Y. Ye, K. Soumyanath, W. Pinfold, C. Webb, P. Madland for discussions; B. Bloechel for measurements; R. Hofsheier and J. Rattner for encouragement and support. References [1] J. Davis et al, Proc. IEEE, March 21, pp [2] Y. Kanno et al, 2 VLSI Circuits Symp. Digest, pp [3] T. Fuse et al, 21 VLSI Circuits Symp. Digest, pp [4] L. R. Carley et al, Proc. ISLPED 1999, pp [5] M. Takahashi et al, IEEE JSSC, Vol. 33, Nov. 199, pp [6] M. Takahashi et al, IEEE JSSC, Vol. 35, Nov. 2, pp [7] S.Tyagi et al, 2 IEDM Tech. Digest, pp [] S. Thompson, I. Young, 1997 VLSI Tech. Symp. Digest, pp [9] S. Shigematsu et al, IEEE JSSC, Vol. 32, June 1997, pp [1] T. Inukai et al, Proc. CICC 2, pp [11] P. Pant et al, IEEE Trans. VLSI Systems, April 21, pp [12] S. Borkar et al, Proc. 21 ISPD, pp [13] A. Alvandpour et al, 21 VLSI Circuits Symp. Digest, pp [14] R. Krishnamurthy et al, 21 VLSI Circuits Symp. Digest, pp [15] H. Zhang et al, Proc. ISLPED 199, pp [16] R. Krishnamurthy et al, 21 VLSI Circuits Symp., pp [17] G. Singer et al, 2 ISSCC Digest, pp [1] M. Anders et al, 22 ISSCC Digest, pp

5 Power (W) Power (Watts) Cooling Capacity Of Conventional System Pentium 4 processor Pentium II processor Performance Desktop Pentium processor.µ.25µ.6µ 1.35µ Mobile &1U 46 Server 36 PDA.1µ 1 XScale Fig. 1. CPU power scaling trend Current (µa/µm) 25 Active Power 12% 2 Active Leakage % 15 % 6% 4% 5 2% L poly (µm) 11 C Bitline I on Bitline I off (16 cells) Bitline Robustness Fig. 3. Bitline robustness scaling trend Robustness (Noise Margin / Vcc) Performance-critical units (high Vcc) Fig.. Sleep transistor 1.1 Gate Delay Delay (ckp)/delay (std) µ. 1 µ. 1 3 µ. 1 µ.7µ Technology Fig. 2. Leakage power fraction Level Converters Non-critical units (low Vcc) Low Vcc on-die regulator(optional) Fig. 5. Static 2-supply scheme sleep transistor Virtual Vcc Functional Unit Virtual Vss sleep transistor Clock Normalized Leakage I Leak M 21 M 2j M 2K -bit 16-bit 32-bit % Measured Leakage in 1.2V, 13nm process Subthreshold lkg Gate lkg Voltage (V) Fig. 6. Leakage vs. Vcc scaling trend Pk Dyn_out M 11 M 1j M 1K Fig. 9(a) M-bit dynamic OR Inv_out Φ1 RS Clock Fig. 4. Interconnect delay scaling trend Normalized active leakage T keeper 1 - High-Vt Low-Vt - 1X DIBL (mv/v) Fig µm leakage measurements Delay Element Wide Pull down D M1 M2 Standard Keeper (downsized) Conditional Keeper PK2 PK1 d_ out Fig. 9(b) Conditional keeper technique LBL RS15 D15 LBL1 inv_ out N T keeper /T MAX Fig. 1. Conditional keeper benefit Fig. 11. Conventional dynamic LBL 319

6 Φ1 Pk LBL D D15 M1 Vs Px RS M2 RS15 Fig. 12. Pseudo-static LBL LBL1 V G = V B =, V D = 1.2V conventional 73X pseudo-static 1.E+4 1.E+3 1.E+2 1.E+1 1.E+ 1.E V GS (V) Fig. 13. Pseudo-static leakage measurements Normalized Ieakage 356µm 9µm Clock Drivers Cell Array IN OUT C L Output Voltage (V) Input PSF Out CMOS Out Fig GHz 256x32 register file layout Fig. 15(a). PSF driver and (b) Output response. Time (ns) Ready Logic FPU Delay (ps) CMOS PSF % D<14:> Sched<15:1> Sched<> E<1:> 1AND X Priority Encoder Ready<> Ready<> L1U Fig bit processor floorplan Driver Area (µm) Fig. 17. PSF driver comparisons P<6:> Ready<7:1> Ready<> AND X Sched<> Sched<> RF Fig. 19. Instruction scheduler organization 4 x 32 5:1 5:1 ALU sum 32 to sum 32 RF, X Scheduler ALU sched sched Ready Logic CSG Ready Logic CSG Φ 1d Lower 9 bits 4 x 32 5:1 5:1 ALU 1 sum1 32 to RF, sum1 32 X Scheduler ALU 1 sched1 sched1 to RF, R<:> R<17:9> Φ 1d High-Vt transistors High-Vt transistors Upper 9 bits Fig. 2(a). Scheduler ready logic CSG implementation Fig bit Integer ALU and instruction scheduler loop Ready<7:1> P<6:> Φ 2 Sched<> Sched<> High-Vt Ready<> transistors Fig. 2(b). Scheduler priority encoder CSG implementation 32

RF Operand Operands RF Operand Operands 5:1 Mux 5:1 Mux Control Odd-bit CSG Carry merge 4um loopback bus 31 3 29 2 Propagate/Generate/Partial Sum (dynamic) 3 2 1 Carry merge (static) Carry merge 1

7 RF Operand Operands RF Operand Operands 5:1 Mux 5:1 Mux Control Odd-bit CSG Carry merge 4um loopback bus Propagate/Generate/Partial Sum (dynamic) Carry merge (static) Carry merge 1 (dynamic) Carry merge 2 (static) Carry merge 3 (dynamic) Carry merge 4 (static) Carry merge 5 (CSG) / Sum Sum Sum Fig bit integer ALU core Φ 2 g i g i-1 Carry i p i Sum i Psum i Sumi Sum generation Carryi Fig. 22(a). Han-Carlson odd-bit CSG circuit Even-bit CSG Carry merge Φ 2 g i Sum generation Carry i Psum i Sum i Carryi Fig. 22(b). Han-Carlson even-bit CSG circuit Input Scan ctl RF Output Sched. Clock ALU Misc BB ctl Sum i Die Area 1.61 x 1.44 mm Process 13nm CMOS Interconnect 1 poly, 6 metal Transistors 16K Frequency 5GHz Maximum V cc 1.5V Core Power 1.43V Pad Count 72 Fig nm testchip microphotograph and details Area Performance (Delay) Active Leakage Robustness ALU 5% 1% 4% equal Scheduler 67% 23% 5% equal Fmax (GHz) Active Power (mw) Supply Voltage (V) Supply Voltage (V) Fig nm Fmax, active power, and leakage power measurements (3 C). Leakage Power (mw) Table 1. CSG benefits summary 321

8 322

Leakage Control Techniques for Designing Robust, Low Power Wide-OR Domino Logic for Sub-130nm CMOS Technologies

Leakage Control Techniques for Designing Robust, Low Power Wide-OR Domino Logic for Sub-30nm CMOS Technologies Bhaskar Chatterjee, Manoj Sachdev Ram Krishnamurthy * Department of Electrical and Computer