Computer-Aided Design for Low-Power Robust Computing in Nanoscale CMOS

Size: px

Start display at page:

Download "Computer-Aided Design for Low-Power Robust Computing in Nanoscale CMOS"

Gabriel Norris
5 years ago
Views:

1 INVITED PAPER Computer-Aided Design for Low-Power Robust Computing in Nanoscale CMOS The problem with designs that minimize power consumption is that they tend to reduce circuit reliability; improved techniques that jointly optimize both power and reliability are needed. By Dennis Sylvester, Senior Member IEEE, and Ashish Srivastava, Member IEEE ABSTRACT This work argues that the foremost challenges to the continued rapid improvements in CMOS integrated circuit (IC) performance are power consumption and design robustness. Furthermore, these two goals are often contradictory in nature, which indicates that joint optimization approaches must be adopted to properly handle both. To highlight needs in computer-aided design (CAD), we review a sampling of stateof-the-art work in power reduction techniques, and also in the newly emerging area of statistical optimization applied to very large scale integration (VLSI) ICs. The lack of CAD techniques to perform multiobjective function optimization (specifically parametric yield under correlated performance metrics) is a major limitation of current CAD research. In addition, with design trends pushing towards architectures based on aggressive adaptivity and voltage scaling, CAD researchers and engineers will need to refocus efforts on enabling this type of complex design. KEYWORDS CMOS; integrated circuits; low-power; parametric yield I. INTRODUCTION Power consumption has repeatedly been the driving force behind changes in the preferred transistor technology of choice in designing integrated circuits (ICs). Moves from bipolar to nmos and then nmos to complementary MOS (CMOS) technologies occurred primarily due to the much lower power consumption (at the increasing integration Manuscript received April 11, 2006; revised August 23, The authors are with the University of Michigan, Ann Arbor, MI USA ( dmcs@umich.edu; ansrivas@umich.edu). Digital Object Identifier: /JPROC /$25.00 Ó2007 IEEE levels prevalent at each transition) of the latter options. Key to these transitions has been the technological maturity of MOS devices when power levels in bipolar became unacceptable. The semiconductor industry now faces a similar crisis due to the rapidly growing power densities observed in CMOS-based ICs in the deepsubmicrometer (0.35 m to 0.13m) and nanometerscale (90 nm and beyond) regimes. Power density trends exhibited by CMOS circuits in the past decade closely mirror those of bipolar-based systems in the mid to late 1980s and early 1990s [76]. Rising power budgets have been driven by a number of factors including a move away from constant-field scaling [105] in recent technologies, and exponential growth in transistor subthreshold and gate leakage currents and hence static power dissipation. The primary difference between past technology transition points and today is that there is no mature low-power technology alternative to CMOS. While there is considerable ongoing work in post-cmos devices 1 including single electron transistors, spin transistors, carbon nanotube FETs, and ferromagnetic logic devices, none of these technologies will be ready to assume a role as the workhorse of the IC industry for at least 10 years, and more likely 20þ years. With fundamental limitations on key MOSFET parameters such as subthreshold swing and with the need to reduce operating voltages, and subsequently threshold voltages, the days when process engineers primarily drove improvements in IC performance and powerappeartobenearanend.theburdenisshiftingto circuit designers and computer-aided design (CAD) engineers to improve power (or energy) efficiency in 1 See the Emerging Research Devices chapter of the 2005 International Technology Roadmap for Semiconductors for an excellent treatment of these devices [95]. Vol. 95, No. 3, March 2007 Proceedings of the IEEE 507

2 nanometer-scale CMOS, thereby extending the lifespan of traditional CMOS. The need for improvement in designing low-power circuits can be demonstrated by industry economic factors such as the large expenses incurred to remove heat from an IC (e.g., highly nonlinear cost versus power dissipation curves [74]). From a broader perspective, while there is substantial debate about the exact percentage of total U.S. electricity used for information technology purposes [79], [80], it is clear that the trend is markedly upwards. This is driven in large part by the proliferation of server farms, which consume more watts per square meter than semiconductor fabrication facilities or automobile manufacturing plants. In attempting to rein in growing power density levels, designers and CAD tool developers will also face another challenge which represents a second major difference between past technological transitions and today. That is, MOSFETs are approaching their fundamental physical limits in terms of how small they can be manufactured reliably. Even given recent research efforts demonstrating channel lengths as small as 3 nm [75], the end of the current International Technology Roadmap for Semiconductors(ITRS)isforecastas2020atthe14nmnode(6nm physical gate length) [95]. There are also a large number of so-called Bred-bricks,[ or needs with no known solutions, for technology nodes beyond 45 nm. Part of these concerns stem from the growing number of new materials being used, or proposed, in silicon manufacturing. Given the relative difficulty of integrating new materials into a process flow, one can reasonably expect difficulties (or delays) in delivering new technology nodes such as 45 nm and 32 nm. The design and CAD communities must therefore assume more of the responsibility in creating highly manufacturable (i.e., robust) designs considering such stressed underlying devices. To this point, we have argued that the two most pressing issues in the design of nanometer-scale CMOS ICs will be power and robustness. There is general agreement that each of these is a major challenge in isolation: a 2003 Design Automation Conference panel with representatives from three top five integrated device manufacturers along with three of the top four electronic design automation (EDA) firms quantitatively concluded that variability and power (particularly leakage) were the two issues requiring the most additional EDA investment. However, it is actually the intersection of these two problems that requires the most attention and efforts from the CAD community. There is an inherent tradeoff between these two quality metrics, which can be illustrated by a number of examples. Design automation techniques aimed at reducing power consumption invariably do so by attacking nodes with timing slack. Fig. 1 qualitatively depicts this process where an initial path delay histogram (preoptimization) shows some spread, indicating the presence of noncritical paths. After power Fig. 1. Formation of a timing wall due to deterministic power optimization. optimization (techniques to accomplish this are discussed in detail in Section II), the distribution effectively is pushed to the right towards the initial critical path delay. These traditional approaches are blind to the impact of optimization decisions on the yield of the design, and invariably result in the formation of a timing wall [73] since the optimization has no incentive to reduce the delay of noncritical paths. All near-critical paths can affect the circuit delay due to process variability, and hence the design becomes more susceptible to these variations. If process tolerances are relaxed in nanometer CMOS technologies, 2 this effect is worsened. Single event upsets (SEUs), caused by energetic particle strikes such as neutrons or alpha particles, have become a major concern in combinational logic due to several technology trends such as reduced nodal capacitances. The use of low-power design techniques tends to increase SEU likelihoods since they rely on the use of small gates and potentially lower supply voltages, both of which reduce the critical charge levels required to cause functional failure. Lowered supply voltages, a common technique to be described in detail later in this paper, also naturally increase performance uncertainty, which is magnified at low supply to threshold voltage ratios. There is concern among researchers today that extremely scaled CMOS technologies (e.g., G 32 nm) will exhibit dramatically higher device defect rates 2 Such tolerances may be relaxed either voluntarily due to a need to make new nodes achievable in the expected 2 year timeframe, or involuntarily due to fundamental lithographic or other processing limitations. 508 Proceedings of the IEEE Vol.95,No.3,March2007

3 than are currently seen. This concern is captured in the often-cited need for Breliable design using unreliable components.[ One of the ways to deal with unreliable components is the use of redundancy (which is commonplace today in large memory arrays where it is likely that several bit cells will fail due to the sheer size of the memory). However, redundancy approaches for logic, such as triple modular redundancy, are very expensive both in area and power. Thegrowinguseofclockgating[107]toaddress dynamic power and aggressive sleep or standby modes to improve battery life inevitably leads to higher levels of power supply noise since levels of current draw are becoming more disparate during runtime. These examples illustrate that low power design is fundamentally antagonistic to robust design. A key parameter in quantifying this tradeoff is parametric yield given timing and power constraints. Fig. 2 shows the impact of process variation (or spread) on the timing and leakage power behavior of a generic IC. Delay and leakage are inversely correlated since they exhibit opposite sensitivities to effective channel length Leff (and Vth through short-channel effects). Therefore, fast dies exhibit high leakage and vice versavthe net effect is the two-sided yield constraint seen in Fig. 2. With the recent rise in leakage power as a fraction of total power 3 and the proliferation of mobile electronics requiring long battery life, yield loss due to violation of power constraints is a major concern today. 4 In Section V, we will discuss very early efforts at parametric yield driven optimizationvin general, this is an open area of CAD research requiring significant attention in the next several years. The remainder of this paper is organized as follows. We begin by reviewing the state of the art in CAD techniques for low power, focusing on dynamic and static power components (Sections II and III) and finally on the total power minimization problem (Section IV). Section V then turns to the newly emerging area of statistical analysis and optimization of ICs considering both timing and power. In Section VI, we then examine future trends in circuit design and discuss the role of CAD in enabling these new directions. II. DYNAMIC POWER MINIMIZATION Until recently, dynamic power dissipation, or the power dissipated in switching the capacitance associated with 3 Intel has projected that over half of the total power consumption in their microprocessors at the 65 nm node will be contributed by leakage currents [97]. 4 The assumption here is that dynamic power is relatively constant with process variation as Leff fluctuations impact total switched capacitance sublinearly (due to the presence of wire capacitance) whereas Leff affects leakage power exponentially. Fig. 2. The inverse correlation of leakage power and delay leads to low parametric yield, and an interesting and challenging CAD problem. devices and interconnects, has been the major component of power dissipation. The dynamic power dissipated in switching a load capacitance C, operatingatavoltagev dd, and frequency f can be expressed as P Dyn ¼ CV 2 dd f (1) where is the probability of switching. For a gate-level design, the dynamic power can be obtained by simply summing the dynamic power expressed in (1) over all nodes in the circuit. However, things are complicated by the fact that it is not straightforward to determine the switching activity and the capacitance associated with each node. The switching probability of a node depends on the sequence of input vectors applied to a design, and probabilistic techniques are generally employed for estimation. The concept of transition density was proposed in [1] and improves upon switching activity computed using state probabilities (which assume at most one transition in a single cycle). However, probabilistic approaches suffer from increased complexity when considering spatial correlations resulting from reconvergence and temporal correlations across different clock cycles. Moreover, in reality nodes within the circuit suffer from incomplete transition (glitches) which require expensive timing simulations for identification and the delay characteristic of the gates, which change continuously during optimization. This component of power has been observed to be as large as 70% in circuits such as combinational adders [2] and is typically in the 10% 15% range. The capacitance term in (1) represents the total capacitance driven by the gate, including junction, gate, and wire components. With process scaling, interconnect Vol. 95, No. 3, March 2007 Proceedings of the IEEE 509

4 capacitance has become a significant fraction of the overall node capacitance. Interconnect capacitance is often estimated based on previous design experiences, before physical layout of the design is completed. Another source of error in power estimation at the gate level results from the estimation process being unaware of the charging/discharging patterns for the capacitance associated with the internal nodes of a gate. Notwithstanding these inaccuracies, gate-level dynamic power optimization is a key step in current design flows since it is difficult to change gate sizes after completion of the physical design phase of the design. In this section, we will discuss some of the techniques that have been developed for minimizing dynamic power consumption. A. Gate Sizing A large amount of work has been done in the area of gate-sizing-based optimization techniques. Gate-sizing algorithms can be broadly divided into discrete and continuous domain algorithms. Discrete gate sizing, which chooses the gate sizes from an available set of gate implementations (representing the standard cell library), has been shown to be NP-complete [3] and hence most of the developed techniques are heuristic, or applicable to special circuit graphs such as trees [4] under which dynamic programming techniques can be used efficiently. A good overview of heuristic gate-sizing approaches for general nonlinear nonconvex discrete problems is provided in [5]. We will discuss discrete heuristic algorithms in more detail when we consider multiple power and threshold voltage assignment. On the other hand, continuous gate sizing assumes the availability of all gate sizes in the library. Most of the early workintheareaofcontinuousgatesizingwastargetedat improving the performance of a design. However, many of these approaches can be easily tailored to perform power optimization, under timing constraints. Gate sizing was first shown to be a convex optimization problem in [6], under suitable delay models. A convex problem provides the incentive that a local minimum is guaranteed to be a global minimum. However, writing the convex problem requires enumerating all paths in the circuit, which can grow exponentially with circuit size. This problem was resolved in [7], which was the first to propose a polynomial time gate-sizing approach. Polynomial complexity was achieved by using a novel ellipsoidal interior-point algorithm for convex optimization. The objective function in the optimization problem was simply expressed as a weighted sum of the gate sizes (interconnect capacitance is unchanged due to gate sizing), where the weights represent the switching probability of the different nodes in the circuit. Alternatively, a node-based formulation allows the entire problem to be written with a much smaller number of constraints that grows linearly with circuit size, while introducing additional variables in the problem. The nodebased formulation is mathematically expressed as Min: Xn i x i s:t:: a j D 0 8j 2 output a j þ D ji ðxþ a i 8i ¼ 1;...; n and 8j 2 inputðiþ 0 a i 8i 2 input L i x i U i 8i ¼ 1;...; n (2) where x i represent gate sizes. The timing constraints are expressed using implicit optimization variables a i s that represent the arrival time at each node. D 0 is the timing constraint imposed on the circuit and D ji represents the delay of the timing arc going from node j to node i, andl i and U i represent the upper and lower bounds on each of the gate sizes. The ease with which the above problem can be solved depends on the form of the delay model used to capture the dependence of D ij onthegatesizes.ifalinear (or piecewise linear) model is used [9], then the above problem becomes a linear programming (LP) problem and can be solved very efficiently. However, for better accuracy posynomial and generalized posynomial [10] delay models have been proposed. Under these models the above problem, which becomes a geometric problem (GP), can be mapped to a convex optimization problem using an exponential transformation for the gate size variables. However, geometric problems have a cubic runtime complexity which becomes prohibitive for large circuits. To solve this issue, the constraint structure in the above representation of the problem was simplified using Lagrangian multipliers in [8]. The proposed approach iteratively updated the Lagrangian multipliers while the gate-sizing problem was solved for a known set of Lagrange multipliers, which can be done very efficiently. Under simple delay models, the approach was also shown to converge to the optimal solution. However, the approach was shown to have convergence issues and improvements were suggested in [11]. Another class of gate-sizing algorithms, such as JiffyTune [12], are based on dynamic timing analysis to generate timing sensitivities. This incurs large runtime penalties. However, problems such as false paths that arise in formulations similar to (2) can be easily handled. In all the above approaches, we have neglected the contribution of short circuit current to the total dynamic power dissipation. This can have strong implications for optimization, since while short circuit power in a welldesigned circuit (with sharp transitions) is usually small, this is not true for intermediate solutions obtained during optimization. This can even result in power and timing being nonconflicting objectives [13]. However, accurate models for short circuit power are generally nonconvex, and the resulting power optimization problems must be 510 Proceedings of the IEEE Vol.95,No.3,March2007

5 solved using general nonlinear optimization techniques. The conditions under which short circuit power is small enough to be neglected can be implicitly imposed by constraining the transition time at each of the nodes in the circuit. However, this raises the issue of propagating transition time constraints through the circuit timing graph. The two choices of propagating the worst transition time or the transition time associated with the worst delay lead to pessimistic or optimistic results, respectively. The approach proposed in [14] uses a weighted sum of the delay and transition time at each node to decide the timing event that is propagated to the next stage. B. Multiple Supply Voltages Although gate sizing can be used to optimize dynamic power dissipation in a design, gate sizes have at most a linear dependence on the overall power dissipation. It can be seen from (1) that reducing the power supply of a gate provides quadratic benefits in power. However, multiple supply voltage supplies complicate design since they impose the topological constraint that gates operating at a lower power supply cannot fan-in to any gate operating at a higher supply voltage without inserting a dedicated level converter. Moreover, multiple power supplies require additional voltage generation and routing resources. Since it has been shown that the power savings achieved by more than two power supplies is minimal [15], most of the work in multi-vdd is focused on dual-vdd designs. Clustered voltage scaling (CVS) was proposed in [16] and allows only one transition from high to low Vdd along a path, resulting in a cluster of low Vdd gates close to the output flip-flops or latches as shown in Fig. 3. Any level conversion from low Vdd to high Vdd is performed at these sequential elements and is referred to as synchronous level conversion. Extended CVS (ECVS) [17] allows multiple high to low Vdd transitions along a path, and uses asynchronous level converters [18], [19] to convert a low Vdd signal up to high Vdd at all points in the circuit where a low Vdd gate acts as an input to a high Vdd gate. Algorithms for dual-vdd assignment are severely limited due to the discrete nature of the dual-vdd assignments. Most of the continuous gate-sizing approaches cannot be efficiently mapped to perform dual- Vdd assignment. However, some dual-vdd assignment algorithms are based on solving the initial problem assuming the availability of a continuous range of supply voltages and then heuristically clustering the obtained solution to the available set of supply voltages [20]. Standard dual-vdd library based implementation of CVS and ECVS are based on a levelized traversal of the circuit graph from the primary outputs (or sequential element inputs) to the primary inputs (or sequential element outputs). A gate is set to low Vdd if the assignment does not result in a violation of the timing constraints. Though such a traversal is reasonable from a CVS perspective, it is artificial for ECVS. An improved sensitivity-based greedy- ECVS approach (called GECVS) was proposed in [21]. The sensitivity-based approach is based on iteratively computing the sensitivity of each gate to low Vdd assignment. The sensitivity metric is defined as Sensitivity ¼ P X arcs Slack arc D arc (3) where the summation is over all timing arcs associated withagate(sixfora3-inputnand gate). P and D arc represent the change in total power and the change in the delay of the arc, respectively, if the gate is assigned to a lower Vdd, and Slack arc represents the timing slack of the arc. The terms P and D arc implicitly consider the power and delay penalties associated with asynchronous level conversion, and naturally lead to the formation of low Vdd clusters. The work showed that such an approach leads to an additional power saving of 28% and 13% over CVS and ECVS, respectively, for the ISCAS benchmark circuit set. Thetwoknobsofgatesizinganddual-Vddassignment available to a designer were combined in [22] and [23]. The approach proposed in [23] improves CVS-based dual- Vdd assignment by following traditional CVS with low Vdd assignment enabled by gate sizing. Sensitivities of the form shown in (3) are computed for gates on the front (shown in Fig. 4), which consists of gates that can be assigned to low Vdd without violating the topological constraint. After a gate has been to low Vdd, gate sizing is performed to meet thetimingconstraints.gatesizingisperformedusinga sensitivity-based measure expressed as Sensitivity ¼ 1 X D arc P Slack arc S min þ K arcs (4) Fig. 3. Implementation of simple dual-vdd circuits using (a) CVS and (b) ECVS (from [17]). where S min is the minimum slack in the circuit and K is a small positive number providing numerical stability. This sensitivity form provides higher sensitivity to gates on the critical paths. Since CVS is a heavily topologically constrained problem, [23] employs hill-climbing techniques Vol. 95, No. 3, March 2007 Proceedings of the IEEE 511

6 Fig. 4. Power savings in dual-vdd designs are found to be the largest when the lower Vdd is approximately 60% 70% of the higher Vdd (from [17]). to avoid local minima, and shows approximately 10% additional power savings over CVS. Efficient implementation of the sensitivity computation step is of the utmost importance to maintain reasonable computational requirements in sensitivity-based approaches. The first and most important feature to note is that sensitivity computation for a gate does not involve a full or incremental timing analysis run. The timing analysis is performed by identifying a small subcircuit around the node of interest. The subcircuit involves gates in the immediate fan-in of the node (to consider the change in delay due to the change in load capacitance) and a few (generally two) logic stages in the fanout cone (to consider the change in delay due to the change in transition time). Secondly, techniques such as maximum weighted independent set (MWIS) [24] are used to select a set of independent gates. The gates are independent in the sense that changing the implementation of one of the gates does not alter the timing slack on the other. This reduces the number of sensitivity computation iterations. Another important issue in dual-vdd design is the choice of the lower supply voltage to maximize power savings. A low supply voltage allows higher power savings, but the higher delay penalty limits the percentage of gates that can be assigned to the lower supply voltage. References [17] and [25] report significant dynamic power savings of 45% 50% using a lower Vdd that is 0.6 to 0.7 times the higher Vdd, as shown in Fig. 5. This range of the lower Vdd was also shown to be optimal for dual-vdd designs using an analytical framework developed in [15], which claimed that the optimal lower supply voltage is equal to half the sum of the threshold voltage and the higher supply voltage. gate. The input vector of a gate has a strong impact on the leakage power of a gate, due to the well-known Bstack effect[ that significantly reduces leakage if two or more off-devices are connected in series [38]. Thus, in addition to state probabilities, we require correlation information for the state of each pair of nodes, and approaches such as [26] are generally used. In current and future technologies and under typical operating conditions (i.e., elevated temperatures), subthreshold leakage is the dominant contributor to leakage power dissipation. As a result, subthreshold leakage reduction techniques have received significant interest. Gate leakage current, due to tunneling between the gate and channel (or source/drain), is another prominent form of leakage in modern processes (e.g., 65 nm) [95]. However, there is substantial work in developing a high-k gate insulating material to replace SiO 2 Vthis would allow for physically thicker oxides to be used to suppress tunneling while achieving the same electrical behavior as a thinner SiO 2 region (i.e., same C ox ) [115], [116]. Although the introduction of such a high-k material into a production technology has taken longer than expected, it seems likely that gate leakage will be addressed in this manner by the 32 nm node at the latest [114]. Therefore, in this section we examine many of the most popular subthreshold leakage optimization techniques. Leakage power optimization techniques can be divided into two broad classes, depending on whether they provide active mode or standby mode leakage reduction. Standby mode leakage represents the leakage power dissipated by the design under idle conditions (significant for mobile applications). Leakage reduction techniques for the active, or normal, mode of operation are limited to multi-vth assignment and gate-length biasing. However, a number of techniques have been developed to reduce standby leakage III. STATIC POWER MINIMIZATION Section I mentioned that static power has grown to become a significant contributor to the total power budget, which is particularly problematic for mobile applications where battery life is dictated by average power draw. Leakage current calculations for gate-level designs require an estimate of the input state-vector probability for each Fig. 5. Gates 1, 2, and 3 are assigned to low Vdd during CVS and gates 4, 5, and 8 satisfy the dual-vdd topological constraint for low Vdd assignment. The dashed line represents the front. 512 Proceedings of the IEEE Vol.95,No.3,March2007

7 power dissipation through the use of sleep devices, bodybiasing, and input vector control. Similar ideas led to the development of drowsy caches [27] for memories. Active mode leakage reduction techniques have been heavily investigated in recent years, with most of the emphasis on dual-vth assignment. Dual-Vth assignment can be seen as an analog of dual-vdd assignment for leakage power, with a number of key similarities. Dual-Vth provides a much stronger lever (compared to gate sizing and dual-vdd assignment) to control leakage power, due to the exponential dependence of leakage current on Vth. In addition, two threshold voltages have been shown to provide a large fraction of the benefits that can be achieved by using multiple threshold voltages. Thus, given the cost of additional mask steps for threshold implants, most of the work in this area has been targeted at dual-vth optimization. Analysis of the optimal choice of threshold voltages in dual- Vth designs shows that the difference between the two threshold voltages is approximately one-tenth of the supply voltage [15]. However, dual-vth designs differ considerably from dual-vdd designs in their physical design implications, making them much more attractive. Since the different threshold voltages are achieved using additional implants, no additional routing resources are required. In addition, dual-vth assignment can be performed post-layout since it does not require any change of note in the layout of the gate. Moreover, the level conversion issues in dual-vdd designs are not a concern for dual-vth designs and a large fraction of the design can be operated at higher threshold voltages without incurring level conversion penalties. Due to the similarity in the problem structure for dual- Vdd and dual-vth assignment, a number of comparable techniques have been developed. Techniques proposed in [28] and [29] are based on solving a continuous optimization problem, followed by heuristic clustering. Sensitivity-based techniques were proposed in [30] and [31]. Recently, a number of techniques have been proposed to perform simultaneous dual-vth assignment and gate sizing [32] [35]. References [32] and [33] propose sensitivity-based techniques that identify the set of gates to be assigned to a lower Vth and upsized to meet delay constraints. The approach proposed in [34] is based on enumerating the solution space, and pruning them to limit the required search. An iterative two-stage approach was proposed in [35]. Each of the iterations involves slack allocation that is performed to maximize the potential power savings. This is achieved by solving the following optimization problem: where a i s represent the arrival time at node i, D 0 is the timing constraint imposed on the circuit and D ji represents thedelayofthetimingarcgoingfromnodejto node i, d i represents the slack allocated to node i, U i represents the upper bound on the allocated slack, and s i represents the power-delay sensitivity. The sensitivity corresponds to the maximum sensitivity across different implementations of the gate. The second stage is used to identify implementations of the gate to use the slack assigned to the gate. Since the allocated slack may not be utilized fully in the second stage in some cases and may be too small to perform any upsizing or Vth-assignment, the allocated slack can remain unutilized. Therefore, the unused slack in reallocated in the following iterations. The approach in [36] proposed a novel approach for simultaneous dual-vth assignment and gate sizing. The inherently discrete problem is formulated as a continuous problem that allows it to be solved using any of the widely available and highly efficient nonlinear optimizers. The formulation is based on the concept of a mixed-vth gate (shown in Fig. 6), which consists of two implementations of the same gate operating in parallel, where one implementation operates at each of the higher and the lower threshold voltages. Let us assume the problem is expressed as shown in (2), and that the delay of a gate is written as where D ¼ RC l (6) C l ¼ C Load þ K SL W (7) Max: Xn s i d i s:t:: a j D 0 8j 2 output a j þd ji þ d i a i 8;...; n and 8j 2 inputðiþ 0 a i 8i 2 input d i U i 8i ¼ 1;...; n (5) Fig. 6. A Mixed-Vth Implementation of a NAND Gate represented as a parallel combination of high-vth and low-vth gates (from [36]). Vol. 95, No. 3, March 2007 Proceedings of the IEEE 513

represents the capacitive load driven by the gate, R represents the driver resistance which is inversely proportional to the device width, K SL is a constant that models the contribution of a gate s

8 represents the capacitive load driven by the gate, R represents the driver resistance which is inversely proportional to the device width, K SL is a constant that models the contribution of a gate s intrinsic capacitance to its own load, and W is the width of the transistors associated with the gate. The delay of the mixed-vth gate can then be expressed as where D ¼ R eff C l ¼ R l R h R l W h þ R h W l C l (8) C l ¼ C Load þ K SL ðw l þ W h Þ: (9) The total power of each gate can be written as After observing that these state probabilities tend to be heavily skewed (i.e., a node is usually 0 or usually 1 rather than having near equal likelihoods), the authors of [100] used these probabilities to guide a heuristic dual-vth assignment algorithm similar to [33]. Combined with lowlevel gate optimizations such as pin reordering, gate decomposition, and rewiring, the results showed substantial gains over a probability-unaware optimization approach (30% 55% active mode leakage reductions). Although dual-vth assignment provides significant leakage reductions, the improvements are known to be extremely sensitive to process variations due to low-vth devices exhibiting large spreads in leakage. Gate-length (Lgate) biasing provides an alternative to achieve leakage power savings while also addressing design robustness or yield (Fig. 7). Furthermore, Lgate biasing does not require additional mask steps. The efficacy of such an approach to reducing leakage variability was first investigated in [37]. The approach proposes only small Lgate biases of less than 10%, thus ensuring cells to be replaced post-layout without engineering change orders (ECOs). Moreover, using small biases ensures good printability and provides substantial reduction in leakage since the nominal gate length for a technology is generally near the knee of the leakage versus P gate ¼ P l W l þ P h W h (10) where P l and P h are the power per unit width of the low- Vth and high-vth fraction of the gate, respectively. These constants include both dynamic and static power, and are dependent on the load and switching activity of a gate. The circuit, consisting of all mixed-vth gates, can now be optimized for leakage power by performing traditional gate sizing. The interesting outcome of this optimization is that at the optimal point either the high-vth or the low-vth width of each mixed-vth gate is zero and thus each gate can be implemented using standard high or low Vth gates. This result is true under the assumption of ideal drivers. Even if drivers are modeled as standard cells a very small number of the mixed gates can have nonzero high and low Vth device widths. However, given their small number they can be heuristically snapped to high or low Vth gates, with a negligible impact on overall leakage power. Compared to a sensitivity-based method, the approach achieves average leakage savings of 31% and average total power savings of 7.4%. The crucial advantage provided by such an approach is that it maps a discrete optimization problem to a continuous problem, which can be solved to optimality in an efficient manner. Additional information available to a designer can be used to further improve the leakage savings. Based on functional gate-level simulations of expected program loads, designers can obtain early knowledge of expected state probabilities during the course of circuit operation. Fig. 7. By selectively introducing longer-than-minimum channel lengths in gates having timing slack, leakage spread induced by process variability can be greatly reduced and yield improved (from [37]). 514 Proceedings of the IEEE Vol.95,No.3,March2007

9 gate-length curve. Results in [37] showed that Lgate biased designs show 54% less leakage worst case variability while providing significant leakage power savings for very small delay and dynamic power penalties. Although dual-vth and gate-length biasing provide significant leakage reductions, specialized leakage reduction modes are required for reducing power dissipation in designs with long periods of inactivity. We now discuss some of the techniques that have been developed for standby subthreshold leakage power minimization. These techniques are based on reducing leakage by allowing the entire or sections of the design to go into a Bsleep[ state under idle conditions. Thus, one of the key requirements is the generation of the Bsleep[ signal based on the type of workloads. Another key concern is the power overhead associated with bringing the design into a sleep state. This overhead provides a lower bound on the time the circuit should stay in the sleep state to obtain overall power savings. The generated sleep state signal can then be used to reduce leakage by power gating, through body-biasing to increase device Vth, or by forcing the circuit into a known low-leakage state. Each of these techniques is now described in more detail. Input vector control uses the stack effect [38] to reduce leakage in a design by forcing gates to a particular low leakage logic state. However, the efficacy of such an approach is severely limited by the fact that, to maintain low hardware overheads, only a small number of nodes in a design can be directly assigned to a given state. Thus, even though the range of leakage values of a single gate can be extremely wide ð100þ over different input combinations, the range of leakage currents for a complete design is much smaller (generally in the range of 10% 20%) [109]. Moreover, determining the state that should be forced at each of the Bcontrol[ points is a hard problem. A random sampling based approach was proposed in [39]. The approach showed that under the assumption that leakage power has a beta distribution for varying inputs patterns the number of random samples required to establish that only a fraction x of the vectors have smaller leakage (with a confidence y) is bounded and can be expressed as n lnð1 yþ lnð1 xþ (11) where n is the number of samples. Although standby leakage reductions achievable by standard input vector control techniques are severely limited by logical correlations, this approach can be combined with other leakage reduction techniques to boost its effectiveness. For instance, [99] showed that by simultaneously considering state assignment along with Vth assignment (in a dual-vth process), one can achieve 5 greater leakage savings than input vector control alone. This improvement is based on the key observation that, given a standby mode assignment for a given circuit, the leaking devices are known and can be appropriately assigned to high-vth. The proposed algorithm seeks to minimize the delay overhead of such an approach while actively searching for the best input vector and corresponding Vth assignments. Larger standby leakage power savings can be achieved by using high-vth sleep transistors [62]. These sleep transistors are used to Bgate[ the power supply to a low Vth logic block. This technique, also known as multiple-threshold CMOS (MTCMOS) or power gating, significantly reduces leakage power in the sleep state since all the off low-vth devices are stacked with an off high-vth device. However, sleep transistors are generally very large and the area penalty can be significant, and it associated with delay penalties during the active mode of operation. Moreover, sizing of sleep transistors is complicated since it depends on the switching patterns of the gates in the block connected to the sleep transistor [40]. Clustering techniques to place gates that have mutually exclusive discharge patterns have been explored [41]. Some authors have also investigated using separate sleep transistors for each gate [42]. The approach is based on slack allocation (5) where the objective function is replaced by the leakage power expressed as a function of the delay penalty associated with sleep transistors. However, such implementations suffer from sneak leakage paths which are tricky to identify [43]. Nonetheless, power gating is one of the most well-understood and important standby mode leakage reduction techniques presently and should remain so in the near future. Another technique that has been explored involves altering the threshold voltage of the transistors using body biasing. Investigations using forward and reverse body biasing showed that reverse body biasing results in worse short-channel effects [110]. Hence, current implementations generally use forward body biases to lower Vth during active modes of operation. However, using the same body bias for the entire design results in a highly constrained problem and degrades power improvements. Techniques similar to clustering in MTCMOS can be used to group gates into a few sets, each of which is assigned to a single body bias. IV. TOTAL POWER MINIMIZATION As process technologies have scaled, the contribution of active leakage power dissipation has steadily increased and in current technologies leakage power can be as significant as dynamic power dissipation. Hence, techniques for total power minimization have attracted a lot of interest. Based on the discussion in the previous sections combining the ideas of multiple-vdd and multiple-vth designs seems to be the natural choice. The efficacy of such a design approach was explored in [44]. Vol. 95, No. 3, March 2007 Proceedings of the IEEE 515

10 To estimate the power improvement obtained by applying multiple Vdd s and Vth s, and to analyze the relationship among the optimal values of Vth s and Vdd s to be used, [44] performed a path-based analysis of a generic logic network. To simplify the problem [44] assumes that the paths are node and edge disjoint. In addition it is assumed that it is possible to apply a combination of Vdd s and Vth s to any fraction of the total path capacitance. This is equivalent to stating that extended clustered voltage scaling(ecvs)isused,whichallowsforasynchronouslevel conversion anywhere along a path. Consider V 1 and V th1 to be the supply and threshold voltages in a single V dd =V th system and C 1;1 to be the total path capacitance. Considering the same path implemented in an n-vdd/m-vth design, let us define C i;j s (other than C 1;1 which is the total path capacitance) as the capacitances operating at a supply voltage V i and threshold voltage V THj. If we define the capacitance ðc i Þ to be the capacitance operating at a supply voltage V i it can be expressed as C i ¼ Xm j¼1 C i;j for i 6¼ 1 (12) The total dynamic power dissipation can then be expressed as P ¼ f " # C 1;1 Xm V1 2 þ Xn C i i¼2 i¼2! C i Vi 2 : (13) The first term in (13) corresponds to the capacitance operating at V 1 and is obtained by subtracting the sum of the capacitances operating at voltages other than V 1 from the total path capacitance C 1;1. Now the ratio of the dynamic power dissipation to the original design can be expressed as Gain Dyn ¼ 1 1 C 1;1 X n " C i 1 V #! 2 i : (14) V 1 The static power can be expressed similarly. If W 1;1 is the total device width (both pmos and NMOS) and W i;j is the device width (both pmos and NMOS) at power supply V i and threshold voltage V THj then the gain in static power can be expressed as Gain static ¼ 1 Xn X m j¼1 " 1 10 V THj V TH1 #! 2 Vi S W i;j W 1;1 V 1 (15) where S is the subthreshold swing. The change in delay, D, when the power supply or Vth is changed is estimated using the alpha-power law model D i;j ¼ V i V 1 V1 V TH1 : (16) V i V THj As shown in [15], the capacitance and transistor width along a path are largely proportional to the path s delay. Hence, the ratios of widths in (15) can be replaced by ratios of capacitance. At this point the problem of power minimization for given voltages and thresholds can be formulated as a linear programming (LP) problem with the ratios of capacitances as the variables. For each point in the in the Vdd 2 n Vth 2 m space, which corresponds to a particular value for the Vdd 2 n s and Vth 2 m s, the problem is formulated and the ratios of capacitance corresponding to different path delays are obtained as a solution of the LP problem. The ratios of capacitance are then integrated over an assumed path-delay distribution to obtain the total capacitance operating at each combination of Vdd and Vth. Now the total power reduction can now be expressed as Maximize: K Gain dyn þ Gain static s:t:: 1 þ X! C i;j ðd i;j 1Þ t 1 C i;j 1;1 (17a) (17b) where t is the original path delay normalized to the critical path delay and K is a weighting factor that provides emphasis to either dynamic or static power. The constraint forces the final delay of each path to be less than the critical delay of the network, which being normalized is equal to one and thus maintains the operating frequency of f. Since the paths are independent of each other, minimizing the power dissipation on each of the paths leads to the minimum power of the complete logic network. Results in Fig. 8 demonstrate that the power reduction obtained by applying dual Vdd/Vth is consistently much larger than the optimal dual Vdd design. It is also seen in Fig. 8 that the advantage offered by the second threshold voltage is smallest (around 10% 20%) for lower K values. This is because the dual Vdd/Vth technique is predicated on using a lower second threshold voltage to allow cells to be run at a lower power supply while maintaining good drive capability. However, at small K values the static power is comparable to dynamic power and a likely increase in static power due to the lower Vth is less acceptable as a tradeoff. Modern high-performance designs tend to exhibit K values in the range of 2 20; the dual Vdd/Vth approach delivers 15% 30% lower power than dual Vdd alone over this range. A key distinction 516 Proceedings of the IEEE Vol.95,No.3,March2007

observed is that the optimal second power supply voltage in dual Vdd/Vth systems is typically much lower than the range of 0:6 to0:7 Vdd 1 for dual Vdd designs.

11 observed is that the optimal second power supply voltage in dual Vdd/Vth systems is typically much lower than the range of 0:6 to0:7 Vdd 1 for dual Vdd designs. It has also been shown [44] that the power improvements of dual Vdd/Vth designs increase as the nominal power supply ðvdd 1 Þ is scaled down, as opposed to results for dual Vdd where the improvements were shown to decrease with process scaling [15]. Fig. 9 shows the minimum achievable power as a function of the second Vdd and Vth. The optimal point is not overly sharp, and hence points close to optimal in terms of the lower supply and threshold voltage can be expected to provide near-optimal power savings. This attribute is desirable since it implies that design centering techniques should be capable of reducing the impact of process variability on total power. Also, the shallow minima may allow for reasonable performance even in situations where Vdd and Vth choices are rather limited. These predictions were confirmed in [23], which developed an integrated approach to perform simultaneous Vdd/Vth assignment and gate sizing. In this approach the power of a design, initially synthesized using high Vdd and low Vth gates, is first optimized using low Vdd assignment and gate sizing using a sensitivity-based approach, as described in Section II. This is followed by a phase in which gates are set to high Vth using the slack created by either reassigning gates to high Vdd (only gates that satisfy the dual-vdd topological constraint are considered) or performing gate sizing. The decisions involved in the process are made based on sensitivity measures computed using (4) to reduce power or using (5) to force the circuit to meet timing constraints. Additionally, hill-climbing techniques are used in each phase of the algorithm to avoid local minima resulting from dual-vdd assignments. The Fig. 9. Power reduction as a function of second Vdd and Vth values. This example uses Vdd 1 ¼ 0:9 V,Vth 1 ¼ 0:225 V, and K ¼ 10 (from [44]). sensitivity-based approach was shown to provide significant reduction power across circuits with varying activity factors and thus encompasses the effectiveness of both dual-vdd and dual-vth assignment algorithms. Other approaches to total power reduction using multiple supplies and threshold voltages have also been proposed. A genetic algorithm based search technique has been proposed in [46]. However, such an approach can have runtime issues. An iterative LP based approach was proposed in [45]. The approach is based on using a simple linear delay model in the Vdd/Vth allocation phase, allowing the problem to be expressed as an LP. The objective function is expressed as X i2nodes ðp i Þx i Fig. 8. Dual Vdd/Vth shows larger total power reduction than dual Vdd/single Vth throughout the range of K values, with power improvements ranging from 15% 45% (from [44]). where P i represents the achievable change in power and x i is a 0/1 optimization variable. However, to solve the problem as an LP the variables are allowed to take continuous values in the range 0 to 1. If the final value of the variable in the optimized solution is greater than 0.99, then the implementation of the gate is changed. After the power optimization step, if the circuit fails timing (due to the inaccurate linear delay model) then the LP is re-solved to reduce the delay of the circuit. Finally, we revisit the concept of Fig. 1 using the total power optimization framework described above from [44]. By examining results from various Vdd/Vth combinations, one can quantify the relationship between power gains and resulting path delay histogram. Fig. 10 shows the fraction of paths post-optimization that exhibit delays within 5% of the timing constraint (these paths can be considered critical although the exact definition of criticality is not Vol. 95, No. 3, March 2007 Proceedings of the IEEE 517

Fig. 10. Dual Vdd/Vth provides a better power/robustness tradeoff than dual VddVat Iso-power, there are many fewer critical paths, easing design and timing verification (adapted from [44]).

12 Fig. 10. Dual Vdd/Vth provides a better power/robustness tradeoff than dual VddVat Iso-power, there are many fewer critical paths, easing design and timing verification (adapted from [44]). important). Two points are evident. 1) Dual Vdd/Vth provides a better power/robustness tradeoff than dual VddVat iso-power, there are many fewer critical paths, easing design and timing verification. Similarly, at a fixed number of critical paths (which can be used in this case as a rough proxy for timing yield), power is 11% lower for dual-vdd/vth. 2) More importantly, the plot shows rapidly diminishing returns for the last few percent of power reductionvthis is achieved at great cost in terms of critical path density. Thus, it can be argued that a yield-driven design would not operate at the minimal power solution since it is likely that the timing yield in such a design would be very poor. This fundamental observation directly leads into the next section. A. Leakage Analysis A number of statistical leakage analysis approaches have been recently developed [47] [51]. The approach proposed in [47] was the first to analytically analyze the impact of intra-die variations on leakage current, and used a Taylor series expansion to estimate the mean and variance of leakage power under variations. A high-level leakage analysis approach based on total device width in a design was developed in [49]. The first comprehensive approach to analyze both inter- and intra-die variations was presented in [48], where the leakage current of each gate was approximated as a lognormal random variable (RV). The total leakage, which is the sum of lognormal RVs, was then approximated as another lognormal using Wilkinson s method, a moment matching technique. The intra-die model was then applied to each sample of a discretized inter-die probability distribution and the resulting distributions were combined using Bayes Theorem. The results obtained showed that intra-die and interdie variations differ strongly in their impact on the overall leakage distribution as shown in Fig. 11. If most of the variability is due to inter-die variation, the leakage distribution is strongly skewed. However, with increasing intra-die variations, as expected in future technologies, the skewness decreases due to the averaging effect of different samples on the same die. The first gate-level leakage analysis approach considering spatial correlation was proposed in [50]. To handle the correlated components of variations (inter-die and correlated intra-die) the overall chip area is divided into a grid using a model similar to that used in [52] for statistical timing analysis. To simplify the problem, this set of correlated RVs is replaced by another set of mutually independent RVs with zero mean and unit variance using V. POWER OPTIMIZATION UNDER VARIABILITY Previous sections have considered power optimization in a deterministic environment. However, with technology scaling process variations have been continually growing, making it vital that their impact is considered during the design phase to ensure acceptable parametric yield. The two key components of power dissipation, dynamic and static, behave very differently under variability. Dynamic power has a linear dependence on process parameters, and hencevariationsindynamicpowershowthesamerangeas the process parameters themselves. At the same time, leakage power dissipation exhibits an exponential dependence on process parameters that result in certain fabricated samples of a design to have significantly higher leakage currents. Therefore, the focus of this section will be on techniques that enable efficient and accurate statistical analysis and optimization of leakage power. For a more detailed description of these techniques, we refer the reader to [108]. Fig. 11. PDFs of leakage current for different contributions of inter and inter-die process variation. The total variation ð3þ is 15% of nominal. 518 Proceedings of the IEEE Vol.95,No.3,March2007

13 the principal components of the set of correlated RVs. A vector of RVs (say X) with a correlation matrix C, can be expressed as a linear combination of the principal components Y as X ¼ X þ V 1 D 1 2 Y (18) where x is the vector of the mean values of X, is a diagonal matrix with the diagonal elements being the standard deviations of X, V is the matrix of the eigenvectors of C, andd is a diagonal matrix of the eigenvalues of C. Since the correlation matrix of a multivariate (nondegenerate) Gaussian RV is positive-definite, all elements of D are positive and the square-root in (18) can be evaluated. The leakage power of an individual gate is expressed as shown in (19) Leakage ¼ exp V nom þ Xp p ðp p Þ! (19) where expðv nom Þ is the nominal values of leakage power, and s represent the sensitivities of the log of leakage to the process parameters under consideration. The variable P p represents the change in the process parameters from their nominal value. In a statistical scenario, the process parameters are modeled as RVs. If the overall circuit is partitioned using a grid, the leakage of individual gates can be expressed as a function of these RVs. Using the principal component approach, the leakage in (19) can then be expressed as accuracy. The leakage power of an individual gate a is expressed as P a leak ¼ exp a 0 þ Xn a i z i þ a nþ1 R! (21) where the z s are principal components of the RVs and the a s are the coefficients obtained using (18) and (19). The mean and variance of the RV in (21) can be computed as! E P a leak ¼ exp a0 þ 1 X nþ1 a 2 i 2! Var P a leak ¼ exp 2a0 þ Xnþ1 a 2 i exp 2a 0 þ 1 2 X nþ1 a 2 i (22)! : (23) The correlation of the leakage of gate a with the lognormal RV associated with z j is found by evaluating 0 1 E P a leak ez j ¼ exp a0 þ 1 nþ1 a 2 i þða j þ 1Þ 2 A 2 ;i6¼j 8j 2f1; 2;...; ng: (24) Similarly the covariance of the leakage of two gates (a and b) isfoundusing Leakage ¼ exp V nom þ Xp! X n p ji z j!þ l R j¼1 (20) E P a leak Pa leak ¼ exp ða 0 þb 0 Þþ 1 2 X n!! ða i þb i Þ 2 þ a 2 nþ1 þ b2 nþ1 : (25) where z j s are the principal components of the correlated RV s P p s in (19) and the s can be obtained from (18). R Nð0; 1Þ in the above equation represents the random component of the variations of all process parameters lumped into a single term that contributes a total variance of l 2 to the overall variance of the log of leakage. The leakage power of the total circuit can then be expressed as a sum of correlated RVs. This sum can be accurately approximated as another lognormal random variable. Reference [50] shows that the approximation performed using an extension of Wilkinson s method (based on matching the first two moments) provides good This approach assumes that the sum of leakage power can be expressed in the same canonical form as (21). If the random variables associated with all the gates in the circuit are summed in a single step, the overall complexity of the approach is Oðn 2 Þ due to the size of the correlation matrix. Since the sum of two lognormal RVs is assumed to have a lognormal distribution in the same canonical form, we can use a recursive technique to estimate the sum of more than two lognormal RVs. In each recursive step, two RVs of the form in (21) are summed to obtain another RV in the same canonical form. To find Vol. 95, No. 3, March 2007 Proceedings of the IEEE 519

14 the coefficients in the expression for the sum of the RVs, the first two moments (as in Wilkinson s method) and the correlations with the lognormal RVs associated with each of the Gaussian principal components are matched. Let us outline one of the recursive steps where we sum P b leak and P c leak to obtain Pa leak. The coefficients associated with the principal components can be found using (22) (25) and expressing the coefficients associated with the principal components as a i ¼ log ¼ log! E P a leak ez i E P a leak Eðe z iþ E P b leak ez i þ E P c leak e z i!: E P b (26) leak þ E P c leak Eðe z iþ Using the expressions developed in [48], the remaining two coefficients in the expression for P a leak can be expressed as (27) and (28), shown at the bottom of the page. Having obtained the sum of two lognormals in the original canonical form, the process is recursively repeated to compute the expression for the total leakage power of the circuit. The results obtained using such an analysis shows that the dependence of the variance of leakage power on the random component is extremely weak. This arises because the random component associated with each gate is independent and hence the ratio of standard deviation to mean for the sum of these independent RVs is inversely proportional to the square root of the number of RVs summed. This ratio does not reduce for correlated RVsVtherefore, if a large number of RVs are summed with both correlated and random components, the overall variance is dominated by the variance of the correlated component. B. Leakage Optimization Optimization under variations, or robust optimization, has been an active area of research for some time in the operations research community, and recently some of these techniques have been applied in developing statistical circuit optimization techniques. Generally the optimization formulation in a deterministic environment is simply extended to a statistically varying environment. Such an extension involves substituting the objective function (power dissipation) by a function of its statistical parameters (mean, variance, etc.). The constraints in the formulation are also enforced to hold with a certain probability ðþ. This probability will correspond to the timing yield in a timing-constrained optimization. One way to perform such an optimization is to sample the entire space and optimize the design while constraining the feasible space to meet the constraints for all of these samples. Moreover, if all the deterministic constraints are convex then the constraints for the samples remain convex. The size of the sample set (N) toguarantee with a given probability ðþ that the timing yield constraints will be satisfied over the entire process space, was developed in [53], and is expressed as 2n N Ceil 1 log 12 1 þ 2 1 log 2 þ 2n 1 (29) where n is the dimension of the deterministic problem. This shows that the number of samples become impractical for large values of, and for large circuits (due to the linear dependence on n). Moreover, the approach does not use any available information about the distribution of the parameters or the structure of the optimization problem. A better approach to perform circuit optimization is to analytically reformulate the chance constrained optimization problem. Generally such problems are intractable, but in certain cases, such as an LP with ellipsoidal or polyhedral uncertainty, efficient reformulations can be obtained. The previous section discussed an LP formulation to investigate the optimal choice of process parameters in a generic logic network. This analysis was extended in [54] a 0 ¼ 1 2 log E P b leak þ E P c 4! leak E P b leak þ E P c 2þ leak VarðPb ÞþVarðP c Þþ2CovðP b P c Þ (27) "! # a nþ1 ¼ log 1 þ VarðP 0:5 bþþvarðp c Þþ2CovðP b P c Þ E P b leak þ E P c 2 Xn a 2 i (28) leak 520 Proceedings of the IEEE Vol.95,No.3,March2007

15 to consider variations in threshold voltage and their impact on dual-vth design, using techniques from robust optimization [55]. Let us reconsider the generic path network of Section IV. Assume that it consists of a set of paths with N distinct delays, and that the number of paths with delay D i ð1 i NÞ is P i. To consider the worst case impact of variations on delay, the variations are assumed to be perfectly correlated on each path, and the variation across paths is assumed to be independent. Let us assume that the required timing yield of each path in the network ð k Þ is predetermined, allowing the delay constraint to be re-expressed as PðT critical D k 0Þ k : (30) The above inequality constrains the probability of the path P k having a delay less than the critical delay of the circuittobemorethan k. The timing constraint can be simplified using the alpha-power model as D k ¼ Xn X m j¼1 C i;j Vdd j ðvdd j VTH k Þ which can be approximated as D k ¼ Xn X m j¼1 vth k 1 (31) Vdd j VTH k C i;j Vdd j ðvdd j VTH k Þ 1 þ vth k : (32) Vdd j VTH k This form allows us to write the delay as a linear combination of the variation in threshold voltage, represented as RVs vth, which are assumed to be Gaussian. At this point the mean and variance of D k can be written as afunctionofc (the vector of C i;j s). Thus, we can express (30) as To estimate the timing yield of each path, the convex Byield allocation[ problem is formulated as Min: X k s:t: : YN k¼1 ðþ P k k Y; 0:5 k 1; k ¼ 1;...:; N (35) where Y is the desired timing yield for the network and is a constant vector representing the vector of coefficients of the C i;j s in the objective function of the power optimization problem. The two problems are then solved iteratively. The importance of yield allocation is illustrated conceptually in Fig. 12, which shows an expected solution of optimal yield allocation that tightly constrains the fast paths, and is thus able to loosely constrain the slow paths while maintaining the overall yield of the network. On the other hand, uniform allocation (i.e., the same timing yield for each path) can potentially constrain paths to have a post-optimized yield higher than their initial yield (this is seen in Fig. 12 for paths with near-critical delay). In these cases, the power optimization problem may become infeasible if there are no available means to increase the timing yield of a path. Fig. 13 shows results obtained using such a framework when the mean of the leakage power is minimized. As can be clearly seen, with an increase in the level of variability the optimal second threshold voltage reduces. For example, the difference between a purely deterministic optimization and an expected 3 level of 30% of the mean is approximately 40 mv. This can be understood by noting that with increasing variations devices with higher nominal Vth suffer not only a larger delay penalty (due PððT Critical D k ÞNð k ðcþ; k ðcþþ 0Þ k (33) which is rewritten as p k ðcþ ffiffi 2 erfinvð1 2k Þ k ðcþ 0 (34) where erfinv is the inverse of the error function. Equation (34) defines a convex set under the condition that k 9 0:5 [54]. Since the target yield for a given path will always be much greater than 50%, this condition is easily satisfied. Fig. 12. Different yield allocation options can have a significant impact on the optimization problem (from [54]). Vol. 95, No. 3, March 2007 Proceedings of the IEEE 521

16 Fig. 13. Average power reduction as a function of the second threshold voltage (from [54]). to the growing Vth/Vdd ratio) but also a larger power penalty stemming from their larger variation. Even more critical is that the achievable leakage power savings when considering process fluctuations is significantly degraded compared to the deterministic case. For the most part this occurs since roughly half of the devices inserted for a given Vth will exhibit thresholds smaller than the nominal value. Due to the exponential dependency of leakage on Vth, the insertion of high-vth devices will often result in much less leakage reduction than expected based on the nominal conditions alone. In Fig. 13, the power savings reduce from roughly 90% in the deterministic case to just over 70% given a reasonable level of Vth variability. Though the high-level path-based framework can be easily extended to consider variations, a similar approach runs into difficulties for circuit-level optimization. However, several techniques to perform statistical circuit optimization have been explored recently. The general technique of (30) (34) to convert a chance-constrained linear constraint to a convex constraint was applied to the path-based gate-sizing formulation (Section II) under a linear delay model in [56]. The problem can then be solved using standard convex optimization algorithms. This work also assumes that yield allocation for the individual paths has already been performed. The work in [57] was aimed at using the interior-point convex optimizer used in [7] for deterministic gate sizing. However, to enable this the authors model the statistical timing check, by a traditional static timing check where the delay of each gate is now modeled as circuit yield, and is the cdf of the standard Gaussian RV. This approximation is mostly conservative, and akin to assuming that all variations are perfectly correlated. However, the objective function is treated statistically and a weighted sum of the squares of the first and second moments of leakage power, which are expressed as posynomials, is used. The approach in [58] was the first to map the statistical problem to a node-based formulation and is based on the two-stage iterative optimization approach [35] discussed in Section III. Node delays are again assumed to be perfectly correlated and of the form in (36), as in [57]. However, the desired improvements in leakage power ðpþ are used to add a second-order conic constraint to the optimization problem. Hence, the optimization is formulated as a feasibility problem, and can be expressed as a second-order conic problem (SOCP). The problem was then solved using efficient primal-dual interior point methods and results showed that the runtime for such an approach increases linearly with circuit size. These results should encourage CAD engineers to further investigate potential applications of interior-point algorithms. A sensitivity-based optimization for leakage power was proposed in [59]. The Bstatistical sensitivities[ used in this approach were generated by considering a high percentile point of the distribution of the sensitivities themselves. It is interesting to note that the incorporation of statistical sensitivities provided an additional reduction of 40% in leakage power at the tightest delay constraint compared to the case where only statistical timing analysis was used to enforce the delay constraint (illustrated in Fig. 14). This indicates that although the use of a static statistical timing analysis (SSTA) framework is clearly important, statistically modeling the power and delay impact of change in Vth is equally critical. Additionally, the optimization based on corner models is not able to meet very tight constraints ðd gate Þþ 1 ðyþðd gate Þ (36) where D gate is the statistical gate delay, and represent the mean and standard deviation functions, Y is the desired Fig. 14. Power-delay space obtained using a statistical sensitivity based optimization technique (from [59]). 522 Proceedings of the IEEE Vol.95,No.3,March2007

on the 95th percentile of the delay that are met by optimizations that employ an SSTA engine due to the pessimism of the corner model approach. The top curve (Bdelay using corner models[) in Fig.

17 on the 95th percentile of the delay that are met by optimizations that employ an SSTA engine due to the pessimism of the corner model approach. The top curve (Bdelay using corner models[) in Fig. 14 plots the results for the optimization using corner models where the delay is calculated using worst case models. Although traditional optimization formulations can be extended to power optimization under timing yield constraints, they fail to consider the inverse correlation between delay and leakage power. This correlation was first considered during optimization in [60]. The approach was based on a principal component analysis (PCA) based yield analysis engine [50], where the same set of underlying RVs is used to perform both leakage and timing analysis to capture the correlation between these design parameters. The overall yield of the design can then be expressed as Y ¼ PD ð D 0 ; log P L logðp 0 P D ÞÞ (37) where P L and P D are the leakage and dynamic power of the design, and P 0 and D 0 represent the delay and power constraints imposed on the circuit. The above yield expression is now equivalent to the integral of a bivariate Gaussian RV over a rectangular region, as shown in Fig. 15. The optimization is based on efficient computation of the gradient of the yield [as defined in (37)] with respect to the gate sizes. A cutset-based approach is used to determine the impact of gate sizing on the delay distribution, whereas the impact on leakage is easily obtained by recomputing the sum of leakage power with the upsized gate. The results showed that the technique can improve Fig. 15. Joint probability distribution function for the bivariate Gaussian distribution of delay and leakage for a benchmark circuit. Sample constraints are shown, which can be used to calculate parametric yield (from [50]). the yield of a design by as much as 40% over a deterministically optimized design. VI. FUTURE DIRECTIONS IN ROBUST LOW-POWER COMPUTING Computer-aided design research tends to follow (and ideally anticipate) major trends in circuit design and process technology. Therefore, to project what lies ahead for CAD, we must first consider a roadmap for the future of IC design. While examining the prevailing trends in circuit design below, we will also discuss CAD implications to varying extents. A. Multiple Cores Multicore design has become the most significant trend in microprocessor design today in an effort to curb power trends while continuing to improve performance at historical rates. One particularly interesting spin on multicore design is to employ nonuniform cores [61], including running them at different supply voltages or otherwise achieving a range of power/performance design points across the chip. There are several interesting design angles that may be taken here, all pointing to more heterogeneity in future designs, which is in itself a challenge to CAD tools. These include running memories at higher voltages to improve both robustness and performance (providing better memory bandwidth) and the use of dedicated hardware accelerators for very low voltage designs, which are discussed later in this section. B. Adaptive Design Styles An effective way of handling the increasing uncertainty in circuit performance is by changing the paradigm for the design of integrated circuits to a paradigm based on adaptive circuit fabrics. The idea of sensing the environmental conditions for a system and adjusting its behavior based on some prespecified objective is not new, and is widely used in the design of intelligent systems. In the IC design world, point techniques for sensing the process or environmental conditions are also well known [63] [65]. However, contemporary designs are so complex that an underlying CAD infrastructure is absolutely vital to fulfilling the potential of adaptivity. In particular, an adaptive design paradigm for integrated circuits should enforce predictable behavior through synthesizing adaptive circuit fabrics along with the target design. These techniques rely on sensing process and environmental conditions and adjusting certain circuit properties to keep the design within its specifications. The synthesis tools themselves insert sufficient adaptivity to guarantee specified power/performance/yield levels. This adaptivity can take the form of known, but largely unused to date, techniques such as adaptive body bias and supply voltages, or may leverage newly developed approaches as they come Vol. 95, No. 3, March 2007 Proceedings of the IEEE 523

larger gates can be used intelligently to improve cycle time, thereby limiting leakage energy per operation and improving overall energy efficiency.

18 larger gates can be used intelligently to improve cycle time, thereby limiting leakage energy per operation and improving overall energy efficiency. Supply voltage and gate-sizing selections can be concurrently optimized to carefully trade off dynamic and leakage energies, and potentially greatly improve battery life for applications like sensor networks. Given recent statements by a leading microprocessor company that energy efficiency is central to their future plans [68], ideas such as these may eventually be applied to more mainstream applications as well. Fig. 16. In applying adaptive techniques, the design space must be carefully explored to guide CAD tool decisions. online (e.g., adaptive power gating [66]). Fig. 16 shows a generic depiction of the design space that an adaptive CAD toolwouldneedtofirstmapoutandthenuseduringthe course of optimization. In this case there is an optimal granularity of applied body bias due to the tradeoff between the introduced overhead associated with each added bias level and the diminishing power reductions provided by each level. Given that there is some consensus that designs in highly scaled CMOS will dedicate a small fraction of devices to managing chip behavior (monitoring, adjusting, optimizing traffic on-the-fly), CAD support for such a design style is clearly a major gap in current research. C. Ultralow-Voltage Design There is growing interest in the use of ultralow supply voltages (i.e., 400 mv), with applications ranging from wireless integrated microsystems for remote sensing and biomedical applications, to even high performance tasks by using multiple low-voltage cores to recoup throughput [67]. The major technology trends associated with this type of design are that while the impact of process variability is very high (due to the heightened sensitivity of current to Vth and Vdd in the subthreshold regime), performance requirements are typically not stringent. This eases CAD requirements overall but the focus on energy in these applications (leading to battery life improvements) indicates that there should be a refocus on that metric during design optimization. For instance, it has been shown that minimum energy per operation is achieved at a specific supply voltage, denoted V min, depending on circuit topology, activity, etc. (Fig. 17). Since energy efficiency is not monotonic with Vdd, there is some level of analysis to be done in determining the appropriate supply voltage for purely energy-constrained designs. Furthermore, gatesizing choices are nontrivial in these types of systems as D. The Role of Standard Cells Inthefuture,CADtoolsmustbeabletoseamlessly handle the many standard cell variants expected in new technologies. This explosion in variants arises from both the growing number of techniques being developed to address power and robustness, as well as the potential gains achievable by applying these techniques at finer levels of granularity. For instance, tuning standard cell layouts can be useful in addressing systematic variability sources, such as those based on optical proximity effects. These are good examples of low-hanging fruitvthere are only low costs to implement in many cases, with high payoffs (e.g., the systematic behavior of critical dimension linewidth with respect to lithographic defocus, or the socalled iso-dense effect [98]). Recent work has also shown that transistor-level Lgate biasing can improve upon the results of [37] by more than 15% in average leakage and 39% in leakage variability, translating directly to higher yield at the cost of more complex libraries. In general, libraries represent the interface between IC performance and manufacturing. Many design-for-manufacturability (DFM) and leakage reduction approaches (e.g., the probability-aware dual-vth assignment approach of [100] Fig. 17. Simulated energy consumption of a 50-stage inverter chain in 130-nm CMOS shows a clear minimum versus supply voltage, arising from the exponentially longer instruction execution time in subthreshold (from [67]). Similar trends hold for more complex logic blocks, which has been verified experimentally ([102], [103]). 524 Proceedings of the IEEE Vol.95,No.3,March2007

19 relies on an expanded standard cell library) will be implemented via enhancements to the current standard cell library paradigm. While approaches calling for four flavors of every cell type may seem impractical, studies showing that the vast majority of instantiated cells in designs come from a small group of master cells [106] point to the possibility of applying these customizations to only a small subset of the total library while maintaining their efficacy. One commonly cited approach to improving manufacturability in future technology nodes is the use of highly regular designs (beyond simple transistor orientation restrictions or forbidden pitches, which are already common today) [69], [101]. It seems reasonable that regularity will indeed be required to improve manufacturability and reduce variability levels, and this may occur as early as 45nm.Atthesametime,however,theadaptivecircuit techniques mentioned earlier in this section provide a method of coping with high levels of variability. As the primary example, adaptive body biasing has previously been shown to be an effective yield enhancement technique since it can be used to rein in leakage on fast parts as well as boost speed for slow chips. On the other hand, the power impact of highly regular design is yet to be determined. An open question here is: is it more effective/cheaper to incorporate regularity (extreme Bcorrect by construction[) or intelligent adaptivity (Bsense and correct[)? E. Multiobjective Optimization Statistical static timing analysis (SSTA) is currently an active research area [52], [70] [72] and several EDA companies are now starting to bring SSTA solutions to the marketplace. Nonetheless, SSTA is unproven, based on an abundance of simplifications, difficult to practically implement, and not highly useful without optimization capabilities (of which there has been almost no work to date). In addition, a number of modeling issues remain to be addressed in SSTA, including the dependence of delay on load and slope, interaction between interconnect and gate delays under process variations, and the impact of noise on delay. These issues need to be resolved before SSTA can replace traditional static timing analysis. Moreover, optimization using SSTA requires the definition of statistical sensitivities and no approaches have yet been developed for their efficient computation. Eventually parametric yield should be the objective function of CAD flows, not simply timing or power or area, etc. The key question then becomes; what approaches should be taken to this difficult problem? Regardless of the exact approaches taken, it is vital that CAD researchers take up this challenge aggressively. There does not yet seem to be a general direction for this work given the little empirical evidence in the literature today. However, we anticipate that most approaches will seek to build on mainstream SSTA approaches ([52], [70] in particular) and use these within the core optimization engine. However, we suggest that alternatives to this path should be explored in parallel that are more palatable to designers. One possibility is the use of well-established and fast deterministic approaches, combined with variation space sampling, to yield an efficient and proven robust design strategy. An added advantage of such an approach would be its generality and applicability to a wide range of optimization tasks that may be difficult to address in existing SSTA frameworks [113]. F. Interconnect Design Trends Process technology increases the fraction of delay attributable to interconnect due to the reverse scaling properties of wires. That is, since shrinking wires leads to larger delays whereas smaller devices are faster, the fraction of total circuit delay allocated to wiring has grown dramatically. In certain cases interconnect optimization alone is unable to achieve the desired operating frequencies. In such cases the wire delay is distributed over multiple cycles by wire-pipelining, which involves inserting flip-flops. However, additional flip-flops increase latency in wire-pipelined buses. The impact of this increased latency on performance depends on the microarchitecture. This has led to microarchitecure-aware floorplanning approaches that attempt to reduce the need for pipelining on performance critical wires [111], [112]. This is only problematic in global wires for which the wirelength is sufficient to lead to large values of the quadratic RC delay term. Local wires are short enough that they can still often be modeled as lumped capacitances, given that their resistance is much smaller than the effective impedance of the active drivers [91]. To combat the observed quadratic delay dependency of global wires on line length, large CMOS inverters (repeaters) are inserted uniformly along these wires to reduce the delay dependency to linear [92]. This well-known technique has been in use for many years and has limited the negative (delay) impact of wire scaling to a large extent. However, there are several related challenges that have cropped up due to the enormous numbers of repeaters in use today in high-end designs. Foremost among these is the growing power demands of these repeaters. In [93], the authors demonstrate that the number of nets requiring repeaters is growing rapidly. 5 With inevitable increases in wiring resistance due to scaling of cross-sectional dimensions, shorter and shorter wires are now buffered/repeated. Thus, more wires appear to be global. The net result of this trend is that chip repeater count grows roughly as S 3, where S is the scale factor (S 1:4 per generation). Combined with the fact that repeaters are very wide (often with W=L 100), the total switched capacitance on global wires is very large. A recent 5 Although the authors of [93] use pessimistic assumptions in their analysis, the general trends described remain valid. Vol. 95, No. 3, March 2007 Proceedings of the IEEE 525

20 6 Throughput can be maintained or even improved by having several pulses active on a line simultaneously, referred to as interconnect wave pipelining [86]. This threatens the robustness of the technique as dispersion along the line can easily lead to data loss and is not necessary for pulsed signaling to be advantageous. report detailing the total power breakdown in a highperformance mobile microprocessor part shows that global signals (excluding clock) make up 21% of total chip power [94]. Furthermore, [94] reports the average switching activity of global signals as 0.05Vthis relatively low activity factor indicates that leakage will be a significant component of total power in global signaling. Also, the stack effect, which reduces runtime mode leakage when input patterns lead to series connected OFF devices, is not present in repeaters, further enhancing the importance of leakage. Design solutions are central to addressing the power problems posed by modern on-chip wiring. These solutions must continue to meet stringent timing and signal integrity requirements while reducing both static and dynamic power. Other interconnect-related work, such as improved models and better materials, cannot provide as significant a gain in performance as improved signaling techniques (in the near to midrange timeframe). For example, while copper was adopted largely due to its better resistivity compared to aluminum, it does not scale well due to the need for barrier or cladding layers. There are no other materials on the horizon to provide lower resistance. The introduction of low-k materials to address back-end capacitance has been much slower than predicted by earlier versions of the ITRS [95]. In terms of modeling, sufficient accuracy is achievable today. Also, designers seek to avoid difficult-to-model cases, such as highly inductive lines, by using aggressive shielding. Given this, a question becomes: How will design adapt? We suggest the following general approaches. 1) Let buses evolve by using smarter repeaters. 2) Borrow techniques from off-chip signaling such as low-swing and ultrahigh-speed serial links or potentially low-overhead bus encoding techniques, particularly for peak power constrained applications [96]. 3) Optimize global wiring dimensions targeting low power rather than performance, trading off density/cost to limit communication-driven power consumption.thereisverylittleworkinthisarea today. Focusing on the first point, there are numerous examples of power-efficient approaches to on-chip global communication in the literature [81] [84], [88] [90]. One of the most promising areas for future work is in pulsed signaling, where a brief voltage pulse is propagated down a long wire to indicate a change in the present state of the signal. Only a small fraction of the total line capacitance is charged in this case, greatly reducing power consumption while maintaining high performance. 6 However, pulsed signaling is inherently sensitive to process variation, which can lead to potential attenuation (causing lost data) or broadening of the pulse (wasting power), particularly in the back end of the line (BEOL). Since BEOL variability levels tend to be much higher than frontend parameters [87], this again points to a compromise between power and robustness. One option here is to introduce adaptivity into the design of these pulsed signaling circuitsvthe pulse duration being received at the end of the line can be monitored and parameters such as input pulsewidth and driver size can be modified accordingly to maintain balance between signal attenuation and broadening. Regardless of how traditional repeater-based signaling strategies evolve, ease of implementation is crucial to their adoptionvthey should be consistent with the current repeater paradigm and essentially constitute a Bdrop-in[ addition for designers and CAD tool developers. VII. CONCLUSION Keeping industry on an exponential growth curve is vital to exploiting the promise of future ICs (e.g., to aid in computational biology research, cosmological data mining, etc.). Growing power consumption is widely recognized as a key limiter to the continuation of Moore s law, as has been the case in several technology inflection points in the past several decades. However, unlike these previous cases, there are no viable technological alternatives to CMOS, meaning that design and CAD will need to more actively reinforce process scaling improvements in the near-term. Another major trend is that design is becoming probabilistic, threatening the robustness of future ICsVboth design and CAD are ill-equipped to handle this transition presently. We have argued in this paper that these two design goals (robustness and lowpower) are often conflicting, citing a range of examples from soft errors to the nature of path delay histograms pre- and post-optimization. Compromises cannot be afforded in either of these two metricsvmerely continuing along current trends is untenable in either the power or reliability dimensions. What is needed is multilevel innovation where power and robustness are considered jointly throughout the design process. By applying newly emerging techniques at the device, circuit, CAD, and architecture levels, we believe that industry will successfully meet these challenges. In summary, the two primary implications of such a multiarea approach to the future of CAD are: 1) a need for CAD researchers and engineers to broaden their perspective through detailed collaborations with device engineers, computer architects, and both analog and digital circuit designers and 2) that the CAD community will provide an increasing amount of Bequivalent technology scaling[ to the industry as difficulties in scaling CMOS devices beyond 32-nm mount. h 526 Proceedings of the IEEE Vol.95,No.3,March2007

21 REFERENCES [1] F. Najm, BTransition density: A new measure of activity in digital circuits,[ IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 12, no. 2, pp , Feb [2] A. Shen, A. Ghosh, S. Devadas, and K. Keutzer, BOn average power dissipation and random pattern testability of CMOS combinational logic networks,[ in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, 1992, pp [3] P. K. Chan, BAlgorithms for library-specific sizing of combinational logic,[ in Proc. IEEE/ACM Design Automation Conf., 1990, pp [4] A. Oliveira and R. Murgai, BAn exact gate assignment algorithm for tree circuits under rise and fall delays,[ in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, 2000, pp [5] O. Coudert, R. W. Haddad, and S. Manne, BNew algorithms for gate sizing: A comparative study,[ in IEEE/ACM Design Automation Conf., 1996, pp [6] J. Fishburn and A. Dunlop, BTILOS: A posynomial programming approach to transistor sizing,[ in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, 1985, pp [7] S. S. Sapatnekar, V. B. Rao, and P. M. Vaidya, BA convex optimization approach to transistor sizing for CMOS circuits,[ in Proc. IEEE Int. Conf. Computer-Aided Design, 1991, pp [8] C. Chen, A. Srivastava, and M. Sarrafzadeh, BOn gate level power optimization using dual-supply voltages,[ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 5, pp , Oct [9] M. Berkelaar and J. Jess, BGate sizing in MOS digital circuits with linear programming,[ in Proc. Eur. Design Automation Conf., 1990, pp [10] M. Ketkar, K. Kasamsetty, and S. S. Sapatnekar, BConvex delay models for transistor sizing,[ in Proc. IEEE/ACM Design Automation Conf., 2000, pp [11] H. Tennakoon and C. Sechen, BGate sizing using Lagrangian relaxation combined with a fast gradient-based pre-processing step,[ in Proc. IEEE Int. Conf. Computer-Aided Design, pp [12] A. R. Conn, P. K. Coulman, R. A. Haring, G. L. Morrill, C. Visweswariah, and C. W. Wu, BJiffyTune: Circuit optimization using time-domain sensitivities,[ IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 17, no. 12, pp , Dec [13] S. S. Sapatnekar and W. Chuang, BPower versus delay in gate sizing: Conflicting objectives?[ in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, 1995, pp [14] H. Tennakoon and C. Sechen, BEfficient and accurate gate sizing with piecewise convex delay models,[ in Proc. IEEE/ACM Design Automation Conf., 2005, pp [15] M. Hamada, Y. Ootaguro, and T. Kuroda, BUtilizing surplus timing for power reduction,[ in Proc. Custom Integrated Circuits Conf., 2001, pp [16] K. Usami and M. Horowitz, BClustered voltage scaling technique for low-power design,[ in Proc. Int. Symp. Low-Power Electronics and Design, 1995, pp [17] K. Usami et al., BAutomated low-power technique exploiting multiple supply voltage applied to a media processor,[ IEEE J. Solid-State Circuits, vol. 33, no. 3, pp , Mar [18] S. Kulkarni and D. Sylvester, BFast and energy-efficient asynchronous level converters for multi-vdd design,[ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 9, pp , Sep [19] F. Ishihara, F. Sheikh, and B. Nikolic, BLevel conversion for dual-supply systems,[ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 2, pp , Feb [20] V. Sundararajan and K. Parhi, BSynthesis of low power CMOS VLSI circuits using dual supply voltages,[ in Proc. IEEE/ACM Design Automation Conf., 1999, pp [21] S. Kulkarni, A. Srivastava, and D. Sylvester, BA new algorithm for improved VDD assignment in low power dual VDD systems,[ in Proc. IEEE/ACM Int. Symp. Low-Power Electronics Design, 2004, pp [22] C. Yeh et al., BGate-level design exploiting dual supply voltages for power-driven applications,[ in Proc. IEEE/ACM Design Automation Conf., 1999, pp [23] A. Srivastava, D. Sylvester, and D. Blaauw, BPower minimization using simultaneous gate sizing, dual-vdd, and dual-vth assignment,[ in Proc. IEEE/ACM Design Automation Conf., 2004, pp [24] C. Chen, A. Srivatsava, and M. Sarrafzadeh, BOn gate-level power optimization using dual supply voltages,[ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 5, pp , Oct [25] M. Takahashi et al., BA 60-mW MPEG4 video codec using clustered voltage scaling with variable supply-voltage scheme,[ IEEE J. Solid-State Circuits, pp , Nov [26] S. Ercolani et al., BEstimate of signal probability in combinational logic networks,[ in Proc. Eur. Test Conf., 1989, pp [27] K. Flautner et al., BDrowsy caches: Simple techniques for reducing leakage power,[ in Proc. ACM/IEEE Int. Symp. Microarchitecture, 2002, pp [28] A. Srivastava, BSimultaneous Vt selection and assignment for leakage optimization,[ in Proc. Int. Symp. Low-Power Electronics Design, 2003, pp [29] V. Sundarajan and K. Parhi, BLow power synthesis of dual threshold voltage CMOS VLSI circuits,[ in Proc. Int. Symp. Low-Power Electronics Design, 1999, pp [30] L. Wei, K. Roy, and C. Koh, BPower minimization by simultaneous dual-vth assignment and gate sizing,[ in Proc. Custom Integrated Circuits Conf., 2000, pp [31] Q. Wang and S. Vrudhula, BAlgorithms for minimizing standby power in deep submicron, dual-vt CMOS circuits,[ IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 21, no. 3, pp , Mar [32] P. Pant, R. Roy, and A. Chatterjee, BDual-threshold voltage assignment with transistor sizing for low power CMOS circuits,[ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 2, pp , Apr [33] S. Sirichotiyakul et al., BStand-by power minimization through simultaneous threshold voltage selection and circuit sizing,[ in Proc. IEEE/ACM Design Automation Conf., 1999, pp [34] M. Ketkar and S. S. Sapatnekar, BStandby power optimization via transistor sizing and dual threshold voltage assignment,[ in Proc. IEEE/ACM International Conf. Computer-Aided Design, 2002, pp [35] D. Nguyen et al., BMinimization of dynamic and static power through joint assignment of threshold voltages and sizing optimization,[ in Proc. Int. Symp. Low-Power Electronics Design, 2003, pp [36] S. Shah, A. Srivastava, V. Zolotov, D. Sharma, D. Sylvester, and D. Blaauw, BDiscrete Vt assignment and gate sizing using a self-snapping continuous formulation,[ in Proc. ACM/IEEE Int. Conf. Computer-Aided Design, Nov. 6 10, 2005, pp [37] P. Gupta, A. B. Kahng, P. Sharma, and D. Sylvester, BSelective gate-length biasing for cost-effective runtime leakage control,[ in IEEE/ACM Design Automation Conf., 2004, pp [38] A. Chandrakasan, W. J. Bowhill, and F. Fox, Design of High-Performance Microprocessor Circuits. Piscataway, NJ: IEEE Press, [39] J. P. Halter and F. N. Najm, BA gate-level leakage power reduction method for ultra-low-power CMOS circuits,[ in Proc. Custom Integrated Circuits Conf., Santa Clara, CA, May 5 8, 1997, pp [40] J. Kao et al., BMTCMOS hierarchical sizing based on mutual exclusive discharge patterns,[ in Proc. IEEE/ACM Design Automation Conf., 1998, pp [41] M. Anis, S. Areibi, and M. I. Elmasry, BDynamic and leakage power reduction in MTCMOS circuits using an automated efficient gate clustering technique,[ in Proc. Design Automation Conf., Jun [42] V. Khandelwal and A. Srivastava, BLeakage control through fine-grained placement and sizing of sleep transistors,[ in Proc. IEEE/ACM Int. Conf. Computer Aided Design, Nov [43] B. Calhoun, F. Honore, and A. Chandrakasan, BDesign methodology for fine-grained leakage control in MTCMOS,[ in Proc. Int. Symp. Low-Power Electronics Design, 2003, pp [44] A. Srivastava and D. Sylvester, BMinimizing total power by simultaneous Vdd/Vth assignment,[ IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 23, no. 5, pp , May [45] D. Chinnery and K. Keutzer, BLinear programming for sizing, Vth and Vdd assignment,[ in Proc. Int. Symp. Low-Power Electronics Design, 2005, pp [46] W. Hung, Y. Xie, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and Y. Tsai, BTotal power optimization through simultaneous multiple-vdd, multiple-vth assignment and device sizing with stack forcing,[ presented at the Int. Symp. Low-Power Electronics Design, Newport Beach, CA, Aug [47] A. Srivastava, M. R. Bai, D. Sylvester, and D. Blaauw, BModeling and analysis of leakage power considering within-die process variations,[ in Proc. Int. Symp. Low-Power Electronics Design, 2002, pp [48] R. Rao, A. Srivastava, D. Blaauw, and D. Sylvester, BStatistical estimation of leakage current considering inter- and intra-die process variation,[ in Proc. Int. Symp. Low-Power Electronics Design, 2003, pp [49] R. Rao, A. Devgan, D. Blaauw, and D. Sylvester, BParametric yield estimation considering leakage variability,[ in Proc. IEEE/ACM Design Automation Conf., 2004, pp [50] A. Srivastava, S. Shah, K. Agarwal, D. Sylvester, D. Blaauw, and S. Director, Vol. 95, No. 3, March 2007 Proceedings of the IEEE 527

22 BAccurate and efficient gate-level parametric yield estimation considering power/ performance correlation,[ in Proc. ACM/IEEE Design Automation Conf., 2005, pp [51] H. Chang and S. S. Sapatnekar, BFull-chip analysis of leakage power under process variations, including spatial correlations,[ in Proc. ACM/IEEE Design Automation Conf., 2005, pp [52] VV, BStatistical timing analysis considering spatial correlations using a single PERT-like traversal,[ in Proc. ACM/IEEE Int. Conf. Computer Aided-Design, 2003, pp [53] M. C. Campi and G. Calafiore, BDecision making in an uncertain environment: The scenario based optimization approach,[ in Multiple Participant Decision Making, J. Andrysek, M. Karny, and J. Kracik, Eds. Adelaide: Advanced Knowledge International, 2004, pp [54] A. Srivastava and D. Sylvester, BA general framework for probabilistic low-power design space exploration considering process variation,[ in Proc. ACM/IEEE Int. Conf. Computer-Aided Design, 2004, pp [55] A. Ben-Tal and A. Nemirovski, BRobust solutions of uncertain linear programs,[ Oper. Res. Lett., vol. 25, pp. 1 13, [56] M. Mani and M. Orshansky, BA new statistical algorithm for gate sizing,[ in Proc. ACM/IEEE Int. Conf. Computer Design, 2004, pp [57] S. Bhardwaj and S. Vrudhula, BLeakage minimization of nano-scale circuits in the presence of systematic and random variations,[ in Proc. IEEE/ACM Design Automation Conf., Jun [58] M. Mani, M. Orshansky, and A. Devgan, BAn efficient algorithm for statistical minimization of total power under timing yield constraints,[ in Proc. ACM/IEEE Design Automation Conf., 2005, pp [59] A. Srivastava, D. Sylvester, and D. Blaauw, BStatistical optimization of leakage power considering process variations using dual-vth and sizing,[ in Proc. ACM/IEEE Design Automation Conf., 2004, pp [60] K. Chopra, S. Shah, A. Srivastava, D. Sylvester, and D. Blaauw, BParametric yield maximization using gate sizing based on efficient statistical power and delay gradient computation,[ in Proc. ACM/IEEE Int. Conf. Computer-Aided Design, Nov. 6 10, 2005, pp [61] R. Kumar et al., BSingle-ISA heterogeneous multi-core architectures: The potential for processor power reduction,[ in Proc. ACM/IEEE Int. Symp. Microarchitecture (MICRO), 2003, pp [62] S. Mutoh et al., BA 1-V power supply high-speed digital circuit technology with multi-threshold voltage CMOS,[ IEEE J. Solid-State Circuits, vol. 30, no. 8, pp , Aug [63] J. Tschanz et al., BAdaptive body bias for reducing impacts of die-to-die and within-die parameter variations on microprocessor frequency and leakage,[ in Proc. ISSCC, 2002, pp [64] T. Chen et al., BComparison of adaptive body bias (ABB) and adaptive supply voltage (ASV) for improving delay and leakage under the presence of process variation,[ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 5, pp , Oct [65] M. Meijer, F. Pessolano, and J. Pineda de Gyvez, BTechnology extrapolation for adaptive power and frequency scaling in 90 nm CMOS,[ in Proc. ISLPED, 2004, pp [66] H. Deogun, R. M. Rao, D. Sylvester, and K. Nowka, BAdaptive MTCMOS for dynamic leakage and frequency control using variable footer strength,[ in Proc. IEEE System-on-Chip Conf., 2005, pp [67] S. Hanson et al., BUltra-low voltage, minimum energy CMOS,[ IBM J. Res. Develop., pp , Jul./Sep [68] J. Rattner, DesignCon 2006, keynote address. [69] L. W. Liebmann, BLayout impact of resolution enhancement techniques: Impediment or opportunity?[ in ACM/IEEE Int. Symp. Physical Design, 2003, pp [70] C. Visweswariah et al., BFirst-order incremental block-based statistical timing analysis,[ in Proc. ACM/IEEE Design Automation Conf., 2004, pp [71] A. Devgan et al., BBlock-based static timing analysis with uncertainty,[ in Proc. ACM/IEEE Int. Conf. Computer-Aided Design, 2003, pp [72] A. Agarwal, D. Blaauw, and V. Zolotov, BStatistical timing analysis for intra-die process variations with spatial correlations,[ in Proc. ACM/IEEE Int. Conf. Computer-Aided Design, 2003, pp [73] X. Bai, C. Visweswariah, and P. N. Strenski, BUncertainty-aware circuit optimization,[ in ACM/IEEE Design Automation Conf., 2002, pp [74] S. H. Gunther et al., BManaging the impact of increasing microprocessor power consumption,[ Intel Technol. J., 1st quarter, [75] H. Lee et al., BSub-5 nm all-around gate FinFET for ultimate scaling,[ in IEEE Symp. VLSI Technology, Dig. Tech. Papers, 2006, pp [76] U. Ghoshal and R. Schmidt, BRefrigeration technologies for sub-ambient temperature operation of computing systems,[ in ISSCC 2000, paper TD13.2 (slide supplement). [77] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, BAnalysis and mitigation of variability in subthreshold design,[ in IEEE/ACM Int. Symp. Low-Power Electronics Design, 2005, pp [78] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner, BThe limit of dynamic voltage scaling and extended DVS,[ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 11, pp , Nov [79] M. Mills, BThe Internet begins with coal,[ Forbes, May [80] K. Roth et al., BEnergy consumption by commercial office and telecommunications equipment,[ report by Arthur D. Little for the U.S. Department of Energy, Dec [81] P. Wang, G. Pei, and E. Kan, BPulsed wave interconnect,[ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 5, pp , May [82] M. Chen and Y. Cao, BAnalysis of pulse signaling for low-power on-chip global bus design,[ in Proc. IEEE Int. Symp. Quality Electronic Design, 2006, p. 6. [83] H. Kaul and D. Sylvester, BA novel buffer circuit for energy efficient signaling in dual-vdd systems,[ in Proc. ACM/IEEE Great Lakes Symp. VLSI, 2005, pp [84] R. Bashirullah et al., BA 16 Gb/s adaptive bandwidth on-chip bus based on hybrid current/voltage model signaling,[ in Proc. IEEE VLSI Symp. Circuits, 2004, p [85] D. Sylvester, H. Kaul, K. Agarwal, R. M. Rao, S. Nassif, and R. B. Brown, BPower-aware global signaling strategies,[ in Proc. IEEE Int. Symp. Circuits and Systems, 2005, pp [86] L. Zhang, Y. Hu, and C. C.-P. Chen, BWave-pipelined on-chip global interconnect,[ in ACM/IEEE Asia-South Pacific Design Automation Conf., Jan. 2005, pp [87] S. R. Nassif, BModeling and analysis of manufacturing variations,[ in Proc. IEEE Custom Integrated Circuits Conf., 2001, pp [88] H. Kaul and D. Sylvester, BLow-power global IC communication based on transition-aware global signaling,[ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 5, pp , May [89] M. Khellal et al., BStatic pulsed buses for on-chip interconnects,[ in Proc. Symp. VLSI Circuits, 2002, pp [90] R. M. Rao et al., BApproaches to runtime and standby mode leakage reduction on global buses,[ in Proc. ISLPED, 2004, pp [91] D. Sylvester and K. Keutzer, BGetting to the bottom of deep submicron,[ in Proc. ICCAD, 1998, pp [92] H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. Reading, MA: Addison- Wesley, [93] P. Saxena, N. Menezes, P. Cocchini, and D. A. Kirkpatrick, BThe scaling challenge: Can correct-by-construction design help?[ Proc. ISPD, 2003, pp [94] N. Magen, A. Kolodny, U. Weiser, and N. Shamir, BInterconnect-power dissipation in a microprocessor,[ in Proc. SLIP, 2004, pp [95] International Technology Roadmap for Semiconductors, [96] H. Kaul, D. Sylvester, M. Anders, and R. Krishnamurthy, BDesign and analysis of spatial encoding circuits for peak power reduction in on-chip buses,[ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 11, pp , Nov [97] S. Narendra, BLeakage issues in IC design: Trends, estimation and avoidance,[ in ICCAD, 2003, full-day tutorial. [98] P. Gupta, A. B. Kahng, Y. Kim, and D. Sylvester, BSelf-compensating design for focus variation,[ in Proc. ACM/IEEE Design Automation Conf., 2005, pp [99] D. Lee, D. Blaauw, and D. Sylvester, BStatic leakage reduction through simultaneous Vt/Tox and state assignment,[ IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 24, no. 7, pp , Jul [100] VV, BRuntime leakage minimization through probability-aware dual-vt or dual-tox assignment,[ in Proc. ACM/IEEE Asia-South Pacific Design Automation Conf., 2005, pp [101] V. Kheterpal et al., BDesign methodology for IC manufacturability based on regular logic-bricks,[ in Proc. ACM/IEEE Design Automation Conf., 2005, pp [102] B. Calhoun and A. Chandrakasan, BCharacterizing and modeling minimum energy operation for subthreshold circuits,[ in Proc. ACM/IEEE Int. Symp. Low Power Electronics and Design, 2004, pp [103] B. Zhai et al., BA 2.6 pj/inst subthreshold sensor processor for optimal energy efficiency,[ presented at the IEEE Int. Symp. VLSI Circuits, Honolulu, HI, [104] A. Wang and A. Chandrakasan, BA 180 mv FFT processor using subthreshold circuit 528 Proceedings of the IEEE Vol.95,No.3,March2007

techniques,[ in Proc. IEEE Int. Solid-State Circuits Conf., 2004, pp. 292 294. [105] R. H. Dennard et al., BDesign of ion-implanted MOSFETs with very small physical dimensions,[ IEEE J.

Computer-Aided Design, San Jose, CA, Nov. 19 23, 2003, embedded tutorial presentation. [107] G. E. Tellez, A. Farrahi, and M.

23 techniques,[ in Proc. IEEE Int. Solid-State Circuits Conf., 2004, pp [105] R. H. Dennard et al., BDesign of ion-implanted MOSFETs with very small physical dimensions,[ IEEE J. Solid-State Circuits, vol. 9, no. 5, pp , Oct [106] R. Puri, BDesign and CAD challenges in sub-90 nm CMOS technologies,[ presented at the ACM/IEEE Int. Conf. Computer-Aided Design, San Jose, CA, Nov , 2003, embedded tutorial presentation. [107] G. E. Tellez, A. Farrahi, and M. Sarrafzadeh, BActivity driven clock design for low power circuits,[ in Proc. ACM/IEEE Int. Conf. Computer-Aided Design, 1995, pp [108] A. Srivastava, D. Sylvester, and D. Blaauw, Statistical Analysis and Optimization for VLSI: Timing and Power. New York: Springer, [109] M. C. Johnson, D. Somasekhar, and K. Roy, BModels and algorithms for bounds on leakage in CMOS circuits,[ IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 18, no. 6, pp , Jun [110] A. Keshavarzi, S. Narendra, B. Bloechel, S. Borkar, and V. De, BForward body bias for microprocessors in 130 nm technology generation and beyond,[ in Proc. IEEE Int. Symp. VLSI Circuits, 2002, pp [111] V. Nookala, Y. Chen, D. Lilja, and S. Sapatnekar, BMicroarchitecture-aware floorplanning using a statistical design of experiments approach,[ in Proc. ACM/IEEE Design Automation Conf., 2005, pp [112] C. Long et al., BFloorplanning optimization with trajectory piecewise-linear model for pipelined interconnects,[ in Proc. ACM/IEEE Design Automation Conf., 2004, pp [113] S. H. Kulkarni, D. Sylvester, and D. Blaauw, BA statistical framework for post-silicon tuning through body bias clustering,[ in Proc. ACM/IEEE Int. Conf. Computer-Aided Design, 2006, pp [114] S. Inumiya et al., BA thermally-stable sub-0.9 nm EOT TaSix/HfSiON gate stack with high electron mobility suitable for gate-first fabrication of hp45 LOP devices,[ in IEEE Int. Electron Devices Meeting, 2005, pp [115] Y. T. Hou et al., BHigh performance tantalum carbide metal gate stacks for nmosfet application,[ in IEEE Int. Electron Devices Meeting, 2005, pp [116] M. A. Quevedo-Lopez et al., BHigh performance gate first HfSiON dielectric satisfying 45 nm node requirements,[ in IEEE Int. Electron Devices Meeting, 2005, pp ABOUT THE AUTHORS Dennis Sylvester (Senior Member, IEEE) received the B.S. degree in electrical engineering (summa cum laude) from the University of Michigan, Ann Arbor, in 1995, and the M.S. and Ph.D. degrees in electrical engineering from the University of California, Berkeley, in 1997 and 1999, respectively. His dissertation research was recognized with the 2000 David J. Sakrison Memorial Prize as the most outstanding research in the UC-Berkeley EECS department. He is now an Associate Professor of electrical engineering and computer science at the University of Michigan, Ann Arbor. He was also a Visiting Associate Professor of electrical and computer engineering at National University of Singapore during the academic year. He previously held research staff positions in the Advanced Technology Group of Synopsys, Mountain View, CA, and at Hewlett-Packard Laboratories, Palo Alto, CA. He has published numerous articles along with one book and several book chapters in his field of research, which includes low-power circuit design and design automation techniques, design-for-manufacturability, and on-chip interconnect modeling. He also serves as a consultant and technical advisory board member for several electronic design automation firms in these areas. Dr. Sylvester received an NSF CAREER award, the 2000 Beatrice Winner Award at ISSCC, an IBM Faculty Award, an SRC Inventor Recognition Award, and several best paper awards and nominations. He is the recipient of the ACM SIGDA Outstanding New Faculty Award, the 1938E Award for teaching and mentoring and Vulcans Education Excellence Award from the College of Engineering, and the University of Michigan Henry Russel Award. He has served on the technical program committee of numerous design automation and circuit design conferences and was General Chair of the 2003 ACM/IEEE System-Level Interconnect Prediction (SLIP) Workshop and 2005 ACM/IEEE Workshop on Timing Issues in the Synthesis and Specification of Digital Systems (TAU). He is currently an Associate Editor for IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS and IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. He helped define the circuit and physical design roadmap as a member of the International Technology Roadmap for Semiconductors (ITRS) U.S. Design Technology Working Group from 2001 to He is a member of ACM, American Society of Engineering Education, and Eta Kappa Nu. Ashish Srivastava (Member, IEEE) received the B.Tech. degree in electrical engineering from the Indian Institute of Technology, Kanpur, India, in 2001, and the M.S. and Ph.D. degrees in electrical engineering from the University of Michigan, Ann Arbor, in 2003 and 2005, respectively. Currently, he is with Magma Design Automation, Austin, TX, where he is a Senior Member of technical staff. He is the author of several papers and one book in his area of research interests, which include power and timing analysis and CAD techniques for circuit optimization. Vol. 95, No. 3, March 2007 Proceedings of the IEEE 529

THERE is a growing need for high-performance and. Static Leakage Reduction Through Simultaneous V t /T ox and State Assignment

THERE is a growing need for high-performance and. Static Leakage Reduction Through Simultaneous V t /T ox and State Assignment 1014 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 7, JULY 2005 Static Leakage Reduction Through Simultaneous V t /T ox and State Assignment Dongwoo Lee, Student