Subthreshold Logic Energy Minimization with Application-Driven Performance

Subthreshold Logic Energy Minimization with Application-Driven Performance William Biederman, Daniel Yeager EECS, UC Berkeley Abstract Ultra low energy/instruction digital circuits in aggressively scaled technologies suffer from high σ/µ delay variation due to RDF as well as high temperature/throughput sensitivity. A myriad of solutions have been proposed in the literature to improve efficiency and robustness, including: device upsizing, adaptive body and supply voltage, power gating, logic family, pipeline depth optimization, replay on error, and replica timing. These solutions require significant design margin which degrades energy efficiency and/or fail to meet timing constraints in practical applications. We propose use of timing error detection to achieve minimum energy operation despite uncorrelated (RDF) delay variation and replica-based body bias to combat correlated (temperature, mean-process) delay variation. This enables a design with guaranteed performance for practical, real-world applications. Fig. 1. Classic energy/operation theoretic bound vs. supply voltage without throughput constraint. V th.4v [3]. I. INTRODUCTION Emerging applications in shipment tracking (both active and passive RFID), implantable and wearable biomedical electronics, and environmental monitoring have created new commercial markets that drive technology growth in a different way. These applications require a fixed level of computation, where the key performance metric is energy/op minimization. This provides maximum battery life for active devices or maximum wirelessly powered range for passive devices. Often, these devices are built into infrastructure such as roads or buildings with the expectation of 10-year battery life / replacement. The relationship between lifespan, battery size, frequency of operation and E/op is shown by the following equation: E batt = E op ft life (1) This implies that to achieve a minimum life span T life required by a fixed performance application, maximum E op is constrained by battery size. Any energy minimization beyond this point translates into additional performance/sophistications (such as signal processing) which opens new application spaces. Aggressive energy minimization typically results in subthreshold supply voltage for khz to MHz clock frequency applications. The MOS exponential current dependence on V th in the subthreshold regime causes V th variation to severely impact delay and energy. Section III reviews prior work in reducing and mitigating variation. However, prior works fail to address throughput needs for practical applications. We propose use of timing error detection in a energy-minimization control loop to guarantee throughput (Section VII). Global body bias is performed open loop using an analog compensation circuit that senses device V th. Optimization of the pipeline logic depth and device sizing given an energy tracking loop have not been presented in the literature, so we propose a method of selecting design parameters based on the process node and desired operating frequency (Section VI). II. PROBLEM ANALYSIS A. Impact of V th on minimum energy operating point The total energy consumed by a digital circuit can be modeled as: E T OT AL = αcvdd 2 + E LEAKAGE (2) E LEAKAGE = I LEAK V DD dt (3) Where α is the activity factor. Since both the leakage and switching energy are dependent on V DD, voltage scaling is required to minimize energy consumption. The switching energy decreases quadratically with V DD, while the leakage energy decreases linearly with V DD and the leakage current will reduce due to the DIBL effect (V DS will decrease as V DD decreases). The integral yields a dependency on the operating frequency, which, for a non-throughput constrained application, will be set by the delay and the pipeline logic depth. This yields an increasing, exponential dependence on V DD below V T and leads us to the classic minimum energy graph shown in Fig. 1. Since the optimal V dd is set by V th, continual process scaling has proved useful thus far. Table 1 shows the trend of decreasing V th with decreasing process nodes, however, as

TABLE I SCALING TRENDS Node L eff T ox V dd,nom V t,sat I off,nom I on,nom [nm] [nm] [nm] [V ] [V ] [na/µm] [µa/µm] 250 120 4.0 2.5 0.63 0.002 820 180 70 2.3 1.8 0.49 0.11 840 130 49 1.6 1.3 0.36 4.5 890 90 35 1.4 1.2 0.32 19 1030 65 24.5 1.2 1.1 0.30 62 1150 45 17.5 1.1 1.0 0.27 200 1250 32 12.6 1.0 0.9 0.27 350 1290 Fig. 3. Sigma-to-mean of I on due to threshold voltage variation versus supply voltage [2]. Fig. 2. Measured microcontroller frequency vs. supply voltage. Delay increases exponentially below V th = 400mV. From [8]. the effective channel length becomes relatively large, DIBL plays an increasingly important effect on the leakage current. B. The minimum E/op sets the operating frequency For the traditional minimum energy/op analysis discussed, V dd is set to V th, which sets the characteristic gate delay. For a traditional design approach, the logic depth is fixed, yielding a set operating frequency, which is often non-ideal and slower than desired. This fact was illustrated in [8] and the frequency vs V dd graph is shown in Fig. 2. C. The impact of PVT variation on V th Subthreshold current is exponential with threshold voltage as can be seen from the following equation [3]: I sub vth exp( V gs V th ), m S S (4) ( ) 2 σvth 3σ ION /µ ION = 3 exp 1 (5) Consequently, any process-induced variations will cause an exponential change in threshold current. The primary V th variations are due to random dopant fluctuations (RDF) and transistor length variations. However, RDF dominates geometric variations since the variation dependency of V th due to DIBL reduces with scaling V dd. This is because Ion voltages become more sensitive to V th fluctuations, with the net result that Ion variation due to DIBL remains roughly constant [2]. Fig. 3 shows the effects of RDF and Leff induced V th variations have on the 3σ/µ of I on. This figure also highlights the importance of minimizing noise and variations in V dd. As can be seen in the subthreshold current equation, the current is exponentially dependent on temperature variation (V th = kt/q). This is problem is highlighted by the measured MCU frequency vs temperature from [8] in Fig. 2. Fig. 4. Measured V th distribution from a single die [8]. D. V th variation is not spatially correlated The cause of threshold variation for subthreshold logic was investigated in [9] without conclusion of any spatial correlation and validating that RDF is the main cause of variation. The V th variations measured off a test die is shown in Fig. 4, and reveals an almost perfect Gaussian distribution. Random, uncorrelated variations in V th mean that we can only use standard statistical modeling to predict performance of subthreshold logic. III. PRIOR WORK Both correlated and uncorrelated variations lead to several challenges under aggressive energy minimization design. The first is random variation from lithographic error and RDF, which leads to stochastic path delay. The second is a correlated path delay variation due to temperature and process variation. Prior works addressing these issues are discussed in the following subsections. A. Random Path Delay Variation Fig. 5 shows the increase of V th variation with process node due to random dopant fluctuations (RDF) and line edge roughness (LER). Table I shows that V th is remaining relatively constant, and consequently the relative variation (σ/µ) is rapidly increasing. High relative variation requires substantial timing margin, and this margin degrades our energy efficiency. As discussed in [2], threshold variation due to RDF and LER lacks spatial correlation and thus cannot be compensated though replica schemes such as [3]. Two techniques for

Fig. 5. V T variation for predictive technology nodes. From [1]. (a) Comparison of process node. (b) Comparison of V th at 45nm node. Fig. 6. Sigma-to-mean variation vs. transistor width and path length. Note that the highest variability data points also correspond to the most efficient designs (highest activity factor through low pipeline depth) and smallest device width (lowest active power). From [2]. reducing this variation are increase of the device active area (W*L) and averaging of delay variation through a long data path [2]. Fig. 6 illustrates the effect of device upsizing and data path length increase on relative delay variation. Device upsizing increases both active and leakage power. Increasing the logic depth reduces the activity factor, which reduces energy efficiency due to a higher leakage-to-active power ratio [2]. B. Correlated Path Delay Variation Excess path delay caused by temperature and/or process (mean doping level, oxide thickness, proximity correction effects, etc.) variation tends to be highly correlated across die. This correlated variation can cause the critical path to exceed the clock period. Replica (or canary ) circuits can be used to detect and compensate for correlated variation by adjusting the supply (AVS) or body bias (ABB) voltages [3]. IV. CHOICE OF TECHNOLOGY NODE Some authors have argued that leakage energy at highly scaled process nodes will make subthreshold operation less efficient and thus less attractive [4]. However, it is important that we continue to study energy minimization in aggressivelyscaled technology nodes for several reasons. First, use of general purpose (low V th ) vs. low power (high V th ) devices can significantly mitigate leakage at low operating frequencies Fig. 7. Energy per operation vs. frequency. The top figure shows that for a fixed V th, the choice of process node is limited by leakage. The bottom figure shows that leakage can be controlled by the choice of V th. From [5]. (Figs. 7(b) and 7(a)) [5]. Second, the economic side of Moore s Law will continue to motivate manufacturers to use smaller process nodes. This cost reduction is a key driver of new applications spaces that were previously not economically viable. Third, use of timing error detection eliminates the overdesign margin that penalizes conventional designs in highly scaled process nodes. We discuss this in detail throughout the remainder of this paper. V. PROPOSED SOLUTION To summarize to this point, we want to maximize our activity factor and minimize our device size to achieve high energy efficiency. We assume that device V th can be adjusted to provide optimal energy efficiency at our design frequency as in Fig. 7(b). Referring back to Fig. 6, this design point corresponds to the highest delay variation. This in turn requires the most timing margin, which requires us to increase our supply voltage or decrease V th to accommodate the worst-case variation. To break this trade-off, we propose dynamically biasing our circuit to accommodate the measured critical path delay. Timing measurements are made at the latch with a transition detector [6] [7]. If a signal arrives during the latch s transparency window (T D Q ), a timing error is generated. If a signal arrives before the latch s transparency window (T CLK Q ), no error is generated. This enables real-time control of the supply and body bias for minimum energy while exactly satisfying timing constraints (no design margin required). Design optimization (sizing, pipeline depth, V th )

TABLE II PERFORMANCE COMPARISON OF σ VT COMPENSATION TECHNIQUES AT 1MHZ FOR 90NM Design Strategy V min N E/op- no σ Vth E/op- w/ σ Vth Nominal [2] 130mV 10 15 fj 31.8 fj Variability-aware [2] 210mV 15-24.2 fj Our Design@1MHz 180mV 16-13fJ in this regime has not been presented in the literature. In Section VI we present a design procedure for subthreshold circuits operating in this regime and Section VII describes the enabling circuitry in more detail. VI. ENERGY OPTIMIZATION IN THE PRESENCE OF RDF A. Fixed Throughput Design Analysis For a fixed throughput design, the leakage energy can no longer be modeled by Eq 2. Instead, leakage energy/op is now proportional to the operating frequency as long as equation 6 is valid. 1 f op (6) t delay N In Eq 2, the integral of E LEAK resulted in the exponential dependence on V T from t delay (Eq. 7) to cancel with that from I LEAK (Eq. 8). Now, however, the resulting energy equation for a fixed throughput design is shown in Eq. 9. E σvt t delay = KC gv DD Vgs VT (7) ηv I s e th Vgs VT ηv I sub = I s e th and I LEAK = I sub Vgs=0 (8) = αcvdd 2 + V DD I LEAK (V T )ξ(v T, σ VT )dv T (9) f op Here, K is a technological constant and ξ(v T, σ VT ) is the gaussian probability distribution function. The standard deviation of V T vs process node is modeled by [1] using Eq. 10. ( ) AVth σ Vth = (10) W L We can now apply this new energy model to a MATLAB simulation in order to find the optimal sizing and pipeline depth for a fixed throughput and process node. B. Design Optimization As proven in [10] the optimal V DDopt and V Topt for the minimum energy operation in subthreshold can be set as a function of frequency by: V Topt V DDopt = ηv th (2 lambertw (β)) (11) ( ) fop KC g L DP V DD = V DDopt ηv th ln (12) where β = 2Cα W NKC e2 (13) The remaining design parameters are the logic depth and upsizing. The minimum values for these devices can be determined through the use of a yield parameter. For our I s Fig. 8. The minimum energy per instruction vs pipeline depth for different yield constraints. Fig. 9. Plot of the minimum energy per instruction across frequency vs process nodes. application space, the two critical constraints are throughput and energy. Since RDF variations are uncorrelated the percent of devices that meet these contraints can be equated to: %Y total = Y fop Y Eop (14) The worst case energy under variation from Eq 9 is determined by the worst case leakage current. Therefore, our energy yield is determined by [2]: I leakmax = I S e µ Vth + 1 2 ( σ V th ) 2 (P + conf P (e ( σ V th mv ) 2 T 1)) (15) Where µ Vth is set by the tracking loop and P is the total number of gates in the circuit, can be considered an applicationspecific design constant. Consequently, the only method available for altering I leakmax to meet a yield specification is by reducing σ Vth via upsizing. Now only the logic depth is left to be specified via the yield on the timing constraint. The worst case delay is specified as [2]: t delaymax = µ(t delay ) + confσ(t delay ) (16) σ(lnt delay ) = ln(1 + 1 N (et2 1)) (17)

Fig. 10. Circuit for open-loop body bias adjustment across temperature and process corner. µ(lnt delay ) = ln( 1 2 ηc LV DD 1 I S e VDD ) (18) +µ VT + 1 2 ln( N 3 N 1 + e t2 ) (19) Where t = σ Vth /( ). Now, MATLAB can be used to solve for the minimum energy design (sizing and logic depths) depending on yield for specific process and frequency. The following iterative algorithm can be used to compute the optimal design: ALGORITHM - detailed steps of iterative approach for (some frequency), for (some L), for (some yield) sweep N to meet f opyield ( sweep W to meet E yield ( E σvt (V DD, V T ) ) ) take min of E(W sweep, N sweep ) that meet overall yield end; end; end; Figure 8 shows the N iteration solution from MATLAB in 90nm, at 1MHz, with varying yield constraints. Decreasing the yield constraint result in more than 25% energy savings from 95% yield to 40%. At smaller process nodes, the fabrication savings could compensate for the cost of unusable chips. Figure 9 was generated in MATLAB and shows a plot of the minimum energy operating point over frequency for any given process node when you cannot change your V th from the process norm by using different process flavors and body biasing. This trend is argued by [4] and exemplifies the importance of choosing device flavors and implementing error detection circuitry and accurate body biasing. Table II shows a comparison of our design strategy versus others in the literature. These results are from MATLAB simulations with fitting parameters extracted from a predictive technology model in 90nm with 80% yield. Although, the E/op is lower, we are using a different technology node, with fitting parameters and leakage values from a predictive model, and under a throughput constraint so it is hard to make a direct comparison. However, the general trend of V m in, and N remains the same, indicating that the analysis technique is reasonable. Fig. 11. corner.,-.$ &(/%$ '(/%$!"%$!"#$ Fig. 12. Automatic open-loop body bias voltage vs. temperature and process 012+34'$5)4/,$ 53+,6$7$ 53+,6$8$ ;)&<$;/3=$>'%$!&&$ '::(:=+$ 9)%+:)--':$ Minimum energy tracking loop with RDF robustness. VII. MINIMUM ENERGY TRACKING LOOP,-.$ &()*+$ '()*+$ A pipelined logic design with latches that detect time borrowing (such as Razor-II [6] or TBTD [7]) allows a circuit automatically to bias itself for peak energy efficiency (Fig. 12). This in turn enables design at the minimum energy point without leaving timing margin. The operation is straightforward: if any latch detects input data transitions after the clock transition, an error signal is generated. The error signals from each pipeline stage pipeline stage are subsequently combined with a logic OR. A controller detects errors and increases the supply voltage to increase the timing slack. Conversely, the supply voltage is decreased after N ef error-free clock cycles. Note that the final pipeline stage should leave sufficient margin so that time borrowing by the previous stage does not cause a timing error at a hard clock edge. Also notice that the controller ensures that the amount of borrowed time cannot accumulate; errors force the timing margin to increase via an increase in supply voltage. Body bias can be employed to increase the energy efficiency of the design. This is not required for functionality of the supply voltage controller, and the controller automatically adapts to the body bias because the controller is measuring actual circuit timing. A simple, robust method for adjusting the body bias is shown in Fig. 10. Generated bias voltages are shown in Fig. 11 and correspond well with simulations of the optimal body bias for an inverter chain. The reason the circuit works is that it forces a constant V th ; for a given frequency and activity factor, the optimal V th should stay constant.

(a) SS Corner possible to design low energy/op circuits in deep sub-micron technologies while still achieving minimum performance. We proposed an application-driven design approach to specify design choices for a desired yield. This analysis was applied to an iterative MATLAB solver to determine the ideal width and logic depth given our specified design constraints. We employed this design in a Spectre simulation using dynamic biasing of our circuit to accommodate the measured critical path delay. We compared our performance to other design techniques found in the literature and found that our method is comparable and possibly superior with more accurate model parameters and under certain throughput, yield and process node scenarios. We also conclude that when designing for minimum energy operation, it is desirable to continue using scaled process nodes, despite the fact that σ Vth increases. The increase in V th variation can be at least partially compensated for in design techniques, and from the decreased fabrication costs. (b) FF Corner (c) FF Corner with body bias Fig. 13. Energy tracking loop settling transient. The upper plots show the supply voltage and detected errors. The lower plots show the pipeline input and output; the timing converges to 1/2 clock cycle. VIII. ENERGY TRACKING LOOP RESULTS The energy tracking loop of Fig 12 was simulated across process corners to demonstrate the ability of the loop to compensate for these variations. Note that the variations are applied locally to the inverters in the pipeline stage; a replica circuit could not detect these RDF-induced timing variations. Figs. 13(a) and 13(b) demonstrate the operation of the loop. Initially the supply voltage is set too high; the controller decreases the supply voltage until timing errors are detected. Fig. 13(c) shows the affect of the body bias circuitry sensing and correcting global V th variation due to temperature or correlated manufacturing variation. In the FF corner, the body bias generator creates a negative body bias in order to compensate for the low V th. This results in a higher (more optimal) supply voltage, as shown in Fig. 13(c). REFERENCES [1] M. Abu-Rahma and M. Anis, Variability in vlsi circuits: Sources and design considerations, in Circuits and Systems, 2007. ISCAS 2007. IEEE International Symposium on, 27-30 2007, pp. 3215 3218. [2] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, Analysis and mitigation of variability in subthreshold design, in Proceedings of the 2005 international symposium on Low power electronics and design. ACM, 2005, p. 25. [3] Y. Ramadass and A. Chandrakasan, Minimum energy tracking loop with embedded DC-DC converter enabling ultra-low-voltage operation down to 250 mv in 65 nm CMOS, IEEE Journal of Solid State Circuits, vol. 43, no. 1, p. 256, 2008. [4] M. Seok, D. Sylvester, and D. Blaauw, Optimal technology selection for minimizing energy and variability in low voltage applications, in Proceeding of the 13th international symposium on Low power electronics and design. ACM, 2008, pp. 9 14. [5] D. Bol, D. Flandre, and J. Legat, Technology flavor selection and adaptive techniques for timing-constrained 45nm subthreshold circuits, in Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design. ACM, 2009, pp. 21 26. [6] S. Das, C. Tokunaga, S. Pant, W. Ma, S. Kalaiselvan, K. Lai, D. Bull, and D. Blaauw, RazorII: In situ error detection and correction for PVT and SER tolerance, IEEE Journal of Solid-State Circuits, vol. 44, no. 1, pp. 32 48, 2009. [7] K. Bowman, J. Tschanz, N. Kim, J. Lee, C. Wilkerson, S. Lu, T. Karnik, and V. De, Energy-efficient and metastability-immune timing-error detection and instruction-replay-based recovery circuits for dynamicvariation tolerance, ISSCC 2008, 2008. [8] B. Zhai, S. Pant, L. Nazhandali, J. Olson, A. Reeves, M. Minuth, R. Helfand, D. Blaauw, and T. Austin, Energy Efficient Subthreshold Processor Design, IEEE Journal of Solid-State Circuits (JSSC), 2007. [9] N. Drego, A. Chandrakasan, and D. Boning, Lack of Spatial Correlation in MOSFET Threshold Voltage Variation and Implications for Voltage Scaling, IEEE transactions on semiconductor manufacturing, vol. 22, no. 2, p. 245, 2009. [10] B. Calhoun, A. Wang, and A. Chandrakasan, Modeling and sizing for minimum energy operation in subthreshold circuits, IEEE Journal of Solid-State Circuits, vol. 40, no. 9, pp. 1778 1786, 2005. IX. CONCLUSION A variety of techniques exist in the literature to improve efficiency and robustness for variation in subthreshold logic with little concern for throughput constraint. In order to meet the economical demands imposed by More s Law, it must be