Combating NBTI-induced Aging in Data Caches

Similar documents
Aging-Aware Instruction Cache Design by Duty Cycle Balancing

Low Power Aging-Aware On-Chip Memory Structure Design by Duty Cycle Balancing

Recovery Boosting: A Technique to Enhance NBTI Recovery in SRAM Arrays

Analyzing Combined Impacts of Parameter Variations and BTI in Nano-scale Logical Gates

A Novel Multiplier Design using Adaptive Hold Logic to Mitigate BTI Effect

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Design A Redundant Binary Multiplier Using Dual Logic Level Technique

An Overview of Static Power Dissipation

A Low Complexity and Highly Robust Multiplier Design using Adaptive Hold Logic Vaishak Narayanan 1 Mr.G.RajeshBabu 2

Bus-Switch Encoding for Power Optimization of Address Bus

Low Power Design of Schmitt Trigger Based SRAM Cell Using NBTI Technique

Performance Evaluation of Recently Proposed Cache Replacement Policies

Design and Analysis of Sram Cell for Reducing Leakage in Submicron Technologies Using Cadence Tool

A Novel Low-Power Scan Design Technique Using Supply Gating

NBTI and Process Variation Circuit Design Using Adaptive Body Biasing

Impact of Interconnect Length on BTI and HCI Induced Frequency Degradation

Low Power High Performance 10T Full Adder for Low Voltage CMOS Technology Using Dual Threshold Voltage

ZIGZAG KEEPER: A NEW APPROACH FOR LOW POWER CMOS CIRCUIT

Low Power Design of Successive Approximation Registers

Transistor Network Restructuring Against NBTI Degradation. P. F. Butzen a, V. Dal Bem a, A. I. Reis b, R. P. Ribas b.

This work is supported in part by grants from GSRC and NSF (Career No )

Design of Signed Multiplier Using T-Flip Flop

Static Energy Reduction Techniques in Microprocessor Caches

Design of Negative Bias Temperature Instability (NBTI) Tolerant Register File

II. Previous Work. III. New 8T Adder Design

Design of a Tri-modal Multi-Threshold CMOS Switch with Application to Data Retentive Power Gating

Energy Efficiency of Power-Gating in Low-Power Clocked Storage Elements

Sleepy Keeper Approach for Power Performance Tuning in VLSI Design

Design Of Arthematic Logic Unit using GDI adder and multiplexer 1

PROCESS and environment parameter variations in scaled

Totally Self-Checking Carry-Select Adder Design Based on Two-Rail Code

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

History & Variation Trained Cache (HVT-Cache): A Process Variation Aware and Fine Grain Voltage Scalable Cache with Active Access History Monitoring

Revisiting Dynamic Thermal Management Exploiting Inverse Thermal Dependence

Energy Efficient Memory Design using Low Voltage Complementary Metal Oxide Semiconductor on 28nm FPGA

A Employing Circadian Rhythms to Enhance Power and Reliability

Duty-Cycle Shift under Asymmetric BTI Aging: A Simple Characterization Method and its Application to SRAM Timing 1 Xiaofei Wang

EECS150 - Digital Design Lecture 28 Course Wrap Up. Recap 1

Enhancing Power, Performance, and Energy Efficiency in Chip Multiprocessors Exploiting Inverse Thermal Dependence

Reducing the Sub-threshold and Gate-tunneling Leakage of SRAM Cells using Dual-V t and Dual-T ox Assignment

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

FV-MSB: A Scheme for Reducing Transition Activity on Data Buses

Design of Ultra-Low Power PMOS and NMOS for Nano Scale VLSI Circuits

CHAPTER 1 INTRODUCTION

A Novel Continuous-Time Common-Mode Feedback for Low-Voltage Switched-OPAMP

A High-Speed Variation-Tolerant Interconnect Technique for Sub-Threshold Circuits Using Capacitive Boosting

Leakage Power Reduction for Logic Circuits Using Variable Body Biasing Technique

Variable-Segment & Variable-Driver Parallel Regeneration Techniques for RLC VLSI Interconnects

DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) FOR MICROPROCESSORS POWER AND ENERGY REDUCTION

Power consumption is now the major technical

Ultra Low Power VLSI Design: A Review

Leakage Power Reduction Through Hybrid Multi-Threshold CMOS Stack Technique In Power Gating Switch

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System

Total reduction of leakage power through combined effect of Sleep stack and variable body biasing technique

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

A Low-Power SRAM Design Using Quiet-Bitline Architecture

Low Power Register Design with Integration Clock Gating and Power Gating

2 Assoc Prof, Dept of ECE, George Institute of Engineering & Technology, Markapur, AP, India,

CSAM: A Clock Skew-aware Aging Mitigation Technique

Leakage Power Reduction by Using Sleep Methods

DESIGN & IMPLEMENTATION OF SELF TIME DUMMY REPLICA TECHNIQUE IN 128X128 LOW VOLTAGE SRAM

Novel SRAM Bias Control Circuits for a Low Power L1 Data Cache

An Array-Based Circuit for Characterizing Latent Plasma-Induced Damage

Bus Serialization for Reducing Power Consumption

Low Power Realization of Subthreshold Digital Logic Circuits using Body Bias Technique

A NEW APPROACH FOR DELAY AND LEAKAGE POWER REDUCTION IN CMOS VLSI CIRCUITS

32-Bit CMOS Comparator Using a Zero Detector

Extending Modular Redundancy to NTV: Costs and Limits of Resiliency at Reduced Supply Voltage

Enhancement of Design Quality for an 8-bit ALU

Design of High Performance Arithmetic and Logic Circuits in DSM Technology

UNEXPECTED through-silicon-via (TSV) defects may occur

Study and Analysis of CMOS Carry Look Ahead Adder with Leakage Power Reduction Approaches

Introducing Pulsing into Reliability Tests for Advanced CMOS Technologies

Design of Parallel Prefix Tree Based High Speed Scalable CMOS Comparator for converters

Credit-Based Dynamic Reliability Management Using Online Wearout Detection

MANY integrated circuit applications require a unique

A Low-Power 12 Transistor Full Adder Design using 3 Transistor XOR Gates

An Optimized Design System for Flip-Flop Grouping Using Low Power Clock Gating

DESIGN OF EXTENDED 4-BIT FULL ADDER CIRCUIT USING HYBRID-CMOS LOGIC

All Digital on Chip Process Sensor Using Ratioed Inverter Based Ring Oscillator

A Design Comparison of Low Power 50 nm Technology Based Inverter with Sleep Transistor and MTCMOS Scheme

Dynamic MIPS Rate Stabilization in Out-of-Order Processors

COMPARISON AMONG DIFFERENT CMOS INVERTER WITH STACK KEEPER APPROACH IN VLSI DESIGN

RECENT technology trends have lead to an increase in

A Review of Clock Gating Techniques in Low Power Applications

Innovations In Techniques And Design Strategies For Leakage And Overall Power Reduction In Cmos Vlsi Circuits: A Review

The challenges of low power design Karen Yorav

Variation-Aware Design for Nanometer Generation LSI

4 principal of JNTU college of Eng., JNTUH, Kukatpally, Hyderabad, A.P, INDIA

DATE 2016 Early Reliability Modeling for Aging and Variability in Silicon System (ERMAVSS Workshop)

Design of low power SRAM Cell with combined effect of sleep stack and variable body bias technique

Penelope 1 : The NBTI-Aware Processor

Instruction Scheduling for Low Power Dissipation in High Performance Microprocessors

A Transistor-Level Stochastic Approach for Evaluating the Reliability of Digital Nanometric CMOS Circuits

Design of Delay-Power Efficient Carry Select Adder using 3-T XOR Gate

Design and Optimization of Half Subtractor Circuits for Low-Voltage Low-Power Applications

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

A New Configurable Full Adder For Low Power Applications

Transcription:

Combating NBTI-induced Aging in Data Caches Shuai Wang, Guangshan Duan, Chuanlei Zheng, and Tao Jin State Key Laboratory of Novel Software Technology Department of Computer Science and Technology Nanjing University swang@nju.edu.cn, {guangshan_duan, zhengchl, taojin}@smail.nju.edu.cn ABSTRACT The negative bias temperature instability (NBTI) in CMOS devices is one of most prominent sources of aging mechanisms, which can induce severe threats to the reliability of modern processors at deep submicron semiconductor technologies. Due to the unbalanced duty cycle ratio of the SRAM cells, the data cache suffers a heavy NBTI stress and this will further exacerbate the aging effect in the data cache. In this paper, an aging-aware design is proposed to combat the NBTI-induced aging in the data cache. First, the detailed lifetime behaviors of the cachelines in the data cache are studied. Then, different schemes are proposed to mitigate the negative aging effects by balancing the duty cycle ratio of the SRAM cells in the cachelines according to their different lifetime phases. By applying our proposed idle-time-based cacheline invalidation, early write-back, and bit-flipping schemes, the duty cycle ratio of the data cache can be well balanced. By adopting the drowsy scheme for invalidated cachelines, our design can also reduce the power consumption significantly, which will further optimize the thermal behavior and aging effect of data caches. Categories and Subject Descriptors B.8.1 [Hardware]: Performance and Reliability Reliability, Testing, and Fault-Tolerance General Terms Reliability Keywords Data caches; negative bias temperature instability; low power; duty cycle balancing 1. INTRODUCTION In deep submicron semiconductor technologies, the aging effect in CMOS devices has become one of major challenges in new microprocessor designs [4]. Recent research Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GLSVLSI 13, May 2 3, 2013, Paris, France. Copyright 2013 ACM 978-1-4503-1902-7/13/05...$15.00. has shown that the lifetime reliability of CMOS devices can be degraded by aging mechanisms such as bias temperature instability (BTI), hot-carrier injection, and gate-oxide wearout [23, 17, 15]. The negative bias temperature instability (NBTI) has proven to be one of the most critical failure mechanisms affecting future technologies. The NBTI affects the pmos device when negative voltage is applied at the gate (logic 0 ) and the NBTI-induced aging is proportional to the stress time and the switching activity of the device. If the SRAM cells hold the same value for a long time, which means a highly biased duty cycle ratio, it will cause some devices under heavy NBTI stress and exacerbate the aging effect. Since data caches in modern processors are implemented with SRAM cells and hold the data for program execution, protecting the data cache against NBTI-induced aging is very important for reliable processor design. Due to the uneven use of the cachelines and the presence of the narrow-width values [9], the data cache suffers a highly biased duty cycle ratio thus a heavy NBTI stress. However, the NBTI-induced degradation of device reliability cannot be mitigated simply by adopting some traditional techniques, such as guardbanding, which may incur significant reduction in circuit speed. Therefore, the focus of this work is the microarchitectural solution to balance the duty cycle ratio and mitigate the aging stress. In this work, an aging-aware data cache (AADC) design is proposed to combat the lifetime degradation in the performance and reliability of the SRAM cells in the data cache by duty cycle balancing. We first conduct the detailed analysis on the lifetime behaviors of the data cache and divide the cachelines in the data caches into three groups, clean cachelines, dirty cachelines, and invalid cachelines. For the clean cachelines in the data cache, we propose to do the idletime-based cacheline invalidation first and then bit-flip these invalidated cachelines periodically. For the dirty cachelines, we propose to do the idle-time-based early write-back first, and then do the invalidation and apply the bit-flipping. For the invalid cachelines, we can just do the bit-flipping periodically. By carefully choosing the idle-time and bit-flipping time intervals, the average duty cycle ratio of the data cache can be well balanced with negligible performance and energy overheads, and thus the NBTI degradation of the data cache will be significantly mitigated. Further, to reduce the power consumption of the data cache, we adopt the drowsy scheme for the invalidated cachelines in our design, and the low-energy states of the SRAM cells will further alleviate the aging effect in the data cache. 215

The rest of the paper is organized as follows. In the next section, related work on aging-aware/nbti-aware designs is discussed. In Section 3, we provide detailed designs of our aging-aware data cache. The experimental results and discussion are presented in Section 4. Section 5 draws the conclusion. 2. RELATED WORK The NBTI-induced instability in the SRAM cells was studied and a data flipping scheme was proposed in [14]. In their scheme, the data in inverted mode need to be flipped back during the read operation. Therefore, extra XOR gates were added to do the inverting, which will increase the cycle time and the power consumption of the SRAM cells. In our AADC design, no bit-inverting is needed during the data cache access, thus it has no impact on the performance and power efficiency of the data cache. In [1], a design of NBTI-aware processor named Penelope was proposed and evaluated. Penelope protects the memory structures, such as registers and cache blocks, in the processor by utilizing the idle time of these resources. Therefore, it has the limitation in balancing the duty cycle ratios of the heavily in-use memory structures like the data cachelines. However, our AADC provides the solutions for both idle and in-use cachelines. In [21], duty cycle balancing designs were proposed for register files by exploiting the narrow-width data in the processor. Shin et al. proposed a redundancy scheme to improve the lifetime of the SRAM caches against the NBTIinduced wearout [19]. Duty cycle balancing scheme proposed in [12] targeted at the aging-effect reduction in instruction caches by exploring the lifetime behaviors, while our work is a microarchitecture solution to data caches. Their schemes cannot be directly applied to data caches, since the lifetime behaviors are much more complicated in data cache compared to those in instruction caches. Moreover, we also consider the low power designs and their effects in aging reduction. In [11], a holistic approach named Colt was proposed to balance duty cycle ratios of devices in modern processors by applying the complement mode to data path, control path, and storage hierarchy. Although Colt does not need to do the bit inverting when the data are fetched, extra XOR gates are still required to do the bit-flipping. In our AADC, the bit-flipping is done by writing all zeros or ones to these cachelines, thus no extra XOR gates will be involved. 3. AGING-AWARE DATA CACHE (AADC) 3.1 Motivation Previous work has studied that caches are often occupied by more 0 than 1 [14, 1]. Our experimental results also demonstrate that the duty cycle ratio of data caches is not balanced to the best case (i.e., 50%) in our simulated microprocessor. Therefore, the pmos devices in data caches will be affected by the negative bias at most of the time and suffer a high NBTI-stress. The traditional guardbanding technique requires a large guardband in SRAM V DDMIN, which is too expensive and may limit some low power designs, such as supply voltage scaling. Customized NBTI-resilient SRAM cells were proposed in [2, 20]. Recent work [16, 7, 8] also exploited low-energy states of the SRAM cells for mitigating the aging effect. However, all the previous schemes target at alleviating the NBTI-induced aging effects in general SRAM Cache Miss Invalid Figure 1: cache. Access Access Access Live Dead Replace The lifetime of a data item in the data cell structures. There is no scheme targeting at aging-aware design specifically for the L1 data cache by utilizing the access pattern and lifetime behavior of the data cache. Therefore, we propose a microarchitecture solution to balance the duty cycle ratios and combat the NBTI-induced aging of the SRAM cells in the data cache. 3.2 Lifetime Behavior of the Data Cache In [3, 24, 22], the lifetime behaviors of L1 caches have been broadly studied, especially for improving their reliability against soft errors. Due to the variety of access patterns in the L1 data cache, such as read, write, replace, and writeback, the lifetime model of the L1 data cache in their studies is quite complicated, which makes the data cache difficult to be analyzed and optimized. Therefore, we simplify the lifetime model of cachelines in the data cache and divide their lifetime into the following three phases, Live, Dead, and Invalid, similar to the analysis in [13]. Live: lifetime phase between first access and last access of a data item, Dead: lifetime phase between the last access and the replacement of a data item, Invalid: lifetime phase when the data item is in the invalid state. Figure 1 shows the correlation among three lifetime phases for typical data cache activities, and the access (A) can be a cache read (R) or a cache write (W). Notice that the data item in the data cache can be a cacheline, a word, a byte, or a single bit. Although [22] claims that a byte-level analysis is accurate for the lifetime characterization for the data cache, we choose the a cacheline-level model in our AADC study for two reasons: a) the cost of controlling a byte-level bit-flipping is too high, so we choose bit-flipping for each cacheline, and b) the target of our work is to mitigate the NBTI-induced aging in the data cache, a coarse-grained lifetime models like cacheline-based model should be enough for this study. 3.3 NBTI-Aware Designs for Different Lifetime Phases Based on the lifetime categorization of the data cache, we adopt different strategies to different lifetime phases in order to mitigate the NBTI stress of the SRAM cells with minimum performance and energy overheads. For the cachelines in the invalid states, we propose to simply bit-flip these cachelines periodically. Since the data in these invalid cachelines will not be needed for program execution, they do not need to be flipped back, even if they are in the complement mode when the cachelines will become valid after a cacheline replacement from the L2 cache. Our bit-flipping is just writing all zeros or ones to these cachelines, thus we do not need 216

extra XOR gates to do the flipping, which makes our design more power and area efficient compared to the previous schemes in [14, 11]. For the cachelines in the valid states, we cannot simply apply the similar bit-flipping because the data in these cachelines may be needed during future cache accesses. For the cachelines in the Live phase, if we do our flipping scheme (writing zeros or ones) to the cachelines, the original data will be lost. If we use the inverting scheme proposed in [14, 11], extra XOR gates for inverting are required. For the clean cachelines in the Dead phase, they are actually not needed in the future. Because the data in the clean cacheline is not updated by the CPU, the data will be just discarded at the replacement. This provides possibilities to do our flipping scheme for these cachelines in the Dead phase. However, the problem is that we cannot determine which read operation to the clean cacheline is the last read during the program execution. Therefore we cannot know when to start applying our bit flipping scheme. In [22], a clean cacheline invalidation (CCI) scheme was proposed to reduce the vulnerability factor of the clean cachelines in the data cache by invalidating the cachelines after being idle for some predefined intervals. The CCI scheme is based on the observation that most read-read (RR) instances have small intervals (less than 1K cycles) and these RR instances with small intervals only contribute a small percent of the overall RR time. Different from their scheme, we adopt the cacheline invalidation (CI) scheme in order to do our bit flipping and balance the duty cycle ratios of the cachelines. For the clean cacheline in the data cache, if it remains idle for certain predefined interval, we propose to invalidate it and then do the our bit flipping schemes similar to these invalid cachelines. By applying our cacheline invalidation and flipping (CIF) scheme, most of Dead phase in the clean cacheline will be converted into the invalid phase, so the NBTI stress can be mitigated by applying our bit flipping scheme. Moreover, the Live phase in the clean cacheline will be reduced if a small invalidation interval is used. Therefore, part of the Live phase will be converted into the invalid phase and its aging effects can also be alleviated. The duty cycle ratio of the remaining Live phase is not optimized by our CIF scheme. However, the remaining Live phase only contributes a small percentage (less than 10% in our study) to the cacheline lifetime, so the overall duty cycle ratio of the SRAM cells in the clean cachelines can be well balanced. For the dirty cachelines in a write-back data cache, the data in these cachelines are still needed and will be written back into the L2 cache at the replacement. Therefore, we cannot apply our CIF scheme to the dirty cachelines directly. Instead, we propose to do the idle-time-based early writeback (EWR) [22] first, and then do the invalidation and bitflipping. Similar to the clean cachelines, due to the small percentage of the Live phase after the early write-back and invalidation, the overall duty cycle ratio of the SRAM cells in the dirty cachelines will be also well balanced. For a writethrough data cache, the situation is much simple. Since all cachelines are clean in the write-through data cache, we can just apply our CIF scheme to balance the duty cycle ratio. 3.4 Microarchitecture of the AADC The key issues in the AADC design are how to do the early write-back (EWR) for the dirty cachelines, the cacheline in- Tag Array BF IT Global Counter Address Way 0 Way N 1 Way 0 EWR/CI Logic w/ 2 bit Local Counter V D Z Bit Flipping Logic Decoder Data Array Way N 1 Figure 2: Microarchitectural schematic of the proposed AADC. validation (CI) for the clean cachelines, and the bit-flipping for the invalid cachelines. Figure 2 shows the block diagram of our AADC design. We use the valid bit (V) in the tag array to control whether the EWR/CI or the bit-flipping scheme should be applied to each cacheline. For the valid cacheline (V = 1), an N-bit global counter (IT for idle-time based) ticked by the clock signal and a two-bit local counter ticked by the global counter every 2 N cycles are introduced for each cacheline. The local counter is reset to zeros once the cacheline is accessed. If the local counter saturates, we use the dirty bit (D) to control either EWR+CI is performed for the dirty cacheline (D = 1), or only CI is performed is performed for the clean cacheline (D = 0). After that, the valid bit V is set to zero, and the local counter is also reset to zero. For the invalid cacheline (V = 0), a global counter (BF) is used for the bit-flipping. The BF counter and the cacheline state zero bit (Z) work together to determine whether all zeros or ones should be updated into the entire cacheline. If the BF counter saturates and the Z bit is equal to one, indicating that currently the data in the cacheline are all zeros, all ones will be written into the cacheline in order to balance the duty cycle ratios of the SRAM cells in the cacheline, and the Z bit will be set to zero. If the BF counter saturates and the Z bit is equal to zero, all zeros should be written into the cacheline and the Z bit will be set to one. Note that in order to minimize the area overhead of our AADC design, we choose the same idle time interval for EWR and CI. Therefore, only one two-bit local counter is needed for each cacheline and it can be shared by using D bit for dirty and clean cachelines. 3.5 Power Optimization Previous aging reduction solutions have studied the aging benefits provided by the low-energy states of the SRAM cells [16, 7, 8]. If energy saving or leakage control schemes [10, 13] are adopted in data caches, the aging effect will be further mitigated. In our AADC design, the cachelines will not be needed after the invalidation, which makes them very suitable for applying the energy saving schemes, such as the drowsy scheme. Therefore, we propose to adopt the drowsy scheme to further reduce the aging effect of the data cache, i.e., applying the drowsy scheme to these invalidated cachelines. Moreover, the leakage control scheme will also result 217

in the temperature reduction in the data cache, which can further alleviate the aging. 3.6 Area, Performance, and Power Overheads of the AADC For the area overhead, since no extra XOR gates or inverting operation are needed in our AADC design, the overhead is mainly from one extra Z bit showing the current state (all zeros or ones) for the invalid cachelines, and the twobit local counter to implement the EWR+CI or CI for each cacheline. The space overheads of the global counter BF and IT are negligible because they are shared by the entire data cache. The space overheads of the Z bit and the two-bit local counter for each cacheline is also not too costly. For instance, for a data cache with a 64-byte cacheline, the space overhead of our AADC compared to the data array is only 3 bits out of 64 bytes (3/(64 8) = %). For the performance overhead, since the data in the invalid cacheline have no impact on processor execution and our bit-flipping operation is not in the critical path, the performance will not be degraded. However, the early writeback and cacheline invalidation schemes do have the impact on the performance, because the invalidation operations may cause additional cache misses, if the invalidated cachelines need to be accessed by the CPU in the near future. Therefore, we need to carefully choose the idle interval for EWR and CI in order to maximize the lifetime aging mitigation and minimize the performance degradation. Note that the drowsy scheme in our AADC has no performance impact, since all the cachelines in drowsy modes are invalid and will not need to be waken up during accesses. The bit-flipping operation in our AADC scheme is the major contribution of the power overhead. In general, if a large time interval for bit-flipping is used, the power overhead will be reduced. However, if the time interval is too large, duty cycle balancing for the data cache of our AADC scheme will become less effective. Therefore, detailed experiments need to be conducted in order to choose a proper bit-flipping interval for power efficiency. 4. EXPERIMENTAL EVALUATION 4.1 Experimental Setup We derive our simulators from SimpleScalar V3.0 [6] to model a high-performance microprocessor similar to Alpha 21364. Table 1 gives the detailed configuration of the simulated microprocessor. To evaluate the power efficiency of our AADC design, a modified version of the Wattch power model [5] is used for power profiling (at 32nm technology) during the simulation. For experimental evaluation, we use the SPEC CPU2000 benchmark suite compiled for the Alpha Instruction Set Architecture. Ten benchmarks are randomly selected for our experimental evaluation. We use the reference input sets for this study. Each benchmark is first fast-forwarded to its early single simulation point specified by SimPoint [18]. We use the last 100 million instructions during the fast-forwarding phase to warm-up if the number of skipped instructions is more than 100 million. Then, we simulate the next 100 million instructions in detail. 4.2 Experimental Results and Analysis In order to apply our AADC design, we first need to conduct the lifetime behavior analysis on the data cache and this Lifetime Distribution Table 1: Parameters of the simulated processor. Processor Core Datapath Width 4 inst. per cycle Int Issue Queue 20 entries FP Issue Queue 15 entries Load/Store Queue 64 entries Active list (ACL) 80 entries Int Register File 80 registers FP Register File 72 registers Function Units 4 IALU, 2 IMULT/IDIV 2 FALU, 1 FMULT/FDIV/FSQRT 2 MemPorts Branch Predictor Branch Predictor Alpha 21264 tournament predictor 32-entry RAS BTB 2048-entry 2-way Memory Hierarchy L1 I/DCache 64KB, 2 ways, 64B blocks, 2 cycles L2 UCache 4MB, 8 ways, 128B blocks, 12 cycles Memory 225 cycles first chunk, 12 cycles rest TLB Fully-assoc., 128 entries 1 0.9 0.8 0.7 0.5 0.4 0.3 0.2 0.1 0 Invalid Live Dead Figure 3: The lifetime distribution of the cachelines in the data cache. characterization is performed at the cacheline level. Our experimental results show that most of the cachelines in the data cache are valid (in-use) during the execution. As shown in Figure 3, 99.5% of the cachelines in the data cache are valid on the average. The Live and Dead phase are the lifetime phases when the cachelines are in the valid state. Figure 3 shows that the Live phase accounts for about 24.6% of a cacheline s lifetime and the Dead phase contributes about 74.9% on the average. Therefore, in order to apply different effective aging mitigation schemes according to the different lifetime behaviors of the cachelines, we first divide the cachelines into two groups in our AADC study: valid and invalid cachelines. For the invalid cachelines, we propose to simply bit-flip these cachelines periodically. However, as we discussed in Section 3, we need to choose a proper bit-flipping time interval in order to achieve high-balanced duty cycle ratio of the invalid cachelines as well as minimize the overheads. If 218

1 0.7 Duty Cycle (Zero) Ratio 0.9 0.8 0.7 Duty Cycle (Zero) Ratio 0.5 0.4 0.3 0.2 0.1 0.5 0 Figure 4: The average stress duty cycle (zero) ratio for valid cachelines. we choose a small interval, the power overhead will be high, but the duty cycle ratio will be more perfectly balanced. If a large interval is used, the power overhead will be reduced, but the duty cycle ratio of the cachelines will not be well balanced. Based on our experimental results, a 40K-cycle interval for bit-flipping has negligible power and performance overheads with nearly perfect duty cycle balancing capability. Therefore, we choose the 40K-cycle bit-flipping interval for our AADC design. For the valid cachelines, our experimental results in Figure 4 show that the average stress duty cycle (zero) ratio is 84.0%, which needs to be further reduced. For clean cachelines, based on the fact that most of the read-read (RR) instances have small intervals, we proposed to use an idletime-based cacheline invalidation (CI) scheme to invalidate the valid cachelines after being idle for some predefined intervals in Section 3. By applying the CI scheme, most of the duty cycles of clean cachelines will be converted into the duty cycles of invalid cachelines. Then, we can further balance them by applying our bit-flipping scheme. However, similar to the bit-flipping scheme, the problem is how to choose the proper invalidation interval that can reduce the RR phase significantly with negligible performance degradation. Our experimental results show that if a small 500-cycle interval is chosen, the RR phase can be significantly reduced to 0.5% from the original 13.7%, but the performance loss is high, 5.4% on the average. This high performance loss is mainly caused by the high pipeline stall penalty due to the increased data cache misses incurred by the CI scheme, which is not affordable in modern high-performance processor. However, if a large 64K-cycle interval is adopted, the performance loss is less than 0.3 %, while the RR phase increases to 6.3%. Based on our experimental results, 4K-cycle is a good choice for the clean cacheline invalidation. The performance loss is under 0.7% and the RR phase is reduced from 13.7% to 2.4%. For dirty cachelines, we proposed to adopt the idle-timebased early write-back (EWR) scheme [22] first, and then apply the invalidation and bit-flipping. Similar to the idletime chosen for clean cacheline invalidation, we conduct a study based on different idle times and the experimental results show that 4K-cycle is also a good choice for the early write-back, which can effectively reduce the Live phase in dirty cachelines with negligible performance overheads. Figure 5: The average stress duty cycle (zero) ratio for all cachelines after applying the proposed AADC scheme. Power Reduction Rate 8 4 0.56 0.52 0.48 Figure 6: The power reduction rate by applying the drowsy scheme. Therefore, as we discussed in Section 3, we choose a 4Kcycle interval for both idle-time-based cacheline invalidation and early write-back to minimize the area overhead of our AADC design. After the cacheline invalidation, we use the same 40K-cycle interval for bit-flipping in order to achieve duty cycle balancing. Our experimental results in Figure 5 show that our AADC design can reduce the average stress duty cycle ratio to 54.1% for all cachelines in the data cache with the performance loss under 0.8%. Previous study has shown that the gate-oxide failure probability is proportional to the device stress time [15]. Therefore, we can expect a similar MTTF (mean time to failure) improvement for the data cache, which is 48% in our study. For further power saving, we propose to adopt the drowsy scheme to these invalidated cachelines. We scale the power numbers provided in [10] for this study. Since the data in invalidated cachelines of our AADC design will not be needed during the drowsy mode, the performance overhead due to the wake-up operations for drowsy scheme can be ignored or overlapped with the cache miss penalty. Figure 6 shows that our AADC design can achieve a 61.5% power reduction for the data cache, which can further mitigate the aging effect. 219

5. CONCLUSION The NBTI-induced aging effect is becoming a critical threat to the performance and reliability for future CMOS devices. The SRAM data caches in modern processors suffer high aging stresses due to the unbalanced duty cycle ratio of the devices. In this paper, we propose an aging-aware data cache (AADC) design to combat the NBTI-induced aging by duty cycle balancing. Based on our detailed study on the lifetime behaviors of the cachelines in the data cache, different aging reduction schemes are proposed for different lifetime phases. By applying our proposed idle-time-based invalidation for clean cachelines, early write-back and invalidation for dirty cachelines, and bit-flipping scheme for invalid cachelines, the duty cycle ratio of the entire data cache can be well balanced to 50% with minimized overheads. For further power saving, we adopt the drowsy scheme to the invalidated cachelines in our AADC design and it can achieve a 61.5% power reduction in the data cache. Therefore, the NBTI degradation of the data cache can be significantly alleviated. 6. ACKNOWLEDGMENTS This work was supported in part by a grant from Chinese NSF Award 61100035. 7. REFERENCES [1] J. Abella, X. Vera, and A. Gonzalez. Penelope: The nbti-aware processor. In Proceedings of IEEE/ACM International Symposium on Microarchitecture, pages 85 96, 2007. [2] J. Abella, X. Vera, O. Unsal, and A. Gonzalez. Nbti-resilient memory cells with nand gates for highly-ported structures. In Workshop on Dependable and Secure Nanocomputing, June 2007. [3] A. Biswas et al. Computing architectural vulnerability factors for address-based structures. In Proc. of the IEEE International Symposium on Computer Architecture, June 2005. [4] S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro, 25(6):10 16, Nov. 2005. [5] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In Proc. International Symposium on Computer Architecture, 2000. [6] D. Burger and T. M. Austin. The simplescalar tool set, version 2.0. Technical Report 1342, Computer Sciences Department, University of Wisconsin, 1997. [7] A. Calimera, M. Loghi, E. Macii, and M. Poncino. Dynamic indexing: Concurrent leakage and aging optimization for caches. In Proceedings of the ACM/IEEE International Symposium on Low-Power Electronics and Design, pages 343 348, August 2010. [8] A. Calimera, M. Loghi, E. Macii, and M. Poncino. Partitioned cache architectures for reduced nbti-induced aging. In Proceedings of the Design, Automation and Test in Europe, pages 938 943, March 2011. [9] O. Ergin et al. Exploiting narrow values for soft error tolerance. IEEE Computer Architecture Letters, 5(2), July-Dec. 2006. [10] K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: Simple techniques for reducing leakage power. In Proc. the 29th International Symposium on Computer Architecture, pages 148 157, Anchorage, AK, May 2002. [11] E. Gunadi, A. Sinkar, N. Kim, and M. Lipasti. Combating aging with the colt duty cycle equalizer. In Proceedings of the IEEE/ACM Int. Symp. on Microarchitecture, pages 103 114, 2010. [12] T. Jin and S. Wang. Aging-aware instruction cache design by duty cycle balancing. In Proceedings of IEEE Computer Society Annual Symposium on VLSI, pages 195 200, 2012. [13] S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proc. the Int l Symposium on Computer Architecture, pages 240 251, 2001. [14] S. V. Kumar, C. H. Kim, and S. S. Sapatnekar. Impact of nbti on sram read stability and design for reliability. In Proceedings of Int. Sym. on Quality Electronic Design, 2006. [15] E. Minami et al. Circuit-level simulation of tddb failure in digital cmos circuit. IEEE Trans. on Semiconductor Manufacturing, 8(3), Aug 1995. [16] A. Ricketts, J. Singh, K. Ramakrishnan, N. Vijaykrishnan, and D. K. Pradhan. Investigating the impact of nbti on different power saving cache strategies. In Proceedings of the Design, Automation and Test in Europe, pages 592 597, March 2010. [17] E. Rosenbaum et al. Effect of hot-carrier injection on n- and pmosfet gate oxide integrity. IEEE Electron Device Letters, 12(11), Nov 1991. [18] T. Sherwood et al. Automatically characterizing large scale program behavior. In Proc. of ASPLOS X, October 2002. [19] J. Shin et al. A proactivewearout recovery approach for exploiting microarchitectural redundancy to extend cache sram lifetime. In Proceedings of International Symposium on Computer Architecture, pages 353 362, 2008. [20] T. Siddiqua and S. Gurumurthi. Recovery boosting: A technique to enhance nbti recovery in sram arrays. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI, July 2010. [21] S. Wang et al. Low power aging-aware register file design by duty cycle balancing. In Proceedings of Conference on Design, Automation and Test in Europe, pages 546 549, 2012. [22] S. Wang, J. Hu, and S. G. Ziavras. On the characterization and optimization of on-chip cache reliability against soft errors. IEEE Transactions on Computers, 58(9):1171 1184, September 2009. [23] W. Wang et al. The impact of nbti on the performance of combinational and sequential circuits. In Proceedings of the Design Automation Conf., 2007. [24] W. Zhang. Computing cache vulnerability to transient errors and its implication. In Proc. of the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, Oct. 2005. 220