Recovery-Based Design for Variation-Tolerant SoCs

Size: px

Start display at page:

Download "Recovery-Based Design for Variation-Tolerant SoCs"

Noel Parker
5 years ago
Views:

1 Recovery-Based Design for Variation-Tolerant SoCs Vivek Kozhikkottu, Sujit Dey and Anand Raghunathan School of Electrical and Computer Engineering, Purdue University School of Electrical and Computer Engineering, UC San Diego ABSTRACT Parameter variations have emerged as a significant threat to continued CMOS scaling in the nanometer regime. Due to increasing performance penalties associated with worst-case design, recovery based design has emerged as a promising approach for dealing with the impact of variations. Previous work has applied recovery based design at the circuit and micro-architecture levels of abstraction. In this work, we address the problem of designing variation-tolerant SoCs using the recovery based design paradigm. We demonstrate that a monolithic implementation of recovery based design fails to scale for large SoCs. We propose the concept of recovery islands, wherein each island consists of one or more SoC components that can recover independent of the rest of the SoC, and demonstrate how our proposal can be easily realized via minor changes to a traditional SoC design flow. We study the tradeoffs involved in applying recovery based design at the system level. We demonstrate that it is critical to account for (i) the inherent diversity of the error-voltage profiles among various components in an SoC, and (ii) the impact of error recovery in a component on overall system performance. We then propose a systematic recovery-based SoC design methodology that partitions a given SoC into recovery islands and also computes the optimal operating points for each island, taking into account the various system level trade-offs involved. We evaluate our framework on three different SoC designs, an b MAC processor, an MPEG encoder and a Wireless Video Capture system and demonstrate an average of 32% energy savings over conventional designs. Categories and Subject Descriptors B.7.1 [INTEGRATED CIRCUITS]: VLSI (Very large scale integration) General Terms Algorithms, Design Keywords System-on-chip, Variation Aware Design, Variation Tolerance, Low Power Design 1. INTRODUCTION Continued scaling of CMOS technologies has resulted in parameter variations emerging as a critical design concern. Parameter variations can be broadly classified as process variations caused due to the inherent nature of the manufacturing This material is based upon work supported in part by the National Science Foundation under Grant No Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2012, June 3-7, 2012, San Francisco, California, USA. Copyright 2012 ACM ACM /12/06...$ process and environmental variations due to fluctuations in temperature and supply voltage. These parameter variations manifest as statistical behavior in the delay and power consumption of circuits, and have traditionally been dealt with by over-design. However, with continued scaling into the nanometer regime, the gap between typical-case and worst-case design is growing too large, and the performance and energy cost of worst-case design can no longer be ignored. To overcome the problems with worst-case design, recovery based design techniques such as Razor [1] and EDS [2] have been proposed. These techniques employ embedded error detection and recovery circuitry to help detect and recover from timing errors induced by variations. They help eliminate conservative voltage guard bands by dynamically controlling the supply voltage in response to the occurrence of timing errors. Moreover, components can be voltage overscaled even beyond their zero error operating points to achieve considerable energy reductions for a negligible loss in performance [1]. These recovery based design techniques have hitherto been applied only at the circuit and micro-architecture levels [3,4]. We believe that ours is the first effort to explore the application of recovery-based design in a systematic manner to entire SoCs. 1.1 Paper Overview and Contributions In this work, we address the problem of designing variationtolerant SoCs using the recovery based design paradigm. The significant contributions of our work are as follows: We demonstrate that applying recovery based design in a monolithic fashion is not scalable for large SoC designs. We propose a new design approach in which SoCs are divided into multiple recovery islands, each of which can detect and recover from errors independent of the rest of the SoC. We also demonstrate that the communication architecture serves as an ideal variable latency interface for partitioning the SoC into recovery islands. We study the trade-offs involved in applying recovery based design at the system level. We demonstrate that each component s distinct error-voltage characteristics as well as its impact on overall system performance need to be considered while clustering them into recovery islands and computing their operating points. We propose a methodology that systematically partitions a given SoC into recovery islands and also computes the optimal operating point for each island. The framework takes into account the above trade-offs, as well as the complex interactions between different islands, using an emulation based performance analysis framework. We apply recovery based SoC design to three different SoC designs an b MAC processor, an MPEG encoder and a Wireless Video Capture system and obtain an average of 32% energy savings over conventional designs. The rest of this paper is organized as follows. Section 2 summarizes prior work on variation-aware system design. Section 3 describes the challenges involved in applying recovery 826

2 based design to SoCs. Section 4 gives an overview of the proposed concept of recovery islands and the various interfaces needed to enable it. Section 5 analyzes the various systemlevel trade-offs involved in recovery island based SoC design with the help of an example. Section 6 describes our systematic recovery based SoC design methodology. Section 7 describes our experimental setup and presents the results obtained by applying the proposed framework to three example SoC designs. 2. RELATED WORK In the context of SoCs, several previous efforts have demonstrated the strong potential of addressing variations at the system level. In the context of multiple voltage-frequency island based SoC design, several efforts [5, 6] have exploited the inherent flexibility of the multi-island design paradigm to mitigate the impact of variations. Techniques for analyzing the impact of process variations on system performance and power were developed in [7] and [8]. A variation tolerant onchip communication architecture was discussed in [9]. In [10], the authors develop techniques to optimize system-level power management policies under the impact of variations. [11] proposes partitioning an SoC into fine grained body bias islands to help mitigate the impact of within-die leakage variations. However, most of these techniques only deal with manufacturing induced process variations and do not deal with workload, voltage and temperature based variations. Recovery based design, due to its dynamic and adaptive nature [1] [2], deals with all sources of variations and thus eliminates the need for conservative design margins. Due to the various power-performance penalties associated with worst-case design, researchers have started actively developing recovery based design techniques. Razor [1] and EDS [2] propose circuit level mechanisms to detect and correct timing based errors, providing a safety net that allows the elimination of guard bands and design margins. Furthermore, these mechanisms achieve substantial energy savings by facilitating voltage overscaling, a technique of scaling the supply voltage beyond the circuit s critical operating point, resulting in timing errors. In this context, [12] and [13] have proposed using cell sizing and dual threshold voltage cells to modify the timing slack of the frequently-occurring, near-critical timing paths to facilitate further voltage overscaling, thereby achieving additional energy savings. Similarly, at the architecture level, [14] and [15] have suggested architectural modifications to reshape the error-voltage profiles of underlying micro-architectural blocks so as to increase their potential for voltage overscaling. In [3] and [4] the authors argue that finegrained adaptive biasing and voltage interpolation based techniques can be applied to processors instrumented with recovery mechanisms to help mitigate the impact of within-die parameter variations. However, as noted earlier, these techniques focus on the circuit and micro-architecture level trade-offs involved in applying recovery based design. In this paper, we focus on identifying the key system level characteristics and trade-offs that must be taken into account for applying recovery based design in the context of SoCs. 3. MOTIVATION In this section, we motivate the need for a new approach to recovery based design for SoCs, by outlining two major scalability concerns associated with applying recovery based techniques in a monolithic fashion. We utilize an example SoC design to help quantify these concerns. Figure 1 shows the block diagram of a Wireless Video Capture Device (WVCD) SoC consisting of ten components connected to a system bus. The SoC performs two main functions, namely, i) it encodes video frames stored in an on-chip frame buffer, and ii) it packetizes the frames using the b protocol and sends the packets out to a wireless interface for transmission. The four important compute-intensive functions i) Checksum Computation (CRC), ii) Wired Equivalent Privacy encryption (WEP), iii) Motion Estimation (ME), and iv) DCT compression (DCT), are all implemented as hardware accelerators. Figure 1: Wireless video capture SoC The first major factor limiting the scalability of a monolithic recovery based scheme is the impact of within-die parameter variations [16]. Within-die variations cause components within a given instance of the SoC to have differing performancepower characteristics. Recovery based design techniques typically try to operate a component at its optimal operating voltage point so as to eliminate the conservative voltage guard bands needed to deal with variations. However, in a monolithic implementation, the operating voltage of the entire SoC would be determined by the voltage of its slowest component that has been impacted most negatively by variations. As a consequence, a large number of components would be forced to operate at sub-optimal voltages, leading to reduced energy benefits. Figure 2 shows the mean energy savings (for WVCD SoC chips) obtained by a monolithic implementation of recovery based design, for increasing values of within-die process variations. The figure shows that for higher values of within-die variations, the energy savings attained by monolithic recovery based Figure 2: Mean energy savings vs. within-die variations design decreases significantly. Moreover, increased within-die variations in other important parameters such as voltage, temperature and workload, would only exacerbate this effect. In summary, within-die parameter variations pose a severe challenge to scaling recovery based design to large SoCs. The second major concern affecting the scalability of monolithic recovery based design is the strict timing constraint required for performing error detection and correction. The timing constraint can be expressed as follows: T clk tree + T delay sample + T clk error + T error agg <T clk period, (1) where T clk tree is the clock to flip-flop delay, T delay sample and T clk error represent the delays associated with generation of the error signal by the shadow flipflop and finally T error agg refers to the delay required for aggregating all the error signals back to gate the clock source. All the above delays must add up to less 827

3 than the system s clock period (T clk period ) so as to successfully perform clock gating before the start of the next clock cycle, when an error is detected. However, with increasing SoC sizes and the poor scaling of interconnect delays [17], the global delay components (T clk tree and T error agg) restrict the applicability of monolithic implementations of recovery based design to small systems. For example, for an operating frequency of 1 GHz at the 45nm technology node, our analysis suggests that the monolithic scheme would be feasible only for circuits of size up to 0.9mm 2. Thus, interconnect delays enforce a strict limit on the size of SoC designs for which monolithic recovery baseddesignisapplicable. Due to the above mentioned limitations associated with monolithic recovery based design, we propose applying recovery based techniques to SoCs in a more fine grained manner. There are two key issues that must be addressed in order to realize this proposal. First, we need to allow components to recover independent from the rest of the SoC, while maintaining correct operation. Second, we must explore the design space of possible partitions ranging from a monolithic recovery based design on one extreme to a fine-grained partitioning whereeachsoccomponentisinitsownrecoveryisland. In doing so, it is necessary to consider the area and energy overheads associated with creation of recovery islands. We address these issues in the following sections. 4. RECOVERY ISLANDS: SCALING RE- COVERY BASED DESIGNS TO SOCS Traditionally, SoCs are partitioned into coarse-grained voltage and frequency islands. We propose partitioning these voltage-frequency islands further into more fine grained recovery islands. Each recovery island consists of one or more SoC components and must possess the following key properties i) the ability to detect and recover from errors in any of its components independent of the rest of the SoC, and ii) the dimensions of the recovery island must allow the timing constraints imposed by Equation 1 to be satisfied. One of the challenges associated with enabling recovery islands to recover independently is to find suitable points for partitioning a given SoC into islands. Creating recovery islands from arbitrary partitions of logic would require extensive re-design of the underlying components and their respective interfaces. From an external perspective, each recovery island can take a variable number of cycles to respond to a transaction, depending on whether a component in the island has encountered an error or not. We note that the system-level communication architecture (bus or network-on-chip) used in most SoCs is already designed so as to tolerate variable latencies, be it due to bus contention or a component being busy. The communication architecture thus serves as an ideal variable latency interface to partition the SoC into recovery islands. However, one cannot just directly connect the recovery islands to the interconnect fabric, as they would violate the established interface protocols. Appropriate cross recovery-island interfaces need to be designed to interface recovery islands with the rest of the system. Figure 3 presents the structure of a generic recovery island based SoC. The SoC shown in the figure has been partitioned into three recovery islands, each consisting of one or more SoC components. Each of the recovery islands are connected to the system interconnect, with the help of cross-recovery island interfaces. We note that the system interconnect fabric itself is excluded from the recovery island partitioning process and therefore needs to be designed conservatively so as to avoid timing errors. A more detailed description of these interfaces is provided in Section A of the supplementary material. For each recovery island, the timing critical flip-flops are in- Figure 3: Recovery island based design strumentedwitherrordetectionandrecoverycircuitry[1,2]. When an error is detected, the clock is gated for the next cycle to allow the correct values to be restored to all the flipflops. All the error signals are then aggregated and fed into the operating point controller [1], which is responsible for dynamically controlling the supply voltage of the island to maintain a desired error rate. We achieve supply voltage scaling by utilizing voltage interpolation [18] so as to avoid the significant overheads associated with voltage regulators and converters. Voltage interpolation provides the ability for different groups of logic gates within a block to select between two static supply voltages VDDH and VDDL. The scheme enables dynamic modulation of a circuit s delay by choosing an appropriate combination of logic segments within a block to be connected to VDDH and VDDL, respectively. The recovery island based design methodology incurs area and energy penalties associated with additional cross-island interfaces and operating point control mechanisms. As a consequence, partitioning the SoC at the granularity of individual components can lead to considerable energy overheads and may significantly diminish the system level energy benefits obtained by the framework. Thus, it is imperative to find an energy-optimal partitioning of the SoC into recovery islands. 5. DESIGN TRADEOFFS In this section, we explore the various system level design trade-offs involved in recovery island based SoC design. We utilize the previously described WVCD SoC (Figure 1) for illustrating some of these trade-offs. Error Rate cpu wep crc me dct total Voltage Figure 4: Error rate versus voltage profile We first describe the component level characteristics to be considered when partitioning an SoC into recovery islands. The operating voltage of a component in recovery based design depends on its inherent error-voltage profile, which is in turn

4 1 CPU 0.8 WEP CRC 0.6 Total Voltage WEP CRC CPU Error Rate (%) determined by various factors such as circuit structure, component size, application workload (path activation probabilities) as well as process, voltage and temperature variations. For the example WVCD SoC, Figure 4 shows the error versus voltage profiles for the largest five components. We also plot the total error versus voltage profile for the entire SoC. As the figure shows, the total system error is mostly dominated by errors in the CPU. In a scheme in which the entire SoC is treated as a single recovery island (monolithic implementation of recovery based design), each component would be operated at a voltage mostly determined by the CPU s error-voltage profile. On the other hand, if we perform recovery at a component level, each component would be operated at its own optimum voltage based on its error-voltage profile, which would lead to substantial energy savings. However this scheme would also involve excessive overheads associated with implementing recovery islands. In general, SoC s have components that tend to greatly differ in their structure, complexity, size and workload, leading to diverse error-voltage profiles across components. This diversity is further amplified because of intra-die variations in process parameters, temperature gradients across the chip, as well as local voltage fluctuations. The partitioning scheme thus needs to incorporate this inherent diversity in the errorvoltage profiles among various components, and the overheads associated with recovery islands, in addition to system-level factors as discussed next. System Performance Loss (%) ME DCT CPU WEP CRC Error rate (%) Figure 5: System performance loss versus error rate We now motivate the need to consider system level effects while choosing optimal operating points for each recovery island. Errors in a component force it to spend clock cycles in recovery and thereby affect system performance. However, depending on how critical a component is to overall system performance, the same error rate in different components can have different effects on system performance. Also, due to complex inter-dependencies between components (e.g., concurrent execution and synchronization), the system performance impact due to errors in different components need not be additive. Figure 5 plots the system performance loss versus error rate for different components of the WVCD SoC. As can be seen, errors in the ME accelerator have the greatest impact on system performance, followed by the DCT accelerator. Therefore, for a given system level performance target, a configuration in which the rest of the components (CRC, CPU, WEP) operate at higher error rates and thereby can be voltage scaled more aggressively, is more energy efficient than a configuration in which the ME and DCT components operate at higher error rates. In complex SoCs, the correlation between error rate and system performance loss can be quite varied across components and this diversity needs to be considered while selecting the optimal operating points for each SoC component. Recovery islands that consist of components that are more critical to system performance should be operated at lower error rates, whereas those that contain components with a lower impact on system performance should be voltage scaled more aggressively so as to reduce overall system power. Also note that it is beneficial during partitioning to group together components that have similar impact on system performance. 6. RECOVERY BASED SOC DESIGN In this section, we describe a systematic methodology for recovery based SoC design that considers the issues and tradeoffs described in the previous section. The proposed methodology, shown in Figure 6, takes as its input the given SoC architecture, the application software, the desired performance target and component-level clustering constraints derived from the SoC floorplan. It produces as its output, the best SoC partitioning scheme along with optimized operating points for each island. The methodology consists of three main steps. In the component characterization step, we compute the error rate, system performance loss and energy savings for each SoC component at each possible operating voltage. The island partitioning and optimization step partitions the SoC appropriately into recovery islands, and computes the best operating point for each island. Finally, the local search step further tunes the operating points obtained in the previous step while considering the complex performance interactions between different SoC components. We elaborate upon these steps in the rest of this section. Error Rate System Performance Loss (%) Figure 6: Recovery island based design methodology 6.1 Component Characterization In this step, we first obtain the error-voltage profile and the error-system performance loss profile for each component. For generating the error-voltage profiles, we first capture bus level input traces for each component by performing cycle-accurate functional simulation of the SoC for representative workloads. We then use the captured traces as input vectors to perform post-synthesis simulations at different operating voltages to obtain the error-voltage profile for each component. For the error vs. system performance loss profile, we use an emulation based performance analysis framework. We instrument each SoC component with error injectors and circuitry that mimics error recovery, and obtain the system performance loss for increasing error rates in each component. More details on the emulation setup are provided in Supplemental Section B. We now combine both the profiles with component level energy estimates to obtain an error, system performance loss and energy tuple for each operating point (voltage). 6.2 Island Partitioning and Optimization In this step we derive an optimized partition of the SoC into recovery islands and obtain the best operating point for each 829

5 island. The number of ways an SoC consisting of N components can be partitioned into k recovery islands can be quite large (N k ). The search space involved in identifying an optimal operating point for each island further increases the design space by O k, where O is the number of operating points. To efficiently explore this design space, we adopt an iterative procedure wherein we start off with each component in a separate partition, and iteratively apply the operating point selection and island clustering steps until we can no longer find a better partition. Consider the initial partition where each component is in a separate recovery island. We can compute the best operating point for each component by modeling it as a convex optimization problem as shown in Equation 2. minimize V i n E i(v i) i=1 subject to n P i(v i) ζ; V min V i V max i =1,...,n (2) In the above equation, E i(v i)andp i(v i) refer to the energy and system performance loss respectively of the i-th component operating at voltage V i and ζ is the constraint on acceptable system performance loss. For this step, we make the simplifying assumption that system performance loss due to N different components is linearly additive. This is not true in general, due to effects such as communication dependencies, shared resources, system-level critical paths, etc. We ignore these effects in the island partitioning and optimization step to make the problem tractable, but account for them in the subsequent local search step. Once we obtain the optimal operating points for each component, we group together the two components whose operating points are closest, if this clustering is valid based upon the floorplan derived constraints. These constraints are represented as a clustering matrix that specifies which component pairs could be grouped together, based on their proximity in the SoC s floorplan. Grouping components reduces the overheads associated with recovery islands, at the cost of forcing the grouped components to operate at the same voltage. The grouping heuristic therefore minimizes the sub-optimality in operating points. We choose the two components j and k that, when grouped together, give the best energy savings E = E ri (E k (V j) E k (V k )). The first term E ri refers to the energy savings due to the reduced overheads of having one less recovery island and the second term refers to the energy loss due to one of the components (in this case, k) going from a lower operating voltage V k toahigherpointv j. We iteratively perform the island partitioning and operating point selection steps until we find no further grouping that can lead to energy savings. 6.3 Local Search In this step, we tune the operating points for the partitioned SoC obtained from the previous phase taking into account the inter-dependencies between various SoC components. We achieve this by first performing emulation of the SoC at the operating points obtained from the previous step to compute the actual system performance loss. Next, based on whether the performance loss is larger or smaller than the given target, we increase or decrease the operating voltages for each island by one unit step and measure the resulting system performance loss. We now greedily change the operating voltage of the island with the best energy savings to performance loss ratio, and repeat this process until the specified performance target is just satisfied. 7. EXPERIMENTAL RESULTS In this section, we first describe our experimental set up i=1 and the example SoCs used in our study. We then present the energy savings obtained by utilizing our framework on three different SoC designs. Our experimental methodology to evaluate the proposed concepts consists of various commercial and research tools. For obtaining the error-voltage profiles, we first perform logic synthesis of each component with Synopsys Design Compiler using the IBM 45nm technology cell library. We utilize VAR- IUS [19] for modeling the impact of inter-die and intra-die process and temperature variations on component-level errorvoltage profiles. For each of our experiments we generate chips, each of which has different intra-die variation profiles for V th and L eff values. To obtain the temperature distribution across the chip, we provide average power consumption values of each SoC component, along with the SoC floorplan to the HotSpot thermal modeling tool [20]. We use NANOSIM [21], a transistor level simulator, to obtain the power consumption data for each of the SoC components at different operating voltages. The memory energy consumption and access times are modeled using CACTI5.3 [22]. We use an Altera DE3 board [23] as our emulation platform for obtaining the component level error-rate versus system performance loss profiles. We evaluate our framework on three example SoC designs, an b MAC processor, an MPEG encoder, and a Wireless Video Capture Device. The WVCD system was described in detail in Section 3. We now briefly describe the MPEG and MAC systems. MPEG encoding entails two compute-intensive operations - Motion Estimation (ME) and DCT Compression, which are implemented as hardware accelerators. The input frames to be encoded are stored in an on-chip frame buffer and an embedded processor is in charge of co-ordinating the transfer of frames between the frame buffer and the two accelerators, and also executes the remaining tasks. The MAC processor implements the key steps of the b MAC protocol, and consists of a processor, hardware accelerators for CRC and WEP computation, and peripherals connected by a system interconnect. In order to verify the functional correctness of the various cross island interfaces as well as to accurately model the impact of errors on system performance, each SoC was partitioned into recovery islands using the proposedmethodologyandemulatedonthede3platform Figure 7: Energy distribution for conventional and recovery based SoC designs Figure 7 presents a box whisker plot of the normalized energy consumption of distinct chip instances for each of the three example SoCs. For each SoC, we evaluate the energy consumption under three different design schemes. Traditional refers to a guard band based design scheme wherein the voltage is chosen based on timing analysis using the worst case process/temperature corner provided in the cell library. 830

One Island refers to a recovery-based design wherein the entire SoC is treated as a single recovery island, ignoring the feasibility of timing constraints described in Equation 1.

6 One Island refers to a recovery-based design wherein the entire SoC is treated as a single recovery island, ignoring the feasibility of timing constraints described in Equation 1. Finally, RBD refers to the proposed recovery island based design framework. Both the One Island and RBD cases are designed for a target system performance loss of no more than 2% due to error recovery. As can be seen from the figure, the RBD design achieves the best energy distribution - the median of the energy distribution is reduced by 31%-33% compared to the Traditional design. The One Island design is able to eliminate the overheads associated with die-to-die variations and thereby achieves 18%-21% improvements in the median of the energy distribution. RBD outperforms the One Island case by 11%-14%. These results clearly illustrate that (i) recovery based SoC design can significantly optimize energy consumption under variations, and (ii) the proposed recovery island based SoC design framework maximally leverages the potential of recovery based design. Figure 8 plots the percentage energy savings offered by the RBD scheme over the One Island scheme with increasing values of within die manufacturing variations. With increasing values of σ/μ, the Figure 8: Energy savings sensitivity to magnitude of WID vari- energy savings offered by the RBD ations scheme increases as it is able to locally reconfigure the operating voltage to each island s characteristics. Also note that the RBD scheme performs better for the Wireless Video Capture Device as it has a larger number of components and hence displays more diversity across components. Figure 9 plots the percentage energy savings obtained for the WVCD SoC as a function of the number of recovery islands. As can be seen from the figure, the optimal energy savings are obtained for an SoC partitioned into three recovery islands. For larger numbers of recovery islands, the overheads associated with recovery Figure 9: Energy savings sensitivity to number of recovery Islands for the WVCD system islands begin to dominate over the potential energy savings attainable by performing recovery at a finer granularity. The above example clearly demonstrates the need for performing recovery based design at an optimal granularity and hence the partitioning methodology presented in Section 6. Table 1 details the number of components, the area overheads and the number of recovery islands in the final clustering, for each of the three example SoC designs. The area and energy overheads of the required cross-recovery island interfaces and operating point controllers were estimated by synthesizing them using the IBM 45nm library. These overheads are added to the overheads reported in [1] to estimate the total overheads of recovery based design. In summary, we believe that our experiments clearly illustrate the potential benefits of recovery based SoC design in Table 1: Recovery island design details SoC No. of Area (%) Best No. of Design Components Overhead Recovery Islands MAC 8 3.2% 2 MPEG 8 4.7% 3 WVCD % 3 optimizing energy consumption under variations. 8. CONCLUSION We explored the concept of recovery based design and demonstrated how one can implement such a paradigm in the context of modern SoC designs. We presented a variation aware framework for partitioning an SoC into recovery islands and also finding the optimal operating points for each island. We applied the proposed framework to three example SoCs and demonstrated substantial energy benefits over traditional guard band based design. 9. REFERENCES [1] D. Ernst et al., Razor: a low-power pipeline based on circuit-level timing speculation, in Proc. MICRO, 2003, pp [2] K. Bowman, J. Tschanz, C. Wilkerson, S. Lu, T. Karnik, V. De, and S. Borkar, Circuit techniques for dynamic variation tolerance, in Proc. DAC, 2009, pp [3] M. Gupta, J. Rivers, P. Bose, G. Wei, and D. Brooks, Tribeca: Design for PVT variations with local recovery and fine-grained adaptation, in Proc.Micro, 2009, pp [4] S. Sarangi, B. Greskamp, A. Tiwari, and J. Torrellas, EVAL: Utilizing processors with variation-induced timing errors, in Proc. MICRO, 2008, pp [5] U.Y.Ogras,R.Marculescu,andD.Marculescu, Variation-adaptive feedback control for networks-on-chip with multiple clock domains, in Proc. DAC, 2008, pp [6] S. Garg and D. Marculescu, System-level throughput analysis for process variation aware multiple voltage-frequency island designs, ACM TODAES, vol. 13, no. 4, pp. 1 25, [7] V. J. Kozhikkottu, R. Venkatesan, A. Raghunathan, and S. Dey, VESPA: Variability emulation for System-on-Chip performance analysis, in Proc. DATE, 2011, pp [8] S. Chandra, K. Lahiri, A. Raghunathan, and S. Dey, Considering process variations during system-level power analysis, in Proc. ISLPED, 2006, pp [9] S. Pasricha, Y. Park, N. Dutt, and F. J. Kurdahi, System-level PVT variation-aware power exploration of on-chip communication architectures, ACM TODAES, vol. 14, no. 2, pp. 1 25, [10] S. Chandra, K. Lahiri, A. Raghunathan, and S. Dey, Variation-tolerant dynamic power management at the system-level, IEEE TVLSI, vol. 17, no. 9, pp , [11] S. Garg and D. Marculescu, System-level mitigation of WID leakage power variability using body-bias islands, in Proc. CODES+ISSS, 2008, pp [12] A. Kahng, S. Kang, R. Kumar, and J. Sartori, Designing a processor from the ground up to allow voltage/reliability tradeoffs, in Proc. HPCA, 2010, pp [13] L. Wan and D. Chen, DynaTune: circuit-level optimization for timing speculation considering dynamic path behavior, in Proc. ICCAD, 2009, pp [14] B. Greskamp et al., Blueshift: Designing processors for timing speculation from the ground up. in Proc. HPCA, 2009, pp [15] N. Zea, J. Sartori, B. Ahrens, and R. Kumar, Optimal power/performance pipelining for error resilient processors, in Proc. ICCD, 2010, pp [16] K. Bowman, A. Alameldeen, S. Srinivasan, and C. Wilkerson, Impact of die-to-die and within-die parameter variations on the clock frequency and throughput of multi-core processors, IEEE TVLSI, vol. 17, no. 12, pp , dec [17] ITRS, [18] K. Brownell, G. Wei, and D. Brooks, Evaluation of voltage interpolation to address process variations, in Proc.ICCAD, 2008, pp [19] S.R. Sarangi et al., VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects, Semiconductor Manufacturing, IEEE Trans., vol.21,no.1,pp.3 13,2008. [20] Skadron, K. et al., Temperature-aware microarchitecture, in Proc. ISCA, 2003, pp [21] Nanosim, Synopsys Inc. [22] CACTI-5.3, [23] ALTERA, 831

SUPPLEMENTAL SECTION A. CROSS ISLAND INTERFACES In this section, we give an overview of the cross island interfaces needed to ensure correct functioning of the recovery island based designs.

In this work, we implemented and tested interface wrappers (explained below) for both master and slave interfaces of a commercially available communication architecture, the Avalon Interconnect

7 SUPPLEMENTAL SECTION A. CROSS ISLAND INTERFACES In this section, we give an overview of the cross island interfaces needed to ensure correct functioning of the recovery island based designs. As noted earlier, each recovery island needs to adhere to the existing system bus protocols and hence appropriate wrappers need to be designed for each island. In this work, we implemented and tested interface wrappers (explained below) for both master and slave interfaces of a commercially available communication architecture, the Avalon Interconnect Fabric from Altera [23]. A similar procedure can be applied to design interface wrappers for any other standard communication architecture. B. EMULATION AND ERROR INJECTION FRAMEWORK In this section, we describe in detail the emulation and error injection framework utilized to obtain the error vs. system performance loss profiles for each SoC component. Figure 11 gives an overview of the proposed emulation based error injection framework. To perform the required analysis we first instrument each SoC component with cross island interfaces described in the previous section. To analyze the impact of error recovery on system performance, we mimic the error aggregation signal generated by shadow latches using a synthetic error injection module and clock gate each component using the generated error signal. The error injection circuit consists of a random number generator (LSFR) and a software programmable control register. The error signal is produced by comparing the generated random number to the threshold value programmed into the control register. Thus, the error rate in a component can be appropriately controlled by writing the required value into a threshold register through software. Figure 10: Cross island interface logic Figure 10 shows an overview of the wrapper interface for a Read-Write Avalon Master. The wrapper needs to deal with two scenarios. First, it needs to ensure that a read or write request sent out by the component during an error recovery cycle is not interpreted by the communication architecture as two requests. Second, it needs to make sure that any data returned by the communication architecture during an error recovery cycle is always captured and not lost. As can be seen from the figure, the wrapper consists of two major components, the intervention detection logic and the selection and sampling logic. The intervention detection logic analyzes the error signal coming from an island, the request signals from the master and the data valid signal from the bus to determine if there is a need to intervene in the current cycle. The selection logic is a set of multiplexers that perform the desired modification to the bus signals. Consider a scenario wherein the master interface sends out a write request during a cycle in which an error occurred. The intervention detection logic should detect this scenario and deassert the write request signal in the next cycle, so that the system bus does not treat it as two distinct write requests. This functionality is achieved with the help of simple multiplexer logic. The more complicated scenario arises when the master issues a read request and the system bus responds to it during a recovery cycle. In this case, we need to ensure that the data returned is appropriately captured and is available to the master in the next cycle. This functionality is achieved with the help of sampler logic which always stores a clock delayed version of the read data bus signal. The selection logic now sets the data valid signal high on the next cycle and the sampled read data signal is appropriately routed to the master interface. Figure 11: Emulation based error injection framework The application program that runs on the SoC is instrumented with a software control loop that is in charge of programming the error rate for a given component, executing the application and finally measuring the overall system performance with the help of hardware performance counters. We note that the system performance metric is chosen by the system designer and can be anything ranging from throughput, latency or a pre-defined performance score over a set of benchmarks. The emulation board used for our experiments is an Altera DE3 board equipped with a Stratix III EPS3SL150 FPGA. The proposed methodology can also be applied to any state-of-the-art emulation platform. C. DISCUSSION For the recovery based design paradigm to be widely applicable to a large class of SoCs it needs to be compatible with current design flows. In this section, we discuss key considerations in this regard. We also explore alternative methodologies that could be incorporated into our proposed framework for differing design requirements. Incorporating recovery based design into current design flows: A key requirement needed to utilize the proposed 832

8 recovery based design paradigm is the ability to partition an SoC into multiple recovery islands. As noted in Section 4, variable latency interfaces in the system serve as ideal points around which the system may be partitioned. Most interfaces which exist in commercial SoCs such as communication channels, system buses and on-chip networks utilize latency insensitive protocols and hence can be appropriately re-designed or instrumented with interface wrappers to ensure correct functionality even when a component is unavailable during the error recovery process. Most commercial SoCs include components which are equipped with recovery mechanisms like pipeline flushes, state machine rollback etc. These mechanisms are essential for correcting errors from sources such as speculative execution and soft errors. Although in this study we chose to utilize a singlecycle clock gating based recovery scheme for each island, they can instead utilize their own inbuilt recovery mechanisms for dealing with timing violations. The proposed framework is not restricted to any specific error recovery scheme and can easily be adapted to deal with multiple recovery mechanisms that can exist within an SoC. Current SoC platforms make use of various power management schemes to dynamically adapt to an application s time varying power-performance requirements. Dynamic voltagefrequency scaling (DVFS) is one such widely used mechanism which modulates the voltage and frequency of individual SoC components/islands based on workload characteristics. The voltage interpolation scheme utilized by the framework requires two supply voltage rails VDDH and VDDL. One possible integration scheme with DVFS would involve utilizing existing DVFS controllers to decide the VDDH and VDDL operating voltages. The recovery based design scheme s operating point controller can then perform more fine grained voltage interpolation based on the current error rate. Thus, the proposed framework can be integrated with DVFS with minimal changes to the overall design flow. Another common practice employed in current commercial SoC design involves using IP modules procured from external vendors. These components are often non-modifiable and cannot be instrumented with the required circuitry needed to detect and recover from errors. In such a scenario, these components alone may be operated with design margins and only other system components are considered in the recovery based design process. Alternative design methodologies: In this work, we made several choices such as floorplan driven clustering, emulation based local search for eliminating non-linearities associated with component inter-dependencies, number of operating points under consideration and a static error-voltage profile based evaluation methodology. We now analyze the various alternative choices that could be adopted for achieving different end design objectives such as improved energy benefits, reduced emulation runtime etc. One such alternative involves performing cluster driven floor planning wherein various components are first clustered together without considering the delay constraints essential for correct functionality. The clustered system is then floorplanned and evaluated for delay violations. Performing partitioning prior to floorplanning could potentially help in physically grouping together components that are most suited for clustering, thereby leading to improved energy savings. However, if the current clustering configuration violates the delay requirements, the above described process would have to repeated with the next best clustering configuration. Also in some commercial design flows, floorplanning is done quite early in the design cycle and may not be flexible to changes thereafter. In this study we considered ten distinct operating points at which each component was characterized. Thus, a k component system would require 10k emulation runs to derive the error vs. system performance loss profiles. Reducing the number of operating points proportionately decreases the total run time at the cost of increased energy consumption due to a more coarse grained search space. The emulation based local search phase could also be replaced with an appropriate analytical performance model to attain similar run time savings. However, this scheme would be applicable only for systems that have relaxed overall system performance constraints as the complex inter-component interactions cannot be completely captured by an analytical framework. 833

UNIT-II LOW POWER VLSI DESIGN APPROACHES

UNIT-II LOW POWER VLSI DESIGN APPROACHES Low power Design through Voltage Scaling: The switching power dissipation in CMOS digital integrated circuits is a strong function of the power supply voltage.