MLP-aware Instruction Queue Resizing: The Key to Power-Efficient Performance

Size: px
Start display at page:

Download "MLP-aware Instruction Queue Resizing: The Key to Power-Efficient Performance"

Transcription

1 MLP-aware Instruction Queue Resizing: The Key to Power-Efficient Performance Pavlos Petoumenos 1, Georgia Psychou 1, Stefanos Kaxiras 1, Juan Manuel Cebrian Gonzalez 2, and Juan Luis Aragon 2 1 Department of Elecrtical and Computer Engineering, University of Patras, Greece 2 Computer Engineering Department, University of Murcia, Spain Abstract. Several techniques aiming to improve power-efficiency (measured as EDP) in out-of-order cores trade energy with performance. Prime examples are the techniques to resize the instruction queue (IQ). While most of them produce good results, they fail to take into account that changing the timing of memory accesses can have significant consequences on the memory-level parallelism (MLP) of the application and thus incur disproportional performance degradation. We propose a novel mechanism that deals with this realization by collecting fine-grain information about the maximum IQ resizing that does not affect the MLP of the program. This information is used to override the resizing enforced by feedback mechanisms when this resizing might reduce MLP. We compare our technique to a previously proposed non-mlpaware management technique and our results show a significant increase in EDP savings for most benchmarks of the SPEC2000 suite. Key words: Instruction Queue, Power-Aware, MLP-Aware 1 Introduction Power efficiency in high-performance cores received considerable attention in recent years. A significant body of work targets energy reduction in processor structures, striving at the same time to preserve the processors performance to the extend possible. In this work, we revisit a class of microarchitectural techniques that resize the Instruction Queue (IQ) to reduce its energy consumption. The IQ is one of the most energy-hungry structures because of its size, operation (fully-associative matches), and access frequency. Three proposals by Buyuktosunoglu et al. [1], Folegnani and Gonzlez [2], and Kucuk et al. [3] exemplify this approach: the main idea is to reduce the energy by resizing the IQ downwards to adjust to the needs of the program using at the same time a feedback loop to limit the damage to performance. The IQ can be physically partitioned into segments that can be completely turned off [1][3] or logically partitioned [2]. While physical partitioning and segment deactivation can be more effective in energy savings, the more sophisticated resizing policy of Folegnani and Gonzlez minimizes performance degradation. We consider the

2 2 Pavlos Petoumenos et al. combination of the physical partitioning of [1] and [3] and the ILP-contribution policy of [2] as the basis for our comparisons. Studying these approaches in more detail we discovered that, in some cases, small changes in the IQ size bring about significant degradation in performance. New found understanding of the relation of the size of the instruction queue to performance, by the work of Chou et al. [7], Karkhanis and Smith [6], Eyerman and Ecckhout [4], Qureshi et al. [8], points to the main culprit for this: Memory- Level Parallelism (MLP) [5]. In the presence of misses and MLP the single most important factor that affects performance in relation to the IQ size is whether MLP is preserved or harmed. While this new understanding seems, in retrospect, obvious and easy to integrate in previous proposals, in fact it requires a new approach in managing the IQ. Specifically, while prior feedback-loop proposals based their decisions on a coarse sampling period measured in thousands of cycles or instructions, an MLP-aware technique requires a much more fine-grain approach where decisions must be taken at instruction intervals whose size is comparable to the size of the IQ. The reason for this is that MLP itself exists if misses appear within the instruction window of the processor and must be handled at that resolution. More accurately, MLP exists if the distance between misses is less than the size of the reorder buffer [6]. In our case, we use a combined IQ and ROB in the form of a Register Update Unit [9], thus the ROB size defaults to the size of the IQ. In the rest of the paper we discuss MLP with respect to the size of the IQ. Contributions of this paper: We expose MLP (when it exists) as the main factor that affects performance in IQ resizing techniques. We propose a methodology to integrate MLP-awareness in IQ resizing techniques by measuring and predicting MLP in fine-grain segments of the dynamic instruction stream. We propose an example practical implementation of this approach and show that it consistently outperforms in total EDP (for the whole processor) an optimized non-mlp-aware technique (a physically partitioned IQ + ILPcontribution feedback [2]) as well as improving EDP across the board compared to base case of an unmanaged IQ. Structure of this paper. Section 2 presents related work, while Section 3 motivates the need for IQ resizing using MLP. In Section 4 we present our proposal for MLP-awareness in the IQ resizing and in Section 5 we delve into some details. Section 6 offers our evaluation and Section 7 concludes the paper. 2 Related Work IQ Resizing Techniques. The instruction window (including possibly a separate reorder buffer, an instruction queue and the load/store queue) has been prime target for energy optimizations. The reason is twofold: first they are greedy

3 MLP-aware Instruction Queue Resizing 3 consumers of the processor power budget; second, typically these structures are sized to support peak execution in a wide out-of-order processor. Although there are many proposals for IQ resizing, we list here the three main proposals that are relevant to our work. The proposals by Buyuktosunoglu et al [1] and Kucuk et al [3], resize the IQ by physically partitioning it in segments and disabling and enabling each segment 1 as needed. Both techniques use a feedback loop to control the size of the IQ. The Buyuktosunoglu et al. technique [1] is based on the number of active entries (i.e., ready-to-issue entries) per segment to decide whether or not a segment deserves to remain active, while Kucuk et al. [3] argue that the occupancy of the IQ (in valid instructions) rather than active entries is a better indication of its required size. In both techniques the IPC is measured periodically; a sudden drop in IPC signals the need for IQ upsizing. However in both approaches there is no other mechanism to revert back to the full IQ size. The Folegnani and Gonzlez approach [2] is distinguished by being a logical resizing of the IQ (limiting the range of the entries that can be allocated) not a physical partitioning as the other two. In addition, the feedback mechanism is based on how much the youngest instructions in the IQ contribute to the actual ILP. If they do not contribute much, this means that the size of the IQ can be reduced further. However, there is no way to adapt the IQ to larger sizes except periodically revering back to full size. Nevertheless, the Folegnani and Gonzlez resizing policy is very good at adjusting the IQ size so as to not harm the ILP in the program. Modeling of MLP. Karkhanis and Smith presented a first-order model of a superscalar core [6] that deepened our understanding of MLP. This work exposed the importance of MLP in shaping the performance of out-of-order execution. Karkhanis and Smith show that in absence of any upsetting events such as branch mispredictions and L2 misses the number of instructions that will issue (on average) is a property of the program and is expressed by the so-called IW characteristic of the program. The presence, however, of upsetting events, such as L2 misses, decreases the IPC from the ideal point of the IW characteristic depending on the apparent cost of the L2 miss. This is because, a L2 miss drains the pipeline, and eventually stalls it when the instruction that misses blocks the retirement of other instructions [6]. MLP, in this case, spreads the cost of accessing main memory over all the instructions that miss in parallel, making their apparent cost appear much less. Existing IQ resizing mechanisms focus mainly on the variations of the ILP (effectively moving along the IW characteristic) ignoring what happens during misses. Our work, focuses instead on the latter. MLP-Aware Techniques. Qureshi et al. exploited MLP to optimize cache replacement [8]. MLP in their case is associated with data and the MLP prediction is done in the cache. Data that do not exhibit MLP are given preference 1 Sequentially from the end of the IQ in [1] or independently of position in [3].

4 4 Pavlos Petoumenos et al Distribution of distance between parallel misses Increase of execution time Distribution of distance between parallel misses Increase of execution time Fig. 1. Comparison of the distribution of distances between parallel L2 misses and performance degradation due to IQ resizing for gcc and art in staying in the cache over data that exhibit MLP. Eyerman and Ecckhout exploit MLP to optimize fetch policies for Simultaneous Multi-Threaded (SMT) processors [4]. In their case, MLP is associated with instructions that miss in the cache. Our MLP prediction mechanism is similar to the one proposed by Eyerman and Ecckhout, but since we use it to manage the IQ rather than the fetch stage, we associate MLP information with larger segments of code. 3 MLP and IQ Resizing In this section, we motivate the basic premise of this paper, i.e., that IQ resizing techniques must be aware of the MLP in the program to avoid excessive performance degradation. Figure 1 shows the distribution of the distances between parallel misses in two SPEC2000 programs: art and gcc. We assume here an IQ of 128 entries. The distance is measured in the number of intervening instructions between instructions that miss in parallel. Figure 1 also plots for each distance the increase in execution time if the IQ size is decreased below that distance. Each time we resize the IQ, we eliminate the MLP with distance greater than the size of the IQ. In art, MLP is distributed over a range of distances, so its execution time is proportionally affected with the decrease of the IQ because we immediately eliminate some MLP. The more MLP is eliminated the harder performance is hit. In contrast, most MLP in gcc is clustered around distance 32. Below this point, we exeperience a dramatic increase in execution time (10). For the intermediate IQ sizes, (between the maximum size and 32), execution time increases slowly due to loss of ILP. These examples demonstrate how sensitive performance is with respect to MLP and indicate that an efficient IQ management scheme must focus primarily on MLP rather than ILP. Another important characteristic of MLP that necessitates a fine-grain approach to IQ management is that the distance among parallel misses changes very frequently. Figure 2 shows a window of 20K instructions from the execution of twolf. At each point the maximum observed distance among parallel misses is shown. Ideally, at each point in time, the IQ size should fit all such distances while at the same time be as small as possible. A blanket IQ size for

5 MLP-aware Instruction Queue Resizing Fig. 2. Maximum Distances between parallel misses of twolf the whole window, based on some estimation of the average distance between parallel misses, is simply not good enough since it would eliminate all MLP of distance larger than the average. 4 Managing the IQ with MLP Our approach is to quantify MLP opportunities and relate this information back to the instruction stream via a prediction structure. Upon seeing the same instructions again, MLP information stored in the predictor guides our decisions for IQ resizing. 4.1 Quantifying MLP: MLP-distance Our first concern is to quantify MLP opportunities in a way that is useful to IQ resizing. Two memory instructions are able to overlap their L2 misses, if there are no dependencies between them and the number of instructions dispatched between them is less than the size of the IQ. This number of instructions, called MLP-distance in [4], is also the basic metric for our management scheme. A straightforward way to measure the MLP-distance among instructions is to check the LSQ every time a miss is serviced and find the youngest instruction which is still waiting for its data from the L2. This technique does not fit in our case, since it can only identify overlapping misses for the current size of the IQ. To overcome this problem we need to check for misses that could potentially overlap over a number of instructions, as many as the maximum number of instructions that can fit in the unmanaged (full) IQ. Always keeping information for as many instructions as the maximum IQ size, partially defeats the purpose of resizing the IQ. Thus, instead of keeping information for each individual instruction, we keep aggregate information for instruction segments, groups of sequentially dispatched instructions (these instruction segments coincide with the segments that make up a physically partitioned IQ). This MLP information is kept in a small cyclic buffer which we call MLP distance buffer or MDB with as many entries as the maximum number of IQ segments (Figure 3). MDB is not affected by IQ resizing. A new entry is allocated for each segment, but in contrast to the real IQ entries, MDB entries are evicted

6 6 Pavlos Petoumenos et al. Fig. 3. MLP Distance Buffer (MDB): A circular queue keeps aggregate information corresponding to IQ segments only when it fills. This means that MDB retires a segment only when newly dispatched instructions are farther away than the size of the non-managed IQ, and thus could not possibly execute in parallel with the retiring segment. Our approach to measure MLP distance, is similar to [4] but based on segments for increased power efficiency. Each time an instruction causes an L2 miss, the corresponding MDB segment is marked as also having caused a miss. Upon eviction of an entry, the MDB is searched for the other entries which have caused L2 misses. If there are such entries, this means that there could be possible MLP among them. We update each entrys MLP-distance field with the distance measured in segments from the youngest entry with a miss, if this distance is longer than the previously recorded value. MDB is infrequently accessed and it is only processed whenever segments which caused L2 misses are retired. The MLP-distance is not an entirely accurate estimation of actual MLP. To reside at the same time in the IQ is not the only requirement for two instructions to overlap their misses i.e. possible dependencies between the instructions may cause their misses to be isolated. In any case, the actual MLP-distance will be less than or equal to the value produced by our approach. This in turn means we might miss some opportunities for downward IQ resizing but we will not incur performance degradation due to loss of MLP. Our experiments showed that for most benchmarks falsely assuming parallel misses causes few overestimations of the MLP-distance. Considering the simplicity of our mechanism, these results indicate a satisfactory level of performance. Measuring the MLP-distance allows us to control the IQ size for the relevant part of the code so as to not hurt MLP. We need, however, to relate this information back to the instruction stream so the next time we see the same part of the code we can react and correctly set the size of the IQ. 4.2 Associating MLP-Distance with Code For the purpose of managing the IQ in a MLP-aware fashion, we dynamically divide program execution into fragments and associate MLP-distance information with these fragments. Execution fragments should be comparable in size

7 MLP-aware Instruction Queue Resizing 7 to the size of the IQ. Much shorter fragments would be sub-optimal since the information they carry will be used to manage the whole IQ, which contains multiple such fragments. This could lead to very frequent and many times conflicting resizing decitions. Longer fragments, such as program phases, also fail to provide us with the fine-grain information we need to quickly react to fast-changing MLP-distances in the instruction stream. To achieve the desired balance in the size of the fragments we use the notion of superpaths. A superpath is nothing more than an aggregation of sequentially executed basic blocks and loosely corresponds to a trace (in a trace cache) [13] or a hotspot [12]. The MLP-distance of a superpath is assigned by the MDB: when all of the MDB entries belonging to a superpath are retired, the longest MLP-distance stored in these entries is selected to update the MLP-distance of the superpath. Note that the instructions which establish this MLP-distance do not have to belong to the same superpath. In other words, an MLP-distance can straddle two or more superpaths, in which case it might affect all of them (if it is the maximum observed MLP-distance in each of them). The next time the same superpath is observed in dispatch, its stored information is retrieved to manage the IQ. A more detailed discussion about superpaths and how we actually implemented the aforementioned mechanisms can be found in Section Resizing Policy When we start tracking a superpath in dispatch, we check whether we have stored information about its behavior, including its MLP-distance information. If there is such information then we have an indication about the minimum IQ size which will not affect the MLP of this superpath. In many cases, however, there is no MLP-distance information available or the MLP-distance does not restrict IQ downsizing. The question is how much can we downsize the IQ is such cases? As explained in Section 3, an efficient IQ resizing scheme has to find the IQ size that minimizes energy consumption without hurting performance. This means that besides not hurting MLP we must also protect ILP. This gap can be filled by any of the existing IQ resizing techniques. For example, the ILP-feedback approach of [2] can provide a target IQ size that does not hurt ILP while the MLP-aware approach judges whether this size hurts MLP and if so it overrides the decision. For the rest of this paper the ILP-feedback information will be provided by the decision making mechanism in Folegnani and Gonzlez [2]. This mechanism examines the participation of the youngest segment of the IQ to the IPC within a specific number of cycles. If the contribution of the youngest part is not important, namely the number of instructions that issue from this segment is below a threshold, the IQ size is decreased. In our case, we deactivate the youngest segment when it empties. The main idea in the Folegnani and Gonzlez work is that if a segment contributes very little to the ILP, deactivating it would not harm performance. However, this holds only for ILP not for MLP. In other words, even if the contribution of a segment in issued instructions is very small it can still have a

8 8 Pavlos Petoumenos et al. significant impact on performance, which occurs if any of its issued instructions is involved in MLP. It is exactly in this situation where MLP-awareness makes all the difference. Further, in the Folegnani and Gonzlez work, once the IQ is downsized, there is no way to detect whether the situation changes all we can see is the contribution to ILP of the active portion of the IQ. Thus, periodically, the IQ is brought back to its full size, and the downsizing process starts again. In our case, the existence of MLP automatically upsizes the IQ to a size that does not harm MLP. 5 Practical Implementation 5.1 IQ Segmentation To allow dynamic variation of the IQ size, we divide the IQ in a number of independent parts referred as segments. For example, we use an 128-entry IQ partitioned into eight, sixteen-entry segments. Bitline segmentation is used to implement the resizing of the structure [10]. The structure of the IQ follows the one in [2]. The IQ is a circular FIFO queue, with every new instruction inserted at the tail; retiring instructions are removed from the head. The difference in our case, is that individual segments can be deactivated. A segment is deactivated if instructions from the youngest segment contribute less than threshold instructions in a quantum of time (1000 cycles) and a segment is reactivated every 5 quanta. This inevitably leads to constraints that have to be met during the resizing process, similarly to those faced by Ponomarev et al. [11]: downsizing of the IQ is only permitted if there are no instructions left to commit in the segment being removed and upsizing is constrained to activate segments that come after all the instructions currently residing in the IQ. 5.2 Superpaths Our basic IQ management unit, the superpath, is characterized by its size and its first instruction. Sequential basic blocks are organized into superpaths at the dispatch stage and they contain at least as many instructions as the IQ size. Superpath creation ends when we encounter a basic block which is at the head of a previously created superpath, in order to reduce both the number and the overlap of superpaths. For each newly created superpath we allocate an entry in a small hardware cache which keeps information about the superpath, as well as, information about its MLP-distance. After performing an exploration of the design space, we chose a 4-way, 16-set configuration (indexed by the lower-order bits of the start address of the superpath). We store 28 bits per superpath entry: 20 bits of the starting address (lowest-order bits above the indexing bits), the MLP-distance prediction (4 bits, quantized in multiples of 16) and its confidence counter (3 bits) and a valid bit. For our 4x16 cache, our storage requirements add up to 1792 bits. According to CACTI [14] this structure contributes 9.5 mw to the total power consumption of the processor, which is a reasonable power overhead compared to the power consumption of the IQ (8.9W for the configuration described in Section 6).

9 MLP-aware Instruction Queue Resizing 9 Table 1. Configuration of simulated system Parameter Configuration Fetch/Issue/Commit width 4 instructions per cycle BTB 1024 entries,4-way set-associative Branch Predictor Combining, bimodal + 2 Level, 2 cycle penalty Instruction Queue 128 entries (combined with ROB) Load/Store Queue 64 entries L1 I-cache 16 KB, 4-way, 64 bytes block size L1 D-cache 8 KB, 4-way, 32 bytes block size Unified L2 cache 256 KB/512 KB//, 8-way, 64 bytes block size TLB 4096 entry (I), 4096 entry(d) Memory 8 bytes wide, 120 cycles latency Functional Units 4 int ALUs, 2 int multipliers, 4 fp ALUs, 2 fp multipliers 5.3 Results Overview 5.4 MLP-distance Prediction When all instructions of superpath commit, the stored superpath information is updated with the MLP-distance information of this particular execution. Different executions of a superpath are generally not identical in terms of MLP, so what we want to associate with the superpath is a dominant value for its MLP-distance. To manage this, in addition to keeping an MLP-distance prediction for each superpath entry, we employ a 3-bit saturating confidence counter which indicates our confidence that the stored MLP-distance is also the dominant value. The confidence counter is incremented for each MLP-distance update which agrees with the current prediction and decremented for each update which disagrees. When it reaches zero we replace it. 6 Evaluation 6.1 Experimental Setup For our experiments we use a detailed cycle accurate simulator that supports a dynamic superscalar processor model and WATTCH power models [15]. The configuration of the simulated processor is described in Table 1. We use a subset of the SPEC2000 benchmark suite, containing benchmarks with more than one long-latency load per 1K instructions for the smallest cache size we utilize. Benchmarks with even less misses present no interest for our work, since without misses our mechanism falls back to the baseline ILP-feedback technique. All benchmarks are run with their reference input. We simulate 300M instructions after skipping 1B instructions for all benchmarks except for vpr, twolf, mcf where we skip 2B instructions and ammp where we skip 3B instructions. The experiments were performed for four different cache sizes. Three metrics are used in our evaluation: total processor energy, execution time and the energy

10 10 Pavlos Petoumenos et al Conservative ILP-feedback Aggressive ILP-feedback with MLP Conservative ILP-feedback Aggressive ILP-feedback with MLP Fig. 4. Average (geometric mean) Normalized EDP (left) and Performance Degradation (right) for ILP-feedback and ILP-feedback with MLP-awareness x delay product.we first present the effects of MLP-awareness on the baseline ILP-oriented mechanism and then a direct comparison of the two mechanisms, the ILP-aware and the combination of the ILP-feedback and ILP/MLP techniques. All results (except otherwise noted) are normalized to the base case of an unmanaged IQ. 6.2 Effects of MLP-awareness Figure 4 depicts the EDP geometric mean of all benchmarks for two different thresholds: an aggressive threshold of 768 instructions and a conservative threshold of 256 instructions. Note how difficult it is to improve the geometric mean of the whole-processor EDP with IQ resizing. This depends a great deal on the portion of the total power budget taken up by the IQ. In our case, this is a rather conservative 13.7%, so in the ideal case significant energy savings and no performance degradation we expect to approach this percentage in EDP savings. Indeed, this is what we achieve with our proposal. As shown in the graph, the ILP-feedback technique works marginally with a conservative threshold while its combination with the MLP-aware mechanism improves the situation only slightly. However, with much more aggressive resizing, the ILP-feedback technique seriously harms performance and despite the larger energy savings, yields a worse EDP across all cache configurations. In this case, the incorporation of the MLP-aware mechanism can readily improve the results and turn loss into significant benefit, approaching 1 EDP improvement. As long as the ILP-feedback technique does not hurt the MLP, it yields benefits. When that changes, the performance loss is unacceptable. This hazard is avoided when the MLP mechanism is used because it works as a safety net. With the help of the MLP mechanism, resizing can be pushed to the limit. 6.3 Direct Comparison of ILP- and ILP/MLP-aware techniques In this section, we compare the behavior of the two approaches, each evaluated for a configuration which minimizes its average EDP. An additional considera-

11 MLP-aware Instruction Queue Resizing 11 tion was to find configurations that do not harm EDP over the base case for any benchmark. This, however, is not possible for the ILP-feedback technique, lest we are content with marginal EDP improvements. Thus, for the ILP-feedback technique we remove this restriction and we simply select the threshold that minimizes average EDP, which is 256-instructions. The ILP-feedback with the MLP mechanism can be pushed much harder as it is evident from Figure 4 with the 768-instruction threshold. However, EDP worsens over the base case for two programs (applu and art), even though the average EDP is minimized. The threshold that gives the second best average EDP giving up less than 2% over the previous best case for the combined ILP/MLP mechanism is the 512-instruction threshold which satisfies our requirement for EDP improvement across all benchmarks. Figure 5 illustrates the normalized EDP, execution time increase and energy savings for the best thresholds for each mechanism. The end result is that the very aggressive resizing of the ILP/MLP technique harms performance comparably to the conservative ILP-feedback technique but at the same time manages to reduce the IQ size more and produce significantly higher energy savings. This results in an EDP for the ILP/MLP technique that is consistently better than the EDP of the ILP-feedback technique, almost doubling the benefit on average( % compared to % of the ILP-feedback technique). The benefit of the ILP/MLP technique diminishes slightly (relatively to the ILP-feedback technique) with cache size, since with fewer misses we have less opportunities for MLP. Finally, note that the performance degradation of the ILP/MLP technique is kept at reasonable levels while, even for the conservative ILP-feedback, it can vary considerably more (e.g., mcf execution time increases 67%-69%). 7 Conclusions In this paper, we revisit techniques for resizing the instruction queue aiming to improve the power-efficiency of high-performance out-of-order cores. Prior approaches resized the IQ paying attention primarily to ILP. In many cases this results in considerable loss of performance while the energy gains from the IQ are bounded with respect to the energy of the whole processor. The result is that EDP improves in some cases but worsens in others making such techniques inconsistent. The culprit for this is MLP Memory-Level Parallelism. Resizing the IQ can reduce the amount of MLP in programs with serious consequences on performance. With this realization, we set out to provide a technique that can be applied on top of previous IQ resizing techniques. Our technique, detects possible MLP at runtime and uses prediction to guide IQ resizing decisions. Because we need to manage the whole IQ, our basic unit of management is a sequence of basic blocks, called superpath, comparable in the number of instructions to the maximum IQ size. MLP information is associated with superpaths and is used to override resizing decisions that might harm the MLP of the superpath. In absence of misses and MLP, resizing of the IQ is performed using already existing techniques.

12 12 Pavlos Petoumenos et al. Our results show that we can manage the IQ, considerably better than in prior approaches yielding consistently better EDP over the base case. At the same time, we can push the resizing of the IQ much more aggressively (to achieve better energy savings) knowing that our safety-net mechanism protects the MLP of the program and will not inordinately harm performance swim mgrid applu gcc galgel art crafty facerec lucas parser vortex apsi wupwise vpr twolf mcf ammp ILP-feedback ILP-feedback with MLP swim mgrid applu gcc galgel art crafty facerec lucas parser vortex apsi wupwise vpr twolf mcf ammp ILP-feedback ILP-feedback with MLP swim mgrid applu gcc galgel art crafty facerec lucas parser vortex apsi wupwise vpr twolf mcf ammp ILP-feedback ILP-feedback with MLP Fig. 5. Normalized Energy-Delay Product (a), Execution Time Increase (b) and Energy Savings (c) for best configurations: ILP-Feedback (256 threshold) and ILP-Feedback with MLP (512 threshold) Acknowledgments. The heading should be treated as a subsubsection heading and should not be assigned a number. References 1. Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. J. Mol. Biol. 147, (1981) 2. May, P., Ehrlich, H.C., Steinke, T.: ZIB Structure Prediction Pipeline: Composing a Complex Biological Workflow through Web Services. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par LNCS, vol. 4128, pp Springer, Heidelberg (2006)

13 MLP-aware Instruction Queue Resizing Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999) 4. Czajkowski, K., Fitzgerald, S., Foster, I., Kesselman, C.: Grid Information Services for Distributed Resource Sharing. In: 10th IEEE International Symposium on High Performance Distributed Computing, pp IEEE Press, New York (2001) 5. Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The Physiology of the Grid: an Open Grid Services Architecture for Distributed Systems Integration. Technical report, Global Grid Forum (2002)

MLP-aware Instruction Queue Resizing: The Key to Power- Efficient Performance

MLP-aware Instruction Queue Resizing: The Key to Power- Efficient Performance MLP-aware Instruction Queue Resizing: The Key to Power- Efficient Performance Pavlos Petoumenos 1, Georgia Psychou 1, Stefanos Kaxiras 1, Juan Manuel Cebrian Gonzalez 2, and Juan Luis Aragon 2 1 Department

More information

Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors

Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors STIJN EYERMAN and LIEVEN EECKHOUT Ghent University A thread executing on a simultaneous multithreading (SMT) processor

More information

MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor

MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor Kenzo Van Craeynest, Stijn Eyerman, and Lieven Eeckhout Department of Electronics and Information Systems (ELIS), Ghent University,

More information

MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor

MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor Kenzo Van Craeynest, Stijn Eyerman, and Lieven Eeckhout Department of Electronics and Information Systems (ELIS), Ghent University,

More information

Performance Evaluation of Recently Proposed Cache Replacement Policies

Performance Evaluation of Recently Proposed Cache Replacement Policies University of Jordan Computer Engineering Department Performance Evaluation of Recently Proposed Cache Replacement Policies CPE 731: Advanced Computer Architecture Dr. Gheith Abandah Asma Abdelkarim January

More information

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 16 - Superscalar Processors 1 / 78 Table of Contents I 1 Overview

More information

DeCoR: A Delayed Commit and Rollback Mechanism for Handling Inductive Noise in Processors

DeCoR: A Delayed Commit and Rollback Mechanism for Handling Inductive Noise in Processors DeCoR: A Delayed Commit and Rollback Mechanism for Handling Inductive Noise in Processors Meeta S. Gupta, Krishna K. Rangan, Michael D. Smith, Gu-Yeon Wei and David Brooks School of Engineering and Applied

More information

CS Computer Architecture Spring Lecture 04: Understanding Performance

CS Computer Architecture Spring Lecture 04: Understanding Performance CS 35101 Computer Architecture Spring 2008 Lecture 04: Understanding Performance Taken from Mary Jane Irwin (www.cse.psu.edu/~mji) and Kevin Schaffer [Adapted from Computer Organization and Design, Patterson

More information

Combined Circuit and Microarchitecture Techniques for Effective Soft Error Robustness in SMT Processors

Combined Circuit and Microarchitecture Techniques for Effective Soft Error Robustness in SMT Processors Combined Circuit and Microarchitecture Techniques for Effective Soft Error Robustness in SMT Processors Xin Fu, Tao Li and José Fortes Department of ECE, University of Florida xinfu@ufl.edu, taoli@ece.ufl.edu,

More information

Mitigating Inductive Noise in SMT Processors

Mitigating Inductive Noise in SMT Processors Mitigating Inductive Noise in SMT Processors Wael El-Essawy and David H. Albonesi Department of Electrical and Computer Engineering, University of Rochester ABSTRACT Simultaneous Multi-Threading, although

More information

Pipeline Damping: A Microarchitectural Technique to Reduce Inductive Noise in Supply Voltage

Pipeline Damping: A Microarchitectural Technique to Reduce Inductive Noise in Supply Voltage Pipeline Damping: A Microarchitectural Technique to Reduce Inductive Noise in Supply Voltage Michael D. Powell and T. N. Vijaykumar School of Electrical and Computer Engineering, Purdue University {mdpowell,

More information

Exploring Heterogeneity within a Core for Improved Power Efficiency

Exploring Heterogeneity within a Core for Improved Power Efficiency Computer Engineering Exploring Heterogeneity within a Core for Improved Power Efficiency Sudarshan Srinivasan Nithesh Kurella Israel Koren Sandip Kundu May 2, 215 CE Tech Report # 6 Available at http://www.eng.biu.ac.il/segalla/computer-engineering-tech-reports/

More information

Exploiting Resonant Behavior to Reduce Inductive Noise

Exploiting Resonant Behavior to Reduce Inductive Noise To appear in the 31st International Symposium on Computer Architecture (ISCA 31), June 2004 Exploiting Resonant Behavior to Reduce Inductive Noise Michael D. Powell and T. N. Vijaykumar School of Electrical

More information

Final Report: DBmbench

Final Report: DBmbench 18-741 Final Report: DBmbench Yan Ke (yke@cs.cmu.edu) Justin Weisz (jweisz@cs.cmu.edu) Dec. 8, 2006 1 Introduction Conventional database benchmarks, such as the TPC-C and TPC-H, are extremely computationally

More information

Project 5: Optimizer Jason Ansel

Project 5: Optimizer Jason Ansel Project 5: Optimizer Jason Ansel Overview Project guidelines Benchmarking Library OoO CPUs Project Guidelines Use optimizations from lectures as your arsenal If you decide to implement one, look at Whale

More information

Trace Based Switching For A Tightly Coupled Heterogeneous Core

Trace Based Switching For A Tightly Coupled Heterogeneous Core Trace Based Switching For A Tightly Coupled Heterogeneous Core Shru% Padmanabha, Andrew Lukefahr, Reetuparna Das, Sco@ Mahlke Micro- 46 December 2013 University of Michigan Electrical Engineering and Computer

More information

Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors

Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors Anys Bacha Computer Science and Engineering The Ohio State University bacha@cse.ohio-state.edu Radu Teodorescu Computer Science

More information

Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines

Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines Michael D. Powell, Ethan Schuchman and T. N. Vijaykumar School of Electrical and Computer Engineering, Purdue University

More information

Fall 2015 COMP Operating Systems. Lab #7

Fall 2015 COMP Operating Systems. Lab #7 Fall 2015 COMP 3511 Operating Systems Lab #7 Outline Review and examples on virtual memory Motivation of Virtual Memory Demand Paging Page Replacement Q. 1 What is required to support dynamic memory allocation

More information

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Chapter 4 The Processor Part II Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup p = 2n/(0.5n + 1.5) 4 =

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Schedulers Data-Capture Scheduler Dispatch: read available operands from ARF/ROB, store in scheduler Commit: Missing operands filled in from bypass Issue: When

More information

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Freeway: Maximizing MLP for Slice-Out-of-Order Execution Freeway: Maximizing MLP for Slice-Out-of-Order Execution Rakesh Kumar Norwegian University of Science and Technology (NTNU) rakesh.kumar@ntnu.no Mehdi Alipour, David Black-Schaffer Uppsala University {mehdi.alipour,

More information

THERE is a growing need for high-performance and. Static Leakage Reduction Through Simultaneous V t /T ox and State Assignment

THERE is a growing need for high-performance and. Static Leakage Reduction Through Simultaneous V t /T ox and State Assignment 1014 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 7, JULY 2005 Static Leakage Reduction Through Simultaneous V t /T ox and State Assignment Dongwoo Lee, Student

More information

An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors

An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors STEVEN SWANSON, LUKE K. McDOWELL, MICHAEL M. SWIFT, SUSAN J. EGGERS and HENRY M. LEVY University of Washington

More information

Techniques for Generating Sudoku Instances

Techniques for Generating Sudoku Instances Chapter Techniques for Generating Sudoku Instances Overview Sudoku puzzles become worldwide popular among many players in different intellectual levels. In this chapter, we are going to discuss different

More information

Improving Energy-Efficiency of Multicores using First-Order Modeling

Improving Energy-Efficiency of Multicores using First-Order Modeling Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1404 Improving Energy-Efficiency of Multicores using First-Order Modeling VASILEIOS SPILIOPOULOS ACTA

More information

Efficiently Exploiting Memory Level Parallelism on Asymmetric Coupled Cores in the Dark Silicon Era

Efficiently Exploiting Memory Level Parallelism on Asymmetric Coupled Cores in the Dark Silicon Era 28 Efficiently Exploiting Memory Level Parallelism on Asymmetric Coupled Cores in the Dark Silicon Era GEORGE PATSILARAS, NIKET K. CHOUDHARY, and JAMES TUCK, North Carolina State University Extracting

More information

Pipelined Processor Design

Pipelined Processor Design Pipelined Processor Design COE 38 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Pipelining versus Serial

More information

Wavelet Analysis for Microprocessor Design: Experiences with Wavelet-Based di/dt Characterization

Wavelet Analysis for Microprocessor Design: Experiences with Wavelet-Based di/dt Characterization Wavelet Analysis for Microprocessor Design: Experiences with Wavelet-Based di/dt Characterization Russ Joseph Dept. of Electrical Eng. Princeton University rjoseph@ee.princeton.edu Zhigang Hu T.J. Watson

More information

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy CSE 2021: Computer Organization Single Cycle (Review) Lecture-10 CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan CSE-2021 July-12-2012 2 Single Cycle with Jump Multi-Cycle Implementation

More information

Heat-and-Run: Leveraging SMT and CMP to Manage Power Density Through the Operating System

Heat-and-Run: Leveraging SMT and CMP to Manage Power Density Through the Operating System To appear in the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2004) Heat-and-Run: Leveraging SMT and CMP to Manage Power Density Through

More information

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems Eric Rotenberg Center for Embedded Systems Research (CESR) Department of Electrical & Computer Engineering North

More information

The challenges of low power design Karen Yorav

The challenges of low power design Karen Yorav The challenges of low power design Karen Yorav The challenges of low power design What this tutorial is NOT about: Electrical engineering CMOS technology but also not Hand waving nonsense about trends

More information

Managing Static Leakage Energy in Microprocessor Functional Units

Managing Static Leakage Energy in Microprocessor Functional Units Managing Static Leakage Energy in Microprocessor Functional Units Steven Dropsho, Volkan Kursun, David H. Albonesi, Sandhya Dwarkadas, and Eby G. Friedman Department of Computer Science Department of Electrical

More information

Proactive Thermal Management using Memory-based Computing in Multicore Architectures

Proactive Thermal Management using Memory-based Computing in Multicore Architectures Proactive Thermal Management using Memory-based Computing in Multicore Architectures Subodha Charles, Hadi Hajimiri, Prabhat Mishra Department of Computer and Information Science and Engineering, University

More information

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation Mark Wolff Linda Wills School of Electrical and Computer Engineering Georgia Institute of Technology {wolff,linda.wills}@ece.gatech.edu

More information

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System Ho Young Kim, Robert Maxwell, Ankil Patel, Byeong Kil Lee Abstract The purpose of this study is to analyze and compare the

More information

Precise State Recovery. Out-of-Order Pipelines

Precise State Recovery. Out-of-Order Pipelines Precise State Recovery in Out-of-Order Pipelines Nima Honarmand Recall Our Generic OOO Pipeline Instruction flow (pipeline front-end) is in-order Register and memory execution are OOO And, we need a final

More information

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Presentation of UniServer Horizon 2020 European project findings: X-Gene server chips, voltage-noise characterization, high-bandwidth voltage measurements,

More information

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance Michael D. Powell, Arijit Biswas, Shantanu Gupta, and Shubu Mukherjee SPEARS Group, Intel Massachusetts EECS, University

More information

Dynamic Scheduling II

Dynamic Scheduling II so far: dynamic scheduling (out-of-order execution) Scoreboard omasulo s algorithm register renaming: removing artificial dependences (WAR/WAW) now: out-of-order execution + precise state advanced topic:

More information

Instruction Level Parallelism Part II - Scoreboard

Instruction Level Parallelism Part II - Scoreboard Course on: Advanced Computer Architectures Instruction Level Parallelism Part II - Scoreboard Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Basic Assumptions We consider

More information

A Survey of the Low Power Design Techniques at the Circuit Level

A Survey of the Low Power Design Techniques at the Circuit Level A Survey of the Low Power Design Techniques at the Circuit Level Hari Krishna B Assistant Professor, Department of Electronics and Communication Engineering, Vagdevi Engineering College, Warangal, India

More information

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps CSE 30321 Computer Architecture I Fall 2010 Homework 06 Pipelined Processors 85 points Assigned: November 2, 2010 Due: November 9, 2010 PLEASE DO THE ASSIGNMENT ON THIS HANDOUT!!! Problem 1: (25 points)

More information

Instruction Scheduling for Low Power Dissipation in High Performance Microprocessors

Instruction Scheduling for Low Power Dissipation in High Performance Microprocessors Instruction Scheduling for Low Power Dissipation in High Performance Microprocessors Abstract Mark C. Toburen Thomas M. Conte Department of Electrical and Computer Engineering North Carolina State University

More information

TIME- OPTIMAL CONVERGECAST IN SENSOR NETWORKS WITH MULTIPLE CHANNELS

TIME- OPTIMAL CONVERGECAST IN SENSOR NETWORKS WITH MULTIPLE CHANNELS TIME- OPTIMAL CONVERGECAST IN SENSOR NETWORKS WITH MULTIPLE CHANNELS A Thesis by Masaaki Takahashi Bachelor of Science, Wichita State University, 28 Submitted to the Department of Electrical Engineering

More information

Power Management in Multicore Processors through Clustered DVFS

Power Management in Multicore Processors through Clustered DVFS Power Management in Multicore Processors through Clustered DVFS A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Tejaswini Kolpe IN PARTIAL FULFILLMENT OF THE

More information

Mixed Synchronous/Asynchronous State Memory for Low Power FSM Design

Mixed Synchronous/Asynchronous State Memory for Low Power FSM Design Mixed Synchronous/Asynchronous State Memory for Low Power FSM Design Cao Cao and Bengt Oelmann Department of Information Technology and Media, Mid-Sweden University S-851 70 Sundsvall, Sweden {cao.cao@mh.se}

More information

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang Joo Lee Intel Rustam Miftakhutdinov The University

More information

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur

More information

Statistical Simulation of Multithreaded Architectures

Statistical Simulation of Multithreaded Architectures Statistical Simulation of Multithreaded Architectures Joshua L. Kihm and Daniel A. Connors University of Colorado at Boulder Department of Electrical and Computer Engineering UCB 425, Boulder, CO, 80309

More information

Compiler Optimisation

Compiler Optimisation Compiler Optimisation 6 Instruction Scheduling Hugh Leather IF 1.18a hleather@inf.ed.ac.uk Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2018 Introduction This

More information

Understanding Channel and Interface Heterogeneity in Multi-channel Multi-radio Wireless Mesh Networks

Understanding Channel and Interface Heterogeneity in Multi-channel Multi-radio Wireless Mesh Networks Understanding Channel and Interface Heterogeneity in Multi-channel Multi-radio Wireless Mesh Networks Anand Prabhu Subramanian, Jing Cao 2, Chul Sung, Samir R. Das Stony Brook University, NY, U.S.A. 2

More information

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont   Core 2 Microarchitecture P6 µarchitecture Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Core 2 Microarchitecture Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. Portions

More information

System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators

System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching s Wonyoung Kim, Meeta S. Gupta, Gu-Yeon Wei and David Brooks School of Engineering and Applied Sciences, Harvard University, 33 Oxford

More information

Out-of-Order Execution. Register Renaming. Nima Honarmand

Out-of-Order Execution. Register Renaming. Nima Honarmand Out-of-Order Execution & Register Renaming Nima Honarmand Out-of-Order (OOO) Execution (1) Essence of OOO execution is Dynamic Scheduling Dynamic scheduling: processor hardware determines instruction execution

More information

ABSTRACT 1. INTRODUCTION

ABSTRACT 1. INTRODUCTION THE APPLICATION OF SOFTWARE DEFINED RADIO IN A COOPERATIVE WIRELESS NETWORK Jesper M. Kristensen (Aalborg University, Center for Teleinfrastructure, Aalborg, Denmark; jmk@kom.aau.dk); Frank H.P. Fitzek

More information

Advances in Antenna Measurement Instrumentation and Systems

Advances in Antenna Measurement Instrumentation and Systems Advances in Antenna Measurement Instrumentation and Systems Steven R. Nichols, Roger Dygert, David Wayne MI Technologies Suwanee, Georgia, USA Abstract Since the early days of antenna pattern recorders,

More information

Application Note (A13)

Application Note (A13) Application Note (A13) Fast NVIS Measurements Revision: A February 1997 Gooch & Housego 4632 36 th Street, Orlando, FL 32811 Tel: 1 407 422 3171 Fax: 1 407 648 5412 Email: sales@goochandhousego.com In

More information

Ruixing Yang

Ruixing Yang Design of the Power Switching Network Ruixing Yang 15.01.2009 Outline Power Gating implementation styles Sleep transistor power network synthesis Wakeup in-rush current control Wakeup and sleep latency

More information

MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng.

MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng. MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng., UCLA - http://nanocad.ee.ucla.edu/ 1 Outline Introduction

More information

Domino Static Gates Final Design Report

Domino Static Gates Final Design Report Domino Static Gates Final Design Report Krishna Santhanam bstract Static circuit gates are the standard circuit devices used to build the major parts of digital circuits. Dynamic gates, such as domino

More information

A Low-Power SRAM Design Using Quiet-Bitline Architecture

A Low-Power SRAM Design Using Quiet-Bitline Architecture A Low-Power SRAM Design Using uiet-bitline Architecture Shin-Pao Cheng Shi-Yu Huang Electrical Engineering Department National Tsing-Hua University, Taiwan Abstract This paper presents a low-power SRAM

More information

Nonuniform multi level crossing for signal reconstruction

Nonuniform multi level crossing for signal reconstruction 6 Nonuniform multi level crossing for signal reconstruction 6.1 Introduction In recent years, there has been considerable interest in level crossing algorithms for sampling continuous time signals. Driven

More information

FV-MSB: A Scheme for Reducing Transition Activity on Data Buses

FV-MSB: A Scheme for Reducing Transition Activity on Data Buses FV-MSB: A Scheme for Reducing Transition Activity on Data Buses Dinesh C Suresh 1, Jun Yang 1, Chuanjun Zhang 2, Banit Agrawal 1, Walid Najjar 1 1 Computer Science and Engineering Department University

More information

Artificial Neural Network Engine: Parallel and Parameterized Architecture Implemented in FPGA

Artificial Neural Network Engine: Parallel and Parameterized Architecture Implemented in FPGA Artificial Neural Network Engine: Parallel and Parameterized Architecture Implemented in FPGA Milene Barbosa Carvalho 1, Alexandre Marques Amaral 1, Luiz Eduardo da Silva Ramos 1,2, Carlos Augusto Paiva

More information

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides.

More information

Leveraging Simultaneous Multithreading for Adaptive Thermal Control

Leveraging Simultaneous Multithreading for Adaptive Thermal Control Leveraging Simultaneous Multithreading for Adaptive Thermal Control James Donald and Margaret Martonosi Department of Electrical Engineering Princeton University {jdonald, mrm}@princeton.edu Abstract The

More information

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Ramon Canal NCD Master MIRI. NCD Master MIRI 1 Wattch, Hotspot, Hotleakage, McPAT http://www.eecs.harvard.edu/~dbrooks/wattch-form.html http://lava.cs.virginia.edu/hotspot http://lava.cs.virginia.edu/hotleakage http://www.hpl.hp.com/research/mcpat/

More information

Efficient UMTS. 1 Introduction. Lodewijk T. Smit and Gerard J.M. Smit CADTES, May 9, 2003

Efficient UMTS. 1 Introduction. Lodewijk T. Smit and Gerard J.M. Smit CADTES, May 9, 2003 Efficient UMTS Lodewijk T. Smit and Gerard J.M. Smit CADTES, email:smitl@cs.utwente.nl May 9, 2003 This article gives a helicopter view of some of the techniques used in UMTS on the physical and link layer.

More information

Inherent Time Redundancy (ITR): Using Program Repetition for Low-Overhead Fault Tolerance

Inherent Time Redundancy (ITR): Using Program Repetition for Low-Overhead Fault Tolerance Inherent Time Redundancy (ITR): Using Program Repetition for Low-Overhead Fault Tolerance Vimal Reddy, Eric Rotenberg Center for Efficient, Secure and Reliable Computing, ECE, North Carolina State University

More information

Proactive Thermal Management Using Memory Based Computing

Proactive Thermal Management Using Memory Based Computing Proactive Thermal Management Using Memory Based Computing Hadi Hajimiri, Mimonah Al Qathrady, Prabhat Mishra CISE, University of Florida, Gainesville, USA {hadi, qathrady, prabhat}@cise.ufl.edu Abstract

More information

A Bypass First Policy for Energy-Efficient Last Level Caches

A Bypass First Policy for Energy-Efficient Last Level Caches A Bypass First Policy for Energy-Efficient Last Level Caches Jason Jong Kyu Park University of Michigan Ann Arbor, MI, USA Email: jasonjk@umich.edu Yongjun Park Hongik University Seoul, Korea Email: yongjun.park@hongik.ac.kr

More information

Microarchitectural Attacks and Defenses in JavaScript

Microarchitectural Attacks and Defenses in JavaScript Microarchitectural Attacks and Defenses in JavaScript Michael Schwarz, Daniel Gruss, Moritz Lipp 25.01.2018 www.iaik.tugraz.at 1 Michael Schwarz, Daniel Gruss, Moritz Lipp www.iaik.tugraz.at Microarchitecture

More information

Combating NBTI-induced Aging in Data Caches

Combating NBTI-induced Aging in Data Caches Combating NBTI-induced Aging in Data Caches Shuai Wang, Guangshan Duan, Chuanlei Zheng, and Tao Jin State Key Laboratory of Novel Software Technology Department of Computer Science and Technology Nanjing

More information

CHAPTER 8: EXTENDED TETRACHORD CLASSIFICATION

CHAPTER 8: EXTENDED TETRACHORD CLASSIFICATION CHAPTER 8: EXTENDED TETRACHORD CLASSIFICATION Chapter 7 introduced the notion of strange circles: using various circles of musical intervals as equivalence classes to which input pitch-classes are assigned.

More information

TUNABLE MISMATCH SHAPING FOR QUADRATURE BANDPASS DELTA-SIGMA DATA CONVERTERS. Waqas Akram and Earl E. Swartzlander, Jr.

TUNABLE MISMATCH SHAPING FOR QUADRATURE BANDPASS DELTA-SIGMA DATA CONVERTERS. Waqas Akram and Earl E. Swartzlander, Jr. TUNABLE MISMATCH SHAPING FOR QUADRATURE BANDPASS DELTA-SIGMA DATA CONVERTERS Waqas Akram and Earl E. Swartzlander, Jr. Department of Electrical and Computer Engineering University of Texas at Austin Austin,

More information

Architecture Performance Prediction Using Evolutionary Artificial Neural Networks

Architecture Performance Prediction Using Evolutionary Artificial Neural Networks Architecture Performance Prediction Using Evolutionary Artificial Neural Networks P.A. Castillo 1,A.M.Mora 1, J.J. Merelo 1, J.L.J. Laredo 1,M.Moreto 2, F.J. Cazorla 3,M.Valero 2,3, and S.A. McKee 4 1

More information

COTSon: Infrastructure for system-level simulation

COTSon: Infrastructure for system-level simulation COTSon: Infrastructure for system-level simulation Ayose Falcón, Paolo Faraboschi, Daniel Ortega HP Labs Exascale Computing Lab http://sites.google.com/site/hplabscotson MICRO-41 tutorial November 9, 28

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have

More information

Control Techniques to Eliminate Voltage Emergencies in High Performance Processors

Control Techniques to Eliminate Voltage Emergencies in High Performance Processors Control Techniques to Eliminate Voltage Emergencies in High Performance Processors Russ Joseph David Brooks Margaret Martonosi Department of Electrical Engineering Princeton University rjoseph,mrm @ee.princeton.edu

More information

Predictive Assessment for Phased Array Antenna Scheduling

Predictive Assessment for Phased Array Antenna Scheduling Predictive Assessment for Phased Array Antenna Scheduling Randy Jensen 1, Richard Stottler 2, David Breeden 3, Bart Presnell 4, Kyle Mahan 5 Stottler Henke Associates, Inc., San Mateo, CA 94404 and Gary

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have

More information

CS 110 Computer Architecture Lecture 11: Pipelining

CS 110 Computer Architecture Lecture 11: Pipelining CS 110 Computer Architecture Lecture 11: Pipelining Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on

More information

CMP 301B Computer Architecture. Appendix C

CMP 301B Computer Architecture. Appendix C CMP 301B Computer Architecture Appendix C Dealing with Exceptions What should be done when an exception arises and many instructions are in the pipeline??!! Force a trap instruction in the next IF stage

More information

Analysis of Dynamic Power Management on Multi-Core Processors

Analysis of Dynamic Power Management on Multi-Core Processors Analysis of Dynamic Power Management on Multi-Core Processors W. Lloyd Bircher and Lizy K. John Laboratory for Computer Architecture Department of Electrical and Computer Engineering The University of

More information

SCALCORE: DESIGNING A CORE

SCALCORE: DESIGNING A CORE SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia,

More information

EE 382C EMBEDDED SOFTWARE SYSTEMS. Literature Survey Report. Characterization of Embedded Workloads. Ajay Joshi. March 30, 2004

EE 382C EMBEDDED SOFTWARE SYSTEMS. Literature Survey Report. Characterization of Embedded Workloads. Ajay Joshi. March 30, 2004 EE 382C EMBEDDED SOFTWARE SYSTEMS Literature Survey Report Characterization of Embedded Workloads Ajay Joshi March 30, 2004 ABSTRACT Security applications are a class of emerging workloads that will play

More information

UNIT-II LOW POWER VLSI DESIGN APPROACHES

UNIT-II LOW POWER VLSI DESIGN APPROACHES UNIT-II LOW POWER VLSI DESIGN APPROACHES Low power Design through Voltage Scaling: The switching power dissipation in CMOS digital integrated circuits is a strong function of the power supply voltage.

More information

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11) Lecture Topics Today: Memory Management (Stallings, chapter 7.1-7.4) Next: continued 1 Announcements Self-Study Exercise #6 Project #4 (due 10/11) Project #5 (due 10/18) 2 Memory Hierarchy 3 Memory Hierarchy

More information

Lecture 13 Register Allocation: Coalescing

Lecture 13 Register Allocation: Coalescing Lecture 13 Register llocation: Coalescing I. Motivation II. Coalescing Overview III. lgorithms: Simple & Safe lgorithm riggs lgorithm George s lgorithm Phillip. Gibbons 15-745: Register Coalescing 1 Review:

More information

Utilization Based Duty Cycle Tuning MAC Protocol for Wireless Sensor Networks

Utilization Based Duty Cycle Tuning MAC Protocol for Wireless Sensor Networks Utilization Based Duty Cycle Tuning MAC Protocol for Wireless Sensor Networks Shih-Hsien Yang, Hung-Wei Tseng, Eric Hsiao-Kuang Wu, and Gen-Huey Chen Dept. of Computer Science and Information Engineering,

More information

Advanced Engineering Statistics. Jay Liu Dept. Chemical Engineering PKNU

Advanced Engineering Statistics. Jay Liu Dept. Chemical Engineering PKNU Advanced Engineering Statistics Jay Liu Dept. Chemical Engineering PKNU Statistical Process Control (A.K.A Process Monitoring) What we will cover Reading: Textbook Ch.? ~? 2012-06-27 Adv. Eng. Stat., Jay

More information

A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols

A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols Josh Broch, David Maltz, David Johnson, Yih-Chun Hu and Jorjeta Jetcheva Computer Science Department Carnegie Mellon University

More information

LWA Beamforming Design Concept

LWA Beamforming Design Concept LWA Beamforming Design Concept Steve Ellingson October 3, 27 Contents Introduction 2 2 Integer Sample Period Delay 2 3 Fractional Sample Period Delay 3 4 Summary 9 Bradley Dept. of Electrical & Computer

More information

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review) CSE 2021: Computer Organization IF for Load (Review) Lecture-11 CPU Design : Pipelining-2 Review, Hazards Shakil M. Khan CSE-2021 July-19-2012 2 ID for Load (Review) EX for Load (Review) CSE-2021 July-19-2012

More information

Leveraging the Core-Level Complementary Effects of PVT Variations to Reduce Timing Emergencies in Multi-Core Processors

Leveraging the Core-Level Complementary Effects of PVT Variations to Reduce Timing Emergencies in Multi-Core Processors Leveraging the Core-Level Complementary Effects of PVT Variations to Reduce Timing Emergencies in Multi-Core Processors Guihai Yan a) Key Laboratory of Computer System and Architecture, Institute of Computing

More information

Handling Search Inconsistencies in MTD(f)

Handling Search Inconsistencies in MTD(f) Handling Search Inconsistencies in MTD(f) Jan-Jaap van Horssen 1 February 2018 Abstract Search inconsistencies (or search instability) caused by the use of a transposition table (TT) constitute a well-known

More information

Engineering the Power Delivery Network

Engineering the Power Delivery Network C HAPTER 1 Engineering the Power Delivery Network 1.1 What Is the Power Delivery Network (PDN) and Why Should I Care? The power delivery network consists of all the interconnects in the power supply path

More information

UNIT-III LIFE-CYCLE PHASES

UNIT-III LIFE-CYCLE PHASES INTRODUCTION: UNIT-III LIFE-CYCLE PHASES - If there is a well defined separation between research and development activities and production activities then the software is said to be in successful development

More information