SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Similar documents
Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Final Report: DBmbench

Project 5: Optimizer Jason Ansel

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Out-of-Order Execution. Register Renaming. Nima Honarmand

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Performance Evaluation of Recently Proposed Cache Replacement Policies

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Instruction Level Parallelism Part II - Scoreboard

Outline Simulators and such. What defines a simulator? What about emulation?

CSE502: Computer Architecture CSE 502: Computer Architecture

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

CMP 301B Computer Architecture. Appendix C

Precise State Recovery. Out-of-Order Pipelines

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CSE502: Computer Architecture CSE 502: Computer Architecture

Quantifying the Complexity of Superscalar Processors

Dynamic Scheduling I

CSE502: Computer Architecture CSE 502: Computer Architecture

On the Rules of Low-Power Design

Processors Processing Processors. The meta-lecture

Pipelined Processor Design

CS61c: Introduction to Synchronous Digital Systems

EE 382C EMBEDDED SOFTWARE SYSTEMS. Literature Survey Report. Characterization of Embedded Workloads. Ajay Joshi. March 30, 2004

Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Design Challenges in Multi-GHz Microprocessors

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

COTSon: Infrastructure for system-level simulation

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems

An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors

Low-Power Design for Embedded Processors

Computer Architecture A Quantitative Approach

COSC4201. Scoreboard

CS429: Computer Organization and Architecture

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

CLIPPER: Counter-based Low Impact Processor Power Estimation at Run-time

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Computing Layers

Dynamic Scheduling II

Tomasolu s s Algorithm

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

Dynamic MIPS Rate Stabilization in Out-of-Order Processors

CSE502: Computer Architecture CSE 502: Computer Architecture

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Instruction Level Parallelism III: Dynamic Scheduling

Inherent Time Redundancy (ITR): Using Program Repetition for Low-Overhead Fault Tolerance

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Mitigating Inductive Noise in SMT Processors

CS Computer Architecture Spring Lecture 04: Understanding Performance

Pre-Silicon Validation of Hyper-Threading Technology

FPGA Based 70MHz Digital Receiver for RADAR Applications

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

DAT105: Computer Architecture

DSP VLSI Design. DSP Systems. Byungin Moon. Yonsei University

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Parallel architectures Electronic Computers LM

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

CS521 CSE IITG 11/23/2012

CSE502: Computer Architecture Welcome to CSE 502

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

An ahead pipelined alloyed perceptron with single cycle access time

Efficiently Exploiting Memory Level Parallelism on Asymmetric Coupled Cores in the Dark Silicon Era

An Overview of Computer Architecture and System Simulation

CS 110 Computer Architecture Lecture 11: Pipelining

Exploring Heterogeneity within a Core for Improved Power Efficiency

CS4617 Computer Architecture

Power Issues with Embedded Systems. Rabi Mahapatra Computer Science

DeCoR: A Delayed Commit and Rollback Mechanism for Handling Inductive Noise in Processors

A High Definition Motion JPEG Encoder Based on Epuma Platform

MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor

Meltdown & Spectre. Side-channels considered harmful. Qualcomm Mobile Security Summit May, San Diego, CA. Moritz Lipp

Compiler Optimisation

Department Computer Science and Engineering IIT Kanpur

OOO Execution & Precise State MIPS R10000 (R10K)

Instruction Level Parallelism. Data Dependence Static Scheduling

LV-Link 3.0 Software Interface for LabVIEW

Trace Based Switching For A Tightly Coupled Heterogeneous Core

FUNCTIONAL VERIFICATION: APPROACHES AND CHALLENGES

CHALLENGES IN PROCESSOR MODELING AND VALIDATION

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Computer Architecture and Organization:

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

A Static Power Model for Architects

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

How a processor can permute n bits in O(1) cycles

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System

Transcription:

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation Mark Wolff Linda Wills School of Electrical and Computer Engineering Georgia Institute of Technology {wolff,linda.wills}@ece.gatech.edu Abstract This paper describes an interactive animation tool, SATSim, which conveys superscalar architecture concepts. It has been used in an advanced undergraduate computer architecture course to visualize the complicated behavioral patterns of superscalar architectures, such as out-oforder execution, in-order commitment, and the impact of branch mispredictions and cache misses. SATSim allows students to interactively change hardware configuration parameters and to observe their effects visually in a more accessible manner than is currently possible with existing simulators or with traditional static media. Introduction As the complexity of superscalar architectures increases, the underlying concepts of dynamic scheduling and speculative execution become increasingly difficult to explain without some form of visualization. These architectures typically exhibit complex behavioral patterns involving concurrent, out-of-order execution of instructions on multiple functional units and the resolution of data and control dependencies on the fly. Static visual aids, such as diagrams on whiteboards, slides, or in textbooks, are limited in their ability to simultaneously convey both the structural relationships between components of a superscalar architecture and the temporal relationships between instructions as they progress through the pipeline and utilize architectural resources. An alternative is to use a superscalar architectural simulator. However, existing simulators [6,7,8,9] have been designed primarily for performing research, in which the focus is on accurately modeling the effects of architectural mechanisms so that their performance characteristics can be studied. Most of these simulators do not attempt to visually convey the behavior of the architectural mechanisms themselves and how they interact. They are often designed to accurately model a particular architecture and are often too complex to be suitable for students who are new to superscalar architecture concepts. More importantly, these simulators are typically not interactive: a machine configuration is specified and the simulation summarizes performance and resource utilization results as output. It is not easy to observe what is going on during the simulation (e.g., what the components are doing and how they interact), and it is not possible to interactively experiment with the simulation (e.g., to force a branch misprediction or a cache miss at any point during simulation in order to observe its effect). This paper describes SATSim: Superscalar Architecture Trace Simulator. SATSim is being used at Georgia Tech in an undergraduate seniorlevel course on advanced computer architecture. SATSim provides an animated and interactive visualization aid for teaching some of the important concepts associated with superscalar architectures. These concepts include out-of-order execution, inorder commitment, dynamic resolution of data dependencies using register renaming and reservation stations, and the performance effects of branch prediction accuracy and cache hit rates. Simulator Details SATSim reads instructions from a trace file and displays them as they progress through a simulated microarchitecture. The user can configure the microarchitecture by selecting the superscalar factor,

the number of reservation stations per execution unit, the number of reorder and rename buffer entries, and the number of each type of execution unit. In addition, the user can specify branch prediction accuracy, cache miss rates, and cache miss penalties, using the dialog box shown in Figure 1. SATSim models the behavior of the specified microarchitecture, avoiding processor-specific idiosyncrasies that tend to confuse the concepts. Branch prediction and cache functionality are generated statistically with manual override to illustrate the effect of a misprediction. The trace file input is in MIPs format. The simulator decodes all instructions into one of only five types, integer, floating point, branch, load, and store. Source and destination registers are decoded so that data dependencies and renaming resource utilization can be accurately portrayed. Figure 2 shows a screen shot of the animation in progress. Figure 3 shows the animation window as it progresses through six simulated clock cycles. In particular, instruction 5 is shown going from issue to commitment. Each instruction is given a three-digit name so that it can be displayed and tracked by the user as it progresses through the simulated processor. The simulator displays the status of all in-flight instructions. Instructions are shown as they occupy locations in the pipeline, rename buffer, reorder buffer, reservation stations, and execution units. Instructions acquire color as their destination register is renamed, and lose color once their result has been broadcast. The color, or lack of color, for the instruction providing data to each source register, is displayed for instructions that occupy a reservation station. The current color assignment, if any, for each of the registers is also displayed. The status of each instruction is updated on each simulated clock cycle. Running performance metrics are also updated and displayed on each clock cycle. The animation has three interactive features that are useful for explaining and understanding superscalar architecture concepts. The user can force the next fetched branch instruction to be mispredicted, and then observe the resulting performance impact. The user can force the next fetched line of instructions to miss in the instruction cache. And, the user can force the next load instruction that executes to miss in the data cache. During the simulation, the user has control of the animation speed. The animation can be set to update the display every 1, 10, 100, or 1000 clock cycles. The user can single step through the animation, set the animation to 1, 2, or 10 screen updates per second, or let it run as fast as possible. The user can also allow the simulation to run to the end of the trace file without updating the screen. When the simulation ends, the program writes pertinent performance and utilization data to a tabdelimited text file. Table 1 lists the data that is written to disk after each simulation. Date and Time All Simulation Parameters Trace File Name Total Cycles Number Instructions Committed Number of Each Instruction Type Fetched Total Icache Misses Total Miss Penalty Total Stall Cycles Total Dcache Misses Reorder Rename Integer Execution Floating Point Execution Branch Execution Memory Execution Integer Reservation Floating Point Reservation Branch Reservation Memory Reservation Table 1. Data Collection SATSim runs on Windows 9x or NT. The program was developed using Microsoft Visual C++ Version 6. Simulator Use The simulator is currently used in an advanced undergraduate computer architecture course to assist students with understanding superscalar architecture concepts. One associated assignment asks the students to discuss the effects of branch prediction accuracy and cache hit rates on the performance of superscalar architectures. Another assignment asks the student to explore the design space by varying the architectural parameters, and then discuss the tradeoffs between chip-area cost (based on a given cost model), cycle time, and architectural resources.

The simulator allows students to explore the design space. Then, the archived data allows for more thorough analysis of design tradeoffs. Through these assignments, the students were able to explore the difficult concepts of dynamic scheduling and speculative execution. Conclusions and Future Work SATSim and the associated assignments assist students with understanding superscalar architecture concepts. Significant improvement was observed, through quizzes and class participation, in the students comprehension, as compared to previous courses. SATSim conveys important fundamental concepts in superscalar architecture design and their associated nomenclature. Once this conceptual foundation is in place, more advanced concepts can be discussed, such as trace caches, branch predication, data prediction, vector processors, SIMD ISA extensions, and multiprocessor systems. Enhancements to SATSim include: an online help system will be incorporated, a browse function will be added for selecting input and output files, and a utility for running simulations in batch mode will be included. Real-time visualization of resource utilization information will also be added. [5] Yeager, K.C., The Mips R10000 Superscalar Microprocessor, IEEE Micro, April 1996, p.28-40. [6] Douglas C. Burger and Todd M. Austin, The SimpleScalar Tool Set, Version 2.0, Computer Architecture News, 25 (3), pp. 13-25, June, 1997[7] Moura, C., SuperDLX A Generic Superscalar Simulator, ACAPS Technical Memo 64, McGill University School of Computer Science. [8]DLXview, http://yara.ecn.purdue.edu/~teamaaa/dl xview/ [9] Larry B. Hostetler & Brian Mirtich. DLXsim --- A Simulator for DLX The simulator can be downloaded from http://ece.gatech.edu/research/pica/satsim/satsim.h tml References [1] Hennesey, J.L., Patterson, D.A., Computer Architecture A Quantitative Approach, 2 nd. Ed., Morgan Kaufmann Publishers, 1996. [2] Flynn, M.J., Computer Architecture Pipelined and Parallel Processor design, Jones and Bartlett Publishers, 1995. [3] Song, S.P., Denman, M., Chang, J., The PowerPC 604 RISC Microprocessor, IEEE Micro, October 1994, p.8-17. [4] Kessler, R.E., The Alpha 21264 Microprocessor, IEEE Micro, March 1999, p.24-36.

Figure 1. Screenshot of the Trace Options dialog Figure 2. Screenshot of animation

Figure 3. Sequence of Animation Steps