Processors Processing Processors. The meta-lecture

Similar documents
COTSon: Infrastructure for system-level simulation

Outline Simulators and such. What defines a simulator? What about emulation?

SW simulation and Performance Analysis

Project 5: Optimizer Jason Ansel

CS4617 Computer Architecture

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

What is a Simulation? Simulation & Modeling. Why Do Simulations? Emulators versus Simulators. Why Do Simulations? Why Do Simulations?

CSE502: Computer Architecture Welcome to CSE 502

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Introduction to co-simulation. What is HW-SW co-simulation?

Recent Advances in Simulation Techniques and Tools

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Parallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Statistical Simulation of Multithreaded Architectures

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

Performance Evaluation of Recently Proposed Cache Replacement Policies

Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m )

Console Games Are Just Like Mobile Games* (* well, not really. But they are more alike than you

ECE 124 Digital Circuits and Systems Winter 2011 Introduction Calendar Description:

Challenges in Transition

EE 280 Introduction to Digital Logic Design

EECS150 - Digital Design Lecture 28 Course Wrap Up. Recap 1

CS61c: Introduction to Synchronous Digital Systems

Final Report: DBmbench

Architecture ISCA 16 Luis Ceze, Tom Wenisch

Like Mobile Games* Currently a Distinguished i Engineer at Zynga, and CTO of FarmVille 2: Country Escape (for ios/android/kindle)

Performance Metrics, Amdahl s Law

Digital Systems Design

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Overview of Design Methodology. A Few Points Before We Start 11/4/2012. All About Handling The Complexity. Lecture 1. Put things into perspective

Matthew Grossman Mentor: Rick Brownrigg

Trace Based Switching For A Tightly Coupled Heterogeneous Core

Lecture 1. Tinoosh Mohsenin

Challenges of in-circuit functional timing testing of System-on-a-Chip

AN EFFICIENT APPROACH TO MINIMIZE POWER AND AREA IN CARRY SELECT ADDER USING BINARY TO EXCESS ONE CONVERTER

CS429: Computer Organization and Architecture

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

Welcome to 6.S084! Computation Structures (special)

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Power of Realtime 3D-Rendering. Raja Koduri

Introduction (concepts and definitions)

Interconnect-Power Dissipation in a Microprocessor

1) Fixed point [15 points] a) What are the primary reasons we might use fixed point rather than floating point? [2]

Assessing and. Rui Wang, Assistant professor Dept. of Information and Communication Tongji University.

A Static Power Model for Architects

Welcome to 6.111! Introductory Digital Systems Laboratory

Parallel Multi-core Verilog HDL Simulation

Department Computer Science and Engineering IIT Kanpur

Welcome to 6.111! Introductory Digital Systems Laboratory

Copyright 2003 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Slides prepared by Walid A. Najjar & Brian J.

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

High Speed ECC Implementation on FPGA over GF(2 m )

CSE502: Computer Architecture CSE 502: Computer Architecture

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Architecting Systems of the Future, page 1

Precise State Recovery. Out-of-Order Pipelines

History and Perspective of Simulation in Manufacturing.

CS Computer Architecture Spring Lecture 04: Understanding Performance

Mohit Arora. The Art of Hardware Architecture. Design Methods and Techniques. for Digital Circuits. Springer

Revisiting Dynamic Thermal Management Exploiting Inverse Thermal Dependence

Blackfin Online Learning & Development

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

Model checking in the cloud VIGYAN SINGHAL OSKI TECHNOLOGY

An Overview of Computer Architecture and System Simulation

Static Power and the Importance of Realistic Junction Temperature Analysis

TABLE OF CONTENTS CHAPTER TITLE PAGE

CS/EE 181a 2010/11 Lecture 1

NVIDIA SLI AND STUTTER AVOIDANCE:

CS4961 Parallel Programming. Lecture 1: Introduction 08/24/2010. Course Details Time and Location: TuTh, 9:10-10:30 AM, WEB L112 Course Website

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

An Efficent Real Time Analysis of Carry Select Adder

Pipelined Processor Design

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

A Novel Design of High-Speed Carry Skip Adder Operating Under a Wide Range of Supply Voltages

Course Outcome of M.Tech (VLSI Design)

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

EE241 - Spring 2013 Advanced Digital Integrated Circuits. Projects. Groups of 3 Proposals in two weeks (2/20) Topics: Lecture 5: Transistor Models

EE382V: Embedded System Design and Modeling

High Performance Tor Experimentation from the Magic of Dynamic ELFs

FIFO WITH OFFSETS HIGH SCHEDULABILITY WITH LOW OVERHEADS. RTAS 18 April 13, Björn Brandenburg

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Design and Implementation of Complex Multiplier Using Compressors

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

A New network multiplier using modified high order encoder and optimized hybrid adder in CMOS technology

Exercise 3: Sound volume robot

Table of Contents HOL ADV

Game Architecture. 4/8/16: Multiprocessor Game Loops

Introduction. Reading: Chapter 1. Courtesy of Dr. Dansereau, Dr. Brown, Dr. Vranesic, Dr. Harris, and Dr. Choi.

EE19D Digital Electronics. Lecture 1: General Introduction

CUDA-Accelerated Satellite Communication Demodulation

Creating the Right Environment for Machine Learning Codesign. Cliff Young, Google AI

Ps3 Computing Instruction Set Definition Reduced

Multi-core Platforms for

MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng.

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Transcription:

Simulators 5SIA0

Processors Processing Processors The meta-lecture

Why Simulators? Your Friend Harm

Why Simulators? Harm Loves Tractors Harm

Why Simulators? The outside world Unfortunately for Harm you need to go outside to drive tractors Harm

Why Simulators? The outside world And the outside world is filled with dangers Harm

Why Simulators? The outside world And the outside world is filled with dangers Harm Rain!

Why Simulators? The outside world And the outside world is filled with dangers Rain! Scary Animals! Harm

Why Simulators? Harm

Why Simulators? So Harm uses his PC Harm

Why Simulators? Harm

Why Simulators? Oh No! My PC is too slow to run Farming Simulator Harm

Why Simulators? Oh No! My PC is too slow to run Farming Simulator Harm You

Why Simulators? Stand back! I m a computer architect! Oh No! My PC is too slow to run Farming Simulator Obligatory cape Harm You

How to help Harm? Of course you have many ideas on how to speed-up Harms computer. But which ones should you apply? You

Design Space Exploration Options

Design Space Exploration Options Buy (or build) all hardware options

Design Space Exploration Options Buy (or build) all hardware options Gee that sounds expensive...

Design Space Exploration Options Buy (or build) all hardware options Use analytical models

Design Space Exploration Options Buy (or build) all hardware options Use analytical models How reliable is that?

Design Space Exploration Options Buy (or build) all hardware options Use analytical models Simulate the design points!

Design Space Exploration Options Buy (or build) all hardware options Use analytical models Simulate the design points! Hey, I like simulators, That sounds promising :)

What to simulate for?

What to simulate for? Performance Energy Power (!=Energy) Thermal

What to simulate for? Performance Energy Power (!=Energy) Thermal What details to simulate?

What to simulate for? Performance Energy Power (!=Energy) Thermal What details to simulate? Cycle accurate vs Functionality Caches Full operating system Disk accesses Background tasks...

What to simulate for? Performance Energy Power (!=Energy) Thermal What details to simulate? Cycle accurate vs Functionality Caches Full operating system Disk accesses Background tasks...

All the details: RTL Simulation

All the details: RTL Simulation Simulate at gate level:

All the details: RTL Simulation Simulate at gate level: - modelsim/questasim (Mentor) ncsim (Cadence) VCS (Synopsys) Icarus Verilog (Open Source!)...

All the details: RTL Simulation Simulate at gate level: - modelsim/questasim (Mentor) ncsim (Cadence) VCS (Synopsys) Icarus Verilog (Open Source!)... Advantages: - - No need to build a custom simulator if you need RTL to build hardware anyway Highest level of precision and detail

All the details: RTL Simulation Simulate at gate level: - modelsim/questasim (Mentor) ncsim (Cadence) VCS (Synopsys) Icarus Verilog (Open Source!)... Advantages: - - No need to build a custom simulator if you need RTL to build hardware anyway Highest level of precision and detail Disadvantage: - Horribly slow for realistic designs

All the details: RTL Simulation Simulate at gate level: - modelsim/questasim (Mentor) ncsim (Cadence) VCS (Synopsys) Icarus Verilog (Open Source!)... Advantages: - - No need to build a custom simulator if you need RTL to build hardware anyway Highest level of precision and detail Disadvantage: Nvidia GPU with > 1 Billion transistors Small tests take over 8 hours! [1] [1] http://www.deepchip.com/items/0523-04.html - Horribly slow for realistic designs

Computer Architect Simulating Simulating! modified from http://xkcd.com/303/

Slightly less horribly slow: Hardware Emulation RTL description of Target Architecture

Slightly less horribly slow: Hardware Emulation RTL description of Target Architecture Synthesize for FPGA (slow)

Slightly less horribly slow: Hardware Emulation RTL description of Target Architecture Synthesize for FPGA (slow) Emulate on FPGA (fast!) Note: instrumentation required to get detailed information out!

Levels of detail in Simulation Full-System versus User-level Cycle Accurate versus Functional Execution- versus Trace-driven

Full-system versus User-Level To OS or not to OS?

Full-system versus User-Level To OS or not to OS? Full-System

Full-system versus User-Level To OS or not to OS? Full-System

Full-system versus User-Level To OS or not to OS? Full-System User-Level

User-Level Famous example: Simple Scalar [1] Advantages Fast to develop and update to new architectures Usually accurate enough Disadvantages Any time spent in the OS is not modelled accurately. Can have severe impact, database applications spent 20-30% of their time in OS mode. [1] http://www.simplescalar.com/

Cycle Accurate versus Functional

Cycle Accurate versus Functional Cycle Accurate

Cycle Accurate versus Functional Cycle Accurate

Cycle Accurate versus Functional Cycle Accurate Functional

Cycle Accurate versus Functional Functional - no/limited model of the micro architecture An (add) instruction of the target can be translated to an (add) instruction on the host, and be simulated that way. Example 1: Simple Scalar sim-fast Example 2: QEMU, Full-system emulator using dynamic translation Cycle Accurate - includes model of the micro architecture Block resources in the pipeline when instruction executes Use target branch predictor scheme Out-of-order execution Example: Simple Scalar sim-outorder

Intermezzo - Internals of dynamic translation Target Binary Magic Translate Native Instructions

Intermezzo - Internals of dynamic translation Target Binary Magic Translate Native Instructions int32_t instructions[]={ 0x3FE9, 0xA701, 0xEF02, 0x8FF0 }; execute(instructions);

Intermezzo - Internals of dynamic translation Target Binary Magic Translate Native Instructions int32_t instructions[]={ 0x3FE9, 0xA701, 0xEF02, 0x8FF0 }; execute(instructions); Question Implement the execute function in regular C

Intermezzo - Internals of dynamic translation void execute(int32_t* instructions){ //declare a pointer to a function that returns void // and has no arguments void (fp*)(void); //set the function pointer to the first instruction fp=instructions; //call the function //Note: make sure the last instruction in the list returns fp(); } int32_t instructions[]={ 0x3FE9, 0xA701, 0xEF02, 0x8FF0 }; execute(instructions);

Execution- versus Trace-driven

Execution- versus Trace-driven Application Binary Execution-Driven Simulator Metrics Execution Driven: Application executes on simulator

Execution- versus Trace-driven Application Binary Execution-Driven Simulator Metrics Execution Driven: Application executes on simulator Application Binary Execution-Driven Simulator Instruction Trace Trace-Driven Simulator Trace Driven: simulator uses trace as input Metrics

Execution- versus Trace-driven Application Binary Execution-Driven Simulator Metrics Execution Driven: Application executes on simulator Application Binary Execution-Driven Simulator Instruction Trace mov mov mov mov int Trace-Driven Simulator Trace Driven: simulator uses trace as input edx,len ecx,msg ebx,1 eax,4 0x80 Metrics

Execution- versus Trace-driven Application Binary Why would a sane person do this? Execution-Driven Simulator Metrics Execution Driven: Application executes on simulator Application Binary Execution-Driven Simulator Instruction Trace mov mov mov mov int Trace-Driven Simulator Trace Driven: simulator uses trace as input edx,len ecx,msg ebx,1 eax,4 0x80 Metrics

Execution- versus Trace-driven Application Binary Execution-Driven Simulator Metrics Execution Driven: Application executes on simulator mov mov mov mov int edx,len ecx,msg ebx,1 eax,4 0x80 Execution-Driven Simulator Application Binary OR ISA compatible Processor Instruction Trace Trace-Driven Simulator Trace Driven: simulator uses trace as input Metrics

Trace-driven Simulation Advantages Trace collection only required once Trace collection can be done with ISA compatible processor Trace simulator does not need to simulate all instructions, can skip ahead in trace if not implemented

Trace-driven Simulation Advantages Trace collection only required once Trace collection can be done with ISA compatible processor Trace simulator does not need to simulate all instructions, can skip ahead in trace if not implemented Disadvantages Cannot speculatively execute code (trace is fixed) Trace file can become huge for large applications (hundreds of GBs)

Mixing Simulation Strategies Direct-execution Parts execute directly on the host (e.g. using dynamic translation such as QEMU) Other parts are executed on cycle accurate simulation Use case: Interested in memory accesses and memory behavior. Execute only loads and stores on the simulator, emulate the rest directly on the host machine

Simulation in the Multiprocessor Era

Simulation in the Multiprocessor Era

Parallelisation in all levels of the simulation stack Benchmark Target Processor Simulator Host Platform

Parallelisation in all levels of the simulation stack Multi-threaded application Benchmark 0 1... N Target Processor Simulator Host Platform

Parallelisation in all levels of the simulation stack Multi-threaded application Benchmark 0 1... N Target Processor Simulator Host Platform A multi-threaded application running on a single core target processor. Question: Does this make sense?

Parallelisation in all levels of the simulation stack Multi-threaded application Benchmark 0 1... N Multi-core target processor Target Processor A B...? Simulator Host Platform

Parallelisation in all levels of the simulation stack Multi-threaded application Benchmark 0 1... N Multi-core target processor Target Processor A B...? Simulator Host Platform A multi-core processor running on a single threaded simulator. Question: Does this make sense?

Parallelisation in all levels of the simulation stack Multi-threaded application Benchmark 0 1... N Multi-core target processor Target Processor A B...? Simulator Multi-threaded simulator 0 1... Host Platform N

Parallelisation in all levels of the simulation stack Multi-threaded application Benchmark 0 1... N Multi-core target processor Target Processor A B...? Simulator Multi-threaded simulator 0 1... Host Platform N A multi-threaded simulator running on a single-core host. Question: Does this make sense?

Parallelisation in all levels of the simulation stack Multi-threaded application Benchmark 0 1... N Multi-core target processor Target Processor A B...? Simulator Multi-threaded simulator 0 1... N Host Platform Multi-core host platform A B...?

Parallelisation in all levels of the simulation stack Multi-threaded application Multi-core target processor Benchmark 0 1... N Target Processor A B...? Simulator Multi-threaded simulator 0 1... N Host Platform Multi-core host platform A B...? But how to build a fast, multi-threaded simulator?

Parallelisation in all levels of the simulation stack But how to build a fast, multi-threaded simulator? Simulator Multi-threaded simulator 0 1... N

Parallel Simulation Techniques Discrete event simulation Quantum simulation Slack simulation

Parallel Simulation Techniques Discrete event simulation Quantum simulation Not schrödinger's cat quantum though Slack simulation

Space Granularity The textbook implicitly assumes the smallest hardware block that can be mapped to a simulator thread is a full target core. Holds for almost all real-world simulators, which severely limits the parallelism

Space Granularity The textbook implicitly assumes the smallest hardware block that can be mapped to a simulator thread is a full target core. Holds for almost all real-world simulators, which severely limits the parallelism Exception is RTL simulation, there the blocks can be smaller. The Rocketick simulator even appears to use GPUs! [1] [1] http://www.deepchip.com/items/0523-04.html

Discrete-Event Simulation A logical choice for a simulator time step is one cycle for the fastest core.

Discrete-Event Simulation Disadvantage

Discrete-Event Simulation Disadvantage Under utilisation of the host platform if threads are idle for synchronisation

Discrete-Event Simulation Is it really this bad? What assumption did the author of the book make here? Disadvantage Under utilisation of the host platform if threads are idle for synchronisation

Discrete-Event Simulation Every target processor Pn is mapped to a separate host core Is it really this bad? What assumption did the author of the book make here? Disadvantage Under utilisation of the host platform if threads are idle for synchronisation

Target vs Host Cores There is no relation between the number of target cores and the number of host cores!!!

Target vs Host Cores There is no relation between the number of target cores and the number of host cores!!!

Multi-threaded application Benchmark 0 1... N Multi-core target processor Target Processor A B...? Simulator Multi-threaded simulator 0 1... N Host Platform Multi-core host platform A B...?

Discrete-Event Simulation

Discrete-Event Simulation Utilisation of host depends on variation in processing time of a cycle, but also on the amount of host cores! 1 Host core 1 P4 P3 P2 P1

Quantum Simulation Synchronize threads at larger time-steps, e.g. 3 cycles

Quantum Simulation Synchronize threads at larger time-steps, e.g. 3 cycles Advantage Utilisation improves, because the variation of processing is amortized over longer sections of simulation Disadvantage No longer cycle accurate

Slack Simulation Start with discrete-event simulation schedule

Slack Simulation Instead of waiting in the red areas, use slack to process ahead

Slack Simulation Instead of waiting in the red areas, use slack to process ahead

Slack Simulation Side-effect: Drift The cores might be simulating different points in time, and could drift apart Mitigation Allow a maximum drift (or slack), and synchronize when this value is exceeded

Slack Simulation Side-effect: Drift The cores might be simulating different points in time, and could drift apart Mitigation Allow a maximum drift (or slack), and synchronize when this value is exceeded Max slack of 2

Slack versus Quantum simulation In quantum simulation, the core simulation times always stay within a cycle window, which is fixed in global time. Also in slack simulation the simulation times stay within a window, but with the key difference that this is a sliding window.

Slack versus Quantum simulation In quantum simulation, the core simulation times always stay within a cycle window, which is fixed in global time. Also in slack simulation the simulation times stay within a window, but with the key difference that this is a sliding window. Typically much less synchronisation!

Still not good enough From the paper Graphite: a Distributed Parallel Simulator for Multicores Simulation slowdown is as low as 41 versus native execution [1] Graphite: A Distributed Parallel Simulator for Multicores - Jason E. Miller et al.

Still not good enough From the paper Graphite: a Distributed Parallel Simulator for Multicores Simulation slowdown is as low as 41 versus native execution That still sounds slow [1] Graphite: A Distributed Parallel Simulator for Multicores - Jason E. Miller et al.

Still not good enough From the paper Graphite: a Distributed Parallel Simulator for Multicores Simulation slowdown is as low as 41 versus native execution Well... That still sounds slow [1] Graphite: A Distributed Parallel Simulator for Multicores - Jason E. Miller et al.

Still not good enough From the paper Graphite: a Distributed Parallel Simulator for Multicores Simulation slowdown is as low as 41 versus native execution Yes :( That still sounds slow [1] Graphite: A Distributed Parallel Simulator for Multicores - Jason E. Miller et al.

Question What can we do if it still takes weeks or months to simulate a full benchmark? 0 cycles 1e16

Workload Sampling Naive Approach Only simulate first X cycles fixed length 0 cycles 1e12

Workload Sampling Often benchmarks start with reading settings and initialisation. Most likely not representative of workload! fixed length 0 cycles 1e12

Workload Sampling Fix Use functional simulation to skip over the initial section skip init with functional sim init fixed length 0 cycles 1e12

Workload Sampling Question Is the window always a good representation of the benchmark? Why/why not? skip init with functional sim init fixed length 0 cycles 1e12

Program Modes Real world programs spend time in different modes, which can have very different characteristics

Workload Sampling Sample uniformly over the program, hopefully capturing the dominant modes uniform sampling skip init with functional sim init fixed length 0 cycles 1e12

Workload Sampling Sample uniformly over the program, hopefully capturing the dominant modes uniform sampling skip init with functional sim init However, if the window size is very small, the micro-architecture is not initialized correctly! E.g.: the branch predictor and caches fixed length 0 cycles 1e12

Workload Sampling Solution: add warm up period before every window uniform sampling skip init with functional sim init fixed length 0 cycles 1e12

Workload Sampling Solution: add warm up period before every window uniform sampling skip init with functional sim init Question How long should we warm-up? fixed length 0 cycles 1e12

Workload Sampling [1] SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling - Roland E. Wunderlich et al. Solution: add warm up period before every window uniform sampling skip init with functional sim init fixed length 0 Some numbers suggested by SMARTS [1] to get a feeling for the scale: - Initializing caches 500.000 cycles - Initializing branch prediction, reorder buffers, etc (micro architectural structures.) 4000 cycles - window size 1000 cycles cycles 1e12

Workload Sampling uniform sampling skip init with functional sim init fixed length 0 cycles 1e12

Workload Sampling mode sampling uniform sampling skip init with functional sim init fixed length 0 cycles 1e12

Workload Sampling mode sampling uniform sampling skip init with functional sim init Profile for modes in the application, and select representative windows. Typically the window size can be larger, so less windows + warm-up is required fixed length 0 cycles 1e12

Summary Why Simulators Simulation detail Full-System vs User-level Functional vs Cycle Accurate (micro-arch.) vs Gate-Level Execution- vs Trace-driven (Fast) Multiprocessor Simulation More accurate than models Cheaper than building hardware Discrete event Quantum slack Workload Sampling Summary (the meta lecture)

Summary Why Simulators Simulation detail Full-System vs User-level Functional vs Cycle Accurate (micro-arch.) vs Gate-Level Execution- vs Trace-driven (Fast) Multiprocessor Simulation More accurate than models Cheaper than building hardware Discrete event Quantum slack Workload Sampling Summary (the meta lecture) You can read about all of this in your textbook, chapter 9