COTSon: Infrastructure for system-level simulation

Similar documents
Processors Processing Processors. The meta-lecture

Recent Advances in Simulation Techniques and Tools

Outline Simulators and such. What defines a simulator? What about emulation?

Performance Evaluation of Recently Proposed Cache Replacement Policies

Trace Based Switching For A Tightly Coupled Heterogeneous Core

Statistical Simulation of Multithreaded Architectures

Final Report: DBmbench

The Ghost in the Machine Observing the Effects of Kernel Operation on Parallel Application Performance

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

Performance Metrics, Amdahl s Law

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Optimizing VM Checkpointing for Restore Performance in VMware ESXi Server

Project 5: Optimizer Jason Ansel

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

SOFTWARE IMPLEMENTATION OF THE

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

NetApp Sizing Guidelines for MEDITECH Environments

What is a Simulation? Simulation & Modeling. Why Do Simulations? Emulators versus Simulators. Why Do Simulations? Why Do Simulations?

Parallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir

CSE502: Computer Architecture Welcome to CSE 502

A quantitative Comparison of Checkpoint with Restart and Replication in Volatile Environments

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Dynamic Scheduling II

Dynamic MIPS Rate Stabilization in Out-of-Order Processors

CS4617 Computer Architecture

Precise State Recovery. Out-of-Order Pipelines

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

CS429: Computer Organization and Architecture

SW simulation and Performance Analysis

Experience with new architectures: moving from HELIOS to Marconi

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

ROM/UDF CPU I/O I/O I/O RAM

The Critical Role of Firmware and Flash Translation Layers in Solid State Drive Design

Towards a Cross-Layer Framework for Accurate Power Modeling of Microprocessor Designs

CS 110 Computer Architecture Lecture 11: Pipelining

CS Computer Architecture Spring Lecture 04: Understanding Performance

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

Enhancing System Architecture by Modelling the Flash Translation Layer

Table of Contents HOL ADV

The Looming Software Crisis due to the Multicore Menace

Lecture 6: Electronics Beyond the Logic Switches Xufeng Kou School of Information Science and Technology ShanghaiTech University

Computer Architecture

Message Passing-Aware Power Management on Many-Core Systems

Recovery Boosting: A Technique to Enhance NBTI Recovery in SRAM Arrays

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU

Polarization Optimized PMD Source Applications

Vampir Getting Started. Holger Brunst March 4th 2008

A Parallel Monte-Carlo Tree Search Algorithm

SSD Firmware Implementation Project Lab. #1

CS649 Sensor Networks IP Lecture 9: Synchronization

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Assessing and. Rui Wang, Assistant professor Dept. of Information and Communication Tongji University.

Adaptable C5ISR Instrumentation

MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor

CS 6290 Evaluation & Metrics

Flexibility, Speed and Accuracy in VLIW Architectures Simulation and Modeling

CSTA K- 12 Computer Science Standards: Mapped to STEM, Common Core, and Partnership for the 21 st Century Standards

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Power Management in Multicore Processors through Clustered DVFS

Parallel Randomized Best-First Search

Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism

A Nanophotonic Interconnect for High- Performance Many-Core Computation

Exploring Heterogeneity within a Core for Improved Power Efficiency

From network-level measurements to Quality of Experience: Estimating the quality of Internet access with ACQUA

Total No. of Questions :09] [Total No. of Pages : 02

Kosuke Imamura, Assistant Professor, Department of Computer Science, Eastern Washington University

Simulating GPGPUs ESESC Tutorial

Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka

Out-of-Order Execution. Register Renaming. Nima Honarmand

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

Configuring OSPF. Information About OSPF CHAPTER

Analysis of Dynamic Power Management on Multi-Core Processors

CSE502: Computer Architecture CSE 502: Computer Architecture

Architecture ISCA 16 Luis Ceze, Tom Wenisch

EE 382C EMBEDDED SOFTWARE SYSTEMS. Literature Survey Report. Characterization of Embedded Workloads. Ajay Joshi. March 30, 2004

Plane-dependent Error Diffusion on a GPU

IMPLEMENTING MULTIPLE ROBOT ARCHITECTURES USING MOBILE AGENTS

MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor

Arithmetic Encoding for Memristive Multi-Bit Storage

An Overview of Computer Architecture and System Simulation

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

VOLTAGE NOISE IN PRODUCTION PROCESSORS

ECE473 Computer Architecture and Organization. Pipeline: Introduction

COMPARATIVE PERFORMANCE OF SMART WIRES SMARTVALVE WITH EHV SERIES CAPACITOR: IMPLICATIONS FOR SUB-SYNCHRONOUS RESONANCE (SSR)

Chapter 1 Basic concepts of wireless data networks (cont d.)

PUBLICATION P UNION Agency - Science Press. Reprinted with permission.

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

Design Challenges in Multi-GHz Microprocessors

Simulated BER Performance of, and Initial Hardware Results from, the Uplink in the U.K. LINK-CDMA Testbed

CMP 301B Computer Architecture. Appendix C

Transcription:

COTSon: Infrastructure for system-level simulation Ayose Falcón, Paolo Faraboschi, Daniel Ortega HP Labs Exascale Computing Lab http://sites.google.com/site/hplabscotson MICRO-41 tutorial November 9, 28 Lake Como, Italy 28 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Core Concepts Functional Simulator (SimNow) Sequences the behavioral simulation of CPUs and devices Timers Using functional events, it computer the target metrics (time, power) Sampler Decide when to turn on or off the Timers and for how long Interleaver Decides how to buffer and reorder functional events (SMP) Time Predictor Based on timer metrics evolution over time, decides how to feed the information back to the functional simulator 2 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Decoupling Simulation Functional Simulation (fast) Emulates the behavior of all the components of our system Disks, video, network cards, etc. Necessary to verify correctness, run software Timing Simulation (slow) Models the timing of all the components Used to measure performance (or power) COTSon approach: Functional Directed with sampling and time feedback Device function and Software Functional Simulator Events (instructions, ) Time feedback (predicted IPC) Timing Simulator Metrics, time and power 3 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

COTSon Components SimNow (Functional) Northbridge Memory Southbridge Core Core 1 Sampling Timing feedback Asynchronous Events Interleaver... 4 3 2 1... 2 1 Timing feedback COTSon Node C C1 CPU and Memory Timer D$ I$ D$ I$ Bus L2$ Memory HD HD 1 NIC Network Mediator Disk Timer Disk Timer NIC Timer Network Switch (Functional) Network Timer Sampling COTSon Node 1 Sampling COTSon 1 Node 4 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Timers (a.k.a. CPU/device models) Accept instructions, process them and update metrics All timers share the memory hierarchy Some must have metrics: cycles and instructions Pluggable architecture Not only CPU models, but also: Profiling Trace generation Simpoint -like analysis Current models Timer: simple linear model + cache hierarchy Timer1: Timer + in-order pipeline Bandwidth: Only limited by memory bandwidth PTLSim (open source): linked to COTSon, full x86 OoO superscalar 5 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Samplers Decide when and how much to simulate and when to move from one simulation state to another Functional: fast forward to the next state as quickly as possible Warming (simple/detailed): get data in stateful structures (e.g., caches), but do not account for time Simulation: account for time Pluggable architecture Many implementations Smarts [1], SimPoint [2], Dynamic Sampling [3], Random, Interval-based, [1] Wunderlich et al. SMARTS: Accelerating Microarchitecture Simulation Via Rigorous Statistical Sampling, ISCA'3 [2] B. Calder. Simpoint (www.cse.ucsd.edu/~calder/simpoint) [3] A. Falcón et al. Combining Simulation and Virtualization through Dynamic Sampling, ISPASS'7 Samplers are what provide the major acceleration component Even for very accurate (hence slow) timing models, a good sampler only needs to invoke the timing model < 1% of the time. 6 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Single CPU simulation Fast and accurate single node simulation using Dynamic Sampling Detect dynamically program phase changes The challenge is to avoid disturbing the VM execution in the code cache during fast functional emulation Phase changes are correlated with VM statistics (exceptions, I/O events, code cache invalidations, ) which are easy to get and don t impact performance IPC Exceptions 1 3 4 5 2 Instructions 7 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial 6

Dynamic Sampling A. Falcón, P. Faraboschi, and D. Ortega, Combining Simulation and Virtualization through Dynamic Sampling, in Proceedings of ISPASS 7 Allows users to favor accuracy or speed, depending on their requirements High accuracy:.4% accuracy error with 8.5x speedup High speed: 39x speedup with 1.9% error Fully dynamic Does not require any a priori analysis Automatically detect code phases Allows for providing timing feedback to the functional simulator 8 9 November 28 COTSon: Infrastructure for system-level simulation MICRO-41

Multi-core simulation SimNow performs functional simulation of multi-cores It simulates MP as sequential interleaved at coarse granularity This misses fine grain memory interactions COTSon buffers events and delivers them interleaved to the CPU timing models Problem: Hard to scale up OS? BIOS? SimNow 1 2 3 4 Core 1 Core 2 Core 3 Core 4 Interleaver Model CPU 1 Model CPU 2 Model CPU 3 Model CPU 4 Interconnect/Memory Model Simulator Front-End Simulator Back-End 9 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Interleaving Fundamentals 1 2 1 2 1 MP functional simulation runs sequentially interleaved at coarse granularity. This may miss fine-grain memory interactions We buffer events at every MP quantum and deliver them interleaved to the timers Buffer and coalesce MP quantum 1 1 2 2 Interleave 1 2 1 2 1 2 1 2 1 2 1 2 1 1 Interleaved based on the CPUs IPC To timing model 1 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Timing Feedback Problem: feed back timing information to the functional emulator Give the simulated application an illusion of approximate time (functional time corresponding to simulated time) Define the IPC of a quantum based on previous history Classic time-series prediction problem, with unknown model Current model: simple predictor The IPC is fed back to the functional simulator The application being simulated acts as if execution is faster or slower Emulate (functional) CPI=1. Simulate (timing) Previous y observed and predicted CPIs Current CPI=2. Predict CPI Emulate (functional) CPI=1.8 11 October 28 GT Talk

Many-core simulation M. Monchiero, J.-H. Ahn, A. Falcón, D. Ortega, and P. Faraboschi, How to simulate 1 cores, dascmp 8 Translate SW thread-level into simulated core-level parallelism Identify and separate the instruction streams of the different threads at the OS level (context switches) Dynamically map each instruction flow to the corresponding core of the target multicore architecture, taking into account application-level thread synchronization SimNow (1 core) Thread ID (from guest OS) OS context switches Thread 1 Thread 2 Thread 3 Model CPU n Model CPU 3 Model CPU 2 Model CPU 1 Interconnect/Memory Model Simulator Front-End Simulator Back-End 12 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Multi-node simulation Simulate a computer cluster as a cluster of full-system simulators Each node of the cluster is simulated with a full-system simulator Network simulator used to simulate network topology Problems: Time skew between nodes needs to be controlled with quanta Quantum size must be chosen carefully Small quanta Bad simulation speed Large quanta Bad simulation accuracy 13 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Adaptive Synchronization A. Falcón, P. Faraboschi, and D. Ortega, An Adaptive Synchronization Technique for Parallel Simulation of Networked Clusters, in Procs. of ISPASS 8 Basic idea: dynamically adjust the quantum for maximum speed at a controlled accuracy loss Quantum increases/decreases depending on packet traffic Slow Acceleration, fast deceleration ( driving over speed bumps ) Packets 45 4 35 3 25 2 15 1 5 Packets Quantum 1 9 8 7 6 5 4 3 2 1 Quantum Time 14 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Speed vs. Accuracy Tradeoffs We can play the speed vs. accuracy game at several control points Within a node: dynamic sampling sensitivity At cluster level: adaptive quantum range By choosing the appropriate values we can reach Single node accuracy in the order of 11% 15% error (simple CPU model) Networking accuracy (microbenchmark) up to 15 Gb/s All of the above with self-relative slowdown (vs. native) of ~15x-3x Improvement Areas SMP and cluster validation on larger applications Better CPU models (if needed), especially in the SMP coherency area Distributed simulation sometimes unstable for large clusters (> 5 nodes) Canned recipes for non-expert users for accuracy/speed requirements 15 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Success stories Fault isolation for commodity architectures study Configurable isolation: building high-availability systems with commodity multi-core processors (ISCA 7) Isolation in Commodity Multicore Processors (IEEE MICRO 7) Nanophotonics architecture investigation Corona: System implications of emerging nanophotonic technology (ISCA 8) Last level cache technologies study (CACTI-D) A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies (ISCA 8) Web 2. workload analysis Microblades and megaservers: system architectures for emerging Web 2. / internet workloads (ISCA 8) and some other internal projects at HP Labs 16 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Putting it all together IPC Network traffic Acc. IPC over time of 8 nodes running NAMD 17 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

COTSon Labs

COTSon Labs Experiments 1. Functional simulation 2. Simple timers dump_to in_order 3. Memory tracer 4. Timing feedback 5. Samplers Random sampling Dynamic sampling 6. Selective tracing 7. Network simulation 8. Disk simulation

Functional simulation (I) cotson-node Lua file Lua command Lua file Lua file cotson-node Lua command Lua file 21 7 November 28

Functional simulation (II) How to start a (deterministic) simulation Send keystrokes to SimNow xtools using SimNow hacks Network access Pre-started application 22 7 November 28

Simple timer: dump_to Use COTSon SDK to create your own timing or sampling module Experiment: Instructions from SimNow are disassembled and dumped to a file No time feedback Output fields (disasm) pid tid cr3 PC (length) Opcodes disasm [load store] virtual @ physical @ (length) [load store] virtual @ physical @ (length)

Simple timer: in-order 3-stage in-order pipeline + cache stalls Memory hierarchy in Lua CPU CPU 1 I$ D$ I$ D$ L2$ L2$ MOESI BUS Memory

Memory tracer Transparent memory Dump to file/display CPU CPU 1 I$ D$ I$ D$ L2$ L2$ Memory memory tracer

Timing feedback With timing feedback 2 1.8 1.6 CPU 1 CPU 2 1.4 1.2 IPC 1.8.6.4.2 5 1 15 2 time 26 7 November 28

Timing feedback Without timing feedback 1.8 IPC.6.4 CPU 1.2 CPU 2 2 4 6 8 1 12 14 16 18 2 time 27 7 November 28

Random sampling Sampling states Functional: pre-program IPC Simple Warming: warm caches and branch predictor Detailed Warming: simple warming + warm reorder buffer Simulation: sample, full timing

Dynamic sampling (I) 29 7 November 28

Dynamic sampling (II) 2 1.8 1.6 full dynamic 1.4 1.2 IPC 1.8.6.4.2 5 1 15 2 time 3 7 November 28

Selective Tracing Lets user determine which application(s) or part(s) of an application running inside SimNow is simulated with timing Combined with CR3 tracing, allows the user to skip instructions from OS or other applications Change in CR3 register = context switch Uses SimNow tagging of instructions to communicate data between guest OS and COTSon Via a reserved CPUID instruction Ex: application instrumentation #include cotson-tracer.h" int main(void) { COTSON_BEGIN_TRACE (1) [benchmark code] COTSON_END_TRACE (1) } Ex: OS instrumentation $> cotson_tracer.sh begin 1 $> benchmark1 $> cotson_tracer.sh end 1 $> $> cotson_tracer.sh begin 2 $> benchmark2 $> cotson_tracer.sh end 2 31 9 November 28 COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Network simulation 4-node cluster, 1 CPU per node NAS benchmarks with mpich2 MPI library Node discovery, MPI boot and five NAS benchmarks (cg, ep, is, lu, mg) with 8 threads Simple crossbar switch, 2Gb/s bandwidth 1 Gb/s NICs Adaptive quantum synchronization 1:1

Disk simulation Disksim integrated into COTSon http://www.pdl.cmu.edu/disksim Experiment No CPU timing IPC=1 Disk model Seagate Cheetah 4LP 4.5 GB 1,33 rpm