Recent Advances in Simulation Techniques and Tools

Similar documents
COTSon: Infrastructure for system-level simulation

Outline Simulators and such. What defines a simulator? What about emulation?

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Processors Processing Processors. The meta-lecture

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

Document downloaded from:

SpiNNaker SPIKING NEURAL NETWORK ARCHITECTURE MAX BROWN NICK BARLOW

Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Synthetic Aperture Beamformation using the GPU

Final Report: DBmbench

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Simulating GPGPUs ESESC Tutorial

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU

CS649 Sensor Networks IP Lecture 9: Synchronization

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

A High Definition Motion JPEG Encoder Based on Epuma Platform

CS4617 Computer Architecture

Challenges in Transition

CUDA-Accelerated Satellite Communication Demodulation

REPORT DOCUMENTATION PAGE

GPU-accelerated track reconstruction in the ALICE High Level Trigger

Statement of Research Weiwei Chen

Customized Computing for Power Efficiency. There are Many Options to Improve Performance

Game Architecture. 4/8/16: Multiprocessor Game Loops

LIMITS OF PARALLELISM AND BOOSTING IN DIM SILICON

Perspective platforms for BOINC distributed computing network

Parallel Programming Design of BPSK Signal Demodulation Based on CUDA

escience: Pulsar searching on GPUs

IHV means Independent Hardware Vendor. Example is Qualcomm Technologies Inc. that makes Snapdragon processors. OEM means Original Equipment

Setting up a Digital Darkroom A guide

Evaluation of CPU Frequency Transition Latency

Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism

What is a Simulation? Simulation & Modeling. Why Do Simulations? Emulators versus Simulators. Why Do Simulations? Why Do Simulations?

Cherry Picking: Exploiting Process Variations in the Dark Silicon Era

Multiband RF-Interconnect for Reconfigurable Network-on-Chip Communications UCLA

Communications Planner for Operational and Simulation Effects With Realism (COMPOSER)

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS

Introduction to co-simulation. What is HW-SW co-simulation?

HARDWARE ACCELERATION OF THE GIPPS MODEL

Ben Baker. Sponsored by:

Multi-core Platforms for

DICELIB: A REAL TIME SYNCHRONIZATION LIBRARY FOR MULTI-PROJECTION VIRTUAL REALITY DISTRIBUTED ENVIRONMENTS

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

Performance Metrics, Amdahl s Law

The Xbox One System on a Chip and Kinect Sensor

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

Parallelism Across the Curriculum

Statistical Simulation of Multithreaded Architectures

Evaluation of Mobile Ad Hoc Network with Reactive and Proactive Routing Protocols and Mobility Models

Accelerated Impulse Response Calculation for Indoor Optical Communication Channels

A virtual On Board Control Unit for system tests

Networks at the Speed of Light pave the way for the tactile internet

Characterizing and Improving the Performance of Intel Threading Building Blocks

Booster: Reactive Core Acceleration for Mitigating the Effects of Process Variation and Application Imbalance in Low-Voltage Chips

Measuring and Evaluating Computer System Performance

GPU-based data analysis for Synthetic Aperture Microwave Imaging

SOFTWARE IMPLEMENTATION OF THE

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

The Looming Software Crisis due to the Multicore Menace

LSI and Circuit Technologies for the SX-8 Supercomputer

THIS article focuses on the design of an advanced

Console Architecture 1

ROM/UDF CPU I/O I/O I/O RAM

Bridge RF Design and Test Applications with NI SDR Platforms

Parallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir

Stress Testing the OpenSimulator Virtual World Server

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

GPU Computing for Cognitive Robotics

Experience with new architectures: moving from HELIOS to Marconi

Model checking in the cloud VIGYAN SINGHAL OSKI TECHNOLOGY

The Ghost in the Machine Observing the Effects of Kernel Operation on Parallel Application Performance

An Energy Conservation DVFS Algorithm for the Android Operating System

Design Challenges in Multi-GHz Microprocessors

WEI HUANG Curriculum Vitae

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Boot Camp

Dynamic MIPS Rate Stabilization in Out-of-Order Processors

An architecture for Scalable Concurrent Embedded Software" No more communication in your program, the key to multi-core and distributed programming.

Table of Contents HOL ADV

Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION

EE 382C EMBEDDED SOFTWARE SYSTEMS. Literature Survey Report. Characterization of Embedded Workloads. Ajay Joshi. March 30, 2004

Use Nvidia Performance Primitives (NPP) in Deep Learning Training. Yang Song

Combined Dynamic Thermal Management Exploiting Broadcast-Capable Wireless Networkon-Chip

Towards a Cross-Layer Framework for Accurate Power Modeling of Microprocessor Designs

LOW-POWER SOFTWARE-DEFINED RADIO DESIGN USING FPGAS

PoC #1 On-chip frequency generation

cfireworks: a Tool for Measuring the Communication Costs in Collective I/O

Using SDR for Cost-Effective DTV Applications

IEEE n MIMO Radio Design Verification Challenge and a Resulting ATE Program Implemented for MIMO Transmitter and Receiver Test

Smart-M3-Based Robot Interaction in Cyber-Physical Systems

Real-time Pulsar Timing signal processing on GPUs

Parallel Simulation of Social Agents using Cilk and OpenCL

Matthew Grossman Mentor: Rick Brownrigg

Transcription:

Recent Advances in Simulation Techniques and Tools Yuyang Li, li.yuyang(at)wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download Abstract: Simulation refers to using specified kind of techniques to imitate the operation in real world. It is derived from then modeling of the object in the real world. In computer engineering, simulation is one of the three main methods in analyzing computer systems. It requires less time than measuring the real system and can be done by computer programs. Gem5 and ESESC are two featured simulation tools for simulating computer system. In this paper, their features will be studied, validated and then summarized. key words: simulation tools, computer system simulator, simulator validation, gem5, ESESC Table of Contents 1.Introduction 2.Back Ground o 2.1 Simulation o 2.2 Computer Architecture simulator o 2.3 Emulator 3.The Gem5 Simulator o 3.1 gem5 accuracy test o 3.2 pd-gem5 o 3.3 gem5-gpu 4.ESESC Simulator o 4.1 An Example of Simulation 5.Conclusions References Acronyms 1. Introduction http://www.cse.wustl.edu/~jain/cse567-17/ftp/raistat/index.html 1

In computer system analysis, the three main methods are modeling, simulation and measuring. Among which, simulation can be performed on any stage, do not require a large portion of time and can be done using computer language. [Jain91] Since the appearance of the simulation technology, the techniques and tools has experienced a boost on the development. Different techniques are invented satisfying various demands. In this way, a comparation will be necessary in order to find the differences and the direction fitted by each one. 2. Background 2.1 Simulation Simulation is known as the imitation of the operation of a real-world process or system over time. [Banks01] It is based on the model developed representing the key features, behaviors and functions of the real-world object. The model is the representation of the system, and the simulation represents the operation on the system. Simulation is widely used in computer engineering and electrical engineering. 2.2 Computer Architecture simulator Computer architecture simulator, also known as architectural simulator, is a kind of software modeling computer hardware and give the predicted performance. The modeled target can be very flexible. It could be a microprocessor only or a full system include processor, memory and I/O devices. The architectural simulation can achieve the following purposes: evaluating designs without building the real hardware system, accessing non-existing computer devices, generating detailed performance data and quick debugging. A typical example is the multicore computer system, it demands a full-system simulation because creating and debugging can be very difficult and time-consuming. What's more, with the help of simulation, the software development can also start before the hardware is ready. This will validate the design if the hardware. [Joloboff09] 2.3 Emulator Emulator can be hardware or software, and allows one system to behave like another. The system running the emulator is called the host system and the one emulated called the guest. The emulator will allow the host system to run software that designed for the guest system. The word "emulator" was first created in 1963[Emerson95], at that time it only refers to the simulation assisted by microcode or hardware while running software emulation was still called "simulation". 3. The Gem5 Simulator Gem5 is a simulator platform doing simulation around system-level computer architecture and processor microarchitecture. It integrates interchangeable CPU model, GPU model, memory http://www.cse.wustl.edu/~jain/cse567-17/ftp/raistat/index.html 2

system, and multiple instruction set architectures with has Full-system capability and Multisystem capability and power modeling ability. Untill now, it has developed into many branches. 3.1 gem5 accuracy test[butko12] A two-factor experiment is used to measure the accuracy of the gem5 simulator. The method is that run the same program on a real hardware system and the system simulated by gem5 respectively, collect output data and calculate the differences. For the referential hardware model, the Snowball SKY-S9500-ULP-C01 development kit is chosen. The system is built with ST-Ericsson A9500, a dual-core ARM Cortex-A9 processor on Linux Kernel. On the other hand, the gem5 model was built as a dual-core ARM Cortex-A9 processor@ 1GHz, running on Linux kernel, too. The ALP Media, SPLASH-2 and STREAM benchmark were chosen as the workload, permitting exploiting and assessing performance of multicore architecture. The first benchmark includes speech recognition, face recognition, race tracing, MPEG-2 encode and MPEG-2 decode. In the experiment, the last two application which are MPEG-2 encoder and decoder are selected as the tested service. In the second benchmark, there are eight complete applications: Barnes, FMM, Ocean, Radiosity, Water-Spatial and 3-kernels: FFT, LU, Radix. What's more, the STREAM benchmark is a simple program that calculating corresponding rate based on the measured memory bandwidth. The descriptions are presented in Table 1. Table 1. Benchmark Set Description The measured execution time was shown in table2. http://www.cse.wustl.edu/~jain/cse567-17/ftp/raistat/index.html 3

Table 2. Execution Time of The Adopted Benchmarks The result shows that the mismatch rate is between 0,47% and 17.94%. The mismatch rate varies for different workloads. In this case, the Radix application is studied since it has the largest mismatch rate. This time, different number of keys were tested to study the relationship between the number and the mismatch rate. Table 3 shows the outputs. Table 3. Execution Time (ET) of Radix Sort Kernel Fig.1 shows that the mismatch rate will increase the mismatch rate. This is because the volume of communication between cores will increase when the number of keys increase, which will cause more cache misses. Fig.1 Radix Sort Execution Time Behavior For the stream benchmark, Table4 shows that the results from both system are very close. http://www.cse.wustl.edu/~jain/cse567-17/ftp/raistat/index.html 4

Table 4. Memory Bandwidth When Executing Stream Benchmark 3.2 pd-gem5 pd-gem5 is developed for parallel/distributed computer systems. [Alian16] Each host runs one or more gem5 processes, which simulates a full system node or more network switch. An 8-node system simulated in pd-gem5 can be depicted as Fig.2: Fig.2 The Structure of a Four-node Computer System Simulation in pd-gem5 For networking, each packet is generated by a simulated NIC, and forwarded to a simulated network switch port through TCP sockets. The switch will route the packet to a simulated NIC destination node. The NIC model latency of traditional gem5 simulation is Where S is the packet size, it is smaller than the maximum transmission unit. lnic and bnic stand for NIC's fixed latency and maximum bandwidth. However, in pd-gem5, it is enhanced to more precisely capture non-linear latency effect due to diverse packet sizes. Barrier synchronization is implemented in order to synchronize simulated nodes in pd-gem5. In barrier synchronization, each simulated node is synchronized at the end of each simulated time quantum, which should be fixed. To run the simulation, hosts which consists a quad core Intel Xeon processor, 2x8GB DDR3-1600 DIMMs and an Intel ethernet NIC are used. And another group of AMD quad core processor, 1x8GB DDR3 DIMM and a Realtek PCIe Ethernet NIC is used to validate this model. http://www.cse.wustl.edu/~jain/cse567-17/ftp/raistat/index.html 5

2 to 24-node systems are evaluated with a star network topology. Then, the MPI implementation of the NAS benchmark is ran. Fig.3 shows the comparison of pd-gem5, gem5 and the measurement of a AMD quad core processor. From the graph, we can find that the pd-gem5 shows a similar non-linearity fits the real measured curve well. On contrast, the original gem-5 result is more linear and has more significant difference with the real result. In this way, the pd-gem5 method is validated. Fig.3 Comparison of Packet Round Trip Latency Figure 3 is the speed up of pd-gem5 running on multiple simulation hosts (0.25 times the number of simulated nodes/the number of cores per simulation host) over pd-gem5 running on a single simulation host. For example, the number of simulation hosts used for the simulation is 2, 4, and 6 for the number of simulated nodes 8, 16, and 24. This is because each simulation host has four cores. The simulation time of using multiple simulation hosts is normalized to that of using a single simulation host to get the speedup. When the number of simulated nodes is fewer than the number of (active) threads supported by a simulation host, pd-gem5 running on a single simulation host shows slightly higher performance than running on multiple simulation hosts, as the overhead of synchronization is lower (on-chip versus off-chip interconnect communications). However, pd-gem5 running the NAS benchmarks on 2, 4, and 6 simulation hosts offers 1.2, 1.6, and 3.2 higher geometric-mean performance for 8, 16, and 24 simulated nodes than running them on a single simulation host. Note that the microbenchmark fails to complete in a reasonable amount of time when a single simulation host attempts to simulate a 24node computer system because of extremely slow simulation and thus there is no associated speedup point in Fig.4. http://www.cse.wustl.edu/~jain/cse567-17/ftp/raistat/index.html 6

Fig.4 Simulation Speedup of Using Multiple Simulation Hosts over a Single Simulation Host 3.3 gem5-gpu [Jason15] Gem5 is the simulator modeling integrated CPU-GPU systems. It is based on the gem5 and GPGPU-sim simulators. It is designed to do simulations on both systems with coherent caches, using single virtual address space across CPU and GPU, and systems with separated caches. Gem5-gpu followed gem5's cache coherence language SLICC. In addition, gem5-gpu added a series of heterogeneous cache protocols. MOESI protocol is used in the newly-added models. The gem5-gpu simulator supports most unmodified CUDA 3.2 source code and allows CPU and GPU working simultaneously. Fig.5 shows that compared with the real system, gem5-gpu have a difference smaller than 22% in most cases. And the performance is tightly correlated with the performance of the GPGPU simulation. In this way, such simulator can be validated. 4. The ESESC Simulator Fig.5 Runtimes Normalized to NVIDIA GTX 580 http://www.cse.wustl.edu/~jain/cse567-17/ftp/raistat/index.html 7

Besides the gem5 simulator, ESESC is also a widely used simulator for computer systems. ESESC is an open source very fast simulator that models heterogeneous multicores (Out of Oder/In Order/GPUs) with detailed performance, power, and thermal models. [esesc] The simulator models ARM ISA, supports time-based sampling and can give detailed power and temperature reports. ESESC is famous for its speed. The simulator uses time-based sampling [Ehsan13]. Fig.6 shows that comparing with single threaded simulators without sampling and multithreaded simulators, ESESC showed a slight decrease when the number of cores increased. However, the simulation speed (represented by MIPS) still went beyond other simulators definitely. Supplemented by its detailed report for physical features (power and temperature), ESESC is can be used widely in its field. 4.1 An Example of Simulation Fig.6 Simulation Speed for Different Simulators A simple exampling experiment is implemented. The computer systems with different L1 cache size, with and without L2 cache are tested and the IPCs are studied as the representation of the performance. The details are shown in Table.1. Table 5. The Details of the Experiments The results are shown as follow: http://www.cse.wustl.edu/~jain/cse567-17/ftp/raistat/index.html 8

Table.6 The Results of the Verification Experiment The ANOVA Table is as follow: Table.7 The ANOVA Table of the Experiment From the table, it is shown that the effect of the Cache size is significant while the effect of the L2 Cache is insignificant. The linear regression model of the IPC and cache size is What's more, increasing the cache size do decrease the miss rate and help to improve the performance. In this way, we can find the result valid and represent the real-world condition. 5. Conclusions Nowadays, computer system simulation is getting more and more important since its convenience for not requiring real system and its reliability validated by several experiments. Also, simulators are branching into more and more types fitting multiple demands. The gem5 simulator can simulate different models of CPU and GPU and many computer systems, it can be customized due to individual requirement for the simulation. In this way, Subsimulator software such as pd-gem5 and gem5-gpu is designed. Pd-gem5 is prepared for simulating parallel or distributed systems. It can run several hosts at the same time and is suitable for simulating network systems. Gem5-GPU combines gem5 CPU simulation and GPGPU simulation to simulate CPU and GPU together. It has independent interface between gem5 and GPGPU simulator and can give results very close to the measurement of real system. The ESESC simulator is good at running simulation in high speed. It can simulate multicore, in order/ out of order processors and different kind of caches. Comparing with other simulators http://www.cse.wustl.edu/~jain/cse567-17/ftp/raistat/index.html 9

based on instructions such as SESC, this simulator is based on the TBS technique and generates output data faster than some other simulators. The data is detailed including power and temperature calculations. However, it is limited in ARM instruction set architectures. The current problem is that it cannot handle multithreads conditions. In this way, these simulators show different feature against each other and have their own advantages and disadvantages. There probably does not exist a simulator that is perfect for all tasks, but the development and evolution of computer simulation techniques and tools is obvious enough. The current techniques have already made great difference and made the simulation method an important step in computer system analysis. References 1. [Jain91] Raj Jain, "The Art of Computer Systems Performance Analysis," John Wiley & Sons, INC, 1991, ISBN-10: 0471503363> 2. [Banks01] J. Banks; J. Carson; B. Nelson; D. Nicol (2001). Discrete-Event System Simulation. Prentice Hall. p. 3. ISBN 0-13-088702-1 3. [Joloboff09] "Full System Simulation of Embedded Systems" 4. [Emerson95] Pugh, Emerson W. (1995). Building IBM: Shaping an Industry and Its Technology. MIT. p. 274. ISBN 0-262-16147-8 5. [Alian16] Mohammad Alian, Daehoon Kim, Nam Sung Kim. "pd-gem5: Simulation Infrastructure for Parallel/Distributed Computer Systems", IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 15, NO. 1, JANUARY-JUNE 2016 6. [Butko12] Anastasiia Butko, Rafael Garibotti, Luciano Ost, Gilles Sassatelli, "Accuracy Evaluation of GEM5 Simulator System", Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2012 7th International Workshop on, 9-11 July 2012 7. [Jason15] Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, David A. Wood, "gem5-gpu: A Heterogeneous CPU-GPU Simulator", IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 14, NO. 1, JANUARY-JUNE 2015 8. [esesc]esesc Simulator Official Website https://masc.soe.ucsc.edu/esesc/ 9. [Ehsan13] Ehsan K. Ardestani, Jose Renau, "ESESC:AFastMulticoreSimulatorUsingTime-BasedSampling", High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on, 23-27 Feb. 2013 Acronyms 1. SPLASH-2: Stanford ParalleL Applications of SHared memory 2. ET: Execution Time 3. ESESC: Enhanced Super ESCalar 4. TBS: Time-Based Sampling 5. IPC: Instructions Per Cycle 6. ANOVA: ANalysis Of VAriance http://www.cse.wustl.edu/~jain/cse567-17/ftp/raistat/index.html 10

Last modified: December 15, 2017 This and other papers on performance analysis of computer systems are available online at http://www.cse.wustl.edu/~jain/cse567-17/index.html Back to Raj Jain's Home Page http://www.cse.wustl.edu/~jain/cse567-17/ftp/raistat/index.html 11