Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Size: px
Start display at page:

Download "Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes"

Transcription

1 Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur Mutlu

2 Executive Summary Problem: No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalesce an application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 2

3 GPU Support for Virtual Memory Improves programmability with a unified address space Enables large data sets to be processed in the GPU Allows multiple applications to run on a GPU Virtual memory can enforce memory protection 3

4 State-of-the-Art Virtual Memory on GPUs GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Shared TLB Limited TLB reach Private Shared High latency page walks Page Table Walkers Page Table (Main memory) Data (Main Memory) High latency I/O GPU-side memory CPU-side memory CPU Memory 4

5 Trade-Off with Page Size Larger pages: Better TLB reach High demand paging latency Smaller pages: Lower demand paging latency Limited TLB reach 5

6 Normalized Performance Normalized Performance Trade-Off with Page Size No Paging Overhead Small (4KB) Large (2MB) % With Paging Overhead Small (4KB) Large (2MB) % Can we get the best of both page sizes? 6

7 Outline Background Key challenges and our goal Mosaic Experimental evaluations Conclusions 7

8 Challenges with Multiple Page Sizes Time App 1 Allocation App 2 Allocation App 1 Allocation App 2 Allocation Coalesce App 1 Pages Coalesce App 2 Pages Large Page Frame 1 Large Page Frame 2 Large Page Frame 3 Large Page Frame 4 Large Page Frame 5 State-of-the-Art GPU Memory Cannot coalesce (without migrating multiple 4K pages) Need to search which pages to coalesce Unallocated App 1 App 2 8

9 Desirable Allocation Time App 1 Allocation App 2 Allocation App 1 Allocation App 2 Allocation Large Page Frame 1 Large Page Frame 2 Large Page Frame 3 Large Page Frame 4 Large Page Frame 5 Desirable Behavior GPU Memory Coalesce App 1 Pages Coalesce App 2 Pages Can coalesce (without moving data) Unallocated App 1 App 2 9

10 Our Goals High TLB reach Low demand paging latency Application transparency Programmers do not need to modify the applications 10

11 Outline Background Key challenges and our goal Mosaic Experimental evaluation Conclusions 11

12 Mosaic GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 12

13 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 13

14 Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 2 1 Allocate Memory Application Demands Data Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Soft guarantee: A large page frame contains pages from only a single address space Conserves contiguity within the large page frame 14

15 Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 2 1 Allocate Memory Application Demands Data Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Transfer Data 3 System I/O Bus CPU Memory Data transfer is done at a small page granularity A page that is transferred is immediately ready to use 15

16 Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 4 Transfer Done Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Transfer Data 3 System I/O Bus CPU Memory 16

17 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 17

18 Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages Contiguity-Aware Compaction Hardware Large Page Frame Large Page Frame Fully-allocated large page frame Coalesceable Allocator sends the list of coalesceable pages to the In-Place Coalescer 18

19 Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages 2 Contiguity-Aware Compaction Hardware Update page tables In-Place Coalescer has: List of coalesceable large pages Page Table Data Key Task: Perform coalescing without moving data Simply need to update the page tables 19

20 Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages 2 Contiguity-Aware Compaction Hardware Update page tables Large Page Table 10 Coalesced Bit Small Page Table Page Table Data Application-transparent Data can be accessed using either page size No TLB flush 20

21 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 21

22 Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 22

23 Mosaic: Data Deallocation GPU Runtime Application Deallocates Data 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data 2 Splinter Pages (reset the coalesced bit) Large Page Frame Splinter only frames with deallocated pages 23

24 Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 24

25 Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Large Page Frames 2 List of free pages Contiguity-Aware Compaction Hardware Page Table Free large page Free large page Data 1 Compact Pages Compaction decreases memory bloat Happens only when memory is highly fragmented 25

26 Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Once pages are compacted, they become non-coalesceable No virtual contiguity Maximizes number of free large page frames 26

27 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 27

28 Baseline: State-of-the-Art GPU Virtual Memory GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Private Shared Shared TLB Page Table Walkers Page Table (Main memory) Data (Main Memory) GPU-side memory CPU-side memory CPU Memory 28

29 Methodology GPGPU-Sim (MAFIA) modeling GTX750 Ti 30 GPU cores Multiple GPGPU applications execute concurrently 64KB 4-way L1, 2048KB 16-way L2 64-entry L1 TLB, 1024-entry L2 TLB 8-entry large page L1 TLB, 64-entry large page L2 TLB 3GB main memory Model sequential page walks Model page tables and virtual-to-physical mapping CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites 235 total workloads evaluated Available at: 29

30 Comparison Points State-of-the-art CPU-GPU memory management GPU-MMU based on [Power et al., HPCA 14] Upside: Utilizes parallel page walks, TLB request coalescing and page walk cache to improve performance Downside: Limited TLB reach Ideal TLB: Every TLB access is an L1 TLB hit 30

31 Weighted Speedup Performance Homogeneous 95.0% 61.5% 55.4% 33.8% 39.0% Heterogeneous GPU-MMU Mosaic Ideal TLB 21.4% 31.5% 43.1% 23.7% Number of Concurrently-Executing Applications Mosaic consistently improves performance across a wide variety of workloads Mosaic performs within 10% of the ideal TLB 31

32 Other Results in the Paper TLB hit rate Mosaic achieves average TLB hit rate of 99% Per-application IPC 97% of all applications perform faster Sensitivity to different TLB sizes Mosaic is effective for various TLB configurations Memory fragmentation analysis Mosaic reduces memory fragmentation and improves performance regardless of the original fragmentation Performance with and without demand paging 32

33 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 33

34 Summary Problem: No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalesce an application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 34

35 Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur Mutlu

36 Backup Slides

37 Current Methods to Share GPUs Time sharing Fine-grained context switching Coarse-grained context switching Spatial sharing NVIDIA GRID Multi process service 37

38 Other Methods to Enforce Protection Segmented paging Static memory partitioning 38

39 TLB Flush With Mosaic, the contents in the page tables are the same TLB flush in Mosaic occurs when page table content is modified This invalidates content in the TLB Need to be flushed Both large and small page TLBs are flushed 39

40 Normalized Performance Performance with Demand Paging 2.0 GPU-MMU no Paging GPU-MMU with Paging Mosaic with Paging Homogeneous Heterogeneous 40

41 In-Place Coalescer: Coalescing Key assumption: Soft guarantee Large page range always contains pages of the same application L1 Page Table Set Large Page Bit L2 Page Table Set Disabled Bit Set Disabled Bit Set Disabled Bit Set Disabled Bit Coalesce VA Q: How to access large page base entry? PD PT PO PO Benefit: No data movement 41

42 In-Place Coalescer: Large Page Walk Large page index is available at leaf PTE L1 Page Table Set Large Page Bit L2 Page Table Set Disabled Bit Set Disabled Bit Set Disabled Bit Set Disabled Bit Coalesce 42

43 Weighted Speedup Sample Application Pairs GPU-MMU Mosaic Ideal TLB TLB-Friendly TLB-Sensitive

44 TLB Hit Rate TLB Hit Rate 100% L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 80% 60% 40% 20% 0% 1 App 2 Apps 3 Apps 4 Apps 5 Apps GPU-MMU Number of Concurrently-Executing Applications Mosaic

45 Normalized Performance Pre-Fragmenting DRAM 1.6 no CAC CAC CAC-BC CAC-Ideal % 50% 70% 90% 95% 97% 100% Fragmentation Index

46 Normalized Performance Page Occupancy Experiment 1.6 no CAC CAC CAC-BC CAC-Ideal Large Page Frame Occupancy

47 Memory Bloat vs. GPU-MMU Memory Bloat KB Page GPU-MMU CAC 0 1% 10% 25% 35% 50% 75% Page Occupancy

48 Normalized Performance Normalized Performance Normalized Performance Normalized Performance Individual Application IPC GPU-MMU Mosaic Ideal-TLB GPU-MMU Mosaic Ideal-TLB Sorted Application Number GPU-MMU Mosaic Ideal-TLB Sorted Application Number GPU-MMU Mosaic Ideal-TLB Sorted Application Number Sorted Application Number

49 Normalized Performance Normalized Performance Normalized Performance Normalized Performance 1.4 GPU-MMU Mosaic Per-SM L1 TLB Base Page Entries 1.4 GPU-MMU Mosaic Per-SM L1 TLB Large Page Entries GPU-MMU Mosaic Shared L2 TLB Base Page Entries 1.4 GPU-MMU Mosaic Shared L2 TLB Large Page Entries

50 Mosaic: Putting Everything Together GPU Runtime Application Demands Data List of Free Pages List of Large Pages Application Deallocate Data Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory Transfer Done Coalesce Pages Splinter Pages Compact Pages Page Table Data System I/O Bus Transfer Data 50

51 Mosaic: Data Allocation GPU Runtime Application Demands Data List of Large Pages Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory Transfer Done Coalesce Pages Page Table Data System I/O Bus Transfer Data 51

52 Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation Hardware List of Free Pages Application Deallocate Data In-Place Coalescer Contiguity-Aware Compaction Splinter Pages Compact Pages Page Table Data 52

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood Supporting x86-64 Address Translation for 100s of GPU s Jason Power, Mark D. Hill, David A. Wood Summary Challenges: CPU&GPUs physically integrated, but logically separate; This reduces theoretical bandwidth,

More information

Fall 2015 COMP Operating Systems. Lab #7

Fall 2015 COMP Operating Systems. Lab #7 Fall 2015 COMP 3511 Operating Systems Lab #7 Outline Review and examples on virtual memory Motivation of Virtual Memory Demand Paging Page Replacement Q. 1 What is required to support dynamic memory allocation

More information

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators Hiroyuki Usui, Lavanya Subramanian Kevin Chang, Onur Mutlu DASH source code is available at GitHub

More information

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment Track and Vertex Reconstruction on GPUs for the Mu3e Experiment Dorothea vom Bruch for the Mu3e Collaboration GPU Computing in High Energy Physics, Pisa September 11th, 2014 Physikalisches Institut Heidelberg

More information

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang Joo Lee Intel Rustam Miftakhutdinov The University

More information

Recent Advances in Simulation Techniques and Tools

Recent Advances in Simulation Techniques and Tools Recent Advances in Simulation Techniques and Tools Yuyang Li, li.yuyang(at)wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download Abstract: Simulation refers to using specified kind

More information

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs 5 th International Conference on Logic and Application LAP 2016 Dubrovnik, Croatia, September 19-23, 2016 Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs

More information

Synthetic Aperture Beamformation using the GPU

Synthetic Aperture Beamformation using the GPU Paper presented at the IEEE International Ultrasonics Symposium, Orlando, Florida, 211: Synthetic Aperture Beamformation using the GPU Jens Munk Hansen, Dana Schaa and Jørgen Arendt Jensen Center for Fast

More information

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links DLR.de Chart 1 GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links Chen Tang chen.tang@dlr.de Institute of Communication and Navigation German Aerospace Center DLR.de Chart

More information

Game Architecture. 4/8/16: Multiprocessor Game Loops

Game Architecture. 4/8/16: Multiprocessor Game Loops Game Architecture 4/8/16: Multiprocessor Game Loops Monolithic Dead simple to set up, but it can get messy Flow-of-control can be complex Top-level may have too much knowledge of underlying systems (gross

More information

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS 6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS Editor: Publisher: Prof. Pece Mitrevski, PhD Faculty of Information and Communication

More information

Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown) Warp-Aware Trace Scheduling for GPUS James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown) Historical Trends in GFLOPS: CPUs vs. GPUs Theoretical GFLOP/s 3250 3000 2750 2500

More information

COTSon: Infrastructure for system-level simulation

COTSon: Infrastructure for system-level simulation COTSon: Infrastructure for system-level simulation Ayose Falcón, Paolo Faraboschi, Daniel Ortega HP Labs Exascale Computing Lab http://sites.google.com/site/hplabscotson MICRO-41 tutorial November 9, 28

More information

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads Terminology CUDA Threads Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Streaming Multiprocessor (SM) A SM processes block of threads Streaming Processors (SP) also called CUDA

More information

Simulating GPGPUs ESESC Tutorial

Simulating GPGPUs ESESC Tutorial ESESC Tutorial Speaker: ankaranarayanan Department of Computer Engineering, University of California, Santa Cruz http://masc.soe.ucsc.edu 1 Outline Background GPU Emulation Setup GPU Simulation Setup Running

More information

Use Nvidia Performance Primitives (NPP) in Deep Learning Training. Yang Song

Use Nvidia Performance Primitives (NPP) in Deep Learning Training. Yang Song Use Nvidia Performance Primitives (NPP) in Deep Learning Training Yang Song Outline Introduction Function Categories Performance Results Deep Learning Specific Further Information What is NPP? Image+Signal

More information

A Study of Optimal Spatial Partition Size and Field of View in Massively Multiplayer Online Game Server

A Study of Optimal Spatial Partition Size and Field of View in Massively Multiplayer Online Game Server A Study of Optimal Spatial Partition Size and Field of View in Massively Multiplayer Online Game Server Youngsik Kim * * Department of Game and Multimedia Engineering, Korea Polytechnic University, Republic

More information

CUDA-Accelerated Satellite Communication Demodulation

CUDA-Accelerated Satellite Communication Demodulation CUDA-Accelerated Satellite Communication Demodulation Renliang Zhao, Ying Liu, Liheng Jian, Zhongya Wang School of Computer and Control University of Chinese Academy of Sciences Outline Motivation Related

More information

NetApp Sizing Guidelines for MEDITECH Environments

NetApp Sizing Guidelines for MEDITECH Environments Technical Report NetApp Sizing Guidelines for MEDITECH Environments Brahmanna Chowdary Kodavali, NetApp March 2016 TR-4190 TABLE OF CONTENTS 1 Introduction... 4 1.1 Scope...4 1.2 Audience...5 2 MEDITECH

More information

The Critical Role of Firmware and Flash Translation Layers in Solid State Drive Design

The Critical Role of Firmware and Flash Translation Layers in Solid State Drive Design The Critical Role of Firmware and Flash Translation Layers in Solid State Drive Design Robert Sykes Director of Applications OCZ Technology Flash Memory Summit 2012 Santa Clara, CA 1 Introduction This

More information

A Case for Opportunistic Embedded Sensing In Presence of Hardware Power Variability

A Case for Opportunistic Embedded Sensing In Presence of Hardware Power Variability A Case for Opportunistic Embedded Sensing In Presence of Hardware Power Variability L. Wanner, C. Apte, R. Balani, Puneet Gupta, and Mani Srivastava University of California, Los Angeles puneet@ee.ucla.edu

More information

Power of Realtime 3D-Rendering. Raja Koduri

Power of Realtime 3D-Rendering. Raja Koduri Power of Realtime 3D-Rendering Raja Koduri 1 We ate our GPU cake - vuoi la botte piena e la moglie ubriaca And had more too! 16+ years of (sugar) high! In every GPU generation More performance and performance-per-watt

More information

SCALCORE: DESIGNING A CORE

SCALCORE: DESIGNING A CORE SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia,

More information

Deadline scheduling: can your mobile device last longer?

Deadline scheduling: can your mobile device last longer? Deadline scheduling: can your mobile device last longer? Juri Lelli, Mario Bambagini, Giuseppe Lipari Linux Plumbers Conference 202 San Diego (CA), USA, August 3 TeCIP Insitute, Scuola Superiore Sant'Anna

More information

WAFTL: A Workload Adaptive Flash Translation Layer with Data Partition

WAFTL: A Workload Adaptive Flash Translation Layer with Data Partition WAFTL: A Workload Adaptive Flash Translation Layer with Data Partition Qingsong Wei Bozhao Gong, Suraj Pathak, Bharadwaj Veeravalli, Lingfang Zeng and Kanzo Okada Data Storage Institute, A-STAR, Singapore

More information

MUVR: Supporting Multi-User Mobile Virtual Reality with Resource Constrained Edge Cloud

MUVR: Supporting Multi-User Mobile Virtual Reality with Resource Constrained Edge Cloud 2018 Third ACM/IEEE Symposium on Edge Computing MUVR: Supporting Multi-User Mobile Virtual Reality with Resource Constrained Edge Cloud Yong Li Department of Electrical Engineering and Computer Science

More information

Dynamic Warp Resizing in High-Performance SIMT

Dynamic Warp Resizing in High-Performance SIMT Dynamic Warp Resizing in High-Performance SIMT Ahmad Lashgar 1 a.lashgar@ece.ut.ac.ir Amirali Baniasadi 2 amirali@ece.uvic.ca 1 3 Ahmad Khonsari ak@ipm.ir 1 School of ECE University of Tehran 2 ECE Department

More information

Oculus Rift Getting Started Guide

Oculus Rift Getting Started Guide Oculus Rift Getting Started Guide Version 1.23 2 Introduction Oculus Rift Copyrights and Trademarks 2017 Oculus VR, LLC. All Rights Reserved. OCULUS VR, OCULUS, and RIFT are trademarks of Oculus VR, LLC.

More information

Parallel Simulation of Social Agents using Cilk and OpenCL

Parallel Simulation of Social Agents using Cilk and OpenCL D. Moser, A. Riener, K. Zia, A. Ferscha Department for Pervasive Computing, JKU Linz/Austria Parallel Simulation of Social Agents using Cilk and OpenCL DS-RT 2011 15th International Symposium on Distributed

More information

Final Report: DBmbench

Final Report: DBmbench 18-741 Final Report: DBmbench Yan Ke (yke@cs.cmu.edu) Justin Weisz (jweisz@cs.cmu.edu) Dec. 8, 2006 1 Introduction Conventional database benchmarks, such as the TPC-C and TPC-H, are extremely computationally

More information

Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism

Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism Sangpil Lee and Won Woo Ro School of Electrical and Electronic Engineering Yonsei University Seoul, Republic of

More information

Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka

Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka Abstract Virtual prototyping is becoming increasingly important to embedded software developers, engineers, managers

More information

PARALLEL ALGORITHMS FOR HISTOGRAM-BASED IMAGE REGISTRATION. Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, Wolfgang Effelsberg

PARALLEL ALGORITHMS FOR HISTOGRAM-BASED IMAGE REGISTRATION. Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, Wolfgang Effelsberg This is a preliminary version of an article published by Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, and Wolfgang Effelsberg. Parallel algorithms for histogram-based image registration. Proc.

More information

IBM Research Report. GPUVolt: Modeling and Characterizing Voltage Noise in GPU Architectures

IBM Research Report. GPUVolt: Modeling and Characterizing Voltage Noise in GPU Architectures RC55 (WAT1-3) April 1, 1 Electrical Engineering IBM Research Report GPUVolt: Modeling and Characterizing Voltage Noise in GPU Architectures Jingwen Leng, Yazhou Zu, Minsoo Rhu University of Texas at Austin

More information

Characterizing, Optimizing, and Auto-Tuning Applications for Energy Efficiency

Characterizing, Optimizing, and Auto-Tuning Applications for Energy Efficiency PhD Dissertation Proposal Characterizing, Optimizing, and Auto-Tuning Applications for Efficiency Wei Wang The Committee: Chair: Dr. John Cavazos Member: Dr. Guang R. Gao Member: Dr. James Clause Member:

More information

Performance Evaluation of Recently Proposed Cache Replacement Policies

Performance Evaluation of Recently Proposed Cache Replacement Policies University of Jordan Computer Engineering Department Performance Evaluation of Recently Proposed Cache Replacement Policies CPE 731: Advanced Computer Architecture Dr. Gheith Abandah Asma Abdelkarim January

More information

Accelerated Impulse Response Calculation for Indoor Optical Communication Channels

Accelerated Impulse Response Calculation for Indoor Optical Communication Channels Accelerated Impulse Response Calculation for Indoor Optical Communication Channels M. Rahaim, J. Carruthers, and T.D.C. Little Department of Electrical and Computer Engineering Boston University, Boston,

More information

Hardware-Software Co-Design Cosynthesis and Partitioning

Hardware-Software Co-Design Cosynthesis and Partitioning Hardware-Software Co-Design Cosynthesis and Partitioning EE8205: Embedded Computer Systems http://www.ee.ryerson.ca/~courses/ee8205/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

Document downloaded from:

Document downloaded from: Document downloaded from: http://hdl.handle.net/1251/64738 This paper must be cited as: Reaño González, C.; Pérez López, F.; Silla Jiménez, F. (215). On the design of a demo for exhibiting rcuda. 15th

More information

Monte Carlo integration and event generation on GPU and their application to particle physics

Monte Carlo integration and event generation on GPU and their application to particle physics Monte Carlo integration and event generation on GPU and their application to particle physics Junichi Kanzaki (KEK) GPU2016 @ Rome, Italy Sep. 26, 2016 Motivation Increase of amount of LHC data (raw &

More information

Self-Aware Adaptation in FPGAbased

Self-Aware Adaptation in FPGAbased DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Self-Aware Adaptation in FPGAbased Systems IEEE FPL 2010 Filippo Siorni: filippo.sironi@dresd.org Marco Triverio: marco.triverio@dresd.org Martina Maggio: mmaggio@mit.edu

More information

Image Processing Architectures (and their future requirements)

Image Processing Architectures (and their future requirements) Lecture 17: Image Processing Architectures (and their future requirements) Visual Computing Systems Smart phone processing resources Qualcomm snapdragon Image credit: Qualcomm Apple A7 (iphone 5s) Chipworks

More information

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs Li Zhou and Avinash Kodi Technologies for Emerging Computer Architecture Laboratory (TEAL) School of Electrical Engineering and

More information

SOFTWARE IMPLEMENTATION OF THE

SOFTWARE IMPLEMENTATION OF THE SOFTWARE IMPLEMENTATION OF THE IEEE 802.11A/P PHYSICAL LAYER SDR`12 WInnComm Europe 27 29 June, 2012 Brussels, Belgium T. Cupaiuolo, D. Lo Iacono, M. Siti and M. Odoni Advanced System Technologies STMicroelectronics,

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Schedulers Data-Capture Scheduler Dispatch: read available operands from ARF/ROB, store in scheduler Commit: Missing operands filled in from bypass Issue: When

More information

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida Early Adopter : Multiprocessor Programming in the Undergraduate Program NSF/TCPP Curriculum: Early Adoption at the University of Central Florida Narsingh Deo Damian Dechev Mahadevan Vasudevan Department

More information

CS434/534: Topics in Networked (Networking) Systems

CS434/534: Topics in Networked (Networking) Systems CS434/534: Topics in Networked (Networking) Systems Improve Wireless Capacity; Programmable Wireless Networks Yang (Richard) Yang Computer Science Department Yale University 208A Watson Email: yry@cs.yale.edu

More information

Application of Maxwell Equations to Human Body Modelling

Application of Maxwell Equations to Human Body Modelling Application of Maxwell Equations to Human Body Modelling Fumie Costen Room E, E0c at Sackville Street Building, fc@cs.man.ac.uk The University of Manchester, U.K. February 5, 0 Fumie Costen Room E, E0c

More information

Oculus Rift Getting Started Guide

Oculus Rift Getting Started Guide Oculus Rift Getting Started Guide Version 1.7.0 2 Introduction Oculus Rift Copyrights and Trademarks 2017 Oculus VR, LLC. All Rights Reserved. OCULUS VR, OCULUS, and RIFT are trademarks of Oculus VR, LLC.

More information

Trace Based Switching For A Tightly Coupled Heterogeneous Core

Trace Based Switching For A Tightly Coupled Heterogeneous Core Trace Based Switching For A Tightly Coupled Heterogeneous Core Shru% Padmanabha, Andrew Lukefahr, Reetuparna Das, Sco@ Mahlke Micro- 46 December 2013 University of Michigan Electrical Engineering and Computer

More information

Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors

Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors STIJN EYERMAN and LIEVEN EECKHOUT Ghent University A thread executing on a simultaneous multithreading (SMT) processor

More information

Like Mobile Games* Currently a Distinguished i Engineer at Zynga, and CTO of FarmVille 2: Country Escape (for ios/android/kindle)

Like Mobile Games* Currently a Distinguished i Engineer at Zynga, and CTO of FarmVille 2: Country Escape (for ios/android/kindle) Console Games Are Just Like Mobile Games* (* well, not really. But they are more alike than you think ) Hi, I m Brian Currently a Distinguished i Engineer at Zynga, and CTO of FarmVille 2: Country Escape

More information

GPU-accelerated track reconstruction in the ALICE High Level Trigger

GPU-accelerated track reconstruction in the ALICE High Level Trigger GPU-accelerated track reconstruction in the ALICE High Level Trigger David Rohr for the ALICE Collaboration Frankfurt Institute for Advanced Studies CHEP 2016, San Francisco ALICE at the LHC The Large

More information

Dynamic Adaptive Operating Systems -- I/O

Dynamic Adaptive Operating Systems -- I/O Dynamic Adaptive Operating Systems -- I/O Seetharami R. Seelam Patricia J. Teller University of Texas at El Paso El Paso, TX 16 November 2005 SC 05, Seattle, WA 1 Goals Present a summary of our ongoing

More information

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Ramon Canal NCD Master MIRI. NCD Master MIRI 1 Wattch, Hotspot, Hotleakage, McPAT http://www.eecs.harvard.edu/~dbrooks/wattch-form.html http://lava.cs.virginia.edu/hotspot http://lava.cs.virginia.edu/hotleakage http://www.hpl.hp.com/research/mcpat/

More information

Data Compression via Logic Synthesis

Data Compression via Logic Synthesis Data Compression via Logic Synthesis Luca Amarú 1, Pierre-Emmanuel Gaillardon 1, Andreas Burg 2, Giovanni De Micheli 1 Integrated Systems Laboratory (LSI), EPFL, Switzerland 1 Telecommunication Circuits

More information

Massively Parallel Signal Processing for Wireless Communication Systems

Massively Parallel Signal Processing for Wireless Communication Systems Massively Parallel Signal Processing for Wireless Communication Systems Michael Wu, Guohui Wang, Joseph R. Cavallaro Department of ECE, Rice University Wireless Communication Systems Internet Information

More information

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance Michael D. Powell, Arijit Biswas, Shantanu Gupta, and Shubu Mukherjee SPEARS Group, Intel Massachusetts EECS, University

More information

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems Eric Rotenberg Center for Embedded Systems Research (CESR) Department of Electrical & Computer Engineering North

More information

CUDA 를활용한실시간 IMAGE PROCESSING SYSTEM 구현. Chang Hee Lee

CUDA 를활용한실시간 IMAGE PROCESSING SYSTEM 구현. Chang Hee Lee 1 CUDA 를활용한실시간 IMAGE PROCESSING SYSTEM 구현 Chang Hee Lee Overview Thin film transistor(tft) LCD : Inspection Object Type of Defect Type of Inspection Instrument Brief Lighting / Focusing Optic Magnification

More information

Enhancing System Architecture by Modelling the Flash Translation Layer

Enhancing System Architecture by Modelling the Flash Translation Layer Enhancing System Architecture by Modelling the Flash Translation Layer Robert Sykes Sr. Dir. Firmware August 2014 OCZ Storage Solutions A Toshiba Group Company Introduction This presentation will discuss

More information

Challenges in Transition

Challenges in Transition Challenges in Transition Keynote talk at International Workshop on Software Engineering Methods for Parallel and High Performance Applications (SEM4HPC 2016) 1 Kazuaki Ishizaki IBM Research Tokyo kiszk@acm.org

More information

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division 8/1/21 Professor G.G.L. Meyer Johns Hopkins University Parallel Computing

More information

Plane-dependent Error Diffusion on a GPU

Plane-dependent Error Diffusion on a GPU Plane-dependent Error Diffusion on a GPU Yao Zhang a, John Ludd Recker b, Robert Ulichney c, Ingeborg Tastl b, John D. Owens a a University of California, Davis, One Shields Avenue, Davis, CA, USA; b Hewlett-Packard

More information

CT-Bus : A Heterogeneous CDMA/TDMA Bus for Future SOC

CT-Bus : A Heterogeneous CDMA/TDMA Bus for Future SOC CT-Bus : A Heterogeneous CDMA/TDMA Bus for Future SOC Bo-Cheng Charles Lai 1 Patrick Schaumont 1 Ingrid Verbauwhede 1,2 1 UCLA, EE Dept. 2 K.U.Leuven 42 Westwood Plaza Los Angeles, CA 995 Abstract- CDMA

More information

Scheduling and Communication Synthesis for Distributed Real-Time Systems

Scheduling and Communication Synthesis for Distributed Real-Time Systems Scheduling and Communication Synthesis for Distributed Real-Time Systems Department of Computer and Information Science Linköpings universitet 1 of 30 Outline Motivation System Model and Architecture Scheduling

More information

Optimizing VM Checkpointing for Restore Performance in VMware ESXi Server

Optimizing VM Checkpointing for Restore Performance in VMware ESXi Server Optimizing VM Checkpointing for Restore Performance in VMware ESXi Server Irene Zhang University of Washington Tyler Denniston MIT CSAIL Yury Baskakov VMware Alex Garthwaite CloudPhysics Virtual Machine

More information

LEGO car course topics

LEGO car course topics LEGO car course topics Xiebing Wang, Xiang Gao, Biao Hu, Kai Huang Chair of Robotics and Embedded Systems Department of Informatiks Technische Universität München Xiebing Wang, Xiang Gao, Biao Hu, Kai

More information

Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION

Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION 2. RELATED WORKS 3. PROPOSED WEATHER RADAR IMAGING BASED ON CUDA 3.1 Weather radar image format and generation

More information

An evaluation of debayering algorithms on GPU for real-time panoramic video recording

An evaluation of debayering algorithms on GPU for real-time panoramic video recording An evaluation of debayering algorithms on GPU for real-time panoramic video recording Ragnar Langseth, Vamsidhar Reddy Gaddam, Håkon Kvale Stensland, Carsten Griwodz, Pål Halvorsen University of Oslo /

More information

NVIDIA APEX: High-Definition Physics with Clothing and Vegetation. Michael Sechrest, IDV Monier Maher, NVIDIA Jean Pierre Bordes, NVIDIA

NVIDIA APEX: High-Definition Physics with Clothing and Vegetation. Michael Sechrest, IDV Monier Maher, NVIDIA Jean Pierre Bordes, NVIDIA NVIDIA APEX: High-Definition Physics with Clothing and Vegetation Michael Sechrest, IDV Monier Maher, NVIDIA Jean Pierre Bordes, NVIDIA Outline Introduction APEX: A Scalable Dynamics Framework APEX Clothing

More information

Table of Contents HOL EMT

Table of Contents HOL EMT Table of Contents Lab Overview - - Machine Learning Workloads in vsphere Using GPUs - Getting Started... 2 Lab Guidance... 3 Module 1 - Machine Learning Apps in vsphere VMs Using GPUs (15 minutes)...9

More information

Developing a GPU Processing Framework for Accelerating Remote Sensing Algorithms

Developing a GPU Processing Framework for Accelerating Remote Sensing Algorithms 19 October 2010 Research and Industrial Collaboration Conference Research to Reality Northeastern University, Boston, MA Developing a GPU Processing Framework for Accelerating Remote Sensing Algorithms

More information

Parallel Storage and Retrieval of Pixmap Images

Parallel Storage and Retrieval of Pixmap Images Parallel Storage and Retrieval of Pixmap Images Roger D. Hersch Ecole Polytechnique Federale de Lausanne Lausanne, Switzerland Abstract Professionals in various fields such as medical imaging, biology

More information

High Performance Computing for Engineers

High Performance Computing for Engineers High Performance Computing for Engineers David Thomas dt10@ic.ac.uk / https://github.com/m8pple Room 903 http://cas.ee.ic.ac.uk/people/dt10/teaching/2014/hpce HPCE / dt10/ 2015 / 0.1 High Performance Computing

More information

Data acquisition and Trigger (with emphasis on LHC)

Data acquisition and Trigger (with emphasis on LHC) Lecture 2! Introduction! Data handling requirements for LHC! Design issues: Architectures! Front-end, event selection levels! Trigger! Upgrades! Conclusion Data acquisition and Trigger (with emphasis on

More information

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology Bronson Messer Director of Science National Center for Computational Sciences & Senior R&D Staff Oak Ridge

More information

Dynamic Routing and Spectrum Assignment in Brown-field Fixed/Flex Grid Optical Network. Tanjila Ahmed

Dynamic Routing and Spectrum Assignment in Brown-field Fixed/Flex Grid Optical Network. Tanjila Ahmed Dynamic Routing and Spectrum Assignment in Brown-field Fixed/Flex Grid Optical Network Tanjila Ahmed Outline ØAbstract ØWhy we need flexible grid? ØChallenges to handle mixed grid ØExisting Solutions ØOur

More information

A Bypass First Policy for Energy-Efficient Last Level Caches

A Bypass First Policy for Energy-Efficient Last Level Caches A Bypass First Policy for Energy-Efficient Last Level Caches Jason Jong Kyu Park University of Michigan Ann Arbor, MI, USA Email: jasonjk@umich.edu Yongjun Park Hongik University Seoul, Korea Email: yongjun.park@hongik.ac.kr

More information

Building Java Apps with ArcGIS Runtime SDK

Building Java Apps with ArcGIS Runtime SDK Building Java Apps with ArcGIS Runtime SDK Vijay Gandhi, Elise Acheson, Eric Bader Demo Source code: https://github.com/esri/arcgis-runtime-samples-java/tree/master/devsummit-2014 Video Recording: http://video.esri.com

More information

The Xbox One System on a Chip and Kinect Sensor

The Xbox One System on a Chip and Kinect Sensor The Xbox One System on a Chip and Kinect Sensor John Sell, Patrick O Connor, Microsoft Corporation 1 Abstract The System on a Chip at the heart of the Xbox One entertainment console is one of the largest

More information

Image Processing Architectures (and their future requirements)

Image Processing Architectures (and their future requirements) Lecture 16: Image Processing Architectures (and their future requirements) Visual Computing Systems Smart phone processing resources Example SoC: Qualcomm Snapdragon Image credit: Qualcomm Apple A7 (iphone

More information

Microarchitectural Attacks and Defenses in JavaScript

Microarchitectural Attacks and Defenses in JavaScript Microarchitectural Attacks and Defenses in JavaScript Michael Schwarz, Daniel Gruss, Moritz Lipp 25.01.2018 www.iaik.tugraz.at 1 Michael Schwarz, Daniel Gruss, Moritz Lipp www.iaik.tugraz.at Microarchitecture

More information

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei Institute of Microelectronics Tsinghua University The 45th International

More information

Table of Contents HOL ADV

Table of Contents HOL ADV Table of Contents Lab Overview - - Horizon 7.1: Graphics Acceleartion for 3D Workloads and vgpu... 2 Lab Guidance... 3 Module 1-3D Options in Horizon 7 (15 minutes - Basic)... 5 Introduction... 6 3D Desktop

More information

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 16 - Superscalar Processors 1 / 78 Table of Contents I 1 Overview

More information

CAMEO: Continuous Analytics for Massively Multiplayer Online Games

CAMEO: Continuous Analytics for Massively Multiplayer Online Games CAMEO: Continuous Analytics for Massively Multiplayer Online Games Alexandru Iosup Parallel and Distributed Systems Group Delft University of Technology 1 MMOGs are a Popular, Growing Market 25,000,000

More information

Characterizing and Improving the Performance of Intel Threading Building Blocks

Characterizing and Improving the Performance of Intel Threading Building Blocks Characterizing and Improving the Performance of Intel Threading Building Blocks Gilberto Contreras, Margaret Martonosi Princeton University IISWC 08 Motivation Chip Multiprocessors are the new computing

More information

HMD based VR Service Framework. July Web3D Consortium Kwan-Hee Yoo Chungbuk National University

HMD based VR Service Framework. July Web3D Consortium Kwan-Hee Yoo Chungbuk National University HMD based VR Service Framework July 31 2017 Web3D Consortium Kwan-Hee Yoo Chungbuk National University khyoo@chungbuk.ac.kr What is Virtual Reality? Making an electronic world seem real and interactive

More information

Application-Managed Flash Sungjin Lee, Ming Liu, Sangwoo Jun, Shuotao Xu, Jihong Kim and Arvind

Application-Managed Flash Sungjin Lee, Ming Liu, Sangwoo Jun, Shuotao Xu, Jihong Kim and Arvind Application-Managed Flash Sungjin Lee, Ming Liu, Sangwoo Jun, Shuotao Xu, Jihong Kim and Arvind Massachusetts Institute of Technology Seoul National University 14th USENIX Conference on File and Storage

More information

Parallel Randomized Best-First Search

Parallel Randomized Best-First Search Parallel Randomized Best-First Search Yaron Shoham and Sivan Toledo School of Computer Science, Tel-Aviv Univsity http://www.tau.ac.il/ stoledo, http://www.tau.ac.il/ ysh Abstract. We describe a novel

More information

Experience Report on Developing a Software Communications Architecture (SCA) Core Framework. OMG SBC Workshop Arlington, Va.

Experience Report on Developing a Software Communications Architecture (SCA) Core Framework. OMG SBC Workshop Arlington, Va. Communication, Navigation, Identification and Reconnaissance Experience Report on Developing a Software Communications Architecture (SCA) Core Framework OMG SBC Workshop Arlington, Va. September, 2004

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 14 Improving Performance: Interleaving Israel Koren ECE568/Koren Part.14.1 Background Performance

More information

Performance Metrics, Amdahl s Law

Performance Metrics, Amdahl s Law ecture 26 Computer Science 61C Spring 2017 March 20th, 2017 Performance Metrics, Amdahl s Law 1 New-School Machine Structures (It s a bit more complicated!) Software Hardware Parallel Requests Assigned

More information

ECE473 Computer Architecture and Organization. Pipeline: Introduction

ECE473 Computer Architecture and Organization. Pipeline: Introduction Computer Architecture and Organization Pipeline: Introduction Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 11.1 The Laundry Analogy Student A,

More information

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS Computer Architecture (263-2210-00L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS Instructor: Prof. Onur Mutlu TAs: Hasan Hassan, Arash Tavakkol, Mohammad Sadr, Lois Orosa, Juan Gomez Luna Assigned:

More information

GC for interactive and real-time systems

GC for interactive and real-time systems GC for interactive and real-time systems Interactive or real-time app concerns Reducing length of garbage collection pause Demands guarantees for worst case performance Generational GC works if: Young

More information

Exploring Heterogeneity within a Core for Improved Power Efficiency

Exploring Heterogeneity within a Core for Improved Power Efficiency Computer Engineering Exploring Heterogeneity within a Core for Improved Power Efficiency Sudarshan Srinivasan Nithesh Kurella Israel Koren Sandip Kundu May 2, 215 CE Tech Report # 6 Available at http://www.eng.biu.ac.il/segalla/computer-engineering-tech-reports/

More information

MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor

MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor Kenzo Van Craeynest, Stijn Eyerman, and Lieven Eeckhout Department of Electronics and Information Systems (ELIS), Ghent University,

More information

escience: Pulsar searching on GPUs

escience: Pulsar searching on GPUs escience: Pulsar searching on GPUs Alessio Sclocco Ana Lucia Varbanescu Karel van der Veldt John Romein Joeri van Leeuwen Jason Hessels Rob van Nieuwpoort And many others! Netherlands escience center Science

More information