Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes
|
|
- Roberta Nelson
- 6 years ago
- Views:
Transcription
1 Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur Mutlu
2 Executive Summary Problem: No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalesce an application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 2
3 GPU Support for Virtual Memory Improves programmability with a unified address space Enables large data sets to be processed in the GPU Allows multiple applications to run on a GPU Virtual memory can enforce memory protection 3
4 State-of-the-Art Virtual Memory on GPUs GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Shared TLB Limited TLB reach Private Shared High latency page walks Page Table Walkers Page Table (Main memory) Data (Main Memory) High latency I/O GPU-side memory CPU-side memory CPU Memory 4
5 Trade-Off with Page Size Larger pages: Better TLB reach High demand paging latency Smaller pages: Lower demand paging latency Limited TLB reach 5
6 Normalized Performance Normalized Performance Trade-Off with Page Size No Paging Overhead Small (4KB) Large (2MB) % With Paging Overhead Small (4KB) Large (2MB) % Can we get the best of both page sizes? 6
7 Outline Background Key challenges and our goal Mosaic Experimental evaluations Conclusions 7
8 Challenges with Multiple Page Sizes Time App 1 Allocation App 2 Allocation App 1 Allocation App 2 Allocation Coalesce App 1 Pages Coalesce App 2 Pages Large Page Frame 1 Large Page Frame 2 Large Page Frame 3 Large Page Frame 4 Large Page Frame 5 State-of-the-Art GPU Memory Cannot coalesce (without migrating multiple 4K pages) Need to search which pages to coalesce Unallocated App 1 App 2 8
9 Desirable Allocation Time App 1 Allocation App 2 Allocation App 1 Allocation App 2 Allocation Large Page Frame 1 Large Page Frame 2 Large Page Frame 3 Large Page Frame 4 Large Page Frame 5 Desirable Behavior GPU Memory Coalesce App 1 Pages Coalesce App 2 Pages Can coalesce (without moving data) Unallocated App 1 App 2 9
10 Our Goals High TLB reach Low demand paging latency Application transparency Programmers do not need to modify the applications 10
11 Outline Background Key challenges and our goal Mosaic Experimental evaluation Conclusions 11
12 Mosaic GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 12
13 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 13
14 Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 2 1 Allocate Memory Application Demands Data Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Soft guarantee: A large page frame contains pages from only a single address space Conserves contiguity within the large page frame 14
15 Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 2 1 Allocate Memory Application Demands Data Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Transfer Data 3 System I/O Bus CPU Memory Data transfer is done at a small page granularity A page that is transferred is immediately ready to use 15
16 Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 4 Transfer Done Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Transfer Data 3 System I/O Bus CPU Memory 16
17 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 17
18 Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages Contiguity-Aware Compaction Hardware Large Page Frame Large Page Frame Fully-allocated large page frame Coalesceable Allocator sends the list of coalesceable pages to the In-Place Coalescer 18
19 Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages 2 Contiguity-Aware Compaction Hardware Update page tables In-Place Coalescer has: List of coalesceable large pages Page Table Data Key Task: Perform coalescing without moving data Simply need to update the page tables 19
20 Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages 2 Contiguity-Aware Compaction Hardware Update page tables Large Page Table 10 Coalesced Bit Small Page Table Page Table Data Application-transparent Data can be accessed using either page size No TLB flush 20
21 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 21
22 Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 22
23 Mosaic: Data Deallocation GPU Runtime Application Deallocates Data 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data 2 Splinter Pages (reset the coalesced bit) Large Page Frame Splinter only frames with deallocated pages 23
24 Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 24
25 Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Large Page Frames 2 List of free pages Contiguity-Aware Compaction Hardware Page Table Free large page Free large page Data 1 Compact Pages Compaction decreases memory bloat Happens only when memory is highly fragmented 25
26 Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Once pages are compacted, they become non-coalesceable No virtual contiguity Maximizes number of free large page frames 26
27 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 27
28 Baseline: State-of-the-Art GPU Virtual Memory GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Private Shared Shared TLB Page Table Walkers Page Table (Main memory) Data (Main Memory) GPU-side memory CPU-side memory CPU Memory 28
29 Methodology GPGPU-Sim (MAFIA) modeling GTX750 Ti 30 GPU cores Multiple GPGPU applications execute concurrently 64KB 4-way L1, 2048KB 16-way L2 64-entry L1 TLB, 1024-entry L2 TLB 8-entry large page L1 TLB, 64-entry large page L2 TLB 3GB main memory Model sequential page walks Model page tables and virtual-to-physical mapping CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites 235 total workloads evaluated Available at: 29
30 Comparison Points State-of-the-art CPU-GPU memory management GPU-MMU based on [Power et al., HPCA 14] Upside: Utilizes parallel page walks, TLB request coalescing and page walk cache to improve performance Downside: Limited TLB reach Ideal TLB: Every TLB access is an L1 TLB hit 30
31 Weighted Speedup Performance Homogeneous 95.0% 61.5% 55.4% 33.8% 39.0% Heterogeneous GPU-MMU Mosaic Ideal TLB 21.4% 31.5% 43.1% 23.7% Number of Concurrently-Executing Applications Mosaic consistently improves performance across a wide variety of workloads Mosaic performs within 10% of the ideal TLB 31
32 Other Results in the Paper TLB hit rate Mosaic achieves average TLB hit rate of 99% Per-application IPC 97% of all applications perform faster Sensitivity to different TLB sizes Mosaic is effective for various TLB configurations Memory fragmentation analysis Mosaic reduces memory fragmentation and improves performance regardless of the original fragmentation Performance with and without demand paging 32
33 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 33
34 Summary Problem: No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalesce an application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 34
35 Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur Mutlu
36 Backup Slides
37 Current Methods to Share GPUs Time sharing Fine-grained context switching Coarse-grained context switching Spatial sharing NVIDIA GRID Multi process service 37
38 Other Methods to Enforce Protection Segmented paging Static memory partitioning 38
39 TLB Flush With Mosaic, the contents in the page tables are the same TLB flush in Mosaic occurs when page table content is modified This invalidates content in the TLB Need to be flushed Both large and small page TLBs are flushed 39
40 Normalized Performance Performance with Demand Paging 2.0 GPU-MMU no Paging GPU-MMU with Paging Mosaic with Paging Homogeneous Heterogeneous 40
41 In-Place Coalescer: Coalescing Key assumption: Soft guarantee Large page range always contains pages of the same application L1 Page Table Set Large Page Bit L2 Page Table Set Disabled Bit Set Disabled Bit Set Disabled Bit Set Disabled Bit Coalesce VA Q: How to access large page base entry? PD PT PO PO Benefit: No data movement 41
42 In-Place Coalescer: Large Page Walk Large page index is available at leaf PTE L1 Page Table Set Large Page Bit L2 Page Table Set Disabled Bit Set Disabled Bit Set Disabled Bit Set Disabled Bit Coalesce 42
43 Weighted Speedup Sample Application Pairs GPU-MMU Mosaic Ideal TLB TLB-Friendly TLB-Sensitive
44 TLB Hit Rate TLB Hit Rate 100% L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 80% 60% 40% 20% 0% 1 App 2 Apps 3 Apps 4 Apps 5 Apps GPU-MMU Number of Concurrently-Executing Applications Mosaic
45 Normalized Performance Pre-Fragmenting DRAM 1.6 no CAC CAC CAC-BC CAC-Ideal % 50% 70% 90% 95% 97% 100% Fragmentation Index
46 Normalized Performance Page Occupancy Experiment 1.6 no CAC CAC CAC-BC CAC-Ideal Large Page Frame Occupancy
47 Memory Bloat vs. GPU-MMU Memory Bloat KB Page GPU-MMU CAC 0 1% 10% 25% 35% 50% 75% Page Occupancy
48 Normalized Performance Normalized Performance Normalized Performance Normalized Performance Individual Application IPC GPU-MMU Mosaic Ideal-TLB GPU-MMU Mosaic Ideal-TLB Sorted Application Number GPU-MMU Mosaic Ideal-TLB Sorted Application Number GPU-MMU Mosaic Ideal-TLB Sorted Application Number Sorted Application Number
49 Normalized Performance Normalized Performance Normalized Performance Normalized Performance 1.4 GPU-MMU Mosaic Per-SM L1 TLB Base Page Entries 1.4 GPU-MMU Mosaic Per-SM L1 TLB Large Page Entries GPU-MMU Mosaic Shared L2 TLB Base Page Entries 1.4 GPU-MMU Mosaic Shared L2 TLB Large Page Entries
50 Mosaic: Putting Everything Together GPU Runtime Application Demands Data List of Free Pages List of Large Pages Application Deallocate Data Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory Transfer Done Coalesce Pages Splinter Pages Compact Pages Page Table Data System I/O Bus Transfer Data 50
51 Mosaic: Data Allocation GPU Runtime Application Demands Data List of Large Pages Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory Transfer Done Coalesce Pages Page Table Data System I/O Bus Transfer Data 51
52 Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation Hardware List of Free Pages Application Deallocate Data In-Place Coalescer Contiguity-Aware Compaction Splinter Pages Compact Pages Page Table Data 52
Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood
Supporting x86-64 Address Translation for 100s of GPU s Jason Power, Mark D. Hill, David A. Wood Summary Challenges: CPU&GPUs physically integrated, but logically separate; This reduces theoretical bandwidth,
More informationFall 2015 COMP Operating Systems. Lab #7
Fall 2015 COMP 3511 Operating Systems Lab #7 Outline Review and examples on virtual memory Motivation of Virtual Memory Demand Paging Page Replacement Q. 1 What is required to support dynamic memory allocation
More informationDASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators
DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators Hiroyuki Usui, Lavanya Subramanian Kevin Chang, Onur Mutlu DASH source code is available at GitHub
More informationTrack and Vertex Reconstruction on GPUs for the Mu3e Experiment
Track and Vertex Reconstruction on GPUs for the Mu3e Experiment Dorothea vom Bruch for the Mu3e Collaboration GPU Computing in High Energy Physics, Pisa September 11th, 2014 Physikalisches Institut Heidelberg
More informationImproving GPU Performance via Large Warps and Two-Level Warp Scheduling
Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang Joo Lee Intel Rustam Miftakhutdinov The University
More informationRecent Advances in Simulation Techniques and Tools
Recent Advances in Simulation Techniques and Tools Yuyang Li, li.yuyang(at)wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download Abstract: Simulation refers to using specified kind
More informationComputational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs
5 th International Conference on Logic and Application LAP 2016 Dubrovnik, Croatia, September 19-23, 2016 Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs
More informationSynthetic Aperture Beamformation using the GPU
Paper presented at the IEEE International Ultrasonics Symposium, Orlando, Florida, 211: Synthetic Aperture Beamformation using the GPU Jens Munk Hansen, Dana Schaa and Jørgen Arendt Jensen Center for Fast
More informationGPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links
DLR.de Chart 1 GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links Chen Tang chen.tang@dlr.de Institute of Communication and Navigation German Aerospace Center DLR.de Chart
More informationGame Architecture. 4/8/16: Multiprocessor Game Loops
Game Architecture 4/8/16: Multiprocessor Game Loops Monolithic Dead simple to set up, but it can get messy Flow-of-control can be complex Top-level may have too much knowledge of underlying systems (gross
More information6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS
6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS Editor: Publisher: Prof. Pece Mitrevski, PhD Faculty of Information and Communication
More informationWarp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)
Warp-Aware Trace Scheduling for GPUS James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown) Historical Trends in GFLOPS: CPUs vs. GPUs Theoretical GFLOP/s 3250 3000 2750 2500
More informationCOTSon: Infrastructure for system-level simulation
COTSon: Infrastructure for system-level simulation Ayose Falcón, Paolo Faraboschi, Daniel Ortega HP Labs Exascale Computing Lab http://sites.google.com/site/hplabscotson MICRO-41 tutorial November 9, 28
More informationCUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads
Terminology CUDA Threads Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Streaming Multiprocessor (SM) A SM processes block of threads Streaming Processors (SP) also called CUDA
More informationSimulating GPGPUs ESESC Tutorial
ESESC Tutorial Speaker: ankaranarayanan Department of Computer Engineering, University of California, Santa Cruz http://masc.soe.ucsc.edu 1 Outline Background GPU Emulation Setup GPU Simulation Setup Running
More informationUse Nvidia Performance Primitives (NPP) in Deep Learning Training. Yang Song
Use Nvidia Performance Primitives (NPP) in Deep Learning Training Yang Song Outline Introduction Function Categories Performance Results Deep Learning Specific Further Information What is NPP? Image+Signal
More informationA Study of Optimal Spatial Partition Size and Field of View in Massively Multiplayer Online Game Server
A Study of Optimal Spatial Partition Size and Field of View in Massively Multiplayer Online Game Server Youngsik Kim * * Department of Game and Multimedia Engineering, Korea Polytechnic University, Republic
More informationCUDA-Accelerated Satellite Communication Demodulation
CUDA-Accelerated Satellite Communication Demodulation Renliang Zhao, Ying Liu, Liheng Jian, Zhongya Wang School of Computer and Control University of Chinese Academy of Sciences Outline Motivation Related
More informationNetApp Sizing Guidelines for MEDITECH Environments
Technical Report NetApp Sizing Guidelines for MEDITECH Environments Brahmanna Chowdary Kodavali, NetApp March 2016 TR-4190 TABLE OF CONTENTS 1 Introduction... 4 1.1 Scope...4 1.2 Audience...5 2 MEDITECH
More informationThe Critical Role of Firmware and Flash Translation Layers in Solid State Drive Design
The Critical Role of Firmware and Flash Translation Layers in Solid State Drive Design Robert Sykes Director of Applications OCZ Technology Flash Memory Summit 2012 Santa Clara, CA 1 Introduction This
More informationA Case for Opportunistic Embedded Sensing In Presence of Hardware Power Variability
A Case for Opportunistic Embedded Sensing In Presence of Hardware Power Variability L. Wanner, C. Apte, R. Balani, Puneet Gupta, and Mani Srivastava University of California, Los Angeles puneet@ee.ucla.edu
More informationPower of Realtime 3D-Rendering. Raja Koduri
Power of Realtime 3D-Rendering Raja Koduri 1 We ate our GPU cake - vuoi la botte piena e la moglie ubriaca And had more too! 16+ years of (sugar) high! In every GPU generation More performance and performance-per-watt
More informationSCALCORE: DESIGNING A CORE
SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia,
More informationDeadline scheduling: can your mobile device last longer?
Deadline scheduling: can your mobile device last longer? Juri Lelli, Mario Bambagini, Giuseppe Lipari Linux Plumbers Conference 202 San Diego (CA), USA, August 3 TeCIP Insitute, Scuola Superiore Sant'Anna
More informationWAFTL: A Workload Adaptive Flash Translation Layer with Data Partition
WAFTL: A Workload Adaptive Flash Translation Layer with Data Partition Qingsong Wei Bozhao Gong, Suraj Pathak, Bharadwaj Veeravalli, Lingfang Zeng and Kanzo Okada Data Storage Institute, A-STAR, Singapore
More informationMUVR: Supporting Multi-User Mobile Virtual Reality with Resource Constrained Edge Cloud
2018 Third ACM/IEEE Symposium on Edge Computing MUVR: Supporting Multi-User Mobile Virtual Reality with Resource Constrained Edge Cloud Yong Li Department of Electrical Engineering and Computer Science
More informationDynamic Warp Resizing in High-Performance SIMT
Dynamic Warp Resizing in High-Performance SIMT Ahmad Lashgar 1 a.lashgar@ece.ut.ac.ir Amirali Baniasadi 2 amirali@ece.uvic.ca 1 3 Ahmad Khonsari ak@ipm.ir 1 School of ECE University of Tehran 2 ECE Department
More informationOculus Rift Getting Started Guide
Oculus Rift Getting Started Guide Version 1.23 2 Introduction Oculus Rift Copyrights and Trademarks 2017 Oculus VR, LLC. All Rights Reserved. OCULUS VR, OCULUS, and RIFT are trademarks of Oculus VR, LLC.
More informationParallel Simulation of Social Agents using Cilk and OpenCL
D. Moser, A. Riener, K. Zia, A. Ferscha Department for Pervasive Computing, JKU Linz/Austria Parallel Simulation of Social Agents using Cilk and OpenCL DS-RT 2011 15th International Symposium on Distributed
More informationFinal Report: DBmbench
18-741 Final Report: DBmbench Yan Ke (yke@cs.cmu.edu) Justin Weisz (jweisz@cs.cmu.edu) Dec. 8, 2006 1 Introduction Conventional database benchmarks, such as the TPC-C and TPC-H, are extremely computationally
More informationParallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism
Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism Sangpil Lee and Won Woo Ro School of Electrical and Electronic Engineering Yonsei University Seoul, Republic of
More informationSimulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka
Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka Abstract Virtual prototyping is becoming increasingly important to embedded software developers, engineers, managers
More informationPARALLEL ALGORITHMS FOR HISTOGRAM-BASED IMAGE REGISTRATION. Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, Wolfgang Effelsberg
This is a preliminary version of an article published by Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, and Wolfgang Effelsberg. Parallel algorithms for histogram-based image registration. Proc.
More informationIBM Research Report. GPUVolt: Modeling and Characterizing Voltage Noise in GPU Architectures
RC55 (WAT1-3) April 1, 1 Electrical Engineering IBM Research Report GPUVolt: Modeling and Characterizing Voltage Noise in GPU Architectures Jingwen Leng, Yazhou Zu, Minsoo Rhu University of Texas at Austin
More informationCharacterizing, Optimizing, and Auto-Tuning Applications for Energy Efficiency
PhD Dissertation Proposal Characterizing, Optimizing, and Auto-Tuning Applications for Efficiency Wei Wang The Committee: Chair: Dr. John Cavazos Member: Dr. Guang R. Gao Member: Dr. James Clause Member:
More informationPerformance Evaluation of Recently Proposed Cache Replacement Policies
University of Jordan Computer Engineering Department Performance Evaluation of Recently Proposed Cache Replacement Policies CPE 731: Advanced Computer Architecture Dr. Gheith Abandah Asma Abdelkarim January
More informationAccelerated Impulse Response Calculation for Indoor Optical Communication Channels
Accelerated Impulse Response Calculation for Indoor Optical Communication Channels M. Rahaim, J. Carruthers, and T.D.C. Little Department of Electrical and Computer Engineering Boston University, Boston,
More informationHardware-Software Co-Design Cosynthesis and Partitioning
Hardware-Software Co-Design Cosynthesis and Partitioning EE8205: Embedded Computer Systems http://www.ee.ryerson.ca/~courses/ee8205/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer
More informationDocument downloaded from:
Document downloaded from: http://hdl.handle.net/1251/64738 This paper must be cited as: Reaño González, C.; Pérez López, F.; Silla Jiménez, F. (215). On the design of a demo for exhibiting rcuda. 15th
More informationMonte Carlo integration and event generation on GPU and their application to particle physics
Monte Carlo integration and event generation on GPU and their application to particle physics Junichi Kanzaki (KEK) GPU2016 @ Rome, Italy Sep. 26, 2016 Motivation Increase of amount of LHC data (raw &
More informationSelf-Aware Adaptation in FPGAbased
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Self-Aware Adaptation in FPGAbased Systems IEEE FPL 2010 Filippo Siorni: filippo.sironi@dresd.org Marco Triverio: marco.triverio@dresd.org Martina Maggio: mmaggio@mit.edu
More informationImage Processing Architectures (and their future requirements)
Lecture 17: Image Processing Architectures (and their future requirements) Visual Computing Systems Smart phone processing resources Qualcomm snapdragon Image credit: Qualcomm Apple A7 (iphone 5s) Chipworks
More informationPROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs
PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs Li Zhou and Avinash Kodi Technologies for Emerging Computer Architecture Laboratory (TEAL) School of Electrical Engineering and
More informationSOFTWARE IMPLEMENTATION OF THE
SOFTWARE IMPLEMENTATION OF THE IEEE 802.11A/P PHYSICAL LAYER SDR`12 WInnComm Europe 27 29 June, 2012 Brussels, Belgium T. Cupaiuolo, D. Lo Iacono, M. Siti and M. Odoni Advanced System Technologies STMicroelectronics,
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Out-of-Order Schedulers Data-Capture Scheduler Dispatch: read available operands from ARF/ROB, store in scheduler Commit: Missing operands filled in from bypass Issue: When
More informationEarly Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida
Early Adopter : Multiprocessor Programming in the Undergraduate Program NSF/TCPP Curriculum: Early Adoption at the University of Central Florida Narsingh Deo Damian Dechev Mahadevan Vasudevan Department
More informationCS434/534: Topics in Networked (Networking) Systems
CS434/534: Topics in Networked (Networking) Systems Improve Wireless Capacity; Programmable Wireless Networks Yang (Richard) Yang Computer Science Department Yale University 208A Watson Email: yry@cs.yale.edu
More informationApplication of Maxwell Equations to Human Body Modelling
Application of Maxwell Equations to Human Body Modelling Fumie Costen Room E, E0c at Sackville Street Building, fc@cs.man.ac.uk The University of Manchester, U.K. February 5, 0 Fumie Costen Room E, E0c
More informationOculus Rift Getting Started Guide
Oculus Rift Getting Started Guide Version 1.7.0 2 Introduction Oculus Rift Copyrights and Trademarks 2017 Oculus VR, LLC. All Rights Reserved. OCULUS VR, OCULUS, and RIFT are trademarks of Oculus VR, LLC.
More informationTrace Based Switching For A Tightly Coupled Heterogeneous Core
Trace Based Switching For A Tightly Coupled Heterogeneous Core Shru% Padmanabha, Andrew Lukefahr, Reetuparna Das, Sco@ Mahlke Micro- 46 December 2013 University of Michigan Electrical Engineering and Computer
More informationMemory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors
Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors STIJN EYERMAN and LIEVEN EECKHOUT Ghent University A thread executing on a simultaneous multithreading (SMT) processor
More informationLike Mobile Games* Currently a Distinguished i Engineer at Zynga, and CTO of FarmVille 2: Country Escape (for ios/android/kindle)
Console Games Are Just Like Mobile Games* (* well, not really. But they are more alike than you think ) Hi, I m Brian Currently a Distinguished i Engineer at Zynga, and CTO of FarmVille 2: Country Escape
More informationGPU-accelerated track reconstruction in the ALICE High Level Trigger
GPU-accelerated track reconstruction in the ALICE High Level Trigger David Rohr for the ALICE Collaboration Frankfurt Institute for Advanced Studies CHEP 2016, San Francisco ALICE at the LHC The Large
More informationDynamic Adaptive Operating Systems -- I/O
Dynamic Adaptive Operating Systems -- I/O Seetharami R. Seelam Patricia J. Teller University of Texas at El Paso El Paso, TX 16 November 2005 SC 05, Seattle, WA 1 Goals Present a summary of our ongoing
More informationRamon Canal NCD Master MIRI. NCD Master MIRI 1
Wattch, Hotspot, Hotleakage, McPAT http://www.eecs.harvard.edu/~dbrooks/wattch-form.html http://lava.cs.virginia.edu/hotspot http://lava.cs.virginia.edu/hotleakage http://www.hpl.hp.com/research/mcpat/
More informationData Compression via Logic Synthesis
Data Compression via Logic Synthesis Luca Amarú 1, Pierre-Emmanuel Gaillardon 1, Andreas Burg 2, Giovanni De Micheli 1 Integrated Systems Laboratory (LSI), EPFL, Switzerland 1 Telecommunication Circuits
More informationMassively Parallel Signal Processing for Wireless Communication Systems
Massively Parallel Signal Processing for Wireless Communication Systems Michael Wu, Guohui Wang, Joseph R. Cavallaro Department of ECE, Rice University Wireless Communication Systems Internet Information
More informationArchitectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance
Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance Michael D. Powell, Arijit Biswas, Shantanu Gupta, and Shubu Mukherjee SPEARS Group, Intel Massachusetts EECS, University
More informationUsing Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems
Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems Eric Rotenberg Center for Embedded Systems Research (CESR) Department of Electrical & Computer Engineering North
More informationCUDA 를활용한실시간 IMAGE PROCESSING SYSTEM 구현. Chang Hee Lee
1 CUDA 를활용한실시간 IMAGE PROCESSING SYSTEM 구현 Chang Hee Lee Overview Thin film transistor(tft) LCD : Inspection Object Type of Defect Type of Inspection Instrument Brief Lighting / Focusing Optic Magnification
More informationEnhancing System Architecture by Modelling the Flash Translation Layer
Enhancing System Architecture by Modelling the Flash Translation Layer Robert Sykes Sr. Dir. Firmware August 2014 OCZ Storage Solutions A Toshiba Group Company Introduction This presentation will discuss
More informationChallenges in Transition
Challenges in Transition Keynote talk at International Workshop on Software Engineering Methods for Parallel and High Performance Applications (SEM4HPC 2016) 1 Kazuaki Ishizaki IBM Research Tokyo kiszk@acm.org
More informationHybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division
Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division 8/1/21 Professor G.G.L. Meyer Johns Hopkins University Parallel Computing
More informationPlane-dependent Error Diffusion on a GPU
Plane-dependent Error Diffusion on a GPU Yao Zhang a, John Ludd Recker b, Robert Ulichney c, Ingeborg Tastl b, John D. Owens a a University of California, Davis, One Shields Avenue, Davis, CA, USA; b Hewlett-Packard
More informationCT-Bus : A Heterogeneous CDMA/TDMA Bus for Future SOC
CT-Bus : A Heterogeneous CDMA/TDMA Bus for Future SOC Bo-Cheng Charles Lai 1 Patrick Schaumont 1 Ingrid Verbauwhede 1,2 1 UCLA, EE Dept. 2 K.U.Leuven 42 Westwood Plaza Los Angeles, CA 995 Abstract- CDMA
More informationScheduling and Communication Synthesis for Distributed Real-Time Systems
Scheduling and Communication Synthesis for Distributed Real-Time Systems Department of Computer and Information Science Linköpings universitet 1 of 30 Outline Motivation System Model and Architecture Scheduling
More informationOptimizing VM Checkpointing for Restore Performance in VMware ESXi Server
Optimizing VM Checkpointing for Restore Performance in VMware ESXi Server Irene Zhang University of Washington Tyler Denniston MIT CSAIL Yury Baskakov VMware Alex Garthwaite CloudPhysics Virtual Machine
More informationLEGO car course topics
LEGO car course topics Xiebing Wang, Xiang Gao, Biao Hu, Kai Huang Chair of Robotics and Embedded Systems Department of Informatiks Technische Universität München Xiebing Wang, Xiang Gao, Biao Hu, Kai
More informationLiu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION
Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION 2. RELATED WORKS 3. PROPOSED WEATHER RADAR IMAGING BASED ON CUDA 3.1 Weather radar image format and generation
More informationAn evaluation of debayering algorithms on GPU for real-time panoramic video recording
An evaluation of debayering algorithms on GPU for real-time panoramic video recording Ragnar Langseth, Vamsidhar Reddy Gaddam, Håkon Kvale Stensland, Carsten Griwodz, Pål Halvorsen University of Oslo /
More informationNVIDIA APEX: High-Definition Physics with Clothing and Vegetation. Michael Sechrest, IDV Monier Maher, NVIDIA Jean Pierre Bordes, NVIDIA
NVIDIA APEX: High-Definition Physics with Clothing and Vegetation Michael Sechrest, IDV Monier Maher, NVIDIA Jean Pierre Bordes, NVIDIA Outline Introduction APEX: A Scalable Dynamics Framework APEX Clothing
More informationTable of Contents HOL EMT
Table of Contents Lab Overview - - Machine Learning Workloads in vsphere Using GPUs - Getting Started... 2 Lab Guidance... 3 Module 1 - Machine Learning Apps in vsphere VMs Using GPUs (15 minutes)...9
More informationDeveloping a GPU Processing Framework for Accelerating Remote Sensing Algorithms
19 October 2010 Research and Industrial Collaboration Conference Research to Reality Northeastern University, Boston, MA Developing a GPU Processing Framework for Accelerating Remote Sensing Algorithms
More informationParallel Storage and Retrieval of Pixmap Images
Parallel Storage and Retrieval of Pixmap Images Roger D. Hersch Ecole Polytechnique Federale de Lausanne Lausanne, Switzerland Abstract Professionals in various fields such as medical imaging, biology
More informationHigh Performance Computing for Engineers
High Performance Computing for Engineers David Thomas dt10@ic.ac.uk / https://github.com/m8pple Room 903 http://cas.ee.ic.ac.uk/people/dt10/teaching/2014/hpce HPCE / dt10/ 2015 / 0.1 High Performance Computing
More informationData acquisition and Trigger (with emphasis on LHC)
Lecture 2! Introduction! Data handling requirements for LHC! Design issues: Architectures! Front-end, event selection levels! Trigger! Upgrades! Conclusion Data acquisition and Trigger (with emphasis on
More informationNRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology
NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology Bronson Messer Director of Science National Center for Computational Sciences & Senior R&D Staff Oak Ridge
More informationDynamic Routing and Spectrum Assignment in Brown-field Fixed/Flex Grid Optical Network. Tanjila Ahmed
Dynamic Routing and Spectrum Assignment in Brown-field Fixed/Flex Grid Optical Network Tanjila Ahmed Outline ØAbstract ØWhy we need flexible grid? ØChallenges to handle mixed grid ØExisting Solutions ØOur
More informationA Bypass First Policy for Energy-Efficient Last Level Caches
A Bypass First Policy for Energy-Efficient Last Level Caches Jason Jong Kyu Park University of Michigan Ann Arbor, MI, USA Email: jasonjk@umich.edu Yongjun Park Hongik University Seoul, Korea Email: yongjun.park@hongik.ac.kr
More informationBuilding Java Apps with ArcGIS Runtime SDK
Building Java Apps with ArcGIS Runtime SDK Vijay Gandhi, Elise Acheson, Eric Bader Demo Source code: https://github.com/esri/arcgis-runtime-samples-java/tree/master/devsummit-2014 Video Recording: http://video.esri.com
More informationThe Xbox One System on a Chip and Kinect Sensor
The Xbox One System on a Chip and Kinect Sensor John Sell, Patrick O Connor, Microsoft Corporation 1 Abstract The System on a Chip at the heart of the Xbox One entertainment console is one of the largest
More informationImage Processing Architectures (and their future requirements)
Lecture 16: Image Processing Architectures (and their future requirements) Visual Computing Systems Smart phone processing resources Example SoC: Qualcomm Snapdragon Image credit: Qualcomm Apple A7 (iphone
More informationMicroarchitectural Attacks and Defenses in JavaScript
Microarchitectural Attacks and Defenses in JavaScript Michael Schwarz, Daniel Gruss, Moritz Lipp 25.01.2018 www.iaik.tugraz.at 1 Michael Schwarz, Daniel Gruss, Moritz Lipp www.iaik.tugraz.at Microarchitecture
More informationRANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM
RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei Institute of Microelectronics Tsinghua University The 45th International
More informationTable of Contents HOL ADV
Table of Contents Lab Overview - - Horizon 7.1: Graphics Acceleartion for 3D Workloads and vgpu... 2 Lab Guidance... 3 Module 1-3D Options in Horizon 7 (15 minutes - Basic)... 5 Introduction... 6 3D Desktop
More informationChapter 16 - Instruction-Level Parallelism and Superscalar Processors
Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 16 - Superscalar Processors 1 / 78 Table of Contents I 1 Overview
More informationCAMEO: Continuous Analytics for Massively Multiplayer Online Games
CAMEO: Continuous Analytics for Massively Multiplayer Online Games Alexandru Iosup Parallel and Distributed Systems Group Delft University of Technology 1 MMOGs are a Popular, Growing Market 25,000,000
More informationCharacterizing and Improving the Performance of Intel Threading Building Blocks
Characterizing and Improving the Performance of Intel Threading Building Blocks Gilberto Contreras, Margaret Martonosi Princeton University IISWC 08 Motivation Chip Multiprocessors are the new computing
More informationHMD based VR Service Framework. July Web3D Consortium Kwan-Hee Yoo Chungbuk National University
HMD based VR Service Framework July 31 2017 Web3D Consortium Kwan-Hee Yoo Chungbuk National University khyoo@chungbuk.ac.kr What is Virtual Reality? Making an electronic world seem real and interactive
More informationApplication-Managed Flash Sungjin Lee, Ming Liu, Sangwoo Jun, Shuotao Xu, Jihong Kim and Arvind
Application-Managed Flash Sungjin Lee, Ming Liu, Sangwoo Jun, Shuotao Xu, Jihong Kim and Arvind Massachusetts Institute of Technology Seoul National University 14th USENIX Conference on File and Storage
More informationParallel Randomized Best-First Search
Parallel Randomized Best-First Search Yaron Shoham and Sivan Toledo School of Computer Science, Tel-Aviv Univsity http://www.tau.ac.il/ stoledo, http://www.tau.ac.il/ ysh Abstract. We describe a novel
More informationExperience Report on Developing a Software Communications Architecture (SCA) Core Framework. OMG SBC Workshop Arlington, Va.
Communication, Navigation, Identification and Reconnaissance Experience Report on Developing a Software Communications Architecture (SCA) Core Framework OMG SBC Workshop Arlington, Va. September, 2004
More informationUNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 14 Improving Performance: Interleaving Israel Koren ECE568/Koren Part.14.1 Background Performance
More informationPerformance Metrics, Amdahl s Law
ecture 26 Computer Science 61C Spring 2017 March 20th, 2017 Performance Metrics, Amdahl s Law 1 New-School Machine Structures (It s a bit more complicated!) Software Hardware Parallel Requests Assigned
More informationECE473 Computer Architecture and Organization. Pipeline: Introduction
Computer Architecture and Organization Pipeline: Introduction Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 11.1 The Laundry Analogy Student A,
More informationComputer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS
Computer Architecture (263-2210-00L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS Instructor: Prof. Onur Mutlu TAs: Hasan Hassan, Arash Tavakkol, Mohammad Sadr, Lois Orosa, Juan Gomez Luna Assigned:
More informationGC for interactive and real-time systems
GC for interactive and real-time systems Interactive or real-time app concerns Reducing length of garbage collection pause Demands guarantees for worst case performance Generational GC works if: Young
More informationExploring Heterogeneity within a Core for Improved Power Efficiency
Computer Engineering Exploring Heterogeneity within a Core for Improved Power Efficiency Sudarshan Srinivasan Nithesh Kurella Israel Koren Sandip Kundu May 2, 215 CE Tech Report # 6 Available at http://www.eng.biu.ac.il/segalla/computer-engineering-tech-reports/
More informationMLP-Aware Runahead Threads in a Simultaneous Multithreading Processor
MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor Kenzo Van Craeynest, Stijn Eyerman, and Lieven Eeckhout Department of Electronics and Information Systems (ELIS), Ghent University,
More informationescience: Pulsar searching on GPUs
escience: Pulsar searching on GPUs Alessio Sclocco Ana Lucia Varbanescu Karel van der Veldt John Romein Joeri van Leeuwen Jason Hessels Rob van Nieuwpoort And many others! Netherlands escience center Science
More information