Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Similar documents
Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Fall 2015 COMP Operating Systems. Lab #7

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Recent Advances in Simulation Techniques and Tools

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Synthetic Aperture Beamformation using the GPU

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

Game Architecture. 4/8/16: Multiprocessor Game Loops

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS

Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

COTSon: Infrastructure for system-level simulation

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Simulating GPGPUs ESESC Tutorial

Use Nvidia Performance Primitives (NPP) in Deep Learning Training. Yang Song

A Study of Optimal Spatial Partition Size and Field of View in Massively Multiplayer Online Game Server

CUDA-Accelerated Satellite Communication Demodulation

NetApp Sizing Guidelines for MEDITECH Environments

The Critical Role of Firmware and Flash Translation Layers in Solid State Drive Design

A Case for Opportunistic Embedded Sensing In Presence of Hardware Power Variability

Power of Realtime 3D-Rendering. Raja Koduri

SCALCORE: DESIGNING A CORE

Deadline scheduling: can your mobile device last longer?

WAFTL: A Workload Adaptive Flash Translation Layer with Data Partition

MUVR: Supporting Multi-User Mobile Virtual Reality with Resource Constrained Edge Cloud

Dynamic Warp Resizing in High-Performance SIMT

Oculus Rift Getting Started Guide

Parallel Simulation of Social Agents using Cilk and OpenCL

Final Report: DBmbench

Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism

Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka

PARALLEL ALGORITHMS FOR HISTOGRAM-BASED IMAGE REGISTRATION. Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, Wolfgang Effelsberg

IBM Research Report. GPUVolt: Modeling and Characterizing Voltage Noise in GPU Architectures

Characterizing, Optimizing, and Auto-Tuning Applications for Energy Efficiency

Performance Evaluation of Recently Proposed Cache Replacement Policies

Accelerated Impulse Response Calculation for Indoor Optical Communication Channels

Hardware-Software Co-Design Cosynthesis and Partitioning

Document downloaded from:

Monte Carlo integration and event generation on GPU and their application to particle physics

Self-Aware Adaptation in FPGAbased

Image Processing Architectures (and their future requirements)

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

SOFTWARE IMPLEMENTATION OF THE

CSE502: Computer Architecture CSE 502: Computer Architecture

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

CS434/534: Topics in Networked (Networking) Systems

Application of Maxwell Equations to Human Body Modelling

Oculus Rift Getting Started Guide

Trace Based Switching For A Tightly Coupled Heterogeneous Core

Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors

Like Mobile Games* Currently a Distinguished i Engineer at Zynga, and CTO of FarmVille 2: Country Escape (for ios/android/kindle)

GPU-accelerated track reconstruction in the ALICE High Level Trigger

Dynamic Adaptive Operating Systems -- I/O

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Data Compression via Logic Synthesis

Massively Parallel Signal Processing for Wireless Communication Systems

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems

CUDA 를활용한실시간 IMAGE PROCESSING SYSTEM 구현. Chang Hee Lee

Enhancing System Architecture by Modelling the Flash Translation Layer

Challenges in Transition

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division

Plane-dependent Error Diffusion on a GPU

CT-Bus : A Heterogeneous CDMA/TDMA Bus for Future SOC

Scheduling and Communication Synthesis for Distributed Real-Time Systems

Optimizing VM Checkpointing for Restore Performance in VMware ESXi Server

LEGO car course topics

Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION

An evaluation of debayering algorithms on GPU for real-time panoramic video recording

NVIDIA APEX: High-Definition Physics with Clothing and Vegetation. Michael Sechrest, IDV Monier Maher, NVIDIA Jean Pierre Bordes, NVIDIA

Table of Contents HOL EMT

Developing a GPU Processing Framework for Accelerating Remote Sensing Algorithms

Parallel Storage and Retrieval of Pixmap Images

High Performance Computing for Engineers

Data acquisition and Trigger (with emphasis on LHC)

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

Dynamic Routing and Spectrum Assignment in Brown-field Fixed/Flex Grid Optical Network. Tanjila Ahmed

A Bypass First Policy for Energy-Efficient Last Level Caches

Building Java Apps with ArcGIS Runtime SDK

The Xbox One System on a Chip and Kinect Sensor

Image Processing Architectures (and their future requirements)

Microarchitectural Attacks and Defenses in JavaScript

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

Table of Contents HOL ADV

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

CAMEO: Continuous Analytics for Massively Multiplayer Online Games

Characterizing and Improving the Performance of Intel Threading Building Blocks

HMD based VR Service Framework. July Web3D Consortium Kwan-Hee Yoo Chungbuk National University

Application-Managed Flash Sungjin Lee, Ming Liu, Sangwoo Jun, Shuotao Xu, Jihong Kim and Arvind

Parallel Randomized Best-First Search

Experience Report on Developing a Software Communications Architecture (SCA) Core Framework. OMG SBC Workshop Arlington, Va.

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Performance Metrics, Amdahl s Law

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

GC for interactive and real-time systems

Exploring Heterogeneity within a Core for Improved Power Efficiency

MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor

escience: Pulsar searching on GPUs

Transcription:

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur Mutlu

Executive Summary Problem: No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalesce an application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 2

GPU Support for Virtual Memory Improves programmability with a unified address space Enables large data sets to be processed in the GPU Allows multiple applications to run on a GPU Virtual memory can enforce memory protection 3

State-of-the-Art Virtual Memory on GPUs GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Shared TLB Limited TLB reach Private Shared High latency page walks Page Table Walkers Page Table (Main memory) Data (Main Memory) High latency I/O GPU-side memory CPU-side memory CPU Memory 4

Trade-Off with Page Size Larger pages: Better TLB reach High demand paging latency Smaller pages: Lower demand paging latency Limited TLB reach 5

Normalized Performance Normalized Performance Trade-Off with Page Size No Paging Overhead Small (4KB) Large (2MB) 1.0 0.8 0.6 0.4 0.2 0.0 52% With Paging Overhead Small (4KB) Large (2MB) 1.0 0.8 0.6 0.4 0.2 0.0-93% Can we get the best of both page sizes? 6

Outline Background Key challenges and our goal Mosaic Experimental evaluations Conclusions 7

Challenges with Multiple Page Sizes Time App 1 Allocation App 2 Allocation App 1 Allocation App 2 Allocation Coalesce App 1 Pages Coalesce App 2 Pages Large Page Frame 1 Large Page Frame 2 Large Page Frame 3 Large Page Frame 4 Large Page Frame 5 State-of-the-Art GPU Memory Cannot coalesce (without migrating multiple 4K pages) Need to search which pages to coalesce Unallocated App 1 App 2 8

Desirable Allocation Time App 1 Allocation App 2 Allocation App 1 Allocation App 2 Allocation Large Page Frame 1 Large Page Frame 2 Large Page Frame 3 Large Page Frame 4 Large Page Frame 5 Desirable Behavior GPU Memory Coalesce App 1 Pages Coalesce App 2 Pages Can coalesce (without moving data) Unallocated App 1 App 2 9

Our Goals High TLB reach Low demand paging latency Application transparency Programmers do not need to modify the applications 10

Outline Background Key challenges and our goal Mosaic Experimental evaluation Conclusions 11

Mosaic GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 12

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 13

Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 2 1 Allocate Memory Application Demands Data Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Soft guarantee: A large page frame contains pages from only a single address space Conserves contiguity within the large page frame 14

Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 2 1 Allocate Memory Application Demands Data Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Transfer Data 3 System I/O Bus CPU Memory Data transfer is done at a small page granularity A page that is transferred is immediately ready to use 15

Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 4 Transfer Done Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Transfer Data 3 System I/O Bus CPU Memory 16

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 17

Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages Contiguity-Aware Compaction Hardware Large Page Frame Large Page Frame Fully-allocated large page frame Coalesceable Allocator sends the list of coalesceable pages to the In-Place Coalescer 18

Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages 2 Contiguity-Aware Compaction Hardware Update page tables In-Place Coalescer has: List of coalesceable large pages Page Table Data Key Task: Perform coalescing without moving data Simply need to update the page tables 19

Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages 2 Contiguity-Aware Compaction Hardware Update page tables Large Page Table 10 Coalesced Bit Small Page Table Page Table Data Application-transparent Data can be accessed using either page size No TLB flush 20

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 21

Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 22

Mosaic: Data Deallocation GPU Runtime Application Deallocates Data 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data 2 Splinter Pages (reset the coalesced bit) Large Page Frame Splinter only frames with deallocated pages 23

Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 24

Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Large Page Frames 2 List of free pages Contiguity-Aware Compaction Hardware Page Table Free large page Free large page Data 1 Compact Pages Compaction decreases memory bloat Happens only when memory is highly fragmented 25

Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Once pages are compacted, they become non-coalesceable No virtual contiguity Maximizes number of free large page frames 26

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 27

Baseline: State-of-the-Art GPU Virtual Memory GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Private Shared Shared TLB Page Table Walkers Page Table (Main memory) Data (Main Memory) GPU-side memory CPU-side memory CPU Memory 28

Methodology GPGPU-Sim (MAFIA) modeling GTX750 Ti 30 GPU cores Multiple GPGPU applications execute concurrently 64KB 4-way L1, 2048KB 16-way L2 64-entry L1 TLB, 1024-entry L2 TLB 8-entry large page L1 TLB, 64-entry large page L2 TLB 3GB main memory Model sequential page walks Model page tables and virtual-to-physical mapping CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites 235 total workloads evaluated Available at: https://github.com/cmu-safari/mosaic 29

Comparison Points State-of-the-art CPU-GPU memory management GPU-MMU based on [Power et al., HPCA 14] Upside: Utilizes parallel page walks, TLB request coalescing and page walk cache to improve performance Downside: Limited TLB reach Ideal TLB: Every TLB access is an L1 TLB hit 30

Weighted Speedup Performance 7 6 5 4 3 2 1 0 Homogeneous 95.0% 61.5% 55.4% 33.8% 39.0% Heterogeneous GPU-MMU Mosaic Ideal TLB 21.4% 31.5% 43.1% 23.7% 1 2 3 4 5 2 3 4 5 Number of Concurrently-Executing Applications Mosaic consistently improves performance across a wide variety of workloads Mosaic performs within 10% of the ideal TLB 31

Other Results in the Paper TLB hit rate Mosaic achieves average TLB hit rate of 99% Per-application IPC 97% of all applications perform faster Sensitivity to different TLB sizes Mosaic is effective for various TLB configurations Memory fragmentation analysis Mosaic reduces memory fragmentation and improves performance regardless of the original fragmentation Performance with and without demand paging 32

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 33

Summary Problem: No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalesce an application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 34

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur Mutlu

Backup Slides

Current Methods to Share GPUs Time sharing Fine-grained context switching Coarse-grained context switching Spatial sharing NVIDIA GRID Multi process service 37

Other Methods to Enforce Protection Segmented paging Static memory partitioning 38

TLB Flush With Mosaic, the contents in the page tables are the same TLB flush in Mosaic occurs when page table content is modified This invalidates content in the TLB Need to be flushed Both large and small page TLBs are flushed 39

Normalized Performance Performance with Demand Paging 2.0 GPU-MMU no Paging GPU-MMU with Paging Mosaic with Paging 1.5 1.0 0.5 0.0 Homogeneous Heterogeneous 40

In-Place Coalescer: Coalescing Key assumption: Soft guarantee Large page range always contains pages of the same application L1 Page Table Set Large Page Bit L2 Page Table Set Disabled Bit Set Disabled Bit Set Disabled Bit Set Disabled Bit Coalesce VA Q: How to access large page base entry? PD PT PO PO Benefit: No data movement 41

In-Place Coalescer: Large Page Walk Large page index is available at leaf PTE L1 Page Table Set Large Page Bit L2 Page Table Set Disabled Bit Set Disabled Bit Set Disabled Bit Set Disabled Bit Coalesce 42

Weighted Speedup Sample Application Pairs 5 4 3 2 1 0 GPU-MMU Mosaic Ideal TLB TLB-Friendly TLB-Sensitive

TLB Hit Rate TLB Hit Rate 100% L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 80% 60% 40% 20% 0% 1 App 2 Apps 3 Apps 4 Apps 5 Apps GPU-MMU Number of Concurrently-Executing Applications Mosaic

Normalized Performance Pre-Fragmenting DRAM 1.6 no CAC CAC CAC-BC CAC-Ideal 1.4 1.2 1.0 0.8 30% 50% 70% 90% 95% 97% 100% Fragmentation Index

Normalized Performance Page Occupancy Experiment 1.6 no CAC CAC CAC-BC CAC-Ideal 1.4 1.2 1.0 0.8 Large Page Frame Occupancy

Memory Bloat vs. GPU-MMU Memory Bloat 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 4KB Page GPU-MMU CAC 0 1% 10% 25% 35% 50% 75% Page Occupancy

Normalized Performance Normalized Performance Normalized Performance Normalized Performance Individual Application IPC 8 7 6 5 4 3 2 1 GPU-MMU Mosaic Ideal-TLB 9 8 7 6 5 4 3 2 1 GPU-MMU Mosaic Ideal-TLB 0 0 10 20 30 40 50 0 0 25 50 75 8 7 6 5 Sorted Application Number GPU-MMU Mosaic Ideal-TLB 8 7 6 5 Sorted Application Number GPU-MMU Mosaic Ideal-TLB 4 4 3 3 2 2 1 1 0 0 25 50 75 100 0 0 25 50 75 100 125 Sorted Application Number Sorted Application Number

Normalized Performance Normalized Performance Normalized Performance Normalized Performance 1.4 GPU-MMU Mosaic 1.3 1.2 1.1 1.0 0.9 0.8 8 16 32 64 128 256 Per-SM L1 TLB Base Page Entries 1.4 GPU-MMU Mosaic 1.3 1.2 1.1 1.0 0.9 0.8 4 8 16 32 64 Per-SM L1 TLB Large Page Entries 1.4 1.3 1.2 1.1 1.0 0.9 0.8 GPU-MMU Mosaic 64 128 256 512 1024 4096 Shared L2 TLB Base Page Entries 1.4 GPU-MMU Mosaic 1.3 1.2 1.1 1.0 0.9 0.8 32 64 128 256 512 Shared L2 TLB Large Page Entries

Mosaic: Putting Everything Together GPU Runtime Application Demands Data List of Free Pages List of Large Pages Application Deallocate Data Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory Transfer Done Coalesce Pages Splinter Pages Compact Pages Page Table Data System I/O Bus Transfer Data 50

Mosaic: Data Allocation GPU Runtime Application Demands Data List of Large Pages Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory Transfer Done Coalesce Pages Page Table Data System I/O Bus Transfer Data 51

Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation Hardware List of Free Pages Application Deallocate Data In-Place Coalescer Contiguity-Aware Compaction Splinter Pages Compact Pages Page Table Data 52