Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur Mutlu

Executive Summary Problem: No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalesce an application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 2

GPU Support for Virtual Memory Improves programmability with a unified address space Enables large data sets to be processed in the GPU Allows multiple applications to run on a GPU Virtual memory can enforce memory protection 3

State-of-the-Art Virtual Memory on GPUs GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Shared TLB Limited TLB reach Private Shared High latency page walks Page Table Walkers Page Table (Main memory) Data (Main Memory) High latency I/O GPU-side memory CPU-side memory CPU Memory 4

Trade-Off with Page Size Larger pages: Better TLB reach High demand paging latency Smaller pages: Lower demand paging latency Limited TLB reach 5

Normalized Performance Normalized Performance Trade-Off with Page Size No Paging Overhead Small (4KB) Large (2MB) 1.0 0.8 0.6 0.4 0.2 0.0 52% With Paging Overhead Small (4KB) Large (2MB) 1.0 0.8 0.6 0.4 0.2 0.0-93% Can we get the best of both page sizes? 6

Outline Background Key challenges and our goal Mosaic Experimental evaluations Conclusions 7

Challenges with Multiple Page Sizes Time App 1 Allocation App 2 Allocation App 1 Allocation App 2 Allocation Coalesce App 1 Pages Coalesce App 2 Pages Large Page Frame 1 Large Page Frame 2 Large Page Frame 3 Large Page Frame 4 Large Page Frame 5 State-of-the-Art GPU Memory Cannot coalesce (without migrating multiple 4K pages) Need to search which pages to coalesce Unallocated App 1 App 2 8

Desirable Allocation Time App 1 Allocation App 2 Allocation App 1 Allocation App 2 Allocation Large Page Frame 1 Large Page Frame 2 Large Page Frame 3 Large Page Frame 4 Large Page Frame 5 Desirable Behavior GPU Memory Coalesce App 1 Pages Coalesce App 2 Pages Can coalesce (without moving data) Unallocated App 1 App 2 9

Our Goals High TLB reach Low demand paging latency Application transparency Programmers do not need to modify the applications 10

Outline Background Key challenges and our goal Mosaic Experimental evaluation Conclusions 11

Mosaic GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 12

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 13

Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 2 1 Allocate Memory Application Demands Data Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Soft guarantee: A large page frame contains pages from only a single address space Conserves contiguity within the large page frame 14

Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 2 1 Allocate Memory Application Demands Data Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Transfer Data 3 System I/O Bus CPU Memory Data transfer is done at a small page granularity A page that is transferred is immediately ready to use 15

Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 4 Transfer Done Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Transfer Data 3 System I/O Bus CPU Memory 16

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 17

Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages Contiguity-Aware Compaction Hardware Large Page Frame Large Page Frame Fully-allocated large page frame Coalesceable Allocator sends the list of coalesceable pages to the In-Place Coalescer 18

Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages 2 Contiguity-Aware Compaction Hardware Update page tables In-Place Coalescer has: List of coalesceable large pages Page Table Data Key Task: Perform coalescing without moving data Simply need to update the page tables 19

Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages 2 Contiguity-Aware Compaction Hardware Update page tables Large Page Table 10 Coalesced Bit Small Page Table Page Table Data Application-transparent Data can be accessed using either page size No TLB flush 20

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 21

Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 22

Mosaic: Data Deallocation GPU Runtime Application Deallocates Data 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data 2 Splinter Pages (reset the coalesced bit) Large Page Frame Splinter only frames with deallocated pages 23

Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 24

Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Large Page Frames 2 List of free pages Contiguity-Aware Compaction Hardware Page Table Free large page Free large page Data 1 Compact Pages Compaction decreases memory bloat Happens only when memory is highly fragmented 25

Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Once pages are compacted, they become non-coalesceable No virtual contiguity Maximizes number of free large page frames 26

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 27

Baseline: State-of-the-Art GPU Virtual Memory GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Private Shared Shared TLB Page Table Walkers Page Table (Main memory) Data (Main Memory) GPU-side memory CPU-side memory CPU Memory 28

Methodology GPGPU-Sim (MAFIA) modeling GTX750 Ti 30 GPU cores Multiple GPGPU applications execute concurrently 64KB 4-way L1, 2048KB 16-way L2 64-entry L1 TLB, 1024-entry L2 TLB 8-entry large page L1 TLB, 64-entry large page L2 TLB 3GB main memory Model sequential page walks Model page tables and virtual-to-physical mapping CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites 235 total workloads evaluated Available at: https://github.com/cmu-safari/mosaic 29

Comparison Points State-of-the-art CPU-GPU memory management GPU-MMU based on [Power et al., HPCA 14] Upside: Utilizes parallel page walks, TLB request coalescing and page walk cache to improve performance Downside: Limited TLB reach Ideal TLB: Every TLB access is an L1 TLB hit 30

Weighted Speedup Performance 7 6 5 4 3 2 1 0 Homogeneous 95.0% 61.5% 55.4% 33.8% 39.0% Heterogeneous GPU-MMU Mosaic Ideal TLB 21.4% 31.5% 43.1% 23.7% 1 2 3 4 5 2 3 4 5 Number of Concurrently-Executing Applications Mosaic consistently improves performance across a wide variety of workloads Mosaic performs within 10% of the ideal TLB 31

Other Results in the Paper TLB hit rate Mosaic achieves average TLB hit rate of 99% Per-application IPC 97% of all applications perform faster Sensitivity to different TLB sizes Mosaic is effective for various TLB configurations Memory fragmentation analysis Mosaic reduces memory fragmentation and improves performance regardless of the original fragmentation Performance with and without demand paging 32

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 33

Summary Problem: No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalesce an application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 34

Backup Slides

Current Methods to Share GPUs Time sharing Fine-grained context switching Coarse-grained context switching Spatial sharing NVIDIA GRID Multi process service 37

Other Methods to Enforce Protection Segmented paging Static memory partitioning 38

TLB Flush With Mosaic, the contents in the page tables are the same TLB flush in Mosaic occurs when page table content is modified This invalidates content in the TLB Need to be flushed Both large and small page TLBs are flushed 39

Normalized Performance Performance with Demand Paging 2.0 GPU-MMU no Paging GPU-MMU with Paging Mosaic with Paging 1.5 1.0 0.5 0.0 Homogeneous Heterogeneous 40

In-Place Coalescer: Coalescing Key assumption: Soft guarantee Large page range always contains pages of the same application L1 Page Table Set Large Page Bit L2 Page Table Set Disabled Bit Set Disabled Bit Set Disabled Bit Set Disabled Bit Coalesce VA Q: How to access large page base entry? PD PT PO PO Benefit: No data movement 41

In-Place Coalescer: Large Page Walk Large page index is available at leaf PTE L1 Page Table Set Large Page Bit L2 Page Table Set Disabled Bit Set Disabled Bit Set Disabled Bit Set Disabled Bit Coalesce 42

Weighted Speedup Sample Application Pairs 5 4 3 2 1 0 GPU-MMU Mosaic Ideal TLB TLB-Friendly TLB-Sensitive

TLB Hit Rate TLB Hit Rate 100% L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 80% 60% 40% 20% 0% 1 App 2 Apps 3 Apps 4 Apps 5 Apps GPU-MMU Number of Concurrently-Executing Applications Mosaic

Normalized Performance Pre-Fragmenting DRAM 1.6 no CAC CAC CAC-BC CAC-Ideal 1.4 1.2 1.0 0.8 30% 50% 70% 90% 95% 97% 100% Fragmentation Index

Normalized Performance Page Occupancy Experiment 1.6 no CAC CAC CAC-BC CAC-Ideal 1.4 1.2 1.0 0.8 Large Page Frame Occupancy

Memory Bloat vs. GPU-MMU Memory Bloat 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 4KB Page GPU-MMU CAC 0 1% 10% 25% 35% 50% 75% Page Occupancy

Normalized Performance Normalized Performance Normalized Performance Normalized Performance Individual Application IPC 8 7 6 5 4 3 2 1 GPU-MMU Mosaic Ideal-TLB 9 8 7 6 5 4 3 2 1 GPU-MMU Mosaic Ideal-TLB 0 0 10 20 30 40 50 0 0 25 50 75 8 7 6 5 Sorted Application Number GPU-MMU Mosaic Ideal-TLB 8 7 6 5 Sorted Application Number GPU-MMU Mosaic Ideal-TLB 4 4 3 3 2 2 1 1 0 0 25 50 75 100 0 0 25 50 75 100 125 Sorted Application Number Sorted Application Number

Normalized Performance Normalized Performance Normalized Performance Normalized Performance 1.4 GPU-MMU Mosaic 1.3 1.2 1.1 1.0 0.9 0.8 8 16 32 64 128 256 Per-SM L1 TLB Base Page Entries 1.4 GPU-MMU Mosaic 1.3 1.2 1.1 1.0 0.9 0.8 4 8 16 32 64 Per-SM L1 TLB Large Page Entries 1.4 1.3 1.2 1.1 1.0 0.9 0.8 GPU-MMU Mosaic 64 128 256 512 1024 4096 Shared L2 TLB Base Page Entries 1.4 GPU-MMU Mosaic 1.3 1.2 1.1 1.0 0.9 0.8 32 64 128 256 512 Shared L2 TLB Large Page Entries

Mosaic: Putting Everything Together GPU Runtime Application Demands Data List of Free Pages List of Large Pages Application Deallocate Data Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory Transfer Done Coalesce Pages Splinter Pages Compact Pages Page Table Data System I/O Bus Transfer Data 50

Mosaic: Data Allocation GPU Runtime Application Demands Data List of Large Pages Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory Transfer Done Coalesce Pages Page Table Data System I/O Bus Transfer Data 51

Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation Hardware List of Free Pages Application Deallocate Data In-Place Coalescer Contiguity-Aware Compaction Splinter Pages Compact Pages Page Table Data 52