Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Size: px

Start display at page:

Download "Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes"

Roberta Nelson
6 years ago
Views:

1 Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur Mutlu

2 Executive Summary Problem: No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalesce an application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 2

3 GPU Support for Virtual Memory Improves programmability with a unified address space Enables large data sets to be processed in the GPU Allows multiple applications to run on a GPU Virtual memory can enforce memory protection 3

4 State-of-the-Art Virtual Memory on GPUs GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Shared TLB Limited TLB reach Private Shared High latency page walks Page Table Walkers Page Table (Main memory) Data (Main Memory) High latency I/O GPU-side memory CPU-side memory CPU Memory 4

5 Trade-Off with Page Size Larger pages: Better TLB reach High demand paging latency Smaller pages: Lower demand paging latency Limited TLB reach 5

6 Normalized Performance Normalized Performance Trade-Off with Page Size No Paging Overhead Small (4KB) Large (2MB) % With Paging Overhead Small (4KB) Large (2MB) % Can we get the best of both page sizes? 6

7 Outline Background Key challenges and our goal Mosaic Experimental evaluations Conclusions 7

8 Challenges with Multiple Page Sizes Time App 1 Allocation App 2 Allocation App 1 Allocation App 2 Allocation Coalesce App 1 Pages Coalesce App 2 Pages Large Page Frame 1 Large Page Frame 2 Large Page Frame 3 Large Page Frame 4 Large Page Frame 5 State-of-the-Art GPU Memory Cannot coalesce (without migrating multiple 4K pages) Need to search which pages to coalesce Unallocated App 1 App 2 8

9 Desirable Allocation Time App 1 Allocation App 2 Allocation App 1 Allocation App 2 Allocation Large Page Frame 1 Large Page Frame 2 Large Page Frame 3 Large Page Frame 4 Large Page Frame 5 Desirable Behavior GPU Memory Coalesce App 1 Pages Coalesce App 2 Pages Can coalesce (without moving data) Unallocated App 1 App 2 9

10 Our Goals High TLB reach Low demand paging latency Application transparency Programmers do not need to modify the applications 10

11 Outline Background Key challenges and our goal Mosaic Experimental evaluation Conclusions 11

12 Mosaic GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 12

13 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 13

14 Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 2 1 Allocate Memory Application Demands Data Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Soft guarantee: A large page frame contains pages from only a single address space Conserves contiguity within the large page frame 14

15 Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 2 1 Allocate Memory Application Demands Data Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Transfer Data 3 System I/O Bus CPU Memory Data transfer is done at a small page granularity A page that is transferred is immediately ready to use 15

16 Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation 4 Transfer Done Large Page Frame In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data Transfer Data 3 System I/O Bus CPU Memory 16

17 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 17

18 Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages Contiguity-Aware Compaction Hardware Large Page Frame Large Page Frame Fully-allocated large page frame Coalesceable Allocator sends the list of coalesceable pages to the In-Place Coalescer 18

19 Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages 2 Contiguity-Aware Compaction Hardware Update page tables In-Place Coalescer has: List of coalesceable large pages Page Table Data Key Task: Perform coalescing without moving data Simply need to update the page tables 19

20 Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation 1 In-Place Coalescer List of large pages 2 Contiguity-Aware Compaction Hardware Update page tables Large Page Table 10 Coalesced Bit Small Page Table Page Table Data Application-transparent Data can be accessed using either page size No TLB flush 20

21 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 21

22 Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 22

23 Mosaic: Data Deallocation GPU Runtime Application Deallocates Data 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Page Table Data 2 Splinter Pages (reset the coalesced bit) Large Page Frame Splinter only frames with deallocated pages 23

24 Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 24

25 Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Large Page Frames 2 List of free pages Contiguity-Aware Compaction Hardware Page Table Free large page Free large page Data 1 Compact Pages Compaction decreases memory bloat Happens only when memory is highly fragmented 25

26 Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Once pages are compacted, they become non-coalesceable No virtual contiguity Maximizes number of free large page frames 26

27 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 27

28 Baseline: State-of-the-Art GPU Virtual Memory GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Private Shared Shared TLB Page Table Walkers Page Table (Main memory) Data (Main Memory) GPU-side memory CPU-side memory CPU Memory 28

29 Methodology GPGPU-Sim (MAFIA) modeling GTX750 Ti 30 GPU cores Multiple GPGPU applications execute concurrently 64KB 4-way L1, 2048KB 16-way L2 64-entry L1 TLB, 1024-entry L2 TLB 8-entry large page L1 TLB, 64-entry large page L2 TLB 3GB main memory Model sequential page walks Model page tables and virtual-to-physical mapping CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites 235 total workloads evaluated Available at: 29

30 Comparison Points State-of-the-art CPU-GPU memory management GPU-MMU based on [Power et al., HPCA 14] Upside: Utilizes parallel page walks, TLB request coalescing and page walk cache to improve performance Downside: Limited TLB reach Ideal TLB: Every TLB access is an L1 TLB hit 30

31 Weighted Speedup Performance Homogeneous 95.0% 61.5% 55.4% 33.8% 39.0% Heterogeneous GPU-MMU Mosaic Ideal TLB 21.4% 31.5% 43.1% 23.7% Number of Concurrently-Executing Applications Mosaic consistently improves performance across a wide variety of workloads Mosaic performs within 10% of the ideal TLB 31

32 Other Results in the Paper TLB hit rate Mosaic achieves average TLB hit rate of 99% Per-application IPC 97% of all applications perform faster Sensitivity to different TLB sizes Mosaic is effective for various TLB configurations Memory fragmentation analysis Mosaic reduces memory fragmentation and improves performance regardless of the original fragmentation Performance with and without demand paging 32

33 Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 33

34 Summary Problem: No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalesce an application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 34

35 Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur Mutlu

36 Backup Slides

37 Current Methods to Share GPUs Time sharing Fine-grained context switching Coarse-grained context switching Spatial sharing NVIDIA GRID Multi process service 37

38 Other Methods to Enforce Protection Segmented paging Static memory partitioning 38

39 TLB Flush With Mosaic, the contents in the page tables are the same TLB flush in Mosaic occurs when page table content is modified This invalidates content in the TLB Need to be flushed Both large and small page TLBs are flushed 39

40 Normalized Performance Performance with Demand Paging 2.0 GPU-MMU no Paging GPU-MMU with Paging Mosaic with Paging Homogeneous Heterogeneous 40

41 In-Place Coalescer: Coalescing Key assumption: Soft guarantee Large page range always contains pages of the same application L1 Page Table Set Large Page Bit L2 Page Table Set Disabled Bit Set Disabled Bit Set Disabled Bit Set Disabled Bit Coalesce VA Q: How to access large page base entry? PD PT PO PO Benefit: No data movement 41

42 In-Place Coalescer: Large Page Walk Large page index is available at leaf PTE L1 Page Table Set Large Page Bit L2 Page Table Set Disabled Bit Set Disabled Bit Set Disabled Bit Set Disabled Bit Coalesce 42

43 Weighted Speedup Sample Application Pairs GPU-MMU Mosaic Ideal TLB TLB-Friendly TLB-Sensitive

44 TLB Hit Rate TLB Hit Rate 100% L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 80% 60% 40% 20% 0% 1 App 2 Apps 3 Apps 4 Apps 5 Apps GPU-MMU Number of Concurrently-Executing Applications Mosaic

45 Normalized Performance Pre-Fragmenting DRAM 1.6 no CAC CAC CAC-BC CAC-Ideal % 50% 70% 90% 95% 97% 100% Fragmentation Index

46 Normalized Performance Page Occupancy Experiment 1.6 no CAC CAC CAC-BC CAC-Ideal Large Page Frame Occupancy

47 Memory Bloat vs. GPU-MMU Memory Bloat KB Page GPU-MMU CAC 0 1% 10% 25% 35% 50% 75% Page Occupancy

48 Normalized Performance Normalized Performance Normalized Performance Normalized Performance Individual Application IPC GPU-MMU Mosaic Ideal-TLB GPU-MMU Mosaic Ideal-TLB Sorted Application Number GPU-MMU Mosaic Ideal-TLB Sorted Application Number GPU-MMU Mosaic Ideal-TLB Sorted Application Number Sorted Application Number

49 Normalized Performance Normalized Performance Normalized Performance Normalized Performance 1.4 GPU-MMU Mosaic Per-SM L1 TLB Base Page Entries 1.4 GPU-MMU Mosaic Per-SM L1 TLB Large Page Entries GPU-MMU Mosaic Shared L2 TLB Base Page Entries 1.4 GPU-MMU Mosaic Shared L2 TLB Large Page Entries

50 Mosaic: Putting Everything Together GPU Runtime Application Demands Data List of Free Pages List of Large Pages Application Deallocate Data Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory Transfer Done Coalesce Pages Splinter Pages Compact Pages Page Table Data System I/O Bus Transfer Data 50

51 Mosaic: Data Allocation GPU Runtime Application Demands Data List of Large Pages Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory Transfer Done Coalesce Pages Page Table Data System I/O Bus Transfer Data 51

52 Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation Hardware List of Free Pages Application Deallocate Data In-Place Coalescer Contiguity-Aware Compaction Splinter Pages Compact Pages Page Table Data 52

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Supporting x86-64 Address Translation for 100s of GPU s Jason Power, Mark D. Hill, David A. Wood Summary Challenges: CPU&GPUs physically integrated, but logically separate; This reduces theoretical bandwidth,