Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Supporting x86-64 Address Translation for 100s of GPU s Jason Power, Mark D. Hill, David A. Wood

Summary Challenges: CPU&GPUs physically integrated, but logically separate; This reduces theoretical bandwidth, inconvenient for programming Current achievement: proof-of-concept GPU MMU design: 1. Per-CU TLBs, highly-threaded PTW, page walk cache 2. Full x86-64 support 3. Modest performance decrease (2% vs. ideal MMU)

Motivation Closer physical integration of GPGPUs, Programming model still decoupled Separate address model Unified virtual address (NVIDIA) model Want: Shared virtual address space (HSA huma)

Separate Address Space CPU address space Simply copy data Transform to new pointers Transform to new pointers GPU address space Explicit memory allocation and data is required to be replicated Pointer structure (Tree, Hash Table) needs transformation by programmers

Unified Virtual Addressing CPU address space 1-to-1 addresses GPU address space Advantage: Using CUDA API to directly allocate host memory Disadvantages: Access to host memory has poor performance; Access to GPU memory space also need replication and pointer structure transformation

Shared Virtual Address Space Simplifies code Enables rich pointer-based datastructures Trees, linked lists, etc. Enables composablity Need: MMU (memory management unit) for the GPU Low overhead Support for CPU page tables (x86-64) 4 KB pages Page faults, TLB flushes, TLB shootdown, etc

Background GPU Compute Unit CPU CPU Core Compute Unit Compute Unit L2 L2 Compute Unit DRAM Heterogeneous Architecture Overview

GPU Overview Compute Unit Instruction Fetch / Decode Coalescer Register File Shared Memory GPU Compute Unit Compute Unit Compute Unit Compute Unit L2

ata-driven GPU MMU design D0: CPU-like MMU D1: Post-coalescer MMU D2: D1 + Highly-threaded page table walker D3: D2 + shared page walk cache

GPU MMU Design 0 CU CU CU CU I-Fetch Register File I-Fetch Register File I-Fetch Register File I-Fetch Register File TLB TLBTLBTLB TLB TLBTLBTLB TLB TLBTLBTLB TLB TLBTLBTLB TLB TLBTLBTLB TLB TLBTLBTLB TLB TLBTLBTLB TLB TLBTLBTLB Coalescer Coalescer Coalescer Coalescer L2 Mimic CPU MMU work and regards every lane as individual core like multi-core structures; Disadvantages: Ignoring the potential data locality in a warp or a CTA cause potential increased bandwidth;

GPU MMU Design 1 1x Shared Memory (scratchpad) Coalescer 0.45x 0.06x

GPU MMU Design 1 Coalescer Coalescer Coalescer Coalescer TLB TLB TLB TLB Shared page walk unit Page fault register L2

Performance The performance is only 30% compared to ideal MMU; The result varies a lot with different workloads. That is because bandwidth as been utilized efficiently; But the many workloads are sensitive to global memory latency

Multiple outstanding page walks An average of 60 page table walks are active at CU; The worst workload averages 140 concurrent page table walks; Miss latency skyrockets due to queuing delays if blocking page walker;

GPU MMU Design 2 Coalescer Coalescer Coalescer Coalescer TLB TLB TLB TLB Highly-threaded Page table walker Shared page walk unit Pagewalk buffers Page fault register L2 A shared multi-threaded page table walker with 32 threads

Performance Design 2 has benefits in some workloads compared to Design 1; However, backproc, bfs and nw cannot improve with Design 2;

High TLB miss rate Average Miss rate: 29%

GPU MMU Design 3 Coalescer Coalescer Coalescer Coalescer TLB TLB TLB TLB Highly-threaded Page table walker Shared page walk unit Pagewalk buffers Page fault register Page walk cache L2

Performance Design 3 achieve 2% slowdown on average compared to ideal MMU; The worst case is 12% slowdown;

Conclusions Shared virtual memory is important Non-exotic MMU design Post-coalescer TLBs Highly-threaded page table walker Page walk cache Full compatibility with minimal overhead