Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Similar documents
Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

Fall 2015 COMP Operating Systems. Lab #7

Table of Contents HOL ADV

Game Architecture. 4/8/16: Multiprocessor Game Loops

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Processors Processing Processors. The meta-lecture

Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

WiMAX Basestation: Software Reuse Using a Resource Pool. Arnon Friedmann SW Product Manager

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Recent Advances in Simulation Techniques and Tools

NVIDIA SLI AND STUTTER AVOIDANCE:

Document downloaded from:

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Simulating GPGPUs ESESC Tutorial

Project 5: Optimizer Jason Ansel

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

The Xbox One System on a Chip and Kinect Sensor

Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism

SCALCORE: DESIGNING A CORE

Dynamic Warp Resizing in High-Performance SIMT

Synthetic Aperture Beamformation using the GPU

Signal Processing on GPUs for Radio Telescopes

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

High Performance Computing for Engineers

Image Processing Architectures (and their future requirements)

Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics

Game Architecture. Rabin is a good overview of everything to do with Games A lot of these slides come from the 1 st edition CS

Meltdown & Spectre. Side-channels considered harmful. Qualcomm Mobile Security Summit May, San Diego, CA. Moritz Lipp

Deadline scheduling: can your mobile device last longer?

escience: Pulsar searching on GPUs

Image Processing Architectures (and their future requirements)

How different FPGA firmware options enable digitizer platforms to address and facilitate multiple applications

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

ICS312 Machine-level and Systems Programming

MUVR: Supporting Multi-User Mobile Virtual Reality with Resource Constrained Edge Cloud

Case Study. Nikon by Kanban. "Varnish API & Web Acceleration, it s lightning fast, and flexible"

Bridging the Information Gap Between Buffer and Flash Translation Layer for Flash Memory

Use Nvidia Performance Primitives (NPP) in Deep Learning Training. Yang Song

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL

Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka

Monte Carlo integration and event generation on GPU and their application to particle physics

SSD Firmware Implementation Project Lab. #1

CS429: Computer Organization and Architecture

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

I. Check the system environment II. Adjust in-game settings III. Check Windows power plan setting... 5

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU

VR with Metal 2 Session 603

White Paper Unlocking the Potential of LDPC, New FlexLDPC Coding from. Datum Systems. for PSM-500, 500L & 500LT Series Modems

Console Games Are Just Like Mobile Games* (* well, not really. But they are more alike than you

The Critical Role of Firmware and Flash Translation Layers in Solid State Drive Design

CUDA 를활용한실시간 IMAGE PROCESSING SYSTEM 구현. Chang Hee Lee

Parallel Programming Design of BPSK Signal Demodulation Based on CUDA

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

Parameter-Free Tree Style Pipeline in Asynchronous Parallel Game-Tree Search

Threading libraries performance when applied to image acquisition and processing in a forensic application

CSE502: Computer Architecture CSE 502: Computer Architecture

Cognitive Radio Platform Technology

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Massively Parallel Signal Processing for Wireless Communication Systems

Microarchitectural Attacks and Defenses in JavaScript

Killzone Shadow Fall: Threading the Entity Update on PS4. Jorrit Rouwé Lead Game Tech, Guerrilla Games

Airborne radar clutter simulation using GPU (CUDA)

Multi-core Platforms for

Power Modeling and Characterization of Computing Devices: A Survey. Contents

COTSon: Infrastructure for system-level simulation

WAFTL: A Workload Adaptive Flash Translation Layer with Data Partition

GPU-based data analysis for Synthetic Aperture Microwave Imaging

A High Definition Motion JPEG Encoder Based on Epuma Platform

GPU Acceleration of the HEVC Decoder Inter Prediction Module

Like Mobile Games* Currently a Distinguished i Engineer at Zynga, and CTO of FarmVille 2: Country Escape (for ios/android/kindle)

A GPU Implementation for two MIMO OFDM Detectors

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Power of Realtime 3D-Rendering. Raja Koduri

Software-based Microarchitectural Attacks

Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing

Towards Warp-Scheduler Friendly STT-RAM/SRAM Hybrid GPGPU Register File Design

GPU-accelerated track reconstruction in the ALICE High Level Trigger

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

Application of Maxwell Equations to Human Body Modelling

Performance Lessons from Porting Source 2 to Vulkan. Dan Ginsburg

Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION

Introduction to Real-Time Systems

Enhancing System Architecture by Modelling the Flash Translation Layer

PoC #1 On-chip frequency generation

SOFTWARE IMPLEMENTATION OF THE

Interactive Media and Game Development Master s

Dr Myat Su Hlaing Asia Research Center, Yangon University, Myanmar. Data programming model for an operation based parallel image processing system

Blackfin Online Learning & Development

Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

The Geometry of Cognitive Maps

Dynamic Scheduling I

CUDA-Accelerated Satellite Communication Demodulation

UNIT-III LIFE-CYCLE PHASES

Creating Intelligence at the Edge

Optimizing VM Checkpointing for Restore Performance in VMware ESXi Server

Transcription:

Supporting x86-64 Address Translation for 100s of GPU s Jason Power, Mark D. Hill, David A. Wood

Summary Challenges: CPU&GPUs physically integrated, but logically separate; This reduces theoretical bandwidth, inconvenient for programming Current achievement: proof-of-concept GPU MMU design: 1. Per-CU TLBs, highly-threaded PTW, page walk cache 2. Full x86-64 support 3. Modest performance decrease (2% vs. ideal MMU)

Motivation Closer physical integration of GPGPUs, Programming model still decoupled Separate address model Unified virtual address (NVIDIA) model Want: Shared virtual address space (HSA huma)

Separate Address Space CPU address space Simply copy data Transform to new pointers Transform to new pointers GPU address space Explicit memory allocation and data is required to be replicated Pointer structure (Tree, Hash Table) needs transformation by programmers

Unified Virtual Addressing CPU address space 1-to-1 addresses GPU address space Advantage: Using CUDA API to directly allocate host memory Disadvantages: Access to host memory has poor performance; Access to GPU memory space also need replication and pointer structure transformation

Shared Virtual Address Space Simplifies code Enables rich pointer-based datastructures Trees, linked lists, etc. Enables composablity Need: MMU (memory management unit) for the GPU Low overhead Support for CPU page tables (x86-64) 4 KB pages Page faults, TLB flushes, TLB shootdown, etc

Background GPU Compute Unit CPU CPU Core Compute Unit Compute Unit L2 L2 Compute Unit DRAM Heterogeneous Architecture Overview

GPU Overview Compute Unit Instruction Fetch / Decode Coalescer Register File Shared Memory GPU Compute Unit Compute Unit Compute Unit Compute Unit L2

ata-driven GPU MMU design D0: CPU-like MMU D1: Post-coalescer MMU D2: D1 + Highly-threaded page table walker D3: D2 + shared page walk cache

GPU MMU Design 0 CU CU CU CU I-Fetch Register File I-Fetch Register File I-Fetch Register File I-Fetch Register File TLB TLBTLBTLB TLB TLBTLBTLB TLB TLBTLBTLB TLB TLBTLBTLB TLB TLBTLBTLB TLB TLBTLBTLB TLB TLBTLBTLB TLB TLBTLBTLB Coalescer Coalescer Coalescer Coalescer L2 Mimic CPU MMU work and regards every lane as individual core like multi-core structures; Disadvantages: Ignoring the potential data locality in a warp or a CTA cause potential increased bandwidth;

GPU MMU Design 1 1x Shared Memory (scratchpad) Coalescer 0.45x 0.06x

GPU MMU Design 1 Coalescer Coalescer Coalescer Coalescer TLB TLB TLB TLB Shared page walk unit Page fault register L2

Performance The performance is only 30% compared to ideal MMU; The result varies a lot with different workloads. That is because bandwidth as been utilized efficiently; But the many workloads are sensitive to global memory latency

Multiple outstanding page walks An average of 60 page table walks are active at CU; The worst workload averages 140 concurrent page table walks; Miss latency skyrockets due to queuing delays if blocking page walker;

GPU MMU Design 2 Coalescer Coalescer Coalescer Coalescer TLB TLB TLB TLB Highly-threaded Page table walker Shared page walk unit Pagewalk buffers Page fault register L2 A shared multi-threaded page table walker with 32 threads

Performance Design 2 has benefits in some workloads compared to Design 1; However, backproc, bfs and nw cannot improve with Design 2;

High TLB miss rate Average Miss rate: 29%

GPU MMU Design 3 Coalescer Coalescer Coalescer Coalescer TLB TLB TLB TLB Highly-threaded Page table walker Shared page walk unit Pagewalk buffers Page fault register Page walk cache L2

Performance Design 3 achieve 2% slowdown on average compared to ideal MMU; The worst case is 12% slowdown;

Conclusions Shared virtual memory is important Non-exotic MMU design Post-coalescer TLBs Highly-threaded page table walker Page walk cache Full compatibility with minimal overhead