CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Similar documents
Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Synthetic Aperture Beamformation using the GPU

Dynamic Warp Resizing in High-Performance SIMT

CUDA-Accelerated Satellite Communication Demodulation

A GPU Implementation for two MIMO OFDM Detectors

Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION

Application of Maxwell Equations to Human Body Modelling

Parallel Programming Design of BPSK Signal Demodulation Based on CUDA

A Message Scheduling Scheme for All-to-all Personalized Communication on Ethernet Switched Clusters

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS

Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

High Performance Computing for Engineers

A Polyphase Filter for GPUs and Multi-Core Processors

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL

Image Processing Architectures (and their future requirements)

Accelerated Impulse Response Calculation for Indoor Optical Communication Channels

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Processors Processing Processors. The meta-lecture

Where Tegra meets Titan! Prof Tom Drummond!

Monte Carlo integration and event generation on GPU and their application to particle physics

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Airborne radar clutter simulation using GPU (CUDA)

HIGH PERFORMANCE COMPUTING USING GPGPU FOR RADAR APPLICATIONS

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

Project 5: Optimizer Jason Ansel

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Scaling Resolution with the Quadro SVS Platform. Andrew Page Senior Product Manager: SVS & Broadcast Video

GPU-based data analysis for Synthetic Aperture Microwave Imaging

Challenges in Transition

Multiple Clock and Voltage Domains for Chip Multi Processors

Compiler Optimisation

An evaluation of debayering algorithms on GPU for real-time panoramic video recording

Real-Time Software Receiver Using Massively Parallel

Massively Parallel Signal Processing for Wireless Communication Systems

Data acquisition and Trigger (with emphasis on LHC)

Threading libraries performance when applied to image acquisition and processing in a forensic application

Parallel Go on CUDA with. Monte Carlo Tree Search

CSE502: Computer Architecture CSE 502: Computer Architecture

Game Architecture. 4/8/16: Multiprocessor Game Loops

Real Time Visualization of Full Resolution Data of Indian Remote Sensing Satellite

Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism

Recent Advances in Simulation Techniques and Tools

Simulating GPGPUs ESESC Tutorial

PARALLEL ALGORITHMS FOR HISTOGRAM-BASED IMAGE REGISTRATION. Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, Wolfgang Effelsberg

CS4961 Parallel Programming. Lecture 1: Introduction 08/24/2010. Course Details Time and Location: TuTh, 9:10-10:30 AM, WEB L112 Course Website

Design of Parallel Algorithms. Communication Algorithms

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

Message Scheduling for All-to-all Personalized Communication on Ethernet Switched Clusters

USING MULTIPROCESSOR SYSTEMS FOR MULTISPECTRAL DATA PROCESSING

Lecture 8-1 Vector Processors 2 A. Sohn

IHV means Independent Hardware Vendor. Example is Qualcomm Technologies Inc. that makes Snapdragon processors. OEM means Original Equipment

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

A High Definition Motion JPEG Encoder Based on Epuma Platform

Contents 1 Introduction 2 MOS Fabrication Technology

GPU-accelerated track reconstruction in the ALICE High Level Trigger

Document downloaded from:

Instruction Level Parallelism Part II - Scoreboard

A Study of Optimal Spatial Partition Size and Field of View in Massively Multiplayer Online Game Server

Characterizing and Improving the Performance of Intel Threading Building Blocks

Real-time Pulsar Timing signal processing on GPUs

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

II. FRAME STRUCTURE In this section, we present the downlink frame structure of 3GPP LTE and WiMAX standards. Here, we consider

GPU-based Parallel Computing of Energy Consumption in Wireless Sensor Networks

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Image Processing Architectures (and their future requirements)

Software-based Microarchitectural Attacks

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

Real Time Simulation of Power Electronic Systems on Multi-core Processors

Real-time Grid Computing : Monte-Carlo Methods in Parallel Tree Searching

Using the Two-Way X-10 Modules with HomeVision

Scalable SCMA Jianglei Ma Sept. 24., 2017

Analysis of Image Compression Algorithm: GUETZLI

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Dr Myat Su Hlaing Asia Research Center, Yangon University, Myanmar. Data programming model for an operation based parallel image processing system

Customized Computing for Power Efficiency. There are Many Options to Improve Performance

CT-Bus : A Heterogeneous CDMA/TDMA Bus for Future SOC

Self-Aware Adaptation in FPGAbased

Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms

DESIGN, IMPLEMENTATION AND OPTIMISATION OF 4X4 MIMO-OFDM TRANSMITTER FOR

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Parallel Randomized Best-First Search

WiSync: An Architecture for Fast Synchroniza5on through On- Chip Wireless Communica5on

Energy Efficient Soft Real-Time Computing through Cross-Layer Predictive Control

Construction of visualization system for scientific experiments

The Message Passing Interface (MPI)

The Blueprint of 5G A Global Standard

WiMAX Basestation: Software Reuse Using a Resource Pool. Arnon Friedmann SW Product Manager

RECOMMENDATION ITU-R M (Question ITU-R 87/8)

Diffracting Trees and Layout

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

The Looming Software Crisis due to the Multicore Menace

EE382V-ICS: System-on-a-Chip (SoC) Design

Parallel Simulation of Social Agents using Cilk and OpenCL

EM Simulation of Automotive Radar Mounted in Vehicle Bumper

FAST RADIX 2, 3, 4, AND 5 KERNELS FOR FAST FOURIER TRANSFORMATIONS ON COMPUTERS WITH OVERLAPPING MULTIPLY ADD INSTRUCTIONS

Transcription:

Terminology CUDA Threads Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Streaming Multiprocessor (SM) A SM processes block of threads Streaming Processors (SP) also called CUDA Cores A SP processes threads belonging to a block Terminology How it works 1) Grid is launched 2) Blocks are assigned to streaming multiprocessors (SM) on block-by-block basis in arbitrary order (This allows scalability) (Each SM can process more blocks)

How it works 3) An assigned block is partitioned into warps. Their execution is interleaved 4) Warps are assigned to SM (one thread to one SP) 5) Warps can be delayed if idle for some reason (waiting for memory) Basic Considerations the size of a block is limited to 512 threads blockdim(512,1,1) blockdim(8,16,2) blockdim(16,16,2) kernel can handle up to 65,536x65,536 blocks G80 Architecture has 16 SMs each can process 8 blocks or 768 threads max: 8x16=128 CUDA Cores (SPs) max: 16x768=12,288 threads GT200 Architecture has 30 SMs each can process 8 blocks or 1024 threads max: 8x30=240 CUDA Cores (SPs) max: 30x1,024= 30,720 threads

GT200 Architecture GT200 Architecture 30,720 threads max 240 CUDA cores One SM limits: 1024 threads = 4x256 or 8x128 etc. One block limits: 512 threads = 2x256 or 8x64 etc. Image Nvidia Image Nvidia GT400 (Fermi) Block Assignment has 16 SM each can process 8 blocks 1 SM has 32 cuda cores total: 512 cuda cores plus 16kb or 48kb L1 Caches per SM can run two different warps per kernel (dual warp scheduler) if more than the maximum amount of blocks are assigned to SM they will be scheduled for later execution

Warps A thread block is divided into warps A block of 32 threads (hw dependent and can change) Warps are the scheduling units of SM warp 0 : t 0,t 1,,t 31 warp 1 : t 32,t 32,,t 63 Warps Example: 3 blocks assigned to SM, each with 128 threads. How many warps we have in the SM? 128 threads/32 (warp length)=4 warps 4(warps) x 3 (blocks) = 12 warps at the same time Warps Example2: How many warps in the GT200? 1024 threads/32 (warp length)=32 warps Warp Assignment one thread is assigned to one SP SM has 8 SPs warp has 32 threads so a warp is executed in four steps

Warps latency hiding Why do we need so many warps if there are just a few CUDA cores in SM? Latency hiding: a warp executes a global memory read instruction that delays it for 400 cycles any other warp can be executed in the meantime if more than one is available - priorities Warps processing A warp is SIMT (single instruction multiple thread) all run in parallel and the same instruction Two warps are MIMD can do branching, loops, etc. Threads within one warp do not need synchronization they run the same time instruction Warps zero-overhead Zero-overhead thread scheduling having many warps available, the selection of warps that are ready to go keeps the SM busy (no idle time) that is why, caches are not usually necessary Example - granularity Having GT200 and matrix multiplication. Which tiles are the best 4x4, 8x8, 16x16, or 32x32?

Example - granularity 4x4 will need 16 threads per block SM can take up to 1024 threads We can take 1024/16=64 blocks BUT! The SM is limited to 8 blocks There will be 8*16=128 threads in each SM 128/32=4 -> 8 warps, but each half full heavily underutilized! (fewer warps to schedule) Example - granularity 8x8 will need 64 threads per block SM can take up to 1024 threads We can take 1024/64=16 blocks BUT! The SM is limited to 8 blocks There will be 8*64=512 threads in each SM 512/32=16 warps still underutilized! (fewer warps to schedule) Example - granularity 16x16 will need 256 threads per block SM can take up to 1024 threads We can take 1024/256=4 blocks The SM can take it 2x There will be 8*64=512 threads in each SM 512/32=16 full capacity and a lot of warps to schedule Example - granularity 32x32 will need 1024 threads per block a block (GT200) can take max 512 Not even one will fit in the SM (not true in GT400)

Example - granularity granularity does not automatically mean a good performance depends on using shared memory, branching, loops, etc. but it does imply low latency Blocks (resp. # of threads in block) should be multiples 32 for better alignment Warps/block alignment 1D Case block of 100 threads how many warps? 100/32=3+1/4 t 0 t 1 t 31 t 32 t 33 t 63 t 64 t 65 t 92 t 93 t 94 t 95 t 96 t 97 t 98 t 99 w 0 w 1 w 2 ¼ of w 3 the last warp will be occupied entirely, but only the 8 threads will have meaning Warps/block alignment 2D Case blockdim(9,9) 81 threads 100/32=2 warps and 17 threads t 0,0 t 1,0 t 2,0 t 3,0 t 4,0 t 5,0 t 6,0 t 7,0 t 8,0 t 0,1 t 1,1 t 2,1 t 3,1 t 4,1 t 5,1 t 6,1 t 7,1 t 8,1 t 0,2 t 1,2 t 2,2 t 3,2 t 4,2 t 5,2 t 6,2 t 7,2 t 8,2 t 0,3 t 1,3 t 2,3 t 3,3 t 4,3 t 5,3 t 6,3 t 7,3 t 8,3 t 0,4 t 1,4 t 2,4 t 3,4 t 4,4 t 5,4 t 6,4 t 7,4 t 8,4 t 0,5 t 1,5 t 2,5 t 3,5 t 4,5 t 5,5 t 6,5 t 7,5 t 8,5 t 0,6 t 1,6 t 2,6 t 3,6 t 4,6 t 5,6 t 6,6 t 7,6 t 8,6 t 0,7 t 1,7 t 2,7 t w 3,7 t 4,7 t 5,7 t 6,7 t 7,7 t 1 w 8,7 t 0,8 t 1,8 t 2,8 t 3,8 t 4,8 t 5,8 t 6,8 t 7,8 t 8,8 2 Warps/block alignment 3D Case blockdim(4,4,5) 80 threads 100/32=2 warps and 16 threads t 0,0 t 1,0 t 2,0 t 3,0,4 t 0,0 t t 1,0 t 0,1 t 2,0 t 1,1 t 3,0,3 t 2,1 t 3,1,4 0,0 t t 1,0 t 0,1 t 2,0 t t 1,1 t 3,0,2 t 0,2 2,1 t 1,2 t 3,1,3 0,0 t t 1,0 t 2,2 t 3,2,4 0,1 t 2,0 t 1,1 t 3,0,1 t 0,2 2,1 t t 1,2 t 3,1,2 0,0,0 t t 1,0,0 t 0,3 2,2 t 1,3 t 3,2,3 0,1 t 2,0,0 t t 1,1 t 3,0,0 2,3 t 3,3,4 0,2 t 2,1 t 1,2 t 3,1,1 t 0,3 2,2 t 1,3 t 3,2,2 0,1,0 t t 1,1,0 t 2,3 t 3,3,3 0,2 t 2,1,0 t t 1,2 t 3,1,0 0,3 t 2,2 t 1,3 t 3,2,1 t 2,3 t 3,3,2 0,2,0 t t 1,2,0 t 0,3 t 2,2,0 t 1,3 t 3,2,0 2,3 t 3,3,1 t 0,3,0 t 1,3,0 t 2,3,0 t 3,3,0 t 0,0,0 t 1,0,0 t 3,3,1 t 0,0,2 t 1,0,2 t 3,3,3 t 0,0,4 t 1,0,4 t 3,3,4 t 0,0 t 1,0 t 4,3 t 5,3 t 6,3 t 0,7 t 64 t 65 t 8,8 w 0 (32) w 1 (32) w 3 (17) w 0 (32) w 1 (32) w 3 (16)

Warp execution SIMT single instruction, multiple threads the same instruction is broadcasted to all threads and executed at the same time in the SM. All SPs in the SM execute the same instruction. Thread Divergence How can all threads execute the same instruction if we have the if command? Example: if (threadidx.x<10) {a[0]=10;} else {a[1]=10;} Threads [0-9] will do then the others will do else This is called thread divergence Thread Divergence The compiler will unroll both branches and the GPU will perform both branches. then in the first pass, else in the second. But not all ifs cause thread divergence! a=tex2d(tex,u,v); if (a<0.5) {a[0]=10;} else {a[1]=10;} Thread Divergence What causes thread divergence? 1) If statements with functions of threadidx 2) Loops with functions of threadidx ifs are expensive anyway

Thread Divergence Example: for (int i=0;i<threadidx.x;i++) a[i]=i; All loops that should finished will finish, but the GPU will iterate for the others till the end Reading NVIDIA CUDA Programming Guide Kirk, D.B., Hwu, W.W., Programming Massively Parallel Processors, NVIDIA, Morgan Kaufmann 2010