Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Size: px

Start display at page:

Download "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling"

Ross Robertson
5 years ago
Views:

1 Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang Joo Lee Intel Rustam Miftakhutdinov The University of Texas at Austin Onur Mutlu Carnegie Mellon University Yale N. Patt The University of Texas at Austin MICRO-44 December 6 th, 2011 Porto Alegre, Brazil

2 Rise of GPU Computing GPUs have become a popular platform for general purpose applications New Programming Models CUDA ATI Stream Technology OpenCL Order of magnitude speedup over single-threaded CPU

3 How GPUs Exploit Parallelism Multiple GPU cores (i.e., Streaming Multiprocessors) Focus on a single GPU core Exploit parallelism in 2 major ways: Threads grouped into warps Single PC per warp Warps executed in SIMD fashion Multiple warps concurrently executed Round-robin scheduling Helps hide long latencies

4 The Problem Despite these techniques, computational resources can still be underutilized Two reasons for this: Branch divergence Long latency operations

5 Branch Divergence A 1111 Current PC: Current Active Mask: A B Taken Not Taken B 1001 C 0110 D 0110 C D 1111 D D 1111 Reconverge PC Active Mask Execute PC

6 Long Latency Operations Core Memory System All Warps Compute All Warps Compute Req Warp 0 Req Warp 1 Req Warp 15 Round Robin Scheduling, 16 total warps time

7 Computational Resource Utilization 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% to to 23 8 to 15 1 to 7 0 Good Bad 32 warps, 32 threads per warp, SIMD width = 32, round-robin scheduling

8 Large Warp Microarchitecture (LWM) Alleviates branch divergence Fewer, but larger warps Warp size much greater than SIMD width Total thread count and SIMD-width stay the same Dynamically breaks down large warp into sub-warps Can be executed on existing SIMD pipeline Rearrange active mask as 2D structure Number of columns = SIMD width Search each column for an active thread to create new sub-warp

9 Large Warp Microarchitecture Example Decode Stage Sub-warp 10 2 mask Sub-warp 10 mask Sub-warp 0 mask

10 More Large Warp Microarchitecture Divergence stack still used Handled at the large warp level How large should we make the warps? More threads per warp more potential for sub-warp creation Too large a warp size can degrade performance Re-fetch policy for conditional branches Must wait till last sub-warp finishes Optimization for unconditional branch instructions Don t create multiple sub-warps Sub-warping always completes in a single cycle

11 Two Level Round Robin Scheduling Split warps into equal sized fetch groups Create initial priority among the fetch groups Round-robin scheduling among warps in same fetch group When all warps in the highest priority fetch group are stalled Rotate fetch group priorities Highest priority fetch group becomes least Warps arrive at a stalling point at slightly different points in time Better overlap of computation and memory latency

12 Round Robin vs Two Level Round Robin Core Memory System All Warps Compute All Warps Compute Req Warp 0 Req Warp 1 Req Warp 15 Round Robin Scheduling, 16 total warps time Core Group 0 Compute Req Warp 0 Req Warp 1 Group 1 Compute Group 0 Compute Group 1 Compute Saved Cycles Memory System Req Warp 7 Req Warp 8 Req Warp 9 Req Warp 15 time Two Level Round Robin Scheduling, 2 fetch groups, 8 warps each

13 More on Two Level Scheduling What should the fetch group size be? Enough warps to keep pipeline busy in the absence of long latency stalls Too small Uneven progression of warps in the same fetch group Destroys data locality among warps Too large Reduces benefits of two-level scheduling More warps stall at the same time Not just for hiding memory latency Complex instructions (e.g., sine, cosine, sqrt, etc.) Two-level scheduling allows warps to arrive at such instructions at slightly different points in time

14 Combining LWM and Two Level Scheduling 4 large warps, 256 threads each Fetch group size = 1 large warp Problematic for applications with few long latency stalls No stalls no fetch group priority changes Single large warp starved Branch re-fetch policy for large warps bubbles in pipeline Timeout invoked fetch group priority change 32K instruction timeout period Alleviates starvation

15 Methodology Simulate single GPU core with 1024 thread contexts divided into 32 warps each with 32 threads Scalar Front End SIMD Back End Register File and On Chip Memories Memory System 1-wide fetch, decode 4KB single ported I-Cache Round-robin scheduling In order, 5 stages, 32 parallel SIMD lanes 64KB Register File 128KB, 4-way, D-Cache with 128B line size 128KB, 32-banked private memory Open row, first-come first-serve scheduling 8 banks, 4KB row buffer per bank 100-cycle row hit latency, 300-cycle row conflict latency 32 GB/s memory bandwidth

16 blackjack sort viterbi kmeans decrypt blackscholes needleman hotspot matrix_mult reduction histogram bfs gmean IPC Overall IPC Results Baseline TBC LWM 2Lev LWM+2Lev LWM+2Lev improves performance by 19.1% over baseline and by 11.5% over TBC

17 IPC IPC IPC and Computational Resource Utilization IPC for blackjack IPC for histogram baseline LWM 2LEV LWM+2LEV 5 0 baseline LWM 2LEV LWM+2LEV Computational Resource Utilization for blackjack Computational Resource Utilization for histogram 120% 120% 100% % 32 80% 24 to 31 80% 24 to 31 60% 16 to 23 60% 16 to 23 40% 8 to 15 40% 8 to 15 20% 1 to 7 20% 1 to 7 0% baseline LWM 2LEV LWM+2LEV 0 0% baseline LWM 2LEV LWM+2LEV 0

18 Conclusion For maximum performance, the computational resources on GPUs must be effectively utilized Branch divergence and long latency operations cause them to be underutilized or unused We proposed two mechanism to alleviate this Large Warp Microarchitecture for branch divergence Two-level scheduling for long latency operations Improves performance by 19.1% over traditional GPU cores Increases scope of applications that can run efficiently on a GPU Questions

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads Terminology CUDA Threads Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Streaming Multiprocessor (SM) A SM processes block of threads Streaming Processors (SP) also called CUDA