Characterizing and Improving the Performance of Intel Threading Building Blocks

Size: px

Start display at page:

Download "Characterizing and Improving the Performance of Intel Threading Building Blocks"

Linda Clark
6 years ago
Views:

1 Characterizing and Improving the Performance of Intel Threading Building Blocks Gilberto Contreras, Margaret Martonosi Princeton University IISWC 08

2 Motivation Chip Multiprocessors are the new computing platform. 2 cores, 4 cores, 8 cores Are we ready? Why is parallelism so challenging? Identify parallelism Annotation/extraction of parallelism Mapping to cores Respond to: OS effects Thermal emergencies Variability trends Reliability issues 2

3 How is Parallelism Annotated/Extracted Compiler DSWP, SUIF, Polaris Other Parallel Languages Cilk, StreamIT, Linda Orca, Parloc, Emerald, etc Parallelization Libraries TBB, OpenMP, Java threads, pthreads, CUDA, etc This work answers the following questions: What are some of the major sources of overheads? How do they impact overall parallelism performance? How can we improve parallelism performance?

4 Our work focus This talk will focus on the Intel Threading Building Blocks (TBB) Task-based parallelization library for C++ applications Support a wide range of parallelism types Utilizes task stealing for load balancing Methodology is applicable to other parallelism management approaches

5 Presentation Outline Description of TBB Programming example Task management in TBB Characterization Methodology Measuring basic operations using simulation and real-system measurements TBB overheads in PARSEC benchmarks Performance of Task Stealing Improving TBB Occupancy-based task stealing Summary and Conclusions 5

6 Annotation and Management class LUWork { for (i=k+1; i<size; i++) { void operator()(cost blocked_range<int> &b){ for L[i][k] (i=b.begin(); = M[i][k] i!=b.end(); / M[k][k]; i++) { L[i][k] for(j=i+1; = M[i][k] j<size; / M[k][k]; j++) for(j=i+1; M[i][j] j<size; = M[i][j] j++) M[i][j] = M[i][j] L[i][k]*M[k][j]; L[i][k]*M[k][j]; } } } } LUWork work(l,m,k,size); parallel_for(blocked_range( k+1, size, CHUNK_SIZE),work); TBB chunk size worker 1 worker 1 Work Scheduler worker 2 worker 2 worker 3 worker 3 Runtime procedure: spawn() acquire_queue() get_task() spawn() spawn() steal() acquire_queue() get_task() spawn() steal() acquire_queue() get_task() 6

7 Reducing TBB Library Overhead? Understand Overheads Creating tasks spawn() Assigning tasks to worker threads get_task() queue_acquire() wait_for_all() Stealing or rebalancing parallelism steal() Improve parallelism reorganization policies Employ smart redistribution policies Make this as fast and as efficient as possible 7

8 Methodology Benchmarks PARSEC Microbenchmarks Intel Threading Building Blocks (TBB) Open source 2.0 version Real CMP System 4-core AMD system (2 processors) 4GB RAM Linux 2.6 Oprofile is used for performance counter measurements Cycle-accurate CMP simulator 2-issue, in-order cores 32KB D$ (coherent), 32K I$ 8MB shared L2 cache MSI directory-based coherence protocol Mesh network, 32b BW/port/cycle 8

9 Cost of Parallelism Management Simulation Results (4-32 cores) cores 8 cores 12 cores 16 cores 32 cores Cycles get_task spawn stealing (successful) stealing (unsuccessful) acquire_queue wait_for_all Runtime activity 9

10 TBB Overheads: PARSEC Scheduler Waiting Synchronization Stealing 30% fluidanimate 30% swaptions 41% 47% Average time per core 25% 20% 15% 10% 5% Average time per core 25% 20% 15% 10% 5% 0% P8 P12 P16 P25 P32 0% P8 P12 P16 P25 P32 Number of cores Number of cores 30% blackscholes 30% streamcluster 34% 54% Average time per core 25% 20% 15% 10% 5% Average time per core 25% 20% 15% 10% 5% 0% P8 P12 P16 P25 P32 0% P8 P12 P16 P25 P32 Number of cores Number of cores

11 Improving Stealing TBB utilizes random stealing as its victim selection policy Success Rate False Negatives 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% P4 P8 P12 P16 P25 P32 P4 P8 P12 P16 P25 P32 P4 P8 P12 P16 P25 P32 Bitcounter LU Matmult 11

12 Occupancy-based Stealing Random Stealing Occupancy-based stealing Scheduler Scheduler worker 1 worker 1 worker 2 worker 2 worker 3 worker 3 worker 4 worker 4 worker 1 worker 1 worker 2 worker 2 worker 3 worker 3 worker 4 worker 4 2 Random stealing: Random number Stealing Occupancy stealing: Scanning Stealing

13 Performance of Occupancy-based Stealing Occupancy-Based 1-cycle scan Normal stealing 1-cycle scan 1-cycle stealing P16 P25 P32 P16 P25 P32 P16 P25 P32 Bitcounter 2.5% 2.5% 2.7% 2.4% 2.8% 3.7% 4.7% 6.9% 7.8% LU 10% 4.1% 9.7% 10.2% 4.6% 8.0% 16% 10.4% 20.6% Matmult 9.5% 6% 19% 9.8% 7.0% 21.1% 10.8% 9.8% 28.7% Smarter selection policies are desired High potential in overhead reduction 13

14 Conclusions Increasing usage of TBB makes it a prime candidate for in-depth characterization Parallelization libraries help, but tend to exhibit high (dynamic) overheads (>40% at 32 cores) Understanding software overheads is the first step in creating high-performance parallel systems We have presented a detailed characterization of the Intel Threading building Blocks and implemented occupancy-based stealing (19% performance over random stealing). 14

15 Thanks! 15

16 Summary Programmers require tools that allows them to take (fast) advantage of increasing core counts. Parallelization libraries help, but tend to exhibit high (dynamic) overheads (>40% at 32 cores) Understanding software overheads is the first step in creating high-performance parallel systems We have presented a detailed characterization of the Intel Threading building Blocks and implemented occupancy-based stealing (19% performance over random stealing). 16

17 Cost of Parallelism Management 4 core, 1.8GHz AMD system Oprofile configured to measure CPU_CLK_UNHALTED 1 to 32 core CMP simulator 2-issue, in-order cores Shared L core 2 cores 3 cores 4 cores cores 8 cores 12 cores 16 cores 32 cores Cycles Cycles get_task spawn steal acquire_queue wait_for_all 0 get_task spawn stealing (successful) stealing (unsuccessful) acquire_queue wait_for_all Runtime activity Runtime activity Our goals: 1) Reduce per-event overheads 2) Improve rebalancing

18 Static versus Dynamic Management 32 Static (pthread) 32 TBB 32 PARSEC Speedup Number of cores Speedup Number of cores Speedup Number of cores fluidanimate swaptions blackscholes

Booster: Reactive Core Acceleration for Mitigating the Effects of Process Variation and Application Imbalance in Low-Voltage Chips

Booster: Reactive Core Acceleration for Mitigating the Effects of Process Variation and Application Imbalance in Low-Voltage Chips Timothy N. Miller, Xiang Pan, Renji Thomas, Naser Sedaghati, Radu Teodorescu