The Looming Software Crisis due to the Multicore Menace

Size: px

Start display at page:

Download "The Looming Software Crisis due to the Multicore Menace"

Felix Lane
5 years ago
Views:

1 The Looming Software Crisis due to the Multicore Menace Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

2 2 Today: The Happily Oblivious Average Joe Programmer Joe is oblivious about the processor Moore s law bring Joe performance Sufficient for Joe s requirements Joe has built a solid boundary between Hardware and Software High level languages abstract away the processors Ex: Java bytecode is machine independent This abstraction has provided a lot of freedom for Joe Parallel Programming is only practiced by a few experts

3 3 Joe the Parallel Programmer Moore s law is not bringing anymore performance gains If Joe needs performance he has to deal with multicores Joe has to deal with performance Joe has to deal with parallelism Is there a better way?

4 4 Why Parallelism is Hard A huge increase in complexity and work for the programmer Programmer has to think about performance! Parallelism has to be designed in at every level Humans are sequential beings Deconstructing problems into parallel tasks is hard for many of us Parallelism is not easy to implement Parallelism cannot be abstracted or layered away Code and data has to be restructured in very different (non-intuitive) ways Parallel programs are very hard to debug Combinatorial explosion of possible execution orderings Race condition and deadlock bugs are non-deterministic and illusive Non-deterministic bugs go away in lab environment and with instrumentation

5 Compiler-Aware Language Design The StreamIt Experience FMDemod Scatter LPF LPF 2 LPF 3 Gather Speaker

6 6 Stream Application Domain Graphics Cryptography Databases Object recognition Network processing and security Scientific codes

7 7 StreamIt Project Language Semantics / Programmability StreamIt Language (CC 02) Programming Environment in Eclipse (P-PHEC 05) Optimizations / Code Generation Phased Scheduling (LCTES 03) Cache Aware Optimization (LCTES 05) Domain Specific Optimizations Linear Analysis and Optimization (PLDI 03) Optimizations for bit streaming (PLDI 05) Linear State Space Analysis (CASES 05) Parallelism Teleport Messaging (PPOPP 05) Compiling for Communication-Exposed Architectures (ASPLOS 02) Load-Balanced Rendering (Graphics Hardware 05) Applications SAR, DSP benchmarks, JPEG, MPEG [IPDPS 06], DES and Serpent [PLDI 05], Uniprocessor backend StreamIt Program Front-end Stream-Aware Optimizations Cluster backend Annotated Java Raw backend C MPI-like C per tile + C msg code IBM X0 backend Streaming X0 runtime

8 8 Compiler-Aware Language Design boost productivity, enable faster development and rapid prototyping programmability enable parallel execution target multicores, clusters, tiled architectures, DSPs, graphics processors,

9 Streaming Application Design MPEG bit stream picture type VLD quantization coefficients <QC> <PT, PT2> frequency encoded macroblocks ZigZag <QC> IDCT IQuantization Saturation spatially encoded macroblocks Motion Compensation <PT> splitter joiner splitter Y Cb Cr reference picture Motion Compensation <PT> reference picture Channel Upsample joiner <PT2> macroblocks, motion vectors Picture Reorder differentially coded motion vectors Motion Vector Decode Repeat recovered picture Color Space Conversion motion vectors Motion Compensation <PT> reference picture Channel Upsample add VLD(QC, PT, PT2); add Structured splitjoin { block level split roundrobin(n B, V); diagram describes add pipeline { add ZigZag(B); computation add IQuantization(B) to and QC; flow add IDCT(B); add Saturation(B); } of add data pipeline { } add MotionVectorDecode(); add Repeat(V, N); join roundrobin(b, V); } add splitjoin { split roundrobin(4 (B+V), B+V, B+V); Conceptually easy to understand } add MotionCompensation(4 (B+V)) to PT; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT; add ChannelUpsample(B); } } functionality Clean abstraction of join roundrobin(,, ); add PictureReorder(3 W H) to PT2; add ColorSpaceConversion(3 W H); MPEG-2 Decoder 9

10 0 StreamIt Philosophy picture type VLD quantization coefficients <QC> <PT, PT2> frequency encoded macroblocks ZigZag <QC> IDCT IQuantization Saturation spatially encoded macroblocks Motion Compensation <PT> splitter joiner splitter Y Cb Cr reference picture Motion Compensation <PT> reference picture Channel Upsample joiner <PT2> MPEG bit stream macroblocks, motion vectors Picture Reorder differentially coded motion vectors Motion Vector Decode Repeat recovered picture Color Space Conversion motion vectors Motion Compensation <PT> reference picture Channel Upsample add VLD(QC, PT, PT2); add Preserve splitjoin { program split roundrobin(n B, V); structure Natural for application developers to express add pipeline { add ZigZag(B); add IQuantization(B) to QC; add IDCT(B); add Saturation(B); } add pipeline { add MotionVectorDecode(); add Repeat(V, N); } Leverage join roundrobin(b, V); program } add splitjoin { split roundrobin(4 (B+V), B+V, B+V); structure to discover parallelism and deliver high performance add MotionCompensation(4 (B+V)) to PT; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT; add ChannelUpsample(B); } } join roundrobin(,, ); } Programs remain clean add PictureReorder(3 W H) to PT2; Portable and malleable add ColorSpaceConversion(3 W H);

11 StreamIt Philosophy MPEG bit stream picture type VLD quantization coefficients <QC> <PT, PT2> frequency encoded macroblocks ZigZag <QC> IDCT IQuantization Saturation spatially encoded macroblocks Motion Compensation <PT> splitter joiner splitter Y Cb Cr reference picture Motion Compensation <PT> reference picture Channel Upsample joiner <PT2> macroblocks, motion vectors Picture Reorder differentially coded motion vectors Motion Vector Decode Repeat recovered picture Color Space Conversion motion vectors Motion Compensation <PT> reference picture Channel Upsample add VLD(QC, PT, PT2); add splitjoin { split roundrobin(n B, V); add pipeline { add ZigZag(B); add IQuantization(B) to QC; add IDCT(B); add Saturation(B); } add pipeline { add MotionVectorDecode(); add Repeat(V, N); } join roundrobin(b, V); } add splitjoin { split roundrobin(4 (B+V), B+V, B+V); add MotionCompensation(4 (B+V)) to PT; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT; add ChannelUpsample(B); } } join roundrobin(,, ); } add PictureReorder(3 W H) to PT2; add ColorSpaceConversion(3 W H); output to player

12 2 Compiler-Aware Language Design boost productivity, enable faster development and rapid prototyping programmability enable parallel execution target multicores, clusters, tiled architectures, DSPs, graphics processors,

13 3 Common Machine Languages Unicores: Common Properties Single flow of control Single memory image Multicores: Common Properties Multiple flows of control Multiple local memories Differences: Register File ISA Register Allocation Instruction Selection Instruction Scheduling Functional Units von-neumann languages represent the common properties and abstract away the differences Differences: Number and capabilities of cores Communication Model Synchronization Model

4 Bridging the Abstraction layers StreamIt exposes the data movement Graph structure is architecture independent StreamIt exposes the parallelism Explicit task parallelism Implicit but inherent data

14 4 Bridging the Abstraction layers StreamIt exposes the data movement Graph structure is architecture independent StreamIt exposes the parallelism Explicit task parallelism Implicit but inherent data and pipeline parallelism Each multicore is different in granularity and topology Communication is exposed to the compiler The compiler needs to efficiently bridge the abstraction Map the computation and communication pattern of the program to the cores, memory and the communication substrate

15 Types of Parallelism Task Parallelism (traditionally thread fork/join) Parallelism explicit in algorithm Between filters without producer/consumer relationship Scatter Gather Data Parallelism Peel iterations of filter, place within scatter/gather pair (fission) parallelize filters with state Pipeline Parallelism Between producers and consumers Stateful filters can be parallelized 5 Task

16 Types of Parallelism Scatter Data Parallel Gather Task Parallelism (traditionally thread fork/join) Parallelism explicit in algorithm Between filters without producer/consumer relationship Pipeline Scatter Data Parallelism (traditionally data parallel loops) Between iterations of a stateless filter Place within scatter/gather pair (fission) Can t parallelize filters with state Data Gather Pipeline Parallelism (traditionally in hardware) Between producers and consumers Statefull filters can be parallelized 6 Task

17 7 Problem Statement Given: Find: Stream graph with compute and communication estimate for each filter Computation and communication resources of the target machine Schedule of execution for the filters that best utilizes the available parallelism to fit the machine resources

18 8 Baseline : Task Parallelism BandPass BandPass Inherent task parallelism between two processing pipelines Compress Expand BandStop Compress Expand BandStop Task Parallel Model: Only parallelize explicit task parallelism Fork/join parallelism Execute this on a 2 core machine ~2x speedup over single core Adder What about 4, 6, 024, cores?

19 ChannelVocoder DCT DES FFT Filterbank FMRadio Serpent TDE MPEG2Decoder Vocoder Radar Geometric Mean BitonicSort Evaluation: Task Parallelism Raw Microprocessor 6 inorder, single-issue cores with D$ and I$ 6 memory banks, each bank with DMA Cycle accurate simulator Parallelism: Not matched to target! Synchronization: Not matched to target! Throughput Normalized to Single Core StreamIt

20 20 Baseline 2: Fine-Grained Data Parallelism BandPass Compress Expand BandStop Adder BandStop BandPass Compress Expand BandStop Each of the filters in the example are stateless Fine-grained Data Parallel Model: Fiss each stateless filter N ways (N is number of cores) Remove scatter/gather if possible We can introduce data parallelism Example: 4 cores Each fission group occupies entire machine

21 Evaluation: Fine-Grained Data Parallelism Task Fine-Grained Data Good Parallelism! Too Much Synchronization! ChannelVocoder DCT DES FFT Filterbank FMRadio Serpent TDE MPEG2Decoder Vocoder Radar Geometric Mean BitonicSort Throughput Normalized to Single Core StreamIt

22 22 Phase : Coarsen the Stream Graph BandPass Peek BandPass Peek Before data-parallelism is exploited Compress Expand BandStop Peek Compress Expand BandStop Peek Fuse stateless pipelines as much as possible without introducing state Don t fuse stateless with stateful Don t fuse a peeking filter with anything upstream Adder

23 Phase : Coarsen the Stream Graph 23 BandPass Compress Expand BandStop Adder BandPass Compress Expand BandStop Before data-parallelism is exploited Fuse stateless pipelines as much as possible without introducing state Don t fuse stateless with stateful Don t fuse a peeking filter with anything upstream Benefits: Reduces global communication and synchronization Exposes inter-node optimization opportunities

24 24 Phase 2: Data Parallelize Data Parallelize for 4 cores BandPass Compress Expand BandPass Compress Expand BandStop BandStop Adder Adder Adder Fiss 4 ways, to occupy entire chip

25 25 Phase 2: Data Parallelize Data Parallelize for 4 cores BandPass BandPass Compress Compress Expand Expand BandPass BandPass Compress Compress Expand Expand Task parallelism! Each fused filter does equal work Fiss each filter 2 times to occupy entire chip BandStop BandStop Adder Adder Adder

26 26 Phase 2: Data Parallelize BandPass BandPass Compress Compress Expand Expand BandPass BandPass Compress Compress Expand Expand Data Parallelize for 4 cores Task-conscious data parallelization Preserve task parallelism Benefits: Reduces global communication and synchronization BandStop BandStop BandStop BandStop Task parallelism, each filter does equal work Fiss each filter 2 times to occupy entire chip Adder Adder Adder

27 Evaluation: Coarse-Grained Data Parallelism Task Fine-Grained Data Coarse-Grained Task + Data Good Parallelism! Low Synchronization! ChannelVocoder DCT DES FFT Filterbank FMRadio Serpent TDE MPEG2Decoder Vocoder Radar Geometric Mean BitonicSort Throughput Normalized to Single Core StreamIt

28 28 Target a 4 core machine Simplified Vocoder 6 AdaptDFT AdaptDFT 6 RectPolar 20 Data Parallel 2 UnWrap Unwrap 2 Diff Diff Amplify Amplify Data Parallel, but too little work! Accum Accum PolarRect 20 Data Parallel

29 29 Target a 4 core machine Data Parallelize 6 AdaptDFT AdaptDFT 6 RectPolar RectPolar RectPolar RectPolar UnWrap Unwrap 2 Diff Diff Amplify Amplify Accum Accum RectPolar RectPolar PolarRect RectPolar 20 5

30 30 Target 4 core machine Data + Task Parallel Execution Cores 2 2 Time 2 RectPolar 5

31 3 Target 4 core machine We Can Do Better! Cores 2 2 Time 6 RectPolar 5

32 Phase 3: Coarse-Grained Software Pipelining Prologue New Steady State RectPolar RectPolar New steady-state is free of dependencies Schedule new steady-state using a greedy partitioning 32 RectPolar RectPolar

33 33 Target 4 core machine Greedy Partitioning To Schedule: Cores Time 6

34 BitonicSort ChannelVocoder DCT DES FFT Filterbank FMRadio Serpent TDE MPEG2Decoder Vocoder Radar Geometric Mean Evaluation: Coarse-Grained Task + Data + Software Pipelining Task Fine-Grained Data Coarse-Grained Task + Data Coarse-Grained Task + Data + Software Pipeline Best Parallelism! Lowest Synchronization! Throughput Normalized to Single Core StreamIt

35 Next: Scalable Stream Representation Data parallelism Pipeline parallelism 4 tiles 6 tiles 64 tiles

36 36 Conclusions Computer Architecture is at a cross roads Once in a lifetime opportunity to redesign from scratch How to use the Moore s law gains to improve the programmability? Switching to multicores without losing the gains in programmer productivity may be the Grandest of the Grand Challenges Half a century of work still no winning solution Will affect everyone! Streaming programming model Can break the von Neumann bottleneck A natural fit for a large class of applications An ideal machine language for multicores. Compiler can extract explicit and inherent parallelism Parallelism is abstracted away from architectural details of multicores Sustainable Speedups (5x to 9x on the 6 core Raw) Increased abstraction does not have to sacrifice performance

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose,