MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Size: px

Start display at page:

Download "MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:"

Irma Angelica Lyons
6 years ago
Views:

1 MIT OpenCourseWare Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Rodric Rabbah, Multicore Programming Primer, January (IAP) (Massachusetts Institute of Technology: MIT OpenCourseWare). (accessed MM DD, YYYY). License: Creative Commons Attribution-Noncommercial-Share Alike. Note: Please use the actual date you accessed this material in your citation. For more information about citing these materials or our Terms of Use, visit:

2 6.189 IAP 2007 Lecture 17 The Raw Experience IAP 2007 MIT

3 Raw Chips October IAP 2007 MIT

4 Raw Microprocessor Tiled microprocessor with point-to-point pipelined scalar operand network Each tiles is 4 mm x 4mm MIPS-style compute processor Single-issue 8-stage pipe 32b FPU 32K D Cache, I Cache 4 on-chip mesh networks Two for operands One for cache misses, I/O One for message passing IAP 2007 MIT

5 Raw Microprocessor 16 tiles (16 issue) 180 nm ASIC (IBM SA-27E) ~100 million transistors 1 million gates 3-4 years of development 1.5 years of testing 200K lines of test code Core Frequency: V V Frequency competitive with IBMimplemented Powers in same process 18W average power IAP 2007 MIT

6 One Cycle in the Life of a Tiled Processor Image by MIT OCW. Image by MIT OCW. Application uses as many tiles as needed to exploit its parallelism IAP 2007 MIT

7 Raw Motherboard IAP 2007 MIT

8 Raw in Action IAP 2007 MIT

9 MPEG-2 Encoder Performance 350 x 240 Images 720 x 480 Images Speedup 8 30 Frames/s Speedup 8 30 Frames/s # of Tiles # of Tiles Square Linear speedup Diamond Hand-optimized, slice parallel implementation Circle Slice parallel implementation Triangle Baseline macroblock parallel implementation IAP 2007 MIT

10 MPEG-2 Encoder Performance Encoding Rate (frames/s) # Tiles 352 x x x * 30* * * Estimated data rates IAP 2007 MIT

11 Programmable Graphics Pipeline Input Vertex V Vertex Sync Triangle Setup Pixel P Pixel simplified graphics pipeline IAP 2007 MIT

12 Phong Shading Per-pixel phong-shaded polyhedron 162 vertices, 1 light Output, rendered using Raw simulator IAP 2007 MIT

13 Phong Shading (64-tiles) Fixed pipeline Reconfigurable pipeline 33% faster 150% better utilization IAP 2007 MIT

14 Shadow Volumes 4 textured triangles 1 point light Rendered in 3 passes Output, rendered using Raw simulator IAP 2007 MIT

15 Shadow Volumes (64-tiles) Fixed pipeline Pass 1 Pass 2 Pass 3 Reconfigurable pipeline Pass 1 Pass 2 Pass 3 40% faster cycles IAP 2007 MIT

16 1020 Element Microphone Array IAP 2007 MIT

17 Case Study: Beamformer 1,600 1,400 1,430 1,200 MFLOPS 1, C program 19 C program Unoptimized StreamIt Optimized StreamIt 1 GHz Pentium III 420 MHz single tile Raw 420 MHz 64 tile Raw 420 MHz 16 tile Raw IAP 2007 MIT

18 The Raw Experience Insights into the design Raw architecture Raw parallelizing compiler StreamIt language and Compiler IAP 2007 MIT

19 Scalability Problems in Wide Issue Processors Wide Fetch (16 inst) Control Bypass Net Unified Load/Store Queue IAP 2007 MIT

20 Area and Frequency Scalability Problems ~N 3 ~N 2 N s Bypass Net IAP 2007 MIT

21 Operand Routing is Global + >> Bypass Net IAP 2007 MIT

22 Idea: Make Operand Routing Local Bypass Net IAP 2007 MIT

23 Idea: Exploit Locality Bypass Net IAP 2007 MIT

24 Replace Crossbar with Point-To-Point Network IAP 2007 MIT

25 Replace Crossbar with Point-To-Point Network + >> IAP 2007 MIT

26 Operand Transport Latency Crossbar Point-to-Point Network Non-local Placement Locality-driven Placement ~ N ~ N ½ ~ N ~ IAP 2007 MIT

27 Distribute the Register File IAP 2007 MIT

28 More Scalability Problems Wide Fetch (16 inst) Control SCALABLE Unified Load/Store Queue IAP 2007 MIT

29 More Scalability Problems Wide Fetch (16 inst) Control Unified Load/Store Queue IAP 2007 MIT

30 Distribute Everything Wide Fetch (16 inst) Control Unified Load/Store Queue IAP 2007 MIT

31 Tiled Processor IAP 2007 MIT

32 Tiled Processor IAP 2007 MIT

33 Tiled Processor (Taylor PhD 2007) Fast inter-tile communication through point-to-point pipelined scalar operand network (SON) Easy to scale for the same reasons as multicores IAP 2007 MIT

34 Raw Compute Processor Internals r24 r24 r25 r25 r26 E r26 r27 M1 M2 r27 IF D A F TL P U IAP 2007 MIT

35 Tile-Tile Communication add $25,$1,$2 Route $P->$E Route $W->$P sub $20,$1,$ IAP 2007 MIT

36 Why Communication Is Expensive on Multicores Multiprocessor Node 1 Multiprocessor Node 2 send occupancy send send overhead latency Transport Cost receive occupancy receive receive overhead latency IAP 2007 MIT

37 Why Communication Is Expensive on Multicores Multiprocessor Node 1 send occupancy Destination node name Sequence number Value Launch sequence send latency Commit Latency Network injection IAP 2007 MIT

38 Why Communication Is Expensive on Multicores receive sequence Multiprocessor Node 2 injection cost receive occupancy receive latency IAP 2007 MIT

39 A Figure of Merit for SONs 5-tuple <SO, SL, NHL, RL, RO> Send occupancy Send latency Network hop latency Receive latency Receive occupancy Tip: Ordering follows timing of message from sender to receiver IAP 2007 MIT

40 The Interesting Region Scalable Multiprocessor (on-chip) Where is Cell in this space? <2, 14, 3, 14,4> Raw SON (scalable) < 0, 0, 1, 2, 0> Superscalar (not scalable) < 0, 0, 0, 0, 0> IAP 2007 MIT

41 The Raw Experience Insights into the design Raw architecture Raw parallelizing compiler IAP 2007 MIT

42 Raw Parallelizing Compiler (Lee PhD 2005) Sequential Program Compiler Data distribution Instruction distribution Coordination Communication Control flow Fine-grained Orchestrated Parallel Program IAP 2007 MIT

43 Data Distribution load r1 <- addr add r1 <- r1, 1???? M M M M IAP 2007 MIT

44 Instruction Distribution seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v2.4=v2 pval5=seed.0*6.0 pval3=seed.o*v2.4 pval4=pval5+2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp2=tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 v2.7=pval6*5.0 v2=v2.7 v3.10=tmp3.6-v2.7 v3=v3.10 basic block IAP 2007 MIT

45 Clustering: Parallelism vs. Communication seed.0=seed seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v3=v3.10 v0=v0.9 v2=v2.7 v3=v IAP 2007 MIT

46 Adjusting Granularity: Load Balancing seed.0=seed seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v3=v3.10 v0=v0.9 v2=v2.7 v3=v IAP 2007 MIT

47 Placement seed.0=seed pval5=seed.0*6.0 seed.0=seed pval1=seed.0*3.0 pval4=pval5+2.0 pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 v1.2=v1 v2.4=v2 v1.8=pval7*3.0 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 v0.9=tmp0.1-v1.8 v0=v0.9 v1=v1.8 v2=v2.7 v3.10=tmp3.6-v2.7 v3=v3.10 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 v2=v IAP 2007 MIT

48 Communication Coordination Proc Proc Proc Switch Switch Proc receive routes send IAP 2007 MIT

49 Instruction Scheduling v2.4=v2 seed.0=recv(0) pval3=seed.o*v2.4 route(t,e) v1.2=v1 seed.0=recv() pval2=seed.0*v1.2 route(n,t) seed.0=recv() pval5=seed.0*6.0 route(w,t) seed.0=seed send(seed.0) pval1=seed.0*3.0 pval0=pval1+2.0 route(t,e) tmp2.5=pval3+2.0 tmp2=tmp2.5 send(tmp2.5) tmp1.3=recv() pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 route(t,e) route(e,t) tmp1.3=pval2+2.0 send(tmp1.3) tmp1=tmp1.3 tmp2.5=recv() tmp0.1=recv() pval7=tmp1.3+tmp2.5 route(t,w) route(w,t) route(w,t) pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 route(w,s) tmp0.1=pval0/2.0 send(tmp0.1) tmp0=tmp0.1 route(t,e) Send(v2.7) v2=v2.7 route(t,e) v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 route(w,n) v2.7=recv() v3.10=tmp3.6-v2.7 v3=v3.10 route(s,t) Inter-tile cycle scheduling schedules communication and can guarantee deadlock freedom IAP 2007 MIT

50 Final Code Representation v2.4=v2 route(t,e) v1.2=v1 route(n,t) seed.0=recv() route(w,t) seed.0=seed route(t,e) seed.0=recv(0) route(t,e) seed.0=recv() route(t,w) pval5=seed.0*6.0 route(w,s) send(seed.0) route(t,e) pval3=seed.o*v2.4 route(e,t) pval2=seed.0*v1.2 route(w,t) pval4=pval5+2.0 route(s,t) pval1=seed.0*3.0 tmp2.5=pval3+2.0 route(t,e) tmp1.3=pval2+2.0 route(w,t) tmp3.6=pval4/3.0 pval0=pval1+2.0 tmp2=tmp2.5 send(tmp1.3) route(w,n) tmp3=tmp3.6 tmp0.1=pval0/2.0 send(tmp2.5) tmp1=tmp1.3 v2.7=recv() send(tmp0.1) tmp1.3=recv() tmp2.5=recv() v3.10=tmp3.6-v2.7 tmp0=tmp0.1 pval6=tmp1.3-tmp2.5 tmp0.1=recv() v3=v3.10 v2.7=pval6*5.0 pval7=tmp1.3+tmp2.5 Send(v2.7) v1.8=pval7*3.0 v2=v2.7 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v IAP 2007 MIT

51 Control Coordination IAP 2007 MIT

52 Asynchronous Global Branching br x br x br x br x br x IAP 2007 MIT

53 Summary Tiled microprocessors incorporate the best elements of superscalars and multiprocessors Superscalar Multicore Tiled Processor with SON PE-PE communication Free Expensive Cheap exploitation of parallelism Implicit Explicit Both Clean semantics Yes No Yes scalable No Yes Yes power efficient No Yes Yes IAP 2007 MIT

54 Raw Project Contributors Anant Agarwal Saman Amarasinghe Jonathan Babb Rajeev Barua Ian Bratt Jonathan Eastep Matt Frank Ben Greenwald Henry Hoffmann Paul Johnson Jason Kim Theo Konstantakopoulos Walter Lee Jason Miller James Psota Arvind Saraf Vivek Sarkar Nathan Shnidman Volker Strumpen Michael Taylor Elliot Waingold David Wentzlaff IAP 2007 MIT

Project 5: Optimizer Jason Ansel

Project 5: Optimizer Jason Ansel Overview Project guidelines Benchmarking Library OoO CPUs Project Guidelines Use optimizations from lectures as your arsenal If you decide to implement one, look at Whale