MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Rodric Rabbah, 6.189 Multicore Programming Primer, January (IAP) 2007. (Massachusetts Institute of Technology: MIT OpenCourseWare). http://ocw.mit.edu (accessed MM DD, YYYY). License: Creative Commons Attribution-Noncommercial-Share Alike. Note: Please use the actual date you accessed this material in your citation. For more information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms

6.189 IAP 2007 Lecture 17 The Raw Experience 1 6.189 IAP 2007 MIT

Raw Chips October 02 2 6.189 IAP 2007 MIT

Raw Microprocessor Tiled microprocessor with point-to-point pipelined scalar operand network Each tiles is 4 mm x 4mm MIPS-style compute processor Single-issue 8-stage pipe 32b FPU 32K D Cache, I Cache 4 on-chip mesh networks Two for operands One for cache misses, I/O One for message passing 3 6.189 IAP 2007 MIT

Raw Microprocessor 16 tiles (16 issue) 180 nm ASIC (IBM SA-27E) ~100 million transistors 1 million gates 3-4 years of development 1.5 years of testing 200K lines of test code Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V Frequency competitive with IBMimplemented Powers in same process 18W average power 4 6.189 IAP 2007 MIT

One Cycle in the Life of a Tiled Processor Image by MIT OCW. Image by MIT OCW. Application uses as many tiles as needed to exploit its parallelism 5 6.189 IAP 2007 MIT

Raw Motherboard 6 6.189 IAP 2007 MIT

Raw in Action 7 6.189 IAP 2007 MIT

MPEG-2 Encoder Performance 350 x 240 Images 720 x 480 Images 16 16 60 60 Speedup 8 30 Frames/s Speedup 8 30 Frames/s 4 1 1 4 8 # of Tiles 4 10 1 16 4 8 # of Tiles Square Linear speedup Diamond Hand-optimized, slice parallel implementation Circle Slice parallel implementation Triangle Baseline macroblock parallel implementation 10 8 6.189 IAP 2007 MIT

MPEG-2 Encoder Performance Encoding Rate (frames/s) # Tiles 352 x 240 640 x 480 720 x 480 1 4.30 1.14 1.00 2 8.48 2.24 1.97 4 16.18 4.45 3.84 8 30.82 8.69 7.52 16 58.65 16.74 14.57 32 103* 30* 64 158* 51.90 * Estimated data rates 9 6.189 IAP 2007 MIT

Programmable Graphics Pipeline Input Vertex V Vertex Sync Triangle Setup Pixel P Pixel simplified graphics pipeline 10 6.189 IAP 2007 MIT

Phong Shading Per-pixel phong-shaded polyhedron 162 vertices, 1 light Output, rendered using Raw simulator 11 6.189 IAP 2007 MIT

Phong Shading (64-tiles) Fixed pipeline Reconfigurable pipeline 33% faster 150% better utilization 12 6.189 IAP 2007 MIT

Shadow Volumes 4 textured triangles 1 point light Rendered in 3 passes Output, rendered using Raw simulator 13 6.189 IAP 2007 MIT

Shadow Volumes (64-tiles) Fixed pipeline Pass 1 Pass 2 Pass 3 Reconfigurable pipeline Pass 1 Pass 2 Pass 3 40% faster cycles 14 6.189 IAP 2007 MIT

1020 Element Microphone Array 15 6.189 IAP 2007 MIT

Case Study: Beamformer 1,600 1,400 1,430 1,200 MFLOPS 1,000 800 600 640 400 200 0 240 C program 19 C program Unoptimized StreamIt Optimized StreamIt 1 GHz Pentium III 420 MHz single tile Raw 420 MHz 64 tile Raw 420 MHz 16 tile Raw 16 6.189 IAP 2007 MIT

The Raw Experience Insights into the design Raw architecture Raw parallelizing compiler StreamIt language and Compiler 17 6.189 IAP 2007 MIT

Scalability Problems in Wide Issue Processors Wide Fetch (16 inst) Control Bypass Net Unified Load/Store Queue 18 6.189 IAP 2007 MIT

Area and Frequency Scalability Problems ~N 3 ~N 2 N s Bypass Net 19 6.189 IAP 2007 MIT

Operand Routing is Global + >> Bypass Net 20 6.189 IAP 2007 MIT

Idea: Make Operand Routing Local Bypass Net 21 6.189 IAP 2007 MIT

Idea: Exploit Locality Bypass Net 22 6.189 IAP 2007 MIT

Replace Crossbar with Point-To-Point Network 23 6.189 IAP 2007 MIT

Replace Crossbar with Point-To-Point Network + >> 24 6.189 IAP 2007 MIT

Operand Transport Latency Crossbar Point-to-Point Network Non-local Placement Locality-driven Placement ~ N ~ N ½ ~ N ~1 25 6.189 IAP 2007 MIT

Distribute the Register File 26 6.189 IAP 2007 MIT

More Scalability Problems Wide Fetch (16 inst) Control SCALABLE Unified Load/Store Queue 27 6.189 IAP 2007 MIT

More Scalability Problems Wide Fetch (16 inst) Control Unified Load/Store Queue 28 6.189 IAP 2007 MIT

Distribute Everything Wide Fetch (16 inst) Control Unified Load/Store Queue 29 6.189 IAP 2007 MIT

Tiled Processor 30 6.189 IAP 2007 MIT

Tiled Processor 31 6.189 IAP 2007 MIT

Tiled Processor (Taylor PhD 2007) Fast inter-tile communication through point-to-point pipelined scalar operand network (SON) Easy to scale for the same reasons as multicores 32 6.189 IAP 2007 MIT

Raw Compute Processor Internals r24 r24 r25 r25 r26 E r26 r27 M1 M2 r27 IF D A F TL P U 33 6.189 IAP 2007 MIT

Tile-Tile Communication add $25,$1,$2 Route $P->$E Route $W->$P sub $20,$1,$25 34 6.189 IAP 2007 MIT

Why Communication Is Expensive on Multicores Multiprocessor Node 1 Multiprocessor Node 2 send occupancy send send overhead latency Transport Cost receive occupancy receive receive overhead latency 35 6.189 IAP 2007 MIT

Why Communication Is Expensive on Multicores Multiprocessor Node 1 send occupancy Destination node name Sequence number Value Launch sequence send latency Commit Latency Network injection 36 6.189 IAP 2007 MIT

Why Communication Is Expensive on Multicores receive sequence Multiprocessor Node 2 injection cost receive occupancy receive latency 37 6.189 IAP 2007 MIT

A Figure of Merit for SONs 5-tuple <SO, SL, NHL, RL, RO> Send occupancy Send latency Network hop latency Receive latency Receive occupancy Tip: Ordering follows timing of message from sender to receiver 38 6.189 IAP 2007 MIT

The Interesting Region Scalable Multiprocessor (on-chip) Where is Cell in this space? <2, 14, 3, 14,4> Raw SON (scalable) < 0, 0, 1, 2, 0> Superscalar (not scalable) < 0, 0, 0, 0, 0> 39 6.189 IAP 2007 MIT

The Raw Experience Insights into the design Raw architecture Raw parallelizing compiler 40 6.189 IAP 2007 MIT

Raw Parallelizing Compiler (Lee PhD 2005) Sequential Program Compiler Data distribution Instruction distribution Coordination Communication Control flow Fine-grained Orchestrated Parallel Program 41 6.189 IAP 2007 MIT

Data Distribution load r1 <- addr add r1 <- r1, 1???? M M M M 42 6.189 IAP 2007 MIT

Instruction Distribution seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v2.4=v2 pval5=seed.0*6.0 pval3=seed.o*v2.4 pval4=pval5+2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp2=tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 v2.7=pval6*5.0 v2=v2.7 v3.10=tmp3.6-v2.7 v3=v3.10 basic block 43 6.189 IAP 2007 MIT

Clustering: Parallelism vs. Communication seed.0=seed seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v3=v3.10 v0=v0.9 v2=v2.7 v3=v3.10 44 6.189 IAP 2007 MIT

Adjusting Granularity: Load Balancing seed.0=seed seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v3=v3.10 v0=v0.9 v2=v2.7 v3=v3.10 45 6.189 IAP 2007 MIT

Placement seed.0=seed pval5=seed.0*6.0 seed.0=seed pval1=seed.0*3.0 pval4=pval5+2.0 pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 v1.2=v1 v2.4=v2 v1.8=pval7*3.0 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 v0.9=tmp0.1-v1.8 v0=v0.9 v1=v1.8 v2=v2.7 v3.10=tmp3.6-v2.7 v3=v3.10 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 v2=v2.7 46 6.189 IAP 2007 MIT

Communication Coordination Proc Proc Proc Switch Switch Proc receive routes send 47 6.189 IAP 2007 MIT

Instruction Scheduling v2.4=v2 seed.0=recv(0) pval3=seed.o*v2.4 route(t,e) v1.2=v1 seed.0=recv() pval2=seed.0*v1.2 route(n,t) seed.0=recv() pval5=seed.0*6.0 route(w,t) seed.0=seed send(seed.0) pval1=seed.0*3.0 pval0=pval1+2.0 route(t,e) tmp2.5=pval3+2.0 tmp2=tmp2.5 send(tmp2.5) tmp1.3=recv() pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 route(t,e) route(e,t) tmp1.3=pval2+2.0 send(tmp1.3) tmp1=tmp1.3 tmp2.5=recv() tmp0.1=recv() pval7=tmp1.3+tmp2.5 route(t,w) route(w,t) route(w,t) pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 route(w,s) tmp0.1=pval0/2.0 send(tmp0.1) tmp0=tmp0.1 route(t,e) Send(v2.7) v2=v2.7 route(t,e) v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 route(w,n) v2.7=recv() v3.10=tmp3.6-v2.7 v3=v3.10 route(s,t) Inter-tile cycle scheduling schedules communication and can guarantee deadlock freedom 48 6.189 IAP 2007 MIT

Final Code Representation v2.4=v2 route(t,e) v1.2=v1 route(n,t) seed.0=recv() route(w,t) seed.0=seed route(t,e) seed.0=recv(0) route(t,e) seed.0=recv() route(t,w) pval5=seed.0*6.0 route(w,s) send(seed.0) route(t,e) pval3=seed.o*v2.4 route(e,t) pval2=seed.0*v1.2 route(w,t) pval4=pval5+2.0 route(s,t) pval1=seed.0*3.0 tmp2.5=pval3+2.0 route(t,e) tmp1.3=pval2+2.0 route(w,t) tmp3.6=pval4/3.0 pval0=pval1+2.0 tmp2=tmp2.5 send(tmp1.3) route(w,n) tmp3=tmp3.6 tmp0.1=pval0/2.0 send(tmp2.5) tmp1=tmp1.3 v2.7=recv() send(tmp0.1) tmp1.3=recv() tmp2.5=recv() v3.10=tmp3.6-v2.7 tmp0=tmp0.1 pval6=tmp1.3-tmp2.5 tmp0.1=recv() v3=v3.10 v2.7=pval6*5.0 pval7=tmp1.3+tmp2.5 Send(v2.7) v1.8=pval7*3.0 v2=v2.7 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 49 6.189 IAP 2007 MIT

Control Coordination 50 6.189 IAP 2007 MIT

Asynchronous Global Branching br x br x br x br x br x 51 6.189 IAP 2007 MIT

Summary Tiled microprocessors incorporate the best elements of superscalars and multiprocessors Superscalar Multicore Tiled Processor with SON PE-PE communication Free Expensive Cheap exploitation of parallelism Implicit Explicit Both Clean semantics Yes No Yes scalable No Yes Yes power efficient No Yes Yes 52 6.189 IAP 2007 MIT

Raw Project Contributors Anant Agarwal Saman Amarasinghe Jonathan Babb Rajeev Barua Ian Bratt Jonathan Eastep Matt Frank Ben Greenwald Henry Hoffmann Paul Johnson Jason Kim Theo Konstantakopoulos Walter Lee Jason Miller James Psota Arvind Saraf Vivek Sarkar Nathan Shnidman Volker Strumpen Michael Taylor Elliot Waingold David Wentzlaff 53 6.189 IAP 2007 MIT