Instructor: Randy H. Katz hap://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #20. Warehouse Scale Computer

CS 61C: Geat Ideas in Compute Achitectue Contol and Pipelining Instucto: Randy H. Katz hap://inst.eecs.bekeley.edu/~cs61c/fa13 11/5/13 Fall 2013 - - Lectue #20 1 So0wae Paallel Requests Assigned to compute e.g., Seach Katz Paallel Theads Assigned to coe e.g., Lookup, Ads Paallel InstucVons >1 instucvon @ one Vme e.g., 5 pipelined instucvons Paallel Data >1 data item @ one Vme e.g., Add of 4 pais of wods Hadwae descipvons All gates @ one Vme Pogamming Languages You Ae Hee! Haness Paallelism & Achieve High Pefomance Hadwae Today s Lectue Waehouse Scale Compute Coe Memoy Input/Output InstucVon Unit(s) Main Memoy Coe 2 Compute (Cache) Coe FuncVonal Unit(s) A 0 +B 0 A 1 +B 1 A 2 +B 2 A 3 +B 3 Smat Phone Logic Gates 1

Machine Intepeta4on Levels of RepesentaVon/ IntepetaVon High Level Language Pogam (e.g., C) Compile Assembly Language Pogam (e.g., MIPS) Assemble Machine Language Pogam (MIPS) Hadwae Achitectue DescipCon (e.g., block diagams) Achitectue Implementa4on Logic Cicuit DescipCon (Cicuit SchemaCc Diagams) temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; lw $t0, 0($2) lw $t1, 4($2) sw $t1, 0($2) sw $t0, 4($2) Anything can be epesented as a numbe, i.e., data o instucvons 0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111! 3 InstucVon Level Paallelism (ILP) Anothe paallelism fom to go with Request Level Paallelism and Data Level Paallelism RLP e.g., Waehouse Scale CompuVng DLP e.g., SIMD, Map- Reduce ILP e.g., Pipelined InstucBon ExecuBon 5 stage pipeline => 5 instucvons execuvng simultaneously, one at each pipeline stage 4 2

Pipelined ExecuVon Pipelined Datapath Agenda Stuctual and Data Hazads Contol Hazads 5 Pipelined ExecuVon Pipelined Datapath Agenda Stuctual and Data Hazads Contol Hazads 6 3

Review: Single- Cycle Pocesso Five steps to design a pocesso: 1. Analyze instucvon set à Pocesso datapath equiements Contol 2. Select set of datapath Memoy components & establish Datapath clock methodology 3. Assemble datapath meevng the equiements: e- examine fo pipelining 4. Analyze implementavon of each instucvon to detemine semng of contol points that effects the egiste tansfe. 5. Assemble the contol logic Fomulate Logic EquaVons Design Cicuits 7 Input Output Pipeline Analogy: Doing Laundy Ann, Bian, Cathy, Dave each have one load of clothes to wash, dy, fold, and put away Washe takes 30 minutes Dye takes 30 minutes Folde takes 30 minutes Stashe takes 30 minutes to put clothes into dawes A B C D 8 4

SequenVal Laundy 6 PM 7 8 9 10 11 12 1 2 AM T a s k O d e A B C D 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time SequenVal laundy takes 8 hous fo 4 loads 9 Pipelined Laundy 12 2 AM 6 PM 7 8 9 10 11 1 T a s k O d e A B C D 30 30 30 30 30 30 30 Pipelined laundy takes 3.5 hous fo 4 loads! Time 10 5

T a s k O d e 6 PM 7 8 9 A B C D Pipelining Lessons (1/2) Time 30 30 30 30 30 30 30 Pipelining doesn t help latency of single task, it helps thoughput of enve wokload MulVple tasks opeavng simultaneously using diffeent esouces PotenVal speedup = Numbe pipe stages (4 in this case) Time to fill pipeline and Vme to dain it educes speedup: 8 hous/3.5 hous o 2.3X v. potenval 4X in this example 11 T a s k O d e 6 PM 7 8 9 A B C D Pipelining Lessons (2/2) Time 30 30 30 30 30 30 30 Suppose new Washe takes 20 minutes, new Stashe takes 20 minutes. How much faste is pipeline? Pipeline ate limited by slowest pipeline stage Unbalanced lengths of pipe stages educes speedup 12 6

Pipelined ExecuVon Pipelined Datapath Agenda Stuctual and Data Hazads Contol Hazads 13 Review: RISC Design Pinciples A simple coe is a faste coe ReducVon in the numbe and complexity of instucvons in the ISA à simplifies pipelined implementavon Common RISC stategies: Fixed instucvon length, geneally a single wod (MIPS = 32b); Simplifies pocess of fetching instucvons fom memoy Simplified addessing modes; (MIPS just egiste + offset) Simplifies pocess of fetching opeands fom memoy Fewe and simple instucvons in the instucvon set; Simplifies pocess of execuvng instucvons Simplified memoy access: only load and stoe instucvons access memoy; Let the compile do it. Use a good compile to beak complex high- level language statements into a numbe of simple assembly language statements 14 7

Review: Single Cycle Datapath 31 26 21 op s t immediate Data Memoy {R[s] + SignExt[imm16]} = R[t] RegDst= Rd RegW= busw 32 npc_sel= 1 clk Rs 5 5 Rw imm16 Rt 0 Ra Rb RegFile 16 clk Rt 5 ExtOp= Extende busa busb 32 32 16 inst fetch unit Rs Rt Rd Imm16 zeo ct= MemtoReg= 32 0 1 Sc= InstucVon<31:0> <21:25> = 32 Data In clk <16:20> MemW= 32 <11:15> WEn Ad <0:15> Data Memoy 15 0 1 0 Steps in ExecuVng MIPS 1) IF: InstucVon Fetch, Incement PC 2) ID: InstucVon Decode, Read Registes 3) EX: ExecuVon Mem- ef: Calculate Addess Aith- log: Pefom OpeaVon 4) Mem: Load: Read Data fom Memoy Stoe: Wite Data to Memoy 5) WB: Wite Data Back to Registe 16 8

Redawn Single- Cycle Datapath PC instucvon memoy d s t egistes Data memoy +4 imm 1. InstucVon Fetch 2. Decode/ Registe Read 3. Execute 4. Memoy 5. Wite Back 17 Pipelined Datapath PC instucvon memoy d s t egistes Data memoy +4 imm 1. InstucVon Fetch 2. Decode/ Registe Read 3. Execute 4. Memoy 5. Wite Back Add egistes between stages Hold infomavon poduced in pevious cycle 5 stage pipeline; clock ate potenval 5X faste 18 9

Moe Detailed Pipeline Registes named fo adjacent stages, e.g., IF/ID 19 IF fo Load, Stoe, Highlight combinavonal logic components used + ight half of state logic on ead, le~ half on wite 20 10

ID fo Load, Stoe, 21 EX fo Load 22 11

MEM fo Load 23 WB fo Load Has Bug that was in 1 st edivon of textbook! Wong egiste numbe 24 12

Coected Datapath fo Load Coect egiste numbe 25 Pipelined ExecuVon RepesentaVon Time IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB Evey instucvon must take same numbe of steps, also called pipeline stages, so some will go idle somevmes 26 13

Gaphical Pipeline Diagams PC instucvon memoy d s t egistes Data memoy +4 imm 1. InstucVon Fetch 2. Decode/ Registe Read 3. Execute 4. Memoy 5. Wite Back Use datapath figue below to epesent pipeline IF ID EX Mem WB 27 I n s t. O d e Gaphical Pipeline RepesentaVon (In Reg, ight half highlight ead, lev half wite) Time (clock cycles) Load Add Stoe Sub O I$ Reg I$ Reg I$ D$ Reg I$ 28 Reg D$ Reg I$ Reg D$ Reg Reg D$ Reg D$ Reg 14

Pipeline Pefomance Assume Vme fo stages is 100ps fo egiste ead o wite 200ps fo othe stages What is pipelined clock ate? Compae pipelined datapath with single- cycle datapath Inst Inst fetch Registe ead op Memoy access Registe wite Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-fomat 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps 29 Student RouleAe? Pipeline Pefomance Single- cycle (T c = 800ps) Pipelined (T c = 200ps) 30 15

Pipeline Speedup If all stages ae balanced i.e., all take the same Vme Time between instucvons pipelined = Time between instucvons nonpipelined Numbe of stages If not balanced, speedup is less Speedup due to inceased thoughput Latency (Vme fo each instucvon) does not decease 31 Pipelined ExecuVon Pipelined Datapath Agenda Stuctual and Data Hazads Contol Hazads 32 16

Hazads SituaVons that pevent stavng the next logical instucvon in the next clock cycle 1. Stuctual hazads Requied esouce is busy (e.g., stashe is studying) 2. Data hazad Need to wait fo pevious instucvon to complete its data ead/wite (e.g., pai of socks in diffeent loads) 3. Contol hazad Deciding on contol acvon depends on pevious instucvon (e.g., how much detegent based on how clean pio load tuns out) 33 1. Stuctual Hazads Conflict fo use of a esouce In MIPS pipeline with a single memoy Load/Stoe equies memoy access fo data InstucVon fetch would have to stall fo that cycle Causes a pipeline bubble Hence, pipelined datapaths equie sepaate instucvon/data memoies In eality, povide sepaate L1 instucvon cache and L1 data cache 34 17

I n s t. O d e 1. Stuctual Hazad #1: Single Memoy Time (clock cycles) Load Inst 1 Inst 2 Inst 3 Inst 4 I$ Read same memoy twice in same clock cycle Reg D$ Reg 35 I n s t. 1. Stuctual Hazad #2: Registes (1/2) sw Inst 1 Time (clock cycles) O d e Inst 2 Inst 3 Inst 4 I$ Reg D$ Reg Can we ead and wite to egistes simultaneously? 36 18

1. Stuctual Hazad #2: Registes (2/2) Two diffeent soluvons have been used: 1) RegFile access is VERY fast: takes less than half the Vme of stage Wite to Registes duing fist half of each clock cycle Read fom Registes duing second half of each clock cycle 2) Build RegFile with independent ead and wite pots Result: can pefom Read and Wite duing same clock cycle 37 2. Data Hazads An instucvon depends on complevon of data access by a pevious instucvon add $s0, $t0, $t1 sub $t2, $s0, $t3 38 19

Fowading (aka Bypassing) Use esult when it is computed Don t wait fo it to be stoed in a egiste Requies exta connecvons in the datapath 39 Coected Datapath fo Fowading? 40 20

Fowading Paths Chapte 4 The Pocesso 41 Load- Use Data Hazad Can t always avoid stalls by fowading If value not computed when needed Can t fowad backwad in Vme! 42 21

Stall/Bubble in the Pipeline Stall inseted hee Chapte 4 The Pocesso 43 Pipelining and ISA Design MIPS InstucVon Set designed fo pipelining All instucvons ae 32- bits Easie to fetch and decode in one cycle x86: 1- to 17- byte instucvons (x86 HW actually tanslates to intenal RISC instucvons!) Few and egula instucvon fomats, 2 souce egiste fields always in same place Can decode and ead egistes in one step Memoy opeands only in Loads and Stoes Can calculate addess 3 d stage, access memoy 4 th stage Alignment of memoy opeands Memoy access takes only one cycle 44 22

Why Isn t the DesVnaVon Registe Always in the Same Field in MIPS ISA? 31 26 21 op s t d shamt funct Need to have 2 pat immediate if 2 souces and 1 desvnavon always in same place 16 31 6 bits 26 5 bits 21 5 bits 16 5 bits 5 bits 6 bits 0 op s t immediate 6 bits 5 bits 5 bits 16 bits 11 6 0 SPUR pocesso (A poject Dave PaAeson and Randy woked on togethe) 45 3. Contol Hazads Banch detemines flow of contol Fetching next instucvon depends on banch outcome Pipeline can t always fetch coect instucvon SVll woking on ID stage of banch BEQ, BNE in MIPS pipeline Simple soluvon OpVon 1: Stall on evey banch unvl have new PC value Would add 2 bubbles/clock cycles fo evey Banch! (~ 20% of instucvons executed) 46 23

I n s t. O d e beq Inst 1 Inst 2 Inst 3 Inst 4 Stall => 2 Bubbles/Clocks Time (clock cycles) I$ Reg D$ Reg Whee do we do the compae fo the banch? 47 3. Contol Hazad: Banching OpVmizaVon #1: Inset special banch compaato in Stage 2 As soon as instucvon is decoded (Opcode idenvfies it as a banch), immediately make a decision and set the new value of the PC Benefit: since banch is complete in Stage 2, only one unnecessay instucvon is fetched, so only one no- op is needed Side Note: means that banches ae idle in Stages 3, 4 and 5 48 24

Coected Datapath fo BEQ/BNE? 49 Student RouleAe? One Clock Cycle Stall Time (clock cycles) I n s t. O d e beq Inst 1 Inst 2 Inst 3 Inst 4 I$ Reg D$ Reg Banch compaato moved to Decode stage. 50 25

Pipelined ExecuVon Pipelined Datapath Agenda Stuctual and Data Hazads Contol Hazads 51 3. Contol Hazads OpVon 2: Pedict outcome of a banch, fix up if guess wong Must cancel all instucvons in pipeline that depended on guess that was wong Simplest hadwae if we pedict that all banches ae NOT taken Why? 52 26

3. Contol Hazad: Banching OpVon #3: Redefine banches Old definivon: if we take the banch, none of the instucvons a~e the banch get executed by accident New definivon: whethe o not we take the banch, the single instucvon immediately following the banch gets executed (the banch- delay slot) Delayed Banch means we always execute inst a0e banch This opvmizavon is used with MIPS 53 3. Contol Hazad: Banching Notes on Banch- Delay Slot Wost- Case Scenaio: put a no- op in the banch- delay slot BeAe Case: place some instucvon peceding the banch in the banch- delay slot as long as the changed doesn t affect the logic of pogam Re- odeing instucvons is common way to speed up pogams Compile usually finds such an instucvon 50% of Vme Jumps also have a delay slot 54 27

Example: Nondelayed vs. Delayed Banch Nondelayed Banch Delayed Banch o $8, $9, $10 add $1, $2, $3 add $1, $2, $3 sub $4, $5, $6 beq $1, $4, Exit xo $10, $1, $11 sub $4, $5, $6 beq $1, $4, Exit o $8, $9, $10 xo $10, $1, $11 Exit: Exit: 55 Delayed Banch/Jump and MIPS ISA? Why does JAL put PC+8 in egiste 31? 56 28

Code Scheduling to Avoid Stalls Reode code to avoid use of load esult in the next instucvon C code fo A = B + E; C = B + F; stall stall lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 13 cycles lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 11 cycles 58 Pee InstucVon I. Thanks to pipelining, I have educed the time it took me to wash my one shit. II. Longe pipelines ae always a win (since less wok pe stage & a faste clock). A)(oange) I is Tue and II is Tue B)(geen) I is False and II is Tue C)(pink) I is Tue and II is False 59 29

And, in Conclusion, Pipelining impoves pefomance by inceasing instucvon thoughput: exploits ILP Executes mulvple instucvons in paallel Each instucvon has the same latency Key enable is placing egistes between pipeline stages Subject to hazads Stuctue, data, contol Stalls educe pefomance But ae equied to get coect esults Compile can aange code to avoid hazads and stalls Requies knowledge of the pipeline stuctue 61 30