Lecture 2: Review of Pipelines

The Instction Set: a Citical Inteface softwae Lecte 2: Review of Pipelines instction set hadwae AP Sp. 98 UCB 1 Lec 1.2 Instction Set Achitecte... the attibtes of a [compting] system as seen by the pogamme, i.e. the conceptal stcte and fnctional behavio, as distinct fom the oganization of the data flows and contols the logic design, and the physical implementation. Amdahl, Blaaw, and Books, 1964 SOFTWARE -- Oganization of Pogammable Stoage -- ata Types & ata Stctes: Encodings & Repesentations -- Instction Fomats -- Instction (o Opeation Code) Set -- odes of Addessing and Accessing ata Items and Instctions -- Eceptional Conditions Oganization Capabilities & Pefomance Chaacteistics of Pincipal Fnctional Units (e.g., istes,, Shiftes, Logic Units,...) Ways in which these components ae inteconnected Infomation flows between components Logic and means by which sch infomation flow is contolled. Choeogaphy of FUs to ealize the ISA iste Tansfe Level (RTL) esciption Logic esigne's View ISA Level FUs & Inteconnect Lec 1.3 Lec 1.4

Review: IPS R3000 (coe) 0 0 Pogammable stoage 1 2^32 bytes 31 32-bit GPRs (R0=0) 31 32 32-bit FP egs (paied P) PC HI, LO, PC lo hi Aithmetic logical ata types? Fomat? Addessing odes? Add, AddU, Sb, SbU, And, O, Xo, No, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OI, XoI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV oy Access LB, LBU, LH, LHU,, L,R SB, SH, SW, SWL, SWR Contol 32-bit instctions on wod bonday J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL Lec 1.5 Review: Basic ISA Classes Accmlato: 1 addess add A acc acc + mem[a] 1+ addess add A acc acc + mem[a + ] Stack: 0 addess add tos tos + net Geneal Ppose iste: 2 addess add A B EA(A) EA(A) + EA(B) 3 addess add A B C EA(A) EA(B) + EA(C) Load/Stoe: 3 addess add Ra Rb Rc Ra Rb + Rc load Ra Rb Ra mem[rb] stoe Ra Rb mem[rb] Ra Lec 1.6 Instction Fomats Vaiable: Fied: Hybid: Addessing modes each opeand eqies addess specifie => vaiable fomat code size => vaiable length instctions pefomance => fied length instctions simple decoding, pedictable opeations With load/stoe instction ach, only one memoy addess and few addessing modes => simple fomat, addess mode given by opcode Lec 1.7 IPS Addessing odes & Fomats Simple addessing modes All instctions 32 bits wide iste (diect) Immediate Base+inde PC-elative op s t d egiste op s t op s t egiste op s t iste Indiect? PC immed immed immed + + oy oy Lec 1.8

Cay-1: the oiginal RISC iste-iste 15 9 Op 8 6 5 3 2 0 Rd Rs1 R2 VAX-11: the canonical CISC Vaiable fomat, 2 and 3 addess instction Byte 0 1 n m OpCode A/ A/ A/ Load, Stoe and Banch 15 9 8 6 5 3 2 0 15 0 Op Rd Rs1 Immediate Rich set of othogonal addess modes immediate, offset, indeed, atoinc/dec, indiect, indiect+offset applied to any opeand Simple and comple instctions synchonization instctions data stcte opeations (qees) polynomial evalation Lec 1.9 Lec 1.10 Review: Load/Stoe Achitectes IPS R3000 ISA (Smmay) 3 addess GPR E eg iste to egiste aithmetic Load and stoe with simple addessing modes (eg + immediate) Simple conditionals compae ops + banch z compae&banch op condition code + banch on condition op immed Simple fied-fomat encoding op offset Instction Categoies Load/Stoe Comptational Jmp and Banch Floating Point» copocesso oy anagement Special istes R0 - R31 PC HI LO 3 Instction Fomats: all 32 bits wide Sbstantial incease in instctions ecease in data BW (de to many egistes) Even moe significant decease in CPI (pipelining) Cycle time, Real estate, esign time, esign compleity OP OP OP s t d sa fnct s t immediate jmp taget Lec 1.11 Lec 1.12

Levels of Repesentation (61C Review) High Level Langage Pogam Assembly Langage Pogam achine Langage Pogam Contol Signal Specification Compile Assemble achine Intepetation temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; lw $15,0($2) lw $16,4($2) sw $16, 0($2) sw $15, 4($2) 0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 OP[0:3] <= Inst[9:11] & ASK Lec 1.13 Eection Cycle Obtain instction fom pogam stoage Instction Fetch Instction etemine eqied actions and instction size ecode Opeand Locate and obtain opeand data Fetch Eecte Compte eslt vale o stats Reslt eposit eslts in stoage fo late se Stoe Net etemine sccesso instction Instction Lec 1.14 Latch o egiste What s a Clock Cycle? combinational logic Fast, Pipelined Instction Intepetation Net Instction Instction Addess Instction Fetch Instction iste ecode & Opeand Fetch Opeand istes NI NI NI NI E Time NI E W E W E W E W W Old days: 10 levels of gates Today: detemined by nmeos time-offlight isses + gate delays clock popagation, wie lengths, dives Eecte Reslt istes Stoe Reslts istes o Lec 1.15 Lec 1.16

Pipelining: It s Natal! Seqential Landy 6 P 7 8 9 10 11 idnight Time Landy Eample Ann, Bian, Cathy, ave each have one load of clothes to wash, dy, and fold Washe takes 30 mintes ye takes 40 mintes Folde takes 20 mintes A B C AP Sp. 98 UCB 17 T a s k O d e A B C 30 40 20 30 40 20 30 40 20 30 40 20 Seqential landy takes 6 hos fo 4 loads If they leaned pipelining, how long wold landy take? AP Sp. 98 UCB 18 T a s k O d e A B C Pipelined Landy Stat wok ASAP 6 P 7 8 9 10 11 idnight Time 30 40 40 40 40 20 Pipelined landy takes 3.5 hos fo 4 loads AP Sp. 98 UCB 19 T a s k O d e A B C Pipelining Lessons 6 P 7 8 9 Time 30 40 40 40 40 20 Pipelining doesn t help latency of single task, it helps thoghpt of entie wokload Pipeline ate limited by slowest pipeline stage ltiple tasks opeating simltaneosly Potential speedp = Nmbe pipe stages Unbalanced lengths of pipe stages edces speedp Time to fill pipeline and time to dain it edces speedp AP Sp. 98 UCB 20

Compte Pipelines Eecte billions of instctions, so thoghpt is what mattes LX desiable feates: all instctions same length, egistes located in same place in instction fomat, memoy opeands only in loads o stoes + N'est pas visible a pogamme Net PC Addess Instction Fetch 4 Adde 5 Steps of IPS atapath Fige 3.1, Page 130, CA:AQA 2e oy Inst Inst. ecode. Fetch Net SEQ PC RS1 RS2 R File Eecte Add. Calc UX UX Zeo? oy Access UX ata oy L Wite Back UX Imm Sign Etend WB ata AP Sp. 98 UCB 21 Lec 1.22 Steps 1 & 2 - instction fetch step IR <-- [ PC]: fetch the net instction fom memoy NPC <-- PC + 4 : compte the new PC done in paallel with opcode decode I - instction decode and egiste fetch step A <-- s[ IR 6.. 10 ] B <-- s[ IR 11.. 16 ] Possible since egiste specifies ae encoded in fied fields We may fetch egiste contents that we don t se bt OK since the opeands will be eady if the opcode is of the type that does se them Also calclate the sign etended immediate in case that s the vale that the opcode needs Net PC Addess Instction Fetch 4 Adde oy 5 Steps of IPS atapath Fige 3.4, Page 134, CA:AQA 2e /I Inst. ecode. Fetch Net SEQ PC RS1 RS2 Imm File Sign Etend I/EX Eecte Add. Calc Net SEQ PC UX UX Zeo? EX/E oy Access UX R R R ata oy E/WB Wite Back UX WB ata AP Sp. 98 UCB 23 ata stationay contol local decode fo each instction phase / pipeline stage Lec 1.24

Visalizing Pipelining Fige 3.3, Page 133, CA:AQA 2e Time (clock cycles) Its Not That Easy fo Comptes I n s t. O d e Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Ifetch Ifetch Limits to pipelining: Hazads pevent net instction fom eecting ding its designated clock cycle Stctal hazads: HW cannot sppot this combination of instctions (single peson to fold and pt clothes away) ata hazads: Instction depends on eslt of pio instction still in the pipeline (missing sock) Contol hazads: Pipelining of banches & othe instctions that change the PC Common soltion is to stall the pipeline ntil the hazad is esolved, inseting one o moe bbbles in the pipeline Lec 1.25 AP Sp. 98 UCB 26 Time (in clock cycles) One oy Pot/Stctal Hazads Fige 3.6, Page 142 I n s t. O d e Load Instction 1 Instction 2 Instction 3 Instction 4 Inst 4 Time (clock Time (in Clock cycles) Cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 AP Sp. 98 UCB 27 Load Instction 1 Instction 2 Instction 3 Instction 4 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 FIGURE 3.6 A machine with only one memoy pot will geneate a conflict wheneve a memoy efeence occs. AP Sp. 98 UCB 28

One oy Pot/Stctal Hazads Time (in (clock Clock Cycles) cycles) Fige 3.7, Page 143 Load Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 I n s t. O d e Load Instction 1 1 Instction 2 stall Instction 3 3 bbble bbble bbble bbble bbble Instction 1 Instction 2 Stall Instction 3 FIGURE 3.7 The stctal hazad cases pipeline bbbles to be inseted. Bbble Bbble Bbble Bbble Bbble AP Sp. 98 UCB 29 AP Sp. 98 UCB 30 CPI pipelined Speed Up Eqation fo Pipelining = Ideal CPI + Pipeline stall clock cycles pe inst Speedp = Ideal CPI Pipeline depth Clock Cycle npipelined Ideal CPI + Pipeline stall CPI Clock Cycle pipelined Speedp = Pipeline depth Clock Cycle npipelined 1 + Pipeline stall CPI Clock Cycle pipelined Eample: al-pot vs. Single-pot achine A: al poted memoy achine B: Single poted memoy, bt its pipelined implementation has a 1.05 times faste clock ate Ideal CPI = 1 fo both Loads ae 40% of instctions eected SpeedUp A = Pipeline epth/(1 + 0) (clock npipe /clock pipe ) = Pipeline epth SpeedUp B = Pipeline epth/(1 + 0.4 1) (clock npipe /(clock npipe / 1.05) = (Pipeline epth/1.4) 1.05 = 0.75 Pipeline epth SpeedUp A / SpeedUp B = Pipeline epth/(0.75 Pipeline epth) = 1.33 achine A is 1.33 times faste AP Sp. 98 UCB 31 AP Sp. 98 UCB 32

Time (in clock cycles) I n s t. O d e Time (clock cycles) add 1,2,3 Pogam Eection Ode (in Instctions) ata Hazad on R1 R1,R2,R3 sb 4,1,3 SUB R4,R1,R5 and 6,1,7 AN R6,R1,R7 o 8,1,9 OR R8,R1,R9 XOR R10,R1,R11 o 10,1,11 Fige Time (in Clock 3.9, Cycles) page 147 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I/RF EX E WB AP Sp. 98 UCB 33 Pogam eection ode (in instctions) R1, R2, R3 SUB R4, R1, R5 AN R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 FIGURE 3.9 The se of the eslt of the instction in the net thee instctions cases a hazad, since the egiste is not witten ntil afte those instctions ead it. AP Sp. 98 UCB 34 Time (in clock cycles) Pogam eection ode (in instctions) R1, R2, R3 SUB R4, R1, R5 AN R6, R1, R7 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Thee Geneic ata Hazads Inst I followed by Inst J Read Afte Wite (RAW) Inst J ties to ead opeand befoe Inst I wites it OR R8, R1, R9 XOR R10, R1, R11 FIGURE 3.10 A set of instctions that depend on the eslt se fowading paths to avoid the data hazad. AP Sp. 98 UCB 35 AP Sp. 98 UCB 36

Time (in clock cycles) Pogam eection ode (in instctions) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 R1, R2, R3 R4, 0(R1) SW 12(R1), R4 FIGURE 3.11 Stoes eqie an opeand ding E, and fowading of that opeand is shown hee. Thee Geneic ata Hazads Inst I followed by Inst J Wite Afte Read (WAR) Inst J ties to wite opeand befoe Inst I eads i Gets wong opeand Can t happen in LX 5 stage pipeline becase: All instctions take 5 stages, and Reads ae always in stage 2, and Wites ae always in stage 5 AP Sp. 98 UCB 37 AP Sp. 98 UCB 38 Thee Geneic ata Hazads Inst I followed by Inst J Wite Afte Wite (WAW) Inst J ties to wite opeand befoe Inst I wites it Leaves wong eslt ( Inst I not Inst J ) Can t happen in LX 5 stage pipeline becase: All instctions take 5 stages, and Wites ae always in stage 5 Will see WAR and WAW in late moe complicated pipes AP Sp. 98 UCB 39 I n s t. O d e ata Hazad Even with Fowading Time (clock cycles) lw 1, 0(2) Pogam Eection Ode (in Instctions) sb 4,1,6 and 6,1,7 o 8,1,9 Time Fige (in Clock Cycles) 3.12, Page 153 CC 1 CC 2 CC 3 CC 4 CC 5 R1,0(R1) SUB R4,R1,R5 AN R6,R1,R7 OR R8,R1,R9 AP Sp. 98 UCB 40

Time (in clock cycles) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 R1, 0(R2) R1, 0(R2) Pogam eection ode (in instctions) SUB R4, R1, R5 AN R6, R1, R7 Pogam eection ode (in instctions) SUB R4, R1, R5 AN R6, R1, R7 Bbble Bbble OR R8, R1, R9 Bbble OR R8, R1, R9 FIGURE 3.12 The load instction can bypass its eslts to the AN and OR instctions, bt not to the SUB, since that wold mean fowading the eslt in "negative time." AP Sp. 98 UCB 41 FIGURE 3.13 The load intelock cases a stall to be inseted at clock cycle 4, delaying the SUB instction and those that follow by one cycle. AP Sp. 98 UCB 42 A = B + C Softwae Schedling to Avoid Load Hazads lw b,b lw c, c add a,b,c sw a, a I EX I E EX I WB E Cale Cale WB EX I E EX WB E WB AP Sp. 98 UCB 43 Ty podcing fast code fo a = b + c; d = e f; assming a, b, c, d,e, and f in memoy. Slow code: Rb,b Rc,c Ra,Rb,Rc SW a,ra Re,e Rf,f SUB Rd,Re,Rf SW d,rd AP Sp. 98 UCB 44

ata Flow Gaph Schedling sing FG Softwae Schedling to Avoid Load Hazads HW Change fo Fowading Fige 3.20, Page 161 Ty podcing fast code fo a = b + c; d = e f; assming a, b, c, d,e, and f in memoy. Slow code: SW SUB SW Rb,b Rc,c Ra,Rb,Rc a,ra Re,e Rf,f Rd,Re,Rf d,rd Fast code: SW SUB Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,ra Rd,Re,Rf I/EX EX/E E/WB AP Sp. 98 UCB SW d,rd 47 AP Sp. 98 UCB 48 Zeo? ata oy

I/EX EX/E E/WB /I I/EX EX/E E/WB Zeo? ata memoy PC 4 Instction memoy IR IR6..10 IR11..15 E/WB.IR R e g is te s Banch taken Zeo? ata memoy 16 Sign 32 etend FIGURE 3.20 Fowading of eslts to the eqies the addition of thee eta inpts on each mltiplee and the addition of thee paths to the new inpts. AP Sp. 98 UCB 49 AP Sp. 98 UCB 50 FIGURE 3.4 The datapath is pipelined by adding a set of egistes, one between each pai of pipe stages. Banch Hazads When we decide to banch, othe instctions ae in the pipeline! Pogam eection ode (in instctions) 40 beq $1, $3, 7 44 and $12, $2, $5 48 o $13, $6, $2 52 add $14, $2, $2 72 lw $4, 50($7) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 We ae pedicting banch not taken need to add hadwae fo flshing instctions if we ae wong Banch Stall Impact If CPI = 1, 30% banch, Stall 3 cycles => new CPI = 1.9! Two pat soltion: etemine banch taken o not soone, AN Compte taken banch addess ealie LX banch tests if egiste = 0 o not 0 LX Soltion: ove Zeo test to I/RF stage Adde to calclate new PC in I/RF stage 1 clock cycle penalty fo banch vess 3 Contol Hazad on Banches Thee Stage Stall 1998 ogan Kafmann Pblishes 51 AP Sp. 98 UCB 52

Instction Fetch PC 4 Instction oy IR Pipelined LX atapath Fige 3.22, page 163 /I Inst. ecode. Fetch IR 6..10 IR 11..15 E/WB.IR Zeo? istes Eecte Add. Calc. I/EX EX/E oy Access This is the coect 1 cycle latency implementation! ata oy E/WB Wite Back PC 4 Instction IR memoy /I IR6..10 IR11..15 E/WB.IR 16 32 Sign etend R e giste s Zeo? I/EX EX/E ata memoy E/WB Sign etend 16 32 AP Sp. 98 UCB 53 FIGURE 3.22 The stall fom banch hazads can be edced by moving the zeo test and banch AP Sp. 98 taget UCB calclation 54 into the I phase of the pipeline. Benchmak compess eqntott espesso gcc li dodc ea Fowad conditional banches 3% 3% 2% 2% 1% 4% 3% 4% 2% 2% hydo2d 2% 0% mdljdp 0% 0% 2% s2co 1% 1% 4% 6% 6% 4% 4% 8% 9% 11% 11% 11% 10% 12% 22% 0% 5% 10% 15% 20% 25% Pecentage of instctions eected Backwad conditional banches Unconditional banches 80% 70% 60% 51% 50% Faction of all conditional banches 40% 30% 20% 10% 0% 22% 63% compess eqntott 8% 35% 25% Fowad taken 44% 38% 34% 16% 13% 26% Benchmak 53% Backwad taken 37% 61% 14% 78% 21% 21% 3% espesso gcc li dodc ea hydo2d mdljdp s2co AP Sp. 98 UCB 55 FIGURE 3.24 The feqency of instctions (banches, jmps, calls, and etns) that may change the PC. FIGURE 3.25 Togethe the fowad and backwad taken banches accont fo an aveage of AP 67% Sp. 98 of all conditional UCB 56 banches.

Fo Banch Hazad Altenatives #1: Stall ntil banch diection is clea #2: Pedict Banch Not Taken Eecte sccesso instctions in seqence Sqash instctions in pipeline if banch actally taken Advantage of late pipeline state pdate 47% LX banches not taken on aveage PC+4 aleady calclated, so se it to get net instction #3: Pedict Banch Taken 53% LX banches taken on aveage Bt haven t calclated banch taget addess in LX» LX still incs 1 cycle banch penalty» Othe machines: banch taget known befoe otcome Fo Banch Hazad Altenatives #4: elayed Banch efine banch to take place AFTER a following instction banch instction seqential sccesso 1 seqential sccesso 2... seqential sccesso n banch taget if taken Banch delay of length n 1 slot delay allows pope decision and banch taget addess in 5 stage pipeline LX ses this AP Sp. 98 UCB 57 AP Sp. 98 UCB 58 elayed Banch Whee to get instctions to fill banch delay slot? Befoe banch instction Fom the taget addess: only valable when banch taken Fom fall thogh: only valable when banch not taken Cancelling banches allow moe slots to be filled Compile effectiveness fo single banch delay slot: Fills abot 60% of banch delay slots Abot 80% of instctions eected in banch delay slots sefl in comptation Abot 50% (60% 80%) of slots seflly filled elayed Banch downside: 7-8 stage pipelines, mltiple instctions issed pe clock (spescala) AP Sp. 98 UCB 59

Evalating Banch Altenatives Pipeline speedp = Pipeline depth 1 +Banch feqency Banch penalty Schedling Banch CPI speedp v. speedp v. scheme penalty npipelined stall Stall pipeline 3 1.42 3.5 1.0 Pedict taken 1 1.14 4.4 1.26 Pedict not taken 1 1.09 4.5 1.29 elayed banch 0.5 1.07 4.6 1.31 Conditional & Unconditional = 14%, 65% change PC Pipelining Intodction Smmay Jst ovelap tasks, and easy if tasks ae independent Speed Up Pipeline epth; if ideal CPI is 1, then: Speedp = Pipeline epth 1 + Pipeline stall CPI Hazads limit pefomance on comptes: Stctal: need moe HW esoces ata (RAW,WAR,WAW): need fowading, compile schedling Contol: delayed banch, pediction X Clock Cycle Unpipelined Clock Cycle Pipelined AP Sp. 98 UCB 61 AP Sp. 98 UCB 62