Lecture 2: Review of Pipelines

Similar documents
CMSC 611: Advanced Computer Architecture

CS61C : Machine Structures

Pipelining and ISA Design

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

Lecture 4: Introduction to Pipelining

CS61C : Machine Structures

Instructor: Randy H. Katz hap://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #20. Warehouse Scale Computer

CS 61C: Great Ideas in Computer Architecture Pipelining. Anything can be represented as a number, i.e., data or instrucvons

Instruction Level Parallelism. Data Dependence Static Scheduling

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Pipelined Processor Design

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

CS 110 Computer Architecture Lecture 11: Pipelining

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

CMSC 611: Advanced Computer Architecture

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

GRADE 6 FLORIDA. Division WORKSHEETS

CS429: Computer Organization and Architecture

CS420/520 Computer Architecture I

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

A Pseudolite-Based Positioning System for Legacy GNSS Receivers

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

CS521 CSE IITG 11/23/2012

RISC Central Processing Unit

Minimizing Ringing and Crosstalk

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

CMP 301B Computer Architecture. Appendix C

LECTURE 8. Pipelining: Datapath and Control

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

NICKEL RELEASE REGULATIONS, EN 1811:2011 WHAT S NEW?

Statement of Works Data Template Version: 4.0 Date:

Computer Architecture Lab Session

Variance? which variance? R squared effect size measures in simple mediation models

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Chapter 9 Cascode Stages and Current Mirrors

RISC Design: Pipelining

EECE 321: Computer Organiza5on

Investigation. Name: a About how long would the threaded rod need to be if the jack is to be stored with

Short-Circuit Fault Protection Strategy of Parallel Three-phase Inverters

CSE502: Computer Architecture CSE 502: Computer Architecture

Experiments with the HoloEye LCD spatial light modulator

Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

User Relay assisted Traffic Shifting in LTE-Advanced Systems

Out-of-Order Execution. Register Renaming. Nima Honarmand

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Real-time Self Compensating AC/DC Digitally Controlled Power Supply

An Efficient Control Approach for DC-DC Buck-Boost Converter

Dynamic Scheduling I

2D Coding for Future Perpendicular and Probe Recording

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

HYBRID FUZZY PD CONTROL OF TEMPERATURE OF COLD STORAGE WITH PLC

LEARN: Localized Energy Aware Restricted Neighborhood Routing for Ad Hoc Networks

Considerations about a Model to Compensate the Scintillation Effects in the Satellite Link Connections

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Instruction Level Parallelism Part II - Scoreboard

1550 nm WDM read-out of volume holographic memory

ABSTRACTT FFT FFT-' Proc. of SPIE Vol U-1

Proposal of Circuit Breaker Type Disconnector for Surge Protective Device

Design and Implementation of 4 - QAM VLSI Architecture for OFDM Communication

INCREMENTAL REDUNDANCY (IR) SCHEMES FOR W-CDMA HS-DSCH

6.1 Reciprocal, Quotient, and Pythagorean Identities

What you can do with very little:

UNCERTAINTY ESTIMATION OF SIZE-OF-SOURCE EFFECT MEASUREMENT FOR 650 NM RADIATION THERMOMETERS

ECEN326: Electronic Circuits Fall 2017

4 Trigonometric and Inverse Trigonometric Functions

Trigonometry: Angles between 0 and 360

Tomasolu s s Algorithm

Analysis and Implementation of LLC Burst Mode for Light Load Efficiency Improvement

Derangements. Brian Conrey and Tom Davis and March 23, 2000

N2-1. The Voltage Source. V = ε ri. The Current Source

You are Here! Processor Design Process. Agenda. Agenda 10/25/12. CS 61C: Great Ideas in Computer Architecture Single Cycle MIPS CPU Part II

Figure Geometry for Computing the Antenna Parameters.

Real-Time Fault Diagnostics for a Permanent Magnet Synchronous Motor Drive for Aerospace Applications

Multi-Channel Power Amplifi ers

Embedded Hardware (1) Kai Huang

Antenna fundamentals: With answers to questions and problems (See also Chapter 9 in the textbook.)

PERFORMANCE OF TOA ESTIMATION TECHNIQUES IN INDOOR MULTIPATH CHANNELS

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

Wireless Communication (Subject Code: 7EC3)

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

where and are polynomials with real coefficients and of degrees m and n, respectively. Assume that and have no zero on axis.

CSE 2021: Computer Organization

Performance Analysis of Z-Source Inverter Considering Inductor Resistance

Design of composite digital filter with least square method parameter identification

Configurable M-factor VLSI DVB-S2 LDPC decoder architecture with optimized memory tiling design

Dimensioning of Hierarchical B3G Networks with Multiple Classes of Traffic

Sliding Mode Control for Half-Wave Zero Current Switching Quasi-Resonant Buck Converter

77 GHz ACC Radar Simulation Platform

Synopsis of Technical Report: Designing and Specifying Aspheres for Manufacturability By Jay Kumler

STACK DECODING OF LINEAR BLOCK CODES FOR DISCRETE MEMORYLESS CHANNEL USING TREE DIAGRAM

COSC4201. Scoreboard

Transcription:

The Instction Set: a Citical Inteface softwae Lecte 2: Review of Pipelines instction set hadwae AP Sp. 98 UCB 1 Lec 1.2 Instction Set Achitecte... the attibtes of a [compting] system as seen by the pogamme, i.e. the conceptal stcte and fnctional behavio, as distinct fom the oganization of the data flows and contols the logic design, and the physical implementation. Amdahl, Blaaw, and Books, 1964 SOFTWARE -- Oganization of Pogammable Stoage -- ata Types & ata Stctes: Encodings & Repesentations -- Instction Fomats -- Instction (o Opeation Code) Set -- odes of Addessing and Accessing ata Items and Instctions -- Eceptional Conditions Oganization Capabilities & Pefomance Chaacteistics of Pincipal Fnctional Units (e.g., istes,, Shiftes, Logic Units,...) Ways in which these components ae inteconnected Infomation flows between components Logic and means by which sch infomation flow is contolled. Choeogaphy of FUs to ealize the ISA iste Tansfe Level (RTL) esciption Logic esigne's View ISA Level FUs & Inteconnect Lec 1.3 Lec 1.4

Review: IPS R3000 (coe) 0 0 Pogammable stoage 1 2^32 bytes 31 32-bit GPRs (R0=0) 31 32 32-bit FP egs (paied P) PC HI, LO, PC lo hi Aithmetic logical ata types? Fomat? Addessing odes? Add, AddU, Sb, SbU, And, O, Xo, No, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OI, XoI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV oy Access LB, LBU, LH, LHU,, L,R SB, SH, SW, SWL, SWR Contol 32-bit instctions on wod bonday J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL Lec 1.5 Review: Basic ISA Classes Accmlato: 1 addess add A acc acc + mem[a] 1+ addess add A acc acc + mem[a + ] Stack: 0 addess add tos tos + net Geneal Ppose iste: 2 addess add A B EA(A) EA(A) + EA(B) 3 addess add A B C EA(A) EA(B) + EA(C) Load/Stoe: 3 addess add Ra Rb Rc Ra Rb + Rc load Ra Rb Ra mem[rb] stoe Ra Rb mem[rb] Ra Lec 1.6 Instction Fomats Vaiable: Fied: Hybid: Addessing modes each opeand eqies addess specifie => vaiable fomat code size => vaiable length instctions pefomance => fied length instctions simple decoding, pedictable opeations With load/stoe instction ach, only one memoy addess and few addessing modes => simple fomat, addess mode given by opcode Lec 1.7 IPS Addessing odes & Fomats Simple addessing modes All instctions 32 bits wide iste (diect) Immediate Base+inde PC-elative op s t d egiste op s t op s t egiste op s t iste Indiect? PC immed immed immed + + oy oy Lec 1.8

Cay-1: the oiginal RISC iste-iste 15 9 Op 8 6 5 3 2 0 Rd Rs1 R2 VAX-11: the canonical CISC Vaiable fomat, 2 and 3 addess instction Byte 0 1 n m OpCode A/ A/ A/ Load, Stoe and Banch 15 9 8 6 5 3 2 0 15 0 Op Rd Rs1 Immediate Rich set of othogonal addess modes immediate, offset, indeed, atoinc/dec, indiect, indiect+offset applied to any opeand Simple and comple instctions synchonization instctions data stcte opeations (qees) polynomial evalation Lec 1.9 Lec 1.10 Review: Load/Stoe Achitectes IPS R3000 ISA (Smmay) 3 addess GPR E eg iste to egiste aithmetic Load and stoe with simple addessing modes (eg + immediate) Simple conditionals compae ops + banch z compae&banch op condition code + banch on condition op immed Simple fied-fomat encoding op offset Instction Categoies Load/Stoe Comptational Jmp and Banch Floating Point» copocesso oy anagement Special istes R0 - R31 PC HI LO 3 Instction Fomats: all 32 bits wide Sbstantial incease in instctions ecease in data BW (de to many egistes) Even moe significant decease in CPI (pipelining) Cycle time, Real estate, esign time, esign compleity OP OP OP s t d sa fnct s t immediate jmp taget Lec 1.11 Lec 1.12

Levels of Repesentation (61C Review) High Level Langage Pogam Assembly Langage Pogam achine Langage Pogam Contol Signal Specification Compile Assemble achine Intepetation temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; lw $15,0($2) lw $16,4($2) sw $16, 0($2) sw $15, 4($2) 0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 OP[0:3] <= Inst[9:11] & ASK Lec 1.13 Eection Cycle Obtain instction fom pogam stoage Instction Fetch Instction etemine eqied actions and instction size ecode Opeand Locate and obtain opeand data Fetch Eecte Compte eslt vale o stats Reslt eposit eslts in stoage fo late se Stoe Net etemine sccesso instction Instction Lec 1.14 Latch o egiste What s a Clock Cycle? combinational logic Fast, Pipelined Instction Intepetation Net Instction Instction Addess Instction Fetch Instction iste ecode & Opeand Fetch Opeand istes NI NI NI NI E Time NI E W E W E W E W W Old days: 10 levels of gates Today: detemined by nmeos time-offlight isses + gate delays clock popagation, wie lengths, dives Eecte Reslt istes Stoe Reslts istes o Lec 1.15 Lec 1.16

Pipelining: It s Natal! Seqential Landy 6 P 7 8 9 10 11 idnight Time Landy Eample Ann, Bian, Cathy, ave each have one load of clothes to wash, dy, and fold Washe takes 30 mintes ye takes 40 mintes Folde takes 20 mintes A B C AP Sp. 98 UCB 17 T a s k O d e A B C 30 40 20 30 40 20 30 40 20 30 40 20 Seqential landy takes 6 hos fo 4 loads If they leaned pipelining, how long wold landy take? AP Sp. 98 UCB 18 T a s k O d e A B C Pipelined Landy Stat wok ASAP 6 P 7 8 9 10 11 idnight Time 30 40 40 40 40 20 Pipelined landy takes 3.5 hos fo 4 loads AP Sp. 98 UCB 19 T a s k O d e A B C Pipelining Lessons 6 P 7 8 9 Time 30 40 40 40 40 20 Pipelining doesn t help latency of single task, it helps thoghpt of entie wokload Pipeline ate limited by slowest pipeline stage ltiple tasks opeating simltaneosly Potential speedp = Nmbe pipe stages Unbalanced lengths of pipe stages edces speedp Time to fill pipeline and time to dain it edces speedp AP Sp. 98 UCB 20

Compte Pipelines Eecte billions of instctions, so thoghpt is what mattes LX desiable feates: all instctions same length, egistes located in same place in instction fomat, memoy opeands only in loads o stoes + N'est pas visible a pogamme Net PC Addess Instction Fetch 4 Adde 5 Steps of IPS atapath Fige 3.1, Page 130, CA:AQA 2e oy Inst Inst. ecode. Fetch Net SEQ PC RS1 RS2 R File Eecte Add. Calc UX UX Zeo? oy Access UX ata oy L Wite Back UX Imm Sign Etend WB ata AP Sp. 98 UCB 21 Lec 1.22 Steps 1 & 2 - instction fetch step IR <-- [ PC]: fetch the net instction fom memoy NPC <-- PC + 4 : compte the new PC done in paallel with opcode decode I - instction decode and egiste fetch step A <-- s[ IR 6.. 10 ] B <-- s[ IR 11.. 16 ] Possible since egiste specifies ae encoded in fied fields We may fetch egiste contents that we don t se bt OK since the opeands will be eady if the opcode is of the type that does se them Also calclate the sign etended immediate in case that s the vale that the opcode needs Net PC Addess Instction Fetch 4 Adde oy 5 Steps of IPS atapath Fige 3.4, Page 134, CA:AQA 2e /I Inst. ecode. Fetch Net SEQ PC RS1 RS2 Imm File Sign Etend I/EX Eecte Add. Calc Net SEQ PC UX UX Zeo? EX/E oy Access UX R R R ata oy E/WB Wite Back UX WB ata AP Sp. 98 UCB 23 ata stationay contol local decode fo each instction phase / pipeline stage Lec 1.24

Visalizing Pipelining Fige 3.3, Page 133, CA:AQA 2e Time (clock cycles) Its Not That Easy fo Comptes I n s t. O d e Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Ifetch Ifetch Limits to pipelining: Hazads pevent net instction fom eecting ding its designated clock cycle Stctal hazads: HW cannot sppot this combination of instctions (single peson to fold and pt clothes away) ata hazads: Instction depends on eslt of pio instction still in the pipeline (missing sock) Contol hazads: Pipelining of banches & othe instctions that change the PC Common soltion is to stall the pipeline ntil the hazad is esolved, inseting one o moe bbbles in the pipeline Lec 1.25 AP Sp. 98 UCB 26 Time (in clock cycles) One oy Pot/Stctal Hazads Fige 3.6, Page 142 I n s t. O d e Load Instction 1 Instction 2 Instction 3 Instction 4 Inst 4 Time (clock Time (in Clock cycles) Cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 AP Sp. 98 UCB 27 Load Instction 1 Instction 2 Instction 3 Instction 4 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 FIGURE 3.6 A machine with only one memoy pot will geneate a conflict wheneve a memoy efeence occs. AP Sp. 98 UCB 28

One oy Pot/Stctal Hazads Time (in (clock Clock Cycles) cycles) Fige 3.7, Page 143 Load Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 I n s t. O d e Load Instction 1 1 Instction 2 stall Instction 3 3 bbble bbble bbble bbble bbble Instction 1 Instction 2 Stall Instction 3 FIGURE 3.7 The stctal hazad cases pipeline bbbles to be inseted. Bbble Bbble Bbble Bbble Bbble AP Sp. 98 UCB 29 AP Sp. 98 UCB 30 CPI pipelined Speed Up Eqation fo Pipelining = Ideal CPI + Pipeline stall clock cycles pe inst Speedp = Ideal CPI Pipeline depth Clock Cycle npipelined Ideal CPI + Pipeline stall CPI Clock Cycle pipelined Speedp = Pipeline depth Clock Cycle npipelined 1 + Pipeline stall CPI Clock Cycle pipelined Eample: al-pot vs. Single-pot achine A: al poted memoy achine B: Single poted memoy, bt its pipelined implementation has a 1.05 times faste clock ate Ideal CPI = 1 fo both Loads ae 40% of instctions eected SpeedUp A = Pipeline epth/(1 + 0) (clock npipe /clock pipe ) = Pipeline epth SpeedUp B = Pipeline epth/(1 + 0.4 1) (clock npipe /(clock npipe / 1.05) = (Pipeline epth/1.4) 1.05 = 0.75 Pipeline epth SpeedUp A / SpeedUp B = Pipeline epth/(0.75 Pipeline epth) = 1.33 achine A is 1.33 times faste AP Sp. 98 UCB 31 AP Sp. 98 UCB 32

Time (in clock cycles) I n s t. O d e Time (clock cycles) add 1,2,3 Pogam Eection Ode (in Instctions) ata Hazad on R1 R1,R2,R3 sb 4,1,3 SUB R4,R1,R5 and 6,1,7 AN R6,R1,R7 o 8,1,9 OR R8,R1,R9 XOR R10,R1,R11 o 10,1,11 Fige Time (in Clock 3.9, Cycles) page 147 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I/RF EX E WB AP Sp. 98 UCB 33 Pogam eection ode (in instctions) R1, R2, R3 SUB R4, R1, R5 AN R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 FIGURE 3.9 The se of the eslt of the instction in the net thee instctions cases a hazad, since the egiste is not witten ntil afte those instctions ead it. AP Sp. 98 UCB 34 Time (in clock cycles) Pogam eection ode (in instctions) R1, R2, R3 SUB R4, R1, R5 AN R6, R1, R7 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Thee Geneic ata Hazads Inst I followed by Inst J Read Afte Wite (RAW) Inst J ties to ead opeand befoe Inst I wites it OR R8, R1, R9 XOR R10, R1, R11 FIGURE 3.10 A set of instctions that depend on the eslt se fowading paths to avoid the data hazad. AP Sp. 98 UCB 35 AP Sp. 98 UCB 36

Time (in clock cycles) Pogam eection ode (in instctions) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 R1, R2, R3 R4, 0(R1) SW 12(R1), R4 FIGURE 3.11 Stoes eqie an opeand ding E, and fowading of that opeand is shown hee. Thee Geneic ata Hazads Inst I followed by Inst J Wite Afte Read (WAR) Inst J ties to wite opeand befoe Inst I eads i Gets wong opeand Can t happen in LX 5 stage pipeline becase: All instctions take 5 stages, and Reads ae always in stage 2, and Wites ae always in stage 5 AP Sp. 98 UCB 37 AP Sp. 98 UCB 38 Thee Geneic ata Hazads Inst I followed by Inst J Wite Afte Wite (WAW) Inst J ties to wite opeand befoe Inst I wites it Leaves wong eslt ( Inst I not Inst J ) Can t happen in LX 5 stage pipeline becase: All instctions take 5 stages, and Wites ae always in stage 5 Will see WAR and WAW in late moe complicated pipes AP Sp. 98 UCB 39 I n s t. O d e ata Hazad Even with Fowading Time (clock cycles) lw 1, 0(2) Pogam Eection Ode (in Instctions) sb 4,1,6 and 6,1,7 o 8,1,9 Time Fige (in Clock Cycles) 3.12, Page 153 CC 1 CC 2 CC 3 CC 4 CC 5 R1,0(R1) SUB R4,R1,R5 AN R6,R1,R7 OR R8,R1,R9 AP Sp. 98 UCB 40

Time (in clock cycles) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 R1, 0(R2) R1, 0(R2) Pogam eection ode (in instctions) SUB R4, R1, R5 AN R6, R1, R7 Pogam eection ode (in instctions) SUB R4, R1, R5 AN R6, R1, R7 Bbble Bbble OR R8, R1, R9 Bbble OR R8, R1, R9 FIGURE 3.12 The load instction can bypass its eslts to the AN and OR instctions, bt not to the SUB, since that wold mean fowading the eslt in "negative time." AP Sp. 98 UCB 41 FIGURE 3.13 The load intelock cases a stall to be inseted at clock cycle 4, delaying the SUB instction and those that follow by one cycle. AP Sp. 98 UCB 42 A = B + C Softwae Schedling to Avoid Load Hazads lw b,b lw c, c add a,b,c sw a, a I EX I E EX I WB E Cale Cale WB EX I E EX WB E WB AP Sp. 98 UCB 43 Ty podcing fast code fo a = b + c; d = e f; assming a, b, c, d,e, and f in memoy. Slow code: Rb,b Rc,c Ra,Rb,Rc SW a,ra Re,e Rf,f SUB Rd,Re,Rf SW d,rd AP Sp. 98 UCB 44

ata Flow Gaph Schedling sing FG Softwae Schedling to Avoid Load Hazads HW Change fo Fowading Fige 3.20, Page 161 Ty podcing fast code fo a = b + c; d = e f; assming a, b, c, d,e, and f in memoy. Slow code: SW SUB SW Rb,b Rc,c Ra,Rb,Rc a,ra Re,e Rf,f Rd,Re,Rf d,rd Fast code: SW SUB Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,ra Rd,Re,Rf I/EX EX/E E/WB AP Sp. 98 UCB SW d,rd 47 AP Sp. 98 UCB 48 Zeo? ata oy

I/EX EX/E E/WB /I I/EX EX/E E/WB Zeo? ata memoy PC 4 Instction memoy IR IR6..10 IR11..15 E/WB.IR R e g is te s Banch taken Zeo? ata memoy 16 Sign 32 etend FIGURE 3.20 Fowading of eslts to the eqies the addition of thee eta inpts on each mltiplee and the addition of thee paths to the new inpts. AP Sp. 98 UCB 49 AP Sp. 98 UCB 50 FIGURE 3.4 The datapath is pipelined by adding a set of egistes, one between each pai of pipe stages. Banch Hazads When we decide to banch, othe instctions ae in the pipeline! Pogam eection ode (in instctions) 40 beq $1, $3, 7 44 and $12, $2, $5 48 o $13, $6, $2 52 add $14, $2, $2 72 lw $4, 50($7) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 We ae pedicting banch not taken need to add hadwae fo flshing instctions if we ae wong Banch Stall Impact If CPI = 1, 30% banch, Stall 3 cycles => new CPI = 1.9! Two pat soltion: etemine banch taken o not soone, AN Compte taken banch addess ealie LX banch tests if egiste = 0 o not 0 LX Soltion: ove Zeo test to I/RF stage Adde to calclate new PC in I/RF stage 1 clock cycle penalty fo banch vess 3 Contol Hazad on Banches Thee Stage Stall 1998 ogan Kafmann Pblishes 51 AP Sp. 98 UCB 52

Instction Fetch PC 4 Instction oy IR Pipelined LX atapath Fige 3.22, page 163 /I Inst. ecode. Fetch IR 6..10 IR 11..15 E/WB.IR Zeo? istes Eecte Add. Calc. I/EX EX/E oy Access This is the coect 1 cycle latency implementation! ata oy E/WB Wite Back PC 4 Instction IR memoy /I IR6..10 IR11..15 E/WB.IR 16 32 Sign etend R e giste s Zeo? I/EX EX/E ata memoy E/WB Sign etend 16 32 AP Sp. 98 UCB 53 FIGURE 3.22 The stall fom banch hazads can be edced by moving the zeo test and banch AP Sp. 98 taget UCB calclation 54 into the I phase of the pipeline. Benchmak compess eqntott espesso gcc li dodc ea Fowad conditional banches 3% 3% 2% 2% 1% 4% 3% 4% 2% 2% hydo2d 2% 0% mdljdp 0% 0% 2% s2co 1% 1% 4% 6% 6% 4% 4% 8% 9% 11% 11% 11% 10% 12% 22% 0% 5% 10% 15% 20% 25% Pecentage of instctions eected Backwad conditional banches Unconditional banches 80% 70% 60% 51% 50% Faction of all conditional banches 40% 30% 20% 10% 0% 22% 63% compess eqntott 8% 35% 25% Fowad taken 44% 38% 34% 16% 13% 26% Benchmak 53% Backwad taken 37% 61% 14% 78% 21% 21% 3% espesso gcc li dodc ea hydo2d mdljdp s2co AP Sp. 98 UCB 55 FIGURE 3.24 The feqency of instctions (banches, jmps, calls, and etns) that may change the PC. FIGURE 3.25 Togethe the fowad and backwad taken banches accont fo an aveage of AP 67% Sp. 98 of all conditional UCB 56 banches.

Fo Banch Hazad Altenatives #1: Stall ntil banch diection is clea #2: Pedict Banch Not Taken Eecte sccesso instctions in seqence Sqash instctions in pipeline if banch actally taken Advantage of late pipeline state pdate 47% LX banches not taken on aveage PC+4 aleady calclated, so se it to get net instction #3: Pedict Banch Taken 53% LX banches taken on aveage Bt haven t calclated banch taget addess in LX» LX still incs 1 cycle banch penalty» Othe machines: banch taget known befoe otcome Fo Banch Hazad Altenatives #4: elayed Banch efine banch to take place AFTER a following instction banch instction seqential sccesso 1 seqential sccesso 2... seqential sccesso n banch taget if taken Banch delay of length n 1 slot delay allows pope decision and banch taget addess in 5 stage pipeline LX ses this AP Sp. 98 UCB 57 AP Sp. 98 UCB 58 elayed Banch Whee to get instctions to fill banch delay slot? Befoe banch instction Fom the taget addess: only valable when banch taken Fom fall thogh: only valable when banch not taken Cancelling banches allow moe slots to be filled Compile effectiveness fo single banch delay slot: Fills abot 60% of banch delay slots Abot 80% of instctions eected in banch delay slots sefl in comptation Abot 50% (60% 80%) of slots seflly filled elayed Banch downside: 7-8 stage pipelines, mltiple instctions issed pe clock (spescala) AP Sp. 98 UCB 59

Evalating Banch Altenatives Pipeline speedp = Pipeline depth 1 +Banch feqency Banch penalty Schedling Banch CPI speedp v. speedp v. scheme penalty npipelined stall Stall pipeline 3 1.42 3.5 1.0 Pedict taken 1 1.14 4.4 1.26 Pedict not taken 1 1.09 4.5 1.29 elayed banch 0.5 1.07 4.6 1.31 Conditional & Unconditional = 14%, 65% change PC Pipelining Intodction Smmay Jst ovelap tasks, and easy if tasks ae independent Speed Up Pipeline epth; if ideal CPI is 1, then: Speedp = Pipeline epth 1 + Pipeline stall CPI Hazads limit pefomance on comptes: Stctal: need moe HW esoces ata (RAW,WAR,WAW): need fowading, compile schedling Contol: delayed banch, pediction X Clock Cycle Unpipelined Clock Cycle Pipelined AP Sp. 98 UCB 61 AP Sp. 98 UCB 62