Multiple Predictors: BTB + Branch Direction Predictors

Similar documents
Instruction Level Parallelism. Data Dependence Static Scheduling

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Pipelined Processor Design

CSE502: Computer Architecture CSE 502: Computer Architecture

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CS 110 Computer Architecture Lecture 11: Pipelining

CSE 2021: Computer Organization

Project 5: Optimizer Jason Ansel

Dynamic Scheduling II

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

RISC Central Processing Unit

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Out-of-Order Execution. Register Renaming. Nima Honarmand

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 8-1 Vector Processors 2 A. Sohn

CSE502: Computer Architecture CSE 502: Computer Architecture

Dynamic Scheduling I

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Computer Architecture

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

ICS312 Machine-level and Systems Programming

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

LECTURE 8. Pipelining: Datapath and Control

CS429: Computer Organization and Architecture

CS521 CSE IITG 11/23/2012

CZ3001 ADVANCED COMPUTER ARCHITECTURE

RISC Design: Pipelining

Lecture 4: Introduction to Pipelining

Precise State Recovery. Out-of-Order Pipelines

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

CSE502: Computer Architecture CSE 502: Computer Architecture

5. (Adapted from 3.25)

Department Computer Science and Engineering IIT Kanpur

Trace Based Switching For A Tightly Coupled Heterogeneous Core

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Controller Implementation--Part I. Cascading Edge-triggered Flip-Flops

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

CMP 301B Computer Architecture. Appendix C

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

Giovanni Squillero

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Lecture 13 Register Allocation: Coalescing

Reading Material + Announcements

PIC16F84A Firmware Configuration Details: 400MHZ LCD Frequency Counter

EE 457 Homework 5 Redekopp Name: Score: / 100_

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

MILITARY PRODUCTION MINISTRY Training Sector. Using and Interpreting Information. Lecture 6. Flow Charts.

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

ECE473 Computer Architecture and Organization. Pipeline: Introduction

OOO Execution & Precise State MIPS R10000 (R10K)

ARM BASED DISTRIBUTED ELECTRICITY MONITORING AND CONTROL USING GSM MODEM

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

On the Rules of Low-Power Design

Appendix A. Selected excerpts from behavior modeling session Examples of training screens

bus waveforms transport delta and simulation

Low Power VLSI Circuit Synthesis: Introduction and Course Outline

A Static Power Model for Architects

MICROPROCESSORS AND MICROCONTROLLER 1

Tomasolu s s Algorithm

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Flux Gate Musical Toy

An ahead pipelined alloyed perceptron with single cycle access time

EECE 321: Computer Organiza5on

Can Computers Think? Dijkstra: Whether a computer can think is about as interesting as whether a submarine can swim. 2006, Lawrence Snyder

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Constructive Computer Architecture

Digital Power: Definition

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Processors Processing Processors. The meta-lecture

4.1 Device Structure and Physical Operation

Performance Evaluation of Recently Proposed Cache Replacement Policies

Agent-based/Robotics Programming Lab II

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

Issue. Execute. Finish

Software-based Microarchitectural Attacks

CS61C : Machine Structures

Game Programming Paradigms. Michael Chung

Generating MSK144 directly for Beacons and Test Sources.

The next level of intelligence: Artificial Intelligence. Innovation Day USA 2017 Princeton, March 27, 2017 Michael May, Siemens Corporate Technology

ArbStudio Triggers. Using Both Input & Output Trigger With ArbStudio APPLICATION BRIEF LAB912

ASC-50. OPERATION MANUAL September 2001

Bluespec-3: Architecture exploration using static elaboration

Let's Celebrate. You Have Finished the Seasons for Growth. Program. Post Group - Survey Levels 1-2-3

QS PRO & QS PRO 2 Set-up App Instructions For Bluetooth BLE (Android 4.4+)

Compiler Optimisation

2016+ QS PRO Set-up App Instructions For Bluetooth BLE (Android 4.4+)

Chapter 13: Comparators

Lesson 7. Digital Signal Processors

(Refer Slide Time: 01:19)

Transcription:

Constructive Computer Architecture: Branch Prediction: Direction Predictors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October 28, 2015 http://csg.csail.mit.edu/6.175 L16-1 Multiple Predictors: BTB + Branch Direction Predictors tight loop Next Addr Pred Br Dir Pred correct mispred correct mispred mispred insts must be filtered P C Need next PC immediately Instr type, PC relative targets available Reg Read Simple conditions, register targets available Complex conditions available Write Back Suppose we maintain a table of how a particular Br has resolved before. At the decode stage we can consult this table to check if the incoming (pc, ppc) pair matches our prediction. If not redirect the pc October 28, 2015 http://csg.csail.mit.edu/6.175 L16-2 1

Branch Prediction Bits Remember how the branch was resolved previously Assume 2 BP bits per instruction Use saturating counter 1 1 Strongly taken On taken On taken 1 0 Weakly taken 0 1 Weakly taken? 0 0 Strongly taken Direction prediction changes only after two successive bad predictions October 28, 2015 http://csg.csail.mit.edu/6.175 L16-3 Two-bit versus one-bit Branch prediction Consider the branch instruction needed to implement a loop with one bit, the prediction will always be set incorrectly on loop exit with two bits the prediction will not change on loop exit A little bit of hysteresis is good in changing predictions October 28, 2015 http://csg.csail.mit.edu/6.175 L16-4 2

from Branch History Table (BHT) Instruction Opcode offset PC 00 Branch? + Target PC k BHT Index 2 k -entry BHT, 2 bits/entry At the stage, if the instruction is a branch then BHT is consulted using the pc; if BHT shows a different prediction than the incoming ppc, is redirected 4K-entry BHT, 2 bits/entry, ~80-90% correct direction predictions Taken/ Taken? October 28, 2015 http://csg.csail.mit.edu/6.175 L16-5 Exploiting Spatial Correlation Yeh and Patt, 1992 if (x[i] < 7) then y += 1; if (x[i] < 5) then c -= 4; If first condition is false then so is second condition History register, H, records the direction of the last N branches executed by the processor and the predictor uses this information to predict the resolution of the next branch October 28, 2015 http://csg.csail.mit.edu/6.175 L16-6 3

Two-Level Branch Predictor Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct) PC 2-bit global branch history shift register 00 k Four 2 k, 2-bit Entry BHT Shift in Taken/ Taken results of each branch Taken/ Taken? October 28, 2015 http://csg.csail.mit.edu/6.175 L16-7 Where does BHT fit in the processor pipeline? BHT can only be used after instruction decode We still need the next instruction address predictor (e.g., BTB) at the fetch stage Predictor training: On a pc misprediction, information about redirecting the pc has to be passed to the fetch stage. However for training the branch predictors information has to be passed even when there is no misprediction October 28, 2015 http://csg.csail.mit.edu/6.175 L16-8 4

Multiple predictors in a pipeline At each stage we need to take two decisions: Whether the current instruction is a wrong path instruction. Requires looking at epochs Whether the prediction (ppc) following the current instruction is good or not. Requires consulting the prediction data structure (BTB, BHT, ) stage must correct the pc unless the redirection comes from a known wrong path instruction Redirections from stage are always correct, i.e., cannot come from wrong path instructions October 28, 2015 http://csg.csail.mit.edu/6.175 L16-9 Dropping vs poisoning an instruction Once an instruction is determined to be on the wrong path, the instruction is either dropped or poisoned Drop: If the wrong path instruction has not modified any book keeping structures (e.g., Scoreboard) then it is simply removed Poison: If the wrong path instruction has modified book keeping structures then it is poisoned and passed down for book keeping reasons (say, to remove it from the scoreboard) Subsequent stages know not to update any architectural state for a poisoned instruction October 28, 2015 http://csg.csail.mit.edu/6.175 L16-10 5

N-Stage pipeline BTB only assume unbounded epochs fep attached to every fetched instruction BTB {pc, ppc, ieep} recirect {pc, newpc, taken mispredict,...} eep PC f2d d2e... At : (correct pc?) if (ieep < eep) then mark the instruction as poisoned (correct ppc?) if (correct pc) & mispred then increase eep For every control instruction send <pc, newpc, taken, mispred,...> to for training and redirection At : msg from : train BTB with <pc, newpc, taken, mispred> and if msg from indicates misprediction then set pc, increase fep October 28, 2015 http://csg.csail.mit.edu/6.175 L16-11 N-Stage pipeline: Two predictors feep fdep drecirect redirect PC dep erecirect redirect PC eep PC f2d d2e... Both and can redirect the PC; redirect should never be overruled We will use separate epochs for each redirecting stage feep and deep are estimates of eep at and, respectively. deep is updated by the incoming eep fdep is s estimates of dep Initially all epochs are set to 0 stage logic does not change October 28, 2015 http://csg.csail.mit.edu/6.175 L16-12 6

stage Redirection logic feep fdep drecirect {pc, newpc, ieep,...} {pc, ppc, ieep, idep} dep erecirect {pc, newpc, taken mispredict,...} deep {..., ieep} eep PC f2d d2e... yes Is idep = dep? yes no Current instruction is OK; Is ieep = deep? Wrong path instruction; drop it check the ppc prediction via BHT, increment dep if misprediction no Current instruction is OK but has redirected the pc; Set <deep, dep> to <ieep, idep>; October 28, 2015 http://csg.csail.mit.edu/6.175 L16-13 N-Stage pipeline: Two predictors Redirection logic feep fdep drecirect {pc, newpc, ieep,...} {pc, ppc, ieep, idep} dep erecirect {pc, newpc, taken mispredict,...} deep {..., ieep} eep PC f2d d2e... At execute: (correct pc?) if (ieep < eep) then poison the instruction (correct ppc?) if (correct pc) & mispred then increase eep; For every non-poisoned control instruction send <pc, newpc, taken, mispred,...> to for training and redirection At fetch: msg from execute: train btb & if (mispred) set pc, increase feep, msg from decode: if (no redirect message from ) if (ieep=feep) then set pc, increase fdep else drop it make sure that the msg At decode: from is not from a wrong path instruction October 28, 2015 http://csg.csail.mit.edu/6.175 L16-14 7

One bit epoch does not work feep fdep drecirect {pc, newpc, ieep,...} {pc, ppc, ieep, idep} dep erecirect {pc, newpc, taken mispredict,...} deep {..., ieep} eep PC f2d d2e... The decode redirect which is issues in eep should only kill instructions in the same eep in Suppose a message has red eepoch and sits for a long time in dredirect then by the time reads it eepoch may have changed to green and again to red. In such a situation the message in dredirect should be discarded For one-bit epoch solution see Khan, Wright and Zhang October 28, 2015 http://csg.csail.mit.edu/6.175 L16-15 Discussion The number of entries in BTB is small both because of the need for fast access and the need to store the target address (small and fat) The number entries in BHT is large (thin and tall) We can keep the history bits for branches in the BTB also to improve performance; alternatively we can set the branches to be always-taken Jumps through registers (JALR) are problematic and perhaps should not be kept in the BTB October 28, 2015 http://csg.csail.mit.edu/6.175 L16-16 8

Uses of Jump Register (JALR) Switch statements (jump to address of matching case) BTB will work well only if the same case is used repeatedly Dynamic function call (jump to run-time function address) BTB will work well only if the same function is called repeatedly, (e.g., in C++ programming, when objects have same type in virtual function call) Subroutine returns (jump to return address) BTB is not likely to work because a function is called from many distinct call sites! How can we improve subroutine call transfers? October 28, 2015 http://csg.csail.mit.edu/6.175 L16-17 Subroutine Return Stack A small structure to accelerate JR for subroutine returns is typically much more accurate than BTBs Push call address when function call executed fa() { fb(); } fb() { fc(); } fc() { fd(); } Pop return address when subroutine return decoded pc of fd call pc of fc call pc of fb call k entries (typically k=8-16) Don t keep these instructions in BTB October 28, 2015 http://csg.csail.mit.edu/6.175 L16-18 9

Multiple Predictors: BTB + BHT + Ret Predictors tight loop P C Next Addr Pred Need next PC immediately Br Dir Pred, RAS Instr type, PC relative targets correct JR pred Reg Read Simple conditions, register targets correct mispred Complex conditions available mispred insts must be filtered available available Multiple predictors are common; one of the PowerPCs has all the three predictors Performance analysis is quite difficult depends upon the sizes of various tables and program behavior The system must work even if every prediction is wrong Write Back October 28, 2015 http://csg.csail.mit.edu/6.175 L16-19 10