Tomasulo s Algorithm. Tomasulo s Algorithm

Similar documents
Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

COSC4201. Scoreboard

CMP 301B Computer Architecture. Appendix C

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Instruction Level Parallelism Part II - Scoreboard

Parallel architectures Electronic Computers LM

CS521 CSE IITG 11/23/2012

DAT105: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Out-of-Order Execution. Register Renaming. Nima Honarmand

OOO Execution & Precise State MIPS R10000 (R10K)

Tomasolu s s Algorithm

Dynamic Scheduling I

Instruction Level Parallelism. Data Dependence Static Scheduling

Dynamic Scheduling II

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Instruction Level Parallelism III: Dynamic Scheduling

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

CSE502: Computer Architecture CSE 502: Computer Architecture

Precise State Recovery. Out-of-Order Pipelines

Lecture 8-1 Vector Processors 2 A. Sohn

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Project 5: Optimizer Jason Ansel

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

CSE502: Computer Architecture CSE 502: Computer Architecture

Issue. Execute. Finish

Compiler Optimisation

Pipelined Processor Design

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

RISC Central Processing Unit

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

SCALCORE: DESIGNING A CORE

LECTURE 8. Pipelining: Datapath and Control

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Lecture 4: Introduction to Pipelining

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

ECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice

CS 110 Computer Architecture Lecture 11: Pipelining

Reading Material + Announcements

Datapath Components. Multipliers, Counters, Timers, Register Files

LV-Link 3.0 Software Interface for LabVIEW

Exploring Computation- Communication Tradeoffs in Camera Systems

Designing with STM32F3x

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Computer Architecture

Lecture 3: Modulation & Clock Recovery. CSE 123: Computer Networks Alex C. Snoeren

Extending IBIS-AMI to Support Back-Channel Communications DesignCon IBIS Summit February 3, 2011 Santa Clara, CA

EECE 321: Computer Organiza5on

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

High-Speed RSA Crypto-Processor with Radix-4 4 Modular Multiplication and Chinese Remainder Theorem

SkeeterSoft s National Pastime III Simulated Baseball Game

DIGITAL DESIGN WITH SM CHARTS

CHAPTER 1 INTRODUCTION

Lecture 3: Modulation & Clock Recovery. CSE 123: Computer Networks Stefan Savage

Final Report: DBmbench

High Resolution Pulse Generation

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Model 5-1: Simple Call Center

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Figure 1 DIV NAIs in AO 4 not to scale L-1-1 UNCLASSIFIED

Lecture 12 Building Components

EE382V-ICS: System-on-a-Chip (SoC) Design

King Fahd University of Petroleum & Minerals Computer Engineering Dept

Electronic Instrumentation

Software Module MDPP-16-QDC V0003

5096 FIRMWARE ENHANCEMENTS

AMBA Generic Infra Red Interface

EE445L Fall 2015 Quiz 2A Solution Page 1

AutoBench 1.1. software benchmark data book.

PLC ON A CHIP EZ LADDER CONFIGURATOON. EZ LADDER Configurations for PLC on a Chip & PLC on a Chip Module REV 3

CS429: Computer Organization and Architecture

Lecture 14 Analog to Digital Conversion

Multiplier Design and Performance Estimation with Distributed Arithmetic Algorithm

Multiple Clock and Voltage Domains for Chip Multi Processors

Lecture 12 Timer Functions

Contents. Basic Concepts. Histogram of CPU-burst Times. Diagram of Process State CHAPTER 5 CPU SCHEDULING. Alternating Sequence of CPU And I/O Bursts

2002 IEEE International Solid-State Circuits Conference 2002 IEEE

Chapter 6 - Info codes

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Image Processing Architectures (and their future requirements)

Computer Hardware. Pipeline

REALIZATION OF FPGA BASED Q-FORMAT ARITHMETIC LOGIC UNIT FOR POWER ELECTRONIC CONVERTER APPLICATIONS

4.4 Implementation Structures in FPGAs and DSPs. Presented by Lee Pucker President, ForwardLink Consulting

Microarchitectural Attacks and Defenses in JavaScript

Lecture 04 CSE 40547/60547 Computing at the Nanoscale Interconnect

Cyclops User s Manual

Transcription:

Tomasulo s Algorithm Load and store buffers Contain data and addresses, act like reservation stations Branch Prediction Top-level design: 56 Tomasulo s Algorithm Three Steps: Issue Get next instruction from FIFO queue If available RS, issue the instruction to the RS with operand values if available If operand values not available, stall the instruction issue Execute When operand becomes available, store it in any reservation stations waiting for it When all operands are ready, issue the instruction Loads and store maintained in program order through effective address instruction allowed to initiate execution until all branches that proceed it in program order have completed Write result Write result on CDB into reservation stations and store buffers (Stores must wait until address and value are received) Branch Prediction 57

Tomasulo s Algorithm Op: Operation to perform in the unit (e.g., + or ) Vj, Vk: Value of Source operands Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) te: Qj,Qk=0 ready Store buffers only have Qj for RS producing result A: Used to hold info for the load store (initially immediate, then effective address) Busy: Indicates reservation station or FU is busy Register result status Qi indicates which functional unit will write each register, 0 means no write to this register 58 Example Branch Prediction 59

Dealing with WAR The processor issues both DIV and ADD although there is a WAR hazard. If F^ is ready when DIV is issued, its value is read and stored in the RS (ADD may change it that is O.K.) If not ready, RS will read it from the FU producing it, again ADD may change F6 since we will read it from the FU not F6 Branch Prediction 60 Instruction stream LD F6 34+ R2 Load1 LD F2 45+ R3 Load2 MULTD F0 F2 F4 Load3 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 FU count down Add1 Add2 Add3 Mult1 Mult2 0 FU Clock cycle counter 3 Load/Buffers 3 FP Adder R.S. 2 FP Mult R.S. 61

LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 MULTD F0 F2 F4 Load3 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Add1 Add2 Add3 Mult1 Mult2 1 FU Load1 62 LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Add1 Add2 Add3 Mult1 Mult2 2 FU Load2 Load1 te: Can have multiple loads outstanding CSE4201 63

LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Add1 Add2 Add3 Mult1 Yes MULTD R(F4) Load2 Mult2 3 FU Mult1 Load2 Load1 te: registers names are removed ( renamed ) in Reservation Stations; MULT issued Load1 completing; who is waiting for Load1? 64 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2 Add1 Yes SUBD M(A1) Load2 Add2 Add3 Mult1 Yes MULTD R(F4) Load2 Mult2 4 FU Mult1 Load2 M(A1) Add1 Load2 completing; what is waiting for Load2? 65

LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 4 ADDD F6 F8 F2 2 Add1 Yes SUBD M(A1) M(A2) Add2 Add3 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 5 FU Mult1 M(A2) M(A1) Add1 Mult2 Timer starts down for Add1, Mult1 66 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 4 ADDD F6 F8 F2 6 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 9Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 6 FU Mult1 M(A2) Add2 Add1 Mult2 Issue ADDD here despite name dependency on F6? 67

LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 4 7 ADDD F6 F8 F2 6 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 waiting Add3 8Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 7 FU Mult1 M(A2) Add2 Add1 Mult2 Add1 (SUBD) completing; what is waiting for it? 68 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 4 7 8 ADDD F6 F8 F2 6 Add1 2 Add2 Yes ADDD (M-M) M(A2) Add3 7Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 8 FU Mult1 M(A2) Add2 (M-M) Mult2 69

70 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 4 7 8 ADDD F6 F8 F2 6 10 Add1 0 Add2 Yes ADDD (M-M) M(A2) Add3 5Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 10 FU Mult1 M(A2) Add2 (M-M) Mult2 Add2 (ADDD) completing; what is waiting for it? 71

LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 Load3 SUBD F8 F6 F2 4 7 8 ADDD F6 F8 F2 6 10 11 Add1 Add2 Add3 4Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 11 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Write result of ADDD here? All quick instructions complete in this cycle! CSE4201 72 73

74 75

LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 15 Load3 SUBD F8 F6 F2 4 7 8 ADDD F6 F8 F2 6 10 11 Add1 Add2 Add3 0Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 15 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Mult1 (MULTD) completing; who is waiting for it? 76 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 15 16 Load3 SUBD F8 F6 F2 4 7 8 ADDD F6 F8 F2 6 10 11 Add1 Add2 Add3 Mult1 40 Mult2 Yes DIVD M*F4 M(A1) 16 FU M*F4 M(A2) (M-M+M(M-M) Mult2 Just waiting for Mult2 (DIVD) to complete 77

78 79

LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 15 16 Load3 SUBD F8 F6 F2 4 7 8 56 ADDD F6 F8 F2 6 10 11 Add1 Add2 Add3 Mult1 0Mult2 Yes DIVD M*F4 M(A1) 56 FU M*F4 M(A2) (M-M+M(M-M) Mult2 80 LD F6 34+ R2 1 3 4 Load1 LD F2 45+ R3 2 4 5 Load2 MULTD F0 F2 F4 3 15 16 Load3 SUBD F8 F6 F2 4 7 8 56 57 ADDD F6 F8 F2 6 10 11 Add1 Add2 Add3 Mult1 Mult2 Yes DIVD M*F4 M(A1) 56 FU M*F4 M(A2) (M-M+M(M-M) Result Once again: In-order issue, out-of-order execution and out-of-order completion. 81

Tomasulo s Algorithm Load and stores could be done out of order provided they access different memory locations. If they access same location, must preserve order (WAR, RAW, or WAW). If address calculation is done in program order, load/store can check if any uncompleted load/store share the same address Either wait or forward if possible. 82