CS521 CSE IITG 11/23/2012

Similar documents
Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Tomasulo s Algorithm. Tomasulo s Algorithm

Parallel architectures Electronic Computers LM

CMP 301B Computer Architecture. Appendix C

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Instruction Level Parallelism Part II - Scoreboard

COSC4201. Scoreboard

Out-of-Order Execution. Register Renaming. Nima Honarmand

Dynamic Scheduling I

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Instruction Level Parallelism III: Dynamic Scheduling

Tomasolu s s Algorithm

CSE502: Computer Architecture CSE 502: Computer Architecture

Instruction Level Parallelism. Data Dependence Static Scheduling

OOO Execution & Precise State MIPS R10000 (R10K)

DAT105: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Precise State Recovery. Out-of-Order Pipelines

Dynamic Scheduling II

Project 5: Optimizer Jason Ansel

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

CS 110 Computer Architecture Lecture 11: Pipelining

LECTURE 8. Pipelining: Datapath and Control

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Issue. Execute. Finish

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Pipelined Processor Design

Lecture 4: Introduction to Pipelining

RISC Central Processing Unit

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Computer Architecture

Lecture 8-1 Vector Processors 2 A. Sohn

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

EC4205 Microprocessor and Microcontroller

a8259 Features General Description Programmable Interrupt Controller

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

Compiler Optimisation

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Computing Layers

CS429: Computer Organization and Architecture

CZ3001 ADVANCED COMPUTER ARCHITECTURE

CMSC 611: Advanced Computer Architecture

ICS312 Machine-level and Systems Programming

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

ECE473 Computer Architecture and Organization. Pipeline: Introduction

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

EECE 321: Computer Organiza5on

Understanding Engineers #2

CS61c: Introduction to Synchronous Digital Systems

Pipelined Architecture (2A) Young Won Lim 4/7/18

Pipelined Architecture (2A) Young Won Lim 4/10/18

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

FMP For More Practice

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

SCALCORE: DESIGNING A CORE

CMSC 611: Advanced Computer Architecture

EE382V-ICS: System-on-a-Chip (SoC) Design

Computer Architecture and Organization:

Clock-Powered CMOS: A Hybrid Adiabatic Logic Style for Energy-Efficient Computing

Chapter 3 Digital Logic Structures

Multiple Predictors: BTB + Branch Direction Predictors

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Quantifying the Complexity of Superscalar Processors

CSE 466 Software for Embedded Systems. What is an embedded system?

On the Rules of Low-Power Design

7.1. Unit 7. Fundamental Digital Building Blocks: Decoders & Multiplexers

Final Report: DBmbench

Low Complexity Out-of-Order Issue Logic Using Static Circuits

Department Computer Science and Engineering IIT Kanpur

Combined Circuit and Microarchitecture Techniques for Effective Soft Error Robustness in SMT Processors

RISC Design: Pipelining

Computer Hardware. Pipeline

You are Here! Processor Design Process. Agenda. Agenda 10/25/12. CS 61C: Great Ideas in Computer Architecture Single Cycle MIPS CPU Part II

Chapter 5 Sequential Logic Circuits Part II Hiroaki Kobayashi 6/30/2008

Tomorrow s Technology and You

Learning Outcomes. Spiral 2 3. DeMorgan Equivalents NEGATIVE (ACTIVE LO) LOGIC. Negative Logic One hot State Assignment System Design Examples

Lecture 02: Digital Logic Review

CS420/520 Computer Architecture I

ECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice

Chapter 5 Sequential Logic Circuits Part II Hiroaki Kobayashi 7/11/2011

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

EECS 473. Review etc.

Transcription:

Parallel Decoding and issue Parallel execution Preserving the sequential consistency of execution and exception processing 1 slide 2 Decode/issue data Issue bound fetch Dispatch bound fetch RS RS RS RS slide 3 slide 4 Decode/issue data RS RS RS RS Parallel Decoding and issue Parallel execution Renaming Remove WAR and WAW data dependency Out of Order Scheduling Score boarding and Tomasulo approach Re Ordering Maintaining order of execution Preserving the sequential consistency of execution and exception processing slide 5 6 1

RAW DIV F0, F2, F4 RAW ADD F6, F0,F8 WAR DIV ADD F0, F2, F4 S, F0,F8 ST F6, O(R1) SUB F8, F10,F24 MUL F6, F10,F8 WAW WAR ST S, O(R1) SUB T, F10,F24 MUL F6, F10, T 7 8 register address from mapping physical register file (larger than architectural register file) Compiler Done statically Limited by registers visible to compiler Hardware Done dynamically Limited by registers available to hardware slide 9 slide 10 Introduced with CDC6600 Data 0 1 2 Register File status 1 0 1 0 1 11 slide 12 2

decoded OC Reservation station Rs1 Rs2 Rd OC (opcode) check V bits of sources Rs1,Rs2,Rd reset V bit of Rd slide 13 Register File Os1 Os2 (operand value) result, Rd update Rd set V bit DIV R3, R1, R2 MUL R5, R3, R4 ADD R4, R6, R7 DIV MUL ADD bus Precedence handle Hazards But Stall for all RAW, WAR and WAW Scoreboard R1 R2 R3 R4 R5 R6 R7 Read Φ Φ MUL MUL Φ ADD ADD Write Φ Φ DIV ADD MUL Φ Φ Precedence Φ Φ W R W Φ Φ 1: issue I to DIV [R1],[R2] >DIV, ], Begin DIV 2. Issue I2 to MUL (Dependency) to RS TAG [DIV] (from R3) goes to MUL TAG [MUL] is placed in [R4] read score board TAG[MUL] is placed in [R5] write to score board 3: Decode I3 to ADD (Dependency) to RS TAG[ADD] for [R6] read score board TAG[ADD] for [R7] read score board TAG[ADD] for [R4] write score board 14 Op Operation to perform in the unit Qj, Qk From which FU is it will get Operand Rj, Rk Ready Status of Source Registers for the Operation If Both Ri and Rj is Yes, then Operation can be scheduled Busy Indicates reservation station or FU is busy Op Operation to perform in the unit Qj, Qk From which FU is it will get Operand Rj, Rk Ready Status of Source Registers for the Operation If Both Ri and Rj is Yes, then Operation can be scheduled Busy Indicates reservation station or FU is busy 15 16 LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) 1 INT 2 MUL1 3 MUL2 4 ADD 5 DIV 1 INT Y LF 2 MUL1 Y MUL 4 ADD Y SUB 5 DIV Y DIV FU No 17 FU No 18 3

LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) 1 INT Y LF F2 R3 2 MUL1 Y MUL F0 F2 F4 4 ADD Y SUB F8 F6 F2 5 DIV Y DIV F10 F0 F6 1 INT Y LF F2 R3 Y Y 2 MUL1 Y MUL F0 F2 F4 1 N Y 4 ADD Y SUB F8 F6 F2 1 Y N FU No 19 FU No 2 1 4 5 20 LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) 1 INT Y LF F2 R3 N N 2 MUL1 Y MUL F0 F2 F4 1 N Y 4 ADD Y SUB F8 F6 F2 1 Y N 2 MUL1 Y MUL F0 F2 F4 Y Y 4 ADD Y SUB F8 F6 F2 Y Y FU No 2 1 4 5 21 FU No 2 4 5 22 LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) 2 MUL1 Y MUL F0 F2 F4 N N 4 ADD N 2 MUL1 Y MUL F0 F2 F4 N N 4 ADD Y ADD F6 F8 F2 Y Y FU No 2 5 23 FU No 2 4 5 24 4

LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) 2 MUL1 Y MUL F0 F2 F4 N N 4 ADD Y ADD F6 F8 F2 N N 2 MUL1 Y MUL F0 F2 F4 N N 4 ADD Y ADD F6 F8 F2 N N FU No 2 4 5 25 FU No 2 4 5 26 LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) 2 MUL1 N 4 ADD Y ADD F6 F8 F2 N N 5 DIV Y DIV F10 F0 F6 Y Y 2 MUL1 N 4 ADD Y ADD F6 F8 F2 N N 5 DIV Y DIV F10 F0 F6 N N FU No 4 5 27 FU No 4 5 28 LF F6, 34(R2) LF F2, 45(R3) 2 MUL1 N 4 ADD N 5 DIV Y DIV F10 F0 F6 N N FU No 5 29 Out of Order Scheduling No renaming required, can be done in Hardware by using special taging Developed in IBM but extensively used by Intel/AMD Tomasulo: Eckert Mauchly Award in 1997 All most all modern processor use this method Pentium Pro, Core 2 Duo, Core i3/5/7, AMD Optron/Phenom 30 5

It is so important to Read in ACA Course. Many Demo s available online Ref: http://www.ecs.umass.edu/ece/koren/architectur edu/ece/koren/architectur e/tomasulo/applettomasulo.html http://www.dcs.ed.ac.uk/home/hase/webhase/d emo/tomasulo.html http://www.ecs.umass.edu/ece/koren/architectur e/tomasulo1/tomasulo.htm 31 1.Issue get from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2.Execution operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3.Write result finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Issue: build dependence for new inst Writeback: Wakeup dependent s 32 Renames its two source registers (source renaming) Assigns it to a free RS Updates Renaming table (dest renaming) Also decodes the inst and read register values in parallel Only ready s can join the competition There is a select logic to select s for FU execution Some policy may be used, e.g. age based Non ready s can be waken up during writeback of its parent inst 33 34 Normal data bus data + destination ( go to bus) Common data bus data + source ( come from bus) 64 bits of data + 4 bits of source index (tag) Does the broadcast to every in the fly How it do Child s do tag matching and update their ready bits and value fields (if the tag matches theirs) Adapted from UCB CS252 S98, Copyright 1998 USB 35 36 6

IBM 360/91 Tomasulo s scheme Issue bound fetch FUs : LOAD, STORE, 3 x ADD/SUB, 2 x MUL/DIV Group RS s with 1 slot per FU 1 In order issue, out of order execution slide 37 decoded Reservation station OC Os1/Is1 Vs1 Os2/Is2 Vs2 Rd associative update of Is1, Is2 with Rd, set Vs bits slide 38 Rs1,Rs2,Rd reset V bit of Rd Os1 check Vs1, Vs2 OC, Os1, Os2, Rd Register File Os2 (operand value) result, Rd update Rd, set V bit Common Data Bus Op Operation to perform in the unit Vj, Vk (Value of Source operands) Store buffers has V field, result to be stored Qj, Qk Qk Reservation stations producing source registers (value to be written) Store buffers only have Qi for RS producing result Busy Indicates reservation station or FU is busy 39 LF F6, 34(R2) LF F2, 45(R3) ADD1 ADD2 ADD3 MUL1 MUL2 Qi 40 LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) ADD1 Y SUB ADD2 Y ADD MUL1 Y MUL MUL2 Y DIV ADD1 Y SUB (LD1) LD2 ADD2 Y ADD ADD1 LD2 MUL1 Y MUL (F4) LD2 Qi 41 Qi MUL1 LD2 ADD2 ADD1 MUL2 42 7

LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) ADD1 Y SUB (LD1) (LD2) ADD2 Y ADD (LD2) ADD1 MUL1 Y MUL (LD2) (F4) ADD1 N ADD2 Y ADD (ADD1) (LD2) MUL1 Y MUL (LD2) (F4) Qi MUL1 ADD2 ADD1 MUL2 43 Qi MUL1 ADD2 MUL2 44 LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) ADD1 N ADD2 Y ADD (ADD1) (LD2) MUL1 Y MUL (LD2) (F4) ADD1 N ADD2 N MUL1 Y MUL (LD2) (F4) Qi MUL1 ADD2 MUL2 45 Qi MUL1 MUL2 46 LF F6, 34(R2) LF F2, 45(R3) ADD1 N ADD2 N MUL1 N MUL2 Y DIV (MUL1) (LD1) Qi MUL2 47 48 8