Lecture 8-1 Vector Processors 2 A. Sohn

Similar documents
Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Project 5: Optimizer Jason Ansel

Instruction Level Parallelism Part II - Scoreboard

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Tomasulo s Algorithm. Tomasulo s Algorithm

COSC4201. Scoreboard

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Dynamic Scheduling I

Out-of-Order Execution. Register Renaming. Nima Honarmand

Instruction Level Parallelism. Data Dependence Static Scheduling

Pipelined Processor Design

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

CMP 301B Computer Architecture. Appendix C

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Tomasolu s s Algorithm

Compiler Optimisation

CS429: Computer Organization and Architecture

Lecture 4: Introduction to Pipelining

Keywords: Adaptive filtering, LMS algorithm, Noise cancellation, VHDL Design, Signal to noise ratio (SNR), Convergence Speed.

Digital Integrated CircuitDesign

Multi-Channel FIR Filters

Parallel architectures Electronic Computers LM

CS521 CSE IITG 11/23/2012

Computer Arithmetic (2)

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Exam Complex Systems Design Methodology

CS 110 Computer Architecture Lecture 11: Pipelining

OOO Execution & Precise State MIPS R10000 (R10K)

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

A new serial/parallel architecture for a low power modular multiplier*

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Multiple Predictors: BTB + Branch Direction Predictors

Switch Mode Power Conversion Prof. L. Umanand Department of Electronics System Engineering Indian Institute of Science, Bangalore

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Instruction Level Parallelism III: Dynamic Scheduling

DAT105: Computer Architecture

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Class Project: Low power Design of Electronic Circuits (ELEC 6970) 1

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Electronics Prof D. C. Dube Department of Physics Indian Institute of Technology, Delhi

CSE 260 Digital Computers: Organization and Logical Design. Midterm Solutions

Computer Hardware. Pipeline

EECE 321: Computer Organiza5on

CSE502: Computer Architecture CSE 502: Computer Architecture

High-Speed RSA Crypto-Processor with Radix-4 4 Modular Multiplication and Chinese Remainder Theorem

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Dynamic Scheduling II

Chapter 3. H/w s/w interface. hardware software Vijaykumar ECE495K Lecture Notes: Chapter 3 1

Mohd Ahmer, Mohammad Haris Bin Anwar and Amsal Subhan ijesird, Vol. I (XI) May 2015/422

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Department Computer Science and Engineering IIT Kanpur

Design and Characterization of 16 Bit Multiplier Accumulator Based on Radix-2 Modified Booth Algorithm

CSE502: Computer Architecture CSE 502: Computer Architecture

Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m )

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Computer Architecture and Organization: L08: Design Control Lines

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

[Krishna, 2(9): September, 2013] ISSN: Impact Factor: INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Relocatable Fleet Code

Pipelined Architecture (2A) Young Won Lim 4/7/18

Pipelined Architecture (2A) Young Won Lim 4/10/18

CS420/520 Computer Architecture I

Lecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques.

Section 1. Fundamentals of DDS Technology

High Speed ECC Implementation on FPGA over GF(2 m )

Mahendra Engineering College, Namakkal, Tamilnadu, India.

Mixed-Signal-Electronics

SYNTHESIS OF CYCLIC ENCODER AND DECODER FOR HIGH SPEED NETWORKS

First Name: Last Name: Lab Cover Page. Teaching Assistant to whom you are submitting

Hardware Implementation of Automatic Control Systems using FPGAs

CSE 370 Winter Homework 5 Solutions

Types of Control. Programmed Non-programmed. Program Counter Hardwired

IJCSIET--International Journal of Computer Science information and Engg., Technologies ISSN

Computer-Aided Manufacturing

Design of Modified Shannon Based Full Adder Cell Using PTL Logic for Low Power Applications

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors

Computer Architecture

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

Ps3 Computing Instruction Set Definition Reduced

Computer Architecture and Organization:

Design of FIR Filter Using Modified Montgomery Multiplier with Pipelining Technique

CSE 2021: Computer Organization

CS61c: Introduction to Synchronous Digital Systems

An Optimized Implementation of CSLA and CLLA for 32-bit Unsigned Multiplier Using Verilog

Transcription:

Lecture 8-1 Vector Processors Vector Processors How many iterations does the following loop go through? For i=1 to n do A[i] = B[i] + C[i] Sequential Processor: n times. Vector processor: 1 instruction! Characteristics of VPs No data hazards A single instruction for the entire loop Heavily interleaved memory No control hazards Heavily pipelined operations. Questions to ask: Can memory support? Memory Bandwidth Massive interleaved memory, remember? Memory-to-Memory architecture Register-to-Register architecture Pipeline chaining Can instructions be vectorized? Vectorizing compiler Lecture 8-1 Vector Processors 2 A. Sohn

Vector Processors Pipeline the operations on individual elements. Vector functional units: Multiplier, Adder Vector registers Vector Load/Store units A set of scalar registers Lecture 8-1 Vector Processors 3 A. Sohn Memory Interleaving VPs have several functional units including Adder, Loader, and Storer. Vectors are ideally allocated to 8 memory modules for C[0:7]=A[0:7]+B[0:7] M0 M 1 M 2 M 3 A[0] B[6] C[4] A[1] B[7] C[5] A[2] B[0] C[6] A[3] B[1] C[7] M4 M 5 M 6 M 7 A[4] B[2] C[1] A[5] B[3] C[2] A[6] B[4] C[3] A[7] B[5] C[4] S4 S3 S2 S1 M7 RB5 RB5 RA7 RA7 W3 W3 M6 M5 RB4 RB3 RB3 RB4 RA5 RA6 RA6 RA5 W1 W2 W2 W1 M4 RB2 RB2 RA4 RA4 W0 W0 M3 RB1 RB1 RA3 RA3 W7 W7 M2 M1 RB0 RB0 R R R R RB7 RB7 W6 W6 W5 W5 M0 R R RB6 RB6 W4 W4 8 9 10 11 12 13 14 15 Lecture 8-1 Vector Processors 4 A. Sohn

Less Ideal Case M 0 M 1 M 2 M 3 A[0] B[0] C[0] A[1] B[1] C[1] A[2] B[2] C[2] A[3] B[3] C[3] M 4 M 5 M 6 M 7 A[4] B[4] C[4] A[5] B[5] C[5] A[6] B[6] C[6] A[7] B[7] C[7] S4 S3 S2 S1 M7 RA7 RA7 RB7 RB7 W7 W7 M6 RA6 RA6 RB6 RB6 W6 W6 M5 RA5 RA5 RB5 RB5 W5 W5 M4 RA4 RA4 RB4 RB4 W4 W4 M3 RA3 RA3 RB3 RB3 W3 W3 M2 R R RB2 RB2 W2 W2 M1 R R RB1 RB1 W1 W1 M0 R R RB0 RB0 W0 W0 8 9 10 11 12 13 14 Convoy 1 Convoy 2 15 16 17 Pipeline Chaining What is pipeline chaining? An expansion of the internal forwarding concept in pipeline. (Internal forwarding: one register to another) Feeding results of one pipeline to the operand registers of another pipeline. Example: V0 = Memory Memory Fetch V2 = V0 + V1 Vector Load V3 = V2 < A3 Left Shift V5 = V3 V4 Logical Product Facts and assumptions: One vector load pipe with 8 stages One vector add pipe with 3 stages One vector shift pipe with 4 stages One vector logic pipe with 2 stages Tansition between a pipe and a vector regisetr takes 1 clock cycle. V1 is already loaded in the register. Vector length =8. Vector register length = 64. Lecture 8-1 Vector Processors 6 A. Sohn

Pipeline Chaining with Vector Size 8 Memory a d d b Load pipe V0 c V1 Add pipe e V2 g f j j h Shift pipe V3 i V4 Logic pipe k V5 l Lecture 8-1 Vector Processors 7 A. Sohn Pipeline Chaining with Vector Length 8 Load pipe Add pipe Shift pipe Logic p A3 A3 A4 A5 A6 A7 A3 A4 A5 A6 A7 A4 A5 A6 A7 S0 S1 S2 S3 S4 S5 S6 S7 S0 S1 S2 S3 S4 S5 S0 S1 S2 S3 S4 S5 S6 S0 S1 S2 S3 S4 S5 S6 S7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

No Chaining with Vector Length 8 Load pipe Add pipe Shift pipe Logic p A3 A4 A5 A3 A4 A3 A4 A5 A6 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Chaining with Vector Length 64 Load pipe Add pipe Shift pipe Logic p A3 A4 A5 A6 A7 A8 A3 A4 A5 A6 A7 A8 A9 A9 S0 S1 S2 S3 S4 S5 S0 S1 S2 S3 S4 S5 S6 S0 S1 S2 S3 S4 S5 S6 S7 S0 S1 S2 S3 S4 S5 S6 S7 S8 A3 A4 A5 A6 A7 A8 A9 0 1 0 1 2 0 1 2 3 L8 L9 L10 L11 L12 L13 L14 L15 L16 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L22 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L22 L23 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

For Vector Length N Suppose that the vector register length is 64 and the vector length is n>64. θ θ Lecture 8-1 Vector Processors 11 A. Sohn Vectorizing Compiler What is a vectorizing compiler? Case 1: S1: A[i] = B[i] + C[i] S2: D[i] = E[i] F[i] Two statements can replace the entire loop! S1: A[1:n] = B[1:n] + C[i] S2: D[1:n] = E[1:n] F[1:n] This simple example is easy to vectorize. Case 2: How about this loop? S1: A[i] = A[i 1] + B[i] Can we easily vectorize the loop? S1: A[1:n] = A[1:n] + B[i:n]? No, we can t because S1 uses a value computed by S1 in an earlier iteration. A[i] deps on A[i-1], which is computed in the previous iteration. Lecture 8-1 Vector Processors 12 A. Sohn

S1: A[i] = A[i 1] + B[i] Iteration 1: A[1] = A[0] + B[1] Iteration 2: A[2] = A[1] + B[2] Iteration 3: A[3] = A[2] + B[3] Iteration n: A[n] = A[n 1] + B[n] Lecture 8-1 Vector Processors 13 A. Sohn Various Depences Now, consider a little more complex case: S1: A[i] = A[i 1] + B[i] S2: B[i] = A[i] + B[i 1] S3: A[i] = B[i 1] + C[i] S1 uses a value computed by S1 in the previous iteration Loop-carried depence S2 uses a value computed by S2 in the previous iteration Loop-carried depence. S2 uses a value A[i] computed by S1 in the same iteration Flow-depence (RAW Hazard). S1 uses a value B[i] which will be modified by S2 in the same iteration Anti-depence (WAR Hazard). S1 and S3 write to the same value A[i] in the same iteration Output-depence (WAW Hazard). Note: Anti and output-depences are name conflicts. They are not true depences. Lecture 8-1 Vector Processors 14 A. Sohn

Ok, then we can remove the depences! How do we do that? Simply change A to X, then see what happens. S1: X[i] = A[i 1] + B[i] S2: B[i] = X[i] + B[i 1] S3: A[i] = B[i 1] + C[i] S2 uses a value computed by S2 in the previous iteration Loop-carried depence. Anti-depence (WAR Hazard). S1 and S3 write to the same value A[i] in the same iteration Output-depence (WAW Hazard). The output depence is gone! How about the anti-depence? It s also gone! Lecture 8-1 Vector Processors 15 A. Sohn How To Vectorize? Can we vectorize the various depences? Loop-carried depen cannot be vectorized. Flow-depence cannot be vectorized. Anti-depence can be vectorized. How? Output-depence can be vectorized. How? The two depences can be eliminated by renaming the storage. Consider the following example: No loop-carried depence in the loop. Flow-depence in (1,3), and (1,4). Anti-depence in (1,2) and (3,4) Output-depence in (1,4) B is renamed to T in S1, S3, and S4. It removes the output-depence. A is renamed to F in S2. It removes the anti-depence. S1: B[i] = A[i] + D[i] S2: A[i] = A[i] D[i] S3: C[i] = B[i] + D[i] S4: B[i] = E[i] B[i] S1: T[i] = A[i] + D[i] S2: F[i] = A[i] D[i] S3: C[i] = T[i] + D[i] S4: B[i] = E[i] T[i] Lecture 8-1 Vector Processors 16 A. Sohn