CZ3001 ADVANCED COMPUTER ARCHITECTURE

Similar documents
LECTURE 8. Pipelining: Datapath and Control

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Instruction Level Parallelism. Data Dependence Static Scheduling

EE 457 Homework 5 Redekopp Name: Score: / 100_

EECE 321: Computer Organiza5on

CS 110 Computer Architecture Lecture 11: Pipelining

FMP For More Practice

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

Computer Architecture

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

RISC Design: Pipelining

You are Here! Processor Design Process. Agenda. Agenda 10/25/12. CS 61C: Great Ideas in Computer Architecture Single Cycle MIPS CPU Part II

CS429: Computer Organization and Architecture

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Class Project: Low power Design of Electronic Circuits (ELEC 6970) 1

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

CS521 CSE IITG 11/23/2012

CMSC 611: Advanced Computer Architecture

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

EE382V-ICS: System-on-a-Chip (SoC) Design

RISC Central Processing Unit

Understanding Engineers #2

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Advanced Digital Logic Design

Campus Fighter. CSEE 4840 Embedded System Design. Haosen Wang, hw2363 Lei Wang, lw2464 Pan Deng, pd2389 Hongtao Li, hl2660 Pengyi Zhang, pnz2102

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

AREA EFFICIENT DISTRIBUTED ARITHMETIC DISCRETE COSINE TRANSFORM USING MODIFIED WALLACE TREE MULTIPLIER

Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization

Design and Implementation of a Digital Image Processor for Image Enhancement Techniques using Verilog Hardware Description Language

Dynamic Scheduling I

IJCSIET--International Journal of Computer Science information and Engg., Technologies ISSN

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

CHAPTER III THE FPGA IMPLEMENTATION OF PULSE WIDTH MODULATION

Lecture 4: Introduction to Pipelining

Out-of-Order Execution. Register Renaming. Nima Honarmand

Design and Implementation of High Speed Carry Select Adder

Multiple Predictors: BTB + Branch Direction Predictors

Implementation of Parallel Multiplier-Accumulator using Radix- 2 Modified Booth Algorithm and SPST

Design of Delay Efficient PASTA by Using Repetition Process

Pipelined Processor Design

Hardware Implementation of BCH Error-Correcting Codes on a FPGA

EC4205 Microprocessor and Microcontroller

Versuch 7: Implementing Viterbi Algorithm in DLX Assembler

Computer Architecture and Organization:

Project 5: Optimizer Jason Ansel

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

CSEN 601: Computer System Architecture Summer 2014

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

ADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

Audio Sample Rate Conversion in FPGAs

Design of a High Speed FIR Filter on FPGA by Using DA-OBC Algorithm

CHAPTER 5 IMPLEMENTATION OF MULTIPLIERS USING VEDIC MATHEMATICS

Computer Architecture Lab Session

A Fixed-Width Modified Baugh-Wooley Multiplier Using Verilog

Solving the Rubik s Cube

Globally Asynchronous Locally Synchronous (GALS) Microprogrammed Parallel FIR Filter

Fpga Implementation Of High Speed Vedic Multipliers

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

CS420/520 Computer Architecture I

THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE

Pipelining and ISA Design

CS61C : Machine Structures

CS/EE Homework 9 Solutions

High Resolution Pulse Generation

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Folded Low Resource HARQ Detector Design and Tradeoff Analysis with Virtex 5 using PlanAhead Tool

Pipelined Architecture (2A) Young Won Lim 4/10/18

Pipelined Architecture (2A) Young Won Lim 4/7/18

Artificial Neural Network Engine: Parallel and Parameterized Architecture Implemented in FPGA

Optimized BPSK and QAM Techniques for OFDM Systems

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

EFFICIENT FPGA IMPLEMENTATION OF 2 ND ORDER DIGITAL CONTROLLERS USING MATLAB/SIMULINK

FPGA Implementation of Digital Modulation Techniques BPSK and QPSK using HDL Verilog

Parallel architectures Electronic Computers LM

AN2971 Application note

FPGA Implementation of MHz and mw High Speed Low Power Viterbi Decoder

Eight Bit Serial Triangular Compressor Based Multiplier

COSC4201. Scoreboard

Method We follow- How to Get Entry Pass in SEMICODUCTOR Industries for 2 nd year engineering students

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION

Estimation of Real Dynamic Power on Field Programmable Gate Array

CMP 301B Computer Architecture. Appendix C

Design and FPGA Implementation of a. Digital Signal Processor

FIR_NTAP_MUX. N-Channel Multiplexed FIR Filter Rev Key Design Features. Block Diagram. Applications. Pin-out Description. Generic Parameters

ASIP Solution for Implementation of H.264 Multi Resolution Motion Estimation

FPGA Implementation of Area-Delay and Power Efficient Carry Select Adder

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

ECE6332 VLSI Eric Zhang & Xinfei Guo Design Review

EECS150 Spring 2007 Lab Lecture #5. Shah Bawany. 2/16/2007 EECS150 Lab Lecture #5 1

COMET DISTRIBUTED ELEVATOR CONTROLLER CASE STUDY

Design of COFDM Transceiver Using VHDL

Transcription:

CZ3001 ADVANCED COMPUTER ARCHITECTURE Lab 3 Report Abstract Pipelining is a process in which successive steps of an instruction sequence are executed in turn by a sequence of modules able to operate concurrently, so that another instruction can be begun before the previous one can finish execution. This report explains the process and implementation of a 3-staged and 4-staged pipelined data path and their synthesis report. Seshadri Madhavan Matric. No: U1322790J Lab Group: SSP3

1. Explain the function of three-stage pipelined data path with test bench for the execution of Rtype instruction with the necessary RTL-block diagram A typical instruction in a program uses multiple clock cycles (namely, fetch, decode, execute, write back and so on). In a single pipelined system all the actions occur in a sequential manner and take multiple clock cycles to execute and all the circuitry in the processor is also not in use. Thereby to improve the efficiency of the system by fully utilising all the functional units, a multi-staged pipelined data-path helps in full utilization of the functional units. The major advantage of pipelined data path is the fact that multiple instructions are concurrently running in the system. A typical example would be when the 1 st instruction is in the execution phase, the 2 nd instruction is in the decode phase and the 3 rd instruction is in the fetch phase. In a 3-staged pipelined data path, up-to 3 instructions are running concurrently. Illustration with an Example 16-BIT R-TYPE INSTRUCTION FORMAT: Op-code Rs Rt Rd 4-bit 4-bit 4-bit 4-bit A three stage pipeline for R type instructions consists of the following stages: namely Fetch, Decode and Execute. Let us a see a typical example by using the first three instructions from the test bench. INSTRUCTIONS 1. 0x0531 (Perform operation with operation ID 0x0000 and the destination register 0x0005 and source registers 0x0003 and 0x0001) 2. 0x1F02 (Perform operation with opcode 0x0001 with the destination register 0x00F and the source registers 0x000 and 0x002 ) 3. 0x7E51(Perform operation with opcode 0x0007 with the destination register 0x00E and the source registers 0x005 and 0x001) FIRST CLOCK CYCLE Fetch 0x0001 fetch at PC which is 0x0531 Decode Execute

SECOND CLOCK CYCLE Fetch 0x0002 fetch at PC which is 0x1F02 Decode INST 0x0531 is decoded as reg5 = reg3+reg1 RData1 reads 0 from register 3 Rdata2 reads 5 from register 1 ALU_op is set to 0x0000 for add operation Execute THIRD CLOCK CYCLE Fetch 0x0003 fetch at PC which is 0x7E51 Decode INST 0x0531 is decoded as regf = reg0-reg2 RData1 reads 0 from register 0 Rdata2 reads 2 from register 2 ALU_op is set to be set 0x0001 for SUB operation Execute reg5 = reg3+reg1 ALU_out = 5 Imm_ID_EXE = 1 Test Bench Explanation The simulation figure illustrates the instruction FETCH, DECODE and EXECUTE as explained in the above process. In the first Clock cycle the first instruction is fetched from memory. In the second clock cycle, the second instruction is fetched from memory and the first instruction is decoded. In the third clock cycle, the first instruction is executed while the second instruction is decoded and the third instruction is fetched and values in test bench confirm the same. The first instruction is shaded by BLUE colour, second by Orange colour and third by Yellow colour.

RTL Diagrams for R type instructions

2. What modification you made for converting the data path from R type to be used for both R& I type? Explain the test bench output for the execution of R & I type instructions for threestage pipelined data path To convert the data path to be able to execute both R&I type instructions and not just R type instructions we need to add sign extension for the immediate value. This is done by adding the following code for sign extension,.imm_in({{12{inst[3]}},inst[3:0]}) The following explains the demonstration of the pipeline: FIRST CLOCK CYCLE Fetch 0x0008 fetch at PC which is 0x5BD1 Decode at 0x0007 is being decoded Execute at 0x0006 is being executed SECOND CLOCK CYCLE Fetch 0x0009 fetch at PC which is 0x3B13 Decode INST 0x5BD1 is decoded as regb = regd SRL 1 RData1 reads 0x001E from register D Rdata2 reads 0x005 from register 1 ALU_op is set to 0x0101 for the operation SRL Execute INST at 0x0007 is being executed THIRD CLOCK CYCLE Fetch 0x000A fetch at PC which is 0x0DE2 Decode INST 0x2B13 is decoded as regb = reg1 SRL 1 RData1 reads 0x001E from register 1 Rdata2 reads 0x005 from register 3 ALU_op is set to 0x0101 for the operation SRL Execute INST at 0x0008 is being executed Rdata1_ID_EXE=0x001E Rdata2_ID_EXE=5 Imm_ID_EXE=1 Rdata2_imm_ID_EXE=1 Alu_out=000F In the third clock cycle, the execution phase of the instruction 0x008 is ongoing. The value of Rdata2_ID_EXE is not selected but the immediate value is selected because it is, I type

instruction and ALU_src is set to 1 by the control of the processor. In the R type instructions the value of ALU_src is set to 0 so the value present in the Rdata2_ID_EXE is used instead of the value from the immediate register. In the above figure, the instruction 0x0008 is I (which is highlighted by the blue shading) type instruction and 0x009 and 0x000a are the R type instructions in the current instruction set (which is highlighted by the orange and yellow shading). We can observe that the alusrc_id_exe changes to 1 to show that the data to be used is the immediate data provided in the instruction and not the Rdata2. 3. Explain the function of four-stage pipelined R & I type data path with the help of test bench output. There are 4 stages in a four staged pipeline namely Fetch, Decode, Execute and Write Back the result if the write enabled signal is high. In this lab experiment, write enabled is always high. In a four staged pipelined architecture, up to 4 instructions can be executed simultaneously in processor. When the first instruction is writing back the result, the second instruction is executing, the third instruction is in the decoding stage and fourth instruction is in the fetching stage. Thus the four instructions can be executed simultaneously. FIRST CLOCK CYCLE Fetch PC is being updated to PC = 0x0001 fetched is 0x0531 Decode Execute Write Back

SECOND CLOCK CYCLE Fetch PC is being updated to PC = 0x0002 fetched is 0x1F02 Decode INST 0x0531 is decoded as reg5 = reg3+reg1 RData1 reads 0 from register 3 Rdata2 reads 5 from register 1 ALU_op is set to 0x0000 for add operation Execute Write Back THIRD CLOCK CYCLE Fetch PC is being updated to PC = 0x0003 fetched is 0x7E51 Decode INST 0x0531 is decoded as regf = reg0-reg2 RData1 reads 0 from register 0 Rdata2 reads 2 from register 2 ALU_op is set to 0x0001 for SUB operation Execute reg5 = reg3+reg1 ALU_out = 5 Imm_ID_EXE = 1 Rdata2_imm_ID_EXE=1 Write Back FOURTH CLOCK CYCLE Fetch PC is being updated to PC = 0x0004 fetched is 0x6a55 Decode INST 0x7E51 is decoded as rege = reg5 X reg1 RData1 reads 5 Rdata2 reads 5 ALU_op is set to 0x0007 for SUB operation Execute RegF = Reg0 - Reg2 ALU_out = 0xfffe Imm_ID_EXE = 1 Rdata2_imm_ID_EXE=2 Write Back Register 5 is written with the value of 5 The example above illustrates the stages in the four-staged pipeline process in which up to four instructions can be executed at any point of time. Just as illustrated above, when the first instruction is writing back result into memory, the second instruction is Executing, the third instruction is Decoding and the fourth instruction is in the Fetching stage.

The above figure illustrates the test bench simulation of the four staged pipelined data path. When the first instruction is writing back result into memory, the second instruction is Executing, the third instruction is Decoding and the fourth instruction is in the Fetching stage.

4. Synthesize the three-stage and four-stage pipelined R & I type CPU and find the number of slices and minimum period. CPU BITWIDTH No of LUT slices No of registers minimum period in ns Three-stage Pipelined CPU(R and I type) Implementation Four-stage Pipelined CPU implementation 16 642 330 8.103ns 16 635 348 6.404ns 5. Justify the synthesis result From the above synthesis result it can be observed that the number of registers increases from the three staged to four staged pipeline as more register are required to hold the values to be used to store in the additional stage of the pipeline. It can also be observed that the number of LUT slices actually decreases from the 3 staged to the four staged pipelined implementation (from 642 to 635) because Xilinx has different optimization process for each of the individual circuits. In an ideal scenario, the number of LUT slices are supposed to increase when we increase the number of stages by one, in the pipelined data path because more circuitry is required to implement the additional stage in the pipelined data path.