MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Size: px
Start display at page:

Download "MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:"

Transcription

1 MIT OpenCourseWare Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Rodric Rabbah, Multicore Programming Primer, January (IAP) (Massachusetts Institute of Technology: MIT OpenCourseWare). (accessed MM DD, YYYY). License: Creative Commons Attribution-Noncommercial-Share Alike. Note: Please use the actual date you accessed this material in your citation. For more information about citing these materials or our Terms of Use, visit:

2 6.189 IAP 2007 Lecture 17 The Raw Experience IAP 2007 MIT

3 Raw Chips October IAP 2007 MIT

4 Raw Microprocessor Tiled microprocessor with point-to-point pipelined scalar operand network Each tiles is 4 mm x 4mm MIPS-style compute processor Single-issue 8-stage pipe 32b FPU 32K D Cache, I Cache 4 on-chip mesh networks Two for operands One for cache misses, I/O One for message passing IAP 2007 MIT

5 Raw Microprocessor 16 tiles (16 issue) 180 nm ASIC (IBM SA-27E) ~100 million transistors 1 million gates 3-4 years of development 1.5 years of testing 200K lines of test code Core Frequency: V V Frequency competitive with IBMimplemented Powers in same process 18W average power IAP 2007 MIT

6 One Cycle in the Life of a Tiled Processor Image by MIT OCW. Image by MIT OCW. Application uses as many tiles as needed to exploit its parallelism IAP 2007 MIT

7 Raw Motherboard IAP 2007 MIT

8 Raw in Action IAP 2007 MIT

9 MPEG-2 Encoder Performance 350 x 240 Images 720 x 480 Images Speedup 8 30 Frames/s Speedup 8 30 Frames/s # of Tiles # of Tiles Square Linear speedup Diamond Hand-optimized, slice parallel implementation Circle Slice parallel implementation Triangle Baseline macroblock parallel implementation IAP 2007 MIT

10 MPEG-2 Encoder Performance Encoding Rate (frames/s) # Tiles 352 x x x * 30* * * Estimated data rates IAP 2007 MIT

11 Programmable Graphics Pipeline Input Vertex V Vertex Sync Triangle Setup Pixel P Pixel simplified graphics pipeline IAP 2007 MIT

12 Phong Shading Per-pixel phong-shaded polyhedron 162 vertices, 1 light Output, rendered using Raw simulator IAP 2007 MIT

13 Phong Shading (64-tiles) Fixed pipeline Reconfigurable pipeline 33% faster 150% better utilization IAP 2007 MIT

14 Shadow Volumes 4 textured triangles 1 point light Rendered in 3 passes Output, rendered using Raw simulator IAP 2007 MIT

15 Shadow Volumes (64-tiles) Fixed pipeline Pass 1 Pass 2 Pass 3 Reconfigurable pipeline Pass 1 Pass 2 Pass 3 40% faster cycles IAP 2007 MIT

16 1020 Element Microphone Array IAP 2007 MIT

17 Case Study: Beamformer 1,600 1,400 1,430 1,200 MFLOPS 1, C program 19 C program Unoptimized StreamIt Optimized StreamIt 1 GHz Pentium III 420 MHz single tile Raw 420 MHz 64 tile Raw 420 MHz 16 tile Raw IAP 2007 MIT

18 The Raw Experience Insights into the design Raw architecture Raw parallelizing compiler StreamIt language and Compiler IAP 2007 MIT

19 Scalability Problems in Wide Issue Processors Wide Fetch (16 inst) Control Bypass Net Unified Load/Store Queue IAP 2007 MIT

20 Area and Frequency Scalability Problems ~N 3 ~N 2 N s Bypass Net IAP 2007 MIT

21 Operand Routing is Global + >> Bypass Net IAP 2007 MIT

22 Idea: Make Operand Routing Local Bypass Net IAP 2007 MIT

23 Idea: Exploit Locality Bypass Net IAP 2007 MIT

24 Replace Crossbar with Point-To-Point Network IAP 2007 MIT

25 Replace Crossbar with Point-To-Point Network + >> IAP 2007 MIT

26 Operand Transport Latency Crossbar Point-to-Point Network Non-local Placement Locality-driven Placement ~ N ~ N ½ ~ N ~ IAP 2007 MIT

27 Distribute the Register File IAP 2007 MIT

28 More Scalability Problems Wide Fetch (16 inst) Control SCALABLE Unified Load/Store Queue IAP 2007 MIT

29 More Scalability Problems Wide Fetch (16 inst) Control Unified Load/Store Queue IAP 2007 MIT

30 Distribute Everything Wide Fetch (16 inst) Control Unified Load/Store Queue IAP 2007 MIT

31 Tiled Processor IAP 2007 MIT

32 Tiled Processor IAP 2007 MIT

33 Tiled Processor (Taylor PhD 2007) Fast inter-tile communication through point-to-point pipelined scalar operand network (SON) Easy to scale for the same reasons as multicores IAP 2007 MIT

34 Raw Compute Processor Internals r24 r24 r25 r25 r26 E r26 r27 M1 M2 r27 IF D A F TL P U IAP 2007 MIT

35 Tile-Tile Communication add $25,$1,$2 Route $P->$E Route $W->$P sub $20,$1,$ IAP 2007 MIT

36 Why Communication Is Expensive on Multicores Multiprocessor Node 1 Multiprocessor Node 2 send occupancy send send overhead latency Transport Cost receive occupancy receive receive overhead latency IAP 2007 MIT

37 Why Communication Is Expensive on Multicores Multiprocessor Node 1 send occupancy Destination node name Sequence number Value Launch sequence send latency Commit Latency Network injection IAP 2007 MIT

38 Why Communication Is Expensive on Multicores receive sequence Multiprocessor Node 2 injection cost receive occupancy receive latency IAP 2007 MIT

39 A Figure of Merit for SONs 5-tuple <SO, SL, NHL, RL, RO> Send occupancy Send latency Network hop latency Receive latency Receive occupancy Tip: Ordering follows timing of message from sender to receiver IAP 2007 MIT

40 The Interesting Region Scalable Multiprocessor (on-chip) Where is Cell in this space? <2, 14, 3, 14,4> Raw SON (scalable) < 0, 0, 1, 2, 0> Superscalar (not scalable) < 0, 0, 0, 0, 0> IAP 2007 MIT

41 The Raw Experience Insights into the design Raw architecture Raw parallelizing compiler IAP 2007 MIT

42 Raw Parallelizing Compiler (Lee PhD 2005) Sequential Program Compiler Data distribution Instruction distribution Coordination Communication Control flow Fine-grained Orchestrated Parallel Program IAP 2007 MIT

43 Data Distribution load r1 <- addr add r1 <- r1, 1???? M M M M IAP 2007 MIT

44 Instruction Distribution seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v2.4=v2 pval5=seed.0*6.0 pval3=seed.o*v2.4 pval4=pval5+2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp2=tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 v2.7=pval6*5.0 v2=v2.7 v3.10=tmp3.6-v2.7 v3=v3.10 basic block IAP 2007 MIT

45 Clustering: Parallelism vs. Communication seed.0=seed seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v3=v3.10 v0=v0.9 v2=v2.7 v3=v IAP 2007 MIT

46 Adjusting Granularity: Load Balancing seed.0=seed seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v3=v3.10 v0=v0.9 v2=v2.7 v3=v IAP 2007 MIT

47 Placement seed.0=seed pval5=seed.0*6.0 seed.0=seed pval1=seed.0*3.0 pval4=pval5+2.0 pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 v1.2=v1 v2.4=v2 v1.8=pval7*3.0 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 v0.9=tmp0.1-v1.8 v0=v0.9 v1=v1.8 v2=v2.7 v3.10=tmp3.6-v2.7 v3=v3.10 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 v2=v IAP 2007 MIT

48 Communication Coordination Proc Proc Proc Switch Switch Proc receive routes send IAP 2007 MIT

49 Instruction Scheduling v2.4=v2 seed.0=recv(0) pval3=seed.o*v2.4 route(t,e) v1.2=v1 seed.0=recv() pval2=seed.0*v1.2 route(n,t) seed.0=recv() pval5=seed.0*6.0 route(w,t) seed.0=seed send(seed.0) pval1=seed.0*3.0 pval0=pval1+2.0 route(t,e) tmp2.5=pval3+2.0 tmp2=tmp2.5 send(tmp2.5) tmp1.3=recv() pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 route(t,e) route(e,t) tmp1.3=pval2+2.0 send(tmp1.3) tmp1=tmp1.3 tmp2.5=recv() tmp0.1=recv() pval7=tmp1.3+tmp2.5 route(t,w) route(w,t) route(w,t) pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 route(w,s) tmp0.1=pval0/2.0 send(tmp0.1) tmp0=tmp0.1 route(t,e) Send(v2.7) v2=v2.7 route(t,e) v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 route(w,n) v2.7=recv() v3.10=tmp3.6-v2.7 v3=v3.10 route(s,t) Inter-tile cycle scheduling schedules communication and can guarantee deadlock freedom IAP 2007 MIT

50 Final Code Representation v2.4=v2 route(t,e) v1.2=v1 route(n,t) seed.0=recv() route(w,t) seed.0=seed route(t,e) seed.0=recv(0) route(t,e) seed.0=recv() route(t,w) pval5=seed.0*6.0 route(w,s) send(seed.0) route(t,e) pval3=seed.o*v2.4 route(e,t) pval2=seed.0*v1.2 route(w,t) pval4=pval5+2.0 route(s,t) pval1=seed.0*3.0 tmp2.5=pval3+2.0 route(t,e) tmp1.3=pval2+2.0 route(w,t) tmp3.6=pval4/3.0 pval0=pval1+2.0 tmp2=tmp2.5 send(tmp1.3) route(w,n) tmp3=tmp3.6 tmp0.1=pval0/2.0 send(tmp2.5) tmp1=tmp1.3 v2.7=recv() send(tmp0.1) tmp1.3=recv() tmp2.5=recv() v3.10=tmp3.6-v2.7 tmp0=tmp0.1 pval6=tmp1.3-tmp2.5 tmp0.1=recv() v3=v3.10 v2.7=pval6*5.0 pval7=tmp1.3+tmp2.5 Send(v2.7) v1.8=pval7*3.0 v2=v2.7 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v IAP 2007 MIT

51 Control Coordination IAP 2007 MIT

52 Asynchronous Global Branching br x br x br x br x br x IAP 2007 MIT

53 Summary Tiled microprocessors incorporate the best elements of superscalars and multiprocessors Superscalar Multicore Tiled Processor with SON PE-PE communication Free Expensive Cheap exploitation of parallelism Implicit Explicit Both Clean semantics Yes No Yes scalable No Yes Yes power efficient No Yes Yes IAP 2007 MIT

54 Raw Project Contributors Anant Agarwal Saman Amarasinghe Jonathan Babb Rajeev Barua Ian Bratt Jonathan Eastep Matt Frank Ben Greenwald Henry Hoffmann Paul Johnson Jason Kim Theo Konstantakopoulos Walter Lee Jason Miller James Psota Arvind Saraf Vivek Sarkar Nathan Shnidman Volker Strumpen Michael Taylor Elliot Waingold David Wentzlaff IAP 2007 MIT

Project 5: Optimizer Jason Ansel

Project 5: Optimizer Jason Ansel Project 5: Optimizer Jason Ansel Overview Project guidelines Benchmarking Library OoO CPUs Project Guidelines Use optimizations from lectures as your arsenal If you decide to implement one, look at Whale

More information

The Looming Software Crisis due to the Multicore Menace

The Looming Software Crisis due to the Multicore Menace The Looming Software Crisis due to the Multicore Menace Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 2 Today: The Happily Oblivious Average

More information

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose,

More information

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 16 - Superscalar Processors 1 / 78 Table of Contents I 1 Overview

More information

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Chapter 4 The Processor Part II Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup p = 2n/(0.5n + 1.5) 4 =

More information

Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics

Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics Christopher Batten 1, Ajay Joshi 1, Jason Orcutt 1, Anatoly Khilo 1 Benjamin Moss 1, Charles Holzwarth 1, Miloš Popović 1,

More information

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture Overview 1 Trends in Microprocessor Architecture R05 Robert Mullins Computer architecture Scaling performance and CMOS Where have performance gains come from? Modern superscalar processors The limits of

More information

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy CSE 2021: Computer Organization Single Cycle (Review) Lecture-10 CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan CSE-2021 July-12-2012 2 Single Cycle with Jump Multi-Cycle Implementation

More information

CS 110 Computer Architecture Lecture 11: Pipelining

CS 110 Computer Architecture Lecture 11: Pipelining CS 110 Computer Architecture Lecture 11: Pipelining Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on

More information

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T. Pipeline Hazards Krste Asanovic Laboratory for Computer Science M.I.T. Pipelined DLX Datapath without interlocks and jumps 31 0x4 RegDst RegWrite inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B OpSel

More information

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to. FPGAs 1 CMPE 415 Technology Timeline 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs FPGAs The Design Warrior s Guide

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Schedulers Data-Capture Scheduler Dispatch: read available operands from ARF/ROB, store in scheduler Commit: Missing operands filled in from bypass Issue: When

More information

Design Challenges in Multi-GHz Microprocessors

Design Challenges in Multi-GHz Microprocessors Design Challenges in Multi-GHz Microprocessors Bill Herrick Director, Alpha Microprocessor Development www.compaq.com Introduction Moore s Law ( Law (the trend that the demand for IC functions and the

More information

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3 EN164: Design of Computing Systems Lecture 22: Processor / ILP 3 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

On-chip Networks in Multi-core era

On-chip Networks in Multi-core era Friday, October 12th, 2012 On-chip Networks in Multi-core era Davide Zoni PhD Student email: zoni@elet.polimi.it webpage: home.dei.polimi.it/zoni Outline 2 Introduction Technology trends and challenges

More information

Dynamic Scheduling II

Dynamic Scheduling II so far: dynamic scheduling (out-of-order execution) Scoreboard omasulo s algorithm register renaming: removing artificial dependences (WAR/WAW) now: out-of-order execution + precise state advanced topic:

More information

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling EE241 - Spring 2004 Advanced Digital Integrated Circuits Borivoje Nikolic Lecture 15 Low-Power Design: Supply Voltage Scaling Announcements Homework #2 due today Midterm project reports due next Thursday

More information

Dynamic Scheduling I

Dynamic Scheduling I basic pipeline started with single, in-order issue, single-cycle operations have extended this basic pipeline with multi-cycle operations multiple issue (superscalar) now: dynamic scheduling (out-of-order

More information

Multiband RF-Interconnect for Reconfigurable Network-on-Chip Communications UCLA

Multiband RF-Interconnect for Reconfigurable Network-on-Chip Communications UCLA Multiband RF-Interconnect for Reconfigurable Network-on-hip ommunications Jason ong (cong@cs.ucla.edu) Joint work with Frank hang, Glenn Reinman and Sai-Wang Tam ULA 1 ommunication hallenges On-hip Issues

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science !!! Basic MIPS integer pipeline Branches with one

More information

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona NPTEL Online - IIT Kanpur Instructor: Dr. Mainak Chaudhuri Instructor: Dr. S. K. Aggarwal Course Name: Department: Program Optimization for Multi-core Architecture Computer Science and Engineering IIT

More information

Lecture 4: Introduction to Pipelining

Lecture 4: Introduction to Pipelining Lecture 4: Introduction to Pipelining Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes A B C D Dryer takes 40 minutes Folder

More information

Low Power VLSI Circuit Synthesis: Introduction and Course Outline

Low Power VLSI Circuit Synthesis: Introduction and Course Outline Low Power VLSI Circuit Synthesis: Introduction and Course Outline Ajit Pal Professor Department of Computer Science and Engineering Indian Institute of Technology Kharagpur INDIA -721302 Agenda Why Low

More information

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads Terminology CUDA Threads Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Streaming Multiprocessor (SM) A SM processes block of threads Streaming Processors (SP) also called CUDA

More information

Challenges of in-circuit functional timing testing of System-on-a-Chip

Challenges of in-circuit functional timing testing of System-on-a-Chip Challenges of in-circuit functional timing testing of System-on-a-Chip David and Gregory Chudnovsky Institute for Mathematics and Advanced Supercomputing Polytechnic Institute of NYU Deep sub-micron devices

More information

Department Computer Science and Engineering IIT Kanpur

Department Computer Science and Engineering IIT Kanpur NPTEL Online - IIT Bombay Course Name Parallel Computer Architecture Department Computer Science and Engineering IIT Kanpur Instructor Dr. Mainak Chaudhuri file:///e /parallel_com_arch/lecture1/main.html[6/13/2012

More information

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution School of Electrical and Computer Engineering Cornell University revision: 2016-11-28-17-33 1 In-Order Dual-Issue

More information

Customized Computing for Power Efficiency. There are Many Options to Improve Performance

Customized Computing for Power Efficiency. There are Many Options to Improve Performance ustomized omputing for Power Efficiency Jason ong cong@cs.ucla.edu ULA omputer Science Department http://cadlab.cs.ucla.edu/~cong There are Many Options to Improve Performance Page 1 Past Alternatives

More information

Instruction Level Parallelism Part II - Scoreboard

Instruction Level Parallelism Part II - Scoreboard Course on: Advanced Computer Architectures Instruction Level Parallelism Part II - Scoreboard Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Basic Assumptions We consider

More information

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2) Lecture Topics Today: Pipelined Processors (P&H 4.5-4.10) Next: continued 1 Announcements Milestone #4 (due 2/23) Milestone #5 (due 3/2) 2 1 ISA Implementations Three different strategies: single-cycle

More information

Tomasolu s s Algorithm

Tomasolu s s Algorithm omasolu s s Algorithm Fall 2007 Prof. homas Wenisch http://www.eecs.umich.edu/courses/eecs4 70 Floating Point Buffers (FLB) ag ag ag Storage Bus Floating Point 4 3 Buffers FLB 6 5 5 4 Control 2 1 1 Result

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have

More information

A High Definition Motion JPEG Encoder Based on Epuma Platform

A High Definition Motion JPEG Encoder Based on Epuma Platform Available online at www.sciencedirect.com Procedia Engineering 29 (2012) 2371 2375 2012 International Workshop on Information and Electronics Engineering (IWIEE) A High Definition Motion JPEG Encoder Based

More information

SCALCORE: DESIGNING A CORE

SCALCORE: DESIGNING A CORE SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia,

More information

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018 omasulo s Algorithm Winter 2018 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, yson, Vijaykumar, and Wenisch of Carnegie Mellon University,

More information

COSC4201. Scoreboard

COSC4201. Scoreboard COSC4201 Scoreboard Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RIT) 1 Overcoming Data Hazards with Dynamic Scheduling In the pipeline, if there is

More information

Recent Advances in Simulation Techniques and Tools

Recent Advances in Simulation Techniques and Tools Recent Advances in Simulation Techniques and Tools Yuyang Li, li.yuyang(at)wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download Abstract: Simulation refers to using specified kind

More information

Compiler Optimisation

Compiler Optimisation Compiler Optimisation 6 Instruction Scheduling Hugh Leather IF 1.18a hleather@inf.ed.ac.uk Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2018 Introduction This

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have

More information

ECE473 Computer Architecture and Organization. Pipeline: Introduction

ECE473 Computer Architecture and Organization. Pipeline: Introduction Computer Architecture and Organization Pipeline: Introduction Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 11.1 The Laundry Analogy Student A,

More information

Learning Outcomes. Spiral 2 8. Digital Design Overview LAYOUT

Learning Outcomes. Spiral 2 8. Digital Design Overview LAYOUT 2-8.1 2-8.2 Spiral 2 8 Cell Mark Redekopp earning Outcomes I understand how a digital circuit is composed of layers of materials forming transistors and wires I understand how each layer is expressed as

More information

CMP 301B Computer Architecture. Appendix C

CMP 301B Computer Architecture. Appendix C CMP 301B Computer Architecture Appendix C Dealing with Exceptions What should be done when an exception arises and many instructions are in the pipeline??!! Force a trap instruction in the next IF stage

More information

Multi-Channel Charge Pulse Amplification, Digitization and Processing ASIC for Detector Applications

Multi-Channel Charge Pulse Amplification, Digitization and Processing ASIC for Detector Applications 1.0 Multi-Channel Charge Pulse Amplification, Digitization and Processing ASIC for Detector Applications Peter Fischer for Tim Armbruster, Michael Krieger and Ivan Peric Heidelberg University Motivation

More information

CS61c: Introduction to Synchronous Digital Systems

CS61c: Introduction to Synchronous Digital Systems CS61c: Introduction to Synchronous Digital Systems J. Wawrzynek March 4, 2006 Optional Reading: P&H, Appendix B 1 Instruction Set Architecture Among the topics we studied thus far this semester, was the

More information

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction REAL TIME DIGITAL SIGNAL Introduction Why Digital? A brief comparison with analog. PROCESSING Seminario de Electrónica: Sistemas Embebidos Advantages The BIG picture Flexibility. Easily modifiable and

More information

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Ramon Canal NCD Master MIRI. NCD Master MIRI 1 Wattch, Hotspot, Hotleakage, McPAT http://www.eecs.harvard.edu/~dbrooks/wattch-form.html http://lava.cs.virginia.edu/hotspot http://lava.cs.virginia.edu/hotleakage http://www.hpl.hp.com/research/mcpat/

More information

CS4617 Computer Architecture

CS4617 Computer Architecture 1/26 CS4617 Computer Architecture Lecture 2 Dr J Vaughan September 10, 2014 2/26 Amdahl s Law Speedup = Execution time for entire task without using enhancement Execution time for entire task using enhancement

More information

A SCALABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

A SCALABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye A SCALABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS Theepan Moorthy and Andy Ye Department of Electrical and Computer Engineering Ryerson University 350

More information

CSE502: Computer Architecture Welcome to CSE 502

CSE502: Computer Architecture Welcome to CSE 502 Welcome to CSE 502 Introduction & Review Today s Lecture Course Overview Course Topics Grading Logistics Academic Integrity Policy Homework Quiz Key basic concepts for Computer Architecture Course Overview

More information

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks Advanced Computer Architecture Spring 2010 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture Outline Instruction-Level Parallelism Scoreboarding (A.8) Instruction Level Parallelism

More information

Silicon-Photonic Clos Networks for Global On-Chip Communication

Silicon-Photonic Clos Networks for Global On-Chip Communication Silicon-Photonic Clos Networks for Global On-Chip Communication Ajay Joshi, Christopher Batten, Yong-Jin Kwon, Scott Beamer, Imran Shamim, Krste Asanović, Vladimir Stojanović NOCS 2009 Massachusetts Institute

More information

Low Power System-On-Chip-Design Chapter 12: Physical Libraries

Low Power System-On-Chip-Design Chapter 12: Physical Libraries 1 Low Power System-On-Chip-Design Chapter 12: Physical Libraries Friedemann Wesner 2 Outline Standard Cell Libraries Modeling of Standard Cell Libraries Isolation Cells Level Shifters Memories Power Gating

More information

Implementation of Pixel Array Bezel-Less Cmos Fingerprint Sensor

Implementation of Pixel Array Bezel-Less Cmos Fingerprint Sensor Article DOI: 10.21307/ijssis-2018-013 Issue 0 Vol. 0 Implementation of 144 64 Pixel Array Bezel-Less Cmos Fingerprint Sensor Seungmin Jung School of Information and Technology, Hanshin University, 137

More information

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I U. Wisconsin CS/ECE 752 Advanced Computer Architecture I Prof. Karu Sankaralingam Unit 5: Dynamic Scheduling I Slides developed by Amir Roth of University of Pennsylvania with sources that included University

More information

Instruction Level Parallelism III: Dynamic Scheduling

Instruction Level Parallelism III: Dynamic Scheduling Instruction Level Parallelism III: Dynamic Scheduling Reading: Appendix A (A-67) H&P Chapter 2 Instruction Level Parallelism III: Dynamic Scheduling 1 his Unit: Dynamic Scheduling Application OS Compiler

More information

CS521 CSE IITG 11/23/2012

CS521 CSE IITG 11/23/2012 Parallel Decoding and issue Parallel execution Preserving the sequential consistency of execution and exception processing 1 slide 2 Decode/issue data Issue bound fetch Dispatch bound fetch RS RS RS RS

More information

Measuring and Evaluating Computer System Performance

Measuring and Evaluating Computer System Performance Measuring and Evaluating Computer System Performance Performance Marches On... But what is performance? The bottom line: Performance Car Time to Bay Area Speed Passengers Throughput (pmph) Ferrari 3.1

More information

How a processor can permute n bits in O(1) cycles

How a processor can permute n bits in O(1) cycles How a processor can permute n bits in O(1) cycles Ruby Lee, Zhijie Shi, Xiao Yang Princeton Architecture Lab for Multimedia and Security (PALMS) Department of Electrical Engineering Princeton University

More information

CS Computer Architecture Spring Lecture 04: Understanding Performance

CS Computer Architecture Spring Lecture 04: Understanding Performance CS 35101 Computer Architecture Spring 2008 Lecture 04: Understanding Performance Taken from Mary Jane Irwin (www.cse.psu.edu/~mji) and Kevin Schaffer [Adapted from Computer Organization and Design, Patterson

More information

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen GIGA seminar 11.1.2010 Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen janne.janhunen@ee.oulu.fi 2 Outline Introduction Benefits and Challenges

More information

Dr. D. M. Akbar Hussain

Dr. D. M. Akbar Hussain Course Objectives: To enable the students to learn some more practical facts about DSP architectures. Objective is that they can apply this knowledge to map any digital filtering algorithm and related

More information

Research Statement. Sorin Cotofana

Research Statement. Sorin Cotofana Research Statement Sorin Cotofana Over the years I ve been involved in computer engineering topics varying from computer aided design to computer architecture, logic design, and implementation. In the

More information

SOFTWARE IMPLEMENTATION OF THE

SOFTWARE IMPLEMENTATION OF THE SOFTWARE IMPLEMENTATION OF THE IEEE 802.11A/P PHYSICAL LAYER SDR`12 WInnComm Europe 27 29 June, 2012 Brussels, Belgium T. Cupaiuolo, D. Lo Iacono, M. Siti and M. Odoni Advanced System Technologies STMicroelectronics,

More information

hankhoffmann/

hankhoffmann/ Henry Hoffmann 1/6 RESEARCH STATEMENT Henry (Hank) Hoffmann (hankhoffmann@cs.uchicago.edu) My research explores the principled design and implementation of self-aware computer systems; i.e., those that

More information

Integrated Power Delivery for High Performance Server Based Microprocessors

Integrated Power Delivery for High Performance Server Based Microprocessors Integrated Power Delivery for High Performance Server Based Microprocessors J. Ted DiBene II, Ph.D. Intel, Dupont-WA International Workshop on Power Supply on Chip, Cork, Ireland, Sept. 24-26 Slide 1 Legal

More information

Power Spring /7/05 L11 Power 1

Power Spring /7/05 L11 Power 1 Power 6.884 Spring 2005 3/7/05 L11 Power 1 Lab 2 Results Pareto-Optimal Points 6.884 Spring 2005 3/7/05 L11 Power 2 Standard Projects Two basic design projects Processor variants (based on lab1&2 testrigs)

More information

Globally Asynchronous Locally Synchronous (GALS) Microprogrammed Parallel FIR Filter

Globally Asynchronous Locally Synchronous (GALS) Microprogrammed Parallel FIR Filter IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 6, Issue 5, Ver. II (Sep. - Oct. 2016), PP 15-21 e-issn: 2319 4200, p-issn No. : 2319 4197 www.iosrjournals.org Globally Asynchronous Locally

More information

Deadlock-free Routing Scheme for Irregular Mesh Topology NoCs with Oversized Regions

Deadlock-free Routing Scheme for Irregular Mesh Topology NoCs with Oversized Regions JOURNAL OF COMPUTERS, VOL. 8, NO., JANUARY 7 Deadlock-free Routing Scheme for Irregular Mesh Topology NoCs with Oversized Regions Xinming Duan, Jigang Wu School of Computer Science and Software, Tianjin

More information

Low Power Design Methods: Design Flows and Kits

Low Power Design Methods: Design Flows and Kits JOINT ADVANCED STUDENT SCHOOL 2011, Moscow Low Power Design Methods: Design Flows and Kits Reported by Shushanik Karapetyan Synopsys Armenia Educational Department State Engineering University of Armenia

More information

Lecture 8-1 Vector Processors 2 A. Sohn

Lecture 8-1 Vector Processors 2 A. Sohn Lecture 8-1 Vector Processors Vector Processors How many iterations does the following loop go through? For i=1 to n do A[i] = B[i] + C[i] Sequential Processor: n times. Vector processor: 1 instruction!

More information

L15: VLSI Integration and Performance Transformations

L15: VLSI Integration and Performance Transformations L15: VLSI Integration and Performance Transformations Average Cost of one transistor Acknowledgement: 10 1 0.1 0.01 0.001 0.0001 0.00001 $ 0.000001 Gordon Moore, Keynote Presentation at ISSCC 2003 0.0000001

More information

L15: VLSI Integration and Performance Transformations

L15: VLSI Integration and Performance Transformations L15: VLSI Integration and Performance Transformations Acknowledgement: Materials in this lecture are courtesy of the following sources and are used with permission. Curt Schurgers J. Rabaey, A. Chandrakasan,

More information

Lecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques.

Lecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques. Introduction EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Techniques Cristian Grecu grecuc@ece.ubc.ca Course web site: http://courses.ece.ubc.ca/353/ What have you learned so far?

More information

StreamIt: High-Level Stream Programming on Raw

StreamIt: High-Level Stream Programming on Raw StreamIt: High-Level Stream Programming on Raw Michael Gordon, Michal Karczmarek, Andrew Lamb, Jasper Lin, David Maze, William Thies, and Saman Amarasinghe March 6, 2003 The StreamIt Language Why use the

More information

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997 CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling September 3, 1997 Dave Patterson (httpcsberkeleyedu/~patterson) lecture slides: http://www-insteecsberkeleyedu/~cs152/

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 14 Improving Performance: Interleaving Israel Koren ECE568/Koren Part.14.1 Background Performance

More information

Synthetic Aperture Beamformation using the GPU

Synthetic Aperture Beamformation using the GPU Paper presented at the IEEE International Ultrasonics Symposium, Orlando, Florida, 211: Synthetic Aperture Beamformation using the GPU Jens Munk Hansen, Dana Schaa and Jørgen Arendt Jensen Center for Fast

More information

Gaming Development. Resources

Gaming Development. Resources Gaming Development Resources Beginning Game Programming Fourth Edition Jonathan S. Harbour 9781305258952 Beginning Game Programming will introduce students to the fascinating world of game programming

More information

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m )

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m ) High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m ) Abstract: This paper proposes an efficient pipelined architecture of elliptic curve scalar multiplication (ECSM)

More information

Artificial Neural Network Engine: Parallel and Parameterized Architecture Implemented in FPGA

Artificial Neural Network Engine: Parallel and Parameterized Architecture Implemented in FPGA Artificial Neural Network Engine: Parallel and Parameterized Architecture Implemented in FPGA Milene Barbosa Carvalho 1, Alexandre Marques Amaral 1, Luiz Eduardo da Silva Ramos 1,2, Carlos Augusto Paiva

More information

Final Report: DBmbench

Final Report: DBmbench 18-741 Final Report: DBmbench Yan Ke (yke@cs.cmu.edu) Justin Weisz (jweisz@cs.cmu.edu) Dec. 8, 2006 1 Introduction Conventional database benchmarks, such as the TPC-C and TPC-H, are extremely computationally

More information

Architecting Systems of the Future, page 1

Architecting Systems of the Future, page 1 Architecting Systems of the Future featuring Eric Werner interviewed by Suzanne Miller ---------------------------------------------------------------------------------------------Suzanne Miller: Welcome

More information

Heterogeneous Concurrent Error Detection (hced) Based on Output Anticipation

Heterogeneous Concurrent Error Detection (hced) Based on Output Anticipation International Conference on ReConFigurable Computing and FPGAs (ReConFig 2011) 30 th Nov- 2 nd Dec 2011, Cancun, Mexico Heterogeneous Concurrent Error Detection (hced) Based on Output Anticipation Naveed

More information

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei Institute of Microelectronics Tsinghua University The 45th International

More information

Relative Timing Driven Multi-Synchronous Design: Enabling Order-of-Magnitude Energy Reduction

Relative Timing Driven Multi-Synchronous Design: Enabling Order-of-Magnitude Energy Reduction Relative Timing Driven Multi-Synchronous Design: Enabling Order-of-Magnitude Energy Reduction Kenneth S. Stevens University of Utah Granite Mountain Technologies 27 March 2013 UofU and GMT 1 Learn from

More information

Power Issues with Embedded Systems. Rabi Mahapatra Computer Science

Power Issues with Embedded Systems. Rabi Mahapatra Computer Science Power Issues with Embedded Systems Rabi Mahapatra Computer Science Plan for today Some Power Models Familiar with technique to reduce power consumption Reading assignment: paper by Bill Moyer on Low-Power

More information

Getting to Work with OpenPiton. Princeton University. OpenPit

Getting to Work with OpenPiton. Princeton University.   OpenPit Getting to Work with OpenPiton Princeton University http://openpiton.org OpenPit ASIC SYNTHESIS AND BACKEND 2 Whats in the Box? Synthesis Synopsys Design Compiler Static timing analysis (STA) Synopsys

More information

Flexibility, Speed and Accuracy in VLIW Architectures Simulation and Modeling

Flexibility, Speed and Accuracy in VLIW Architectures Simulation and Modeling Flexibility, Speed and Accuracy in VLIW Architectures Simulation and Modeling IVANO BARBIERI, MASSIMO BARIANI, ALBERTO CABITTO, MARCO RAGGIO Department of Biophysical and Electronic Engineering University

More information

Highly Versatile DSP Blocks for Improved FPGA Arithmetic Performance

Highly Versatile DSP Blocks for Improved FPGA Arithmetic Performance 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines Highly Versatile DSP Blocks for Improved FPGA Arithmetic Performance Hadi Parandeh-Afshar and Paolo Ienne Ecole

More information

Out-of-Order Execution. Register Renaming. Nima Honarmand

Out-of-Order Execution. Register Renaming. Nima Honarmand Out-of-Order Execution & Register Renaming Nima Honarmand Out-of-Order (OOO) Execution (1) Essence of OOO execution is Dynamic Scheduling Dynamic scheduling: processor hardware determines instruction execution

More information

On the Rules of Low-Power Design

On the Rules of Low-Power Design On the Rules of Low-Power Design (and Why You Should Break Them) Prof. Todd Austin University of Michigan austin@umich.edu A long time ago, in a not so far away place The Rules of Low-Power Design P =

More information

Multi-Channel FIR Filters

Multi-Channel FIR Filters Chapter 7 Multi-Channel FIR Filters This chapter illustrates the use of the advanced Virtex -4 DSP features when implementing a widely used DSP function known as multi-channel FIR filtering. Multi-channel

More information

Neuromorphic Computing based Processors

Neuromorphic Computing based Processors Neuromorphic Computing based Processors Hao Jiang A collaborative research among San Francisco State University, EI-Lab at University of Pittsburgh, HP Labs, and AFRL Outline Why Neuromorphic Computing?

More information

Proc. IEEE Intern. Conf. on Application Specific Array Processors, (Eds. Capello et. al.), IEEE Computer Society Press, 1995, 76-84

Proc. IEEE Intern. Conf. on Application Specific Array Processors, (Eds. Capello et. al.), IEEE Computer Society Press, 1995, 76-84 Proc. EEE ntern. Conf. on Application Specific Array Processors, (Eds. Capello et. al.), EEE Computer Society Press, 1995, 76-84 Session 2: Architectures 77 toning speed is affected by the huge amount

More information

High Performance Computing for Engineers

High Performance Computing for Engineers High Performance Computing for Engineers David Thomas dt10@ic.ac.uk / https://github.com/m8pple Room 903 http://cas.ee.ic.ac.uk/people/dt10/teaching/2014/hpce HPCE / dt10/ 2015 / 0.1 High Performance Computing

More information

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont MIPS R10000 Case Study Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Multiprocessor SGI Origin Using MIPS R10K Many thanks to Prof. Martin and Roth of University of Pennsylvania for

More information

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur

More information

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides.

More information

EE4800 CMOS Digital IC Design & Analysis. Lecture 1 Introduction Zhuo Feng

EE4800 CMOS Digital IC Design & Analysis. Lecture 1 Introduction Zhuo Feng EE4800 CMOS Digital IC Design & Analysis Lecture 1 Introduction Zhuo Feng 1.1 Prof. Zhuo Feng Office: EERC 730 Phone: 487-3116 Email: zhuofeng@mtu.edu Class Website http://www.ece.mtu.edu/~zhuofeng/ee4800fall2010.html

More information

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS 1 T.Thomas Leonid, 2 M.Mary Grace Neela, and 3 Jose Anand

More information