Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

Size: px
Start display at page:

Download "Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)"

Transcription

1 Warp-Aware Trace Scheduling for GPUS James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

2 Historical Trends in GFLOPS: CPUs vs. GPUs Theoretical GFLOP/s NVIDIA GPU Single-Precision FP Intel CPU Single-Precision FP GeForce 6800 Ultra Northwood PrescottWoodcrest GeForce 5800 GeForce 280 GTX GeForce 8800 GTX Westmere Harpertown Bloomfield GeForce 7800 GTX GeForce 680 GTX GeForce 480 GTX GeForce 580 GTX Sandy Bridge 2012 reproduced from NVIDIA CUDA C Programming Guide (Version 5.0)

3 Performance Pitfalls Control flow can negatively affect performance.

4 Performance Pitfalls Pipeline Stall - execution delay in an instruction pipeline to resolve a dependency

5 reproduced from NVIDIA CUDA C Programming Guide (Version 5.0) Hardware: CPU versus GPU Control ALU ALU ALU ALU Cache DRAM DRAM CPU GPU

6 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions

7 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions

8 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions

9 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions pipeline stall (bubble) Pipeline Stages Fetch Decode Execute Write Completed Instructions

10 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions

11 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions

12 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions

13 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions

14 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions

15 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions

16 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions

17 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions

18 Performance Pitfalls Pipeline Stall - execution delay in an instruction pipeline to resolve a dependency

19 Performance Pitfalls Pipeline Stall - execution delay in an instruction pipeline to resolve a dependency Warp Divergence - threads within a warp take different paths and the different execution paths are serialized

20 Warp Divergence Example A B C D

21 Warp Divergence Example A A B C D

22 Warp Divergence Example A A Warp Divergence! B C D

23 Warp Divergence Example B A A B Warp Divergence! C D

24 Warp Divergence Example B A A B Warp Divergence! Warp Divergence! C D

25 Warp Divergence Example B A A B Warp Divergence! Warp Divergence! C D

26 Warp Divergence Example B A A B Warp Divergence! Warp Divergence! C Warp Reconverges! D

27 Warp Divergence Example B A A B Warp Divergence! Warp Divergence! C D D Warp Reconverges!

28 Warp-Aware Trace Scheduling Schedule instructions across basic block boundaries to expose additional ILP

29 Warp-Aware Trace Scheduling Schedule instructions across basic block boundaries to expose additional ILP while managing and optimizing warp divergence.

30 Origins: Microcode Trace Scheduling generalizing local and disparate verticalto-horizontal microcode compaction Step Description

31 Origins: Microcode Trace Scheduling generalizing local and disparate verticalto-horizontal microcode compaction Step 1. Trace Selection Description

32 Origins: Microcode Trace Scheduling generalizing local and disparate verticalto-horizontal microcode compaction Step 1. Trace Selection Description 2. Trace Formation

33 Origins: Microcode Trace Scheduling generalizing local and disparate verticalto-horizontal microcode compaction Step Description 1. Trace Selection 2. Trace Formation 3. Local Scheduling

34 Origins: Microcode Trace Scheduling generalizing local and disparate verticalto-horizontal microcode compaction Step Description 1. Trace Selection partition basic blocks into regions 2. Trace Formation facilitate local scheduling, potentially adding nodes and edges 3. Local Scheduling schedule instructions within each region

35 A B C D J E K F L G H I

36 Annotate CFG - dynamic profiling - static branch prediction B 90 A C 5 87 D J 50 E K F 1 L 8 G H 92 I

37 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop B A C 10 5 Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 D F 13 8 G J L 50 K H 92 I

38 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop B A C 10 5 Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 D F 13 8 G J L 50 K H 92 I

39 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop B A C 10 5 Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 D F 13 8 G J L 50 K H 92 I

40 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop B A C 10 5 Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 D F 13 8 G J L 50 K H 92 I

41 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 B D F H A C G I 10 J Trace # 1 L 50 K

42 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 B D F H A C G I 10 J Trace # 1 L 50 K

43 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 B D F H A C G I 10 J Trace # 1 L 50 K

44 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 Trace # 2 B D F H A C G I 10 J Trace # 1 L 50 K Trace # 3

45 Before After mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; 10 BB BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2]; BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2];

46 Before After mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; 10 BB BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2]; BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2];

47 Before After mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; 10 BB BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2]; BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2];

48 Before After mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; 10 BB BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2]; BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2];

49 Before After mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; 10 BB BB0 90 BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB3 BB mul.wide.s32 %rd13, %r4, 4; add.s64 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, 0; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, bra BB2; BB BB0 90 BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2]; BB3 BB3: mul.wide.s32 %rd15, %r3, %rd8; add.s64 %rd12, %rd1, %rd15;

50 1.40 Comparing Speedup and IPC Using Dynamic Profiling 1.40 K ernel Speedup bpnn layerforward CUDA Kernel Kernel2 cuda compute flux kernel calculate temp invert mapping Speedup IPC kmeanspoint kernel gpu cuda dilate kernel IMGVF kernel lud diagonal mummergpuk ernel printk ernel needle cuda shared1 euclid needle cuda shared2 likelihood kernel find index kernel dynproc kernel kernel srad cuda 1 srad cuda 2 kernel compute cost Instructions Executed per Cycle ( IPC ) backprop bfs hotspot heartwall cfd lavamd kmeans leukocyte lud nn mummergpu nw pathfinder particlefilter(n) particlefilter(f) HARMEAN GEOMEAN streamcluster srad

51 Backup Slides

52 Restricted [6] General [6] Boosting [36] Deviant (G P U ) Scheduling Restrictions Legal and Safe Legal none Hardware Support Exception Handling for Speculative Instructions none shadow register file, non-trapping shadow store buffer, instructions and support for re-executing instructions excludes texture, shared and constant memory operations and all store instructions none prohibited ignored supported absent

53 GPU Programming Model CPU GPU Host Code CPU Device Code Time Cyclic CommunicationGPU CPU Host Code

54 GPU Programming Model CPU GPU Grid Block (0,0) Block (1,0) Host Code CPU Device Code Block (0,1) Block (1,1) Time Cyclic CommunicationGPU CPU Host Code

55 GPU Programming Model CPU GPU Grid Block (0,0) Block (1,0) Host Code CPU Device Code Block (0,1) Block (1,1) Time Cyclic CommunicationGPU CPU Host Code Block (0,1) Thread (0,0) Thread (1,0) Thread (2,0) Thread (0,1) Thread (1,1) Thread (2,1)

56 Characterizing the Grid Grid griddim.y griddim.x

57 Characterizing the Grid, Blocks Grid griddim.y blockdim.y Block (0,1) blockdim.x griddim.x

58 Characterizing the Grid, Blocks Grid griddim.y blockdim.y Block (blockidx.x,blockidx.y) blockdim.x griddim.x

59 Characterizing the Grid, Blocks, and Threads Grid griddim.y blockdim.y Block (blockidx.x,blockidx.y) Thread (0,1) blockdim.x griddim.x

60 Characterizing the Grid, Blocks, and Threads Grid griddim.y blockdim.y Block (blockidx.x,blockidx.y) Thread (threadidx.x,threadidx.y) blockdim.x griddim.x

61 Warp Divergence Examples Assuming one block of 128 threads Example Divergence? if (threadidx.x < 32) { }

62 Warp Divergence Examples Assuming one block of 128 threads Example if (threadidx.x < 32) { } Divergence? NO if (threadidx.x > 15) { }

63 Warp Divergence Examples Assuming one block of 128 threads Example if (threadidx.x < 32) { } if (threadidx.x > 15) { } Divergence? NO YES if (threadidx.x > 65) { }

64 Warp Divergence Examples Assuming one block of 128 threads Example if (threadidx.x < 32) { } if (threadidx.x > 15) { } if (threadidx.x > 65) { } Divergence? NO YES YES if (BlockIdx.x > 1) { }

65 Warp Divergence Examples Assuming one block of 128 threads Example if (threadidx.x < 32) { } if (threadidx.x > 15) { } if (threadidx.x > 65) { } if (blockidx.x > 1) { } Divergence? NO YES YES NO

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang Joo Lee Intel Rustam Miftakhutdinov The University

More information

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy CSE 2021: Computer Organization Single Cycle (Review) Lecture-10 CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan CSE-2021 July-12-2012 2 Single Cycle with Jump Multi-Cycle Implementation

More information

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Chapter 4 The Processor Part II Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup p = 2n/(0.5n + 1.5) 4 =

More information

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment Track and Vertex Reconstruction on GPUs for the Mu3e Experiment Dorothea vom Bruch for the Mu3e Collaboration GPU Computing in High Energy Physics, Pisa September 11th, 2014 Physikalisches Institut Heidelberg

More information

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads Terminology CUDA Threads Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Streaming Multiprocessor (SM) A SM processes block of threads Streaming Processors (SP) also called CUDA

More information

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood Supporting x86-64 Address Translation for 100s of GPU s Jason Power, Mark D. Hill, David A. Wood Summary Challenges: CPU&GPUs physically integrated, but logically separate; This reduces theoretical bandwidth,

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have

More information

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have

More information

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides.

More information

Out-of-Order Execution. Register Renaming. Nima Honarmand

Out-of-Order Execution. Register Renaming. Nima Honarmand Out-of-Order Execution & Register Renaming Nima Honarmand Out-of-Order (OOO) Execution (1) Essence of OOO execution is Dynamic Scheduling Dynamic scheduling: processor hardware determines instruction execution

More information

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I U. Wisconsin CS/ECE 752 Advanced Computer Architecture I Prof. Karu Sankaralingam Unit 5: Dynamic Scheduling I Slides developed by Amir Roth of University of Pennsylvania with sources that included University

More information

Simulating GPGPUs ESESC Tutorial

Simulating GPGPUs ESESC Tutorial ESESC Tutorial Speaker: ankaranarayanan Department of Computer Engineering, University of California, Santa Cruz http://masc.soe.ucsc.edu 1 Outline Background GPU Emulation Setup GPU Simulation Setup Running

More information

Dynamic Scheduling I

Dynamic Scheduling I basic pipeline started with single, in-order issue, single-cycle operations have extended this basic pipeline with multi-cycle operations multiple issue (superscalar) now: dynamic scheduling (out-of-order

More information

Instruction Level Parallelism III: Dynamic Scheduling

Instruction Level Parallelism III: Dynamic Scheduling Instruction Level Parallelism III: Dynamic Scheduling Reading: Appendix A (A-67) H&P Chapter 2 Instruction Level Parallelism III: Dynamic Scheduling 1 his Unit: Dynamic Scheduling Application OS Compiler

More information

CMP 301B Computer Architecture. Appendix C

CMP 301B Computer Architecture. Appendix C CMP 301B Computer Architecture Appendix C Dealing with Exceptions What should be done when an exception arises and many instructions are in the pipeline??!! Force a trap instruction in the next IF stage

More information

Dynamic Scheduling II

Dynamic Scheduling II so far: dynamic scheduling (out-of-order execution) Scoreboard omasulo s algorithm register renaming: removing artificial dependences (WAR/WAW) now: out-of-order execution + precise state advanced topic:

More information

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 16 - Superscalar Processors 1 / 78 Table of Contents I 1 Overview

More information

Dynamic Warp Resizing in High-Performance SIMT

Dynamic Warp Resizing in High-Performance SIMT Dynamic Warp Resizing in High-Performance SIMT Ahmad Lashgar 1 a.lashgar@ece.ut.ac.ir Amirali Baniasadi 2 amirali@ece.uvic.ca 1 3 Ahmad Khonsari ak@ipm.ir 1 School of ECE University of Tehran 2 ECE Department

More information

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona NPTEL Online - IIT Kanpur Instructor: Dr. Mainak Chaudhuri Instructor: Dr. S. K. Aggarwal Course Name: Department: Program Optimization for Multi-core Architecture Computer Science and Engineering IIT

More information

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps CSE 30321 Computer Architecture I Fall 2010 Homework 06 Pipelined Processors 85 points Assigned: November 2, 2010 Due: November 9, 2010 PLEASE DO THE ASSIGNMENT ON THIS HANDOUT!!! Problem 1: (25 points)

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science !!! Basic MIPS integer pipeline Branches with one

More information

Precise State Recovery. Out-of-Order Pipelines

Precise State Recovery. Out-of-Order Pipelines Precise State Recovery in Out-of-Order Pipelines Nima Honarmand Recall Our Generic OOO Pipeline Instruction flow (pipeline front-end) is in-order Register and memory execution are OOO And, we need a final

More information

Department Computer Science and Engineering IIT Kanpur

Department Computer Science and Engineering IIT Kanpur NPTEL Online - IIT Bombay Course Name Parallel Computer Architecture Department Computer Science and Engineering IIT Kanpur Instructor Dr. Mainak Chaudhuri file:///e /parallel_com_arch/lecture1/main.html[6/13/2012

More information

Project 5: Optimizer Jason Ansel

Project 5: Optimizer Jason Ansel Project 5: Optimizer Jason Ansel Overview Project guidelines Benchmarking Library OoO CPUs Project Guidelines Use optimizations from lectures as your arsenal If you decide to implement one, look at Whale

More information

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review) CSE 2021: Computer Organization IF for Load (Review) Lecture-11 CPU Design : Pipelining-2 Review, Hazards Shakil M. Khan CSE-2021 July-19-2012 2 ID for Load (Review) EX for Load (Review) CSE-2021 July-19-2012

More information

CSE 2021: Computer Organization

CSE 2021: Computer Organization CSE 2021: Computer Organization Lecture-11 CPU Design : Pipelining-2 Review, Hazards Shakil M. Khan IF for Load (Review) CSE-2021 July-14-2011 2 ID for Load (Review) CSE-2021 July-14-2011 3 EX for Load

More information

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold Pipelining Readings: 4.5-4.8 Example: Doing the laundry Ann, Brian, Cathy, & Dave A B C D each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes

More information

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

Best Instruction Per Cycle Formula >>>CLICK HERE<<< Best Instruction Per Cycle Formula 6 Performance tuning, 7 Perceived performance, 8 Performance Equation, 9 See also is the average instructions per cycle (IPC) for this benchmark. Even. Click Card to

More information

Recent Advances in Simulation Techniques and Tools

Recent Advances in Simulation Techniques and Tools Recent Advances in Simulation Techniques and Tools Yuyang Li, li.yuyang(at)wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download Abstract: Simulation refers to using specified kind

More information

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs 5 th International Conference on Logic and Application LAP 2016 Dubrovnik, Croatia, September 19-23, 2016 Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs

More information

Application of Maxwell Equations to Human Body Modelling

Application of Maxwell Equations to Human Body Modelling Application of Maxwell Equations to Human Body Modelling Fumie Costen Room E, E0c at Sackville Street Building, fc@cs.man.ac.uk The University of Manchester, U.K. February 5, 0 Fumie Costen Room E, E0c

More information

CUDA-Accelerated Satellite Communication Demodulation

CUDA-Accelerated Satellite Communication Demodulation CUDA-Accelerated Satellite Communication Demodulation Renliang Zhao, Ying Liu, Liheng Jian, Zhongya Wang School of Computer and Control University of Chinese Academy of Sciences Outline Motivation Related

More information

Pipelined Processor Design

Pipelined Processor Design Pipelined Processor Design COE 38 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Pipelining versus Serial

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Schedulers Data-Capture Scheduler Dispatch: read available operands from ARF/ROB, store in scheduler Commit: Missing operands filled in from bypass Issue: When

More information

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators Hiroyuki Usui, Lavanya Subramanian Kevin Chang, Onur Mutlu DASH source code is available at GitHub

More information

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture Overview 1 Trends in Microprocessor Architecture R05 Robert Mullins Computer architecture Scaling performance and CMOS Where have performance gains come from? Modern superscalar processors The limits of

More information

escience: Pulsar searching on GPUs

escience: Pulsar searching on GPUs escience: Pulsar searching on GPUs Alessio Sclocco Ana Lucia Varbanescu Karel van der Veldt John Romein Joeri van Leeuwen Jason Hessels Rob van Nieuwpoort And many others! Netherlands escience center Science

More information

GPU-based data analysis for Synthetic Aperture Microwave Imaging

GPU-based data analysis for Synthetic Aperture Microwave Imaging GPU-based data analysis for Synthetic Aperture Microwave Imaging 1 st IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis 1 st -3 rd June 2015 J.C. Chorley 1, K.J. Brunner 1, N.A.

More information

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time Pipelining Readings: 4.5-4.8 Example: Doing the laundry A B C D Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes

More information

Use Nvidia Performance Primitives (NPP) in Deep Learning Training. Yang Song

Use Nvidia Performance Primitives (NPP) in Deep Learning Training. Yang Song Use Nvidia Performance Primitives (NPP) in Deep Learning Training Yang Song Outline Introduction Function Categories Performance Results Deep Learning Specific Further Information What is NPP? Image+Signal

More information

Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION

Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION 2. RELATED WORKS 3. PROPOSED WEATHER RADAR IMAGING BASED ON CUDA 3.1 Weather radar image format and generation

More information

RISC Central Processing Unit

RISC Central Processing Unit RISC Central Processing Unit Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Spring, 2014 ldvan@cs.nctu.edu.tw http://www.cs.nctu.edu.tw/~ldvan/

More information

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS Computer Architecture (263-2210-00L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS Instructor: Prof. Onur Mutlu TAs: Hasan Hassan, Arash Tavakkol, Mohammad Sadr, Lois Orosa, Juan Gomez Luna Assigned:

More information

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks Advanced Computer Architecture Spring 2010 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture Outline Instruction-Level Parallelism Scoreboarding (A.8) Instruction Level Parallelism

More information

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps CSE 30321 Computer Architecture I Fall 2011 Homework 06 Pipelined Processors 75 points Assigned: November 1, 2011 Due: November 8, 2011 PLEASE DO THE ASSIGNMENT ON THIS HANDOUT!!! Problem 1: (15 points)

More information

GPU-accelerated track reconstruction in the ALICE High Level Trigger

GPU-accelerated track reconstruction in the ALICE High Level Trigger GPU-accelerated track reconstruction in the ALICE High Level Trigger David Rohr for the ALICE Collaboration Frankfurt Institute for Advanced Studies CHEP 2016, San Francisco ALICE at the LHC The Large

More information

CS 110 Computer Architecture Lecture 11: Pipelining

CS 110 Computer Architecture Lecture 11: Pipelining CS 110 Computer Architecture Lecture 11: Pipelining Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on

More information

Processors Processing Processors. The meta-lecture

Processors Processing Processors. The meta-lecture Simulators 5SIA0 Processors Processing Processors The meta-lecture Why Simulators? Your Friend Harm Why Simulators? Harm Loves Tractors Harm Why Simulators? The outside world Unfortunately for Harm you

More information

Instruction Level Parallelism Part II - Scoreboard

Instruction Level Parallelism Part II - Scoreboard Course on: Advanced Computer Architectures Instruction Level Parallelism Part II - Scoreboard Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Basic Assumptions We consider

More information

Compiler Optimisation

Compiler Optimisation Compiler Optimisation 6 Instruction Scheduling Hugh Leather IF 1.18a hleather@inf.ed.ac.uk Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2018 Introduction This

More information

PARALLEL ALGORITHMS FOR HISTOGRAM-BASED IMAGE REGISTRATION. Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, Wolfgang Effelsberg

PARALLEL ALGORITHMS FOR HISTOGRAM-BASED IMAGE REGISTRATION. Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, Wolfgang Effelsberg This is a preliminary version of an article published by Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, and Wolfgang Effelsberg. Parallel algorithms for histogram-based image registration. Proc.

More information

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links DLR.de Chart 1 GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links Chen Tang chen.tang@dlr.de Institute of Communication and Navigation German Aerospace Center DLR.de Chart

More information

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont   Core 2 Microarchitecture P6 µarchitecture Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Core 2 Microarchitecture Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. Portions

More information

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3 EN164: Design of Computing Systems Lecture 22: Processor / ILP 3 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture

Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture Jingwen Leng Yazhou Zu Vijay Janapa Reddi The University of Texas at Austin {jingwen, yazhou.zu}@utexas.edu,

More information

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont MIPS R10000 Case Study Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Multiprocessor SGI Origin Using MIPS R10K Many thanks to Prof. Martin and Roth of University of Pennsylvania for

More information

OOO Execution & Precise State MIPS R10000 (R10K)

OOO Execution & Precise State MIPS R10000 (R10K) OOO Execution & Precise State in MIPS R10000 (R10K) Nima Honarmand CDB. CDB.V Spring 2018 :: CSE 502 he Problem with P6 Map able + Regfile value R value Head Retire Dispatch op RS 1 2 V1 FU V2 ail Dispatch

More information

Multi-core Platforms for

Multi-core Platforms for 20 JUNE 2011 Multi-core Platforms for Immersive-Audio Applications Course: Advanced Computer Architectures Teacher: Prof. Cristina Silvano Student: Silvio La Blasca 771338 Introduction on Immersive-Audio

More information

Accelerated Impulse Response Calculation for Indoor Optical Communication Channels

Accelerated Impulse Response Calculation for Indoor Optical Communication Channels Accelerated Impulse Response Calculation for Indoor Optical Communication Channels M. Rahaim, J. Carruthers, and T.D.C. Little Department of Electrical and Computer Engineering Boston University, Boston,

More information

Software-based Microarchitectural Attacks

Software-based Microarchitectural Attacks SCIENCE PASSION TECHNOLOGY Software-based Microarchitectural Attacks Daniel Gruss April 19, 2018 Graz University of Technology 1 Daniel Gruss Graz University of Technology Whoami Daniel Gruss Post-Doc

More information

Monte Carlo integration and event generation on GPU and their application to particle physics

Monte Carlo integration and event generation on GPU and their application to particle physics Monte Carlo integration and event generation on GPU and their application to particle physics Junichi Kanzaki (KEK) GPU2016 @ Rome, Italy Sep. 26, 2016 Motivation Increase of amount of LHC data (raw &

More information

Tomasolu s s Algorithm

Tomasolu s s Algorithm omasolu s s Algorithm Fall 2007 Prof. homas Wenisch http://www.eecs.umich.edu/courses/eecs4 70 Floating Point Buffers (FLB) ag ag ag Storage Bus Floating Point 4 3 Buffers FLB 6 5 5 4 Control 2 1 1 Result

More information

EECE 321: Computer Organiza5on

EECE 321: Computer Organiza5on EECE 321: Computer Organiza5on Mohammad M. Mansour Dept. of Electrical and Compute Engineering American University of Beirut Lecture 21: Pipelining Processor Pipelining Same principles can be applied to

More information

An evaluation of debayering algorithms on GPU for real-time panoramic video recording

An evaluation of debayering algorithms on GPU for real-time panoramic video recording An evaluation of debayering algorithms on GPU for real-time panoramic video recording Ragnar Langseth, Vamsidhar Reddy Gaddam, Håkon Kvale Stensland, Carsten Griwodz, Pål Halvorsen University of Oslo /

More information

GPU Acceleration of the HEVC Decoder Inter Prediction Module

GPU Acceleration of the HEVC Decoder Inter Prediction Module GPU Acceleration of the HEVC Decoder Inter Prediction Module Diego F. de Souza, Aleksandar Ilic, Nuno Roma and Leonel Sousa INESC-ID, IST, Universidade de Lisboa Rua Alves Redol 9, 000-09, Lisbon, Portugal

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Speculation and raps in Out-of-Order Cores What is wrong with omasulo s? Branch instructions Need branch prediction to guess what to fetch next Need speculative execution

More information

ECE473 Computer Architecture and Organization. Pipeline: Introduction

ECE473 Computer Architecture and Organization. Pipeline: Introduction Computer Architecture and Organization Pipeline: Introduction Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 11.1 The Laundry Analogy Student A,

More information

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018 omasulo s Algorithm Winter 2018 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, yson, Vijaykumar, and Wenisch of Carnegie Mellon University,

More information

Lecture 4: Introduction to Pipelining

Lecture 4: Introduction to Pipelining Lecture 4: Introduction to Pipelining Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes A B C D Dryer takes 40 minutes Folder

More information

COSC4201. Scoreboard

COSC4201. Scoreboard COSC4201 Scoreboard Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RIT) 1 Overcoming Data Hazards with Dynamic Scheduling In the pipeline, if there is

More information

Table of Contents HOL ADV

Table of Contents HOL ADV Table of Contents Lab Overview - - Horizon 7.1: Graphics Acceleartion for 3D Workloads and vgpu... 2 Lab Guidance... 3 Module 1-3D Options in Horizon 7 (15 minutes - Basic)... 5 Introduction... 6 3D Desktop

More information

Parallel Programming Design of BPSK Signal Demodulation Based on CUDA

Parallel Programming Design of BPSK Signal Demodulation Based on CUDA Int. J. Communications, Network and System Sciences, 216, 9, 126-134 Published Online May 216 in SciRes. http://www.scirp.org/journal/ijcns http://dx.doi.org/1.4236/ijcns.216.9511 Parallel Programming

More information

Semantic Segmentation on Resource Constrained Devices

Semantic Segmentation on Resource Constrained Devices Semantic Segmentation on Resource Constrained Devices Sachin Mehta University of Washington, Seattle In collaboration with Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh Hajishirzi Project

More information

Instruction Level Parallelism. Data Dependence Static Scheduling

Instruction Level Parallelism. Data Dependence Static Scheduling Instruction Level Parallelism Data Dependence Static Scheduling Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D

More information

Synthetic Aperture Beamformation using the GPU

Synthetic Aperture Beamformation using the GPU Paper presented at the IEEE International Ultrasonics Symposium, Orlando, Florida, 211: Synthetic Aperture Beamformation using the GPU Jens Munk Hansen, Dana Schaa and Jørgen Arendt Jensen Center for Fast

More information

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution School of Electrical and Computer Engineering Cornell University revision: 2016-11-28-17-33 1 In-Order Dual-Issue

More information

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation Mark Wolff Linda Wills School of Electrical and Computer Engineering Georgia Institute of Technology {wolff,linda.wills}@ece.gatech.edu

More information

CORRECTED VISION. Here be underscores THE ROLE OF CAMERA AND LENS PARAMETERS IN REAL-WORLD MEASUREMENT

CORRECTED VISION. Here be underscores THE ROLE OF CAMERA AND LENS PARAMETERS IN REAL-WORLD MEASUREMENT Here be underscores CORRECTED VISION THE ROLE OF CAMERA AND LENS PARAMETERS IN REAL-WORLD MEASUREMENT JOSEPH HOWSE, NUMMIST MEDIA CIG-GANS WORKSHOP: 3-D COLLECTION, ANALYSIS AND VISUALIZATION LAWRENCETOWN,

More information

Threading libraries performance when applied to image acquisition and processing in a forensic application

Threading libraries performance when applied to image acquisition and processing in a forensic application Threading libraries performance when applied to image acquisition and processing in a forensic application Carlos Bermúdez MSc. in Photonics, Universitat Politècnica de Catalunya, Barcelona, Spain Student

More information

Game Architecture. 4/8/16: Multiprocessor Game Loops

Game Architecture. 4/8/16: Multiprocessor Game Loops Game Architecture 4/8/16: Multiprocessor Game Loops Monolithic Dead simple to set up, but it can get messy Flow-of-control can be complex Top-level may have too much knowledge of underlying systems (gross

More information

CS Computer Architecture Spring Lecture 04: Understanding Performance

CS Computer Architecture Spring Lecture 04: Understanding Performance CS 35101 Computer Architecture Spring 2008 Lecture 04: Understanding Performance Taken from Mary Jane Irwin (www.cse.psu.edu/~mji) and Kevin Schaffer [Adapted from Computer Organization and Design, Patterson

More information

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors 6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors Options for dealing with data and control hazards: stall, bypass, speculate 6.S084 Worksheet - 1 of 10 - L19 Control Hazards in Pipelined

More information

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T. Pipeline Hazards Krste Asanovic Laboratory for Computer Science M.I.T. Pipelined DLX Datapath without interlocks and jumps 31 0x4 RegDst RegWrite inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B OpSel

More information

CS429: Computer Organization and Architecture

CS429: Computer Organization and Architecture CS429: Computer Organization and Architecture Dr. Bill Young Department of Computer Sciences University of Texas at Austin Last updated: November 8, 2017 at 09:27 CS429 Slideset 14: 1 Overview What s wrong

More information

Reading Material + Announcements

Reading Material + Announcements Reading Material + Announcements Reminder HW 1» Before asking questions: 1) Read all threads on piazza, 2) Think a bit Ÿ Then, post question Ÿ talk to Animesh if you are stuck Today s class» Wrap up Control

More information

Software Pipelining Creates Parallelization Opportunities

Software Pipelining Creates Parallelization Opportunities Software Pipelining Creates Parallelization Opportunities Jialu Huang, Arun Raman, Thomas B Jablin, Yun Zhang, Tzu-Han Hung David I August Liberty Research Group Princeton University 1 DSWP+ DOALL DSWP

More information

NVIDIA SLI AND STUTTER AVOIDANCE:

NVIDIA SLI AND STUTTER AVOIDANCE: NVIDIA SLI AND STUTTER AVOIDANCE: A Recipe for Smooth Gaming and Perfect Scaling with Multiple GPUs NVIDIA SLI AND STUTTER AVOIDANCE: Iain Cantlay (Developer Technology Engineer) Lars Nordskog (Developer

More information

Power of Realtime 3D-Rendering. Raja Koduri

Power of Realtime 3D-Rendering. Raja Koduri Power of Realtime 3D-Rendering Raja Koduri 1 We ate our GPU cake - vuoi la botte piena e la moglie ubriaca And had more too! 16+ years of (sugar) high! In every GPU generation More performance and performance-per-watt

More information

Computer Architecture A Quantitative Approach

Computer Architecture A Quantitative Approach Computer Architecture A Quantitative Approach Fourth Edition John L. Hennessy Stanford University David A. Patterson University of California at Berkeley With Contributions by Andrea C. Arpaci-Dusseau

More information

Real-Time Software Receiver Using Massively Parallel

Real-Time Software Receiver Using Massively Parallel Real-Time Software Receiver Using Massively Parallel Processors for GPS Adaptive Antenna Array Processing Jiwon Seo, David De Lorenzo, Sherman Lo, Per Enge, Stanford University Yu-Hsuan Chen, National

More information

EE382V-ICS: System-on-a-Chip (SoC) Design

EE382V-ICS: System-on-a-Chip (SoC) Design EE38V-CS: System-on-a-Chip (SoC) Design Hardware Synthesis and Architectures Source: D. Gajski, S. Abdi, A. Gerstlauer, G. Schirner, Embedded System Design: Modeling, Synthesis, Verification, Chapter 6:

More information

23270: AUGMENTED REALITY FOR NAVIGATION AND INFORMATIONAL ADAS. Sergii Bykov Technical Lead Machine Learning 12 Oct 2017

23270: AUGMENTED REALITY FOR NAVIGATION AND INFORMATIONAL ADAS. Sergii Bykov Technical Lead Machine Learning 12 Oct 2017 23270: AUGMENTED REALITY FOR NAVIGATION AND INFORMATIONAL ADAS Sergii Bykov Technical Lead Machine Learning 12 Oct 2017 Product Vision Company Introduction Apostera GmbH with headquarter in Munich, was

More information

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Suggested Readings! Lecture 12 Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings! 1! CSE 30321 Lecture 12 Introduction to Pipelining! CSE 30321 Lecture 12 Introduction to Pipelining! 2! Suggested Readings!! Readings!! H&P: Chapter 4.5-4.7!! (Over the next 3-4 lectures)! Lecture 12"

More information

Image Processing Architectures (and their future requirements)

Image Processing Architectures (and their future requirements) Lecture 17: Image Processing Architectures (and their future requirements) Visual Computing Systems Smart phone processing resources Qualcomm snapdragon Image credit: Qualcomm Apple A7 (iphone 5s) Chipworks

More information

Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors

Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors STIJN EYERMAN and LIEVEN EECKHOUT Ghent University A thread executing on a simultaneous multithreading (SMT) processor

More information

SKA NON IMAGING PROCESSING CONCEPT DESCRIPTION: GPU PROCESSING FOR REAL TIME ISOLATED RADIO PULSE DETECTION

SKA NON IMAGING PROCESSING CONCEPT DESCRIPTION: GPU PROCESSING FOR REAL TIME ISOLATED RADIO PULSE DETECTION SKA NON IMAGING PROCESSING CONCEPT DESCRIPTION: GPU PROCESSING FOR REAL TIME ISOLATED RADIO PULSE DETECTION Document number... WP2 040.130.010 TD 001 Revision... 1 Author... Aris Karastergiou Date... 2011

More information

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida Early Adopter : Multiprocessor Programming in the Undergraduate Program NSF/TCPP Curriculum: Early Adoption at the University of Central Florida Narsingh Deo Damian Dechev Mahadevan Vasudevan Department

More information

Pre-Silicon Validation of Hyper-Threading Technology

Pre-Silicon Validation of Hyper-Threading Technology Pre-Silicon Validation of Hyper-Threading Technology David Burns, Desktop Platforms Group, Intel Corp. Index words: microprocessor, validation, bugs, verification ABSTRACT Hyper-Threading Technology delivers

More information

Introduction to Real-Time Systems

Introduction to Real-Time Systems Introduction to Real-Time Systems Real-Time Systems, Lecture 1 Martina Maggio and Karl-Erik Årzén 16 January 2018 Lund University, Department of Automatic Control Content [Real-Time Control System: Chapter

More information