Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)
|
|
- Amberlynn Evans
- 5 years ago
- Views:
Transcription
1 Warp-Aware Trace Scheduling for GPUS James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)
2 Historical Trends in GFLOPS: CPUs vs. GPUs Theoretical GFLOP/s NVIDIA GPU Single-Precision FP Intel CPU Single-Precision FP GeForce 6800 Ultra Northwood PrescottWoodcrest GeForce 5800 GeForce 280 GTX GeForce 8800 GTX Westmere Harpertown Bloomfield GeForce 7800 GTX GeForce 680 GTX GeForce 480 GTX GeForce 580 GTX Sandy Bridge 2012 reproduced from NVIDIA CUDA C Programming Guide (Version 5.0)
3 Performance Pitfalls Control flow can negatively affect performance.
4 Performance Pitfalls Pipeline Stall - execution delay in an instruction pipeline to resolve a dependency
5 reproduced from NVIDIA CUDA C Programming Guide (Version 5.0) Hardware: CPU versus GPU Control ALU ALU ALU ALU Cache DRAM DRAM CPU GPU
6 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions
7 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions
8 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions
9 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions pipeline stall (bubble) Pipeline Stages Fetch Decode Execute Write Completed Instructions
10 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions
11 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions
12 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions
13 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions
14 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions
15 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions
16 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions
17 With Branch Prediction Without Branch Prediction Clock Cycle Clock Cycle Waiting Instructions Pipeline Stages Fetch Decode Execute Write Completed Instructions
18 Performance Pitfalls Pipeline Stall - execution delay in an instruction pipeline to resolve a dependency
19 Performance Pitfalls Pipeline Stall - execution delay in an instruction pipeline to resolve a dependency Warp Divergence - threads within a warp take different paths and the different execution paths are serialized
20 Warp Divergence Example A B C D
21 Warp Divergence Example A A B C D
22 Warp Divergence Example A A Warp Divergence! B C D
23 Warp Divergence Example B A A B Warp Divergence! C D
24 Warp Divergence Example B A A B Warp Divergence! Warp Divergence! C D
25 Warp Divergence Example B A A B Warp Divergence! Warp Divergence! C D
26 Warp Divergence Example B A A B Warp Divergence! Warp Divergence! C Warp Reconverges! D
27 Warp Divergence Example B A A B Warp Divergence! Warp Divergence! C D D Warp Reconverges!
28 Warp-Aware Trace Scheduling Schedule instructions across basic block boundaries to expose additional ILP
29 Warp-Aware Trace Scheduling Schedule instructions across basic block boundaries to expose additional ILP while managing and optimizing warp divergence.
30 Origins: Microcode Trace Scheduling generalizing local and disparate verticalto-horizontal microcode compaction Step Description
31 Origins: Microcode Trace Scheduling generalizing local and disparate verticalto-horizontal microcode compaction Step 1. Trace Selection Description
32 Origins: Microcode Trace Scheduling generalizing local and disparate verticalto-horizontal microcode compaction Step 1. Trace Selection Description 2. Trace Formation
33 Origins: Microcode Trace Scheduling generalizing local and disparate verticalto-horizontal microcode compaction Step Description 1. Trace Selection 2. Trace Formation 3. Local Scheduling
34 Origins: Microcode Trace Scheduling generalizing local and disparate verticalto-horizontal microcode compaction Step Description 1. Trace Selection partition basic blocks into regions 2. Trace Formation facilitate local scheduling, potentially adding nodes and edges 3. Local Scheduling schedule instructions within each region
35 A B C D J E K F L G H I
36 Annotate CFG - dynamic profiling - static branch prediction B 90 A C 5 87 D J 50 E K F 1 L 8 G H 92 I
37 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop B A C 10 5 Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 D F 13 8 G J L 50 K H 92 I
38 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop B A C 10 5 Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 D F 13 8 G J L 50 K H 92 I
39 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop B A C 10 5 Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 D F 13 8 G J L 50 K H 92 I
40 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop B A C 10 5 Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 D F 13 8 G J L 50 K H 92 I
41 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 B D F H A C G I 10 J Trace # 1 L 50 K
42 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 B D F H A C G I 10 J Trace # 1 L 50 K
43 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 B D F H A C G I 10 J Trace # 1 L 50 K
44 Annotate CFG - dynamic profiling - static branch prediction while there are unvisited nodes loop Find the next unvisited node, with highest edge weight Add node to trace end loop end while E 87 Trace # 2 B D F H A C G I 10 J Trace # 1 L 50 K Trace # 3
45 Before After mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; 10 BB BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2]; BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2];
46 Before After mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; 10 BB BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2]; BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2];
47 Before After mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; 10 BB BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2]; BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2];
48 Before After mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; 10 BB BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; BB BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2]; BB3 BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2];
49 Before After mul.wide.s32 %rd13, %r4, 4; add.s54 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, bra BB2; 10 BB BB0 90 BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB3 BB mul.wide.s32 %rd13, %r4, 4; add.s64 %rd14, %rd2, %rd13; cvt.s64.s32 %rd7, %r6; add.s64, %rd8, %rd6, %rd7; bra.uni BB3; mov.u32 %r11, %ctaid.y; add.s32 %r12, %r8, 1; mov.u32 %r1, %tid.y; mov.u32 %r2, %tid.x; setp.ne.s32 %p2, %r2, 0; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, bra BB2; BB BB0 90 BB2: ld.shared.f32 %f5, [%rd3]; ld.shared.f32 %f6, [%rd3]; mul.f32 %f7, %f5, %f6; st.shared.f32 [%r2], %f7; bra.uni BB3; BB BB3: mul.wide.s32 %rd15, %r3, 4; add.s64 %rd12, %rd1, %rd15; shl.b64 %rd16, %rd4, 6; mov.u64 %rd17, param_0; add.s64 %rd18, %rd17, %rd16; mul.wide.s32 %rd19, %r2, 4; add.s64 %rd2, %rd18, %rd19; ld.global.f32 %f4, [%rd2]; BB3 BB3: mul.wide.s32 %rd15, %r3, %rd8; add.s64 %rd12, %rd1, %rd15;
50 1.40 Comparing Speedup and IPC Using Dynamic Profiling 1.40 K ernel Speedup bpnn layerforward CUDA Kernel Kernel2 cuda compute flux kernel calculate temp invert mapping Speedup IPC kmeanspoint kernel gpu cuda dilate kernel IMGVF kernel lud diagonal mummergpuk ernel printk ernel needle cuda shared1 euclid needle cuda shared2 likelihood kernel find index kernel dynproc kernel kernel srad cuda 1 srad cuda 2 kernel compute cost Instructions Executed per Cycle ( IPC ) backprop bfs hotspot heartwall cfd lavamd kmeans leukocyte lud nn mummergpu nw pathfinder particlefilter(n) particlefilter(f) HARMEAN GEOMEAN streamcluster srad
51 Backup Slides
52 Restricted [6] General [6] Boosting [36] Deviant (G P U ) Scheduling Restrictions Legal and Safe Legal none Hardware Support Exception Handling for Speculative Instructions none shadow register file, non-trapping shadow store buffer, instructions and support for re-executing instructions excludes texture, shared and constant memory operations and all store instructions none prohibited ignored supported absent
53 GPU Programming Model CPU GPU Host Code CPU Device Code Time Cyclic CommunicationGPU CPU Host Code
54 GPU Programming Model CPU GPU Grid Block (0,0) Block (1,0) Host Code CPU Device Code Block (0,1) Block (1,1) Time Cyclic CommunicationGPU CPU Host Code
55 GPU Programming Model CPU GPU Grid Block (0,0) Block (1,0) Host Code CPU Device Code Block (0,1) Block (1,1) Time Cyclic CommunicationGPU CPU Host Code Block (0,1) Thread (0,0) Thread (1,0) Thread (2,0) Thread (0,1) Thread (1,1) Thread (2,1)
56 Characterizing the Grid Grid griddim.y griddim.x
57 Characterizing the Grid, Blocks Grid griddim.y blockdim.y Block (0,1) blockdim.x griddim.x
58 Characterizing the Grid, Blocks Grid griddim.y blockdim.y Block (blockidx.x,blockidx.y) blockdim.x griddim.x
59 Characterizing the Grid, Blocks, and Threads Grid griddim.y blockdim.y Block (blockidx.x,blockidx.y) Thread (0,1) blockdim.x griddim.x
60 Characterizing the Grid, Blocks, and Threads Grid griddim.y blockdim.y Block (blockidx.x,blockidx.y) Thread (threadidx.x,threadidx.y) blockdim.x griddim.x
61 Warp Divergence Examples Assuming one block of 128 threads Example Divergence? if (threadidx.x < 32) { }
62 Warp Divergence Examples Assuming one block of 128 threads Example if (threadidx.x < 32) { } Divergence? NO if (threadidx.x > 15) { }
63 Warp Divergence Examples Assuming one block of 128 threads Example if (threadidx.x < 32) { } if (threadidx.x > 15) { } Divergence? NO YES if (threadidx.x > 65) { }
64 Warp Divergence Examples Assuming one block of 128 threads Example if (threadidx.x < 32) { } if (threadidx.x > 15) { } if (threadidx.x > 65) { } Divergence? NO YES YES if (BlockIdx.x > 1) { }
65 Warp Divergence Examples Assuming one block of 128 threads Example if (threadidx.x < 32) { } if (threadidx.x > 15) { } if (threadidx.x > 65) { } if (blockidx.x > 1) { } Divergence? NO YES YES NO
Improving GPU Performance via Large Warps and Two-Level Warp Scheduling
Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang Joo Lee Intel Rustam Miftakhutdinov The University
More information7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10 CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan CSE-2021 July-12-2012 2 Single Cycle with Jump Multi-Cycle Implementation
More informationChapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:
Chapter 4 The Processor Part II Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup p = 2n/(0.5n + 1.5) 4 =
More informationTrack and Vertex Reconstruction on GPUs for the Mu3e Experiment
Track and Vertex Reconstruction on GPUs for the Mu3e Experiment Dorothea vom Bruch for the Mu3e Collaboration GPU Computing in High Energy Physics, Pisa September 11th, 2014 Physikalisches Institut Heidelberg
More informationCUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads
Terminology CUDA Threads Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Streaming Multiprocessor (SM) A SM processes block of threads Streaming Processors (SP) also called CUDA
More informationSupporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood
Supporting x86-64 Address Translation for 100s of GPU s Jason Power, Mark D. Hill, David A. Wood Summary Challenges: CPU&GPUs physically integrated, but logically separate; This reduces theoretical bandwidth,
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have
More informationMosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes
Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have
More informationEECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont
Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides.
More informationOut-of-Order Execution. Register Renaming. Nima Honarmand
Out-of-Order Execution & Register Renaming Nima Honarmand Out-of-Order (OOO) Execution (1) Essence of OOO execution is Dynamic Scheduling Dynamic scheduling: processor hardware determines instruction execution
More informationU. Wisconsin CS/ECE 752 Advanced Computer Architecture I
U. Wisconsin CS/ECE 752 Advanced Computer Architecture I Prof. Karu Sankaralingam Unit 5: Dynamic Scheduling I Slides developed by Amir Roth of University of Pennsylvania with sources that included University
More informationSimulating GPGPUs ESESC Tutorial
ESESC Tutorial Speaker: ankaranarayanan Department of Computer Engineering, University of California, Santa Cruz http://masc.soe.ucsc.edu 1 Outline Background GPU Emulation Setup GPU Simulation Setup Running
More informationDynamic Scheduling I
basic pipeline started with single, in-order issue, single-cycle operations have extended this basic pipeline with multi-cycle operations multiple issue (superscalar) now: dynamic scheduling (out-of-order
More informationInstruction Level Parallelism III: Dynamic Scheduling
Instruction Level Parallelism III: Dynamic Scheduling Reading: Appendix A (A-67) H&P Chapter 2 Instruction Level Parallelism III: Dynamic Scheduling 1 his Unit: Dynamic Scheduling Application OS Compiler
More informationCMP 301B Computer Architecture. Appendix C
CMP 301B Computer Architecture Appendix C Dealing with Exceptions What should be done when an exception arises and many instructions are in the pipeline??!! Force a trap instruction in the next IF stage
More informationDynamic Scheduling II
so far: dynamic scheduling (out-of-order execution) Scoreboard omasulo s algorithm register renaming: removing artificial dependences (WAR/WAW) now: out-of-order execution + precise state advanced topic:
More informationChapter 16 - Instruction-Level Parallelism and Superscalar Processors
Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 16 - Superscalar Processors 1 / 78 Table of Contents I 1 Overview
More informationDynamic Warp Resizing in High-Performance SIMT
Dynamic Warp Resizing in High-Performance SIMT Ahmad Lashgar 1 a.lashgar@ece.ut.ac.ir Amirali Baniasadi 2 amirali@ece.uvic.ca 1 3 Ahmad Khonsari ak@ipm.ir 1 School of ECE University of Tehran 2 ECE Department
More informationInstructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona
NPTEL Online - IIT Kanpur Instructor: Dr. Mainak Chaudhuri Instructor: Dr. S. K. Aggarwal Course Name: Department: Program Optimization for Multi-core Architecture Computer Science and Engineering IIT
More informationIF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps
CSE 30321 Computer Architecture I Fall 2010 Homework 06 Pipelined Processors 85 points Assigned: November 2, 2010 Due: November 9, 2010 PLEASE DO THE ASSIGNMENT ON THIS HANDOUT!!! Problem 1: (25 points)
More informationSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science !!! Basic MIPS integer pipeline Branches with one
More informationPrecise State Recovery. Out-of-Order Pipelines
Precise State Recovery in Out-of-Order Pipelines Nima Honarmand Recall Our Generic OOO Pipeline Instruction flow (pipeline front-end) is in-order Register and memory execution are OOO And, we need a final
More informationDepartment Computer Science and Engineering IIT Kanpur
NPTEL Online - IIT Bombay Course Name Parallel Computer Architecture Department Computer Science and Engineering IIT Kanpur Instructor Dr. Mainak Chaudhuri file:///e /parallel_com_arch/lecture1/main.html[6/13/2012
More informationProject 5: Optimizer Jason Ansel
Project 5: Optimizer Jason Ansel Overview Project guidelines Benchmarking Library OoO CPUs Project Guidelines Use optimizations from lectures as your arsenal If you decide to implement one, look at Whale
More information7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)
CSE 2021: Computer Organization IF for Load (Review) Lecture-11 CPU Design : Pipelining-2 Review, Hazards Shakil M. Khan CSE-2021 July-19-2012 2 ID for Load (Review) EX for Load (Review) CSE-2021 July-19-2012
More informationCSE 2021: Computer Organization
CSE 2021: Computer Organization Lecture-11 CPU Design : Pipelining-2 Review, Hazards Shakil M. Khan IF for Load (Review) CSE-2021 July-14-2011 2 ID for Load (Review) CSE-2021 July-14-2011 3 EX for Load
More informationPipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold
Pipelining Readings: 4.5-4.8 Example: Doing the laundry Ann, Brian, Cathy, & Dave A B C D each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes
More informationBest Instruction Per Cycle Formula >>>CLICK HERE<<<
Best Instruction Per Cycle Formula 6 Performance tuning, 7 Perceived performance, 8 Performance Equation, 9 See also is the average instructions per cycle (IPC) for this benchmark. Even. Click Card to
More informationRecent Advances in Simulation Techniques and Tools
Recent Advances in Simulation Techniques and Tools Yuyang Li, li.yuyang(at)wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download Abstract: Simulation refers to using specified kind
More informationComputational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs
5 th International Conference on Logic and Application LAP 2016 Dubrovnik, Croatia, September 19-23, 2016 Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs
More informationApplication of Maxwell Equations to Human Body Modelling
Application of Maxwell Equations to Human Body Modelling Fumie Costen Room E, E0c at Sackville Street Building, fc@cs.man.ac.uk The University of Manchester, U.K. February 5, 0 Fumie Costen Room E, E0c
More informationCUDA-Accelerated Satellite Communication Demodulation
CUDA-Accelerated Satellite Communication Demodulation Renliang Zhao, Ying Liu, Liheng Jian, Zhongya Wang School of Computer and Control University of Chinese Academy of Sciences Outline Motivation Related
More informationPipelined Processor Design
Pipelined Processor Design COE 38 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Pipelining versus Serial
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Out-of-Order Schedulers Data-Capture Scheduler Dispatch: read available operands from ARF/ROB, store in scheduler Commit: Missing operands filled in from bypass Issue: When
More informationDASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators
DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators Hiroyuki Usui, Lavanya Subramanian Kevin Chang, Onur Mutlu DASH source code is available at GitHub
More informationOverview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture
Overview 1 Trends in Microprocessor Architecture R05 Robert Mullins Computer architecture Scaling performance and CMOS Where have performance gains come from? Modern superscalar processors The limits of
More informationescience: Pulsar searching on GPUs
escience: Pulsar searching on GPUs Alessio Sclocco Ana Lucia Varbanescu Karel van der Veldt John Romein Joeri van Leeuwen Jason Hessels Rob van Nieuwpoort And many others! Netherlands escience center Science
More informationGPU-based data analysis for Synthetic Aperture Microwave Imaging
GPU-based data analysis for Synthetic Aperture Microwave Imaging 1 st IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis 1 st -3 rd June 2015 J.C. Chorley 1, K.J. Brunner 1, N.A.
More informationA B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time
Pipelining Readings: 4.5-4.8 Example: Doing the laundry A B C D Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes
More informationUse Nvidia Performance Primitives (NPP) in Deep Learning Training. Yang Song
Use Nvidia Performance Primitives (NPP) in Deep Learning Training Yang Song Outline Introduction Function Categories Performance Results Deep Learning Specific Further Information What is NPP? Image+Signal
More informationLiu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION
Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION 2. RELATED WORKS 3. PROPOSED WEATHER RADAR IMAGING BASED ON CUDA 3.1 Weather radar image format and generation
More informationRISC Central Processing Unit
RISC Central Processing Unit Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Spring, 2014 ldvan@cs.nctu.edu.tw http://www.cs.nctu.edu.tw/~ldvan/
More informationComputer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS
Computer Architecture (263-2210-00L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS Instructor: Prof. Onur Mutlu TAs: Hasan Hassan, Arash Tavakkol, Mohammad Sadr, Lois Orosa, Juan Gomez Luna Assigned:
More informationComputer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks
Advanced Computer Architecture Spring 2010 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture Outline Instruction-Level Parallelism Scoreboarding (A.8) Instruction Level Parallelism
More informationIF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps
CSE 30321 Computer Architecture I Fall 2011 Homework 06 Pipelined Processors 75 points Assigned: November 1, 2011 Due: November 8, 2011 PLEASE DO THE ASSIGNMENT ON THIS HANDOUT!!! Problem 1: (15 points)
More informationGPU-accelerated track reconstruction in the ALICE High Level Trigger
GPU-accelerated track reconstruction in the ALICE High Level Trigger David Rohr for the ALICE Collaboration Frankfurt Institute for Advanced Studies CHEP 2016, San Francisco ALICE at the LHC The Large
More informationCS 110 Computer Architecture Lecture 11: Pipelining
CS 110 Computer Architecture Lecture 11: Pipelining Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on
More informationProcessors Processing Processors. The meta-lecture
Simulators 5SIA0 Processors Processing Processors The meta-lecture Why Simulators? Your Friend Harm Why Simulators? Harm Loves Tractors Harm Why Simulators? The outside world Unfortunately for Harm you
More informationInstruction Level Parallelism Part II - Scoreboard
Course on: Advanced Computer Architectures Instruction Level Parallelism Part II - Scoreboard Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Basic Assumptions We consider
More informationCompiler Optimisation
Compiler Optimisation 6 Instruction Scheduling Hugh Leather IF 1.18a hleather@inf.ed.ac.uk Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2018 Introduction This
More informationPARALLEL ALGORITHMS FOR HISTOGRAM-BASED IMAGE REGISTRATION. Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, Wolfgang Effelsberg
This is a preliminary version of an article published by Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, and Wolfgang Effelsberg. Parallel algorithms for histogram-based image registration. Proc.
More informationGPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links
DLR.de Chart 1 GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links Chen Tang chen.tang@dlr.de Institute of Communication and Navigation German Aerospace Center DLR.de Chart
More informationEECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture
P6 µarchitecture Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Core 2 Microarchitecture Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. Portions
More informationEN164: Design of Computing Systems Lecture 22: Processor / ILP 3
EN164: Design of Computing Systems Lecture 22: Processor / ILP 3 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationEnergy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture
Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture Jingwen Leng Yazhou Zu Vijay Janapa Reddi The University of Texas at Austin {jingwen, yazhou.zu}@utexas.edu,
More informationEECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont
MIPS R10000 Case Study Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Multiprocessor SGI Origin Using MIPS R10K Many thanks to Prof. Martin and Roth of University of Pennsylvania for
More informationOOO Execution & Precise State MIPS R10000 (R10K)
OOO Execution & Precise State in MIPS R10000 (R10K) Nima Honarmand CDB. CDB.V Spring 2018 :: CSE 502 he Problem with P6 Map able + Regfile value R value Head Retire Dispatch op RS 1 2 V1 FU V2 ail Dispatch
More informationMulti-core Platforms for
20 JUNE 2011 Multi-core Platforms for Immersive-Audio Applications Course: Advanced Computer Architectures Teacher: Prof. Cristina Silvano Student: Silvio La Blasca 771338 Introduction on Immersive-Audio
More informationAccelerated Impulse Response Calculation for Indoor Optical Communication Channels
Accelerated Impulse Response Calculation for Indoor Optical Communication Channels M. Rahaim, J. Carruthers, and T.D.C. Little Department of Electrical and Computer Engineering Boston University, Boston,
More informationSoftware-based Microarchitectural Attacks
SCIENCE PASSION TECHNOLOGY Software-based Microarchitectural Attacks Daniel Gruss April 19, 2018 Graz University of Technology 1 Daniel Gruss Graz University of Technology Whoami Daniel Gruss Post-Doc
More informationMonte Carlo integration and event generation on GPU and their application to particle physics
Monte Carlo integration and event generation on GPU and their application to particle physics Junichi Kanzaki (KEK) GPU2016 @ Rome, Italy Sep. 26, 2016 Motivation Increase of amount of LHC data (raw &
More informationTomasolu s s Algorithm
omasolu s s Algorithm Fall 2007 Prof. homas Wenisch http://www.eecs.umich.edu/courses/eecs4 70 Floating Point Buffers (FLB) ag ag ag Storage Bus Floating Point 4 3 Buffers FLB 6 5 5 4 Control 2 1 1 Result
More informationEECE 321: Computer Organiza5on
EECE 321: Computer Organiza5on Mohammad M. Mansour Dept. of Electrical and Compute Engineering American University of Beirut Lecture 21: Pipelining Processor Pipelining Same principles can be applied to
More informationAn evaluation of debayering algorithms on GPU for real-time panoramic video recording
An evaluation of debayering algorithms on GPU for real-time panoramic video recording Ragnar Langseth, Vamsidhar Reddy Gaddam, Håkon Kvale Stensland, Carsten Griwodz, Pål Halvorsen University of Oslo /
More informationGPU Acceleration of the HEVC Decoder Inter Prediction Module
GPU Acceleration of the HEVC Decoder Inter Prediction Module Diego F. de Souza, Aleksandar Ilic, Nuno Roma and Leonel Sousa INESC-ID, IST, Universidade de Lisboa Rua Alves Redol 9, 000-09, Lisbon, Portugal
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Speculation and raps in Out-of-Order Cores What is wrong with omasulo s? Branch instructions Need branch prediction to guess what to fetch next Need speculative execution
More informationECE473 Computer Architecture and Organization. Pipeline: Introduction
Computer Architecture and Organization Pipeline: Introduction Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 11.1 The Laundry Analogy Student A,
More informationEECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018
omasulo s Algorithm Winter 2018 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, yson, Vijaykumar, and Wenisch of Carnegie Mellon University,
More informationLecture 4: Introduction to Pipelining
Lecture 4: Introduction to Pipelining Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes A B C D Dryer takes 40 minutes Folder
More informationCOSC4201. Scoreboard
COSC4201 Scoreboard Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RIT) 1 Overcoming Data Hazards with Dynamic Scheduling In the pipeline, if there is
More informationTable of Contents HOL ADV
Table of Contents Lab Overview - - Horizon 7.1: Graphics Acceleartion for 3D Workloads and vgpu... 2 Lab Guidance... 3 Module 1-3D Options in Horizon 7 (15 minutes - Basic)... 5 Introduction... 6 3D Desktop
More informationParallel Programming Design of BPSK Signal Demodulation Based on CUDA
Int. J. Communications, Network and System Sciences, 216, 9, 126-134 Published Online May 216 in SciRes. http://www.scirp.org/journal/ijcns http://dx.doi.org/1.4236/ijcns.216.9511 Parallel Programming
More informationSemantic Segmentation on Resource Constrained Devices
Semantic Segmentation on Resource Constrained Devices Sachin Mehta University of Washington, Seattle In collaboration with Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh Hajishirzi Project
More informationInstruction Level Parallelism. Data Dependence Static Scheduling
Instruction Level Parallelism Data Dependence Static Scheduling Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D
More informationSynthetic Aperture Beamformation using the GPU
Paper presented at the IEEE International Ultrasonics Symposium, Orlando, Florida, 211: Synthetic Aperture Beamformation using the GPU Jens Munk Hansen, Dana Schaa and Jørgen Arendt Jensen Center for Fast
More informationECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution
ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution School of Electrical and Computer Engineering Cornell University revision: 2016-11-28-17-33 1 In-Order Dual-Issue
More informationSATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation
SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation Mark Wolff Linda Wills School of Electrical and Computer Engineering Georgia Institute of Technology {wolff,linda.wills}@ece.gatech.edu
More informationCORRECTED VISION. Here be underscores THE ROLE OF CAMERA AND LENS PARAMETERS IN REAL-WORLD MEASUREMENT
Here be underscores CORRECTED VISION THE ROLE OF CAMERA AND LENS PARAMETERS IN REAL-WORLD MEASUREMENT JOSEPH HOWSE, NUMMIST MEDIA CIG-GANS WORKSHOP: 3-D COLLECTION, ANALYSIS AND VISUALIZATION LAWRENCETOWN,
More informationThreading libraries performance when applied to image acquisition and processing in a forensic application
Threading libraries performance when applied to image acquisition and processing in a forensic application Carlos Bermúdez MSc. in Photonics, Universitat Politècnica de Catalunya, Barcelona, Spain Student
More informationGame Architecture. 4/8/16: Multiprocessor Game Loops
Game Architecture 4/8/16: Multiprocessor Game Loops Monolithic Dead simple to set up, but it can get messy Flow-of-control can be complex Top-level may have too much knowledge of underlying systems (gross
More informationCS Computer Architecture Spring Lecture 04: Understanding Performance
CS 35101 Computer Architecture Spring 2008 Lecture 04: Understanding Performance Taken from Mary Jane Irwin (www.cse.psu.edu/~mji) and Kevin Schaffer [Adapted from Computer Organization and Design, Patterson
More information6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors
6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors Options for dealing with data and control hazards: stall, bypass, speculate 6.S084 Worksheet - 1 of 10 - L19 Control Hazards in Pipelined
More informationAsanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.
Pipeline Hazards Krste Asanovic Laboratory for Computer Science M.I.T. Pipelined DLX Datapath without interlocks and jumps 31 0x4 RegDst RegWrite inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B OpSel
More informationCS429: Computer Organization and Architecture
CS429: Computer Organization and Architecture Dr. Bill Young Department of Computer Sciences University of Texas at Austin Last updated: November 8, 2017 at 09:27 CS429 Slideset 14: 1 Overview What s wrong
More informationReading Material + Announcements
Reading Material + Announcements Reminder HW 1» Before asking questions: 1) Read all threads on piazza, 2) Think a bit Ÿ Then, post question Ÿ talk to Animesh if you are stuck Today s class» Wrap up Control
More informationSoftware Pipelining Creates Parallelization Opportunities
Software Pipelining Creates Parallelization Opportunities Jialu Huang, Arun Raman, Thomas B Jablin, Yun Zhang, Tzu-Han Hung David I August Liberty Research Group Princeton University 1 DSWP+ DOALL DSWP
More informationNVIDIA SLI AND STUTTER AVOIDANCE:
NVIDIA SLI AND STUTTER AVOIDANCE: A Recipe for Smooth Gaming and Perfect Scaling with Multiple GPUs NVIDIA SLI AND STUTTER AVOIDANCE: Iain Cantlay (Developer Technology Engineer) Lars Nordskog (Developer
More informationPower of Realtime 3D-Rendering. Raja Koduri
Power of Realtime 3D-Rendering Raja Koduri 1 We ate our GPU cake - vuoi la botte piena e la moglie ubriaca And had more too! 16+ years of (sugar) high! In every GPU generation More performance and performance-per-watt
More informationComputer Architecture A Quantitative Approach
Computer Architecture A Quantitative Approach Fourth Edition John L. Hennessy Stanford University David A. Patterson University of California at Berkeley With Contributions by Andrea C. Arpaci-Dusseau
More informationReal-Time Software Receiver Using Massively Parallel
Real-Time Software Receiver Using Massively Parallel Processors for GPS Adaptive Antenna Array Processing Jiwon Seo, David De Lorenzo, Sherman Lo, Per Enge, Stanford University Yu-Hsuan Chen, National
More informationEE382V-ICS: System-on-a-Chip (SoC) Design
EE38V-CS: System-on-a-Chip (SoC) Design Hardware Synthesis and Architectures Source: D. Gajski, S. Abdi, A. Gerstlauer, G. Schirner, Embedded System Design: Modeling, Synthesis, Verification, Chapter 6:
More information23270: AUGMENTED REALITY FOR NAVIGATION AND INFORMATIONAL ADAS. Sergii Bykov Technical Lead Machine Learning 12 Oct 2017
23270: AUGMENTED REALITY FOR NAVIGATION AND INFORMATIONAL ADAS Sergii Bykov Technical Lead Machine Learning 12 Oct 2017 Product Vision Company Introduction Apostera GmbH with headquarter in Munich, was
More informationSuggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!
1! CSE 30321 Lecture 12 Introduction to Pipelining! CSE 30321 Lecture 12 Introduction to Pipelining! 2! Suggested Readings!! Readings!! H&P: Chapter 4.5-4.7!! (Over the next 3-4 lectures)! Lecture 12"
More informationImage Processing Architectures (and their future requirements)
Lecture 17: Image Processing Architectures (and their future requirements) Visual Computing Systems Smart phone processing resources Qualcomm snapdragon Image credit: Qualcomm Apple A7 (iphone 5s) Chipworks
More informationMemory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors
Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors STIJN EYERMAN and LIEVEN EECKHOUT Ghent University A thread executing on a simultaneous multithreading (SMT) processor
More informationSKA NON IMAGING PROCESSING CONCEPT DESCRIPTION: GPU PROCESSING FOR REAL TIME ISOLATED RADIO PULSE DETECTION
SKA NON IMAGING PROCESSING CONCEPT DESCRIPTION: GPU PROCESSING FOR REAL TIME ISOLATED RADIO PULSE DETECTION Document number... WP2 040.130.010 TD 001 Revision... 1 Author... Aris Karastergiou Date... 2011
More informationEarly Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida
Early Adopter : Multiprocessor Programming in the Undergraduate Program NSF/TCPP Curriculum: Early Adoption at the University of Central Florida Narsingh Deo Damian Dechev Mahadevan Vasudevan Department
More informationPre-Silicon Validation of Hyper-Threading Technology
Pre-Silicon Validation of Hyper-Threading Technology David Burns, Desktop Platforms Group, Intel Corp. Index words: microprocessor, validation, bugs, verification ABSTRACT Hyper-Threading Technology delivers
More informationIntroduction to Real-Time Systems
Introduction to Real-Time Systems Real-Time Systems, Lecture 1 Martina Maggio and Karl-Erik Årzén 16 January 2018 Lund University, Department of Automatic Control Content [Real-Time Control System: Chapter
More information