Compiler Optimisation

Similar documents
CMP 301B Computer Architecture. Appendix C

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Project 5: Optimizer Jason Ansel

Reading Material + Announcements

Instruction Level Parallelism Part II - Scoreboard

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Instruction Level Parallelism. Data Dependence Static Scheduling

CSE502: Computer Architecture CSE 502: Computer Architecture

SCHEDULING Giovanni De Micheli Stanford University

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Out-of-Order Execution. Register Renaming. Nima Honarmand

Lecture 14 Instruction Selection: Tree-pattern matching

Dynamic Scheduling I

DAT105: Computer Architecture

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Instruction Level Parallelism III: Dynamic Scheduling

Instruction Selection via Tree-Pattern Matching Comp 412

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

COSC4201. Scoreboard

Lecture 8-1 Vector Processors 2 A. Sohn

Tomasolu s s Algorithm

EECS 583 Class 7 Classic Code Optimization cont d

Instruction Scheduling for Low Power Dissipation in High Performance Microprocessors

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Lecture 13 Register Allocation: Coalescing

CS521 CSE IITG 11/23/2012

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m )

Power-conscious High Level Synthesis Using Loop Folding

Chapter 12. Cross-Layer Optimization for Multi- Hop Cognitive Radio Networks

Dynamic Scheduling II

Tomasulo s Algorithm. Tomasulo s Algorithm

Loop Scheduling with Timing and Switching-Activity Minimization for VLIW DSP Λ

CS 110 Computer Architecture Lecture 11: Pipelining

Timed Games UPPAAL-TIGA. Alexandre David

Module 3 Greedy Strategy

Pipelined Processor Design

Lecture 20: Combinatorial Search (1997) Steven Skiena. skiena

EE382V-ICS: System-on-a-Chip (SoC) Design

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Department Computer Science and Engineering IIT Kanpur

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Register Allocation by Puzzle Solving

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Power Management in Multicore Processors through Clustered DVFS

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Event-Driven Scheduling. (closely following Jane Liu s Book)

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

PRIORITY QUEUES AND HEAPS. Lecture 19 CS2110 Spring 2014

Set 4: Game-Playing. ICS 271 Fall 2017 Kalev Kask

Adversary Search. Ref: Chapter 5

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Scheduling. Radek Mařík. April 28, 2015 FEE CTU, K Radek Mařík Scheduling April 28, / 48

Digital Integrated CircuitDesign

Parallel architectures Electronic Computers LM

Precise State Recovery. Out-of-Order Pipelines

Fast and Accurate RF component characterization enabled by FPGA technology

On the Combination of Constraint Programming and Stochastic Search: The Sudoku Case

DIGITAL DESIGN WITH SM CHARTS

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

High Speed ECC Implementation on FPGA over GF(2 m )

Instability of Scoring Heuristic In games with value exchange, the heuristics are very bumpy Make smoothing assumptions search for "quiesence"

Games and Adversarial Search II

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

Introduction to Real-Time Systems

Greedy Algorithms. Kleinberg and Tardos, Chapter 4

VLSI System Testing. Outline

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Routing ( Introduction to Computer-Aided Design) School of EECS Seoul National University

Dr. D. M. Akbar Hussain

Programming an Othello AI Michael An (man4), Evan Liang (liange)

Prepared by Vaishnavi Moorthy Asst Prof- Dept of Cse

SOFTWARE IMPLEMENTATION OF THE

: Principles of Automated Reasoning and Decision Making Midterm

Parsimony II Search Algorithms

Hardware Implementation of Automatic Control Systems using FPGAs

CS 4700: Artificial Intelligence

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

CSE548, AMS542: Analysis of Algorithms, Fall 2016 Date: Sep 25. Homework #1. ( Due: Oct 10 ) Figure 1: The laser game.

Control of the Contract of a Public Transport Service

MITOCW watch?v=krzi60lkpek

10/5/2015. Constraint Satisfaction Problems. Example: Cryptarithmetic. Example: Map-coloring. Example: Map-coloring. Constraint Satisfaction Problems

10. BSY-1 Trainer Case Study

CS/ENGRD 2110 Object-Oriented Programming and Data Structures Spring 2012 Thorsten Joachims. Lecture 17: Heaps and Priority Queues

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Handling Search Inconsistencies in MTD(f)

Surveillance strategies for autonomous mobile robots. Nicola Basilico Department of Computer Science University of Milan

Pipeline Damping: A Microarchitectural Technique to Reduce Inductive Noise in Supply Voltage

Multi-user Space Time Scheduling for Wireless Systems with Multiple Antenna

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

Local search algorithms

PRIORITY QUEUES AND HEAPS

CS 229 Final Project: Using Reinforcement Learning to Play Othello

Transcription:

Compiler Optimisation 6 Instruction Scheduling Hugh Leather IF 1.18a hleather@inf.ed.ac.uk Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2018

Introduction This lecture: Scheduling to hide latency and exploit ILP Dependence graph Local list Scheduling + priorities Forward versus backward scheduling Software pipelining of loops

Latency, functional units, and ILP Instructions take clock cycles to execute (latency) Modern machines issue several operations per cycle Cannot use results until ready, can do something else Execution time is order-dependent Latencies not always constant (cache, early exit, etc) Operation Cycles load, store 3 load / cache 100s loadi, add, shift 1 mult 2 div 40 branch 0 8

Machine types In order Deep pipelining allows multiple instructions Superscalar Multiple functional units, can issue > 1 instruction Out of order Large window of instructions can be reordered dynamically VLIW Compiler statically allocates to FUs

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Simple schedule 1 a := 2*a*b*c Cycle Operations Operands waiting loadai r arp, @a r 1 add r 1, r 1 r 1 loadai r arp, @b r 2 mult r 1, r 2 r 1 loadai r arp, @c r 2 mult r 1, r 2 r 1 storeai r 1 r arp, @a Done 1 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Simple schedule 1 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 r 1 3 r 1 add r 1, r 1 r 1 loadai r arp, @b r 2 mult r 1, r 2 r 1 loadai r arp, @c r 2 mult r 1, r 2 r 1 storeai r 1 r arp, @a Done 1 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Simple schedule 1 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 r 1 3 r 1 4 add r 1, r 1 r 1 r 1 loadai r arp, @b r 2 mult r 1, r 2 r 1 loadai r arp, @c r 2 mult r 1, r 2 r 1 storeai r 1 r arp, @a Done 1 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Simple schedule 1 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 r 1 3 r 1 4 add r 1, r 1 r 1 r 1 5 loadai r arp, @b r 2 r 2 6 r 2 7 r 2 mult r 1, r 2 r 1 loadai r arp, @c r 2 mult r 1, r 2 r 1 storeai r 1 r arp, @a Done 1 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Simple schedule 1 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 r 1 3 r 1 4 add r 1, r 1 r 1 r 1 5 loadai r arp, @b r 2 r 2 6 r 2 7 r 2 8 mult r 1, r 2 r 1 r 1 9 Next op does not use r 1 r 1 loadai r arp, @c r 2 mult r 1, r 2 r 1 storeai r 1 r arp, @a Done 1 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Simple schedule 1 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 r 1 3 r 1 4 add r 1, r 1 r 1 r 1 5 loadai r arp, @b r 2 r 2 6 r 2 7 r 2 8 mult r 1, r 2 r 1 r 1 9 loadai r arp, @c r 2 r 1, r 2 10 r 2 11 r 2 mult r 1, r 2 r 1 storeai r 1 r arp, @a Done 1 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Simple schedule 1 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 r 1 3 r 1 4 add r 1, r 1 r 1 r 1 5 loadai r arp, @b r 2 r 2 6 r 2 7 r 2 8 mult r 1, r 2 r 1 r 1 9 loadai r arp, @c r 2 r 1, r 2 10 r 2 11 r 2 12 mult r 1, r 2 r 1 r 1 13 r 1 storeai r 1 r arp, @a Done 1 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Simple schedule 1 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 r 1 3 r 1 4 add r 1, r 1 r 1 r 1 5 loadai r arp, @b r 2 r 2 6 r 2 7 r 2 8 mult r 1, r 2 r 1 r 1 9 loadai r arp, @c r 2 r 1, r 2 10 r 2 11 r 2 12 mult r 1, r 2 r 1 r 1 13 r 1 14 storeai r 1 r arp, @a store to complete 15 store to complete 16 store to complete Done 1 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Schedule loads early 2 a := 2*a*b*c Cycle Operations Operands waiting loadai r arp, @a r 1 loadai r arp, @b r 2 loadai r arp, @c r 3 add r 1, r 1 r 1 mult r 1, r 2 r 1 mult r 1, r 2 r 1 storeai r 1 r arp, @a Done 2 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Schedule loads early 2 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 loadai r arp, @b r 2 loadai r arp, @c r 3 add r 1, r 1 r 1 mult r 1, r 2 r 1 mult r 1, r 3 r 1 storeai r 1 r arp, @a Done 2 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Schedule loads early 2 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 loadai r arp, @b r 2 r 1, r 2 loadai r arp, @c r 3 add r 1, r 1 r 1 mult r 1, r 2 r 1 mult r 1, r 3 r 1 storeai r 1 r arp, @a Done 2 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Schedule loads early 2 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 loadai r arp, @b r 2 r 1, r 2 3 loadai r arp, @c r 3 r 1, r 2, r 3 add r 1, r 1 r 1 mult r 1, r 2 r 1 mult r 1, r 3 r 1 storeai r 1 r arp, @a Done 2 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Schedule loads early 2 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 loadai r arp, @b r 2 r 1, r 2 3 loadai r arp, @c r 3 r 1, r 2, r 3 4 add r 1, r 1 r 1 r 1, r 2, r 3 mult r 1, r 2 r 1 mult r 1, r 3 r 1 storeai r 1 r arp, @a Done 2 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Schedule loads early 2 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 loadai r arp, @b r 2 r 1, r 2 3 loadai r arp, @c r 3 r 1, r 2, r 3 4 add r 1, r 1 r 1 r 1, r 2, r 3 5 mult r 1, r 2 r 1 r 1, r 3 6 r 1 mult r 1, r 3 r 1 storeai r 1 r arp, @a Done 2 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Schedule loads early 2 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 loadai r arp, @b r 2 r 1, r 2 3 loadai r arp, @c r 3 r 1, r 2, r 3 4 add r 1, r 1 r 1 r 1, r 2, r 3 5 mult r 1, r 2 r 1 r 1, r 3 6 r 1 7 mult r 1, r 3 r 1 r 1 8 r 1 storeai r 1 r arp, @a Done 2 loads/stores 3 cycles, mults 2, adds 1

Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Schedule loads early 2 a := 2*a*b*c Cycle Operations Operands waiting 1 loadai r arp, @a r 1 r 1 2 loadai r arp, @b r 2 r 1, r 2 3 loadai r arp, @c r 3 r 1, r 2, r 3 4 add r 1, r 1 r 1 r 1, r 2, r 3 5 mult r 1, r 2 r 1 r 1, r 3 6 r 1 7 mult r 1, r 3 r 1 r 1 8 r 1 9 storeai r 1 r arp, @a store to complete 10 store to complete 11 store to complete Done Uses one more register 11 versus 16 cycles 31% faster! 2 loads/stores 3 cycles, mults 2, adds 1

Scheduling problem Schedule maps operations to cycle; a Ops, S(a) N Respect latency; a, b Ops, a dependson b = S(a) S(b) + λ(b) Respect function units; no more ops per type per cycle than FUs can handle Length of schedule, L(S) = max a Ops (S(a) + λ(a)) Schedule S is time-optimal if S 1, L(S) L(S 1 ) Problem: Find a time-optimal schedule 3 Even local scheduling with many restrictions is NP-complete 3 A schedule might also be optimal in terms of registers, power, or space

List scheduling Local greedy heuristic to produce schedules for single basic blocks 1 Rename to avoid anti-dependences 2 Build dependency graph 3 Prioritise operations 4 For each cycle 1 Choose the highest priority ready operation & schedule it 2 Update ready queue

List scheduling Dependence/Precedence graph Schedule operation only when operands ready Build dependency graph of read-after-write (RAW) deps Label with latency and FU requirements Example: a = 2*a*b*c

List scheduling Dependence/Precedence graph Schedule operation only when operands ready Build dependency graph of read-after-write (RAW) deps Label with latency and FU requirements Anti-dependences (WAR) restrict movement Example: a = 2*a*b*c

List scheduling Dependence/Precedence graph Schedule operation only when operands ready Build dependency graph of read-after-write (RAW) deps Label with latency and FU requirements Anti-dependences (WAR) restrict movement renaming removes Example: a = 2*a*b*c

List scheduling List scheduling algorithm Cycle 1 Ready leaves of (D) Active while(ready Active ) a Active where S(a) + λ(a) Cycle Active Active - a b succs(a) where isready(b) Ready Ready b if a Ready and b, a priority b priority Ready Ready - a S(op) Cycle Active Active a Cycle Cycle + 1

List scheduling Priorities Many different priorities used Quality of schedules depends on good choice The longest latency path or critical path is a good priority Tie breakers Last use of a value - decreases demand for register as moves it nearer def Number of descendants - encourages scheduler to pursue multiple paths Longer latency first - others can fit in shadow Random

List scheduling Example: Schedule with priority by critical path length

List scheduling Example: Schedule with priority by critical path length

List scheduling Example: Schedule with priority by critical path length

List scheduling Example: Schedule with priority by critical path length

List scheduling Example: Schedule with priority by critical path length

List scheduling Example: Schedule with priority by critical path length

List scheduling Example: Schedule with priority by critical path length

List scheduling Example: Schedule with priority by critical path length

List scheduling Example: Schedule with priority by critical path length

List scheduling Example: Schedule with priority by critical path length

List scheduling Example: Schedule with priority by critical path length

List scheduling Example: Schedule with priority by critical path length

List scheduling Forward vs backward Can schedule from root to leaves (backward) May change schedule time List scheduling cheap, so try both, choose best

List scheduling Forward vs backward Opcode loadi lshift add addi cmp store Latency 1 1 2 1 1 4

List scheduling Forward vs backward Forwards Int Int Stores 1 loadi 1 lshift 2 loadi 2 loadi 3 3 loadi 4 add 1 4 add 2 add 3 5 add 4 addi store 1 6 cmp store 2 7 store 3 8 store 4 9 store 5 10 11 12 13 cbr Backwards Int Int Stores 1 loadi 1 2 addi lshift 3 add 4 loadi 3 4 add 3 loadi 2 store 5 5 add 2 loadi 1 store 4 6 add 1 store 3 7 store 2 8 store 1 9 10 11 cmp 12 cbr

Scheduling Larger Regions Schedule extended basic blocks (EBBs) Super block cloning Schedule traces Software pipelining

Scheduling Larger Regions Extended basic blocks Extended basic block EBB is maximal set of blocks such that Set has a single entry, B i Each block B j other than B i has exactly one predecessor

Scheduling Larger Regions Extended basic blocks Extended basic block EBB is maximal set of blocks such that Set has a single entry, B i Each block B j other than B i has exactly one predecessor

Scheduling Larger Regions Extended basic blocks Schedule entire paths through EBBs Example has four EBB paths

Scheduling Larger Regions Extended basic blocks Schedule entire paths through EBBs Example has four EBB paths

Scheduling Larger Regions Extended basic blocks Schedule entire paths through EBBs Example has four EBB paths

Scheduling Larger Regions Extended basic blocks Schedule entire paths through EBBs Example has four EBB paths

Scheduling Larger Regions Extended basic blocks Schedule entire paths through EBBs Example has four EBB paths Having B 1 in both causes conflicts Moving an op out of B 1 causes problems

Scheduling Larger Regions Extended basic blocks Schedule entire paths through EBBs Example has four EBB paths Having B 1 in both causes conflicts Moving an op out of B 1 causes problems Must insert compensation code

Scheduling Larger Regions Extended basic blocks Schedule entire paths through EBBs Example has four EBB paths Having B 1 in both causes conflicts Moving an op into B 1 causes problems

Scheduling Larger Regions Superblock cloning Join points create context problems

Scheduling Larger Regions Superblock cloning Join points create context problems Clone blocks to create more context

Scheduling Larger Regions Superblock cloning Join points create context problems Clone blocks to create more context Merge any simple control flow

Scheduling Larger Regions Superblock cloning Join points create context problems Clone blocks to create more context Merge any simple control flow Schedule EBBs

Scheduling Larger Regions Trace scheduling Edge frequency from profile (not block frequency)

Scheduling Larger Regions Trace scheduling Edge frequency from profile (not block frequency) Pick hot path Schedule with compensation code

Scheduling Larger Regions Trace scheduling Edge frequency from profile (not block frequency) Pick hot path Schedule with compensation code Remove from CFG

Scheduling Larger Regions Trace scheduling Edge frequency from profile (not block frequency) Pick hot path Schedule with compensation code Remove from CFG Repeat

Loop scheduling Loop structures can dominate execution time Specialist technique software pipelining Allows application of list scheduling to loops Why not loop unrolling?

Loop scheduling Loop structures can dominate execution time Specialist technique software pipelining Allows application of list scheduling to loops Why not loop unrolling? Allows loop effect to become arbitrarily small, but Code growth, cache pressure, register pressure

Software pipelining Consider simple loop to sum array

Software pipelining Schedule on 1 FU - 5 cycles load 3 cycles, add 1 cycle, branch 1 cycle

Software pipelining Schedule on VLIW 3 FUs - 4 cycles load 3 cycles, add 1 cycle, branch 1 cycle

Software pipelining A better steady state schedule exists load 3 cycles, add 1 cycle, branch 1 cycle

Software pipelining Requires prologue and epilogue (may schedule others in epilogue) load 3 cycles, add 1 cycle, branch 1 cycle

Software pipelining Respect dependences and latency including loop carries load 3 cycles, add 1 cycle, branch 1 cycle

Software pipelining Complete code load 3 cycles, add 1 cycle, branch 1 cycle

Software pipelining Some definitions Initiation interval (ii) Number of cycles between initiating loop iterations Original loop had ii of 5 cycles Final loop had ii of 2 cycles Recurrence Loop-based computation whose value is used in later loop iteration Might be several iterations later Has dependency chain(s) on itself Recurrence latency is latency of dependency chain

Software pipelining Algorithm Choose an initiation interval, ii Compute lower bounds on ii Shorter ii means faster overall execution Generate a loop body that takes ii cycles Try to schedule into ii cycles, using modulo scheduler If it fails, increase ii by one and try again Generate the needed prologue and epilogue code For prologue, work backward from upward exposed uses in the scheduled loop body For epilogue, work forward from downward exposed definitions in the scheduled loop body

Software pipelining Initial initiation interval (ii) Starting value for ii based on minimum resource and recurrence constraints Resource constraint ii must be large enough to issue every operation Let N u = number of FUs of type u Let I u = number of operations of type u I u /N u is lower bound on ii for type u max u ( I u /N u ) is lower bound on ii

Software pipelining Initial initiation interval (ii) Starting value for ii based on minimum resource and recurrence constraints Recurrence constraint ii cannot be smaller than longest recurrence latency Recurrence r is over k r iterations with latency λ r λ r /k u is lower bound on ii for type r max r ( λ r /k u ) is lower bound on ii

Software pipelining Initial initiation interval (ii) Starting value for ii based on minimum resource and recurrence constraints Start value = max(max u ( I u /N u ), max r ( λ r /k u ) For simple loop a = A[ i ] b = b + a i = i + 1 if i < n goto end Resource constraint Memory Integer Branch I u 1 2 1 N u 1 1 1 I u /N u 1 2 1 Recurrence constraint b i k r 1 1 λ r 2 1 I u /N u 2 1

Software pipelining Modulo scheduling Modulo scheduling Schedule with cycle modulo initiation interval

Software pipelining Modulo scheduling Modulo scheduling Schedule with cycle modulo initiation interval

Software pipelining Modulo scheduling Modulo scheduling Schedule with cycle modulo initiation interval

Software pipelining Modulo scheduling Modulo scheduling Schedule with cycle modulo initiation interval

Software pipelining Modulo scheduling Modulo scheduling Schedule with cycle modulo initiation interval

Software pipelining Current research Much research in different software pipelining techniques Difficult when there is general control flow in the loop Predication in IA64 for example really helps here Some recent work in exhaustive scheduling -i.e. solve the NP-complete problem for basic blocks

Summary Scheduling to hide latency and exploit ILP Dependence graph - dependences between instructions + latency Local list Scheduling + priorities Forward versus backward scheduling Scheduling EBBs, superblock cloning, trace scheduling Software pipelining of loops

PPar CDT Advert 4-year programme: MSc by Research + PhD Research-focused: Work on your thesis topic from the start Collaboration between: University of Edinburgh s School of Informatics Ranked top in the UK by 2014 REF Edinburgh Parallel Computing Centre UK s largest supercomputing centre Research topics in software, hardware, theory and application of: Parallelism Concurrency Distribution Full funding available Industrial engagement programme includes internships at leading companies Now accepting applications! Find out more and apply at: pervasiveparallelism.inf.ed.ac.uk The biggest revolution in the technological landscape for fifty years