Algorithms for Memory Hierarchies Lecture 14

Similar documents
EE Controls Lab #2: Implementing State-Transition Logic on a PLC

MAXIMUM FLOWS IN FUZZY NETWORKS WITH FUNNEL-SHAPED NODES

CHAPTER 2 LITERATURE STUDY

CS 135: Computer Architecture I. Boolean Algebra. Basic Logic Gates

Kirchhoff s Rules. Kirchhoff s Laws. Kirchhoff s Rules. Kirchhoff s Laws. Practice. Understanding SPH4UW. Kirchhoff s Voltage Rule (KVR):

Geometric quantities for polar curves

Polar Coordinates. July 30, 2014

(CATALYST GROUP) B"sic Electric"l Engineering

Math Circles Finite Automata Question Sheet 3 (Solutions)

Solutions to exercise 1 in ETS052 Computer Communication

Sequential Logic (2) Synchronous vs Asynchronous Sequential Circuit. Clock Signal. Synchronous Sequential Circuits. FSM Overview 9/10/12

Topic 20: Huffman Coding

A Slot-Asynchronous MAC Protocol Design for Blind Rendezvous in Cognitive Radio Networks

Spiral Tilings with C-curves

Dataflow Language Model. DataFlow Models. Applications of Dataflow. Dataflow Languages. Kahn process networks. A Kahn Process (1)

Synchronous Generator Line Synchronization

Lecture 20. Intro to line integrals. Dan Nichols MATH 233, Spring 2018 University of Massachusetts.

First Round Solutions Grades 4, 5, and 6

Efficient and Resilient Key Discovery based on Pseudo-Random Key Pre-Deployment

Synchronous Machine Parameter Measurement

Y9.ET1.3 Implementation of Secure Energy Management against Cyber/physical Attacks for FREEDM System

Student Book SERIES. Fractions. Name

The Math Learning Center PO Box 12929, Salem, Oregon Math Learning Center

April 9, 2000 DIS chapter 10 CHAPTER 3 : INTEGRATED PROCESSOR-LEVEL ARCHITECTURES FOR REAL-TIME DIGITAL SIGNAL PROCESSING

Understanding Basic Analog Ideal Op Amps

Experiment 3: The research of Thevenin theorem

ECE 274 Digital Logic. Digital Design. Datapath Components Shifters, Comparators, Counters, Multipliers Digital Design

Domination and Independence on Square Chessboard

ABB STOTZ-KONTAKT. ABB i-bus EIB Current Module SM/S Intelligent Installation Systems. User Manual SM/S In = 16 A AC Un = 230 V AC

Multi-beam antennas in a broadband wireless access system

Synchronous Machine Parameter Measurement

University of North Carolina-Charlotte Department of Electrical and Computer Engineering ECGR 4143/5195 Electrical Machinery Fall 2009

From Off-The-Shelf to Market-Ready New Age Enclosures is your Single Source Solution. Let us quote modifiying our Stock Enclosures to meet your

Lecture 16: Four Quadrant operation of DC Drive (or) TYPE E Four Quadrant chopper Fed Drive: Operation

MONOCHRONICLE STRAIGHT

Chapter 2 Literature Review

Energy Harvesting Two-Way Channels With Decoding and Processing Costs

Module 9. DC Machines. Version 2 EE IIT, Kharagpur

Exercise 1-1. The Sine Wave EXERCISE OBJECTIVE DISCUSSION OUTLINE. Relationship between a rotating phasor and a sine wave DISCUSSION

Use of compiler optimization of software bypassing as a method to improve energy efficiency of exposed data path architectures

Homework #1 due Monday at 6pm. White drop box in Student Lounge on the second floor of Cory. Tuesday labs cancelled next week

Regular languages can be expressed as regular expressions.

A New Algorithm to Compute Alternate Paths in Reliable OSPF (ROSPF)

Sinusoidal Steady State Analysis

Compared to generators DC MOTORS. Back e.m.f. Back e.m.f. Example. Example. The construction of a d.c. motor is the same as a d.c. generator.

Eliminating Non-Determinism During Test of High-Speed Source Synchronous Differential Buses

Subword Permutation Instructions for Two-Dimensional Multimedia Processing in MicroSIMD Architectures

A Development of Earthing-Resistance-Estimation Instrument

Design and implementation of a high-speed bit-serial SFQ adder based on the binary decision diagram

(1) Primary Trigonometric Ratios (SOH CAH TOA): Given a right triangle OPQ with acute angle, we have the following trig ratios: ADJ

Convolutional Networks. Lecture slides for Chapter 9 of Deep Learning Ian Goodfellow

PROGRAMMING MANUAL MTMA/01 MTMV/01 FB00329-EN

LECTURE 9: QUADRATIC RESIDUES AND THE LAW OF QUADRATIC RECIPROCITY

On the Description of Communications Between Software Components with UML

A Key Set Cipher for Wireless Sensor Networks

MATH 118 PROBLEM SET 6

Seven Sisters. Visit for video tutorials

Network Sharing and its Energy Benefits: a Study of European Mobile Network Operators

Spotted at APA. Top Points this week. Spring f ro. Year 7. Year 8. Year 9. Year 10. Year 11. Student Newsletter ~

Arduino for Model Railroaders

From Off-The-Shelf to Market-Ready New Age Enclosures is your Single Source Solution. Let us quote modifiying our Stock Enclosures to meet your

& Y Connected resistors, Light emitting diode.

Three-Phase Synchronous Machines The synchronous machine can be used to operate as: 1. Synchronous motors 2. Synchronous generators (Alternator)

NONCLASSICAL CONSTRUCTIONS II

Three-Phase NPC Inverter Using Three-Phase Coupled Inductor

A Simple Approach to Control the Time-constant of Microwave Integrators

Section 6.1 Law of Sines. Notes. Oblique Triangles - triangles that have no right angles. A c. A is acute. A is obtuse

arxiv: v1 [cs.cc] 29 Mar 2012

Design of a Pipelined DSP Microprocessor MUN DSP2000

10.4 AREAS AND LENGTHS IN POLAR COORDINATES

MEASURE THE CHARACTERISTIC CURVES RELEVANT TO AN NPN TRANSISTOR

High Speed On-Chip Interconnects: Trade offs in Passive Termination

BP-P2P: Belief Propagation-Based Trust and Reputation Management for P2P Networks

BP-P2P: Belief Propagation-Based Trust and Reputation Management for P2P Networks

METHOD OF LOCATION USING SIGNALS OF UNKNOWN ORIGIN. Inventor: Brian L. Baskin

2005 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,

Vector Calculus. 1 Line Integrals

Student Book SERIES. Patterns and Algebra. Name

How to Build Wealth Like Warren Buffett.

Interference Cancellation Method without Feedback Amount for Three Users Interference Channel

The Discussion of this exercise covers the following points:

9.4. ; 65. A family of curves has polar equations. ; 66. The astronomer Giovanni Cassini ( ) studied the family of curves with polar equations

Experiment 8 Series DC Motor (II)

Improving synchronized transfers in public transit networks using real-time tactics

A Cluster-based TDMA System for Inter-Vehicle Communications *

Abacaba-Dabacaba! by Michael Naylor Western Washington University

FTU263. Ripple Control Receiver. Technical Data. Load Management Ripple Control

THE STUDY OF INFLUENCE CORE MATERIALS ON TECHNOLOGICAL PROPERTIES OF UNIVERSAL BENTONITE MOULDING MATERIALS. Matej BEZNÁK, Vladimír HANZEN, Ján VRABEC

Effect of High-speed Milling tool path strategies on the surface roughness of Stavax ESR mold insert machining

Demand response for aggregated residential consumers with energy storage sharing

To provide data transmission in indoor

ECE 274 Digital Logic

PRO LIGNO Vol. 11 N pp

Redundancy Data Elimination Scheme Based on Stitching Technique in Image Senor Networks

Separation Constraint Partitioning - A New Algorithm for Partitioning. Non-strict Programs into Sequential Threads. David E. Culler, Seth C.

NEW METHOD FOR THE STATE EVALUATION OF THE ZERO-SEQUENCE SYSTEM

Application Note. Differential Amplifier

Example. Check that the Jacobian of the transformation to spherical coordinates is

Mixed CMOS PTL Adders

Make Your Math Super Powered

Transcription:

Algorithms for emory Hierrchies Lecture 4 Lecturer: Nodri Sitchinv Scribe: ichel Hmnn Prllelism nd Cche Obliviousness The combintion of prllelism nd cche obliviousness is n ongoing topic of reserch, in this lecture we will only lern to now few bsics. In the pst lectures we looed t prllel lgorithms in which the memory nd bloc size were nown while in this lecture we will loo t prllel lgorithms in which the memory/cche size re unnown. This hs mny dvntges lie tht the lgorithms need to be designed only once nd then wor in different inds of setups. Todys processor nd system rchitectures get incresingly complex with severl levels of prtilly shred nd prtilly privte cches. Designing cche-wre lgorithms for these rchitectures becomes incresingly difficult while cche-oblivious lgorithms lso wor on systems with severl levels of memory. Deling with cche hierrchies in prllel lgorithms is lot more complex thn deling with cche hierrchies in sequentil lgorithms s the cches cn be used by ny processor tht shres the cche. Furthermore while in sequentil lgorithms it is cler tht dt tht hs been loded t one point in the lgorithm nd is still in the cche t lter point in time cn still be used while in prllel lgorithms the dt might be used by other processors tht don t shre tht cche. Thus the dt could hve been modified in other cches nd might need to be updted. For the very sme reson multiple processors using the sme dt in very short period of time cn cuse problems in prllel lgorithms. Not nowing the bloc size poses n dditionl chllenge when dt tht is stored in the sme bloc could be used by different processors. In this cse the ccess would need to be synchronized but s the bloc size is unnown it is lso unnown if there is ctully ny synchroniztion needed. There re two wys to design prllel lgorithm: Corse-grin prllelism One thred per physicl core. The threds re provided by the operting system. Fine-grin prllelism ny light-weight user level threds. The progrm exposes lots of prllelism, the smllest ts should be s smll s possible. The runtime system then schedules these tss on physicl cores. An exmple for such system is CILK++.. Wor-depth frmewor The wor-depth frmewor hs been initilly introduced for P models. In the wor-depth frmewor the lgorithm specifies the smllest possible tss tht cn be executed concurrently.

The wor W is defined s the number of such tss while the depth D is the number of (prllel) steps. The computtion of n lgorithm cn be modeled s DAG (directed cyclic grph) whose nodes represent the tss nd whose edges represent the dependencies between the tss. The wor W is then the running time of the lgorithm in sequentil nd the depth D is the longest pth in the grph. The properties of the grph tht we need to nlyze re thus the number of nodes nd the longest pth. With rent s theorem we get the following inequlity for the prllel execution time T P with P processors: T P W P + D When the number of processors is smller thn the number of tss tht cn be executed in prllel one physicl processors simultes more thn one virtul processor..2 Scheduling tss on physicl processors For ust one processor ny topologicl order on the DAG is possible. One possibility tht is shown in the grph in Figure is depthfirst scheduler (-DF - scheduler) tht follows ech pth in the grph s long s possible (i.e. until it encounters tss tht hsn t ll requirements fulfilled). For two processors, we ll hve loo t three possibilities: Greedy scheduler The greedy scheduler ssigns free tss (tss tht hven t been executed yet but whose predecessors hve lredy been executed) to ny vilble processor, n exmple is shown in Figure. A greedy scheduler will lwys find schedule with t mximum twice s mny steps s the optiml schedule would need. D 2 4 2 b 3 c 2 d 2 2 4 3 8 9 3 e f 4 g 4 h 3 3 6 0 6 i 4 6 Prioritized scheduler A prioritized scheduler ssigns the free tss ccording PDF schedule for 2 processors Figure : -DF-schedule; greedy schedule nd to priority, for exmple the PDFscheduler (prllel depth-first scheduler) is greedy scheduler tht ssigns tss prioritized by -DF order, n exmple is shown in Figure. Wor steling scheduler A wor-steling scheduler is for exmple implemented in CILK++. The lgorithm of the wor-steling scheduler cn be found in Algorithm, n exmple of its execution in Figure 2. 2

Algorithm : The wor steling scheduler forech processor do if my deque is non-empty then pop first ts nd execute; else stel ts from rndom processor s deque s end; push newly creted free tss on my deque; p p 2 2 b, c, d d 2 b 4 c 2 d 3 e, c g, h 4 c h 3 e f 3 g 4 h f 6 i 6 i Figure 2: Exmple execution of the wor steling scheduler 3

P... P... P Sequentil Privte Cche Shred Cche Figure 3: The different cche setups we consider in this lecture In the sequentil cche-oblivious setting it ws the pging lgorithm tht new the exct prmeters of the system. In the prllel setting the scheduler replces the role of the pging lgorithm. A grph is series-prllel if it cn be constructed of series-prllel grphs (with simple edge with two nodes s the smllest series-prllel grph) by seril or prllel composition nd hs source nd sin. Theorem. If the execution DAG is series-prllel of depth D, then the wor steling scheduler for P processors will perform O(P D) stels in expecttion. Proof. Intuition: Ech set of O(P ) stels reduces the depth of the remining DAG by t lest one level. In the following we will loo t two simplified scenrios for memory nlysis: The I/Ocomplexity of schedulers for privte nd shred cches (see Figure 3). Consider lgorithm 2 (visulized in Figure 4). Algorithm 2: Scheduler exmple lgorithm for prllel i =, 2,..., P do for =,..., R do for =,..., do x i x i + i [] ; The sequentil running time T seq of lgorithm 2 is O(R ). The I/O-complexity when using -DF-scheduler is 2 I/O s. With prllel schedule for two processors with the PDF scheduler on privte cche there re 2 I/O s in totl. On shred cche of size P = we get 2 R I/O s while when P = 2 there re only 2 I/O s. Theorem 2. The PDF scheduler incurs Q I/O s on shred cche if P = + P D where Q is the sequentil I/O complexity of the solution. Theorem 3. The WS scheduler incurs O(Q ) I/O s on shred cche if P = P. On current destop mchine with P = 8 processors for D = O(log n) 32 P D = 26 which mens tht with the PDF scheduler we only need 26 dditionl words of memory for the prllel solution. 4

[] 2 [] x x + = [] [2] 2 [2] = 2[] [] 2 [] R times x x + = [] [] 2 [] [] 2 [] = 2[]. x x + = [] [] 2 [] [] 2 []. = 2[] R times Figure 4: Visuliztion of lgorithm 2 This mens tht we should design cche oblivious lgorithms with the smllest depth tht s possible. The cche-oblivious sorting lgorithm tht we considered in clss hd depth Ω( n). lelloch et l. published cche-oblivious sorting lgorithm with depth O(log 2 (n)) in 20. For privte cches, the WS scheduler incurs Q p = Q + P D totl I/O s. The proof ide is tht ech of the O(P D) stels incurs Θ() I/O s. There re schedulers tht wor for mixed cches, they re combintion of the WS nd PDF schedulers with some dditions.