Algorithms for Memory Hierarchies Lecture 14

Algorithms for emory Hierrchies Lecture 4 Lecturer: Nodri Sitchinv Scribe: ichel Hmnn Prllelism nd Cche Obliviousness The combintion of prllelism nd cche obliviousness is n ongoing topic of reserch, in this lecture we will only lern to now few bsics. In the pst lectures we looed t prllel lgorithms in which the memory nd bloc size were nown while in this lecture we will loo t prllel lgorithms in which the memory/cche size re unnown. This hs mny dvntges lie tht the lgorithms need to be designed only once nd then wor in different inds of setups. Todys processor nd system rchitectures get incresingly complex with severl levels of prtilly shred nd prtilly privte cches. Designing cche-wre lgorithms for these rchitectures becomes incresingly difficult while cche-oblivious lgorithms lso wor on systems with severl levels of memory. Deling with cche hierrchies in prllel lgorithms is lot more complex thn deling with cche hierrchies in sequentil lgorithms s the cches cn be used by ny processor tht shres the cche. Furthermore while in sequentil lgorithms it is cler tht dt tht hs been loded t one point in the lgorithm nd is still in the cche t lter point in time cn still be used while in prllel lgorithms the dt might be used by other processors tht don t shre tht cche. Thus the dt could hve been modified in other cches nd might need to be updted. For the very sme reson multiple processors using the sme dt in very short period of time cn cuse problems in prllel lgorithms. Not nowing the bloc size poses n dditionl chllenge when dt tht is stored in the sme bloc could be used by different processors. In this cse the ccess would need to be synchronized but s the bloc size is unnown it is lso unnown if there is ctully ny synchroniztion needed. There re two wys to design prllel lgorithm: Corse-grin prllelism One thred per physicl core. The threds re provided by the operting system. Fine-grin prllelism ny light-weight user level threds. The progrm exposes lots of prllelism, the smllest ts should be s smll s possible. The runtime system then schedules these tss on physicl cores. An exmple for such system is CILK++.. Wor-depth frmewor The wor-depth frmewor hs been initilly introduced for P models. In the wor-depth frmewor the lgorithm specifies the smllest possible tss tht cn be executed concurrently.

The wor W is defined s the number of such tss while the depth D is the number of (prllel) steps. The computtion of n lgorithm cn be modeled s DAG (directed cyclic grph) whose nodes represent the tss nd whose edges represent the dependencies between the tss. The wor W is then the running time of the lgorithm in sequentil nd the depth D is the longest pth in the grph. The properties of the grph tht we need to nlyze re thus the number of nodes nd the longest pth. With rent s theorem we get the following inequlity for the prllel execution time T P with P processors: T P W P + D When the number of processors is smller thn the number of tss tht cn be executed in prllel one physicl processors simultes more thn one virtul processor..2 Scheduling tss on physicl processors For ust one processor ny topologicl order on the DAG is possible. One possibility tht is shown in the grph in Figure is depthfirst scheduler (-DF - scheduler) tht follows ech pth in the grph s long s possible (i.e. until it encounters tss tht hsn t ll requirements fulfilled). For two processors, we ll hve loo t three possibilities: Greedy scheduler The greedy scheduler ssigns free tss (tss tht hven t been executed yet but whose predecessors hve lredy been executed) to ny vilble processor, n exmple is shown in Figure. A greedy scheduler will lwys find schedule with t mximum twice s mny steps s the optiml schedule would need. D 2 4 2 b 3 c 2 d 2 2 4 3 8 9 3 e f 4 g 4 h 3 3 6 0 6 i 4 6 Prioritized scheduler A prioritized scheduler ssigns the free tss ccording PDF schedule for 2 processors Figure : -DF-schedule; greedy schedule nd to priority, for exmple the PDFscheduler (prllel depth-first scheduler) is greedy scheduler tht ssigns tss prioritized by -DF order, n exmple is shown in Figure. Wor steling scheduler A wor-steling scheduler is for exmple implemented in CILK++. The lgorithm of the wor-steling scheduler cn be found in Algorithm, n exmple of its execution in Figure 2. 2

Algorithm : The wor steling scheduler forech processor do if my deque is non-empty then pop first ts nd execute; else stel ts from rndom processor s deque s end; push newly creted free tss on my deque; p p 2 2 b, c, d d 2 b 4 c 2 d 3 e, c g, h 4 c h 3 e f 3 g 4 h f 6 i 6 i Figure 2: Exmple execution of the wor steling scheduler 3

P... P... P Sequentil Privte Cche Shred Cche Figure 3: The different cche setups we consider in this lecture In the sequentil cche-oblivious setting it ws the pging lgorithm tht new the exct prmeters of the system. In the prllel setting the scheduler replces the role of the pging lgorithm. A grph is series-prllel if it cn be constructed of series-prllel grphs (with simple edge with two nodes s the smllest series-prllel grph) by seril or prllel composition nd hs source nd sin. Theorem. If the execution DAG is series-prllel of depth D, then the wor steling scheduler for P processors will perform O(P D) stels in expecttion. Proof. Intuition: Ech set of O(P ) stels reduces the depth of the remining DAG by t lest one level. In the following we will loo t two simplified scenrios for memory nlysis: The I/Ocomplexity of schedulers for privte nd shred cches (see Figure 3). Consider lgorithm 2 (visulized in Figure 4). Algorithm 2: Scheduler exmple lgorithm for prllel i =, 2,..., P do for =,..., R do for =,..., do x i x i + i [] ; The sequentil running time T seq of lgorithm 2 is O(R ). The I/O-complexity when using -DF-scheduler is 2 I/O s. With prllel schedule for two processors with the PDF scheduler on privte cche there re 2 I/O s in totl. On shred cche of size P = we get 2 R I/O s while when P = 2 there re only 2 I/O s. Theorem 2. The PDF scheduler incurs Q I/O s on shred cche if P = + P D where Q is the sequentil I/O complexity of the solution. Theorem 3. The WS scheduler incurs O(Q ) I/O s on shred cche if P = P. On current destop mchine with P = 8 processors for D = O(log n) 32 P D = 26 which mens tht with the PDF scheduler we only need 26 dditionl words of memory for the prllel solution. 4

[] 2 [] x x + = [] [2] 2 [2] = 2[] [] 2 [] R times x x + = [] [] 2 [] [] 2 [] = 2[]. x x + = [] [] 2 [] [] 2 []. = 2[] R times Figure 4: Visuliztion of lgorithm 2 This mens tht we should design cche oblivious lgorithms with the smllest depth tht s possible. The cche-oblivious sorting lgorithm tht we considered in clss hd depth Ω( n). lelloch et l. published cche-oblivious sorting lgorithm with depth O(log 2 (n)) in 20. For privte cches, the WS scheduler incurs Q p = Q + P D totl I/O s. The proof ide is tht ech of the O(P D) stels incurs Θ() I/O s. There re schedulers tht wor for mixed cches, they re combintion of the WS nd PDF schedulers with some dditions.