Understandng the Spke Algorthm Vctor Ejkhout and Robert van de Gejn May, ntroducton The parallel soluton of lnear systems has a long hstory, spannng both drect and teratve methods Whle drect methods exst that have great generalty, here we consder a subcase of practcal mportance, that of banded matrces We note that many PDE problems naturally gve rse to banded systems, gven a large enough bandwdth For any banded matrx, we can mpose a block structure such that the matrx s block trdagonal Ths structure gves each processor a contguous block row of the matrx; we assume that the number of processors s low enough that the the part owned by any processor comprses one or more of the blocks that defne the block trdagonal structure n ths paper we present a number of varants on the Spke factorzaton of Polzz and Sameh [?] nstead of the customary algebrac presentaton we vew ths algorthm as a doman decomposton method, where each processor corresponds to a subdoman, and the problem varables are dvded n nteror regons and separators We wll do a cost analyss for the case where the algorthm s appled to a fnte element type matrx The Spke algorthm s then seen to be slghtly more expensve than a regular doman decomposton Basc nsghts From D doman to block trdagonal matrx To understand how a block trdagonal matrx mght arse, consder the doman n Fgure) magne a dscretzaton (mesh of unknowns) ordered n column-major order The dark lnes represent columns of unknowns that are vewed as separators Ths
Fgure : A one-dmensonally parttoned doman wth 4 subdomans and separators then yelds a matrx wth the followng structure: gves a matrx A () B () C () A () B () C () A () D () E () S () G () H () A () B () C () A () B () C () A () D () E () S () G () H () A () B () C () A () B () C () A () Here the large blocks correspond to the couplng matrx for the subdomans between separators and the small ones to the separators themselves
Permutaton Now, we can order the unknowns dfferently Stll n column-major order, but orderng all the columns for the nterors, skppng the separators, and then the separators Ths yelds the matrx A () B () C () A () B () C () A () D () A () B () C () A () B () C () A () A () B () C () A () B () C () A () H () D () H () E () G () S () E () G () S (), () whch we recognze as the same matrx, permuted LU factorzaton vs one-sded factorzaton Consder the matrx ( ) A B C S ts LU factorzaton s gven by ( ) ( A B L A = C S Ĉ L S )( U A ˆB U S ), where A = L A U A, L A ˆB = B, ĈU A = C, and S Ĉ ˆB = S CA B = L S U S
Step (Factor nteror block) for =,,R (*) (*) (*) := endfor Step := L A := C( U b (or b ) b (or b ) B( factor L A U A for =,,R := ($) + := Ĝ( ($) + U A b (or b ) b b (or b ) := L A := Ĉ( + b (or b ) (x) := endfor Step b (or b ) b (or b ) (+) R := L R (+) Ê R := E( R U (+) T := R (+) V := Ê E( R Ĥ( (+) S ( j+) := Ê R Step 4 b (or b ) R b (or b ) R b (or b ) ˆD R b (or b ) R b (or b ) ˆD R b (or b ) (x) add contrbuton from (x) processor j to (x) factor L S U S b (or b ) (+)(x) ˆT := L S T b (or b ) (+)(x) ˆV := V U S b (or b ) (+) S ( j+) := S ( j+) ˆV ˆT b (or b ) (+) send S ( j+) to processor j + T Ê V T Ê V Ê Ŝ ˆT ˆV S ( j+) Fgure : Parallel LU factorzaton L A and U A overwrte  L S and U S overwrte Ŝ Commands annotated wth (*) are not perform n teraton = Commands annotated wth ($) are not perform n teraton = R Commands annotated wth (x) are not perform by the frst processor Commands annotated wth (+) are not perform by the last processor 4
Fshbone Algorthm Let us perform a LU factorzaton of the permuted matrx n (), takng advantage of zero blocks, yeldng the matrces L () U () ˆB () Ĉ () L () U () Ĉ () L () ˆB () U () ˆD () L () U () Ĉ () L () ˆB () Ĥ () U () ˆB () Ĥ () Ĉ () L () U () Ĥ () ˆD () L () Ĉ () L () U () ˆB () Ĥ () U () Ĉ () L () ˆB () Ĥ () U () Ĥ () Ê () Ĝ () Ĝ () Ĝ () L () S U () S ˆT () Ê () Ĝ () Ĝ () Ĝ () ˆV () L () S U () S An algorthm for ths s gven n Fgure f one performed the computaton as f the matrx were permuted, but kept the data n the orgnal matrx, one would end up wth a matrx that would look lke ths: Â () Ĉ () ˆB () Â () Ĉ () ˆB () Â () Ê () ˆD () Ŝ () Ĝ () Ĥ () Ĥ () Ĥ () Â () Ĉ () Ĝ () ˆB () Â () Ĉ () Ĝ () ˆB () Â () ˆV () Ê () ˆT () ˆD () Ŝ () Ĝ () Ĥ () Ĥ () Ĥ () Â () Ĉ () Ĝ () Ĝ () ˆB () Â () Ĉ () ˆB () Â () where stores L A and U A, and stores L S and U S To parallelze ths one frst decdes where what data exsts Let us assume that the doman was parttoned nto p subdomans (wth p separators) and hence there are p large dagonal blocks We wll assgn to processor j the submatrx or, permuted, T S ( j+) V S ( j+), 5
Notce that the block n red s actually owned by processor j + A contrbuton to the Schur complement that updates that block wll be computed on ths processor, and then passed to the next one, to be added there to S ( j+) Assumng that each large dagonal block actually has R subblocks on the dagonal, a parallel algorthm (that s executed on each processor for computng the LU factorzaton of the matrx s gven n Fgure 4 The Spke Algorthm To understand the Spke algorthm, one should frst understand the followng unconventonal sequence of factorzatons We start wth matrx ( ) W X Y Z and compute the LU factorzaton of W, W L W U W Then ( ) ( )( )( W X LW UW ˇX = Y Z Y Z ) () Next, we compute factor further: ( ) ( W X LW )( UW )( )( ˇX ) Y Z = Y Z Y ˇX () where Z Y ˇX s often called a Schur complement Fnally, f we factor Z Y ˇX nto ts LU factorzaton, Z Y ˇX L W U W, we obtan ( ) ( )( )( )( )( )( ) W X LW UW ˇX = (4) Y Z Y U Z L Z Agan, consder the permuted matrx n Equaton A () B () C () A () B () C () A () D () A () B () C () A () B () C () A () A () B () C () A () B () C () A () H () D () H () E () G () S () E () G () S (), (5) 6
and assgn to processor j the submatrx S ( j+) or, permuted, T V S ( j+) The block n red s agan owned by processor j + and wll be a contrbuton to the update of that processor s S ( j+) The use thck lnes here and n () (5) was purposely chosen They delneate what we wll refer to as W, X, Y, and Z n the subsequent dscusson Consder the parallel algorthm n Fgure n Equatons??, let W consdes of the block dagonal matrx wth the nteror blocks on ts dagonal Step computes (ths processor s part of) the LU factorzaton W L W U W Step solves L W (U W ˆX) = X Step update Z = Z Y ˆX Step 4 factors Z Y ˆX f one performs the computaton as f the matrx were permuted, but updated the data n the orgnal matrx, one would end up wth a matrx that looks lke: Â () ˆB () ˆD () Ĉ () Â () ˆB () ˆD () Ĉ () Â () ˆD () Ê () Ŝ () Ĝ () Ĥ () Ĥ () Ĥ () Â () Ĉ () ˆB () Â () Ĉ () ˆB () Â () ˆV () Ê () ˆT () ˆD () ˆD () ˆD () Ŝ () Ĝ () Ĥ () Ĥ () Ĥ () Â () Ĉ () ˆB () Â () Ĉ () ˆB () Â (), (6) where stores L A and U A, and stores L S and U S Ths s the classc sparsty pattern that results from the Spke algorthm 7
The factorzaton of the matrx s now gven by (BETTER CHECK THS!!!) L A () Ĉ () L A () Ĉ () L A () L A () Ĉ () L A () Ĉ () L A () U A () ˆB () U A () ˆB () U A () U A () ˆB () U A () ˆB () U A () Ê () Ĥ () Ĥ () Ĥ () Ê () L S () ˆV () L S () U S () ˆT () U S () ˆD () ˆD () ˆD () Ĝ () ˆD () ˆD () ˆD () 5 Cost comparson We now analyze the predcted cost of the algorthms For ths, we make the followng assumptons/observatons: All blocks are b b Computng an LU factorzaton of a b b matrx takes b flops Multplyng two b b matrces takes b flops Solvng a b b trangular system wth b rght-hand sdes takes b flops Each flop takes γ tme Sendng a b b block from one processor to another requres α + b β tme Wth ths assumptons, each operaton n Fgures and s annotated wth ts cost (gnore for now the cost estmate n parentheses) and a summary of estmated costs are gven n Fgure 4 Some nterestng observatons: 8
f R p, the speedup for the LU factorzatons that need to be performance approaches p, perfect speedup Even f R p, the maxmal speedup attaned combned for the TRSMs and GEMMs s at best a factor p/ (The combned costs approach Rb γ and Rb γ for Fshbone and Spke, respectvely, versus 4pR for sequental block trangular factorzaton, f p s fxed and R The reason s that the fll-n that occurs when factorng the permuted matrx represents a sgnfcant overhead The Spke algorthm s less effcent than the Fshbone algorthm Ths s not due to how the matrx s parttoned among processors, nor how the nherently sequental part s performed, snce both these are the same for both algorthms 9
Step : Factor nteror block (matrx W) for =,,R (*) (*) (*) := endfor := L A b (or b ) := C( U b B( b b factor L A U A Step : Solve L W (U W ˆX) = X for =,,R ($) + := L A b := Ĉ( b endfor R := L R b for = R,, (*) (*) := endfor := U A b := H( b := U A b Step : Update Z = Z Y ˆX (x) := (+) T := (+) V := Ê (+) S ( j+) := Ê Step 4: Factor the updated Z b b b R b R b (x) add contrbuton from (x) processor j to (x) factor L S U S b (+)(x) ˆT := L S T b (+)(x) ˆV := V U S b (+) S ( j+) := S ( j+) ˆV ˆT b (+) send S ( j+) to processor j + (L A and U A are stored n ) T V T V S ( j+) Ŝ ˆT ˆV S ( j+) Fgure : Spke Algorthm L A and U A overwrte  L S and U S overwrte Ŝ Commands annotated wth (*) are not perform n teraton = Commands annotated wth ($) are not perform n teraton = R Commands annotated wth (x) are not perform by the frst processor Commands annotated wth (+) are not perform by the last processor
Fshbone Spke Sequental Step LU fact Rb γ Rb γ (pr + p )b γ TRSM (R )b γ (R )b γ (pr + p )b γ GEMM (R )b γ (R )b γ (pr + p )b γ Step TRSM Rb γ (R + )b γ GEMM 6(R )b γ 6(R )b γ Step TRSM b γ b γ GEMM 6b γ 8b γ Step 4 LU fact (p )b γ (p )b γ TRSM (p )b γ (p )b γ GEMM (p )b γ (p )b γ Send (p )(α + b β) (p )(α + b β) Total LU fact (R + p )b γ (R + p )b γ (pr + p )b γ TRSM (4R + (p ))b γ (5R + (p ))b γ (pr + p )b γ GEMM (8R + p 4)b γ (8R + p )b γ (pr + p )b γ Send (p )(α + b β) (p )(α + b β) Fgure 4: Cost analyss of the dfferent algorthms