Separation Constraint Partitioning - A New Algorithm for Partitioning. Non-strict Programs into Sequential Threads. David E. Culler, Seth C.

Seprtion Constrint Prtitioning - A New Algorithm or Prtitioning Non-strict Progrms into Sequentil Threds Klus E. Schuser Deprtment o Computer Science Universit o Cliorni, Snt Brr Snt Brr, CA 93106 schuser@cs.ucs.edu Dvid E. Culler, Seth C. Goldstein Computer Science Division Universit o Cliorni, Berkele Berkele, CA 94720 culler,sethgg@cs.erkele.edu Astrct In this pper we present sustntill improved thred prtitioning lgorithms or modern implicitl prllel lnguges. We present new lock prtitioning lgorithm, seprtion constrint prtitioning, which is oth more powerul nd more eile thn previous lgorithms. Our lgorithm is gurnteed to derive miml threds. We present theoreticl rmework or proving the correctness o our prtitioning pproch, nd we show how seprtion constrint prtitioning mkes interprocedurl prtitioning vile. We hve implemented the prtitioning lgorithms in n Id90 compiler or worksttions nd prllel mchines. Using this eperimentl pltorm, we qunti the eectiveness o dierent prtitioning schemes on whole pplictions. 1 Introduction Modern implicitl prllel lnguges, such s the unctionl lnguge Id90, llow the elegnt ormultion o rod clss o prolems while eposing sustntil prllelism. However, their non-strict semntics require ne-grin dnmic scheduling nd snchroniztion, mking n ecient implementtion on conventionl prllel mchines chllenging. In compiling these lnguges or commodit processors, the most importnt step is prtitioning the progrm into sequentil threds. This pper presents new prtitioning lgorithm nd eperimentll qunties its eectiveness. Mn o the issues tht rise in implementing non-strict lnguges (dnmic scheduling, snchroniztion, nd hep mngement) re present, independent o the source lnguge, when producing code or prllel mchines. Thus, the techniques developed in this pper re lso pplicle to prllel implementtions o other lnguges. Moreover, deling with non-strictness requires tht these issues e ced even when compiling or sequentil processors, ecuse non-strict progrms require logicl prllel eecution in order to mke orwrd progress. The lnguge studied in this pper, Id90 [Nik90], is non-strict unctionl lnguge with eger evlution. This comintion, termed lenient evlution [Tr91], ehiits more prllelism thn lz evlution while retining much To pper in the ACM SIGPLAN-SIGACT Smposium on Principles o Progrmming Lnguges (POPL'95). o its epressive power. 1 To urther increse prllelism, Id90 provides dt structures tht utomticll snchronize etween producer nd consumers: I-structures nd M- structures. When eecuting lenient progrm on prllel mchine, dnmic scheduling m e required or two resons. First, the semntics o the lnguge mke it impossile in generl to stticll determine the order o opertions. The order in which opertions o unction eecute m depend on the dnmic contet in which the unction is invoked (c., Section 2.1), not just on the vlue o its rguments. Second, long ltenc inter-processor communiction nd ccesses to snchronizing dt structures require tht the computtion dependent on the messges e scheduled dnmicll. Dnmic scheduling is epensive on commodit microprocessors, incurring high cost or contet switching. Thereore, these lnguges hve een ccompnied the development o specilized computer rchitectures, e.g., grph reduction mchines [PCSH87, Kie87], dtow mchines [ACI 83, GKW85, SYH 89, PC90], nd multithreded rchitectures [Jor83, NPA92]. Much reserch hs een done in compiling lenient lnguges or dtow rchitectures [ACI 83, Tr86, AN90, GKW85, Cul90]. As clerer seprtion o lnguge nd rchitecture hs een otined, ttention hs shited to compiltion spects o these lnguges or commodit processors [Tr91, SCvE91, Nik93]. The emphsis o the compiltion work is to stticll schedule groups o instructions into sequentil threds nd restrict dnmic scheduling to occur onl etween threds. A thred orms the sic unit o work: once scheduled it runs until completion. The tsk o identiing portions o the progrm tht cn e scheduled stticll nd ordered into threds is clled prtitioning [Tr91]. Prtitioning the progrm into sequentil threds requires sustntil compiler nlsis (dependence nlsis) ecuse, unlike in impertive lnguges, the evlution order is not specied the progrmmer. Cre hs to e tken to generte threds which oe ll dnmic dt dependencies nd do not led to dedlock. Prtitioning decisions impl trde-os etween prllelism, snchroniztion cost, nd sequentil ecienc [SCvE91]. However, given the limits on thred size imposed the lnguge model, the use o split-phse ccesses, nd the control prdigm, our gol simpl is to m- 1 Usull non-strictness is comined with lz evlution (e.g., in LML nd Hskell). Under lz evlution n epression is onl evluted i it contriutes to the nl result. Lz evlution decreses the prllelism ecuse the evlution o n epression is onl strted ter it is known to contriute to the result.

imize the length o the threds nd minimize the numer o thred switches. 1.1 Contriutions The min contriution o this pper is the development o new thred prtitioning lgorithm tht is sustntill more powerul thn n previousl known [Tr91, HDGS91, SCvE91, TCS92]. A compiler or Id90 hs een developed with ck-ends or worksttions nd the CM-5 multiprocessor. It serves s n eperimentl pltorm or studing the eectiveness o the prtitioning lgorithms. The prtitioning lgorithms presented in [SCvE91] nd [TCS92] re strting point or this work nd re etended in severl ws. The pper: presents new lock prtitioning lgorithm, seprtion constrint prtitioning, which is more powerul thn iterted prtitioning, the previousl est known lock prtitioning, shows how seprtion constrint prtitioning cn e integrted successull into the interprocedurl prtitioning lgorithm to improve code t the cll oundries, outlines the theoreticl rmework nd sketches the proo o correctness or our prtitioning pproch, 2 implements the prtitioning lgorithms, resulting in running eecution vehicle or Id90 on worksttions nd severl prllel mchines, nd qunties the eectiveness o the dierent prtitioning schemes on whole pplictions. In ddition, lthough not documented here, we hve etended interprocedurl nlsis to hndle recursion nd mutull dependent cll sites [Sch94]. Section 2 ormlizes the prolem o thred prtitioning nd presents n emple which shows tht non-strict lnguges require dnmic scheduling. In Section 3 we present our new prtitioning lgorithm, seprtion constrint prtitioning. Section 4 rie discusses how seprtion constrint prtitioning cn e integrted with interprocedurl prtitioning. Section 5 presents eperimentl results which show the eectiveness o our prtitioning lgorithm. Finll, Section 6 contins the summr nd conclusions. A short sketch o the proo o correctness or the prtitioning lgorithm ppers in Appendi A. 1.2 Relted Work Prtitioning is similr in spirit to compiltion techniques or lz unctionl lnguges [SNvGP91, Pe92]. Strictness nlsis [Mc80, CPJ85] tries to determine which rguments cn e evluted eore invoking the od o unction, thus voiding the cretion o epensive thunks or strict rguments. Prtitioning goes step urther s it m derive tht rguments cn e evluted together even i unction is not strict in them [Tr91]. Pth nlsis [BH87] detects the order in which rguments re evluted, which m result in cheper representtion o thunks nd reduce the cost o orcing nd updting them. Seril comintors 2 The complete proo cn e ound in [Sch94]. Hudk nd Golderg re one o the rst ttempts to improve the prllel eecution o lz unctionl progrms incresing their grnulrit [Gol88]. Their pproch is to group severl comintors into lrger seril comintors. Prtitioning pls crucil role or the prllel eecution o strict unctionl lnguges, ut unlike non-strict lnguges, the ordering o instructions cn e determined stticll. Thus, the dicult is not wht cn e put into the sme thred, ut rther wht should e plced into the sme thred given communiction nd lod lncing constrints [SH86, NRB93]. Most o the prtitioning reserch or lenient lnguges ws inspired Tru's seminl theoreticl work, which uses dependence nlsis to chrcterize when instructions cn e grouped into thred [Tr91]. 3 Tru showed tht the prolem o nding prtitioning with the minimum numer o threds is NP-complete. Thus, ll o the prtitioning pproches rel on heuristics to group nodes into threds. Innucci developed dependence set prtitioning, which groups nodes tht depend on the sme set o inputs [In88]. Demnd set prtitioning, presented in [SCvE91] nd [HDGS91] is nlogous to dependence set prtitioning, ut it groups nodes which re demnded the sme set o outputs. Iterted prtitioning comines the power o dependence nd demnd set prtitioning ppling them itertivel [TCS92, HDGS91]. One o the lgorithms is pplied, then the reduced grph is ormed, nd the other lgorithm is pplied. This process is repeted until no urther chnges occur. Schuser et l. etended the two sic prtitioning schemes with locl \merge up" nd \merge down" rules, thus chieving essentill the sme degree o grouping s iterted prtitioning [SCvE91]. Tru et l. [TCS92] etended iterted prtitioning with interprocedurl nlsis to otin lrger threds. Recentl, [Coo94] nd [Sch94] independentl developed etensions to the interprocedurl lgorithm to hndle recursive unctions. Seprtion constrint prtitioning identies ll possile \merges" llowing the thred prtitioning to e guided high level heuristics, such s minimizing the cost o procedurl oundries. 2 Block Prtitioning The prtitioning lgorithm produces collection o threds. The instructions o ech thred re stticll scheduled, nd ll dnmic scheduling required non-strictness or potentill long ltenc communiction occurs etween threds. Denition 1 (Thred [TCS92]) A thred is suset o the instructions o procedure od, such tht: 1. compile-time instruction ordering cn e determined or the thred which is vlid or ll contets in which the contining procedure cn e invoked, nd 2. once the rst instruction in thred is eecuted, it is lws possile to eecute ech o the remining instructions in the compile-time order without puse, interruption, or eecution o instructions rom other threds. 3 Tru's originl rmework llows threds to suspend; thus, his threds cpture the sequentil ordering which is required etween instructions, ut not the dnmic scheduling which m occur etween the threds. Susequent reserch in this re disllows thred suspension, which hs the dvntge o cpturing the cost o switching etween threds. 2

Our prtitioning lgorithms work on structured dtow grphs [Tr86], the intermedite orm used in the Id90 compiler. It is similr to intermedite representtions ound in other optimizing compilers. A structured dtow grph consists o collection o locks, 4 one or ech unction nd ech rm o conditionl, nd interces which descrie how the locks relte to one nother [TCS92]. Ech lock is represented n cclic dtow grph; roughl, it corresponds to group o opertors with the sme control dependence [FOW87]. For emple, ll opertors comprising the \then" rm o conditionl, ecluding those in nested conditionls, re lock. Denition 2 (Dtow Grph) A dtow grph is directed cclic grph o vertices nd dependence edges, (V; E s; E q), where E s V V re the certin direct dependence edges nd E q V V re the certin indirect dependence edges. The vertices descrie the instructions, including rithmetic nd logic opertors, cretion nd ccess o dt structures, nd the sending nd receiving o rguments nd results. The edges cpture certin dt dependencies, which re present in ever contet in which the procedure cn e invoked. We distinguish two kind o certin dependencies: direct dependencies (represented in the emples stright rcs see Figure 2) nd indirect dependencies (represented squiggl rcs see Figure 2). Indirect dependencies represent potentill long ltenc dependencies which m involve nodes o other locks. For emple, n indirect edge connects the request nd response nodes or split-phse snchronizing dt structure ccess. Such n ccess m require long time to complete due to network ltenc or snchroniztion del, i.e., it m hve to wit until n other computtion completes nd stores the reerenced vlue. Deinition 1 implies tht nodes connected n indirect dependence must reside in dierent threds. In ddition, we dene potentil indirect dependence (pid) s one which m eist in some ut not ll invoctions o the lock. pids m go through nodes o other locks. This concept o potentil indirect dependencies is ver importnt. A pid is dependence which could eist in some legl eecution o the progrm, where legl eecution is dened s one which does not led to dedlock in the sence o prtitioning. The ke oservtion is tht the compiler does not hve to consider pids which re contrdicted certin dependencies ecuse such dependencies would led to dedlock. Certin dependencies provide the mechnism to reduce the set o pids. We need to e conservtive nd overpproimte the pids. Due to the non-strict nture o the lnguge the compiler initill ssumes tht or ech unction n rgument m depend on n o its results. Through the process o nlsis some pids re ruled out. The chllenge is to represent pids s precisel nd s ecientl s possile. We use inlet nd outlet nnottions to represent pids in the dtow grph. Denition 3 (Annottion) An nnottion or lock is 5 tuple A = ( i; Inlet; o; Outlet; CID), where i is the inlet lphet, Inlet : V! Pow( i) mps ech node to set o inlet nmes (the inlet nnottion), o the outlet lphet, Outlet : V! Pow( o) mps ech node to set o 4 In previous work the term sic lock ws used [Tr86, TCS92]. Since this term hs dierent mening in the compiler literture or impertive lnguges we use the term lock. outlet nmes (the outlet nnottion), nd CID (V V ) re the certin indirect dependencies (CID = E q). In the grphicl representtion, we ttch incoming circles to nodes or inlet nnottions nd outgoing circles or outlet nnottions. For emple, in Figure 1 the inlet nnottions re g nd g. Nodes with outlet nd inlet nnottions orm the endpoints o pids. A pid m trvel rom node with n outlet nnottion to node with n inlet nnottion. An inlet nme represents set o outlet nodes tht this node certinl depends on. This set is not known t compile time, ut ever node which contins the sme inlet nme in its nnottion depends on this sme set o outlet nodes, lthough we cn not identi which outlet nodes the re. Thus, the initil ssumption o n inlet depending on n set o outlets (not contrdicted certin dependencies) is cptured giving ech inlet unique nnottion. Likewise, n outlet nme represents set o inlet nodes tht depend on this node. The process o ssigning the sme or prtill overlpping inlet (or outlet) nmes to multiple nodes llows us epress shring o dependencies etween nodes. As mentioned ove, the pids cpture the potentil dependencies rom outlet nodes to inlet nodes tht do not led to dedlock t runtime. We ssume tht pid eists unless it is contrdicted certin dependencies. More ormll, we dene pid to eist rom node s to node r i, there eists n o 2 Outlet(s) nd i 2 Inlet(r) such tht there does not eist pth over stright nd squiggl edges rom node r 0 with i 2 Inlet(r 0 ) to node s 0 with o 2 Outlet(s 0 ). The tsk o prtitioning lgorithm is to tke s input structured dtow grph nd to prtition the vertices o ech lock into non-overlpping regions such tht ech suset cn e mpped into non-suspending sequentil thred. Deriving the threds is done in two steps. First, the nodes o ech lock re prtitioned into disjoint susets. Then, the instructions o ech suset re linerized (n topologicl ordering will do). The prtitioning lgorithms presented here onl derive the susets o vertices nd leve the ctul ordering o instructions within ech thred to lter stge o the compiler. We shll reer to ech suset o vertices simpl s thred. A correct prtitioning hs no circulr dependencies etween threds, i.e., no sttic ccles within locks nd dnmic ccles cross locks. Without circulr dependencies, it is possile to del the scheduling o thred until ll o its predecessors hve completed nd then run the thred until completion. In ddition, correct prtitioning must ensure tht requests nd responses to split-phse opertions re in dierent threds. 2.1 Simple Emples We now present simple emple to illustrte the concepts just introduced. Figure 1 shows the dtow grph or the unction which is clled the procedures g nd h. de u v = (uu, vv); de g z = {s,t = ( z s) in t}; computes (z z) (z z) de h z = {s,t = ( t z) in s}; computes (z z) (z z) This emple illustrtes the need or dnmic scheduling even in the sence o conditionls. The unction tkes two rguments, u nd v, nd returns two results, u u nd v v. Within there is no dependence etween the multipliction 3

nd ddition. Thereore, the cn e scheduled in n order under trditionl strict evlution. This is not true or non-strict evlution. The unction g eeds the rst result o the unction ck in s the second rgument, while the unction h eeds the second result ck in s the rst rgument. These two dependencies re pids. In the contet o unction g the multipliction must e eecuted eore the ddition, while in unction h the opposite is true. Thus, the multipliction nd the ddition hve to e scheduled independentl, nd it is impossile to put them together into single non-suspensive thred. Rec 1 Send 1 Rec 2 Send 2 Figure 1: Smll emple o dtow grph or the unction u v = (uu, vv); nd its prtitioning into two threds. The rcs represent direct dependencies while the inlet nd outlet nnottions represent sets o potentil dependencies s eplined in the tet. The shded regions represent the threds. The two pids re represented inlet nd outlet nnottions. Without n interprocedurl nlsis to indicte otherwise, ech node is given unique singleton nnottion, impling tht we hve to ssume tht ech depends on (or inuences) dierent set o unknown nodes. In our emple, the rgument receive nodes re given the inlet nnottions g nd g, while the send nodes hve the outlet nnottion g nd g. The nmes themselves re not importnt; the sence o shring etween the nmes is wht is importnt. B our denition potentil dependence eists rom the send node with outlet nnottion g ck to the receive node with inlet nnottion g ecuse there does not eist certin dependence pth contrdicting this, i.e., rom node with in its inlet nnottion to node with in its outlet nnottion. Likewise, there eists pid rom Send 2 to Rec 1. Functions g nd h contin these pids. On the other hnd, there cnnot eist pid rom Send 1 ck to the Rec 1 ecuse this is contrdicted certin dependence pth. Thus, the inlet nd outlet nnottions correctl cpture the two potentil dependencies which m rise t run time. As result, the let nd right nodes must st in seprte threds, nd the prtitioning lgorithm cn t est otin two threds, s indicted the shded regions. We cn improve the prtitioning using interprocedurl nlsis i we know tht the unction is onl clled in the ollowing contet: de oo z = {s,t =( z z) in st}; computes (z z)(z z) In this cse, it is vlid to give oth receive nodes o the de site the sme inlet nnottion, s g, s oth rgument send nodes t the cll site depend on the sme rgument o the unction oo. Likewise, we cn give the sme outlet nnottion, s g, to oth send nodes o. Now when prtitioning, the compiler cn determine tht there cnnot eist potentil dependence rom result ck to n rgument, since under the new nnottion there eists certin dependence pth rom receive node with the inlet nme to send node with the outlet nnottion. Thus, the compiler cn group ll o the nodes in into single thred. Dnmic scheduling m lso rise when ccessing snchronizing dt structures. For emple, ssume tht unction contins the ollowing code which mnipultes I- structures. A[k] = ; A[l] = ; The corresponding dtow grph is shown in Figure 2. This code etches n element rom, multiplies this with itsel nd stores the result into loction A[k]. It lso etches rom loction, dds this element with itsel nd stores it into loction A[l]. The declrtive nture o the nonstrict lnguge does not speci the order in which these sttements re eecuted. Actull, tht order m depend on the contet in which this code is eecuted. I k = n, the store into loction A[k] denes the vlue which is etched rom. Thereore there eists pid rom the store to the etch response, s indicted the dshed line in Figure 2. Thus, the multipliction hs to e eecuted eore the ddition. I l = m the opertions would eecute in the reverse order (see Figure 2c). Note tht these dependencies re not directl present in the unction, the re estlished through the snchronizing I-structure. These potentil dependencies re cptured the nnottions; n inlet nnottion on etch node represents dependence on some store. I-structure ccesses hve to e represented split-phse opertions which seprte the request rom the response. There re two resons wh the request nd the response m not eecute together. First, etch m get deerred should it occur eore the corresponding store. Second, eecution on prllel mchine m result in long communiction ltenc i the ccessed element resides on nother processor. Both orms require dnmic scheduling. Thus the request nd response hve to reside in dierent threds. With split-phse ccesses the processor cn continue working ter issuing the request, mking it possile to hide the communiction ltenc with computtion tht is not dependent on the requested dt. The potentill long ltenc etween the request nd response is indicted the squiggl edges in the dtow grph, which represent certin indirect dependencies. These emples illustrte tht potentil dependencies cnnot e known t compile time. The cn trvel through rguments, results, internl cll sites, nd through I-structure ccesses. 2.2 Limits o Iterted Prtitioning The previousl est known lock prtitioning scheme, iterted prtitioning [TCS92], is not powerul enough to lws nd miml threds. A slightl revised version o the rst emple, shown in Figure 3, proves tht seprtion constrint prtitioning is strictl more powerul thn iterted prtitioning, which ils to nd miml threds. This emple consists o si nodes. Iterted prtitioning orms two threds. The inlet/outlet nnottions re not 4

() () (c) A[k] A[l] A[k] A[l] A[k] A[l] k = n l = m Figure 2: Simple emple o dtow grph with I-structures or the code A[k] = ; A[l] = ;. The shded regions show the our threds. Since etch o n I-structure element m deer, it cnnot e plced into the sme thred s the response. The threds cnnot e grouped into single thred ecuse there m eist potentil indirect dependencies which require dnmic scheduling. These pid edges re indicted the dshed rcs in Prt () or k = n nd Prt (c) or l = m. Rec 1 Rec 2 solel on equl dependence or demnd sets. This oservtion is ormlized seprtion constrint prtitioning. Send 1 Send 2 Figure 3: Emple where iterted prtitioning ils to merge two threds. unique singleton sets, ut insted reect dependence shring which could e the result o interprocedurl nlsis. The dependence set o the three let nodes is g, nd their demnd set is g. Iterted prtitioning will group them ll into single thred. Likewise, the three right nodes re grouped into single thred ecuse their dependence set is ; g, nd their demnd set is ; g. Iterted prtitioning cnnot merge the let nd the right nodes since their dependence nd demnd sets re dierent. However, the cn sel e merged or the ollowing resons. The dependence sets represent the set o (unknown) outlets node depends on. The dependence set or the let nodes is suset o tht or the right nodes. Since the right nodes depend on lrger set o outlet nodes, the cnnot inuence n o the let nodes, nd thus there cnnot eist pid rom the right to the let nodes. The sme rgument in reverse holds or the demnd sets. It is thereore possile to merge the two threds into one. Merging these threds requires more powerul prtitioning rule which is not sed 3 Seprtion Constrint Prtitioning Seprtion constrint prtitioning cn, with respect to n nnottion, precisel determine or n two nodes whether the cn e merged or not. The rule is simple: two nodes o lock cnnot e merged (i.e., the must reside in dierent threds) i there eists either certin indirect dependence (cid) or potentil indirect dependence (pid) etween them. The reson is tht oth orms o indirect dependencies m require dnmic scheduling. Given this seprtion rule, we cn esil devise n eective prtitioning lgorithm. Strting with the unprtitioned dtow grph, we nd two nodes without seprtion constrint, merge them, orm the reduced grph, nd repet this process until no urther nodes cn e merged. Although this method is more powerul nd elegnt thn the previous prtitioning lgorithms, unortuntel it is computtionll more epensive. As discussed elow, this prolem cn e llevited onl running it on suset o the grph. Seprtion constrint prtitioning hs our dvntges. First, it is gurnteed to derive miml (ut not necessril optiml) threds. Ater it hs nished, ever pir o threds hs seprtion constrint etween them, nd thereore it is impossile to merge urther. Second, it dels in unied w with the prtitioning constrints introduced certin nd potentil indirect dependencies, nd thereore does not require suprtitioning (s do dependence nd demnd set prtitioning [TCS92]). Third, the lgorithm cn e comined with heuristics tht ttempt to merge the nodes in n order which minimizes communiction, dnmic scheduling, nd snchroniztion overhed. Finll, it cn lso e nturll integrted into the interprocedurl prtitioning lgorithm. 5

3.1 The Algorithm The most complicted spects o the lgorithm re the initil computtion o the seprtion constrints nd their updte when two nodes re merged. Seprtion constrints rise rom cids, which connect send nodes to receive nodes, nd pids, derived rom the nnottions or the lock. We s tht n two nodes tht re connected through pid or cid hve n indirect dependence nd cnnot e merged. Deriving the cids is es, s the re directl represented in the grph. The chllenge is to ecientl determine the pids, which the compiler does not know nd hs to pproimte sel. Algorithm 1 (Seprtion constrint prtitioning) Given dtow grph with inlet/outlet nnottions: 1. Compute the reeive, trnsitive closure o the successor reltion Succ over E s [ CID. 2. Compute the set o potentil indirect dependence edges, i.e., those edges rom outlets to inlets which re not contrdicted certin dependencies. PID = (s; r)j9i; o : i2inlet(r); o2outlet(s); :9(r 0 ; s 0 )2Succ : i2inlet(r 0 ); o2outlet(s 0 )g 3. Comining PID nd CID, compute the set o nodes with n indirect dependence etween them. ID = (u; v)j9(s; r) 2 PID [ CID : (u; s) 2 Succ ; (r; v) 2 Succ g 4. Find two nodes u; v without n indirect dependence etween them, i.e., or which (u; v) 62 ID nd (v; u) 62 ID. Merge u; v into single thred, nd updte the representtion. () Derive the new set o nodes, use v s representtive or the two merged nodes nd discrd u. V new = V? ug () Compute the new reeive trnsitive closure. Succ new = (p; s)j(p; s) 2 Succ ; p 6= u; s 6= ug [(p; s)j(p; u) 2 Succ ; (v; s) 2 Succ ; p 6= u; s 6= ug [(p; s)j(p; v) 2 Succ ; (u; s) 2 Succ ; p 6= u; s 6= ug (c) Compute the new set o indirect dependencies. ID new = (p; s)j(p; s) 2 ID; p 6= u; s 6= ug [(p; s)j(p; u) 2 ID; (v; s) 2 Succ ; p 6= u; s 6= ug [(p; s)j(p; v) 2 ID; (u; s) 2 Succ ; p 6= u; s 6= ug [(p; s)j(p; u) 2 Succ ; (v; s) 2 ID; p 6= u; s 6= ug [(p; s)j(p; v) 2 Succ ; (u; s) 2 ID; p 6= u; s 6= ug (d) Set V = V new, Succ = Succ new, nd ID = ID new. 5. Repet rom Step 4 until no more nodes cn e merged. Oserve tht eisting seprtion constrints never dispper. Merging two nodes cn onl introduce new constrints. Thus ever pir o nodes hs to e tested t most once or merging. Ater merging two nodes, the trnsitive closure nd the indirect dependencies re updted. Furthermore, the new ID cn e computed rom the old ID nd Succ. We ppl this lgorithm to the emple in Figure 1. Following the rule in Step 2, we derive tht there eists pid rom the let send to the right receive nd rom the right send to the let receive. (Send 1; Rec 2) 2 PID eists ecuse 2 Inlet(Rec 2) nd 2 Outlet(Send 1), nd there is no pth rom node with in its inlet set to node with in its outlet set to contrdict this. A similr rgument cn e mde or the other pid. As result there eists seprtion constrint rom n o the let nodes to n o the right nodes, nd the let nodes cnnot e merged with the right nodes, s oserved erlier. Now we ppl this lgorithm to the emple in Figure 3. Following the rule in Step 2, we derive tht there re no pids ecuse the re ll contrdicted certin dependencies: or ever inlet/outlet nme pir there eists pth rom node with the inlet nme in its inlet set to node with the outlet nme in its outlet set. Thereore, PID = ;. Since CID = ; we scertin tht ID = ;. Thus there re no seprtion constrints nd n two nodes cn e merged. Seprtion constrint prtitioning will, s epected, end with single thred. 3.2 Merge Order Heuristics The lgorithm s presented so r does not speci the order in which pirs o nodes re visited nd tested or merging. This eiilit is n importnt dvntge, s it permits the lgorithm to e comined with heuristic tht visits the nodes in n order tht minimizes communiction, dnmic scheduling, nd snchroniztion overhed. All three opertions re epensive on commodit processors. We decided to ddress communiction rst, since on most prllel mchines communiction hs the highest overhed. Our heuristic is rst to tr merging nodes elonging to the sme unction cll oundr (which reduces communiction), then nodes t conditionl oundries (which reduces dnmic scheduling), nd nll the remining interior nodes o the lock. Ater interprocedurl nlsis (eplined in Section 4), the nnottions or lock m hve een rened nd the lock cn e reprtitioned. Repetedl reprtitioning using iterted prtitioning is ver epensive. However using seprtion constrint prtitioning, we cn perorm the interprocedurl nlsis on restricted grph consisting o the nodes t de nd cll site oundries nd their connectivit nd then prtition the interior nodes ter the nnottions hve een completel rened. Etrcting this restricted grph rom the originl progrm is irl simple nd involves onl computing the trnsitive closure o ech lock's dtow grph. The sving is enormous: or our enchmrk progrms the grph sizes re reduced ctor o 10 to 20 the lrgest lock ws reduced rom 619 nodes to 40 nodes. Running seprtion constrint prtitioning on the restricted grph is ver st, mking interprocedurl prtitioning vile. Finll, ter otining the est possile prtitioning t the unction cll oundries, we prtition the interior o locks. Our pproch is to run seprtion constrint prtitioning onl on suset o the nodes o the lock (the most criticl nodes) nd or the rest o the lock use iterted prtitioning, which in prctice runs ster. 3.3 Compleit To compute the compleit o the ove lgorithm we ssume tht ID nd Succ re represented n djcenc mtri. Assume tht the prolem size n is the mimum over the numer o edges, numer o inlet nmes, nd num- 6

er o outlet nmes. Since the dtow grph is cclic, initill computing the trnsitive closure is O(n 2 ). Determining the PID edges in Step 2 is O(n 3 ). Computing ID is O(n 2 ). Testing whether two nodes cn e merged tkes onl constnt time, since ID is represented s mtri. Since merging never elimintes seprtion constrints, t most O(n 2 ) pir o nodes hve to e tested, thus this prt o Step 4 is O(n 2 ). Merging occurs t most O(n) times, nd the compleit o Steps 4(){(d) is O(n 2 ). Overll, the totl compleit o the lgorithm is O(n 3 ). In prctice the running time is too long or lrge locks. For iterted prtitioning the worst cse compleit is lso O(n 3 ), since the compleit o dependence nd demnd set prtitioning is O(n 2 ). However eperimentl dt indicte tht in prctice iterted prtitioning requires onl smll numer o itertions to nd the nl solution. Two ccles (i.e., our prtitioning steps) were sucient or prtitioning the locks o the set o Id90 progrms we used or the eperimentl results section. On the other hnd, it is possile to construct emples which require n ritrr numer o itertions (see [Sch94] or detils). 3.4 Correctness Proving correctness o seprtion constrint prtitioning is much hrder thn dependence nd demnd set prtitioning, which re quite intuitive. The ppendi contins short discussion o the correctness proo. There re two ke spects to this proo. First, we show tht the lgorithm correctl updtes the set o indirect dependencies ID throughout the eecution o the progrm evertime two nodes u; v re merged. This implies tht certin nd potentil indirect dependencies re correctl tken cre o. Second, we prove tht when the lgorithm termintes, ll prtitions re conve, i.e., there do not eist n sttic ccles rom thred ck to itsel. This m not e the cse t intermedite steps o the lgorithm. Thus seprtion constrint prtitioning is quite dierent rom iterted prtitioning. There the prtitioning is correct ter ever step nd we could choose to stop t n time i so desired. Seprtion constrint prtitioning, on the other hnd, hs to run until termintion. 4 Interprocedurl Prtitioning The lock prtitioning lgorithm presented so r is limited in its ilit to derive threds ecuse without glol nlsis it must ssume tht ever send in lock m potentill eed ck to n receive unless contrdicted certin dependencies. This is cptured the singleton inlet nd outlet nnottions given initill to send nd receive nodes. Glol nlsis cn determine tht some o these potentil dependencies cnnot rise nd there improve the prtitioning [TCS92]. For emple, the inormtion gined while prtitioning procedure cn e used to improve the inlet nd outlet nnottions o its cll sites. These rened nnottions m shre nmes, reecting the shring mong dependence nd demnd sets present in the procedure. In ddition, squiggl edges rom rgument send nodes to result receive nodes cn e introduced i the procedure hs the corresponding pths rom the rgument receives to result sends. Both the rened nnottions nd the squiggl edges help to etter pproimte the pids nd there improve susequent prtitioning. The sme optimiztions re possile in the reverse direction. The nnottions o the de site o procedure cn e improved with the inormtion present t its cll site. Dependence nd demnd sets t the cll site determine the new shring in inlet nd outlet nnottions t the de site. Squiggl edges cn e introduced rom result send nodes ck to rgument receive nodes, i the corresponding pths rom result receive nodes to rgument send nodes eist in the cll site. This optimiztion is more complicted i procedure hs more thn one cll site, in which cse the new nnottions nd squiggl edges must e comptile with ll o the cll sites. Conditionls re hndled similrl to unction clls. A conditionl with two rms cn e viewed s unction cll, where, depending on the result o the predicte, one o two locks re clled [AA89]. This representtion simplies the prtitioning process, s we cn use the sme unied mechnism to del with unction clls nd conditionls. When the nlsis is pplied to unction clls it llows us to reduce communiction; when pplied to conditionls it llows us to reduce control ow overhed. 4.1 Interprocedurl Prtitioning Emple We will not present the orml interprocedurl prtitioning lgorithm here s it lred hs een presented in [TCS92]. An etended version which cn del with recursive unction cn e ound in [Sch94]. However, we discuss smll emple to help illustrte it. The emple shown in Figure 4 consists o two locks, cller nd cllee. The let prt o the gure shows the dtow grph or the cller, the unction g, while the right prt shows the dtow grph or the cllee, the unction. Both procedures receive two rguments nd return two results. The procedure g contins cll site o the procedure, s indicted the interior dshed rectngle, the two rgument send nodes (AS1 nd AS2), nd result receive nodes (RR1 nd RR2). The corresponding de site o the procedure consists o the two rgument receive nodes (AR1 nd AR2) nd two result send nodes (RS1 nd RS2). As shown in Prt () o the gure, the lgorithm strts initill giving ll receive nd send nodes unique singleton inlet or outlet nnottion. As shown the shded regions in Prt (), prtitioning the cller results in our threds, while prtitioning the cllee results in two threds. This is the est prtitioning possile under the trivil nnottion. The top our nodes o the cller cnnot e plced into single thred ecuse the prtitioning lgorithm hs to ssume tht pid m eist rom the node with the outlet nnottion ug ck to the node with the inlet nnottion g. Anlogous rguments cn e mde or wh the other threds hve to st seprte. To improve the prtitioning, we must ppl interprocedurl nlsis which propgtes inormtion cross locks. Propgtion involves introducing squiggl edges nd rening inlet nd outlet nnottions. Let us rst eplore wht hppens when propgting rom the cller to the cllee. In this cse, no squiggl edge is introduced, since the cller does not hve certin dependence pth rom result receive node to n rgument send node. The new inlet nnottions given to the rgument receive nodes t the de site reect the dependence sets o the rgument send nodes t the cll site. As shown in Prt (c) o the gure, the node AR1 gets the new inlet nnottion g, while the node AR2 gets the inlet n- 7

Cller Cllee ) Initil Annottion Ever receive is nnotted with unique singleton inlet nme, nd ever send with unique singleton outlet nme. g AS1 u AS2 v AR1 e AR2 RR1 c RR2 d RS1 RS2 z w ) Initil Prtitioning Prtitioning o the cller results in our threds, while the cllee gets two threds. g AR1 e AR2 AS1 u AS2 v RR1 c RR2 d RS1 RS2 z w c) Rennottion o Cllee Annottion propgtion rom the cller to the cllee results in the new inlet nd outlet nnottions. d) Prtitioning o Cllee With the new nnottion seprtion constrint prtitioning cn otin single thred. g AS1 u AS2 v RR1 c RR2 d w AR1 AR2 RS1 w RS2 w e) Rennottion o Cller Annottion propgtion rom the cllee to the cller introduces our squiggl edges nd elimintes the nnottions. ) Prtitioning o Cller Prtitioning now otins two threds. Further rennottion nd prtitioning does not improve this. g AS1 RR1 w AS2 RR2 AR1 AR2 RS1 w RS2 w Figure 4: Emple o interprocedurl prtitioning with nnottion propgtion. 8

nottion ; g. Likewise, the new outlet nnottions given to the result send nodes reect the demnd set o the corresponding result receive nodes, which re wg nd w; g respectivel. The new nnottions correspond precisel to the sitution shown in Figure 3. Using seprtion constrint prtitioning, we cn group ll nodes o the cllee into single thred, s indicted the shded region in Prt (d) o the gure. Net we propgte nnottions rom the cllee to the cller. This time we cn introduce our squiggl edges t the cll site, one rom ever rgument send node to ever result receive node, since the corresponding certin dependence pths re present in the cllee now tht it consists o single thred. These squiggl edges cpture ll o the dependencies which cn rise t this cll site. Thereore, we cn give the rgument send nd result receive nodes t the cll site empt inlet nd outlet nnottions, s shown in Prt (e) o the gure. Appling seprtion constrint prtitioning, the two top threds in the cller cn e merged into single thred, s shown in Prt () o the gure. Likewise, the ottom two threds cn e grouped into single thred. Becuse the top nd the ottom threds re connected squiggl edges, the hve to remin seprte. Thus, prtitioning the cller results in two threds, the est prtitioning tht cn e otined or this emple. Further rennottion nd prtitioning does not improve this. Note tht the resulting threds re the sme s in strict sequentil progrm. 5 Eperimentl Results In this section we evlute our prtitioning scheme in the contet o the Berkele Id90 compiler. Using vrious metrics we show how seprtion constrint prtitioning comined with interprocedurl nlsis pproch the ecienc o n orculr \strict prtitioner." 5.1 Methodolog The Berkele Id90 compiler uses ront-end developed t MIT [Tr86], which produces structured dtow grphs or the prtitioning lgorithms presented here. The prtitioned grphs re used to generte code or TAM, threded strct mchine [CGSvE93]. The TAM code is then trnslted to the trget mchine. Our trnsltion pth uses C s portle \intermedite orm" nd is producing code or the CM-5, s well s or vrious stndrd sequentil mchines [Gol94]. We used this implementtion or sttistics collection nd mesurements. All o the progrms re compiled or prllel eecution. As the run, lots prllelism is eposed. However in order to ctor out rod mil o issues unrelted to prtitioning, such s lod lncing nd loclit, we present dt here rom runs on single processor. See [CGSvE93, SGS 93] or dt nd discussion on running these progrms on prllel mchines. We use si enchmrk progrms, shown in Tle 1, rnging up to 1,100 source code lines. It should e noted tht the code ws tken s is, compiled or TAM, nd eecuted on stndrd worksttions or the CM-5 without n modictions. The progrms rnge rom ver ne grined (e.g., Quicksort) to medium grined (e.g., MMT). 5.2 Evlution To mesure the eectiveness o prtitioning we compre our dierent prtitioning schemes: dtow prtitioning (DF), iterted prtitioning (IT), seprtion constrint prtitioning with interprocedurl nlsis (IN), nd strict prtitioning (ST). Dtow prtitioning nd strict prtitioning represent the two etremes o the spectrum. Dtow prtitioning puts unr nodes into the thred o their predecessor, reecting the limited thred cpilities supported mn dtow mchines. Strict prtitioning ignores possile non-strictness nd compiles unction clls nd conditionls strictl, thus representing the est possile interprocedurl prtitioning lgorithm. Although it is not the cse or our si enchmrk progrms, strict prtitioning produces n incorrect prtitioning or progrms which require nonstrictness. Iterted nd interprocedurl prtitioning represent the two rel prtitioning schemes. With iterted prtitioning ever lock is prtitioned in isoltion. Seprtion constrint prtitioning with interprocedurl nlsis pplies the techniques discussed in this pper the interprocedurl nlsis uses seprtion constrint prtitioning to rst group nodes t de nd cll site oundries, ter which interior nodes re merged using iterted prtitioning. Figure 5 shows the dnmic TAM instruction distriution or the enchmrk progrms under the our prtitioning schemes, ech normlized to dtow prtitioning. Since the cost or ech TAM instruction diers, this gure does not necessril reect eecution time which is presented lter. Instructions re clssied into one o our ctegories: ALU opertions, hep ccesses, communiction, nd control opertions. The progrms towrd the let o the gure ehiit ver ne-grin prllelism nd re control intensive. The moderte locking (44) nd regulr structure o MMT shows signicnt contrst. As epected, improved prtitioning sustntill reduces the numer o control opertions. For most progrms, iterted prtitioning reduces the numer o control opertions more thn ctor o 2. For Simple nd MMT the reduction is much lrger. 5 Interprocedurl prtitioning urther reduces the control opertions or the more nel grined progrms, while or the corse grined progrms the improvement is insignicnt. Interprocedurl nd strict prtitioning lso decrese the numer o instructions relted to communiction, s the grouping o rguments nd results reduces the numer o messges. This eect is prticulrl importnt since communiction opertions re more thn ten times s epensive s n other. In order to see the eectiveness o seprtion constrint prtitioning comined with interprocedurl nlsis we look t how oundr nodes re grouped into threds. In the code genertion to TAM, pssing o rguments nd results or unction invoction requires send instructions. Similrl, the implementtion o conditionls is sed on switches, which, depending on the result o the predicte, steer the control to one o two successor threds. One distinguishing eture out prtitioning cross locks is tht it m group nodes t lock oundries. For emple, multiple send nodes residing in the sme thred cn e grouped into single send node i the corresponding receive nodes lso reside in single thred. A similr optimiztion lso occurs t oundries o conditionls. Here multiple switch opertions cn e replced single switch. 5 Just s importnt s the decrese o the numer o control opertions is the ct tht the lso ecome simpler. For emple, orks to snchronizing thred oten turn into orks to unsnchronizing threds. 9