A Class of Interconnection Networks for Multicasting

A Class of Itercoectio Networks for Multicastig Yuayua Yag Departmet of Computer Sciece ad Electrical Egieerig Uiversity of Vermot, Burligto, VT 05405 yag@cs.uvm.edu Abstract Multicast or oe-to-may commuicatios arise frequetly i parallel computig applicatios ad i other commuicatio eviromets. Multicast etworks ca simultaeously support multiple multicast coectios betwee the etwork iputs ad etwork outputs. However, due to the much more complex commuicatio patters ad routig cotrol i multicast etworks, there is still a cosiderably large gap i etwork cost betwee eve the curretly best kow multicast etworks ad permutatio etworks. I this paper, we will preset a class of itercoectio etworks which ca support a substatial amout of well-defied multiple multicast coectios i a oblockig fashio ad yet have a comparable low cost to permutatio etworks. We will also provide a efficiet routig algorithm for satisfyig multicast coectio requests i such etworks. Moreover, the multicast coectig capability of the etworks will be represeted as a fuctio of fudametal etwork structural parameters so that the trade-off betwee the etwork multicast capability ad the etwork cost ca be determied. This eables differet system desigers to choose the multicast etworks which fit i their particular applicatio eeds. By utilizig a etwork with well-defied multicast capability i a parallel computig system, software or algorithm desigers of the system will be able to make full use of the multicast capability provided by the etwork, ad substatial improvemets i the performace of the system ca be achieved due to sigificatly shorteed delays i data trasfer ad simplified sychroizatio mechaisms for shared data. Itroductio Multicast or oe-to-may coectig capability is highly demaded i parallel applicatios ad i other commuicatio eviromets. May parallel applicatios require that a processor i a parallel computer seds data or messages to some subset of the other processors to complete a commo task. Oe example is the commoly used parallel algorithm for the Fast Fourier Trasform(FFT)[]. This algorithm eeds to frequetly traspose the data matrix ad requires oe-to-may commuicatio of a large amout of data amog processors. Other examples iclude Barrier Sychroizatio[] ad write update/ivalidate i directory-based cache coherece protocols[3]. Also, telecoferecig ad video This research is supported i part by the Natioal Sciece Foudatio uder Grat No. OSR-9350540 ad MIP-9553. broadcastig are typical applicatios i a telecommuicatio eviromet. There have bee growig iterests i supportig multicast i parallel computers. Multicast ca be supported i either hardware or software. For example, the CUBE-[4] supports broadcast ad a form of restricted multicast i hardware, but sice its itercoectioetwork isadirect etwork, i which each ode has a dedicated lik to each of its eighbor odes, the routig algorithm adopted may cause a deadlock whe two messages are set at the same time. There has bee also much work o supportig multicast i wormhole routed direct etworks i software[5, 6]. The basic approach is sedig a message alog a subset of odes o the multicast tree. This approach eeds at least log N steps to sed a message to N destiatios. O the other had, amog the parallel computers usig idirect etworks or multistage etworks, both IBM GF[7, 8] ad NEC Ceju-3[9] support a form of restricted multicast i hardware. I IBM GF a multicast may eed two passes through the etwork, ad i NEC Ceju-3 oly sigle multicast is supported. Moreover, there has bee some work o supportig multicast i software i multistage etworks[0, ]. Sice multistage etworks ca easily have deadlock-free routig ad equal commuicatio latecy betwee ay etwork iputs ad outputs, they get more ad more attetio for the itercoectig eeds of parallel computers[] ad ATM switch architectures i broadbad etworks[3]. Also, sice multicast is a fudametal commuicatio patter i parallel computers, efficiet hardware support for it becomes icreasig importat[4]. I this paper, we will be maily cocered with providig efficiet hardware support for multiple multicast commuicatios i multistage etworks. A multicast coectio i a multistage etwork ca coect a etwork iput port simultaeously to more tha oe etwork output port. I the followig, we refer to a maximal set of multicast coectios betwee the iputs ad outputs of a multistage etwork as a multicast assigmet. Amulticast etwork is a etwork which ca realize all possible multicast assigmets. Multicast etworks have bee extesively studied ad there has bee much progress i this area[7] [8]. However, the perceived high etwork cost ad complex routig cotrol of multicast etworks might still discourage system desigers to seriously cosider them for practical parallel computig systems ad other commuicatio systems. I fact, due to the much more complex coectio patters i multicast etworks, there is still a cosiderably large gap i etwork cost betwee eve the curretly best kow multicast etworks ad permutatio etworks. Meawhile, may real applicatios may ot eed full multicast capability. For example, every processor i a parallel computig system may eed to have multicast capability to simultaeously sed data or cotrol iformatio

to a group of other processors from time to time, but at ay give time, oly a small portio of processors may eed to perform multicastig, or each processor eeds to perform multicastig to oly a limited umber of other processors. Although permutatio etworks with multicast switches may realize some multicast coectio patters, they i geeral ca ot satisfy the eeds of such applicatios. This is because that permutatio etworks are desiged for realizig oly oe-to-oe coectios. I geeral, there may ot be a clear defiitio o what type of multicast coectio patters a permutatio etwork ca realize. Also, the umber of multicast coectio patters realized, if ay, usually is very limited. This drawback of permutatio etworks may prevet software ad algorithm desigers of parallel computig systems from efficietly utilizig multicast capability sice there is o simple rule for them to judge whether a give multicast coectio ca be routed i a sigle pass through a etwork ad thus o guaratee for the time to complete a multicast coectio i the etwork. I fact, i such a etwork, a multicast coectio may eed several passes through the etwork, depedig o the etwork load ad/or state at the time of multicastig, ad i the worst case, a multicast coectio may have to be performed sequetially. As discussed above, full multicast etworks i geeral are still too expesive for practical multicast applicatios, ad permutatio etworks i geeral caot support multicast efficietly. Hece, we are motivated to cosider compromisig etwork desigs for practical multicast applicatios. I this paper, we will desig a class of practical itercoectio etworks which ca realize a substatial amout of well-defied multiple multicast coectios ad yet have a comparable low cost to permutatio etworks. We will refer to such etworks as restricted multicast etworks. We will also provide a efficiet routig algorithm for satisfyig multicast coectio requests i such etworks. The proposed etworks will eable the software or algorithm desiger of a parallel computig system to make full use of the multicast capability provided by the etwork. By utilizig such a etwork with well-defied multicast capability, substatial improvemets i the performace of a parallel computig system ca be achieved due to sigificatly shorteed delays i data trasfer ad simplified sychroizatio mechaisms for shared data. The rest of this paper is orgaized as follows. Sectio describes the etwork structure to be cosidered. Sectio 3 gives the ecessary defiitios ad otatios for restricted multicast etworks. Sectio 4 reviews the previous results related to this type of etworks for both permutatio ad multicast. Sectio 5 presets the mai results of the paper, the oblockig coditios for the proposed restricted multicast etworks. The routig algorithm is described i Sectio 6. Fially, Sectio 7 cocludes the paper. work ca guaratee a short costat latecy regardless of the umber of processor or memory modules i a parallel computig system, whereas most of other etworks (i.e. so-called growig stage etworks)[4] [8] require at least log N stages for a N N etwork which represets the miimum etwork latecy that this type of etwork ca offer. This feature of costat stage etworks is attractive for large scale highly parallel computig systems where commuicatio delay is critical. This type of etwork was first proposed by Clos[5]. The etwork has adjustable etwork parameters ad ca provide differet type of coectig capabilities by choosig differet values of the parameters. The geeral Clos etwork ca have ay odd umber of stages ad is built i a recursive fashio from smaller size etworks. Therefore, it is i geeral sufficiet to cosider oly the three-stage etwork. A three-stage Clos etwork with N iput ports ad N output ports (i.e. a N N etwork) has r switch modules of size m i stage, m switch modules of size r r i stage, ad r switch modules of size m i stage 3. The etwork has exactly oe lik betwee every two switch modules i its cosecutive stages. Such a three-stage etwork is deoted as a v(m; ;r ; ;r ) etwork. I three-stage etworks, stage isalsoreferredtoasiput stage, stageisalsoreferredtoas middle stage, ad stage 3 is also referred to as output stage. A geeral schematic of a v(m; ;r ; ;r )etwork is show i Figure. For the special symmetrical case where = = ad r = r = r, the three-stage etwork is deoted as a v(m; ; r) etwork. I the followig, we will maily discuss the symmetrical v(m; ; r) etworks but the geeralizatio to asymmetrical v(m; ;r ; ;r )etworks is straightforward. I geeral, the etwork cost of such a multistage etwork is measured by the umber of crosspoits i the etwork. A a b switch module is assumed to have ab crosspoits. From the etwork structure described above, it is easy to see that the total umber of crosspoits of a v(m; ; r) etwork equals mr + r m + mr = m(r + r )=m(n + r ): I other words, the etwork cost of a v(m; ; r) etwork is proportioal to the umber of middle stage switches, m for fixed N ad r. Therefore, as we will see later, the mai focus of the study for this type of etworks is o reducig the umber of middle switches i such etworks to yield lower cost etworks. Iputs Stage Stage Stage 3 Outputs The Network Structure I this sectio, we provide a brief descriptio of the etwork we will cosider. The etwork structure to be cosidered is a class of etworks based o Clos etworks[5]. This type of etwork belogs to so-called costat stage etworks or limited stage etworks. We kow that the etwork latecy of a etwork is proportioal to the umber of stages i the etwork. Therefore, a costat stage et- r m r Figure : A geeral schematic of a v(m; ;r ; ;r )etwork. 3 Prelimiaries I this sectio, we preset some basic defiitios ad otatios that will be useful i our aalysis of restricted multicast etworks.

First, it is reasoable to assume that every switch i the v(m; ; r) multicast etwork has multicast capability, that is, each idle iput lik of a switch ca be simultaeously coected to ay subset of idle output liks of the switch. Sice output stage switches have multicast capability, a multicast coectio ca therefore be described i terms of coectios betwee a iput port ad output stage switches. Moreover, the umber of output stage switches i a multicast coectio is referred to as the faout of the multicast coectio. Let O deote the set of all output stage switches. Based o the structure of the v(m; ; r) etwork, we have O = f; ;:::;rg. For the i-th iput port i iput stage, i f;;:::;rg;let I i O deote the subset of the output stage switches to which iput port i is to be coected i a multicast coectio. I i is referred to as a iput coectio request from iput port i. Furthermore, if iput port i ca be coected to at most d ( d r) output stage switches at a time (that is, ji ij d), we will refer to this iput coectio request as a d-restricted iput coectio request. For a multicast assigmet where each iput switch ca have at most (0 = (; r) ) iput coectio requests with urestricted faouts ad all other iput coectio requests are d- restricted ( d r), we will refer it to a (; d)-multicast assigmet. Figure shows a (; )-multicast assigmet i a v(5; 3; 4) etwork. We will refer to a v(m; ; r) etwork that ca realize all (; d)-multicast assigmets as a v ;d (m; ; r) multicast etwork. Note that i a v ;d (m; ; r) etwork, those multicast coectios o each iput switch are ot tied to ay specific subset of iput ports ad ay iput port ca request a urestricted multicast coectio as log as the total umber of urestricted multicast coectios o that iput switch does ot exceed at that time. We will simply refer a v ;(m; ; r) etwork, where at most iput ports i each iput switch ca have urestricted multicast coectios at a time ad all other iput port ca have oly oe-tooe coectios, to a v (m; ; r) etwork. Clearly, a v (m; ; r) etwork is a full multicast v(m; ; r) etwork, ad a v 0(m; ; r) etwork is a classical permutatio v(m; ; r) etwork. I additio, the multicast etworks we cosider i this paper are oblockig etworks i the sese that we ca always satisfy a eligible multicast coectio request without ay rearragemet of existig coectios i the etwork regardless of curret etwork state. This elimiates the possible disruptio of o-goig commuicatios caused by the rearragemets ad the resultig time delay i path routigs. Stage Stage Stage 3 3 4 Idle lik Busy lik Figure : A (; )-multicast assigmet i a v(5; 3; 4) etwork. 3 4 5 3 4 4 Previous Related Work The v(m; ; r) etworks have bee extesively studied i the literature[5, 6, 8, 9,, ]. From the etwork structure described i Sectio, we kow that two of the etwork parameters, ad r, are restricted by the etwork iput/output size (i fact N = r), ad the etwork cost is proportioal to the umber of middle stage switches m for fixed N ad r. Therefore, the mai focus of the study has bee o fidig the miimum value of the etwork parameter m for certai type of coectig capability to achieve the miimum etwork cost. A recet desig[, ] shows that a v(m; ; r) etwork is oblockig for arbitrary multicast assigmets if the umber of middle stage switches, m, satisfies the coditio m 3( ) log. This result represets the curretly best kow desig for costat stage multicast etworks. I fact, it has bee show[3] that uder several typical routig cotrol strategies m log is the ecessary coditio for a v(m; ; r) multicast etwork to be oblockig. However, it was show[5, 6] that a v(m; ; r) etwork is oblockig for permutatio assigmets if m. Clearly, there is still a cosiderably large gap i etwork cost betwee v(m; ; r) multicast etworks ad v(m; ; r) permutatio etworks. I the followig, we will determie the oblockig coditios for v ;d (m; ; r) multicast etworks. As we will see that v ;d (m; ; r) etworks compromise betwee full multicast etworks ad permutatio etworks: they have a comparable low cost to permutatio etworks ad yet powerful eough multicast capability for various multicast applicatios. 5 Noblockig Coditios for v ;d (m; ; r) Multicast Networks I this sectio, we will preset the mai results of this paper. We will first give the oblockig coditio for geeral v ;d (m; ; r) multicast etworks. We will the exted the result to v (m; ; r) multicast etworks to yield the restricted multicast etworks with the same order of etwork cost as v(m; ; r) permutatio etworks. Assume a v ;d (m; ; r) etwork is curretly providig some multicast coectios from its iput ports to its output ports. For ay iput port i f; :::;rg, we will refer the set of middle stage switches with curretly uused liks to the iput switch associated with iput port i the available middle switches. Moreover, for ay middle stage switch j f;;:::;mg, we will refer the subset of output stage switches to which middle switch j is providig coectio paths from the iput ports the destiatio set of middle switch j ad deote it as M j. Clearly, we have M j O for ay j f;;:::;mg. Notice that a output port ca be coected to at most oe iput port at a time i a multicast coectio. The followig lemma reveals a global costrait to M j's. Lemma At ay state of a v ;d (m; ; r) multicast etwork, there are at most 's, 's, :::, r's distributed i the destiatio sets M ;M ;:::;M m. Proof. Sice ay output stage switch k, k f;;:::;rg,has output ports, it ca have at most disjoit coectio paths from the middle stage. This meas that there are at most k's i all destiatio sets M ;M ;:::;M m.

Now, give a ew iput coectio request I i;i f;;:::;rg, we eed to fid middle stage switches from the available middle switches to satisfy this coectio request. The followig lemma gives a ecessary ad sufficiet coditio for satisfyig a coectio request I i. Lemma We ca satisfy a coectio request I i usig some x (x ) middle switches, say, j ;j ;:::;j x, from amog the available middle switches of a v ;d (m; ; r) etwork if ad oly if I i \ ( T x k= Mj k )=: Proof. If there exist T x available middle switches say, j ;j ;:::;j x, x for which I i \ ( k= Mj k )=, the for every output switch t, t I i, we ca always fid a middle switch, say j k, k x,such that t 6 M j k, through which a coectio path to output switch t is available. Thus, we ca satisfy the ew coectio request through these x middle switches. Similarly, if we ca satisfy coectio T request I i usig x middle switches, say, j ;j ;:::;j x,the x I i \ ( k= Mj k )=before we satisfy this coectio T request. x Otherwise, if exists some output switch t, t I i \ ( k= Mj k ), the a coectio path could ot be provided to output switch t through ay middle switch i the set of x available middle switches. Theorem If there are at least available middle switches for a coectio request with faout f ( f r) i a v ;d (m; ; r) etwork, we ca always choose o more tha log f middle switches to satisfy this coectio request amog these available middle switches. Proof. Without loss of geerality, suppose the iput coectio request I i = f; ;:::;fg, ji ij=fr, ad the available middle switches are M ;M ;:::;M. By Lemma, there are at most ( ) ' s, ( ) ' s, :::, ( ) f' s distributed amog M ;M ;:::M. Assig j such that The we have ji i \ M j j = ji i \ M j j mi ji i \ M jj: j ( )f < f : Agai, without loss of geerality, suppose ji i \ M j j = f; ;:::;f 0 g; where f 0 < f. The assig j such that ji i \ M j \ M j j = Similarly, we have mi ji i \ M j \ M jj: j ;j6=j ji i \ M j \ M j j < ( )f= < f : I geeral, i step k, we assig j k such that ji i \ M j \ M j \\M j k j= mi j ;j6=jp;p<k jii \ Mj \ M j \\M j k \Mjj ad ji i \ M j \ M j \\M j k j< f k: There exists some x log f such that ji i \ M j \ M j \\M jx j=0: That is, I i \ (\ x k=m j k )=: By Lemma, I i ca be satisfied by M j ;M j ;:::;M jx. Theorem A v ;d (m; ; r) multicast etwork is oblockig if m (; r) d +( )( + log d) +: Proof. We will prove this theorem by cosiderig the worst case etwork state: the ew iput coectio request I i has a faout d ad all other iput ports o the same iput switch as I i are already coected to some output switches, amog which (; r) iput ports have a faout r ad ( ) iput ports have a faout d. Clearly, the middle switches providig coectio paths for the other iput ports o this iput switch are ot available for satisfyig this ew coectio request. By Theorem, there are a total of (; r) +( ) log d middle switches ot available to the ew coectio request. Agai, by Theorem if we still have middle switches available, the we ca satisfy the ew coectio request. I additio, this available middle switches also guaratee that future coectio requests from this iput switch ca always be satisfied. This is because that after we satisfy I i, we still have log d available middle switches for ay iput port o this iput switch ad all iput ports are coected to some output switches. Later, if ay iput port o this iput switch wats to request a ew coectio, it must release the old coectio, which yields at least log d extra available middle switches. Therefore, i ay case, we always have at least available middle switches. By Theorem, we ca satisfy ay future coectio request from this iput switch. Similarly, we ca apply the above argumet to other iput switches. Hece the oblockig coditio for a v ;d (m; ; r) etwork is m +( ) log d + = (; r) d +( )( + log d) +: Theorem gives the oblockig coditio for geeral v ;d (m; ; r) multicast etworks. I the followig, we will discuss some iterestig special cases of v ;d (m; ; r) etworks. Theorem 3 gives a more explicit oblockig coditio for a v ;d etwork with certai ad d values. Theorem 3 I a v ;d (m; ; r) etwork, if at most (r),where (r), iput ports o each iput switch ca have urestricted multicast coectios ad all other iput ports ca have multicast coectios with faout at most (r), the oblockig coditio becomes m (r),wherec is a costat. Proof. Settig = (r) ad d =(r) i Theorem, we have that m +( )( + log d) + d = (r) [ (r)]+( )[ + (r)] + :

There exists a costat c such that the etwork is oblockig if m (r). Now let's take a look at a example of Theorem 3. Suppose that we let (r) = log i Theorem 3. The we have = log ad d =. Therefore, the etwork is oblockig if m 3 log for r 6. Furthermore, we are particularly iterested i the restricted multicast etworks which have the same order of etwork cost as the permutatio etworks. Theorem 4 gives the oblockig coditio for such etworks. Theorem 4 I a v (m; ; r) etwork, if at most,wherec is a costat, iput ports o each iput switch ca have urestricted multicast coectios ad all other iput ports ca have oe-to-oe coectios, the oblockig coditio becomes m (+c) : ad Proof. This coditio ca be derived by settig = d =i Theorem. Recall that the oblockig coditio for the v(m; ; r) permutatio etwork is m. Sice the etwork cost is proportioal to the umber of middle switches, m, it is easy to see that the v (m; ; r) etworks that satisfy the coditio i Theorem 4 have the same order of etwork cost as permutatio etworks. Theorem 4 suggests that each iput switch ca have up to iput ports out of its iput ports makig urestricted multicast coectios at ay time while keepig a low etwork cost comparable to a permutatio etwork. Uder this oblockig coditio, the umber of iput ports that ca request urestricted multicast coectios at a time are geerally adequate for may multicast applicatios. For example, i a parallel computig system, we ca cosider all processors coected to a iput switch as a cluster which are cooperatig to complete a commo task. At ay give time, ot all processors i the cluster eed to perform multicast, ad we ca have up to processors i the cluster performig multicast. Moreover, the oly thig the higher-level software ad algorithm desigers eed to be cocered is to keep the umber of processors doig multicastig i the cluster below the threshold. This is a fairly simple rule to judge whether a multicast coectio ca be realized i a sigle pass through the etwork. Furthermore, we ca obtai eve lower cost etworks for those applicatios which have weaker requiremets for multicast capability. For example, if each iput switch has up to c (where c is a costat) iput ports makig urestricted multicast coectios at ay time ad all other iput ports makig oly oe-tooe coectios, the oblockig coditio for a v (m; ; r) becomes m c +. Also, eve if each iput switch has up to c p iput ports makig urestricted multicast coectios, the umber of middle switches eeded for oblockig is oly c p +. We summarize the oblockig coditios for some typical v ;d (m; ; r) etworks alog with permutatio v(m; ; r) etwork ad full multicast v(m; ; r) etwork i Table. From Table, we ca see that the ewly desiged restricted multicast etworks ca realize a substatial amout of well-defied multicast assigmets while keepig etwork cost comparable to v(m; ; r) permutatio etworks. Moreover, the multicast capability of the etworks is represeted as a fuctio of fudametal etwork structural parameters so that the trade-off betwee the etwork multicast capability ad the etwork cost ca be determied. Table : Noblockig coditios for some typical restricted multicast etworks Noblockig # Urestricted Network coditio m multicast ports i each iput switch Permutatio v(m; ; r) 0 v (m; ; r) = c c + c v (m; ; r) = 0:5 :5 0:5 v (m; ; r) = v ;d (m; ; r) = log 3 3 log d = Full multicast v(m; ; r) 3 log log This eables differet system desigers to choose the multicast etworks which fit i their particular applicatio eeds. 6 A Routig Algorithm for v ;d (m; ; r) Networks I this sectio, we will preset a routig algorithm for satisfyig coectio requests i a v ;d (m; ; r) etwork. Give a v ;d (m; ; r) etwork satisfyig the oblockig coditio i Theorem ad a iput coectio request I i. The there are at least available middle switches for I i. Take ay k = middle switches from them. Without loss of geerality, let these available middle switches be M ;M ;:::;M k.let A[j] ( j r) deote the umber of iput coectios with faout greater tha d i the jth iput switch. We have followig algorithm for coectig I i:

Algorithm: /* Check the eligibility of the coectio request ad update A[j]. */ i j = ; if (ji ij >d)the f if (A[j] <)the A[j] =A[j]+; else exit without makig coectio; g /* Fid up to x = log ji ij middle switches for I i */ TMP I i; S ; T f;;:::;kg; for i = to x do H[i] ; while (TMP 6= )f choose middle switch p such that jm p \ TMPj = mi qt jm q \ TMPj; S S [fpg; T T fpg; H[p] TMP M p; TMP M p \TMP; g /* Distribute I i to the selected middle switches i S.*/ while (S 6= ) f take p S; M p M p [ H[p]; S S fpg; g Ed We ow give some ecessary explaatios for the routig algorithm. I the above algorithm, set S stores the idexes of the selected middle switches to satisfy the iput coectio request I i, ad H[p] stores a subset of I i which will be realized by middle switch p. The first while loop i the algorithm is to fid middle switches to satisfy the coectio request I i. From Theorem ad Theorem, we kow that at most log ji ij middle switches are eeded for satisfyig I i. At the ed of the first while loop, S stores the idexes of selected middle switches which together will satisfy I i. I fact, we ca show that at the ed of the first while loop, the followig coditios hold:. for ay p S, H[p] \ M p = ;. for ay S p, q S, adp 6= q, H[p] \ H[q] =; 3. I i = ps H[p]. Therefore, I i ca be distributed to the set of middle switches idexed by the elemets of S. This is accomplished i the secod while loop of the algorithm. I other words, set H[p] is distributed to middle switch p for all p S i the secod while loop. We ow aalyze the complexity of the above algorithm. The time for oe iteratio of the first while loop is proportioal to jtmpjjtj. Sice the umber of available middle switches is, after each iteratio, jtmpj reduces its value to half. We kow that iitially jtmpj = ji ijrad jt j =. Thus, the total time for the first while loop is proportioal to ji ij( ), that is, O(N). Clearly, the secod while loop also takes O(N) time. The rest of the algorithm takes less tha O(N) time. Thus, the time complexity of the above algorithm is liear to the etwork size. Moreover, by employig the techiques used i[], we ca obtai a parallel routig algorithm for the above routig process with time complexity of O(log r). 7 Coclusios I this paper, we have preseted a class of practical itercoectio etworks for supportig multicast commuicatios i parallel computig systems. The ewly desiged etworks ca support a substatial amout of well-defied multicast assigmets i a oblockig fashio ad still keep the same order of etwork cost as permutatio etworks. We have also preseted a efficiet routig algorithm for satisfyig coectio requests i such etworks. Moreover, the multicast coectig capability of the etworks is represeted as a fuctio of fudametal etwork structural parameters so that the trade-off betwee the etwork multicast capability ad the etwork cost ca be determied. This eables differet system desigers to choose the multicast etworks which fit i their particular applicatio eeds. By utilizig a etwork with well-defied multicast capability i a parallel computig system, software or algorithm desigers of the system will be able to make full use of the multicast capability provided by the etwork, ad substatial improvemets i the performace of the system ca be achieved due to sigificatly shorteed delays i data trasfer ad simplified sychroizatio mechaisms for shared data. Ackowledgemets The author would like to thak the aoymous referees for their helpful commets ad suggestios. Refereces [] J.H. Heessy ad D.A. Patterso, Computer Architecture: A Quatitative Approach, d editio, Morga Kaufma Publishers, Ic., 995. [] D.K. Pada, Issues i desigig efficiet ad practical algorithms for collective commuicatio o wormhole-routed systems, Proc. of the 995 ICPP Workshop o Challeges for Parallel Processig, pp. 8-5, 995. [3] M. Tomasevic ad V. Milutiovic, The Cache Coherece Problem i Shared-Memory Multiprocessors: Hardware Solutios, IEEE Computer Society Press, 993. [4] NCUBE Compay, NCUBE 6400 Processor Maual, 990. [5] P.K. McKiley, H. Xu, A.-H. Esfahaia ad L.M. Ni, Uicast-based multicast commuicatio i wormhole-routed etworks, IEEE Tras. o Parallel ad Distributed Systems, vol. 5, No., pp. 5-65, 994. [6] X. Li, P.K. McKiley ad L.M. Ni, Deadlock-free multicast wormhole routig i D mesh multicomputers, IEEE Tras. o Parallel ad Distributed Systems, vol. 5, No. 8, pp. 793-804, 994.

[7] M. Kumar, Supportig broadcast coectios i Bees etworks IBM Research Report RC-4063, 988. [8] J. Beetem, M. Deeau ad D. Weigarte, The GF supercomputer, Proc. of the th Aual Iteratioal Symposium o Computer Architecture, pp. 08-5, 985. [9] N. Koike, NEC Ceju-3: A microprocessor-based parallel computer, Proc. of the 8th Iteratioal Parallel Processig Symposium, pp. 396-40, 994. [0] H. Xu, Y. Gui ad L.M. Ni, Optimal software multicast i wormhole-routed multistage etworks, Proc. of Supercomputig' 94, pp. 703-7, 994. [] C. Chiag, S. Bhattacharya ad L.M. Ni, Multicast i extrastage multistage itercoectio etworks, Proc. of the 6th IEEE Symposium o Parallel ad Distributed Processig, pp. 45-459, 994. [] L.M. Ni ad D.K. Pada, A report of the ICPP ' 94 pael o sea of itercoectio etworks: what' s your choice?, IEEE Computer Society Techical Committee o Computer Architecture Newsletter, Witer 994-95, pp. 3-44, 994. [3] R. Rooholamii, V. Cherkassky ad M. Ggarver, Fid the right ATM switch for the market, Computers, vol 7, No. 4, pp. 6-8, 994. [4] L.M. Ni, Should scalable parallel computers support efficiet hardware multicast? Proc. of the 995 ICPP Workshop o Challeges for Parallel Processig, pp. -7, 995. [5] C. Clos, A study of o-blockig switchig etworks, The Bell System Techical Joural, vol. 3, pp. 406-44, 953. [6] V.E. Bees, Heuristic remarks ad mathematical problems regardig the theory of switchig systems, The Bell System Techical Joural, vol. 4, pp.0-47, 96. [7] G.W. Richards ad F. K. Hwag, A two-stage rearrageable broadcast switchig etwork, IEEE Tras. Commuicatios, vol. COM-33, pp. 05-034, 985. [8] G.M. Masso ad B.W. Jorda, Geeralized multi-stage coectio etworks, Networks, Vol., pp. 9-09, 97. [9] F.K. Hwag ad A. Jajszczyk, O oblockig multicoectio etworks, IEEE Tras. Commuicatios, vol. COM-34, pp. 038-04, 986. [0] P. Feldma, J. Friedma, ad N. Pippeger, Wide-sese oblockig etworks, SIAM Joural of Discrete Mathematics, vol., No., pp. 58-73, 988. [] Y. Yag ad G.M. Masso, Noblockig broadcast switchig etworks, IEEE Tras. Computers, vol. C-40, No. 9, pp. 005-05, 99. [] Y. Yag ad G.M. Masso, Fast path routig techiques for oblockig broadcast etworks, Proc. of the IEEE Thirteeth Iteratioal Phoeix Coferece o Computers ad Commuicatios, pp. 358-364, 994. [3] Y. Yag ad G.M. Masso, The ecessary coditios for Clos-type oblockig multicast etworks, Proc. of the 0th Iteratioal Parallel Processig Symposium, 996. [4] J.P. Ofma, A uiversal automatio, Tras. Moscow Math. Soc., vol. 4, 965 (traslatio published by Amer. Math. Soc. pp. 9-5, 967). [5] C.D. Thompso, Geeral coectio etworks for parallel processor itercoectio, IEEE Tras. Computers, vol. C- 7, pp. 9-5, 978. [6] Chi-Tau Lea, A ew broadcast switchig etwork, IEEE Tras. Commuicatios, vol. COM-36, pp. 8-37, 988. [7] C. Lee ad A.Y. Oruc, Desig of efficiet ad easily routable geeralized coectors, IEEE Tras. Commuicatios, vol. COM-43, pp. 646-650, 995. [8] Y. Yag ad G.M. Masso, Broadcast rig sadwich etworks, IEEE Tras. Computers, vol. C-44, No. 0, pp. 69-80, 995.