On Optimization of Finite-Difference Time-Domain (FDTD) Computation on Heterogeneous and GPU Clusters

Size: px

Start display at page:

Download "On Optimization of Finite-Difference Time-Domain (FDTD) Computation on Heterogeneous and GPU Clusters"

Chad Stafford
6 years ago
Views:

1 On Optmzaton of Fnte-Dfference Tme-Doman (FDTD) Computaton on Heterogeneous and GPU Clusters Ramtn Shams and Parastoo Sadegh a a College of Engneerng and Computer Scence (CECS) The Australan Natonal Unversty Canberra ACT 0200, Australa Abstract A model for the computatonal cost of fnte-dfference tme-doman (FDTD) method rrespectve of mplementaton detals or the applcaton doman s gven. The model s used to formalze the problem of optmal dstrbuton of computatonal load to an arbtrary set of resources across a heterogeneous cluster. We show that the problem can be formulated as a mnmax optmzaton problem and derve analytc lower bounds for the computatonal cost. The work provdes nsght nto optmal desgn of FDTD parallel software. We also propose an effcent algorthm for parttonng of the computatonal doman for load-balancng FDTD computatons across an arbtrary cluster. We demonstrate that sgnfcant performance gans, as much as 75%, can be acheved by proper load dstrbuton. Key words: Fnte-Dfference Tme-Doman (FDTD), Heterogeneous Computng, Parallel Processng, Graphcs Processng Unt (GPU), Optmzaton Introducton Fnte-dfference tme-doman (FDTD) method, snce ts ntroducton by Yee [], has been wdely used to obtan numercal solutons of Maxwell s equatons for a broad range of problems. The applcatons of FDTD n electrodynamcs nclude antenna and radar desgn, electronc and photonc crcut desgn, mcrowave tomography, cellular and wreless network smulaton, moble phone safety studes, and many more [2]. The method s not lmted to electrodynamcs and can be used to solve other spatotemporal partal dfferental equatons such as those occurrng n acoustcs (e.g. see [3]). The explct nature of FDTD Preprnt submtted to Elsever 4 May 200

2 formulaton, ts smplcty, accuracy and robustness, together wth a well establshed theoretcal framework have contrbuted to a seemngly unendng popularty of the method. Realstc FDTD smulatons nvolve fne dscretzaton of the spatal doman as well as the temporal doman. It s not uncommon to use spatal grds of 0 9 and spatotemporal grds of 0 3 or more cells. As a result, FDTD smulatons requre sgnfcant computatonal resources both n terms of memory and executon. It s nevtable that many realstc smulatons exceed the memory lmtatons of a sngle computer and have to be dvded across a cluster of computers. The addtonal ncentve to dstrbute FDTD computaton s to mnmze executon tme where resources are avalable. Ths, however, ntroduces the addtonal cost of ntercommuncaton between compute nodes n order to execute FDTD smulatons on the entre computatonal doman of the problem. Parallelzaton and acceleraton of FDTD has been an actve area n recent years. In partcular, there have been several examples of FDTD acceleraton on feld programmable gate array (FPGA) hardware and graphcs processng unts (GPUs). A lst of recent contrbutons n ths area s gven later n Secton 2.2. Tradtonally, computer clusters have been bult exclusvely from homogeneous compute nodes. Wth the ntroducton of accelerator technologes and ther rse n popularty for general purpose computng, ths s no longer the case. Current generaton of accelerators requre a tradtonal computer host and hence by defnton create hybrd nodes when clustered. A notable example of one such cluster s IBM s Roadrunner supercomputer whch comprses 3,824 Opteron cores and 6,640 Cell processer cores. Accordng to techncal staff at Los Alamos the applcatons are typcally desgned for executon on Cell processors and except for trval house keepng and data transfer to and from Cell processors, the Opteron cores reman dle most of the tme. Ths represents a sgnfcant computng capacty that remans under-utlzed. Our motvaton s to maxmze use of avalable computatonal resources, whether across an organzaton s network or on purpose-bult heterogeneous clusters, towards solvng larger computatonal problems. A heterogeneous cluster s defned as a group of computatonal resources wth dfferng technology, executon capablty, memory sze, and speed. A heterogeneous cluster may comprse accelerators (e.g. GPUs, Cell processors, FPGA boards), desktop computers, and server computers. Whle heterogeneous resources are commonly found on any modern network, they are rarely used as heterogeneous clusters partc- Accordng to dscussons at Path to Petascale Workshop, Aprl 2009, Natonal Center for Supercomputng Applcatons, Unversty of Illnos at Urbana- Champagn. 2

3 ularly across technology boundares (e.g. x86 and PowerPC). The two man mpedments n effectve use of heterogeneous clusters are (a) the addtonal effort nvolved n development of heterogeneous applcatons and (b) the need to desgn an optmal load dstrbuton scheme across heterogeneous resources. In ths work, we assume the reader s suffcently motvated to tackle the former problem and focus on the latter n the context of FDTD parallelzaton where we look at the problem of optmal dstrbuton of FDTD computaton across a heterogeneous cluster.. Contrbutons Ths work provdes nsght nto the optmal desgn of a heterogeneous FDTD applcaton by () modelng the cost of FDTD computaton on a heterogeneous cluster (Secton 3.); (2) formalzng the load dstrbuton problem as a mnmax optmzaton problem (Secton 3.2); (3) dervng analytc lower bounds for the executon cost of FDTD on a heterogeneous cluster (Secton 3.4.); and (4) proposng a heurstc algorthm for effcent dstrbuton of load to an arbtrary cluster of computatonal resources (Secton 4). We would lke to clarfy from the outset that ths work s not a software development effort or a specfc parallel mplementaton of FDTD. It s an mplementaton-agnostc analyss of the nherent computatonal lmtatons of FDTD on a most general class of computatonal clusters (.e. heterogeneous clusters). It also proposes a near optmal load dstrbuton algorthm for FDTD mplementaton for heterogeneous clusters n general and for homogeneous clusters as a subset. We note that a myrad of parallel mplementatons of FDTD on homogeneous clusters, n the form of commercal packages, open source lbrares, and scholarly research exst, that can beneft from ths work. These parallelzaton efforts cover a wde range of hardware from supercomputers and general purpose programmable CPUs to GPU, applcaton-specfc ntegrated crcut (ASIC) and FPGA clusters. We would partcularly lke to emphasze lack of FDTD software that can effcently run across such technology boundares n a true heterogeneous fashon. 3

4 2 Concepts 2. An Overvew of FDTD Method In ths secton, we provde a bref overvew of the FDTD method based on 3D Maxwell electromagnetc equatons. The extent we delve nto the subject s to enable the reader to apprecate the computatonal model of FDTD presented n Secton 3.. A careful treatment of the subject s outsde the scope of ths paper and the reader s referred to [2] for detaled dscussons. Maxwell s curl equatons for lnear, sotropc, lossy and non-dspersve meda are gven by µ H t = E + σ mh + M, () ɛ E t = H + σ ee + J, (2) where H s the magnetc feld, E s the electrc feld, µ s the magnetc permeablty, ɛ s the electrc permttvty, M s the equvalent magnetc current densty, J s the electrc current densty, σ m s the equvalent magnetc conductvty, and σ e s the electrc conductvty. For the purpose of FDTD smulatons H and E felds are unknown and all other quanttes are gven at each pont n space. The equatons embody 6 partal dfferental equatons, for example the dervatve of the x-component of the electrc feld wth respect to tme s gven by ɛ E x t = H y z H z y + σ ee x + J x. (3) The equatons are dscretzed n space and tme to derve an explct soluton for the next tme step. Due to dependence of E and H components, t s best to nterleave values of E and H n tme wth t/2 tme dfference between them. For example, H can be computed at n t and E computed wth a half nterval shft at n t + t/2. Smlarly, the E and H components are staggered n space accordng to an arrangement known as the Yee cell [,2]. For convenence we show a functon u( x, j y, k z, n t) wth u n,j,k. Usng ths notaton, (3) can be dscretzed on a cubod grd of ( x, y, z) usng 4

5 second order accurate central dfferences as E x n+ 2 E,j+ ɛ 2,k+ x n 2,j+ 2 2,k+ 2,j,k t H y n,j+ 2,k+ H y n,j+ 2,k z H z n,j+,k+ 2 H z n,j,k+ 2 y + σ e,j,k E x n,j+ 2,k+ 2 + J x n,j,k, (4) snce E x s not computed at n t, t can be approxmated by E x n,j+ 2,k+ 2 and we have ( ) E x n+ 2 ɛ,j,k / t σ e,j,k /2,j+ 2,k+ 2 ɛ,j,k / t + σ e,j,k /2 [ Hy n,j+ H y n 2,k+,j+ 2,k z E x n+ 2 + E,j+ 2,k+ x n 2,j+ 2 2,k+ 2, (5) 2 E x n 2,j+ 2,k+ 2 ( ) ɛ,j,k / t + σ e,j,k /2 ] H z n,j+,k+ 2 H z n,j,k+ 2 y + J x n,j,k Based on (6), E x at each grd pont and at tme n t + t/2 can be computed from values of E and H at prevous tmes. Smlar equatons can be derved for other components of E and H felds. These equatons allow the feld values to be computed explctly at an arbtrary tme ndex by marchng through all prevous tme ndces. We note that the rather unusual notaton nvolvng half ndces such as j + are due to the arrangement of the feld components n 2 the Yee cell. We refer the reader to [2] f a detaled explanaton of the notaton s desred. The dscrete grd dscussed above, also known as the computatonal doman, represents a fnte, dscrete and bounded model of some real doman of nterest. If the smulaton s suffcently long the wavefronts reach the boundares of the computatonal doman. Wth no nformaton about the medum outsde the boundary, the wavefronts cannot naturally progress and are reflected back nto the computatonal doman. In effect, the boundary of the computatonal doman acts as a reflectve barrer, whch n most crcumstances s undesrable and a source of error and clutter. A sgnfcant body of work has been dedcated to desgnng absorbng boundary condtons (ABCs) to address ths problem [2]. Commonly used ABC s nclude Mur s ABC [4], the perfectly matched layer (PML) [5] and ts varants such as the unaxal PML (UPML) [6,7] and the convolutonal PML (CPML) [8]. Wthout gettng nto the detals of dfferent types of ABCs and ther characterstcs, we note that mplementaton of the boundary condtons nvolves. (6) 5

6 addtonal computaton at one or several layers of cells on the border of the computatonal doman. Ths makes the computaton of boundary cells more expensve than regular cells. As prevously noted, use of FDTD s not lmted to soluton of Maxwell s equatons or to two or three dmensons. In the followng sectons, a general d-dmensonal (d > ) Cartesan computatonal doman s assumed. 2.2 Parallelzaton of FDTD From the dscusson of the prevous secton t should be obvous that FDTD method naturally lends tself to parallelzaton. At each tme-step updatng a cell, for example wth (6), requres values of the feld components of a gven cell and ts neghborng cells n the prevous tme step. In a more general settng where hgher order dscretzaton or non-lnear wave equatons are nvolved one may requre feld values from several prevous tme steps. Regardless of the complexty of the wave equatons, FDTD ensures that each cell update s ndependent of ts neghbors n the current tme-step. Ths allows for FDTD computatons to be well suted to parallelzaton. Several parallel mplementatons of FDTD have appeared n the lterature. The efforts cover a wde range of parallel archtecture and hardware ncludng symmetrc multprocess (SMP) clusters, FPGA hardware, GPUs, and dstrbuted shared memory (DSM) systems. A non-exhaustve summary (post 2000) s gven n Table whch serves to demonstrate the degree of nterest n ths problem. SMP clusters typcally use a combnaton of OpenMP [9] for parallelzaton on a sngle node and the message passng nterface (MPI) [0] for parallelzaton across multple nodes. There has been more nterest n usng GPUs for acceleraton of FDTD n recent years. FDTD s memory ntensve and standard cachng mechansms on the CPUs are not well suted to FDTD memory access patterns. Latest generaton of GPUs, on the other hand, are specfcally suted to ths task as they gve programmers control over loadng data nto a small but effcent shared memory that can sgnfcantly boost memory access. In addton, bandwdths of devce memory on a GPU may exceed 00 GB/s whch s at least an order of magntude hgher than standard host memory. 6

7 Table A sample lst of parallel FDTD contrbutons n the lterature. Applcaton ABC Platform Perf 2 Group Year 3D FDTD PML Cray T3E (6 300 MHz), MPI 8.8 Guffaut [] 200 2D FDTD Mur Custom 00 MHz 6.3 Kawaguch [2] D FDTD Mur Frebrd FPGA board (@ 70 MHz) 3.8 Chen [3] D FDTD PML Xlnx Vrtex-II 8000 FPGA 30 Durbano [4] D FDTD Mur 3 GeForce FX 5800 Ultra, OpenGL 82 Krakwsky [5] D FDTD PML IBM RS/6000 SP (8 nodes/ MHz), OpenMP/MPI 3D FDTD None GTX 8800 (6 MP/ 28 cores), OpenGL/Cg 36. Hughes [6] Adams [7] D FDTD Mur GTX 280 (30 MP/ 240 cores), CUDA Stefansk [8] D ADI-FDTD Mur GTX 280 (30 MP/ 240 cores), CUDA 40 Stefansk [8] D FDTD PEC GTX 280 (30 MP/ 240 cores), CUDA - Luge [9] D FDTD None GTX 280 (30 MP/ 240 cores), CUDA 795 Takada [20] 2009 : Absorbng Boundary Condton 2 : Performance n MCells/s 3 : A combnaton of st order Mur and perodc boundary condtons s used. 4 : Compute Unfed Devce Archtecture [2] 3 Method 3. Modelng Computatonal Cost of FDTD on a Heterogeneous Cluster Consder a hyper-rectangular FDTD computatonal doman Ω such as the 2- rectangle shown n Fg. parttoned nto a number of non-overlappng hyperrectangular sub-domans each denoted by Ω and a boundary gven by Ω. The sub-domans are non-overlappng (except on the boundary),.e. Ω = Ω, Ω Ω j = for j. (7) Each partton s mapped to a computatonal resource such as a mult-core CPU, a GPU or a Cell processor and the number of parttons equals the number of avalable computatonal resources. We denote the set of all such parttons wth n elements by Γ n. For each partton the cost of a sngle tme-marchng step s broken down nto three components () cost of updatng regular cells whch s proportonal to the sze of partton Ω, 7

8 Ω Ω Ω 3 Ω 2 Ω 3 Ω 4 Ω 5 Fg.. A 2D computatonal doman dvded nto 5 parttons. The boundary of the thrd partton s shown wth a thcker border for emphass. (2) addtonal cost of updatng cells n the nner boundary of the partton wth neghborng cells due to the need to load nformaton from neghborng parttons through an nterconnecton between assocated resources whch depends on the sze of the common boundary between a gven partton and ts neghborng parttons Ω Ω j, and (3) addtonal cost of updatng cells n the outer boundary of the partton wth the boundary of the doman (typcally on an absorbng layer) whch s proportonal to the sze of the common boundary between the partton and the computatonal doman Ω Ω. Ignorng the addtonal cost of handlng source cells (that typcally comprse only a small number of cells), the cost of tme-marchng algorthm (n terms of executon tme) for the th partton can be wrtten as t = α Ω + j β j Ω Ω j + γ Ω Ω, (8) where Ω s the sze of the partton, Ω s the sze of the partton boundary, and α, β j and γ are constants of proportonalty that relate sze of parttons and boundares to the respectve cost of executon. Note that these constants are determned by the computatonal capablty of each resource and the throughput of the nterconnect between resources, j and are ndependent of the doman parttonng. The frst term n (8) captures the cost of executng one tme-marchng step on the nteror of the partton where all the nformaton to compute n the next tme nstance for a gven cell resdes wthn the computatonal resource, the second term represents the addtonal cost assocated wth the transfer of data from resource j to resource to enable computaton of the next tme step for boundary cells, and the thrd term represents the addtonal cost of computng absorbng boundary condtons on the boundary of the doman. 8

9 3.2 Problem formulaton All resources need to complete ther computaton of the current tme step before they can proceed to the next step. In other words, a barrer synchronzaton prmtve s requred at the end of each tme step teraton. Ths requres faster resources to wat for slower resources to complete ther task and hence the cost of a sngle teraton s gven by t m = max α Ω + β j Ω Ω j + γ Ω Ω. (9) j We are now ready to formalze the problem: we seek the optmal dstrbuton of load to a heterogeneous cluster that mnmzes (9). Ths s a mnmax problem over dsjont n-element parttons of the computatonal doman Ω t opt = mn Γ n max α Ω + β j Ω Ω j + γ Ω Ω. (0) j Fndng the optmal parttons based on (0) s far from trval. Ths s because, the geometry and poston of the parttons need to be known n order to compute the overlap between neghborng parttons and between parttons and the exteror of the doman. However, as shown n the followng sectons, t s possble to fnd analytc lower bounds for t opt (under certan condtons) that are ndependent of parttonng scheme and hence provde nsght nto achevable performance levels wthout the need to drectly solve for (0). The problem needs to be relaxed n order to acheve these goals. 3.3 Relaxng the Problem Let β = mn j {β j } for j ; by replacng β j wth β n (0) and usng Ω Ω j = Ω Ω Ω, () j we have { t opt mn Γ n max } {α Ω + β Ω + (γ β ) Ω Ω }. (2) 9

10 Assumng the thrd term (nvolvng Ω Ω ) n the above equaton s nonnegatve (γ β for all ), we can wrte { } t opt mn max {α Ω + β Ω }. (3) Γ n Ths s not an unreasonable assumpton, gven that computng boundary condtons s typcally a more expensve task. Fndng a soluton for the rghthand sde of (3) provdes a lower-bound for t opt. Ths s also equvalent to solvng a specal case of (0) where the normalzed cost of data transfer between a resource and all other resources s the same (.e. β j = β ) and the normalzed cost of computng boundary condtons s assumed to be the same as the cost of transferrng boundary nformaton between the resources (.e. γ = β ). We also relax the parttonng condton (7) such that we only requre the sum of partton szes to be equal to the sze of the doman. Ths means that we gnore the need to properly pack the parttons n the gven doman, at least for now. We denote ths reduced optmzaton problem by: t opt = mn Γ n { max {α Ω + β Ω } }, Ω = Ω. (4) Lemma: If Ω = {Ω,..., Ω n } s a soluton of (4) so s Ω = { Ω,..., Ω n } where Ω = Ω and the parttons of Ω are hyper-cubes. Proof : Out of all hyper-rectangles of a gven sze, the hyper-cube has the least boundary sze, hence Ω Ω and α Ω + β Ω α Ω + β Ω. So the cost of computaton for no resource under Ω s hgher than under Ω and Ω must be a soluton as well. For a d-dmensonal hyper-cube we have Ω = 2d Ω ( d ) and we can now focus our search on fndng a hyper-cubc soluton by solvng { { t opt = mn max Ω (α + 2dβ Ω }} d ), Ω = Ω. (5) Γ n 3.4 Lower Bounds for the Mnmax Load Dstrbuton Problem In ths secton we derve analytc lower bounds for (5) Bound Let us denote the executon tme of a gven partton by t m = max {t } ( t m Ω + 2d β ) Ω d α α (6) 0

11 t m α Ω + 2d β Ω d (7) α ( ) ( t m Ω + 2d α β α Ω ) d (8) ( ) ( t opt = mn{t m } mn Ω + 2d Γ n Γ n α ) β Ω d α. (9) Mnmzng the rghthand sde of (9) gves a lower bound for t opt. Ths requres mnmzng the term nvolvng Ω d. Let q be the ndex of the partton wth the smallest rato of the normalzed transfer cost to the normalzed computaton cost (.e. q = argmn β /α ) β Ω α d β q Ω d, (20) α q β ( ) q d β q Ω = Ω d, (2) α q α q where we use x p + y p (x + y) p for x, y 0 and 0 p to derve (2). Therefore, t opt ( ) ( + 2d β ) q Ω d Ω. (22) α q α The rghthand sde of (22) gves a lower bound for t opt. A soluton close to ths lower bound can be found when the addtonal cost of computng boundary cells s sgnfcantly lower than the cost of tme-marchng regular cells (β /α ) or deally when β = 0 n whch case t opt from (8) s gven by t opt = ( ) Ω. (23) We argue that ths s possble when the partton szes are gven by α Ω j = α j ( α ) Ω = t opt α j. (24) The proof s by contradcton, frst consder for some j, we have Ω j > t opt /α j, ths results n t j > t opt, whch volates the condton that t opt s the maxmum cost of executon of any partton for a gven parttonng scheme. Conversely, consder that for some j, we have Ω j < t opt /α j. We have already establshed

12 that Ω t opt /α for j. Summng up nequaltes for all we have Ω j + j Ω < t opt α j + j t opt α. (25) Usng (23) and notng that Ω = Ω, both sdes of the above nequalty reduce to Ω whch cannot be and the proof s complete. Accordng to (24), where the computatonal cost assocated wth boundary condtons and data transfers are low, the optmal parttonng scheme s one that ensures all computatonal resources take the same amount of tme to complete one teraton of the algorthm. Ths s consstent wth the desre to ensure computatonal resources wll not be dle when possble Bound 2 The bound gven n the prevous secton s tght where α β and becomes less tght as the cost of data transfers ncreases. We derve a tghter bound under such condtons n ths secton. Assumng that Ω > and usng (5), we can wrte { { t opt mn max Ω d (α + 2dβ ) }}, (26) Γ n where we replaced α Ω wth the smaller term α Ω d. In a manner smlar to the proof gven n the prevous secton, t can be shown that the rghthand sde of (26) s mnmzed when for all and j and Ω s gven by Ω d (α + 2dβ ) = Ω j d (αj + 2dβ j ), (27) Ω j = Ω [ (α j + 2dβ j ) d d And a new lower bound s gven by ( ) t opt Ω (α + 2dβ ) d d (α + 2dβ ) d d ]. (28) d. (29) The tghtness of the bound mproves as the number of dmensons ncreases. We also note that the bound gven n (29) s loose when the cost of data transfer s not sgnfcant compared to the cost of computatons but mproves as data transfer becomes the bottleneck. Ths trend s opposte to that of the 2

13 bound derved n the prevous secton. Therefore, t makes sense to combne the two bounds and use ther maxmum as the lower-bound. 3.5 Numercal Optmzaton The objectve functon gven n (5) can be solved usng constraned numercal optmzaton methods. The objectve functon f( Ω,..., Ω ) = max { } Ω (α + 2dβ Ω d ), Ω = Ω (30) s nonlnear and non-convex wth lnear constrants or alternatvely one can parameterze n partton dmenson sze x = Ω d where the cost functon wll be convex n x but subject to nonlnear constrants. Ether way standard convex optmzaton methods cannot be used. We use a nonlnear constraned optmzaton method based on sequental quadratc programmng (SQP) and Quas-Newton algorthm to solve the problem [22] 2. As usual ntalzaton close to the global mnmum s an mportant element for mprovng the success of the optmzaton algorthm. We ntalze the algorthm wth an ntal guess n accordance wth (24) or (26). We use the equaton that corresponds wth the tghter of the two bounds. In practce, for a range of experments, the optmzaton converges quckly (typcally n less than 00 teratons) and gven the smplcty of the cost functon the computaton are most effcent. We defer further dscusson on the experments to Secton A Heurstc Algorthm for Load Dstrbuton Up to now, we have dscussed methods to determne the sze and dmensons of parttons subject to the relaxed constrant that the sum of parttons equals the sze of the computatonal doman. Once the soluton of the relaxed problem s found, the parttons need to be ft nto the computatonal doman under the constrant that the they must cover ts entre volume. There are several reasons that an exact ft may not be possble. In practce, dmensons of the computatonal doman and ts parttons belong to the set of postve ntegers. Ths nherently means that the optmal 3 parttonng sze and dmensons 2 An mplementaton of an actve-set method based on SQP and Quas-Newton algorthm s gven by MATLAB Optmzaton Toolbox functon fmncon. 3 In ths secton, the use of term optmal refers to the results obtaned from the numercal optmzaton algorthm. We realze that gven the nonlnear nature of the problem strct optmalty of the optmzaton algorthm s not guaranteed. 3

14 cannot be exactly met except for carefully engneered dmensons. We also note that even where optmal partton szes and dmensons are feasble, parttonng of the computatonal doman to an exact set of parttons may not be possble (e.g. try parttonng a square nto two squares). Intutvely, as the number of parttons ncrease these lmtatons become less of an ssue and a close to optmal parttonng can be acheved. We propose a heurstc algorthm for parttonng. Our ntuton s that the algorthm should be fathful to the orgnal partton szes and shapes (whch are hyper-cubes) to the extent possble. The method s called balanced parttonng algorthm and s gven as follows: Let Ω be a computatonal doman to be dstrbuted to n computatonal resources () Measure normalzed computatonal cost α and transfer cost β of each resource. (2) Compute the set of optmal partton szes S = { Ω, Ω 2,..., Ω n } usng a properly ntalzed nonlnear optmzaton algorthm. (3) Partton S nto two sets S and S 2 such that the dfference between the sum of elements of S and S 2 s mnmal. (4) Partton Ω along the largest dmenson nto two parttons whose sze s gven by the sum of elements of S and S 2. Adjust any nteger round-off errors ntroduced as a result. (5) Replace S wth S and S 2 and Ω wth newly created parttons and contnue the steps 3-5 untl S and S 2 cannot be further parttoned. The ntuton to partton the doman along ts largest dmenson s to mantan the lowest possble aspect rato (the rato of the largest dmenson to the smallest dmenson). Ths s an attempt to make the parttons closer to hypercubes as the parttonng algorthm progresses. The thrd step of the algorthm s known as the number parttonng problem. The problem s whether a set of numbers can be parttoned nto two halves of equal sum or more generally fndng two parttons that mnmze the maxmum partton sum. The number parttonng problem s NP-complete [23], however, there are smple heurstc algorthms that can, n many nstances, solve the problem optmally or near optmally n less than O(n 2 ) tme. The best heurstc algorthm s the dfferencng algorthm and s gven n [24]. Brefly, the dfferencng algorthm reduces the sze of the set by one n each teraton by replacng the two largest numbers wth ther absolute dfference. Ths s equvalent to decdng that the two largest numbers wll go nto dfferent sets wthout actually commttng whch set receves whch number at ths tme. The forward leg of the algorthm termnates when the set s reduced to a sngle number. The last number standng wll represent the dscrepancy of 4

15 the two sets (the absolute dfference of ther sums). The algorthm then backtracks and at each step replaces one dfference number wth ts components n such a way that dscrepancy of the two sets remans constant. For a detaled dscusson and an example of the algorthm refer to [24]. An example of parttonng a 2D computatonal doman of 0 0 cells across 5 resources where the partton szes are gven as S = {4, 7, 22, 26, 3} s gven n Table 2. Usng the balanced parttonng algorthm results n slght adjustments to partton szes at the end; wth fnal partton szes beng {5, 5, 20, 25, 35} as shown n the last row of the table. Table 2 Parttonng a sample computatonal doman Parttons Balanced Parttons Adjusted Sums Doman {4, 7, 22, 26, 3} {4, 22, 26}, {7, 3} 50, {4, 22, 26} {4, 22}, {26} 25, {7, 3} {7}, {3} 5, {4, 22} {4}, {22} 5, We compare the performance of the balanced parttonng algorthm wth the strpe parttonng algorthm where the computatonal doman s smply parttoned along a sngle axs. For the strpe algorthm, we choose partton szes proportonal to the resource s performance (.e. n accordance to (24)). Ths s smlar to what s typcally used n FDTD parallelzaton on homogeneous clusters today. Fg. 2 shows one smple strpe parttonng of the prevous example. The frst thng to notce s that the dscrepancy between achevable parttons and desred parttons s hgher. Ths s the result of larger roundoff errors due to the ntegral dmensons of the computatonal doman, whch n turn translates nto an even less optmum dstrbuton of the computatonal load. 5

16 (a) Strpe (b) Balanced Fg. 2. Comparson of the strpe and balanced parttonng algorthms. The balanced parttonng results n less dscrepancy compared wth the desred partton szes. 4 Results In ths secton, we present a number of smulaton results for heterogeneous and homogeneous clusters and compare the performance of the balanced and strpe parttonng methods n respect to the derved bounds. We fnd t more ntutve to show the results n terms of achevable throughput rather than the computaton tme. The throughput s defned as the rato of the number of cells to the processng tme (.e. computaton or transfer tme). Ths has the added advantage of havng the results normalzed to the sze of the computatonal doman. The measurements wll be gven as recprocals of α and β n mega cells per second (MCells/s). We also note that n the context of throughput we wll be talkng about upper bounds whch are nversely related to computaton tme lower bounds. Example : Homogeneous GPU Cluster Doman: cells Cluster: homogeneous, 2-8 NVIDIA GT200 GPUs on a sngle host (certan motherboards allow up to 8 GT200 GPUs to be nstalled), 3D FDTD performance on a GT200 GPU α = 493 MCells/s Lnk: PCI-E x6, nomnal bandwdth: 8 GB/s, actual bandwdth measured at 2.4 GB/s and 4.3 GB/s on two dfferent motherboards, combned throughput s β = 50 and 93 MCells/s respectvely Notes: The throughput measurements were performed wth host memory allocated as standard page-able memory. The throughput can be mproved by usng page-locked (pnned) memory. However, pnned memory s a scarce resource and not sutable for typcally large memory demands of FDTD applcatons. The total bandwdth s constant and as we ncrease the number of GPUs the data transfer rate per GPU decreases. In Fg. 3, we show the performance of the cluster as a functon of the number of GPUs on two dfferent hosts wth dfferent actual PCI-E bandwdths. A number of observatons can be made from Fg. 3: (a) the scalablty of the cluster mproves wth ncreased throughput; (b) the performance of the balanced par- 6

17 Performance (MCells/s) Example : Homogeneous Cluster of GPUs Bandwdth: 2.40 GB/s Upper Bound Optmal Parttonng Balanced Parttonng Strpe Parttonng Performance (MCells/s) Example : Homogeneous Cluster of GPUs Bandwdth: 4.30 GB/s Upper Bound Optmal Parttonng Balanced Parttonng Strpe Parttonng GPUs (a) Bandwdth: 2.4 GB/s GPUs (b) Bandwdth: 4.3 GB/s Fg. 3. Comparson of balanced and strpe parttonng algorthms. The balanced parttonng results n up to 47% mprovement n performance and s close to the performance of the optmal soluton. ttonng s close to the optmal soluton; (c) the balanced parttonng method outperforms the stpe method by up to 47% for the slower host and up to 38% for the faster host; (d) as the number of GPUs ncreases, the transfer rate per GPU decreases and the advantage of better parttonng s more emphatcally demonstrated; (e) the strpe method exhbts poor scalablty and the performance plateaus wth 5 or 6 GPUS whereas the balanced parttonng contnues to scale. Example 2: Heterogeneous GPU Cluster: Doman: cells Cluster: heterogeneous, (a) -4 NVIDIA GT200 GPUs, 3D FDTD performance measured at α = 493 MCells/s, (b) -4 NVIDIA GT80 GPUs, 3D FDTD performance measured at α = 40 MCells/s Lnk: PCI-E x6, nomnal bandwdth: 8 GB/s, actual bandwdth measured at 2.4 GB/s and 4.3 GB/s on two dfferent motherboards, combned throughput s β = 50 and 93 MCells/s respectvely In ths example, we look at a heterogeneous cluster of GPUs nstalled n a sngle host. The cluster comprses equal numbers of GT80 and GT200 GPUs. The results depcted n Fg. 4 demonstrate the superor performance and scalablty of the balanced parttonng for a heterogeneous cluster. The results are more or less consstent wth earler observatons n the prevous example. Example 3: Homogeneous CPU Cluster: Doman: cells Cluster: homogeneous, 4-32 nodes each wth Quad-core Intel Core GHz CPUs, 3D FDTD performance measured at α = 72.8 MCells/s Lnk: (a) Ggabt Ethernet, actual bandwdth: 0. GB/s, β = 2. 7

18 Performance (MCells/s) Example 2: Heterogeneous Cluster of GPUs Bandwdth: 2.40 GB/s Upper Bound Optmal Parttonng Balanced Parttonng Strpe Parttonng Performance (MCells/s) Example 2: Heterogeneous Cluster of GPUs Bandwdth: 4.30 GB/s Upper Bound Optmal Parttonng Balanced Parttonng Strpe Parttonng GPUs (a) Bandwdth: 2.4 GB/s GPUs (b) Bandwdth: 4.3 GB/s Fg. 4. Comparson of balanced and strpe parttonng algorthms on a heterogeneous cluster. The balanced parttonng results n up to 34% mprovement n performance and s close to the performance of the optmal soluton. MCells/s (b) 0 Ggabt Ethernet throughput, actual bandwdth: 0.6 GB/s, β = 2.5 MCells/s Note: OpenMP s used to parallelze the FDTD code on each node. In ths example, a larger computatonal doman s dstrbuted to a cluster of PCs. We compare the scalablty and performance of the cluster over a Ggabt and 0 Ggabt network. The smulatons predct that for a Ggabt network the cluster saturates wth 8 nodes when balanced parttonng s used. Usng the strpe parttonng method saturates the cluster wth only 4 nodes. Fg. 5(a) also demonstrates that usng the balanced parttonng method results n more than 3% mprovement n peak performance compared to the strpe method. In Fg. 5(b) the network bandwdth s ncreased by a factor of 6. Ths has a sgnfcant mpact on the performance of the cluster. The balanced parttonng method scales up to 32 nodes now and acheves a peak performance of 669 MCells/s compared to 80 MCells/s on the Ggabt network. Also note that the peak performance of the balanced parttonng s almost 76% hgher than the peak performance of the strpe parttonng method. The advantage of PC clusters over GPU clusters s ther larger memory sze. Ths makes smulaton of larger computatonal domans possble, albet at the cost of lower performance. As shown n prevous examples, a cluster of GPUs on a sngle host exceeds a performance level of 000 MCells/s. Perhaps to solve the dlemma, one can create a cluster of mult-gpus nodes to address both memory capacty and performance problems. However, such a cluster wll hardly scale unless one s prepared to nvest n hgher bandwdth technologes such as quad data rate (QDR) InfnBand. Example 4: Heterogeneous CPU Cluster: Doman: cells 8

19 Performance (MCells/s) Example 3: Homogeneous Cluster of CPUs Bandwdth: 0.0 GB/s Upper Bound Optmal Parttonng Balanced Parttonng Strpe Parttonng Performance (MCells/s) Example 3: Homogeneous Cluster of CPUs Bandwdth: 0.60 GB/s Upper Bound Optmal Parttonng Balanced Parttonng Strpe Parttonng GPUs (a) Bandwdth: 0. GB/s GPUs (b) Bandwdth: 0.6 GB/s Fg. 5. Comparson of balanced and strpe parttonng algorthms on a homogeneous cluster of PCs and for dfferent network bandwdths. The network bandwdth s the man bottleneck. Increasng network bandwdth mproves the scalablty of the cluster. Cluster: heterogeneous, (a) 2-6 nodes each wth Quad-core Intel Core GHz CPUs, 3D FDTD performance measured at α = 72.8 MCells/s, (b) 2-6 node each wth Quad-core Intel Core Duo 2.66 GHz CPUs, 3D FDTD performance measured at α = 39. MCells/s Lnk: (a) Ggabt Ethernet, actual bandwdth: 0. GB/s, β = 2. MCells/s (b) 0 Ggabt Ethernet throughput, actual bandwdth: 0.6 GB/s, = 2.5 MCells/s β Performance (MCells/s) Example 4: Heterogeneous Cluster of CPUs Bandwdth: 0.0 GB/s Upper Bound Optmal Parttonng Balanced Parttonng Strpe Parttonng Performance (MCells/s) Example 4: Heterogeneous Cluster of CPUs Bandwdth: 0.60 GB/s Upper Bound Optmal Parttonng Balanced Parttonng Strpe Parttonng GPUs (a) Bandwdth: 0. GB/s GPUs (b) Bandwdth: 0.6 GB/s Fg. 6. Comparson of balanced and strpe parttonng algorthms on a heterogeneous cluster of PCs and for dfferent network bandwdths. The network bandwdth s the man bottleneck. Increasng network bandwdth mproves the scalablty of the cluster. In our last example, we look at results from a heterogeneous cluster of up to 32 nodes. The cluster comprses of equal number of quad-core Core 7 and quadcore Core Duo nodes. Despte the dsparty n performance of the nodes, the balanced parttonng acheves reasonable scalablty wth the faster network. 9

20 5 Dscusson For ease of reference, the man results of the paper are summarzed here: The problem of dstrbuton of FDTD load to a heterogeneous cluster can be formulated as a mnmax problem over n-element parttons of the computatonal doman Ω t opt = mn Γ n max α Ω + β j Ω Ω j + γ Ω Ω. j A smpler problem s formulated by relaxng the condtons of the orgnal problem as { } t opt = mn max {α Ω + β Ω }, Γ n Ω = Ω. Two lower bounds can be found for the relaxed problem. The combnaton of whch gves the followng bound t opt ( ) ( max + 2d β ) ( ) q Ω d Ω, Ω (α + 2dβ ) d d α q α The results set an upper bound on achevable performance mprovements that can be used to predct the extent to whch parallelzaton s practcally benefcal. In a dynamc cluster where resources may become avalable durng the lfe tme of computatons one may have to decde f t s benefcal to repartton the problem to utlze the newly avalable resources. Redstrbuton of the problem may ncur sgnfcant traffc and an algorthm may not redstrbute the problem untl such tme that enough computatonal resources are avalable to justfy the overhead or may even determne that redstrbuton of the problem s detrmental to the overall performance. One can smply use the bounds or better sll estmate the performance of the redstrbuted confguraton before makng such decsons. We proposed a heurstc algorthm for parttonng the computatonal doman and showed by experment that the algorthm acheves performance levels close to deal partton szes obtaned by the numercal optmzaton algorthm. The algorthm s smple and effcent. The results show that sgnfcant performance gans can be acheved by the smple vrtue of usng a better parttonng algorthm. Optmal parttonng also mproves scalablty whch means that computatonal resources can be more effcently utlzed. The burden of optmzng parttons as descrbed n ths paper s neglgble compared to the d 20

21 effort of parallelzng FDTD code. A properly parallelzed FDTD code should n prncple be able to run wth non-equal parttons and should beneft from the method presented n ths paper wth mnmal effort. As an addtonal beneft exstng FDTD applcatons can effcently run on heterogeneous clusters of smlar technology. Fully heterogeneous applcatons that can run across technology boundares wll be a natural extenson. The questons remans whether the numercal optmzaton fnds the global mnmum of (5). Based on the experments, we beleve the numercal optmzaton results are optmal or very close to optmal. Ths s a clam that can be more comfortably asserted f one s able to derve tghter bounds or ndeed prove the optmalty of the soluton. We suspect the methodology presented n ths work s not lmted to FDTD and can lend tself to smlar analyses n other computatonal problem domans. 6 Acknowledgements Ths work was supported n part by the Australan Research Councl (ARC) Dscovery Project DP09349 and n part by the ARC/Mcrosoft Lnkage Project LP The vews expressed heren are those of the authors and are not necessarly those of the fundng organzatons. References [] Yee, K.: Numercal soluton of ntal boundary value problems nvolvng Maxwell s equatons n sotropc meda. IEEE Transactons on Antennas and Propagaton 4 (966) [2] Taflove, A., Hagness, S.C.: Computatonal Electrodynamcs: The Fnte Dfference Tme Doman Method. thrd edn. Artech House Inc., Norwood, MA, USA (2005) [3] Pnton, G.F., Dahl, J., Rosenzweg, S., Trahey, G.E.: A heterogeneous nonlnear attenuatng full-wave model of ultrasound. IEEE Transactons on Ultrasoncs, Ferroelectrcs, and Frequency Control 56(3) (March 2009) [4] Mur, G.: Absorbng boundary condtons for the fnte-dfference approxmaton of the tme-doman electromagnetc-feld equatons. IEEE Trans. on Electromagnetc Compatblty 23(4) (98) [5] Berenger, J.: A perfectly matched layer for the absorpton of electromagnetc waves. Journal of Computatonal Physcs 4(2) (October 994)

22 [6] Sacks, Z.S., Kngsland, D.M., Lee, R., Lee, J.F.: A perfectly matched ansotropc absorber for use as an absorbng boundary condton. IEEE Trans. on Antennas and Propagaton 43(2) (995) [7] Gedney, S.D.: An ansotropc perfectly matched layer-absorbng medum for the truncaton of FDTD lattces. IEEE Trans. on Antennas and Propagaton 44(2) (996) [8] Roden, J.A., Gedney, S.D.: Convolutonal PML (CPML): An effcent FDTD mplementaton of the CFS-PML for arbtrary meda. Mcrowave and Optcal Technology Letters 27(5) (2000) [9] : OpenMP Applcaton Programmng Interface, verson 3.0. OpenMP, (2009) [0] W. Gropp, E.L., Skjellum, A.: Usng MPI: Portable Parallel Programmng wth the Message Passng Interface. second edn. MIT Press, Cambrdge, MA, USA (999) [] Guffaut, C., Mahdjoub, K.: A parallel FDTD algorthm usng the MPI lbrary. IEEE Antennas and Propagaton Magazne 43(2) (Aprl 200) [2] Kawaguch, H., Takahara, K., Yamauch, D.: Desgn study of ultrahgh-speed mcrowave smulator engne. IEEE Transactons on Magnetcs 38(2) (Aprl 2002) [3] Chen, W., Kosmas, P., Leeser, M., Rappaport, C.: An FPGA mplementaton of the two-dmensonal fnte-dfference tme-doman (FDTD) algorthm. In: Proc. Internatonal Symposum on Feld Programmable Gate Arrays. (2004) [4] Durbano, J.P., Humphrey, J.R., Ortz, F.E., Curt, P.F., Prather, D.W., Mrotznk, M.S.: Hardware acceleraton of the 3D fnte-dfference tme-doman method. In: Proc. IEEE Antennas and Propagaton Socety Int. Symposum. Volume. (2004) [5] Krakwsky, S.E., Turner, L.E., Okonewsk, M.M.: Acceleraton of fntedfference tme-doman (FDTD) usng graphcs processor unts (GPU). In: IEEE Int. Mcrowave Symposum. Volume 2. (2004) [6] Hughes, M.C., Stuchly, M.A.: Hybrd parallel fnte dfference tme doman smulaton of nanoscale optcal phenomena. In: Int. Conf. on Wreless Communcatons and Appled Computatonal Electromagnetcs. (2005) [7] Adams, S., Payne, J., Boppana, R.: Fnte dfference tme doman (FDTD) smulatons usng graphcs processors. In: Hgh Performance Computng Modernzaton Program Users Group Conference. (2007) [8] Stefansk, T.P., Drysdale, T.D.: Acceleraton of the 3D ADI-FDTD method usng graphcs processor unts. In: IEEE Int. Mcrowave Symposum. (2009)

23 [9] Luge, D., Kang, L., Fanmn, K.: Parallel 3D fnte dfference tme doman smulatons on graphcs processors wth CUDA. In: Int. Conf. on Computatonal Intellgence and Software Engneerng. (December 2009) 4 [20] Takada, N., Shmobaba, T., Masuda, N., Ito, T.: Hgh-speed FDTD smulaton algorthm for GPU wth compute unfed devce archtecture. In: Proc. IEEE Antennas and Propagaton Socety Int. Symposum. (2009) 4 [2] : Compute Unfed Devce Archtecture (CUDA) Programmng Gude, verson 2.2. NVIDIA, (2009) [22] Bonnans, J.F., Glbert, J.C., Lemaréchal, C., Sagastzábal, C.A.: Numercal Optmzaton: Theoretcal and Practcal Aspects. second edn. Sprnger (2006) [23] Mertens, S.: The easest hard problem: Number parttonng. In Percus, A., Istrate, G., Moore, C., eds.: Computatonal Complexty and Statstcal Physcs, New York, Oxford Unversty Press (2006) [24] Karmarker, N., Karp, R.M.: The dfferencng method of set parttonng. Techncal report, Unversty of Calforna at Berkeley, Berkeley, CA, USA (983) 23

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University Dynamc Optmzaton Assgnment 1 Sasanka Nagavall snagaval@andrew.cmu.edu 16-745 January 29, 213 Robotcs Insttute Carnege Mellon Unversty Table of Contents 1. Problem and Approach... 1 2. Optmzaton wthout