Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs

Hardware-oftware Co-ynthess of Low Power Real-Tme Dstrbuted Embedded ystems wth Dynamcally Reconfgurable FPGAs L hang and Nraj K. Jha Dept. of EE, Prnceton Unversty {lshang, jha}@ee.prnceton.edu Abstract In ths paper, we present a mult-objectve hardwaresoftware co-synthess system for mult-rate, real-tme, low power dstrbuted embedded systems consstng of dynamcally reconfgurable FPGAs, processors, and other system resources. We use an evolutonary algorthm based work for automatcally determnng the quantty and type of dfferent system resources, and then assgnng s to dfferent processng elements (PEs) and communcatons to communcaton lnks. For FPGAs, we propose a two-dmensonal, mult-rate cyclc schedulng algorthm, whch determnes prortes based on realtme constrants and reconfguraton overhead nformaton, and then schedules s based on the resource utlzaton and reconfguraton condton n both space and tme. The FPGA scheduler s ntegrated n a lst-based system scheduler. To the best of our knowledge, ths s the frst mult-objectve co-synthess system, whch uses dynamcally reconfgurable devces to synthesze a dstrbuted embedded system, to target smultaneous optmzaton of system prce and power. Expermental results ndcate that our method can reduce schedule length by an average of 41.0% and reconfguraton power by an average of 46.0% compared to the prevous method. It also yelds multple system archtectures whch trade off system prce and power under real-tme constrants. 1. Introducton Hardware-software co-synthess entals automatc dervaton of the hardware-software archtecture of dstrbuted embedded systems to satsfy mult-objectve goals, such as performance, prce and power. Allocaton, assgnment and schedulng are the three key steps n hardware-software co-synthess desgn flow. Allocaton determnes the type and number of PEs and communcaton lnks n the system archtecture. Assgnment determnes the mappng of s (communcatons) to PEs (lnks). chedulng determnes the tme when s and communcatons are executed. An FPGA s a commonly used PE n dstrbuted embedded systems. Compared wth AICs, FPGAs offer a parallel and flexble hardware platform. In order to reduce the reconfguraton overhead, many new reconfgurable archtectures have been proposed [ 1]-[5]. In dynamcally Acknowledgements: Ths work was supported by DARPA under contract no. DAAB07-00-C-L516. reconfgurable FPGAs, the embedded confguraton storage crcutry can be updated selectvely n a few clock cycles, wthout dsturbng the executon of the remanng logc. uch FPGAs offer the potental for hgher performance as well as the ablty to effcently support mult-mode [6] requrements for embedded systems. Wth the success of battery-based personal computng devces and wreless communcaton systems, low power has become a key ssue n system desgn. Although ts flexblty makes dynamcally reconfgurable FPGAs a good soluton for portable applcatons, the power consumpton problem cannot be neglected. On-lne reconfguraton not only ntroduces a delay overhead n executon, but also a power overhead (whch can account for half of the energy consumpton). Ths makes the FPGA power optmzaton problem more complex than that for general-purpose processors or AICs. 1.1 Prevous Work The problem of dynamcally reconfgurable FPGAs s addressed both n hgh-level synthess [7]-[9] and systemlevel synthess [15]-[20]. However, n system-level synthess, the problem s much more complex. The executon tme, power consumpton, and reconfguraton overhead for each and also the resource utlzaton and reconfguraton condton n the FPGA need to be consdered. Allocaton/assgnment and schedulng, whch are known be to NP-complete [10], need be addressed n both the tme and space domans. Most hardware-software co-synthess algorthms do not tackle FPGAs [11]-[14]. In those that do [15]-[20], system prce s the sngle optmzaton objectve. In [15], multple s are not allowed to execute concurrently on the same FPGA. The approach n [17] uses mxed nteger lnear programmng, whch does not scale well to larger program szes. Also, many algorthms make the smplfyng assumpton that the embedded system conssts of just one processor and one FPGA [18]-[20]. 1.2 Our Approach and Contrbutons We use an evolutonary algorthm to tackle the problem of allocaton and assgnment. uch an algorthm has been shown to produce hgh-qualty solutons n small runtmes for the co-synthess problem [14,15]. Multobjectve system requrements, such as prce and power consumpton, can be smultaneously optmzed wth ths method. No lmtaton s mposed on the quantty of system resources. nce the schedulng s performed n the nner loop of co-synthess, a relatvely accurate heurstc scheduler wth a low tme complexty s a must. econd, effcent methods for reducng the delay and power

overheads of dynamc reconfguraton are requred. We propose a two-dmensonal mult-rate cyclc schedulng heurstc. Dependng on the resource and reconfguraton nformaton, the scheduler treats each farly and tres to globally mnmze the reconfguraton overhead. Our co-synthess system smultaneously optmzes system prce and power consumpton under real-tme constrants. Multple non-domnated solutons are provded to the system desgner wth dfferent trade-offs between system prce and power. The rest of ths paper s organzed as follows. In ecton 2, we defne the varous terms and models used n our co-synthess system. In ecton 3, we ntroduce the cosynthess work. In ecton 4, we descrbe the schedulng algorthm. We demonstrate the experment results n ecton 5. Fnally, we conclude n ecton 6. 2. Prelmnares In ths secton, we defne the concepts and models used n our co-synthess system. 2.1 Input pecfcaton The nput specfcaton s assumed to be n the form of a set of graphs, as shown n Fgure 1. A graph s a drected acyclc graph, n whch a node denotes a whle an edge between s represents data dependency and the amount of data transmtted. Each graph has a perod, whch represents the nterval between the earlest start tmes of ts consecutve executons. In real-tme systems, hard deadlnes are assocated wth some of the s. An embedded system contanng multple graphs wth dfferent perods s called mult-rate. The least common multple (LCM) of the dfferent graph perods s defned as the hyperperod. A vald schedule s defned over a hyperperod [21]. 2.2 Resource Lbrary Model In addton to graphs, a co-synthess algorthm also needs to be fed nformaton from a resource lbrary. Ths lbrary conssts of general-purpose processors, dynamcally reconfgurable FPGAs, communcaton lnks and memores that can be used for co-synthess. e10 11 10 e11 12 e20 21 20 e21 22 e12 e13 Deadlne=13 Deadlne=19 Perod=15 13 Hyperperod = 30 Deadlne=34 Perod=30 Fgure 1: graphs In dynamcally reconfgurable FPGAs, a onedmensonal reconfguraton model s commonly used as shown n Fgure 2 [1,3,5]. In ths model, the atomc reconfguraton storage unt that can be dynamcally updated s a. The reconfguraton of one does not dsturb the executon of other s. A may reutlze a confguraton pattern left behnd by an earler. Multple s can only be reconfgured one by one. Each ready needs to be loaded nto contguous s n the FPGA reconfguraton memory before ts executon. For each, the has a specfc confguraton pattern. If the requred confguraton pattern cannot be found n the correspondng n the FPGA, a pattern mss s sad to occur. mlar to caches n computers, compulsory, conflct, capacty and coherent msses can occur n the reconfguraton memory of FPGAs. Frame 0 Frame 1 Frame N Dynamc reconfguraton resource Fgure 2: A dynamcally reconfgurable FPGA mode The followng parameters are defned for each dynamcally reconfgurable FPGA n the resource lbrary: prce, number of confguraton s, reconfguraton bandwdth, number of reconfguraton bts for each, number of I/Os, dle power, and reconfguraton power per. For each, the worst-case executon tme, average power consumpton, and memory requrement to store reconfguraton and computaton data on each FPGA type n the resource lbrary are specfed. General-purpose processors are descrbed by prce and a varable ndcatng whether or not t has a communcaton buffer. For each, the worst-case executon tme, average power consumpton, preempton tme, and memory load are specfed for each type of processor n the resource lbrary. Communcaton lnks are descrbed by prce, packet sze, average power consumpton per packet, worst-case communcaton tme per packet, pn requrement, dle power consumpton, and contact counts. Memory blocks are modeled by prce, power and sze. The memory requrement for computaton and communcaton s specfed for each. The nformaton for each, such as executon tme and power consumpton etc., can be characterzed wth the help of technques such as those presented n [22]-[25]. 3.Hardware/oftware Co-synthess Framework Allocaton, assgnment and schedulng are the three man steps that need to be carred out n co-synthess. We use an evolutonary algorthm based work for allocaton and assgnment [15]. However, n [15], only system prce was mnmzed. Also, t used an FPGA model that supported the executon of only one at a tme. Ths model s not sutable for the current generaton of FPGAs. Our co-synthess system does not mpose any restrctons on the quantty of dfferent system resources. We propose a new two-dmensonal mult-rate schedulng algorthm for dynamcally reconfgurable FPGAs n an embedded system. Ths ads the system-level scheduler. chedulng s dscussed n detal n ecton 4. An overvew of our co-synthess system s shown n Fgure 3. Co-synthess solutons are organzed n clusters. olutons wthn a cluster share the same allocaton, but have dfferent assgnments. olutons are ntalzed frst. Then evoluton operators,.e., reproducton, mutaton, and nformaton tradng, are used to transform allocaton and assgnment to obtan the next generaton of solutons.

Wthn each cluster, the assgnment nformaton may be mutated or traded between dfferent solutons. Allocaton nformaton may be mutated or traded between dfferent clusters. The rank of solutons s determned n a twodmensonal space: system prce and power consumpton, as shown n Fgure 4. The Pareto-rankng method s used for ths purpose. A soluton s rank s equal to the number of other solutons that do not domnate t. In the fgure, the solutons denoted by are not domnated by any other solutons, whereas the solutons denoted by B are domnated by at least one other soluton. Fnally, when a pre-specfed number of generatons has passed wthout mprovement, nvald solutons,.e., those that do not meet the deadlnes, are pruned out, and the remanng nondomnated solutons are reported to the system desgner (a soluton domnates another f t s better n both power consumpton and system prce). Input specfcaton Intalzaton oluton prunng Non-domnated solutons Resource lbrary chedulng engne Allocaton transformaton Assgnment transformaton FPGA schedulng Processor schedulng Communcaton lnk schedulng oluton Prortzng prortzng Fgure 3: Hardware/software co-synthess work ystem prce B B B Power consumpton oluton Pareto-optmal Fgure 4: Potental soluton space 4. chedulng Algorthm The schedulng algorthm s nvoked n the nner loop of co-synthess after the allocaton and assgnment steps. s (communcaton events) need to be scheduled on dfferent processors and FPGAs (communcaton lnks). Processors and communcaton lnks represent a sequental resource. Hence, they requre a one-dmensonal schedulng problem to be solved. However, schedulng for dynamcally reconfgurable FPGAs s a two-dmensonal problem, ncludng both the tme and space domans, as descrbed next. 1. chedulng sequence: At each schedulng pont, multple ready s may resde n the canddate pool. Each may have a dfferent tme, resource and reconfguraton requrement, and power consumpton. Thus, changng the schedulng order may have a sgnfcant mpact on schedulng qualty. 2. Locaton assgnment polcy: FPGAs are a parallel hardware platform. When a canddate needs to be scheduled, there are many possble postons n the FPGA where the crcut mplementng the can be located. Assgnng a to a dfferent locaton not only nfluences the current, but may also mpact the s scheduled ether after or before t. In ths secton, we dwell on the FPGA schedulng problem n sgnfcant detal. 4.1 Motvatonal Example We next present an example to motvate our schedulng approach. Example 1: Consder a system specfcaton wth three smple graphs as shown n Fgure 5. The allocaton and assgnment nformaton for each and communcaton event s shown n Table 1. s 10 and 31 are assumed to have the same confguraton patterns, whle the confguraton patterns for other s are assumed to be dfferent. The reconfguraton tme for each s 3.4 unts. Based on the allocated PEs, the worstcase executon tme for each s shown n Table 2. The communcaton events C31 and C20 are executed on the bus that lnks the three PEs. Ther communcaton tmes are 15 and 10 unts, respectvely. Based on the tradtonal assumpton n dstrbuted computng, we assume that the communcaton tme between two s assgned to the same PE s zero. Two dfferent schedulng approaches are appled to these graphs as descrbed below. C10 C11 10 11 12 Deadlne=85 Perod=150 20 21 Deadlne=90 Perod=150 31 30 C20 C30 C31 32 Deadlne=80 Deadlne=100 Perod=150 Hyperperod = 150 Fgure 5: graphs Table 1: Allocaton and assgnment nformaton Proc 1 Proc 2 FPGA Bus 21 32 Other s C20, C31 Table 2: executon tme 10 11 12 20 21 30 31 32 Worst-case exec. tme 33 11 25 50 20 26 33 37 chedulng approach I: chedulng sequence: The order of schedulng s s based on a statc slack-based prorty [26]. The prorty of s: P = ( T latest ready T earlest ready ) where Tearlestready s the earlest ready tme of and Tlatestready s ts latest ready tme. These two values are computed by conductng a topologcal search of the graphs based on as-soon-as-possble (AAP) and as-late-as-possble (ALAP) schedulng.

Locaton assgnment polcy: Confguraton patterns are allowed to be loaded nto the FPGA before the ready tme. Confguraton patterns left by earler s can be utlzed by later s. If there are several canddate postons n the FPGA where the can be placed, the heurstc s to fnd a poston that allows the to start as soon as possble. Ths locaton assgnment polcy s smlar to the greedy heurstc proposed n [18]. Table 3 (frst row) shows the schedule length, reconfguraton resource utlzaton (lower the better), and reconfguraton power consumpton. The deadlne s volated n ths case. Fgure 6 shows the FPGA, processor and bus schedule. The shaded blocks represent wse reconfguraton. Reconfguratons ntroduced by compulsory msses are not shown, as they occur only once n the begnnng of the frst hyperperod. The numbers n brackets ndcate the sequence n whch the s are scheduled. chedulng approach II: Ths s the approach we take. chedulng sequence: The order of schedulng s s determned dynamcally by prortes, whch consder both real-tme constrants and the reconfguraton overhead nformaton (detals gven n ecton 4.2). Locaton assgnment polcy: The global reconfguraton nformaton for all the s assgned to the FPGA s consdered, as s the current state of the FPGA. Table 3 (second row) and Fgure 7 ndcate the schedule qualty for ths approach. From the above example, we fnd, not surprsngly, that dfferent FPGA schedulng polces may dramatcally nfluence the schedulng qualty,.e., the satsfacton of real-tme constrants, reconfguraton resource utlzaton, and reconfguraton power consumpton. Frst, snce reconfguraton tself consumes a sgnfcant amount of power, mnmzng the reconfguraton overhead s mportant for reducng system power consumpton. econd, solutons that cannot satsfy real-tme constrants necesstate faster (and generally more expensve) PEs. Ths ncreases system prce. A good schedulng approach reduces schedulng length and ndrectly the system prce and power consumpton. Table 3: chedulng results App. Deadlne chedule Reconfg. Reconfg. length utlzaton power I Volaton 117 48% 127 mw II atsfed 80 23% 61 mw 4.2 Two-Dmensonal FPGA chedulng Algorthm In ths secton, we descrbe the two-dmensonal schedulng algorthm for the dynamcally reconfgurable FPGAs n the embedded system. chedulng sequence and locaton assgnment polcy are the two mportant factors that need to be consdered. 4.2.1 chedulng equence As n Approach I n Example 1, statc slack-based prortes are commonly used to order s for schedulng on processors. The ntutve dea behnd ths approach s that a wth a longer slack can tolerate some delay and should yeld to another wth a shorter slack. Ths approach works well on sequental resources. However, ths approach s not sutable for FPGAs, whch can execute multple s concurrently. In the statc slackbased prorty approach, s along the crtcal path of one graph may always be scheduled before s n other graphs. Ths can prove to be qute sub-optmal for FPGAs. Our expermental results show that schedulng s from dfferent graphs n an nterleaved fashon n FPGAs leads to better global schedules. 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 10 (1) 30 (5) C31 11 (2) Reconfguraton resource utlzaton 12 (3) FPGA schedule 20 (4) Processor1 schedule Processor2 schedule 32 Bus schedule Hyperperod 31 (6) C20 21 Fgure 6: chedulng result for Approach I 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 10 (1) 30 (3) 20 (2) C31 11 (4) Reconfguraton resource utlzaton 31 (5) 12 (6) Processor1 schedule 21 Processor2 schedule 32 C20 FPGA schedule Bus schedule Hyperperod Fgure 7: chedulng result for Approach II Another dfference between processors and FPGAs s that n FPGAs, reconfguraton degrades performance and ncreases power consumpton. Hence, n order to reduce the reconfguraton overhead, among the multple ready s, those that can utlze the confguraton patterns that already resde n FPGA should be preferred. Ths means that the reconfguraton overhead should also nfluence prorty. We propose a dynamc prorty based approach, whch dynamcally updates the prorty, as follows. prorty = latest fnsh tme + exec tme + reconf overhead reconf nter, j where latest fnsh tme s the latest possble fnsh tme for whch s computed by conductng a backward topologcal search of the graph based on the graph deadlne nformaton. exec tme s the worst-case executon tme for on the assgned PE. reconf overhead s the reconfguraton overhead of. reconf nter,j s the nter- reconfguraton tme between adjacent s, whch s updated dynamcally, as follows. For each n the canddate pool that has the same confguraton patterns as j, whch has been

removed from the canddate pool for schedulng on the FPGA, the value of ths varable s zero. In ths approach, both the real-tme constrants and reconfguraton overhead are consdered, and s from dfferent graphs are treated farly. 4.2.2 Locaton assgnment Polcy When a s selected based on the above approach, multple canddate locatons may exst n the FPGA. The locaton assgnment polcy for a not only nfluences the current, but also the schedulng result for other s. everal factors need to be consdered n the context, as dscussed next. Reconfguraton prefetch: Each needs to be loaded nto the FPGA frst before startng ts executon. When the mplementaton s large, the reconfguraton overhead may be substantal even n dynamcally reconfgurable FPGAs. Reconfguraton prefetch can be employed to allevate ths problem. The system can try loadng the earler and fnsh the reconfguraton before the ready tme of the. Ths may allow the reconfguraton tme for the to be hdden. Confguraton pattern reutlzaton: When a new needs to be loaded nto an FPGA, ts confguraton patterns need to be mapped nto a set of contguous s. If subsets of the requste confguraton patterns already resde n the FPGA, loadng of those data can be avoded. Ths helps reduce the reconfguraton overhead. Evcton canddate: If not enough free space s left n the FPGA for new confguraton patterns, some exstng confguraton patterns need to be evcted from the devce. Ths problem s smlar to the pagng problem [27] and the weghted cachng problem [28]. However, for our problem, all the s assgned to a need to be contguous, whch makes the problem more complex. The s that need to be reconfgured for the ncomng may contan confguraton patterns from dfferent s, each executng at a dfferent recurrent frequency (ths s the number of tmes the executes n the hyperperod). When a confguraton pattern wth a hgher recurrent frequency s evcted, t may ntroduce a new reconfguraton overhead later n the hyperperod. We defne the evcton cost for a canddate poston for ths based on a weghted sum of all the confguraton patterns that need to be replaced, as follows: evctoncost = end = start freq recurrent freq where recurrent s the recurrent frequency of the confguraton pattern n. The evctoncost s the weghted cost for ths canddate poston. The canddate postons wth lower evctoncost should be preferred. Fttng polcy: The algorthm should try to avod fragmentaton of the FPGA confguraton memory when choosng the canddate poston from the FPGA. lack tme utlzaton: ome of the possble canddate postons for a ready may already have confguraton patterns smlar to the newly requred ones. Usng these postons would lower the evcton cost. However, the may not be able to start executon mmedately f assgned to such canddate postons. A greedy polcy may neglect such canddate postons. Ths may adversely mpact the schedule qualty for other s. Ths s because reconfguraton hardware s a sequental resource. Reconfguraton of one delays reconfguraton of others. Therefore, reconfguraton overhead mnmzaton should have a hgh prorty. Thus, a better approach to the canddate poston selecton problem s to possbly choose a slghtly nferor soluton for the gven whch helps fnd a better global soluton. The slack of a ndcates to what extent an nferor soluton can be tolerated for t. nce the may share the slack wth other s, whch may not have been scheduled yet, the slack should not be completely used up by the current. The porton of the slack, whch can be utlzed for the n queston, should be the slack dvded by the depth of the sub graph (the root vertex of the sub graph s the current,), as follows: slack j tolerate start tme j = start tme j +, depthsub graph where starttme j s the ready tme of j, depth subgraph s the depth of the sub graph n terms of the number of s, and slack j s the slack of j. toleratestarttme j s the delayed start tme that j can tolerate. Our FPGA locaton selecton polcy s based on the above analyses. The nfluence of reconfguraton overhead on the dspatch tme of each s mnmzed. Canddate postons wth lower weghted-reconfguraton overhead and tolerable delay are always chosen. Reconfguraton data can be effectvely shared among s wth smlar reconfguraton patterns. The reconfguraton overhead s, therefore, effectvely reduced and sometmes hdden. Ths also mnmzes reconfguraton power, a sgnfcant part of the power consumpton n FPGAs. 4.2.3 The Algorthm The pseudo-code for the two-dmensonal schedulng algorthm s shown n Fgure 8. Frst, root nodes from all the graphs are put nto the canddate pool (lne 2). The prorty of each n the canddate pool s updated dynamcally (lne 4), and the,, wth the hghest prorty chosen (lne 5). nce the parent s of may be assgned to PEs other than tself, the correspondng communcaton events need to be scheduled on the communcaton resource frst (lne 6). Then s scheduled on the canddate PE (lne 7). Fnally, schedulng of leads to other s becomng ready (lne 8). The key part of the schedulng algorthm s schedule( ), whose workng s llustrated next. Consder C n the partal FPGA schedule shown n Fgure 9. When ths s beng loaded nto the FPGA, the reconfguraton overhead may be ntroduced before or after the, shown as shaded blocks. Two ssues need to be consdered for the reconfguraton blocks ntroduced before C. Frst, the tmespans of the empty slots n the dfferent s among the possble canddate postons for C may be dfferent. nce the reconfguraton hardware s a sequental resource, reconfguraton of one wll delay the reconfguraton of other s and even the start tme of the. econd, the reconfguraton slots left

unused between the reconfguraton events and C cannot be utlzed by s wth dfferent confguraton patterns. In our approach, the prorty,, to determne P the reconfguraton sequence of s s defned as follows: ( r t s t ), r t s t r t = ready tme modulo hyperperod P = ( hyperperod s t + r t r t < s t ), s t = start tme hyperperod modulo where ready tme s the ready tme of the, start tme s the start tme of the empty tme slot n. The dea s that f the duraton between the reconfguraton slot start tme and the ready tme s short, reconfguraton of the correspondng needs to be scheduled frst. Otherwse, reconfguraton may not be completed by the ready tme and hence delay executon. The reconfguraton slots n each are scheduled before ths ready based on a nonncreasng prorty order. In order to hde the reconfguraton overhead whenever possble, a functon called scheduleback() s used. Ths functon looks backward for the frst avalable reconfguraton slot from r t to s t n the current. If the functon returns false, t means that reconfguraton cannot start durng s t, r t ]. In ths case, another functon [ schedulefront() s nvoked. Ths looks for the frst avalable reconfguraton slot n the current from rt to the fnsh tme of the empty tmespan. Wth ths approach, the reconfguraton events are scheduled as soon as possble before the ready tme and also as closely as possble to ths, addressng both the ssues rased before. In Fgure 9, before canddate C, s 8 and 9 are scheduled frst then s 0 to 3. We next dscuss the ssues nvolved n schedulng reconfguraton slots after the. To leave enough flexblty for future s, the reconfguraton slots need to be placed as close to the next as possble. Also, a prorty needs to be defned to determne the schedulng order for all the needed s n order to tackle the nterrelatonshps among them, as follows: ( f t r t f t r t ), r t = ready tme modulo hyperperod P = ( hyperperod r t + f t ), f t < r t f t = fnsh tme hyperperod modulo where fnsh tme s the fnsh tme of the empty tmespan n. Functon scheduleback() s called for each based on a nonncreasng prorty order. It looks for the frst avalable reconfguraton slot from f t to rt n the current, and chooses the frst avalable slot. Wth ths approach, n Fgure 9, n s 0 to 3, the reconfguraton slots after C are scheduled close to A (note that s repeat after the hyperperod). In s 8 and 9, the reconfguraton slots are scheduled close to B. 1. 2. schedulng (){ canddate algorthm pool root nodes 3. whle( pendng s NULL){ 4. prorty calculaton( canddate pool) 5. extract( canddate pool) 6. sched nput communcaton( ) 7. schedule ( ) 8. canddatepool ntroduce ready ( )}} Fgure 8: Pseudo-code of the schedulng algorthm A type 3 B type 2 canddate C type 1 Reconfguraton resource Hyperperod 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Fgure 9: A schedulng example Functon schedule( ) contans two steps. Frst, canddatepostonsort( ) calculates the prorty for each canddate poston. Its pseudo-code s shown n Fgure 10. In lnes 3 to 6, the algorthm calculates the prorty of the s n each canddate poston. Then, for each canddate poston, t schedules reconfguraton slots before the based on the prortes (lnes 7-9). From all the s n ths canddate poston, t chooses the latest reconfguraton fnsh tme to be the actual ready tme for ths poston. Then t uses the locaton assgnment polcy descrbed n ecton 4.2.2 to calculate the prorty for each canddate poston (lne 10). econd, functon schedulep( ) s nvoked to schedule the. Its pseudo-code s shown n Fgure 11. The canddate poston wth the hghest prorty s chosen from canddatepostonpool (lne 3). The reconfguraton slots before the are scheduled frst (lne 4), then the reconfguraton slots after the (lne 5). Fnally, the tself s nserted nto the schedule (lne 6). If any of these three steps fals, the at whch the falure occurs s chosen. The next tme slot s searched from ths, and usng ths a new prorty for the canddate poston s calculated (lne 9). The canddate poston s nserted nto the prorty queue at the approprate locaton (lne 10). Then a new canddate poston s chosen to try to schedule the (lne 11). For the FPGA schedulng algorthm, the tme complexty s O(n 2 logn), where n s the number of s. However, n the average case, t behaves lke an nlogn algorthm. 1. canddate poston sort( ){ 2. for( 0; ){ 3. for = num canddate postons ( j = < poston start ; j poston fnsh ){ 4. slotj = canddate tme slot fnd( ) 5. slot prortyj = slot prorty calculaton ( slotj) 6. slot prorty pl. nsert( slot prortyj )} 7. for( j = slot prorty pl. begn; j < slot prorty pl. end){ 8. f ( reconfg j = false){ 9. schedule reconfg( )} 10. update poston prortycanddate ( poston)}} Fgure 10. Canddate poston prorty calculaton

4.3 chedulng Algorthms for Other Resources FPGA schedulng s compatble wth schedulng for processors and communcaton lnks. We use the same approach to schedule s (communcaton events) on dfferent processors (communcaton lnks). The only dfference s reconfguraton tmes can be made zero for processors and lnks, and the schedulng problem s onedmensonal (analogous to havng only one n the FPGA). 1. schedule p( ){ 2. whle ( canddate poston pool NULL ){ 3. canddate poston extract ( canddate poston pool ) 4. schedule reconfg before ( ) 5. schedule reconfg after ( ) 6. schedule exec( ) 8. f ( false ){ 9. calculate prorty ( canddate poston. next slot ( )) 10. canddate poston pool. nsert ( candate poston ) 11. next canddate poston chosen ( ) 12. contnue }}} Fgure 11. schedulng 5. Expermental Results In ths secton, we present expermental results for our FPGA schedulng algorthm and the hardware/software co-synthess system. The system s mplemented n C++ usng the standard template lbrary (TL). The resource lbrary conssts of varous system resources avalable from the ndustry and academa. We use processors, memory blocks and communcaton lnks provded n [29]. The parameters of our dynamcally reconfgurable FPGA model are based on Xlnx Vrtex-E FPGAs [5]. The graphs, whch are nput to the co-synthess system, are generated by TGFF [29]. All the experments were performed on a Pentum-III 667MHz PC (512MB memory) runnng Lnux O. We frst demonstrate the performance of our FPGA schedulng algorthm. We compare the results of schedulng for Approach I (ect. 4.1), whch s based on statc slack-based prorty, confguraton prefetch, and preconfguraton utlzaton [18], and our Approach II. The results are shown n Table 4, whch ncludes schedule length, reconfguraton power consumpton and CPU tme. Compared wth Approach I, the mprovements n schedule length and reconfguraton power are shown n columns 4 and 7, respectvely, and also n Fgure 12. For these examples, the number of graphs vares from 4 to 6, and the total number of s n these graphs s around 200. In Fgure 12, the bars represent schedule length and the lnes represent reconfguraton power. Table 4: FPGA schedulng results Ex. chedule length (n 10 3 tme unts) Reconf. power (mw) CPU tme (seconds) I II Imp. I II Imp. I II 1 4815 1625 66.3% 101.4 12.0 88.2% 3.2 2.2 2 12530 5302 57.7% 186.7 88.1 52.8% 0.7 0.3 3 8353 5488 34.3% 114.8 81.3 29.2% 7.5 3.6 4 5992 2392 60.1% 88.4 37.3 57.8% 3.2 1.4 5 9139 6903 24.5% 120.2 94.0 21.8% 5.9 4.3 6 3282 2852 13.1% 223.3 193.3 13.4% 1.2 1.1 7 2066 1351 34.6% 33.1 19.9 39.9% 2.4 1.5 8 4270 1600 62.5% 99.3 33.1 66.7% 0.7 0.5 9 4600 4717-2.5% 67.9 74.7-10.0% 3.8 3.2 10 6444 2588 59.8% 110.3 0 100% 0.5 0.3 Fgure 12. FPGA schedulng results As opposed to Approach I, our algorthm always meets the real-tme constrants (for Approach I, only solutons for Examples 3, 5 and 9 meet the real-tme constrants). The average reducton n schedule length s 41.0% and the average reducton n reconfguraton power s 46.0%. Recall that reconfguraton power s frequently of the same order as the power consumpton. Hence, t s very mportant to reduce reconfguraton power. Reducton of the schedule length helps the co-synthess system choose lower cost (and potentally slower) PEs wthout volatng the real-tme constrants, thus reducng the system prce. In Example 9, our approach gets worse results. The reason s that n ths example, because of the tght FPGA resource constrants, not much flexblty s left for our schedulng algorthm to explore the globally optmal soluton. nce our approach may not choose a locally optmal soluton for each, t may at tmes get a worse result than Approach I whch s much more greedy. Also, our algorthm needs slghtly less run-tme. Ths s because our algorthm looks ahead to the needs of future s and makes t easer to schedule them. nce Approach I s greedy and makes locally optmal choces, t needs more tme to schedule s encountered n the later part of the schedulng process. The results for our hardware/software co-synthess system are shown n Table 5. In ths table, rows 2 and 3, respectvely, show the correspondng system prce and power consumpton of all the non-domnated solutons, and the last row shows the CPU tme for co-synthess. The system prce s calculated by summng up the prce of all the processors, FPGAs, communcaton lnks and memory n the dstrbuted embedded system that s syntheszed. The system power consumpton s calculated by summng up all the executon, reconfguraton, communcaton and dle energes n the hyperperod and dvdng by the hyperperod. Table 5 llustrates the ablty of our cosynthess system to effectvely explore the desgn space. Our mult-objectve optmzaton approach acheves a good trade-off between system prce and power consumpton. All real-tme constrants are satsfed. The run-tme ndcates that large graphs can be handled n a reasonable amount of tme. 6. Conclusons We presented a mult-objectve hardware/software cosynthess system for real-tme dstrbuted embedded systems. A novel two-dmensonal mult-rate cyclc schedulng algorthm was proposed to tackle the schedulng problem n dynamcally reconfgurable

FPGAs. Ths algorthm not only mnmzes schedule length (thus allowng cheaper PEs), but also sgnfcantly reduces reconfguraton power. Reconfguraton power s the man bottleneck n explotng the reconfguraton capablty of modern dynamcally reconfgurable FPGAs. Ours s the frst co-synthess system to target both prce and power optmzaton n dstrbuted embedded systems contanng dynamcally reconfgurable FPGAs. Example Table 5: Hardware/software co-synthess results Prce (dollar) Power consumpton (mw) 144.7 66.1 394.5 CPU tme (mnutes) 1 209 389 99.7 2 42 212 253.6 133.6 57 619.7 153 305.5 3 173 271.1 19.8 198 121.9 525 108.4 159 745.5 4 174 626.9 54.2 209 503.6 153 815.6 5 385 699.8 28.8 420 489.4 232 922.7 6 367 829.6 14.9 394 557.5 7 156 684.2 353 462.9 3.0 8 156 790.5 204 345.6 18.0 209 852.0 9 238 345.8 39.2 250 265.3 10 156 353.8 2.1 References [1] J. Hauser and J. Wawrzynek, Garp: A MIP processor wth a reconfgurable coprocessor, n Proc. ymp. Feld- Programmable Custom Computng Machnes, pp. 12-21, Apr. 1997. [2]. Trmberger, D. Carberry, A. Johnson, and J. Wong, A tme-multplexed FPGA, n Proc. ymp. Feld-Programmable Custom Computng Machnes, pp. 22-28, Apr. 1997. [3] Z. A. Ye, A. Moshovos,. Hauck, and P. Banerjee, CHIMAERA: A hgh-performance archtecture wth a tghtlycoupled reconfgurable functonal unt, n Proc. Int. ymp. Computer Archtecture, pp. 225-232, June 2000. [4] T. Fuj et al., A dynamcally reconfgurable logc engne wth a mult-context/mult-mode unfed-cell archtecture, n Proc. Int. old-tate Crcuts Conf., Feb. 1999. [5] Vrtex-E data sheet, http://www.xlnx.com. [6] Y. hn, D. Km, and K. Cho, chedulablty-drven performance analyss of multple mode embedded real-tme systems, n Proc. Desgn Automaton Conf., pp. 495-500, June 2000. [7] Z. L, K. Compton, and. Hauck, Confguraton cachng technques for FPGA, n Proc. ymp. Feld-Programmable Custom Computng Machnes, Apr. 2000. [8] X. Tang, M. Aalsma, and R. Jou, A compler drected approach to hdng confguraton latency n chameleon processors, n Proc. Int. Conf. Feld Programmable Logc and Applcatons, pp. 29-38, Aug. 2000. [9] M. Kaul, R. Vemur,. Govndarajan, and I. Ouass, An automated temporal parttonng and loop fsson approach for FPGA based reconfgurable synthess of DP applcatons, n Proc. Desgn Automaton Conf., pp. 616-622, June 1999. [10] M. R. Garey and D.. Johnson, Computers and Intractablty: A Gude to the Theory of NP-Completeness, W. H. Freeman and Company, NY, 1979. [11]. Prakash and A. Parker, O: ynthess of applcatonspecfc heterogeneous multprocessor systems, J. Parallel & Dstrbuted Comput., vol. 16, pp. 338-351, Dec. 1992. [12] T.-Y. Yen and W. Wolf, Communcaton synthess for dstrbuted embedded systems, n Proc. Int. Conf. Computer- Aded Desgn, pp. 703-708, June 1997. [13] D. Krovsk and M. Potkonjak, ystem-level synthess of low-power hard real-tme systems, n Proc. Desgn Automaton Conf., pp. 697-702, June 1997. [14] R. P. Dck and N. K. Jha MOGAC: A multobjectve genetc algorthm for hardware-software co-synthess of dstrbuted embedded systems, IEEE Trans. Computer-Aded Desgn, vol. 17, pp. 920-935, Oct. 1998. [15] R.P. Dck and N.K. Jha, ``CORD: Hardware-oftware Co- ynthess of Reconfgurable Real-Tme Dstrbuted Embedded ystems,'' n Proc. Int. Conf. Computer-Aded Desgn, pp. 62-68, Nov. 1998. [16] B. P. Dave, CRUADE: Hardware/software co-synthess of dynamcally reconfgurable heterogeneous real-tme dstrbuted embedded systems, n Proc. Desgn, Automaton & Test n Europe, pp. 97-104, Mar. 1999. [17] N. henoy, A. Choudhary, and P. Banerjee, An algorthm for synthess of large tme-constraned heterogeneous adaptve systems, ACM Trans. Desgn Automaton of Electronc ystems, vol. 6, no. 2, pp. 207-225, Apr. 2001. [18] B. Jeong,. Yoo,. Lee, and K. Y. Cho, Hardwaresoftware cosynthess for run-tme ncrementally reconfgurable FPGAs, n Proc. Asa & outh Pacfc Desgn Automaton Conf., pp.169-174, Jan. 2000. [19] Y. B. L, T. Callahan, E. Darnell, R. Harr, U. Kurkure, and J. tockwood, Hardware-software co-desgn of embedded reconfgurable archtectures, n Proc. Desgn Automaton Conf., pp. 507-512, June 2000. [20] J. Noguera and R. Bada, A HW/W parttonng algorthm for dynamcally reconfgurable archtectures, n Proc. Desgn, Automaton & Test n Europe, pp. 729-734, Mar. 2001. [21] E. L. Lawler and C. U. Martel, chedulng perodcally occurrng s on multple processors, Informaton Processng Letters, vol. 7, pp. 9-12, Feb. 1981. [22]. Malk, M. Martonos, and Y.-T. L, tatc tmng analyss of embedded software, n Proc. Desgn Automaton Conf., pp. 147-152, June 1997. [23] K.. Khour, G. Lakshmnarayana, and N. K. Jha, Hghlevel synthess of low power control-flow ntensve crcuts, IEEE Trans. Computer-Aded Desgn, vol. 18, no. 12, pp. 1715-1729, Dec. 1999. [24] V. Twar,. Malk, and A. Wolfe, Power analyss of embedded software: A frst step toward software power mnmzaton, IEEE Trans. VLI ystems, vol. 2, no. 4, pp. 437-445, Apr. 1994. [25]. Gupta and F. Najm, Power modelng for hgh-level power estmaton, IEEE Trans. VLI systems, vol. 8, no. 1, pp. 18-29, Feb. 2000. [26] B. Dave, G. Lakshmnarayana, and N. K. Jha, COYN: Hardware-software co-synthess of embedded systems, n Proc. Desgn Automaton Conf. pp. 703-708, June 1997. [27] L. A. Belady, A study of replacement algorthms for vrtual storage computers, IBM ys. J., vol. 5, pp. 78-101, 1966. [28] N. Young, The k-server dual and loose compettveness for pagng, Algorthmca, vol. 11, pp. 525-541, June 1994. [29] R. P. Dck, D. R. Rhodes, and W. H. Wolf, TGFF: graphs for free, n Proc. Int. Workshop HW/W Co-Desgn, pp. 97-101, Mar. 1998.