Use of compiler optimization of software bypassing as a method to improve energy efficiency of exposed data path architectures

Guzm et l. EURASIP Journl on Emedded Systems 213, 213:9 RESEARCH Open Access Use of compiler optimiztion of softwre ypssing s method to improve energy efficiency of exposed dt pth rchitectures Vldimír Guzm *, Teemu Pitkänen nd Jrmo Tkl Astrct In the design of emedded systems, hrdwre nd softwre need to e co-explored together to meet trgets of performnce nd energy. With the use of ppliction-specific instruction-set processors, s stnd-lone solution or s prt of system on chip, the customiztion of processors for prticulr ppliction is known method to reduce energy requirements nd provide performnce. In prticulr, processor designs with exposed dt pths trde compile time complexity for simplified control hrdwre nd lower running costs. An exposed dt pth lso llows the removl of unused components of interconnection network, once the ppliction is compiled. In this pper, we propose the use of compiler technique for processors with exposed dt pths, clled softwre ypssing. Softwre ypssing llows the compiler to schedule dt trnsfers etween execution units directly, ypssing the use of generl-purpose register file, incresing scheduling freedom, with reduced dependencies induced y the reuse of registers, decresing the numer of red nd write ccesses to register files, nd llowing the use of register files with less red nd write ports while mintining or improving performnce nd mintining reprogrmmility. We compre our proposl ginst n rchitecture explortion technique, connectivity reduction, which finds in compiled ppliction ll interconnection network components tht re used nd removes those which re not, leding to n energy-efficient ppliction-specific instruction-set processor. We oserve tht the use of softwre ypssing leds to improvements in ppliction speed, with rchitectures hving the smllest numer of register file ports consistently outperforming rchitectures with lrger numer of ports, nd reduction in energy consumption. In contrst, connectivity reduction mintins the sme ppliction speed, reduces energy consumption, nd llows for increse in processor frequency; however, with the clock frequency incresed to mtch the performnce of softwre ypssing, energy consumption grows. We lso oserve tht in cse reprogrmmility is not n issue, the most energy-efficient solution is comintion of softwre ypssing nd connectivity reduction. 1 Introduction In n emedded domin, unlike in more trditionl highperformnce computing, performnce closely reltes to energy. Efficient solutions utilize the knowledge of ppliction or ppliction domin to explore hrdwre nd softwre techniques, eventully leding to ppliction-specific instruction-set processors, nd to provide enough performnce for prticulr tsk, or set of tsks, while minimizing energy requirements. The explortion of ville instruction-level prllelism *Correspondence: vldimir.guzm@tut.fi Deprtment of Computer Systems, Tmpere University of Technology, Tmpere 3372, Finlnd (ILP) nd the use of instruction-set extensions re common wys to improve clock cycle performnce, leding to lower opertion frequencies nd lower energy requirements. When exploring ILP in computtion, lrge numer of generl-purpose registers contriutes to the increse in performnce. Hving progrm vriles in independent registers llows dt-independent computtion pths to e scheduled in prllel, on different execution units. However, in terms of lgorithm computtion, the time nd energy spent on trnsferring vlues etween functionunitsndregisterfilesrewsted,notcontriuting to ctul computtion directly, since only the energy spent 213 Guzm et l.; licensee Springer. This is n Open Access rticle distriuted under the terms of the Cretive Commons Attriution License (http://cretivecommons.org/licenses/y/2.), which permits unrestricted use, distriution, nd reproduction in ny medium, provided the originl work is properly cited.

Guzm et l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 2 of 3 while computing in function units is ctully useful to compute results. In ddition, when prllel execution requires ccess to severl generl-purpose registers in the sme cycle, register files need to provide enough red nd write ports to llow such ccess, leding to incresed complexity nd higher energy requirements of register files. Another method on how to increse the performnce of emedded processors is the customiztion of instruction set [1,2]. Complex computtion ptterns cn e implemented s custom function units, providing etter performnce nd freeing other processor resources for independent computtion pths. This customiztion, however, often leds to implementtion with higher numer of input vlues nd produces severl results, further elevting the prolem of the numer of register file ports nd necessry dt trnsports etween function units nd register files. Yet nother method to reduce energy requirements of custom processor is the optimiztion of the dt pth. To mintin the est progrmmility nd llow mximum ILP exploittion, processor interconnection networks tend to e designed for the worst cse scenrio. However, once the ppliction schedule is set, we cnsimplyremovelltheunusedcomponentsofthe interconnection network, eventully mintining only the required connections (connectivity reduction). Thishs n effect on reduction in interconnection network complexity, instruction fetch nd decode energy requirements, nd cn llow for increse in processor clock frequency. An unfortunte effect of connectivity reduction, however, is limittion to reprogrmmility, or no reprogrmmility t ll, of such processor. In cse the ppliction is modified nd needs to e recompiled for the reduced rchitecture, the compiler my produce inefficient schedule working round missing connections or fil to schedule completely. In this pper, we propose the use of compiler optimiztion technique clled softwre ypssing, suitle for rchitectures with exposed dt pths, s tool for improving energy efficiency. By llowing the compiler to schedule dt trnsfers directly etween function unit outputs nd inputs, reding the vlue of previous computtion from the register file ecomes unnecessry. This helps reduce unnecessry energy costs of register files. In ddition, in the cse where ll of the uses of produced vlues cn e ypssed directly, the ctul write of vlue to generl-purpose register cn e discrded during compiltion (ded result move elimintion), reducing the totl numer of dt trnsfers on the interconnection network nd further contriuting to reduction in processor energy consumption. Additionl enefit of this optimiztion is the reduction of flse dt dependencies creted y register lloction when severl progrm vriles reuse the sme register to store the dt, effectively llowing the scheduler more scheduling freedom nd increses ville ILP to explore during instruction scheduling. We reson tht the comined enefits of softwre ypssing (reduction in register file reds nd writes, reduction in the numer of dt trnsfers on the interconnection network, nd improved cycle count) mtch those of connectivity reduction when it comes to energy efficiency while mintining full reprogrmmility. In situtions where reprogrmmility is not n issue, we propose the use of comintion of softwre ypssing nd connectivity reduction. We reson tht the enefits of these two complement ech other, providing for lrge energy svings. In prticulr, in order to mtch the clock cycleimprovementsginedytheuseofsoftwreypssing, processor with only reduced connectivity needs to run with higher clock frequency, eventully leding to n increse in energy requirements, possily exceeding energy udget. In our previous work [3], we considered conservtive pproch to softwre ypssing nd investigted its effect on clock cycle performnce nd reduction in register files reds nd writes depending on the heuristic prmeter guiding ypssing decisions. We lso discussed the impct of heuristic when to ypss on register file nd interconnection network energy consumption, however, using only hrdwre estimtor [4] nd single rchitecture. In this pper, we propose novel ypssing lgorithm nd use n extensive set of register file rchitecture types to investigte the effect of ypssing nd connectivity reduction on the energy of vrious processor core components directly influenced y one or oth of the methods. In ddition, we investigte the cost of mtching the performnce improvements rought y ypssing vi synthesis for higher clock frequency when only connectivity reduction is ville. We evlute our pproch using commercilly ville tools for processor synthesis, gte level simultion, nd power nlysis. The rest of this pper is orgnized s follows: Section 2 discusses previous work. Section 3 gives short introduction to rchitectures with exposed dt pths nd descries our choice, trnsport-triggered rchitectures. Section 4 gives n overview of our novel softwre ypssing lgorithm with integrted ded result move elimintion, s well s connectivity reduction. Section 5 outlines our experimentl setup, nd Section 6 provides discussion of the results of our experiments. Finlly, Section 7 concludesthispper. 2 Relted work Effects of ypssing register files re known nd pprecited in processor design [5,6], with reported register file power reduction of 12% on verge for Intel XScle

Guzm et l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 3 of 3 processor nd performnce loss of 2% in [5] nd up to 8% register file energy reduction compred to Reduced Instruction Set Computer (RISC)/Very Long Instruction Word (VLIW) counterprts in [6]. More nd more effort is spent in focusing on computtion nd distncing from the temporry dt storge. The trditionl use of register files for storing dt ecomes prolem with monolithic register files (RF) in VLIW processors with lrge numer of function units. The requirement of lrge numer of RF ports in such cse mkes the use of monolithic RF prohiitively expensive. Common solutions involve the clustering of RF into numer of smller ones. Intercluster communictioncntheneimplementedusingrftorfcopying nd/or red/write etween dedicted function units nd RF cross clusters. However, s shown in [7,8], using only register to register copies etween clusters reduces chievle ILP when compred to monolithic RF. Results closer to monolithic RF file cn e chieved with the use of direct reds nd writes from dedicted function unit to RF in different clusters, suggesting tht the RF to RF copies etween clusters should e voided when possile. Another step towrds etter performnce nd more chievle ILP is ypssing dt directly from function unit to function unit, voiding the use of RF ltogether. Such solution improves performnce nd reduces the energy required y RF ut cn e lso used to reduce the numer of required RF ports while retining performnce. The effective use of RF ypssing is dependent on the rchitecture s division of work etween the softwre nd the hrdwre. In order to ypss the RF, the compiler or hrdwre logic must e le to determine wht the consumers of the ypssed vlue re, effectively requiring dt flow informtion, nd how the direct opernd trnsfer cn e performed in the hrdwre. While hrdwre implementtions of RF ypssing my e trnsprent to progrmmer, they lso require dditionl logic nd wiring in the processor nd cn only nlyze limited instruction window for the required dt flow informtion. Hrdwre implementtions of ypssing cnnot get the enefit of reduced register pressure since the registers re lredy llocted to the vriles when the progrm is executing. However, the enefits from the reduced numer of RF ccesses re chieved. Register renming [9] lso increses ville ILP y the removl of flse dependencies. Dynmic strnds presented in [1] re n exmple of n lterntive hrdwre implementtion of RF ypssing. Strnds re dynmiclly detected tomic units of execution where registers cn e replced y direct dt trnsports etween opertions. In Explicit Dt Grph Execution (EDGE) rchitectures [11], opertions re stticlly ssigned to execution units, ut they re scheduled dynmiclly in dt-flow fshion. Instructions re orgnized in locks, nd ech lock specifies its register nd memory inputs nd outputs. Execution units re rrnged in mtrix, nd ech unit in the mtrix is ssigned sequence of opertions from the lock to e executed. Ech opertion is nnotted with the ddress of the execution unit to which the result should e sent. Intermedite results re thus trnsported directly to their destintions. Sttic strnds in [12] follow n erlier work [1] to decrese hrdwre costs. Strnds re found stticlly during compiltion nd nnotted to pss the informtion to the hrdwre. As result, the numer of required registers is reduced lredy in the compile time. This method ws, however, pplied only to trnsient opernds with single definition nd single use, effectively up to 72% of dynmic integer opernds, ypssing out hlf of them [12]. The uthors reported 16% to 24% svings in issue energy, 17% to 2% svings in ypss energy, 13% to 14% svings in register file energy, nd 15% improvement in instruction per cycle, using cycle-ccurte simultor for two hrdwre models: Reness (formerly Hitchi) SuperH SH4 nd IBM PowerPC 75FX emedded microprocessor. Dtflow mini-grphs [13] re treted s tomic units y processor. They hve the interfce of single instruction, with intermedite vriles live only in the ypss network. In [14], rchitecturlly visile virtul registers re used to reduce register pressure through ypssing. In this method, virtul register is only tg mrking dt dependence etween opertions without hving physicl storgeloctionintherf. Softwre implementtions of ypssing nlyze codes during compile time nd pss to the processor the exct informtion out the sources nd the destintions of ypssed dt trnsports, thus voiding ny dditionl ypssing nd nlyzing logic in the hrdwre. This requires n rchitecture with n exposed dt pth tht llows such direct progrmming, like the trnsport-triggered rchitectures (TTA) [15,16], synchronous trnsfer rchitecture [17], FlexCore [18], noinstruction-set-computer [19], or sttic pipelining [2]. A commercil ppliction of the TTA prdigm is the Mxim MAXQ generl-purpose microcontroller fmily [21]. The ssignment of destintion ddresses in n EDGE rchitecture corresponds to softwre ypssing in trnsport-triggered setting. Softwre-only ypssing ws previously implemented for TTA rchitecture using the experimentl MOVE frmework [22,23] nd MOVE-Pro [6]. TTAs re specil type of VLIW rchitectures. They llow progrms to define explicitly the opertions executed in ech function unit (FU) s well s to define how (with position in instruction defining us) nd when

Guzm et l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 4 of 3 dt re trnsferred (moved) to ech prticulr port of ech unit. With the option of hving registers in input nd output ports of FUs, TTA llows the scheduler to move opernds to FUs in different cycles nd red results severl cycles fter they re computed. Therefore, the limiting fctor for ypssing is the vilility of connections etween the source FU nd destintion FUs. In our previous work [3], we introduced simple, conservtive softwre ypssing implementtion. We, however, only focused on the improvements in cycle counts nd register file red nd write ccesses when chnging ypss ggressiveness heuristics. In [24], we discussed the mentioned simple ypssing lgorithm in terms of energy only using hrdwre cost estimtion model [4] nd single rchitecture. 3 Exposing dt pths: trnsport triggering pproch TTA [15] is n exposed dt pth rchitecture which llows the numer of rchitecturl resources to e selected, e.g., selection of the numer nd size of register files, numer of red nd write ports for ech register file individully. Similrly, the numer of function units s well s the opertion set of ech function unit cn e defined y n rchitecture designer. To connect them together, the interconnection network is designed, with choice of the numer of uses nd sockets to e used. Ech socket provides connection etween the function unit or register file port nd one or more uses. This llows for fully connected rchitectures, with most compiler freedom to choose how to trnsport dt etween the source nd destintion, s well s hevily reduced connectivity, with uses connecting only smll numer of components. It is necessry to point out tht complex fully connected interconnection network is expensive nd limits the mximum clock frequency. Alterntively, n optimized connectivity could llow for higher frequency; however, it reduces scheduling freedom. With less lterntives for trnsport, potentilly prllel dt trnsports often need to e serilized. Figure 1 shows n exmple of TTA. There re two function units in Figure 1, denoted s FU: one serving s n rithmetic nd logic unit (ALU) nd the other s LSU. GCU: denotes glol control unit, which is responsile for rnching opertions, nd RF: denotes single register file with two red nd two write ports. Figure 1 lso shows five trnsport uses nd 13 sockets connected to units nd uses.itcneseeninfigure1thtnotllthesockets re connected to ll the uses (s the lck dot denotes connection). Another interesting spect of TTA comes from VLIW inheritnce. An instruction defines wht dt trnsports re to e performed on ech us, which leds to wide instruction words. As mtter of fct, for ech us in the system, the instruction word encodes the source field of trnsport s well s destintion field of trnsport. While incresing numer of uses leds to more freedom for the compiler to schedule, it lso increses the instruction width. Reducing the connectivity etween sockets nd uses typiclly leds to lower numer of its required to encode dt trnsport for individul us nd nrrows the instruction width. However, in order to significntly reduce the negtive impct of the instruction width, instruction compression cn e pplied. Using dictionry compression [25,26], for exmple, the code density cn e improved significntly with decrese in spent energy s well. 4 Softwre ypssing nd connectivity reduction lgorithms In this section, we first discuss our two implementtions of softwre ypssing: First is conservtive pproch, previously pulished in [3], while the second one is n opportunistic pproch, which hs not een pulished previously. 4.1 Softwre ypssing nd ded result move elimintion Softwre ypssing nd ded result move elimintion on TTA processors cn e illustrted with n exmple. Let us consider the following code excerpt on RISC-type rchitecture: dd R3,r1,r2 dd R5,r4,R3 mul r1,r3,r5 The sme code in typicl TTA syntx on mchine with two uses could e the following: r2 -> dd.t; r1 -> dd.o; dd.r -> R3; r4 -> dd.o; R3 -> mul.o; R3 -> dd.t; dd.r -> R5;...; R5 -> mul.t;...; mul.r -> r1;...; We cn see tht individul dt trnsport to opernd (denoted with.o suffix) registers nd trigger registers (.t) from the generl-purpose registers (rx or RX) reexplicitly defined. In the sme mnner, the dt trnsport from the result register (.r) to generl-purpose registers re explicit. It is worth noting tht the timing of dt trnsport is not fixed. For instnce, register r4 is written to the opernd register of dd one cycle efore register R3 is written to the trigger register of dd, strtingctul computtion. Similrly, the opernd write of mul is two cycles efore the trigger write of the sme opertion strts execution.

Guzm et l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 5 of 3 Function units Register File Control Unit FU: ALU { dd, nd, gtu } FU: LSU { ldw, stw } RF: RF 42x32 GCU: gcu { jump, cll } Buses Trigger port 1 2 3 4 Sockets Connections Figure 1 An exmple of TTA with five uses nd reduced connectivity. A TTA rchitecture with reduced connections etween sockets nd uses. A fully connected rchitecture would hve connection to ll uses in ech socket. The sme code with softwre ypssing nd ded result move elimintion pplied could e the following: r2 -> dd.t; r1 -> dd.o; dd.r -> mul.o; r4 -> dd.o; dd.r -> dd.t;...; dd.r -> mul.t;...; mul.r -> r1;...; We cn oserve direct dt trnsports from the result to opernd or trigger registers. Dt trnsports from the result register to the generl-purpose register re completely scrped for oth dditions (ded result move elimintion), s the results re used just once in multipliction, where they re trnsported directly from the result register of the dder. Compred to the first RISC-like exmple, we cn oserve tht ll the registers denoted with cpitl RX hve disppered, while the mount of work function units performed remins the sme. As the numer of registers ville in the rchitecture is limited, the compiler reuses the sme registers to store different vriles through the execution of the progrm. This leds to flse dependencies when instruction scheduling s reordering of the dt trnsport cn e limited y not rel dt dependence, such s producer consumer, ut flse dependence such s write-fter-red or write-fter-write. The removl of the uses of registers reduces this prolem induced y register lloction. Our first ypssing lgorithm is sed on dt dependence grph s prt of list scheduling, previously pulished in [3]. In order to prevent possile dedlocks, the lgorithm uses conservtive implementtion, which first schedules dt trnsports of ll opernds efore ttempting to ypss opernds directly from the result registers of other function units which produced required vlues. Redundnt result writes to the register file cn e removed only once ll the uses of the vlue written to the register file get ypssed nd the vlue is not used outside the current sic lock. As heuristic when not to ypss, simple distnce in cycles etween the originl write to the register, where the producer produces vlue, nd the cycle where the vlue is scheduled to e red from the register to the opernd of the consumer is used (lookbckdistnce). It is notle tht this implementtion works with top-down scheduling lgorithms [27]. Our second ypssing lgorithm is lso sed on dt dependence grph. However, while in the first cse we used top-down scheduling lgorithm, in this cse, we reversed the direction nd implemented ypssing during ottom-up scheduling. Due to the nture of ottom-up scheduling, we strt the scheduling of opertion y scheduling ll result moves of opertion. This hs n dvntge of immedite vilility of informtion whether or not ll of the uses of the result vlue ecome ypssed nd n unnecessry write to the register file cn e removed immeditely. In ddition, our implementtion strts with scheduling ofresultmoveswithnttempttofindllypssdestintions nd crete direct ypss moves - erly ypssing. Only in cse some of the destintions cnnot e ypssed, or the result vlue needs to e used outside the scope of the current sic lock, the result move to register in the register file is scheduled. Afterwrds, the ypssing is ttempted once gin for the destintions tht did not get

Guzm et l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 6 of 3 ypssed during erly ypssing. While this lte ypssing does not contriute to improvement in cycle counts of the current opertion nymore, it still removes unnecessry red from the register file nd frees register file red port. Only once ll of the result moves of the opertion re scheduled, with or without ypsses, the lgorithm ttempts to schedule input opernd moves s well. In cse the schedule of opernd moves fils, the result moves re unscheduled nd reschedule is ttempted, with only erly ypssing enled. If the scheduling of opernds still fils, result moves re unscheduled gin nd reschedule is ttempted, with only lte ypssing enled. Once gin, if the schedule of opernds still fils, scheduled moves re unscheduled nd reschedule is ttempted, without ny ypssing enled. Only if ll previous ttempts fil, the strting cycle of the scheduling is decresed nd reschedule is ttempted. The outline of our scheduling lgorithm is presented in Algorithm 1, with inputs denoting the set of the input opernds of the opertion eing scheduled nd with outputs denoting the possily empty set of the results the opertion produces. The ScheduleOperndWrites method simply tries to schedule input opernds of the opertion s lte s possile once the results of the opertion re successfully scheduled, tking into ccount the ltency nd pipeline chrcteristics of the opertion on the selected function unit. The method UnscheduleResultReds simply unschedules ll the previously scheduled results of the opertion nd undo possile ypsses. Actul ypssing of result moves is presented in Algorithm2, withcycle denoting the strting cycle from which the scheduling strts, outputs denoting the possily empty set of results the opertion eing scheduled produces, nd the two flgs ypsserly nd ypsslte indicting if ypssing should e ttempted erly or lte or oth. The method ScheduleCndidteALAP tries to schedule the originl result move to register s lte s possile, strting from cycle, in cse not ll of the result reds were successfully ypssed or if the result is used in different sic lock. Other ypssing strtegies re possile, including preregister lloction ypssing [28], recursively ypssing chin of opertions on criticl pth, ypssing fter the lock is fully scheduled without chnging schedule to reduce only register file ccesses, etc. 4.2 Simple connectivity reduction With the freedom of design choices offered y trnsporttriggered rchitectures, the process of mnully optimizing the connectivity of ctul processor cn ecome rther difficult. We strt y mnully selecting register file configurtion, s will e descried in Section 5, nd fully connected interconnection network. We used simple connectivity reduction. The ide ehind the lgorithm is to schedule n ppliction for fully connected TTA nd then simply remove the connections of function units nd register files to the uses tht were never used in the existing schedule. In ddition, whole function units nd their respective sockets could e removed, if unused y the ppliction. The reduction in the numer of socket-to-us connection should led to less its required to encode source nd destintion fields of dt trnsports for uses nd possily llow for higher clock frequency to e chieved. 5 Experimentl setup We selected open-source TTA-sed Co-design Environment [29,3] s the pltform for our experiments. To evlute the effect of softwre ypssing nd connectivity reduction, we selected n integer suset of CHStone v1.7 [31] enchmrk, s descried in Tle 1. The selected designs were identicl except for their register file configurtions. The numer of interconnection uses nd function units remin the sme for ll the tested rchitectures s well s the totl numer of registers. We selected the numer of interconnection uses in such wy tht they do not limit the mximum chievle instructionlevel prllelism ville in the selected enchmrks. Selected register file configurtions could e split into three ctegories: 1. Architectures with single multi-ported (SM) register file SM 1 4 4-1 RF - 4 red 4 write ports (42 registers; Figure 2) SM 1 3 3-1 RF - 3 red 3 write ports (42 registers; Figure 2) SM 1 2 2-1 RF - 2 red 2 write ports (42 registers; Figure 2c) SM 1 4 2-1 RF - 4 red 2 write ports (42 registers; Figure 2d) Tle 1 Integer suset of CHStone enchmrk used in our experiments Benchmrk Origin dpcm CHStone/SNU es CHStone/AIL lowfish CHStone/MiBench gsm CHStone/MediBench jpeg CHStone/The Portle Video Reserch Group mips CHStone motion CHStone/MediBench sh CHStone/MiBench

Guzm et l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 7 of 3 Algorithm 1 Scheduling moves of single opertion in ottom-up fshion 1: ScheduleOpertion(inputs, outputs) 2: resultsfiled := true 3: operndsfiled := true 4: ypsserly := true 5: ypsslte := true 6: cycle := DDGLtestCycle(outputs) 7: while cycle > or resultsfiled or operndsfiled do 8: resultsfiled := ScheduleResultReds(cycle, outputs, ypsserly, ypsslte) 9: if not resultsfiled then 1: {Result moves scheduled} 11: else if ypsserly nd ypsslte then 12: ypsslte := flse 13: continue {Retry with the sme strt cycle without lte ypssing} 14: else if ypsserly nd notypsslte then 15: ypsserly := flse 16: ypsslte := true 17: continue {Retry with the sme strt cycle without erly ypssing} 18: else if not ypsserly nd ypsslte then 19: ypsslte := flse 2: continue {Retry with the sme strt cycle without ny ypssing} 21: else 22: ypsserly := true 23: ypsslte := true 24: cycle := cycle 1 25: continue {Retry with erlier strt cycle nd oth ypssing methods} 26: end if 27: operndsfiled := ScheduleOperndWrites(cycle, inputs) 28: if not operndsfiled then 29: {Both, results nd opernds of opertion, successfully scheduled} 3: return true 31: else 32: {Opernd moves cnnot e scheduled} 33: {with current position of result moves.} 34: UnscheduleResultReds(outputs) 35: if ypsserly nd ypsslte then 36: ypsslte := flse 37: else if ypsserly nd notypsslte then 38: ypsserly := flse 39: ypsslte := true 4: else if not ypsserly nd ypsslte then 41: ypsslte := flse 42: else 43: ypsserly := true 44: ypsslte := true 45: cycle := cycle 1 46: end if 47: end if 48: end while

Guzm et l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 8 of 3 Algorithm 2 Schedule nd ypss result reds 1: ScheduleResultReds(cycle, outputs, ypsserly, ypsslte) 2: for ll result in outputs do 3: ypsssuccess := flse 4: if ypsserly then 5: Try to ypss ll uses of the result move 6: ypsssuccess := BypssMove(result, cycle) 7: if ypsssuccess nd result not used outside the current lock then 8: All moves tht use result were ypssed 9: continue 1: end if 11: end if 12: {Not ll uses of result were ypssed} 13: {or the result is used in different lock.} 14: ScheduleCndidteALAP(result, cycle) 15: if not result is scheduled then 16: UndoBypss(result) 17: return flse 18: end if 19: if ypsslte then 2: {Still try to ypss result uses to reduce ntidependencies} 21: BypssMove(result, cycle) 22: end if 23: end for 24: return true RF: RF 42x32 RF: RF 42x32 RF: RF 42x32 RF: RF 42x32 c d Figure 2 Register file with multiple red nd write ports. An exmple of single register file with multiple write ports, connected to the interconnection us. () SM 1 4 4. () SM 1 3 3. (c) SM 1 2 2. (d) SM 1 4 2.

Guzm et l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 9 of 3 2. Architectures with single register file with single red nd single write port (SS) or multiple register files with single red nd single write port (MS) SS 1 1 1-1 RF - 1 red 1 write port (42 registers; Figure 3) MS 2 1 1-2 RFs - 1 red 1 write port in ech (2 21 registers; Figure 3) MS 3 1 1-3 RFs - 1 red 1 write port in ech (3 14 registers; Figure 3c) 3. Architectures with multiple register files with multiple red nd write ports (MM) MM 2 2 1-2RFs-2red1writeportin ech (2 21 registers; Figure 4) MM 2 2 2-2RFs-2red2writeportin ech (2 21 registers; Figure 4) Exmples of register file implementtions re shown in Figure 5,. The structure of the register file cn e divided into three prts: input control register nk output control For ech dt input port, the register file contins n input opcode, which specifies the register ought to e written in the register nk, nd n input trigger, which descries when the dt re ought to e written to the register descried in the corresponding opcode. For ech dt output port, the register file contins n output opcode, which descries which register is fed to the output port. When using the register file with single input nd output ports, the complexity of the write nd red control RF: RF 42x32 RF: RF 21x32 RF: RF_1 21x32 RF: RF 14x32 RF: RF_1 14x32 RF: RF_2 14x32 Figure 3 Register file with one red nd one write port. An exmple of register file with one red nd one write port, connected to the interconnection us. () SS 1 1 1. () MS 2 1 1. (c) MS 3 1 1. c

Guzmet l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 1 of 3 RF: RF 21x32 RF: RF_1 21x32 RF: RF 21x32 RF: RF_1 21x32 Figure 4 Two register files with two red nd one or two write ports ech. An exmple of two register files with two red nd one or two write ports in ech, connected to the interconnection us. () MM 2 2 1. () MM 2 2 2. is trnsferred to the interconnection network nd, while eing visile in the dt pth, cn e optimized more effectively y compiler techniques nd in the design spce explortion. Usully, the register file tends to e the end point of the criticl pth of the processors. Incresing the input control cpcitnce nd dely y dding write port hs n effect not only on the input control logic of the register file, ut lso in the interconnection network. The ddition of red port to the register file hs minor effects to the cpcitnce nd the dely of the control logic. Remining prts of our processor sty the sme, with three-integer ALUs, one multiplier, lod store unit, nd eighteen uses to ccommodte for ILP ville cross our set of enchmrks. We schedule our selected enchmrks for the set of preselected TTA designs four times. First, we use our previously pulished top-down scheduling lgorithm [3], with ctul softwre ypssing disled, nd collect resulting dt, such s the numer of clock cycles the enchmrked ppliction needs to end nd the numer of reds nd writes of the registers in ll ville register files. We lso collect informtion out instruction width nd the numer of socket-to-us connections. Afterwrds, we enle softwre ypssing with topdown scheduler nd recompile our set of enchmrk pplictions for ll the selected rchitectures gin, collecting the sme dt s ove. Collecting informtion from more conservtive softwre ypssing implementtion, we repet the steps ove using our new, erly softwre ypssing during the ottom-up scheduling lgorithm, presented in Section 4.1. We collect dt without softwre ypssing enled nd gin with softwre ypssing enled. This will llow us to consider differences tht scheduling nd ypssing strtegies hve. For second test, for ech comintion of enchmrk nd rchitecture, we remove unused connections. This hs no effect on ctul cycle counts, or the numer of register file reds nd writes, ut reduces the numer of socket-to-us connections nd in some cses nrrows the instruction word width s the numer of its required to ddress ll sockets connected to ny us cn drop. The ppliction of ypssing of course chnges the schedule, so in some cses, the numer of removed connections cn e higher for the cse without ypssing nd vice vers. Tking those eight sets of dt, we synthesize ech rchitecture to 13-nm CMOS stndrd cell ASIC technology with Synopsys Design Compiler nd run gte-level simultion with Mentor ModelSim. From the results of the gte-level simultion, we cquire gte ctivity for the Synopsys Power compiler. From the Synopsys Power compiler, we cquire power used y individul rchitecturl components of the processor core, such s interconnection network, individul register files, function units, instruction fetch, nd instruction decode. The processors were synthesized to 25-MHz clock frequency (4-ns clock period) since for this vlue, rchitectures with lrger numer of red nd write ports cn still e synthesized nd meet timing constrints. In ddition, fter collecting the results from the experiments s descried ove, we select the rchitecture configurtion tht showed the est energy efficiency with softwre ypssing cross our set of enchmrks nd tke the enchmrk s cycle counts s mesure of rel-time performnce t 25-MHz frequency. Connectivity reduction does not improve clock cycle performnce; therefore, to chieve the sme rel-time performnce, we compute the required frequency s follows: Required frequency = 25 MHz (Reduced connectivity cycles/bypssed cycles). Wethen ttempt to synthesize ech of the enchmrks for it s required frequency, run gte-level simultion, nd collect power dt, s descried ove. Results of our experiments re discussed in detil in Section 6.

Guzmet l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 11 of 3 Port 1 D in 1 opc D CLK Q Port 1 Q out 1 opc D Q CLK CLK Port 2 D in 2 opc D CLK Q Port 2 Q out 2 opc D Q CLK CLK Port 1 D D Q Port 2 D CLK in 1 opc Port 1 Q in 2 opc out 1 opc out 2 opc D CLK Q D Q Port 2 Q CLK D Q CLK CLK Figure 5 Exmples of implementtions of register files. In this figure, we disply the implementtion of () two register files, ech with one red nd one write port nd () single register file with two red nd two write ports. 6 Results In the following, we first present results collected during setting up our experiment in Susection 6.1. Afterwrds, we discuss energy results otined y synthesis nd simultion in Susection 6.2. In ddition, we discuss results collected when trying to mtch rel-time performnce otined y the use of softwre ypssing vi synthesizing for higher frequency in Susection 6.3.

Guzmet l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 12 of 3 6.1 Dt collected efore synthesisnd simultion Figure 6 shows results we collected when scheduling our set of enchmrks for the selected rchitectures with nd without softwre ypssing. The cycle counts do not chnge with connectivity reduction pplied s reduction is done fter the schedule is generted. All the results re normlized to the worst cse of top-down scheduler (TD), which is the single register file with single red nd single write port. It is cler from the figure tht the ypssing hs drmtic effect on the clock cycles. Additionl scheduling freedom nd direct red of result from the result register to the opernd sves significnt numer of clock cycles. This is fctor we expect to hugely contriute to totl energy reduction when using softwre ypssing. The difference etween our conservtive top-down scheduling nd ypssing compred to the more opportunistic ottom-up scheduling (BU) nd ypssing is lso visile on this figure. The use of the top-down scheduling lgorithm trnsltes to slightly etter results in terms of clock cycle counts compred to the ottom-up scheduler, in prticulr for n rchitecture with single red nd single write port in the single register file. This is minly cused y the use of nother optimiztion, the dely slot filling. Ourimplementtion of dely slot filling tkes dvntge of predicted execution, llowing to fill in more thn dely slots fter the rnch opertion. Prcticlly, s soon s the predicte used y rnch instruction is computed, the opertions from the following sic locks, including the top of the loop ody, cn e scheduled, gurded y the sme predicte. When scheduling the loop ody, the top-down scheduler schedules opertions not on the criticl pth, such s loop counter increment nd loop repet condition evlution, s erly s possile, while the ottom-up scheduler schedules them s lte s possile, virtully just efore the rnch tkes plce. In cse Figure 6 Sttistics collected from softwre ypssing nd reduced connectivity for ech rchitecture. In this figure, we present () the numer of clock cycles normlized to the worst cse of no ypssing nd top-down scheduler, () the numer of register reds nd writes normlized to the cse with no ypssing nd top-down scheduler, (c) reduction in instruction width normlized to the cse without connectivity reduction with top-down scheduler, (d) the numer of connections left fter connectivity reduction normlized to the cse without connectivity reduction with top-down scheduler.

Guzmet l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 13 of 3 loop counter increment is used immeditely t the eginning of the sic lock, in the cse of the top-down scheduler, it is therefore possile to schedule the conditionl execution of the opertions from the top of the sic lock even efore the rnch opertion, nd the rnch opertion then chnges the control flow to the lter prt of the sic lock fter those filled opertions, eventully creting smller loop ody. In the cse of the ottom-up scheduler, the lte computtion of the loop counter nd loop predicte prevents this from hppening. Further optimiztion of identifying such dependencies nd tking mixed pproch of ottomup nd top-down scheduling would ring the est of oth worlds. However, once the ypssing is enled, the cycle count differences ecome much smller, removing the penlty of worse strting point of the ottom-up scheduler. More detiled results of the chieved clock cycles of the two typicl cses of our enchmrks, gsm nd motion,represented in Figure 7, respectively. Here, we oserve tht the effects of using ypssing s well s different register file configurtions vries etween individul enchmrks. The ddition of red nd write ports improves the cycle count to lesser extent thn softwre ypssing, with exception of the gsm enchmrk, where ypssing with single register file with single red nd single write port does not improve performnce s much s the ddition of red nd write ports or more register files. On the contrry, for the cse of the motion enchmrk, ypssing leds to rther uniform clock cycle counts through the rnge of rchitectures. In Figure 6, we cn see reduction in the numer of reds nd writes of the register files for ech of the selected rchitectures when softwre ypssing is pplied. Results re normlized to the numer of reds nd writes without ypssing with TD. It is cler from this figure tht clock cycles (normlized) 12 1 8 6 No ypssing TD With ypssing TD No ypssing BU With ypssing BU 4 clock cycles (normlized) 12 1 8 6 4 Figure 7 The numer of clock cycles with nd without softwre ypssing for ech rchitecture. In this figure, we disply the numer of clock cycles normlized to the worst cse without softwre ypssing with top-down scheduler for () gsm nd ()motion.

Guzmet l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 14 of 3 the use of ottom-up scheduling nd ypssing leds to significnt decrese in the numer of register reds nd writes compred to more conservtive ypssing using top-down scheduling. As previously descried, more detiled nlysis of two individul enchmrks, mips nd motion, isshowninfigure8,,respectively.weoserve tht in the cse of the mips enchmrk, out 45% of register reds nd writes re eliminted when using softwre ypssing, which represents the worst result from our set of enchmrks. On the other hnd, the motion enchmrk shows drmtic difference in reduction of register file reds etween the top-down nd ottom-up schedulers, reducing over 8% of register writes nd 6% of register reds. In Figure 6c, we cn see reduction in instruction width when connectivity reduction is pplied, nd Figure 6d shows the numer of socket-to-us connections left fter connectivity reduction. It cn e seen from Figure 6c,d % of mximum reds/writes without ypssing 9 8 7 6 5 4 3 2 1 Register reds TD Register writes TD Register reds BU Register writes BU 9 8 % of mximum reds/writes without ypssing 7 6 5 4 3 2 1 Figure 8 The numer of register file reds nd writes left fter ypssing for ech rchitecture. In this figure, we disply the numer of dynmic register file reds nd writes fter the ppliction of softwre ypssing normlized to the worst cse without softwre ypssing with top-down scheduler for () mips nd () motion.

Guzmet l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 15 of 3 tht there is vrition when pplying connectivity reduction for cses without ypssing nd with ypssing. Since ypssing cuses chnges to the schedule, there re dded direct dt trnsports etween function units nd the schedule is more compct, leding to more ctivity per cycle. On the other hnd, the numer of dt trnsports etween function units nd register file decreses with softwre ypssing; therefore, connectivity to the register file cn e reduced. It cn e seen from those two figures tht once gin ottom-up scheduling nd erly ypssing leds to more reduction of instruction width nd lower numer of connections left; vrition is however only out 5%. A detiled view of the reduction of instruction width with softwre ypssing of enchmrks dpcm nd lowfish is presented in Figure 9,, respectively. Here, we oserve tht for three of the rchitectures with single register file, the removl of unused connection did not led to decrese in instruction width t ll. For other rchitectures, we oserve tht reduction for rchitectures with severl register files vries, following very similr ptternsseeninfigure6c,withvritionsofonlyfew percentge points. A detiled view of the numer of socket-to-us connections left fter connectivity reduction for dpcm nd lowfish is shown in Figure 1,, respectively. We oserve cler generl pttern tht ctully spred through ll the enchmrks. The lest numer of connections left is generlly present with severl simple register files, while single multi-ported register file requires lrger numer reltive instruction width reduction in % of fully connected 15 1 5 No ypssing TD With ypssing TD No ypssing BU With ypssing BU reltive instruction width reduction in % of fully connected 15 1 5 Figure 9 Reduction in instruction width fter removing connections for ech rchitecture. In this figure, we disply reduction in instruction width fter connectivity reduction compred to the cse without reduced connections nd softwre ypssing with top-down scheduler for () dpcm nd () lowfish.

Guzmet l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 16 of 3 reltive numer of connections left in % of fully connected 8 7 6 5 4 3 2 1 reltive numer of connections left in % of fully connected 8 7 6 5 4 3 2 1 No ypssing TD With ypssing TD No ypssing BU With ypssing BU Figure 1 The normlized numer of connections left fter connectivity reduction for ech rchitecture. In this figure, we disply the numer of connections left fter connectivity reduction normlized to the cse without connectivity reduction nd softwre ypssing pplied with top-down scheduler for () dpcm nd () lowfish. of connections to remin. It cn lso e seen from this figure tht softwre ypssing hs little effect on the numer of connections removed, with the lrgest difference of only out 5% nd vrition etween the top-down nd ottom-up schedules of only out 1%. Overll, Figure 6 shows tht the connectivity reduction indeed leds to reduction in instruction width nd successfully removes lrge numer of socket-to-us connections nd tht the softwre ypssing produces lrge reduction in dynmic register reds nd writes s well s lrge drop in cycle counts, with eventully simpler rchitectures with single red nd single write port in the register file outperforming much lrger rchitectures with multi-ported register files without ypssing.

Guzmet l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 17 of 3 The comintion of these reductions hs n effect on the power of individul processor components nd results in energy reduction. However, we cn lso oserve tht the effect of softwre ypssing on the successful removl of connections is clerly limited. There is typiclly t most 5% vrition in the numer of connections removed etween using softwre ypssing or not nd similrly smll vrition in instruction width reduction. A lrger vrition is visile etween top-down scheduling with conservtive ypssing nd ottom-up scheduling with erly ypssing. We oserve tht while the older top-down scheduler provides etter strting point in terms of clock cycles thn new ottom-up scheduler implementtion, the difference nrrows when ypssing is enled, nd overll, in terms of register file ccesses, connectivity removl, nd instruction width, the novel lgorithm performs etter. 6.2 Results of synthesis nd simultion After performing gte-level simultion on ll our enchmrks, rchitectures, nd optimiztion comintions s descried in Section 5, we collected power dt for individul processor components. We computed the energy of individul processor components using the common formul Energy = Power Cycles/Frequency. We computed the verges for ll enchmrked pplictions to focus on overll trends, in ddition to individul enchmrks. Figure 11 shows the energy of ll processor core components s it chnges for rchitectures nd pplied connectivity reduction nd ypssing. Specificlly, on Figure 11, FC TD denotes fully connected rchitecture configurtion with top-down scheduling, RC TD denotes n rchitecture with reduced connectivity with top-down scheduling, FC BU denotes fully connected rchitecture configurtion with ottom-up scheduling, nd RC BU denotes n rchitecture with reduced connectivity with ottom-up scheduling. On Figure 11, FC + ypss TD denotes fully connected rchitecture with softwre ypssing pplied with top-down scheduler, RC + ypss TD denotes softwre ypssing pplied followed y the ppliction of connectivity reduction with top-down scheduler, FC + ypss BU denotes fully connected rchitecture with softwre ypssing pplied with ottom-up scheduler, nd RC + ypss BU denotes softwre ypssing pplied followed y the ppliction of connectivity reduction with ottom-up scheduler. Itcneseenfromthefigurethttheeffectofconnectivity reduction is lrger for cses with single-ported register files. In ech of the cses, the effect of softwre % of mximum energy 1 8 6 4 2 FC TD RC TD FC BU RC BU % of mximum energy 1 8 6 4 FC+ypss TD RC+ypss TD FC+ypss BU RC+ypss BU 2 Figure 11 Averge spent energy for ech rchitecture. In this figure, we disply the reltive energy for () fully connected rchitecturesnd reduced connectivity without ypssing nd () fully connected rchitectures nd reduced connectivity with softwre ypssing.

Guzmet l. EURASIP Journl on Emedded Systems 213, 213:9 Pge 18 of 3 ypssing on the overll energy is lrger thn tht of connectivity reduction. Miniml energy cn e found with the comintion of oth, softwre ypssing nd connectivity reduction, s they complement ech other, with out 5% reduction compred to the lrgest rchitecture without ypssing nd connectivity reduction. It is notle tht ottom-up scheduling without softwre ypssing produces worse energy results due to the penlty of slightly higher clock cycles. However, the ppliction of connectivity reduction erses this penlty leding to results similr to top-down scheduling, nd the ppliction of ypssing nd connectivity reduction produces slightly etter energy results for ottom-up scheduling thn top-down for most of the rchitectures. For individul enchmrks of mips nd motion, Figure 12, shows the sme dt, with the left side of the figure displying dt without ypssing nd the right side of the figure with ypssing enled. The effect of connectivity reduction nd softwre ypssing follows the ptterns oserved previously in Figure 6, with Figure 12 representing the worst oserved result nd Figure 12 the est oserved result. Overll, results in Figure 11 show tht the comintion of softwre ypssing nd connectivity reduction leds to energy svings up to 5% using single-ported register files compred to the energy required y the lrgest rchitecture with four red nd four write ports while mintining or improving cycle counts. Looking into more detil, Figure 13 shows the effect of ypssing nd connectivity reduction on the energy of register files, with the left grph displying dt without ypssing nd the right grph with ypssing. Due to it s nture, connectivity reduction does not contriute significntly to the reduction of energy of register files (with the exception of possile uffer distriution, s discussed previously in Section 5), nd the min enefit is from the effect of ypssing, comining drop in cycle counts (Figure 6) s well s ctul reduction in the numer of dynmic register reds nd writes (Figure 6). It cn e seen from this figure tht the ddition of write port increses the energy drmticlly, this trend is visile regrdless of the use of softwre ypssing. As mtter of fct, from the single register file, we oserved liner progression when dding register file write ports. Strting from the rchitecture with single register file with single red nd write port (denoted s 1 1 1) through two red nd two write ports (1 2 2), three red nd write ports (1 3 3) until the most expensive four red nd write ports (1 4 4). The ddition of red ports (rchitecture with four red nd two write ports, denoted s 1 4 2), however, does not significntly differentite from two red nd two write ports. A similr oservtion cnemdeforthecseoftworegisterfiles.therchitecture with two register files, ech with single red nd single write port (2 1 1) does not differentite significntly from the rchitecture with two red nd single write ports in ech register file (2 2 1). However, the ddition of second write port to oth register files (2 2 2) leds to significnt jump in energy consumption. While in Figure 6, we oserved firly consistent clock cycle performnce cross the rnge of rchitectures fter the ppliction of softwre ypssing s well s reduction in register file reds nd write, Figure 13 clerly shows how expensive more complex register file configurtions re, even with softwre ypssing. Figure 14, shows the rekdown of the register consumption of register files for mips nd motion enchmrks, with visile effect of drmtic reduction in register file reds nd writes oserved in Figure 8 cusing significnt difference in energy etween softwre ypssing with top-down scheduler nd softwre ypssing with ottom-up scheduler in the cse of motion. Figure 13 shows the effect of ypssing nd connectivity reduction on the interconnection network. Both reduced the connectivity nd ypssing results in the drop of energy, with the comintion of oth providing the est results. However, while connectivity reduction cuses drop of energy y ctully removing components tht consume energy, the enefit from softwre ypssing is lrgely due to decrese in cycle counts nd interconnection network trffic. The use of new ottom-up scheduling nd ypssing lgorithm results in etter interconnection energy results for ll of the rchitectures. Figure 15, shows the sme informtion for mips nd motion, respectively. As discussed in Section 5, the register files nd interconnection network interct. Therefore, when synthesizing for reduced connectivity, the synthesis tool redistriuted cpcitnce etween the interconnection network nd register file, which shows, for the rchitecture mrked s 3 1 1, s slight increse in interconnection network energy. In order to investigte the clim tht connectivity reduction leds to decrese in processor core energy, we present Figure 16, where the rekdown for mips nd motion enchmrks shows comined energy of register file(s) nd interconnection network, respectively. We oserve tht in cses where connectivity reduction cused slight increse in register file energy or interconnection energy, the sum of those two still shows miniml difference compred to fully connected rchitectures. The sme result hs een oserved for ll the enchmrks. Figure 17 shows the results of energy reduction in decode, following similr suite. A notle exception