April 9, 2000 DIS chapter 10 CHAPTER 3 : INTEGRATED PROCESSOR-LEVEL ARCHITECTURES FOR REAL-TIME DIGITAL SIGNAL PROCESSING

April 9, 2000 DIS chpter 0 CHAPTE 3 : INTEGATED POCESSO-LEVEL ACHITECTUES FO EAL-TIME DIGITAL SIGNAL POCESSING

April 9, 2000 DIS chpter 3.. INTODUCTION The purpose of this chpter is twofold. Firstly, bsic processor-level rchitecturl strtegies or styles will be investigted to mp lgorithms into integrted embedded softwre nd custom hrdwre systems (rchitectures). The methods will be suited especilly for rel-time multi-medi nd telecom signl processing (SP) pplictions but lso solutions for off-line nd numericl processing will be included. Secondly, the bsic modules out of which ll digitl hrdwre is composed will be investigted (Fig..). This will result in requirements for communiction, dt-pth (supporting the lgorithmic opertions), controller nd storge components. These modules will be treted in more detil in the subsequent chpters. It is lso importnt to be wre tht complete rchitecturl methodology hs to include wys to mp the underlying modules (components) into n IC lyout both fst nd efficiently. Moreover, the chrcteristics of the prmetrisble building block librry to be predefined, to llow using them during the rchitecturl explortion phse. This includes views of the re, power nd speed mesures for dders, registers, OM's, stndrd cells nd the like. 3.2. POCESSO-LEVEL ACHITECTUAL STYLES In order to clrify the need for different rchitecturl styles if n efficient embedded integrted system hs to be obtined, first typicl test-cse will be treted, nmely imge processing systems. 3.2. Imge processing subtsks nd rchitectures. When deling with imges, in generl, severl processing stges cn be defined tht re closely relted to the humn perception system [Off85] (see lso course on imge processing of Prof. Vn Gool). They rnge from high-rte locl lgorithms, which cn be executed in prllel, over medium-rte sequentil ones, which look t lrger neighbourhoods, to low-rte globl ones, which require informtion from the entire imge (Fig.3.). To clrify this, robot vision system will be dopted s n exmple. I m g e 0-50 MHz Front-end D or 2D rrys with regulr communiction P i x e l s -20 MHz Medium-level lowly muxed custom processors F e t u r e s < khz Bck-end microcoded processors D t Fig.3. Bsic submodules in most imge nd video processing systems Pixel informtion is processed loclly t the front-end, then compressed to fetures in the medium-stge nd then interpreted into the desired dt t the bck-end.

April 9, 2000 DIS chpter 2 Approprite rchitecturl design strtegies for these submodules hevily depend on the throughput requirements nd the properties of the signl nd control flow. Front-end imge nd video processing (0-40 MHz rnge) typiclly involve locl enhncement nd restortion opertions on the originl imge. For instnce, for the robot vision system, the pixel informtion in the originl imges from the cmer cn be trnsformed to represent the potentil loction of edges. This is typiclly chieved bsed on grdient informtion gthered in locl neighbourhood (e.g. 3x3) round every pixel position. These extremely regulr nd locl opertions should lso be reflected in the wy they re mpped onto silicon. This typiclly results in modulr, highly prllel rchitectures such s systolic rrys (see subsection 3.2.4). At the medium-level (-0 MHz rnge) the still mssive mount of locl (pixel-level) informtion is trnsformed nd compressed to be useful for the finl imge interprettion stge. Here, higher-level imge informtion (so clled fetures) such s the ctul edges, texture or fces of objects is identified, resulting in the removl of redundncy (i.e. the noise nd the unnecessry detils). Typiclly, prtly recursive nd thus sequentil lgorithms re needed (s in recursive or II filters). For instnce, in the robot vision system, in the trnsformed imge with the potentil position of edges, first "begin-point" cn be identified. From there on, n itertive (recursive) process cn strt where the locl neighbourhood of the current end point of the edge is scnned for the best cndidte to become the new end-point. As result of this combintion between irregulr, recursive lgorithms nd reltively high throughput, the rchitectures should combine customized rithmetic opertors (i.e. fst "hrd-wired" dt-pth) with multiple communiction nd storge modules tht ccommodte high clock rtes (nd thus dt rtes). This lowly time multiplexed lterntive will be explined in subsections 3.2.3 nd 3.2.4, nd illustrted in detil for the 2nd order filter exmple in section 3.3. Finlly, t the bck-end, the periodic dt strems re now the sets of fetures for the repetitive imge frmes. Hence, the frme rte is quite low (from few Hz to few khz). However, for ech of these frmes, sophisticted lgorithms for e.g. further compression or imge understnding hve to be relized. They involve globl opertions nd lot of decision-mking. For instnce, for our robot vision system, once the mny individul edges hve been clustered into complete contours, the ctul object recognition cn strt. For this purpose, the shpes of the contours re compred with possible templtes stored in n object librry (which contins e.g. cubes, disks, spheres, pyrmids). This involves complex pruning (with condition trees) of the mny possible mtched constructs to obtin robust recognition process. Consequently, the rchitectures most suited for this rnge of pplictions will be highly flexible nd will hve to provide n efficient solution when the clock rte chievble in the hrdwre is much higher thn (in this cse) the frme rte. This will led to the use of highly multiplexed processors (see subsection 3.2.4 nd section 3.3).

April 9, 2000 DIS chpter 3 3.2.2 The rchitecturl design problem So we see tht in order to mp such pplictions onto hrdwre, first suitble rchitecturl style for the processors hs to be selected. Mny lterntives re vilble here, which will be ddressed in subsection 3.2.3. From the imge processing exmple, it is cler tht this importnt design decision depends on the chrcteristics of the flow-grph, the smple rte specifictions nd economicl spects. Crucil DDG chrcteristics include the desired reprogrmmbility, the mount of modulrity or regulrity, the dominnce of either prllel or recursive opertions, nd the presence of complex control flow i.e. conditions nd loops. The smple rte is determined by the periodicity of the dt strems tht re offered t the input. As lredy mentioned, word of cution is needed on the interprettion of wht "smple" is for prticulr subsystem in n ppliction: does it correspond e.g. to frmes or individul words. Economicl issues re relted to the volume produced on yerly bsis, the time to mrket ffordble nd whether the re or the power cost hve to be optimized. Selecting n efficient rchitecturl style is complicted step s no "rigid" theory hs been (or cn be?) developed to guide it. Architecturl experts found their decision on experience, "heuristics". An ttempt will be mde to summrize some of this knowledge into more or less deterministic "decision tree", or to be more precise into set of such decision trees (Fig.3.3) which will be discussed in subsection 3.2.3. A set of trees is needed becuse mny of the necessry choices cn be mde lmost independently. Architecturl design thus strts with "pruning" the tree brnches within this vriety of lterntives, resulting in motivted choice for ech of the submodules in your ppliction (see imge processing exmple in subsection 3.2.). Power Pmx - -2 W Seril o o Prllel Are Spce Amx-cm 2 Speed,Fcl Fmx-50-00 MHz

April 9, 2000 DIS chpter 4 Fig.3.2 Architecture selection problem: optimistion within throughput, re nd power constrints Clerly, the "optiml" rchitecturl decisions will lso depend on the limittions of given technology, in terms of size, timing nd power consumption. For instnce, current IC's should preferbly be smller thn bout cm 2 for resons of yield, though quite bit more is rechble for very high volume circuits (not ddressed in this course). In ddition, the mximl power consumption should be less thn W in order to limit the het dissiption nd thus the pckge cost (plstic) nd lso to increse the relibility (e.g. electro-migrtion effects for which usully pek or men-squre-root criteri re the most relevnt). For mobile nd other power-conscious computing the restrictions on verge power re even stricter. Moreover, the bsic gte dely is limited to between 0.5 nd few ns depending on the gte length nd the technology used so the clock frequencies re currently limited to bout 50-00 Mhz (Fig.3.2). Note tht for very high volume circuits like microprocessors, gin much higher clock rtes (up to 600 Mhz for the DEC-Alph) re fesible, but these come t the price of very expensive design nd process so they re not cceptble in consumer mrket context. The ltter will determine the mximl clock frequency Fcl=Tcl which cn rnge between 0 nd 00 MHz dependent on the length (number of gtes) of the criticl pth between clocked registers. The clock rtes offered by IC vendors for the off-theshelf components in their ctlogs re typiclly lower due to overhed (on- off dely, extr hrdwre for flexibility). Of course, the bsolute figures (especilly for the gte dely) will lso depend on the miniml feture size (e.g. 0.5 um) nd nture (e.g. CMOS or bipolr) of the vilble IC technology. In this course, whenever bsolute dt re required for delys, re or power, it is ssumed tht CMOS technology in the 3 to.25 µm rnge hs been used (for historicl resons). However, typiclly the min trends s presented here for older process technologies, will continue to hold for more dvnced technologies. Only the exct loction of the boundry between the ppliction domins of some of the rchitecturl methodologies will shift little. In summry, relizble rchitecture for given lgorithm should reside within the constrined size*time*power cube (Fig.3.2) nd optimize given cost function. In most cses, this cost function will be minly the size of the resulting circuit. However, the overll design nd test cost should be tken into ccount too in this trde- off with the production cost (wfer pckge cost). Other importnt fctors re the necessry volume of the production nd the competition in terms of time-to-mrket. This mkes rchitecture design very interesting nd cretive tsk which offers mny chllenges for the engineer. It comes very close to the cretivity ssocited with the work of n rtist. Moreover, rchitecturl design ffects both the bord nd the IC level, nd is thus importnt both for "systems" (options Automtistie en computersystemen, or Telecommunictie) nd "micro-electronics" types of engineers. In ddition, the gins in terms of the optimistion of the cost function re much higher t the system nd register-trnsfer (or rchitecture) level thn the impct of the optimistions t the gte or trnsistor level! Importnt notes:

April 9, 2000 DIS chpter 5. Lower level decisions, below the T level re required (see 4th yer course). One of the most importnt decisions is relted to the technology pltform, i.e. the choice between fully ppliction-specific circuit (ASIC) or reconfigurble pltform (e.g. field-progrmmble gte rry or interconnect trget). In principle, lso heterogeneous mix of these technologies is now fesible in so-clled systems-on-chip pltforms. Clerly, the reconfigurtion of the FPGA nd FPIC pltforms llows more flexibility but this comes t the price of (hevily) incresed re nd especilly power cost. 2. At this stge, it is ssumed tht choices to be tken t higher bstrction levels re lredy fixed in the entry specifiction. This includes in prticulr the lgorithmic issues like the selection of dt types (see Chpter 2, Section 2), the tsk-level decisions (Chpter 2, Section 3) nd the DTSE issues (Chpter 6, Section 7) which involve especilly lgorithmic trnsformtions on loop structure nd dt-flow. In generl, ll lgorithmic trnsformtions should hve been executed lredy. Their min objective is to minimize the wste in needless computtions, dt trnsfer nd dt storge. Moreover, the overll system control should hve been simplified s much s possible. These issues will not be the min topic of this course but indirectly they will pop up during the illustrtion of the principles in the smll demonstrtors in the course text, nd lso during the exercise sessions. Dt storge Progrmmbility Centrlized Distributed Generl Purpose Processor Domin Specific Processor Custom Processor Architecture Dt distribution AM PAM egfiles Control Mechnism Pipelining Brodcsting Control Flow (Centrlized) Dt flow (distributed) Bit seril Prllelism on bit level Bit prllel Hrd wired Micro coded Lrge grin Smll grin Prllelism on word level Communiction (timing) Single PE Multiple PE's Synchronous (clocked) Shred bus Self-timed (distributed) Asynchronous Bus protocol (rbiter) SIMD MIMD MISD egulr Arrys Multi Proc. Cooperting Dt pths SISD Hrdwired Dt pth Fig.3.3 Architecturl explortion: Min rchitecturl choices: ll of these lterntives cn be selected completely independently (in prllel!). For this reson they re not combined into the conventionl single complex tree 3.2.3 The bsic decision trees. The mny rchitecturl design choices will now be treted in detil. In Fig.3.3, every "independent" tree hs root which is frmed (e.g. progrmmbility). Guide lines will

April 9, 2000 DIS chpter 6 be included whenever possible. More detils cn be found lso in [Ct90]. It should be stressed tht in mny cses "hybrid" solutions, which combine two or more brnches within tree, cn be even more preferble. 3.2.3. Progrmmbility Firstly, the designer hs to choose how much progrmmbility is necessry for hisher ppliction. For production volumes, in principle, the system hrdwre cn be fully customized to the ppliction in order to exploit ll wys of minimizing size ndor power consumption for given timing spec. Obviously this cn only hppen if the lgorithm to be relized is completely fixed. This results in n custom processor (CP) rchitecture solution. The design times of such CP's re still very high though nd design cost is mjor obstruction to wider cceptnce. However, tody's design cycle figures cn be reduced drmticlly by mintining only limited prmeterizble module librry t the T level ("meet-in-the-middle" strtegy s discussed in Chpter 3 [DeM86]). This pproch should be combined with the ppliction of powerful CAD tools which support the design tsks both t the rchitecture synthesis nd the mcrocell genertion or T synthesis levels. This is demonstrted in rchitecturl synthesissilicon compiler environments s the different CATHEDALs [DeM90] t IMEC, PHIDEO t Philips [Lip93] nd HYPELAGE [b9] t U.C.Berkeley. For prototyping purposes however, progrmmble instruction-set processors (IP) solutions of the generl-purpose (GP) type re usully fvoured (see section 3.4 for exmples). These will llow you to chnge some of the not yet fixed prts in your lgorithm "on-the-fly" by modifying the "progrm" executed by the hrdwre. It should be stressed though tht these solutions involve reltively lrge overhed in size nd power consumption to chieve this, especilly t the high clock rtes employed for modern ISCs. In generl-purpose DSP processors however, the clock rtes is usully limited to bout 50 Mhz nd the rtio of useful instructions is not 00% so for leding-edge telecom nd multi-medi processing these (single) processors still exhibit severe limittions. The min issue in the future will be the energy-dely product tht cn be obtined by such processors. ecent reserch t U.C.Berkeley hs demonstrted tht this criterion is bout 3 to 4 orders of mgnitude lrger GPs thn for CPs. A wy in between is offered by the domin-specific processors (Dps) which re optimized for prticulr ppliction domin but which still llow you to (prtilly) chnge your lgorithm on-the-fly by loding nother progrm. Exmples include the current genertion of Jpnese video nd imge processors customized especilly towrds the front-end modules in Fig.3.. Also the current genertion of multi-medi processors like the Philips TriMedi nd the TI C60 series re bridging the gp between GPs nd CPs. Even with the limited progrmmbility however, the price pid in hrdwre or power overhed is t present considerble compred to full CP solutions. Much more rchitecture reserch is needed to mke these solutions fully ttrctive for lrge mrket segments. 3.2.3.2 Control mechnism

April 9, 2000 DIS chpter 7 Bsiclly two wys exist to control the sequence of opertions which hs to be pplied to the dt "present" in the hrdwre. In the first lterntive, nmely control flow rchitectures, every ction on the opertors is steered from centrl mechnism. Still, hierrchy cn be present in the distribution of the "commnds" to the opertors. This cn be compred with "dicttor-ruled society" (Fig.3.4). Here, the orgnistion of the controller cn be hrd-wired (typiclly for higher smple rtes) or bsed on progrmmble (domin-specific) micro-code. This distinction will be discussed in detil in Chpter 5. BIG BOSS A Control Hierrchy B OFFICE CHIEFS CLEQS Dt-pth V A L () Fig.3.4 Control mechnism: control flow () dt flow (b) (b) * B A V A L Alterntively, the decisions on the ctions to be tken cn be "coded" in seprte field (so clled tg) which is ttched to every signl. In this cse, the signls cn move utonomously long the opertors. In sense, they crry their "progrm" which decides wht to do with them on their bck (Fig.3.4b). In some wys, this dt flow concept cn be compred with n extremely democrtic nd decentrlized society. The decision on the opertion control formt is usully in fvour of the more centrlized control flow lterntive for the time being. Indeed, dt-flow rchitectures still tend to cost (too) much hrdwre, especilly when combined with the synchronous timing discussed below. Both n re nd power penlty re usully pid. Moreover, keeping the opertors in the pure dt-flow processor busy is sometimes difficult, especilly when this principle is pplied up to the primitive opertions (smll-grin dt flow). Still, some more regulr pplictions cn benefit from lrge-grin dt-flow principles where combintion is mde with control flow t the lowest level of the opertion hierrchy. 3.2.3.3 Dt communiction nd timing Either synchronous or n synchronous (or self-timed [Sei80]) protocol hs to be decided upon. Within the synchronous ctegory, bsic clock exists which determines the time intervls for which both internl nd externl signls hve to remin stble nd those in which the signls my chnge (Fig.3.5). The trnsfer between "storge loctions" (registers) tkes plce t the rhythm of the clock. Multiple clocks my be present s long s they re derived from one mster. Within the synchronous systems, communiction cn be fully distributed on point-to-point bsis (without dt conflicts or congestion), or bsed on shred bus system where every bus cn hve severl "msters" which write nd severl "slves" which red the dt (Fig.3.5b).

April 9, 2000 DIS chpter 8 Phi Phi2 Phi EG Phi2 Bse clock: e.g. 2 phse non-overlpping Fig.3.5 () Timing: synchronous mechnism Distributed (point-to-point) Stble [i-] Shred Bus Stble [i] 2 2 3 4 3 4 At every cycle: mster => POTOCOL Fig.3.5 (b) Timing: communiction options In contrst, n synchronous system llows signls to chnge t rbitrry moments in time so dt strems flow utonomously through the network. In order to provide wy of synchronistion, whenever these strems hve to interct, opertors check the sttus of incoming dt nd only produce n "output" when ll opernds re vilble (Fig.3.6). Agin, the communiction cn be bus bsed, with n rbiter protocol which is minly useful for the communiction with the "outside world", or distributed (Fig.3.6b). The ltter options usully leds to so clled self-timed dt flow systems, s discussed bove. In the literture, minly rigorously clocked rchitectures hve been discussed so fr, s the re nd power overhed of completely self-timed systems is still too lrge in (V)LSI, even with recent dvnces in circuit design nd design utomtion. Hence, we will restrict ourselves to the synchronous option in the course. This my chnge in the future when extremely lrge systems cn reside on single chip (ULSI). Most probbly the concept of (lrge) synchronous islnds in n synchronous se will be dopted then, where loclly clocked uniform zones re seprted by synchronous interfces t some key points in the system net-list (t loctions where the synchronistion overhed between the zones cn be kept negligible). The synchronous communiction llows for lrge clock-skew between the islnds. Moreover, it enbles esier shut-down of the components (islnds) tht re not needed t prticulr instnce in time (for dt-dependent ppliction lods).

April 9, 2000 DIS chpter 9 Test Oper Fig.3.6 () Timing: synchronous mechnism Distributed (self-timed) Bus bsed Op Op ASIC uproc ASIC Op Arbiter Fig.3.6 (b) communiction options Sfe protocol: e.g 4-wy hndshke 3.2.3.4 Dt storge Dt storge cn be distributed (in fst register-bnks or locl pointer-ddressed memories) or centrlized (e.g. in lrge off-chip frme memories) dependent on the mount of the dt to be stored nd on the frequency of ccess. All these choices will be discussed in more detil in Chpter 6. These rchitecture choices re very importnt for re nd power cost reduction, s demonstrted by studies for CPs intended for multi-medi pplictions both t Stnford (group of Theres Meng) nd t IMEC. Moreover, dt storge nd trnsfer issues should be considered in combined wy becuse they re hevily relted. Also for IPs the min power nd re cost in modern processors is situted in memory storge nd the communiction, s shown by recent studies t U.C.Berkeley (Dve Ptterson s group), Stnford (Mrk Horowitz et l.) nd Princeton (Shrd Mlik et l.). Therefore, system-level DTSE is crucil stge (see Chpter 6, Section 7).

April 9, 2000 DIS chpter 20 8 Decoder Write ddress 8 Increm.pointer Write pointer Dt in 2 3 4 Dt out Dt in 2 3 4 8 Dt out 8 Decoder ed ddress () Fig.3.7 Dt storge: register-file () PAM (b) ed pointer Increm.pointer (b) The register-files usully combine severl ports for reding ndor writing with smll size nd n efficient ddress decoding mechnism (Fig.3.7). They re preferred for fst ccess needed in short-term storge. Unfortuntely, the re nd power overhed for lrger sizes is rpidly becoming uncceptble. For lrger memories, severl options re vilble. If the ccess mechnism cn be restricted to prticulr types of "incrementing" the ddress, pointer-bsed relistion (Fig.3.7b) is dvntgeous. If rndom ccess is needed we hve to fll bck on more conventionl AM's which re usully restricted to single port. Also the use of customized cches with dt red in once but consumed mny times t the output is very importnt. This is especilly so when the communiction bottleneck is situted in the IO nd when dt is (hevily) reused. This is typicl for e.g. buffering between lrge off-chip memory nd fst on-chip processing hrdwre. 3.2.3.5 egister distribution One of the min wys to speed up the chievble throughput in hrdwre is the introduction of dditionl storge nd synchronistion points (clocked registers in the synchronous cse or dt-flow synchronizers in the synchronous cse) between hrdwre components. This technique clled pipelining cn be employed both for set of cscded opertors in dt-pth nd for long busses in the communiction (s opposed to brodcsting the dt where the dt re distributed ll over the system in single "clock cycle"). Pipelining cn be prtilly compred with the introduction of severl specilized workers in fctory which re responsible for only smll prt of the totl job nd which obtin prtilly completed "product", pply n incrementl step towrds completion nd pss it to the next co-worker in the line. As result, the throughput of the products increses compred to sitution with the sme mount of workers who would combine their efforts but who hve to perform ll of the steps in sequence before completed product rolls of the line nd new one cn be strted.

April 9, 2000 DIS chpter 2 *- Op - Op Op Abs Op Fig.3.8 Illustrtion of pipeline principle *2 This principle is illustrted for smll dt-pth in Fig.3.8. Without the internl registers, the input signls hve to pss 3 opertors before result is vilble nd before new input signl cn be entered. Hence the criticl pth dely is lrge resulting in both smll clock rte nd mximl chievble smple rte. With the presence of the 4 pipeline registers (2 in ech brnch), the criticl pths in ech of the pipeline stges is now reduced significntly. In the best cse, the clock rte cn be 3 times higher nd the sme pplies for the smple rte. This comes t the cost of slight increse in re but lso of significnt increse in power consumption. There is however trde-off involved s there will be n increse in the totl mount of cycles which psses between the moment t which new signl rrives t the input nd the moment t which the finl result of the lgorithm leves the system. This mount of cycles is clled the input-output dely (or ltency). So even though the clock period itself decreses with incresed pipelining nd the mximl rte t which new dt cn be entered into the system correspondingly increses, it tkes mny more cycles before completion of the tsk for prticulr signl. Z - Fs = restricted even if Fcl increses! Fig.3.9 ecursive bottle-neck As result, pipelining is extremely difficult or even impossible when recursive bottlenecks re present where in principle no dditionl smple delys cn be introduced. This is illustrted for the bottle-neck in recursive filter in Fig.3.9. If dditionl pipelines re dded the clock rte cn increse, but ny input smple still hs to wit until the processed result of the previous smple hs pssed ll the opertor stges in

April 9, 2000 DIS chpter 22 the feedbck loop before the new smple cn enter it. Hence the mximl smple rte equls the clock rte corresponding to feedbck loop with only single register! Moreover, it hs to be stressed tht when the extent of pipelining continues to be rised, the extr registers will eventully increse the re nd power consumption more thn wht cn be motivted by the gin in clock speed nd mximl smple rte. Still, studies t the Univ. of Achen (group of Tobis Noll) hve shown tht (not too hevy) pipelining significntly reduces spurious switching due to hzrds inside the pipeline sections. Overll the effect on power is positive. Moreover, gted clocking techniques should be employed to reduce the power overhed if no ctivity is needed in prticulr section of the logic. 3.2.3.6 Prllelism on the bit- or word-level In principle, spce cn be exchnged for time during the rchitecturl explortion by number of techniques. This cn hppen either by sequentil tretment of the bits (or groups of bits) within word, or of the words in ( prt of) n lgorithm. Both options will be nlyzed in more detil in subsection 3.2.4. It hs to be stressed here tht the selection of the wy to perform this re-power-time trde-off is THE most crucil issue in rchitecturl design of rel-time signl processing systems. Moreover, this sttement pplies lso lrgely for the design of other types of systems. 3.2.4 Methodologies for efficient time multiplexing. In order to clrify the bsic options for time multiplexing or hrdwre shring, nd the methods to select between them, we will mke some ssumptions tht simplify the issue considerbly. In prticulr, it is ssumed tht the overhed of communiction, storge nd control cost is neglected, unless when mentioned explicitly. Moreover, we will mostly del with uni-dimensionl (sclr type) signls, except for few input signls stored in seprte memory. Tking into ccount these other considertions lso would led us too fr here, but it hs to be stressed of course tht prcticl methodology, pplicble for rel-life designs, should incorporte these complicting fctors too. 3.2.4. Bsic method In this course, our bsic pproch for time-multiplexing t the rchitecture level is the following:. Derive n initil hrd-wired rchitecture by substituting every opertion in the SFGDDG with the corresponding opertor vilble in the building block librry. If no suitble opertor is vilble, then the high-level opertion hs to be expnded into more primitive opertions. b. Optimize the mximlly chievble clock rte F cl by incresing the pipeline level. This will lso be beneficil for the power consumption (due to the bovementioned reduction of spurious switching) for execution of the periodiclly repeted signl processing ppliction. It will led to n incresed mximl smple rte F s. If recursive bottle-necks re present, the chievble clock rte will not be

April 9, 2000 DIS chpter 23 incresed in those however. As result, the mximl smple rte cn differ for different prts of the system. c. Evlute which prts of the design re: - too fst: sve re by shring hrdwre t the bit- or word-level (see below). Avoid too extreme multiplexing (shring) becuse tht will increse the power. A good trde-off between re nd power requires CAD tool support however. - too slow: trnsform the initil SFGDDG to increse the vilble prllelism. This is possible by e.g. unrolling loops responsible for recursive bottle-necks, possibly combined with resubstitution of lgebric sttements (so-clled lookhed trnsformtions). This step is however lrgely prt of the lgorithmic design trjectory where lgorithmic trnsformtions re pplied to remove the redundnt opertions in the ppliction nd to improve the concurrency. 3.2.4.2 Principle of bitdigit-seril design An importnt option for hrdwre shring is the sequentil tretment of the WL bits in signl (word) in time. This is illustrted in Fig.3.0- for 3-bit ddition. If every bit in word is processed with individul hrdwre nd communicted over seprte wire, we cll this bit-prllel computtion (Fig.3.0). Then, every cycle full 3-bit word is produced. On the other hrd, if ll the bits re treted in sequence on single hrdwre unit (usully lest- significnt bit or LSB first) requiring one clock cycle per bit, bit- seril mode is used (Fig.3.). Now, WL clock cycles re necessry to produce the full word. However, pipelining cn hppen t very fine grnulrity so the chievble clock rtes re higher thn for the bit-prllel cse. A problem tht occurs in this cse is the initilistion of the crry bit. In order to solve this, strt signl to control the crry-in hs to be provided which goes high every three cycles (Fig.3.b). c 0 0 b 0 b 2 b 2 c c 2 s 0 s s2 MSB LSB c 0 c c 2 c in 0 b 0 s 0 b s 2 b 2 s 2 c in Fig.3.0 Principle of bit-prllel ddition

April 9, 2000 DIS chpter 24 i s i b i c i Strt: Strt () (b) Fig.3. Principle of bit-seril ddition: dt-pth () control signl (b) Idelly, the re cn be WL times smller nd the clock rte cn be WL times fster for the bit-seril cse compred to fully bit-prllel relistion. However, in prctice, the dditionl register cost nd the mny wys of speeding up lso bit-prllel hrdwre complicte detiled comprison. Typiclly, the mximl clock rte of bitseril hrdwre lies between 20 nd 50 MHz, which mens tht the seril smple rte t word level is still lower thn wht cn be chieved in bit-prllel rchitectures. Also the issue of power consumption is not esy to nswer in generl. The power consumed by the rithmetic is lrgely unchnged, but the cpcitive lod increses nd lso the logic overhed requires more power. In between these two extremes, rnge of digit- seril lterntives is vilble where k digits of WLk bits ech re processed sequentilly. This is illustrted in Fig.3.2 for n ddition bsed on digits of 2 bits. Similr considertions for re, clock rte, smple rte nd power pply s for the pure bit-seril cse. s i i b i c c i- i i- s i- b i- Strt Fig.3.2 Principle of digit-seril ddition for digits of 2 bits We cn summrize s follows: if given mount of dt hs to be processed in given smple period, the trde-off will be in fvour of digit- or bit-seril if the rtio =F cl F s llows n efficient use of the hrdwre. The ltter is when WL-bit wide opertors mke sense nd when the re nd power consumption overhed introduced by converting the opertors to bit- seril mode is not too high. As result, the pproch is mostly restricted to modulr liner opertions becuse non-liner rithmetic nd decision-mking re difficult to combine with prtitioning words into groups of bits: they require complex control. Moreover, problem occurs in

April 9, 2000 DIS chpter 25 lgorithms tht contin itertions or loops: in recursive structures such s ccumultors, register overhed is introduced becuse full word- dely hs to be foreseen in the feed-bck pths (see exmple of digitl filter in section 3.3). As result, the bit-seril rchitecture is mostly suited for ll types of digitl filters nd similr pplictions. 3.2.4.3 Bit-seril design methodology The following pproch is not necessrily leding to "optiml" solutions but it is simple to pply mnully.. The strt is n optimized bit-prllel (hrd-wired) rchitecture or SBD. b. IF =F cl F s WL THEN bit-seril ELSE digit-seril over WL bits. c. Substitute ll prllel opertors by seril opertors. Severl librries re fesible for this. In this course, it is ssumed tht ll seril opertors re internlly pipelined to ensure high clock rte. As result so-clled "rithmetic dely" is ssigned to every opertor (Fig.3.3). Prllel oper Seril oper. Arithmetic dely b s i b i c i s i bit - z Strt(LSB) WL-bit shift reg WL bit 2 -k Bit-repeter Strt (MSB) Stop (MSBk) k bit Fig.3.3 Librry of bit-seril building blocks Notes: - the STAT control signl for the dder goes high when the LSB's enter the dder - WL bit rithmetic dely is ssigned to smple dely t the word level. - (k)-bit dely is ssigned to down-shifter over k bits.

April 9, 2000 DIS chpter 26 The ltter is necessry due to the physicl opertion of twos-complement down-shift where the k LSB's hve to be overwritten with more significnt bits nd where the sign hs to be extended (repeted) over k bits fter the sign bit hs pssed, hence the nme "bit-repeter". The STAT nd STOP control signls steer the extension of the sign bit. It is ssumed tht the STAT signls goes high only when the MSB bit enters the bit-repeter nd the STOP goes high once exctly k clock cycles lter, i.e. while the LSB bits of the next word hve lredy entered the bit-repeter. The reson why ll control signls hve been chosen to be "pulses" going high exctly one cycle is to simplify the controller which will become more cler in Chpter 5. Brin teser : why re k bits of dely needed for the opertion of the shifter itself, without tking into ccount the extr bit due to the pipelining. d. In order to compenste for the rithmetic delys, which re needed in the opertors present in the SBD, the vilble "lgorithmic delys" due to the initil z - delys hve to be distributed over the rchitecture. This is lso clled "dely mngement". Note tht dditionl delys re ssumed to be "entered" from the input nd output nodes. As result of this dely mngement, so clled "shimming delys" will be dded into some of the rcs in the SBD to compenste for differences in rithmetic delys over prllel pths in the grph. Severl solutions re fesible in generl nd mny methods re vilble to chieve this dely distribution. In order to chieve "optiml" results, CAD tools re needed s the problem is NP-complete. An exmple of such tool is the COMPASS progrm developed t IMEC [Goo85]. A prticulrly simple method to pply mnully is to use "potentils". Definition: the potentil of node is the number of the control pulse under which the LSB bits of ll signls (bit-seril words) pss on tht node. Here, the control pulses re ssumed to be periodic signls with period WL becuse ll events re repeted within tht intervl. They re numbered from 0 to WL-. The 0 potentil is ssigned to reference node. For multi-rte systems this definition cn be extended. These potentils re ssigned to ll nodes in the SBD tking into ccount the following rules: - strt from 0 for the outputs of smple delys, or n input nodes if no smple delys re present, nd count upwrds for the other nodes on the pths brnching off from the delys. - the potentil for the output node for n opertor is the mximl potentil of its inputs PLUS the rithmetic dely ssocited with it - IF the input potentils for n opertor with more thn input node differ THEN ssign shimming delys to compenste for this difference. - IF the finl potentil P t the input of smple dely is smller thn or equl to WL THEN substitute the WL-bit dely by (WL-P)-bits ELSE dditionl bit delys hve to be moved to this node EITHE by removing them from other prts of the flow-grph O by incresing the overll word-length which is quite costly s it requires lso converters t the globl inputs nd outputs to compenste for the chnge in signl types.

April 9, 2000 DIS chpter 27 - t the end, the finl potentils re recomputed in modulo rithmetic with bse of WL. Note however tht the number of delys present in loops should remin equl to the number of lgorithmic delys present in the initil bit-prllel SBD times WL. Otherwise, n infesible solution would be produced. Pipelined bit-prllel rch. 8 b 8-8 c 8 2-4 ssume 8 x too fst-> bit-seril 0 0 0 b 7 5 c B4b 0 5 8 4 Fig.3.4 Illustrtion of bit-seril design methodology out - 6 c 2 An exmple to illustrte this method is shown in Fig.3.4. The number of rithmetic delys ssocited with ech opertor is indicted in bold below the opertor symbol. The potentils ssigned to ech node re indicted bove the corresponding rc in plin letters. Note the ddition of 5 shimming delys (shded) to compenste for the difference in potentil between the inputs of the subtrctor. 3.2.4.4 Principle of word-seril design Different opertions within given lgorithm cn be multiplexed in time on the sme opertor whenever the rtio =[F cl F s ] (rounded to the nerest lower integer) is higher thn. For instnce, if 0 multiplictions hve to be performed in cscde nd if the rtio =0, we cn use single multiplier sequentilly (word seril mode s opposed to word prllel). As result, the clock rte is bout constnt, except for the dditionl communiction delys, resulting in fctor reduction of the smple rte. However, the re

April 9, 2000 DIS chpter 28 decreses with bout the sme fctor for the rithmetic opertors, resulting in perfect re-time trde-off. Hence, the power consumption budget remins lmost unchnged. Unfortuntely, dditionl storge is needed if the intermedite results hve to be retined, so the mount of "stte" memory cn never be multiplexed. Moreover, dditionl control is needed to steer the communiction of the results. As result, the re-power-time trde-off is gin complicted for the complete rchitecture through the effect of the memory nd controller cost. Still, the re nd power overhed is usully smller thn in the bit-seril cse when decision-mking or non-liner opertions re present. This is illustrted for simple exmple in Fig.3.5. b - out mux control 8 bc 8 - out Fig.3.5 Illustrtion of word-seril rchitecture 3.2.4.5 Word-seril design methodology Our bsic pproch for this will be to multiplex groups of similr opertions on the sme (dt-pth) hrdwre. This is chieved through the following steps:. Strt from n SFG which is optimized for mximlly chievble clock rte by dding pipeline delys (see bit-seril cse). b. Prtition the SFG in clusters of opertions which re similr in terms of the signl types (e.g. integer word-length), opertion types nd connections (shpe of the grph). Hence, these clusters cn be sequentilly executed on the sme hrdwre with little overhed. A mjor requirement is tht mximlly =FclFs clusters re shred on the sme unit.

April 9, 2000 DIS chpter 29 X - eg - cluster 2 0.5 eg Z - X 0.5 The strting point for finding these clusters re the bodies of (criticl) loops nd conditions which re prt of the control flow, or functions provided by the user which indicte repetition. The remining, less criticl prts of the lgorithm re then ssigned to units tht re not yet fully utilized for ll vilble time slots. cluster Y Sel2 Sel - 0.5 Z - z - - Y () (b) Fig.3.6 Illustrtion of word-seril method: cluster selection () resulting ASU (b) An exmple illustrting this principle for the 2nd order digitl filter of Fig.2.3 is shown in Fig.3.6 for the cse when =2 (n illustrtion of the detiled ppliction of the steps will be distributed during the exercise sessions). Note the cluster boundries tht re selected to contin the function of the two st order segments out of which the 2nd order section is composed. In order to mke them more similr, trnsformtion hs been pplied first on the originl coefficient 0.25, which hs been decomposed into 0.5x0.5, followed by move of the common 0.5 fctor from both coefficients up to the input of the dder. Another trnsformtion hs been pplied to the upper z- smple dely, which hs been decomposed into two clock delys (registers) nd then distributed over the boundries between the clusters in order to mke time shring possible. It should be noted tht this splitting of delys is specil cse which is only needed for flow-grphs which re prtitioned into clusters breking directed loops (see exercises). In the simple exmple of Fig.3.5, this is not needed. c. Once the clusters ssigned to the sme unit hve been selected, the ctul composition of this unit cn be derived from the signl types, the opertions to be executed nd their connections. A good pproch is to strt from the most complex grph nd derive from this the "initil unit" by substituting ll opertions by their corresponding hrd-wired opertors. Next, ll other clusters re mtched onto this initil unit until the finl unit is cpble of executing ll of them. In this process, grdully more progrmmbility is dded in the opertors nd the connections. This pproch leds to customized, so clled "ppliction-specific units" or ASU's s long s the multiplexing fctor is low. An exmple for the filter is shown in Fig.3.6b. In order to derive this ASU, n initil unit bsed on e.g. the first cluster cn be selected resulting in hrd-wired shifter over 2 nd hrd-wired dder. By mtching lso cluster 2 onto this initil unit, we rrive t the finl ASU where the dder becomes lowly progrmmble ddersubtrctor nd where the smple dely hs to be bypssed or not by the multiplexer controlled with Sel. Note the very limited flexibility needed for the shift nd dd-type opertors nd the very restricted progrmmbility in the connections.

April 9, 2000 DIS chpter 30 d. When the composition of the ASU's is known, nd the scheduling of the execution modes is fixed, the controller to steer ll the rithmetic opertors nd the multiplexers cn be designed. Due to the limited mount of different control signls needed nd the usully high-speed requirements, hrd-wired but hierrchiclly decomposed controller is usully preferred. Methods for this will be treted in Chpter 5. All these steps re fesible to pply mnully s long s the exmples remin simple. However, rel-life pplictions require dvnced CAD tools to support such design methodology. A CAD environment supporting the methodology described bove while llowing extensive user interction hs been developed t IMEC in the CATHEDAL-III project (see lso CAD prt, Chpter 3). 3.2.4.6 Combining the pproches Prllelism t bit- nd word-level cn of course be combined, resulting in the 4 cses illustrted in Fig.3.7. Here, m independent opertions (e.g. with different input sources) on n-bit wide words re ssumed. The figure illustrtes the re-time trdeoff with the number of wires in spce nd the number of clock cycles in time.

April 9, 2000 DIS chpter 3 BPWS m time BPWP time n Spce n wire n * m n n m Spce BSWP n time m wires Spce Spce BSWS m*n n m wires time wire Fig.3.7 Illustrtion of bit-seril (BS) - bit-prllel (BP) nd word-seril (WS) - wordprllel (WP) combintion Note though tht the trde-offs between these options re much more complicted in rel-life pplictions thn wht is suggested by the bsic principles summrized in this figure.

April 9, 2000 DIS chpter 32 0. Initil DDG. Initil rchitecture b c - - out b c 8 8 8-8 out 2. Pipelined rchitecture b c - out 3. Assume 8 x too fst-> bit-seril b c - out 4. Assume 2x too fst -> word-seril 8 bc 8 - out 8 Fig.3.8 Illustrtion of bit-seril nd word-seril design methods For the power, the trde-off is more complex still. The min reson is the impct of spurious switching (prtly restricted by the dditionl pipelining proposed in step of our overll methodology). Another issue is the dt correltion, which obscures the picture. If no correltion is present, the power trend will lrgely follow the re trend but in prctice this is too pessimistic for the power xis. Tools re needed however to incorporte this in rel designs. A third mjor issue which complictes the power trde-off is the mjor effect of the V dd choice, due to the 0.5*C*V dd *V sw *F formul, where the voltge swing V sw is usully equl to V dd. For rel-time processing, the best pproch is probbly to fix the V dd t the lowest possible vlue which is technologiclly cceptble (bsed on noise mrgin nd lekge criteri). For this V dd,

April 9, 2000 DIS chpter 33 the bove mentioned methodology cn then be pplied gin. For hevily dtdependent pplictions where the timing is not fixed, this is fr from optiml however. Then it is better to provide higher V dd vlues during periods of hevy execution loding nd miniml V dd during non-time criticicl periods, s proposed by Annth Chndrksn et l. t MIT nd Bob Brodersen et l. t U.C.Berkeley. An exmple which illustrtes the two design methods for bit-seril nd word-seril rchitectures, is shown in Fig.3.8. Note the reltively smll overhed for the bit-seril cse. This is minly due to the fct tht the initil DDG meets the requirements for nice bit-seril pplictions: liner, no decisions, no recursion. The word-seril solution for multiplexing fctor of 2 is quite optimized lso, but problem would occur of course if we hve to find more thn 2 similr clusters which cn be mtched on the sme unit. In this cse, we would hve to go to digit-seril or, if the fctor is even lrger thn 8, we hve to combine the 2 multiplexing pproches. The mximl hrdwre shring fctor in this exmple equls 6 resulting in purely bit-serilwordseril rchitecture. The hevily shred bit-seril ddersubtrctor is then quite smll but the problem is situted in the very lrge overhed in terms of storge (registers), communiction (mux, bus) nd control. Obviously, going to n extreme hrdwre shring is not optiml t ll in this cse nd more efficient solution with reduced time multiplexing would be preferble. 3.2.4.7 Terminology for word-seril rchitectures (Fig.3.3) The cost of time multiplexing typiclly depends hevily on the reusbility of hrdwre nd on the rtio. If the lgorithm is very modulr or when is low, usully enough opertions cn be found which re similr enough in nture to profit from multiplexing. In tht cse, compct nd specilized opertors connected in hrdwired fshion with limited number of multiplexers cn do the job. This pproch results in hrd-wired lowly multiplexed dt-pth rchitecture bsed on ASU's s derived bove. It fetures smll overhed in logic: few multiplexers, few lowly progrmmble opertors, nd little control. This is typiclly the cse for medium-level imge nd video processing subsystems s illustrted in Fig.3.. The sme pplies for front-end udio nd telecom modules. This is the CATHEDAL-III domin. However, if the rtio is very lrge compred to the regulrity in the lgorithm, the opertors hve to become more flexible nd in the end, "universl" opertor such s fully progrmmble rithmetic-nd-logic unit (ALU) or ddress computtion unit (ACU) will hve to be provided. In tht cse, lso more progrmmble connections (multiplexers, busses) re required leding to highly multiplexed processor rchitecture with execution unit type dt-pths. Usully the controller is chosen to be of the microcoded type (see Chpter 5). The disdvntge of these processors is locted in their overhed, both in terms of the progrmmble opertors nd the dditionl control nd communiction hrdwre. However, for lrge rtio's, they provide the only efficient solution. Typicl ppliction domins for this style re situted in bck-end video nd imge processing (Fig.3.) but lso bck-end udio, user-end telecom nd utomotive processing contin such modules. This is the CATHEDAL-II domin (see CAD course, Chpter 3). Sometimes, the number of opertions to be executed per incoming smple is very high while the smple rte pproches the chievble clock rte. Fortuntely, the lgorithm to be executed is then usully very regulr. This is for instnce the cse for mny

April 9, 2000 DIS chpter 34 front-end imge or rdr processing (Fig.3.) subsystems. Then, we need very prllel nd modulr rchitectures, which efficiently exploit the inherent prllelism vilble. A typicl exmple of this clss re regulr rrys where the communiction between the "mtrix" of processing units is fully regulr nd loclized (Fig.3.9). If ll the locl connections between neighbours re fully pipelined nd if ll units re identicl, soclled "systolic rry" is designed [Kun82]. When the pipelining is chieved by n synchronous dt-flow mechnism (subsection 3.2.3), it becomes wve-front rry [Kun87]. I n p u t D t Fig.3.9 egulr rry style PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE O u t p u t D t It should be noted tht for generl-purpose progrmmble solutions, the nomenclture for the 3 rchitecturl styles described bove is different [Flyn66]. In tht cse, the distinction is mde bsed on single or multiple dt (SD or MD) which re controlled by single or multiple instructions (SI or MI). With this terminology, regulr D or 2D rrys (e.g. systolic) re SIMD mchines. A single lowly or highly multiplexed dtpth represents MISD unit nd rrys of independently controlled MISD's form n MIMD mchine (see lso section 3.4). The efficient exploittion of prllelism t the bit nd word level is one of the mjor chllenges in present dy VLSI design. The detection of prllelism inherently vilble in complex lgorithms is very complex problem nd topic of much reserch. This reserch is the bsis for new genertions of computer rchitectures tht will eventully form the bsis for rtificil intelligence mchines of the so clled 5th genertion.