Asynchronous Data-Driven Circuit Synthesis

Similar documents
CHAPTER 2 LITERATURE STUDY

Sequential Logic (2) Synchronous vs Asynchronous Sequential Circuit. Clock Signal. Synchronous Sequential Circuits. FSM Overview 9/10/12

Mixed CMOS PTL Adders

Solutions to exercise 1 in ETS052 Computer Communication

Math Circles Finite Automata Question Sheet 3 (Solutions)

Dataflow Language Model. DataFlow Models. Applications of Dataflow. Dataflow Languages. Kahn process networks. A Kahn Process (1)

Area-Time Efficient Digit-Serial-Serial Two s Complement Multiplier

Multi-beam antennas in a broadband wireless access system

On the Description of Communications Between Software Components with UML

CHAPTER 3 AMPLIFIER DESIGN TECHNIQUES

Chapter 2 Literature Review

& Y Connected resistors, Light emitting diode.

Kirchhoff s Rules. Kirchhoff s Laws. Kirchhoff s Rules. Kirchhoff s Laws. Practice. Understanding SPH4UW. Kirchhoff s Voltage Rule (KVR):

The Discussion of this exercise covers the following points:

Geometric quantities for polar curves

Design and implementation of a high-speed bit-serial SFQ adder based on the binary decision diagram

Use of compiler optimization of software bypassing as a method to improve energy efficiency of exposed data path architectures

ABB STOTZ-KONTAKT. ABB i-bus EIB Current Module SM/S Intelligent Installation Systems. User Manual SM/S In = 16 A AC Un = 230 V AC

Digital Design. Sequential Logic Design -- Controllers. Copyright 2007 Frank Vahid

Exercise 1-1. The Sine Wave EXERCISE OBJECTIVE DISCUSSION OUTLINE. Relationship between a rotating phasor and a sine wave DISCUSSION

Network Theorems. Objectives 9.1 INTRODUCTION 9.2 SUPERPOSITION THEOREM

ISSCC 2006 / SESSION 21 / ADVANCED CLOCKING, LOGIC AND SIGNALING TECHNIQUES / 21.5

EE Controls Lab #2: Implementing State-Transition Logic on a PLC

Algorithms for Memory Hierarchies Lecture 14

CS2204 DIGITAL LOGIC & STATE MACHINE DESIGN SPRING 2005

Student Book SERIES. Patterns and Algebra. Name

DIGITAL multipliers [1], [2] are the core components of

Homework #1 due Monday at 6pm. White drop box in Student Lounge on the second floor of Cory. Tuesday labs cancelled next week

Implementation of Different Architectures of Forward 4x4 Integer DCT For H.264/AVC Encoder

Aquauno Select MINUTES. (duration) FREQUENCY LED. OFF 8h AQUAUNO SELECT 5 MIN FREQUENCY. the timer is being programmed;

CS 135: Computer Architecture I. Boolean Algebra. Basic Logic Gates

10.4 AREAS AND LENGTHS IN POLAR COORDINATES

MAXIMUM FLOWS IN FUZZY NETWORKS WITH FUNNEL-SHAPED NODES

Understanding Basic Analog Ideal Op Amps

To provide data transmission in indoor

A Practical DPA Countermeasure with BDD Architecture

Experiment 3: Non-Ideal Operational Amplifiers

The Design and Verification of A High-Performance Low-Control-Overhead Asynchronous Differential Equation Solver

SOLVING TRIANGLES USING THE SINE AND COSINE RULES

Module 9. DC Machines. Version 2 EE IIT, Kharagpur

5 I. T cu2. T use in modem computing systems, it is desirable to. A Comparison of Half-Bridge Resonant Converter Topologies

(1) Non-linear system

PROGRAMMING MANUAL MTMA/01 MTMV/01 FB00329-EN

April 9, 2000 DIS chapter 10 CHAPTER 3 : INTEGRATED PROCESSOR-LEVEL ARCHITECTURES FOR REAL-TIME DIGITAL SIGNAL PROCESSING

Experiment 3: Non-Ideal Operational Amplifiers

MOS Transistors. Silicon Lattice

arxiv: v1 [cs.cc] 29 Mar 2012

First Round Solutions Grades 4, 5, and 6

A COMPARISON OF CIRCUIT IMPLEMENTATIONS FROM A SECURITY PERSPECTIVE

The Math Learning Center PO Box 12929, Salem, Oregon Math Learning Center

Discontinued AN6262N, AN6263N. (planed maintenance type, maintenance type, planed discontinued typed, discontinued type)

(CATALYST GROUP) B"sic Electric"l Engineering

Device installation. AFR 1xx - Feature Description of the Smart Load. AFR1xx 145 % 200 %

Th ELI1 09 Broadband Processing of West of Shetland Data

Student Book SERIES. Fractions. Name

A Slot-Asynchronous MAC Protocol Design for Blind Rendezvous in Cognitive Radio Networks

High Speed On-Chip Interconnects: Trade offs in Passive Termination

Open Access A Novel Parallel Current-sharing Control Method of Switch Power Supply

PB-735 HD DP. Industrial Line. Automatic punch and bind machine for books and calendars

Synchronous Generator Line Synchronization

Direct Current Circuits. Chapter Outline Electromotive Force 28.2 Resistors in Series and in Parallel 28.3 Kirchhoff s Rules 28.

Engineer-to-Engineer Note

METHOD OF LOCATION USING SIGNALS OF UNKNOWN ORIGIN. Inventor: Brian L. Baskin

Energy Harvesting Two-Way Channels With Decoding and Processing Costs

LATEST CALIBRATION OF GLONASS P-CODE TIME RECEIVERS

Regular languages can be expressed as regular expressions.

Genetic Representations for Evolutionary Minimization of Network Coding Resources

This is a repository copy of Four-port diplexer for high Tx/Rx isolation for integrated transceivers.

Logic Design of Elementary Functional Operators in Quaternary Algebra

ECE 274 Digital Logic. Digital Design. Datapath Components Shifters, Comparators, Counters, Multipliers Digital Design

Comparison of soundscape on the ground floor of tube-houses in Hanoi and open urban space in Bordeaux

Example. Check that the Jacobian of the transformation to spherical coordinates is

Section 16.3 Double Integrals over General Regions

Spiral Tilings with C-curves

Design and Development of 8-Bits Fast Multiplier for Low Power Applications

Three-Phase Synchronous Machines The synchronous machine can be used to operate as: 1. Synchronous motors 2. Synchronous generators (Alternator)

Simulation of Transformer Based Z-Source Inverter to Obtain High Voltage Boost Ability

Throughput of Wireless Relay Networks with Interference Processing

D I G I TA L C A M E R A S PA RT 4

A New Algorithm to Compute Alternate Paths in Reliable OSPF (ROSPF)

Engineer-to-Engineer Note

Magnetic monopole field exposed by electrons

Synchronous Machine Parameter Measurement

Synchronous Machine Parameter Measurement

PRO LIGNO Vol. 11 N pp

On Dual-Rail Control Logic for Enhanced Circuit Robustness

Calculation of Off-Core Inductance in Dual-Circuit Model of Transformer

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Software for the automatic scaling of critical frequency f 0 F2 and MUF(3000)F2 from ionograms applied at the Ionospheric Observatory of Gibilmanna

High-speed Simulation of the GPRS Link Layer

This is a repository copy of Effect of power state on absorption cross section of personal computer components.

Experimental Application of H Output-Feedback Controller on Two Links of SCARA Robot

Automatic Synthesis of Compressor Trees: Reevaluating Large Counters

Information-Coupled Turbo Codes for LTE Systems

Performance Monitoring Fundamentals: Demystifying Performance Assessment Techniques

Electrical data Nominal voltage AC/DC 24 V Nominal voltage frequency

Proceedings of Meetings on Acoustics

MEASURE THE CHARACTERISTIC CURVES RELEVANT TO AN NPN TRANSISTOR

ROBOTIC SYSTEMS FOR SURGICAL APPLICATIONS

Operation Manual. Addendum. Embroidery Machine. Product Code: 884-T13

Transcription:

Asynchronous Dt-Driven Circuit Synthesis Sm Tylor, Doug Edwrds, Luis A Pln, Senior Memer, IEEE nd Luis A. Trzon D., Student Memer, IEEE Astrct A method is descried for synthesising synchronous circuits sed on the Hndshke Circuit prdigm ut employing dt-driven, rther thn control-driven, style. This pproch ttempts to comine the performnce dvntges of dt-driven synchronous design styles with the hndshke circuit style of construction used in existing syntx-directed synthesis. The method is demonstrted on significnt design 32-it microprocessor. This exmple shows tht the dt-driven circuit style provides etter performnce thn control-driven synthesised circuits. The pper exts previous reported work y illustrting how conditionl execution, oft-cited s prolem for dt-driven descriptions, is hndled within the system, nd y more detiled nlysis of the design exmple. I. INTRODUCTION Before synchronous synthesis techniques will e seriously considered over their synchronous counterprts, they must demonstrte tht they cn chieve competitive performnce. The reserch reported here ims to improve the performnce of lrge synthesised synchronous circuits. The focus of the pproch is on hndshke circuit representtion of the circuit; tht is to sy, n strct representtion of the structure of the circuit which is indepent of technologies, protocols, dt encodings or ny other detils of the ctul circuit implementtion. The hndshke circuit prdigm llows the construction of lrge scle circuits y the composition of smll hndshke components tht re strightforwrd to implement in isoltion. Hrdwre descriptions re written in high-level lnguge nd compiled in syntx-directed fshion into the hndshke circuit representtion. This mens the structure of the resulting circuit is directly relted to the source code, llowing optimistions nd trde-offs to e mde t the source code level. Furthermore, writing circuit descriptions in lnguges such s Bls [1], [2], [8] nd Hste [12] is reltively strightforwrd, even for novices. However, control overhed in the conventionl control-driven style of hndshke circuit synthesis is mjor ostcle to performnce. Previous work [34], [33] gives detils of how the control overhed rises in conventionl control-driven synthesis style, n overview of previous ttempts to mitigte its effects [4], [5], [6], [10], [14], [23] nd n in-depth justifiction for dt-driven pproch. For definitive ckground mteril on hndshke circuit compiltion, see [3]. Essentilly there re three principl effects contriuting to the control overhed: Mnuscript received Octoer 13, 2008; revised Mrch 11, 2009. This reserch ws supported y EPSRC. The uthors re with the Advnced Processor Technologies Group, School of Computer Science, The University of Mnchester, Mnchester M13 9PL, UK (e-mil: demnchester.c.uk) All inputs re synchronised with ech other efore ny opertions within lock cn proceed ecuse only control my ctivte processing opertions nd therefore it is necessry for the control to know tht the inputs it will use re ville. Reds nd writes re sequenced to ensure tht vriles re not written nd red concurrently. Dt processing opertions only egin fter the control inititesthemduetothepullstyleofopertion.ifthedt processing were to operte in prllel with the control then the overhed of the control should e reduced. Dt-driven synchronous design styles re thus much less prone to the prolem of control overhed, however, ttempts to utomte trnsformtions to existing Bls hndshke circuits to produce more efficient structures long the lines of existing dt-flow style compiltion strtegies [35], [36], [38] were not successful ecuse it ws not cler wht the result of such optimistion should look like nd it ws very difficult to gurntee the resulting circuit would ehve in the sme fshion s the originl. Techniques such s dt-driven decomposition (DDD) [38] rely on pipelining sequentil progrms nd produce modified circuit ehviour. A Bls designer my dep on the design ehving in the mnner it ws written which could esily not e the cse fter optimistion; indeed if DDD-type strtegy were to hve een pplied to the SPA processor [22] the memory interfce would hve roken. The pproch presented here ttempts to comine the enefits of dt-driven style with the convenience nd flexiility of the hndshke circuit prdigm which llows the roust synthesis of lrge circuits. To this, the hndshke circuit structures of the control-driven Bls synthesis method hve een exmined nd dt-driven lterntives re proposed. To generte these structures, dt-driven description style is proposed nd compiler hs een developed to compile these description into hndshke circuit representtion. This compiler is integrted into the Bls design flow enling the use of existing Bls tools for moving from the hndshke circuit representtion to gte-level circuit. The orgniztion of the pper is s follows. Section II reviews other relted work. Section III exmines how clssic Bls hndshke circuit templtes cn e more efficiently replced y dt-driven vrints. Section IV exmines prolems with conditionl structures nd introduces new hndshke components required for dt-driven system. Section V introduces new dt-driven Bls lnguge. Section VI descries the implementtion of significnt design exmple. Finlly, the performnce, re nd power consumption re compred for the design exmple synthesised oth y clssic Bls nd the new dt-driven Bls.

2 A. Dt-driven pproch II. RELATED WORK The syntx-directed synthesis pproch genertes controldriven structures, ut it is noticele tht in clocked design much emphsis is plced on pipelined dtpth design; similrly most synchronous techniques re sed on dt-flow pipelined style pproches. Muller [20], [19] descried the first synchronous pipelines using C-elements s ltches with comintionl logic etween the ltch stges. The ltency of ech pipeline stge cn e reduced y incorporting logic within registrtion stges. This hs led to the concept of pipeline templtes which define the registrtion control of stge nd llow for trnsistorlevel designs to e incorported within them. Willims [37] developed the PC0 dul-ril pipeline. Susequently Lines [16] developed the concept of templtes sed on prechrge hlfnd full- uffers. Other implementtion styles, imed t high performnce, often requiring creful timing control include GsP from Sun Lortories [31], [26] nd the IPCMOS pipelines from IBM [26]. Singh nd Nowick [29], [28], [27] hve developed rnge of high performnce pipeline styles with vriety of trde-offs; these ppers contin n excellent review of synchronous pipeline techniques. Whilst the works descried ove re indictive of the interest in pipelining techniques, they re not directly pplicle to the prolem ddressed in this pper: they re concerned with circuit level implementtions of pipeline rchitectures rther thn the prolems of how to specify the composition of dtdriven circuits from ehviourl synthesis lnguge. Sprsø [30] nlysed the performnce of vriety of pipeline topologies in terms of token flow through the structures nd quntified performnce in terms of forwrd nd reverse ltency, nd cycle time. Agin, lthough this mteril leds to good understnding of the complexity of pipeline structures, it is not directly pplicle to utomted synthesis techniques. B. Other synthesis systems Automted synthesis techniques for lrge scle systems do not hve to e restricted to syntx-directed hndshke circuit pproch. 1) De-synchronistion sed methods: These involve converting conventionl synchronous design descriptions into synchronous designs [7], [15]. Typiclly existing CAD tools re used for much of the dtpth synthesis nd synchronous control synthesis tools re used to produce controllers tht replce the glol clock. This pproch hs the dvntge tht designers need little specilist knowledge of synchronous techniques. A drwck is tht y using design trgeted t synchronous implementtion, potentil dvntges of synchronous techniques re not exploited. For exmple, concurrency is restricted to the synchronous pipeline structure nd so the fine-grined concurrency possile in synchronous design is not exploited. It is lso difficult to exploit the possiility for synchronous designs to use dt-depent delys insted of the worst-cse delys of synchronous design. 2) CHP sed methods: The CSP[13] sed Communicting Hrdwre Processes (CHP) lnguge is the sis of some synchronous synthesis systems [18], [24], [32]. These systems use mnul or utomtic progrm trnsformtions to refine design into more concurrent version. The finl progrm is then trnslted into production-rule set which is used to generte trnsistor implementtion of the design. The Cltech synthesis tools (CAST) hve een used to produce some high performnce circuits [17] ut these rely on significnt mnul intervention in the synthesis flow to rrive t the most effective progrm trnsformtions nd lso rely on the use of the PCHB (prechrge hlf-uffer) circuit style. This circuit style is not widely used nd requires specilised cell lirry. The utomtic progrm trnsformtions employed in CAST re not ehviour preserving nd re only correct for designs tht meet prticulr requirements. An inexperienced designer my struggle to understnd nd meet these requirements. C. Hndshke Circuit Optimistions Attempts hve een mde to pply control re-synthesis to the control of oth Tngrm/Hste [14] nd Bls [4], [5]. Control re-synthesis ttempts to improve the performnce of the control tree y clustering sections of the tree, determining the overll ehviour, nd synthesising new controller to implement this ehviour using controller synthesis tool [6], [10], [9]. By removing the communictions etween clusters of components, the resulting controller should improve performnce over the originl control tree. Control re-synthesis is effective ut limited. Improving the speed of the control tree will oviously help reduce control overhed ut only so much improvement cn e gined. The control still synchronises with dt t the sme points nd so the sequentil opertion of the control-driven structure is still mintined. Control re-synthesis is complementry to other pproches to improving control overhed including the dtdriven style introduced in this pper Hnsen nd Singh [11] descrie source-to-source trnsformtionoftheoriginlspecifictionintonewoneusingvriety of concurrency-enhncing optimiztions: utomtic prllelistion, utomtic pipelining, rithmetic optimistion, nd reordering of chnnel communiction. Considerle speedups re climed. However, some of the exmples used strt with extremely nive code sequences, so it is esy to otin significnt improvements. Unlike the techniques descried in this pper, their work is not gurnteed to preserve the originl ehviour of the hrdwre nd frequently the designer is given responsiility for ensuring tht the ehviourl chnges re cceptle. This reduces the usefulness of n utomted pproch s it is necessry for the designer to understnd the nture of the trnsformtions to ensure they re sfe, In contrst, the work descried in this pper mintins the sourcelevel predictility of the description- wht you write is wht you get nd the performnce improvements re mde over highly optimised nd efficient control-driven descriptions. Nin [21] hs explored optimistions of circuits synthesised y Bls; the work is concerned with resource lloction

3 nd indings to explore the optimistion spce ut within the confines of the conventionl control-driven pproch offered y Bls. The work hs recently een exted to Hste. A similr pproch tken y the Moods system [25]. No ttempt is mde to generte dt-driven circuits in these pproches lthough it is noted tht the control structure rising from syntx-directed trnsltion results in poor performnce. ctivte Fork Component Sync Component ody III. DATA-DRIVEN CIRCUIT STRUCTURES In the context of hndshke circuits, dt-driven style should chieve greter speed performnce ecuse: All control is ctivted in prllel. Sequencing is loclised to storge elements; red nd write sections of control cn operte in prllel s the loclised sequencing ensures tht storge elements re not concurrently red nd written. Control nd dtpth cn operte in prllel ecuse of the push nture of dt processing. There re possile disdvntges to dt-driven circuits; they re likely to require more re nd to consume more energy. The loclised control of the dt-driven style consumes more re thn the control-driven tree s insted of ppering once, the control is distriuted in mny plces. This effect is exggerted in dely-insensitive implementtions where n incresed mount of completion detection is required nd the implementtion of push-style vriles is prticulrly expensive. However, the incresed concurrency in this distriuted control is mjor fctor in the incresed performnce. Energy consumption due to switching cn lso e expected to increse s result of the increse in concurrent ctivity. Specultion cnlso e expectedto hvenimpcton energyconsumption s this involves extr switching ctivity in the dtpth tht need not occur in the control-driven style. The dt-driven circuit style will e introduced in this section y comprison with conventionl hndshke circuits generted y the Bls system ecuse the source code is freely ville mking it fesile to experiment with the system. However, it should e emphsised tht the results here pply to ny system using similr syntx directed compiltion method such s Hste. The dt-driven style ws lrgely developed y exmining nd dpting hndshke circuit structures so comprison provides the most instructive method of introduction. Some of the eight new hndshke components required to support the new style re mentioned. More detils cn e found in [33]. Figure 1. Bls input structure. to code lock tht hs chnnel-like interfces. Similrly the symol <- denotes writing to the chnnel on the left hnd side of n expression. In this exmple, the ctivtion of the input commnd is used to initite pulling dt from the environment on the input chnnels nd. The Fork component psses the ctivtion request in prllel to the two FlseVrile () components. These re used to hold open dt (the dt on the chnnels nd is not relesed until the hndshkes on those chnnels complete) nd implement multicst on the input chnnels. The ody of the structure is ctivted following the signl ports of the FlseVrile component eing synchronised t the Sync component. This ctivtion indictes the vilility of the dt for the ody to then pull it from the red ports of the when required. The dt-driven style mkes use only of push structures. Insted of using the to implement multicst, n lterntive push structure must e used. As the input chnnels re now push chnnels, there is no need to pull the input dt. For inputs tht re used in only one plce, the dt cn e pushed directly to the ody. For inputs tht re used more thn once, duplicte of the dt must e sent to ll the required plces. The Dup component is used to implement this rodcst ehviour. Figure 2 shows the dt-driven version of the circuit exmple given in figure 1. dup ctivte ody A. Input The conventionl Bls input structure is shown in figure 1. This structure is produced y the ctive enclosure construct shown elow., -> then <ody - used once, used twice> In Bls, chnnels (tht is dt wrpped in request/cknowledge hndshke) stnd on the left-hnd side of n expression nd the symol -> denotes red from the chnnels, either into vrile or into nother chnnel, or Figure 2. Dt-driven input structure. An dvntge of this pproch is tht the input chnnels do not need to e synchronised efore ctivting the ody s the ody no longer needs n ctivtion to indicte the vilility of the dt; the dt will e pushed to the required plces t some point. The ovious drwck with this pproch is tht, s the originl structure implemented multicst, the ody ws free to select which red ports, if ny, of the to use. Where conditionl structures re used, the dt is only conditionlly

4 required. In the rodcst structure, the dt is sent to ll possile destintions whether they need it or not. The resolution of this prolem is discussed in section IV. B. Vriles They re implemented y the Vrile hndshke component. This component hs pssive input known s the write port nd one or more pssive outputs known s the red ports. The control-driven pproch llows dt to e written to the Vrile component y pushing to the write port nd red from the vrile y pulling from the red ports. The lnguge ensures tht the vrile is not written t the sme time it is red. To the designer, Bls vrile therefore looks very much like vrile found in most impertive progrmming lnguges. In the dt-driven style, the storge component is clled the VrilePush nd hs ctive push red ports. Unlike the originl Vrile component, this component hs write-once, red-once ehviour; ech time dt vlue is written it is utomticlly pushed on ll red ports nd the hndshke on ll red ports must then complete efore the next write dt is ccepted. This mkes dt-driven vrile much more kin to chnnel tht hs storge, therey llowing ech of the chnnel to complete indepently. In the control-driven style, one is free to use ech red port one or more times or not t ll. The drwck of the write-once red-once nture of dt-driven vriles is tht ech port must e used once nd only once. This llows more flexiility in the control-driven style ut where the designer uses this flexiility, they do so t the cost of performnce. In common with the dt-driven input structure from the previous section, the drwck of this pproch is tht the dt tht is pushed on the red ports of the vrile my not ctully e required y the destintion. C. Dt processing The originl Bls dt-processing structure is pull structure implemented using the Fetch component to initite red of the required dt from the required Vrile or FlseVrile components, pull it through pss-through dt components, nd then push it to the destintion. The following Bls code produces the exmple hndshke circuit structure shown in figure 3., -> then o1 <- + o2 <- The input chnnels ndre red (->) y the code lock tht follows. Here the sum of nd is written (<-) to the destintion chnnel o1 nd concurrently () is written to chnnel o2. As shown in the preceding sections, in conventionl Bls, Vriles nd FlseVriles hve pssive red ports wheres in the dt-driven style, dt is lwys pushed to ll plces where it my e required. In the dt-driven style this dt is pushed stright through the push dtpth components to the destintion s shown in figure 4. ctivte Figure 3. Bls dt processing structure. dup Figure 4. Dt-driven dt processing. The hndshke circuit grph for the dt-driven circuit is certinly lot smller ut wht impct does it hve on the control prt of the circuit? Tle I summrises the key prmeters in typicl dul-ril implementtion of the control circuitry for the two pproches. Tle I CONTROL COMPLEXITY FOR N-BIT DATAPATH Prmeter Bls Dt-driven Gte Count Forwrd Ltency Reverse Ltency A. Conditionl Execution + + 8n+6 AND/OR gtes 2n+8 C-gtes 4 AND/OR gtes 1 C-gte 2 AND/OR gtes 3 C-gtes o1 o2 o1 o2 5n AND/OR gtes n+2 C-gtes 1 AND gte 1 AND/OR gte 1 C-gte IV. CONDITIONAL STRUCTURES Conditionl execution is supported y the cse nd if structures in Bls. This section will tke the cse construct s n exmple s it is more commonly used thn if, nd the implementtion of if is fundmentlly the sme s tht of cse with few extensions. The following Bls code is n exmple of the use of the cse construct. The control input c is used to determine whether to s the sum of nd or just to the output o1. This code is compiled into the hndshke circuit shown in figure 5.,, c -> then o1 <- + o1 <-

5 ctivte c c dup + r r o1 Figure 5. Bls conditionl structure. + As usul, the hndshke circuit opertes y requesting the three inputs, synchronising on their rrivl nd then ctivting the ody. The ody pulls c from the FlseVrile into Cse component () tht decides which of its sync outputs to ctivte sed on the vlue of the control dt tht hs een input. The stndrd dt-processing structure is then used to pull the required dt nd s it to the output. Additionlly in this exmple, the CllMux component () merges the two possile sources for output o1 onto single output chnnel. As the Cse componentwill only ctivte one of its outputs t ny time the CllMux will only receive n input on one input chnnel t time, therey voiding ny hzrds. The dt-driven equivlent of this circuit is shown in figure 6. The difference etween the dt-driven style nd the control-driven style is tht s ll inputs re pushed (see sections III-A nd III-B), ll the dt processing opertions re initited, even though the result my not e required. In order for the circuit to operte correctly these extr results must not e llowed to propgte. The FetchReject component ( ) is introduced to reject the unwnted dt. FetchReject is so nmed ecuse it is rther like push version of the Fetch component. Insted of pulling dt nd sing it to the output, it wits for pushed dt to rrive on the input nd then either psses it through to the output or completes on the input chnnel without sing nything on the output, therey rejecting the dt. Two sync ports re provided on the component, the ctivtion port which is used to instruct tht the dt should e pssed nd the reject port which is used to instruct tht the dt should e rejected. Once the FetchReject components re in plce, ll tht remins is to connect the ctivtion nd reject ports to the correct outputs of the Cse component. In this simple exmple, one is ctivted while the other is rejected. This rrngement llows the CllMux component to e used s in the originl Bls circuit ecuse concurrent input hndshkes re voided y correctly using the FetchReject components. As the dt-driven style does not require synchronistion of the inputs, there is potentil for performnce improvements over the control-driven circuit. The logic in the Cse component is le to proceed s soon s the control dt rrives, nd o1 Figure 6. Dt-driven conditionl structure. in prllel with the dt processing rther thn lwys hving to complete efore inititing the pull dt processing. However, the dt-driven style is essentilly speculting on needing the results of ll opertions. When using conditionl structure the unwnted results must e rejected nd the overhed of this opertion my hrm performnce. However, it is elieved tht generlly this overhed should rrely e significnt for the following resons. As the rejection will often occur in prllel with other useful opertions, its effect on the overll performnce should e limited. Only where the reject tkes longer thn useful processing will it reduce the overll performnce s oth must e completed efore the next cycle of the opertion. The reject opertion itself is quite efficient ut if the rrivl of the dt is slow then the overll impct my e greter. In cses where there is no opertion in prllel with the reject, it my often e the cse tht the dt will rrive in dvnce of the reject signl nd the rejection will therefore e concluded quite swiftly. Note tht in the Bls circuit, it is still necessry for ll the inputs to rrive efore the opertion cn complete even if no dt processing is ctully performed. Furthermore, in the Bls circuit, the logic in the Cse component does not egin evluting until ll the inputs hve rrived wheres in the dt-driven pproch the evlution cn occur in prllel with the rrivl of the inputs nd so the FetchReject my hve received the reject y the time the dt rrivesso it will t lest e immeditelyrejected, leit following possile dditionl dely through some dt processing logic. In the conventionl Bls cse, ll the inputs must rrive efore the process of deciding wht to do with them cn egin. Even so, it my e the cse tht unlnced dtpthscould cuse prolem. Consider the exmple shown in figure 6. Here one of the opertions is n ddition while the other is simply pssing through the dt from input unchnged. The ddition is most likely to incur significntly longer dely thn the pss-through opertion. If the second opertion is selected frequently, nd ssuming the environment cn supply inputs nd consume outputs quickly enough, there is the potentil for the rejection of the dd opertion to reduce the throughput of the overll circuit. However, experience in designing with Bls hs shown tht the dely of the control nerly lwys exceeds tht of the dtpth so it is resonle to e optimistic tht mny dtpth delys incurred s result of specultion will e

6 entirely msked y the dely of the control tht works out whether or not to reject. Additionlly, the inputs needed for the dtpth opertion my rrive erlier thn those for the control llowing the dtpth to complete efore the control signls rrive t the FetchRejects. B. Conditionl input Conditionl inputs my occur in Bls code when n input is mde s prt of the ody of conditionl structure. For exmple, in the code elow, chnnel is conditionl input in the cluse of the cse construct., c -> then o1 <- -> then o1 <- During the opertion of this code, dt is only pulled on input chnnel if the cluse is ctivted. Otherwise no communiction occurs on chnnel. This code is compiled into the hndshke circuit shown in figure 7. ctivte c Figure 7. Bls conditionl input structure. The importnt thing to notice when this circuit is converted to the dt-driven style is tht when dt rrives on input, it is lwys used; there is no need to reject ny dt if the cluse of the cse is not executed, s the input never rrives.ofcourse,indt-drivenstyle theremyerequest ping on chnnel ut this should e cknowledged y susequent cycle of the circuit when the cluse is executed. It is importnt tht, until the cluse is tken, this request is not propgted to downstrem component which my cknowledge out of sequence possily cusing erroneous ehviour. To void this possiility the FetchPush component is used. This component cn e considered s push version of Fetch, or version of FetchReject without reject. To further explin the ove, consider the exmple in figure 8 which is the dt-driven equivlent of the exmple in figure 7. The FetchPush component is used on chnnel to ensure ny request on is not pssed to the CllMux o1 component efore the Cse hs decided tht tht opertion should occur. This ensures the inputs to the CllMux cnnot occur concurrently. Comining conditionl nd unconditionl inputs in n expression is more chllenging. Consider the following code exmple, only smll modifiction to the lst exmple given ove: the dt on one of the chnnels () is used twice., c -> then o1 <- -> then o1 <- + In this exmple, if the cluse is not tken then ny dtthtispingonisnottoerejectedutdtoninto the dder must e rejected. In generl, this prolem will occur ny time conditionl inputs re comined in n expression with unconditionl inputs. In order to void this prolem, scheme could e devised to reject the unconditionl inputs efore they re comined with the conditionl inputs [33]. However, such scheme reverses prt of the dvntge of dopting push style s the dtpth opertions re once gin stlled witing for control to decide whether the result of the opertion is required, insted of control nd dtpth operting in prllel. Furthermore this scheme presents dditionl complexity in compiltion s the plcing of rejections is now much less strightforwrd. For these resons, such scheme hs not een used. Insted, comintions of conditionl nd unconditionl inputs within expressions re considered invlid y the compiler, voiding the need to produce n implementtion t the expense of some reduction in flexiility. However, s discussed in section V-G, users re still le to implement this scheme in the source description if they choose to. C. Nested conditionls Conditionl structures in Bls cn e nested within one nother s demonstrted y the following code. c, d -> then cse d of <ody X> e -> then cse e of <ody Y> In the control-driven style the output ctivtions from one conditionl structure re simply used to ctivte the nested conditionl. In the dt-driven style, the evlution of the logic in ll Cse components proceeds concurrently, ut the output

7 c r commnds linked y sequentil or prllel control, the dtdriven pproch consists of lists of commnds tht operte indepently nd in prllel. Unlike the control-driven pproch, control sections of the circuit do not wit for n ctivtion ut proceed s fr s they re le, pusing only when witing dt. Figure 8. Dt-driven conditionl input structure. o1 A. Hello World! The equivlent of Hello World progrm in Bls is the one-plce uffer. This serves eqully well s n introduction to the dt-driven lnguge nd is shown in figure 10. ctivtions of nested conditionls must e delyed ping n ctivtion from the outer structure s shown in figure 9. This exmple demonstrtes the use of the CsePush nd CsePushR components. c d e r ctivte X ctivte Y Figure 9. Dt-driven nested conditionl structure. CsePush is used where it is necessry to synchronise with n ctivtion efore output ctivtions re mde from the Cse component. This is true for the Cse component whose input is chnneles dt will only rrive on e when it is required. It my e necessry to reject the input to CsePush if dt will rrive tht is not required, s in this exmple with the Cse component whose input is chnnel d. CsePushR is simply CsePush with reject input tht upon ctivtion will discrd the input dt without ctivting ny outputs. The reject port is then ctivted on ll conditions where the ctivte port is not. V. NEW INPUT LANGUAGE This section will riefly introduce the high-level lnguge tht is trnslted in syntx-directed fshion into the new circuit structures. Note tht the lnguge ws primrily conceived s mens to n ; tht is, to generte the dtdriven hndshke circuits. In the sme wy tht ll vlid Bls descriptions my e compiled to functionl circuits, so ll dt-driven descriptions my similrly e compiled to dt-driven structures. This mens tht the lnguge reflects the dt-flow style of the circuits nd is less flexile nd less fmilir thn the sequentil progrmming lnguge style possile in Bls. The lnguge is designed to resemle conventionl Bls wherever possile. Unlike Bls where circuit consists of -- One-plce uffer procedure uf (input i : 1 its; output o : 1 its ) is vrile x : 1 its egin input i output x x <- i input x output o o <- x i x Figure 10. Dt-driven one-plce uffer description & hndshke circuit. It cn e seen from this smll exmple tht much of the lnguge is very similr to conventionl Bls. The declrtion of the procedure nd the input nd output ports is identicl. Unlike conventionl Bls, the procedure input ports will lwys e pssive due to the push style of implementtion. Internlly to the procedure the input ports re treted s redonly chnnels nd the output ports s write-only chnnels. The min new feture in evidence here is the division of the procedure into locks consisting of input nd output declrtions nd ody contining the commnds tht use the inputs nd generte the outputs. Unlike Bls, the control structures of the circuit re lrgely implicit. Blocks implicitly operte in prllel, s do the list of commnds within the locks. The only synchronistion etween the two locks in this exmple tkes plce t the vrile; the red must complete efore the next write cn overwrite the dt in the vrile. This llows the vrile reds nd writes to overlp to the lrgest possile extent. The hndshke circuit for this uffer is simply Vrile- Push component; this should e compred with the controlhevy Bls generted circuit of figure 11. B. Vriles The control-driven style of Bls llows vriles to e ccessed in very generl fshion, so s to pper very similr to vriles in stndrd progrmming lnguge. o

8 -- One-plce uffer procedure uf (input i : 1 its; output o : 1 its) is vrile x : 1 its egin loop i -> x -- Input communiction ; o <- x -- Output communiction # ; chnnels written to must hve een declred s n output from the lock. Compiltion of the write commnd involves compiling the expression into pproprite push dtpth components nd connecting the result to the destintion. E. Arrys Chnnels nd vriles cn e rryed in similr fshion to Bls. However there re some differences in the semntics of vrile rrys. The code elow demonstrtes the full flexiility offered y Bls for using rryed vriles. input i : rry 0..3 of 2 its input c : 3 its input d : 2 its output o : 2 its output p : rry 0..3 of 2 its vrile v : rry 0..3 of 2 its i Figure 11. Bls one-plce uffer description & hndshke circuit. x Vriles cn e red nd written in ny ritrry sequence. The Vrile component hs pssive red nd write ports nd the control tree initites communiction on these s required. In the dt-driven pproch, the VrilePush immeditely pushes ny dt written to it out of its ctive red ports. This mens tht vrile must lwys e red fter it hs een written. Vriles therefore resemle less those of stndrd progrmming lnguges nd re much more similr to chnnels. In fct, it my e more helpful to think of vrile in the dt-driven style s chnnel tht contins storge, or even s type of chnnel which ech communicnt cn use t different times, rther thn hving to synchronise like norml chnnel. Reflecting this, vriles re specified s inputs nd outputs (to locks procedure ports only connect using chnnels) in precisely the sme fshion s chnnels. In the following discussion use of the term chnnel generlly implies chnnel or vrile except where otherwise stted. C. Input control In control driven pproch, it is necessry for the control nd dt to synchronise to relese the dt once ll required reds hd een completed on the chnnel. As reds re now to e pushed, this synchronistion is unnecessry s the relese of ll the red ports will indicte tht ll reds on the chnnel re completed. In the dt-driven pproch, therefore, inputs re merely specified s rriving t some point the opertion of the commnds; the control wits for the rrivl of inputs t ny points where they re red (if they hve not rrived lredy). D. Write commnd The write commnd (e.g. x <- i) is used to output the result of n expression to n output chnnel (or vrile). The o i -> v ; loop c -> then 01xx then o <- v[(#c[0..1] s 2 its)] 00xx then d -> v[#c[0..1] s 2 its)] ; p <- v Firstly, single vlue is written to the entire rry, then n individul element is red or written, nd then the entire rry is red s single vlue. The strtegy dopted y Bls is to implement the rryed vrile using multiple Vrile components, one for ech element in the rry. The control cn then initite reds nd writes of the pssive ported Vriles individully or s group, splitting the write dt nd comining the red dt s required. A dt-driven equivlent of this circuit structure presents sustntil prolems. Once ech Vrile component hs een converted to VrilePush components, it is necessry to write to echvrilepusheforeit is red.after writingto single element in the rry, only tht element would e ville to red. One option is to leve the mngement of the structure to the user, who must only ttempt to red elements of the rry tht re written. Alterntively the user could e restricted to lwys writing to every element if they wish to use run-time indexing or n elorte scheme to write-ck the originl dt to those vrile elements tht re not written could e devised. This would ensure tht every time ny element in the rry is written, ll the other elements re lso written (with unchnged dt). To the red side, the rryed vrile lwys ppers s if the entire rry hs een written. None of these suggestions hve een fully dopted. An rryed vrile declred in the dt-driven lnguge in the sme fshion s Bls vrile genertes single VrilePush tht holds n entire vlue of the rry type. The whole of the rry must therefore e written to t once.

9 Vriles cn lso e declred in similr fshion to rryed chnnels producing multiple vriles in the implementtion. Ech of these vriles must e written individully; the whole rry my not e written y single commnd. This second type of vrile cn e used y the user to generte firly close pproximtion of the functionlity of the multi-vrile Bls structure y implementing, in the source description, the second of the schemes offered ove. Although the functionlity my e similr, the re used is sustntilly greter. wc wd dup dup to other instnces r r to red ports reg_r* F. Structurl itertion Structurl itertion is very useful lnguge feture especilly when comined with rryed chnnels nd vriles. Essentilly it llows the sme code to e compiled multiple times with different chnnel nd vrile connections. For exmple, the following code is simplified excerpt from the dt-driven description of the register nk of the nnosp processor. constnt REGNUM = log REGCOUNT its rry REGCOUNT of vrile reg_usrw rry REGCOUNT of vrile reg_usrr input reg_usrr output reg_usrw for i in 0..REGCOUNT - 1 reg_usrw[i] <- reg_usrr[i] input reg_usrw, wc, wd output reg_usrr, reg_svcr forech i in reg_usrr cse wc of (i s REGNUM) then reg_usrr[i] <- wd reg_usrr[i] <- reg_usrw[i] The code genertes REGCOUNT instnces of the circuit in figure 12. (The position of the chnnels tht tke dt to the red ports re indicted on the digrm ut the code for the red ports is not given ove.) Effectively this code genertes register cell for ech register. In ech cycle of opertion the write control (wc) nd dt (wd) is duplicted to ech cell nd tht cell compres the register ddress in the control ginst its own index. If they mtch then the write ck dt is written to tht register, otherwise the originl vlue from the register is written. G. Restrictions 1) Comining inputs: A conditionl input is n input tht is prt of the ody of conditionl structure. As explined in section IV-B, such inputs cnnot e comined with unconditionl inputs in ny expression: the following code will produce compiler error since the opertion + cnnot e used where the input of is conditionl nd is not. to other instnces Figure 12. Simplified register cell. reg_w input, c output o1 o1 <- input o1 <- + This restriction cn e worked round y declring nother chnnel nd mking oth inputs to the expression conditionl s follows: chnnel t input, c output o1, t o1 <- t <- input, t o1 <- + t Note tht y using this technique, less dvntge is tken of the specultion s the cse must e resolved efore the chnnel t is written nd the expression egins evlution. Note lso however, tht the specultive evlution of the ddition is voided in the cse where the cluse is not chosen. This my e exploited for the purposes of improving performnce or reducing energy consumption. 2) All inputs nd outputs must e used: All inputs nd outputs tht re declred must pper in the ody of the lock. (They must lso e declred if they pper.) It is only necessry for the possiility to exist for ech output to e produced. It is not necessry for every, or indeed ny, output to ctully e produced y the lock when it is operting. Once n input is declred it will e ssumed tht vlue will rrive from tht chnnel or vrile, ut n output declrtion mens only tht the lock is the one tht writes to the chnnel/vrile, not tht vlue will definitely e written in ny prticulr cycle of the lock.

10 3) Output to input depencies must not e disjoint: An output deps on n input if the input must rrive efore the outputcn e produced.for exmplein the followingcodeo1 deps on c nd, t deps on c nd nd o2 deps on nd t. chnnel t input,, c, t output o1, o2, t o1 <- t <- o2 <- + t This gives three sets of input depencies for ech output: {c,}, {c,}, {,t}. These re not disjoint s c ppers in the first two nd ppers in the first nd third. This code is therefore vlid, however he following is not vlid: chnnel t input,, c, t output o1, o2, t o1 <- t <- o2 <- + t The sets of input depencies for this code re: {c,}, {c,}, nd {,t}. The set contining ndtis disjoint from the other two sets. A seprte lock should e used to produce o2: chnnel t input, c output o1, t o1 <- t <- input, t output o2 o2 <- + t This rule helps to ensure the design is understndle s ech lock hssingle cycle of opertiondue to the fct tht ll inputs re synchronised somewhere, though not necessrily with ll others. For exmple, if the following code were vlid then its mening would e open to question ut presumly, following the method of opertion so fr defined, o1 will e written every time rrives, o2 would e written every time rrives nd there would e no synchronistion etween the two opertions. input, output o1, o2 o1 <- o2 <- In Bls, if one were to write: o1 <- o2 <-, then there is n explicit synchronistion tht tkes plce in the control. The dt-driven style is designed to void mking such synchronistions. In Bls, there will e one communiction on o1 nd one on o2 efore nother tkes plce on either chnnel. In the dt-driven style there could e infinite communictions on o1 efore ny occur on o2 or vice-vers. This could mke designs much more difficult to understnd. VI. A DESIGN EXAMPLE NANOSPA The enefits nd drwcks of the dt-driven style hve een explored using lrge design exmple nnosp which is 32-it microprocessor implementing wht is essentilly slightly cut-down version of the ARM instruction set nd whichisdevelopmentofspa [22],thefirst lrgescledesign descried in Bls. The nnosp hs een grdully developed with the sole ojective of mking Bls synthesised synchronous ARM of the mximum possile performnce. Development hs reched the stge where the processor implements ll the min fetures of the instruction set nd enchmrk progrms cn e run in simultion to produce good ide of the performnce (which is lmost ten times tht of the originl SPA). This mkes it n excellent exmple in demonstrting whether dt-driven circuit cn offer performnce improvements over the est ville conventionl Bls circuit. The demonstrtor ws chosen to: demonstrte tht the dt-driven synthesis flow cn e used to construct significnt design. compre the performnce of high performnce Bls design with the closest possile equivlent in the dtdriven style. demonstrte the integrtion into the existing Bls designflow nd the use of mixed Bls nd dt-driven designs. ttempt some level of qulittive comprison etween the fetures nd flexiility offered to the designer in oth description styles. A. Dt-driven nnosp The dt-driven nnosp hs een descried in the new dt-driven input lnguge. The description is roughly the sme length s the Bls originl ( 3000 lines). As fr s possile, the micro-rchitecture of the processor hs een precisely copied from the Bls description. As consequence, most of the synthesised dtpthlogic is the sme s the Bls nnosp, nd the control contins most of the significnt differences. The intention is to ttempt to explore the dvntge gined y using the dt-driven style in descriing design tht is s close s possile to Bls description, rther thn y tiloring the design specificlly to suit the dt-driven style.

11 control from fetch Regulr decode Write Control r0 ctrl LDM/STM decode to execute dt Write Control Write Control r1 r2 Figure 13. Dt-driven nnosp decode structure. Write Control r3 The two mjor exceptions where it ws necessry to mke significnt chnges to the rchitecture re in the decode unit, due to its use of (temporl) itertion, nd the register nk, due to its relince on Bls-style vriles. These issues re discussed elow. B. Decode Unusully for RISC-style processor, the ARM instruction set contins support for multi-cycle lod nd store instructions. These lod nd store multiple (ldm/stm) instructions llow ny given suset of registers to e loded from or stored to contiguous words in memory using single instruction. The nnosp implements these instructions in the decode stge y simply generting nd issuing multiple single memory trnsfer opertions to the execute unit. The itertive decode for ldm/stm instructions mkes use of the Bls while loop structure to repetedly generte memory trnsfer opertions. In the controldriven style the hndshke for the inputs to the decode cn enclose ll of this itertive opertion llowing the inputs to e red repetedly y ech itertion. In dt-driven style, itertion is implemented in different wy: insted of enclosing severl vrile nd chnnel reds within single input hndshke, the input hndshke is repeted s mny times s required, ech time reding vriles nd chnnels only once. It is quite strightforwrd to rerrnge the structure of the decode to implement the multicycle instructions s shown in figure 13. In this structure the whole decode cn e viewed s itertive with regulr instructions simply eing specil cse requiring only single itertion. When n instruction rrives t decode it is pssed through the multiplexer to the decode logic. If the instruction is n ldm/stm, the necessry dt for the next itertion is pssed ck to the multiplexer nd the control signl is set so s to re-inject the dt s the next instruction. When the ldm/stm is finished, or fter single cycle if the instruction is regulr instruction, the multiplexer is signlled to inject the next instruction eing sent from fetch. Although this my not e the most efficient implementtion, it hs the importnt dvntge tht the two locks shown in the shded re in figure 13 (for regulr or ldm/stm instructions) cn e copied directly to the dt-driven description. Figure 14. Dt-driven nnosp register structure. C. Register nk The Bls nnosp register nk uses the generl red nd write structure for vrile rrys discussed previously. The pssive-ported Vrile component llows reds nd writes to occur to vriles in ny ritrry order. As discussed erlier, it is not so esy to provide this generl structure when using push style vriles. In order to red from ny vrile, it is necessry for tht vrile to push its dt. Therefore, in order to implement the register nk in the dt-driven style it is necessry to write to every vrile (i.e. register) every cycle. The dt-driven register nk write structure is illustrted in figure 14. The write control nd dt re here duplicted to individul write control units elonging to ech register. These individul units decide whether to write the dt to their respective register. If they do not write the dt, they recycle the existing vlue nd write this to the register insted. The susequent red my therefore pick the pproprite dt from ny register s ll registers will push dt. The dt-driven register nk structure results in n individul cell for ech register tht controls the writes to tht prticulr register (figure 14). A red unit is generted for ech red port. This structure results in improved performnce ut lso significntly increses the re over the Bls counterprt. It will lso significntly increse the energy consumption s every register is written on every cycle. D. Simultion Results 1) Individul Modules: Rther thn simulting the entire processor, it is more instructive to exmine the results from simulting individul modules within the processor. This voids issues ssocited with the pipelining nd processor rchitecture nd demonstrtes the performnce improvements gined y using the dt-driven logic style. The environments used in the test enches for these simultions ll hve zero dely. Generlly, this fvours the control-driven pproch s, for exmple, the cost of synchronising inputs tht ll rrive simultneously is miniml. The results then show (to close pproximtion) the minimum improvement chieved y the

12 Tle II PERFORMANCE OF DUAL-RAIL AND SINGLE-RAIL NANOSPA MODULES Dul-ril Gtes/Cycle Single-ril Gtes/Cycle Module Test Bls Dt-driven Improvement Bls Dt-driven Improvement Fetch 59 29 2.0 47 33 1.4 Decode regulr 52 39 1.3 179 76 1.0 ldm/stm (5 registers) 604 254 2.4 578 311 1.9 Register nk 1 write 134 69 1.9 82 61 1.3 2 writes 182 74 2.5 99 61 1.6 ALU nd 74 41 1.8 57 33 1.7 dd 0 crry 85 74 1.1 75 42 1.8 dd 5 crry 86 74 1.2 83 52 1.6 dd 32 crry 107 65 1.6 137 106 1.3 mov 77 57 1.4 56 32 1.8 ExecuteControl non-memory 44 24 1.8 41 32 1.3 memory store 57 30 1.9 46 32 1.4 memory lod 64 32 2.0 50 32 1.6 Execute nop 83 58 1.4 70 56 1.3 nd 93 58 1.6 91 59 1.5 nd with shift 133 65 2.0 129 63 2.0 nds (updte flgs) 95 58 1.6 91 59 1.5 ldr/str 116 65 1.8 150 106 1.4 rnch 92 74 1.2 88 56 1.6 Tle III DUAL-RAIL AND SINGLE-RAIL NANOSPA AREA Dul-ril Trnsistor Count Module Bls Dt-driven Fetch 7667 17957 Decode 64394 271369 Register nk 68456 376914 Execute 143750 265707 ExecuteControl 5073 5754 ALU 38687 53492 Shifter 28987 85431 Other execute 64311 107642 Other 36482 24806 Totl 320749 956753 Single-ril Trnsistor Count Module Bls Dt-driven Fetch 5741 5280 Decode 37114 58825 Register nk 31468 71999 Execute 56754 72492 ExecuteControl 2887 3470 ALU 9179 11270 Shifter 13677 26249 Other execute 28261 27720 Other 20380 11234 Totl 151457 219830 dt-driven style. The fct tht the dt-driven style does not synchronise ll inputs efore eginning the opertion nd does not wit until the opertion is complete efore relesing them cn potentilly further improve performnce. The fetch, decode nd execute units together with some individul modules from the execute unit hve een simulted. These ltter modules were simulted, where pproprite, with different input dt to demonstrte the dt-depent vrition in performnce. The reltive performnce of the modules in nnosp for oth dul-ril nd single-ril implementtions re shown in tle II. The Gtes/cycle figure is the numer of gtes the request-in signl goes through efore emerging s n cknowledge signl to its dt provider. It is mesure of mximum throughput of ech module in the design. The reltive res of the units within nnosp (s mesured y trnsistor count) re shown in tle III. 2) Register nk hyrid design: The register nk hs een highlighted s prticulr prolem in terms of re nd energy consumption. A possile solution tht my e esily implemented is to use the conventionl Bls register nk in plce of the dt-driven register nk. As the interfce to oth register nk designs is the sme nd the two design styles re integrted into the sme flow, it is trivil to produce this hyrid design. This provides n excellent exmple of how designs with mixed Bls nd dt-driven modules cn e used. The lower re nd energy requirements of the controldriven style cn e exploited for non-criticl modules, while the performnce of the dt-driven style is exploited for others. 3) Processor Performnce: The processor ws simulted t the trnsistor level using nnosim to mesure oth speed nd energy consumption with the processor running the Dhrystone enchmrk. Tle IV compres the energy consumption for single Dhrystone loop nd the performnce of the controldriven design, the dt-driven design nd hyrid design with conventionl Bls register nk. The dul-ril control-driven nnosp chieves 54 Dhrystone MIPS. The dt-driven version chieves 85 Dhrystone MIPS, n improvementof 1.6 times the originl.as cn e seen from tle III, the re is significntly incresed, from 320749 to 956753 trnsistors. As nticipted, significnt proportion of this increse is found in the register nk (from 68456 to 376914 trnsistors). If the increse in register nk re is ignored, then the dt-driven nnosp is just over twice the size of the originl Bls version. As expected(for the resons given in section III), the energy consumption is considerly greter for the fully dt-driven version. Tle V shows the contriution of the vrious modules to the energy consumption. It confirms tht the lrgest energy consumer in the dtdriven pproch is the register nk, wheres for the Bls implementtion, the execute unit is the most significnt. The dul-ril hyrid design chieves 68 Dhrystone MIPS ( speed