Power-Aware FPGA Logic Synthesis Using Binary Decision Diagrams

Similar documents
CHAPTER 2 LITERATURE STUDY

CS 135: Computer Architecture I. Boolean Algebra. Basic Logic Gates

Synchronous Machine Parameter Measurement

ABB STOTZ-KONTAKT. ABB i-bus EIB Current Module SM/S Intelligent Installation Systems. User Manual SM/S In = 16 A AC Un = 230 V AC

Synchronous Machine Parameter Measurement

Mixed CMOS PTL Adders

ISSCC 2006 / SESSION 21 / ADVANCED CLOCKING, LOGIC AND SIGNALING TECHNIQUES / 21.5

Engineer-to-Engineer Note

Timing Constraint-driven Technology Mapping for FPGAs Considering False Paths and Multi-Clock Domains

MAXIMUM FLOWS IN FUZZY NETWORKS WITH FUNNEL-SHAPED NODES

Experiment 3: Non-Ideal Operational Amplifiers

Experiment 3: Non-Ideal Operational Amplifiers

A New Stochastic Inner Product Core Design for Digital FIR Filters

Algorithms for Memory Hierarchies Lecture 14

Application Note. Differential Amplifier

Stage m. Stage 1. a a. a a

A New Algorithm to Compute Alternate Paths in Reliable OSPF (ROSPF)

Study on SLT calibration method of 2-port waveguide DUT

Fuzzy Logic Controller for Three Phase PWM AC-DC Converter

Y9.ET1.3 Implementation of Secure Energy Management against Cyber/physical Attacks for FREEDM System

Understanding Basic Analog Ideal Op Amps

DYE SOLUBILITY IN SUPERCRITICAL CARBON DIOXIDE FLUID

Design And Implementation Of Luo Converter For Electric Vehicle Applications

This is a repository copy of Effect of power state on absorption cross section of personal computer components.

Simulation of Transformer Based Z-Source Inverter to Obtain High Voltage Boost Ability

Sequential Logic (2) Synchronous vs Asynchronous Sequential Circuit. Clock Signal. Synchronous Sequential Circuits. FSM Overview 9/10/12

Joanna Towler, Roading Engineer, Professional Services, NZTA National Office Dave Bates, Operations Manager, NZTA National Office

Exercise 1-1. The Sine Wave EXERCISE OBJECTIVE DISCUSSION OUTLINE. Relationship between a rotating phasor and a sine wave DISCUSSION

Section 16.3 Double Integrals over General Regions

Soft switched DC-DC PWM Converters

METHOD OF LOCATION USING SIGNALS OF UNKNOWN ORIGIN. Inventor: Brian L. Baskin

EET 438a Automatic Control Systems Technology Laboratory 5 Control of a Separately Excited DC Machine

Design and implementation of a high-speed bit-serial SFQ adder based on the binary decision diagram

Nevery electronic device, since all the semiconductor

(CATALYST GROUP) B"sic Electric"l Engineering

Multi-beam antennas in a broadband wireless access system

The Discussion of this exercise covers the following points:

A Development of Earthing-Resistance-Estimation Instrument

Postprint. This is the accepted version of a paper presented at IEEE PES General Meeting.

Direct AC Generation from Solar Cell Arrays

Implementation of Different Architectures of Forward 4x4 Integer DCT For H.264/AVC Encoder

System-Wide Harmonic Mitigation in a Diesel Electric Ship by Model Predictive Control

Computing Logic-Stage Delays Using Circuit Simulation and Symbolic Elmore Analysis

Topic 20: Huffman Coding

April 9, 2000 DIS chapter 10 CHAPTER 3 : INTEGRATED PROCESSOR-LEVEL ARCHITECTURES FOR REAL-TIME DIGITAL SIGNAL PROCESSING

Application of Wavelet De-noising in Vibration Torque Measurement

Module 9. DC Machines. Version 2 EE IIT, Kharagpur

Redundancy Data Elimination Scheme Based on Stitching Technique in Image Senor Networks

EE Controls Lab #2: Implementing State-Transition Logic on a PLC

A Novel Back EMF Zero Crossing Detection of Brushless DC Motor Based on PWM

Chapter 2 Literature Review

University of North Carolina-Charlotte Department of Electrical and Computer Engineering ECGR 4143/5195 Electrical Machinery Fall 2009

Design and Modeling of Substrate Integrated Waveguide based Antenna to Study the Effect of Different Dielectric Materials

A Practical DPA Countermeasure with BDD Architecture

Area-Time Efficient Digit-Serial-Serial Two s Complement Multiplier

(1) Non-linear system

Lab 8. Speed Control of a D.C. motor. The Motor Drive

High Speed On-Chip Interconnects: Trade offs in Passive Termination

Three-Phase Synchronous Machines The synchronous machine can be used to operate as: 1. Synchronous motors 2. Synchronous generators (Alternator)

Discontinued AN6262N, AN6263N. (planed maintenance type, maintenance type, planed discontinued typed, discontinued type)

Eliminating Non-Determinism During Test of High-Speed Source Synchronous Differential Buses

Engineer-to-Engineer Note

5 I. T cu2. T use in modem computing systems, it is desirable to. A Comparison of Half-Bridge Resonant Converter Topologies

THE present trends in the development of integrated circuits

Design and Development of 8-Bits Fast Multiplier for Low Power Applications

B inary classification refers to the categorization of data

Localization of Latent Image in Heterophase AgBr(I) Tabular Microcrystals

Example. Check that the Jacobian of the transformation to spherical coordinates is

To provide data transmission in indoor

Compared to generators DC MOTORS. Back e.m.f. Back e.m.f. Example. Example. The construction of a d.c. motor is the same as a d.c. generator.

MEASURE THE CHARACTERISTIC CURVES RELEVANT TO AN NPN TRANSISTOR

Lecture 16: Four Quadrant operation of DC Drive (or) TYPE E Four Quadrant chopper Fed Drive: Operation

Threshold Logic Computing: Memristive-CMOS Circuits for Fast Fourier Transform and Vedic Multiplication

Exponential-Hyperbolic Model for Actual Operating Conditions of Three Phase Arc Furnaces

Use of compiler optimization of software bypassing as a method to improve energy efficiency of exposed data path architectures

Network Sharing and its Energy Benefits: a Study of European Mobile Network Operators

A Simple Approach to Control the Time-constant of Microwave Integrators

Digital Design. Sequential Logic Design -- Controllers. Copyright 2007 Frank Vahid

Pennsylvania State University. University Park, PA only simple two or three input gates (e.g., AND/NAND,

CHAPTER 3 AMPLIFIER DESIGN TECHNIQUES

DESIGN OF CONTINUOUS LAG COMPENSATORS

& Y Connected resistors, Light emitting diode.

Electronic Circuits I - Tutorial 03 Diode Applications I

A Comparative Analysis of Algorithms for Determining the Peak Position of a Stripe to Sub-pixel Accuracy

Intention reconsideration in theory and practice

Information-Coupled Turbo Codes for LTE Systems

Interference Cancellation Method without Feedback Amount for Three Users Interference Channel

A Slot-Asynchronous MAC Protocol Design for Blind Rendezvous in Cognitive Radio Networks

Investigation of Ground Frequency Characteristics

THE STUDY OF INFLUENCE CORE MATERIALS ON TECHNOLOGICAL PROPERTIES OF UNIVERSAL BENTONITE MOULDING MATERIALS. Matej BEZNÁK, Vladimír HANZEN, Ján VRABEC

DIGITAL multipliers [1], [2] are the core components of

Multipath Mitigation for Bridge Deformation Monitoring

Genetic Representations for Evolutionary Minimization of Network Coding Resources

Section 2.2 PWM converter driven DC motor drives

Section Thyristor converter driven DC motor drive

Ultra Low Cost ACCELEROMETER

ECE 274 Digital Logic

D I G I TA L C A M E R A S PA RT 4

Quantitative Studies of Impact of 3D IC Design on Repeater Usage

Adaptive Network Coding for Wireless Access Networks

Transcription:

Power-Awre FPGA Logic Synthesis Using Binry Decision Digrms Kevin Oo Tinmung, Dvid Howlnd, nd Russell Tessier Deprtment of Electricl nd Computer Engineering University of Msschusetts Amherst, MA 01003 ABSTRACT Power consumption in field progrmmble gte rrys (FPGAs) hs become n importnt issue s the FPGA mrket hs grown to include mobile pltforms. In this work we present power-wre logic optimiztion tool tht is specilized to fcilitte subsequent power-wre technology mpping. Our synthesis frmework uses binry decision digrm (BDD) bsed collpsing nd decomposition techniques in conjunction with signl switching estimtes to chieve power-efficient circuit networks. The results of synthesis nd subsequent power-wre technology mpping re evluted using two distinct physicl design pltforms: cdemic VPR nd Alter Qurtus II. Our pproch chieves n verge energy reduction of 13% for Alter Cyclone II devices versus synthesis with SIS-bsed lgebric optimiztion t the cost of 11% verge circuit performnce if performnce-optiml technology mpping is performed fter synthesis. If technology mpping is tuned to chieve the sme verge dely for both SIS nd BDD-bsed flows, 3% verge energy reduction is chieved by our new synthesis pproch. Ctegories nd Subject Descriptors B.7.2 [Integrted Circuits]: Design Aids Generl Terms Algorithms Keywords FPGA, Binry decision digrm, Dynmic power 1. INTRODUCTION The deployment of FPGAs in vriety of portble nd embedded systems hs demonstrted the need for device power efficiency. This need hs led to series of FPGA power reduction pproches t both the rchitecturl nd computer-ided design levels. Most power-wre FPGA CAD techniques hve been focused on technology mpping [1], plcement [5], nd routing [5] nd use signl switching informtion to reduce dynmic power. Although these techniques re effective t reducing re nd signl switching ctivity, they rely on circuit netlist which hs first been processed by logic optimiztion. Permission to mke digitl or hrd copies of ll or prt of this work for personl or clssroom use is grnted without fee provided tht copies re not mde or distributed for profit or commercil dvntge nd tht copies ber this notice nd the full cittion on the first pge. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission nd/or fee. FPGA 07, Februry 18 20, 2007, Monterey, CA, USA. Copyright 2007 ACM 1-59593-292-5/07/0002...$5.00. Previously, it hs been shown tht effective power-wre FPGA technology mpping lgorithms limit signl fnout nd encpsulte high-trnsition nets within FPGA look-up tbles () [5][13][7][1]. These techniques reduce the need for reconsuming logic repliction nd limit signl switches on inter- LUT pths. Idelly, circuit netlist generted by logic optimiztion hs nodes which exhibit low switching nd low fnout. These chrcteristics help power-wre technology mpping evlute broder rnge of low-power mpping choices. Our new synthesis lgorithms use binry decision digrm (BDD) clustering nd decomposition to minimize circuit fnout nd signl switching chrcteristics. Prior to BDD decomposition, circuit nodes re collpsed into clusters which minimize node fnout. Subsequently, clusters re decomposed so tht signl switching is minimized. Efficient decomposition is fcilitted with the use of fst sttisticl switching estimtor which llows for the evlution of numerous intermedite decompositions. Our new FPGA optimized synthesis system hs been integrted into compiltion flow which includes EMp [5], power-wre technology mpping tool. Plcement nd routing is performed with Qurtus II, commercil system from Alter, nd lterntively, with VPR, n cdemic plce nd route tool. For series of benchmrk designs, our synthesis pproch shows 13% energy reduction versus synthesis with SIS [12] for Alter Cyclone II, nd n 11% energy reduction for n cdemic FPGA model t fixed clock frequency. These energy reductions come t the cost of substntil loss in chievble pek design clock frequency for subset of designs. If technology mpping performnce constrints re relxed in the SIS compiltion flow so tht compiltion chieves the sme verge clock frequency obtined with our BDD-bsed compiltion flow, 3% energy svings is observed for BDD-bsed synthesis. This pper is orgnized s follows. Section 2 describes necessry bckground regrding BDDs nd power-wre technology mpping, nd relted logic synthesis work. Our power-wre logic synthesis nd technology mpping frmework is presented in Section 3 nd the experimentl pproch is outlined in Section 4. Section 5 provides n nlysis of the experimentl results. Section 6 concludes the pper nd offers directions for future work. 2. Bckground 2.1 BDD-Bsed Logic Synthesis A BDD is rooted directed cyclic grph (DAG) representing switching function. BDD-bsed logic synthesis bsed on reduced ordered binry decision digrms (ROBDDs) is n effective logic optimiztion pproch for both stndrd cell [14] nd FPGA

technology [15]. Typiclly, BDD synthesis relies on functionl decomposition to explore rnge of both lgebric nd Boolen optimiztions. This step is n importnt prt of our BDDprocessing flow which lso includes clustering, vrible reordering, nd shred subcircuit extrction steps. An erly step in BDD bsed synthesis involves collpsing portions of circuit DAG into clusters. This eliminte procedure mkes the resulting BDD network esier to evlute nd optimize. Ech collpsed cluster is represented s BDD which is subsequently decomposed. Prtil collpsing helps remove logic redundncy in multi-level logic representtions, such s the redundncy cused by reconvergence. In [15], it ws determined tht mximum fnout free cone (MFFC) bsed collpsing is especilly effective for FPGA synthesis with BDDs. Collpsed multi-input cones often mtch the structure of FPGA, leding to re-efficient implementtions. Cluster genertion with prtil collpsing is followed by itertive BDD decomposition bsed on vrible prtitioning pproches. The theory of domintors [14] hs been shown to be n effective bsis for BDD decomposition. The domintor pproch quickly identifies cuts which divide trget subcircuit into two prts. The subcircuits which result from these cuts cn then be combined with either n AND, OR, or XOR function to perform originl BDD functionlity. Algebric decompositions result in subcircuits which do not shre input vribles; Boolen decomposition llows shring. Typiclly depth-first BDD scn is performed to identify domintors. Unlike strictly lgebric synthesis tools, domintor bsed decomposition cn identify AND/OR, AND/XOR, nd multiplexer decompositions eqully efficiently. Since rithmetic FPGA designs often contin numerous XOR nd multiplexer opertions, this flexibility is desirble for FPGA synthesis. To explore vriety of decompositions, the vribles in the BDD my be reordered to expose more efficient cuts [14]. Previously, it ws determined tht frequent vrible reorderings provide reefficient FPGA implementtions [15]. 2.2 Relted Work Severl power-wre technology independent synthesis lgorithms hve been proposed. In pper by Imn nd Pedrm [4], power-wre synthesis pproch which uses lgebric decompositions is proposed. Given Boolen function of node, dynmic power is optimized by minimizing its lod in terms of the number of product terms in the function nd by minimizing the estimted switching ctivity of internl sum-of-product nodes. Lennrd et l. [6] uses the probbility distribution of don't cre sets to optimize re with n eye towrds power reduction. An pproch presented by Roy et l. [11] trnsforms sum-of-product expressions into fctor-tree forms nd uses cpcitnce informtion of AND/OR gtes to guide logic optimiztion. Holt nd Tygi [3] consider interconnect cpcitnce using bounding box during synthesis. In pper by Lindgren et l. [8], powerwre BDD decomposition pproch for pss trnsistor logic is defined. In generl, these techniques support power reduction in fine-grined ASICs which re bsed on logic librries. The presence of corse-grined nd logic clusters mke these power-wre synthesis pproches inpproprite for FPGAs. To effectively perform power-wre FPGA logic optimiztion, the gols of subsequent power-wre technology mpping must be understood. Most relted mpping pproches exmine the reduction of dynmic power in FPGA interconnect, where nerly Initil Boolen network Build BDDs Sweep Power-Awre MFFC-Bsed Eliminte Power-Awre BDD Decomposition Shred Sub-circuit Extrction Power-Awre Technology Mpping (EMp) Finl power-optimized LUT network Power-optimized 2-input nodes Switching Activity Estimtion Switching Activity Estimtion Switching Activity Estimtion Figure 1: Power-wre logic synthesis nd technology mpping flow two-thirds of contemporry FPGA power is consumed. An erly reserch study by Frrhi nd Srrfzdeh [2] showed tht power consumption cost metric bsed on net signl switching cn be the bsis for LUT pcking heuristic. Although this pproch showed modest improvement in circuit power (14% reduction), net increse in LUT re ws noted. An pproch by Li et l [7] first mps circuits using depth-optiml cut-bsed heuristic nd then re-cuts logic node outputs on the criticl pth to reduce power. This technique resulted in 19% power reduction when compred to the pproch in [2]. Recent efforts t chieving power reduction in FPGA technology mpping focus on two specific gols: the reduction of LUT output signl switching nd the minimiztion of logic repliction. The first gol minimizes the switching ctivity of the nets between by encpsulting high ctivity nets inside. The second gol minimizes the number of LUT outputs by minimizing node dupliction for performnce enhncement. Anderson nd Njm [1] use depth-first, cut-bsed mpping to determine n initil mpping. This work focuses on eliminting node replictions long non-criticl pths to reduce power. A cost function bsed on LUT re, depth, nd power is derived to llow for mpping trdeoffs. The uthors note n dditionl 10% power reduction if optiml technology mpping depth is relxed. A similr mpping pproch is pplied in EMp [5]. The EMp pproch uses both signl switching nd signl fnout s cost metrics used to encpsulte logic in. Our power-wre FPGA synthesis pproch follows up on these recent efforts in the logic optimiztion step by optimizing node fnout, output switching, nd re during prtil collpsing nd decompositions.

3. Power-wre Logic Optimiztion Flow Our BDD-bsed logic synthesis system uses series of steps to convert logic design into technology mpped circuit. Figure 1 shows our power-wre FPGA synthesis flow. The initil circuit input is n un-optimized Boolen network. BDDs re built for ech Boolen node in the network. After pre-processing to remove logic redundncy (sweep), power-wre itertive eliminte nd decomposition procedures re performed to generte collection of 2-input one-output nodes. Power-wre shred subcircuit extrction is performed to minimize power nd re in the finl LUT network. As finl step, the 2-input nodes re mpped to using EMp. Accurte switching ctivity estimtion forms n importnt prt of our logic optimiztion flow. To llow for rpid re-evlution of switching ctivity during circuit processing, probbilistic, rther thn simultion bsed pproch is used. An effective probbilistic pproch for estimting switching ctivity in logic circuits is the trnsition density model [9]. The model hs two prmeters: trnsition density D the verge number of trnsitions per unit time, nd sttic probbility P the probbility of the signl being high for certin time period. For signl y, the trnsition density of the signl D(y) is: D y) = n i= 1 dy P( ) D( xi ) dx ( (1) where n is the number of signls input to y, P(dy/dx i ) is the probbility chnge in x i will cuse chnge in y, nd D(x i ) is the trnsition density of x i. A modeling tool bsed on this clcultion pproch [10] provides switching ctivity estimtion for our flow for ech node from primry inputs to primry outputs. The following sections detil our pproches for power-wre BDD-bsed eliminte nd decomposition. 3.1 Power-Awre Eliminte Globl BDD eliminte procedures crete single lrge globl BDD for ech design primry output (design output/flip flip input) using design primry inputs (design inputs/flip flop outputs) s the BDD inputs. The resulting size nd complexity of these BDDs hs been shown to led to poor subsequent decompositions [14]. As result, we develop collpsing pproch which results in multiple smller BDDs for ech primry output. To reduce logic minimiztion complexity, the combintionl nodes needed for ech primry output re reduced to collection of smller subcircuits vi prtil collpsing of the circuit. Ech collpsed logic cluster is represented s locl BDD tht is lter processed by BDD decomposition. To chieve clusters tht re suitble for nd tht re lso power efficient, we use two-phse eliminte procedure. Ech eliminte phse requires one or more DAG trversls with specific cost nd fnout objective. In the first phse, MFFCs re constructed in n effort to identify cones which mtch the properties of. In the second phse, nodes remining from the first phse re collpsed in n effort to minimize cluster size nd fnout. For ech phse, collpsing is performed itertively until the process converges bsed on the specific cost objective. The first phse of eliminte cretes MFFCs by itertively collpsing nodes with single fnout into their children. i Collpsing tkes plce during strightforwrd DAG trversl from primry inputs to primry outputs. This pproch hs been shown [15] to led to effective decompositions which cn be pcked into K-input during subsequent technology mpping. As result of collpsing, ll djcent single fnout nodes cn be plced into the sme LUT reducing the required circuit re. BDD node count is used s collpsing constrint during this itertive phse. The objective for the power-wre second phse of eliminte is to minimize node fnout during collpsing while llowing for sufficient cluster growth to enble subsequent decomposition. This phse considers both BDD node count nd node fnout. During ech itertion possible collpsing is determined with power-wre cost function which tkes the fnout of nodes into ccount. Nodes with lrge fnout re undesirble since they cn crete lrge number of wires between fter technology mpping, incresing interconnect cpcitnce. Collpsing for the second phse is evluted vi two cost metrics for ech potentil eliminte opertion: 1. A cost metric which determines possible increse in verge node fnout s result of collpsing. 2. A cost metric which mesures the possible increse in BDD nodes. The following definitions re used to better clrify the cost metrics. A cndidte node is node which my be collpsed into its fnout (or child nodes). Prent nodes hve outputs which fn into the cndidte node. The fnout cost metric for collpse of cndidte node into its child nodes is the resulting verge fnout of the prent nodes divided by scling constnt. This vlue cn be defined s follows: Fnout Cost = fnouts( i prent _ nodes (2) β* prent _ nodes where fnouts is the number of fnouts of the prent nodes fter the collpse, prent_nodes is the number of prent nodes, nd β is scling constnt. A β vlue of 4 ws found to give best results through experimenttion. A potentil node collpse is only ccepted if the result of Eq. (2) is less thn 1, indicting tht fnouts greter thn 4 re undesirble. A β vlue of greter thn 4 often leds to t lest one high cpcitnce inter-cluster connection. Vlues less thn 4 penlize connections which re generlly routed using low-cpcitnce intr-cluster wiring. The second cost metric which evlutes BDD node increse s result of collpsing is defined s follows: Node Cost = nodes( nodes( cndidte) + nodes( (3) i new _ nodes i child _ nodes where nodes(cndidte) is the BDD node count of the cndidte node before collpsing, nodes( is the BDD node count i child _ nodes of ll the child nodes before collpsing, nd i new _ nodes nodes( is

Figure 2: Exmple of power-wre prtil collpsing the BDD node count of combined nodes creted by collpsing the cndidte node into its children. The cost gives the difference between BDD node count before nd fter collpsing the cndidte node. If the Node Cost is greter thn the predefined threshold vlue, collpsing is not performed. A node count growth of greter thn 10 generlly requires n dditionl LUT, dversely ffecting re. Figure 2 illustrtes power-wre prtil collpsing of cndidte node z into child nodes x nd y. First, Fnout Cost is clculted. There re two prent nodes of node z (b nd c). After collpsing, nodes b nd c hve 2 fnouts ech. The Fnout Cost is (fnouts(b) + fnouts(c)) / (4 * fnin_nodes ) = (2 + 2) / (4 * 2) = 0.5. Since Fnout Cost is less thn 1, Node Cost is clculted to determine if the collpse should be performed. As shown in Figure 2, the number of nodes before collpsing is (8 + 3 + 5) = 16 nd the number of nodes fter collpsing is (6 + 4) = 10. Therefore, the Node Cost is 10 16 = -6, indicting collpsing cn proceed. Figure 3 lists the per-phse steps in the identifiction nd subsequent collpse of nodes during the power wre second phse. The node-bsed design is topologiclly ordered nd trversed from primry inputs (PIs) to primry outputs (POs). An ROBDD is then implemented for ech node in the design. Clusters re determined nd collpsible nodes re identified nd collpsed. During ech itertion these steps re performed until no dditionl collpsible nodes re found in the network. Node collpsing is performed vi vrible substitution. 3.2 Power-Awre BDD Decomposition Unlike previous BDD decomposition pproches which hve focused on re reduction [14][15], our pproch simultneously optimizes both re nd signl switching chrcteristics of decomposed functions. As result, switching ctivity, in the form of trnsition density, is directly used in evluting the costs of decompositions. The result of BDD decomposition is n optimized 2-input node network tht is suitble for subsequent shred subcircuit extrction nd technology mpping. As mentioned erlier in Section 3, node trnsition densities cn be quickly evluted for ech proposed decomposition bsed on probbilistic techniques. Our decomposition engine, which is bsed on BDS [14], performs heuristic serch for efficient BDD decompositions, including both lgebric nd Boolen decompositions. For ech BDD node under decomposition, the engine first serches for lgebric Begin Topologiclly order network nodes from PIs to POs Build BDD for ech node Identify collpsible nodes While number of collpsible nodes 0 Evlute Eq (2) nd (3) for node If Fnout Cost < 1 nd Node Cost < 10 Collpse node into immedite fnout Updte network nd re-identify ll collpsible nodes End while End Figure 3: Itertive eliminte lgorithm second phse decompositions, which re bsed on 0, 1, nd x-domintors, to perform simple AND, OR, nd XOR decompositions, respectively. Subsequently, series of Boolen decompositions bsed on generlized AND, OR, nd XOR decompositions re ttempted. If none of these decompositions re successful, the BDD is co-fctored with respect to its top vrible. For ech BDD, severl different decompositions my exist depending on where the BDD is cut. Following decomposition, the source BDD is broken into two prts, the domintor dm nd decomposed dp functions. These prts re then combined vi simplified function (e.g. AND, OR, XOR). For exmple, in Figure 4 both disjunctive (top, right) nd conjunctive (bottom, right) decomposition of the BDD on the left re possible. Ech potentil BDD decomposition is evluted with cost function which considers both the resulting re nd switching chrcteristics of the resulting circuit. Our cost function is defined s follows: Cost = D( dm * dp) nodes( dm) + nodes( dp) vribles( dm dp) α + (1 α) ( + ) D( orig) 2 nodes( orig) 2 vribles( orig) (4) whereα is scling constnt, D(dm*dp) is the trnsition density of the BDD function fter decomposition, D(orig) is the trnsition density of the originl BDD before decomposition, nodes() indictes the number of BDD nodes in BDD, vribles(dm dp) is the number of shred vribles in dp nd dm nd vribles(orig) is the number of vribles in the originl BDD. The first prt of the cost function optimizes the switching ctivity (power cost) nd the second prt optimizes re for BDD decompositions. Through experimenttion, n α vlue of 0.25 ws determined to generte best results since this vlue puts more of bis on re thn switching ctivity. Since minimizing re lso minimizes dynmic power for most designs due to power reductions in logic nd ssocited clocking nd routing, this fctor is given more weight thn the switching ctivity metric. Vlues of α less thn 0.25 tend to cuse our lgorithm to ignore decompositions tht hve slightly greter re but significntly reduced switching.

0.55 0.625 0.8125 G=b H=d+cd F=G+H F=b+d+cd 0.5 b + 0.5 0.25 d c = H G b Cut 1 c Cut 2 d 1 0 P()=P(b)=P(c)=P(d)=0.5 D()=D(c)=D(d)=0.5,D(b)=0.6 1 0 1 0 0 1 0.25 0.375 0.5 0.65 0.8 D=+c 0.5 D + b = 0.5 c d Q 0.25 0.25 0.375 0.75 1 0 0.625 Q=d+b F=DQ 1 0 1 0 Figure 4: Different cuts for the BDD of F = b + d+ cd Figure 4 illustrtes two possible cuts which generte two different decompositions of the BDD for the function F. Both cuts hve the sme re costs bsed on the number of nodes fter decomposition. But using the trnsition density bsed switching ctivity estimtions (shown in itlics t the top of the BDDs), the signl switching for the two cuts differ. From the figure it cn be seen tht bottom cut is more desirble for F since it hs lowest trnsition density (0.8). Sttic probbilities re lbeled on ech BDD edge nd trnsition density nd sttic probbility vlues for input vribles re shown t the bottom left. The decomposed nodes re stored in fctoring trees fter BDD decomposition. To reduce the re in the finl 2-input logic gte network, extrction of shred subtrees is performed to find logic shring mong different prts of the network. This technique reduces re (number of ) since node redundncies re eliminted. Shring extrction lso reduces the number of duplicted fnins of node in the subtree, thus reducing the number of wires t the input nd within the extrcted subtree. 4. EXPERIMENTAL APPROACH Our new logic optimiztion system hs been integrted into two FPGA mpping flows to evlute its power nd energy reduction potentil. Following logic optimiztion, we use EMp, powerwre technology mpper, for LUT pcking. Following mpping, two plce nd route options re possible. In the first flow, we use VPR, n cdemic plce nd route tool for physicl synthesis. Post-route core dynmic power for this flow is determined with the trnsition density estimtor described in [10]. The trget rchitecture for this flow is n FPGA model with four per logic block nd wire segments of length 4. A 0.18 µm technology is ssumed for the FPGA model. A second flow uses Alter Qurtus II 5.0 to perform plcement nd routing to Cyclone II devices. Alter s PowerPly power nlyzer ws used with rndom input wveforms to determine core dynmic power results. The timing requirements re set to chieve mximum performnce for ll designs during plce nd route by setting the clock frequency constrint to n unttinble 1 GHz. To evlute the synthesis benefits of our new pproch, we compre the power nd re results of our power-wre BDDbsed FPGA synthesis tool with two other logic synthesis tools using the two flows mentioned bove. A synthesis flow which uses SIS nd DMIG [12] is used to llow for comprison to previous work with EMp [5]. In the SIS flow, technology independent logic optimiztion is performed on the initil circuit designs using SIS (script.rugged). The optimized circuits re then trnsformed into networks of 2-input simple gtes using SIS s tech_decomp nd DMIG. In ddition to SIS comprisons, we lso compre the synthesis results of our power-wre BDD tool to our originl BDD decomposition tool (BDS-pg [15]) tht optimizes FPGA re. All other steps in the evlution flows remin the sme for these two lternte synthesis pproches. For ech experiment, 16 lrge MCNC benchmrks were used, s listed in Tble 1. The experiments were conducted on Pentium-4/1.8GHz mchine with 512MB of RAM. All designs were trgeted to the smllest FPGA which would hold them.

lu4 pex2 pex4 bigkey des diffeq dsip elliptic ex1010 ex5p frisc misex3 pdc s298 seq spl Tble 1: Detiled results of SIS flow, BDS-pg flow, nd power-wre BDD flow for Alter 90 nm Cyclone II FPGAs SIS Flow (A) Originl BDS-pg Flow (B) Power-Awre BDD Flow (C) Dely 50 MHz Dely 50 MHz dely 50 MHz Energy rtio (C)/(A) Geometric Ave. Energy rtio (C)/(B) 1400 1779 1294 1810 1391 1069 1366 2319 4405 1057 2563 1324 4498 1738 1605 3725 7.65 9.00 7.06 3.92 6.18 10.01 3.76 11.72 9.51 8.37 15.67 7.83 10.71 15.54 7.30 9.64 1.60 1.61 0.85 2.59 3.59 0.53 2.43 1.87 1.26 0.86 1.07 1.34 1.33 0.83 1.61 1.44 552 1846 1302 1365 1069 1090 1433 2494 4655 1112 2825 2032 3600 1880 1631 2842 6.94 10.84 9.39 4.62 4.89 8.54 4.36 12.10 13.55 9.25 16.76 10.25 11.90 19.68 8.50 11.67 0.79 1.69 0.90 2.46 2.71 0.55 2.39 1.91 1.54 0.90 1.12 2.33 1.45 1.16 1.66 1.20 528 1740 1203 1478 1077 1086 1460 2444 4610 1095 2819 1443 3472 1113 1619 2818 7.11 9.77 9.09 4.49 5.30 10.30 4.18 12.56 11.98 9.78 17.70 9.00 13.10 17.07 8.51 11.49 0.81 1.61 0.70 2.42 2.82 0.53 2.33 1.77 0.81 0.75 1.68 1.24 0.72 1.60 1.01 0.51 0.99 0.83 0.78 0.96 0.95 0.64 0.88 1.25 0.86 0.70 1.02 0.95 0.78 0.98 1.04 0.96 0.97 0.52 0.84 0.89 0.72 0.86 0.61 0.96 0.84 1858 8.38 1.39 1739 9.38 1.41 1629 9.33 1.20 0.87 0.85 lu4 pex2 pex4 bigkey des diffeq dsip elliptic ex1010 ex5p frisc misex3 pdc s298 seq spl Geometric Ave. Tble 2: Detiled results of SIS flow, BDS-pg flow, nd power-wre BDD flow for VPR 0.18 µm model SIS Flow (A) Originl BDS-pg Flow (B) Power-Awre BDD Flow (C) Dely 5 MHz Dely 5 MHz dely 5 MHz Energy rtio (C)/(A) 1400 1779 1294 1810 1391 1069 1366 2319 4405 1057 2563 1324 4498 1738 1605 3725 42.23 47.54 36.82 45.86 31.31 47.31 24.58 69.19 49.76 43.52 87.96 42.40 55.13 82.88 43.39 57.86 27.61 26.53 16.78 71.23 75.03 12.56 55.23 28.05 37.13 15.77 17.08 22.55 44.21 16.32 27.30 39.85 552 1846 1302 1365 1069 1090 1433 2494 4655 1112 2825 2032 3600 1880 1631 2842 33.84 59.64 52.96 22.76 29.36 50.24 37.00 66.91 83.14 48.08 97.95 61.56 66.95 139.54 47.64 61.79 16.88 25.29 14.18 73.03 65.73 12.88 50.28 30.79 55.61 14.62 17.91 32.19 41.36 15.91 26.77 38.54 528 1740 1203 1478 1077 1086 1460 2444 4610 1095 2819 1443 3472 1113 1619 2818 33.61 53.00 50.23 38.69 29.98 52.62 36.42 75.38 62.03 53.80 103.92 55.30 76.41 110.66 46.18 61.75 16.79 25.22 14.16 57.25 73.46 11.76 50.45 27.87 28.95 14.63 16.88 20.41 40.21 14.46 26.72 36.58 Energy rtio (C)/(B) 1858 47.99 28.82 1739 54.40 28.45 1629 54.96 25.63 0.89 0.61 0.95 0.84 0.80 0.98 0.94 0.99 0.78 0.99 0.89 0.98 0.92 0.99 0.78 1.12 0.53 0.94 0.63 0.97 0.95 5. EXPERIMENTAL RESULTS In n initil experiment, ll 16 designs were synthesized using SIS/DMIG, our originl BDD tool (BDS-pg), nd our new power-wre BDD tool. Following technology mpping with EMp using defult settings, the designs were pcked, plced, nd routed by Qurtus II with clock constrints. The results from this experiment re summrized in Tble 1. Two min results re pprent from the tble. For fixed design clock frequency, our power-bdd synthesis tool chieves bout 13% overll energy reduction versus flow with SIS synthesis nd bout 15% reduction versus flow which includes our previous BDD synthesis pproch. In the tble, energy vlues re bsed on dynmic power vlues of the design (except I/Os) determined vi simultion. Note tht lthough we list energy results t 50 MHz (the clock speed used for simultion), the reltive results for dynmic power nd energy will not chnge if the clock frequency for ll three mppings for ech design is incresed to the mximum vlue llowed mong the three. For exmple, design lu4 is constrined by 7.65 ns period for SIS optimiztion so ll

Tble 3: Pre-mp fnout nd trnsition density vlues Averge Fnout % Diff Ave. Trns. Density % Diff BDS-pg 2.05 0 0.30 0 BDS-power minus elim BDS-power minus decomp 2.07 1.0 0.27-9.0 1.91-7.3 0.28-6.2 BDS-power 1.93-5.4 0.24-18.0 designs could hve been evluted t this frequency with the sme reltive result. This constnt energy rtio occurs since both dynmic power nd energy due to dynmic power re linerly proportionl to clock frequency. This significnt energy reduction comes t the cost of mximum chievble clock frequency. The pek clock frequency for our new BDD pproch is bout 11% worse thn the pek frequency chievble with SIS. This increse is not surprising since our eliminte nd decomposition steps focus on reduced re nd signl switching rther thn depth reduction. We believe tht for mny pplictions, energy svings will be of utmost importnce during design especilly for low to moderte performnce embedded pplictions. It should be noted tht lthough most designs for the Cyclone II experiments hve simultion coverge of close to 100% of design nodes, designs ex1010 nd pdc only hve coverge of 82% nd 89% respectively, which my skew their results. To verify our results for Cyclone II, we performed the sme set of experiments using the VPR flow described in Section 4. As seen in Tble 2, n 11% energy svings due to dynmic power reduction ws lso chieved with this flow versus the SIS flow. The dely verge increse of 15% is lso similr. The overll energy svings re likely lower for the 180 nm model used by VPR versus Cyclone II due to VLSI technology difference nd since VPR does not model the sme interconnect richness found in the commercil 90 nm Cyclone II device. As mentioned erlier in this section, the SIS flow results (A) in Tbles 1 nd 2 were generted with defult EMp technology mpping settings which ttempt to simultneously optimize for both power nd performnce. In n dditionl set of experiments, the netlists creted by SIS synthesis were rempped by EMp under relxed timing constrints so tht verge dely incresed by 11%, the sme mount chieved by power-wre BDD synthesis followed by defult EMp technology mpping (Flow C). This rempping chieved n 8% re nd 10% energy reduction versus the initil SIS flow (A). The energy svings re consistent with previous results [1] which exmine power svings versus dely trdeoffs in technology mpping. In comprison to the dely-relxed SIS/EMp flow, the power-wre BDD flow reduced re by 4% nd energy by 3%. In our experimenttion, we only exmine dynmic power since the sttic power of ech design vries only smll mount per implementtion for fixed FPGA device. Although the logic re required by the designs vried bsed on the synthesis tools used, the smllest pckge required to fit ech design did not chnge. The re reduction chieved by the power-wre BDD synthesis tool my hve more of n impct on sttic power if future FPGAs provide the cpbility to power down device regions. Although the percentge reduction in energy for Cyclone II (13%) is roughly the sme s the mount of logic re reduction (12%) for the power-wre BDD synthesis flow, the use of power-wre eliminte nd decomposition plys role in energy reduction. From Tble 1 it cn be seen tht if the originl BDS-pg synthesis flow (B) is used, logic re is reduced by bout 6% versus the SIS flow (A) but energy remins roughly constnt. The benefit of ech step of the power-wre BDD synthesis pproch cn be seen in Tble 3. This tble lists verge node fnout nd trnsition density vlues for designs tht hve been synthesized but not technology mpped. In ddition to synthesis with our originl nd new BDD tools, we lso performed synthesis with the power-wre eliminte shut off (minus elim) nd power-wre decomposition shut off (minus decomp). For those cses, their non-power-wre counterprts were used insted. As seen from the tble, the power-wre eliminte ws prticulrly importnt in reducing verge fnout nd powerwre decomposition reduces trnsition density. This effect ws lso seen in the post-mp LUT designs creted by EMp. The verge post-mp fnout for BDD-pg versus BDD-power is 2.91 vs. 2.73 nd the verge trnsition density is 0.23 vs. 0.21. The reduction of both of these vlues plys role in reducing dynmic power long with design re reduction. To ddress the issue of incresed circuit dely for designs mpped with our new power wre BDD synthesis tool, n dditionl dely resynthesis pproch ws ttempted on designs following the BDD synthesis shown in Figure 1 nd before EMp. This pproch uses tree height reduction [12] on design fter powerwre BDD decomposition. Functionlly equivlent gtes in the circuit pths re collpsed together nd re-decomposed into 2- input gtes using DMIG, tree-height reduction decomposition lgorithm. A Huffmn encoding procedure [12] is used in DMIG to reduce the depth of the logic gte network. With this pproch, n verge 2% improvement in mximum dely versus results in Tble 1 is chieved for BDS-power, but the mximum dely results re still 9% lrger on verge thn the SIS flow. The energy sving over SIS is reduced from 13% to 10% on verge while the re result remins the sme. 6. CONCLUSION This work presents power-wre BDD-bsed synthesis system for FPGAs. Our BDD synthesis system performs prtil collpsing during logic optimiztion to reduce node fnout. Additionlly, signl switching informtion is used during logic decomposition to chieve decompositions tht re both re nd power efficient. Both of these gols hve been shown to fvorbly ssist subsequent power-wre technology mpping. Severl dvncements wrrnt evlution s next steps. It my be possible to limit node depth during decomposition by using more globl view of BDDs. Alterntively, it my be possible to initilly explore lgebric decompositions long criticl pths followed by BDD-bsed decomposition long non-criticl pths.

7. ACKNOWLEDGMENTS This work ws funded by grnt from Alter Corportion. We thnk Julien Lmoureux from the University of British Columbi for providing the EMp softwre. We cknowledge the efforts of Mohmmed Alhussein in prepring the finl version of the pper. 8. REFERENCES [1] J. Anderson nd F.N. Njm, Power-Awre Technology Mpping for LUT-Bsed FPGAs, IEEE Int. Conf. on Field Progrmmble Technology, Dec. 2002, pp. 211-218. [2] A. H. Frrhi nd J. Srrfzdeh, FPGA Technology Mpping for Power Minimiztion, Int. Workshop on Field- Progrmmble Logic nd Applictions, 1994, pp. 66-77. [3] G. Holt nd A. Tygi, Minimizing Interconnect Energy Through Integrted Low-Power Plcement nd Combintionl Logic Synthesis, Proc. ISPD, Apr. 1997, pp. 48-53. [4] S. Imn nd M. Pedrm, POSE: Power Optimiztion nd Synthesis Environment, Proceedings of the 33rd Design Automtion Conference, June 1996, pp. 21-26. [5] J. Lmoureux nd S.J.E. Wilton, On the Interction between Power-Awre FPGA CAD Algorithms, Proc. ICCAD, Nov. 2003. [6] C. Lennrd, P. Buch, nd A. Newton, Logic Synthesis using Power-Sensitive Don t Cre Sets, Proc. of Int. Symposium on Low Power Electronics nd Design, 1990, pp. 293-296. [7] H. Li, W.-K. Mk nd S. Ktkoori, LUT-Bsed FPGA Technology Mpping for Power Minimiztion with Optiml Depth, IEEE Computer Society Workshop on VLSI, 2001. [8] P. Lindgren, M. Kerttu, M. Thornton, nd R. Drechsler, Low Power Optimiztion Technique for BDD Mpped Circuits, Proc. ASP-DAC, Jn. 2001, pp. 615-621. [9] F. N. Njm, Trnsition Density: A New Mesure of Activity in Digitl Circuits, IEEE Trnsctions on Computer-Aided Designs of Integrted Circuits nd Systems, vol. 12, no. 2, Feb. 1993, pp. 310-323. [10] K. Poon, A. Yn, nd S. Wilton, A Flexible Power Model for FPGAs, Interntionl Conference on Field- Progrmmble Logic nd Applictions, Sep. 2002. [11] S. Roy, A. Hrm, nd P. Bnerjee, PowerShke: A Low Power Driven Clustering nd Fctoring Methodology for Boolen Expressions, Proceedings of Design, Automtion nd Test in Europe Conference, Feb. 1998. [12] E. M. Sentovich et l., SIS: A System for Sequentil Circuit Synthesis, UC Berkeley, Memorndum No. UCB/ERL M92/41, Electronics Reserch Lbortory, My 1992. [13] Z.-H. Wng, E.-C. Liu, J. Li nd T.-C. Wng, Power Minimiztion in LUT-Bsed FPGA Technology Mpping, Proc. ASP-DAC, Jn. 2001, pp. 635-640. [14] C. Yng nd M. Ciesielski, BDS: A BDD-Bsed Logic Optimiztion System, IEEE Trns. on Comp.-Aided Design of Integrted Circuits nd Sys., vol. 21, no. 7, July 2002. [15] N. Vemuri, P. Kll, nd R. Tessier, BDD-bsed Logic Synthesis for LUT-Bsed FPGAs, ACM Trnsctions of Design Automtion of Electronic Systems, vol. 7, no. 4, Oct. 2002, pp. 501-52.