Dynamic Power Consumption in Virtex -II FPGA Family

Similar documents
IEE Electronics Letters, vol 34, no 17, August 1998, pp ESTIMATING STARTING POINT OF CONDUCTION OF CMOS GATES

High Speed ADC Sampling Transients

TECHNICAL NOTE TERMINATION FOR POINT- TO-POINT SYSTEMS TN TERMINATON FOR POINT-TO-POINT SYSTEMS. Zo = L C. ω - angular frequency = 2πf

MTBF PREDICTION REPORT

Calculation of the received voltage due to the radiation from multiple co-frequency sources

High Speed, Low Power And Area Efficient Carry-Select Adder

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

To: Professor Avitabile Date: February 4, 2003 From: Mechanical Student Subject: Experiment #1 Numerical Methods Using Excel

antenna antenna (4.139)

A High-Sensitivity Oversampling Digital Signal Detection Technique for CMOS Image Sensors Using Non-destructive Intermediate High-Speed Readout Mode

PERFORMANCE EVALUATION OF BOOTH AND WALLACE MULTIPLIER USING FIR FILTER. Chirala Engineering College, Chirala.

Control Chart. Control Chart - history. Process in control. Developed in 1920 s. By Dr. Walter A. Shewhart

Uncertainty in measurements of power and energy on power networks

Learning Ensembles of Convolutional Neural Networks

Figure.1. Basic model of an impedance source converter JCHPS Special Issue 12: August Page 13

Digital Transmission

NATIONAL RADIO ASTRONOMY OBSERVATORY Green Bank, West Virginia SPECTRAL PROCESSOR MEMO NO. 25. MEMORANDUM February 13, 1985

Block-wise Extraction of Rent s Exponents for an Extensible Processor

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

A Comparison of Two Equivalent Real Formulations for Complex-Valued Linear Systems Part 2: Results

A study of turbo codes for multilevel modulations in Gaussian and mobile channels

Multiple Error Correction Using Reduced Precision Redundancy Technique

HIGH PERFORMANCE ADDER USING VARIABLE THRESHOLD MOSFET IN 45NM TECHNOLOGY

Total Power Minimization in Glitch-Free CMOS Circuits Considering Process Variation

Graph Method for Solving Switched Capacitors Circuits

Vectorless Analysis of Supply Noise Induced Delay Variation

Total Power Minimization in Glitch-Free CMOS Circuits Considering Process Variation

Research of Dispatching Method in Elevator Group Control System Based on Fuzzy Neural Network. Yufeng Dai a, Yun Du b

ECE315 / ECE515 Lecture 5 Date:

A MODIFIED DIRECTIONAL FREQUENCY REUSE PLAN BASED ON CHANNEL ALTERNATION AND ROTATION

Walsh Function Based Synthesis Method of PWM Pattern for Full-Bridge Inverter

A Fuzzy-based Routing Strategy for Multihop Cognitive Radio Networks

Sizing and Placement of Charge Recycling Transistors in MTCMOS Circuits

A High-Speed Multiplication Algorithm Using Modified Partial Product Reduction Tree

The Spectrum Sharing in Cognitive Radio Networks Based on Competitive Price Game

@IJMTER-2015, All rights Reserved 383

Comparison of Two Measurement Devices I. Fundamental Ideas.

Harmonic Balance of Nonlinear RF Circuits

Parameter Free Iterative Decoding Metrics for Non-Coherent Orthogonal Modulation

Guidelines for CCPR and RMO Bilateral Key Comparisons CCPR Working Group on Key Comparison CCPR-G5 October 10 th, 2014

A Mathematical Solution to Power Optimal Pipeline Design by Utilizing Soft Edge Flip-Flops

RC Filters TEP Related Topics Principle Equipment

Procedia Computer Science

Design of an FPGA based TV-tuner test bench using MFIR structures

Comparative Analysis of Reuse 1 and 3 in Cellular Network Based On SIR Distribution and Rate

Figure 1. DC-DC Boost Converter

Chaotic Filter Bank for Computer Cryptography

Micro-grid Inverter Parallel Droop Control Method for Improving Dynamic Properties and the Effect of Power Sharing

AC-DC CONVERTER FIRING ERROR DETECTION

Efficient Large Integers Arithmetic by Adopting Squaring and Complement Recoding Techniques

熊本大学学術リポジトリ. Kumamoto University Repositor

Passive Filters. References: Barbow (pp ), Hayes & Horowitz (pp 32-60), Rizzoni (Chap. 6)

Unit 1. Current and Voltage U 1 VOLTAGE AND CURRENT. Circuit Basics KVL, KCL, Ohm's Law LED Outputs Buttons/Switch Inputs. Current / Voltage Analogy

Figure 1. DC-DC Boost Converter

Side-Match Vector Quantizers Using Neural Network Based Variance Predictor for Image Coding

California, 4 University of California, Berkeley

A Low-power APUF-based Environmental Abnormality Detection Framework

4.3- Modeling the Diode Forward Characteristic

A NSGA-II algorithm to solve a bi-objective optimization of the redundancy allocation problem for series-parallel systems

Lecture 10: Bipolar Junction Transistor Construction. NPN Physical Operation.

Latency Insertion Method (LIM) for IR Drop Analysis in Power Grid

Topology Control for C-RAN Architecture Based on Complex Network

Process Variation Aware SRAM/Cache for Aggressive Voltage-Frequency Scaling

Implementation Complexity of Bit Permutation Instructions

SRAM Leakage Suppression by Minimizing Standby Supply Voltage

Control of Chaos in Positive Output Luo Converter by means of Time Delay Feedback

MASTER TIMING AND TOF MODULE-

problems palette of David Rock and Mary K. Porter 6. A local musician comes to your school to give a performance

Design of Practical FIR Filter Using Modified Radix-4 Booth Algorithm

VRT014 User s guide V0.8. Address: Saltoniškių g. 10c, Vilnius LT-08105, Phone: (370-5) , Fax: (370-5) ,

Understanding the Spike Algorithm

A Novel Soft-Switching Two-Switch Flyback Converter with a Wide Operating Range and Regenerative Clamping

COMPARISON OF VARIOUS RIPPLE CARRY ADDERS: A REVIEW

Exploiting Dynamic Workload Variation in Low Energy Preemptive Task Scheduling

Webinar Series TMIP VISION

Priority based Dynamic Multiple Robot Path Planning

ANNUAL OF NAVIGATION 11/2006

Estimating Mean Time to Failure in Digital Systems Using Manufacturing Defective Part Level

Simulation of Distributed Power-Flow Controller (Dpfc)

ELECTRONIC WAVELENGTH TRANSLATION IN OPTICAL NETWORKS. Milan Kovacevic and Anthony Acampora. Center for Telecommunications Research

Design of Shunt Active Filter for Harmonic Compensation in a 3 Phase 3 Wire Distribution Network

Performance Analysis of Multi User MIMO System with Block-Diagonalization Precoding Scheme

Shunt Active Filters (SAF)

Customer witness testing guide

Resource Allocation Optimization for Device-to- Device Communication Underlaying Cellular Networks

PRO- CRIMPER III Hand Crimping Tool Assembly DESCRIPTION (Figures 1 and 2)

Approximating User Distributions in WCDMA Networks Using 2-D Gaussian

PRO- CRIMPER* III Hand Crimping

FFT Spectrum Analyzer

Frequency Map Analysis at CesrTA

onlinecomponents.com

MEASURING DIELECTRIC PROPERTIES OF SIMULANTS FOR BIOLOGICAL TISSUE

Analysis of Time Delays in Synchronous and. Asynchronous Control Loops. Bj rn Wittenmark, Ben Bastian, and Johan Nilsson

PRO- CRIMPER* III Hand

A NOVEL HIGH STEP-UP CONVERTER BASED ON THREE WINDING COUPLED INDUCTOR FOR FUEL CELL ENERGY SOURCE APPLICATIONS

Soft-Switched CCM Boost Converter with High Voltage Gain for High Power Applications

Application of Intelligent Voltage Control System to Korean Power Systems

Rejection of PSK Interference in DS-SS/PSK System Using Adaptive Transversal Filter with Conditional Response Recalculation

PRO- CRIMPER III Hand Crimping Tool Assembly INSTALLATION AND REMOVAL OF DIE SET AND LOCATOR ASSEMBLY (Figure 2)

PRO- CRIMPER* III Hand Crimping

Transcription:

Dynamc Power Consumpton n Vrtex -II FPGA Famly L Shang Prnceton Unversty EE Dept., Prnceton, NJ 08540 lshang@ee.prnceton.edu Alreza S Kavan Xlnx Inc. 2100 Logc Dr., San Jose, CA 95124 alreza.kavan@xlnx.com Kusuma Bathala Xlnx Inc. 2100 Logc Dr., San Jose, CA 95124 kusuma.bathala@xlnx.com ABSTRACT Ths paper analyzes the dynamc power consumpton n the fabrc of Feld Programmable Gate Arrays (FPGAs) by takng advantage of both smulaton and measurement. Our target devce s Xlnx Vrtex -II famly, whch contans the most recent and largest programmable fabrc. We dentfy mportant resources n the FPGA archtecture and obtan ther utlzaton, usng a large set of real desgns. Then, usng a number of representatve case studes we calculate the swtchng actvty correspondng to each resource. Fnally, we combne effectve capactance of each resource wth ts utlzaton and swtchng actvty to estmate ts share of power consumpton. Accordng to our results, the power dsspaton share of routng, logc and clockng resources are 60%, 16%, and 14%, respectvely. Also, we concluded that dynamc power dsspaton of a Vrtex-II CLB s 5.9µW per MHz for typcal desgns, but t may vary sgnfcantly dependng on the swtchng actvty. 1. INTRODUCTION Recent advances n semconductor process technology has led to rapd scalng of transstor dmensons, allowng a large number of them to be packed on the same chp. Feld Programmable Devces (FPDs), whch consume hgher number of transstors compared to ther alternatve Applcaton Specfc Integrated Crcuts (ASICs), have also enjoyed a rapd growth due to these technology advancements. Hgh densty of transstors on the same chp has made power consumpton one of the major challenges of deep submcron IC desgn [1]. Tradtonally, FPD power consumpton has been less of a concern compared to ther speed and area effcency. However, t s lkely that large FPDs at the leadng edge of CMOS desgn wll soon face tough challenges regardng power consumpton. Large FPDs, whch are often called Feld Programmable Gate Arrays (FPGAs), consst of a set of logc blocks and a flexble routng structure to connect them together. Usng automated CAD tools, desgners may program the logc blocks and ther Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. FPGA 02, February 24-26, 2002, Monterey, Calforna, USA. Copyrght 2002 ACM 1-58113-452-5/02/0002 $5.00. correspondng nterconnect to mplement any desred applcaton wthn a reasonable amount of tme. Such flexblty and fast tme to market, however, comes wth the expense of addtonal transstors and metal resources that are only partally utlzed. Therefore, we need to dentfy utlzed logc and routng resources that contrbute to a sgnal for analyzng the dynamc power consumpton. Our analyss and results n ths paper can be used n 2 ways: 1) Better understandng of where power s consumed n FPGAs wll help desgn of future power-effcent FPGAs. 2) Detaled understandng of power consumpton dstrbuton wll help expert desgners to reduce or control the power characterstcs of ther desgn. The remander of ths secton goes over the related work. Secton 2 outlnes the necessary background ncludng an ntroducton to Vrtex-II, and varous types of power consumpton to narrow down the focus of ths paper. Secton 3 explans the role of effectve capactance, followed by Secton 4 that explans our methodology for power estmaton. In Secton 5 we summarze the results and provde an overvew of power dstrbuton n FPGA fabrc. Fnally, n Secton 6 we dscuss the results to acheve a better understandng and propose future work. 1.1. Related Work Analyss and estmaton for power dsspaton of large FPGAs has receved lttle attenton compared to that of standard cell ASIC, whch has been extensvely studed n the lterature. Prevous works [2,3] have analyzed the power dstrbuton of Xlnx 4000 famly, determnng the dstrbuton of varous resources n the devce. Both works have also suggested archtectural mprovements based on reducng the power supply to make FPGAs power-effcent. A recent work [4] characterzes the power dsspaton of the FPGA nterconnecton usng Manhattan dstances between logc blocks. Ths work uses an teratve approach to update the sgnal values, but the teraton process does not always converge. Work n [4] also consders 4000 famly, but t does not capture power dstrbuton of nternal resources. Our work s dstngushed from the above n several aspects: unlke prevous work we consder state-of-the-art FPGA fabrc, whch already uses advanced process technology wth reduced power supply. Therefore, relevant archtectural suggestons n those works need to be revsted based on the new data. In addton, our results are more accurate due to our access to the detaled schematcs of the FPGA crcuts. Fnally, we have based our methodology and results on real large desgns as opposed to prevous works that used smaller crcuts, resultng n somewhat dfferent results. 157

2. BACKGROUND In ths secton, we frst ntroduce and justfy our choce of target devce. Then, we follow by explanng varous components of power consumpton. At the end, we descrbe the focus of ths paper and ntroduce our methodology. 2.1. Vrtex-II FPGA The largest FPGA devce, whch s recently ntroduced to the market, s n Vrtex-II famly. Vrtex-II uses 0.15-mcron process wth eght layers of metal at 1.5 V power supply. In addton to advancements n ts process technology, Vrtex-II s the frst Xlnx FPGA wth fully buffered nterconnect, whch may be consdered as a turnng pont n ts routng archtecture. Fgure 1 shows 2v40, whch s the smallest member of Vrtex-II famly. As shown n the fgure, Vrtex-II ncludes a number of hard cores, ncludng memory blocks, IO blocks, dgtal clock managers, encrypton crcutry, and custom multplers. However, most of the slcon area n the largest members of the famly s consumed by what s referred to as programmable fabrc. Whle we admt to the mportance of power consumpton of hard cores on the FPGA, we focus on the fabrc for the followng reason. Power and performance of FPGAs are often compared to ther standard cell ASIC counterparts that use less slcon area for realzng the same functonalty. The power neffcency of FPGAs s often contrbuted to ts programmable fabrc that trades addtonal slcon area for ts flexblty. The hard cores n FPGAs are expected to perform as good as ther equvalent n ASIC, or n some cases even better due to ther custom desgn. Snce we would lke to dentfy the source of power neffcency n FPGAs, we only consder the programmable fabrc of FPGAs. Vrtex-II fabrc conssts of Confgurable Logc Blocks (CLBs), whch are connected usng a rch set of routng resources. Each Vrtex-II CLB contans four slces, where each slce conssts of two 4-nput Lookup Tables (LUTs), two Flp-flops (FFs), and a varety of dedcated crcutry to accommodate more effcent mplementaton of some specfc logc. Vrtex-II uses a segmented routng structure to mnmze the number of transstors and wres that a sgnal needs to traverse to reach ts destnaton. Block RAM Multpler FABRIC I/O Fgure 1. Vrtex-II platform FPGA. Dgtal Clock Manager The segmented routng archtecture ncludes wres that travel two CLBs (called s), sx CLBs (called es), and the length of the chp (called s), n both vertcal and horzontal dmensons. There are also pass transstors and buffers assocated wth each set of wres. For example, when we refer to a swtch for ts power consumpton we are consderng both wre and ts supportng transstors. There are also two sets of swtches to connect the wre segments to the nputs and outputs of each CLB; we refer to these sets as Input Crossbar () and Output Crossbar (). The CLB slces are also referred as logc and the above fve sets of swtches comprse nterconnect. In addton to logc and nterconnect, we wll consder global resources and swtches that accommodate the clockng for the crcuts. More detaled nformaton regardng Vrtex-II archtecture can be found n [5]. 2.2. Power Consumpton There are two types of power consumpton n FPGAs: statc and dynamc. In CMOS logc, whch ncludes SRAM-based FPGAs, leakage current s the only source of statc power dsspaton. There are two major sources of leakage current: 1) reverse based PN-juncton current, 2) subthreshold channel conducton. Both these components have smlar characterstcs such as hgh dependency to temperature, process varaton, and logc states of the crcut. Leakage current has often been gnored n the past due to ts neglgble amount; but ths s lkely to change wth scalng of transstor dmensons. Scalng often comes wth a reducton n power supply voltage (Vdd), and lower Vdd reduces the speed. To mantan or ncrease the speed we need to reduce the threshold voltage (V th ) of the transstor along wth the scalng. However, the subthreshold channel current of a transstor exponentally ncreases wth any V th decrease, leadng to a rapd ncrease n statc power consumpton. We beleve n the mportance of statc power dsspaton n future FPGAs, and have analyzed t elsewhere. In ths paper, however, we focus on the dynamc components of power dsspaton. Accordng to our emprcal results, the statc power s between 5-20% of total power dsspaton n Vrtex-II, dependng on the temperature, devce, runnng frequency, and the desgn. 2.2.1. Dynamc power consumpton Dynamc power dsspaton s caused by sgnal transtons n the crcut. A hgher operatng frequency leads to more frequent sgnal transtons and results n ncreased power dsspaton. The most sgnfcant source of dynamc power consumpton n CMOS crcuts s the chargng and dschargng of capactance. Ths can be modeled as P 2 = CV f (1), where C, V, and f are the capactance, voltage swng, and operatng frequency of resource, respectvely [6]. Another component of dynamc power dsspaton, also caused by sgnal swtchng, s short-crcut power. Accordng to our smulatons, short-crcut current n FPGA nterconnect s less than 10% of the total. Ths s consstent wth the lterature [6] because nterconnect short-crcut power dsspaton s mostly caused by swtchng of nverters n the buffers. However, shortcrcut power n the logc nsde the CLB slce s a hgher percentage of ts total power. For the sake of smplcty, we emulate short-crcut power wth an addtonal capactance. 158

To calculate total power dsspaton we consder three factors: Frst, we defne effectve capactance as the sum of parastc effects due to nterconnecton wres and transstors, and the emulated capactance due to short- crcut current. We wll obtan ths capactance for each resource as t s explaned n the next secton. The second mportant factor s the resource utlzaton. In typcal FPGA desgns, the majorty of the resources are not used after the confguraton and thus they wll not consume any dynamc power. Snce the resource utlzaton vares wth desgn, we consder a large set of real crcuts to obtan statstcally vald results. The thrd factor n determnng power dsspaton s the swtchng actvty, whch s defned as the number of sgnal transtons n a clock perod. For example, a clock sgnal has a swtchng actvty of two. The swtchng actvty for each resource also requres a statstcal representaton, because t depends not only on the type of desgn, but also the nput stmul. We explan our methodology to obtan statstcal representatons for resource utlzaton and swtchng actvty n Secton 4, and present ther correspondng results n Secton 5. 3. EFFECTIVE CAPACITANCE We obtan the effectve capactance of each resource usng two methods: measurement and spce smulaton. Usng two sources of data helped us verfy our results to mprove ther accuracy. 3.1. Measurement In order to measure the effectve capactance of each resource, we frst mplement a smple reference crcut n the FPGA and measure ts power. Then the target resource s added to the reference desgn and the power s measured agan. The dfference between frst and the second power measurement determnes the power dsspaton of the target resource. To mprove the accuracy, both reference desgn and the target resource are replcated to fll up the devce n both measurements. We used 2V1000, wth 5120 slces (10240 LUTs), for all our measurements. For further verfcaton we have used several frequences for each resource; the lnear change of our power measurements wth respect to frequency nsures the correctness of our results. In addton, the power supples of the FPGA core, FPGA I/O, and the testng envronment are solated to mnmze the possblty of errors. The mportant FPGA resources that are used for measurement are descrbed n subsecton 2.1. 3.2. Transstor-level Smulaton Snce we have access to the schematc of nternal crcuts n Vrtex-II, transstor-level smulaton s also used to dentfy the effectve capactance of resources. The prmary reason for usng smulaton s that there are some resources whose capactance can not be solated for measurement. In addton, usng smulaton wll enable us to nvestgate future archtectures for the purpose of power effcency. Needless to say that smulaton helps us to mprove the accuracy of our results. In many cases, we repeated ether our measurements or smulaton untl the results were consstent. We solate the crcuts correspondng to each resource, generate the netlst usng Cadence tool, and use Hspce to smulate the netlst. Although our netlst are extracted before the layout, we examned the layout n each case to add the correct metal and transstor loadng to our crcuts. Ths further enables us to reuse our crcuts for nvestgaton of other archtectures, as opposed to usng a post-layout netlst. 3.3. C eff Results Table 1 summarzes the effectve capactance for the major resources n the Vrtex-II devce. The capactance data n ths table s characterzed based on devce 2v1000FG256-5. Our choce of target was solely due to ts avalablty. The capactance of all resources except and Clockng are the same for other members of the Vrtex-II famly. The capactance of long lnes and global clock tree vary accordng to the wdth and heght of the devce. In our estmatons for and Clockng, we lnearly extend the wre capactance based on the devce. Table 1. Effectve capactance summary Type Resource Capactance (pf) 9.44 Interconnect per CLB Logc per CLB 5.12 13.20 18.40 26.10 LUT nputs 26.40 FF nputs 2.88 Carry 2.68 Global 300 wrng Clockng Local 0.72 Each of the nterconnect crossbars consst of a number of swtches, where the swtch capactance can be further dvded nto that of pass transstor network, buffer, and the metal wre. For and the buffer capactance s the domnant part, but for and the wre capactance exceeds the rest. swtches drve a wre crossng the whole chp; therefore, the metal wre n domnates the capactance. Note that the capactance of swtches n the same set vares slghtly, due to the layout. Whle we consder some of those varatons n our fnal results, presentng ther detals s beyond the capacty of ths paper. For the same reason, we have presented the total capactance of all the LUT nputs n the CLB, as a lump value. The LUT nputs can be classfed nto two groups of fast and slow based on whch ther ndvdual capactance vares. The resources n Table 1 are not only partally utlzed, but also contrbute to the fnal capactance based on varous rates of occurrences dependng on the desgn. Therefore, t s mperatve to consder the resource utlzaton for beng able to obtan meanngful results. We wll do ths n the next two sectons. 4. ESTIMATION METHODOLOGY In ths secton we explan how we augmented Xlnx standard desgn flow to estmate the power share for each resource. The frst step n the standard flow s synthess, whch creates a structural EDIF netlst. Xlnx CAD tool then reads the EDIF netlst to map, place, and route the desgn n several steps. Fnally, the resultng btstream can be generated and confgured nto the FPGA. The routed desgn, whch s avalable n an 159

nternal format (called NCD), can be used for power estmaton as follows. 4.1. Resource Utlzaton Flow Whle the routed NCD fle contans all the routng nformaton ncludng the utlzed resources, t can not be read drectly. Therefore, we requre another step n addton to Xlnx standard flow, as shown n the left sde of Fgure 2. We use Xlnx Desgn Language (XDL) utlty n Xlnx tools to convert the bnary NCD fle to text format. Then usng a number of Perl programs we obtan the utlzaton for each resource n the routed desgns. Our desgn set ncludes more than sxty real crcuts of all szes. We present the resource utlzaton of the largest ten crcuts n Secton 5. 4.2. Swtchng Actvty Flow The rght sde of Fgure 2 shows the swtchng actvty flow, whch starts wth the routed desgn smlar to utlzaton flow. We back annotate the routed NCD to generate a structural VHDL fle that contans all the resources used n the desgn and ther correspondng delays. Then, Modelsm s used to feed the desgn wth nput stmul and perform a real-delay tmng smulaton. Fnally, the result of smulaton s read nto our Perl scrpt along wth the routed desgn n XDL format to obtan the statstcal representaton of the swtchng actvty for each resource. The swtchng actvty depends on both the desgn and ts nput test vector. Whle we had access to a large set of real desgns, the realstc test vectors were not avalable. Therefore n some cases, we appled random nputs to the desgn. However, ths was not possble n most cases, and thus our desgn set for calculaton of swtchng actvty was smaller than that of resource utlzaton. Nonetheless, we beleve our results are conclusve as dscussed n the next secton. Transformaton Routed crcut (XDL) Resource utlzaton analyss Xlnx desgn flow Routed crcut (NCD) Power dstrbuton analyss Back annotaton Real-delay tmng smulaton Swtchng actvty analyss Fgure 2. Power estmaton flow. Tmng model Input stmul Effectve capactances Calculaton of swtchng actvty requres consderng two types of elements n the desgn: nets and logc. Nets often have one source and multple destnatons, and sgnals mantan ther swtchng actvty gong through the nets. However, the logc manpulaton occurs n the LUTs and occasonally n the other parts of the slce, and may change the swtchng actvty. The swtchng actvty of a LUT output depends on both the actvty of ts nputs and the confguraton of the LUT, whch determnes ts logc. It s possble to obtan the swtchng actvty of LUT nputs and output from the smulaton results, but we need to consder the LUT confguraton to be able to dentfy the power dsspaton at ntermedate nodes nsde the LUT. In order to model the swtchng actvty of LUTs, we apply a statstcal estmaton approach smlar to that of [7]. Frst, we calculate partal swtchng actvty of the output due to ts th nput as OUT S p (OUT) = P( ) S( INP ), where S( INP ) s the INP swtchng actvty of nput sgnal, and OUT P( ) s the INP probablty of the Boolean dfference equaton, whch s determned by LUT confguraton. The value of S p (OUT) estmates the transtons of ntermedate output sgnal caused by the togglng of nput sgnal INP. The total swtchng actvty of the output ntermedate sgnal s then calculated as S OUT ) = S ( OUT ) S( INP ) over all ts correspondng ( p nputs. Note that n ths approach we have assumed that the swtchng actvtes of nput sgnals are not correlated. Ths assumpton wll ntroduce some error, but the analyss of correlaton of nput sgnals s too complcated. 5. RESULTS There are three factors that determne the overall results of dynamc power dsspaton: effectve capactance, utlzaton, and swtchng actvty. A summary of capactance measurements and smulaton s gven n 3.3. Ths secton presents the results of the second and thrd factor to obtan the overall results of power dsspaton. 5.1. Utlzaton Resource utlzaton strongly depends on the desgn. Although we have access to a large set of benchmark sets, we observed that the resource utlzaton s domnated by a few large desgns. Therefore, we ntentonally excluded the small desgns to bas our results for future devces. Our benchmark desgns ft n 2v3000 to 2v6000 Vrtex-II devces, and contan more than a quarter of mllon LUTs. To nsure statstcally vald results, we made sure all the crcuts occupy most of ther correspondng devce. Ths s mportant because f a crcut were too small for the devce, some of the global resources such as long wres would have lower than usual utlzaton, whch would have led to naccuracy. Fgure 3 presents the overall utlzaton of varous nterconnect resources. The results of the fgure depend on the qualty of the tools as well as tmng and physcal constrants of the desgn. It s recommended that the tmng constrants for crcuts to be as tght as possble even f the requested speed s not requred. Tghter tmng requrement drects the place and route tools to choose the resources wth lower capactance where 160

possble, resultng n lower power consumpton for the same frequency. The detaled resource utlzaton s combned wth capactance measures to calculate the overall results n 5.3. 0 10000 20000 30000 Fgure 3. Interconnect resource utlzaton. 5.2. Swtchng Actvty Perhaps the most complcated part of accurate power estmaton s to calculate the average number of sgnal transtons correspondng to each resource. Ths s because the sgnal transtons or swtchng actvty are determned by the nput patterns that are appled to the desgn. It wll be a tedous task to obtan the test vectors that fully emulate the real behavor of a desgn. Due to flexblty of FPGAs, a large number of desgners tend to drop ther desgn nto the devce and test t. Therefore, they only partally smulate the desgn, whch does not requre a complete test vector. Nonetheless, we managed to obtan a complete set of test vectors for one of our large benchmark crcuts. Our benchmark crcut occupes more than 90% of a 2V3000 devce (wth 14336 slces), and ts number of flp-flops s as hgh as 85% of ts number of LUTs. The desgners have suppled the nput stmul, whch s the next best thng to havng real nputs, n our opnon. Fgure 4 shows swtchng actvty results for all fve types of resources n our test case. The horzontal axs n the fgure represents the swtchng actvty, as defned n Subsecton 2.2, wth the accuracy of +0.02. The vertcal axs, whch s logarthmc for the sake of clarty, represents the number of occurrences for swtches n each set of resource. Occurrence 100000 10000 1000 100 10 (0.11) (0.15) (0.2) (0.13) (0.36) 1 0 0.2 0.4 0.6 0.8 1 Swtchng actvty Fgure 4. Swtchng actvty wth real nput stmul. The frst observaton based on Fgure 4 s that all resources follow the same statstcal behavor. Ths s ntutve because these resources are drvng each other and routng resources do not change the swtchng actvty. The numbers n parentheses next to a resource name n the fgure are the swtchng actvty averages. The average swtchng actvty of the crcut for,,,, and are 0.11, 0.15, 0.2, 0.13, and 0.36, respectvely. In the next subsecton, we wll use these averages as a representatve for the swtchng actvty to calculate the overall power dsspaton. There are two domnant local peaks n the curves of Fgure 4 at or around swtchng actvty 0 and 1. Zero actvty for some sgnals s expected, and swtchng actvty one corresponds to sgnals that change wth the clock edge. Also, the 50% swtchng actvty, whch s equvalent to the actvty of a flp-flop whose output s fed back to ts nput, happens frequently. Keep n mnd that clock sgnal has a swtchng actvty of two, but t often uses dedcated resources nstead of the shown routng resources. We account for the clock power consumpton separately n our Perl program. To nvestgate swtchng actvty further, we also appled random nput patterns to a set of desgns to observe ther behavor. Unfortunately, ths restrcts our choce of benchmark crcuts, because the majorty of our real crcuts wll not work wth random nput patterns. Our example crcuts nclude FIR and FFT flters, DES encrypton, and a crcut wth a number of multplers. These desgns utlze 7790 slces (wth 13276 LUTs) and 2483 flp-flops. Fgure 5 summarzes the swtchng actvty results when a new random nput s suppled every fve clock cycles. Snce random nputs change wth the probablty of 50%, the nput patterns represent a swtchng actvty of 10% (or 0.1). The local peaks around 0.1 swtchng actvty n Fgure 5 confrm the dependency of the crcut swtchng actvty on the actvty of nput patterns. One notceable dfference wth prevous results s that the average swtchng actvty for varous resources are close. The average of and swtchng actvty n Fgure 4 are hgher than that of other resources. Closer examnaton shows that the hgh average s caused by hgh occurrence of the swtchng actvty of one, whch s assocated to a clock-enable sgnal. Therefore, a slght desgn modfcaton s lkely to reduce the average and swtchng actvty n our real crcut. Occurrence 10000 1000 100 10 (0.12) (0.11) (0.12) (0.10) (0.11) 1 0 0.2 0.4 0.6 0.8 1 Swtchng actvty Fgure 5. Swtchng actvty wth random test vectors. 161

A fnal experment shows the swtchng actvty for the same four benchmark crcuts, when random nputs are suppled every clock cycle. Ths corresponds to 50% swtchng actvty of the nput patterns. Fgure 6 shows the new curves and averages when hgh actvty s enforced by random nputs. The domnant peak moves to 0.5 from the 0.1 n Fgure 5. The average actvtes, however, have not ncreased lnearly, due to the exstence of a large number of nets wth zero actvty n both cases. An nterestng observaton from Fgure 6 s that the swtchng actvty for some sgnals s hgher than one. In synchronous crcuts the swtchng actvty hgher than one may be contrbuted to spurous transtons. These unwanted transtons, whch are often called gltches, wll dsspate addtonal dynamc power. Gltchng occurs when the nputs to LUTs arrve at dfferent tme and therefore the LUT output wll have multple transtons n a sngle clock cycle before settlng down to the correct logc level. Gltchng of statc CMOS crcuts n ASIC s prevously studed, concludng that t may contrbute to 20% to 70% of power dsspaton [8]. At ths pont we have no reason to beleve that FPGAs wll be dfferent than ther alternatve ASIC wth respect to gltches. However, gltches wll ncrease wth the depth of combnatonal logc, and flp-flops at the rght place wll sgnfcantly reduce the gltchng. Snce there s an abundance of flp-flops n FPGAs, technques such as ppelnng or retmng can be used to reduce gltches wthout a sgnfcant area or power penalty. Fgure 6 also shows the average of the swtchng actvty for each resource n parentheses. The hghest actvty for the examned crcuts s 4.26, but s not shown n the fgure for clarty sake. If we consder the swtchng actvty hgher than one as gltchng, the power dsspaton due to gltches n,,, and s 10%, 11%, 9%, 18% of total power dsspaton of that resource, respectvely. Gltchng power n the real crcut (Fgure 4) s less than 1%, and thus neglgble. We beleve the gltchng n Fgure 6 s hgher than that of real crcut for two man reasons. Frst, the number of flp-flops n our real crcut s hgher than our examples, and second, the typcal crcut nput stmul does not cause as many gltches as random nputs. 5.3. Overall Results So far, we have dscussed all three mportant factors that are requred for dynamc power dsspaton. Equaton (1) n Subsecton 2.2.1 can be rewrtten as P = 1 2 V f CU S, 2 where V s the supply voltage, f s the operatng frequency, and C, U, S, are the effectve capactance, utlzaton and swtchng actvty of each resource, respectvely. The ½ factor s a result of the way we defne swtchng actvty. Usng ths new equaton, we frst obtan the power dstrbuton for our real crcut whose swtchng actvty s shown n Fgure 4. Overall results for ths benchmark crcut are presented n Fgure 7. The swtchng actvty for logc s calculated as explaned n Subsecton 4.2, and the swtchng actvty for clockng resources s two. Ths desgn uses flp-flops heavly and as a result the power consumpton of the clockng resources s as hgh as 22%. One sngle benchmark crcut s not representatve, and we have already obtaned the statstcally vald utlzaton results for a number of crcuts. Therefore, we present three sets of results n Fgure 8, usng all our real crcuts, but wth the swtchng actvty averages obtaned from Fgures 4, 5, and 6. Although our focus n ths paper s the FPGA fabrc, we would also lke to nsure that the power dsspaton of Input-Output Blocks (IOBs) s not the domnant part. Vrtex-II FPGAs can be confgured for several IOB standards, and ther detaled nvestgaton s beyond the scope of ths paper. We only consder LVTTL standard wth fast slew rate and drve strength of 12 ma, and measure the power dsspaton of each IOB runnng at a known frequency. We add 10 pf for the package and board dependent capactance assocated wth each output buffer. The IOB power supples are both 1.5 V and 3.3 V dependng on the part and type of the block. We also consder the toggle rate of IOBs as the actvty average of all routng resources, whch s reflected n the swtchng actvty. Occurrence 10000 (0.46) (0.38) 1000 (0.38) (0.40) 100 (0.32) 10 1 0 0.25 0.5 0.75 1 1.25 1.5 Logc 26% Clockng 22% Interconnet 52% 5% 13% 11% 11% 12% Swtchng actvty Fgure 6. Hgh swtchng actvty (random nputs). Fgure 7. Power dstrbuton for a real crcut. 162

Part (a): swtchng actvty from Fgure 4 Part (c): swtchng actvty from Fgure 6 Logc 12% Clockng 10% IOB 7% Interconnect 71% 19% 3% 9% 17% 23% Logc 21% Clockng 4% IOB 12% Interconnect 63% 4% 15% 6% 20% 18% Part (b): swtchng actvty from Fgure 5 Logc 16% 18% Clockng 14% Interconnect 60% 4% 19% IOB 10% 12% 7% Fgure 8. Power dsspaton dstrbuton Accordng to Fgure 8, most of the power dsspaton n FPGA fabrc occurs n the nterconnect resources. Part (b) of the fgure also helps us to observe the effects of utlzaton and effectve capactance f the swtchng actvty of the resources would have been roughly the same. Keep n mnd that the swtchng actvty of the clock s two n all parts of the fgure. The hgh wre capactance n and contrbutes to the most of ther power dsspaton, whle hgh utlzaton of s s the man cause of ther hgh share n the pe charts. It s clear from our results that the desgns should take advantage of ther localty as much as possble to reduce the power consumpton for the same functonalty. Both careful desgnng and mprovng the CAD tools can acheve ths goal. The above results are ndependent of supply voltage and the operatng frequency. We also estmated the total power dsspaton of all the crcuts at 100 MHz and supply voltage of 1.5V to obtan the average consumpton of a Vrtex-II CLB. Accordng to our results, one CLB approxmately consumes 5.9 µw per MHz. Ths number s a good measure for CLB power dsspaton n typcal desgns, but hgh swtchng actvty can sgnfcantly rase the CLB power dsspaton. For example, a swtchng actvty of 50% (one new random nput n every clock cycle) would cause CLB power consumpton as hgh as 23 µw per MHz. On the other hand, a swtchng actvty of 5% would reduce the CLB power dsspaton to 3.1 µw per MHz. Uncertanty management and better understandng of the swtchng actvty s an mportant potental future work. 6. CONCLUDING REMARKS The power dsspaton of semconductor devces s rapdly growng to be a major concern as the devce szes ncrease. FPGA devces contan the largest number of transstors on the same chp, but most of those transstors do not dsspate dynamc power. However, rapd growth of FPGAs wll soon leave them on the hot plate of power hungry devces. In ths paper we thoroughly analyzed the dynamc power dsspaton n Vrtex-II, whch s the most recent and the largest FPGA product. We pnponted three mportant factors that contrbute to total power dsspaton as effectve capactance, resource utlzaton, and swtchng actvty. We nvestgated these three factors n detal and concluded wth presentng the dstrbuton of total power dsspaton for a number of real crcuts. Our results are somewhat dfferent than prevous work, whch s mostly due to usng larger crcuts and the state-of-the-art FPGAs. The results of our work can be extended to nvestgate varous technques to reduce the power consumpton. There are three possble avenues that can help reduce power dsspaton: desgn changes, archtectural modfcatons, and CAD tool 163

mprovements. All these three approaches requre the results of our work as the frst step. We ntend to contnue ths work toward the goal of reducng power usng any of these three technques. Fnally, we plan to obtan other realstc test vectors to expand our understandng of swtchng actvty. 7. ACKNOWLEDGEMENTS We would lke to thank Nabeel Shraz and Suresh Svasubramanam from Xlnx for ther help. Also, comments from other Xlnx employees mproved the qualty of ths work. The opnons expressed by authors are thers alone and do not represent the opnons of Xlnx and are not an ndcaton of any future polcy on FPGA software or hardware held by Xlnx. 8. REFERENCES [1] D. Sylvester, H. Kaul, Future Performance Challenges n Nanometer Desgn, Desgn Automaton Conference, pp. 3-8, June 2001. [2] E. A. Kusse, and J. Rabaey, Low-energy embedded FPGA structures, Int. Symp. On Low Power Electroncs & Desgn, pp. 155-160, Aug. 1998. [3] A. Graca, Power consumpton and optmzaton n feld programmable gate arrays, Ph.D. thess, Département Communcatons et Électronque, Ecole Natonale Supéreure des Télécommuncatons, 2000. [4] T. Osmulsk, et. al., A probablstc power predcton tool for the Xlnx 4000-seres FPGA, n Proc. 5 th Int. Wksp. Embedded/Dstrbuted HPC Systems and Applcatons, pp. 776-783, May 2000. [5] Xlnx Inc., Vrtex-II Platform FPGA Handbook, 2000. [6] Gary Yeap, Practcal Low Power Dgtal VLSI Desgn, Kluwer Academc Publshers, 1998. [7] S. Gupta and F. N. Najm, "Analytcal models for RTL power estmaton of combnatonal and sequental crcuts," IEEE Trans. on Computer-Aded Desgn, vol. 19, no. 7, pp. 808-814, July 2000. [8] A. Shen, et. al, On average power dsspaton and random pattern testablty of CMOS Combnatonal Logc Networks, IEEE ICCAD, pp. 402-407, 1992. 164