Block-wise Extraction of Rent s Exponents for an Extensible Processor

Similar documents
High Speed, Low Power And Area Efficient Carry-Select Adder

Uncertainty in measurements of power and energy on power networks

To: Professor Avitabile Date: February 4, 2003 From: Mechanical Student Subject: Experiment #1 Numerical Methods Using Excel

High Speed ADC Sampling Transients

IEE Electronics Letters, vol 34, no 17, August 1998, pp ESTIMATING STARTING POINT OF CONDUCTION OF CMOS GATES

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

TECHNICAL NOTE TERMINATION FOR POINT- TO-POINT SYSTEMS TN TERMINATON FOR POINT-TO-POINT SYSTEMS. Zo = L C. ω - angular frequency = 2πf

Research of Dispatching Method in Elevator Group Control System Based on Fuzzy Neural Network. Yufeng Dai a, Yun Du b

antenna antenna (4.139)

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

MTBF PREDICTION REPORT

Dynamic Power Consumption in Virtex -II FPGA Family

NATIONAL RADIO ASTRONOMY OBSERVATORY Green Bank, West Virginia SPECTRAL PROCESSOR MEMO NO. 25. MEMORANDUM February 13, 1985

Vectorless Analysis of Supply Noise Induced Delay Variation

A High-Speed Multiplication Algorithm Using Modified Partial Product Reduction Tree

HIGH PERFORMANCE ADDER USING VARIABLE THRESHOLD MOSFET IN 45NM TECHNOLOGY

Passive Filters. References: Barbow (pp ), Hayes & Horowitz (pp 32-60), Rizzoni (Chap. 6)

Latency Insertion Method (LIM) for IR Drop Analysis in Power Grid

A study of turbo codes for multilevel modulations in Gaussian and mobile channels

Figure 1. DC-DC Boost Converter

Calculation of the received voltage due to the radiation from multiple co-frequency sources

Figure.1. Basic model of an impedance source converter JCHPS Special Issue 12: August Page 13

ECE315 / ECE515 Lecture 5 Date:

Control Chart. Control Chart - history. Process in control. Developed in 1920 s. By Dr. Walter A. Shewhart

Application of Intelligent Voltage Control System to Korean Power Systems

A NSGA-II algorithm to solve a bi-objective optimization of the redundancy allocation problem for series-parallel systems

Prevention of Sequential Message Loss in CAN Systems

Topology Control for C-RAN Architecture Based on Complex Network

Figure 1. DC-DC Boost Converter

PERFORMANCE EVALUATION OF BOOTH AND WALLACE MULTIPLIER USING FIR FILTER. Chirala Engineering College, Chirala.

Total Power Minimization in Glitch-Free CMOS Circuits Considering Process Variation

Graph Method for Solving Switched Capacitors Circuits

RC Filters TEP Related Topics Principle Equipment

Efficient Large Integers Arithmetic by Adopting Squaring and Complement Recoding Techniques

ANNUAL OF NAVIGATION 11/2006

@IJMTER-2015, All rights Reserved 383

Understanding the Spike Algorithm

EE 508 Lecture 6. Degrees of Freedom The Approximation Problem

Resource Allocation Optimization for Device-to- Device Communication Underlaying Cellular Networks

Weighted Penalty Model for Content Balancing in CATS

A Comparison of Two Equivalent Real Formulations for Complex-Valued Linear Systems Part 2: Results

MASTER TIMING AND TOF MODULE-

Analysis of Time Delays in Synchronous and. Asynchronous Control Loops. Bj rn Wittenmark, Ben Bastian, and Johan Nilsson

Comparison of Two Measurement Devices I. Fundamental Ideas.

Optimal Sizing and Allocation of Residential Photovoltaic Panels in a Distribution Network for Ancillary Services Application

Design and Implementation of DDFS Based on Quasi-linear Interpolation Algorithm

Walsh Function Based Synthesis Method of PWM Pattern for Full-Bridge Inverter

A Mathematical Solution to Power Optimal Pipeline Design by Utilizing Soft Edge Flip-Flops

4.3- Modeling the Diode Forward Characteristic

Shunt Active Filters (SAF)

A Fuzzy-based Routing Strategy for Multihop Cognitive Radio Networks

Parameter Free Iterative Decoding Metrics for Non-Coherent Orthogonal Modulation

Learning Ensembles of Convolutional Neural Networks

Chapter 2 Two-Degree-of-Freedom PID Controllers Structures

Low Switching Frequency Active Harmonic Elimination in Multilevel Converters with Unequal DC Voltages

Total Power Minimization in Glitch-Free CMOS Circuits Considering Process Variation

Yield Optimisation of Power-On Reset Cells and Functional Verification

Estimating Mean Time to Failure in Digital Systems Using Manufacturing Defective Part Level

Comparative Analysis of Reuse 1 and 3 in Cellular Network Based On SIR Distribution and Rate

Priority based Dynamic Multiple Robot Path Planning

Control of Chaos in Positive Output Luo Converter by means of Time Delay Feedback

A MODIFIED DIRECTIONAL FREQUENCY REUSE PLAN BASED ON CHANNEL ALTERNATION AND ROTATION

Rejection of PSK Interference in DS-SS/PSK System Using Adaptive Transversal Filter with Conditional Response Recalculation

A Low-power APUF-based Environmental Abnormality Detection Framework

A MODIFIED DIFFERENTIAL EVOLUTION ALGORITHM IN SPARSE LINEAR ANTENNA ARRAY SYNTHESIS

Chaotic Filter Bank for Computer Cryptography

Harmonic Balance of Nonlinear RF Circuits

Side-Match Vector Quantizers Using Neural Network Based Variance Predictor for Image Coding

Throughput Maximization by Adaptive Threshold Adjustment for AMC Systems

Webinar Series TMIP VISION

A Simple Yet Efficient Accuracy Configurable Adder Design

Approximating User Distributions in WCDMA Networks Using 2-D Gaussian

SRAM Leakage Suppression by Minimizing Standby Supply Voltage

Optimal Placement of PMU and RTU by Hybrid Genetic Algorithm and Simulated Annealing for Multiarea Power System State Estimation

A Simple Satellite Exclusion Algorithm for Advanced RAIM

Power System State Estimation Using Phasor Measurement Units

Micro-grid Inverter Parallel Droop Control Method for Improving Dynamic Properties and the Effect of Power Sharing

Keywords: Speed binning, delay measurement hardware, process variation.

California, 4 University of California, Berkeley

熊本大学学術リポジトリ. Kumamoto University Repositor

Abstract. 1. Introduction. 2. Control Generated Clocking (CGC) Minimization techniques

Methods for Preventing Voltage Collapse

1 GSW Multipath Channel Models

Sensors for Motion and Position Measurement

Unit 1. Current and Voltage U 1 VOLTAGE AND CURRENT. Circuit Basics KVL, KCL, Ohm's Law LED Outputs Buttons/Switch Inputs. Current / Voltage Analogy

Enhanced Artificial Neural Networks Using Complex Numbers

Implementation Complexity of Bit Permutation Instructions

POWER constraints are a well-known challenge in advanced

A Novel Optimization of the Distance Source Routing (DSR) Protocol for the Mobile Ad Hoc Networks (MANET)

THE GENERATION OF 400 MW RF PULSES AT X-BAND USING RESONANT DELAY LINES *

A High-Sensitivity Oversampling Digital Signal Detection Technique for CMOS Image Sensors Using Non-destructive Intermediate High-Speed Readout Mode

White Paper. OptiRamp Model-Based Multivariable Predictive Control. Advanced Methodology for Intelligent Control Actions

Beam quality measurements with Shack-Hartmann wavefront sensor and M2-sensor: comparison of two methods

INSTANTANEOUS TORQUE CONTROL OF MICROSTEPPING BIPOLAR PWM DRIVE OF TWO-PHASE STEPPING MOTOR

Guidelines for CCPR and RMO Bilateral Key Comparisons CCPR Working Group on Key Comparison CCPR-G5 October 10 th, 2014

The Effect Of Phase-Shifting Transformer On Total Consumers Payments

An Adaptive Over-current Protection Scheme for MV Distribution Networks Including DG

LMP Based Zone Formation in Electricity Markets

Exploiting Dynamic Workload Variation in Low Energy Preemptive Task Scheduling

DESIGN OF OPTIMIZED FIXED-POINT WCDMA RECEIVER

Transcription:

Block-wse Extracton of Rent s Exponents for an Extensble Processor Tapan Ahonen, Tero Nurm, Jar Nurm, and Joun Isoaho Tampere Unversty of Technology, and Unversty of Turku, Fnland tapan.ahonen@tut.f, tnurm@utu.f, jar.nurm@tut.f, jsoaho@utu.f Abstract It s envsoned that future system-on-chp hardware platform desgns wll be based on reuse of a customzable processor core. Consequently, beng able to quckly evaluate the key performance metrcs assocated wth specfc ponts n the desgn space becomes essental. Development of an early desgn phase performance estmaton method for logc blocks of an extensble processor core s descrbed. The processor blocks were systematcally syntheszed wth varyng constrants for reference and the correspondng Rent s exponents were extracted from the results. The mpact of synthessorgnated desgn space dscontnutes on the accuracy of physcal performance estmaton was evaluated by applyng lnear regresson on the resultng desgn ponts. 1. Introducton Wth growng complexty, System-on-Chp (SoC) developers have to resort to platform based desgn. One of the key components n SoC hardware platform desgn s a customzable processor core. System-level desgn methodologes often utlze logc synthess of reusable blocks. Snce synthess can drastcally alter the physcal level desgn, a problem orgnates from ths knd of desgn approach concernng the predctablty of physcal level performance metrcs such as clock frequency, power dsspaton and devce area. Therefore, t becomes essental to have the ablty of quckly evaluatng performance metrcs n dfferent desgn ponts, whle recognzng desgn space lmts of the block at hand. Accurate knowledge of the actual desgn, archtecture, and technology parameters s needed to realze the benefts of early desgn phase performance estmaton models. Inaccurate parameters are lkely to cause sgnfcant errors n the estmaton results. The exstng processor performance estmaton models yeld good results f used for certan types of desgns wth smlar archtecture, logc style, and technology generaton. These models nclude SUSPENS by Bakoglu [1], models of Sa- Halasz and M [][3], BACPAC [4], and RIPE [5]. They are commonly based on globally defned clock cycle tme and random logc presentaton of sub-blocks. Average onchp wre length s estmated usng Donath s wrng statstcs [6] and organzaton s descrbed by the wellknown Rent s rule [7] usng only one global Rent s exponent. The rule states that f we have N gates number of logc gates n a block, #I/O number of nput and output pns/connectons between the block and ts envronment, and we have an average amount of K p I/O connectons for an ndvdual gate/nstance nsde the block, then we can lnk these varables wth the followng equaton [7] p # I / O = K p N gates (1) where K p s Rent's constant and p s the Rent's exponent that s derved expermentally. If a logc synthess tool s used to generate varyng block mplementatons, a common Rent s exponent fals to accurately predct performance characterstcs. In order to solve ths problem we have extracted Rent s exponent as a functon of delay separately for each block, and developed a method for estmatng block performance characterstcs when the delay vares. The analyss was made for XIRISC [8], an extensble reduced nstructon set computer (RISC) developed at the Unversty of Bologna.. The Extensble RISC Core Hardware platforms for SoC development need to be customzable by reconfgurable nterconnecton scheme or programmable functonalty. The major shortcomngs of reconfgurable nterconnectons are unpredctable delays because of varable length nterconnectons, hgh mplementaton costs due to nevtably complex communcatons, slow reconfguraton, and possbly, need for adopton of new data transport orented software languages. Approachng SoC platform desgn from the programmable functonalty pont of vew pushes complexty towards less expensve mplementaton

features and allows use of legacy software. Customzable functonalty can be acheved wth low cost by usng an extensble RISC processor core lke XIRISC as a basc buldng block. Extensble means that the nstructon set archtecture (ISA) s easly modfable for specfc needs and thus the hardware resources can be selected accordngly. XIRISC s provded as an open source syntheszable Very Hgh Speed Integrated Crcut Hardware Descrpton Language (VHDL) descrpton accompaned by GNU Compler Collecton (GCC) based C/C++ software development envronment. Modfcatons to the ISA and word wdths used can be easly made through sngle VHDL fles. XIRISC s based on the Harvard archtecture wth fve ppelne stages as depcted n fgure 1. The Harvard archtecture s a processor archtecture that has separate buses for nstructon and data memores. In addton to the tradtonal sngle data path mplementaton, a double data path Very Long Instructon Word (VLIW) mplementaton s supported. fetch program memory control logc & system control coprocessor decode regster fle execute ALU shfter multpler MAC memory access data memory Fgure 1. An abstract vew of the XIRISC archtecture. wrte back The top-level block structure of the XIRISC core conssts of roughly seven parts. System Control Coprocessor (SCC) s an entty responsble for handlng exceptons and nterrupts, mantanng processor state and controllng context swtches. The remanng control logc conssts of address generaton, program counter (PC) value calculaton, nstructon decodng, and hazard handlng. Data memory s accessed n a load-store fashon usng regster fle (RF) to store the operands and results of operatons. Selectable Functonal Unts (FUs) nclude Arthmetc Logc Unt (ALU), shfter, multpler, Multply and ACcumulate (MAC) logc, and even a dvder, whch s not consdered n ths study. Multpler can be mplemented as a sngle-cycle devce producng as accurate results as the operands are or as a two-cycle devce producng full precson results (double wdth result compared to the operands). Ths two-cycle devce mplements the multply part of the MAC logc. Performance of XIRISC can be sgnfcantly affected by modfcatons to the ISA, data word wdth or block level archtecture. Because of the hgh abstracton level nature of the core descrpton, also constrants set to the synthess program have a dramatc effect. Typcal 3-bt mplementaton of the XIRISC core on a 0.18 µm CMOS process occupes slcon area less than half a square mllmeter, dsspates approxmately 0.4 watts of power and operates at or above 100 MHz. The scope of ths study had to be tghtly restrcted due to the extensveness of the XIRISC desgn space. To cover as wde desgn space as possble, focus was concentrated on the block level archtecture and a smple system level case study. When the extensblty of XIRISC s exploted and modfcatons to the ISA are made, the set of needed FUs may change whle the desgn space of a specfc FU may not. Of course, modfyng the ISA results n altered performance metrcs for the control logc, and may affect the SCC. The scope was further refned to block-level desgn space exploraton of 3-bt mplementatons emphaszng tmng constrants. The standard ISA of the XIRISC dstrbuton was not modfed for ths study. 3. Rent s Exponent Extracton Methodology To be able to predct system level performance metrcs already n the early phase of desgn cycle wthout executng many tme-consumng synthess runs, we have developed a method for lnkng dfferent synthess runs of gven tmng constrants to an early estmaton method havng only a few organzatonal parameters wth whch one can estmate performance metrcs of a processor accurately enough n the early phase of desgn wthout the need to know too many detals about the desgn tself. The key component n early estmaton analyss s Rent's rule [7] presented already n the frst secton. In ths paper we have used organzatonal parameters: Rent's constant (K p ) and Rent's exponent (p). Systematc workflow was used n our method to derve Rent's exponent p and a regresson curve for the exponent as a functon of delay. Frst, one block was taken under examnaton. By usng (1) Rent s exponent values correspondng to varyng delay fgures were calculated. Values for Rent's constant K p (number of average gate I/O) and number of gates N gates are extracted from synthess reports. The number of block I/O pns/connectons (#I/O) s defned n the specfcaton of the block. In ths way we get the varous values of p for the block under examnaton. Ths result s stll nadequate f we need to change the delay fgure more freely. Assume that we want to use delay constrant not matchng wth any of the reference ponts but somethng n between the

syntheszed values. Then we have to plot Rent's exponent p as a functon of delay and make a regresson analyss for the curve. We decded to use lnear regresson for curve fttng although t seems that some of the curves could obey polynomal expresson rather than lnear expresson. The regresson process s explaned n secton 5 n more detal. 4. Block-Level Desgn Space Exploraton Before the actual desgn space exploraton of ndvdual blocks, a technology exploraton was carred out. The focus was on performance metrc devaton wth changes n operatng envronment. Envronment parameters nclude the actual operatng voltage and chp temperature. For a best-case envronment, assumed operatng voltage s hgher than nomnal and chp temperature s at or below freezng. For a worst-case envronment, assumed operatng voltage s lower than nomnal and chp temperature s somewhere around bolng pont. Dfferent 0.18 µm statc CMOS processes wth ether 1.8 V or 1.3 V nomnal operatng voltage were examned usng the XIRISC multpler as a case study. It was observed that the multpler delay degraded from the best-case envronment to the worst-case envronment by a factor rangng from.3 to 3.5 for the 1.8 V nomnal voltage, and a factor of up to 4. for the 1.3 V nomnal voltage. Of course, power consumpton s also affected by the lowered operatng voltage assumpton and hgher resstance due to rased temperature assumpton. The observed mpact on ths case study was that the power consumpton lowered by a factor rangng from 1.6 to 1.7 for the 1.8 V nomnal voltage, and a factor rangng from 1.8 to 1.9 for the 1.3 V nomnal voltage. The dfferences between technologes wth the same nomnal voltage are caused by slght varatons n the characterzed best-case and worst-case envronments n addton to dfferences n senstvty to envronmental changes. The hgher factors for 1.3 V nomnal voltage are due to a hgher relatve devaton between best case and worst case operatng voltage assumptons compared to 1.8 V nomnal voltage. Nomnal voltage of 1.8 V and best-case operatng envronment were chosen for the block syntheses. Ths yelds the hghest power consumpton and lowest delay. Hence, the delay fgures from block syntheses should be used for comparson purposes only, not to determne the actual clock frequency as the assumed envronment parameters are not realstc n the vast majorty of operatng envronments. Referrng to the technology exploraton results, a rough estmaton of the delay n a typcal envronment could be obtaned by multplyng the best-case delay by two. The desgn space exploraton was realzed by syntheszng the ndvdual blocks usng seven dfferent delay constrants for each. In addton to that, synthess runs for mnmzed power consumpton and slcon area were performed. Intal synthess runs for all blocks were executed wth absolutely no constrants set for the synthess tool. For the followng sx runs, the delay constrant was lowered n equal szed steps. The magntude of these steps was chosen to be 15 percent of the delay fgure obtaned from the ntal synthess run. All of the synthess runs were performed wth hghest possble mappng effort,.e. maxmum number of heurstc optmzaton cycles. Optmzaton over subblock boundares was allowed for the synthess tool. XIRISC block synthess results are gathered together nto tables 1,, and 3 presentng delay, area, and power dsspaton fgures respectvely. The fgures have been evaluated from the gate level desgns by the synthess tool. Performance metrcs obtaned from the ntal runs wthout constrants are organzed under the label ntal result. The followng column labels descrbe the magntude of the gven delay constrant as a percentage of the delay fgure n the ntal result column. The results for power consumpton and slcon area mnmzaton runs are gven n the last two columns. Power dsspaton fgures n table 3 are normalzed for a clock frequency of 100 MHz to ease comparson between mplementatons. When nterpretng the power consumpton fgures of ndvdual blocks n table 3, one has to bear n mnd that these fgures are domnated by nternal power consumpton of logc cells nsde the block, whereas the overall power consumpton of the processor core s mostly determned by nterconnect swtchng power. Some general synthess features are evdent n the results. Because area s the hghest prorty parameter n desgn optmzaton, forcng the synthess tool to mnmze occuped slcon area yelds practcally the same results as gvng no constrants at all. Ths can be verfed by comparng the ntal result and mnmum area columns of tables 1,, and 3. Mnmum area mplementatons are generally consdered to be close to mnmum power mplementatons. In ths study the observed dfference n power consumpton between mnmum area and mnmum power mplementaton ranged from 15% to 6% beng 19% n average. The area penalty due to mnmzaton of power consumpton ranged from 8% to 30% beng % n average, whle delay penalty ranged from 0% to 7% beng 13% n average. These fgures are gven excludng the RF, whch exhbted 51% ncrease n delay and 11% ncrease n area for a power consumpton drop of 13%. The exceptonal behavor of the RF s caused by domnance of a sngle standard cell, the flp-flop regster, havng only a few dscrete mplementatons. The results ndcate also that consderable dfferences n performance metrcs are

ncurred by synthess alone. The lowest delay fgures dffer from the hghest by a factor rangng from.5 to 5.8 beng 3.7 n average. The correspondng factor for area fgures ranges from 1.3 to 3.5 beng.3 n average, and for power consumpton fgures from 1.6 to 3.9 beng.5 n average. ALU 3.58.80.51 1.97 1.40 0.90 0.6 3.51 3.51 control logc 3.50.96.45 1.91 1.40 1.10 1.3 3.65 3.50 MAC 9.33 7.85 5.51 4.78 3.63.85.76 9.99 9.7 multpler 3.66.91.56.00 1.79 1.8 1.8 4.61 3.71 -cycle multpler 4.69 3.85 3.6.61.1.3.15 5.97 4.69 regster fle 1.07 0.91 0.75 0.59 0.45 0.43 0.43 1.6 1.07 SCC.37 1.81 1.65 1.30 0.95 0.60 0.64.64.38 shfter 1.74 1.08 1.08 0.96 0.70 0.44 0.45 1.99 1.74 Table 1. Delay fgures from synthess n nanoseconds. ALU 1143 15536 17551 18055 18739 381 39739 184 11383 control logc 1848 0386 1189 1701 6501 3790 41935 3065 18387 MAC 173351 18175 183833 189653 30498 36333 35080 1715 171999 multpler 79581 87618 95633 13331 154673 1576 155677 101171 78963 -cycle multpler 168088 176599 194167 54771 3605 34886 330416 10518 166736 regster fle 10918 1560 15645 1734 3117 3505 36454 134140 10918 SCC 9680 31355 31400 31551 33534 37167 40014 3368 9667 shfter 8905 1131 1131 11444 1571 353 30867 11563 8905 Table. Slcon area fgures from synthess n square mcrons. ALU 4.69 6.68 4.43 4.58 4.70 5.71 10.7 3.47 4.70 control logc 1.58 1.63 1.66 1.6 1.89.48 3.05 1.0 1.58 MAC 10.6 10.8 10.6 11.0 13.1 19.6 19.3 8.98 10.6 multpler 1.61 1.69 1.77.46.78.80.8 1.5 1.61 -cycle multpler 10.4 10.7 1.0 15.0 18.6 18.3 19.0 8.84 10.4 regster fle 11.0 11.1 11. 11.4 17.1 19.6 19.9 9.5 11.0 SCC.98 3.18 3.16 3.0 3.35 3.67 3.86.46.99 shfter 1.77 1.87 1.87 1.90 3.04 4.46 5.75 1.48 1.77 Table 3. Power dsspaton fgures from synthess n mllwatts at 100 MHz. ALU 0.4800 0.4678 0.4509 0.4491 0.4509 0.4377 0.4188 0.4837 N/A control logc 0.574 0.5707 0.5677 0.5639 0.554 0.5313 0.556 0.558 N/A MAC 0.3413 0.3515 0.346 0.3363 0.3545 0.3385 0.3480 0.3697 N/A multpler 0.967 0.941 0.974 0.306 0.977 0.93 0.998 0.985 N/A -cycle multpler 0.336 0.3387 0.3361 0.3633 0.3559 0.3565 0.3577 0.3695 N/A regster fle 0.3441 0.3445 0.3443 0.3439 0.36 0.311 0.305 0.341 N/A SCC 0.4580 0.4618 0.4617 0.4611 0.4684 0.4666 0.468 0.4545 N/A shfter 0.4347 0.409 0.409 0.4089 0.4380 0.4168 0.4039 0.441 N/A Table 4. Rent's exponents (area-optmzed) as a functon of delay constrant.

ALU 0.4378 0.457 0.4404 0.4384 0.4409 0.473 0.4068 0.4645 N/A control logc 0.6534 0.6573 0.6554 0.6555 0.6460 0.61 0.6053 0.6796 N/A MAC 0.394 0.4053 0.3963 0.3886 0.4094 0.3904 0.4009 0.4440 N/A multpler 0.405 0.4030 0.4088 0.410 0.4031 0.3973 0.4054 0.4360 N/A -cycle multpler 0.386 0.3897 0.3853 0.417 0.4086 0.4099 0.410 0.443 N/A regster fle 0.3791 0.3805 0.3801 0.3793 0.3600 0.3543 0.3537 0.3868 N/A SCC 0.5067 0.5098 0.5103 0.5090 0.5177 0.5159 0.5188 0.55 N/A shfter 0.4391 0.448 0.448 0.441 0.4438 0.431 0.4108 0.4764 N/A Table 5. Rent's exponents (power-optmzed) as a functon of delay constrant. 5. Rent s Exponents and Regresson Analyss The values of Rent's exponent p found n the lterature very often refer to a specfc desgn case mplemented wth specfc technology and specfc logc desgn style. Hence, there s a need to derve Rent's exponent for ndvdual blocks separately. Because n the logc synthess process a CAD tool changes the organzaton and type of standard cell components accordng to varous delay, area and power consumpton constrants, we need to defne our own Rent's exponent for each ndvdual block as a functon of a specfc constrant. In ths paper we have used only seven dfferent delay constrants as explaned earler n secton 4. We have used synthess results to defne a specfc Rent's exponent p for each block used n our XIRISC processor case study as a functon of delay constrant. Lnear regresson analyss s made for Rent's exponent varaton as a functon of delay constrant [9]. Error functon s gven by () 1 N J = ε () N = 1 where ε s d -(b+w x ) where d s here the real exponent value and x s here the delay constrant. Ths s a mean square error (MSE) functon. Our am s to mnmze the error functon by settng partal dervatves of J to zero wth respect to an ntercept value b and a slope w. After dong ths the followng equatons are derved [9] = x d x xd b (3) [ N ( x xavg ) ] ( )( ) = x xavg d davg w (4) ( x xavg ) where x s here the delay constrant and d s the real exponent value. Symbols x avg and d avg descrbe average (mean) values of the delay constrant and Rent's exponent, respectvely. Addtonally, there are two separately optmzed cases for dervng Rent's exponent: areaoptmzed case and power consumpton-optmzed case. For both cases we use the nformaton receved from the synthess report and assume that a standard -nput NAND gate wth normal drve strength, fan-out of and an nput rse tme of 17 ps represents an average gate. In the frst case we use area value extracted from the synthess report and n the latter case we use power consumpton value from the report. These two cases gve a bt dfferent Rent s exponent values and also dfferent regresson lnes whch can then be appled separately n area and power consumpton estmaton. One must notce that these values and regresson lnes apply only to ths specfc desgn style (statc CMOS gates) and ths specfc technology (0.18 µm n our case). The advantage s that we can vary the block delays as long as we stay nsde the boundary values. Thus the analyss presented here apples as well to totally synchronous (Sync), totally asynchronous (Async) or globally asynchronous, locally synchronous (GALS) desgn scheme of future SoC platforms. Tables 4 and 5 gve the real area-optmzed and power-optmzed Rent's exponents calculated by usng Rent's rule and synthess reports for the blocks. Those values are then used n lnear regresson analyss, whch fnally yelds a formula for Rent's exponents as a lnear functon of delay constrant. Some lnear graphs are presented n secton 6. 6. Analyss of the Estmaton Method After performng lnear regresson for calculated Rent s exponent values of each block, the obtaned regresson lne can be used to estmate the value of Rent's exponent as a functon of delay. Fgures and 3 llustrate the regresson lnes derved for the area-optmzed Rent s exponent values of the ALU and RF blocks respectvely. Fgures 4 and 5 depct regresson lnes for the power consumpton-optmzed Rent s exponent values of the control logc and two-cycle multpler. An arrow s used to mark the pont of greatest devaton from the regresson lne. For all blocks, the curve shape s very smlar for the area-optmzed and power consumpton-optmzed Rent s exponents. In the regresson procedure, all desgn ponts

have been gven equal weght. In general, there are more evaluated desgn ponts n the neghborhood of the mnmum achevable delay due to the synthess procedure. Hence, the desgns close to the mnmum delay have hgher weght n the dervaton of the regresson lne. To evaluate the hghest possble error n estmaton results, the pont on the regresson lne correspondng to the pont of greatest devaton was used. The obtaned results were then compared to the synthess results assocated wth the pont. Results of ths comparson are gathered together nto table 6. In addton to the error percentages of area and power consumpton estmates, the error percentages of respectve Rent s exponents are shown to llustrate the fact that a small change n the value of Rent s exponent results n consderable alteraton of the estmaton result. It has to be notced that the absolute maxmum error mght be found n between some of the evaluated desgn ponts. Fgure 4. Regresson lne derved for poweroptmzed Rent s exponents of the control logc. Fgure. Regresson lne derved for area-optmzed Rent s exponents of the arthmetc logc unt. Fgure 3. Regresson lne derved for area-optmzed Rent s exponents of the regster fle. Fgure 5. Regresson lne derved for poweroptmzed Rent s exponents of the -cycle multpler. ERROR IN EXPONENT (A / P) ERROR IN CONSUMPTION ERROR IN ALU.0 /.5% 13.4% 17.7% control logc.1 /.5% 17.5% 17.3% MAC 1.9 /.8% 17.4% 7.5% multpler.5 / 1.8% 7.4% 13.8% -cycle.8 / 3.3% 33.4% 3.3% multpler regster fle 3.6 / 3.3% 40.8% 33.3% SCC 0.6 / 0.7% 4.4% 6.3% shfter 5.3 / 4.3% 49.3% 38.% Table 6. Maxmum errors n area and power consumpton when usng Rent's exponent values wth the greatest devaton from regresson lne. Apparently, the modelng accuracy vared wth logc type. Snce Rent s rule assumes random logc, the blocks

that are mostly comprsed of combnatonal logc, the ALU, control logc, and SCC, were modeled most accurately. The RF and shfter blocks exhbt hgh error percentage due to a regular nner structure that does not obey the Rent s rule. They consst manly of flp-flop regsters for whch only a few dscrete mplementatons are avalable n the standard cell lbrary. Ths causes nonlnear dscontnutes, such as n fgure 3, to appear nsde ther physcal-level desgn spaces. For the group of multpler-based blocks, moderate modelng accuracy was acheved. Referrng to fgure 5, ths group exhbts abrupt changes that are almost orthogonal to the regresson lne. Ths can be explaned by topology changes beng allowed for the synthess tool. Also characterstc but not unque to ths group was the tendency of Rent s exponent to grow towards the low-delay mplementatons. It was observed that whenever Rent s exponent exhbted an abrupt change, also Rent s constant was dramatcally altered. Because Rent s constant s defned by a number of connectons between a logc gate and ts surroundngs, t s obvous that ths abruptness s due to synthess tool beng forced to utlze smaller gates to meet the tmng constrant. For example, the most dramatc ncrease of Rent s exponent vsble on fgure 5 s accompaned by Rent s constant drop of 34% ndcatng consderably smaller logc gates n the desgn. Fnally, a cycle tme estmate was done for the processor based on the assumpton that the cycle tme s defned by the executon stage. In ths case we dd not take on-chp memory nto account n our calculatons. The cycle tme was assumed to consst of the sum of the delays n a logc block tself (based on the synthess results we here assumed that MAC has the longest delay), n the global wre and n a ppelne regster. We used 0.3 µm wde and 0.565 µm thck, metal 4 wre for global sgnalng. RLC delay equaton presented n [10] was used for global wre delay, synthess results for logc delay and standard cell lbrary nformaton for the ppelne regster. For 85% delay constrants (see Tables 1-3), cycle tme n the XIRISC processor was 8.359 ns (logc 95.3%, global wre 0.1%, ppelne regster 4.6%). For 70% and 55% delay constrants cycle tmes were 5.8961 ns and 5.1671 ns, respectvely. The relatve delay of logc was stll 93.5% and 9.5 %, respectvely. These relatvely hgh logc delay percentages gve a hnt of the fact that global wre delay s not so crtcal f proper ppelnng s used and there are bottleneck logc blocks n the desgn causng hgh cycle tmes for the processor. 7. Concluson Development of an early desgn phase performance estmaton method was descrbed for logc blocks of the XIRISC processor core. Compared to tradtonal approaches, whch use a sngle average exponent for all logc, we used block-wse exponents and took varaton due to varyng tmng constrants nto account. We notced a large varaton n Rent s exponent between dfferent blocks: the exponent value ranged from 0.9 to 0.68. Bggest estmaton naccuraces orgnate from the synthess tool tendency to abruptly change average logc gate sze to meet a tghter delay constrant. Restrctng the set of logc gates allowed for the synthess tool e.g. to gates havng a maxmum of 3 nputs would not affect the delay or power optmzaton results very much, but would facltate more accurate performance modelng. Focus of future research wll be n the neghborhood of the abrupt changes n average gate sze, that s, abrupt changes n the Rent s constant. It s antcpated that usng polynomal regresson nstead of lnear regresson would result n more accurate estmates of the performance metrcs. Acknowledgements Gven the opportunty we acknowledge Fabo Camp for provdng us access to nvaluable proflng nformaton on algorthms runnng on XIRISC. References [1] H.B. Bakoglu, Crcuts, Interconnectons, and Packagng for VLSI, Addson-Wesley Publshng Company Inc., 1990. [] G.A. Sa-Halasz, Drectons n Future Hgh-End Processors, n Proc. IEEE Internatonal Conference on Computer Desgn, 199, pp. 30-33. [3] D.C. Edelsten, G.A. Sa-Halasz, Y.J. M, VLSI On-Chp Interconnecton Performance Smulatons and Measurements, IBM Journal of Research and Development, Vol. 39, No. 4, July 1995, pp. 383-401. [4] D. Sylvester, K. Keutzer, Gettng to the bottom of deep submcron, Proceedngs of Internatonal Conference on CAD, pp. 03-11, 1998. [5] B. Geuskens, K. Rose, Modelng Mcroprocessor Performance, Kluwer Academc Publshers, 1998. [6] W. Donath, Placement and Average Interconnecton Lengths of Computer Logc, IEEE Transactons on Crcuts and Systems, No. 4, Aprl 1979, pp. 7-77. [7] B.S. Landman, R.L. Russo, "On a Pn Versus Block Relatonshp for Parttons of Logc Graphs", IEEE Transactons on Computers, Vol. C-0, No. 1, December 1971, pp. 1469-1479. [8] F. Camp, R. Canegallo, R. Guerrer, IP-Reusable 3-Bt VLIW Rsc Core, n Proc. European Sold-State Crcuts Conference, 001, pp. 456-459. [9] J.C. Prncpe, N.R. Eulano, W.C. Lefebvre, Neural and adaptve systems: Fundamentals through smulatons, John Wley & Sons, Inc, 000, pp. 10-13. [10] Y.I. Ismal, E.G. Fredman, "Effects of Inductance on the Propagaton Delay and Repeater Inserton n VLSI Crcuts, IEEE Transactons on VLSI Systems, vol. 8, no, Aprl 000, pp. 195-06.