Fast Algorithm of A 64-bit Decimal Logarithmic Converter

Similar documents
Efficient Large Integers Arithmetic by Adopting Squaring and Complement Recoding Techniques

High Speed, Low Power And Area Efficient Carry-Select Adder

PERFORMANCE EVALUATION OF BOOTH AND WALLACE MULTIPLIER USING FIR FILTER. Chirala Engineering College, Chirala.

A High-Speed Multiplication Algorithm Using Modified Partial Product Reduction Tree

Uncertainty in measurements of power and energy on power networks

Design of Practical FIR Filter Using Modified Radix-4 Booth Algorithm

Walsh Function Based Synthesis Method of PWM Pattern for Full-Bridge Inverter

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

Fully Redundant Decimal Arithmetic

@IJMTER-2015, All rights Reserved 383

FPGA Implementation of Ultrasonic S-Scan Coordinate Conversion Based on Radix-4 CORDIC Algorithm

FFT Spectrum Analyzer

Low Switching Frequency Active Harmonic Elimination in Multilevel Converters with Unequal DC Voltages

Design and Implementation of DDFS Based on Quasi-linear Interpolation Algorithm

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

Parameter Free Iterative Decoding Metrics for Non-Coherent Orthogonal Modulation

A Comparison of Two Equivalent Real Formulations for Complex-Valued Linear Systems Part 2: Results

IEE Electronics Letters, vol 34, no 17, August 1998, pp ESTIMATING STARTING POINT OF CONDUCTION OF CMOS GATES

Inverse Halftoning Method Using Pattern Substitution Based Data Hiding Scheme

NATIONAL RADIO ASTRONOMY OBSERVATORY Green Bank, West Virginia SPECTRAL PROCESSOR MEMO NO. 25. MEMORANDUM February 13, 1985

HIGH PERFORMANCE ADDER USING VARIABLE THRESHOLD MOSFET IN 45NM TECHNOLOGY

Digital Transmission

A study of turbo codes for multilevel modulations in Gaussian and mobile channels

A High-Sensitivity Oversampling Digital Signal Detection Technique for CMOS Image Sensors Using Non-destructive Intermediate High-Speed Readout Mode

A Preliminary Study on Targets Association Algorithm of Radar and AIS Using BP Neural Network

Analysis of Time Delays in Synchronous and. Asynchronous Control Loops. Bj rn Wittenmark, Ben Bastian, and Johan Nilsson

Calculation of the received voltage due to the radiation from multiple co-frequency sources

Research of Dispatching Method in Elevator Group Control System Based on Fuzzy Neural Network. Yufeng Dai a, Yun Du b

Chaotic Filter Bank for Computer Cryptography

A NSGA-II algorithm to solve a bi-objective optimization of the redundancy allocation problem for series-parallel systems

The Performance Improvement of BASK System for Giga-Bit MODEM Using the Fuzzy System

Multiple Error Correction Using Reduced Precision Redundancy Technique

MASTER TIMING AND TOF MODULE-

Passive Filters. References: Barbow (pp ), Hayes & Horowitz (pp 32-60), Rizzoni (Chap. 6)

Review: Our Approach 2. CSC310 Information Theory

High Performance Integer DCT Architectures For HEVC

Rejection of PSK Interference in DS-SS/PSK System Using Adaptive Transversal Filter with Conditional Response Recalculation

Learning Ensembles of Convolutional Neural Networks

RC Filters TEP Related Topics Principle Equipment

To: Professor Avitabile Date: February 4, 2003 From: Mechanical Student Subject: Experiment #1 Numerical Methods Using Excel

ANNUAL OF NAVIGATION 11/2006

problems palette of David Rock and Mary K. Porter 6. A local musician comes to your school to give a performance

antenna antenna (4.139)

A MODIFIED DIFFERENTIAL EVOLUTION ALGORITHM IN SPARSE LINEAR ANTENNA ARRAY SYNTHESIS

1. Introduction. Key words: FPGA, Picoblaze, PID controller, HDL, Simulink

Side-Match Vector Quantizers Using Neural Network Based Variance Predictor for Image Coding

COMPARISON OF VARIOUS RIPPLE CARRY ADDERS: A REVIEW

POLYTECHNIC UNIVERSITY Electrical Engineering Department. EE SOPHOMORE LABORATORY Experiment 1 Laboratory Energy Sources

POWER constraints are a well-known challenge in advanced

Design of an FPGA based TV-tuner test bench using MFIR structures

Implementation Complexity of Bit Permutation Instructions

DESIGN OF OPTIMIZED FIXED-POINT WCDMA RECEIVER

Space Time Equalization-space time codes System Model for STCM

Hierarchical Generalized Cantor Set Modulation

Phasor Representation of Sinusoidal Signals

Understanding the Spike Algorithm

Fast Code Detection Using High Speed Time Delay Neural Networks

Mismatch-tolerant Capacitor Array Structure for Junction-splitting SAR Analog-to-digital Conversion

熊本大学学術リポジトリ. Kumamoto University Repositor

MODEL ORDER REDUCTION AND CONTROLLER DESIGN OF DISCRETE SYSTEM EMPLOYING REAL CODED GENETIC ALGORITHM J. S. Yadav, N. P. Patidar, J.

FPGA Implementation of Fuzzy Inference System for Embedded Applications

DESIGN OF OPTIMIZED FIXED-POINT WCDMA RECEIVER

Figure.1. Basic model of an impedance source converter JCHPS Special Issue 12: August Page 13

Optimal Placement of PMU and RTU by Hybrid Genetic Algorithm and Simulated Annealing for Multiarea Power System State Estimation

A Simple Yet Efficient Accuracy Configurable Adder Design

Research on Controller of Micro-hydro Power System Nan XIE 1,a, Dezhi QI 2,b,Weimin CHEN 2,c, Wei WANG 2,d

Evaluate the Effective of Annular Aperture on the OTF for Fractal Optical Modulator

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 ISSN

Performance Analysis of Multi User MIMO System with Block-Diagonalization Precoding Scheme

Accelerated Modular Multiplication Algorithm of Large Word Length Numbers with a Fixed Module

In-system Jitter Measurement Based on Blind Oversampling Data Recovery

NETWORK 2001 Transportation Planning Under Multiple Objectives

Block-wise Extraction of Rent s Exponents for an Extensible Processor

aperture David Makovoz, 30/01/2006 Version 1.0 Table of Contents

California, 4 University of California, Berkeley

High Speed ADC Sampling Transients

Application of Intelligent Voltage Control System to Korean Power Systems

AC-DC CONVERTER FIRING ERROR DETECTION

Revision of Lecture Twenty-One

FAST ELECTRON IRRADIATION EFFECTS ON MOS TRANSISTOR MICROSCOPIC PARAMETERS EXPERIMENTAL DATA AND THEORETICAL MODELS

Research on Peak-detection Algorithm for High-precision Demodulation System of Fiber Bragg Grating

TECHNICAL NOTE TERMINATION FOR POINT- TO-POINT SYSTEMS TN TERMINATON FOR POINT-TO-POINT SYSTEMS. Zo = L C. ω - angular frequency = 2πf

Topology Control for C-RAN Architecture Based on Complex Network

A New Type of Weighted DV-Hop Algorithm Based on Correction Factor in WSNs

An Algorithm Forecasting Time Series Using Wavelet

Graph Method for Solving Switched Capacitors Circuits

An Efficient Method for PAPR Reduction of OFDM Signal with Low Complexity

FULL RECONFIGURABLE INTERLEAVER ARCHITECTURE FOR HIGH-PERFORMANCE SDR APPLICATIONS

Secure Transmission of Sensitive data using multiple channels

Comparative Analysis of Reuse 1 and 3 in Cellular Network Based On SIR Distribution and Rate

Keywords LTE, Uplink, Power Control, Fractional Power Control.

ETSI TS V8.4.0 ( )

INSTANTANEOUS TORQUE CONTROL OF MICROSTEPPING BIPOLAR PWM DRIVE OF TWO-PHASE STEPPING MOTOR

Hardware Implementation of Fuzzy Logic Controller for Triple-Lift Luo Converter

AN ALL DIGITAL QAM MODULATOR WITH RADIO FREQUENCY OUTPUT

Enhanced Artificial Neural Networks Using Complex Numbers

c 2009 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,

Discussion on How to Express a Regional GPS Solution in the ITRF

Fuzzy Logic Controlled Shunt Active Power Filter for Three-phase Four-wire Systems with Balanced and Unbalanced Loads

A Spreading Sequence Allocation Procedure for MC-CDMA Transmission Systems

Transcription:

JOURNAL OF OMPUTERS, VOL. 5, NO. 12, DEEMBER 20 1847 Fast Algorthm of A 64-bt Decmal Logarthmc onverter Ramn Tajallpour, Md. Ashraful Islam, and Khan A. Wahd Dept. of Electrcal and omputer Engneerng, Unversty of Saskatchewan, Saskatoon, Saskatchewan, anada Emal: rat177@mal.usask.ca, md142@mal.usask.ca, khan.wahd@usask.ca Abstract The paper presents an effcent algorthm to compute base- logarthm of a decmal number. The algorthm uses a 64-bt floatng-pont arthmetc, and s based on a dgt-by-dgt teratve computaton that does not requre look-up tables, curve fttng, decmal-bnary converson, or dvson operatons. It s the frst FPGA prototype of ts knd that uses a 64-bt (decmal 16-dgt) precson. Two numercal examples have been presented for the purpose of llustraton. The algorthm produces very accurate result wth a maxmum absolute error of 3.53x - 14. The archtecture s ppelned and mplemented on to the Xlnx Vrtex2p FPGA. It costs 6,752 logc cells, outputs at a mnmum rate of 51 mega-samples/sec, and consumes 125.7 mw of power. The scheme s very sutable for tmng and accuracy crtcal applcatons and complant wth the IEEE754-2008 standard (decmal64 format). Index Terms Decmal logarthm converter, floatng-pont arthmetc, teratve computaton, IEEE754-2008 I. INTRODUTION Elementary functons such as the logarthm and the exponental operatons have become very useful n many applcatons such as, fnancal analyss, tax calculaton, nternet based applcatons, and ecommerce [1], where these operatons are used to avod hardware-expensve multplcaton and dvson operatons. In the past, several hardware-effcent methods have been proposed for computng the base-2 logarthm of bnary numbers [4][5][11]-[16]. However, after the ncluson of decmal floatng-pont (FP) operaton n the latest IEEE754-2008 standard [6], more researchers have devoted ther effort n developng decmal FP algorthms and archtectures to effcently compute logarthms [7][17], exponentaton [28], trgonometrc operatons, etc. A study has shown that 55% of the numbers stored n the database of 50 bg organzatons s decmal [21]. There are several software packages avalable to customer to compute decmal numbers usng decmal arthmetc to mnmze error [23], but the softwaremplemented decmal arthmetc requres much longer tme to execute than the hardware verson [1], whch led momentum to ts mplementaton n hardware. IBM has Manuscrpt receved January 1, 2009; revsed June 1, 2009; accepted July 1, 2009. opyrght credt, project number, correspondng author, etc. recently mplemented decmal FP archtecture n ther POWER6 [3][19], z9 [29], and z mcroprocessors [20]. Several decmal archtecture of mult-operand carry-save adder [24][25], carry look-ahead adder [26], parallel BD adder [27], sgned-dgt adder [18], etc. have been proposed. There are several applcatons whch requre the drect computaton of decmal (or radx-) logarthm, such as, to measure the ph n chemstry, the earthquake ntensty n Rchter scale, the optcal densty n spectrometry and optcs, the brghtness of stars n astronomy, etc. [2]. Moreover, the radx- logarthm s wdely used n computng the rato of voltage and power levels (called bel) n telecommuncatons, electroncs and acoustcs. In most base- logarthmc converters, the decmal nput s frst converted to bnary followed by base-2 logarthm computaton; after the completon, the result s converted back to decmal radx these back and forth conversons of bases ntroduce errors on the system. A generalzed teratve algorthm to compute base-k logarthm has been presented n [8]; however, the dvson operaton n that work lmts the performance by ncurrng erroneous computaton. Moreover, the use of lookup tables and the lack of user control on the number of teraton make ths algorthm very neffcent for hardware mplementaton. We have recently presented a 32-bt decmal logarthm (n short, log) converter [22]. Whle the decmal32 format, as defned n the IEEE754-2008 standard [6], s only used for storage, the decmal64 and decmal128 are used for more accurate decmal computaton. Beng motvated by the fact, n ths paper, we extend the algorthm to compute the radx- log usng decmal64 precson, whch s the frst FPGA prototype of ts knd. The algorthm s based on a dgt-by-dgt teratve computaton that does not requre error correcton crcutry, look-up tables, curve fttng, or dvson operatons. The number of teratons of the log converter depends on the user defned precson. The prevous 32- bt desgn [22] suffers from hgh latency (e.g. 40 clock cycles) due to an neffcent power- algorthm and unppelned operaton. Here, we present a very effcent power- module wth a ppelned archtecture that may take only 4 clock cycles to produce the log result wth a hgh process throughput of 51 mega-samples/sec. The error analyss shows that the proposed scheme produces very accurate result. The archtecture s developed based do:.4304/jcp.5.12.1847-1855

1848 JOURNAL OF OMPUTERS, VOL. 5, NO. 12, DEEMBER 20 on 64-bt bnary coded decmal (BD) representaton and complant wth the IEEE decmal64 FP standard. The paper s organzed as follows: Secton II presents the background. In secton III, the dgt-by-dgt algorthm s presented. The pseudo-code of the algorthm and two examples are also presented for llustraton. A detaled hardware mplementaton s dscussed n secton IV, where the descrpton of dfferent nternal modules s presented. Secton V dscusses the performance analyss wth a comparson of hardware among related log converter desgns. The paper s concluded n secton VI. II. BAKGROUND The general form of any postve number, L can be expressed as: 2 1 0... 2 1 0 (1) 1 2 1R 2R... L R R R R Where, R s the numercal base, and s the coeffcent for the th power of that base, rangng from 0 to R 1. For a decmal base, R equals to. After takng the logarthm of any postve decmal number, P, (1) results n the followng (2): Llog P 1... 1... 1 0 1 1 0 1 2 2..., P 0 The coeffcents n (2) can now be dvded nto two categores: nteger (or character :...,,,...,, ) 1 1 0 and fracton (or mantssa : 1, 2, 3,...). The procedure to compute these coeffcents s descrbed n the followng secton. III. ALGORITHM FOR DEIMAL64 LOG A. Decmal64 format n IEEE754-2008 The IEEE 754-2008 decmal FP arthmetc supports the decmal32, decmal64, and decmal128 computaton and data nterchange formats, and mplements all the operatons and conversons [6]. The basc decmal FP format s llustrated n Fg. 1. The Sgn s a 1-bt feld and ndcates the sgn of the number where S s 0 or 1. The combnaton feld s a w+5-bts feld that encodes two most sgnfcant bts (MSBs) of the exponent and the most sgnfcant dgt (MSD) of the coeffcent. The Not-a-Number (NaN) and Infnte number (Inf) are ndcated n the ombnaton Feld. The based exponent s a w+2 bt quantty, where the value of the frst two bts of the based exponent taken together s 0, 1, or 2. The whole encoded exponent s an unsgned bnary nteger wth the largest unsgned value. The value of the exponent s calculated by subtractng an (2) exponent bas from the value of the encoded exponent, to be able to represent both negatve and postve exponents. Fgure 1. Decmal floatng-pont number format n IEEE754-2008 The Talng Sgnfcand Feld (3j x bts) s formed by appendng the decoded contnuaton dgts (j-bt) as a suffx to the most sgnfcant dgt (MSD) derved from the combnaton feld. Each -bt group represents three decmal dgts, usng Densely Packed Decmal (DPD) encodng [30]. The format encodes a total of p=3j+1 decmal dgts, where p = the number of dgts n the sgnfcand (precson). For decmal64 format: w = 8; j = 5, exponent bas = 398, and p = 16. B. Proposed teratve algorthm The IEEE754-2008 standard [6] defnes any nonnormalzed unsgned decmal fracton as: d0. d1d2d3... d 15, where 0d 9. To be compatble wth the standard, we extend our nput decmal number, P as follows: ( 1) S a P P c (3) Where, S s the sgn, a s the exponent, and P c s the coeffcent (n nteger form). Takng a 64-bt log of (3) results n (4): Llog ( P) log ( ) log ( P c ) (4) a Here, 000000000000001P c 999999999999999. The computaton of log ( P c ) follows the teratve algorthm. It starts wth the computaton of the upper lmt of, called max, whch s the number of mantssa dgts desred n the fnal converted answer. max s set by the user and defnes the number of teratons. In order to perform the ntal range reducton, we extend (4) further as gven below: Llog ( P) ab log ( k) (5) Where, b (range: 1 b15) s the characterstcs of log ( P) and s obtaned by detectng leadng zeros; k s a decmal fracton (range: 0.1k 1). After separatng a, and combnng (2) and (5), we get the followng (6): 2 1 b 1 P (6) The ntermttent data s accumulated nto a temporary varable, A, where 1 A :

JOURNAL OF OMPUTERS, VOL. 5, NO. 12, DEEMBER 20 1849 2 1 P 1 A (7) b Ths dvson (by ) operaton can be easly mplemented by rght shftng the nput dgts. In order to determne the fractonal parts (e.g., 1, 2, 3,... ), we take the power- of (7) as shown below: A ( ) 1 1 2 1 1 1 Now (8) has a structure smlar to (7). The frst mantssa coeffcent, 1, s computed by smply countng the number of ntegers n (8). The temporary value (stored n A ) undergoes another power range reducton and s accumulated back nto A. The process contnues for the remanng mantssa coeffcents untl the number of teraton reaches max, set earler by the user. For cases where the nput les between 0 and 1, the log produces negatve result. Interestngly, the proposed algorthm s capable of handlng such cases. For ths purpose, we frst adjust he decmal pont as follows: (8) P m n, where m1, n 0 (9) Takng the 64-bt radx- log n both sdes of (9) leads to the followng (): L log P n log m () log Now, the computaton of m follows the procedure descrbed n (5) (8).. Pseudocode of the algorthm The pseudo-code of the proposed algorthm summarzed below and llustrated n Fg. 2: 3. Detect the number of ntegers n A 4. Start computng the mantssa coeffcents, 5. ompute power- of A, and perform power range reducton 6. Decrease by 1 7. If > 0, repeat steps 3-6 D. Examples In order to better llustrate the algorthm, we present two examples n the followng secton: 1) Determne the logarthm of decmal number, P = 123456789.123456 up to three fractonal dgts. The computaton steps are as follows: Here the user sets the number of fractonal bts; so, max 3; hence, the computaton process wll contnue up to the computaton of, and the fnal 3 answer wll be n the format: 0. 123 The number of nteger n P: N 9 ; hence, 0 (9 1) 8 Rght shft the dgts of P by eght dgts: A( P8) 1.23456789123456 ompute power- of A and accumulate the result: A A (1.23456789123456) 8.2252626737231264115782140871879 ompute the number of ntegers n A: N 1; hence, 1 (1 1) 0 Rght shft the dgts of A by zero dgt: A (8.2252626737231264115782140871879 0) 8.2252626737231264115782140871879 ompute power- of A and accumulate the result: A A (8.2252626737231264115782140871879) 1417417401.8443859394911136893318 ompute the number of ntegers n A: N ; hence, 2 ( 1) 9 Rght shft the dgts of A by nne dgts: A (1417417401.8443859394911136893318 9) 1.4174174018443859394911136893318 ompute power- of A and accumulate the result: A A (1.4174174018443859394911136893318) 32.732381445138501594634638993872 Once agan, compute the number of ntegers n A: N 2 ; hence, 3 (21) 1 The teraton stops, and the fnal answer s: L 8.091 Fgure 2. Flow graph of the decmal log algorthm 1. Read decmal nput, P and the number of mantssa n the log result, 2. Transfer P to A and perform ntal range reducton 2) Determne the logarthm of decmal number, P = 0.00000123456789 up to two fractonal dgts. The computaton steps are as follows: Here the nput s a fracton ( 0 P 1); so, max (2 1) 3 ; hence, the computaton process wll

1850 JOURNAL OF OMPUTERS, VOL. 5, NO. 12, DEEMBER 20 contnue up to the computaton of, and the fnal 3 answer wll be n the format: 0. 12 3 6 Adjust the decmal pont: P 1.23456789 ; m = 1.23456789, n = -6 The number of nteger n P: N 1; hence, 0 (11) 0 Rght shft the dgts of P by zero dgt: A( P0) 1.23456789 ompute power- of A and accumulate the result: A A (1.23456789) 8.22526259147257950476114366154 ompute the number of ntegers n A: N 1; hence, 1 (11) 0 Rght shft the dgts of A by zero dgt: A (8.22526259147257950476114366154 0) 8.22526259147257950476114366154 ompute power- of A and accumulate the result: A A (8.22526259147257950476114366154) 1417417260.35587702142524239761 ompute the number of ntegers n A: N ; hence, 2 ( 1) 9 Rght shft the dgts of A by nne dgts: A (1417417260.35587702142524239761 9) 1.41741726035587702142524239761 ompute power- of A and accumulate the result: A A (1.41741726035587702142524239761) 32.7323487129826281766581287077 Once agan, compute the number of ntegers n A: N 2 ; hence, 3 (2 1) 1 The teraton stops, and the fnal answer s: L 60.091 5.909 core archtecture s developed usng unsgned BD representaton wth an nternal precson of 16 dgts (64- bt bnary). The DPS (decmal-pont separator) module detects and separates the DP, and then stores the unsgned magntude to a temporary regster. The DP follows a separate path (DP Accumulator DP Update) that s parallel to the core computaton. The DP Update module tracks the poston of the decmal-pont and updates t after every computaton step. Fgure 3. Block dagram of the entre system Thus, t can be seen that the algorthm does not requre any lookup tables, curve fttng, FP dvson operatons, or error correcton crcutry. The followng secton descrbes the hardware mplementaton of the proposed scheme. IV. HARDWARE IMPLEMENTATION The archtecture of the radx- log converter s shown n Fg. 3. It conssts of two major unts both connected to a controller: Synchronous regster-ounter and ore unt. The converter accepts two nputs: a 16-dgt decmal number (P, n BD) ncludng the decmal pont (nputted as hex ) and the desred number of dgts after the decmal radx pont (, n bnary) n the fnal log result. Dependng on, a down counter s set whch defnes the number of teratons. The core unt performs the fundamental computaton supervsed by the ontroller. The wdth of the data lnes n all the followng fgures s n decmal dgt, unless otherwse specfed. The archtecture of the core unt s shown n Fg. 4. It does not show the nteracton wth the controller. The Fgure 4. Archtecture of the core unt The 16-dgt unsgned nput s passed to the Zero Detector (ZD) module that determnes the number of ntegers, whch s the frst dgt (or coeffcent) of the fnal log result. The coeffcent s updated at the same tme n the oeffcent Update module. The rght-shft (RS) operaton to be performed on A s also acheved at the same clock cycle by smply updatng the poston of the DP n the DP Update module. The ntermttent data s passed nto the Power- Unt and fed back to ZD for further processng. The data flow to ZD s controlled by

JOURNAL OF OMPUTERS, VOL. 5, NO. 12, DEEMBER 20 1851 the controller (lnes not shown) through a 2:1 multplexer. The process contnues untl the counter reaches to zero and the controller then stops the computaton. The fnal result s accumulated n the Result Update unt after combnng the outputs of oeffcent Update and DP Update. A. Power- archtecture The Power- module s a key unt of the log converter and the accuracy of the fnal result greatly depends on ts effcent mplementaton. Because of ts complexty, we have explored several optons based on dvde and conquer algorthm to effcently mplement the unt whch are shown n Fg 5. onsderng the tradeoff between hardware cost and speed, we have chosen opton 3 for our mplementaton. The algorthm s based on a recursve powerng that requres one parallel multpler unt and 4 cc to complete. Ths s a sgnfcant mprovement over the prevous mplementaton [22] whch had taken 40 cc for such computaton. multplcaton stage. The tmng dagram for each clock pulse and the selecton/control sequence are shown n Fg 6(b) [here, n ndcates the nstance of the clock pulse at any gven tme, t]. X 0 1 A B Sel Decmal Mult En Temp Latch (a) Acc. Output Latch X (b) Fgure 6. (a) Archtecture of the Power- unt; (b) Tmng dagram (a) Opton 1 (b) Opton 2 Mult unt 1 X X 2 X 4 X 8 X Latency - 4 cc (c) Opton 3 Fgure 5. Power- algorthms and the cost of mplementaton The overall archtecture of the Power- unt s shown n Fg. 6(a) where the wdth of all data lnes s 16-dgt. It conssts of a 16-dgt combnatonal multpler, an accumulator, and a few latches. The selecton bt (Sel) dctates the multplcaton operaton: 0 for A*A; 1 for A*B. A key step of the proposed algorthm s to count the number of ntegers before the decmal pont to evaluate the coeffcents,, whch may take any value between 0 and 9 (where, 0 ). Insde the multpler, the most sgnfcant 16 dgts are retaned and accumulated for further processng. The data flow s controlled usng a 2:1 multplexer by the controller (lnes not shown). After the frst multplcaton stage, the output (.e., X 2 ) s stored n a temporary latch so that t can be used later at the fourth B. 16-dgt decmal multpler Several effcent methods for decmal multplcaton have been proposed n the past [9][]. Here, we have used a general purpose 16-dgt combnatonal multplcaton algorthm whch s a modfed verson of [9]. Ths s another mprovement over the prevous 32-bt desgn [22] where a sequental multpler was used. The multpler archtecture (as shown n Fg. 7) s optmzed to reach the desred throughput. The multpler nput s recoded and the partal products are frst kept n a redundant format and then accumulated by a tree of redundant adders. Fnally the 32-dgt product s obtaned by convertng the carry-save tree s outputs nto BD format. The presented combnatonal archtecture results n low latency and that s why t s chosen for the log converter. The product, p s gven below, where A and B are the sgned multpler and the multplcand respectvely: n1 p AB. AB, B[0,9] (11) 0 In order to make computaton smple, B s recorded nto two groups Bh {0,5,} and Bl { 2,1,0,1,2}, where B Bh B. Negatve l numbers are represented n radx- complement and are mplemented by performng the 9 s complement of the BD dgt and addng a 1 wth the least sgnfcant dgt. In order to compute 2X (two tmes nput), we frst duplcate each dgt of the nput and record t usng 5 bts

1852 JOURNAL OF OMPUTERS, VOL. 5, NO. 12, DEEMBER 20 one carry (1 bt) and one dgt (4-bts). For a 16-dgt nput, t generates 16 dgts and 16 carry bts. In the second step, the carry bts are added; the carry s not propagated to the next dgt. The generaton of 5X (fve tmes nput) s performed by frst computng X (ten tmes nput) and a smple dvson operaton. The overall archtecture of the partal product generator (PPG) s shown n Fg 8. where c S u (12) 1 1 0 c c1 f u 1 f u. otherwse 1 U 1 Fgure 7. 16-dgt general purpose BD multpler S Fgure 9. Fnal converson to BD usng sgned-dgt adder 5:1 MUX V. PERFORMANE EVALUATION The archtecture of the 64-bt log converter has been prototyped usng Verlog and syntheszed onto Xlnx Vrtex2p FPGA (xc2vp30ff1152-7). The breakdown of the cost of dfferent unts s shown n Table I. It can be seen that the Power- unt (ncludng a combnatonal decmal multpler) consumes the most resources of the entre system. 3:1 MUX Radx- Adder TABLE I. HARDWARE OST OF THE PROPOSED 64-BIT DEIMAL LOG ONVERTER Module Submodule Bt length Reg. Logc cells ore Unt Others 64 269 213 (nc. Decmal Power-) mult 64 0 6,403 ontroller and others 16 73 136 Total 64 342 6,752 Fgure 8. Block dagram of the partal product generator (PPG) The partal products are added usng adder tree and converted to BD. For ths converson, we have used an effcent sgned-dgt decmal adder [18] whch has the beneft of carry-free addton; however a carrypropagaton adder (PA) must be used to transform the sgned-dgt sum nto an unsgned sum. The operaton s shown n Fg. 9 and descrbed as follows by (12): A. Error analyss For the targeted 64-bt decmal FP applcatons, the proposed log converter must be able to acheve the mnmum accuracy (.e., 32-bt bnary precson as defned n [31]) to guarantee correct operaton. In order to compute the maxmum error of the converted log result, we have performed an error analyss, where a long test vector comprsng of 0 16-dgt postve decmal numbers (rangng from 0000000000000000 to

JOURNAL OF OMPUTERS, VOL. 5, NO. 12, DEEMBER 20 1853 9999999999999999 wth arbtrary poston of decmal pont) s used. The error s computed takng a precson of 22-dgt (wrtten n Matlab) as reference. Fg. shows the normalzed error plot (n log scale) for only fve arbtrary samples for dfferent nternal dgt precson. In the x-axs, the number of decmal dgts retaned after one power- teraton s shown. It can be seen from the plot that the algorthm produces less error f hgher number of dgts s retaned. The proposed archtecture s based on a 64-bt bnary precson, and thus can only handle up to 16-decmal dgts. As a result, the error, as seen n the plot, s fxed after 16 dgts. The maxmum normalzed error at ths precson s estmated to be 3.53 14. Normalzed error (n log scale) 2 0-2 -4-6 -8 - -12 7411478522.589632 1234567891236547 987.4563232152 4567124587954.823 242.5816395012042-14 4 6 8 12 14 16 18 20 22 Number of decmal dgt retaned Fgure. Normalzed error of log converter for dfferent precson B. Hardware comparsons Table II compares the results of the proposed log converter wth other smlar desgns. In the cases, where the nformaton of logc cells s not avalable, we have computed t from the count of slces (e.g., one slce s equvalent to two logc cells). The latency ndcates the mnmum number of clock cycles requred to produce decmal 1-dgt output. In all cases, smlar Xlnx FPGA technology s used and the maxmum absolute error along wth the number of dgt accuracy s presented. We start wth two bnary desgns [12][13] whch gve us a rough estmate about the relatve comparson between bnary and radx- desgns. Note that, the results presented n [12] and [13] are based on the synthess of HDL code that was orgnally generated automatcally by ++ program. The work n [7] s based on a curve fttng (lnear approxmaton) algorthm. Due to the use of look-uptable (.e., ROM mappng), the desgn takes only 1 cc to produce the output, but the crude approxmaton algorthm results n large error (e.g. maxmum absolute error s 0.09 wth only 3 dgt accuracy). There are 16 partton regons used to acheve such accuracy and a complex error-correcton crcutry s requred at the end. The work n [17] s based on a dgtal recurrence algorthm. Wth the use of large look-up-tables and complex mappng, the error s largely mnmzed, but at the expense of low operatonal frequency and reduced throughput (e.g. latency s 18 cc), whch makes the scheme unsutable for tme crtcal applcatons. The decmal dgt accuracy s 14, whch s stll lower than the proposed algorthm. [17] also dscusses brefly the extenson to 64-bt desgn, but the cost of actual FPGA mplementaton s not reported. As a result, we have estmated the cost from the 7-dgt core. In [22], the authors have presented an teratve scheme wth a sequental multpler unt that results n consumng relatvely low recourses; however the sequental nature of the archtecture and the un-ppelned operaton lmt the performance by yeldng a very large latency and low throughput. ompared to all exstng desgns, the proposed scheme has much lesser computaton error and hgher dgt accuracy (very naturally as t uses hgher precson), lesser hardware cost, and hgher frequency of operaton. ompared to [22], the hardware cost of the proposed 64- bt scheme s hgher (and so s the estmated power consumpton) because of the two followng reasons: (1) use of a combnatonal multpler; (2) use of a much hgher dgt precson. However, the (mnmum) latency s 4 cc whch makes the proposed scheme very sutable for tme and accuracy crtcal applcatons. The computaton algorthm s generalzed and scalable, whch means that the archtecture can be extended for decmal128 format wthout causng large ncrease n the complexty and hardware cost ths s another advantage of the proposed scheme. As an example, for complance wth the IEEE754-2008 decmal64 format, [17] requres a sgnfcant ncrease n two LUTs from 14-dgt to 34- dgt, and moderate ncrease n other processng blocks wth a large ncrease n latency by at least two tmes. VI. ONLUSION The paper presents a fast algorthm and effcent mplementaton for computng decmal logarthm usng 64-bt floatng-pont arthmetc that comples wth the IEEE754-2008 standard. The algorthm s based on a dgt-by-dgt teratve computaton that does not requre look-up tables, curve fttng, decmal-bnary converson, or dvson algorthms. The fnal logarthmc output s very accurate wth a maxmum absolute error of 3.53x - 14 ; no correcton or roundng crcutry s requred that makes the scheme sutable for tmng and accuracy crtcal applcatons. The archtecture s generalzed and scalable can be extended for decmal128 format. Future research s drected towards such extenson, as well as the VLSI mplementaton of the algorthm.

1854 JOURNAL OF OMPUTERS, VOL. 5, NO. 12, DEEMBER 20 Base Btlength (bnary) TABLE II. HARDWARE OMPARISONS OF DIFFERENT LOG ONVERTERS Scheme Slces Logc cells Freq. (MHz) Power (mw) Latency (cc) No. of dgt accuracy Max. abs. error Detry et al. 2 31 ROM 881 1,762 11.4 -- 30 -- -- [12] 1 Detry et al. [13] 1 2 31 ROM 1,893 2,736 14.5 -- 16 -- -- Dongdong 32 urve 999 1,998 50.9 8 1 3 0.9 x -1 et al.[7] fttng Dongdong 32 Dgt 2,842 5,684 47.7 -- 18 14 0.66 x -14 et al.[17] 2 recurrence Ramn 32 Iteratve 526 1,053 44 79 40 0.1 x -11 et al. [22] 3 Dongdong 64 4 Dgt 4,603 9,602 -- -- 36 -- -- et al.[17] recurrence Proposed 64 Iteratve 3,376 6,752 51 125.7 4 16 3.53 x -14 1 usng Xlnx VrtexII FPGA (xc2v00-4) devce; 2 the max. abs. error reported s based on theoretcal calculaton; 3 usng VrtexII FPGA (xc2v00-6) devce; 4 the estmated hardware count for 64-bt, snce the actual cost was not reported AKNOWLEDGMENT The authors would lke to acknowledge the Natural Scence and Engneerng Research ouncl of anada (NSER) for ts support to ths research work. REFERENES [1] M. owlshaw, Decmal Floatng-Pont: Algorsm for omputers, Proc. of the IEEE Symposum on omputer Arthmetc, pp. 4-111, 2003. [2] Wkpeda, [Onlne], Avalable: http://en.wkpeda.org/wk/logarthm, November 2009. [3] IBM Power6, IBM orporaton, May 2007. [4] J. Mtchell, omputer Multplcaton and Dvson Usng Bnary Logarthms, IRE Trans. Electron. omputer, pp. 512-517, 1962. [5] D. Kostopoulos, An Algorthm for the omputaton of Bnary Logarthms, IEEE Trans. on omputers, vol. 40, no. 11, pp. 1267-1270, 1991. [6] The IEEE Standard for Floatng-Pont Arthmetc (IEEE 754-2008), IEEE omputer Socety, Aug 2008. [7] D. hen, Y. ho, L hen, D. Teng, K. Wahd, S. Ko, A Novel Decmal-to-decmal Logarthmc onverter, Proc. of the IEEE Int. Symposum on rcuts and Systems, pp. 688-691, 2008. [8] H. Lo and J. hen, A Hardwred Generalzed Algorthm for Generatng the Logarthm Base-k by Iteraton, IEEE Trans. omputer, vol. -36, pp.1363 1367, 1987. [9] T. Lang and A. Nannarell, A Radx- ombnatonal Multpler, Proc. of the Aslomar onference on Sgnals, Systems and omputers, pp. 313-317, 2006. [] H.. Neto and M. P. Vestas, Decmal Multpler on FPGA Usng Embedded Bnary Multplers, Proc. of the Int. onf. on Feld Programmable Logc and Applcatons, pp.197-202, 2008. [11] M. Ercegovac, Radx-16 Evaluaton of ertan Elementary Functons, IEEE Trans. on omputers, vol. -22(6), pp. 561 566, 1973. [12] J. Detrey, F. Dnechn, and X. Pujol, Return of the Hardware Floatng-Pont Elementary Functon, Proc. of the IEEE Symposum on omputer Arthmetc, pp. 161-168, 2007. [13] J. Detrey and F. de Dnechn, A Parameterzable Floatngpont Logarthm Operator for FPGAs, Proc. of the 39th Aslomar onf. on Sgnals, Systems & omputers, pp. 1186-1190, 2005. [14] P. T. P. Tang, Table-drven Implementaton of the Logarthm Functon n IEEE Floatng-pont Arthmetc, AM Trans. on Mathematcal Software, vol. 16(4), pp. 378 400, 1990. [15]. Wrathall and T.. hen, onvergence Guarantee and Improvements for a Hardware Exponental and Logarthm Evaluaton Scheme, Proc. of the IEEE Symposum on omputer Arthmetc, pp. 175 182, 1978. [16] W. Wong and E. Goto, Fast Hardware-based Algorthms for Elementary Functon omputatons usng Rectangular Multplers, IEEE Trans. on omputers, vol. 43(3), pp. 278 294, 1994. [17] Dongdong hen, Yu Zhang, Younhee ho, Moon Ho Lee, Seok-Bum Ko, A 32-bt Decmal Floatng-Pont Logarthmc onverter, Proc. of the IEEE Symposum on omputer Arthmetc, pp. 195-203, 2009. [18] J. Rebacz, E. Oruklu, and J. Sane, Performance Evaluaton of Mult-Operand Fast Decmal Adders, Proc. of the IEEE Int. Mdwest Symposum on rcuts and Systems, pp. 535-538, 2009. [19] E. Schwarz and S. arlough, Power6 decmal dvde, Proc. of the IEEE Int. onf. on Applcaton-specfc Systems, Archtectures and Processors, pp. 128-133, 2007. [20]. Webb, IBM z: The next-generaton manframe mcroprocessor, IEEE Mcro, vol. 28, no. 2, pp. 19 29, 2008. [21] A. Tsang and M. Olschanowsky, A study of database 2 customer queres, IBM Santa Teresa Laboratory, San Jose, A, USA, Tech. Rep. TR-03.413, Apr. 1991. [22] R. Tajallpour, D. Teng, S-B Ko, and K. Wahd, On the Fast omputaton of Decmal Logarthm, Proc. of the IEEE Int. onf. on omputer and Informaton Technology, pp. 32-36, 2009. [23] BgDecmal, [Onlne], Avalable: http://java.sun.com/j2se/1.5.0/docs/ap/java/math/bgdec mal.html, November 2009 [24] R. Kenney and M. Schulte, Hgh-speed multoperand decmal adders, IEEE Trans. on omputers, vol. 54, no. 8, pp. 953 963, 2005. [25] I. D. astellanos and J. E. Stne, ompressor trees for decmal partal product reducton, Proc. of the 18th AM Great Lakes Symposum on VLSI, pp. 7 1, 2008.

JOURNAL OF OMPUTERS, VOL. 5, NO. 12, DEEMBER 20 1855 [26] A. Bayrakc and A. Akkas, Reduced delay BD adder, Proc. of the IEEE Int. onf. n Applcaton-specfc Systems, Archtectures and Processors, pp. 266 271, 2007. [27] L. Dadda, Multoperand parallel decmal adder: A mxed bnary and bcd approach, IEEE Trans. on omputers, vol. 56, no., pp. 1320 1328, 2007. [28] D. hen, Y. Zhang, D. Teng, K. Wahd, M. Lee, and S-B. Ko, A New Decmal Antlogarthmc onverter, Proc. of the IEEE Int. Symposum on rcuts and Systems, pp. 445-448, 2009. [29] A. Duale, M. Decker, H. Zpperer, M. Aharon, and T. Bohzc, Decmal Floatng-Pont n z9: An Implementaton and Testng Perspectve. Journal on IBM Res. and Dev., Jan 2007. [30] M. owlshaw, Densely Packed Decmal Encodng, IEEE omputers and Dgtal Technques, pp. 2-4, 2002. [31] The IEEE Standard for Bnary Floatng-Pont Arthmetc (IEEE754-1985), IEEE omputer Socety. Ramn Tajallpour completed hs B.Sc. n Electrcal Engneerng from Azad South Tehran Unversty n 2006. After graduaton wth help of two other as a group made a testng rng nstrument whch acheved a patent number and already s usng at well-known company, SAIPA automoble manufacturng company n Iran. Then n 2007, he joned to Qom power staton n Iran as an automaton expert and worked about 1.5 years. He started hs M.Sc. program wth Dr. Khan A. Wahd at Electrcal Engneerng department of Unversty of Saskatchewan n fall 2008. He s currently workng n Dgtal Systems Research Group and hs research nterest s n the feld of vdeo and mage compresson and processng, dgtal desgnng, FPGA, and real-tme embedded systems. Md. Ashraful Islam Md Ashraful Islam: Receved hs B.Sc. degree n Electrcal and Electronc Engneerng from the Bangladesh Unversty of Engneerng & Technology, Bangladesh n 2005.He s currently an M.Sc. canddate n Electrcal & omputer Engneerng Department at Unversty of Saskatchewan. Pror to that, he worked as a Lecturer n Southeast Unversty, Bangladesh, as a Rado planner n Grameenphone Ltd, Bangladesh & Telecom Malaysa n Bangladesh respectvely. He s a member of Dgtal Systems Research Group, at the ollege of Engneerng, Unversty of Saskatchewan. Hs research nterests nclude hgh-performance dgtal crcut, FPGA and ASI desgn, and VLSI archtectures for mage processng. He has desgned and co-desgned two I chps of area and power effcent archtectures n 0.18um MOS (TSM) technology that has been reported to hs publcatons. Khan A. Wahd earned hs B.Sc. degree from Bangladesh Unversty of Engneerng and Technology (BUET) n 2000. He receved hs M.Sc. (2003) and Ph.D. (2007) from the Unversty of algary. He was the recpent of numerous prestgous awards and scholarshps ncludng the most dstngushed Kllam Scholarshp and the NSER anada Graduate Scholarshp for hs doctoral research. Dr. Wahd has been workng as an Assstant Professor n the Department of Electrcal and omputer Engneerng at the Unversty of Saskatchewan snce July 2007. He has authored over 40 peer-revewed journal and nternatonal conference papers n the feld of dgtal arthmetc technques, FPGA and ASI desgn, real-tme embedded systems, vdeo and mage compresson, and bomedcal magng systems. He has been servng as a revewer for the IEEE Transactons on rcut and Systems for Vdeo Technology, Bomedcal Engneerng Onlne, EURASIP Journal on Sgnal Processng, and Elsever Journal on omputers and Electrcal Engneerng snce 2006. He s a regstered as a Professonal Engneer n the provnce of Saskatchewan, anada, and a Member of the Insttute of Electrcal and Electroncs Engneers (IEEE).