Some Modular Adders and Multipliers for Field Programmable Gate Arrays

Similar documents
High Speed Area Efficient Modulo 2 1

OPTIMIZATION OF RNS FIR FILTERS FOR 6-INPUTS LUT BASED FPGAS

Reducing Power Dissipation in Complex Digital Filters by using the Quadratic Residue Number System Λ

Logarithms APPENDIX IV. 265 Appendix

CHAPTER 5 A NEAR-LOSSLESS RUN-LENGTH CODER

ELEC 204 Digital Systems Design

A Comparative Study on LUT and Accumulator Radix-4 Based Multichannel RNS FIR Filter Architectures

A Comparison on FPGA of Modular Multipliers Suitable for Elliptic Curve Cryptography over GF(p) for Specific p Values

Modulo 2 n +1 Arithmetic Units with Embedded Diminished-to-Normal Conversion

Design of FPGA- Based SPWM Single Phase Full-Bridge Inverter

A study on the efficient compression algorithm of the voice/data integrated multiplexer

Intermediate Information Structures

Application of Improved Genetic Algorithm to Two-side Assembly Line Balancing

Reconfigurable architecture of RNS based high speed FIR filter

A New Space-Repetition Code Based on One Bit Feedback Compared to Alamouti Space-Time Code

x y z HD(x, y) + HD(y, z) HD(x, z)

A Novel Three Value Logic for Computing Purposes

Design of FPGA Based SPWM Single Phase Inverter

CHAPTER 6 IMPLEMENTATION OF DIGITAL FIR FILTER

INCREASE OF STRAIN GAGE OUTPUT VOLTAGE SIGNALS ACCURACY USING VIRTUAL INSTRUMENT WITH HARMONIC EXCITATION

COMPRESSION OF TRANSMULTIPLEXED ACOUSTIC SIGNALS

Survey of Low Power Techniques for ROMs

High-Order CCII-Based Mixed-Mode Universal Filter

Cross-Layer Performance of a Distributed Real-Time MAC Protocol Supporting Variable Bit Rate Multiclass Services in WPANs

X-Bar and S-Squared Charts

Lossless image compression Using Hashing (using collision resolution) Amritpal Singh 1 and Rachna rajpoot 2

Permutation Enumeration

A SIMPLE METHOD OF GOAL DIRECTED LOSSY SYNTHESIS AND NETWORK OPTIMIZATION

HDL LIBRARY OF PROCESSING UNITS FOR GENERIC AND DVB-S2 LDPC DECODING

A SIMPLE METHOD OF GOAL DIRECTED LOSSY SYNTHESIS AND NETWORK OPTIMIZATION

Run-Time Error Detection in Polynomial Basis Multiplication Using Linear Codes

FPGA Implementation of the Ternary Pulse Compression Sequences

Design of Area and Speed Efficient Modulo 2 n -1 Multiplier for Cryptographic Applications

Technical Explanation for Counters

On Parity based Divide and Conquer Recursive Functions

Massachusetts Institute of Technology Dept. of Electrical Engineering and Computer Science Fall Semester, Introduction to EECS 2.

PRACTICAL FILTER DESIGN & IMPLEMENTATION LAB

CP 405/EC 422 MODEL TEST PAPER - 1 PULSE & DIGITAL CIRCUITS. Time: Three Hours Maximum Marks: 100

POWERS OF 3RD ORDER MAGIC SQUARES

8. Combinatorial Structures

Faulty Clock Detection for Crypto Circuits Against Differential Faulty Analysis Attack

Comparison of Frequency Offset Estimation Methods for OFDM Burst Transmission in the Selective Fading Channels

7. Counting Measure. Definitions and Basic Properties

A New Design of Log-Periodic Dipole Array (LPDA) Antenna

Test Time Minimization for Hybrid BIST with Test Pattern Broadcasting

A SELECTIVE POINTER FORWARDING STRATEGY FOR LOCATION TRACKING IN PERSONAL COMMUNICATION SYSTEMS

A New Implementation for a 2 n 1 Modular Adder Through Carbon Nanotube Field Effect Transistors

PHY-MAC dialogue with Multi-Packet Reception

AME50461 SERIES EMI FILTER HYBRID-HIGH RELIABILITY

The Eye. Objectives: Introduction. PHY 192 The Eye 1

Ch 9 Sequences, Series, and Probability

Delta- Sigma Modulator based Discrete Data Multiplier with Digital Output

PROJECT #2 GENERIC ROBOT SIMULATOR

AkinwaJe, A.T., IbharaJu, F.T. and Arogundade, 0.1'. Department of Computer Sciences University of Agriculture, Abeokuta, Nigeria

Design and Implementation of Vedic Algorithm using Reversible Logic Gates

Hybrid BIST Optimization for Core-based Systems with Test Pattern Broadcasting

APPLICATION NOTE UNDERSTANDING EFFECTIVE BITS

General Model :Algorithms in the Real World. Applications. Block Codes

Single Bit DACs in a Nutshell. Part I DAC Basics

An Adaptive Image Denoising Method based on Thresholding

3. Error Correcting Codes

Combined Scheme for Fast PN Code Acquisition

Optimal Arrangement of Buoys Observable by Means of Radar

Chapter 3 Digital Logic Structures

SIDELOBE SUPPRESSION IN OFDM SYSTEMS

Novel Modeling Techniques for RTL Power Estimation

DIGITALLY TUNED SINUSOIDAL OSCILLATOR USING MULTIPLE- OUTPUT CURRENT OPERATIONAL AMPLIFIER FOR APPLICATIONS IN HIGH STABLE ACOUSTICAL GENERATORS

H2 Mathematics Pure Mathematics Section A Comprehensive Checklist of Concepts and Skills by Mr Wee Wen Shih. Visit: wenshih.wordpress.

Roberto s Notes on Infinite Series Chapter 1: Series Section 2. Infinite series

Implementation of Fuzzy Multiple Objective Decision Making Algorithm in a Heterogeneous Mobile Environment

Compound Controller for DC Motor Servo System Based on Inner-Loop Extended State Observer

AC : USING ELLIPTIC INTEGRALS AND FUNCTIONS TO STUDY LARGE-AMPLITUDE OSCILLATIONS OF A PENDULUM

A New Basic Unit for Cascaded Multilevel Inverters with the Capability of Reducing the Number of Switches

Laboratory Exercise 3: Dynamic System Response Laboratory Handout AME 250: Fundamentals of Measurements and Data Analysis

Spread Spectrum Signal for Digital Communications

MEASUREMENT AND CONTORL OF TOTAL HARMONIC DISTORTION IN FREQUENCY RANGE 0,02-10KHZ.

Design of modulo 2 n -1 multiplier Based on Radix-8 Booth Algorithm using Residue Number System

Counting on r-fibonacci Numbers

Adaptive Resource Allocation in Multiuser OFDM Systems

Pulse-echo Ultrasonic NDE of Adhesive Bonds in Automotive Assembly

Acquisition of GPS Software Receiver Using Split-Radix FFT

}, how many different strings of length n 1 exist? }, how many different strings of length n 2 exist that contain at least one a 1

AME28461 SERIES EMI FILTER HYBRID-HIGH RELIABILITY

Fingerprint Classification Based on Directional Image Constructed Using Wavelet Transform Domains

Power Optimization for Pipeline ADC Via Systematic Automation Design

1. How many possible ways are there to form five-letter words using only the letters A H? How many such words consist of five distinct letters?

Joint Power Allocation and Beamforming for Cooperative Networks

Cross-Entropy-Based Sign-Selection Algorithms for Peak-to-Average Power Ratio Reduction of OFDM Systems

Wavelet Transform. CSEP 590 Data Compression Autumn Wavelet Transformed Barbara (Enhanced) Wavelet Transformed Barbara (Actual)

Accelerating Image Processing Algorithms with Microblaze Softcore and Digilent S3 FPGA Demonstration Board

Towards Acceleration of Deep Convolutional Neural Networks using Stochastic Computing

Analysis of SDR GNSS Using MATLAB

Integrating Machine Reliability and Preventive Maintenance Planning in Manufacturing Cell Design

WAVE-BASED TRANSIENT ANALYSIS USING BLOCK NEWTON-JACOBI

Multisensor transducer based on a parallel fiber optic digital-to-analog converter

The Firing Dispersion of Bullet Test Sample Analysis

A 5th order video band elliptic filter topology using OTRA based Fleischer Tow Biquad with MOS-C Realization

LETTER A Novel Adaptive Channel Estimation Scheme for DS-CDMA

PERMUTATIONS AND COMBINATIONS

Subband Coding of Speech Signals Using Decimation and Interpolation

Transcription:

Some Modular Adders ad Multipliers for Field Programmable Gate Arras Jea-Luc Beuchat Laboratoire de l Iformatique du Parallélisme (CNRS, ENSL, INRIA) 46, Allée d Italie F 69364 Lo Cede 7 Jea-Luc.Beuchat@es-lo.fr Abstract This paper is devoted to the stud of umber represetatios ad algorithms leadig to efficiet implemetatios of modular adders ad multipliers o recet Field Programmable Arras. Our hardware operators take advatage of the buildig blocks available i such devices: carrpropagate adders, memor blocks, ad sometimes embedded multipliers. The first part of the paper describes three basic methodologies to carr out a modulo m additio ad presets i more details the desig of modulo (2 ± ) adders. The major result is a ovel modulo (2 +) additio algorithm leadig to a area-time efficiet implemetatio of this arithmetic operatio o FPGAs. The secod part describes a modulo m multiplicatio algorithm ivolvig small multipliers ad memor blocks, ad modulo (2 +) multipliers based o Ma s algorithm. We also suggest some improvemets of this operator i order to perform a multiplicatio i the group (Z 2 +, ). Itroductio Modular arithmetic plas a crucial role i various fields such as residue umber sstem arithmetic or crptograph. Several algorithms for modular additio ad multiplicatio have bee proposed ad umerous papers describe both theoretical ad practical results (see for istace [4, 6, 7]). Those algorithms are geerall desiged for stadard itegrated circuits ad are based o ver low-level basic elemets such as NAND or XOR gates. However, recet Field Programmable Gate Arras (FPGA) embed dedicated carr logic, memor blocks, ad sometimes small multipliers. Arithmetic operators takig advatage of these ew buildig blocks could outperform classic architectures. A recet stud of usiged multiplicatio ad divisio o Virte-II devices has alread show that embedded multipliers allow sigificat speed improvemets compared to stadard solutios ol based o Cofigurable Logic Blocks (CLB) [2]. This paper is devoted to the stud of umber represetatios ad algorithms leadig to efficiet implemetatios of modular adders (Sectio 2) ad multipliers (Sectio 3) o Virte-E ad Virte-II devices. Virte-E ad Virte-II CLBs provide fuctioal elemets for schroous ad combiatorial logic. Each CLB icludes respectivel two (Virte-E) or four (Virte-II) slices cotaiig basicall two 4-iput look-up tables (LUT), two storage elemets, ad fast carr logic dedicated to additio ad subtractio. Furthermore, Virte-E FPGAs icorporate large memor blocks orgaized i colums. Each block is a full schroous dualported 496-bit RAM whose data width ca be cofigured (, 2, 4, 8 or 6 bits). A Virte-II device embeds ma 8-bit 8-bit multipliers supportig two idepedet damic data iput ports: 8-bit siged or 7-bit usiged. 8-Kbit true dual-port RAM blocks (called block SelectRAM resources) acceptig various data/address aspect ratios are also available. Arithmetic operators dedicated to FPGAs should therefore ivolve such buildig blocks. 2 Modular Additio The modulo m additio of two umbers ad belogig to,...,m } is defied b: + if + <m, ( + ) mod m () + m if + m, ad ca be straightforwardl implemeted b a adder, a comparator, ad a subtracter. The compariso is however epesive, both i terms of area ad dela. The algorithms studied i this sectio allow to get rid of it ad lead to more efficiet hardware operators. I this paper, k log 2 m +deotes the umber of bits which are required to ecode both iputs ad output of a modulo m arithmetic operator. There are basicall three methodologies to carr out a modulo m additio [3]: Table-Based Operators. This solutio cosists i storig i a table the values ( + ) mod m for each -7695-926-/3/$7. (C) 23 IEEE

pair of iputs ad (Figure a). Its mai drawback lies i the epoetial growth of the required memor size (k 2 2k bits). Hbrid Operators. Figure b describes a modulo m adder ivolvig a stadard biar adder followed b a which corrects the sum. This architecture reduces the memor requiremets from k 2 2k bits to k 2 k+ bits. Adder-Based Operators. Aother wa to implemet Equatio () is described b Algorithm ad leads to the circuit of Figure c. Referece [3] provides for istace a proof a correctess of this method. This architecture requires ol two carr-propagate adders ad a multipleer ad is therefore well suited for FPGAs. Algorithm Modulo m additio. : Choose j such that 2 j <m<2 j 2: s + 3: s (s mod 2 j )+2 j m 4: if the carr-out bit of s or s is oe the 5: ( + ) mod m s mod 2 j 6: else 7: ( + ) mod m s mod 2 j 8: ed if 2. Modulo (2 ± ) Additio Some improvemets of the adder-based operator previousl described are possible for specific values of m. For istace, modulo (2 ) additio, or oe s complemet additio, is defied b []: ( + ) mod (2 ) ( + +)mod 2 if + + 2, + if + +< 2. Figure d depicts the architecture of the correspodig hardware operator. Due to the coditio + + 2, we perform two additios i parallel ad select the correct result with a multipleer. Remember that zero has a double represetatio i oe s complemet, amel... ad... (i.e. is cogruet to 2 (modulo 2 )). If the computatio path accommodates the secod ecodig of zero, Equatio (2) ca be rewritte as follows: ( + ) mod (2 ) ( + +)mod 2 if + 2, + if + <2. (2) (3) Note that the carr-out c out from the sum + idicates whether the icremetatio must be performed. It is still possible to evaluate + ad + +i parallel, ad to choose the correct result accordig to c out (Figure e). A alterate architecture, illustrated o Figure f, simpl adds c out to the +. Desigig a modulo (2 +)adder is a little bit trickier. Such a operator is however useful i a wider rage of applicatio icludig for istace the modulo (2 +) multiplier of the IDEA block cipher [8]. This arithmetic operatio is ofte performed i the dimiished-oe umber sstem, where a umber is represeted b ad the umber is ot used or treated as a special case []: ( + +)mod (2 +) + if + 2, ( + +)mod 2 if + < 2. ( + + c out ) mod 2. (4) Figure g ad Figure h depict two hardware operators performig the modulo (2 +)additio accordig to this algorithm. The priciple of these architectures is the same as for the modulo (2 ) adder. Let us stud the modulo (2 +)additio of two umbers i ormal represetatio. The algorithms described i this paper returs the desired result icreased b oe. Nevertheless, this propert facilitates the desig of the circuit ad ca be dealt with i ma applicatios. The modulo (2 +)additio is ow defied b: ( + +)mod (2 +) 2 if 2 ad 2, ( + ) mod 2 + c out if + <2 + (5). Two direct implemetatios of Equatio (5) are illustrated b Figure i ad Figure j [5]. Their mai drawback lies i the multipleer hadlig the special case where both operads are equal to 2. We suggest here a alterative architecture suppressig the multipleer. Let us defie the ( +2)-bit iteger s s + s...s +. The modulo (2 +)additio ca be epressed as: ( + +)mod (2 +) ( + ) mod 2 + s + 2 + s + s. (6) A proof of correctess of this algorithm is provided i Ae A. Figure k depicts the resultig hardware operator which requires ol two carr-propagate adders ad a NOR gate. O Virte-E or Virte-II devices, this logic gate is implemeted withi the carr chai ad the modulo (2 +) adder fits ito a sigle CLB colum. Note that this operator also deals with umbers i dimiished-oe represetatio, -7695-926-/3/$7. (C) 23 IEEE

Optioal pipelie stage Sigle represetatio of zero Double represetatio of zero (+) mod m (a) Table based operator (+) mod m (b) Hbrid operator Modulo (2 +) operators (dimiished oe umber sstem) (+) mod m k 2 m (c) Adder based operator (+) mod m Modulo (2 +) operators (umbers i ormal represetatio) (+) mod m (d) (e) (f) Modulo (2 ) operators (+) mod m ( + +) mod m (g) ( + +) mod m (h) Most sigificat bit 2 or (i) bits + bits + bits + bits (++) mod m + bits Most sigificat bit + bits bits 2 + bits + bits (++) mod m (j) + bits bits + bits + bits (++) mod m (k) Figure. Several architectures of modulo m adders. while elimiatig the costrait,. The coversio from ormal represetatio to dimiished-oe umber sstem is ow defied b ξ ( ) mod (2 +): the umber ( ) is cogruet to 2 (modulo (2 +)). 2.2 Implemetatio Results We have writte a C program which geerates the sthesizable VHDL code of each circuit illustrated o Figure. Three parameters allow to choose oe of the modulo adders, to specif the modulus m, ad to isert a optioal pipelie stage. We have the coducted a series of eperimets with this tool i order to evaluate the area ad the dela of each modular adder accordig to m. Our first eperimet aims to compare four architectures of a modulo (2 ) adder (Table ). The operators de- All eperimets described i this paper were performed o a Su Microsstems Ultra- workstatio (44 MHz, GB of memor). All iput ad output sigals were routed through the D-tpe flip-flops available i the Iput/Output blocks of Virte-E or Virte-II devices. The optioal pipelie stage has ot bee used. The automaticall geerated VHDL code was sthesized usig Splif Pro 7..3 ad implemeted o Virte-E ad Virte-II devices emploig Xili Alliace Series 4..3i. No specific costraits were give to the sthesis tools ad it should be therefore possible to improve the results. picted b Figure d ad Figure e do ot sigificatl improve the adder-based operator defied b Algorithm. The last modulo (2 ) adder described i this paper (Figure f) does ot require a multipleer ad is therefore smaller. The dela of the four circuits is comparable. This eperimet illustrates that a peculiar umber ecodig (i.e. the double represetatio of zero) ca sometimes lead to a better hardware implemetatio of a arithmetic operator. Table 2 describes the mai specificities of some modulo (2 +)adders o a Virte-E device. Due to the required memor size, the hbrid operator (Figure b) is rather uattractive ad is limited to small moduli ( 8). Note that the table-based method works ol if 5. For 4, the operator requires for istace two 496-bit RAM blocks ad four slices, ad its dela is equal to 4.3 s. For 7, 529 slices are eeded ad the dela is the equal to 44.5 s. This eperimet also shows that our ew modulo (2 +)additio algorithm leads to the smallest circuits. 3 Modular Multiplicatio A basic modulo m multiplicatio algorithm cosists i computig w, where, < m, ad dividig this product b m. Sice divisio is hard to perform, several -7695-926-/3/$7. (C) 23 IEEE

Table. Compariso of some modulo (2 ) adders o a XCV5E-6 device. 4 8 2 6 2 24 28 32 Fig. c Fig. d Fig. e Fig. f 6 slices 3 slices 9 slices 25 slices 3 slices 37 slices 43 slices 49 slices 9.4 s.5 s 2.3 s 3.4 s 3.7 s 4.3 s 4.5 s 4.9 s 6 slices 2 slices 8 slices 24 slices 3 slices 36 slices 42 slices 48 slices 8. s 9. s 9.4 s.2 s.9 s. s 2.6 s 3.3 s 6 slices 2 slices 8 slices 24 slices 3 slices 36 slices 42 slices 48 slices 8. s 8.6 s 9.4 s 9.7 s.9 s.5 s 2.3 s 3. s 5 slices 8 slices 2 slices 6 slices 2 slices 24 slices 28 slices 32 slices 8.4 s 8.8 s 9.4 s.7 s 4.8 s 3.7 s 5.5 s 4.2 s Table 2. Compariso of some modulo (2 + ) adders o a XCV5E-6 device. 4 8 2 6 2 24 28 32 Fig. b Fig. c Fig. j Fig. k slices 2 slices.2 s 2.5 s 7 slices 5 slices 2 slices 27 slices 33 slices 39 slices 45 slices 5 slices 9.4 s.8 s 2.2 s 3. s 3.3 s 3.6 s 4.3 s 6.9 s 6 slices 3 slices 9 slices 25 slices 3 slices 37 slices 43 slices 49 slices 8.7 s.6 s 2.2 s 3.6 s 4.4 s 4.9 s 5.7 s 8.7 s 6 slices slices 5 slices 9 slices 23 slices 27 slices 3 slices 35 slices 7.8 s.6 s.6 s 3.4 s 4.5 s 3.8 s 4.7 s 9.2 s algorithms have bee proposed to overcome this problem. Referece [6] provides a good bibliograph o this subject. All these solutios are however dedicated to VLSI implemetatios; cosequetl, we propose here a stud of some modular multipliers based o the buildig blocks available i recet FPGAs. Aalogousl to additio, modulo m multiplicatio ca be implemeted b meas of tables (Figure 2a). This approach is however limited to small moduli due to the epoetial growth of the required memor, ad other architectures must be ivestigated. 3. Multiplicatio with Subsequet Modulo Correctio This modulo m multiplicatio scheme is dedicated to FPGAs embeddig small multipliers ad memor blocks. The priciple cosists i computig the 2k-bit wide product ad the performig a modulo correctio b meas of a table ad a modulo m additio. Give a umber 2 j, the algorithm is described as follows: () mod m (() mod 2 j +2 j () div 2 j ) mod m (() mod 2 j +( (m 2 j ) () div 2 j ) mod m) mod m if m>2 j, (() mod 2 j + ((2 j (7) m) () div 2 j ) mod m) mod m if m<2 j. The case where m 2 j is straightforward ad will ot be addressed i this paper. This scheme requires a usiged multiplier, a memor to store all possible values of ( (m 2 j ) () div 2 j ) mod m or ((2 j m) () div 2 j ) mod m, ad a modulo m adder. Let us defie p () mod 2 j ad p ( (m 2 j ) () div 2 j ) mod m (or p ((2 j m) () div 2 j ) mod m). We have ow to distiguish the two followig cases: For m>2 j, p,p m ad we deduce that p +p 2m 2. The fial additio ca therefore be performed with the modulo m adder described i sectio 2 (Figure 2b). For m<2 j,wehave p < 2 j, p <m, ad p + p 2 j + m 2. The architecture of the modulo m multiplier depeds o 2 j + m 2. If this value is strictl smaller tha 2m, the operator is defied b: () mod m p + p if p + p <m, (8) p + p m if m p + p < 2m. From 2 j + m 2 < 2m, we deduce that 2 j m +. Sice m<2 j, Equatio (8) holds iff 2 j m +. For other values of 2 j, the modulo m multiplicatio is formulated as: () mod m p + p if p + p <m, p + p m if m p + p < 2m, p + p 2m if p + p 2m. -7695-926-/3/$7. (C) 23 IEEE

Figure 2c illustrates the correspodig hardware operator. 3.2 Modulo (2 + ) Multiplicatio Modulo (2 +)multiplicatio ca be performed accordig to Equatio (7). If the target FPGA does ot embed small multipliers, the implemetatio of this scheme is however epesive. We propose a alterate architecture based o the algorithm described b Ma i [9]. Assume that ψ ψ...ψ is the dimiished-oe represetatio of, i.e. ψ ( ) mod (2 +). Whe is eve, Ma has proved that: mod (2 +) ( 2 ) P i + mod (2 +), (9) 2 i where P i 2 2i ( 2ψ 2i + ψ 2i+ + ψ 2i ) ad ψ ( ψ ψ ) mod (2 +). Each partial product P i ca be easil computed from the dimiished-oe represetatio of. Whe is odd, the product mod (2 +)is give b: mod (2 +) P 2 ( }} ( 2 (ψ + ψ 2 ) ) + +(( )/2+) ) 2 i P i mod (2 +), () where ψ ψ mod (2 +). Ma computes the sum of the partial products ad the costat with a carr-save adder, the performs a modulo (2 +)reductio with two modulo (2 +)carr-save adders ad oe modulo (2 +)carr-propagate adder [9]. This architecture does ot take advatage of the fast carr logic available i FPGAs ad we suggest here a implemetatio of Equatios (9) ad () based o our ew modulo (2 +)adder described i Sectio 2. Both equatios impl to sum up /2 partial products P i ad the costat /2. Remember that our modulo (2 +)adder returs the sum of its two operads icreased b oe. Cosequetl, if we compute the sum /2 i P i with ( /2 ) such adders, we obtai ( /2 i P i + /2 ) mod (2 +), which is the dimiished-oe represetatio of the product. Figure 3a depicts the correspodig hardware operator which takes as iputs the dimiished-oe represetatios of ad, ad returs ( ) mod (2 +). Sice ( +(2 ) + ) mod (2 +) ( ) mod (2 +), the coversio from usiged iteger to dimiished-oe umber sstem ca be achieved with our modulo (2 +)adder. The iverse coversio is performed with a carr-propagate adder (Figure 3a): it is eas to verif that (a +)mod (2 +) i a i2 i +ā whe a,...,2 }. Let us cosider the multiplicatio i Z 2 + a Z 2 + gcd(a, 2 +) } (i.e. the multiplicative group of Z 2 +). Sice (Z 2 +, ) is a group, we kow that () mod (2 +) ad it is therefore possible to represet the umber 2 b. This trick saves oe bit ad allows two improvemets of the multiplier based o Ma s algorithm: Due to the special ecodig of 2, the dimiishedoe represetatio of a umber Z 2 + is ( ) mod 2. We obtai for istace ( ) mod 2 2, which is the dimiished-oe represetatio of 2. It is eas to check that ( ) mod (2 +) ( ) mod 2 whe 2. The coversio from dimiished-oe umber sstem to usiged iteger does ot require a additioal stage amore (Figure 3b). It is ideed possible to modif the last adder of the tree i order to compute (a + b + 2) mod (2 +)accordig to: (a + b +2)mod (2 +) ( ) s i 2 i + s + s mod 2, () i where s s + s...s a + b +. The proof of correctess of this algorithm is straightforward. Multiplicatio i Z 2 6 + is for istace the critical operatio of the IDEA block cipher [8]. Several modulo (2 +) multipliers have cosequetl bee ivestigated over the past ears (see for istace [, 4, ]). Aother implemetatio of Ma s algorithm has bee proposed b Hämäläie et al. i [5]. This architecture is also based o carr-propagate adders. However, modulo (2 +)additios are carried out b the circuit of Figure j, ad a additioal stage performs the coversio. This modular multiplier is therefore larger ad slower tha ours. Aother wa to implemet Equatio (9) or Equatio () cosists i computig the sum s /2 i P i + /2 with a carr-propagate adder tree, the i performig a modulo (2 +)correctio. We assume that is eve ad defie s s mod 2 ad s s div 2. Sice s 2 ad s /2, we obtai: s mod (2 +)(s +2 s ) mod (2 +) (s s ) mod (2 +) (s +2 s +)mod (2 +) (s + s +2)mod (2 +) s + s +2 if s + s +< 2, s + s + 2 if s + s + 2, -7695-926-/3/$7. (C) 23 IEEE

(*) mod m p p Modulo m adder k 2 m (*) mod m p p Modulo m adder 2m m (*) mod m (a) Table based operator (b) Multiplicatio with subsequet modulo correctio j (m>2 ) (c) Multiplicatio with subsequet modulo correctio j (m<2 ) Figure 2. Three architectures of modulo m multipliers. (a+b+) mod (2 +) ( )mod(2 +) Partial Product Geeratio ( )mod(2 +) (Σ P + /2 ) mod (2 +) i (a) Dimiished oe umber sstem a a + bits a... a + bits (a+)mod(2 +) Coversio from dimiished oe represetatio to ormal represetatio a b + bits bits bits (a+b+2) mod (2 +) ( )mod 2 Partial Product Geeratio (Σ i P + /2) mod (2 +) (b) Multiplicatio i Z * 2 + ( )mod 2 Optioal pipelie stage Figure 3. Architectures of a modulo (2 + ) multiplier based o Ma s algorithm. which is the dimiished-oe represetatio of () mod (2 +). Figure 4a depicts the correspodig hardware operator. Small improvemets are agai possible whe, Z 2 +. The coversio from ormal represetatio to the dimiished-oe umber sstem is eactl the same as for the operator based o modulo (2 +)adders (Figure 3b). Note fiall that s mod (2 +) 2. Due to the special ecodig of zero, () mod (2 +)(smod (2 +)+)mod 2 ad we perform the coversio b settig the iput carr of the fial adder to oe (Figure 4b). 3.3 Implemetatio Results This sectio describes the mai characteristics of some modulo m multipliers studied i this paper. The eperimetal setup is the same as for modular adders. Table 3 digests the mai characteristics of modulo m multipliers based o a multiplicatio with subsequet modulo correctio for Virte-II devices. We ol cosider here operators requirig a sigle 8-Kbit memor block, which defies the maimum value for m. Remember that k log 2 m + deotes the umber of bits required to ecode m. Whe m 2 k, Equatio (7) ields: () mod m (() mod 2 k + ((2 k m) () div 2 k ) mod m. The table is addressed b the k-bit word () div 2 k ad also returs a k-bit umber. Cosequetl, the block SelectRAM is cofigured i the K 8-bit mode (i.e address bits ad 8 data bits) ad the modulus m is comprised betwee 3 ad 23. Table 4 summarizes the area ad the dela of several modulo (2 +)multipliers whe, Z 2 +. I this eperimet, we compare the two architectures discussed i this paper (Figures 3b ad 4b) to a operator based o the modified Low-High algorithm proposed i []. This circuit ivolves a usiged multiplier, a multipleer to hadle the special cases where or, ad a modulo correctio (Figure 4c). Our eperimets illustrate that: The architecture based o a carr-propagate adder tree ad a modulo (2 +) correctio seems to be the better implemetatio of Ma s algorithm for FPGAs. This result is ot surprisig: ote that our modulo (2 + ) adder is roughl twice as large as a ( +)-bit carr-propagate adder. Cosequetl, the two modulo (2 +)multipliers previousl described respectivel -7695-926-/3/$7. (C) 23 IEEE

( )mod(2 +) ( )mod(2 +) ( )mod 2 ( )mod 2 Usiged multiplier Carr propagate adders (Σ Partial Product Geeratio s s P + /2 mod (2 +) i (a) Dimiished oe umber sstem ) Modulo (2 +) correctio Partial Product Geeratio s s (Σ P + /2) mod (2 +) i * (b) Multiplicatio i Z 2 + Σ P + /2 i & "..."! most sig ificat bits carr out & "..."! bits () mod (2 +)!! least sig ificat bits (c) Modified Low High algorithm Figure 4. Two other architectures of a modulo (2 + ) multiplier based o Ma s algorithm ad the architecture based o the modified Low-High algorithm described i []. require 2 /2 (Figure 3b) ad /2 +2(Figure 4b) carr-propagate adders to sum the partial products. The modified Low-High algorithm leads to smaller circuits whe the modulo (2 +)multiplier is purel combiatorial. The -bit -bit usiged multiplier sums the partial products P i 2 i i, i,..., } with a tree of carr-propagate adders. This architecture takes advatage of the dedicated AND gate associated with each LUT i order to geerate the partial products. Although Ma s algorithm reduces the amout of partial products, their geeratio ivolves much more hardware resources (LUTs ad multipleers). However, whe pipelie stages are iserted to reduce the dela, the circuit illustrated o Figure 4b is attractive for 28. This result is eplaied b the fact that pipeliig the usiged multiplier of the operator based o the modified Low-High algorithm is epesive. 4 Coclusio We have described ad compared several modular adders ad multipliers ivolvig various buildig blocks (carrpropagate adders, tables, ad small multipliers). Our mai results iclude the desig of a ew modulo (2 +)adder, a modulo m multiplier based o the embedded multipliers ad memor blocks available i Virte-II devices, ad implemetatios of Ma s algorithm carefull optimized for FPGAs. Our eperimets idicate that the choice of a operator depeds o several parameters such as the modulus m, the target FPGA famil, ad the umber of iteral pipelie stages. However, our VHDL geerators allow to quickl eplore a wide parameter space ad to determie which architecture is most appropriate for a give applicatio. Ackowledgmets The author would like to thak the Miistère Fraçais de la Recherche (grat # 48 CDR ACI jeues chercheurs ), the Swiss Natioal Sciece Foudatio, ad the Xili Uiversit Program for their support. A Proof of the New Modulo (2 + ) Additio Algorithm Let us demostrate that the algorithm defied b Equatio (6) carries out ( + +)mod (2 +)whe, 2. First of all, let us ote that + 2 + ad ( + +)mod (2 + + if + <2, +) + 2 if + 2. -7695-926-/3/$7. (C) 23 IEEE

Table 3. Multiplicatio with subsequet modulo correctio o a XC2V4-6 device. Each operator requires a small multiplier ad a sigle 8-Kbit memor block. m 5 m 3 m 29 m 6 m 25 m 253 m 59 m 2 Area [slices] 6 8 2 3 5 6 2 9 Dela [s] 5.6 8.5 8.7 9.2 9.5.2 9.5 9.7 Table 4. Modulo (2 + ) multiplicatio i Z 2 + o a XCV2E-6 device. 4 8 2 6 2 24 28 32 Figure 3b 4 slices 6 slices 4 slices 254 slices 398 slices 575 slices 885 slices 56 slices (without pipelie).6 s 2.5 s 3.9 s 35.5 s 47.3 s 5.8 s 56.6 s 55.8 s Figure 3b 5 slices 66 slices 56 slices 264 slices 433 slices 63 slices 4 slices 3 slices (with pipelie) 6.7 s 8.9 s 9.7 s.8 s 3.3 s 2.4 s 5. s 5. s Figure 4b 5 slices 57 slices 23 slices 22 slices 325 slices 46 slices 725 slices 939 slices (without pipelie).2 s 8.8 s 24. s 28.7 s 34.8 s 35.9 s 39. s 42.6 s Figure 4b 8 slices 63 slices 38 slices 223 slices 362 slices 494 slices 856 slices 89 slices (with pipelie) 5.8 s 8.5 s 9.9 s.6 s.9 s.5 s 2. s 3.2 s Figure 4c 9 slices 53 slices slices 82 slices 27 slices 372 slices 492 slices 629 slices (without pipelie) 6. s 2.8 s 27.6 s 3. s 34.5 s 37.9 s 39.3 s 44.2 s Figure 4c 25 slices 77 slices 39 slices 22 slices 353 slices 46 slices 592 slices 743 slices (with pipelie) 7. s 9.4 s.4 s 2.6 s 2.8 s 4. s 3.5 s 5.9 s We have to distiguish the three followig cases to establish the correctess of our algorithm: For + 2 + (i.e. 2 ), we have ( + ) mod 2, s, ad s +. Our algorithm returs ( + +)mod (2 +)2 + 2, which is the correct result. For 2 + <2 +, we kow that s + ad s. Cosequetl, ( + +)mod (2 +) ( + ) mod 2 + 2. Fiall, for + <2, s + s ad ( + ) mod 2 +. We obtai ( + +)mod (2 + ) + +. Refereces [] J.-L. Beuchat. Modular Multiplicatio for FPGA Implemetatio of the IDEA Block Cipher. Techical Report 22-32, Laboratoire de l Iformatique du Parallélisme, Ecole Normale Supérieure de Lo, 46 Allée d Italie, 69364 Lo Cede 7, Sept. 22. [2] J.-L. Beuchat ad A. Tisserad. Small Multiplier-based Multiplicatio ad Divisio Operators for Virte-II Devices. I M. Gleser, P. Zipf, ad M. Reovell, editors, Field- Programmable Logic ad Applicatios Recofigurable Computig Is Goig Maistream, umber 2438 i Lecture Notes i Computer Sciece, pages 53 522. Spriger, 22. [3] A. V. Curiger. VLSI Architectures for Computatios i Fiite Rigs ad Fields, volume 26 of Series i Microelectroics. Hartug-Gorre Verlag, 993. [4] A. V. Curiger, H. Boeberg, ad H. Kaesli. Regular VLSI Architectures for Multiplicatio Modulo (2 +). IEEE Joural of Solid-State Circuits, 26(7):99 994, 99. [5] A. Hämäläie, M. Tommiska, ad J. Skttä. 6.78 Gigabits per Secod implemetatio of the IDEA Crptographic Algorithm. I M. Gleser, P. Zipf, ad M. Reovell, editors, Field-Programmable Logic ad Applicatios Recofigurable Computig Is Goig Maistream, umber 2438 i Lecture Notes i Computer Sciece, pages 76 769. Spriger, 22. [6] A. A. Hiasat. New Efficiet Structures for a Modular Multiplier for RNS. IEEE Trasactios o Computers, 49(2):7 74, 2. [7] A. A. Hiasat. High-Speed ad Reduced-Area Modular Adder Structures for RNS. IEEE Trasactios o Computers, 5():84 89, 22. [8] X. Lai. O the Desig ad Securit of Block Ciphers. ETH Series i Iformatio Processig. Hartug Gorre Verlag Kostaz, 992. [9] Y. Ma. A Simplified Architecture for Modulo (2 +)Multiplicatio. IEEE Trasactios o Computers, 47(3):333 337, 998. [] R. Zimmerma. Efficiet VLSI Implemetatio of Modulo (2 ± ) Additio ad Multiplicatio. I Proceedigs of the 4th IEEE Smposium o Computer Arithmetic, pages 58 67, Adelaide, Australia, April 999. -7695-926-/3/$7. (C) 23 IEEE