High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m )

Similar documents
ELLIPTIC curve cryptography (ECC) was proposed by

High Speed ECC Implementation on FPGA over GF(2 m )

Modular Multiplication Algorithm in Cryptographic Processor: A Review and Future Directions

CHAPTER 5 IMPLEMENTATION OF MULTIPLIERS USING VEDIC MATHEMATICS

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

Field Programmable Gate Arrays based Design, Implementation and Delay Study of Braun s Multipliers

CARRY SAVE COMMON MULTIPLICAND MONTGOMERY FOR RSA CRYPTOSYSTEM

Modified Booth Multiplier Based Low-Cost FIR Filter Design Shelja Jose, Shereena Mytheen

DESIGN OF FIR FILTER ARCHITECTURE USING VARIOUS EFFICIENT MULTIPLIERS Indumathi M #1, Vijaya Bala V #2

Design of a High Speed FIR Filter on FPGA by Using DA-OBC Algorithm

AREA AND DELAY EFFICIENT DESIGN FOR PARALLEL PREFIX FINITE FIELD MULTIPLIER

On Built-In Self-Test for Adders

Lightweight Mixcolumn Architecture for Advanced Encryption Standard

LARGE MULTIPLIERS WITH FEWER DSP BLOCKS. Florent de Dinechin, Bogdan Pasca

An Optimized Design for Parallel MAC based on Radix-4 MBA

Implementation of Parallel Multiplier-Accumulator using Radix- 2 Modified Booth Algorithm and SPST

Design and Implementation of Complex Multiplier Using Compressors

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS

Review On Design Of Low Power Multiply And Accumulate Unit Using Baugh-Wooley Based Multiplier

Design and Implementation of High Speed Carry Select Adder Korrapatti Mohammed Ghouse 1 K.Bala. 2

Performance Enhancement of the RSA Algorithm by Optimize Partial Product of Booth Multiplier

Implementation of FPGA based Design for Digital Signal Processing

Design of Multiplier Less 32 Tap FIR Filter using VHDL

Mahendra Engineering College, Namakkal, Tamilnadu, India.

EFFICIENT VLSI IMPLEMENTATION OF A SEQUENTIAL FINITE FIELD MULTIPLIER USING REORDERED NORMAL BASIS IN DOMINO LOGIC

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm

JDT EFFECTIVE METHOD FOR IMPLEMENTATION OF WALLACE TREE MULTIPLIER USING FAST ADDERS

AREA EFFICIENT DISTRIBUTED ARITHMETIC DISCRETE COSINE TRANSFORM USING MODIFIED WALLACE TREE MULTIPLIER

DESIGN AND IMPLEMENTATION OF AREA EFFICIENT, LOW-POWER AND HIGH SPEED 128-BIT REGULAR SQUARE ROOT CARRY SELECT ADDER

Implementation of Parallel MAC Unit in 8*8 Pre- Encoded NR4SD Multipliers

SQRT CSLA with Less Delay and Reduced Area Using FPGA

Digital Integrated CircuitDesign

Design and Characterization of 16 Bit Multiplier Accumulator Based on Radix-2 Modified Booth Algorithm

Design of a High Throughput 128-bit AES (Rijndael Block Cipher)

Data Word Length Reduction for Low-Power DSP Software

An Design of Radix-4 Modified Booth Encoded Multiplier and Optimised Carry Select Adder Design for Efficient Area and Delay

Globally Asynchronous Locally Synchronous (GALS) Microprogrammed Parallel FIR Filter

HIGH PERFORMANCE BAUGH WOOLEY MULTIPLIER USING CARRY SKIP ADDER STRUCTURE

Multiplier Design and Performance Estimation with Distributed Arithmetic Algorithm

Design of Digital FIR Filter using Modified MAC Unit

IJCSIET--International Journal of Computer Science information and Engg., Technologies ISSN

CHAPTER III THE FPGA IMPLEMENTATION OF PULSE WIDTH MODULATION

Reduced Complexity Wallace Tree Mulplier and Enhanced Carry Look-Ahead Adder for Digital FIR Filter

Wave Pipelined Circuit with Self Tuning for Clock Skew and Clock Period Using BIST Approach

An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors

Low-cost Implementations of NTRU for pervasive security

VLSI IMPLEMENTATION OF AREA, DELAYANDPOWER EFFICIENT MULTISTAGE SQRT-CSLA ARCHITECTURE DESIGN

High Speed Vedic Multiplier Designs Using Novel Carry Select Adder

International Journal Of Scientific Research And Education Volume 3 Issue 6 Pages June-2015 ISSN (e): Website:

Partial Reconfigurable Implementation of IEEE802.11g OFDM

National Conference on Emerging Trends in Information, Digital & Embedded Systems(NC e-tides-2016)

Evaluation of Large Integer Multiplication Methods on Hardware

Design and Implementation of High Speed Carry Select Adder

A Fixed-Width Modified Baugh-Wooley Multiplier Using Verilog

A Survey on Power Reduction Techniques in FIR Filter

What this paper is about:

Design of FIR Filter Using Modified Montgomery Multiplier with Pipelining Technique

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 7, July 2012)

A Low-Power Broad-Bandwidth Noise Cancellation VLSI Circuit Design for In-Ear Headphones

Review Paper on an Efficient Processing by Linear Convolution using Vedic Mathematics

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

A Novel High-Speed, Higher-Order 128 bit Adders for Digital Signal Processing Applications Using Advanced EDA Tools

SYNCHRONOUS stream ciphers are lightweight

SPIRO SOLUTIONS PVT LTD

An Efficient Method for Implementation of Convolution

MULTIRATE IIR LINEAR DIGITAL FILTER DESIGN FOR POWER SYSTEM SUBSTATION

In this lecture: Lecture 8: ROM & Programmable Logic Devices

TABLE OF CONTENTS CHAPTER TITLE PAGE

Video Enhancement Algorithms on System on Chip

An area optimized FIR Digital filter using DA Algorithm based on FPGA

High speed all digital phase locked loop (DPLL) using pipelined carrier synthesis techniques

Design and Analysis of RNS Based FIR Filter Using Verilog Language

JDT LOW POWER FIR FILTER ARCHITECTURE USING ACCUMULATOR BASED RADIX-2 MULTIPLIER

Design of Parallel Algorithms. Communication Algorithms

A Compact FPGA Implementation of a Bit-Serial SIMD Cellular Processor Array

Performance Analysis of an Efficient Reconfigurable Multiplier for Multirate Systems

International Journal of Scientific & Engineering Research Volume 3, Issue 12, December ISSN

Anitha R 1, Alekhya Nelapati 2, Lincy Jesima W 3, V. Bagyaveereswaran 4, IEEE member, VIT University, Vellore

VLSI Implementation of Digital Down Converter (DDC)

Design of a Power Optimal Reversible FIR Filter ASIC Speech Signal Processing

Literary Survey True Random Number Generation in FPGAs Adam Pfab Computer Engineering 583

FPGA Implementation of Wallace Tree Multiplier using CSLA / CLA

Multi-Channel FIR Filters

An Efficient SQRT Architecture of Carry Select Adder Design by HA and Common Boolean Logic PinnikaVenkateswarlu 1, Ragutla Kalpana 2

A WiMAX/LTE Compliant FPGA Implementation of a High-Throughput Low-Complexity 4x4 64-QAM Soft MIMO Receiver

CHAPTER 4 ANALYSIS OF LOW POWER, AREA EFFICIENT AND HIGH SPEED MULTIPLIER TOPOLOGIES

Research Article Design of a Novel Optimized MAC Unit using Modified Fault Tolerant Vedic Multiplier

An Efficient Baugh-WooleyArchitecture forbothsigned & Unsigned Multiplication

FIR Filter Design on Chip Using VHDL

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) STUDY ON COMPARISON OF VARIOUS MULTIPLIERS

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Lecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques.

Implementation and Performance Analysis of different Multipliers

ISSN Vol.03,Issue.02, February-2014, Pages:

A VLSI Implementation of Fast Addition Using an Efficient CSLAs Architecture

Low Power R4SDC Pipelined FFT Processor Architecture

NOVEL HIGH SPEED IMPLEMENTATION OF 32 BIT MULTIPLIER USING CSLA and CLAA

High Speed IIR Notch Filter Using Pipelined Technique

128 BIT MODIFIED SQUARE ROOT CARRY SELECT ADDER

[Devi*, 5(4): April, 2016] ISSN: (I2OR), Publication Impact Factor: 3.785

Transcription:

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m ) Abstract: This paper proposes an efficient pipelined architecture of elliptic curve scalar multiplication (ECSM) over GF(2 m ). The architecture uses a bit-parallel finite field (FF) multiplier accumulator (MAC) based on the Karatsuba Ofman algorithm. The Montgomery ladder algorithm is modified for better sharing of execution paths. The data path in the architecture is well designed, so that the critical path contains few extra logic primitives apart from the FF MAC. In order to find the optimal number of pipeline stages, scheduling schemes with different pipeline stages are proposed and the ideal placement of pipeline registers is thoroughly analyzed. We implement ECSM over the five binary fields recommended by the National Institute of Standard and Technology on Xilinx Virtex-4 and Virtex-5 field-programmable gate arrays. The three-stage pipelined architecture is shown to have the best performance, which achieves a scalar multiplication over GF(2 163 ) in 6.1µs using 7354 Slices on Virtex-4. Using Virtex-5, the scalar multiplication form=163, 233, 283, 409, and 571 can be achieved in 4.6, 7.9, 10.9, 19.4, and 36.5 µs, respectively, which are faster than previous results. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2. Enhancement of the project: Existing System: Elliptic curve scalar multiplication (ECSM) is the key operation, which dominates the performance of ECC cryptosystem. Various architectures have been proposed to speed up ECSM. Most of them explore pipeline and parallelism to improve the working frequency and to

reduce the required number of clock cycles in ECSM. Leong and Leung developed a microcoded elliptic curve processor, supporting ECSM over GF(2m) for arbitrary m. Sakiyama et al. proposed a superscalar coprocessor and accelerated ECSM by exploiting instruction-level parallelism (ILP) dynamically. A pipelined application specific instruction set processor for ECC was proposed, which performed ECSM over GF(2163) in 19.55 μs on Xilinx XC4VLX200. Designs implemented high-speed scalar multiplication over a special class of curves, such as Koblitz curves, binary Edwards curves, and Hessian curves. In this paper, we focus on optimizing ECSM over generic curves in GF(2m). Some designs duplicate arithmetic blocks to maximize the parallelism in ECSM. For GF(2163), Kim et al. used three Gaussian normal basis multipliers to achieve ECSM in 10 μs on Xilinx XC4VLX80. Zhang et al. developed three finite-field (FF) cores and a main controller to achieve ECSM in 7.7 μs on Xilinx XC4VLX80. The best design in performed ECSM in 5.5 μs on Xilinx Virtex-5 using three digit-serial FF multipliers and one FF divider. Despite high speed, these deigns require massive logic resources, and thus, they are not practical for FPGA implementation. Considering the tradeoff between area and speed, many designs use word-serial or digit-serial FF multipliers to implement ECSM. These designs usually require a large number of clock cycles for a scalar multiplication. Ansari and Hasan proposed an efficient scheme, which kept the pseudopipelined word-serial FF multiplier working without idle cycles. A scalar multiplication over GF(2163) costs 4050 clock cycles and 21 μs on Xilinx XC4VLX200. FF multipliers with different word sizes (w) were developed, and the best design with w = 55 performed ECSM over GF(2163) in 2751 clock cycles and 9.6 μs on Xilinx XC4VLX200. Disadvantages: Area coverage is high Performance speed is slow Proposed System:

Data Dependence Analysis of ECSM The modified Montgomery ladder scalar multiplication totally takes m(6m + 5S + 3A) + (11M + 5A + I) operations, where M, S, A, and I denote multiplication, square, addition, and inversion in GF(2m), respectively, and m is the dimension of the binary field GF(2m). The original Montgomery ladder scalar multiplication requires (m 1)(6M + 5S + 3A) + (10M +7A+3S+ I) operations. The increased operations are due to the merged initialization and the modified postprocess for better sharing the data path with the main loop. As square and addition are much cheaper than multiplication, and inversion occurs only once, we can see that optimizing operations in the main loop, especially the FF multiplication, is the key to realize highperformance ECSM. Fig. 1. Data dependence graph of (a) point addition and (b) point doubling in the Montgomery ladder algorithm.

Each iteration in the main loop performs point addition and point doubling, which take 6M + 5S + 3A together. The data dependence of point addition and doubling in the Montgomery ladder algorithm is shown in Fig. 1. The critical path lies in calculating the X-coordinate of point addition, which takes 2M + 1S + 2A, as is shaded in Fig. 1. Thus, it may use at most three FF multipliers to achieve maximum parallelism in scalar multiplication. PROPOSED ARCHITECTURE OF ELLIPTIC CURVE SCALAR MULTIPLICATION: we propose the high-performance architecture based on the improved Montgomery ladder scalar multiplication algorithm, as shown in Fig. 2. Fig. 2. Proposed architecture of ECSM. The proposed ECSM architecture consists of one bit-parallel FF MAC, one FF squarer, a register bank, a finite-state machine, and a 6 18 control ROM. The FF MAC is implemented using the Karatsuba Ofman algorithm, and is well pipelined. The n-stage pipelined FF MAC takes n clock cycles to finish one multiplication. The FF squarer is not pipelined, and one clock cycle is required to finish one square. The inputs to FF MAC, A, B, and C, and the input to FF squarer, S,

are all registered. Another four registers T1, T2, T3, and T4 are used in the data path for data caching. Fig. 3. Data path of ECSM using a three-stage pipelined FF MAC.

The data path of ECSM using a three-stage pipelined FF MAC is given for example in Fig. 6. The terms X1, X2, Z1, and Z2 are not presented, because they are the intermediate results of the FF MAC or FF Squarer. The bold dashed line in Fig. 6 shows the critical path of the three-stage pipelined architecture, which consists of a pipelined FF MAC, an addition (XOR), and a 4:1 MUX. Data paths with other pipeline stages are similar to Fig. 6 except for different data connections. Control signals stored in the control ROM are also different. But, the critical path delay remains unchanged. Advantages: Area reduction Speed is increased Software implementation: Modelsim Xilinx ISE