Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization

Similar documents
Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization

Reference. Wayne Wolf, FPGA-Based System Design Pearson Education, N Krishna Prakash,, Amrita School of Engineering

Synthesis and Analysis of 32-Bit RSA Algorithm Using VHDL

Lecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques.

A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION

CHAPTER III THE FPGA IMPLEMENTATION OF PULSE WIDTH MODULATION

CHAPTER 5 IMPLEMENTATION OF MULTIPLIERS USING VEDIC MATHEMATICS

WHAT ARE FIELD PROGRAMMABLE. Audible plays called at the line of scrimmage? Signaling for a squeeze bunt in the ninth inning?

Design of Parallel Algorithms. Communication Algorithms

Hardware Implementation of BCH Error-Correcting Codes on a FPGA

Design and FPGA Implementation of an Adaptive Demodulator. Design and FPGA Implementation of an Adaptive Demodulator

Hardware-based Image Retrieval and Classifier System

Performance Enhancement of the RSA Algorithm by Optimize Partial Product of Booth Multiplier

Interconnect testing of FPGA

DYNAMICALLY RECONFIGURABLE PWM CONTROLLER FOR THREE PHASE VOLTAGE SOURCE INVERTERS. In this Chapter the SPWM and SVPWM controllers are designed and

Field Programmable Gate Arrays based Design, Implementation and Delay Study of Braun s Multipliers

Artificial Neural Network Engine: Parallel and Parameterized Architecture Implemented in FPGA

Chapter 1 Introduction

Expert Systems with Applications

CARRY SAVE COMMON MULTIPLICAND MONTGOMERY FOR RSA CRYPTOSYSTEM

PE713 FPGA Based System Design

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division

EECS 427 Lecture 21: Design for Test (DFT) Reminders

Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates

FlexWave: Development of a Wavelet Compression Unit

CS/EE Homework 9 Solutions

REALISATION OF AWGN CHANNEL EMULATION MODULES UNDER SISO AND SIMO

Lecture 30. Perspectives. Digital Integrated Circuits Perspectives

Embedded System Hardware - Reconfigurable Hardware -

Disseny físic. Disseny en Standard Cells. Enric Pastor Rosa M. Badia Ramon Canal DM Tardor DM, Tardor

Method We follow- How to Get Entry Pass in SEMICODUCTOR Industries for 2 nd year engineering students

FINITE IMPULSE RESPONSE (FIR) FILTER

Lecture Perspectives. Administrivia

High-Speed RSA Crypto-Processor with Radix-4 4 Modular Multiplication and Chinese Remainder Theorem

FPGA-Based Design and Implementation of a Multi-Gbps LDPC Decoder

Factorization myths. D. J. Bernstein. Thanks to: University of Illinois at Chicago NSF DMS Alfred P. Sloan Foundation

Design of a High Throughput 128-bit AES (Rijndael Block Cipher)

FPGA Circuits. na A simple FPGA model. nfull-adder realization

Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA

אני יודע מה עשית בפענוח האחרון: התקפות ערוצי צד על מחשבים אישיים

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

PROGRAMMABLE ASICs. Antifuse SRAM EPROM

Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates

Design Methodologies. Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Abstract of PhD Thesis

FIR_NTAP_MUX. N-Channel Multiplexed FIR Filter Rev Key Design Features. Block Diagram. Applications. Pin-out Description. Generic Parameters

A WiMAX/LTE Compliant FPGA Implementation of a High-Throughput Low-Complexity 4x4 64-QAM Soft MIMO Receiver

Design Methodologies. Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Implementing Logic with the Embedded Array

Vol. 4, No. 4 April 2013 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Throughput vs. Area Trade-offs in High-Speed Architectures of Five Round 3 SHA-3 Candidates Implemented Using Xilinx and Altera FPGAs

Evaluation of Large Integer Multiplication Methods on Hardware

Audio Sample Rate Conversion in FPGAs

Minimum key length for cryptographic security

DDC_DEC. Digital Down Converter with configurable Decimation Filter Rev Block Diagram. Key Design Features. Applications. Generic Parameters

Lightweight Mixcolumn Architecture for Advanced Encryption Standard

Implementation of Block Turbo Codes for High Speed Communication Systems

Implementing Multipliers with Actel FPGAs

DATA SECURITY USING ADVANCED ENCRYPTION STANDARD (AES) IN RECONFIGURABLE HARDWARE FOR SDR BASED WIRELESS SYSTEMS

DESIGN OF LOW POWER HIGH SPEED ERROR TOLERANT ADDERS USING FPGA

VHDL based Design of Convolutional Encoder using Vedic Mathematics and Viterbi Decoder using Parallel Processing

NGP-N ASIC. Microelectronics Presentation Days March 2010

IJCSIET--International Journal of Computer Science information and Engg., Technologies ISSN

A SCALABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

Decision Based Median Filter Algorithm Using Resource Optimized FPGA to Extract Impulse Noise

Fpga Implementation of Truncated Multiplier Using Reversible Logic Gates

Single Chip FPGA Based Realization of Arbitrary Waveform Generator using Rademacher and Walsh Functions

Design and implementation of LDPC decoder using time domain-ams processing

Hardware Implementation of Proposed CAMP algorithm for Pulsed Radar

Block Diagram. i_in. q_in (optional) clk. 0 < seed < use both ports i_in and q_in

DESIGN OF A HIGH SPEED MULTIPLIER BY USING ANCIENT VEDIC MATHEMATICS APPROACH FOR DIGITAL ARITHMETIC

Asynchronous vs. Synchronous Design of RSA

Serial and Parallel Processing Architecture for Signal Synchronization

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Putting Queens in Carry Chains

International Journal of Scientific & Engineering Research Volume 3, Issue 12, December ISSN

Multi-Channel FIR Filters

Study of Power Consumption for High-Performance Reconfigurable Computing Architectures. A Master s Thesis. Brian F. Veale

A Low Power and High Speed Viterbi Decoder Based on Deep Pipelined, Clock Blocking and Hazards Filtering

VLSI IMPLEMENTATION OF MODIFIED DISTRIBUTED ARITHMETIC BASED LOW POWER AND HIGH PERFORMANCE DIGITAL FIR FILTER Dr. S.Satheeskumaran 1 K.

Functional analysis of DSP blocks in FPGA chips for application in TESLA LLRF system

Non-Wafer-Scale Sieving Hardware for the NFS: Another Attempt to Cope with 1024-bit

DESIGN & IMPLEMENTATION OF SELF TIME DUMMY REPLICA TECHNIQUE IN 128X128 LOW VOLTAGE SRAM

Spectral Monitoring/ SigInt

Digital Integrated Circuits Perspectives. Administrivia

Midterm Exam ECE 448 Spring Thursday Section. (15 points)

FPGA Implementation of Adaptive Noise Canceller

JDT EFFECTIVE METHOD FOR IMPLEMENTATION OF WALLACE TREE MULTIPLIER USING FAST ADDERS

1 Q' 3. You are given a sequential circuit that has the following circuit to compute the next state:

CprE 583 Reconfigurable Computing

BPSK_DEMOD. Binary-PSK Demodulator Rev Key Design Features. Block Diagram. Applications. General Description. Generic Parameters

Design and FPGA Implementation of High-speed Parallel FIR Filters

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

An Efficient Baugh-WooleyArchitecture forbothsigned & Unsigned Multiplication

Field Programmable Gate Array Implementation and Testing of a Minimum-phase Finite Impulse Response Filter

Managing dynamic reconfiguration on MIMO Decoder

Datorstödd Elektronikkonstruktion

Topics. FPGA Design EECE 277. Combinational Logic Blocks. From Last Time. Multiplication. Dr. William H. Robinson February 25, 2005

Partial Reconfigurable Implementation of IEEE802.11g OFDM

FPGA Implementation of Wallace Tree Multiplier using CSLA / CLA

Transcription:

Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization Sashisu Bajracharya MS CpE Candidate Master s Thesis Defense Advisor: Dr Kris Gaj George Mason University

Outline RSA and Factoring with Number Field Sieve(NFS) 2 Matrix Step of NFS 3 Basic Mesh Routing 6 4 Improved Mesh Routing 7 5 Results and 8 Conclusions FPGA Array 9 Summary & Conclusions SRC-6e Reconfigurable Computer

RSA Major Public Key Cryptosystem Public key (e, N) Private key (d, P,Q) Alice Encryption Network Decryption Bob { e, N } { d, P, Q } N = P Q P, Q - large prime factors

RSA developed by Ron Rivest, Adi Shamir & Leonard Adlemann in 977

Applications of RSA Secure WWW, SSL, 95% of e-commerce Network Browser WebServer S/MIME, PGP Alice Bob

How hard is to break RSA? Largest Number Factored: 576 bits RSA-576 (Dec 23) Resources & efforts 88 9882926 7963838697 23946654 39876356 337947382 77633564 229888597 5234665485 396665 4743453 738833 39676996 92322573 4387955 656996223 5687593 76525759 workstations from 8 different sites around the world 3 months

Recommended key sizes for RSA Old standard: Individual users New standard: 52 bits (55 decimal digits) Broken in 999 Individual users 768 bits Organizations (short term) 24 bits Organizations (long term) 248 bits

Estimated Difficulty of factoring 24-bit number by RSA Security, Inc 342 million PCs, 5 MHz 7 GB RAM year

Our Task Determine how hard is to break RSA for factoring large key sizes using reconfigurable hardware Generic Array of FPGAs SRC-6e Reconfigurable Computer

Best Algorithm to Factor NUMBER FIELD SIEVE Complexity: Sub-exponential time and memory N = Number to factor, k = Number of bits of N exponential function, e k Sub-exponential function, e k/3 (ln k) 2/3 Polynomial function, a k m

Number Field Sieve(NFS) Steps Polynomial Selection Sieving Matrix (Linear Algebra) Computationally intensive steps Square Root

Hardware Architecture of NFS proposed to date Daniel Bernstein Univ of Illinois, Chicago Adi Shamir, Eran Tromer Weizemann Institute, Israel Mesh Approach Matrix and Sieving Mesh Sorting: Matrix Fall 2 Mesh Routing TWIRL Matrix AsiaCrypt 22 Sieving Crypto 23, AsiaCrypt 23 Mesh method improves asymptotic complexity for NFS performance Just analytical estimations, no real implementation, no concrete numbers

My Objective Bring this mesh algorithm to practical hardware implementation and concrete numbers Matrix (Linear Algebra) Focus of Research

My Objective Detailed design in RTL code of the mesh algorithm Synthesis and Implementation Results for an array of Virtex FPGAs and SRC-6e Reconfigurable computer

Function of Matrix Step Find the linear dependency in the large sparse matrix obtained after sieving step D = number of matrix columns or rows 6 for 52-bit 7 for 24-bit D c i c i2 c il c i c i2 c il =

Mesh based hardware circuits, proposed by Bernstein and Shamir-Tromer, decrease the time and cost of matrixvector multiplications Block Weidemann Algorithm for the Matrix Step ) Uses multiple matrix-vector multiplications of the sparse matrix A with K random vectors A v i, A 2 v i,, A k v i k = 2D/K 2) Post computation leading to the determination of linear dependence on columns of matrix A Most Time consuming operation: A [DxD] v [Dx]

Two Architectures for Matrix-vector multiplication Mesh Sorting (Bernstein) Based on Recursive Sorting Mesh Routing (Shamir-Tromer) Based on Routing Does one multiplication at a time Does K multiplications at a time large area compact area - (handles large matrix size)

Mesh Routing

Matrix-Vector Multiplication v A A v Sparse Matrix

Mesh Routing m x m mesh where m = D 9 4 9 4 8 3 8 3 7 5 2 7 5 2 6 3 6 3 4 2 4 2 7 5 7 5 8 4 8 4 8 6 3 8 6 3 9 2 9 2 A v cell( S ) 2 3 4 5 6 7 8 9 d = maximum non-zero entries for all column m D D m

Routing in the Mesh 2 4 3 5 7 9 8 Fourth cell 3 6 4 7 2 3 5 2 4 8 6 8 9 Each time a packet arrives at the target cell, the packet s vector s bit is xored with the partial result bit on the target cell

Mesh Routing Mesh contains the result of the multiplication 2 4 3 5 7 9 8 3 2 5 6 4 7 3 2 4 8 6 8 9

Mesh Routing with K parallel multiplications Example for K=2 v v 2 9 4 9 4 8 3 8 3 7 5 2 7 5 2 6 3 6 3 4 2 4 2 7 5 7 5 8 4 8 4 8 6 3 8 6 3 9 2 9 2 A mesh

Clockwise Transposition Routing Each step a cell holds one packet, and receives one packet from neighbor for compare-exchange Exchange is done only if it reduces the distance to target of the farthest traveling packet

Clockwise Transposition Routing Four iterations repeated Cells Compareexchange direction

Types of Packets ) Valid packet 2) Invalid packet - packet becomes invalid when reached to destination

Compare-Exchange Cases Four cases for a cell Left cell 2 2 N N a) Both packets are valid (may need to exchange) b) Current packet invalid, incoming new packet valid (may need to exchange) 2 N 2 N N N N N c) Current packet valid, incoming new packet invalid (may need to annihilate) c) Current packet invalid, incoming new packet invalid (no action)

Basic and Improved Mesh Routing Designs

Basic Mesh Routing Design Each Cell of mesh handles one column of matrix A K = or K 32, K = number of vectors multiplied by matrix A concurrently Total routing takes d 4 m compare-exchange operations

Basic Loading and Unloading Design Vector Non Zero Matrix Entries Result Vector

Parallel Loading & Unloading Design Vector Non Zero Matrix Entries Result Vector Restricted by Number of IO pins available

Basic Cell Design for Basic Mesh Routing LUT-RAM P[i] R[i] address en decode CU annihilate CR en_cur exchange equal eq_pack eq_packet P [i] en_equal Check Dest equal Status bits r c coordinate exchange annihilate en_equal row/col Comparator oper eq_packet

Comparator Design cell s coordinate current packet new packet row col s row col s2 row col row/col s, s2 en_equal > a = b Control Signal Logic oper s s2 exchange annihilate eq_packet

Improved Mesh Routing Design Each Cell of mesh handles p columns of the matrix A Compact area => handles larger matrix size Total routing takes p d 4 m compare-exchange steps proposed for cost reduction

Mesh Cell Design for Improved Mesh Routing R[i] LUT-RAM address en decode CU en P[i] addr annihilate equal eq_pack CR en_cur exchange eq_packet P [i] en_equal equal addr addr2 Check_ Dest Status bits r c coordinate exchange annihilate en_equal Comparator eq_packet row/col oper

Target FPGA Devices Xilinx Virtex II XC2V8 46,592 CLB slices 93,84 LUT (LookUpTable) 93,84 FF (Flip-Flop) Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs XC2V6 33,792 CLB slices 67,584 LUT (LookUpTable) 67,584 FF (Flip-Flop) LUTO Carry & Control Logic FF I/O Block CLB slice LUT Carry & Control Logic FF CLB-SLICE CLK

Results and Analysis

Synthesis Results for one Virtex II XC2V8 using Basic Mesh Routing Design Matrix Size K CLB slices LUTs FFs Clock Period (ns) Time for K mult (ns) Time per mult (ns) 44x44 (Mesh 2x2) 823 (7%) 5,495 (6%) 5,38 (5%) 4 672 672 44x44 (Mesh 2x2) 32 23,949 (5%) 46,944 (5%) 23,49 (25%) 66 797 25 44x44 (Mesh 2x2) 7 43,65 (92%) 84,836 (9%) 45,378 (48%) 78 854 2 K = number of concurrent matrix-vector multiplications Time for K mult = d * 4 * m * Clock period

Speedup vs Software Implementation Reference Optimized SW Implementation: PC, Pentium IV, 2768 GHz, GB RAM Matrix Size 44x44 (Mesh 2x2) K One Multiplication Time in SW (ns) One Multiplication Time in HW (ns) Speedup 7 344 2 282

Distributed Computation (Geiselmann, Steinwandt) A v Av A, A,2 A,3 v A A 2, A 2,2 A 2,3 A 3, A 3,2 A 3,3 v 2 v 3 = A 2 A 3 + + A A, v A v,2 2 A,3 v 3 = A v = s j=, j s A : A j= s, j v v j j

52-bit & 24-bit performance with different number of square array of FPGAs connected in two dimensions ) FPGA array performs single sub-matrix by sub-vector multiplication 2) Reuse FPGA array for next sub-computation

52-bit Performance with one chip & multiple chips connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D 2 /(m 4 ) T K = routing time for K multiplications in the mesh = d*4*m* Clock Period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K * n *( T K + T Load ) Virtex II chips D m n 67x 6 2 2 67x 6 2 T K (ns) T Load (ns) T Total (days) Speedup vs chip 2 x 9 854 3 697 2 x 5 68352 836 66 3 6 2 67x 6 92 33,32 956 2936 52 459 32 2 67x 6 384 2,64 292 5894 92 363

24-bit Performance with one chip & multiple chips connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D 2 /(m 4 ) T K = routing time for K multiplications in the mesh = d*4*m* Clock Period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K * n *( T K + T Load ) Virtex II chips D m n 4 x 7 2 2 4 x 7 2 6 2 4 x 7 92 T K (ns) T Load (ns) 77 x 854 3 T Total (days) Speedup vs chip 4 x 6 77 x 6 68352 836 3224 58 2 x 6 956 2936 328 436 32 2 4 x 7 384 73,586 292 5894 48 343

Analysis & Conclusion Polynomial Speedup with number of FPGAs Speedup approximately proportional to (#FPGA) 3/2 T Total = 2 D D 3 ( d 4 m 4 K ( m # chip) # chip + T load m = mesh size in one Virtex II chip )

Speedup vs number of FPGA chips 4 35 Speedup over chip 3 25 2 5 5 5 5 Number of Virtex II chips

Synthesis Results on one Virtex II XC2V8 for Improved Mesh Routing Design Matrix Size K CLB slices 234x234 (Mesh 2x2, p=6 ) 6738 (4%) LUTs,438 (%) FFs 6,279 (7%) Clock Period (ns) Time for K mult (ns) Time per mult (ns) 45 36 36 234x234 (Mesh 2x2, p=6 ) 32 29,938 (64%) 5,983 (54%) 9,65 (2%) 67 2826 4 234x234 (Mesh 2x2, p=6 ) 5 43,42 (93%) 74,3 (89%) 27,46 (29%) 77 3593 27 Time for K mult = p * d * 4 * m * Clock period

52-bit Performance with one chip & multiple chips connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=6 n = number of times to repeat sub-multiplications = D 2 /(m 2 p) 2 T K = routing time for K multiplications in the mesh = p*d*4*m*clock period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K* n *( T K + T Load ) Virtex II chips D m n 67x 6 2 T K (ns) T Load (ns) T Total (days) Speedup vs chip 84 x 6 3593 568 593 2 67x 6 2 846 8 x 5 2 x 5 4 48 6 2 67x 6 92 29 3 x 6 38 x 5 96 67 32 2 67x 6 384 8 26 x 6 9 x 5 32 4492

24-bit Performance with one chip & multiple chips connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=6 n = number of times to repeat sub-multiplications = D 2 /(m 2 p) 2 T K = routing time for K multiplications in the mesh = p*d*4*m*clock period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K* n *( T K + T Load ) Virtex II chips D m n T K (ns) T Load (ns) T Total (days) Speed up vs chip 4 x 7 2 2 4 x 7 2 6 2 4 x 7 92 4599 3 x 8 3593 568 2685 3 x 4 8 x 5 2 x 5 864 47 3 x 6 38 x 5 2 64 32 2 4 x 7 384 287 26 x 6 9 x 5 27 4698

Comparison of Basic & Improved Mesh Routing performance with the number of FPGAs 7 Basic Mesh Routing Improved Mesh Routing 4 Improved Mesh Routing Basic Mesh Routing 6 2 5 Time 4 (days) 3 Time (days) 8 6 2 4 2 2 4 6 8 2 Number of Virtex II chips 2 4 6 8 2 Number of Virtex II chips 52-bit 24-bit

Speedup of Improved to Basic Mesh Routing vs Number of Virtex II FPGAs speedup ratio 8 6 4 2 8 6 4 2 5 5 Number of Virtex II chips speedup ratio 8 6 4 2 8 6 4 2 5 5 Number of Virtex II chips 52-bit 24-bit

Comparison vs Cray Implementation 52-bit number, Improved Mesh Routing Design Cray C96 24 Virtex II FPGAs 93 days 32 days (32 hours)

Conclusions for Basic Mesh Routing & Improved Mesh Routing Best Case for 24-bit: Improved Mesh Routing Design 24 Virtex II chips Total execution time: 27 days Improved Mesh Routing faster than Basic Mesh Routing in Virtex II 8 by factor of around -5 times large sub-matrix size handled in same FPGA decreases sharply number of iterations to repeat sub-multiplications Influence of K reducing from 7 to 5 very low

SRC-6e Reconfigurable Computer

SRC-6e Reconfigurabe Computer Hardware Architecture P3 ( GHz) 8 MB/s / P3 ( GHz) 8 MB/s / Control FPGA XC2V6 ½ MAP Board / 528 MB/s 528 MB/s L2 8 MB/s MIOC L2 /8 MB/s PCI-X µp Board / Computer Memory (5 GB) DDR Interface SNAP 8 MB/s / 8 bits flags / 64 bits data / FPGA XC2V6 48 MB/s / (6x64 bits) On-Board Memory (24 MB) 48 MB/s (6x 64 bits) / 24 MB/s (92 bits) (8 bits) / / 48 MB/s (6x 64 bits) / (8 bits) FPGA 2 XC2V6 Chain Ports 24 MB/s

MAP Programming Model of SRC MAP C sub-routine FPGA contents MAP_Function(a, d, e) { a FPGA } Macro_(a, b, c) Macro_2(b, d) Macro_2(c, e) Macro_ b c Macro_2 Macro_2 d e

SRC Program Partitioning µp system FPGA system C function for µp C function for MAP VHDL macro HLL HDL

SRC-6e Designs

SRC-Mesh State Machine Cells Complete Mesh in VHDL

SRC-Cells Control in C Cell VHDL macro Mesh in MAP C

Modified Architecture of the cell for SRC-Mesh LUT-RAM P[i] R[i] address en decode CU annihilate CR en_cur exchange equal eq_pack eq_packet R P [i] en_equal Check Dest equal Status bits r c coordinate exchange annihilate en_equal row/col Comparator oper eq_packet

SRC-Cells Design Entry & Circuit cell cell b a2 a b2 for ( ) { cell (a, &b); cell (a2, &b2); a = b2; a2 = b; } a cell b cell b2 a2

Cell Architecture for SRC-Cells Design annihilate equal eq_pack CR en_cur exchange eq_packet P [i] en_equal Check Dest equal Status bits r c coordinate exchange annihilate en_equal row/col Comparator oper eq_packet

Results and Analysis

SRC Basic Mesh Routing Results K = number of parallel sub-matrix by sub-vector multiplications n = number of times to repeat sub-multiplications = D 2 / m 4 x = clock-cycles per exchange = routing time for K multiplications in the mesh = d*4*m*x* period T Kroute T KTot = time for K multiplications including loading, unloading and routing T 52 Compute = computational total time for a 52-bit Matrix step T 52 Total = total time for a 52-bit Matrix step = 3*D/K* n *( T KTot ) Design Type Mesh Size K CLB slices LUTs FFs Period (ns) x T Kroute T KTot Compute T 52 ( days) T 52 Total ( days) SRC- Mesh 2x2 (Matrix 44x44) 42 3,743 (9%) 54,66 (8%) 43,545 (64%) 2 96 ns 87 ns,52 22,46 SRC- Mesh 2x2 (Matrix 4x4) 3,533 (93%) 54,69 (8%) 28,636 (42%) 2 6 ns 227 ns 4,222 47,865 SRC- Mesh x (Matrix x) 7 3,566 (93%) 55,528 (82%) 46,647 (69%) 2 8 ns 87 ns,938 27, 898 SRC- Cells x (Matrix 2x2) 32,84 (97%) 29,959 (44%) 47,759 (7%) 3 32 ns 6 ns 939,676,46,2

Comparison of 52-bit Performance for different mesh sizes & K values with equivalent area Computational time Total time 6, 4, 2, # days, 8, 6, 4, 2, 2x2 K= 2x2 K=42 Mesh Type x K=7

Conclusion for performance of different mesh sizes & K values Comparing performance for different mesh sizes and K with equivalent FPGA resources ( 9% ) mesh of 2x2 with K=42 better than mesh of2x2 with K= 2 D D T = 3 ( d 4 m x + 4 K m Total T load ) mesh of x K=7 similar to mesh of 2x2 K=42

SRC-Mesh vs SRC-Cells Area for x mesh with K= Design Type Mesh Size K CLB slices LUTs FFs Period (ns) x T Kroute SRC- Cells x (Matrix x) 25325 (74%) 2256 (33%) 3642 (53%) 3 2 ns SRC- Mesh x (Matrix x) 9,347 (27%) 3,427 (9%) 439 (5%) 2 8 ns

SRC-mesh vs SRC-cells Area for x mesh 8 7 74 6 53 5 % 4 3 27 33 CLB LUT FF 2 9 5 SRC Mesh SRC Cells

Conclusions for SRC-Mesh and SRC-Cells SRC-cells has about 27 times larger area than SRC-mesh for same mesh parameters performs worse than SRC-mesh (only small mesh can fit, K small) Benefit: ease of programming in high-level language

SRC Improved Mesh Routing Results (Area) Design Type Mesh Size m x m /w p =6 K CLB slices LUTs FFs Improved SRC- Mesh x (matrix 6x6) 32 3,2 (9%) 5,95 (76%) 29,954 (44%) Improved SRC- Mesh 8x8 (matrix 24x24) 64 3,456 (93%) 53,6 (78%) 3,82 (45%)

SRC Improved Mesh Routing Results (Performance) K = number of simultaneous vectors being multiplied p = number of multiple columns of A handled in one cell= 6 n = number of times to repeat sub-multiplications =D 2 /(m 2 *p) 2 x = clock-cycles per compare-exchange operation T Kroute = routing time for K multiplications in the mesh = p*d*4*m*x* period T KTot = time for K multiplications including loading, unloading and routing T 52 Compute = computational total time for a 52-bit Matrix step T 52 Total = total time for a 52-bit Matrix step = 3*D/K* n *( T KTot ) Design Type Improved SRC- Mesh Improved SRC- Mesh Mesh Size m x m x (Matrix 6x6) 8x8 (matrix 24x24) K Period (ns) x T Kroute (ns) T Ktot (ns) T 52 Compute ( days) T 52 Total ( days) 32 3 9,2 3,48 2444 4,3 64 3 5,36 25,26 244 3,93

Analysis & Conclusion for SRC-6e Improved & Basic Mesh Routing Improved SRC-Mesh faster than Basic SRC-mesh design by a factor of 57 in SRC-6e Virtex II 6 393 days compared to 22,46 days in best case Larger sub-matrix size decreases significantly number of sub-multiplications

Standalone FPGA vs SRC design Standalone FPGA Virtex II 8 vs SRC Virtex II 6 Virtex II 8 designs, larger K and m Latency of routing increases in SRC-6e To improve the frequency to MHz, time of compare-exchange increased by 2-3 clock cycles Limited IO from 6 OBM banks in SRC-6e, more loading-unloading time Result on two dimensional array of Standalone Virtex II FPGAs vs one FPGA on SRC-6e

Summary & Conclusions First Practical hardware Implementation of Mesh Routing for the Number Field Sieve implemented and tested Practical concrete numbers for theoretical algorithm of Mesh Routing obtained to assess the current hardness of the matrix step in reconfigurable hardware Two architectures, Basic and Improved, implemented and compared All designs compared using the platform generic array of FPGA devices SRC-6e Reconfigurable Computer

Summary & Conclusions Assuming constant area, Improved Mesh Routing Design faster than Basic Mesh Routing Design by a factor of -5 in Virtex II 8 large sub-matrix handled A two-dimensional array of Virtex II chips can perform computations faster than a single FPGA by a factor proportional to (number of FPGAs) 3/2 Matrix step for a 24-bit number can be performed using 24 Virtex II chips in 27 days

Summary & Conclusions Two design entry approaches developed for the SRC- 6e SRC-Mesh is entirely written in VHDL SRC-cells is written mostly in C with only cell in VHDL SRC-Mesh outperforms SRC-cells by a factor of 5 at the cost of hardness in development of the VHDL code manual optimized circuit in VHDL suitable for SRC platform for the distributed computation of mesh

Acknowledgement Dr Kris Gaj SRC Computers Inc Deapesh Misra

Questions