Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization

Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization Sashisu Bajracharya MS CpE Candidate Master s Thesis Defense Advisor: Dr Kris Gaj George Mason University

Outline RSA and Factoring with Number Field Sieve(NFS) 2 Matrix Step of NFS 3 Basic Mesh Routing 6 4 Improved Mesh Routing 7 5 Results and 8 Conclusions FPGA Array 9 Summary & Conclusions SRC-6e Reconfigurable Computer

RSA Major Public Key Cryptosystem Public key (e, N) Private key (d, P,Q) Alice Encryption Network Decryption Bob { e, N } { d, P, Q } N = P Q P, Q - large prime factors

RSA developed by Ron Rivest, Adi Shamir & Leonard Adlemann in 977

Applications of RSA Secure WWW, SSL, 95% of e-commerce Network Browser WebServer S/MIME, PGP Alice Bob

How hard is to break RSA? Largest Number Factored: 576 bits RSA-576 (Dec 23) Resources & efforts 88 9882926 7963838697 23946654 39876356 337947382 77633564 229888597 5234665485 396665 4743453 738833 39676996 92322573 4387955 656996223 5687593 76525759 workstations from 8 different sites around the world 3 months

Recommended key sizes for RSA Old standard: Individual users New standard: 52 bits (55 decimal digits) Broken in 999 Individual users 768 bits Organizations (short term) 24 bits Organizations (long term) 248 bits

Estimated Difficulty of factoring 24-bit number by RSA Security, Inc 342 million PCs, 5 MHz 7 GB RAM year

Our Task Determine how hard is to break RSA for factoring large key sizes using reconfigurable hardware Generic Array of FPGAs SRC-6e Reconfigurable Computer

Best Algorithm to Factor NUMBER FIELD SIEVE Complexity: Sub-exponential time and memory N = Number to factor, k = Number of bits of N exponential function, e k Sub-exponential function, e k/3 (ln k) 2/3 Polynomial function, a k m

Number Field Sieve(NFS) Steps Polynomial Selection Sieving Matrix (Linear Algebra) Computationally intensive steps Square Root

Hardware Architecture of NFS proposed to date Daniel Bernstein Univ of Illinois, Chicago Adi Shamir, Eran Tromer Weizemann Institute, Israel Mesh Approach Matrix and Sieving Mesh Sorting: Matrix Fall 2 Mesh Routing TWIRL Matrix AsiaCrypt 22 Sieving Crypto 23, AsiaCrypt 23 Mesh method improves asymptotic complexity for NFS performance Just analytical estimations, no real implementation, no concrete numbers

My Objective Bring this mesh algorithm to practical hardware implementation and concrete numbers Matrix (Linear Algebra) Focus of Research

My Objective Detailed design in RTL code of the mesh algorithm Synthesis and Implementation Results for an array of Virtex FPGAs and SRC-6e Reconfigurable computer

Function of Matrix Step Find the linear dependency in the large sparse matrix obtained after sieving step D = number of matrix columns or rows 6 for 52-bit 7 for 24-bit D c i c i2 c il c i c i2 c il =

Mesh based hardware circuits, proposed by Bernstein and Shamir-Tromer, decrease the time and cost of matrixvector multiplications Block Weidemann Algorithm for the Matrix Step ) Uses multiple matrix-vector multiplications of the sparse matrix A with K random vectors A v i, A 2 v i,, A k v i k = 2D/K 2) Post computation leading to the determination of linear dependence on columns of matrix A Most Time consuming operation: A [DxD] v [Dx]

Two Architectures for Matrix-vector multiplication Mesh Sorting (Bernstein) Based on Recursive Sorting Mesh Routing (Shamir-Tromer) Based on Routing Does one multiplication at a time Does K multiplications at a time large area compact area - (handles large matrix size)

Mesh Routing

Matrix-Vector Multiplication v A A v Sparse Matrix

Mesh Routing m x m mesh where m = D 9 4 9 4 8 3 8 3 7 5 2 7 5 2 6 3 6 3 4 2 4 2 7 5 7 5 8 4 8 4 8 6 3 8 6 3 9 2 9 2 A v cell( S ) 2 3 4 5 6 7 8 9 d = maximum non-zero entries for all column m D D m

Routing in the Mesh 2 4 3 5 7 9 8 Fourth cell 3 6 4 7 2 3 5 2 4 8 6 8 9 Each time a packet arrives at the target cell, the packet s vector s bit is xored with the partial result bit on the target cell

Mesh Routing Mesh contains the result of the multiplication 2 4 3 5 7 9 8 3 2 5 6 4 7 3 2 4 8 6 8 9

Mesh Routing with K parallel multiplications Example for K=2 v v 2 9 4 9 4 8 3 8 3 7 5 2 7 5 2 6 3 6 3 4 2 4 2 7 5 7 5 8 4 8 4 8 6 3 8 6 3 9 2 9 2 A mesh

Clockwise Transposition Routing Each step a cell holds one packet, and receives one packet from neighbor for compare-exchange Exchange is done only if it reduces the distance to target of the farthest traveling packet

Clockwise Transposition Routing Four iterations repeated Cells Compareexchange direction

Types of Packets ) Valid packet 2) Invalid packet - packet becomes invalid when reached to destination

Compare-Exchange Cases Four cases for a cell Left cell 2 2 N N a) Both packets are valid (may need to exchange) b) Current packet invalid, incoming new packet valid (may need to exchange) 2 N 2 N N N N N c) Current packet valid, incoming new packet invalid (may need to annihilate) c) Current packet invalid, incoming new packet invalid (no action)

Basic and Improved Mesh Routing Designs

Basic Mesh Routing Design Each Cell of mesh handles one column of matrix A K = or K 32, K = number of vectors multiplied by matrix A concurrently Total routing takes d 4 m compare-exchange operations

Basic Loading and Unloading Design Vector Non Zero Matrix Entries Result Vector

Parallel Loading & Unloading Design Vector Non Zero Matrix Entries Result Vector Restricted by Number of IO pins available

Basic Cell Design for Basic Mesh Routing LUT-RAM P[i] R[i] address en decode CU annihilate CR en_cur exchange equal eq_pack eq_packet P [i] en_equal Check Dest equal Status bits r c coordinate exchange annihilate en_equal row/col Comparator oper eq_packet

Comparator Design cell s coordinate current packet new packet row col s row col s2 row col row/col s, s2 en_equal > a = b Control Signal Logic oper s s2 exchange annihilate eq_packet

Improved Mesh Routing Design Each Cell of mesh handles p columns of the matrix A Compact area => handles larger matrix size Total routing takes p d 4 m compare-exchange steps proposed for cost reduction

Mesh Cell Design for Improved Mesh Routing R[i] LUT-RAM address en decode CU en P[i] addr annihilate equal eq_pack CR en_cur exchange eq_packet P [i] en_equal equal addr addr2 Check_ Dest Status bits r c coordinate exchange annihilate en_equal Comparator eq_packet row/col oper

Target FPGA Devices Xilinx Virtex II XC2V8 46,592 CLB slices 93,84 LUT (LookUpTable) 93,84 FF (Flip-Flop) Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs XC2V6 33,792 CLB slices 67,584 LUT (LookUpTable) 67,584 FF (Flip-Flop) LUTO Carry & Control Logic FF I/O Block CLB slice LUT Carry & Control Logic FF CLB-SLICE CLK

Results and Analysis

Synthesis Results for one Virtex II XC2V8 using Basic Mesh Routing Design Matrix Size K CLB slices LUTs FFs Clock Period (ns) Time for K mult (ns) Time per mult (ns) 44x44 (Mesh 2x2) 823 (7%) 5,495 (6%) 5,38 (5%) 4 672 672 44x44 (Mesh 2x2) 32 23,949 (5%) 46,944 (5%) 23,49 (25%) 66 797 25 44x44 (Mesh 2x2) 7 43,65 (92%) 84,836 (9%) 45,378 (48%) 78 854 2 K = number of concurrent matrix-vector multiplications Time for K mult = d * 4 * m * Clock period

Speedup vs Software Implementation Reference Optimized SW Implementation: PC, Pentium IV, 2768 GHz, GB RAM Matrix Size 44x44 (Mesh 2x2) K One Multiplication Time in SW (ns) One Multiplication Time in HW (ns) Speedup 7 344 2 282

Distributed Computation (Geiselmann, Steinwandt) A v Av A, A,2 A,3 v A A 2, A 2,2 A 2,3 A 3, A 3,2 A 3,3 v 2 v 3 = A 2 A 3 + + A A, v A v,2 2 A,3 v 3 = A v = s j=, j s A : A j= s, j v v j j

52-bit & 24-bit performance with different number of square array of FPGAs connected in two dimensions ) FPGA array performs single sub-matrix by sub-vector multiplication 2) Reuse FPGA array for next sub-computation

52-bit Performance with one chip & multiple chips connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D 2 /(m 4 ) T K = routing time for K multiplications in the mesh = d*4*m* Clock Period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K * n *( T K + T Load ) Virtex II chips D m n 67x 6 2 2 67x 6 2 T K (ns) T Load (ns) T Total (days) Speedup vs chip 2 x 9 854 3 697 2 x 5 68352 836 66 3 6 2 67x 6 92 33,32 956 2936 52 459 32 2 67x 6 384 2,64 292 5894 92 363

24-bit Performance with one chip & multiple chips connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D 2 /(m 4 ) T K = routing time for K multiplications in the mesh = d*4*m* Clock Period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K * n *( T K + T Load ) Virtex II chips D m n 4 x 7 2 2 4 x 7 2 6 2 4 x 7 92 T K (ns) T Load (ns) 77 x 854 3 T Total (days) Speedup vs chip 4 x 6 77 x 6 68352 836 3224 58 2 x 6 956 2936 328 436 32 2 4 x 7 384 73,586 292 5894 48 343

Analysis & Conclusion Polynomial Speedup with number of FPGAs Speedup approximately proportional to (#FPGA) 3/2 T Total = 2 D D 3 ( d 4 m 4 K ( m # chip) # chip + T load m = mesh size in one Virtex II chip )

Speedup vs number of FPGA chips 4 35 Speedup over chip 3 25 2 5 5 5 5 Number of Virtex II chips

Synthesis Results on one Virtex II XC2V8 for Improved Mesh Routing Design Matrix Size K CLB slices 234x234 (Mesh 2x2, p=6 ) 6738 (4%) LUTs,438 (%) FFs 6,279 (7%) Clock Period (ns) Time for K mult (ns) Time per mult (ns) 45 36 36 234x234 (Mesh 2x2, p=6 ) 32 29,938 (64%) 5,983 (54%) 9,65 (2%) 67 2826 4 234x234 (Mesh 2x2, p=6 ) 5 43,42 (93%) 74,3 (89%) 27,46 (29%) 77 3593 27 Time for K mult = p * d * 4 * m * Clock period

52-bit Performance with one chip & multiple chips connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=6 n = number of times to repeat sub-multiplications = D 2 /(m 2 p) 2 T K = routing time for K multiplications in the mesh = p*d*4*m*clock period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K* n *( T K + T Load ) Virtex II chips D m n 67x 6 2 T K (ns) T Load (ns) T Total (days) Speedup vs chip 84 x 6 3593 568 593 2 67x 6 2 846 8 x 5 2 x 5 4 48 6 2 67x 6 92 29 3 x 6 38 x 5 96 67 32 2 67x 6 384 8 26 x 6 9 x 5 32 4492

24-bit Performance with one chip & multiple chips connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=6 n = number of times to repeat sub-multiplications = D 2 /(m 2 p) 2 T K = routing time for K multiplications in the mesh = p*d*4*m*clock period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K* n *( T K + T Load ) Virtex II chips D m n T K (ns) T Load (ns) T Total (days) Speed up vs chip 4 x 7 2 2 4 x 7 2 6 2 4 x 7 92 4599 3 x 8 3593 568 2685 3 x 4 8 x 5 2 x 5 864 47 3 x 6 38 x 5 2 64 32 2 4 x 7 384 287 26 x 6 9 x 5 27 4698

Comparison of Basic & Improved Mesh Routing performance with the number of FPGAs 7 Basic Mesh Routing Improved Mesh Routing 4 Improved Mesh Routing Basic Mesh Routing 6 2 5 Time 4 (days) 3 Time (days) 8 6 2 4 2 2 4 6 8 2 Number of Virtex II chips 2 4 6 8 2 Number of Virtex II chips 52-bit 24-bit

Speedup of Improved to Basic Mesh Routing vs Number of Virtex II FPGAs speedup ratio 8 6 4 2 8 6 4 2 5 5 Number of Virtex II chips speedup ratio 8 6 4 2 8 6 4 2 5 5 Number of Virtex II chips 52-bit 24-bit

Comparison vs Cray Implementation 52-bit number, Improved Mesh Routing Design Cray C96 24 Virtex II FPGAs 93 days 32 days (32 hours)

Conclusions for Basic Mesh Routing & Improved Mesh Routing Best Case for 24-bit: Improved Mesh Routing Design 24 Virtex II chips Total execution time: 27 days Improved Mesh Routing faster than Basic Mesh Routing in Virtex II 8 by factor of around -5 times large sub-matrix size handled in same FPGA decreases sharply number of iterations to repeat sub-multiplications Influence of K reducing from 7 to 5 very low

SRC-6e Reconfigurable Computer

SRC-6e Reconfigurabe Computer Hardware Architecture P3 ( GHz) 8 MB/s / P3 ( GHz) 8 MB/s / Control FPGA XC2V6 ½ MAP Board / 528 MB/s 528 MB/s L2 8 MB/s MIOC L2 /8 MB/s PCI-X µp Board / Computer Memory (5 GB) DDR Interface SNAP 8 MB/s / 8 bits flags / 64 bits data / FPGA XC2V6 48 MB/s / (6x64 bits) On-Board Memory (24 MB) 48 MB/s (6x 64 bits) / 24 MB/s (92 bits) (8 bits) / / 48 MB/s (6x 64 bits) / (8 bits) FPGA 2 XC2V6 Chain Ports 24 MB/s

MAP Programming Model of SRC MAP C sub-routine FPGA contents MAP_Function(a, d, e) { a FPGA } Macro_(a, b, c) Macro_2(b, d) Macro_2(c, e) Macro_ b c Macro_2 Macro_2 d e

SRC Program Partitioning µp system FPGA system C function for µp C function for MAP VHDL macro HLL HDL

SRC-6e Designs

SRC-Mesh State Machine Cells Complete Mesh in VHDL

SRC-Cells Control in C Cell VHDL macro Mesh in MAP C

Modified Architecture of the cell for SRC-Mesh LUT-RAM P[i] R[i] address en decode CU annihilate CR en_cur exchange equal eq_pack eq_packet R P [i] en_equal Check Dest equal Status bits r c coordinate exchange annihilate en_equal row/col Comparator oper eq_packet

SRC-Cells Design Entry & Circuit cell cell b a2 a b2 for ( ) { cell (a, &b); cell (a2, &b2); a = b2; a2 = b; } a cell b cell b2 a2

Cell Architecture for SRC-Cells Design annihilate equal eq_pack CR en_cur exchange eq_packet P [i] en_equal Check Dest equal Status bits r c coordinate exchange annihilate en_equal row/col Comparator oper eq_packet

Results and Analysis

SRC Basic Mesh Routing Results K = number of parallel sub-matrix by sub-vector multiplications n = number of times to repeat sub-multiplications = D 2 / m 4 x = clock-cycles per exchange = routing time for K multiplications in the mesh = d*4*m*x* period T Kroute T KTot = time for K multiplications including loading, unloading and routing T 52 Compute = computational total time for a 52-bit Matrix step T 52 Total = total time for a 52-bit Matrix step = 3*D/K* n *( T KTot ) Design Type Mesh Size K CLB slices LUTs FFs Period (ns) x T Kroute T KTot Compute T 52 ( days) T 52 Total ( days) SRC- Mesh 2x2 (Matrix 44x44) 42 3,743 (9%) 54,66 (8%) 43,545 (64%) 2 96 ns 87 ns,52 22,46 SRC- Mesh 2x2 (Matrix 4x4) 3,533 (93%) 54,69 (8%) 28,636 (42%) 2 6 ns 227 ns 4,222 47,865 SRC- Mesh x (Matrix x) 7 3,566 (93%) 55,528 (82%) 46,647 (69%) 2 8 ns 87 ns,938 27, 898 SRC- Cells x (Matrix 2x2) 32,84 (97%) 29,959 (44%) 47,759 (7%) 3 32 ns 6 ns 939,676,46,2

Comparison of 52-bit Performance for different mesh sizes & K values with equivalent area Computational time Total time 6, 4, 2, # days, 8, 6, 4, 2, 2x2 K= 2x2 K=42 Mesh Type x K=7

Conclusion for performance of different mesh sizes & K values Comparing performance for different mesh sizes and K with equivalent FPGA resources ( 9% ) mesh of 2x2 with K=42 better than mesh of2x2 with K= 2 D D T = 3 ( d 4 m x + 4 K m Total T load ) mesh of x K=7 similar to mesh of 2x2 K=42

SRC-Mesh vs SRC-Cells Area for x mesh with K= Design Type Mesh Size K CLB slices LUTs FFs Period (ns) x T Kroute SRC- Cells x (Matrix x) 25325 (74%) 2256 (33%) 3642 (53%) 3 2 ns SRC- Mesh x (Matrix x) 9,347 (27%) 3,427 (9%) 439 (5%) 2 8 ns

SRC-mesh vs SRC-cells Area for x mesh 8 7 74 6 53 5 % 4 3 27 33 CLB LUT FF 2 9 5 SRC Mesh SRC Cells

Conclusions for SRC-Mesh and SRC-Cells SRC-cells has about 27 times larger area than SRC-mesh for same mesh parameters performs worse than SRC-mesh (only small mesh can fit, K small) Benefit: ease of programming in high-level language

SRC Improved Mesh Routing Results (Area) Design Type Mesh Size m x m /w p =6 K CLB slices LUTs FFs Improved SRC- Mesh x (matrix 6x6) 32 3,2 (9%) 5,95 (76%) 29,954 (44%) Improved SRC- Mesh 8x8 (matrix 24x24) 64 3,456 (93%) 53,6 (78%) 3,82 (45%)

SRC Improved Mesh Routing Results (Performance) K = number of simultaneous vectors being multiplied p = number of multiple columns of A handled in one cell= 6 n = number of times to repeat sub-multiplications =D 2 /(m 2 *p) 2 x = clock-cycles per compare-exchange operation T Kroute = routing time for K multiplications in the mesh = p*d*4*m*x* period T KTot = time for K multiplications including loading, unloading and routing T 52 Compute = computational total time for a 52-bit Matrix step T 52 Total = total time for a 52-bit Matrix step = 3*D/K* n *( T KTot ) Design Type Improved SRC- Mesh Improved SRC- Mesh Mesh Size m x m x (Matrix 6x6) 8x8 (matrix 24x24) K Period (ns) x T Kroute (ns) T Ktot (ns) T 52 Compute ( days) T 52 Total ( days) 32 3 9,2 3,48 2444 4,3 64 3 5,36 25,26 244 3,93

Analysis & Conclusion for SRC-6e Improved & Basic Mesh Routing Improved SRC-Mesh faster than Basic SRC-mesh design by a factor of 57 in SRC-6e Virtex II 6 393 days compared to 22,46 days in best case Larger sub-matrix size decreases significantly number of sub-multiplications

Standalone FPGA vs SRC design Standalone FPGA Virtex II 8 vs SRC Virtex II 6 Virtex II 8 designs, larger K and m Latency of routing increases in SRC-6e To improve the frequency to MHz, time of compare-exchange increased by 2-3 clock cycles Limited IO from 6 OBM banks in SRC-6e, more loading-unloading time Result on two dimensional array of Standalone Virtex II FPGAs vs one FPGA on SRC-6e

Summary & Conclusions First Practical hardware Implementation of Mesh Routing for the Number Field Sieve implemented and tested Practical concrete numbers for theoretical algorithm of Mesh Routing obtained to assess the current hardness of the matrix step in reconfigurable hardware Two architectures, Basic and Improved, implemented and compared All designs compared using the platform generic array of FPGA devices SRC-6e Reconfigurable Computer

Summary & Conclusions Assuming constant area, Improved Mesh Routing Design faster than Basic Mesh Routing Design by a factor of -5 in Virtex II 8 large sub-matrix handled A two-dimensional array of Virtex II chips can perform computations faster than a single FPGA by a factor proportional to (number of FPGAs) 3/2 Matrix step for a 24-bit number can be performed using 24 Virtex II chips in 27 days

Summary & Conclusions Two design entry approaches developed for the SRC- 6e SRC-Mesh is entirely written in VHDL SRC-cells is written mostly in C with only cell in VHDL SRC-Mesh outperforms SRC-cells by a factor of 5 at the cost of hardness in development of the VHDL code manual optimized circuit in VHDL suitable for SRC platform for the distributed computation of mesh

Acknowledgement Dr Kris Gaj SRC Computers Inc Deapesh Misra

Questions