Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization Sashisu Bajracharya MS CpE Candidate Master s Thesis Defense Advisor: Dr Kris Gaj George Mason University
Outline RSA and Factoring with Number Field Sieve(NFS) 2 Matrix Step of NFS 3 Basic Mesh Routing 6 4 Improved Mesh Routing 7 5 Results and 8 Conclusions FPGA Array 9 Summary & Conclusions SRC-6e Reconfigurable Computer
RSA Major Public Key Cryptosystem Public key (e, N) Private key (d, P,Q) Alice Encryption Network Decryption Bob { e, N } { d, P, Q } N = P Q P, Q - large prime factors
RSA developed by Ron Rivest, Adi Shamir & Leonard Adlemann in 977
Applications of RSA Secure WWW, SSL, 95% of e-commerce Network Browser WebServer S/MIME, PGP Alice Bob
How hard is to break RSA? Largest Number Factored: 576 bits RSA-576 (Dec 23) Resources & efforts 88 9882926 7963838697 23946654 39876356 337947382 77633564 229888597 5234665485 396665 4743453 738833 39676996 92322573 4387955 656996223 5687593 76525759 workstations from 8 different sites around the world 3 months
Recommended key sizes for RSA Old standard: Individual users New standard: 52 bits (55 decimal digits) Broken in 999 Individual users 768 bits Organizations (short term) 24 bits Organizations (long term) 248 bits
Estimated Difficulty of factoring 24-bit number by RSA Security, Inc 342 million PCs, 5 MHz 7 GB RAM year
Our Task Determine how hard is to break RSA for factoring large key sizes using reconfigurable hardware Generic Array of FPGAs SRC-6e Reconfigurable Computer
Best Algorithm to Factor NUMBER FIELD SIEVE Complexity: Sub-exponential time and memory N = Number to factor, k = Number of bits of N exponential function, e k Sub-exponential function, e k/3 (ln k) 2/3 Polynomial function, a k m
Number Field Sieve(NFS) Steps Polynomial Selection Sieving Matrix (Linear Algebra) Computationally intensive steps Square Root
Hardware Architecture of NFS proposed to date Daniel Bernstein Univ of Illinois, Chicago Adi Shamir, Eran Tromer Weizemann Institute, Israel Mesh Approach Matrix and Sieving Mesh Sorting: Matrix Fall 2 Mesh Routing TWIRL Matrix AsiaCrypt 22 Sieving Crypto 23, AsiaCrypt 23 Mesh method improves asymptotic complexity for NFS performance Just analytical estimations, no real implementation, no concrete numbers
My Objective Bring this mesh algorithm to practical hardware implementation and concrete numbers Matrix (Linear Algebra) Focus of Research
My Objective Detailed design in RTL code of the mesh algorithm Synthesis and Implementation Results for an array of Virtex FPGAs and SRC-6e Reconfigurable computer
Function of Matrix Step Find the linear dependency in the large sparse matrix obtained after sieving step D = number of matrix columns or rows 6 for 52-bit 7 for 24-bit D c i c i2 c il c i c i2 c il =
Mesh based hardware circuits, proposed by Bernstein and Shamir-Tromer, decrease the time and cost of matrixvector multiplications Block Weidemann Algorithm for the Matrix Step ) Uses multiple matrix-vector multiplications of the sparse matrix A with K random vectors A v i, A 2 v i,, A k v i k = 2D/K 2) Post computation leading to the determination of linear dependence on columns of matrix A Most Time consuming operation: A [DxD] v [Dx]
Two Architectures for Matrix-vector multiplication Mesh Sorting (Bernstein) Based on Recursive Sorting Mesh Routing (Shamir-Tromer) Based on Routing Does one multiplication at a time Does K multiplications at a time large area compact area - (handles large matrix size)
Mesh Routing
Matrix-Vector Multiplication v A A v Sparse Matrix
Mesh Routing m x m mesh where m = D 9 4 9 4 8 3 8 3 7 5 2 7 5 2 6 3 6 3 4 2 4 2 7 5 7 5 8 4 8 4 8 6 3 8 6 3 9 2 9 2 A v cell( S ) 2 3 4 5 6 7 8 9 d = maximum non-zero entries for all column m D D m
Routing in the Mesh 2 4 3 5 7 9 8 Fourth cell 3 6 4 7 2 3 5 2 4 8 6 8 9 Each time a packet arrives at the target cell, the packet s vector s bit is xored with the partial result bit on the target cell
Mesh Routing Mesh contains the result of the multiplication 2 4 3 5 7 9 8 3 2 5 6 4 7 3 2 4 8 6 8 9
Mesh Routing with K parallel multiplications Example for K=2 v v 2 9 4 9 4 8 3 8 3 7 5 2 7 5 2 6 3 6 3 4 2 4 2 7 5 7 5 8 4 8 4 8 6 3 8 6 3 9 2 9 2 A mesh
Clockwise Transposition Routing Each step a cell holds one packet, and receives one packet from neighbor for compare-exchange Exchange is done only if it reduces the distance to target of the farthest traveling packet
Clockwise Transposition Routing Four iterations repeated Cells Compareexchange direction
Types of Packets ) Valid packet 2) Invalid packet - packet becomes invalid when reached to destination
Compare-Exchange Cases Four cases for a cell Left cell 2 2 N N a) Both packets are valid (may need to exchange) b) Current packet invalid, incoming new packet valid (may need to exchange) 2 N 2 N N N N N c) Current packet valid, incoming new packet invalid (may need to annihilate) c) Current packet invalid, incoming new packet invalid (no action)
Basic and Improved Mesh Routing Designs
Basic Mesh Routing Design Each Cell of mesh handles one column of matrix A K = or K 32, K = number of vectors multiplied by matrix A concurrently Total routing takes d 4 m compare-exchange operations
Basic Loading and Unloading Design Vector Non Zero Matrix Entries Result Vector
Parallel Loading & Unloading Design Vector Non Zero Matrix Entries Result Vector Restricted by Number of IO pins available
Basic Cell Design for Basic Mesh Routing LUT-RAM P[i] R[i] address en decode CU annihilate CR en_cur exchange equal eq_pack eq_packet P [i] en_equal Check Dest equal Status bits r c coordinate exchange annihilate en_equal row/col Comparator oper eq_packet
Comparator Design cell s coordinate current packet new packet row col s row col s2 row col row/col s, s2 en_equal > a = b Control Signal Logic oper s s2 exchange annihilate eq_packet
Improved Mesh Routing Design Each Cell of mesh handles p columns of the matrix A Compact area => handles larger matrix size Total routing takes p d 4 m compare-exchange steps proposed for cost reduction
Mesh Cell Design for Improved Mesh Routing R[i] LUT-RAM address en decode CU en P[i] addr annihilate equal eq_pack CR en_cur exchange eq_packet P [i] en_equal equal addr addr2 Check_ Dest Status bits r c coordinate exchange annihilate en_equal Comparator eq_packet row/col oper
Target FPGA Devices Xilinx Virtex II XC2V8 46,592 CLB slices 93,84 LUT (LookUpTable) 93,84 FF (Flip-Flop) Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs XC2V6 33,792 CLB slices 67,584 LUT (LookUpTable) 67,584 FF (Flip-Flop) LUTO Carry & Control Logic FF I/O Block CLB slice LUT Carry & Control Logic FF CLB-SLICE CLK
Results and Analysis
Synthesis Results for one Virtex II XC2V8 using Basic Mesh Routing Design Matrix Size K CLB slices LUTs FFs Clock Period (ns) Time for K mult (ns) Time per mult (ns) 44x44 (Mesh 2x2) 823 (7%) 5,495 (6%) 5,38 (5%) 4 672 672 44x44 (Mesh 2x2) 32 23,949 (5%) 46,944 (5%) 23,49 (25%) 66 797 25 44x44 (Mesh 2x2) 7 43,65 (92%) 84,836 (9%) 45,378 (48%) 78 854 2 K = number of concurrent matrix-vector multiplications Time for K mult = d * 4 * m * Clock period
Speedup vs Software Implementation Reference Optimized SW Implementation: PC, Pentium IV, 2768 GHz, GB RAM Matrix Size 44x44 (Mesh 2x2) K One Multiplication Time in SW (ns) One Multiplication Time in HW (ns) Speedup 7 344 2 282
Distributed Computation (Geiselmann, Steinwandt) A v Av A, A,2 A,3 v A A 2, A 2,2 A 2,3 A 3, A 3,2 A 3,3 v 2 v 3 = A 2 A 3 + + A A, v A v,2 2 A,3 v 3 = A v = s j=, j s A : A j= s, j v v j j
52-bit & 24-bit performance with different number of square array of FPGAs connected in two dimensions ) FPGA array performs single sub-matrix by sub-vector multiplication 2) Reuse FPGA array for next sub-computation
52-bit Performance with one chip & multiple chips connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D 2 /(m 4 ) T K = routing time for K multiplications in the mesh = d*4*m* Clock Period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K * n *( T K + T Load ) Virtex II chips D m n 67x 6 2 2 67x 6 2 T K (ns) T Load (ns) T Total (days) Speedup vs chip 2 x 9 854 3 697 2 x 5 68352 836 66 3 6 2 67x 6 92 33,32 956 2936 52 459 32 2 67x 6 384 2,64 292 5894 92 363
24-bit Performance with one chip & multiple chips connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D 2 /(m 4 ) T K = routing time for K multiplications in the mesh = d*4*m* Clock Period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K * n *( T K + T Load ) Virtex II chips D m n 4 x 7 2 2 4 x 7 2 6 2 4 x 7 92 T K (ns) T Load (ns) 77 x 854 3 T Total (days) Speedup vs chip 4 x 6 77 x 6 68352 836 3224 58 2 x 6 956 2936 328 436 32 2 4 x 7 384 73,586 292 5894 48 343
Analysis & Conclusion Polynomial Speedup with number of FPGAs Speedup approximately proportional to (#FPGA) 3/2 T Total = 2 D D 3 ( d 4 m 4 K ( m # chip) # chip + T load m = mesh size in one Virtex II chip )
Speedup vs number of FPGA chips 4 35 Speedup over chip 3 25 2 5 5 5 5 Number of Virtex II chips
Synthesis Results on one Virtex II XC2V8 for Improved Mesh Routing Design Matrix Size K CLB slices 234x234 (Mesh 2x2, p=6 ) 6738 (4%) LUTs,438 (%) FFs 6,279 (7%) Clock Period (ns) Time for K mult (ns) Time per mult (ns) 45 36 36 234x234 (Mesh 2x2, p=6 ) 32 29,938 (64%) 5,983 (54%) 9,65 (2%) 67 2826 4 234x234 (Mesh 2x2, p=6 ) 5 43,42 (93%) 74,3 (89%) 27,46 (29%) 77 3593 27 Time for K mult = p * d * 4 * m * Clock period
52-bit Performance with one chip & multiple chips connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=6 n = number of times to repeat sub-multiplications = D 2 /(m 2 p) 2 T K = routing time for K multiplications in the mesh = p*d*4*m*clock period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K* n *( T K + T Load ) Virtex II chips D m n 67x 6 2 T K (ns) T Load (ns) T Total (days) Speedup vs chip 84 x 6 3593 568 593 2 67x 6 2 846 8 x 5 2 x 5 4 48 6 2 67x 6 92 29 3 x 6 38 x 5 96 67 32 2 67x 6 384 8 26 x 6 9 x 5 32 4492
24-bit Performance with one chip & multiple chips connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=6 n = number of times to repeat sub-multiplications = D 2 /(m 2 p) 2 T K = routing time for K multiplications in the mesh = p*d*4*m*clock period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K* n *( T K + T Load ) Virtex II chips D m n T K (ns) T Load (ns) T Total (days) Speed up vs chip 4 x 7 2 2 4 x 7 2 6 2 4 x 7 92 4599 3 x 8 3593 568 2685 3 x 4 8 x 5 2 x 5 864 47 3 x 6 38 x 5 2 64 32 2 4 x 7 384 287 26 x 6 9 x 5 27 4698
Comparison of Basic & Improved Mesh Routing performance with the number of FPGAs 7 Basic Mesh Routing Improved Mesh Routing 4 Improved Mesh Routing Basic Mesh Routing 6 2 5 Time 4 (days) 3 Time (days) 8 6 2 4 2 2 4 6 8 2 Number of Virtex II chips 2 4 6 8 2 Number of Virtex II chips 52-bit 24-bit
Speedup of Improved to Basic Mesh Routing vs Number of Virtex II FPGAs speedup ratio 8 6 4 2 8 6 4 2 5 5 Number of Virtex II chips speedup ratio 8 6 4 2 8 6 4 2 5 5 Number of Virtex II chips 52-bit 24-bit
Comparison vs Cray Implementation 52-bit number, Improved Mesh Routing Design Cray C96 24 Virtex II FPGAs 93 days 32 days (32 hours)
Conclusions for Basic Mesh Routing & Improved Mesh Routing Best Case for 24-bit: Improved Mesh Routing Design 24 Virtex II chips Total execution time: 27 days Improved Mesh Routing faster than Basic Mesh Routing in Virtex II 8 by factor of around -5 times large sub-matrix size handled in same FPGA decreases sharply number of iterations to repeat sub-multiplications Influence of K reducing from 7 to 5 very low
SRC-6e Reconfigurable Computer
SRC-6e Reconfigurabe Computer Hardware Architecture P3 ( GHz) 8 MB/s / P3 ( GHz) 8 MB/s / Control FPGA XC2V6 ½ MAP Board / 528 MB/s 528 MB/s L2 8 MB/s MIOC L2 /8 MB/s PCI-X µp Board / Computer Memory (5 GB) DDR Interface SNAP 8 MB/s / 8 bits flags / 64 bits data / FPGA XC2V6 48 MB/s / (6x64 bits) On-Board Memory (24 MB) 48 MB/s (6x 64 bits) / 24 MB/s (92 bits) (8 bits) / / 48 MB/s (6x 64 bits) / (8 bits) FPGA 2 XC2V6 Chain Ports 24 MB/s
MAP Programming Model of SRC MAP C sub-routine FPGA contents MAP_Function(a, d, e) { a FPGA } Macro_(a, b, c) Macro_2(b, d) Macro_2(c, e) Macro_ b c Macro_2 Macro_2 d e
SRC Program Partitioning µp system FPGA system C function for µp C function for MAP VHDL macro HLL HDL
SRC-6e Designs
SRC-Mesh State Machine Cells Complete Mesh in VHDL
SRC-Cells Control in C Cell VHDL macro Mesh in MAP C
Modified Architecture of the cell for SRC-Mesh LUT-RAM P[i] R[i] address en decode CU annihilate CR en_cur exchange equal eq_pack eq_packet R P [i] en_equal Check Dest equal Status bits r c coordinate exchange annihilate en_equal row/col Comparator oper eq_packet
SRC-Cells Design Entry & Circuit cell cell b a2 a b2 for ( ) { cell (a, &b); cell (a2, &b2); a = b2; a2 = b; } a cell b cell b2 a2
Cell Architecture for SRC-Cells Design annihilate equal eq_pack CR en_cur exchange eq_packet P [i] en_equal Check Dest equal Status bits r c coordinate exchange annihilate en_equal row/col Comparator oper eq_packet
Results and Analysis
SRC Basic Mesh Routing Results K = number of parallel sub-matrix by sub-vector multiplications n = number of times to repeat sub-multiplications = D 2 / m 4 x = clock-cycles per exchange = routing time for K multiplications in the mesh = d*4*m*x* period T Kroute T KTot = time for K multiplications including loading, unloading and routing T 52 Compute = computational total time for a 52-bit Matrix step T 52 Total = total time for a 52-bit Matrix step = 3*D/K* n *( T KTot ) Design Type Mesh Size K CLB slices LUTs FFs Period (ns) x T Kroute T KTot Compute T 52 ( days) T 52 Total ( days) SRC- Mesh 2x2 (Matrix 44x44) 42 3,743 (9%) 54,66 (8%) 43,545 (64%) 2 96 ns 87 ns,52 22,46 SRC- Mesh 2x2 (Matrix 4x4) 3,533 (93%) 54,69 (8%) 28,636 (42%) 2 6 ns 227 ns 4,222 47,865 SRC- Mesh x (Matrix x) 7 3,566 (93%) 55,528 (82%) 46,647 (69%) 2 8 ns 87 ns,938 27, 898 SRC- Cells x (Matrix 2x2) 32,84 (97%) 29,959 (44%) 47,759 (7%) 3 32 ns 6 ns 939,676,46,2
Comparison of 52-bit Performance for different mesh sizes & K values with equivalent area Computational time Total time 6, 4, 2, # days, 8, 6, 4, 2, 2x2 K= 2x2 K=42 Mesh Type x K=7
Conclusion for performance of different mesh sizes & K values Comparing performance for different mesh sizes and K with equivalent FPGA resources ( 9% ) mesh of 2x2 with K=42 better than mesh of2x2 with K= 2 D D T = 3 ( d 4 m x + 4 K m Total T load ) mesh of x K=7 similar to mesh of 2x2 K=42
SRC-Mesh vs SRC-Cells Area for x mesh with K= Design Type Mesh Size K CLB slices LUTs FFs Period (ns) x T Kroute SRC- Cells x (Matrix x) 25325 (74%) 2256 (33%) 3642 (53%) 3 2 ns SRC- Mesh x (Matrix x) 9,347 (27%) 3,427 (9%) 439 (5%) 2 8 ns
SRC-mesh vs SRC-cells Area for x mesh 8 7 74 6 53 5 % 4 3 27 33 CLB LUT FF 2 9 5 SRC Mesh SRC Cells
Conclusions for SRC-Mesh and SRC-Cells SRC-cells has about 27 times larger area than SRC-mesh for same mesh parameters performs worse than SRC-mesh (only small mesh can fit, K small) Benefit: ease of programming in high-level language
SRC Improved Mesh Routing Results (Area) Design Type Mesh Size m x m /w p =6 K CLB slices LUTs FFs Improved SRC- Mesh x (matrix 6x6) 32 3,2 (9%) 5,95 (76%) 29,954 (44%) Improved SRC- Mesh 8x8 (matrix 24x24) 64 3,456 (93%) 53,6 (78%) 3,82 (45%)
SRC Improved Mesh Routing Results (Performance) K = number of simultaneous vectors being multiplied p = number of multiple columns of A handled in one cell= 6 n = number of times to repeat sub-multiplications =D 2 /(m 2 *p) 2 x = clock-cycles per compare-exchange operation T Kroute = routing time for K multiplications in the mesh = p*d*4*m*x* period T KTot = time for K multiplications including loading, unloading and routing T 52 Compute = computational total time for a 52-bit Matrix step T 52 Total = total time for a 52-bit Matrix step = 3*D/K* n *( T KTot ) Design Type Improved SRC- Mesh Improved SRC- Mesh Mesh Size m x m x (Matrix 6x6) 8x8 (matrix 24x24) K Period (ns) x T Kroute (ns) T Ktot (ns) T 52 Compute ( days) T 52 Total ( days) 32 3 9,2 3,48 2444 4,3 64 3 5,36 25,26 244 3,93
Analysis & Conclusion for SRC-6e Improved & Basic Mesh Routing Improved SRC-Mesh faster than Basic SRC-mesh design by a factor of 57 in SRC-6e Virtex II 6 393 days compared to 22,46 days in best case Larger sub-matrix size decreases significantly number of sub-multiplications
Standalone FPGA vs SRC design Standalone FPGA Virtex II 8 vs SRC Virtex II 6 Virtex II 8 designs, larger K and m Latency of routing increases in SRC-6e To improve the frequency to MHz, time of compare-exchange increased by 2-3 clock cycles Limited IO from 6 OBM banks in SRC-6e, more loading-unloading time Result on two dimensional array of Standalone Virtex II FPGAs vs one FPGA on SRC-6e
Summary & Conclusions First Practical hardware Implementation of Mesh Routing for the Number Field Sieve implemented and tested Practical concrete numbers for theoretical algorithm of Mesh Routing obtained to assess the current hardness of the matrix step in reconfigurable hardware Two architectures, Basic and Improved, implemented and compared All designs compared using the platform generic array of FPGA devices SRC-6e Reconfigurable Computer
Summary & Conclusions Assuming constant area, Improved Mesh Routing Design faster than Basic Mesh Routing Design by a factor of -5 in Virtex II 8 large sub-matrix handled A two-dimensional array of Virtex II chips can perform computations faster than a single FPGA by a factor proportional to (number of FPGAs) 3/2 Matrix step for a 24-bit number can be performed using 24 Virtex II chips in 27 days
Summary & Conclusions Two design entry approaches developed for the SRC- 6e SRC-Mesh is entirely written in VHDL SRC-cells is written mostly in C with only cell in VHDL SRC-Mesh outperforms SRC-cells by a factor of 5 at the cost of hardness in development of the VHDL code manual optimized circuit in VHDL suitable for SRC platform for the distributed computation of mesh
Acknowledgement Dr Kris Gaj SRC Computers Inc Deapesh Misra
Questions