Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization

Size: px

Start display at page:

Download "Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization"

Ashley McKinney
5 years ago
Views:

1 Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization Sashisu Bajracharya MS CpE Candidate Master s Thesis Defense Advisor: Dr Kris Gaj George Mason University

2 Outline RSA and Factoring with Number Field Sieve(NFS) 2 Matrix Step of NFS 3 Basic Mesh Routing 6 4 Improved Mesh Routing 7 5 Results and 8 Conclusions FPGA Array 9 Summary & Conclusions SRC-6e Reconfigurable Computer

3 RSA Major Public Key Cryptosystem Public key (e, N) Private key (d, P,Q) Alice Encryption Network Decryption Bob { e, N } { d, P, Q } N = P Q P, Q - large prime factors

4 RSA developed by Ron Rivest, Adi Shamir & Leonard Adlemann in 977

5 Applications of RSA Secure WWW, SSL, 95% of e-commerce Network Browser WebServer S/MIME, PGP Alice Bob

6 How hard is to break RSA? Largest Number Factored: 576 bits RSA-576 (Dec 23) Resources & efforts workstations from 8 different sites around the world 3 months

7 Recommended key sizes for RSA Old standard: Individual users New standard: 52 bits (55 decimal digits) Broken in 999 Individual users 768 bits Organizations (short term) 24 bits Organizations (long term) 248 bits

8 Estimated Difficulty of factoring 24-bit number by RSA Security, Inc 342 million PCs, 5 MHz 7 GB RAM year

9 Our Task Determine how hard is to break RSA for factoring large key sizes using reconfigurable hardware Generic Array of FPGAs SRC-6e Reconfigurable Computer

10 Best Algorithm to Factor NUMBER FIELD SIEVE Complexity: Sub-exponential time and memory N = Number to factor, k = Number of bits of N exponential function, e k Sub-exponential function, e k/3 (ln k) 2/3 Polynomial function, a k m

11 Number Field Sieve(NFS) Steps Polynomial Selection Sieving Matrix (Linear Algebra) Computationally intensive steps Square Root

12 Hardware Architecture of NFS proposed to date Daniel Bernstein Univ of Illinois, Chicago Adi Shamir, Eran Tromer Weizemann Institute, Israel Mesh Approach Matrix and Sieving Mesh Sorting: Matrix Fall 2 Mesh Routing TWIRL Matrix AsiaCrypt 22 Sieving Crypto 23, AsiaCrypt 23 Mesh method improves asymptotic complexity for NFS performance Just analytical estimations, no real implementation, no concrete numbers

13 My Objective Bring this mesh algorithm to practical hardware implementation and concrete numbers Matrix (Linear Algebra) Focus of Research

14 My Objective Detailed design in RTL code of the mesh algorithm Synthesis and Implementation Results for an array of Virtex FPGAs and SRC-6e Reconfigurable computer

15 Function of Matrix Step Find the linear dependency in the large sparse matrix obtained after sieving step D = number of matrix columns or rows 6 for 52-bit 7 for 24-bit D c i c i2 c il c i c i2 c il =

16 Mesh based hardware circuits, proposed by Bernstein and Shamir-Tromer, decrease the time and cost of matrixvector multiplications Block Weidemann Algorithm for the Matrix Step ) Uses multiple matrix-vector multiplications of the sparse matrix A with K random vectors A v i, A 2 v i,, A k v i k = 2D/K 2) Post computation leading to the determination of linear dependence on columns of matrix A Most Time consuming operation: A [DxD] v [Dx]

17 Two Architectures for Matrix-vector multiplication Mesh Sorting (Bernstein) Based on Recursive Sorting Mesh Routing (Shamir-Tromer) Based on Routing Does one multiplication at a time Does K multiplications at a time large area compact area - (handles large matrix size)

18 Mesh Routing

19 Matrix-Vector Multiplication v A A v Sparse Matrix

20 Mesh Routing m x m mesh where m = D A v cell( S ) d = maximum non-zero entries for all column m D D m

21 Routing in the Mesh Fourth cell Each time a packet arrives at the target cell, the packet s vector s bit is xored with the partial result bit on the target cell

22 Mesh Routing Mesh contains the result of the multiplication

23 Mesh Routing with K parallel multiplications Example for K=2 v v A mesh

24 Clockwise Transposition Routing Each step a cell holds one packet, and receives one packet from neighbor for compare-exchange Exchange is done only if it reduces the distance to target of the farthest traveling packet

25 Clockwise Transposition Routing Four iterations repeated Cells Compareexchange direction

26 Types of Packets ) Valid packet 2) Invalid packet - packet becomes invalid when reached to destination

27 Compare-Exchange Cases Four cases for a cell Left cell 2 2 N N a) Both packets are valid (may need to exchange) b) Current packet invalid, incoming new packet valid (may need to exchange) 2 N 2 N N N N N c) Current packet valid, incoming new packet invalid (may need to annihilate) c) Current packet invalid, incoming new packet invalid (no action)

28 Basic and Improved Mesh Routing Designs

29 Basic Mesh Routing Design Each Cell of mesh handles one column of matrix A K = or K 32, K = number of vectors multiplied by matrix A concurrently Total routing takes d 4 m compare-exchange operations

30 Basic Loading and Unloading Design Vector Non Zero Matrix Entries Result Vector

31 Parallel Loading & Unloading Design Vector Non Zero Matrix Entries Result Vector Restricted by Number of IO pins available

32 Basic Cell Design for Basic Mesh Routing LUT-RAM P[i] R[i] address en decode CU annihilate CR en_cur exchange equal eq_pack eq_packet P [i] en_equal Check Dest equal Status bits r c coordinate exchange annihilate en_equal row/col Comparator oper eq_packet

33 Comparator Design cell s coordinate current packet new packet row col s row col s2 row col row/col s, s2 en_equal > a = b Control Signal Logic oper s s2 exchange annihilate eq_packet

34 Improved Mesh Routing Design Each Cell of mesh handles p columns of the matrix A Compact area => handles larger matrix size Total routing takes p d 4 m compare-exchange steps proposed for cost reduction

35 Mesh Cell Design for Improved Mesh Routing R[i] LUT-RAM address en decode CU en P[i] addr annihilate equal eq_pack CR en_cur exchange eq_packet P [i] en_equal equal addr addr2 Check_ Dest Status bits r c coordinate exchange annihilate en_equal Comparator eq_packet row/col oper

36 Target FPGA Devices Xilinx Virtex II XC2V8 46,592 CLB slices 93,84 LUT (LookUpTable) 93,84 FF (Flip-Flop) Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs XC2V6 33,792 CLB slices 67,584 LUT (LookUpTable) 67,584 FF (Flip-Flop) LUTO Carry & Control Logic FF I/O Block CLB slice LUT Carry & Control Logic FF CLB-SLICE CLK

37 Results and Analysis

38 Synthesis Results for one Virtex II XC2V8 using Basic Mesh Routing Design Matrix Size K CLB slices LUTs FFs Clock Period (ns) Time for K mult (ns) Time per mult (ns) 44x44 (Mesh 2x2) 823 (7%) 5,495 (6%) 5,38 (5%) x44 (Mesh 2x2) 32 23,949 (5%) 46,944 (5%) 23,49 (25%) x44 (Mesh 2x2) 7 43,65 (92%) 84,836 (9%) 45,378 (48%) K = number of concurrent matrix-vector multiplications Time for K mult = d * 4 * m * Clock period

39 Speedup vs Software Implementation Reference Optimized SW Implementation: PC, Pentium IV, 2768 GHz, GB RAM Matrix Size 44x44 (Mesh 2x2) K One Multiplication Time in SW (ns) One Multiplication Time in HW (ns) Speedup

40 Distributed Computation (Geiselmann, Steinwandt) A v Av A, A,2 A,3 v A A 2, A 2,2 A 2,3 A 3, A 3,2 A 3,3 v 2 v 3 = A 2 A A A, v A v,2 2 A,3 v 3 = A v = s j=, j s A : A j= s, j v v j j

41 52-bit & 24-bit performance with different number of square array of FPGAs connected in two dimensions ) FPGA array performs single sub-matrix by sub-vector multiplication 2) Reuse FPGA array for next sub-computation

42 52-bit Performance with one chip & multiple chips connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D 2 /(m 4 ) T K = routing time for K multiplications in the mesh = d*4*m* Clock Period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K * n *( T K + T Load ) Virtex II chips D m n 67x x 6 2 T K (ns) T Load (ns) T Total (days) Speedup vs chip 2 x x x , x ,

43 24-bit Performance with one chip & multiple chips connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D 2 /(m 4 ) T K = routing time for K multiplications in the mesh = d*4*m* Clock Period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K * n *( T K + T Load ) Virtex II chips D m n 4 x x x 7 92 T K (ns) T Load (ns) 77 x T Total (days) Speedup vs chip 4 x 6 77 x x x ,

44 Analysis & Conclusion Polynomial Speedup with number of FPGAs Speedup approximately proportional to (#FPGA) 3/2 T Total = 2 D D 3 ( d 4 m 4 K ( m # chip) # chip + T load m = mesh size in one Virtex II chip )

45 Speedup vs number of FPGA chips 4 35 Speedup over chip Number of Virtex II chips

46 Synthesis Results on one Virtex II XC2V8 for Improved Mesh Routing Design Matrix Size K CLB slices 234x234 (Mesh 2x2, p=6 ) 6738 (4%) LUTs,438 (%) FFs 6,279 (7%) Clock Period (ns) Time for K mult (ns) Time per mult (ns) x234 (Mesh 2x2, p=6 ) 32 29,938 (64%) 5,983 (54%) 9,65 (2%) x234 (Mesh 2x2, p=6 ) 5 43,42 (93%) 74,3 (89%) 27,46 (29%) Time for K mult = p * d * 4 * m * Clock period

47 52-bit Performance with one chip & multiple chips connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=6 n = number of times to repeat sub-multiplications = D 2 /(m 2 p) 2 T K = routing time for K multiplications in the mesh = p*d*4*m*clock period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K* n *( T K + T Load ) Virtex II chips D m n 67x 6 2 T K (ns) T Load (ns) T Total (days) Speedup vs chip 84 x x x 5 2 x x x 6 38 x x x 6 9 x

48 24-bit Performance with one chip & multiple chips connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=6 n = number of times to repeat sub-multiplications = D 2 /(m 2 p) 2 T K = routing time for K multiplications in the mesh = p*d*4*m*clock period T Load = time for loading and unloading for K multiplications T Total = total time for a Matrix step = 3*D/K* n *( T K + T Load ) Virtex II chips D m n T K (ns) T Load (ns) T Total (days) Speed up vs chip 4 x x x x x 4 8 x 5 2 x x 6 38 x x x 6 9 x

49 Comparison of Basic & Improved Mesh Routing performance with the number of FPGAs 7 Basic Mesh Routing Improved Mesh Routing 4 Improved Mesh Routing Basic Mesh Routing Time 4 (days) 3 Time (days) Number of Virtex II chips Number of Virtex II chips 52-bit 24-bit

50 Speedup of Improved to Basic Mesh Routing vs Number of Virtex II FPGAs speedup ratio Number of Virtex II chips speedup ratio Number of Virtex II chips 52-bit 24-bit

51 Comparison vs Cray Implementation 52-bit number, Improved Mesh Routing Design Cray C96 24 Virtex II FPGAs 93 days 32 days (32 hours)

52 Conclusions for Basic Mesh Routing & Improved Mesh Routing Best Case for 24-bit: Improved Mesh Routing Design 24 Virtex II chips Total execution time: 27 days Improved Mesh Routing faster than Basic Mesh Routing in Virtex II 8 by factor of around -5 times large sub-matrix size handled in same FPGA decreases sharply number of iterations to repeat sub-multiplications Influence of K reducing from 7 to 5 very low

53 SRC-6e Reconfigurable Computer

54 SRC-6e Reconfigurabe Computer Hardware Architecture P3 ( GHz) 8 MB/s / P3 ( GHz) 8 MB/s / Control FPGA XC2V6 ½ MAP Board / 528 MB/s 528 MB/s L2 8 MB/s MIOC L2 /8 MB/s PCI-X µp Board / Computer Memory (5 GB) DDR Interface SNAP 8 MB/s / 8 bits flags / 64 bits data / FPGA XC2V6 48 MB/s / (6x64 bits) On-Board Memory (24 MB) 48 MB/s (6x 64 bits) / 24 MB/s (92 bits) (8 bits) / / 48 MB/s (6x 64 bits) / (8 bits) FPGA 2 XC2V6 Chain Ports 24 MB/s

55 MAP Programming Model of SRC MAP C sub-routine FPGA contents MAP_Function(a, d, e) { a FPGA } Macro_(a, b, c) Macro_2(b, d) Macro_2(c, e) Macro_ b c Macro_2 Macro_2 d e

56 SRC Program Partitioning µp system FPGA system C function for µp C function for MAP VHDL macro HLL HDL

57 SRC-6e Designs

58 SRC-Mesh State Machine Cells Complete Mesh in VHDL

59 SRC-Cells Control in C Cell VHDL macro Mesh in MAP C

60 Modified Architecture of the cell for SRC-Mesh LUT-RAM P[i] R[i] address en decode CU annihilate CR en_cur exchange equal eq_pack eq_packet R P [i] en_equal Check Dest equal Status bits r c coordinate exchange annihilate en_equal row/col Comparator oper eq_packet

61 SRC-Cells Design Entry & Circuit cell cell b a2 a b2 for ( ) { cell (a, &b); cell (a2, &b2); a = b2; a2 = b; } a cell b cell b2 a2

62 Cell Architecture for SRC-Cells Design annihilate equal eq_pack CR en_cur exchange eq_packet P [i] en_equal Check Dest equal Status bits r c coordinate exchange annihilate en_equal row/col Comparator oper eq_packet

63 Results and Analysis

64 SRC Basic Mesh Routing Results K = number of parallel sub-matrix by sub-vector multiplications n = number of times to repeat sub-multiplications = D 2 / m 4 x = clock-cycles per exchange = routing time for K multiplications in the mesh = d*4*m*x* period T Kroute T KTot = time for K multiplications including loading, unloading and routing T 52 Compute = computational total time for a 52-bit Matrix step T 52 Total = total time for a 52-bit Matrix step = 3*D/K* n *( T KTot ) Design Type Mesh Size K CLB slices LUTs FFs Period (ns) x T Kroute T KTot Compute T 52 ( days) T 52 Total ( days) SRC- Mesh 2x2 (Matrix 44x44) 42 3,743 (9%) 54,66 (8%) 43,545 (64%) 2 96 ns 87 ns,52 22,46 SRC- Mesh 2x2 (Matrix 4x4) 3,533 (93%) 54,69 (8%) 28,636 (42%) 2 6 ns 227 ns 4,222 47,865 SRC- Mesh x (Matrix x) 7 3,566 (93%) 55,528 (82%) 46,647 (69%) 2 8 ns 87 ns,938 27, 898 SRC- Cells x (Matrix 2x2) 32,84 (97%) 29,959 (44%) 47,759 (7%) 3 32 ns 6 ns 939,676,46,2

65 Comparison of 52-bit Performance for different mesh sizes & K values with equivalent area Computational time Total time 6, 4, 2, # days, 8, 6, 4, 2, 2x2 K= 2x2 K=42 Mesh Type x K=7

66 Conclusion for performance of different mesh sizes & K values Comparing performance for different mesh sizes and K with equivalent FPGA resources ( 9% ) mesh of 2x2 with K=42 better than mesh of2x2 with K= 2 D D T = 3 ( d 4 m x + 4 K m Total T load ) mesh of x K=7 similar to mesh of 2x2 K=42

67 SRC-Mesh vs SRC-Cells Area for x mesh with K= Design Type Mesh Size K CLB slices LUTs FFs Period (ns) x T Kroute SRC- Cells x (Matrix x) (74%) 2256 (33%) 3642 (53%) 3 2 ns SRC- Mesh x (Matrix x) 9,347 (27%) 3,427 (9%) 439 (5%) 2 8 ns

68 SRC-mesh vs SRC-cells Area for x mesh % CLB LUT FF SRC Mesh SRC Cells

69 Conclusions for SRC-Mesh and SRC-Cells SRC-cells has about 27 times larger area than SRC-mesh for same mesh parameters performs worse than SRC-mesh (only small mesh can fit, K small) Benefit: ease of programming in high-level language

70 SRC Improved Mesh Routing Results (Area) Design Type Mesh Size m x m /w p =6 K CLB slices LUTs FFs Improved SRC- Mesh x (matrix 6x6) 32 3,2 (9%) 5,95 (76%) 29,954 (44%) Improved SRC- Mesh 8x8 (matrix 24x24) 64 3,456 (93%) 53,6 (78%) 3,82 (45%)

71 SRC Improved Mesh Routing Results (Performance) K = number of simultaneous vectors being multiplied p = number of multiple columns of A handled in one cell= 6 n = number of times to repeat sub-multiplications =D 2 /(m 2 *p) 2 x = clock-cycles per compare-exchange operation T Kroute = routing time for K multiplications in the mesh = p*d*4*m*x* period T KTot = time for K multiplications including loading, unloading and routing T 52 Compute = computational total time for a 52-bit Matrix step T 52 Total = total time for a 52-bit Matrix step = 3*D/K* n *( T KTot ) Design Type Improved SRC- Mesh Improved SRC- Mesh Mesh Size m x m x (Matrix 6x6) 8x8 (matrix 24x24) K Period (ns) x T Kroute (ns) T Ktot (ns) T 52 Compute ( days) T 52 Total ( days) ,2 3, , ,36 25, ,93

72 Analysis & Conclusion for SRC-6e Improved & Basic Mesh Routing Improved SRC-Mesh faster than Basic SRC-mesh design by a factor of 57 in SRC-6e Virtex II days compared to 22,46 days in best case Larger sub-matrix size decreases significantly number of sub-multiplications

73 Standalone FPGA vs SRC design Standalone FPGA Virtex II 8 vs SRC Virtex II 6 Virtex II 8 designs, larger K and m Latency of routing increases in SRC-6e To improve the frequency to MHz, time of compare-exchange increased by 2-3 clock cycles Limited IO from 6 OBM banks in SRC-6e, more loading-unloading time Result on two dimensional array of Standalone Virtex II FPGAs vs one FPGA on SRC-6e

74 Summary & Conclusions First Practical hardware Implementation of Mesh Routing for the Number Field Sieve implemented and tested Practical concrete numbers for theoretical algorithm of Mesh Routing obtained to assess the current hardness of the matrix step in reconfigurable hardware Two architectures, Basic and Improved, implemented and compared All designs compared using the platform generic array of FPGA devices SRC-6e Reconfigurable Computer

75 Summary & Conclusions Assuming constant area, Improved Mesh Routing Design faster than Basic Mesh Routing Design by a factor of -5 in Virtex II 8 large sub-matrix handled A two-dimensional array of Virtex II chips can perform computations faster than a single FPGA by a factor proportional to (number of FPGAs) 3/2 Matrix step for a 24-bit number can be performed using 24 Virtex II chips in 27 days

76 Summary & Conclusions Two design entry approaches developed for the SRC- 6e SRC-Mesh is entirely written in VHDL SRC-cells is written mostly in C with only cell in VHDL SRC-Mesh outperforms SRC-cells by a factor of 5 at the cost of hardness in development of the VHDL code manual optimized circuit in VHDL suitable for SRC platform for the distributed computation of mesh

77 Acknowledgement Dr Kris Gaj SRC Computers Inc Deapesh Misra

78 Questions

Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization

Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization A thesis submitted in partial fulfillment of the requirements for the degree