Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates

Size: px

Start display at page:

Download "Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates"

May Sutton
6 years ago
Views:

1 Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates Frank K. Gürkaynak, Kris Gaj, Beat Muheim, Ekawat Homsirikamol, Christoph Keller, Marcin Rogawski, Hubert Kaeslin, Jens-Peter Kaps ETH - George Mason University March 2012

2 Motivation Present comparative ASIC performance results on all SHA-3 third round candidates Microelectronics Design Center 2 / 29

3 Motivation Present comparative ASIC performance results on all SHA-3 third round candidates In this work No claims about the cryptographic security Authors recommendations for SHA equivalent security have been followed Microelectronics Design Center 2 / 29

4 Two Groups, Two Different Approaches George Mason University Academic approach Optimized for maximum: Throughput per Area Taken VHDL codes from extensive architecture evaluations for FPGAs Microelectronics Design Center 3 / 29

5 Two Groups, Two Different Approaches George Mason University Academic approach Optimized for maximum: Throughput per Area Taken VHDL codes from extensive architecture evaluations for FPGAs ETH Quasi industrial approach Specific throughput target: Gbit/s Selected smallest design for the throughput Deliberately tried to increase architectural diversity Microelectronics Design Center 3 / 29

6 Background Timeline earlier GMU releases ATHENa, a database for FPGA results ETH publishes study on 2nd round candidates May 2011 Quo Vadis 2011 Wokshop in Warsaw Start of collaboration Jun 2011 Start of project Aug 2011 Common interface, all cores (ETH -GMU) compatible Oct 2011 Tape-out Dec 2011 Production problem with I/O transistors Feb 2012 Measured 5 ASICs from first batch Microelectronics Design Center 4 / 29

7 SHABZIGER: Our ASIC with all SHA-3 Candidates Techology UMCLL65nm Supply 1.2V VDD Metallization 8-Metal Package 56pin QFN56 Total Size 1.825mm x 1.825mm Area Unit 1 GE=1.44µm 2 Microelectronics Design Center 5 / 29

8 SHABZIGER: Our ASIC with all SHA-3 Candidates Techology UMCLL65nm Supply 1.2V VDD Metallization 8-Metal Package 56pin QFN56 Total Size 1.825mm x 1.825mm Area Unit 1 GE=1.44µm 2 Microelectronics Design Center 5 / 29

9 Main Problem EDA tools are designed for industry requirements Constraints for worst case conditions. Tools not designed for finding peak (faster/smaller) performance. Microelectronics Design Center 6 / 29

10 Main Problem EDA tools are designed for industry requirements Constraints for worst case conditions. Tools not designed for finding peak (faster/smaller) performance. In general, Academia is interested in limits Not easy to get fair numbers from industrial tools. Constraints are mis-used for exploration. Microelectronics Design Center 6 / 29

11 The Design Flow Specifications Architecture (GMU) Architecture (ETH ) RTL Description (VHDL) Constraints Synthesis (Synopsys DC) Place and Route (Cadence EDI) Synthesis (Synopsys DC) Wireload Model Place and Route (Cadence EDI) ASIC (UMC65nm) High Low Accuracy of Results Microelectronics Design Center 7 / 29

12 The Verification Flow Mentor Modelsim Control Select Alg/Mode Control LFSR Random Input Stimuli Formatter Padding Unit NIST KAT RTL/Netlist Expected Response Simulated Response Check Results Generate TV manufactured ASIC Test Vectors HP83000 Simulation Result Measurement Result Microelectronics Design Center 8 / 29

13 Reporting Performance: Area How much silicon area is used by the circuit Area is reported in Gate Equivalents (GE). For the UMC65 technology and the standard cell library used 1 GE=1.44µm 2 Includes overhead for clock trees, scan chains, reset circuitry. Microelectronics Design Center 9 / 29

14 Reporting Performance: Area How much silicon area is used by the circuit Area is reported in Gate Equivalents (GE). For the UMC65 technology and the standard cell library used 1 GE=1.44µm 2 Includes overhead for clock trees, scan chains, reset circuitry. Area in Gate Equivalents is not very accurate Additional overhead for : Power Routability Signal integrity These depend on circuit and operating conditions. Microelectronics Design Center 9 / 29

15 Reporting Performance: Time, Speed, Throughput Finding the correct unit Clock period [ns] Main constraint for speed in a digital circuit. Microelectronics Design Center 10 / 29

16 Reporting Performance: Time, Speed, Throughput Finding the correct unit Clock period [ns] Main constraint for speed in a digital circuit. Throughput [Gbit/s] Useful when comparing different architectures In this work: long message hashing performance. Microelectronics Design Center 10 / 29

17 Reporting Performance: Time, Speed, Throughput Finding the correct unit Clock period [ns] Main constraint for speed in a digital circuit. Throughput [Gbit/s] Useful when comparing different architectures In this work: long message hashing performance. Time per data item [ns/bit] More practical for AT (Area-Time) plots, one axis is time. Similar to [cycles/byte] used for software performance Microelectronics Design Center 10 / 29

18 The AT plot 5A 4A 3A 2A A Increasing Area More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T Microelectronics Design Center 11 / 29

19 The AT plot 5A 4A 3A 2A A Increasing Area More Efficient Implementation Constant AT Product Circuit Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T Microelectronics Design Center 11 / 29

20 The AT plot 5A 4A 3A 2A A Increasing Area More Efficient Implementation Constant AT Product Implementations with different constraints Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T Microelectronics Design Center 11 / 29

21 The AT plot 5A 4A Different constant AT lines 3A 2A A Increasing Area More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T Microelectronics Design Center 11 / 29

22 The AT plot 5A 4A 3A 2A A Increasing Area More Efficient Implementation Large variation of results typically +/- 10% Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T Microelectronics Design Center 11 / 29

23 The AT plot 5A 4A Overconstrained for Speed => Too large 3A 2A A Increasing Area More Efficient Implementation Efficient Implementations Overconstrained for Area => Too slow Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T Microelectronics Design Center 11 / 29

24 Synthesis Results 180 Faster More Efficient Smaller 25 kbit/s/gate Area [kgate eq] kbit/s/gate kbit/s/gate kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 12 / 29

25 Synthesis Results Area [kgate eq] Keccak 20 0 Gbit/s 0 ns/bit Synthesis Run Results with Different Timing Constraints Grostl JH 1000 kbit/s/gate Skein 500 kbit/s/gate 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Faster More Efficient 50 kbit/s/gate 100 kbit/s/gate 200 kbit/s/gate SHA-2 Smaller 25 kbit/s/gate gmu Time per bit [ns/bit] SHA-2 BLAKE Grostl JH Keccak Skein Microelectronics Design Center 12 / 29

26 Synthesis Results Area [kgate eq] Selected Implementation Faster More Efficient Smaller 25 kbit/s/gate 50 kbit/s/gate gmu SHA-2 BLAKE Grostl JH Keccak Skein kbit/s/gate kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 12 / 29

27 The Story of Wireload Models Wireload models reflect the routing overhead of the circuit Parasitic effects are major contributors to overall delay. Microelectronics Design Center 13 / 29

28 The Story of Wireload Models Wireload models reflect the routing overhead of the circuit Parasitic effects are major contributors to overall delay. During synthesis, wireload models approximate this delay. Microelectronics Design Center 13 / 29

29 The Story of Wireload Models Wireload models reflect the routing overhead of the circuit Parasitic effects are major contributors to overall delay. During synthesis, wireload models approximate this delay. Each circuit is different, will require a different wireload. Microelectronics Design Center 13 / 29

30 The Story of Wireload Models Wireload models reflect the routing overhead of the circuit Parasitic effects are major contributors to overall delay. During synthesis, wireload models approximate this delay. Each circuit is different, will require a different wireload. Wireload can be extracted after place and route. Microelectronics Design Center 13 / 29

31 The Story of Wireload Models Wireload models reflect the routing overhead of the circuit Parasitic effects are major contributors to overall delay. During synthesis, wireload models approximate this delay. Each circuit is different, will require a different wireload. Wireload can be extracted after place and route. Subsequent synthesis runs will be more accurate. Microelectronics Design Center 13 / 29

32 Synthesis Results with Extracted Wireload Area [kgate eq] Faster More Efficient Result of Synthesis Exploration Smaller 25 kbit/s/gate 50 kbit/s/gate gmu SHA-2 BLAKE Grostl JH Keccak Skein kbit/s/gate kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 14 / 29

33 Synthesis Results with Extracted Wireload Area [kgate eq] Change in Performance Faster More Synthesis Run Efficient with Extracted Wireload Results for Different Timing Constraints Smaller 25 kbit/s/gate 50 kbit/s/gate gmu SHA-2 BLAKE Grostl JH Keccak Skein kbit/s/gate kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 14 / 29

34 Synthesis Results with Extracted Wireload Area [kgate eq] Faster More Efficient Selected Implementation from Synthesis Run with Extracted Wireload Smaller 25 kbit/s/gate 50 kbit/s/gate gmu SHA-2 BLAKE Grostl JH Keccak Skein kbit/s/gate kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 14 / 29

35 Obtaining Postlayout Results Cores synthetized separately, combined during backend Constraints specified individually for each core. Microelectronics Design Center 15 / 29

36 Obtaining Postlayout Results Cores synthetized separately, combined during backend Constraints specified individually for each core. SoC Encounter can optimize all modes simultaneously. Microelectronics Design Center 15 / 29

37 Obtaining Postlayout Results Cores synthetized separately, combined during backend Constraints specified individually for each core. SoC Encounter can optimize all modes simultaneously. Due to parasitic effects, constraints are relaxed for P&R. Microelectronics Design Center 15 / 29

38 Obtaining Postlayout Results Cores synthetized separately, combined during backend Constraints specified individually for each core. SoC Encounter can optimize all modes simultaneously. Due to parasitic effects, constraints are relaxed for P&R. Backend affects each circui differently. Microelectronics Design Center 15 / 29

39 Obtaining Postlayout Results Cores synthetized separately, combined during backend Constraints specified individually for each core. SoC Encounter can optimize all modes simultaneously. Due to parasitic effects, constraints are relaxed for P&R. Backend affects each circui differently. Used several runs to find an acceptable solution. Microelectronics Design Center 15 / 29

40 Postlayout Results Area [kgate eq] Initial Synthesis Synthesis with Extracted Wireload Faster More Efficient Smaller 25 kbit/s/gate 50 kbit/s/gate gmu SHA-2 BLAKE Grostl JH Keccak Skein kbit/s/gate kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 16 / 29

41 Postlayout Results Area [kgate eq] Initial Synthesis Final Postlayout Result Synthesis with Extracted Wireload Faster More Efficient Smaller 25 kbit/s/gate 50 kbit/s/gate gmu SHA-2 BLAKE Grostl JH Keccak Skein kbit/s/gate kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 16 / 29

42 Postlayout Results Area [kgate eq] Faster More Efficient Smaller 25 kbit/s/gate 50 kbit/s/gate Gbit/s gmu ethz SHA-2 BLAKE Grostl JH Keccak Skein kbit/s/gate kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 16 / 29

43 Normalized Energy/bit, Measurement vs Estimation Energy/Bit normalized to GMU SHA Normalized Energy/Bit for ALL SHA-3 Candidates Postlayout Results Typical Conditions VDD=1.2V Numbers in pj/bit GMU , SHA-2 BLAKE Groestl JH Algorithms Keccak Skein Microelectronics Design Center 17 / 29

44 Normalized Energy/bit, Measurement vs Estimation Energy/Bit normalized to GMU SHA SHA-2 Normalized Energy/Bit for ALL SHA-3 Candidates Postlayout Results Typical Conditions VDD=1.2V Numbers in pj/bit , BLAKE Groestl JH Algorithms Keccak GMU ETHZ Skein Microelectronics Design Center 17 / 29

45 Normalized Energy/bit, Measurement vs Estimation Energy/Bit normalized to GMU SHA Normalized Energy/Bit for ALL SHA-3 Candidates Measurement Results Average of 5 ASICs VDD=1.2V Numbers in pj/bit GMU ETHZ SHA-2 BLAKE Groestl JH Algorithms Keccak Skein Microelectronics Design Center 17 / 29

46 Throughput/Area, Measurement vs Estimation 3 Normalized Throughput/Area of ALL SHA-3 Candidates Throughput/Area normalized to GMU SHA Postlayout Results Typical Case VDD=1.2V Numbers in kbits/ge 179 GMU SHA-2 BLAKE Groestl JH Algorithms Keccak Skein Microelectronics Design Center 18 / 29

47 Throughput/Area, Measurement vs Estimation Throughput/Area normalized to GMU SHA SHA-2 Normalized Throughput/Area of ALL SHA-3 Candidates Postlayout Results Typical Case VDD=1.2V Numbers in kbits/ge BLAKE Groestl JH Algorithms Keccak 117 GMU ETHZ Skein 44 Microelectronics Design Center 18 / 29

48 Throughput/Area, Measurement vs Estimation Throughput/Area normalized to GMU SHA SHA-2 Normalized Throughput/Area of ALL SHA-3 Candidates Measurement Results Average of 5 ASICs VDD=1.2V Numbers in kbits/ge BLAKE Groestl JH Algorithms Keccak 121 GMU ETHZ Skein 46 Microelectronics Design Center 18 / 29

49 Concluding Remarks (I) SHA-2 Very efficient in hardware By far the smallest Algorithm has been around longer, perhaps reason for more optimized implementations Microelectronics Design Center 19 / 29

50 Concluding Remarks (I) SHA-2 Very efficient in hardware By far the smallest Algorithm has been around longer, perhaps reason for more optimized implementations BLAKE Compact, easy to implement Allows good scalability Not the fastest Microelectronics Design Center 19 / 29

51 Concluding Remarks (II) Grøstl Best scalability (Speed/Area tradeoff) Low throughput per area Cumbersome for hardware Microelectronics Design Center 20 / 29

52 Concluding Remarks (II) Grøstl Best scalability (Speed/Area tradeoff) Low throughput per area Cumbersome for hardware JH Consistently ranks in the middle So far, unable to find good scaling options All modes use identical hardware Microelectronics Design Center 20 / 29

53 Concluding Remarks (III) Keccak Hands down fastest algorithm Large block size, and small latency key to speed Not very good Area/Speed trade-off Microelectronics Design Center 21 / 29

54 Concluding Remarks (III) Keccak Hands down fastest algorithm Large block size, and small latency key to speed Not very good Area/Speed trade-off Skein Low throughput per area Interesting hardware trade-offs due to adder Longer combinational delay per clock cycle, perhaps reason for better match between expectation and measurement. Microelectronics Design Center 21 / 29

55 Lessons Learned Synthesis results can be far from actual performance Microelectronics Design Center 22 / 29

56 Lessons Learned Synthesis results can be far from actual performance Measurement on ASIC is necessary Microelectronics Design Center 22 / 29

57 Lessons Learned Synthesis results can be far from actual performance Measurement on ASIC is necessary Industrial EDA tools ill suited for best performance Microelectronics Design Center 22 / 29

58 Lessons Learned Synthesis results can be far from actual performance Measurement on ASIC is necessary Industrial EDA tools ill suited for best performance Different implementations should be compared Microelectronics Design Center 22 / 29

59 Thank you... Microelectronics Design Center 23 / 29

60 Additional Material All sources and scripts: Microelectronics Design Center 24 / 29

One ASIC, Many Cores A common I/O interface for all cores LFSR based input assembles random input message FinalBlock signal tells that current message block is last Last message block

61 One ASIC, Many Cores A common I/O interface for all cores LFSR based input assembles random input message FinalBlock signal tells that current message block is last Last message block is padded (fixed padding length) All inputs applied parallel, 1088 bits for Keccak, 512 for others Multiplexer selects 16-bits out of 256 output bits Microelectronics Design Center 25 / 29

62 Post Layout Results: Speed, Typical Case Alg. Block Size Impl. Area (FFs) Max. Clk Tput TpA [bits] [kge] [MHz] [Gbit/s] [kbit/s GE] SHA BLAKE 512 Grøstl 512 JH 512 Keccak 1088 Skein 512 ETHZ (29%) GMU (35%) ETHZ (26%) GMU (34%) ETHZ (17%) GMU (9%) ETHZ (27%) GMU (31%) ETHZ (25%) GMU (19%) ETHZ (19%) GMU (22%) Microelectronics Design Center 26 / 29

63 Measurement Results: Speed, Average of 5 ASICs Alg. Block Size Impl. Area (FFs) Max. Clk Tput TpA [bits] [kge] [MHz] [Gbit/s] [kbit/s GE] SHA BLAKE 512 Grøstl 512 JH 512 Keccak 1088 Skein 512 ETHZ (29%) GMU (35%) ETHZ (26%) GMU (34%) ETHZ (17%) GMU (9%) ETHZ (27%) GMU (31%) ETHZ (25%) GMU (19%) ETHZ (19%) GMU (22%) Microelectronics Design Center 27 / 29

64 Post Layout Results: Gb/s, Typical Algorithm Block Size Imp. Latency Clk Freq. Power Energy/bit [bits] [cycles] [MHz] [mw] [pj/bit] SHA BLAKE 512 Grøstl 512 JH 512 Keccak 1088 Skein 512 ETHZ GMU ETHZ GMU ETHZ GMU ETHZ GMU ETHZ GMU ETHZ GMU Microelectronics Design Center 28 / 29

65 Measurement Results: Gb/s - 1.2V Algorithm Block Size Imp. Latency Clk Freq. Power Energy/bit [bits] [cycles] [MHz] [mw] [pj/bit] SHA BLAKE 512 Grøstl 512 JH 512 Keccak 1088 Skein 512 ETHZ GMU ETHZ GMU ETHZ GMU ETHZ GMU ETHZ GMU ETHZ GMU Microelectronics Design Center 29 / 29

Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates

Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates Frank K. Gürkaynak, Kris Gaj, Beat Muheim, Ekawat Homsirikamol, Christoph Keller, Marcin Rogawski, Hubert Kaeslin, Jens-Peter