Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates

Similar documents
Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates

Throughput vs. Area Trade-offs in High-Speed Architectures of Five Round 3 SHA-3 Candidates Implemented Using Xilinx and Altera FPGAs

Fair and Comprehensive Performance Evaluation of 14 Second Round SHA-3 ASIC Implementations

AN EFFICIENT APPROACH TO MINIMIZE POWER AND AREA IN CARRY SELECT ADDER USING BINARY TO EXCESS ONE CONVERTER

EE 434 ASIC and Digital Systems. Prof. Dae Hyun Kim School of Electrical Engineering and Computer Science Washington State University.

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

CMOS VLSI IC Design. A decent understanding of all tasks required to design and fabricate a chip takes years of experience

VLSI IMPLEMENTATION OF MODIFIED DISTRIBUTED ARITHMETIC BASED LOW POWER AND HIGH PERFORMANCE DIGITAL FIR FILTER Dr. S.Satheeskumaran 1 K.

Minerva: Automated Hardware Optimization Tool

Getting to Work with OpenPiton. Princeton University. OpenPit

EECS 427 Lecture 21: Design for Test (DFT) Reminders

2 Assoc Prof, Dept of ECE, George Institute of Engineering & Technology, Markapur, AP, India,

Lecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques.

Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization

ASIC Computer-Aided Design Flow ELEC 5250/6250

Digital Systems Design

The challenges of low power design Karen Yorav

High-speed low-power 2D DCT Accelerator. EECS 6321 Yuxiang Chen, Xinyi Chang, Song Wang Electrical Engineering, Columbia University Prof.

INF3430 Clock and Synchronization

UT90nHBD Hardened-by-Design (HBD) Standard Cell Data Sheet February

Low Power Design Methods: Design Flows and Kits

Power Optimised Digital Filterbank as Part of a Psychoacoustic Human Hearing Model

An Optimized Design of High-Speed and Energy- Efficient Carry Skip Adder with Variable Latency Extension

Overview of Design Methodology. A Few Points Before We Start 11/4/2012. All About Handling The Complexity. Lecture 1. Put things into perspective

Timing analysis can be done right after synthesis. But it can only be accurately done when layout is available

The backend duplication method

Datorstödd Elektronikkonstruktion

High Speed Binary Counters Based on Wallace Tree Multiplier in VHDL

Policy-Based RTL Design

Implementing Multipliers with Actel FPGAs

VLSI System Testing. Outline

On Chip Active Decoupling Capacitors for Supply Noise Reduction for Power Gating and Dynamic Dual Vdd Circuits in Digital VLSI

MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng.

Classic. Feature. EPLD Family. Table 1. Classic Device Features

A 32 Gbps 2048-bit 10GBASE-T Ethernet Energy Efficient LDPC Decoder with Split-Row Threshold Decoding Method

J~ -/-(~ Dr. Jens-Peter Kaps, Committee Member

Low Complexity Cross Parity Codes for Multiple and Random Bit Error Correction

Analysis and Comparison on Full Adder Block in Submicron Technology By: Massimo Alioto and Gaetano Palumbo. Krystina Tabangcura 7/25/11

Class Project: Low power Design of Electronic Circuits (ELEC 6970) 1

VLSI Design Verification and Test Delay Faults II CMPE 646

Towards PVT-Tolerant Glitch-Free Operation in FPGAs

ICE of silicon. [Roza] Computational efficiency [MOPS/W] 3DTV. Intrinsic computational efficiency.

DIGITAL IMPLEMENTATION OF HIGH SPEED PULSE SHAPING FILTERS AND ADDRESS BASED SERIAL PERIPHERAL INTERFACE DESIGN

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS

Chapter 1 Introduction

UNIT-II LOW POWER VLSI DESIGN APPROACHES

Hot Topics and Cool Ideas in Scaled CMOS Analog Design

Design of High-Performance Intra Prediction Circuit for H.264 Video Decoder

LSI Design Flow Development for Advanced Technology

Digital Signal Processing for an Integrated Power-Meter

Accurate Timing and Power Characterization of Static Single-Track Full-Buffers

Digital IC-Project and Verification

ENHANCING SPEED AND REDUCING POWER OF SHIFT AND ADD MULTIPLIER

Electronic Design Automation at Transistor Level by Ricardo Reis. Preamble

Very Large Scale Integration (VLSI)

[Krishna, 2(9): September, 2013] ISSN: Impact Factor: INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Area Power and Delay Efficient Carry Select Adder (CSLA) Using Bit Excess Technique

VLSI Implementation & Design of Complex Multiplier for T Using ASIC-VLSI

Emulating and Diagnosing IR-Drop by Using Dynamic SDF

An Efficent Real Time Analysis of Carry Select Adder

Optimized high performance multiplier using Vedic mathematics

ASIC Implementation of High Throughput PID Controller

Low Power, Radiation tolerant microelectronics design techniques. Executive Summary REF : ASP-04-BO/PE-476 DATE : 02/11/2004 ISSUE : -/2 PAGE : 1 /18

Improved DFT for Testing Power Switches

A Novel High-Speed, Higher-Order 128 bit Adders for Digital Signal Processing Applications Using Advanced EDA Tools

Ruixing Yang

Course Outcome of M.Tech (VLSI Design)

Audio Sample Rate Conversion in FPGAs

Low power implementation of Trivium stream cipher

FlexWave: Development of a Wavelet Compression Unit

SUBSTRATE NOISE FULL-CHIP LEVEL ANALYSIS FLOW FROM EARLY DESIGN STAGES TILL TAPEOUT. Hagay Guterman, CSR Jerome Toublanc, Ansys

POWER GATING. Power-gating parameters

Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA

Tiago Reimann Cliff Sze Ricardo Reis. Gate Sizing and Threshold Voltage Assignment for High Performance Microprocessor Designs

Design For Test. VLSI Design I. Design for Test. page 1. What can we do to increase testability?

On Current Strategies for Hardware Acceleration of Digital Image Restoration Filters

NanoFabrics: : Spatial Computing Using Molecular Electronics

Performance Analysis of a 64-bit signed Multiplier with a Carry Select Adder Using VHDL

Performance Analysis of an Efficient Reconfigurable Multiplier for Multirate Systems

VLSI Implementation of Auto-Correlation Architecture for Synchronization of MIMO-OFDM WLAN Systems

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

L15: VLSI Integration and Performance Transformations

Final Project Report 4-bit ALU Design

Low Power 3-2 and 4-2 Adder Compressors Implemented Using ASTRAN

How cryptographic benchmarking goes wrong. Thanks to NIST 60NANB12D261 for funding this work, and for not reviewing these slides in advance.

A New network multiplier using modified high order encoder and optimized hybrid adder in CMOS technology

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

Lecture 1: Digital Systems and VLSI

CHAPTER 4 FIELD PROGRAMMABLE GATE ARRAY IMPLEMENTATION OF FIVE LEVEL CASCADED MULTILEVEL INVERTER

A Generic Standard Cell Design Methodology for Differential Circuit Styles

Abstract of PhD Thesis

A HARDWARE DC MOTOR EMULATOR VAGNER S. ROSA 1, VITOR I. GERVINI 2, SEBASTIÃO C. P. GOMES 3, SERGIO BAMPI 4

A Novel Low-Power Scan Design Technique Using Supply Gating

Single Chip FPGA Based Realization of Arbitrary Waveform Generator using Rademacher and Walsh Functions

ASICs Concept to Product

Minimum key length for cryptographic security

Architectural and Technology Influence on the Optimal Total Power Consumption

REALISATION OF AWGN CHANNEL EMULATION MODULES UNDER SISO AND SIMO

Reduced Redundant Arithmetic Applied on Low Power Multiply-Accumulate Units

A10-Gb/slow-power adaptive continuous-time linear equalizer using asynchronous under-sampling histogram

Transcription:

Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates Frank K. Gürkaynak, Kris Gaj, Beat Muheim, Ekawat Homsirikamol, Christoph Keller, Marcin Rogawski, Hubert Kaeslin, Jens-Peter Kaps ETH - George Mason University 22-23 March 2012

Motivation Present comparative ASIC performance results on all SHA-3 third round candidates Microelectronics Design Center 2 / 29

Motivation Present comparative ASIC performance results on all SHA-3 third round candidates In this work No claims about the cryptographic security Authors recommendations for SHA-2-256 equivalent security have been followed Microelectronics Design Center 2 / 29

Two Groups, Two Different Approaches George Mason University Academic approach Optimized for maximum: Throughput per Area Taken VHDL codes from extensive architecture evaluations for FPGAs Microelectronics Design Center 3 / 29

Two Groups, Two Different Approaches George Mason University Academic approach Optimized for maximum: Throughput per Area Taken VHDL codes from extensive architecture evaluations for FPGAs ETH Quasi industrial approach Specific throughput target: 2.488 Gbit/s Selected smallest design for the throughput Deliberately tried to increase architectural diversity Microelectronics Design Center 3 / 29

Background Timeline earlier GMU releases ATHENa, a database for FPGA results ETH publishes study on 2nd round candidates May 2011 Quo Vadis 2011 Wokshop in Warsaw Start of collaboration Jun 2011 Start of project Aug 2011 Common interface, all cores (ETH -GMU) compatible Oct 2011 Tape-out Dec 2011 Production problem with I/O transistors Feb 2012 Measured 5 ASICs from first batch Microelectronics Design Center 4 / 29

SHABZIGER: Our ASIC with all SHA-3 Candidates Techology UMCLL65nm Supply 1.2V VDD Metallization 8-Metal Package 56pin QFN56 Total Size 1.825mm x 1.825mm Area Unit 1 GE=1.44µm 2 Microelectronics Design Center 5 / 29

SHABZIGER: Our ASIC with all SHA-3 Candidates Techology UMCLL65nm Supply 1.2V VDD Metallization 8-Metal Package 56pin QFN56 Total Size 1.825mm x 1.825mm Area Unit 1 GE=1.44µm 2 Microelectronics Design Center 5 / 29

Main Problem EDA tools are designed for industry requirements Constraints for worst case conditions. Tools not designed for finding peak (faster/smaller) performance. Microelectronics Design Center 6 / 29

Main Problem EDA tools are designed for industry requirements Constraints for worst case conditions. Tools not designed for finding peak (faster/smaller) performance. In general, Academia is interested in limits Not easy to get fair numbers from industrial tools. Constraints are mis-used for exploration. Microelectronics Design Center 6 / 29

The Design Flow Specifications Architecture (GMU) Architecture (ETH ) RTL Description (VHDL) Constraints Synthesis (Synopsys DC) Place and Route (Cadence EDI) Synthesis (Synopsys DC) Wireload Model Place and Route (Cadence EDI) ASIC (UMC65nm) High Low Accuracy of Results Microelectronics Design Center 7 / 29

The Verification Flow Mentor Modelsim Control Select Alg/Mode Control LFSR Random Input Stimuli Formatter Padding Unit NIST KAT RTL/Netlist Expected Response Simulated Response Check Results Generate TV manufactured ASIC Test Vectors HP83000 Simulation Result Measurement Result Microelectronics Design Center 8 / 29

Reporting Performance: Area How much silicon area is used by the circuit Area is reported in Gate Equivalents (GE). For the UMC65 technology and the standard cell library used 1 GE=1.44µm 2 Includes overhead for clock trees, scan chains, reset circuitry. Microelectronics Design Center 9 / 29

Reporting Performance: Area How much silicon area is used by the circuit Area is reported in Gate Equivalents (GE). For the UMC65 technology and the standard cell library used 1 GE=1.44µm 2 Includes overhead for clock trees, scan chains, reset circuitry. Area in Gate Equivalents is not very accurate Additional overhead for : Power Routability Signal integrity These depend on circuit and operating conditions. Microelectronics Design Center 9 / 29

Reporting Performance: Time, Speed, Throughput Finding the correct unit Clock period [ns] Main constraint for speed in a digital circuit. Microelectronics Design Center 10 / 29

Reporting Performance: Time, Speed, Throughput Finding the correct unit Clock period [ns] Main constraint for speed in a digital circuit. Throughput [Gbit/s] Useful when comparing different architectures In this work: long message hashing performance. Microelectronics Design Center 10 / 29

Reporting Performance: Time, Speed, Throughput Finding the correct unit Clock period [ns] Main constraint for speed in a digital circuit. Throughput [Gbit/s] Useful when comparing different architectures In this work: long message hashing performance. Time per data item [ns/bit] More practical for AT (Area-Time) plots, one axis is time. Similar to [cycles/byte] used for software performance Microelectronics Design Center 10 / 29

The AT plot 5A 4A 3A 2A A Increasing Area More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T Microelectronics Design Center 11 / 29

The AT plot 5A 4A 3A 2A A Increasing Area More Efficient Implementation Constant AT Product Circuit Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T Microelectronics Design Center 11 / 29

The AT plot 5A 4A 3A 2A A Increasing Area More Efficient Implementation Constant AT Product Implementations with different constraints Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T Microelectronics Design Center 11 / 29

The AT plot 5A 4A Different constant AT lines 3A 2A A Increasing Area More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T Microelectronics Design Center 11 / 29

The AT plot 5A 4A 3A 2A A Increasing Area More Efficient Implementation Large variation of results typically +/- 10% Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T Microelectronics Design Center 11 / 29

The AT plot 5A 4A Overconstrained for Speed => Too large 3A 2A A Increasing Area More Efficient Implementation Efficient Implementations Overconstrained for Area => Too slow Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T Microelectronics Design Center 11 / 29

Synthesis Results 180 Faster 160 140 120 More Efficient Smaller 25 kbit/s/gate Area [kgate eq] 100 80 60 50 kbit/s/gate 40 100 kbit/s/gate 20 1000 kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 12 / 29

Synthesis Results Area [kgate eq] 180 160 140 120 100 80 60 40 Keccak 20 0 Gbit/s 0 ns/bit Synthesis Run Results with Different Timing Constraints Grostl JH 1000 kbit/s/gate Skein 500 kbit/s/gate 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Faster More Efficient 50 kbit/s/gate 100 kbit/s/gate 200 kbit/s/gate SHA-2 Smaller 25 kbit/s/gate gmu Time per bit [ns/bit] SHA-2 BLAKE Grostl JH Keccak Skein Microelectronics Design Center 12 / 29

Synthesis Results Area [kgate eq] 180 160 140 120 100 80 60 Selected Implementation Faster More Efficient Smaller 25 kbit/s/gate 50 kbit/s/gate gmu SHA-2 BLAKE Grostl JH Keccak Skein 40 100 kbit/s/gate 20 1000 kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 12 / 29

The Story of Wireload Models Wireload models reflect the routing overhead of the circuit Parasitic effects are major contributors to overall delay. Microelectronics Design Center 13 / 29

The Story of Wireload Models Wireload models reflect the routing overhead of the circuit Parasitic effects are major contributors to overall delay. During synthesis, wireload models approximate this delay. Microelectronics Design Center 13 / 29

The Story of Wireload Models Wireload models reflect the routing overhead of the circuit Parasitic effects are major contributors to overall delay. During synthesis, wireload models approximate this delay. Each circuit is different, will require a different wireload. Microelectronics Design Center 13 / 29

The Story of Wireload Models Wireload models reflect the routing overhead of the circuit Parasitic effects are major contributors to overall delay. During synthesis, wireload models approximate this delay. Each circuit is different, will require a different wireload. Wireload can be extracted after place and route. Microelectronics Design Center 13 / 29

The Story of Wireload Models Wireload models reflect the routing overhead of the circuit Parasitic effects are major contributors to overall delay. During synthesis, wireload models approximate this delay. Each circuit is different, will require a different wireload. Wireload can be extracted after place and route. Subsequent synthesis runs will be more accurate. Microelectronics Design Center 13 / 29

Synthesis Results with Extracted Wireload Area [kgate eq] 180 160 140 120 100 80 60 Faster More Efficient Result of Synthesis Exploration Smaller 25 kbit/s/gate 50 kbit/s/gate gmu SHA-2 BLAKE Grostl JH Keccak Skein 40 100 kbit/s/gate 20 1000 kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 14 / 29

Synthesis Results with Extracted Wireload Area [kgate eq] 180 160 140 120 100 80 60 Change in Performance Faster More Synthesis Run Efficient with Extracted Wireload Results for Different Timing Constraints Smaller 25 kbit/s/gate 50 kbit/s/gate gmu SHA-2 BLAKE Grostl JH Keccak Skein 40 100 kbit/s/gate 20 1000 kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 14 / 29

Synthesis Results with Extracted Wireload Area [kgate eq] 180 160 140 120 100 80 60 Faster More Efficient Selected Implementation from Synthesis Run with Extracted Wireload Smaller 25 kbit/s/gate 50 kbit/s/gate gmu SHA-2 BLAKE Grostl JH Keccak Skein 40 100 kbit/s/gate 20 1000 kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 14 / 29

Obtaining Postlayout Results Cores synthetized separately, combined during backend Constraints specified individually for each core. Microelectronics Design Center 15 / 29

Obtaining Postlayout Results Cores synthetized separately, combined during backend Constraints specified individually for each core. SoC Encounter can optimize all modes simultaneously. Microelectronics Design Center 15 / 29

Obtaining Postlayout Results Cores synthetized separately, combined during backend Constraints specified individually for each core. SoC Encounter can optimize all modes simultaneously. Due to parasitic effects, constraints are relaxed for P&R. Microelectronics Design Center 15 / 29

Obtaining Postlayout Results Cores synthetized separately, combined during backend Constraints specified individually for each core. SoC Encounter can optimize all modes simultaneously. Due to parasitic effects, constraints are relaxed for P&R. Backend affects each circui differently. Microelectronics Design Center 15 / 29

Obtaining Postlayout Results Cores synthetized separately, combined during backend Constraints specified individually for each core. SoC Encounter can optimize all modes simultaneously. Due to parasitic effects, constraints are relaxed for P&R. Backend affects each circui differently. Used several runs to find an acceptable solution. Microelectronics Design Center 15 / 29

Postlayout Results Area [kgate eq] 180 160 140 120 100 80 60 Initial Synthesis Synthesis with Extracted Wireload Faster More Efficient Smaller 25 kbit/s/gate 50 kbit/s/gate gmu SHA-2 BLAKE Grostl JH Keccak Skein 40 100 kbit/s/gate 20 1000 kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 16 / 29

Postlayout Results Area [kgate eq] 180 160 140 120 100 80 60 Initial Synthesis Final Postlayout Result Synthesis with Extracted Wireload Faster More Efficient Smaller 25 kbit/s/gate 50 kbit/s/gate gmu SHA-2 BLAKE Grostl JH Keccak Skein 40 100 kbit/s/gate 20 1000 kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 16 / 29

Postlayout Results Area [kgate eq] 180 160 140 120 100 80 60 Faster More Efficient Smaller 25 kbit/s/gate 50 kbit/s/gate 2.488 Gbit/s gmu ethz SHA-2 BLAKE Grostl JH Keccak Skein 40 100 kbit/s/gate 20 1000 kbit/s/gate 500 kbit/s/gate 200 kbit/s/gate 0 Gbit/s 0 ns/bit 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit] Microelectronics Design Center 16 / 29

Normalized Energy/bit, Measurement vs Estimation Energy/Bit normalized to GMU SHA-2 8 7 6 5 4 3 2 Normalized Energy/Bit for ALL SHA-3 Candidates Postlayout Results Typical Conditions VDD=1.2V Numbers in pj/bit 18.49 GMU 10.53 1 3.68 6,62 7.15 4.01 0 SHA-2 BLAKE Groestl JH Algorithms Keccak Skein Microelectronics Design Center 17 / 29

Normalized Energy/bit, Measurement vs Estimation Energy/Bit normalized to GMU SHA-2 8 7 6 5 4 3 2 1 0 3.68 SHA-2 Normalized Energy/Bit for ALL SHA-3 Candidates Postlayout Results Typical Conditions VDD=1.2V Numbers in pj/bit 4.77 6,62 13.99 BLAKE 18.49 20.30 7.15 Groestl JH Algorithms 6.65 4.01 3.28 Keccak GMU ETHZ 10.53 20.10 Skein Microelectronics Design Center 17 / 29

Normalized Energy/bit, Measurement vs Estimation Energy/Bit normalized to GMU SHA-2 8 7 6 5 4 3 2 Normalized Energy/Bit for ALL SHA-3 Candidates Measurement Results Average of 5 ASICs VDD=1.2V Numbers in pj/bit 10.16 20.67 23.15 27.38 11.20 9.85 GMU ETHZ 16.02 28.42 1 3.98 5.05 6.28 4.98 0 SHA-2 BLAKE Groestl JH Algorithms Keccak Skein Microelectronics Design Center 17 / 29

Throughput/Area, Measurement vs Estimation 3 Normalized Throughput/Area of ALL SHA-3 Candidates Throughput/Area normalized to GMU SHA-2 2.5 2 1.5 1 0.5 273 Postlayout Results Typical Case VDD=1.2V Numbers in kbits/ge 179 GMU 115 208 117 518 0 SHA-2 BLAKE Groestl JH Algorithms Keccak Skein Microelectronics Design Center 18 / 29

Throughput/Area, Measurement vs Estimation Throughput/Area normalized to GMU SHA-2 3 2.5 2 1.5 1 0.5 0 273 SHA-2 Normalized Throughput/Area of ALL SHA-3 Candidates Postlayout Results Typical Case VDD=1.2V Numbers in kbits/ge 162 179 77 BLAKE 115 42 208 Groestl JH Algorithms 146 518 770 Keccak 117 GMU ETHZ Skein 44 Microelectronics Design Center 18 / 29

Throughput/Area, Measurement vs Estimation Throughput/Area normalized to GMU SHA-2 3 2.5 2 1.5 1 0.5 0 215 SHA-2 Normalized Throughput/Area of ALL SHA-3 Candidates Measurement Results Average of 5 ASICs VDD=1.2V Numbers in kbits/ge 174 167 85 BLAKE 86 41 154 Groestl JH Algorithms 139 394 685 Keccak 121 GMU ETHZ Skein 46 Microelectronics Design Center 18 / 29

Concluding Remarks (I) SHA-2 Very efficient in hardware By far the smallest Algorithm has been around longer, perhaps reason for more optimized implementations Microelectronics Design Center 19 / 29

Concluding Remarks (I) SHA-2 Very efficient in hardware By far the smallest Algorithm has been around longer, perhaps reason for more optimized implementations BLAKE Compact, easy to implement Allows good scalability Not the fastest Microelectronics Design Center 19 / 29

Concluding Remarks (II) Grøstl Best scalability (Speed/Area tradeoff) Low throughput per area Cumbersome for hardware Microelectronics Design Center 20 / 29

Concluding Remarks (II) Grøstl Best scalability (Speed/Area tradeoff) Low throughput per area Cumbersome for hardware JH Consistently ranks in the middle So far, unable to find good scaling options All modes use identical hardware Microelectronics Design Center 20 / 29

Concluding Remarks (III) Keccak Hands down fastest algorithm Large block size, and small latency key to speed Not very good Area/Speed trade-off Microelectronics Design Center 21 / 29

Concluding Remarks (III) Keccak Hands down fastest algorithm Large block size, and small latency key to speed Not very good Area/Speed trade-off Skein Low throughput per area Interesting hardware trade-offs due to adder Longer combinational delay per clock cycle, perhaps reason for better match between expectation and measurement. Microelectronics Design Center 21 / 29

Lessons Learned Synthesis results can be far from actual performance Microelectronics Design Center 22 / 29

Lessons Learned Synthesis results can be far from actual performance Measurement on ASIC is necessary Microelectronics Design Center 22 / 29

Lessons Learned Synthesis results can be far from actual performance Measurement on ASIC is necessary Industrial EDA tools ill suited for best performance Microelectronics Design Center 22 / 29

Lessons Learned Synthesis results can be far from actual performance Measurement on ASIC is necessary Industrial EDA tools ill suited for best performance Different implementations should be compared Microelectronics Design Center 22 / 29

Thank you... Microelectronics Design Center 23 / 29

Additional Material All sources and scripts: http://www.iis.ee.ethz.ch/~sha3 Microelectronics Design Center 24 / 29

One ASIC, Many Cores A common I/O interface for all cores LFSR based input assembles random input message FinalBlock signal tells that current message block is last Last message block is padded (fixed padding length) All inputs applied parallel, 1088 bits for Keccak, 512 for others Multiplexer selects 16-bits out of 256 output bits Microelectronics Design Center 25 / 29

Post Layout Results: Speed, Typical Case Alg. Block Size Impl. Area (FFs) Max. Clk Tput TpA [bits] [kge] [MHz] [Gbit/s] [kbit/s GE] SHA-2 512 BLAKE 512 Grøstl 512 JH 512 Keccak 1088 Skein 512 ETHZ 24.30 (29%) 516.00 3.943 162.255 GMU 25.14 (35%) 870.32 6.855 272.691 ETHZ 39.96 (26%) 344.12 3.091 77.347 GMU 43.02 (34%) 436.30 7.703 179.039 ETHZ 69.39 (17%) 460.83 2.913 41.977 GMU 160.28 (9%) 757.58 18.470 115.239 ETHZ 46.79 (27%) 558.97 6.814 145.626 GMU 54.35 (31%) 947.87 11.286 207.655 ETHZ 46.31 (25%) 786.16 35.639 769.550 GMU 80.65 (19%) 920.81 41.743 517.587 ETHZ 71.87 (19%) 564.33 3.141 43.697 GMU 71.90 (22%) 312.11 8.411 116.977 Microelectronics Design Center 26 / 29

Measurement Results: Speed, Average of 5 ASICs Alg. Block Size Impl. Area (FFs) Max. Clk Tput TpA [bits] [kge] [MHz] [Gbit/s] [kbit/s GE] SHA-2 512 BLAKE 512 Grøstl 512 JH 512 Keccak 1088 Skein 512 ETHZ 24.30 (29%) 552.79 4.224 173.826 GMU 25.14 (35%) 685.40 5.399 214.751 ETHZ 39.96 (26%) 377.93 3.395 84.947 GMU 43.02 (34%) 405.84 7.165 166.541 ETHZ 69.39 (17%) 445.63 2.817 40.593 GMU 160.28 (9%) 563.70 13.743 85.747 ETHZ 46.79 (27%) 532.48 6.491 138.725 GMU 54.35 (31%) 704.72 8.391 154.387 ETHZ 46.31 (25%) 700.28 31.746 685.482 GMU 80.65 (19%) 701.75 31.813 394.456 ETHZ 71.87 (19%) 588.24 3.274 45.548 GMU 71.90 (22%) 323.21 8.710 121.036 Microelectronics Design Center 27 / 29

Post Layout Results: Power @2.488 Gb/s, Typical Algorithm Block Size Imp. Latency Clk Freq. Power Energy/bit [bits] [cycles] [MHz] [mw] [pj/bit] SHA-2 512 BLAKE 512 Grøstl 512 JH 512 Keccak 1088 Skein 512 ETHZ 67 324 11.86 4.76 GMU 65 316 9.16 3.68 ETHZ 57 276 34.80 13.99 GMU 29 140 16.47 6.62 ETHZ 81 392 50.50 20.30 GMU 21 102 46.01 18.49 ETHZ 42 204 16.54 6.67 GMU 43 209 17.80 7.15 ETHZ 24 54 8.16 3.28 GMU 24 54 9.98 4.01 ETHZ 92 446 50.00 20.10 GMU 19 92 26.19 10.53 Microelectronics Design Center 28 / 29

Measurement Results: Power @2.488 Gb/s - 1.2V Algorithm Block Size Imp. Latency Clk Freq. Power Energy/bit [bits] [cycles] [MHz] [mw] [pj/bit] SHA-2 512 BLAKE 512 Grøstl 512 JH 512 Keccak 1088 Skein 512 ETHZ 67 324 12.57 5.05 GMU 65 316 9.90 3.98 ETHZ 57 276 51.42 20.67 GMU 29 140 25.27 10.16 ETHZ 81 392 68.12 27.38 GMU 21 102 57.59 23.15 ETHZ 42 204 24.51 9.85 GMU 43 209 27.89 11.20 ETHZ 24 54 12.38 4.98 GMU 24 54 15.62 6.28 ETHZ 92 446 70.71 28.42 GMU 19 92 39.86 16.02 Microelectronics Design Center 29 / 29