How a processor can permute n bits in O(1) cycles

Similar documents
BIT PERMUTATION INSTRUCTIONS: ARCHITECTURE, IMPLEMENTATION, AND CRYPTOGRAPHIC PROPERTIES

Comparing Fast Implementations of Bit Permutation Instructions

On Permutation Operations in Cipher Design

Permutation Operations in Block Ciphers

Transactions Briefs. Sorter Based Permutation Units for Media-Enhanced Microprocessors

Bit Permutation Instructions for Accelerating Software Cryptography

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

EECS150 - Digital Design Lecture 23 - Arithmetic and Logic Circuits Part 4. Outline

CESEL: Flexible Crypto Acceleration. Kevin Kiningham Dan Boneh, Mark Horowitz, Philip Levis

CS4617 Computer Architecture

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

Low Power System-On-Chip-Design Chapter 12: Physical Libraries

CSE502: Computer Architecture CSE 502: Computer Architecture

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m )

RISC Central Processing Unit

Design A Redundant Binary Multiplier Using Dual Logic Level Technique

Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Lecture 9: Clocking for High Performance Processors

EECS150 - Digital Design Lecture 2 - Synchronous Digital Systems Review Part 1. Outline

CS61c: Introduction to Synchronous Digital Systems

Clock-Powered CMOS: A Hybrid Adiabatic Logic Style for Energy-Efficient Computing

Disseny físic. Disseny en Standard Cells. Enric Pastor Rosa M. Badia Ramon Canal DM Tardor DM, Tardor

An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

Chapter 7 Introduction to 3D Integration Technology using TSV

An Interconnect-Centric Approach to Cyclic Shifter Design

CSE 2021: Computer Organization

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

VLSI System Testing. Outline

ELLIPTIC curve cryptography (ECC) was proposed by

ASIC Implementation of High Speed Area Efficient Arithmetic Unit using GDI based Vedic Multiplier

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

FPGA Based 70MHz Digital Receiver for RADAR Applications

Chapter 1 Introduction

PE713 FPGA Based System Design

EECS150 - Digital Design Lecture 28 Course Wrap Up. Recap 1

Low-Power CMOS VLSI Design

Out-of-Order Execution. Register Renaming. Nima Honarmand

Department Computer Science and Engineering IIT Kanpur

EE382V-ICS: System-on-a-Chip (SoC) Design

Distributed Virtual Environments!

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

High-Speed RSA Crypto-Processor with Radix-4 4 Modular Multiplication and Chinese Remainder Theorem

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

Copyright 2003 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Slides prepared by Walid A. Najjar & Brian J.

Design of Parallel Prefix Tree Based High Speed Scalable CMOS Comparator for converters

CENTRALIZED BUFFERING AND LOOKAHEAD WAVELENGTH CONVERSION IN MULTISTAGE INTERCONNECTION NETWORKS

A design of 16-bit adiabatic Microprocessor core

PROMINENT SPEED ARITHMETIC UNIT ARCHITECTURE FOR PROFICIENT ALU

EE 382C EMBEDDED SOFTWARE SYSTEMS. Literature Survey Report. Characterization of Embedded Workloads. Ajay Joshi. March 30, 2004

ENHANCING SPEED AND REDUCING POWER OF SHIFT AND ADD MULTIPLIER

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Overview. The Big Picture... CSC 580 Cryptography and Computer Security. January 25, Math Basics for Cryptography

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India

CprE 583 Reconfigurable Computing

SINGLE CYCLE TREE 64 BIT BINARY COMPARATOR WITH CONSTANT DELAY LOGIC

International Journal of Modern Engineering and Research Technology

Video Enhancement Algorithms on System on Chip

8.1. Unit 8. Fundamental Digital Building Blocks: Decoders & Multiplexers

Low-Power Design for Embedded Processors

Contents CONTRIBUTING FACTORS. Preface. List of trademarks 1. WHY ARE CUSTOM CIRCUITS SO MUCH FASTER?

Performance Metrics, Amdahl s Law

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Digital Integrated CircuitDesign

Interpolation Error in Waveform Table Lookup

A High Performance Split-Radix FFT with Constant Geometry Architecture

A New RNS 4-moduli Set for the Implementation of FIR Filters. Gayathri Chalivendra

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

Implementation of Efficient Bit Permutation Box for Embedded Security

An Energy Scalable Computational Array for Energy Harvesting Sensor Signal Processing. Rajeevan Amirtharajah University of California, Davis

CS3334 Data Structures Lecture 4: Bubble Sort & Insertion Sort. Chee Wei Tan

CHAPTER 4 FIELD PROGRAMMABLE GATE ARRAY IMPLEMENTATION OF FIVE LEVEL CASCADED MULTILEVEL INVERTER

On Built-In Self-Test for Adders

CMP 301B Computer Architecture. Appendix C

CHAPTER 3 ANALYSIS OF LOW POWER, AREA EFFICIENT AND HIGH SPEED ADDER TOPOLOGIES

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Incorporating Variability into Design

An Overview of Static Power Dissipation

32-Bit CMOS Comparator Using a Zero Detector

FOR HIGH SPEED LOW POWER APPLICATIONS USING RADIX-4 MODIFIED BOOTH ENCODER

An Area Efficient FFT Implementation for OFDM

MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng.

Policy-Based RTL Design

High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur

DESIGN OF LOW POWER MULTIPLIERS

CHAPTER 5 NOVEL CARRIER FUNCTION FOR FUNDAMENTAL FORTIFICATION IN VSI

A Level-Encoded Transition Signaling Protocol for High-Throughput Asynchronous Global Communication

ASIP Solution for Implementation of H.264 Multi Resolution Motion Estimation

FPGA Based System Design

CS 110 Computer Architecture Lecture 11: Pipelining

Lecture #1. Course Overview

The challenges of low power design Karen Yorav

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

High Performance Low-Power Signed Multiplier

FAST RADIX 2, 3, 4, AND 5 KERNELS FOR FAST FOURIER TRANSFORMATIONS ON COMPUTERS WITH OVERLAPPING MULTIPLY ADD INSTRUCTIONS

Transcription:

How a processor can permute n bits in O(1) cycles Ruby Lee, Zhijie Shi, Xiao Yang Princeton Architecture Lab for Multimedia and Security (PALMS) Department of Electrical Engineering Princeton University Proceedings of IEEE Hot Chips 14, August 2002. Motivation Secure information processing increases in importance in interconnected world Word-oriented microprocessors today can handle cryptography algorithms well, except for: Bit-level permutations Multi-word arithmetic The larger architectural question: Can a word-oriented processor handle complex bit-level operations within the word efficiently? 1

Today - microprocessor or ASIC Logic Operations MASK-Gen/AND/SHIFT/OR 4n instructions EXTRACT/DEPOSIT 2n instructions Table lookup small set of fixed permutations only 8x2KB tables, about 32 instructions for 64 bits permutation Subword permutation instructions for multimedia Works on 8-bit or larger subwords ASIC permutation very fast in hardware, BUT small set of fixed permutations only Goal: add new Permutation Functional Unit to Processor Achieve any one of n! permutations in log(n) instructions n n Source to be permuted Register File ALU Shifter Configuration bits Permutation FU n Intermediate result 2

Initial Problem Definition Efficient bit permutation instructions for arbitrary permutations of n bits Focus on n = 32 or 64 (word sizes) Standard instruction format and datapaths 2 reads, 1 write per instruction No extra state (to save and restore) Single cycle, simple hardware in log(n) instructions - optimal Number of different n-bit permutation = n! log( n!) nlog( n) ( n > 0) nlog(n) bits needed to specify an arbitrary permutation Outline Permute n bits: from O(n) to O(log(n)) instructions ISA definitions Chip/Circuit Implementations Performance, Cycletime, Versatility Permute n bits: from O(log(n)) to O(1) cycles Conclusion 3

Alternative permutation methods to reduce O(n) to O(log n) instructions for achieving any one of n! permutations Partitioning GRP Building virtual interconnection networks CROSS (log(n) types of ) OMFLIP (2 types of ) Select source bit by its numeric index PPERM SWPERM and SIEVE 8-bit GRP operation GRP Rs, Rc, Rd 0 7 Data Control Rs Rc a b c d e f g h 1 0 0 1 1 0 1 0 Result Rd b c f h a d e g 4

GRP64 Implementation 64 data bits and 64 control bits 64 data bits and 64 inverted control bits in reverse order 1: 2:1 bit 2 bits 3:2 bit 4 bits 5:16 bit 32 bits 6:32 bit 64 bits 64 OR gates output Chip with Permutation Unit (GRP) 5

8-bit CROSS instruction building a virtual Benes Network input output Butterfly network Inverse butterfly network perform any 2 butterfly in one instruction Performs any n-bit permutation with 2log(n) log(n) different types of Scalable for subword permutation Shortest latency 8-bit OMFLIP building a virtual Omega-Flip Network input output Omega network Flip network perform 2 omega or flip in one instruction Performs any n-bit permutation with 2log(n) Only 2 different types of Scalable for subword permutation Smallest area for a permutation unit 6

An OMFLIP Implementation To implement any 2 combinations of Omega or Flip, it is enough to implement a circuit with only 4, 2 omega, 2 flip This allows 00, FF, OF and FO combinations Other circuit organizations also possible, e.g., O-F-O-F, F-O-F-O and F-O-O-F bypassing connections 64 bits omega stage flip stage flip stage omega stage 64 permuted bits Chip with Permutation Unit (OMFLIP) 7

Comparison Maximum Number of Instructions Required for Any Permutation Current ISA Table lookup GRP OMFLIP or CROSS Bit permutation, n elements, each 1-bit Θ(n) Θ(n) log(n) log(n) Subword permutation, n/k elements, each k-bit Θ(n/k) Θ(n) log(n/k) log(n/k) Speedup of DES 2.5 2.24 2.14 2 1.5 1 1 1 1.17 1.12 Table Look-Up GRP OMFLIP or CROSS 0.5 0 cache 1 cache 2 For key generation, speedup is 11x-16X! Cache 1: one-level cache, 16KB (50 cycles miss penalty). Cache 2: two-level cache, L1: 16KB (10 cycles miss penalty), L2: 256KB (50 cycles) 8

Speedup for sorting 64 elements using GRP instruction Subword size 4 bits 8 bits 16 bits vs. Bubble sort 408.3 128.9 43.7 vs. Selection sort 272.7 86.1 29.2 vs. Quick sort 94.4 29.8 10.1 Demonstrates versatility of GRP instructions for sorting as well as permutations. How to execute log(n) instructions in O(1) cycles? Instruction sequence to permute 64 bits: OMFLIP,oo R1,R2,R10 OMFLIP,oo R10,R3,R10 OMFLIP,oo R10,R4,R10 OMFLIP,ff R10,R5,R10 OMFLIP,ff R10,R6,R10 OMFLIP,ff R10,R7,R10... RISC ISA constraint of instructions with only 2 operands n-bit permutation needs 1+log(n) operands Supplying these operands results in register data dependencies But 7 operands could be supplied in 4 RISC instructions rather than 6? 9

Leverage microarchitecture features in 2-way superscalar processors Original instruction sequence to permute 64 bits: OMFLIP,oo R1,R2,R10 OMFLIP,oo R10,R3,R10 OMFLIP,oo R10,R4,R10 OMFLIP,ff R10,R5,R10 OMFLIP,ff R10,R6,R10 OMFLIP,ff R10,R7,R10 Enable Data-rich functional units utilizing existing parallel register ports and data buses Replace 6 instructions with 4 (ISA or microarchitecture) OMFLIP,oo R1,R2,R10 OMcont R4,R3,R10 OMFLIP,ff R10,R5,R10 OMcont R7,R6,R10 2-way Superscalar with a (4,2) Data-rich Functional Unit from memory 7-port register file ALU1 ALU2 (4,2)- FU 10

Two (4,1) functional units, each log(n) (Butterfly is faster than Omega-flip) n=64 bits 6 types of Butterfly Butterfly network (BFLY) n=64 bits 64 permuted bits Inverse butterfly Inverse butterfly network (IBFLY) 2log(n)=12 64 permuted bits Performing any permutation of n bits with 2 cycles latency, 1 cycle thruput Consider n=64 bits Implement 2 permutation functional units, each with log(n) e.g., 6-stage Butterfly network, 6-stage InverseButterfly network Use Data-rich (4,1) functional unit leveraging datapaths of 2-way superscalar microarchitecture Replace former log(n)=6 instructions by 4 instructions via ISA or microarchitecture Execute these 4 instructions, two at a time 2 cycles latency but 1 cycle thruput Can achieve any one of n! permutations at the rate of one per cycle different permutation possible every cycle 11

Conclusions Very fast, easily implementable, general-purpose permutation instructions for any processor Radical speedup: from O(n) to O(log n) instructions Latest result: down to O(1) cycles!! Can achieve any one of n! permutations at the rate of one per cycle Important applications: accelerates both secure and multimedia information processing single-bit and multi-bit subword permutations big speedup in current algorithms, e.g., DES opens field for faster, more secure new algorithms versatile, multi-purpose primitives, e.g., for sorting Validates basic word-orientation of processors even for complex bit operations within a word 12