Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Similar documents
Direct measurements of propagation delay of single-flux-quantum circuits by time-to-digital converters

2 SQUID. (Superconductive QUantum Interference Device) SQUID 2. ( 0 = Wb) SQUID SQUID SQUID SQUID Wb ( ) SQUID SQUID SQUID

Design and Operation Of Parallel Carry-Save Pipelined Rsfq Multiplier For Digital Signal Processing

Design of 8-Bit RSFQ Based Multiplier for DSP Application

Circuit Description and Design Flow of Superconducting SFQ Logic Circuits

IN the past few years, superconductor-based logic families

Advancement of superconductor digital electronics

SINGLE FLUX QUANTUM ONE-DECIMAL-DIGIT RNS ADDER

FPGA IMPLEMENTATION OF 32-BIT WAVE-PIPELINED SPARSE- TREE ADDER

Low Temperature Superconductor Electronics. H.-G. Meyer, Institute of Photonic Technology Albert Einstein Strasse Jena, Germany

DIGIT SERIAL PROCESSING ELEMENTS. Bit-Serial Multiplication. Digit-serial arithmetic processes one digit of size d in each time step.

Superconducting Digital Signal Processor for Telecommunication

High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers

Design and demonstration of a 5-bit flash-type SFQ A/D converter integrated with error correction and interleaving circuits

IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 28, NO. 2, MARCH

Lecture 3: Logic circuit. Combinational circuit and sequential circuit

JDT LOW POWER FIR FILTER ARCHITECTURE USING ACCUMULATOR BASED RADIX-2 MULTIPLIER

Digital Integrated CircuitDesign

Multi-Channel Time Digitizing Systems

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

CHAPTER 5 DESIGN OF COMBINATIONAL LOGIC CIRCUITS IN QCA

Comparator Design Analysis using Efficient Low Power Full Adder Meena Aggarwal 1, Rajesh Mehra 2 1 ME student (ECE), 2 Associate Professor

High-Speed RSA Crypto-Processor with Radix-4 4 Modular Multiplication and Chinese Remainder Theorem

Design and Implementation of Single Bit ALU Using PTL & GDI Technique

IMPLEMENTING THE 10-BIT, 50MS/SEC PIPELINED ADC

THE Josephson junction based digital superconducting

Design of 64-Bit Low Power ALU for DSP Applications

A New network multiplier using modified high order encoder and optimized hybrid adder in CMOS technology

AREA EFFICIENT DISTRIBUTED ARITHMETIC DISCRETE COSINE TRANSFORM USING MODIFIED WALLACE TREE MULTIPLIER

DIGITAL INTEGRATED CIRCUITS A DESIGN PERSPECTIVE 2 N D E D I T I O N

A Prescaler Circuit for a Superconductive Time-to-Digital Converter

Analyzing Metrics of ALU Designs Traversing from Years 2002 to 2015

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS

Design of Baugh Wooley Multiplier with Adaptive Hold Logic. M.Kavia, V.Meenakshi

A Superconductive Flash Digitizer with On-Chip Memory

Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India

Design & Analysis of Low Power Full Adder

Digital Encoder for RF Transmit Waveform Synthesizer Amol Inamdar, Deepnarayan Gupta, Saad Sarwana, Anubhav Sahu, and Alan M.

Serial Parallel Multiplier Design in Quantum-dot Cellular Automata

EECS150 - Digital Design Lecture 23 - Arithmetic and Logic Circuits Part 4. Outline

HIGH PERFORMANCE BAUGH WOOLEY MULTIPLIER USING CARRY SKIP ADDER STRUCTURE

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

ASIC Implementation of High Speed Area Efficient Arithmetic Unit using GDI based Vedic Multiplier

Course Outcome of M.Tech (VLSI Design)

EE247 Lecture 22. Figures of merit (FOM) and trends for ADCs How to use/not use FOM. EECS 247 Lecture 22: Data Converters 2004 H. K.

A new 6-T multiplexer based full-adder for low power and leakage current optimization

Design and Analysis of Row Bypass Multiplier using various logic Full Adders

EECS150 - Digital Design Lecture 28 Course Wrap Up. Recap 1

An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors

Superconducting Technology Assessment. Position Papers

CHAPTER 1 INTRODUCTION

CS 61C: Great Ideas in Computer Architecture Finite State Machines, Functional Units

The Metrics and Designs of an Arithmetic Logic Function over

Another way to implement a folding ADC

High Performance Low-Power Signed Multiplier

By Dayadi Lakshmaiah, Dr. M. V. Subramanyam & Dr. K. Satya Prasad Jawaharlal Nehru Technological University, India

A Study of The Advancement of CMOS ALU & Full Adder Circuit Design For Modern Design

Energy-Efficient Single Flux Quantum Technology

FPGA IMPLENTATION OF REVERSIBLE FLOATING POINT MULTIPLIER USING CSA

VLSI DESIGN OF DIGIT-SERIAL FPGA ARCHITECTURE

Area Efficient and Low Power Reconfiurable Fir Filter

Low Power R4SDC Pipelined FFT Processor Architecture

REVIEW ARTICLE: EFFICIENT MULTIPLIER ARCHITECTURE IN VLSI DESIGN

Low Power 8-Bit ALU Design Using Full Adder and Multiplexer

Electronic Design Automation at Transistor Level by Ricardo Reis. Preamble

Mahendra Engineering College, Namakkal, Tamilnadu, India.

A design of 16-bit adiabatic Microprocessor core

CMOS Process Variations: A Critical Operation Point Hypothesis

A 10-Gb/s Multiphase Clock and Data Recovery Circuit with a Rotational Bang-Bang Phase Detector

S.Nagaraj 1, R.Mallikarjuna Reddy 2

Disseny físic. Disseny en Standard Cells. Enric Pastor Rosa M. Badia Ramon Canal DM Tardor DM, Tardor

<Explanation of Improved the Quality of ALU And Ten Different Types of Designs for Decreasing Power Dissipation>

DESIGN OF PARALLEL MULTIPLIERS USING HIGH SPEED ADDER

12-Bit Pipeline ADC Implemented in 0.09-um Digital CMOS Technology for Powerline Alliance

A Fixed-Width Modified Baugh-Wooley Multiplier Using Verilog

International Journal of Scientific & Engineering Research Volume 3, Issue 12, December ISSN

Design of Low power and Area Efficient 8-bit ALU using GDI Full Adder and Multiplexer

Design of an Energy Efficient 4-2 Compressor

Option 1: A programmable Digital (FIR) Filter

Integrated Circuit Design 813 Stellenbosch University Dept. E&E Engineering

Interconnect-Power Dissipation in a Microprocessor

II. QUATERNARY CONVERTER CIRCUITS

Very Large Scale Integration (VLSI)

Low-Power VLSI. Seong-Ook Jung VLSI SYSTEM LAB, YONSEI University School of Electrical & Electronic Engineering

Low-Power Multipliers with Data Wordlength Reduction

COMPARISION OF LOW POWER AND DELAY USING BAUGH WOOLEY AND WALLACE TREE MULTIPLIERS

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

A Novel Approach For Designing A Low Power Parallel Prefix Adders

A Static Power Model for Architects

Architectures and circuits for timeinterleaved. Sandeep Gupta Teranetics, Santa Clara, CA

Sophisticated design of low power high speed full adder by using SR-CPL and Transmission Gate logic

Modified Booth Encoding Multiplier for both Signed and Unsigned Radix Based Multi-Modulus Multiplier

Fault Tolerance in VLSI Systems

A Low-Power High-speed Pipelined Accumulator Design Using CMOS Logic for DSP Applications

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm

UNIT-II LOW POWER VLSI DESIGN APPROACHES

Full-gate verification of superconducting integrated circuit layouts with InductEx

REALIZATION OF FPGA BASED Q-FORMAT ARITHMETIC LOGIC UNIT FOR POWER ELECTRONIC CONVERTER APPLICATIONS

CARRY SAVE COMMON MULTIPLICAND MONTGOMERY FOR RSA CRYPTOSYSTEM

High Performance 128 Bits Multiplexer Based MBE Multiplier for Signed-Unsigned Number Operating at 1GHz

Transcription:

Yokohama National University 21 st International Symposium on Superconductivity Tsukuba, Japan October 27-29, 29, 2008 Recent development of large-scale reconfigurable data-paths using RSFQ circuits Nobuyuki Yoshikawa Department of Electrical and Computer Engineering, Yokohama National University, Yokohama, Japan Coworker H. Park, H. Hara, K. Taketomi, T. Kainuma, Y. Yamanashi Yokohama National University I. Kataeva, R. Kasagi, S. Iwasaki, H. Akaike, A. Fujimaki, M. Tanaka, K. Obata, Y. Ito, K. Takagi, N. Takagi Nagoya University H. Honda, K. Inoue, K. Murakami Kyusyu University S. Nagasawa, M. Hidaka SRL/ISTEC

Outline of This Talk Background Architecture Target system Component developments Floating-point adders/multipliers (FPA/FPU) 2 x 2 RDP New process and cell library Road map Summary

Demand on High-Performance Computer Calculation amount of electronic structure of molecules l using the molecular l orbital method O(N 4 ) A molecule with 1000 atoms 600 TB of ERI calculations l composed of a lot of product-sum operations

Breakdown of Moore s Law Trends of the clock frequency of recent microprocessors [GHz] 5 3 Pentium 4 uency Cloc ck freq 1 0.8 0.6 0.4 1.6x / year Xeon Pentium III 0.2 1998 1999 2000 2001 2002 2003 2004 Celeron 1.1x / year http://www. intel.com/

Problem in High-Performance Computers and Our Approach Large power consumption Memory wall problem (Single Flux Quantum circuits + new architecture) solves these problems Josephson junction 0 = h/2e = 2.07 mv. ps

Large-Scale Reconfigurable Data-Path (LSRDP) S using RSFQ Circuits A lot of FPUs + Reconfigurable network The data are directly transferred between FPUs. Reduction of memory wall problem N. Takagi et al. IEICE Technical Report, SCE2006-36, January 2007.

Example of Application of LSRDP while (I < 1000): Electron repulsion integral calculations of molecular orbit tei(4,4,4,4)=(((3+2*p*(4*pax*pbx+pbx**2+pax**2*(1+2*p*pbx**2)))*(3+2*q*(4*qcx*qdx+qdx**2+qcx**2*(1+2*q*qdx**2)))*f(0,t))/(p**2*q**2)+(4*(3+2*p*(4*pax*pbx+pbx* *2+PAx**2*(1+2*p*PBx**2)))*PQx*(QCx+QDx)*(3+2*q*QCx*QDx)*f(1,t))/(p*q*(p+q))(4*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2 )))*f(1,t))/(p*q*(p+q))(8*(pax+pbx)*(3+2*p*pax*pbx)*(qcx+qdx)*(3+2*q*qcx*qdx)*(((p+q)*f(1,t))+2*p*pqx**2*q*f(2,t)))/(p*q*(p+q)**2)+(2*(3+2*p*(4*pax*pbx+pbx**2+pax**2 *(1+2*p*PBx**2)))*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p*q**2*(p+q)**2)+(2*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(3+2*q*(4*QCx*QDx+Q Dx**2+QCx**2*(1+2*q*QDx**2)))*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p**2*q*(p+q)**2)+(4*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*PQx*(QCx+QDx)*(3*(p+q)*f( 2,t)+2*p*PQx**2*q*f(3,t)))/(q*(p+q)**3) +(8*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*PQx*(QCx+QDx)*(3+2*q*QCx*QDx)*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(p*(p+q)**3)(8*(PAx+ PBx)*(3+2*p*PAx*PBx)*PQx*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(q*(p+q)**3)(4*(PAx+PBx)*PQx*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*( 1+2*q*QDx**2)))*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(p*(p+q)**3)+((3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p* PQx**2*q*f(4,t))))/(q**2*(p+q)**4)(8*(PAx+PBx)*(3+2*p*PAx*PBx)*(QCx+QDx)*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(q*(p+q)**4)(8*(PAx+PBx)*( QCx+QDx)*(3+2*q*QCx*QDx)*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p*(p+q)**4)+(4*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(3+q*(QCx**2+4*QCx* QDx+QDx**2))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p*q*(p+q)**4)+((3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(3*(p+q)**2*f(2,t) +4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p**2*(p+q)**4)(4*p*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t)) ))/(q*(p+q)**5)+(8*(3+p*(pax**2+4*pax*pbx+pbx**2))*pqx*(qcx+qdx)*(15*(p+q)**2*f(3,t)+4*p*pqx**2*q*(5*(p+q)*f(4,t)+p*pqx**2*q*f(5,t))))/(p+q)**5+(4*pqx*q*(qcx+qdx)*( 3+2*q*QCx*QDx)*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p*(p+q)**5)(8*(PAx+PBx)*PQx*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(15*(p+q)**2*f(3,t )+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p+q)**5+(8*(PAx+PBx)*(QCx+QDx)*(15*(p+q)**3*f(3,t)+30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))8*p**3*P Qx**6*q**3*f(6,t)))/(p+q)**6+(2*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(15*(p+q)**3*f(3,t)30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))+8*p**3*PQx**6*q**3*f(6,t)))/( q*(p+q)**6)+(2*(3+q*(qcx**2+4*qcx*qdx+qdx**2))*(15*(p+q)**3*f(3,t)30*p*pqx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*pqx**2*q*f(5,t))+8*p**3*pqx**6*q**3*f(6,t)))/(p*(p+q)**6) 787 MUL, 261 ADD, 69 FUNC I = I+1: Data-flow graph mapped to the LSRDP

LSRDP Architecture: Suitable for RSFQ Circuits Data flow in one direction. No loop structure. Need high throughput. Latency is not so important. Suitable for bit-serial processing. Reduced requirement on memory band width. High switching activity. Heating is serious in semiconductor circuits

Application Fields of LSRDP Processors Molecular orbit calculation Diffusion equation Wave equation Poisson equation etc.

Target System: 10-TFLOPS RSFQ-LSRDP Computer 4.2 K 2TB memory module (FB-DIMM [DDR3@1333MHz, 128GB] 16 modules) CMOS CPU (1chip) ORN FPU... ORN : : : : SFQ 0.5um process SFQ RDP (32FPU 32chips) p (4GFLOPS/FPU)... ORN ORN... SFQ Streaming Buffer (64Kb 2chips) : : : SMAC SMAC SB... : 1024FPU@MCM (34chips) 4MCM SMAC Memory band width per MCM:256GB/s (=16GB/s 16 channels)

Organization of the Project Profs. K. Murakami, H. Honda (Kyushu Univ.) LSRDP architecture, compiler, algorithm Profs. N. Takagi, K. Takagi (Nagoya Univ.) CAD for logic design, arithmetic circuits Prof. N. Yoshikawa (Yokohama National Univ.) RSFQ-FPU chip, cell library Profs. A. Fujimaki, H. Akaike (Nagoya Univ.) Network, RSFQ-LSRDP chip, cell library Dr. S. Nagasawa (SRL) Advanced process

Component Development Floating-point adder (FPA) Floating-point multiplier (FPM) Operand routing network (ORN) 2 x 2 LSRDP prototype

Floating-Point Numbers S: Sign (-1) S F 2 E E: Exponent F: Significand or Fraction S(1bit) E(8 bit) F(23 bit) Example (single precision, 32 bit): 1.101 2 4 0 11000011 10100000000000000000000 Data format in IEEE754 standard Sign Exponent Fraction Half-precision 1 5 11 Single-precision 1 8 24 Double-precision 1 11 53

Bit-Serial Floating-Point Calculation Two bit-serial data-paths are used for the calculation of significand and exponent. n f LSB Significand MSB n e Sign LSB Exponent MSB t

Timing Parameters in Bit-Serial Calculation Input (clock) (clock) Output MSB Input LSB (data) Operation Unit (data) MSB Output LSB Time Input (clock) (data) LSB Input 1 MSB LSB Input 2 MSB LSB Input 3 MSB C. Minimum interval Output (clock) (data) A. Skew LSB Output 1 MSB B. Latency

Floating-Point Addition: Example 1. Subtraction of exponents. +( 1.110 2 2 ) 2. Alignment of significands. + -( 1.010 2 4 ) +( 0.01110 2 4 ) 3. Addition (or subtraction) of two significands. - ) +( 1.010 2 4 ) +(11.00110 2 4 ) 4. Production of the sign of the result. 5. Normalization of result. 6. Rounding. -( 0.11010 2 4 ) -( 1.101 2 3 )

Block Diagram of Bit-Serial FPA (1) Align significand & Rounding (2) Addition (or subtraction) (3) Normalization Significand of A Fa Shifter of A Adder & Subtractor Normalizer Significand of Result Significand of B Exponent & Sign of A Exponent & Sign of B Fb Separator circuit Comparator of magnitude B A > Sa Sb Ea Eb Shifter of B Subtractor ift value Result of A A-B Sh Controller Buffer Buffer Eff fective Operation MUX Res sult of Operation Amoun nt of Correc tion Sing of Result Normalizer & Sign and Exponent s Combine circuit Exponent & Sign of Result : Data signals : Control signals

Chip Photograph of Half-Precision FPA Normalizer 1mm Shifter of A Half precision FPA Minimum interval: 12 clocks ( n f + 1) Latency: 23 clocks ( 2n f + 1) Operating frequency: 20 GHz Performance : 1.67 167 GFLOPs # of JJs : 11700 JJs Power consumption : 3.5 mw Size : 6.76 4.96mm 2 Shifter Register of Significands Clock Generator Shifter Register of Exponent and Sign Controller Shifter of B Shifter Register for Confirmation *n f : bit length of significand Adder & Subtractor Normalize r Confirmation Shift ter Register for *SRL Nb 2.5 ka/cm 2 standard process CONNECT cooperated with SRL, NiCT, NU & YNU

DC Bias Margin of Each Component Circuits @20GHz -10% -5% 0% 5% 10% 15% 20% 25% 30% 35% 40% Shifter_A Shifter_B Adder Normalizer of significand Normalizer of exponent Controller of shifter Controller of shifter and adder

Floating-Point Multiplier (-1) S F 2 E S: Sign E: Exponent F: Fraction Significand part is calculated by a systolic-array multiplier. S(1bit) E(8 bit) F(23 bit) Z f =XX f YY f Exponent part is calculated by a bit-serial adder. Z e =X e +Y e

Systolic-Array Multiplier - Composed of 1D array of 1-b processing element (PE). - Small hardware cost: (bit length) - High throughput : ~ 1/(bit length) MSB Input LSB MSB Output LSB

Chip Photograph of Half-Precision FPM Clock Generator Shifter Register Multiplier Normalizer Half precision FPM Minimum interval: 12 clocks ( n f + 1) Latency: 11 clocks (n f ) Operating frequency: 25 GHz Performance : 21 2.1 GFLOPs # of JJs : 11044 JJs Power consumption : 3.5 mw Size : 6.22 3.78 mm 2 1mm Adder and Normalizer Shifter Register *SRL Nb 2.5 ka/cm 2 standard process *n f : bit length of significand CONNECT cooperated with SRL, NiCT, NU & YNU

Test Result of FPM@25GHz LSB MSB FX : 11010110111 EX: 11001 FY: 11001010011 EY: 01101 (10) FXY: 10101001110 EXY: 11000 (-2) [Calculation of exponent part ] (10) + (-2) + 1 = 9 EX EY Carry from fraction part Correct operation was confirmed at high speed. Maximum operating frequency: 31.5 GHz

Summary of Half-Precision FPUs Floating Point Adder Floating Point Multiplier # of JJs 11700 11044 Size (mm2) 6.76 x 4.96 6.22 x 3.78 Minimum interval (clocks) 12 ( nf + 1) Latency (clocks) 23 (2 nf + 1) nf : bit length of fraction part FPA Clock Generator Normalizer Shifter Register of Significands Adder & Subtractor Controller Norm alizer Clock Generator Shifter Register of Exponent and Sign Shifter of B 1mm Shift fter Register for Confirrmation Shifter of A Multiplier Shifter Register FPM Normalizer Normalizer Shifter Register 1mm Shifter Register for Confirmation CONNECT cooperated with SRL, NiCT, NU & YNU

NDRO-based and crossbar-based architectures of ORN ORN requirements: 1-to-N connections where N is an odd number connections to either input of the FPU FPU FPU FPU NDRO NDRO NDRO NDRO NDRO NDRO FPU FPU FPU M-FPUs + : small number of Josephson junctions required : irregular non-pipelined structure => with the increase of the complexity becomes cumbersome FPU FPU FPU ½CBT ½CBT ½CBT CBT CBT CBT CBT CBT CBT CBT FPU FPU FPU + : scalable pipelined easily re-designed for any number of N and M : large number of Josephson junctions required

Comparison of the ORN architectures NDRO-based ORN ORN complexity latency, ps skew, ps minimum interval number of control lines bias current, A power, mw number of JJ N=3, N M=8 M ~60 ~60 n f +60ps 96 0.6 1.5 ~5500 N=5, M=10 ~80 ~80 n f +80ps 200 0.9 2.25 ~8000 N=9, M=32 ~100 ~100 n f +100ps 1152 5.5 13.75 ~50500 Number of JJs of NDRO-based ORN in a table is an estimation based on a design of the switch for RDP prototype (N=3, M=4) that consisted of 2750 JJs and requires 300 ma bias current (Iwasaki, not published yet) Crossbar-based ORN ORN latency, skew, ps minimum number of bias power, number of complexity clocks interval control lines current, A mw JJ N=3, M=8 6 ~300 n f 100 0.63 1.575 6230 N=5, M=10 10 ~500 n f 208 1.41 3.525 13930 N=9, M=32 18 ~900 n f 1168 8.28 20.7 77440 A crossbar switch with broadcasting function: 296 JJs Note that almost the same number of JJs are required for both ORNs if isometric (equal length wirings) network is employed in the NDRO-based ORN.

1 t 2 ORN test 1-to-2 t t dout01 dout11 dout12 dout02 dout11 dout01 bias_kern1 ladder clkin_lffin clkin hf clkin_hf din1 CBT1 CBT0 CBT2 clkout2 clkout1 clkout bar02 bar12 cross11 cross01 cross10 bar00 clkin_lfin lki lfi din0 clkin_lfout1 clkin lfo t1 clkin_lfout2 din2 bias_kern2 din0 bias_kern0 dout02 Example of the low frequency test: din0 -> dout01, dout02, dout12 dout12 Frequency dependence of the bias margins: din0 -> dout11 bias_kern1 margins for din0 -> dout11 routing CBT1 20.000 1-to-2 ORN: 2043 JJs, bias current 226 ma 15.000 10.000 5.000 CBT0 Total test circuit: 3098 JJs Total bias current: 359 ma 0.000 10.842 12.679 14.324 15.858-5.000-10.000-15.000 Example: open466, no. 4 chip F2-20.000-25.000 CBT2-30.000 completely functional, exhaustive test bias_kern0 = -14.6/5.3 % does not depend on the pattern bias_kern1 = -16.1/18.3 % for din0 -> dout11, dout12 bias_kern2 = -20.7/12.6 % for din0 -> dout11, dout12 minimum! bias_kern1 = -40.3/17.2% for din1 -> dout01 bias_kern2 = -38/12.6% for din2 -> dout02, dout12 maximum! 17.241 18.818 20.345 21.854 23.480 upper margin lower margin

Design of 2x2 SFQ-RDP 11 pipeline stages Bias current: 1.27 A Designed frequency: 25 GHz Circuit area: 5.90 x 3.68 mm 2 InSR & OutSR length: 16-bits 10839 JJs Data length: 7-bits 1 mm Buffer Output SR Buffer Input SR ALU ORN ALU Controller

Demonstration of 2x2 SFQ-RDP Frequency characteristicof ti RDP Input patterns Output patterns Maximum operating frequency 23 GHz The function for each ALU is chosen as shown above.

Device Structure of Nb 10-layer Fabrication Process Active layer Including JJ and R Main GP and CPL 2 nd PTL layer 1 st PTL layer DC Power layer Bias pillar Bias pillar 10 ka/cm 2 Nb process

Layout DCP (M1) Bias Pillar (C1, 2, 3, 4, 5, 6, GC) 5 x 5 m 2 6 layers Moat (M2, 3, 4, 5, 6, 7) PTL (M3, 5) Width: 4.8 5.5 m Via of PTLs less then 12 x 12 m 2 30 m Maximum current value: 2.4 ma (limited by size of contacts)

Cell library Jc: 10 ka/cm2 c = 2 D2FF CBE 30μm DC/SFQ SFQ/DC DFF JAND 30μm JANDF JNOR RTFFB JNOT SPL3 T1 JOR SPLL

Design of Bit-Serial Half Adder using a New Cell Library Logic simulation results of bit-serial half adder J : 10 ka/cm 2 J c : 10 ka/cm c = 2

On-Chip High-Speed Test Results of Bit-Serial Half Adder Clock Generator Shift Register for Input Bit-Serial Shift Register for Output Adder J c : 10 ka/cm 2 c c = 2

Road Map of RSFQ LSRDP Processor 40 ka/cm 2 Process 10 TFLOPS LSRDP system development 100 GHz FPU & LSRDP prototype 10 ka/cm 2 Process 2.5 ka/cm 2 Process 25GHz FPU/RDP 60 GHz FPU & LSRDP prototype FPA Shifter Register of Significan Clock ds Generator Shifter Register of Exponent and Sign Normalizer Shifter of A Controller Shifter of B N Add o er r & m Sub a tract l or i z e r Shifter Register for Confirmation Shifter Register for Confirmation 2007 2008 2009 2010 2011 2012 2013 2014 -

Summary Our target is to make a fundamental technology for high-end supercomputers based on large-scale reconfigurable data-path (LRDP) architecture. Some key components were designed and implemented using standard Nb process, and their correct operations were demonstrated. Half-precision RSFQ FPA and FPU Operand routing network (ORN) 2 x 2 RDP Structure of the SRL advanced II process was determined and a new cell library is under development. 85 GHz operation of bit-serial half-adder was demonstrated.