Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Yokohama National University 21 st International Symposium on Superconductivity Tsukuba, Japan October 27-29, 29, 2008 Recent development of large-scale reconfigurable data-paths using RSFQ circuits Nobuyuki Yoshikawa Department of Electrical and Computer Engineering, Yokohama National University, Yokohama, Japan Coworker H. Park, H. Hara, K. Taketomi, T. Kainuma, Y. Yamanashi Yokohama National University I. Kataeva, R. Kasagi, S. Iwasaki, H. Akaike, A. Fujimaki, M. Tanaka, K. Obata, Y. Ito, K. Takagi, N. Takagi Nagoya University H. Honda, K. Inoue, K. Murakami Kyusyu University S. Nagasawa, M. Hidaka SRL/ISTEC

Outline of This Talk Background Architecture Target system Component developments Floating-point adders/multipliers (FPA/FPU) 2 x 2 RDP New process and cell library Road map Summary

Demand on High-Performance Computer Calculation amount of electronic structure of molecules l using the molecular l orbital method O(N 4 ) A molecule with 1000 atoms 600 TB of ERI calculations l composed of a lot of product-sum operations

Breakdown of Moore s Law Trends of the clock frequency of recent microprocessors [GHz] 5 3 Pentium 4 uency Cloc ck freq 1 0.8 0.6 0.4 1.6x / year Xeon Pentium III 0.2 1998 1999 2000 2001 2002 2003 2004 Celeron 1.1x / year http://www. intel.com/

Problem in High-Performance Computers and Our Approach Large power consumption Memory wall problem (Single Flux Quantum circuits + new architecture) solves these problems Josephson junction 0 = h/2e = 2.07 mv. ps

Large-Scale Reconfigurable Data-Path (LSRDP) S using RSFQ Circuits A lot of FPUs + Reconfigurable network The data are directly transferred between FPUs. Reduction of memory wall problem N. Takagi et al. IEICE Technical Report, SCE2006-36, January 2007.

Example of Application of LSRDP while (I < 1000): Electron repulsion integral calculations of molecular orbit tei(4,4,4,4)=(((3+2*p*(4*pax*pbx+pbx**2+pax**2*(1+2*p*pbx**2)))*(3+2*q*(4*qcx*qdx+qdx**2+qcx**2*(1+2*q*qdx**2)))*f(0,t))/(p**2*q**2)+(4*(3+2*p*(4*pax*pbx+pbx* *2+PAx**2*(1+2*p*PBx**2)))*PQx*(QCx+QDx)*(3+2*q*QCx*QDx)*f(1,t))/(p*q*(p+q))(4*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2 )))*f(1,t))/(p*q*(p+q))(8*(pax+pbx)*(3+2*p*pax*pbx)*(qcx+qdx)*(3+2*q*qcx*qdx)*(((p+q)*f(1,t))+2*p*pqx**2*q*f(2,t)))/(p*q*(p+q)**2)+(2*(3+2*p*(4*pax*pbx+pbx**2+pax**2 *(1+2*p*PBx**2)))*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p*q**2*(p+q)**2)+(2*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(3+2*q*(4*QCx*QDx+Q Dx**2+QCx**2*(1+2*q*QDx**2)))*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p**2*q*(p+q)**2)+(4*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*PQx*(QCx+QDx)*(3*(p+q)*f( 2,t)+2*p*PQx**2*q*f(3,t)))/(q*(p+q)**3) +(8*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*PQx*(QCx+QDx)*(3+2*q*QCx*QDx)*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(p*(p+q)**3)(8*(PAx+ PBx)*(3+2*p*PAx*PBx)*PQx*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(q*(p+q)**3)(4*(PAx+PBx)*PQx*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*( 1+2*q*QDx**2)))*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(p*(p+q)**3)+((3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p* PQx**2*q*f(4,t))))/(q**2*(p+q)**4)(8*(PAx+PBx)*(3+2*p*PAx*PBx)*(QCx+QDx)*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(q*(p+q)**4)(8*(PAx+PBx)*( QCx+QDx)*(3+2*q*QCx*QDx)*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p*(p+q)**4)+(4*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(3+q*(QCx**2+4*QCx* QDx+QDx**2))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p*q*(p+q)**4)+((3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(3*(p+q)**2*f(2,t) +4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p**2*(p+q)**4)(4*p*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t)) ))/(q*(p+q)**5)+(8*(3+p*(pax**2+4*pax*pbx+pbx**2))*pqx*(qcx+qdx)*(15*(p+q)**2*f(3,t)+4*p*pqx**2*q*(5*(p+q)*f(4,t)+p*pqx**2*q*f(5,t))))/(p+q)**5+(4*pqx*q*(qcx+qdx)*( 3+2*q*QCx*QDx)*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p*(p+q)**5)(8*(PAx+PBx)*PQx*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(15*(p+q)**2*f(3,t )+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p+q)**5+(8*(PAx+PBx)*(QCx+QDx)*(15*(p+q)**3*f(3,t)+30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))8*p**3*P Qx**6*q**3*f(6,t)))/(p+q)**6+(2*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(15*(p+q)**3*f(3,t)30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))+8*p**3*PQx**6*q**3*f(6,t)))/( q*(p+q)**6)+(2*(3+q*(qcx**2+4*qcx*qdx+qdx**2))*(15*(p+q)**3*f(3,t)30*p*pqx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*pqx**2*q*f(5,t))+8*p**3*pqx**6*q**3*f(6,t)))/(p*(p+q)**6) 787 MUL, 261 ADD, 69 FUNC I = I+1: Data-flow graph mapped to the LSRDP

LSRDP Architecture: Suitable for RSFQ Circuits Data flow in one direction. No loop structure. Need high throughput. Latency is not so important. Suitable for bit-serial processing. Reduced requirement on memory band width. High switching activity. Heating is serious in semiconductor circuits

Application Fields of LSRDP Processors Molecular orbit calculation Diffusion equation Wave equation Poisson equation etc.

Target System: 10-TFLOPS RSFQ-LSRDP Computer 4.2 K 2TB memory module (FB-DIMM [DDR3@1333MHz, 128GB] 16 modules) CMOS CPU (1chip) ORN FPU... ORN : : : : SFQ 0.5um process SFQ RDP (32FPU 32chips) p (4GFLOPS/FPU)... ORN ORN... SFQ Streaming Buffer (64Kb 2chips) : : : SMAC SMAC SB... : 1024FPU@MCM (34chips) 4MCM SMAC Memory band width per MCM:256GB/s (=16GB/s 16 channels)

Organization of the Project Profs. K. Murakami, H. Honda (Kyushu Univ.) LSRDP architecture, compiler, algorithm Profs. N. Takagi, K. Takagi (Nagoya Univ.) CAD for logic design, arithmetic circuits Prof. N. Yoshikawa (Yokohama National Univ.) RSFQ-FPU chip, cell library Profs. A. Fujimaki, H. Akaike (Nagoya Univ.) Network, RSFQ-LSRDP chip, cell library Dr. S. Nagasawa (SRL) Advanced process

Component Development Floating-point adder (FPA) Floating-point multiplier (FPM) Operand routing network (ORN) 2 x 2 LSRDP prototype

Floating-Point Numbers S: Sign (-1) S F 2 E E: Exponent F: Significand or Fraction S(1bit) E(8 bit) F(23 bit) Example (single precision, 32 bit): 1.101 2 4 0 11000011 10100000000000000000000 Data format in IEEE754 standard Sign Exponent Fraction Half-precision 1 5 11 Single-precision 1 8 24 Double-precision 1 11 53

Bit-Serial Floating-Point Calculation Two bit-serial data-paths are used for the calculation of significand and exponent. n f LSB Significand MSB n e Sign LSB Exponent MSB t

Timing Parameters in Bit-Serial Calculation Input (clock) (clock) Output MSB Input LSB (data) Operation Unit (data) MSB Output LSB Time Input (clock) (data) LSB Input 1 MSB LSB Input 2 MSB LSB Input 3 MSB C. Minimum interval Output (clock) (data) A. Skew LSB Output 1 MSB B. Latency

Floating-Point Addition: Example 1. Subtraction of exponents. +( 1.110 2 2 ) 2. Alignment of significands. + -( 1.010 2 4 ) +( 0.01110 2 4 ) 3. Addition (or subtraction) of two significands. - ) +( 1.010 2 4 ) +(11.00110 2 4 ) 4. Production of the sign of the result. 5. Normalization of result. 6. Rounding. -( 0.11010 2 4 ) -( 1.101 2 3 )

Block Diagram of Bit-Serial FPA (1) Align significand & Rounding (2) Addition (or subtraction) (3) Normalization Significand of A Fa Shifter of A Adder & Subtractor Normalizer Significand of Result Significand of B Exponent & Sign of A Exponent & Sign of B Fb Separator circuit Comparator of magnitude B A > Sa Sb Ea Eb Shifter of B Subtractor ift value Result of A A-B Sh Controller Buffer Buffer Eff fective Operation MUX Res sult of Operation Amoun nt of Correc tion Sing of Result Normalizer & Sign and Exponent s Combine circuit Exponent & Sign of Result : Data signals : Control signals

Chip Photograph of Half-Precision FPA Normalizer 1mm Shifter of A Half precision FPA Minimum interval: 12 clocks ( n f + 1) Latency: 23 clocks ( 2n f + 1) Operating frequency: 20 GHz Performance : 1.67 167 GFLOPs # of JJs : 11700 JJs Power consumption : 3.5 mw Size : 6.76 4.96mm 2 Shifter Register of Significands Clock Generator Shifter Register of Exponent and Sign Controller Shifter of B Shifter Register for Confirmation *n f : bit length of significand Adder & Subtractor Normalize r Confirmation Shift ter Register for *SRL Nb 2.5 ka/cm 2 standard process CONNECT cooperated with SRL, NiCT, NU & YNU

DC Bias Margin of Each Component Circuits @20GHz -10% -5% 0% 5% 10% 15% 20% 25% 30% 35% 40% Shifter_A Shifter_B Adder Normalizer of significand Normalizer of exponent Controller of shifter Controller of shifter and adder

Floating-Point Multiplier (-1) S F 2 E S: Sign E: Exponent F: Fraction Significand part is calculated by a systolic-array multiplier. S(1bit) E(8 bit) F(23 bit) Z f =XX f YY f Exponent part is calculated by a bit-serial adder. Z e =X e +Y e

Systolic-Array Multiplier - Composed of 1D array of 1-b processing element (PE). - Small hardware cost: (bit length) - High throughput : ~ 1/(bit length) MSB Input LSB MSB Output LSB

Chip Photograph of Half-Precision FPM Clock Generator Shifter Register Multiplier Normalizer Half precision FPM Minimum interval: 12 clocks ( n f + 1) Latency: 11 clocks (n f ) Operating frequency: 25 GHz Performance : 21 2.1 GFLOPs # of JJs : 11044 JJs Power consumption : 3.5 mw Size : 6.22 3.78 mm 2 1mm Adder and Normalizer Shifter Register *SRL Nb 2.5 ka/cm 2 standard process *n f : bit length of significand CONNECT cooperated with SRL, NiCT, NU & YNU

Test Result of FPM@25GHz LSB MSB FX : 11010110111 EX: 11001 FY: 11001010011 EY: 01101 (10) FXY: 10101001110 EXY: 11000 (-2) [Calculation of exponent part ] (10) + (-2) + 1 = 9 EX EY Carry from fraction part Correct operation was confirmed at high speed. Maximum operating frequency: 31.5 GHz

Summary of Half-Precision FPUs Floating Point Adder Floating Point Multiplier # of JJs 11700 11044 Size (mm2) 6.76 x 4.96 6.22 x 3.78 Minimum interval (clocks) 12 ( nf + 1) Latency (clocks) 23 (2 nf + 1) nf : bit length of fraction part FPA Clock Generator Normalizer Shifter Register of Significands Adder & Subtractor Controller Norm alizer Clock Generator Shifter Register of Exponent and Sign Shifter of B 1mm Shift fter Register for Confirrmation Shifter of A Multiplier Shifter Register FPM Normalizer Normalizer Shifter Register 1mm Shifter Register for Confirmation CONNECT cooperated with SRL, NiCT, NU & YNU

NDRO-based and crossbar-based architectures of ORN ORN requirements: 1-to-N connections where N is an odd number connections to either input of the FPU FPU FPU FPU NDRO NDRO NDRO NDRO NDRO NDRO FPU FPU FPU M-FPUs + : small number of Josephson junctions required : irregular non-pipelined structure => with the increase of the complexity becomes cumbersome FPU FPU FPU ½CBT ½CBT ½CBT CBT CBT CBT CBT CBT CBT CBT FPU FPU FPU + : scalable pipelined easily re-designed for any number of N and M : large number of Josephson junctions required

Comparison of the ORN architectures NDRO-based ORN ORN complexity latency, ps skew, ps minimum interval number of control lines bias current, A power, mw number of JJ N=3, N M=8 M ~60 ~60 n f +60ps 96 0.6 1.5 ~5500 N=5, M=10 ~80 ~80 n f +80ps 200 0.9 2.25 ~8000 N=9, M=32 ~100 ~100 n f +100ps 1152 5.5 13.75 ~50500 Number of JJs of NDRO-based ORN in a table is an estimation based on a design of the switch for RDP prototype (N=3, M=4) that consisted of 2750 JJs and requires 300 ma bias current (Iwasaki, not published yet) Crossbar-based ORN ORN latency, skew, ps minimum number of bias power, number of complexity clocks interval control lines current, A mw JJ N=3, M=8 6 ~300 n f 100 0.63 1.575 6230 N=5, M=10 10 ~500 n f 208 1.41 3.525 13930 N=9, M=32 18 ~900 n f 1168 8.28 20.7 77440 A crossbar switch with broadcasting function: 296 JJs Note that almost the same number of JJs are required for both ORNs if isometric (equal length wirings) network is employed in the NDRO-based ORN.

1 t 2 ORN test 1-to-2 t t dout01 dout11 dout12 dout02 dout11 dout01 bias_kern1 ladder clkin_lffin clkin hf clkin_hf din1 CBT1 CBT0 CBT2 clkout2 clkout1 clkout bar02 bar12 cross11 cross01 cross10 bar00 clkin_lfin lki lfi din0 clkin_lfout1 clkin lfo t1 clkin_lfout2 din2 bias_kern2 din0 bias_kern0 dout02 Example of the low frequency test: din0 -> dout01, dout02, dout12 dout12 Frequency dependence of the bias margins: din0 -> dout11 bias_kern1 margins for din0 -> dout11 routing CBT1 20.000 1-to-2 ORN: 2043 JJs, bias current 226 ma 15.000 10.000 5.000 CBT0 Total test circuit: 3098 JJs Total bias current: 359 ma 0.000 10.842 12.679 14.324 15.858-5.000-10.000-15.000 Example: open466, no. 4 chip F2-20.000-25.000 CBT2-30.000 completely functional, exhaustive test bias_kern0 = -14.6/5.3 % does not depend on the pattern bias_kern1 = -16.1/18.3 % for din0 -> dout11, dout12 bias_kern2 = -20.7/12.6 % for din0 -> dout11, dout12 minimum! bias_kern1 = -40.3/17.2% for din1 -> dout01 bias_kern2 = -38/12.6% for din2 -> dout02, dout12 maximum! 17.241 18.818 20.345 21.854 23.480 upper margin lower margin

Design of 2x2 SFQ-RDP 11 pipeline stages Bias current: 1.27 A Designed frequency: 25 GHz Circuit area: 5.90 x 3.68 mm 2 InSR & OutSR length: 16-bits 10839 JJs Data length: 7-bits 1 mm Buffer Output SR Buffer Input SR ALU ORN ALU Controller

Demonstration of 2x2 SFQ-RDP Frequency characteristicof ti RDP Input patterns Output patterns Maximum operating frequency 23 GHz The function for each ALU is chosen as shown above.

Device Structure of Nb 10-layer Fabrication Process Active layer Including JJ and R Main GP and CPL 2 nd PTL layer 1 st PTL layer DC Power layer Bias pillar Bias pillar 10 ka/cm 2 Nb process

Layout DCP (M1) Bias Pillar (C1, 2, 3, 4, 5, 6, GC) 5 x 5 m 2 6 layers Moat (M2, 3, 4, 5, 6, 7) PTL (M3, 5) Width: 4.8 5.5 m Via of PTLs less then 12 x 12 m 2 30 m Maximum current value: 2.4 ma (limited by size of contacts)

Cell library Jc: 10 ka/cm2 c = 2 D2FF CBE 30μm DC/SFQ SFQ/DC DFF JAND 30μm JANDF JNOR RTFFB JNOT SPL3 T1 JOR SPLL

Design of Bit-Serial Half Adder using a New Cell Library Logic simulation results of bit-serial half adder J : 10 ka/cm 2 J c : 10 ka/cm c = 2

On-Chip High-Speed Test Results of Bit-Serial Half Adder Clock Generator Shift Register for Input Bit-Serial Shift Register for Output Adder J c : 10 ka/cm 2 c c = 2

Road Map of RSFQ LSRDP Processor 40 ka/cm 2 Process 10 TFLOPS LSRDP system development 100 GHz FPU & LSRDP prototype 10 ka/cm 2 Process 2.5 ka/cm 2 Process 25GHz FPU/RDP 60 GHz FPU & LSRDP prototype FPA Shifter Register of Significan Clock ds Generator Shifter Register of Exponent and Sign Normalizer Shifter of A Controller Shifter of B N Add o er r & m Sub a tract l or i z e r Shifter Register for Confirmation Shifter Register for Confirmation 2007 2008 2009 2010 2011 2012 2013 2014 -

Summary Our target is to make a fundamental technology for high-end supercomputers based on large-scale reconfigurable data-path (LRDP) architecture. Some key components were designed and implemented using standard Nb process, and their correct operations were demonstrated. Half-precision RSFQ FPA and FPU Operand routing network (ORN) 2 x 2 RDP Structure of the SRL advanced II process was determined and a new cell library is under development. 85 GHz operation of bit-serial half-adder was demonstrated.