A 64Gb/s PAM-4 Transmitter with 4-Tap FFE and 2.26pJ/b Energy Efficiency in 28nm CMOS FDSOI G. Steffan 1, E. Depaoli 1, E. Monaco 1, N. Sabatino 1, W. Audoglio 1, A. A. Rossi 1, S. Erba 1, M. Bassi 2, A. Mazzanti 2 1 STMicroelectronics, Pavia, Italy 2 Università degli Studi di Pavia, Pavia, Italy 1 of 31
Outline Motivation Proposed TX Architecture Reconfigurable FFE Output Driver High-Speed Serializer Clock Generation Measurement Results Conclusions 2 of 31
Exabytes per Month 72,5 Network Traffic: Growth and Challenges 88,7 108,5 132,1 160,6 194,4 200 [Cisco, The Zettabyte Era: Trends and Analysis] [OIF-FD-Client-400G/1T-01.0 White Paper] 150 100 3x 50 0 2015 2016 2017 2018 2019 2020 Challenges Gate count increases faster than I/O speed Power dissipation, rather than technology and routing, mostly limits max I/O density Increasing data rate at > 25Gb/s increases link losses and power consumption PAM-4 Modulation Helps maintain loss budget by decreasing Nyquist frequency SNR degradation can be recovered by using FEC 3 of 31
High-Speed PAM-4/NRZ TX Design PAM-4 High output amplitude and linearity to preserve SNR and H/V opening PAM-4 Very high bandwidth to speed-up nonadjacent level transitions PAM-4/NRZ Precise and reliable serialization with low power Challenges PAM-4/NRZ Reconfigurable FFE to be compliant with several standards PAM-4/NRZ PAM-4/NRZ high/low speed modes for auto-negotiation and substitution of legacy components 4 of 31
TX Block Diagram V dd,cmos FFE L LSB 40b 40:8 8b C -2 C -1 C 1 C 2 5x8b C -2 C -1 C -2 C -1 C 1 C 2 C 1 C 2 4x8b 4x4b M L FFEM MSB 40b 40:8 8b C -2 C -1 C 1 C 2 5x8b C -2 C -1 C -2 C -1 C 1 C 2 C 1 C 2 4x8b 4x4b M M Shift-registers delay 8bit bundles and generate five C [-2:2] FFE data-streams MUXs M M and M L enable C [-2:2] selection In PAM-4 mode, up to 4 FFE taps In NRZ mode, 40b LSB/MSB data is merged, but M M and M L can still be operated independently to provide up to 5 FFE taps 5 of 31
TX Block Diagram V dd,cmos FFE L V dd,dr 24 LSB 40b MSB 40b 40:8 40:8 8b 8b C -2 C -1 C 1 C 2 C -2 C -1 C 1 C 2 5x8b M L 5x8b C -2 C -1 C -2 C -1 C 1 C 2 C 1 C 2 C -2 C -1 C -2 C -1 C 1 C 2 C 1 C 2 4x8b 4x8b FFEM 4x4b 4x4b 3 12 6 3 6 24 12 6 48 Output Network M M Output driver is composed of 72 elements 24 driver elements are driven by LSB data, 48 by MSB data Dedicated voltage supply V dd,dr =1.2V 6 of 31
TX Block Diagram V dd,cmos FFE L V dd,dr 24 LSB 40b MSB 40b 40:8 40:8 8b 8b C -2 C -1 C 1 C 2 C -2 C -1 C 1 C 2 5x8b M L 5x8b C -2 C -1 C -2 C -1 C 1 C 2 C 1 C 2 C -2 C -1 C -2 C -1 C 1 C 2 C 1 C 2 4x8b 4x8b FFEM 4x4b 4x4b 3 12 6 3 6 24 12 6 48 Output Network M M 4/5 2 REF CK PLL 2-8GHz I/Q Generation CK4-I CK4-Q PLL generates 2-8GHz clock signal High precision I/Q signals generator feeds the 40:1 serializer 7 of 31
Reconfigurable TX FFE LSB MSB C 2 C 1 C 2 C -1 C 1 C -1 C -2 C -2 C 2 C 1 C 2 C 1 FFE L C -2 C -1 C -2 C -1 C 1 C 2 C 1 C 2 C -2 C -1 C -2 C -1 3 12 6 3 6 24 Output Network Coefficients Minimum Normalized Amplitude PAM-4 FS NRZ FS NRZ HS NRZ QS 2-PRE 1-PRE MAIN 1-POST 2-POST -21/72-3/24-21/72-21/72 12/24 36/72 36/72-9/24-36/72-36/72-9/24-36/72 45/72-27/72 At Full-Speed, it provides up to 4 FFE tap in PAM-4 mode and 5 tap in NRZ mode, meeting OIF CEI 56Gb/s MR and 28Gb/s KP4 standards At Half-Speed, data is oversampled and [C -2,C 2 ] are mapped as 1- Pre/Post cursor, respectively, meeting 10Gb/s KR10 and 8.5Gb/s PCI Exp-3 C -1 C -1 C -2 C -2 C 1 C 2 C 1 C 2 12 6 At Quarter-Speed, C 2 is mapped as 1-Postcursor while C -2:1 are all set to the Main cursor. This configurations is compliant with 2.5Gb/s PCI-Exp1 FFE M 8 of 31
State-of-the-art PAM-4 Output Drivers [Bassi et al., ISSCC 2016, JSSC 2017] [Nazemi et al., ISSCC 2016] Hybrid voltage/current driver Very good linearity and high output amplitude with 1V supply Bandwidth limited by increased load Low FFE programmability Pure current mode driver Simple implementation, high bandwidth Two supply domain and need of level shifter operating at output symbol rate High FFE programmability 9 of 31
Proposed Current Mode Driver V dd,dr V dd,dr V dd,dr V ref Out P Out N <72> <72> <1> V bias M C1 M C2 V dd,cmos V dd,cmos V dd,cmos <1> In N In P In N and In P CMOS-level input data streams from serializer Gate voltages of M C1,2 current sources are constant, set by replica bias based on desired output swing V ref When output node is high, M C1,2 source is pulled to V dd,cmos, relaxing reliability constraints and allowing the use of thin oxide devices High output swing with good linearity and large bandwidth 10 of 31
Output Network Return Loss [db] TF [db] Out P Coil #1 Coil #1 Out N 0 ESD Resistor Load Bank ESD -3 Coil #2 Driver Coil #2 Coil #1 Coil #2-6 With Coils Without Coils C LOAD C DRIVER C ESD C BUMP 200V MM / 500V CDM, >>2kV HBM ESDs -9 0 10 20 30 Frequency [GHz] 0 Driver capacitance is comparable with ESD capacitance Double T-coil network enhances bandwidth by 1.5 and improves impedance matching at high frequency -10-20 -30 With Coils Without Coils 0 10 20 30 Frequency [GHz] 11 of 31
High-Speed Serializer Architectures t DIV /2 CK2 CK4-I CK4-Q t PULSE SEL<3:0> FF CK4-I D1 FF CK4-Q D2 FF CK4-I D3 FF CK4-Q CK4-I CK4-Q D0 4:2 t MUX B0 B1 t D 2:1 OUT C PAR t MUX FF D0 CK4-I FF D1 CK4-Q FF D2 CK4-I FF D3 CK4-Q SEL<0> SEL<1> SEL<2> SEL<3> OUT 2xC PAR Half-rate architecture Quarter-rate architecture t BIT > t Setup + t MUX + t DIV t D Low C PAR load of half-rate architecture leads to very fast commutations t BIT > t Setup + t MUX t PULSE Higher C PAR load of quarter-rate architecture leads to increased ISI Propagating clock forward relaxes serializer timing constraints Low load highly desirable to limit ISI 12 of 31
Proposed MUX Architecture CK4-I CK4-Q X2 CK2 CK4-I t MULT CK4-Q t MULT t MUX t Setup D0 D1 D2 D3 t MUX B0 4:2 2:1 B1 OUT B0 B1 CK2 OUT b0 b2 b1 b3 b0 b1 b2 b3 Quarter-rate architecture to enhance speed and lower ISI Local X2 clock multiplier to save power Forward propagated delay implemented with X2 allows relaxed timing constraints: t BIT > t Setup + t MUX t MULT 13 of 31
Proposed MUX Architecture Jitter Pk-Pk [ps] CK4-I CK4-Q D0 D1 D2 D3 D0/D1 D2/D3 t MUX X2 t MULT B0 B1 CK2 4:2 2:1 LAT B0/B1 OUT CK4-I P/N CK4-Q P/P CK4-I N/P CK4-Q N/N B0 CK2 P B1 CK2 N X2 CK2 P/N 2:1 OUT 18 16 Traditional Mux 14 Proposed Mux 12 10 8 6 4 2 10 20 30 40 50 Symbol Rate[Gsym/s] CK4-I/CK4-Q 4:2 MUX 4:2 based on pass-gate to save power and guarantee t MUX > t MULT to respect hold-time constraints NAND-based frequency doubler generates half rate clock for the last 2:1 MUX At 32 Gsym/s the Pk-Pk jitter on output node is reduce by 1.3 compared to a traditional direct MUX 14 of 31
Effects of I/Q Mismatches CK-I P/N CK-Q P/P CK4-I 2UI Δ 2UI CK-I N/P CK2 P/N CK4-Q 2UI CK-Q N/N CK2 1UI-Δ 1UI+Δ 1UI-Δ 1UI+Δ Δ=5.6º Δ<1.4º 1.12UI 0.88UI 1UI 1UI I/Q mismatches on quarter-rate clocks creates DCD on half-rate clock I/Q phase difference must be lower than 1.4 15 of 31
Effects of I/Q Duty-Cycle Distortion CK-I P/N CK-Q P/P CK4-I 2UI-Δ 2UI+Δ CK-I N/P CK2 P/N CK4-Q 2UI CK-Q N/N CK2 1UI 1UI-Δ 1UI+Δ 1UI Δ=0.11UI Δ<0.01UI 1UI 0.89UI 1.11UI 1UI 1UI 1UI 1UI 1UI DCD on quarter-rate I/Q clocks translates to DCD on half-rate clocks with period of 4UI Generation of precise I/Q quarter-rate clocks is key, especially at high-speed 16 of 31
Clock Generation Tree Integer-N PLL REF CK Bandgap Regulator PFD CP LPF 6-8GHz 4-6GHz /1 /2 Locking Signal Vtune Phase Rotator CML to CMOS CML to CMOS Duty-Cycle Correction DCC-I DCC-Q CK4-I CK4-Q /N Injection Locked Ring Oscillator Integer-N type PLL with two VCOs and output divider to generate 2-8GHz master clock Injection-Locking Ring Oscillator provides high-accuracy 8 phases against PVTs Phase rotators interpolate 8 π/4-spaced phases to improve DNL and INL Quarter-rate clocks fed to serializer after DCC circuit 17 of 31
Locking Signal Frequency Code register preset up down Injection Locked Ring Oscillator logic Vtune clk LF Digital Loop v TH v TL Analog Loop Quadrature phase error [ ] Quadrature phase error [ ] 8 4 0-4 a -8 0.8 0.9 1.0 1.1 1.2 Supply voltage [V] 8 b 4 0 No calibration Analog calibration ON Analog + digital calibration f IN =8GHz -4 No calibration Analog calibration ON Analog + digital calibration -8-40 0 40 Temperature [ C] 80 120 Buffer [Anzalone et al., ESSCIRC 2016] A phase detector based on passive mixers measures the quadrature error and continuously tunes the oscillator Vtune for fine phase correction Concurrently, a window comparator monitors Vtune and drives digital coarse calibration in background. The quadrature phase error is kept lower than 1.5º when supply and temperature variations are between [0.9V, 1.2V] and [-40ºC, 120ºC] 18 of 31
Phase Rotator 1 f IN =11GHz from ext 0.5 j 1=135º,j 2=180º j 1=90º,j 2=135º DNL [LSB] 0-0.5 <0> <15> ` ` j 1=45º,j 2=90º j 1=0º,j 2=45º <15> <0> -1 2 1 0 32 64 96 12 ` ` INL [LSB] 0-1 -2-3 j 1P j 1N j 2P j 2N -4 0 32 64 96 128 Code = 2GHz with AQC = 11GHz with AQC = 11GHz without AQC Phase Rotators consists of four slices driven by the ILRO outputs Each slice consists of 32 differential pair thermometric weighted to reduce switching glitches and guarantee the monotonicity of the output phase At 11GHz, the maximum DNL and INL are 0.5 and 1 LSB, respectively 19 of 31
DCD Correction Circuit DCD @8GHz [%] 2 SELP<6:0> P,N 1 0-1 IN P,N SELN<6:0> P,N OUT N,P -2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CODE PMOS and NMOS switches operates independently Two 7 bit thermometric code to avoid glitches and guarantee the monotonicity of the correction DCD correction circuit capability equal to ±1.5% at 8GHz 20 of 31
Chip Photo and Power Break-Down Output Driver 16% LF Serializer 8% Transmitter Clock Gen. 31% HF Serializer 45% PLL Power Consumption: 145mW @ 64Gb/s V dd,cmos =1V V dd,dr =1.2V 10ML CMOS 28nm FDSOI CMOS from STMicroelectronics Chips encapsulated in flip-chip BGA packages 21 of 31
Insertion Gain [db] Agilent DCA-X 86100D Test Board Measurement Setup Pkg&Board Insertion Gain [db] 0-1 -2-3 0 4 8 12 16 20 Frequency [GHz] Ref Clock Package and trace board loss at 16GHz is 2.5dB Connectors and cable add about two more db of loss Total loss at 16GHz equal to 4.5dB 22 of 31
Output Eyes at 28/56 Gb/s PRBS-9 @ 28Gb/s QPRBS-13 @ 56Gb/s 0.84UI 0.73V 0.48UI 0.18V FIR setting: [C -1 C 1 ]=[-1/24 18/24-3/24] Vertical opening: 0.73V Horizontal opening: 0.84UI FIR setting: [C -1 C 1 ]=[-1/24 18/24-3/24] Vertical opening: 0.18V Horizontal opening: 0.48UI 23 of 31
Output Eyes at 32/64 Gb/s PRBS-9 @ 32Gb/s QPRBS-13 @ 64Gb/s 0.75UI 0.6V 0.36UI 0.14V FIR setting: [C -1 C 1 ]=[-1/24 18/24-3/24] Vertical opening: 0.6V Horizontal opening: 0.75UI FIR setting: [C -1 C 1 ]=[-1/24 18/24-3/24] Vertical opening: 0.14V Horizontal opening: 0.36UI 24 of 31
S22 and PLL Phase Noise Return loss better than the mask limit with margin Jitter of the clock is estimated by integrating phase noise starting from 500kHz offset The random jitter integrated up to 8GHz is 290fs 25 of 31
Comparison with State of Art 26 of 31
Conclusions Delivering high TX amplitude while preserving linearity and large bandwidth is key for high-speed PAM-4 transmitters A new output driver allows high swing and good linearity with increased supply while still employing thin-oxide devices operated reliably A smart FFE structure is proposed for back-compatibility with legacy standards Measurements test chips realized in 28nm CMOS FDSOI technology by STMicroelectronics prove the effectiveness of the proposed TX 27 of 31
Acknowledgement The authors are thankful to Dr. Guido Albasini, Daniele Baldi and Dr. Davide Sanzogni and the layout team for their contributions 28 of 31