LOW-POWER HIGH-SPEED SERIAL LINK DESIGN

Size: px

Start display at page:

Download "LOW-POWER HIGH-SPEED SERIAL LINK DESIGN"

Sheena Curtis
5 years ago
Views:

1 LOW-POWER HIGH-SPEED SERIAL LINK DESIGN By JIKAI CHEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA

2 2013 Jikai Chen 2

3 To my wife and my parents 3

4 ACKNOWLEDGEMENTS During the seven years as a PhD student at the University of Florida, I received much help from many people. Although there is only person listed as the author, the work presented in this Dissertation would not have been possible without them. To each one of them I owe many thanks. I want to thank my advisor, Dr. Rizwan Bashirullah, for his encouragement when things might go wrong, his tolerance and patience when things did go wrong, and his high standard which I will carry though the rest of my life. I want to thank Dr. Jenshan Lin, Dr. Robert Fox, and Dr. Sanjay Ranka for being in my committee and spending their precious time on this Dissertation. My special thanks go to my friends at ICR. Walker Turner, Qiuzhong Wu, Hang Yu, Chris Dougherty, Chun-ming Tang, Lin Xue, Zhiming Xiao, Chun-chin Peng, Yan Hu, Pawan Sabharwal, Deepak Bhatia, Lawrence Fomundam, and Felipe Garay offered me help when I needed it the most, and brought fun to my supposedly dull PhD life. I will miss the basketball games that we played in those hot summer days. I want to thank Professor Paul Kohl and his group at Georgia Institute of Technology for their wonderful cooperation, especially Brad Chen and Todd Spencer. I feel blessed to have such wonderful friends outside ICR, including Shuo Cheng, Mingqi Chen, Changzhi Li, Xiaogang Yu and Yan Yan. There is no doubt I enjoyed and will always cherish our friendship. I am grateful to my manager, Yanli Fan, and my colleagues, Karl Muth, Archie Hu and Huawen Jin, at Texas Instruments. Yanli has been very supportive when I needed to take time off for my defense. I learnt a lot from each one of them, and look forward to making my own contribution to the team. 4

5 I want to thank my parents, my parents in law, and my sister. Throughout the ups and downs in the past years, they supported me with their love without condition. If there is only one thing that I want to achieve in my life, I want to make them proud. Finally I want to thank my dear wife, Yuan Rao, the most caring and lovely woman in my life. I cannot thank her enough for her love, encouragement, patience, and everything she has done for me. Marrying her is by far the best thing that ever happened to me. I won t hesitate a moment to give everything in the world for my wife, and dedicating this Dissertation to her is the least I can do. 5

6 TABLE OF CONTENTS page ACKNOWLEDGEMENTS... 4 TABLE OF CONTENTS... 6 LIST OF TABLES... 9 LIST OF FIGURES LIST OF ABBREVIATIONS ABSTRACT CHAPTER 1 INTRODUCTION Research Motivation Dissertation Organization HIGH-SPEED SERIAL LINK OVERVIEW Chapter Overview The Channel Equalization FFE CTLE DFE Clocking Clock Generation Clock Recovery Signaling Signaling Efficiency Effects of Channel Loss Effects of FFE and DFE Effects of Back Termination Effects of Signaling and Termination Modes Summary AN ACTIVE LINK WITH AIR-CAVITY TRANSMISSION LINES Chapter Overview Transmission Line Design Fabrication Link Implementation

7 3.4.1 Link Architecture TX Design RX Design Preamp design DFE design Experimental Results Air-Cavity Transmission Line Measurement Link Measurement Summary A 4.5-Gb/s 12.4-mW RX WITH BAUD-RATE CDR Chapter Overview Baud-Rate CDR Majority-Voting DFE Chip Implementation Architecture Slicer DMUX Clocking Experimental Results Summary A 5-Gb/s 0.75-pJ/BIT VOLTAGE-MODE TRANSCEIVER Chapter Overview TX Implementation TX Architecture PRBS Generator LDO TX Driver RX Implementation RX Architecture Slicer Design Level Shifting and DFE Tap Generation DFE with Look-Ahead Selection Tree Decimated Baud-Rate CDR Injection-Locking-Based Clock Generation Clock Generation Overview ILRO Core Delay Line Experimental Results TX Measurement Clocking Measurement RX Measurement Transceiver Measurement Summary

8 6 A DIGITAL BACKGROUND ADC CALIBRATION TECHNIQUE Chapter Overview Background Calibration Review of Prior Art Proposed Background Calibration Scheme Calibration accuracy Convergence speed Calibration overhead and performance considerations Chip Implementation ADC Architecture Resistor Ladder T/H Comparator Digital Backend Reference ADC Calibration Engine and Supporting Circuitry Clock and Power Distribution Experimental Results Summary CONCLUSIONS LIST OF REFERENCES BIOGRAPHICAL SKETCH

9 LIST OF TABLES Table Page 2-1 Summary of signaling and termination modes Final air-cavity microstrip dimensions Performance summary CDR truth table update Clock phase update Selector truth table Majority-voter truth table Performance summary Performance summary of the receiver Performance summary of the transceiver Comparison of proposed and existing background calibration schemes Comparison with recently published work

10 LIST OF FIGURES Figure Page 1-1 Evolution of Intel Microprocessors ITRS predictions for transistor count and on-chip clock frequency for the next decade ITRS predictions of I/O and power for the next decade Power efficiency of high-speed links vs. year A typical high-speed serial link Conductor loss Physical mechanism of dielectric loss Channel loss A sample SBR Main cursor vs. Nyquist loss Eye degradation due to channel loss FFE CTLE DFE block diagrams Block diagrams of a PLL and a DLL Block diagrams of an injection-locked 5-stage ring oscillator Simulated phase noise suppression with injection-locking CDR block diagram Block diagram and principle of Alexander PD Simulated performances of an inverter in a 0.13-μm CMOS technology A typical link frontend Main cursor amplitude and signaling power penalty vs. channel loss

11 2-19 Post-cursor amplitudes vs. channel loss The effects of channel loss and equalization on Effects of FFE and DFE in frequency domain Lattice diagram for reflection calculation Eye opening vs. RX mismatch CM signaling VM signaling Cross-sections of microstrips Simulated of conventional and air-cavity microstrip Simulated of conventional and air-cavity microstrip Simulated dielectric loss of conventional and air-cavity microstrip Picture of the 3D model and simulated loss at various line widths Simulated dielectric loss of air-cavity and conventional transmission lines Improvement with air-cavity transmission line Signaling power reduction with air-cavity Fabrication process for the air-cavity structure Picture and cross-section of the fabricated air-cavity structure Link block diagram Schematics of the latch and multiplexer Schematic of the 5-b DAC Preamp model for gain optimization Preamp design Input impedance tuning Simulated RX eye diagrams Layout of the test board with the air-cavity active link

12 3-20 Measured performances of a 5-cm air-cavity microstrip Loss of the air-cavity line Chip micrographs of the TX and the RX Picture of the populated test board Test setup Measured waveforms Measured link performances Different ISI seen by the edge and data samples CDR block diagrams Operation principle of the proposed baud-rate CDR Block diagram of a 1-tap speculative DFE Proposed majority voter schematic Simulated delay Simulated selector and majority-voter performances Block diagram of the RX Schematic of the slicer with threshold control Simulated slicer performances Schematics of the CML and CMOS DMUX cells Schematic of the divider for I/Q generation Principle of PI Schematic of the phase interpolator Level-converter schematic Die micrograph and board picture Test setup Measured 20 channel performances

13 4-20 Measured DFE performances CDR measurement results Measured CDR jitter tolerance TX block diagram PRBS block diagram All-zero detector Schematic of the self-biased comparator with offset Simulated waveforms confirming the function of the all-zero detector Stability of the LDO RX block diagram Schematic of the slicer Level shifters Detailed schematic of the level shifter Simulated frequency response of the level shifter at different gain settings Simulated pre-layout selector delay vs. power supply DFE selection tree Block diagram of the injection-locking-based clock generation Schematic of the ILRO core Start-up issue of the pseudo-differential oscillator Schematic of the current-starved delay line Simulated delay line tuning curve Chip micrograph and transceiver layout TX measurement results at 6.25 Gb/s ILRO measurement results Measured phase noise with and without injection locking

14 5-23 Measured CDR delay line tuning curve showing >2-UI tuning range Measured loss characteristics of the 20 channel Measured 4-Gb/s eye diagrams before and after the 20 channel RX bathtubs with and without DFE Jitter histogram of the recovered clock Measured 5-Gb/s TX eye diagrams Measured CDR waveforms RX bathtubs with and withou DFE An ADC-based serial link Schematic of a preamp Correlation-based calibration Redundancy-based calibration Reference-ADC-based calibration Principle of reference-adc-based calibration Proposed reconfigurable-comparator-based calibration Mechanism of noise-induced calibration error Required conversions for convergence with different resolutions Block diagram of the ADC T/H Design T/H Bandwidth vs. switch width Comparator block diagram Schematics of the first two stages of the preamplifier Effects of M Current-steering DAC and the DAC bias generator. The bias generator is shared by all the comparators Simulated comparator performances

15 6-21 Block diagram of the digital backend FSM flow chart. N is the calibration index, which is also the SRAM address Chip micrograph Measured ADC linearity Test setup for dynamic performance evaluation Output spectrums ENOB w/ and w/o calibration

16 LIST OF ABBREVIATIONS Term: ADC CDR CG CM CML CTLE DFE DLL DMUX DNL DSP ENOB FFE FSM ILRO INL ISI ITRS I/O LFSR LPF LSB MUX NRZ Definition Analog-to-digital converter Clock and data recovery Common-gate Current mode Current-mode logic Continuous-time linear equalization Decision-feedback equalization Delay-locked loop De-multiplexer Differential non-linearity Digital signal processor Effective number of bits Feedforward equalization Finite-state machine Injection-locked ring oscillator Integral non-linearity Inter-symbol-interference International technology roadmap of semiconductors Input/output Linear-feedback shift register Low-pass filter Least significant bit Multiplexer Non-return-to-zero 16

17 PD PFD PI PLL PM PRBS RX SAFF SBR SNR TX UI VCDL VCO VM Phase detector Phase-and-frequency detector Phase interpolator Phase-locked loop Phase modulation Pseudo-random bit sequence Receiver Sense-amplifier flip-flop Single-bit response Signal-to-noise ratio Transmitter Unit interval Voltage-controlled delay line Voltage-controlled oscillator Voltage mode 17

18 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy LOW-POWER HIGH-SPEED SERIAL LINK DESIGN By Jikai Chen May 2013 Chair: Rizwan Bashirullah Major: Electrical and Computer Engineering With ever increasing integrated functionalities and on-chip clock frequency on a processor, the off-chip bandwidth is increasing at even higher rates. The ITRS predicts that the aggregate off-chip bandwidth of future processors will reach 100 Tb/s in the next ten years, delivered by multiple high-speed serial links in parallel, each running at multi-gb/s. At the same time, the total power budget of a processor is practically flat due to package and cooling technology limitations. To accommodate the increase of offchip bandwidth, the power efficiency of high-speed interconnects must be dramatically improved over the next decade. Various factors come into play when improving the power efficiency of highspeed serial links. For multi-gb/s off-chip signaling, the electrical channel presents the most difficult challenge with its latency and frequency-dependent attenuation. As a result, clock and data recovery (CDR) and channel equalization have become essential functions in all high-speed off-chip serial links. To truly optimize the link power efficiency, the impact of channel condition, CDR and equalization on the link power 18

19 must be well understood, in addition to that of such design choices as signaling mode and termination topology. This Dissertation is the result of such an effort. The Dissertation starts with an overview of the high-speed serial link. The channel loss mechanisms are first reviewed and dielectric loss is shown to be the dominant factor in future high-speed channels. The dependence of the signaling power on signaling modes, termination topologies and equalization techniques is analyzed to identify power-efficient solutions. CDR is also briefly reviewed, revealing the need for a better baud-rate scheme than existing ones. To reduce the dielectric loss, a low-power active link is presented in Chapter 3 with an air-cavity transmission line which reduces the channel latency and the dielectric loss by replacing the dielectric material between the signal lines and the ground plane with air. Other techniques include the use of DFE, a current-sharing frontend, and the removal of back termination for better power efficiency. The link works up to 6.25 Gb/s with a power efficiency of 0.6 pj/bit. Clock recovery is addressed in Chapter 4. A novel digital baud-rate CDR scheme is proposed which automatically tracks the maximum eye-opening. Chapter 4 also proposes replacing the selectors in a traditional speculative DFE with majority-voters which is faster and more power-efficient. A receiver that incorporates the proposed baud-rate CDR and majority-voting DFE works at 4.5 Gb/s while consuming 12.4 mw, yielding a power efficiency of 2.8 pj/bit. Building upon the results of Chapters 3 and 4, Chapter 5 presents a complete 5- Gb/s transceiver which dissipates only 3.7 mw. To improve the power efficiency, the transceiver uses exclusively static CMOS logic gates instead of the CML gates in 19

20 Chapters 3 and 4, and employs injection-locking based clock generation. Heavy parallelism and speculation in the DFE selection tree further reduces the power consumption. The measured 0.75-pJ/it power efficiency is among the best reported to date. While currently most serial links still rely on some analog signal processing, the continuous scaling of CMOS technology has recently made an ADC-based serial link attractive in which equalization and timing recovery are all carried out in the digital domain. One of the key challenges in this ADC-based architecture is the power consumption of the high-speed ADC. Chapter 6 presents a novel digital background calibration scheme suitable for high-speed ADCs which features negligible hardware and power overhead. The efficacy of the proposed calibration scheme is experimentally confirmed with a 50-mW 2.5-GS/s 5-bit full-flash ADC. All the test chips in this Dissertation are in a 0.13-µm bulk CMOS technology. However, they are readily applicable to more advanced technologies. It is therefore expected that techniques proposed in this Dissertation should help enable future offchip serial links with high aggregate bandwidth and low power consumption. 20

21 CHAPTER 1 INTRODUCTION 1.1 Research Motivation The past few decades have witnessed the tremendous advancement of the semiconductor technology. Governed by Moore s Law [1] [2], the functionality (represented by the number of transistors) integrated on a single chip and the on-chip clock frequency both grew exponentially, as can be observed in Figure 1-1, which shows the transistor number and on-chip clock frequency of Intel s microprocessors over the past 40 years. Consequently, higher and higher I/O bandwidth is needed for the communication between microprocessors, accelerators, and memories [3]. Recently, the aggregate off-chip bandwidth has entered the Tb/s range, necessitating the integration of multiple (tens or even hundreds of) high-speed serial-link transceivers on the same chip, each operating at multi-gb/s. For example, in [4], a 16-core SPARC processor has 1.1 Tb/s aggregate I/O bandwidth provided by 112 transmitters and 176 receivers with peak signaling rate of 4.08 Gb/s each. Such exponential growth of functionality and clock frequency is expected to continue in the coming decade, as predicted by ITRS [5] and shown in Figure 1-2(A) and (B), giving rise to even faster increase of the I/O bandwidth over the same period. Figure 1-3(A) and Figure 1-3(B) show the predicted off-chip clock frequency and the total number of pads, while the resulting aggregate off-chip bandwidth is plotted in Figure 1-3(C), assuming that differential NRZ signaling is used and that 50% of the pads are dedicated to off-chip signaling. It can be seen that within 10 years, the total bandwidth will extend to the hundred Tb/s range. 21

22 Transistors # Clock frequency (GHz) Transistor # Clock frequency (MHz) 1E Core Xeon 1E+09 1E+08 1E+07 1E Pentium III 1E+05 Pentium 1E E Year Pentium III Core Xeon 10 Pentium Year (A) Figure 1-1. Evolution of Intel Microprocessors. A) Transistor count. B) on-chip clock frequency. (B) 1E E E+10 1E Year Year (A) Figure 1-2. ITRS predictions for transistor count and on-chip clock frequency for the next decade. A) Transistor count. B) on-chip clock frequency. However, due to packaging and cooling limitations, it is also predicted that the total power consumption of a processor will be kept practically flat about 140 W over the same period, as shown in Figure 1-3(D) [5]. State-of-the-art power efficiency of highspeed serial-link transceivers is around 1 pj/bit (1 mw/gb/s), which means 100 W I/O power consumption if 100 Tb/s aggregate bandwidth is desired. Apparently, the power efficiency of high-speed transceivers must be greatly improved in order to maintain such a growth of I/O bandwidth. For example, if the I/O power is to be kept around 20% of the whole chip, the power efficiency should improve to approximately 0.2 pj/bit in (B) 22

23 Power Efficiency (pj/bit) Aggregate IO BW (Tb/s) Power (W) Off-chip Clock (GHz) IO pads X Year (A) X Year Year (B) 1.4X Year (C) Figure 1-3. ITRS predictions of I/O and power for the next decade (D) 10 1E E E+2 2 ~-20%/year 10 1E E E Year Figure 1-4. Power efficiency of high-speed links vs. year In response, the power efficiency of high-speed serial links has been steadily improving at about 20% each year [6] [7] in the past driven by the joint effort of technology scaling and design innovations. Figure 1-4 shows the power efficiency of the high-speed serial links published in ISSCC and the VLSI Symposium since

24 Extrapolating this trend to 2022 gives about 0.7 pj/bit, which is 3 the 0.2 pj/bit goal. This clearly indicates that more drastic improvement is needed in the future and is the motivation behind the research work presented in this Dissertation. 1.2 Dissertation Organization A high-speed serial link involves functions such as equalization, clocking, and signaling. To improve the power efficiency of the whole link, it is vital to understand each of these components and their inter-dependencies, which is the topic of Chapter 2. Chapter 2 starts with the channel, with special emphasis on the intrinsic loss of transmission lines. It then introduces a few popular equalization techniques to compensate channel loss. The important topic of clock generation and recovery follows, revealing the attractiveness of injection-locking-based clock generation and baud-rate CDR. After that, the signaling power is related to channel loss, equalization, termination, and signaling modes. The advantages of DFE and voltage mode signaling with differential termination are demonstrated. Chapter 3 focuses on reducing the signaling power by joint channel and circuit optimization. An air-cavity transmission line structure is proposed to reduce the dielectric loss which dominates at high frequencies. To further reduce the power dissipation, the link also features speculative DFE and a current-sharing frontend without back termination. The active link dissipates 3.7 mw at 6.25 Gb/s, which translates to a power efficiency of 0.6 pj/bit. A digital eye-tracking baud-rate CDR scheme is proposed in Chapter 4. The baud-rate CDR automatically tracks the maximum eye-opening while reducing the clocking power by more than 50% compared to a conventional oversampling-based CDR. A majority-voting 1-tap speculative DFE is also proposed which is more amenable 24

25 to low-power and high-speed designs than the selectors in conventional speculative DFE s. Implemented with CML gates, a receiver with the proposed baud-rate CDR and majority-voting DFE consumes 12.4-mW at 4.5-Gb/s including the clocking circuitry. To further improve the power efficiency, Chapter 5 presents a complete transceiver in exclusive static CMOS gates. The RX employs heavy parallelism to reduce the power supply from the nominal 1.2 V to 1.0 V. Other design features include a speculative DFE with a look-ahead selection tree, a decimated baud-rate eye-tracking CDR, and an injection-locked ring oscillator for multi-phase clock generation. The TX uses a voltage-mode driver with differential termination to reduce the signaling power. The transceiver consumes 3.7 mw at 5 Gb/s. At 0.75 pj/bit, the power efficiency is among the best to date. With advanced CMOS technologies offering transistors with cut-off frequencies above 100 GHz and gate delays of around 10 ps, it is now possible for the RX to directly digitize incoming signal and perform equalization and timing recovery in the digital domain [8]. One of the key challenges, however, is the ADC s power consumption. With a given architecture, an ADC s power consumption is limited by mismatch which prevents the use of small transistors. In response, Chapter 6 describes a novel background ADC calibration scheme that is suitable for high-speed ADCs and incurs negligible hardware and power overhead. The proposed calibration scheme is implemented in a 50-mW 2.5-GS/s 5-bit flash ADC and its effectiveness is demonstrated with experimental results. All the reported results are in 0.13-μm bulk CMOS technology. It is expected that the migration to more advanced technologies will lead to even better performances. The 25

26 proposed techniques should therefore help pave the way toward low-power high-speed serial links to meet the requirements of future high-performance electronic systems. 26

27 DMUX CHAPTER 2 HIGH-SPEED SERIAL LINK OVERVIEW 2.1 Chapter Overview Figure 2-1 shows a typical high-speed serial link, which consists of a TX, a channel, and a RX. The TX multiplexes a low-speed parallel bus into a high-speed serial stream and drives it toward the channel. The RX resolves the stream into digital bits with a slicer and de-multiplexes them back to a parallel format. The equalizer (EQ) compensates the frequency-dependent loss of the channel, and the clock and data recovery (CDR) unit adaptively adjusts the RX clock phase so that the slicer digitizes the incoming stream with enough timing margin. TX Channel RX MUX DRV EQ CDR Figure 2-1. A typical high-speed serial link To improve the power efficiency of a serial link, the various parts of the link must be well understood. We first examine the channel, with emphasis on transmission line loss because it plays a vital role in determining the link performance. We then introduce some popular equalization techniques to compensate the channel loss, including FFE, CTLE, and DFE. Clocking, including clock generation and clock recovery, is presented next. We show in this part that injection-locking is an attractive clock-generation technique, and that baud-rate CDR schemes are generally preferred over their over- 27

28 sampling counterparts. In the end, we relate the signaling power to channel loss, equalization, impedance mismatch, signaling modes, and termination schemes. We demonstrate that DFE usually gives better signaling efficiency than FFE, and that voltage-mode signaling with differential termination reduces the signaling power significantly. 2.2 The Channel At multi-gb/s, the channel delay is comparable or even larger than the bit time, rendering the signaling sensitive to reflections due to impedance mismatch. For this reason, the channel is usually a transmission line with controlled 50-Ω impedance to accommodate measurement equipment and properly terminated at both the TX and RX. Discontinuities along the channel such as vias, packages, and connectors should all be carefully evaluated and controlled. However, even a perfectly uniform transmission with proper termination presents challenges to high-speed signaling. At multi-gb/s, the channel suffers from two frequency-dependent loss mechanisms, and it s the channel rather than the transistors that limit the total signaling bandwidth. For example, it is shown in [9] that, in theory, an NMOS in 0.8um technology is able to resolve a 48-Gb/s binary bit stream. However, the experimental results fall way short of the theoretical prediction due to the channel bottleneck (including the pads and packages). The first loss mechanism is the conductor resistance. At low frequencies, the current flows evenly through the conductor cross-sectional area. At high frequencies, however, the current tends to follow the path with least inductance, flowing only in a shallow band underneath the conductor surface, a phenomenon known as skin effect, 28

and σ is the conductivity. Figure 2-2(B) plots the skin depth in copper as a function of frequency. In GHz range, the skin depth is only on the order of μm.

29 δ (μm) as shown in Figure 2-2(A). The skin depth, the depth at which the current density decays to e -1 of that at the surface, is given by [10] where δ is the skin depth, is the frequency, μ is the permeability, and σ is the conductivity. Figure 2-2(B) plots the skin depth in copper as a function of frequency. In GHz range, the skin depth is only on the order of μm. (A) Frequency (GHz) (B) Figure 2-2. Conductor loss. A) Skin effect. B) Skin depth vs. frequency in copper The crowding of current to the conductor surface increases the effective resistance at high frequencies. Since the skin depth is inversely proportional to, the conductor loss (in db) increases proportionally to. 29

The second loss mechanism is the dielectric dissipation, which originates from the polarization of the molecules in the dielectric material.

against each other and convert some of the electric energy into heat [11].

30 The second loss mechanism is the dielectric dissipation, which originates from the polarization of the molecules in the dielectric material. As illustrated in Figure 2-3, when an alternating electric field is applied to a dielectric material, the molecules rotate to align with the external field and in doing so rub against each other and convert some of the electric energy into heat [11]. Because the molecules rotate every time the field polarity changes, the dielectric loss (in db) is proportional to frequency, and is given by [12] where is the loss tangent of the dielectric material. E Figure 2-3. Physical mechanism of dielectric loss The total loss is the combined effects of and, and can be expressed as where and are constants determined by the transmission line construction. Since both and increase with frequency, the channel displays a low-pass profile. Figure 2-4 shows an example channel loss, where is the data rate. The loss at half data rate,, is also known as the Nyquist loss. denotes the frequency at which the two loss mechanisms contribute the same and is given by ( ) 30

31 α(f) For a differential 100-Ω 8-mil 0.5-OZ microstrip line on FR4, is around 2 GHz. For high-quality cables, may be much higher. For example, a 50-Ω RG-58 cable with PolyEthylene dielectric material may have an around 100 GHz. Nyquist loss f DR f C 1.00 f DR Frequency Figure 2-4. Channel loss In the time domain, this low-pass characteristic can be captured by the channel s single-bit response (SBR). Figure 2-5 shows a sample SBR, where is the main cursor, those with negative index are pre-cursors, and those with positive index are post-cursors. It can be seen that due to the limited channel bandwidth, a single bit spans more than one UI and interferes with neighboring bits, a phenomenon known as inter-symbol-interference (ISI). To evaluate the impact of channel loss on the link performances, it is desirable to establish a relationship between the Nyquist loss and the SBR. However, since the Nyquist loss does not completely characterize the channel, an exact mapping between the Nyquist loss and the SBR is not possible. Figure 2-6 shows the main cursor amplitude at different Nyquist losses. Depending on the relationship between and, and may have varying significances, and channels with the same Nyquist loss may have different SBRs. Without loss of generality, the discussion in this chapter considers the case. 31

Voltage (V) Voltage (V) Voltage (V) h ch (0) SBR 1.0 0.8 0.6 h ch (0) 0.4 h ch (1) 0.2 h ch (-1) h ch (2) h ch (3) 0.

Main cursor vs. Nyquist loss 2.3 Equalization Figure 2-7 shows the simulated eye diagrams for channels with 6-, 12-, and 18- db Nyquist losses.

To extend the bandwidth of the channel, equalization is often employed in high-speed serial links.

32 Voltage (V) Voltage (V) Voltage (V) h ch (0) SBR h ch (0) 0.4 h ch (1) 0.2 h ch (-1) h ch (2) h ch (3) Time (UI) Figure 2-5. A sample SBR only = only Nyquist loss (db) Figure 2-6. Main cursor vs. Nyquist loss 2.3 Equalization Figure 2-7 shows the simulated eye diagrams for channels with 6-, 12-, and 18- db Nyquist losses. The channel loss degrades both the voltage and timing margins seen by the RX. When the Nyquist loss is about 12 db, the eye completely closes. To extend the bandwidth of the channel, equalization is often employed in high-speed serial links. This section reviews some of the most popular techniques db db db Time (UI) Time (UI) Time (UI) +1 Figure 2-7. Eye degradation due to channel loss 32

33 Gain (db) FFE Since the ISI originates from the channel s low-pass characteristic, it is possible to reverse it with a linear high-pass filter. One way of doing this is through a discretetime FIR filter [13] [14] at the TX or RX, of which TX feedforward equalization (FFE) is the most popular, as Figure 2-8(A) shows. By adjusting the tap weights, a relatively flat composite frequency response can be obtained, as shown in Figure 2-8(B). D D D D D CK (A) Channel FFE -4 Composite f DR 0.51f DR Frquency (B) Figure 2-8. FFE. A) Block diagram. B) Working principle Although more drivers are used for FFE, their total size is the same as the driver without FFE if the same peak gain is maintained. The electronic power overhead of FFE stems mainly from the additional flip-flops and the associated wiring. 33

34 Gain CTLE (A) Frequency (B) Figure 2-9. CTLE. A) Circuit detail. B) Frequency response Another linear equalization technique is the continuous-time linear equalizer (CTLE) [6] [7]. Figure 2-9(A) and (B) show the schematic and transfer function of such a CTLE. The transfer functions has two poles and one zero, which are given by The product of the gain, the peaking factor, and the bandwidth satisfies [15] 34

35 which means the performance of the CTLE is limited by the cut-off frequency of the technology. Due to the high bandwidth and linearity requirements, a CTLE tends to be power hungry. For example, implemented in 90-nm CMOS, the CTLE in [6] provides 8.7-dB peaking and accounts for 27% of the total RX power at 6.25 Gb/s. For a Gb/s link implemented in 65-nm CMOS, the CTLE provides 7.5-dB peaking and represents 38% of the RX power [7] DFE Besides the linear equalizers discussed above, a non-linear equalization technique, known as decision-feedback equalization (DFE), has found interest in recent high-speed serial links [16] [17] [18]. A 1-tap DFE is depicted in Figure 2-10(A). It works by directly removing the ISI of the previous bit from the current analog sample. Another way of viewing it is that the DFE adjust the slicer threshold depending on the previous bit. The power overhead of the DFE shown in Figure 2-10(A) consists mainly of the summer. The feedback path in Figure 2-10(A) must settle within one UI, a difficult design challenge at high data rates. To relax this stringent timing requirement, speculative DFE can be used, where possible results are pre-computed and then selected by the previous bits [19], as shown in Figure 2-10(B). The power overhead of speculative DFE is comprised of the additional slicers. 35

36 Slicer DFF (A) Selector Slicers DFF (B) Figure DFE block diagrams. A) Conventional DFE. B) Speculative DFE. 2.4 Clocking At multi-gb/s, both the timing offset and uncertainty must be well controlled, and clocking, including clock generation and clock recovery, may constitute a significant or even dominant portion of the total link power [6] [20].This section looks at both clock generation and clock recovery, and identifies ways to reduce the clocking power Clock Generation Clock generation in high-speed serial links is usually done with a PLL or a DLL. Figure 2-11(A) depicts a PLL block diagram, which consists of a phase detector (PD), a low-pass loop filter (LPF), a voltage-controlled oscillator (VCO), and an optional divider. At steady state, the negative feedback loop ensures that the VCO output phase is aligned with that of the reference clock. A DLL block diagram is shown in Figure 2-11(B), where the VCO in a PLL is replaced with a voltage-controlled delay line (VCDL). Under locked condition, the delay 36

37 of the VCDL is equal to one reference clock cycle. Compared to a PLL, a DLL is usually easier to design because the loop is of first order. While the cores of a PLL and a DLL are the VCO and VCDL, the other loop components may consume significant power. For example, in [6], the VCO consumes only 12% of the total PLL power. Besides, the PD and loop filter also occupy considerable area. CK REF PD LPF VCO (A) CK REF VCDL LPF PD (B) Figure Block diagrams of a PLL and a DLL. A) PLL. B) DLL Another clock generation technique that is found in some recent serial links is the injection-locked oscillator [21] [22]. Figure 2-12 depicts the block diagram of an injection-locked 5-stage ring oscillator. In the absence of injection signal, each stage of the oscillator contributes a delay of, resulting in a free-running frequency of When a clock with frequency is injected to one of the nodes, the delay of the injected stage changes by and at rising and falling edges respectively. 37

38 Designating, under locked condition, the oscillation is sustained at, and the following equation holds: Injection-locking a ring oscillator to a clean reference clock can dramatically improve its noise performance because periodical correction by the injected clock prevents jitter from accumulating indefinitely [23]. This can be observed in the frequency domain as a reduction in the phase with injection-locking, as illustrated in Figure Compared to a PLL or a DLL, an injection-locked oscillator avoids the power and area overhead of the PD, the LPF and the dividers, while still offering good jitter performance [24] [23] [25]. Besides, since no feedback loop is involved, an injectionlocking-based clock generation does not have the stability issue of a PLL or DLL. CK4 TD CK1 CK2 CK3 CK0 Figure Block diagrams of an injection-locked 5-stage ring oscillator 38

39 Phase noise (dbc) w/o injection w/ injection E+05 1E+06 1E+07 1E+08 1E+09 Offset frequency (Hz) Figure Simulated phase noise suppression with injection-locking Clock Recovery A clock recovery unit is essentially a feedback system consisting of three basic blocks, namely a phase detector (PD), a phase shifter or rotator, and a loop filter, as shown in Figure The PD determines whether the sampling clock is too early or too late. The early/late information, after being processed by a loop filter, is used to control the phase shifter or rotator toward the desired position. PD Phase Rotator LPF Figure CDR block diagram Various architectures exist for clock recovery [26]. The PD can be either linear [27] or non-linear [28], with the former giving both the direction and magnitude of the phase deviation, while the latter only the direction. In high-speed serial links, non-linear PD is more popular because it does not require processing of narrow pulses [29]. The loop filter can be analog [30], digital [31], or hybrid [32]. The phase shifter or rotator can be implemented with an oscillator, a delay line, or a phase interpolator (PI) etc. 39

40 LOGIC Non-linear phase detection is usually achieved via oversampling. Figure 2-15(A) shows the block diagram of an Alexander PD [28]. The input signal is sliced twice for each UI, one for eye center (data) and one for eye boundary (edge). Whenever a data transition is detected, the edge sample in between is compared with the two data samples to determine whether the sampling clock is too early or too late, as illustrated in Figure 2-15(B). Assuming the clock phases are evenly spaced, at locked condition, the data-sampling phase is automatically placed at the center. D DIN CK Early/ Late E (A) D0 E0 D1 (D0=E0 &&E0!=D1) CK too early D0 E0 D1 (D0!=E0 &&E0=D1) CK too late D0 E0 D1 (D0=E0 &&E0=D1) No transition (B) Figure Block diagram and principle of Alexander PD The power overhead of oversampling CDR consists of the additional slicers and clocking circuitry. While the additional slicers may be disabled to reduce their power consumption if a low CDR bandwidth is acceptable [6], it is still necessary to generate the extra clock phases. Moreover, since oversampling requires timing resolution better 40

41 Energy/cycle (fj) t pd (ps) than the bit time, the clocking power overhead is more than it appears because doubling the timing resolution requires more than doubling the clocking power. This can be observed in Figure 2-16, which shows the delay and energy of an inverter in a 0.13μm CMOS technology. For this reason, baud-rate CDR is preferred to reduce clocking power V DD (V) (A) V DD (V) (B) Figure Simulated performances of an inverter in a 0.13-μm CMOS technology. A) Delay. B) Energy. 2.5 Signaling In a high-speed serial link, the TX driver needs to produce a large enough voltage swing over the low channel impedance. The power consumed by the TX driver, also known as the signaling power, may constitute a significant portion of the total link power. For instance, in [7], nearly 40% of the link power is consumed by the TX driver. 41

42 To improve the power efficiency of the whole link, it is imperative to gain an insight to the various factors that affect the signaling power Signaling Efficiency Figure 2-17 shows a typical frontend found in high-speed links [17]. The analysis in this section assumes that the DC loss of the channel is negligible. Without DC loss, the signal swing at the TX and RX are the same, as shown in Figure For the ideal case with lossless channel and perfect termination, the eye opening is the same as the signal swing and the signaling power is Z TX = Z 0 Z RX = Z 0 Z 0 Figure A typical link frontend Factors such as channel loss, equalization, termination, and signaling modes cause to deviate from. If we define the signaling efficiency as the signaling power now becomes 42

43 Main cursor P SIG penalty By studying the relationship between and the various factors such as channel loss, equalization, termination, and signaling mode, their impacts on the signaling power can be understood Effects of Channel Loss With the SBR given, the worst-case eye opening can be found using the peakdistortion technique [33], and is calculated to be For a uniform channel with perfect matching, all the cursors are positive. Since the DC loss is negligible, i.e. ) Equation 2-9 can be simplified to, ) Nyquist loss (db) Figure Main cursor amplitude and signaling power penalty vs. channel loss Figure 2-18 shows the simulated amplitudes of the main cursor as a function of the channel Nyquist loss. Assuming the post-cursors are completely removed by DFE, the main cursor amplitude equals the RX eye opening. The signaling power penalty of the channel loss is therefore calculated accordingly and is plotted also in Figure It 43

44 Post cursor amplitude can be seen that when the Nyquist loss exceeds about 9dB, 50% more signaling power is needed to restore the eye opening seen by the RX slicers. Besides mandating more signaling power, higher channel loss also necessitates more equalization and induces power penalty for signal processing thereof. This is explained with the help of Figure 2-19, which shows the amplitudes of the first three post-cursors normalized to the main cursor. Generally speaking, with increasing channel loss, the post-cursors become more and more significant compared to the main cursor. Specifically, when the Nyquist loss is 9 db, the second post-cursor is around 10% of the main cursor. While 1-tap DFE may be enough when the Nyquist loss is less than about 6~9 db, extra DFE taps are desired beyond that, incurring power penalty for the extra latches etc All normalized to h(0). h(1) h(2) h(3) Nyquist loss (db) Figure Post-cursor amplitudes vs. channel loss Figure 2-20 plots for different channel losses. When the Nyquist loss goes beyond 9 db, the eye opening quickly degrades and error-free signaling without equalization becomes impractical or even impossible near 12 db. 44

45 P SIG /P SIG W/O EQ W/ FFE W/ DFE Nyquist loss (db) Figure The effects of channel loss and equalization on Effects of FFE and DFE To facilitate signaling over lossy channels, equalization is often employed in highspeed serial links. The impacts on the signaling power depend on the specific equalization scheme. The FFE operates with an FIR filter in cascade with the channel. With proper tap weights, the FIR filter inverts the channel response so that the composite frequency response is flat up to the Nyquist frequency, i.e.. ) The peak gain of the FIR filter occurs at the Nyquist frequency, and is kept at unity for fair comparison, i.e.. ) Equation 2-12 can then be simplified to. ) The signaling efficiency with FFE is then given by [34]. ) The DFE, on the other hand, directly removes the ISI of the previous bits and is better understood in the time domain. In the absence of detection errors (no error propagation), the DFE can be analyzed in a linear fashion and the composite SBR is 45

46 Attenuation (db) ) The signaling efficiency with DFE is then given by. ) The normalized signaling power with FFE and DFE is also plotted in Figure While both FFE and DFE extend the achievable data rate, DFE always yields the lowest signaling power. For example, when the Nyquist loss is 9 db, the signaling power with DFE is 40% lower than that with FFE W/O EQ. W/ DFE W/ FFE Boosting f DR 0.51f DR Frquency Figure Effects of FFE and DFE in frequency domain Intuitively, this benefit of DFE stems from the fact that DFE boosts the highfrequency component [16]. This is in contrast to FFE, which merely attenuates the lowfrequency component of the signal so that the high- and low-frequency components have the same amplitude when arriving at the RX. This is shown in Figure 2-21, which compares the composite frequency responses with FFE and DFE of a hypothetical channel which has an SBR of [0.8, 0.2] Effects of Back Termination As shown in Figure 2-17, a typical link has termination at both the TX and RX. Although the TX back termination helps mitigate reflections, it reduces the signal swing by 50%, which must be compensated for by doubling the signaling power. Note, 46

47 however, that this back termination is not necessary if the channel is relatively uniform and a good impedance matching is ensured at the RX. With the back termination removed and assuming perfect RX matching, the signaling power now becomes. ) Comparing Equation 2-17 to Equation 2-7, removing the back termination reduces the signaling power by half because it doubles the impedance seen by the TX driver [35]. However, without the damping of the back termination, reflections due to RX impedance mismatch may make multiple trips along the channel before dying out. The resulting degradation of the eye opening must be evaluated. The effect of RX impedance mismatch can be studied with the help of the lattice diagram [36], as shown in Figure 2-22, where and are the reflection coefficients at the TX and RX respectively. When a pulse first arrives at the RX, the transmitted pulse is given by. ) The reflected pulse travels back and gets fully reflected at the TX. When it arrives again at the RX, the transmitted pulse is, ) where denotes convolution. Since the channel DC loss is negligible, the worst case eye opening degradation due to the first reflected pulse is. ) Similarly, the degradation due to the n th reflection is. ) 47

48 Normaled eye opening The total effect is obtained by taking the sum and is. ) The signaling efficiency without back termination is therefore 2 ( ), ) where the factor 2 accounts for the amplitude doubling due to the removal of the back termination. Figure Lattice diagram for reflection calculation Figure 2-23 depicts the eye opening improvement with the back termination removed as a function of RX impedance mismatch. With 9-dB Nyquist loss and 10% impedance mismatch, the signaling power is reduced by nearly 40%. 250% 200% 150% Nyquist loss = 9 db 12 db 100% 50% 0% -40% -20% 0% 20% 40% RX mismatch Figure Eye opening vs. RX mismatch Also plotted in Figure 2-23 is the effect of RX mismatch when the Nyquist loss is 12 db. The eye degradation becomes more sensitive to RX mismatch without back 48

49 termination when the channel loss increases. Intuitively, this is because the main cursor decreases with increasing channel loss, while the reflection remains the same as long as the DC loss is negligible. Note the above discussion assumes negligible DC loss of the channel. If the channel has substantial DC loss, the reflections may be heavily attenuated and good termination may not be required at either TX or RX [37] Effects of Signaling and Termination Modes The above discussion considers exclusively current-mode signaling. However, both current-mode (CM) [16] [20] and voltage-mode (VM) [6] [38] [39] signaling have been used for high-speed serial links. Besides, the termination may be single-ended or differential. Their signaling powers are analyzed below. Figure 2-24(A) shows the schematic of a current-mode frontend with singleended termination. The differential pair works in saturation region and steers the tail current to either branch according to the bit being transmitted. The voltage levels at the TX outputs are The voltage swing and the signaling power are therefore When the termination is differential, as shown in Figure 2-24(A), the voltage levels become 49

50 while the single-ended voltage swing and the signaling power are the same as singleended termination. V DD V DD R R R R Z 0 Z 0 (A) V DD R R Z 0 Z 0 R (B) Figure CM signaling. A) Single-ended termination. B) Differential termination. Figure 2-25 shows the schematic for VM signaling. The transistors work in linear region and connect the outputs to either voltage rails according to the bit being transmitted. Termination is provided by series resistors, either by the on-resistance of the transistors or by explicit resistors in series with the transistors. With single-ended termination, the voltage levels at the TX outputs are The single-ended voltage swing and the signaling power are 50

51 For the case of differential termination, the voltage levels become The single-ended voltage swing and the signaling power now are It can be seen that using differential termination reduces the signaling power by 50% for VM signaling. V DRV Z 0 Z 0 R=Z 0 R R (A) V DRV Z 0 Z 0 2R R =Z 0 (B) Figure VM signaling. A) Single-ended termination. B) Differential termination. 51

52 Table 2-1 summarizes the performance of current-mode and voltage-mode drivers with single-ended and differential terminations. It can be seen that even with a linear regulator to generate V DRV, a VM signaling with differential termination consumes only 25% of CM signaling power. Table 2-1 Summary of signaling and termination modes Mode CM CM VM VM Term. SE Diff. SE Diff 2.6 Summary Various factors come into play when one tries to improve the power efficiency of a high-speed serial link, with the channel posing the most difficult challenge. At multi- Gb/s, conductor loss and dielectric loss limit the channel loss and causes temporal spreading of the transmitted pulses. To compensate for the resulting ISI, high-speed serial links usually employ equalization such as FFE, CTLE and DFE, with each involving a different level of complexity. Clocking, including clock generation and clock recovery, is challenging at high data rates and sometimes may dominate the total link power budget. Conventional 52

53 solutions such as PLL and DLL entail considerable area and power overhead due to the PD and LPF. Injection-locking based clock generation, on the other hand, is a promising technique because it avoids such overhead while still features low jitter. To reduce the clocking power, baud-rate CDR is preferred over its oversampling counterpart, such as the Alexander type CDR, which has found popular use in recent high-speed serial links. Due to the low channel impedance, the signaling power, the power dissipated by the TX driver, consumes considerable percentage of the link power. Using the peak distortion technique and the concept of signaling efficiency, this chapter shows the attractiveness of DFE and VM signaling with differential termination. It is also shown that with moderate channel loss and reasonable termination tolerance, back termination can be removed to further reduce the signaling power. The rest part of this Dissertation will report a few TX and RX implementations that embed the analysis results presented in this chapter. Their usefulness is demonstrated with experimental results. 53

54 CHAPTER 3 AN ACTIVE LINK WITH AIR-CAVITY TRANSMISSION LINES 3.1 Chapter Overview As discussed in chapter 2, the bandwidth of transmission lines is limited primarily by conductor loss and dielectric loss. Because is proportional to while is proportional to [36], the latter mechanism dominates at high frequencies. For conventional dielectric materials such as FR4, the dielectric loss significantly degrades the channel bandwidth for multi-gb/s signaling. While resorting to materials with low loss tangents or even optics is possible, such solutions incur significant cost overhead. Figure 3-1(A) shows the cross-sections of a conventional microstrip on FR4 (. Since the field of a microstrip transmission line resides in both the air and FR4, the effective dielectric constant lies somewhere between and. The extent to which dominates is characterized by a socalled filling factor [40], which satisfies The effective loss tangent can also be related to the filling factor by [40] Because the dielectric loss is determined by both and through [12] reduction of the filling factor will reduce the dielectric loss. Intuitively, since most of the field energy is confined between the signal lines and the ground plane, if we can somehow fill the space between them with air, the filling 54

55 factor will be reduced. This can be done by employing the air-cavity microstrip structure (also known as inverted microstrip [41]), as shown in Figure 3-1(B) Air-cavity microstrips can be formed by selectively post-processing the FR4 boards for high-speed interconnects. This avoids the cost overhead associated with expensive substrate materials for non-critical signals. FR4 (A) FR4 (B) Figure 3-1. Cross-sections of microstrips. A) Conventional. B) Air-cavity. Figure 3-3 shows the simulated of conventional and air-cavity differential microstrips, with the conductor thickness kept at 5 µm. The calculated filling factor is shown in Figure 3-3. It can be seen that air-cavity microstrip has lower and, and that when, is reduced by 30% by employing the air-cavity. According to Equation 3-7, such reductions translate to an improvement of 36% of, as shown in Figure 3-4. It should be noted that not only is the air-cavity structure attractive for low loss, it also features lower latency for the same channel length because is reduced. Encouraged by these results, this Chapter presents the design and fabrication of air-cavity transmission lines, and their use in an active link. The active link features a 55

56 α d (db/cm) r, current-sharing frontend and speculative DFE to reduce the signaling power. Back termination at the TX is also removed for further power saving. Experimental results confirm the dielectric loss is reduced by 26% by the air-cavity structure. Operating at 6.25 Gb/s, the link consumes 3.7 mw, yielding a 0.6 pj/bit power efficiency Conventional Air-cavity W/H Figure 3-2. Simulated of conventional and air-cavity microstrip α f conventional Air-cavity W/H Figure 3-3. Simulated of conventional and air-cavity microstrip Conventional Air-cavity W/H Figure 3-4. Simulated dielectric loss of conventional and air-cavity microstrip 56

57 3.2 Transmission Line Design The main design parameters of the proposed air-cavity structure include the signal line width W and spacing S, the conductor thickness t, and the height of the aircavity H. The design goals include 100 Ω differential impedance, low loss and high density. Considering the process capability, the conductor thickness is chosen to be 5 μm. For simplicity, the signal line width W and spacing S are assumed to be the same. A meandered transmission line length of 20 cm is used as a representative channel length for chip-to-chip interconnects [42]. The channel loss is evaluated at 5GHz with a target of 10 db or less, or equivalently an attenuation constant of 0.5 db/cm at this frequency. The transmission line is simulated in a 3D electromagnetic simulator. Figure 3-5(A) shows the picture of the 3D model. To reduce the requirement on computation resources, a short line of 1 cm is simulated. The obtained S-parameters are then cascaded to get the characteristics of longer lines. Figure 3-5(B) shows the simulated air-cavity loss performance at 5 GHz at various signal line widths. While the conductor loss decreases with increasing conductor sizes due to larger effective conducting surface area, the dielectric loss stays relatively constant since it is primarily determined by the material properties. From a loss reduction perspective, it is desirable to use as big a W as possible. However, to achieve the desired impedance, a proper must be maintained. The fabrication process limits the air-cavity height to about 20 μm. Accordingly, the final W is chosen to be 40 μm, which gives an 8 db total loss for a 20 cm channel at 5 GHz. The transmission line dimensions are listed in Table

(db/cm) S21 (db/cm) 1.2 1.0 Ground 0.8 Signal P FR4 Signal N 0.6 0.4 Conductor Loss Total Loss 0.2 Dielectric Loss 0 0 10 20 30 40 Width (μm) Figure 3-5.

Final air-cavity microstrip dimensions W S t H 40 µm 40 µm 5 µm 19 µm 50 60 Figure 3-6 compares the dielectric loss in the proposed air-cavity transmission line and the conventional FR4-based

58 (db/cm) S21 (db/cm) Ground 0.8 Signal P FR4 Signal N Conductor Loss Total Loss 0.2 Dielectric Loss Width (μm) Figure 3-5. Picture of the 3D model and simulated loss at various line widths Table 3-1. Final air-cavity microstrip dimensions W S t H 40 µm 40 µm 5 µm 19 µm Figure 3-6 compares the dielectric loss in the proposed air-cavity transmission line and the conventional FR4-based microstrip transmission line (in db/cm) with the same conductor width and spacing. The air-cavity structure reduces the dielectric loss by around 26% Airgap Air-cavity tand FR4 tand Conventional Frequency (GHz) Figure 3-6. Simulated dielectric loss of air-cavity and conventional transmission lines The effective dielectric constants are calculated from the simulated phase characteristic. The air-cavity structure reduces the effective dielectric constant by 25% from 2.75 to

Channel length (cm) Channel length (cm) Loss (db/cm) Figure 3-7 compares the simulated losses of conventional and air-cavity transmission lines. The loss of the air-cavity transmission line is 0.

The improvement of air-cavity topology becomes more pronounced at higher frequencies as the dielectric loss becomes more significant.

59 Channel length (cm) Channel length (cm) Loss (db/cm) Figure 3-7 compares the simulated losses of conventional and air-cavity transmission lines. The loss of the air-cavity transmission line is 0.25 db/cm at GHz and is 8% less than the conventional structure. Figure 3-8 shows the signaling power reduction with the air-cavity structure assuming FFE and DFE respectively. The improvement of air-cavity topology becomes more pronounced at higher frequencies as the dielectric loss becomes more significant. For example, at 10 GHz, the loss improvement is nearly 15%, and for a 20-cm channel the signaling power is reduced by more than 10% with DFE and 16% with FFE. It is therefore expected that the air-cavity structure is especially attractive for future high-speed interconnects Air-cavity Conventional Frequency (GHz) Figure 3-7. Improvement with air-cavity transmission line %-55% 45%-50% 40%-45% 35%-40% 30%-35% 25%-30% 20%-25% 15%-20% 10%-15% %-30% 20%-25% 15%-20% 10%-15% 5%-10% 0%-5% Data rate (Gb/s) Data rate (Gb/s) 5 (A) Figure 3-8. Signaling power reduction with air-cavity. A) With FFE. B) With DFE. (B) 59

60 3.3 Fabrication Figure 3-9 illustrates the process flow for fabricating the proposed air-cavity interconnects. The process begins with electroplating the first copper pattern on an FR4 substrate representing the differential signal lines (Figure 3-9(A)). Following this step, a sacrificial polymer layer is spin-coated with desired thickness and patterned to act as a temporary placeholder in the formation of the air-cavity (Figure 3-9(B)). The sacrificial polymer contains poly-propylene carbonate (PPC) (Novomer Inc., Ithica, NY). A photoacid generator is added in order to obtain a photo sensitive polymer mixture, and γ-butyrolactone (GBL) serves as the solvent. A similar formulation is available as Unity 2203P from Promerus LLC, Brecksville, OH. Two different approaches for patterning are studied for the PPC layer, photo-patterning and self-patterning [43]. When photopatterning, a photo mask is used. When employing PPC self-patterning process, no photo mask is needed, and the slightly sloped sidewalls of the PPC patterns makes it ideal for the sequential layers to have a better step coverage. The copper ground layer is then patterned on top of the PPC patterns. The entire surface is then overcoated with Avatrel 8000P (functionalized polynorbornene) for hermetic seal of transmission line and providing mechanical support for the top ground copper layer (Figure 3-9(C)). PPC polymer backbone unzipping occurs upon heating up to 220 C during Avatrel overcoat curing, during which period of time the solid PPC is converted to gaseous products. The gaseous products gradually permeates through the overcoat sidewalls and opening in the ground layer patterns, leaving an air-cavity region of the same physical shape as the patterned PPC with little residue (Figure 3-9(D)), thus air-cavity transmission line structure is formed. The overcoat also serves as solder mask for later die and cable attachments. 60

Picture and cross-section of the fabricated air-cavity structure Figure 3-10(A) shows the picture of the finished

61 FR4 (A) PPC FR4 (B) Avatrel PPC FR4 (C) Avatrel Air FR4 (D) Figure 3-9. Fabrication process for the air-cavity structure Ground Signal Air-cavity Figure Picture and cross-section of the fabricated air-cavity structure Figure 3-10(A) shows the picture of the finished air-cavity differential transmission lines. The ground plane is patterned in a grid style, with holes for gas release during PPC evaporation. Figure 3-10(B) shows a cross-section of the finished air-cavity structure. FR4 61

62 2 7-1 PRBS 3.4 Link Implementation Link Architecture Figure 3-11 shows the block diagram of the link. The RX has a common-gate (CG) preamp and a half-rate 1-tap speculative DFE. The TX consists of a half-rate PRBS core, a MUX, and an open-drain driver. To reduce signaling power, the back termination at the TX output usually found in high-speed serial links is removed in this design. For the same voltage swing seen by the RX, removing the back termination reduces the required signaling power by 50% because it doubles the impedance seen by the TX. Q 0 L L L +h L -h VB 32 µm 32 µm CK RX Q 1 L DFE L L L +h -h Impedance control Offset Control CG amp 20-cm air-cavity channel L L L D 0 D 1 24 µm 12 µm L L Driver CK TX TX MUX Current-sharing frontend Figure Link block diagram Channel equalization is primarily done by the DFE for better power efficiency as discussed above. However, because DFE only cancels post-cursors, a 2-tap FFE is still 62

63 built in the TX driver for pre-cursor cancellation. Note that this TX FFE can also be configured for post-cursor cancellation, and facilitates the comparison between FFE and DFE in terms of power efficiency TX Design The latches, multiplexers and drivers in the TX are all implemented in currentmode logic (CML) for fast operation and good power noise immunity, as shown in Figure Considering the fact that the pre-cursor is usually only a fraction of the main cursor, the pre-cursor driver is sized half of the main cursor driver. The multiplexers are sized in such a manner that the signal path comprised of the latch, the multiplexer and the driver has a uniform fan-out. OUTN OUTP OUTN OUTP INP INN AP AN BP BN CKP CKN CKP CKN (A) (B) Figure Schematics of the latch and multiplexer. A) Latch. B) Multiplexer V DD W0 W1 W2 W3 W4 X1 X2 X4 X8 X16 V REF V BIAS Figure Schematic of the 5-b DAC To facilitate debugging and testing, a serial interface is integrated on-chip. The bias currents of all the gates are controlled with 5-b DACs, the schematic of which is shown in Figure

64 3.4.3 RX Design Preamp design The RX consists of the CG preamp and the DFE. The CG frontend at the RX side serves multiple purposes. First, it provides low-to-high impedance transformation and increases the voltage swing seen by the following DFE stage. This accommodates a smaller input voltage swing, which is important for high power efficiency as discussed before. Second, it accomplishes level-shifting of the input signal so that NMOS input stages can be used in the DFE. Third, the input impedance looking into the source of the CG amplifier provides partial impedance matching for the channel. The most important design metrics of the CG preamp are bandwidth and gain, which are both closely related to power. With the bandwidth design target set to 67% of the data rate, or 4.2 GHz for 6.25 Gb/s NRZ signaling, the gain of the CG preamp is optimized for minimum link power. A higher preamp gain yields better RX sensitivity and lower signaling power, but requires more power for the preamp. For a given channel condition and technology, an optimum gain therefore exists that minimizes the total frontend power P FE. IN Figure Preamp model for gain optimization Figure 3-14 shows a preamp model for gain optimization. For a given load capacitance, gain A and 3-dB bandwidth, the following equations hold: 64

65 Power (mw) I AMP (ma) where W is the transistor width, is the transistor transconductance per unit width, R is the load resistance, and is the transistor drain capacitance per unit width. For each transistor current density, W and R can be solved and the amplifier current is found to be A= Current density (ua/um) (A) Preamp gain (B) Figure Preamp design. A) Amplifier current vs. current density. B) Frontend power vs. preamp gain Figure 3-15(A) plots the amplifier current as a function of at different gain in the target 0.13-um CMOS technology when driving the four slicers of the DFE. For each 65

66 gain, there exists an optimum current density, and the optimum current density increases with increasing gain. Figure 3-15(B) shows the signaling power, the preamp power and the frontend power at different gain with optimum current density over a channel with 9-dB Nyquist loss. The slicer sensitivity is 100 mv, and it is assumed that DFE is used and that back termination is removed. The minimum frontend power is attained when the preamp gain is around 4, and is about 50% lower than the case without the preamp. The frontend power is further reduced with a current-sharing frontend, as shown in Figure By stacking the CG preamp and the open-drain TX driver, the tail current of the TX driver is reused by the RX amplifier. According to Figure 3-15(B), this reduces the frontend power by nearly 50%. The fact that the TX driver is powered from the RX supply also helps to suppress the noise coupling from the TX supply. Back termination is removed in this work to reduce signaling power. The downside of this practice is the risk of potential reflections due to TX impedance mismatch. To mitigate the effect of reflections, a good impedance matching at the RX side must be maintained. Since the input impedance of the CG frontend is bias dependent and non-linear, a programmable resistor is connected across the RX inputs to provide a better matching, as shown in Figure 3-16(A). The programmable range of the resistor is chosen so that a differential input impedance of 100 Ω is maintained over a wide bias range between 0.5 ma and 5 ma, as shown in Figure 3-16(B). Figure 3-17 compares the RX eye diagrams with and without back termination. It can be seen that, as expected, removing the back termination nearly doubles the RX eye opening without 66

Voltage (mv) Voltage (mv) Z DM (Ω) 580 1.2 µm 580 any noticeable degradation of the eye quality.

VB 32 µm 32 µm X 8 (A) 150 140 130 120 110 100 90 80 70 60 50 0 1 2 3 4 5 6 Tail current (ma) (B) Figure 3-16. Input impedance tuning. A) Schematic.

67 Voltage (mv) Voltage (mv) Z DM (Ω) µm 580 any noticeable degradation of the eye quality. Given the same RX sensitivity, this means the signaling power is reduced by nearly 50%. VB 32 µm 32 µm X 8 (A) Tail current (ma) (B) Figure Input impedance tuning. A) Schematic. B) Simulated result mv mv (A) Time (UI) (B) Time (UI) Figure Simulated RX eye diagrams. A) With back termination. B) Without back termination. 67

68 To prevent the RX sensitivity degradation due to small transistor sizes, offset cancellation is also built into the CG amplifier, as shown in Figure The polarity and magnitude of the offset cancellation are all adjustable via digital control DFE design The DFE employs a speculative architecture and half-rate clocking to ease timing requirement. The slicers are implemented as CML latches with adjustable built-in offset, as shown in Figure When the latch is in its amplification phase (CKP is HIGH), an auxiliary differential amplifier injects static current into the output nodes to introduce a desired offset. This is in contrast to [44], where the offset is introduced during the regeneration phase. This leads to more robust latch operation since the regenerative gain is not affected by the offset injecting differential pair. Another highlight of the DFE design is that a single latch stage is employed before the selector, unlike [44] where a complete flip-flop is used. To account for different channel profiles, both the polarity and the magnitude of the offset injecting current are programmable via an on-chip serial interface. The programmable range of the slicer threshold is simulated to be ±140 mv, which is large enough to account for different DFE tap weights required by different channel profiles. OUTN OUTP INP INN CKP CKN Tap Control SP SN CKP Figure Slicer schematic 68

69 The designs of the CML latches and multiplexers in the DFE are the same as the TX except sizing. Unlike the multiplexers in the TX which see the large input capacitances of the pre-cursor and main cursor drivers, the multiplexers in the RX only see the CML latch inputs. Accordingly, they are sized the same as the latches to save power. 3.5 Experimental Results To evaluate the performance of the proposed air-cavity structure, a test board is designed. The layout of the test board is shown in Figure The center area is occupied by the active link, which include footprints for a TX chip and a RX, and the aircavity transmission lines. The rectangular board for the active link is cut using a dicing saw and interfaced with test equipment to evaluate overall link performance. CPW lines are used to connect the SMA connectors to the chip footprint. Test Structures 4 TX RX Active Link Test Structures Figure Layout of the test board with the air-cavity active link The top and bottom areas of the test board are used to implement air-cavity test structures of various lengths. To improve measurement accuracy, open-short-thru deembedding structures are also implemented. To facilitate processing, custom alignment marks are placed at multiple locations. The entire board footprint is designed to fit into a circular area with a diameter of 4 to accommodate the in-house fabrication capabilities. 69

70 Phase (degree) Loss (db) Air-Cavity Transmission Line Measurement The performance of the air-cavity transmission line was obtained by measuring a 5-cm test structure using a vector network analyzer with high-frequency probes. Figure 3-20 shows the measured loss and phase responses. The effective dielectric constant is calculated to be 1.7 from the measured phase, which is lower than predicted before. This is probably because the dielectric constant of the base material is lower (~3.9) than the used 4.4 in previous simulations. The lower dielectric constant also leads to higher line impedance, which causes ripples in the measured loss due to impedance mismatch [12] Frequency (GHz) (A) Frequency (GHz) (B) Figure Measured performances of a 5-cm air-cavity microstrip. A) Loss. B) Phase. 70

in Figure 3-21. The loss is 0.28 db/cm at 3.125 GHz, which readily meets our design goal.

71 150μm 1.5 mm 1.5 mm 100μm Loss (db/cm) The true loss of the line (excluding the effects of impedance mismatch) is calculated from extracted propagation constant using the technique in [45], and the result is shown in Figure The loss is 0.28 db/cm at GHz, which readily meets our design goal. Simulation result (with ) is also overlaid for comparison, demonstrating good agreement between measurement and simulation Simulation Measurement Frequency (GHz) Figure Loss of the air-cavity line Link Measurement The TX and RX test chips are fabricated in 0.13-μm 1.2-V CMOS process. Figure 3-22 shows the chip micrographs. The TX and RX cores occupy 0.03 mm 2 and 0.02 mm 2, respectively. The test chips are wire-bonded to QFN packages and mounted on the test board with a 20-cm 8 ) air-cavity interconnect. Figure 3-23 shows the picture of the populated test board with air-cavity lines in the center of the board. 1.3 mm 1.3 mm 200μm Transmitter 200μm Receiver Figure Chip micrographs of the TX and the RX 71

Trigger TX 20 cm air-cavity RX Figure 3-23. Picture of the populated test board CK TX TX Balun TX OUT RX IN RX OUT Air-Cavity RX Balun CK RX Scope or BERT Delay Splitter 3.125 GHz Figure 3-24.

72 Trigger TX 20 cm air-cavity RX Figure Picture of the populated test board CK TX TX Balun TX OUT RX IN RX OUT Air-Cavity RX Balun CK RX Scope or BERT Delay Splitter GHz Figure Test setup The test setup is depicted in Figure The TX and RX work mesochronously, deriving their clocks from the same signal generator, with their phase relationship adjusted by a mechanically-tunable delay-line. The full link operates successfully at 6.25 Gb/s with a half-rate input clock of GHz. Figure 3-25(A) and Figure 3-25(B) show the measured single-ended eyediagrams at the outputs of the RX CG amplifier (driven off-chip for testing purpose) before and after enabling the TX FFE respectively. The closed eye-diagram is 72

Correct 2 7-1 PRBS sequence is verified with both visual inspection and BER measurements. (A) (B) (C) Figure 3-25.

25 Gb/s and a BER of 10-12 with only the TX FFE enabled, the eye opening is 30% UI.

73 successfully opened by enabling TX FFE. Figure 3-25(C) shows the eye-diagram at the output of the DFE for a PRBS pattern, with the corresponding transient waveform shown in Figure 3-25(D). Correct PRBS sequence is verified with both visual inspection and BER measurements. (A) (B) (C) Figure Measured waveforms (D) Figure 3-26 shows the measured RX bathtub curves and energy-per-bit performance with different equalization settings. At 6.25 Gb/s and a BER of with only the TX FFE enabled, the eye opening is 30% UI. Enabling the RX DFE and disabling TX FFE improves the eye opening to 37%, while the overall power efficiency improves from 0.9 to 0.6 mw/(gb/s), respectively. Enabling both FFE and DFE further improves the horizontal eye opening to 56% UI but decreases the power efficiency. When the link is operated at 6.25 Gb/s with only the DFE enabled, the TX core, the current-sharing front-end, and the DFE dissipate 1.44 mw, 1.2 mw and 1.06 mw, respectively. 73

74 BER Power Efficiency (pj/bit) Gb/s 1.5 (A) FFE DFE FFE+DFE Time (UI) Figure Measured link performances. A) RX bathtub curves. B) Power efficiency. (B) Table 3-2 summarizes the link performance in relation to a recently published paper. Compared to previously published results, a large portion of the TX and RX power is decreased using the current-sharing frontend FFE Only FFE+DFE DFE Only Data Rate (Gb/s) Table 3-2. Performance summary This work [7] Technology 0.13 μm 65 nm Supply voltage 1.2 V 1.0 V Data rate 6.25 Gb/s 12.5 Gb/s Front-end swing 125 mv 100 mv BER 1e-12 1e-12 Horizontal eye 56% 6.25Gb/s - Power 3.7 mw 12 mw Energy-per-bit 0.6 pj/bit 0.98 pj/bit TX/RX core area 0.03mm 2 / 0.02mm mm 2 /0.24mm Summary The bandwidth of the channel poses difficult challenges for high-speed serial links. At high frequencies, dielectric loss dominates over conductor loss. The design and 74

75 fabrication of the air-cavity transmission line structure is presented in this Chapter to reduce the dielectric loss. The measured effective dielectric constant is 1.73 and the loss is about 0.4 db/cm. The air-cavity transmission lines are used in an active link. The active link features a low-power current-sharing frontend with a 1-tap speculative DFE. To further reduce power consumption, the back termination is also removed. The active link achieves successful 6.25 Gb/s operation and consumes 3.7 mw off a 1.2 V power supply, demonstrating the potential of the techniques for future low-power high-speed interconnects. 75

76 CHAPTER 4 A 4.5-Gb/s 12.4-mW RX WITH BAUD-RATE CDR 4.1 Chapter Overview The receiver presented in Chapter 3 does not include CDR, an essential function in high-speed receivers as discussed in Chapter 2. CDR in high-speed serial links is usually achieved with oversampling. However, oversampling CDRs have a few issues. One of the issues is explained in Chapter 2, which is the requirement for power-hungry clock generation and distribution with sub-bit-time resolution. The second issue with oversampling CDR lies in its assumption that the maximum voltage margin occurs at the eye center [31]. When the input eye is horizontally asymmetric, locking to the eye center may lead to sub-optimal voltage margin. The third issue with oversampling CDR is that it reduces the already challenging settling time requirement for DFE [17] [46]. Because the input signal is oversampled, the time allowed for the DFE to settle is now less than one UI. For low-power high-speed serial link design, a baud-rate CDR that circumvents these issues is therefore of interest. Sampling at the eye edges may also require dedicated edge equalization, since the edge samples experience different ISI than the data samples, as shown in. Edge sample ISI Data sample ISI Time (UI) Figure 4-1. Different ISI seen by the edge and data samples 76

77 In this Chapter we present a RX with a novel digital baud-rate eye-tracking CDR which employs an auxiliary slicer (CDR slicer) with adjustable threshold voltage. By jointly updating the sampling phase and the threshold voltage of the CDR slicer, the CDR loop drives the decision point of the CDR slicer to the peak of the eye opening, and thus automatically locks to the maximum voltage-margin point. Because the CDR slicer samples at exactly the same instant as the main data slicers, it does not interfere with DFE operation. We also present a majority-voting DFE architecture that replaces the selectors in a traditional speculative DFE with majority-voters. Compared to a selector, a majority voter is more amenable to low-power and high-speed designs because it reduces the transistor stacking levels and features equal delay to all data inputs. A majority-voter also eliminates the need for a level shifter in bipolar designs. A receiver was implemented with the proposed CDR scheme and the majorityvoting DFE. Details of the RX implementation will be given in this Chapter, together with measurement results, which confirmed correct functions of both techniques. 4.2 Baud-Rate CDR A few baud-rate CDR schemes have been proposed in the past. The Mueller- Muller CDR [47], used in several recently published serial link receivers [8] [46], operates by adjusting the clock phase so that the sampled pulse response satisfies a predefined timing criterion. However, this type of CDR does not necessarily ensure maximum voltage margin of the sampled eye at lock. The CDR in [48] improves the voltage sampling margin but is only suitable for integrating-type RX frontends. The baud-rate CDR in [7] relies on auxiliary slicers that have a larger sampling window than the main data slicers to keep the sampling phase away from the eye edges, but it does 77

78 not take into account the voltage margin. Another baud-rate CDR reported in [49] locks to the maximum voltage-margin point, but requires analog slope detection circuitry and is therefore not as amenable to technology scaling and migration as digital solutions. Data slicer DIN Edge slicer 2 sampling D E PD LOGIC LOOP FILTER Phase update Phase control (A) Main slicer DIN CDR slicer 1 sampling D D CDR PD LOGIC LOOP FILTER Phase & offset update Phase & offset control (B) Figure 4-2. CDR block diagrams. A) Alexander CDR. B) Proposed baud-rate CDR. Figure 4-2 shows the block diagram of the Alexander CDR and the proposed baud-rate CDR. The Alexander CDR employs two slicers, sampling half-ui away from each other, hence 2 oversampling. The PD in an Alexander CDR only produces information for updating the clock phase. The proposed CDR also employs two slicers (main and CDR slicers). However, unlike the Alexander CDR, these two slicers sample the input signal at the same time, therefore no oversampling is involved. The PD in the proposed CDR not only controls the clock phase, but also the offset of the CDR slicer. 78

79 The algorithm of the proposed CDR is such that it drives the sampling point of the CDR slicer to the position with maximum vertical eye opening. Since the CDR slicer and the main slicer are triggered by the same clock phase, this automatically lock the clock phase to the point with maximum voltage margin. The operation principle of the proposed CDR is explained with the help of the CDR truth table shown in Table 4-1, where an denotes don t care., and are three consecutive outputs of the data slicer, is the output of the CDR slicer sampled at the same time as, and is the threshold voltage of the CDR slicer. The CDR takes action whenever is 1, tracking only the upper part of the eye. The discussion below therefore considers the case when exclusively. If higher CDR bandwidth is desired, the lower portion of the eye can also be utilized using an additional CDR slicer. Table 4-1. CDR truth table Figure 4-3(A) illustrates an example eye diagram. The upper portion of the eye is divided into five numbered regions by the different waveform trajectories corresponding to input patterns (010), (011), (110) and (111). According to Table 4-1, the CDR updates only when data pattern equals (010), (011) or (110) since pattern (111) 79

80 does not contain any timing information. Assuming equal probability for pattern occurrences, the CDR behavior is summarized in Table 4-2 and Table 4-3 and is graphically depicted in Figure 4-3(B), where the circles indicate possible decision points of the CDR slicer, the vertical arrows indicate the updating direction, and the horizontal arrows indicate the clock phase updating direction. By inspecting Figure 4-3(B), it can be seen that the CDR drives the CDR slicer s decision point until it dithers around the maximum eye-opening position (denoted by a star). Since the CDR slicer and the DFE are clocked at the same phase, this automatically locks the DFE to the maximum voltage-margin point. The proposed CDR has a few noteworthy advantages. First, baud-rate operation saves clocking power by eliminating the need to generate extra clock phases for oversampling. Second, the CDR automatically locks to the point with maximum voltage margin without using any eye-opening monitor circuits. Third, the proposed CDR does not constrain the frontend interface to any particular architecture. Moreover, decimation of the CDR slicer output is easily accommodated in this CDR, whereas in some other schemes this may be constrained because they require consecutive CDR slicer results [46]. It should also be noted that the CDR slicer can be reused for equalization adaptation to reduce hardware and power overhead Figure 4-3. Operation principle of the proposed baud-rate CDR 80

81 Table 4-2. update Region (010) (011) (110) (111) Total Table 4-3. Clock phase update Region (010) (011) (110) (111) Total Majority-Voting DFE DFE has been used extensively in high-speed links to compensate for intersymbol-interference (ISI) in band-limited electrical channels [12] [17] [16] due to its noise immunity, high signaling power efficiency as explained in Chapter 2. To relax the stringent timing requirement, speculative DFE architecture [19] [50] is often used. As shown in Figure 4-4, a 1-tap speculative DFE makes two tentative decisions and assuming the previous bit is and respectively, and then the correct decision is selected by. The timing requirement for the DFE loop can be written as ) where is the selector delay, and and are the delay and setup time of the CML DFF. 81

82 Selector Slicers DFF Figure 4-4. Block diagram of a 1-tap speculative DFE From Equation 4-2 the selector and flip-flop delays in the critical timing path determine the maximum operating speed of the 1-tap speculative DFE. While significant work has been published on CML latches/ffs [51] [52], the following observations can be made regarding the operation of a CML selector, which is shown in Figure 4-5. First, because the selection of the current bit decision is made by series connecting the previous bit, the CML selector employs three transistors in the stack (including the tail current), and is therefore not optimal for low-voltage/low-power designs. Second, to maximize the timing margin of the critical DFE feedback loop, it is desirable to minimize the delay from to, yet in Figure 4-5, experiences the largest delay among the three inputs. The third issue concerns the common-mode level of : since is supplied from a CML latch, its common-mode level is close to VDD and this may necessitate an explicit level shifting stage which incurs power and speed overhead (especially in bipolar implementations [53]). Figure 4-5. Schematic of a CML selector 82

83 Table 4-4. Selector truth table Table 4-3 shows the truth table of the CML selector in a speculative DFE. Note, however, that in a low-pass electrical channels with a pulse response of [, ], both coefficients and are positive, and thus the feedback tap weight in the DFE always tends negative. This implies that the combination and in the truth table in Figure 4-5 does not occur (indicated in gray), and inverting the corresponding row outputs therefore does not affect the DFE function. Thus, the truth table can be rewritten as shown intable 4-5, and can be expressed as. ) where is the sign of the operand. Figure 4-6. Proposed majority voter schematic Equation 4-2 can be readily implemented with a majority-voter, as shown in Figure 4-6. Compared to the selector in Figure 4-5, the majority-voter obviates the few disadvantages mentioned previously. The number of transistors in stack is reduced from 83

84 three to two, making the majority-voter more amenable for low voltage designs. The majority-voter is fully symmetric with respect to the three inputs, and as a result, the critical delay from to is identical for all inputs. Moreover, no level-shifting is required for. Table 4-5. Majority-voter truth table Figure 4-7(A) compares the simulated to delay for a selector and majority-voter as a function of the input transistors current density. For comparison, the input transistors are of the same size, the single-ended input swing is 300mV, the fanout is assumed to be two, and the supply is set to 1.2V. The load resistors are adjusted so that both the selector and the majority-voter have a small-signal gain of one. The delay of both selector and majority voter decreases with larger current densities and higher transistor, and saturates as reaches its maximum. For equal current densities, the majority-voter exhibits ~50% less delay. Figure 4-7(B) shows the overall DFE loop delay using the proposed majority voter and the traditional selector. In this comparison, the latches in the DFF are biased with equal current density in both cases. The majority voter based DFE shows >10% improvement in delay over a wide range of current densities. Further improvement can be achieved by increasing the current-density bias point and speed of the CML DFFs. 84

85 DFE loop delay (ps) Improvement Delay (ps) Selector Majority-voter Current density (µa/µm) (A) % W/ selector W/ majority-voter 20% 15% 10% 5% 0% Current density (µa/µm) (B) Figure 4-7. Simulated delay. A) Selector and majority-voter. B) Overall DFE loop. Figure 4-8(A) shows the selector and the majority-voter delay as a function of bias current. Although the majority-voter has three static tail current paths compared to the single current bias leg of the selector, the overall current consumption to achieve the same delay is comparable. This is due to the fact that the majority-voter requires a lower current density than the selector to achieve the same speed. That is, the majorityvoter has a lower effort delay [54], and thus it exhibits higher power efficiency. This can be related to the majority-voter having one transistor less in the stack, which also enables operation at lower supply voltages as shown in Figure 4-8(B). A comparison of the selector and majority-voter delay normalized to their respective delays at the 85

86 Normalized delay Delay (ps) nominal supply voltage of 1.2V shows that 1) the majority voter is significantly less sensitive to supply voltage variation and 2) it can operate at a lower supply voltage. For instance, the selector delay quickly degrades below 0.8V while the majority-voter exhibits a more gradual degradation below 0.6 V Selector Majority-voter Current (µa) (A) (B) 10 9 Selector 8 7 Majority-voter VDD (V) Figure 4-8. Simulated selector and majority-voter performances. A) Delay vs. total bias current. B) Normalized delay variation with supply voltage (VDD) for currentdensity of 100 A/ m Architecture 4.4 Chip Implementation Figure 4-9 shows the block diagram of the RX core. The input data is sampled by a half-rate 1-tap speculative DFE and a CDR slicer. The DFE output is then demultiplexed by 8, whereas the CDR slicer output is decimated by 8. A CDR logic block 86

87 processes the output of the DFE and the CDR slicer according to the CDR algorithm described above, and updates both the threshold of the CDR slicer with a 6-b DAC and the clock phase with a phase interpolater (PI). The I/Q inputs to the PI are generated by dividing down a full-rate external clock. To minimize power, the RX employs high-speed CML circuits only in the first two stages and static CMOS logic for the later stages, as shown in Figure 4-9. In addition, the data output of the CDR slicer is decimated by 8 instead of being fully demultiplexed. Although this decimation reduces the CDR bandwidth, experimental results reported in following sections confirm that the CDR bandwidth is sufficiently large for plesiochronous chip-to-chip interconnects. All blocks are built with custom layout except the CDR logic block which is synthesized with standard cells. L DFE Slicer Latch + + L + DMUX L L L L L L IN Maj. voter Q[0:15] L L SAFF SAFF L SAFF CDR slicer L L D D D Q CDR 6-b DAC CK 6 /2 I Q 5 PI /2 /2 /2 Clocking Level converter CDR LOGIC CML CMOS Figure 4-9. Block diagram of the RX 87

88 4.4.2 Slicer The slicer is implemented as a CML latch with digital offset control, as shown in Figure 4-10, where all transistors without length annotation are of minimum channel length. During pre-amplification mode, a current is injected to the output nodes to introduce a desired offset. To reduce power supply noise, the offset-injection current is kept active even when the slicer is in regeneration mode. Both the polarity and magnitude of the injected current are controlled through the serial interface. An important design parameter of the slicer is the offset tuning range, which must be large enough to override the intrinsic slicer offset while generating the desired DFE tap weight. Figure 4-11(A) shows the simulated offset of the slicer, while the simulated offset tuning characteristic of the slicer is shown in Figure 4-11(B) when the sign of offset is set to 1. The slicer offset is 34 mv, and the offset tuning range is ±220 mv. With 6-b digital control, this gives a maximum DFE tap weight of nearly 200 mv with a nominal step of 3 mv. IN IN S S CK CK CK CK Figure Schematic of the slicer with threshold control 88

89 Slicer offset (mv) = Offset voltage (mv) (A) Offset control code (B) Figure Simulated slicer performances. A) Slicer offset. B) Offset tuning DMUX The DMUX is constructed from cascading 1:2 DMUX cells. Figure 4-12 illustrates the schematics of the latch-based CML and CMOS 1:2 DMUX cells, together with their transistor-level details. The CML latch has the same topology as the slicer, except that it does not have the offset adjustment. Also note that the bias current and the transistor sizes are reduced by 50% since offset is not critical. The CMOS latches are implemented as sense-amplifier flip-flops (SAFFs). 89

90 L L L SAFF SAFF = = L L SAFF Sense-amplifier SR latch IN 1 IN 0.7 CK 2 CK CK 0.28 Figure Schematics of the CML and CMOS DMUX cells Clocking The clocking circuitry generates clocks for the DFE and the DMUX. A full-rate external clock is first divided down by a CML divider to obtain I/Q clocks, as shown in Figure Since phase inversion is simply swapping the differential signal polarity, I and IB are obtained simultaneously. The same is true for Q and QB. L CK L IP / IN QP/QN Figure Schematic of the divider for I/Q generation A phase interpolator (PI) combines the I/Q clocks with digitally-controlled weights to adjust the receiver sampling phase. The principle of PI is depicted in Figure Phase interpolation is achieved by combining the I/Q clock phases with different 90

weightings. Figure 4-15 shows the schematic of the PI, which consists of four differential pairs. Phase tuning is achieved by adjusting the tail currents of the four differential pairs.

91 weightings. Figure 4-15 shows the schematic of the PI, which consists of four differential pairs. Phase tuning is achieved by adjusting the tail currents of the four differential pairs. To guarantee monotonicity, the tail current in each differential pair is split into eight identical current sources, and the binary phase control word PI[5:0] is converted to thermometer code W[0:31] to control the 32 current sources. With this half-rate architecture, the phase resolution of the PI is UI. 90 o I (0 o ) QB(270 o ) 180 o 0 o Q(90 o ) IB(180 o ) 270 o (A) Figure Principle of PI (B) 1.7 PI[5:0] IP IN QP QN IN IP QN QP W[0:7] W[8:15] W[16:23] W[24:31] Decoder VBN [0] [1] [2] [3] [4] [5] [6] [7] Figure Schematic of the phase interpolator 91

The CML clock is ACcoupled to inverters with resistive feedback.

92 400 μm The output of the PI is further divided down to clock the DMUX. Figure 4-16 shows the level-converter schematic used to convert CML logic levels to full-swing CMOS for clocking the SAFF s in the last two DMUX stages. The CML clock is ACcoupled to inverters with resistive feedback. The feedback resistor and coupling capacitor values are chosen so that the lower cut-off frequency is well below the target clock frequency. CML CMOS Figure Level-converter schematic. 4.5 Experimental Results The receiver chip was implemented in 0.13-μm bulk CMOS technology, mounted on a QFN package and assembled on an FR4 test board. Figure 4-17 shows the die micrograph along with test board picture. The receiver core occupies an area of 0.14mm μm RX (A) Figure Die micrograph and board picture (B) 92

93 Figure 4-18 depicts the measurement setup. A PRBS generator and a 20-inch differential microstrip FR4 channel were used to validate the receiver. The PRBS generator and the RX were clocked by two different RF sources. When evaluating the DFE, the two RF sources are synchronized with the RX CDR disabled. Otherwise they ran independently when CDR loop was enabled. The phase modulation (PM) was added for jitter tolerance measurement. The recovered data was monitored using a BERT and a high-speed sampling oscilloscope. Measurements were performed up to 4.5 Gb/s with a PRBS pattern, limited at higher data rates by equipment capability. PRBS 20 FR4 ustrip DIN RX DOUT CK CKIN Scope or BERT Balun PM SYNC RF SRC 1 RF SRC 2 Figure Test setup Figure 4-19 shows the measured channel insertion loss and the resulting eye diagram at 4.5 Gb/s, showing complete eye closure due to severe ISI. The loss at Nyquist frequency is 22 db. The measured bathtubs at different DFE settings are shown in Figure which were obtained by sweeping the PI control code while monitoring the receiver BER. Without DFE, error-free operation was not possible. The eye opening enlarges with increasing DFE settings, and decreases due to over-equalization after reaching the maximum eye-opening. The peak eye-opening is 0.5 UI. Figure 4-21(A) shows the measured PI linearity. The minimum DNL of LSB indicates monotonic operation, as guaranteed by the thermometer coding. The maximum DNL is 1.5 LSB, giving a maximum phase step of 0.09 (=1.5/16) UI. The 93

S21 (db) repetitive DNL and INL patterns are due to the use of simple I/Q interpolation scheme [55]. 0-10 -20-30 -40-50 -22 db @ 2.25 GHz 0 1 2 3 4 5 Frequency (GHz) (A) (B) Figure 4-19.

94 S21 (db) repetitive DNL and INL patterns are due to the use of simple I/Q interpolation scheme [55] GHz Frequency (GHz) (A) (B) Figure Measured 20 channel performances. A) Loss. B) Eye diagram. The CDR function was evaluated by setting the frequency of the PRBS generator slightly different from the RX clock source. The CDR lock range was measured to be ±100 ppm, confirming plesiochronous operation even though the CDR bandwidth is low due to decimation. The histogram of the recovered clock at the limit of the lock range is the shown in Figure 4-21(B). The RMS jitter is 13 ps. The jitter is relatively high because the clock output buffer chain shares the same power domain with the noisy digital circuitry. 94

INL/DNL (LSB) Eye opening (UI) BER (A) 10 0 1.0E+00 1.0E-02 10-2 1.0E-04 10-4 1.0E-06 10-6 1.0E-08 10-8 1.0E-10 10-10 1.0E-12 10-12 60% 50% 40% 30% 20% 10% 0% DFE setting =5 10 15 20 0 0.2 0.4 0.6 0.

95 INL/DNL (LSB) Eye opening (UI) BER (A) E E E E E E E % 50% 40% 30% 20% 10% 0% DFE setting = Phase (UI) DFE setting (B) Figure Measured DFE performances. A) Bathtub curves. B) Eye openings INL DNL PI control word (A) (B) Figure CDR measurement results. A) PI linearity. B) Recovered clock. Jitter tolerance of the CDR was measured by phase modulating the clock of the PRBS generator and recording the modulation depth when bit error occurred. The measured jitter tolerance is shown in Figure Below 30 KHz jitter frequency, the jitter tolerance is larger than 1 UI. 95

96 Jitter tolerance (UI) Jitter frequency (KHz) Figure Measured CDR jitter tolerance The RX core consumes 12.4 mw from a 1.2V supply, which translates to an FOM of 2.75 pj/bit. Table 4-6 shows the performance summary. Table 4-6. Performance summary Input Data Rate 4.5 Gb/s De-multiplexing 1:16 Equalization Clock Recovery Power Supply Power Process Area FoM 1-tap speculative DFE Baud-rate eye-tracking 1.2 V 12.4 mw 0.13 μm CMOS 360μm 400μm 2.8 pj/bit 4.6 Summary Traditional oversampling CDR involves a few design issues, including the requirement of power-hungry generation and distribution of clocks with sub-bit-time resolution, the stringent constraint on the settling time of DFE, the possibility of suboptimal equalization of edge samples. It also locks to the center of the eye regardless of the specific eye shape, potentially leading to degraded voltage margins. Various baudrate CDRs have been proposed over the years. However, they either do not take into account the voltage margin, still require sampling at instants other than the data 96

97 sampling instants, entails analog circuitry for slope detection, or is only suitable for integrating-type frontends. In this Chapter, we propose a novel digital baud-rate eye-tracking CDR scheme that obviates the above disadvantages. It employs a CDR slicer in parallel with the main slicers, and the CDR algorithm controls both the clock phase and the threshold voltage of the CDR slicer to drive the decision point of the CDR slicer to the peak of the eye opening. Since the CDR slicer shares the same clock phase as the main slicer, this automatically locks the RX to the point with the maximum eye-opening. A majority-voting DFE architecture is also presented in this Chapter wherein the selectors in a speculative DFE are replaced with majority-voters. The majority-voter has one less level of transistors in the stack, and is therefore more amenable to low-power and high-speed designs compared to a selector. It also reduces the DFE loop delay due to its structural simplicity. Furthermore, the majority-voting DFE obviates the need for a level shifter in bipolar designs. Experimental results confirm the effectiveness of the proposed CDR scheme and the majority-voting DFE. Implemented in 0.13-μm CMOS, the RX works reliably at 4.5 Gb/s while consuming 12.4 mw. Higher data rate is limited by the measurement equipment. The CDR displays a lock range of ±100 ppm, and the DFE is able to equalize a channel with 22 db Nyquist loss while producing a 50% UI equalized eyeopening. 97

98 CHAPTER 5 A 5-Gb/s 0.75-pJ/BIT VOLTAGE-MODE TRANSCEIVER 5.1 Chapter Overview Chapter 3 and Chapter 4 apply some of the results from Chapter 2 to improve the link power efficiency on the architecture level, namely the removal of back termination, the channel loss reduction with air-cavity transmission lines, and the use of DFE and baud-rate CDR. A few circuit techniques are also resorted to in Chapters 3 and 4, such as the current-sharing frontend and the majority-voting speculative DFE. The 6.25-Gb/s transceiver in Chapter 3 achieves 0.6-pJ/bit power efficiency without CDR, whereas the 4.5-Gb/s RX in Chapter 4 achieves 2.8-pJ/bit including CDR and clocking circuitry. Based upon these results, this Chapter attempts to build a complete transceiver with better power efficiency in the same technology. To attain this goal, the transceiver employs a combination of architectural improvements and circuit techniques. One major improvement is the signaling mode. The transceiver uses voltagemode signaling with differential termination in place of the current-mode signaling used in the air-cavity active link in Chapter 3. According to chapter 2, this reduces the signaling power by 75%. The other major improvement is the exclusive use of static CMOS logic gates instead of the CML logic gates in Chapters 3 and 4. This avoids the static current consumption of the CML gates since the CMOS gates only consume power during state transitions. To further improve the power efficiency, the RX operates from a 1-V power supply, instead of the nominal 1.2-V power supply. To cope with the resulting speed degradation of the gates, the slicers heavily parallelized and a look-ahead selection tree 98

99 is used in the DFE. Heavy parallelism in the frontend also saves power by eliminating the need for an explicit DMUX. The RX in this Chapter uses the same baud-rate CDR algorithm as presented in Chapter 4. However, further decimation is applied to reduce the power consumption. An injection-locked ring oscillator is used for clock generation to avoid the power overhead of a PLL or DLL. In place of the PI for phase rotation in Chapter 4, a delay line is used to adjust the injection clock phase so that the RX clock phases can be moved simultaneously. The result is a complete 5-Gb/s transceiver in 0.13-µm bulk CMOS process with 3.7-mW power consumption. This translates to a power efficiency of 0.75-pJ/bit, which is among the best reported to date. 5.2 TX Implementation TX Architecture Figure 5-1 shows the TX block diagram. A full-swing restorer (FSR) converts the output from a CML PRBS generator (reused from a previous design) to full swing CMOS logic levels. A tapered inverter chain acts as a pre-driver between the FSR and the VM driver. To preserve high speed, the fan-out of the predriver is designed to be two. An on-chip LDO generates the supply V DRV for the VM driver from the un-regulated chip supply. 99

100 LDO V REF - + V DD V DRV FSR PRBS FSR Pre-driver Driver Figure 5-1. TX block diagram PRBS Generator Figure 5-2 shows the block diagram of the PRBS generator. It consists of a clock buffer, a PRBS core, a buffer, and an all-zero detector. This PRBS generator is reused from a previous design, and all the buffers and gates are implemented in fullydifferential CML although the drawing is single-ended for simplicity. The PRBS core is a linear feedback shift register (LFSR) comprised of 14 D latches clocked at 2.5 GHz. The linear feedback through the XOR gates implements the polynomial X 7 +X 6 +1 to generate a maximum-length sequence. A half-rate architecture is chosen for easier clock distribution [56]. The two 2.5-Gb/s PRBS streams with proper phase shift are multiplexed to obtain the 5-Gb/s PRBS. D D D D D D D CK D D D D D D D All Zero Detector PRBS Core Figure 5-2. PRBS block diagram 100

101 One well-known design issue in PRBS generator is the all-zero state of the LFSR which will circulate indefinitely once the LFSR falls into this state. To prevent this from happening, [57] [58] uses a reset signal to manually insert a one into the LFSR. This solution will not work if the LFSR accidentally falls into the all-zero state during normal operation (for instance due to power supply disturbance). A better solution is to monitor the LFSR and automatically reset it if such an all-zero state is detected. [59] uses logic gates to detect the all-zero state, which is complex and timing-critical. [60] instead detect the average DC level of the LFSR outputs. Although this solution is not timingcritical, it still needs additional routing for all the LFSR outputs and thereby incurs extra loading and complicates the layout. Note, however, that it s not necessary to monitor all the LFSR outputs to detect the all-zero state. Instead, monitoring the final generator output would suffice. This avoids the loading and layout complication. Figure 5-3 shows the all-zero detection used in this work. The RC filter has a cut-off frequency of 2 MHz and filters out the highfrequency component. Since a PRBS is nearly DC balanced, P and N should have nearly the same DC voltages. When the LFSR falls into the all-zero state, however, P will have a lower DC voltage than N, and the comparator senses such a condition and resets the LFSR. Figure 5-4 shows the schematic of the self-biased comparator. For robust operation, the comparator has a built-in offset of roughly 60 mv so that it will not activate reset during normal operation. Figure 5-5 shows the simulated waveforms of the all-zero detector. At start-up, the PRBS is stuck at the all-zero state. The detector senses this state and inserts one s into the LFSR so that proper PRBS pattern can be initiated. 101

102 P PRBS Core N Reset 42 KΩ 1.7 pf Figure 5-3. All-zero detector 10µm/0.5µm 12µm 3µm 0.2µm/5µm 10µm/0.5µm Figure 5-4. Schematic of the self-biased comparator with offset All-zero Reset Normal operation Figure 5-5. Simulated waveforms confirming the function of the all-zero detector LDO The LDO powers the TX driver for better supply noise rejection and also provides a convenient means for adjusting the TX output swing. For a single-ended output swing of 100 mv, the driver current consumption is 1 ma with differential RX termination. With 102

103 Phase (degree) Gain (db) a width of, the pass element is large enough to source 10 ma to support larger swings in measurement. The error amplifier is a simple two-stage opamp. The dominant pole is located at the V DRV node due to the large decoupling capacitor. Figure 5-6 shows the stability simulation results. The phase margin is 72 degrees E+1 1E+2 1E+3 1E+4 1E+5 1E+6 1E+7 1E+8 Frequency (Hz) (A) E+1 1E+2 1E+3 1E+4 1E+5 1E+6 1E+7 1E+8 Frequency (Hz) (B) Figure 5-6. Stability of the LDO TX Driver Since the targeted TX swing is less than 100 mv, the TX employs an N-over-N VM driver [6] [39], as shown in Figure 5-1. Exclusive use of NMOS in the driver reduces the input capacitance and therefore the predriver power consumption compared to an inverter driver [61]. The transistors are sized for 50-Ω R on for proper channel back 103

104 SYNC DFE Selection Tree CDR Logic termination. Note the top NMOS is sized slightly larger than the bottom one since it sees less overdrive voltage. 5.3 RX Implementation RX Architecture Figure 5-7 depicts the receiver block diagram with differential termination. Because the TX output has low common-mode voltage, the input signals V P and V N are first shifted up to enable NMOS transistors at the input of the slicers. Equalization is done with 1-tap speculative DFE for its high signaling efficiency compared to TX FFE [16]. A bank of 32 slicers performs digitization and direct 1:16 de-multiplexing. Two additional CDR slicers facilitate timing recovery. The slicer bank s 34 output bits are synchronized, and 17 of them are selected to accomplish DFE. The ILRO, locked to a MHz external source, generates 16 clocks phases CK[0:15] for the slicer bank. The CDR logic extracts timing information from the 17 bits and adjusts the phase of the injection clock to track the maximum eye opening. V P -V N +2V DFE V P -V N -2V DFE Q[0] V P V LS +V DFE Q[7] 5 Gb/s V CM Q[8] V LS -V DFE V N Level Shifter Q[15] Q*[8] CDR slicers Figure 5-7. RX block diagram DFE ILRO CK[0:15] Delay MHz 104

105 5.3.2 Slicer Design The most important design goals of the slicer include power, speed and sensitivity. The slicers are implemented as SAFFs to avoid static power consumption, as shown in Figure 5-8. With 16-way interleaving, the speed requirement on the slicer is much relaxed, leaving its sensitivity the focus of design optimization. One factor that impacts the slicer sensitivity is transistor mismatch. To reduce the input capacitance and power consumption, the slicers are sized to near minimum. As a result, the simulated 1-σ slicer offset is 38 mv. To improve RX sensitivity, all the slicers have 8-b offset trimming. The trimming range is designed to be ±160 mv, yielding a trimming resolution of 1.25 mv. CKB Figure 5-8. Schematic of the slicer Another factor that impacts the slicer sensitivity is hysteresis, including the hysteresis due to incomplete resetting of the SA core, and the hysteresis due to the imbalanced input capacitances of the RS latch that follows the SA core [62]. With heavy front-end parallelism, the SA core has enough time to completely reset and no hysteresis is observed due to the SA core. To remove the hysteresis due to the imbalanced RS latch input capacitance, a buffer stage is inserted between the SA core and the RS latch, as shown in Figure 5-8. Simulation indicates that without this buffer 105

106 stage, the slicer has a hysteresis of 30 mv, whereas inserting the buffer stage makes the hysteresis negligible Level Shifting and DFE Tap Generation The slicers use NMOS input transistors for faster operation. However, the RX input has a common-mode level close to ground due to the use of the VM signaling. A level shifter is therefore required before the slicers to shift up the input signals by V LS. Level-shifting can be accomplished with an AC-coupling capacitor [63] or a common-gate (CG) amplifier [64], as shown in Figure 5-9(A) and (B). AC-coupling does not consume power but cuts off the low frequency component of the input signal. On the other hand, a CG amplifier provides DC coverage but dissipates excessive power due to the stringent bandwidth requirement. This is especially true when driving the large input capacitance of the heavily-parallelized slicer bank. Figure 5-9(C) shows the basic idea of the proposed level-shifter, which combines the advantages of both - a capacitor provides a high-frequency signal path while a source-follower enables DC coverage. AC path DC path Capacitor-based CG-amp-based Proposed (A) (B) (C) Figure 5-9. Level shifters. A) Capacitor-based. B) CG-amp-based. C) Proposed. Figure 5-10 shows the detailed schematic of the level shifter. The AC-coupling capacitor is implemented as a NMOS transistor with source and drain shorted to the input. The shifting voltage is adjusted by tuning VB. To control the low frequency gain, the source follower is broken into 4 identical segments, with the input of each segment 106

107 Gain (db) switchable between the input and the common mode voltage by GAIN[3:0]. When all the four inputs are switched to the common mode voltage (GAIN=0), the DC path of the level shifter is disabled. Figure 5-11 shows the simulated frequency response of the level shifter at different gain settings. When the DC path is disabled, the level shifter has a low cut off frequency of 3M Hz. Because of its much relaxed bandwidth requirement, the source follower consumes negligible <10 μw) power. GAIN[3:0] V CM VB To slicers V IN Figure Detailed schematic of the level shifter Gain=4-30 Gain=3-40 Gain=2 Gain=1-50 Gain= E E E E+10 Frequency (Hz) Figure Simulated frequency response of the level shifter at different gain settings The level shifters also provide a convenient means of generating the DFE tap. This is achieved by introducing an offset in the shifting voltages of V P and V N, as shown in Figure 5-7. Although it s possible to embed the DFE tap into the slicer offset, doing so would have required too large a slicer trimming range when the input swing is high. 107

108 Delay (ps) DFE with Look-Ahead Selection Tree The slicer bank is implemented using a 16-way parallel architecture to relax speed requirement and avoid the added power consumption by an explicit demultiplexer. A critical issue in the speculative DFE is the stringent timing constraint, which occurs when decisions are selected based on previously received bits. For a straightforward implementation of the DFE selection tree shown in Figure 5-13(A), the previous bits must ripple through all 16 selectors under worst-case conditions, and the resulting timing constraint is where and are the delay and set-up times of the D flip-flop, is the selector delay, and is the bit time. Figure 5-12 shows the simulated as a function of V DD before layout extraction. At 1.0 V, the delay is about 120 ps. Considering the parasitics due to wiring, such a delay is marginal for 5 Gb/s operation ( ) V DD (V) Figure Simulated pre-layout selector delay vs. power supply This work uses a look-ahead selection tree to expedite the selection process. Two possible sets of decisions for Q[8:15] are pre-computed and then selected, as shown in Figure 5-13(B). The timing constraint now becomes 108

109 which is relaxed by nearly 50% compared to the straightforward implementation. DFF Q[0] Selector DFF Q[0] Precomputation DFF Q[7] 0 1 DFF Q[7] DFF Q[8] DFF Q[8] DFF Q[15] DFF Q[15] (A) (B) Figure DFE selection tree. A) Conventional. B) Look-ahead Decimated Baud-Rate CDR The RX employs the same baud-rate CDR scheme as that in chapter 4 to reduce the clocking power compared to Alexander-type CDRs [64]. If we want to monitor all CK[0:15], 32 more slicers will be required, leading to considerable power and area overhead. To further reduce power consumption, only CK[8] is monitored in this work. This greatly reduces the number of CDR slicers by more than 90%, from 32 to 2. Although this decimation reduces CDR bandwidth, it is generally acceptable for mesochronous chip-to-chip links [64]. Note that because of heavy parallelism, the reduction in input capacitance and area is more pronounced compared to the decimation in [64]. 5.4 Injection-Locking-Based Clock Generation Clock Generation Overview Despite a 50% reduction in the number of clock phases by the baud-rate CDR, generating the required 16 phases for the slicer bank is still non-trivial. Injection-locking based clock generation is chosen in place of PLL- or DLL-based schemes for its low power and superior jitter performance. Figure 5-14 shows the block diagram of the clock 109

110 generation circuitry. At the core lie two cascaded (master and slave) low-power injection-locked ring oscillators (ILROs). Both ILROs are digitally trimmed to ensure reliable locking. The slave ILRO helps correct the master ILRO s phase mismatch and duty-cycle distortion due to injection locking [65]. A bank of current-starved delay lines facilitates further phase calibration. Phase tuning of ILRO is usually done by adjusting the free-run frequency of the ILRO [66] [22] [67]. However, tuning the free-run frequency of the ILRO may change the phase relationship between its outputs and degrade the RX timing margin. In this work, the phases of the ILRO outputs are tuned by adjusting the injection clock phase with an additional delay line controlled by the CDR logic, as shown in Figure Phase trimming Delay lines X16 X16 Slave ILRO Freq. trimming X16 Master ILRO From CDR logic Delay line Ext. ref. Figure Block diagram of the injection-locking-based clock generation ILRO Core The master and slave ILROs are of the same design. Figure 5-15 shows the ILRO core schematic. Eight pseudo-differential delay cells constructed from inverters are used instead of CML delay cells to avoid static current consumption. The input clock phases are injected through NMOS transistors. To ensure locking, the free-run frequency of the oscillator is digitally trimmed. 110

111 CTRL[7] X128 [1] [0] X2 X1: PMOS: NMOS: P[0] P[9] P[6] P[15] P[8] P[1] P[14] P[7] PMOS: NMOS: P[0] P[1] P[8] P[14] P[15] INJ[0] INJ[1] INJ[8] INJ[14] INJ[15] Figure Schematic of the ILRO core One design issue of the pseudo-differential oscillator is its start-up. Because there are even stages of delay cells, a stable DC solution exists where the whole ring behaves like a latch, as shown in Figure To prevent that from happening, the cross-coupled inverters must be sized large enough compared to the main inverters. In this design, the cross-coupled inverters are sized of the main inverters for reliable start-up, as annotated in Figure Figure Start-up issue of the pseudo-differential oscillator Delay Line The delay lines are constructed from cascading current-starved delay cells, the schematic of which is shown in Figure 5-17, where a 4-b digitally controlled current sets the bias current of the inverters. Figure 5-18 shows the simulated tuning curve of one delay cell. The tuning range is 30 ps. The CDR delay line consists of 8 delay cells. The 111

112 Delay (ps) total tuning range of 240 ps is larger than 1 UI for reliable CDR operation especially when the extra delay caused by parasitics is considered. 1 [3:0] IN OUT Figure Schematic of the current-starved delay line Control code Figure Simulated delay line tuning curve 5.5 Experimental Results The transceiver was fabricated in a 0.13-μm bulk CMOS process using only nominal-v T devices. The test chip was assembled in a 32-pin QFN package and mounted on an FR4 board. Figure 5-19 shows the chip micrograph. The RX measures, while the TX occupies TX Measurement The TX is measured at different supply voltages. With a 1.5-V supply the TX is able to work up to 6.25 Gb/s, whereas at 1.2 V the TX is able to work at 5 Gb/s. Below 1.2 V the TX does not work properly, probably limited by the CML PRBS core. Figure 5-20(A) shows the measured TX eye diagrams at 6.25 Gb/s. The RMS jitter is 11 ps. 112

DRV FSR 230 μm 300 μm Figure 5-20(B) shows

500 μm Level shifters DFE Buffers CDR logic

113 DRV FSR 230 μm 300 μm Figure 5-20(B) shows the captured transient of the TX output, which confirms correct pattern generation. 500 μm Level shifters DFE Buffers CDR logic RX ILROs Delay lines TX Decoupling cap PRBS LDO 500 μm Figure Chip micrograph and transceiver layout 20 mv 50 ps (A) 20 mv 2.5 ns (B) Figure TX measurement results at 6.25 Gb/s. A) Output eye diagram. B) TX transient showing correct PRBS patter. 113

114 Lock range Frequency (MHz) Clocking Measurement Figure 5-21 shows the measured tuning curve and locking range of the ILRO. The ILRO has a tuning range of more than 500 MHz, and the locking range is larger than 10% when the free-run frequency is MHz Frequency control word (A) 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% Frequency control word (B) Figure ILRO measurement results. A) Frequency tuning. B) Locking range. Figure 5-22 shows measured phase noises with and without injection. At 100 KHz offset, injection-locking suppresses the phase noise by more than 70 db. 114

115 Delay increase (ps) Phase noise (dbc) W/O injection W/ injection Frequency Offset 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 Figure Measured phase noise with and without injection locking The measured CDR delay line tuning curve is shown in Figure The tuning range is 400 ps, which covers 2 UI when the data rate is 5 Gb/s. The measured tuning range is more than 60% larger than simulation results, indicating heavy parasitics due to routing Control code Figure Measured CDR delay line tuning curve showing >2-UI tuning range RX Measurement Standalone RX measurement is done up to 4 Gb/s due to equipment limit. Figure 5-24 shows the measured loss profile of the 20 channel. The loss is 19.2 db at 2 GHz. Figure 5-25 shows the 4 Gb/s eye diagrams before and after the channel. Due to severe channel loss, the eye is completely closed after the channel. 115

BER S21 (db) 0-10 -20-19.2 db @2 GHz -30-40 -50 0 1 2 3 4 5 Frequency (GHz) Figure 5-24. Measured loss characteristics of the 20 channel 30 mv 100 ps 30 mv 100 ps Figure 5-25.

116 BER S21 (db) GHz Frequency (GHz) Figure Measured loss characteristics of the 20 channel 30 mv 100 ps 30 mv 100 ps Figure Measured 4-Gb/s eye diagrams before and after the 20 channel Figure 5-26 shows the measured bathtubs with and without DFE. Error-free operation cannot be attained without DFE, while the eye opening is 30% when DFE is enabled. Figure 5-26 shows the recovered clock. The RMS jitter is 4.85 ps, while the p- p jitter is 42 ps E+00 W/O DFE 1.E E-06-6 W/ DFE 10 1.E % 10 1.E Delay (UI) Figure RX bathtubs with and without DFE 116

117 J RMS =4.85 ps J P-P = 42 ps Figure Jitter histogram of the recovered clock The receiver core is powered from a 1V supply, and dissipates 1.1 mw, which translates to a power efficiency of 0.28 pj/bit. Table 5-1 compares the performance to some recently published work. The power efficiency is nearly a 2 improvement over the best result of previously published complete receivers. Table 5-1. Performance summary of the receiver [6] [7] [22] This work Data rate (Gb/s) Equalization CTLE CTLE CTLE DFE Nyquist loss (db) Sub-rate 1/2 1/2 1/10 1/16 Clock generation PLL PLL ILRO ILRO CDR Alexander Buad-rate NA Buad-rate eye-tracking J rms (ps) NA Technology 90-nm 65-nm 65-nm 0.13-μm V DD (V) / Power (mw) Area (mm 2 ) FoM (pj/bit) Transceiver Measurement The whole link is then tested with a 10 channel on FR4 at 5 Gb/s, although the TX is capable of operating at 6.25 Gb/s. 117

118 Figure 5-28 shows the TX eye diagrams before and after passing the 10 channel. Although the Nyquist channel loss is less than the standalone RX measurement, the eye is still completely closed due to the bandwidth and jitter of the TX. The near-end TX RMS jitter is 13 ps. 20 mv 50 ps (A) 20 mv 50 ps (B) Figure Measured 5-Gb/s TX eye diagrams. A) Before the channel. B) After the 10 channel Figure 5-29 show the recovered data and clock of the RX. The recovered clock has an RMS jitter of 6.9 ps. Figure 5-30 shows the RX bathtubs before and after enabling the DFE. The eye opening with DFE enabled is 18%. 118

BER 20 mv 1 ns (A) J RMS = 6.9 ps J P-P = 57.8 ps (B) Figure 5-29. Measured CDR waveforms. A) Recovered 312.5-Mb/s data. B) Recovered 312.

8 Delay (UI) Figure 5-30. RX bathtubs with and withou DFE The TX works from a 1.2-V supply and consumes 2.1 mw, while the RX consumes 1.

119 BER 20 mv 1 ns (A) J RMS = 6.9 ps J P-P = 57.8 ps (B) Figure Measured CDR waveforms. A) Recovered Mb/s data. B) Recovered M clock E+00 1.E W/O DFE W/ DFE 1.E E % E Delay (UI) Figure RX bathtubs with and withou DFE The TX works from a 1.2-V supply and consumes 2.1 mw, while the RX consumes 1.6 mw from a 1-V supply. The total power consumption of the transceiver is 3.7 mw, and the power efficiency is 0.75 pj/bit. Table 5-2 compares the transceiver 119

120 performance with some recent publications. Even though we use a lelatively less advanced technology, the power efficiency is among the best. Table 5-2. Performance summary of the transceiver [42] [6] [7] [68] This work Technology 65 nm 90 nm 65 nm 45 nm 0.13 μm TX V DD (V) V V RX V DD (V) Data rate (Gb/s) Nyquist loss (db) TX swing (mvpp) BER 1e-12 1e-15 1e-12 1e-14 1e-12 Eye opening (UI) - 30% 43% - 18% Power (mw) Energy efficiency (pj/bit) TX/RX area (mm 2 ) 0.03/ / / / / Summary Building on the results in Chapter 3 and Chapter 4, this Chapter presents a 5- Gb/s 0.75-pJ/bit transceiver in 0.13-um bulk CMOS technology. Various design techniques are combined to attain this high power efficiency, including the VM signaling with differential termination to reduce the signaling power by 75% compared to CM signaling, the exclusive use of static CMOS gates to avoid the static power consumption of CML gates, the injection-locking-based clock generation, decimation in the CDR circuitry, and low-voltage RX operation enabled by the heavy frontend parallelism and the look-ahead DFE selection tree. The heavy parallelism also eliminates the need for an explicit DMUX, leading to further power reduction. 120

121 Even though the transceiver is implemented in a less advanced 0.13-um CMOS technology, the achieved power efficiency of 0.75 pj/bit is among the best reported to date at comparable data rates. It s therefore believed that the techniques presented in this Chapter will help enable the Tb/s aggregate off-chip signaling of future electronic systems. 121

122 CHAPTER 6 A DIGITAL BACKGROUND ADC CALIBRATION TECHNIQUE 6.1 Chapter Overview The continuous scaling of CMOS technology has made digital signal processing more powerful and affordable. Compared to analog signal processing, digital solutions have the advantages of greater flexibility and better scalability. As a result, there is a trend of moving more and more signal processing into the digital domain. This trend is also reflected in high-speed serial links [8] [69] [70], where an ADC digitizes the distorted incoming bit stream and a DSP carries out the signal processing such as equalization and timing recovery in the digital domain, as shown in Figure 6-1. TX ADC DSP Figure 6-1. An ADC-based serial link One of the key challenges in such ADC-based serial links is the design of a highspeed low-power ADC. Due to its high speed, a flash ADC is often the architecture of choice. For low power consumption, it is desirable to use small transistors in the flash ADC. However, the mismatch between transistors becomes worse with small transistor sizes, which will degrade the linearity of the ADC if left unaddressed. For example, consider the preamp in Figure 6-2 often found in flash ADCs. Around balanced condition, the input and output are related by where is the preamp gain,,, and are the differential output, input and reference voltages respectively. The last term, is the offset voltage of the preamp due to device mismatches. With proper design and 122

123 layout, has a zero mean (no systematic offset) and a certain spread determined by circuit details and the fabrication technology. For typical bias conditions, is dominated by transistor threshold voltage mismatch [71] and can be expressed as where is a parameter determined by the technology, and is the gate area of the transistors. To satisfy linearity requirement, the transistors must be sized large enough so that is kept within a fraction of the ADC step size. With the transistor length and current density largely determined by speed requirement, W is the only design variable that can be exploited to reduce. According to Equation 6-2, to decrease by half, the transistor width and therefore the current consumption must be increased by, a very unfavorable tradeoff for low power designs. As technology scales down, this tradeoff is expected to become more and more challenging due to effects such as random dopant fluctuation (RDF) and line-edge roughness (LER) [72]. R D V ON V OP V INP W/L V RP V INN W/L V RN 2I D 2I D Figure 6-2. Schematic of a preamp Since offset changes slowly over time with environmental (supply voltage and temperature) variations and device aging, it can be cancelled with some form of calibration effectively. Various calibration schemes have been proposed in the past for 123

124 flash ADCs, which all fall into either the foreground [73] [74] [75] or the background categories [76] [77]. A foreground calibration scheme mandates temporarily interrupting the normal ADC operation and is therefore usually done at power-up or during certain idle times when allowed by the system. However, as the supply voltage and temperature change over time, the calibration results may no longer be optimum, leading to degraded performance [78]. In contrast, a background calibration scheme does not require interrupting the ADC operation and can run continuously to track environmental variations and device aging. Thus, background calibration schemes are generally preferred. Some of the critical challenges in background calibration for high-speed ADCs are accuracy, convergence speed, area/power overhead, and performance penalty. Despite the many background calibration techniques proposed in the past, a quick literature review demonstrates the need for an improved background calibration scheme that is suitable for high-speed ADCs. In response, this Chapter describes a novel background calibration scheme for ADCs which features negligible hardware and power overhead. The proposed calibration scheme is implemented in a 50-mW 2.5-GS/s 5-bit flash ADC and its effectiveness is verified with experimental results. 6.2 Background Calibration Review of Prior Art Several background calibration schemes for flash ADCs have been reported in literature, and are briefly reviewed here. Correlation-based calibration operates by modulating the analog input signal with pseudo-random sequences to extract offset information from the resulting statistics of the digital output, and has been proposed for both pipeline and flash ADC s [79] [80] [81] [82]. In [79] and [80], the analog input is 124

125 PDF PDF converted to a white signal with little energy at DC by chopping it with a pseudo-random binary sequence. The DC component in the resulted signal stems mainly from the ADC offset. By forcing this DC component to zero, the comparator offset can be effectively removed. A more general approach is proposed in [81], where the offset of a comparator is detected by chopping the analog input with a sequence from an on-chip random-number-generator (RGN) and observing the code distribution of the digital outputs, as illustrated in Figure 6-3 (drawn single-ended for simplicity). The chopping operation degrades the ADC sample rate because it needs finite time to settle. Due to this approach s statistical nature, the analog input must be uncorrelated with the on-chip generated random sequence and the calibration results are prone to fluctuation which can only be minimized at the cost of the convergence speed [81]. Furthermore, Correlation-based calibration invariably introduces performance penalty because they interfere with the analog signal path with chopping or noise injection. For fast and robust calibration, deterministic schemes are generally preferred. RNG SH V IN SL SL Q V R SH + - V os When RNG=1: Q=sgn(V IN -V R -V os ) When RNG=0: Q=sgn(V IN -V R +V os ) P 1 P 1 P 1 0 V IN -V R 0 +V os -V os +V os V IN -V R Figure 6-3. Correlation-based calibration 125

126 Redundancy-based calibration [83] [77] [84] achieves deterministic operation by employing redundant elements to enable un-interrupted ADC operation when some of the elements undergo calibration. Figure 6-4 shows the 6b ADC block diagram with background calibration as reported in [76], where 64 instead of 63 comparators (C1- C64) are employed in parallel. When C1 is being calibrated, the other 63 comparators (C2-C64) work together as a normal ADC. After C1 s calibration is done, the comparator array is reconfigured so that C1 and C3-C64 work together as a normal ADC and C2 undergoes calibration, with the ADC operation un-interrupted. This process repeats continuously and in the end all the comparators are calibrated. The advantage of this technique is its low hardware overhead. However, this technique still incurs speed penalty because it needs to reconfigure the ADC during its normal operation. Encoder Control Logic C64 C63 C2 C1 V IN V RP V RN Figure 6-4. Redundancy-based calibration Reference-ADC based calibration schemes proposed in [85] [86] [87] employ a slow but accurate reference ADC to improve the linearity of the fast but inaccurate main ADC. Figure 6-5 shows a simplified block diagram of the reference-adc based calibration scheme, while Figure 6-6 shows its working principle. For simplicity, we assume that the main ADC has 3-b resolution. In Figure 6-6, the transfer curves of the main ADC and the ideal reference ADC are overlaid. Denoting the transition levels of 126

127 Output code the main and reference ADCs as and respectively, any offset will cause to differ from. These differences are marked by gray bars in Figure 6-6 and are referred to as calibration windows hereafter. Whenever falls within the calibration windows, a discrepancy occurs between the reference and main ADC outputs. The calibration engine then examines such discrepancies and drives toward the ideal. Main ADC V IN Decimation Cal. Engine M Ref. ADC Figure 6-5. Reference-ADC-based calibration Ref. ADC Main ADC V IN Figure 6-6. Principle of reference-adc-based calibration. Although reference-adc-based calibration is deterministic and incurs negligible performance penalty, there is considerable design overhead when the reference and main ADCs are entirely different for example, a Σ-Δ ADC is used to calibrate a pipeline ADC in [87]. Furthermore, because the main and reference ADCs operate from different sampling clocks, mismatch in their track-and-hold (T/H) circuits can degrade the calibration accuracy. To alleviate this problem, one has to resort to either power- 127

128 hungry T/H circuits to drive both ADCs [86] or dedicated timing calibration for the two sampling clocks [88], both of which are very challenging at high speeds. These disadvantages can be avoided with the so-called split-adc architecture, where the reference ADC is simply a replica of the main ADC and operates at the same speed [78] [89]. The replica ADC, however, incurs significant area, input capacitance and power overhead Proposed Background Calibration Scheme In the reference-adc based calibration scheme, all the transition levels are calibrated simultaneously. This necessitates a reference ADC with at least the same resolution as the main ADC, and thus high overhead seems inevitable. However, because offset varies slowly over time, the transition levels can be calibrated sequentially instead of simultaneously. The benefit of this sequential calibration is the greatly reduced complexity of the reference ADC. In the extreme case, as in our proposed calibration scheme, 1-b resolution is sufficient, and the reference ADC degenerates to a single comparator. Figure 6-7 shows a block diagram of the proposed calibration scheme. The reference ADC is now replaced with a single comparator, whose threshold voltage is reconfigurable through a digital-to-analog converter (DAC). At the beginning, the calibration engine sets the comparator s threshold voltage to, as shown in Figure 6-8(A). By monitoring the outputs of the ADC and the comparator, the calibration engine adjusts until. After calibrating, the comparator s threshold voltage is set to and calibration of begins, as shown in Figure 6-8(B). By iterating the same process, all the transition levels of the main ADC can be calibrated. 128

129 Output code Output code Output code Output code Output code Output code Output code Output code The resulting fully-calibrated transfer curve of the ADC is shown in Figure 6-8(H). The performance metrics of the proposed calibration scheme are discussed below. Main ADC V IN Cal. Engine DAC - + Reconfigurable comparator Figure 6-7. Proposed reconfigurable-comparator-based calibration V TH [1] cal. V TH [2] cal. V TH [3] cal. V TH [4] cal. V IN V IN V IN V IN (A) (B) (C) (D) V TH [5] cal. V TH [6] cal. V TH [7] cal. Finished V IN V IN V IN V IN (E) (F) (G) (H) Figure 6-8. Principle of the proposed calibration scheme. The transition levels are calibrated sequentially in A)-G), and the resulting transfer curve is shown in H). 129

130 Calibration accuracy The calibration accuracy is determined by a few factors, including the reference ADC accuracy, the calibration step size, and noise. The discussion above assumes an ideal reference ADC. In reality, however, both the DAC and the comparator in the reference ADC introduce errors and ultimately limit the calibration accuracy. Moreover, due to the digital nature of the calibration scheme, the main ADC can only be adjusted in discrete steps. The reference ADC accuracy, together with the finite calibration step size, limits the overall calibration accuracy. Once the ADC is calibrated, the residual error in the transition level is bounded by - ) where is the DAC error, is the offset of the comparator in the reference ADC, and is the calibration step size. The calibrated INL and DNL are bounded by ) and ) respectively. Notice that does not impact the calibrated DNL. This is because appears in all the calibrated transition levels and merely causes a DC shift in the calibrated transfer curve. The effect of noise on calibration accuracy is shown in Figure 6-9 for the case, where denotes the mean of a random variable. For convenience, the noise is lumped to should indicate in Figure 6-9. Ideally, whenever a discrepancy occurs, it and correct calibration can be made. However, due to noise, may be temporarily higher than, as indicated by the dashed line in Figure 6-9, 130

131 and this may cause incorrect calibration to occur. To improve immunity to noise, the calibration engine can average multiple discrepancies before making a decision. PDF Correct Incorrect Figure 6-9. Mechanism of noise-induced calibration error Because the reference ADC shares the same T/H and sampling clock as the main ADC, the calibration accuracy of the proposed scheme does not suffer from the T/H mismatch issue as the conventional reference-adc based approach does. Nor is it sensitive to the statistics of the input signal since it does not rely on the correlation between the input signal and an on-chip pseudo-random sequence Convergence speed To calculate the convergence speed, we assume distributes uniformly within the full-scale input range V FS. Similar calculations can be carried out for other input distributions, such as those of sine waves. Suppose the initial offset of a certain transition level is. The probability that the input produces a discrepancy is, and on average conversions are needed to reduce the offset by one step, where is the smallest integer that is larger than. Therefore, the number of conversions to calibrate the offset is 131

132 ) If we assume the offset is a normal distribution with a mean of zero and a standard deviation of σ, then the average number of conversions required to calibrate a particular transition level is ) ( ) Exploiting the symmetry of the integrand and assuming the offset is within [-3σ, 3σ], we can approximate the above integral as ) ( ) For an N-bit ADC, there are 2 N -1 transition levels. The total number of conversions for the calibration to converge is ) Since, Equations 6-8 and 6-9 are combined to yield ) ( ) Figure 6-10 plots as a function of the ADC resolution with different σ when. For a 5-bit ADC, when, the calibration takes about conversions to converge. Note that while grows at a rate of 2 2N, it is a relatively 132

133 # of conversions weak function of σ. For example, tripling σ from to increases the required number of conversions by only 37%. This is because calibrating small offsets takes more conversions as the input has a lower chance of producing a discrepancy when the offset is small. 1.E E E σ=3v LSB 1.E E σ=1v LSB 1.E Resolution (bit) Figure Required conversions for convergence with different resolutions Calibration overhead and performance considerations The calibration overhead consists mainly of the reference ADC, the calibration engine, the memory to store the offset control words, and the circuitry to adjust the main ADC offset. With the calibration engine, the memory and the adjustment circuitry being common to all digital calibration schemes, the major overhead advantage of the proposed scheme lies in the simplicity of the reference ADC. The comparator in the reference ADC can reuse the design available in the main ADC and entails no extra design effort. The DAC in the reference ADC is only used to set the threshold voltage and its speed requirement is much relaxed compared to the main ADC s sample rate. The power, area, and design overhead of the reference ADC is therefore trivial. The proposed calibration scheme does not require noise injection or chopping as seen in correlation-based calibrations. While redundancy-based calibration reconfigures the main ADC during normal operation, the calibration scheme herein does not. 133

134 Moreover, it does not insert extra conversion cycles thereby avoiding any speed penalty. Although the reference ADC does increase the input capacitance, this penalty is minimal because only a single comparator is used. For example, calibrating a 5-b flash ADC with the proposed scheme increases the input capacitance by less than 4%. This is in stark contrast to the split-adc architecture, which increases the input capacitance by. Table 6-1 shows a comparison of various background calibration schemes. The proposed calibration engine achieves deterministic operation, introduces little performance penalty, and incurs low hardware and design overhead. Because the calibration is sequential, its convergence is slower than the split-adc architecture. This usually is not detrimental since environmental variations are slow. When fast convergence is desired (for example, to reduce the test time during mass production), foreground calibration can be performed at power up before the background calibration is enabled. Table 6-1. Comparison of proposed and existing background calibration schemes Deterministic Performance Hardware Design Converg. Penalty Overhead Effort Speed Correlation-based No Yes Medium Medium Low Redundancy-based Yes Yes Low Low High Ref.-ADC-based Yes No High High Medium Split-ADC Yes Yes High Low High This work Yes No Low Low Medium 6.3 Chip Implementation ADC Architecture Figure 6-11 depicts a block diagram of the implemented 5-bit flash ADC with the calibration circuitry (drawn single-ended for simplicity, though the real implementation is 134

135 differential). The main ADC consists of a track-and-hold (T/H), a resistor ladder, a comparator array, and a digital backend. The comparator array is comprised of comparators C[1:31], which digitize the sampled analog input against 31 evenly-spaced reference voltages V R [1:31] from the resistor ladder. The resulting thermometer codes are then converted to binary format by the digital backend which also corrects first-order bubble errors. Digital Backend C[0]~C[31]: Comparators W[1]~W[31]: Offset control words SQ S[31] S[30] S[2] S[1] Q[31] Q[30] Q[2] Q[1] Q[0] FSM V IN T/H C[31] C[30] C[2] C[1] C[0] V R [31] V R [30] V R [2] V R [1] V RP V RN SR S[31] S[30] S[2] S[1] Bias Gen. Serial Interface W[31] S[31] W[30] S[30] SRAM (31X5b) Addr. Decoder W[2] S[2] W[1] S[1] DATA ADDR Figure Block diagram of the ADC The calibration circuitry consists of the resistor ladder and the shaded blocks in Figure The switch bank SR, the resistor ladder and the comparator C[0] make up the reference ADC. The SRAM stores the offset control words W[1:31] for C1~C31. The finite-state machine (FSM) communicates with the SRAM through the address decoder and serves as the calibration engine. The chip also houses a serial interface. This facilitates digital control of the bias generator and allows clearing the SRAM content to disable calibration. 135

136 6.3.2 Resistor Ladder Since the resistor ladder generates the reference voltages for the reference ADC, its linearity ultimately determines the achievable calibration accuracy. For an N-bit ADC, the requirement on the resistors used in the ladder is [90] where R is the nominal resistance and is the variance. The resistor ladder consists of identical poly resistor units with W/L of 8μm/4μm with estimated mismatch <0.35%, which is better than 8-bit accuracy [91]. To stabilize the reference voltages and suppress input feedthrough, decoupling PMOS capacitors are connected to all resistor ladder output taps [92]. The resistor ladder consumes 0.21 mw T/H A passive T/H precedes the comparator array, the schematic of which is shown in Figure 6-12(A). By presenting a static signal to the comparator array during quantization, the T/H helps minimize linearity degradation due to signal dependent comparator delays and the clock and signal skew between comparators. Since the input voltage swing is from V DD -0.4V to V DD, PMOS transistors are used. This also eliminates the need for a buffer to shift the input common mode level [93] [92]. The bandwidth of the T/H is determined by the on-resistance of the switch and the sampling capacitor. Figure 6-12(B) shows the small signal model of the T/H, where C PAD is the pad parasitic capacitance, C sample is the sampling capacitance, and the 25Ω resistor is the parallel combination of the channel impedance and the on-chip termination resistor. A simple π model is used in the transistors places, with R and C being the channel resistance and the gate capacitance of a unit width transistor 136

137 Bandwidth (GHz) VDD respectively. A larger transistor has a lower on-resistance and thus tends to give a higher bandwidth. However, when the on-resistance is comparable to 25Ω, the bandwidth will drop with increasing transistor width because the parasitic capacitance begins to dominate. An optimum transistor size therefore exists which maximizes the total T/H bandwidth. Figure 6-13 plots the T/H bandwidth as a function of the transistor width. It can be seen that a width of 28um gives the highest bandwidth. However, the optimum is not a very sharp one. A transistor width of 14um is chosen instead, with only a 10% drop in bandwidth, while saving about 0.2mW on clocking. 7µm 14µm 7µm CKB CK CKB (A) 25 CPAD C'W R' W C'W Csample (B) Figure T/H Design. A) Schematic. B) Its small-signal model Width (µm) Figure T/H Bandwidth vs. switch width 137

138 { { A few mechanisms limit the T/H linearity, including signal-dependent charge injection, clock feedthrough, and nonlinear channel resistance during track-mode [94]. Dummy switches driven by a delayed complementary clock are used at both sides of the sampling switch to cancel the charge injection [92]. With second order distortion largely removed by differential signaling, the third order term dominates the distortion performance. Simulation shows that, when sampling a 1.4GHz full scale sine wave at 2.5GS/s, the T/H achieves -45dBc third order harmonic distortion, with 1.5dB improvement by the dummy switches Comparator Figure 6-14 shows the block diagram of the comparator. A three-stage preamplifier followed by a regenerative latch digitizes the difference between input and reference voltages. Another two latch stages reduce metastability and convert currentmode-logic (CML) levels to full-swing CMOS logic levels. A current steering DAC accepts the control word from the SRAM and injects static current into the output of the first preamplifier stage to cancel the offset of the whole comparator. P1 P2 P3 L1 P4 L2 L3 V IN V R SR SRAM DAC CML Latch CML Latch SAFF{ Figure Comparator block diagram. Compared to a dynamic comparator [74], the preamplifier expedites the regeneration in the latch [95], suppresses charge kickback, and provides better power supply and common-mode rejections. The preamplifier consists of three stages (P1~P3) for fast overdrive recovery [90] [93]. Figure 6-15 shows the schematics of P1, P2, and 138

139 the DAC. Resistor loads are used instead of diode connected transistors to avoid the voltage headroom due to the transistor V T [93]. P1 DAC P2 VB R 1A R 1B M 3A M 3B R 2A R 2B M 1A M 1B M 2A M 2B M 4A M 4B M 5A M 5B V INP V RP V INN V RN I T1 I T2 I DAC I T3 M 1A, M 1B 1µ/0.12µ M 2A, M 2B 1µ/0.12µ M 3A, M 3B 1µ/0.12µ M 4A, M 4B 0.4µ/0.12µ M 5A, M 5B 1µ/0.12µ R 1A, R 1B 6 KΩ R 2A, R 2B 6 KΩ I T1, I T2, I T3 100 µa I DAC 0~40 µa Figure Schematics of the first two stages of the preamplifier For high speed operation, the bandwidth of the preamplifiers must be maximized. For that reason, it s desirable to bias the transistors at high current densities. However, this practice is limited by two factors. First, the transit frequency of a transistor increases slowly at high current densities, as shown in Figure 6-16(A), which means the current efficiency drops at high current densities, even without considering the f T drop caused by velocity saturation. Second, the highest current density is limited by the supply voltage due to voltage headroom issues. For P1, ignoring the currents through M3, the gain is given by where g m is the transconductance of M1 and M2, I 1 is the current through M1 and M2, and V R is the voltage drop on R1 when the differential pair is balanced. The term is due to the fact that half of the bias current flows through M 1B and M 2B and does not produce any gain. Since V INP, V INN, V RP and V RN all vary between V DD -0.4V to V DD, to prevent M1 and M2 from entering linear region, must be kept below or about

140 V R (V) f T (GHz) V considering the body effect. Figure 6-16(B) plots as a function of the current density, assuming a moderate gain of 2. It can be seen that the speed and gain requirements can t be met without violating the limit. To solve this problem, two transistors biased in the saturation region (M 3A and M 3B ) are used to bypass half of the current to reduce the voltage headroom on R 1A and R 1B by half [96], as also shown in Figure 6-16(B). The chosen current density is 50μA/μm. Since P2 has less self-loading, it can achieve a larger GBW than P1 given the same bias condition and fanout. The gain of P2 is therefore designed 70% higher than P1, while the bandwidths of P1 and P2 are kept the same. No inductive peaking is used to save area E E E E Current Density (μa/μm) (A) w/o M 3 w/ M Current density (μa/μm) (B) Figure Effects of M 3. A) Transit frequency vs. current density. B) Required voltage drop on the load resistor vs. current density. 140

141 P3 L1 P4 L2 R 1A R 1B R 2A R 2B M 1A M 1B M 2A M 2B M 3A M 3B M 4A M 4B CK M 5A M 5B CKB CKB M 6A M 6B CK I T1 I T2 M 1A, M 1B 1µ/0.12µ M 2A, M 2B 0.8µ/0.12µ M 3A, M 3B 1µ/0.12µ M 4A, M 4B 0.8µ/0.12µ M 5A, M 5B 2µ/0.12µ M 6A, M 6B 2µ/0.12µ R 1A, R 1B 8 KΩ R 2A, R 2B 8 KΩ I T1, I T2 60 µa Figure Schematic of the CML latches A CML flip-flop and a sense-amplifier flip-flop (SAFF) complete the comparator. Figure 6-17 shows the CML flip-flop, which is constructed with the conventional masterslave topology. Figure 6-18 shows the SAFF schematic. It consists of a sense-amplifier (SA) and a set-reset (SR) latch. The SAFF provides additional gain to suppress metastability errors and convert CML levels to full-swing CMOS levels. With the additional gains of the latches, the ADC s BER is estimated to be better than [97]. L3 SR CKB Figure Schematic of the SAFF Figure 6-19 shows the current-steering DAC. A bias generator shared by all the comparators generates three bias voltages. The offset control word W[N] selects from these three bias voltages and VSS to inject an appropriate current to comparator C[N] and cancel its offset. 141

142 Shared bias generator Current-steering DAC I B 1 I B 2 I B 3 W[N][4] M 2A M 2B W[N][4] M 6A, M 6B, M 6C 2µ/0.16µ M 4A, M 4B 0.4µ/0.12µ M 1A VB[1] M 1B VB[2] M 1C VB[3] VB[3] VB[2] VB[1] VSS M 3A VB[3] VB[2] VB[1] VSS M 3B M 3A 1.6µ/0.16µ M 3B 0.4µ/0.16µ I B 13 µa W[N][3:2] W[N][1:0] Figure Current-steering DAC and the DAC bias generator. The bias generator is shared by all the comparators. One important design parameter of the current-steering DAC is its calibration range. This range is selected based on the comparator offset and the yield target. To reduce area and power consumption, the transistors in the comparators are sized close to the minimum. Figure 6-20(A) shows the simulated comparator offset, which is 22.5 mv (0.9 LSB) and is dominated by the preamplifier. For a certain calibration range, the yield is the probability of all the 32 comparators offset falling within this range, and, assuming a Gaussian distribution for the comparator offset, is given by [ ( )] Figure 6-20(B) shows the yield as a function of the normalized calibration range. To achieve a yield higher than 90%, the normalized calibration range should be higher than 6. In this prototype, the maximum I DAC is programmable through the serial interface, and the simulated can cover up to, as Figure 6-20(C) shows. The other key parameter of the current-steering DAC is its resolution, which determines the calibration step and the achievable calibration accuracy as discussed previously. In this prototype, 5-b resolution is chosen. When the calibration range is 142

143 V cal (mv) Yield programmed to 5.4 LSB ( ), the calibration step is 0.19 LSB. With the resistor ladder providing higher than 8-b linearity, this guarantees a calibration accuracy of 0.5 LSB according to Equation Offset (mv) (A) 100% 80% 60% 40% 20% 0% Normalized DAC Range (B) Max. I DAC (µa) (C) Figure Simulated comparator performances. A) Offset. B) Yield vs. normalized calibration range. C) Calibration range. 143

144 6.3.5 Digital Backend A digital backend converts the output thermometer codes of the comparator array to binary format. It also provides the capability of correcting or minimizing errors due to bubbles or metastabilities. Figure 6-21 shows the block diagram of the digital backend. The three-input AND gate array converts the thermometer codes to one-hot codes and provides 1 st order bubble error correction. The one-hot codes are then used to address a quasi-gray-code ROM encoder [98]. Simple XOR gates convert the quasi-gray code to binary codes. The binary codes are then decimated by 64 to accommodate the limited bandwidth of the test equipment. Decimator Binary /64 0 Quasi-gray CK Pipelined ROM Encoder One-hot 1 1 Thermometer SR SR SR SR Figure Block diagram of the digital backend Reference ADC The reference ADC is comprised of the resistor ladder, the switch bank SR, and the comparator C[0]. The resistor ladder is reused form the main ADC to reduce the calibration overhead. The switch bank SR is built with CMOS transmission gates and is controlled by the one-hot code S[1:31] to select the desired reference voltage for C[0] from the resistor ladder. The switch bank SR is implemented with simple CMOS 144

145 transmission gates. C[0] shares the same design as C[1:31] and does not involve any extra design effort Calibration Engine and Supporting Circuitry The other calibration circuitry includes the FSM as the calibration engine, the SRAM to store the offset control words, the address decoder to facilitate the communication between the FSM and SRAM, and the switch bank SQ. The FSM, the SRAM, and the address decoder are all built with standard cells, while the switch bank SQ is implemented with CMOS transmission gates, same as SR. N = 1 Clear error counter Compare Q[N] and Q[0] Update error counter No 128 comparisons? Yes Update W[N] N = 31? Yes No N = N+1 Figure FSM flow chart. N is the calibration index, which is also the SRAM address. Figure 6-22 shows the flow chart of the FSM operation. At the beginning, the FSM sets N to 1. This sets S[1] to HIGH so that both C[0] and C[1] s reference voltages are connected to V R [1]. Meanwhile, C[1] s output is also selected. To improve noise immunity, the FSM then accumulates the results of 128 comparisons between C[0] and C[1] s outputs before updating the control word W[1] in the SRAM. After that, the FSM sets N to 2 and calibrates C[2]. This process repeats cyclically for C[1:31] so that the comparators are all continuously calibrated in the background. 145

146 Note that, with the help of SQ, the FSM directly reads the ADC s raw thermometer output instead of its decoded binary output. This eliminates the need for a 5-b digital comparator and bypasses the possible complication introduced by bubble error correction Clock and Power Distribution Clock distribution is of crucial importance in high speed ADC design. The clock buffers are sized for the same fan-out. Dummy loads are inserted in the clock tree to compensate for unbalanced loads. To account for the finite delay through the preamplifier, the clock of the T/H leads that of the comparators by one inverter delay. Since the clock of the FSM and the decimator is divided down from the full-speed clock and its phase relationship with the full-speed clock is unknown, multiple phases are generated for selection through the on-chip serial interface. The power is split to analog and digital domains. Decoupling capacitors are inserted whenever there is spare area. To prevent noise coupling through the substrate, guardring is inserted between the analog part and the digital part. The guardring is connected to a dedicated ground pad, separate from analog and digital ground pads [99]. 6.4 Experimental Results The prototype 5-bit flash ADC was fabricated in 0.13μm 1-poly 8-metal bulk CMOS process and was measured in a QFN package. Figure 6-23 shows the chip micrograph. The ADC core occupies an active area of 0.24 mm 2. Even without any layout optimization, the calibration circuitry takes less than 10% of the core area. 146

SRAM Comparator Digital Backend FSM R Ladder Bias Clock Tree 100 μm Figure 6-23. Chip micrograph. The ADC was powered from a 1.2-V supply. The reference voltages V RP and V RN were set to 1.2 V and 0.

147 SRAM Comparator Digital Backend FSM R Ladder Bias Clock Tree 100 μm Figure Chip micrograph. The ADC was powered from a 1.2-V supply. The reference voltages V RP and V RN were set to 1.2 V and 0.8 V respectively, giving a differential full-scale input range of 0.8 V. The ADC s decimated digital output was captured by a mixed-signal oscilloscope and post-processed in Matlab. The ADC s static performance was evaluated by stepping the DC input voltage to the ADC and recording the levels at which the output toggles. The peak-to-peak noise observed during DC measurement is 2.5 mv, or roughly 0.1 LSB. To remove the effect of noise during the DC measurement, the output codes were averaged to find the transition levels. Figure 6-24 shows the measured INL and DNL with and without calibration. When calibration is disabled, i.e., when all the SRAM bits are cleared to 0 through the serial interface, the ADC has an INL of -1.85/1.48 LSB and a DNL of /2.75 LSB. Enabling calibration improves the INL to -0.21/0.17 LSB and the DNL to -0.07/0.04 LSB. The low calibrated DNL and INL clearly demonstrates the efficacy of the proposed calibration scheme. 147

A 5-Gb/s 156-mW Transceiver with FFE/Analog Equalizer in 90-nm CMOS Technology Wang Xinghua a, Wang Zhengchen b, Gui Xiaoyan c,

4th International Conference on Computer, Mechatronics, Control and Electronic Engineering (ICCMCEE 2015) A 5-Gb/s 156-mW Transceiver with FFE/Analog Equalizer in 90-nm CMOS Technology Wang Xinghua a,