High Speed Clock and Data Recovery Techniques. Behrooz Abiri

Similar documents
A 5-Gb/s 156-mW Transceiver with FFE/Analog Equalizer in 90-nm CMOS Technology Wang Xinghua a, Wang Zhengchen b, Gui Xiaoyan c,

ECEN620: Network Theory Broadband Circuit Design Fall 2014

Design Metrics for Blind ADC-Based Wireline Receivers

A Fully Integrated 20 Gb/s Optoelectronic Transceiver Implemented in a Standard

ECEN620: Network Theory Broadband Circuit Design Fall 2012

To learn fundamentals of high speed I/O link equalization techniques.

A 5Gb/s Speculative DFE for 2x Blind ADC-based Receivers in 65-nm CMOS. Siamak Sarvari

Ultra-high-speed Interconnect Technology for Processor Communication

Source Coding and Pre-emphasis for Double-Edged Pulse width Modulation Serial Communication

A10-Gb/slow-power adaptive continuous-time linear equalizer using asynchronous under-sampling histogram

ECEN720: High-Speed Links Circuits and Systems Spring 2017

ECEN720: High-Speed Links Circuits and Systems Spring 2017

ISSCC 2003 / SESSION 4 / CLOCK RECOVERY AND BACKPLANE TRANSCEIVERS / PAPER 4.3

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2012

ISSCC 2006 / SESSION 13 / OPTICAL COMMUNICATION / 13.2

Integrated Circuit Design for High-Speed Frequency Synthesis

Chapter 2 Signal Conditioning, Propagation, and Conversion

A Blind Baud-Rate CDR and Zero-Forcing Adaptive DFE for an ADC-Based Receiver. Clifford Ting

This chapter discusses the design issues related to the CDR architectures. The

A 5.4-Gb/s Clock and Data Recovery Circuit Using Seamless Loop Transition Scheme With Minimal Phase Noise Degradation

High-Speed Interconnect Technology for Servers

THE TREND toward implementing systems with low

LSI and Circuit Technologies for the SX-8 Supercomputer

A fully digital clock and data recovery with fast frequency offset acquisition technique for MIPI LLI applications

A Variable-Frequency Parallel I/O Interface with Adaptive Power Supply Regulation

Dual-Rate Fibre Channel Repeaters

A 10-Gb/s Multiphase Clock and Data Recovery Circuit with a Rotational Bang-Bang Phase Detector

Electronics A/D and D/A converters

8-Bit, high-speed, µp-compatible A/D converter with track/hold function ADC0820

A Serial Link Transceiver Based on 8 GSa/s A/D and D/A Converters

ALTHOUGH zero-if and low-if architectures have been

A 0.18µm SiGe BiCMOS Receiver and Transmitter Chipset for SONET OC-768 Transmission Systems

Design of Pipeline Analog to Digital Converter

ECEN 720 High-Speed Links: Circuits and Systems

2. ADC Architectures and CMOS Circuits

TIMING recovery (TR) is one of the most challenging receiver

Studies on FIR Filter Pre-Emphasis for High-Speed Backplane Data Transmission

CHAPTER. delta-sigma modulators 1.0

CLOCK AND DATA RECOVERY (CDR) circuits incorporating

Analysis of the system level design of a 1.5 bit/stage pipeline ADC 1 Amit Kumar Tripathi, 2 Rishi Singhal, 3 Anurag Verma

Multiple Reference Clock Generator

Analog I/O. ECE 153B Sensor & Peripheral Interface Design Winter 2016

EE 434 Final Projects Fall 2006

A Compact, Low-Power Low- Jitter Digital PLL. Amr Fahim Qualcomm, Inc.

5Gbps Serial Link Transmitter with Pre-emphasis

A Serial Link Transceiver Based on 8 GSa/s A/D and D/A Converters

A 10Gbps Analog Adaptive Equalizer and Pulse Shaping Circuit for Backplane Interface

Tuesday, March 22nd, 9:15 11:00

A Pattern-Guided Adaptive Equalizer in 65nm CMOS. Shayan Shahramian

Backchannel Modeling and Simulation Using Recent Enhancements to the IBIS Standard

Instantaneous Loop. Ideal Phase Locked Loop. Gain ICs

VLSI Broadband Communication Circuits

UTILIZATION OF AN IEEE 1588 TIMING REFERENCE SOURCE IN THE inet RF TRANSCEIVER

f o Fig ECE 6440 Frequency Synthesizers P.E. Allen Frequency Magnitude Spectral impurity Frequency Fig010-03

SV2C 28 Gbps, 8 Lane SerDes Tester

Jitter in Digital Communication Systems, Part 1

Digital Controller Chip Set for Isolated DC Power Supplies

Statistical Link Modeling

On Chip Active Decoupling Capacitors for Supply Noise Reduction for Power Gating and Dynamic Dual Vdd Circuits in Digital VLSI

ISSCC 2006 / SESSION 4 / GIGABIT TRANSCEIVERS / 4.1

DESIGN OF MULTIPLYING DELAY LOCKED LOOP FOR DIFFERENT MULTIPLYING FACTORS

CDR in Mercury Devices

System on a Chip. Prof. Dr. Michael Kraft

Chapter 2 Analog-to-Digital Conversion...

Chapter 5: Signal conversion

Analog Front-End Design for 2x Blind ADC-based Receivers. Tina Tahmoureszadeh

High Speed I/O 2-PAM Receiver Design. EE215E Project. Signaling and Synchronization. Submitted By

SiNANO-NEREID Workshop:

ECE 6770 FINAL PROJECT

ML PCM Codec Filter Mono Circuit

BER-optimal ADC for Serial Links

Circuit Design for a 2.2 GByte/s Memory Interface

A CMOS Multi-Gb/s 4-PAM Serial Link Transceiver*

ECEN 720 High-Speed Links Circuits and Systems

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Linear Integrated Circuits Applications

High-speed Serial Interface

CMOS High Speed A/D Converter Architectures

BPSK_DEMOD. Binary-PSK Demodulator Rev Key Design Features. Block Diagram. Applications. General Description. Generic Parameters

Lecture 11: Clocking

A Clock Generating System for USB 2.0 with a High-PSR Bandgap Reference Generator

Low power SERDES transceiver for supply-induced jitter sensitivity methodology analysis

Lecture 160 Examples of CDR Circuits in CMOS (09/04/03) Page 160-1

LINEAR IC APPLICATIONS

A 0.18µm CMOS Gb/s Digitally Controlled Adaptive Line Equalizer with Feed-Forward Swing Control for Backplane Serial Link

Advantages of Analog Representation. Varies continuously, like the property being measured. Represents continuous values. See Figure 12.

Adaptive Receivers for High-Speed Wireline Links. Dustin Dunwell

DESIGN OF MULTI-BIT DELTA-SIGMA A/D CONVERTERS

A 0.2-to-1.45GHz Subsampling Fractional-N All-Digital MDLL with Zero-Offset Aperture PD-Based Spur Cancellation and In-Situ Timing Mismatch Detection

EL4089 and EL4390 DC Restored Video Amplifier

Agilent AN 1275 Automatic Frequency Settling Time Measurement Speeds Time-to-Market for RF Designs

100 Gb/s: The High Speed Connectivity Race is On

A-D and D-A Converters

11.1 Gbit/s Pluggable Small Form Factor DWDM Optical Transceiver Module

Wideband Sampling by Decimation in Frequency

High-speed Serial Interface

Notes on OR Data Math Function

A COMPACT, AGILE, LOW-PHASE-NOISE FREQUENCY SOURCE WITH AM, FM AND PULSE MODULATION CAPABILITIES

High Speed Flash Analog to Digital Converters

The problem of upstream traffic synchronization in Passive Optical Networks

Low Power Digital Receivers for Multi- Gb/s Wireline/Optical Communication

Transcription:

High Speed Clock and Data Recovery Techniques by Behrooz Abiri A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto Copyright c 2011 by Behrooz Abiri

Abstract High Speed Clock and Data Recovery Techniques Behrooz Abiri Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2011 This thesis presents two contributions in the area of high speed clock and data recovery systems. These contributions are focused on the fast phase recovery and adaptive equalization techniques. The first contribution of this thesis is an adaptive engine for a 2x blind sampling receiver. The proposed adaptation engine is able to find the phase-dependent DFE coefficients of the receiver on the fly. The second contribution is a burst-mode clock and data recovery architecture which uses an analog phase interpolator. The proposed burst-mode CDR is capable of locking to the first data transition it receives. The phase interpolator uses the inherent timing information in the data transition to rotate the phase of a reference clock and align it with the incoming data edge. The feasibility of the concept is demonstrated through fabrication and measurements. ii

Acknowledgements I sincerely declare gratitude to my supervisor, Professor Ali Sheikholeslami, for his support and guidance throughout the time I ve been with him. I thank Professors Tony Chan Carusone, Roman Genov, and Joyce Poon for serving on my thesis examination committee and also providing comments and recommendations that have enriched this thesis. I am thankful for the support and design reviews provided by Fujitsu s staff, especially Hirotaka Tamura, Bill Walker, and Masaya Kibune. My special thanks is for Ravi Shivnaraine and Saeid Rezaei for their great help during the measurements. I would also like to thank Siamak Sarvari for kindly providing all the information on his work which was used in one of my designs and Clifford Ting for sharing his valuable experience with digital synthesis. Finally, I thank my mother, father, and sister for their encouragement, support and understanding throughout my life. iii

Contents Nomenclature xi 1 Introduction 1 1.1 Motivation................................... 2 1.2 Thesis Objectives............................... 3 1.3 Thesis Outline................................. 4 Part I - Adaptation Engine for Blind ADC-Based Receiver 5 2 Background: ADC-Based CDR 6 2.1 Feed-forward Blind-sampling CDR..................... 8 2.2 DFE for Blind-sampling CDR........................ 12 2.3 ADC...................................... 14 2.4 Summary................................... 18 3 Proposed Adaptation Engine 19 3.1 Architecture.................................. 19 3.2 Proposed Adaptation Engine......................... 20 3.3 Simulation and Measurement Results.................... 25 3.4 Summary................................... 31 iv

Part II - A Phase-Interpolator-Based Burst-Mode CDR 33 4 Background: Burst-Mode CDR 34 4.1 TDMA PON................................. 36 4.2 Previous work on BMCDR.......................... 39 4.2.1 Oversampling Architecture...................... 39 4.2.2 Dual GVCO-based Architecture................... 41 4.2.3 Single GVCO-based Architecture.................. 42 4.2.4 Injection Locking Architecture.................... 44 4.3 Summary................................... 46 5 Proposed Burst-Mode CDR 47 5.1 Proposed Phase Interpolator Approach................... 49 5.2 Proposed Architecture............................ 51 5.2.1 Clock Recovery Unit......................... 53 5.2.2 Delay Locked Loop.......................... 55 5.3 Simulation and Measurement Results.................... 60 5.4 Summary................................... 67 6 Conclusion and Future Work 68 6.1 Thesis Contributions............................. 69 6.2 Future Work.................................. 70 Bibliography 71 v

List of Tables 3.1 Comparison of BER, horizontal and vertical eye opening, before and after adaptation for the channels with S 21 shown in Fig. 3.7(c)......... 26 3.2 Performance summary............................. 32 5.1 Comparison of proposed architecture with previous work......... 66 vi

List of Figures 2.1 Block diagram of (a) a typical binary CDR and (b) an ADC-based CDR. 7 2.2 ADC Sampling method in digital CDR: (a) Phase-tracked clocking, (b) blind (asynchronous) sampling........................ 7 2.3 Complete block diagram of the blind sampling ADC-based receiver.... 9 2.4 Block diagram of a feed-forward blind sampling CDR............ 9 2.5 (a) Digital phase (Φ X ) recovery from digital samples. (b) The digital 3 rd order low-pass filter [46]............................ 10 2.6 Data selection scheme. (a) Selection of UI, based on Φ AVG and number of transitions, (b) selection of sample in the UI................. 11 2.7 (a) DFE in phase tracking CDR, (b) DFE in a blind sampling CDR.... 13 2.8 Selection of DFE coefficients based on sampling phase. (a) DFE coefficients shown on pulse response, (b) full-rate implementation using a lookup table [39].................................. 13 2.9 Interpolating flash ADC (PA: pre-amplifier, L: latch)............ 14 2.10 Clocked amplifier used in ADC pre-amplifier................ 15 2.11 Reference voltage generation by interpolation (a) without the interpolating amplifier, (b) with interpolating amplifier (IA)............... 16 2.12 Proposed interpolating flash ADC with clocked pre-amplifier (PA), interpolating amplifier (IA), and Latch (L).................... 17 3.1 Conventional LMS adaptation engine in phase tracking CDR....... 19 vii

3.2 1-Tap DFE equalization in (a) phase tracking, and (b) blind sampling CDR. Note that in the blind case, the samples assume four levels corresponding to four different desired levels after equalization......... 21 3.3 Proposed LMS adaptation engine for blind-sampling CDR........ 21 3.4 Detailed block diagram of the proposed adaptation engine......... 22 3.5 Simulated high-frequency jitter tolerance comparison of desired waveforms, generated based on triangular and averaging technique with (a) added random jitter (RJ) to receiver clock and (b) added random noise (V n ) to the received signal. (Sinusoidal jitter frequency: 170MHz, BER< 10 6, RJ and V n have Gaussian distribution for BER< 10 6, used channel has 10dB loss at Nyquist frequency.)............................ 23 3.6 Block diagram of desired waveform generator................ 24 3.7 Channel insertion loss for (a) a 26-inch and (b) a 34-inch FR4 channel. (c) S 21 of the channels used in simulation for Table 3.1.......... 27 3.8 Simulated jitter tolerance comparison of adaptive DFE (this work) versus manual DFE (based on [39]).......................... 28 3.9 Die photo.................................... 28 3.10 Measurement setup............................... 29 3.11 Eye diagrams before and after equalization for (a) a 26-inch and (b) a 34-inch Tyco channel.............................. 30 3.12 Measured learning curves........................... 31 3.13 Simulated and measured jitter tolerance for (a) 26-inch and (b) 34-inch FR4 channels.................................. 32 4.1 PON splitting strategies: (a) Simple one-stage, (b) multi-stage splitting, and (c) optical bus.............................. 35 4.2 TDMA in PON................................ 37 4.3 OLT receiver.................................. 38 viii

4.4 Burst-mode TIA................................ 38 4.5 (a) Block diagram of oversampling burst-mode CDR, (b) principle of operation..................................... 40 4.6 (a) Block diagram of dual GVCO burst-mode CDR, (b) timing diagram. 41 4.7 (a) Block diagram of single GVCO burst-mode CDR, (b) timing diagram. 43 4.8 (a) Block diagram of injection-locking burst-mode CDR, (b) the gating circuit, (c) schematic of VCO [24], and (d) principle of operation..... 45 5.1 Error generation in the presence of frequency offset and long consecutive identical digits (CID)............................. 47 5.2 Conceptual block diagram of a phase interpolator (PI)........... 49 5.3 Analog PI [20]................................. 50 5.4 Digital PI [16]................................. 51 5.5 Selection of PI coefficients; (a) ideal and (b) practical coefficients..... 52 5.6 Proposed BMCDR block diagram....................... 52 5.7 (a) Block diagram of proposed clock recovery unit and (b) operation... 53 5.8 Proposed BMCDR block diagram....................... 55 5.9 (a) Double edge triggered S/H consisted of two (b) single edge triggered S/H s...................................... 56 5.10 Bang-bang phase detector including a decimator.............. 56 5.11 7-bit synchronous up/down counter...................... 59 5.12 Post layout simulation result for burst-mode acquisition.(a) Input data after CML-to-CMOS block, (b) PI coefficients provided by S/H s, (c) recovered clock, and (d) recovered data. Note the instantaneous phase recovery. 61 5.13 Die Photo and area of building blocks.................... 62 5.14 Measurement setup............................... 63 5.15 (a)measured recovered clock and (b) data for 6 Gb/s PRBS10 data.... 64 5.16 Measured PI linearity at (a) 4 Gb/s and (b) 6 Gb/s............ 64 ix

5.17 Measured phase locking speed at (a) 1 Gb/s, (b) 2.5 Gb/s, (c) 4 Gb/s, and (d) 6 Gb/s................................. 65 ch x

Nomenclature ADC ADSL ATM BER BMCDR CDR CID CO DAC DCDL DFE DLL DOCSIS FFE FIFO analog-to-digital converter asymmetric digital subscriber line asynchronous transfer mode bit error rate burst-mode clock and data recovery clock and data recovery consecutive identical digits central office digital to analog converter digitally controlled delay line decision feed-back equalizer delay locked loop data over cable system interface specification feed forward equalizer first in first out xi

FTTH GVCO HDMI I/O IA IC LA LMS LSB MSB OLT ONU PA PD PD PLL PON PVT S/H TDMA fiber to the home gated voltage controlled oscillator high definition media interface input/output interpolating amplifier integrated circuit limiting amplifier least mean square least significant bit most significant bit optical line termination optical node unit pre-amplifier phase detector photo diode phase locked loop passive optical network process, voltage, and temperature sample and hold time division multiple access xii

TIA UI USB VCDL WDMA transimpedance amplifier unit interval universal serial bus voltage controlled delay line wavelength division multiple access xiii

Chapter 1 Introduction The advances in silicon technology have provided cheap computational power to the public and resulted in a widespread of electronic gadgets as the means of communication, entertainment, and business. As their computational power increases, the rate of data exchange between the building blocks of these gadgets increases. To maintain low production cost, it is necessary to limit the number of I/O pins of the IC s. Thus, higher data rate per pin is required to accommodate the demand for higher aggregate inter-chip data exchange rate. The increase in the bandwidth demand is not just limited to the intra-device scale. The emergence of applications such as video conferencing, video on-demand, cloud computing, and on-line data storage services has introduced a great demand for connection bandwidth between end-users and service providers. This demand has triggered the replacement of copper-based connections with optical fibers. To reduce the tremendous cost of such replacement, networks with fewer active optical components are preferred. Passive optical network (PON) has gained popularity as the last mile optical delivery network because of the lesser cost that it imposes on data service providers. 1

Chapter 1. Introduction 2 1.1 Motivation The channel imperfections have started to influence data integrity as the per-pin data rate of IC s has increased. The higher cost of improved backplanes compared to the cost of extra on-chip circuitry required to resolve the data integrity degradation, has been the driving force of research on high-speed signaling [17, 13, 34, 10, 30]. The main non-ideality of the channel that manifests itself at higher data rates is the limited bandwidth of the channel. Part of this bandwidth limitation arises from the dielectric losses and the other part arises from the conductive losses which become worse at higher frequencies due to skin effect. To overcome this bandwidth limitation, equalization techniques have been widely used. Many of these equalization schemes use analog circuitry [30, 45, 19, 25]. While analog circuitry may require less power compared to their digital counterparts, they suffer greatly from process, voltage, and temperature variations, especially in higher technology nodes. Analog circuits do not benefit as much as digital circuits from technology scaling and they also require complete redesign when the design is to be ported from one process to another. The digital equalizers that have been reported in [5, 9, 15, 49, 39] either use a recovered clock to sample the data at the center of UI [5, 9, 15] or use blind sampling [49, 39]. The use of blind sampling architecture decouples the digital and analog blocks, yielding faster development and easier portability. The adaptive digital equalizer reported in [49] is a feed forward equalizer (FFE) that suffers from cross-talk and quantization noise enhancement. While a decision feed-back equalizer (DFE) reported in [39] removes such noise enhancement, it is not adaptive, meaning that the DFE coefficients must be obtained and programmed manually for each channel that equalizer has to work with. While this scheme may be acceptable for some applications, it could not be used for applications in which the channel is not fixed (e.g. universal serial bus (USB) and high definition media interface (HDMI) ). The first part of this thesis proposes an adaptation engine which integrates with a DFE [39] in a blind sampling ADC-based clock and data

Chapter 1. Introduction 3 recovery (CDR). The current revisions of passive optical network (PON) standards [35, 2] use time division multiple access (TDMA) to allow several users to connect to a central office (CO). This type of multiple access scheme requires the use of so called burst-mode CDR (BMCDR) in the receiver of CO. The task of BMCDR is to quickly recover the phase of incoming data packets sent from different users, and thus reduce TDMA overhead [22]. Due to stability issues, it is hard to design a wide-band closed loop CDR that provides such fast phase locking. Open loop CDR s, on the other hand, are the designer s choice for fast phase recovery. In the second part of this thesis, a BMCDR based on phase interpolation is proposed. 1.2 Thesis Objectives This thesis presents the design and implementation of an adaptive engine for the DFE of a blind sampling, ADC-based CDR and a phase-interpolator-based burst-mode CDR. The specific objectives of this thesis are the following: Investigating the feasibility of an adaptation engine for the DFE reported in [39] and proposing a scheme to overcome the challenges introduced by blind sampling. Design, implementation, and demonstration of the operation of the proposed adaptation engine, integrated with the DFE and receiver proposed in [39] and [46]. Investigating the possibility of fast data phase acquisition using phase interpolators and proposing an architecture for a BMCDR based on phase interpolation. Design, implementation, and testing of the proposed BMCDR to verify the operation of the scheme.

Chapter 1. Introduction 4 1.3 Thesis Outline This thesis is organized in two parts. Part I describes the adaptation engine. The PIbased BMCDR is presented in Part II. Part I includes two chapters. Chapter 2 provides background while chapter 3 provides details of implementation, and results of simulation and measurement for the proposed adaptation engine integrated with the ADC-based receiver. Part II also includes two chapters. Chapter 4 presents a brief background on BM- CDR s and PON as their main application. Chapter 5 describes the proposed BMCDR including the simulation and measurement results. The results of this thesis are summarized in chapter 6, along with a discussion on potential future directions for research.

Part I Adaptation Engine for Blind ADC-Based Receiver 5

Chapter 2 Background: ADC-Based CDR The increasing demand for higher data rates through legacy backplane channels with limited bandwidth has introduced severe signal degradation due to inter-symbol interference (ISI) to the received signal. To recover data from this severely degraded signal, high equalization levels are required [18]. While analog equalization could be used in binary CDR s as shown in Fig. 2.1(a), the use of ADC as the sampler provides another layer of equalization in the digital domain. The combined equalization in analog and digital can be used to recover data from higher attenuation channels (Fig. 2.1(b)). Digital equalizers are easy to design and are portable across the technology nodes because they can be implemented in RTL. In addition, digital equalizers consume less power with technology advancement and are more robust to PVT variations. As shown in Fig. 2.2, the sampling clock in ADC-based receivers could either track the phase of the incoming data by CK REC or it could ignore the phase when a blind (asynchronous) clock, CK Blind, is used. In a phase tracking system, as shown in Fig. 2.2(a), a digital phase detector compares the phase of the incoming data with the phase of the sampling clock. A low pass filter then sends digital control bits to a digitallycontrolled oscillator (DCO) or a phase-interpolator (PI) in order to adjust the phase of the sampling clock [15]. In this system, there is a feedback loop containing both 6

Chapter 2. Background: ADC-Based CDR 7 IN OUT IN REC OUT IN (a) REC OUT IN REC OUT REC (b) Figure 2.1: Block diagram of (a) a typical binary CDR and (b) an ADC-based CDR. OUT IN OUT IN REC Rec (a) OUT IN OUT IN Blind Blind (b) Figure 2.2: ADC Sampling method in digital CDR: (a) Phase-tracked clocking, (b) blind (asynchronous) sampling.

Chapter 2. Background: ADC-Based CDR 8 digital and analog components and, as a result, the delay of the feedback loop plays an important role in the stability of the system [8]. During the design, delay of both digital and analog blocks in the loop should be taken into account, which makes the mixed signal design complicated. On the other hand, a blind sampling CDR [46], as shown in Fig. 2.2(b), eliminates the feedback path and hence is unconditionally stable. This allows for independent design of the ADC and the remaining digital building blocks. As mentioned earlier, the main advantage of an ADC-based CDR is in the availability of extra equalization in the digital domain. This extra equalization can be done either as a feed-forward equalizer (FFE) or a decision-feedback equalizer (DFE). An FFE [49] boosts both the signal and the noise at high frequencies. This noise, in the case of ADCbased CDR, includes the ADC quantization noise that may limit the performance. A DFE for blind sampling CDR is proposed in [39] to address this noise enhancement. In [39], the DFE coefficients are obtained manually by measuring the pulse response of the channel and subtracting it from a desired pulse response, where the latter is defined so as not to contain any ISI. This approach, however, does not lend itself easily to adaptation unless the data communication is interrupted or initiated by a training sequence so as to obtain the channel pulse response. To overcome this limitation, we propose [4] an adaptive DFE where the DFE coefficients are obtained during data transmission, i.e. without interruption by a training sequence. We have further integrated this adaptive DFE with the rest of the building blocks to demonstrate a complete receiver as shown in Fig. 2.3. 2.1 Feed-forward Blind-sampling CDR Fig. 2.4 shows a simplified block diagram of a blind-sampling, feed-forward CDR [46], where the ADC samples the data at twice the data rate and a digital phase detector calculates the sampling phase of ADC with respect to incoming data. A data selection

Chapter 2. Background: ADC-Based CDR 9 [1:9] [1:8] IN valid[1:2] Blind Figure 2.3: Complete block diagram of the blind sampling ADC-based receiver. AVG OUT IN X X Figure 2.4: Block diagram of a feed-forward blind sampling CDR. 0 1 2

IN X Chapter 2. Background: ADC-Based CDR 10 Φ X = 0.5. S 0 /(S 0 _ S1 ) S 0 S 1 S 2 (a) X e 1 8 z -1 1-z -1 z -1 1 256 1-z -1 z -1 1 512 1-z -1 AVG (b) Figure 2.5: (a) Digital phase (Φ X ) recovery from digital samples. (b) The digital 3 rd order low-pass filter [46]. block then decides which sample to choose as the recovered data, based on the sampling phase. In this section, the details of phase recovery and data selection are presented. The ADC implementation is described in section 2.3. Fig. 2.5(a) shows the method of phase recovery using linear interpolation. The samples are first arranged in groups of three with one sample being shared between two adjacent groups. The position of a possible zero-crossing with respect to the first sample of the group, Φ X, is calculated using linear interpolation. A digital 3 rd order low-pass filter (shown in Fig. 2.5(b)) averages this instantaneous zero-crossings and produces an average phase, Φ AVG. The data selection scheme as depicted in Fig. 2.6 is a two step process. If there is at most one transition between the three samples, two of them are selected based on Φ AVG. If 0 Φ AVG < 0.5, the second and third samples are chosen. Otherwise, the first and second samples are chosen. In the case that there is a double transition between the three samples, the middle sample is chosen and the second step of data selection process is bypassed. In the second step of the process, Φ X is used to select the sample whose sign

Chapter 2. Background: ADC-Based CDR 11 A B No Cjitter A B C No jitter AVG = X < A B C X= AVG A B C Pos. jitter AVG = X < A B C X AVG A B C Neg. jitter AVG AVG X :Selected Sample (a) (b) Figure 2.6: Data selection scheme. (a) Selection of UI, based on Φ AVG and number of transitions, (b) selection of sample in the UI. is not affected by jitter. Fig. 2.6(b) shows the selection for the case of 0 Φ AVG < 0.5. The linear interpolation in the PD requires smooth data transitions for accurate phase recovery. While a 5 db or more loss in typical channels is sufficient for this purpose, an anti-aliasing filter [42] has to be integrated with the receiver for shorter channels or when the CDR operates at lower data rates where channel attenuation drops significantly. A frequency offset between transmitter data rate and receiver sampling rate will cause the samples to drift in the UI. Whenever the sampling phase moves one UI forward (backward) one sample needs to be inserted (dropped). The CDR produces a signal, N valid which is sent to FIFO to add or drop the extra bit (refer to Fig. 2.3). In this work, the FIFO data is read out at the exact rate of incoming data and, hence, the FIFO is never over/under flowed. In a commercial product, a flow control in data link layer is needed to adjust data throughput, so that FIFO will not over/under flow.

Chapter 2. Background: ADC-Based CDR 12 2.2 DFE for Blind-sampling CDR The structure of digital DFE depends on the sampling scheme. In a phase tracking CDR, the sampling is performed at the eye-center of incoming data. This fixed sampling phase implies that the main cursor and the first post-cursor ISI are fixed for a given channel, thus the ISI replica generation block is simply providing a constant DFE coefficient, as illustrated in Fig. 2.7(a). On the other hand, in a blind sampling CDR, as the sampling phase sweeps the UI, the value of main cursor and first post-cursor ISI change, as shown in Fig. 2.7(b). This implies that the ISI replica generation should take into account the sampling phase and dynamically change the DFE coefficients according to the sampling phase. To address the variable DFE coefficients of different sampling phases, the authors in [39] propose dividing the UI into eight bins (as shown in Fig. 2.8(a)) and choosing an appropriate DFE coefficient from a look-up table based on where the sampling phase falls within one UI. Fig. 2.8(b) shows the simplified full-rate implementation of DFE for the CDR as proposed in [39]. As can be seen from this figure, two look-up tables produce the phase-dependent DFE coefficients for even and odd samples based on Φ AV G. Since the samples are half a UI apart, the corresponding DFE coefficients are shifted by four in the look-up table. The CDR uses three-sample windows to calculate the sampling phase. The samples are arranged such that two of the samples correspond to the current UI (b n ) while the other corresponds to the previous UI (b n 1 ). Hence for the implementation of 1-tap DFE, both b n 1 and b n 2 are required to remove the first post cursor ISI from b n and b n 1 respectively. The measurement in [39] show that the manual 1-tap DFE is only capable of equalizing up to 13.3 db of attenuation. However, for typical channels with higher attenuation, a 2-tap DFE or a combination of the 1-tap DFE with a linear equalizer should be used. Theoretically the DFE combined with the FFE presented in [49] is capable of equalizing channels up to 28 db.

Chapter 2. Background: ADC-Based CDR 13 IN IN n n REC Rec REC REC (a) IN IN n n BLIND BLIND (b) Figure 2.7: (a) DFE in phase tracking CDR, (b) DFE in a blind sampling CDR. [0:2] [0:2] [1:2] n AVG n-1 1 2 2 6 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 [1:2] 2 [1:2] n-1 n-1 1 2 AVG :2] n AVG n-1 1 7 2 3 n-1 8 7 6 5 4 3 2 1 1 2 AVG 0 n-2 n-2 n-1 0 1 n 2 2 1 5 6 7 8 1 2 3 4 1 2 3 4 5 6 7 8 AVG n-1 (a) (b) n 2 1 5 6 7 8 1 2 3 4 1 2 3 4 5 6 7 8 Figure 2.8: Selection of DFE coefficients based on sampling phase. (a) DFE coefficients AVG shown on pulse response, (b) full-rate implementation using a look-up table [39]. 2

Chapter 2. Background: ADC-Based CDR 14 in ref ref O OI out ref O ref 2.3 ADC Figure 2.9: Interpolating flash ADC (PA: pre-amplifier, L: latch). Flash ADC s are known to have higher conversion rate compared to other ADC architectures. The feed-forward blind sampling CDR requires a sampling rate of 10 GS/s which in [46, 39] is provided by four time-interleaved 5-bit flash ADC s, each sampling at 2.5 GS/s. The ADC sampling clocks are generated by a 4-phase clock divider which is driven by an external 5 GHz clock source. In high speed applications, the power consumption of the ADC is the major contributor to the overall power consumption of the receiver and it is important to reduce the power consumption of the flash ADC as much as possible. While a 10 GS/s, non-interleaved flash ADC can be realized in 65nm process by using CML latches, the resulting power consumption is considerably higher than the power consumption obtained from time-interleaving of four slower ADC s using CMOS latches to achieve the same sampling rate. Although time interleaving increases the aggregate sampling rate, it reduces the input bandwidth of the ADC as it increases the input capacitance of the ADC. To reduce the input capacitance of each ADC, an interpolating flash ADC [47] was used in [46, 39] to reduce the number of pre-amplifiers (PA) that load the input node. The remainder of

Chapter 2. Background: ADC-Based CDR 15 9 10 11 12 o,n 13 7 8 o,p i,p 3 4 r,p r,n 5 6 i,n 1 2 Figure 2.10: Clocked amplifier used in ADC pre-amplifier. this section describes the detailed implementation of the ADC used in [46, 39]. Fig. 2.9 shows an overall block diagram of an interpolating flash ADC. The PA s amplify the difference between the input signal and the reference voltages. For a typical 5-bit flash ADC, a total of 31 PA s are required at the front end. In an interpolating ADC, a total of 17 PA s are used instead, relying on a resistive ladder to generate the remaining 14 levels. The PA s and resistive ladder outputs are then latched and sent to a thermal-to-binary encoder. It is desirable for a PA to have a high gain as this would reduce the effect of latch offset and the probability of metastability [27]. The gain offered by a continuous-time PA is not sufficient for high-speed applications due to inherent tradeoff posed by the gain-bandwidth product of the PA [12]. Instead, as shown in Fig. 2.10, a Strong-Arm regenerative PA was used, where the overdrive recovery is improved by resetting the previous state of the amplifier. In an interpolating flash ADC, the PA s must be linear; otherwise the the interpolated values will not correspond to the correct intermediate reference voltages. The implemented regenerative PA has a high gain and its output will easily enter into a nonlinear region. To demonstrate this point, Fig. 2.11(a) shows the outputs of two adjacent

.... Chapter 2. Background: ADC-Based CDR 16 V in V ref (N+2) CK V ref (N) CK time time PA/IA Output PA output, V O (N) PA/IA Output IA aperture PA output, V O (N) Interpolated, V OI (N+1) (Resistor only) time IA output, V OI (N+1) time PA aperture PA output, V O (N+2) Latch aperture PA output, V O (N+2) Latch aperture (a) (b) Figure 2.11: Reference voltage generation by interpolation (a) without the interpolating amplifier, (b) with interpolating amplifier (IA).

Chapter 2. Background: ADC-Based CDR 17 in ref ref O OI out ref O ref Figure 2.12: Proposed interpolating flash ADC with clocked pre-amplifier (PA), interpolating amplifier (IA), and Latch (L). PA s, PA(N) and PA(N+2), when the input voltage lies between their two input reference voltages, but is closer to V ref (N). As a result, the output of PA(N), V O (N), has a smaller slope magnitude compared to that of PA(N+2), V O (N+2). The difference in slope causes the interpolated voltage, V OI (N+1), to initially become negative (which is the expected correct value), then move towards zero, as the outputs of the PA s saturate. This would be an incorrect interpolated value and may send the following latch into a metastable state. To overcome this problem, another regenerative amplifier, denoted by IA in Fig. 2.12, was added, with its sampling aperture occurring after PA s aperture and before the settling of their outputs. The timing diagram of this modified structure is shown in Fig 2.11(b). The IA performs interpolation by amplifying the transient output difference of two PA s while valid. The same clock that triggers the PA s also triggers IA s. The amplifying window of the interpolating amplifiers is delayed with respect to the PA s by reducing the size of M 1 and M 2 (Fig. 2.10) in the IA s with respect to the corresponding transistors in the PA s.

Chapter 2. Background: ADC-Based CDR 18 2.4 Summary This chapter provided a background on blind-sampling ADC-based receiver and a DFE for this receiver that equalizes the the received signal, despite its blind sampling. Some of the building blocks of the digital receiver were further described.

Chapter 3 Proposed Adaptation Engine 3.1 Architecture Fig. 3.1 shows the block diagram of the conventional LMS adaptation engine for a phase-tracking CDR. In this diagram, rx n represents the received signal at a discrete time n, which corresponds to the center of the n th UI. Similarly, s n and b n represent the equalized signal and the recovered bits corresponding to the same n th interval. The core of the adaptive engine consists of subtracting a reference level, d ref, from s n, the equalized signal, to produce an error signal, e n. This error signal is then correlated with the previous recovered bit, b n 1, to produce the DFE coefficient, c, for a 1-tap DFE. If we limit the channel to the one that produces only one post-cursor ISI, the rx n can take one of four values as depicted in Fig. 3.2(a). After the ISI is removed, the equalized n n n DQ n DQ ref n-1 ref n-1 n n Figure 3.1: Conventional LMS adaptation engine in phase tracking CDR. 19

Chapter 3. Proposed Adaptation Engine 20 signal, s n, can only take one of two values. In fact, these two values are used as the reference levels in Fig. 3.1. For a CDR with a blind clock, on the other hand, the choice of d ref is more complicated as illustrated in Fig. 3.2(b). In this case, the sampling clock is not phase aligned to the center of the UI, and hence the equalized signal, at the sampling phase, may assume any of four possible values, two of which are also phase dependent. The two values corresponding to no transition do not depend on the sampling phase. The two that correspond to data transitions depend on the sampling phase. While with transition filtering, the adaptation engine can use either sets as the desired levels, the phase dependent set provides a better reference for the samples near the zero-crossings and thus it can guide the adaptation engine to equalize those samples. Equalization of the edge samples has a great impact in the performance of the 2x blind sampling receiver where at least one of the samples is usually close to the edge of UI. To accommodate this phase-dependent desired levels, we propose the modified LMS engine shown in Fig. 3.3. The d ref generator block in this diagram produces a desired level corresponding to the sampling phase. The only remaining problem is that the DFE coefficient has to change with the sampling phase and if the adaptation speed is lower than the rate at which the sampling phase is changing, then the adaptation may not converge to its final value for that phase. To resolve this issue, eight registers are used to store the DFE coefficients as in [39] but updated dynamically. At each sampling phase, only the corresponding DFE coefficient will be updated. In this way, each coefficient will reach its final value corresponding to that sampling phase. This may require several passes of sampling phase through that phase bin. 3.2 Proposed Adaptation Engine Fig. 3.4 shows the detailed implementation of the proposed adaptation engine. To reduce adaptation area and power overhead, only two consecutive ADC samples, S 8,9, are

Chapter 3. Proposed Adaptation Engine 21 n-1 n n n b n-1, b n 1, 1 0, 1 1, 0 0, 0 Unequalized Equalized (a) n-1 n n n b n-1, b n 1, 1 0, 1 1, 0 0, 0 Unequalized Equalized (b) Figure 3.2: 1-Tap DFE equalization in (a) phase tracking, and (b) blind sampling CDR. Note that in the blind case, the samples assume four levels corresponding to four different desired levels after equalization. n n n DQ n DQ ref n-1 ref n-1 n n Figure 3.3: Proposed LMS adaptation engine for blind-sampling CDR

Chapter 3. Proposed Adaptation Engine 22 [1:16] EQ EQ AVE n EQ-8,9 2:1 EQ-8,9 8 9 ref-1,2 EQ-8,9 2:1 [1,2] Figure 3.4: Detailed block diagram of the proposed adaptation engine. used. Based on these samples and Φ AVG, the desired waveform generator block produces phase-dependent desired levels, d ref 1,2, which correspond to sampling phases 1/2 UI apart. d ref 1,2 are then compared with corresponding equalized samples, S EQ 8,9. The resulting errors are multiplied by adaptation loop gain, g, and the previous recovered bit. DFE coefficient updates are produced after a transition filtering that removes errors not corresponding to data transitions. Two 1:8 DMUX use Φ AVG to select two accumulators that store the corresponding DFE coefficients to be updated. The DFE coefficient select block (DCS) then selects the two DFE coefficients, c 1,2, that are used in the DFE adders. The shape of the desired waveform can be derived from an equalized eye by dividing UI into 8 bins and then averaging the samples that fall in each bin. One drawback of this averaging scheme is the extra hardware required to store and update the desired waveform. Another drawback is the formation of interacting adaptation and desired waveform generation loops which can cause unpredictable behavior. As an example, if the adaptation starts with zero initial conditions, the eye opening at the output of DFE would be small, producing in turn small desired levels. As a result, the adaptation will not be able to work properly and the eye opening will not improve.

Chapter 3. Proposed Adaptation Engine 23 HF Jitter Tolerance (UI pp ) 10 0 10-1 HF Jitter Tolerance (UI pp ) 10 0 Desired waveform Desired waveform from averaging from averaging Triangular Triangular desired waveform desired waveform 10-1 0 0 10-2 10-2 10-1 10-1 10 0 10 0 RX RJ (UI RX pp ) RJ (UI pp ) HF Jitter Tolerance (UI pp ) 10 0 10 0 Desired waveform Desired waveform from averaging from averaging Triangular Triangular desired waveform desired waveform 10-1 HF Jitter Tolerance (UI pp ) 10-1 0 0 10-3 10-3 10-2 10-2 10-1 10-1 10 0 10 0 RX V n (V pp RX ) V n (V pp ) (a) (b) Figure 3.5: Simulated high-frequency jitter tolerance comparison of desired waveforms, generated based on triangular and averaging technique with (a) added random jitter (RJ) to receiver clock and (b) added random noise (V n ) to the received signal. (Sinusoidal jitter frequency: 170MHz, BER< 10 6, RJ and V n have Gaussian distribution for BER< 10 6, used channel has 10dB loss at Nyquist frequency.)

Chapter 3. Proposed Adaptation Engine 24 AVE AVE ref-1 AVE AVE 8 9 2:1 AVE AVE AVE AVE ref-2 AVE Figure 3.6: Block diagram of desired waveform generator. Another way to produce the desired waveform is to use a fixed pre-defined shape with adjustable amplitude to accommodate different input power levels. A triangular waveform is a suitable candidate because it is consistent with linear interpolation by the PD. In other words, if the adaptation converges perfectly so that the equalized eye becomes diamond shape, then the error in PD due to the linear interpolation should be minimal. It is possible to merge the two methods described above to produce the desired levels. First we let the engine adapt based on a pre-defined desired waveform and then switch to the averaging technique. Fig. 3.5 compares the performance of this combined approach against that of a triangular waveform only. It can be seen that the receiver jitter tolerance is better with the averaging scheme whenever high levels of random noise or jitter are added to the received signal or the receiver clock, respectively. In the actual implementation, we used triangular desired waveform because of its simplicity and less overhead compared to the other method. The desired waveform generator is shown in Fig. 3.6. Two dynamic look-up tables calculate the desired levels for 2 samples that are 1/2 UI apart, based on a stored triangular waveform and Φ AV G. The height of the triangular waveform is adjusted based on the incoming data amplitude. The ADC samples that are closer to the center of the eye are rectified and averaged to produce an approximation of the incoming data amplitude.

Chapter 3. Proposed Adaptation Engine 25 The limited bandwidth and non-linearity of the analog front end (AFE) and the quantization noise of the ADC may adversely affect the adaptation or equalization. The bandwidth limitation of the AFE can be absorbed into the channel loss, thus it will only reduce the equalization range of 1-tap DFE. Both non-linearity and quantization noise can be represented with additive noise and as a result they can also reduce the equalization range as they degrade the received signal on top of ISI degradation. For a random bit sequence, the adaptation loop, however, remains almost unaffected because it finds the DFE coefficients by correlating equalization error with the previous bit and averages out any high speed uncorrelated variation caused by the quantization noise and non-linearity. 3.3 Simulation and Measurement Results The channels used in measurement consist of two FR4 daughter cards with 5-inch traces each and a backplane with adjustable trace length. The total length of the FR4 channels are 26-inch and 34-inch corresponding to insertion loss of 9.9 db and 13.3 db at the Nyquist frequency of 2.5 GHz (Fig. 3.7(a) and 3.7(b)). The functional simulations were performed in Simulink R using event driven modeling [48] to increase simulation speed. The pulse response of the channels extracted from measured S-parameters were used in the simulation to emulate channel attenuation. The effect of adaptive 1-tap DFE on vertical and horizontal eye opening of the received signal and BER of the receiver for different channels has been presented in Table 3.1. Although the 1-tap DFE is not able to open the eye for the lossy channels, the adaptation has improved the BER. Fig. 3.8 compares the simulated jitter tolerance of the receiver with the adaptive DFE (this work) against the manual DFE (based on [39]). In both simulations, the target BER is 10 6 (as contrasted with 10 12 in measurements) and PRBS7 is used. A frequency offset of 50 ppm is introduced between the receiver and

Chapter 3. Proposed Adaptation Engine 26 Before EQ After EQ Channel att. BER Horizontal Vertical BER Horizontal Vertical at Nyq. (db) eye opening eye opening eye opening eye opening 6.9 < 10 6 0.552 UI 320 mv < 10 6 0.6185 UI 288 mv 9.4 < 10 6 0.474 UI 224 mv < 10 6 0.6538 UI 272 mv 10.9 2.25e-4 0.365 UI 96 mv < 10 6 0.568 UI 200 mv 12.4 0.0019 0.245 UI 80 mv < 10 6 0.475 UI 192 mv 14.9 0.169 0 0 0.0046 0.176 UI 40mV 19.8 0.381 0 0 0.164 0 0 22.9 0.46 0 0 0.323 0 0 Table 3.1: Comparison of BER, horizontal and vertical eye opening, before and after adaptation for the channels with S 21 shown in Fig. 3.7(c) transmitter clock frequencies to emulate blind sampling. In addition, a Gaussian random jitter of 0.17 UI PP and 0.23 UI PP is introduced to the transmitter and the receiver clock, respectively. The simulation results confirm the adaptation is achieved with little or no loss to performance (jitter tolerance) in the 34-inch channel. To find the limit of adaptation, the manual DFE coefficients were swept for a given channel and the set of coefficients which reduced the receiver BER to less than 10 6 were compared to the adapted coefficients in the adaptive DFE. It was observed that the 1-tap DFE is able to reach the target BER for channels up to 14.8 db of attenuation, but the adaptive DFE, in spite of convergence of coefficients, was unable to achieve the target BER. Although the adaptive DFE falls behind the manual DFE by 1.5 db, it automatically provides DFE coefficients that are otherwise quite time consuming to find. The receiver test chip was implemented in Fujitsu s 65 nm CMOS process. The die photo is shown in Fig. 3.9. The ADC and the digital CDR including all the test structures occupy an area of 400 490µm 2 and 420 640µm 2 respectively.

27 21 21 21 21 Chapter 3. Proposed Adaptation Engine (a) (b) 0 Channel att. at Nyq. (db) -5 6.9 9.4 10.9 12.4 S21 (db) -10-15 -20 14.9 19.8 22.9-25 -30-35 0 0.5 1 1.5 2 2.5 Frequency (GHz) 3 3.5 (c) Figure 3.7: Channel insertion loss for (a) a 26-inch and (b) a 34-inch FR4 channel. (c) S21 of the channels used in simulation for Table 3.1

Chapter 3. Proposed Adaptation Engine 28 3 2 BER<10-6 Pattern: PRBS7 Tx-Rx f CLK =50ppm Tx RJ = 0.17 UI PP Rx RJ = 0.23 UI PP 1 2 3 4 5 6 7 8 Figure 3.8: Simulated jitter tolerance comparison of adaptive DFE (this work) versus manual DFE (based on [39]). E A B C D F Process 65-nm CMOS Data Rate 5Gb/s Supply 1.2V ADC Power 114mW Digital Power 78mW 1.9mm A B Input Buffers 4 2.5GSa/s ADC C 4:16 DeMUX 1.9mm E A B C D F D E Digital CDR/DFE + Test Structures BGR and Bias Gen. F Pad Drivers Figure 3.9: Die photo.

Chapter 3. Proposed Adaptation Engine 29 7 in Figure 3.10: Measurement setup. FR4 A simplified measurement setup is shown in Fig. 3.10. A Centellax board generating PRBS7 at 5 Gb/s was used as data source. The output amplitude of the PRBS generator did not cover ADC s input range, therefore we used a wide-band amplifier with a gain of 7 db after the PRBS generator. Based on the on-chip PRBS checker, the receiver operates at 5 Gb/s with BER < 10 12. Fig. 3.11 shows the reconstructed eye diagrams of received data before and after the 1- tap DFE equalization. A small frequency offset between the receiver and the transmitter was used so that the sampling points sweep the UI. The samples from the ADC and DFE were extracted and post-processed to produce the eye diagrams. For the 34-inch channel, the adaptive DFE is able to open the otherwise closed eye of the received data by 320 mv. The learning curves of the DFE coefficients are shown in Fig. 3.12. Coefficients 1 to 4 are shown on the first and 5 to 8 on the second row. It can be seen that the DFE coefficients converge in around 80 µs. The implemented adaptation engine uses 2 out of 16 ADC samples to perform the adaptation. The adaptation speed can be increased by utilizing more samples at the expense of more hardware and power consumption. Increasing adaptation loop gain can also speed up the adaptation, however this may cause coefficients to drift whenever a non-random bit sequence is received. The measurement results of receiver jitter tolerance for BER< 10 12 are plotted and compared with simulation results in Fig. 3.13. Sinusoidal jitter was applied to the transmitted data by modulating the clock frequency of the PRBS board. Using an Agilent E8257D signal generator as the clock source, the maximum modulation frequency

30 Chapter 3. Proposed Adaptation Engine 191mV 191mV PP,diff PP,diff 41.2mV 41.2mV PP,diff PP,diff 46ps 46ps(0.23UI) (0.23UI) 16mVdiff/LSB 16mVdiff/LSB 16mVdiff/LSB 16mVdiff/LSB 32mVdiff/LSB 32mVdiff/LSB 32mVdiff/LSB 32mVdiff/LSB 94ps (0.47UI) 94ps (0.47UI) 00 0.5 0.5 11 UIUI 1.5 1.5 22 EyeOpening Opening= =2626LSB LSB= =416 416mV mv Eye (a) 00 0.5 0.5 11 UIUI 1.5 1.5 22 EyeOpening Opening==20 20LSB LSB==320 320mV mv Eye (b) Figure 3.11: Eye diagrams before and after equalization for (a) a 26-inch and (b) a 34-inch Tyco channel.

Chapter 3. Proposed Adaptation Engine 31 12 10 8 6 4 2 0-2 (9.9dB@2.5GHz) 1 4 2 3 0 20 40 60 80 100 14 12 10 8 6 4 2 0-2 (13.3dB@2.5GHz) 4 1 3 2 0 20 40 60 80 100 120 10 5 8 6 4 2 6 8 0 0 20 40 60 80 100 7 14 12 5 10 8 6 7 6 8 4 2 0-2 0 20 40 60 80 100 120 Figure 3.12: Measured learning curves. that this signal generator supports is 8 MHz, thus jitter tolerance measurement was limited to this frequency. It can be seen that at 8 MHz the receiver tolerates 0.29 UI PP and 0.2 UI PP of sinusoidal jitter for the 26-inch and the 34-inch channels, respectively. Finally, a performance summary is presented in Table 3.2. 3.4 Summary In this chapter, an adaptive DFE for a 2 blind sampling ADC-based CDR was described. The adaptation engine which provides the DFE coefficients uses phase-dependent desired levels for adaptation. A triangular waveform was used as the ideal reference waveform to guide the adaptation. While the CDR cannot provide error-free operation at 5 Gb/s for the 34-inch FR4 channel without equalization, it does provide a jitter tolerance of 0.2 UI PP with BER < 10 12 after adaptive equalization. The receiver consumes 192 mw, out of which, 114 mw is consumed by the flash ADC and 78 mw by the digital blocks.

Chapter 3. Proposed Adaptation Engine 32 Jitter Tolerance (UI pp ) 10 2 10 1 10 0 Jitter Tolerance - 26" channel Simulated Measured Simulation: BER<10-6 Measurement: BER<10-12 0.1 100kHz 1MHz 10MHz 100MHz Frequency (a) Jitter Tolerance (UI pp ) 10 2 10 1 10 0 Jitter Tolerance - 34" channel Simulated Measured Simulation: BER<10-6 Measurement: BER<10-12 0.1 100kHz 1MHz 10MHz 100MHz Frequency (b) Figure 3.13: Simulated and measured jitter tolerance for (a) 26-inch and (b) 34-inch FR4 channels. Technology Data rate 65nm CMOS 5 Gb/s Area 0.46 mm 2 Channel attenuation 13.3 db Adaptation time 80 µs High freq. jitter tolerance ADC + input buffer + 4:16 DMUX power consumption Digital CDR power consumption 0.2 UI pp 114 mw 78 mw Table 3.2: Performance summary.

Part II A Phase-Interpolator-Based Burst-Mode CDR 33

Chapter 4 Background: Burst-Mode CDR The emergence of bandwidth hungry web applications such as video-on-demand, online gaming, and on-line storage drives as well as the proliferation of web-ready devices such as TV s, tablets, and gaming consoles has introduced a great demand for faster internet connection among home and office users. Although standards such as ADSL [38] and DOCSIS [1] have provided broadband access to home and office users through the available copper-based telephone or cable TV infrastructures, because of their intrinsic bandwidth limitation, they are not promising to accommodate the bandwidth growth in the upcoming decade. Optical fiber, on the other hand, has a large available bandwidth which can easily accommodate the bandwidth demand. Unfortunately the high cost of replacing copper with fiber has been the main drawback of fiber to the home (FTTH). In order to reduce the cost and the payback time of this replacement, current approach is to share the optical fiber between several users and also to minimize the number of active optical and electronic components. Passive optical network (PON) is widely adopted by many internet service providers as FTTH solution. In PON, passive optical couplers (power splitters) are used to connect several consumers or Optical Node Units (ONU) to the Optical Line Termination (OLT) in the central office (CO). Depending on the demographic locations of the subscribers, 34

Chapter 4. Background: Burst-Mode CDR 35 ONU ONU ONU Optical Coupler OLT ONU (a) ONU Optical Coupler ONU ONU Optical Coupler Optical Coupler OLT ONU (b) ONU Optical Coupler Optical Coupler Optical Coupler OLT ONU ONU ONU (c) Figure 4.1: PON splitting strategies: (a) Simple one-stage, (b) multi-stage splitting, and (c) optical bus.

Chapter 4. Background: Burst-Mode CDR 36 different topologies [21] may be used to reduce the cost of the implementation. These different topologies are shown in Fig. 4.1. A one-stage splitting architecture (Fig. 4.1(a)) is the simplest form of the network, but may increase the fiber mileage which will increase the cost of implementation. An optical bus as depicted in Fig. 4.1(c) may reduce the fiber mileage but the last ONU suffers from high optical loss due to splitting and fiber losses. In order to reduce cost per user, it is desirable to connect as many ONU s as possible to a single OLT, however, the impact of splitting losses limits the maximum practical splitting ratio. In current standards this maximum splitting ratio is limited to 128 [2, 37], but practical implementations usually limit this to 16 or 32 [21]. In all of the splitting topologies shown in Fig. 4.1, the data at the CO is sent and received through a single fiber and then passive optical splitters are used to split or combine the data from ONU s. There are two possible multiple access techniques that can be used in these networks: A wavelength division multiple access (WDMA) or a time division multiple access (TDMA). In WDMA different wavelengths are assigned to each ONU, but in TDMA the same wavelength is used by all the ONU s and each ONU sends its packet in the given allocated time slot. While WDMA doesn t suffer from bandwidth reduction when more users start to use the channel, it poses a practical difficulty as each ONU is required to have a laser with a different wavelength and the OLT needs multiple receivers and transmitters. The TDMA on the other hand is easier to implement and the fact that home or office data consumers usually send and receive data in burst of packets, makes TDMA a reasonable choice. At the time of writing this thesis there was no finalized standard on WDMA PON. 4.1 TDMA PON Fig. 4.2 shows the simplest TDMA PON network architecture. The duplex transmission is realized by using two different wavelengths for upstream and downstream channels.

Chapter 4. Background: Burst-Mode CDR 37 Consumer ONU 1 2 ONU 2 ONU n 1 n Optical Coupler Optical Fiber n Upstream: Burst 2 1 1 2 n Downstream: Continuous Central Office OLT Figure 4.2: TDMA in PON. Passive Optical Network The downstream transmission sent by the OLT in the CO is continuous and is received by all the ONU s. Each ONU only selects its intended packets. The fact the information of other ONU s are available to each ONU is a big security concern and thus mandates the use of encryption in the downstream [36]. The upstream transmission protocol is more complicated than the downstream. Depending on the bandwidth requirements of each ONU and their distance from the CO, the OLT assigns a time slot to each ONU to assure that no packets sent by ONU s collide. The packets sent by each ONU are combined into a single optical fiber by a passive optical coupler and received by the OLT. Since the optical coupler is a directional coupler, the upstream from an ONU is not available to other ONU s, thus no encryption is required in the upstream. Each ONU receives a continuous data stream, thus a typical optical receiver can be used in to recover data. The OLT receiver, however, receives bursts of data packets from each ONU which have different phase and power levels. Each packet from an ONU arrives with a different phase because of the difference in distance of ONU s from the OLT. The power level of the received data also depends on the distance of the ONU from the OLT and the number of optical couplers between them. The optical receiver in OLT is shown in Fig. 4.3 [31]. A photo diode (PD) converts the received optical signal into electrical current signal which is then amplified and converted to differential voltage by a transimpedance amplifier (TIA). Fig. 4.4 shows a burst-mode

Chapter 4. Background: Burst-Mode CDR 38 PD n 2 1 n 2 1 BMCDR Recovered Data Recovered Clock TIA LA Figure 4.3: OLT receiver. Level Gain Detector Control PD R f Amp Thresh. Detector Single/Balance Figure 4.4: Burst-mode TIA. TIA as reported in [32]. Fast level detectors are used to adjust the gain of TIA as well as the threshold voltage of single to differential voltage converter. A limiting amplifier (LA) is then used to further amplify the voltage from the TIA. The LA is designed such that it amplifies the minimum signal level from the TIA to output levels that can be used by the following CDR, while the larger signals will be limited by the maximum available swing of the LA [29]. Data packets after the LA have constant amplitude, however they still have different phase. The following clock and data recovery has to be able to quickly recover the phase of the incoming data, otherwise, the efficiency of data transmission will suffer greatly. The maximum time allocated for signal threshold detection, TIA gain adjustment, and phase recovery depends on the PON standard. While IEEE 802.3 allows the relaxed

Chapter 4. Background: Burst-Mode CDR 39 time of 400 ns for this purpose [43], the ITU-T G.984 standard has allocated only 44 bits corresponding to around 40 ns [36]. This small settling time makes the design of the OLT receiver challenging. 4.2 Previous work on BMCDR The small time slot allocated for signal threshold detection, TIA gain adjustment, and data phase recovery has led to invention of many schemes in fast level detection, gain adjustment and phase recovery [32, 31, 33, 14, 44, 11]. Given the fixed allocated time for data acquisition, reducing the settlement time of each of these components relaxes the design of the other block. Specific research has been done on the CDR as a key component of the OLT receiver. Wide-band, stable closed loop CDR s that can achieve lock in the given timing budget are hard to design. Open loop receivers, on the other hand, are able to provide fast response without sacrificing the stability. In this section an overview of the reported BMCDR s in the literature is presented. 4.2.1 Oversampling Architecture The block diagram of a 10.3 Gb/s oversampling BMCDR as reported in [31, 41] is shown in Fig. 4.5(a). This CDR integrates a 10.3125 GHz phase locked loop (PLL), a multiphase clock generator and a data over-sampler, however, the digital edge detector and data selector are implemented on FPGA. In this design a multi-phase clock generator produces eight 10.3125 GHz clocks with 45 phase spacing from the clock generated by the PLL. The data sampler block samples the incoming burst data at 82.5GS/s, generating eight samples per UI. The edge detector detects the rising and falling edges of the data by comparing the adjacent samples. Then the data selector selects the data sampled by the clock which has the largest phase

Chapter 4. Background: Burst-Mode CDR 40 f ref 8-phase Clock Generator PLL Input Data Samplers Edge Detection + Data Selection Logic Recovered Data (a) UI Center Detected Edges (b) Figure 4.5: (a) Block diagram of oversampling burst-mode CDR, (b) principle of operation.

Chapter 4. Background: Burst-Mode CDR 41 Stop/Start V cntrl Input Data Out GVCO A GVCO B Delay D Q Recovered Data Recovered Clock Reference PLL V cntrl GVCO N PFD Charge Pump Reference Clock (a) Input Data A B Recovered Clock (b) Figure 4.6: (a) Block diagram of dual GVCO burst-mode CDR, (b) timing diagram. margin, i.e. as shown in Fig. 4.5(b) the sample which is closest to the center of UI. This architecture is capable of fast data recovery and good jitter tolerance at the expense of large power consumption of 5.8W (excluding FPGA power consumption), due to the large number of high speed clocks and samplers. The design of multi-phase clock generator and clock routing with equal phase separation is also a challenge at 10GHz. 4.2.2 Dual GVCO-based Architecture The work presented in [7] was the first reported BMCDR based on gated VCO (GVCO) targeted for asynchronous transfer mode (ATM) networks. A GVCO is a voltage controlled ring oscillator which oscillation can be started or stopped by closing or opening the ring. As shown in Fig. 4.6(a) the GVCO oscillates

Chapter 4. Background: Burst-Mode CDR 42 whenever the NOR gate in the ring oscillator acts as an inverter which happens when the gating signal is low. If the gating signal is high, however, the output of NOR gate is zero and there will be no oscillation. As a result, when the gating signal makes a high to low transition, the oscillation in the GVCO starts with a low to high transition at its output and thus the phase of oscillation could be adjusted accordingly. The BMCDR uses this characteristic of GVCO to align the phase of recovered clock with the incoming data. The incoming data is connected to gating control of two GVCO s; one oscillating when data level is low, the other when the data is high. As shown in Fig. 4.6(b), the falling and rising edge of data aligns the rising edge of GVCO A and B output, respectively. A NOR gate then combines the oscillation of both GVCO s to produce a continuous oscillating recovered clock. The gating control is only able to adjust the phase of the recovered clock. Its frequency, however, is adjusted by adjusting the control voltage of GVCO s provided by a PLL locked to a reference clock. The PLL uses a GVCO matched to the ones used in the BMCDR to assure that they oscillate with the locked frequency. This BMCDR architecture is able to recover the phase of incoming data within the first transition, however, the recovered clock suffers from ISI induced deterministic jitter which is resulted from limited bandwidth of GVCO s and wide bandwidth content of their outputs because of intermittent oscillation. 4.2.3 Single GVCO-based Architecture To reduce the clock jitter induced by the ISI a single GVCO-based architecture was proposed in [33]. The block diagram of the BMCDR is shown in Fig. 4.7(a). In this design the GVCO is almost identical to the one implemented in dual GVCO architecture. The only difference in the GVCO is the use of a AND gate instead of a NOR gate to control the state of oscillation. As a result the GVCO oscillates whenever the gating signal is high. When the gating signal is low, the oscillation is stopped and the output is low. In this design incoming data does not directly gate the oscillation; instead, the GVCO is

Chapter 4. Background: Burst-Mode CDR 43 Gating Circuit Input Data A f ref V cntrl PLL with Matched GVCO GVCO Recovered Clock (a) Input Data A Recovered Clock (b) Figure 4.7: (a) Block diagram of single GVCO burst-mode CDR, (b) timing diagram. preceded by a gating circuit that has a steady state of high output and produces narrow zero pulses (with a width of half a UI) whenever there is a rising edge in the incoming data (as shown in Fig. 4.7(b)). This narrow pulses reset the phase of oscillation and as a result at the rising edge of data the clock becomes aligned with the center of the data UI. The use of narrow gating pulses will only stop the GVCO s oscillation and set the output of the GVCO to zero for half a UI and thus the oscillation is not interrupted if the recovered clock is already low, which happens when data and recovered clock are in phase. If a new packet of data with a new phase arrives, however, the oscillation will only be interrupted for the first rising edge of data, and clock phase will jump to align with the new data. The data dependent jitter of the clock in this design is almost removed as the frequency spectrum of the GVCO is much narrower compared to the dual GVCO-based architecture. The gating circuitry, however, still may suffer from ISI if it is not carefully designed.

Chapter 4. Background: Burst-Mode CDR 44 4.2.4 Injection Locking Architecture The main advantage of GVCO based BMCDR s is their fast phase recovery. The fast phase recovery, however, means that the recovered clock will be aligned with jittery data edges and thus any high frequency jitter on the data will also appear in the recovered clock. It is possible to trade off speed in return for high frequency jitter rejection by using an injection locking BMCDR. The block diagram of such a BMCDR is shown in Fig. 4.8(a) [24]. The same as GVCO architectures, in this architecture the frequency of oscillation is set by sharing the control voltage of the VCO locked to the reference frequency inside a PLL with the VCO in the BMCDR. To achieve injection locking, it is necessary to provide injections with the same frequency of oscillation. The NRZ data spectrum, however, has a notch in the spectrum at the baud rate frequency. A gating circuit is required to produce injections with nonzero spectrum content at baud rate frequency. Fig. 4.8(b) shows the the gating circuit block diagram. This circuit produces injection pulses with every data edge. The power spectrum at baud rate frequency is maximized if the delay introduced by the cascaded buffers is half a UI. In the implemented injection-locking VCO as shown in Fig. 4.8(c), the injections are added in current mode to the output of oscillator. The added injection current will rotate the phase of oscillator until it becomes equal to the injection phase. The speed of locking depends on the injection amplitude which is controlled by adjusting the current tail of injecting differential pair. By reducing the injection power and hence slowing the response of the recovered clock phase to the change in incoming data phase, it is possible to reject high frequency jitter of incoming data, however, slower response means that recovering the phase of new data packet requires more time.

Chapter 4. Background: Burst-Mode CDR 45 Input Data Gating Circuit Inj. Recovered Clock V cntrl PLL with Matched VCO (a) V cntrl V cntrl Recovered Clock Injection Input Data Injection Out I inj I vco (b) (c) I osc,o I osc,i I osc,o I osc,i I Inj I osc,i I osc,o I Inj I Inj (d) Figure 4.8: (a) Block diagram of injection-locking burst-mode CDR, (b) the gating circuit, (c) schematic of VCO [24], and (d) principle of operation.

Chapter 4. Background: Burst-Mode CDR 46 4.3 Summary In this chapter, a background on PON, where a burst-mode receiver is an essential part of the OLT, was provided. The burst-mode receiver, comprising of a burst-mode TIA, an LA, and a BMCDR, has to quickly settle for each data packet despite their different power level or timing (phase). Previous works on BMCDR which address the need of quick clock recovery were briefly described. An oversampling architecture, although providing good jitter tolerance and fast data recovery, is quite power hungry. Other burst-mode CDR s that were described, provide better power efficiency by using GVCO or injection locking, however, as it will be described in the next chapter, they are sensitive to PVT variations as their operation relies on the frequency matching of the oscillators in the PLL and the BMCDR.

Chapter 5 Proposed Burst-Mode CDR In the previous chapter, different BMCDR architectures were discussed. All of these architectures except the oversampling architecture, rely on the perfect matching of the oscillator inside the PLL and the one inside the BMCDR so that with the same control voltage, they oscillate with the same frequency. However, this perfect matching cannot be realized due to layout and process mismatches. This will result in a frequency offset between the reference frequency and the recovered clock frequency in the absence of data transitions. Since the BMCDR recovered clock phase is adjusted by the receive of each data edge, the recovered clock can track the phase drift induced by the frequency offset if the data Incoming data Recovered Clock Recovered Data Error Figure 5.1: Error generation in the presence of frequency offset and long consecutive identical digits (CID). 47

Chapter 5. Proposed Burst-Mode CDR 48 transition density is high enough. On the other hand, in a long consecutive identical digits (CID), the recovered clock is not updated and as a result, the phase drift is accumulated by time. If the accumulated phase error surpasses half a UI, then an error is generated. Fig. 5.1 shows an example of how this frequency offset can produce errors whenever there is a long pattern of CID. In this example it is assumed that due to the mismatch between the oscillators, the recovered clock has slightly lower frequency compared to the received data baud rate. Thus, in the presence of a long CID of zeros, the number of recovered zeros is smaller than the actual number of transmitted zeros. The length of CID that the CDR can tolerate is called CID tolerance (N CID ). In a BMCDR this CID tolerance is inversely proportional to the frequency offset and can be represented by [44]: N CID = P m B r F off (5.1) where P m is the horizontal phase margin in UI, B r is the data baud rate and F off is the frequency offset. The other problem that makes design of GVCO-based and injection locking BMCDR s challenging, is the possible pulling of the oscillator in the BMCDR by the one in the PLL loop. If isolation between the two oscillators is poor, then the introduced injection by the PLL oscillator competes with the injection from gating circuit. This will cause unaccounted phase offset in the injection locking architecture. In both architectures this unwanted injection will reduce CID tolerance as the absence of injection from gating circuit will cause the BMCDR oscillator to be pulled by the PLL s and hence the recovered clock phase is disturbed. To achieve good isolation, it is necessary to place the two oscillators far apart from each other, and hence good matching between them cannot be achieved.

Chapter 5. Proposed Burst-Mode CDR 49 CK I CK Q CK O CK O CK I CK Q Figure 5.2: Conceptual block diagram of a phase interpolator (PI). 5.1 Proposed Phase Interpolator Approach As mentioned earlier, oversampling BMCDR is free from frequency offset problem, but it is quite power hungry. In this thesis a phase interpolation scheme resolves the issue of frequency offset without consuming much power. Instead of using the control voltage from the reference PLL, a PI based BMCDR uses the quadrature output clocks and generates a recovered clock which has exactly the same frequency of the reference PLL. In this section, we briefly describe the operation of a phase interpolator as it will be used in the proposed burst-mode CDR in Section 5.2. The conceptual block diagram of a PI is shown in Fig. 5.2. A clock with an arbitrary phase and amplitude, CK O, can be generated by multiplying two quadrature clocks, CK I and CK Q, by appropriate coefficients α and β and then summing them, thereby, the PI output can be expressed as: CK O = αck I + βck Q (5.2) Based on the type of the coefficients used, phase interpolators are categorized into digital and analog. An example of an analog PI is shown in Fig. 5.3 and a digital one in Fig. 5.4. In the analog PI the analog value of the tail currents set the PI coefficients such that α = I α+ I α and β = I β+ I β. In this way it is possible to obtain both positive and negative PI coefficients. The summation is then done in current mode on

Chapter 5. Proposed Burst-Mode CDR 50 CK O CK I I I I I Figure 5.3: Analog PI [20]. the load resistor. By replacing the analog current sources with current digital to analog converters (DAC s) a digital PI is obtained, however it is possible to save power and area by reducing the number of current DAC s by a factor of two by introducing a single pole, double throw switch that selects the sign of the PI coefficient. Ideally a PI should provide an output clock with constant amplitude. Assuming that the input quadrature clocks are sinusoidal, the weight factors α and β should satisfy the following equation: α 2 + β 2 = const. (5.3) These coefficients with respect to the output clock phase are shown in Fig. 5.5(a). As it can be seen these coefficients are sinusoidal functions of output phase and hence, providing those coefficient might not be easy especially in the analog domain [20]. To simplify the implementation of the coefficients, the sinusoidal functions are usually approximated by triangular ones as shown in Fig. 5.5(b). This simplification however will result in coefficients that do not satisfy equation 5.3 and as a result there will be an amplitude variation in the output clock. This variation will not affect the performance of the CDR

Chapter 5. Proposed Burst-Mode CDR 51 CK O CK I CK Q sgn sgn N S <N:1> N S <N:1> V b M <N:1> M <N:1> I-DAC I-DAC Figure 5.4: Digital PI [16]. as the CDR only relies on the zero-crossing of the clock. 5.2 Proposed Architecture In a typical PI-based CDR, PI coefficients are produced by a controller operating in a slow feed-back loop. In a burst-mode CDR, however, it is desired to obtain the coefficients as fast as possible and hence an open loop controller is more desireable. Fig. 5.6 shows the block diagram of the proposed BMCDR. The main building block of the CDR is the burst-mode clock recovery unit which uses a phase interpolator to instantly align the phase of the recovered clock with the incoming data. The recovered clock out of the clock recovery unit needs to be buffered to drive the following circuitry, thus the delay that is introduced by the buffer has to be accounted for. A matched buffer in the data path can compensate for the delay, however, any PVT, layout mismatches or

Chapter 5. Proposed Burst-Mode CDR 52 CK O CK Q CK I 45 o 90 o 135 o 180 o 225 o 270 o 315 o (a) CK Q CK O CK I 45 o 90 o 135 o 180 o 225 o 270 o 315 o (b) Figure 5.5: Selection of PI coefficients; (a) ideal and (b) practical coefficients. Retimed Data DMUX Data DLL DCDL 16 Data CK I CK Q CML to CMOS Burst Mode Clock Recovery DLCode DLCode 4 4 DCDL Up/Dn Counter L E PD CK REC 16 Figure 5.6: Proposed BMCDR block diagram.

Chapter 5. Proposed Burst-Mode CDR 53 Clock Recovery Data CK I CML to CMOS S/H S/H Phase Interpolator CK REC CK Q CK I CK Q (a) CK Q CK REC CKREC CK I t - CK I CK Q D n D n+1 t=t 0 (b) Figure 5.7: (a) Block diagram of proposed clock recovery unit and (b) operation. clock recovery non-idealities can cause a phase offset between the data and the recovered clock. To resolve this issue, the buffer is integrated into a digital delay locked loop (DLL) to assure that both clock and data are aligned before slicing. The sliced data is then demuxed by 16 and is fed to an on chip BERT to verify the functionality of the CDR. 5.2.1 Clock Recovery Unit The heart of the proposed BMCDR is the clock recovery unit which aligns the recovered clock with the incoming data edge. As shown in Fig. 5.7(a), the clock recovery unit consists of a phase interpolator and two double edge triggered sample and holds. The

Chapter 5. Proposed Burst-Mode CDR 54 phase interpolator rotates the phase of the reference quadrature clocks based on the coefficients provided by the sample and holds. Assuming a data edge at t = t 0 and sinusoidal recovered clock of CK REC = sin(2πft φ 0 ), the alignment of the rising edge of clock with data edge (as shown in Fig. 5.7(b)) means φ 0 = 2πft 0. Thus the recovered clock can be rewritten as: CK REC (t) = sin(2πf(t t 0 )) (5.4) = cos(2πft 0 ) sin(2πft) sin(2πft 0 ) cos(2πft) (5.5) = βck Q (t) αck I (t) (5.6) where CK I (t) = cos(2πft) and CK Q (t) = sin(2πft) are the reference quadrature clocks, and α = sin(2πft 0 ) = CK Q (t 0 ) and β = cos(2πft 0 ) = CK I (t 0 ) are the PI coefficients. An interesting result from the above equation is that the PI coefficients are the samples of the reference clocks at time t = t 0 ; thus the data edge can be used to sample the reference clocks and produce the coefficients. This sampling is performed by the two double edge triggered sample and holds (S/H). The schematic of the implemented PI is shown in Fig. 5.8. This analog PI is composed of two Gilbert cells with their outputs connected together. Each Gilbert cell multiplies the PI coefficient from the S/H s by the corresponding reference clock. The result of the multiplication is then subtracted in the current mode and converted back to voltage when passed through the load resistors. The high speed clocks are connected to the upper transistors of the Gilbert cell and the slower PI coefficients are connected to the lower transistors. The differential architecture assures that the supply noise and common mode variations in the reference clocks and coefficients are rejected. The differential PI coefficients also allows for both positive and negative coefficients, thus the PI can rotate the reference clocks in a complete 360. The double edge triggered S/H is implemented by connecting two edge triggered S/H s in parallel as shown in Fig. 5.9(a), one operating with the rising edge and the other with

Chapter 5. Proposed Burst-Mode CDR 55 CK REC _ CK I CK I CK Q CK Q CK I CK Q V bias CK I CK Q Figure 5.8: Proposed BMCDR block diagram. the falling edge. The edge triggered S/H is realized by connecting two level triggered S/H in a master and slave configuration as shown in Fig. 5.9(b). Parasitic capacitors are sufficient to hold the charge, because the longest CID length in typical data transmissions is never going to be longer than a thousand UI, which corresponds to a microsecond at 1 Gb/s. A differential pair acts as a buffer between the master and the slave S/H s and prevents charge sharing and kick back from the output transmission gate to the input pass transistors. This differential pair also helps reducing the common mode component of clock feed through from the pass transistors. 5.2.2 Delay Locked Loop The open loop nature of BMCDR makes it vulnerable to PVT variations as there is no sensing process to assure that the sampling is happing at the optimum phase. For example, in the PI-based BMCDR, the mismatches between the in-phase and quadrature

Chapter 5. Proposed Burst-Mode CDR 56 CK I In D D Out In D D Out Data (CMOS Level) (a) CK I D D (b) Figure 5.9: (a) Double edge triggered S/H consisted of two (b) single edge triggered S/H s. Data D Q C [0:15] CK C15 C[0:14] E [0:15] L [0:15] Majority Voter E L D Q D Q X [0:15] Figure 5.10: Bang-bang phase detector including a decimator.

Chapter 5. Proposed Burst-Mode CDR 57 paths of the PI or the delay in S/H s and PI can shift the clock phase from what it is intended to be. During the design process, it is possible to match the simulated delay of components in the clock and data path. This matching, however, is only valid for a specific process corner. Different process corners, or mismatches between the devices will cause a phase shift in the recovered clock. It is possible, however, to put variable delay elements in the clock and data path and match the delay of the clock and data path for each corner separately, however, this manual adjustment is not desirable in a commercial product. This process can easily be automated by using a DLL as shown in Fig. 5.6. DLL s are mainly used in high performance digital designs where it is necessary to remove the skew of the clock as it is buffered before driving digital blocks [26]. In a DLL, a phase detector (PD) detects the alignment of a reference clock with the buffered clock and the corresponding early/late signal is averaged and fed to a variable delay element which adds to the delay of the buffer such that the buffered clock is aligned with the reference clock. In typical DLL designs, the variable delay element is only put on the path of the signal that is to be aligned with the reference clock, and the delay is set such that the signal is aligned with the next edge of the reference clock. In our design, however, two variable delay elements, one on the data path and the other on the recovered clock path, have been used. Whenever the delay of the data path increases, the delay of the clock path decreases and vice versa. This will allow for compensation of both positive and negative skews without the extra UI delay. The delay lines used are digitally controlled, so it is possible to turn off other components of the DLL and fix the delay of these elements after the delay of data and clock paths are matched and hence save power. The implemented PD is a bang-bang [6] architecture as shown if Fig. 5.10. Instead of using high-speed, power hungry, current mode XOR gates to produce early and late signals, the edge and center samples are demuxed by 16 and then CMOS XOR gates have been used. A majority voter then decides whether the clock was early or late by

Chapter 5. Proposed Burst-Mode CDR 58 comparing the number of the early and late signals. Although majority voting throws away part of the information from the PD (i.e. the difference in the number of early and late signals), it reduces the transition density dependency of the DLL loop gain [40]. The output of the majority voter is integrated by a 7-bit up/down counter shown in Fig. 5.11. The implemented counter is selected to be synchronous i.e. its outputs change at the same time. This architecture, although more complicated than a ripple counter, assures glitch free control signals are sent to the digitally controlled delay lines (DCDL). The first four most significant bits (MSB) of the counter are sent to the DCDL and the other three bits which act as the dithering bits are thrown away. The presence of dithering bits prevents clock and data phase to change rapidly and also reduces the DLL loop gain. The DCDL on data path uses the counter outputs while the bitwise inverse of the counter outputs are sent to the DCDL on the clock path. In this way, whenever the delay of data path increases, the delay of the clock path decreases. When the counter is reset, its output is set to 1000000, thus the delay code of DCDL on data path will be initialized with 1000 and the delay code of DCDL on clock path will be set to 0111. This means that the minimum delay difference of clock and data paths is one LSB of DCDL. However, the small value of this delay mismatch can be ignored, since the resolution of the DCDL is high enough. A DCDL and digital integrator (counter) was chosen over a voltage controlled delay line (VCDL), because it is possible to turn of the DLL and fix the delay after the DLL locks. Turning off the DLL saves power and also removes the extra jitter introduced by the bang-bang PD. However, it is required activate the DLL periodically to track any delay mismatches caused by variation in temperature and supply voltage.

Chapter 5. Proposed Burst-Mode CDR 59 Up C[0:6] up_e Dn CX[0:6] dn_e D Q Q C[0] CX[0] up_e C[0] CX[0] dn_e D Q Q C[1] CX[1] up_e C[0:1] CX[0:1] dn_e D Q Q C[2] CX[2] up_e C[0:2] CX[0:2] dn_e D Q Q C[3] CX[3] up_e C[0:3] CX[0:3] dn_e D Q Q C[4] CX[4] up_e C<[0:4] CX[0:4] dn_e D Q Q C[5] CX[5] up_e C[0:5] CX[0:5] dn_e D Q Q C[6] CX[6] Reset CK Figure 5.11: 7-bit synchronous up/down counter.

Chapter 5. Proposed Burst-Mode CDR 60 5.3 Simulation and Measurement Results The initial simulations for the proof of the concept were performed in Simulink R. The typical approach in CDR design is to build more realistic models of the components in Simulink and simulate the performance of the CDR in Simulink as it is much faster to optimize the design in Simulink than in transistor level simulators like Hspice R. The simple architecture of the BMCDR, however, lends itself to design and optimization in the transistor level. The initial transistor level schematic was designed in 65nm ST- Microelectronics process and then it was ported to Fujitsu s 65nm CMOS. Post layout simulation result of burst-mode acquisition of the receiver is shown in Fig. 5.12. In this simulation the DLL was disabled and the delay of DCDL elements were fixed to the value that was provided by the DLL when it matched delay of clock and data path. In this simulation, two packets of data with different relative phase were sent to the receiver. Upon the arrival of the next packet, the PI coefficients were quickly updated in order to align the phase of the recovered clock with the phase of the incoming data. As it can be seen form this simulation, the PI coefficients reach their final value within the first received bit and the recovered clock is completely aligned in the following bit. Small ripples on the PI coefficients are the result of feed through from the S/H s. These ripples, however, have no effect on the recovered clock, assuming that the feed through of the in-phase clock is the same amount as the quadrature clock. This assumption is valid as long as the S/H s are perfectly matched and both quadrature clocks have the same amplitude. To prove this, we re-write equation 5.5 substituting α = sin(2πft 0 ) and β = cos(2πft 0 ) by sin(2πft 0 )+ɛ sin(2πft) and cos(2πft 0 )+ɛ cos(2πft), respectively;

Chapter 5. Proposed Burst-Mode CDR 61 Buff eddata 1 0-1 Previous Packet (a) (b) Next Packet PI Coef. 0.2 0-0.2 CK REC Data REC 1 0-1 0.5 0-0.5 (c) (d) 4.5 5 5.5 6 6.5 7 7.5 8 Time (ns) Figure 5.12: Post layout simulation result for burst-mode acquisition.(a) Input data after CML-to-CMOS block, (b) PI coefficients provided by S/H s, (c) recovered clock, and (d) recovered data. Note the instantaneous phase recovery.

Chapter 5. Proposed Burst-Mode CDR 62 D C A B D A Clock Recovery 50 70 m CR Break Out B BUF/DFF/DeMUX 200 70 m C BERT 60 30 m D CML Output Drivers 90 140 m Figure 5.13: Die Photo and area of building blocks. where ɛ represents the ratio of feed through: CK REC (t) = β sin(2πft) α cos(2πft) = [cos(2πft 0 ) + ɛ cos(2πft)] sin(2πft) [sin(2πft 0 ) + ɛ sin(2πft)] cos(2πft) = sin(2πf(t t 0 )) + ɛ[cos(2πft) sin(2πft) sin(2πft) cos(2πft)] = sin(2πf(t t 0 )) (5.7) so CK REC (t) is not affected by feed through. The test chip was fabricated in Fujitsu s 65nm CMOS process. Fig. 5.13 shows the die photo of the receiver and the area of building blocks. Total area of the receiver, excluding the output drivers and test structures is 250 70µm 2. Due to a mistake during the design, the pad-frame and the probe-card pad configuration were incompatible and the DLL-enable pad on the chip was connected to the prob-card s ground. As a result the DLL could not be enabled during the measurement and the DCDL elements could

Chapter 5. Proposed Burst-Mode CDR 63 CK TX DCA-J 86100C TRG RF OUT Signal Gen PRBS Generator DUT Error Flag Recovered Data Recovered Clock CH1 CH2 RF OUT Signal Gen REF IN CK Q CK I RF OUT Signal Gen REF OUT Figure 5.14: Measurement setup. only work as fixed delay elements. Fortunately, we were able to manually adjust the delay codes to match the delay of clock and data paths. The receiver consumes 22 mw, out of which 3.8 mw is consumed by the clock recovery unit and 18.2 mw by the DLL delay elements, slicer, and DMUX. Fig. 5.14 shows the simplified measurement setup. A Centellax PRBS generator was used to produce the input data. The in-phase and quadrature clocks were generated by synchronizing two signal generators and adjusting the phase of one of the generators to have a 90 phase shift with respect to the other generator. The full rate recovered data and clock were observed by a digital communication analyzer and the functionality of the receiver was determined by observing the error flag output of the internal PRBS error checker. The receiver was able to achieve a BER of 10 12 in 1-6 Gb/s range for PRBS7 data. The eye diagrams of the recovered clock and recovered data for a 6 Gb/s PRBS10 sequence are shown in Fig. 5.15. The peak-to-peak jitter of recovered clock and data are 24.2 ps and 30.9 ps, respectively. A part of this jitter is the result of the PI nonlinearity. Because of PI non-linearity, the interpolated clock has an error that is phase dependent. Thus if the PI coefficients change over time, the recovered clock will manifest jitter. Hence, any frequency offset between the transmitter and reference clocks generates a periodic jitter in the recovered clock.

Chapter 5. Proposed Burst-Mode CDR 64 6Gb/s Recovered Clock Jitter RMS =4.17ps Jitter p-p = 24.2ps (a) 6Gb/s Recovered Data(PRBS10) Jitter p-p = 30.9ps (b) Figure 5.15: (a)measured recovered clock and (b) data for 6 Gb/s PRBS10 data. CK REC outputs for 30 o data phase steps CK REC outputs for 30 o data phase steps Change in output phase (deg.) 200 150 100 50 Max. deviation from ideal line: 6.5 o 0 0 50 100 150 200 Change in output phase (deg.) 200 150 100 50 Max. deviation from ideal line: 2 o 0 0 50 100 150 200 Input Phase (deg.) Input Phase (deg.) (a) (b) Figure 5.16: Measured PI linearity at (a) 4 Gb/s and (b) 6 Gb/s.

Chapter 5. Proposed Burst-Mode CDR 65 1Gb/s Recovered Clock 2.5Gb/s Recovered Clock Recovered Data (a) (b) 4Gb/s Recovered Clock 6Gb/s Recovered Clock Recovered Data Recovered Data (c) (d) Figure 5.17: Measured phase locking speed at (a) 1 Gb/s, (b) 2.5 Gb/s, (c) 4 Gb/s, and (d) 6 Gb/s. In order to measure the linearity of the phase interpolator, the phase of input data was shifted by 20 steps and the change in the output phase was measured. Fig. 5.16 shows the measured linearity of the PI at 4 Gb/s and 6 Gb/s. It can be seen that INL at these two frequencies is 6.5 and 2, respectively. The lower INL in higher frequency can be attributed to lower clock amplitude entering the PI as a result of higher attenuation. The linearity of PI also depends on the quality of the reference clocks. Sinusoidal reference clocks are crucial for PI linearity. In this design, high quality reference clocks are provided by signal generators. In a product, however, it is necessary to place low pass filters before the PI to assure sinusoidal clocks.