Low Power Digital Receivers for Multi- Gb/s Wireline/Optical Communication

Size: px

Start display at page:

Download "Low Power Digital Receivers for Multi- Gb/s Wireline/Optical Communication"

Darleen Johnston
5 years ago
Views:

1 Low Power Digital Receivers for Multi- Gb/s Wireline/Optical Communication by A K M Delwar Hossain A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Integrated Circuits and Systems Department of Electrical and Computer Engineering University of Alberta A K M Delwar Hossain, 07

2 Abstract As the gate length scaling continues to less than 0nm, digital performances of integrated circuits (IC) continue to improve at a faster rate than their analog performances. This naturally leads to a trend where traditional high-speed mixed signal transceivers have to be replaced by their digital counterparts. For a wireline receiver, this means replacing traditional analog mixed signal equalizers with digital equalization techniques while still fitting within challenging power budget. Similarly, traditional optical receivers use analog mixed signal techniques to solve data dependent DC offset, burst mode timing recovery, etc. This work introduces fully digital techniques to address these challenges to design compact, low power digitally-enhanced optical receivers. The first receiver architecture of this dissertation describes the design technique of energyefficient sequence detection and equalization without the use of any ADC and DSP. This scheme takes advantage of the inter-symbol-interference (ISI) in the channel to reconstruct the time domain bit sequence. It is the most power efficient digital receiver reported to date. It improves power efficiency i.e. power consumed per bit by.5x and power consumed per bit per db channel loss by.65x of the state-of-art. The second receiver takes the architecture of the first receiver and modifies it to introduce data trace-back. This is the first-time implementation of data trace-back in SerDes. The data trace-back improves noise immunity and the voltage margin of the system. The added ii

3 data trace-back with the decision feedback equalization (DFE) improved BER from 0-0 (DFE only) to 0 -. The third receiver architecture describes a low power 7-0 Gb/s burst mode DC-coupled receiver for photonic switch networks. The concurrent operation of DC and timing recovery implies low latency in burst mode receivers. DC is recovered using SAR (successive approximation register) logic within 6 cycles of /8th of data rate clock, which is 6.5X improvement from the current state-of-art. It consumes only 3.7 mw during runtime. iii

4 Preface The concept, architecture, and measurement results of the work in Chapter has been published in Symposium on VLSI Circuits: "A 35 mw 0 Gb/s ADC-DSP less direct digital sequence detector and equalizer in 65nm CMOS," A. D. Hossain, Aurangozeb, M. Mohammad and M. Hossain, 06 IEEE Symposium on VLSI Circuits (VLSI-Circuits), Honolulu, HI, 06, pp. -. I was responsible for the system design, data collection, and manuscripts composition. Aurangozeb assisted me during design and data collection. Dr. Masum Hossain was the supervisor of the project. The paper containing overall concept and measurement results of optical receiver described in Chapter 4 has been submitted for the peer review journal of IEEE Transactions on Circuits and Systems I: Regular Papers. "Burst Mode Optical Receiver with 0ns Lock Time Based on Concurrent DC Offset and Timing Recovery Technique," A. D. Hossain, Aurangozeb and M. Hossain, IEEE Transactions on Circuits and Systems I. I was responsible for the system design, data collection, and manuscripts composition. Aurangozeb assisted me during system design, manuscript editing, and data collection. Dr. Masum Hossain was the supervisor of the project. iv

5 Acknowledgements First and foremost, I would like to express my sincere gratitude to my supervisor Prof. Masum Hossain, for giving me the golden opportunity of being part of his group when I knew little about circuits. Dr. Hossain has been a great inspiration for me in last three years. His way of teaching and supervising has helped me a lot in understanding very basics of circuit design. His friendly endeavor to me made working with him productive and fun. He will be a role model that I will look up to as I start my career. Special thanks to him for providing me with proper funds to carry out my studies. I would also like to thank Prof. Duncan Elliott and Prof. Kambiz Moez for being on my supervisory committee. I would like to thank Prof. Pedram Mousavi and CMC microsystems for lending us the equipment for testing the chips. It has been my privilege to work with team members of Mixed-signal Integrated systems in Nano-Technology (MINT) lab at University of Alberta. I would like to thank Amlan, Aurangozeb and Waleed for their friendship and technical insight. My sincere thanks to all friends and well-wishers, particularly Manir, Ishtiza and Shibli for making my life in Edmonton enjoyable. Thanks to my better half, Raka, for all her support and encouragement. This three year in Edmonton would be impossible without her support and patience. Finally, I thank my parents and siblings for always believing in me. v

6 Contents ABSTRACT... II PREFACE... IV ACKNOWLEDGEMENTS... V CONTENTS... VI LIST OF TABLES... IX LIST OF FIGURES... X CHAPTER. INTRODUCTION..... MOTIVATION ORGANISATION OF THE THESIS... 3 CHAPTER. A 0 GB/S DIRECT DIGITAL SEQUENCE DETECTOR AND EQUALIZER WITHOUT ADC-DSP IN 65NM CMOS EQUALIZATION: MOVING FROM ANALOG TO DIGITAL CONCEPT OF SEQUENCE DECODING Reference Placement: ADC vs. Sequence Detector Digital Sequence Detection Advantages & Challenges in Direct Sequence Decoding Sequence DFE LINK DESIGN Comparator Reference for Sequence Generation Sequence Generation and DFE Implementation HARDWARE COST AND PERFORMANCE OF THE SYSTEM ERROR TOLERANCE Bottom Prediction Error Tolerance Mid Prediction Error Tolerance Top Prediction Error Tolerance vi

7 .5.4. In-Bank Comparator Error Tolerance SYSTEM DESIGN Passive Equalizer Sample & Hold (SH) Comparator SR Latch Reference Muxing Timing Issue Sequence DFE Critical Timing IMPLEMENTATION & EXPERIMENTAL RESULTS CHAPTER 3. A 6 GB/S DIRECT DIGITAL SEQUENCE DETECTOR WITH DATA TRACE-BACK AND EQUALIZER IN 65NM CMOS NOISE TOLERANCE LIMIT OF CURRENT RECEIVER Case I Case II IMPROVING NOISE TOLERANCE OF CURRENT RECEIVER Setting Fixed Data Comparators Case I Case II Trace-back Sequence Generation Conditions of Data Trace-back Improved Noise Margin SYSTEM DESIGN Reference Muxing Timing DFE and Trace-back Feedback EXPERIMENTAL RESULTS CHAPTER 4. BURST MODE OPTICAL RECEIVER WITH 0NS LOCK TIME BASED ON CONCURRENT DC OFFSET AND TIMING RECOVERY TECHNIQUE CONVENTIONAL OPTICAL RECEIVER CHALLENGES IN BURST-MODE RECEIVER vii

8 4.3. PROPOSED BURST MODE RECEIVER Trans-impedance Amplifier (TIA) DC Recovery Clock Recovery Timing Skew Correction IMPLEMENTATION AND MEASUREMENT RESULTS COMPARISON WITH STATE-OF-ART... CHAPTER 5. CONCLUDING REMARKS FUTURE WORKS... 6 BIBLIOGRAPHY... 7 viii

9 List of Tables Table.: Performance Summary of the 0 Gb/s Receiver Table 3.: Reference placement for checking probable bank miss Table 3.: Trace-back choice logic....7 Table 3.3: Detecting Strong / Table 3.4: Receiver Summary....8 Table 4.: Performance Comparison Summary of the TIA.... Table 4.: Performance Comparison Summary of the Receiver.... Table 5.: Performance summary of the implemented receivers....3 ix

10 List of Figures Figure.: Data rate trend in digital receivers ([3] []).... Figure.: Power efficiency of high-speed digital receivers over the years ([4], [] [6]).... Figure.: Conventional wireline receiver....6 Figure.: N-tap feedforward equalization technique....7 Figure.3: N-tap decision feedback equalization technique....7 Figure.4: -tap loop unrolled DFE....8 Figure.5: Conventional ADC-based solution for the wireline receivers...9 Figure.6: Digital receiver using FFE and maximum likelihood sequence detector....0 Figure.7: ADC-based receiver block diagram and reference placement.... Figure.8: Sequence detector reference placement depending on tap coefficient....4 Figure.9: Digital sequence detection technique at t=t0 and t=t....5 Figure.0: Digital sequence detection technique at t=t and t=t Figure.: Digital sequence detection technique (summarized)....6 Figure.: ADC vs. Sequence Detection....8 Figure.3: Sequence DFE choices for 3 taps....0 Figure.4: Example showing sequence DFE technique.... Figure.5: Channel response and single bit response before and after the passive equalizer of example transmission link.... x

11 Figure.6: References for the link....3 Figure.7: Placement of edge comparator for prediction...4 Figure.8: Use of the edge comparators for placing the data comparators....6 Figure.9: Sequence DFE feedback for the link....7 Figure.0: Reduction of bank comparators using DFE....7 Figure.: (a) Placement of floating comparators. (b) Verification of edge prediction. (c) Example sequence generation and DFE implementation....9 Figure.: Sequence sorting process of the overall system Figure.3: Comparison of the required number of comparators among loop unrolled DFE, ADC, and sequence DFE....3 Figure.4: Comparison of noise margin between ADC-based DFE and sequence DFE. 33 Figure.5: Error tolerance of edge comparator prediction to send the floating data comparators to bottom at time t=t Figure.6: Error tolerance of edge comparator prediction to send the floating data comparators to the middle at time t=t Figure.7: Error tolerance of edge comparator prediction to send the floating data comparators to bottom at time t=t Figure.8: Error tolerance of edge comparator prediction to send the floating data comparators to the top at time t=t Figure.9: In-bank comparator error tolerance of the system using DFE Figure.30: Maximum likelihood sequence detector with passive equalization and timing recovery...4 Figure.3: Schematic of passive equalizer with its AC response for different settings....4 xi

12 Figure.3: Passive equalizer response for example link. (a) AC analysis, (b) input to the equalizer and (c) output of the equalizer...43 Figure.33: Layout of the implemented passive equalizer Figure.34: Concept of sample and hold Figure.35: Implemented sample and hold circuitry (a) and its simulated differential operation (b) Figure.36: Layout of sample and hold circuit Figure.37: Schematic of the implemented comparator Figure.38: Operational stages of the comparator Figure.39: Simulation results showing sampling, regeneration, decision, and reset stages of the comparator Figure.40: Layout of the implemented comparator Figure.4: SR Latch implementation and truth table....5 Figure.4: Layout of the implemented SR latch....5 Figure.43: Edge and data SH and comparator clocking for CH Figure.44: Reference settling behavior Figure.45: Sequence DFE and loop unrolling for B+ feedback (red box) Figure.46: Implemented prototype in 65nm CMOS Figure.47: Complete receiver with its power consumption Figure.48: Pin diagram and test setup of the prototype Figure.49: Measured half rate 4-bit sequence DAC output xii

13 Figure.50: Measured 0 Gb/s input eye (a) and 6 levels output eye of sequence detector (b) Figure.5: Measured.5 GHz recovered clock eye (a) and histogram (b) Figure.5: BER bathtub (a) and the recovered.5 Gb/s PRBS (Pseudo-random bit sequence) checked data eye Figure 3.: Case I error in sequence detection. B0 and B- detected incorrect Figure 3.: Case II error in sequence detection. B- detected incorrect Figure 3.3: Introduction of two fixed data comparator reference instead of fixed edge comparators Figure 3.4: Fixing error with little margin using fixed data comparators Figure 3.5: Introduction of check comparator to find out whether it may miss a bank or not Figure 3.6: Overall reference placement of the architecture with trace-back Figure 3.7: Resolving the issue of case II Figure 3.8: Comparison of noise margins of 4 tap sequence DFE with and without data trace-back with ADC-based DFE Figure 3.9: System architecture of quad-rate receiver with data trace-back Figure 3.0: Reference Muxing for quadrate channel running at 4 GHz Figure 3.: Feedback for DFE and Trace-back Figure 3.: Implemented prototype in 65nm Figure 3.3: Pin diagram and test setup of the prototype Figure 3.4: Channel Response at 8 GHz Figure 3.5: Measured 6 Gb/s input eye after passive equalizer (a) and 6-level output eye of the sequence decoder (b) xiii

14 Figure 3.6: Measured 4 GHz clock eye (a) and PRBS checked recovered 4 Gb/s data eye (b) Figure 3.7: BER bathtub comparing DFE and Trace-back Figure 3.8: D color-map of BER showing voltage margin in Y-axis and timing margin in X-axis of sequence DFE (a) and data trace-back (b)....8 Figure 4.: Conventional vs. proposed implementation of the optical receiver Figure 4.: Conventional implementation of burst mode receiver with DC offset calibration Figure 4.3: Settling time vs. bandwidth of DC recovery loop Figure 4.4: Conventional timing recovery loop Figure 4.5: Effect of DC on duty cycle of the data Figure 4.6: Block Diagram of the receiver with power breakdown....9 Figure 4.7: Detailed schematic of TIA and its DC recovery loop....9 Figure 4.8: TIA transfer function and its simulated and measured gain Figure 4.9: Measured 0 Gb/s output eye of the front end Figure 4.0: Simulated amplifier stage output eye with (left) and without (right) C & C for input currents of 00 µa, 50 µa and 600 µa (from top to bottom) Figure 4.: Simulated and calculated input referred noise of TIA Figure 4.: Proposed DC recovery technique Figure 4.3: Measured scope shot of DC recovery operation within 4.8 ns....0 Figure 4.4: DC recovery without LPF and use of rising edge pulses allowing concurrent operation of DC and timing recovery....0 Figure 4.5: Ring oscillator with pulse filtering and its timing diagram xiv

15 Figure 4.6: Timing Skew compensation Figure 4.7: Slope Detection Figure 4.8: Slope detection logic Figure 4.9: Implemented die photo in 0.3 µm Figure 4.0: Pin diagram and test setup of the prototype Figure 4.: Measured scope shot of DC and timing recovery with preamble and data pattern Figure 4.: Recovered clock eye Figure 4.3: Recovered clock histogram Figure 4.4: Phase noise plot of the recovered clock Figure 4.5: On-chip PRBS checked recovered.5 Gb/s channel data eye....0 xv

16 Chapter. Introduction Wired data communication systems can be classified into two major categories based on their medium of transmission: wireline and optical. Wireline systems use copper medium for short distance data transmission. For chip-to-chip backplane data transmission, copper is a cost-effective option. On the other hand, optical systems use fibre optic cable to carry a large amount of data over a long distance. Fibre optic cables can provide significantly higher data bandwidth than its counterpart. As a result, optical receivers are finding their place in data centers. Wireline and optical receivers implemented in the analog domain have shown excellent performance and power efficiency at high speed ([], []). However, with shrinking technology nodes and increasing data rates, the performance of these receivers is not improving as much as their digital counterparts due to supply scaling, increased device leakage, increased noise and process variations. Therefore, the recent trend shows that digital implementations of these receivers are getting higher interest from the circuit design industry than their analog counterparts. Figure. shows the increasing data rates of state-of-art digital receivers over the years implying we can achieve good high-speed performance in digital receivers as well. However, digital implementations require analog-to-digital converters (ADC) followed by digital signal processing (DSP)

17 Power Efficiency (pj/bit in log scale) Data Rate (Gb/s) units which make them exceed existing transceiver power budget. Figure. illustrates power efficiency (consumed power per bit) of high-speed digital receivers over the years Zhang, ISSCC[8] Cui, ISSCC[9] Rylov, ISSCC[0] Varzaghani, JSSC[6] Shafik, ISSCC [] Chen, JSSC[4] Tabasy, JSSC[7] Hossain, VLSI [] Abiri, ISSCC[3] Ting, ISSCC[5] Year of Publication Figure.: Data rate trend in digital receivers ([3] []). 0 Agazzi, JSSC [3] Cao, ISSCC [4] Yamaguchi, ISSCC [5] Zhang, ISSCC [6] 0 Chen, JSSC [4] Shafik, ISSCC [] Hossain, VLSI [] Year Figure.: Power efficiency of high-speed digital receivers over the years ([4], [] [6]).

18 .. Motivation The main motivation of this thesis is to make digital receivers power efficient in both wireline and optical cases. For wireline receivers, as the high-speed data passes through the copper medium, it suffers from dielectric loss and parasitic crosstalk. The channel appears as a low pass filter to the high-frequency component of the signal and attenuates it by adding random and deterministic noise. So, at the presence of high-speed data and channel loss, un-equalized transceivers cannot provide adequate performance. Therefore, different equalization schemes have become a necessary part of receiver design. One of the motivations of this thesis is to search for a cost-effective low power equalization scheme for wireline digital receivers. Burst mode optical receivers are used in rapidly reconfigurable photonic switch networks. This reconfiguration operation requires both burst mode DC and timing recovery in optical receivers. This thesis also concentrates on a low latency low power digital burst mode optical receiver. Although pulse amplitude modulation (PAM) provides us with lower bandwidth requirement, for the sake of simplicity and higher signal to noise ratio (SNR), non-returnto-zero (NRZ) binary data will be used in this thesis... Organisation of the Thesis A total of three data receivers are designed for this thesis. Two of them are focused on wireline applications and the other one is an optical receiver. 3

19 Chapter describes an energy efficient 0 Gb/s sequence detector and equalizer without the use of any ADC and DSP implemented in 65nm CMOS technology. The chapter contains the concept of sequence detection and equalization using a highly digital architecture. Implemented analog front end, digital logics and timing issues of the system are also described in detail. Chapter 3 takes the receiver design from Chapter to have improved noise immunity. The receiver design is modified to have -bit data trace-back that provides better noise performance and voltage margin than before. Front-end and digital logics are made to handle a higher data rate of 6 Gb/s for this receiver. Chapter 4 describes a 7-0 Gb/s burst mode optical receiver implemented in 0.3 μm. DC recovery operation without low-pass filter enables us to have timing recovery operation going on at the same time resulting in a low latency receiver. The chapter covers analog front-end with its noise performance, successive approximation register (SAR) logic based DC recovery, injection locking based timing recovery and timing skew correction using slope detection of the signal. Chapter 5 summarizes the performance of all three receivers and proposes future work that can be done. 4

20 Chapter. A 0 Gb/s Direct Digital Sequence Detector and Equalizer without ADC- DSP in 65nm CMOS This chapter describes a low power 0 Gb/s sequence detector and equalizer without the use of ADC and DSP in TSMC 65nm. The chapter begins with a brief description of the conventional receivers in Section.. This section compares state-of-art mixed signal and ADC-based solutions. Section. discusses the technique of digital sequence detection and compares this technique with ADC-based solutions. The digital sequence decoding naturally leads us to sequence DFE technique. Section.3 takes an example transmission link and discusses sequence generation and DFE implementation techniques used in this receiver. Section.4 discusses hardware cost and noise margin of the system and compares it with ADC-based solutions. Section.5 discusses error tolerance of the receiver with examples. Section.6 provides details of the receiver design starting from the top-level block diagram and then descending to the overall design. This section includes description of sensitive analog front-end blocks and critical timing margins of the system. Section.7 shows experimental results and comparison with the state-of-art ADC-based receivers. 5

21 .. Equalization: Moving from Analog to Digital In high-speed wireline transceivers, the frequency dependent channel loss is the main source of inter-symbol interference (ISI). In simple words, ISI is the residue of the current symbol that affects the following symbols (pre-cursor) as well as the previous symbols (post-cursor). For high loss channels, conventional receiver designs ([], [7] [0]) usually feature analog continuous time linear equalization (CTLE), implemented using active or passive elements, in the front end as shown in Figure.. In addition, feed-forward equalization (FFE) (Figure.) and decision feedback equalization (DFE) (Figure.3) techniques are used for further ISI cancellation and bit detection. There are feedback timing concerns for the first tap cancellation in DFE for high-speed data since this feedback has to be done within unit interval (UI). Although prior work [9] has demonstrated that Z - CTLE Passive Active Z - From TX Channel Recovered Data Clock Recovery Figure.: Conventional wireline receiver. 6

22 y(t) T y(t-t) T y(t-t) T y(t-nt) T α 0 α α α N f(0) f() f() f(n) FFE Output Figure.: N-tap feedforward equalization technique. y(t) Decision Circuit d(t) T d(t-t) T d(t-nt) T β 0 β β N b(0) b(n) b() Figure.3: N-tap decision feedback equalization technique. direct cancellation of the first tap is possible, but as the data rates go up, it becomes hard to comply with the DFE timing margin for the first tap. To resolve this timing issue, a commonly used approach is loop unrolled DFE ([], []). In this approach, signals are pre-calculated and then one of them is selected using previously decoded bits. This approach can be extended to other taps if there are feedback timing concerns with them. Figure.4 shows -tap loop unrolled DFE architecture. The loop unrolled approach increases front-end hardware and power by two times as the number of taps with timing concerns increase. 7

23 y(t) +β +β + Comparator - y(t) +β -β + Comparator - MUX 4X d(t) d(t-t) Z - Z - d(t-t) y(t) +β +β + Comparator - y(t) +β -β + Comparator - Figure.4: -tap loop unrolled DFE. In general, analog mixed-signal solution can equalize with excellent energy efficiency (around ~3pJ/bit ([], [0])). However, their performance can be limited by the analog performance of the technology such as gain-bandwidth product and comparator resolution. First, there is SNR degradation which results from the CTLE, that generally inverts the channel, which amplifies noise, including crosstalk. Second, the linearity requirement is hard to achieve - scaled supply reduces maximum achievable linear swing. Third, process variation makes it very difficult to achieve reliable control over zero and pole frequencies to achieve the desirable frequency response. All these factors limit the performance of the symbol-by-symbol detection technique. In such SNR-limited cases, sequence decoders 8

24 DFE FFE/IIR From TX Channel ADC Digital Equalization Recovered Data CTLE Passive Active Clock Recovery Figure.5: Conventional ADC-based solution for the wireline receivers. outperform symbol-by-symbol detectors. However, existing maximum likelihood sequence detectors (MLSD) require an analog-to-digital (ADC) converter in the front-end ([3], [6]). There are previously-reported ADC-based solutions where equalization (FFE/DFE) is moved to the digital domain (Figure.6) ([4], [4], [6], [3]). Harwood et. al. [3] implemented a receiver with -tap FFE and 5-tap DFE in digital domain using baud rate sampling ADC for.5 Gb/s operation. The architecture used two 6.5 Gb/s (half rate) 4.5-bit flash ADC and DSP to perform the numerical FFE and DFE to compensate 4dB channel loss with a power efficiency of 6.4 pj/bit. Zhang et. al. [6] combined both conventional approaches by designing a dual path receiver; CTLE and DFE for high SNR input and ADC-based solution for low SNR data. The time-interleaved 6-bit ADC alone takes 95 mw for 0.35 GS/s operation. Moreover, additional DSP consumes 500 mw [4]. In recent works ([4], []), it is demonstrated that ADC-based solutions can achieve a power efficiency of ~0pJ/bit. Chen et. al. [4] demonstrated a power efficiency of 3 pj/bit using a four-way time interleaved 4-bit flash ADC. A three- 9

25 stage continuous time high pass filter and a -tap FFE were implemented in the analog front-end and a 5-tap DFE was implemented in digital domain to equalize 9 db channel loss. Shafik et. al.[] demonstrated a power efficiency of 8.7 pj/bit using a 3-way time interleaved 6-bit SAR ADC. A 4-tap FFE and 3-tap DFE were implemented in digital domain for compensating channel loss up to 5.3 db. In [3], Agazzi et. al. demonstrated the first maximum likelihood sequence detector (MLSD) for multimode fibers (Figure.6). The reported sequence decoder outperforms symbol-by-symbol detectors by at least electrical db for 0.35 Gb/s operation. The analog front-end of the digital receiver incorporates an 8-way time-interleaved 0-stage pipelined ADC with self-calibration where each path has two slices. At any given time in one path, one slice is in normal operation mode and the other one is in calibration or power down mode. For on-chip DSP, the outputs of the ADCs are further demultiplexed by a factor of. The digital back-end of the receiver has a nonlinear MIMO (multiple-input, y(n) y(n-) ADC Z - Z - y(n-) Z - y(n-n) α 0 α α α N f(0) f() f() f(n) Maximum Likelihood Sequence Detector FFE Output N=5 5 Figure.6: Digital receiver using FFE and maximum likelihood sequence detector. 0

26 multiple-output) channel estimator, a 5-5 tap FFE and an 8-state sliding block Viterbi decoder (SBVD). In all these cases, while ADC alone approaches 0 pj/bit power consumption, including the DSP [6] significantly exceeds the receiver power budget for SerDes solution space. Therefore, a solution without ADC and DSP is an attractive low-power alternative to existing approaches... Concept of Sequence Decoding Any high-speed data going through a lossy medium suffers from ISI, crosstalk, and random noise. In most cases, ISI is the most dominant factor contributing to channel loss. Due to ISI, we can say the channel has memory. This memory can extend up to hundreds of previous symbols and a few following symbols. For simplicity, we can consider a channel where ISI is limited to within three taps - one pre-cursor (h-), main (h0) and one postcursor (h+). We can assume a transmitter from which bits are transmitted, and a receiver receives those bits with ISI after passing through the lossy channel. The received signal is essentially the result of convolution between the transmitted bit symbols with the impulse response of the channel. Therefore, baud spaced sampled values will be the combination of these three taps. An example of the transmitted bit sequence and the received signal after channel loss are illustrated in Figure.7 which will be used throughout this section to discuss the concept of sequence detection.

27 ... Reference Placement: ADC vs. Sequence Detector For ADC-based solutions, if we consider a flash ADC, the received signal is fed to a comparator bank (Figure.7). The total signal space is divided into N levels for N-bit ADC to cover the whole dynamic range. As we are considering 3-taps of ISI for sequence detector, we can also consider a 3-bit ADC where the signal space will be divided into 3 =8 levels. The comparator bank output is ideally thermometric in nature. A thermometric to h 0 h - h + From TX Channel Comp Bank Therm. to Binary Digital FFE & DFE Transmitted Bits MSB LSB 0 0 Received Signal 00 0 Total Signal Space time Figure.7: ADC-based receiver block diagram and reference placement.

28 binary converter gives us the ADC output, which is then processed in the digital domain for DFE and FFE operation. The digital implementation of the FFE is realized by subtracting the ADC output bits of nearby cursors from the ADC output of the main cursor. In digital subtractors, the main cursor output of the ADC will go as is and the nearby-cursor taps will be subtracted using the ratio of their tap coefficients. For the DFE implementation in the digital domain, the previous bit decisions are fed back which is then subtracted from the ADC output to get the current bit decision. For sequence detection, the received signal at any point in time can be seen as a combination of pre-cursor and post-cursor components of the neighboring bit stream and main cursor component of the current bit. Therefore, from the received signal sample, we can reconstruct the corresponding time sequence of previous, current and next bit. In this case (Figure.8), as there are only three taps, the bit sequence is B+B0B-. Here, B+ is the previous bit, B0 is the current bit, and B- is the next bit. The received signal has to go through a comparator bank. To set the references for these comparators, the simplest approach is to directly calculate the distance from the received sampled value to each sequence constellation. It can be done by comparing the sampled value to a set of references based on different combinations of main (h0), pre-cursor (h-) and post-cursor (h+) taps. In general, we will have a time sequence of length N if there are N number of un-equalized taps in the single bit response. In the signal space, these N taps can combine in N number of ways, creating N signal levels corresponding to N unique sequences. 3

29 h 0 h - h + From TX Transmitted Bits Channel Comp Bank Sequence Decoder Prev. Bit B + Curr. Bit B 0 Next Bit B - 0 Received Signal Total Signal Space 00 time 000 Figure.8: Sequence detector reference placement depending on tap coefficient.... Digital Sequence Detection Figure.-Figure. presents the sequence detection technique. For digital sequence detection, we are placing the comparator references as a combination of tap values representing time sequence as illustrated in Figure.8. At time t=t0, the signal is at the bottom of signal space (Figure.9 (a)). At this time, from transmitted bit sequence, the previous bit is 0, the current bit is 0, and the next bit is also 0. So, at this point, the 3-bit sequence decoder should give 4

30 Transmitted Bits Transmitted Bits t 0 t t t 3 t 4 t 5 t 0 t t t 3 t 4 t Received Signal Received Signal time 000 time 000 (a) t=t (b) t=t Figure.9: Digital sequence detection technique at t=t0 and t=t The current bit for t=t0 becomes the previous bit at t=t (Figure.9 (b)). Similarly, the next bit for t=t0 becomes the current bit at t=t. Now at t=t, the next bit is. Due to this bit, the received signal starts to increase. The decoded sequence at t=t should be 00. Transmitted Bits Transmitted Bits t 0 t t t 3 t 4 t 5 t 0 t t t 3 t 4 t Received Signal Received Signal time time (a) t=t (b) t=t3 Figure.0: Digital sequence detection technique at t=t and t=t3. 5

31 t 0 t t t 3 t 4 t 5 Decoded Sequence, B + B 0 B Each Bit decoded 3 times Figure.: Digital sequence detection technique (summarized). Similar things happen at t=t (Figure.0 (a)) and t=t3 (Figure.0 (b)). At t=t and t=t3, the signal is in the middle of signal space. Although at t=t3 the main bit of the sequence is 0, it cannot go down due to its post-cursor and pre-cursor taps. Figure. shows decoded time sequence from t=t0 to t=t5. As we are considering a sequence length of 3 bits corresponding to three taps of ISI from the channel, each bit is decoded three times: first as next bit, then as current bit and after that as previous bit. In general, if there are N taps present in the signal, each bit should be decoded N times. This N time decoding of same bit results in improved signal to noise ratio (SNR) and can be used later for verification purposes. The post-cursor bits can be used for DFE operation, whereas the pre-cursor bits can be used for further error correction as will be discussed later. If the received signal is pushed up or down due to noise, the DFE can compensate for those errors by the sequence decoder. 6

32 ..3. Advantages & Challenges in Direct Sequence Decoding The basic concepts of ADC and sequence detection have been discussed in prior sections which give us the background to compare between these two techniques and find out the challenges in implementing a sequence decoder. Advantages of sequence decoding over ADC-based links: Sequence decoder uses ISI in a constructive way. References are set using ISI tap coefficient for each dominant tap in single bit response. ADCs work independently of ISI; however, the DSP and the FFE/DFE do care about ISI. In the FFE/DFE, ISI components are subtracted. Due to this subtraction, we lose signal power as well as add quantization noise. If there are N dominant taps present in single bit response of sequence decoder, each bit will be decoded N-times. The example discussed so far has 3 dominant taps, thus each bit is detected 3 times which improves SNR. In ADC-based links, additional quantization noise degrades SNR. Sequence decoder will detect the time sequence directly. The main symbol comes as a part of the sequence. There is no need for additional DSP. In sequence decoder, we are decoding and predicting previous and upcoming bits respectively. These bits can be used later for further error correction. However, for ADC-based links, an additional power hungry DSP is required for the DFE/FFE implementation. 7

33 MSB LSB 0 Prev. Bit B + Curr. Bit B 0 Next Bit B - Received Signal Total Signal Space Received Signal time 000 time 000 Figure.: ADC vs. Sequence Detection. Challenges of sequence decoding over ADC-based links: In ADC-based links, the comparator references are binary and monotonically increasing. However, in sequence decoder, references are non-binary and nonmonotonic (Figure.). In ADC-based links, the required comparator resolution for N-bit ADC is: total signal space/ N. The required comparator resolution is not well defined for sequence decoding as the references are non-binary and non-monotonic. It can be less than total signal space/ N for N-bit sequence detection. In ADC-based links, the thermometric comparator output can be easily converted to binary values. However, converting comparator output to possible sequences, so far, seems difficult. 8

34 ..4. Sequence DFE The channel we assumed for understanding this concept has three dominant taps. Out of these taps, only one is post-cursor (h+) which corresponds to B+ in the bit sequence. As shown in sequence detection, this B+ at any time is detected as B0 just -unit interval (UI) before. We can use this B0 of previous time ( UI before) as a feedback for current sequence detection. This post-cursor bit (B+) feedback eases the pressure on the sequence detector to give us the correct sequence directly. From sequence decoder, we can take a set of two possible sequences differentiable using B+, and out of those two we can choose one depending on the value of B+ (Figure.3). This post-cursor feedback is same as DFE implementation. As we are implementing it for sequence detector, we can call it sequence DFE. In general, if there are N post-cursor taps present in the single bit response of the received signal at sequence detector input, we can generate N sequences. Out of these N sequences, only one will be chosen using N post-cursor bits. For this example channel, to give choices to DFE, let us have eight comparators in the comparator bank. Each of them is getting references for different possible combinations of h+, h0, and h-. Although sequences are not increasing monotonically, comparators C0-7, here are set in monotonic increasing order. The truth table in Figure.3 shows the choices for DFE implementation as each comparator output changes. We will discuss how easily we can generate these choices in a later section. In the truth table, if all the comparators give 0 or, there is only one choice to DFE. For all 0 case, 000 goes as a choice, which is already paired with 00 when only one comparator is. We can get rid of this single option 9

35 Received Signal C 7 C 7 C 6 C 5 C 4 C 3 C C C 0 DFE Choice Comp Bank Sequence Decoder B + Feedback DFE Output Refs C 6 C 5 C 4 C 3 C C C Single choice & already given in another pair Redundant Comparator Figure.3: Sequence DFE choices for 3 taps. as this choice is already available. The same thing goes for all case. So, this DFE feedback gives us the option to remove the top and bottom most comparators and still have all the possible sequences in the DFE choice list. Figure.4 shows an example of the DFE implementation. At time t=t, four of the bottom comparators are and the rest of them are 0. According to the truth table in Figure.3, DFE choices would be 00 and 0 going to a : MUX. As the previously decoded bit is 0, out of these two choices the one with B+=0 will go to the output. So, the output here is 00. Sequence generation for DFE, DFE implementation and timing margin of DFE will be discussed in detail in later sections. 0

36 Transmitted Bits C 7 =0 t 0 t t t 3 t 4 t 5 0 C 6 = Received Signal C 5 =0 C 4 =0 C 3 = 00 C = 00 C = 000 time C 0 = Comp Bank Sequence Decoder DFE Output B + =0 Figure.4: Example showing sequence DFE technique..3. Link Design The receiver to be designed considers a voltage mode transmitter where the transmitter differential swing varies from 600mVpp to Vpp. Therefore, signal attenuation is needed to scale the received signal to match the dynamic range at the comparator bank input. Since the high-frequency signal is already attenuated by the channel (top of Figure.5), only the low-frequency signal is attenuated which translates to passive equalization.

37 Channel Response -7 db Loss at 5 GHz Frequency (GHz) From TX Channel Passive Equalization Comp Bank Sequence Gen. Passive EQ O --- Data Sample X Edge Sample Sequence DFE Figure.5: Channel response and single bit response before and after the passive equalizer of example transmission link. For a channel giving 7 db loss, only 5 to 7 db boost (or DC attenuation) is sufficient to contain ISI components within four taps one pre-cursor, main and two post-cursor taps (Figure.5). This partially equalized signal is then fed to a 4-bit sequence decoder. The same scheme works for other channels as well if we can have a programmable boost from the passive equalizer.

38 .3.. Comparator Reference for Sequence Generation The channel we assumed in section. had only 3 taps before going to comparator bank or sequence decoder. The practical channel after passive equalization has 4 dominant taps in its single bit response. For reference generation, instead of having the taps in terms of their time sequence we can have them organized using the descending order of their weights. Main cursor tap (h0) will always have the highest weight among them. In most cases, we will find: h0>h+>h->h+. So, we will have MSB bit representing B0, MSB- bit representing B+, MSB- giving B- and LSB representing B+ (Figure.6). From TX Channel Passive Equalization h 0 h - h + h + Reference Sequence: B 0 B + B - B Figure.6: References for the link. 3

39 The taps don t have binary relation with each other which gives rise to overlaps between reference levels. For example, here 00 goes above 000 levels. There are a total of 3 overlaps for this link. If the channel loss increases further, the number of overlap will increase. However, if we divide the references into different banks of B0 and B+, within each bank there will be no overlap as usually h->h+. The banks will have overlap between them, but not within them..3.. Sequence Generation and DFE Implementation For an N-bit sequence detector, the input signal needs to be compared to N reference levels that result in a power and area penalty similar to a flash ADC. So, for a 4-bit sequence, 6 reference levels will be there. Note that due to ISI, sample to sample signal variation is limited therefore, it is not necessary to cover the entire signal space. Rather based on previous sample position, covering only 50% of the signal space is sufficient. Seq n =B 0 B + B - B + Seq n =B 0 B + B - B Edge Comparators for Prediction time time Figure.7: Placement of edge comparator for prediction 4

40 As we are covering 50% of signal space, rather than having 4 =6 comparators, we can have half of them. In other words, rather than having 4 banks, we can have only of them at a time. This selection of two banks depends on previous sample value. In clock and data recovery circuits, the sample before a data sample is an edge sample. We can place two comparators at the edge sample and use the output of these comparators to recycle the data comparators each time (Figure.7). Figure.8 illustrates how two edge comparators (Cedge & Cedge0) predict data position and place the references for the data comparators. In Figure.8(a) at time t=t0, the data was at the bottom of signal space. The edge sample in between t=t0 and t=t, results in Cedge=0 and Cedge0=. This edge information implies that the next sample may be in the middle of signal space. So, rather than having bottom bank (00) references, we can give 0 bank references at t=t. However, 0 bank references remain in their position. In Figure.8(b) at time t=t, the data was in the middle of signal space. The edge sample in between t=t and t=t, results in Cedge= and Cedge0=. This edge data implies that the next sample may be on the top part of signal space. So, rather than having one of the middle banks (0) references, we can give bank references at t=t3. However, 0 bank references remain in their position. So, we can see these two edge comparators directly control references going to the floating comparators. Cedge controls reference switching between and 0 banks and Cedge0 does the same for 0 and 00 banks. Figure.8(c) illustrates reference placement for all the cases. The number of comparators now reduces from 6 to 0 out of which are edge comparator and 8 are floating data comparator. 5

41 t 0 t t t 3 t 4 t 0 t t t 3 t 4 C edge = Ref. Switch C edge0 = Edge Sample time time (a) t 0 t t t 3 t 4 Ref. Switch t 0 Edge Sample t t C edge = C edge0 = t 3 t time time (b) t 0 t t t 3 t (c) time Figure.8: Use of the edge comparators for placing the data comparators. 6

42 In sequence detection, we have two post-cursor bits. Sequence DFE allows the sequence decoder to give four possible sequences differentiated by B+ and B+. At first, B+ feedback will come in action as in Figure.9. Within the banks, sequences are differentiated by B+. So, when we are using a bank reference, we can take two possible sequences from there with different B+ values and sort out the correct value of B+ using DFE feedback. Figure.0 takes 00 bank as an example. Here even if we remove the comparators C0 and C3 of the bank, we will get all possible sequences. Comp Bank Sequence Decoder DFE Output B + Feedback B + Feedback Figure.9: Sequence DFE feedback for the link. Reference Sequence: B 0 B + B - B C 3 C C C 3 C C C 0 Generated Sequence C 0 00 Figure.0: Reduction of bank comparators using DFE. 7

43 So, the number of comparator per bank is now. In total, we need 6 comparators for sequence detection. Two edge comparators are for prediction and four floating data comparators (CF0-3) are for in-bank comparison. In-bank comparison gives us possible combinations of B-B+. B0 and B+ differentiate the banks. But out of four banks, two are selected using the edge comparators. We can t use the edge prediction directly to generate possible combinations of B0B+. A verification of that prediction is needed using the data comparators. Although the data comparators are floating, the data comparator having the top floating reference will be called CF3 and the bottom one will be called CF0. Figure.(a) shows the positions of all six comparators at different times. For edge prediction verification in Figure.(b), when the edge comparators send the floating comparators to bottom two banks, we check the top floating comparator. If CF3=0, it implies that the signal is actually at the bottom. For the middle case, both the top (CF3) and the bottom (CF0) floating comparators are checked. CF3=0, in this case, implies we have not missed the top bank and CF0= implies we have not missed the bottom bank. The possible combinations of B0B+ is also given in Figure.(b). Figure.(c) shows two examples of how sequence generation and DFE works. At time t=t, the edge comparators predict and the data comparators verify that the signal is actually on top. The top position implies and 0 are the possible sequences of B0B+. For the bank, the floating comparators, CF3 and CF, are 0 at t=t giving possible sequences of 00 and 0 for B-B+. So from the bank the two possible sequences are: 0 and 00. 8

44 t 0 t t t 3 t 4 Prediction C edge C edge0 Verification Position 0 0 C F3 =0 Bottom 0 C F3 =0 & C F0 = Middle C F0 = Top Possible Sequence of (B 0 B + ) time (a) Time Position C F3 C F C F C F0 Gen. Seq. t t 3 Top Mid B + Feedback (c) Figure.: (a) Placement of floating comparators. (b) Verification of edge prediction. (c) Example sequence generation and DFE implementation. 0 (b) After B + Feedback B + Feedback Output 00 0 CF and CF0 do the in-bank comparison for the 0 bank. At t=t, both of them are which implies and 0 are the two possible combinations of B-B+ in the 0 bank. So from the 0 bank, the two possible combinations are 0 and 00. The generated four combinations go to the DFE multiplexer. One sequence from each bank is selected using B+ feedback. At t=t, B+ feedback is 0. As B+ is LSB here, feedback is compared with the LSB and the combinations with B+=0 is passed to next MUX. Now, there are two 9

45 sequences that can be differentiated using B+. At t=t, B+ feedback is. In the combinations coming to MUX, B+ bit is the MSB- bit. After comparison of B+ feedback, 00 sequence is passed as DFE output. A similar example is shown for t=t3. Figure. shows the overall sorting process of the system. As there are 4 dominant taps left in the single bit response after partial equalization by the passive equalizer, we have total 6 sequences- 4 banks. The edge comparator prediction and the floating data comparator verification give us 8 possible sequences from banks. The in-bank floating comparators get rid of half of them giving us a total of four sequences each from banks. These four sequences enter the DFE MUX. B+ feedback chooses one sequence from each bank. B+ feedback chooses a final sequence that comes out as the DFE output. Input Prediction 4 =6 Sequence 8 Sequence h 0 h - h + h + Floating Comparators in Bank Post-Cursor B + Feedback Post-Cursor B + Feedback 4 Sequence Sequence DFE Output Figure.: Sequence sorting process of the overall system. 30

46 .4. Hardware Cost and Performance of the System For comparison among loop unrolled DFE, ADC-based DFE, and sequence DFE, we can consider a general channel, which has N dominant taps remaining to be equalized. Out of these taps, let us consider L taps are precursors and M taps are post-cursors. There is always one main cursor. Therefore, the number of comparators required for the loop unrolled DFE architecture can be written as C.. M NL Loop UnrolledDFE or For N tap channel, if we consider P-bit ADC, the number of comparators required can be written as C P.. ADC In the sequence DFE architecture discussed so far, there are edge comparators for prediction and data comparators for sequence generation and verification of prediction. The number of comparators required can be written as C Sequence DFE M M L PredictionFactor.3 where, the prediction factor is the measure of prediction capability of the edge comparators. The first term in Eq..3 gives the number of comparators required for prediction and the second term gives the number of comparators required for in-bank comparison. Figure.3 shows the comparison among the three cases. For the sequence DFE, the prediction factor for the example 4 tap channel is considered to be, as we are covering half of the total signal space. This figure shows the required number of comparators for the sequence DFE for three values of the prediction factor (, 3, and 4). In all cases, the number 3

47 No. of Comparators of pre-cursors is considered. Figure.3 shows the number of comparators required for the sequence DFE and the loop unrolled DFE is comparable Loop Unrolled DFE ADC Sequence DFE-Prediction Factor= Sequence DFE-Prediction Factor=3 Sequence DFE-Prediction Factor=4 No. of Comparators Vs. Resolution Resolution Figure.3: Comparison of the required number of comparators among loop unrolled DFE, ADC, and sequence DFE. The noise margin of the sequence DFE can be defined as half the distance between two banks having similar B+ values. For an N tap sequence DFE having L taps of precursors (h-i), main cursor (h0), and M taps of post-cursors (h+i), the noise margin can be written as NM Sequence DFE h 0 L i h i h M..4 We get the noise margin of a conventional DFE when the un-equalized taps (in this case precursors (h-i)) interfere destructively with the main cursor (h0). In the case of an ADCbased DFE, the noise margin will further reduce due to the quantization noise (QN) from the ADC. The quantization noise reduces as the resolution of the ADC increases. 3

48 Voltage Margin (mv) Figure.4 shows the comparison of noise margins between ADC-based DFE and sequence DFE. In this figure, practical channels are considered that have 4-7 dominant taps. The tap values are: h0=0.6 mv, h+=0.6 mv, h-=0. mv, h+=0.08 mv, h+3=0.04 mv, h+4=0.0 mv, h+5=0.0 mv. For all cases, only one precursor is considered. For N tap channel, N bit sequence DFE and N bit ADC-based DFE are considered. As the number of bits of the ADC increases the noise margin also increases. For sequence DFE, the noise margin increases as the number of post-cursor taps increases. 70 Voltage Margin Vs. Resolution tap Channel-ADC Based DFE 4-tap Channel-Sequence DFE 5-tap Channel-ADC Based DFE 5-tap Channel-Sequence DFE 6-tap Channel-ADC Based DFE 6-tap Channel-Sequence DFE 7-tap Channel-ADC Based DFE 7-tap Channel-Sequence DFE No. of Bits in ADC Figure.4: Comparison of noise margin between ADC-based DFE and sequence DFE..5. Error Tolerance Section.3. discusses sequence generation and DFE implementation. There are prediction error tolerance logic implementations that give the system margin to work with errors. 33

49 Figure.(b) discusses the verification of prediction done by the edge comparators. Sections discuss the prediction error correction techniques of the edge comparators. Section.5.4 discusses the in-bank comparator error tolerance due to DFE feedback..5.. Bottom Prediction Error Tolerance Figure.5 shows one of the examples of edge comparator prediction error. With correct prediction, the floating data comparators should be in the middle at time t=t. Due to noise or sampling error, the edge comparators predicted wrong and the floating data comparators stayed at the bottom. At bottom position, possible B0B+ combinations are 00 and 0. DFE feedback at time t=t should be 0 as the previous bit is 0. This feedback will make final output for B0 to be 0 after DFE feedback. So, due to the error in prediction, the main cursor will be detected wrong. However, the are prediction error tolerance logic in the system prevents it from happening. As the edge comparators send the data comparators to the bottom, all the floating data comparators output go to. At the bottom, if all the data comparators give, this implies an error in prediction of position of the floating data comparators. So, we should overwrite the position of the data comparators to the middle. For the middle position, CF3 and CF become CF and CF0 respectively. Their outputs are overwritten in CF and CF0. In this error case, if the comparators were placed in the middle, the most probable outputs of the top two comparators are 0. So, 0 is overwritten there. This error check gives the correct choice for DFE operation. 34

50 t 0 t t t 3 t Comp. Position for correct prediction Comp. Position with noise With Noise Wrong Prediction time Correction using Floating Comp. OP Position C F3 C F C F C F0 Position C F3 C F C F C F0 Bot Mid 0 0 While in bottom, if all floating comps give, it implies error Needs Correction Figure.5: Error tolerance of edge comparator prediction to send the floating data comparators to bottom at time t=t..5.. Mid Prediction Error Tolerance The prediction error tolerance of the mid position has two cases. In case I, the correct position for the floating data comparators are on top, but due to noise, they ended up in the middle (Figure.6). Any time an error happens in predicting the position of the floating data comparators, the system will make an error in B0. Case I for the mid position prediction error tolerance is similar to the bottom position prediction error tolerance. While in the 35

51 middle, if all the floating data comparators give, it suggests an error has occurred. As all of them are, the signal may be on top instead. So, after verification, mid position is overwritten to the top here. For the top position, CF3 and CF become CF and CF0 respectively. Their output is overwritten in CF and CF0. CF3 and CF will be overwritten with 0 as it is the most probable outcome Comp. Position for correct prediction Comp. Position with noise t 0 t t t 3 t time With Noise Wrong Prediction Correction using Floating Comp. OP Position C F3 C F C F C F0 Position C F3 C F C F C F0 Mid Top 0 0 While in mid, if all floating comps give, it implies error Needs Correction Figure.6: Error tolerance of edge comparator prediction to send the floating data comparators to the middle at time t=t. Case II of the mid prediction error tolerance is illustrated in Figure.7. At time t=t, the correct floating comparator position is at the bottom but due to an error, it ended up in the 36

52 middle. While in the mid position, if all the floating data comparators give 0, it implies an error has occurred. As all of them are 0, the signal may be at the bottom instead. So, now mid position is overwritten to the bottom. For the bottom position, CF and CF0 become CF3 and CF respectively. Their output is overwritten in CF3 and CF. CF0 and CF will be overwritten with as it is the most probable outcome t 0 With Noise Wrong Prediction t t time t 3 Comp. Position for correct prediction t 4 Comp. Position with noise Correction using Floating Comp. OP Position C F3 C F C F C F0 Position C F3 C F C F C F0 Mid 0 0 Bot While in mid, if all floating comps give 0s, it implies error Needs Correction Figure.7: Error tolerance of edge comparator prediction to send the floating data comparators to bottom at time t=t3. 37

53 .5.3. Top Prediction Error Tolerance Top prediction error tolerance is similar to case II of the mid prediction error tolerance. From the top position, the most probable miss will be the mid position. While on top, if all the data comparators are 0, prediction verification overwrites position to the mid. Outputs of CF and CF0 go to CF3 and CF respectively. CF and CF0 are overwritten with t 0 t t t 3 t 4 Comp. Position with noise Comp. Position for correct prediction time With Noise Wrong Prediction Correction using Floating Comp. OP Position C F3 C F C F C F0 Position C F3 C F C F C F0 Top 0 0 Mid While in top, if all floating comps give 0, it implies error Needs Correction Figure.8: Error tolerance of edge comparator prediction to send the floating data comparators to the top at time t=t3. 38

54 .5.4. In-Bank Comparator Error Tolerance In prior sections, we discussed how we could fix the issues of comparator and sampling error that result in incorrect positioning of the floating data comparators. We showed that if we make an error in predicting the position of the floating data comparators, we can fix that using the verification logics. Now, if there is an error in data comparators, the DFE is there to fix it t 0 t t t 3 Sample with +ve Noise t Sample w/o Noise Sample with -ve Noise time Noise Position C F3 C F C F C F0 Gen. Seq. w/o w +ve noise w -ve noise Mid Mid Mid B + Feedback After B + Feedback B + Feedback Output Figure.9: In-bank comparator error tolerance of the system using DFE

55 Figure.9 shows three cases of sampling at time t=t3. In the first case, when there is no noise, the correct output is 0. Now, with some noise, if the signal is pushed up, one of the floating comparators, CF, gives which was giving 0 in the no noise case. So, from 0 bank DFE choices are now 00 and 00. However, from 0 bank DFE choices do not change, as comparators related to this bank are still giving the correct result. Now, the DFE comes in action and detects the correct sequence. If noise pushes signal down, as long as it is above CF reference level, no error will occur..6. System Design The overall block diagram of the quadrate implementation of the system is illustrated in Figure.30. The architecture requires three edge comparators and four data comparators in each time-interleaved path. Three edge comparators serve dual purposes: (a) provide the timing error information with higher resolution and (b) place the data comparators in the vicinity of the next sample. In addition, the edge samples with the decoded sequence allow us to filter edges with ISI. Two edge comparators and four data comparators are used to generate sequences with redundancy for sequence DFE. The decoded main bits from previous two time-interleaved paths come in as B+ and B+ feedbacks and help in DFE operation. In subsequent sections, we will talk about the analog front-end and the critical timing issues of the system. Section discuss passive equalizer, sample and hold (SH), comparator, and SR latch respectively. Section discuss the critical timing issues of the system. 40

56 Reference Generator Sequence Generator B 0 B B - B Passive Equalizer φ 0 (Data S/H) Sequence DFE φ 35 (Edge S/H) CH0 CH90 CH80 CH70.5 GHz Clock Gen TDC Figure.30: Maximum likelihood sequence detector with passive equalization and timing recovery..6.. Passive Equalizer The passive equalizer used in this system is a C-R high pass filter that attenuates lowfrequency portion of the incoming signal. Figure.3 shows the schematic and AC response of the equalizer. The transfer function of the equalizer can be written as: V Transfer Function, V in out ( R EQ R 4 scr 4 EQ R 4 R ) scr EQ R 4..5 DC Gain R EQ R4 R 4. High Frequency Gain. Zeroof the Equalizer, z CR EQ. 4

57 Gain (V/V) Poleof the Equalizer, p. C( R ) The choice of values of resistors and capacitor is critical. Small values of R imply degradation in input matching and the equalizer needs a bigger capacitor to have zero at proper frequency. Larger values of R give higher input time constant. The thermal noise generated from these resistors is small enough that it does not effect the sequence DFE operation. The tunable REQ of the equalizer makes it possible to have a DC attenuation from 3 to 8 db. It also moves the pole and the zero of the system. The tunable boost also makes it possible to work with channels of different loss. As long as the passive equalizer gives us a single bit response with four dominant taps remaining to be equalized, the system can equalize the rest. If channel taps remain unequalised, their impact will come as deterministic noise. Figure.3(a) shows the AC analysis of the equalizer for the example link with 7 db loss. Here, the passive equalizer equalizes 6 db of channel loss and the maximum likelihood sequence detector will equalize the remaining db of loss. The transient eyes of input and output of the passive equalizer are shown in Figure.3(b) and Figure.3(c). As the equalizer output only has 4 dominant taps, 6 levels with overlap appear at the equalizer output. Figure.33 shows the layout of the passive equalizer. EQ R 4 INP C R R R 3 R 4 C L OUTP Passive Equalizer AC Response for Different Settings of R EQ C R R R 3 R 4 pf 405 Ω 80 Ω 405 Ω 03 Ω R EQ R EQ VCM -3-4 R 3 R 4-5 R EQ =R INN R R C OUTN C L Frequency (Hz) R EQ =R R R EQ =R R 3 R EQ =R R R 3 Figure.3: Schematic of passive equalizer with its AC response for different settings. 4

Passive Equalizer AC Analysis 0 6 db DC Attenuation 7 db Lossy Channel -5 Gain (V/V) -0-5 -0 Loss yet to be equalized -5 Channel Attenuation Equalizer Boost After Equalization -30-35 7 0 8 9 0 0 0 0

58 Passive Equalizer AC Analysis 0 6 db DC Attenuation 7 db Lossy Channel -5 Gain (V/V) Loss yet to be equalized -5 Channel Attenuation Equalizer Boost After Equalization Frequency (Hz) (a) Output of the Passive Equalizer Amplitude (V) Amplitude (V) Input to the Passive Equalizer Time (ps) (b) 0 Time (ps) 50 (c) Figure.3: Passive equalizer response for example link. (a) AC analysis, (b) input to the equalizer and (c) output of the equalizer. Resistors Capacitor Capacitor Boost Control Figure.33: Layout of the implemented passive equalizer. 43

59 .6.. Sample & Hold (SH) Figure.34 shows the concept of a conventional sample and hold circuit. The input data, Vin, comes in and hits a switch. A Clock signal controls the switch. Whenever the Clock is high, the switch gives a direct path from the input to the output. The RC time constant of resistance from switch and Chold has to be small so that the SH has enough bandwidth to follow the high-speed input and change accordingly. When Clock goes low, Chold will hold the last value it gets from the input. V in V out C hold Clock Figure.34: Concept of sample and hold. High performance and high-speed data receivers require good linear performance from their SH circuits. High-performance SH circuits are usually implemented using Switched Capacitor (SC) circuits. The input sampling switch limits the linearity of these SH circuits. Non-linearity associated with the sampling switch is mainly attributed to non-linear onresistance (RON) and associated parasitic capacitance of the transistors. These transistor switches produce harmonic distortion when sampling high-frequency signals. As a result, SNDR (Signal to Noise and Distortion Ratio), SFDR (Spurious Free Dynamic Range), and THD (Total Harmonic Distortion) of the incoming signal deteriorate. The on-resistance (RON) of a transistor switch is given by [4] 44

60 where, V V R ON (t) μc (t) Gate to sourcevoltage ox W L V Mobility of charge carrier of gs t Cox Oxide Capacitance per unit area GS TH μ V ( t) Threshold voltage of the transistor. TH (t) transistor.6 VGS(t) and VTH(t) depend on the incoming data signal, Vin(t). V gs t V t V t.7 g in V TH t φ ) ( t) VTH0 γ( φf VSB F.8 where, V TH0 Threshold voltage when V SB 0 where, V SB γ, φ (t) Source to body voltage F Device parameters. VB=Body Voltage. V SB t Vint VB.9 The RON along with the Chold define the tracking bandwidth of the SH circuit as given by Eq..0. The dependence of RON on the time varying input signal means that f-3db will be different for different values of the input signal. Therefore, the SH will not track all values of the input signal equally. f π μ C W L t V n ox gs TH 3 db. RON Chold π Chold V (t).0 45

61 Amplitude (V) Amplitude (V) INP M P OUTP C CLK C INN M P OUTN CLKB M P3 (a) M P4 0.4 Input=INP-INN Sampled Value=OUTP-OUTN Track Time Hold Time Sampling Pulse Time (ns) (b) Figure.35: Implemented sample and hold circuitry (a) and its simulated differential operation (b). For a fixed Chold, the RON has to be small enough to achieve high bandwidth in the SH. So, in our design, as illustrated in Figure.35 (a), MP and MP have high W/L ratio, which allows low RON and high bandwidth. Differential operation gets rid of clock feedthrough that comes when sampling pulses come in or when the comparator in the next stage of the SH is clocked. The expected signal swing at the input of SH is about 600 mv with a common mode of around 800 mv. Using transmission gates here as switches will not help. Bootstrapping makes RON independent of Vin(t) but adds a lot of complexity. 46

62 The sampling time is also a function of Vin(t). When the voltage of the capacitor node is lower than the input voltage, the input node acts as the source of the sampling PMOS. The relation among Vin(t), Vgs(t), and VTH(t) is shown in Eq In the implemented SH, for low values of Vin(t), the PMOS switch turns off too fast while for high values of Vin(t) it turns off too late causing distortion. Charge injection is another source of non-linearity in the SH. In the case of PMOS switches, when CLK signal goes high, the charge in the channel is distributed between the drain and the source of the MOSFET. The amount of charge escaping from the channel is a complex function of impedance defined by the amount of charge to the ground and the transition time of the controlling clock. This charge injection gives us gain error and DC offset in the SH. The transistors, MP3 and MP4, are there to absorb the injected charge from the channel, which is a crude way to get rid of this problem without adding complexity to the SH. The transistors MP3 and MP4 are sized half of MP and MP, assuming that half of the injected charge will flow in each way. Figure.35(b) shows the differential operation of the implemented SH. The SH in this design is triggered with 5% duty cycle pulses. As a quadrate system, the SH tracks the incoming data for UI and holds the data for 3 UI. Figure.36 shows the layout of the implemented sample and hold circuit. The figure specifies the positions of sampling PMOS, compensating PMOS, and capacitor in the sample and hold. 47

Sampling PMOS Capacitor Compensating Devices Figure.36: Layout of sample and hold circuit..6.3. Comparator Figure.37 shows the implemented 4-input comparator.

63 Sampling PMOS Capacitor Compensating Devices Figure.36: Layout of sample and hold circuit Comparator Figure.37 shows the implemented 4-input comparator. First, for simplicity, we can remove MN5, MN6, and MN7 and consider this as a -input comparator. The strong-arm latch based comparator gets the sampled value from the SH circuit. The theoretical depiction in Figure.38 shows four stages of operation of the comparator. At first, the sampling phase starts when CLK signal goes from low to high at t=t0. The transistors MN3 and MN4 discharge VN and VP node respectively. MN and MN discharge OUTN and OUTP respectively. The sampling phase ends when PMOS transistors MP5 and MP6 turn on. The regeneration phase starts at t=t. At this stage, as PMOS transistors are ON, we get crosscoupled inverters formed by MN and MP5 pair and MN and MP6 pair. The positive feedback from the cross-coupled inverters amplifies the signal. When the comparator outputs touch the rail at t=t, it enters the decision phase. The decision from this stage is latched using an SR latch afterwards. When the CLK signal goes low at t=t3, the comparator outputs resets, i.e. both outputs go to VDD. The speed of the comparator depends on its sampling and 48

64 regeneration stage. Figure.39 and Figure.40 show the simulation results and the layout of the implemented comparator. VDD M P CLK M P3 M P5 M P6 M P4 CLK M P OUTN OUTP M N M N V N V P REFN M N5 M M N3 N 4 INP INN M N6 REFP CLK M N7 CLK M N8 Figure.37: Schematic of the implemented comparator. SH_OUTN SH_OUTP t=t 0 t=t t=t t=t 3 CLK COMP_OUTN COMP_OUTP Sampling Regeneration Decision Reset Figure.38: Operational stages of the comparator. 49

Amplitude (V) Amplitude (V) 0.4 0. Sampled Value Input to the comparator 0-0. -0.4. Clock 0. Regeneration 0 OUTN OUTP 0 Sampling Decision Reset.5.6.7.8.9 Time (ns) Figure.

65 Amplitude (V) Amplitude (V) Sampled Value Input to the comparator Clock 0. Regeneration 0 OUTN OUTP 0 Sampling Decision Reset Time (ns) Figure.39: Simulation results showing sampling, regeneration, decision, and reset stages of the comparator. Strong- Arm Input Figure.40: Layout of the implemented comparator. 50

66 .6.4. SR Latch The implemented comparator resets to VDD, as CLK goes low. Therefore, the implemented set-reset (SR) latch has to hold the previous value when both of its inputs are high. Figure.4 shows the NAND implementation of SR latch. In this implementation, both S=0 and R=0 are invalid inputs. In our case, the comparator outputs (OUTP and OUTN) will never go to VSS at the same time. When both signals are high, the SR latch retains its previous value. When R=0, it forces Q(t+) =. As, S= and Q(t+) =, QB(t+) goes to 0. When S=0, it forces QB(t+) =. As, R= and QB(t+) =, Q(t+) goes to 0. In this way, cascading a strong-arm based comparator with an SR latch, we get a strongarm flip-flop. The advantages of this strong-arm flip-flop are zero static power consumption and full rail to rail swing. S Q VDD M P M P3 M P4 M P R QB Q M N M N QB S R Q (t+) QB (t+) Q (t) QB (t) 0 0 S M N3 M N4 R Not Allowed Figure.4: SR Latch implementation and truth table. 5

Figure.4: Layout of the implemented SR latch..6.5. Reference Muxing Timing Issue The receiver architecture utilizes the output of the edge comparators to place the references of the data comparators.

67 Figure.4: Layout of the implemented SR latch Reference Muxing Timing Issue The receiver architecture utilizes the output of the edge comparators to place the references of the data comparators. The references of the data comparators have to settle before they are clocked. The timing diagram of the SHs and comparators for the edge and data are shown in Figure.43. For CH0, the edge SH is clocked by P35, i.e. the pulse rising in-line with Ф35. The edge comparators are clocked 0.5 UI later using Ф0, and at the same time, the data signal is sampled using P0. As the generated pulses have 3 UIs of hold time, the data comparators have to be clocked within these 3 UIs. In this receiver, we have clocked the data comparators using Ф80, which is UI later. The simulation results of Figure.44 shows that reference settles within 0% of its final value which is within 50 ps. Therefore, in this case, it gives 50 ps timing margin. 5

68 P 35 Edge SH Φ 0 Edge Comp P 0 Data SH Φ 80 Data Comp UI References Figure.43: Edge and data SH and comparator clocking for CH0. Input Φ Φ 0 Edge Reference φ Edge Comparator output 6.3 & 6.35 Reference Edge Reference References Edge Sample update Data Sample 50 ps Margin φ Time (ns) Figure.44: Reference settling behavior Sequence DFE Critical Timing The sequence generator gives the sequence DFE four 4-bit sequences which are differentiable using post-cursors, B+ and B+. In this quadrate system, DFE feedback is shown in Figure.45. In this -tap DFE, B+ feedback comparison is done using the usual XOR operation. However, like all DFE architecture, st -tap DFE feedback timing is critical here. To facilitate this timing, for B+ feedback, sequences are pre-calculated for XOR comparison which save one gate delay. Whenever B+ feedback arrives, the data has to pass through just one MUX (Figure.45). The logic implemented is given by the equations below: 53

69 B + Comparison CH90 CH0 B - B 0 B + B + B - B 0 B + B + Four 4b Sequence Two 4b Sequence DFE_OP 4b B - B 0 B + B + CH80 B + Comparison B - B 0 B + B + CH70 D0_PR<3:0> 0 D0<3:0> D<3:0> OP<3:0> 0 D<> B_FB D_PR<3:0> 0 D0<3:0> D<3:0> Loop Unrolling for B + Feedback D_PR Figure.45: Sequence DFE and loop unrolling for B+ feedback (red box). D0_PR 3 : 0 D0 3 : 0 D D_PR 3 : 0 D. 3 : 0 D0 3 : 0 D D_PR 3 : 0 D. OP 3 : 0 D0_PR B_FB [D0 B_FB [D0 D0 3 : 0 3 : 0 B_FB D_PR 3 : 0 3 : 0 3 : 0 D D_PR D D_PR B_FB 3 : 0 3 : 0 [B_FB D B_FB D ] D ] D ] D D0 3 : 0 3 : 0 [B_FB D B_FB D ] [B_FB D ] D 3 : 0 [B_FB D ].3.7. Implementation & Experimental Results The implemented quadrate receiver shown in Figure.46 occupies only 0.3 mm with each time-interleaved path taking 40 μm X 40 μm. The digital sequence decoder consumes only 35 mw of power out of which 4 mw is taken in digital operation (Figure 54

Comparator Bank 40 µm REF GEN.47). The additional clock recovery circuit consumes around 9 mw. The test setup with all the required instruments is shown in Figure.48.

Each time with different channel loss, the single bit response of the channel is observed and the dominant tap values are measured.

70 Comparator Bank 40 µm REF GEN.47). The additional clock recovery circuit consumes around 9 mw. The test setup with all the required instruments is shown in Figure.48. Different amount of channel loss is realized by having different lengths of coaxial cable. Each time with different channel loss, the single bit response of the channel is observed and the dominant tap values are measured. These tap values are used to set the reference values for the comparators. Tunable current sources of the reference generator of the chip allow us to do the tuning of references for different channels. The received signal and the decoded half-rate time sequence DAC output are put on top of one another in Figure.49. Each time the decoded sequence DAC output relates close to the received signal after the passive equalizer. The 0 Gb/s input eye and the 6 level sequence detector output eye from one of the four channels are shown in Figure.50. Figure.5 shows the measured recovered clock of the system with only.56 ps peak to peak jitter. Without any transmit equalization, 4-bit sequence decoder operates error-free over a 7 db loss channel with 90 mv voltage margin and 5 ps timing margin (Figure.5). 40 µm Test & Interfacing Circuit CH0 CH90 Clocking CH80 CH70 Digital Back-end REF Lines Analog Front-end Figure.46: Implemented prototype in 65nm CMOS. 55

CVDD VSS IREF_OUT RESET CVDD OUTP_EQ OUTN_EQ IREF_MIR VSS INN VCM INP mw Reference Generator Sequence Generator B 0 B B - B Passive Equalizer φ 0 (Data S/H) Sequence DFE 9 mw φ 35 (Edge S/H) 4 mw CH0

71 CVDD VSS IREF_OUT RESET CVDD OUTP_EQ OUTN_EQ IREF_MIR VSS INN VCM INP mw Reference Generator Sequence Generator B 0 B B - B Passive Equalizer φ 0 (Data S/H) Sequence DFE 9 mw φ 35 (Edge S/H) 4 mw CH0 CH90 CH80 CH70 9 mw.5 GHz Clock Gen TDC Figure.47: Complete receiver with its power consumption. Transient Single bit response is used for characterizing the channel Tektronix DPO 7604C Oscilloscope Channel Tektronix AWG 7000A Input Data Pattern AVDD VSS DVDD AVDD VSS DVDD OUTP_B0 OUTN_B0 OUTP_6 OUTN_6 CLK_OUTP CLK_OUTN IREF_H0 IREF_H IREF_H IREF_H_ SR_IN SR_CLK PRBS_ERROR Used for eye diagrams Agilent 8600D Infiniium DCA-X Oscilloscope Used for tuning references CHIP ID: ICSAADEQ On-Chip PRBS Checker Output Control Bit Load using Arduino Figure.48: Pin diagram and test setup of the prototype. 56

50: Measured 0 Gb/s input eye (a) and 6 levels output eye of

72 4-bit Sequence Input Figure.49: Measured half rate 4-bit sequence DAC output. 0 ps/div 80 ps/div (a) 0 Gb/s Input Eye (b) 6-level output eye of the sequence decoder Figure.50: Measured 0 Gb/s input eye (a) and 6 levels output eye of sequence detector (b). 50 ps/div 0 ps/div (a) Recovered Clock Eye (b) Recovered Clock Histogram Figure.5: Measured.5 GHz recovered clock eye (a) and histogram (b). 57

BER 00 ps/div 0 0 0-5 0-0 5% 0 - -0.5-0.4-0.3-0. -0. 0 0. 0. 0.3 0.4 0.5 Clock Phase (UI) (a) BER bathtub (b) Recovered Data Eye Figure.5: BER bathtub (a) and the recovered.

73 BER 00 ps/div % Clock Phase (UI) (a) BER bathtub (b) Recovered Data Eye Figure.5: BER bathtub (a) and the recovered.5 Gb/s PRBS (Pseudo-random bit sequence) checked data eye coming out of the chip using -bit DAC (50Ω buffer). The comparison between this work and the existing state-of-art ADC-based solutions is listed in Table.. Chen et al. [] and Zhang et al. [6] worked with 4-way timeinterleaved flash ADC architecture with baud-rate timing recovery in place. Although Zhang et al. can compensate channel loss up to 34 db, the additional DSP required for that consumes 500 mw of power. Shafik et al.[] demonstrated a power efficiency of 8.7 pj/bit using a 3-way time-interleaved SAR ADC without any clock recovery system. The architecture can compensate up to 5.3 db channel loss, which gives it a figure of merit (FoM) of pj/bit/db. This work improves the state-of-art by compensating 7dB channel loss at 0 Gb/s consuming only 35 mw. Therefore, the design achieves a power efficiency of 3.5 pj/bit and FoM of only 0.3 pj/bit/db. 58

74 Table.: Performance Summary of the 0 Gb/s Receiver. Equalizer Architecture Chen JSSC [4] 4x Variable Ref. Flash ADC Zhang ISSCC 3[6] 4x Rectified Flash Shafik ISSCC 5[] 3xSAR This work 4x Sequence DFE Data Rate 0 Gb/s 0.35 Gb/s 0 Gb/s 0 Gb/s Technology 65nm 40nm 65nm 65nm Supply Voltage (V) Compensated Channel loss. N/A. 9/3 Gb/s 34 GS/s 5.3 GS/s 7 GS/s BER <0 - <0 - <0-0 <0 - Timing Recovery Power Consumption (mw) Baud-rate Baud-rate None Rx 30 ADC-95 DSP- 500 ADC 79 Dig. EQ 8 Data-Edge Sampled Sampler Digital 4 Clocking 9 Area (mm ) Efficiency (pj/bit) 3/ FoM (pj/bit/db) 0.45/

75 Chapter 3. A 6 Gb/s Direct Digital Sequence Detector with Data Trace-back and Equalizer in 65nm CMOS The previous chapter introduced the concept of sequence decoding where we use the ISI components in a constructive way to recreate the time sequence of bits transmitted from the other side of the channel. This naturally leads to a fundamentally different way of processing the pre-cursor. In the traditional approach, the pre-cursor is reduced at the cost of SNR whereas here the pre-cursor ISI is used to predict the next bit. This opens up the potential for better equalization strategy that results in higher voltage margin. Due to the fixed supply, SNR will degrade especially when higher order modulations are introduced. The system may benefit from techniques as discussed in this chapter by achieving lower BER. The main concept of the pre-cursor ISI utilization is to correct and mitigate the DFE error propagation. Since the next bit at time TN- is part of the sequence that is selected through the sequence DFE, one UI later at time TN we can compare the current bit with the next bit from the previous cycle to detect error. In addition to error detection, we can also correct the error. The error correction mechanism involves three steps. First, we identify bit decisions that are not corrupted by the DFE we name those decisions as high 60

76 confidence decisions. We only use such high confidence decisions for error correction. Second, we make use of the pre-cursor to generate the next bit. Rather than generating a single sequence, we generate two potential sequences differentiable based on the next bit. Third, when the bit is detected with high confidence, we use that to select the sequence with the next bit that matches the high confidence correct bit. The chapter begins with performance limiting factors of the receiver architecture described in Chapter. Section 3. discusses two cases where noise limits the performance of the current architecture. Section 3. discusses what we can do to overcome these limitations to give better noise immunity to the system. This section arrives at the technique of -bit data trace-back. Sequence generation and feedback for data trace-back are also discussed here. Section 3.3 describes the modified block diagram of the receiver. The receiver in this chapter also works at a higher speed than the previous one. Section 3.4 shows the experimental results and summary of the modified receiver performance with data traceback. 3.. Noise Tolerance Limit of Current Receiver The receiver architecture discussed in Chapter is capable of compensating 7 db channel loss with a voltage margin of around 0% of total signal space at BER of 0 -. Section.5 of Chapter discussed how we can recover from errors introduced by noise and prediction by the edge comparators. If there is an error in prediction by the edge comparators, we can verify that prediction using the floating data comparators. Errors caused by the floating data comparators can be fixed using the DFE feedback. 6

77 In this section, we will discuss two cases where noise limits the performance of the current receiver architecture. In case I, the edge comparators predict incorrectly and the verification using the data comparators cannot fix that error. In case II, the edge comparators predict correctly, but the data comparators see a large noise for in-bank comparison Case I Figure 3. shows a case of simultaneous error demonstrated by both the edge and the data comparators due to noise at time t=t3. In the correct operational case, the prediction by the edge comparators will send the data comparators to the middle position with references from the 0 and 0 banks. In this position, two of the floating comparators, CF,0, give as output and the other two, CF3,, give 0. So, the generated sequences from the 0 bank are 000 and 00; the generated sequences from the 0 bank are 00 and 0. B+ feedback will come at first and as it is, 00 sequence from the 0 bank and 0 from the 0 bank will be forwarded for B+ feedback sorting. At t=t3, B+ feedback is. The correct DFE output should be 0 sequence. For the error case, the edge sample of the incoming data was pushed up by noise and this noise sent the data comparators to the top. The position verification system does not work here as the bottom data comparators see data that is higher than its reference level. Here, CF0= and the rest are 0. For the top position, sequences from the bank are 00 and 0 and from the 0 bank are 00 and 00. If we assume the DFE feedbacks are still there and working the way they should work, the final output will be 0. In this situation, both B0 and B-, that are not in feedback, are incorrect. 6

78 t 0 t t Sample with +ve Noise Sample w/o Noise t 3 t 4 Comp. Position with noise Comp. Position for correct prediction 0000 time Noise Position C F3 C F C F C F0 Gen. Seq. w/o w +ve noise Mid Top Figure 3.: Case I error in sequence detection. B0 and B- detected incorrect Case II B + Feedback After B + Feedback B + Feedback Output 0 0 Figure 3. shows an error by the in-bank floating comparators at time t=t3. The correct case is already discussed in Section 3... In this case, the prediction by the edge comparators is correct. At time t=t3, the possible banks are 0 and 0. Due to large noise, the incoming signal is pushed down and CF becomes 0; however, in the ideal case, it should be. So, from the 0 bank possible sequences are 00 and 00. After DFE feedback, 00 is selected as the output which implies an error in B- detection. 63

79 t 0 t t t 3 t Sample w/o Noise Sample with huge -ve Noise time Noise Position C F3 C F C F C F0 Gen. Seq. B + Feedback After B + Feedback B + Feedback Output w -ve noise Mid Figure 3.: Case II error in sequence detection. B- detected incorrect. 3.. Improving Noise Tolerance of Current Receiver The current architecture as discussed in Chapter uses the DFE feedback and detects the main cursor, B0. The precursor, B-, comes as a by-product. The by-product can be used further to verify the detection of B0. The current architecture works with the prediction from the edge comparators that is later verified using the floating data comparators. Instead of the prediction, we can use sub-ranging ADC approach, where the floating data comparators will be placed based on the data sample rather than the edge sample. 64

80 3... Setting Fixed Data Comparators Two fixed data comparators can be used instead of using the edge comparators. These two data comparators will directly drive the reference multiplexers of the floating comparators. The reference levels of these two comparators are well defined (Figure 3.3). The two fixed comparators have to differentiate between banks. CFIX is there to differentiate between the and 0 banks. So, it can be placed in between the top of the 0 bank and the bottom of the bank, i.e. in between levels 0 and 00. If CFIX=0, the reference levels of the 0 bank will be passed to the floating comparators. If it is, the reference levels of the bank will be passed to the floating comparators. The same is true for the 0 and 00 banks. CFIX0 is in between 00 and 000 levels. CFIX0=0 implies references of the 00 bank will be passed to the floating comparators and CFIX0= implies references of the 0 bank will be passed to the floating comparators C FIX C FIX Figure 3.3: Introduction of two fixed data comparator reference instead of fixed edge comparators. 65

81 3... Case I The fixed data comparators fix the error of case I with little margin (Figure 3.4(a)). If there is higher noise, it will push the signal up a little bit more again producing error (Figure 3.4(b)). The additional noise sends the floating comparators to the top again and the verification logic cannot recover: the error prevails. To fix this issue, we can use the decoded B- bit of the sequence. Figure 3.5 shows how the concept of using B- bit works. At time t=t3, both B+ and B+ are. To illustrate the concept, we can start with the feedback first. B+ and B+ DFE feedback leave 4 possible combinations out of 6 possible sequences; two combinations are from the bank and two are from the 0 bank. As the fixed data comparators place the floating comparators to the top position, it is highly unlikely to have 00 as a possible sequence. 00 is the bottom-most possible sequence while the fixed data comparators placed the floating comparators on the top. So, we can t 0 t t Sample with +ve Noise Sample w/o Noise t 3 t 4 New Data Comps. fixes error with little margin t 0 t t Sample with +ve Noise Sample w/o Noise t 3 t 4 Comp. Position with noise Correct Comp. Position 0000 (a) time 0000 (b) time Figure 3.4: Fixing error with little margin using fixed data comparators. 66

82 Sequence DFE (B + and B + Feedback) In this case, B + = B + = In this case, C FIX = C FIX0 = Unlikely C CHK = 0 Sample with +ve Noise Sample w/o Noise t 3 t 4 Next Bit, Definite Comp. Position with noise Correct Comp. Position DFE Output Probable Sequences C CHK Trace-back Choices 0 0 (implies it may miss top of bank) 0 Next Bit Trace-back Output 0 Figure 3.5: Introduction of check comparator to find out whether it may miss a bank or not. neglect 00 sequence. When the fixed data comparators are placing the floating comparators for the and 0 banks, the most probable sequence we can miss is the top of the 0 bank i.e. 0 sequence. To see if there is any chance of missing the top of the 0 bank, we can use a check comparator. In this case, the placement of the check comparator is in between the top of the probable missing bank and the in-bank bottom sequence having the same values of B+ and B+. The check comparator at t=t3 gives 0. Out of the probable 3 sequences, this implies that the top most can be neglected. As the next bit is a definite, it can come as feedback and correct the bit decision. Here, the tracing back options are 0 and 0 sequence. LSB+ bit is B- bit. After comparing B- bit with LSB+ bit, the output will be 0, which is the correct output. If we go back to what we had at the 67

83 beginning of case I, DFE output was already 0. We added the check comparator and it said the data might be from the missing bank. As the next bit was a definite, we checked back to see if we decoded the correct sequence (0) and found out we had the wrong sequence (0). We corrected the error using B- bit. When the floating data comparators are on the top position ( and 0 bank), the check comparator checks the probability of a missing sequence from the nearest bank (0 bank). Table 3. shows all the possible cases of missing the probable banks and the corresponding comparator placement. When the signal is supposed to be in the middle (0 and 0 bank), there are two cases: (i) missing the bottom of the bank (00 sequence) and (ii) missing the top of the 00 bank (00 sequence). 00 has B+= and B+=0. The closest in-bank sequence having the same B+ and B+ is 00. So, the check comparator reference will be set in between 00 and 00. The case of missing the top of 00 bank is similar to the case of missing the top of 0 bank. So, for the mid position we need two comparators to check whether there is any probable missing bank. However, for the top and the bottom position, we need only one comparator. Table 3.: Reference placement for checking probable bank miss. Position Top Mid Bottom Available Banks Most Probable Bank Missed Most Probable Sequence outside Banks B + of the Most Probable Sequence B + of the Most Probable Sequence Closest Ref. in bank having same B + & B Check Comp Placement

84 The references for the check comparator can also be multiplexed using the fixed data comparators. However, when the signal is on the top or on the bottom, one of the check comparators will not be clocked and that one will receive common mode voltage (VCM) as reference. Figure 3.6 illustrates placement of all the comparator references; for the check comparators, it also describes why it is being used. C FIX <:0> = C FIX <:0> = 00 Checks if it missed top of 0 bank Checks if it missed bottom of 0 bank C FIX C FIX Checks if it missed bottom of bank Checks if it missed top of 00 bank C FIX <:0> = 0 Figure 3.6: Overall reference placement of the architecture with trace-back Case II The check comparators also resolve the issue of Section 3... At t=t3, the check comparators check whether the system has missed the bottom of the bank and the top of the 00 bank. As the DFE output was 00 in case II, the check comparator that checks whether it missed the bank becomes the one to be considered for trace-back. As it says 0, there is no chance of data being on top. Another probable sequence is the 0 in-bank 69

85 option. As the next bit is a definite, it comes for trace-back and fixes the sequence as shown in Figure t t 3 t Sample w/o Noise Checks if it missed bottom of bank Sample with huge -ve Noise Checks if it missed top of 00 bank 0000 time C CHK 0 (implies it is not missing bottom of bank) Trace-back Choices 00 0 Next Bit Trace-back Output 0 Figure 3.7: Resolving the issue of case II Trace-back Sequence Generation There are two types of sequence generated for trace-back in this system: (i) outside bank trace-back sequence (case I) and (ii) within bank trace-back sequence (case II). Table 3. lists the logic showing which trace-back option will be considered for the different outputs of the check comparators. 70

86 Table 3.: Trace-back choice logic. Bank Check Comp Output Trace Back Choice Outside Within 0 Outside Within 0 Within Outside 0 Within Outside Now, for outside bank trace-back, there may be an error in B0 and B-. So, the sequence choices given for the trace-back are: DFE selected one (B0BB-B) The other option having B0 B- flipped. Within bank trace-back implies there may be an error in B- only. So, the sequence choices given for the trace-back are: DFE selected one (B0BB-B) The other option having B- flipped Conditions of Data Trace-back For the data trace-back discussed so far, we have always mentioned the next bit to be definite /0. When both the fixed data comparators detect the received signal to be on the top or on the bottom position and the floating top and bottom check comparators verify the sample location, the data is considered to be a strong /0. The logic for detecting strong 7

87 /0 is listed in Table 3.3. If the next bit is not strong /0, the incorrect DFE feedback will propagate. The strong /0 logic ensures the next bit is not dependent on the DFE feedback for its main cursor decision. If the next bit is a strong /0, it can be used for error detection by the DFE. Table 3.3: Detecting Strong /0. Fixed Data Comparators Floating Top/Bottom Check Comparators Bit CFIX CFIX0 CCHK_TOP CCHK_BOT Strong x 0 Not strong 0 0 x Not strong 0 0 Strong 0 We cannot do trace-back when top-bottom-mid verification has already resolved the fixed data comparators errors Improved Noise Margin The data trace-back adds two additional comparators to increase the noise margin of the system. The required number of comparators increases by only. We can modify Eq..3 to get the required number of comparators: M L PredictionFactor M C Sequence DFE and Trace-back. 3. 7

88 Voltage Margin (mv) The noise margin of the modified architecture of the sequence DFE can be defined as half the distance between two banks having similar B+ value as discussed in Section.4. For a 4 tap sequence DFE having one precursor (h-), main cursor (h0) and two post-cursors (h+ & h+), the noise margin can be written as h h h 0 NM Sequence DFE. 3. The noise margin of the data trace-back increases as we consider similar values of B+ as well. However, the additional noise margin will be there if and only if the next bit is a strong /0. Figure 3.8 shows the comparison of noise margins for these three cases. As the number of bits of the ADC increases, the noise margin also increases. However, the noise margin achieved by the 4 tap sequence DFE with data trace-back is always higher than the ADCbased DFE. In this figure, h0=0.6 mv, h+=0.6 mv, h-=0. mv and h+=0.08 mv. Voltage Margin Vs. Resolution tap Sequence DFE noise margin ADC Based DFE Sequence DFE with Trace-back Sequence DFE without Trace-back No. of Bits Figure 3.8: Comparison of noise margins of 4 tap sequence DFE with and without data trace-back with ADC-based DFE. 73

89 3.3. System Design The overall implementation of the quadrate system is shown in Figure 3.9. The receiver uses similar passive equalizer at the front end as used in Chapter. The system requires two fixed data comparators, four floating data comparators and two check comparators overall eight comparators. The output of the fixed comparators places the references of the floating and check comparators. For the check comparators, when in the top/bottom, only one of them is clocked. If the comparator is not clocked, it gets common mode voltage as reference input. The system uses an external clock as it does not have any built-in timing recovery in place. The receiver works with 6 Gb/s input data with each time-interleaved path running at 4 GHz. Top Bottom Check One hot code from Fix Comp Op (3b) Trace Back Sequence DFE Φ 0 (FINE S/H) Passive Equalizer Sequence Generator B 0 B B - B Φ 0 (COARSE S/H) Reference Generator CH0 CH90 CH80 CH70 Figure 3.9: System architecture of quad-rate receiver with data trace-back. 74

90 Amplitude (V) Reference Muxing Timing As the system is running at a higher speed than the receiver described in Chapter, the comparators are resized to get similar timing margin for reference muxing. As we are not using the prediction by the edge comparators, coarse (SH for fixed data comparators) and fine (SH for floating comparators) SHs sample the data at P0, i.e. the pulse rising at Φ0. The coarse comparators are clocked instantly using a delayed version of Φ0. Now, we have time up to the hold time of P0, which is less than 3 UI. The fine and top/bottom check comparators are clocked UI later using Φ80. The outputs of the fixed comparators take around 00 ps to update the references of the floating comparators (Figure 3.0). It still gives 5 ps timing margin to clock the floating comparators to achieve 6 Gb/s operation. The check comparators are clocked using the same scheme Top Check Ref. Bot Check Ref. Data Sample Input 0 Margin 5 ps -0.5 P 0 Φ 0 Φ 80 Coarse Comp. OP & Ref. update 00 ps Fine Ref Time (ns) Figure 3.0: Reference Muxing for quadrate channel running at 4 GHz. 75

91 3.3.. DFE and Trace-back Feedback The DFE feedback works as described in Section.6.6. Loop unrolled architecture is carried over from the previous architecture described in Figure.33. The new addition here is a sequence generator after the DFE output. The sequence generator generates a choice for trace-back as described in Section The DFE output and generated sequence have different values of B-. Now, B- feedback comes only if the next channel has strong /0 and the current channel does not have strong /0. After this comparison, we get trace-back output. CH0 B - B 0 B + B + CH90 B - B 0 B + B + B - B 0 B + B + CH80 B - B 0 B + B + CH70 B + Comparison B - Comparison after strong /0 Four 4b Sequence Two 4b Sequence DFE_OP 4b Seq_Gen for TB TB_OP 4b B + Comparison DFE TB Figure 3.: Feedback for DFE and Trace-back. 76

92 3.4. Experimental Results The prototype of the architecture implemented in TSMC 65nm is shown in Figure 3.. The design takes an area of 300 µm X 560 µm with each channel taking 300 µm X 40 µm area. This prototype works with 6 Gb/s input data. The pin diagram and test setup of the chip with all the required instruments are shown in Figure 3.3. Different amount of channel loss is realized by having different lengths of coaxial cable as discussed in Chapter. Each time with different channel loss, the single bit response of the channel is observed and the dominant tap values are measured. These tap values are used to set the reference values for the comparators. Tunable current sources of the reference generator of the chip allow us to do the tuning of references for different channels. The outputs of the sequence DFE with and without data trace-back is taken separately. Figure 3.4 shows the channel response for which the receiver was tested. Figure 3.5 shows the input after passive equalization and the channel output side-by-side. The 6 Gb/s input after passive equalizer has 4 dominant taps. The generated clock and PRBS checked recovered data eye is given in Figure 3.6. The BER curves in Figure 3.7 (BER bathtub curve) and Figure 3.8 (D color-map of BER showing voltage margin in Y-axis and timing margin in X-axis) show improvement in BER after the data trace-back. For 6 Gb/s operation, the receiver achieves a BER of 0-0 when DFE works alone. However, with added trace-back BER of the receiver improves to 0 -. The BER plots of Figure 3.8 show improvement in voltage margin of the modified architecture. The design consumes 7.8 mw power with DFE and data trace-back in action. However, in low loss case, the data trace-back can be turned off to save power. Without the trace-back, it consumes only 55 mw power. Table 3.4 summarizes overall receiver performance. At higher loss cases (~35 db), both the trace- 77

back and DFE is applied; whereas in lower loss cases, the system can achieve desirable performance

For 6 Gb/s operation, when both the trace-back and DFE are active, the receiver consumes 7.

The figure of merit (FoM) during 35 db operation turns out to be 0.

93 back and DFE is applied; whereas in lower loss cases, the system can achieve desirable performance using DFE only while consuming less power. For 6 Gb/s operation, when both the trace-back and DFE are active, the receiver consumes 7.8 mw at an efficiency of pj/bit. The figure of merit (FoM) during 35 db operation turns out to be 0.8 pj/bit/db, which is better than the FoM of pj/bit/db during DFE-only operation as it compensates less loss. PGEN CLK Buffer REF GEN 300 µm Clocking AFE Clocking REF Lines DFE TB CH0 CH80 CH90 CH µm Figure 3.: Implemented prototype in 65nm. 78

Channel Response OUTP_EQ OUTN_EQ INN VCM INP DVDD OUTP_FINAL_B0 OUTN_FINAL_B0 CLK_OUT CVDD OUTP_DFE_B0 OUTN_DFE_B0 CLK_INP CLK_INN VSS PRBS_ERROR RESET Used for eye diagrams Agilent 8600D Infiniium

94 Channel Response OUTP_EQ OUTN_EQ INN VCM INP DVDD OUTP_FINAL_B0 OUTN_FINAL_B0 CLK_OUT CVDD OUTP_DFE_B0 OUTN_DFE_B0 CLK_INP CLK_INN VSS PRBS_ERROR RESET Used for eye diagrams Agilent 8600D Infiniium DCA-X Oscilloscope Clock Generator Signal generator SMB 00A On-Chip PRBS Checker Output CHIP ID: ICSAALSE Used for tuning references Control Bit Load using Arduino OUTP_FINAL OUTN_FINAL DVDD OUTP_DFE OUTN_DFE AVDD VSS VDD_OUT IREF_OUT IREF_H0 IREF_H IREF_H IREF_H_ SR_IN SR_CLK SR_OUT AVDD VSS DVDD Transient Single bit response is used for characterizing the channel Tektronix DPO 7604C Oscilloscope Channel Tektronix AWG 7000A Input Data Pattern Figure 3.3: Pin diagram and test setup of the prototype. 35 db Loss at 8 GHz Frequency (GHz) Figure 3.4: Channel Response at 8 GHz. 79

BER 0 ps/div 50 ps/div (a) 6 Gb/s input eye after passive

3.5: Measured 6 Gb/s input eye after passive equalizer (a) and

40 ps/div 50 ps/div (a) 4 GHz Clock Eye (b) Recovered 4 Gb/s

6: Measured 4 GHz clock eye (a) and PRBS checked recovered 4

95 BER 0 ps/div 50 ps/div (a) 6 Gb/s input eye after passive equalizer (b) 6-level output eye of the sequence decoder Figure 3.5: Measured 6 Gb/s input eye after passive equalizer (a) and 6-level output eye of the sequence decoder (b). 40 ps/div 50 ps/div (a) 4 GHz Clock Eye (b) Recovered 4 Gb/s Data Eye Figure 3.6: Measured 4 GHz clock eye (a) and PRBS checked recovered 4 Gb/s data eye (b). 0-3 DFE Trace-Back Clock Phase(UI) Figure 3.7: BER bathtub comparing DFE and Trace-back. 80

96 Volt(V) Volt(V) 0. BER of DFE 0. BER of Trace-back Clock Phase (UI) (a) Clock Phase (UI) (b) Figure 3.8: D color-map of BER showing voltage margin in Y-axis and timing margin in X-axis of sequence DFE (a) and data trace-back (b). Table 3.4: Receiver Summary. Shafik ISSCC 5 [] Hossain VLSI 6 [] This Work Technology TSMC 65 nm TSMC 65 nm TSMC 65 nm Supply Voltage (V).. Data Rate (Gb/s) Gb/s Active Die Area (mm ) Timing Margin (UI) BER (DFE and TB) 0-0 (DFE only) Loss Compensation (db) Power Consumption (mw) Power Efficiency (pj/bit) FoM (pj/bit/db) (DFE and TB) 55 (DFE only) (DFE and TB) (DFE only) 0.8 (DFE and TB) (DFE only) 8

97 Chapter 4. Burst Mode Optical Receiver with 0ns Lock Time Based on Concurrent DC Offset and Timing Recovery Technique This chapter describes a low power low latency 7-0 Gb/s burst mode DC-coupled receiver for photonic switch networks. The chapter begins with a discussion of the conventional optical receivers in section 4.. Section 4. covers the state-of-art DC and timing recovery methods applied in burst-mode optical receivers and their performance limiting factors. Section 4.3 discusses overall architecture of the proposed receiver. Analog front-end comprising of a trans-impedance amplifier (TIA) followed by three stage of amplifiers, DC recovery loop, timing recovery loop and timing skew correction loop are also discussed in details in section 4.3. Section 4.4 covers the implementation of the system in IBM 0.3μm and the measured experimental results. Section 4.5 compares the proposed burst mode optical receiver with the current state-of-art optical receivers. The implemented transimpedance amplifier (TIA) is also compared with the state-of-art TIAs. 8

98 4.. Conventional Optical Receiver The continuous growth of cloud computing, big data and social media applications is placing enormous bandwidth demand on data communication networks. The interconnection between servers, which has traditionally relied on electrical cables, is becoming a major bottleneck in today s data networks, as we approach the limits of copper wires in terms of speed, loss and power consumption. To meet this demand, optical interconnects are already finding their place in data centers to connect racks which are only a few meters apart. Optics can provide large transmission bandwidth, which reduces latency and fast processing speed. Extending these benefits, cross-point network switches can also be optical to provide significant increase in cross-sectional bandwidth. In addition, rapidly reconfigurable optical switching network can significantly improve latency and throughput [5]. There are electrical transceivers to interface these optical switches. However, unlike existing SerDes, these optical transceivers need to accommodate different optical power levels when the cross point is reconfigured. Interestingly, existing passive optical network (PON) provides such functionalities. Unfortunately, to fit in the data center solution space, existing PON solutions need to improve significantly: First, in existing PON receivers TIA and LA are located on one IC and clock and data recovery (CDR) is located on a different IC ([6], Figure 4.). It is desirable to integrate them into a single complete receiver and then to integrate multiple receivers in a single IC. Second, the power efficiency of burst mode PON transceivers is in the order of 50 to 00 pj/bit. This efficiency has to improve to better than 0 pj/bit considering the cooling cost of data centers. 83

99 Burst # Burst # TX TX TXN RX RX RXN Existing Solution IC IC TIA+LA CDR Proposed Solution in a single IC TIA + AMPs + CDR TIA + AMPs + CDR TIA + AMPs + CDR Figure 4.: Conventional vs. proposed implementation of the optical receiver. Currently, most of these links are VCSEL based, and their performance is excellent but lacks in integration density and bandwidth. To meet next generation s bandwidth demand, VCSEL has to improve its bandwidth and at the same time, we have to look for alternative solutions such as silicon-photonic modulators that have higher integration density. Optical interconnects based on silicon integrated photonic waveguides and devices aim to replace metal wires with optical dielectric waveguides on the same CMOS chip. They have been considered as a promising technology that can meet the projected bandwidth demand, while offering the advantage of being compatible with CMOS electronics. Energy efficiency of the link has become as critical as its bandwidth. From an application point of view, there is growing demand for energy efficiency improvement over the entire traffic of data including idle mode, not just at the peak of data rate. Therefore, transceivers have to adapt to this bursty nature of data traffic and thereby achieve both the highest bandwidth and excellent efficiency. When used in such power-down mode, it is required 84

100 that the transceiver would be able to wake-up, receive and transmit data without significant preamble to keep high burst efficiency. Transceivers require nanosecond scale burst to take advantage of the dynamically reconfigurable high-speed optical switches based on Silicon Photonics. After each switching event, the receiver has to support the data stream that may come from a different transmitter having different modulator with different extinction ratio and different switching loss. In addition, the receiver has to support un-encoded data stream where the data pattern desired to be received can have long strings of consecutive identical data (CID) bits, i.e. long strings of s and 0 s. In an AC coupled link, a long string of s and 0 s may cause baseline wandering issue. The length of the non-transition period of the incoming optical signal has to be small compared to the time constant determined by the capacitance used for coupling between optical and electrical interface to avoid wander issue. Encoding can avoid DC balancing problem, but it adds latency and degrades the overall efficiency of the transmission. DC coupled interfaces can easily overcome these issues. However, the receiver has to compensate for the DC offset created due to switching event during its fast lock time in this burst mode format. 4.. Challenges in Burst-Mode Receiver In conventional implementations ([6], Figure 4.), TIA and CDR are designed in two different ICs interfaced with high speed interconnect that mandates sufficient signal swing to meet the signal integrity requirement. Usually, in a 50-ohm environment, meeting the swing specification leads to significant power consumption in the limiting amplifier (LA). 85

101 TIA A F = Forward path TF To CDR TIA A F = Forward path TF DC Calibration Logic A DC = DC recovery TF LPF (a) DC Recovery using feedback path LPF A DC = DC recovery TF To CDR (b) DC Recovery using feedforward path Figure 4.: Conventional implementation of burst mode receiver with DC offset calibration. In addition, multiple chip implementations have cost penalty. Therefore, in state-of-art DCcoupled burst mode optical receiver [], TIA and CDR are integrated on the same chip. However, lock time still remains an issue that can be broken into two components DC offset correction and timing recovery. DC offset correction loop can be based on either feedback ([], Figure 4.(a)) or feedforward ([7], Figure 4.(b)) concept in both cases, a low pass filter senses the DC offset and subtracts that at the input of the receiver. Feedback can be implemented in analog domain using V-to-I [8] or in digital using DAC []. However, in both cases, the pole frequency of the feedback path (ωp) sets the tradeoff between lock time and bandwidth. For feedback system, if AF with pole frequency ωf and ADC are the transfer functions of the forward path and DC recovery path respectively, then the system transfer function (AFB) can be written as shown in Eq. 4.. The step response of the feedback system gives us the lock time and the high pass pole frequency fc defines the usable bandwidth (Eq. 4.) [9]. 86

102 Lock Time (ms) BW Penalty (GHz) where, where, A R FC F F A FB AF A A f F A s, A f c F DC DC A R C F. s Time constant of p the LPF In general, if the pole (ωp) in the feedback path is at a relatively higher frequency, feedback loop settles faster, but that also reduces the usable effective bandwidth (Figure 4.3). This problem can be avoided by shifting the feedback pole at a lower frequency, but that also increases the lock time. A good compromise can be to place the pole at /0th of the preamble frequency and with additional digital post-processing; this way, it is possible to reduce lock time to less than 50 cycles of the digital clock [] Feedback lock time Feedforward lock time Feedback BW Penalty Frequency (MHz) (c) Lock Time and BW Penalty vs. Frequency Figure 4.3: Settling time vs. bandwidth of DC recovery loop. 87

103 For feedforward DC recovery concept, the system transfer function is A FF s f s p. 4.3 Compared to feedback solution, DC settles faster in this concept and bandwidth penalty is set by ωp. As a result, the fundamental trade-off between usable bandwidth and settling time is still significant. In both feedback and feedforward cases of DC recovery, if the recovered DC information during preamble period is not stored using memory, the system faces baseline wandering during CID pattern. Similarly, feedback based timing recovery loops (Figure 4.4(a)) also suffer from longer lock time issues in burst mode applications. Fortunately, open loop approaches such as injection lock based timing recovery loops have reduced the lock time to few cycles ([30], Figure 4.4(b)). However, open loop timing recovery loops suffer from several limitations; first, VCO drifts away in the absence of injection pulses which leads to poor tolerance to CID and suboptimal lock position. Second, unlike traditional SerDes, where half-rate or quarter rate solutions are often used to reduce power, injection locking is usually done at full rate therefore, receiver power is usually high. Although injection locking can achieve BBPD CP + LF Variable Delay D-FF Nonlinearity Pulse Gen (a) (b) Figure 4.4: Conventional timing recovery loop. 88

104 fast lock time, receiver initialization time is still gated by DC offset correction as shown in Figure 4.5. At the beginning, when DC offset is still uncorrected, outputs of the limiting amplifier also have non 50% duty cycle. That means generated injection pulses are spaced such that the generated frequency is at an offset from the correct one. Therefore, when injected with these pulses, VCO does not lock at the correct phase or frequency. Only after the DC offset is corrected, generated pulses are properly spaced to lock the VCO. Although existing works have shown very fast burst mode clock recovery [3], without addressing the DC offset corrections, receiver lock time does not improve. Offset Correction TIA Clock Recovery TIA Output Offset Offset corrected LA Output Injection Pulses T>UI T<UI T=UI Oscillator Output DC Recovery Timing Recovery Figure 4.5: Effect of DC on duty cycle of the data. 89

105 This work introduces a burst mode quad-rate 7-0 Gb/s receiver architecture for dynamically reconfigurable photonic switch network. An entire receiver including TIA, three stage amplifiers, and CDR are implemented on the same IC; in fact, the inductor-less receiver fits within 465 µm X 65 µm area. Use of limiting amplifier leads to a power and area consuming design; therefore, in this design, we eliminate the limiting amplifier to reduce power. Unlike traditional DC offset correction that relies on sensing DC at the output, we propose a novel DC offset correction that can directly measure DC offset from alternating data. Since there is no LPF needed, this DC offset correction loop can settle significantly faster compared to the conventional approach. To achieve fast wake up, DC offset correction and timing recovery loops run concurrently, and that reduces the lock time to less than 0ns. To achieve a robust solution, timing skew adaptation algorithm is used to get optimum timing margin for the receiver Proposed Burst Mode Receiver The overall block diagram of the implemented receiver is shown in Figure 4.6. The system consists of an analog front-end (TIA followed by three stage differential amplifiers), a DC recovery loop, an injection lock based quad-rate CDR and a timing skew adaptation loop. 90

106 3.6 mw 5 bit DAC 4. mw 0.9 mw SAR Logic TIA 4.3 mw 7.8 mw --- Timing recovery loop --- DC recovery loop --- Onetime initialization power --- Runtime power CML to CMOS.8 mw Sample & Hold 4Φ 3.6 mw.3 mw S(n) + S(n-T) S(n) S(n-T) S(n-T) S(n-3T) Pulse Gen.3 mw Phase Rotator 3.6 mw + 3.8mW Data Comparators + Timing Skew Adaptation Logic Φ code 4Φ Quarter Rate Oscillator 4.4 mw Frequency (f) code Figure 4.6: Block Diagram of the receiver with power breakdown Trans-impedance Amplifier (TIA) TIA receives the current from the photodiode and the single-ended output voltage goes to one of the inputs of differential amplifier chain whereas the other input is recovered DC. A CMOS common gate amplifier is used as TIA followed by a common source amplifier with source degeneration (Figure 4.7). So, trans-impedance gain of the common gate stage is gm sc Trans - impedance Gain, AF R. 4.4 gm g sc m Voltage gain of the common source stage with source degeneration impedance is Voltage Gain, A So, overall gain of the forward path is F gm R 3 3. gm ( R4 ) 3 sc 4.5 9

107 R R R 3 Forward Path, A F OUTP OUTN To CDR M 3 V M b M R 4 C C Boost Path, A HP I I/ I/4 I/8 I/6 S(n) + S(n-T) S(n-T) Φ n-t SAR C8 S(n) Φ n Figure 4.7: Detailed schematic of TIA and its DC recovery loop. A F gm sc gm R 3 3 R. 4.6 gm ( R4 ) 3 gm g sc m sc The forward path through M serves two purposes: first, the SAR DAC creates the DC output (OUTN) that corrects the data dependent DC offset. Second, the signal path through AC coupling capacitor C goes through the high pass transfer function, A HP gm R. 4.7 gm g sc m At the input of the differential amplifier, these two paths appear such that the DC components of the signals cancel each other but the high-frequency parts of the signals 9

Gain (db) combine constructively. By boosting the high-frequency signals, the front-end bandwidth is extended to GHz without using any inductor. Figure 4.

108 Gain (db) combine constructively. By boosting the high-frequency signals, the front-end bandwidth is extended to GHz without using any inductor. Figure 4.8 shows simulated gain of the analog front end with and without high pass path. Forward Path High Pass Path Gain W/ Cap Gain W/O Cap Measured Gain Frequency (GHz) Figure 4.8: TIA transfer function and its simulated and measured gain. Figure 4.9: Measured 0 Gb/s output eye of the front end. 93

109 0.04 Amplitude (V) Amplitude (V) mv mv Time (ps) mv mv 0-50 Time (ps) 0 50 Time (ps) (c) Input current 50 µa with Cap C and C (d) Input current 50 µa w/o Cap C and C 0.3 Amplitude (V) 0.3 Amplitude (V) 50 (b) Input current 00 µa w/o Cap C and C Amplitude (V) Amplitude (V) (a) Input current 00 µa with Cap C and C 0 0 Time (ps) mv mv Time (ps) (c) Input current 600 µa with Cap C and C Time (ps) (f) Input current 600 µa w/o Cap C and C Figure 4.0: Simulated amplifier stage output eye with (left) and without (right) C & C for input currents of 00 µa, 50 µa and 600 µa (from top to bottom). 94

110 Benefit of the high pass frequency response is visible in the simulated eye diagram shown in Figure 4.0 for different input currents. Measured TIA output eye is consistent with the simulated results and indicates the extended bandwidth benefit of the front-end. TIA has to sense small (~60 μa) current from the photodiode and convert it to voltage. If the noise component from this TIA front end is high, it will propagate through the whole front end which may result in incorrect operation of data and clock recovery. One of the design considerations of this TIA is to have the noise component as low as possible while getting the most amount of gain and boost out of it. The trade-off is that if we try to increase the boost in the DC recovery path, it produces higher noise. So, a nominal point was chosen where we can get a high boost and low noise. The noise sources in this TIA are all the transistors and resistors handling 7~0 Gb/s high-speed data. Noise from these sources will appear at both OUTP and OUTN of the TIA as shown in Figure 4.7. However, these two paths have different gains. The two path gains are derived in Eq. 4.6 and 4.7. For noise calculation, first, we will show all the noise accumulated at OUTP and OUTN. Then, this noise at OUTP and OUTN will be referred to input by dividing them using their respective path gain. Noise current from any transistor is given by [4] I 4KTg n, M m. 4.8 Noise generated by MN sees two impedances in parallel; one from MN and the other from C and MN. The part of noise current that passes through MN appears at OUTP with gains 95

111 from the TIA common gate stage and the common source with source degeneration stage (ACS). Noiseat OUTP due to M N, V g sc. 4.9 g m g m sc m n, OUTP, M 4KT g m R A N CS The part of noise current passing through DC path will appear at OUTN node. Noiseat OUTN due to M N, V g. 4.0 g m g sc m m n, OUTN, M 4KT g R N m Noise generated by MN sees two impedances in parallel; one from MN and the other from C and MN. The noise appearing at OUTP will pass through C and MN. For OUTN, this noise will come through MN and will see the gain of its path. Noiseat OUTP due to M N Noise at OUTN due to M, V N g. 4. g m g m sc m n, OUTP, M 4KT g m R A N CS, V g sc. 4. g m g sc m m n, OUTN, M 4KT g R N m The noise due to MN3 appears at OUTP only. It sees the gain from the common source with degeneration stage. Noiseat OUTP due to M, V KTg R A. 4.3 N N3 n, OUTP, M m CS The noise voltage and current from any resistor, R, are given by [4] 96

112 V R n, 4KTR 4.4 4KT n,. 4.5 R I R Now, the noise from R of TIA appears both at OUTP and OUTN. The noise voltage generated from R sees the gain from the common source with degeneration stage and shows up at OUTP. Noiseat OUTP due to R, n, OUTP, R 4KTR ACS V. 4.6 For noise from R to appear at OUTN, it has to go through the impedances introduced by MN, C, and MN. The noise current from R will generate a voltage across MN which will see a gain from MN and appear at OUTN node. Noise at OUTN due to R, V 4KT g sc. 4.7 g m g sc m m n, OUTN, R g mr R Noise from R only appears across OUTN. Noiseat OUTN due to R KTR, Vn, OUTN, R Noise from R3 and R4 only appear in OUTP node. Noiseat OUTP Noiseat OUTP due to R KTR 3 due to R, V 4 3, Vn, OUTP, R R R KT sc n, OUTP, R4 R 4 R4 sc gm3 97

113 98 All the noise from different sources appearing at OUTP and OUTN will be summed up to get total noise at respective nodes Noiseat OUTP, , 3 3 CS m m CS m m m m CS CS m m m m OUTP n A R g KT R g sc R sc R R KT A R sc g g g g KT KTR A KTR A R sc g g sc g g KT V Noiseat OUTN,, R g sc g g sc g R KT R sc g g sc g g KT KTR R sc g g g g KT V m m m m m m m m m m m m OUTN n 4. To get the input referred noise we need to divide noise at OUTP and OUTN with their path gains, which we get from Eq. 4.6 and 4.7.,,, Input ReferredNoise, HP OUTN n F OUTP n IN n A V A V I. 4.3 Noise estimated by the above equation correlates well with the simulated noise over the frequency band as shown in Figure 4.. Consuming only 4.3 mw, the TIA achieves a sensitivity of -3.98dBm assuming photodetector responsivity of 0.5A/W at BER of 0 -.

114 Hz ) Input-referred current noise ( pa/ p H z ) (pa/ n,in I Input Referred Noise, Simulated Calculated Frequency (GHz) Figure 4.: Simulated and calculated input referred noise of TIA DC Recovery DC recovery loop works during the preamble period (000. pattern). The differential voltage from amplifier chain is sampled using the sample and hold (SH) circuit. Two consecutive samples, S(n) and S(n-T) from two nearby time-interleaved paths, are compared to get S(n)+S(n-T), which is then fed to SAR logic block (Figure 4.6). The SAR block is clocked using /8 th of data rate clock (C8 clock) of the receiver to have the DC well settled before next decision cycle. The overall operation takes six cycles to complete. The st cycle is used to reset the SAR logic. The other five cycles are used to update the DAC. Figure 4. illustrates the implemented DC recovery technique. In Figure 4., AMP_OUTP and AMP_OUTN are the outputs of the amplifier stage. At the beginning of each burst, in worst case scenario the DC offset may even fully saturate the amplifier 99

115 Burst # Burst # TIA output Offset AMP_OUTN AMP_OUTP S(n) -ve +ve +ve +ve +ve S(n-) -ve -ve -ve -ve -ve S(n) > S(n-) x Yes No Yes No S(n)+S(n-) -ve +ve -ve +ve -ve SAR DCSN DOWN UP DOWN UP DOWN Figure 4.: Proposed DC recovery technique. output. In such scenario, both sampled values appear to be ve. Therefore, DC offset continues to reduce until output starts to toggle. During the preamble period, the signal goes through a transition in each cycle. Therefore, the sampler that is sampling will have +ve value and the other one that is sampling 0 should have ve value. Since we are not using any limiting amplifier, their amplitude will not saturate. When these two samples are added, their summation indicates the DC offset in the input signal. To design a binary search algorithm, we only consider the polarity of S(n)+S(n-T). Each time SAR logic updates the DAC depending on this output. As the algorithm progresses, their amplitudes will vary until the SAR DAC output DC voltage matches the incoming data dependent offset. As the DC recovery information is stored digitally during the preamble period, the system does not face any baseline wandering issue during CID pattern. Figure 4.3 shows 00

116 a measured scope shot of DC recovery operation taking only 4.8 ns for incoming 0 Gb/s data. SAR Operation 4.8ns ns/div Preamble Data Figure 4.3: Measured scope shot of DC recovery operation within 4.8 ns Clock Recovery In the proposed receiver architecture, DC and timing recovery works concurrently. During DC recovery, as there is no LPF, DC settles faster as the SAR updates the DAC at every C8 clock. As the DC is moving through different SAR logic, if it settles within the signal range of the TIA output, injection pulses begin to appear (Figure 4.4). Non 50% duty cycle amplifier output during DC recovery gives pulses that are not -unit interval (UI) apart. For the injection-locked oscillator (ILO) to lock at the proper frequency, the separation between pulses has to be an integer multiple of UI. Although the distance between rising and falling edge pulses depend on DC offset, the distance from one rising edge to next rising edge is always UI. Therefore, rather than using both rising/falling edge, only rising edge pulses are used to facilitate concurrent operation. When it comes to the regular data pattern, this distance will always be N.UI where N. 0

117 TIA output DC Recovery AMP_OUTP AMP_OUTN Injection Pulses Rising Edge Pulses Falling Edge Pulses > UI UI <UI UI UI UI Figure 4.4: DC recovery without LPF and use of rising edge pulses allowing concurrent operation of DC and timing recovery. A ring oscillator is chosen for this work as it provides a wide tuning range with a compact area requirement. The quadrate operation of the oscillator helps to achieve higher data rate with better power efficiency in comparison to half rate or full rate clocking. The oscillator has to operate in GHz range as the data rate is between 7-0 Gb/s. The ring oscillator has four stages as shown in Figure 4.5. Stage I, which gives Φ0 and Φ80, is chosen for data pulse injection. The injected pulse corrects the zero crossings of Φ0 and Φ80. As injection is happening in only one stage of the quadrate oscillator, only the rising edge pulses can do the correction which are separated by 4UI. The 4UI separated pulses also have to arrive at the time of zero crossing of Φ0. In this work, a pulse window is created by having an AND between Φ5 and Φ35. However, any harmonic component of the pulses generated from data can still go through and provide the phase correction if it falls 0

118 Φ 0 Φ 5 Φ 90 Φ 35 Φ 80 Φ 45 Φ 70 Φ 35 Filtered Pulse 0 0 Pulse Window Φ 5 Φ 35 Pulse Filtering Generated Pulse DATAB DATA Generated Pulse Φ 0 Φ 5 Φ 35 Pulse Window Filtered Pulse Figure 4.5: Ring oscillator with pulse filtering and its timing diagram. within the pulse window. Figure 4.5 illustrates the whole pulse filtering function of the oscillator. With this pulse filtering in place during the preamble period each window gets a pulse transferred to the oscillator which in turn makes it lock instantaneously. During DC recovery operation as the DC goes up and down the lock point of the oscillator may shift, but it always remains locked during that time enabling concurrent operation. 03

119 Timing Skew Correction There are delays associated with the CML-to-CMOS and the injection pulse generator that go on to correct the phase of the oscillator. The corrected data phases of the injection locked oscillator go through buffers and pulse generators to do the sampling of the incoming data. In this way, the timing margin for data samplers becomes a function of path delay (Figure 4.6). Ideally, this path delay should be 0.5 UI. However, in different corners, this delay varies which reduces the timing and voltage margin of the data recovery path. To get rid of this issue, a slope detection technique is applied. The slope detection logic only works during the no data transition period as the ILO already takes care of transition edges. If the data is sampled at the middle of the eye (Figure 4.7(a)), the slope between two consecutive samples should ideally be zero. This is true for both and 00 data sequences. However, TIA Sample & Hold 4Φ S(n) S(n-T) S(n-T) S(n-3T) Phase Rotator Timing Skew Adaptation Logic Φ code 4Φ Frequency (f) code CML to CMOS Pulse Gen Quarter Rate Oscillator 0.5 UI <0.5 UI >0.5 UI Figure 4.6: Timing Skew compensation. 04

120 x Slope =0 x 0 Slope =+ve x x Slope =-ve 0 x Slope =0 0 0 (a) Sampling at the middle of the eye x Slope =-ve 0 0 (b) Sampling before 0.5 UI from edge x x Slope =+ve 0 0 (c) Sampling after 0.5 UI from edge Figure 4.7: Slope Detection. if the non-transition data is sampled before/after 0.5 UI from the edge, there will be a +ve/- ve slope. As illustrated in Figure 4.7(b), when the data sequence is 0x, if the slope between two samples of consecutive s is +ve, the sampling phase should be moved to the right to have the phase in the middle of the eye. However, for 00x data sequence it has to move right if the slope between two samples of consecutive 0s is -ve. Figure 4.7(c) shows the situations where the data sampling phase has to move left. The data sequences to consider here are x0 and x00. This adaptation logic runs after the preamble period during regular data pattern and updates the phase rotator to have correct sampling phase with the highest timing margin. To implement this slope detection, we can reuse S/H, comparator and SAR algorithm used for DC recovery. The comparison between two consecutive samples, S(n) and S(n-T) during the non-transition data period will give the slope of the incoming signal. The phase rotator has 5-bit control allowing 3 phase steps 05

121 between two data phases. The slope detection algorithm output goes to a majority voter and C6 (/6th of data rate) is used to take data out of the majority voter. After majority voting, the phase rotator control bits are updated using the successive approximation algorithm. This logic runs once during each burst. There is a frequency (f) code that controls the frequency of the oscillator. This frequency code ensures that the oscillator is at the correct frequency. As the frequency is locked, the oscillator can t drift away during the CID data pattern. S(n) Data Decision bits Φ n S(n)- S(n-T) Slope Detection Logic Left/Right/ Do Nothing Φ n-t S(n-T) Data Pattern Slope Phase Movement 0x +ve Right x0 -ve Left 00x -ve Right x00 +ve Left Figure 4.8: Slope detection logic Implementation and Measurement Results The implemented die photo is shown in Figure 4.9. The quadrate receiver takes only 465 µm x 65 µm area in 0.3 µm technology. Figure 4.0 shows the test setup with all the necessary instruments. The receiver recovers DC offset in only 4.8 ns compared to current state-of-art architecture taking.5 ns. The burst mode clock recovery takes ns more (i.e. 5.8 ns) during the preamble period (Figure 4.). The receiver works at 0 Gb/s consuming only 4.6 mw out of which 5. mw goes for the initial DC recovery, and 3.8 mw goes for 06

IREF_AMPS VDC_OUT TIA_OUT IREF_TIA TIA_IN DVDD AVDD Phase Rotator the timing skew correction. The power efficiency of the receiver during runtime is only 3.7 pj/bit.

4 shows the phase noise plot for 7 Gb/s operation giving only.3 ps rms jitter when jitter is integrated from KHz to GHz.

CHIP ID: ICGAAEND Used for eye diagrams VSS IREF_CHIP INIT VSS DVDD Agilent 8600D Infiniium DCA-X Oscilloscope VCO_BIAS CVDD CLK0_BUF D0_BUF AVDD SR_IN SR_CLK Control Bit Load

122 IREF_AMPS VDC_OUT TIA_OUT IREF_TIA TIA_IN DVDD AVDD Phase Rotator the timing skew correction. The power efficiency of the receiver during runtime is only 3.7 pj/bit. During low data rates, the consumed power goes down, but power efficiency is reduced. The recovered clock has around 0 ps measured peak to peak jitter (Figure 4.3). Figure 4.4 shows the phase noise plot for 7 Gb/s operation giving only.3 ps rms jitter when jitter is integrated from KHz to GHz. VI BUF ILO SAR DIGITAL BLOCK COMPARATORS ANALOG FRONT END SAR DAC Area = 465 µm x 65 µm FEND_OUT TIA_IN Figure 4.9: Implemented die photo in 0.3 µm. CHIP ID: ICGAAEND Used for eye diagrams VSS IREF_CHIP INIT VSS DVDD Agilent 8600D Infiniium DCA-X Oscilloscope VCO_BIAS CVDD CLK0_BUF D0_BUF AVDD SR_IN SR_CLK Control Bit Load using Arduino Spectrum analyzer R&S FSV 40GHz Used for phase noise plots Tektronix DPO 7604C Oscilloscope Used for high speed transient screenshots Tektronix AWG 7000A Preamble and PRBS Data pattern loading Figure 4.0: Pin diagram and test setup of the prototype. 07

123 Preamble SAR Operation 4.8 ns Data ns/div DC & Timing Recovery Locked Figure 4.: Measured scope shot of DC and timing recovery with preamble and data pattern. Figure 4.: Recovered clock eye. 08

124 Figure 4.3: Recovered clock histogram. Figure 4.4: Phase noise plot of the recovered clock. 09

125 Figure 4.5: On-chip PRBS checked recovered.5 Gb/s channel data eye. 0

A 5-Gb/s 156-mW Transceiver with FFE/Analog Equalizer in 90-nm CMOS Technology Wang Xinghua a, Wang Zhengchen b, Gui Xiaoyan c,

4th International Conference on Computer, Mechatronics, Control and Electronic Engineering (ICCMCEE 2015) A 5-Gb/s 156-mW Transceiver with FFE/Analog Equalizer in 90-nm CMOS Technology Wang Xinghua a,