Silicon Nanophotonics for Many-Core On-Chip Networks

Size: px

Start display at page:

Download "Silicon Nanophotonics for Many-Core On-Chip Networks"

August Randall Chapman
5 years ago
Views:

University of Colorado, Boulder CU Scholar Electrical, Computer & Energy Engineering Graduate Theses & Dissertations Electrical, Computer & Energy Engineering Spring 4-1-2013 Silicon Nanophotonics

1 University of Colorado, Boulder CU Scholar Electrical, Computer & Energy Engineering Graduate Theses & Dissertations Electrical, Computer & Energy Engineering Spring Silicon Nanophotonics for Many-Core On-Chip Networks Moustafa Mohamed University of Colorado at Boulder, Follow this and additional works at: Part of the Computer Engineering Commons, and the Nanoscience and Nanotechnology Commons Recommended Citation Mohamed, Moustafa, "Silicon Nanophotonics for Many-Core On-Chip Networks" (2013). Electrical, Computer & Energy Engineering Graduate Theses & Dissertations This Dissertation is brought to you for free and open access by Electrical, Computer & Energy Engineering at CU Scholar. It has been accepted for inclusion in Electrical, Computer & Energy Engineering Graduate Theses & Dissertations by an authorized administrator of CU Scholar. For more information, please contact

2 Silicon Nanophotonics for Many-Core On-Chip Networks by Moustafa Mohamed B.Sc., Cairo University, 2003 M.Sc., Cairo University, 2006 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Electrical, Computer and Energy Engineering 2013

3 This thesis entitled: Silicon Nanophotonics for Many-Core On-Chip Networks written by Moustafa Mohamed has been approved for the Department of Electrical, Computer and Energy Engineering Prof. Alan Mickelson & Prof. Li Shang Prof. Douglas Sicker Prof. Edward Kuester Prof. Won Park Prof. Yifu Ding Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

4 iii Mohamed, Moustafa (Ph.D., Computer Engineering) Silicon Nanophotonics for Many-Core On-Chip Networks Thesis directed by Prof. Alan Mickelson & Prof. Li Shang Number of cores in many-core architectures are scaling to unprecedented levels requiring ever increasing communication capacity. Traditionally, architects follow the path of higher throughput at the expense of latency. This trend has evolved into being problematic for performance in many-core architectures. Moreover, the trends of power consumption is increasing with system scaling mandating nontraditional solutions. Nanophotonics can address these problems, offering benefits in the three frontiers of many-core processor design: Latency, bandwidth, and power. Nanophotonics leverage circuit-switching flow control allowing low latency; in addition, the power consumption of optical links is significantly lower compared to their electrical counterparts at intermediate and long links. Finally, through wave division multiplexing, we can keep the high bandwidth trends without sacrificing the throughput. This thesis focuses on realizing nanophotonics for communication in many-core architectures at different design levels considering reliability challenges that our fabrication and measurements reveal. First, we study how to design on-chip networks for low latency, low power, and high bandwidth by exploiting the full potential of nanophotonics. The design process considers device level limitations and capabilities on one hand, and system level demands in terms of power and performance on the other hand. The design involves the choice of devices, designing the optical link, the topology, the arbitration technique, and the routing mechanism. Next, we address the problem of reliability in on-chip networks. Reliability not only degrades performance but can block communication. Hence, we propose a reliability-aware design flow and present a reliability management technique based on this flow to address reliability in the system. In the proposed flow reliability is modeled and analyzed for at the device, architecture, and system level. Our reliability management technique is superior to existing solutions in terms of power and performance. In fact, our solution can scale to thousand core with low overhead. 1 1 This material is based upon work supported by the National Science Foundation under Grant No. CCF and CCF Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

5 Dedication I would like to dedicate this thesis to my father who knows the future.

6 v Acknowledgements This PhD program was a long journey started with an intention to learn and ended with a degree. During this journey I have met several people who have helped me at both the academic and personal level. I would like to thank them all and this acknowledgment is to show them my gratitude. I would like to thank my advisor Prof Mickelson who have guided me throughout the program through advice and discussion. I would like to thank Prof Vachharajani and Shang for advising me the first couple of semesters. I would also like to thank Zheng Li whom I worked with while he was a senior PhD student and as a Post Doc. I have enjoyed working with him and I have learned from him too. Also, I would like to thank the EMTNano team who contributed to this thesis directly and indirectly through a great collaboration that enabled us to publish excellent papers and reach novel and solid results. Without their contribution, this thesis would be missing the experimental side which makes it solid and practical. At the personal level I would like to thank my family. They were of great support in every way. They always encouraged me and helped since I was a kid till this very day. I owe them a lot. Last, but certainly not least, I would like to thank my friends and colleagues in CU Boulder who were of help in different ways. I would also like to thank Prof Alan Mickelson, Prof Li Shang, Prof Manish Vachharajani, and Prof Michael Lightner (Chairman of ECEE department) for funding me at different periods of my PhD program This material is based upon work supported by the National Science Foundation under Grant No. CCF and CCF Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

7 vi Contents Chapter 1 Introduction Drawbacks of electrical interconnections Potential of optical interconnections Challenges for designing optical interconnections Silicon photonic devices for on-chip networks Background Devices for on-chip nanophotonic networks Stacked Racetracks for switching Motivation Analytical model Numerical Simulation Adiabatic couplers Motivation Analytical model Numerical simulation Summary On-chip Network Design Motivation: Characterizing on-chip traffic

8 vii Characterizing electrical interconnect Characterizing photonic interconnect Iris: Antenna-based design Network Architecture Design Iris: Physical Design Simulation results Evaluation platform and configuration Comparison of Iris and alternatives Pareto-Space of design Linear power division networks Summary Reliability challenges in nanophotonic on-chip networks Variation-induced Reliability Challenges System Reliability Impact Summary Reliability-aware design flow Overview Reliability-Aware Device Analysis and Design Reliability-Aware Device Modeling and Analysis Reliability-Aware Device Design Reliability-Aware Architecture Analysis and Design Reliability-aware network-on-chip simulator Case Studies: Reliability-Analysis Reliability-Aware Architecture Design Guidelines Summary

9 viii 6 Reliability management solutions Device-level Reliability Management Athermal Silicon Photonic Devices Process-Variation-Immune Silicon Photonic Devices Case Study: Reliability-aware design and management of a single channel optical link Reliability Management: Problem formulation Reliability Management Solution I Proposed solution Simulation Results Reliability Management Solution II Proposed Solution Simulation Results Summary Future work Cost-effective nanophotonics Applications of silicon nitride nanophotonics in data centers Conclusion 128 Bibliography 130

10 ix Tables Table 3.1 Power budget Configuration Athermal devices characteristics Process and thermal variations analysis of the devices Resonant wavelength characteristics of the devices Thermal tuning power of different optical links Voltage tuning power of different optical links Nanophotonic network configuration

11 x Figures Figure 1.1 Transistor delay scaling with technology according to ITRS projections Copper wire delay scaling with technology according to ITRS projects Copper wire power scaling with technology according to ITRS projects Our optical measurement set-up including the chip, fibers, and controllers. Also there are different micro-ring configurations Simulated result Simulation results Schematic of Iris both networks used in the design Arbitration and arbiter s finite state machine (FSM) Physical design of Iris Scanning electron microscope images of nanophotonic devices fabricated through epixfab and measured in our labs Measurement results of nanophotonic racetrack-based switch which enhances the bandwidth of the switch and reduces overall area Power consumption comparison of seven different networks including optical and electrical networks Performance comparison: average local L2 miss latency of seven different networks including state of the art optical and electrical networks

12 xi 3.8 Throughput analysis of our broadcast-based networks Linear Power distribution topology using adiabatic coupler Super-Linear Power distribution topology used by other power splitters Passband shift of micro-rings due to process variations of devices on different chip on same wafer, with same nominal design parameters Passband shift of the same micro-ring due to thermal variations induced experimentally Signal loss and crosstalk in silicon photonic on-chip network due to variations in the system Reliability-aware design flow based on abstracting our system into two layers. The flow involves the analysis and design of each layer in a two-step process T-matrix of directional coupler and illustration of the directional coupler structure T-matrix models of micro-ring resonator and an illustration of its structure Transmission spectrum of the measurement and model fit of a directional coupler measured and calibrated in our labs Simulation results of a forward-biased micro-ring modulator and calibration of the proposed models at equilibrium (equ), on, and off states Doped PN-junction micro-ring top and side view of the p- and n-regions Doped PIN-junction micro-ring top and side view of the p-, i-, n-regions Simulation results of a reverse-biased micro-ring modulator and calibration of the proposed models Traditional design process for Silicon Photonic devices using slow and inaccurate electromagnetic simulations Proposed design process for Silicon Photonic devices based on the T-matrix models Electrical and optical power trade-offs in micro-ring design based on device radius Performance reliability tradeoffs based on channel passband width

13 xii 5.13 The radii and resonant wavelength of a 16-channel WDM ring designed using T- matrix model for a 63% yield Operating temperature of doped micro-ring due to dopant-induced optical power loss Power and reliability trade-offs in micro-ring design based on dopant concentration in the micro-ring The simulation infrastructure for evaluating the power, performance, and reliability of on-chip optical networks Performance decomposition and comparison in ideal environment (left) and in presence of process and thermal variations (right) Power decomposition and comparison in ideal environment (left) and in presence of process and thermal variations (right) Performance comparison of 16 WDM channel setup (left) and 32 channel setup (right) in presence of process and thermal variation Design space exploration in the architecture domain by varying power, channel bandwidth, and network Illustration of inter-channel hopping in on-chip network at switches due to variations λ r change and power loss in voltage tuning as we increase dopant concentration Left: fabricated devices in epixfab. Right: illustration of power loss and crosstalk in on-chip networks due to variations in the system Network reliability analysis of four different reliability-management techniques in term of connection ratio to overall number of communicating nodes Power efficiency analysis of three different reliability management techniques under realistic workloads Network performance analysis of the proposed technique against ideal case of on-chip optical network under realistic workloads and variations Scalability of routing algorithm with number of cores in system

14 xiii 6.8 Reliability results for four different reliability management techniques under realistic workloads and variations Power evaluation for five different reliability management techniques under realistic workloads and variations Power scalability of the proposed reliability management solution against thermal tuning and channel hopping for future many-core systems

15 Chapter 1 Introduction As we progress towards more advanced technologies and lower transistor dimensions, power is the main constraint which will mandate the architecture of future systems. Hence, our optimization problem is how to get the most performance within the power envelope determined by the application. Nowadays, the answer is many-core architectures will potentially offer the best tradeoff. Looking further down the road, we see different views on how to architect the system. Ideas like dynamic power management of the system [1] or putting more cache [2] have been proposed. These kind of solutions come with a performance cost which make them less attractive. Hence, it is obvious that more transformational solutions are needed. 1 Looking at the power profile and sources of power dissipation of Intel s 48-core processor 10% of the power is dissipated in the on-chip network [11]. Intel s 80-core teraflop chip network (links and routers) consumed 28% of the total chip power [12]. From academia, RAW, a processor developed in MIT, had 36% of the power going to communication [13]. Digging further into this problem we see that the portion of power dissipated in communication is expected to increase with new technologies [14]. Hence, a solution is needed if we were to target thousand core system. In this study, we focus on how to offer a power efficient communication fabric for future systems in a scalable manner. Such an enhancement can not be done using traditional electrical interconnects as we will show. Hence, we opt for optics as an alternative, one of the more mature technologies for interconnects. Optics can offer a low power solution with high performance 1 The content of this thesis is copied from publications that I have published or that are under review [3, 4, 5, 6, 7, 8, 9, 10].

16 2 and solve this equation. However, to reach that goal several obstacles and problems need to be addressed. In this thesis we address some of the major concerns for this emerging technology and offer practical solutions. However, unlike traditional system-level research we address this problem at different abstraction levels starting from fabrication, device-level modeling, architectures, and systems. Next we will explain in details our motivations and how optics can help us reach peta-scale computing on-chip. 1.1 Drawbacks of electrical interconnections Electrical interconnects have long been used in on-chip communication. They are CMOS compatible, have high integration density, and when repeated have high signal integrity. Moreover, electrical interconnects have been known for their high performance and low power. However, as we march towards the new fabrication technologies, electrical interconnects are not scaling. On one hand while transistor delay, shown in Figure 1.1, is improving in performance and delay. On the other hand copper metals are deteriorating performance- and power-wise. International technology roadmap for semiconductors (ITRS) projects that interconnect delay will deteriorate as shown in Figure 1.2 [14]. The trend is growing super-linearly. A repeated interconnect would linearize the delay but will not improve the degradation in delay trend we observe as of today. Moreover, power levels are deteriorating as shown in Figure 1.3 according to ITRS projections [14]. Even a repeated line improves power through linearity with length but the overall power level is still high [15]. With both performance and power deteriorating, traditional global copper interconnects will fail to meet petascale performance demands and power constraints. Hence, electrical-based copper interconnects are evolving into a bottleneck for the system. There is little chance for copper interconnects in the realm of thousand-core systems and a replacement is necessary.

17 3 Transistor latency (ps) Technology node (nm) Year Figure 1.1: Transistor delay scaling with technology according to ITRS projections 1mm global wire RC delay (ps) Technology node (nm) Year Figure 1.2: Copper wire delay scaling with technology according to ITRS projects Global metal wires power (W) Technology node (nm) Year Figure 1.3: Copper wire power scaling with technology according to ITRS projects

18 4 1.2 Potential of optical interconnections On-chip optical interconnects are viewed as a potential replacement for electrical interconnect. Among the emerging technologies, it is considered the most mature [16]. All the necessary components necessary for building a network has been demonstrated successfully. Micro-ring filters for wave division multiplexing (WDM) have small footprint, low insertion loss, and small passband bandwidth [17]. Micro-ring modulators have been demonstrated using doped PN- and PINjunctions [18, 19]. We have demonstrated how to build wide band switches using racetracks [20]. Also, low power and high data rate receivers have been experimentally demonstrated [21]. Finally, an optical link containing the different components discussed here have been integrated to demonstrate high data rate and low power optical communication. For instance, Chen et al. demonstrated a 3 Gbps link with a power of 120 fj/bit [19]. Oracle demonstrated a four channel WDM link operating at 10 Gbps per channel and a power between fj/bit [22]. The advancement in optical devices has pushed system-level research forward. For instance, research in silicon photonics on-chip network topologies revealed various topologies like serpentineshaped buses, serpentine-shaped crossbars, antenna-based broadcast, and switch-based point-topoint interconnects [23, 24, 25, 26, 27]. In addition, silicon photonics found its way to off-chip network topologies [25]. Routers [28], and flow control mechanisms [29] have been designed and analyzed as well. These studies indicate that optical interconnects are a promising technology and a potential replacement for copper wires. The optical interconnects hold several advantages over traditional copper interconnects as discussed by Beausoleil et al. [30, 31, 32]. The main advantages could be summarized as follow: Light speed latency: The fastest signal that can propagate in a material is electromagnetic waves. However, speed depends on the group index of the material. In metal interconnects, transmission latency is catching up with light speed [33]. We propose light speed communication in silicon. In addition to light speed latency, the interconnect is relay-free.

19 5 High Bandwidth: WDM offers high bandwidth in a narrow area. Simply by increasing the number of channels at sender and receiver in a single nm waveguide can carry tens or even hundreds of channels. The exciting part is that it is expected that this density will improve with new technology generations. Low power: An optical channel can operate at sub-mw range. Hence, for a 64-channel system, the total optical power is below 100 mw for an optical link. The total network power greatly depends on topology and architecture. There are low power designs which consumes a couple of watts like Iris [34, 27] while others break even with electrical interconnects power consumption such as Corona [23]. 1.3 Challenges for designing optical interconnections Optical on-chip interconnect is a new field and remains less understood than electrical interconnects. Initial studies imitated electrical interconnect topologies and architectures for use in optics. With little device support of the architectures proposed did not unleash the full potential of silicon photonics and were unscalable. Moreover, the reliability challenges that silicon photonics networks face were poorly addressed. Hence, this has driven us to study the problem of communication in many-core architectures. In 2008, the first silicon photonics paper was published in the architecture community. It projected that silicon photonics can provide the system with the high performance and reduce the power consumption. However, the solution proposed by Kirman et al. [24] was re-using electrical concepts and applying it to optics. This was clear to us (EMTNano group) that it is far from optimal and there is more space for innovation. This has pushed us to understand the problem of many-core architecture. We found out that electrical interconnects provided very high throughput but at the expense of latency which deteriorated the system performance. The exa-scale computing report by DARPA confirmed this observation [35]. So we started with an architecture goal which is low latency design and began an inter-disciplinary effort to solve the problem. We have proposed a

20 6 full solution including physical, device, architecture, and system level components. At the physical level we proposed optics but unlike earlier research, we have moved from e-beam lithography that is suitable for small scale production, used in academia, to photolithography a step towards large scale production. At the device level we designed and fabricated the traditional silicon photonics devices and novel devices like an antenna in the infra-red regime at the nano-scale dimension. These devices were measured and characterized in our labs. In addition we have built models for simulation-based studies at higher abstraction levels. At the architecture level, we designed different networks that define a Pareto-space of designs in terms of power, area, and latency. Finally, we studied the interaction of the network with the other components of the system, more specifically the CMOS computing layers. By studying all the layers of the system we have moved this technology one step forward towards realization in many-core architectures or other applications and made it one step closer to mass-production. The silicon photonics technology has been advanced through the work presented in this thesis, the EMTNano project contributions, and other groups working in this domain. Research is moving away from simple experimental prototypes to real system realization. In fact, this technology is gaining wide interest in industry by companies like Oracle, HP, Intel, and IBM. These companies follow a simple path as we do in this research: A multidisciplinary approach and studying all layers of the system but they do not share the results that we have concluded here. In this thesis we do not only share our results but also our vision and how to proceed with this technology. In my opinion, studying the technology from all aspects is the most important step. Other research efforts lack either the device level understanding or the system level goals. Both hinder the advancement of technology. Another important criteria for progress is the study of this technology with mass-production goal in mind as we did in this research. This technology has moved away from lab experiments to real-world application. This can be pursued in different ways like moving to photolithography which has lower cost and faster time to market. Considering the cost of the system being designed and the return on investment. Fabrication and measurement efforts are a must since simulations usually deviate from real systems and give a quantitative understanding of

21 7 the problem. As for how to advance this technology further and what to do to have a product, here are a couple of ideas: (1) System-level application: Silicon photonics have several potentials; however, the final application which would adopt this technology is still unclear. Earlier, researchers in this domain pushed for on-chip communication. Samsung, one of the largest memory producers, studies the processor-memory interface. Recently, Intel is working on data centers applications. All these are attempts to find a proper application which will eventually converge as we have a better understanding of the technology. In my opinion, that this approach of understanding the device limitation and then switching to a new application is far from ideal. If we were to reach a consensus on a target application, device and system level efforts needs to be unified. Both levels of design must advance together rather than one dragging the other. However, we do not have an answer for what is the ultimate application. (2) Device-level innovations: There are some major challenges that need to be addressed in silicon photonics devices before it finds wide spread. The first challenge is power efficiency of modulators and switches. The best approach for this problem is polymer-based devices since the static power consumption is negligible and the dynamic power is very low yielding a very low power device. The next problem is process variations which can be solved using brute force through going to smaller technology nodes with a hefty fabrication price, device-level innovations, or system-level innovations as we have presented in this thesis. The final solution still remains unknown and would probably be a combination of all these together. Building a reliable system from unreliable components is not a new idea and has been demonstrated in fiber optics, the father of silicon photonics. Finally, the light source needs to be integrated with low cost; however, we have not studied this problem in this thesis and other groups are actively working on it. The problems studied and researched during the EMTNano project extends beyond this thesis. In this thesis, we narrow our focus, due to limited space, to discuss two main system-level problems in the many-core architectures: Network design: Designing networks for on-chip interconnects. In this design perfor-

22 8 mance and power are the main criteria that guide us in the design process. The design is defined in terms of topology, flow control and routing under the system constraints on one hand in terms of latency, bandwidth, and power requirements and the device constraints on the other hand in terms of functionality, power, and performance. Thus, we touch the different layers of design to make design decisions and compare between different alternatives. Reliability-aware design flow: The sensitivity of silicon devices to process and thermal variations make them vulnerable to reliability problems. The result can be degradation in performance to full block of communication; thus, hindering the operation of the system and the performance benefits of silicon photonics. However, addressing this issue needs to be done at different design phases in order to have a reliable system. We propose a two phase design process that leads to a system-level reliability management solution yielding a reliable system. These topics mentioned above, are too broad to fully cover. Instead, we provide a general overview of the technology with references and focus on the work that we have contributed to. First, we discuss network design to show why electrical design techniques can not apply to silicon nanphotonics. In optical on-chip networks, optical buffers on the nano-scale remain unfeasible. These buffers are huge in area and power hungry. Hence, they can not be integrated in the system [36]. the network. This is a main difference from electrical interconnects that require re-architecting Moreover, optics supports WDM with very low power and area overhead unlike electrical interconnects where power and area scale linearly with number of channels between two communicating ends. Next, in the design process we follow a bottom up approach where we present the essential components for building nanophotonic on-chip networks and characterize them in terms of power and performance. We focus on two devices which are part of our contributions: (1) Stacked racetrack for switching and (2) Adiabatic coupler for power splitters. Next, we proceed to the architecture level, where we show how broadcast-based networks can solve the latency problem

23 9 and thus improve the overall system performance. Moreover, the power levels are low, enabling a lower power budget. In light of broadcast-based architectures we show the Pareto space of three nanophotonic on-chip networks that leverage an antenna for broadcast of low latency packets and an additional network for high throughput. Through different design choices of the second network, we provide the architect with different options for power, latency, and area depending on the target application. Next, we discuss the reliability-aware design flow to alleviate the system from the deleterious effect of variations. As we show, variations are a problem that occurs due to fabrication inaccuracies leading to process variations and thermal variations due to workload variations temporally and spatially. These variations occur naturally and the designer has little control over them. Even though process variations are expected to improve; however, it can never reach the precision of a micro-ring passband which is as small as 0.1 nm. On the other hand, thermal variations persist in future system as ITRS predicts [14]. This problem has gained the attention of device and system designers and several solutions have been proposed to overcome variations. However, these solutions come with a cost and do not scale. The novelty of our work is that we address this problem through a reliability-aware design flow not just a solution. While we are designing our system we account for our three metrics: power, performance, and reliability. We propose abstracting the system into two abstraction layers: The device layer and the network layer. The device layer is to model devices meanwhile the network is to model the whole network. These abstraction layers enable a two phase design flow. First is an analysis step where we understand the capabilities and limitations of our devices and networks in light of variations and other system constraints such as power and performance. Based on this analysis we design a reliability management solution. Next step, we perform the detailed design of the devices and network that provide a basis for our reliability management solution. We demonstrate our flow through multiple examples and design cases to practically show how to design a reliable system. My contributions in this thesis can be summarized as follow at each design level:

24 10 Device level: The suggestion of stacked racetracks for switching in on-chip networks [4]. The design space tradeoffs of stacked racetracks in terms of power, area, and bandwidth [4]. The optimization of adiabatic couplers for 3 db coupling in on-chip networks through simulations [5]. The device-level tradeoffs of different silicon photonics devices in terms of power, performance, and reliability based on the T-matrix model [6]. Architecture level: The proposal of the Pareto space of design for antenna-based on-chip networks in terms of power, latency, and throughput [8]. The architecture-level reliability design factors for on-chip silicon photonics communication in collaboration with Zheng Li [7]. The design of one-to-many broadcast network based on adiabatic couplers that can be used for clock or power distribution [5]. Physical Design of Iris with Zheng Li [8]. System level: A routing algorithm at the kernel level that is reliability-aware as part of a reliability management solution [3]. Designing our first reliability management solution based on channel hopping and voltage tuning in collaboration with Zheng Li [3]. A reliability-aware design flow for silicon photonics on-chip networks that addresses reliability at device, architecture, and system level [9].

25 A reliability management solution based on the design flow suggested above that is 11 superior to earlier techniques in terms of power and performance. Moreover, it s scalable to thousand core systems [9]. Out of this thesis, the most exciting contribution I have is defining the design space at the device and architecture level for system-level design. Understanding the design techniques, their limitations and capabilities will enable any designer to optimally design a system for any application. The next chapter in this thesis, Chapter 2, discusses the device-level design of silicon photonics on-chip networks. Chapter 3 discusses the design of our antenna-based broadcast on-chip networks, Iris. After discussing network design way turn our attention to reliability-aware design. Chapter 4 discusses the challenges of variations in our system with an assessment of how much variations to expect, next Chapter 5 discusses the reliability-aware design flow. Chapter 6 discusses two reliability management solutions. Finally, Chapter 7 presents our future work and Chapter 8 concludes.

26 Chapter 2 Silicon photonic devices for on-chip networks In this chapter we discuss the basic components and devices that enable building an on-chip optical network and more important enables WDM in a power efficient manner. We characterize our devices in terms of their basic characteristics and design metrics to give the reader a sense of their capabilities and limitations. Next, we discuss in detail two structures: (1) A stacked racetrack for switching and (2) Adiabatic coupler for 3 db couplers, which are contributions in this thesis Background Silicon photonics has opened the door for optical technology to function as an on-chip communication technology. The components in an on-chip nanophotonic interconnect mimic those found in wavelength division multiplexed (WDM) optical telecommunication systems. The link consists of light sources, transmitters, waveguides with attendant routing, and receivers. Except for the light source, all of the components necessary for complete WDM on-chip interconnection networks have been demonstrated [37, 38]. Optical power is coupled from an off-chip light source via a grating or inverse taper [37] to an on-chip waveguide. When the source is broadband, the transmitter [39, 21] demultiplexes the light into wavelength channels using bandpass filters and then modulates each of the channel with a digital data stream generated by the electronic processing/storage component to be interconnected. Photonic signals are then routed to various on-chip destinations via on-chip waveguides, switches and hubs (that re-multiplex broadcast signals). Once 1 The content of this thesis is copied from publications that I have published or that are under review [3, 4, 5, 6, 7, 8, 9, 10].

27 13 Figure 2.1: Our optical measurement set-up including the chip, fibers, and controllers. Also there are different micro-ring configurations. the optical signals arrive at their destinations, the receiver demultiplexes the wavelengths and directs the channels to detectors. Photonic signals are converted into electrical current via Ge doped photo-didoes and are then amplified by CMOS circuitry [21]. We have designed and fabricated silicon photonic devices in the epixfab commercial foundry [40], and then post-processed and measured in our laboratories. These devices are used as examples to illustrate one possible (complete) set of components for an optical on-chip network. In this complete set, a grating coupler, is used to couple broadband light from off-chip to on-chip. Ring resonator based filters [41] and racetrack resonators [4], in Figure 2.1, serve as passive filters and demultiplexers, and when doped and enabled with electrodes as modulators and switches. Rings and racetracks are especially compact and use the strong guidance of silicon photonics to the best advantage. The main difference is in the width of the channel provided for WDM which provides the design a wide range of components for network design. 2.2 Devices for on-chip nanophotonic networks In this section we give a survey of different basic devices needed for building the basic components of a nanophotonic on-chip network. More specifically, we discuss waveguides the signal conduit, passive filters realized through micro-rings, racetracks, and directional couplers. In addi-

28 14 tion, modulators and switches enabled through two types of modulation techniques: (1) Doping and (2) Electro-optic polymer. Finally, the photodetector that converts the optical signal to current. Next, we discuss in detail the basic devices with more details and characterization in terms of power and performance: Light source: Till this day, all components of the system has been integrated and miniaturized into nano-scale devices except for the source. There are numerous types of sources, in this thesis we focus on: (1) Broadband sources like SLED that we use in our experimental setup and (2) Narrow band ones like lasers. The operating wavelength we focus on in this thesis is 1550 nm and later we show applications at 650 nm. However, other operating wavelengths are possible. In fact, HP and Intel are working at different wavelengths. Waveguide: There are two basic kind of waveguides: Ridge waveguides and slot waveguides. In our design we opt for ridge waveguides due to its low loss. The loss we got from waveguides fabricated in epixfab is 2.3 db/cm [40]. Also, the dimensions are designed to have low loss and low sensitivity to variations. Our ridge waveguide is nm. Passive filter: The support of WDM through filters is an important feature of nanophotonic. There are three basic devices we have fabricated for WDM: (1) Micro-rings, (2) Racetracks, and (3) Directional couplers. Micro-rings have narrow passband as narrow as 0.1 nm [42], while racetracks are wider that can reach 0.35 nm [43], and finally directional couplers have a sinusoidal spectrum with full-width-half-maximum can reach 3.5 nm according to one of our fabricated and measured devices. The basic characteristics of filters are full-width-half-maximum, free-spectral range, and extinction ratio. These can be controlled through the following design parameters: (1) Micro-rings have radius, gap, and waveguide size, (2) Racetrack have a coupling length in addition to micro-ring design parameters, and (3) Directional couplers have waveguide size, gap and coupling length as design parameters. Modulators and switches: Modulators and switches are based on shifting the resonant

29 15 wavelength. Passive filters can be converted into modulators and switches through two basic techniques: (1) Doping and (2) Polymer. Doping can change the refractive index of waveguide through carrier injection [44] or depletion [45]. Polymer-based modulators use slot structures whether in micro-rings [46], racetracks, or Mach-Zehnder Interferometer [47]. The important characteristics of a modulator or switch are the voltage, static and dynamic power for operation, and area. These characteristics depend on the design parameters of the device that we discussed in the passive filter devices, in addition there is dopant concentration, polymer electro-optic coefficient, and power concentration in the polymer. Photodetector: This is the device responsible for converting the optical signal into electrical signal. The device is realized through doping the waveguide with Germanium [48, 49, 50]. The main design parameter is the length, where greater length leads to more absorption of light. The characteristics of photodetectors of concern include shot noise, dark current noise, responsivity, bandwidth, switching speed, and bit-error-rate. 2.3 Stacked Racetracks for switching In this section we present one of our contributions which is using stacked racetracks for switching. In this application, the broadening effect can be leveraged to switch multiple channels together with a single device. Next, we discuss in more details the design and tradeoffs. Ring resonators [51] can be used as compact filters. The most compact of micro-rings have a sharp resonance (minimal spectral width) but a large free spectrum range (FSR). In order to repeatedly slice a broadband source into channels, a few nm channel width with a controllable (from 2 to the total number of channels) finesse (=FSR/spectral width) is desirable Motivation Racetrack resonators [52] and stacked ring resonators [53] have been proposed to broaden the spectral width and reduce the free spectral range. Racetrack resonators provide longer cou-

30 16 pling lengths than rings, allowing for broader spectral responses with manageable fabrication tolerances [54]. Stacking resonators provides a mechanism for broadening and sharpening spectral response [55, 56]. In this section, we discuss the stacking of racetrack resonators. We place attention on tradeoffs between spectral width and area. Our fabricated devices are measured to have a 2-5 nm spectral width with 17nm FSR Analytical model Designing a stacked racetrack resonator with sufficient number of resonances in the available near infrared (NIR) band and sufficient bandwidth for each requires calculation of the free-spectral range (FSR) and knowledge of the spectral width of each resonance. The free-spectral range (FSR) depends on the resonance wavelengths of the racetrack, which corresponds to the optical path length being equal to an integer multiple of wavelengths [57]. For a racetrack with radius R and straight section of length L, the difference in adjacent resonance wavelengths λ, or F SR is: λ m λ m+1 λ 2 m λ = (2.1) (πr + L) n mode (πr + L) n mode where n mode is the mode index of the light propagating in the waveguide. Using stacked racetrack resonators between the input and output tangential waveguides can broaden and sharpen the spectral response. While each racetrack resonator functions as a directional coupler with a broad spectral response, stacking resonators yield a filter with a sharp response [55]. This effect has been shown for both ring [17] and racetrack configurations [56, 58]. While narrowing the gap between circular ring resonators and the tangential waveguides can broaden the resonant spectral width, the desired widths require gaps smaller than or too close to fabrication tolerances. Using racetrack-shaped resonators increase the coupling distance such that the same spectral width can be achieved with a larger gap that is easier to fabricate [54].

31 Numerical Simulation Transmission (db) Transmission Wavelength (um) (a) Two-racetrack tranmission using comsol where two peaks broaden the whole response Transmission (db) Transmission Wavelength (um) (b) Three-racetrack transmission using comsol where the three peaks further increases the bandwidth Bandwidth (nm) BW FSR Number of racetracks Free spectral range (nm) (c) Bandwidth and FSR trends of stacked racetracks. (d) Area and Power Trends of stacked racetracks. It It shows the number of stacked racetracks versus bandwidth and FSR stacked shows the area and power trends with number of racetracks Area (um 2 ) Figure 2.2: Simulated result Area Insertion loss Number of racetracks Insertion loss (db) In this section, we discuss the simulation of our design. We have used the measured dimensions of the fabricated devices in the simulations. Our design goal is a filter with minimal area, a wide passband (spectral width) and a controllably low finesse. We herein numerically explore the design space to determine an optimal number of stacked racetracks for a single device as well as the total number of optimal devices necessary to cover the whole band on interest. We have designed stacked racetrack devices with one to five racetracks to compare characteristics. In terms of resonant frequency, all five designs have similar resonant frequencies within 3-4nm range. As for the free spectral range, all designs have F SR 20nm. The quality factors (Q) (spectral width divided by center wavelength) vary from 736 to 295. The one-racetrack design exhibits a Q of 736, the two-racetrack of 370, and the three, four, and five-racetrack designs have Q of 295.

32 18 Figure 2.2(a) depicts the transmission for a two-racetrack device while Figure 2.2(b) is for a three-racetrack device. The bandwidth for the devices are 4nm and 5nm respectively which covers 20-25% of the total band. On the other hand, at resonance the transmission is not flat and we can see peaks and dips in both devices. Figure 2.2(c) and 2.2(d) present the trends of the different designs according to different design parameters. In Figure 2.2(c) the bandwidth of the switch initially rises until three stacked racetracks at 5nm then it saturates after that. Hence, for our design space exploration purposes we should consider only upto three racetracks. On the other hand, the free spectral range doesn t change, so it won t affect our design. In Figure 2.2(d), the area increases linearly which makes it an important factor in our design decisions. Meanwhile, the power is almost constant which makes it less important. From the above trends it seems that area will be the determining factor in our design. For a two-racetrack device, we need 5 devices, a total of µm 2 to cover the whole band. On the other hand for a three-racetrack device, we need 4 devices, a total of 1, µm 2. This makes the two-racetrack device more attractive for our switch design. The numerical simulation shows that a two-racetrack device suites best a switch design, although the differences between the two and three-racetrack devices are quite small. Having a broadband device with comparatively low area allows us to build a wide-band switch with minimal area compared to other micro-ring based approaches [51]. 2.4 Adiabatic couplers In this section we focus on one of the devices that we have optimized, the adiabatic coupler. This coupler can be used as a 3 db coupler that has several application in on-chip nanophotonic networks. Next, we discuss the importance and applications of adiabatic couplers, and then its design and characteristics that we have been able to demonstrate through simulations.

33 Motivation Broadcast is a crucial operation in interconnection of processing elements. Passive broadcast requires that the signal power is evenly distributed to the end destinations. The requirements are stringent. Optical power budget is limited by the receiver sensitivity. Assuming identical receivers, the broadcast power budget is determined by the poorest transmission path. Uneven distribution in a multi-stage splitter network leads to an increase of the required optical source power in order to preserve a given bit-error-rate (BER). The optical power source in SoI is limited by the non-linearity of on-chip waveguides. The power cannot be increased past a given point that is determined by the balance of the broadcast paths. In addition to the balance limit, power splitters also must meet the on-chip area constraints and be compatible with CMOS mass-production-compatible fabrication processes. Power-splitting has been implemented in a multitude of different ways, for example: Antenna array: An on-chip optical antenna array [59] provides power efficient broadcast for a moderate number of nodes. Unfortunately, it is difficult to add additional waveguides, limiting the total bandwidth. Y splitters: 1-to-2 and 1-to-N Y-shape power splitters can be used for broadcast and multicast [60]. However, Y-splitters are sensitive to the fabrication inaccuracy. The precise geometry challenges photolithography. Photolithography is the only mass-production-compatible CMOS fabrication exposure technique. Trench based splitters: By designing a ridge waveguide that splits the incoming optical power into transmitting and reflecting outputs, a 1-to-2 trench splitter can be formed [61, 62]. However, such designs suffer from poor transmission efficiency (80%). Adiabatic couplers: Mode coupling is an inherently wavelength dependent process. That adiabatic couplers exhibit no mode coupling at all which dictates that 3 db adiabatic couplers have broad operating bandwidths. Broad bandwidth goes hand in hand with resistance to process and thermal variations. Without the mode conversion to radiation modes, that mode coupling would

34 20 entail, the adiabatic coupler is not subject to any significant excess loss Analytical model An adiabatic coupler operates in a single mode system. System and waveguide modes are indistinguishable when the waveguides are of sufficiently different size. Waveguides of greatly different sizes may be brought into proximity without exchange of power if they are sufficiently different and brought together sufficiently slowly. The waveguides cannot remain one large and one small if the light is to be transferred to a different output guide. To be transferred equally to each arm, the structure must be symmetric. Two parallel guides will exhibit minimal perturbation to each other if they are changing in opposite senses, that is, the larger one decreasing and the smaller increasing. To determine the coupling length over which the guides shall be brought together to the same size in a parallel region is the challenging part of adiabatic coupler design. The adiabatic coupler system is governed by a set of coupled local normal mode equations. These equations describe how power is distributed through the system [63]. da i dz = C dp ij dz A j + jβ i A i (2.2) da j dz = C dp ij dz A i + jβ j A j (2.3) where A i, A j, β i, β j, and C i, C j refer to the amplitude, propagation constant and coupling coefficient of the fundamental and next highest order system modes respectively. A general solution to the above equations for adiabatic coupling structures is shown in [63], and from this solution, the optimal coupling length can be computed analytically for a given set of input waveguides. L 3 db = 1 ɛ X IIo κ II (1 + α o pt) 3/2 (2.4) where L 3 db is the optimal length of the coupler, ɛ is the error tolerance, X IIo is the asyncronicity parameter between the waveguides at the start of the coupling region, κ II is the coupling coefficient between the two waveguides and α opt is the optimum length ratio between the input and coupling region length.

35 Numerical simulation Output Power 60 % 55 % 50 % 45 % Output Port 1 Output Port 2 40 % Wavelength (nm) (a) Transmission of adiabatic coupler using comsol simulations under ideal conditions Output Power 60 % 55 % 50 % 45 % Output Port 1 Output Port 2 40 % Wavelength (nm) (b) Process variation analysis of the adiabatic coupler under 10% variations Figure 2.3: Simulation results The structure of the adiabatic coupler designed is composed of three regions. In the first region, the waveguides start with a width of 450 nm and are spaced 1 µm apart. The two waveguides evolve into 300 nm and 600 nm widths. Next, the gap tapers down linearly to 200 nm in a 50 µm distance. In the second region, the gap between the two waveguides is constant at 200 nm while the two waveguides evolve adiabatically back to 450 nm waveguides. In the third region, the two waveguides quickly separate to a gap of 6 µm. Using FEM simulations, we study the different characteristics of the device. First, within the C+L band nm, the power splitting is 48-52% as shown in Figure 2.3(a). This wide band offers a plethora of wave division multiplexing channels. Meanwhile, the insertion loss is calculated to be 0.14 db since the structure is adiabatic and has negligible loss. Altogether, the length of the structure is 300 µm with a coupling length of 200 µm. We have also conducted a process variation analysis for the design. Based on our previous fabrication efforts, we have computed the worst case process variation and simulated the design. The simulation results are shown in Figure 2.3(b) which indicate a degradation in transmission to 45-55%. This transmission is still acceptable for power-splitting purposes. 2.5 Summary In this section we have discussed the basic devices necessary for building an optical on-chip network. The system resembles to great extent communication in fiber optics which provides low

36 22 power and high performance through WDM that electrical interconnects can not offer. Based on the basic devices, we have presented two basic functionalities which are power splitting using adiabatic couplers and optical switches using stacked racetracks. The next step is to show how to assemble these devices into a network. Unlike other communication systems, on-chip networks has its distinguishable characteristics which requires novel design approaches. In the next chapter we present our broadcast-based designs, with focus on Iris an all-optical solution.

37 Chapter 3 On-chip Network Design In this chapter we discuss on-chip network design leveraging the components we have studied so far. In the design process we follow a top down approach where we study the workload characteristics to motivate our design decisions. Earlier studies show that on-chip traffic is heterogeneous in nature and a single network for target performance will not meet the goal [64]. On one hand, broadcast packets if sent over point-to-point networks will deteriorate the performance of the unicast network; thus suggesting a broadcast medium. On the other hand, unicast packets need a high throughput point-to-point network. The conflicting requirements of different traffic components suggests a hybrid network consisting of a low latency broadcast network and high throughput unicast network. Zheng et al. designed multiple broadcast-based optical networks [65, 10, 8, 66]; however, he has not identified the tradeoffs between these designs. In this chapter we study one of the broadcast-based designs, Iris, a hybrid all-optical network both at the architecture and physical level comparing it to other broadcast-based architectures proposed by Zheng et al. [65, 10, 8, 66]. Then the Pareto-space of antenna-based broadcast designs is presented to show the tradeoffs. Next, we discuss a one-to-many broadcast topology based on the adiabatic coupler discussed in Chapter 2. This network unlike earlier broadcast networks, has high efficiency enabling linear power division. Moreover, it can be used for power or clock distribution in the system. 1 1 The content of this thesis is copied from publications that I have published or that are under review [3, 4, 5, 6, 7, 8, 9, 10].

38 Motivation: Characterizing on-chip traffic In many-core systems, on-chip traffic is heterogeneous. Part of the traffic is intended to synchronize thread execution, coordinate distributed data placement, and allocate global resources. It consists of short, latency-critical and often-multicast protocol messages. The other part of the traffic transfers data between cores and usually consists of long, throughput-hungry and oftenunicast data packets. Consider a cache-coherent shared-memory many-core system. To complete one data transaction, triggered by a read or write cache miss, a request protocol message is first sent out to locate valid cache line copies. This message, containing only the memory address and the processor ID, is usually short. Once data is located, cache lines containing large data packets are transferred back to the requester. To locate cache copies quickly, broadcasting the request protocol message is preferred, as is found in various protocols, e.g. Snoopy, Token Coherence, etc. Otherwise, directory indirection is incurred which slows the whole transaction. The two classes of traffic have different characteristics. Sending short protocol messages to multiple destinations in a unicast network worsens the overall protocol message deliver latency, which is determined by the slowest path. A recent study shows that even if multicast messages take only a small portion of on-chip traffic, the overall network latency and throughput can degrade substantially, negating the point-to-point efficiency and performance advantages [64]. It is crucial to provide a broadcast/multicast interconnect fabric to minimize, or share, multiple path setup latencies and reduce the broadcast-induced bursty contention. On the other hand, for transferring throughput-hungry data messages, which are typically hundreds-of-bytes long, serialization and contention dominate the overall latency. To minimize their latencies, communication links need to be shared effectively so as to sustain a scalable high throughput.

39 Characterizing electrical interconnect The lack of power and latency scalability limits electrical interconnect s potential to satisfy global on-chip communication requirements, especially for those often-multicast latency-critical protocol messages. Fundamentally, neither propagation delay nor power consumption of global electrical wires scales with technology advances. The propagation delay of metal wires is quadratic in the propagation distance L and is proportional to the electrical RC time constant. The energy required to send information over wire is determined by the wire capacitance and swing voltage as CVDD 2 L. Numerically, [67] predicts that, from 65 nm to 32 nm technology, the signal propagation delay in a 1 mm minimal-pitch global copper wire will increase from 227 ps to 1129 ps, and the power index range (dynamic power per unit frequency per unit area of one metal layer) for global wires increases from to W/GHz-cm 2. Routers or switches, which are used to share the segmented wiring resources, introduce excessive and non-deterministic latency for arbitration, switching, and contention. These operations require significant power overhead. Consider the recently developed 45 nm Intel Single-Chip Cloud Computer [68]. The router in this 4-by-6 mesh network includes four pipeline stages. To traverse the network without contention, a packet would take at most 10 hops, where each hop is 5 clock cycles. Each router consumes 500 mw power contributing to 10% of the tile power consumption. Multicast protocol traffic can further exacerbate this situation. As the number of on-chip processor cores increases, worst-case latency and power consumption suffer as the average number of hops per packet increases. As the technology scales, the increasing leakage power, especially in router buffers, worsens this situation Characterizing photonic interconnect In principle, nanophotonics can offer performance and power improvement over the electrical counterparts [69]. However, on-chip nanophotonic interconnect exhibits unique characteristics.

40 26 Electrical network designs with photonic wires cannot unleash the potential of photonic interconnect. In this section, models of the nanophotonic interconnect demonstrate a unique method for handling heterogeneous traffic. As part of our efforts in characterizing and modeling nanophotonic devices, we have fabricated three batches of devices in [40] and measured them in our labs. Specifically, we have fabricated devices using 193 nm deep UV lithography on 200 mm silicon-on-insulator (SOI) wafers with 220 nm top Si film and 2000 nm buried oxide. Where noted, we also use recently published results of other groups that have employed similar SOI process technology. The analysis below is guided by our results from the physical study of these devices Performance Nanophotonic interconnect offers low-latency data propagation and ultra-wide bandwidth by leveraging wavelength-division multiplexing (WDM) and time-division multiplexing (TDM). Therefore, the latency of a packet is significantly reduced compared to the electrical solution. The main overhead is due to the serialization latency to transfer large packets in multiple clock cycles, the circuit path setup latency, and the electrical to and from optical conversion latency at the transceiver ends. These three sources of overhead may be reduced by innovation in network design Power The power consumption of a nanophotonic on-chip network consists of both the optical power consumed in propagation and electrical power consumed in the electrical control circuitry. Off-chip light sources cannot presently be controlled dynamically by on-chip traffic. Optical power then is usually statically allocated by a worst-case aggregated loss budget. Total optical power is given by: P static = P optical = P detect / i (1 η i loss ) (3.1) where η i loss is the proportional power loss at the ith component along the signal propagation path, and P detect is the required optical power for arrival at the photodetector, i.e., the sensitivity of

41 27 photodetector. If the optical power needs to be delivered to multiple locations sequentially, the increasing number of ηloss i would inevitably lead to significant growth of P optical. Numerically, the worst case optical loss budget analysis has to account for various components, including: the grating coupler loss which may reach 1.6 db [70], straight waveguide loss which is 1.34 db/cm [71], waveguide bend loss which is approximately db for a bending radius of 1 µm [72], waveguide crossing loss which may reach db/crossing [73], micro-ring based filters which have a loss of 1-2 db depending on the dimension [74], and micro-ring based modulators which have a loss of 2 db and an extinction ratio of 9 db [39]. Finally, sensitivity of current photodetectors, P detect, is less than 1 µw based on analysis of [69] and photodetector designs demonstrated by [75] (where sensitivity is defined by the electrical signal to noise ratio required in order to achieve a bit-error-rate (BER) of in a typical threshold detection digital repeater). Another important component of the nanophotonic power is the electrical power at the transceiver necessary to convert from optical to electrical and back and the electrical power necessary to drive photonic switches. This power is given by: P transfer = P transmitter + P receiver + P switch (3.2) where P transmitter is the power to modulate the electrical signal onto the laser carrier at the transmitter, P receiver is the power to convert the optical signal back into an electrical signal at the receiver and P switch is the power necessary to route the photonic signal to its destination. The power for each component includes the electrical dissipation and the power expended tuning devices to the proper passband. Transmitters and switches are usually designed with ring resonators whose operating energy has been demonstrated to be 2 pj/bit [76], while receivers per bit energy consumption is in the order of 690 fj/bit [21] Functionality Photonic on-chip interconnect is bufferless in the foreseeable future. The circuit-switched nature of nanophotonic interconnect directly affects the performance and power characteristics of

These methods significantly increase arbitration time and counteract the performance benefit of nanophotonic interconnect.

Other designs, which leverage intermediate electro-optic conversion such as Joshi et al. s work [80] are particularly unattractive for short and often-multicast protocol messages.

42 28 on-chip communication. To configure the nanophotonic network, past work uses either a packetswitched electrical network [77], tokens that circle around all cores [29], or address-based reservation mechanisms [78]. These methods significantly increase arbitration time and counteract the performance benefit of nanophotonic interconnect. On the other hand, point-to-point communication [79, 25] alleviates arbitration overhead, however, the static allocation of bandwidth limits system performance. Other designs, which leverage intermediate electro-optic conversion such as Joshi et al. s work [80] are particularly unattractive for short and often-multicast protocol messages. In all the previous designs, the broadcast delay is determined by the worst case propagation path which is significant. The unique challenges will be analyzed and addressed in the following sections. 3.2 Iris: Antenna-based design This section presents the design of Iris, the proposed hybrid nanophotonic network. The goal is to develop a high-performance and low-power on-chip network that optimally parses the heterogeneous on-chip communication workload by taking into account the underlying physical limitations of interconnect medium and devices. Iris is a hybrid of two subnetworks: a low-latency WDM broadcast/multicast network based on 2D antenna array structures, and a throughputoptimized circuit-switched network based on optical channel waveguides. Switch Transmitter Waveguide Transmitter Antenna feed Antenna Receiver Receiver (a) Top down conceptual diagram of message transmissions in Iris (b) Layout of the broadcast network (antenna and waveguides to the antenna rods) Figure 3.1: Schematic of Iris both networks used in the design

43 29 The broadcast and circuit-switched subnetworks operate in tandem to provide low-latency, high-throughput communication at low power levels. The interaction is straight-forward. Short, multicast protocol messages, are transmitted via WDM on an antenna array. Since the network is inherently broadcast-based, multicast requires no additional effort. In addition, global resources can be efficiently managed in this network. For larger, throughput-sensitive data messages, the throughput-optimized circuit-switched subnetwork is used as it can deliver a higher total cutthrough bandwidth. To allocate a channel, we use the broadcast network to reserve the appropriate wavelengths and the physical waveguides to setup paths with very low latency. The details, as shown in Figure 3.1, are described in the following sections. Iris exhibits three distinct advantages over competing networks: (1) Nanophotonic interconnect provides low-latency propagation speed with bandwidth-topower ratio advantages. (2) Hybrid interconnect design matches the heterogeneous requirements of on-chip communication. (3) The novel broadcast network liberates the on-chip network from delivering multicast short messages which require expensive hop-by-hop arbitration for multiple communication resources Network Architecture Design This section describes the architecture-level design of Iris hybrid broadcast/circuit-switched network. In Iris, broadcast subnetwork handles short, multicast-like coherence protocol messages and coordinates the global interconnect resources, while the circuit-switched subnetwork transfers large throughput hungry data packets.

44 Low-latency broadcast subnetwork The broadcast subnetwork collects signals from all on-chip cores into a central small antenna array, where signals are broadcast and distributed to all cores as shown in Figure 3.1(b). The WDM channels in this subnetwork are divided into several channel groups. Each channel group is able to deliver all bits in one short protocol message simultaneously. The broadcast subnetwork provides full-connectivity and serves two kinds of traffic: (1) Short, latency-critical, often-multicast messages, which are used by coherence protocols and synchronization mechanisms, and are on the critical path of system performance. The broadcast nature of the subnetwork provides a global order for events and thus simplifies the system level coherence and synchronization. (2) Global resource arbitration messages for both broadcast and circuit-switched network path setup. As broadcast network has global reach with low latency, it enables an efficient resource allocation mechanism. The key logical design consideration for the broadcast network is how to perform arbitration, i.e. determining the sequence of concurrent protocol message requests. To do this, we leverage the broadcast nature of the subnetwork and WDM. As we will see, the net result is a distributed global arbitration scheme that provides high broadcast throughput as well as low arbitration latency. Figure 3.2(a) illustrates the arbitration process. W nodes can leverage W wavelengths in the broadcast subnetwork to arbitrate for one shared resource. One broadcast arbitration operation is followed by multiple broadcast protocol messages. The arbitration scheme starts with each node being assigned a unique dynamic priority. Each node broadcasts a signal 1 on the wavelength corresponding to its dynamic priority. For example, as shown in Figure 3.2(a), in an eight-node network, three nodes, A, C and D, want to access a shared communication resource. Node A can broadcast on W 1 (denoted [ 1 ]), Node C can broadcast on W 6 ([ 1 ]), and Node D can broadcast on W 2 ([ 1 ]). As the node sends the one-bit message, it also listens for the combined messages broadcast across all wavelengths. If multiple nodes try to access

45 31 A C D broadcast receive broadcast receive broadcast receive W7 W (a) Single-cycle arbitration and arbitration protocol in broadcast network Wavelengths Circuit-switch Broadcast Protocol packets Status update Broadcast Protocol packets Arbitration State update for circuit-switch network Broadcast protocol messages Time No Idle Receive new protocol messages? Complete Yes Request 1. Generate a new dynamic priority ID 2. Broadcast "1" on the corresponding wavelength Receive No No Idle Receive new unicast data messages? Complete Yes No Check Read the resource occupation table. Is all the path segments to the destination free? Yes No Request 1. Generate two new dynamic priority IDs (vertical/horizontal) 2. Broadcast them on corresponding wavelength Receive Broadcast Send the protocol message over the broadcast subnetwork Yes Acknowledgement Use OR gate to check IDs of other requesting nodes. Highest priority? Broadcast Send the updated resource occupation over the broadcast subnetwork Unicast Send the data message over the circuit switch Yes Yes Acknowledgement Use OR gate to check IDs of other requesting nodes. Highest priority? (b) Finite State Machine in broadcast network (c) Finite State Machine in unicast network Figure 3.2: Arbitration and arbiter s finite state machine (FSM)

46 32 the same resource, they will see the same bit vector, reflecting the dynamic priorities of all the requesting nodes. For example, if Node A, C, and D arbitrate for the same set of wavelengths, they will all see a combined priority vector of [ 1 11 ], the bit-wise OR of Node A, C, and D s dynamic priorities. Then, each participating node in arbitration will gain access to the broadcast network in a sequence corresponding to the requesters dynamic priorities. The number of protocol messages equals the number of requesters in the arbitration. The node with highest priority will be able to broadcast the multi-destinationed protocol messages in the next broadcast time slot after a quick check with OR gates. Other nodes will have to keep checking until their turn arrives. After each arbitration, each node changes its dynamic priority using a deterministic random number generator. Since all nodes share the same random seed, they always agree on the same global priority. Figure 3.2(b) illustrates the state transition in broadcasting protocol messages. To arbitrate for the same K resources, i.e., multiple broadcast channel groups, W nodes can still leverage W wavelengths in the broadcast network. The only difference is that each node will be pre-assigned a resource id for arbitration. The arbitration scheme turns to K (W/K) : 1 arbitration in parallel. This design sacrifices some arbitration efficiency for reduced arbitration latency and simplified arbitration logic Throughput-optimized circuit-switched subnetwork From an architecture point of view, the throughput-optimized subnetwork can be one of three options: (1) A circuit-switched mesh network using photonic channel waveguides [8], (2) A circuitswitched mesh network using electrical interconnects and bufferless routers [65], and (3) Buffered electrical interconnect [66]. In this section we focus on option one, the circuit-switched optical network, other designs have been discussed elsewhere [66, 65]. In our throughput-optimized network there are horizontal and vertical wires/waveguides between each row and each column of processors in a many-core system, as is depicted in Figure 3.9. Dimension order routing is used for its minimal distance and for simplicity of path setup. The broadcast subnetwork provides a low-latency means of allocating routing paths for long data mes-

47 33 sages. The above design decisions are based on the following considerations: The mesh network organizes the waveguides in a regular pattern and suits the tile-based processor core floorplan. Performance-wise, mesh networks with dimension order routing provide the short(est) distances between two on-chip nodes. Power-wise, mesh topology and simplified switch design reduce waveguide length and the number of bends and crosses. With dimension order routing, the maximum number of turns for each data transfer is one. The flow control of the circuit-switched network requires a path setup mechanism, which is handled by the broadcast subnetwork. Different from arbitrating the broadcast network, the path setup arbitrates multiple serially connected optical waveguide segments. Consider an 8-by-8 mesh network where there are 16 horizontal and 16 vertical waveguides. Each of the waveguides is decomposed into 7 segments. In dimension-order X Y routing, the path to be setup consists of segments in one horizontal waveguide and segments in one vertical waveguide. An efficient circuit-switched setup mechanism has been designed. The broadcast network is leveraged for arbitration of unicast network resources. In addition, to reduce congestion, each node keeps track of the occupation of the links (i.e. waveguide segments) in order to avoid requesting preoccupied links in the circuit-switched network. Specifically, during each unicast network arbitration cycle, all the nodes that attempt to setup a path send signal 1 on the WDM channels of the broadcast network corresponding to their dynamical priorities. To avoid setting up multiple paths with conflict, only the one with the highest priority wins. The winning node notifies the rest of the nodes the circuit-switched path that has just been reserved by sending out a status update information in the next cycle indicating the destination node id. Each node is equipped with a local path occupation table that records the occupation status, i.e. occupied or free, and timeout counters of all segments in all the channel waveguides. This

48 34 table will be updated in two cases. When any node in the network wins the circuit-switched arbitration and sends out a status update, all table entries corresponding to the segments of the circuit-switched path are set to occupied for a time equivalent to the serialization time of the sending data packet plus time of flight. As time progresses through each network clock cycle, all of the occupied segment table entries will decrement their timeout counters to reflect that one flit of the packet has been transmitted through the corresponding segments. In case of an attempt to access the circuit-switched network, this table will be checked to see if the communication path is free before sending an arbitration request for circuit-switched setup on the broadcast network. In summary, this mechanism reduces contention in the broadcast network by avoiding arbitration for resources that are currently in use. When a communicating node tries to send a unicast packet over the circuit-switched network, the communicating node first checks the local occupation table, when free sends out a unicast arbitration request on the broadcast network, and finally broadcasts the status update information and sends the packets over the unicast network. The finite state machine of this process is shown in Figure 3.2(c) Summary of broadcast operations This section summarizes all the messages that need to be transmitted over the broadcast network, and describes the organization of these messages leveraging time-division multiplexing and wavelength-division multiplexing. Four types of messages are transmitted over the broadcast network as shown in Figure 3.2(a), including protocol messages (labeled as protocol message ), arbitration for broadcast (labeled as broadcast arbitration ), arbitration for the circuit-switched subnetwork (labeled as unicast arbitration ), and status update information (labeled as status update ). The first two types are broadcast-related as described in Section The last two types are unicast-related as described in Section As depicted in Figure 3.2(a), we design a protocol to organize multiple types of messages

49 35 on the broadcast network. The broadcast-related messages and unicast-related messages are interleaved to provide a balance between unicast and broadcast messages. The percentage of time allocated to each of these two types are statically fixed to 50% Support for coherence and consistency Iris addresses the heterogeneous traffic characteristics in on-chip traffic, so the cache coherence protocols are efficiently supported by the broadcast network, which delivers latency-critical protocol transaction messages at the speed of light to all on-chip nodes through WDM photonic broadcast channels. Compared to other electrical or photonic interconnect designs, Iris provides the following features for efficient protocol support: Direct support of snoopy and other broadcast-based low latency protocols. Compared to the directory-protocol based design, which introduces directory indirection, Iris reduces latency in locating outstanding cache copies. Global order which simplifies the system level coherence and synchronization. Global ordering can be enforced by providing each core with a queue that tracks all of the in-flight transactions on the nanophotonic network (since the nanophotonic network is a global broadcast medium, every core sees every transaction that occurs in the nanophotonic network). As new in-flight transactions occur in the network, they are enqueued. When simultaneous transactions occur in separate channel groups, the channel group numbers serve as a tie breaker: the transaction broadcast on the lowest channel group is enqueued first, the next highest channel group second, and so on. Optimal direct cache-to-cache transfer. When a cache miss occurs, various transaction acknowledgments can occur, and the fastest responder will be served. When an acknowledgment is broadcast in the network, the pending transaction s entry in the queue is marked as acknowledged. If the transaction is at the front of the queue, it is dequeued and the

50 36 transaction completes (commits). If an acknowledgment occurs for a transaction that has been acknowledged by others, it will be discarded. In addition to classical snoopy protocol, various cache coherence protocols which rely heavily on broadcasting can be supported. It benefits Token Coherence by providing an efficient way to broadcast transient requests and a low-latency token transfer mechanism. Protocols in AMD s Coherent HyperTransport and Intel s QuickPath Interconnect make heavy use of broadcast messages, which can also be improved by the proposed design Iris: Physical Design This section focuses on the physical design of the hybrid photonic interconnection network, Iris. A broadcast network is implemented using dielectric antenna arrays to implement the any-1- to-n fan-outs, while a circuit-switched subnetwork is implemented using channel waveguides and broadband switches. To allocate these interconnect resources, the arbitration messages are prepared by electrical gates and broadcasted via the photonic antennas. Figure 3.3(a) illustrates the proposed network architecture and many-core system integration. The proposed nanophotonic broadcast network consists of photonic transceivers, channel waveguides, photonic switches and a photonic broadcast antenna array. The nanophotonic components are fabricated in separate silicon layers, and integrated with the CMOS silicon die through a three-dimensional integration [81]. The design and fabrication of each layer can be optimized independently. The optical path is composed of similar basic components for both the broadcast and circuitswitched networks as shown in Figure 3.3(b): (1) Off-chip light source: a broadband supercontinuum fiber laser light emitting at a power level of 10 mw/nm [82], is chosen to provide cost effective WDM laser power ; (2) WDM filters: Passive micro-ring filters are used to demultiplex the broadband light into wavelength channels and deliver these channels to the system. We have fabricated several micro-rings and they have been proven to be suitable candidates for filter design. Figure 3.4(a) shows a filter design fabricated and measured in our labs;

51 37 Antenna array and broadcast subnetwork Optical vias Circuit-switching network Modulation, multiplexing, and etc. Electrical vias Electronics layer (processors and memorys) Off-chip laser source (a) 3D layer conceptual view where different layers are interconnected through 3D integration Frequency response in transceiver side Antenna array (broadcast network) Transmission Antennas Linear waveguides circuit-switch network Receiving Antennas Frequency response in receiver side Intensive modulation Laser Source Modulators Filters Light Source Transmitter Signals Channel select Source waveguides Demultiplexers stack Detectors Nanophotonic waveguides Receiver (b) Components and frequency responses through WDM to improve the bandwidth Figure 3.3: Physical design of Iris

4(b) shows a compact modulator design, fabricated and measured within our labs, with a 1.5 µm radius ring resonator sitting alongside a waveguide.

52 38 (a) Micro-ring based adddrop filter SEM (b) Micro-ring based modulator SEM (c) SEM of antenna (d) Racetrack-based switch SEM Figure 3.4: Scanning electron microscope images of nanophotonic devices fabricated through epixfab and measured in our labs (3) Modulators: Figure 3.4(b) shows a compact modulator design, fabricated and measured within our labs, with a 1.5 µm radius ring resonator sitting alongside a waveguide. Micro-ring based modulators are chosen due to their compact footprint, low power consumption, and high modulation rates; (4) Straight and bent waveguides: Waveguides transmit the optical signal across the chip; (5) Photodetectors: Epitaxial grown Ge detectors are used at the receiver end. Similar optical links have been demonstrated to be compact, low power, and deliver over-ghz speed [39]; However, the broadcast and circuit-switched networks differ in the switching mechanism. The broadcast network is based on an efficient antenna array, serving as an any-1-to-n fan-outs element. Figure 3.4(c) shows an antenna structure that we have fabricated. In contrast, the circuit-switched network leverages optical switches. Figure 3.4(d) shows a third order stacked racetrack designed for switching optical packets. Such design is chosen due to its wide bandwidth and small free spectral range as shown in the transmission graph in Figure 3.5. Next, the physical design of the two subnetworks are discussed respectively Broadcast subnetwork The broadcast subnetwork scales a typical free space communication system design to the micrometer length scale. It provides direct all-to-all connectivity for on-chip cores at low cost, low latency, and small power cost. The antenna array is the key to efficient on-chip broadcasting since a single packet is sent

53 Power (dbm) Wavelength (nm) 39 Figure 3.5: Measurement results of nanophotonic racetrack-based switch which enhances the bandwidth of the switch and reduces overall area once and received by all other communicating nodes. We have designed and fabricated the antenna structure [59, 83]. Initial measurement results are in good agreement with simulations. The antenna array consists of dielectric-rod antennas in which the silicon ridge guide s width is tapered up from the nominal waveguiding value and then down to below cutoff. The structure of the waveguide taper determines the antenna radiation pattern. The apertures of the multiple antennas lie on the circumference of circle, approximately 3 µm from each other, as shown in Figure 3.4(c). To increase the coupling efficiency among these antennas, a 500 nm thick electro-optic polymer is coated on these structures. The broadcast network design, as shown in Figure 3.1(b), provides low-latency transmission at high bandwidth. Light entering the chip is demultiplexed into 64 wavelengths and modulated using micro-ring modulators. The modulated light is carried by a waveguide emanating toward the center of the chip where the antenna lies. Next, the optical signal radiates through the antenna and is transmitted to other ports. Then the signal is carried through the waveguides to the rest of the cores without encountering any crossovers. Electromagnetic simulation shows that the working range of our antenna structure extends from 1350 nm to 1650 nm, completely covering the C+L bands (1530 nm to 1625 nm) [83]. As a broadband passive device, the operation of antenna array is not susceptible to thermal variations. A large number of WDM channels can therefore be supported. The actual communication bandwidth equals the number of WDM channels times the bit rate per channel. The bit rate per channel could reach 4 Gbps. Specifically, the propagation latency is composed of the following parts:

54 (1) T CLK Q : Clock to Q propagation latency of the driving latch storing the information to be transferred. T CLK Q 31ps, assuming 32 nm technology. 40 (2) T m : Optical transmission of the modulator and its driver, T m 36.3ps ([84]). (3) T p : Time of flight. T p = s 149ps. We assume that group index of SOI waveguide is 4.2 [85, 86]. The horizontal and vertical dimension of the broadcast network is 0.75 cm, which is three quarters of the overall die side length. This assumes that the antenna array access points are placed in the corner of each tile. (4) T r : Receiver photodetector and amplifier delay, T r 7.2ps [84]. (5) T setup : Setup time of the latch receiving the signal, T setup = 19ps, assuming 32 nm technology. The total delay is approximately 243ps. The bandwidth could reach 4 Gbps per WDM channel. For a 64 WDM channels system and 4 Gbps per channel, the bandwidth of the broadcast network is 256 Gbps. Next, we focus on analyzing the power efficiency of our design. The following analysis shows that the proposed nanophotonic broadcast network is highly power efficient, and can reliably operate in the milliwatt range per WDM channel at 4 Gbps per channel. To determine the optical power budget for our broadcast network, we perform calculations according to the loss budgeting method [87]. First, we determine the minimum required signal power at the detector for a given bit error rate(ber) and system noise, taking into account the modulation loss. Then, the radiation in the broadcast network is calculated to estimate the required input power according to Equation 3.1 in Section Specifically, the optical receiver sensitivity is estimated to be -35 db [69]. Table 3.1 gives a detailed account of the power loss of the different components along the optical path. As for the antenna power loss, it was computed through simulations. From fullwave simulations of the antenna structures, we computed an end-to-end transmission loss of 21.5 db

55 41 Table 3.1: Power budget Broadcast network Device Insertion loss Electrical power Number of devices Grating coupler 1.6 db - 1 [70] Demultiplexer at transmitters 1 db - - [74] Micro-ring modulator 2 db 5.9 mw 4096 with driver [39] [76] Waveguide loss 3.79 db - - (2.82 cm) [71] Antenna array 25 db - 1 Demultiplexer at receivers 1 db - - [74] Receiver with mw 4096 Ge photodetector [21] Total loss per wavelength db - - Channel-waveguide network Device Insertion loss Electrical power Number of devices Grating coupler 1.6 db - 1 [70] Demultiplexer at transmitters 1 db - - [74] Micro-ring modulator 2 db 5.9 mw 4096 with driver [39] [76] Waveguide loss 5.36 db - - (4 cm) [71] Waveguide crossings db - 8/path [73] Switch loss 2 db 5.9 mw 256 [20] [39] [76] Demultiplexer at receivers 1 db - - [74] Receiver with mw 4096 Ge photodetector [21] Total loss per wavelength db - - Receiver sensitivity -35 dbm [69]

56 42 for 32-port antenna [83]. Using computer models we project the transmission to reach 25 db for a 64-port antenna used in our study. Hence, an input laser power of 0.86 mw/channel is required at a BER of For our 64-channel and 64-core system, a total power of approximately 3.6 W is needed, which is quite low compared to electrical alternatives. The electrical power needed for the broadcast network is also minimized. Assuming the same setup with a 20% activity factor, the aggregated receiver power consumption is only 2.2 W due to significant improvement in transceiver electrical circuitry power. In addition, during a broadcast operation, single sender transmitters are involved. The power efficiency will improve along with CMOS technology Circuit-switched subnetwork This section describes the physical design of the circuit-switched subnetwork in terms of performance and power. The circuit-switched network is composed of channel waveguides and high-order racetrack resonators, which function as links and switches, respectively. SOI-based channel waveguides can deliver unicast traffic at high throughput, with negligible loss. They are chosen to form the circuitswitched subnetwork, as depicted in Figure 3.9. In this design, they are arranged both horizontally and vertically to line up as a mesh topology without bends and curves. As a result, the optical power loss has been significantly reduced. High-order racetrack resonators, controlled by electrically tuned signals, sit at the waveguide cross-overs to form high bandwidth switches as shown in Figure 3.4(d). To save operating electrical power and area, they can operate multiple WDM channels simultaneously due to their wide passband width (shown in Figure 3.5). Such wide passband width is achieved by increasing the coupling efficiency and high-order filter design described in [20]. In summary, regular network design minimizes optical loss, WDM can be leveraged to increase throughput at low operating power, and light-speed propagation shortens latency. The network shows a superb performance. Throughput-wise, doped resonators can exhibit high operating speed [39]. In this study, we have assumed 4 Gb/s per wavelength channel per link taking into account path propagation delay and electrical circuitry limitations at sender-receiver

57 Table 3.2: Configuration 43 Nanophotonic network in Iris Memory Hierarchy and Processors Throughput per wavelength 4Gb/s L1 cache per core 64KB, 2-way 64-byte line Number of wavelengths 64 L2 cache per core 256KB, 16-way 128-byte line Broadcast channels 2 L2 access latency 6 cycles Coherence protocol Snoopy protocol MESI Memory access latency 12 cycles a Protocol msg size 32 bits Processor 2 GHz Alpha a Memory access latency is composed of DRAM access time that is calculated by CACTI 6.5 tool [88] using 32 nm high-performance technology, and chip-to-chip link latency assuming the DRAM is stacked via three dimensional hybrid integration technology. ends. By further assuming 64 WDM channel setup, the 8-by-8 mesh network provides over 4 Tb/s aggregated bisection bandwidth. Latency-wise, once the transmitting core gains access to the unicast network, the full cache-line is transmitted without any non-deterministic delay due to contention or routers. Since these messages are less sensitive to delay due to large serialization latency, the arbitration delay has insignificant impact on system performance. Next we analyze the power of the unicast network. The optical power loss is small compared to traditional photonic networks for two reasons: First, the unicast network is free of bends; Second, due to the dimension order routing, the packet experience only one switch delay in worst case. Hence, the total insertion loss is db (taking into consideration the grating coupler loss). The electrical power of the network is dissipated mainly in modulators, switches and receivers. By leveraging the optical switch presented by Chen et al. [20], we can greatly reduce the overall number of devices and thus reducing the overall power consumption. In this study we estimated the total electrical power to be 7.4 W, at an activity factor of 0.2, for the whole Iris network, including the broadcast and channel-waveguide networks, following the power estimations presented in Table Simulation results In this section, we evaluate Iris, the proposed nanophotonic on-chip network, on a 64-core chip-multiprocessor. The performance and power efficiency of Iris are compared against several recently proposed electrical and photonic alternatives. The simulation studies demonstrate that

58 the proposed design significantly reduces the power dissipation and cache-network system latency for a set of typical scientific and commercial workloads Evaluation platform and configuration This simulation study targets a 64-core chip-multiprocessor design. The chip-multiprocessor occupies about 1 cm 2 silicon die. 3D integrated memory [23] and chip-to-chip fiber connected memory [89] are assumed in this setup. These technologies are likely to be widely adopted before on-chip nanophotonic networks are realized. Table 3.2 shows the configuration of the network and memory hierarchy. We have implemented a trace-driven cycle-accurate cache-network simulator, which simulates the activities of the memory hierarchy, the interconnection networks, and cache coherence protocol transactions. Network traffic traces are gathered using the M5 full-system simulator [90] running several SPECOMP [91], SPLASH2 [92] and ALPBench [93] multi-threaded benchmarks that we have, including ammp, applu, apsi, art, equake, fma3d, swim, wupwise, ocean, radix, cholesky, waternsq, lu, fft, fmm, and mpgenc. We also consider memory access traces collected from three commercial server workloads: TPC-H, TPC-W [94], and SpecJBB [95]. Each benchmark is spawned into 16 threads, which are distributed among the processor cores and executed concurrently. We consolidated four benchmarks into the 64-core chip and executed them concurrently to emulate virtual machine run-time workload management. The performance and power consumption of Iris is estimated based on recent nanophotonic device studies [89, 39, 96]. In Section we presented an analysis of the basic components composing nanophotonic networks. Then more details were presented in Section about Iris and both its broadcast and unicast networks. Performance analysis using state-of-the-art silicon photonic devices shows an approximately 243 ps sender receiver communication latency in the broadcast network. Therefore, the peak bit rate per wavelength channel is approximately 4 Gbps, twice as fast as the electrical counterpart. We also consider a 64 WDM channel setting in order to provide sufficient communication bandwidth and concurrent transmission support. Future tech-

59 nology advances and proper pipelining schemes can further improve the nanophotonic broadcast network speed Comparison of Iris and alternatives The performance and power efficiency of Iris are evaluated against the following recentlyproposed electrical and photonic alternatives. E-mesh: A packet-switched electrical mesh network using virtual channel flow control and supporting directory-based MESI protocol. It is equipped with recently-proposed latencyoptimized two-pipeline-stage router design with speculative virtual channel allocation [97]. P-mesh: A channel photonic waveguide based circuit-switched mesh network supporting directory protocol. It consists of a latency-optimized electrical network to setup the photonic circuit-switched path. P-mesh is an approximation of a recently-proposed optical network design [77]. E-tree: A tree-based electrical network operated in one coherence domain, which is established from SUN Fireplane System Interconnect supporting the snoopy protocol [98]. The address bus is implemented in a hierarchical tree topology with a fan-out of four at each node. The data bus is a packet-switched electrical mesh network. P-bus: A channel waveguide based photonic network supporting snoopy protocol [24]. It consists of a set of snake-shape single-write multi-read waveguides connecting on-chip processing units. The optical signal is broadcast along each waveguide. Corona: A serpentine multi-write single-read waveguide-based crossbar network [23]. Each waveguide with WDM support is statically dedicated to a single read node while other nodes compete by cycling tokens. Network performance is limited by the token cycling that requires 8 clock cycles per round trip, and by the directory coherence protocol.

60 Power (W) Router Link TX-RX-Switch Optical radix-waternsq -lu-mpgenc ammp-equake -swim-equake fft-lu -mpgenc-cholesky apsi-equake -fma3d-applu fft-lu -mpgenc-ocean P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris ocean-cholesky -fft-waternsq fma3d-apsi -ammp-wupwise waternsq-lu -mpgenc-radix E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree ammp-fma3d -wupwise-swim equake-fma3d -apsi-ammp fft-mpgenc -ocean-cholesky specjbb-tpch -tpcw-lu 46 Figure 3.6: Power consumption comparison of seven different networks including optical and electrical networks. In this study, E-mesh and E-tree are used to quantify the overall performance improvement and power savings of the proposed nanophotonic solution over existing electrical alternatives. Photonic alternatives P-mesh, P-bus, and Corona are used to evaluate the latency and power impact of Iris over recently published nanophotonic networks. In addition, in order to evaluate the throughput and power trade-offs, the circuit-switched photonic subnetwork in Iris is replaced with electrical packet-switched and circuit-switched mesh networks as the following two alternatives: Packet: It is a hybrid network consisting of a broadcast photonic subnetwork and an electrical packet-switched mesh subnetwork. Protocol messages will be sent via broadcast network like in Iris, but the large data packets will be sent via the electrical subnetwork with hop-by-hop arbitration and buffering. Packet is evaluated against Iris circuit-switched photonic subnetwork in terms of the throughput and power efficiency. Circuit: Similar to Packet, it is a hybrid electrical-photonic interconnect except that the electrical network in Circuit is circuit switched. The circuit switch path setup is scheduled by broadcast network similar to Iris. This configuration is to show how photonic interconnect outperform electrical alternatives by providing low latency and high throughput.

61 Power efficiency evaluation Figure 3.6 shows the on-chip power dissipation breakdown of Iris and the other seven alternatives. The electrical network power consumption of each of the alternatives is calculated by ORION2.0 [99] at 32 nm technology. The power consumption of the electrical network is contributed by routers and link circuitry (labeled as Routers and links ). The power dissipation of the nanophotonic network is contributed by light source (labeled as Optical source ), and electrical transmitter, receiver and switch power for unicasting/broadcasting (labeled as Electro-optical ). This study shows that Iris is the most power efficient on-chip interconnect solution. Compared to electrical alternatives E-mesh and E-tree, Iris reduces the average power consumption by 19.5% and 50.6%, respectively. This shows that global electrical interconnects are not power-efficient for on-chip communication, especially for chip-scale broadcast. Compared to photonic alternatives P-mesh, P-bus and Corona, Iris also reduces the average power consumption by 20.3%, 92.7% and 1.8%, respectively. In P-mesh, power overhead mainly comes from the non-scalable static-power consumption of the electrical network used to set up the bufferless photonic link paths. In P-bus, photonic splitters are introduced along the channel waveguides for broadcasting. As each node along the waveguide taps the same portion of the incoming optical power, the required optical power is super-linearly proportional to the number of cores. In Corona, serpentine waveguides are long and introduce excess insertion loss. To minimize the waveguide propagation and bending loss, the optical access points of neighboring cores are placed close to each other, introducing increased electrical dissipation from the longer electrical wires. On the other hand, Packet and Circuit do not show power benefits because the major traffic load: data packets are still transferred via electrical network. The removal of small protocol messages does not lead to significant power reduction. Compared to Packet and Circuit, Iris reduces the overall network power consumption by 45.3% and 33.1%. As technology scales further, the power efficiency of nanophotonic devices and components is expected to improve. The power consumption of the electrical network, on the other hand, is

62 L2 miss latency (clock cycle) Request Protocol Ack Memory apsi radix-waternsq ammp-equake -equake -lu-mpgenc -swim-equake -fma3d fft-lu ocean-cholesky fma3d-apsi -applu -mpgenc-ocean -fft-waternsq -ammp-wupwise P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris fft-lu -mpgenc-cholesky waternsq-lu -mpgenc-radix E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree P-bus P-mesh E-mesh Packet Circuit Iris E-tree ammp-fma3d fft-mpgenc -wupwise-swim -ocean-cholesky equake-fma3d -apsi-ammp specjbb-tpch -tpcw-lu 48 Figure 3.7: Performance comparison: average local L2 miss latency of seven different networks including state of the art optical and electrical networks. expected to increase. Therefore, using nanophotonic on-chip interconnect becomes increasingly power beneficial. More specifically, Iris shows better power efficiency and is expected to dominate in future technologies Performance evaluation Next, we evaluate the performance of Iris. Cache miss latency is used as the critical cachenetwork performance criteria. Figure 3.7 shows the average L2 cache miss latency (read and write) of Iris compared to the seven alternatives. In this study, latency is decomposed into the following four components: cache miss request latency, the time between a cache miss occurs until the node gains access to the network (labeled as Request ); protocol transaction latency, the time between a request is sent until the node receives a reply from the directory or another cache (labeled as Protocol ); memory access latency, the time to access the memory (labeled as Memory ), and acknowledgments or data reply latency, the time for the cache line to be sent in response to a read request or an acknowledgment for the node to write (labeled as Ack ). This study shows that the broadcast network efficiently delivers coherence messages and simplified protocol transaction, while the channel-waveguide-based circuit-switched network improves the network throughput for large data packets. Therefore, compared with directory-based solution

63 Coherence protocol delay (cycles) Iris Circuit Packet Cache miss injection rate (misses per cycle) 49 Figure 3.8: Throughput analysis of our broadcast-based networks. E-mesh, P-mesh, and Corona, Iris, can significantly reduce the protocol transaction latency and request latency. In addition, the broadcast network can effectively reduce the setup latency of the circuit-switched network compared to electrical network assistance in P-mesh or token circulation in Corona. Overall, Iris provides a high-performance on-chip communication solution with 43.0% latency reduction over E-mesh, 40.9% reduction over P-mesh, and 36.7% over Corona. Iris also provides better performance compared to other snoopy-based solutions. E-tree does not show latency reduction over the directory protocol alternatives because of the slow electrical hierarchical tree and network congestion under temporarily high traffic workloads. P-bus offers similar but less latency reduction compared to Iris, because it works at a lower frequency due to longer propagation distance in its snake-shape waveguide. Overall, compared to E-tree and P-bus, the proposed solution reduced the overall average local L2 cache miss latency by 49.9% and 7.2%, respectively. Iris also provides power efficient, high performance solutions for data packets. Equipped with the same antenna-array-based subnetwork, the overall network latency in Iris shows 32.7% reduction over Circuit and 17.2% reduction over Packet Pareto-Space of design In the next simulation study we conduct a throughput analysis. Three hybrid designs (Iris, Packet, and Circuit), are all equipped with a broadcast antenna array based subnetwork and handle

64 50 messages heterogeneously. However, throughput-hungry data packets are carried by different subnetworks which consequently show different performance and power. These three designs represent various points in the Pareto space of designing a high throughput subnetwork with the low latency broadcast subnetwork supports. The trade-offs between throughput, CMOS area, and power are involved. As shown in Figure 3.7 and Figure 3.8, no design dominates the rest in all design metrics. Packet, for instance, exhibit higher throughput at high workloads at the expense of high power consumption and high area. On the other hand, Circuit throughput doesn t scale at very highloads while showing lower power levels and minimal area. Finally, Iris is a midpoint between the two former designs in terms of throughput scalability and area while having the lowest power levels among the three designs. From our analysis, we expect that Iris, will dominate and show superb performance up to a packet injection rate of 0.02 packet/cycle (which corresponds to 64 cores) in both performance and power. However, photonic-broadcast-based designs, such as Packet, can still provide high performance with reasonable power for higher workload levels. These results demonstrate that Iris can effectively speed up on-chip data transactions compared to the state-of-the-art electrical and photonic alternatives. These results also demonstrate that the performance benefit of Iris increases with on-chip data sharing, as the broadcast/multicast nanophotonic subnetwork can effectively handle latency-critical cache coherence protocol messages. On the other hand, it is challenging to efficiently handle on-chip broadcast and multicast short messages using electrical and other channel waveguide alternatives. 3.4 Linear power division networks In this section we show how we can interconnect the 3 db adiabatic coupler for linear power division among several receivers and we also analyze the optical characteristics of the system. The objective of the proposed interconnect is to linearly divide the input signal to multiple receivers in a power-efficient manner. Super-linear power distribution topologies as shown in Figure 3.10 split the power consecutively, tapping non-proportional portions of power at each receiver

65 51 due to process variations. Hence, the input optical power grows super-linearly with the number of receivers. However, by restructuring that interconnect into a balanced tree with receivers at the leaves as shown in Figure 3.9, we find that we can achieve linear power division. The consequence is a power-area tradeoff. In the super-linear power distribution, the number of power-splitters are linear and equal to the number of receivers, while the required input power grows super-linearly with the number of receivers. Meanwhile, in the linear power division interconnect, the number of power-splitters is also linear in the number of receivers but doubled. The required input power only grows linearly with the number of receivers. Although the area of the linear power division interconnect is doubled, the small size of the adiabatic coupler does not pose a serious overhead and can easily fit in an on-chip network. For this structure to fit on a chip, we use an H-tree layout for the interconnect as shown in Figure 3.9. H-trees are widely used in VLSI clock trees due to several advantages. The first advantage is that at the device level, there are no crossings. At the system-level, the latency of the signal from the source to any destination is balanced. The source signal emanates from the center and travels to every on-chip receiver. Next, in order to boost the performance of the network, multiple waveguides are added in parallel. In addition to WDM, this increases the available bandwidth for the network. This performance improvement comes at the cost of introducing waveguide crossings. These crossings introduce power loss and crosstalk. Careful design of the the number of waveguides used is necessary for a sustainable bit-error-rate. The low insertion loss of the adiabatic coupler designed and its wide bandwidth offer a broadcast medium for on-chip communication and microprocessor-memory interface. This broadcast medium can be used for power or clock distribution.

66 52 Optical Receiver Power Splitter Optical signal Broadband Optical Signal Sender mp P P P P Receiver 1 Receiver Receiver 2 m-1 Receiver m Figure 3.9: Linear Power distribution topology using adiabatic coupler

67 53 Broadband Optical Signal Optical signal Power Splitter Optical Receiver Optical signal Sender m 2 P Receiver m-2 4P Receiver m-1 2P Receiver m P Figure 3.10: Super-Linear Power distribution topology used by other power splitters

68 Summary Optics is fundamentally a different technology from electrical interconnects which requires different design methodologies. In this section, we have shown the drawback of electrical interconnects and the shortcoming of earlier approaches that borrow electrical interconnect design techniques. The first step of design is to understand the performance requirements and the power constraints. Based on the problem formulation we propose an architecture and then show how to build it physically using available nanophotonic devices. In fact, we have gone one step further by designing a new antenna design in the infra-red regime to support our architecture. Our work in this chapter demonstrates the second stack of design which is network design under ideal conditions. Our next chapters introduce the impact of the system on our network and how they interact. Ideal conditions considered here are far from real and requires re-consideration. We study the operating and fabrication conditions in detail and propose solutions to the variations in the following sections.

69 Chapter 4 Reliability challenges in nanophotonic on-chip networks In this chapter, we turn our attention to one of the important challenges facing nanophotonic on-chip network which is reliability. Reliability of on-chip optical communication is threatened by variations in the system. There are typically two kinds of variations that may lead to communication degradation and in some extreme cases total failure of the network and consequently system. These are process variations due to inaccuracies in fabrication and thermal variations due to temporal and spatial variations in workload. The goal of this chapter is to provide a quantitative measure of variations and their impact on the device and system. We provide models that evaluate the system-level metrics based on device-level information Variation-induced Reliability Challenges The silicon photonic interconnection fabric depends upon matched resonant wavelengths of multiple, spatially separated, resonant devices used for signal modulation, switching and filtering, as we described earlier in Chapter 2. However, resonant device characteristics (e.g., passband wavelengths of rings and racetracks) are sensitive to device dimensions and refractive index of materials. When passbands of receivers and transmitters do not fully overlap, signal loss and crosstalk may occur, which in turn lead to degradation of system performance and power efficiency. Variations are the product of actual processes and can only be accurately revealed in data sets of measurements. Fabrication-induced process variations affect the critical dimensions of silicon 1 The content of this thesis is copied from publications that I have published or that are under review [3, 4, 5, 6, 7, 8, 9, 10].

70 Transmission (db) Wavelength (nm) 56 Figure 4.1: Passband shift of micro-rings due to process variations of devices on different chip on same wafer, with same nominal design parameters photonic devices, most prominently the waveguide width and thickness, resulting in passband shift. To quantify process variations, we have fabricated more than 6 batches of wafers in epixfab [40]. On one of the 200 mm wafers, we have measured and characterized over 20 dies with same nominal design dimensions. Figure 4.1 displays the passbands of four different micro-rings for the same design. As shown, passband shift can be great enough to eliminate overlap completely in some cases. Thermal variations result from exposure of silicon photonic devices to spatial and temporal variations in temperature mainly caused by power-hungry CMOS processors. Previous thermal analysis shows that spatial variations may reach 17 [100] and temporal variations may reach 40 [101]. Thermal variations also lead to passband drift of silicon photonic devices due to the large thermo-optic coefficient of silicon. To quantify the effect on photonic devices, we experimentally generated thermal variations of such magnitude in our measurement setup as we have shown earlier in Figure 2.1. The setup includes a CMOS chip attached to a thermal heater and a thermocouple to control the exact temperature. Then measures of the optical spectrum takes place under the different conditions. The rings and racetrack resonators under test exhibited change in spectral transmission. Figure 4.2 displays the dependence of the passband of ring resonators on temperature over a 50 temperature range. As can be seen, the drift in the center wavelength of a passband

71 Transmission (db) C 30 C 35 C 40 C 45 C 50 C Wavelength (nm) 57 Figure 4.2: Passband shift of the same micro-ring due to thermal variations induced experimentally is as much as 0.11 nm/. Wavelength shift for racetracks is smaller, approximately 0.09 nm/ according to our measurements. 4.2 System Reliability Impact To model the system reliability impact of variation-induced device passband mismatches, we investigate optical signal path and determine signal to noise degradation at the receiver. In silicon photonic interconnect, there are four separate devices that must have their passband overlapped: the demultiplexer at the broadband light source, the multiplexer that modulate signals before transmission, the switch, and the demultiplexer at receiver end. We depict these devices conceptually as shown in Figure 4.3. Additionally, noise is introduced through crosstalk, a secondary effect of passband drift. Power that lies outside of the overlap may still find through the system to another receiver. This is illustrated in Figure 4.3 where at the switch a weakened signal propagates in one direction to reach the receiver while the rest of the signal follows another path to be added as crosstalk to another signal. The damage is two-fold: Lower signal levels and higher noise levels which lead to a lower signal-to-noise ratio (SN R) at the detector. This signal-to-noise ratio is quantitatively defined by [3]: SNR = RP rec N r + N x (4.1)

72 Sender Off chip laser source ON/ OFF Process variation Thermal variation 3. Filter Switch TURN ON Path 1 Path 2 Receiver 4. Detector Power loss 58 Path 1 signal / modulator Passband of the switch Path 2 signal / modulator Path 1 filter 1. Modulator 2. Switch TURN ON Signal Noise Received signal at the detector Crosstalk Figure 4.3: Signal loss and crosstalk in silicon photonic on-chip network due to variations in the system. where SNR is signal-to-noise ratio, R is photodetector responsivity, P rec is signal level received at detector, N r is receiver noise such as thermal noise, shot noise, and dark current noise (dominated by thermal noise), and N x is crosstalk noise. Degradation in SNR leads to transient errors at the receiver, where the receiver can not distinguish between the on and off signals representing ones and zeros; thus impacting the reliability of communication. The reliability of a communication link is measured in terms of bit-error-rate and is defined as: BER = 1/2erfc(t SNR/ 2) (4.2) where t is the minimum distance of the error correcting code. Variations too small to result in catastrophic failure may still result in performance degradation. When a bit error is detected, re-transmission is necessary, resulting in increased latency for the packet. The packet latency, in terms of bit-error-rate, can be defined as [100]: t = t o + (1 p) t o + (1 p) 2 t o +... = t o / (1 BER) m, (4.3) Re-transmission has two side effects: First, congestion in the network occurs where the latency of the whole network deteriorates. Second packets on the critical path of the execution harms the system throughput bringing down the performance of the whole system [64]. As one can see from Equation 4.2, the bit-error-rate of the communication channel depends

73 59 exponentially on the SN R. Small fluctuations in received power may lead to dramatic reliability degradation. For a bit-error-rate of (which is as low as one bit error every 250 sec per channel at 4 Gbps having negligible power and performance overhead) required for reliable communication [87], a signal-to-noise ratio of 49 is the minimum level allowed. Measurements indicate that drifts in micro-ring passband may exceed the passband width. Using 0.11 nm/ and 1.48 nm passband for a 64 channel system operating in the C+L band, a 13 spatial thermal variation is enough to totally block communication [100]. Similarly, process variations may be large enough to result in reliability failure. According to our measurements of SoI micro-rings fabricated in epixfab we can have up to 1.2 nm standard deviation shift in resonant wavelength at the die level [102] which is specific to our design and layout having high device density, near fabrication limit gap (130 nm), and a radius as small as 5µm. Selvaraja et al. s designs had a lower device density, a larger gap of 180 nm, and a radius of 5, 5.01, 5.02, 5.03 µm leading to a nm standard deviation in resonant wavelength shift. Thus, our variations is larger than other numbers presented by Selvaraja et al. [103]. 4.3 Summary In this section we moved on to the second part of our thesis. We discuss the real fabrication and operating conditions for our on-chip optical network. We identify an important problem which are variations both process and thermal that alter the specifications of our devices and threaten the reliability of our network. We quantitatively evaluate its impact on the devices and the system. The problem is severe and a solution is necessary if optical on-chip networks were to find wide application in many-core architectures. The goal of this chapter is to provide a motivation for the next two chapters which present a solution to this problem.

74 Chapter 5 Reliability-aware design flow We have discussed reliability challenges in on-chip optical networks and shown that variations threaten the reliability of the network and can lead to latency deterioration, power inefficiency, and in some cases total failure of communication and system. Thus, this problem needs to be addressed and solved before silicon photonics finds wide application in the many-core realm. Device-level solutions come at a cost or have limitations and do not address the problem to the full extent. Earlier system-level work focused on designing reliability management solutions that are independent of the underlying network. However, these techniques have several shortcomings including high power cost, unscalability to future thousand-core architectures, and finally they have no vision of the different reliability management techniques, their cost and tradeoffs. Thus, the solution is inefficient and leaves space for further improvement. The goal of this chapter is to propose a flow that accounts for the design metrics: power, performance, and reliability. The flow is based on abstracting the system in two abstraction layers: (1) Device layer and (2) Network layer. The device layer focuses on designing the devices to provide the specification required in terms of power, performance, and functionality. In addition, yield is a new metric that we introduce to account for reliability. The network layer focuses on designing the bandwidth, topology, flow control, and routing technique. Moreover, the reliability of the network is accounted for through computing the degradation latency or increase in power requirements to counteract variations in the system. Based on these abstraction layers we propose a two step flow: (1) Analysis and (2) Design. The first step includes analyzing the devices and understanding their potentials and limitations which feed into the analysis of the

75 61 network level through reliability models. The network analysis defines the Pareto-space of design points through a simulation-based study. Based on our analysis we can make design decisions and we propose a reliability management solution. Then comes the next and final step which is the detailed design of our devices and network that supports our reliability management solution Overview To address the reliability challenge of silicon photonic interconnect, a reliability-aware design flow is proposed. As shown in Figure 5.1, there are three steps in this flow: Analysis, management, and design. In the analysis part, a fabrication-calibrated device model is built to quantify the impact of variations on the device reliability. Based on this model, the Pareto-space of device and network designs are explored. The impact of variations on the system reliability is minimized through choosing optimal design points. Due to the absence of variation-free design, a light-weight reliability management solution is supplied based on the optimal choice of device and network. Finally, a detailed design stage of devices and network is conducted to complete the flow. The comprehensive flow is founded on abstracting the interconnect into two tightly coupled layers: device and network, as shown in Figure 5.1. (1) The device layer is concerned with designing individual devices and point to point optical links. At this level, we can quantify the variations and their impact on the device response statistically (for yield). (2) The network layer is concerned with designing a network architecture under the conventional performance and power constraints in addition to the reliability constraint unique to silicon photonic interconnect. At this layer, the architect has quantitative analysis and models from the device level to model variations. Based on these models, architecturelevel simulation is possible and we can define the Pareto-space of designs for design space exploration. 1 The content of this thesis is copied from publications that I have published or that are under review [3, 4, 5, 6, 7, 8, 9, 10].

62 Device-level Design Network-level Design Raliabilty-Aware Network Device-level Variation Analysis Device-level Parameters Reliabilitymanagement Solution Network-level Parameters Pareto-Space

1: Reliability-aware design flow based on abstracting our system into two layers. The flow involves the analysis and design of each layer in a two-step process.

76 62 Device-level Design Network-level Design Raliabilty-Aware Network Device-level Variation Analysis Device-level Parameters Reliabilitymanagement Solution Network-level Parameters Pareto-Space Designs Raliabilty-Aware Devices Starting Point Device-level Analysis Statistical T-matrix models of variations Network-level Analysis Abstraction layers of the system Design Flow Figure 5.1: Reliability-aware design flow based on abstracting our system into two layers. The flow involves the analysis and design of each layer in a two-step process. The device and network levels are tightly coupled to address reliability challenges. Devicelevel design decisions need network-level information to proceed and vice-versa. For instance, to decide on the resonating device, one can use a wide band directional coupler or narrow band micro-ring. Without an understanding of the performance requirements and area constraints of the system, we can not design the devices. To uniformly assess the reliability across two layers and multiple steps, we use yield as a measure. Yield is defined as the number of chips that conform to the design specifications and are reliable (having a bit-error-rate of at least ) in a single wafer. The yield numbers presented herein are based on fabrication results that we have measured and analyzed in our labs. In addition, we also study power and performance design metrics in detail. 5.2 Reliability-Aware Device Analysis and Design In this section, we discuss the device layer analysis and design steps in the flow. We present a fabrication calibrated reliability-aware model that helps both analysis and design. Specifically, in analysis, the models can be used for studying the statistical behavior of the model parameters, and, in design, it can be used for specific design by varying the design parameters and evaluating

Input Light Source of power(p) Output Power at arm 1 (P ) 1 O 1 63 i 1 Couping length (L ) Gap width (d) S i 1 i 2 S O 1 O 2 i 2 Couping region O 2 Output Power at arm 2 (P ) 2 Figure 5.

77 Input Light Source of power(p) Output Power at arm 1 (P ) 1 O 1 63 i 1 Couping length (L ) Gap width (d) S i 1 i 2 S O 1 O 2 i 2 Couping region O 2 Output Power at arm 2 (P ) 2 Figure 5.2: T-matrix of directional coupler and illustration of the directional coupler structure the response of the device. We present several case studies for analysis and design, respectively Reliability-Aware Device Modeling and Analysis In this section, we develop models for basic silicon photonic devices that are building blocks of an optical on-chip network. The model is based on the Transfer-matrix (T-matrix) method. We illustrate the model using WDM devices such as the directional coupler and the micro-ring. They have different passband width, and are widely used as filters and modulators. T-matrix model describes the power transmission spectrum of a multi-port device taking into account different variation sources. T-matrix model for a directional coupler is used here as an example, as shown in Figure 5.2. A directional coupler consists of two symmetric waveguides. The gap between the two waveguides is tapered down until it reaches the desired gap width (d g ). When light propagates in one arm, part of the input power is coupled to the second waveguide in the coupling region. The amount of power coupled from one waveguide to another depends on the gap width (d g ) and the coupling length (L c ) at the specific wavelength (λ). The transmission curve in terms of resonant wavelength is a cosine function as shown in Figure 5.4. The two-input two-output directional coupler can have its wavelength spectrum modeled as follow [104]:

78 64 o 1 = o 2 r it it i 1, (5.1) r i 2 where i 1 and i 2 are the input ports, o 1 and o 2 are the output ports, t is the coupling coefficient of energy transferring to the neighboring waveguide, where as r is the energy that stays within the same waveguide. r and t are the complex conjugates of r and t. r = cos(l c κ 2 + δ 2 δ ) + i κ 2 + δ sin(l c κ 2 + δ 2 ) (5.2) 2 t = κ κ 2 + δ 2 sin(l c κ 2 + δ 2 ), (5.3) In Equation 5.2 and Equation 5.3 L c is the length of the two parallel coupling waveguides, κ is the coupling coefficient that determines the coupling efficiency. δ is defined as β (λ) /2 where β is the propagation constant difference between the first and the second modes that are in the coupling region S. The parameters in the proposed model include both design parameters and fabricationspecific parameters. For instance, L c is a design parameter controlled by the designer, on the other hand, δ is fabrication-specific parameters (and also depends on the gap parameter determined by designer): Fabrication-specific parameters in the model provide more accurate depiction of variations impact compared to conventional methods such as computation-intensive electromagnetic simulations. For instance, the change in temperature, process variations in gap, and process variations in width and thickness of the waveguide are all modeled in δ. Design parameters provide a powerful tool to explore the design space of the device with constraints. For instance, by varying the coupling length L c, one can get the wavelength spectrum of different designs. Similarly, the T-matrix model for micro-ring can be used for variability analysis. A micro-ring

79 i1 S1 i1 o1 uf o2 S1 i2 df 65 o1 uf df df uf S2 Figure 5.3: T-matrix models of micro-ring resonator and an illustration of its structure. i2 S2 o2 in Figure 5.3 can be modeled as a concatenation of T-matrices of the form [104]: o 1 = d f u f = o 2 r it r it it i 1, (5.4) r e iφ/2 e αl/2 u f it eiφ/2 e αl/2 d f, (5.5) r where i 1 and i 2 are the input ports, o 1 and o 2 are the output ports, d f is the downward stream in the ring, and u f is the upward stream in the ring as shown in Figure 5.3, α is the loss per unit length and l is single round trip length, φ is the phase progression from a round trip, which can be i 2 expressed as by φ = 2πn eff l/λ, where n eff is the effective refractive index and varies with applied voltage, device geometry and the ambient temperature, and λ is the wavelength in free space. In Equation 5.4 and Equation 5.5, we assume the coupling regions (S1 and S2 in Figure 5.3) are symmetric and almost lossless. t is the field coupling coefficient from the waveguides into the ring, and also from the ring to the waveguides. r represents the amplitude of the remaining field. Hence, we have r 2 + t 2 = 1. Next, we extend the passive micro-ring model to the doped micro-ring design. We start with the discussion of the plasma dispersion effect, where the micro-ring optical properties are altered by the applied voltage through the change in carrier concentration. We provide first order models for carrier concentration calculations. These models are calibrated against measurement results for accurate modeling. Finally, we discuss calibration and design parameters.

80 The change in the carrier concentration in the micro-ring alters the optical properties of 66 silicon. Two main properties are affected: (1) the refractive index and (2) optical loss of the silicon waveguide. The change around 1.55 µm wavelength can be quantified using the following equations [105]: n = N e ( N h ) 0.8, (5.6) α d = N e N h. (5.7) where n is change in refractive index due to doping, N h is the hole carrier concentration change, N e is the electron carrier concentration change, and α d is the increase in loss due to doping. The change in the refractive index of silicon leads to the change of the effective index n eff and the group index n g. The changes of group and effective indices can be computed through a modal-analysis applied to the waveguide cross-section. Such modal-analysis is a low-computation overhead simulation. The result of the refractive index change is a shift of the resonant peak, through which modulation and switching can be achieved. The change in waveguide loss increases the micro-ring loss (α). This deleterious increase in loss widens the passband width of the micro-ring response and increases the insertion loss. Hence, careful design of dopant concentration is required to control the loss of the ring. The altered values of effective and group indices, and waveguide loss are used in the transfermatrix to model the doped micro-ring at different carrier concentrations. Hence, full-spectrum analysis of the doped micro-ring becomes possible just as in the case of passive micro-rings. The carrier concentration, responsible for plasma dispersion effect, must be calculated accurately. Computing the carrier concentration for PIN- and PN-junctions are derived from the analytical models, given by: φ fn φ fi N e = n i e V t, (5.8) φ fp φ fi N h = n i e V t, (5.9) where N e is the electron concentration, N h is the hole concentration, n i is the intrinsic carrier

81 67 concentration, V t is thermal voltage, φ fi is the intrinsic Fermi level potential, φ fn is the quasi- Fermi-level potential for electrons, and φ fp is the quasi-fermi-level potential for holes. In the forward-biased case, knowledge of the potential at different points across the PIN- or PN-junction is crucial for accurate calculation of carrier concentration. At high voltage, which is the case in doped micro-rings, the p-region and n-region can be considered as a resistance, consequently the voltage drop follows Ohm s law. For the depletion region, the voltage must be computed by solving the (non-linear) equation that accounts for the voltage drop across the p- and n-region: V o V d = I s R j e V d V t, (5.10) where V o is the applied voltage across the junction, V d is the voltage across the depletion region, I s is the saturation current, and R j is the junction resistance of the n- and p-region. In case of a reverse-biased PIN- or PN-junction, the carrier concentration profiles are step functions. The electron concentration is zero until it reaches the end of the depletion region in the n-junction, where it steps up to the n-region doping (N d ) and the holes follow a similar profile in the p-region. The above models can be used to give first order indications of the carrier concentrations in the waveguide at different doping profiles. In addition, they can be fit against real measured curves of the junction for extrapolation to different operating voltages. The fitting parameters in that case are the resistance of the p- and n-regions and the saturation current in the junction. In addition, we have another two design parameters that can be used to change the characteristics of the ring: (1) Doping profile: Based on the junction type (PIN- or PN-junction) and doping profile one can get different carrier concentration profiles which in turn change the ring characteristics. Different carrier concentrations lead to a change in the resonant peak and passband width. (2) Applied voltage: The applied voltage also changes the carrier concentration and leads to the plasma dispersion effect (change in resonant peak and passband width).

82 There are three prominent features of the proposed T-matrix model comparing to timeconsuming electro-magnetic simulation or empirical analysis: 68 In T-matrix models, fabrication parameters in the model are obtained from calibration against fabricated devices. Specifically, in the above model, α, n eff and coupling coefficients are obtained from calibration. By fitting with multiple fabricated devices of the same design at different temperatures, the resulting model parameters capture the impact from fabrication inaccuracies and thermal variations. The accuracy is much higher compared to time consuming electro-magnetic simulations. From our analysis, it turns out that n eff holds invaluable information about waveguide thickness and width variations and the thermo-optic coefficient responsible for thermal sensitivity. The statistical analysis of calibrated parameters provides the foundation of the reliabilityaware device design depicted in Figure Through calibration of our models against fabricated devices, we can compute our model parameters which follow a Gaussian distribution with a mean and standard deviation. These statistical models form the bases of our analysis phase as we show in our case study of designing a 16 channel micro-ring-base WDM filter. In this phase, we can statistically evaluate the response of different devices under different process and thermal variations which are then passed on to later stages for design. For instance, knowing that n eff in micro-rings have a mean of µ n and standard deviation σ n, one can analyze the variations for micro-rings of different radii and gaps to get a mean resonant wavelength of µ r and a standard deviation of σ r for different designs with high accuracy and low computational cost. For instance, a micro-ring with a standard variation of 2.59 nm in waveguide linewidth [106] would result in a 1.5 nm shift in resonant peak; hence, a yield of 68% in a 64 channel system. Device-level information can be translated to network-level through the calibrated T-matrix models. Modeling a set of connected devices that represent a network on-chip can be done by multiplying the T-matrices. This network-level analysis can provide worst-case power

83 loss, noise levels; thus signal-to-noise ratio and bit-error-rate, which are our reliability metrics. Moreover, optical and electric power and analytical performance metrics can be 69 computed for the whole network. Considering an optical link built with the micro-ring in the last example. The 68% yield drops down to 7% for the link which is too low to build a system. Hence, a design space exploration of the network is possible through lowcomputational overhead matrices multiplications [6] Case Study: T-matrix calibration of directional coupler As a case study for our proposed reliability-aware models we take the modeling of a directional coupler as an example (micro-rings have been calibrated and designed elsewhere [6]). We have fabricated several directional couplers at epixfab with various gaps and coupling lengths. In this case study we choose a 1063 µm directional coupler for calibration. We first start with the T-matrix model [104] and discuss the fabrication-specific parameters. We perform the calibration step and fit our model against the measurement results. Next, we show how to leverage the model to analyze variations and its impact on spectrum response. Analytically, the output power in each arm derived from the T-matrix model is defined as [104]: P 1 (λ) = P (λ) cos 2 (δl c ) (5.11) P 2 (λ) = P (λ) sin 2 (δl c ), (5.12) where L c is the coupling length, P 1 is the transmitted power, P 2 is the coupled power, P is the input power, λ is the operating wavelength, and δ is β (λ) /2. In Equation 5.11 and Equation 5.12, β (implicitly defined in δ) depends on the gap width d g and operating wavelength λ and it strongly depends on the fabrication process. Hence, calibration of this parameter is necessary. As one can see, in this model one needs to calibrate the measured transmission of a fabricated device against the model to get a numerical value of δ, the difference of propagation constants.

84 Output power (microwatt) Measured Transmission Fitting Transmission Wavelength (nm) 70 Figure 5.4: Transmission spectrum of the measurement and model fit of a directional coupler measured and calibrated in our labs. Analytically, one can compute δ by computing the period of the cosine function of the power transmitted in one arm (which is complementary to the other arm). Knowing, that δ is linearly inverse in wavelength to the first order approximation, as our simulation results indicate, we can use linear fitting of the reciprocal of wavelength to compute δ. In Figure 5.4, we have calibrated the model against the measurements and it presents the transmission according to the measurements and to the model after fitting. The calibration is a two step process, where first we use analytical formulas and first order linear approximation to compute δ. Later, least square curve fitting techniques were used to get more accurate results. As one can see, the two curves are in good agreement and the error is less than 8% of maximum transmission on average. In order to quantify the impact of variations one needs to plug-in the fabrication-specific parameters that represent different variations and re-compute the wavelength spectrum. Since this computation involves computing Equation 5.11 and Equation 5.12, the computational overhead is low and could be repeated for thousands of fabrication and operating conditions. This enables a statistical account of the device behavior under different variations. Hence, analysis and modeling of variations and its impact on response of the device is possible. The proposed model, unlike other simulation-based models, is straightforward and has low computational overhead. Moreover, it can be extended to study devices with different design parameters by simply changing these parameters

85 Simulation (off) Calibrated (off) Simulation (on) Calibrated (on) Simulation (equ) Calibrated (equ) 71 Output power (a.u.) Wavelength (nm) Figure 5.5: Simulation results of a forward-biased micro-ring modulator and calibration of the proposed models at equilibrium (equ), on, and off states in the model; thus enabling statistical analysis and design Case Study: Forward-Biased PIN Junction Micro-Rings Our T-matrix model has also been examined and calibrated against a recently published forward-biased PIN junction micro-ring modulator design by Manipatruni et al. [107]. This is a very compact 2.5 µm radius micro-ring. The PIN junction is formed across the ridge of the microring similar to Figure 5.7. The junction is forward-biased and operates between 1.1 V and 0.96 V to maximize the carrier concentration change. The power consumption is dominated by static power (267 µw) to sustain the carrier concentration. As the optical power transfer spectrum at two levels are not fully released in the publication, we leverage device simulation (Cogenda Genius [108]) and electro-magnetic simulation (COMSOL Multiphysics) to produce the electrical properties and optical properties of the design at full scale. These simulations are not necessary in real calibration scenarios. The calibration of the forward-biased micro-ring follows the same steps as the calibration of the passive micro-ring. However, in the forward-biased micro-ring, the effective index and loss factor need to be adjusted according to the carrier concentration distribution in all states including equilibrium, on, and off. Such adjustment is performed by first fitting the voltage-current curve of

86 n-type region Micro-ring waveguide 72 p-type region Micro-ring n-type region p-type region Figure 5.6: Doped PN-junction micro-ring top and side view of the p- and n-regions. the active device to find R j and I s. Carrier concentration can thus be solved from Equation 5.9, Equation 5.8, and Equation Figure 5.5 shows our final calibration results in all three states. The difference of two voltage levels gives approximately change in effective refractive index and consequently 1.4 nm of resonant wavelength shift. Due to the carrier-induced loss increase, the passband width (BW ) increases from 0.57 nm at off state to 0.75 nm at on state. Finally, the calibration error is less than 3% of maximum transmission in all three states Case Study: Reverse-Biased PN Junction Micro-Rings We tested our calibration procedure on a recently published reverse-biased PN junction microring design. The device is a 5 µm radius micro-ring proposed by Dong et al. [109]. A PN junction is formed across the ridge of the micro-ring as shown in Figure 5.6. The junction is mostly reversebiased and operated between 1 V and -2 V to vary the depletion region length in the waveguide. As the current flow is small due to reverse-bias, even with large voltage, the static power is comparatively small and the total power is dominated by dynamical power (50 fj/bit). Due to the paucity of published measurement data, we use simulation to characterize the doped micro-ring for calibration, though this would not be necessary in a real-world calibration scenario. The calibration of a reverse-biased micro-ring follows the same steps as the forward-biased

87 n-type region Micro-ring waveguide 73 Intrinsic region p-type region Micro-ring n-type region Intrinsic region p-type region Figure 5.7: Doped PIN-junction micro-ring top and side view of the p-, i-, n-regions. micro-ring except that the carrier concentration profile is not constant and needs to be modeled in both the on and off states. Following that, eigenvalue mode analysis is necessary to compute the change in effective and group indices of the waveguide. Figure 5.8 shows our final calibration results. Our proposed model demonstrates a good match after calibration, with an error of less than 1.25% of maximum transmission. The difference of two voltage levels gives approximately change of effective refractive index and consequently 0.3 nm of resonant wavelength shift. Both of these numbers are much smaller than in the forward-biased design. The carrier concentration change also leads to small loss increase; the passband width (BW ) increases from 0.12 nm at off state to 0.15 nm at on state, which are also smaller compared to the forward-biased case Reliability-Aware Device Design In this section, we describe the reliability device design taking into account variations and yield. We leverage the analysis results at both the device and architecture level for detailed specifications of the device characteristics such as full-width-half-maximum, free spectral range, and resonant wavelengths. We propose a new design process that is superior to current slow and inaccurate process and moreover, we demonstrate this design process through an example of a 16 WDM channel micro-ring based multiplexer.

88 Simulation (off) Calibrated (off) Simulation (on) Calibrated (on) Simulation (equ) Calibrated (equ) 74 Output power (a.u.) Wavelength (nm) Figure 5.8: Simulation results of a reverse-biased micro-ring modulator and calibration of the proposed models Traditional techniques for design of silicon photonics devices are slow, iterative, expensive, and do not explicitly account for variations. The traditional design process shown in Figure 5.9 starts with a wavelength spectrum of a device specification, which are then translated to design parameters using first order models. Next, simulations are conducted in an iterative approach to fine tune the design parameters. If process and thermal variations are to be taken into considerations more simulations for each variation point is required. This iterative procedure is long where a single iteration can take upto ten hours depending on how complex the structure under study is. After that, the design parameters are available and the device is laid out for fabrication. The fabrication process is an expensive and time consuming step. Despite this effort, the output device when measured does not match the device specifications. This returns to several reasons such as process variations and inaccuracies in simulations. Current electromagnetic simulations present an idealized framework for simulation without accounting for real fabrication effects such as surface roughness and impurities introduced in the fabrication process. Hence, the fabricated device does not meet our device wavelength spectrum. This has motivated us to pursue an alternative design process as we will present here. We propose a novel reliability-aware design process. The new design process is a two step process as shown in Figure In the first step, our goal is to calibrate the fabrication-specific

89 Time Consuming Iterations 75 Design Specifications First order models Electro-magenetic simulation Expensive Device Fabrication Compare Measurements Do NOT Match Figure 5.9: Traditional design process for Silicon Photonic devices using slow and inaccurate electromagnetic simulations. parameters of the T-matrix model discussed in Section 5.2.1, statistically analyze the model parameters to compute mean and standard deviation, and build a library of devices. The second step involves the design of the target device with a pre-defined wavelength response spectrum to conform to the design specifications the analysis phase and reliability management solution recommends. For instance, a directional coupler design would involve the specification of wavelength of peak transmission and the period of the peaks as shown in Figure 5.4. In the model, we have the fabrication-specific parameters and design parameters. We leverage the device library that includes the fabrication-specific parameters for our design and tune the design parameters to get the required response. This design process is more accurate than long electromagnetic simulations since it accounts for process-specific effects. Our main goal of design is to have a low power and high performance design. This flow adds reliability as one of the design constraints. In this section we study power, performance, and reliability tradeoffs for designing micro-rings. We present four case studies on the different design metrics.

90 76 One Time Process Sample Design Specifications First order models Electro-magenetic simulation Device Fabrication Device Library Statistical Calibration Parameters Statistical Analysis Calibration Parameters Model Calibration Measurements Design Specifications Transfer Matrix Model Device Fabrication Measurements Compare Match Figure 5.10: Proposed design process for Silicon Photonic devices based on the T-matrix models Case study: Power Design Trade-off The micro-ring dissipates both electrical and optical power. The electrical power is expended in the doped micro-ring junction and driver circuit. Optical power is depleted by intrinsic insertion loss, off-resonance coupling, and dopant enhanced optical absorption. Using the above discussed models, we conducted a numerical experiment to elucidate the trade-offs between electrical and optical power by varying the micro-ring radius. As shown in Figure 5.11, the electrical power increases linearly with the micro-ring radius due to the increase of junction cross-sectional area; meanwhile, the optical power grows exponentially as the radius decreases, according to Vlasov et al. s model [110]. Hence, when designing the micro-ring for a target free spectral range and resonant wavelength by controlling radius, it is crucial to take optical and electrical power trade-offs into consideration and minimize the overall power dissipation Case study: Performance Design Trade-off Figure 5.12 shows an experiment that demonstrates the trade-off between system reliability and network bandwidth, with micro-ring bandwidth as the control design parameter. A 0.8 nm

91 Optical Power (db/cm) Optical Power Micro-ring radius (µm) Electrical Power Electrical Power (mw) 77 Figure 5.11: Electrical and optical power trade-offs in micro-ring design based on device radius. process variation is considered herein. The design parameter, micro-ring bandwidth, is typically controlled by the dopant concentration and the waveguide-ring gap. As shown in Figure 5.12, by increasing the bandwidth of the micro-ring, the number of channels the system supports is reduced; hence performance deteriorates. Meanwhile, the tolerance of the system to process variation improves. A suitable operating point for the system under study is at 6.5 nm ( which corresponds to 15 WDM channels). In this case, the bit-error-rate is reduced to below 10 12, which is typically required for reliable communication [87] Case study: Reliability-Power Design Trade-off Leveraging the models presented in this section, we have conducted two reliability experiments that are important to system-level design decisions. We examine the thermal stability of a doped micro-ring due to electrical and optical power dissipation. Electrical power dissipation in a doped micro-ring might lead to the thermal runaway effect [111] and thus affect the interconnect stability. Specifically, the electrical power dissipated in a doped micro-ring may increase its operating temperature and lead to a red shift in the resonant wavelength. To counteract the red shift due to the heating effect, a higher voltage is needed in order to blue shift. If this rise in voltage leads to more power dissipation and higher temperature, then there is no operating voltage that can converge to a stable operating wavelength. To examine this effect, we modeled a PIN-junction micro-ring similar to Manipatruni et al. s design [107]. We

92 Number of Channels Number of Channels Channel passband width (nm) BER Bit Error Rate (error/sec) 78 Figure 5.12: Performance reliability tradeoffs based on channel passband width varied the operating voltage from zero to maximum allowed voltage and computed the static power consumed (which is the dominant power [107]). Using the power evaluated we conducted a thermal analysis using COMSOL Multiphysics to compute the range of operating temperatures. The results show that, within the entire range of voltages, the temperature change is around 1. Within the operating voltages ( V) the temperature change is around For this specific design, the thermal runaway effect can be neglected as the resulting wavelength shift is small compared to the passband width. In a PN-junction micro-ring as the one designed by Dong et al. [109], the electrical power consumed is much lower (one fifth of the power in the micro-ring we study). This leads to a much less thermal variation which could be safely ignored. Hence, electrical-power induced thermal runaway has negligible impact on the stability of a doped micro-ring. Thermal runaway can also occur due to excessive heating resulting from optical losses, which we study as our second experiment. The rise in temperature leads to a red-shift, which needs to be compensated by raising the operating voltage. Raising the operating voltage increases the carrier concentration and thus the optical losses leading to a positive feedback. This power-related reliability effect is crucial in power intensive designs such as Corona [23], where the optical power losses in the doped micro-ring is significant. In the first part of this experiment we study the temperature rise due to optical-loss heating. We consider the loss due to excess carrier concentration by theoretical models, while bending loss and fabrication inaccuracies have been measured in our labs. As shown in Figure 5.14, the

93 Radius (µm) Micro-ring radii Resonant Wavelength (nm) Figure 5.13: The radii and resonant wavelength of a 16-channel WDM ring designed using T-matrix model for a 63% yield. increase in dopant concentration leads to exponential increase in optical loss, and thus operating temperature. The power increases with dopant concentration and consequently the temperature increases. At very high doping levels, almost all of the input power is dissipated in the micro-ring and the temperature level saturates. In this example, thermal runaway is imminent at dopant concentrations beyond Hence, lower optical levels are required for thermal stability of this network design. In the second part of this experiment, we study the impact of dopant concentration in a doped micro-ring on the system reliability. In reliability evaluation, we use bit-error rate as a metric, where we consider the impact of process variation in addition to the optical loss due to excess carriers that we considered in the first part of the experiment. For the process variation, we assume a 0.8 nm shift in passband width. As shown in Figure 5.15, the bit-error-rate of the system improves at the start, since the passband width widens, thus it can tolerate more process variation. However, as the dopant concentration increases further, the optical loss significantly increases, which reduces the signal level and degrades the overall bit-error-rate of the system. This study demonstrates the power and reliability trade-offs in micro-ring based network design Case Study: Reliability-Performance-aware Design In this case study, we demonstrate the design of a 16-channel WDM multiplexer. The fabrication-calibrated model parameters are obtained from a fabricated 4-channel WDM multi-

94 Temperature Optical Power 80 Operating Temperature (K) Dopant concentration (cm -3 ) Optical Power (db) Figure 5.14: Operating temperature of doped micro-ring due to dopant-induced optical power loss plexer. The target 16-channel multiplexer has tighter inter-channel spacing and thus higher bandwidth. The fabricated 4-channel WDM design composes of micro-rings of different radii. The micro-ring radii are 4.98, 5, 5.02, 5.04 µm. Measurement results indicate that the channel spacing has 2.45 nm mean value and a standard deviation of nm. This design was fabricated and measured in 12 different dies, the structure was repeated twice on the same die, and all the dies were on the same wafer. Out of the 24 structures measured only one WDM channel failed and had overlapping channels due to process variation giving a yield of 96% and is expected to be more than 99% yield based on our standard deviation of channel spacing variation for large sample count. Based on the process variation data we calibrated from measurement, the design of a 16- channel WDM multiplexer follows the following steps: Design goal: In this analysis we target a 16 WDM channel design and a yield of 68.27% which corresponds to one standard deviation. The channels should be designed to fit in minimal bandwidth given the yield and channel count. Design parameters: According to the T-matrix model the design parameters are effective index n eff and radius R. The designer has less control over the effective index since the waveguide dimensions are designed for minimal loss (and least sensitivity to process variation). Our target design is to have a channel spacing of 0.84 nm to realize our 68.27% yield. Hence, we need to design the radius

95 Optical Power (db/cm) Optical Power Doping concentration (cm -3 ) BER Bit Error Rate (error/sec) 81 Figure 5.15: Power and reliability trade-offs in micro-ring design based on dopant concentration in the micro-ring. R accordingly. Design process: Based on our T-matrix models we know the following relation that relates the resonant wavelength λ, effective index n eff, and radius R is mλ = 2πn eff R. By using perturbation theory one can model process variation in this formula: m λ = 2π (R n eff + n eff R) (5.13) where λ is the channel spacing, R is difference in radius for different micro-rings in the structure, and n eff quantifies the process variation in waveguide thickness and width. By plugging in the numbers for n eff and n eff from our measurement results, λ for our target design, one can determine the radius change between subsequent micro-rings in the structure R. The final design radii and corresponding wavelength is provided in Figure Through our simple T-matrix model we have accurately designed a new 16-channel WDM filter knowing the yield in advance and with low computational overhead. On the other hand, if conventional electromagnetic simulations are used for design, the process would take longer time with more iterations and less confidence in the yield. 5.3 Reliability-Aware Architecture Analysis and Design The second abstraction level in our design abstraction is the network level. This level is essential in analyzing the performance of the network under realistic variations and loads, defining

96 82 3D integration and layout information Wattch Power model ISAC Thermal analysis Power analysis Benchmarks SPECOMP SPLASH2 PARSEC M5 simulator Full-system simulator for multi-core architectures NoC simulator Cycle-accurate network-cache simulator to model network, cache and coherence protocol behaviors Perofmance analysis Reliability analysis Fabricated Devices Microrings, racetracks, etc... Physical measurements & characterizations T-matrix device library Figure 5.16: The simulation infrastructure for evaluating the power, performance, and reliability of on-chip optical networks. a Pareto-space of design points, and detailed design of the network in terms of topology, optical power, and bandwidth to meet the power and latency requirements of the system. In this section, we present a simulation framework for reliability-aware analysis and design, a case study on reliability analysis of different designs, and design guidelines, the result of our analysis phase simulation Reliability-aware network-on-chip simulator In order to perform analysis from device to network to system to application-level, our simulation infrastructure includes all of the tools depicted in Figure The M5 full-system simulator [90] models many-core processors running multi-threaded benchmarks such as PARSEC [112], SPECOMP, and SPLASH2 [92]. Cache traces with time stamps can be generated from this simulator to properly stress the network under evaluation. These traces stress the network representing the actual characteristics of on-chip communication. Wattch [113] and ISAC [114] have been built into the full-system simulator to determine the ambient temperature for nanophotonic devices. A cycle-accurate network-cache simulator models the network design and cache coherence protocol. This simulator accepts the realistic traces from full-system simulator to characterize the behavior, power usage and performance of the nanophotonic network. Reliability models evaluated from

97 L2 cache miss latency (cycles) Iris Mesh Bus Clos Corona Memory Response Protocol Request Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona ammp applu apsi cholesky fluidanimatefreqmine lu oceannon radix waternsq L2 cache miss latency (cycles) 1,400 1,200 1, Memory Response Protocol Request Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona ammp applu apsi fluidanimate 83 Figure 5.17: Performance decomposition and comparison in ideal environment (left) and in presence of process and thermal variations (right) experimental measurements are also built into this network-cache simulator to model the possible performance degradation due to spatial thermal and process variation. Thermal variation is fed dynamically from the thermal analyzer ISAC. Process variations at the one σ-level as determined by scanning electron microscope (SEM) measurements are collected from fabricated devices. The reliability models described in Chapter 4 provides models that characterize the physical properties of the nanophotonic devices in a fast and accurate manner. Supplementary electrical power of network components, like routers and links, are modeled using ORION 2.0 [99] Case Studies: Reliability-Analysis In this section, using the proposed modeling analysis framework, we study the performance, power consumption and reliability of five network designs. These designs differ in topology, floorplan, resource allocation and more. They represent a wide variety of state-of-the-art nanophotonic on-chip interconnect designs. We evaluate these designs in realistic process and thermal variation environment (see Section 5.3.1) while stressing them with realistic many-core traffic. Corona [23]: This is a serpentine multi-write single-read crossbar network. Each waveguide with WDM support is statically dedicated to a single read node, and all other nodes that want to write compete by cycling tokens around the nodes. Network performance is limited by the token cycling that requires 8 clock cycles per round trip, and by the directory

98 Power (W) Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Optical light source Electrical routers and links Electrical optical conversion Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona ammp applu apsi cholesky fluidanimatefreqmine lu oceannon radix waternsq Power(W) Optical light source Electrical routers and links Electrical optical conversion Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona ammp applu apsi fluidanimate 84 Figure 5.18: Power decomposition and comparison in ideal environment (left) and in presence of process and thermal variations (right) coherence protocol. Bus [24]: This network consists of a set of snake-shape single-write multi-read waveguides connecting on-chip processing units. The optical signal is broadcast along each waveguide to support a snoopy coherence protocol. Power consumption is a challenge here as the input optical power grows super-linearly with the number of cores. Mesh [115]: This is a channel photonic waveguide based circuit-switched mesh network supporting the directory protocol. The network includes a latency-optimized electrical network to setup photonic circuit-switched paths. Clos [80]: This network uses three stages of intermediate electrical routers to create a larger non-blocking all-to-all network. Single-read single-write WDM waveguides transmit signals from one stage to another. Arbitration is provided in electrical routers, which introduce additional latency but provide higher throughput and alternative paths. Iris [34]: Iris consists of two subnetwork components: 1) a low latency optical antenna based broadcast network to support on-chip short, often-multicast, latency-critical traffic, and 2) a high-throughput channel-waveguide-based circuit-switching mesh network for throughput constrained data packets transfer. These two subnetworks operate in tandem to offer efficient support of snoopy protocol. However, the broadcast network also brings

99 85 reliability challenges. All of the above networks have been configured to serve a 64-node many-core processor. Each node has an Alpha core running at 2 GHz, a 16 KB private L1 cache and a 16 MB shared L2 cache. The optical network is characterized by CMOS-compatible SoI fabrication technology, 64 WDM channels per waveguide, 4 Gbps per WDM channel, a 320 fj/bit transmitter with a modulator [18], and a 690 fj/bit receiver with a detector [21]. Ten multi-threaded benchmarks have been used in the proposed simulation infrastructure to stress the underlying network designs. These benchmarks are selected from three representative benchmark sets including PARSEC [112], SPLASH2 [92], and SPECOMP. They represent the current and future on-chip communication demand at the application level Performance evaluation: Figure 5.17 demonstrates the ideal network performance (no variation effects) of each design (left), and degradation due to variation-induced reliability effects (right). L2 cache miss latency is chosen as a representative performance metric as it includes the transmission latency of both request and response packets to serve one cache transaction. As can be seen, the simulation infrastructure is able to model all the network behaviors under realistic traffic patterns. Furthermore, the latency can be further decomposed to unveil the potential performance bottleneck. As is shown in Figure 5.17, there are four components: cache miss request latency (labeled as Request ), protocol transaction latency (labeled as Protocol ), memory access latency (labeled as Memory ), and acknowledgments or data reply latency (labeled as Response ). If there is no process and thermal variation, nanophotonic interconnect is able to deliver high communication performance, especially Iris and Bus, which leverage on-chip broadcast snoopy protocol to accelerate locating outstanding cache copies. It is also worth noting that Corona has higher latency because it uses a directory-based protocol and has a long token round-trip latency. Similarly, Mesh has a high latency because of the slow directory-based protocol and electrical

100 86 network to set up the optical path. In presence of process and thermal variation, most of these networks would not operate, especially when the performing applications that generate heterogeneous thermal profiles, as complete passband mismatch leads to communication failures. Even for the four thermal-balanced applications (shown on the right of Figure 5.17), network performance is still seriously affected by process variation. The excessive packet re-transmission latency due to significant bit error rate makes the network unreliable. This effect has been characterized by L2 cache miss latency that extend beyond reasonable values. The performance is greatly degraded. Some networks show tolerance to process variation as they could provide alternative paths for traffic communication, including Clos and Mesh. In contrast, for broadcast based architectures like Iris and Bus, the communication requires the participation of all nodes and is vulnerable to transmission errors Power evaluation Figure 5.18 characterizes the power efficiency of all the network designs for different applications with and without process and thermal variation. The total power is composed of the optical power injected by the light source, the electrical power dissipated in links and routers, and the electro-optical power consumed by modulators, switches and receivers. The figure on the left shows the power consumption without any variations, while the figure on the right shows the required light source power to guarantee the network reliability in presence of thermal and process variation. As can be seen, some of the network designs have a poor power efficiency even in the ideal situation. For instance, Bus leverages cascaded splitters to implement broadcast, which leads to approximately 20 db loss in a 64 node network. In realistic environment, six out of the ten benchmarks encounter complete mismatch between transmitter and receiver, thus any power increments will not resolve the mismatch problem. Hence, in the right figure we focus on four benchmarks which have almost homogeneous thermal profiles and an increase in light source power can reduce the bit-error-rate to 10 12, our nominal level. Some designs require high optical power levels close to the non-linear limit as in Corona. Other designs exhibit more resilience to variations. For in-

101 1, , L2 cache miss latency (cycles) 1, Memory Response Protocol Request L2 cache miss latency (cycles) Memory Response Protocol Request 0 Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona ammp applu apsi choleskyfluidanimatefreqmine lu oceannon radix waternsq 0 Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona Iris Mesh Bus Clos Corona ammp applu apsi cholesky fluidanimatefreqmine lu oceannon radix Figure 5.19: Performance comparison of 16 WDM channel setup (left) and 32 channel setup (right) in presence of process and thermal variation stance, Clos has a short optical path, low optical power loss and have several intermediate electrical relay points implemented as electrical routers which makes it more resilient to power loss due to passband mismatch. Intermediate electrical relay points regenerate the optical signal and reroute the signal, thus avoiding signal loss and crosstalk at optical switches. However, this comes with an electrical power overhead in the variation-free case WDM and reliability evaluation: The number of WDM channels provides a trade-off between reliability and performance. As the channel number decreases, the bandwidth of the network decreases. Thus, system throughput is degraded. However, the optical bandwidth per wavelength increases, if we operate within the fixed C+L band (1,530 nm 1,625 nm). From reliability perspective, more bandwidth per channel means tolerating more mismatch. Such trade-offs have been studied in Figure We have reduced the WDM channels to 32 and 16 for all the network designs, and conducted performance comparison in presence of process and thermal variation. For 32 WDM channel setup, one of the benchmarks (waternsq) still results in a heterogeneous thermal profile that prevents network from functioning. For 16 WDM channel, thermal variations of all benchmarks are tolerable but the nanophotonic network shows performance degradation due to limited bandwidth provided. Studies in Figure 5.17 and Figure 5.19 demonstrate the performance and reliability trade-offs in nanophotonic interconnect

102 Power (Watt) Iris-32Channel-15XPower Iris-16Channel-1XPower Iris-No Variations Corona-32Channel-15XPower Corona-16Channel-1XPower Corona-No Variations CBC-32Channel-5XPower CBC-16Channel-1XPower CBC-No Variations Electrical-32Channel Electrical-16Channel Latency (clock cycles) 88 Figure 5.20: Design space exploration in the architecture domain by varying power, channel bandwidth, and network. designs. With higher WDM channels, the nanophotonic networks are susceptible to process and thermal variation. More communication errors occur, data re-transmission becomes more frequent, and the overall system performance degrades. On the other hand, the lower WDM channel setup introduces more communication resource contention and longer serialization latency in transmitting large packets. Different network shows different trade-offs which should be examined carefully in the simulation infrastructure. For instance, Corona relies on WDM to provide point-to-point crossbar functionality, and severe performance degradation has been observed from 32 to 16 WDM channels. On the other hand, Clos could still be able to provide decent performance with limited channels Reliability-Aware Architecture Design Guidelines The design parameters that the architect has control over are: topology, flow-control, routing, power budget, error correction and detection, and bandwidth. These parameters need to be decided upon in the light of variations. For example, the loss along each optical path needs to be selected such that the network will operate reliably under a variety of workloads. As we will see, worst case design leads to unnecessary overhead. Worst cases are best treated with active control. The design parameters need to be studied in light of the design goals. Due to the strong correlation between the different design goals: (1) Performance, (2) Power, and (3) Reliability,

103 89 designing in this three dimensional space becomes more complicated and a simulation framework is necessary to show the architect the Pareto space of design points. Simulation frameworks based on reliability models can do this job. It provides analysis of performance in terms of latency and bandwidth, power in terms of optical power and electrical power, reliability in terms of bit-errorrate. The goal of this design space exploration step is two-fold: First, assess our architecture design options and second, quantitatively understand the slack for our power budget and performance constraints. Next, we demonstrate how to perform an analysis step and produce the Pareto-space of design points through a case study and understand our decision space. Our simulation study analyzes the impact of variations on the network and defines the design space for an architect that accounts for reliability of a silicon photonic network. We simulate five different networks: (1) Electrical mesh network [116], (2) Iris [8], (3)Corona [23], (4) Macrochip network [25], and (5) Channel-Borrowing Crossbar (CBC) [117]. The network configuration includes 64 nodes, lu a benchmark in SPLASH2 [92] is used, which has a spatial thermal variation of 15, and process variations are based on our process variation analysis conducted in our lab, where a standard deviation of 1.2 nm in resonant wavelength is assumed. We leverage our simulation framework that models performance, power, and reliability, where error detection and re-transmission is assumed to accommodate communication errors. Two parameters were varied during simulations: The bandwidth per link (number of bits that can be transmitted concurrently) where we considered 64, 32, 16, and 4 channels per link for the optical network (through different resonators including micro-rings, racetracks, and directional couplers), and 32, 16, and 4 bits links in the electrical network. Moreover, orthogonally we have varied the input optical power for silicon photonic networks by increasing the power by a factor of 2x, 3x, 5x, 10x, and 15x. Figure 5.20 shows the results of the networks within an acceptable power and performance design space (other results beyond this range are omitted for clarity). From this study we can verify our design guidelines as follow: Optical path design (Topology and flow control): The topology and flow control are cru-

104 90 cial to the tolerance of the network to variations. Topologies to-date are one of five categories: (1) Single-Write-Multiple-Read (SWMR) networks that leverage a serpentine link that broadcasts data [24, 118], (2) Multiple-Write-Single-Read (MWSR) networks, also called crossbars, that leverage a shared serpentine link to deliver data to a single receiver [119, 23, 120, 78], (3) Repeated links networks which exhibit short optical links where the optical single is regenerated through electrical routers [80, 121], (4) Antenna-based broadcast such as Iris [8], (5) Optical switch-based designs such the mesh network in Petracca et al. s work [115] and Iris [8]. First, repeated link networks are favored over SWMR and MWSR networks due to the short optical paths with lower optical loss, where high signal power levels tolerate higher noise and variation levels. Optical switches are a major source of crosstalk in optical networks [122]. Switches provide flexibility in network design, but the number of switches along an optical path should be minimized through inspired design. Broadcast-based and SWMR networks mandates that all nodes in the system are matched since they are all participating in the communication. This adds additional constraints from a reliability perspective. Our simulations confirm these guidelines through the power and latency results it provided. The macro-chip network survived the process variations at high channel count (32 channels) due to the very short optical path. However, the latency was too high to include in the design space due to the small bandwidth allocated per core-to-core communication link. Meanwhile, broadcastbased networks such as Corona suffer from sensitivity to variations, where any mismatch of one of the communicating nodes leads to re-transmission. Hence, the high latency of these networks. Finally, unicast networks such as channel-borrow crossbar are less sensitive to variations and can operate at lower optical power levels but is more sensitive to bandwidth variations. Bandwidth design: The number of WDM channels is another factor in the design decision. The more number of WDM channels used the smaller the bandwidth per channel. Narrow channels means less tolerance to variations. Assume a channel of bandwidth 3 nm for a 32 channel network. Then it can tolerate upto 3 nm passband shift which is equivalent to 27 thermal variations. On the other hand for a 128 channel network, it can tolerate only 0.75 nm which is equivalent to 7 thermal variations. However, a 128 channel system has higher bandwidth and better performance

105 91 under ideal conditions. Our simulation study for our specific design space gives us more details as follow: At high bandwidth (64 channels) all optical networks fail and at low channel count (four channels) the variation has negligible influence but the bandwidth is so small that the latency deteriorates beyond our acceptable design space as shown for all the networks under study in Figure At nominal power levels, at most 16 channels can be used to accommodate the variations in the system. However, the performance suffers from temporal variations (some packets can not be received) and low bandwidth. Error detection and correction: Redundancy in data through adding error detection and correction bits can greatly improve reliability. Nitta et al. propose using error detection codes and forward error correction codes to detect and correct different types of errors in the network including inter-channel crosstalk [123], partial and total passband mismatch [124]. The number of data channels used for error detection and correction is a tradeoff between reliability and performance. Power budget design: The input laser power of the network can play a major role in reliability. Increasing the input laser power increases the signal-level, and consequently improves signal-to-noise-ratio and bit-error-rate. This works in case of partial mismatch between sender and receiver(s). The drawback is that more power is put into the system which increases the power budget, a major design criteria, and introduces thermal reliability problems to the CMOS manycore layer. Therefore, a tradeoff between power and reliability arises and careful design is necessary to balance the two demands [7]. Quantitatively, our simulation results verify this as follow: At 32 channels, a 15X optical power increase for broadcast-based networks (like Corona and Iris) and 5X optical power increase for other networks (like Channel-Borrow crossbar) is needed to reduce the latency to our acceptable design space in Figure Fabrication cost: As the number of sub-networks increase, the number of CMOS layers in the 3D heterogeneous stack increases; thus, yield is reduced and cost is increased. From a reliability point of view there are multiple point of potential failures which negatively impacts the reliability of communication. In our design space, Iris and Corona leverage two networks which implies two SOI layers in the the 3D stack. Hence, higher cost and lower reliability unlike channel-borrow crossbar

106 92 which would be lower cost. From this study we conclude that Iris dominates in terms of power and performance but at a higher cost, while Corona and channel-borrow crossbar provide a Pareto space of designs. As for electrical networks, they can not compete with optical counterparts and will eventually be replaced. However, this study also shows that variations hinder exploiting the full potential of nanophotonic networks on-chip. For instance, Iris that represents best power-performance has a 3X higher latency when compared to the ideal case (no variations). Hence, additional techniques are necessary to mitigate the impact of variations and unleash the potential of silicon photonics that promises very low latency at low power levels as earlier studies in Iris [8] for 64 cores and Atac [125] for a thousand core indicate. 5.4 Summary Earlier solutions to reliability problems focused on high level system solutions which integrate different techniques to present a solution. However, these solutions are power inefficient and unscalable. This has motivated us to consider the whole problem from step one of design. Instead of proposing a solution, we propose a reliability-aware flow. This flow is a two step process and results in a reliability-management solution. Our flow is based on the T-matrix model which enables us to analyze and design our devices and network in a time efficient manner and with high accuracy. Based on these models we propose a reliability management solution to be integrated with the network to alleviate the system from the impact of variations on communication. This will be the topic of the next chapter.

107 Chapter 6 Reliability management solutions In Chapter 5 we showed a design flow that can help us decide on a reliability management solution based on the device constraints and system requirements. In this chapter we show different approaches to reliability management. First we start with device-level techniques. We present a survey of different approaches and we indicate the limitations of these techniques. Some of these techniques can address variations but with a hefty cost making it unsuitable for on-chip nanophotonic networks. Others have limited effect and can not address variations to the full range. The following section, we present one of our early reliability management solutions. The solution is an improvement over earlier device-level reliability management solutions and one of the first system-level solutions to reliability in the literature. This solution is a system-level solution based on an initial understanding of the problem and the design space. Finally, in the last section we present our final reliability management solution that is based on our reliability-aware design flow. We compare this solution to other reliability management solutions to date. Our latest solution shows small power and negligible performance overhead. Moreover, as we show our solution is scalable to future thousand core architectures Device-level Reliability Management Existing reliability management techniques are mainly developed at device level, but these device-only techniques are not cost-effective by themselves. To prove this point and provide the 1 The content of this thesis is copied from publications that I have published or that are under review [3, 4, 5, 6, 7, 8, 9, 10].

108 Table 6.1: Athermal devices characteristics Extra Electrical Switching Post-fabrication Additional Sensitivity Temperature Wavelength Reliability Optical Loss Power Speed Step Area Range Range Technique Voltage Tuning [126] 2-5 db 4.52 mw/bit 1 Gbps Doping High C+L band Bandwidth Tuning [127] 0 db 26.4 mw/bit few Mbps Add Heater Low C+L band Thermal Tuning [128] 0 db 0.2 mw/bit >6 Kbps Add Heater Low C+L band Polymer Cladding I [129] 0 db 0 mw Gbps Polymer coating High - 5 pm/ nm Polymer Cladding II [130] 0 db 0 mw Gbps Polymer coating High pm/ nm Polymer Cladding III [131] 0 db 0 mw Gbps Polymer coating High pm/ nm Slotted Device [132] 5-8 db 0 mw Gbps Polymer coating High - 52 pm/ nm All-Polymer waveguide [133] 0 db 0 mw unverified CMOS-Incompatible High several mm pm/ nm Couple to MZI [134] 3 db 0 mw Gbps - High µm nm Stress Control [135] unknown 0 mw Gbps Several steps Low pm/ nm Micro-fluid Tuning [136] unknown 0 mw slow Bond Micro-fluid Chip Low nm Evanescent Field Perturbation [137] unknown 0 mw slow - Mechanical Probe area - > nm 94

109 95 system designer with insight of device-level techniques, we conduct a survey of existing devicelevel reliability management options to mitigate the impact of process and thermal variations. The techniques provided are discussed from a system perspective and we evaluate them from the point of view of designing an optical network on-chip while taking yield into account. We present a comparison of different techniques for variation-resilient design in Table 6.1 in terms of their capabilities and overhead. We have categorized these techniques based on their targeted variation: athermal techniques and process-variation-immune techniques Athermal Silicon Photonic Devices Herein, we present the different device-level techniques to mitigate thermal variations. We focus on techniques apply to micro-ring devices, although some of the presented approaches could also be applied to others. There are three main categories, first is tuning the characteristics of the micro-ring, the second is using polymers to counteract the effect of silicon s high thermo-optic coefficient, and the third category includes various less promising techniques. First, we discuss tuning based approaches: Voltage tuning: The resonant wavelength of a micro-ring can be tuned by carrier injection/depletion which alters the carrier concentration in the silicon waveguide and leads to a blue-shift in resonant wavelength [126, 39]. However, the tuning range of micro-rings in this approach is few nanometers since the optical loss increases exponentially as the tuning range increases [3, 100]. Manipatruni et al. demonstrated an athermal micro-ring operation for a temperature range of 15 which corresponds to 1.65 nm [126]. Hence, this approach is limited to narrow range tuning only. Bandwidth tuning: By re-structuring the coupling region and using the thermo-optic effect of silicon one can tune the bandwidth of the micro-ring from 0.01 nm to a few nanometers. Chen et al. demonstrated the technique through fabrication and measurements where they achieved a tuning range of nm [127]. By increasing the bandwidth of the micro-ring,

110 shifts in resonant wavelength can be tolerated. However, since the bandwidth is still small in this experiment, bandwidth tuning can only tolerate variations less than Thermal tuning: In this approach we use a micro-heater that heats up the micro-ring and leads to a red-shift in the resonant wavelength. Several groups have demonstrated thermal tuning for micro-rings across a wide range of wavelength [109]. The basic idea is to counteract the shift in resonant wavelength at one side of communication by heating up the micro-ring at the other side. This approach has a negative impact on the system from two perspectives: First, it greatly increases the total power consumption of the system making it inefficient [101, 3, 100]; secondly, it increases the temperature of the system which leads to thermal reliability in many-core system [100]. The second class of approaches leverages polymers and their thermal qualities to provide an athermal device. Using polymer involves an extra post-processing step but no complicated control is required at run-time putting them at advantage in comparison to class one techniques. The main idea is to guide part or all of the light in the polymer which is athermal. This can be achieved through (1) Polymer cladding with negative thermo-optic coefficient which works for all silicon devices and introduces negligible overhead, (2) Slotted structures that exhibit higher loss than ridge structure; hence, they are used for limited distances and don t fully resolve the problem, and (3) All polymer waveguides (no silicon involved) which have large area and can not be used on-chip. Our third class of athermal micro-rings include a miscellaneous set of approaches; however, they are less promising than the earlier proposed techniques, and we include them for completeness in Table 6.1. In conclusion, different techniques come with different overheads or limitations. As shown in Table 6.1, the second class of techniques of athermal polymer are the most promising techniques to eliminate thermal variations. Moreover, they have little overhead which is basically an extra postprocessing step. However, using polymer for thermal variations does not allow us to use polymer

111 97 for process variations as we show in the next section. As for class one techniques, they come with a cost or limited effect as we show in the next case study on designing optical links. Finally, class three require more improvement to become practical for on-chip communication domain Process-Variation-Immune Silicon Photonic Devices Techniques to overcome process variations overlap with athermal device management techniques. More specifically, tuning techniques of athermal micro-ring designs discussed in Table 6.1 can be used for counteracting process variations. This includes thermal tuning, bandwidth tuning and voltage tuning; however, the main difference is that the tuning for process variation is done once after fabrication. Since there is no need for run-time management, the implementation is easier. Besides tuning, trimming is another technique for process variations. In this technique, the micro-ring or silicon photonic device is coated with a ultraviolet (UV) sensitive polymer overlay. Later, the device is measured to determine the target refractive index of the polymer to achieve the required wavelength spectrum. The polymer is then exposed to a predetermined dose of UV to achieve the target properties [138]. Similar approaches have been demonstrated using photobleaching instead of using UV exposure using different polymers [139]. The main challenge is that each micro-ring exhibits a different range of process variations and requires a different dose of UV; thus a different mask per micro-ring is required adding to the total cost of the system. As a compromise, one can leverage the locality of process variations to apply the same dose of UV to the same local region and reduce the mismatch in resonant wavelength of non-local devices. To compare various device-level reliability management techniques in a system picture, we use yield (the number of chips that lie within a pre-specified range of design parameters) as our metric and we compute the probability of overlap of passband for a sender and receiver given a Gaussian distribution of resonant wavelength. This involves simply integrating the tuning range of the applied technique across a three standard deviation range of process variation and find the probability of overlap. The standard deviation of process variation of a micro-ring is 1.2 nm

112 according to our measurements. Meanwhile, the tuning range of the different techniques is given 98 in Table 6.1. Based on such method, we present representative results to compare device-level techniques: Thermal tuning can totally eliminate the impact of variations with a high power and switching speed cost, giving a yield of 100%. Voltage tuning can eliminate the impact of process variations with a yield of 82% and thermal variations with a yield of 68%, in both cases with a lower power cost. However, it can not overcome both variations. Athermal polymer can eliminate the run-time overhead only with a low cost but can not counteract process variations giving a yield of 82%. Finally, trimming can reduce process variations but not thermal variations giving a yield of 68%. No single solution can solve the reliability problem without a significant overhead. Hence, a system level view is necessary. Next, we demonstrate the tradeoffs involved of different device-level reliability management techniques through a case study Case Study: Reliability-aware design and management of a single channel optical link In this case study, we design a single channel WDM optical link that can operate reliably under process and thermal variations. We leverage our analysis of process and thermal variations for different WDM-structures including micro-rings, racetracks, and directional couplers to make device design decisions and reliability management decisions. In our analysis we account for power and yield. Our process variation analysis is for specific designs and changes in the design specification will lead to different process variations as our analysis shows. For instance, changes in bending radius or gap can adversely affect the standard deviation of process variations. More specifically, we focus on the following designs for our analysis: Micro-ring: The micro-ring has a radius of 4.98 µm and a gap of 200 nm. The design was fabricated and measured on 12 dies, replicated two times in the same die, on the same wafer. The wafer was fabricated in LETI [40]. Racetrack: The racetrack has a bending radius of 3 µm, a coupling length of 7 µm, and a

113 Table 6.2: Process and thermal variations analysis of the devices. 99 Device Process Variations Thermal Variations µ σ nm/ Micro-ring nm 1.2 nm 0.11 nm/ Racetrack nm 2.16 nm 0.09 nm/ Directional Coupler 1550 nm 1.12 nm 0.12 nm/ gap of 130 nm (the minimal feature size). The design was fabricated and measured on 18 dies on the same wafer. The wafer was fabricated in IMEC [40]. Directional Coupler: The directional coupler design we use has 130 nm gap (the minimal feature size) and length of 1063 µm. The design was fabricated and measured on 18 different dies on the same wafer. The wafer was fabricated in IMEC [40]. To design a reliable optical link with minimal power overhead, the designer needs to account for the following factors that impact the operation of silicon photonic devices: Variation in resonant wavelengths: Reliability analysis provide the standard deviation in resonant wavelength variations due to process and thermal variations. The required tuning power is determined by such variations. For the three types of devices, a summary of our variation analysis is provided in Table 6.2. Channel Full-Width-Half-Maximum (FWHM) and Free Spectral Range (FSR): Both FWHM and FSR impact the sensitivity of device operation to variations. The larger the FWHM of the resonant peak, the more tolerant the channel is to variations. The shorter the FSR, the less tuning range. Among the three types of devices, the directional coupler has widest FWHM and smallest FSR implying lowest tuning range, followed by racetracks, Table 6.3: Resonant wavelength characteristics of the devices Device FWHM FSR (nm) (nm) Micro-ring 0.65 nm 16.5 nm Racetrack 0.85 nm 15.7 nm Directional Coupler 3.56 nm 6.7 nm

114 Table 6.4: Thermal tuning power of different optical links 100 Device Tuning Range Average case Power Thermal tuning Yield Tuning Range Worst case Power Yield Micro-ring 1.06 nm mw 68.27% 8.1 nm 1.62 mw 99.9% Racetrack 1.81 nm mw 68.27% 7.5 nm 1.5 mw 99.9% Directional Coupler 0 nm 0 mw 68.27% 0.29 nm mw 99.9% and finally micro-rings as shown in Table 6.3. Next, we compare between the different device-level reliability management techniques for an optical link given the design goal (minimizing tuning power for a given yield) and design parameters (tuning technique and device choice). Different tuning techniques provided in Table 6.1 have different overheads. We focus on voltage tuning and thermal tuning as in the first portion of Table 6.1 since they can compensate both process and thermal variations. The comparison is based on the electrical tuning power computed for each device and tuning technique for both average and worst cases. The worst case puts an upper limit of power dissipation, while the average case gives a statistical value for power dissipation which is useful for multi-channel network with multiple links. Our models from Chapter 5 provide the variation ranges for effective index due to process and thermal variations and consequently a simple analysis (from phase 1 of our design flow) will give us the resonant wavelength of the different devices. According to our calculations, the minimal overlap is 0.5 nm for a light source of 780 µw/nm and an optical path loss of 10 db. Hence, the total worst and average statistical case power is given by: (F SR 2 F W HM + 1) P total = P tuning 2 P total = 2P tuning σ 2π µ+σ µ (6.1) (λ µ) e 1 2( λ µ σ ) 2 dλ (6.2) where P total is total tuning power for respective tuning technique, P tuning is tuning power per nanometer, F SR is free spectral range, F W HM is full width half maximum, µ is mean of resonant wavelength, σ is standard deviation of process variations, and λ is resonant wavelength. The result of applying Equation 6.1 and Equation 6.2 for the three different device design under study and the two tuning techniques is provided in Table 6.4 and Table 6.5. As one can

115 Table 6.5: Voltage tuning power of different optical links 101 Device Tuning Range Average case Power Voltage tuning Yield Tuning Range Worst case Power Yield Micro-ring 1.06 nm mw 68.27% 8.1 nm - - Racetrack 1.81 nm mw 55.5% 7.5 nm - - Directional Coupler 0 nm 0 mw 68.27% 0.29 nm mw 99.9% see, voltage tuning can tune the process variation within one standard deviation for the different devices but beyond 1.65 nm the optical loss is too high as is the case in worst case tuning range for micro-rings and racetracks; hence, it can not be used. On the other hand, thermal heating can tune any range of variations but at slightly higher power levels (and at expense of post-processing the wafer for creating undercut structures and for low switching frequencies). In terms of device choice, directional couplers dominate, due to their large FWHM and small FSR, making them immune to process variations within one standard deviations and very low tuning ranges for worst case tuning range. The yield for directional couplers is optimal as for both average and worst case tuning ranges. In this case study, we have demonstrated how to design a variation-aware single channel optical link using our analysis of variations of fabricated devices and our models for both process and thermal variations. Since, we are considering a single link power is the main design criteria but at higher bandwidth the tradeoff between power and bandwidth needs to be considered in the light of application constraints.

116 Table 6.6: Nanophotonic network configuration 102 Item Notation Value Nanophotonic on-chip network characterization Number of wavelength channels N 64 Data bit rate per channel B 4 Gps Passband of one channel λ c 0.74 nm Maximum shift w/ worst-case crosstalk λ x 0.07 nm Maximum shift w/o crosstalk λ i 0.63 nm Thermal tuning coefficient λ/ T 0.11 nm/ Process and thermal variation characterization Maximum thermal variation T 20 Temperature change rate T/ t 0.1 /ms Maximum process variation λ fab 0.76 nm 6.2 Reliability Management: Problem formulation In this section we define and formulate the reliability management problem in a mathematical and concrete approach. This paves the way for us to propose reliability management solutions. The nanophotonic communication management problem is formulated as follows. A nanophotonic network is described by a graph G = (V, E), where vertices V are resonant switching nodes, i.e., modulators, filters, and intermediate switches, and edges E are the waveguides connecting the resonant switching nodes. In order to optimize nanophotonic communication performance, power efficiency and reliability, it thus requires: Channel management: For each communication path P = {s, n 1, n 2,..., n m, d}, from a sender s modulator s, through a number of intermediate switches n 1,..., n m, to a receiver s filter d, control the resonant wavelength of each resonant device to minimize path P s passband mismatch, as follows: min{ λ r (s, n 1, n 2,..., n m, d)} (6.3) Routing: For each run-time communication transaction, identify a set of L routing paths P i = {s, n 1, n 2,..., n m, d}, i = 1,..., L with minimal network congestion and SNR > SNR min.

117 Reliability Management Solution I This section presents a power-efficient run-time management solution for nanophotonic onchip communication in many-core systems. The proposed solution leverages channel management and network routing techniques to minimize channel passband mismatch-induced power loss and crosstalk, and also resource contention Proposed solution This section presents the proposed run-time management solution, which consists of the following components: Channel management: Leveraging inter-channel hopping and intra-channel wavelength tuning, the proposed solution enables a wide-range resonant wavelength control, and effectively minimizes nanophotonic passband mismatch. Variation-aware routing: The problem to identify a routing path with minimal crosstalk or resource contention is NP-complete. We propose an efficient heuristic-based routing algorithm which adjusts the routing decisions at run-time to effectively minimize communication crosstalk and network resource contention. Together, the proposed techniques help us reach a power-efficient run-time control to enable high-performance, reliable nanophotonic communication Channel management This section presents run-time channel management for photonic passband mismatch minimization. As described in Chapter 2, nanophotonic channel passband is narrow, typically within one or a few nanometers. Therefore, process and thermal variation induced resonant wavelength drift may cross multiple channels. Herein, we propose inter-channel hopping and intra-channel wavelength tuning for power-efficient passband mismatch minimization.

DEMUX DEMUX DEMUX...... m+k 1 2... m............ MATCH... 1 2 m DEMUX Inter-channel management technique Process and thermal variation induced channel drifting Figure 6.

104 Inter-channel hopping: Essentially, inter-channel hopping does not change the resonant wavelength of the resonant devices. Instead, it shifts the communication to a different set of channels.

By introducing k extra micro-ring resonators operating at [λ m+1,..., λ m+k ], the sender can use the [k + 1,.

118 DEMUX DEMUX DEMUX m+k m MATCH m DEMUX Inter-channel management technique Process and thermal variation induced channel drifting Figure 6.1: Illustration of inter-channel hopping in on-chip network at switches due to variations. 104 Inter-channel hopping: Essentially, inter-channel hopping does not change the resonant wavelength of the resonant devices. Instead, it shifts the communication to a different set of channels. As shown in Figure 6.1, due to process and thermal variation, the overall passband of the m receiver filters shifts from [λ 1,..., λ m ] to [λ k+1,..., λ k+m ]. By introducing k extra micro-ring resonators operating at [λ m+1,..., λ m+k ], the sender can use the [k + 1,..., k + m] passband to transfer the data and overcome the k-channel passband mismatch problem. The overhead, i.e. value of k, can be determined by characterizing the worst-case process and thermal variation induced resonant wavelength shift ( λ max ) with respect to the overall optical wavelength bandwidth (λ bandwidth ), as follows: k = m λ max λ bandwidth (6.4) For example, in a 64 WDM channel setting (m = 64), the area overhead (k) equals 3, resulting in less than 5% area overhead. Note that, similar ideas can also be applied to the receiver end by adding extra micro-rings to each receiver filter. Intra-channel wavelength tuning: It leverages voltage tuning to adjust the drifted resonant wavelength. Electro-optic modulator and switch have already leveraged this effect to change the operation according to electrical control signal, while in our solution, this effect is also used for reliability management to align the mismatched passbands. Optical power loss is a primary challenge of

The Light at the End of the Wire. Dana Vantrease + HP Labs + Mikko Lipasti

The Light at the End of the Wire Dana Vantrease + HP Labs + Mikko Lipasti 1 Goals of This Talk Why should we (architects) be interested in optics? How does on-chip optics work? What can we build with optics?