More-than-Moore with Integrated Silicon-Photonics Vladimir Stojanović Berkeley Wireless Rearch Center UC Berkeley 1
Acknowledgments Milos Popović (Boulder/BU), Rajeev Ram, Jason Orcutt, Hanqing Li (MIT), Krste Asanović (UC Berkeley) Jeffrey Shainline, Christopher Batten, Ajay Joshi, Anatoly Khilo Mark Wade, Karan Mehta, Jie Sun, Josh Wang Chen Sun, Sen Lin, Sajjad Moazeni, Nandish Mehta, Michael Georgas, Benjamin Moss, Jonathan Leu Yong-Jin Kwon, Scott Beamer, Yunsup Lee, Andrew Waterman, Miquel Planas, Rimas Avizienis, Henry Cook, Huy Vo Roy Meade, Gurtej Sandhu and Micron Fab12 team (Zvi, Ofer, Daniel, Efi, Elad, ) DARPA, Micron, NSF, BWRC IBM Trusted Foundry, Global Foundries Slide 2
3 More-than-Moore perspective Enhanced CMOS enables new applications 1997 One of the first CMOS radios Rudell & Gray 2004 World s first 60GHz CMOS Amplifier Niknejad & Brodersen 2012 World s first siphotonic transmitter in 45nm SOI Stojanovic, Popovic, Ram Inductors in IC process Nguyen & Meyer 1990
IBM/GF 12SOI (45nm) CMOS 300mm wafer, commercial process MOSIS and TAPO MPW access Advanced process used in microprocessors Photonic enhancement enables VLSI photonic systems (no required process changes) IBM Cell IBM Espresso IBM Power 7
Zero-Change Photonics in 45nm [C. Sun, JSSC 2016] Photonics for free! (No modification to the process) Closest proximity of electronics and photonics Single substrate removal post-processing step Monolithic photonics platform with fastest transistors
Integrated photonic interconnects Each λ carries one bit of data Bandwidth Density achieved through DWDM Energy-efficiency achieved through low-loss optical components and tight integration Slide 6
Single channel link tradeoffs Loss 10-dB 15-dB Rx Cap 5-fF 25-fF 7
Need to optimize carefully 512 Gb/s aggregate throughput Laser energy increases with data-rate Limited Rx sensitivity Modulation more expensive -> lower extinction ratio Tuning costs decrease with data-rate Moderate data rates most energy-efficient assuming 32nm CMOS Georgas CICC 2011 8
DWDM link efficiency optimization Optimize for min energy-cost Bandwidth density dominated by circuit and photonics area (not coupler pitch) Slide 9
6mm 5mm Towards an Optical DRAM System EOS22 Test Chip High Performance 45nm SOI Photonic-Interconnected DRAM (PIM) Micron D1L Test Chip Power-optimized 0.22µm Bulk [ISCA 2010] 2M transistors 1000s optical devices 70M transistors 1000 optical devices DARPA POEM Slide 10
World s First Processor to Communicate with Light Silicon-Photonic components integrated directly in the chip zero-change 45nm SOI DARPA POEM & PERFECT Stojanović, Asanović C. Sun et al. Nature, Dec. 15. 11
Frequency (MHz) Processor Cores 45nm SOI Vdd (V) 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 15.8 12.8 10.6 8.7 7.1 5.7 4.6 3.7 3.1 2.6 2.1 1.6 200 16.7 14.0 11.6 9.6 7.9 6.4 5.3 4.3 3.6 3.0 2.4 1.9 250 14.9 12.4 10.3 8.6 7.0 5.8 4.8 4.0 3.3 2.8 2.2 300 15.6 13.0 10.9 9.1 7.5 6.2 5.2 4.3 3.7 3.0 2.4 350 13.6 11.3 9.6 7.9 6.6 5.5 4.7 3.9 3.3 2.6 400 14.1 11.8 9.9 8.3 6.9 5.8 4.9 4.2 3.5 2.8 450 12.2 10.3 8.6 7.2 6.1 5.2 4.4 3.7 3.0 500 12.5 10.5 8.8 7.5 6.3 5.4 4.5 3.8 3.1 550 10.8 9.1 7.7 6.5 5.6 4.7 4.0 3.3 600 11.0 9.3 7.9 6.7 5.8 4.9 4.2 3.3 650 11.2 9.5 8.1 6.9 5.9 5.0 4.3 3.5 700 11.4 9.7 8.3 7.1 6.1 5.2 4.4 3.6 750 9.8 8.4 7.2 6.2 5.3 4.5 3.6 800 8.6 7.4 6.4 5.4 4.6 3.7 850 8.7 7.5 6.5 5.5 4.7 3.8 900 8.8 7.6 6.6 5.6 4.8 3.9 950 7.3 6.6 5.7 4.8 4.0 1000 6.7 5.7 4.9 4.0 1050 5.8 5.0 4.1 1100 5.9 5.0 4.2 1150 Gflops/W 5.1 4.2 1200 5.1 4.3 1250 Not Operational 4.2 1300 1350 RISC-V open ISA Scalar-vector cores - Boot Linux 0.2-1.35GHz, 4-16 GFlops/W Slide 12 [Lee ESSCIRC 2014]
Si Waveguides 470nm width 700nm width
Key Device Components Vertical couplers Waveguide Diffraction Grating Waveguide Taper Slide 14 [Wade OIC 2015] submitted to Nature August 2015 please keep confidential funded in part by DARPA POEM program, Stojanovic, Asanovic, Popovic, Ram
Key Device Components SiGe from PMOS strain engineering used in Photodetectors SiGe Photodetector Waveguide Taper Waveguide Slide 15 [Orcutt 2013, Alloatti APL 2015]
Key Device Components Modulator Microring Drop Waveguide Integrated Heater Output Waveguide Input Waveguide [Shainline OL 2013, Wade OFC 2014] Slide 16
Transmitter Modulator In/Out Grating Couplers V BIAS PRBS31 8:1 Modulator Driver Transmit Site 50um In Driven by a CMOS logic inverter (1.2V pp ) 5 Gb/s data rate at ~30fJ/b with >6dB extinction ratio, 3dB insertion loss Up to 12 Gb/s with 2-3dB extinction [Wade OFC 2014] [Sun VLSI 2015] Slide 17
Receiver Receive Site 50um 5k Ω Receiver BER Checker TIA - + φ Out A Input Grating Coupler 2:8 5k Ω Dummy TIA - + Clock Buffers φ Out B V PD Photodetector Low parasitics from monolithic integration enable single-stage 5kΩ TIA receiver 10 Gb/s operation at 290 fj/bit with 8.3uA sensitivity [Georgas VLSI 2014] Slide 18
5 Gb/s Chip-to-Chip Link Laser Power 1189nm Laser 3.8 dbm Decision Threshold [ua] 2 0-2 Thermal Tuner PD -0.2 dbm Chip 1 Bit 1 Bit 0 Tx Rx Out A PRBS31 8:1-3.2 dbm -10.5 dbm -7.2 dbm -14.5 dbm Optical Amplifier 0.8 dbm -5.9 dbm 0 50 100 150 200 Time [ps] Chip 2 PD Rx Out B BER Check 2:8 Rx -3.2 dbm 9.65uA -9.9 dbm 2.04uA 1e-001 1e-002 1e-003 1e-004 1e-005 1e-006 1e-007 1e-008 1e-009 1e-010 <1e-010 Bit 1 Bit 0 Slide 19
5 Gb/s Link Efficiency Summary Thermal Tuning* 275 fj/bit (42%) Modulator Driver 30 fj/bit (5%) Optical power: 3.6mW (13mW extrapolated without amplifier) 662 fj/bit for circuits Receiver 357 fj/bit (54%) zero-change monolithic competitive with state-of-the-art heterogeneous platforms 680 fj/bit, 14mW optical power [Zheng PTL 2012**] *Includes all closed-loop circuits + 0.5 nm tuning power **0.5nm tuning power only Slide 20
5 Gb/s Link Efficiency Summary Thermal Tuning* 275 fj/bit (42%) Modulator Driver 30 fj/bit (5%) Optical power: 3.6mW (13mW extrapolated without amplifier) 662 fj/bit for circuits Receiver 357 fj/bit (54%) 560 fj/bit for laser wall-plug** Not using our best devices in the link 1dB loss couplers [Wade, OIC 2015] (on the same chip instead of 4dB in the link) 5-10x better photodetector (0.1-0.2 A/W photodetector on the same chip) Expect to obtain >40x smaller laser power (65fJ/b optical) **11.6% QD laser wall-plug efficiency *Includes all closed-loop circuits + 0.5 nm tuning power Slide 21
Electronic-Photonic Packaging Die-thinned chip with selective substrate removal WDM transceiver regions Epoxy Processor and SRAM regions Flip-chip onto FR4 PCB using C4 bumps Selective substrate removal of optical transceiver regions Slide 22
Memory Controller PD RISC-V Processor Interface PD 1MB Memory Array Optical Memory System Demo Chip (Processor Mode) Memory to processor link read data Chip (Memory Mode) 1MB Memory Array (Inactive) Receiver Optical Amplifier Transmitter Laser 50/50 Power Splitter Single-Mode Fiber Transmitter Optical Amplifier Receiver RISC-V Processor (Inactive) Command + address + write data Processor to memory link Chip 1 acts as processor, Chip 2 acts as memory Custom memory controller, DRAM interface emulator Takes advantage of full duplex (as opposed to half-duplex) memory interface Video demonstration (thermal stress test) http://www.nature.com/nature/journal/v528/n7583/fig_tab/nature16454_sv1.html
Transmission [db] Tx and Rx DWDM Transceiver Banks Transmission [db] 0-2 -4-6 -8 1-10 0 2 4-12 3 5 6 9 7 10 8-14 0 2 3 4-16 1 1170 1172 1174 1176 1178 1180 1182 1184 1186 1188 1190 0 Wavelength [nm] -2 Tx Rx -4-6 0 1-8 9 10 2 3 6 4 5-10 8 7 9 10-12 0 1170 1172 1174 1176 1178 1180 1182 1184 1186 1188 1190 Wavelength [nm] Advanced lithography enables tight ring resonance control
11 x 8 Gbps Tx Demonstration 11 rings, each demonstrating 8 Gbps modulation Independently testing one at a time Potential for 88 Gb/s on a single fiber/waveguide Each ring is auto-locked [Sun JSSC 2016] Slice 0 Slice 1 Slice 2 Slice 3 Slice 4 Slice 5 Slice 6 Slice 7 Slice 8 Slice 9 Slice 10 >8 Gb/s limited by duty-cycle distortion of off-chip clock source
Going Faster PAM2 and PAM4 [Moazeni et al, ISSCC 2017] Direct Digital-to-Optical Conversion!
Chip floorplan
Transmitter eye diagrams Extinction ratio (ER): 3dB, Insertion loss (IL): 5.5dB PAM4 coding used: (0,5,10,15) 42fJ/b driver energy efficiency
Transmitter specs 29
Improved Rx Topologies Leverage tight electronic-photonic integration to create new, more sensitive receiver structures Differential, DDR receiver [Nandish Mehta et al. ESSCIRC16]
Platform Performance Summary Metric [Beamer ISCA 2010] Conservative Estimates 45nm SOI Platform Bulk Photonics Platform* Waveguide Loss 4 db/cm 3.7 db/cm 10.5 db/cm Vertical Coupler Loss 1 db 1 db 3 db Tx Data Rate 10 Gb/s 20 Gb/s 5 Gb/s Tx Energy Per Bit 120 fj/b 42 fj/b 350 fj/b Rx Data Rate 10 Gb/s 12 Gb/s 5 Gb/s Rx Energy Per Bit 80 fj/b 297 fj/b 1700 fj/b Rx Sensitivity 10 μa 8 μa 36 μa PD Responsivity 0.9 A/W 0.44 A/W 0.2 A/W Thermal Tuning Efficiency 1.6 μw/ghz 3.8 μw/ghz 10 μw/ghz Comparison to a proposal for the processor-memory system we published 6 years ago Meeting/exceeding most system specs Slide 31 *considerably slower process than one assumed in [Beamer ISCA 2010]
8 mm 8 mm Array Poly Si Photonics in Bulk CMOS DRAM processes heavily optimized for cost Micron wafers Periphery DDR3-1333 Technology 2 Gb die cost ~90 Key constraints: Bulk Substrate Low Cost Meade et al. OI 13, VLSI Tech Symp 14 No SiGe 17
Memory: Bulk photonics integration First-ever link result with bulk CMOS photonics DTI adjacent to STI Micron D1L Reticle 180nm Bulk chip [Meade et al. VLSI Tech Symp 14, Sun et al VLSI Ckts Symp 14]
WDM in bulk-photonics - Tx All slices BER checked at 5Gb/s 45Gb/s aggregate rate per waveguide 34
WDM in bulk-photonics - Rx All receive slices functional and BER checked at 5Gb/s Single fiber more I/O BW than x16 DDR4 part
Conclusions Silicon-photonics enabler of new capabilities Think new on-chip inductor or new on-chip t-line Potentially revolutionize many applications despite slowdown in CMOS scaling VLSI compute and network infrastructure just a start Need process, device, circuit and system-level understanding 36