PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs Li Zhou and Avinash Kodi Technologies for Emerging Computer Architecture Laboratory (TEAL) School of Electrical Engineering and Computer Science Ohio University, Athens OH, USA 7 th International Symposium on Networks-on-Chip (NOCS), April 2-24, 203 Contact Website: http://oucsace.cs.ohiou.edu/~avinashk/
Multicores & Networks-on-Chip TILE-Gx72 [] 80-core Intel TeraFlops [2] 2880-core KEPLER (Nvidia) [3] With increasing number of cores, communication-centric design paradigm is becoming important (Networks-on-Chip) Energy for communication is increasing Delivered throughput is decreasing [] http://www.tilera.com/products/processors/tile-gx_family [2] http://www.intel.com/pressroom/kits/teraflops/ [3] http://www.nvidia.com/object/nvidia-kepler.html NOCS-3 TEAL 2
Energy Discrepancy & Throughput Energy discrepancy between computation and global communication with technology scaling Need to reduce global communication energy Relative.2 0.8 0.6 0.4 0.2 0 Compute Energy Interconnect Energy 45 32 22 4 0 7 Technology (nm) Source: Shekar Borkar, Intel Reduced throughput due to aggressive voltage and clock scaling On-die energy: Interconnect Compute Tile Power: Intel Tera-Flops (65 nm) [] Need to provide scalable bandwidth without sacrificing performance Potential Solutions: Nanophotonics, Wireless, 3D Stacking [] Y. Hoskote, A 5-GHz Mesh Interconnect for A Teraflops Processor, IEEE Computer Society, 2007 pp. 5-6 NOCS-3 TEAL 3
Why Photonics? Photonics provides Low energy (7.9 fj/bit) Small footprint (~2.5 μm) High bandwidth (~40 Gbps) Low latency (0.45 ps/mm) CMOS compatible. L. Xu, W. Zhang, Q. Li, J. Chan, H. L. R. Lira, M. Lipson, K. Bergman, 40-Gb/s DPSK Data Transmission Through a Silicon Microring Switch," IEEE Photonics Technology Letters 24. 2. S. Manipatruni, K. Preston, L. Chen, and M. Lipson, Ultra-low voltage, ultra-small mode volume silicon microring modulator, Opt. Express 8, 8235-8242 (200) NOCS-3 TEAL 4
Nanophotonic Link Buffer Chain Photodetector TIA Limiting Amplifier Driver for Electronics Micro-ring resonator T x T x T x T x R x R x R x R x λ λ 2 λ 3 λ 4 λ λ 2 λ 3 λ 4 Off-Chip Laser Core A Core B Laser power Compensates for a variety of light losses along its path Trimming power Microring resonators are sensitive to temperature variations. They require additional trimming power to maintain their resonant wavelength NOCS-3 TEAL 5
Power Breakdown Static Power Challenge 00% 80% 60% 40% 20% Laser Trimming Power Others (routing, O/E, E/O conversion) More than 60% of total power budget! 0% Radix-32 SWMR Corona Flexishare The off-chip laser source and on-chip microring resonators trimming power represent the majority of network power NOCS-3 TEAL 6
PROBE: Targeting on the static power (Preview) Key goal Save significant static optical power while meeting performance constraints Hardware mechanisms Tunable splitters -> adaptive channels Binary-tree based waveguide Global and local bandwidth controllers Approach Traffic load prediction Dynamic bandwidth scaling on the background Three pre-defined bandwidth modes Main results Static power savings more than 60%, with % penalty on throughput and 20% on execution time. NOCS-3 TEAL 7
Outline Introduction & Motivation PROBE Architecture & Implementation Traffic Prediction Dynamic Bandwidth Scaling Performance Analysis Conclusions & Future Work NOCS-3 TEAL 8
Architecture (/2) Tile 0 0 4 5 L 0 R 0 R R 2 R 3 L L 4 L 5 R 4 R 5 R 6 R 7 L 2 L 3 L 6 L 7 2 3 6 7 8 9 2 3 L 8 L 9 L 2 L 3 R 8 R 9 R 0 R R 2 R 3 R 4 R 5 L 0 L L 4 L 5 0 4 5 R: router, L: laser, : voltage regulator NOCS-3 TEAL 9
Splitter Key component essential components for signal distribution in optical networks splits a signal from a single waveguide into a large number of waveguides Passive splitter Fixed power ratio Power inefficient Tunable splitter [2] Tunable power ratio More flexibility Tuning Range: 0~99% Tuning speed: 6ns Power loss 0.2~0.8dB CMOS compatible (0.9V, 5~40μm) [] Dest. Dest. 2 Dest. 3 Dest. 4 [] M. Olivero and M. Svalgaard, UV-written Integrated Optical xn Splitter, Optics Express, Vol. 4 Issue, pp.62-70 (2006) [2] R. Thapliya, T. Kikuchi, and S. Nakamura, Tunable Power Splitter Based on An Electro-optic Multimode Interference Device, Journal of Applied optics, vol. 46, no. 9, 2007. NOCS-3 TEAL 0
Channel Design - Prototype [] Optical Signal (-α )(-e ) (-α )(- α 2 )(-e )(-e 2 ) 2 2 =/4 =/3 =/2 α: power ratio e: the access optical power loss β: power portion in that branch β = α (-e ) β 2 = α 2 (- α )(-e )(-e 2 ) 3 3 β 3 = α 3 (- α )(- α 2 )(-e )(-e 2 )(-e 3 ) Branch 4 β 4 = (- α )(- α 2 )(-α 3 )(-e )(-e 2 )(-e 3 ) e =e 2 =e 3 [] B. Z. Fu, Y. H. Han, H. W. Li, and X. W. Li, Accelerating Lightpath Setup Via Broadcasting in Binary-Tree Waveguide in Optical NoCs, In Proceedings of the Conference on Design, Automation and Test in Europe (DATE), pp. 933-936, 200. NOCS-3 TEAL
Channel Design - Four Power State (/2) Pstate Pstate 2 /3 /2 2 2 3 3 Branch 4 β 4 =(/4)(-e) 3 Bw=.28Tb/s Branch 4 β 4 0 Bw=960Gb/s NOCS-3 TEAL 2
Channel Design - Four Power State (2/2) Pstate BW (Tb/s) α α2 α3 power loss (db).28 /4 /3 /2 0.49 2 0.96 /3 /2 0.39 3 0.64 /2 NA 0.30 4 0.32 NA NA 0.2 Pstate Pstate 2 Pstate 3 Pstate 4 2 2 2 2 3 3 3 3 Branch 4 β 4 =(/4)(-e) 3 Branch 4 β 4 0 Branch 4 β 4 0 Branch 4 β 4 0 Bw=.28Tb/s Bw=960Gb/s Bw=640Gb/s Bw=320Gb/s NOCS-3 TEAL 3
Waveguide Design Three-level binary-tree-based waveguide α (2,) 2 Channel To R Laser 0 α (,) α (2,2) Level direction Level 2 channel.............. Level 3 branch Channels To R 4, R 8, R 2 Channel 2 To R2 Channel 3 To R3 X direction Y direction NOCS-3 TEAL 4
Traffic Prediction (/2) Traffic indicators Link and buffer utilization [] First predictor - for low traffic variation Second predictor - for high traffic variation Based on the prior work which is inspired by history-based branch predictor and the observation of repetitive behavior of real traffic [2] [] X. Chen, L-S. Peh, G-Y. Wei, Y-K. Huang, and P. Prucnal, Exploring the Design Space of Power-Ware Opto-electronic Network Systems, International Symposium on High-Performance Computer Architecture (HPCA), pp. 20-3, 2005. [2] Y. S-C. Huang, K. C-K. Chou, C-T King, Application-Driven End-to-End Traffic Predictions for Low Power NoC Design, In IEEE Transactions on Very Large Scale Integration System, pp. -0, 202. NOCS-3 TEAL 5
Traffic Prediction (2/2) Second predictor - History based Channel #. H5 H4 H3 H2 H H0 0(i, x, 0) 5 5 3 5 4 (i, x, ) 2 2 0 2 2 0 2(i, x, 2) 2 4 2 4 4 3(i, y, 0) 3 5 4 3 Link level HTPT: History traffic pattern table H5~H: History traffic pattern, H0: current link utilization 5 3 5 4 4 2 3 4 5 Link Util 0.0~0.2 0.2~0.4 0.4~0.6 0.6~0.8 0.8~.0 P: predicted traffic load Tag Index P LRU 2 424 2 5 242 3 2 4 0 5354 3 2 0 3544 2 6 PT: Prediction table NOCS-3 TEAL 6
Dynamic Bandwidth Scaling (/3) Prediction Rw Lu, Bu Predict Rw Reconfiguration windows, set to 000 cycles in the simulation. Link and buffer utilization are gathered at each output port. Predict the resource utilization based on the traffic fluctuation. NOCS-3 TEAL 7
Dynamic Bandwidth Scaling (2/3) Prediction Rw Lu, Bu Predict Decision Bw Three modes Compare the predicted link utilization <-> pre-defined bandwidth. Performance mode (0.2 ~ 0.4), Balanced mode (0.4 ~ 0.6), Poweraware mode (0.6 ~ 0.8) Increase the bandwidth if over the upper bound, decrease if lower than the lower bound. Check the buffer utilization. NOCS-3 TEAL 8
Dynamic Bandwidth Scaling (3/3) Prediction Rw Lu, Bu Predict Decision Bw Three modes Tuning Lasers Microrings Calculate the splitter power ratios, and required laser power Tune the lasers, the splitters, and the on-chip microrings Delay is critical! Off-chip communication Tuning NOCS-3 TEAL 9
Outline Introduction & Motivation PROBE Architecture & Implementation Traffic Prediction Dynamic Bandwidth Scaling Performance Analysis Conclusions & Future Work NOCS-3 TEAL 20
Methodology 64-core system 5 GHz processor 64KB private L and 4MB per tile shared L2 caches, 4 GB DRAM, 60 cycle access latency, 6 on-chip DRAM controllers Detailed Networks-on-Chip Model Cycle-accurate simulator based on Booksim Virtual channel flow control (2 VCs, 6 flits buffer depth) 256 bits channel width Performance Analysis Latency, throughput, execution time, optical power Benchmarks SPLASH-2, PARSEC, and SPEC CPU 2006 traces Synthetic traffic pattern NOCS-3 TEAL 2
Latency (# of cycles) Latency (# of cycles) Load / Latency Curve 60 50 40 Uniform Without PROBE Power-aware Mode Balanced Mode Performance Mode PROBE 60 50 40 Bit Complement Without PROBE Power-aware Mode Balanced Mode Performance Mode PROBE 30 Fluctuation 30 20 20 0 0 % 0 0. 0.2 0.3 0.4 0.5 0.6 Injection rate (flit/node/cycle) 0 0 4.8% 0 0.05 0. 0.5 Injection rate (flit/node/cycle) Power-aware Mode Fluctuation: link utilizations go back and forth over the boundary Throughput: at most % penalty compared to the baseline Balanced Mode and Performance Mode are approaching to the baseline and have different closing points. NOCS-3 TEAL 22
optical power consumption Latency (# of cycles) Latency vs. Optical Power (/2) 60 50 40 30 20 0 0.2 0.8 0.6 0.4 0.2 Uniform Without PROBE Power-aware Mode Balanced Mode Performance Mode PROBE % 0 0. 0.2 0.3 0.4 0.5 0.6 Injection rate (flit/node/cycle) 25% 75% Critical point (injection rate) 0.05 (Three modes) 0.23 (Perf. Mode) 0.45 (Balanced Mode) Optical power saving 25% optical power saving due to % throughput loss (Poweraware Mode) Save ~75% optical power at low network load (Three modes) 0 NOCS-3 TEAL 23
optical power consumption optical power consumption Latency (# of cycles) Latency (# of cycles) Latency vs. Optical Power (2/2) 60 50 40 Bit Complement Without PROBE Power-aware Mode Balanced Mode Performance Mode PROBE 60 50 40 Transpose Without PROBE Power-aware Mode Balanced Mode Performance Mode PROBE 30 30 20 20 0 0.2 0.8 0 4.8% 0 4.7% 0 0.05 0. 0.5.2 0 0.05 0. Injection rate (flit/node/cycle) Injection rate (flit/node/cycle) 0.5 0.8 0.6 0.4 0.2 50% 75% 0.6 0.4 0.2 57% 75% 0 NOCS-3 TEAL 24 0
Normalized execution time Real Traffic Traces Exec. Time.6.4.2 0.8 0.6 0.4 0.2 0 Without PROBE Performance Mode Balanced Mode Power-aware PROBE Performance Mode: close to the baseline Balanced Mode: % penalty on average Power-aware Mode: 25% penalty on average NOCS-3 TEAL 25
optical power consumption Real Traffic Traces Optical Power.2 Without PROBE Performance Mode Balanced Mode Power-aware PROBE 0.8 0.6 0.4 0.2 0 Performance Mode: 59% more optical power saving Balanced Mode: 70% optical power saving on average Power-aware Mode: 72% optical power saving on average NOCS-3 TEAL 26
Conclusions The photonic interconnect design is boosted by the evolution of optical devices. PROBE is an energy-efficient solution to reduce the high static power consumption in photonic networks. PROBE further improves the on-chip resource utilization. NOCS-3 TEAL 27
Questions? THANK YOU! NOCS-3 TEAL 28