4/2/2004 Interconnect-Power Dissipation in a Microprocessor N. Magen, A. Kolodny, U. Weiser, N. Shamir Intel corporation Technion - Israel Institute of Technology
4/2/2004 2 Interconnect-Power Definition Interconnect-Power is dynamic power consumption due to interconnect capacitance switching How much power is consumed by Interconnections? Future generations trends? How to reduce the interconnect power? 0.3 µm cross-section, source - Intel
4/2/2004 3 Background Power is becoming a major design issue Scope: Dynamic power, the majority of power P = ΣAF i C i V 2 f This work focuses on the capacitance term
4/2/2004 4 Outline Research methodology Interconnect Power Analysis Power-Aware Router Experiment Interconnect Power Prediction Summary
4/2/2004 5 Case study Low-power, state-of-the-art µ-processor Dynamic switching power analysis Interconnect attributes: Length Capacitance Fan Out (FO) Hierarchy data Net type Activity factors (AF) Miscellaneous.
4/2/2004 6 Interconnect Length Model Total wire length Stitched across hierarchies Summed over repeaters 8 6 4 4 3 7 7 Net model Cdiff. Cwire Cgate
4/2/2004 7 Activity Factors Generation Power test vectors generation (worst case for high power, unit stressing) RTL full-chip simulation (results in blocks primary inputs: Activity,Probability) Monte-Carlo based block inputs generation (based on the RTL statistics) Transistor level simulation - per block (Unit delay, tuning for glitches) Per node activity factor Source - Intel Pentium M Processor Power Estimation, Budgeting, Optimization, and Validation, ITJ 2003
4/2/2004 8 Outline Research methodology Interconnect Power Analysis Power-Aware Router Experiment Interconnect Power Prediction Summary
4/2/2004 9 Interconnect Length Distribution 0000 000 00 Number of nets 0 0. 0.0 Pentium 0.5 [um] Pentium MMX 0.35 [um] Pentium Pro 0.5 [um] Pentium II 0.35 [um] Pentium II 0.25 [um] Pentium III 0.8 [um] Low Power Processor 0.3 [um] 0.00 0 00 000 0000 00000 Net Length [um] Source: Shekhar Y. Borkar, CRL - Intel
4/2/2004 0 Interconnect Length Distribution Nets vs. Net Length Log Log scale Exponential decrease with length Global clock not included Number of Nets 000 00 0 0. Local Global Total Total 0.0 0.00 0 00 000 0000 00000 Length [um]
4/2/2004 Total Dynamic Power Total Dynamic Power Global clock not included Local nets = 66% Global nets = 34% Normalized Dynamic Power 00 90 80 70 60 50 40 30 20 Nets: 390k Cap: 0[nF] FO: 2 AF: 0.0485 Total Power vs. Net Length Peak Peak 2 Interconnect Local Global Total Total Total Nets: 75k Cap: 20[nF] FO: 20 AF: 0.055 0 0 0 00 000 0000 00000 Length [um]
4/2/2004 2 Total Dynamic Power Breakdown Global clock included Interconnect 5% Gate 34% Diffusion 5%
4/2/2004 3 Power Breakdown by Net Types Global clock included global signals 34% global clock 9% local signals 27% local clock 20% global signals 2% global clock 3% local clock 29% local signals 37% Interconnect power (Interconnect only) Total power (Gate, Diffusion and Interconnect)
4/2/2004 4 Interconnect Power Breakdown Interconnect consumes 50% of dynamic power Clock power ~40% (of Interconnect and total) 90% of power consumed by 0% of nets Interconnect design is NOT power-aware! Predictive model can project the interconnect power. Interconnect power Total power 00 % global signals 90 % 80 % 34 % global clock Local signals local clock 70 % 60 % 50 % 40 % 30 % 9 % 27 % Interconnect 5% Gate 34% Local local signals clock 20 % 0 % 20 % Diffusion 5% 0%
4/2/2004 5 Outline Research methodology Interconnect Power Analysis Power-Aware Router Experiment Interconnect Power Prediction Summary
4/2/2004 6 Experiment - Power-Aware Router Routing Experiment optimizing processor s blocks Local nodes (clock and signals) consume 66% of dynamic power 0% of nets consume 90% of power Min. spanning trees can save over 20% Interconnect power Routing with spacing can save up to 40% Interconnect power Small block s local clock network
4/2/2004 7 Power-Aware Router Flow Power grid routing Clock tree: high FO, long lines, very active Clock tree routing With spacing Avoiding congestion Top n% power consuming signal nets routing Global and Detailed Routing - of the un-routed nets (timing and congestion driven) Rip-up: not high power nets All nets routed? No Power-aware Rip up and re-route Yes Followed by downsizing Finish
4/2/2004 8 60% Results - Power Saving Average Dynamic power saving 50% 40% 30% 20% 0% 0% Driver Downsizing Router Power Saving Block A Block B Block C Block D Block E Average saving results: 4.3% for ASIC blocks Downsize saving Router saving - Estimated based on clock interconnect power
4/2/2004 9 Outline Research methodology Interconnect Power Analysis Power-Aware Router Experiment Interconnect Power Prediction Summary
4/2/2004 20 Future of Interconnect Power 00% Dynamic Power breakdown Gate 90% 80% Diffusion 70% 60% 50% Interconnect 40% 30% 20% 0% % G POW % D POW % IC POW Source - ITRS 200 Edition adapted data 0% 0.5 0.3 0. 0.09 0.08 0.07 0.065 0.045 0.032 0.022 Generation Technology generation [µm] Interconnect power grows to 65%-80% within 5 years! (using optimistic interconnect scaling)
4/2/2004 2 Interconnect Power Prediction The number of nets vs. unit length Modified Davis model Number of Number Nets of (normalized) 00 0 0. 0.0 0. 0.00 0.0000 0.0 0.000 0.00 0 00 % Interconnect length projection Upper local bound Lower global bound Power Measured 0 00 000 0000 00000 Length [ um ] Dynamic power breakdown model The dynamic power average breakdown Power Power 90 % 80 % 70 % 60 % 50 % 40 % 30 % Interconnect Diffusion Interconnect Diff Gate 20 % 0 % Gate 0% Local Intermediate Global
4/2/2004 22 Interconnect Power Model Multiplication of the number of interconnects with power breakdowns gives: 6 Projected dynamic power vs. net length Power (normalized) Power 5 4 3 2 Measured power Projection 0 0 00 000 0000 00000 Length [µm] [ um ] The power model matches processor power distribution!
4/2/2004 23 Outline Research methodology Interconnect Power Analysis Power-Aware Router Experiment Interconnect Power Prediction Summary
4/2/2004 24 Summary Interconnect is 50% of the dynamic power of processors, and getting worse. Interconnect power-aware design is recommended Clock consumes 40% of interconnect power. Clock interconnect spacing is suggested Interconnect power is sum of nearly all net lengths and types. Router level Interconnect power reduction addresses all Interconnect power has strong dependency on the hierarchy Per Hierarchy analysis and optimization algorithms
4/2/2004 25 Future Research. Interconnect Power characterization and prediction 2. Investigate Interconnect power reduction techniques: Interconnect-Spacing for power Interconnect Power-Aware physical design Aspect Ratio optimization for power Architectural communication reduction
4/2/2004 26 Questions?
4/2/2004 27 BACKUP-Slides
4/2/2004 28 Processor Case Study Analysis subject: Processor, 0.3 [µm] 77 million transistors, die size of 88 [mm 2 ] Data sources (AF, Capacitance, Length) Excluded: L2 cache, global clock, analog units
4/2/2004 29 Global Communication Global power is important Global power is mostly IC For higher power benchmarks Global power is higher G-clock excluded Global Power Percent 35% 30% 25% 20% 5% 0% %Global Global Power % vs. Test Power Poly. (%Global ) 5% 0% Total Power [uw]
4/2/2004 30 Benchmark Selection High power test benchmarks Worst case design Suitable for: thermal design, power grid design Average power is a fraction of peak power Unit stressing benchmarks Averaging of all high power benchmarks High node coverage ITC logo
4/2/2004 3 Interconnect Power Implications Interconnect power can be reduced by minimizing switched capacitance: Fabrication process (wire parameters) Power-driven physical design Logic optimization for power Architectural interconnect optimization
4/2/2004 32 Interconnect Capacitance Side-cap is increasing: 70% to 80% self-cap. Layer 3 00% 90% 80% 70% 60% Global Capacitance breakdown V-cap 50% H-cap H-cap V-cap Layer 2 Side-cap. 40% 30% 20% V-cap H-cap Source - ITRS 200 Edition adapted data Layer 0% 0% 0.5 0.3 0. 0.09 0.08 0.07 0.065 0.045 0.032 0.022 Generation Technology generation [µm] The majority of interconnect capacitance is side-capacitance!
4/2/2004 33 Fabrication Process Aspect Ratio (AR) Interconnect AR = Thickness Width Thickness Low AR = Low Interconnect power Low AR = High resistance Frequency Modeling Local: average gate, average IC Global: optimally buffered global IC R int C int R int C int Width R int C int Local R int C int Global R int C int R int C int... R int C int L crit L crit L crit L crit R int R int R int... R int C int C int C int C int
4/2/2004 34 Aspect Ratio Trade offs Power depends on cap. Frequency: Local gates and IC cap. Global mostly IC RC 20% 0% 00% Freq. And Power vs. Relative AR Local path speed Per layer AR optimization! Scaling more power save, less frequency loss 90% 80% Power Frequency - Local Frequency - Global 70% Global path speed Dynamic Power 60% 50.0% 62.5% 75.0% 87.5% 00.0% 2.5% 25.0% 37.5% 50.0% AR Relative AR Aspect Ratio optimization can save over 0% of dynamic power!
4/2/2004 35 Physical Design - Spacing Spacing can save up to 40% 20% 0.3 [µm] global IC cap. vs. spacing About 30% is with double space Spacing advantages: scaling, frequency, reliability, noise, easy to modify min. space Capacitance Relative capacitance 00% 80% 60% 40% 20% 2X 3X 4X Global capacitanc 0%.5 2 2.5 3 3.5 4 4.5 5 Spacing Spacing Wire spacing can save up to 20% of the dynamic power!
4/2/2004 36 Spacing calculation Back of an envelope estimation: 0% of Interconnect 90% power X2 spacing = extra 20% wiring Global clock not spaced (inductance) Global clock is 20% of interconnect power Save: 30% of (90%-20%) = 20% Interconnect is 50% 0% power save Expected 20% with downsizing Minor losses - congestion
4/2/2004 37 µ-architecture - CMP Comparing two scaling methods, by IC power. Gen. Gen. 2 Uniprocessor P` IC - predicted by Rent L2 - identical, minor Clock - Identical! Same average AF. P L2` L2 CMP P P Result ~5% less dynamic power for CMP L2`
4/2/2004 38 Power critical vs. Timing critical 00% 90% 80% acummulated power 70% Accumulated Power 60% 50% 40% 30% 20% 0% 0% Timing critical Slack [ps]
4/2/2004 39 Outline Research methodology Interconnect Power Analysis Future Trends Analysis Interconnect Power Implications Summary
4/2/2004 40 Interconnect Length Prediction Technology projections - ITRS Interconnect length predictions: ITRS model: /3 of the routing space - most optimistic Davis model: o Rent s rule based o Predicts number of nets as function of: the number of gates and complexity factors Models calibrated based on the case study Time?
4/2/2004 4 Rent s parameters Rent s rule: T = k N r T N K r = # of I/O terminals (pins) = # of gates = avg. I/O s per gate = Rent s exponent can be: 0 < r <, but common - (simple) 0.5 < r < 0.75 (complex) N gates T terminals
4/2/2004 42 Donath s length estimation model For the i-th level: There are blocks 4 i For each block there are: s terminal 4 r i N k Assuming two terminal nets : nets 4 2 r i N k The nets of the i- level must be substracted. ( ) r r r 4 4 2 4 4 2-4 4 2 4 = r i i i i i i N k N k N k Nets for level i : ni=
4/2/2004 43 Average interconnection length Taken from a SLIP 200 tutorial by Dirk Stroobandt The wires can be of two types A and D. LA = LD = [ ] λ λ λ λ λ λ λ λ 3 3 4 4 = + + = = = = A A B B i j i j A B B A j j i i [ ] λ λ λ λ λ λ λ = + + = = = = 2 2 4 A A B B i j i j B B A A j i j i The average: ri= λ λ 9 2 9 4 = = = I i i I i i i n r n R.5.5 0.5 0.5 4 4 4 7 9 2 r r r r r r N N N Overall : equals
4/2/2004 44 From Rent s rule: IDF: () i l Where: FO α =, Davis Model T r = r N 3 α r l 2 2 p 4 Γ 2 N l 2 N l l l N : + 2 3 = N l 2 N : α r 3 2 p 4 Γ ( 2 N l) l 6 P ( ) Γ= 2 N N 2 FO + p p + 2 p 2 2 N N N + p ( 2 p ) ( p ) ( 2 p 3) 6 p 2 p p P Interconnect total number and length: 2 N 2 N Nets: I i ζ dζ Length: total Multipoint Length: = ( ) total ( ) multi_terminal L = i ζ ζ dζ L = L χ where χ= total 4 FO+3
4/2/2004 45 Davis Model - extension Constant factor favors shorter nets. Short P2P net has higher chance to be a part of a multipoint net. Correction factor: Length: number of point to point nets shorter than l multi-terminal factor( l) = total point to point nets l Imulti-terminal () l = i( ζ ) multi-terminal factor( ζ) dζ FO 000000 0. Extracted Davis Model 00000 Measured Extended Davis Model 0000 Nets 0.0 0.00 Number of Nets 000 00 0.000 0 0.0000 0 00 000 0000 00000 Length [um ] 0 00 000 0000 00000 Length um
4/2/2004 46 RMST - Example
4/2/2004 47 Total Dynamic Power Total Dynamic Power Global clock not included Local nets = 66% Global nets = 34% Power (normalized) Power 6 5 4 3 2 Total Power vs. Net Length TOTAL Total_IC 0 Interconnect 0 00 000 0000 00000 Length [um] Length [µm]
4/2/2004 48 Local and Global IC Local and Global IC are different: Number by Length breakdown IC breakdown cap and power Fan out Metal usage AF is similar Power Power 00 % 80 % 60 % 40 % 20 % 0% 00 % 80 % 60 % 40 % 20 % Local Power breakdown vs. Net Length 4.6 8. 32 6.64 32.864 65.728 3. 456 262.496 523.744 044. 99 2084.99 460 8300. 45 656.4 33930 83850 Length [ um ] Global Power breakdown vs. Net Length IC Diff Gate IC Diff Gate 0% 4. 6 8. 32 6. 64 32. 864 65. 728 3. 456 262. 496 523. 744 044. 99 2084. 99 460 8300. 45 656. 4 33930 83850 Length [ um ]
4/2/2004 49 45 Benchmarks Comparison Global Dynamic power vs. Length 40 High Power Tests Benchmarks 35 30 Power 25 20 5 0 5 0 0 00 000 0000 00000 000000 Length [um ] High power tests show similar behavior to average SPEC!
4/2/2004 50 Interconnect Peaks Total wire length vs. Length 4 Measured Davis Model Total wire length 3 2 2 0 0 0 0 00 Length [ um ] 000 0000 00000 Length [µm] Average gate sizing vs. Length. 8 Average gate sizing. 6. 4 Relative sizing. 2 0. 8 0. 6 0. 4 0. 2 0 0 00 000 0000 00000 0 0 00 000 0000 00000 [ um ] Length [µm]
4/2/2004 5 ITRS Power Trends The ITRS power projection interconnect power reduction that happens in 2006-2007 is based on:. Aggressive voltage reduction 2. Low-k dielectric improvements The devices capacitance increase by +30% (trend -5%) The combined effect: Interconnect power reduction (relative to voltage) Device power remains constant
[ W ] 4/2/2004 52 Dynamic power - ITRS trend 600.00 Dynamic power projection 500.00 IC Power (normalized) 400.00 300.00 200.00 Diff Gate 00.00 0.00 0.5 0.3 0. 0.09 0.08 0.07 0.065 0.045 0.032 0.02 Generation /2 min pitch Technology generation [µm] The Black curve is the ITRS maximum heat removal capabilities
4/2/2004 53 Power-Aware Flow Placement The reduced IC cap allows for driver downsizing On average it reduced the dynamic power by.4 of the IC power saving Downsizing is timing verified Cells downsizing reduced the total area and leakage by 0.4% No signal spacing was applied over 30% unused metal Post-layout optimization are possible Yes Power-aware Routing RC Extraction Timing Analysis, Power Analysis All slacks positive? Yes Power driven Timing constrained driver downsizing Sizing modified? No Timing driven - driver upsizing No Finish
4/2/2004 54 FUBS description A medium, randomly picked B small, highest clock power C small, good potential D medium, good potential E worse than average Block Name Block A Block B Block C Block D Block E AVERAGE Area [µm 2 ] 3880.6 0274.6 6586. 64229. 59766.3 209537.8 Devices 4574 8644 768 894 609 6675 Inactive Nodes 63.66% 98.78% 82.36% 39.22% 35.38% 52.94% Power [uw] 770.22 25.5 786.76 8.90 6757. 5373.86 RMST potential power saving 4.3% 7% 22% 29% 4.% 7% Clock cap..25% 2.59% 2.75% 3.6% 3.27% 8.0% Clock power 72.0% 99.99% 96.46% 94.99% 33.84% 60.47% IC cap. 34.00% 27.70% 38.4% 36.05% 29.86% 34.67% IC power 28.89% 59.54% 46.74% 48.62% 40.65% 36.83% Clock IC power 20.9% 59.54% 45.48% 46.26% 6.87% 23.87% Clock IC length.7% 2.34% 2.05% 2.09% 0.74% 3.85% Relative - Capacitance per Length Unit. 82.23% 3.5% 87.46% 83.74% 85.97% 88.46%
4/2/2004 55 Miller Factor - Power R Opposite direction switching- The current: Energy: That is 4 times a single switching energy. V C R2 V2 dq dc ( Vc) d Vc Ic = = = C dt dt dt d V E I V dt C V dt C V dv C V T T Vdd c 2 c = c dd = dd = dd c 2 dd dt = 0 0 Vdd Decoupling by Miller factor of 2. Same direction switching => no current. Decoupling by Miller factor of 0. Average case: Miller factor of suitable for poweraverage case sum metric.
4/2/2004 56 Routing Model Via blockage: Router efficiency: 0.6 Power grid: 20% of routing Clock grid: 0% of top tier ( ) Low layer pitch High layer pitch Layer multiplier = - blocking fraction More accurate than ITRS 200.