Microprocessor Design in the Nanoscale Era Stefan Rusu Senior Principal Engineer Intel Corporation IEEE Fellow stefan.rusu@intel.com 2012 Stefan Intel Rusu Corporation July 2012 1
Agenda Microprocessor Design Trends Process Technology Directions Active Power Management Leakage Reduction Techniques Packaging and Thermal Modeling Future Directions and Summary 2
Microprocessor Evolution 4004 Processor Westmere-EX Processor Year 1971 2011 Transistors 2300 2.6 B Process 10 µm 32 nm Die area 12 mm 2 513 mm 2 Die photos not at scale 3
Microns Scaling Trends 10 Feature Size 1 0.1 0.01 0.7x every 2 years 65nm 45nm 32nm 22nm 1970 1980 1990 2000 2010 2020 Transistor dimensions scale to improve performance, reduce power and reduce cost per transistor M. Bohr 4
Client Processor Trend: Integrated Graphics Ivy Bridge 22nm client processor with monolithic integrated graphics Up to 4 dual-threaded cores and 8MB L3 cache Dual channel DDR3 memory controller at 1600MT/s Integrated PCIe interface (16 Gen3 + 4 Gen2 + 4 DMI lanes) - First Client CPU to support PCIe Gen3 Three independent displays 1.4B transistors in 160mm 2 die S. Damaraju, ISSCC 2012 5
Client Processor Trend: Integrated WiFi RF TLine to ANT (50Ω) Balun Filter 50Ω Diff. Package Chip TX RF PA G m RX RF TX SW RX SW / Matching Network Sensitive RF circuits integrated with 32nm ATOM and PCH Integration of traditional III-V RF components -21dBm Power amp, and 34dBm T/R switch, 3.5dB NF LNA H. Lakdawala, ISSCC 2012 6
Number of cores Server Processor Trends: More Cores 12 Westmere-EX 8 Nehalem-EX Dunnington 4 Tigerton 0 Tulsa Xeon EX Processors 65nm 45nm 32nm Server core count increases every generation, while keeping within flat power budget 7
Server Processor Trends: More Cache On-Die L3 Cache [MB] 32 28 24 20 16 12 8 4 0 30 24 16 8 4 1 Xeon EX Processors 180nm 130nm 90nm 65nm 45nm 32nm Cache size increases with every process generation 8
Power [W] Server Processors Power Trends 1000 100 10 1 Total Power Active Power 0.1 0.01 Leakage 0.001 1990 1995 2000 2005 2010 Year 9
Supply Voltage [V] Voltage Scaling Has Slowed Down 10 1 ~0.7X Scaling ~0.95X Scaling 0.1 '91 '93 '95 '97 '99 '01 '03 '05 '07 '09 '11 '13 10
Agenda Microprocessor Design Trends Process Technology Directions Active Power Management Leakage Reduction Techniques Packaging and Thermal Modeling Future Directions and Summary 11
30 Years of MOSFET Scaling Dennard 1974 Intel 2005 1 mm Gate Length: 1.0 mm 35 nm Gate Oxide Thickness: 35 nm 1.2 nm Operating Voltage: 4.0 V 1.2 V Classical scaling ended in the early 2000s due to gate oxide leakage limits M. Bohr, ISSCC 2009 12
90 nm Strained Silicon Transistors NMOS PMOS High Stress Film SiGe SiGe SiN cap layer Tensile channel strain SiGe source-drain Compressive channel strain Strained silicon provided increased drive currents, making up for lack of gate oxide scaling M. Bohr, ISSCC 2009 13
45 nm High-k + Metal Gate Transistors 65 nm Transistor 45 nm HK+MG SiO 2 dielectric Polysilicon gate electrode Hafnium-based dielectric Metal gate electrode High-k + metal gate transistors break through gate oxide scaling barrier M. Bohr, ISSCC 2009 14
HK/MG Gate Leakage Reduction 1000x 25x HK+MG significantly reduces gate leakage K. Mistry, IEDM 2007 15
Normalized Cell Leakage 6T SRAM Bit Cell Leakage Reduction 12 10 8 1.0V, 25C 6 4 2 0 I GATE I OFF I JUNCT 65nm 10x 45nm SRAM bit cell leakage reduced ~10x M. Bohr, ISSCC 2009 16
Traditional Planar Transistor Gate High-k Dielectric Source Drain Oxide Silicon Substrate Traditional 2-D planar transistors form a conducting channel in the silicon region under the gate electrode when in the on state M. Bohr, 2011 17
22 nm Tri-Gate Transistor Gate Drain Source Oxide Silicon Substrate 3-D Tri-Gate transistors form conducting channels on three sides of a vertical fin structure, providing fully depleted operation 18
Transistor Scaling Trends 32 nm Planar Transistors 22 nm Tri-Gate Transistors Gates Fins M. Bohr, 2011 19
Transistor Gate Delay 2.2 2.0 1.8 22% Faster Transistor Gate Delay (normalized) 1.6 1.4 1.2 1.0 0.8 32nm Planar 45nm Planar 22% Faster 0.6 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Operating Voltage (V) 32nm planar transistors 22% faster than 45nm planar D. Perlmutter, ISSCC 2012 20
Transistor Gate Delay 2.2 2.0 22% Faster Transistor Gate Delay (normalized) 1.8 1.6 1.4 1.2 1.0 0.8 14% Faster 22nm Planar 32nm Planar 45nm Planar 22% Faster 14% Faster 0.6 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Operating Voltage (V) 22nm planar transistors would have been only 14% faster 21
Transistor Gate Delay 2.2 2.0 1.8 Transistor Gate Delay (normalized) 1.6 1.4 1.2 37% Faster 32nm Planar 1.0 0.8 22nm Tri-Gate 18% Faster 0.6 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Operating Voltage (V) 22nm Tri-Gate transistors provide improved performance at high voltage and unprecedented 37% speedup at low voltage 22
Intel Transistor Leadership 2003 2005 2007 2009 2011 90 nm 65 nm 45 nm 32 nm 22 nm SiGe SiGe Invented SiGe Strained Silicon 2 nd Gen. SiGe Strained Silicon Invented Gate-Last High-k Metal Gate 2 nd Gen. Gate-Last High-k Metal Gate First to Implement Tri-Gate Strained Silicon High-k Metal Gate Tri-Gate 23
Lithography Challenges 1000 nm Feature size 248nm 193nm 100 Lithography Wavelength Gap 13nm (EUVL) 10 89 91 93 95 97 99 01 03 05 07 09 11 13 15 Initial Production 193 nm enhancements enable the 22 nm generation 24
Extreme Ultraviolet Lithography EUV lithography uses extremely short wavelength light -Visible light 400 to 700 nm -DUV lithography 193 and 248 nm -EUV lithography 13 nm EUV Micro Exposure Tool World s First EUV Mask 25
Layout Restrictions 65 nm Layout Style 32 nm Layout Style Bi-directional features Varied gate dimensions Varied pitches Uni-directional features Uniform gate dimension Gridded layout M. Bohr, ISSCC 2009 26
450mm in the Era of Complex Scaling: Must coordinate demand drivers, technical requirements and resources End-User Demand Drivers Integrated IC Maker Coordination Equipment & Materials Development University and Government Support Projected 2000 Wafer, circa 1975 (Gordon Moore, ISSCC 03)
Process Variations Die-to-Die Variations Within-Die Variations Systematic Random Resist Thickness Lens Aberrations Random Placement of Dopant Atoms 28
Voltage and Temperature Variations Voltage -Chip activity change -Current delivery RLC -Dynamic: ns to 10-100µs -Within-die variation Temperature -Activity & ambient change -Dynamic: 100-1000µs -Within-die variation Temp ( o C) 29
# of Paths # of Paths Frequency Probability Impact on Design Methodology Due to variations in: Vdd, Vt, and Temp Path Delay Delay Deterministic Probabilistic Deterministic Probabilistic 10X variation ~50% total power Delay Target Delay Target Leakage Power Major paradigm shift from deterministic design to probabilistic / statistical design 30
Cell Area (um 2 ) SRAM Cell Size Scaling 10 1 45 nm, 0.346 um 2 0.1 0.5x every 2 years 32 nm, 0.171 um 2 0.01 180 130 90 65 45 32 22 Process technology [nm] 22 nm, 0.092 um 2 Memory density continues to double every 2 years 31
Number of Metal Layers Interconnect Trends 10 8 6 4 2 Al Cu 0 500 350 250 180 130 90 65 45 32 22 Technology Generation (nm) 32
22nm Interconnects M1 to M8 cross-section M1-M6 use ultra-low-k ILD and self-aligned vias providing 13-18% capacitance reduction Cross-section of integrated MIM capacitor Enables capacitance density of >20fF/mm 2 C. Auth, VLSI Symposium 2012 33
On-chip Interconnect Trend Relative delay 100 Feature size (nm) 250 180 130 90 65 45 32 Global interconnect without repeaters 10 Global interconnect with repeaters 1 0.1 Source: ITRS, 2001 Gate delay (FO4) Local interconnect (M1,2) Local interconnects scale with gate delay Global interconnects do not keep up with scaling 34
Agenda Microprocessor Design Trends Process Technology Directions Active Power Management Leakage Reduction Techniques Packaging and Thermal Modeling Future Directions and Summary 35
Voltage and Frequency Scaling Frequency Max Target Frequency Required Frequency Data Retention Limit Performance Limit 3 2 1 Reliability Limit sub-threshold logic +/-10% J. Rosal, ISSCC 2006 V T V LOW V DD V HI 1 - Fixed V DD, Frequency Scaling: Linear Power Reduction 2 - Fixed Frequency, V DD Scaling: Square Power Reduction 3 - Voltage and Frequency scaling: Cubic Power Reduction V DD 36
Memory and RF Vmin Reduction Write Assist circuit temporarily drops the array supply node to make it easier to write into the bit-cell Shared across several cells CVCC Both Cache and Register Files use this technique to improve write Vmin in 22nm Ivy Bridge processor 22nm transistor and circuit improvements enable Vmin reduction of >100mV for Cache and 60mV for RF WL (0 --> 1) Write Data 0 BL Data 1 --> 0 Data# 0 --> 1 Write Data# 1 BL# S. Damaraju, ISSCC 2012 37
ROM ENERGY EFFICENCY 1.8 mm Scan NTV Pentium Processor HIGH Subthreshold NTV Normal operating range 1.1 mm ~5x Demonstrated IA-32 Core Logic Level Shifters + clk spine LOW L1$-I L1$-D ZERO Ultra-low Power VOLTAGE Energy Efficient 280 mv 0.45 V 1.2 V MAX High Performance 3 MHz 60 MHz 915 MHz 2 mw 10 mw 737 mw 1500 Mips/W 5830 Mips/W 1240 Mips/W S. Jain, ISSCC 2012 38
Clock Gating D En Clk 0 1 S D REG Q En Clk D D REG Q Save power by gating the clock when data activity is low Requires detailed logic validation 39
Core Power Management C0 HFM C0 LFM C1/C2 C4 C6 Core voltage Core clock OFF OFF OFF PLL OFF OFF L1 caches L2 caches flushed flushed partial flush off off Wakeup time Power active active <1us <30us <100us Modulating the processor core voltage and frequency enables lower power states Gerosa, A-SSCC 2008 40
Multiple Voltage Domains QPI QPI QPI QPI Core Supply Core Supply I/O Domain 1.1V fixed Fuse Un-Core Domain 0.9-1.1V fixed Core Supply Uncore Supply Core Supply Core Domain 0.85-1.1V variable SMI SMI Multiple voltage domains minimize power consumption across the core and uncore areas Rusu, ISSCC 2009 41
Multiple Clock Domains QPI QPI QPI QPI BCLK IO PLLs Filter PLL Un - core PLL IO DLLs Core PLLs SMI SMI Three primary clock domains: core, un-core, I/O Total of 16 PLLs and 8 DLLs Rusu, ISSCC 2009 42
Agenda Microprocessor Design Trends Process Technology Directions Active Power Management Leakage Reduction Techniques Packaging and Thermal Modeling Future Directions and Summary 43
I Off (A/um) Subthreshold Leakage Trend 1.E-04 1.E-06 1.E-08 1.E-10 1.E-12 Intel 15nm transistor Intel 20nm transistor Intel 30nm transistor Research data in literature ( ) Production data in literature ( ) 1.E-14 10 100 1000 Physical Gate Length (nm) 44
Leakage Reduction Techniques Body Bias Stack Effect Sleep Transistor Vdd +Ve Vbp Equal Loading Logic Block -Ve Vbn 2-3X reduction 2-1000X reduction 45
Normalized Leakage Leakage is a Strong Function of Voltage 100 90 80 70 60 50 40 30 20 10 0 130nm process Subthreshold Leakage 0 0.3 0.6 0.9 1.2 1.5 Voltage (V) Gate Leakage Sub-threshold and gate leakage reduce with lower supply voltage 46
Voltage Cache Sleep and Shut-off Modes Active Mode Sleep Mode Shut-off Mode Sub-array Sub-array Sub-array Virtual VSS Block Select Sleep Bias Shut off X X X 1.1V Virtual VSS 2x lower leakage 250mV 2x lower leakage 520mV 0V 0V S. Rusu, US Pat. 7,657,767 47
Leakage Shut-off Infrared Images 16MB part 8MB part 4MB part 16MB in sleep mode 8MB 8MB sleep shut-off 4MB 12MB sleep shut-off Leakage reduction 3W (8MB) 5W (4MB) 48
Cache Dynamic Shut-off Way 15 14 3 2 1 0 15 14 3 2 1 0 Data Tag Data Controller Controller Normal Operation In the full-load state, all 16 ways are enabled (green) Cache-by-Demand Operation Under idle or low-load states, cache ways are dynamically flushed out and put in shut-off mode (red) 49
Cache Leakage Management Three PMOS sleep transistor groups for sub-array leakage reduction Y. Wang, ISSCC 2009 50
Cache Leakage Reduction Benefit Leakage management circuit reduces sub-array leakage by 58% 51
Leakage Mitigation: Long-Le Transistors Nominal Le All transistors can be either nominal or long-le Most library cells are available in both flavors Long-Le transistors are ~10% slower, but have 3x lower leakage All paths with timing slack use long-le transistors Initial design uses only long channel devices Long Le (Nom+10%) S. Rusu, ISSCC 2006 52
Long-Le Transistors Usage Map QPI QPI QPI QPI Nehalem-EX long-channel device usage [percent] 90-100 80-90 70-80 60-70 50-60 SMI SMI Massive long-channel usage in uncore reduces leakage 53
Power & Leakage Breakdown Nehalem-EX 45nm example Vcore 54.6% Power Breakdown Vuncore 33.4% Leakage Breakdown Active 84% Reduction techniques Vpll 0.8% Vio 11.2% Clock gating Run uncore at 0.9V Leakage 16% Long channel device usage: 58% cores, 85% uncore S. Rusu, ISSCC 2009 54
Core and Cache Recovery Example QPI0 QPI1 QPI2 QPI3 Disabled Disabled Core2 Core5 System Interface Core1 Core6 Core0 Disabled Disabled Core7 SMI SMI Defective core and cache slices can be disabled in horizontal pairs S. Rusu, ISSCC 2009 55
Voltage Voltage Minimize Leakage in Disabled Blocks Disabled cores Power gated Active/ Shut-off Core 0.85V Active Leakage Reduction 40x Shut-off Virtual VCC Disabled cache slices All major arrays in shut-off Active SRAM array Sleep/ Shut-off 0.9V 0V 0V Active Sleep Shut-off Leakage Reduction 35% 83% Virtual VCC S. Rusu, ISSCC 2009 56
Core/Cache Recovery Infrared Image All cores and cache slices are enabled S. Rusu, ISSCC 2009 57
Core/Cache Recovery Infrared Image Shut-off 2 cores (top row) and 2 cache slices (bottom row) Disabled blocks are clock and power gated S. Rusu, ISSCC 2009 58
Agenda Microprocessor Design Trends Process Technology Directions Active Power Management Leakage Reduction Techniques Packaging and Thermal Modeling Future Directions and Summary 59
Microprocessor Package Evolution 1971 4004 Processor - 16-pin ceramic package - Wire bond attach - 750 khz I/O 2012 Xeon E5 Processor - 2011-contact organic package - Flip-chip attach - 8.0 GHz I/O 60
Heat Flux (W/cm2) Temperature (C) Power Density Models Power Map On-Die Temperature 250 200 150 100 50 0 110 100 90 80 70 60 50 40 With increasing power density and large on-die caches, detailed, non-uniform power models are required 61
Thermal Modeling Simulated power density Infrared emission microscope measurement D. Genossar and N. Shamir Intel Pentium M Processor Power Estimation, Budgeting, Optimization and Validation, Intel Technology Journal 5/2003 62
Thermal Sensors QPI0 QPI1 QPI2 QPI3 Multiple temperature sensors -One in each core hot spot -One in the die center Temperature information is available through PECI bus for system fan management SMI SMI 63
Power Gate Power Gate Power Gate Power Gate Power Gate Power Gate Power Gate Power Gate Power Management Unit Core 7 Core 6 Core 5 Core 4 Sensors Sensors Sensors Sensors Sensors Power Management Unit External Voltage Regulator Control Power Gates Control Sensors Sensors Sensors Sensors Core 0 Core 1 Core 2 Core 3 PMU controls processor voltage and frequency based on compute loading and thermal data 64
Agenda Microprocessor Design Trends Process Technology Directions Active Power Management Leakage Reduction Techniques Packaging and Thermal Modeling Future Directions and Summary 65
Future Directions 2D mesh network with multiple Voltage / Frequency islands Communication across islands achieved through FIFOs Ogras (CMU), DAC 2007 66
Voltage Fine Grain Power Management 0 f 0 f/2 0 25-core processor example: f/2 0 f f/2 f/2 0 0 f f f/2 f Cores with critical tasks Freq = f, at Vdd TPT = 1, Power = 1 0 f/2 0 f f/2 0 f f/2 0 f f/2 Non-critical cores Freq = f/2, at 0.7*Vdd TPT = 0.5, Power = 0.25 VDD Hi-Act Lo-Act Shut-off 0.7*VDD 0 Temporarily shut down TPT = 0, Power = 0 0 Pwr=1 Pwr=¼ Pwr=0 0V 0 Permanently disabled TPT = 0, Power = 0 67
Summary Moore s Law has fueled the worldwide technology revolution for over 40 years and will continue for at least another decade -0.7x transistor dimension scaling every two years -Tri-gate devices provide significant benefits Continued microprocessor performance improvement depends on our ability to manage active power and leakage -Clock and power gate un-used or disabled blocks -Multiple voltage and clock domains -Dynamic voltage and frequency adjustment Core and cache recovery enables multiple product options -Disabled cores and cache slices are clock and power gated 68