Data Center Energy Trends

Similar documents
Secret Key Systems (block encoding) Encrypting a small block of text (say 128 bits) General considerations for cipher design:

Visa Smart Debit/Credit Certificate Authority Public Keys

Bridging the Information Gap Between Buffer and Flash Translation Layer for Flash Memory

Status and Prospect for MRAM Technology

C Mono Camera Module with UART Interface. User Manual

Function Block DIGITAL PLL. Within +/- 5ppm / 10 years (Internal TCXO Stability) 1 External Reference Frequency Range: 10MHz +/- 100Hz

The Critical Role of Firmware and Flash Translation Layers in Solid State Drive Design

Audit Attestation Microsec ETSI Assessment 2017 No. AA

WAFTL: A Workload Adaptive Flash Translation Layer with Data Partition

DEGEN DE1103 FM / MW / SW RECEIVER FM / AM / SSB / CW MODES OPERATING MANUAL

Lecture #29. Moore s Law

PCM progress report no. 7: A look at Samsung's 8-Gb array

Non-Volatile Memory Characterization and Measurement Techniques

Lecture 6: Electronics Beyond the Logic Switches Xufeng Kou School of Information Science and Technology ShanghaiTech University

ROBIN: Incremental Oblique Interleaved ECC for Reliability Improvement in STT-MRAM Caches

MULTI-PORT MEMORY DESIGN FOR ADVANCED COMPUTER ARCHITECTURES. by Yirong Zhao Bachelor of Science, Shanghai Jiaotong University, P. R.

In pursuit of high-density storage class memory

Memory (Part 1) RAM memory

CSci 127: Introduction to Computer Science

VARIATION MONITOR-ASSISTED ADAPTIVE MRAM WRITE

Improving the Reliability of. NAND Flash, Phase-change RAM and Spin-torque Transfer RAM. Chengen Yang

ETSI TS V ( )

Energy and Performance Driven Circuit Design for Emerging Phase-Change Memory

Audit Attestation for SwissSign AG. This is to confirm that TUV AUSTRIA CERT has successfully audited the CAs of SwissSign without critical findings.

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

DS1642 Nonvolatile Timekeeping RAM

Application Note Model 765 Pulse Generator for Semiconductor Applications

Enhancing System Architecture by Modelling the Flash Translation Layer

Digital Lighting Systems, Inc. PD804-DMX. Eight Channel DMX Pack. (includes information for PD804-DMX-S) USER'S MANUAL. PD804-DMX-UM Rev.

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

Rotel RSX-1056 RS232 HEX Protocol

5G: implementation challenges and solutions

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

Massive MIMO for the New Radio Overview and Performance

Chalcogenide Memory, Logic and Processing Devices. Prof C David Wright Department of Engineering University of Exeter

PCRAMsim: System-Level Performance, Energy, and Area Modeling for Phase-Change RAM

LOW POWER CIRCUITS DESIGN USING RESISTIVE NON-VOLATILE MEMORIES HUANG KEJIE

NetApp Sizing Guidelines for MEDITECH Environments

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Study of Pattern Area of Logic Circuit. with Tunneling Field-Effect Transistors

A Differential 2R Crosspoint RRAM Array with Zero Standby Current

Widely Tunable Adaptive Resolution-controlled Read-sensing Reference Current Generation for Reliable PRAM Data Read at Scaled Technologies

Fall 2015 COMP Operating Systems. Lab #7

Installation and configuration manual DXCa Modbus RTU CAN Gateway V1.2

NANOELECTRONIC TECHNOLOGY: CHALLENGES IN THE 21st CENTURY

Computer Simulation and DSP Implementation of Data Mappers of V.90 Digital Modem in Theaid of IT

! 1F8B0 " 1F8B1 ARROW POINTING UPWARDS THEN NORTH WEST ARROW POINTING RIGHTWARDS THEN CURVING SOUTH WEST. 18 (M4b)

Mayank Chakraverty and Harish M Kittur. VIT University, Vellore, India,

Code: 9A Answer any FIVE questions All questions carry equal marks *****

NAND Structure Aware Controller Framework

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

8WD4 Signaling Columns

Core Circuit Technologies for PN-Diode-Cell PRAM

THE content-addressable memory (CAM) is one of the most

Future Trend in Memory Device. Cho Jeong Ho SK hynix

Generation of AES Key Dependent S-Boxes using RC4 Algorithm

Low Transistor Variability The Key to Energy Efficient ICs

Memory Basics. historically defined as memory array with individual bit access refers to memory with both Read and Write capabilities

TC55VBM316AFTN/ASTN40,55

Computer Systems Research: Past and Future

Application-Managed Flash Sungjin Lee, Ming Liu, Sangwoo Jun, Shuotao Xu, Jihong Kim and Arvind

Nonlinear Multi-Error Correction Codes for Reliable MLC NAND Flash Memories Zhen Wang, Mark Karpovsky, Fellow, IEEE, and Ajay Joshi, Member, IEEE

POWER ANALYZER CVM-MINI SERIES INSTRUCTION MANUAL M A CIRCUTOR, SA

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Figure 2. Another example from Teun Spaans Domino Plaza web site.

Design and Implement of Low Power Consumption SRAM Based on Single Port Sense Amplifier in 65 nm

TOSHIBA MOS DIGITAL INTEGRATED CIRCUIT SILICON GATE CMOS

DESIGN & IMPLEMENTATION OF SELF TIME DUMMY REPLICA TECHNIQUE IN 128X128 LOW VOLTAGE SRAM

Project 5: Optimizer Jason Ansel

EEC 216 Lecture #10: Ultra Low Voltage and Subthreshold Circuit Design. Rajeevan Amirtharajah University of California, Davis

TOSHIBA MOS DIGITAL INTEGRATED CIRCUIT SILICON GATE CMOS

SRA Life, Earth, and Physical Science Laboratories correlation to Illinois Learning Standards: Science Grades 6-8

SUPPLY NETWORK ANALYZER CVM-96 SERIES

CSE 237A Winter 2018 Homework 1

Recommendation ITU-R BT.1577 (06/2002)

MOBY-D Family Matrix

FIFO WITH OFFSETS HIGH SCHEDULABILITY WITH LOW OVERHEADS. RTAS 18 April 13, Björn Brandenburg

Introduction to CMOS VLSI Design (E158) Lecture 5: Logic

Homework 10 posted just for practice. Office hours next week, schedule TBD. HKN review today. Your feedback is important!

MICROELECTRONIC CIRCUIT DESIGN Third Edition

A Wrench in the Cogwheels of P2P Botnets. Werner, Senior Virus Analyst, Kaspersky Lab 23 Annual FIRST Conference Vienna, 13th June 2011

showtech 9th May.txt

INTERNATIONAL TELECOMMUNICATION UNION. SERIES V: DATA COMMUNICATION OVER THE TELEPHONE NETWORK Simultaneous transmission of data and other signals

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

ID: Cookbook: browseurl.jbs Time: 17:13:23 Date: 27/08/2018 Version:

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

A Software Technique to Improve Yield of Processor Chips in Presence of Ultra-Leaky SRAM Cells Caused by Process Variation

Power Analyzer CVM-NRG96. User manual Extended version

A Ternary Content Addressable Cell Using a Single Phase Change Memory (PCM)

POINTAX 6000L2 Point Recorder

Improving MLC flash performance and endurance with Extended P/E Cycles

Design of Soft Error Tolerant Memory and Logic Circuits

SSD Firmware Implementation Project Lab. #1

A Low-Power SRAM Design Using Quiet-Bitline Architecture

MAGNETORESISTIVE random access memory

Lecture 4&5 CMOS Circuits

A Case Study of Nanoscale FPGA Programmable Switches with Low Power

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

Data acquisition and Trigger (with emphasis on LHC)

Semiconductor Memory: DRAM and SRAM. Department of Electrical and Computer Engineering, National University of Singapore

Transcription:

Data Center Energy Trends Data center electricity usage Increased by 56% from 2005 to 2010 1.1% to 1.5% total world electricity usage 1.7% to 2.2% total US electricity (Note: Includes impact of 2008 recession.) (Note: 2x increase 2000 to 2005, below prediction.) Source: Koomey 2011 Data center with 10K servers Servers per rack: 26, total rack requirement: 385 Power usage/yr: 52 GWh (est. for 297W server) Source: Samsung 2008 0.3 The Consequence At current growth rate (2000-2005) in energy usage for data centers, will need 30 new coal-fired or nuclear power plants by 2015 % of World CO 2 Emissions 0.6 0.8 1 Data Centers (2020) Data Centers Malaysia Netherlands Metric Megatons CO 2 170 Four-fold increase surpass airline industry! 178 146 670 Data Centers Airlines Shipyards Steel plants Argentina 142 Source: Koomey 2011 1

Increasing Memory Demand Parallelism (core count) Larger & complex data sets More sophisticated applications irtualization & consolidation 1000 100 10 # Core GB Today: 10 s (to 100 s) GB Tomorrow: Terabyte and beyond??? 1 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Source: Kevin Te-Ming Lim, Disaggregated Memory Architectures for Blade Servers, Ph.D. Thesis, University of Michigan, 2010 More Memory Energy/power consumption shift 300 200 100 0 97 50 150 Server Power Consumption (Watts) Memory CPU Other Source server power: Samsung, 2008 Terabyte in Buffered or DDR3 S 8GB: 125 DIMMs, 400W@DDR3, 1.25KW@FB Up to 4-10x more than already power hungry machines! 2

A long-time winner: Decades old! Cost, power, performance trade-offs have favored it Massive future capacity leads to a different outcome! Limitations to Destructive reads: Must replace data after a read Limited data retention: Periodic refresh Susceptibility to errors: Charge can be disturbed Scalability: Projections (ITRS) question below 22nm The Wave Rolling In has long been the best choice until now does offer advantages Effectively unlimited write endurance (doesn t wear out) Fast read/write (symmetric) latency (And, of course, it s a commodity, here today, etc.) Can we use it judiciously? Just a little bit, please? Combine with alternative technology Small has reasonable energy, capacity We ve seen this before SRAM cache vs? 3

The Wave Rolling In US Patents Granted Phase-Change Memory (/PRAM) MRAM For an old technology, a dramatic change of events with tremendous interest! FRAM Source: Lam, LSI-TSA 2008 Alternative Memory Technology Read Speed Write Speed Cell Area Endurance Addressability 20~50ns 20~50ns 6F 2 10 15 Yes SRAM ~2ns ~2ns 146F 2 10 15 ~10 16 Yes NAND Flash 25us 500us 5F 2 10 4 ~10 5 No STT-RAM 2ns 10ns 37~40F 2 10 12 Yes 30~50ns ~1us 5~8F 2 10 7 ~10 8 Yes 4

Alternative Memory Technology Read Speed Write Speed Cell Area Endurance Addressability 20~50ns 20~50ns 6F 2 10 15 Yes SRAM ~2ns ~2ns 146F 2 10 15 ~10 16 Yes NAND Flash 25us 500us 5F 2 10 4 ~10 5 No STT-RAM 2ns 10ns 37~40F 2 10 12 Yes 30~50ns ~1us 5~8F 2 10 7 ~10 8 Yes Alternative Memory Technology Read Speed Write Speed Cell Area Endurance Addressability 20~50ns 20~50ns 6F 2 10 15 Yes SRAM ~2ns ~2ns 146F 2 10 15 ~10 16 Yes NAND Flash 25us 500us 5F 2 10 4 ~10 5 No STT-RAM 2ns 10ns 37~40F 2 10 12 Yes 30~50ns ~1us 5~8F 2 10 7 ~10 8 Yes Fast, non-destructive reads: Nearing parity w/ Non-volatile, non-destructive, no refresh è low energy 5

Alternative Memory Technology Read Speed Write Speed Cell Area Endurance Addressability 20~50ns 20~50ns 6F 2 10 15 Yes SRAM ~2ns ~2ns 146F 2 10 15 ~10 16 Yes NAND Flash 25us 500us 5F 2 10 4 ~10 5 No STT-RAM 2ns 10ns 37~40F 2 10 12 Yes 30~50ns ~1us 5~8F 2 10 7 ~10 8 Yes Density on par with, 2.5nm prototype Liang et al, A 1.4uA Reset Current Phase Change Memory Cell with Integrated Carbon Nanotube Electrodes for Cross-Point Memory Applications, IEEE Symp. on LSI (LSIT), 2011 Fast, non-destructive reads: Nearing parity w/ Alternative Memory Technology Read Speed Write Speed Cell Area Endurance Addressability 20~50ns 20~50ns 6F 2 10 15 Yes SRAM ~2ns ~2ns 146F 2 10 15 ~10 16 Yes NAND Flash 25us 500us 5F 2 10 4 ~10 5 No STT-RAM 2ns 10ns 37~40F 2 10 12 Yes 30~50ns ~1us 5~8F 2 10 7 ~10 8 Yes Write performance limited Relatively slow bit cell writes but no block erasure required like Flash Multiple write rounds of bit groups, leading to 1us (Numonyx prototype) Density on par with, 2.5nm prototype Fast, non-destructive reads: Nearing parity w/ 6

Alternative Memory Technology Read Speed Write Speed Cell Area Endurance Addressability 20~50ns 20~50ns 6F 2 10 15 Yes SRAM ~2ns ~2ns 146F 2 10 15 ~10 16 Yes NAND Flash 25us 500us 5F 2 10 4 ~10 5 No STT-RAM 2ns 10ns 37~40F 2 10 12 Yes 30~50ns ~1us 5~8F 2 10 7 ~10 8 Yes Repeated writes lead to wear on bit cell Writes cause stress to bit cells, leading to failure Limited write cycles but better than Flash Write performance limited by individual bit and group of bits Density on par with, 2.5nm prototype Fast, non-destructive reads: Nearing parity w/ Alternative Memory Technology Read Speed Write Speed Cell Area Endurance Addressability 20~50ns 20~50ns 6F 2 10 15 Yes SRAM ~2ns ~2ns 146F 2 10 15 ~10 16 Yes NAND Flash 25us 500us 5F 2 10 4 ~10 5 No STT-RAM 2ns 10ns 37~40F 2 10 12 Yes 30~50ns ~1us 5~8F 2 10 7 ~10 8 Yes Similar array structure/operation as : bit (byte) addressability Repeated writes lead to wear on bit cell Write performance limited by individual bit and group of bits Density on par with, 2.5nm prototype Fast, non-destructive reads: Nearing parity w/ 7

Alternative Memory Technology Read Speed Write Speed Cell Area Endurance Addressability 20~50ns 20~50ns 6F 2 10 15 Yes SRAM ~2ns ~2ns 146F 2 10 15 ~10 16 Yes NAND Flash 25us 500us 5F 2 10 4 ~10 5 No STT-RAM 2ns 10ns 37~40F 2 10 12 Yes 30~50ns ~1us 5~8F 2 10 7 ~10 8 Yes Similar array structure/operation as : bit (byte) addressability Repeated Nearly writes ideal complement lead to wear (maybe on bit cell replacement?) for (scales, low standby power, bit addressable, fast reads) Write performance limited by individual bit and group of bits BUT. must find techniques to overcome limitations Density on par with, 2.5nm prototype Fast, non-destructive reads: Nearing parity w/ : The Fundamental Idea Similar process as CD-R Chalcogenide (GST) Application of heat changes state of material Resistance associated with each state stores a bit Crystalline (low, SET, 1) Amorphous (high, RESET, 0) Operation Write: Heat/cool Read: Measure resistance Programmed volume of GST (heated and then cooled to change phase) Diagram/photo: Micron Technology http://www.micron.com/innovations/pcm.html 8

Read/Write Operations Read Measure resistance n Low: logic 1 (SET) n High: logic 0 (RESET) Relatively fast Power efficient Non-destructive Writes Slow bit writes: heating/ cooling: 50ns ~ 150ns Limited parallel bit writes: large programming current Long latency: 1000ns High write energy Heat stress leads to failure, with limited endurance (10 7 ) Consequences of Asymmetric read/write latency and bandwidth Reads projected to reach parity with Writes will remain slow due to heating/cooling Wear-out and endurance management Integrated relatively near CPU leads to heavy usage E.g., one write/second: fails in 110 days Memory will quickly fail without precautions Nonvolatility Reliability Important, desirable properties. Most focus has been on making it work first, then find ways to exploit these properties 9

Rethinking Main Memory for Starting Point: Main Memory System Agent C0 Main Memory () C1 C2 C3 Sandy Bridge Hybrid Memory Archetype Conventional memory adapted to C0 System Agent Essential idea Small combined with a large C1 C2 C3 Large + Capacity, low standby power Write performance Write energy Endurance Degree of Sandy change/tech Bridge driven 1. Partitioned + 2. r/w cache 3. write buffer Small (single fast DIMM) + Write performance + Write energy + Endurance Capacity, standby power 10

read/write cache Phase-change Main Memory Architecture (PMMA) C0 System Agent AEB () replacement Maintain same interfaces Commodity components Isolate changes to mem ctrl C1 C2 C3 Main Memory () System Agent Acts as controller to / Hit: Check tags, access AEB Miss: Check tags, access & AEB PMMA AEB () acts as cache Accesses to main memory made through the cache Write performance Endurance management [Fer10a,Fer10b,Fer11a,Fer11b] PMMA Physical address (PA) System Agent System Agent CPU Interface C0 C1 C2 C3 Controller/DMAC Controller/DMAC AEB Page cache Map PA to DA Memory Allocated pages Map PA to DA Spare pages 11

PMMA System Agent CPU Interface System Agent Request Controller C0 control C1 data C2 C3 Controller/DMAC In Flight Buffer Controller/DMAC AEB Page cache Map PA to DA Memory Allocated pages Map PA to DA Spare pages Request Controller Operates on pages (larger than cache block from CPU) Processes requests & allocates resources Multiple outstanding requests Page allocation & eviction (AEB) Map physical to device address Book keeping Track resources used, including what is cached & where Map physical address (PA) to device address (DA) IFB: High speed memory buffers inflight pages (AEB/) 12

Request Controller to/from CPU interface St PAdr R/W Size Tag AEB AEB Bookkeeping to/from AEB,, IFB Request Bookkeeping Request Controller St PAdr R/W Size Tag AEB cntrl 13

RC: Read Hit Read cache block A St PAdr R/W Size Tag AEB Read cache block A Mapped to a page for A cntrl RC: Read Hit Read A St PAdr R/W Size Tag AEB = Hit in AEB cntrl 14

RC: Read Hit Read A St PAdr R/W Size Tag AEB = Hit in AEB Hand-off to controller cntrl Request Controller St PAdr R/W Size Tag AEB cntrl 15

RC: Read Miss Read B St PAdr R/W Size Tag AEB = Miss in AEB cntrl RC: Read Miss Read B St PAdr R/W Size Tag AEB Select eviction candidate page C from AEB cntrl 16

RC: Read Miss Read B St PAdr R/W Size Tag AEB Map PAè DA pages B,C = Miss in ARQ (not active) Allocate entry cntrl RC: Read Miss w/o Writeback Read B St PAdr R/W Size Tag AEB Suppose evicted page, C, is clean Allocate ARQ/IFB entries Page B: to AEB Page C is clean cntrl 17

RC: Read Miss w/o Writeback Read B St PAdr R/W Size Tag AEB Page B: to AEB Make request, copy to IFB cntrl RC: Read Miss w/o Writeback Read B St PAdr R/W Size Tag AEB Page B: to AEB Copy to AEB cntrl 18

RC: Read Miss w/o Writeback Read B St PAdr R/W Size Tag AEB Hand-off to to finish read cntrl RC: Read Miss w/writeback Read B St PAdr R/W Size Tag AEB Suppose evicted page, C, was dirty: Miss with eviction cntrl 19

RC: Read Miss w/writeback Read B St PAdr R/W Size Tag AEB Allocate ARQ/IFB entries (2) Page B: to AEB Page C: AEB to cntrl RC: Read Miss w/writeback Read B St PAdr R/W Size Tag AEB Start page copying (sub-blocks) B: copy to IFB C: copy to IFB cntrl 20

RC: Read Miss w/writeback Read B St PAdr R/W Size Tag AEB Start page copying (via IFB) B: copy to IFB C: finished, in IFB, free in AEB cntrl RC: Read Miss w/writeback Read B St PAdr R/W Size Tag AEB Start page copying to IFB B: finished, in IFB C: finished, in IFB cntrl 21

RC: Read Miss w/writeback Read B St PAdr R/W Size Tag AEB Complete page transfers B: copy from IFB to C: low priority, as able to finish cntrl RC: Read Miss w/writeback Read B St PAdr R/W Size Tag AEB Complete page transfers B: finished, release resources C: low priority, as able to finish cntrl 22

RC: Read Miss w/writeback Read B St PAdr R/W Size Tag AEB Complete page transfers B: finished, released resources C: low priority, as able to finish cntrl 1 Optimization: Page Partitioning "Page" is data unit AEB & logical unit PA-DA map at page level E.g., 2KB, 1KB, 512B, Larger page size + smaller tag store + smaller mapping table unnecessary movement writes of clean data 9b 7a 99 b2 3e a3 ab 78 97 c4 1a ef 8c ee d2 ff 00 5a f1 36 a9 71 ab df ce 91 f9 68 f3 4f 21 91 b6 ae 4e d5 05 b0 06 78 00 0a f6 8f 87 ff 00 f0 52 3f 8d 3f 07 bc 23 6b a1 f8 2f 58 f0 f6 ad a5 d8 46 b1 5a 41 af 69 86 e8 da a2 8c 2a a4 91 c9 1b 90 3b 07 2d 80 00 18 00 0a f1 fb 8e 0d 55 97 83 5a 54 c0 61 f1 14 d5 1a 90 4e 2b 65 d8 98 e2 2a 53 97 3c 65 ab 27 f8 d3 f1 63 c7 9f b4 e7 c4 cb 6f 15 fe d1 fe 23 fe dc be d3 22 68 34 db 3b 78 16 da c7 4c 47 20 b8 86 25 ee c4 2e 5d 89 66 da b9 24 28 c5 af 18 fc 79 f1 96 b9 fb 36 4b f0 93 4d bc d2 ec fc 15 7f 7f 1d fd fc 4b 67 9b ab b7 49 a3 99 41 97 77 03 7c 31 f1 8e 8a 2b 0e 70 0d 53 9c f3 5a c7 01 87 54 e3 49 41 72 a7 74 bc d7 52 5e 22 a3 93 9b 96 ac ed 7e 02 7e d4 be 34 fd 95 2f 75 3b ff 00 81 93 68 d6 da 9e a9 6e 96 b2 4d a8 59 7d a8 2c 6a c1 b0 ab b9 71 92 06 70 82 bd 02 4f f8 2c 27 ed 1e 9d 35 df 03 ff 00 e1 3c 7f f8 f5 78 04 fc 0a a7 2f de c5 2c 46 53 83 c6 4f da 56 a6 9c bb 8e 18 ca d4 63 cb 09 59 1d 3e ab fb 4e fc 49 d5 3e 15 fc 49 f0 bd f6 b5 a5 bc 7f 16 b5 39 35 2f 11 dd 7d 87 f7 f2 99 24 de 62 84 ef fd d4 7d 54 2f 38 0c 45 41 f1 8b f6 9b f1 ff 00 c6 8f 82 fe 0c f8 6d e3 0d 4f 4d ff 00 84 13 c1 52 db 4d 6d a7 c3 67 e5 c9 74 d6 f1 b2 44 27 93 76 5d 40 72 76 e0 0c e0 f5 03 1c b4 fc 1a a7 3f 7a de 39 56 12 e9 fb 35 74 d3 f9 a5 65 f8 19 3c 55 5f e6 3a 6f 8e 1f b4 67 8e 7f 68 6f da 02 db e2 3f c4 bd 56 da 3f 10 e9 76 56 fa 70 99 26 99 01 b3 16 11 41 2c 93 46 53 0c 48 71 24 ae db 81 1c e2 bd 89 7f e0 b3 ff 00 b4 56 93 a1 2d 8d 96 ab e0 9b fb 84 8f cb 4d 4e ff 00 41 dd size Page 23

1 Optimization: Page Partitioning Sub-page is request unit 1x tag/map per page Requested on demand 9b 7a 99 b2 3e a3 ab 78 97 c4 1a ef 8c ee d2 ff 00 5a f1 36 a9 71 ab df ce 91 f9 68 f3 4f 21 91 b6 ae 4e d5 05 b0 06 78 00 0a f6 8f 87 ff 00 f0 52 3f 8d 3f 07 bc 23 6b a1 f8 2f 58 f0 f6 ad a5 d8 46 b1 5a 41 af 69 86 e8 da a2 8c 2a a4 91 c9 1b 90 3b 07 2d 80 00 18 00 0a f1 fb 8e 0d 55 97 83 5a 54 c0 61 f1 14 d5 1a 90 4e 2b 65 d8 98 e2 2a 53 97 3c 65 ab 27 f8 d3 f1 63 c7 9f b4 e7 c4 cb 6f 15 fe d1 fe 23 fe dc be d3 22 68 34 db 3b 78 16 da c7 4c 47 20 b8 86 25 ee c4 2e 5d 89 66 da b9 24 28 c5 af 18 fc 79 f1 96 b9 fb 36 4b f0 93 4d bc d2 ec fc 15 7f 7f 1d fd fc 4b 67 9b ab b7 49 a3 99 41 97 77 03 7c 31 f1 8e 8a 2b 0e 70 0d 53 9c f3 5a c7 01 87 54 e3 49 41 72 a7 74 bc d7 52 5e 22 a3 93 9b 96 ac ed 7e 02 7e d4 be 34 fd 95 2f 75 3b ff 00 81 93 68 d6 da 9e a9 6e 96 b2 4d a8 59 7d a8 2c 6a c1 b0 ab b9 71 92 06 70 82 bd 02 4f f8 2c 27 ed 1e 9d 35 df 03 ff 00 e1 3c 7f f8 f5 78 04 fc 0a a7 2f de c5 2c 46 53 83 c6 4f da 56 a6 9c bb 8e 18 ca d4 63 cb 09 59 1d 3e ab fb 4e fc 49 d5 3e 15 fc 49 f0 bd f6 b5 a5 bc 7f 16 b5 39 35 2f 11 dd 7d 87 f7 f2 99 24 de 62 84 ef fd d4 7d 54 2f 38 0c 45 41 f1 8b f6 9b f1 ff 00 c6 8f 82 fe 0c f8 6d e3 0d 4f 4d ff 00 84 13 c1 52 db 4d 6d a7 c3 67 e5 c9 74 d6 f1 b2 44 27 93 76 5d 40 72 76 e0 0c e0 f5 03 1c b4 fc 1a a7 3f 7a de 39 56 12 e9 fb 35 74 d3 f9 a5 65 f8 19 3c 55 5f e6 3a 6f 8e 1f b4 67 8e 7f 68 6f da 02 db e2 3f c4 bd 56 da 3f 10 e9 76 56 fa 70 99 26 99 01 b3 16 11 41 2c 93 46 53 0c 48 71 24 ae db 81 1c e2 bd 89 7f e0 b3 ff 00 b4 56 93 a1 2d 8d 96 ab e0 9b fb 84 8f cb 4d 4e ff 00 41 dd Sub-page Page 1 Optimization: Page Partitioning Sub-page is request unit 1x tag/map per page Requested on demand Presence/absence tracked 9b 7a 99 b2 3e a3 ab 78 97 c4 1a ef 8c ee d2 ff 00 5a f1 36 a9 71 ab df ce 91 f9 68 f3 4f 21 91 b6 ae 4e d5 05 b0 06 78 00 0a f6 8f 87 ff 00 f0 52 3f 8d 3f 07 bc 23 6b a1 f8 2f 58 f0 f6 ad a5 d8 46 b1 5a 41 af 69 86 e8 da a2 8c present 2a a4 91 c9 1b 90 3b 07 2d 80 00 18 00 0a f1 fb 8e 0d 55 97 83 5a 54 c0 61 f1 14 d5 1a 90 4e 2b 65 d8 98 e2 2a 53 97 3c 65 ab 27 f8 d3 f1 63 c7 9f b4 e7 c4 cb 6f 15 fe d1 fe 23 fe dc be d3 22 68 34 db 3b 78 16 da c7 4c 47 20 b8 86 25 ee c4 2e 5d 89 66 da b9 24 28 c5 af 18 fc 79 f1 96 b9 fb 36 4b f0 93 4d bc d2 ec fc 15 7f 7f 1d fd fc 4b 67 9b ab b7 49 a3 99 41 97 77 03 7c 31 f1 8e 8a 2b 0e 70 0d 53 9c f3 5a c7 01 87 54 e3 49 41 72 a7 74 bc d7 52 5e 22 a3 93 9b 96 ac ed 7e 02 7e d4 be 34 fd 95 2f 75 3b ff 00 81 93 68 d6 da 9e a9 6e 96 b2 4d a8 59 7d a8 2c 6a c1 b0 ab b9 71 92 06 70 82 bd 02 4f f8 2c 27 ed 1e 9d 35 df 03 ff 00 e1 3c 7f f8 f5 78 04 fc 0a a7 2f de c5 2c 46 53 83 c6 4f da 56 a6 9c bb 8e 18 ca d4 63 cb 09 59 1d 3e ab fb 4e fc 49 d5 3e 15 fc 49 f0 present bd f6 b5 a5 bc 7f 16 b5 39 35 2f 11 dd 7d 87 f7 f2 99 24 de 62 84 ef fd d4 7d 54 2f 38 0c 45 41 f1 8b f6 9b f1 ff 00 c6 8f 82 fe 0c f8 6d e3 0d 4f 4d ff 00 84 13 c1 52 db 4d 6d a7 c3 67 e5 c9 74 d6 f1 b2 44 27 93 76 5d 40 72 76 e0 0c e0 f5 03 1c b4 fc 1a a7 3f 7a de 39 56 12 e9 fb 35 74 d3 f9 a5 65 f8 19 3c 55 5f e6 3a 6f 8e 1f b4 67 8e 7f 68 6f da 02 db e2 3f c4 bd 56 da 3f 10 e9 present 76 56 fa 70 99 26 99 01 b3 16 11 41 2c 93 46 53 0c 48 71 24 ae db 81 1c e2 bd 89 7f e0 b3 ff 00 b4 56 93 a1 2d 8d 96 ab e0 9b fb 84 8f cb 4d 4e ff 00 41 dd Sub-page Page 24

1 Optimization: Page Partitioning Sub-page is request unit 1x tag/map per page Requested on demand Presence/absence tracked Asymmetric size 9b 7a 99 b2 3e a3 ab 78 97 c4 1a ef 8c ee d2 ff 00 5a f1 36 a9 71 ab df ce 91 f9 68 f3 4f 21 91 b6 ae 4e d5 05 b0 06 78 00 0a f6 8f 87 ff 00 f0 52 3f 8d 3f 07 bc 23 6b a1 f8 2f 58 f0 f6 ad a5 d8 46 b1 5a 41 af 69 86 e8 da a2 8c 2a a4 91 c9 1b 90 3b 07 2d 80 00 18 00 0a f1 fb 8e 0d 55 97 83 5a 54 c0 61 f1 14 d5 1a 90 4e 2b 65 d8 98 e2 2a 53 97 3c 65 ab 27 f8 d3 f1 63 c7 9f b4 e7 c4 cb 6f 15 fe d1 fe 23 fe dc be d3 22 68 34 db 3b 78 16 da c7 4c 47 20 b8 86 25 ee c4 2e 5d 89 66 da b9 24 28 c5 af 18 fc 79 f1 96 b9 fb 36 4b f0 93 4d bc d2 ec fc 15 7f 7f 1d fd fc 4b 67 9b ab b7 49 a3 99 41 97 77 03 7c 31 f1 8e 8a 2b 0e 70 0d 53 9c f3 5a c7 01 87 54 e3 49 41 72 a7 74 bc d7 52 5e 22 a3 93 9b 96 ac ed 7e 02 7e d4 be 34 fd 95 2f 75 3b ff 00 81 93 68 d6 da 9e a9 6e 96 b2 4d a8 59 7d a8 2c 6a c1 b0 ab b9 71 92 06 70 82 bd 02 4f f8 2c 27 ed 1e 9d 35 df 03 ff 00 e1 3c 7f f8 f5 78 04 fc 0a a7 2f de c5 2c 46 53 83 c6 4f da 56 a6 9c bb 8e 18 ca d4 63 cb 09 59 1d 3e ab fb 4e fc 49 d5 3e 15 fc 49 f0 bd f6 b5 a5 bc 7f 16 b5 39 35 2f 11 dd 7d 87 f7 f2 99 24 de 62 84 ef fd d4 7d 54 2f 38 0c 45 41 f1 8b f6 9b f1 ff 00 c6 8f 82 fe 0c f8 6d e3 0d 4f 4d ff 00 84 13 c1 52 db 4d 6d a7 c3 67 e5 c9 74 d6 f1 b2 44 27 93 76 5d 40 72 76 e0 0c e0 f5 03 1c b4 fc 1a a7 3f 7a de 39 56 12 e9 fb 35 74 d3 f9 a5 65 f8 19 3c 55 5f e6 3a 6f 8e 1f b4 67 8e 7f 68 6f da 02 db e2 3f c4 bd 56 da 3f 10 e9 76 56 fa 70 99 26 99 01 b3 16 11 41 2c 93 46 53 0c 48 71 24 ae db 81 1c e2 bd 89 7f e0 b3 ff 00 b4 56 93 a1 2d 8d 96 ab e0 9b fb 84 8f cb 4d 4e ff 00 41 dd Write Sub-page Sub-page Page 1 Optimization: Page Partitioning Sub-page is request unit 1x tag/map per page Requested on demand Presence/absence tracked Asymmetric size Small dirty granularity 9b 7a 99 b2 3e a3 ab 78 97 c4 1a ef 8c ee d2 ff 00 5a f1 36 a9 71 ab df ce 91 f9 68 f3 4f 21 91 b6 ae 4e d5 05 b0 06 78 00 0a f6 8f 87 ff 00 f0 52 3f 8d 3f 07 bc 23 6b a1 f8 2f 58 f0 f6 ad a5 d8 46 b1 5a 41 af 69 86 e8 da a2 8c 2a a4 91 c9 1b 90 3b 07 2d 80 00 18 00 0a f1 fb 8e 0d 55 97 83 5a 54 c0 61 f1 14 d5 1a 90 4e 2b 65 dirty d8 98 e2 2a 53 97 3c 65 ab 27 f8 d3 f1 63 c7 9f b4 e7 c4 cb 6f 15 fe d1 fe 23 fe dc be d3 22 68 34 db 3b 78 16 da c7 4c 47 20 b8 86 25 ee c4 2e 5d 89 66 da b9 24 28 c5 af 18 fc 79 f1 96 b9 fb 36 4b f0 93 4d bc d2 ec fc 15 7f 7f 1d fd fc 4b 67 9b ab b7 49 a3 99 41 97 77 03 7c 31 f1 8e 8a 2b 0e 70 0d 53 9c f3 5a c7 01 87 54 e3 49 41 72 a7 74 bc d7 52 5e 22 a3 93 9b 96 ac ed 7e 02 7e d4 be 34 fd 95 2f 75 3b ff 00 81 93 68 d6 da 9e a9 6e 96 b2 4d a8 59 7d a8 2c 6a c1 b0 ab b9 71 92 06 70 82 bd 02 4f f8 2c 27 ed 1e 9d 35 df 03 ff 00 e1 3c 7f f8 f5 78 04 fc 0a a7 2f de c5 2c 46 53 83 c6 4f da 56 a6 9c bb 8e 18 ca d4 63 cb 09 59 1d 3e ab fb 4e fc 49 d5 3e 15 fc 49 f0 bd f6 b5 a5 bc 7f 16 b5 39 35 2f 11 dd 7d 87 f7 f2 99 24 de 62 84 ef fd d4 7d 54 2f 38 0c 45 41 f1 8b f6 9b f1 ff 00 c6 8f 82 fe 0c f8 6d e3 0d 4f 4d ff 00 84 13 c1 52 db 4d 6d a7 c3 67 e5 c9 74 d6 f1 b2 44 27 93 76 5d 40 72 76 e0 0c e0 f5 03 1c b4 fc 1a a7 3f 7a de 39 56 12 e9 fb 35 74 d3 dirty f9 a5 65 f8 19 3c 55 5f e6 3a 6f 8e 1f b4 67 8e 7f 68 6f da 02 db e2 3f c4 bd 56 da 3f 10 e9 76 56 fa 70 99 26 99 01 b3 16 11 41 2c 93 46 53 0c 48 71 24 ae db 81 1c e2 bd 89 7f e0 b3 ff 00 b4 56 93 a1 2d 8d 96 ab e0 9b fb 84 8f cb 4d 4e ff 00 41 dd Write Sub-page Sub-page Page 25

1 Optimization: Page Partitioning Block transfer unit Smallest data transfer Sized to banks Higher priority requests pre-empt betw. blocks 9b 7a 99 b2 3e a3 ab 78 97 c4 1a ef 8c ee d2 ff 00 5a f1 36 a9 71 ab df ce 91 f9 68 f3 4f 21 91 b6 ae 4e d5 05 b0 06 78 00 0a f6 8f 87 ff 00 f0 52 3f 8d 3f 07 bc 23 6b a1 f8 2f 58 f0 f6 ad a5 d8 46 b1 5a 41 af 69 86 e8 da a2 8c 2a a4 91 c9 1b 90 3b 07 2d 80 00 18 00 0a f1 fb 8e 0d 55 97 83 5a 54 c0 61 f1 14 d5 1a 90 4e 2b 65 d8 98 e2 2a 53 97 3c 65 ab 27 f8 d3 f1 63 c7 9f b4 e7 c4 cb 6f 15 fe d1 fe 23 fe dc be d3 22 68 34 db 3b 78 16 da c7 4c 47 20 b8 86 25 ee c4 2e 5d 89 66 da b9 24 28 c5 af 18 fc 79 f1 96 b9 fb 36 4b f0 93 4d bc d2 ec fc 15 7f 7f 1d fd fc 4b 67 9b ab b7 49 a3 99 41 97 77 03 7c 31 f1 8e 8a 2b 0e 70 0d 53 9c f3 5a c7 01 87 54 e3 49 41 72 a7 74 bc d7 52 5e 22 a3 93 9b 96 ac ed 7e 02 7e d4 be 34 fd 95 2f 75 3b ff 00 81 93 68 d6 da 9e a9 6e 96 b2 4d a8 59 7d a8 2c 6a c1 b0 ab b9 71 92 06 70 82 bd 02 4f f8 2c 27 ed 1e 9d 35 df 03 ff 00 e1 3c 7f f8 f5 78 04 fc 0a a7 2f de c5 2c 46 53 83 c6 4f da 56 a6 9c bb 8e 18 ca d4 63 cb 09 59 1d 3e ab fb 4e fc 49 d5 3e 15 fc 49 f0 bd f6 b5 a5 bc 7f 16 b5 39 35 2f 11 dd 7d 87 f7 f2 99 24 de 62 84 ef fd d4 7d 54 2f 38 0c 45 41 f1 8b f6 9b f1 ff 00 c6 8f 82 fe 0c f8 6d e3 0d 4f 4d ff 00 84 13 c1 52 db 4d 6d a7 c3 67 e5 c9 74 d6 f1 b2 44 27 93 76 5d 40 72 76 e0 0c e0 f5 03 1c b4 fc 1a a7 3f 7a de 39 56 12 e9 fb 35 74 d3 f9 a5 65 f8 19 3c 55 5f e6 3a 6f 8e 1f b4 67 8e 7f 68 6f da 02 db e2 3f c4 bd 56 da 3f 10 e9 76 56 fa 70 99 26 99 01 b3 16 11 41 2c 93 46 53 0c 48 71 24 ae db 81 1c e2 bd 89 7f e0 b3 ff 00 b4 56 93 a1 2d 8d 96 ab e0 9b fb 84 8f cb 4d 4e ff 00 41 dd Block Write Sub-page Sub-page Page 2 Optimization: CW + AEB bypass Critical block (word) first Deliver block generating miss to CPU Transfer remaining blocks on page AEB bypass Inflight pages can service requests, if data available Data delivered directly from AEB 26

3 Optimization: RWR read-write-read (RWR) n RWR avoids writing unchanged blocks in sub-page n Read verify detects failed page n Failed write leads to spare allocation evicted dirty subpage dirty blks blk Read old block blk Read Write = new blk = block block same same allocate spare 3 Optimization: RWR read-write-read (RWR) n RWR avoids writing unchanged blocks in sub-page n Read verify detects failed page n Failed write leads to spare allocation evicted dirty subpage dirty blks blk Read old block blk Read Write = new blk = block block same same allocate spare 1. Read old block 2. Check for difference 3. If different, write block 27

3 Optimization: RWR read-write-read (RWR) n RWR avoids writing unchanged blocks in sub-page n Read verify detects failed page n Failed write leads to spare allocation evicted dirty subpage dirty blks blk Read old block 1. Read newly written block 2. Check for difference 3. If different, failed, allocate spare Read Write blk = new blk = block block same same allocate spare 4 Optimization: Endurance AEB eviction policy (N-chance) to minimize writes Non-uniform writes to memory Uneven writes cause pages to fail before others Failed page(s): memory is now broken Wear-leveling to uniformly distribute writes Wear pages at same level Pages will fail at approximately same time Spare capacity Replace failed pages on-demand 28

PMMA Energy-Delay Normalized Energy-Delay(%) 80 40 0-40 -80-120 -420-960 Page Size 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 512 1024 2048 4096 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 Canneal Facesim Bwaves GCC MCF SPECjbb SPECmix Compared to equivalent capacity in E*D improved -only (small system (16GB, losses/gains 4 core) are wins, e.g., bwaves) PMMA: small (speed optimized) with large 256MB (224MB AEB+32MB meta) is good compromise 1024, 2048B page is good compromise tag area vs. locality PMMA Energy-Delay Normalized Energy-Delay(%) 80 40 0-40 -80-120 -420-960 Page Size Small performance gain (~10%) Inherently, not much better than IFB + spatial locality + faster 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 512 1024 2048 4096 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 Canneal Facesim Bwaves GCC MCF SPECjbb SPECmix E D improved (small losses/gains are wins, e.g., bwaves) 256MB (224MB AEB+32MB meta) is good compromise 1024, 2048B page is good compromise tag area vs. locality 29

PMMA Energy-Delay Normalized Energy-Delay(%) 80 40 0-40 -80-120 -420-960 E D improved Page Size from s low read power, smaller power, and filtering of writes at 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 512 1024 2048 4096 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 Canneal Facesim Bwaves GCC MCF SPECjbb SPECmix E D improved (small losses/gains are wins, e.g., bwaves) 256MB (224MB AEB+32MB meta) is good compromise 1024, 2048B page is good compromise tag area vs. locality PMMA Energy-Delay Normalized Energy-Delay(%) 80 40 0-40 -80-120 -420-960 Poor spatial locality combined with large footprint. Brings in lots of pages, which are shortly evicted due to footprint. Lots of extra cost Page Size 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 512 1024 2048 4096 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 Canneal Facesim Bwaves GCC MCF SPECjbb SPECmix E D improved (small losses/gains are wins, e.g., bwaves) 256MB (224MB AEB+32MB meta) is good compromise 1024, 2048B page is good compromise tag area vs. locality 30

PMMA Energy-Delay Normalized Energy-Delay(%) 80 40 0-40 -80-120 -420-960 Compromise: Small E D gain, with small pages and moderate sized AEB (224 MB) Page Size 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 512 1024 2048 4096 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 Canneal Facesim Bwaves GCC MCF SPECjbb SPECmix E D improved (small losses/gains are wins, e.g., bwaves) 256MB (224MB AEB+32MB meta) is good compromise 1024, 2048B page is good compromise tag area vs. locality PMMA Energy-Delay Normalized Energy-Delay(%) 80 40 0-40 -80-120 -420-960 Page Size 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 512 1024 2048 4096 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 112 224 448 Canneal Facesim Bwaves GCC MCF SPECjbb SPECmix E D improved (small losses/gains are wins, e.g., bwaves) 256MB (224MB AEB+32MB meta) is good compromise 1024B vs 2048B page trades tag/spare table vs. locality 31

Read-Write Page Partitioning Normalized energy-delay (%) 70% 60% 50% 40% 30% 20% 10% 0% -10% 1024 2048 2048-512-256 2048-1024-256 2048-2048-256-13 Canneal Facesim Bwaves GCC MCF SPECjbb SPEC mix Average Results for AEB size 224 MB (+32MB meta data) 1024B best overall result but larger metadata storage R/W page partitioning recoups losses from 2048B Read-Write Page Partitioning Normalized energy-delay (%) 70% 60% 50% 40% 30% 20% 10% 0% -10% 1KB gains, then 2KB lost 1KB has larger tag store/spare table Subpaging helps recoup performance with less tag store & smaller spare table 1024 2048 2048-512-256 2048-1024-256 2048-2048-256-13 Canneal Facesim Bwaves GCC MCF SPECjbb SPEC mix Average Results for AEB size 224 MB 1024B best overall result but larger metadata storage R/W page partitioning recoups losses from 2048B 32

Lifetime: Cumulative Impact Technique Lifetime Cumulative Gain Baseline (LRU) 0.47 month 7-Chance 0.86 1.83X +RWR 3.36 months 3.91X +GC512-Random 97.29 months 28.91X Wear-leveling is essential to achieve 8 years 7-chance and RWR also have a large impact Summary architectures complement for main memory? Flash replacement Memory + storage combination Current front-runners share essential idea Small + Large Endurance on the way to being solved? Write bandwidth and energy likely to persist 33