RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

Size: px

Start display at page:

Download "RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM"

Sabrina Gregory
5 years ago
Views:

Yin, Leibo Liu, Shaojun Wei Institute of Microelectronics

1 RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei Institute of Microelectronics Tsinghua University The 45th International Symposium on Computer Architecture - ISCA 2018

2 Ubiquitous Deep Neural Networks (DNNs) Image Classification Object Detection Video Surveillance Speech Recognition 1

DNN Requires Large On-Chip Buffer Modern DNN s layer data storage can reach 0.3~6.27MB. The numbers will increase if the network processes higher resolution images or larger batch size.

3 DNN Requires Large On-Chip Buffer Modern DNN s layer data storage can reach 0.3~6.27MB. The numbers will increase if the network processes higher resolution images or larger batch size. [1] Krizhevsky et al., ImageNet Classification with Deep Convolutional Neural Networks, NIPS 12. [2] Simonyan et al., Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR 15. [3] Szegedy et al., Going Deeper with Convolutions, CVPR 15. [4] He et al., Deep Residual Learning for Image Recognition, CVPR 16. 2

SRAM-based DNN Accelerators The small footprint limits the on-chip buffer size of conventional SRAM-based DNN accelerators.

(Normalized) IO FC/LSTM Configurable Interface Weight Buffer CONV Configuratin Configuratin Controller Configuration Context

..... Data Buffer1 Bank[0]...... Bank[47] Bank[0] Bank[47]... IO Super PE Super PE Super.

0mm 2 Eyeriss, 182KB, 12.3mm 2 Envision, 77KB, 10.1mm 2 (Normalized) Thinker: Yin et al.

DianNao: Chen et al., DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning, ASPLOS 14.

4 SRAM-based DNN Accelerators The small footprint limits the on-chip buffer size of conventional SRAM-based DNN accelerators. Usually <500KB with area cost of 3~20mm 2. (Normalized) IO FC/LSTM Configurable Interface Weight Buffer CONV Configuratin Configuratin Controller Configuration Context Heterogeneous PE Array PE PE PE... PE PE PE PE PE PE... PE PE PE PE PE PE... PE PE PE PE PE PE... PE PE PE... Buffer CTRL Buffer CTRL Data Buffer1 Bank[0] Bank[47] Bank[0] Bank[47]... IO Super PE Super PE Super... Super Super PE PE PE Super PE Data Buffer2 Data Buffer System Thinker, 348KB, 19.4mm 2 DianNao, 44KB, 3.0mm 2 Eyeriss, 182KB, 12.3mm 2 Envision, 77KB, 10.1mm 2 (Normalized) Thinker: Yin et al., A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications, JSSC 18. DianNao: Chen et al., DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning, ASPLOS 14. Eyeriss: Chen et al., Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, ISSCC 16. Envision: Moons et al., ENVISION: A 0.26-to-10TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable Convolutional Neural Network Processor in 28nm FDSOI, ISSCC 17. 3

5 SRAM vs. edram (Embedded DRAM) edram has higher density than SRAM. Refresh is required for data retention. Charge will leak over time and might cause retention failures. 4

Refresh is an Energy Bottleneck [1] HPCA 13 edram Power Breakdown

Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized

6 Refresh is an Energy Bottleneck [1] HPCA 13 edram Power Breakdown [2] ISCA 10 System Power Breakdown Overhead: edram Refresh Energy [1] Chang et al., Technology Comparison for Large Last-Level Caches (L3Cs): Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized edram, HPCA [2] Wilkerson et al., Reducing Cache Power with Low-Cost, Multi-bit Error-Correcting Codes, ISCA 10.

7 Opportunity to Remove edram Refresh Refresh Interval = Retention Time Ghosh, Modeling of Retention Time for High-Speed Embedded Dynamic Random Access Memories, TCASI

8 Opportunity to Remove edram Refresh Refresh is unnecessary, if Data Lifetime < Retention Time Opportunity1: Increase retention time by training. Opportunity2: Reduce data lifetime by scheduling. 7

9 RANA: Retention-Aware Neural Acceleration Framework 1. Accuracy Constraint 2. edram Retention Time Distribution 1. Energy Modeling 2. Data Lifetime Analysis 3. Buffer Storage Analysis 1. Data Mapping 2. Memory Controller Modification DNN Accelerator 2. Target DNN Model Retention-Aware Training Method Tolerable Retention Time Hybrid Computation Pattern Layerwise Configurations Refresh-Optimized edram Controller Optimized Energy Consumption (Training) (Scheduling) (Architecture) Compilation Phase Execution Phase Strengthen DNN accelerators with refresh-optimized edram: Increase on-chip buffer size by replacing SRAM with edram. Reduce energy overhead by removing unnecessary edram refresh. 8

10 RANA: Retention-Aware Neural Acceleration Framework DNN Accelerator 2. Target DNN Model Retention-Aware Training Method Tolerable Retention Time Hybrid Computation Pattern Layerwise Configurations Refresh-Optimized edram Controller Optimized Energy Consumption (Training) (Scheduling) (Architecture) DNN accelerator DNN model Layer description Hardware constraints edram Controller Unified Buffer System edram Bank Switch to the next layer No Run scheduling scheme Computation Pattern: <OD/WD, Tm, Tn, Tr, Tc> The last layer? Yes Reference Clock Programmable Clock Divider Refresh Issuer edram Refresh Flags edram Bank edram Bank edram Bank edram Bank Configurations for each layer Retention Time Data Lifetime Refresh Control 9

11 Tech1: Retention-Aware Training Method Retention time is diverse among different cells. Retention failure rate: Fraction of the cells under the given retention time. The weakest cell appears at the 45micro-second point. Typical edram Retention Time Distribution (32KB) Kong et al., Analysis of Retention Time Distribution of Embedded DRAM A New Method to Characterize Across-Chip Threshold Voltage Variation, ITC

12 Tech1: Retention-Aware Training Method Retrain the network to tolerate higher failure rate and get longer tolerable retention time. Target DNN Model Failure Rate (r) Fixed-Point Pretrain Fixed-Point DNN Model Random Bit-Level Errors Weight Adjustment Adding Layer Masks Retrain Retention-Aware Training Method Retention-Aware DNN Model 11

13 Tech1: Retention-Aware Training Method Failure rate of 10 5 : No accuracy loss, 734μs. Failure rate of 10 4 : Accuracy decreases. 45μs 734μs 1030μs Relative Accuracy under Different Retention Failure Rates 12

14 Tech2: Hybrid Computation Pattern Computation pattern, expressed in a loop. Data lifetime and buffer storage are related to the loop ordering, especially the outermost-level loop. 13

15 Tech2: Hybrid Computation Pattern Outputs are dynamically updated by accumulation, which recharges the cells like periodic refresh. Different computation patterns have different data lifetime and buffer storage requirements. Input Dependent Output Dependent Weight Dependent 14

16 Tech2: Hybrid Computation Pattern Scheduling scheme: Input: DNN accelerator and network s parameters. Optimization: Minimize total system energy. Output: Layerwise configurations. DNN accelerator DNN model Switch to the next layer Layer description Hardware constraints Run scheduling scheme Computation Pattern: <OD/WD, Tm, Tn, Tr, Tc> Scheduling Scheme min Energy s. t. Energy = Equation (14), Tn Th Tl R i, Tm Tr Tc R o, Tm Tn K 2 R w, 1 Tm M, 1 Tn N, 1 Tr R, 1 Tc C. No The last layer? Yes Configurations for each layer 15

17 Tech3: Refresh-Optimized edram Controller edram controller: Programmable clock divider: Refresh interval. Refresh issuers and flags, for each edram bank. Configuration from Tech1 & Tech2. edram Controller Unified Buffer System Reference Clock Programmable Clock Divider Refresh Issuer edram Bank edram Bank edram Bank edram Bank edram Refresh Flags edram Bank 16

Evaluation Platform RTL-level cycle-accurate simulation, for performance estimation and memory access tracing. System-level energy estimation, based on synthesis, Destiny and CACTI.

18 Evaluation Platform RTL-level cycle-accurate simulation, for performance estimation and memory access tracing. System-level energy estimation, based on synthesis, Destiny and CACTI. DNN Accelerator edram Platform Configurations 256 MACs, 384KB SRAM, 200MHz, 5.682mm 2, 65nm 1.454MB, retention time = 45μs, 65nm Kong et al., Analysis of Retention Time Distribution of Embedded DRAM A New Method to Characterize Across-Chip Threshold Voltage Variation, ITC

19 Experimental Results edram refresh operations: 99.7% Off-chip memory access: 41.7% System energy consumption: 66.2% 18

20 Scalability to Other Architectures DaDianNao: 4096 MACs, 36MB edram, 606MHz. edram refresh operations: 99.9% System energy consumption: 69.4% Chen et al., DaDianNao: A Machine-Learning Supercomputer, MICRO

21 Takeaway DNN Accelerator 2. Target DNN Model Retention-Aware Training Method Tolerable Retention Time Hybrid Computation Pattern Layerwise Configurations Refresh-Optimized edram Controller Optimized Energy Consumption (Training) (Scheduling) (Architecture) RANA: Retention-Aware Neural Acceleration Framework Training: Retention-aware training method. Exploit DNN s error resilience to improve tolerable retention time. Scheduling: Hybrid computation pattern. Different computing order and parallelism show different data lifetime and buffer storage requirement. Architecture: Refresh-Optimized edram controller. No need to refresh all the banks. No need to always use the worst-case refresh interval. Not limited to applying edram to DNN acceleration. Approximate computing: Retention and error resilience. 20

22 Thank you for your attention!

An energy-efficient coarse grained spatial architecture for convolutional neural networks AlexNet

LETTER IEICE Electronics Express, Vol.14, No.15, 1 12 An energy-efficient coarse grained spatial architecture for convolutional neural networks AlexNet Boya Zhao a), Mingjiang Wang b), and Ming Liu Harbin