Binary Neural Network and Its Implementation with 16 Mb RRAM Macro Chip

Binary Neural Network and Its Implementation with 16 Mb RRAM Macro Chip Assistant Professor of Electrical Engineering and Computer Engineering shimengy@asu.edu http://faculty.engineering.asu.edu/shimengyu/ 3/22/2017 School of Electrical, Computer, and Energy Engineering (ECEE)

Outline Challenges of Analog Synapses and Why We Need Binarize the Neural Network Binary Neural Network and Its Implementations on Tsinghua s 16 Mb RRAM Macro Chip Benchmark of Binary and Analog Synapses Summary 2

Demands for Neuromorphic Hardware Deep learning in Cloud: huge training labeled dataset, high precision training, power-hungry, etc. Google Cat: 16,000 CPU cores MS Residual-CNN: 8 GPUs Edge (IoT) computing needs novel hardware / algorithms Local to the sensor, real-time inference, small area and low-power Adaptive on-line learning with continuous (possibly unlabeled) data GPU FPGA 30 frames/s 3 Neuromorphic ASIC

A Shift in Computing Paradigm towards Neuro-inspired Resistive synaptic device Long-term vision: Brain-like computer 4

Current Status of envm based Neuromorphic Research Mostly focused on device-level engineering Performance metrics Desired Targets Device dimension < 10 nm Multilevel states number >100 * with a linear symmetric update Energy consumption <10 fj/programming pulse Dynamic range >100 * Retention >10 years * Endurance >10 9 updates * Note: * these numbers are application-dependent A few array-level demo with simple pattern classification, such as: UCSB s 12*12 crossbar array with memristors (Nature 2015) IBM s 256*256 1T1R array with PCM (IEDM 2015) ASU s 12*12 crossbar array with multilevel RRAM (EDL 2016) ASU-Tsinghua s 400*400 1T1R array with binary RRAM (IEDM 2016) 5

Cross-point Architecture for Accelerating Weighted Sum and Weight Update Weighted sum (inference): all cells are activated in parallel, summing up column current perform vector-matrix multiplication Weight update (training): cell s conductance could be updated by applying programming voltage from row/column at the same time. Task Operations WW XX WW update II ii = GG iiii VV jj jj GG iiii = ηη VV ii VV jj (analog computation inside the array, may need ADC at edge of array) 6

Binary RRAM and Analog RRAM Synaptic Devices Current (A) 1m 100µ 10µ 1µ 100n 10n 1n Gradual reset Pt HfO2 TiN V -2-1 0 1 2 Voltage (V) Abrupt set Current (A) 100n -3-2 -1 0 1 2 3 Binary Synapses: Conventional filamentary switching RRAM with abrupt set and gradual reset, multilevel states achievable in the reset, could be used for offline training. 10µ 1n 10p V Ta TaOx TiO2 Ti Gradual reset Gradual set Voltage (V) T.-H. Hou s group, NCTU, Taiwan Analog Synapses: Special interfacial switching RRAM with both smooth set and reset, attractive for online training. 7

Realistic Analog Device s Weight Update Behaviors Nonlinearity in weight update Device variations Non-zero off-state conductance How would these non-ideal effects impact learning accuracy? 8

NeuroSim: A Simulator from Device to Algorithm Parameters: Network size, learning rate, thresholding value, etc. MNIST data Input layer Key operations: - Feed forward (weighted sum) - Back propagation (weight update) Input data Algorithm level Synapse Array Read peripheral Thresholding circuit & buffer Hidden layer Circuit level Output layer Synapse Array Read peripheral Output buffer True crossbar Array WL Synapse Interconnects BL Pseudo-crossbar Array WL SL 6T SRAM Array WL BL BL BLB n SRAM cells as a synapse Device level NVM device model Digital RRAM Analog RRAM Device parameters: - Cell height and width - Maximum and minimum conductance - Read/write voltage and pulse width Non-ideal properties: - Nonlinear weight update with finite number of states Conductance # pulse - Variations (Device-todevice and cycle-to-cycle weight update variation, and read noise) SRAM device model SRAM Device parameters: - Cell height and width - Transistor width - Sensing voltage - Read/write latency and energy Input: Network structure, Training/testing traces Array type and technology node Device type and non-ideal factors Output: Area, Latency, Energy, Accuracy 9

Model Calibration (Latency, Energy, Leakage) Benchmark at 45 nm with PTM model 10

Model Calibration (Area) 123.6545 µm Layout using 45 nm NanGate PDK SL Switch Matrix SL envm BL 127.859 µm Crossbar WL Decoder Pseudocrossbar Array (256x256) BL Switch Matrix 0.18 µm WL ADC Mux w/ Decoder Shift Register 0.18 µm Adder Layout Area: 1.5810E+04 um 2 Model Area: 1.5454E+04 um 2 11

Impact of Weight Precision and Weight Update Nonlinearity in Analog Synapses A multilayer perceptron (MLP) 400-200-10 network is used for benchmarking. At least 6-bit is required for MNIST dataset online learning, while 1 or 2-bit may work for offline classification. Nonlinearity significantly degrades accuracy for online learning if using analog synapses. 12

Benchmark of Reported Analog Resistive Synapses Desired analog envms for Reported analog envms for learning learning Targeted envm type PCMO Ag:a-Si TaO x /TiO 2 AlO x /HfO 2 Ideal envm envm # of bits 5 6 6 5 6 6 Nonlinearity (weight increase/decreas e) 3.25/5.82 1.13/2.65 1.13/0.72 3/1 1/1 0/0 R ON 23 MΩ 26 MΩ 5 MΩ 16.9 kω 200 kω 200 kω ON/OFF ratio 6.84 12.5 2 4.43 50 50 Weight update cycle-to-cycle variation (σ) Accuracy for online learning Accuracy for offline classification <1% 3.5% <1% 5% 2% 0% 10% ~75% ~10% ~10% ~90% ~94.8% ~13% ~51% ~10% ~10% ~94.5% ~94.5% Green: good attributes, Red: major cause of learning failure 13

Binary Neural Network (BNN) Precision Reduction to Ternary Weight (+1,0,-1) and Binary Neuron for Propagation Higher precision (e.g. 8 bit) is kept for weight update only (because ΔW is small) Ternary for backpropagation Back-Propagation of Errors n-bit gradient descent for weights update p 1 a 1 10 W 2-3... 200 W 1-2... 400 a n Feedforward Inference MNIST dataset Ternary for feedforward p n Output Layer Hidden Layer Input Layer Accuracy [%] 98 97 96 95 94 93 92 91 All floating point Avg floating point 8bit weight & neuron Avg 8bit weight & neuron 8bit weight & 1bit neuron Avg 8bit weight & 1bit neuron Ternary weight & 1bit neuron Avg ternary weight & 1bit neuron 90 0 10 20 30 40 50 (a) (b) (c) Training Epoch Accuracy [%] 97 96 95 94 93 0 10 20 30 40 50 Training Epoch Avg 400-200-10 400-200-10 Avg 400-400-10 400-400-10 Avg 400-800-10 400-800-10 Followed the recent trends in machine/deep learning, e.g. BinaryNet and XNOR-Net S. Yu, et al. IEDM 2016 15

16 Mb Macro Chip (Tsinghua) Dobus01 <0:7> Dobus23 <0:7> Block0 512*1024 Block0 512*1024 Block1 512*1024 Block1 512*1024 3-stage gate Block5 512*1024 Block5 512*1024 Block4 512*1024 Block4 512*1024 Block2 Block3 Block7 Block6 3-stage gate Block2 Block3 Block7 Block6 3-stage gate Dobus45 <0:7> Dobus67 <0:7> I/O Dobus1011 <0:7> Dobus89 <0:7> Analog Dout <0:7> Data buffer 3-stage gate Block10 Block11 Block14 Block15 3-stage gate Block10 Block11 Block14 Block15 Block8 Block9 Block13 Block12 3-stage gate Digital Block8 Block9 Block13 Block12 Chip designed and fabricated by Huaqiang Wu s group in Tsinghua University I/O Dobus1415 <0:7> Dobus1213 <0:7> Capacity 16 Mb Tech Node 130 nm V DD_Digital V DD_Analog 1.8 V 5 V V WL _ SET 2-5 V/ 50 ns V BL _ SET 2-3 V/ 50 ns V WL _ RESET 3.5-5 V/ 50 ns V SL _ RESET 2-3 V/ 50 ns I/O Width 8 16

RRAM Stack and Endurance of RRAM 54.3 nm 9.1 nm RRAM Cell HfOx based RRAM integrated between M4 and M5 on top of CMOS Measured endurance ~1E6 cycles Courtesy of Huaqiang Wu (Tsinghua University) 17

Implementation of BNN on 16 Mb RRAM Chip for Offline Classification 512 1024 W 1-2 W 1-2 (400X400) Zoom in W 2-3 400 row / from input images 400 400 400 column weighted sum for hidden layer Subtraction & Acvtivation / 200 20 / 200 row inputs 20 column weighted sum for output Network topology 400-200-10 / W2-3 (200X20) Experiment data Programmed weight matrix pattern on 1 block of 16 Mb chip Error (in red) occurs, bit yield ~99% 18

Impact of RRAM Finite Bit Yield for Classification 98 97 97 96 96 95 Accuracy [%] 95 94 93 92 Ideal Software BNN with 1-bit Classification Accuracy [%] 94 93 92 91 90 BNN with 1-bit classification 91 89 90 88 89 0 10 20 30 40 50 Training Epoch 87 99 98 97 96 95 94 93 92 91 90 RRAM bit Yield [%] The software baseline with high precision classification has accuracy ~97%. BNN with 1-bit classification (with sign) has accuracy ~96.3% For MNIST dataset, 99% bit yield is sufficient to maintain ~96.3% 19

Precision Reduction for Training Online training needs higher precision than offline classification, because the small error accumulation is needed in backpropagation 100 Column Decoder Decoder Driver 95 WL Accuracy [%] 90 85 80 75 70 65 32 BNN with online training 28 24 6bit precision >96% accuracy 20 16 12 8 4 WL Decoder Mux Decoder (b) Binary Synapses SL BL 1T1R Array envm VSA VSA Adder Register Adder Shift Register Mux VSA VSA Adder Register Adder Shift Register VSA VSA Adder Register Adder Shift Register Precision bit 6-bit is needed for MNIST dataset, thus 6 binary RRAM cells are grouped for implementing one synapse 20

Distribution of RRAM Updates During Training 100 W 1-2 W 2-3 100 90 80 70 60 Online training MSB Endurance limit 90 80 70 60 Online training Endurance limit CDF [%] 50 Sign 40 Sign b7 b6 30 b5 b4 20 b3 10 LSB b2 b1 0 10 0 10 1 10 2 10 3 10 4 10 5 # of switching cycle CDF [%] 50 Sign Sign 40 b7 b6 30 b5 20 MSB b4 b3 10 b2 LSB b1 0 10 0 10 1 10 2 10 3 10 4 10 5 # of switching cycle Most cells update less than endurance limit (10 4 cycles) LSB updates more than MSB, and W 2-3 updates more than W 1-2 21

Impact of RRAM Finite Endurance on Training 100 100 Endurance 1e3 90 99 Endurance 3e3 Endurance 5e3 Accuracy [%] 80 70 60 50 Endurance 1e3 3e3 5e3 8e3 Accuracy [%] 98 97 96 95 94 Endurance 8e3 Endurance 1e4 40 1e4 93 30 0 10 20 30 40 50 92 10 20 30 40 50 Training Epoch Training Epoch Lower endurance results in lower peak of accuracy. With 10 4 cycles, ~96.9% accuracy is achievable for online training 22

NeuroSim Simulation Set-up for Analog and Binary Synapses Analog synapse Binary synapse # bits 6 6 Nonlinearity (weight 0.72/0.72 -- increase/weight decrease) R ON 200kΩ 200kΩ ON/OFF ratio 50 50 Read voltage 0.5 V 0.5 V Write voltage 2 V (for both weight increase 2 V and decrease) Write pulse width 100 ns per pulse 100 ns Resistance of access 10kΩ 10kΩ transistor in 1T1R Read noise 2.89% -- Array type Pseudo-crossbar Traditional 1T1R Array size 400x100 and 100x10 400x600 and 100x60 Tech node 14 nm 14 nm Wire width 40 nm 40 nm 24

Benchmark Results of Analog and Binary Synapses Analog synapse Binary synapse Accuracy 82.17% 94.03% Area 1560.8 µm 2 2678.2 µm 2 Total feed forward 1.1044e-01 s 2.7063e+00 s latency Total weight 1.7640e+05 s 3.2283e+03 s update Latency Total feed forward 4.4835e-04 J 2.3709e-03 J energy Total weight 2.7115e+00 J 8.0447e+00 J update energy Leakage 26.631 µw 15.397 µw Binary synapses could be a near-term solution, while a perfect analog synapses could bring in many benefits in the long run 25

Summary Today s RRAM technology (even binary) can support offline classification with low-power, fast and accurate recognition. For online training, analog synapses with continuous weights need to overcome grand challenges such as nonlinear weight update, and slow programming speed (as multiple pulses are needed to tune the weights). Binarizing neural network with low-precision weights, allows today s binary RRAM for online training with high accuracy, which also shows a good resilience to limited yield and endurance, as shown in our demonstration of 16 Mb RRAM chip. Trade-offs exist between binary and analog synapse implementations: binary synapses are good for high accuracy and fast training speed, but with overhead in the chip area and dynamic energy. 26

Acknowledgement Students: Pai-Yu Chen, Zhiwei Li Collaborator: Huaqiang Wu, Tsinghua University NSF-CCF-1552687: CAREER: Scaling-up Resistive Synaptic Arrays for Neuro-inspired Computing 27