Binary Neural Network and Its Implementation with 16 Mb RRAM Macro Chip

Similar documents
Nano-device and Architecture Interaction in Machine/deep Learning

Neuromorphic Computing based Processors

Creating Intelligence at the Edge

Supplementary Figures

RRAM based analog synapse device for neuromorphic system

Artificial Neural Networks. Artificial Intelligence Santa Clara, 2016

Compact Oscillation Neuron Exploiting Metal-Insulator- Transition for Neuromorphic Computing

MINE 432 Industrial Automation and Robotics

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 34, NO. 12, DECEMBER

Energy Efficient and High Performance Current-Mode Neural Network Circuit using Memristors and Digitally Assisted Analog CMOS Neurons

MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng.

Lecture 6: Electronics Beyond the Logic Switches Xufeng Kou School of Information Science and Technology ShanghaiTech University

Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator

Option 1: A programmable Digital (FIR) Filter

A Parallel Analog CCD/CMOS Signal Processor

Ultra Low Voltage Operation with Bootstrap Scheme for Single Power Supply SOI-SRAM

Chalcogenide Memory, Logic and Processing Devices. Prof C David Wright Department of Engineering University of Exeter

A Synchronized Axon Hillock Neuron for Memristive Neuromorphic Systems

Low Transistor Variability The Key to Energy Efficient ICs

Implementation of High Performance Carry Save Adder Using Domino Logic

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

Topics. Memory Reliability and Yield Control Logic. John A. Chandy Dept. of Electrical and Computer Engineering University of Connecticut

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

Proposers Day Workshop

Static Random Access Memory - SRAM Dr. Lynn Fuller Webpage:

Integration, Architecture, and Applications of 3D CMOS Memristor Circuits

A New network multiplier using modified high order encoder and optimized hybrid adder in CMOS technology

Interconnect-Power Dissipation in a Microprocessor

CHAPTER 7 A BICS DESIGN TO DETECT SOFT ERROR IN CMOS SRAM

MULTI-PORT MEMORY DESIGN FOR ADVANCED COMPUTER ARCHITECTURES. by Yirong Zhao Bachelor of Science, Shanghai Jiaotong University, P. R.

Memory (Part 1) RAM memory

Lecture 12 Memory Circuits. Memory Architecture: Decoders. Semiconductor Memory Classification. Array-Structured Memory Architecture RWM NVRWM ROM

Multiple-Layer Networks. and. Backpropagation Algorithms

Ruixing Yang

Arithmetic Encoding for Memristive Multi-Bit Storage

CHAPTER 4 LINK ADAPTATION USING NEURAL NETWORK

SiGe epitaxial memory for neuromorphic computing with reproducible high performance based on engineered dislocations

DIGITAL INTEGRATED CIRCUITS A DESIGN PERSPECTIVE 2 N D E D I T I O N

Designing Of A New Low Voltage CMOS Schmitt Trigger Circuit And Its Applications on Reduce Power Dissipation

Memory Basics. historically defined as memory array with individual bit access refers to memory with both Read and Write capabilities

Fully Parallel 6T-2MTJ Nonvolatile TCAM with Single-Transistor-Based Self Match-Line Discharge Control

A 7 bit 3.52 GHz Current Steering DAC for WiGig Applications

A Low-Offset Latched Comparator Using Zero-Static Power Dynamic Offset Cancellation Technique

Low Power System-On-Chip-Design Chapter 12: Physical Libraries

Accelerating Stochastic Random Projection Neural Networks

A Differential 2R Crosspoint RRAM Array with Zero Standby Current

A 1Mjot 1040fps 0.22e-rms Stacked BSI Quanta Image Sensor with Cluster-Parallel Readout

CHAPTER 6 NEURO-FUZZY CONTROL OF TWO-STAGE KY BOOST CONVERTER

CHAPTER 5 DESIGN OF COMBINATIONAL LOGIC CIRCUITS IN QCA

Digital Integrated CircuitDesign

POST CMOS PATHFINDING. Leti Innovation Days June 28-29, 2017

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

Opportunities and Challenges in Ultra Low Voltage CMOS. Rajeevan Amirtharajah University of California, Davis

CHAPTER 3 NEW SLEEPY- PASS GATE

CHAPTER 4 MIXED-SIGNAL DESIGN OF NEUROHARDWARE

CMOS Analog Integrate-and-fire Neuron Circuit for Driving Memristor based on RRAM

Modeling and Design Analysis of 3D Vertical Resistive Memory - A Low Cost Cross-Point Architecture

DESIGN & IMPLEMENTATION OF SELF TIME DUMMY REPLICA TECHNIQUE IN 128X128 LOW VOLTAGE SRAM

Design and Analysis of Row Bypass Multiplier using various logic Full Adders

Generating an appropriate sound for a video using WaveNet.

1 Introduction. w k x k (1.1)

Introduction to Machine Learning

CHAPTER 4 ANALYSIS OF LOW POWER, AREA EFFICIENT AND HIGH SPEED MULTIPLIER TOPOLOGIES

Stochastic Mixed-Signal VLSI Architecture for High-Dimensional Kernel Machines

A comparative study of different feature sets for recognition of handwritten Arabic numerals using a Multi Layer Perceptron

Darwin: a neuromorphic hardware co-processor based on Spiking Neural Networks

Silicon photonics integration roadmap for applications in computing systems

Leakage Power Minimization in Deep-Submicron CMOS circuits

Low-Power Communications and Neural Spike Sorting

S.Nagaraj 1, R.Mallikarjuna Reddy 2

A New Capacitive Sensing Circuit using Modified Charge Transfer Scheme

A Multiplexer-Based Digital Passive Linear Counter (PLINCO)

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

Low Power Design of Successive Approximation Registers

Low-Power Approximate Unsigned Multipliers with Configurable Error Recovery

64 Kb logic RRAM chip resisting physical and side-channel attacks for encryption keys storage

DESIGN OF PARALLEL MULTIPLIERS USING HIGH SPEED ADDER

A Survey on A High Performance Approximate Adder And Two High Performance Approximate Multipliers

John Lazzaro and John Wawrzynek Computer Science Division UC Berkeley Berkeley, CA, 94720

JDT EFFECTIVE METHOD FOR IMPLEMENTATION OF WALLACE TREE MULTIPLIER USING FAST ADDERS

HfO 2 Based Resistive Switching Non-Volatile Memory (RRAM) and Its Potential for Embedded Applications

I DDQ Current Testing

Assoc. Prof. Dr. Burak Kelleci

A Foveated Visual Tracking Chip

Team VeryLargeScaleEngineers Robert Costanzo Michael Recachinas Hector Soto. High Speed 64kb SRAM. ECE 4332 Fall 2013

Analog Axon Hillock Neuron Design for Memristive Neuromorphic Systems

/14/$ IEEE 63

Column-Parallel Architecture for Line-of-Sight Detection Image Sensor Based on Centroid Calculation

CMOL CrossNets as Pattern Classifiers

A new 6-T multiplexer based full-adder for low power and leakage current optimization

White Paper Kilopass X2Bit bitcell: OTP Dynamic Power Cut by Factor of 10

The High-Voltage Monolithic Active Pixel Sensor for the Mu3e Experiment

Reference. Wayne Wolf, FPGA-Based System Design Pearson Education, N Krishna Prakash,, Amrita School of Engineering

Chapter 3 Novel Digital-to-Analog Converter with Gamma Correction for On-Panel Data Driver

Power Optimization of FPGA Interconnect Via Circuit and CAD Techniques

Figure 1. Artificial Neural Network structure. B. Spiking Neural Networks Spiking Neural networks (SNNs) fall into the third generation of neural netw

By Dayadi Lakshmaiah, Dr. M. V. Subramanyam & Dr. K. Satya Prasad Jawaharlal Nehru Technological University, India

THE content-addressable memory (CAM) is one of the most

RRAM for Future Memory and Computing Applications

A 32 Gbps 2048-bit 10GBASE-T Ethernet Energy Efficient LDPC Decoder with Split-Row Threshold Decoding Method

Transcription:

Binary Neural Network and Its Implementation with 16 Mb RRAM Macro Chip Assistant Professor of Electrical Engineering and Computer Engineering shimengy@asu.edu http://faculty.engineering.asu.edu/shimengyu/ 3/22/2017 School of Electrical, Computer, and Energy Engineering (ECEE)

Outline Challenges of Analog Synapses and Why We Need Binarize the Neural Network Binary Neural Network and Its Implementations on Tsinghua s 16 Mb RRAM Macro Chip Benchmark of Binary and Analog Synapses Summary 2

Demands for Neuromorphic Hardware Deep learning in Cloud: huge training labeled dataset, high precision training, power-hungry, etc. Google Cat: 16,000 CPU cores MS Residual-CNN: 8 GPUs Edge (IoT) computing needs novel hardware / algorithms Local to the sensor, real-time inference, small area and low-power Adaptive on-line learning with continuous (possibly unlabeled) data GPU FPGA 30 frames/s 3 Neuromorphic ASIC

A Shift in Computing Paradigm towards Neuro-inspired Resistive synaptic device Long-term vision: Brain-like computer 4

Current Status of envm based Neuromorphic Research Mostly focused on device-level engineering Performance metrics Desired Targets Device dimension < 10 nm Multilevel states number >100 * with a linear symmetric update Energy consumption <10 fj/programming pulse Dynamic range >100 * Retention >10 years * Endurance >10 9 updates * Note: * these numbers are application-dependent A few array-level demo with simple pattern classification, such as: UCSB s 12*12 crossbar array with memristors (Nature 2015) IBM s 256*256 1T1R array with PCM (IEDM 2015) ASU s 12*12 crossbar array with multilevel RRAM (EDL 2016) ASU-Tsinghua s 400*400 1T1R array with binary RRAM (IEDM 2016) 5

Cross-point Architecture for Accelerating Weighted Sum and Weight Update Weighted sum (inference): all cells are activated in parallel, summing up column current perform vector-matrix multiplication Weight update (training): cell s conductance could be updated by applying programming voltage from row/column at the same time. Task Operations WW XX WW update II ii = GG iiii VV jj jj GG iiii = ηη VV ii VV jj (analog computation inside the array, may need ADC at edge of array) 6

Binary RRAM and Analog RRAM Synaptic Devices Current (A) 1m 100µ 10µ 1µ 100n 10n 1n Gradual reset Pt HfO2 TiN V -2-1 0 1 2 Voltage (V) Abrupt set Current (A) 100n -3-2 -1 0 1 2 3 Binary Synapses: Conventional filamentary switching RRAM with abrupt set and gradual reset, multilevel states achievable in the reset, could be used for offline training. 10µ 1n 10p V Ta TaOx TiO2 Ti Gradual reset Gradual set Voltage (V) T.-H. Hou s group, NCTU, Taiwan Analog Synapses: Special interfacial switching RRAM with both smooth set and reset, attractive for online training. 7

Realistic Analog Device s Weight Update Behaviors Nonlinearity in weight update Device variations Non-zero off-state conductance How would these non-ideal effects impact learning accuracy? 8

NeuroSim: A Simulator from Device to Algorithm Parameters: Network size, learning rate, thresholding value, etc. MNIST data Input layer Key operations: - Feed forward (weighted sum) - Back propagation (weight update) Input data Algorithm level Synapse Array Read peripheral Thresholding circuit & buffer Hidden layer Circuit level Output layer Synapse Array Read peripheral Output buffer True crossbar Array WL Synapse Interconnects BL Pseudo-crossbar Array WL SL 6T SRAM Array WL BL BL BLB n SRAM cells as a synapse Device level NVM device model Digital RRAM Analog RRAM Device parameters: - Cell height and width - Maximum and minimum conductance - Read/write voltage and pulse width Non-ideal properties: - Nonlinear weight update with finite number of states Conductance # pulse - Variations (Device-todevice and cycle-to-cycle weight update variation, and read noise) SRAM device model SRAM Device parameters: - Cell height and width - Transistor width - Sensing voltage - Read/write latency and energy Input: Network structure, Training/testing traces Array type and technology node Device type and non-ideal factors Output: Area, Latency, Energy, Accuracy 9

Model Calibration (Latency, Energy, Leakage) Benchmark at 45 nm with PTM model 10

Model Calibration (Area) 123.6545 µm Layout using 45 nm NanGate PDK SL Switch Matrix SL envm BL 127.859 µm Crossbar WL Decoder Pseudocrossbar Array (256x256) BL Switch Matrix 0.18 µm WL ADC Mux w/ Decoder Shift Register 0.18 µm Adder Layout Area: 1.5810E+04 um 2 Model Area: 1.5454E+04 um 2 11

Impact of Weight Precision and Weight Update Nonlinearity in Analog Synapses A multilayer perceptron (MLP) 400-200-10 network is used for benchmarking. At least 6-bit is required for MNIST dataset online learning, while 1 or 2-bit may work for offline classification. Nonlinearity significantly degrades accuracy for online learning if using analog synapses. 12

Benchmark of Reported Analog Resistive Synapses Desired analog envms for Reported analog envms for learning learning Targeted envm type PCMO Ag:a-Si TaO x /TiO 2 AlO x /HfO 2 Ideal envm envm # of bits 5 6 6 5 6 6 Nonlinearity (weight increase/decreas e) 3.25/5.82 1.13/2.65 1.13/0.72 3/1 1/1 0/0 R ON 23 MΩ 26 MΩ 5 MΩ 16.9 kω 200 kω 200 kω ON/OFF ratio 6.84 12.5 2 4.43 50 50 Weight update cycle-to-cycle variation (σ) Accuracy for online learning Accuracy for offline classification <1% 3.5% <1% 5% 2% 0% 10% ~75% ~10% ~10% ~90% ~94.8% ~13% ~51% ~10% ~10% ~94.5% ~94.5% Green: good attributes, Red: major cause of learning failure 13

Outline Challenges of Analog Synapses and Why We Need Binarize the Neural Network Binary Neural Network and Its Implementations on Tsinghua s 16 Mb RRAM Macro Chip Benchmark of Binary and Analog Synapses Summary 14

Binary Neural Network (BNN) Precision Reduction to Ternary Weight (+1,0,-1) and Binary Neuron for Propagation Higher precision (e.g. 8 bit) is kept for weight update only (because ΔW is small) Ternary for backpropagation Back-Propagation of Errors n-bit gradient descent for weights update p 1 a 1 10 W 2-3... 200 W 1-2... 400 a n Feedforward Inference MNIST dataset Ternary for feedforward p n Output Layer Hidden Layer Input Layer Accuracy [%] 98 97 96 95 94 93 92 91 All floating point Avg floating point 8bit weight & neuron Avg 8bit weight & neuron 8bit weight & 1bit neuron Avg 8bit weight & 1bit neuron Ternary weight & 1bit neuron Avg ternary weight & 1bit neuron 90 0 10 20 30 40 50 (a) (b) (c) Training Epoch Accuracy [%] 97 96 95 94 93 0 10 20 30 40 50 Training Epoch Avg 400-200-10 400-200-10 Avg 400-400-10 400-400-10 Avg 400-800-10 400-800-10 Followed the recent trends in machine/deep learning, e.g. BinaryNet and XNOR-Net S. Yu, et al. IEDM 2016 15

16 Mb Macro Chip (Tsinghua) Dobus01 <0:7> Dobus23 <0:7> Block0 512*1024 Block0 512*1024 Block1 512*1024 Block1 512*1024 3-stage gate Block5 512*1024 Block5 512*1024 Block4 512*1024 Block4 512*1024 Block2 Block3 Block7 Block6 3-stage gate Block2 Block3 Block7 Block6 3-stage gate Dobus45 <0:7> Dobus67 <0:7> I/O Dobus1011 <0:7> Dobus89 <0:7> Analog Dout <0:7> Data buffer 3-stage gate Block10 Block11 Block14 Block15 3-stage gate Block10 Block11 Block14 Block15 Block8 Block9 Block13 Block12 3-stage gate Digital Block8 Block9 Block13 Block12 Chip designed and fabricated by Huaqiang Wu s group in Tsinghua University I/O Dobus1415 <0:7> Dobus1213 <0:7> Capacity 16 Mb Tech Node 130 nm V DD_Digital V DD_Analog 1.8 V 5 V V WL _ SET 2-5 V/ 50 ns V BL _ SET 2-3 V/ 50 ns V WL _ RESET 3.5-5 V/ 50 ns V SL _ RESET 2-3 V/ 50 ns I/O Width 8 16

RRAM Stack and Endurance of RRAM 54.3 nm 9.1 nm RRAM Cell HfOx based RRAM integrated between M4 and M5 on top of CMOS Measured endurance ~1E6 cycles Courtesy of Huaqiang Wu (Tsinghua University) 17

Implementation of BNN on 16 Mb RRAM Chip for Offline Classification 512 1024 W 1-2 W 1-2 (400X400) Zoom in W 2-3 400 row / from input images 400 400 400 column weighted sum for hidden layer Subtraction & Acvtivation / 200 20 / 200 row inputs 20 column weighted sum for output Network topology 400-200-10 / W2-3 (200X20) Experiment data Programmed weight matrix pattern on 1 block of 16 Mb chip Error (in red) occurs, bit yield ~99% 18

Impact of RRAM Finite Bit Yield for Classification 98 97 97 96 96 95 Accuracy [%] 95 94 93 92 Ideal Software BNN with 1-bit Classification Accuracy [%] 94 93 92 91 90 BNN with 1-bit classification 91 89 90 88 89 0 10 20 30 40 50 Training Epoch 87 99 98 97 96 95 94 93 92 91 90 RRAM bit Yield [%] The software baseline with high precision classification has accuracy ~97%. BNN with 1-bit classification (with sign) has accuracy ~96.3% For MNIST dataset, 99% bit yield is sufficient to maintain ~96.3% 19

Precision Reduction for Training Online training needs higher precision than offline classification, because the small error accumulation is needed in backpropagation 100 Column Decoder Decoder Driver 95 WL Accuracy [%] 90 85 80 75 70 65 32 BNN with online training 28 24 6bit precision >96% accuracy 20 16 12 8 4 WL Decoder Mux Decoder (b) Binary Synapses SL BL 1T1R Array envm VSA VSA Adder Register Adder Shift Register Mux VSA VSA Adder Register Adder Shift Register VSA VSA Adder Register Adder Shift Register Precision bit 6-bit is needed for MNIST dataset, thus 6 binary RRAM cells are grouped for implementing one synapse 20

Distribution of RRAM Updates During Training 100 W 1-2 W 2-3 100 90 80 70 60 Online training MSB Endurance limit 90 80 70 60 Online training Endurance limit CDF [%] 50 Sign 40 Sign b7 b6 30 b5 b4 20 b3 10 LSB b2 b1 0 10 0 10 1 10 2 10 3 10 4 10 5 # of switching cycle CDF [%] 50 Sign Sign 40 b7 b6 30 b5 20 MSB b4 b3 10 b2 LSB b1 0 10 0 10 1 10 2 10 3 10 4 10 5 # of switching cycle Most cells update less than endurance limit (10 4 cycles) LSB updates more than MSB, and W 2-3 updates more than W 1-2 21

Impact of RRAM Finite Endurance on Training 100 100 Endurance 1e3 90 99 Endurance 3e3 Endurance 5e3 Accuracy [%] 80 70 60 50 Endurance 1e3 3e3 5e3 8e3 Accuracy [%] 98 97 96 95 94 Endurance 8e3 Endurance 1e4 40 1e4 93 30 0 10 20 30 40 50 92 10 20 30 40 50 Training Epoch Training Epoch Lower endurance results in lower peak of accuracy. With 10 4 cycles, ~96.9% accuracy is achievable for online training 22

Outline Challenges of Analog Synapses and Why We Need Binarize the Neural Network Binary Neural Network and Its Implementations on Tsinghua s 16 Mb RRAM Macro Chip Benchmark of Binary and Analog Synapses Summary 23

NeuroSim Simulation Set-up for Analog and Binary Synapses Analog synapse Binary synapse # bits 6 6 Nonlinearity (weight 0.72/0.72 -- increase/weight decrease) R ON 200kΩ 200kΩ ON/OFF ratio 50 50 Read voltage 0.5 V 0.5 V Write voltage 2 V (for both weight increase 2 V and decrease) Write pulse width 100 ns per pulse 100 ns Resistance of access 10kΩ 10kΩ transistor in 1T1R Read noise 2.89% -- Array type Pseudo-crossbar Traditional 1T1R Array size 400x100 and 100x10 400x600 and 100x60 Tech node 14 nm 14 nm Wire width 40 nm 40 nm 24

Benchmark Results of Analog and Binary Synapses Analog synapse Binary synapse Accuracy 82.17% 94.03% Area 1560.8 µm 2 2678.2 µm 2 Total feed forward 1.1044e-01 s 2.7063e+00 s latency Total weight 1.7640e+05 s 3.2283e+03 s update Latency Total feed forward 4.4835e-04 J 2.3709e-03 J energy Total weight 2.7115e+00 J 8.0447e+00 J update energy Leakage 26.631 µw 15.397 µw Binary synapses could be a near-term solution, while a perfect analog synapses could bring in many benefits in the long run 25

Summary Today s RRAM technology (even binary) can support offline classification with low-power, fast and accurate recognition. For online training, analog synapses with continuous weights need to overcome grand challenges such as nonlinear weight update, and slow programming speed (as multiple pulses are needed to tune the weights). Binarizing neural network with low-precision weights, allows today s binary RRAM for online training with high accuracy, which also shows a good resilience to limited yield and endurance, as shown in our demonstration of 16 Mb RRAM chip. Trade-offs exist between binary and analog synapse implementations: binary synapses are good for high accuracy and fast training speed, but with overhead in the chip area and dynamic energy. 26

Acknowledgement Students: Pai-Yu Chen, Zhiwei Li Collaborator: Huaqiang Wu, Tsinghua University NSF-CCF-1552687: CAREER: Scaling-up Resistive Synaptic Arrays for Neuro-inspired Computing 27