Creating Intelligence at the Edge

Size: px

Start display at page:

Download "Creating Intelligence at the Edge"

Sherilyn Dorsey
5 years ago
Views:

1 Creating Intelligence at the Edge Vladimir Stojanović E3S Retreat September 8, 2017

2 The growing importance of machine learning Page 2

3 Applications exploding in the cloud Huge interest to move to the edge Big issues speed, available power for local computation Opportunity for new technologies to help enter E3S Page 3

Deep Neural Network Operation http://www.amax.

4 Deep Neural Network Operation Each layer learns to do a single thing (extract feature, compose objects, etc). Page 4

Deep Neural Network Example: AlexNet https://jeremykarnowski.wordpress.

5 Deep Neural Network Example: AlexNet Key operations Feature extraction Classification Convolution (feature extraction - convolutional layers) Matrix-Vector multiply (classification - fully-connected layers) Page 5

Classes and Uses of Neural Nets Multi-layer Perceptron LeNet-300 hand-written digit recognition, fully-connected layers, feed-forward Recurrent Neural Networks Long-short term memory,

6 Classes and Uses of Neural Nets Multi-layer Perceptron LeNet-300 hand-written digit recognition, fully-connected layers, feed-forward Recurrent Neural Networks Long-short term memory, voice recognition, natural language processing (FC layers with feedback) Convolutional Neural Networks Image processing, computer vision (e.g. AlexNet) [Jouppi et al, ISCA 2017] Page 6

7 Intelligence confined to the Cloud However, even in the cloud, special hardware is needed [Facebook BigSur] Page 7

High-bandwidth memory close to GPU chip on the

8 [Nvidia Volta] 120TFlops/s Hardware specialization in the Cloud: Repurposing Graphics Processing Units Specialized TensorCores Optimize Matrix-vector multiplies High-bandwidth memory close to GPU chip on the interposer Specialized GPU-GPU interconnect (Nvlink) [Microsoft HGX-1] Page 8

Matrix-vector operations, lots of onchip memory

9 Hardware specialization in the Cloud: Custom ASICs (Tensor Processing Unit) Custom chip specializing in Matrix-vector operations, lots of onchip memory Optimized for inference 8-bit integer arithmetic Custom interconnect Page 9

10 Why specialized hardware in the Cloud? [Jouppi et al, ISCA 2017] TPU GPU CPU Saves energy (primary OpEx) Speeds-up discovery of new networks (faster training) Offer as a service (inference TPUs, FPGAs, training GPUs, FPGAs) Page 10

11 TPU architecture [Jouppi et al, ISCA 2017] Weight coefficients streamed from the outside Significant energy burned on weight fetch Large-on chip memory for intermediate results 8-bit systolic datapath partially saves energy on SRAM writes Page 11

12 Limitations of centralized intelligence Costs at the mobile terminal communicating with central server/cloud Local energy/op ~1pJ Energy per bit To register - 100fJ To SRAM - 1pJ To DRAM - 20pJ Latency To register 100ps To SRAM 1ns To DRAM 100ns Over radio: >1uJ To local node >1ms To Cloud >50ms Page 12

13 Opportunity for Autonomous Intelligence at the Edge The rise of autonomous systems Great need for local intelligence Need to be extremely energy-efficient and low-latency Page 13

14 How to justify specialization? Volumes are huge (starting with smartphones) every major hw manufacturer looking into these now Lifetime is short (IoT and smartphone chips change at most every 3 years) Can afford to create custom chips that are very efficient Page 14

15 How hard do we need to work to make it happen? TPU2 has 40W for sustained ~10TOp/s At 20Hz, and 10 9 weights, and 200 Ops/weight Need 50 x 10 9 x 200 Op/s = 10 TOp/s 50ms x 40W = 2J/inference Radio energy to send an HD image for inference (~ a few J/image) How to make this work with 0.5W of power budget (e.g. a minidrone, a phone or 0.05W for an IoT device)? TPU2 still fetches all weights from DRAM (approx. third the power above) The rest spent on logic and local SRAM stores Page 15

16 Opportunity for E3S: Cross-layer design E3S strength in seeing the big picture from devices to systemlevel Cross-layer design needed to solve this problem Page 16

17 AlexNet1000 revisited >90% parameters in FC layers Speed bottleneck in FC layers (large matrices) Problems: 1) weight fetching 2) storing of partial results >90% operations in Conv layers Page 17

18 Algorithmic improvements: Making the problem smaller with compression FC layers are huge and sparse: use compression (pruning) [Han et al. 2015] Running a 1 billion connection neural network (a bit larger than AlexNet), at 20Hz would require: (20Hz)(1G)(640pJ) = 12.8W! Page 18

19 Problems with focusing on single layer e.g. threshold pruning (Han et al) Resulting matrix is sparse random (unpredictable locations) locations Sparsity adds significant hw overhead, prevents systolic implementation with local coefs Techniques that are oblivious to underlying hardware do not yield desired improvements Threshold pruning example Shrinks the coefficient storage Marginally improves the speed due to random nature of the resulting coefficient locations Page 19

$Randomized sparsity affects performance Techniques that are oblivious to underlying hardware do not yield desired improvements Improvements only a fraction of the$

20 Randomized sparsity affects performance Techniques that are oblivious to underlying hardware do not yield desired improvements Improvements only a fraction of the pruning ratio Threshold pruning example Shrinks the coefficient storage Marginally improves the speed due to random nature of the resulting coefficient locations Page 20

21 A better way: Architecture-aware compression Need an approach that exposes hardware features at the FC layer level Example sub-block matrix Good fit for parallel hardware (multiple-threads, systolic, etc) Example 8x structured FC layer compression However, accuracy significantly degraded with this FC layer topology Page 21

Scaffold pruning example: Permutation-based pruning

hardware-friendly structure to capture the FC

Architecture-friendly template (efficient

22 Scaffold pruning example: Permutation-based pruning Need a transformation that randomizes the hardware-friendly structure to capture the FC information Simple example block permutation Architecture-friendly template (efficient implementation) Random row-column permutation Algorithm-friendly template (good accuracy) Page 22

23 Scaffold training: PBP training results Many random permutations work Example - LeNet-300 on MNIST dataset (10x compression) Page 23

24 PBP impact on micro-architecture Input vector shuffle MxV Multiply-accumulates with local weight storage (dense sub-blocks) Output vector shuffle Algorithmic transformations enable simple, systolic (flow-through) architecture with in-situ coefficient storage (minimal energy) Fixed or reconfigurable input/output shuffle (permutation) Already runs 3-7x faster than sparse on mobile GPUs Page 24

25 Need for local storage new devices Spin devices or vertical NEM relays great candidates 10 9 weights more than enough to represent today s and future networks At 10x structured compression, 10 8 weights at 4-8 bits, still 0.5-1GB! Need to look across layers to optimize the resolution Local, NV storage to minimize fetch energy during run-time and power-cycling Page 25

26 NV reconfigurable logic Potentially use NV logic to fuse weight storage and multiply Input string: Answers: 1 0 (a) BL 1 I A 0 I A 1 I B 0 I B 1 I C 0 I C 1 I D 0 I D 1 I E 0 I E 1 O X 0 O X 1 O Y 0 O Y 1 Read enable (RE) PG BL BL BL 32 : Current to pull up BL : Current to pull down OL PG PG Pass gate (PG) S S S ( ) h f ( d ) b d K. Kato et al Embedded Nano-Electro-Mechanical Memory (b) d for Energy-Efficient Reconfigurable Logic EDL 2016 Page 26

27 Need for reconfigurable interconnect new devices Reconfigurable shuffle allows infrequent but efficient network change Adds flexibility to the custom chip Can utilize vertical NEM relays Page 27

28 Where next? Cross-layer design, model and compare with optimized CMOS A concrete system benchmark for E3S Page 28

Binary Neural Network and Its Implementation with 16 Mb RRAM Macro Chip

Binary Neural Network and Its Implementation with 16 Mb RRAM Macro Chip Assistant Professor of Electrical Engineering and Computer Engineering shimengy@asu.edu http://faculty.engineering.asu.edu/shimengyu/