REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Size: px

Start display at page:

Download "REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND."

Imogene Barton
5 years ago
Views:

1 December 3-6, 2018 Santa Clara Convention Center CA, USA REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

2 ACCELERATING INFERENCING ON THE EDGE WITH RISC-V Russell Klein Technical Director Mentor, A Siemens Company

3 Accelerating inferencing Inferencing algorithms exceed the performance capabilities of embedded processors We examine creating accelerators as: RoCC co-processor for RISC-V Bus-based peripheral Instruction extensions were considered Was not a good fit for the algorithms we were working on Would need to contort things too far

Driving automation Already have 6 HD cameras at 30 frames/sec in ADAS 1920 x 1080 pixels x 30 x 3 colors (RGB) 186,624,000 pixels/sec for a single camera

4 Driving automation Already have 6 HD cameras at 30 frames/sec in ADAS 1920 x 1080 pixels x 30 x 3 colors (RGB) 186,624,000 pixels/sec for a single camera >1 billion pixels per second for the car More Radar-Lidar-cameras added at every level Camera resolutions and frame rates expected to double every few years

5 All embedded systems are not equal Should get smart enough not to wake me when the neighbor s cat is on the front porch at 4 a.m. but wake me when a person is there Inferencing rates of 1 or 2 seconds is fine <1 million pixels per second 5

6 Yolo Tiny (v3) Algorithm for detecting and classifying objects in pictures Used on cell phones and computationally limited systems Over 5.5 billion floating point operations per inference Over 36 million weight and bias values Neural network has 23 layers Full Yolo has 106 layers

7 Yolo-Tiny Profile

8 What is convolution? Multiply one array by another, element by element, and sum the results Source: Embedded-Vision.com

9 Algorithm Description 3x3 convolution with an optional 2x2 max pooling Kernel Image

10 Algorithm Description 3x3 convolution with an optional 2x2 max pooling Kernel 0 x 1 = 0 1 x 2 = 2 2 x 3 = 6 10 x 4 = x 5 = x 6 = x 7 = x 8 = x 9 = sum 681 Image

11 Algorithm Description 3x3 convolution with an optional 2x2 max pooling Kernel 1 x 1 = 1 2 x 2 = 4 3 x 3 = 9 11 x 4 = x 5 = x 6 = x 7 = x 8 = x 9 = sum 726 Image

12 Algorithm Description 3x3 convolution with an optional 2x2 max pooling Kernel 10 x 1 = x 2 = x 3 = x 4 = x 5 = x 6 = x 7 = x 8 = x 9 = sum 1131 Image

13 Algorithm Description 3x3 convolution with an optional 2x2 max pooling Kernel 11 x 1 = x 2 = x 3 = x 4 = x 5 = x 6 = x 7 = x 8 = x 9 = sum 1174 Image

14 Algorithm Description Repeat across the entire image Kernel Image

15 Algorithm Description Repeat across the entire image Kernel Image

16 Algorithm Description Repeat across the entire image Kernel Image

17 Algorithm Description Repeat across the entire image Kernel Image

18 Some observations For the 416x416 pixel image a 3x3 convolution requires 1,557,504 multiplications It is embarrassingly parallel, all multiplies could be done in parallel If you could get 1.5 million multipliers on a piece of silicon If you could get the data to and from the multipliers This is 0.002% of the multiplies for the inference

19 High Level Synthesis Enables faster design Write more abstractly, i.e. less coding Synthesis handles many details you probably don t want to Creates control/data flow graph from original C/C++ Automatically constructs: parallelism pipelining resource sharing Infers/interfaces memories

High Level Synthesis Makes verification easier Reuse stimulus from original C/C++ code Original C/C++ can be an oracle for created RTL Can enable formal verification CDFG can be used

20 High Level Synthesis Makes verification easier Reuse stimulus from original C/C++ code Original C/C++ can be an oracle for created RTL Can enable formal verification CDFG can be used to compare against implemented RTL Even if complete equivalence cannot be proved, reduces verification space Original Algorithm Original Testbench Compare Transactor HLS RTL Transactor

21 80% reduction in verification effort Time Traditional RTL Functional Regression 3 months 1000 CPUs Time HLS C++ Functional Regression using formal 2 weeks 14 CPUs Resources Resources NVIDIA Xavier 12nFF SoC NVIDIA Case Study available on mentor.com

22 Yolo Tiny 1000 image verification suite Original C implementation on desktop computer 1.5 seconds per inference ~25 minutes Behavioral C implementation with overhead 5 seconds per inference - ~1 hour 25 minutes RTL implementation 389 minutes per inference - ~9 months Verifying the behavioral code and proving equivalence is much, much faster

23 3x3 convolve + 2x2 max pool

High Level Synthesis By unrolling loops and pipelining created various implementation with no changes to source code: 1 multiplier @

24 High Level Synthesis By unrolling loops and pipelining created various implementation with no changes to source code: 1 84 clocks 4 22 clocks 9 8 clocks 36 3 clocks 216 2/clock 36 multiplier schematic

25 Operation sizing Move from floating point to fixed point Select fixed point precision to meet needs of the algorithm For Yolo-Tiny accuracy improves moving from 32 bit floating point to 20 bit fixed point Fixed point algorithms can be verified on the desk-top Open source fixed point library: Multiplier costs in area and power a roughly proportional to the square of the size of the operands. Moving from 32 bits to 12 bits is about 1/7 th the area and power

26 RoCC accelerator Created accelerator in HDL using HLS Interface was AC channel (ready/valid/data) Good match for RISC-V protocols Instantiated protocol converter between accelerator and RoCC interface Interacted with custom0 instruction No change to compliers Accessed kernel and image data through L1 data cache port

27 RoCC Accelerator Details Followed excellent directions from Colin Schmidt from 2015 rocket chip tutorial bootcamp And course notes from CS250 at Berkeley and github examples. Defined method for creating and interfacing an RoCC accelerator They explain it so much better than I can Many thanks

28 Software vs. Accelerator Software was executing 1 multiply in 7 instructions Found by examination of object code from optimized compile Accelerator was executing 36 parallel multiplies in one clock Then adds and compares in 2 more clocks Accelerator was about 100x faster As expected: 3 clock cycles per conv/max_pool for accelerator 7 clock cycles x 36 multiplies = 252, plus some loop control for software

29 But wait, where s my performance Team member computed 3 x number of conv/map_pools per inference and clock cycles and found we were 4x too slow Looked at bus, nice burst cycles, almost 100 utilization Looked at cache misses and found lots of them

30 What s going on?

31 What s going on?

32 What s going on?

33 What s going on?

34 What s going on? All these pixels are on different cache lines At 100% of bandwidth on bus we would keep accelerator fed. With cache thrashing, accelerator was starving.

35 What to do about it? Create a shift register 3 lines + 4 pixels long Add 2 pixels for each computation

36 What to do about it? Create a shift register 3 lines + 4 pixels long Add 2 pixels for each computation

37 What to do about it? Create a shift register 3 lines + 4 pixels long Add 2 pixels for each computation Makes caches much happier

38 memory memory memory What to do about it? Too any registers to be efficient (1260) Add memories where there are no multiplier taps * * * Requires 7 array declarations in C++, the large arrays are automatically put into memories

39 Peripheral Accelerator Alternate approach for accelerator is to attach it to the bus. Instead of custom0 instructions, read and write to memory mapped control registers Accelerator has master interface to read image and kernel data directly from memory, but no longer shares caches with Rocket Core Core tile to interconnect was bottleneck 3x3 conv 2x2 max accel

in training of networks Not needed in network inference engine, use fixed point

40 Memory Architecture and Power Considerations Keeping data local is key to minimizing power consumption Very important for ASIC Floating-point is costly Used in training of networks Not needed in network inference engine, use fixed point Fixed-point doesn't need to be power-of-two For custom hardware can be anything *NVIDIA 2017

41 Yolo-Tiny Profile Small number of coefficients in early layers Small(ish) number of features in later layers

42 Reordering Loops to Minimize Weight Reads Loops are organized so that weights are only read once from system DRAM Weights are held stationary across feature maps Feature maps are computed in order and stored in local SRAM Original Algorithm OUT_CHAN:for(int oc=0;oc<out_channels;oc++){ FMAP_HEIGHT:for(int r=0;r<in_height;r++){ FMAP_WIDTH:for(int c=0;c<in_width+1;c++){ IN_CHAN:for(int ic=0;ic<in_channels;ic++){ KERNEL_Y:for(int i=0;i<3;i++){ KERNEL_X:for(int j=0;j<3;j++){ acc+=fmap[ic][r-i/2][c-j/2]*kernel[ic][oc][i][j] } } } fmap_out[d][r][c] = acc; } } OUT_CHAN:for(int oc=0;oc<max_out_chan;oc++){ KERNEL_Y:for(int i=-1;i<2;i++){ KERNEL_X:for(int j=-1;j<2;j++){ FMAP_HEIGHT:for(int r=0;r<max_height;r++){ FMAP_WIDTH:for(int c=0;c<max_width;c++){ IN_CHAN:for(int ic=0;ic!=max_in_chan;ic+=pack){ < FMAP ping-pong memory read PACK values > < Read and cache weights once for reuse > MAC:for(int p=0;p<pack;p++) acc += fmap_data[p] * kernel_data[p]; } acc_mem[r][c] += acc;//fmap accumulator memory } } }

43 In-place Architecture Layers processed one after another Feature maps stored locally in SRAM Weights read from system memory Layer Control Kernel I/F Feature memory A Multiply/Accumul ate Engine Feature memory B Ping-Pong memory

44 Reorganize Feature Maps Feature maps are interleaved by input channels Example: PACK=4, input chan=8, fmaps come in order Pack 4 input channels side-by-side i0 i1 i2 i3 i4 i5 i6 i7 mem Feature maps

45 Conclusion Built accelerator for 3x3 convolution and 2x2 max pool Comprises 96% of the load for object recognition algorithm 36 parallel multipliers, pipelined adder tree and comparitors Implemented as a RoCC accelerator Shares L1 cache with core, so a good choice when co-operating with software on common data Implemented as a bus based peripheral Has independent bus interface, so a good choice when communication bound High-level synthesis Enables highly customized accelerators Accommodates last minute changes Targets ASICs, efpgas, and FPGAs

46 THANK YOU

Digital Systems Design

Digital Systems Design Digital Systems Design and Test Dr. D. J. Jackson Lecture 1-1 Introduction Traditional digital design Manual process of designing and capturing circuits Schematic entry System-level