Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing Paper by: Wajahat Qadeer Rehan Hameed Ofer Shacham Preethi Venkatesan Christos Kozyrakis Mark Horowitz Presentation by: Justin Selig Patrick Wang Shaurya Luthra
Convolution What is it used for Computational photography Image processing Video processing Convolutional Neural Networks Why do we care Cheap imagers on the rise Proliferating technology
Convolution Example Kernel: Sum
Convolution Example Kernel: Sum
Convolution Example 1 18 3 4 5 Kernel: Sum
Convolution Example 1 18 3 4 5 Kernel: Sum
Convolution Example 1 18 43 4 5 Kernel: Sum
Convolution Example 1 18 43 4 5 Kernel: Sum
Convolution Example 1 18 43 76 5 Kernel: Sum
Convolution Example 1 18 43 76 5 Kernel: Sum
Convolution Example 1 18 43 76 5 This is a form of Map Reduce: Map: Multiply kernel coefficients by elements of matrix. Reduce: Compute single output from multiple operands.
Computation Model Basic 1D convolution of image Img with filter f Generalized to a map function Map and reduce function R with convolution size c
Computation Model For basic convolution Map Multiplication Reduce Summation Operation kernels define Map and Reduce for different operation types
Example Operations Motion Estimation (H.264) Map Absolute Difference Reduce Summation SIFT (Gaussian blur) Map Multiply Reduce Summation Up Sampling (½ pixel) Map Multiply Reduce Summation Demosaic Interpolation Map Multiply Reduce Complex...
Convolution What is the problem? Very computation heavy Too much energy consumption on both CPUs and GPUs
The Energy Cost of General Purpose Processing Source: Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing by W Qadeer, R Hameed, O Shacham, P Venkatesan, C Kozyrakis, M Horowitz. Presentation given at Stanford University
Flexibility vs Efficiency
General Requirements Hundreds of Ops per instruction (ideal 100x performance gains) Minimize data fetch Conflict? Convolution uses intermediate values
What about SIMD? Single Instruction Multiple Data extension Can achieve up to 10x better performance And programmable! Limited by register file structure Source: Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing by W Qadeer, R Hameed, O Shacham, P Venkatesan, C Kozyrakis, M Horowitz. Presentation given at Stanford University
What Makes Specialization Better Data structures are optimized for data flow and data locality requirements.
Convolution Engine Architecture Shift Registers: 2D Shift Registers Vertical and 2D convolution Vertical shift (shift in row of image) Simultaneous register access 1D Shift registers Data for horizontal convolution flow Coefficient Registers: 2D Registers Stores static data (filter coefficients, static pixel data) ALU/Multipliers: 128 of these allow parallel execution
Convolution Engine Architecture Interface Unit (IF): Parallel access to register file to arrange data into blocks accessible by functional units. Functional Units (FU): Fixed-point, two-input ALUs. Supports multiply, absolute difference, and arithmetic. Reduce Units (RU): Programmable degree of reduction for arithmetic and logical stages implemented with a combining tree.
Convolution Engine Architecture Map/Reduce Logic: Abstract Convolution to map and reduce step Transforms input to output pixel Map Stage ALU s work with interface units Reduce stage Programmable reduce unit implemented as a combining tree Data Shuffle Stage Flexible swizzle network that allows permutation of data between stages
Convolution Engine Architecture Other Hardware: 32 Element SIMD unit Interfaces with 2D output register Only intermediate ops so no data access Vector add and vector subtract ops only Sizing: Amortization of instruction costs hundreds of ops per instruction 50-100 ops/instr is good enough More ALUs = diminishing returns 128 chosen to keep all ALUs busy Can power off half of ALUs and compute structures Programming: Adds instructions to processor ISA
Evaluation Map each target application on a chip multiprocessor 2 CEs Test against: SIMD CMPs (custom heterogeneous chip multiprocessors ) Used Tensilica s Xtensa Modeling Platform Created floorplan to account for interconnection energy
Conclusion Convolution Engines (CEs) are a flexible specialized processors that increases energy efficiency while maintaining programmability. CEs take advantage of data-reuse patterns, eliminate data-transfer overheads, and enable a large number of operations per cycle. CEs may support numerous applications based on convolution-like patterns. Compared to single-kernel accelerators, CE remains within 2-3x the energy and area. CEs use 8-15x less energy than SIMD engines.
Thanks for Listening! Questions?
Quiz: Convolution Use the following kernel to compute an average of the neighboring pixels OF THE TOP 2 SQUARES by convolving the filter over the given matrix. (*Don t use intermediate values for computation) 4 2 4 1 4 2 4 1 4 2 4 1 4 2 4 1 1/4 1/4 1/4 1/4 Kernel: Average