Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing

Similar documents
REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Image Filtering. Median Filtering

Creating Intelligence at the Edge

PLazeR. a planar laser rangefinder. Robert Ying (ry2242) Derek Xingzhou He (xh2187) Peiqian Li (pl2521) Minh Trang Nguyen (mnn2108)

An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors

Option 1: A programmable Digital (FIR) Filter

A High Definition Motion JPEG Encoder Based on Epuma Platform

A New network multiplier using modified high order encoder and optimized hybrid adder in CMOS technology

>>> from numpy import random as r >>> I = r.rand(256,256);

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION

A Survey on Power Reduction Techniques in FIR Filter

Energy efficient multi-granular arithmetic in a coarse-grain reconfigurable architecture

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm

Design of a High Speed FIR Filter on FPGA by Using DA-OBC Algorithm

Digital Integrated CircuitDesign

>>> from numpy import random as r >>> I = r.rand(256,256);

CPSC 340: Machine Learning and Data Mining. Convolutional Neural Networks Fall 2018

ISSN Vol.07,Issue.08, July-2015, Pages:

Application of Maxwell Equations to Human Body Modelling

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Images and Filters. EE/CSE 576 Linda Shapiro

Multi-core Platforms for

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Digital Image Processing. Digital Image Fundamentals II 12 th June, 2017

Efficient Construction of SIFT Multi-Scale Image Pyramids for Embedded Robot Vision

Image Manipulation: Filters and Convolutions

Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India

CS4617 Computer Architecture

PARALLEL ALGORITHMS FOR HISTOGRAM-BASED IMAGE REGISTRATION. Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, Wolfgang Effelsberg

CHAPTER 1 INTRODUCTION

IMAGE PROCESSING: AREA OPERATIONS (FILTERING)

[Devi*, 5(4): April, 2016] ISSN: (I2OR), Publication Impact Factor: 3.785

Mahendra Engineering College, Namakkal, Tamilnadu, India.

Prof. Feng Liu. Winter /10/2019

Video Enhancement Algorithms on System on Chip

CIS581: Computer Vision and Computational Photography Homework: Cameras and Convolution Due: Sept. 14, 2017 at 3:00 pm

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS

CESEL: Flexible Crypto Acceleration. Kevin Kiningham Dan Boneh, Mark Horowitz, Philip Levis

Lecture Perspectives. Administrivia

How a processor can permute n bits in O(1) cycles

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Lecture 30. Perspectives. Digital Integrated Circuits Perspectives

Using One hot Residue Number System (OHRNS) for Digital Image Processing

CHAPTER 4 ANALYSIS OF LOW POWER, AREA EFFICIENT AND HIGH SPEED MULTIPLIER TOPOLOGIES

CS534 Introduction to Computer Vision. Linear Filters. Ahmed Elgammal Dept. of Computer Science Rutgers University

ASIP Solution for Implementation of H.264 Multi Resolution Motion Estimation

Motion illusion, rotating snakes

Filters. Materials from Prof. Klaus Mueller

ASIC Implementation of High Speed Area Efficient Arithmetic Unit using GDI based Vedic Multiplier

Design A Redundant Binary Multiplier Using Dual Logic Level Technique

ACM Fast Image Convolutions. by: Wojciech Jarosz

A Review on Different Multiplier Techniques

ISSN Vol.03,Issue.02, February-2014, Pages:

Real Time Image Denoising using Synchronized Bilateral Filter

CS 4501: Introduction to Computer Vision. Filtering and Edge Detection

High Performance Imaging Using Large Camera Arrays

A Rotation-based Data Buffering Architecture for Convolution Filtering in a Field Programmable Gate Array

Design of a Power Optimal Reversible FIR Filter ASIC Speech Signal Processing

Midterm Examination CS 534: Computational Photography

Matlab (see Homework 1: Intro to Matlab) Linear Filters (Reading: 7.1, ) Correlation. Convolution. Linear Filtering (warm-up slide) R ij

Performance Analysis of Multipliers in VLSI Design

The Metrics and Designs of an Arithmetic Logic Function over

FPGA Implementation of Wallace Tree Multiplier using CSLA / CLA

Understanding Neural Networks : Part II

Ben Baker. Sponsored by:

Image Deblurring and Noise Reduction in Python TJHSST Senior Research Project Computer Systems Lab

A Simple Design and Implementation of Reconfigurable Neural Networks

Optimized Image Scaling Processor using VLSI

High Speed Vedic Multiplier Designs Using Novel Carry Select Adder

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) STUDY ON COMPARISON OF VARIOUS MULTIPLIERS

Image Processing Architectures (and their future requirements)

A Novel Approach For Designing A Low Power Parallel Prefix Adders

Image Processing for feature extraction

Design of an Efficient Edge Enhanced Image Scalar for Image Processing Applications

02/02/10. Image Filtering. Computer Vision CS 543 / ECE 549 University of Illinois. Derek Hoiem

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

[Krishna, 2(9): September, 2013] ISSN: Impact Factor: INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Neural Networks The New Moore s Law

Image Filtering and Gaussian Pyramids

International Journal of Scientific & Engineering Research, Volume 7, Issue 3, March-2016 ISSN

Modified Partial Product Generator for Redundant Binary Multiplier with High Modularity and Carry-Free Addition

A New RNS 4-moduli Set for the Implementation of FIR Filters. Gayathri Chalivendra

Digital Image Processing

Design and Characterization of 16 Bit Multiplier Accumulator Based on Radix-2 Modified Booth Algorithm

AN ERROR LIMITED AREA EFFICIENT TRUNCATED MULTIPLIER FOR IMAGE COMPRESSION

IJCSIET--International Journal of Computer Science information and Engg., Technologies ISSN

Automatic Kernel Code Generation for Focal-plane Sensor-Processor Devices

Low-Power Multipliers with Data Wordlength Reduction

AN EFFICIENT ALGORITHM FOR THE REMOVAL OF IMPULSE NOISE IN IMAGES USING BLACKFIN PROCESSOR

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers

Multiplier Design and Performance Estimation with Distributed Arithmetic Algorithm

Ajmer, Sikar Road Ajmer,Rajasthan,India. Ajmer, Sikar Road Ajmer,Rajasthan,India.

A Compact Design of 8X8 Bit Vedic Multiplier Using Reversible Logic Based Compressor

Multimedia Systems Giorgio Leonardi A.A Lectures 14-16: Raster images processing and filters

Circular averaging filter (pillbox) Approximates the two-dimensional Laplacian operator. Laplacian of Gaussian filter

Design and Implementation of Wallace Tree Multiplier Using Kogge Stone Adder and Brent Kung Adder

Transcription:

Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing Paper by: Wajahat Qadeer Rehan Hameed Ofer Shacham Preethi Venkatesan Christos Kozyrakis Mark Horowitz Presentation by: Justin Selig Patrick Wang Shaurya Luthra

Convolution What is it used for Computational photography Image processing Video processing Convolutional Neural Networks Why do we care Cheap imagers on the rise Proliferating technology

Convolution Example Kernel: Sum

Convolution Example Kernel: Sum

Convolution Example 1 18 3 4 5 Kernel: Sum

Convolution Example 1 18 3 4 5 Kernel: Sum

Convolution Example 1 18 43 4 5 Kernel: Sum

Convolution Example 1 18 43 4 5 Kernel: Sum

Convolution Example 1 18 43 76 5 Kernel: Sum

Convolution Example 1 18 43 76 5 Kernel: Sum

Convolution Example 1 18 43 76 5 This is a form of Map Reduce: Map: Multiply kernel coefficients by elements of matrix. Reduce: Compute single output from multiple operands.

Computation Model Basic 1D convolution of image Img with filter f Generalized to a map function Map and reduce function R with convolution size c

Computation Model For basic convolution Map Multiplication Reduce Summation Operation kernels define Map and Reduce for different operation types

Example Operations Motion Estimation (H.264) Map Absolute Difference Reduce Summation SIFT (Gaussian blur) Map Multiply Reduce Summation Up Sampling (½ pixel) Map Multiply Reduce Summation Demosaic Interpolation Map Multiply Reduce Complex...

Convolution What is the problem? Very computation heavy Too much energy consumption on both CPUs and GPUs

The Energy Cost of General Purpose Processing Source: Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing by W Qadeer, R Hameed, O Shacham, P Venkatesan, C Kozyrakis, M Horowitz. Presentation given at Stanford University

Flexibility vs Efficiency

General Requirements Hundreds of Ops per instruction (ideal 100x performance gains) Minimize data fetch Conflict? Convolution uses intermediate values

What about SIMD? Single Instruction Multiple Data extension Can achieve up to 10x better performance And programmable! Limited by register file structure Source: Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing by W Qadeer, R Hameed, O Shacham, P Venkatesan, C Kozyrakis, M Horowitz. Presentation given at Stanford University

What Makes Specialization Better Data structures are optimized for data flow and data locality requirements.

Convolution Engine Architecture Shift Registers: 2D Shift Registers Vertical and 2D convolution Vertical shift (shift in row of image) Simultaneous register access 1D Shift registers Data for horizontal convolution flow Coefficient Registers: 2D Registers Stores static data (filter coefficients, static pixel data) ALU/Multipliers: 128 of these allow parallel execution

Convolution Engine Architecture Interface Unit (IF): Parallel access to register file to arrange data into blocks accessible by functional units. Functional Units (FU): Fixed-point, two-input ALUs. Supports multiply, absolute difference, and arithmetic. Reduce Units (RU): Programmable degree of reduction for arithmetic and logical stages implemented with a combining tree.

Convolution Engine Architecture Map/Reduce Logic: Abstract Convolution to map and reduce step Transforms input to output pixel Map Stage ALU s work with interface units Reduce stage Programmable reduce unit implemented as a combining tree Data Shuffle Stage Flexible swizzle network that allows permutation of data between stages

Convolution Engine Architecture Other Hardware: 32 Element SIMD unit Interfaces with 2D output register Only intermediate ops so no data access Vector add and vector subtract ops only Sizing: Amortization of instruction costs hundreds of ops per instruction 50-100 ops/instr is good enough More ALUs = diminishing returns 128 chosen to keep all ALUs busy Can power off half of ALUs and compute structures Programming: Adds instructions to processor ISA

Evaluation Map each target application on a chip multiprocessor 2 CEs Test against: SIMD CMPs (custom heterogeneous chip multiprocessors ) Used Tensilica s Xtensa Modeling Platform Created floorplan to account for interconnection energy

Conclusion Convolution Engines (CEs) are a flexible specialized processors that increases energy efficiency while maintaining programmability. CEs take advantage of data-reuse patterns, eliminate data-transfer overheads, and enable a large number of operations per cycle. CEs may support numerous applications based on convolution-like patterns. Compared to single-kernel accelerators, CE remains within 2-3x the energy and area. CEs use 8-15x less energy than SIMD engines.

Thanks for Listening! Questions?

Quiz: Convolution Use the following kernel to compute an average of the neighboring pixels OF THE TOP 2 SQUARES by convolving the filter over the given matrix. (*Don t use intermediate values for computation) 4 2 4 1 4 2 4 1 4 2 4 1 4 2 4 1 1/4 1/4 1/4 1/4 Kernel: Average