Exploring Computation- Communication Tradeoffs in Camera Systems

Similar documents
Creating Intelligence at the Edge

Image Processing Architectures (and their future requirements)

Image processing. Case Study. 2-diemensional Image Convolution. From a hardware perspective. Often massively yparallel.

Arda Gumusalan CS788Term Project 2

Vision with Precision Webinar Series Augmented & Virtual Reality Aaron Behman, Xilinx Mark Beccue, Tractica. Copyright 2016 Xilinx

Implementation of Face Detection System Based on ZYNQ FPGA Jing Feng1, a, Busheng Zheng1, b* and Hao Xiao1, c

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

IHV means Independent Hardware Vendor. Example is Qualcomm Technologies Inc. that makes Snapdragon processors. OEM means Original Equipment

Enabling Mobile Virtual Reality ARM 助力移动 VR 产业腾飞

Hardware-accelerated CCD readout smear correction for Fast Solar Polarimeter

23270: AUGMENTED REALITY FOR NAVIGATION AND INFORMATIONAL ADAS. Sergii Bykov Technical Lead Machine Learning 12 Oct 2017

Column-Parallel Architecture for Line-of-Sight Detection Image Sensor Based on Centroid Calculation

Implementation of a Streaming Camera using an FPGA and CMOS Image Sensor. Daniel Crispell Brown University

An energy-efficient coarse grained spatial architecture for convolutional neural networks AlexNet

Image Processing Architectures (and their future requirements)

A SCALABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

Unpredictable movement performance of Virtual Reality headsets

AI Application Processing Requirements

Harnessing the Power of AI: An Easy Start with Lattice s sensai

Energy-Efficient Hybrid Stochastic-Binary Neural Networks for Near-Sensor Computing

CHAPTER 6 CONCLUSION AND FUTURE SCOPE

Low-Power Communications and Neural Spike Sorting

ESE532: System-on-a-Chip Architecture. Today. Message. Crossbar. Interconnect Concerns

Transforming Industries with Enlighten

CMOS MT9D111Camera Module 1/3.2-Inch 2-Megapixel Module Datasheet

Open Source Digital Camera on Field Programmable Gate Arrays

Out-of-Order Execution. Register Renaming. Nima Honarmand

Lecture 19: Depth Cameras. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Console Architecture 1

Hello, and welcome to this presentation of the STM32 Chrom-ART Accelerator. It covers the features of this of this adaptive real-time accelerator

Data acquisition and Trigger (with emphasis on LHC)

Best Practices for VR Applications

Image processing with the HERON-FPGA Family

RPG XFFTS. extended bandwidth Fast Fourier Transform Spectrometer. Technical Specification

Energy-Efficient Histogram Equalization on FPGA

Neural Networks The New Moore s Law

CMOS MT9D112 Camera Module 1/4-Inch 3-Megapixel Module Datasheet

Binary Neural Network and Its Implementation with 16 Mb RRAM Macro Chip

Video Enhancement Algorithms on System on Chip

Design and Implementation of a Digital Image Processor for Image Enhancement Techniques using Verilog Hardware Description Language

Cognitive Radio Platform Technology

The Jigsaw Continuous Sensing Engine for Mobile Phone Applications!

2015 The MathWorks, Inc. 1

Reconfigurable Video Image Processing

Multi-core Platforms for

ADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION

Streaming Readout for EIC Experiments

Homework 10 posted just for practice. Office hours next week, schedule TBD. HKN review today. Your feedback is important!

CWIC Starter: Immersive Richard Mills - Technical Director, Sky VR Studios Founder, Imaginary Pictures

Tackling the Battery Problem for Continuous Mobile Vision

Design of High-Performance HOG Feature Calculation Circuit for Real-Time Pedestrian Detection *

Open Source Digital Camera on Field Programmable Gate Arrays

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

THE VISIONLAB TEAM engineers - 1 physicist. Feasibility study and prototyping Hardware benchmarking Open and closed source libraries

tackling the battery problem a scenario based approach

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

NanEye GS NanEye GS Stereo. Camera System

A High Definition Motion JPEG Encoder Based on Epuma Platform

Communication Requirements of VR & Telemedicine

A Multi-Layer Perceptron SoC for Smart Devices

Hardware-based Image Retrieval and Classifier System

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

How different FPGA firmware options enable digitizer platforms to address and facilitate multiple applications

A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION

One Size Doesn't Fit All Aligning VR Environments to Workflows

ADAS COMPUTER VISION AND AUGMENTED REALITY SOLUTION

What s Behind 5G Wireless Communications?

CHAPTER 4 FIELD PROGRAMMABLE GATE ARRAY IMPLEMENTATION OF FIVE LEVEL CASCADED MULTILEVEL INVERTER

Power of Realtime 3D-Rendering. Raja Koduri

Journal of Engineering Science and Technology Review 9 (5) (2016) Research Article. L. Pyrgas, A. Kalantzopoulos* and E. Zigouris.

FPGAs: Why, When, and How to use them (with RFNoC ) Pt. 1 Martin Braun, Nicolas Cuervo FOSDEM 2017, SDR Devroom

Construction of visualization system for scientific experiments

Table 1: Example Implementation Statistics for Xilinx FPGAs. Fmax (MHz) LUT FF IOB RAMB36 RAMB18 DSP48

ASIP Solution for Implementation of H.264 Multi Resolution Motion Estimation

Embedded Systems. 9. Power and Energy. Lothar Thiele. Computer Engineering and Networks Laboratory

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

Real Time Hot Spot Detection Using FPGA

INTRODUCTION TO CHANNELIZATION ALGORITHMS IN SDR AND COMPARISON OF THEM

A Fixed-Width Modified Baugh-Wooley Multiplier Using Verilog

Figures from Embedded System Design: A Unified Hardware/Software Introduction, Frank Vahid and Tony Givargis, New York, John Wiley, 2002

ReVRSR: Remote Virtual Reality for Service Robots

Face Detection System on Ada boost Algorithm Using Haar Classifiers

Real-Time License Plate Localisation on FPGA

GPU-accelerated track reconstruction in the ALICE High Level Trigger

Virtual Reality I. Visual Imaging in the Electronic Age. Donald P. Greenberg November 9, 2017 Lecture #21

Virtual Reality Mobile 360 Nanodegree Syllabus (nd106)

Darwin: a neuromorphic hardware co-processor based on Spiking Neural Networks

Energy efficient multi-granular arithmetic in a coarse-grain reconfigurable architecture

An FPGA Based Architecture for Moving Target Indication (MTI) Processing Using IIR Filters

The CCD-S3600-D(-UV) is a

Chapter 12. Cross-Layer Optimization for Multi- Hop Cognitive Radio Networks

Project Abstract Submission : Entry # 456. Part 1 - Team. Part 2 - Project. Team Leader Name. Maroua Filali. Team Leader .

Doc: page 1 of 6

HARDWARE ACCELERATION OF THE GIPPS MODEL

The rise of always-listening sensors integrated in energy-scarce devices such as watches and remotecontrols

A Low-Power Broad-Bandwidth Noise Cancellation VLSI Circuit Design for In-Ear Headphones

Low Power System-On-Chip-Design Chapter 12: Physical Libraries

An Onboard Vision System for Unmanned Aerial Vehicle Guidance

Introduction to Virtual Reality (based on a talk by Bill Mark)

Transcription:

Exploring Computation- Communication Tradeoffs in Camera Systems Amrita Mazumdar Thierry Moreau Sung Kim Meghan Cowan Armin Alaghi Luis Ceze Mark Oskin Visvesh Sathe IISWC 2017 1

Camera applications are a prominent workload with tight constraints low-power light weight real-time processing light weight energy harvesting camera low-power augmented reality glasses real-time processing real-time processing large data size video surveillance cameras large data size 3D-360 virtual reality camera rig 2

Hardware implementations compound the camera system design space camera system constraint implementation bandwidth power ASIC FPGA GPU time size DSP CPU DogChat 3

We can represent camera applications as camera processing pipelines to clarify design space exploration sensor block 1 block 2 block 3 block 4 functions in the application 4

We can represent camera applications as camera processing pipelines to clarify design space exploration sensor image processing face feature tracking image rendering DogChat 5

Developers can trade off between computation and communication costs sensor image processing face feature tracking image rendering offloaded to cloud DogChat 6

Developers can trade off between computation and communication costs sensor image processing face feature tracking image rendering in-camera processing offloaded to cloud DogChat 7

Optional and required blocks in camera pipelines introduce more tradeoffs edge motion motion tracking sensor image processing face feature tracking image rendering required optional 8

Custom hardware platforms explode the camera system design space ASIC edge DSP motion motion tracking FPGA GPU sensor image processing face feature tracking image rendering DSP FPGA CPU required optional 9

Custom hardware platforms explode the camera system design space ASIC edge DSP motion motion tracking FPGA In-camera processing pipelines can help us GPU sensor evaluate these tradeoffs! image processing face feature tracking image rendering DSP FPGA CPU required optional 10

Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design motion face neural network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration stitch prep align depth 11

Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design motion face neural network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration stitch prep align depth 12

Face authentication with energy harvesting cameras WISP Cam energy-harvesting camera powered by RF 1 frame / second ~1 mw processing / frame 13

Face authentication with energy harvesting cameras Is this Armin? 14

CPU-based face authentication neural networks can exceed WISPcam power budgets sensor neural network other application functions on-chip CPU cloud 15

CPU-based face authentication neural networks can exceed WISPcam power budgets sensor motion face neural network other application functions on-chip circuit ASIC hardware cloud adding optional blocks can reduce power consumption for a neural network 16

Exploring design tradeoffs in ASIC accelerators neural network face DMA Master Bus Scheduler SNNAP SRAM PU control PE... PE SIG d_in offset 8 acc 16 26 PE0 PE1 PE2 PE3 weight weight weight weight 8 8 8 8 8 8 8 8 MUL MUL MUL MUL 16 16 16 16 26 26 26 26 ADD ADD ADD ADD 26 26 26 26 sigmoid unit acc 26 acc. fifo sig. many fifo more details 8 d_out pixels in VJ integral image accumulator classifier unit window buffer stage unit threshold unit feature unit 1 1 1 input row a d + b c + weight1 a d + b c + weight2 a d + b c + weight3 integral accumulator += 1 2 3 1 4 4 previous row feature unit - x - x - x + + + + + yes weight no weight 2 6 7 integral row output threshold > Evaluated NN topology and hardware impact on energy and accuracy in paper! Streaming face accelerator Selected a 400-8-1 network topology and used 8-bit datapaths for optimal energy/accuracy point Explored classifier and other algorithm parameters to optimize energy optimality 17

Evaluation Which pipeline achieves the lowest overall power? Synthesized ASIC accelerators in Synopsys Constructed simulator to evaluate power consumption on real-world video input Computed power for computation and transfer of resulting data for each pipeline configuration 18

Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer sensor <1% >99% sensor motion <1% >99% 11,340 3,731 sensor face detect 10% 90% 374 sensor NN 16% 84% 782,090 sensor motion face detect >99% <1% 132 sensor motion NN >99% <1% 257,236 sensor face detect NN >99% <1% sensor motion face detect NN >99% <1% 160 419 1 1000 1000000 log Power (µw) 19

Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer sensor <1% >99% sensor motion <1% >99% 11,340 3,731 sensor face detect 10% 90% sensor NN 16% 84% sensor motion face detect >99% <1% sensor motion NN >99% <1% 132 374 782,090 prefilters reduce overall power 257,236 sensor face detect NN >99% <1% sensor motion face detect NN >99% <1% 160 419 1 1000 1000000 log Power (µw) 20

Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer sensor <1% >99% 11,340 sensor motion <1% >99% 3,731 just using NN sensor face detect 10% 90% 374 sensor NN 16% 84% 782,090 sensor motion face detect >99% <1% sensor motion NN >99% <1% 132 257,236 prefilters with NN use less power sensor face detect NN >99% <1% 419 sensor motion face detect NN >99% <1% 160 1 1000 1000000 log Power (µw) 21

Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer sensor <1% >99% sensor motion <1% >99% 11,340 3,731 sensor face detect 10% 90% sensor NN 16% 84% sensor motion face detect >99% <1% sensor motion NN >99% <1% 132 374 782,090 most powerefficient 257,236 sensor face detect NN >99% <1% sensor motion face detect NN >99% <1% 419 most powerefficient 160 with on-chip NN 1 1000 1000000 log Power (µw) 22

In-camera processing for face authentication motion face neural network In isolation, even well-designed hardware can show sub-optimal performance Optional blocks can improve the overall cost, if they balance compute and communication better than the original design 23

Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design motion face neural network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration stitch prep align depth 24

Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design motion face neural network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration stitch prep align depth 25

Producing real-time VR video from a camera rig Goal: 30 fps 3D-360 stereo video 16 GoPro cameras 1.8 GB/s output 4K-30 fps 3.6 GB/s raw video 26

Producing real-time VR video from a camera rig 16 GoPro cameras 4K-30 fps 3.6 GB/s raw video cloud processing prevents realtime video Goal: 30 fps 3D-360 stereo video 1.8 GB/s output 27

VR pipeline is usually offloaded to perform heavy computation offloaded to cloud sensor prep image align depth from flow image stitch stream to viewer 5% 20% 70% 5% processing time need to accelerate depth from flow to achieve high performance 28

Offloading before the costly step doesn t avoid compute-communication tradeoffs Video Frame Size (MB) 600 450 300 150 0 image alignment step produces significant intermediate data offloading early on is still 2x final output size sensor prep image align depth from flow image stitch stream to viewer 29

Evaluation Which pipeline achieves the highest frame rate? Designed a simple parallel accelerator for Xilinx Zynq SoC, simulated for Virtex UltraScale+ implementation details in paper Evaluated against CPU and GPU implementations in Halide Assumed 2GB/s network link for communication 30

Which pipeline achieves the highest frame rate? (FPS) pipeline configuration compute transfer sensor 100 15.8 sensor prep 100 15.8 15.8 15.8 sensor prep align 100 3.95 sensor prep align depth (CPU) 0.09 5.27 sensor prep align depth (GPU) 11.2 5.27 sensor prep align depth (FPGA) 174 5.27 sensor prep align depth (CPU) stitch 0.09 31.6 4.0 0.1.09 5.3 5.3 0.1.09 sensor prep align depth (GPU) stitch 11.2 31.6 11.2 sensor prep align depth (FPGA) stitch 174 31.6 31.6 0 7 14 21 28 35 effective FPS 31

Which pipeline achieves the highest frame rate? (FPS) pipeline configuration compute transfer sensor 100 15.8 sensor prep 100 15.8 sensor prep align 100 3.95 sensor prep align depth (CPU) 0.09 5.27 sensor prep align depth (GPU) 11.2 5.27 sensor prep align depth (FPGA) 174 5.27 sensor prep align depth (CPU) stitch 0.09 31.6 sensor prep align depth (GPU) stitch 11.2 31.6 4.0 0.1.09 5.3 5.3 0.1.09 11.2 15.8 15.8 CPU results are slowest sensor prep align depth (FPGA) stitch 174 31.6 31.6 0 7 14 21 28 35 effective FPS 32

Which pipeline achieves the highest frame rate? (FPS) pipeline configuration compute transfer sensor 100 15.8 sensor prep 100 15.8 15.8 15.8 sensor prep align 100 3.95 sensor prep align depth (CPU) 0.09 5.27 sensor prep align depth (GPU) 11.2 5.27 sensor prep align depth (FPGA) 174 5.27 sensor prep align depth (CPU) stitch 0.09 31.6 sensor prep align depth (GPU) stitch 11.2 31.6 4.0 0.1.09 5.3 5.3 0.1.09 11.2 Data size is too big after depth for offloading sensor prep align depth (FPGA) stitch 174 31.6 31.6 0 7 14 21 28 35 effective FPS 33

Which pipeline achieves the highest frame rate? (FPS) pipeline configuration compute transfer sensor 100 15.8 sensor prep 100 15.8 15.8 15.8 sensor prep align 100 3.95 sensor prep align depth (CPU) 0.09 5.27 sensor prep align depth (GPU) 11.2 5.27 sensor prep align depth (FPGA) 174 5.27 sensor prep align depth (CPU) stitch 0.09 31.6 4.0 0.1.09 5.3 5.3 0.1.09 full pipeline with FPGA is only one that achieves realtime frame rate sensor prep align depth (GPU) stitch 11.2 31.6 11.2 sensor prep align depth (FPGA) stitch 174 31.6 31.6 0 7 14 21 28 35 effective FPS 34

In-camera processing for real-time VR video stitch prep align depth Computation and communication together highlight benefits not seen when considered separately For VR video, in-camera processing pipelines enable applications that could not even be achieved via cloud offload 35

In-camera processing pipelines help characterize camera systems In-camera pipelines evaluate computation-communication trade-offs Use hardware-software co-design to balance constraints and optimize designs Achieve optimal performance by considering bottlenecks in context of full system Thank you!