Exploring Computation- Communication Tradeoffs in Camera Systems

Exploring Computation- Communication Tradeoffs in Camera Systems Amrita Mazumdar Thierry Moreau Sung Kim Meghan Cowan Armin Alaghi Luis Ceze Mark Oskin Visvesh Sathe IISWC 2017 1

Camera applications are a prominent workload with tight constraints low-power light weight real-time processing light weight energy harvesting camera low-power augmented reality glasses real-time processing real-time processing large data size video surveillance cameras large data size 3D-360 virtual reality camera rig 2

Hardware implementations compound the camera system design space camera system constraint implementation bandwidth power ASIC FPGA GPU time size DSP CPU DogChat 3

We can represent camera applications as camera processing pipelines to clarify design space exploration sensor block 1 block 2 block 3 block 4 functions in the application 4

We can represent camera applications as camera processing pipelines to clarify design space exploration sensor image processing face feature tracking image rendering DogChat 5

Developers can trade off between computation and communication costs sensor image processing face feature tracking image rendering offloaded to cloud DogChat 6

Developers can trade off between computation and communication costs sensor image processing face feature tracking image rendering in-camera processing offloaded to cloud DogChat 7

Optional and required blocks in camera pipelines introduce more tradeoffs edge motion motion tracking sensor image processing face feature tracking image rendering required optional 8

Custom hardware platforms explode the camera system design space ASIC edge DSP motion motion tracking FPGA GPU sensor image processing face feature tracking image rendering DSP FPGA CPU required optional 9

Custom hardware platforms explode the camera system design space ASIC edge DSP motion motion tracking FPGA In-camera processing pipelines can help us GPU sensor evaluate these tradeoffs! image processing face feature tracking image rendering DSP FPGA CPU required optional 10

Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design motion face neural network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration stitch prep align depth 11

Face authentication with energy harvesting cameras WISP Cam energy-harvesting camera powered by RF 1 frame / second ~1 mw processing / frame 13

Face authentication with energy harvesting cameras Is this Armin? 14

CPU-based face authentication neural networks can exceed WISPcam power budgets sensor neural network other application functions on-chip CPU cloud 15

CPU-based face authentication neural networks can exceed WISPcam power budgets sensor motion face neural network other application functions on-chip circuit ASIC hardware cloud adding optional blocks can reduce power consumption for a neural network 16

Exploring design tradeoffs in ASIC accelerators neural network face DMA Master Bus Scheduler SNNAP SRAM PU control PE... PE SIG d_in offset 8 acc 16 26 PE0 PE1 PE2 PE3 weight weight weight weight 8 8 8 8 8 8 8 8 MUL MUL MUL MUL 16 16 16 16 26 26 26 26 ADD ADD ADD ADD 26 26 26 26 sigmoid unit acc 26 acc. fifo sig. many fifo more details 8 d_out pixels in VJ integral image accumulator classifier unit window buffer stage unit threshold unit feature unit 1 1 1 input row a d + b c + weight1 a d + b c + weight2 a d + b c + weight3 integral accumulator += 1 2 3 1 4 4 previous row feature unit - x - x - x + + + + + yes weight no weight 2 6 7 integral row output threshold > Evaluated NN topology and hardware impact on energy and accuracy in paper! Streaming face accelerator Selected a 400-8-1 network topology and used 8-bit datapaths for optimal energy/accuracy point Explored classifier and other algorithm parameters to optimize energy optimality 17

Evaluation Which pipeline achieves the lowest overall power? Synthesized ASIC accelerators in Synopsys Constructed simulator to evaluate power consumption on real-world video input Computed power for computation and transfer of resulting data for each pipeline configuration 18

Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer sensor <1% >99% sensor motion <1% >99% 11,340 3,731 sensor face detect 10% 90% 374 sensor NN 16% 84% 782,090 sensor motion face detect >99% <1% 132 sensor motion NN >99% <1% 257,236 sensor face detect NN >99% <1% sensor motion face detect NN >99% <1% 160 419 1 1000 1000000 log Power (µw) 19

Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer sensor <1% >99% sensor motion <1% >99% 11,340 3,731 sensor face detect 10% 90% sensor NN 16% 84% sensor motion face detect >99% <1% sensor motion NN >99% <1% 132 374 782,090 prefilters reduce overall power 257,236 sensor face detect NN >99% <1% sensor motion face detect NN >99% <1% 160 419 1 1000 1000000 log Power (µw) 20

Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer sensor <1% >99% 11,340 sensor motion <1% >99% 3,731 just using NN sensor face detect 10% 90% 374 sensor NN 16% 84% 782,090 sensor motion face detect >99% <1% sensor motion NN >99% <1% 132 257,236 prefilters with NN use less power sensor face detect NN >99% <1% 419 sensor motion face detect NN >99% <1% 160 1 1000 1000000 log Power (µw) 21

Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer sensor <1% >99% sensor motion <1% >99% 11,340 3,731 sensor face detect 10% 90% sensor NN 16% 84% sensor motion face detect >99% <1% sensor motion NN >99% <1% 132 374 782,090 most powerefficient 257,236 sensor face detect NN >99% <1% sensor motion face detect NN >99% <1% 419 most powerefficient 160 with on-chip NN 1 1000 1000000 log Power (µw) 22

In-camera processing for face authentication motion face neural network In isolation, even well-designed hardware can show sub-optimal performance Optional blocks can improve the overall cost, if they balance compute and communication better than the original design 23

Producing real-time VR video from a camera rig Goal: 30 fps 3D-360 stereo video 16 GoPro cameras 1.8 GB/s output 4K-30 fps 3.6 GB/s raw video 26

Producing real-time VR video from a camera rig 16 GoPro cameras 4K-30 fps 3.6 GB/s raw video cloud processing prevents realtime video Goal: 30 fps 3D-360 stereo video 1.8 GB/s output 27

VR pipeline is usually offloaded to perform heavy computation offloaded to cloud sensor prep image align depth from flow image stitch stream to viewer 5% 20% 70% 5% processing time need to accelerate depth from flow to achieve high performance 28

Offloading before the costly step doesn t avoid compute-communication tradeoffs Video Frame Size (MB) 600 450 300 150 0 image alignment step produces significant intermediate data offloading early on is still 2x final output size sensor prep image align depth from flow image stitch stream to viewer 29

Evaluation Which pipeline achieves the highest frame rate? Designed a simple parallel accelerator for Xilinx Zynq SoC, simulated for Virtex UltraScale+ implementation details in paper Evaluated against CPU and GPU implementations in Halide Assumed 2GB/s network link for communication 30

Which pipeline achieves the highest frame rate? (FPS) pipeline configuration compute transfer sensor 100 15.8 sensor prep 100 15.8 15.8 15.8 sensor prep align 100 3.95 sensor prep align depth (CPU) 0.09 5.27 sensor prep align depth (GPU) 11.2 5.27 sensor prep align depth (FPGA) 174 5.27 sensor prep align depth (CPU) stitch 0.09 31.6 4.0 0.1.09 5.3 5.3 0.1.09 sensor prep align depth (GPU) stitch 11.2 31.6 11.2 sensor prep align depth (FPGA) stitch 174 31.6 31.6 0 7 14 21 28 35 effective FPS 31

Which pipeline achieves the highest frame rate? (FPS) pipeline configuration compute transfer sensor 100 15.8 sensor prep 100 15.8 sensor prep align 100 3.95 sensor prep align depth (CPU) 0.09 5.27 sensor prep align depth (GPU) 11.2 5.27 sensor prep align depth (FPGA) 174 5.27 sensor prep align depth (CPU) stitch 0.09 31.6 sensor prep align depth (GPU) stitch 11.2 31.6 4.0 0.1.09 5.3 5.3 0.1.09 11.2 15.8 15.8 CPU results are slowest sensor prep align depth (FPGA) stitch 174 31.6 31.6 0 7 14 21 28 35 effective FPS 32

Which pipeline achieves the highest frame rate? (FPS) pipeline configuration compute transfer sensor 100 15.8 sensor prep 100 15.8 15.8 15.8 sensor prep align 100 3.95 sensor prep align depth (CPU) 0.09 5.27 sensor prep align depth (GPU) 11.2 5.27 sensor prep align depth (FPGA) 174 5.27 sensor prep align depth (CPU) stitch 0.09 31.6 sensor prep align depth (GPU) stitch 11.2 31.6 4.0 0.1.09 5.3 5.3 0.1.09 11.2 Data size is too big after depth for offloading sensor prep align depth (FPGA) stitch 174 31.6 31.6 0 7 14 21 28 35 effective FPS 33

Which pipeline achieves the highest frame rate? (FPS) pipeline configuration compute transfer sensor 100 15.8 sensor prep 100 15.8 15.8 15.8 sensor prep align 100 3.95 sensor prep align depth (CPU) 0.09 5.27 sensor prep align depth (GPU) 11.2 5.27 sensor prep align depth (FPGA) 174 5.27 sensor prep align depth (CPU) stitch 0.09 31.6 4.0 0.1.09 5.3 5.3 0.1.09 full pipeline with FPGA is only one that achieves realtime frame rate sensor prep align depth (GPU) stitch 11.2 31.6 11.2 sensor prep align depth (FPGA) stitch 174 31.6 31.6 0 7 14 21 28 35 effective FPS 34

In-camera processing for real-time VR video stitch prep align depth Computation and communication together highlight benefits not seen when considered separately For VR video, in-camera processing pipelines enable applications that could not even be achieved via cloud offload 35

In-camera processing pipelines help characterize camera systems In-camera pipelines evaluate computation-communication trade-offs Use hardware-software co-design to balance constraints and optimize designs Achieve optimal performance by considering bottlenecks in context of full system Thank you!