Semantic Segmentation on Resource Constrained Devices

Similar documents
DSNet: An Efficient CNN for Road Scene Segmentation

A Fuller Understanding of Fully Convolutional Networks. Evan Shelhamer* Jonathan Long* Trevor Darrell UC Berkeley in CVPR'15, PAMI'16

Convolutional Networks for Image Segmentation: U-Net 1, DeconvNet 2, and SegNet 3

NU-Net: Deep Residual Wide Field of View Convolutional Neural Network for Semantic Segmentation

Semantic Segmentation in Red Relief Image Map by UX-Net

Detection and Segmentation. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 11 -

Colorful Image Colorizations Supplementary Material

Lecture 23 Deep Learning: Segmentation

Biologically Inspired Computation

Improving Robustness of Semantic Segmentation Models with Style Normalization

Pelee: A Real-Time Object Detection System on Mobile Devices

Learning Pixel-Distribution Prior with Wider Convolution for Image Denoising

Synthetic View Generation for Absolute Pose Regression and Image Synthesis: Supplementary material

DEEP LEARNING ON RF DATA. Adam Thompson Senior Solutions Architect March 29, 2018

Deep Learning. Dr. Johan Hagelbäck.

arxiv: v3 [cs.cv] 18 Dec 2018

Continuous Gesture Recognition Fact Sheet

Deformable Convolutional Networks

23270: AUGMENTED REALITY FOR NAVIGATION AND INFORMATIONAL ADAS. Sergii Bykov Technical Lead Machine Learning 12 Oct 2017

Rapid Computer Vision-Aided Disaster Response via Fusion of Multiresolution, Multisensor, and Multitemporal Satellite Imagery

Fully Convolutional Network with dilated convolutions for Handwritten

Domain Adaptation & Transfer: All You Need to Use Simulation for Real

Introduction to Machine Learning

An Introduction to Convolutional Neural Networks. Alessandro Giusti Dalle Molle Institute for Artificial Intelligence Lugano, Switzerland

ROAD RECOGNITION USING FULLY CONVOLUTIONAL NEURAL NETWORKS

ChannelNets: Compact and Efficient Convolutional Neural Networks via Channel-Wise Convolutions

Convolutional neural networks

Understanding Neural Networks : Part II

Understanding Convolution for Semantic Segmentation

Cascaded Feature Network for Semantic Segmentation of RGB-D Images

Understanding Convolution for Semantic Segmentation

SUBMITTED TO IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1

arxiv: v1 [stat.ml] 10 Nov 2017

Fully Convolutional Networks for Semantic Segmentation

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. ECE 289G: Paper Presentation #3 Philipp Gysel

DeepUNet: A Deep Fully Convolutional Network for Pixel-level Sea-Land Segmentation

SCENE SEMANTIC SEGMENTATION FROM INDOOR RGB-D IMAGES USING ENCODE-DECODER FULLY CONVOLUTIONAL NETWORKS

arxiv: v1 [cs.cv] 15 Apr 2016

Research on Hand Gesture Recognition Using Convolutional Neural Network

Camera Model Identification With The Use of Deep Convolutional Neural Networks

arxiv: v3 [cs.cv] 22 Aug 2018

arxiv: v2 [cs.cv] 8 Mar 2018

Wadehra Kartik, Kathpalia Mukul, Bahl Vasudha, International Journal of Advance Research, Ideas and Innovations in Technology

CS 7643: Deep Learning

A COMPARATIVE ANALYSIS OF IMAGE SEGMENTATION TECHNIQUES

arxiv: v1 [cs.lg] 2 Jan 2018

Road detection with EOSResUNet and post vectorizing algorithm

Classification Accuracies of Malaria Infected Cells Using Deep Convolutional Neural Networks Based on Decompressed Images

Convolutional Neural Networks

Learning to Understand Image Blur

Artistic Image Colorization with Visual Generative Networks

arxiv: v1 [cs.cv] 3 May 2018

GESTURE RECOGNITION WITH 3D CNNS

CROSS-LAYER FEATURES IN CONVOLUTIONAL NEURAL NETWORKS FOR GENERIC CLASSIFICATION TASKS. Kuan-Chuan Peng and Tsuhan Chen

LANDMARK recognition is an important feature for

یادآوری: خالصه CNN. ConvNet

Coursework 2. MLP Lecture 7 Convolutional Networks 1

clcnet: Improving the Efficiency of Convolutional Neural Network using Channel Local Convolutions

GPU ACCELERATED DEEP LEARNING WITH CUDNN

arxiv: v1 [cs.cv] 9 Nov 2015 Abstract

Deep Multispectral Semantic Scene Understanding of Forested Environments using Multimodal Fusion

arxiv: v1 [cs.cv] 19 Jun 2017

Durham Research Online

arxiv: v3 [cs.cv] 5 Dec 2017

arxiv: v1 [cs.cv] 21 Nov 2018

SIMULATION-BASED MODEL CONTROL USING STATIC HAND GESTURES IN MATLAB

Designing Convolutional Neural Networks for Urban Scene Understanding

Residual Conv-Deconv Grid Network for Semantic Segmentation

Can you tell a face from a HEVC bitstream?

Lecture 11-1 CNN introduction. Sung Kim

arxiv: v2 [cs.cv] 11 Oct 2016

A Neural Algorithm of Artistic Style (2015)

Xception: Deep Learning with Depthwise Separable Convolutions

Vehicle Color Recognition using Convolutional Neural Network

Scene Perception based on Boosting over Multimodal Channel Features

CSC 578 Neural Networks and Deep Learning

Neural Architectures for Named Entity Recognition

GESTURE RECOGNITION FOR ROBOTIC CONTROL USING DEEP LEARNING

A New Framework for Supervised Speech Enhancement in the Time Domain

Lecture 7: Scene Text Detection and Recognition. Dr. Cong Yao Megvii (Face++) Researcher

On the Use of Fully Convolutional Networks on Evaluation of Infrared Breast Image Segmentations

Supplementary Material for Generative Adversarial Perturbations

An energy-efficient coarse grained spatial architecture for convolutional neural networks AlexNet

Semantic Segmented Style Transfer Kevin Yang* Jihyeon Lee* Julia Wang* Stanford University kyang6

Prototyping Vision-Based Classifiers in Constrained Environments

Lecture 17 Convolutional Neural Networks

Suggested projects for EL-GY 6123 Image and Video Processing (Spring 2018) 360 Degree Video View Prediction (contact: Chenge Li,

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Driving Using End-to-End Deep Learning

Impact of Automatic Feature Extraction in Deep Learning Architecture

Convolutional Neural Network-Based Infrared Image Super Resolution Under Low Light Environment

On Emerging Technologies

Computer Vision Seminar

Automatic tumor segmentation in breast ultrasound images using a dilated fully convolutional network combined with an active contour model

Free-hand Sketch Recognition Classification

Virtual Worlds for the Perception and Control of Self-Driving Vehicles

arxiv: v1 [cs.sd] 1 Oct 2016

Thermal Image Enhancement Using Convolutional Neural Network

arxiv: v1 [cs.cv] 5 Dec 2018

Radio Deep Learning Efforts Showcase Presentation

Transcription:

Semantic Segmentation on Resource Constrained Devices Sachin Mehta University of Washington, Seattle In collaboration with Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh Hajishirzi Project page: https://sacmehta.github.io/espnet/

Problem Statement Limited computational resources Only 256 CUDA cores in comparison to standard GPU cards such as TitanX which has 3500+ cuda cores CPU and GPU shares the RAM Limited Power (TX2 can run in two modes that has TDP requirement of 7.5V [Max-Q] and 15 V [Max-P]) Max-Q s performance is identical to TX1. GPU Clock @ 828 MHz Max-P boosts the clock rates to the max. value. GPU clock @1300 MHz CPU Cores RAM CPU CPU Cores PCIe (a) Desktop RAM GPU SM (b) Embedded Device GPU SM Global Memory GPU Figure: Hardware-level resource comparison on a desktop and embedded device

Problem Statement Accurate segmentation networks are deep and learns more parameters. As a consequence, they are slow and power hungry.

Problem Statement Accurate segmentation networks are deep and learns more parameters. As a consequence, they are slow and power hungry. Deep networks cannot be used in embedded devices because of hardware constraints Limited computational resources Limited energy overhead Restrictive memory constraints

Agenda What is semantic segmentation? CNN basics Overview of SOTA efficient networks ESPNet Results

What is Semantic Segmentation? Input: RGB Image Output: A segmentation Mask

Overview A standard CNN architecture stacks Convolutional layers Pooling layers Activation and Batch normalization layers (see [r1]) Linear (Fully connected) layers Figure: Example of Stacking layers in CNN network Source: [r1] Xu, Bing, et al. "Empirical evaluation of rectified activations in convolutional network." arxiv preprint arxiv:1505.00853 (2015).

Overview: Convolution A convolution layer compute the output of neurons that are connected to local regions in the input. For a CNN processing RGB images, a convolutional kernel is usually a 3- dimensional (M n n) that is applied over M channels to produce the output feature map. Figure: An example of 3x3 convolutional kernel processing an input of size 5x5 Source: http://deeplearning.net/software/theano/tutorial/conv_ arithmetic.html n M n N Figure: A convolutional kernel visualization

Pooling Pooling operations help the CNN network to learn scale-invariant representations. Common pooling operations are: Max. Pooling Average Pooling Strided convolution

Pooling: Max Pooling Figure: Max pooling example Note: Average pooling layer is the same as Max pooling layer except that the kernel is performing a averaging function instead of maximum. Source: http://cs231n.stanford.edu/

Pooling: Strided Convolution Figure: 3x3 convolution with a stride of 1 Figure: 3x3 convolution with a stride of 2 Source: http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html

Efficient Networks

MobileNet Uses depth-wise separable convolution First compute kernel per input channel Apply point-wise convolution to increase the number of channels. Depth-wise convolution Figure: A standard convolution kernel Point-wise convolution Figure: Depth-wise separable convolution kernel

MobileNet Uses depth-wise separable convolution First compute kernel per input channel Apply point-wise convolution to increase the number of channels. Depth-wise convolution Figure: A standard convolution kernel Point-wise convolution Figure: Depth-wise separable convolution kernel Figure: Block-wise representation Source: Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arxiv preprint arxiv:1704.04861 (2017).

ShuffleNet ShuffleNet uses the similar block structure as ResNet, but with following modifications: 1x1 point-wise convolutions are replaced with grouped convolution 3x3 standard convolutions are replaced with the depthwise convolution Figure: ShuffleNet block Source: Zhang, Xiangyu, et al. "Shufflenet: An extremely efficient convolutional neural network for mobile devices." arxiv preprint arxiv:1707.01083 (2017).

ShuffleNet ShuffleNet uses the similar block structure as ResNet, but with following modifications: 1x1 point-wise convolutions are replaced with grouped convolution 3x3 standard convolutions are replaced with the depthwise convolution Figure: ShuffleNet block Figure: Standard convolution Figure: Grouped convolution Source: https://blog.yani.io/filter-group-tutorial/

ESPNet

ESP Block ESP is the basic building block of ESPNet Standard convolution is replaced by Point-wise convolution Spatial pyramid of dilated convolution Figure: ESP Kernel-level visualization Figure: ESP block-level visualization

Dilated/Atrous Convolution Dilated convolutions are special form of standard convolution in which the effective receptive field is increased by inserting zeros (or holes) between each pixel in the convolutional kernel. Source: http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html Figure: Dilated convoltuion

Gridding problem with Dilated Convolutions Figure: Gridding artifact in dilated convolution

Gridding problem with Dilated Convolutions Solution Add convolution layers with lower dilation rate at the end of the network (see below links for more details) Cons: Network parameter increases Source: Yu, Fisher, Vladlen Koltun, and Thomas Funkhouser. "Dilated residual networks." CVPR, 2017. Wang, Panqu, et al. "Understanding convolution for semantic segmentation." WACV, 2018.

Hierarchical feature fusion for de-gridding Figure: ESP Block with Hierarchical Feature Fusion (HFF)

Hierarchical feature fusion (HFF) for degridding Figure: ESP Block with HFF Figure: Feature map visualization with and without HFF

Input-reinforcement: An efficient way of improving the performance Information is lost due to filtering or convolution operations. Reinforce the input inside the network to learn better representations miou Parameters Without input reinforcement 0.40 0.186 M With input reinforcement 0.42 0.187 M * Results on the cityscape urban visual scene understanding dataset * miou is mean intersection over union Figure: ESPNet without and with input reinforcement

ESPNet with a light-weight decoder Adding 20,000 more parameters improved the accuracy by 6%. Figure: Comparison between ESPNet without and with light weight decoder on the Cityscape validation dataset Figure: ESPNet without and with light weight decoder

Comparison with efficient networks

Network size vs Accuracy Network size is the amount of space required to store the network parameters Under similar constraints, ESPNet outperform MobileNet and ShuffleNet by about 6%.

Inference Speed vs Accuracy Inference speed is measured in terms of frames processed per second. Device - Laptop CUDA Cores 640 Under similar constraints, ESPNet outperform MobileNet and ShuffleNet by about 6%.

Comparison with state-of-the-art networks

Accuracy vs Network size Network size is the amount of space required to store the network parameters ESPNet is small in size and well suited for edge devices.

Accuracy vs Network parameters ESPNet learns fewer parameters while delivering competitive accuracy.

Power Consumption vs Inference Speed ESPNet is fast and consumes less power while having a good segmentation accuracy. Figure: Standard GPU (NVIDIA-TitanX: 3,500+ CUDA Cores) Figure: Mobile GPU (NVIDIA-Titan 960M: 640 CUDA Cores)

Inference Speed and Power Consumption on Embedded Device (NVIDIA TX2) ESPNet processes a RGB image of size 1024x512 at a frame rate of 9 FPS. Figure: Inference speed at different GPU frequencies Figure: Power consumption vs samples

Visual Results on the Cityscape validation set

Visual Results on unseen set

Results on Breast Biopsy Whole Slide Image Dataset

Results on Breast Biopsy dataset The average size of breast biopsy images is 10,000 x 12,000 pixels 58 images marked by expert pathologists into 8 different tissue categories were split into equal training and validation sets. ESPNet delivered the same segmentation performance while learning 9.46x lesser parameters than state-of-the-art networks.

Visual results RGB Image Ground Truth Predicted Semantic Mask

Visual results RGB Image Ground Truth Predicted Semantic Mask RGB Image Ground Truth Predicted Semantic Mask

References [1] (PSPNet) Zhao, Hengshuang, et al. "Pyramid scene parsing network." IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 2017. [2] (FCN-8s) Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. [3] (SegNet) Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla. "Segnet: A deep convolutional encoder-decoder architecture for image segmentation." IEEE transactions on pattern analysis and machine intelligence 39.12 (2017): 2481-2495. [4] (DeepLab) Chen, Liang-Chieh, et al. "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs." IEEE transactions on pattern analysis and machine intelligence 40.4 (2018): 834-848. [5] (SQNet) Treml, Michael, et al. "Speeding up semantic segmentation for autonomous driving." MLITS, NIPS Workshop. 2016. [6] (ERFNet) Romera, Eduardo, et al. "ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation." IEEE Transactions on Intelligent Transportation Systems 19.1 (2018): 263-272.

References [7] (ENet) Paszke, Adam, et al. "Enet: A deep neural network architecture for realtime semantic segmentation." arxiv preprint arxiv:1606.02147 (2016). [8] (MobileNet) Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arxiv preprint arxiv:1704.04861 (2017). [9] (ShuffleNet) Zhang, Xiangyu, et al. "Shufflenet: An extremely efficient convolutional neural network for mobile devices." arxiv preprint arxiv:1707.01083 (2017). [10] (ResNext) Xie, Saining, et al. "Aggregated residual transformations for deep neural networks." Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017. [11] (ResNet) He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [12] (Inception) Szegedy, Christian, et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." AAAI. Vol. 4. 2017.

Thank You