Convolutional Networks for Image Segmentation: U-Net 1, DeconvNet 2, and SegNet 3

Size: px

Start display at page:

Download "Convolutional Networks for Image Segmentation: U-Net 1, DeconvNet 2, and SegNet 3"

Edward Riley
5 years ago
Views:

1 Convolutional Networks for Image Segmentation: U-Net 1, DeconvNet 2, and SegNet 3 1 Olaf Ronneberger, Philipp Fischer, Thomas Brox (Freiburg, Germany) 2 Hyeonwoo Noh, Seunghoon Hong, Bohyung Han (POSTECH, Korea) 3 Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla (Cambridge, U.K.) 12 January 2018 Presented by: Gregory P. Spell

2 Outline 1 Image Segmentation 2 U-Net 3 DeconvNet 4 SegNet

3 Image Segmentation Goal is to perform pixel-wise classification on images. Useful for scene understanding (as in autonomous driving) Modern methods adopt deep architectures for image classification and extend them to pixel-wise labelling

4 General Segmentation Architecture Encoder Network: extract image features using deep convolutional network Each layer: bank of trainable convolutional filters, followed by ReLUs and max-pooling to downsample image features Decoder Network: upsamples feature map back to image resolution with final output having same number of channels as there are pixel classes Where the methods differ most dramatically Network mirrors encoder network Pixel-wise softmax over final feature map and cross-entropy loss function for training using SGD.

5 Encoder Schemes Both SegNet and DeconvNet use the convolutional network from VGG16 for image classification DeconvNet keeps two fully-connected layers from VGG16 SegNet discards fully connected layers to decrease number of parameters U-Net uses shallower network and no fully-connected layers

Decoder Networks: Upsampling Upsampling is needed to return feature map to higher resolution for pixel classification Pooling destroys spatial information, which is useful for

6 Decoder Networks: Upsampling Upsampling is needed to return feature map to higher resolution for pixel classification Pooling destroys spatial information, which is useful for precise localization To reconstruct (partially): store max-pooling indices from encoder and place each activation back to its original pooled location Pad zeros to other locations

7 Decoder Networks: Deconvolution Upsampling provides sparse feature maps Use trainable (de)convolution filters to densify maps

8 Decoder Analysis Unpooling captures example-specific structures Deconvolution captures class-specific shapes Hierarchical structure reconstructs shape details from coarse to fine

U-Net Specifics Designed for biomedical image processing: cell segmentation Data augmentation via applying elastic deformations, which is natural since deformation is a common variation of tissue

9 U-Net Specifics Designed for biomedical image processing: cell segmentation Data augmentation via applying elastic deformations, which is natural since deformation is a common variation of tissue Concatenate features from encoder network with corresponding arm of decoder network instead of reusing pooling indices Introduce a weight map to compensate for class imbalance of pixels and force network to learn borders between touching cells

10 U-Net Architecture

11 DeconvNet Specifics Instance-wise segmentation: use edge-box 1 algorithm to generate object proposals from which to predict pixel classes. Aggregate all proposal outputs for an image via pixel-wise max (or average) Two-stage training: train on easy examples (cropped bounding boxes centered on a single object) first and then more difficult examples (proposals from edge-box) 1 Edge Boxes: Locating Object Proposals from Edges, C.L. Zitnick and P. Dollar (ECCV, 2014)

DeconvNet Results Evaluate on the PASCAL VOC 2012 benchmark with the Intersection-over-Union (IoU between ground truth and predicted segmentations) metric E denotes an ensemble with

12 DeconvNet Results Evaluate on the PASCAL VOC 2012 benchmark with the Intersection-over-Union (IoU between ground truth and predicted segmentations) metric E denotes an ensemble with Fully-Convolutional Nets (FCNs an earlier framework), and CRF denotes use of conditional random field post-processing 2 2 Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, V.Koltun (NIPS, 2011)

13 SegNet Architecture

14 SegNet Results CamVid Dataset: 3433 training road scenes SUN-RGB-D Dataset: 5250 training indoor scenes

15 Road Scene Examples

16 Indoor Scenes Examples

NU-Net: Deep Residual Wide Field of View Convolutional Neural Network for Semantic Segmentation

NU-Net: Deep Residual Wide Field of View Convolutional Neural Network for Semantic Segmentation Mohamed Samy 1 Karim Amer 1 Kareem Eissa Mahmoud Shaker Mohamed ElHelw Center for Informatics Science Nile