Convolutional Networks for Image Segmentation: U-Net 1, DeconvNet 2, and SegNet 3

Convolutional Networks for Image Segmentation: U-Net 1, DeconvNet 2, and SegNet 3 1 Olaf Ronneberger, Philipp Fischer, Thomas Brox (Freiburg, Germany) 2 Hyeonwoo Noh, Seunghoon Hong, Bohyung Han (POSTECH, Korea) 3 Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla (Cambridge, U.K.) 12 January 2018 Presented by: Gregory P. Spell

Outline 1 Image Segmentation 2 U-Net 3 DeconvNet 4 SegNet

Image Segmentation Goal is to perform pixel-wise classification on images. Useful for scene understanding (as in autonomous driving) Modern methods adopt deep architectures for image classification and extend them to pixel-wise labelling

General Segmentation Architecture Encoder Network: extract image features using deep convolutional network Each layer: bank of trainable convolutional filters, followed by ReLUs and max-pooling to downsample image features Decoder Network: upsamples feature map back to image resolution with final output having same number of channels as there are pixel classes Where the methods differ most dramatically Network mirrors encoder network Pixel-wise softmax over final feature map and cross-entropy loss function for training using SGD.

Encoder Schemes Both SegNet and DeconvNet use the convolutional network from VGG16 for image classification DeconvNet keeps two fully-connected layers from VGG16 SegNet discards fully connected layers to decrease number of parameters U-Net uses shallower network and no fully-connected layers

Decoder Networks: Upsampling Upsampling is needed to return feature map to higher resolution for pixel classification Pooling destroys spatial information, which is useful for precise localization To reconstruct (partially): store max-pooling indices from encoder and place each activation back to its original pooled location Pad zeros to other locations

Decoder Networks: Deconvolution Upsampling provides sparse feature maps Use trainable (de)convolution filters to densify maps

Decoder Analysis Unpooling captures example-specific structures Deconvolution captures class-specific shapes Hierarchical structure reconstructs shape details from coarse to fine

U-Net Specifics Designed for biomedical image processing: cell segmentation Data augmentation via applying elastic deformations, which is natural since deformation is a common variation of tissue Concatenate features from encoder network with corresponding arm of decoder network instead of reusing pooling indices Introduce a weight map to compensate for class imbalance of pixels and force network to learn borders between touching cells

U-Net Architecture

DeconvNet Specifics Instance-wise segmentation: use edge-box 1 algorithm to generate object proposals from which to predict pixel classes. Aggregate all proposal outputs for an image via pixel-wise max (or average) Two-stage training: train on easy examples (cropped bounding boxes centered on a single object) first and then more difficult examples (proposals from edge-box) 1 Edge Boxes: Locating Object Proposals from Edges, C.L. Zitnick and P. Dollar (ECCV, 2014)

DeconvNet Results Evaluate on the PASCAL VOC 2012 benchmark with the Intersection-over-Union (IoU between ground truth and predicted segmentations) metric E denotes an ensemble with Fully-Convolutional Nets (FCNs an earlier framework), and CRF denotes use of conditional random field post-processing 2 2 Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, V.Koltun (NIPS, 2011)

SegNet Architecture

SegNet Results CamVid Dataset: 3433 training road scenes SUN-RGB-D Dataset: 5250 training indoor scenes

Road Scene Examples

Indoor Scenes Examples