Fully Convolutional Networks for Semantic Segmentation

Size: px

Start display at page:

Download "Fully Convolutional Networks for Semantic Segmentation"

Sherman Richards
6 years ago
Views:

1 Fully Convolutional Networks for Semantic Segmentation Jonathan Long* Evan Shelhamer* Trevor Darrell UC Berkeley Presented by: Gordon Christie 1

2 Overview Reinterpret standard classification convnets as Fully convolutional networks () for semantic segmentation Use AlexNet, VGG, and GoogleNet in experiments Novel architecture: combine information from different layers for segmentation State-of-the-art segmentation for PASCAL VOC 2011/2012, NYUDv2, and SIFT Flow at the time Inference less than one fifth of a second for a typical image 2

3 pixels in, pixels out monocular depth estimation (Liu et al. 2015) semantic segmentation boundary prediction (Xie & Tu 2015) 3

4 convnets perform classification < 1 millisecond tabby cat 1000-dim vector end-to-end learning 4

5 R-CNN does detection many seconds dog R-CNN cat 5

6 R-CNN figure: Girshick et al. 6

7 < 1/5 second??? end-to-end learning 7

8 a classification network tabby cat 8

9 becoming fully convolutional 9

10 becoming fully convolutional 10

11 upsampling output 11

12 end-to-end, pixels-to-pixels network conv, pool, nonlinearity upsampling pixelwise output + loss 12

13 Dense Predictions Shift-and-stitch: trick that yields dense predictions without interpolation Upsampling via deconvolution Shift-and-stitch used in preliminary experiments, but not included in final model Upsampling found to be more effective and efficient 13

14 Classifier to Dense Convolutionalize proven classification architectures: AlexNet, VGG, and GoogLeNet (reimplementation) Remove classification layer and convert all fully connected layers to convolutions Append 1x1 convolution with channel dimensions and predict scores at each of the coarse output locations (21 categories + background for PASCAL) 14

15 Classifier to Dense Cast ILSVRC classifiers into s and compare performance on validation set of PASCAL 2011 learning rate, not best performance possible.) - AlexNet - VGG16 mean IU forward time 50 ms 210 ms 59 ms conv. layers parameters 57M 134M 6M rf size max stride GoogLeNet 4 15

16 spectrum of deep features combine where (local, shallow) with what (global, deep) fuse features into deep jet (cf. Hariharan et al. CVPR15 hypercolumn ) 16

17 skip layers interp + sum interp + sum end-to-end, joint learning of semantics and location dense output 17

18 skip layers 32x upsampled prediction (-32s) 2x upsampled prediction 16x upsampled prediction (-16s) 2x upsampled prediction 8x upsampled prediction (-8s) pool3 pool4 pool5 pool4 prediction P pool3 prediction P n to combine coarse, high layer information with fine, low layer information. Layers are shown as grids that 18

19 Comparison of skip s Results on subset of validation set of PASCAL VOC VGG16, renamed to highlight stride. pixel acc. mean acc. mean IU f.w. IU -32s-fixed s s s

20 skip layer refinement input image stride 32 stride 16 stride 8 ground truth no skips 1 skip 2 skips 20

21 training + testing - train full image at a time without patch sampling - reshape network to take input of any size - forward time is ~150ms for 500 x 500 x 21 output 21

22 Results PASCAL VOC 2011/12 VOC 2011: 8498 training images (from additional labeled data test sets, and reduces inference time. mean IU mean IU inference VOC2011 test VOC2012 test time R-CNN [12] SDS [16] s -8s ms 22

23 Table 4. Results on NYUDv2. RGBD is early-fusion of the RGB and depth channels at the input. HHA is the depth embedding of [14] as horizontal disparity, height above ground, and the angle of the local surface normallabels à with the 40 categories inferred gravity 1449 RGB- D images with pixelwise direction. RGB-HHA is the jointly trained late fusion model that sums RGB and HHA predictions. Results NYUDv2 Gupta et al. [14] -32s RGB -32s RGBD -32s HHA -32s RGB-HHA -16s RGB-HHA pixel acc mean acc mean IU f.w. IU Table 5. (center) a a non-par SVM whi vnet traine samples ( noted RCN Li Tigh Tighe Tighe Farabe Farabe 23

Results SIFT Flow is early-fusion of the Table 5. Results on SIFT Flow10 with class segmentation 2688 images with pixel labels HA is the depth embed(center) and geometric segmentation (right).

24 Results SIFT Flow is early-fusion of the Table 5. Results on SIFT Flow10 with class segmentation 2688 images with pixel labels HA is the depth embed(center) and geometric segmentation (right). Tighe [33] is à 33 semantic categories, 3 geometric categories ght above ground, and a non-parametric transfer method. Tighe 1 is an exemplar th the inferred gravity SVM while 2 is SVM + MRF. Farabet is a multi-scale conlearn both label spaces jointly ned late fusion model vnet trained on class-balanced samples (1) or natural frequency à learning and inference have similar performance and samples (2). Pinheiro is a multi-scale, recurrent convnet, deean mean f.w. computation as independent models noted RCNN3 ( 3 ). The metric for geometry is pixel accuracy. c IU IU D images, with pixel- Liu et al. [23] Tighe et al. [33] Tighe et al. [34] 1 Tighe et al. [34] 2 Farabet et al. [8] 1 Farabet et al. [8] 2 Pinheiro et al. [28] -16s pixel acc mean mean f.w. acc. IU IU geom. acc

25 SDS* Truth Input Relative to prior state-of-theart SDS: - 20% relative improvement for mean IoU faster *Simultaneous Detection and Segmentation Hariharan et al. ECCV14 25

26 leaderboard == segmentation with Caffe 26

27 conclusion fully convolutional networks are fast, endto-end models for pixelwise problems - code in Caffe branch (merged soon) - models for PASCAL VOC, NYUDv2, SIFT Flow, PASCAL-Context caffe.berkeleyvision.org fcn.berkeleyvision.org github.com/bvlc/caffe 27

A Fuller Understanding of Fully Convolutional Networks. Evan Shelhamer* Jonathan Long* Trevor Darrell UC Berkeley in CVPR'15, PAMI'16

A Fuller Understanding of Fully Convolutional Networks Evan Shelhamer* Jonathan Long* Trevor Darrell UC Berkeley in CVPR'15, PAMI'16 1 pixels in, pixels out colorization Zhang et al.2016 monocular depth