Lecture 23 Deep Learning: Segmentation

Size: px

Start display at page:

Download "Lecture 23 Deep Learning: Segmentation"

Brendan Parks
5 years ago
Views:

1 Lecture 23 Deep Learning: Segmentation COS 429: Computer Vision Thanks: most of these slides shamelessly adapted from Stanford CS231n: Convolutional Neural Networks for Visual Recognition Fei-Fei Li, Andrej Karpathy, Justin Johnson COS429 : : Andras Ferencz

2 2 : COS429 : L23 : : Andras Ferencz

3 Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arxiv : COS429 : L23 : : Andras Ferencz 3

4 Computer Vision Tasks Classification Classification + Localization Object Detection Instance Segmentation CAT CAT CAT, DOG, DUCK CAT, DOG, DUCK Single object 4 : COS429 : L23 : : Andras Ferencz Multiple objects 4

5 Simple Recipe for Classification + Localization Step 2: Attach new fully-connected regression head to the network Fully-connected layers Classification head Convolution and Pooling Class scores Fully-connected layers Regression head Image Final conv feature map 5 : COS429 : L23 : : Andras Ferencz Box coordinates 5

6 Sliding Window: Overfeat 0.5 Network input: 3 x 221 x 221 Larger image: 3 x 257 x : COS429 : L23 : : Andras Ferencz 0.75 Classification scores: P(cat) 6

7 Sliding Window: Overfeat Network input: 3 x 221 x 221 Larger image: 3 x 257 x : COS429 : L23 : : Andras Ferencz Classification scores: P(cat) 7

8 Sliding Window: Overfeat Network input: 3 x 221 x 221 Larger image: 3 x 257 x : COS429 : L23 : : Andras Ferencz Classification scores: P(cat) 8

9 Sliding Window: Overfeat Greedily merge boxes and scores (details in paper) 0.8 Network input: 3 x 221 x 221 Larger image: 3 x 257 x : COS429 : L23 : : Andras Ferencz Classification score: P(cat) 9

Sliding Window: Overfeat In practice use many sliding window locations and multiple scales Window positions + score maps Final Predictions Box regression

10 Sliding Window: Overfeat In practice use many sliding window locations and multiple scales Window positions + score maps Final Predictions Box regression outputs Sermanet et al, Integrated Recognition, Localization and Detection using Convolutional Networks, ICLR : COS429 : L23 : : Andras Ferencz 10

11 Efficient Sliding Window: Overfeat Efficient sliding window by converting fullyconnected layers into convolutions 4096 x 1 x 1 Convolution + pooling Class scores: 1000 x 1 x x 1 x 1 1 x 1 conv 1 x 1 conv 5x5 conv 5x5 conv Image: 3 x 221 x 221 Feature map: 1024 x 5 x 5 1 x 1 conv 4096 x 1 x 1 11 : COS429 : L23 : : Andras Ferencz 1 x 1 conv 1024 x 1 x 1 Box coordinates: (4 x 1000) x 1 x 1 11

Efficient Sliding Window: Overfeat Training time: Small image, 1 x 1 classifier output Test time: Larger image, 2 x 2 classifier output, only extra compute at

12 Efficient Sliding Window: Overfeat Training time: Small image, 1 x 1 classifier output Test time: Larger image, 2 x 2 classifier output, only extra compute at yellow regions Sermanet et al, Integrated Recognition, Localization and Detection using Convolutional Networks, ICLR : COS429 : L23 : : Andras Ferencz 12

13 Computer Vision Tasks Classification Classification + Localization 13 : COS429 : L23 : : Andras Ferencz Object Detection Instance Segmentation 13

14 Region Proposals Find blobby image regions that are likely to contain objects Class-agnostic object detector Look for blob-like regions 14 : COS429 : L23 : : Andras Ferencz 14

15 Region Proposals: Selective Search Bottom-up segmentation, merging regions at multiple scales Convert regions to boxes Uijlings et al, Selective Search for Object Recognition, IJCV : COS429 : L23 : : Andras Ferencz 15

16 R-CNN Girschick et al, Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014 Slide credit: Ross Girschick 16 : COS429 : L23 : : Andras Ferencz 16

Fast R-CNN R-CNN Problems: Slow at test-time due to independent forward passes of the CNN Solution: Share computation of convolutional layers between proposals for an image R-CNN Problems: - Post-hoc

17 Fast R-CNN R-CNN Problems: Slow at test-time due to independent forward passes of the CNN Solution: Share computation of convolutional layers between proposals for an image R-CNN Problems: - Post-hoc training: CNN not updated in response to final classifiers and regressors - Complex training pipeline Solution: Just train the whole system end-to-end all at once! 17 : COS429 : L23 : : Andras Ferencz

18 Fast R-CNN: Region of Interest Pooling Convolution and Pooling Hi-res input image: 3 x 800 x 600 with region proposal Can back propagate similar to max pooling Hi-res conv features: CxHxW with region proposal 18 : COS429 : L23 : : Andras Ferencz Fully-connected layers RoI conv features: Cxhxw for region proposal Fully-connected layers expect low-res conv features: Cxhxw 18

19 Faster R-CNN: Training In the paper: Ugly pipeline - Use alternating optimization to train RPN, then Fast R-CNN with RPN proposals, etc. - More complex than it has to be Since publication: Joint training! One network, four losses - RPN classification (anchor good / bad) - RPN regression (anchor -> proposal) - Fast R-CNN classification (over classes) - Fast R-CNN regression (proposal -> box) Slide credit: Ross Girschick 19 : COS429 : L23 : : Andras Ferencz 19

20 Faster R-CNN: Results R-CNN Fast R-CNN Faster R-CNN Test time per image 50 seconds (with proposals) 2 seconds 0.2 seconds (Speedup) 1x 25x 250x map (VOC 2007) : COS429 : L23 : : Andras Ferencz 20

21 Object Detection State-of-the-art: ResNet Faster R-CNN + some extras He et. al, Deep Residual Learning for Image Recognition, arxiv : COS429 : L23 : : Andras Ferencz 21

22 ImageNet Detection : COS429 : L23 : : Andras Ferencz 22

YOLO: You Only Look Once Detection as Regression Divide image into S x S grid Within each grid cell predict: B Boxes: 4 coordinates + confidence Class scores: C numbers Regression from image

23 YOLO: You Only Look Once Detection as Regression Divide image into S x S grid Within each grid cell predict: B Boxes: 4 coordinates + confidence Class scores: C numbers Regression from image to 7 x 7 x (5 * B + C) tensor Direct prediction using a CNN Redmon et al, You Only Look Once: Unified, Real-Time Object Detection, arxiv : COS429 : L23 : : Andras Ferencz 23

24 YOLO: You Only Look Once Detection as Regression Faster than Faster R-CNN, but not as good Redmon et al, You Only Look Once: Unified, Real-Time Object Detection, arxiv : COS429 : L23 : : Andras Ferencz 24

25 Computer Vision Tasks Classification Classification + Localization Object Detection Segmentation CAT CAT CAT, DOG, DUCK CAT, DOG, DUCK Single object 25 : COS429 : L23 : : Andras Ferencz Multiple objects 25 25

26 Today Classification Classification + Localization Object Detection Segmentation Today 26 : COS429 : L23 : : Andras Ferencz

27 Semantic Segmentation Label every pixel! Don t differentiate instances (cows) Classic computer vision problem Figure credit: Shotton et al, TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context, IJCV : COS429 : L23 : : Andras Ferencz 27

28 Instance Segmentation Detect instances, give category, label pixels simultaneous detection and segmentation (SDS) Lots of recent work (MS-COCO) Figure credit: Dai et al, Instance-aware Semantic Segmentation via Multi-task Network Cascades, arxiv : COS429 : L23 : : Andras Ferencz 28

29 Semantic Segmentation Extract patch 29 : COS429 : L23 : : Andras Ferencz 29

30 Semantic Segmentation Extract patch Run through a CNN CNN 30 : COS429 : L23 : : Andras Ferencz 30

31 Semantic Segmentation Extract patch 31 : COS429 : L23 : : Andras Ferencz Run through a CNN Classify center pixel CNN COW 31

32 Semantic Segmentation Extract patch Run through a CNN Classify center pixel CNN COW Repeat for every pixel 32 : COS429 : L23 : : Andras Ferencz 32

33 Semantic Segmentation Run fully convolutional network to get all pixels at once CNN 33 : COS429 : L23 : : Andras Ferencz Smaller output due to pooling 33

34 Semantic Segmentation: Multi-Scale Farabet et al, Learning Hierarchical Features for Scene Labeling, TPAMI : COS429 : L23 : : Andras Ferencz 34

35 Semantic Segmentation: Multi-Scale Resize image to multiple scales Farabet et al, Learning Hierarchical Features for Scene Labeling, TPAMI : COS429 : L23 : : Andras Ferencz 35

36 Semantic Segmentation: Multi-Scale Resize image to multiple scales Run one CNN per scale Farabet et al, Learning Hierarchical Features for Scene Labeling, TPAMI : COS429 : L23 : : Andras Ferencz 36

37 Semantic Segmentation: Multi-Scale Resize image to multiple scales Run one CNN per scale Upscale outputs and concatenate Farabet et al, Learning Hierarchical Features for Scene Labeling, TPAMI : COS429 : L23 : : Andras Ferencz 37

38 Semantic Segmentation: Multi-Scale Resize image to multiple scales Run one CNN per scale Upscale outputs and concatenate External bottom-up segmentation Farabet et al, Learning Hierarchical Features for Scene Labeling, TPAMI : COS429 : L23 : : Andras Ferencz 38

Semantic Segmentation: Multi-Scale Resize image to multiple scales Run one CNN per scale Upscale outputs and concatenate External bottom-up segmentation

39 Semantic Segmentation: Multi-Scale Resize image to multiple scales Run one CNN per scale Upscale outputs and concatenate External bottom-up segmentation Farabet et al, Learning Hierarchical Features for Scene Labeling, TPAMI : COS429 : L23 : : Andras Ferencz 39 Combine everything for final outputs

40 Semantic Segmentation: Refinement Apply CNN once to get labels Pinheiro and Collobert, Recurrent Convolutional Neural Networks for Scene Labeling, ICML : COS429 : L23 : : Andras Ferencz 40

41 Semantic Segmentation: Refinement Apply CNN once to get labels Apply AGAIN to refine labels Pinheiro and Collobert, Recurrent Convolutional Neural Networks for Scene Labeling, ICML : COS429 : L23 : : Andras Ferencz 41

More iterations improve results Pinheiro and Collobert, Recurrent Convolutional

42 Semantic Segmentation: Refinement Same CNN weights: recurrent convolutional network Apply CNN once to get labels Apply AGAIN to refine labels And again! More iterations improve results Pinheiro and Collobert, Recurrent Convolutional Neural Networks for Scene Labeling, ICML : COS429 : L23 : : Andras Ferencz 42

43 Semantic Segmentation: Upsampling Learnable upsampling! Long, Shelhamer, and Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR : COS429 : L23 : : Andras Ferencz 43

44 Semantic Segmentation: Upsampling Long, Shelhamer, and Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR : COS429 : L23 : : Andras Ferencz 44

45 Semantic Segmentation: Upsampling skip connections Long, Shelhamer, and Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR : COS429 : L23 : : Andras Ferencz 45

46 Semantic Segmentation: Upsampling skip connections Skip connections = Better results Long, Shelhamer, and Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR : COS429 : L23 : : Andras Ferencz 46

47 Learnable Upsampling: Deconvolution Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 47 : COS429 : L23 : : Andras Ferencz Output: 4 x 4 47

48 Learnable Upsampling: Deconvolution Typical 3 x 3 convolution, stride 1 pad 1 Dot product between filter and input Input: 4 x 4 48 : COS429 : L23 : : Andras Ferencz Output: 4 x 4 48

49 Learnable Upsampling: Deconvolution Typical 3 x 3 convolution, stride 1 pad 1 Dot product between filter and input Input: 4 x 4 49 : COS429 : L23 : : Andras Ferencz Output: 4 x 4 49

50 Learnable Upsampling: Deconvolution Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 50 : COS429 : L23 : : Andras Ferencz Output: 2 x 2 50

51 Learnable Upsampling: Deconvolution Typical 3 x 3 convolution, stride 2 pad 1 Dot product between filter and input Input: 4 x 4 51 : COS429 : L23 : : Andras Ferencz Output: 2 x 2 51

52 Learnable Upsampling: Deconvolution Typical 3 x 3 convolution, stride 2 pad 1 Dot product between filter and input Input: 4 x 4 52 : COS429 : L23 : : Andras Ferencz Output: 2 x 2 52

53 Learnable Upsampling: Deconvolution 3 x 3 deconvolution, stride 2 pad 1 Input: 2 x 2 53 : COS429 : L23 : : Andras Ferencz Output: 4 x 4 53

54 Learnable Upsampling: Deconvolution 3 x 3 deconvolution, stride 2 pad 1 Input gives weight for filter Input: 2 x 2 54 : COS429 : L23 : : Andras Ferencz Output: 4 x 4 54

55 Learnable Upsampling: Deconvolution 3 x 3 deconvolution, stride 2 pad 1 Sum where output overlaps Same as backward pass for normal convolution! Deconvolution is a bad name, already defined as inverse of convolution Input gives weight for filter Input: 2 x 2 55 : COS429 : L23 : : Andras Ferencz Better names: convolution transpose, backward strided convolution, 1/2 strided convolution, upconvolution Output: 4 x 4 55

56 Semantic Segmentation: Upsampling Normal VGG Noh et al, Learning Deconvolution Network for Semantic Segmentation, ICCV : COS429 : L23 : : Andras Ferencz Upside down VGG 6 days of training on Titan X 56

(MS-COCO) Figure credit: Dai et al, Instance-aware Semantic Segmentation

57 Instance Segmentation Detect instances, give category, label pixels simultaneous detection and segmentation (SDS) Lots of recent work (MS-COCO) Figure credit: Dai et al, Instance-aware Semantic Segmentation via Multi-task Network Cascades, arxiv : COS429 : L23 : : Andras Ferencz 57

58 Instance Segmentation Similar to R-CNN, but with segments Hariharan et al, Simultaneous Detection and Segmentation, ECCV : COS429 : L23 : : Andras Ferencz 58

59 Instance Segmentation Similar to R-CNN, but with segments External Segment proposals Hariharan et al, Simultaneous Detection and Segmentation, ECCV : COS429 : L23 : : Andras Ferencz 59

60 Instance Segmentation Similar to R-CNN External Segment proposals Hariharan et al, Simultaneous Detection and Segmentation, ECCV : COS429 : L23 : : Andras Ferencz 60

61 Instance Segmentation Similar to R-CNN, but with segments External Segment proposals Mask out background with mean image Hariharan et al, Simultaneous Detection and Segmentation, ECCV : COS429 : L23 : : Andras Ferencz 61

62 Instance Segmentation Similar to R-CNN, but with segments External Segment proposals Mask out background with mean image Hariharan et al, Simultaneous Detection and Segmentation, ECCV : COS429 : L23 : : Andras Ferencz 62

63 Instance Segmentation Similar to R-CNN, but with segments External Segment proposals Mask out background with mean image Hariharan et al, Simultaneous Detection and Segmentation, ECCV : COS429 : L23 : : Andras Ferencz 63

64 Instance Segmentation: Cascades Similar to Faster R-CNN Won COCO 2015 challenge (with ResNet) Dai et al, Instance-aware Semantic Segmentation via Multi-task Network Cascades, arxiv : COS429 : L23 : : Andras Ferencz 64

65 Instance Segmentation: Cascades Similar to Faster R-CNN Region proposal network (RPN) Won COCO 2015 challenge (with ResNet) Dai et al, Instance-aware Semantic Segmentation via Multi-task Network Cascades, arxiv : COS429 : L23 : : Andras Ferencz 65

Instance Segmentation: Cascades Similar to Faster R-CNN Region proposal network (RPN) Reshape boxes to fixed size, figure / ground logistic regression Learn entire model end-to-end!

66 Instance Segmentation: Cascades Similar to Faster R-CNN Region proposal network (RPN) Reshape boxes to fixed size, figure / ground logistic regression Learn entire model end-to-end! Mask out background, predict object class Won COCO 2015 challenge (with ResNet) Dai et al, Instance-aware Semantic Segmentation via Multi-task Network Cascades, arxiv : COS429 : L23 : : Andras Ferencz 66

67 Instance Segmentation: Cascades Dai et al, Instance-aware Semantic Segmentation via Multi-task Network Cascades, arxiv 2015 Predictions 67 : COS429 : L23 : : Andras Ferencz Ground truth 67

68 Segmentation Overview Semantic segmentation Classify all pixels Fully convolutional models, downsample then upsample Learnable upsampling: fractionally strided convolution Skip connections can help Instance Segmentation Detect instance, generate mask Similar pipelines to object detection 68 : COS429 : L23 : : Andras Ferencz 68

69 Quick overview of Other Topics 69 : COS429 : L23 : : Andras Ferencz 69

70 Recurrent Neural Networks (RNN) Vanilla Neural Networks 70 : COS429 : L23 : : Andras Ferencz 70

71 Recurrent Neural Networks (RNN) e.g. Image Captioning image -> sequence of words 71 : COS429 : L23 : : Andras Ferencz 71

72 Recurrent Neural Networks (RNN) e.g. Sentiment Classification sequence of words -> sentiment 72 : COS429 : L23 : : Andras Ferencz 72

73 Recurrent Neural Networks (RNN) e.g. Machine Translation seq of words -> seq of words 73 : COS429 : L23 : : Andras Ferencz 73

74 Recurrent Neural Networks (RNN) e.g. Video classification on frame level 74 : COS429 : L23 : : Andras Ferencz 74

75 y RNN x 75 : COS429 : L23 : : Andras Ferencz 75

76 Character RNN during training train more train more train more 76 : COS429 : L23 : : Andras Ferencz

77 77 : COS429 : L23 : : Andras Ferencz 77

78 Generated C code 78 : COS429 : L23 : : Andras Ferencz 78

79 Searching for interpretable cells quote detection cell 79 : COS429 : L23 : : Andras Ferencz 79

80 Sequential Processing of fixed inputs Multiple Object Recognition with Visual Attention, Ba et al. 80 : COS429 : L23 : : Andras Ferencz

81 Sequential Processing of fixed outputs DRAW: A Recurrent Neural Network For Image Generation, Gregor et al. 81 : COS429 : L23 : : Andras Ferencz

82 Image Captioning Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick 82 : COS429 : L23 : : Andras Ferencz 82

83 Recurrent Neural Network Convolutional Neural Network 83 : COS429 : L23 : : Andras Ferencz 83

Soft Attention for Captioning Distribution over L locations a1 CNN Image: HxWx3 Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015

84 Soft Attention for Captioning Distribution over L locations a1 CNN Image: HxWx3 Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Distribution over vocab a2 h0 d1 h1 h2 Features: LxD Weighted features: D Weighted combination of features 84 : COS429 : L23 : : Andras Ferencz z1 y1 z2 y2 First word 84

85 Soft Attention for Captioning Soft attention Hard attention Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML : COS429 : L23 : : Andras Ferencz 85

86 Soft Attention for Captioning Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML : COS429 : L23 : : Andras Ferencz 86

87 Spatial Transformer Networks Can we make this function differentiable? Input image: HxWx3 Cropped and rescaled image: XxYx3 Box Coordinates: (xc, yc, w, h) Jaderberg et al, Spatial Transformer Networks, NIPS : COS429 : L23 : : Andras Ferencz

88 Spatial Transformer Networks Can we make this function differentiable? Input image: HxWx3 Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input Repeat for all pixels in output to get a sampling grid Cropped and rescaled image: XxYx3 Then use bilinear interpolation to compute output Box Coordinates: (xc, yc, w, h) Jaderberg et al, Spatial Transformer Networks, NIPS : COS429 : L23 : : Andras Ferencz Network attends to input by predicting

89 Spatial Transformer Networks Grid generator uses to compute sampling grid A small Localization network predicts transform Input: Full image Output: Region of interest from input Sampler uses bilinear interpolation to produce output 89 : COS429 : L23 : : Andras Ferencz

90 Spatial Transformer Networks Insert spatial transformers into a classification network and it learns to attend and transform the input Differentiable attention / transformation module 90 : COS429 : L23 : : Andras Ferencz 90

Detection and Segmentation. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 11 -

Detection and Segmentation. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 11 - Lecture 11: Detection and Segmentation Lecture 11-1 May 10, 2017 Administrative Midterms being graded Please don t discuss midterms until next week - some students not yet taken A2 being graded Project