Convolutional neural networks

Size: px

Start display at page:

Download "Convolutional neural networks"

Hubert Cummings
5 years ago
Views:

1 Convolutional neural networks

2 Themes Curriculum: Ch 9.1, 9.2 and The simple motivation and idea How it s done Receptive field Pooling Dilated convolutions

3 Resources Learning goals Chapter 9 (not great ) cs231n video: g Video relevant for motivation part : V98 Why is convolutional network good for images and audio? How does a normal convolutional network work? Why is the receptive field important? How can we increase the receptive field? What are pooling, why is it used and what are possible downsides?

4 The simple motivation and idea

5 The simple idea Image filters can enhance image attributes Convolutional neural networks are similar to conventional image filtering Filter kernels are learnt

6 How does a fully connected see the world? A neural network or standard machine learning have to learn that pixels close to each other are more related. A cat moved from one part of a picture to the other is viewed as completely different objects.

7 A shifted frog is seen as completely different

8 A shifted frog is seen as completely different

9 Most image applications are absolute position invariant

10 Building absolute position invariance We can make a sliding classifier: Reusing the same classifier many times for each picture Problems? SVM

11 Building absolute position invariance We can make a sliding classifier: Reusing the same classifier many times for each picture Problems: Restricted field of view Still problems with different sizes

12 Make every layer in a neural network slide

13 Make every layer in a neural network slide Not only the cat classifier is reused, but also partial representations Edge, fur, eye, grass detectors More tolerant to changes in shape and size? Large receptive field? Reuse from sliding is combined with reuse with depth

14 Make every layer in a neural network slide Reuse from sliding is combined with reuse with depth With depth a detector can be reused for different classes etc. With sliding a detector can also be reused for every position A product relationship instead of sum (have not seen any studies)

15 How it s done

16 Convolutional neural network You should all know convolution Difference between convolution and correlation is irrelevant (flipping filter) When we deal with channels or features there are some options

17 Filters and channels (Standard method) An input image have a third dimension (say RGB) A filter/kernel always has the same third dimension

18 Filters and channels Overlapping area is multiplied then summed (dot product) With sliding you get 28x28x1 output

19 Usually we use multiple filters per layer A new kernel/filter slides over the same image Create a new filtered image

20 Many activation maps create a new image If we filter the image 6 times, we get a new image with 6 channels.

21 A onelayer, twofilter network

22 A onelayer, twofilter network

23 A onelayer, twofilter network

24 A onelayer, twofilter network

25 In convolutional networks, layers are 3D...

26 kernels are 4D If we combine all the filters we get a 4D tensor The operation can be viewed as: a matrix multiplication for each spatial position a sum over spatial dimensions This is a useful representation as many deep learning frameworks present it in this way

27 Convolutional neural network consist of multiple layers

28 Convolutional neural network consist of multiple layers

29 Some stack many layers

30 Can a convolutional network remember positions? A fully connected network treat each position different...

31 Can a convolutional network remember positions? A fully connected network treat each position different A convolutional network can first of all keep spatial information in the spatial dimension of the filter bank. More on this later

32 Receptive field How much can the algorithm see

33 How large area influence the end result? With a sliding classifier you get the input size as a receptive field Why do we even want a large receptive field? SVM

34 How large area influence the end result? With a convolutional network the receptive field increase with each layer

35 How large area influence the end result? With a convolutional network the receptive field increase with each layer 3 inputs influence each node in the first hidden layer

36 How large area influence the end result? With a convolutional network the receptive field increase with each layer 3 inputs influence each node in the first hidden layer 5 influence the next...

37 How large area influence the end result? With a convolutional network the receptive field increase with each layer 3 inputs influence each node in the first hidden layer 5 influence the next...

38 How many inputs can influence each output?

39 The receptive field grow with k1 for each layer

40 The receptive field grow with k1 for each layer two 3x3 layers = one 5x5 layers

41 The receptive field grow with k1 for each layer two 3x3 layers = one 5x5 layers So should we use 3x3 or 5x5?

42 The receptive field grow with k1 for each layer two 3x3 layers = one 5x5 layers So should we use 3x3 or 5x5? A 5x5 kernel have: 5*5*(filters_in*filters_out) parameters Two 3x3 kernel have: 2*3*3*(filters_in*filters_out) parameters

43 Smaller spatial filter size is more parameter efficient A network with many parameters generally more training data and computation time A larger receptive field per parameter is good More layers can give more reuse

44 How large receptive field did the 152layer ResNet have (it used 3x3 convolutions)?

45 How large receptive field did the 152layer ResNet have (it used 3x3 convolutions)? 305

46 Increasing the receptive field more efficiently Why do we need to?

47 Increasing the receptive field more efficiently Why do we need to? We only need a certain level of abstraction (still a research topic, but indicated in: Wide Residual Networks Wider or Deeper: Revisiting the ResNet Model for Visual Recognition Residual Networks are Exponential Ensembles of Relatively Shallow Networks Low level features also need spatial context Large networks are expensive in computation time and memory

48 Strided convolutions By skipping positions we can cover a larger area with less computation The effect of the receptive field for the next layer is important

49 The effect of strided convolutions

50 The effect of strided convolutions We still cover the whole input Do we have a larger receptive field? The next layer have a larger receptive field 7 compared to 5

51 The effect of strided convolutions We still cover the whole input Do we have a larger receptive field? The next layer have a larger receptive field 7 compared to 5 The effect can be seen from:

52 The effect of strided convolutions Essentially all the following layers will have a receptive field multiplied by S Green: stride = 2, Red: stride=2 for first, Blue: stride=1

53 With strides, spatial dimensions will become smaller Usually some of the of the network capacity is preserved through an increasing number of channels

54 Can the network still remember positions?

55 Can the network still remember positions? Yes, the network can still encode positional information in the depth dimension A network can pass positional information (right, left etc.) to different channels

56 Pooling Spatial reduction and forcing invariance

57 Maxpooling A strided maximum filtering Choosing the maximum value inside the kernel range

58 Maxpooling: invariance builtin We saw that a network could learn max or average functions to create invariance With maxpooling you explicitly remove some spatial information This can help both position and rotation invariance As we know many image analysis applications seek results invariant to position

59 Maxpooling have some important problems Even if we want our final results to be positionally, we may need positional information in the earlier representations Only a small part of the network is updated with gradients each step (learning slower) We calculate a lot of values that is not used

60 Can the network still remember positions?

is to the right and a high value to another channel if the target is to the left.

61 Can the network still remember positions? Yes, in a similar way as with strides Give a high value to one channel if the target is to the right and a high value to another channel if the target is to the left. The book calls it approximately invariant to small translations Variant features will be harder to learn compared to invariant features

62 Dilated convolutions Larger receptive field, without reducing spatial dimensions or increasing the number of parameters

63 Dilated convolutions Skipping values in the kernel Same as filling the kernel with every other value as zero Still cover all inputs Larger kernel with no extra parameters

64 A growing dilation factor can give similar effect as stride With a constant dilation factor you get the same effect as using a larger kernel With growing dilation you can get an even larger receptive field, while still covering all inputs

65 A growing dilation factor can give similar effect as stride With audio signals, as with this application receptive field is even more important.

66 Next week: Monday: Introduction to tensorflow (small lecture and coding): Why use tensorflow? Tensorflow compared to numpy Friday: Residual networks Convolutional neural networks for segmentation and localisation

Coursework 2. MLP Lecture 7 Convolutional Networks 1

Coursework 2. MLP Lecture 7 Convolutional Networks 1 Coursework 2 MLP Lecture 7 Convolutional Networks 1 Coursework 2 - Overview and Objectives Overview: Use a selection of the techniques covered in the course so far to train accurate multi-layer networks