Convolutional neural networks

Themes Curriculum: Ch 9.1, 9.2 and http://cs231n.github.io/convolutionalnetworks/ The simple motivation and idea How it s done Receptive field Pooling Dilated convolutions

Resources Learning goals Chapter 9 (not great ) http://cs231n.github.io/convolutionalnetworks/ cs231n video: https://www.youtube.com/watch?v=aqirpkrayd g Video relevant for motivation part : https://www.youtube.com/watch?v=sq67nbcl V98 Why is convolutional network good for images and audio? How does a normal convolutional network work? Why is the receptive field important? How can we increase the receptive field? What are pooling, why is it used and what are possible downsides?

The simple motivation and idea

The simple idea Image filters can enhance image attributes Convolutional neural networks are similar to conventional image filtering Filter kernels are learnt

How does a fully connected see the world? A neural network or standard machine learning have to learn that pixels close to each other are more related. A cat moved from one part of a picture to the other is viewed as completely different objects.

A shifted frog is seen as completely different

Most image applications are absolute position invariant

Building absolute position invariance We can make a sliding classifier: Reusing the same classifier many times for each picture Problems? SVM

Building absolute position invariance We can make a sliding classifier: Reusing the same classifier many times for each picture Problems: Restricted field of view Still problems with different sizes

Make every layer in a neural network slide

Make every layer in a neural network slide Not only the cat classifier is reused, but also partial representations Edge, fur, eye, grass detectors More tolerant to changes in shape and size? Large receptive field? Reuse from sliding is combined with reuse with depth

Make every layer in a neural network slide Reuse from sliding is combined with reuse with depth With depth a detector can be reused for different classes etc. With sliding a detector can also be reused for every position A product relationship instead of sum (have not seen any studies)

How it s done

Convolutional neural network You should all know convolution Difference between convolution and correlation is irrelevant (flipping filter) When we deal with channels or features there are some options

Filters and channels (Standard method) An input image have a third dimension (say RGB) A filter/kernel always has the same third dimension

Filters and channels Overlapping area is multiplied then summed (dot product) With sliding you get 28x28x1 output

Usually we use multiple filters per layer A new kernel/filter slides over the same image Create a new filtered image

Many activation maps create a new image If we filter the image 6 times, we get a new image with 6 channels.

A onelayer, twofilter network

In convolutional networks, layers are 3D...

kernels are 4D If we combine all the filters we get a 4D tensor The operation can be viewed as: a matrix multiplication for each spatial position a sum over spatial dimensions This is a useful representation as many deep learning frameworks present it in this way

Convolutional neural network consist of multiple layers

Some stack many layers

Can a convolutional network remember positions? A fully connected network treat each position different...

Can a convolutional network remember positions? A fully connected network treat each position different A convolutional network can first of all keep spatial information in the spatial dimension of the filter bank. More on this later

Receptive field How much can the algorithm see

How large area influence the end result? With a sliding classifier you get the input size as a receptive field Why do we even want a large receptive field? SVM

How large area influence the end result? With a convolutional network the receptive field increase with each layer

How large area influence the end result? With a convolutional network the receptive field increase with each layer 3 inputs influence each node in the first hidden layer

How large area influence the end result? With a convolutional network the receptive field increase with each layer 3 inputs influence each node in the first hidden layer 5 influence the next...

How many inputs can influence each output?

The receptive field grow with k1 for each layer

The receptive field grow with k1 for each layer two 3x3 layers = one 5x5 layers

The receptive field grow with k1 for each layer two 3x3 layers = one 5x5 layers So should we use 3x3 or 5x5?

The receptive field grow with k1 for each layer two 3x3 layers = one 5x5 layers So should we use 3x3 or 5x5? A 5x5 kernel have: 5*5*(filters_in*filters_out) parameters Two 3x3 kernel have: 2*3*3*(filters_in*filters_out) parameters

Smaller spatial filter size is more parameter efficient A network with many parameters generally more training data and computation time A larger receptive field per parameter is good More layers can give more reuse

How large receptive field did the 152layer ResNet have (it used 3x3 convolutions)?

How large receptive field did the 152layer ResNet have (it used 3x3 convolutions)? 305

Increasing the receptive field more efficiently Why do we need to?

Increasing the receptive field more efficiently Why do we need to? We only need a certain level of abstraction (still a research topic, but indicated in: Wide Residual Networks Wider or Deeper: Revisiting the ResNet Model for Visual Recognition Residual Networks are Exponential Ensembles of Relatively Shallow Networks Low level features also need spatial context Large networks are expensive in computation time and memory

Strided convolutions By skipping positions we can cover a larger area with less computation The effect of the receptive field for the next layer is important

The effect of strided convolutions

The effect of strided convolutions We still cover the whole input Do we have a larger receptive field? The next layer have a larger receptive field 7 compared to 5

The effect of strided convolutions We still cover the whole input Do we have a larger receptive field? The next layer have a larger receptive field 7 compared to 5 The effect can be seen from:

The effect of strided convolutions Essentially all the following layers will have a receptive field multiplied by S Green: stride = 2, Red: stride=2 for first, Blue: stride=1

With strides, spatial dimensions will become smaller Usually some of the of the network capacity is preserved through an increasing number of channels

Can the network still remember positions?

Can the network still remember positions? Yes, the network can still encode positional information in the depth dimension A network can pass positional information (right, left etc.) to different channels

Pooling Spatial reduction and forcing invariance

Maxpooling A strided maximum filtering Choosing the maximum value inside the kernel range

Maxpooling: invariance builtin We saw that a network could learn max or average functions to create invariance With maxpooling you explicitly remove some spatial information This can help both position and rotation invariance As we know many image analysis applications seek results invariant to position

Maxpooling have some important problems Even if we want our final results to be positionally, we may need positional information in the earlier representations Only a small part of the network is updated with gradients each step (learning slower) We calculate a lot of values that is not used

Can the network still remember positions?

Can the network still remember positions? Yes, in a similar way as with strides Give a high value to one channel if the target is to the right and a high value to another channel if the target is to the left. The book calls it approximately invariant to small translations Variant features will be harder to learn compared to invariant features

Dilated convolutions Larger receptive field, without reducing spatial dimensions or increasing the number of parameters

Dilated convolutions Skipping values in the kernel Same as filling the kernel with every other value as zero Still cover all inputs Larger kernel with no extra parameters

A growing dilation factor can give similar effect as stride With a constant dilation factor you get the same effect as using a larger kernel With growing dilation you can get an even larger receptive field, while still covering all inputs

A growing dilation factor can give similar effect as stride With audio signals, as with this application receptive field is even more important.

Next week: Monday: Introduction to tensorflow (small lecture and coding): Why use tensorflow? Tensorflow compared to numpy Friday: Residual networks Convolutional neural networks for segmentation and localisation