Lesson 08. Convolutional Neural Network. Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni.

Similar documents
Introduction to Machine Learning

Deep Learning. Dr. Johan Hagelbäck.

Research on Hand Gesture Recognition Using Convolutional Neural Network

Tiny ImageNet Challenge Investigating the Scaling of Inception Layers for Reduced Scale Classification Problems

Lecture 11-1 CNN introduction. Sung Kim

Deep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation

Generating an appropriate sound for a video using WaveNet.

Vehicle Color Recognition using Convolutional Neural Network

Automatic Speech Recognition (CS753)

Image Manipulation Detection using Convolutional Neural Network

Biologically Inspired Computation

INFORMATION about image authenticity can be used in

IBM SPSS Neural Networks

Lecture 17 Convolutional Neural Networks

Learning Deep Networks from Noisy Labels with Dropout Regularization

Convolutional Networks Overview

ECS 289G UC Davis Paper Presenta6on #1

Adversarial Attacks on Face Detectors using Neural Net based Constrained Optimization

Multiple-Layer Networks. and. Backpropagation Algorithms

An Introduction to Convolutional Neural Networks. Alessandro Giusti Dalle Molle Institute for Artificial Intelligence Lugano, Switzerland

arxiv: v1 [cs.ce] 9 Jan 2018

SIMULATION-BASED MODEL CONTROL USING STATIC HAND GESTURES IN MATLAB

CONVOLUTIONAL NEURAL NETWORKS: MOTIVATION, CONVOLUTION OPERATION, ALEXNET

Stacking Ensemble for auto ml

Application of Multi Layer Perceptron (MLP) for Shower Size Prediction

Coursework 2. MLP Lecture 7 Convolutional Networks 1

CSC 578 Neural Networks and Deep Learning

Introduction to Machine Learning

Author(s) Corr, Philip J.; Silvestre, Guenole C.; Bleakley, Christopher J. The Irish Pattern Recognition & Classification Society

یادآوری: خالصه CNN. ConvNet

Are there alternatives to Sigmoid Hidden Units? MLP Lecture 6 Hidden Units / Initialisation 1

LANDMARK recognition is an important feature for

AUGMENTED CONVOLUTIONAL FEATURE MAPS FOR ROBUST CNN-BASED CAMERA MODEL IDENTIFICATION. Belhassen Bayar and Matthew C. Stamm

CHAPTER 6 BACK PROPAGATED ARTIFICIAL NEURAL NETWORK TRAINED ARHF

The Art of Neural Nets

Camera Model Identification With The Use of Deep Convolutional Neural Networks

PROJECT REPORT. Using Deep Learning to Classify Malignancy Associated Changes

sensors Jin Kyu Kang, Hyung Gil Hong and Kang Ryoung Park *

Low frequency extrapolation with deep learning Hongyu Sun and Laurent Demanet, Massachusetts Institute of Technology

Counterfeit Bill Detection Algorithm using Deep Learning

Convolutional Networks for Image Segmentation: U-Net 1, DeconvNet 2, and SegNet 3

CS 7643: Deep Learning

Deep Neural Network Architectures for Modulation Classification

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen

Kernels and Support Vector Machines

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault

Lecture 3 - Regression

Free-hand Sketch Recognition Classification

Statistical Tests: More Complicated Discriminants

Classification Accuracies of Malaria Infected Cells Using Deep Convolutional Neural Networks Based on Decompressed Images

Comparison of Google Image Search and ResNet Image Classification Using Image Similarity Metrics

Prediction of Cluster System Load Using Artificial Neural Networks

Playing CHIP-8 Games with Reinforcement Learning

A Deep Learning Approach To Universal Image Manipulation Detection Using A New Convolutional Layer

Convolutional Neural Networks: Real Time Emotion Recognition

Deep Learning for Launching and Mitigating Wireless Jamming Attacks

GESTURE RECOGNITION FOR ROBOTIC CONTROL USING DEEP LEARNING

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2014

Adversarial Examples and Adversarial Training. Ian Goodfellow, OpenAI Research Scientist Presentation at Quora,

arxiv: v2 [cs.lg] 13 Oct 2018

Convolutional Neural Networks

Automated Image Timestamp Inference Using Convolutional Neural Networks

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

arxiv: v2 [cs.mm] 12 Jan 2018

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

Image Recognition of Tea Leaf Diseases Based on Convolutional Neural Network

Can you tell a face from a HEVC bitstream?

arxiv: v2 [cs.sd] 22 May 2017

Initialisation improvement in engineering feedforward ANN models.

Deep Reinforcement Learning and Forward Modeling for StarCraft AI

DYNAMIC CONVOLUTIONAL NEURAL NETWORK FOR IMAGE SUPER- RESOLUTION

THE problem of automating the solving of

Wheel Defect Detection With Machine Learning

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

arxiv: v1 [cs.cv] 19 Jun 2017

GPU ACCELERATED DEEP LEARNING WITH CUDNN

Lecture 4 : Monday April 6th

CP-JKU SUBMISSIONS FOR DCASE-2016: A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS

Driving Using End-to-End Deep Learning

Review Sheet for Math 230, Midterm exam 2. Fall 2006

Radio Deep Learning Efforts Showcase Presentation

Department of Computer Science and Engineering. The Chinese University of Hong Kong. Final Year Project Report LYU1601

A New Framework for Supervised Speech Enhancement in the Time Domain

Music Recommendation using Recurrent Neural Networks

A Comparison of Particle Swarm Optimization and Gradient Descent in Training Wavelet Neural Network to Predict DGPS Corrections

arxiv: v2 [cs.cv] 11 Oct 2016

Automatic tumor segmentation in breast ultrasound images using a dilated fully convolutional network combined with an active contour model

On the Use of Convolutional Neural Networks for Specific Emitter Identification

Impact of Automatic Feature Extraction in Deep Learning Architecture

Learning Deep Networks from Noisy Labels with Dropout Regularization

Xception: Deep Learning with Depthwise Separable Convolutions

Multi-frame convolutional neural networks for object detection in temporal data

arxiv: v3 [cs.cv] 18 Dec 2018

Analyzing features learned for Offline Signature Verification using Deep CNNs

Comparing Time and Frequency Domain for Audio Event Recognition Using Deep Learning

REAL TIME EMULATION OF PARAMETRIC GUITAR TUBE AMPLIFIER WITH LONG SHORT TERM MEMORY NEURAL NETWORK

Predicting outcomes of professional DotA 2 matches

Lecture 23 Deep Learning: Segmentation

NEURAL NETWORK DEMODULATOR FOR QUADRATURE AMPLITUDE MODULATION (QAM)

Classification of Road Images for Lane Detection

Transcription:

Lesson 08 Convolutional Neural Network Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni Lesson 08

Convolution we will consider 2D convolution the result of convolution: for each point it tells us the area under the multiplication of signal and kernel (f g) [m, n] = f [m i, n j] g [i, j] (1) i= j= Lesson 08 1 / 37

Architecture of Convolutional Neural Network There can be more variations of the architecture The standard architecture for image classification Convolution Layer Max-pooling Fully connected Layer Lesson 08 2 / 37

Convolutional Layer A special layer inside a Neural Network We define size and number of kernels (filters) to be learned, the stride and padding The layer uses the defined kernels to compute feature maps over the input map as a convolution This dramatically reduces the number of parameters in the layer as opposed to fully connected layer Usually the input is the image width height channels The kernels operate through channels kernel of defined size 3 3 really has size 3 3 channels This holds for all other convolutional layers the number of kernels in a layer defines the number of channels of its output feature map Lesson 08 3 / 37

Output of Convolutional Layer - example The first convolutional layer has 32 kernels, thus the output feature map has depth (number of channels) equal to 32 If we define the size of kernels in the consecutive layer to be 5 5 the kernels will have (really) the size of 5 5 32 Lesson 08 4 / 37

Output of Convolutional Layer - strides The stride defines by how many pixels we move the kernel until we apply it next time Stride one: Lesson 08 5 / 37

Output of Convolutional Layer - strides The stride defines by how many pixels we move the kernel until we apply it next time Stride one: Lesson 08 6 / 37

Output of Convolutional Layer - strides The stride defines by how many pixels we move the kernel until we apply it next time Stride one: Lesson 08 7 / 37

Output of Convolutional Layer - strides The stride defines by how many pixels we move the kernel until we apply it next time Stride one, visited locations: Lesson 08 8 / 37

Output of Convolutional Layer - strides The stride defines by how many pixels we move the kernel until we apply it next time Stride three: Lesson 08 9 / 37

Output of Convolutional Layer - strides The stride defines by how many pixels we move the kernel until we apply it next time Stride three: Lesson 08 10 / 37

Output of Convolutional Layer - strides The stride defines by how many pixels we move the kernel until we apply it next time Stride three: Lesson 08 11 / 37

Output of Convolutional Layer - strides The stride defines by how many pixels we move the kernel until we apply it next time Stride three, visited locations: Lesson 08 12 / 37

Output of Convolutional Layer - strides The stride defines the size of the output feature map In the previous example we had an image 10 10 With stride one, the output map will be of size 8 8 With stride three, the output map will be of size 3 3 The stride can be rectangular (eg. 3 1) There are several strategies for choosing strides Very often the strides are chosen so that consecutive kernels overlap Lesson 08 13 / 37

Output of Convolutional Layer - padding Padding is important for managing the shape of the output feature map It is a scalar parameter that determines the width of added boundary pixels to the input map Current implementations support zero valued boundaries Lesson 08 14 / 37

Output of Convolutional Layer - padding Example of padding equal to one, stride equal to three: Lesson 08 15 / 37

Output of Convolutional Layer - padding Example of padding equal to one, stride equal to three: Lesson 08 16 / 37

Output of Convolutional Layer - padding Example of padding equal to one, stride equal to three: Lesson 08 17 / 37

Output of Convolutional Layer - padding Example of padding equal to one, stride equal to three: Visited locations Lesson 08 18 / 37

Computing the convolution Let s consider M kernels K c, c = 1,... M with size k k The size of the input map is W I H I C I The depth of the output map C O is the number of kernels M The width and height of the output map is determined by the size of the kernels, the strides, and the padding W O = W I + 2 pad k + 1 (2) stride For each output location (x, y, c) and each kernel K c, where c = 1,..., M we compute the convolution: k 1 k 1 O[x, y, c] = C I 1 i=0 j=0 n=0 I [x s i, y s j, n] K c [i, j, n] (3) where (x s, y s, n) is the proper location in the input map given the stride and pool Lesson 08 19 / 37

Activation function The output of the convolution is then activated using activation function b is the bias term O map [x, y, c] = f (O[x, y, c] + b) (4) The choice of the activation function is arbitrary, up to the point of being differentiable (or at least have a defined derivative) on the whole domain The mathematical purpose of the activation function is to model the non-linearity But it is not necessary: f (x) = ax is a proper activation function Lesson 08 20 / 37

Activation function - sigmoidal A family of S shaped functions Commonly used in past to model the activity of a neuron cell Sigmoid: Hyperbolic tangent: f (x) = There are more possibilities... 1 ex = 1 + e x e x + 1 (5) f (x) = tanh(x) = ex e x e x + e x (6) Lesson 08 21 / 37

Activation function - sigmoidal examples Lesson 08 22 / 37

Activation function - Rectified Linear Unit Most commonly used activation function in CNN f (x) = max(x, 0) (7) non-linear, easy gradient Lesson 08 23 / 37

Activation function - Rectified Linear Unit Modifications PReLU - parametrized ReLU, where the slope of the negative part is handled as a parameter to be learned via backpropagation Maxout - several linear functions are being learned via backpropagation and the activation is the max of these Lesson 08 24 / 37

Activation function - Rectified Linear Impact on Training Krizhevsky (2012) reports a much faster learning with ReLU as opposed to tanh Lesson 08 25 / 37

Pooling Pooling is used to compress the information propagated to the next level of network In past average pooling was used More recently (2012) the max pooling was re-introduced and experiments show its superiority Overlapping pooling seems to be important Parameters: size of the pooling window, stride of the pooling window Lesson 08 26 / 37

Batch Normalization Any kind of normalization is important Batch Normalization is widely used and is the leading form of normalization in the means of performance The statistics of the output map are computed They are normalized so that they have zero mean and unit variance well-behaved input for the next layer The normalization factors - scale and shift - are remembered as a running average through the training phase Further more the zero mean and unit variance statistics are scaled and shifted via learned parameters γ, β The main idea: decorrelation, any slice of the network has similar inputs/outputs, faster training Lesson 08 27 / 37

Classification layer - Softmax The best practice for classification is to use softmax Softmax is a function σ(z) j = e z j K k=1 ez k (8) that takes the input vector and transforms it to be between 0; 1 and to sum up to one It is a generalization of logistic function If j is the index of a class, then σ(z) j is the probability of the input belonging to class C j The targets are so called one hot vectors - a vector with one on the index j and zeros elsewhere Lesson 08 28 / 37

Learning - Objective function - Classification To be able to learn the parameters of the network we need a objective (criterion, loss) function to be optimized For the classification task with softmax layer we use so called categorical cross-entropy L(p, q) = x p(x) log q(x) (9) where x is the index of the class, p is the true distribution (one hot vector) and q is the approximated distribution (softmax) Lesson 08 29 / 37

Learning - Objective function - Regression Regression is a form of approximation when we provide inputs and outputs and are looking for parameters that minimize the difference between generated outputs (predictions) and provided outputs Mean squared error: L(Y, Ŷ ) = 1 (y i ŷ i ) 2 (10) N Mean absolute error: L(Y, Ŷ ) = 1 y i ŷ i (11) N Hinge loss: i i L(Y, Ŷ ) = 1 max(1 y i ŷ i, 0) (12) N i Lesson 08 30 / 37

Learning - Stochastic Gradient Descent To find the optimal values of the parameters ω of the network (weights and biases), we need to use backpropagation That is to compute the partial derivatives of the objective functions against individual parameters CNN has much less parameters than fully connected net faster convergence The most widespread approach is to use stochastic gradient descent ω = argmin L(ω) (13) ω L ω t+1 = ω t ɛ ω t (14) ω D t where t is the iteration step, ɛ is the learning rate, L ω t is th average over the t-th batch D t with respect to ω evaluated at ω t ω D t Lesson 08 31 / 37

Learning - Stochastic Gradient Descent The SGD uses mini-batches to optimize the parameters A mini-batch is an example of training data - not too small, not too big One run through a mini-batch is called an iteration, a run over all mini-batches in training dataset is called epoch It is very useful to use momentum in the computing of the gradient L v t+1 = α v t β ɛ ω t ɛ ω t (15) ω D t where α is the momentum (0.9), β is the weight decay (0.0005) and then ω t+1 = ω t + v t+1 (16) Lesson 08 32 / 37

Overfitting Overfitting is a common phenomena when training neural networks Very good results on training data, very bad results on testing data Lesson 08 33 / 37

Reducing Overfitting - Data Augmentation It is the easiest way of fighting overfitting By applying label preserving transformations and thus enlarging the dataset Different methods (can be combined): 1. Taking random (large) crops of the images (and resizing) 2. Horizontal reflection 3. Altering the RGB values of pixels 4. Small geometric transformations Lesson 08 34 / 37

Reducing Overfitting - Dropout This method tries to make the individual neurons independent on each other Mostly used with fully connected layers We set a probability of dropout p d For each training batch we set output of a neuron to be zero with probability p d Is often implemented as a layer Lesson 08 35 / 37

Examples - learned kernels These are the kernels of the first layer from AlexNet Trained no ImageNet 1000-classes, roughly 1.2 millions of training images Lesson 08 36 / 37

Examples - results Lesson 08 37 / 37