Deep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation

Similar documents
Are there alternatives to Sigmoid Hidden Units? MLP Lecture 6 Hidden Units / Initialisation 1

Lesson 08. Convolutional Neural Network. Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni.

Introduction to Machine Learning

Automatic Speech Recognition (CS753)

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Networks 1

Coursework 2. MLP Lecture 7 Convolutional Networks 1

IBM SPSS Neural Networks

Generating an appropriate sound for a video using WaveNet.

Artificial Neural Networks. Artificial Intelligence Santa Clara, 2016

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 1

CHAPTER 6 BACK PROPAGATED ARTIFICIAL NEURAL NETWORK TRAINED ARHF

Recurrent neural networks Modelling sequential data. MLP Lecture 9 / 13 November 2018 Recurrent Neural Networks 1: Modelling sequential data 1

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

Behaviour Patterns Evolution on Individual and Group Level. Stanislav Slušný, Roman Neruda, Petra Vidnerová. CIMMACS 07, December 14, Tenerife

Research on Hand Gesture Recognition Using Convolutional Neural Network

A Comparison of Particle Swarm Optimization and Gradient Descent in Training Wavelet Neural Network to Predict DGPS Corrections

Constant False Alarm Rate Detection of Radar Signals with Artificial Neural Networks

Vehicle Color Recognition using Convolutional Neural Network

Introduction to Machine Learning

Application of Multi Layer Perceptron (MLP) for Shower Size Prediction

A Quantitative Comparison of Different MLP Activation Functions in Classification

1 Introduction. w k x k (1.1)

Radio Deep Learning Efforts Showcase Presentation

A New Framework for Supervised Speech Enhancement in the Time Domain

MINE 432 Industrial Automation and Robotics

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Transient stability Assessment using Artificial Neural Network Considering Fault Location

Experiment 9. PID Controller

Predicting outcomes of professional DotA 2 matches

Stacking Ensemble for auto ml

Playing CHIP-8 Games with Reinforcement Learning

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

Deep Neural Network Architectures for Modulation Classification

Voice Activity Detection

Adversarial Examples and Adversarial Training. Ian Goodfellow, OpenAI Research Scientist Presentation at Quora,

Adversarial Attacks on Face Detectors using Neural Net based Constrained Optimization

DeepStack: Expert-Level AI in Heads-Up No-Limit Poker. Surya Prakash Chembrolu

Neural Filters: MLP VIS-A-VIS RBF Network

Cómo estructurar un buen proyecto de Machine Learning? Anna Bosch Rue VP Data Launchmetrics

The Automatic Classification Problem. Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification

Eur Ing Dr. Lei Zhang Faculty of Engineering and Applied Science University of Regina Canada

The Art of Neural Nets

arxiv: v1 [cs.ce] 9 Jan 2018

GESTURE RECOGNITION FOR ROBOTIC CONTROL USING DEEP LEARNING

Adaptive Neural Network-based Synchronization Control for Dual-drive Servo System

Comparison of Various Neural Network Algorithms Used for Location Estimation in Wireless Communication

CHAPTER 4 MONITORING OF POWER SYSTEM VOLTAGE STABILITY THROUGH ARTIFICIAL NEURAL NETWORK TECHNIQUE

Learning Deep Networks from Noisy Labels with Dropout Regularization

Tiny ImageNet Challenge Investigating the Scaling of Inception Layers for Reduced Scale Classification Problems

arxiv: v2 [cs.lg] 7 May 2017

CSE 258 Winter 2017 Assigment 2 Skill Rating Prediction on Online Video Game

Application of Generalised Regression Neural Networks in Lossless Data Compression

The Game-Theoretic Approach to Machine Learning and Adaptation

Deep Learning for Human Activity Recognition: A Resource Efficient Implementation on Low-Power Devices

Statistical Tests: More Complicated Discriminants

Deep Learning Basics Lecture 9: Recurrent Neural Networks. Princeton University COS 495 Instructor: Yingyu Liang

Low frequency extrapolation with deep learning Hongyu Sun and Laurent Demanet, Massachusetts Institute of Technology

CHAPTER 4 LINK ADAPTATION USING NEURAL NETWORK

Deep Learning. Dr. Johan Hagelbäck.

Contents. List of Figures List of Tables. Structure of the Book How to Use this Book Online Resources Acknowledgements


Enhancing Symmetry in GAN Generated Fashion Images

Voltage Stability Assessment in Power Network Using Artificial Neural Network

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2014

Biologically Inspired Computation

Transactions on Information and Communications Technologies vol 1, 1993 WIT Press, ISSN

MAGNT Research Report (ISSN ) Vol.6(1). PP , Controlling Cost and Time of Construction Projects Using Neural Network

Recognition Offline Handwritten Hindi Digits Using Multilayer Perceptron Neural Networks

A COMPARISON OF ARTIFICIAL NEURAL NETWORKS AND OTHER STATISTICAL METHODS FOR ROTATING MACHINE

Author(s) Corr, Philip J.; Silvestre, Guenole C.; Bleakley, Christopher J. The Irish Pattern Recognition & Classification Society

Design of Low Noise Amplifier of IRNSS using ANN

CHAPTER 4 IMPLEMENTATION OF ADALINE IN MATLAB

Industrial computer vision using undefined feature extraction

Applications of Music Processing

Neural Network Part 4: Recurrent Neural Networks

Initialisation improvement in engineering feedforward ANN models.

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault

NEURO-ACTIVE NOISE CONTROL USING A DECOUPLED LINEAIUNONLINEAR SYSTEM APPROACH

NEURAL NETWORK DEMODULATOR FOR QUADRATURE AMPLITUDE MODULATION (QAM)

Lecture 3 - Regression

Counterfeit Bill Detection Algorithm using Deep Learning

NH 67, Karur Trichy Highways, Puliyur C.F, Karur District DEPARTMENT OF INFORMATION TECHNOLOGY DIGITAL SIGNAL PROCESSING UNIT 3

PERFORMANCE ANALYSIS OF SRM DRIVE USING ANN BASED CONTROLLING OF 6/4 SWITCHED RELUCTANCE MOTOR

REAL TIME EMULATION OF PARAMETRIC GUITAR TUBE AMPLIFIER WITH LONG SHORT TERM MEMORY NEURAL NETWORK

NEURAL NETWORK BASED MAXIMUM POWER POINT TRACKING

Application of Feed-forward Artificial Neural Networks to the Identification of Defective Analog Integrated Circuits

System on a Chip. Prof. Dr. Michael Kraft

Module 10 : Receiver Noise and Bit Error Ratio

Fault Tolerant Multi-Layer Perceptron Networks

CHAPTER 6 ANFIS BASED NEURO-FUZZY CONTROLLER

A Machine Learning Approach to Real Time Earthquake Classification for the Southern California Early Response Warning System

Multimedia Forensics

SSB Debate: Model-based Inference vs. Machine Learning

Pulse Compression Techniques of Phase Coded Waveforms in Radar

MLP for Adaptive Postprocessing Block-Coded Images

Multiple-Layer Networks. and. Backpropagation Algorithms

Computer Vision, Lecture 3

Norsk Regnesentral (NR) Norwegian Computing Center

A comparative study of different feature sets for recognition of handwritten Arabic numerals using a Multi Layer Perceptron

Transcription:

Deep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation Steve Renals Machine Learning Practical MLP Lecture 4 9 October 2018 MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 1

Recap: Training multi-layer networks AffineLayer.fprop SigmoidLayer.fprop AffineLayer.fprop CrossEntropySoftmaxError.grad y 1 y` Outputs yk g (3) 1 w (3) g (3) w (3) ` Kk g (3) 1k K AffineLayer.bprop w (3) `k h (2) 1 h (2) g (2) k 1 g (2) w (2) 1j w (2) kj SigmoidLayer.bprop Hidden h (2) H g (2) k w (2) H Hj AffineLayer.bprop SigmoidLayer.fprop AffineLayer.fprop h (1) j w (1) ji g (1) j x i Hidden g (2) k = X! g m (3) w mk h (2) k (1 h(2) k ) m Inputs SigmoidLayer.bprop @E n @w (2) kj = g (2) k h(1) j MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 2

Are there alternatives to Sigmoid Hidden Units?

Sigmoid function 1 Logistic sigmoid activation function g(a) = 1/(1+exp( a)) 0.9 0.8 0.7 0.6 g(a) 0.5 0.4 0.3 0.2 0.1 0 5 4 3 2 1 0 1 2 3 4 5 a

Sigmoid Hidden Units (SigmoidLayer) Compress unbounded inputs to (0,1), saturating high magnitudes to 1 Interpretable as the probability of a feature defined by their weight vector Interpretable as the (normalised) firing rate of a neuron MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 3

Sigmoid Hidden Units (SigmoidLayer) Compress unbounded inputs to (0,1), saturating high magnitudes to 1 Interpretable as the probability of a feature defined by their weight vector Interpretable as the (normalised) firing rate of a neuron However... Saturation causes gradients to approach 0 If the output of a sigmoid unit is h, then the gradient is h(1 h) which approaches 0 as h saturates to 0 or 1 hence the gradients it multiplies into approach 0. Small gradients result in small parameter changes, so learning becomes slow Outputs are not centred at 0 The output of a sigmoid layer will have mean> 0 numerically undesirable. MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 3

tanh tanh(x) = ex e x e x + e x sigmoid(x) = 1 + tanh(x/2) 2 Derivative: d dx tanh(x) = 1 tanh2 (x) MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 4

tanh hidden units (TanhLayer) tanh has same shape as sigmoid but has output range ±1 Results about approximation capability using sigmoid layers also apply to tanh layers MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 5

tanh hidden units (TanhLayer) tanh has same shape as sigmoid but has output range ±1 Results about approximation capability using sigmoid layers also apply to tanh layers Possible reason to prefer tanh over sigmoid: allowing units to be positive or negative allows gradient for weights into a hidden unit to have a different sign MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 5

tanh hidden units (TanhLayer) tanh has same shape as sigmoid but has output range ±1 Results about approximation capability using sigmoid layers also apply to tanh layers Possible reason to prefer tanh over sigmoid: allowing units to be positive or negative allows gradient for weights into a hidden unit to have a different sign h (2) 1 h (1) 1 h (1) j h (2) k g (2) k Hidden Hidden h (2) H h (1) H MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 5

tanh hidden units (TanhLayer) tanh has same shape as sigmoid but has output range ±1 Results about approximation capability using sigmoid layers also apply to tanh layers Possible reason to prefer tanh over sigmoid: allowing units to be positive or negative allows gradient for weights into a hidden unit to have a different sign h (2) 1 h (1) 1 h (1) j h (2) k g (2) k Hidden Hidden h (2) H h (1) H Saturation still a problem MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 5

Rectified Linear Unit ReLU relu(x) = max(0, x) Derivative: d dx relu(x) = { 0 if x 0 1 if x > 0 MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 6

ReLU hidden units (ReluLayer) Similar approximation results to tanh and sigmoid hidden units Empirical results for speech and vision show consistent improvements using relu over sigmoid or tanh Unlike tanh or sigmoid there is no positive saturation saturation results in very small derivatives (and hence slower learning) MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 7

ReLU hidden units (ReluLayer) Similar approximation results to tanh and sigmoid hidden units Empirical results for speech and vision show consistent improvements using relu over sigmoid or tanh Unlike tanh or sigmoid there is no positive saturation saturation results in very small derivatives (and hence slower learning) Negative input to relu results in zero gradient (and hence no learning) Relu is computationally efficient: max(0, x) Relu units can die (i.e. respond with 0 to everything) Relu units can be very sensitive to the learning rate MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 7

Generalisation

Generalization Generalization: what is the expected error on a test set? how to compare the accuracy of different networks trained on the same data? Causes of error Network too flexible: Too many weights compared with number of training examples Network not flexible enough: Not enough weights (hidden units) to represent the desired mapping When comparing models, it can be helpful to compare systems with the same number of trainable parameters (i.e. the number of trainable weights in a neural network) Optimizing training set performance does not necessarily optimize test set performance... MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 8

Training / Test / Validation Data Partitioning the data... Training data data used for training the network Validation data frequently used to measure the error of a network on unseen data (e.g. after each epoch) Test data less frequently used unseen data, ideally only used once Frequent use of the same test data can indirectly tune the network to that data (e.g. by influencing choice of hyperparameters such as learning rate, number of hidden units, number of layers,...) MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 9

Measuring generalisation Generalization Error The predicted error on unseen data. How can the generalization error be estimated? Training error? E train = K tk n ln yk n Validation error? E val = training set k=1 validation set k=1 K tk n ln yk n MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 10

Cross-validation Optimize network performance given a fixed training set Hold out a set of data (validation set) and predict generalization performance on this set 1 Train network in usual way on training data 2 Estimate performance of network on validation set If several networks trained on the same data, choose the one that performs best on the validation set (not the training set) n-fold Cross-validation: divide the data into n partitions; select each partition in turn to be the validation set, and train on the remaining (n 1) partitions. Estimate generalization error by averaging over all validation sets. MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 11

Overtraining Overtraining corresponds to a network function too closely fit to the training set (too much flexibility) Undertraining corresponds to a network function not well fit to the training set (too little flexibility) Solutions If possible increasing both network complexity in line with the training set size Use prior information to constrain the network function Control the flexibility: Structural Stabilization Control the effective flexibility: early stopping and regularization MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 12

Structural Stabilization Directly control the number of weights: Compare models with different numbers of hidden units Start with a large network and reduce the number of weights by pruning individual weights or hidden units Weight sharing use prior knowledge to constrain the weights on a set of connections to be equal. Convolutional Neural Networks MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 13

Lab 4: 04 Generalisation and overfitting Lab 4 explores overfitting and how we can measure how well the models we train generalise their predictions to unseen data. Setting up a 1-dimension regression problem Using a radial basis functions (RBF) network as a model for this problem Exploring the behaviour of the RBF network as the number of model parameters (basis functions) increases MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 14

Early Stopping Use validation set to decide when to stop training Training Set Error monotonically decreases as training progresses MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 15

Early Stopping Use validation set to decide when to stop training Training Set Error monotonically decreases as training progresses Validation Set Error will reach a minimum then start to increase Best generalization predicted to be at point of minimum validation set error MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 15

Early Stopping Use validation set to decide when to stop training Training Set Error monotonically decreases as training progresses Validation Set Error will reach a minimum then start to increase Best generalization predicted to be at point of minimum validation set error E Validation Training t* t MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 15

Early Stopping Use validation set to decide when to stop training Training Set Error monotonically decreases as training progresses Validation Set Error will reach a minimum then start to increase Best generalization predicted to be at point of minimum validation set error Effective Flexibility increases as training progresses Network has an increasing number of effective degrees of freedom as training progresses Network weights become more tuned to training data Very effective used in many practical applications such as speech recognition and optical character recognition MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 15

Early Stopping Use validation set to decide when to stop training Training Set Error monotonically decreases as training progresses Validation Set Error will reach a minimum then start to increase Best generalization predicted to be at point of minimum validation set error E Validation Training t* t MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 15

Early Stopping Use validation set to decide when to stop training Training Set Error monotonically Why does decreases early stopping as training progresses Validation Set Error will reach improve a minimum generalisation? then start to increase Best generalization predicted to be at point of minimum validation set error E Validation Training MLP Lecture 4 / 9 October 2018 t* Deep Neural Networks t (2) 15

Generalisation by design Regularisation penalise the weights: L1 (sparsity), L2 (weight decay) Data augmentation generate additional (noisy) training data Model combination smooth together multiple networks Dropout randomly delete a fraction of hidden units each minibatch Parameter sharing e.g. convolutional networks MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 16

Weight Decay (L2 Regularisation) Weight decay puts a spring on weights If training data puts a consistent force on a weight, it will outweigh weight decay If training does not consistently push weight in a direction, then weight decay will dominate and weight will decay to 0 Without weight decay, weight would walk randomly without being well determined by the data Weight decay can allow the data to determine how to reduce the effective number of parameters MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 17

Penalizing Complexity Consider adding a complexity term E w to the network error function, to encourage smoother mappings: E n = Etrain n }{{} data term + βe }{{ W } prior term MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 18

Penalizing Complexity Consider adding a complexity term E w to the network error function, to encourage smoother mappings: E n = Etrain n }{{} data term + βe }{{ W } prior term E train is the usual error function: Etrain n K = tk n ln y k n k=1 MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 18

Penalizing Complexity Consider adding a complexity term E w to the network error function, to encourage smoother mappings: E n = Etrain n }{{} data term + βe }{{ W } prior term E train is the usual error function: Etrain n K = tk n ln y k n k=1 E W should be a differentiable flexiblity/complexity measure, e.g. E W = E L2 = 1 wi 2 2 i E L2 w i = w i MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 18

Gradient Descent Training with Weight Decay E n = (E train n + E L2) = w i w i ( E n ) = train + βw i w i w i = η ( E n train w i + βw i ( E n train ) w i + β E ) L2 w i Weight decay corresponds to adding E L2 = 1/2 i w i 2 Addition of complexity terms is called regularisation to the error function When used with gradient descent, weight decay corresponds to L2 regularisation MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 19

L1 Regularisation L1 Regularisation corresponds to adding a term based on summing the absolute values of the weights to the error: Gradients E n = E n w i Etrain n }{{} + βel1 n }{{} data term prior term = E n train + β w i = E n train w i + β E L1 w i = E n train w i + β sgn(w i ) Where sgn(w i ) is the sign of w i : sgn(w i ) = 1 if w i > 0 and sgn(w i ) = 1 if w i < 0 MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 20

L1 vs L2 L1 and L2 regularisation both have the effect of penalising larger weights In L2 they shrink to 0 at a rate proportional to the size of the weight (βw i ) In L1 they shrink to 0 at a constant rate (β sgn(w i )) Behaviour of L1 and L2 regularisation with large and small weights: when w i is large L2 shrinks faster than L1 when w i is small L1 shrinks faster than L2 So L1 tends to shrink some weights to 0, leaving a few large important connections L1 encourages sparsity E L1 / w is undefined when w = 0; assume it is 0 (i.e. take sgn(0) = 0 in the update equation) MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 21

Data Augmentation Adding fake training data Generalisation performance goes with the amount of training data (change MNISTDataProvider to give training sets of 1 000 / 5 000 / 10 000 examples to see this) Given a finite training set we could create further training examples... Create new examples by making small rotations of existing data Add a small amount of random noise Using realistic distortions to create new data is better than adding random noise MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 22

Model Combination Combining the predictions of multiple models can reduce overfitting Model combination works best when the component models are complementary no single model works best on all data points Creating a set of diverse models Different NN architectures (number of hidden units, number of layers, hidden unit type, input features, type of regularisation,...) Different models (NN, SVM, decision trees,...) How to combine models? Average their outputs Linearly combine their outputs Train another combiner neural network whose input is the outputs of the component networks Architectures designed to create a set of specialised models which can be combined (e.g. mixtures of experts) MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 23

Lab 5: 05 Regularisation Lab 5 explores different methods for regularising networks to reduce overfitting and improve generalisation In the context of a feed-forward network using ReLU hidden layers, the lab explores L1 and L2 regularisation Data augmentation MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 24

Summary Tanh and ReLU Generalisation and overfitting Preventing overfitting L2 regularisation weight decay L1 regularisation sparsity Creating additional training data Model combination Reading: Nielsen, chapter 3 (section on overfitting and regularization) of Neural Networks and Deep Learning http://neuralnetworksanddeeplearning.com/chap3.html#overfitting_ and_regularization Goodfellow et al, chapter 7 Deep Learning (sections 7.1 7.5) http://www.deeplearningbook.org/contents/regularization.html MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2) 25