Are there alternatives to Sigmoid Hidden Units? MLP Lecture 6 Hidden Units / Initialisation 1

Similar documents
Deep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Networks 1

Automatic Speech Recognition (CS753)

Coursework 2. MLP Lecture 7 Convolutional Networks 1

Adversarial Examples and Adversarial Training. Ian Goodfellow, OpenAI Research Scientist Presentation at HORSE 2016 London,

Lesson 08. Convolutional Neural Network. Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni.

Recurrent neural networks Modelling sequential data. MLP Lecture 9 / 13 November 2018 Recurrent Neural Networks 1: Modelling sequential data 1

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 1

Adversarial Examples and Adversarial Training. Ian Goodfellow, OpenAI Research Scientist Presentation at Quora,

Generating an appropriate sound for a video using WaveNet.

Robustness (cont.); End-to-end systems

DNN-based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identification

Introduction to Machine Learning

Experiments on Deep Learning for Speech Denoising

Introduction to Machine Learning

A New Framework for Supervised Speech Enhancement in the Time Domain

CSC321 Lecture 11: Convolutional Networks

Available online at ScienceDirect. Procedia Technology 18 (2014 )

Deep Neural Network Architectures for Modulation Classification

CHAPTER 6 BACK PROPAGATED ARTIFICIAL NEURAL NETWORK TRAINED ARHF

INTRODUCTION TO DEEP LEARNING. Steve Tjoa June 2013

Deep Learning. Dr. Johan Hagelbäck.

Radio Deep Learning Efforts Showcase Presentation

Constant False Alarm Rate Detection of Radar Signals with Artificial Neural Networks

Learning Deep Networks from Noisy Labels with Dropout Regularization

Application of Multi Layer Perceptron (MLP) for Shower Size Prediction

Visualizing and Understanding. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 12 -

Experiments with Noise Reduction Neural Networks for Robust Speech Recognition

Statistical Tests: More Complicated Discriminants

Deep Learning for Indoor Localization based on Bi-modal CSI Data

MLP for Adaptive Postprocessing Block-Coded Images

Presentation Overview. Bootstrapping from Game Tree Search. Game Tree Search. Heuristic Evaluation Function

Neural Filters: MLP VIS-A-VIS RBF Network

Augmenting Self-Learning In Chess Through Expert Imitation

Bootstrapping from Game Tree Search

AN IMPROVED NEURAL NETWORK-BASED DECODER SCHEME FOR SYSTEMATIC CONVOLUTIONAL CODE. A Thesis by. Andrew J. Zerngast

MINE 432 Industrial Automation and Robotics

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Stacking Ensemble for auto ml

On the Use of Convolutional Neural Networks for Specific Emitter Identification

Transactions on Information and Communications Technologies vol 1, 1993 WIT Press, ISSN

Single Channel Source Separation with General Stochastic Networks

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen

Digital Image Processing Labs DENOISING IMAGES

Deep Learning for Human Activity Recognition: A Resource Efficient Implementation on Low-Power Devices

Biologically Inspired Computation

Image Recognition of Tea Leaf Diseases Based on Convolutional Neural Network

Filtering Images in the Spatial Domain Chapter 3b G&W. Ross Whitaker (modified by Guido Gerig) School of Computing University of Utah

Computer Vision, Lecture 3

Masters of Engineering in Electrical Engineering Course Syllabi ( ) City University of New York--College of Staten Island

Research on Hand Gesture Recognition Using Convolutional Neural Network

IBM SPSS Neural Networks

Application of Generalised Regression Neural Networks in Lossless Data Compression

Image Manipulation Detection using Convolutional Neural Network

Adversarial Attacks on Face Detectors using Neural Net based Constrained Optimization

JUMPSTARTING NEURAL NETWORK TRAINING FOR SEISMIC PROBLEMS

MobileSOFT: U: A Deep Learning Framework to Monitor Heart Rate During Intensive Physical Exercise

Playing CHIP-8 Games with Reinforcement Learning

Reinforcement Learning Agent for Scrolling Shooter Game

Application of Convolutional Neural Network Framework on Generalized Spatial Modulation for Next Generation Wireless Networks

Hanabi : Playing Near-Optimally or Learning by Reinforcement?

SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING DEEP RECURRENT NEURAL NETWORKS

Deep Learning for Launching and Mitigating Wireless Jamming Attacks

Convolutional Neural Networks for Small-footprint Keyword Spotting

Auditory modelling for speech processing in the perceptual domain

A Novel Multi-diagonal Matrix Filter for Binary Image Denoising

Classification Accuracies of Malaria Infected Cells Using Deep Convolutional Neural Networks Based on Decompressed Images

Investigating Very Deep Highway Networks for Parametric Speech Synthesis

CandyCrush.ai: An AI Agent for Candy Crush

Defense Against the Dark Arts: Machine Learning Security and Privacy. Ian Goodfellow, Staff Research Scientist, Google Brain BayLearn 2017

REAL TIME EMULATION OF PARAMETRIC GUITAR TUBE AMPLIFIER WITH LONG SHORT TERM MEMORY NEURAL NETWORK

Use of Neural Networks in Testing Analog to Digital Converters

Deep Learning Basics Lecture 9: Recurrent Neural Networks. Princeton University COS 495 Instructor: Yingyu Liang

arxiv: v1 [cs.ne] 5 Feb 2014

Learning the Speech Front-end With Raw Waveform CLDNNs

Chapter 2 Channel Equalization

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Artificial Intelligence and Deep Learning

TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS

An Introduction to Convolutional Neural Networks. Alessandro Giusti Dalle Molle Institute for Artificial Intelligence Lugano, Switzerland

Acoustic Signals Recognition by Convolutional Neural Network

SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING DEEP RECURRENT NEURAL NETWORKS

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

Discriminative Enhancement for Single Channel Audio Source Separation using Deep Neural Networks

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Classifying the Brain's Motor Activity via Deep Learning

ONE of the important modules in reliable recovery of

Counterfeit Bill Detection Algorithm using Deep Learning

A comparative study of different feature sets for recognition of handwritten Arabic numerals using a Multi Layer Perceptron

On the appropriateness of complex-valued neural networks for speech enhancement

The effects of Deep Belief Network pre-training of a Multilayered perceptron under varied labeled data conditions

Neural Network Part 4: Recurrent Neural Networks

Deep Learning for Autonomous Driving

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

Learning Pixel-Distribution Prior with Wider Convolution for Image Denoising

THE problem of automating the solving of

ANALYSIS OF GABOR FILTER AND HOMOMORPHIC FILTER FOR REMOVING NOISES IN ULTRASOUND KIDNEY IMAGES

CLASSLESS ASSOCIATION USING NEURAL NETWORKS

Using RASTA in task independent TANDEM feature extraction

Color Image Denoising Using Decision Based Vector Median Filter

Transcription:

Are there alternatives to Sigmoid Hidden Units? MLP Lecture 6 Hidden Units / Initialisation 1

Hidden Unit Transfer Functions Initialising Deep Networks Steve Renals Machine Learning Practical MLP Lecture 6 28 October 2015 MLP Lecture 6 Hidden Units / Initialisation 2

tanh tanh(x) = ex e x e x + e x ; sigmoid(x) = 1 + tanh(x/2) 2 Derivative: d dx tanh(x) = 1 tanh2 (x) MLP Lecture 6 Hidden Units / Initialisation 3

tanh hidden units tanh has same shape as sigmoid but has output range ±1 Results about approximation capability of sigmoid networks also apply to tanh networks Possible reason to prefer tanh over sigmoid: allowing units to be positive or negative allows gradient for weights into a hidden unit to have a different sign h (2) 1 h (2) k Hidden h (2) H (2) k h (1) 1 h (1) j Hidden h (1) H MLP Lecture 6 Hidden Units / Initialisation 4

Rectified Linear Unit ReLU Derivative: relu(x) = max(0, x) { d dx relu(x) = 0 if x 0 1 if x > 0 MLP Lecture 6 Hidden Units / Initialisation 5

ReLU hidden units Similar approximation results to tanh and sigmoid hidden units Empirical results for speech and vision show consistent improvements using relu over sigmoid or tanh Unlike tanh or sigmoid there is no positive saturation saturation results in very small derivatives (and hence slower learning) Negative input to relu results in zero gradient (and hence no learning) Relu is computationally efficient: max(0, x) Relu units can die (i.e. respond with 0 to everything) Relu units can be very sensitive to the learning rate MLP Lecture 6 Hidden Units / Initialisation 6

Maxout units Unit that takes the max of two linear functions z i = w i h L 1 : (if w 2 = 0 then we have Relu) h = max(z 1, z 2 ) Has the benefits of Relu (piecewise linear, no saturation), without the drawback of dying units Twice the number of parameters max max Layer L + + + + Layer L-1 MLP Lecture 6 Hidden Units / Initialisation 7

Generalising maxout Units can take the max over G linear functions z i : h = G max i=0 (z i) Maxout can be generalised to other functions, e.g. p-norm Typically p = 2 ( G ) 1/p h = z p = z i p p can be learned by gradient descent. (Exercise: What is the gradient E/ p for a p-norm unit?) i=0 MLP Lecture 6 Hidden Units / Initialisation 8

How should we initialise deep networks? MLP Lecture 6 Hidden Units / Initialisation 9

Initialising deep networks (Pretraining) Why is training deep networks hard? Vanishing (or exploding) gradients gradients for layers closer to the input layer are computed multiplicatively using backprop If sigmoid/tanh hidden units near the output saturate then back-propagated gradients will be very small Good discussion in chapter 5 of Neural Networks and Deep Learning Solve by stacked pretraining Train the first hidden layer Add a new hidden layer, and train only the parameters relating to the new hidden layer. Repeat. The use the pretrained weights to initialise the network emphfine-tune the complete network using gradient descent Approaches to pre-training Supervised: Layer-by-layer cross-entropy training Unsupervised: Autoencoders Unsupervised: Restricted Boltzmann machines (not covered in this course) MLP Lecture 6 Hidden Units / Initialisation 10

Greedy Layer-by-layer cross-entropy training 1 Train a network with one hidden layer 2 Remove the output layer and weights leading to the output layer 3 Add an additional hidden layer and train only the newly added weights 4 Goto 2 or finetune & stop if deep enough MLP Lecture 6 Hidden Units / Initialisation 11

Greedy Layer-by-layer cross-entropy training 1 Train a network with one hidden layer 2 Remove the output layer and weights leading to the output layer 3 Add an additional hidden layer and train only the newly added weights 4 Goto 2 or finetune & stop if deep enough MLP Lecture 6 Hidden Units / Initialisation 11

Greedy Layer-by-layer cross-entropy training 1 Train a network with one hidden layer 2 Remove the output layer and weights leading to the output layer 3 Add an additional hidden layer and train only the newly added weights 4 Goto 2 or finetune & stop if deep enough MLP Lecture 6 Hidden Units / Initialisation 11

Greedy Layer-by-layer cross-entropy training 1 Train a network with one hidden layer 2 Remove the output layer and weights leading to the output layer 3 Add an additional hidden layer and train only the newly added weights 4 Goto 2 or finetune & stop if deep enough MLP Lecture 6 Hidden Units / Initialisation 11

Greedy Layer-by-layer cross-entropy training 1 Train a network with one hidden layer 2 Remove the output layer and weights leading to the output layer 3 Add an additional hidden layer and train only the newly added weights 4 Goto 2 or finetune & stop if deep enough MLP Lecture 6 Hidden Units / Initialisation 11

Greedy Layer-by-layer cross-entropy training 1 Train a network with one hidden layer 2 Remove the output layer and weights leading to the output layer 3 Add an additional hidden layer and train only the newly added weights 4 Goto 2 or finetune & stop if deep enough MLP Lecture 6 Hidden Units / Initialisation 11

Autoencoders An autoencoder is a neural network trained to map its input into a distributed representation from which the input can be reconstructed Example: single hidden layer network, with an output the same dimension as the input, trained to reproduce the input using squared error cost function E = 1 y x 2 2 y: d dimension outputs learned representation x: d dimension inputs MLP Lecture 6 Hidden Units / Initialisation 12

Stacked autoencoders Can the hidden layer just copy the input (if it has an equal or higher dimension)? In practice experiments show that nonlinear autoencoders trained with stochastic gradient descent result in useful hidden representations Early stopping acts as a regulariser Stacked autoencoders train a sequence of autoencoders, layer-by-layer First train a single hidden layer autoencoder Then use the learned hidden layer as the input to a new autoencoder MLP Lecture 6 Hidden Units / Initialisation 13

Stacked Autoencoders Hidden 3 Hidden 2 Hidden 1 Input MLP Lecture 6 Hidden Units / Initialisation 14

Pretraining using Stacked autoencoder Hidden 3 Hidden 2 Hidden 1 Input Initialise hidden layers MLP Lecture 6 Hidden Units / Initialisation 15

Pretraining using Stacked autoencoder Output Hidden 3 Hidden 2 Hidden 1 Input Train output layer MLP Lecture 6 Hidden Units / Initialisation 15

Pretraining using Stacked autoencoder Output Hidden 3 Hidden 2 Hidden 1 Input Fine tune whole network MLP Lecture 6 Hidden Units / Initialisation 15

Denoising Autoencoders Basic idea: Map from a corrupted version of the input to a clean version (at the output) Forces the learned representation to be stable and robust to noise and variations in the input To perform the denoising task well requires a representation which models the important structure in the input The aim is to learn a representation that is robust to noise, not to perform the denoising mapping as well as possible Noise in the input: Random Gaussian noise added to each input vector Masking randomly setting some components of the input vector to 0 Salt & Pepper randomly setting some components of the input vector to 0 and others to 1 Stacked denoising autoencoders noise is only applied to the input vectors, not to the learned representations MLP Lecture 6 Hidden Units / Initialisation 16

Denoising Autoencoder E = 1 y x 2 2 y: d dimension outputs learned representation x : d dimension inputs (noisy) x: d dimension inputs (clean) MLP Lecture 6 Hidden Units / Initialisation 17

Summary Hidden unit transfer functions: tanh, ReLU, Maxout Layer-by-layer Pretraining and Autoencoders For many tasks (e.g. MNIST) pre-training seems to be necessary / useful for training deep networks For some tasks with very large sets of training data (e.g. speech recognition) pre-training may not be necessary (Can also pre-train using stacked restricted Boltzmann machines) Reading: Michael Nielsen, chapter 5 of Neural Networks and Deep Learning http://neuralnetworksanddeeplearning.com/chap5.html Pascal Vincent et al, Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion, JMLR, 11:3371 3408, 2010. http://www.jmlr.org/papers/volume11/vincent10a/ vincent10a.pdf MLP Lecture 6 Hidden Units / Initialisation 18