Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 1

Recurrent neural networks Modelling sequential data MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 1

Recurrent Neural Networks 1: Modelling sequential data Steve Renals Machine Learning Practical MLP Lecture 9 15 November 2017 / 20 November 2017 MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 2

Sequential Data We often wish to model data that is a sequence or trajectory through time, for instance audio signals, text (sequences of characters/words), currency exchange rates, motion of animal Modelling sequential data Invariances across time The current state depends on the past Need to share data across time Convolutional networks model invariances across space can we do something similar across time? Yes - time-delay neural networks Can we use units to act as memories? Yes - recurrent networks MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 3

Recap: Space invariance 6x4x4 Pooling Layers 6x8x8Feature Maps 3x12x12 Pooling Layers Input 28x28 3x24x24 Feature Maps Local connectivity Weight sharing MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 4

Modelling sequences...... t=3 Imagine modelling a time sequence of 3D vectors t=2 t=1 t=0 x1 x2 x3 MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 5

Modelling sequences input hidden output Imagine modelling a time sequence of 3D vectors Can model fixed context with a feed-forward network with previous time input vectors added to the network input x1 x2 x3 x1 x2 x3 x1 x2 x3 t-2 t-1 2 frames of context t MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 5

Modelling sequences t-t...... t-3 t-2 T frames of context t-1 t output fully-connected layer 1D conv layer x1 x2 x3 input Imagine modelling a time sequence of 3D vectors Can model fixed context with a feed-forward network with previous time input vectors added to the network input Model using 1-dimension convolutions in time - time-delay neural network (TDNN) MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 5

Modelling sequences t-t...... t-3 t-2 T frames of context t-1 t output fully-connected layer 1D conv layer x1 x2 x3 input Imagine modelling a time sequence of 3D vectors Can model fixed context with a feed-forward network with previous time input vectors added to the network input Model using 1-dimension convolutions in time - time-delay neural network (TDNN) Network takes into account a finite context MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 5

speeds up the training by 5x in the baseline TDNN architecture shown in action in Figure TDNNs 1. t-7 t+2 t-10 t-4 t-1 t+5-1 +2-1 +2-7 +2-3 +3-3 +3 t -1 +2-1 +2 Layer 4 Layer 3 Layer 2 the parameters, just as in the cross-entropy t Our previous smbr-based training reci on the ASpIRE setup, so we introduced a recipe which we have since found to be use in other LVCSR tasks. In the smbr objective function, as for rors are not penalized. This can lead to larg tion errors when decoding with smbr traine TDNN Correcting operating this asymmetry on 23in the smbr ob frames penalizing ofinsertions, context was shown to improv mance of smbr models by 10% relative. Without sub-sampling training [22, 24], the frame error is alway (blue+red) reference is silence, which means that ins vocalized-noise is not penalized), but inserti With regionssub-sampling are not penalized. (red) In other words t-11 t-8 t-5 t-2 t+1 t+4 t+7 reference alignment is silence are treated sp -2 +2 in our implementation several phones, inc Layer 1 calized noise and non-spoken noise, are tre t-13 t+9 these purposes.) In our modified smbr tr treat silence as any other phone, except tha Peddinti et al, Reverberation robust acoustic modeling using i-vectors phones withare time collapsed delay neural into a single class for th Figure 1: Computation in TDNN with sub-sampling (red) and putation. This means that replacing one sile networks, without sub-sampling Interspeech-2015, (blue+red) http://www.danielpovey.com/files/2015_interspeech_aspire.pdf other silence phone is not penalized (e.g. rep The hyper-parameters which define the MLPsub-sampled Lecture 9 Recurrent TDNNNeural Networks 1: Modelling sequential data 6

Wavenet van den Oord et al (2016), WaveNet: A Generative Model for Raw Audio, https://arxiv.org/abs/1609.03499 MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 7

Networks with state Feed-forward = finite context: feed-forward networks (even fancy ones like Wavenet) compute the output based on a finite input history. Sometimes the required context is known, but often it is not State units: we would like a network with state across time if an event happens, we can potentially know about that event many time steps in the future State units as memory remember things for (potentially) an infinite time State units as information compression compress a sequence into a state representation Recurrent networks with state units h delay x MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 8

Recurrent networks output recurrent hidden input x1 x2 x3 t MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 9

Graphical model of a recurrent network y h delay x MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 10

Graphical model of a recurrent network y y t 1 y t y t+1 h delay h t 1 h t h t+1 x x t 1 x t x t+1 Unfold a recurrent network in time MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 10

Simple recurrent network y k (t) = softmax d h j (t) = sigmoid s=0 ( H r=0 w (1) js x s(t) + w (2) kr hr (t) + b k H r=0 ) jr h r (t 1) +b j } {{ } Recurrent part Output (t) Hidden (t) Input (t) Hidden (t-1) MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 11

Recurrent network unfolded in time Output (t-1) Output (t) Output (t+1) w (2) w (2) w (2) Hidden (t-1) Hidden (t) Hidden (t+1) w (1) w (1) w (1) Input (t-1) Input (t) Input (t+1) View an RNN for a sequence of T inputs as a T -layer network with shared weights Train an RNN by doing backprop through this unfolded network Weight sharing if two weights are constrained to be equal (w 1 = w 2 ) then they will stay equal if the weight changes are equal ( E/ w 1 = E/ w 2 ) achieve this by updating with ( E/ w 1 + E/ w 2 ) (cf Conv Nets) MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 12

Bidirectional RNN Output (t-1) Output (t) Output (t+1) RHid (t-1) RHid (t) RHid (t+1) FHid (t-1) FHid (t) FHid (t+1) Input (t-1) Input (t) Input (t+1) Output a prediction that depends on the whole input sequence Bidirectional RNN combine an RNN moving forward in time, with one moving backwards in time State units provide a combined representation that depends on both the past and the future MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 13

Back-propagation through time (BPTT) We can train a network by unfolding and back-propagating through time, summing the derivatives for each weight as we go through the sequence More efficiently, run as a recurrent network cache the unit outputs at each timestep cache the output errors at each timestep then backprop from the final timestep to zero, computing the derivatives at each step compute the weight updates by summing the derivatives across time Expensive backprop for a 1,000 item sequence equivalent to a 1,000-layer feed-forward network Truncated BPTT backprop through just a few time steps (e.g. 20) MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 14

Example 1: speech recognition with recurrent networks Phoneme Probabilities freq (Hz) 8000 6000 4000 Recurrent Neural Network Speech Acoustics T Robinson et al (1996). The use of recurrent networks in continuous speech recognition, in Automatic Speech and Speaker Recognition Advanced Topics (Lee et al (eds)), Kluwer, 233 258. http://www.cstr.ed.ac.uk/ downloads/publications/1996/ rnn4csr96.pdf 2000 0 0 200 400 600 800 1000 1200 1400 time (ms) MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 15

Example 2: recurrent network language models T Mikolov et al (2010). Recurrent Neural Network Based Language Model, Interspeech http://www.fit.vutbr.cz/research/ groups/speech/publi/2010/mikolov_ interspeech2010_is100722.pdf MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 16

Summary Model sequences using finite context using feed-forward networks with convolutions in time (TDNNs, Wavenet) Model sequences using infinite context using recurrent neural networks (RNNs) Unfolding an RNN gives a deep feed-forward network with shared weights Train using back-propagation through time Back-propagation through time (Historical) examples on speech recognition and language modelling Reading: Goodfellow et al, chapter 10 (sections 10.1, 10.2, 10.3) http://www.deeplearningbook.org/contents/rnn.html Next lecture: LSTM, sequence-sequence models MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 17