Generating an appropriate sound for a video using WaveNet.

Size: px

Start display at page:

Download "Generating an appropriate sound for a video using WaveNet."

Myrtle Caldwell
5 years ago
Views:

1 Australian National University College of Engineering and Computer Science Master of Computing Generating an appropriate sound for a video using WaveNet. COMP 8715 Individual Computing Project Taku Ueki 1. Supervisor 2. Supervisor Dr. Christian Walder Dr.Benjamin Swift

2 October 27, 2017 ii

3 Contents 1 Introduction Motivation Overview Methods Artificial Neural Networks Overview The Artificial Neuron Feedforward Neural Network Multi Layer Perceptron Each Layer s Function in a Classification Task Training Neural Networks Convolutional Neural Networks Overview of CNN Convolutional Layer Pooling Layer VGG Architecture of VGG model Performance iii

4 Contents Discussion Generalisation of VGG model Recurrent Neural Networks RNN s Model Long Short Term Memory WaveNet Overview of WaveNet Model Implementation and Experiment Results Implementation for the Local Conditioning Simple Local Conditioning Upsampled Local Conditioning Testing the Model Fast and Slow Generation Converting a Video Frame to a Vector Using VGG Pre-trained Model Training the WaveNet with Videos Experiment 1: Train the model only with wave sound files Experiment 2: Training the Model with Sounds and Videos, and Generating Sounds without Local Condition Experiment 3: Training the Model with Sounds and Videos, and Generating Sound with Local condition Discussion Why doesn t local conditioning work for the video, while it works for the toy problem? Future Works 46 iv

5 Contents 6 Conclusion 50 Bibliography 51 v

6 Acknowledgments Firstly I would like to thank my supervisors, Christian Walder and Benjamin Swift for their great support and accurate advice. I could not have done this project without them. I would also like to thank Florin Schimbinschi for his help starting off this project, specifically helping me understand WaveNet model, and our insightful discussions. When I came up with this project s idea I did not expect to get this much support. So, I cannot express my gratitude more. vi

7 List of Figures 2.1 A depicted (a) biological neuron and (b) an artificial neuron Plotted activation functions: (a) sigmoid, (b) tanh and (c) ReLU An illustrated multi layer perceptron An example of simple convolution Max pooling layer with 2x2 filters and stride Configuration of each CNN they tested in their research A list of results of the evaluations for each CNN Performance comparison with other research groups CNNs An unrolled recurrent neural network An unfolded LSTM cell Forget gate Input gate Output gate Overview of the WaveNet model Visualization of a stack of causal convolutional layers Visualization of a stack of dilated causal convolutional layers WaveNet overview with local condition Visualisation of the simple local conditioning for the second hidden layer. 30 vii

8 List of Figures 3.2 Visualisation of the upsampled local conditioning for the second hidden layer Example of the training data Expected result for the test Upsampled model 1000 iterations Simple model 1000 iterations Example of the training data Error value through the training for 1000 epochs without local condition Error value through the training for 300 epochs without local condition Model overview of WaveNet with LSTM Visualisation of the second idea of improving the local conditioning with LSTM Model overview of the future work idea viii

9 1 Introduction 1.1 Motivation What does snow sound like? - It was from this simple question in which this project was born. If an artificial intelligence can learn the sounds of videos and predict an appropriate sound of a video, it may be able to answer to that question. In this project, my goal was to develop a system that generates an appropriate sound for a video. Perhaps most people have not thought about the practice of generating sounds for videos, and therefore may think there is little value in this project of creating appropriate sounds, thus I would like to further elaborate my motivation to start such a project. Firstly, as stated in the abstract, this is a practice used in the film industry. When a film is dubbed into other languages, the film makers have to record or create appropriate sounds for each scene [10]. It is sometimes because the actor s voice could not be separated from the original sound data. Therefore, the environmental sounds, such as footsteps, street noise, etc. have to be recreated for the video. Also, for animation or science fiction films, they need to create appropriate sounds for scenes that are created with computer graphic technologies, because they only generate the visual effects and not the sound effects. 1

10 1 Introduction Additionally, this project would be able to propose a new way of composing or performing music in digital art fields. For example, it is difficult to put music to an abstract video. However, this project may be able to generate some music or sound for such videos. 1.2 Overview Among the research about raw audio generation, the research topic of Text-to-Speech has been gaining increased amounts of attention. As the name implies, the Text-to- Speech task is to convert input text form of speech data into sound form of the speech data. By combining both voice recognition and Text-to-Speech technology, a new interface which does not require physical interaction between the user and the system can be implemented. As the system understands and communicates with the user through human language, this new interface requires only conversations for the operation. WaveNet[6] achieved state-of-the-art performances on Text-to-Speech tasks in both English and Mandarin Chinese. WaveNet is a generative deep learning model for raw audio waveform generation. Their approach is to predict new audio samples based on a long range of previous audio samples using dilated causal convolution layers. As audio data is normally stored as an array of float values, and the sample rate is usually more than 16,000 Hz, to capture only one second of audio waveform, the model has to deal with 16,000 values. Moreover, the audio data holds sequential temporal data as well as the audio signal of each time step. Hence, the model needs to be able to consider a long range of previous audio samples while keeping the temporal causality of the waveform to capture the patterns of it. To treat sequential data in deep learning techniques, recurrent neural networks or causal convolutions are used. However, for raw audio generation, as mentioned above, the input sequence is very long. Because of this unusual length of the input data, these techniques cannot learn the pattern of 2

11 1 Introduction the waveform efficiently. Therefore, they used the dilated causal convolutional layers to cover the long input array. In the hidden layers of dilated causal convolutions, the convolution is conducted with skipping a certain number of inputs, so that the model can cover long input array more efficiently without losing the temporal causality of the original input waveform. Additionally, this WaveNet receives local condition as another input data. With this local condition, the WaveNet model predicts new audio samples based on not only the previous audio samples but also other external information. For example, for the Text-to-Speech tasks, the phonemes of the input text s words are fed into the WaveNet as local conditions, so that the model can change the audio samples to generate as the user intends. For this project, video frames are given to the model as local conditions. As we utter different words based on the words or their phoneme, any environmental sounds occur based on phenomena or physical interactions between objects. So, there are correlations between the videos about these actions and the corresponding sounds. Thus, the goal of this project is to develop a system that recognises the differences between the video frames and to generate an appropriate sound for each scene. To feed video frames into the WaveNet model, they need to be converted in a form that the WaveNet can treat it easily and effectively. Additionally, the image data holds spatial information, so it is crucial for the model to consider the spatial information to recognise each video frame. Accordingly, the input video frames are converted to vectors through a pre-trained convolutional neural network. For image recognition tasks, the spatial information of the input images is extracted through the CNN and converted into a vector so that fully connected layers following the CNN layers can treat it adequately to conduct the classification. Therefore, the trained CNNs can be used to extract spatial information of images and also convert them to vectors. The Visual Geometry Group achieved state-of-the-art performances on image recognition tasks using a CNN. They released the CNN model which was trained on an image classification 3

12 1 Introduction benchmark task publicly to facilitate computer vision related research. It is also proved that this pre-trained model can be used for different types of image recognition tasks, so it is used for many related research. 4

13 2 Methods In this section, I would like to introduce the basics of the existing models and tools that I used in this project. Before I go into the details of my project s model, as I mainly used some different types of deep neural networks, I will start off with explaining the most standard architecture of neural networks. 2.1 Artificial Neural Networks Overview Artificial neural network(ann) is a biologically inspired network of artificial neurons which are designed to perform certain tasks such as classification, regression, etc. The basic idea of ANN is to imitate the architecture of a human brain and simulate its operations on the computer The Artificial Neuron The artificial neuron is the smallest unit that constitutes the ANN. As the name implies, its function is almost identical to the biological neuron. A biological neuron cell 5

14 2 Methods Figure 2.1: A depicted (a) biological neuron and (b) an artificial neuron. receives signals from other neurons through dendrites and then sum these signals up. If the summation of the signals exceed the threshold, this neuron cell becomes activated. Activated neuron cells transmit a signal to other connected neurons through synapses. The artificial neurons work almost identically. An artificial neuron receives input signals from connected neurons. Each input signal has an associated weight value. The signals are multiplied by the weight and passed to connected neurons. As the biological neurons sum up the received signals, the artificial neuron also sums up all the transmitted signals from connected neurons. The summation is passed to an arbitrary activation function which decides whether to activate this neuron or not. If this neuron is activated by the received signals, it transmits a signal to other connected neurons through the axon. The mathematical form of these processes is expressed as below. M y(x, w) = f( w i x i ) i=0 where f( ) is an activation function, x i is an input from previous layer s i th neuron, W is an associated weight for the input. 6

15 2 Methods Also, there are many types of activation functions. I will introduce three popular activation functions. Figure 2.2: Plotted activation functions: (a) sigmoid, (b) tanh and (c) ReLU. Sigmoid The input real value is squashed to a value between 0 and 1. Backpropagation algorithm which is used to update the learnable weight values in the neural networks computes the gradients of cost function with respect to each weight value and updates it based on the gradient. However, sigmoid function s gradient vanishes as the input value increases or decreases too much. Consequently, the weights cannot be updated efficiently. Tanh The input real value is squashed to a value between -1 and 1. This activation function also has the same problem as sigmoid. Rectified Linear Unit (ReLU) 7

16 2 Methods ReLU activation function ignores all the negative inputs and returns the positive inputs as they are. As the gradient of this function does not vanish or explode, it can update the weight values more efficiently and help the model to converge faster Feedforward Neural Network The feedforward neural network is the simplest type of neural network. Feedforward neural networks are comprised of layers of artificial neurons. Each layer s artificial neurons are connected to adjacent layers artificial neurons. An artificial neuron receives the signal from previous layer s artificial neurons. If the neuron is activated, it transmits the signal to the next layer s artificial neurons Multi Layer Perceptron Multi Layer Perceptron is comprised of an input layer, one or more hidden layers, and an output layer. Although single layer perceptron can only solve linearly separable classification, multi layer perceptron can solve nonlinear problems [3]. The basic architecture of a multi layer perceptron is shown below. Input layer The left layer of the figure is the input layer. In this figure, there are two input nodes and one bias node. All input nodes are connected to all the hidden nodes except for bias node of the hidden node. Each connection has an associated weight value. The value of each input node is multiplied by the associated weight and passed onto the hidden nodes. Hidden layer Hidden nodes receive the product of the input node s value and the associated weight from all connected nodes. Each node sums these values up and passes the summation to the activation function. The output of each 8

17 2 Methods Figure 2.3: An illustrated multi layer perceptron. hidden node is multiplied by its associated weight value and passed onto next layer s nodes. Output layer The output layer s nodes receive values from all the connected previous hidden layer s nodes. They also pass the summation of these received values to its activation function Each Layer s Function in a Classification Task The number of input nodes is same as the number of parameters from which you predict the correct class. To respond to many different inputs, the neurons in the hidden layers activate other nodes when a certain input is transmitted. The number of the output layer s nodes indicates the number of classes in the classification task. Each output node corresponds to a certain class. Therefore, in a classification task, the label for each dataset is expressed in a form of one-hot vector. For example, if there are three classes, the label for class one is expressed as [1, 0, 0], class two is [0, 1, 0] and likewise for class three. Hence, in a classification task, the outcomes of each output node represents the probability that the data point is classified to the class. To convert the outcome to the 9

18 2 Methods probability, softmax function is usually used. σ(x) i = e x i Kk=1 e x k softmax function to convert outcome of i th node to the probability. Where K is the number of classes Training Neural Networks There are two phases for neural networks. One is the prediction phase which forwards propagation from input to output to get the results from the input values. The other one is the training phase. During the training phase, the error between the correct answer and the prediction is calculated. The objective of the training phase is to minimise this error to optimise the network s trainable weight values. The backpropagation algorithm is used to train the neural networks. It calculates the gradient of the error and updates the weight to reduce the error according to the gradient. There are some error functions of which I will introduce two common ones briefly. Loss Functions The Mean Squared Error (MSE) function MSE = 1 n (Ŷi Y i ) 2 n i=1 The MSE function computes the average of the square of the difference between a prediction and the label of the data point. The Cross Entropy loss function 10

19 2 Methods Loss(X, Y ) = 1 n Y i ln(x i ) + (1 Y i ) ln(1 X i ) n i=1 Where X is the prediction and Y is the corresponding label. The average of all cross entropies is computed. The learnable weight values are updated based on the computed gradients of the loss function with respect to them. There are some algorithms for optimisation such as Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), and the weights are updated differently based on the optimisation algorithms. Adam is one of the most popular optimisation algorithms and it is used for a variety of deep learning tasks. It considers exponentially the decaying average of past gradients and squared gradients to help SGD to converge to a better value and faster [8]. 2.2 Convolutional Neural Networks Convolutional Neural Network (CNN) is a type of deep neural network that has achieved outstanding performances on image classification tasks. In image classification tasks, CNN is used to extract features from an input image. Extracted features are expressed as a vector and passed to a feedforward neural network to conduct the classification. CNNs can be trained to extract features to help the feedforward neural network to classify the input image more accurately. Therefore, the trained CNN is often used to extract features and convert the input image into a descriptive vector of the image. Visual Geometry Group (VGG) from University of Oxford published a paper that investigated the effect of the depth of the CNN on the accuracy in large-scale image classification tasks [9]. In this research, they demonstrated state-of-the-art performances on image classification benchmark tasks. They released the CNN model which is trained on one of the benchmark tasks. In this project, the input video s frames are converted into vectors 11

20 2 Methods that describe their features. Thus, VGG s pre-trained model was used to convert each frame of the video to a descriptive vector Overview of CNN I will explain the overview of the CNN and the VGG s CNN architecture and their pretrained model. A basic CNN is comprised of a number of convolutional layers, pooling layers and fully connected layers. Convolutional layers have an arbitrary number of convolutional filters. These filters are slid over the input image and convolve the filter and each area of the image. Pooling layers are applied to reduce the special size of the outcomes of the convolutional layers. The purpose of applying these layers is to reduce the number of parameters and computation cost, and also to avoid overfitting the model [2]. The outcomes of the convolutional layers and pooling layers are passed to a fully connected layer to conduct the prediction based on them Convolutional Layer The convolutional layer is the core part of the CNN. A set of trainable filters constitute this layer. Since the input data (image) has three dimensions: width, height and depth (color channels), these filters also have a 3-dimensional shape. However, the shape and the number of dimensions depend on the input data s dimensionality and the designer of the CNN. Their spatial size has to be smaller than the input image. In the forward propagation, each filter is slid across the entire input image. The dot products between the filter and corresponding regions of the input image are computed. This figure shows how the convolution is performed. The yellow region of the left side matrix is the filter and the red values are the weights of the filter. It computes the 12

21 2 Methods Figure 2.4: An example of simple convolution. dot product between the filter s weights and the image s overlapping values. The dot products are stored to the same dimension space as the right side matrix of the figure. This matrix is the input data for the next convolutional layer. Although only 2D convolution is demonstrated in this example, the filter and also the input data can have third dimension (depth). In this case, the filter is applied to the corresponding depth of the image and calculates the dot product. Hence, the output of the convolution will be transformed into 2D space from 3D space Pooling Layer Pooling layers take an important role in CNNs. One of the objectives of using the CNN is to reduce the dimension of the input data to make it more tractable. For example, an image which size is 300x300x3 (width x height x depth) has 270,000 parameters. It can easily lead to an overfitting. Therefore, pooling layers are used to reduce the spatial size of the input volume. There are several algorithms for pooling layers: max pooling, average pooling, L2-norm pooling, etc. However, it has been demonstrated that max 13

22 2 Methods pooling performs better than other algorithms [2]. Figure 2.5: Max pooling layer with 2x2 filters and stride 2. This figure shows how max pooling reduces the spatial size of the input 2D data. In each different coloured region of the matrix, only the highest value is picked and stored, as the value represents the region. Consequently, if a max pooling layer size of 2x2 is used, the input image s size will be decreased to half width and half height image VGG As I mentioned above, Visual Geometry Group (VGG) is a research group of the University of Oxford. They investigated the effect of the depth of the CNN on the accuracy in the large-scale image classification task. They constructed CNNs with different depth from 11 to 19. They used only 3x3 convolution filters to evaluate the effect of the depth of the CNNs consistently. As a result, they not only achieved exceptional performances on the ILSVRC classification and localisation tasks, but also demonstrated applicability of the model to other image recognition datasets. To facilitate further computer vision related research they released their pre-trained model publicly. 14

2 Methods 2.2.5 Architecture of VGG model In their work, they adopted convolutional layers comprised of 3x3 convolution filters with a certain number of channels.

23 2 Methods Architecture of VGG model In their work, they adopted convolutional layers comprised of 3x3 convolution filters with a certain number of channels. Apart from these layers, they used max pooling layers, 1x1 convolutional layers and three fully connected layers. Max pooling is performed by a 2x2 filter with 2 strides. The 1x1 convolutional layer is utilised to increase nonlinearity without any effects on the receptive field of the convolutional layers. The last layer of the stack of convolutional layers and pooling layers is connected to three fully connected layers. The first two fully connected layers have 4096 hidden nodes and the last layer has 1000 hidden nodes. The last layer is followed by a softmax layer to compute the probability for each class. Figure 2.6: Configuration of each CNN they tested in their research. 15

24 2 Methods Figure 2.7: A list of results of the evaluations for each CNN Performance This table shows the results of the evaluations for each configuration. They proved that as the CNN gets deeper, the performance is improved. Additionally, they demonstrated that they can improve the model by changing the training images size randomly (in this work from 256 to 512). The third row for configuration C, D and E shows the performance of the changing training images size Discussion Other CNNs that achieved outstanding accuracy on the image classification competition task consist of convolutional filters with larger receptive field with more strides, especially in the first convolutional layer. The effect of adopting small 3x3 filters throughout the entire CNN is discussed. Firstly, in terms of the size of receptive field, a set of three 3x3 filters is equivalent to a single 7x7 filter. Ergo, three non-linear rectification layers can be obtained by using 3x3 filter, while only one layer is obtained by a single 7x7 filter. As 3x3 filter produces more outcomes, it is more informative for the decision functions than the single outcome from the 7x7 filter. Secondly, the number of the parameters is decreased by adopting smaller filters. If the input volume has C channels, the number of parameters for a single 7x7 filter is calculated as 7 7 C C = 49C 2. On the other 16

25 2 Methods hand, the number of parameters of a set of three 3x3 filters is 3 (3 3 C C) = 27C 2. Hence, using a set of three 3x3 filters is more discriminative for the decision function and also is able to decrease the number of parameters. This implies they can be trained more efficiently with less computation cost [2] Generalisation of VGG model They utilised their pre-trained model on ILSVRC dataset for image classification on other datasets. Specifically, they compared their model s performance on 4 different benchmarks: VOC-2007, VOC-2012, Caltech-101, and Caltech-256 to other state-ofthe-arts. Figure 2.8: Performance comparison with other research groups CNNs. This figure shows the comparison with the other CNNs. VGG s 19-layer CNN achieved highest scores on VOC-2007, VOC-2012, and Caltech-256. In addition, it shows that the combination of 16-layer and 19-layer CNN improves its accuracy. Since the pre-trained model was released by them, it has been used by a wide range of image recognition research. 2.3 Recurrent Neural Networks As we understand sequential data such as texts, audios, videos, etc. by recognising each component of the sequence step by step from the start to the end. Sequential 17

26 2 Methods data contains the information not only in its contents but also in the ordering of the sequence. Therefore, for tasks which treat sequential data, the temporal dependencies of the input data have to be considered to achieve good performances. Recurrent neural networks (RNN) have achieved great performances in a various types of problems, including language modeling, translation, speech recognition, etc. Specifically, Long Short Term Memory (LSTM) which is one of the architectures of RNN has achieved outstanding results in these tasks. As I treat sequential data: audios and videos, for this project, I will explore RNNs in the following chapter RNN s Model Figure 2.9: An unrolled recurrent neural network. A cell receives an input x t and outputs o t, and also this cell feeds the information to itself recurrently as the left figure shows. As the cell receives each step s data one by one along its ordering and generates an output for corresponding step, it can be unfolded as the right side of the figure shows. Additionally, each time step s cell stores its state s t and it can be interpreted as a memory of the network at each time step. Therefore, the state is computed as follows s t = f(ux i + W s t 1 ) where f( ) is an arbitrary activation function, U is a learnable weight for input and W is also a learnable weight for the previous step s hidden state. The output is computed 18

27 2 Methods as below o t = softmax(v s t ) where V is a learnable weight for the current hidden step. Although the temporal features of the input sequence are captured by this basic RNN model, it is not flexible enough to solve complicated tasks. For instance, as this model is summing up the previous steps states simply, the information of the far previous node is lost gradually. Hence, this simple RNN model can deal with short memories, but cannot keep longer memories. Also, for tasks which require both short and long memory, this model cannot handle the data flexibly. To overcome this problem, different architecture of RNN called Long Short Term Memory (LSTM) was developed Long Short Term Memory Long Short Term Memory (LSTM) is an advanced type of RNN which can learn both long and short term dependencies in sequential data. LSTM is capable of selecting features to remember and also forget, and this regulation system consists of three different gates: forget gate, input gate and output gate [4]. Each gate takes different role. In this section, I will introduce those gates one by one. Overview of the model In a LSTM cell, there are two parallel flows of information. One of the flows (the flow above in Figure 2.10) carries the previous step s state information through the cell. Meanwhile, the other flow (the flow below in Figure 2.10) also carries the previous step s information. However, it receives the external output X t and selects what parameters 19

2 Methods Figure 2.10: An unfolded LSTM cell. to forget, input and output.

Forget gate Figure 2.11: Forget gate. Forget gate selects the values to forget from previous step s state information.

In this gate, the input h t1 received from previous step and external input x t are multiplied by different weight matrix and summed up element

28 2 Methods Figure 2.10: An unfolded LSTM cell. to forget, input and output. As a result, the lower flow interacts with the upper flow to regulate its information. This process is repeated through the LSTM network. Forget gate Figure 2.11: Forget gate. Forget gate selects the values to forget from previous step s state information. As I mentioned above, this state acts as the memory for the network. Therefore, this gate makes the network forget specific values. In this gate, the input h t1 received from previous step and external input x t are multiplied by different weight matrix and summed up element wisely. The outcome is passed to the sigmoid function and all the values are mapped to values between 0 and 1. These values are multiplied by the upper flow, and values of the previous state are decreased depending on the corresponding value of the 20

2 Methods outcomes of the sigmoid function. For instance, if the outcome of the sigmoid function is 0, the parameter will be forgotten completely. Input gate Figure 2.12: Input gate.

As the forget gate does, input gate also maps the values between 0 to 1 using the sigmoid function. Hence, it can select parameters to update by multiplying with input values.

29 2 Methods outcomes of the sigmoid function. For instance, if the outcome of the sigmoid function is 0, the parameter will be forgotten completely. Input gate Figure 2.12: Input gate. Input gate computes two values. Firstly, it decides which values of the input to use to update the current state. As the forget gate does, input gate also maps the values between 0 to 1 using the sigmoid function. Hence, it can select parameters to update by multiplying with input values. Secondly, it creates candidate values that may be added to the current state to update it. Therefore, tanh activation function is applied for this value. Values to be added to the current state are selected by the outcomes of the sigmoid function. Output gate Output gate selects which values of the current state to produce as an output of this time step. Both input from the previous step and external input are multiplied by each weight matrix and passed to the sigmoid function to get a selective vector. This vector 21

30 2 Methods Figure 2.13: Output gate. is multiplied with the updated current state so that only selected values by this gate are output. 2.4 WaveNet Overview of WaveNet WaveNet is a deep neural network for generating raw audio waveforms developed by Google DeepMind. This model predicts each audio sample by computing the predictive distribution based on the certain length of previous samples, global condition and local condition. It has achieved state-of-the-art performances on text-to-speech tasks in both English and Mandarin [6]. 22

31 2 Methods Figure 2.14: Overview of the WaveNet model Model Overview Predictive distribution for each audio sample x t is computed conditioned on the previous timesteps. Therefore, the joint probability of a waveform x = x i x 1,..., x t 1 is expressed as a product of conditional probabilities as follows: T p(x) = p(x t x 1,..., x t 1 ) t=1 The WaveNet consists of three parts: causal convolutional layer, a stack of dilated causal convolutional layers and fully connected layers. Firstly, the input raw audio samples are transformed into one-hot vectors with a certain number of categories. 2x1 causal convolution is applied to these transformed vectors. (These new arrays have the same depth as the dilated causal convolutional layer s output.) This sequence of vectors is 23

32 2 Methods passed to the first layer of dilation layers. Secondly, dilated causal convolution is applied to the passed sequence. In each layer, the dilated causal convolution produces two identical sequences of convoluted vectors. The tanh function is applied to one of them, and the sigmoid function is applied to the other one. An element-wise multiplication is applied to these vectors. 1x1 1D convolution is applied to the outcomes of this elementwise multiplication. The result is stored as a skip output of this layer. This is also added to the input sequence of vectors for this layer and passed to the next layer as a residual output. Repeat this process with different dilation value. Finally, the skip outputs are summed up and passed to two fully connected layers and the softmax layer to compute the predictive distribution for each audio sample. Causal Convolutions Figure 2.15: Visualization of a stack of causal convolutional layers. To keep the temporal dependencies of the audio samples, causal convolution is adopted in signal processing tasks. This is implemented with a 1D convolution with filters whose width is 2. The same operation is applied to the output of each layer to implement the stack of causal convolutional layers. RNN is also one of neural networks that can process sequential data. As there is no recurrent connections in causal convolutions, it can learn 24

33 2 Methods time series data faster than RNN. However, many layers are required to increase the receptive field. Long receptive field is a crucial factor to capture the patterns of audio waveforms. Hence, in the WaveNet, a stack of dilated causal convolutional layers is used in the main part. Dilated Causal Convolutions Figure 2.16: Visualization of a stack of dilated causal convolutional layers. A dilated causal convolution is comprised of causal convolutions with a certain number of skips. As the figure shows, 1D convolution is computed for each layer for the output of the previous layer. However, dilation - 1 nodes are skipped to select pairs of nodes to be convolved. Therefore, the size of the receptive field can be dramatically increased with fewer layers compared to a stack of ordinary causal convolution layers. Softmax Distributions Typically, audio data is stored with 16 bits audio bit depth. Therefore, each sample has 65,536 possible values. To make it more tractable, each sample of the input raw audio 25

34 2 Methods is quantised to 256 possible values by applying mu-law companding transformation [5]. f(x t ) = sign(x t ) ln(1 + µ x t ) ln(1 + µ) Where 1 < x t < 1 and µ = 255. This quantisation produces a significantly better reconstruction than a simple linear quantisation scheme, especially for speech data [6] Gated Activation Units In the stack of dilated causal convolutional layers, gated activation units are applied to the outcomes of the dilated causal convolutions. These units consist of two activation functions: sigmoid and tanh. The outcomes of the dilated causal convolutions are multiplied by learnable weight matrix and passed through the activation functions separately. The outcomes of these two activation functions are element-wisely multiplied. This is identical to the input gate of the LSTM cell. Therefore, it can be interpreted that these gated activation units work as an input gate. In short, they select which parameters of the outcome to feed to the next layer of the stacked dilated causal convolutional layers as a residual output. Residual and Skip connections The outcomes of the gated activation units are passed to skip connections and also to the next dilated causal convolutional layer. As the dilated causal convolutional layer gets deeper, longer range of the audio samples are considered. In each layer, the important parameters are selected through the gated activation units, then fed to the next layer and get convolved with the longer range of the audio samples. Also, each layer outputs the outcomes of the gated activation units. The outcomes of all layers of the stacked dilated 26

35 2 Methods causal convolutional layers are summed up and fed to the following fully connected layers. Conditional WaveNet Figure 2.17: WaveNet overview with local condition WaveNet can also compute the conditional distribution p(x h) of next sample of the waveform. T p(x) = p(x t x 1,..., x t 1, h) t=1 By feeding condition to the dilated causal layer, the WaveNet can compute the predictive distribution based on the previous step and the given condition. For instance, for textto-speech tasks, phonemes of each word in the given sentence is fed into the layer so that the WaveNet can generate different waveform for different words. In their work, it is reported that the WaveNet generates human language-like sound without this condition, 27

36 2 Methods while it generates recognisable human language sound. The model can accept two types of conditions: global condition and local condition. The global condition does not change through time, but the local condition does. Therefore, in text-to-speech tasks, the global condition is used to tell the model the type of speaker. On the other hand, the local condition is passed to guide the model what phoneme to generate for that time step. 28

37 3 Implementation and Experiment Results 3.1 Implementation for the Local Conditioning There is an open source code for WaveNet without local conditioning. So, I implemented the local conditioning for that code. In this project, each video frame s vector was passed as a local condition to predict corresponding audio samples to the frame. For example, if the sample rate of the sound file is 16,000 samples per second and frame rate of the video is 25 frames per second, each video frame has 16,000 / 25 = 640 samples. Therefore, the WaveNet receives video frame vectors and predicts 640 audio samples based on the receptive field long previous audio samples and the corresponding video frame vector. One way of implementing the local conditioning is to pass only one video frame vector to the WaveNet (I call it simple local conditioning) to predict corresponding 640 audio samples. The other way is to upsample the video frame vectors to adjust its frame rate to the audio data s sample rate and pass this sequence of video frame vectors (I call it upsampled local condition). 29

3 Implementation and Experiment Results 3.1.1 Simple Local Conditioning Figure 3.1: Visualisation of the simple local conditioning for the second hidden layer.

38 3 Implementation and Experiment Results Simple Local Conditioning Figure 3.1: Visualisation of the simple local conditioning for the second hidden layer. For the simple local conditioning implementation, only the new local condition which corresponds to the newest audio sample is added to the dilated causal convolutions. For instance, in this figure, even if the local condition changes every 4 audio samples, only the newest local condition is added. During the training phase, audio samples and a video frame vector are passed. The length of the audio samples is the receptive field s size plus an arbitrary number of target samples. As only one local condition can be passed to the model, the target audio samples must be corresponding to the video frame. Additionally, in the dilated causal convolutional layers, this local condition is added to the outcome of each layer. This is should be noted that all the audio samples given to the model are convolved with different dilation and added to the local condition. The size of the audio samples is receptive field plus the number of samples for a video frame. Concretely, the number of samples for a video frame is 640 and the receptive field s size is usually more than 5,000 which is equivalent to about 8 video frames. Therefore, originally, those samples in the receptive field have different local condition. However, in this simple implementation for local conditioning, the given video frame vector is added to the convolved samples even though they are not corresponding to it. 30

3 Implementation and Experiment Results 3.1.2 Upsampled Local Conditioning Figure 3.2: Visualisation of the upsampled local conditioning for the second hidden layer.

39 3 Implementation and Experiment Results Upsampled Local Conditioning Figure 3.2: Visualisation of the upsampled local conditioning for the second hidden layer. On the other hand, in the upsampled implementation for local conditioning, the local conditions are upsampled before they are passed to the model. Therefore, the size of the audio samples and the local conditions are the same. Hence, in the dilated causal convolutional layer, each convolved sample is added to the corresponding local condition. To do this, the size of the local conditions is adjusted as the dilation convolution layer goes deeper because the size of the convolution is changed according to each layer s dilation. 3.2 Testing the Model Before I started working on the training and audio synthesis for a video, both the simple and upsampled models were tested their performances with a toy problem. This toy problem is to learn and generate specific frequency sine waves based on the local conditions. The training data consists of three different notes. One of these notes lasts 31

3 Implementation and Experiment Results for an arbitrary duration and changes to another note dynamically. It is repeated several times in one dataset.

40 3 Implementation and Experiment Results for an arbitrary duration and changes to another note dynamically. It is repeated several times in one dataset. The ID of each frequency is given to the model as a local condition. Therefore, the model learns the patterns of each different note s waveform based on the audio samples, and also it learns the relationships between the audio samples and the given local condition. For the training, 30 training data sets were prepared and there are 3 local conditions (notes): D# (155 Hz), G (196 Hz) and A# (233 Hz). Figure 3.3 shows one of the training datasets. Figure 3.3: Example of the training data. The objective for the generation part is to generate different frequency waveforms by changing the local condition through the time. In the generation phase, 900 audio samples are generated. Every 300 samples the local condition varies from 1 to 3 incrementally. Therefore, the expected result changes its frequency as the local condition 32

3 Implementation and Experiment Results varies. (a) (b) (c) (d) Figure 3.4: Expected result for the test. Figure 3.5 shows the results of the test for the upsampled model.

For each local condition, the waveform is cut out and the power spectrum is shown separately in (a), (b) and (c).

41 3 Implementation and Experiment Results varies. (a) (b) (c) (d) Figure 3.4: Expected result for the test. Figure 3.5 shows the results of the test for the upsampled model. The power spectrum of the generated waveform is shown in the upper graph and the lower graph indicates the waveform. For each local condition, the waveform is cut out and the power spectrum is shown separately in (a), (b) and (c). As I mentioned, 3 different notes (155 Hz, 196 Hz and 233 Hz) were generated and the power spectrum (d) shows that there are the frequencies in the waveform. Also, (a), (b) and (c) shows that expected frequencies were generated for each local condition. 33

42 3 Implementation and Experiment Results (a) (b) (c) (d) Figure 3.5: Upsampled model 1000 iterations Figure 3.6 shows the results for the simple model. (d) shows the overall generated waveform. It is clearly shown that although 3 different local conditions were input to the model every 300 samples, only 2 notes are generated. I repeatedly ran this test program, but it only generated one of the notes randomly for an arbitrary length. Hence, the local condition is totally ignored by the model. [3] 34

3 Implementation and Experiment Results (a) (b) (c) (d) Figure 3.6: Simple model 1000 iterations 3.2.1 Fast and Slow Generation An algorithm for the generation is introduced [7].

43 3 Implementation and Experiment Results (a) (b) (c) (d) Figure 3.6: Simple model 1000 iterations Fast and Slow Generation An algorithm for the generation is introduced [7]. The basic concept of this algorithm is to store the outcomes of each layer of dilated causal convolutional layers and reuse the values to generate next audio sample. For instance, in Figure 2.16, the output is computed by convolving the newest audio sample with the outcomes of other hidden layers convolutions. Therefore, the hidden layers values can be reused to compute other new samples convolution, because it is not affected by the new samples. This algorithm accelerates the generation dramatically, so it is highly recommended to utilise. Therefore, 35

44 3 Implementation and Experiment Results the generation with this algorithm is examined to see if it generates the same logits as the slow generation algorithm or not. The slow generation algorithm is to compute all the dilated causal convolutions from scratch. Figure 3.7: Example of the training data. To test the fast and slow generation, logits for each time step are computed based on the same previous audio samples. The logit is a 256 dimensional vector, which indicates the probability distribution of the time step s audio sample. Therefore, element wise differences between the logits generated by slow and fast algorithms are computed and squashed to a scalar value by summing the 256 dimensional vector. The logits are generated for 900 samples and the squashed differences are plotted 3.7. As the slow algorithm add the local condition to the entire dilated causal convolutions, while the fast algorithm updates the local condition of the new sample, there are big differences when the new local condition is fed into the WaveNet. For this test, the size of the receptive field is set to 256. At time step 300 and 600, the local condition was changed. After 256 steps, the WaveNet receives exactly the same audio samples and local conditions, so the there are no differences between slow and fast as the figure shows. 36

45 3 Implementation and Experiment Results 3.3 Converting a Video Frame to a Vector Using VGG Pre-trained Model As the WaveNet s performance has been confirmed in the previous section, I shall now move to the next step: converting an image into a descriptive vector by using VGG s pretrained model. The VGG s pre-trained model is accessible via the Internet. The original pre-trained model is compatible with Caffe 1. However, the same pre-trained model which is compatible with other deep learning libraries (e.g. TensorFlow 2, Keras 3, etc.) are also publicly available. TensorFlow is used for any deep learning implementation throughout this project, so the pretrained model for TensorFlow is obtained. There are two pre-trained models are available: 16-layer and 19-layer model, 19-layer model is used for this project. This 19-layer model is comprised of 16 CNN layers and 3 fully connected layers. The filters and biases stored in the pre-trained model are loaded as constant filters and bias values. Model the 16 CNN layers with TensorFlow library functions and loaded filters and biases are passed to corresponding layer s convolution function as its constant filter and bias. The size of video frames is reduced to 32 x 18 before they are passed to the CNN. Although the size of the input images for the original research is 224 x 224, there are some benefits to reduce the size. Firstly, the conversion gets computationally efficient. As the filters are slid through the image in each convolutional layer, if the input image is smaller, the computation cost can be reduced. Secondly, the size of output vector can be reduced. Although the original size input images (224 x 224) are converted into 1024 dimensional vectors, 32 x 18 images will become 512 dimensional vectors. It seems 1 One of the most popular deep learning framework. Originally developed at UC Berkeley. 2 An open-source machine learning library for python and C++. Developed by Google Brain Team. 3 An open-source neural network library for python. This is the second fastest growing deep learning framework after TensorFlow. 37

46 3 Implementation and Experiment Results that conversion from 32 x 18 (=576) into 512 is not an efficient dimension reduction. However, the image has 3 channels (RGB), so actually 32 x 18 x 3 (=1,728) is reduced to 512 and this vector has spatial information of the input image. Additionally, too many parameters could lead overfitting [2], and also the objects in the image do not need to be recognised in this project. All the WaveNet model need to do is to find the correlation between the image s spatial information and the sound. It is because of these reasons, the input video frames are downsampled. 3.4 Training the WaveNet with Videos. Now I will move onto the training and audio generation with WaveNet using videos as training data. For this experiment I used a video featuring waves on a beach: https: // The wave sound varies as the wave movements change. Specifically, when the wave comes to the shore, the wave sound becomes louder and as the wave goes back to the sea, the sound becomes quiet. The original video is divided into three videos: 1 minute for training, 30 seconds for validation. The 1 minute training video is also divided into 6 videos, and each video is 10 seconds long. I conducted three different experiments: 1. Training the model only with wave sound files. 2. Training the model with wave sound file and corresponding videos, and generate sound file without feeding the video frames (local conditions). 38

3 Implementation and Experiment Results 3. Training the model with wave sound file and corresponding videos, and generate sound file with the video frames (local conditions).

47 3 Implementation and Experiment Results 3. Training the model with wave sound file and corresponding videos, and generate sound file with the video frames (local conditions). (The second and the third experiments are done for both simple and upsampled local conditioning. ) Experiment 1: Train the model only with wave sound files. This experiment was conducted to confirm that the WaveNet model can learn and generate wave sounds. Additionally, the model was trained with different hyperparameters to inspect how they affect the quality of the generation. You can listen to the generated Figure 3.8: Error value through the training for 1000 epochs without local condition. 39

48 3 Implementation and Experiment Results wave sound at here: As you can hear, the generated wave sounds are stable wave-like sounds. As I did not feed the video frames as local conditions, the model captured the pattern of the wave sound in the training videos. 3.8 shows the plotted error values throughout the training phase. For each epoch I plotted the error value 50 times. Therefore there are 50 x 1000 = 50,000 points plotted. The initial error value is approximately 5.5, but it decreased to around 2. However, there are error values around 4 consistently through the training. So, it can be guessed that the model is fitted to some particular parts of the training videos. This is why the generated sound may also be considered as an imitation of those parts. It may be believed it is impossible to generate sound that perfectly matches the training sounds data. However, the model captured the patterns of the training wave sounds and generated a sound that sounds like a wave sound. Therefore, it is confirmed that this WaveNet model can learn the wave sound and generate it Experiment 2: Training the Model with Sounds and Videos, and Generating Sounds without Local Condition In this experiment, the model is trained with audio data and the corresponding video frames as local conditions. However, for the generation phase, the local condition is not given to the model. Therefore, the expected result would be similar to the results of the first experiment. This experiment was conducted to illustrate the effect of local conditioning for the generation clearly by comparing its result to the next experiment s result. The generated sound can be listened here: TPDuPzg 40

49 3 Implementation and Experiment Results Figure 3.9: Error value through the training for 300 epochs without local condition. I trained the mode for 300 epochs. Unlike the training without local conditioning, the error value was not decreased to around 2. Also, the generated sound is similar to the one generated by the model which is trained without local condition, but it is similar to white noise and people may not be able to recognise it as a wave sound Experiment 3: Training the Model with Sounds and Videos, and Generating Sound with Local condition In this experiment, the video frames are fed into the model for both training and generation phases. Hence, the generated sound is expected to correspond to the given video. 41

50 3 Implementation and Experiment Results The generated sound can be heard here: Although I fed the local conditions to the model for both training and generation phases, unfortunately the generated sound does not correspond to the video frames. Moreover, the generated sound does not vary through time even though it receives different local condition through the generation phase. 42

51 4 Discussion 4.1 Why doesn t local conditioning work for the video, while it works for the toy problem? There are possible causes. Implementation mistakes or bugs. The first possible cause is a wrong implementation. Although the performance of the local conditioning and the generation were tested with the toy problem, there still may be some bugs in the code. Specifically, for the generation, despiting giving the local conditions to the model during the generation phase, the generated sound was stable white noise. This was a strange result. However, the training error did not decrease greatly, so the model may have generated totally random values with any input values. This is also a possible reason why the generated sound does not vary even when the local condition is changed. The training video may not be appropriate for this task. 43

52 4 Discussion First of all, the wave sound is composed of white noises with certain dynamics which correspond to the movements of the wave. Therefore, the white noise can be said that one of patterns of the wave sounds. Additionally, the length of the cycle of waves dynamics is more than 5 seconds, but the WaveNet model that I used for the training can deal with only approximately 0.3 second long audio waveform. So, the model may not be able to capture the pattern of the waves (not the pattern of wave sounds, but the pattern of the wave itself). However, to help the model to recognise the pattern of the wave, the video frames are fed as local conditions. Here is another problem about using the wave videos as training data. As this video was recorded at a beach, the waves which were not in the video frame also make wave sounds and are recorded. Hence, the video frames and the wave sounds do not match perfectly. Alternative training videos ideas: Although I could only conduct the experiments with the wave videos in a short amount of time. Other types of videos were also prepared, such as fireworks, street and speech. It is proved that the WaveNet model can generate human language-like sound even if it is trained with only speech sound data. Therefore, if the model is trained with the speech videos, the model could generate human languagelike sound only when the speaker opens his/her mouth. The training was not enough. Although the training without the local conditioning was conducted for 1,000 epochs, the model with local conditioning was trained for only 300 epochs. So, the training may have not been enough. Despite this, it is more likely that other possible causes are the actual causes, because the plotted log of the training error values are different from that of the training without local conditioning. The hyperparameters need to be tuned well. 44

53 4 Discussion There are many hyperparameters for this project s model, but the effects of them on the performance have not been examined yet comprehensively. In the first experiment, the training and generation was conducted with different dilation, residual-channels, dilation-channels, skip-channels and initial-filter-width, but they have not inspected comprehensively. For each experiment, one of them is changed from the original setting. In the model, these hyperparameters affect each other so the performance could be changed dramatically with different combination of different parameters. However, it is quite time consuming for each iteration, which is why they could not be examined comprehensively. Once an adequate result for the training and generation with video frame local conditions is obtained, they need to be inspected more carefully to achieve the best result. 45

54 5 Future Works For future projects, I have considered training the model with multiple types videos at the same time, and improving the model with Recurrent Neural Networks. I am curious about what kind of sounds the model can generate for some videos that it does not see during the training phase. If the model can learn multiple videos sound and video frames, it could guess sounds of videos which originally do not have sound such as snow or blooming flowers. Additionally, the model could be improved by applying a Recurrent neural network for the local condition [1]. As I mentioned, although the WaveNet model s receptive field size is equivalent to about 0.3 second, the cycle of the wave is more than 5 seconds. Therefore, the prediction could be improved by considering longer range of the input. However, because of the computational cost the receptive field cannot be increased to 5 seconds. On the other hand, the frame rate of the input video is 25 frames per second, so it is more feasible to consider longer range of video frames. Even if 5 seconds long video frames are taken into an account, the length of it is 125 frames (=25 frames per second x 5 seconds). Here is the recurrent neural net model to apply for the video frame local condition. For the normal video frame local condition, the descriptive vector of each video frame is 46

5 Future Works upsampled to 16 khz which is same as the sample rate of the audio data. Therefore, only one vector is given to the model which corresponds to each audio sample.

55 5 Future Works upsampled to 16 khz which is same as the sample rate of the audio data. Therefore, only one vector is given to the model which corresponds to each audio sample. For the model with a recurrent neural network, the local condition is a sequence of video frames. The sequence is fed into the recurrent neural networks, and the output of the end of the recurrent neural network is passed to the WaveNet as a local condition. Figure 5.1: Model overview of WaveNet with LSTM. The first idea feeds the same local condition for all the audio samples including ones covered by the receptive field. However, as the model test showed, each audio sample is supposed to receive a corresponding local condition. Therefore, this idea is designed to solve this problem. The problem of the first idea is that only the last output of the recurrent neural network is taken and fed into the WaveNet. In this idea, each audio sample receives corresponding part of the recurrent neural network s output

5 Future Works shows how this model feeds local conditions. Figure 5.2: Visualisation of the second idea of improving the local conditioning with LSTM. The blue line in 5.

56 5 Future Works shows how this model feeds local conditions. Figure 5.2: Visualisation of the second idea of improving the local conditioning with LSTM. The blue line in 5.2 is the waveform and the diagram above illustrates the unfolded recurrent neural network. The inputs for the recurrent neural network (X 0,, X n+3 ) are the video frames. The normal model with upsampled local conditioning (without recurrent neural network) receives X 0,, X n+3 as local conditions. X 0 is given to the corresponding audio samples which are in the same red grid in the figure. So, in this idea, the corresponding output of the recurrent neural network is given to the audio samples. Namely, h 0,, h n+3 are fed into the WaveNet model instead of X 0,, X n+3. Figure 5.3 shows the model overview of this idea. 48

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO Introduction to RNNs for NLP SHANG GAO About Me PhD student in the Data Science and Engineering program Took Deep Learning last year Work in the Biomedical Sciences, Engineering, and Computing group at