An Improved Voice Activity Detection Based on Deep Belief Networks

e-issn 2455 1392 Volume 2 Issue 4, April 2016 pp. 676-683 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com An Improved Voice Activity Detection Based on Deep Belief Networks Shabeeba T. K. 1, Anand Pavithran 2 1,2 Department of Computer Science and Engineering MES College of Engineering, Kuttippuram Kerala, 679573, India Abstract Multiple acoustic features are important for robust- ness of Voice Activity Detection(VAD). Statistical and Machine learning methods have been used for VAD recently. Machine learning methods concentrate more on detecting multiple acoustic features. Deep Belief Network(DBN) is a powerful hierarchical model for feature extraction. This is a nonlinear combination method of multiple features by a deep model. Here, consider the multiple serially-concatenated features as the input layer of DBN, and then extract a new feature by transferring these features through multiple nonlinear hidden layers. Finally, a linear classifier predict the class of the new feature. It is able to incorporate the deep regularity of the acoustic features, so that the overall advantage of the features can be fully mined. As the number of feature increases, the complexity also increases. An improved method is proposed with less time complexity. Keywords Voice Activity Detection, Deep Belief Network, Feature Fusion. I. INTRODUCTION A voice activity detection (VAD) algorithm is able to distinguish speech signals from noise. An effective VAD algorithm can differentiate between speech, which contains background noise, and signals with background noises only. The result of a VAD decision is a binary value, which indicates the presence of speech in the input signal (for example the output value is 1) or the presence of noise only (for example the output value is 0). A VAD algorithm is an integral part from amongst a variety of speech communication systems, such as speech recognition and speech coding in mobile phones, and IP telephony. In telecommunication systems an effective VAD algorithm plays an important role, especially in automatic speech recognition (ASR) systems. VAD is used to reduce the computation by eliminating unnecessary transmission and processing of non- speech segments and to reduce potential mis-recognition errors in non-speech segments. The typical design of a VAD algorithm is as follows: 1) There may be a noise reduction stage. 2) Some features or quantities are calculated from a section of the input signal. 3) A classification rule is applied to classify the section as speech or nonspeech. The classification of VAD are: VADs in standard speech processing systems. Statistical signal processing based VADs. Supervised machine learning based VADs. Unsupervised machine learning based VADs. @IJCTER-2016, All rights Reserved 676

The research on the multiple feature fusion topic is rather important due to the following two reasons[1]. First, the discriminability of a single acoustic feature based VAD is limited. Traditional VADs pay much attention on exploring new complicated acoustic features that are more discriminative. However, seldom features perform overwhelmingly better than the others. Second, the topic of feature fusion is not fully mined. Although most machine learning based VADs do some efforts to the feature fusion task, the main advantage of these VADs still lies in the superiority of the machine learning based approaches to the non-machine learning based approaches, while the feature fusion methods seem still lack of thorough study. The desirable aspects of VAD algorithms are listed below: A Good Decision Rule: A physical property of speech that can be exploited to give consistent and accurate judgment in classifying segments of the signal into silence or otherwise. Adaptability to Background Noise: Adapting to non-stationary background noise improves robustness. Low Computational Complexity: The complexity of VAD algorithm must be low to suit real-time applications. This paper deals with the design of an improved VAD based on DBN. The time complexity of proposed method is less than that of the existing system. II. RELATED WORKS There are many methods for VAD such as, statistical signal processing based, supervised machine learning and unsupervised machine learning based etc. Some of them are discussed in this section. Tao Yu et al., proposed a supervised machine learning based VAD [2] is a discriminative training method that uses a linear weighted sum instead of the simple sum algorithm in the multiple observation technique. In this method, the optimal combination weights from two discriminative training methods are studied to directly improve VAD performance, in terms of reduced misclassification errors and improved receiver operating characteristics (ROC) curves. The weights are optimized by the gradient descent algorithm and Minimum Classification Error (MCE) is used as the optimization objective. J. W. Shin et al., proposed a Support Vector Machine (SVM) [3] provides an effective generalization performance on classification problems based on machine learning. SVM based VAD employing effective feature vectors. Three feature vectors are considered: a posteriori SNR, a priori SNR and predicted SNR. A posteriori SNR is estimated as the ratio of input signal and the variance of noise signal and predicted SNR is estimated by the power spectra of the noise and speech. SVM makes an hyperplane that is separated without errors. Maximum margin clustering (MMC) [4] is an unsupervised learning approach for statistical voice activity detection. MMC can improve the robustness of support vector machine(svm) based VAD while requiring no data labeling for model training. In the MMC framework, the multiple observation compound feature(mo-cf) is proposed to improve accuracy. MO-CF is composed of two sub-features, they are, multiple observation signal-to- noise ratio (MO-SNR) and multiple observation maximum probability (MO-MP). Dongwen Ying et al., propose an unsupervised learning framework [5] to construct statistical models for VAD. This framework is realized by a sequential Gaussian mixture model (GMM). It comprises an initialization process and an updating process. Select the smoothed subband logarithmic energy as the acoustic feature. The input signal is grouped into several Mel subbands in the frequency domain. Then, the logarithmic energy is calculated by using the logarithmic value of the absolute magnitude sum of each subband. Eventually, it is smoothed to form an envelope for classification. Two Gaussian models are employed as the classifier to describe the logarithmic energy distributions @IJCTER-2016, All rights Reserved 677

of speech and nonspeech, respectively. These two models are incorporated into a two component GMM. Its parameters are estimated in an unsupervised way. Speech/nonspeech classification is firstly conducted at each subband. Then, all subband s decisions are summarized by a voting procedure. The proposed VAD does not rely on an assumption that the first several frames of an utterance are nonspeech, which is widely used in most VADs. Various VAD methods are discussed, in order to further improve the performance, proposed to introduce DBN to VAD. Fundamentally, the advantage of the DBN-based VAD is that DBN has a much stronger ability of describing the variations of the features. The DBN based Vad is discussed in the following section and an improvement is also done. III. DBN BASED VAD The DBN-based VAD connects multiple acoustic features of an observation in serial to a long feature vector which is used as the visible layer [i.e., input] of DBN. Then, by transferring the long feature vector through multiple nonlinear hidden layers a new feature is extracted. Finally, the new feature is given as the input of linear classifier [i.e., softmax output layer] of DBN to predict the class of the observation. The prediction function of DBN is formulated as[1]: VAD only contains two classes(speech and non-speech), the prediction function of the DBN-based VAD is as follows: where H1 / H0 denotes the speech/noise hypothesis, and η is a tunable decision threshold, usually setting to 0, g (L) (.) is the activation function of the Lth hidden layer and is defined as, is the weights between the adjacent two layers with i as the i th unit of the L th layer and j as the j th unit of the (L-1) th layer and { x r } r is the input feature vector. The training process of DBN consists of two phases. First, it takes a greedy layer-wise unsupervised pre-training phase, of the stacked RBMs to find initial parameters that are close to a good solution of the deep neural network. Then, it takes a supervised back-propagation training phase to fine-tune the initial parameters. The key point that contributes to the success of DBN is the greedy layer-wise unsupervised pre-training of the RBM models. It performs like a regularizer of the supervised training phase that prevents DBN from over-fitting to the training set. @IJCTER-2016, All rights Reserved 678

The layer-wise unsupervised pre-training of the RBM models contributes to the success of DBN. RBM is an energy model-based two layer, bipartite, undirected stochastic graphical model. Specifically, one layer of RBM is composed of visible units v, and the other layer is composed of hidden units h. There are symmetric connections between the two layers and no connection within each layer. The connection weights can be represented by a weight matrix. In this paper, we only consider the Bernoulli (visible)-bernoulli (hidden) RBM, which means v i {0,1} and h j {0,1}. RBM tries to find a model that maximize the likelihood of, which is equivalent to the optimization problem given below, Where And Energy (v,h;w)=-b T v- c T h-h T Wv where b and c are bias terms of visible layer and hidden layer. DBN is a powerful hierarchical generative model for feature extraction and it can perfectly fuse the advantages of multiple features. In this work, eight features are considered for feature extraction. As the number of features increases, the complexity of the DBN also increases and it will take more time for voice activity detection. So, the limitation of this work is its increased time complexity corresponding to the number of features. In order to overcome this problem, another method in which the same accuracy can be achieved by using more significant features is designed. IV. IMPROVED VAD BASED ON DBN An improved method is introduced and which is based on the more significant features considered for feature extraction. The modification is done by considering the following features: pitch, discrete Fourier transform (DFT), mel-frequency cepstral coefficients (MFCC), linear predictive coding (LPC), energy and zero crossing rate(zcr). In this method, taken the energy and zero crossing rate instead of relative-spectral perceptual linear predictive analysis (RASTA-PLP) and amplitude modulation spectrograms (AMS)[8] in the former case. A. Feature Extraction 1) Pitch: The pitch is estimated using Cepstral Method. The steps involved are: The original signal is transformed using a Fast Fourier Transform (FFT) algorithm. The resulting spectrum is converted to a logarithmic scale. It then transformed using the same FFT algorithm to obtain the power cepstrum. The power cepstrum reverts to the time domain and exhibits peaks corresponding to the period of the frequency spacings. The cepstral coefficients are given by, @IJCTER-2016, All rights Reserved 679

where τ - frequency, F - fourier transform, x[n] signal in the time domain and F{x[n] } 2 - power spectrum estimate of the signal. 2) MFCC: Sounds are filtered by the shape of the vocal tract including tongue, teeth etc. This shape determines what sound comes out. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum. the job of MFCCs is to accurately represent this envelope[9]. The steps involved are; Preemphasis: passing of signal through a filter which emphasizes higher frequencies. Framing: the speech signal is divided into short time frames. Hamming windowing: Y(n)= X(n) W(n) Where Hamming window is defined as W(n), 0 n N-1, N = number of samples in each frame, Y[n] = Output signal, X (n) = input signal. Fast Fourier Transform: To convert each frame of N samples from time domain into frequency domain. Where X(w), H(w) and Y(w) are the Fourier Transform of X(t), h(t) and Y(t) respectively. Mel Filter Bank Processing: The frequencies range in FFT spectrum is very wide and voice signal does not follow the linear scale. - A weighted sum of filter spectral components is used. Discrete Cosine Transform : to convert the log Mel spectrum into time domain. The result of the conversion is called Mel Frequency Cepstrum Coefficient. 3) LPC: LPC is a tool used in audio signal processing which represents the spectral envelope of digital signal in a compressed form[7]. The function [a,e]=lpc(x,n), finds coefficients, A=[ 1A(2)... A(N+1) ], of an Nth order forward linear predictor. Xp(n)= -A(2)*X(n-1)-A(3)*X(n-2)-...- A(N+1)*X(n-N) such that the sum of the squares of the errors err(n) = X(n) - Xp(n) is minimized. [A, E ] = LPC(X,N) returns the variance (power) of the prediction error. Where X can be a vector or a matrix. If X is a matrix containing a separate signal in each column, LPC returns a model estimate for each column in the rows of A. And N specifies the order of the polynomial A(z) which must be a positive integer. N must be less or equal to the length of X. If X is a matrix, N must be less or equal to the length of each column of X. 4) RASTA-PLP: PLP speech analysis is based on short term spectrum of speech. RASTA applies a band-pass filter to the energy in each frequency subband to smooth over short-term noise variations to remove any constant offset resulting from static spectral coloration in the speech channel. RASTA-PLP makes PLP more robust to linear spectral distortions. The steps involved are: Compute the critical band spectrum and take its logarithm. @IJCTER-2016, All rights Reserved 680

Estimate the temporal derivative of the log critical band spectrum. Re-integrate the log critical band temporal derivative. Take inverse logarithm of this relative log spectrum, yielding a relative auditory spectrum. Compute an all pole model of this spectrum. 5) Energy: The energy of the speech signal provides a representation that reflects these amplitude variations. Shorttime energy can define as: Where w(m) is the hamming window. 6) Zero Crossing Rate(ZCR): For discrete-time signals, a zero crossing is said to occur if successive samples have different algebraic signs. The rate at which zero crossings occur is a simple measure of the frequency content of a signal. Zero-crossing rate is a measure of number of times in a given time interval/frame that the amplitude of the speech signals passes through a value of zero. The definition for zero crossings rate is: Where And V. EXPERIMENTS AND RESULTS All experiments are conducted with MATLAB 2013A in windows operating system. A. Dataset The training set consists of twenty signals and the test set also contains the same twenty signals. In this paper, the sampling rate is 8 khz. Because speech can be approximated as a stationary process in short-time scales, divide speech signals into a sequence of short-time frames. The frame is used as the basic detection unit. Given a frame, if the samples labeled as speech are more than a half, the frame is labeled as speech, otherwise, the frame is labeled as noise. B. Acoustic Features for VAD To better show the advantages of the feature fusion tech- niques, extract eight acoustic features from each observa- tion. They are pitch, discrete Fourier transform (DFT), mel- frequency cepstral coefficients (MFCC), linear predictive cod- ing (LPC)[7], energy and zero crossing rate(zcr). C. Parameter settings @IJCTER-2016, All rights Reserved 681

For the proposed DBN based VAD, the depth of DBN [i.e., the number of the hidden layers, or the number of the RBM models] is taken as 2. Denote the n-layer s DBN as DBNn. That is, denote the DBN with only one hidden layer as DBN1. D. Results Twenty speech signals are used for training the network. 40000 samples(from sample number 10000 to 50000) of each signal is taken for training. The testing is conducted for the same 40000 samples of the selected signal. Voice activity present in a signal is detected. Figures 1 and 2 show the output of VAD for testing two example signals. Voice frames are represented by 1 and silence by 0. The comparison between the basic method and improved method is given in the graph in figure 3. From the graph it is clear that the proposed method is better than the existing method. Fig. 1. The original signal and result of VAD for example 1 Fig. 2. The original signal and result of VAD for example 2 @IJCTER-2016, All rights Reserved 682

Fig. 3. Comparison between the basic method and improved method VI. CONCLUSION Voice activity detector (VAD) is an important front-end of modern speech signal processing systems. The DBN-based VAD aims to extract a new feature that can fully express the advantages of all acoustic features by transferring the acoustic features through multiple nonlinear hidden layers. The complexity of DBN is more as the number of features increases. A less complex VAD is designed by considering more significant features. The improved method outperforms the existing method. The scope for the future work relies on the wide variety of application areas of speech processing and to improve the recognition and transmission of speech. And also a modified VAD with less number of features will result in a less complex DBN. REFERENCES [1] Xiao-Lei Zhang and Ji Wu, Deep Belief Networks Based Voice Ac- tivity Detection, IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 4, April 2013. [2] Tao Yu and John H. L. Hansen, Discriminative Training for Multiple Observation Likelihood Ratio Based Voice Activity Detection, IEEE Signal Processing Letters, Vol. 17, No. 11, November 2010. [3] Ji Wu and Xiao-Lei Zhang, VAD based on statistical models and machine learning approaches, ELSEVIER, Computer Speech and Lang., 2010. [4] Ji Wu and Xiao-Lei Zhang, Maximum Margin Clustering Based Statis- tical VAD With Multiple Observation Compound Feature, IEEE Signal Processing Letters, Vol. 18, No. 5, May 2011. [5] Dongwen Ying, Yonghong Yan, Jianwu Dang, and Frank K. Soong, Voice Activity Detection Based on an Unsupervised Learning Frame- work, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, No. 8, November 2011. [6] D. Yu and L. Deng, Deep learning and its applications to signal and information processing, IEEE Signal Processing Magazine, vol. 28, no.1, pp. 145-154, Jan. 2011. [7] Lawrence R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Pearson Education, Jan. 2003. [8] Jurgen Tchorz and Birger Kollmeier, Automatic classification of the acoustical situation using amplitude modulation spectrograms. [9] Lindasalwa Muda, Mumtaj Begam and I. Elamvazuthi, Voice Recogni- tion Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques, Journal of Computing, vol.2, Issue 3, Mar 2010, ISSN 2151-9617. @IJCTER-2016, All rights Reserved 683