CHAPTER 4 VOICE ACTIVITY DETECTION ALGORITHMS

Size: px

Start display at page:

Download "CHAPTER 4 VOICE ACTIVITY DETECTION ALGORITHMS"

Luke Gilbert
6 years ago
Views:

1 66 CHAPTER 4 VOICE ACTIVITY DETECTION ALGORITHMS 4.1 INTRODUCTION New frontiers of speech technology are demanding increased levels of performance in many areas. In the advent of Wireless Communications new speech services are becoming a reality with the development of modern robust speech processing technology. Many researchers discussed about the ill effect of environmental noise on the system performance of speech processing. Abhijeet Sangwan et al (2002) discussed many issues associated with desirable aspects of Voice Activity Detection (VAD) algorithms based on a good decision rule, adaptability to background noise and low computational complexity for estimating the noise spectrum. Background noise acoustically added to speech can degrade the performance of digital voice processors used for applications such as speech compression, recognition, and authentication (Isrel 2003). Digital voice systems will be used in a variety of environments, and their performance must be maintained at a level near that measured using noise-free input speech. To ensure continued reliability, the effects of background noise can be reduced by using, internal modification of the voice processor algorithms to explicitly compensate for signal contamination, or preprocessor noise reduction and noise-cancelling microphones

2 67 Khaled et al (1997) observed that high-energy voiced speech segments are always detected in all VADs under very noisy conditions such as car, bus, babble, and street noise. However, low-energy unvoiced speech is commonly missed. The background noise which contaminates the signal results in either noise only or speech plus noise segments. The VAD developed by Javier Ramirez et al (2005), makes it possible to define an effective endpoint detection algorithm employing a novel noise reduction techniques and order statistic filters for the formulation of the decision rule. The VAD performs an advanced detection of beginnings and delayed detection of word endings which, in part, avoids the inclusion of additional hangover schemes. In addition, VAD provides speech / non speech discrimination also. It has been observed that low energy portions of speech are first to be falsely rejected. A hangover scheme is required to lower the probability of false rejections (Alan Davis et al 2006). Robustness can be achieved by an appropriate extraction of robust features in the front-end and/or by the adaptation of the reference to the noise situation. Noise signals are selected to represent the most probable application scenarios for telecommunication terminals. Some noises are fairly stationary. They are the car noise and the recording in the exhibition hall. Others noises contain non-stationary like, the recordings on the street and at the airport. A fast noise estimation algorithm proposed by Sundarrajan Rangachari et al (2004) resulted in good performance for a single sentence. The noise estimate was found by averaging past spectral power values using a smoothing parameter that was adjusted by the signal presence probability in subbands case as discussed Sundarrajan Rangachari et al (2004).

3 68 A novel VAD algorithm developed by Dong Kook Kim et al (2007) based on the Gaussian distribution and the uniformly most powerful (UMP) test to detect the speech or nonspeech from the input noisy signal. This method provides the decision rule by comparing the magnitude of the noisy speech signal to the adaptive threshold estimated from the noise statistics. A conditional Maximum a posteriori (MAP) criterion decides the hypothesis with the maximum conditional probability given both the observation and the voice activity in the previous frame. This criterion leads to two separates thresholds for Likelihood Ratio Test (LRT) depending on the previous VAD result frame case as discussed Jong Won Shin et al (2008). Several VAD algorithms have been proposed for detecting the voiced / unvoiced region (Boll Steven et al 1980, Dhananjaya et al 2010, Falk Tiago et al 2006, Haitian Xu et al 2007, Jongseo Sohn et al 1999, Juan Manuel Gorriz et al 2008, Matteo Gerosa et al 2007, Plante et al 1998, Qi Li et al 2002, Richard et al 2000, Yutaka Kaneda et al 1986, Zenton Goh et al 1999, Zhong Lin et al 2007). In this chapter, the Voice Activity Detection (VAD) developed by Ramirez et al (2005) is presented along with the noise estimation algorithm as discussed in Sundarrajan Rangachari et al (2004) and Abhijeet Sangwan et al (2002). Various VAD algorithms are studied and comparison of their performance based on parameters such as Zero Crossing Detection (ZCD), Weak Fricative Detection (WFD), Pitch Based Detection (PBD), Energy Based Detection (EBD) and Subband Order Statistics Filter (OSF) in presence of different types of noise like suburban train noise, babble, car, exhibition hall, restaurant, street, airport and train-station noise for Automatic Speech Recognition (ASR) are carried out.

4 VOICE ACTIVITY DETECTION ALGORITHMS A straight forward approach is to identify Voice Activity Detection (VAD), i.e, the processes of discrimination of speech from silence or other background noise. The VAD algorithms are based on any combination of general speech properties such as temporal energy variations, periodicity, and spectrum. The detection task is not as trivial as it appears since the increasing level of background noise degrades the classifier effectiveness. VAD indicates the presence or absence of speech as observed by Ramirez Voice is differentiated into speech or silence based on speech characteristics. The signal is sliced into contiguous frames. A real valued nonnegative parameter is associated with each frame. If this parameter exceeds a certain threshold, the signal is classified as active or inactive. The basic principle of VAD device is that it extracts some measured features or quantities from the input signal and then compares these values with thresholds. Voice activity (VAD=1) is declared if the measured value exceeds the threshold. Otherwise (VAD=0) is declared for no speech activity. In general, a VAD algorithm outputs a binary decision in a frame by frame basis where a frame of a input signal is a short unit of time such as 20-40ms. algorithm: The following are some of the required features of a good VAD (i) Good Decision Rule: A physical property of speech that can be exploited to give consistent and accurate judgment in classifying segments of the signal into silence or otherwise.

5 70 (ii) Adaptability to background Noise: Adapting to non stationary background noise improves the robustness, especially in wireless telephony. (iii) Low Computational Complexity. The complexity of VAD algorithm must be low to suit real-time applications. A tree diagram that represents the classification techniques for VAD algorithms are shown in Figure 4.1. VAD Algorithms Parameter Based Frequency Based Thresholding: ZCD Transform (Power Spectral Density):WFD Subband OSF Linear Variance: EBD Segmentation :PBD Figure 4.1 Tree diagram for VAD Algorithms The VAD Algorithm is classified into two types. (i) Parameter Based VAD Algorithm and (ii) Frequency Based VAD Algorithm.

6 71 types. Parameter Based VAD Algorithms are further classified into three (i) Zero Crossing Detector which based on thresholding. (ii) Energy Based Detection which is implemented through Linear Variance. (iii) Pitch Based Detection through Segmentation. Frequency Based VAD Algorithm which consists of Weak Fricative and Subband Order Statistic Filter which are formed under Transformation method Zero Crossing Detector (ZCD) The Zero Crossing Detector (ZCD) is defined as the number of times in a sound sample that the amplitude of sound wave changes sign. Zero Crossing for a signal is the number of times that it crosses the line of no disturbance or zero line (Abhijeet Sangwan et al 2002). The number of zero crossings for a voice signal lies in fixed range. For a 10ms duration, the number of zero crossing lies between 5 and 15. The number of zero crossing for noise is random and unpredictable. This reason innovate formulate a decision rule that is independent of energy and hence able to detect some low energy phonemes. If Frame is ACTIVE else Frame is INACTIVE (4.1) is the number of Zero Crosses detected in f j R is the set of values of {5-15}, the number of zero crossing for speech duration of 10ms.

7 Weak Fricatives Detector (WFD) The main drawback of ZCD is that of misclassification of noise frames as Active one when zero crossings of the noise frames satisfies the equation (4.1). The problem of discriminating speech from background noise is not trivial, except in the case of extremely high Signal to Noise Ratio acoustic environments. For such high Signal to Noise Ratio (SNR) environments, the energy of the lowest level speech sounds exceeds the background noise energy, and thus a simple energy measurement suffices. However, such ideal recording conditions are not practical for most applications (Rabiner 2004). Therefore a method is required to classify weak fricatives from noise dependent of SNR or other noise characteristics. This particular problem can be made to overcome by using Auto correlation function which is exploited by the high correlation found in speech signals. The unbiased autocorrelation function as [ ] = [ ] ( ) [ ] (4.2) A[x] is the autocorrelation vector y[n] is the vector under consideration n is the frame length Each frame of the incoming signal is segmented into frames of duration 20ms. The energy of each frame is computed as = (( 1) 4 + )

8 73 2where subframes takes the value from 1 to the total number of subframes in the sample, index denotes each sample in the given vector. Thus a vector of 20 such energy values is computed for each frame, which is denoted as ( ) (4.4) where j is the frame under consideration. The classification parameter that is used the variance of the above vector. The Autocorrelation Vector Variance (AVV) is determined as ( ( ) ) (4.5) A reference value for AVV for silence frame is computed by assuming that the first 20 frames to be inactive = ( ) (4.6) A reference value for AVV for silence frame is computed by assuming that the first 20 frames to be inactive. We compare the AVV of subsequent frames with a scalar multiple of this reference value, to determine speech activity. If Frame is ACTIVE else Frame is INACTIVE (4.7) The value of k was set to 7 after trial and error. Only active frames are marked as voiced signal and inactive frames are unvoiced signal Pitch Period Based Detector (PBD) Pitch period estimation is one of the most important problems in speech processing. Pitch detectors are used in vocoders, speaker identification

9 74 and verification systems. Pitch period estimation can be done using the autocorrelation function. The autocorrelation function provides a convenient representation and it forms the basis for pitch detection. One of the major limitations of using the autocorrelation representation is that of retention of information in the speech signals. As a result the autocorrelation function has too many peaks. To estimate this problem it is useful to process the speech signal so as to make the periodicity more prominent while suppressing other distracting features of the signal. Numerous techniques have been proposed and a technique called centre clipping is reported in this thesis. The centre clipped (Sondhi 1968) speech signal is obtained by a nonlinear transformation ( ) = [ )] (4.8) where C[ ] is shown in Figure 4.2 Figure 4.2 Centre clipper transformation function The operation of center clipping is depicted in Figure 4.3

10 75 Figure 4.3 Centre clipping affects a speech waveform It can be seen that for samples above C L, the output of the centre clipper is equal to the input minus the clipping level. For samples below the clipping level, the output is zero. For high clipping levels, fewer peaks will exceed the clipping level and thus fewer pulses will appear in the output. If the clipping level is decreased, more peaks pass through the clipper and the auto correlation function becomes more complex (Rabiner 2004). The problem of extraneous peak can be eliminated in the autocorrelation function by center clipping prior to computing the autocorrelation function. However another difficulty with autocorrelation function representation is that large amount of computation that is required. A simple modification to centre clipping function leads to greater simplification in autocorrelation computation. The output of the clipper is +1 if x(n) > + C L

11 76 and -1 if x(n) < - C L. Otherwise the output is zero. The computation of the autocorrelation function for a 3-level center clipped signal is particularly simple. Most of the extraneous peaks are eliminated, and a clear indication of periodicity is retained. The three level center clipping function is shown in Figure 4.4. Figure 4.4 Three level center clipping function A novel algorithm for estimating the pitch period from the shorttime autocorrelation function proposed by Dubnowski et at (1976). The steps in the pitch based VAD algorithm is given below: i. The speech signal is filtered with a 900 Hz low pass analog filter and sampled at a rate of 10 khz. ii. Segments of length 30msec are selected at 10msec intervals. iii. Using the clipping level, the speech signal is processed by a 3- level centre clipper and the correlation function is computed over a range spanning the expected range of pitch periods. iv. The largest peak of the autocorrelation function is located and the peak value is compared to a fixed threshold. If the peak falls below threshold, the segment is classified as unvoiced else the segment is voiced.

12 Energy Based Detector (EBD) The amplitude of the speech signal varies appreciably with time. The amplitude of unvoiced segments is generally much lower than that of voiced segments. The energy of a signal represents a convenient representation that reflects the amplitude of the signal. Energy of a frame indicates the possible presence of voice data and is an important parameter used in VAD algorithms. Let X(i) be the i th sample of speech. If the length of the frame were k samples, then the j th frame can be represented in time domain by a sequence as = { ( )} ( ) (4.9) E j represents the energy of the j th frame as, = ( ) ( ) (4.10) The VAD algorithm is trained for a small period by a prerecorded sample that contains only background noise. The initial threshold for various parameters is computed from these samples. The initial energy theorem is obtained by taking the mean of the energies (E m ) of the samples = = (4.11) E is the initial threshold estimate and is the number of frames in a prerecorded sample, and the initial 20 frames are considered as INACTIVE.

13 78 The classification rule for speech is as follows, if > k (k > 1) frame is ACTIVE else frame is INACTIVE (4.12) Here, represents the energy of noise frame, while k is the threshold being used in the decision making. Active frames are transmitted while Inactive frames are not transmitted. Energy based decisions are not good for low energy phonemes. Weak fricatives are sometimes silenced completely. High energy voiced speech segments are detected in all VAD algorithms even under noise conditions. However, low energy unvoiced speech is commonly missed, reducing speech quality Subband OSF Based VAD Javier Ramirez et al (2005) proposes the determination of the speech / nonspeech divergence by means of specialized Order Statistics Filter (OSF) working on the subband log-energies. The filters based on order statistics have been successfully employed in restoration of signals and images corrupted by additive noise. The most common OSF is the median filter that is easy to implement and exhibits good performance in removing impulsive noise. Figure 4.5 enumerates the block diagram of the subband based VAD. This algorithm operates on the subband log-energies. Noise reduction is performed first and the VAD decision is formulated on the de-noised signal. The noisy speech signal is decomposed into 25 ms frames with a 10 ms window shift. Let X (m,l) be the spectrum magnitude for the m th band at frame l.the design of the noise reduction block is based on Wiener Filter

14 79 (WF) theory whereby the attenuation is a function of the Signal to Noise Ratio (SNR) of the input signal. The VAD decision is formulated in terms of the de-noised signal. The subband log-energies are processed by means of order statistics filters. NOISE ) FFT ( ) ( ) VAD REDUCTION ) SPECTRAL SMOOTHING WF DESIGN FREQUENCY DOMAIN FILTERING ) ) NOISE UPDATE Figure 4.5 Block diagram of Subband OSF based VAD 1) The noise reduction block consists of four stages. i) Spectrum smoothing: The power spectrum is averaged over two consecutive frames and two adjacent spectral bands. ii) Noise estimation: The noise spectrum ) is updated by means of a 1st order IIR filter on the smoothed spectrum ), ) = 1) + ( ) ) (4.13) where =0.99 and =0,1,,NFFT/2, (NFFT= Nonequispaced FFT)

15 80 iii) Wiener Filter (WF) design: First, the clean signal estimated by combining smoothing and spectral subtraction ) is ) = 1) + (1 ),0) (4.14) where = Then, the WF ) is designed as ( ) = ( ) ( ) (4.15) where ( ) = max ( ) ( ), (4.16) and is selected so that the filter yields a 20 db maximum attenuation. ), the spectrum of the cleaned speech signal, is assumed to be zero at the beginning of the process and is used for designing the WF through equation (2.13) to equation (2.15). It is given by ) = ) (4.17) The filter ) is smoothed in order to eliminate rapid changes between neighbor frequencies that may often cause musical noise. Thus, the variance of the residual noise is reduced and consequently, the robustness when detecting nonspeech is enhanced. The smoothing is performed by truncating the impulse response of the corresponding causal FIR filter to 17 taps using a Hanning window. With this operation performed in the time domain, the frequency response of the Wiener Filter is smoothed and the performance of the VAD is improved.

16 81 iv) Frequency domain filtering: The smoothed filter is applied in the frequency domain to obtain the denoised spectrum ) = ). (4.18) Once the input speech has been de-noised, the log-energies for the l th frame, ), in subbands ( = 0,1,.. 1) are computed by means of E( ) = log K NFFT (Y ( ) ) = k= 0,1, K-1 (4.19) where an equally spaced subband assignment is used. The algorithm uses two OSF for the multiband quantile (MBQ) SNR estimation. A first OSF estimates the subband signal energy by means of ) = ( ) ) ) + ) ) (4.20) where ) is the p sampling quantile, = [ 2 ] and = 2. Finally, the SNR in each subband is measured by ) = ) (4.21) where ) is the noise level in the k th band that needs to be estimated. For the initialization of the algorithm, the first N frames are assumed to be nonspeech frames and the noise level in the k th band, ), is estimated as the median of the set (0, ), (1, ), 1, )}. In order to track

17 82 nonstationary noisy environments, the noise references are updated during nonspeech periods by means of a second OSF (a median filter) ) = ) + ( ) ), k=0,1,..,k-1 (4.22) where ), is the output of the median filter and =0.97 was experimentally selected. On the other hand, the sampling quantile p=0.9 is selected as a good estimation of the subband spectral envelope. subband SNR The decision rule is then formulated in terms of the average ( ) = QSNR( ) (4.23) If the SNR is greater than a threshold, the current frame is classified as speech, otherwise it is classified as nonspeech. It is assumed that the system will work at different noisy conditions and that an optimal threshold can be determined for the system working in the cleanest ( ) and noisiest conditions ( ). Thus, the threshold is adaptive to the measured fullband noise energy < = ( ) (4.24) thus enabling the VAD selecting the optimum working point for different SNR conditions. Note that, the threshold is linearly decreased as the noise level is increased between (E, )and (E, ) which represent optimum thresholds for the cleanest and noisiest conditions defined by the noise energies E and, respectively.

18 DRAWBACKS OF EXISTING ALGORITHMS The existing algorithm is based on the assumption that noise spectrum does not significantly vary within a N frame of the neighborhood of the l st frame. However, this is not true in the case of highly stationary noise. Noise estimation of the first frame is used to denoise 8 frames forward. Noise estimate is very low for the first frame. So the algorithm fails at the beginning to evaluate the noise spectrum and the detection afterwards could be totally erroneous. The existing algorithm also fails to update the threshold in low noise conditions. This will degrade the performance of VAD Proposed Algorithm The proposed algorithm does not depend on the feedback loop for noise spectrum estimation. Instead it uses a noise estimation algorithm which updates noise for every frame. This method of noise estimation is best suited for highly non stationary environments, thus increasing the robustness as discussed in Sundarrajan Rangachari et al (2004). ) ) NOISE ) ) FFT VAD REDUCTION ) ) SPECTRAL WF FREQUENCY SMOOTHING DESIGN DOMAIN FILTERING NOISE UPDATE Figure.4.6 Block diagram of proposed VAD

19 84 The noise estimate is updated by averaging the noisy speech power spectrum using a time and frequency dependent smoothing factor, which is adjusted based on signal presence probability in subbands. It improves the speech/non-speech discriminability and speech recognition performance in noisy environments. Two problems are solved using VAD. The first one is performance of VAD in low noise condition and the second is with noisy environment. The block diagram of proposed VAD is shown in Figure 2.6. The noise estimation algorithm is as follows The smoothed power spectrum of the noisy speech signal is estimated using a first-order recursive formula as ) = 1, ) + ( ) ( ) (4.25) where Y(, k) is an estimate the short time power spectrum of noisy speech and is the smoothing constant, where is the frame index and k is the frequency bin index. Since the noisy speech power spectrum in the speech absent frames is equal to the power spectrum of the noise, we can update the estimate of the noise spectrum by tracking the speech absent frames. To compute the ratio of the energy of the noisy speech power spectrum in three different frequency bands (low: 0-1kHz, middle: 1-3 khz, high: 3 khz and above) to the energy of the corresponding frequency band in the previous noise estimate. The following three ratios are computed: ( ) = ( ) ( ) (4.26) ( ) = ( ) ( ) (4.27)

20 85 ( ) = ( ) ( ) (4.28) where ) is the estimate of the noise power spectrum at frame, and Low Frequency, Medium Frequency, Fs correspond to the frequency bins of 1 khz, 3 khz and the sampling frequency respectively. The speech frame is classified as speech present or speech absent in the following manner. The incoming frame is classified as speech absent frame if the following condition is satisfied ( ) < ( ) < ( ) (4.29) where is threshold. The speech-absent frame and the noise estimate is updated according to ( ) ( 1, ) + ( ) ( ) (4.30) where is a smoothing constant. If any or all of the above three ratios are larger than the threshold, then a different algorithm is used for updating and estimating the noise spectrum. In case of speech present frames, noise updation is as follows: Frequency bins are classified as speech present or absent by tracking the local minimum of noisy speech and then speech presence in each frequency bin is decided separately using the ratio of noisy speech power to its local minimum. A different non-linear rule is used for tracking the minimum of the noisy speech by continuously averaging the past spectral values.

21 86 if ( 1, ) < ) then ( ) = ( 1, ) + ( ( ) ( 1, )) (4.31) else ( ) = ) where ( ) is the local minimum of the noisy speech power spectrum and and are constants whose values are determined experimentally. Let ( ) )/ ) denote the ratio between the energy of the noisy speech to its local minimum. This ratio is compared against a frequency-dependent threshold and if it is found to be larger than that threshold, then the corresponding frequency is considered to contain speech. Using the above ratio smoothing constant can be estimated as follows: ), the new frequency-dependent ( ) = ( ) ( ) (4.32) where, are smoothing constants (, ) and ( ) is a frequencydependent threshold given as ( ) = /2 (4.33)

87 s ( Finally, after computing the frequency-depending smoothing factor,k) the noise spectrum estimate is updated according to N(,k)= s (,k)n(( -1,k)+(1- s (,k) t)) Y (,k) 2 (4.34) 4.

The system consists of two main parts, preprocessor and ASR. The preprocessor includes Voice Activity Detector (VAD).

22 87 s ( Finally, after computing the frequency-depending smoothing factor,k) the noise spectrum estimate is updated according to N(,k)= s (,k)n(( -1,k)+(1- s (,k) t)) Y (,k) 2 (4.34) 4.4 RESULTS AND DISCUSSIONS The proposed structure for increasing the recognition accuracy of the robust speech recognition system using VAD algorithms is shown in the Figure 4.7. The system consists of two main parts, preprocessor and ASR. The preprocessor includes Voice Activity Detector (VAD). VAD identifies the presence or absence of speech and extracts the speech from the noise corrupted speech. Input speech Noise Estimation VAD ASR Recogniton Accuracy Figure 4.7 Structure of speech recognition system Figure 4.8 shows the original clean speech signal. Figure 4.9 shows the output of the existing algorithm. Original signal corrupted by airport noise of SNR 0 db is given as input. Due to false estimation of noise spectrum the algorithm fails at the beginning of the utterance itself. So most of the noise only frames are classified as speech present frames. Figure 4.10 shows the output of the proposed algorithm. The speech frames are extracted correctly from the noisy speech signal. One hundred words were taken for speech recognition (using isolated word recognition with statistical modeling - Hidden Markov Model),

23 88 after adding various noise environments. We have analyzed input word utterance under the most commonly encountered noise environments like suburban train noise, babble, car, exhibition hall, restaurant, street, airport and train-station noise were taken from the AURORA database. In the training phase, the uttered words of 100 samples each digits 0-9, both male and female voice (age from 15-25) are recorded using 8-bit Pulse Code Modulation (PCM) with a sampling rate of 8 khz from single channel input and saved as a wave file using sound recorder software The proposed framework uses a speech processing module includes the Hidden Markov Model (HMM)-based classification and noise language modeling to achieve effective noise knowledge estimation which was discussed in chapter 2. The performance of ASR was analyzed under noisy conditions and the same was analyzed using VAD and the accuracy in percentage is shown in the Figure The Subband Order Statistics Filter (OSF) method algorithm performs better than other VAD algorithms. And the recognition accuracy of all VAD algorithms can be improved if we consider noise estimation in the non-stationary environment. This chapter presented a proposed structure of Speech Recognition Systems with Subband Order Statistics Filter (OSF) improving speech detection robustness in noisy environments. The approach is based on an effective endpoint detection algorithm employing noise reduction techniques and order statistic filters for the formulation of the decision rule. The Automatic speech recognition systems work reasonably well under clean conditions but become fragile in practical applications involving real-world environments.

24 Original signal without noise Figure 4.8 Original clean speech signal 1 Noisy input signal Figure 4.9 Output of existing VAD

25 90 1 Noisy input signal Figure 4.10 Output of proposed algorithm Table 4.1 through Table 4.8 depicts the performance of Subband Order Statistics Filter (OSF) based Voice Activity Detection of Ramirez and proposed algorithm for under various noise conditions in terms improvement of Recognition Accuracy (RA). From Tables 4.1 and 4.2 it was observed that the ASR with VAD in presence of Babble noise source performed better with 20.81% of improvement in RA compared to existing algorithm for SNR at 0 db. The speech recognition accuracy of proposed algorithm has an improvement of 13.54% in RA when compared to the algorithm proposed by the Ramirez et al (2004) in presence of various noise sources.

26 91

27 92 From Tables 4.3 and 4.4 it was found that the in presence of Exhibition noise with 5 db SNR noise level the proposed algorithm performed better with 11.71% of improvement in RA. The proposed algorithm shows an improvement in percentage of RA as 8.07% when compared to the existing algorithm (Ramirez et al). From Tables 4.5 and 4.6 it was observed that the proposed algorithm with 10 db noise level for Train noise source shows an improvement RA of 8.27%. The existing algorithm has an average RA of 80.01%, and the proposed algorithm has got an average RA of 85.18%. From Tables 4.7 and 4.8 it was inferred that in presence the Airport noise source at 15 db level for proposed algorithm performed better with 5.64% of improvement RA. The proposed algorithm was having an improvement RA of 3.67% when compared to the existing algorithm.

28 93

29 94

30 95

31 96 Table 4.9 shows the performance of ASR. The proposed method performs better with maximum improvement of 20.81% RA for Babble noise and with a minimum improvement of 2.26% RA for Street noise. The overall performance analysis of the existing VAD algorithm with proposed algorithm is shown in the Table 4.10 Table 4.9 Overall performance analysis of proposed VAD algorithm in terms of % improvement in RA Percentage Improvement Better 0dB 5dB 10dB 15dB Babble Exhibition Train Airport (20.81 %) (11.71 %) (8.27 %) (5.64 %) Least Airport (6.33 %) Airport (6.13 %) Babble (4.21 %) Street (2.26 %) Table 4.10 Overall performance analysis of VAD Algorithms VAD Method 0dB ( % Accuracy) 5dB ( % Accuracy) 10dB ( % Accuracy) 15dB ( % Accuracy) EBD ZCD WFD PBD Ramirez et al Proposed

32 97 The VAD recognition accuracy of different SNR values for the Subband OSF based VAD and Proposed method are shown in the Figure It was observed that better recognition occurred for Restaurant noise (84.225%) and least recognition for Exhibition noise (78.625%) Overall % of RA for Proposed and Existing VAD % of RA Proposed VAD Existing VAD Noise Sources Figure 4.11 Comparison of Ramirez et al and proposed VAD method for various noise environments The proposed VAD works well for non-stationary signal. In most of the speech enhancement schemes the noise signal is suppressed and speech signal is enhanced. In our proposed VAD algorithm a new noise estimation algorithm is presented along with the OSF which improves the quality as well the RA of the speech recognition system. 4.5 CONCLUSION The algorithms based solely on energy did not give an acceptable Speech Recognition Accuracy with all the test templates. The other techniques (Autocorrelation function and Zero Crossing Detection) gave better Speech Recognition Accuracy. The ZCD was used to recover some low energy phonemes that were rejected by the energy-based detector. However, it also picked up certain noise frames that matched the Zero Crossing criteria.

33 98 WFD technique performed better than ZCD in detection of weak fricatives. A pitch based detection algorithm is an algorithm designed to estimate the pitch or fundamental frequency of a quasi periodic or virtually periodic signal. The performance of PBD is different from other techniques. It produces better performance as same as the WFD. The proposed method for combining the noise estimation algorithms and VAD algorithms, so that improved speech recognition accuracy performance can be obtained under these noise conditions. This chapter, presented a proposed structure of Speech Recognition Systems with Subband Order Statistics Filter (OSF) improving speech detection robustness in noisy environments. The approach is based on an effective endpoint detection algorithm employing noise reduction techniques and order statistic filters for the formulation of the decision rule. The proposed algorithm performs better in the case of non stationary noise than the existing algorithm.

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

BABU et al: VOICE ACTIVITY DETECTION ALGORITHM FOR ROBUST SPEECH RECOGNITION SYSTEM Journal of Scientific & Industrial Research Vol. 69, July 2010, pp. 515-522 515 Performance analysis of voice activity