CHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS

46 CHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS 3.1 INTRODUCTION Personal communication of today is impaired by nearly ubiquitous noise. Speech communication becomes difficult under these conditions; speech sounds are masked by the noise and speech features like quality and intelligibility may be degraded. Today, a great deal of our personal communication is performed using some sort of communication equipment, such as cell phones and inter-com devices. With the prosperity of such portable communication devices, speech enhancement has received a lot of attention. Noise corrupted speech does indeed force the user of the communication equipment to strain both hearing and voice. Altogether, acoustic noise directly affects the human communication, and also dramatically decreases the performance of speech coding and speech recognition algorithms. This urges for effective speech enhancement methods (Nils Westerlund et al 2004). Speech enhancement refers to the restoration of clean speech. The main objective of speech enhancement is to improve one or more perceptual aspects of speech, such as human or machine speech recognition or degree of listener fatigue. Here, the additive noise source may be in the form of wideband noise, which includes white or coloured noise, stationary or nonstationary noise, or a periodic signal.

47 A microphone is used to receive a desired audio signal (such as human speech) and ambient noise from other sources which often interferes with the desired signal. One conventional approach has been used to solve this problem is by Speech Enhancement, a signal processing technique, which utilizes differences in the statistical characteristics of speech and noise. In most Speech Enhancement Algorithms (SEA), it is assumed that an estimate of the noise spectrum is available. Such an estimate is critical for the performance of speech enhancement algorithms as it is needed, for instance, to evaluate the Wiener filter in the Wiener algorithms (Lim and Oppenheim, 1978) or to estimate the a priori SNR in the MMSE algorithms (Ephraim and Malah 1984) or to estimate the noise covariance matrix in the subspace algorithms (Ephraim and Van Trees 1993). A frequently used digital method for effective noise reduction in speech communication is spectral subtraction. This frequency domain method is based on Fast Fourier Transform and is a non linear, yet straight forward way of reducing unwanted broadband noise acoustically added to the signal. The noise bias is estimated in frequency domain during speech pauses and then subtracted from the noisy speech spectra (Boll 1979). The speech enhancement systems capitalize on the major importance of the Short Time Spectral Amplitude (STSA) of the speech signal in its perception. A system which utilizes a Minimum Mean Square Error (MMSE) STSA estimator was proposed by Yariv et al (1984). STSA based on modeling speech and noise spectral components as statistically independent Gaussian random variables. The noise estimate can have a major impact on the quality of the enhanced signal. If the noise estimate is too low, annoying residual noise will be audible, if it is too high, speech will be distorted resulting possibly in

48 intelligibility loss. The simplest approach is to estimate and update the noise spectrum during the silent (e.g., during pauses) segments of the signal using a voice-activity detection (VAD) algorithm (Sohn and Kim 1999). Although such an approach might work satisfactorily in stationary noise (e.g., white noise), it will not work well in more realistic environments (e.g., in a restaurant) where the spectral characteristics of the noise might be changing constantly. Hence there is a need to update the noise spectrum continuously over time and this can be done using noise estimation algorithms. Several noise-estimation algorithms have been proposed for speech enhancement applications (Benny sallberg et al 2005, Malah et al 1999, Martin 2001, Cohen 2002, Cohen 2003, Doblinger 1995, Hirsch and Ehrlicher 1995, Lin et al 2003, Stahl et al 2000, Rangachari et al 2004, Ris and Dupont 2001). Martin (2001) proposed a method for estimating the noise spectrum based on tracking the minimum of the noisy speech over a finite window. As the minimum is typically smaller than the mean, unbiased estimates of noise spectrum were computed by introducing a bias factor based on the statistics of the minimum estimates. The noisy speech feature vectors were modeled using a mixture of Gaussians, and the noise feature vectors were obtained by maximizing a conditional likelihood function based on a recursive EM algorithm. Stochastic approximations were made to sequentially update the noise feature vectors. Some of those noise updates resembled the time-recursive updates of the noise spectrum used in the above noise-estimation algorithms. In fact, some (Garcia et al 2008) proposed the use of optimum smoothing factors for the noise updates similar to (Martin 2001). Improvements to the EM-based methods were reported in (Huijun et al 2009) using sequential Monte Carlo techniques.

49 Most of the aforementioned noise estimation algorithms developed for speech enhancement algorithms do not adapt quickly to increasing noise levels. A noise-estimation algorithm proposed by Rangachari et al (2004) updates the noise estimate faster than the above methods and also avoids overestimation of the noise level. The noise estimate was updated in each frame based on Voice Activity Detection. If speech was absent in a specific frame, the noise estimate was updated with a constant smoothing factor. The speech-presence decision made in each speech frame was based on the ratio of noisy speech spectrum to its local minimum. Improving the quality of speech, hidden in noise, is of great importance in speech communication systems. These systems are used in common real acoustic environments. The nonlinear subtraction is used when a frequency dependent Signal to Noise Ratio (SNR) is obtained by Jiri poruba (2002). Westerlund et al (2003) have proposed the Adaptive Gain Equalizer method (AGE), where the input signal is divided into a number of subbands. These are individually weighed in time domain, in accordance to the short time Signal to Noise Ratio (SNR). A fractional bank gammatone filter for speech enhancement based on a short-term temporal Masking threshold to Noise Ratio (MNR) was proposed by Teddy Surya Gunawan et al (2004). Further research is required to fine tune the parameters for different speech and/or noise characteristics. Various speech enhancement algorithms have been proposed to improve the performance of modern communication devices in noisy environments. Yet, it still remains unclear as to which speech enhancement algorithm performs well in real-world listening situations where the background noise level and characteristics are constantly changing. Reliable and fair comparison between algorithms has been elusive for several reasons, including lack of common speech database for evaluation of new algorithms,

50 differences in the types of noise used and differences in the testing methodology (Yi Hu et al 2006). In this chapter, a Speech Enhancement Algorithm developed by Westerlund et al (2003) is presented along with Non linear Spectral Subtraction method as discussed in Jiri poruba (2002). 3.2 ADAPTIVE GAIN EQUALIZATION The Adaptive Gain Equalization (AGE) method for speech enhancement, introduced by Westerlund et al (2003) separates itself from the traditional methods of improving the SNR of a signal corrupted by noise, through moving away from noise suppression and focusing primarily on speech boosting. Noise suppression traditionally, like spectral subtraction, looks at subtracting an estimated noise bias from the signal corrupted by noise, whereas speech boosting aims to enhance the speech part of the signal by adding an estimate of the speech itself, thus boosting the speech part of the signal. The difference between noise suppression and speech boosting is presented in Figure 3.1. Figure 3.1 (a) shows a noise estimate being subtracted from a noise corrupted signal, while in Figure 3.1 (b), an estimate of the speech signal is used to boost the speech in the noise corrupted signal (Nils Westerlund et al 2004). (a) Noise Suppression (b) Speech Boosting Figure 3.1 Difference between noise suppression and speech boosting

51 An acoustical discrete time speech is signal denoted by s(n) and a discrete time noise signal is denoted by w(n). The noise corrupted speech signal x(n) can then be written as X(n) = s(n) + w(n) (3.1) By filtering the input signal x(n) using a bank of k bandpass filters, h (n), the signal is divided into K subbands, each denoted by x (n) where k is the subband index. This filtering operation can be written in time domain as X (n) = x(n) h (n) (3.2) where * is the convolution operator. The original signal can then described as X(n) = x (n) = s (n) + w (n) (3.3) where s (n) is the speech part subband k and w (n) is the noise part subband k. The output y(n) is defined by y(n) = G (n)x (n) (3.4) where ) is a weighting function that amplifies the band k during speech activity. The ) introduces a gain to each subband, the function will be denoted as gain function that weights the input signal subbands using the ratio between s (n) and w (n), a short term SNR estimate. A block scheme illustration the subband decomposition, weighting and final summation is shown in Figure 3.2. The input signal is divided into a number of frequency subbands, that are individually and adaptively weighted in time domain according to a short term Signal to Noise Ratio (SNR) estimate in each subband at every time instant. When dividing a signal into subbands, a number of different banks could be employed.

52 Gain Control ) ) ) Gain Control ) ) Gain Control ) ) ) Figure 3.2 Block diagram of adaptive gain equalization The gain function in each subband is found by using a ratio of a short term exponential magnitude average, A (n), is calculated as A (n) = A (n 1) + x (n) (3.5) where is a small positive constant controlling how sensitive the algorithm should be to rapid changes in input signal amplitude in subband k. Human speech can be considered approximately short time stationary and constant should be chosen with this in mind. A suitable value for can be estimated using the equation = (3.6) where F is the sampling frequency and T is a time constant.

53 A The slowing varying noise floor level estimate for each subband k, (n), is calculated according to A (n)= (1 + k)a ( 1) if Ax, k(n) > A ( 1) Ax, k(n) if Ax, k(n) A ( 1) (3.7) where is a small positive constant controlling how fast the noise floor level estimation in subband k will adapt to change in the noise environment. Note that A (n) A (n)and that <<. G (n) according to The variable A (n) and A (n) are used to form the gain function G (n)= ; p 0, A (n) > 0 (3.8) where p decides the gain raise individually applied to each of the subband signals. The resulting speech enhanced output signal y(n) is then calculated as in equation (3.4). The calculation of G (n) involves division, care must be taken to ensure that the quotient does not become excessively large due to a small A (n). In a situation with a fair SNR, G (n) will become considerable if no limit is imposed on this function, resulting in unacceptable high speech amplification. A limiter can be imposed on G (n) as G (n)= G (n) if G (n) L L if G (n) > L (3.9) where L is some positive constant, expressed in decibels.

54 The concept of obtaining a speech bias estimate to perform speech boosting may seem like a daunting task. But it does not need to be, the AGE method of speech enhancement relies on a few basic ideas. The first of which is that a speech signal which is corrupted by bandlimited noise can be divided into a number of subbands and each of these subbands can be individually and adaptively boosted according to a SNR estimate in that particular subband. In each subband, a short term average is calculated simultaneously with an estimate of a slowly varying noise floor level. By using the short term average and floor estimate, a gain function is calculated per subband through dividing the short term average by the floor estimate. This gain function is multiplied with the corresponding signal in each subband to form an output per subband. The sum of the outputs from each subband forms the final output signal, which should contain a higher SNR when compared to the original noisy signal. The AGE acts as a speech booster, which is adaptively looking for a subband speech signal to boost. Outlining that speech energy is a highly non-stationary input amplitude excursion, if there is no such excursions no alteration to the subband will be performed, the AGE will remain idle, as a result of the quotient between the short term magnitude average and the noise floor estimate being unity, with them being approximately the same. If speech is present the short term magnitude average will change with the noise floor level remaining approximately unchanged, thus amplifying the signal in the subband at hand due to the quotient becoming larger than unity. During periods of no speech activity, using the AGE provides distortion free background noise during speech activity due to masking effects. This results in increased speech quality with the output signal having a natural sound with minimum distortion and artifacts.

55 The AGE algorithm can be implemented either on digital or analogue circuits proving to be versatile and flexible. The speech enhancement is performed continuously in each subband, which means no voice activity detectors are required. The method is Stand-Alone; it works independently of different speech coding schemes or other adaptive algorithms. Using the AGE requires minimum amendment for good performance 3.2.1 Drawbacks OF AGE When analysis was done under various noise conditions it was seen that the algorithm has an obvious failing point for a SNR of -5 db, with inadequate levels of noise suppression for SNR less than this point. This is due to the fact that short term average failing to track the speech spectra of a speech signal which is heavily corrupted by noise. 3.3 SPECTRAL SUBTRACTION METHOD The standard spectral subtraction method is described in the following equations. A short-term noise spectral magnitude is subtracted from a degraded speech signal by = (3.10) where = (3.11) ) = (3.12)

56 The is a short-term spectral estimation of speech (frame ), a short-term estimation of noise (frame i), a smoothed-out estimate of the corrupted magnitude at time i, a smoothed-out estimate of the noise magnitude at time i, and is a clean speech estimate. following equations The magnitude and can be computed from = + (1 ), (3.13) = + (1 ), (3.14) where the values of the memory factors are found in intervals 0. 1 0.5 and 0.5 0.9. 3.4 NON LINEAR SPECTRAL SUBSTRACTION The nonlinear subtraction (NSS) is used when a frequencydependent Signal to Noise Ratio (SNR) is obtained. An improved noise model in the nonlinear spectral subtraction scheme is determined by (, ) where ) is a frequencydependent overestimation factor that can be estimated during the speech pauses jointly with the noise magnitude. The factor ) is possible to determine from the following equation = ( ). (3.15) Provided the noise magnitude ) and the noise overestimation model ) are known, it is possible for the nonlinear subtraction to be performed through the equation

57 ) = ) ), ), ) ). (3.16) A biased estimate of SNR ) at frame is determined by ) = (3.17) is a nonlinear function which determines a subtraction measure on the basis of the Signal to Noise Ratio determination: ) and the following interval ) ), ), ) 3 ) (3.18) It is essentially an arbitrary function, which realizes the following idea: a minimal subtraction is used if the signal-to-noise ratio ) is high and on the contrary, more noise is subtracted when the SNR ) is low (here the subtraction factor is at maximum). For example it is possible to use the following function (, ) = (3.19) where is a weight factor depending on the variation range of SNR. 3.5 PROPOSED SPEECH ENHANCEMENT ALGORITHM AGE when coupled with Non linear Spectral Subtraction (AGE+ NSS) performs better than AGE when SNR drops below -5db. The first step requires the signal to be filtered into number of subbands. In this example, number of subbands is chosen to be eight. The signal which is sampled at 16kHz is filtered into eight subbands. Non linear spectral subtraction is applied to each of the subband. Short term exponential magnitude average and noise floor is taken simultaneously. Using the short term exponential

58 magnitude average and noise floor the gain is calculated and it is multiplied with the spectra. The block diagram of Adaptive Gain Equalization with Non linear Spectral Subtraction (AGE+NSS) is shown in Figure 3.3. Figure 3.3 Block diagram of proposed algorithm 3.6 RESULTS AND DISCUSSION Noise sources were taken from AURORA database, includes suburban train noise, babble, car, exhibition hall, restaurant, street, airport and train-station noise. In training phase, uttered words (100 samples each digits 0-9, both male and female) were recorded using 8-bit Pulse Code Modulation (PCM) with a sampling rate of 8 khz and saved as a wave file using sound recorder software. Automatic Speech Recognition systems work reasonably well under clean conditions but become fragile in practical applications involving realworld environments.

59 Analysis was carried out at different environmental noises for digits 0-9 at different SNR values. The proposed SEA system works better for different noises at different SNR values. From Tables 3.1 and 3.2 it is observed that proposed algorithm results in better RA compared to existing AGE algorithm for different noise conditions. The proposed algorithm yields an improvement of 20.89% in terms of RA for Exhibition noise compared to other noise sources for 0dB SNR. This is due to the fact the spectral components of Exhibition noise are distributed over the different frequencies and their amplitudes are not up to the level at which they disturb the existing audio frequency range of the isolated words from zero to nine. From Table 3.3 and 3.4 it can be inferred that at 5dB of SNR, an improvement of 13.25% in RA was observed for an ASR in presence of Station noise compared to other noise sources. The proposed ASR with SEA performs better as the Station noise signal is not of sufficient amplitude to disturb the signal strength of the isolated words from zero to nine. At 10dB level the proposed algorithm yields an improvement of 10.03% in RA in presence of Car noise compared to other noise sources as tabulated from Table 3.5 and Table 3.6. The strength of spectral components is at medium level for Car noise compared to other noise sources and it results in better RA. As the noise spectral components are defined clearly at 15dB level, it is easier for the proposed ASR-SEA algorithm to remove the noise components better that in presence the other noise sources at different levels as shown in the Tables 3.7 and 3.8.

64 Table 3.9 shows the performance of ASR, the proposed method performs better with maximum improvement of 20.89% RA for Exhibition noise and with a minimum improvement of 4.58% RA for Station noise. Proposed SEA works better than existing SEA algorithm for different noises at different SNR values. Better recognition occurred for Street noise and least recognition for Airport noise was observed in Figure 3.4. Table 3.9 Overall performance analysis of Proposed SEA algorithm in terms of % improvement in RA Percentage Improvement Better 0dB 5dB 10dB 15dB Exhibition Station Car Airport (20.89%) (13.25 %) (10.03 %) (7.59 %) Least Airport (13.0%) Airport (9.67%) Airport (7.13 %) Station (4.58%) Overall % of RA for proposed and existing Algorithm % of RA 80 78 76 74 72 70 68 66 64 62 Proposed Method Existing Method Noise Sources Figure 3.4 Overall percentage RA of proposed and existing SEA for various noises

65 From Table 3.1 through 3.8 it can inferred that best results for improvement in RA was observed for isolated words in presence of Street noise followed by equally good performance in presence of Car and Train noise. The improvement in RA for proposed algorithm in presence of various noise sources at different levels is that, the NSS with adaptive gain equalizer works well for non-stationary signals compared to stationary signals. 3.7 CONCLUSION Several experiments were conducted commonly to evaluate SEA algorithm. Analysis mainly focused on error probabilities. Proposed SEA was evaluated in terms of ability to discriminate speech from non speech at different SNR s values. SEA avoid losing speech periods leading to an extremely conservative behavior in detecting speech pauses. Proposed framework uses a speech processing module including a noise estimation algorithm with HMM based classification and noise language modeling to achieve effective noise knowledge estimation. Analysis was taken at different environmental noises for digits 0-9 at different SNR values. Recognition Accuracy (RA) increases when noise is estimated for each frame than without noise is being estimated. The proposed method combining the Adaptive Gain Equalizer (AGE) and Non linear Spectral Subtraction (NSS) system works better for different noises at different SNR values.