Signal Processing 9 (2) 55 6 Contents lists available at ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro Fast communication Minima-controlled speech presence uncertainty tracking method for speech enhancement Woojung Lee, Ji-Hyun Song, Joon-Hyuk Chang School of Electronic Engineering, Inha University, Incheon 42-75, Republic of Korea article info Article history: Received 3 February 2 Received in revised form April 2 Accepted 8 June 2 Available online 25 June 2 Keywords: Soft decision Speech absence probability Minima-controlled recursive averaging abstract In speech enhancement, soft decision, in which the speech absence probability (SAP) is introduced to modify the spectral gain or update the noise power, is known to be efficient. In many previous works, a fixed a priori probability of speech absence (q) is assumed in estimating the SAP, which is not realistic since speech is quasi-stationary and may not be present in each frequency bin. To address this problem, Malah et al. devised a novel method to obtain distinct values of q for each frequency bin in many frames by comparing the a posteriori SNR to a threshold value [9]. In this regard, a novel algorithm is achieved by taking an advantage of a minima-controlled recursive averaging (MCRA) technique that allows for the robust tracking of speech absence in time. This leads to the improved tracking performance of speech absence in speech enhancement and better results in the objective and subjective evaluation tests. & 2 Elsevier B.V. All rights reserved.. Introduction In general, listening to speech becomes more difficult as the ambient noise level increases. To avoid this problem, speech enhancement techniques attempt to remove the effect of the additive noise [ 7]. Among them, a conventional strategy of applying soft decision has been considered effective because the probability of speech absence (or speech presence) is incorporated as a key parameter for modifying the spectral gain and updating the noise power [8]. From this viewpoint, in the literature, it can be seen that a fixed probability of q, which is the a priori probability of speech absence, is assumed for all frequency components in the analyzed input frames [8,9]. In[2], q was set to.5 to address the worst-case scenario in which speech and noise are equally likely to occur, while q was set to.2 based on the listening test in []. Several algorithms have been proposed for estimating and updating q [9,]. In Corresponding author. Tel.: +82 32 86 7423; fax: +82 32 868 3654. E-mail address: changjh@inha.ac.kr (J.-H. Chang). particular, Malah et al. proposed an algorithm to obtain distinct values of q for each frequency in each frame based on a simple hypothesis test by comparing the a posteriori SNR with a given threshold [9]. However, it can be seen that the a posteriori SNR is sensitive to outliers, especially for time-varying noise. On the other hand, Cohen proposed a novel technique for estimating noise by averaging past spectral power values with a smoothing parameter that is adjusted by the speech presence probability in subbands []. In particular, the presence of speech in subbands is determined by the ratio between the local energy of noisy speech and its minimum within a given time window. Note that Cohen s method is known to be insensitive to the type and intensity of ambient noise. Also, this method is computationally efficient and characterized by the capability to quickly adapt to sudden changes in the noise spectrum. In this paper, we develop a novel method to track the a priori probability of speech absence which is a dominant parameter in computing the speech absence probability from the observation. To do this, we devise a method to track the a priori probability of speech absence by comparing the local energy of the noisy speech and its 65-684/$ - see front matter & 2 Elsevier B.V. All rights reserved. doi:.6/j.sigpro.2.6.9
56 W. Lee et al. / Signal Processing 9 (2) 55 6 corresponding minimum value in each frequency bin. It is found that it enables a more robust estimate of q, which is analogous to the advantage of Cohen s method []. Based on this, we performed an objective and subjective quality test by incorporating the proposed approach into the speech enhancement, and produced better results. 2. Review of tracking speech presence uncertainty In this section, we first review the notion of the tracking speech uncertainty introduced in [9]. At first, let y(n) denote a noisy speech signal, which is the sum of a clean speech signal, x(n), and an uncorrelated additive noise signal, d(n); y(n)=x(n)+d(n). Applying a short-time Fourier transform (STFT), we then have in the time frequency domain Yðk,lÞ¼Xðk,lÞþDðk,lÞ, ðþ where k is the frequency bin and l is the frame index, respectively. Given two hypotheses, H (k,l) and H (k,l), which indicate speech absence and presence, respectively, it is assumed that H ðk,lþ : Yðk,lÞ¼Dðk,lÞ, H ðk,lþ : Yðk,lÞ¼Xðk,lÞþDðk,lÞ: Like a number of other speech enhancement algorithms [8], we also assume that X(k,l) and D(k,l) are characterized by separate zero-mean complex Gaussian distributions, and the following is obtained: jyðk,lþj2 pðyðk,lþjh Þ¼ exp, pl d ðk,lþ l d ðk,lþ pðyðk,lþjh Þ¼ p½l d ðk,lþþl x ðk,lþš exp jyðk,lþj 2, l d ðk,lþþl x ðk,lþ ð3þ in which l x ðk,lþ and l d ðk,lþ are variances of the clean speech and noise in the kth frequency bin and lth frame index, respectively. Conditioned on the current observation, Y(k,l), the speech absence probability (SAP), pðh jyðk,lþþ, is given by [8] pðh jyðk,lþþ ¼ pðyðk,lþjh ÞpðH Þ pðyðk,lþþ pðyðk,lþjh ÞpðH Þ ¼ pðyðk,lþjh ÞpðH ÞþpðYðk,lÞjH ÞpðH Þ ¼ þqlðyðk,lþþ, ð4þ in which LðYðk,lÞÞ is the likelihood ratio computed in the kth subband and lth frame index as follows: LðYðk,lÞÞ ¼ pðyðk,lþjh Þ pðyðk,lþjh Þ gðk,lþxðk,lþ ¼ exp, ð5þ þxðk,lþ þxðk,lþ where gðk,lþ and xðk,lþ are the a posteriori SNR and the a priori SNR [8], respectively, as follows: gðk,lþ jyðk,lþj2 l d ðk,lþ, ð2þ ð6þ xðk,lþ l xðk,lþ l d ðk,lþ, ð7þ and q (=p(h )/p(h )) is the ratio of the a priori probability for speech presence and speech absence []. Indeed, q is a rough estimate of the ratio of silence time intervals between speech activities and the time duration of speech. This ratio q is assumed to be fixed in many previous works [,5,8]. However, Malah et al. proposed the method to allow different q s in different frequency bins for each frame since this number varies in time due to the non-stationarity of speech. Specifically, in the method of Malah et al., (4) becomes pðh jyðk,lþþ ¼ þqðk,lþlðyðk,lþþ, ð8þ where qðk,lþ¼a q qðk,l Þþð a q ÞIðk,lÞ, ð9þ and a q ðoa q oþ is a smoothing parameter. In particular, I(k,l) is an index function denoting the following hypothesis test by incorporating the a posteriori SNR such that gðk,lþ _ H g TH, ðþ H where g TH is a given threshold (i.e., I(k,l)= if H is accepted, and I(k,l)= if H is accepted). Note that, in the method of Malah et al. [9], the availability of a separate estimate of q in each bin for each frame adaptively controls the update of the noise power in the case of speech presence. 3. Proposed minima-controlled speech presence uncertainty tracking method In the previous section, the estimation of pðh jyðk,lþþ given by (4) is controlled by distinct values of q s obtained by the a posteriori SNR-based hypothesis test, as in the previous approach [9]. However, we note that the a posteriori SNR cannot be relevant due to its high variation over successive short-time frames [2]. For this reason, we consider a monotonic hypothesis test denoting the ratio between the local energy of the noisy speech and its derived minimum, as in the MCRA method proposed by Cohen []. This method is clearly insensitive to the type and strength of noise, which are very desirable characteristics []. To illustrate these characteristics, we first introduce the smoothed local energy of the noisy speech by a first order recursive averaging Sðk,lÞ¼a s Sðk,l Þþð a s ÞS f ðk,lþ, ðþ where S f (k,l) is a local energy of a current frame and a s ðoa s oþ is a smoothing parameter. The minimum of the local energy S min (k,l) is searched for in a samplewise comparison manner such that S min ðk,lþ¼minfs min ðk,l Þ,Sðk,lÞg, S tmp ðk,lþ¼minfs tmp ðk,l Þ,Sðk,lÞg, ð2þ where the minimum value for the current frame is yielded by a comparison of the local energy of the noisy speech
W. Lee et al. / Signal Processing 9 (2) 55 6 57 and the minimum value of the previous frame. Whenever L frames have been read, i.e., l is divisible by L, the temporary value should be employed and initialized by S min ðk,lþ¼minfs tmp ðk,l Þ,Sðk,lÞg, S tmp ðk,lþ¼sðk,lþ, ð3þ and (2) continues to search for the minimum values. The implementation of the minima tracking is summarized as follows: Initialize variables at the first frame (l=) for all frequency bin S(k,)=S f (k,) S min (k,)=s f (k,) For all time frames l ðl^þ For all frequency bins k compute S min =min {S min (k,l-, S(k,l)} using () and (2). save S tmpðk,lþ¼s min fs tmpðk,l Þ,Sðk,lÞg using (2) When l % L== compute S min ðk,lþ¼minfs tmpðk,lþ,sðk,lþg using (3) update S tmpðk,lþ¼sðk,lþ using (3) Using the obtained S min (k,l), we now consider the S r ðk,lþ9sðk,lþ=s min ðk,lþ which denotes the ratio between the local energy of the noisy speech and its derived minimum []. From this, we can derive the following: S r ðk,lþ _ H d, ð4þ H where d is a simple threshold. As an example, Fig. compares two statistics (a posteriori SNR vs. S r (k,l)) when the speech enhancement algorithm operates on noisy speech corrupted by the car noise. From the figure, it can be seen that the a posteriori SNR tends to fluctuate highly during noise intervals. In contrast, S r (k,l) does not exhibit large variation over successive frames during the noiseonly periods while S r (k,l) adapts the speech energy adequately during the speech. Using the decision rule of (4) in the MCRA scheme, we propose ^q, which has a different value of q as in the conventional tracking speech presence uncertainty scheme, such that ^qðk,lþ is given by ^qðk,lþ¼a p ^qðk,l Þþð a p ÞIðk,lÞ, ð5þ in which a p ðoa p oþ is a smoothing parameter and I(k,l) is an indicator function for the result of the decision rule of (4), i.e., I(k,l)= if S r ðk,lþ4d and I(k,l)= if S r ðk,lþod. Then, (8) implies pðh jyðk,lþþ ¼ þ ^qðk,lþlðyðk,lþþ : ð6þ It is not difficult to see from Fig. 2 that the SAP by the proposed method seems more accurate than the conventional method (a posteriori SNR-based). 4. Experiments and results The proposed minima-controlled speech presence uncertainty tracking method was adopted for softdecision-based speech enhancement, as in [8], and was evaluated with extensive objective and subjective tests. For these tests, phrases, spoken by four male and four.5..5 2. 2.5 3. 3.5 4..5..5 2. 2.5 3. 3.5 4. 5 5.5..5 2. 2.5 3. 3.5 4. Fig.. Comparison of two statistics (k=2, around 3 Hz) under street noise (SNR = 5 db). (a) Clean speech waveform, (b) noisy speech waveform, (c) gðk,lþ (dashed line) vs. S r (k,l) (solid line).
58 W. Lee et al. / Signal Processing 9 (2) 55 6..2.3.4.5.6.7.8.9...2.3.4.5.6.7.8.9. Speech Presence Probability.5..2.3.4.5.6.7.8.9. Fig. 2. Comparison of probability (k=2, around 3 Hz) under car noise (SNR = 5 db). (a) Clean speech waveform, (b) noisy speech waveform, (c) speech presence probability in short-time frames: probability using the a posteriori (dashed line), probability of the proposed algorithm (bold line). female speakers, were employed as the experimental data. Each phrase consists of two different meaningful sentences, and its duration was 8 s. For a real-time processing, the proposed method was conducted for each frame of ms with a sampling frequency of 8 khz. Four types of noise sources, such as white noise, car noise, street noise, and office noise, were digitally added to the clean speech waveform at SNRs of 5,, and 5 db. In all cases, speech enhancement was conducted with the experimentally optimized parameter values: a q ¼ :95, g TH ¼ :8, a p ¼ :2, d ¼ 5. At first, we carried out the perceptual evaluation of speech quality (PESQ) based on the ITU-T P.862 tests [3]. From Table, which shows the results of the PESQ, we can see that the proposed minima-controlled speech presence uncertainty tracking method outperformed three conventional methods proposed by McAulay [], Ephraim [2], Malah [9], and ideal q-based method under the given noise conditions. Specifically, the ideal q-based method has fixed values of q which are determined from the ratio of speech and noise in the each speech segment. Note that the performance gain becomes larger, especially for the non-stationary noise such as car and street noise. We also carried out a set of informal tests under the same noise conditions to evaluate the subjective quality of the proposed method. Subjective opinions were given by a group of 2 listeners; each listener gave a score for each test sentence: 5 (Excellent), 4 (Good), 3 (Fair), 2 (Poor), and (Bad). All listener scores were then averaged to Table PESQ scores of the conventional methods and the proposed method. Noise Method SNR (db) 5 5 White McAulay (q=.5).68.95 2.33 Ephraim (q=.2).96 2.34 2.67 Ideal 2.8 2.4 2.73 Malah 2.8 2.4 2.72 Proposed 2.9 2.42 2.75 Street McAulay (q=.5) 2.49 2.8 3.6 Ephraim (q=.2) 2.83 3.2 3.37 Ideal 2.85 3.3 3.38 Malah 2.83 3.2 3.39 Proposed 2.89 3.6 3.4 Car McAulay (q=.5) 2.97 3.2 3.4 Ephraim (q=.2) 3.26 3.54 3.83 Ideal 3.35 3.63 3.88 Malah 3.34 3.63 3.88 Proposed 3.39 3.67 3.9 Office McAulay (q=.5).96 2.34 2.68 Ephraim (q=.2) 2.2 2.62 2.95 Ideal 2.32 2.65 2.94 Malah 2.3 2.63 2.93 Proposed 2.34 2.67 2.96 yield a mean opinion score (MOS). The MOS test results, with a 95% confidence interval, are summarized in Table 2, in which a higher value indicates preference. It is noted that performance was found to improve for most of the
W. Lee et al. / Signal Processing 9 (2) 55 6 59 Table 2 MOS of the conventional methods and the proposed method (with 95% confidence interval). Noise Method SNR (db) 5 5 White McAulay.67.9.897.9 2.367.2 Ephraim.797.9 2.47.26 2.437.2 Ideal.687.26 2.397.9 2.77.8 Malah.847.7 2.457.5 2.847.9 Proposed.87.6 2.527.7 2.877.9 Car McAulay 3.57.22 3.687.26 3.827.3 Ephraim 3.77.23 4.7.27 4.257.2 Ideal 3.77.7 4.7.22 4.367.22 Malah 3.727.26 4.77.23 4.427.3 Proposed 3.757.23 4.77.23 4.427.6 Street McAulay 2.697.7 3.537.24 3.787.27 Ephraim 3.37.25 3.787.2 3.887.27 Ideal 3.37.9 3.77.23 3.97.2 Malah 3.37.6 3.687.8 3.847.2 Proposed 3.427.2 3.97.23 3.957.2 Office McAulay.887.2 2.537.8 3.97.7 Ephraim.87.2 2.597.22 3.67.24 Ideal.847.24 2.477.8 3.97.29 Malah.947.2 2.487.8 3.227.2 Proposed 2.67.2 2.637.8 3.377.8 Table 3 CCR test of the conventional method (Malah-based) and the proposed method (with 95% confidence interval). Noise SNR (db) Overall Speech Noise White 5.327.8.247.3.87.9.237.3.27.3.27.7 5.37.5.27.3.327.7 Car 5.87..7.6.37.9.27.6.7.3.77.6 5.37.5.7.3.7.9 Street 5.77.6.567.2.787.7.727.3.397.2.47.3 5.727.3.237..257.2 Office 5.527.3.67.9.37.9.387.4.97..27.8 5.427.3.87.6.87.7 noises at all SNRs. Indeed, it is observed that the performance differences in the MOS are more significant than the case of the PESQ in many cases. This phenomenon can be attributed to the fact that all parameters have been optimized for subjective quality enhancement. These results confirm that the proposed algorithm is consistently better than the conventional methods. We also conducted additional subjective tests via the ITU-T comparison category rating (CCR) to assess performance difference [4]. Ten listeners with normal hearing (six male and four female) participated in the experiment. The CCR test sheds light on perception quality of the signal of method A (proposed) over method B (Malah). The grades of the seven points scale range are as follows: 3 (much better), 2 (better), (slightly), (about), (slightly worse), 2 (worse), 3 (much worse). The results of CCR test between the proposed method and the conventional method based on Malah [9] are organized in Table 3. From the table, we confirm that the proposed method is found to improve the quality of speech, background noise, and overall speech. Finally, the speech spectrograms obtained with the conventional and proposed approach are presented in Fig. 3. From the figure, we can see that the proposed method effectively suppresses the background noise compared to the conventional method. 5. Conclusions In this paper, we have proposed a novel method to incorporate the minima-controlled technique into the
6 W. Lee et al. / Signal Processing 9 (2) 55 6 4 2 2 3 4 5 6 7 8 4 2 2 3 4 5 6 7 8 4 2 2 3 4 5 6 7 8 4 2 2 3 4 5 6 7 8 time (s) Fig. 3. Speech spectrograms (car noise, SNR = 5 db). (a) Spectrogram of the clean speech (Original), (b) spectrogram of the noisy speech (Noisy Speech), (c) spectrogram of the output signal obtained by Malah [9] (Malah), (d) spectrogram of the output signal obtained by the proposed method (Proposed). tracking speech presence uncertainty for speech enhancement. The ratio between a local energy and its minimum, which is introduced from the MCRA, controls q s for different bins since it provides us with a robust tracking performance of speech presence. Compared to the conventional tracking speech presence uncertainty, the performance of the proposed technique under various noise environments was superior in both subjective and objective tests. Acknowledgements This research was supported by the MKE, Korea, under the ITRC support program supervised by the NIPA (NIPA- 2-C9-2-7). And this work was supported by the IT R&D program of MKE/KEIT. [29-S-36-, Development of New Virtual Machine Specification and Technology]. References [] Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEE Transactions on Acoustics, Speech and Signal Processing ASSP-32 (6) (984) 9 2. [2] R.J. McAulay, M.L. Malpass, Speech enhancement using a softdecision noise suppression filter, IEEE Transactions on Acoustics, Speech and Signal Processing ASSP-28 (2) (98) 37 45. [3] J.-H. Chang, Q.-H. Jo, D.K. Kim, N.S. Kim, Global soft decision employing support vector machine for speech enhancement, IEEE Signal Processing Letters 6 () (29) 57 6. [4] R. Martin, Spectral subtraction based on minimum statistics, in: Proceedings of the EUSIPCO, Edinburgh, UK, September 994, pp. 82 85. [5] I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments, Signal Processing 8 () (2) 243 248. [6] G. Doblinger, Computationally efficient speech enhancement by spectral minima tracking in subbands, in: Proceedings of the Eurospeech, Madrid, Spain, September 995, pp. 53 56. [7] J. Meyer, K.U. Simmer, K.D. Kammeyer, Comparison of one- and two-channel noise-estimation techniques, in: Proceedings of the IWAENC, London, UK, September 997, pp. 37 45. [8] N.S. Kim, J.-H. Chang, Spectral enhancement based on global soft decision, IEEE Signal Processing Letters 7 (5) (2) 8. [9] D. Malah, R. Cox, A.J. Accardi, Tracking speech-presence uncertainty to improve speech enhancement in nonstationary noise environments. in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Pheonix, AZ, March 999, pp. 789 792. [] I. Soon, S. Koh, C. Yeo, Improved noise suppression filter using selfadaptive estimator of probability of speech absence, Signal Processing 75 (2) (999) 5 59.
W. Lee et al. / Signal Processing 9 (2) 55 6 6 [] I. Cohen, B. Berdugo, Noise estimation by minima controlled recursive averaging for robust speech enhancement, IEEE Signal Processing Letters 9 () (22) 2 5. [2] O. Cappé, Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor, IEEE Transactions on Speech Audio Processing 2 (April) (994) 345 349. [3] ITU-T P.862, Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, February 2. [4] ITU-T P.8, Methods for subjective determination of transmission quality, August 996.