On using acoustic environment classification for statistical model-based speech enhancement

Size: px

Start display at page:

Download "On using acoustic environment classification for statistical model-based speech enhancement"

Britney Lawson
5 years ago
Views:

1 Available online at Speech Communication 54 (22) On using acoustic environment classification for statistical model-based speech enhancement Jae-Hun Choi, Joon-Hyuk Chang School of Electrical Engineering, Hanyang University, Seoul 33-79, Republic of Korea Received 4 April 2; received in revised form 7 October 2; accepted 3 October 2 Available online November 2 Abstract In this paper, we present a statistical model-based speech enhancement technique using acoustic environment classification supported by a Gaussian mixture model (GMM). In the data training stage, the principal parameters of the statistical model-based speech enhancement algorithm such as the weighting parameter in the decision-directed (DD) method, the long-term smoothing parameter of the noise estimation, and the control parameter of the minimum gain value are uniquely set as optimal operating points according to the given noise information to ensure the best performance for each noise. These optimal operating points, which are specific to the different background noises, are estimated based on the composite measures, which are the objective quality measures representing the highest correlation with the actual speech quality processed by noise suppression algorithms. In the on-line environment-aware speech enhancement step, the noise classification is performed on a frame-by-frame basis using the maximum likelihood (ML)-based Gaussian mixture model (GMM). The speech absence probability (SAP) is used to detect the speech absence periods and to update the likelihood of the GMM. According to the classified noise information for each frame, we assign the optimal values to the aforementioned three parameters for speech enhancement. We evaluated the performances of the proposed methods using objective speech quality measures and subjective listening tests under various noise environments. Our experimental results showed that the proposed method yields better performances than does a conventional algorithm with fixed parameters. Ó 2 Elsevier B.V. All rights reserved. Keywords: Speech enhancement; Noise classification; Gaussian mixture model; DFT. Introduction Speech enhancement is a fundamental part of speech processing because environmental background noise drastically degrades the performances of processing systems (Boll, 979; Sim et al., 998; McAulay and Malpass, 98; Ephraim and Malah, 984, 985). Among the many approaches that have been developed to enhance speech, the spectral subtraction has been shown to be effective in suppressing stationary noise (Boll, 979). However, this technique is limited in its ability to deal with noise such as musical noise, which is characterized by tones with random noises, especially nonstationary noises. To avoid the Corresponding author. Tel.: ; fax: address: jchang@hanyang.ac.kr (J.-H. Chang). typical artifacts in a practical speech enhancement system, we should consider two major components of noise power estimation and uncorrupted speech estimation (Sim et al., 998; McAulay and Malpass, 98; Ephraim and Malah, 984, 985; Cappé, 994; Park and Chang, 27; Martin, 2; Sohn and Sung, 998; Cohen and Berdugo, 22). In the estimation of speech, Ephraim and Malah derived the minimum mean-square error (MMSE) estimator, which is very efficient at reducing musical noise phenomena (Ephraim and Malah,984, 985; Cappé, 994). Other spectral weighting tasks such as Wiener filtering, maximum a posteriori and MMSE log-spectral amplitude criteria have been considered (Sim et al., 998; McAulay and Malpass, 98; Ephraim and Malah, 984, 985). These algorithms are further enhanced through the use of a soft decision scheme in which the speech absence probability (SAP) is /$ - see front matter Ó 2 Elsevier B.V. All rights reserved. doi:.6/j.specom.2..9

2 478 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) derived based on the likelihood ratio test (LRT) and used for gain modification. Actually, the spectral gain is modified by the SAP, which is estimated for each frequency bin and each frame. The SAP based on the statistical model of speech is generally computed with the help of an a priori SNR, which is estimated using the non-linear recursive procedure, called the decision-directed (DD) approach (Ephraim and Malah, 984). The a priori SNR determined by the DD rule takes into account the current short-time frame, with a fixed weight ( a) and the processing output in the previous frame, with a weight a. Note that the parameter a should be carefully set since it has substantial control over the trade-off between the degree of smoothing in the a priori SNRs in a noisy area and the acceptable level of transient distortion in the signal. In contrast to the conventional DD estimator, which has a fixed weight factor, the adaptive weight factor determined by the deviation of the a posteriori SNR is proposed in (Park and Chang, 27). Unfortunately, this estimator interacts with the estimated SNR and does not consider a wide variety of noise conditions. In view of the noise power estimation, minimum statistics (MS) obtain the noise estimate as the minima values from a smoothed power estimate of the noisy signal (Martin, 2). The MS method is motivated by the observation that the power of a noisy speech signal frequently reduces to the power level of the noise signal. This method is known to be sensitive to outliers, is generally biased, and has a variance that is about twice as large as that of a conventional noise estimator (Cohen and Berdugo, 22). On the other hand, the aforementioned soft decision has been applied to the noise power estimation module by adopting the SAP in the long-term smoothed power spectrum of the background noise (Sohn and Sung, 998; Kim and Chang, 2; Chang and Kim, 2). In (Cohen and Berdugo, 22), the noise power estimate is updated during periods of speech absence, and speech presence. Considering both the noise power estimation and the estimation of speech, most speech enhancement algorithms are packaged with tunable parameters that substantially affect their performance. For example, the weight parameter in the DD approach could be tuned by the off-line knowledge of the acoustic background noise. Actually, the environmental sniffing framework proposed by Akbacak and Hansen shows improvement in speech recognition in a car environment (Akbacak and Hansen, 27). Krishnamurthy and Hansen further improved the performance of speech enhancement technique by providing a more accurate estimate of the noise update rate for a given environment (Krishnamurthy and Hansen, 26). An environmentally-aware voice activity detector, used in the method of Sangwan et al. (27), is based on an accurate noise model by employing the support vector machine (SVM). Regarding the classification of acoustic environments, numerous studies have been conducted for context-aware applications (Ma et al., 26; Kraft et al., 25). In this paper, we propose a novel speech enhancement approach using acoustic noise classification. Practically, statistical model-based speech enhancement is considered to be a target platform in which the SAP is derived based on the LRT by employing the DD method for the estimation of the a priori SNR and is used to modify the spectral gain and update the noise power. First, we identify the optimal points of the principal parameters such as the weight parameter of the DD approach, the long-term smoothing parameter, and the control parameter of the MMSE gain function for a wide variety of noise environments. This is achieved with the help of the composite measure, which is known to be relevant in estimating actual speech quality (Hu and Loizou, 26, 28). Secondly, we perform the noise classification on a frame-by-frame basis to recognize the noise type of the current frame. Indeed, a Gaussian mixture model (GMM)-based maximum likelihood (ML) estimation is used for noise classification during speech absence only and is performed using the SAP. Feature vectors applied to the GMM are carefully selected from the relevant parameters of the 3GPP2 selectable mode vocoder (SMV), as in (Song etal., 28; 3GPP2- C.R3-, v3., 25). Subsequently, we organize this noise knowledge in each frame to assign the optimal values for the three parameters which gives the best performance for a specific type of underlying additive noise. Specifically, our approach responds quickly to noise variation since the running average is used to track evolving noise. Based on a number of experiments, the proposed speech enhancement technique is found to yield a better performance than the conventional approach with fixed parameters. The rest of the paper is organized as follows. Section 2 briefly reviews the soft decision-based speech enhancement technique and Section 3 contains the proposed algorithm. Section 4 describes the experimental setup and results in detail and conclusion s are presented in Section Review of soft decision based-speech enhancement Let x(n) andd(n) denote clean speech and uncorrelated additive noise signals, respectively. The observed noisy speech signal y(n) is the sum of a clean speech signal x(n) and noise d(n), where n is a discrete-time index. By taking a discrete Fourier transform (DFT), we then have Y k ðtþ ¼X k ðtþþd k ðtþ; ðþ where k (=,2,...,K) is the frequency bin and t is the frame index. Given two hypotheses, H and H which indicate speech absence and presence, respectively, we assume that H : speech absent : Y k ðtþ ¼D k ðtþ; H : speech present : Y k ðtþ ¼X k ðtþþd k ðtþ: Assuming that the clean speech X k (t) and the additive noise D k (t) are statistically independent and that noisy spectral components are characterized by zero-mean complex Gaussian distributions, the probability density functions (PDF s) conditioned on the two hypotheses of H and H are given by ð2þ

3 ( ) pðy k ðtþjh Þ¼ pk d;k ðtþ exp jy kðtþj 2 ; ð3þ k d;k ðtþ ( ) pðy k ðtþjh Þ¼ pðk x;k ðtþþk d;k ðtþþ exp jy k ðtþj 2 ; ð4þ k x;k ðtþþk d;k ðtþ where k x,k (t) and k d,k (t) denote the variances of the clean speech and noise for the kth spectral component at the tth frame, respectively (Kim and Chang, 2). For the soft decision, the global SAP (GSAP) p(h jy(t)) conditioned on the current observations is derived such that pðh jy ðtþþ ¼ ¼ pðy ðtþjh ÞpðH Þ pðy ðtþjh ÞPðH ÞþpðYðtÞjH ÞPðH Þ þ PðH Q Þ K PðH Þ k¼ KðY kðtþþ ; where P(H )=( P(H )) is the a priori probability of speech absence. Also, substituting (3) and (4) into (5), the likelihood ratio K(Y k (t)) at the kth frequency is expressed as follows (Kim and Chang, 2): KðY k ðtþþ ¼ pðy kðtþjh Þ pðy k ðtþjh Þ ¼ þ n k ðtþ exp c kðtþn k ðtþ ; ð6þ þ n k ðtþ J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) ð5þ where ^k d;k ðtþ is the estimate for k d,k (t) andf d (=.99) is a smoothing parameter under a general stationary assumption of D k (t) (Kim and Chang, 2). Taking into account the uncertainty for speech absence or presence, the GSAP is applied to the expectation for the power spectrum of a noise signal as shown below: E½jD k ðtþj 2 jy k ðtþš ¼ E½jD k ðtþj 2 jy k ðtþ; H ŠpðH jy ðtþþ þ E½jD k ðtþj 2 jy k ðtþ; H ŠpðH jy ðtþþ: ðþ Let bx k ðtþ represent the estimated clean speech spectrum at the kth frequency bin and in the tth frame. In general, in the speech enhancement techniques, bx k ðtþ is estimated by applying a spectral gain to each spectral component of the input noisy spectrum. For the effective reduction of the musical noise phenomenon, we adopt the MMSEbased noise suppression rule proposed by Ephraim and Malah (984), as follows: bx k ðtþ ¼maxfGðn k ðtþ; c k ðtþþ; G min gy k ðtþ; ð2þ where G min is the minimum gain to control the perceived noise and G(,) denotes the actual noise suppression gain given by pffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p n Gðn; cþ ¼ F 2 cð þ nþ cn þ n ; ð3þ where the a posteriori signal-to-noise ratio (SNR) c k (t) and the a priori SNR n k (t) are defined by c k ðtþ jy kðtþj 2 k d;k ðtþ ; ð7þ n k ðtþ k x;kðtþ k d;k ðtþ : ð8þ Also, if ^n k ðtþ and ^c k ðtþ are the estimates for n k (t) and c k (t), respectively, ^n k ðtþ can be estimated using the well-known decision-directed (DD) approach (Ephraim and Malah, 984) as follows: j bx k ðt Þj ^n 2 k ðtþ a n ^k d;k ðt Þ þð a nþc½^c k ðtþ Š; ð9þ where bx k ðt Þ represents the estimated clean speech spectrum in the previous frame and C[x] =x if x P, and C[x] = otherwise. Here, a n ( 6 a n 6 ) is a weighting factor that controls the trade-off between the noise reduction and the transient signal distortion by being chosen empirically close to (i.e., a n =.99). Also, ^c k ðtþ is directly obtained by the ratio of the input power jy k (t)j 2 and the estimate of k d,k (t). On the other hand, the estimation of the noise power spectrum is a major component in speech enhancement. In particular, the soft decision method adopts a long-term smoothed noise power spectrum of the background noise as the estimate for k d,k (t) as follows (Kim and Chang, 2): ^k d;k ðt þ Þ ¼f d^kd;k ðtþþð f d ÞE½jD k ðtþj 2 jy k ðtþš; ðþ with F ½mŠ ¼exp m h m m i ð þ mþi þ mi ; ð4þ in which I and I being the modified Bessel functions of zero and first order, respectively. Notice that G min should be carefully set since it controls the trade-off between residual noise signal and the musical effect. As a similar value in (TIA/EIA/IS-27, 996),.248 is used as a fixed value in (Kim and Chang, 2). 3. Proposed environment-aware speech enhancement In the previous section, we describe that principal parameters of the soft decision-based speech enhancement technique as in (Kim and Chang, 2), such as the weight a n in the DD approach, the long-term smoothing parameter f d in the noise power estimation, and the minimum gain parameter G min are fixed values. Since, however, those parameters should be varied according to the noise type to ensure the best performance, we organize the environmental knowledge associated with noise to adaptively select parameters in speech enhancement. The overall environment-aware speech enhancement based on the noise classification employing the GMM is shown in Fig.. In the following subsections, we describe each part of the proposed algorithm in more detail. 3.. Finding optimal operating points for given noises The operating points of a n, f d, and G min according to specific noises should be built based on a relevant criterion

4 48 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) Fig.. Overall block diagram of the proposed environment-aware speech enhancement. in terms of speech quality. The most accurate way to evaluate speech quality can be achieved through an exhaustive subjective listening test. However, since such tests are costly and time-consuming, we adopt the well-known composite measure in (Hu and Loizou, 26, 28) to measure overall speech quality depending on the parameter variations. The composite measure for overall quality, C ovl is expressed by combining basic objective measures to form a new measure, as following: C ovl ¼ :594 þ :85S PESQ :52S LLR :7S WSS ; ð5þ where S PESQ, S LLR, and S WSS denote the scores according to the perceptual evaluation of speech quality (PESQ), the log-likelihood ratio (LLR), and the weighted-slope spectral distance (WSS), respectively (ITU-T Rec. P. 862, 2; Quackenbush et al., 988). It is known from (Hu and Loizou, 28) that the composite measure has a significant correlation with the overall perceptual speech quality such as the mean opinion score (MOS). We then prepared 6 speech samples taken from the NTT database that consist of speech material from four male and four female speakers and that are 8 sec in duration. In order to create noisy environments, we applied twelve different noise types that included babble, car, car2, destroyer-engine, destroyeroperation, factory, factory2, HF-channel, office, street, white, wind noises to the clean speech data at SNR levels of 5,, and 5 db. The speech enhancement technique in (Kim and Chang, 2) was applied to these noisy speech sentences and the parameters were varied. Based on the enhanced speech signal, we first investigated the performance of C ovl by varying a n and f d in a way of the graphical curve for clear understanding. For each noise type, we obtained the 3D mesh curve as a function of the various values of a n and f d as shown in Fig. 2. Based on the data in Fig. 2, we discovered that the four different points indicated by the arrows respectively represent the optimal points in terms of C ovl in the cases of babble, factory, HF-channel, and office noise. By repeating this procedure and incorporating G min as an additional parameter to be optimized, we obtained the unique points (a n ; f d, and G minþ for the given noise types as shown in Table. As shown in the table, the different parameters were chosen according to different noises at the optimal operating points. Note that the variations in the points that depend on the input SNRs are small. This observation tells us that these points can be suitably applied without an additional SNR estimation On-line acoustic noise classification employing Gaussian mixture model As described in the previous subsection, the optimal operating points of the principal parameters for various noise types are obtained at the off-line step. For the realtime implementation the optimal point determination in the time-varying noise condition, we should classify the noise signal on a frame-by-frame basis during speech pauses. To achieve a successful classification, a feature vector that effectively characterizes the discrimination among the various noise environments must be chosen. As given in (3GPP2-C.R3-, v.3., 25) we select a 4-dimensional feature vector which includes ten linear predictive coding (LPC) coefficients, the energy, the partial residual energy, the running mean of energy and the running mean of the partial residual energy due to their superior classification performance. In Fig. 3, the normalized distribution of selected feature vectors according to noise is presented demonstrating that the multi-modal characteristics of the selected features can be successfully modeled using the GMM. For the GMM with the feature vectors ~x ¼fx ; x 2 ;...; x N g, the Gaussian mixture density of a weighted sum of M mixture components is written as follows:

5 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) (a) (b) C ovl α ζ ζ d C ovl α ζ ζ d (c) 2.96 (d) C ovl C ovl α ζ ζ d α ζ ζ d Fig. 2. 3D mesh curve for the estimated optimal operating point: (a) babble noise (SNR = 5 db); (b) factory noise (SNR = db); (c) HF-channel noise (SNR = 5 db); (d) office noise (SNR = db). Table Optimal operating points of a n ; f d, and G min for various noise types. Noise type Optimal points a n f d G min babble car car destroyer-engine destroyer-operation factory factory HF-channel office street white wind pð~xjkþ ¼ XM a i p i ð~xþ; i¼ ð6þ where p i ð~xþ and a i denote, respectively, a Gaussian distribution and the weight of the ith Gaussian mixture defined by: p i ð~xþ ¼ exp 2p N 2jR i j 2 2 ð~x l iþ T R i ð~x l i Þ ; ð7þ X M a i ¼ : ð8þ i¼ Based on this, each noise is modeled by the GMM parameter (k) which comprises the mixture weight p i, the mean vector l i, and the covariance matrix R i. In the noise classification, each noise is characterized by the GMM, i.e., k s where s = (babble), 2 (car), 3 (car2), 4 (destroyer-engine), 5 (destroyer-operation), 6 (factory), 7 (factory2), 8 (HF-channel), 9 (office), (street), (white), 2 (wind), or 3 (universal background model). For this, we used 6 for the number of the mixture based on the trade-off between the performance and the additional computational load. Actually, dependence on the mixture order was marginal as for M P 6. Based on the established model, the objective is to identify the noise model with the maximum a posteriori probability for the input feature vector ~xðtþ. Specifically, we determine the noise model (s) with the maximum a posteriori probability on a current frame assuming equally likely noises such that ^sðtþ ¼arg max log ^pðk s j~xðtþþ: ð9þ s¼;2;...;3 As shown in the flow diagram for noise classification based on the GMM in Fig. 4, the likelihoods of the GMM for individual noise are constructed during the initial ten frames. Once we achieve the GMM likelihood for each noise, the likelihoods are updated frame-by-frame during noise-only periods in our approach, which is a major contribution of this work. For this, we use the long-term smoothed likelihood incorporating the SAP to prevent the likelihood update during speech periods as following: log ^pð~xðtþjk s Þ¼pðH jy ðtþþfb log pð~xðt Þjk s Þ þð bþ log pð~xðtþjk s Þg þ ð pðh jy ðtþþþ log pð~xðt Þjk s Þ; ð2þ

6 482 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) (a) babble car car2 des eng des ops factory Frame Energy factory2 HF channel office street.5 white wind Frame Energy.5 (c) babble car car2 des eng des ops factory Running Mean Energy factory2 HF channel office street.5 white wind Running Mean Energy.5 (b) babble car car2 des eng des ops factory Partial Residual Energy factory2 HF channel office street.5 white wind Partial Residual Energy.5 (d) babble car car2 des eng des ops factory Running Mean of the Partial Residual Energy factory2 HF channel office street.5 white wind Running Mean of the Partial Residual Energy Fig. 3. Normalized distributions of the adopted feature vectors for noise classification: (a) frame energy; (b) partial residual energy; (c) running mean energy; (d) running mean of the partial residual energy. where b (=.985) is the smoothing parameter. Indeed, the misadaptation of the likelihoods during speech presence may result in a failure of noise classification. To address this problem, we employ the SAP counter, which counts the number of successive noise-only frames, where the likelihoods are updated according to (2) only when the SAP counter is greater than a given threshold (i.e., 3). Actually, the background noise can be successfully classified according to the GMM, as displayed in Fig. 5. As can be seen from the classification result after t = 4 s, classification is slightly delayed due to the SAP counter and the long-term smoothing of the likelihood Acoustic noise classification-based speech enhancement Using the classified noise information ^sðtþ on the current frame, three key parameters a n, f d, and G min are substituted in every frame with a n ; f d, and G min, respectively, based on Table. Accordingly, the proposed ^nðt; kþ becomes ^n k ðtþ ¼^a n ðtþ j bx k ðt Þj 2 ^k d;k ðt Þ þ ^a n ðtþ C½^ck ðtþ Š: ð2þ This time, ^a nðtþ is obtained using the long-term smoothing to prevent an abrupt change in a n and to ensure robust performance as follows: ^a n ðtþ ¼j a^a n ðt Þþð j aþ^a n ðtþ; ð22þ with j a (=.9) as a smoothing parameter. Also, the estimation of the noise power is then changed using f d such that ^k d;k ðtþ ¼^f d ðtþ^k d;k ðt Þþ ^f d ðtþ E½jD k ðtþj 2 jy k ðtþš; in which ð23þ ^f d ðtþ ¼j f^f d ðt Þþð j fþ^f d ðtþ; ð24þ with a smoothing parameter j f (=.9). As a result, the soft decision-based speech enhancement is finally achieved using (2) and (23). Based on the newly derived ^n p;k ðtþ and, ^c k ðtþ, the cleans speech spectrum bx k ðtþ is obtained using the aforementioned MMSE-based spectral gain such that bx k ðtþ ¼Gð^n k ðtþ; ^c k ðtþþy k ðtþ: ð25þ From the soft decision as in (2), it is known that the noise suppression rule G(,) is modified by egð; Þ which incorporates the SAP, as given by egð^n k ðtþ; ^c k ðtþþ ¼ ð pðh jy k ðtþþþgð^n k ðtþ; ^c k ðtþþ; ð26þ when deriving the suppression gain, the lower limit G min of the spectral gain should be chosen to minimize the

7 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) Fig. 4. Block diagram of the noise classification of the GMM. disturbing residual noise and the speech signal distortion (TIA/EIA/IS-27, 996) as follows: eg k ðtþ ¼maxf egð^n k ðtþ; ^c k ðtþþ; G min ðtþg; ð27þ where max{} is a maximum operator. As shown in (27), a higher minimum gain value results in increased residual noise. In contrast, as the minimum gain approaches to zero, the residual noise is minimized, causing the speech distortion. Thus, it is obvious that the minimum gain G min should be carefully chosen. However, in (TIA/EIA/IS-27, 996), a fixed value (=.248) is used, which is not reasonable when considering the various noise types. Therefore, we adopt G min given by Table according to the classified noise type, where we obtain the final suppression gain as follows: eg k ðtþ ¼maxf egð^n k ðtþ; ^c k ðtþþ; G min ðtþg: ð28þ Accordingly, the residual noise is adjusted based on the noise information, which is clearly different from the approach of the previous method (Kim and Chang, 2). 4. Experiments and results The proposed environment-aware speech enhancement technique using noise classification was evaluated with objective speech quality measures and subjective listening tests. The experimental procedures are divided into the performance evaluation of noise classification and the noise suppression performance for comparison of the proposed algorithm with the conventional method (the speech enhancement based on global soft decision, denoted by SEGSD method) (Kim and Chang, 2). First, in order to evaluate the performance of the noise classification, different data set were used for training and testing. For noise classification using the GMM, test files of speech comprised 456 s long speech data, which was provided by four male and four female speakers and sampled at 8 khz. We did reference decision on clean speech files by manual labeling every -ms frame. Using the handmarked speech files, we divided the test material into only

8 484 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) GSAP speech waveform probability (a) (b) (c) babble noise car noise time(s) 2 s (d) time(s) Fig. 5. Result of noise classification based on GMM: (a) the GSAP; (b) clean speech waveform; (c) noisy speech corrupted by the babble and car noise (SNR = 5 db); (d) result of noise classification. noise-only frames and active speech frames, respectively. The hand-marked test material, included 57.% active speech frames that consisted of 44.% voiced sounds and 3.% unvoiced sounds. To simulate various noise environments, the aforementioned 2 noise sources which were different from those in the training data set, were added to the clean speech data at 5,, 5 db SNRs. The test data included phrases from the NTT database (Chang, 25), spoken by four male and four female speakers. In the NTT database consisting of 96 phrases, each phrase included two different meaningful sentences and each file lasted 8 s. We added the aforementioned various noises to the clean speech signal at different SNRs of 5,, 5 db. We first investigated the performance of the noise classification technique used in the enhancement method compared to the conventional method in (Ma et al., 26). Since we know the noise-only period through the handlabeled information, we measured the detection probabilities (P d ) for each frame of the noise period. For given test files, the performance of the proposed algorithm is shown in Tables 2 4 in the form of a confusion matrix. In addition, the confusion matrices of the conventional method are given in Tables 5 7 for performance comparison. These results show that the noise classification algorithms result in high accuracy (>96%) for given noise at all SNRs. This observation demonstrates that the noise classification technique in our approach is suitable for environmental discrimination for speech enhancement. Note that the performance differences were negligible for the different SNRs, implying that the proposed noise classification provides robust performance in the presence of SNR variation. On the other hand, the conventional method employing the 4th order mel-frequency cepstral coefficient (MFCC) gave us the average accuracy of 95.37% (SNR = 5 db), 95.77% (SNR = db), and 95.75% (SNR = 5 db) whose values are less those of our approach. This indicates that the proposed noise classification technique in the speech enhancement framework algorithm is acceptable. Also, we investigated the computational complexity of the SEGSD and the proposed method for comprehension of the additional computational burden. In this regard, Table 8 shows a summary of the computational complexity in terms of the MIPS claimed by each algorithm. In particular, the MIPS based on the proposed method is divided by two parts such as the noise classification and the speech enhancement part for clear comparison. Note that this computational step is eventually based on the TMS32C55x (TMS32C55x, 22). The results show that the noise classification to classify 3 different noises in the proposed method requires MIPS, which is an additional computation load. But, it is expected to reduce the additional load with minimal performance degradation if we combine similar noise types into a single noise case (e.g., car and car2 are grouped into vehicle) by considering a commercial application. Next, for the comparison of the proposed speech enhancement method with the conventional soft-decision algorithm, which uses fixed smoothing parameters, we

9 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) Table 2 Result of the noise classification through a confusion matrix (SNR = 5 db). Accuracy bab car car2 des-eng. des-ops. fac fac2 Hf office street white wind UBM bab car car des-eng.. des-ops fac fac Hf. office street white. wind. Average accuracy of noise classification: 96.73% Table 3 Result of noise classification through a confusion matrix (SNR = db). Accuracy bab car car2 des-eng. des-ops. fac fac2 Hf office street white wind UBM bab car car des-eng.. des-ops.. fac fac Hf. office street white. wind. Average accuracy of noise classification: 97.75% Table 4 Result of noise classification through a confusion matrix (SNR = 5 db). Accuracy bab car car2 des-eng. des-ops. fac fac2 Hf office street white wind UBM bab car car des-eng.. des-ops fac fac Hf. office street white. wind. Average accuracy of noise classification: 96.83% adopted the composite measures to objectively evaluate the speech quality as a combination of various representative objective quality measures. Specifically, the composite measures consisted of signal distortion (C sig ), background noise distortion (C bak ), and overall quality (C ovl ), as defined in (Hu and Loizou, 26, 28): C sig ¼ 3:93 :29 S LLR þ :63 S PESQ :9 S WSS ; C bak ¼ :634 þ :478 S PESQ :7 S WSS þ :63 S segsnr ; C ovl ¼ :594 þ :85 S PESQ :52 S LLR :7 S WSS ; ð29þ ð3þ ð3þ where C sig, C bak,andc ovl denote a five-point scale of signal distortion, a five-point scale of background intrusiveness,

10 486 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) Table 5 Result of the conventional noise classification using 4 MFCC through a confusion matrix (SNR = 5 db). Accuracy bab car car2 des-eng. des-ops. fac fac2 Hf office street white wind UBM bab car car des-eng des-ops fac fac Hf office street white wind Average accuracy of noise classification: 95.37% Table 6 Result of the conventional noise classification using 4 MFCC through a confusion matrix (SNR = db). Accuracy bab car car2 des-eng. des-ops. fac fac2 Hf office street white wind UBM bab car car des-eng des-ops fac fac Hf office street white wind Average accuracy of noise classification: 95.77% Table 7 Result of the conventional noise classification using 4 MFCC through a confusion matrix (SNR = 5 db). Accuracy bab car car2 des-eng. des-ops. fac fac2 Hf office street white wind UBM bab car car des-eng des-ops fac fac Hf office street white wind Average accuracy of noise classification: 95.75% and the overall quality using the scale of the mean opinion score (MOS), respectively. Also, S segsnr denotes the score by the segmental SNR. Tables 9 and present the results for background distortion and signal distortion. In particular, we added the experimental results using the test set of open noise types such as ambulance and truck noise, which were not part of the training set. In addition, we incorporated the results of the case in which the noises are perfectly classified as an ideal situation. This implies the limit of the performance when using the acoustic noise classification. As seen in the tables, the proposed algorithm yielded better performances

11 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) Table 8 Comparison of the computational complexity. Module Method SEGSD Proposed Acoustic noise classification 7.84 (Feature extraction) (.4) (GMM-likelihood) (6.7) Noise suppression routine Total MIPS Table 9 Signal distortion (C sig ) result obtained from the proposed algorithm and the SEGSD method. Noise Method SNR (db) SEGSD Proposed Perfect classification babble destroyer-engine Hf-channel office ambulance truck than the conventional SEGSD method for every conditions. This results shows that the proposed method consistently results in superior performance compared to that of the SEGSD in terms of the residual noise and the signal distortion. Also, it can be seen that the proposed algorithm works well for the open noise sets, which implies that the proposed algorithm is not dependent on the training set. In particular, the performance difference between the proposed algorithm and the perfectly classified algorithm is almost same, which means that our noise classification algorithm provides superior and robust performances. On the other hand, Table shows the results of the overall speech quality through the composite measure. Based on the results, we can see that the proposed method yields better performances than do the previous method for all SNRs and given noise types incorporating the open noises. This finding is consistent with previous results, that show that our approach Table Background noise distortion (C bak ) result obtained from the proposed algorithm and the SEGSD method. Noise Method SNR (db) SEGSD Proposed Perfect classification babble destroyer-engine Hf-channel office ambulance truck Table Overall quality (C ovl ) result obtained from the proposed algorithm and the SEGSD method. Noise Method SNR (db) SEGSD Proposed Perfect classification babble destroyer-engine destroyeroperation destroyeroperation destroyeroperation Hf-channel office ambulance truck

488 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) 477 49 Table 2 PESQ result obtained from the proposed algorithm and the SEGSD method.

93 5 3.8 3.2 3.2 Hf-channel 5 2.2 2.28 2.28 2.53 2.6 2.6 5 2.83 2.88 2.88 office 5 2.44 2.48 2.48 2.73 2.76 2.76 5 3.4 3.6 3.6 ambulance 5 2.65 2.7 2.95 3. 5 3.2 3.25 truck 5 2.63 2.69 2.9 2.96 5 3.

) babble 5 2.57 ±. 2.73 ±. B 3.49 ±. 3.28 ±.2 B 5 3.49 ±.2 3.697 ±. B destroyer-engine 5 2.366 ±.9 2.457 ±.9 NW 2.66 ±. 2.74 ±.3 NW 5 2.93 ±.4 3.94 ±.4 B destroyeroperation destroyeroperation 5 2.

12 488 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) Table 2 PESQ result obtained from the proposed algorithm and the SEGSD method. Noise Method SNR (db) SEGSD Proposed Perfect classification babble destroyer-engine Hf-channel office ambulance truck Table 3 MOS test result obtained from the proposed algorithm and the SEGSD method (With 95% confidence interval). Environments Method Hypothesis Noise SNR (db) SEGSD Proposed SEGSD(ref.) babble ± ±. B 3.49 ± ±.2 B ± ±. B destroyer-engine ± ±.9 NW 2.66 ± ±.3 NW ± ±.4 B destroyeroperation destroyeroperation ± ±. B 3.83 ± ±. NW ± ±. NW Hf-channel ± ±.2 B ± ±.5 NW ± ±.4 B office ± ±. B ± ±. NW ± ±. B ambulance ± ±.9 NW ± ±.2 B ± ±.2 NW truck ± ±.4 NW 3.57 ± ±.9 B ± ±.3 B improves the qualities of both the speech signal and the background noise. In addition, we evaluated the performance in terms of the well-known objective quality measure, PESQ, which is recommended by the ITU-T for speech quality assessment of narrow-band telephony (ITU-T Rec. P. 862, 2) even though the PESQ measure 4 (a) 4 (b) Frequency [khz] 3 2 Frequency [khz] time (s) time (s) 4 (c) 4 (d) Frequency [khz] 3 2 Frequency [khz] time (s) time (s) Fig. 6. Speech spectrograms (destroyer-operation noise, SNR = 5 db): (a) clean speech; (b) noisy speech; (c) enhanced speech by the SEGSD; (d) enhanced speech by the proposed method.

13 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) is included as a basic element in the composite measure as a basic element. As shown in Table 2, the superiority of the proposed approach compared to that of the SEGSD method was illustrated. On the other hand, subjective listening tests have been performed in order to validate the objective performance evaluation. The listening tests were performed with ten listeners. Each listener scored a test file between one and five. The scale used for these tests corresponds to the MOS scale. The results are presented in Table 3, where a higher value is preferred. In addition, results of the corresponding hypothesis test against reference (SEGSD) are classified into three categorized: () better than (B), (2) not worse than, and (3) worse than (W) were given for checking the statistical significance (Chang and Kim, 2). Table 3 shows that the proposed method outperformed the conventional SEGSD method under the given noise environments and all SNRs. Also, subjective listening test supported by the statistical hypothesis test confirms that the proposed enhancement method leads to better or comparable results compared to those of the previous method even though the parameters are optimized based on the objective composite measure. Thus, it can be concluded that the proposed method improves the audible speech quality performance with the help of the acoustic noise classification. Finally, the speech spectrograms are presented in Fig. 6. Fig. 6(c) and (d) shows the spectrograms obtained with the SEGSD and the proposed algorithm, respectively. In the proposed method, the residual noise spectra are successfully reduced while preserving the speech spectra well. 5. Conclusion In this paper, we proposed a novel speech enhancement technique using environment-awareness provided by noise classification. The principal contribution of this work is the discovery of optimal points for the principal parameters in a statistical model-based speech enhancement, which enables performance improvement. In order to implement a frame-by-frame basis of the noise classification, the GMM-based likelihood is used. It should be noted that the GMM-based likelihood is updated only during the noise frames, which is classified by the SAP of each frame within the unified framework. The performance of the proposed approach was determined to be superior to that of the conventional technique based on extensive objective and subjective quality tests. Acknowledgements This work was supported by the IT R&D program of MKE/KEIT [29-S-36-, Development of New Virtual Machine Specification and Technology]. And, this work was supported by National Research Foundation of Korea (NRF) grant funded by the Korean Government (MEST) (NRF-2-982). This work was supported by the research fund of Hanyang University (HY-2-22) References Akbacak, M., Hansen, J., 27. Environmental sniffing: noise knowledge estimation for robust speech systems. IEEE Trans. Speech Audio Lang. Process. 5 (2), Boll, S.F., 979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27 (4), 3 2. Cappé, O., 994. Elimination of musical noise phenomenon with the Ephraim and Malah noise suppressor. IEEE Trans. Speech Audio Process. 2 (2), Chang, J.-H., Kim, N.S., 2. Speech enhancement: new approaches to soft decision. IEICE Trans. Inform. Systems E84-D (9), Chang, J.-H., 25. Warped discrete cosine transform-based noisy speech enhancement. IEEE Trans. Circuit Systems 52 (9), Cohen, I., Berdugo, B., 22. Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Process. Lett. 9 (), 2 5. Ephraim, Y., Malah, D., 984. Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. ASSP-32 (6), 9 2. Ephraim, Y., Malah, D., 985. Speech enhancement using a minimum mean-square error log spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. ASSP-33 (2), Hu, Y., Loizou, P., 26. Evaluation of objective measures for speech enhancement. In: Proc. Interspeech, pp Hu, Y., Loizou, P., 28. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. ITU-T Rec. P. 862, 2. Perceptual evaluation of speech quality (PESQ), and objective meothod for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. Kim, N.S., Chang, J.-H., 2. Spectral enhancement based on global soft decision. IEEE Signal Process. Lett. 7 (5), 8. Kraft, F., Malkin, R., Schaaf, T., Waibel, A., 25. Temporal ICA for classification of acoustic events in a kitchen environment. In: Proc. Interspeech, pp Krishnamurthy, N., Hansen, J., 26. Noise update modeling for speech enhancement: when do we do enough? In: Proc. Interspeech, pp Ma, L., Milner, B.P., Smith, D., 26. Acoustic environment classification. ACM Trans. Speech Lang. Process. 3 (2), 22. Martin, R., 2. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9 (5), McAulay, R.J., Malpass, M.L., 98. Speech enhancement using a softdecision noise suppression filter. IEEE Trans. Acoust. Speech Signal Process. ASSP-28 (2), Park, Y.S., Chang, J.-H., 27. A novel approach to a robust a priori SNR estimator in speech enhancement. IEICE Trans. Comm. E9-B (8), Quackenbush, S., Barnwell, T., Clements, M., 988. Objective Measures of Speech Quality. Prentice-Hall, Englewood Cliffs, NJ. Sangwan, A., Krishnamurthy, N., Hansen, J., 27. Environmentally aware voice activity detector. In: Proc. Interspeech, pp Sim, B.L., Tong, Y.C., Chang, J.S., Tan, C.T., 998. A parametric formulation of the generalized spectral subtraction method. IEEE Trans. Acoust. Speech Signal Process. 6 (4), Sohn, J., Sung, W., 998. A voice activity detector employing soft decision based noise spectrum adaptation. In: Proc. IEEE Internat. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 98), Vol., pp

14 49 J.-H. Choi, J.-H. Chang / Speech Communication 54 (22) Song, J.-H., Lee, K.-H., Chang, J.-H., Kim, J.K., Kim, N.S., 28. Analysis and improvement of speech/music classification for 3GPP2 SMV based on GMM. IEEE Signal Process. Lett. 5, 3 6. TIA/EIA/IS-27, 996. Enhanced variable rate codec, speech service option 3 for wideband spread spectrum digital systems. TMS32C55x., 22. TMS32C55x DSP library programmer s reference. TI Inc., Dallas, TX, USA. 3GPP2 Spec., 25. Software distribution for selectable mode vocoder (SMV), series option 56, specification. 3GPP2-C.R3-, v3..

Signal Processing 91 (2011) Contents lists available at ScienceDirect. Signal Processing. journal homepage:

Signal Processing 91 (2011) Contents lists available at ScienceDirect. Signal Processing. journal homepage: Signal Processing 9 (2) 55 6 Contents lists available at ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro Fast communication Minima-controlled speech presence uncertainty