ARTICLE IN PRESS. Signal Processing

Signal Processing 9 (2) 737 74 Contents lists available at ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro Fast communication Double-talk detection based on soft decision for acoustic echo suppression Yun-Sik Park, Joon-Hyuk Chang School of Electronic Engineering, Inha University, Incheon 42-75, Republic of Korea article info Article history: Received 22 May 29 Received in revised form 22 September 29 Accepted 2 November 29 Available online 2 November 29 Keywords: Double-talk detection Speech presence probability Voice activity detection abstract In this paper, we propose a novel double-talk detection (DTD) technique based on a soft decision in the frequency domain. The proposed method provides an efficient procedure to detect the double-talk situation by the use of the global near-end speech presence probability (GNSPP) and voice activity detection (VAD) of the near-end and far-end signal. Specifically, the GNSPP is derived based on a statistical method of speech and is employed to determine the double-talk presence in a given frame. The performance of our approach is evaluated by objective tests under different environments, and it is found that the suggested method yields better results compared with the conventional scheme. & 29 Elsevier B.V. All rights reserved.. Introduction Corresponding author. Tel.: +82 32 86 7423; fax: +82 32 868 3654. E-mail address: changjh@inha.ac.kr (J.-H. Chang). In most hands-free mobile communication systems, since the loudspeaker and microphone are acoustically coupled, acoustic echoes occur. In efforts to address this problem, numerous acoustic echo cancellation (AEC) techniques incorporating an adaptive filter such as the least mean square (LMS) and normalized LMS (NLMS) have been reported [ 3]. One of the major problems of AEC techniques, however, is that the performance significantly degrades during the double-talk periods, in which signals from both the near-end and far-end coexist because the double-talk acts as very large interference to the adaptive filter. The problem can be alleviated by freezing the adaptive filter coefficients through the use of a double talk detection (DTD) algorithm [4]. In this regard, many studies have been dedicated to the problem of DTD. In practice, cross-correlation and coherence-based algorithms are relevant, as they present straightforward approaches. Adopting hard decisions [4 6], these schemes classify each frame into one of two (i.e., double-talk or not) cases by comparing decision statistics and given threshold values. However, they are sensitive to optimized parameters and do not always provide reliable performance under various conditions. In this paper, we propose a novel DTD algorithm based on a global soft decision [7], wheretheterm global means that DTD is performed globally in a given frame and soft decision [8,9] denotes that the probability of double-talk is introduced as a decision and is applied to update the adaptive filter in the acoustic echo suppressor (AES) algorithm []. Specifically, the global near-end speech presence probability (GNSPP) based on a statistical model is computed in each frame to apply the proposed DTD algorithm in conjunction with results of voice activity detection (VAD) of the near-end and far-end signal. It is worth noting that our approach provides for the first time an effective framework of DTD based on a soft decision by taking advantage of a statistical model, in contrast with the conventional hard decision-based method. The performance of the proposed algorithm is evaluated by echo return loss enhancement (ERLE) and speech attenuation (SA) tests during double-talk and is demonstrated to be better than that of the conventional method. 2. Global near-end speech presence probability In this section, we consider how to derive the global near-end speech presence probability (GNSPP) in the 65-684/$ - see front matter & 29 Elsevier B.V. All rights reserved. doi:.6/j.sigpro.29..3

738 Y.-S. Park, J.-H. Chang / Signal Processing 9 (2) 737 74 frequency domain. To this end, we first assume that two hypotheses, H and H, indicate near-end speech absence and presence as follows: H : near end speech absent : YðiÞ¼DðiÞ H : near end speech present : YðiÞ¼DðiÞþSðiÞ where DðiÞ¼½Dði; Þ; Dði; 2Þ;...; Dði; MÞŠ, SðiÞ¼½Sði; Þ; Sði; 2Þ;...; Sði; MÞŠ and YðiÞ¼½Yði; Þ; Yði; 2Þ;...; Yði; MÞŠ, respectively, represent the Fourier domain spectra of the echo signal, the near-end speech and the microphone input signal with a frame index i. Also, XðiÞ¼½Xði; Þ; Xði; 2Þ;...; Xði; MÞŠ denote the Fourier spectrum of the far-end signal as shown in Fig.. The background noise is not taken into account since we assume that near-end speech absence is not correlated with the background noise. Under the assumption that Dði; kþ and Sði; kþ are characterized by separate zero-mean complex Gaussian distributions, the following is obtained [7]: jyði; kþj2 pðyði; kþjh Þ¼ exp ð2þ pl d ði; kþ l d ði; kþ pðyði; kþjh Þ¼ ðþ pðl s ði; kþþl d ði; kþþ exp jyði; kþj 2 l s ði; kþþl d ði; kþ ð3þ where l s ði; kþ and l d ði; kþ are the variance of the near-end speech and estimated echo, respectively. Accordingly, the GNSPP pðh jyðiþþ is derived from Bayes rule, such that [7] pðh jyðiþþ ¼ pðyðiþjh ÞpðH Þ pðyðiþjh ÞpðH ÞþpðYðiÞjH ÞpðH Þ where pðh Þð¼ pðh ÞÞ represents the a priori probability of near-end speech absence. Since the spectral component in each frequency bin is assumed to be statistically independent, (4) can be rewritten as [7] pðh jyðiþþ pðh Þ Q M k ¼ ¼ pðyði; kþjh Þ pðh Þ Q M k ¼ pðyði; kþjh ÞþpðH Þ Q M k ¼ pðyði; kþjh Þ ¼ q Q M k ¼ L kðyði; kþþ þq Q M k ¼ L kðyði; kþþ ð4þ ð5þ in which q ¼ pðh Þ=pðH Þð¼Þ which is determined by the rough estimate of the ratio of absence time duration and presence time duration for near-end speech and L k ðyði; kþþ is the likelihood ratio computed in the k th frequency bin, as given by [7] L k ðyði; kþþ ¼ pðyði; kþjh Þ pðyði; kþjh Þ ¼ gði; kþxði; kþ exp ð6þ þxði; kþ þxði; kþ where the a posteriori signal-to-echo ratio (SER) gði; kþ and the a priori SER xði; kþ are defined by gði; kþ jyði; kþj2 l d ði; kþ xði; kþ l sði; kþ l d ði; kþ where l d ði; kþ is estimated by ^l d ði; kþ. The power spectrum of the echo signal is obtained in the case of the absence of the near-end speech signal, as given by ^l d ði; kþ¼z ^l D d ði ; kþþð z D Þj ^Y ði; kþj 2 ð9þ in which z D ð¼:93þ is the smoothing parameter. Also, in (8), xði; kþ is estimated with the help of the well-known decision-directed approach with a DD ¼ :6 []. Then, ^xði; kþ¼a DD j^sði ; kþj 2 l d ði ; kþ þð a DDÞPfgði; kþ g ð7þ ð8þ ðþ where Pfzg¼z if zz, and Pfzg¼ otherwise. As specified in (9), the robust estimation of the echo magnitude spectrum ^Y ði; kþ plays an essential role in the performance. In our approach, we follow the parameter estimation procedure proposed in [] as follows: j ^Y ði; kþj ¼ ^Hði; kþjxði; kþj ðþ where ^Hði; kþ is the estimate for the echo path response mimicking the actual echo path. Specifically, ^H opt ði; kþ is obtained based on the magnitude of the least squares estimator as follows []: ^H opt ði; kþ¼ E½X ði; kþyði; kþš E½X ði; kþxði; kþš ð2þ where denotes the complex conjugate and E½Š represents the expected value. Note that there exist DFT x Send path IDFT Microphone G Near-end GNSPP SER Esimation Far-end Loudspeaker VAD Decision DTD Echo Path Response Estimation IDFT DFT Receive path Fig.. Block diagram of the proposed DTD algorithm.

Y.-S. Park, J.-H. Chang / Signal Processing 9 (2) 737 74 739 some delay between the far-end speech Xði; kþ and the microphone input signal Yði; kþ (due to a digital amplifier, e.g.). In our approach, it is assumed that the echo timedelay is separately estimated and compensated (i.e., no delay) at the near-end. Since the echo path is time varying, the estimated echo path response ^Hði; kþ is obtained using the iterative procedure such that [] Cði; kþ ^Hði; kþ¼ ð3þ Rði; kþ where Cði; kþ¼z C Cði ; kþþð z C ÞjX ði; kþyði; kþj ð4þ Rði; kþ¼z R Rði ; kþþð z R ÞjX ði; kþxði; kþj ð5þ and z C ð¼:998þ and z R ð¼:998þ are smoothing parameters. Note that this update iteration achieves the room change tracking. 3. double-talk detection based AES As noted earlier, the update of the echo path response must be frozen in the case of the double-talk. For this, we propose the DTD technique to incorporate the newly derived GNSPP, pðh jyðiþþ, with the help of the VAD results of the near-end and far-end signal, as shown in Fig.. We inherently consider the near-end speech presence in the case of far-end signal presence, where the GNSPP substantially determine the double-talk situation and is used to update the echo path response based on (2). Note that the VAD has an impact on the near-end speech presence and the far-end speech presence only. Specifically, we derive a novel update routine of the echo path response by utilizing the soft decision as follows: 8 pðh jyðiþþ ^H ði ; kþ >< þð pðh jyðiþþþ ^H ði; kþ¼ ^H opt ði; kþ ð6þ if IðYðiÞÞ ¼ and IðXðiÞÞ ¼ >: ^H opt ði; kþ otherwise where IðÞ denotes an indicator function of the VAD result provided by the IS-27 noise suppression algorithm since it is known that it gives us a robust performance under various noise environments [2]. Furthermore, we modified the VAD algorithm to reduce the false decisions. For example, IðYðiÞÞ ¼ if the near-end signal YðiÞ exists at the i th frame and IðYðiÞÞ ¼ otherwise. Therefore, the update of ^H ði; kþ is finally addressed such that ^H ði; kþ replaces ^H ði ; kþ (i.e., no update) within the double-talk regions on each frequency bin and ^H opt ði; kþ as specified in (2) in the case of single-talk. In particular, in the case of abrupt transient periods between double-talk and single-talk, as shown in Fig. 2, the GNSPP could be a soft value between and. This accounts for why the soft decision scheme is more insensitive to detection error compared to the conventional hard decision methods. Based on this proposed DTD method, we finally apply it to the AES algorithm proposed by Faller et al. [] as follows: ^Sði; kþ¼gði; kþyði; kþ ð7þ Far end Echo Signal Near end Speech Signal Microphone Input Signal p(h Y(i))..5 Far end echo Double Talk Near end Speech. 2 3 4 Time (sec) Fig. 2. DTD results for the acoustic echo signal under the vehicular noise condition (SNR ¼ 2 db).

74 Y.-S. Park, J.-H. Chang / Signal Processing 9 (2) 737 74 where the Wiener filter gain Gði; kþ is given by [] " Gði; kþ¼ maxðjyði; kþj j ^Y # ði; kþj; Þ jyði; kþj ð8þ Table ERLE during single-talk and SA during double-talk test results obtained from the proposed DTD algorithm based on a soft decision with those yielded by the conventional hard decision method in no changes of the echo path. 4. Experiments and results In order to verify the performance of the proposed DTD algorithm, we conducted objective comparison experiments under various noise conditions. Twenty test phrases, spoken by seven speakers and sampled at 8 khz, were used as the experimental data. For assessing the performance of the proposed method, we artificially created 2 data files, where each file was produced by mixing the far-end signal with the near-end signal. Each frame of the windowed signal was transformed into its corresponding spectrum through a 28-point DFT after zero padding. We then constructed 6 frequency bands through combination of subbands to cover all frequency ranges ( 4 khz) of the narrow band speech signal, which is analogous to that of the IS-27 noise suppression algorithm [2]. The far-end speech signal was passed through a filter simulating the acoustic echo path modeled by a time-invariant FIR filter based on the analysis of room acoustics before being mixed electrically [3,4]. The simulation environment was designed to fit a small office room having a size of 5 4 3m 3. The echo level measured at the input microphone was 3.5 db lower than that of the input near-end speech on average. In order to create noisy conditions, white, babble and vehicular noises from the NOISEX-92 database were added to clean near-end speech signals at signal-to-noise ratios (SNRs) of, 2, and 3 db. For the purpose of an objective comparison, we evaluated the performance of the proposed scheme and that of the conventional DTD algorithm proposed by Park et al. [6], wherein the crosscorrelation coefficients-based double-talk detection method is used. The performances of the approaches were measured in terms of echo return loss enhancement (ERLE) and speech attenuation (SA) during double-talk, which are defined by [4]: E½y 2 ðtþš ERLEðtÞ¼ log ð9þ E½e 2 ðtþš " # SAðtÞ¼ X N E½s 2 ðtþš log N ð2þ E½~s 2 ðtþš where t is a sample index, N is the number of samples during the double-talk periods and ~s 2 ðtþ denotes the nearend speech component in the output signal eðtþ. Given the three types of noise environments, the ERLE and SAs scores were averaged to give final mean score results, as presented in Tables and 2. From Table which indicates the results in the single echo path case, it is evident that in most noisy conditions, the proposed DTD algorithm based on a soft decision yielded a lower SA compared to the hard decision-based conventional technique while maintaining the similar ERLE compared with the conventional method during single-talk. Also, from Table 2 showing the results in the case of echo path Environments ERLE (db) SA (db) SNR (db) changes with different room size [3], we can observe that the SAs (measured during double-talk periods) of the proposed scheme based on a soft decision were better than those of the previous method [6] in all the tested conditions. It is noted that the performance gain of the proposed method becomes smaller as the SNR becomes lower. This is attributed to imperfection of the GNSPP under adverse noise conditions. Summarizing the overall results, the proposed approach is found to be effective in the AES technique. 5. Conclusions White 4.27 4.256.75.64 2 6.8 6.69.99.864 3 6.853 6.853 2.28.9 Babble 4.22 4.88.89.7 2 6.8 6.5.998.872 3 6.828 6.83 2.28.9 Vehicle 3.649 3.649.557.469 2 5.85 5.849.958.833 3 6.763 6.688 2.24.895 Clean speech 6.968 6.967 2.34.95 Table 2 ERLE during single-talk and SA during double-talk test results obtained from the proposed DTD algorithm based on a soft decision with those yielded by the conventional hard decision method in changes of the echo path. Environments ERLE (db) SA (db) SNR (db) White 5.48 5.44.783.738 2 8.9 8.5 2.85 2.4 3 8.922 8.97 2.4 2.99 Babble 6.82 6.76.926.884 2 8.338 8.335 2.24 2.82 3 8.968 8.968 2.49 2.7 Vehicle 4.374 4.373.79.685 2 7.384 7.383 2.85 2.44 3 8.743 8.742 2.49 2.7 Clean speech 9.9 9.8 2.5 2.8 In this paper, we have proposed a novel DTD algorithm based on a soft decision scheme in the frequency domain. The GNSPP based on a statistical model of the near-end and far-end signal is applied to the DTD algorithm in

Y.-S. Park, J.-H. Chang / Signal Processing 9 (2) 737 74 74 conjunction with VAD decisions for effective echo suppression. The performance of the proposed algorithm has been found to be superior to that of the conventional technique through objective evaluation tests. Acknowledgements This research was supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-29-C9-92-) and this work was supported by National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MEST) (NRF-29-8562). References [] P.S.R. Diniz, Adaptive Filtering: Algorithm and Practical Implementation, Kluwer, Norwell, MA, 997. [2] C. Avendano, Acoustic echo suppression in the STFT domain, in: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 2. [3] H. Ye, B.X. Wu, A new double-talk detection algorithm based on the orthogonality theorem, IEEE Trans. Commun. 39 (November 99) 542 545. [4] T. Gänsler, M. Hansson, C.J. Ivarsson, A double-talk detector based on coherence, IEEE Trans. Commun. 44 (November 996) 42 427. [5] J. Benesty, D.R. Morgan, J.H. Cho, A new class of doubletalk detectors based on cross-correlation, IEEE Trans. Speech Audio Process. 8 () (March 2) 68 72. [6] S.J. Park, C.G. Cho, C. Lee, D.H. Youn, Integrated echo and noise canceler for hands-free applications, IEEE Trans. Circuits Syst. II 49 (3) (March 22) 86 95. [7] N.S. Kim, J.-H. Chang, Spectral enhancement based on global soft decision, IEEE Signal Process. Lett. 7 (5) (May 2) 8. [8] Y.-S. Park, J.-H. Chang, A novel approach to a robust a priori SNR estimator in speech enhancement, IEICE Trans. Commun. E9-B (8) (August 27) 282 285. [9] Y.-S. Park, J.-H. Chang, A probabilistic combination method of minimum statistics and soft decision for robust noise power estimation in speech enhancement, IEEE Signal Process. Lett. 5 (5) (January 28) 95 98. [] C. Faller, C. Tournery, Robust echo control using a simple echo path model, in: Proceedings of the IEEE International Conference on Acoustics, Speech Signal Processing, vol. 5, 26, pp. V28 V284. [] Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process. ASSP-32 (6) (December 984) 9 2. [2] TIA/EIA/IS-27, Enhanced variable rate codec, speech service option 3 for wideband spread spectrum digital systems, 996. [3] S. McGovern, A Model for Room Acoustics, 23 [Online]. Available: /http://2pi.us/rir.htmls. [4] S.Y. Lee, N.S. Kim, A statistical model based residual echo suppression, IEEE Signal Process. Lett. 4 () (October 27) 758 76.