Save this PDF as:

Size: px
Start display at page:



1 1 International Conference on Cyberworlds IMPROVEMENT OF SPEECH SOURCE LOCALIZATION IN NOISY ENVIRONMENT USING OVERCOMPLETE RATIONAL-DILATION WAVELET TRANSFORMS Di Liu, Andy W. H. Khong School of Electrical and Electronic Engineering Nanyang Technological University Singapore {LIUDI, Abstract The generalized cross-correlation using the phase transform prefilter remains popular for the estimation of timedifferences-of-arrival. However it is not robust to noise and as a consequence, the performance of direction-of-arrival algorithms is often degraded under low signal-to-noise condition. We propose to address this problem through the use of a wavelet-based speech enhancement technique since the wavelet transform can achieve good performance. The overcomplete rational-dilation wavelet transform is then exploited to effectively process speech signals due to its higher frequency resolution. In addition, we exploit the joint distribution of the speech in the wavelet domain and develop a novel local noise variance estimator based on the bivariate shrinkage function. As will be shown, our proposed algorithm achieves good direction-of-arrival performance in the presence of noise. Keywords-, wavelet, speech source localization, DOA estimation I. INTRODUCTION Research into speech source localization has received much attention for cyberworld applications including automatic camera steering, online video surveillance and speaker tracking. One of the widely adopted approaches for speech source localization is the generalized crosscorrelation (GCC) based time-differences-of-arrival (TDOA) estimation algorithm [1]. This algorithm computes the interchannel delays by locating the maximum weighted crosscorrelation between each pair of the received signals. While many different prefilters can be applied, the heuristic-based phase transform (PHAT) prefilter has been found to perform very well under practical conditions []. As reported in [], the PHAT prefilter is optimal in the maximum likelihood (ML) sense in the presence of reverberation. However, this prefilter is not robust to low signal-tonoise ratio (SNR) conditions and as a result, the performance of direction-of-arrival (DOA) estimation algorithms degrade with reducing SNR. Figure 1 shows an illustrative example of this degradation where the mean and standard deviation This work is supported by the Singapore National Research Foundation Interactive Digital Media R&D Program, under research grant NRF8IDM-IDM4-1. bearing error (degree) SNR (db) Figure 1. Variation of the mean and standard deviation of the bearing error against SNR for DOA estimation using the PHAT-GCC algorithm. of the bearing errors increase from to 4 and 4 to 6, respectively, when the SNR reduces from 1 to db. As can be seen, degradation in performance for DOA estimation becomes more pronounced with lower SNR. A common approach to this problem is to preprocess the noisy signals by. Although speech has been an active area of research, these efforts have mainly been focused on improving the subjective quality or intelligibility of the speech. In this work, however, we focus on with the aim of improving the performance of DOA estimation. It has been shown that wavelet-based methods have become an important tool to address the difficult problem of [3], [4]. This is achieved by taking advantage of the sparseness of signals in the wavelet domain. In this work, we propose to incorporate such wavelet techniques to improve the DOA performance in the presence of noise. The wavelet-based algorithm will consist of three steps: 1) computing the wavelet transform (WT) of the noisy signal, ) modifying the noisy wavelet coefficients and 3) computing the inverse WT using the modified wavelets. It is therefore important, in this work, to determine the type of wavelet transform and the threshold selection method in order to achieve good DOA estimation. We note that the speech and noise signals can better be separated if an appropriate transform is selected. The overcomplete rational dilation WT [5] is a recent enhancement where the frequency resolution can be varied. Due to the fact that the speech spectrum varies significantly across frequency bands, the rational dilation WT with high frequency resolution can be effective for processing the speech /1 $6. 1 IEEE DOI 1.119/CW

2 1 5 Figure. Analysis and synthesis filter banks for the implementation of the rational-dilation wavelet transform [after [5]]. in wavelet domain. In contrast, the poor frequency resolution of the dyadic WT limits its effectiveness for analyzing signals that are quasiperiodic in nature including speech, electroencephalogram and signals arising from mechanical vibrations [6]. In addition, among a variety of nonlinear thresholding rules for wavelet-based, the bivariate shrinkage thresholding [7] can improve SNR performance significantly. This is achieved by taking into account the statistical dependencies between wavelet coefficients and their parents using Bayesian estimation theory. As an a priori knowledge, we will discuss the joint distribution of wavelet coefficients for a typical speech signal. In addition, we show that direct application of existing approaches will not address the noise robustness issue. This thresholding requires a noise variance estimatior which will be computed locally for each frequency subband, making it suitable according to the speech spectrum distribution characteristics. II. REVIEW OF OVERCOMPLETE RATIONAL-DILATION WAVELET TRANSFORMS The overcomplete rational-dilation WTs [5] can achieve a class of WTs with constant quality (Q)-factor where the Q- factor of a band-pass filter is the ratio of its center frequency to its bandwidth. We note that WTs with high Q-factors are desirable for processing quasi-periodic signals such as speech due to their higher frequency resolution compared to the dyadic WT with low Q-factor. The iterated filter banks shown in Fig. can be used to implement rational-dilation WTs [5], where p is an upsampling factor, q and s are the downsampling factors while q/p is a rational dilation factor. These parameters can affect the Q-factor, redundancy of WTs and the timebandwidth product; for a given q/p, there is often a trade-off between the Q-factor and the time-bandwidth product. One generally requires higher frequency resolution when analyzing/filtering quasi-periodic signals like speech. In this work, we set p = 9, q = 1, s = 5 giving a dilation factor of 1.11 and a redundancy of.. Figure 3 illustrates its corresponding frequency response of the iterated filter bank and the wavelet. As can be seen from these figures, a good time-frequency localization with more band-pass filters covering the same frequency range is achieved. In addition, these parameters give rise to a high Q-factor and is able to avoid ringing with a modest factor of redundancy of less than 3. This WT, set with higher frequency resolution, can better separate the speech and noise signals. In addition, the noise reduction filter on each subband can be manipulated Figure 3. [after [5]] FREQUENCY (CYCLES/SAMPLE) SUBBAND TIME (SAMPLES) Frequency response and wavelets at several scales independently which in turn determines the amount of noise reduction in each subband. III. WAVELET-BASED SPEECH DENOISING FOR DIRECTION-OF-ARRIVAL ESTIMATION To describe the wavelet-based problem for speech, we define ω k (j) to be the kth wavelet coefficient in the high-pass (H) subband wavelets of scale j, where j = 1,..., J denotes the wavelet scale index and k = 1,... K denotes the wavelet coefficient index. Here, J denotes the total number of wavelet scales and K denotes the total number of wavelet coefficients in each scale after resizing. We next define y k (j) as the noisy observation of ω k (j) and n k (j) as the additive noise, giving y k (j) = ω k (j) + n k (j). We also note that ω k (j + 1) is the wavelet coefficient at the next coarser scale to ω k (j) and therefore we say ω k (j + 1) is the parent of ω k (j). In statistical processes, we can define W k (j), Y k (j) and N k (j) as the random variables of w k (j), y k (j) and n k (j), respectively. Using this notation, we can write y = w + n, (1) where w = [W k (j), W k (j + 1)] T, y = [Y k (j), Y k (j + 1)] T and n = [N k (j), N k (j+1)] T. Taking into account the statistical dependency between adjacent wavelets and employing the maximum a posteriori (MAP) estimator, we can esimate w of the clean speech given the noisy observation y using ŵ(y) = arg max w [p n(y w) p w (w)], () where p n (y w) and p w (w) are the joint probability distribution functions (pdfs) of n and w, respectively. Hence, to estimate clean wavelets ŵ(y) using (), both p w (w) and p n (n) must be computed. Here, the noise is assumed to be i.i.d white Gaussian and we can express the noise pdf as p n (n) = 1 πσn exp ( N k (j) + N k (j + 1) σ n where σ n is the variance of the additive noise. ), (3) 78

3 Joint Histogram Parent 1.5 Child.5 Proposed pdf Child Figure 4. Empirical joint parent-child histogram of wavelet coefficients from speech signal database. Bivariate pdf (4) for joint pdf of parentchild wavelet coefficient paris. 1 1 Parent A. Bivariate shrinkage thresholding for speech signal It is therefore important to determine an analytical expression for the joint pdf that models the wavelet distribution of a typical speech. This joint empirical child-parent histogram can then be used to etimate p w (w). As presented in [7], a possible pdf model is given by p w (w) = 3 πσω exp ( 3 σ ω W k (j) + W k (j + 1) ), (4) where σω is defined as the variance of the clean speech wavelet. To evaluate if this pdf model is suitable for speech signals, we performed the overcomplete rational-dilation WT as described in Section II using q/p = 1/9, s = 5 for a set of 3 speech signals extracted from the NOIZEUS database [8]. The joint histogram between W k (j) and W k (j + 1) is then plotted in Fig. 4 while this joint pdf model defined in (4) is plotted in Fig. 4. Comparing both plots, we note the close similarity between the analytical expression given by (4) and that of the speech signals. We therefore propose to employ (4) for the estimation of p w (w). Substituting (3) and (4) into (), the MAP estimator in () can be rewritten as [7] Ŵ k (j) = Y k (j) ( Yk (j) + Y k (j + 1) 3σ n σ ω ) + Y k (j) + Yk (j + 1), (5) where the function (g) + at the numerator is defined as { if g < (g) + = g otherwise. (6) This is the bivariate shrinkage function in each wavelet scale used for speech. B. Variance estimation for thresholding Considering the wavelet shrinkage function in (5), we define T = 3σn/σ ω as the threshold. It is therefore essential to estimate the noise variance σn and the wavelet variance σω for each wavelet scale. In our algorithm, the variance σω can be estimated as σ ω = ( σ y σ n) +, (7) where σy is the variance of the noisy wavelets. If one assumes that Y k (j) has Gaussian distribution, σy for the kth coefficient in each wavelet scale j will be estimated in the ML sense using coefficients in the neighboring region of B(k), σ y = 1 y M k(j), (8) y k (j) B(k) where M is the size of the neighborhood B(k) and B(k) is defined as all coefficients within a window that is centered at the kth coefficient. Although a typical speech signal occupies a wide frequency spectrum, it has significant energy within the range of 4 Hz. The wavelets in the finest scale correspond to the highest frequency subband denoted as H 1 and do not contain significant speech content. This assumption is valid since we utilize the high frequency resolution of the given rational-dilation WT. In addition, we assume that the noise is white with equivalent energy throughout the whole frequency band and as a result, y(h 1 ) n(h 1 ). We can therefore estimate the overall noise variance from the finest scale wavelet coefficients and a robust median estimator for noise variance is [9], y k (1) subband H 1. (9).6745 We note that direct application of (9) is not applicable for our DOA application. Simulation using (9) exhbits a degradation in DOA performance and that the bearing errors are sensitive to the noise variance. This is because the energy of the speech spectrum varies significantly across different scales. A poor noise estimation can therefore result in an inappropriate threshold T. Accordingly, this can lead to additional unwanted high-frequency noise components. In view of the above, we should consider the degree of shrinkage for the wavelets of the speech signals and propose that the new estimator σ n be given as, y k (1) subband H 1. (1) c The performance of the DOA estimation algorithm is therefore dependent on the choice of c. C. Factor c selection We determine a suitable value of c that gives rise to good DOA performance. This can be achieved empirically by studying how c varies across different speech signals under different SNR conditions. We first perform using (1), (8) and (5) for 3 speech signals extracted from the NOIZEUS database [8]. The DOA of the denoised speech is subsequently estimated using GCC-PHAT. Figures 5 and show the variation of bearing error with c for the case of SNR = and 5 db, respectively. As can be seen, the bearing error first reduces with c after which it then increases modestly. Accordingly, a good choice of c = 1 can be chosen, i.e.,, y k (1) subband H 1. (11) 1 79

4 bearing error(degree) c 3 bearing error(degree) c 3 Figure 5. Variation of the mean bearing errors with c for SNR = db and SNR = 5 db c(1) =.7 c(1) =.3 c(1) = Figure 6. Variation of mean and standard deviation of the bearing error with SNR for different factor c(1) c(1) =.7 c(1) =.3 c(1) =.5 Additional simulations show similarity in this variation for different SNR conditions. We propose to further improve the performance of DOA estimation through c(j) which is level dependent. We achieve this by noting that the ratio between clean and noisy signals in each scale is different and that each scale may be processed independently in order to estimate the noise variance for each scale. We determine a good choice of c(j) empirically for realistic applications through an iterative procedure by first initializing c(j) = 1 for j =,..., J. The value of c(1) is then set to a value which gives rise the lowest DOA error using the GCC-PHAT algorithm. The value of c(j + 1) is then subsequently obtained in a similar manner after finding c(j) that gives rise to the lowest DOA error. The same process is then applied to 3 speech signals from the NOIZEUS database [8] under different SNRs. Experiments conducted in this manner reveal that the performance of GCC-PHAT after is relatively insensitive to c(j), j =,..., J under different SNR conditions and that c(j) = 1 can be considered as a good choice for j =,..., J. Figures 6 and, show the variation of mean and standard deviation of the bearing errors with SNR for different values of c(1). We note that the choice of c(1) affects the DOA performance under different SNR conditions. This can occur since, for the finest wavelet scale, corresponding to the highest frequency subband, it is expected that noise dominates the signal component under low SNR. Therefore, compared with other scales, the noise energy in scale 1 is more significant than the energy of the clean wavelet. Hence, one should set a higher threshold for the finest scale. As can be seen, a good choice for c(1) that gives rise to good DOA performance for the GCC-PHAT is given by c(1) =.3 across the SNRs considered. In addition, we note that, for c(1) =.7, a low mean bearing error can be achieved while its standard deviation is modestly high compared to the case when c(1) =.3. We therefore conclude that c(j) =.3 and c(j) = 1, j =,..., J are good choices for DOA estimation. Although a good choice of c(1) is given by.3, we further provide a means of estimating the SNR so that c(1) can be determined based on that shown in Fig. 6. We first define γ w (j), γ y (j), γ n (j) as the energy of the clean and received signal wavelets as well as noise of scale j, respectively. We next define r w (j) = γ w (j) / j=1 γ w(j), r y (j) = γ y (j) / j=1 γ y(j), r n (j) = γ n (j) / j=1 γ n(j) as the energy ratio for wavelets corresponding to clean, received and noise signals. Since energy in the wavelet domain is equivalent to the time-domain energy, the SNR can be computed by ( j=1 SNR = 1 log γ ) w(j) 1 i=1 γ. (1) n(j) The ratio r y (j) can be obtained using r y (j) = from which we obtain where γ y (j) j=1 γ y(j) = γ w(j) + γ n (j) j=1 γ y(j) = r w(j)( j=1 γ y(j) j=1 γ n(j)) j=1 γ y(j) + r n(j) j=1 γ n(j) j=1 γ, (13) y(j) r y (j) = r w (j) + β (r n (j) r w (j)), (14) β = J j / J γ n (j) γ y (j). (15) We note that when the number of decomposition levels J is large, the signal energy in the coarsest scale approximates to zero. Hence, (14) can be rewritten as r y (J) = β r n (J) and β in (15) can then be expressed as j β = r y (J) / r n (J). (16) Since a white Gaussian noise should have constant energy ratio across the scales, r n (j) can be computed given a WT. By using (15), (16) and (1), SNR can now be rewritten as SNR = 1 log 1 ((1 β)/β) db, (17) from which we can now select a value of c(1) based on Fig. 6. 8

5 Martin s approach [1] without Beroutis approach [9] wavelet based wavelet based Figure 7. DOA performance comparison by our proposed method and that of [1], [11] under different SNRs: mean bearing errors and standard deviation of bearing errors Martin s approach [1] without Beroutis approach [9] Using the above, we can therefore apply a MAP estimator using () and our proposed algorithm for speech source localization is summarized as follows: 1) select c or c(1) using Fig. 6 or estimate SNR using (17); ) compute the noise variance σ n using (1); 3) for wavelet coefficients in each scale k = 1,..., K, a) calculate σ y using (8); b) calculate σ ω using using (7); 4) estimate each coefficient Ŵk(j) in (5); 5) estimate the DOA using the GCC-PHAT. IV. EXPERIMENT RESULTS We evaluate the performance of our proposed algorithm and compare its performance with that of two well-known techniques [1], [11] in the context of DOA estimation. A virtual room of size 1 m 1 m 1 m is created using the method of images. A linear array of four microphones with spacing.5 m and centroid position (5, 5, 1.6) m is used. We evaluate the performance of the algorithms by varying the source bearing with a constant source-sensor distance of 3.6 m. We introduce white noise with different SNRs at each microphone. Speech signals used are obtained from the NOIZEUS database [8]. Bearing errors of our proposed wavelet-based algorithm and the spectral-substraction (SS) technique by Beroutis approach [1] and Martins approach [11] are computed for 3 different speech signals each using 1 independent trials under different SNR conditions. For our method, we have used factors c(j) = 1, j =,..., J and c(1) is chosen using Fig. 6 based on different SNR conditions estimated using (17). The mean and standard deviation of the bearing errors are illustrated in Figs. 7 and, respectively. As can be seen, the approach of [11] does not give rise to good DOA estimation, although it is well known for offering better speech intelligibility. Using our proposed algorithm, the mean bearing errors are reduced by approximately 4 over Beroutis approach under low SNR environment. In addition, the standard deviation for our proposed algorithm is reduced by approximately 8 over Beroutis approach. This improvement is significantly higher than the improvement of the SS method over the GCC-PHAT processor without. This shows that our approach based on wavelet can improve DOA performance over that for the existing SS speech method. V. CONCLUSION We presented a novel wavelet-based speech algorithm for achieving high DOA performance for speech signals. We estimate the local noise variance which can improve DOA performance further. Simulation results showed our proposed method outperforms the spectral subtraction technique under low SNR when the original PHAT algorithm is not robust to low SNR environments. REFERENCES [1] C. Knapp and G. Carter, The generalized correlation method for estimation of time delay, IEEE Trans. Acoust., Speech and Signal Process., vol. 4, no. 4, pp. 3 37, Aug [] C. Zhang, D. Florencio, and Z. Y. Zhang, Why does PHAT work well in low noise, reverberative environments? IEEE Int l Conf. Acoust., Speech and Signal Process., pp , Mar.-Apr. 8. [3] M. Miller and N. Kingsbury, Image using derotated complex wavelet coefficients, IEEE Trans. Image Process., vol. 17, no. 9, pp , Nov. 8. [4] V. Bruni and D. Vitulano, Wavelet-based signal via simple singularities approximation, Signal Processing, vol. 86, no. 4, pp , Apr. 6. [5] I. Bayram and I. W. Selesnick, Frequency-domain design of overcomplete rational-dilation wavelet transforms, IEEE Trans. Signal Process., vol. 57, no. 8, pp , Aug. 9. [6] C. S. Burrus, R. Gopinath, and H. Guo, Introduction to wavelets and wavelet transform: a primer, Prentice Hall, [7] L. Sendur and I. W. Selesnick, Bivariate shrinkage functions for wavelet-based exploiting interscale dependency, IEEE Trans. Signal Process., vol., no. 11, pp , Nov.. [8] loizou/speech/noizeus/. [9] D. Donoho and I. Johnstone, Ideal spatial adaptation by wavelet shrinkage, Biometrika, vol. 81, no. 3, pp , [1] M. Berouti, R. Schwartz, and J. Makhoul, Enhancement of speech corrupted by acoustic noise, in Proc. IEEE Int l Conf. Acoust., Speech and Signal Process., pp. 8 11, [11] R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Trans. Speech and Audio Process., vol. 9, no. 5, pp. 4 51, Jul