Nonlinear postprocessing for blind speech separation Dorothea Kolossa and Reinhold Orglmeister 1 TU Berlin, Berlin, Germany, D.Kolossa@ee.tu-berlin.de, WWW home page: http://ntife.ee.tu-berlin.de/personen/kolossa/home.html Abstract. Frequency domain ICA has been used successfully to separate the utterances of interfering speakers in convolutive environments, see e.g. [6],[7]. Improved separation results can be obtained by applying a time frequency mask to the ICA outputs. After using the direction of arrival information for permutation correction, the time frequency mask is obtained with little computational effort. The proposed postprocessing is applied in conjunction with two frequency domain ICA methods and a beamforming algorithm, which increases separation performance for reverberant, as well as for in-car speech recordings, by an average 3.8dB. By combined ICA and time frequency masking, SNR-improvements up to 15dB are obtained in the car environment. Due to its robustness to the environment and regarding the employed ICA algorithm, time frequency masking appears to be a good choice for enhancing the output of convolutive ICA algorithms at a marginal computational cost. 1 Introduction Frequency domain blind source separation can be employed to obtain estimates of clean speech signals in reverberant environments. One successful approach uses independent component analysis to obtain an estimate of the mixing system (i.e. the room transfer function) and subsequently inverts it. Applying this unmixing system to the signals yields estimates of the short time spectra of the speech signals Ŝ1..n(k, Ω). This ICA-based estimate can be further enhanced taking advantage of the approximate disjoint orthogonality of speech signals. Two signals s 1 (t) ands 2 (t) are called W-disjoint orthogonal, when the support of their windowed Fourier transforms do not overlap, i.e. when S 1 (k, Ω)S 2 (k, Ω) =0 k, Ω, (1) for the window function W (t), where k refers to the frame number and Ω to the frequency bin. This condition does not hold exactly for interfering speech signals, however, it is true approximately for an appropriate choice of time frequency representation, as shown in [10]. Thus, a postprocessing scheme is proposed as follows: in each frequency bin Ω and at each frame k, the magnitudes of the ICA outputs are compared. Based
on the assumption of disjoint orthogonality, only one of the outputs should have a non-zero value at any given frame and bin. Therefore, only the frequency bin with the greatest magnitude is retained, the other frequency bins are set to zero. An overview of the entire system is given in Figure 1. While the approach was first tested on a frequency domain implementation of JADE [5], it is also successful as postprocessing for other ICA and beamforming algorithms. The remainder of this paper is organized as follows. Section 2 gives an overview of the entire signal processing system and describes the ICA and beamforming algorithms which were used to arrive at an initial speech signal estimate Ŝ(k, Ω). Subsequently, Section 3 deals with the nonlinear postprocessing stage. The algorithm was evaluated on three data sets: real-room recordings made in a reverberant office environment, the ICA99 evaluation data sets, and in-car speech data, which was recorded in cooperation with DaimlerChrysler 1. Details of the evaluation data and methods are given in Section 4. Finally, in Section 5, the results are collected and conclusions are drawn. 2 Algorithms The block diagram of the algorithm is shown in Figure 1 for the case of two signals. While the algorithm is applicable for demixing an arbitrary number of sources, provided that they meet the requirement of approximate disjoint orthogonality, it was tested here only for the case of two sources and sensors. x 1 (t) x 2 (t) STFT STFT ICA Permutation Correction Time-Frequency Masking IFFT y 1 (t) IFFT y 2 (t) Fig. 1. Overview of the algorithm First, the microphone signals, sampled at 16kHz, are transformed into the time frequency domain via STFT using a Hamming window of 512 samples, i.e. 1 The authors wish to thank DaimlerChrysler for the cooperation and support.
32ms duration, and a frame shift of 8ms. In the ICA stage, the unmixing filters W(Ω) are determined for each frequency bin. This can be accomplished with any ICA algorithm, provided it operates on complex data. For this work, two different ICA approaches were tested, and were also compared to a fixed direction nullbeamformer. The unmixing filters, determined by ICA, are applied to the microphone signals to obtain initial speech estimates Ŝ(k, Ω). The permutation problem is solved by beampattern analysis, which is done assuming that the incoming signal obeys the farfield beamforming model, i.e. all incoming sound waves are planar. In this case, the directivity patterns of a demixing filter W(Ω) can be calculated as a function of the angular frequency ω = Fs N Ω and the angle of incidence of the signal relative to broadside, ϕ, via F l (Ω,ϕ) = 2 k=1 W lk (Ω)exp(j ΩF sd sin ϕ ). (2) N c Here, N is the number of frequency bins and F s the sample rate. The permutation matrix P(Ω) is determined by aligning the minima of directivity patterns between frequency bins, as described in [6]. This result of this procedure is, on each channel, a linear, filtered combination of the input signals. Since speech signals are sparse in the chosen time frequency representation, subsequent time frequency masking (TF masking) can be used to further suppress noise and interference in those frames and bins, where the desired signal is dominated by interference. Finally, the unmixed signals Y(Ω,k) are transformed back into the time domain using the overlap-add method. 2.1 Complex JADE with Beampattern Correction A frequency domain implementation of JADE results in a set of unmixing matrices, one for each frequency bin. The scaling problem is avoided by using a normalized mixing model and permutations are corrected by beampattern analysis as described above. 2.2 Minimum Cross Statistics Nullbeamforming The second algorithm is also a frequency domain convolutive approach, which is based on searching for the minimum cross cumulant nullbeamformer in each frequency bin. Here, the cross statistics up to fourth order are used, similar to [2]. The idea is to parameterize the unmixing system in such a way that it becomes a nullbeamformer, cancelling as many directional interferers as the number of microphones allows. When the microphones are sufficiently close and well adjusted so that no damping occurs, and when the sources obey the farfield model, the mixing matrix can be written as X(jω)=A ph (jω) S(jω) (3)
with the phase shift mixing matrix [ ] 1 1 A ph (jω)= ( ) ( ) e jω d c sin ϕ 1(ω) e jω d c sin ϕ 2(ω) (4) which depends on the angular frequency ω, the speed of sound c and the distance d between microphones. To cancel one of the signals, the inverse of the mixing model W(jω)= e [ ] 1 e 2 e2 1 (5) e 1 e 2 e 1 1 is used, with ( ) ( ) e 1 = e jω d c sin ϕ 1(ω) and e 2 = e jω d c sin ϕ 2(ω). (6) This nullbeamformer is optimized for each frequency bin separately so that it is possible to compensate phase distortions introduced by the impulse response. The optimization is carried out by stochastic gradient descent for the cost function J(Ŝ1, Ŝ2) =E( Ŝ1 Ŝ2 )+ Cum(Ŝ1, Ŝ2), (7) where Cum(Ŝ1, Ŝ2) refers to the fourth order cross-cumulant of Ŝ1 and Ŝ2. 2.3 Why parameterize each bin? Both ICA algorithms find an unmixing system separately in each frequency bin, and subsequently use only those time frequency points, in which one ICA output dominates the others by a set margin. This approach is strongly reminiscent of a family of algorithms described by Yilmaz and Rickard ([10]), where the following mixing model was used in the windowed Fourier transform domain: [ ] [ ] S 1 (ω, τ) X1 (ω, τ) 1... 1 = X 2 (ω, τ) a 1 e jωδ1...a N e jωδn. (8) S N (ω, τ) The main difference of this mixing model is that the delay δ is not adjusted independently in different frequency bins. In anechoic environments, in which the far-field beamforming assumption is valid, it is sufficient to use one angle of incidence estimate ϕ, corresponding to one delay estimate δ, for all frequencies. In this case, source separation perfomance does not profit notably from introduction of frequency variant nulldirections as shown by [1]. However, when reverberation or noise is present in the signal, phase shift varies strongly over frequency. Thus it becomes difficult to estimate one best direction of arrival (DOA) for each source, and demixing performance suffers from localization errors. To assess the improvements gained from the extra computational effort of an ICA stage, we compared the separation performance of the two above algorithms to that of a constant DOA nullbeamformer, which was pointed to the directions giving minimum cross statistics of the outputs. This beamformer was used in the same structure as the ICA algorithms.
3 Nonlinear Postprocessing In the postprocessing stage, a time frequency mask is applied to the ICA or beamformer outputs, as shown in Figure 2 for the special case of two signals. The time-frequency mask is determined from the ratio of demixed signal energies, 2 1 0 2 1 0 * * Fig. 2. Postprocessing for the 2x2 case which provides an estimate of the local SNR. The masking function M i = Ψ ( log( Ŝi(Ω) 2 ) max log( Ŝj(Ω) 2 ) T ) j i 10 (9) is obtained by comparing this SNR-estimate to an acceptance threshold T, with Ψ defined by { 0for x 0, Ψ(x) = (10) 1for0<x<. The threshold T was varied between -3dB and 5dB, with higher thresholds leading to better SNR gains but in some test cases to musical noise. 4 Evaluation To test the proposed postprocessing method, three datasets were used on which separation was carried out with and without nonlinear postprocessing. 4.1 Datasets ICA1999 Evaluation Data (Real Room) The tracks, which were suggested for evaluating ICA performance for the 1999 ICA Workshop [4], are sampled at 16KHz and are 10 seconds long (160000 samples). A male and a female speaker are speaking simultaneously and there is some background noise.
Reverberant Room Recording Recordings were made in an office room with dimensions of about 10m 15m 3.5m. The distance between the loudspeakers and the two microphones (Behringer ECM 8000) was set to one meter. At this distance, the reverberation time was measured to be 300ms. Speech signals from the TIDigits database [9] were played back and recorded in two different setups of loudspeakers, with the angles of incidence, relative to broadside, as shown in Table 1. Table 1. Recording configurations. config θ 1 θ 2 recordings A 45-25 speaker 1, speaker 2, both speakers B 10-25 speaker 1, speaker 2, both speakers In-Car Speech Data In the final dataset, recordings were made inside a Mercedes S 320 at standstill and at 80 and 100km/h. Speech from the TIDigits database was reproduced with artificial heads and recorded simultaneously with four cardioid microphones, an eight channel microphone array mounted in the center of the ceiling near the rearview mirror, and two reference signals on a 16 channel-recorder. For evaluation, two recordings were used, one of a male and a female speaker and one of two male speakers. The impulse response of the car was measured, and the reverberation time was determined to lie between 60 and 150ms, depending on the position of the artificial head relative to the microphone. 4.2 Results Evaluation of Separation Performance To measure separation quality, the SNR improvement between the mixed and the demixed signal is used. For this purpose, two SNRs are calculated: the SNR at the input of the ICA stage and the output SNR. The output SNR is proposed as a measure of separation performance in [8] and it is calculated for channel j via: SNR out,j = 10 log 10 E(y 2 j,j ) E( i j y2 j,i ) (11) Here, the term y j,i stands for the j th separation output, which is calculated with the microphone signals recorded using only source i active. The input SNR is calculated in a similar way, so that the SNR improvement is obtained by: E(yj,j 2 SNRI j = 10 log ) 10 E( i j y2 j,i ) 10 log E(x 2 j,j ) 10 E( i j x2 j,i ) (12)
with x j,i denoting the j th microphone signal when only source i is active. To determine the influence of time frequency masking on the performance of ICA algorithms, the SNR improvement was calculated with and without nonlinear postprocessing for the three datasets of actual recordings. Table 2 shows the comparison. Table 2. Average SNR improvements for real room recordings. MCC Null- MCC Null- Fixed Fixed JADE JADE beamformer beamformer DOA NBF DOA NBF without with without with without with TF Mask TF Mask TF Mask TF Mask TF Mask TF Mask Reverberant Room (A) 5.2dB 5.5dB 7.3dB 9.3dB 6.8dB 9.7dB Reverberant Room (B) 5.8dB 7.1dB 6.4dB 10.1dB 5.3dB 9.4dB ICA 99 Dataset 2.9dB 8.3dB 2.5dB 4.4dB 0.7dB 3.0dB Car Data standstill 13.8dB 15.4dB 8.8dB 12.0dB 4.6dB 10.9dB Car Data 100kmh 6.3dB 12.3dB 5.4dB 10.4dB 3.1dB 8.5dB The best values are marked in bold. As can be seen, nonlinear postprocessing adds between 1 and 6dB, on average 3.8dB, to the output SNR. Also, it is interesting to see that ICA performance in the noisy recordings (ICA99 and in-car data) is significantly higher than that of the constant DOA beamformer. When the threshold for the local SNR is increased, the SNR can be improved further, on the other hand, listening quality can profit from lower thresholds. The average SNR improvement for different thresholds is shown in Table 3, where the average was taken over all datasets. Table 3. Average SNR improvements for all configurations. MCC Null- Fixed JADE beamformer DOA Estimate no TF-Mask 7.9dB 6.3dB 4.2dB -3dB 9.2dB (+1.3dB) 7.6dB (+1.3dB) 6.0dB (+1.8dB) 0dB 8.9dB (+1.0dB) 8.3dB (+2.0dB) 7.3dB (+3.1dB) 3dB 10.0dB (+2.1dB) 9.8dB(+3.5dB) 8.1dB (+3.9dB) 5dB 10.4dB (+2.5dB) 10.2dB (+3.9dB) 9.2dB (+5.0dB) The values in parentheses are the SNR gains due to TF masking.
5 Conclusions A combination of ICA and time frequency masking has been applied to in car speech recordings as well as to reverberant room recordings and artificial speech mixtures. In the car environment, SNR improvements of 15dB and more can be obtained with this combination, and SNR improvements due to time frequency masking alone in the range of 3dB and more are noted for most test cases. In the scenarios considered, using an frequency variant look direction improved separation performance by an average 1.9dB. However, in the noisy test cases, the output SNR of the ICA processor was greater than that of frequency invariant processing by a margin of 4.5dB. Generally speaking, time frequency masking as a postprocessing step for frequency domain ICA algorithms can improve signal separation significantly. In the simplest form, where a signal in a frequency bin is retained only if its magnitude exceeds that of all other signals, the extra computational effort is negligible, and additional SNR gains of 5dB and more can be obtained. The postprocessing has been tested in conjunction with two ICA algorithms and one beamformer, and it can be expected to yield similar improvements on other frequency domain source separation algorithms. References 1. Balan R.; Rosca J. and Rickard S.: Robustness of Parametric Source Demixing in Echoic Environments. Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation, San Diego, California (2001) 144 149 2. Baumann, W.; Kolossa, D. and Orglmeister, R.: Beamforming-based convolutive source separation. Proceedings ICASSP 03 5 (2003) 357 360 3. Baumann, W.; Kolossa, D. and Orglmeister, R.: Maximum Likelihood Permutation Correction for Convolutive Source Separation. Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation, Nara, Japan (2003) 373 378 4. Available at URL: http://www2.ele.tue.nl/ica99/ 5. Cardoso J.-F., High order contrasts for independent component analysis, Neural Computation 11 (1999) 157 192 6. Kurita, S.; Saruwatari, H.; Kajita, S.; Takeda, K. and Itakura, F.: Evaluation of blind signal separation method using directivity pattern under reverberant conditions, Proceedings ICASSP 00 5 (2000) 3140 3143 7. Parra L. and Alvino C.: Geometric Source Separation: Merging convolutive source separation with geometric beamforming. IEEE Trans. on Speech and Audio Processing 10:6 (2002) 352 362 8. Schobben, D.; Torkkola, K. and Smaragdis, P.: Evaluation of Blind Signal Separation. Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation, Aussois, France (1999) 9. TIDigits Speech Database: Studio Quality Speaker-Independent Connected-Digit Corpus. Readme file on CD-ROM. See also at URL: http://morph.ldc.upenn.edu/catalog/ldc93s10.html 10. Yilmaz, Ö. and Rickard, S.: Blind Separation of Speech Mixtures via Time- Frequency Masking. Submitted to IEEE Transactions on Signal Processing (2003)