Speaker Isolation in a Cocktail-Party Setting

Speaker Isolation in a Cocktail-Party Setting M.K. Alisdairi Columbia University M.S. Candidate Electrical Engineering Spring Abstract the human auditory system is capable of performing many interesting tasks, several of which could find useful applications in engineering settings. One such capability is the ability to perceptually separate sound sources, allowing a listener to focus on a single speaker in a noisy environment. This effect is often referred to as the cocktail effect (in reference to a cocktail-party environment where several simultaneous conversations are taking place in the background) or as Auditory Scene Analysis. This paper introduces two methodologies for isolating a desired speaker s audio stream from a binaural recording of multiple speakers in conversation. An implementation a system for speaker isolation based on one of these methods is also presented. Note that some of the graphics presented in this document are best viewed in a color format. For the electronic version please visit www.columbia.edu/~mka3/speech.html. INTRODUCTION Systems capable of performing Auditory Scene Analysis (ASA) [] could find numerous useful applications. The most evident application is as a front- for speech recognition systems. The development of systems capable of ASA could provide improvements in speech recognition in unconstrained auditory environments [7] [9]. Noisy Environment Sound Input Speaker Extraction System Voice Recognition System Figure Speaker Extraction System as a Front-End to Voice Recognition Another possible use for an ASA capable system could be in theatrical/movie settings as a substitute for wireless microphones. In such instances sound engineers could have a versatile means of controlling audio quality without the physical imposition of hardware on the speaker s person. This paper discusses two methodologies for speech extraction. The first method is based on the Interaural Intensity Difference, and the second on the Time Difference of Arrival. After the preliminary discussion an implementation of the TDOA based method is presented. Analysis and implementation is carriedout on sound recordings from the ShATR corpus. The ShATR Corpus The sound files used were taken from the ShATR corpus of dummy head recordings. The recordings are of five speakers (Guy, Martin, Phil, Inge Marie, and Malcolm) oriented around the dummy head. Included in the ShATR corpus are two files; one file contains a recording of each of the five speakers introducing themselves, Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting -9

the other file is a recording of the five carrying on in conversation. These two files are of primary interest in this document. THEORETICAL BACKGROUND THE HRTF The theoretical undergirding of speaker isolation is the fact that the left and right channels traverse different paths; resulting in differing filtering for each channel. The following figure and equations illustrate the concept. Frequency x 4.5 Sample Spectrogram TF Cell Ear Channel Dummy Head Figure Depiction of the Path Depent Filtering Effect Y Y Where: Y ( ω, = H φ ) X = H X, Y Ear Channel (Equation ) (Equation ) The signals received by the left and right ears..5.5 3 Time Figure 3 - The TF Cell Concept There are two methods which may be implemented towards the goal of categorizing and weighting the TF cells appropriately; the interaural intensity difference (IID) method and the time delay of arrival (TDOA) method. The Inter-aural Intensity Difference The IID method of cell-weight estimation is based on intensity differentials as a function of frequency [6]. For example, a sound originating from a source in the first quadrant of the figure below will be detected by the left ear as a signal that has undergone a low-pass filtering effect. The low-pass effect is a result of the shadowing of high frequency components by the head. On the other hand, low frequencies are able to wrap around the head with little attenuation. H, H The impulse responses of the paths to the left and right ears. SECOND QUADRANT FIRST QUADRANT X ( ω, The original speech signal. HEAD degrees The two transfer functions H and H are called the Head Related Transfer Functions (HRFT) and are functions of position as well as frequency (note the HRTF is more precisely Figure 4 Sound Source Plane a function of frequency ω, azimuth φ, and elevation θ) [5]. It is averred that that by decomposing a multi-speaker signal The interaural intensity difference may be obtained by taking into several time/frequency (TF) cells and weighting each cell the ratio of the left and right channel magnitudes in the appropriately, a desired speaker s speech data may be extracted frequency domain (or correspondingly taking the difference from an aggregate of speakers [3]. The localization cues that between the frequency domain magnitudes in db). The math are implanted in the received signals by the HRTF may be used follows. to determine the appropriate weightings for each of the TF cells. Y = H iid = log log (Equation 3) YRigth( ω, H Rigth( ω, ( Y ) log( Y ) ) iid = log t Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting -9

By comparing incoming speech data to predetermined categorical information the previously mentioned TF cells may be classified appropriately [4]. x 4 The Time Delay of Arrival The distance differential between the propagation paths for the right and left ears causes a phase distortion between the two channels. This phase distortion may be used as localization cue. As with the IID method, these localization cues may be used to derive the necessary weighting information for each TF cell. Theoretically, extraction of the desired time delay information may be accomplished through direct analysis of the phase components as follows: itd = arg * { Y Y } (Equation 4) Despite the plausibility of the above approach it is avoided due to difficulties resulting from the nonlinearity of the phase functions. An alternative means towards the same is the use of the crosscorrelation function. The TDOA is obtained by retaining the lag index of maximum crosscorrelation between the left and right channels. Correlation of & Channels 8 x -6 6 4 - -4 Frequency.5..4.6.8..4.6.8 Time Inge Marie on Microphone Four Figure 6 Inge s Introduction Sample The sound sample is first passed through a Bark scaled filter bank and each band-limited output is broken into short time windows. The left and right channel windows are then crosscorrelated, and the index of the maximum crosscorrelation is retained for each time window. The filter bank is a four-channel filter-bank. The justification for this selection is based on the results given in [3], which show a maximization of the SNR for four frequency bands. Initially, the time window length was based solely on calculations of the maximum possible lag index. However, [3] also shows a maximization of SNR for a window length of 56 samples, thus a window length of 56 was used. The following figure depicts a histogram of the lag indexes that result from the process described above: -6-5 -4-3 - - 3 4 5 Correlation Index Figure 5 Determining the TDOA by Autocorrelation IMPLEMENTATION This paper concentrates on the TDOA method of speaker isolation. The first step is the analysis of the introduction sound samples from the ShATR corpus to develop a source model that describes each speaker s localization cues. The second step involves use of the localization cues extracted by the previously attained models to determine the appropriate TF cell weightings for each TF cell. Analysis Source Models The first step in building the speech isolation system is to study the behavior of the TDOA as a speaker (position) depent feature. This analysis was conducted on the speech samples of each speaker s introduction, such as Inge Marie s depicted below. Figure 7 Histogram of Lag Index of Maximum Correlation As the above histograms show, the lag indexes of maximum crosscorrelation seem to be normally distributed. The appropriate mean and standard deviations are then manually extracted for each speaker. The results for Guy and Inge Marie follow: Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 3-9

Speaker: Guy Channel Mean (µ) Standard Deviation (σ) One -8 Two -6.5 Three -7.5.5 Four 5 Speaker: Inge Marie Channel Mean (µ) Standard Deviation (σ) One 6.5 4 Two 7.5 3 Three 7.5 Four 6.5.5 The weight of a particular TF cell as a function of lag index (i) is as follows: w ( i µ ch ) σ ch ch( i) = e (Equation 5) Speech Isolation System Once the distributions of the lag indexes for each speaker have been determined the sound file containing simultaneous speech may be analyzed and a desired speaker may be extracted. The simultaneous speech sample is first passed through the previously mentioned filter bank. Each of the four band limited signals is then broken into time windows of length 56 samples. The left and right channels are then crosscorrelated for each time frame, and the index of maximum correlation is retained. The sequence of lag indexes is then compared to the desired speaker s lag index distribution model and a weight corresponding to the likelihood of each TF cell belonging to the desired speaker is used as that cell s weight. Figure 9 Extracting Guy s Speech Signal Although the system preformed the desired task of extracting a single speech track from the sequence, the weighting process introduced some undesirable noise. It was postulated that the source of the noise was the discontinuity of the weighting matrix over time, and that the problem could be ameliorated by filtering the weight matrix. The following figures illustrate the short sections of the four weight sequences before and after filtering..5 Filtering of the Weight Matrix Original Weight Seq. Filtered Weight Seq..5.5 x 4.5 TDOA Analysis.5.5 x 4.5.5.5 Desired Speaker x 4 Speech Data Filter Bank Weighting Based on TDOA Model Isolated Speech Weight.5.5.5 Time x 4 Figure Illustration of the Weight Matrix Filtering Process Figure 8 Block Diagram of Speech Isolation System RESULTS The above methodology of speech extraction was successful in isolating the desired speaker s signal. The following is a spectrogram of a sample of cocktail-party speech and the resulting extraction of Guy s stream. Reconstruction of the desired track using the newly filtered weighting matrix resulted in the desired improvement in quality. The following figures illustrate the reconstructed signals for Guy with and without weight matrix filtering. Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 4-9

[8] F. Berthommier and S. Choi, Evaluation of CASA and BSS models for subband cocktail-party speech separation. [9] D. Wang and G. Brown, Seperation of Speech from Interfering Sounds Based on Oscillatory Correlation, IEEE Transactions on Nerual Networks, Vol., No. 3, May 999. [] A. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT Press, 99. Figure Before and After Filtering the Weighting Function CONCLUDING REMARKS The design presented in this paper illustrates a simple implementation of a speech extraction system and establishes the feasibility of such a system. This design was successful in achieving the objective of extracting a single speaker s track from a group recording. Included in the appices is the code for the implementation as well as larger spectrogram depictions of the results. Despite the successes presented in this paper, there exists room for enhancements in future work. For example, the source model in this implementation was obtained manually. The automation of the source model would allow the system to behave in a more versatile manner, possibly allowing the relaxation of the a priori assumption that the speakers positions remain constant. A second potential improvement could be the incorporation of a broader feature set, possibly including the IID in addition to the TDOA. REFERENCES [] S. Mitra, Digital Signal Processing, a Computer-Based Approach, McGraw-Hill Irwin. [] R Ziemer and W. Tranter, Signals and Systems, Continuous and Discrete, Prentice Hall 998. [3] E. Tessier and F. Berthommier, Speech Enhancement and Segregation Based on the Localisation Cue for Cocktail-Party Processing, [4] W. Chau and R. Duda, Combined Monaural and Binaural Localization of Sound Sources, IEEE Proceedings of ASILOMAR-9 996. [5] R. Duda, Modeling Head Related Transfer Functions, IEEE Proceedings of ASILOMAR 993. [6] K. Martin, Estimating Azimuth and Elevation from Interaural Differences. [7] S. Choi, H. Glotin, F. Berthommier, and E. Tessier, A CASA Front-End Using the Localization Cue for Segregation and then Cocktail-Party Speech Recognition. Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 5-9

APPENDIX A: MATLAB CODE function banddat = bands(dd,nb); M.K. Alisdairi Spring b = bands(ysound,number_bands) Given original data 'dd' bands() returns an nb-by-length(dd)-by- matrix containing versions of 'dd' bandlimited according to the bark scale nb is the number of bands used. Note nb may be {8 4 } if(nb~= & nb~= & nb~=4 & nb~=8) clc disp(sprintf('error: nb must be 8,4,, or ')) banddat = ; return Make sure it fits the criteria home disp(sprintf('working...')) N = ^9; Number of frequency points for the STFT Seperate into right and left channels d_left = dd(:,)'; d_right= dd(:,)'; Calculate the STFTs DL = stft(d_left,n,n,n/); DR = stft(d_right,n,n,n/); Define the Bark scaled windows in Hz, then convert to FFT index (w). Note: eight channels max. F = [ 33 69 6 88 3 46 93 5]; w = floor(n*f./48)+; banddat= zeros(nb,((size(dl,)+)*n/),); Initializing is faster Go through and produce the proper bandlimited signals inc = 8/nb; for(i=:inc:8) FtempL = zeros(size(dl)); FtempL((w(i)):((w(i+inc)-)),:) = DL((w(i)):((w(i+inc)-)),:); FtempR = zeros(size(dr)); FtempR((w(i)):((w(i+inc)-)),:) = DR((w(i)):((w(i+inc)-)),:); banddat(ceil(i/inc),:,)=istft(ftempl,n,n,n/); banddat(ceil(i/inc),:,)=istft(ftempr,n,n,n/); Forward the data function [ii,yy,c] = tdoa(b,win,ch) M.K. Alisdairi Spring [i,y,c] = tdoa(band_data,window_length,channels) Function accepts matrix 'b' (produced by bands()), hich contains nb channels of bandlimited audio data. The function will conduct a cross correletion of left and right time windows with length 'win'. bl = b(:,:,); br = b(:,:,); Break data into right and left channels nb=size(b,); Number of bands stop = floor(size(b,)/win); Number of windows Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 6-9

c = NaN*ones(*win-,stop); xcorr matrix for k=:length(ch) j=ch(k); for i=:win:(stop*win) c(:,ceil(i/win))=xcorr(bl(j,i:(i+win-)),br(j,i:(i+win-)))'; home disp(size(c)) [y,i] = max(c); Determine the TDOA for each frame yy(k,:) = y; ii(k,:) = i-win; Forward actual xcorr. Forward lag indexes function [ys,wgt] = extract(b,ii,person,win) M.K. Alisdairi Spring [ys,wgt] = extract(band_data,lag_indexes,desired_person, window size) This function extracts Guy's voice from the sound data in b If person == then extract Guy. If == then extract Inge. u = [ -8-6.5-7.5 ; Means for Guy 6.5 7.5 7.5 6.5]; Means for Inge Marie s = [.5 5; Std. devs for Guy 4 3.5]; Std. devs for Inge Marie u = u(person,:); Take the correct data s = s(person,:); w = zeros(,length(ii)*win); Initialize short weight vector len = min(length(b),length(w)); ys = zeros(,len); Initialize place for new data wgt = zeros(size(ii,),length(w)); Initialize actual weight matrix for ch=:size(ii,) Look at all channels wf = zeros(,length(ii)); wf = exp(-(ii(ch,:)-u(ch)).^/(*s(ch)^)); Calculate appropriate weight based on the lag index w(:) = wf(ceil((:length(w))/win)); Convert to long vector wgt(ch,:) = w; Forward the info ys(,:) = ys(,:)+b(ch,:len,).*w(:len); Calculate the extracted voice ys(,:) = ys(,:)+b(ch,:len,).*w(:len); Caluclate the other channel Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 7-9

APPENDIX B: ENLARGED SPECTROGRAMS Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 8-9

Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 9-9