Spectral Methods for Single and Multi Channel Speech Enhancement in Multi Source Environment

Size: px

Start display at page:

Download "Spectral Methods for Single and Multi Channel Speech Enhancement in Multi Source Environment"

Gervais White
5 years ago
Views:

Spectral Methods for Single and Multi Channel Speech Enhancement in Multi Source Environment A Thesis Submitted in Partial Fulfillment of the

1 Spectral Methods for Single and Multi Channel Speech Enhancement in Multi Source Environment A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY by KARAN NATHWANI to the DEPARTMENT OF ELECTRICAL ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY KANPUR December, 2014

2 CERTIFICATE It is certified that the work contained in the thesis entitled Spectral Methods for Single and Multi Channel Speech Enhancement in Multi Source Environment submitted by Karan Nathwani has been carried out under my supervision. The results embodied in this thesis have not been submitted elsewhere for the award of any degree or diploma. December, 2014 Dr. Rajesh M. Hegde Associate Professor, Department of Electrical Engineering, Indian Institute of Technology, Kanpur

3 Abstract Hands-free mobile telephony has become ubiquitous in modern day living with its wide variety of applications in our day to day life. In this context acquisition of clean speech from both close and distant microphones is very important for building high quality speech based systems. Generally speech signal received at the microphone is degraded by background noise, and room reverberation. However in multi source environments additional speech sources are present making the problem even more challenging. These challenges have motivated research on single and multi channel speech enhancement methods. In this thesis several new methods for single and multi channel speech enhancement are proposed. In particular spectral methods for source separation, de-reverberation, and noise cancellation are described. Additionally methods that jointly perform the aforementioned tasks are also developed. The utility of these methods is demonstrated incorporating them into two speech based information retrieval systems. The group delay spectrum has hitherto not been used for single channel speech enhancement, although it has been widely used in feature extraction. The high resolution and robustness property of the GDS is effectively used in this thesis to propose novel methods for source separation and dereverberation. The first method addresses the problem of source separation using the group delay cross correlation spectrum and an iterative graph cut method. The second method jointly addresses the problem of single channel source separation and dereverberation in a NMF framework. The enhancement problem is formulated herein by assuming different RIR for each source location. The group delay spectral magnitude used in the NMF framework is shown to exhibit accurate decomposition property. Both methods give significant improvements in perceptual quality of separated signals and speech recognition performance when compared to conventional methods. Multi channel systems utilize spatial diversity which is not present in single channel systems. Novel beamforming based spatial spectrum estimation methods for multi channel speech enhancement have been proposed in this thesis. Under the fixed beam forming framework, a new reverberant speech enhancement method that utilises the LP residual cepstrum is developed. On the other hand, a LCMV based spectral method is developed for joint noise cancellation and dereverberation in a beamforming framework. This is realized as a multi

4 channel LCMV filter that constrains both the early and late parts of the speech frame. The filter outputs are then beam formed to remove late reverberations. These methods indicate significant improvement in perceptual quality of separated signals and distant speech recognition performance when compared to conventional methods. Information retrieval systems on a cell phone and in a teleconferencing environment are developed to demonstrate the effectiveness of the methods proposed in this thesis. Blind source separation (BSS) in a multi channel framework can be investigated in future as part of related work. In this context, a Bayesian approach for separation of convolutive mixtures in the spectral domain can also be explored. iv

5 Acknowledgements I take this opportunity to thank God for giving me the power to believe in myself. I would never be able to do this without the faith I have in You, the Almighty. I express my deepest sense of gratitude to my thesis supervisor Dr. Rajesh M. Hegde for providing his constant motivation and valuable guidance. His keen interest and enthusiastic approach has helped me throughout and beyond this thesis. Regular discussions with him have always resolved bottlenecks and given this thesis a proper shape. His expert directions have taught me valuable qualities, which I will treasure throughout my life. It has been a great experience working under his guidance. I am grateful to him for making my last year fruitful at IIT Kanpur. I am thankful to MiPS lab assistant Mr. Narendra Singh for his constant support and other Ph.D. Scholars for continuous guidance, motivation and support for performing the task given to me during my training. I immensely express my heartiest thanks to my friends Sudhir, Rupesh, Lalan, Waquar, Sandeep and Sachin for their support and help. I take this opportunity to express my heartiest thanks and affectionate gratitude to my family members. My parents Mr. Prakash Nathwani and Mrs. Nisha Nathwani receive deepest gratitude and love for their inspiration, patience and encouragement. Additionally, my uncle, Mr. Meghraj Nathwani had been a constant source of strong motivation during my doctoral study. I also want to thank my wife Mrs. Pallavi Nathwani and my brother Mr. Piyush Nathwani for their support and encouragement. Last but not the least, I am extremely thankful to Late Ramesh Nathwani and Late Radha Hemrajani whose belief had motivated me to pursue doctoral degree.

6 Contents List of Figures List of Tables List of Symbols List of Abbreviations xii xvi xviii xix 1 Introduction Motivation and Scope Reverberation Modeling Effects of Acoustic Distortions on Speech Intelligibility Effects of Reverberation on Speech Intelligibility Effects of Noise on Speech Intelligibility Effects of Competing Speaker on Speech Intelligibility Classification of Speech Enhancement Methods Single Channel Speech Enhancement Multi Channel Speech Enhancement Objectives of the Thesis Contributions of the Thesis Organization of the Thesis Review of Single and Multi Channel Speech Enhancement Methods in Spectral Domain 15 vi

7 2.1 Single Channel Speech Enhancement Techniques Spectral Subtraction Minimum Mean Square Error (MMSE) based Speech Enhancement Speaker Separation in a CASA Framework Amplitude Modulation based Speaker Segregation Multi-pitch Tracking based Speaker Segregation Speaker Separation using Instantaneous Frequency Speech Enhancement using Temporal Envelope Filtering Multi Channel Speech Enhancement Techniques Speech Enhancement using Linear Prediction (LP) Residual Blind Source Separation (BSS) based Speech Enhancement Beamforming based Speech Enhancement Fixed Beamforming Adaptive Beamforming Summary Speech Enhancement Quality Measures Subjective Measures Objective Measures Objective Measures for Noise Cancellation Segmental SNR Frequency Weighted Segmental SNR SNR Loss Objective Measures for Speech Dereverberation Signal to Reverberation Ratio Log Likelihood Ratio Log Spectral Distortion Bark Spectral Distortion Objective Measures for Speaker Separation Perceptual Similarity Measure vii

8 Perceptual Evaluation of Speech Quality Target to Interference Ratio Summary Spectral Methods for Single Channel Speech Enhancement Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework Introduction Estimation of the Group Delay Function of a Speech Signal The Group Delay Cross Correlation Approach to Speaker Segregation Multi-Pitch Estimation from Mixed Speech Signals Subband Decomposition using Filter Bank Analysis The Group Delay Cross Correlation Function Harmonic Extraction and Grouping using Iterative Graph Cut Method Spectrographic Mask Generation Algorithm for Speaker Segregation using Group Delay Cross Correlation Function Experiments on Speaker Segregation Segregation of Vowels using the Group Delay Cross Correlation Function Segregation of Mixed Speech Signals using the Group Delay Cross Correlation Function Experiments on Speaker Segregation Database used Subjective Evaluation Results Objective Evaluation Results Experiments on Multi-Speaker Speech Recognition Experimental Conditions Experimental Results viii

9 4.1.8 Summary Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization Introduction System Model for Speaker Separation under Reverberant Environment Formulation of Speaker Separation Problem using Constrained Spectral Divergence Optimization Spectral Divergence Minimization for Joint Speaker Separation and Dereverberation Modified Spectral Subtraction Reconstruction of Individual Signals Incorporating the Group Delay Spectral Magnitude in the Proposed Framework Computing the Spectral Magnitude from Group Delay Function High Resolution and Robustness Properties of Group Delay Spectral Magnitude Accurate Decomposition of Group Delay Subband Envelope Algorithm for Joint Speaker Separation and Dereverberation Spectrographic Analysis Experiments on Speaker Separation Experimental Conditions Subjective Evaluation Results Objective Evaluation Results Evaluation of Target to Interference Ratio Experiments on Speech Dereverberation Objective Evaluation Results Statistical Experiments using One Way ANOVA Experiments on Distant Speech Recognition Summary Discussion ix

10 5 Spectral Methods for Multi Channel Speech Enhancement Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum Introduction Linear Prediction Analysis of Reverberant Speech Multi Channel Speech Enhancement using LP Residual Cepstrum in Fixed Beamforming Framework Single Channel Speech Dereverberation Temporal Averaging The MC-LPRC Algorithm for Speech Dereverberation Spectrographic Analysis Performance Evaluation Experimental Conditions Subjective and Objective Evaluation Experimental Results on Distant Speech Recognition Summary Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter Introduction Problem Formulation Multi Channel LCMV Filter for Noise Cancellation and Speech Dereverberation Suppression of Noise and Early Reverberation Spectrographic Analysis Performance Evaluation Experimental Conditions Experimental Results on Noise Cancellation and Speech Dereverberation Experimental Results on Distant Speech Recognition Summary Discussion x

11 6 Application of Speech Enhancement in the Development of Information Retrieval Systems Application of Single Channel Speaker Segregation Method in Multi-media Information Retrieval Design of a Meeting Capture and Audio Archiving System Speaker Demography in the Audio Archives Design of a Cell Phone based Multi-media Information Retrieval System Application of Multi Channel Speech Enhancement Method for Audio Retrieval in Teleconferencing Environment Experimental Setup for Audio Archiving in Teleconferencing Environment Experiments on Audio Retrieval in Teleconferencing Environment Audio Data Retrieval in Active Meetings Experimental Results for Keyword Retrieval Summary Conclusions and Future Scope Conclusions Future Scope References 139 Publications Related to Thesis Work 157 xi

12 List of Figures 1.1 Audio Scene comprising multiple sources such as noise, reverberations and interfering speaker components The cocktail party environment containing multiple speakers Diagram illustrating the AIR of a closed room. The signal beyond 15 ms is considered to be the reverberant components Time domain representation (top) and narrowband spectrographic representation (bottom) of a clean speech signal containing sentence \bin blue at d one now\ taken from GRID database Time domain representation (top) and narrowband spectrographic representation (bottom) of a reverberant version of a clean speech signal containing sentence \bin blue at d one now\ Block diagram of a simplified Computational Auditory Scene Analysis (CASA) system A schematic structure for temporal envelope filtering Block digram of K sub-band temporal envelope filtering Block diagram illustrating a LCMV beamformer using a GSC structure Block diagram illustrating sequence of steps for speaker segregation using the group delay cross correlation function An undirected graph, G, with 9 vertices. The graph can be clearly divided into two sub-graphs, with nodes 1 through 4 falling in one sub-graph and nodes 5 through 9 in the other sub-graph xii

13 4.3 The two sub-graphs obtained after first iteration of graph cut method The final undirected graph obtained by iterative graph cut method Illustration of average correlation distribution variance between two pairs of channels using 1-D Projection method and Graph Cut method Spectrograms of the mixed sound signal, the reference target signal (above), the reconstructed signals with the application of mask, and without the application of the mask (below) when the TIR is 0 db Spectrograms of the mixed sound signal, the reference target signal (above), the reconstructed signal with the application of mask, and without the application of the mask (below) when the TIR is -6 db Comparative performance of the various algorithms for quality of reconstructed speech (target speaker) Comparative performance of the various algorithms for quality of separated speech (interfering speaker) Comparison of SNR Loss of the reconstructed target speaker for various methods The system model for reverberation for two sources mixed at a single microphone in subband envelope domain under noise Comparison of Fourier transform spectral magnitude (d), GDSM (e) and cepstrally smooth version of FTSM (f) for the system shown in (a) Comparing the robustness property of group delay spectral magnitude with Fourier transform spectral magnitude Comparison of average error in decomposition of observed subband envelope computed from group delay spectral magnitude and Fourier transform spectral magnitude Block diagram of the joint speaker separation and dereverberation method using GDSM Spectrograms of the target signal (above), the mixed reverberated signal (middle) and the reconstructed target signal (below), when the TIR is 0 db and DRR=-3 db Variation in output TIR versus input TIR for various methods xiii

14 4.18 Comparison of PESQ scores for various methods at different DRR Percentage increase in WER with increase in distance between the source and microphone Comparison of the spectrograms of clean and reverberated speech obtained from FFT and LP analysis. FFT spectrogram (Top row) and LP spectrogram (Bottom row) Block diagram of the multi channel speech enhancement method using LP residual cepstrum Illustration of remaining spurious peaks after single channel speech dereverberation. LP residual of (a) clean speech (b) reverberated speech (c) dereverberated speech A Tukey window for one larynx cycle with ψ = Spectrographic analysis of (a) clean speech (b) reverberated speech and (c) dereverberated speech using multi channel LP residual cepstrum method Comparison of LSD, BSD and SRR of various methods at different DRR s for TIMIT ((a)-(c)) and MONC ((d)-(f)) database respectively Comparison of the word error rate for various speech enhancement methods as a function of the distance between source and microphone Block diagram illustrating the joint noise cancellation and dereverberation method using a multi channel LCMV filter Spectrographic analysis of clean (Top), reverberant at DRR = -3dB (Middle), and the dereverberated speech signal (Bottom) obtained from the proposed method Variation in WER for various methods with increase in distance between source and microphone array (a) Screen shot of the interface for multi-media information retrieval system and (b) Photograph of the MIRS working on a cell phone The block diagram for data collection over T1 digital lines in a teleconferencing environment xiv

15 6.3 Flow diagram for archiving meeting audio data over VOIP Block diagram of the audio retrieval system in teleconferencing environment Flow diagram of data retrieval system on T1 digital lines. The filled circles represent active set of microphones (MA-R) and unfilled circles represent set of inactive microphones (MA-C) xv

16 List of Tables 4.1 Possible choices in the sentences of the GRID corpus Comparison of the reconstruction algorithms for GQ, TP, OSS and ANS in terms of mean and standard deviation Mean PSM and PSMt scores for the proposed method (GDCC) and other conventional methods at various TIR values Comparison of the word error rate (%) for various methods at several TIR values Mean error in decomposition (%) for the subband envelope of group delay spectral magnitude and Fourier transform spectral magnitude Comparison of mean opinion score for various methods in terms of GQ, TP and OSS Mean PSM and PSMt scores for the proposed method (GDCC) and other methods at various TIR values Experimental results of speech dereverberation using objective measures (LSD, SRR and BSD) on GRID Database Comparison of one way ANOVA test results for dereverberated target signals at various DRRs Variation in WER (%) for all the methods with increase in distance between source and microphone Comparison of mean opinion scores on the TIMIT and MONC database for various methods xvi

17 5.2 Percentage increase in WER with increase in distance between the source and microphone for various methods Experimental results on noise cancellation using segmental SNR as measure on the TIMIT database Experimental results on speech dereverberation using various methods on TIMIT database Percentage increase in WER for various methods with the varying distance from source to microphone Comparison of percentage accuracy (correct recognition of selected keyword) for various methods when distance between microphone array and speaker is one meter xvii

18 List of Symbols γ ζ η φ B 0 B 1 l ν α i β i θ Flooring factor Domain of spectral subtraction Time frequency subtraction factor Covariance matrix of a signal zeroth order Bessel function first order Bessel function Delay for l th microphone Step size window in short time analysis Real part of the pole of the i th resonator Imaginary part of the pole of the i th resonator Phase spectrum of a signal τ g (k 0 ) Group delay for the frequency band k 0 C(k 0, k 1 ) Covariance of the group delay for the frequency bands with indexes k 0 and k 1. R Affinity matrix m(i, j) Weight on edge between node i and j κ Binary weight vector D Sampling period ɛ(k, m) Reconstruction error β 1 Learning rate parameter for subband envelope of speaker L(m, k) λ 1 Weight enforcing sparsity constraint for L(m, k) J Duration of reverberation ω Frequency index σ 2 ψ O I ρ Variance of a signal Taper ratio of the window Length of one larynx cycle Neighboring windows Correlation matrix of a spectrum xviii

19 List of Abbreviations GDF GDS CASA BSS MMSE STFT LPRC ITU BSS DOA SOI DSB MVDR LCMV GSC MOS SRR SNR LSD BSD PESQ PSM TIR FR ASR GDCC SHR Group Delay Function Group Delay Spectrum Computational Auditory Scene Analysis Blind Source Separation Minimum Mean Square Error Short Time Fourier Transform Linear Prediction Residual Cepstrum International Telecommunications Union Blind Source Separation Direction of Arrival Signal of Interest Delay and Sum Beamforming Minimum Variance Distortionless Response Linearly Constrained Minimum Variance Generalized Sidelobe Canceller Mean Opinion Score Signal to Reverberation Ration Signal to Noise Ratio Log Spectral Distortion Bark Spectral Distortion Perceptual Evaluation of Speech Quality Perceptual Similarity Measure Target to Interference Ratio Fourier Transform Automatic Speech Recognition Group Delay Cross Correlation Subharmonic to Harmonic Ratio

20 STFT OSE TSE NMF WER FTSM GDSM MSS OLA DRR GQ TP OSS AIR DSB DYPSA GCI Short Time Fourier Transform Observed Subband Envelope True Subband Envelope Non Negative Matrix Factorization Word Error Rate Fourier Transform Spectral Magnitude Group Delay Spectral Magnitude Modified Spectral Subtraction Overlap and Add Direct to Reverberation Ratio Global Quality Target Preservation Other Signal Suppression Acoustic Impulse Response Delay and Sum Beamforming Dynamic Programming Projected Phase Slope Algorithm Glottal Closure Instant xx

21 Chapter 1 Introduction This chapter deals with the challenges and issues related to speech enhancement. The scope and motivation for speech enhancement in spectral domain is first described. This is followed by objectives and contributions to speech enhancement. The chapter ends with the organization of this thesis. 1.1 Motivation and Scope The rapidly growing market for speech communication systems has been the prime motivation for this thesis. In general, the speech communication systems can be categorized into hands free communication systems, voice controlled systems and hearing aids. Hands free communication systems are widely used in scenarios where limited use of hands is desired. Such scenarios can be hands free car driving and personal navigation systems, where bluetooth is typically used for communication. Voice controlled systems are used in operation theater by doctors and nurses to move freely around the patients. Hearing aids are typically used by the wearer to amplify the sound to make speech more intelligible. In all the above speech communication systems, the speech source is at a considerable distance from the microphone in a room. The microphone is assumed to be ideal in this thesis, where electrical output is equivalent to the local sound pressure. Acoustic signals radiated within a room are linearly distorted by reflections from walls and other objects. Apart from these reflections, the background noise and other interferences are

1.1 Motivation and Scope 2 also present as shown in Figure 1.1. The main difference between noise and reverberation is Reflected Path Direct Path Desired Speaker Direct Path Microphone Interfering Speaker Reflected Path Loudspeaker Figure 1.

22 1.1 Motivation and Scope 2 also present as shown in Figure 1.1. The main difference between noise and reverberation is Reflected Path Direct Path Desired Speaker Direct Path Microphone Interfering Speaker Reflected Path Loudspeaker Figure 1.1: Audio Scene comprising multiple sources such as noise, reverberations and interfering speaker components. that the degraded component is dependent on the desired signal, in case of reverberation. On other hand, in case of noise, it can be assumed that the degraded components are independent of the desired signal. These distortions degrade the fidelity and intelligibility of the speech signal. Additionally, the recognition performance of automatic speech recognition (ASR) systems is also affected by these distortions. The reverberation degrades speech intelligibility due to the effect of overlap masking, in which segments of an acoustic signal are affected by reverberation components of previous segments. Early reflections mainly contribute to coloration, or spectral distortion, while late reflections, contributes noise-like perceptions or tails to speech signals [1]. These late reflections are called late reverberation in literature. Furthermore, the reverberation causes blurring of speech phonemes with the spread in the time of arrival of reflections at the microphone. In case of interference generated by competing speaker, the removal of competing speaker components is the most challenging. Since, there exists high correlation in the temporal structures of the target and the interfering speakers, resulting in low separation accuracy.

The users suffering from hearing aids often complains of being unable to distinguish one voice from another in a crowded room.

23 1.2 Reverberation Modeling 3 Mixed Signal Figure 1.2: The cocktail party environment containing multiple speakers. This problem is known as cocktail party effect where mutiple speakers are present as shown in Figure 1.2. The users suffering from hearing aids often complains of being unable to distinguish one voice from another in a crowded room. This is due to the spectral coloration, late reverberation, noise and interference from other speakers. In order to counteract the distortions caused by reverberation, background noise and other interferences, acoustic signal enhancement techniques are required. This will increase the perceptual speech quality for listeners. It also improves the recognition accuracy of ASR systems. Hence, the reduction of detrimental effects caused by these distortions is the prime focus of this thesis. Thus, the enhancement of speech intelligibility is quite essential for the future development of applications with hands-free speech acquisition. These challenges have motivated the development of distant speech recognition (DSR) systems. 1.2 Reverberation Modeling Reverberation is described by the notion of multi path reflection which is considered to be the major challenge in speech enhancement. The problem of reverberation occurs when the distance between the speaker and microphone is large. This creates multiple paths for the speech signal to arrive at the microphone. Each wavefront arrives at the microphone with

24 1.2 Reverberation Modeling 4 different amplitude and phase. This is due to the different propogation path length to the microphone and also due to the difference in the amount of sound energy absorbed by the walls Acoustic Impulse Response 0.2 Amplitude Time (s) Figure 1.3: Diagram illustrating the AIR of a closed room. considered to be the reverberant components. The signal beyond 15 ms is A typical acoustic impulse response (AIR) is shown in Figure 1.3. The direct path propagation from the source to microphone involves a slight delay depending on the distance between source and microphone. This is manifested in the AIR as the initial region where the amplitude is almost zero. After the initial region, the strong peaks which correspond to direct path propagation are received. The sound source which is received at microphone after a particular amount of delay, will be reflected from surrounding walls and other objects. These reflected sounds are typically separated in both time and direction of arrival of the sound source. These are called early reverberated components. These early reverberation components show large variation when the source or microphone show significant movement. Thus, information regarding the size of the space and the position of the source in the space are then obtained. The samples corresponding to first 50 ms of the impulse response are classified as the early reflections [2]. The early reflection region has well defined peaks. Late reverberation components result from reflections which reach to the microphone with large amount of delays, after the direct sound has arrived. These late reflections are perceived

1.3 Effects of Acoustic Distortions on Speech Intelligibility 5 either as echoes, or as reverberation. Samples which are beyond 50 ms are classified as late reverberation components.

25 1.3 Effects of Acoustic Distortions on Speech Intelligibility 5 either as echoes, or as reverberation. Samples which are beyond 50 ms are classified as late reverberation components. In the context of speech dereverberation, the separate delayed impulses in the AIR correspond to early reflections. On the other hand, the late reflections appears as a continuum compared to early reflections. Additionally, it should be noted that the energy of the reflections decays at an exponential rate which is a well-known property of the AIR. This has motivated the notion of reverberation time. The reverberation time quantifies the severity of reverberation within a room, and is denoted by RT 60. It is defined as the time taken by the sound energy to decay by 60 db after it is switched off. 1.3 Effects of Acoustic Distortions on Speech Intelligibility In this section, the effects of distortions caused by reverberation, ambient noise and interference from competing speaker on speech intelligibility are discussed Effects of Reverberation on Speech Intelligibility Amplitude Time (s) Frequency (Hz) Time (s) Figure 1.4: Time domain representation (top) and narrowband spectrographic representation (bottom) of a clean speech signal containing sentence \bin blue at d one now\ taken from GRID database.

1.3 Effects of Acoustic Distortions on Speech Intelligibility 6 Reverberant speech can be defined as speech consisting of reasonable amount of echo and coloration.

26 1.3 Effects of Acoustic Distortions on Speech Intelligibility 6 Reverberant speech can be defined as speech consisting of reasonable amount of echo and coloration. The reverberation effects on speech are clearly noticeable visually through a spectrogram. The effect of reverberation can also be observed perceptually on hearing. Figure 1.4 illustrates the time domain waveform and narrowband spectrogram of an clean speech signal respectively. Speech signal corresponding to the sentence \bin blue at d one now\ of the GRID corpus [3] is used in generating the spectrogram. In the clean spectrogram, the resonance frequencies which are associated with the vocal tract correspond to speech formants. It can also be seen from the spectrogram that the phonemes are well separated in time. The time domain waveform and narrowband spectrogram of the dereverberated speech signal are shown in Figure 1.5. It can be clearly seen from Figure 1.5 that the speech signal is severely distorted by the acoustic channel. These distortions are in the form of blurring of Amplitude Time (s) Frequency (Hx) Time (s) Figure 1.5: Time domain representation (top) and narrowband spectrographic representation (bottom) of a reverberant version of a clean speech signal containing sentence \bin blue at d one now\. the speech formants visible in the spectrogram. Additionally, the smearing of the phonemes in time is clearly visible in both the spectrogram and the time domain waveform. Due to this smearing, the empty spaces between words and syllable are generally filled by reverberation, and phonemes subsequently appears to be overlapped. These distortions result in significant difference between the clean and the reverberant speech in terms of speech intelligibility and

27 1.3 Effects of Acoustic Distortions on Speech Intelligibility 7 fidelity. In addition to reduced intelligibility of speech, the performance of automatic speech recognition is also reduced due to reverberation Effects of Noise on Speech Intelligibility In general, noise is assumed to be additive in nature, unlike reverberation and affects the intelligibility of speech in a different manner. The noisy speech signals typically contain ambient noise and other interfering signals. In addition to this, speech can also be corrupted by imperfections of the communication channel containing frequency or temporal response. It is observed that the weak consonants in general are masked by noise by greater degree than the higher intensity vowels. However, unlike reverberation, noise masking is independent on the energy of the preceding segments. This distorts the information present in the speech. Also, speech recognition task under the noisy environment will become particularly challenging. It is observed that the listeners with hearing loss have greater difficulty in speech perception in background noise. In order to quantify speech intelligibility in noise, a standard way is a speech reception threshold (SRT) [4]. SRT is defined as the mixture signal to noise ratio which is required to achieve a certain intelligibility score which is generally 50%. The person suffering from hearing impaired requires 3-6 db higher SNR compared to normal hearing listeners for understanding the same level of speech. Speech shaped noise (SSN) which is considered to be most annoying among all other noises is defined as a steady noise with a long term spectrum matching compared to natural speech. In case of SSN, the SRT for hearing impaired listeners increases from 2.5 db to 7 db [5]. Thus, it is expected that hearing impaired listeners will have to encounter degraded temporal envelope information and poor spectral resolution. This will result in poor level of speech understanding for hearing impaired subjects due to the presence of noise components. Present hearing aids improve the audibility and comfort level of noisy speech. However, their ability to improve the intelligibility of noisy speech is still very limited [6] Effects of Competing Speaker on Speech Intelligibility Speech intelligibility and recognition is severely affected due to the presence of different types of interference. These interferences include white noise, colored noise, background noise,

28 1.4 Classification of Speech Enhancement Methods 8 speech babble, competing speech and reverberation. Competing speech is considered to be the most challenging among the different types of interference. The high correlation of temporal structures between the target speaker speech and interfering speaker speech is one strong reason for poor recognition accuracy. Competing speech can be considered one of the most commonly occuring noise in daily communications among the humans. The most popular example of competing speaker interference is speech acquired in Television debates. During news reading, the speech by news anchors is mixed with speech from background speakers resulting degradation of speech intelligibility. Another example could be in the form of multiple speakers talking simultaneously in a teleconferencing environment. In such examples, the target speech is generally affected by interfering speech. For hearing impaired listeners in the presence of competing speech, an increase in SRT by db is required compared to normal hearing listeners. It is observed that for typical speech materials, a 1 db increase in SRT leads to a 7%-19% reduction in the percent correct score [4]. This indicates that the competing speaker effects speech intelligibility most, when compared to other distortions. Speaker separation by machines still perform poorly in terms of recognizing the combined speech correctly. However, the human beings have innate ability to either extract the target speech after suppressing the interfering speech sources or extract both to obtain reasonably good recognition accuracy. Thus, the aforementioned discussion motivates the development of speech enhancement algorithms for reducing the effects of reverberation, noise and competing speaker on clean speech. The classification of speech enhancement methods is described in the ensuing section. 1.4 Classification of Speech Enhancement Methods The classification of speech enhancement methods is discussed in this section. It is generally difficult for a particular algorithm to perform homogeneously across all types of distortions. Hence, certain assumptions and constraints are required for speech enhancement methods which are generally dependent on specific application and on the environment where it is used.

29 1.4 Classification of Speech Enhancement Methods 9 In general, there are many factors on which the performance of a speech enhancement algorithm is dependent. One of the factor could be number of interfering sources in the multi source environment. In addition to this, assuming different a priori information about the signal of interest or the corrupting signal can also affect the performance of enhancement algorithm. The other factor is the limitation in time variations allowed for the corrupting signal. The last factor is the model based limitation like the restriction of the algorithm to uncorrelated noise. In general, the speech enhancement can be classified in a number of ways. One way to classify speech enhancement methods can be based on single and multiple input channels. They can also be classified based on time and frequency domain processing. The third and last way of classification can be based on adaptive and non adaptive type of algorithms. In this thesis, the classification based on the number of input channels is used. The brief overview of single and multi channel speech enhancement based classification are explained in the ensuing section Single Channel Speech Enhancement In most real time speech based applications, generally a second channel is not available. Such systems are easy to build due to less hardware requirements. Moreover, these single channel systems are comparatively less expensive than the multiple input systems. In the context of noise cancellation, the single channel system constitutes most difficult situations of speech enhancement. In such case, no reference signal to the noise is available and the clean speech cannot be pre-processed prior to being affected by the noise. There are several single channel speech enhancement methods available in the literature such as Wiener filtering [7], spectral subtraction [8] and cepstral inverse filtering [9]. Such single channel systems utilize different statistics of speech and noise. These systems also assume that noise is stationary during speech intervals. Thus, the performance of single channel methods drastically degrades at lower signal to noise ratios. The detailed description of such single channel methods is presented in Chapter 2.

30 1.5 Objectives of the Thesis Multi Channel Speech Enhancement Single microphone systems only utilize the temporal and spectral diversity of the received signal. Reverberation also induces spatial diversity. To additionally exploit this diversity, multiple microphones should be used. Thus, the beamforming based spatial spectrum estimation techniques have been used in literature for multiple microphone speech enhancement. In the context of noise cancellation, multi channel systems make use of multiple signal inputs to the system and noise reference in an adaptive noise cancellation device. Moreover, the multi channel system utilizes phase alignment to reject undesired noise components. Thus, by exploiting the spatial properties of the signal and the noise source, the non-stationarity of noises can be better addressed. This results in overcoming the limitations inherent to one channel systems. The multi channel systems are complex in structure and expensive due to increase in hardware requirement. However, multi channel systems show better speech enhancement results compared to single channel systems. In literature, the blind speaker separation, linear prediction residual and beamforming methods are used for the multi channel speech enhancement explained in details in Chapter Objectives of the Thesis A large body of work on single and multi channel speech enhancement exists in literature. However, speech enhancement in a multi source environment has not received a lot of attention. Also speech distortions caused by reverberation and noise are generally dealt with separately in literature. The objectives of this thesis are to reduce the effects of reverberation, noise and interference from competing speakers in single and multi channel speech based systems. More specifically, this thesis tries to address the issues of speaker separation, speech dereverberation and noise cancellation both separately and in a joint fashion. The thesis also investigates the application of such speech enhancement methods in the development of multi-media information retrieval systems.

31 1.6 Contributions of the Thesis Contributions of the Thesis The main contributions of this thesis are in the development of spectral methods for single and multi channel speech enhancement. In particular, spectral methods for speaker separation, dereverberation, and noise cancellation are proposed. perform the aforementioned tasks are also developed. Additionally, methods that jointly Spectral Methods for Single Channel Speech Enhancement : The group delay spectrum has hitherto not been used for single channel speech enhancement, although it has been widely used in feature extraction. The high resolution and robustness property of the GDS is effectively used in this thesis to propose two novel methods for speaker separation and dereverberation which are outlined below. Single Channel Speaker Separation using Group Delay Spectrum : In this method, the problem of speaker separation using the group delay cross correlation (GDCC) spectrum and an iterative graph cut method are addressed. The group delay spectral estimates are first computed over frequency subbands after passing the speech signal through a bank of filters. The filter bank spacing is based on a multi-pitch algorithm [10] that computes the pitch estimates of the competing speakers. An affinity matrix is then computed from the group delay spectral estimates of each frequency subband. This affinity matrix represents the correlations of the different subbands in the mixed speech signal. The grouping of correlated harmonics present in the mixed speech signal is then carried out by using a new iterative graph cut method. The respective harmonic groups are then utilized to reconstruct individual speakers in the mixed speech signal. Post processing is then performed using spectrographic masks [11] to obtain separated signals with improved perceptual quality. Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization : The problem of single channel speaker separation and dereverberation in a NMF framework is jointly addressed in this method. This enhancement problem is formulated herein by assuming different RIR for each source location. The divergence [12] between the observed and true spectral sub-

32 1.6 Contributions of the Thesis 12 band envelopes is minimized along with certain non-negative constraints to obtain the enhanced signals on the subband envelopes. Additionally, the joint speaker separation and dereverberation framework described herein utilizes the spectral subband envelope obtained from group delay spectral magnitude (GDSM) [13]. In order to obtain the spectral subband envelope from the GDSM, the equivalence of the magnitude and the group delay spectrum via the weighted cepstrum is used. Since the subband envelope of the group delay spectral magnitude is robust and has a high spectral resolution, less error is noted in the NMF decomposition. Late reverberation components present in the separated signals are then removed using a modified spectral subtraction technique. Spectral Methods for Multi Channel Speech Enhancement : Multi channel systems utilize spatial diversity which is not available in single channel systems. Novel beamforming based spatial spectrum estimation methods for multi channel speech enhancement have been proposed in this work which are outlined below. Speech Dereverberation using LP Residual Cepstrum in a Fixed Beamformer : Under the fixed beamforming framework, a new reverberant speech enhancement method that utilizes the LP residual cepstrum is developed. The method deconvolves the acoustic impulse response at each microphone output in the cepstral domain. The deconvolution of acoustic impulse response from reverberated signal in each individual channel removes early reverberation [14]. This dereverberated output from each channel is then spatially filtered using delay and sum beamformer (DSB). The late reverberation components are then removed by temporal averaging of glottal closure instants (GCI) computed using the dynamic programming projected phase-slope algorithm (DYPSA) [15]. The GCI obtained herein correspond to the LP residual peaks. These residual peaks are excluded from the averaging process, since they have significant impact on speech quality and should remain unmodified. Joint Speech Dereverberation and Noise Cancellation using Multi Channel LCMV Filter in a Fixed Beamforming Framework : Speech acquired

33 1.7 Organization of the Thesis 13 from an array of distant microphones is affected by ambient noise and reverberation. Single channel linearly constrained minimum variance (LCMV) filters have been proposed in literature to remove ambient noise. In this work, a LCMV based spectral method is developed for joint noise cancellation and dereverberation in a beamforming framework. This is realized as a multi channel LCMV filter that constrains both the early and late parts of the speech frame. The notion of inter frame correlation [16] is utilized at each microphone to remove early reverberations and noise components. The filter outputs are then beamformed to remove late reverberations [17]. 1.7 Organization of the Thesis The thesis is organized as follows. Chapter 2 reviews the description of the different single and multi channel based speech enhancement techniques in spectral domain. The single channel speech enhancement techniques described in this chapter are spectral subtraction, temporal envelope filtering, minimum mean square error technique and computational auditory scene analysis (CASA) framework based speech enhancement. The multi channel speech enhancement techniques like linear prediction residual based enhancement, blind speaker separation and beamforming based speech enhancement are also discussed. Chapter 3 describes the description of various measures used for determining the quality of speech enhancement techniques. This include the measures used for noise cancellation, speech dereverberation and speaker separation. Chapter 4 provides a description of single channel speech enhancement using group delay based spectrum estimation technique. In this chapter, the first part consists of group delay cross correlation function for single channel speaker segregation. The second part of this chapter deals with joint blind speaker separation and dereverberation using group delay spectral divergence optimization. The performance evaluation of each part is discussed in their respective sections.

34 1.7 Organization of the Thesis 14 Chapter 5 describes the novel beamforming based multi channel speech enhancement methods proposed in this thesis. The first part of this chapter deals with linear prediction residual cepstrum in a fixed beamforming framework to obtain speech dereverberation. A joint noise cancellation and speech dereverberation method is then proposed using multi channel linear constraint minimum variance filter based spectral estimation technique. The multi channel LCMV filter output is then beamformed using delay and sum beamformer to remove late reverberation components. Chapter 6 discusses the development of two applications that utilize single and multi channel speech enhancement methods proposed in this thesis. The information retrieval systems on a cell phone and in a teleconferencing environment are described in this chapter. Chapter 7 concludes the thesis. Future scope of the method proposed in this thesis are also detailed.

35 Chapter 2 Review of Single and Multi Channel Speech Enhancement Methods in Spectral Domain In this section, various spectral domain speech enhancement techniques for single and multi channel scenario are discussed. In literature, speech enhancement methods deal with noise cancellation, speech dereverberation and speaker separation separately. However, the problem of dereverberation, noise cancellation and speaker separation have not been addressed jointly, in earlier work. A review of single and multi channel speech enhancement methods is discussed in this chapter. 2.1 Single Channel Speech Enhancement Techniques A review of various spectral domain speech enhancement methods is presented in this section. The techniques described in this section are used for denoising, dereverberation and speaker separation and use a single microphone for acquisition of the speech signal.

36 2.1 Single Channel Speech Enhancement Techniques Spectral Subtraction Steven Boll proposed spectral subtraction technique [8] for single channel noise reduction. This technique can be considered as a baseline for comparing the novel speech enhancement methods. The spectral subtraction method estimates the clean speech spectrum by subtracting the noise spectrum estimated from the noisy speech signal. In general, the spectral subtraction can be performed in magnitude spectral domain. It can also be performed in power spectral domain. The basic assumption made herein is the statistical independence of noise and speech signal. An additive noise model is considered here. In general, a noisy speech signal under AWGN is represented as y(n) = s(n) + v(n) (2.1) Here, y(n) is a noisy speech signal, s(n) is clean speech and v(n) is AWGN noise added to clean speech. The Equation 2.1 can be written in short time Fourier transform (STFT) as shown below. Y (k, m) = S(k, m) + V (k, m) (2.2) Here, Y (k, m), S(k, m) and V (k, m) are defined as STFTs of y(n), s(n) and v(n) respectively at time frame m and at frequency bin k {0, 1,, K 1}. K here corresponds to total number of frequency bins in each frame. The spectral domain subtraction is then derived from [7] and [8] as given below Ŝ(k, Y (k, m) ζ η(k, m) ˆV (k, m) ζ Y (k, m) ζ η(k, m) ˆV (k, m) ζ > γ ˆV (k, m) ζ m) ζ = γ ˆV (k, m) ζ otherwise (2.3) Here, ˆV (k, m) represents an estimate of the noise spectral magnitude obtained during the silence regions. In other words, the noise spectral magnitude is obtained during the periods of silence. Ŝ(k, m) corresponds to the estimate of clean speech signal. If ζ = 1, then magnitude spectral subtraction is used and ζ = 2 correspond to the power spectral domain. The parameter η(k, m) here corresponds to time frequency dependent subtraction factors which is introduced to compensate for under or over estimation of the instantaneous spectrum

37 2.1 Single Channel Speech Enhancement Techniques 17 of noise. The optimal value of η(k, m) which uses the SNR weighted subtraction factor is explained in [7]. Several methods have been described in [18] for obtaining the optimal subtraction factor. It is possible sometimes that the Ŝ(k, m) ζ becomes negative when the instantaneous signal spectrum becomes smaller than the estimated noise spectrum. In order to overcome the situation of over subtraction, the factor γ is used in Equation 2.3. γ is also known as flooring factor [7, 19] and is taken generally between 0 and 1. The flooring factor can be utilized for estimation of spectral noise [7] and instantaneous noisy speech signal [20, 21] to set the highest level of signal attenuation. Several modification to the spectral subtraction technique have been proposed in literature. Recently, the number of subtraction parameters has been reduced as in [18, 22]. This reduction is obtained by using predetermined frequency bands. These methods are mainly known as multi band spectral subtraction methods. These methods have shown reasonably improvement in noise attenuation compared to spectral subtraction and are more useful for automatic speech recognition applications. In order to reconstruct the signal, the enhanced spectral magnitude obtained after the spectral subtraction is modulated by corresponding phase. It has been shown in [23] that the phase spectrum of the clean speech is a good estimate for the noisy speech phase spectrum. Thus, the enhanced time domain signals can be obtained by applying inverse STFT Minimum Mean Square Error (MMSE) based Speech Enhancement The MMSE technique is based on modeling the speech and noise spectral components as statistically independent variables. In MMSE estimator, a linear relationship between the observed and estimated spectral information is not assumed. However, MMSE assumes that the statistical distributions of speech and noise magnitude spectra are known. The model used in this speech enhancement technique is quite similar to spectral subtraction method as shown in Equation 2.1. In MMSE estimation technique, a Bayesian approach is used to determine the clean speech amplitude. Additionally, a Gaussian distribution is assumed both for speech and noise spectral magnitude in MMSE estimation. Spectral amplitude estimation using MMSE technique can be found in [24]. The derivation of the spectral estimator using MMSE technique which

38 2.1 Single Channel Speech Enhancement Techniques 18 yields the spectral gain function denoted by F (.) is shown below F (c 1 (k, m), c 2 (k, m)) = 2 π 2 2 P (k, m) c 2 (k, m) [ (k, m) exp( P ) (1 + P (k, m))b 0 ( 2 P (k, m) ) + P (k, m)b 1 ( 2 Here, B 0 (.) and B 1 (.) are zeroth and first order Bessel function. And, P (k, m) is defined as P (k, m) 2 (2.4) P (k, m) = c 1(k, m) c 1 (k, m) + 1 c 2(k, m) (2.5) In Equation 2.4, c 1 (k, m) and c 2 (k, m) corresponds to a priori and a posteriori SNRs respectively. The MMSE estimator also relies on a posteriori SNR which plays an important role in removing musical artifacts [25]. Although, its effect on overall attenuation is less than the a priori SNR. An important aspect of MMSE estimator is the dependence on estimating the variance of the noise spectral magnitude and a priori SNR. The variance of the noise is calculated during the period of silence as done for spectral subtraction. The maximum likelihood approach [24] and decision directed approaches [24, 26] are used to compute a priori SNR. In [24], authors have also derived an MMSE estimator for the phase spectrum. But with phase spectrum estimation, it is observed that either magnitude or phase will become optimal. When the magnitude estimator becomes sub-optimal, the phase of noisy speech will turn out the optimal phase spectrum. The other alternatives to the MMSE estimator are log-mmse estimator [27] and r th power spectral magnitude estimator [28]. These alternatives have been used in literature, under uncertainty of signal presence in the noisy observations. During the enhanced signal reconstruction, the MMSE short time spectral amplitude (STSA) estimator is modulated by the complex exponential of the noisy phase. It was shown in [27], that the latter is the MMSE estimator of the complex exponential of the original phase, which does not affect the STSA estimation. ] ) Speaker Separation in a CASA Framework Computational auditory scene analysis (CASA) is the study of auditory scene analysis (ASA) with an aim to separate the sources or speakers similarly as human listeners do. CASA in

39 2.1 Single Channel Speech Enhancement Techniques 19 general utilizes the knowledge of the human auditory system. A brief introduction to CASA follows. The term auditory scene analysis was used by Albert Bregman [29], to define the process by which the human auditory system analyze an incoming speech into semantically useful components. According to this theory, the human auditory system first transforms speech into a neural representation which is processed in a fashion similar to image processing. The entire ASA procedure can be broadly separated into two stages, namely segregation and regrouping. Segregation in itself is often divided into finer categories like segmentation, integration and segregation. However, these finer subdivisions are related more to psychophysics and are of lesser relevance to this work. In the segregation stage, speech, which is a one dimensional Figure 2.1: Block diagram of a simplified Computational Auditory Scene Analysis (CASA) system. temporal signal, is transformed into a multi-dimensional space, like the time frequency plane. In the next stage, these sub-regions are grouped together based on similarities in acoustic features. The target speech and the interfering speech can be then reconstructed from these groups to complete the separation. Figure 2.1 shows a simplified diagram of a typical CASA system. After the short time Fourier transformation (STFT), the time frequency cells that have similar characteristics are determined, based on various acoustic cues. The different time frequency regions are then regrouped into different streams and the target speech (or the interfering speech or both) are reconstructed. A key part of regrouping in CASA system is mask generation. Mask generation generally refers to the judgment of the reliability of each individual time frequency cell. Reliability here refers to whether the time frequency cell in consideration belongs to the target speaker or not. The final construction of either target or competing speech is performed based on reliability and the subsequent application of a

40 2.1 Single Channel Speech Enhancement Techniques 20 binary mask to the grouped regions. A brief review of three techniques that can be used as acoustic cues for speaker segregation are discussed herein Amplitude Modulation based Speaker Segregation Speech is often represented as a combination of a low frequency modulating signal and a high frequency modulated carrier [30], [31], [32], [33]. The goal of many current researchers is to use the envelope of the amplitude modulated of the speech signal as an acoustic cue for speaker segregation. The amplitude modulation based algorithms start with the decomposition of the original speech into multiple narrow band frequency channels. The amplitude modulation features of the outputs of each of the frequency channels are used to group similar sources together [34] Multi-pitch Tracking based Speaker Segregation Since pitch is an intrinsic speech parameter that is highly speaker dependent, it makes sense that pitch tracking is a valuable and extensively studied tool in speech segregation. Though, there are various ways to determine the pitch of a speech signal (temporal, spectral and spectro-temporal [35], [36],[37]), it is very difficult to determine the pitch of a speaker when there are interference from another human source. Many algorithms [38],[39],[40] have already been proposed for multiple pitch detection. Weintraub s [41] work on multi-pitch detection uses the autocorrelation function to calculate and remove the dominant speaker s pitch value and then repeat the algorithm for the weaker speaker. Though, this method is computationally less expensive, it does not significantly reduce the word error rate in multi-speaker speech recognition. In this knowledge based approach, a multi-pitch tracking algorithm [10] that separates and detects pitch based on the subharmonic to harmonic ratios (SHR) is used Speaker Separation using Instantaneous Frequency Instantaneous frequency (IF) has been used to separate speakers in a knowledge based framework in [34]. The idea of employing the instantaneous frequency characteristics of a speech signal to separate competing speakers is based on the concept of frequency modulation (FM) in a communication system. FM theory uses a signal that is variable not only in time, but

41 2.1 Single Channel Speech Enhancement Techniques 21 also in frequency. Speech signals have spectral characteristics that are also variable in time. Hence, the phase of a speech signal is not a constant, but varies in time Speech Enhancement using Temporal Envelope Filtering In this section, the speech dereverberation is explained using temporal envelope filtering. In this technique, the relation between the envelopes of the clean and reverberant speech waveforms are modeled for single channel reverberant speech enhancement. A general structure used for temporal envelope filtering based speech dereverberation is shown in Figure 2.2 (reproduced from [42]). During the first stage, the extraction of temporal envelope of the mixed Figure 2.2: A schematic structure for temporal envelope filtering. speech signal takes place. The parameters like reverberation time are then estimated from the envelope obtained from the first stage. During the second stage, an estimate of the clean envelope is obtained by filtering the envelope signal. The fine structure of reverberant signal is then used to reconstruct clean speech signal during the final stage. It may be noted that the phase modifications in the fine structure are not considered herein. An alternative to the temporal envelope filtering is proposed by extending envelope filtering to each frequency subbands. In [43], authors showed that by performing envelope deconvolution in each frequency subband, the dereverberation can be obtained. In this method, the envelope of the clean speech signal is recovered from the measured speech envelope effectively. The desired signal is then reconstructed by modulating the enhanced spectral magnitude by the original phase spectrum. The structure of the such a scheme is illustrated in Figure 2.3 (reproduced from [42]). It has been assumed in this work that the inverse filter can be estimated in advance. In later work, authors in [44] propose the power envelope inverse filtering technique. This is different from [43] in terms of the definition of power envelope.

42 2.2 Multi Channel Speech Enhancement Techniques 22 Figure 2.3: Block digram of K sub-band temporal envelope filtering. Additionally, the carrier sine wave definition is also different from [43] which is based on the amplitude modulation representation. The review of multi channel speech enhancement techniques are explained in the ensuing section. 2.2 Multi Channel Speech Enhancement Techniques In this section, the several multi channel speech enhancement techniques relevant to this thesis are discussed. Speech enhancement in multi channel scenario is receiving considerable attention due to its effectiveness in eliminating the distortions like background noise and multi path reflections. In multi channel case, the speech signals are captured simultaneously by all the microphones present in the system. This multi channel information is then utilized to filter the observed signal to obtain an estimate of the clean speech. The multi channel speech enhancement techniques generally follow spatial filtering principle to enhance or attenuate signals emanating from the particular directions. Using these techniques, reverberation and noise can be spatially separated from the desired signal. Due to these reasons, multi channel techniques are called spatial spectrum estimators. These techniques exploit the spatial infor-

43 2.2 Multi Channel Speech Enhancement Techniques 23 mation to preserve the signal coming from the desired direction. A review of multi channel techniques is detailed in the ensuing section Speech Enhancement using Linear Prediction (LP) Residual The motivation for using the LP residual methods comes from the observation present in the reverberant environments. It is noticed that the original impulses are followed by several other peaks in the LP residual of voiced speech segments. These peaks are caused by multi path reflections. Moreover, it is assumed that the LP coefficients (LPC) are unaffected by the reverberation. The technique proposed by authors in [45] processes the speech from multiple channels to enhance speech which is degraded by noise and multi path reflections. In this technique, the features of the excitation source used in the speech production are exploited. More specifically, the voiced speech characteristics are used to derive a coherently added signal from the LP residuals of the degraded speech data obtained from different microphones. From these coherently added signals, a weight function is then computed. In case the speech data is available from two or three spatially distributed microphones, then it is possible to derive a weight function [45] for the LP residual to reduce the effects of additive noise and multi path reflections in the enhanced speech. In this work [45], authors have assumed that the signals have the same sequence of significant instants from multiple microphones, except for the fixed delay between a pair of microphones. Degradation due to reflections will lead to many false significant instants at each microphone. However, it is observed that at each microphone, these instants occur at random time. Thus, authors have added the microphone outputs after compensating the time-delays between a pair of microphones. In this work, the time-delay between a pair of microphones is computed using the source information knowledge present in the LP residual. The signal which gets added coherently at the significant instants correspond to the direct path. On the other hand, the reverberant components get added incoherently. Although, there is improvement in the SNR due to this coherent addition but reverberant components are still present.

44 2.2 Multi Channel Speech Enhancement Techniques 24 Authors [45] have also suggested one possible solution to remove these reverberant components along with simultaneously increasing SNR by modifying the LP residual signal. This is done by generating a weight function for the LP residual which enhances the coherent part around the significant instants with respect to the other regions. Since authors [45] observed that both positive and negative samples of LP residual depends on the phase. Therefore, the Hilbert envelope of the LP residual signal [45] is computed to obtain the strength of the LP residual at each instant. The Hilbert transform of a signal is obtained by first computing the discrete Fourier transform (DFT) of a signal and then exchanging the real and imaginary parts of the DFT of a signal. And then finally time domain signal is the obtained by computing the inverse DFT (IDFT). The strength of the excitation at that instant is obtained from the amplitude of the Hilbert envelope, and the amplitudes are typically large around the instants of glottal closure. There are several large amplitude spikes in the Hilbert envelope of the LP residual due to the effects of noise and reverberation. In order to reduce these spikes, authors [45] have determined the Hilbert envelopes of the LP residuals from several microphones which are then added coherently. The enhanced speech is finally synthesized by exciting the time varying all pole filters with the weighted LP residual Blind Source Separation (BSS) based Speech Enhancement Independent component analysis (ICA) [46] based BSS methods has received great importance due to its applications in signal processing such as in speech recognition systems, telecommunications and medical signal processing. However, ICA fails to represent the dynamic nature of a realistic audio environment. For example, the estimation of inverse mixing matrix is adaptively performed on frame by frame basis for characterizing speaker movements. This makes ICA inefficient for BSS in realistic scenarios. In literature, there are BSS which uses only time frequency (TF) information for separating the mixed speaker signals. In other words, these BSS techniques do not consider spatial information for estimating individual speaker signals. In [47], authors have applied blind source separation using time frequency masking and multiple target tracking. In this technique, authors [47] have obtained time frequency mask which contains either a binary or

45 2.2 Multi Channel Speech Enhancement Techniques 25 a real valued coefficient for every time frequency point. This decides whether the TF point belongs to the sound source which need to be separated. When the real-valued masks are considered, the mask value can be taken as a likelihood of the TF point which originate from the target source. Application of this mask results in the separation of the source spectrum from the mixture. Other BSS based methods which uses only TF information is based on degenerate unmixing and estimation technique (DUET) [48]. This method is applicable when sources are disjoint orthogonal. In other words, when the time frequency representations of any two signals in the mixtures are disjoint sets. These method utilize gradient search of the mixing parameters and simultaneously constructs binary time frequency masks that are used to partition one of the mixtures and to recover the original source signals. The problem with TF masking in [47] is the actual mask estimation. In other words, it is difficult to estimate a good mask from the noisy input data using TF masking. In [49], the spatial information from multiple microphones or a microphone array is used to obtain TF mask values. More specifically, the TF mask in [49] is estimated using the speaker s direction of arrival (DOA) information. If a multisource environment is considered, the DOA of each individual overlapping speaker is obtained by using Bayesian filtering or particle filtering techniques. This Bayesian filtering technique [50] is fused with a track before detect (TBD) scheme so that new active speakers can be added and silent speakers can be deleted. In this approach, source movement during the separation is allowed and it operates online. In other words, this technique processes only one frame at a time and does not see the whole spectrum before producing the output. Therefore, the approach can be used in real time provided the algorithm is implemented in an efficient manner. Another technique which utilizes spatial time frequency distributions (STFD) is detailed in [51]. In this technique, the authors introduce a new blind identification method based on joint diagonalization of a combined set of STFD. It is shown in this work that STFD has the similar structure than the data spatial correlation matrix under the assumption of a linear data model. The authors of this work also show the benefits of STFD over the spatial correlation matrix in a nonstationary signal environment. The significance of STFD is observed in the direct utilization of the nonstationarity of the speech signals. Therefore,

46 2.2 Multi Channel Speech Enhancement Techniques 26 this approach utilizes the difference between the TF signatures of the sources. Moreover, this approach separates Gaussian sources which have identical spectral shape with different TF localization properties. In [52], authors have combined the time and frequency information utilized in convolutive blind source separation and spatial information of source signals or sensor array used in adaptive beamforming for BSS Beamforming based Speech Enhancement Beamforming is a spatial filtering method which separates the signal of interest (SOI) from the ambient noise and other interfering speakers utilizing an array of microphones. Under the condition that the geometry of the microphones with respect to the desired source is known before hand, a beam can be formed in the direction of desired source. Thus, beamforming preserves the desired source present in the look direction and removes all the other unwanted or undesired source which are in the non look direction. In beamforming, beam is created by compensating the respective propagation delays between the source and each microphone. This process is generally done to time align all the microphones. Beamforming can also be viewed as a spatial spectrum estimation method in which spectrum of the desired signal in look direction is unattenuated and the rest of the undesired signals are excluded. Beamforming can be classified as fixed beamforming and adaptive beamforming Fixed Beamforming Fixed beamformers belong to the category of beamformers where the weights or spatial filter coefficients do not depend on the input signal and take into consideration the steering vector of the SOI. In such beamformers, once the propagation delays between the source and each microphone are obtained, the microphone channels are individually weighted and combined to produce the desired speech signal. Due to this reason, this technique is known as filter and sum beamformer (FSB). The filter and sum beamformer results in cancellation of noise and other signals in non look direction. In this beamformer, it has been assumed that noise and other signals are uncorrelated with the desired signal in each microphone. The filter and sum beamformer

47 2.2 Multi Channel Speech Enhancement Techniques 27 output in frequency domain is given by d fsb (f) = 1 M M w l (f)y l (f)e j2πf l (2.6) l=1 Here, d fsb (f) correspond to FSB output, w l (f) is the filter weights or coefficients required to produce desired signal, y l (f) represent the observation signal received at the l th microphone and an exponential term in Equation 2.6 is used to compensate for the delay l. M here corresponds to the total number of microphones. In FSB, the different phase weights are applied to the input channels [53]. This will result in the main lobe of the directivity pattern steered to a desired location from where the acoustic input comes. In the literature, several other filter and sum beamformer are proposed for optimizing the filter weights w l (f) for particular noise fields and conditions [54]. Most of these beamformers are used to maximize the signal level criteria in likelihood maximization framework for obtaining the optimal filter weights. The special case of a filter and sum beamformer is delay and sum beamformer (DSB) in which w l (f)=1 [55]. In delay and sum beamforming, it has been considered that all the microphone channels have an identical frequency response and use equal amplitude weights to all channels. The filter sum beamformer differs from delay sum beamformer in that an independent weight is applied to each of the microphone channels before adding all of them Adaptive Beamforming Fixed beamformers have an advantage in terms of simplicity of implementation. But they cannot adapt to the changing acoustic environment. Array processing parameters in adaptive beamforming are adjusted dynamically according to an optimization criterion. Adaptive beamforming can be applied either on a sample by sample or frame by frame basis. The array processing parameters here refer to FIR filters coefficients which are used in a FSB. The objective of an adaptive beamformer is to preserve the signals present in a desired look direction by minimizing the overall energy of the filter output. This objective can be achieved by positioning null responses in the directions of interfering noise sources.

48 2.2 Multi Channel Speech Enhancement Techniques 28 In literature, adaptive beamformers such as capon beamformer or minimum variance distortionless response beamformer [56, 57], linearly constrained minimum variance (LCMV) [42, 58] beamformer, maximum SNR beamformer [59, 60] and linear predictive beamformer have been used for speech enhancement. In this section, the MVDR beamformer and the LCMV beamformer are explained. An efficient implementation of a LCMV beamformer through generalized sidelobe canceller (GSC) [61] is also detailed. Minimum Variance Distortionless Response Beamformer : The frequency domain signal model consist of a speech signal with the uncorrelated noise present in the environment is given by Y (f). Y (f) = S(f) + V (f) (2.7) where Y (f), S(f) and V (f) correspond to received microphone signal, speech signal and noise signal respectively. Y (f) is defined as Y (f)= [Y 1 (f), Y 2 (f),, Y M (f)] T. Similarly, S(f) and V (f) are defined as Y (f). Here M represent the total number of microphones and T correspond to matrix transpose operator. The MVDR beamformer proposed by Capon [62] is then derived by minimizing the mean square error (MSE) of the residual noise, with the constraint that the desired signal is not distorted. Mathematically, it can be expressed as min W H (f)φ Y (f)w (f) (2.8) W (f) subject to W H (f)p s (f) = 1. Here, W (f) = [w 0 (f), w 1 (f),, w M (f)] T is weight required to obtain distortionless output and P s (f) = [1, e jf,, e jmf ] T is a steering vector since it determines the direction of the desired signal. Also, φ Y (f) correspond to covariance matrix of Y (f). The solution to Equation 2.8 is given by [63, 64] W (f) = φ 1 Y (f)p s(f) Ps H (f)φ 1 Y (f)p s(f) (2.9)

49 2.2 Multi Channel Speech Enhancement Techniques 29 An extended version of MVDR is defined by using additional constraint that minimizes the gain of undesired signal present at a different frequencies. This is done by exploiting the structure of the noise covariance matrix φ V (f) [64, 65]. This extension of MVDR is known as linearly constrained minimum variance beamformer because this beamformer can handle more than one linear constraints. LCMV beamforming is limited by the fact that the number of microphones need to be higher than the number of nulls by one. An efficient representation of LCMV beamformer can be studied through generalized sidelobe canceller structure. x(n) Fixed Beamformer y(n) Blocking Matrix Weight Estimation Figure 2.4: Block diagram illustrating a LCMV beamformer using a GSC structure. Generalized Sidelobe Structure : In GSC framework [42, 61], the structure consists of two parts. The first part corresponds to a fixed beamformer (W fb ) which generates a non adaptive output. The second part consist of an adaptive structure for cancellation of sidelobes. In the adaptive structure of the GSC, the blocking matrix (BM) is first utilized to block the desired signals coming from the look direction. Due to this, a correct estimate of signals coming from non look direction can be made. Any signal which is common to both the parts are then cancelled by adjusting the weights (W a ) of the adaptive structure. A standard GSC structure is illustrated in Figure 2.4.

50 2.3 Summary Summary In this chapter, various single and multi channel speech enhancement techniques in spectral domain are discussed. Single channel speech enhancement techniques are computationally less complex when compared to multi channel techniques. But these techniques do not exploit spatial cues. Single channel methods uses only the temporal diversity present in the signal. Hence, they are unable to suppress the reverberation components completely. Multi channel methods are therefore better utilized to handle the reverberation problem.

51 Chapter 3 Speech Enhancement Quality Measures Various quality measures used for speech enhanacement are described in this chapter. The speech quality can be quantified using the subjective and several objective measures for speech enhancement task. In order to investigate the speech quality improvement, speech quality is compared before and after processing. Subjective measures require listening tests which are then averaged over different subjects. The objective quality measures used in this thesis are intrusive measures which are also known as end-to-end or reference measure [42]. In intrusive measures, the distorted signal is compared with the undistorted signal. The undistorted signal is usually known as reference signal. It has been found that if an objective quality measure is highly correlated with subjective evaluation results, then the performance of speech enhancement task would be acceptable. 3.1 Subjective Measures To obtain subjective speech quality measures, the listening tests are performed where the human subjects rate the performance of the quality of a signal in accordance with an opinion scale [66]. In literature, the most commonly used methods for measuring the subjective

52 3.1 Subjective Measures 32 quality of speech transmission over voice communication systems are standardized by the International Telecommunications Union (ITU-T). To assess the quality of the separated data, an established subjective test protocol [67] is used to measure the quality of the separated signals. The separated speech files from different methods are compared to the reference files and rated comparatively. Since there are multiple kinds of distortions that may manifest into the separated speech signal, a subjective test with multiple criteria are necessary. This leads to a more holistic evaluation of the speech separation algorithm than the use of a simple mean opinion score (MOS). In the subjective test protocol, used in this work, the separated signals are rated using four parameters, namely The global quality of the separated signal compared to the clean reference signal. The quality in terms of preservation of the target signal. The quality in terms of absence of other (interfering) speakers. The quality in terms of absence of artificial noise. The tests are performed in the same order as mentioned above, with a small break at the end of each task. A training phase is first conducted where the subject listens to the sounds of all mixtures, aimed to train the subject to address the required task, and to learn the range of perceived sound quality. A grading phase is then performed for each mixture and target speaker. During this phase, the subjects rate the quality of each signal by providing their opinion after listening to both the degraded and enhanced speech signals. The two different scales for rating the scores are used in literature. In first case, the scale of 0 to 100 is used [67] to rate the scores. The second case uses scale of 0 to 5 as described in [68]. Mean opinion score (MOS) is the averaged opinion score of the subjects and indicates the subjective quality of the system or algorithm under test. A histogram plot is then used to indicate relative performance of the various systems. To obtain a meaningful result in MOS, a large number of subjects are required. Due to this, subjective test requires a large effort in order to have opinion scores from several subjects [69]. Sounds can be listened to as many times as desired, and cross checking is also allowed. Signals have to be rated consistently in pairs. The guidelines of the test were presented on a separate written document, lucidly

53 3.2 Objective Measures 33 explaining the details, to avoid any influence of the supervisor. The objective evaluations used for noise cancellation, speech dereverberation and speaker separation are discussed in the ensuing section. 3.2 Objective Measures Due to the rapid development of speech enhancement systems and voice communication systems, the demands for robust objective speech quality measures have been increased. These objective measures also correlate well with subjective speech quality measures. This section discusses the different objective evaluation measures used for evaluating noise cancellation, speech dereverberation and speaker separation methods Objective Measures for Noise Cancellation The segmental SNR (SSNR), frequency weighted segmental SNR (FWSSNR) and SNR loss have been used as objective measures for noise cancellation and are described in the ensuing section Segmental SNR The segmental SNR [70] has been widely used for noise cancellation and is defined as s n 2 SSNR(s n, ŝ n ) = 10log 10 s n ŝ n 2 (3.1) In Equation 3.1, the s n is the original signal and ŝ n is the enhanced signal obtained from the speech enhancement algorithm. Moreover,. is the 2-norm and is given by v = N 2 ( v(k) 2 ) (3.2) k=1 Both signals s n and ŝ n are length of N samples and subscript n correspond to the frame number. The n th frame of length N of original signal is represented as s n =[s nν, s nν+n 1 ]. Here ν is the step size of window used in short time analysis.

54 3.2 Objective Measures Frequency Weighted Segmental SNR The frequency weighted segmental SNR is another objective measure [70] which is given by F W SSNR = 10 N T N T 1 m=0 K k=1 w(k, m)log s(k,m) 10( 2 ) (s(k,m) ŝ(k,m) ) 2 K k=1 w(k, m) (3.3) In Equation 3.3, w(k, m) is defined as the weight applied to the k th frequency band and m th frame. Here, K represent the total number of bands and N T correspond to the total number of frames obtained after performing STFT of the time domain signal. Here, s(k, m) correspond to the spectrum of clean speech signal. On the other hand, ŝ(k, m) is the weighted enhanced signal spectrum in the same band. The signal spectrum is weighted by Gaussian shaped window SNR Loss SNR (Signal to Noise Ratio) Loss [71], between the reference and reconstructed sound files are also used in this work to evaluate the performance of the proposed algorithms. The term SNR Loss is used to indicate the loss introduced by noise suppression algorithms, but it presents a general measure to assess the noise quality of the signal obtained after enhancement. The SNR loss in band k {0, 1,, K 1} and frame m is broadly defined by Loss = SNR X (k, m) SNR ˆX(k, m) (3.4) X(k, m) 2 Loss = 10.log 10 D(k, m) 2 10.log ˆX(k, m) 2 10 D(k, m) 2 (3.5) Loss = 10.log 10 X(k, m) 2 ˆX(k, m) 2 (3.6) Here, SNR X (k, m) is the SNR of the reference signal and SNR ˆX(k, m) is the SNR of the reconstructed signal. X(k, m), ˆX(k, m) and D(k, m) are the spectra of the reference, reconstructed and noise signal respectively at k {0, 1,, K 1} and m th time frame. This SNR loss is calculated in bands and the results across all the frequency bands are used to compute the final SNR loss.

55 3.2 Objective Measures Objective Measures for Speech Dereverberation In this section, objective measures used to determine the quality of speech dereverberation are discussed Signal to Reverberation Ratio The instantaneous segmental Signal to Reverberation Ratio (SRR) [72] of the k th frame is defined similar to the segmental SNR [73] as shown below SRR seg (k) = 10log 10 kr+p 1 n=kr x 2 d (n) kr+p 1 (3.7) n=kr (x d (n) ˆx d (n)) 2 In Equation 3.7, R is the frame rate and P is the frame length, both in samples. Here, x d (n) is the direct path signal which can also be defined as the delayed version of clean signal. The ˆx d (m) represent enhanced signal. It may be noted that the frame rate R is set in such a way that the frame overlap is 75%. By averaging the Equation 3.7 over all frames, the mean segmental SRR is obtained Log Likelihood Ratio The log likelihood ratio (LLR) measure is an linear prediction (LP) based measure [74]. Linear prediction is a widely used technique to study speech production. The LLR is a distance measure that can be directly obtained by calculating the LPC vector of the clean and distorted speech. LLR(x LP, ˆx LP ) = log x LP R x x H LP ˆx LP R xˆx LP (3.8) where, x LP and ˆx LP is the LPC vector for the clean and enhanced speech respectively. In Equation 3.8, R x correspond to autocorrelation of clean speech signal x(n) Log Spectral Distortion The log spectral distortion (LSD) is a widely used distortion measure for speech dereverberation. This distortion measure is computed as the L p norm of the difference of the log spectra of the desired signal x d (n) and the enhanced signal ˆx d (n) [42]. The LSD is a spectral

56 3.2 Objective Measures 36 domain objective measure which utilizes short-time spectra. This is obtained by computing the STFT of the signals. x d (k, m) represent the STFT of the signal x d (n) at k th frequency bin and m th frame index. The frame length is taken to equal to 64 ms long and the overlap is kept to 75%. The L p norm of the difference between x d (k, m) and ˆx d (k, m) in the m th frame is defined as LSD(m) = 2 K K 2 1 L(ˆx d (k, m)) L(x d (k, m)) p k=0 1 p (3.9) In Equation 3.9, K correspond to total number of frequency bins in each frame. Here L(x d (k, m)) = max(20log 10 ( x d (k, m), δ)) corresponds to the log spectrum which is confined to about 50 db dynamic range [42]. Thus, δ is defined as δ = max k,m (20log 10 ( x d (k, m) 50)). Finally by taking the average of Equation 3.9 over all frames having speech, the mean log spectral distortion can be obtained. If the value of p is 1, 2, and then Equation 3.9 produces mean absolute, root mean square, and maximum deviation, respectively. The bark spectral distortion has also been used for measuring speech dereverberation quality which is explained in ensuing section Bark Spectral Distortion The Bark spectral distortion (BSD) is obtained by transforming the speech signal into a perceptually relevant domain. The Bark spectra B xd and Bˆxd [42] are used to obtain the BSD metric. B xd (k b, m) and Bˆxd (k b, m) correspond to the Bark spectra of original signal x d and the enhanced signal ˆx d respectively. Here k b and m denote Bark frequency bin index and frame index respectively. The Bark spectrum is obtained in three steps. The first step is critical band filtering. The second step is equal loudness pre-emphasis. The last step utilizes intensity-loudness power law to obtain perceived loudness. The magnitude squared spectrum of the current analysis frame index m is used as the input to this process. The output of this process is denoted by BSD. The BSD scores can thus be obtained after the Bark spectra are found as shown below BSD = 1 N T N T 1 m=0 K k b =1 (B x d (k b, m) Bˆxd (k b, m)) 2 K k b =1 (Bˆx d (k b, m)) 2 (3.10)

57 3.2 Objective Measures 37 Here, N T is the number of analysis frame. The resulting BSD score for a speech signal is the weighted average of the BSD scores for all of the analysis frames Objective Measures for Speaker Separation The perceptual similarity measure (PSM), perceptual evaluation of speech quality (PESQ) and target to interference ratio (TIR) are used as an objective measures for speaker separation Perceptual Similarity Measure To predict perceived quality differences between audio signals, the PEMO-Q technique [75] is used. PEMO-Q is based on the work of Huber and Kollmeier [76], which in turn follows the approach by Hansen and Kollmeier [77], using the Oldenburg Perception Model ( PEMO ) of Dau et al. [78] for computing internal representations of signal pairs. These pairs are compared quantitatively by calculating the linear cross correlation coefficient. The resulting correlation value serves as an objective measure of the perceptual similarity between two audio signals. Apart from the overall correlation between internal representations (output value PSM ), it also computes an estimate of the instantaneous audio quality as a function of time by using frame-wise correlation (output vector PSM inst ). The non-linear average of this time series represents another estimate of the perceived overall quality output value PSMt. The PSM is shown to be one of the best methods for objective evaluation and shows high correlation to subjective results [79]. Finally, the perceptual quality measure (PSM) is obtained by performing cross correlation operation separately for each modulation channel. In internal representation, however, the different sampling frequencies do not have to match. Hence, this independent processing will result in a high computational efficiency. The two dimensional submatrix x tf mf =constant here corresponds to the internal represenation of reference signal for each modulation channel. Here, subscript t, f and m f correspond to time, frequency and modulation frequency

58 3.2 Objective Measures 38 respectively. The linear cross correlation coefficient of two N M matrices is given by r = N,M t,f=1 (x tf x)(y tf y) (x tf x) 2 (3.11) (y tf y) 2 t,f t,f Here N and M represent the number of time samples and frequency channels, respectively. Also, x and y denotes the corresponding mean values respectively. And, y tf mf =constant is described as internal representation of the distorted test signal. The correlation coefficients shown in Equation 3.11 are weighted by the normalized mean squared values of the corresponding submatrices and added up to compute the final quality measure as N,M PSM = m f w mf r mf, with w mf = yt,f,m 2 f t,f=1 N,M,L t,f,m f =1 y 2 tfm f (3.12) In addition to PSM, the successive cross correlation of 10 ms frames of the internal representations is also computed to obtain instantaneous audio quality (PSMt). Subsequently, PSMt is weighted by the moving average of the internal representation of the test signal. Thus, PSMt models the relation of overall perceived audio quality with instantaneous loudness Perceptual Evaluation of Speech Quality The PESQ is an another objective measure which is described in ITU-T Recommendation P.862 [42, 80]. PESQ is validated by a number of experiments which test the algorithmic performance across many combinations of factors such as filtering, variable delay, coding distortions and channel errors. PESQ has been recommended for speech quality assessment of 3.1 khz narrow-band speech codecs and handset telephony. In this objective measure, the original signal s(t) is compared with an enhanced signal ŝ(t). The enhanced signal ŝ(t) is calculated by the several algorithms under test. The output of PESQ is a prediction of the perceived quality and is corroborated with subjective listening tests. During the PESQ calculation, a series of delays between original input and test signal are computed as the first step. It may be noted that for each time interval, the delay is significantly different from the previous time interval. Additionally, a corresponding start and

59 3.2 Objective Measures 39 stop point is calculated for each of these intervals. The principle used by alignment algorithm is to compute the confidence of achieving two delays for a particular interval. Thereafter, compare the confidence of achieving two delays with the confidence of obtaining a single delay for that interval. The key idea for alignment algorithm is to transform both the original and reconstructed signals into an internal representation. These internal representation is analogous to the psychophysical representation [42] of audio signals in the human auditory system which consider perceptual frequency and loudness. Several stages are involved to achieve this transformation which include time alignment, level alignment, time frequency mapping, frequency warping, and compressive loudness scaling. The processing of internal representation considers linear filtering and local gain variations effect during alignment algorithm. The algorithm can handle delay changes both during silence and voiced region of speech. PESQ thereafter uses a perceptual model to compare the original signal with the aligned reconstructed signal based on set of delays that are found [42]. It can also be noted that PESQ can be used as an objective measure for both speaker separation and speech dereverberation Target to Interference Ratio The target to interference ratio (TIR) is calculated as TIR = σ2 T σ 2 I (3.13) where σt 2 and σ2 I are the variances of the target and the interfering speakers respectively, and are computed across all frames of the test utterance. TIR is also defined as the target-tointerference signal power ratio. This measure is widely used to evaluate speaker separation algorithms. If the values of output TIR exactly overlap with the input TIR, then it means that the corresponding algorithm is effective for speaker separation.

60 3.3 Summary Summary In this chapter, different objective measures used for speech enhancement are discussed. Objective measures specifically used for noise cancellation, speech dereverberation and speaker separation are also discussed. Some of the measures defined in this section can be used for multiple purposes. For example PESQ defined for speech dereverberation can also be used as an objective measure for noise cancellation and speaker separation. These objective measures discussed herein are used to evaluate the quality of speech enhancement methods proposed in this thesis.

61 Chapter 4 Spectral Methods for Single Channel Speech Enhancement The Fourier transform phase spectrum is robust to additive noise for feature extraction when compared to Fourier transform magnitude spectrum. However, the processing the phase spectrum is difficult due to inevitable wrapping of phase spectrum which results in masking of resonance features. The group delay spectrum is thus used as an alternative to process the phase spectrum. However, the group delay spectrum has hitherto not been used for speech enhancement in a multi source environment. In this thesis, novel methods for speaker separation and speech dereverberation are developed using group delay spectrum. The first part of this chapter discuss the group delay cross correlation method used for speaker separation. The second part of this chapter deals with joint speaker separation and dereverberation using group delay spectral matrix factorization. The performance evaluation of both the proposed methods is presented and is compared with conventional speaker separation and speech dereverberation methods available in literature.

62 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework In this section, a novel method that uses the group delay cross correlation function for single channel speaker segregation is proposed. The group delay function, which is the negative derivative of the phase spectrum yields robust spectral estimates. Hence the group delay spectral estimates are first computed over frequency subbands after passing the speech signal through a bank of filters. The filter bank spacing is based on a multi-pitch algorithm that computes the pitch estimates of the competing speakers. An affinity matrix is then computed from the group delay spectral estimates of each frequency subband. This affinity matrix represents the correlations of the different subbands in the mixed broadband speech signal. The grouping of correlated harmonics present in the mixed speech signal is then carried out by using a new iterative graph cut method. The signals are reconstructed from the respective harmonic groups which represent individual speakers in the mixed speech signal. Spectrographic masks are then applied on the reconstructed signals to refine their perceptual quality. The quality of separated speech is evaluated using several objective and subjective criteria. Experiments on multi-speaker automatic speech recognition are conducted using mixed speech data from the GRID corpus Introduction Humans have an innate ability to separate out competing speakers in any environment. One of the most popular computational models that try to use this fundamental ability of humans, is the auditory scene analysis proposed by Bregman [29], [81]. It is based on the theory that the humans separate the one-dimensional speech signal by first projecting it onto a twodimensional time-frequency space. After this projection, other techniques like edge detection in image processing are used to further aid the identification process. The computational aspects of this theory is captured in what is called computational auditory scene analysis. However, single channel speaker segregation by machines is a very challenging task. The presence of noise and reverberation can be handled by different signal processing techniques given that the nature of these interferences are different from that of the speech signal.

63 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 43 When a competing human interference is present within the same channel the separation becomes even more complex. The interference generated by a competing speaker is most challenging to remove. There exists a high correlation in the temporal structures of the target and the interfering speakers resulting in low separation accuracy by temporal structure based algorithms. On the other hand, multi-speaker speech recognition which includes single channel speaker separation has also been a related topic of contemporary research. Single channel separation systems can either be based on the knowledge of the structure of the speech signal or on the statistical information derived from the speech signal [82]. The method for single channel speaker separation proposed in this work utilizes the harmonic relationship in the group delay spectrum within the CASA framework. The basic assumption in most of the speech separation systems based on CASA is that any sound can be interchangeably represented in the time domain by a waveform or in the frequency domain by its corresponding spectrum. Masking is the most popular technique used for monaural separation within this framework. A spectrographic mask is first estimated which yields spectral components that belong to the target speaker and masks out those that do not belong to target speaker. There are several cues in the time-frequency plane that manifest as acoustic events in most cases. They are common onset, common offset and common modulation. Models of perceptual grouping aim to group relevant spectral components to estimate the spectrographic mask for a speaker. Weintraub [83] and Parsons [84] have grouped spectral components that have the harmonic relationship to the fundamental frequencies of the speakers. Wang and Brown [85] grouped spectral components based on the synchrony of network neural oscillators representing time-frequency components. In these and other similar approaches [86], [87], grouping is performed by the direct identification of coeval and similar related cues. Bach and Jordan [88] have taken a data-driven approach instead, and they represented time-frequency components by perceptually motivated features and grouped them by a procedure called spectral clustering (where the term spectrum refers to the eigen spectrum) to estimate the spectrographic masks. In this work, a speech enhancement method in a CASA framework is described. The group delay function is used as an acoustic cue instead of amplitude modulation (AM), instantaneous frequency (IF) and pitch (as explained in detail in Chapter 2) for speaker separation within the CASA framework herein.

64 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework Estimation of the Group Delay Function of a Speech Signal The Fourier transform (FT) magnitude has been widely used for feature extraction from the speech signal. The reason behind the importance of FT magnitude for feature extraction in comparison to FT phase is due to the possibility to visually perceive the features of a signal in the magnitude spectrum. Each resonant frequency in the signal corresponds to one peak of the envelope of the spectrum. On the other hand in FT phase, the resonance in the signal are represented as the transitions of the phase in the phase spectum. The peaks in the magnitude spectrum are visible while the phase transitions are masked completely by unavoidable wrapping of the phase spectrum. This provides motivation for utilizing the group delay function to extract the phase information. There are certain advantages of extracting features from FT phase instead of FT magnitude. The phase spectrum of a signal represents time delay corresponding to each of the sinusoidal components of the signal. Thus under the additive noise, the time delay may not be significantly corrupted in comparison to magnitude spectrum. Hence, the phase spectrum might be considered to be a more reliable source for estimating the features in a noisy signal. Other advantage of using FT phase include better edge preservation in multidimensional signal processing compared to the magnitude spectrum. The group delay function which is the negative derivative of the unwrapped short time phase spectrum, can be computed directly from the speech signal as in [89] without unwrapping the short time phase spectrum. The group delay function has been effectively used to extract the various source and system parameters [90] when the signal under consideration is a minimum phase signal. This is primarily because the magnitude spectrum of a minimum phase signal [90], and its group delay function resembles one other [91]. Group delay is defined as the negative derivative of the Fourier transform phase. Mathematically, the group delay function is defined as τ(ω) = d(θ(ω)) dω (4.1) where the phase spectrum (θ(ω)) of a signal is defined as a continuous function of ω, The values of group delay function away from a constant indicates the degree of nonlinearity of the phase. The Fourier transform phase and the Fourier transform magnitude are related as

65 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 45 in [92]. The group delay function can also be computed from the signal as in [93] using τ(ω) = Im d(log(x(ω))) dω (4.2) = X R(ω)Y R (ω) + Y I (ω)x I (ω) X(ω) 2 (4.3) where the subscripts R and I, denote the real and imaginary parts of the Fourier transform. X(ω) and Y (ω) are the Fourier transforms of x(n) and nx(n), respectively. It is also important to note in this context that the group delay function can be expressed in terms of the cepstral co-efficients as τ(ω) = θ (ω) = nc(n) cos(nω) (4.4) n=1 where c(n) are the n dimensional cepstral co-efficients. Hence the group delay function τ(ω) in general can also be viewed as the Fourier transform of the weighted cepstrum. Relation Between the Group Delay Spectrum (GDS) and Magnitude Spectrum: In general, if we consider the spectrum of any signal as a cascaded of M resonators, the frequency response of the overall filter is given by [91] X(e jω ) = M i=1 1 α 2 i + β2 i ω2 2jωα i (4.5) where α i ± jβ i is the complex pair of poles of the i th resonator. The squared magnitude spectrum is given by X(e jω ) 2 = and the phase spectrum is given by M i=1 1 [(α 2 i + β2 i ω2 ) 2 + 4ω 2 α 2 i ] (4.6) θ(ω) = X(e jω ) = M tan 1 2ωα i αi 2 + β2 i (4.7) ω2 i=1 It is well known that the magnitude spectrum of an individual resonator has a peak at ω 2 = β 2 i α2 i and a half-power bandwidth of α i. The group delay function can be derived

66 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 46 using Equation 4.1 and is given by: τ(ω) = θ (ω) = dθ(ω) dω M = 2α i (αi 2 + β2 i + ω2 ) (αi 2 + β2 i ω2 ) 2 + 4ω 2 αi 2 i=1 (4.8) It was shown in [91] that at the resonance frequency ω 2 = βi 2 α2 i, the group delay function behaves like a squared magnitude response. The group delay function is an effective means of representing spectral information in the speech signal [90] and is shown to be highly robust to noise [94], compared to similar methods like IF and AM based spectral estimation. The most important spectral properties of the group delay function are the additive and high resolution properties [91]. In this work, the group delay cross correlation (GDCC) function has been used for speaker segregation. The motivation behind using this method is due to the fact that the group delay function has been widely used in the temporal spectrum estimation in speech processing and also captures complementary information when compared to the magnitude spectrum [95]. But the group delay function has hitherto not been used in speaker separation using the cross correlation estimates across subbands. This work utilizes the group delay cross correlation estimates to group the individual speakers in a mixed signal using an iterative graph cut method The Group Delay Cross Correlation Approach to Speaker Segregation In this section, an overview of the proposed approach to speaker segregation using group delay cross correlation function is described. Figure 4.1 shows the block diagram outlining the sequence of steps for proposed single channel speaker segregation method. Multi-pitch estimation is conducted on mixed speech signal by passing it through a linear bank of filters. The bandwidth of each channel is equal and is selected based on the pitch of each individual speaker in the mixture. The group delay of each frequency band is then calculated and the pairwise group delay cross correlation function is found between each of the frequency channels in the entire frequency range. The correlation matrix thus generated also known as affinity matrix is used to extract correlated harmonics. The harmonics are then grouped together using the proposed iterative graph cut method, which exploits the two dimensional

67 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 47 Figure 4.1: Block diagram illustrating sequence of steps for speaker segregation using the group delay cross correlation function. property of the affinity matrix. The signal is reconstructed by using the overlap and add method after removing the undesired channels. Thereafter, the binary mask is generated using the min-max method from the target and interfering speaker signals after initial separation. This binary mask is then applied on the reconstructed signals Multi-Pitch Estimation from Mixed Speech Signals Pitch determination is known to be one of the most difficult problems in speech analysis. However, using pitch to estimate the bandwidth of the bandpass filters led to a significant improvement in results in this work. A multi-pitch determination algorithm based on the SHR [10], is used to determine the pitch of the speakers contained in the mixed speech signal. If A m (f) is the amplitude spectrum, f 0 and f max is the fundamental and the maximum frequency of A m (f) respectively. SHR is then obtained as ratio between sum of subharmonic amplitudes (S S ) and sum of harmonic amplitudes (S H ) where S S = S H = L A m (pf 0 ) (4.9) p=1 L A m ((p 1)/2f 0 ) (4.10) p=1 SHR = S S S H (4.11) In the above equations, L is the number of harmonics in the spectrum, and A m (f) = 0, when f > f 0. For the case, when the pitch estimation is confined to the range [F 0 min, F 0 max ],

68 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 48 then L = floor( fmax F 0 min ). The pitch range for male speaker is selected between 50 Hz and 250 Hz, while for the female speaker it is selected between 50 Hz and 400 Hz. Since the highest harmonic is not too useful as a candidate for pitch estimation, f max is limited to within a frequency range of 1200 Hz to 1500 Hz. Based on a certain threshold value of the SHR, one of either the harmonics or subharmonics is selected as a pitch candidate for the respective speakers. The threshold value equal to 0.2 is generally selected based on the pitch perception results. Generally, when the ratio is smaller than this threshold, the subharmonics (non target speaker) do not have effects on pitch perception. Hence, the harmonic frequency index corresponding to the target speaker is selected as pitch frequency index. The final pitch value is calculated as twice this harmonic frequency index, and is the pitch candidate for the target speaker. Else if the SHR is higher than the threshold, the subharmonic frequency index corresponding to the non target speaker is selected. Thus, a final pitch value that is twice this subharmonic frequency index is selected as the pitch for the non target speaker. As the ratio increases approximately to twice the threshold, the pitch is mostly perceived as one octave less than the corresponding lowest subharmonic frequency. When SHR is between the threshold and twice the threshold, the pitch seems to be ambiguous. In this case either of the harmonic or subharmonic frequency indices is selected as the pitch frequency index. These findings suggest that pitch could be determined by computing SHR and comparing it with the pitch perception data which is the method followed in this work. It must be noted in this context that computation of the multi-pitch candidates using the basic definition of SHR [96] as in Equation 4.11 is non trivial. In this work, the SHR is found using a procedure that requires no information about the pitch of either of the speakers. This procedure computes the difference spectrum of the log shifted versions of even and odd orders respectively. The aforementioned harmonic and the subharmonic frequency indices are computed from this difference spectrum. The final pitch value is generally twice the harmonic (target speaker) or subharmonic (non target speaker) frequency index to ensure that individual pitch harmonics are resolved in a reasonable manner. The multiple pitch estimates are then used to estimate the bandwidth of the individual filters in the filter bank. For each individual speaker, the bandwidth is taken to be about twice the pitch determined by the multi-pitch determination algorithm. This is because window lengths are generally

69 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 49 taken to be two or three times the lowest fundamental frequency of the desired speaker. This ensures that the characteristics of the desired speaker are contained in the frequency band. It must be noted here that only the voiced parts are used in the pitch estimation and the unvoiced regions are retained with no processing Subband Decomposition using Filter Bank Analysis Since speech is a broad band signal, it is essential that we first split the speech signal into a series of narrow band signals by passing it through a bank of filters. The output of each of the frequency band corresponds to a narrow band signal with a frequency equal to the center frequency of the band. Consider a speech signal X(ω) in the spectral domain. It is passed through a bank of band pass filters, with a total of K channels, where the k th channel is represented by its center frequency ω k and the corresponding output of the k th band of the filter bank is X k (ω). The calculation of the group delay of the k th channel is equivalent to the calculation of the group delay of a narrow band signal with a center frequency of ω k. A bank of bandpass Butter worth filters is used in this work. The filters are spaced linearly in frequency from 100 Hz to 8000 Hz. The number of filters is variable, but a suitable value can be decided depending on the bandwidth required. Care has to be taken that the bandwidth of each filter is not less than twice the difference among the center frequency of consecutive bands. This puts a constraint on the number of filters. Also, the number of filters should not be large, when the similarities in the group delay of consecutive frequency bands will be high. The number of filters used in filter bank is adaptively determined based on the fundamental frequency of the target speaker. It can also lead to erroneous calculations of the group delay itself. Since, the samples present in the particular frequency channel would be too small to be useful in the correct estimation of the group delay of that particular channel. The bandwidth is also found to be dependent on the value of pitch of the speaker in consideration. The pitch of the desired speaker itself is calculated using a multi-pitch estimation algorithm The Group Delay Cross Correlation Function The term cross-channel correlation in this work refers to the pairwise correlation between the concerned pair of frequency bands within the meaningful frequency range. The full band

70 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 50 cross channel correlation thus can be found by computing the pairwise correlation among every possible pair of the frequency bands across the desired frequency range. Since the aim is to extract all those frequency bands that belong to the same speaker, cross correlation identifies the speaker to frequency band associations. One underlying assumption in the proposed method is that the fundamental frequencies of the two speakers do not overlap to a considerable extent [97] and hence the harmonic structures are also different. In case of a large overlap between fundamental frequencies, the task of separation becomes nearly impossible as is well known. In this work, we propose the short time group delay cross correlation function that can be computed on the subbands. Let C g (k 0, k 1 ) represent the cross-channel correlation evaluated for two frequency channels represented by the indices k 0 and k 1. Mathematically, C g (k 0, k 1 ) is expressed as C g (k 0, k 1 ) = C(k 0, k 1 ) C(k0, k 0 )C(k 1, k 1 ) (4.12) where C(k 0, k 1 ) is the covariance of the group delay for the frequency bands with indices k 0 and k 1 and is given by C(k 0, k 1 ) = E[(τ g [k 0 ] τ g [k 0 ])(τ g [k 1 ] τ g [k 1 ])] (4.13) where E[ ] is the mean and τ g [k 0 ] is group delay for the frequency band k 0. Thus, we can obtain the affinity matrix containing the pairwise cross correlation of the group delay of each frequency channel, C g. This matrix is used to identify and group similar sources together. For any particular frequency index r in a given correlation matrix C g, a row vector (or a column vector, since C g is a symmetric matrix) contains all the correlation values of that particular frequency band r with respect to all the other frequency bands. It is expected that frequency bands that are harmonically related are more likely to be highly correlated and yield a high average correlation value. The average group delay cross correlation value for each frequency band can be calculated as C gav G (r) = 1 K K C g (r, t) (4.14) t=1

71 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 51 where r and t are frequency band indices, K is the total number of frequency bands. Once the average group delay cross correlation is obtained, frequency bands that have a high average correlation value will be grouped together. Since the motivation for using cross-channel correlation is to identify common changes of group delay between frequency channels. Higher correlation values can be expected for channels that share a common speaker as described by the group delay function Harmonic Extraction and Grouping using Iterative Graph Cut Method Although averaging generally leads to satisfactory harmonic grouping performance when the target to interference ratio (TIR) is high. However, performance of the averaging method goes down at low TIRs resulting low separation performance. The structure of the correlation matrix or affinity matrix R containing cross correlation values between all pair of frequency channels, is not exploited by mere averaging as it transforms the matrix to a one dimensional projection and then proceeds with the analysis. There have been developments in image segmentation and grouping which have led to techniques that provide many promising directions towards better solutions to this problem. Image segmentation aims to group similar components of an image together by devising a way of minimizing the distance between similar components and maximizing the distance between dissimilar components. In our case, the image to be grouped is the group delay cross correlation matrix developed earlier. Among many present techniques, Shi and Malik s algorithm [98] has been chosen to incorporate finer grouping into our algorithm. The graph cut method discussed here aims to extract a global impression of the 2-D image or the correlation matrix. Image segmentation or correlation matrix division is treated as a graph partitioning problem here. The graph cut criterion therefore effectively measures the total dissimilarity between the different groups as well as the total similarity within the groups. Thus the problem is effectively transformed into a simple eigenvalue decomposition problem. In the graph cut method, the correlation matrix is represented by a weighted undirected graph [98]. The weight on each edge m(i, j), is a function of the similarity between nodes i and j. Mathematically, the binary weight

72 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 52 vector U in an affinity matrix or graph is given by Equation if i C κ i = 0 otherwise (4.15) where C is the total number of the desired sub-graph and κ i represents i th element of U which belongs to C. This vector has a dimension equal to the number of nodes in the entire graph G, where each of its component takes a unit value when the corresponding node is present in a desired sub-graph C, as in Equation U T RU = m(i, j) (4.16) i,j C where R is the affinity matrix, m(i, j) is weight on each edge and superscript T is transpose operator. One can determine the average association among all nodes in the same sub-graph C using Equation d = max U T RU U T U (4.17) The average association calculated by Equation 4.17, should yield a high value if the choice of sub-graph C represents a good cluster. If U T U is also used to normalize the calculated average association, the problem of identifying the optimal sub-graph can be solved by finding a vector U that maximizes the average value of U T RU. If the requirement that vector U be binary is removed, it is expected that element κ i in vector U would be of a higher value compared to other elements when the node i is a member of sub-graph C. Based on Rayleigh s ratio theorem, for any symmetric affinity matrix R, the maximum value of d in Equation 4.17 can be obtained by picking up the eigenvector corresponding to the largest eigenvalue of R. For the particular affinity matrix R, which is a numeric representation of the graph G in Figure 4.2, the largest eigenvalue λ max is noted to be Noting the corresponding eigenvector, it is easy to set up a threshold for this particular example. Let us assume that threshold is set to 0.1, to make the decision that node 5 through node 9 belong to the subgraph C. Given the assumption that there are only two sub-graphs, we can place the nodes 1 through 4 in the other sub-graph. In case of multiple sub-graphs, the number of groups

73 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 53 has to be known beforehand and we can set different threshold to determine which nodes belong to which sub-graph. This is similar to the problem that if we know the number of speakers beforehand, we can set suitable thresholds to determine which channels belong to which speaker. Figure 4.2: An undirected graph, G, with 9 vertices. The graph can be clearly divided into two sub-graphs, with nodes 1 through 4 falling in one sub-graph and nodes 5 through 9 in the other sub-graph. The threshold has been set to be equal to the mean of the eigenvector corresponding to the maximum eigenvalue. Let this eigenvector be named EV. Through empirical testing of the graph cut algorithm on many affinity matrices, it was found that the sub-regions constructed using the threshold defined above are generally uneven, with one sub-region occupying a very large space of the graph compared to the other region. In order to further tune the graph segmentation, the region that was found to be larger, say a L, was further subject to regrouping using the same algorithm. It was found that while most the channels remain within the new target sub-region a Ln, some unwanted channels that are highly uncorrelated are pruned. This method, though computationally expensive, leads to finer separation of the frequency channels Illustration of Grouping in the Iterative Graph Cut Method : The aim of grouping in the iterative graph cut method is to partition the set of vertices (or points) in a graph, into disjoint sets V 1, V 2..., V m, where the similarity among vertices that belong to a particular set V i, is high and the similarity between vertices of different sets are low. Figure 4.2 shows an undirected graph G, which has the nine nodes. The index of each node is written within the circle and the correlation value is shown between each node. When we

74 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 54 Figure 4.3: The two sub-graphs obtained after first iteration of graph cut method. apply a graph cut method to the Figure 4.2, the entire graph is divided into two sub-graphs. The nodes 1 to 4 fall into one sub-graph and nodes 5 to 9 belong to the other sub-graph as shown in Figure 4.3. Note, in this particular case of correlation matrix R, we stop after the first iteration of graph cut method because during the next iteration all the elements of highest eigenvector for the current iteration is below the threshold. It may be noted that the threshold computed herein is obtained by calculating mean of all the elements of eigenvector of second highest eigenvalue of first iteration. The final undirected sub-graph corresponding to the target speaker as obtained from the two sub-graphs is shown in Figure 4.4. This sub-graph belongs to target speaker since it contains the maximum number of subbands. In a similar fashion, we can obtain the final undirected sub-graph for the interfering speaker. However in general the initial settings, iteration process, and the stopping criterion is Figure 4.4: The final undirected graph obtained by iterative graph cut method. very important in the proposed method and are described herein. The initial number of nodes selected to begin the partitioning is always equal to the number of filters used in the subband decomposition of the mixed speech signal. During the first iteration, the eigenvector

75 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 55 corresponding to the highest eigenvalue is selected. A threshold is set that is equal to the mean of the highest eigenvector. The elements of the highest eigenvector are then compared with this threshold to compute two sub-graphs containing the subbands. The sub-graph that has the highest number of members (subbands) is selected. Subsequent grouping is performed using the graph cut method in an iterative fashion. The iterations are stopped if all the elements of the highest eigenvector are below the new threshold which is the mean of the second highest eigenvector in the first iteration. Experimental results listed in the later part of the work illustrate that the graph cut method is more reliable than the onedimensional projection method with the minor penalty of a larger computational load for decomposing each correlation matrix. Although the iterative graph cut method effectively groups the speakers, the target speaker signal will still contain some frequency components of the interfering speaker. To further improve the quality of the target speaker signal, a spectrographic mask is estimated using the min-max method. This masking technique is described in the ensuing section Spectrographic Mask Generation Spectrographic masks are widely used in knowledge based methods for speech enhancement and separation. In [11], an ideal binary mask has been used. Soft masking techniques have also been used in this context [99]. Since the separation method proposed in this work is knowledge based, the spectrographic mask is computed herein as mask(ω, t) = 1; if Y T (ω, t) > Y U (ω, t) (4.18) mask(ω, t) = 0; if Y T (ω, t) < Y U (ω, t) In Equation 4.18, mask(ω, t) is a spectrographic mask computed using Y T (ω, t) and Y U (ω, t) and is a function of frequency and time. Y T (ω, t) and Y U (ω, t), are defined as the logarithm of the absolute squared magnitude spectrum of the target and the non target speaker respectively as computed from the reconstructed audio signals. Note that in Equation 4.18, ω = 2πnk Q, and Q is the short window over which the mask is estimated. The mask thus estimated is applied to the target signal synthesized which results in an improved version of

76 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 56 the original signal corresponding to the target speaker. Similarly, binary mask is generated by replacing the target speaker with the interfering speaker and applied to the interfering signal, which results in significant improvements in the extraction of the interfering signal Algorithm for Speaker Segregation using Group Delay Cross Correlation Function The steps involved in the separation of individual speakers from a mixed speech signal using the group delay cross correlation function and the graph cut method is enumerated in Algorithm 1. Algorithm 1 function. Single channel speaker separation using the group delay cross correlation 1. Multi Pitch Estimation : The multiple pitch periods are computed from the mixed speech signal using a multi pitch determination algorithm based on subharmonic to harmonic amplitude ratio. 2. Subband Computation : The mixed speech signal is passed through a linear filter bank whose bandwidth and the spacing are determined by the pitch of the target speaker as estimated by the multi pitch algorithm in the previous step. 3. Computation of the Group Delay Cross Correlation Function : The Group delay function is computed for each frequency subband and a cross correlation function is computed across the subbands. 4. Computing the Affinity Matrix : The pairwise correlation is found between all frequency bands of the broadband speech signal. A matrix of group delay cross correlation functions is formed and called as the affinity matrix. 5. Grouping the Correlated Harmonics : The harmonics that have a high degree of correlation are found by eigenvalue decomposition of the affinity matrix and grouped together. 6. Reconstruction of Desired Speaker s Signal : The desired speaker s signal is reconstructed using an overlap and add method of synthesis from the group of harmonics

77 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 57 that are derived from the previous step. 7. Refinement using Spectrographic Masking : A binary mask is generated using the min-max method and applied to the reconstructed signal of the target speaker to refine the perceptual quality of the target speaker. end Experiments on Speaker Segregation In this section, experiments are performed to separate both synthetic vowels and real speech utterances spoken by a multiple speakers. The proposed method using the group delay cross correlation function is used to separate the mixed speech signals. The experiments and the results are described in the following section Segregation of Vowels using the Group Delay Cross Correlation Function The proposed algorithm is used to segregate synthetic vowels first. In order to generate synthetic vowels, mathematical representation of the acoustics of speech producing systems [95, 100], are used. Since speech is considered to be an all pole model, the transfer function for producing a synthetic vowel can be represented as [100] G k (z) = 1 (1 2 z k cos(2πf k D)z 1 + z k 2 z 2 ) (4.19) where D is the sampling period, and it often incorporates the pitch of the vowel, and z k implicitly contains the information of the bandwidth. In the artificial synthesis of vowels, a three formant vowel structure is used. Six synthetic vowels with formants at 1000, 2000 and 3000 Hz with different pitch periods are generated. It is observed that vowels 1, 2 and 3 have harmonically related pitch periods between them, while vowels 4, 5 and 6 are harmonically related pitch periods between them. The group delay cross correlation matrix C g is calculated for the different vowels, and the results using both the averaging and the iterative graph cut method is plotted in terms of average correlation distribution variance. Figure 4.5 gives the illustration of the average correlation distribution (ACD) between two pairs of channels using 1-D Projection and graph cut method. A lower variance of the ACD indicates the

78 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 58 frequency of occurence frequncy of occurence Deviation in correlation values for 1 D Projection correlation bins Deviation in correlation values for Graph Cut Method correlation bins Figure 4.5: Illustration of average correlation distribution variance between two pairs of channels using 1-D Projection method and Graph Cut method. higher confidence in grouping the channels in two groups. Hence, the graph cut method or 2-D projection performs better in comparison to simple averaging or 1-D projection and results in a reasonably better separation of synthetic vowels. The channels thus found to be correlated are used for reconstruction Segregation of Mixed Speech Signals using the Group Delay Cross Correlation Function The proposed Algorithm 1 is used to segregate the speakers from a mixed speech signal. Experiments are conducted on mixed speech samples from the GRID database at different TIR values. Figure 4.6 shows the spectrogram of the mixed signal, the target signal, the reconstructed signals with and without the application of mask respectively at a TIR of 0 db. Figure 4.7, illustrates similar plots at a TIR of -6 db. The target male speaker is uttering a sentence \bin white with g eight soon\ and \set white at o two please\ in Figures 4.6 and 4.7, respectively. It can be noted from Figures 4.6 and 4.7, that the group delay cross correlation method is able to segregate the target speaker effectively. It can also be noted that the quality of the target speaker improves when the spectrographic mask is applied. In ensuing section, the performance of the proposed separation method is evaluated in terms of speech enhancement and multi-speaker speech recognition. In order to evaluate

79 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 59 Figure 4.6: Spectrograms of the mixed sound signal, the reference target signal (above), the reconstructed signals with the application of mask, and without the application of the mask (below) when the TIR is 0 db. Figure 4.7: Spectrograms of the mixed sound signal, the reference target signal (above), the reconstructed signal with the application of mask, and without the application of the mask (below) when the TIR is -6 db. the results of separation, experiments are conducted on GRID corpus [3] sentences. Both subjective and objective measures are used to evaluate the performance. Experiments are also conducted on multi-speaker speech recognition on the GRID corpus and the results are discussed in the ensuing sections Experiments on Speaker Segregation Subjective evaluation of the separated speech is done by human listeners in which they have to hear the separated sounds and rate them on a relative scale. In subjective test,

80 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework sentences from GRID database are used as a reference signal for training each subject. The corresponding separated signal computed from all the methods are used for grading the corresponding methods used for comparison. In objective evaluation, several techniques are used to calculate the similarities between the reconstructed speech signals and the reference speech signals. For objective evaluation, 1000 reconstructed sentences obtained from various methods for both target and interfering speakers are used. Table 4.1: Possible choices in the sentences of the GRID corpus. command colour preposition letter number adverb bin(b) blue(b) at(a) A-Z 1-9 and again(a) lay(l) green(g) by(b) excluding W zero(z) now(n) place(p) red(r) in(i) please(p) set(s) white(w) with(w) soon(s) Database used The GRID [3] database is used in the experiments on speaker segregation. GRID is a large multi-talker audiovisual sentence corpus to support joint computational-behavioral studies in speech perception. In brief, the corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers (18 male, 16 female). Sentences are of the form \put red at G9 now\, that is, [command:4] [color:4] [preposition:4] [letter:25] [number:10] [adverb:4]. Table 4.1 lists all the possible choices in the six positions. The TIR of the mixtures vary from 6 db to -6 db. In GRID, all files are single channel speech data sampled at 25 khz. All material is end-pointed. In other words, there is little or no initial or final silence. The corpus together with transcriptions is freely available for research use Subjective Evaluation Results Any task of speech separation is best judged by the most efficient filter, humans know of which is the ear. Hence, subjective methods of evaluation incorporate the subjective judgment of the listener, which is an essential and perhaps the most accurate way of assessing the quality of any separation system. In subjective evaluation, the separated signals are rated using four

81 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 61 parameters global quality (GQ), target preservation (TP), other signal suppression (OSS) and artificial noise suppression (ANS) [101], as explained elaborately in Chapter 3. The results of subjective evaluation using the proposed GDCC method are compared with six other algorithms, namely the group delay cross correlation-without pitch estimation (GDCC-WP), group delay cross correlation-without masking (GDCC-WM), instantaneous Frequency method, fast fourier transform (FFT) based separation method, Latent Variable Decomposition (LVD) based separation method [102] and Latent Dirichlet Decomposition (LDD) based separation method [103] at 0 db TIR. It must be noted that GDCC-WP is defined as group delay cross correlation without pitch, wherein the pitch estimate is not used for calculating the filter bandwidth but spectrographic masks are applied to the reconstructed signals. GDCC-WM is defined as the group delay cross correlation without masking, where the spectrographic masks are not applied on the reconstructed signals, but the pitch estimate is used to compute the filter bandwidth. The comparative subjective evaluation of the various algorithms are illustrated in Figure 4.8 and Figure 4.9 for the reconstructed target and interference signal respectively. Demonstration of audio files to compare the performance of proposed method with other conventional methods are available at The tabulated mean (M) and standard deviation Subjective Scores Global Quality Target Preservation Other Signal Suppression Noise Suppression Reference GDCC GDCC WP GDCC WM FFT IF LVD LDD Figure 4.8: Comparative performance of the various algorithms for quality of reconstructed speech (target speaker). (SD) for the aforementioned seven different methods for four parameters (GQ, TP, OSS and

82 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework Subjective Scores Global Quality Target Preservation Other Signal Suppression Noise Suppression Reference GDCC GDCC WP GDCC WM FFT IF LVD LDD Figure 4.9: Comparative performance of the various algorithms for quality of separated speech (interfering speaker). ANS) are listed in Table 4.2. The results indicate that the group delay cross correlation al- Table 4.2: Comparison of the reconstruction algorithms for GQ, TP, OSS and ANS in terms of mean and standard deviation. GQ TP OSS ANS Methods M SD M SD M SD M SD GDCC GDCC-WP GDCC-WM Target FFT IF LVD LDD GDCC GDCC-WP GDCC-WM Interference FFT IF LVD LDD gorithm performs reasonably better than the standard FFT, Instantaneous frequency based correlation approaches, Latent Variable Decomposition method and Latent Dirichlet Decomposition method. The incorporation of the multi-pitch algorithm followed by spectrographic masking significantly improves the quality of the reconstructed target signals. However, there are instances wherein the FFT based correlation methods perform better for noise suppression only where the noise is assumed to be the undesired signal. It can also be noted that the

83 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 63 instantaneous frequency based algorithm gives very low quality of noise suppression. Similar conclusions can be drawn from results for the reconstructed interfering signals. Performance of Latent Dirichlet Decomposition is almost similar to Latent Variable Decomposition for low number of speakers but if competitive number of speakers increases, it is expected to perform better as it is generalization over Latent Variable Decomposition Objective Evaluation Results To determine the quality of the separated speaker signals on an objective test scale, the perceptual similarity measure is computed between the signal pairs by calculating the linear cross correlation coefficient. The resulting correlation value serves as an objective measure of the perceptual similarity between two audio signals. Apart from the overall correlation between internal representations (output value PSM), it also computes an estimate of the instantaneous audio quality as a function of time by frame-wise correlation (output vector PSM inst). The other objective measure used in this work is signal to noise ratio (SNR) loss [71] between the reference and reconstructed sound files. The term SNR Loss is used to indicate the loss introduced by noise suppression algorithms, but it presents a general measure to assess the noise quality of the signal obtained after separation. In this work, we have calculated the SNR Loss in individual bands and the results across all frequency bands are added to get the final SNR Loss measure. Table 4.3: Mean PSM and PSMt scores for the proposed method (GDCC) and other conventional methods at various TIR values. Methods TIR=-6 db TIR=-3 db TIR=0 db TIR=3 db TIR=6 db PSM PSMt PSM PSMt PSM PSMt PSM PSMt PSM PSMt GDCC GDCC-WP GDCC-WM FFT IF LVD LDD Perceptual Similarity Measure : The mean PSM and PSMt scores at various TIR, for the proposed GDCC method along with other methods like GDCC-WP, GDCC-WM, FFT, IF, LVD and LDD is listed in Table 4.3. As expected, the perceptual similarity measure

84 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 64 drops at low TIR levels. When the values of TIR is closer or smaller to zero, PSMt (the averaged value of instantaneous PSM vector) drops below zero, indicating poor separation. This corroborates with the results obtained from the subjective evaluation of the sound files. SNR Loss : The SNR loss values of the target speaker on reconstruction using GDCC, GDCC-WM, GDCC-WP, IF, LVD, LDD and direct separation (FFT) methods is illustrated in Figure It is observed that lesser the SNR loss value, closer the signal is to the reference in terms of noise suppression. FFT based correlation method however seems to perform better, as noise suppression is concerned. This is probably due to the lesser resolution that FFT spectrum exhibits when compared to the group delay spectrum. However, from the viewpoint of separation the group delay cross correlation algorithm performs reasonably better than other techniques compared herein. SNR Loss GDCC GDCC WP GDCC WM FFT IF LVD LDD Figure 4.10: Comparison of SNR Loss of the reconstructed target speaker for various methods Experiments on Multi-Speaker Speech Recognition The database used for these experiments is the two talker data from the GRID Corpus. The training set consists of clean utterances of about sentences (500 from each of the 34 speakers), while the testing utterances are degraded at various TIRs, and each condition contains 600 sentences, consisting of a limited number of possible choices.

85 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework Experimental Conditions Hidden Markov Models (HMM) are built for the speech recognition system using the entire two-talker sentence pairs files from the GRID corpus. Each clip is about one second in length and consists of two simultaneous spoken sentences from two speakers in the same channel. However, the sentences in each mixed clip are different from one another. In the speech recognition experiments, we have used 15 state, and 3 mixtures triphone HMM with 39 MFCC with delta and acceleration coefficients. Data from the GRID corpus is first divided into training and test sets. The clean speech training data from GRID corpus are used in training the baseline triphone models of the recognition system. The test data sets from GRID corpus are generated at different TIR and applied to the proposed algorithm along with other methods compared in this work. The reconstructed files from these methods are used for testing the recognition system. However, it must be noted that the recognition results are presented as word error rates after recognizing the complete sentence, as is done with in a state of the art continuous speech recognition system which uses only acoustic models. It must also be noted that the structure of the GRID corpus sentences ensure that acoustic models are fit enough to measure recognition accuracy in a reasonable manner. WER is the qualitative measure used to evaluate the performance of the proposed method. Percentage word error rates are calculated for a limited number of test files generated using the proposed method. WER is defined as WER = S + D + I N (4.20) where S is the number of substitutions, D is the number of deletions, I is the number of insertions and N is the number of words in the reference. The WER of the algorithm is also compared with human WER performance (HP) [104] along with several other methods as shown in Table 4.4 on the same data at various TIRs. Human WER has been calculated by running a series of listening and identification experiments and then calculating the WER for the task at various TIR values.

86 4.1 Single Channel Speaker Separation using Group Delay Spectrum in a CASA Framework 66 Table 4.4: Comparison of the word error rate (%) for various methods at several TIR values. TIR ( db) GDCC HP FFT IF LVD LDD Experimental Results Table 4.4 lists the percentage WER of the extracted target files at various TIR values. It is placed alongside human performance rates, FFT, Instantaneous Frequency, Latent Variable Decomposition and Latent Dirichlet Decomposition in the same tasks at the respective TIR values. From the Table 4.4, it is clear that the proposed algorithm GDCC has better WER compared to other algorithms. This signifies the importance of the proposed GDCC method for speech recognition task compared to other methods Summary A method for single channel speaker segregation using the group delay cross correlation is discussed in this section. An iterative graph cut method has also been proposed to improve the performance of the grouping of correlated frequency bands. This method exploits the crosschannel correlation to select the frequency bands belonging to the same speaker using the iterative graph cut method. Hence, the proposed group delay cross correlation function along with the iterative graph cut method is an efficient two dimensional projection approach, when compared to the conventional one dimensional averaging approaches based on correlation. It can be noted that the separation performance of the group delay cross correlation method is reasonably higher than several conventional methods from literature. These results also shed light in understanding of the analogous nature of the group delay and the instantaneous frequency based methods. Possible improvements in the algorithm can include the integration of long time and short time windowing methods. Integration of instantaneous frequency and group delay methods to select spectro-temporally correlated harmonics can lead to better

87 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 67 separation. A joint instantaneous frequency-group delay (IF-GD) affinity matrix can be explored as a robust estimator of the correlated harmonics present in a mixed speech signal. 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization A novel method of joint speaker separation and dereverberation that minimizes the divergence between the observed and true spectral subband envelopes is discussed in this section. This divergence minimization is carried out within the non-negative matrix factorization (NMF) framework by imposing certain non-negative constraints on the subband envelopes. Additionally, the joint speaker separation and dereverberation framework described in this work utilizes the spectral subband envelope obtained from group delay spectral magnitude (GDSM). The GDSM is computed from the group delay function as in [94]. In order to obtain the spectral subband envelope from the GDSM, the equivalence of the magnitude spectrum and the group delay function via the weighted cepstrum is used. Since the subband envelope of the group delay spectral magnitude is robust and has a high spectral resolution, less error is noted in the NMF decomposition. Late reverberation components present in the separated signals are then removed using a modified spectral subtraction technique. The quality of separated and dereverberated speech signal is evaluated using several objective and subjective criteria. Experiments on distant speech recognition are then conducted at various direct-to-reverberant ratios (DRR) on the GRID [3] corpus. Experimental results indicate significant improvements over existing methods of dereverberation in the literature Introduction The objective of any speaker separation method is to recover the original signals from a composite signal. The problem of separation becomes challenging when signals are mixed in a reverberant environment. Reverberation occurs when the distance between the speaker and microphone is large enough to create multiple paths for the speech signal to arrive at the microphone. The reverberation results in degradation in intelligibility of the speech signal and the performance of a automatic speech recognition (ASR) system.

88 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 68 Several algorithms have been developed for single channel speech dereverberation. The temporal averaging method is proposed in [105] to estimate the room acoustic impulse response (AIR). This is done by complex cepstrum which utilizes an adaptive segmentation technique. The inverse filter solution is then obtained after pre estimation of RIR. In [17], blind estimation of an inverse filters required to obtain the dereverberated signal is explained. The inverse filters in [17] are estimated by computing the correlation matrix between input signals, instead of room impulse response. A two stage algorithm for single microphone has been proposed in [106], where an inverse filter is estimated to reduce coloration effects during first stage. The spectral subtraction is then applied as post processing step to minimize the influence of long-term reverberation. In [107], the maximum kurtosis of the speech residual is proposed for blind dereverberation of speech signal. A non-negative matrix factorization (NMF) method which utilizes gammatone subband magnitude domain dereverberation is proposed in [12]. In [12], the Fourier transform spectral magnitude is used in the NMF framework for automatic speech recognition applications. In [9], the dereverberation is carried out by using cepstrum to determine the acoustic impulse response and then used for inverse filtering to obtain the estimate of clean speech. The truncation error present in [9] is removed in [105], but still inverse filtering is required. Authors in [108] present a blind dereverberation method designed to recover the subband envelope of an original speech signal from its reverberant version. The problem is formulated as a blind deconvolution problem with non-negative constraints, and utilizes the sparse nature of speech spectrograms. In [109], a harmonicity-based dereverberation method to reduce the amount of reverberation in the signal picked up by a single microphone is discussed. A variant of spectral subtraction described in [110] utilizes multi-step forward linear prediction for speech dereverberation. It precisely estimates and suppresses the late reverberations, which result in enhancing the ASR performance. All these methods deal with speech dereverberation problem in a single source environment. Considerable work has also been done to address the speaker separation problem in a nonreverberant environment. In instantaneous frequency method [82], the objective is to extract target component of speech mixed with interfering speech, and to improve the recognition accuracy that is obtained using the recovered speech signal. The instantaneous frequency is used to reveal the underlying harmonic structures of a complex auditory scene. In latent

89 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 69 variable decomposition [102], each magnitude spectral vector of speech signal is represented as the outcome of a discrete random process. The latent Dirichlet decomposition method [103] is a generalization of latent variable decomposition that models distribution process as a mixture of multinomial distributions. In this model, the mixture weights of the component multinomial varies from one analysis window to the other. Non-negative matrix factorization [111], [112], [113] is also an effective method in the context of mixed speaker separation by decomposing the STFT magnitude matrix [114]. A convolutive version of NMF is described in [115] that utilizes temporal variations into the account for speaker separation. The single channel separation of speech and music is discussed in [116] by utilizing discrete energy separation algorithm (DESA). Apart from single channel, multi channel underdetermined blind speaker separation in clean environment is discussed in [117] and [118]. In [119], a non-negative blind source separation (BSS) in noise free environment using multiplicative updates and subspace projection is presented. In general, the problem of speaker separation and dereverberation are looked at separately and solutions have been proposed for each of them individually as can be noted from aforementioned discussion. However, efforts have also been made in addressing the joint speaker separation and dereverberation problem. The joint optimization method for BSS and dereverberation for multi channel is discussed in [120] by optimizing the parameters for the prediction matrices and for the separation matrices. A BSS framework in noisy and reverberant environment based on a matrix formulation is proposed in [121]. The method in [121] allows simultaneous exploitation of nonwhiteness and nonstationarity of the source signals using second-order statistics. In [122], the joint block Toeplitzation and block-inner diagonalization (JBTBID) of a set of correlation matrices of the observed vector sequence is obtained for convolutive BSS. In [123], conditional separation and dereverberation method (CSD) for simultaneously achieving blind speaker separation and dereverberation of sound mixtures is discussed. A tractable BSS framework is explained in [124] for estimating and combining spectral source models from noisy source estimates. In [125], general broadband approach to BSS for convolutive mixtures based on second-order statistics is discussed. The optimum inverse filtering algorithm based on Bezouts theorem is used at the dereverberation stage. This is computationally more efficient and allows the inversion of long impulse responses in

90 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 70 real-time applications. An integrated method for joint multi channel blind dereverberation and separation of convolutive audio mixtures is discussed in [126]. All the above methods follows the tandem approach to solve the separation and reverberation problem for multi channel scenario. Additionally, the above joint blind speaker separation and dereverberation methods requires multi-channel input. This assumption has been relaxed in this work by considering the single channel case. The contributions of the work are as follows. The work proposes a new model for joint speaker separation and dereverberation, in multisource environment. In this work, different impulse responses are considered for different locations of the speakers. Additionally, the proposed method uses subband envelope of the mixed speaker signal computed from group delay spectral magnitude (GDSM) [94], [127] within the NMF framework. Due to the high resolution property of group delay function [95], [94], [128], [129], [130], this method reduces the error in the decomposition of observed subband envelope (OSE) sequence of the mixed signal into its constituent convolutional components. The system model for speaker separation under reverberant environment is discussed in the ensuing section System Model for Speaker Separation under Reverberant Environment Figure 4.11: The system model for reverberation for two sources mixed at a single microphone in subband envelope domain under noise.

91 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 71 The system model for speaker separation under reverberant environment is first formulated. Figure 4.11 illustrates the model for reverberation of two sources mixed at a single microphone. Let L(k, m) and F (k, m) be the subband envelope for two speaker signals located at two different positions. Here, m is frame index and k {1,..., K} correspond to frequency bin index respectively. K is total number of subbands in each frame. The subband envelope of room impulse response (RIR) associated with two speaker signals are denoted by H 1 (k, m) and H 2 (k, m) respectively. The parameters for H 1 (k, m) and H 2 (k, m) are different due to different location of two speakers. The subband envelope Y 1 (k, m) for first reverberated signal can be represented as the convolution of subband envelope of first speaker L(k, m) with its corresponding subband envelope of RIR H 1 (k, m) in an expectation sense [108]. Y 1 (k, m) = τ L(k, τ)h 1 (k, m τ) (4.21) In convolution form, Equation 4.21 can be expressed as Y 1 (k, m) = L(k, m) H 1 (k, m) (4.22) Similarly, the subband envelope Y 2 (k, m) for second reverberated signal can be written as Y 2 (k, m) = τ F (k, τ)h 2 (k, m τ) (4.23) In convolution form, Equation 4.23 can be expressed as Y 2 (k, m) = F (k, m) H 2 (k, m) (4.24) Hence, the mixed reverberated subband envelope Z(k, m) can be written as Z(k, m) = Y (k, m) + V (k, m) (4.25) Here, Y (k, m) = Y 1 (k, m) + Y 2 (k, m) and V (k, m) is the subband envelope of noise which is uncorrelated to both Y 1 (k, m) and Y 2 (k, m). In the mixture, one of the speaker is assumed to be a target speaker.

92 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 72 The model presented herein is different in many aspects when compared to earlier methods discussed in [12], [108], [131] and [132] which deal with multipath effect in reverberant environment. Firstly, this model is formulated as a joint speaker separation and dereverberation problem in multi source environment. Secondly, while traditional approaches [12] utilize the Fourier transform spectral magnitude (FTSM) in an NMF framework, the proposed method utilizes the subband envelope of spectral magnitude computed from group delay function. This is primarily because the GDSM is more robust and has high resolution when compared to Fourier transform spectral magnitude. The formulation of the speaker separation and dereverberation problem in NMF framework is discussed in the ensuing section Formulation of Speaker Separation Problem using Constrained Spectral Divergence Optimization The joint speaker separation and dereverberation method in the subband envelope domain is shown in Figure This model tries to estimate the clean spectrum of two individual speakers through a decomposition of the subband envelope of mixed reverberated speech signal Y (k, m) into its convolutive components L(k, m), H 1 (k, m) and F (k, m), H 2 (k, m) respectively. To achieve this decomposition, a divergence criterion is formulated. In this work, a priori knowledge of nature of subband envelope of RIRs H 1 (k, m) and H 2 (k, m), subband envelope of speech signals mixed at a single microphone L(k, m) and F (k, m) and subband envelope of noise V (k, m) is not required. The filter parameters H 1 (k, m) and H 2 (k, m) are not observed directly but are inferred through the mixed reverberated speech signal Y (k, m). This is an unconstrained problem, where infinite number of decomposition of Y (k, m) into its convolutive components L(k, m), H 1 (k, m) and F (k, m), H 2 (k, m) exists. In order to constrain the solution space, it is required to assume some a priori information about either L(k, m), F (k, m) or H 1 (k, m), H 2 (k, m). In this work, the constraint on positivity of L(k, m), F (k, m), H 1 (k, m) and H 2 (k, m) is considered along with constraining the individual sum of H 1 (k, m) and H 2 (k, m) to unity to avoid scaling problems. The clean speech of two speakers are assumed to be sparse herein. Given an observed subband envelope Z(k, m), the goal now is to find an approximation such that Z(k, m) Y (k, m) (4.26)

93 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 73 Hence, Z(k, m) can be realized using the model shown below Z(k, m) = Y (k, m) + ɛ(k, m) (4.27) where Y (k, m) is the true subband envelope and ɛ(k, m) is the reconstruction error. The reconstruction error is generally due to noise V (k, m) and the artifacts generated during decomposition. Thus, the problem now is to minimize the divergence between Z(k, m) and Y (k, m). In order to do this, an objective function with non-negative constraints is defined as follows Min. D(Z(k, m) Y (k, m)) = j Z(k, j) (Z(k, j)log Y (k, j) Z(k, j) + Y (k, j)) + λ 1 L(k, j) + λ 2 F (k, j) j j (4.28) where L(k, m) 0, F (k, m) 0, H 1 (k, m) 0, H 2 (k, m) 0, m H 1(k, m) = 1 and m H 2(k, m) = 1. Here parameters λ 1 and λ 2 are 0 and obtained empirically. The L(m, k) and F (m, k) are assumed to be sparse. Y (k, j) in Equation 4.28 is given by Y (k, j) = m (L(k, m)h 1 (k, j m) + F (k, m)h 2 (k, j m)) (4.29) The first term in Equation 4.28 is the generalized Kullback-Leibler divergence. The second and third term in Equation 4.28 imposes sparsity on L(k, m) and F (k, m) respectively. The constraint on H 1 (k, m) and H 2 (k, m) avoids scaling problems [12], [108]. The sparsity constraint ensures that only small number of spectral components in L(k, m) and F (k, m) exhibit high values. In this work, sparsity is induced by using a L 1 -regularization Spectral Divergence Minimization for Joint Speaker Separation and Dereverberation Given the subband envelope domain model discussed earlier, the minimization of the objective function in Equation 4.28 is now performed by a variant of gradient descent approach [12]. The spectral components of gradient descent approach at the end of each iteration are ensured

94 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 74 to be non-negative. Differentiating the objective function with respect to L(k, m), we obtain dd dl(k, m) = j ( (Z(k, j) Y (k, j))h1 (k, j m) Y (k, j) ) + λ 1 (4.30) Similarly, the objective function is differentiated with respect to F (k, m), H 1 (k, m) and H 2 (k, m). Hence, we obtain dd df (k, m) = j ( (Z(k, j) Y (k, j))h2 (k, j m) Y (k, j) ) + λ 2 (4.31) dd dh 1 (k, m) = j dd dh 2 (k, m) = j ( ) (Z(k, j) Y (k, j))l(k, j m) Y (k, j) ( ) (Z(k, j) Y (k, j))f (k, j m) Y (k, j) (4.32) (4.33) The update equations for L(k, m), F (k, m), H 1 (k, m) and H 2 (k, m) can therefore be written as dd L(k, m) = L(k, m) β 1 dl(k, m) dd F (k, m) = F (k, m) β 2 df (k, m) dd H 1 (k, m) = H 1 (k, m) β 3 dh 1 (k, m) dd H 2 (k, m) = H 2 (k, m) β 4 dh 2 (k, m) (4.34) (4.35) (4.36) (4.37) where β 1, β 2, β 3 and β 4 are learning rate parameters for L(k, m), F (k, m), H 1 (k, m) and H 2 (k, m) respectively. Generally, it can not be guaranteed that the updates L(k, m), F (k, m), H 1 (k, m), and H 2 (k, m) of L(k, m), F (k, m), H 1 (k, m), and H 2 (k, m) respectively are nonnegative. The learning rate parameters (β 1, β 2, β 3 and β 4 ) are selected such that the updates of L(k, m), F (k, m), H 1 (k, m) and H 2 (k, m) are always non-negative. Thus, keeping the nonnegative constraint in mind, the parameters β 1, β 2, β 3 and β 4 are set accordingly as shown

95 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 75 below β 1 = β 2 = β 3 = β 4 = L(k, m) j H 1(k, j m) + λ 1 F (k, m) j H 2(k, j m) + λ 2 H 1 (k, m) j j L(k, j m) H 2 (k, m) F (k, j m) The values of β 1, β 2, β 3 and β 4 are substituted into Equation 4.34, 4.35, 4.36 and 4.37 respectively. The updates L(k, m), F (k, m), H1 (k, m), and H 2 (k, m) can now finally be written in an explicit form as L(k, m) L(k, m)= j H 1(k, j m)+λ 1 j Z(k, j)h 1 (k, j m) (4.38) Y (k, j) F (k, m) F (k, m)= j H 2(k, j m)+λ 2 j H 1 (k, m)= H 1(k, m) j L(k, j m) j H 2 (k, m)= H 2(k, m) j F (k, j m) j Z(k, j)h 2 (k, j m) (4.39) Y (k, j) Z(k, j)l(k, j m) (4.40) Y (k, j) Z(k, j)f (k, j m) (4.41) Y (k, j) A fixed number of iterative updates are performed to obtain the final updates. Initializing L(k, m), F (k, m), H 1 (k, m) and H 2 (k, m) by non-negative values ensures non-negative updates. The Equation 4.38, 4.39, 4.40 and 4.41 provide the iterative updates for output of m th frame. Similarly, the updates can be obtained for other frames. Having obtained the subband envelopes of individual speakers as in Equation 4.38 and 4.39, the corresponding spectral magnitude L a (k, m) and F a (k, m) are then computed. Due to fixed number of iterations in NMF processing and estimation or modeling errors, some amount of late reverberation and noise is still present in the updates of the spectral magnitude

96 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 76 of the separated speakers. Hence, the remaining late reverberation and noise components are removed by modified spectral subtraction and Wiener filtering method respectively as described in the ensuing section Modified Spectral Subtraction The modified spectral subtraction (MSS) [17] method helps in removing the late reverberation components present in the separated spectral magnitude (which are not completely removed by NMF processing). The spectral magnitude L a(k, m) and F a(k, m) of both the speaker is a linear combination of the spectral magnitude obtained after NMF update L a (k, m) and F a (k, m) respectively, that is L a(k, m) = L a (k, m) + F a(k, m) = F a (k, m) + J α 1i (k)l a (k, m i) (4.42) i=1 J α 2i (k)f a (k, m i) (4.43) i=1 where α 1i (k) and α 2i (k) are the coefficient of the late reverberation for previous i frames for L a(k, m) and F a(k, m) respectively, and J is the duration of the reverberation. Here, α 1i (k) 1 and α 2i (k) 1, because the early reflection components that has most of the power of reverberation is reduced by NMF processing. Therefore, the power spectrum of late reverberation can be approximated by U L (k, m) U F (k, m) J α 1i (k) 2 L a(k, m i) 2 (4.44) i=1 J α 2i (k) 2 F a(k, m i) 2 (4.45) i=1 The coefficients of the late reverberation α 1i (k) and α 2i (k) are estimated by [ L α 1i (k) = E a (k, m)l ] a (k, m i) L a(k, m i) 2 [ F α 2i (k) = E a (k, m)f a ] (k, m i) F a(k, m i) 2 (4.46) (4.47)

97 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 77 Spectral subtraction is now employed to obtain the dereverberated signal for both the speakers. ˆL a (k, m) = L a(k, m)g L (k, m) (4.48) ˆF a (k, m) = F a(k, m)g F (k, m) (4.49) The estimated spectral magnitude obtained after post processing is denoted by a ˆL a (k, m) and ˆF a (k, m) for both the speaker respectively. The gain function G L (k, m) and G F (k, m) are given as G L (k, m) = G F (k, m) = [ [ ] L a(k, m i) U L (k, m) L a(k, m i) 2 F a(k, m i) 2 U F (k, m) F a(k, m i) 2 ] 1 2 (4.50) (4.51) Reconstruction of Individual Signals The reconstruction of each speaker signal after MSS is obtained by using a variant of Wiener filtering technique [115], [133]. In this style of reconstruction, the estimated subband spectral magnitude is modulated by the corresponding subband phase computed from its spectral magnitude [134] for each speaker signal. This Wiener filter also eliminates the residual noise components present after spectral subtraction. Hence, the reconstructed signal ˆL(k, m) and ˆF(k, m) for each speaker signal is given by ˆL a (k, m) ˆL(k, m) = ( L(k, m)). ˆL a (k, m) + ˆF a (k, m) ˆF a (k, m) ˆF(k, m) = ( F(k, m)). ˆL a (k, m) + ˆF a (k, m) (4.52) (4.53) The phase L(k, m) and F(k, m) for each speaker signal is calculated [134] from the corresponding enhanced spectral magnitude ˆL a(k,m) ˆL a(k,m)+ ˆF a(k,m) and ˆFa(k,m) ˆL a(k,m)+ ˆF a(k,m) respectively estimated from Wiener filter. The minimum mean squared error (MMSE) phase estimation [134] technique is utilized in this work to reconstruct the phase for each speaker. The inverse

98 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 78 STFT (ISTFT) which uses overlap and add (OLA) method is used to convert reconstructed STFT ˆL(k, m) and ˆF(k, m) into time domain signal ˆl(n) and ˆf(n) respectively Incorporating the Group Delay Spectral Magnitude in the Proposed Framework The importance of the group delay spectral magnitude is discussed here in the context of joint speaker separation and dereverberation. The high resolution and robustness properties [95] of group delay spectral magnitude results in smooth and robust subband envelopes. This reduces the error in NMF decomposition of observed subband envelope of mixture into its convolutional components. This is primarily due to the accurate decomposition of the subband envelope computed from GDSM which is discussed in the ensuing section Computing the Spectral Magnitude from Group Delay Function The spectral magnitude is computed from the group delay function as in [94]. The group delay function τ s (k, m) of sequence s(n) as defined by the negative derivative of the unwrapped short time phase spectrum [135] is used in this work for obtaining magnitude, which is then used in NMF processing. The relation between the weighted cepstrum ŵ s (n) sequence and group delay function τ s (k, m) is given by ŵ s (n) = ISTFT(τ s (k, m)), n = 0, 1,...N 1 (4.54) where N is the total number of samples point present in ŵ s (n). The new cepstral sequence w 1 (n) is formed from ŵ s (n) as w 1 (0) = 0 w 1 (n) = ŵ s (n)/n (4.55) w 1 (N n + 1) = w 1 (n) In Equation 4.55, n represents the sample points going from 1 to N/2. Now, the N point STFT of w 1 (n) denoted by Y 1 (k, m), is computed. Here k goes from 0 to N 1. The estimated

99 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 79 smooth spectrum S a (k, m) obtained from group delay function is thus given by S a (k, m) = 2 ln(y 2a (k, m)). Where magnitude spectrum Y 2a (k, m) is obtained as ln(y 2a (k, m)) = Real[Y 1 (k, m)]. The group delay function has high resolution property [95] which makes it more robust in spectrum estimation compared to conventional FFT based method of calculating magnitude spectrum. The synthetically generated vowels and speech signals are used to illustrate some useful properties of group delay spectral magnitude in the ensuing section High Resolution and Robustness Properties of Group Delay Spectral Magnitude The high resolution and robustness properties of group delay spectral magnitude [95], [127] are discussed using synthetically generated vowels and actual speech signal. The high resolution property of GDSM explains spectral smoothness of the magnitude computed from GDF. Robustness property of GDSM is then observed by averaging the 300 overlaid realizations of magnitude spectrum computed from group delay function. Imaginary part (a) System Z plane Real part (d) Magnitude from FFT 80 Amplitude (b) Time domain signal Time (ms) (e) Magnitude from GDF 80 Group delay (c) Group delay samples Normalized frequency (f) Cepstrally smooth signal 80 Magnitude Normalized frequency Magnitude Normalized frequency Magnitude Normalized frequency Figure 4.12: Comparison of Fourier transform spectral magnitude (d), GDSM (e) and cepstrally smooth version of FTSM (f) for the system shown in (a).

100 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 80 Consider the system shown in Figure 4.12(a). The signals are generated synthetically with formants at 1020Hz, 1690Hz and 2300Hz, sampled at 10 khz frequency with pitch period of 10 ms in the presence of additive white Gaussian noise (AWGN) at signal to noise ratio (SNR) equal to 20 db as shown in Figure 4.12(b). In Figure 4.12(c), the group delay is calculated for the system shown in Figure 4.12(a). In Figure 4.12(c), all the three formants are clearly distinguished due to high resolution property of the GDF in contrast to FFT spectrum [95]. It can be noted that the magnitude spectrum computed from conventional FFT (FTSM) in Figure 4.12(d) has large fluctuation in spectrum caused by variance of noise. The fluctuation is significantly reduced in the magnitude spectrum estimated from GDF in Figure 4.12(e). It can also be seen from Figure 4.12(e) that the magnitude spectrum computed from GDF is more smooth [94] compared to FTSM (Figure 4.12(d)). Magnitude spectrum from GDF in Figure 4.12(e) is also compared with cepstrally smoothed signal (Figure 4.12(f)) obtained by smoothing the Fourier transform spectral magnitude. It can be observed that cepstral smoothing of FTSM in Figure 4.12(f) is not able to distinguish closely spaced formants compared to GDSM as shown in Figure 4.12(e). Additionally, the spectrum computed from GDF in Figure 4.12(e) also preserves the high resolution property of GDF. Magnitude (a) Single realization for FTSM Magnitude (b) 300 overlaid realization for FTSM Magnitude (c) Average of 300 realization for FTSM Normalized frequency (d) Single realization for GDSM Normalized frequency (e) 300 overlaid realization for GDSM Normalized frequency (f) Average of 300 realization for GDSM 80 Magnitude Magnitude Magnitude Normalized frequency Normalized frequency Normalized frequency Figure 4.13: Comparing the robustness property of group delay spectral magnitude with Fourier transform spectral magnitude.

101 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 81 The robustness property of GDSM is compared with FTSM by averaging the 300 overlaid realizations of estimated magnitude spectrum from FFT (FTSM) and GDF (GDSM) in presence of noise as shown in Figure 4.13(b) and Figure 4.13(e), respectively. The magnitude spectrum is obtained for the synthetic signal shown in Figure 4.12(a). It can be noted from Figure 4.13(c) that averaging the 300 overlaid realization of Fourier transform spectral magnitude reduces the fluctuations caused by the variance of noise but at the expense of large bias. These fluctuations are significantly reduced in the average group delay spectral magnitude in Figure 4.13(f). It is even possible to reduce the fluctuations by processing a single realization as shown in Figure 4.13(d) compared to single realization of Fourier transform spectral magnitude (Figure 4.13(a)). In fact, a single realization for GDSM contains all the information that can be obtained from averaging as shown in Figure 4.13(f). Thus from Figure 4.13, it can be observed that for every iteration of magnitude estimation in the presence of AWGN noise at SNR equal to 20 db, the spectrum estimated from GDF is least affected by additive noise compared to FTSM. It can be noted from Figure 4.13(e) that there is a perfect overlapping among the multiple realizations of GDSM when compared to FTSM in Figure 4.13(d). These properties of GDSM helps in accurate decomposition of observed subband envelope Z(k, m) into its convolutional components Accurate Decomposition of Group Delay Subband Envelope The significance of group delay spectral magnitude is illustrated via the accurate decomposition of the subband envelopes. This accurate decomposition of the subband envelopes is illustrated by analyzing the divergence between observed subband envelope Z(k, m) and true subband envelope Y (k, m). In order to make the analysis results statistically meaningful, 400 decomposition trials are conducted. During decomposition, the observed subband envelope Z(k, m) computed from GDSM and Fourier transform spectral magnitude is used in NMF framework. This results in minimization of spectral divergence between Z(k, m) and Y (k, m) when group delay spectral magnitude is used to compute subband envelope in contrast to FTSM. The average error in decomposition is plotted for different direct-to-reverberant ratio (DRR) [136] over 20 frequency bins as shown in Figure The DRR changes with change in

102 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 82 No. of Cases No. of Cases No. of Cases No. of Cases AED using FTSM at DRR= 5dB Frequency Bins AED using GDSM at DRR= 5dB Frequncy Bins AED using FTSM at DRR= 3dB Frequency Bins AED using GDSM at DRR= 3dB Frequncy Bins No. of Cases No. of Cases No. of Cases No. of Cases AED using FTSM at DRR= 4dB Frequency Bins AED using GDSM at DRR= 4dB Frequncy Bins AED using FTSM at DRR= 1dB Frequency Bins AED using GDSM at DRR= 1dB Frequncy Bins Figure 4.14: Comparison of average error in decomposition of observed subband envelope computed from group delay spectral magnitude and Fourier transform spectral magnitude. distance between the source and the microphone location. The decrease in DRR is equivalent to increase in distance between source and microphone. As DRR decreases, the error in decomposition increases as shown in Figure This is due to the high reverberation effect that causes error in decomposition. Table 4.5: Mean error in decomposition (%) for the subband envelope of group delay spectral magnitude and Fourier transform spectral magnitude. DRR Methods -5 db -4 db -3 db -1 db FTSM GDSM From Table 4.5, it can be observed that the mean error in decomposition is lowest for GDSM compared to FTSM for different DRRs. This is due to the robust and high resolution subband envelope computed from GDF which results in minimization of error in decomposition of Z(k, m) into its convolutional components. The GRID database [3] has been used in

103 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 83 Figure 4.15: Block diagram of the joint speaker separation and dereverberation method using GDSM. this evaluation. It may be noted that the window size, frame length and sampling rate used in this simulation is 800 samples, 200 samples and 25 KHz respectively Algorithm for Joint Speaker Separation and Dereverberation Block diagram of the proposed joint speaker separation and dereverberation algorithm is illustrated in Figure The algorithmic steps involved in joint speaker separation and dereverberation are detailed in Algorithm 2. Algorithm 2 Joint speaker separation and dereverberation using group delay spectral divergence minimization. Computing Spectral Magnitude from GDF : The mixed reverberated speech signal is first windowed to obtain short time sequence. The hanning window is used to perform the task. The magnitude spectrum for mixed reverberated speech is estimated for all subbands in each frame using group delay function.

104 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 84 Computing Subband Envelope : The subband envelope is then computed by taking square of the spectral magnitude obtained from GDF in each frame. NMF Processing : The subband envelope updates shown in Equation are obtained for m th frame using NMF processing. Similarly the subband envelope updates of each speaker are obtained for all frames. Computing Spectral Magnitude : The spectral magnitude of each speaker signal is then obtained by computing square root operation on the estimated subband envelopes. Modified Spectral Subtraction : The modified spectral subtraction is then applied on the estimated spectral magnitude for both the speaker. This method helps in removing late reverberation components. Reconstruction : The reconstruction of each speaker signal is obtained by using a variant of Wiener filtering technique. This Wiener filter also eliminates the residual noise components present after spectral subtraction. The enhanced spectral magnitude for each speaker signal are modulated with the corresponding subband phase. The ISTFT which uses overlap and add method to convert frequency domain signal into time domain signal. end Spectrographic Analysis Figure 4.16 illustrates the spectrogram of clean target speech, mixed reverberated speech and reconstructed target speech respectively. The target speaker and interference speaker is uttering a sentence \place blue by c six again\ and \set red at s two soon\ in Figure 4.16, respectively mixed at target to interference ratio (TIR) of 0 db and DRR equal to -3 db. It can be noted from Figure 4.16, that the proposed algorithm is able to separate and dereverberate the reconstructed target speaker signal effectively. The post processing methods are able to remove late reverberation and noise components. In Figure 4.16, it can be seen that the spectrogram of the reconstructed target speaker is similar to the spectrogram of original target speaker. In the ensuing section, experiments on speaker separation, dereverberation and distant speech recognition are performed. Performance of the speaker separation method is evalu-

105 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 85 Target speech Frequency Frequency Frequency Time (s) Mixed speech under reverberation and noise Time (s) Reconstructed target speech Time (s) Figure 4.16: Spectrograms of the target signal (above), the mixed reverberated signal (middle) and the reconstructed target signal (below), when the TIR is 0 db and DRR=-3 db. ated in terms of subjective, objective and target to interference ratio (TIR) measures. The reconstructed target signal obtained from proposed GDSM method is compared with other separation methods at various TIRs. Additionally, the experiments are conducted to evaluate the quality of speech dereverberation using objective measures and one way ANOVA statistical test at various DRRs. Performance of the proposed method is also compared to other state of art method speech dereverberation methods. Experiments on distant speech recognition (DSR) are also conducted by varying the distance between source and microphone. Significant improvement in terms of word error rate (WER) are noted using the proposed method Experiments on Speaker Separation In this section, the experimental results for speaker separation using the proposed method and other conventional method are presented. The subjective evaluation of separated signals are rated on three parameters as done in [67]. The objective evaluation includes perceptual similarity measure (PSM) to predict the difference between the clean signal and reconstructed target signal. The third set of experiment on speaker separation is performed via evaluation of output TIR with varying input TIR. In objective and TIR evaluation, 1200 reconstructed sentences obtained from various methods for both target and interfering speakers are used. In

106 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 86 this section, the separated target signal obtained from GDSM method and FTSM method at DRR equal to -3 db is compared with other speaker separation methods operated in a clean environment. The reconstructed (separated) target signal obtained from various methods are used in subjective and objective evaluation Experimental Conditions All speech data used in the experiments are obtained from GRID corpus database. The utterances in GRID database are reverberated with different DRR using image method [137], [138]. The dimension of the room used for simulation is 10.4 X 10.4 X 4.2 meters. The noise used in all the experiments is AWGN with SNR equal to 20 db. In all experiments, the window size is 800 samples (32 ms at 25 khz) and frame length is 200 samples such that 75% of overlap is present between neighboring windows. The Kaiser window is used for windowing. Additionally, the reverberation time T 60 used in generating various DRR [139], [140] is 400 ms. As DRR changes with the distance between the source and microphone, the room impulse response also changes correspondingly. In speech dereverberation experiments, the energy of two speaker are considered to be equal (TIR=0 db). In this work, the L 1 -norm is chosen to introduce sparsity. Corresponding to L 1 -norm, q = 1 is choosen in Equation The sparsity parameters λ 1 and λ 2 are determined as λ 1 = U 2 q = λ 2 and U = k,m Z(k, m) 10 8 [108]. The value of q is taken to be unity so that λ 1 = U = λ 2. The algorithm is run for 200 iterations to obtain the subband envelope updates for individual speaker Subjective Evaluation Results The human ear is the most efficient filter for evaluating speech separation task. In subjective methods, the evaluation incorporates the subjective judgment of the listener using MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) protocol [141]. Subjective measure is perhaps the most accurate way of evaluating the quality of any separation system. In this evaluation, the separated signals are rated upon three parameters (global quality (GQ), target preservation (TP) and other signal separation (OSS) [67].

107 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 87 Table 4.6: Comparison of mean opinion score for various methods in terms of GQ, TP and OSS. TIR=-6 db TIR=0 db TIR=6 db Methods GQ TP OSS GQ TP OSS GQ TP OSS GDSM FTSM IF LVD LDD NMF Overall, a total of 25 subjects performed the subjective evaluation of reconstructed (separated) target signal. In listening experiment, 150 sentences from GRID database are used as a reference signals for training each subject and the corresponding separated signal obtained from all the method are used for rating the methods used in comparison. A mean opinion score (MOS) for all the three tasks are calculated from all the candidates. In this experiment, the separated target signal obtained from proposed GDSM method and its variant FTSM method at DRR equal to -3 db is compared with other speaker separation methods operated in clean environment. The methods used in comparison are instantaneous frequency (IF) [82] based method, Latent Variable Decomposition (LVD) [102], Latent Dirichlet Decomposition (LDD) [103] and NMF [114]. It may be noted that, the separated signals obtained from both GDSM and FTSM method employ post processing methods (MSS and Wiener filtering). In Table 4.6, the proposed method GDSM has higher subjective scores for all the three parameters (GQ, TP, OSS) in comparison to other methods for various TIRs. At low TIRs, the performance of GDSM is somewhat similar to LVD, LDD and NMF. There is significant improvement in subjective scores of GDSM in comparison to other method at higher TIRs. This is due to the robust and high resolution property of group delay function which result in robust subband envelope used in NMF decomposition. This in turn result in minimal error in decomposition and produces separated speaker signal which is robust against noise and reverberation.

108 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization Objective Evaluation Results The objective evaluation for speaker separation in this part is also performed using perceptual similarity measure (PSM) and its instantaneous audio quality (PSMt). In Table 4.7, the mean Table 4.7: Mean PSM and PSMt scores for the proposed method (GDCC) and other methods at various TIR values. Methods TIR=-6 db TIR=0 db TIR=6 db PSM LCI UCI PSMt PSM LCI UCI PSMt PSM LCI UCI PSMt GDSM FTSM IF LVD LDD NMF perceptual similarity measure and its instantaneous audio quality scores [67] for the proposed GDSM method is better compared to other methods at various TIR values for reconstructed target speaker signal. As expected, the perceptual similarity measure drops at low TIR levels. When the values of TIR is closer or smaller to zero, PSMt (the averaged value of instantaneous PSM vector) scores are negative, which indicates poor separation. From Table 4.7, it can be noted that the performance of GDSM is similar to LVD, LDD and NMF at lower TIRs but the performance has been increased as TIR increases. This is due to the fact that when subband envelope computed from GDSM is used in NMF decomposition, the spectral divergence between observed subband envelope and true subband envelope is minimized efficiently. The separated signals improves further when MSS and Wiener filter is used to remove remaining late reverberation and noise components respectively. In general, higher PSM and lower PSMt scores, better is the method in terms of objective evaluation. Moreover, the width of the confidence interval (upper confidence interval (UCI) - lower confidence interval (LCI)) for PSM scores is also lowest for the proposed method compared to other methods. These PSM and PSMt scores corroborate with the results obtained from the experiments on subjective evaluation.

109 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization Evaluation of Target to Interference Ratio The target to interference ratio is a measure used to evaluate the quality of speaker separation. In this measure, the output target to interference ratio [142] is calculated for various speaker separation methods as a function of input target to interference ratio. The ratio of the reconstructed target and interfering speaker signal from various method is computed and is plotted against the ratio of original clean target and interfering speaker signal. It must be noted that estimated target and interfering signals for GDSM and FTSM method have been obtained at DRR equal to -3 db in the experiments reported here. Figure 4.17 illustrates the quality of speaker separation by plotting output TIR versus input TIR. It can be seen from Figure 4.17 that the performance of proposed GDSM method is significant at higher input TIRs compared to other methods. This is due to the reason that the estimated target and interfering speaker signals are robust against noise and reverberation as explained in previous sections. At low input TIRs, the performance of GDSM is similar to LVD, LDD and NMF but better than FTSM, IF and FFT. Output TIR GDSM FTSM IF LVD LDD NMF dB 3dB 0dB 3dB 6dB Input TIR Figure 4.17: Variation in output TIR versus input TIR for various methods

110 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization Experiments on Speech Dereverberation In this section, experimental results for speech dereverberation using proposed method and other single source dereverberation method are evaluated. Objective measures [73] used for evaluating the speech dereverberation performance are log spectral distortion (LSD), signalto-reverberation ratio (SRR), log likelihood ratio (LLR) and perceptual evaluation of speech quality (PESQ). In objective evaluation, 1200 dereverberated sentences for target speaker signal obtained from various methods are used Objective Evaluation Results The Log spectral distortion is a speech distortion measure well suited for the assessment of dereverberation algorithms [143]. The LSD is obtained by root mean square (RMS) value of the difference between log spectra of clean speech signal and dereverberated signal [144]. On the other hand, the signal to reverberation ratio [143] is a measure of reverberation which is dependent on the signal before and after processing. The log likelihood ratio (LLR) [73], [74] is also an important measure for speech dereverberation. In this measure, the log likelihood ratio between the LPC vector of the original speech signal and the LPC vector of the enhanced speech is computed by using autocorrelation matrix of the original speech signal. Table 4.8: Experimental results of speech dereverberation using objective measures (LSD, SRR and BSD) on GRID Database. DRR=3 db DRR=1 db DRR=-1 db DRR=-3 db DRR=-5 db Methods LSD SRR LLR LSD SRR LLR LSD SRR LLR LSD SRR LLR LSD SRR LLR TA CC GDSM SS TS KA FTSM From Table 4.8, it is observed that proposed algorithm (GDSM) has lower LSD, LLR values and higher SRR values at various DRRs for dereverberated target signal compared to various methods. The method used in comparison are complex cepstrum (CC) [105] based dereverberation, temporal averaging (TA) [9] method, spectral subtraction (SS) method [17], two stage (TS) method [106] for dereverberation, kurtosis algorithm (KA) [107] and FTSM.

111 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 91 This indicates better reverberation suppression compared to other method used herein. This is due to the reason that the dereverberated target signal from GDSM is highly robust to noise and reverberation compared to other methods. In general, lower the LSD, LLR values and higher the SRR values, better is the method for suppression of reverberation. Perceptual evaluation of speech quality is a family of standards comprising a test methodology for automated assessment of the speech quality as experienced by a user of a telephony system. It is standardized as ITU-T recommendation P.862 [145]. PESQ is an another objective measure specifically designed for analyzing the quality of dereverberated speech. PESQ is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and dereverberated target signal. PESQ results principally model MOS results that cover a scale from 1 (bad) to 5 (excellent) PESQ Scores DRR=5dB DRR=3dB DRR=1dB DRR= 1dB DRR= 3dB DRR= 5dB GDSM FTSM TA CC SS TS KA Figure 4.18: Comparison of PESQ scores for various methods at different DRR. PESQ scores are illustrated in Figure 4.18 for various methods. The PESQ scores are plotted against the different DRRs. The proposed GDSM method has better PESQ scores compared to other methods used herein. There is a decrease in PESQ scores as DRR decreases for all the methods. This is due to the increase in reverberation effect with decrease in DRR. Although, the decrease in DRR is noted to be lowest for GDSM. Thus, all the four objective measure used herein indicates the significance of proposed method for speech dereverberation.

112 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 92 The statistical test based on one way ANOVA is also presented in the ensuing section to illustrate the significance of the proposed method Statistical Experiments using One Way ANOVA ANOVA returns a test decision for the null hypothesis that the two populations comes from normal distributions with the same variance, using the two-sample F-test [146]. The two populations are represented by the clean speech of target speaker and dereverberated target speaker obtained from various methods. The ANOVA produces an F-statistic defined as F = a 2 1/a 2 2 (4.56) where a 1 and a 2 are the sample variances. This test statistic deals with the ratio of the two sample variances. The further this ratio deviates from 1, it is more likely to reject the null hypothesis. The one-way ANOVA results can be considered reliable as long as the following conditions are met. Response variable are normally or approximately normally distributed. Samples are independent. Population variances are equally satisfying null hypothesis. Table 4.9: Comparison of one way ANOVA test results for dereverberated target signals at various DRRs. DRR=-5 db DRR=-1 db DRR=1 db DRR=5 db Methods Fstat UCI LCI WCI Fstat UCI LCI WCI Fstat UCI LCI WCI Fstat UCI LCI WCI GDSM FTSM TA CC SS TS KA From Table 4.9, it is clear that the F statistics for the proposed method are least deviated from unity. F statistics also has lower width of confidence interval (WCI) for the proposed method when compared to other methods at two different DRR. In these experiments, the dereverberated target signal from GDSM and FTSM are obtained at TIR equal to 0 db.

113 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 93 The F statistics are computed at different DRRs. WCI is defined as subtraction of upper confidence interval (UCI) from lower confidence interval (LCI). WCI turns out to be lowest for both DRRs among other methods. This indicates that the dereverberated target signal from GDSM has variance closer to the clean speech signal variance and has higher confidence of approaching towards the null hypothesis compared to other methods. In general, any method satisfying the null hypothesis, the sample variance of that method is more closer to the sample variance of the clean speech signal. Thus, the dereverberated target signal generated from the proposed method satisfies the three conditions of one way ANOVA with higher confidence compared to the other methods. These statistical test indicate the significance of proposed method for speech dereverberation Experiments on Distant Speech Recognition Experiments on distant speech recognition are conducted to evaluate the performance of proposed method. The proposed method is also compared with other state of art methods of dereverberation for distant speech recognition. The GRID corpus databases are used for performing the experiments on distant speech recognition [147], [148], [97]. About clean utterances from GRID database form the training set. Each training utterance is about one second in length sentences which are not present in training set are used to generate the speech mixtures at different DRRs. They are subsequently separated and dereverberated using the proposed method. These 2000 sentences obtained herein are used as the test set in the experiments on speech recognition. 15 state, and 3 mixtures triphone HMM with 39 dimension MFCC with delta and acceleration coefficients are used in the speech recognition experiments on the GRID database. The baseline triphone models of the recognition system are trained using sentences of clean speech data from the GRID corpus. Test data of 2000 sentences are synthesized at different DRRs from the GRID database and applied to the proposed (GDSM) algorithm along with other methods used in comparison. For testing the recognition system, the dereverberated target signals reconstructed from these methods are used. Word error rate has been used to present distant speech recognition results.

114 4.2 Joint Speaker Separation and Dereverberation using Group Delay Spectral Divergence Optimization 94 Table 4.10: Variation in WER (%) for all the methods with increase in distance between source and microphone. Distance in meters (m) Methods 0.5m 1m 1.8m 2.3m 3m 4m TA CC GDSM SS TS KA GMD CTM In Table 4.10, WER for all the methods increases as the distance between the source and microphone increases, except for close talking microphone (CTM). This is because as distance between source and microphone increases, the effect of reverberation increases. WER for CTM is constant and lowest because there is no reverberation effect during close talk. The proposed method has lower WER with respect to CTM indicating higher recognition FTSM KA TS SS GDSM CC TA 2meters 3meters 4meters Percentage increase in WER Figure 4.19: Percentage increase in WER with increase in distance between the source and microphone. accuracy and effective dereverberation compared to SS, TA, IF, TS, KA and GMD methods.

115 4.3 Discussion 95 Figure 4.19 illustrates the percentage increase in WER, when the distance between source and microphone is increased. The percentage increase in WER obtained with respect to distance of one meter between source and microphone, is also lowest for the proposed method. This results in better intelligibility of reconstructed speech and gives higher speech recognition performance Summary A method for performing joint speaker separation and dereverberation by minimizing the the divergence between the observed and true subband envelopes obtained from the group delay spectral magnitude (GDSM) is proposed in this section. Advantages of the GDSM include robustness to noise and reverberation when compared to FFT spectral magnitude. Due to the high resolution property of group delay spectral magnitude, this method reduces the error in the decomposition of mixed signal into its convolutional constituents for different direct-to-reverberant ratio (DRR) conditions. The subjective, objective and statistical quality evaluation of the separated and dereverberated signals are carried out in this work. The proposed method indicates significant improvements over other conventional methods in literature. Lower word error rates are also noted from distant speech recognition experiments at various DRRs. The extension of the proposed method to a multi channel scenario is a potential area of further investigation. 4.3 Discussion In this chapter, the different methods proposed in this thesis for single channel speech enhancement are analysed with respect to other. Two methods proposed in this section utilize the group delay function for speech enhancement. In the first method, the group delay function is used to obtain cross correlation between each band of bank of filters. Rather than directly calculating the correlation between individual bands, the proposed group delay cross correlation (GDCC) method uses the group delay since it suitably captures the phase variations of speech signal in the spectral domain. The performance of the proposed method is compared with other speaker separation methods. It

116 4.3 Discussion 96 is observed that the proposed method outperforms other methods and is closest to human performance. On other hand, the joint speaker separation and dereverberation method shows significant speech enhancement in reverberant and noisy conditions. The group delay function in this work is used to estimate the spectral magnitude for each speaker which is then used in NMF decomposition. This method is able to jointly address the problem of noise cancellation, speech dereverberation and speaker separation. The performance evaluation of the proposed method on the GRID corpus indicates that this method is highly robust to noise and reverberation components.

117 Chapter 5 Spectral Methods for Multi Channel Speech Enhancement Single channel speech enhancement systems utilize only the temporal and spectral diversity of the received signal. Reverberation also induces spatial diversity. To additionally utilize this diversity, multiple microphones can be used. In this chapter, novel beamforming based spatial spectrum estimation methods for multi channel speech enhancement are proposed. Beamforming is a spatial spectrum estimation method used in many speech enhancement applications. Spatial processing techniques have been widely used in speech enhancement. A detailed overview can be found in [42, 58]. In this work, a new reverberant speech enhancement method that utilizes the LP residual cepstrum is developed under the fixed beamforming framework. On the other hand, a LCMV based spectral method is also developed for joint noise cancellation and dereverberation. This is realized as a multi channel LCMV filter that constrains both the early and late parts of the speech frame. The filter outputs are then beamformed to remove late reverberations. These methods indicate significant improvement in perceptual quality of separated signals and distant speech recognition performance when compared to conventional methods.

118 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum In this work, a novel method for multi channel speech enhancement using linear prediction residual cepstrum (LPRC) is proposed. At each microphone output, a deconvolution is performed in the cepstral domain. This deconvolution of acoustic impulse response from reverberated signal results in removal of early reverberation from each individual channel. This dereverberated output from each channel is then spatially filtered using delay and sum beamformer (DSB). The late reverberation components are further removed by temporal averaging of the glottal closure instants (GCI) computed using the dynamic programming projected phase-slope algorithm (DYPSA) [15]. The DYPSA algorithm plays an important role in detecting the true GCI under reverberant environment. DYPSA algorithm utilizes peaks in LP residual of speech. Most of the GCI candidates fall around the peaks of the LP residual. The reverberated LP residual will have some spurious peaks along with true GCI s. In such case, DYPSA is highly robust to detect true GCI s from the spurious peaks. The experiments on subjective and objective evaluation are conducted on TIMIT and MONC databases for proposed method and compared with other methods. The experimental results of the proposed method on speech dereverberation and distant speech recognition indicate reasonable improvement when compared to conventional methods Introduction A speech signal captured using a distant microphone is smeared due to reverberation. Reverberation in general can be defined as a phenomenon in which multiple delayed and attenuated versions of a signal are added to itself. This is due to the multiple reflections from the surrounding walls and other objects. The reverberant speech signal can be modeled as convolution of the clean speech signal and the acoustic room impulse response (AIR) which exists between the source and the microphones. These reverberation components degrade the fidelity, intelligibility and recognition performance of a speech based system. In multi channel scenario, the clean speech signal, s(n), propagates through an array of M acoustic channels. The clean speech signal is estimated by computing the acoustic room impulse response. This

119 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 99 in turn requires a dereverberation algorithm to process the microphone output. Hence, the method of speech dereverberation is viewed as a blind deconvolution problem as neither the clean speech signal nor the AIR is available generally. In [149], spatio-temporal averaging method is defined which operates on the linear prediction residual (LPR) of spatially averaged multi-microphone observations for enhancement of reverberant speech. Speech enhancement using source information by computing the coherently added signal from LPR of degraded speech from different microphones is discussed in [45]. In [150], multi-channel speech dereverberation algorithm is proposed to suppress late reverberation components. Authors in [150] employ a MVDR beamformer followed by a single channel MMSE estimator which operates on the beamformer output signal. The ensuing section discusses linear prediction coding analysis on reverberant speech. This is followed by a discussion on the multi channel method for speech dereverberation using LP residual cepstrum Linear Prediction Analysis of Reverberant Speech Linear Predictive Coding (LPC) is powerful tool for speech and audio signal [100] analysis. The basic idea behind linear predictive analysis is that a speech sample at a particular instant can be expressed as linear combination of past speech samples. The predictor coefficients are calculated by minimizing the sum of squared differences between actual speech sample and the predicted speech sample over a finite interval. The prediction error which is the difference between actual speech samples and estimated speech samples is called the linear prediction (LP) residual. The LP coefficients of clean speech signal A, are given in vector form as A LP = R 1 ss R ss (5.1) where vector A LP = [a 1, a 2,, a p ] T is the p th order LP coefficients, R ss is the auto correlation matrix and R ss is the auto correlation vector. Consider a reverberated signal x(n) which is a convolution of a clean speech signal s(n) and room impulse response h(n). x(n) = s(n) h(n) (5.2)

120 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 100 The LP coefficients ˆB LP = [ˆb 1, ˆb 2,, ˆb p ] T of reverberated speech x(n) are obtained using ˆB LP = R 1 xx R xx (5.3) The i th auto correlation coefficient of R xx is given by r xx,i = E{(x(n)x(n i)} (5.4) The above equation can be written equivalently in frequency domain as Here, X(e jω ) is the Fourier transform of x(n). r xx,i = 1 π X(e jω ) 2 e jωi dω (5.5) 2π π r xx,i = 1 π H(e jω ) 2 S(e jω ) 2 e jωi dω, i = 1, 2,, p (5.6) 2π π Taking spatial expectation on both sides of 5.3, we get E{ˆB LP } = E{R 1 xx R xx } (5.7) By using the zeroth order Taylor series, the expectation can be reduced to E{ˆB LP } E{R xx } 1 E{R xx } (5.8) Now considering the spatial expectation of r xx,i E{r xx,i } = 1 π E{ H(e jω ) 2 S(e jω ) 2 }e jωi dω (5.9) 2π π The term S(e jω ) 2 is the PSD of the clean signal s(n). It is taken outside the spatial expectation as it is independent of source microphone position. The spatial expectation of the energy density spectrum of the acoustic impulse response (AIR) can be split into direct

121 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 101 component and a reverberant component [151] as follows E{ H(e jω ) 2 } H d (e jω ) 2 + E{ H r (e jω ) 2 } (5.10) The direct and the reverberant components can be expressed as [151], respectively, E{ H(e jω ) 2 } = 1 4πD ᾱ πᾱv A (5.11) where D is the distance between the microphone and source, ᾱ is the average wall absorption coefficient and V A is the total surface area of the room. The expression for the expected energy density spectrum of the AIR is constant, say η, and independent of frequency ω. Therefore 5.9 now becomes E{r xx,i } = η π S(e jω ) 2 e jωi dω 2π π (5.12) E{r xx,i } = ηr ss,i (5.13) Substituting above result into (5.3) we have, E{ˆB LP } = R 1 ss R ss (5.14) E{ˆB LP } = ÂLP (5.15) Thus, Equation 5.15 illustrates that the LP coefficients of clean speech ÂLP is approximately equal to LP coefficients of reverberant speech ˆB LP under spatial expectation. The theoretical analysis of LP coefficients shown above is verified experimentally through spectrographic analysis. The robustness of LP coefficients to reverberation is illustrated in Figure 5.1 using LP spectrogram of clean speech and reverberated speech at direct to reverberation ratio of -3 db. The spectrograms are computed from one sentence of the TIMIT database. The difference between FFT spectrogram of clean and reverberated speech is clearly observed. However, the LP spectrogram of reverberated speech is approximately similar to LP spectrogram of clean speech.

5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 102 Figure 5.

FFT spectrogram (Top row) and LP spectrogram (Bottom row). 5.1.

using multi channel LP residual cepstrum in a fixed beamforming framework is described in this section. From Figure 5.

This is achieved by applying LPRC at each microphone output to perform deconvolution of AIR from residual signal of reverberated

The single channel enhancement method is elaborated in the ensuing discussion. 5.1.3.

122 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 102 Figure 5.1: Comparison of the spectrograms of clean and reverberated speech obtained from FFT and LP analysis. FFT spectrogram (Top row) and LP spectrogram (Bottom row) Multi Channel Speech Enhancement using LP Residual Cepstrum in Fixed Beamforming Framework A method for speech dereverberation using multi channel LP residual cepstrum in a fixed beamforming framework is described in this section. From Figure 5.2, it can be seen that multi channel methods requires an initial estimation of partially dereverberated signal at each microphone. This is achieved by applying LPRC at each microphone output to perform deconvolution of AIR from residual signal of reverberated speech so that early reverberation components are suppressed. The single channel enhancement method is elaborated in the ensuing discussion Single Channel Speech Dereverberation A method proposed in [14] is utilized to perform partial deconvolution at each microphone output using cepstral domain herein. The reverberated speech signal is first subjected to linear prediction analysis. After computing the linear prediction coefficients, the LP residual is extracted. The LP residual is modeled as a convolution of LP residual of clean speech

123 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 103 Figure 5.2: Block diagram of the multi channel speech enhancement method using LP residual cepstrum. signal and the AIR. Considering the frequency domain formulation of the source-filter model of speech production. The Fourier transform of the speech signal is given by S(e jω ) = E(e jω )V (e jω ) (5.16) where E(e jω ) is the Fourier transform of the prediction residual and V (e jω ) is the transfer function of the all-pole filter evaluated for z = e jw. Assuming an acoustic impulse response H(e jω ), the Fourier transform of reverberant speech signal can be written as X(e jω ) = S(e jω )H(e jω ) (5.17) = E(e jω )V (e jω )H(e jω ) (5.18) An inverse filter B(e jω ) = 1 + p k=1 ˆb k e jωk, is defined with reference to [14] such that LP coefficients from reverberant speech are approximately equal to those from clean speech in terms of spatial expectation. Here, ˆb k represents k th LP coefficient of the reverberated speech and p is the order of prediction. Thus, inverse filtering the reverberant speech signal results

124 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 104 in R(e jω ) E(e jω )H(e jω ) (5.19) where R(e jω ) is the Fourier transform of the reverberant prediction residual. It can be inferred from the equation 5.19 that the prediction residual obtained from reverberant speech is approximately equal to the clean speech residual convolved with the room impulse response. Therefore, it is reasonable to assume that, if the LP coefficients were identical to those from clean speech, this approximation would be an equivalence. This equivalence is due to the robustness of the LP coefficients under reverberation [14]. It can be inferred from the aforementioned discussion that, if the clean speech residual is recovered from the reverberated speech residual, the dereverberated speech signal can be synthesized reasonably well. The separation of clean residual from reverberated residual is performed via deconvolution. The deconvolution is performed using cepstral subtraction [17]. The cepstrum [105] of the reverberated residual is obtained and the peaks in higher quefrency of the cepstrum correspond to the AIR [9]. Hence peak picking is applied to the cepstrum of reverberated signal. The peaks obtained correspond to the cepstrum of AIR. The peaks are then subtracted from the reverberated residual signal so as to perform deconvolution and obtain an estimate of clean speech residual signal. The initially dereverberated signal is finally synthesized using the estimated clean speech residual signal and the LP coefficients of reverberated signal. The residual of the dereverberated signal in one of the channels along with the clean and the reverberated residual is illustrated in Figure 5.3. It can be seen from the Figure 5.3, that the spurious peaks in the partially enhanced signal are reduced but not completely removed. This is because the single channel algorithm is unable to remove the tail effect of the room impulse response (RIR). The dereverberated output from each microphone output are then spatially filtered using a delay and sum beamformer (DSB) [152]. In order to eliminate the remaining spurious peaks, a temporal averaging [153] method is used as shown in Figure 5.3.

125 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 105 Figure 5.3: Illustration of remaining spurious peaks after single channel speech dereverberation. LP residual of (a) clean speech (b) reverberated speech (c) dereverberated speech Temporal Averaging The temporal averaging is applied on the LP residual of DSB output as shown in the Figure 5.2. The temporal averaging requires an accurate detection of glottal closure instants (GCI) which is computed using the DYPSA [15], [153], [154]. The DYPSA is preferred since it is robust to reverberation. The brief explanation of DYPSA follows herein. DYPSA : DYPSA consists of three steps. Computing the phase slope function and the phase slope projection followed by dynamic programming. The detailed explanation of each step is follows. 1. Phase Slope Function : It is defined as the average slope of the unwrapped phase spectrum of the short time Fourier transform of the prediction residual [153]. The GCI candidates are selected based on positive going zero crossing of the phase slope function. 2. Phase Slope Projection : There are certain GCI s which can go undetected when phase slope function fails to go zero crossing appropriately. Although, the turning points and general shape of the waveform are consistent with the presence of the impulsive event indicating a GCI [153]. Phase slope projection is then used to generate GCI

126 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 106 candidates when a local minimum is followed by a local maximum without zero crossing. The midpoint between these is identified and projected onto the time axis with unit slope. In this way, GCIs whose positive going slope does not cross the zero point are identified. 3. Dynamic Programming (DP) : The dynamic programming is the last step of DYPSA for selecting GCIs from a set of candidates. From the above two process, the number of GCIs candidates are increased. However, there are many GCIs which are not the true candidates. The objective is now to choose a subset corresponding to the true GCIs. GCIs are then selected from a set of candidates by minimizing a cost function using N-best dynamic programming. In this way, spurious candidates will not be selected because they are assigned a high cost within the DP. The advantage of the proposed method multi channel-lp residual cepstrum (MC-LPRC) over spatio-temporal averaging method [153] is that, the DYPSA is applied on enhanced speech as explained in [14]. Hence, DYPSA is more accurate in the detection of GCI. The motivation for using temporal averaging of GCIs [149] comes from two observations. The spurious peaks in LP residual that exist at the output are left unattenuated due to spatial filtering. Also, the spurious peaks are uncorrelated among the consecutive larynx cycles after spatial filtering due to the quasi periodic nature of voiced speech. The speech related features between the consecutive larynx cycles for clean speech do not differ much. Moreover, the features in a particular larynx cycle change slowly compared to its consecutive larynx cycles and show high intercycle correlation. The issue of exclusion of LP residual peaks from the averaging process now needs to be addressed since the peaks corresponding to GCIs have significant impact on speech quality [155], [156], [157] and should remain unmodified. In order to leave the residual peak unmodified, the windowing function [153] is used before averaging. Ideally, the window should be selected such that it should exclude only the residual peak and include rest of the larynx cycle. The GCIs can be detected upto an uncertainty of the order of 1 ms [15]. The peak will not be a true impulse but rather spread in time [158]. A windowing function which meets

127 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 107 the requirements is given by [159] ( ) 2πt w t = cos ψ (O 1) π, t < ψo 2 ( ) 2π = cos ψ 2πt ψ (O 1) π, t < O ψo 2 1 = 1, otherwise where O is the length of one larynx cycle (in terms of samples). The parameter ψ is the taper ratio of the window. The Figure 5.4 shows example of the weighting function with ψ = The taper ratio offers a tunable parameter with advantage to control the amount of the Figure 5.4: A Tukey window for one larynx cycle with ψ = larynx cycle to be included in the averaging process. The inverse window function, 1 w t, is applied on the larynx cycle under enhancement process so as to restore only the original glottal pulse. The enhanced larynx is obtained by summing the average of the windowed larynx cycle along with its I neighbours and inverse windowed current larynx cycle. The mathematical expression for the l th enhanced larynx cycle becomes [153] ˆR(l) = (I W T )R(l) + 1 2I I W T R(l + i) (5.20) where ˆR(l) = [r(n), r(n + 1)..., r(n + O 1)] is the larynx cycle under enhancement at the output of the DSB. I is the identity matrix and W T = diag{w t (0), w t (1),..., w t (O 1)}. i= I

128 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 108 The larynx cycles are not always periodical and can vary by a few samples. When the length of the neighboring larynx cycles are not same then the current cycle length O, is taken as reference. If the neighbouring cycle has less samples then zero padding is used to make its length same as O. If there are more samples then truncation is used to make the length similar to O. The parameter I plays an important role in the algorithm. If I is more, then the averaging may happen across uncorrelated cycles and if I is too less spurious peaks will not be removed by averaging. From the experiments, it is empirically found that the optimal value of I is equal to three The MC-LPRC Algorithm for Speech Dereverberation The multi channel speech enhancement algorithm using LP residual cepstrum is listed in Algorithm 3. Algorithm 3 Multi channel speech dereverberation using LP residual cepstrum Input : Short term reverberant speech signal acquired with different delay through multiple distant microphone. 1. Perform single channel speech dereverberation using LP Residual Cepstrum [14] at each channel output. 2. The delay and sum beamformer is applied on each channel output. 3. Temporal averaging is applied on the LP residual of DSB output. The temporal averaging uses window function to exclude the LP residual peaks from the averaging process because the peaks corresponds to GCI. 4. Use overlap add (OLA) method to reconstruct the enhanced LP residual. 5. The enhanced signal is then obtained by synthesizing the estimated enhanced LP residual with LP coefficients obtained from the DSB output. Output : The enhanced speech signal. end

129 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 109 Figure 5.5: Spectrographic analysis of (a) clean speech (b) reverberated speech and (c) dereverberated speech using multi channel LP residual cepstrum method Spectrographic Analysis Spectrographic analysis is performed on a sentence uttered by a female speaker from TIMIT database [160] sampled at 16 KHz. The AIR is simulated using image method [138], [161]. The dimension of the room used for simulation is meters. Figure 5.5, illustrates the spectrogram for the clean speech signal, reverberated speech signal and dereverberated speech signal respectively. The proposed method gives reasonably performance in terms of enhancing the reverberated signal as can be seen from Figure 5.5. It can also be noted that the spectrogram of dereverberated speech is quite similar to the clean speech spectrogram Performance Evaluation Experiments on speech enhancement and distant speech recognition are conducted at various DRRs to evaluate the performance of the multi channel speech enhancement using LP residual cepstrum. The results are presented as objective measures, subjective measures, and word error rate (WER). Log spectral distortion (LSD), bark spectral distortion (BSD) and signal

130 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 110 to reverberation ratio (SRR) [162] are used as objective measures. On other hand, mean opinion score (MOS) [68], [163] is used as a subjective measure to quantify the experimental results on speech enhancement. The distant speech recognition results are illustrated using WER by varying the distance between source and the microphone. These experiments are conducted on the TIMIT and MONC [164] databases Experimental Conditions The MONC and TIMIT databases are used in experiments on speech dereverberation, perceptual evaluation and distant speech recognition. Sentences and continuous digits from TIMIT and MONC database respectively are reverberated at different DRRs and used in the experiments. For all the experiments, the AIR is simulated using image method [138] with source at four different location corresponding to four different DRR. The subjective evaluation (MOS) was done by averaging the ratings on a 0 to 5 scale from 25 listeners in the age group of 21 to 25 years Subjective and Objective Evaluation For conducting a subjective evaluation of the proposed method, a mean opinion score [68] is computed as listed in Table 5.1. It is observed that the MOS values are highest for the proposed MC-LPRC technique compared to speech enhancement using excitation source information (ESI) [45], spatio-temporal averaging [149] (STA) and spectral enhancement [150] (SE) using joint MVDR and MMSE estimation method. Table 5.1: Comparison of mean opinion scores on the TIMIT and MONC database for various methods. TIMIT MONC Methods DRR=-1dB DRR=-3dB DRR=-4dB DRR=-5dB DRR=-1dB DRR=-3dB DRR=-4dB DRR=-5dB MC-LPC ESI STA SE The Log Spectral Distortion is a speech distortion measure well suited for the assessment of dereverberation algorithms [151]. The Bark Spectral Distortion [162] is similar to LSD, but it utilizes the perceptual significance of bark scale to enhance the quality of speech as perceived by human listeners. On the other hand, the signal to reverberation ratio [151], is

131 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 111 a measure of reverberation which is dependent on the signal. The SRR requires the direct path signal component and is therefore an intrusive measure. All the three LSD, BSD and SRR are used as objective measures to illustrate the results of dereverberation. 2 (a) 1.5 (b) 3 (c) LSD 1 BSD SRR dB 3dB 4dB 5dB 0 1dB 3dB 4dB 5dB 0 1dB 3dB 4dB 5dB LSD (d) dB 3dB 4dB 5dB BSD (e) dB 3dB 4dB 5dB SRR (f) dB 3dB 4dB 5dB MC LPRC ESI STA SE Figure 5.6: Comparison of LSD, BSD and SRR of various methods at different DRR s for TIMIT ((a)-(c)) and MONC ((d)-(f)) database respectively. Objective evaluation results on multi channel speech enhancement are shown in Figure 5.6, which include LSD, BSD and SRR values. DRRs used in these experiments are -1dB, -3dB, -4dB and -5dB respectively. It is noted from Figure 5.6, that the proposed method has lower LSD and BSD values and higher SRR values compared to other method used herein. This is due to the advantage of utilizing temporal averaging over enhanced speech obtained after DSB. The results for MC-LPRC indicate significant improvement over other methods. For speech enhancement, a method should have higher SRR, lower LSD and BSD values. It is observed that, the values of SRR decreases and values of LSD and BSD increases with decrease in DRR. This is due to the increase in the distance between source and the microphone. In fact at lower DRRs, the performance of the proposed method decreases, but proposed method still performs better compared to other methods.

132 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum Experimental Results on Distant Speech Recognition Distant speech recognition experiments are conducted on the reverberated sentences of TIMIT and MONC database at different DRR. In this experiment, we have used 15 state, and 3 mixtures triphone HMM with 39 dimension MFCC with delta and acceleration coefficients for both the databases. The baseline triphone models of the recognition system are trained by 1000 sentences of clean speech training data from both the database. The reverberant version of 500 different sentences from the MONC and TIMIT databases are generated at several DRRs and applied to the proposed algorithm (M-LPRC) along with other methods used in comparison. For testing the recognition system, the dereverberated signals reconstructed from these methods are used. The word error rate has been used to present sentence recognition results. 50 TIMIT Database Word Error Rates, % Distance, m 60 MONC Database Word Error Rates, % Distance, m CTM MC LPRC ESI STA SE Figure 5.7: Comparison of the word error rate for various speech enhancement methods as a function of the distance between source and microphone. In Figure 5.7, the word error rate of the proposed method along with other conventional methods such as excitation source information [45], spatio-temporal averaging [149] and spectral enhancement [150] using joint MVDR and MMSE estimation method are illustrated. In

133 5.1 Multi Channel Reverberant Speech Enhancement using LP Residual Cepstrum 113 Figure 5.7, these results are plotted by additionally varying the distance between source and microphone. WER for all the methods increase as the distance between the source and microphone increase, except for close talking microphone (CTM). This is because as distance between source and microphone increases, the effect of reverberation increases. The proposed method has lower WER for varying distance between source and microphone indicating the higher recognition accuracy and least reverberation effect compared to ESI, STA and SE method. Also, WER for MC-LPRC is closest to CTM when compared to other methods. Table 5.2: Percentage increase in WER with increase in distance between the source and microphone for various methods. TIMIT MONC Distance MC-LPRC ESI STA SE MC-LPRC ESI STA SE 2 meter meter meter In Table 5.2, the percentage increase in WER obtained with respect to distance of one meter between source and microphone, is noted to be lowest for proposed method for both the databases. This indicates the significance of proposed method for distant speech recognition applications used in hand free communication Summary A multi channel speech enhancement method based on the LP residual cepstrum is proposed in this work. The deconvolution of acoustic impulse response from reverberated signal in each individual channel removes early reverberation. This dereverberated output from each channel is then spatially filtered using delay and sum beamformer. The late reverberation components are then removed by temporal averaging of the glottal closure instants (GCI) computed using the dynamic programming projected phase-slope algorithm (DYPSA). The multi channel technique performs reasonably better, when compared to spatio-temporal averaging method. The method described in this work is computationally efficient compared to other conventional deconvolution methods which often rely on the estimation of the acoustic impulse response. However in large rooms, spurious peaks are present in the those regions

134 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter 114 where late reverberation exists. In such situations, the peak detection algorithm is prone to errors at very low DRR. The performance of the proposed method has potential to be further investigated for long reverberation times and at low signal to noise ratios. 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter In this section, multi channel linearly constrained minimum variance (LCMV) filtering in a fixed beamforming framework is discussed which achieve noise cancellation and dereverberation jointly. Speech acquired from an array of distant microphones is affected by ambient noise and reverberation. Single channel LCMV filter has been developed in literature to remove ambient noise. In this work, a LCMV based spectral method is developed for joint noise cancellation and dereverberation in a beamforming framework. This is realized as a multi channel LCMV filter that constrains both the early and late parts of the speech frame. A single channel LCMV filter which accounts for the inter frame correlation is applied on each channel to remove the early reverberation components. The partially enhanced signal from each microphone is then combined using delay and sum beamforming to decorrelate the remaining spurious peaks. A modified spectral subtraction method is also proposed to remove the late reverberation components present in the speech signal. Experimental results on joint noise cancellation and dereverberation indicate a reasonable improvement over conventional speech enhancement methods. Experiments on distant speech recognition are also conducted to illustrate the significance of the method in the context of ASR Introduction Hands free audio source acquisition from distant microphones are often smeared by reverberation. Reverberation is defined as multiple delayed and attenuated versions of a signal added to signal itself due to multiple reflections from the surrounding walls and other objects. In addition to reverberation, the background noise will also affect the speech quality. The distortions caused by reverberation and noise results in degradation of fidelity, intelligibility of speech signal and also affect the recognition performance.

135 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter 115 In [16], single channel minimum variance distortionless response (MVDR) filter has been proposed for noise cancellation. In literature, most of the existing approaches for singlechannel noise reduction in the spectral domain generally assume that consecutive time frames are uncorrelated with each other. Hence, the algorithms which are based on this assumption do not consider interframe correlation into account. However, the speech signals are generaly colored for long period of time and hence in [16], the interframe correlation is taken into account in the derivation of any noise reduction algorithms. This can improve both narrowband and fullband output SNRs and is extended to LCMV filter in [165]. In addition to noise cancellation, the MVDR has also been widely used in spectrum estimation [166], [167], [168], [169] and feature extraction [170], [171]. The ensuing section formulates the problem of reverberation and surrounding noise in a closed room Problem Formulation The audio signal recorded by microphone is smeared by reverberation and surrounding noise. In general, the acoustic impulse response (AIR) h(n) is assumed to be time-invariant. The microphone output is given by y(n). y(n) = z(n) + v(n) (5.21) where n is time samples, v(n) is additive noise and z(n) is the reverberated signal given by z(n) = h(n) s(n) (5.22) N z(n) = h(n)s(n n ) (5.23) n =0 where s(n) is the clean speech signal of N time samples. In the short time Fourier transform (STFT) domain, the microphone signal output y(n) is given by Y (k, m) = Z(k, m) + V (k, m) (5.24)

136 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter 116 In Equation 5.24, vectors Y (k, m), Z(k, m) and V (k, m) represent the STFT of y(n), z(n) and v(n) respectively, at m th time frame and at frequency bin k {0, 1,, K 1}. For m {0, 1, N T 1}, the Equation 5.24 is now represented in matrix form as Y(k, m) = Z(k, m) + V(k, m) (5.25) In Equation 5.25, matrix Z(k, m) is the reverberated component which can also be written as Z(k, m) = [Z T d (k, m) ZT l (k, m)]t (5.26) where Z d (k, m) is direct spectral components defined as Z d (k, m) = [Z(k, m), Z(k, m 1)..., Z(k, m N e + 1)] T (5.27) and Z l (k, m) denote early and late spectral speech components. Z l (k, m) = [Z(k, m N e ),..., Z(k, m N T + 1)] T (5.28) The first N e frames corresponds to the direct path which have strong peaks. The early and late reflection components are contained in the frame range of [N e N T 1]. N T corresponds to the total number of frames obtained after STFT of y(n). The Y(k, m) is defined as Y(k, m) = [Y (k, m), Y (k, m 1),, Y (k, m N T + 1)] T. Similarly, Z(k, m) and V(k, m) are defined as Y(k, m). Substituting Equation 5.26 in Equation 5.25, the microphone output is obtained as Y(k, m) = [Y T d (k, m) YT l (k, m)]t (5.29) Here, Y d (k, m) and Y l (k, m) are defined as the microphone output for frame range of [0 N e 1] and [N e N T 1], and explained similarly as Z d (k, m) and Z l (k, m), respectively.

137 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter 117 Mathematically, Y d (k, m) and Y l (k, m) are represented as Y d (k, m) = Z d (k, m) + V d (k, m) (5.30) Y l (k, m) = Z l (k, m) + V l (k, m) (5.31) A multi channel LCMV filter which utilizes initial estimation of enhanced speech is discussed in the ensuing section Multi Channel LCMV Filter for Noise Cancellation and Speech Dereverberation The joint noise cancellation and speech dereverberation method requires an initial estimation of enhanced speech signal at each microphone output. This initial estimator suppresses the noise and early reverberation from each microphone. The multi channel LCMV filter uses the observed degraded microphone outputs as the input and generate the output from each microphone which are partially enhanced. y 1 (n) Suppression of Noise & Early Reverberation 1 (k,m) y 2 (n)... y M (n) Suppression of Noise & Early Reverberation Suppression of Noise & Early Reverberation 2 (k,m) M (k,m) Delay and Sum Beamforming Multichannel LCMV Filter DSB (k,m) (k,m) (n) Spectral Subtraction Inverse STFT Figure 5.8: Block diagram illustrating the joint noise cancellation and dereverberation method using a multi channel LCMV filter.

138 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter 118 The output of LCMV filter at each channel of an array of M microphones is denoted by Ẑ1(k, m), Ẑ 2 (k, m),..., Ẑ M (k, m). The output Ẑ1(k, m), Ẑ 2 (k, m),..., Ẑ M (k, m) are generated individually by using notion explained in Section The block diagram of the proposed multi channel LCMV (M-LCMV) algorithm is shown in Figure 5.8. The application of LCMV filter to each microphone signal output results in removal of noise and all early reflection components. The delay and sum beamformer [152] is used to combine multichannel LCMV filter outputs which results in an enhanced speech denoted by ẐDSB(k, m). The late reverberant components are then eliminated by post processing the DSB output using modified spectral subtraction method (as explained in Chapter 4.2). It may be noted that the partially enhanced speech is free from noise and early reverberation components. The initial estimation of partially enhanced speech at each microphone is explained in ensuing section Suppression of Noise and Early Reverberation The inter frame correlation is taken into the account in cancellation of noise and early reverberation from degraded speech. This is however due to the highly correlated nature of speech signal. The complex gain required to remove noise and early reverberation components at m th frame of first microphone (denoted by subscript 1 ) using LCMV filter in first N e frames is given by Ẑ 1 (k, m) = Y T d (k, m)w d (5.32) where superscript T and denote transpose and complex conjugate operation respectively. W d = [w 1 (m),..., w 1 (m N e + 1)] T is a vector of length N e and Y d (k, m) = [Y 1 (k, m),..., Y 1 (k, m N e + 1)] T is a matrix of dimension N e K. Here, T denotes transpose operation. In vector W d, w 1 (m) corresponds to the complex weight applied at the m th frame. Similarly, other terms in W d are defined. On other hand, the complex gain required to remove noise and early reverberation components at (m N e ) th frame using LCMV filter from remaining N T N e frames is given by Ẑ 1 (k, m N e ) = Y T l (k, m)w l (5.33)

139 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter 119 Here, W l =[w 1 (m N e ),..., w 1 (m N T + 1)] T. The Equation 5.32 and 5.33 can also be written as Ẑ 1 (k, m) = Z T d (k, m)w d + VT d (k, m)w d (5.34) Ẑ 1 (k, m N e ) = Z T l (k, m)w l + V T l (k, m)w l (5.35) Here, Z d (k, m) = [Z 1 (k, m),..., Z 1 (k, m N e + 1)] T. And, Z l (k, m) is defined for the last N T N e frames. Similarly, V d (k, m) and V l (k, m) are defined in same manner to Z d (k, m) and Z l (k, m) respectively. However as shown in [16], the Z d (k, m) can be decomposed into desired signal Z 1 (k, m) at time frame m and interference signal matrix Z d (k, m) as shown below Z d (k, m) = γz d Z1 T (k, m) + Z d (k, m) (5.36) Z d (k, m) = [Z 1 (k, m),..., Z 1 (k, m N e + 1)] T is the interference signal matrix where each element of Z d (k, m) is defined in details in [16], and normalized inter frame correlation vector is defined as [16] below γ Zd = [γ Z1 (m),..., γ Z1 (m N e + 1)] T γ Zd = E[Z d (k, m)z 1(k, m)] E[ Z 1 (k, m) 2 ] (5.37) Similarly, Z l (k, m), V d (k, m) and V l (k, m) can be decomposed into its desired signal and its corresponding interfering signal matrix Z l (k, m) = γz l Z1 T (k, m N e ) + Z l (k, m) (5.38) V d (k, m) = γv d V1 T (k, m) + V d (k, m) (5.39) V l (k, m) = γv l V1 T (k, m N e ) + V l (k, m) (5.40) where γ Zl and γ Vd and γ Vl are defined in similar way to γ Zd. In similar fashion, Z l (k, m) and V (k, m) are also defined as Z d (k, m).

140 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter 120 Substituting Equation 5.36 and 5.39 into Equation 5.34, Ẑ1(k, m) is obtained. In similar manner, substituting Equation 5.38 and 5.40 into Equation 5.35, we obtain Ẑ1(k, m N e ). Ẑ 1 (k, m) = Z fd (k, m) + Z id (k, m) + V d (k, m) (5.41) Ẑ 1 (k, m N e ) = Z fl (k, m) + Z il (k, m) + V l (k, m) (5.42) where Z fd (k, m) = Z 1 (k, m)γ H Z d W d is the filtered desired signal, Z id(k, m) = (Z T d (k, m) + V T d (k, m))wd is the residual interference signal and V d(k, m) = V 1 (k, m)γv H d Wd is the residual noise signal for first N e frames. Here, superscript H is conjugate transpose operation. The Z fl (k, m), Z il (k, m) and V l (k, m) are defined similar to Z fd (k, m), Z id (k, m) and V d (k, m) respectively. The estimated spectrum at m th frame Ẑ1(k, m) (Equation 5.41) and at (m N e ) th frame Ẑ1(k, m N e ) (Equation 5.42) are sum of three mutually uncorrelated terms. The objective of single channel LCMV filter is to recover the filtered desired signal Z fd (k, m) and Z fl (k, m) along with removal of all undesired signal terms (the last two terms of Equation 5.41 and 5.42 respectively). Thus, putting first set of constraint (for first N e frames) in matrix form, we obtain P T d W d = I d (5.43) where, P d = [γ Zd γ Vd ] and I d = [1 0] T. The dimension of constraint matrix P d is N e 2 and I d is 2 1. The second set of constraints for remaining frames can be enforced in matrix form as P T l W l = I l (5.44) where, P l = [γ Zl γ Vl ], I l = [α 0] T. The dimension of constraint matrix P l is (N T N e ) 2 and I l is 2 1. Here 0 <α< 1, but α=0.5 (observed empirically) removes the early reflections which has most of the power of reverberation and few late reflections. The two constraint have been dealt separately in this work to obtain weights W d and W l using LCMV filter. The optimal filter is then derived by minimizing the mean square error of residual interference plus noise at the filter output with the first set of constraint. Mathematically it is

141 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter 121 given by [165] subject to P T d W d = I d The solution to the Equation 5.45 can now be obtained as [165] min W d W H d ρ dw d (5.45) Ŵ d = ρ 1 d P d[p T d ρ 1 d P d] 1 I d (5.46) where ρ d = E[Y d (k, m)y H d (k, m)] is the correlation matrix of Y d(k, m). Y d (k, m) can be obtained as Y d (k, m) = [Y 1 (k, m)... Y 1 (k, m N e +1)] T, which is first N e frames of Y 1 (k, m). Similarly weights required to minimize the energy of reverberation and residual interference plus noise at the filter output with second set of constraints can be computed as Ŵ l = ρ 1 l P l [P T l ρ 1 l P l ] 1 I l (5.47) where ρ l is defined similar to ρ d and Y l (k, m) is defined as Y l (k, m) = [Y 1 (k, m N e ),..., Y 1 (k, m N T + 1)] T. In general, γ Zd in Equation 5.37 is found in terms of inter frame correlation vector of Y d (k, m) and V d (k, m) as obtained in [16]. Here, noise matrix V d (k, m) is obtained from V(k, m) for first N e frames containing all frequency bands. Similarly γ Zl can be obtained in terms of Y l (k, m) and V l (k, m), where V l (k, m) is obtained from V(k, m) for frame range [N e N T 1]. The statistics of the noise signal is computed during silences as other noise reduction algorithm. The enhanced signal at m th and (m N e ) th frame is thus given by Ẑ 1 (k, m) = Y T d (k, m)ŵ d (k, m) (5.48) Ẑ 1 (k, m N e ) = Y T l (k, m)ŵl (k, m) (5.49) Once the Ẑ1(k, m) is obtained, the Ẑ1(k, m 1),, Ẑ1(k, m N e ) can be obtained similarly as Ẑ1(k, m). On other hand, Ẑ1(k, m N e 1),, Ẑ1(k, m N T + 1) are obtained in similar fashion as Ẑ1(k, m N e ) is obtained. The enhanced signal for first microphone is denoted by Ẑ1(k, m) and obtained as Ẑ1(k, m)=[ẑt d (k, m) Ẑ T l (k, m)] T. Here, Ẑd(k, m) and

5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter 122 Ẑ l (k, m) are defined as Ẑ d (k, m) = [Ẑ1(k, m), Ẑ1(k, m 1),, Ẑ1(k, m N e + 1)] T (5.

The enhanced spectrum still retains some late reverberation components which are removed by using modified spectral subtraction explained in Section 4.2.3.2 of Chapter 4.

5 1 Time 1.5 2 2.5 Figure 5.9: Spectrographic analysis of clean (Top), reverberant at DRR = -3dB (Middle), and the dereverberated speech signal (Bottom) obtained from the proposed method. 5.2.5 Spectrographic Analysis The performance of the proposed M-LCMV algorithm is analysed by considering a sentence from TIMIT database uttered by a male speaker sampled at 16 KHz.

142 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter 122 Ẑ l (k, m) are defined as Ẑ d (k, m) = [Ẑ1(k, m), Ẑ1(k, m 1),, Ẑ1(k, m N e + 1)] T (5.50) Ẑ l (k, m) = [Ẑ1(k, m N e ), Ẑ1(k, m N e 1),, Ẑ1(k, m N T + 1)] T (5.51) On similar lines, the enhanced signal from other microphones can be obtained as Ẑ1(k, m). The enhanced spectrum still retains some late reverberation components which are removed by using modified spectral subtraction explained in Section of Chapter Clean Speech Frequency Time Reverberated plus noisy speech Frequency Time Dereverberated Speech Frequency Time Figure 5.9: Spectrographic analysis of clean (Top), reverberant at DRR = -3dB (Middle), and the dereverberated speech signal (Bottom) obtained from the proposed method Spectrographic Analysis The performance of the proposed M-LCMV algorithm is analysed by considering a sentence from TIMIT database uttered by a male speaker sampled at 16 KHz. The AIR simulated here is done by image method [138]. Figure 5.9, illustrates the results of dereverberation using the proposed algorithm. It can be observed from dereverberated spectrogram (Figure 5.9) that a reasonable amount of noise and reverberation is removed.

143 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter Performance Evaluation The performance of the proposed method is evaluated by conducting experiments on joint noise cancellation and dereverberation. The experiments on distant speech recognition are also conducted. The experiments are conducted on the TIMIT [160] database at various direct to reverberant ratios (DRR). The experiments on speech dereverberation evaluate subjective (mean opinion score (MOS)), and objective (log spectral distortion (LSD) and signal to reverberation ratios (SRR)) measures. The performance of noise cancellation is evaluated using segmental SNR (SSNR) measure Experimental Conditions TIMIT database is used in the experiments. An array of eight microphones is used to obtain spatialized TIMIT data. Sentences from the TIMIT database are reverberated at various DRR. Noise is added to the reverberated signal at SNR equal to 15 db. The room dimension used in the simulation is 10.4mX10.4mX4.2m. The subjective evaluation (MOS) was done by 25 listeners in the age group of 21 to 25 years on dereverberated sentences Experimental Results on Noise Cancellation and Speech Dereverberation Noise cancellation results are presented in terms of segmental SNR [172]. Segmental SNR is defined as average SNR over all frames whose SNR lies between -10 db and 30 db. The Table 5.3: Experimental results on noise cancellation using segmental SNR as measure on the TIMIT database. Methods M-LCMV ESI STP SE DRR=-1 db DRR=-3 db DRR=-4 db DRR=-5 db frame length considered is ms. The proposed method has higher segmental SNR values at different DRR, when compared to excitation source information (ESI) method of speech enhancement [45], spatio temporal averaging (STA) method [149] and spectral enhancement

144 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter 124 (SE) using joint MVDR and MMSE estimation method [150]. The SSNR values decreases with decrease in DRR for all methods and this decrement in SSNR is lowest for the proposed method. Higher the SSNR values, better is the method in terms of speech intelligibility. Speech Dereverberation results are listed as both objective and subjective measures in Table 5.4. The LSD is speech distortion measure obtained by root mean square (RMS) value of the difference between log spectra of clean speech signal and dereverberated signal [143], [144]. The SRR [143] is a measure of reverberation which depends on the signal before and after processing. The increment in LSD values and decrement in SRR and MOS values with Table 5.4: Experimental results on speech dereverberation using various methods on TIMIT database. DRR=-1 db DRR=-3 db DRR=-5 db Methods LSD SRR MOS LSD SRR MOS LSD SRR MOS M-LCMV ESI STA SE decrease in DRR is noted for all the methods from Table 5.4. But, it is observed that proposed algorithm has lower LSD values and higher SRR and MOS values at different DRR indicating better reverberation suppression compared to other method used herein. In general, lower the LSD values and higher the SRR and MOS values, better is the method for removing reverberation Experimental Results on Distant Speech Recognition Spatialized version TIMIT (S-TIMIT) database is generated by acquiring TIMIT data over a microphone array and used for performing the experiments on distant speech recognition. The recognition results are presented in terms of word error rate (WER). 15 states, 3 mixtures triphone HMM with 39 MFCC with delta and acceleration coefficients as feature vectors have been used in the distant speech recognition experiments. In order to train the baseline triphone models of the recognition system, the 1000 clean speech sentences from the database are used. The reverberant version of 500 sentences from the S-TIMIT database are generated at different DRRs. Speech enhancement experiments are conducted using the proposed

145 5.2 Joint Noise Cancellation and Dereverberation using Multi Channel LCMV Filter 125 algorithm (M-LCMV) along with other methods used in comparison. The reconstructed signals from all these methods at different DRR are used for testing the recognition system. In Figure 5.10, WER for all the methods increases as the distance between the source and Word Error Rate, % M LCMV ESI STA SE CTM Distance, m Figure 5.10: Variation in WER for various methods with increase in distance between source and microphone array. microphone array increase except for close talking microphone (CTM). This is because as distance between source and microphone increases, the effect of reverberation increases. WER for CTM is constant and lowest because there is no reverberation effect during close talk. The proposed method has lower WER with respect to CTM indicating the higher recognition accuracy and least reverberation effect compared to ESI, STP and SE method. Table 5.5 illustrates the percentage increase in WER, when the distance between source and microphone array increases. The percentage increase in WER obtained with respect to distance of one meter between source and microphone is noted lowest for the proposed method compared to other methods. Table 5.5: Percentage increase in WER for various methods with the varying distance from source to microphone. Distance M-LCMV ESI STA SE 2 meters meters meters

146 5.3 Discussion Summary In this section, multi channel speech enhancement algorithm is proposed by applying single channel linearly constrained minimum variance (LCMV) filter at the output of an array of microphones. The proposed method is based on orthogonal decomposition for the extraction of desired signal. The multi channel output is combined by using DSB and spectral subtraction. This further improves the quality of reconstructed signal. The subjective and objective evaluation of proposed method shows reasonable improvement over other methods compared herein. Lower word error rate is noted from the experiments on distant speech recognition, when the proposed method is used. The significance of the proposed method is also illustrated using segmental SNR, where higher SNR indicates higher intelligibility of reconstructed signal. 5.3 Discussion In this chapter, a novel beamforming based spatial spectrum estimation methods for multi channel speech enhancement are discussed. LP residual cepstrum (LPRC) based reverberant speech enhancement method is first described. However, LPRC suffers from detecting glottal closure instant or LP residual peaks under noisy condition. This result in detecting spurious peaks as GCI. Thereafter, LCMV filtering based speech enhancement technique is discussed which is robust to noise and reverberation. Thus, LCMV filter results better noise and reverberation suppression. LCMV filter utilizes inter frame correlation for decomposition of degraded signal into its coherent and incoherent signals. These two methods (LPRC and LCMV) have been proposed in a fixed beamforming framework for speech enhancement. MOS scores indicate that the performance of both methods are reasonably better when compared to conventional methods in literature. Experimental results indicate the significance of LCMV filtering technique over LPRC based speech enhancement. Thus, it can be noted that the LCMV filtering is robust to noise and reverberation when compared to LPRC method.

147 Chapter 6 Application of Speech Enhancement in the Development of Information Retrieval Systems Information retrieval systems on a cell phone and in a teleconferencing environment are developed in order to demonstrate the effectiveness of the methods proposed in this thesis. A multi-media information retrieval system (MIRS) is developed in the context of single channel speaker separation. On the other hand, an audio retrieval system is developed in the context of multi channel speech enhancement. A discussion on the development of these systems follows in the ensuing section. 6.1 Application of Single Channel Speaker Segregation Method in Multi-media Information Retrieval In this section, significance of the proposed group delay cross correlation based speaker segregation method (illustrated in Chapter 4.1) in the development of multi-media information retrieval system is discussed. Automatic indexing and segmentation of the large archives of raw audio and video [173] is gaining importance in contemporary multi-media systems. The separated speakers in a multi-speaker environment are indexed in the meeting archives

148 6.1 Application of Single Channel Speaker Segregation Method in Multi-media Information Retrieval 128 and a cell phone based intelligent meeting retrieval system is developed. The multi-media information retrieval system is used to retrieve the keyword spoken by a speaker in a multi source environment with interference from other speakers. Additionally, keywords are also retrieved after separation of the individual speaker using the proposed algorithm. It may also be noted that the system provides additional functionality of retrieving keywords that would otherwise be lost in overlapped speech of more than one speaker. The following section describes the design of a meeting capture system and the methodology for archiving meeting audio in a multi-source environment Design of a Meeting Capture and Audio Archiving System In order to develop an audio archive for indexing keywords in multi-source environments, a meeting room was set up for recording of meeting audio. Four speakers were captured by two cameras around a meeting table. Each speaker read through a different transcript so that the keywords were very distinct for each speaker. This ensures that different keywords are added in a multi-source scenario. The transcripts included several topics like social networking, cricket, movies, and politics. The main objective of the meeting archive development is to apply separation algorithm in the overlapped region where two speakers are speaking simultaneously. Subsequently the overlapping content spoke by the multiple speakers are separated by the proposed algorithm and incorporated into a multi-media retrieval system Speaker Demography in the Audio Archives Majority of the speaker population were students and employees of various departments and centers of Indian Institute of Technology Kanpur. Speakers were selected based on parameters such as diversity of age, dialect and gender as these parameters play a major role in developing the multi-pitch detection and separated speech recognition system. The distribution of malemale and female-female speaker pairs with closely spaced pitch contours was also considered in the selection of the speakers.

1: (a) Screen shot of the interface for multi-media information

149 6.1 Application of Single Channel Speaker Segregation Method in Multi-media Information Retrieval 129 (a) (b) Figure 6.1: (a) Screen shot of the interface for multi-media information retrieval system and (b) Photograph of the MIRS working on a cell phone.

150 6.2 Application of Multi Channel Speech Enhancement Method for Audio Retrieval in Teleconferencing Environment Design of a Cell Phone based Multi-media Information Retrieval System The multi-media interface of the cell phone based information retrieval system is designed keeping ease of access in mind. The retrieval system consists of two components [174]. The first is an interface, which accept queries from the user and provides the results back to the user. The second component is a retrieval engine, which evaluate the queries and interacts with the user interface. The results are presented as video clips in mixed mode and separated mode. In mixed mode, the video clip of two speakers speaking simultaneously is presented to the user. In the separated mode, the video clip of the individual speaker obtained after separation is presented. The cell phone based multi-media retrieval system can be accessed via The system can be accessed by first selecting the keyword. Once the keyword is selected the camera which has captured this particular keyword is activated since multiple cameras have been used in the meeting capture system. Simultaneously the user can either select mixed mode or individual separated speaker mode. Once the selections have been done, the system starts playing the sections of the multimedia file containing this keyword from the selected camera and in the selected mode. The screen shot of the system is shown in Figure Application of Multi Channel Speech Enhancement Method for Audio Retrieval in Teleconferencing Environment In this section, an application of multi channel LCMV filtering based speech enhancement method (described in Chapter 5.2) for audio retrieval in a teleconferencing environment is discussed. The setup for audio data collection and retrieval in a teleconferencing environment is first discussed. The procedure for audio data collection and audio data retrieval is then illustrated using flow diagrams. The experimental results for the keyword retrieval using the proposed algorithm and other conventional methods are also presented herein.

151 6.2 Application of Multi Channel Speech Enhancement Method for Audio Retrieval in Teleconferencing Environment 131 Figure 6.2: The block diagram for data collection over T1 digital lines in a teleconferencing environment Experimental Setup for Audio Archiving in Teleconferencing Environment The setup is designed for three clients present in a teleconferencing meeting. Each client data is recorded using microphone array (MA) as shown in Figure 6.2. At each client side, two microphone arrays have been used, where each MA comprises of four microphones. On each client side, both sets of MA which consist an array of eight microphones, will be activated for meeting data collection (MA-C). The data recorded from each client is generally noisy and contains reverberation. The proposed multi channel LCMV filtering algorithm discussed in Chapter 5.2 under fixed beamforming framework is used to reconstruct the clean speech for each client and stored in the server. The clean data from each client is also broadcasted [175] to all the other clients on T1 digital lines [176] via an asterisk server. The data stored in server is logged in a standard file format for each client. The complete step for data collection is shown as a flow diagram in Figure 6.3. The data spoken by the clients include topics like weather, Kargil war, Olympics performance, Politics in India, and Indian cinema.

152 6.2 Application of Multi Channel Speech Enhancement Method for Audio Retrieval in Teleconferencing Environment 132 Figure 6.3: Flow diagram for archiving meeting audio data over VOIP Experiments on Audio Retrieval in Teleconferencing Environment In the audio retrieval system described herein, the client first sends a query to the audio server. The server acknowledge the request to the client by recognizing the keyword and searching it in the database. If the queried keyword exists in the database, asterisk server

153 6.2 Application of Multi Channel Speech Enhancement Method for Audio Retrieval in Teleconferencing Environment 133 Figure 6.4: Block diagram of the audio retrieval system in teleconferencing environment. sends the keyword to the queried client through the internet [175]. The block diagram for the audio retrieval in teleconferencing environment is illustrated in Figure Audio Data Retrieval in Active Meetings During data retrieval in active meeting, the last four primary microphones will be activated for data retrieval purpose (MA-R) and the first four set of primary microphones will be used for data collection (MA-C). When the meeting is active, a keyword query from any client can be served by switching to the last four microphones at each client side. This in turn activates the second set of MA for data retrieval at the client side as shown in Figure 6.4. The server containing the audio archives returns the queried keyword only when the keyword matches with the keyword stored in the meeting database. The data retrieved will be played back to the client through an asterisk server. The flow diagram for data retrieval system during the meeting is shown in Figure 6.5.

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

NCC 2009, January 16-18, IIT Guwahati 294 Single Channel Speaker Segregation using Sinusoidal Residual Modeling Rajesh M Hegde and A. Srinivas Dept. of Electrical Engineering Indian Institute of Technology