SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Similar documents
Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Modulation Domain Spectral Subtraction for Speech Enhancement

Chapter 4 SPEECH ENHANCEMENT

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Speech Enhancement in the. Modulation Domain

Speech Synthesis using Mel-Cepstral Coefficient Feature

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Speech Enhancement Using a Mixture-Maximum Model

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Speech Enhancement using Wiener filtering

MMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2

Enhanced Waveform Interpolative Coding at 4 kbps

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Epoch Extraction From Emotional Speech

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Modulator Domain Adaptive Gain Equalizer for Speech Enhancement

Estimation of Non-stationary Noise Power Spectrum using DWT

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

PROSE: Perceptual Risk Optimization for Speech Enhancement

EE482: Digital Signal Processing Applications

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Wavelet Speech Enhancement based on the Teager Energy Operator

KALMAN FILTER FOR SPEECH ENHANCEMENT IN COCKTAIL PARTY SCENARIOS USING A CODEBOOK-BASED APPROACH

Speech Enhancement for Nonstationary Noise Environments

Speech Signal Enhancement Techniques

A METHOD OF SPEECH PERIODICITY ENHANCEMENT BASED ON TRANSFORM-DOMAIN SIGNAL DECOMPOSITION

REAL-TIME BROADBAND NOISE REDUCTION

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

RECENTLY, there has been an increasing interest in noisy

Perceptually Motivated Linear Prediction Cepstral Features for Network Speech Recognition

Overview of Code Excited Linear Predictive Coder

Evaluation of clipping-noise suppression of stationary-noisy speech based on spectral compensation

Speech Enhancement in Modulation Domain Using Codebook-based Speech and Noise Estimation

Model-Based Speech Enhancement in the Modulation Domain

Das, Sneha; Bäckström, Tom Postfiltering with Complex Spectral Correlations for Speech and Audio Coding

NOISE ESTIMATION IN A SINGLE CHANNEL

JOINT NOISE AND MASK AWARE TRAINING FOR DNN-BASED SPEECH ENHANCEMENT WITH SUB-BAND FEATURES

A Two-step Technique for MRI Audio Enhancement Using Dictionary Learning and Wavelet Packet Analysis

Speech Enhancement Based On Noise Reduction

Audio Imputation Using the Non-negative Hidden Markov Model

Chapter IV THEORY OF CELP CODING

Phase estimation in speech enhancement unimportant, important, or impossible?

Chapter 3. Speech Enhancement and Detection Techniques: Transform Domain

GUI Based Performance Analysis of Speech Enhancement Techniques

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

Enhancement of Speech in Noisy Conditions

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

Using RASTA in task independent TANDEM feature extraction

Noise Spectrum Estimation in Adverse Environments: Improved Minima Controlled Recursive Averaging

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

Bandwidth Extension for Speech Enhancement

Speech Enhancement: Reduction of Additive Noise in the Digital Processing of Speech

DERIVATION OF TRAPS IN AUDITORY DOMAIN

NCCF ACF. cepstrum coef. error signal > samples

The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals

Codebook-based Bayesian speech enhancement for nonstationary environments Srinivasan, S.; Samuelsson, J.; Kleijn, W.B.

Speech Enhancement in Noisy Environment using Kalman Filter

Can binary masks improve intelligibility?

The role of temporal resolution in modulation-based speech segregation

Analysis of the SNR Estimator for Speech Enhancement Using a Cascaded Linear Model

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING. Department of Signal Theory and Communications. c/ Gran Capitán s/n, Campus Nord, Edificio D5

Accurate Delay Measurement of Coded Speech Signals with Subsample Resolution

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

SPEECH ENHANCEMENT BASED ON A LOG-SPECTRAL AMPLITUDE ESTIMATOR AND A POSTFILTER DERIVED FROM CLEAN SPEECH CODEBOOK

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Advanced audio analysis. Martin Gasser

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping

Mikko Myllymäki and Tuomas Virtanen

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

Robust Low-Resource Sound Localization in Correlated Noise

Voice Activity Detection for Speech Enhancement Applications

Change Point Determination in Audio Data Using Auditory Features

A New Framework for Supervised Speech Enhancement in the Time Domain

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Speech and Music Discrimination based on Signal Modulation Spectrum.

SPECTRAL COMBINING FOR MICROPHONE DIVERSITY SYSTEMS

Machine recognition of speech trained on data from New Jersey Labs

Advances in Applied and Pure Mathematics

Auditory modelling for speech processing in the perceptual domain

Single-channel speech enhancement using spectral subtraction in the short-time modulation domain

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

Adaptive Filters Application of Linear Prediction

Nonuniform multi level crossing for signal reconstruction

Power Function-Based Power Distribution Normalization Algorithm for Robust Speech Recognition

Robust speech recognition using temporal masking and thresholding algorithm

Speech Signal Analysis

A Three-Microphone Adaptive Noise Canceller for Minimizing Reverberation and Signal Distortion

Transcription:

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN Yu Wang and Mike Brookes Department of Electrical and Electronic Engineering, Exhibition Road, Imperial College London, UK Email: {yw09, mike.brookes}@imperial.ac.uk ABSTRACT We propose a speech enhancement algorithm that applies a Kalman filter in the modulation domain to the output of a conventional enhancer operating in the time-frequency domain. The speech model required by the Kalman filter is obtained by performing linear predictive analysis in each frequency bin of the modulation domain signal. We show, however, that the corresponding speech synthesis filter can have a very high gain at low frequencies and may approach instability. To improve the stability of the synthesis filter, we propose two alternative methods of limiting its low frequency gain. We evaluate the performance of the speech enhancement algorithm on the core TIMIT test set and demonstrate that it gives consistent performance improvements over the baseline enhancer. Index Terms speech enhancement, post-processing, Kalman filter, robust linear prediction, modulation domain. INTRODUCTION The goal of a speech enhancement algorithm is to reduce or eliminate background noise without distorting the speech signal. Numerous speech enhancement algorithms have been proposed in the literature; among the most popular are those that apply a variable gain in the time-frequency domain such as the minimum mean square () spectral amplitude [] and log spectral amplitude [] enhancers. These enhancement algorithms give dramatic improvements in signal-to-noise ratio (SNR) but at the expense of introducing spurious tonal artefacts known as musical noise and speech distortion. A number of authors have suggested removing the musical noise by applying some form of post-processing to the to output of the baseline enhancer or to the time-frequency gain function that it utilizes. Smoothing the enhancer gain function is used in [3] to attenuate musical noise in time frames with low SNR and in [4] the gain function of each frame is first transformed into the cepstral domain so that smoothing may be selectively applied to the high quefrency coefficients. In [5], median filtering is applied to time-frequency cells that are classified as having a low probability of containing speech energy in order to eliminate the isolated peaks that characterise musical noise. Several authors have proposed speech enhancers that apply a Kalman filter (KF) to the time domain signal [6, 7, 8, 9] and more recently, So and Paliwal have proposed applying the KF to the short-time modulation domain instead [0]. In this paper, we propose the use of a KF in the modulation domain as a post-processor for speech that has been enhanced by an spectral amplitude algorithm []. The KF incorporates an autoregressive model for the time-evolution of the spectral amplitude in each frequency bin; this is estimated using linear predictive (LPC) analysis applied to the timefrequency domain output of the enhancer. Because the spectral amplitudes include a strong DC component, the gain of the corresponding LPC synthesis filter can be very high at low frequencies and we therefore propose two alternative ways of constraining the low frequency gain in order to improve the filter stability. The remainder of the paper is organized as following; in Section we describe the KF technique for speech enhancement in the modulation domain and after that, in Section 3 we introduce the derivation of the two robust linear prediction models. Finally the evaluation of the new algorithms and the conclusions are given in Section 4 and 5, respectively.. MODULATION DOMAIN KALMAN FILTERING Representing the amplitude spectrum of the noisy speech signal and the clean speech as Y (n, k) and S(n, k), respectively, we assume an additive model of the noisy speech as Y (n, k) = S(n, k) + N(n, k) () where n denotes the acoustic frame and k denotes the acoustic frequency. To perform Kalman filtering in the modulation domain, each frequency bin is processed independently; for clarity, we omit the frequency index, k, in the description that follows. We assume that the temporal envelope, S(n), of the amplitude spectrum of speech signal can be modeled by a linear predictor with coefficients a i ( i p) in each modulation frame: 978--4799-0356-6/3/$3.00 03 IEEE 7457 ICASSP 03

S(n) = p a i S(n i) + P (n) () i= where P (n) is a random Gaussian excitation signal with variance σp. The equations for Kalman filtering in the modulation domain are given in detail in [0] and we give only a brief overview here. In the modulation domain, time-domain noise has colored characteristics [0] and hence a KF for removing a colored noise is used [6]. Within each frequency bin, we use autoregressive models for the speech and the noise of orders p and q respectively and so the state vector in our KF has dimension p + q. The state space representation is given by [ S(n) N(n) ] = Y (n) = [ d T p [ ] [ ] A(n) 0 S(n ) 0 B(n) N(n ) [ ] [ ] dp 0 P (n) + 0 d q Q(n) d T q ] [ S(n) N(n) where S(n) = [S(n) S(n p + )] T is the speech state vector. d p = [ 0 0] T is a p-dimensional vector [ and the ] a T speech transition matrix has the form A(n) = I 0 where a = [a a p ] T is the LPC coefficient vector, and 0 denotes an all-zero column vector of length p. The quantities d q, N(n) and B(n) are defined similarly for the order-q noise model. The speech signal S(n) is thus generated in the modulation domain as the output of the LPC synthesis filter defined as H(z) = ] (3) (4) + p i= a iz i (5) with the excitation signal P (n). To determine the speech and noise model parameters, the time-frequency signal is segmented into overlapping modulation frames. For each frequency bin, a speech model { } a, σp is estimated by applying autocorrelation LPC analysis to the modulation frame. A separate voice activity detector is applied to each frequency bin and a noise model, { b, σq}, estimated during intervals where speech is absent. Full details are given in [0]. 3. KALMAN FILTER POST-PROCESSING The framework for our proposed speech enhancer is shown in Fig. and differs from that in [0] in two respects which we have found to result in enhanced speech of improved quality. First, we apply the KF not to the spectrum of the original noisy speech signal but rather to that of the output of an Fig.. Block diagram of algorithm enhancer that implements the spectral amplitude algorithm from []. Second, motivated by [] and [] we apply the KF to the cube-root of the short-time power spectrum rather than to the amplitude spectrum. Referring to Fig., a short-time Fourier transform (STFT) is applied to the enhanced speech and the cube-root of the resulting power spectrum is taken. In our baseline system, denoted in Sec. 4, the speech and noise models are estimated using the method of [0] and are used in the KF described in Sec.. The output from the KF is converted back to the amplitude domain, combined with the noisy phase spectrum and passed through an inverse-stft to create the output speech. Although we do not do so in our implementation, it would be possible to eliminate the initial STFT operation by taking the enhancer output directly in the time-frequency domain. LPC is conventionally applied to a zero-mean timedomain signal but in the modulation domain KF, it is applied to a positive-valued sequence of transformed spectral amplitudes. As we will show, when LPC analysis is applied to a signal that includes a strong DC component, the resultant synthesis filter can have a very high gain at low frequencies and the filter may, as a consequence, be close to instability. We have found that this near-instability significantly degrades the quality of the output speech and thus in Sec. 3. and 3.3 we propose two alternative ways of preventing it. 3.. Effect of DC bias on LPC analysis In this section, we determine the effect of a strong DC component on the results of LPC analysis. Suppose first that S(n) has zero mean and that the LPC coefficient vector, a, for a frame of length N is determined from the Yule-Walker equations a = R g (6) where the elements of the autocorrelation matrix, R, are given by R i,j = N n S(n i)s(n j) for i, j p and the elements of g are g i = R i,0. The DC gain of the synthesis filter H(z) in equation (5) is given by G = + w T a where w = [ ] T is a p-dimensional vector of ones. 7458

If now a DC component, d, is added to each S(n), the effect is to add d onto each R i,j and the new LPC coefficients, a, are given by a = ( R + d ww T ) ( g + d w ) = (R d R ww T R ) (g + d + d w T R w ) w where the second line follows from the Matrix Inversion Lemma [3]. Writing r = d w T R w, we can obtain w T a = wt R g r + r = wt a r + r Thus the DC gain of the new synthesis filter is + w T a = + r + w T a From (7) we see that the DC gain of the synthesis filter has been multiplied by + r where r is proportional to the power ratio of the DC and AC components of S(n). If this ratio is large, the low frequency gain of the LPC synthesis filter can become very high which results in near instability and poor prediction. Accordingly, in the following sections we propose two alternative methods of limiting the low frequency gain of the LPC synthesis filter. 3.. Method : Bandwidth Expansion The technique of bandwidth expansion is widely used in coding algorithms to reduce the peak gain and improve the stability of an LPC synthesis filter [4]. If a modified set of LPC coefficient is defined by a i = α i a i, for some constant α <, then the poles of the synthesis filter are all multiplied by α. This moves the poles away from the unit circle thereby reducing the gain of the corresponding frequency domain peaks and improving the stability of the filter. In Sec. 4 we evaluate the effect of using this revised set of LPC coefficients, ā, in the KF of Fig. (denoted the B algorithm) and find that it results in a consistent improvement in performance. 3.3. Method : Constrained DC gain Although the bandwidth expansion approach is effective in limiting the low frequency gain of the synthesis filter, it also modifies the filter response at higher frequencies thereby destroying its optimality. An alternative approach is to constrain the DC gain of the synthesis filter to a predetermined value and determine the optimum LPC coefficients subject to this constraint. As noted in Sec. 3., the DC gain of the LPC synthesis filter is given by G and we can force G = G 0 by imposing the constraint w T a = G 0 G 0 β >. (7) The average prediction error energy in the analysis frame is given by { } E = p S(n) + a i S(n i) N n i= and we would like to minimize E subject to the constraint w T a = β. Using a Lagrange multiplier, λ, the solution, ã to this constrained optimization problem is obtained by solving the p + equations and the solution is ( 0.5λ ã d da i ( E + λwt ã ) = 0 ) ( 0 w T = w R w T ã = β ) ( β g where R, g and w are as defined in Sec. 3.. In Sec. 4 we evaluate the effect of using this revised set of LPC coefficients, ã, in the KF of Fig. (denoted the C algorithm) and find that it results in a consistent improvement in performance both over the algorithm, which uses the unconstrained filter coefficients, and also over the B algorithm which uses the bandwidth expanded coefficients. 4. IMPLEMENTATION AND EVALUATION 4.. Stimuli of experiments In this section, we compare the performance of the baseline enhancer [5] with that of the three algorithms that incorporate a KF postprocessor. The algorithm uses an unconstrained speech model, the B algorithm incorporates the bandwidth expansion from Sec. 3. while the C algorithm uses the constrained filter from Sec. 3.3. In our experiments, we use the core test set from the TIMIT database [6] which contains 6 male and 8 female speakers each reading 8 distinct sentences (totalling 9 sentences) and the speech is corrupted by white and factory noise from the RSG-0 database [7] at 5, 0, 5, 0, 5, and 0 db signal-tonoise ratio (SNR). The algorithm parameters were determined by optimizing performance on a subset of the TIMIT training set. We use an acoustic frame length of 3 ms with a 4 ms frame increment which gives a sample rate of 50 Hz in the modulation domain. The speech model is determined from a modulation frame of 8 ms (3 acoustic frames) with a 6 ms frame increment. For the algorithm, the speech and noise models are of orders p = and q = 4 respectively while for the B and C algorithms, they are p = 3 and q = 6, as the different p and q give the best performance for the corresponding enhancers. Additionally, we set α = 0.7 and β = 0.8 and use a Bartelett-Hanning window in the analysis-synthesis procedure and a Hamming window for the estimation of the speech model coefficients. ) (8) 7459

5 0 5 0-5 -0 C B -5-5 0 5 0 5 0 Fig.. Average segsnr values comparing different algorithms, where speech signals are corrupted by white noise at different SNR levels. 4.. Performance of new algorithms Using the new LPC models, the performance of the speech enhancers is evaluated using both segmental SNR (segsnr) and the perceptual evaluation of speech quality (PESQ) measure defined in ITU-T P.86. In all cases the measures are averaged over the 9 sentences in the TIMIT core test set. Figures and 3 show how the average segsnr varies with global SNR for white noise and factory noise for the unenhanced speech, the baseline enhancer and the three KF postprocessing algorithms presented here. We see that at high SNRs, all the algorithms have very similar performance. However at 0 db SNR the provides an approximate db improvement in segsnr over enhancement and the B and C algorithms give an additional 0.5 and.5 db improvement respectively. The PESQ results shown in Fig. 4 and 5 broadly mirror the segsnr results although the post-processing gives an improvement in PESQ even at high SNRs. For both noise types, the constrained KF postprocessor (C) gives a PESQ improvement of >0. over a wide range of SNRs. In addition, informal listening tests also indicate that the proposed post-processing methods, especially B and C enhancers, are able to reduce the musical noise introduced by enhancer. 5. CONCLUSION We have proposed three alternative methods of post-processing the output of an spectral amplitude speech enhancer by using a KF in the modulation domain. The three methods differ in how they estimate the LPC speech model in each modulation frame. We have shown that all three methods give consistent improvements over the enhancer in both segsnr and PESQ and that the best method, which performs LPC analysis with a constrained DC gain, improves PESQ scores by at least 0. over a wide range of SNRs. 5 0 5 0-5 -0 C B -5-5 0 5 0 5 0 Fig. 3. Average segsnr values comparing different algorithms, where speech signals are corrupted by factory noise at different SNR levels. 3.5 3.5.5 C B -5 0 5 0 5 0 Fig. 4. Average PESQ values comparing different algorithms, where speech signals are corrupted by white noise at different SNR levels. 3.5 3.5.5 C B -5 0 5 0 5 0 Fig. 5. Average PESQ values comparing different algorithms, where speech signals are corrupted by factory noise at different SNR levels. 7460

6. REFERENCES [] Y. Ephraim and D. Malah. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust., Speech, Signal Process., 3(6):09, December 984. [] Y. Ephraim and D. Malah. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust., Speech, Signal Process., 33():443 445, 985. [3] T. Esch and P. Vary. Efficient musical noise suppression for speech enhancement system. In Proc. IEEE (ICASSP), pages 4409 44, April 009. [4] C. Breithaupt, T. Gerkmann, and R. Martin. Cepstral smoothing of spectral filter gains for speech enhancement without musical noise. Signal Processing Letters, IEEE, 4():036 039, December 007. [5] Zenton Goh, Kah-Chye Tan, and T. G. Tan. Postprocessing method for suppressing musical noise generated by spectral subtraction. IEEE Trans. Speech Audio Process., 6(3):87 9, May 998. [3] Mike Brookes. The matrix reference manual. http:// www.ee.imperial.ac.uk/hp/staff/dmb/matrix/intro.html, 998-0. [4] P. Kabal. Ill-conditioning and bandwidth expansion in linear prediction of speech. In Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), volume, pages I 84 I 87, April 003. [5] D. M. Brookes. VOICEBOX: A speech processing toolbox for MATLAB. http://www.ee.imperial.ac.uk/hp/ staff/dmb/voicebox/voicebox.html, 998-0. [6] J. S. Garofolo. Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic continuous speech database. Technical report, National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, December 988. [7] H. J. M. Steeneken and F. W. M. Geurtsen. Description of the RSG.0 noise data-base. Technical Report IZF 988 3, TNO Institute for perception, 988. [6] J. D. Gibson, B. Koo, and S.D. Gray. Filtering of colored noise for speech enhancement and coding. IEEE Trans. Signal Process., 39(8):73 74, August 99. [7] A. Yasmin, P. Fieguth, and Li Deng. Speech enhancement using voice source models. In Proc. IEEE (ICASSP), volume, pages 797 800, March 999. [8] Z. Goh, K.-C. Tan, and B. T. G. Tan. Kalmanfiltering speech enhancement method based on a voicedunvoiced speech model. IEEE Trans. Speech Audio Process., 7(5):50 54, September 999. [9] V. Grancharov, J. Samuelsson, and B. Kleijn. On causal algorithms for speech enhancement. IEEE Trans. Audio, Speech, Lang. Process., 4(3):764 773, May 006. [0] S. So and K. K. Paliwal. Modulation-domain Kalman filtering for single-channel speech enhancement. Speech Commun., 53(6):88 89, July 0. [] H. Hermansky, E. A. Wan, and C. Avendano. Speech enhancement based on temporal processing. In Proc. IEEE (ICASSP), volume, pages 405 408, May 995. [] J. G. Lyons and K. K. Paliwal. Effect of compressing the dynamic range of the power spectrum in modulation filtering based speech enhancement. In Proc. Interspeech Conf., pages 387 390, September 008. 746