Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Similar documents
Speech Enhancement for Nonstationary Noise Environments

MMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Speech Signal Enhancement Techniques

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

CHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS

IMPROVING MICROPHONE ARRAY SPEECH RECOGNITION WITH COCHLEAR IMPLANT-LIKE SPECTRALLY REDUCED SPEECH

Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

REAL-TIME BROADBAND NOISE REDUCTION

Noise Estimation based on Standard Deviation and Sigmoid Function Using a Posteriori Signal to Noise Ratio in Nonstationary Noisy Environments

Speech Enhancement: Reduction of Additive Noise in the Digital Processing of Speech

RECENTLY, there has been an increasing interest in noisy

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Wavelet Speech Enhancement based on the Teager Energy Operator

Chapter 4 SPEECH ENHANCEMENT

Signal Processing 91 (2011) Contents lists available at ScienceDirect. Signal Processing. journal homepage:

Students: Avihay Barazany Royi Levy Supervisor: Kuti Avargel In Association with: Zoran, Haifa

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

Enhancement of Speech in Noisy Conditions

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Using RASTA in task independent TANDEM feature extraction

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Analysis of the SNR Estimator for Speech Enhancement Using a Cascaded Linear Model

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

Modified Kalman Filter-based Approach in Comparison with Traditional Speech Enhancement Algorithms from Adverse Noisy Environments

Speech Enhancement Using a Mixture-Maximum Model

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

NOISE ESTIMATION IN A SINGLE CHANNEL

SPEECH ENHANCEMENT BASED ON A LOG-SPECTRAL AMPLITUDE ESTIMATOR AND A POSTFILTER DERIVED FROM CLEAN SPEECH CODEBOOK

24 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 1, JANUARY /$ IEEE

STATISTICAL METHODS FOR THE ENHANCEMENT OF NOISY SPEECH. Rainer Martin

Speech Enhancement Based on Non-stationary Noise-driven Geometric Spectral Subtraction and Phase Spectrum Compensation

Chapter 3. Speech Enhancement and Detection Techniques: Transform Domain

Mikko Myllymäki and Tuomas Virtanen

Noise Spectrum Estimation in Adverse Environments: Improved Minima Controlled Recursive Averaging

Single channel noise reduction

Joint dereverberation and residual echo suppression of speech signals in noisy environments Habets, E.A.P.; Gannot, S.; Cohen, I.; Sommen, P.C.W.

Modulation Domain Spectral Subtraction for Speech Enhancement

Performance Analysiss of Speech Enhancement Algorithm for Robust Speech Recognition System

Power Function-Based Power Distribution Normalization Algorithm for Robust Speech Recognition

DERIVATION OF TRAPS IN AUDITORY DOMAIN

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Estimation of Non-stationary Noise Power Spectrum using DWT

Calibration of Microphone Arrays for Improved Speech Recognition

JOINT NOISE AND MASK AWARE TRAINING FOR DNN-BASED SPEECH ENHANCEMENT WITH SUB-BAND FEATURES

Phase estimation in speech enhancement unimportant, important, or impossible?

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

A HYBRID APPROACH TO COMBINING CONVENTIONAL AND DEEP LEARNING TECHNIQUES FOR SINGLE-CHANNEL SPEECH ENHANCEMENT AND RECOGNITION

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Performance Evaluation of Noise Estimation Techniques for Blind Source Separation in Non Stationary Noise Environment

EMD BASED FILTERING (EMDF) OF LOW FREQUENCY NOISE FOR SPEECH ENHANCEMENT

Recent Advances in Acoustic Signal Extraction and Dereverberation

Noise Estimation and Noise Removal Techniques for Speech Recognition in Adverse Environment

Audio Imputation Using the Non-negative Hidden Markov Model

Advances in Applied and Pure Mathematics

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Different Approaches of Spectral Subtraction method for Enhancing the Speech Signal in Noisy Environments

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

GUI Based Performance Analysis of Speech Enhancement Techniques

Enhancing the Complex-valued Acoustic Spectrograms in Modulation Domain for Creating Noise-Robust Features in Speech Recognition

Speech Synthesis using Mel-Cepstral Coefficient Feature

IN REVERBERANT and noisy environments, multi-channel

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping

VQ Source Models: Perceptual & Phase Issues

Speech Enhancement Based On Noise Reduction

ANUMBER of estimators of the signal magnitude spectrum

Speech Enhancement Based on Audible Noise Suppression

[Rao* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 6, AUGUST

HIGH RESOLUTION SIGNAL RECONSTRUCTION

High-speed Noise Cancellation with Microphone Array

A New Framework for Supervised Speech Enhancement in the Time Domain

Robust telephone speech recognition based on channel compensation

On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering

Available online at ScienceDirect. Procedia Computer Science 54 (2015 )

Adaptive Noise Reduction Algorithm for Speech Enhancement

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

A HYBRID APPROACH TO COMBINING CONVENTIONAL AND DEEP LEARNING TECHNIQUES FOR SINGLE-CHANNEL SPEECH ENHANCEMENT AND RECOGNITION

Codebook-based Bayesian speech enhancement for nonstationary environments Srinivasan, S.; Samuelsson, J.; Kleijn, W.B.

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

International Journal of Advanced Research in Computer Science and Software Engineering

Relative phase information for detecting human speech and spoofed speech

The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals

Adaptive Speech Enhancement Using Partial Differential Equations and Back Propagation Neural Networks

Comparative Performance Analysis of Speech Enhancement Methods

Enhancement of Speech Communication Technology Performance Using Adaptive-Control Factor Based Spectral Subtraction Method

SPEECH ENHANCEMENT USING SPARSE CODE SHRINKAGE AND GLOBAL SOFT DECISION. Changkyu Choi, Seungho Choi, and Sang-Ryong Kim

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Analysis Modification synthesis based Optimized Modulation Spectral Subtraction for speech enhancement

SPEECH communication under noisy conditions is difficult

Transcription:

R E S E A R C H R E P O R T I D I A P Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a IDIAP RR 7-7 January 8 submitted for publication a IDIAP Research Institute, Martigny, Switzerland IDIAP Research Institute www.idiap.ch Av. des Prés Beudin Tel: + 7 7 77 P.O. Box 59 9 Martigny Switzerland Fax: + 7 7 77 Email: info@idiap.ch

IDIAP Research Report 7-7 Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li January 8 submitted for publication Abstract. Conventional frequency-domain speech enhancement filters improve signal-to-noise ratio (SNR), but also produce speech distortions. This paper describes a novel post-processing algorithm devised for the improvement of the quality of the speech processed by a conventional filter. In the proposed algorithm, the speech distortion is first compensated by adding the original noisy speech, and then the noise is reduced by a post-filter. Experimental results on speech quality show the effectiveness of the proposed algorithm in lower speech distortions. Based on our isolated word recognition experiments conducted in 5 real car environments, a relative word error rate (WER) reduction of.5% is obtained compared to the conventional filter.

IDIAP RR 7-7 Introduction Modern communication systems employ some speech enhancement algorithms at the pre-processing stage prior to further processing (such as speech coding or automatic speech recognition (ASR)). Over the past three decades, frequency domain enhancement methods have received significant interest due to their relatively good performance and low computational cost. The first one is the well-known spectral subtraction method []. There have also been other development methods, e.g., Wiener filter, short-time spectral amplitude (STSA) analysis with different estimation techniques, such as maximum likelihood (ML) [], minimum mean square error (MMSE) [], and maximum a posteriori (MAP). While most of the above speech estimators improve the signal-to-noise ratio (SNR), they also produce speech distortions, mainly due to inaccurate or erroneous noise or SNR estimation. In fact, as indicated in [], generally no or hardly any improvements regarding speech intelligibility are found with single-microphone speech enhancement algorithms. Perceptually motivated speech enhancement methods have been proposed to lower speech distortion by exploiting the masking properties from psycho-acoustics. These methods, however, are largely dependent on the accurate estimation of the masking threshold in noise. In low SNR conditions, the estimated masking thresholds might deviate from the true ones resulting in additional residual noise [5]. Moreover, trying to mask the distortions of the residual noise leads into a variable speech distortion [6]. In this paper, we propose a novel post-processing algorithm for reducing the speech distortion caused by the use of conventional filters, while maintaining the noise reduction abilities. The proposed algorithm consists of two stages. In the first stage, the speech processed (or enhanced) by a conventional filter is compensated by adding the original noisy speech. The second stage incorporates a Wiener filter to remove additional residual noise using the cross-spectrum between the original speech and the speech processed by the conventional filter. The proposed post-processing algorithm is universal and may be applied to different types of conventional speech enhancement filters to achieve better performance. The organization of this paper is as follows: In Section, we formulate the proposed filter. In Section, we present the performance evaluation. Section summarizes this paper. Algorithms. Formulation of the proposed filter Let the corrupted speech signal x(i) be represented as x(i) = s(i) + n(i), () where s(i) is the clean speech signal and n(i) is the noise signal. By using the short-time Fourier transform (STFT), in the time-frequency domain we have X(k,l) = S(k,l) + N(k,l), () where k and l denote frequency index and frame index, respectively. For compactness, we will drop both the frequency bin index k and the frame index l in this section. Fig. shows a diagram of the proposed filtering operation. After the noise estimation we apply a conventional (original) filter with a multiplicative nonlinear gain function G to the amplitude of X, and by incorporating the phase of X we obtain Ŝ = G X () = S + Ñ, ()

IDIAP RR 7-7 noisy speech x( i) STFT X noise estimation original filter G Ŝ α α Y post filter G Ŝ sˆ( i) ISTFT OLA G Figure : Diagram of the proposed algorithms. where we model Ñ as the short-time spectrum of residual noise ñ in the processed speech. Then the speech processed by a conventional filter is compensated by adding the original noisy speech, i.e., Y = αx + ( α)ŝ (5) = α(s + N) + ( α)(s + Ñ) (6) = S + αn + ( α)ñ (7) = [α + ( α)g ] X, (8) where α is the parameter that controls the degree of the added noisy speech ( α ). This kind of compensation is expected to reduce the speech distortion caused by the conventional filter G. In order to reduce the additive noise in the compensated speech Y, we propose a post-filter G = PXŜ (9) P Y Y G = [α + ( α)g ], () which utilizes the cross-spectrum between X and Ŝ, to be applied to the new noisy speech Y. Here we derive Eq. () using Eqs. () and (8). As a whole, the proposed filter (gain function) can be formulated as G = G α + ( α)g. () Finally, the enhanced speech ŝ(i) is obtained through the inverse short-time Fourier transform (ISTFT) and overlap-add (OLA) synthesis.. Analysis of the proposed filter With the real value of G, we can formulate the error between the spectrum of the clean signal and the estimated one as E = E[ G X S ] = E[ G (S + N) S ] = (G ) E[ S ] + G E[ N ] + (G )G E[S N + S N], () where E[ ] denotes the expectation operator and indicates the complex conjugate operator. If we assume that the speech and noise are uncorrelated, the third term in the above equation can be negligible. The first term describes the speech distortion while the second term indicates the

IDIAP RR 7-7. α =.8 α = G.6 α =.7 α =.5. α =.. α =....6.8. G Figure : Parametric gain curves of resulted filter G as a function of the original filter G. noise distortion. As shown in [6], complete masking of both speech and noise distortions can not be guaranteed and we must settle for a trade-off between the two distortions (For example, perceptually motivated methods try to mask noise distortion by allowing a variable speech distortion [6]). When G <, our method aims to reduce the speech distortion compared to the original filter, since G is always larger than G (see Fig. ). When G > (e.g., may arise in Ephraim-Malah algorithms), using the presented post-filter results in the reduction of both speech and noise distortions compared to the original filter. The parameter α provides a soft transition between the original noisy speech (α = ) and the speech processed with the original filter (α = ), and plays the role of controling the trade-off between noise reduction and speech distortion. Compared to two-stage Wiener filtering [7], in the second stage we use cross-spectrum and avoid estimating the noise or SNR, which may introduce additional errors. Moreover, in [7] Wiener filters are designed in the frequency domain, whereas the filters are applied in the time domain using convolution operations. The proposed one implements the two filters consistently in the frequency domain, which avoids the re-calculation of the power spectrum in time-frequency switches and improves computational efficiency. Performance Evaluation For evaluation purposes, utterances from Aurora-J database are used (Aurora-J is the same as Aurora-, but uttered in Japanese [8]). The speech signals are sampled at 8 khz and degraded by three types of noise (subway, babble, car) at different SNR levels from db to db in 5 db steps. The spectral analysis is implemented with hamming windows of ms and a frame shift of 6 ms. A minimum mean-square error log-spectral amplitude (MMSE-LSA) estimator [] is used as an original filter as shown in Fig. (Other estimators can also be applied). An improved minima controlled recursive averaging (IMCRA) method [9] was used to estimate the noise. The a priori SNR was calculated using decision-directed approach. The following three types of speech signals were evaluated:. noisy: degraded noisy speech (α = );. original filter: speech enhanced using MMSE-LSA estimator (α = );. presented methods: speech enhanced using the proposed algorithm by cascading the original MMSE-LSA estimator with different values of α ([...5.7.9]). We compute two objective measures, the segmental SNR and the weighted cepstral distance (WCD). Fig. summarizes the results of the segmental SNR for various noise types (averaged over [, ] db for each type). As can be seen, the segmental SNRs are significantly improved in all three

IDIAP RR 7-7 segsnr [db] 9 8 7 6 subway babble car...5.7.9 α Figure : Segmental SNR performance as a function of α. noise types compared to the noisy speech. The segmental SNR of the proposed algorithms depends on the parameter α. When α increases up to. or above, the proposed algorithms can perform as well as the original filter (α = ). In informal listening, compared to the speech processed by the original filter, the speech signals reconstructed using the proposed method are judged to be more crisp and involve less musical artifacts although a little original noise is introduced. Fig. shows an example of spectrograms for different speech, demonstrating that the missed spectrograms in the speech processed by the original filter are partly recovered by using the proposed post-processing algorithm. We also evaluate the enhanced speech using the weighted cepstral distance (WCD) measure, which is defined as WCD = L L l= p w j [c(l,j) ĉ(l,j)], () j where c and ĉ are cepstral coefficients corresponding to the clean signal and the estimated signal, respectively. p is the order of the model (chosen equal to ) and w j is the weight for the i th order coefficient. L is the number of frames in one utterance. As Fig. 5 shows, in subway and babble nonstationary noise cases, the original filter does not provide significant improvement in the WCD measure. Compared to the original filter, the incorporation of the proposed post-processing provides considerable improvement (with α =.). The above two figures illustrate that with a suitable value of α the proposed algorithms can reduce speech distortions while maintaining noise reduction abilities of the original filter. In order to evaluate the proposed algorithms, we also performed speech recognition experiments using realistic data. CIAIR in-car speech corpus [] was used. The test data were based on 5 Table : 5 driving conditions ( driving environments 5 in-car states) idling driving environment city driving expressway driving normal CD player on in-car state air-conditioner (AC) on at low level air-conditioner (AC) on at high level window (near driver) open

IDIAP RR 7-7 5 a) b) c) Frequency (khz) Frequency (khz ) Frequency (khz )...6.8.. Time (s)...6.8.. Time (s)...6.8.. Time (s) 6 5 6 5 6 5 d) Frequency (khz) 6 5...6.8.. Time (s) Figure : Spectrograms of the 77 uttered in Japanese. a) clean speech; b) corrupted speech with car noise ( db); c) enhanced speech obtained using the original filter (MMSE-LSA); d) enhanced speech obtained using the proposed method (α =.5). isolated word sets collected under 5 real driving conditions (listed in Table ) using a microphone set on the visor position to the driver.,-state triphone Hidden Markov Models (HMM) with Gaussian mixtures per state were used for acoustical modeling. They were trained over a total of 7, phonetically balanced sentences collected in the idling-normal and city-normal conditions. The feature vector was a 5-dimensional vector ( CMN-MFCC + CMN-MFCC + log energy). For comparison, we also performed recognition experiments using ETSI advanced front-end []. The acoustical model used for ETSI advanced front-end experiments was trained over the training data processed with ETSI advanced front-end. Fig. 6 shows the recognition performance averaged over the 5 driving conditions (. and.5 are used for α in the proposed method). We found that all the enhancement methods outperformed the original noisy speech. ETSI advanced front-end marginally outperformed the original filter (MMSE-LSA), while the proposed method achieved a relative word error rate (WER) reduction of.5% compared to ETSI advanced front-end. Summary In this paper, we have proposed a post-processing algorithm for the improvement of the quality of speech processed by a conventional filter. Our experiments demonstrated that the proposed postprocessing with a suitable value of α can reduce speech distortion caused by the original filter. The proposed algorithm is universal and may be applied to different types of conventional speech enhancement filters. Since α should be changed in time-frequency, the adaptive optimization of α is worth exploiting and will be the direction of our future work. On the other hand, during speech absence the proposed method is not effective, and speech presence uncertainty may be combined to achieve better performance.

IDIAP RR 7-7 6 WCD measure.5.5 subway babble car.5...5.7.9 Figure 5: Weighted cepstral distance (WCD) performance as a function of α. α 9 correct [%] 85 8 75 noisy original filter..5 ETSI Figure 6: Recognition performance for different methods. Acknowledgements This work was supported by the European Union 6th FWP IST Integrated Project AMIDA (Augmented Multi-party Interaction with Distant Access, FP6-8) and the Swiss National Science Foundation through the Swiss National Center of Competence in Research (NCCR) on Interactive Multi-modal Information Management (IM). References [] S. F. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoustics, Speech, and Signal Processing, vol.assp-7, no., pp.-, 979. [] R.J. McAulay and M.L. Malpass, Speech enhancement using a soft-decision noise suppression filter, IEEE. Trans. Acoustics, Speech, and Signal Processing vol.assp-8, no., pp.7-5, 98. [] Y. Ephraim and D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Trans. Acoustics, Speech, and Signal Processing, vol.assp-, no., pp.-5, 985. [] G.A. Studebaker and I. Hochberg (Editors). Acoustical Factors Affecting Hearing Aid Performance, second edition, Boston: Allyn and Bacon, 99. [5] Y. Hu and P. C. Loizou, A perceptually motivated approach for speech enhancement, IEEE Trans. Speech and Audio Processing, vol., no.5, pp.57-65,. [6] S. Gustafsson, P. Jax and P. Vary, A novel psychoacoustically motivated audio enhancement algorithm preserving Background noise characteristics, in Proc. ICASSP, pp. 97-, 998. [7] A. Agarwal and Y.M. Cheng, Two-stage mel-warped wiener filter for robust speech recognition, In Proc. IEEE ASRU workshop, pp. 67-7, 999. [8] S. Nakamura, K. Takeda, et al., AURORA-J: An evaluation framework for Japanese noisy speech recognition, IEICE Trans. Information and Systems, vol. E88-D, no., pp. 55-5, 5.

IDIAP RR 7-7 7 [9] I. Cohen, Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging, IEEE Trans. Speech and Audio Processing, vol., no.5, pp.66-75,. [] N. Kawaguchi, S. Matsubara, H. Iwa, S. Kajita, K. Takeda, F. Itakura, and Y. Inagaki, Construction of speech corpus in moving car environment, in Proc. ICSLP, pp.6-65,. [] Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithm, ETSI ES 5 v..,.