SPEECH ENHANCEMENT WITH KALMAN FILTERING THE SHORT-TIME DFT TRAJECTORIES OF NOISE AND SPEECH

Similar documents
Low-Complexity Time-Domain SNR Estimation for OFDM Systems

Singing Voice Detection in North Indian Classical Music

where and are polynomials with real coefficients and of degrees m and n, respectively. Assume that and have no zero on axis.

Demosaicking using Adaptive Bilateral Filters

Audio Engineering Society. Convention Paper. Presented at the 120th Convention 2006 May Paris, France

Cyclic Constellation Mapping Method for PAPR Reduction in OFDM system

Parameters of spinning AM reticles

Discussion #7 Example Problem This problem illustrates how Fourier series are helpful tools for analyzing electronic circuits. Often in electronic

Spread Spectrum Codes Identification by Neural Networks

An Efficient Control Approach for DC-DC Buck-Boost Converter

Design of A Circularly Polarized E-shaped Patch Antenna with Enhanced Bandwidth for 2.4 GHz WLAN Applications

OPTIMUM MEDIUM ACCESS TECHNIQUE FOR NEXT GENERATION WIRELESS SYSTEMS

Design of an LLC Resonant Converter Using Genetic Algorithm

ISSN: [Reddy & Rao* et al., 5(12): December, 2016] Impact Factor: 4.116

The Periodic Ambiguity Function Its Validity and Value

PERFORMANCE OF TOA ESTIMATION TECHNIQUES IN INDOOR MULTIPATH CHANNELS

Exploring Convolutional Neural Network Structures and Optimization Techniques for Speech Recognition

Sliding Mode Control for Half-Wave Zero Current Switching Quasi-Resonant Buck Converter

Spectrum Sharing between Public Safety and Commercial Users in 4G-LTE

LLR Reliability Improvement for Multilayer Signals

Published in: International Journal of Material Forming. Document Version: Peer reviewed version

Near-field Computation and. Uncertainty Estimation using Basic. Cylindrical-Spherical Formulae

VLSI Implementation of Low Complexity MIMO Detection Algorithms

Modulation Transfer Function Compensation through a modified Wiener Filter for spatial image quality improvement

N2-1. The Voltage Source. V = ε ri. The Current Source

Multiagent Reinforcement Learning Dynamic Spectrum Access in Cognitive Radios

Optimal Design of Smart Mobile Terminal Antennas for Wireless Communication and Computing Systems

Variance? which variance? R squared effect size measures in simple mediation models

Efficient Power Control for Broadcast in Wireless Communication Systems

Fault severity diagnosis of rolling element bearings based on kurtogram and envelope analysis

Impact of bilateral filter parameters on medical image noise reduction and edge preservation

Channel Modelling ETIM10. Fading Statistical description of the wireless channel

Higher moments method for generalized Pareto distribution in flood frequency analysis

Anti-Forensics of Chromatic Aberration

MIMO OTA Testing in Small Multi-Probe Anechoic Chamber Setups Llorente, Ines Carton; Fan, Wei; Pedersen, Gert F.

Proceedings of Meetings on Acoustics

Analysis of the optimized low-nonlinearity lateral effect sensing detector

Signal Extraction Technology

Power Minimization in Uni-directional Relay Networks with Cognitive Radio Capabilities

Figure Geometry for Computing the Antenna Parameters.

Analysis and Implementation of LLC Burst Mode for Light Load Efficiency Improvement

Experimental Investigation of Influence on Non-destructive Testing by Form of Eddy Current Sensor Probe

Short-Circuit Fault Protection Strategy of Parallel Three-phase Inverters

cos s Presentation Layout HIGH ACCURACY REAL-TIME DAM MONITORING USING LOW COST GPS EQUIPMENT

IEEE Broadband Wireless Access Working Group < Modifications to the Feedback Methodologies in UL Sounding

Volume 1, Number 1, 2015 Pages 1-12 Jordan Journal of Electrical Engineering ISSN (Print): , ISSN (Online):

Performance Analysis of Z-Source Inverter Considering Inductor Resistance

ADVANCES IN PARAMETRIC CODING FOR HIGH-QUALITY AUDIO

GAMMA SHAPED MONOPOLE PATCH ANTENNA FOR TABLET PC

Robust Estimation of Sound Direction for Robot Interface

Modulation and Coding Classification for Adaptive Power Control in 5G Cognitive Communications

A Transmission Scheme for Continuous ARQ Protocols over Underwater Acoustic Channels

Probabilistic Spectrum Assignment for QoS-constrained Cognitive Radios with Parallel Transmission Capability

A Simple Improvement to the Viterbi and Viterbi Monomial-Based Phase Estimators

QoE Enhancement of Audio Video IP Transmission with IEEE e EDCA in Mobile Ad Hoc Networks

WIRELESS SENSORS EMBEDDED IN CONCRETE

Off-line Bangla Signature Verification: An Empirical Study

NICKEL RELEASE REGULATIONS, EN 1811:2011 WHAT S NEW?

THE USE OF WAVELETS FOR NOISE DETECTION IN THE IMAGES TAKEN BY THE ANALOG AND DIGITAL PHOTOGRAMMETRIC CAMERAS

Localization Algorithm for a mobile robot using igs

Design of FIR Filter using Filter Response Masking Technique

PSO driven RBFNN for design of equilateral triangular microstrip patch antenna

Wall Compensation for Ultra Wideband Applications

MACRO-DIVERSITY VERSUS MICRO-DIVERSITY SYSTEM CAPACITY WITH REALISTIC RECEIVER RFFE MODEL

Chamber Influence Estimation for Radiated Emission Testing in the Frequency Range of 1 GHz to 18 GHz

Controller Design of Discrete Systems by Order Reduction Technique Employing Differential Evolution Optimization Algorithm

Noise Attenuation Due to Vegetation

On Solving TM 0n Modal Excitation in a Ka-Band Overmoded Circular Waveguide by the Conservation of Complex Power Technique

UNCERTAINTY ESTIMATION OF SIZE-OF-SOURCE EFFECT MEASUREMENT FOR 650 NM RADIATION THERMOMETERS

On the Influence of Channel Tortuosity on Electric Fields Generated by Lightning Return Strokes at Close Distance

IMAGE QUALITY ASSESSMENT USING THE MOTIF SCAN

Optic Cable Tracking and Positioning Method Based on Distributed Optical Fiber Vibration Sensing

1 Performance and Cost

Using Intermicrophone Correlation to Detect Speech in Spatially Separated Noise

Design and Implementation of 4 - QAM VLSI Architecture for OFDM Communication

Optimal Strategies in Jamming Resistant Uncoordinated Frequency Hopping Systems. Bingwen Zhang

Single-to-three phase induction motor sensorless drive system

Chapter 9 Cascode Stages and Current Mirrors

Distributed Spectrum Allocation via Local Bargaining

Risk Sensitive Filter for Quasi-Static Fading Channel Estimation of OFDM System under Parameter Uncertainty

GENERALISED PRIOR SUBSPACE ANALYSIS FOR POLYPHONIC PITCH TRANSCRIPTION. Derry FitzGerald, Matt Cranitch

PPP-RTK PLATFORM PERFORMANCE BASED ON SINGLE-FREQUENCY GPS DATA

School of Electrical and Computer Engineering, Cornell University. ECE 303: Electromagnetic Fields and Waves. Fall 2007

Analysis of a Fractal Microstrip Patch Antenna

10! !. 3. Find the probability that a five-card poker hand (i.e. 5 cards out of a 52-card deck) will be:

HYBRID FUZZY PD CONTROL OF TEMPERATURE OF COLD STORAGE WITH PLC

ONE-WAY RADAR EQUATION / RF PROPAGATION

Absolute calibration of null correctors using twin computer-generated holograms

Performance analysis of ARQ Go-Back-N protocol in fading mobile radio channels

AUTO-TUNED MINIMUM-DEVIATION DIGITAL CONTROLLER FOR LLC RESONANT CONVERTERS

(2) The resonant inductor current i Lr can be defined as, II. PROPOSED CONVERTER

Electrical characterization of thin film ferroelectric capacitors

A New Method of VHF Antenna Gain Measurement Based on the Two-ray Interference Loss

Key Laboratory of Earthquake Engineering and Engineering Vibration, China Earthquake Administration, China

Design of composite digital filter with least square method parameter identification

Determination of The Winding Inductances Of A Two-Phase Machine.

Denoising Technique Using TRIMMED Bilateral Filtering Method

51. IWK Internationales Wissenschaftliches Kolloquium International Scientific Colloquium

Design and Characterization of Conformal Microstrip Antennas Integrated into 3D Orthogonal Woven Fabrics

This article presents the

Transcription:

14th Euopean Signal Pocessing Confeence (EUSIPCO 006), Floence, Italy, Septembe 4-8, 006, copyight by EURASIP SPEECH ENHANCEMEN WIH KALMAN FILERING HE SHOR-IME DF RAJECORIES OF NOISE AND SPEECH Esfandia Zavaehei, Saeed Vaseghi, and Qin Yan School of Design and Engineeing, Bunel Univesity Uxbidge, UB8 3PH, London, UK phone: + (44) 01895 74000, fax: + (44) 01895 3806, email: esfandia.zavaehei@bunel.ac.uk web: http://dea.bunel.ac.uk/cmsp/home_esfandia/home.htm ABSRAC his pape pesents a time-fequency estimato fo enhancement of noisy speech in the DF domain. he timevaying tajectoies of the DF of speech and noise in each channel ae modelled by low ode autoegessive pocesses incopoated in the state equation of Kalman filtes. he paametes of the Kalman filtes ae estimated ecusively fom the signal and noise in DF channels. he issue of convegence of the Kalman filtes to noise statistics duing the noise-dominated peiods is addessed and a method is incopoated fo estating of Kalman filtes afte long peiods of noise-dominated activity in each DF channel. he pefomance of the poposed method is compaed with cases whee the noise tajectoies ae not explicitly modelled. he sensitivity of the method to voice activity detecto is evaluated. Evaluations show that the poposed method esults in substantial impovement in peceived quality of speech. 1. INRODUCION Speech enhancement impoves the quality and intelligibility of voice communication fo a ange of applications including mobile phones, teleconfeence systems, heaing aids, voice codes and automatic speech ecognition. Among diffeent solutions poposed fo enhancement of noisy speech, estoation of shot-time speech spectum has been extensively studied [1][]. his appoach is nomally based on estimation of the shot time spectal amplitude (SSA) of the clean speech using an estimate of the signal-to-noise atio (SNR) at each fequency. he effect of phase distotion is assumed to be inaudible. An altenative to estimation of the SSA is the estimation of the eal and imaginay components of the DF of the clean speech. he MMSE estimation of the DF components with Gaussian pios, leads to the well-known Wiene filte solution [3] while the MMSE estimation of the SSA within the same set of Gaussian assumptions esults in Ephaim s noise suppession method [1]. In ecent yeas Matin has poposed the use of Gamma and Laplacian distibutions fo modelling the eal and imaginay components of the DF of speech [3]. Speech enhancement methods often assume that the spectal samples ae independent identically distibuted (IID) samples acoss fequency and time dimensions. Howeve, thee seems to be an appaent contadiction [4]; these same methods that stat with the IID assumption, often also use the assumption of the dependency of successive fames fo the calculation and smoothing of some key speech paametes such as the SNRs [1][3][5][6]. he application of Kalman filte fo speech enhancement has been extensively exploed duing the past few decades [7][8][9].hese methods ae mostly concened with the estimation of the speech signal in pesence of noise using an AR model of speech fo each fame. Howeve, the intefame coelation of speech signals, which has been shown is of geat impotance, is usually ovelooked in most of these methods. he modelling and utilization of the timevaying tajectoy of speech and noise spectum is the main focus of this pape. In this pape, the tempoal tajectoy model of the DF of speech and noise ae used in a moe igoous mathematical famewok fo a moe eliable estimation of speech specta. he use of Gaussian pios lends itself to application of Kalman filte fo modelling the tempoal tajectoies of the DF of speech. A set of AR models ae incopoated in Kalman filtes fo adaptive estimation and modelling of the tempoal tajectoies of the DF of the speech and noise signals. he est of this pape is oganized as follows. Section discusses the modelling of the samples of the tempoal tajectoies of DF components. In Section 3 the Kalman estimato of DF tajectoies is intoduced. In Section 4 the empiical issues and the paamete estimation of the new estimato ae discussed. In section 5 the evaluation esults ae compaed with othe methods of speech enhancement. Conclusions ae dawn in Section 6.. MODELLING DF RAJECORIES In this section the tempoal dependency and pedictability of the tajectoy of the DF components ae examined. he level of coelation between successive tempoal samples of DF components vaies fo diffeent fequencies as well as diffeent phonemes (i.e. along time and fequency). Moeove, the pobability distibutions of DF components ae stongly dependent on the fequency channel and the phoneme unde study. Figue 1 illustates the distibution of DF components of channel 6 (1000 Hz) fo phoneme /ah/. he data is obtained fom 130 sentences spoken by a male speake selected andomly fom the Wall Steet Jounal (WSJ) database. It is evident fom Figue 1 that the peak of

14th Euopean Signal Pocessing Confeence (EUSIPCO 006), Floence, Italy, Septembe 4-8, 006, copyight by EURASIP 0.08 Histogam Gaussian SKLD=0.8 0.07 Laplacian SKLD=0.4 Gamma SKLD=0.36 0.06 0.05 0.04 0.03 0.0 Aveaged Coelation Coefficient 1 0.8 0.6 0.4 0. Ca Noise ain Noise White Speech 0.01 0-4 -3 - -1 0 1 3 4 x 10 4 Figue 1. Nomalized histogam of S-DF components fo channel 6 (1 khz), Phoneme /ah/ the histogam is modelled bette with a Gamma distibution while the sides tend to fit a Gaussian distibution. able 1 shows the aveage symmetic Kullback-Leible distance (SKLD) [10] between histogams and paametic distibutions. hese esults show that, on aveage, Gamma distibution models the distibution of DFs of speech bette than Gaussian distibution. his is obseved fom the SKLD of speech with paametic distibutions. It is also obseved that most noise types have a easonably low SKLD with the Gaussian distibution. Howeve, as often, a compomise, between the complexity and the mathematical tactability of the model, suggests the use of Gaussian distibution and Kalman filtes fo modelling the tempoal tajectoies of DF. he eal pat of the DF of clean speech, S (n), can be modelled using an AR pocess: N k k = 1 ( ) = ( ) ( ) + ( ) (1) S n a n S n k e n whee S (n) is the eal pat of the DF of clean speech at fame n of an abitay fequency channel, a k (n) is the kth AR coefficient at the nth fame of the same fequency channel, e (n) is the coesponding estimation eo and N is the model ode. Moeove, it is assumed that S (n) is a stationay pocess within the pediction peiod. Assuming Gaussian distibutions fo DF components, the MMSE linea pedicto (LP) coefficients of Equation (1) can be obtained using Yule- Walke equation: 1 a = ( Rs ) ( ) s n () whee R s and ( ) s n ae the autocoelation matix and vecto of the eal pat of speech DF, S (n)=[s (n), S (n- L+1)], espectively, L is the numbe of samples used to obtain the autocoelation and a(n) is the AR coefficient vecto at fame n. A simila equation stands fo the imaginay com- able 1. Aveage SKLD between the histogams and diffeent paametic distibutions fo speech (aveaged ove all phonemes/fequency channels fo 130 labeled sentences spoken by a male speake) and diffeent noise types Distibution Gaussian Laplacian Gamma Speech 0.81 0.6 0.56 Ca noise 0.04 0.10 0.85 ain noise 0.15 0.05 0. Babble noise 0.69 0.51 0.46 Helicopte fly-by noise 0.1 0.15 0.59 White Gaussian 0.01 0. 0.83 0 0 4 6 8 10 ime lag ( 5ms) Figue. Aveaged absolute coelation in S-DF tajectoies ponent of the DF. he speech fame length, ovelap size and the LP ode should be caefully chosen to comply with the stationaity assumption of Equation(1), that is, between say 0-40 ms. Figue illustates the coelation coefficients between delayed samples of the DF of noise and speech signals, aveaged ove all fequency channels. Note howeve, that while the coelation coefficient may be negative, it is the absolute value which shows the level of coelation. It is evident that although, due to the fame ovelap, thee is a coelation between successive samples of DF of noise, this does not vay much with the noise type and is less than that of speech. he shift-size used in Figue is 5ms and the fame size is 5ms which expeimentally poved to esult in good noise eduction. 3. KALMAN DF RAJECORY RESORAION his section pesents the fomulation of Kalman filtes fo estoation of DF tajectoies. It is assumed that the clean speech signal s(t) is contaminated by the additive backgound noise d(t) uncoelated with the speech signal. he noisy speech signal x(t) is modelled as: x( t) = s( t) + d( t) (3) whee t denotes time. Fo each fequency channel Equation (3) is ewitten in DF domain as: X + jxi = ( S + D ) + j( Si + Di ) (4) whee the subscipts and i epesent the eal and imaginay pats of DF espectively and n denotes fame index. It is assumed that the eal and imaginay pats of the DF ae independent and have Gaussian distibutions. he independency assumption is veified fom a study of the scatte plots of the eal and imaginay pats of the DF coefficients of clean speech [3][11]. he eal pat of the DF of noise, D (n), is modelled using an AR model as: M D = bk D( n k) + g (5) k = 1 whee D (n) is the eal pat of the DF of noise at fame n of an abitay fequency channel, b k (n) is the kth AR coefficient at the nth fame of the same fequency channel, g (n) is the coesponding estimation eo which has a vaiance of σ g and M is the model ode. Following staight-fowad algeba manipulation, equations(1), (4) and (5) fo the eal pat may be epesented in canonical fom: X = A X( n 1) + GcE (6) X = HX (7) c

14th Euopean Signal Pocessing Confeence (EUSIPCO 006), Floence, Italy, Septembe 4-8, 006, copyight by EURASIP whee the state vecto X (n) is defined as: X ( ) ( ) ( ) n = n n S D (8) S = [ S ( n N + 1) S ] (9) D = [ D( n M + 1) D ] (10) whee S and D ae speech and noise state vectos espectively. he tansition matix A (n) is given by: F 0 A = 0 B (11) F (n) and B (n) ae speech and noise tansition matices espectively: 0 1 0 0 0 0 1 0 F = 0 0 0 1 an an1 an a1 (1) 0 1 0 0 0 0 1 0 B = 0 0 0 1 bm bm1 bm b1 (13) E (n) is the AR eo vecto of noise and speech and H c and G c ae constant vectos defined below: E = [ e g ] (14) U( N ) 0 Gc = ( M ) 0 U (15) Hc = ( N) ( M) U U (16) and U(N) is a N 1 vecto defined as: N 1 U( N ) 0 0 1 (17) A pediction of the state vecto is obtained fom the pevious state vecto using the tansition matix A(n) as: ˆ X = E X X ˆ ( n 1) = A n X ˆ n1 (18) { } ( ) ( ) ˆ n 1 ˆ n 1 whee X ( ) is the estimate of X (n-1). As e (n) and g (n) ae othogonal to X ( ) and each othe, the pediction eo covaiance matix is calculated as: Pc = A Pc ( n 1) A + GcΛ G c (19) Λ(n) is a matix defined as: σ e 0 Λ 0 σ g (0) P c ( n 1 ) is the state estimation eo covaiance matix. Note that, since accoding to Equation (7) thee is no noise added to H c X (n), the innovation hee is the diffeence between the pedicted noisy signal and the obseved noisy signal. Incopoating the innovation in the cuent noisy obsevation, the optimum estimate of the state vecto is calculated as: ˆ ˆ ( ) ( ) ( ) ˆ X = X n + K n X n H X (1) ( ) c c whee K c (n) is the Kalman gain vecto: 1 K c = P c H c c c H P H c () whee HP c c Hc is a scala value. he estimation eo covaiance of this estimate, P c (n), is obtained as: Pc = [ IKc Hc ] P c (3) he same set of equations holds fo the imaginay component of all fequency channels with nonzeo imaginay pats. he estimated clean speech DF is the by-poduct of the estimated state vecto X in Equation (1). 4. PARAMEER ESIMAION As the autocoelation of the DF tajectoies of clean speech is not available fo estimation of AR paametes in Equation(), the autocoelation vecto obtained fom the past estoed samples is used. hat is: 1 aˆ = ( Rˆ s ( n 1) ) ˆs ( n1) (4) he autocoelation vecto and matix ae calculated fom the past L=8 samples (with a shift-size of 5ms this is equivalent to 40ms). An implementation issue aises fom the feedback of estoed speech fo calculation of AR paametes using Equation (4). Duing long (typically >00ms) noise-only peiods, whee the vaiance of the noisy signal is equal to that of noise, the ecusive solution given by Equations (19) and (), esults in convegence of the output of Equations (1) towads zeo which consequently deceases the vaiance of pediction eo, σ e, towads zeo. In othe wods, the Kalman filtes speech output conveges to zeo duing noiseonly peiods. At the beginning of the speech signal, just afte a long noise-only peiod, due to the suppession of noise and the absence of speech the pediction of the DF tajectoies will be vey small with a consequently small pediction eo vaiance, σ e, which esults in a high weight fo the pediction of the state vecto (vey small Kalman gain) and zeo- ing of the output speech signal. In ode to pevent the consequent zeoing of speech following a long peiod of speech inactivity the value of σ e needs to be evived fom zeo at the beginning of speech active peiods. his is achieved by ensuing that values of σ e will not be less than a dynamic theshold which is a faction of the noisy signal enegy at each time-fequency bin. hat is: ˆ σe = max ( σ ( ), ( ) e n α X n ) (5) his limits the pediction eo vaiance to a small potion of the instantaneous powe spectum of noisy speech. Equation (5) implies that the DF tajectoies can be only pedicted with a limited pecision, i.e. the pediction eo vaiance cannot be smalle than a theshold popotional to the vaiance of the noisy speech. Vey small values fo α poved to be sufficient fo eviving the conveged tajectoies of σ e and the signal at the beginning of speech activity (e.g. α =0.07).

14th Euopean Signal Pocessing Confeence (EUSIPCO 006), Floence, Italy, Septembe 4-8, 006, copyight by EURASIP In ode to obtain the AR models of the DF tajectoies of noise fo each fequency channel, the autocoelations of the DF tajectoies ae obtained and smoothed duing the noise-only peiods. hese autocoelation vectos ae obtained using L samples of the eal and imaginay components sepaately and then aveaged fo each time step. hat is, the same AR model is used fo the eal and imaginay components of each channel of noise. 5. EVALUAION RESULS he evaluation of the pefomance of DF-Kalman filte with coelated noise model (DFKCN) descibed in section 3, fo enhancement of speech signals coupted by backgound noise is caied out using subjective and objective measues. Vaious types and levels of noise ae added to the speech signals selected fom the WSJ speech database. he noisy signals ae segmented using 5ms hamming windows with a shift size of 5ms. he ca noise signal is ecoded by ou colleagues in a 3-seies BMW at 70 Mph in a ainy day and the tain noise is ecoded in a moving tain. he paametes used in Kalman method ae: Autocoelation length L=8, LP odes N=4 and M= and α=0.07. 5.1. Mean Opinion Scoe (MOS) A set of twenty sample sentences ae dawn fom WSJ database and contaminated by ca noise and tain noise at two diffeent SNRs, 0dB and 10dB. he esulting noisy speech sentences ae then de-noised using fou diffeent methods: (i) paametic spectal subtaction (PSS) [], (ii) MMSE log- SSA [5], (iii) DF-Kalman filte with uncoelated noise model [1] (DFKUN) and (iv) DFKCN. Note that in the fist two methods decision-diected method is used fo tacking the a pioi SNR [1]. en tained listenes wee asked to scoe the quality of the esulting output signals fom 1 to 5, based on the peceptual ease of undestanding (intelligibility) and the comfot of listening (less annoying noise). he mean opinion scoe esults ae pesented in able. he esults of able show that the Kalman filte outputs ae pefeed by the listenes. As often, the extent of validity of these esults is limited by the numbe of listenes and test sentences used. 5.. Objective Evaluation Fom a numbe of diffeent speech quality and distotion measues applied to the estoed sample speech sentences of section 5.1, six ae listed in able 3. he coelation coefficient of each distotion measue with MOS was calculated and the thee most coelated distotion measues wee chosen fo futhe objective evaluation of the pefomance of diffeent methods. able 3 summaizes the coelation coefficients between MOS and six of the most popula objective measues obtained fom this expeiment. Pefomance of the DFKCN in pesence of ca and tain noise is evaluated using Itakua-Saito distance (ISD), Log-Likelihood atio (LLR) [13] and Peceptual Evaluation of Speech Quality (PESQ) scoes. One hunded sentences spoken by 0 speakes (10 Females and 10 Males) ae andomly selected fom WSJ database and contaminated by tain and ca noise at diffeent noise levels. hese noisy signals ae then de-noised using PSS, MMSE, DFKUN and able. Mean opinion scoe esults SNR Noise DFKUN DFKCN MMSE PSS Wiene 0dB Ca 3.7 3.8 3.5 3.4 3. ain.7.9.0.0.1 10dB Ca 4.5 4.7 4.6 4.4 4. ain 3.7 3.9 3.7 3.3 3.5 able 3. he coelation coefficient ρ of MOS and objective evaluation esults PESQ LLR ISD Kullback SegSNR SNR ρ 0.86-0.69-0.61-0.45 0.4 0.07 DFKCN methods and thei distotion measues ae obtained. he aveaged esults of the distotion measues ae summaized in able 4. 5.3. Sensitivity to VAD As mentioned in the pevious section, the estimato is evived at the beginning of speech signal afte long peiods of noise only signal. Moeove the noise statistics ae estimated and aveaged duing noise-only peiods. Many sophisticated methods have been poposed in the liteatue fo obust estimation of noise statistics which tack/detect noise nonstationaity. Although these methods povide bette estimates of the noise statistics, in his wok a simple voice-activitydetecto (VAD) based method is used to keep the focus on the de-noising method used to estimate the speech signal. It is assumed that duing the fist 00ms, the signal contains no speech. his is consistent with the database used in the expeiments. his pat of the signal is used to deive a noise model including the aveaged spectum of the signal, its vaiance, and the AR models of the DF pogessions. Afte this initialization, the spectum of each fame is compaed to that of noise and if the enegy of thei diffeence is less than 3dB the fame is flagged as noise. Afte 16 successive noise fames the VAD stats updating the noise model until a nonnoise fame is detected. his pocedue fo noise estimation has two type of eo, (i) the fames might be misclassified and (ii) the method cannot detect/tack the statistics of fast changing non-stationay noises. he sensitivity of the methods to these eos is evaluated and the esults ae shown in table 5. he esults of table 5 show that, geneally, thee is not able 4. PESQ, LLR and ISD scoes fo vaious noise levels and types, obtained using diffeent de-noising methods Ca Noise SNR (db) ain Noise SNR (db) Measue Method -5 0 5 10-5 0 5 10 DFKUN.41.80 3.13 3.43 1.81..6.98 DFKCN.51.90 3.0 3.49 1.90.30.69 3.05 PESQ MMSE.39.75 3.10 3.38 1.78.0.58.89 PSS.44.79 3.08 3.8 1.65.1.51.84 DFKUN 1.59 1.3 0.95 0.75. 1.74 1.35 1.03 DFKCN 1.5 1.18 0.90 0.68.09 1.68 1.31 1.00 LLR MMSE 1.60 1.6 1.01 0.91.53.07 1.61 1.19 PSS 1.59 1.5 1.01 0.87.64.17 1.67 1.3 DFKUN 1.08 0.78 0.58 0.44.63 1.8 1.0 0.81 DFKCN 1.15 0.85 0.64 0.49.56 1.75 1.17 0.80 ISD MMSE 1.7 0.93 0.71 0.54 3.07.33 1.61 1.08 PSS 1.41 1.04 0.77 0.59 3.43.71 1.89 1.19

14th Euopean Signal Pocessing Confeence (EUSIPCO 006), Floence, Italy, Septembe 4-8, 006, copyight by EURASIP able 5. PESQ scoes of enhanced speech signals when (A) VAD is used to detect noise fames and (B) the coect label of each fame (noise/speech and noise) is povided to the system SNR (db) ain Noise -5 0 5 10 A.41.80 3.13 3.43 DFKUN B.41.81 3.15 3.45 A.51.90 3.0 3.49 DFKCN B.55.9 3.0 3.50 Misclassification % 7.07 5.1 3.80.76 A 1.81..6.98 DFKUN Ca Noise B 1.79.0.61.98 A 1.90.30.69 3.05 DFKCN B 1.98.33.7 3.06 Misclassification % 8.8 7.07 5.75 4.17 ain Noise much diffeence between the PESQ scoes of the enhanced speech signals when the system is povided with the speech activity labels (B). In tain noise which is moe nonstationay we can see that the pefomance of the DFKUN degades if the exact noise fames ae specified while the pefomance of the DFKCN is impoved. We believe that since DFKUN only uses the vaiance of the noise (and not the AR model of the DF tajectoies), it would pefom bette if the abupt changes of the tain noise, which ae the most likely ones to be misclassified, ae not used fo estimation of the vaiances. On the othe hand if these fames ae used in estimation of the AR models fo DFKCN of the noise it would help the system to decompose the noise and speech bette by tacking the noise tajectoies using moe accuate models. Futhemoe, in ca noise which is a moe stationay noise, the pefomance of the system is slightly impoved by poviding the speech-activity labels to the system. 5.4. Discussion Infomal listening tests and compaisons of the quality of the output of the DFKUN and DFKCN methods with the MMSE log-ssa method eveal some majo diffeences. he level of esidual noise of DF-Kalman methods is much less than that of MMSE. While DFKUN slightly distots the low enegy potions of speech signal specta as a esult of the convegence of signal to small values. Due to this effect, at lowe SNRs, the hamonics of the speech ae well estoed while the non-hamonic potions of the speech spectum ae elatively suppessed. his effect is mitigated in DFKCN, while maintaining a simila o lowe level of esidual noise. Moeove, DFKCN esults in much less echo level than DFKUN method poducing a moe natualsounding speech signal. While the natue of the esidual noise in spectal subtaction is musical (shot busts of naowband enegy), the esidual noise of DF-Kalman methods seems to have the same peceptual chaacteistic of the oiginal noise. 6. CONCLUSION A method is poposed fo the enhancement of speech signals coupted with backgound noise. he oveall pefomance of the poposed method is shown to outpefom MMSE log- SSA estimato and paametic spectal subtaction. Listening tests show that the esidual noise of DF-Kalman methods is not composed of annoying naowband noise busts, musical tones. Infomal expeiments show that if the AR model of the DF tajectoies of clean speech ae povided to the system (even in the case of using aveaged models fo the noise obtained fom noise-only peiods), the DFKCN esults in exceptionally supeb quality of the de-noised speech. his suggests that the use of moe sophisticated methods fo estimation of the speech AR models is expected to esult in futhe gain in the pefomance of the DF-Kalman methods. he application of Expectation-Maximization (EM) methods fo this pupose is being studied [8]. REFERENCES [1] Ephaim, Y., Malah, D., Speech enhancement using a minimum mean-squae eo shot-time spectal amplitude estimato, IEEE ans. ASSP on Acoustics, Speech, and Signal Pocessing, vol. -3, no. 6, pp. 1109-111, Dec. 1984. [] Sim, B., ong, Y., Chang, J., an, C., A Paametic Fomulation of the Genealized Spectal Subtaction Method, IEEE ans. on Speech and Audio Pocessing, vol. 6, No. 4, July 1998, pp. 38-337. [3] Matin, R., Speech Enhancement Using MMSE Shot ime Spectal Estimation with Gamma Distibuted Speech Pios, IEEE ICASSP'0, Olando, Floida, May 00. [4] Cohen, I., On the Decision-Diected Estimation Appoach of Ephaim and Malah, ICASSP 04, Monteal, Canada, 17-1 May 004, pp. I-93-96 [5] Ephaim, Y., Malah, D., Speech enhancement using a minimum mean squae eo log-spectal amplitude estimato, IEEE ans. on Acoust., Speech, Signal Pocessing, vol. ASSP-33, pp. 443-445, Ap. 1985. [6] Cohen, I., Relaxed Statistical Model fo Speech Enhancement and a Pioi SNR Estimation, Speech and Audio Pocessing, IEEE ansactions on Volume 13, Issue 5, Pat, Sept. 005 pp. 870 881 [7] Paliwal, K.K., Basu, A., A speech enhancement method based on Kalman filteing, in Poc. Int. Conf. Acoust., Speech, Signal Pocessing, 1987, pp. 177 180 [8] Gannot, S., Bushtein, D., Weinstein, E., Iteative and Sequential Kalman Filte-Based Speech Enhancement Algoithms, IEEE ans. on Speech and Audio Poc., vol. 6, no. 4, pp. 373-385, Jul. 1998 [9] Ma, N., Bouchad, M., Gouban, R.A. Speech Enhancement Using a Masking heshold Constained Kalman Filte and Its Heuistic Implementations, Audio, Speech and Language Pocessing, IEEE ansactions on, Volume 14, Issue 1, Jan. 006 pp. 19-3 [10] Kullback, S., Leible, R.A., On infomation and sufficiency, Ann. Math. Stat., vol., pp. 79-86, 1951 [11] Billinge, D.R., ime Seies: Data Analysis and heoy, Holden-Day, 1981 [1] E. Zavaehei, S. Vaseghi, Speech Enhancement In empoal DF ajectoies Using Kalman Filtes, Intespeech 005, pp. 077-080 [13] Hansen, J., Pellom, B., An Effective Quality Evaluation Potocol fo Speech Enhancement Algoithms, poc. of ICSLP 1998, Sydney