STRONG room reverberation and interfering noise can

Size: px

Start display at page:

Download "STRONG room reverberation and interfering noise can"

Maria Stevenson
5 years ago
Views:

1 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 1 Evaluation an Comparison of Late Reverberation Power Spectral Density Estimators Sebastian Braun, Stuent Member, IEEE, Aam Kuklasiński, Ofer Schwartz, Stuent Member, IEEE, Oliver Thiergart, Emanuël A. P. Habets, Senior Member, IEEE, Sharon Gannot, Senior Member, IEEE, Simon Doclo, Senior Member, IEEE, an Jesper Jensen Abstract Reuction of late reverberation can be achieve using spatio-spectral filters such as the multichannel Wiener filter (MWF). To compute this filter, an estimate of the late reverberation power spectral ensity (PSD) is require. In recent years, a multitue of late reverberation PSD estimators have been propose. In this contribution, these estimators are categorize into several classes, their relations an ifferences are iscusse, an a comprehensive experimental comparison is provie. To compare their performance, simulations in controlle as well as practical scenarios are conucte. It is shown that a common weakness of spatial coherence-base estimators is their performance in high irect-to-iffuse ratio (DDR) conitions. To mitigate this problem, a correction metho is propose an evaluate. It is shown that the propose correction metho can ecrease the speech istortion without significantly affecting the reverberation reuction. Inex Terms iffuse PSD, multichannel ereverberation, spatial filter I. INTRODUCTION STRONG room reverberation an interfering noise can impair the intelligibility of speech in communication scenarios such as mobile phones, conferencing systems, smart TVs, hearing ais, but also ecrease the performance of automatic speech recognition systems [1], [2]. Many methos for ereverberation exist, incluing blin channel ientification [3] [5] an inverse filtering [6] [8], multichannel linear preiction [9] [12], moification of the linear preiction resiual [13], [14], spectral suppression [15] [17], or spectro-spatial filtering [18]. The multichannel Wiener filter (MWF) an relate beamformer-postfilter systems, which resie in the class of spectro-spatial filtering techniques, have been wiely use for joint reverberation an noise reuction [19] [24]. Due to its low complexity, high robustness in practice an irect integration into other speech enhancement systems, the MWF is very popular in practical systems. The S. Braun an E. A. P. Habets are with the International Auio Laboratories Erlangen (a joint institution between the University of Erlangen-Nuremberg an Fraunhofer IIS), Erlangen, Germany ( sebastian.braun@auiolabs-erlangen.e, emanuel.habets@auiolabserlangen.e). A. Kuklasiński an J. Jensen are with Oticon A/S, 2765 Smørum, Denmark, an with Aalborg Univeristy, Department of Electronic Systems, Signal an Information Processing Section, 9220 Aalborg, Denmark. ( aku@oticon.com, jesj@oticon.com) O. Schwartz an S. Gannot are with the Faculty of Engineering, Bar-Ilan University, Ramat-Gan , Israel ( ofer.shwartz@live.biu.ac.il; sharon.gannot@biu.ac.il). O. Thiergart is with the Fraunhofer IIS, Erlangen, Germany ( oliver.thiergart@iis.fraunhofer.e). S. Doclo is with the University of Olenburg, Department of Meical Physics an Acoustics, an the Cluster of Excellence Hearing4all, Olenburg, Germany. ( simon.oclo@uni-olenburg.e). MWF is typically erive in the short-time Fourier transform (STFT) omain assuming a narrowban signal moel. The Wiener filter requires estimates of the secon-orer statistics of the esire an unesire signal components, where the accuracy of these estimates etermines the performance of the MWF. When focusing on ereverberation, the late reverberation is often moele in the STFT omain as an aitive iffuse soun fiel with a time-varying power spectral ensity (PSD) an a time-invariant spatial coherence. In the following we refer to this moel as the spatial coherence moel. As the iffuse spatial coherence can be calculate analytically for known microphone array geometries, the remaining challenge is to obtain an accurate estimate of the late reverberation PSD, which irectly affects the performance of the MWF an hence the quality of the ereverberate signal. As the late reverberation PSD is highly time-varying, it is challenging to obtain an accurate estimate. To the best of our knowlege, the first multichannel methos to estimate the late reverberation PSD were propose in [25], [26], whereas an explicit spatial coherence moel was first use to estimate the coherent-to-iffuse ratio (CDR) in [27] [29]. Although not in the context of ereverberation, methos to estimate the irect soun PSD in a iffuse noise fiel [30], or the iffuse soun PSD [31] [34] have been propose. In the past years, a multitue of estimators for the late reverberation or iffuse soun PSD have been evelope assuming that the soun fiel can be escribe by a irect soun propagating as a plane wave in a time-varying iffuse fiel an aitive stationary noise. Nevertheless, also temporal reverberation moels can be exploite. Existing late reverberation PSD estimators can be ivie into four classes, where the first three classes use the spatial coherence reverberation moel. The first two classes are irect PSD estimators, whereas the thir class comprises inirect PSD estimators, which require an aitional step to obtain the reverberation PSD. In contrast to the first three classes, the fourth class is base on temporal reverberation moels. Estimators in the first class moel the reverberation as a iffuse soun fiel with known spatial coherence an block the irect soun utilizing irection of arrival (DOA) information. This simplifies the estimation proceure as the resulting (in some methos multiple) signals after the blocking operation contain only filtere iffuse soun an noise. In [21], the error PSD matrix of the blocking output signals is minimize, whereas in [24] an [35], maximum likelihoo (ML) estimators are erive given the blocke, or blocke an aitionally filtere, signals. In [35] the solution is obtaine using the Newton metho, whereas in [24] the solution is obtaine by

2 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 2 a root fining proceure. Thiergart et. al. evelope several spatial filters to extract the iffuse soun while blocking the irect soun. In [36] a spatial filter is erive that maximizes the iffuse-to-noise ratio (DNR) at its output, while in [37] a linearly constraine minimum variance (LCMV) beamformer is propose with a novel constraint set to block the irect soun an to extract the iffuse soun. The spatial coherence-base estimators in the secon class use no blocking of the irect soun, therefore the unknown PSDs of irect soun an reverberation have to be estimate jointly. In [38] a close-form ML estimator for the irect soun an reverberation PSDs is presente without taking aitive noise into consieration. The metho presente in [39] obtains the ML estimator of the irect an reverberation PSDs using the Newton metho. In [40], a batch expectation-maximization (EM) algorithm to estimate the irect an reverberation PSDs in the ML sense is presente, where unlike in all other methos consiere in this paper, this metho also estimates the spatial coherence matrix of the reverberation. In [41], the irect an reverberation PSDs are estimate jointly in the least-squares sense by minimizing the Frobenius norm of an error matrix. Estimators in the thir class are consiere as inirect PSD estimators base on the spatial coherence moel, assuming that the reverberation is iffuse: Rather than estimating the iffuse PSD irectly, an estimate can be obtaine by first estimating the CDR an then to estimate the iffuse PSD. To limit the number of algorithms uner test, we constrain ourselves to the best performing CDR estimator reporte in [42]. Estimators in the fourth class utilize temporal moels to escribe the reverberation, an make no assumption on the spatial coherence. In this class, the reverberation is escribe either using Polack s moel [15], [16], or using a narrowban moving average moel [43]. However, when the estimate late reverberation PSD is use in the MWF for ereverberation, an assumption on the spatial coherence of the late reverberation is require. The reverberation PSD estimators in these four classes can be use equivalently in the MWF or similar beamformerpostfilter systems [22] for ereverberation. However, the properties an the performance of this large variety of PSD estimators are unclear an have never been compare in a unifie framework. In this paper, we provie an overview an comparison of the current state-of-the-art reverberation PSD estimators. The obtaine results provie a guieline for choosing an estimator for a specific use-case, an reveal strengths an weaknesses of existing estimators that can rive further research an evelopments. In Sec. II, we present the signal moel assuming a single source per time-frequency bin an erive the MWF to estimate the esire signal, requiring an estimate of the reverberation PSD. Sec. III reviews coherence-base irect estimators with an without blocking, Sec. IV the coherence-base inirect PSD estimators, an Sec. V reviews temporal moel-base PSD estimators. The relations an ifferences between the estimators are iscusse in Sec. VI. A common weakness of the spatial coherence-base estimators is a systematic bias at high irect-to-iffuse ratios (DDRs). Therefore, we propose a bias compensation metho epening on the DDR in Sec. VII. A comprehensive experimental evaluation using controlle an realistic simulations is presente in Sec. VIII, where we analyze the error of the estimate PSDs as well as the resulting performance of the spatial filter using these estimates. Finally, the paper is conclue in Sec. IX. A. Signal moel II. PROBLEM FORMULATION We assume that the soun fiel is capture by an array of M omni-irectional microphones with an arbitrary geometry. The microphone signals given in the STFT omain Y m (k,n), m {1,...,M} are stacke into the vector y(k,n) = [Y 1 (k,n),...,y M (k,n)] T, where k an n enote the frequency an time frame inices. We escribe the soun fiel using a parametric signal moel, where the microphone signal vector is given by y(k,n) = a(k)x(k,n)+(k,n)+v(k,n), (1) where X(k, n) enotes the esire signal component as receive by a reference microphone, a(k) = [A 1 (k),...,a M (k)] T is a vector containing the acoustic relative transfer functions (RTFs) A m (k) of the esire signal from the reference microphone to all M microphones, (k, n) is the reverberation, an v(k, n) is the aitive noise. Throughout this paper, we assume that the RTFs a(k) are time-invariant, but in general they can also be time-varying. Note that the esire signal component X(k, n) is often moele only as the irect soun, ignoring the early reflections arriving within the same STFT frame as the irect soun. The component (k, n) moels the late reverberation, which is assume to be uncorrelate with the esire speech component X(k, n). The component v(k, n) moels stationary or slowly time-varying aitive noise components such as sensor noise an ambient noise. For typical STFT winow lengths of 20 to 30 ms, the three aitive components in (1) can be assume to be mutually uncorrelate, an the PSD matrix of the microphone signals y(k,n) is given by Φ y (k,n) = E { y(k,n) y H (k,n) } (2) = φ x (k,n)a(k)a H (k)+φ (k,n)+φ v (k,n), where E{ } is the expectation operator, φ x (k,n) = E{ X(k,n) 2 } is the PSD of the esire signal at the reference microphone, Φ (k,n) = E { (k,n) H (k,n) } enotes the late reverberation PSD matrix, an Φ v (k,n) = E { v(k,n) v H (k,n) } enotes the noise PSD matrix. We assume that the late reverberation PSD matrix can be moele as a spatially homogenous an isotropic soun fiel with a time-varying power. Therefore, the late reverberation PSD matrix Φ (k,n) can be escribe by a time-invariant coherence matrix Γ (k), which is scale by the time-varying late reverberation PSD φ (k,n) [36], i. e., Φ (k,n) = φ (k,n)γ (k). (3) The time-invariant coherence matrix Γ (k) can be etermine in avance from the microphone array configuration. In the

3 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 3 case of a free-fiel microphone array in a spherical or cylinrical iffuse fiel, there exist analytic expressions for the spatial coherence, i. e. the sinc or Bessel functions, respectively [44], whereas e. g. for irectional microphones [45] or in a hearing ai setup [46], the spatial coherence function is more complex to escribe. A wiely use moel for the spatial coherence of late reverberation uses the spherical iffuse fiel assumption, where the {i, j}-th element of the spatial coherence matrix for omniirectional microphones is given by [44] Γ (i,j) (k) = sinc ( 2π kf s N FFT c r i r j 2 ), (4) where sinc( ) = sin( ) ( ), the vector r m enotes the position of them-th microphone,f s enotes the sampling frequency,n FFT is the FFT length an c is the spee of soun. Although in most consiere late reverberation PSD estimation methos, the coherence matrix Γ (k) is assume to be given by (4), the metho in [40] also allows the estimate this matrix from the observe signals, which coul be avantageous when the reverberant soun fiel iffers from a theoretical iffuse fiel, e. g., in rooms with strongly nonhomogenous or partially non-reflecting bounaries. B. Desire signal estimation To estimate the esire signal X(k,n), we apply a complex value spatial filter w(k, n) to the microphone signals, i. e., X(k,n) = w H (k,n) y(k,n). (5) By minimizing the mean-square error (MSE) cost-function J MWF (w) = E { w H (k,n)y(k,n) X(k,n) 2} (6) we obtain the well-known MWF, which is given by w MWF = [ φ x aa H ] 1aφx +φ Γ +Φ }{{ v, (7) } Φ +v where Φ +v (k,n) enotes the interference PSD matrix. The frequency an time frame inices k an n are omitte here an in the following equations for brevity, wherever possible. The MWF can be split into a minimum variance istortionless response (MVDR) beamformer, enote as w MVDR (k,n), an a single-channel Wiener post-filter, enote as W WF (k,n), i. e. [47] w MWF = Φ 1 +v a ξ, (8) a H Φ 1 +v }{{ a } w MVDR ξ +1 }{{} W WF φ where ξ = x is the a priori signal-to-interference [a H Φ +va] 1 ratio of the MVDR output signal, an can be estimate using the ecision-irecte approach [48]. The aim in this paper is to investigate ifferent estimation methos for the PSD φ (k,n), which etermines the iffuse PSD matrix using the moel (3) together with Γ (k). We assume the RTF vector a(k) an the noise PSD matrix Φ v (k,n) to be known. In practice, both have to be estimate as well, which is beyon the scope of this paper. Popular noise PSD estimation methos are, for example, [49] [52], an for DOA estimation the reaer is referre to [53], [54]. III. COHERENCE-BASED DIRECT PSD ESTIMATORS The coherence-base irect reverberation PSD estimators comprise blocking base methos (Sec. III-A) an nonblocking base methos (Sec. III-B). These estimators have in common that they are exclusively base on the spatial coherence moel (3) with the signal moel (1). Therefore, these estimators epen on the iffuse coherence matrixγ (k) an on the RTF vector a(k). A. Blocking-base methos The blocking-base methos use a set of J = M 1 signals, which are generate by canceling the esire soun from the microphone signals using a blocking matrix. The J- imensional signal vector u is obtaine as u = B H y, (9) where the blocking matrix B of imensionm J has to fulfill the constraint B H a = 0 J 1. (10) Possible choices for the blocking matrix are iscusse in [21], [34], [55]. In this work, we use the eigenspace blocking matrix given by [55] B = [ I M M a(a H a) 1 a H] I M J, (11) where I M J is a truncate ientity matrix. As a consequence of using (1) an (3) with (9) an (10), it follows that the PSD matrix of the blocking output signals u(k,n) is given by Φ u = B H Φ y B (12) = φ x B H aa H B +φ }{{} B H Γ B }{{} 0 J J Γ + B H Φ v B. }{{} Φ v Note that Γ (k) an Φ v (k,n) enote the corresponing secon-orer statistics after applying the blocking matrix. In Secs. III-A1 an III-A2 ML methos are use to estimate the late reverberation PSD, where in Sec. III-A1 the elements of an error matrix are assume as ranom variables, an in Sec. III-A2, the elements of the vector u(k,n) are assume to be ranom variables. 1) PSD matrix-base least-squares metho with blocking: In [21], the error matrix between the estimate PSD matrix Φ u (k,n) an its moel is efine as Φ e = Φ ] u [ Φv +φ Γ, (13) The off-iagonal elements of the error matrix Φ e (k,n) are assume to be rawn from inepenent zero-mean complex Gaussian istributions with equal variance [21]. The solution to this ML problem is in both cases obtaine by solving the least-squares problem of minimizing the square Frobenius norm of the error matrix φ = argmin Φ e 2 F, (14) φ where 2 F enotes the Frobenius norm, an is given by tr { ΓH ( Φu Φ )} v φ = tr{ ΓH Γ }, (15)

4 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 4 where tr{ } enotes the trace operator. An estimate of Φ u (k,n) can be obtaine by using recursive averaging, i.e., Φ u (k,n) = β Φ u (k,n 1)+(1 β)u(k,n)u H (k,n), where β enotes the forgetting factor. 2) ML using blocking output signals: The methos presente in [35] an [24] both start from the assumption that the elements of the microphone signal vector y(k, n) are zeromean complex Gaussian ranom variables y N (0,Φ y ). (16) In [35], the ML problem is solve by an iterative Newton metho, whereas in [24] a filtere version of the signals u(k,n) are use an the ML problem is solve by a rootfining metho. a) Solution using root-fining metho [24]: In the ML estimator using a root fining metho, the blocking output signals u(k,n) are filtere to iagonalize Φ v (k,n) in (12). Specifically, a whitening matrix D(k,n) of imension J J efine as the Cholesky factor of the inverse of Φ v (k,n), i, e. Φ 1 v (k,n) = D(k,n)D H (k,n), yieling an its PSD matrix is given by z = D H B H y (17) Φ z = D H Φ u D = φ Γ +I, (18) with Γ = D H Γ D = D H B H Γ BD. As a result of the escribe whitening of Φ v (k,n), the matrices Φ z (k,n) an Γ (k,n) can be iagonalize using the same unitary matrix C(k,n), i. e., Φ z = CΛ z C H, Γ = CΛ Γ C H, (19) where the orthonormal columns of C(k, n) are the eigenvectors, an where Λ z (k,n) an Λ Γ (k,n) are iagonal matrices containing the eigenvalues of Φ z (k,n) an Γ (k,n), respectively. Due to (18), these eigenvalues are relate as λ z,j = φ λ Γ,j + 1, where λ z,j (k,n) an λ Γ,j (k,n) enote the j-th eigenvalue of Φ z (k,n) an Γ (k,n), respectively. Given the the filtere blocking output signals z(k, n) in (18), with z(k,n) f z (0,Φ z ), the ML estimate of φ is given by φ = argmax logf z (0,Φ z ), (20) φ where f(µ, Φ) enotes the complex Gaussian likelihoo function with mean vector µ an covariance matrix Φ. By setting the erivative of the log-likelihoo function to zero an exploiting the iagonal structure of the involve matrices (for more etails, see [24]), we obtain the polynomial J p(φ ) = p j (φ ), where (21) p j (φ ) = j=1 ( φ g j 1 λ Γ,j ) J,l j l=1 ( φ + 1 ) 2, λ Γ,l where g j (k,n) enotes the j-th iagonal element of C H D H Φu DU. It has been shown in [24] that the root of the polynomial p(φ ) yieling the highest value of the likelihoo (20) is the ML estimate φ (k,n). b) Solution using Newton s metho: To solve the ML estimation problem [35] φ = argmax logf u (0,Φ u ), (22) φ Newton s metho is use to erive an iterative search [56] ( ) D φ (l) φ (l+1) = φ (l) ( ), (23) H φ (l) where l enotes the iteration inex, an D(φ ) an H(φ ) are the graient an the Hessian of the log-likelihoo D(φ ) logf u(0,φ u ) φ, (24) H(φ ) 2 logf u (0,Φ u ) φ 2. (25) As shown in [35], the graient is equal to { (Φ 1 D(φ ) = J tr u uuh I ) } Φ 1 Φ u u, (26) φ with Φu φ = B H Γ B, whereas the Hessian matrix is equal to { H(φ ) = J tr Φ 1 Φ u u Φ 1 u uu H Φ 1 Φ u u + φ φ ( Φ 1 u uu H I ) Φ 1 Φ u u Φ 1 Φ u u φ φ }. (27) As shown in [35], the Newton upate (23) can be compute efficiently by re-arranging (26) an (27), using an eigenvalue ecomposition of B H Γ B an exploiting the resulting iagonal matrices. In practice, the matrix u(k,n)u H (k,n) is substitute by the smoothe version Φ u (k,n). The Newton } iterations are initialize with φ (0) (k,n) = ǫ1 J { Φu tr (k,n), where ǫ is a small positive value. The Newton algorithm is stoppe if the estimate at iteration l = l stop reaches a preefine lower or upper boun, or if a convergence threshol is reache, an the estimate is obtaine by φ (k,n) = (k,n). 3) Diffuse beamformers [36], [37]: Thiergart et al. evelope several beamformers that aim at extracting the late reverberation, moele as iffuse soun, while blocking the esire soun. As our preliminary experiments unveile almost ientical performance across those beamformers in terms of late reverberation PSD estimation, we present the most elegant here: The beamformer propose in [37] minimizes the noise uner the linear constraints of blocking the esire soun an not istorting the average transfer function of the iffuse soun, i. e., φ (lstop) subject to w = argmin w w H a = 0 w H γ 1 = 1, w H Φ v w (28a) (28b) (28c) where γ 1 is the first column of Γ. The analytic solution to (28) is equal to an LCMV filter [37].

5 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 5 The late reverberation PSD can then be estimate by subtracting the PSD of the filtere noise components from the PSD of the filter input signals normalize by the filtere iffuse coherence [36], i. e. { w φ H = max Φ } y w w HΦ vw w HΓ, 0, (29) w where the max{ } operation is introuce to avoi negative PSD estimates. The input PSD matrix is recursively estimate using Φ y (k,n) = β Φ y (k,n 1)+(1 β)y(k,n)y H (k,n). B. Non-blocking base methos In contrast to the blocking base methos from Sec. III-A, methos within the class iscusse in this section o not rely on blocking the esire soun component. Instea, they jointly estimate the esire an late reverberation PSDs. Although the metho presente in [38] also falls into this category, it is exclue here, as it oes not consier aitive noise. Nevertheless, it is worthwhile to note that if the noise component v(k, n) is zero, the solution from [38] provies a close-form solution to the problem of jointly estimating φ x (k,n) an φ (k,n) in the ML sense. In Sec. III-B1, the Newton metho is use to obtain the ML estimates of esire an late reverberation PSDs by assuming the iffuse coherence Γ (k) to be known, whereas the metho reviewe in Sec. III-B2, can also estimate Γ (k) from the ata using an EM algorithm. The original EM metho is escribe in Sec. III-B2a, whereas in Sec. III-B2b we assume that Γ (k) is known. In Sec. III-B3, the esire an iffuse PSDs are estimate jointly in the least-squares sense. 1) ML using Newton s metho: In [39], a ML metho to jointly estimate φ x (k,n) an φ (k,n) is propose uner the assumption that the iffuse coherence matrix Γ (k) is known. By efining p(k,n) = [φ x (k,n), φ (k,n)] T as the unknown parameter set an assuming y(k,n) f y (0,Φ y ), the ML estimate of p(k,n) given y(k,n) can be foun by the Newton metho [56] using p (l+1) = p (l) H 1( p (l)) ( δ p (l)), (30) where δ(p) is the graient of the log-likelihoo, an H(p) is the corresponing Hessian matrix, i.e., δ(p) logf y(0,φ y ) p (31) H(p) 2 logf y (0,Φ y ) p p T, (32) where f (y; p) is the p..f. of the microphone signal vector. The graient δ(p) [δ x (p), δ (p)] T is a 2-imensional vector with elements { (Φ 1 δ i (p) = M tr y yy H I ) Φy 1 } Φ y, (33) φ i where i {x,}, Φy φ x = aa H, an Φy φ = Γ. The Hessian is a symmetric 2 2 matrix: [ ] Hxx (p) H H(p) x (p). (34) H x (p) H (p) with the elements { H ij (p) = M tr Φ 1 Φ y y Φ 1 y yy H Φ 1 Φ y y φ j φ i + ( Φ 1 y yy H I ) } Φ 1 Φ y y Φ 1 Φ y y, (35) φ j φ i where i,j {x,}. In practice, the matrix yy H in (33) an (35) is replace by the smoothe version Φ y. The algorithm is initialize with φ (0) x = φ (0) = ǫ 1 M tr { Φy }. The Newton algorithm is stoppe if the estimates at iteration l reach a preefine lower or upper boun, or if a convergence threshol is reache. 2) ML using the EM metho: The EM algorithm propose in [40] is a batch algorithm that provies estimates of the esire soun PSD φ x (k,n), the RTF vector a(k), the late reverberation PSD φ (k,n) an the late reverberation coherence matrixγ (k). For consistency with the other methos, we assume that a(k) known an is therefore not estimate within the EM. In Sec. III-B2a, we escribe the metho propose in [40]. In Sec. III-B2b, the metho is moifie by assuming prior knowlege of the coherence matrix Γ (k) to investigate the effect of estimating Γ (k). a) ML-EM with unknown reverberation coherence matrix: The esire an iffuse soun components are concatenate in the hien ata vector q(k,n) [ X(k,n) T (k,n) ] T. (36) Using this efinition, equation (1) can be rewritten as y(k,n) = H(k,n)q(k,n)+v(k,n), (37) where the matrix H(k,n) [a(k,n), I M M ]. The esire parameter set is θ(k) = { φx (k), φ (k), Γ (k) }, (38) where φ x (k) = [φ x (k,1),...,φ x (k,n)] T an φ (k) = [φ (k,1),...,φ (k,n)] T, with N being the number of frames. By concatenating the hien ata vectors of all time frames 1,...,N to q(k) = [q T (k,1),...,q T (k,n)] T, an efining ȳ(k) similarly, the conitional expectation of the loglikelihoo function can be euce as Q ( θ;θ (l)) = E { logfȳ(0,φȳ) ȳ(k); θ (l)}, (39) where θ (l) is the parameter-set estimate at iteration l. For implementing the E-step, it is sufficient to estimate q(k,n) E { q(k,n) y(k,n);θ (l)} an Ψ q (k,n) E { q(k,n)q H (k,n) y(k,n);θ (l)} being the first- an seconorer statistics of the hien-ata given the measurements, respectively. Assuming that y(k,n) an q(k,n) in (37) are Gaussian ranom vectors, q(k, n) can be estimate by the optimal linear estimator [40] with Φ (l) q (k,n) = q = E { qy H} ( E { yy H}) 1 y ) 1 = Φ (l) q HH( Φ (l) y y (40) [ φ (l) x (k,n) 0 1 M 0 M 1 φ (l) (k,n) Γ(l) (k) ], (41)

6 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 6 an Φ y (l) = HΦ q (l) H H +Φ v. The matrix X(k,n) Ψ q (k,n) 2 X(k,n) H (k,n), (42) X (k,n)(k,n) (k,n) H (k,n) can be obtaine by [40] ) 1HΦ Ψ q = q q H +Φ (l) q Φ(l) q HH( Φ (l) (l) y q. (43) MaximizingQ ( θ;θ (l)) with relation to the problem parameters constitutes the M-step, i. e. [40] 1. φ x (l+1) (k,n) = X(k,n) 2 (44) 2. Γ (l+1) (k) = 1 N (k,n) H (k,n) N n=1 φ (l) (k,n) (45) { 3. φ (l+1) (k,n) = 1 } ( ) 1 M tr (k,n) H (k,n) Γ (l+1) (k). (46) The EM iterations } are initialize with φ (0) x (k,n) } = 1 ǫ x M { Φy tr (k,n) an φ (0) (k,n) = ǫ 1 M { Φy tr (k,n), where ǫ x > ǫ, an Γ (0) (k) is initialize with (4). b) ML-EM with known reverberation coherence matrix: By assuming that the moel for Γ (k) given by (4) hols, the metho escribe in Sec. III-B2a nees to be moifie only by omitting (45) in the M-step an using the a priori known spatial coherence matrix instea. 3) PSD matrix-base least-squares metho: By matching Φ y (k,n), which can be estimate from the microphone signals, an its moel given in (2), the problem at han can be formulate as a system of M 2 equations in two unknown variables [41]. Since there are more equations than variables, the vector p(k,n) = [φ x (k,n), φ (k,n)] T that minimizes the total square error can be foun by minimizing the square Frobenius norm as p = argmin p Φ y ( φ x aa H ) +φ Γ +Φ v }{{} Φ LS 2 F, (47) where Φ LS (k,n) is the error matrix. Following some algebraic steps, the cost function in (47) can be written as Φ LS 2 F = pt Ap 2b T p+c, (48) where C(k,n) is inepenent of p(k,n), an A(k,n) an b(k,n) are efine as [( a A H a ) 2 ] a H Γ a a H Γ a tr { Γ H Γ } (49) { } b R a ( Φy H Φ v )a { ) }}. (50) R tr {( Φy Φ v Γ H Since the cost function Φ LS (k,n) 2 F in (48) has a quaratic form, setting its graient w.r.t. p(k, n) to zero yiels p(k,n) = A 1 (k,n)b(k,n). (51) Note that this metho is relate to the metho presente in Sec. III-A1. Both methos minimize the Frobenius norm of an error matrix, where the esire soun is blocke in the first metho, whereas the late reverberation an esire soun PSDs are estimate jointly in the secon metho. IV. COHERENCE-BASED INDIRECT PSD ESTIMATORS While all methos in Sec. III irectly estimate the late reverberation PSD, we consier inirect estimators in this section. Within this class, we focus on methos using an estimate of the CDR to estimate the PSD of the iffuse soun, i. e. late reverberation. These estimators rely on the fact that the esire signal X(k, n) is coherent across all microphones. The CDR as efine in [27], [42] for the microphone pair i,j {1,...,M} is given by CDR i,j (k,n) = φ x(k,n) A i (k) A j (k). (52) φ (k,n) The CDR can be estimate using various methos, e. g. [42], [57]. To limit the number of estimators uner test, we restrict ourselves to the propose 2 CDR estimator escribe in [42], which was reporte to perform best across the consiere CDR estimators. The CDR estimator requires knowlege of the RTFs a(k) an the iffuse coherence matrix Γ (k). Furthermore, we compensate for the aitive noise as propose in [45]. To take all microphones into account, we average the CDR estimate for each microphone pair [30], [31] ĈDR(k,n) = 1 M i,j M ĈDR i,j (k,n) A i (k) A j (k), (53) where the set M contains all microphone pair combinations. Given an estimate of the CDR, an exploiting the iffuse homogeneity, the late reverberation PSD is obtaine by [21] 1 M { Φy φ tr (k,n) Φ v (k,n) (k,n) = 1 M ah (k)a(k) ĈDR(k,n)+1. (54) If the esire soun X(k,n) is moele as a plane wave such that A m (k) = 1, m, then (52)-(54) can be simplifie. V. TEMPORAL MODEL-BASED PSD ESTIMATORS Instea of moeling the late reverberation as a iffuse soun fiel, estimators within the fourth class exploit the temporal structure of reverberation, such that they can be applie to each iniviual microphone signal. A. Statistical temporal moel In [15], [16] it was propose to moel the impulse response by an exponentially ecaying ranom process per frequency ban. Using this moel, the late reverberation PSD of the m-th microphone φ (m) can be estimate epening on two a priori require parameters, namely the frequency-epenent reverberation time T 60 (k) an the inverse irect-to-reverberation ratio (DRR) κ(k), by [16] φ (m) (m) (k,n) = [1 κ(k)]e 2α(k)RND φ (k,n N D ) (55) 2α(k)RND +κ(k)e [ φ(m) y (k,n N D ) φ (m) v (k,n N D ) } ], where N D correspons to the number of frames between the irect soun an the start time of the late reverberation, α(k) = 3ln(10)/(T 60 (k)f s ) is the reverberation ecay constant, R is the hop-size, an φ y (k,n) an φ (m) v (k,n) (m) are

7 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 7 Section Metho TABLE I CLASSIFICATION AND PROPERTIES OF REVERB PSD ESTIMATORS exploits spatial coherence moel requires knowlege of spatial coherence exploits temporal structure estimates φ x processing / solution III-A1 Blocking PSD LS [21] online / close-form 0.8 III-A2a Blocking ML root [24] online / polyn. rooting 8.8 III-A2b Blocking ML Newton [35] online / iterative 3.7 III-A3 BF LCMV [37] online / close-form 0.5 III-B1 ML Newton [39] online / iterative 6.5 III-B2a ML-EM est. coh. [40] batch / iterative 15.1 III-B2b ML-EM iff. coh. [40] batch / iterative 14.9 III-B3 PSD LS [41] online / close-form 0.6 IV CDR [42] online / close-form 0.7 V-A LRSV [16] online / close-form 0.4 V-B CTF [43] online / recursive 8.5 real-time factor the iagonal entries of Φ y (k,n) an Φ v (k,n), respectively. Following the spatial homogeneity assumption of the late reverberation, we can spatially average the PSD estimates across all microphones as φ (k,n) = 1 M M m=1 B. Convolutive transfer function base methos φ (m) (k,n). (56) Using the convolutive transfer function (CTF) approximation [58] per frequency ban, the m-th microphone signal can be escribe by Y m (k,n) = L H m (k,l)x m (k,n l)+v m (k,n), (57) l=0 where X m (k,n) is the irect speech signal at the m-th microphone, H m (k,l) for l {0,...,L} are the CTFs an L is the require number of frames to moel the reverberation. By using a relative CTF formulation [43], we can re-interpret X m (k,n) as the speech component in the m-th microphone containing some early reflections an H m (k,l) as the relative CTFs such that H m (k,0) = 1. By moeling the coefficients H m (k,l) by a first-orer Markov ranom variable, H m (k,l) can be estimate using a Kalman filter for l {1,...,L}, an past frames of S(k,n) can be estimate using an auxiliary Wiener filter. Using the estimates Ĥm(k,l) an X m (k,n), the late reverberation PSD in the m-th microphone can be estimate by φ (m) L (k,n) = E Ĥ m (k,l) X 2 m (k,n l), (58) l=n D where N D again enotes the start time frame of the late reverberation. The expectation in (58) can be approximate by a recursive average. As with the previous single-channel estimator in Sec. V-A, the microphone-specific PSDs φ (m) (k, n) can be spatially average using (56). VI. DISCUSSION An overview of the ifferent classes of estimators along with important properties are shown in Tab. I. The first two columns inicate the section numbers an short acronyms of the methos. Discriminative properties of the methos are whether they exploit a spatial coherence moel, require prior knowlege of the spatial coherence of the late reverberation Γ (k), exploit a temporal structure moel, aitionally/inherently eliver an estimate of the esire soun PSD φ x (k,n), are online or batch processing methos, an the type of solution (close-form, iterative, recursive, etc.), yiel a high or low computational complexity in terms of the real-time factor. The real-time factor, i. e. the processing time per time frame, was measure running MATLAB R R2016b on a 3.1 GHz Intel Core i5 processor. Although the implementations were not optimize for runtime, the real-time factors give a goo inication of the computational complexity. Algorithms with a real-time factor < 1 are not complex an easy to implement also on less powerful evices, whereas algorithms with a real-time factor > 1 require more powerful processors an strong optimization to be able to run in real-time. It can be observe that mainly the methos that o not have a closeform solution, are rather complex. For the CTF metho, the complexity epens on the filter length L. The parameter settings for each algorithm are escribe in Sec. VIII-A. While the coherence-base methos an the LRSV metho practically instantaneously eliver useful PSD estimates without elay, the CTF metho requires a short initial convergence phase of 1-2 s before proviing accurate estimates as shown in [43]. An exception is the ML-EM that is a batch metho, which requires a larger amount of ata (in the range of several secons) before proviing a result, an is therefore not useful for online processing. It is interesting that the estimator pairs blocking PSD LS (Sec. III-A1) an PSD LS (Sec. III-B3), an blocking ML Newton (Sec. III-A2b) an ML Newton (Sec. III-B1), use the same mathematical solution methos, while the first ones use a blocking of the esire soun whereas the latter ones jointly estimate late reverberation an esire soun PSDs. Note that all spatial coherence-base methos (Sec. III an IV) can also be use to estimate the PSD of non-reverberation

8 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 8 mean log error [B] Blocking PSD LS Blocking ML root PSD LS fitte curve DDR [B] Fig. 1. Mean log PSD error an fitte exponential compensation function epening on the estimate DDR obtaine using simulate RIRs. relate soun fiels, such as possibly non-stationary iffuse noise or ambient souns like babble noise. On the other han, this means that spatial coherence-base estimators cannot iscriminate between reverberation originating from a speech signal an other iffuse souns, if the reverberation an the other iffuse components have the same spatial coherence. Methos exploiting temporal structures of reverberation such as the ones imposing a moel on the reverberant tail (Sec. V-A) or exploiting the CTF moel (Sec. V-B) can iscriminate between reverberation an other iffuse soun fiels. On the contrary, this means that these methos are not suitable to estimate the PSD of general iffuse soun fiels. Furthermore, all spatial coherence-base methos require prior knowlege or estimates of the RTFs of the esire soun a(k), whereas the temporal moel-base reverberation PSD estimators in Sec. V work inepenently per channel an require some temporal information like the T 60 or the length of the relative CTF L. VII. BIAS COMPENSATION As will be shown in Section VIII, the spatial coherencebase estimators uner test are severely biase in high DDR conitions. Since overestimation of the late reverberation PSD is especially harmful to the auio quality as it causes speech istortion when use for ereverberation, we propose a simple compensation metho using the correction factor c (k,n) = f c ( φ, DDR) as a function of the estimate DDR. A similar compensation metho was propose in [59] in the context of noise reuction. As a proof of concept an without claiming optimality, we fit an exponential function to the mean logarithmic PSD estimation error of the three coherence-base estimators epening on the estimate DDR as shown in Fig. 1, where the logarithmic PSD estimation error will be efine in (65). We approximate the error using the function c ( DDR) = a e b10log 10 DDR, (59) where the bias function c is obtaine in B an the DDR is estimate by DDR = φ y φ φ v. By using MATLAB sfit() φ function we fit the exponential function (59) to the average error of the three coherence-base estimators within the range DDR = [ 20,20] B as shown in Fig. 1 as an example. The such obtaine values are a = an b = Figure 1 shows the use error ata an the fitte curve c ( DDR). φ comp The compensate PSD is then obtaine by multiplying the estimate PSD φ with the inverse linearize bias function φ comp = 10 c( DDR)/10 φ. (60) VIII. EVALUATION In this section, we evaluate the performance of the reviewe estimators in Sections III - V for ifferent acoustic setups. Sections VIII-A - VIII-C iscuss the use simulation parameters, signal generation an performance measures. In Sec. VIII-D we first use a controlle setup for only spatial coherencebase estimators using artificial white noise signals in a stationary iffuse noise fiel. The reverberant PSD estimators iscusse in Sec. V are exclue from this first evaluation since they are not suitable for this scenario. The ML-EM with unknown coherence (Sec. III-B2a) is also omitte from this first evaluation, as in this case the assume coherence moel perfectly fits the ata. Secon, an evaluation using speech an measure room impulse responses (RIRs) an recore noise is conucte in Sec. VIII-E. Finally in Sec. VIII-F an VIII-G, measure RIRs are use to confirm the results in realistic environments. A. Acoustic setup an simulation parameters In all simulations an measurements, we use a uniform circular array with a raius of 10 cm an M = 8 omniirectional microphones. In Sec. VIII-D the esire soun component X(k, n) was generate as a plane wave, while the late reverberation component (k, n) was generate as a stationary iffuse fiel. In Sec. VIII-E, realistic signals were generate using measure RIRs an recore noise from the REVERB challenge atabase [60] an speech ata from [61]. The REVERB atabase provies in total 12 acoustic conitions: three ifferent rooms witht 60 {0.3,0.6,0.7} s, where in each room two ifferent source angles at two istances (0.5 m an 2 m) were measure. The speech ata feature 3 female an 3 male speakers with a total length of over 2 minutes. The signals were sample with a sampling frequency of f s = 16 khz, an analyze using an STFT with 50% overlapping square-root Hann winows of length 32 ms, an N FFT = The stationary noise PSD matrix Φ v (k,n) an the RTF vector a(k) of the esire soun were assume to be known in avance. The noise PSD matrix was compute uring speech absence an the RTF vector was obtaine from the irect soun peak of the RIRs. Therefore, by assuming the source in the far-fiel, the RTF vector a(k, n) correspons to simple elays an is referre to as steering vector in the following. The recursive smoothing parameter for estimating the PSD matrices was set to β = 0.73, which correspons to an exponential smoothing with a time constant of 50 ms. Algorithm specific parameters were chosen as follows: N D = 1 frames, T 60 (k) was set accoring to the fullban reverberation time for each room, the CTF length L was chosen accoring to the fullban T 60 in each room, ǫ = 0.01, ǫ x = 0.5, an ǫ = 0.1. The iterative Newton an EM algorithms were halte after a maximum of 10 iterations, even if the convergence threshol of φ ( l) φ(l 1) < was not reache.

9 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 9 B. Signal generation The efinition of the groun truth, i. e. the oracle PSD φ (k,n), is very important as it significantly influences the results. In Sec. VIII-D, we utilize a highly controlle test scenario. All signal components were generate using stationary white noise in the time omain: the irect soun component at the reference microphone X(k, n) with PSD φ x (k,n), the iffuse soun component (k,n) with PSD φ (m) (k,n) = φ (k,n) by imposing the iffuse long-term coherence given by (4) on white noise signals using the metho propose in [62], an the aitive noise component v(k,n) with PSD φ (m) v (k,n) = φ v (k,n). To obtain the test signals, the ifferent stationary signal components of 10 s length were summe up epening on the reverberant signalto-noise ratio (RSNR) an the DDR RSNR = DDR = k,n (φ x(k,n)+φ (k,n)) k,n φ, (61) v(k,n) k,n φ x(k,n) k,n φ (k,n). (62) Although it is often assume that the transition between early reflections an late reverberation starts aroun 50 ms after the irect soun, we chose this transition smaller to fin a fair common groun truth for the coherence-base an temporal moels: Unlike the temporal moel base methos, the coherence-base moel oes not have a control parameter to efine the start time of the late reverberation. Therefore, the only reasonable option is to efine N D = 1, i. e. that the late reverberation starts one frame shift (in our case 16 ms) after the irect soun. In Secs. VIII-E - VIII-G, the reverberant signals a(k)x(k, n) + (k, n) were generate by convolving non-reverberant speech signals with measure RIRs from the REVERB atabase. The time-omain representation of the oracle reverberation component for evaluation purposes (k, n) was then obtaine by convolution of the non-reverberant test speech signal with winowe RIRs containing only the late part of the reverberation, starting 16 ms after the irect soun. The time-omain representation of the aitive noise v(k, n) was pink noise, in orer to maintain an approximately constant RSNR per frequency ban to the speech. The oracle reverberation PSD use as a target for evaluation is the spatially average instantaneous late reverberation power, i. e. φ (k,n) = H (k,n)(k,n). (63) M For the speech enhancement evaluation in Secs. VIII-F an VIII-G, in aition to using the theoretical iffuse coherence given by (4), we also use the oracle coherence matrix of the late reverberation, where the (i, j)-th element was compute by Γ (i,j) (k) = N n=1 D i(k,n)dj (k,n) ( N )( n=1 D N ), i(k,n) 2 n=1 D j(k,n) 2 where D m (k,n) is the m-th element of (k,n). (64) The oracle esire signal for evaluation purposes require in Sec. VIII-F an VIII-G is efine as the irect soun at the reference microphone, an was obtaine by convolving only the winowe irect path of the reference microphone RIR with the anechoic speech signal. C. Performance measures 1) Logarithmic PSD estimation error: To evaluate the estimation accuracy of the various PSD estimators, we employ the bin-wise logarithmic error φ (k,n) e(k,n) = 10log 10 φ (k,n), (65) which irectly reflects over- an unerestimation as positive an negative values in B, respectively. The log error is analyze statistically in terms of its mean µ e an the lower an upper semi-variance [63] σe,l 2 = 1 (e(k,n) µ e ) 2, T l : e(k,n) µ e (66a) T l {k,n} T l σe,u 2 = 1 (e(k,n) µ e ) 2, T u : e(k,n)>µ e, (66b) T u {k,n} T u where the sets of time-frequency bins T l an T u contain all bins below or above the mean, respectively. Therefore, a log error with zero mean an small semi-stanar eviations σ e,l an σ e,u is esire. In the following figures, the mean is represente by symbols (circle, square, etc.) an the semistanar eviations are inicate by whisker bars. 2) Speech enhancement measures: We also assess the influence of the various PSD estimates on the ereverberation performance of the MWF using (5), (8) an (3) by employing several speech enhancement measures. To assess the perceptual similarity between the MWF output signal X(k,n) an the oracle esire signal X(k,n), we employ the Cepstral Distance (CD) [64] an the Perceptual Evaluation of Speech Quality (PESQ) measure [65]. The amount of perceive reverberation is quantifie by the normalize Speech-to-Reverberation-Moulation Energy Ratio (SRMR) [66]. Furthermore, we compute the segmental interference reuction (IR) an speech istortion inex (SDI) [67] as IR = 1 nt t=(n 1)T 10log s2 v,in T 10 (t) nt n T t=(n 1)T s2 v,mwf (t) (67) SDI = 1 nt t=(n 1)T (s x,mwf(t) s x,in (t)) 2 T nt t=(n 1)T s2 x,in (t), (68) n T where s v,in (t) an s v,mwf (t) are the time-omain representations of late reverberation plus noise at the reference microphone an at the MWF output, respectively, s x,in (t) an s x,mwf (t) are the time-omain representations of irect soun at the reference microphone an the MWF output, t is the sample inex, T is the number of samples corresponing to a segment of 20 ms an the set T contains only time segments where speech is active, etermine by an ieal voice activity etector. For the perceptually motivate measures, we compute the improvement with respect to the unprocesse reference microphone signal, inicate by CD, PESQ an SRMR.

10 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY Blocking PSD LS Blocking ML root Blocking ML Newton BF LCMV ML Newton ML EM iff. coh. PSD LS CDR log error [B] RSNR [B] Fig. 2. Mean an stanar eviation of log error for artificial stationary soun fiel with DDR = 10 B. 20 Blocking PSD LS Blocking ML root Blocking ML Newton BF LCMV ML Newton ML EM iff. coh. PSD LS CDR 15 log error [B] DDR [B] Fig. 3. Mean an stanar eviation of log error for artificial stationary soun fiel with RSNR = 15 B. D. Evaluation of spatial coherence-base PSD estimators for stationary iffuse noise For a stationary iffuse fiel the results of the logarithmic PSD estimation error as escribe in Sec. VIII-C is shown in Fig. 2 over varying RSNRs with a fixe DDR at 10 B. It can be observe that at higher RSNRs, all estimators have a small variance, an their error means are close to 0 B, which means a small estimation error. Blocking ML Newton is performing slightly worse in meium RSNRs compare to the other estimators ue to a small increase in its variance. At low RSNRs the estimators Blocking ML root, Blocking ML Newton, an ML Newton are the most robust as mainly their variance increases, but the mean stays close to 0 B. ML-EM an CDR are very robust in terms of their variance in low RSNRs, but their mean increases, so they become biase to positive values. Figure 3 shows the epenency of the log error on the DDR while keeping the RSNR fixe at 15 B. At first it may seem surprising that all coherence-base estimators show a large error at high positive DDRs. However, this can be explaine by the fact that if the irect soun component ominates in the observe signal, it becomes ifficult to accurately estimate the comparatively weak iffuse soun component. Figures 4 an 5 show the influence of DOA estimation errors for two DDRs (25 B an 0 B, respectively) at a RSNR of 15 B. In this simulation, the steering vector a(k) use to compute the MWF (8) was compute with an angular offset, while the actual source position was kept constant. From these results it can be observe that an offset in the steere DOA increases the variance an the mean shifts to positive values (overestimation). At high DDR, the impact of steering errors is very prominent (see Fig. 4), whereas at DDR = 0 B an lower, the DOA error influence is minor (Fig. 5). However, fortunately, at high DRRs the unerlying assumptions of the typical signal moels use for steering vector estimation are matche more accurately. This can mitigate the influence of steering vector estimation errors at high DRR in practice. In this section we showe that the coherence-base PSD estimators work well at high RSNR, low DDR an low steering errors, but they exhibit weaknesses in low RSNR, high DDR an in the presence of steering errors. Note that a high DRR occurs when the T 60 is small or when the sourcearray istance is small. E. Evaluation of PSD estimators for late reverberation Figure 6 shows the log error obtaine using measure RIRs in the room with T 60 = 0.7 s for varying RSNR. In contrast to Sec. VIII-D, this experiment inclues all PSD estimators. The trens an the relative behavior between the estimators confirms the results from the controlle stationary experiment in Fig. 2, but the variances are much larger ue to moel mismatches. However, the coherence-base methos show a positive bias of their mean. The temporal moelbase estimators LRSV an CTF show a ifferent behavior an generally show a lower mean log error compare to the spatial coherence-base estimators. Therefore, the temporal moel-

11 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 11 log error [B] Blocking PSD LS Blocking ML root Blocking ML Newton BF LCMV ML Newton ML EM iff. coh. PSD LS CDR DOA steering offset [egree] Fig. 4. Mean an stanar eviation of log error for artificial stationary iffuse fiel for DOA offset with RSNR = 15 B an DDR = 25 B. log error [B] Blocking PSD LS Blocking ML root Blocking ML Newton BF LCMV ML Newton ML EM iff. coh. PSD LS CDR DOA steering offset [egree] Fig. 5. Mean an stanar eviation of log error for artificial stationary iffuse fiel for DOA offset with RSNR = 15 B an DDR = 0 B. log error [B] Blocking Frobenius Blocking ML root Blocking ML Newton BF LCMV ML Newton ML EM iff. coh. ML EM est. coh. Frobenius norm CDR LRSV RSNR [B] Fig. 6. Mean an stanar eviation of log error for T 60 = 0.7 s. CTF base estimators yiel less overestimation, which is beneficial in terms of speech istortion. The CTF-base estimator is among the most robust estimators at low RSNRs. The error of the ML-EM with iffuse coherence an the ML-EM with estimate coherence yiel the same variance, but surprisingly, the mean of ML-EM with iffuse coherence is slightly closer to zero B. Figure 7 shows the log error for varying T 60 an source istances at RSNR of 15 B. We can observe that the log error ecreases towars higher reverberation times an larger source istances, i. e. for ecreasing DRR. This confirms the trens from Fig. 3. In Fig. 8, the mean log error is groupe epening on the true bin-wise (local) DRR in steps of 5 B using the ata from all acoustic conitions shown in Fig. 7. It is interesting to note that all coherence-base PSD estimators show a similar behavior in contrast to the temporal moelbase PSD estimators LRSV an CTF. The latter two are more robust against overestimation at high DRRs, which we expect to result in less speech istortion. The trens shown in the previous section for the coherencebase estimators can be confirme when use to estimate the late reverberation PSD. The temporal moel-base estimators yiel more unerestimation an less overestimation than the spatial coherence-base estimators. For the teste rooms, the coherence-base methos show a positive bias of the mean error compare to the temporal moel base methos.

12 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 12 log error [B] Blocking PSD LS Blocking ML root Blocking ML Newton BF LCMV ML Newton ML EM iff. coh. ML EM est. coh. PSD LS 0.3s / 0.5m 0.3s / 2m 0.6s / 0.5m 0.6s / 2m 0.7s / 0.5m 0.7s / 2m T60 [s] / istance [m] CDR LRSV CTF Fig. 7. Mean an stanar eviation of log error in rooms with varying T 60 an source istances an RSNR = 15 B. mean log error [B] Blocking PSD LS Blocking ML root Blocking ML Newton BF LCMV ML Newton ML EM iff. coh. ML EM est. coh. PSD LS CDR LRSV CTF bin-wise DRR [B] Speech Distortion Inex Interference Reuction [B] oracle PSD + iff. coh. oracle PSD + rev. coh. Blocking PSD LS Blocking ML root Blocking ML Newton BF LCMV ML Newton ML EM iff. coh. ML EM est. coh. PSD LS CDR LRSV CTF Fig. 8. Mean log error epening on the bin-wise DRR using measure RIRs. Fig. 9. Speech istortion vs. interference reuction for RSNR = 15 B. F. Performance of the spatial filter using the late reverberation PSD estimates In this subsection we investigate the performance of the MWF using the various PSD estimates. As there are no significant ifferences between the spatial coherence-base estimators observable at higher RSNRs (c.f. Fig. 6), we present results here only for RSNR = 15 B. In this experiment, the oracle late reverberation PSD is use either with the iffuse coherence (4) or with the oracle late reverberation coherence (64) to investigate the mismatch effect of the iffuse fiel moel. Figure 9 shows the interference reuction (IR) vs. the speech istortion inex (SDI) as compute by (67) an (68). The optimal point lies in the lower right corner. The best performance is obtaine by using the oracle PSD with oracle reverberation coherence, which has a clear avantage over the oracle PSD with theoretical iffuse coherence matrix. The closest performing estimators to the oracle PSD are the temporal moel-base methos LRSV an CTF. Among the coherence-base methos, the PSD LS, ML-EM iff. coh., ML Newton an Blocking ML root perform slightly better than Blocking PSD LS, BF LCMV an CDR. The Blocking ML Newton has a lower SDI at the expense of less IR, while the ML-EM est. coh. surprisingly performs worse with a low IR. The improvement of CD, PESQ an SRMR compare to the unprocesse reference microphone is shown in Fig. 10 (higher values are better). While the oracle PSD with reverberation coherence clearly achieves the best performance, the results oracle PSD + iff. coh. oracle PSD + rev. coh. Blocking PSD LS Blocking ML root Blocking ML Newton BF LCMV ML Newton ML EM iff. coh. ML EM est. coh. PSD LS CDR LRSV CTF CD PESQ SRMR Fig. 10. Improvement of perceptual measures for RSNR = 15 B. for most estimators are rather close, except the ML-EM with estimate coherence, which clearly performs worse. This coul be explaine by the fact that the estimate coherence is not accurate enough, since the ML-EM using the theoretical coherence yiels the best performance of all estimators in terms of CD, PESQ an SRMR. Note that the estimators with the best values in Fig. 10 are not necessarily the best souning ones as this jugement is highly subjective, an also the speech istortion shown in Fig. 9 plays a large role. Subjective listening to the processe signals confirms that some estimators prouce very similar

13 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 13 Speech Distortion Inex Interference Reuction [B] oracle PSD + iff. coh. Blocking PSD LS Blocking PSD LS comp Blocking ML root Blocking ML root comp PSD LS PSD LS comp Fig. 11. Speech istortion vs. interference reuction without an with bias compensation, RSNR = 15 B. results, while others can soun very ifferent 1 as represente in Fig. 10. Perceptual ifferences between the estimators are more prominent at lower RSNRs, while at RSNRs above 25 B, perceptual ifferences between the coherence-base estimators become almost inistinguishable (see Fig. 6). The traeoff between speech istortion an interference reuction (see Fig. 9) is clearly auible, which can be a guie on which estimator to choose epening on subjective preference an application. While Fig. 9 suggests that the temporal moel estimators are superior, it has to be kept in min that these rely on information about the reverberation time which is challenging to estimate in practice [68], while the coherencebase estimators rely on information about the DOA, which is commonly easier to estimate in practice. The best performing coherence-base estimator with low complexity is the PSD LS. G. Evaluation of bias compensation for coherence-base reverberation PSD estimators In this section, we evaluate the bias compensation metho for selecte coherence-base estimators. The compensation function shown in Fig. 1 was traine using RIRs simulate by the image metho [69], while for the following evaluation, the measure RIRs from the REVERB ataset an ifferent speech an noise ata were use. The results for some selecte coherence-base estimators without an with the propose bias compensation function are shown in Figs. 11 an 12. We can see in Fig. 11 that for those estimators the bias compensation metho propose in Sec. VII significantly reuces the speech istortion, while sacrificing only a small amount of interference reuction. The perceptual measures in Fig. 12 show an improvement of CD an PESQ by using bias compensation, while the SRMR slightly suffers from bias compensation. Informal subjective listening confirms that the speech istortion can be reuce by the propose bias compensation metho. IX. CONCLUSIONS We reviewe an classifie a variety of late reverberation PSD estimators that can be use for ereverberation. The majority of estimators is base on a spatial coherence moel, 1 Soun examples can be foun online at resources/2017-compare-psd-estimators oracle PSD + iff. coh. Blocking PSD LS Blocking PSD LS comp Blocking ML root Blocking ML root comp PSD LS PSD LS comp CD PESQ SRMR Fig. 12. Improvement of perceptual measures without an with bias compensation, RSNR = 15 B. but also estimators exploiting temporal moels have been investigate. It was shown in extensive controlle an realistic experiments that ifferences between the spatial coherencebase estimators are rather small, where only a few estimators have limitations an achieve results below average. We showe that all spatial coherence-base estimators uner test suffer from the same issues, i. e. overestimation in high DDR an low RSNR conitions. Temporal moel base estimators are less biase in high DRR, an mostly yiel less speech istortion, but also less interference reuction. Furthermore, we propose a metho to compensate the systematic overestimation of the spatial coherence-base estimators in high DDR conitions, which greatly reuce the speech istortion. Using this bias compensation, similar results to the temporal moel base estimators can be achieve using spatial coherence-base estimators with low complexity an without information about the room acoustics. Future work coul be to evelop a PSD estimator that combines spatial an temporal moels. REFERENCES [1] R. Beutelmann an T. Bran, Preiction of speech intelligibility in spatial noise an reverberation for normal-hearing an hearing-impaire listeners, J. Acoust. Soc. Am., vol. 120, no. 1, pp , [2] T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani, an W. Kellermann, Making machines unerstan us in reverberant rooms: Robustness against reverberation for automatic speech recognition, IEEE Signal Processing Magazine, vol. 29, no. 6, pp , Nov [3] G. Xu, H. Liu, L. Tong, an T. Kailath, A least-squares approach to blin channel ientification, IEEE Trans. Signal Process., vol. 43, no. 12, pp , Dec [4] L. Tong an S. Perreau, Multichannel blin ientification: from subspace to maximum likelihoo methos, Proc. IEEE, vol. 86, no. 10, pp , [5] Y. Huang an J. Benesty, A class of frequency-omain aaptive approaches to blin multichannel ientification, IEEE Trans. Signal Process., vol. 51, no. 1, pp , Jan [6] M. Miyoshi an Y. Kanea, Inverse filtering of room acoustics, IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 2, pp , Feb [7] F. Lim, W. Zhang, E. A. P. Habets, an P. A. Naylor, Robust multichannel ereverberation using relaxe multichannel least squares, IEEE/ACM Trans. Auio, Speech, Lang. Process., vol. 22, no. 9, pp , Sept [8] I. Korasi an S. Doclo, Joint ereverberation an noise reuction base on acoustic multi-channel equalization, IEEE/ACM Trans. Auio, Speech, Lang. Process., vol. 24, no. 4, pp , April [9] T. Yoshioka, T. Nakatani, an M. Miyoshi, Integrate speech enhancement metho using noise suppression an ereverberation, IEEE Trans. Auio, Speech, Lang. Process., vol. 17, no. 2, pp , Feb 2009.

14 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 14 [10] T. Yoshioka an T. Nakatani, Generalization of multi-channel linear preiction methos for blin MIMO impulse response shortening, IEEE Trans. Auio, Speech, Lang. Process., vol. 20, no. 10, pp , Dec [11] M. Togami, Y. Kawaguchi, R. Takea, Y. Obuchi, an N. Nukaga, Optimize speech ereverberation from probabilistic perspective for time varying acoustic transfer function, IEEE Trans. Auio, Speech, Lang. Process., vol. 21, no. 7, pp , Jul [12] A. Jukic, T. van Waterschoot, T. Gerkmann, an S. Doclo, Multichannel linear preiction-base speech ereverberation with sparse priors, IEEE Trans. Auio, Speech, Lang. Process., vol. 23, no. 9, pp , Sept [13] B. Yegnanarayana an P. S. Murthy, Enhancement of reverberant speech using LP resiual signal, IEEE Trans. Speech Auio Process., vol. 8, no. 3, pp , May [14] N. D. Gaubitch an P. A. Naylor, Spatiotemporal averaging metho for enhancement of reverberant speech, in Proc. IEEE Intl. Conf. Digital Signal Processing (DSP), Cariff, UK, Jul [15] K. Lebart, J. M. Boucher, an P. N. Denbigh, A new metho base on spectral subtraction for speech e-reverberation, Acta Acoustica, vol. 87, pp , [16] E. A. P. Habets, S. Gannot, an I. Cohen, Late reverberant spectral variance estimation base on a statistical moel, IEEE Signal Process. Lett., vol. 16, no. 9, pp , Sep [17] X. Bao an J. Zhu, An improve metho for late-reverberant suppression base on statistical moels, Speech Communication, vol. 55, no. 9, pp , Oct [18] J. Benesty, J. Chen, an Y. Huang, Microphone Array Signal Processing. Berlin, Germany: Springer-Verlag, [19] C. Marro, Y. Mahieux, an K. Simmer, Analysis of noise reuction an ereverberation techniques base on microphone arrays with postfiltering, IEEE Trans. Speech Auio Process., vol. 6, no. 3, pp , [20] E. A. P. Habets, Multi-microphone spectral enhancement, in Speech Dereverberation, P. A. Naylor an N. D. Gaubitch, Es. Springer, [21] S. Braun an E. A. P. Habets, A multichannel iffuse power estimator for ereverberation in the presence of multiple sources, EURASIP Journal on Auio, Speech, an Music Processing, vol. 2015, no. 1, pp. 1 14, Dec [22] B. Cauchi, I. Korasi, R. Rehr, S. Gerlach, A. Jukić, T. Gerkmann, S. Doclo, an S. Goetze, Combination of MVDR beamforming an single-channel spectral processing for enhancing noisy an reverberant speech, EURASIP Journal on Avances in Signal Processing, vol. 2015, no. 1, p. 61, [23] O. Schwartz, S. Gannot, an E. Habets, Multi-microphone speech ereverberation an noise reuction using relative early transfer functions, IEEE Trans. Auio, Speech, Lang. Process., vol. 23, no. 2, pp , Jan [24] A. Kuklasinski, S. Doclo, S. Jensen, an J. Jensen, Maximum likelihoo PSD estimation for speech enhancement in reverberation an noise, IEEE Trans. Auio, Speech, Lang. Process., vol. 24, no. 9, pp , [25] E. A. P. Habets an S. Gannot, Dual-microphone speech ereverberation using a reference signal, in Proc. IEEE Intl. Conf. on Acoustics, Speech an Signal Processing (ICASSP), vol. IV, Honolulu, USA, Apr. 2007, pp [26] S. Braun, D. P. Jarrett, J. Fischer, an E. A. P. Habets, An informe spatial filter for ereverberation in the spherical harmonic omain, in Proc. IEEE Intl. Conf. on Acoustics, Speech an Signal Processing (ICASSP), Vancouver, Canaa, May [27] M. Jeub, C. Nelke, C. Beaugeant, an P. Vary, Blin estimation of the coherent-to-iffuse energy ratio from noisy speech signals, in Proc. European Signal Processing Conf. (EUSIPCO), Barcelona, Spain, 2011, pp [28] O. Thiergart, G. Del Galo, an E. A. P. Habets, Diffuseness estimation with high temporal resolution via spatial coherence between virtual firstorer microphones, in Proc. IEEE Workshop on Applications of Signal Processing to Auio an Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2011, pp [29] A. Schwarz, K. Reinl, an W. Kellermann, A two-channel reverberation suppression scheme base on blin signal separation an Wiener filtering, in Proc. IEEE Intl. Conf. on Acoustics, Speech an Signal Processing (ICASSP), Dec. 2012, pp [30] I. McCowan an H. Bourlar, Microphone array post-filter base on noise fiel coherence, IEEE Trans. Speech Auio Process., vol. 11, no. 6, pp , Nov [31] S. Lefkimmiatis an P. Maragos, A generalize estimation approach for linear an nonlinear microphone array post-filters, Speech Communication, vol. 49, no. 7-8, pp , Jul [32] U. Kjems an J. Jensen, Maximum likelihoo base noise covariance matrix estimation for multi-microphone speech enhancement, in Proc. European Signal Processing Conf. (EUSIPCO), Bucharest, Romania, Aug. 2012, pp [33] K. Reinl, Y. Zheng, A. Schwarz, S. Meier, R. Maas, A. Sehr, an W. Kellermann, A stereophonic acoustic signal extraction scheme for noisy an reverberant environments, Computer Speech & Language, vol. 27, no. 3, pp , [34] L. Wang, T. Gerkmann, an S. Doclo, Noise power spectral ensity estimation using maxnsr blocking matrix, IEEE/ACM Trans. Auio, Speech, Lang. Process., vol. 23, no. 9, pp , Sep [35] O. Schwartz, S. Braun, S. Gannot, an E. Habets, Maximum likelihoo estimation of the late reverberant power spectral ensity in noisy environments, in Proc. IEEE Workshop on Applications of Signal Processing to Auio an Acoustics (WASPAA), New York, USA, Oct. 2015, pp [36] O. Thiergart, M. Taseska, an E. Habets, An informe parametric spatial filter base on instantaneous irection-of-arrival estimates, IEEE/ACM Trans. Auio, Speech, Lang. Process., vol. 22, no. 12, pp , Dec [37] O. Thiergart an E. Habets, Extracting reverberant soun using a linearly constraine minimum variance spatial filter, IEEE Signal Process. Lett., vol. 21, no. 5, pp , Mar [38] A. Kuklasinski, S. Doclo, S. Jensen, an J. Jensen, Maximum likelihoo base multi-channel isotropic reverberation reuction for hearing ais, in Proc. European Signal Processing Conf. (EUSIPCO), Lisbon, Portugal, Sep. 2014, pp [39] O. Schwartz, S. Gannot, an E. A. P. Habets, Joint maximum likelihoo estimation of late reverberant an speech power spectral ensity in noisy environments, in Proc. IEEE Intl. Conf. on Acoustics, Speech an Signal Processing (ICASSP), Mar. 2016, pp [40] O. Schwartz, S. Gannot, an E. Habets, An expectation-maximization algorithm for multi-microphone speech ereverberation an noise reuction with coherence matrix estimation, IEEE Trans. Auio, Speech, Lang. Process., vol. 24, no. 9, pp , [41] O. Schwartz, S. Gannot, an E. A. P. Habets, Joint estimation of late reverberant an speech power spectral ensities in noisy environments using Frobenius norm, in Proc. European Signal Processing Conf. (EUSIPCO), Aug. 2016, pp [42] A. Schwarz an W. Kellermann, Coherent-to-iffuse power ratio estimation for ereverberation, IEEE/ACM Trans. Auio, Speech, Lang. Process., vol. 23, no. 6, pp , June [43] S. Braun, B. Schwartz, S. Gannot, an E. A. P. Habets, Late reverberation PSD estimation for single-channel ereverberation using relative convolutive transfer functions, in Proc. Intl. Workshop Acoust. Signal Enhancement (IWAENC), Xi an, China, Sep [44] B. F. Cron an C. H. Sherman, Spatial-correlation functions for various noise moels, J. Acoust. Soc. Am., vol. 34, no. 11, pp , Nov [45] O. Thiergart, G. Del Galo, an E. A. P. Habets, On the spatial coherence in mixe soun fiels an its application to signal-to-iffuse ratio estimation, J. Acoust. Soc. Am., vol. 132, no. 4, pp , [46] M. Jeub, M. Dorbecker, an P. Vary, A semi-analytical moel for the binaural coherence of noise fiels, IEEE Signal Processing Letters, vol. 18, no. 3, pp , March [47] K. U. Simmer, J. Bitzer, an C. Marro, Post-filtering techniques, in Microphone Arrays: Signal Processing Techniques an Applications, M. S. Branstein an D. B. War, Es. Berlin, Germany: Springer- Verlag, 2001, ch. 3, pp [48] Y. Ephraim an D. Malah, Speech enhancement using a minimummean square error short-time spectral amplitue estimator, IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 6, pp , Dec [49] R. Martin, Noise power spectral ensity estimation base on optimal smoothing an minimum statistics, IEEE Trans. Speech Auio Process., vol. 9, pp , Jul [50] I. Cohen an B. Berugo, Noise estimation by minima controlle recursive averaging for robust speech enhancement, IEEE Signal Process. Lett., vol. 9, no. 1, pp , Jan [51] T. Gerkmann an R. C. Henriks, Unbiase MMSE-base noise power estimation with low complexity an low tracking elay, IEEE Trans. Auio, Speech, Lang. Process., vol. 20, no. 4, pp , May 2012.

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 15 [52] F. Heese an P. Vary, Noise ps estimation by logarithmic baseline tracing, in Proc. IEEE Intl. Conf.

Lonon, UK: Artech House, 2010. [54] T. E. Tuncer an B. Frielaner, Es., Classical an Moern Directionof-Arrival Estimation. Burlington, USA: Acaemic Press, 2009. [55] S. Markovich-Golan, S.

197 200. [56] S. Boy an L. Vanenberghe, Convex Optimization. Cambrige University Pr

Conf. on Acoustics, Speech an Signal Processing (ICASSP), Mar. 2012. [58] R. Talmon, I. Cohen, an S.

15 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY 15 [52] F. Heese an P. Vary, Noise ps estimation by logarithmic baseline tracing, in Proc. IEEE Intl. Conf. on Acoustics, Speech an Signal Processing (ICASSP), Brisbane, Australia, Apr. 2015, pp [53] Z. Chen, G. K. Gokea, an Y. Yu, Introuction to Direction-of-Arrival Estimation. Lonon, UK: Artech House, [54] T. E. Tuncer an B. Frielaner, Es., Classical an Moern Directionof-Arrival Estimation. Burlington, USA: Acaemic Press, [55] S. Markovich-Golan, S. Gannot, an I. Cohen, A sparse blocking matrix for multiple constraints GSC beamformer, in Proc. IEEE Intl. Conf. on Acoustics, Speech an Signal Processing (ICASSP). Kyoto, Japan: IEEE, Mar. 2012, pp [56] S. Boy an L. Vanenberghe, Convex Optimization. Cambrige University Press, [57] O. Thiergart, G. Del Galo, an E. A. P. Habets, Signal-to-reverberant ratio estimation base on the complex spatial coherence between omniirectional microphones, in Proc. IEEE Intl. Conf. on Acoustics, Speech an Signal Processing (ICASSP), Mar [58] R. Talmon, I. Cohen, an S. Gannot, Relative transfer function ientification using convolutive transfer function approximation, IEEE Trans. Auio, Speech, Lang. Process., vol. 17, no. 4, pp , May [59] J. Eaton, M. Brookes, an P. A. Naylor, A comparison of non-intrusive SNR estimation algorithms an the use of mapping functions, in Proc. European Signal Processing Conf. (EUSIPCO), Sep. 2013, pp [60] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, an T. Yoshioka, A summary of the REVERB challenge: state-of-the-art an remaining challenges in reverberant speech processing research, EURASIP Journal on Avances in Signal Processing, vol. 2016, no. 1, p. 7, Jan [61] E. B. Union. (1988) Soun quality assessment material recorings for subjective tests. [Online]. Available: [62] E. A. P. Habets an S. Gannot, Generating sensor signals in isotropic noise fiels, J. Acoust. Soc. Am., vol. 122, no. 6, pp , Dec [63] A. Hal, Statistical Theory with Engineering Applications, 1st e. John Wiley & Sons, [64] N. Kitawaki, H. Nagabuchi, an K. Itoh, Objective quality evaluation for low bit-rate speech coing systems, IEEE J. Sel. Areas Commun., vol. 6, no. 2, pp , [65] ITU-T, Perceptual evaluation of speech quality (PESQ), an objective metho for en-to-en speech quality assessment of narrowban telephone networks an speech coecs, International Telecommunications Union (ITU-T) Recommenation P.862, Feb [66] J. F. Santos, M. Senoussaoui, an T. H. Falk, An upate objective intelligibility estimation metric for normal hearing listeners uner noise an reverberation, in Proc. Intl. Workshop Acoust. Signal Enhancement (IWAENC), Antibes, France, Sep [67] M. S. Branstein an D. B. War, Es., Microphone Arrays: Signal Processing Techniques an Applications. Berlin, Germany: Springer- Verlag, [68] J. Eaton, N. D. Gaubitch, A. H. Moore, an P. A. Naylor, Estimation of room acoustic parameters: The ACE challenge, IEEE/ACM Trans. Auio, Speech, Lang. Process., vol. 24, no. 10, pp , Oct [69] J. B. Allen an D. A. Berkley, Image metho for efficiently simulating small-room acoustics, J. Acoust. Soc. Am., vol. 65, no. 4, pp , Apr Sebastian Braun receive the M.Sc. egree in electrical engineering an soun engineering from the University of Music an Dramatic Arts Graz an the Technical University Graz in He then joine the International Auio Laboratories Erlangen (a joint institution of the Frierich-Alexaner- Universität Erlangen-Nürnberg an Fraunhofer IIS) as a Ph.D. caniate in the fiel of acoustic signal processing. His current research interests inclue spatial auio processing, spatial filtering, speech enhancement (ereverberation, noise reuction, echo cancellation, feeback cancellation, automatic gain control), aaptive filtering, an binaural processing techniques. Aam Kuklasiński receive the M.Sc. egree in acoustics from Aam Mickiewicz University, Poznań, Polan, an the Ph.D. egree in igital signal processing from Aalborg University, Aalborg, Denmark, in 2012 an 2016, respectively. During his Ph.D stuy, he was a Marie Skłoowska-Curie fellow in the ITN-DREAMS project. His scientific interests inclue: statistical signal processing, speech ereverberation, an binaural cue preservation in hearing ais. Ofer Schwartz receive his B.Sc. (Cum Laue) an M.Sc. egrees in Electrical Engineering from Bar- Ilan University, Israel in 2010 an 2013, respectively. He is now pursuing his Ph.D. in Electrical Engineering at the Speech an Signal Processing laboratory of the Faculty of Engineering at Bar-Ilan University. His research interests inclue statistical signal processing an in particular noise reuction an ereverberation using microphone arrays an speaker localization an tracking. In 2017 he joine the auio epartment in CEVA-DSP Herzelia Israel as a senior algorithm researcher an eveloper. Oliver Thiergart is a researcher at the Fraunhofer Institute for Integrate Circuits (IIS). He receive the Dipl.-Ing. (M.Sc.) egree in Meiatechnology from the Ilmenau University of Technology, Germany, in He then joine the Auio Department of the Fraunhofer IIS in Erlangen. In 2011, he became a member of the International Auio Laboratories Erlangen where he receive his Ph.D. in the fiel of parametric spatial soun processing in Oliver s current research interests inclue parametric spatial soun processing, microphone arrays, spatial filtering, an parameter estimation.

Hea of the Spatial Auio Research Group at Fraunhofer IIS, Germany. He receive the B.Sc. egree in electrical engineering from the Hogeschool Limburg, The Netherlans, in 1999, an the M.Sc. an Ph.D.

16 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, MMM YYYY Emanue l A.P. Habets (S 02-M 07-SM 11) is an Associate Professor at the International Auio Laboratories Erlangen (a joint institution of the Frierich-Alexaner-Universita t Erlangen-Nu rnberg an Fraunhofer IIS), an Hea of the Spatial Auio Research Group at Fraunhofer IIS, Germany. He receive the B.Sc. egree in electrical engineering from the Hogeschool Limburg, The Netherlans, in 1999, an the M.Sc. an Ph.D. egrees in electrical engineering from the Technische Universiteit Einhoven, The Netherlans, in 2002 an 2007, respectively. From 2007 until 2009, he was a Postoctoral Fellow at the Technion - Israel Institute of Technology an at the Bar-Ilan University, Israel. From 2009 until 2010, he was a Research Fellow in the Communication an Signal Processing Group at Imperial College Lonon, U.K. His research activities center aroun auio an acoustic signal processing, an inclue spatial auio signal processing, spatial soun recoring an reprouction, speech enhancement (ereverberation, noise reuction, echo reuction), an soun localization an tracking. Dr. Habets was a member of the organization committee of the 2005 International Workshop on Acoustic Echo an Noise Control (IWAENC) in Einhoven, The Netherlans, a general co-chair of the 2013 International Workshop on Applications of Signal Processing to Auio an Acoustics (WASPAA) in New Paltz, New York, an general co-chair of the 2014 International Conference on Spatial Auio (ICSA) in Erlangen, Germany. He was a member of the IEEE Signal Processing Society Staning Committee on Inustry Digital Signal Processing Technology ( ), a Guest Eitor for the IEEE Journal of Selecte Topics in Signal Processing an the EURASIP Journal on Avances in Signal Processing, an an Associate Eitor of the IEEE Signal Processing Letters ( ). He is the recipient, with S. Gannot an I. Cohen, of the 2014 IEEE Signal Processing Letters Best Paper Awar. Currently, he is a member of the IEEE Signal Processing Society Technical Committee on Auio an Acoustic Signal Processing, vice-chair of the EURASIP Special Area Team on Acoustic, Soun an Music Signal Processing, an Eitor in Chief of the EURASIP Journal on Auio, Speech, an Music Processing. Simon Doclo (S 95-M 03-SM 13) receive the M.Sc. egree in electrical engineering an the Ph.D. egree in applie sciences from the Katholieke Universiteit Leuven, Belgium, in 1997 an From 2003 to 2007 he was a Postoctoral Fellow with the Research Founation Flaners at the Electrical Engineering Department (Katholieke Universiteit Leuven) an the Cognitive Systems Laboratory (McMaster University, Canaa). From 2007 to 2009 he was a Principal Scientist with NXP Semiconuctors at the Soun an Acoustics Group in Leuven, Belgium. Since 2009 he is a full professor at the University of Olenburg, Germany, an scientific avisor for the project group Hearing, Speech an Auio Technology of the Fraunhofer Institute for Digital Meia Technology. His research activities center aroun signal processing for acoustical an biomeical applications, more specifically microphone array processing, active noise control, acoustic sensor networks an hearing ai processing. Prof. Doclo receive the Master Thesis Awar of the Royal Flemish Society of Engineers in 1997 (with Erik De Clippel), the Best Stuent Paper Awar at the International Workshop on Acoustic Echo an Noise Control in 2001, the EURASIP Signal Processing Best Paper Awar in 2003 (with Marc Moonen) an the IEEE Signal Processing Society 2008 Best Paper Awar (with Jingong Chen, Jacob Benesty, Aren Huang). He is member of the IEEE Signal Processing Society Technical Committee on Auio an Acoustic Signal Processing, the EURASIP Special Area Team on Acoustic, Speech an Music Signal Processing an the EAA Technical Committee on Auio Signal Processing. Prof. Doclo serve as guest eitor for several special issues (IEEE Signal Processing Magazine, Elsevier Signal Processing) an is associate eitor for IEEE/ACM Transactions on Auio, Speech an Language Processing an EURASIP Journal on Avances in Signal Processing. 16 Sharon Gannot (S 92-M 01-SM 06) receive the B.Sc. egree (summa cum laue) from the Technion Israel Institute of Technology, Haifa, Israel, in 1986, an the M.Sc. (cum laue) an Ph.D. egrees from Tel-Aviv University, Tel Aviv, Israel, in 1995 an 2000, respectively, all in electrical engineering. In 2001, he hel a Postoctoral position at the Department of Electrical Engineering, KULeuven, Belgium. From 2002 to 2003, he hel a research an teaching position at the Faculty of Electrical Engineering, Technion-Israel Institute of Technology, Haifa, Israel. Currently, he is a Full Professor at the Faculty of Engineering, Bar-Ilan University, Ramat Gan, Israel, where he is heaing the Speech an Signal Processing laboratory an the Signal Processing Track. His research interests inclue multi-microphone speech processing an specifically istribute algorithms for a hoc microphone arrays for noise reuction an speaker separation, ereverberation, single microphone speech enhancement, an speaker localization an tracking. Prof.. Gannot has serve as an Associate Eitor of the EURASIP Journal of Avances in Signal Processing in , an as an Eitor of several special issues on multi-microphone speech processing of the same journal. He has also serve as a Guest Eitor of the ELSEVIER Speech Communication an Signal Processing journals. He has serve as an Associate Eitor of the IEEE Transactions on Auio, Speech, an Language Processing in , an an area chair for the same journal Currently, he serves as a moerator for arxiv in the fiel of auio an speech processing. He also serves as a reviewer of many IEEE journals an conferences. He is a member of the Auio an Acoustic Signal Processing technical committee of the IEEE since January Since January 2017, he serves as the committee chair. He is also a member of the technical an steering committee of the International Workshop on Acoustic Signal Enhancement (IWAENC) since 2005 an was the General Co-chair of IWAENC hel at Tel-Aviv, Israel in August He has serve as the General Co-chair of the IEEE Workshop on Applications of Signal Processing to Auio an Acoustics (WASPAA) in October He was selecte (with colleagues) to present tutorial sessions in ICASSP 2012, EUSIPCO 2012, ICASSP 2013, an EUSIPCO 2013 an was a keynote speaker for IWAENC 2012 an LVA/ICA He receive the Bar-Ilan University outstaning lecturer awar for 2010 an Prof. Gannot is also a co-recipient of ten best paper awars. Jesper Jensen is a Senior Scientist with Oticon A/S, Denmark, where he is responsible for scouting an evelopment of signal processing concepts for hearing instruments. He is also a Professor in Dept. Electronic Systems, Aalborg University. He is also a co-hea of the Centre for Acoustic Signal Processing Research (CASPR) at Aalborg University. His work on speech intelligibility preiction receive the 2017 IEEE Signal Processing Societys best paper awar. His main interests are in the area of acoustic signal processing, incluing signal retrieval from noisy observations, intelligibility enhancement of speech signals, signal processing for hearing ai applications, an perceptual aspects of signal processing.

Recent Advances in Acoustic Signal Extraction and Dereverberation

Recent Advances in Acoustic Signal Extraction and Dereverberation Emanuël Habets Erlangen Colloquium 2016 Scenario Spatial Filtering Estimated Desired Signal Undesired sound components: Sensor noise Competing