IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY

Size: px

Start display at page:

Download "IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY"

Diana Austin
5 years ago
Views:

1 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY Spotforming: Spatial Filtering With Distributed Arrays for Position-Selective Sound Acquisition Maja Taseska, Student Member, IEEE, and Emanuël A. P. Habets, Senior Member, IEEE Abstract Hands-free capture of speech often requires extraction of sources from a certain spot of interest (SOI), while reducing interferers and background noise. Although state-of-theart spatial filters are fully data-dependent and computed using the power spectral density (PSD) matrices of the desired and the undesired signals, the existing solutions to extract sources from a SOI are only partially data-dependent. Estimating the timevarying PSD matrices from the data is a challenging problem, especially in dynamic and quickly time-varying acoustic scenes. Hence, the spot signal statistics are often pre-computed based on a near-field propagation model, resulting in suboptimal filters. In this work, we propose a fully data-dependent spatial filtering framework for extraction of speech signals that originate from a SOI. To achieve position-based spatial selectivity, distributed arrays are used, which offer larger spatial diversity compared to arrays of closely spaced microphones. The PSD matrices of the desired and the undesired signals are updated at each time-frequency bin by using a minimum Bayes risk detector that is based on a probabilistic model of narrowband position estimates. The proposed framework is applicable in challenging multitalk situations, without requiring any prior information, except the geometry, location, and orientation of the arrays. Index Terms Source extraction, distributed arrays, spatial filtering, PSD matrix estimation, signal detection. I. INTRODUCTION I N HANDS-FREE applications involving human-to-human or human-to-machine interaction, the desired speech signal is often contaminated by background noise and interferers. Therefore, to ensure high quality speech acquisition, enhancement of the desired speech is necessary. The objectives of multi-microphone speech enhancement systems can be coarsely classified into one of the following categories: i) extraction of a subset of sources from a mixture [1], ii) source separation where a separate filter is computed to extract each source [2] [4], and iii) extraction of sources that originate from a user-defined SOI [5] [10]. In this work, we focus on the last problem, hereafter referred to as acoustic spotforming, to emphasize that in contrast to traditional beamforming which extracts sources from desired directions [11], [12], spotforming extracts sources from a desired SOI. Directional signals originating from the SOI will be referred to as spot signals, while the background Manuscript received August 28, 2015; revised January 17, 2016 and March 01, 2016; accepted March 01, Date of publication March 10, 2016; date of current version May 09, The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Richard Christian Hendriks. The authors are with the International Audio Laboratories Erlangen, University of Erlangen-Nuremberg and Fraunhofer IIS, Erlangen 91058, Germany ( maja.taseska@audiolabs-erlangen.de). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TASLP noise and directional signals from outside the SOI represent undesired signals. The term spotforming has been used earlier in ultra-wideband (UWB) array processing to emphasize that UWB waveforms can focus on spots, as opposed to directions in narrowband processing [13]. Although not referred to as acoustic spotformers, beamformers which can achieve this task are known as softconstrained, space-constrained, orregion-based beamformers [7], [8], [10]. Alternatively, spotforming can be achieved using ideas from robust adaptive beamforming. The source location uncertainty region used in the design of robust beamformers, can be interpreted as a SOI in the spotforming context. For instance, a spotformer can be realized by a linearly constrained minimum variance (LCMV) robust beamformer with eigenvector constraints that impose low distortion across the SOI [5], [9]. The approaches in [5], [7] [10] are based on a near-field model, where the spot signal statistics are estimated by integrating the near-field steering vectors across the SOI. The statistics are then used to compute the maximum signal-to-noise ratio (maxsnr) filter [8], the minimum mean squared error (MMSE) (Wiener) filter [7], [10], or to design eigenvector constraints [5], [9]. However, as the statistics are data-independent, the resulting filters are sub-optimal. To include the room acoustics, the near-field propagation model can be substituted by measured (or estimated) acoustic transfer functions (ATFs), which have been shown to improve the speech quality and the spatial selectivity of the beamformers [14], [15]. Sound extraction from a volume using a filter-andsum beamformer has been proposed in [15], where the filters are matched to the ATFs of multiple distributed microphones. This approach is data-independent (and hence sub-optimal), requires a large number of microphones to achieve good performance, and requires knowledge of the ATFs. Data-dependent approach based on estimated ATFs has been proposed in [14]. ATF characteristics in a given SOI are first extracted and then used during processing to identify and track the ATFs. This framework is applicable only for a small-scale movements within the SOI, provided that a-priori knowledge of the desired signal subspace is available. In all of the above mentioned approaches [5], [7] [10], [14], [15], scenarios with non-stationary and moving interferers, as often encountered in practice, are not considered. The estimation of time-varying statistics is often done by using fullband voice activity detectors (VADs) [10], [14] that identify periods when the desired signal is active. Such VADs work well if the undesired signal is relatively stationary compared to the desired signal, which is not the case for speech interferers IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 1292 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY 2016 A fully data-dependent spotformer applicable to highly dynamic scenarios was recently proposed by the present authors in [6]. In contrast to the state-of-the-art discussed previously, the data-dependent approach does not assume a propagation model and stationarity of the undesired signals. The PSD matrices of both the spot signal and the undesired signal are estimated online and used to compute the time-varying spotformer coefficients. The current paper is based on the ideas developed in [6], with the following additional contributions: (i) a more complete discussion of the optimal design and the computation of the spotformer using estimated PSD matrices, (ii) formulation of a minimum Bayes risk signal detector used to estimate the PSD matrices, and (iii) evaluation in different dynamic scenarios with measured and simulated data. Assuming relatively small spot sizes, a crucial observation underlying our work is the fact that due to the speech sparsity in the short-time Fourier transform (STFT) domain [16] and the online estimation of the PSD matrices, the spot signal PSD matrix at each time-frequency (TF) bin can be approximated by a rank-one matrix, even if there are multiple sources in the SOI. Therefore, the spot signal can be extracted by a minimum variance distortionless response (MVDR) filter with a time-varying constraint. In contrast to the state-of-the-art LCMV filters with multiple eigenvector constraints, the MVDR filter has a single constraint and offers more degrees of freedom to reduce undesired signals. While different formulations of the MVDR filter given the PSD matrices are studied in literature [17], the application of an MVDR filter with a time-varying constraint to solve the spotforming problem in dynamic scenarios is a novel contribution of this work. The time-varying PSD matrices need to be estimated from the microphone signals. In recent research, spatial cues such as phase differences, direction of arrival (DOA) estimates, or position estimates have been used to to detect the dominant source and update its PSD matrix [2] [4], [18], [19]. We use this idea to estimate the PSD matrices in the context of spotforming [6]. At each TF bin, a position estimate is obtained by triangulation of the DOAs at the distributed arrays. By defining an appropriate model for the distribution of the position estimates, the probability that the spot signal is dominant, referred to as the spot probability, can be evaluated and used to classify each TF bin to the spot signal or the undesired signal. The classification can be done using a minimum Bayes risk rule, similarly to [20], where the goal was to detect a source using DOA estimates. While different ideas from our previous work are used to develop the spot signal detection system detailed in Section IV, the underlying probabilistic models are unique to the current work on spotforming. It is important to note that although distributed arrays are required to obtain narrowband position estimates for the spot signal detector, the spotformer can be computed using an arbitrary subset of arrays or microphones. We will experimentally show that due to the spatial diversity of distributed arrays, multi-array spotforming improves the spatial selectivity compared to single-array spotforming, at the cost of a larger spot signal distortion. Determining the optimal subset of microphones is outside the scope of this paper. Furthermore, we assume that all signals are synchronized and available at a centralized processor. For details on signal synchronization the reader is referred to [21] and references therein. The rest of the paper is organized as follows: in Section II, the signal model is described. In Section III, the state-of-the-art and the proposed spotforming methods are discussed. In Section IV, the position-based minimum Bayes risk detector required for the PSD matrix estimation is proposed. A comprehensive performance evaluation is presented in Section V, and Section VI concludes the paper. II. PROBLEM FORMULATION Consider a setup of at least two distributed arrays, with at least two microphones each, where the total number of microphones is M. LetS denote a SOI, where signals originating from S are desired, while background noise and signals from outside S are undesired. Assuming point sources, the signal at the m-th microphone at time t is given by y m (t) =x m (t)+u m (t)+v m (t) (1) = h r,m (t) s r (t) dr + u m (t)+v m (t), r S where denotes the convolution operator, h r,m is the room impulse response (RIR) between position r and the m-thmicrophone, s r is the signal from a source at position r S, u m is the sum of all signals from outside S, and v m is the sum of background and sensor noise. If there is no source at position r, then s r (t) =0. For sufficiently long STFT frames, the multiplicative transfer function approximation [22] holds and y m (t) is given in the STFT domain as follows Y m (n, k) =X m (n, k)+u m (n, k)+v m (n, k) (2) = H r,m (k) S r (n, k) dr + U m (n, k)+v m (n, k), r S where capital letters denote the TF domain signals of the respective time domain counterparts. In the rest of the paper, all processing is performed in the TF domain, and lower case bold letters are used to denote vectors in the TF domain. Given the ATFs H r,m and assuming H r,1 0, the relative transfer function (RTF) vector with respect to the first microphone, which describes the coupling between the microphones as a response to a source located at r, is defined as g r (k) =[1,H r,2 (k)/h r,1 (k),...,h r,m (k)/h r,1 (k)]. (3) Note that the first microphone is chosen as a reference without loss of generality. Stacking all microphone signals as a vector y, the signal model in vector notation is given by y(n, k) = g r (k) X r,1 (n, k) dr + u(n, k)+v(n, k), r S where X r,1 denotes the signal from a source located at position r, as captured at the first microphone. The PSD matrix of the microphone signals is given by Φ y (n, k) = E [ y(n, k) y H (n, k) ]. If the different signals are modeled as (4)

3 TASESKA AND HABETS: SPOTFORMING: SPATIAL FILTERING WITH DISTRIBUTED ARRAYS 1293 mutually uncorrelated random processes, the following holds Φ y (n, k) =Φ x (n, k)+φ u (n, k)+φ v (n, k), where (5) Φ x (n, k) = φ r,1 (n, k) g r (k) gr H (k) dr, (6) r S and φ r,1 =E [ X r,1 2] is the PSD of X r,1. In addition, we denote by Φ u+v and Φ x+v the PSD matrices of u + v and x + v, respectively. As the processes are mutually uncorrelated, it holds Φ u+v = Φ u + Φ v and Φ x+v = Φ x + Φ v. The objective in this work is to compute an estimate of the signal X 1 = r S X r,1(n, k) dr, which represents the sum of all signals that originate from S, as captured at the first microphone. At each TF bin, X 1 is to be estimated by linearly combining the signals of all microphones, using a data-dependent, time-varying spotformer w as follows X 1 (n, k) =w H (n, k) y(n, k). (7) The spotformer should be able to extract the spot signal with low distortion while sufficiently reducing the undesired signals in non-stationary scenarios where the speech sources move, new sources appear, or existing sources disappear. III. ACOUSTIC SPOTFORMING In this part, we first review two state-of-the-art spotformers in Section III-A and discuss their limitations. The proposed data-dependent spotformer is described in Section III-B. In Section III-C, the recursive time averaging approach for PSD matrix estimation is provided for completeness. A. State-of-the-art Approaches to Spotforming As mentioned in the introduction, spotforming can be realized using ideas from robust adaptive beamforming. Inspired by [5], [9], we consider the robust LCMV filter with eigenvector constraints. The low distortion requirement across S is expressed by the M S constraint matrix G S (S M)as G H S (k) w(n, k) =1 S 1, (8) where the columns of G S are the near-field steering vectors for S sampled positions from the SOI. If the ATF or RTF vectors are known for each position, they can substitute the near-field steering vectors to take the room acoustics into account. Note that although in [5], [9] the near-field design has been used, we will proceed by using the RTFs in order avoid violations of near-field model and to have a fair comparison to the proposed data-dependent spotformer. Eigenvector constraints are computed by substituting G S in the overdetermined system (8) by its rank-r approximation, where r<m. The singular value decomposition (SVD) and the rank-r approximation of G are given by G S = U Σ V H and G S,r = U r Σ r V H r, (9) where U r and V r contain the first r columns of U and V, corresponding to the r largest singular values, and Σ r is a r r diagonal matrix containing these singular values. Using the M r matrix G S,r, the new constraint is given by V r Σ r U H r ŵ = 1 r 1, (10) which can be rearranged a similar form as (8), namely, U H r ŵ = Σ 1 r Vr H 1 r 1, (11) Finally, the LCMV filter with eigenvector constraints is obtained by solving the following optimization problem arg min w H Φ u+v w, subject to 11. (12) w Denoting the r 1 constraint vector on the right-hand side in (11) by c, the LCMV filter is given by w opt = Φ 1 u+v U r (U H r Φ 1 u+v U r) 1 c. (13) The rank r required to ensure that the distortion across the SOI is lower than a given threshold can be determined from the eigenstructure of the matrix G S [23]. An alternative way to realize a spotformer based on existing techniques is to use the ATF vector for the centroid of S as a constraint in an MVDR beamformer. This design was inspired by [15], where sounds from a SOI were extracted by a matched filter. The authors in [15] argue that due to the correlation between the ATFs of neighboring positions, a matched filter extracts sounds from a wider region. However, note that in this case, there is no explicit control of the size of S. To ensure fair comparison to our proposed spotformer, we will use the RTF vector instead of the ATF vector in the implementation. The existing spotformers do not take the statistics of the spot signal into account. The constraints that ensure low distortion are computed using only prior information about the SOI and the propagation vectors. Furthermore, the undesired signal PSD matrix Φ u+v is usually estimated using full-band VADs, which is applicable if Φ u+v varies slowly compared to Φ x and if there are periods where the spot signal is absent. B. Proposed Data-Dependent Approach to Spotforming There are two key observations which underlie the datadependent spotformer: (1) SOIs of relatively small sizes with only a few sources are common in most applications and (2) the spot PSD matrix at each TF bin has a low rank due to the speech sparsity in the STFT domain [16]. Sparsity implies that at each TF bin the energy of only one source is dominant, and hence, Φ x (n, k) can be approximated by a rank-one matrix as Φ x (n, k) φ x1 (n, k) g(n, k) g H (n, k), (14) where φ x1 is the spot signal PSD at the first microphone and g represents the RTF vector between the position of the dominant source and the microphones. Therefore, the spot signal at the first microphone can be extracted with an MVDR filter [24] obtained as the solution to arg min w H Φ u+v (n, k) w, subject tow H g(n, k) =1. (15) w

4 1294 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY 2016 Solving (15) using Lagrangian multipliers [17], we obtain Φ 1 u+v (n, k) g(n, k) w opt (n, k) = g H (n, k) Φ 1 (16) u+v (n, k) g(n, k). The realization of the data-dependent MVDR spotformer involves two main tasks (1) finding a rank-one approximation of Φ x (n, k) that provides an optimal (in a sense that will be discussed next) constraint vector g(n, k) at each TF bin, and (2) estimating the PSD matrices Φ x (n, k) and Φ u+v (n, k). Under the single dominant source assumption, the first task is equivalent to RTF estimation [1], [25], [26] and state-of-theart methods will be described in the remaining of this section. The second task is extremely challenging in scenarios with nonstationary, moving and appearing/disappearing sources and will be detailed in Section IV. 1) MMSE-Based Rank-One Approximation: Under the rank-one approximation, for all realizations of the random process x(n, k) the following relation holds between the spot signal at the reference and the spot signal across all microphones x(n, k) g(n, k)x 1 (n, k). (17) The optimal g(n, k) in the MMSE sense is obtained by solving arg min E [ (x gx 1 ) H (x gx 1 ) ]. (18) g If e 1 =[1, 0, 0] is an M 1 vector, the solution to (18) is g(n, k) = Φ x(n, k) e 1 e H 1 Φ x(n, k) e 1. (19) Hence, the MMSE-optimal g(n, k) is given by the first column of Φ x, normalized by the PSD φ x1. An estimate of Φ x required to evaluate (19) can be obtained as Φ x = Φ x+v Φ v, where Φ x+v and Φ v are estimated by recursive temporal averaging (see Section III-C). In practice, due to estimation errors, Φ x might not be positive semi-definite at some TF bins. To avoid erroneous look direction of the spotformer, the constraint g(n, k) is not updated at such TF bins and g(n 1,k) from the previous time frame is used. Under the single dominant source assumption, the expression (19) is also known as the covariance subtraction-based RTF estimator [26]. Although based on a single-source model, it was experimentally shown [27] that multiple sources can be extracted with reasonably low distortion, which corroborates its applicability in the spotforming context where multiple sources in the spot might be present. 2) Least-Squares-Based Rank-One Approximation: Adifferent way to approximate Φ x by a rank-one matrix is by minimizing the Frobenius norm of the matrix difference arg min Φ x φ x1 g g H F (20) g According to the matrix approximation lemma [28], the optimal solution for g is the principal eigenvector of Φ x.note that due to the presence of background noise in the microphone signals, only an estimate of Φ x+v rather than Φ x can be obtained (see Section III-C). Assuming that an estimate of Φ v is available, g is obtained by first computing the principal eigenvector v max of the whitened PSD matrix Φ Φ v x+v, 1 and performing de-whitening, i.e. g Φ v v max. The scaling is determined by definition, as the first element of g is equal 1 to 1. To avoid the explicit inversion in Φ Φ v x+v, the vector v max can be computed from the generalized eigenvalue decomposition (GEVD) of of the matrix pencil (Φ x+v, Φ v ) [28]. Under the single dominant source assumption, this approach of computing g is known as the covariance whitening-based RTF estimator [26], [29] and it has been shown to outperform the covariance subtraction in low signal-to-noise ratio (SNR) conditions [26]. Note that the complexity of performing GEVD at each TF bin can be reduced by employing an adaptive estimation of the principal eigenvector [30]. Similarly as the covariance subtraction-based estimator, the covariance whitening can also be applied to extract multiple sources due to the speech sparsity in the STFT domain. 3) Projection-Based Rank-One Approximation: When there are multiple sources in the SOI and the rank of Φ x increases, distortion of the spot signal is unavoidable when using an MVDR spotformer. In order to improve the performance of the MVDR filter in such scenarios, we proposed an RTF estimator in [31] that does not explicitly use a rank-one model for Φ x. The RTF of the dominant source at each TF bin is computed using the instantaneous signal y and an estimate of the higher-dimensional signal subspace. This approach can also be applied to obtain g in the spotformer, and we briefly review it here for completeness. The performance of this method is compared to the explicit rank-one model-based ones in the Experiment 6 of Section V-C. Consider the GEVD of the matrix pencil (Φ x+v, Φ v ) for a two-source scenario (rank-two model) (φ r1 g r1 g H r 1 + φ r2 g r2 g H r 2 + Φ v ) v = λ Φ v v. (21) Equation (21) can be rearranged as follows c 1 g r1 + c 2 g r2 =(λ 1) Φ v v, where c 1 =(φ r1 gr H 1 u) 1,c 2 =(φ r2 gr H 1 v) 1. (22) The generalized eigenvectors v provide two linear combinations of the RTF vectors and hence a basis for the signal subspace. As discussed in [31], due to the speech sparsity, two eigenvectors per frequency bin suffice to approximate the signal subspace for up to four concurrent sources. A basis U x for the subspace can be computed by orthonormalization of the two largest eigenvectors of ( Φ x+v, Φ v ). Let us denote a projection matrix onto the signal subspace by P x = U x U H x. The key idea of the RTF estimator in [31] is to enforce the instantaneous RTF estimate g inst (n, k) y(n, k)y1 (n, k) g inst (n, k) = Y 1 (n, k) 2, (23) to lie in the estimated signal subspace, by performing the following subspace projection at each TF bin g proj (n, k) = P x(n, k) g inst (n, k) e H 1 P x(n, k) g inst (n, k), (24)

5 TASESKA AND HABETS: SPOTFORMING: SPATIAL FILTERING WITH DISTRIBUTED ARRAYS 1295 where the denominator normalizes the first element to one. The vector g inst captures the spatial information of the dominant source, whereas the projection denoises g inst by confining it onto the signal subspace. Denoting the output of a binary signal detector by H x, the final RTF estimate to be used as a timevarying constraint in the spotformer (16) is obtained as g(n, k) =H x (n, k) g proj (n, k)+[1 H x (n, k)] g(n 1,k). (25) Hence, when the spot signal is dominant (H x =1)the g is obtained by (24), whereas when the spot signal is absent, g(n 1,k) from the previous frame is used. Note that although the general framework for the projection-based RTF estimator has been developed in [31], the computation of the detector H x required to apply it in the spotforming context, is specific to this work and proposed in Section IV. C. Estimating the PSD Matrices In practice, the PSD matrices Φ u+v and Φ x need to be estimated from the microphone signals. This is commonly done by recursive averaging, where an estimate at the current time frame is obtained using the PSD matrix estimate from the previous frame and the current signal y(n, k) in the following manner (the frequency index was omitted for brevity) Φ u+v (n) = α uv (n) Φ u+v (n 1) + [1 α uv (n)] y(n) y H (n) Φ x+v (n) = α x (n) Φ x+v (n 1) + [1 α x (n)] y(n) y H (n) Φ v (n) = α v (n) Φ v (n 1) + [1 α v (n)] y(n) y H (n). (26) Since the background noise is always present, the recursive averaging yields an estimate Φ x+v rather than Φ x. The averaging parameters α uv, α x, and α v should allow for quick adaptation of the PSD matrices in case of changes in the acoustic scene such as moving or emerging sources. Moreover, to avoid leakage of the undesired signal into the spot signal s PSD matrix and vice versa, α uv, α x, and α v should ensure that only the PSD matrix of the dominant signal is updated, requiring a spot signal detection mechanism at each TF bin. IV. BIN-WISE SIGNAL DETECTION For the purpose of signal detection, we can relax the sparsity assumption and only assume that there exist TF bins where either sources from S or sources outside S and/or background noise are dominant. We define the following hypotheses H v : speech is absent, i.e y v, H x : speech from S is dominant, i.e y x + v, H u : speech outside S is dominant, i.e y u + v, H xu = H x H u : speech is present H uv = H u H v : undesired signal is dominant. (27a) (27b) (27c) (27d) (27e) As illustrated in Figure 1, the probability of hypothesis H xu represents a speech presence probability (SPP) denoted by p xu. Given that speech is present, meaning that H xu is true (upper Fig. 1. Graphical model illustrating the bin-wise hypotheses and the associated probabilities. branch of the figure), the probability that the speech originates from the SOI (hypothesis H x )isp x = p(h x H xu ) and will be referred to as the conditional spot probability. The framework for estimating the matrices Φ x, Φ u+v and Φ v consists of building probabilistic models, computing the probabilities in Figure 1, and using the probabilities to control the parameters α x, α uv and α v in (26). The estimation of the SPP and Φ v are well studied and the state-of-the-art approach used in this work is reviewed in Section IV-A. However, the hierarchical model in Figure 1, and the computation of the conditional spot probability are proposed within our spotforming framework and will be discussed in Section IV-B. A. Speech Presence Probability and Estimation of Φ v In SPP-based noise PSD matrix estimation (see [32] and references therein), given the a-posteriori SPP p xu = p(h xu y), the parameter α v (n, k) used to recursively estimate Φ v (n, k) in (26), is computed as α v (n, k) =1 [1 p xu (n, k)] (1 α v ), α v (0, 1). (28) At TF bins with high SPP p xu 1, it follows α v 1 and hence, the current signals y(n, k) have little influence on the updated noise PSD matrix Φ v, i.e Φ v (n, k) Φ v (n 1,k). At TF bins with low SPP, the signals y(n, k) have a weight factor α v α v in the recursion (26), hence updating the noise PSD matrix. Appropriate values of α v are given in Section V. To compute the SPP, a Gaussian signal model for the STFT coefficients is used [33], where the data likelihoods under speech absence and speech presence, are given by 1 p(y H v )= H π M det[φ v ] e y Φ 1 v y, 1 p(y H xu )= H π M det[φ y ] e y Φ 1 y y. (29) The SPP, can be obtained by applying the Bayes theorem p xu = p(y H xu ) q p(y H xu ) q + p(y H v ) (1 q), (30) where q denotes the a-priori SPP. Note that at TF bin (n, k), before the likelihoods in (29) and the SPP are computed, only an estimate of Φ v (n 1) from the previous frame is available. As the noise is assumed to be relatively stationary, the SPP and noise PSD matrix estimation loop can be implemented by using in (29) the PSD matrix estimate Φ v (n 1) from the previous time frame.

6 1296 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY 2016 Although the a-priori SPP q can be computed in advance [34], changes in the noise PSD tend to be erroneously detected as speech onsets, unless a signal-dependent q is used. To this end, we proposed a direct-to-diffuse ratio (DDR)-based a-priori SPP estimator in [32]. The DDR Γ(n, k) was estimated at each array based on the coherence between two microphones [35], however any DDR estimator suitable for the array geometry can be used. Based on the observation that low values of ˆΓ(n, k) indicate the absence of coherent speech, whereas high values of ˆΓ(n, k) indicate its presence, we proposed in [32] to compute the a-priori SPP as q =1 f[ˆγ], where f[ˆγ] is a sigmoid-like mapping defined as ] 10 f [ˆΓ(n, cρ/10 k) = l min +(l max l min ) 10 cρ/10 +Γ(n, k). ρ (31) The parameters l min and l max determine the minimum and maximum values that the function can attain, c (in db) controls the offset along the Γ axis, and ρ defines the steepness of transition region between speech and non-speech detection. The choice of appropriate parameters is discussed in Section V. B. Spot Probability and Estimation of Φ x+v and Φ u+v When the undesired signals contain speech, an implementation of the spot probability and undesired signal PSD matrix estimation loop similar to the one used for SPP estimation does not provide reliable results due to the speech non-stationarity. We have shown in [6], that a likelihood model for the bin-wise position estimates is suitable for computing the spot probability. In the current contribution, we provide a more detailed discussion on the choice of a likelihood model and its estimation from training data. We subsequently formulate a minimum Bayes risk spot signal detector and based on the detector s decision update either Φ x+v or Φ u+v. 1) Position-Based Minimum Bayes Risk Detector: Let ˆr(n, k) denote a position estimate obtained by triangulating narrowband DOA estimates from two distributed arrays. To estimate the narrowband DOAs we used the method proposed in [36]. If multiple arrays are available, we choose for triangulation the two arrays with largest instantaneous signal amplitude at the given TF bin. If the accuracy of the DOA estimates at the different arrays can be quantified, the arrays with most accurate DOA estimates can be chosen. Let d 1, d 2 denote the locations of the two chosen arrays and e d1, e d2 denote the corresponding DOA unit vectors. The position estimate ˆr is found as the intersection of the lines defined by the array centers and the DOA vectors, by first solving d 1 + e d1 ξ 1 = d 2 + e d2 ξ 2 (32) for ξ 1 and ξ 2, and substituting to find ˆr in either d 1 + e d1 ξ 1 = ˆr, or d 2 + e d2 ξ 2 = ˆr. (33) Given a position estimate ˆr, an optimal minimum Bayes risk detector for a false alarm cost C du > 0 (deciding that the desired signal is dominant, when the undesired is dominant) and a miss cost C ud > 0 (deciding that the undesired signal is dominant when the desired is dominant) is given by the following decision rule [37] p(h x ˆr) decide H x =1, H uv =0if 1 p(h x ˆr) > C du, C ud decide H x =0, H uv =1otherwise. (34) The averaging parameters α x and α uv are then computed as α x =1 H x (1 α x ), α u =1 H uv (1 α uv ). (35) When the spot signal is absent, i.e. H x =0, then α x =1and Φ x+v is not updated. The values of the constants α x and α uv are given in Section V. To obtain the detector s decision, the spot probability p(h x ˆr) is required, which can be decomposed as p[h x ˆr] =p[h x, H xu ˆr] =p[h x H xu, ˆr] p[h xu ]. (36) Recalling the graphical model in Figure 1, we recognize p[h x H xu, ˆr] and p[h xu ] to be the conditional spot probability p x and the SPP p xu, respectively. The SPP can be computed using the framework described in Section IV-A, and the remaining task is to evaluate p[h x H xu, ˆr]. 2) Conditional Spot Probability p[h x H xu, ˆr] : If r denotes a two-dimensional random variable modeling the true source position, the conditional spot probability can be computed by evaluating the probability that r belongs in S. The latter can be computed by integrating the corresponding probability density function (PDF) over S as follows p[h x H xu, ˆr] = f(r H xu, ˆr) dr. (37) r S Let the room be uniformly sampled at N positions r i with i I = {1, 2,...,N} and define a subset I S Iwith cardinality N S such that if i I S then r i S. The integral (37) can then be numerically approximated as follows f(r H xu, ˆr) dr 1 f(r i H xu, ˆr), (38) N S i I S r S Next, each of the N S terms in the sum needs to be evaluated. By applying the Bayes theorem we obtain for each term f(r i H xu, ˆr) =f(ˆr H xu, r i ) f(r i ) f(ˆr H xu ). (39) The denominator f(ˆr H xu ) is obtained by marginalization over the true source position r, namely f(ˆr H xu )= f(ˆr H xu, r) f(r) dr, (40) which can be numerically approximated as follows f(ˆr H xu ) 1 N r f(ˆr H xu, r i ) f(r i ). (41) i I

TASESKA AND HABETS: SPOTFORMING: SPATIAL FILTERING WITH DISTRIBUTED ARRAYS 1297 Finally, combining (38), (39) and (41), the conditional spot probability is computed as p[h x H xu, ˆr] = 1 N S f(ˆr H

The PDF f(r i ) represents the prior knowledge where speech sources are located in the room.

7 TASESKA AND HABETS: SPOTFORMING: SPATIAL FILTERING WITH DISTRIBUTED ARRAYS 1297 Finally, combining (38), (39) and (41), the conditional spot probability is computed as p[h x H xu, ˆr] = 1 N S f(ˆr H xu, r i ) f(r i ) 1 i I S f(ˆr H xu, r i ) f(r i ). N i I (42) Hence, to compute the spot probability we only need know the PDFs f(r i ) and f(ˆr H xu, r i ), for all i I. The PDF f(r i ) represents the prior knowledge where speech sources are located in the room. If no information about the scenario and the source locations is provided, f(r i ) is assumed to be uniform, i.e., f(r i )=1/N. In the next section, we discuss the computation of the likelihood PDFs f(ˆr H xu, r i ). 3) Likelihood Models f(ˆr H xu, r i ) : For each position r i in the room, i I, the PDF f(ˆr H xu, r i ) can be estimated in a training stage. To avoid training, in the initial publication [6] we modeled f(ˆr H xu, r i ) by a symmetric Gaussian distribution with mean r i and a fixed variance σ 2 I. In this work, we include a training stage for each position r i. Training was performed only once in a simulated shoebox room with 200 ms reverberation time and low ambient noise level, where the test signal was a 10 seconds of white noise. The white source was placed at r i and the mean and covariance matrix of the Gaussian distribution f(ˆr H xu, r i ) were estimated in the maximum-likelihood sense from the estimated positions ˆr. The process was repeated for each i I. The estimated PDFs were applied in all experiments in Section V, that encompass rooms with reverberation times ms, different noise levels, and different source constellations, and no significant performance loss directly caused by mismatch between train and test conditions was observed. Hence, the experiments indicated that PDFs obtained by training in typical office conditions, generalize well to different source constellations, reverberation and noise levels, provided that the array geometry and the DOA estimators are fixed. The good generalization can be intuitively explained as follows: reverberation and noise affect the variance of f(ˆr H xu, r i ), but not the directions of the principal axes that depend mostly on the array geometry and orientation. As the increase of variance happens for all r i, i I, the detector is only affected by a minor shift in the false alarm and miss rates (which, if necessary, can be adjusted by modifying the Bayes costs). C. Discussion To illustrate the detector operation, a scenario with one source in the spot and one interferer is shown in Figure 2. The dots represent position estimates observed during 5 seconds of multi-talk. The lightest shade denotes positions from TF bins with H x =0, whereas the darker shades denote positions with H x =1. Each shade illustrates results with different Bayes costs: C ud was fixed to 1, and C du was varied (2,4, and 8, lightest to darkest shade). As indicated in Figure 2, increasing the false alarm cost, reduces the region of spot signal detections. The difference between the detector without training and the one with training is visible from the shape of the shaded Fig. 2. Visualization of the detector for different Bayes costs without training (left) and with training (right). Fig. 3. A block diagram of the proposed data-dependent spotforming framework. The shaded blocks are required for the signal detector described in Section IV, whereas the white blocks use the detector to estimate the PSD matrices and the distortionless constraint, as described in Section III. regions. The training takes into account the true variance of the position estimates associated with the source and reduces the false alarm rate compared to the case where no training is performed, especially when an interferer is located near the spot. The false alarms trigger updates of the spot signal PSD matrix when the undesired signal is present, resulting in the spotformer not focusing on the SOI and introducing audible distortion of the spot signal. The false alarm rate can to some extent be controlled by appropriately adjusting the Bayes costs. Nevertheless, in extremely adverse acoustic conditions where the spot signal-to-undesired signal ratio is low (< 0 db), the state-of-the-art fixed spotformers are likely to provide a preferable speech quality, even though the undesired signal reduction ability is limited. Finally, a detailed diagram of the proposed spotformer summarizing all described processing blocks is given in Figure 3. V. PERFORMANCE EVALUATION The spotformer performance was evaluated in different acoustic conditions with both measured and simulated data. In Section V-A, the experimental setups are described, in Section V-B the performance measures are overviewed, and in Section V-C, the experimental results are discussed. Related audio examples are available online at A. Experimental Setup and Overview Measurements were carried out in a room with T ms and dimensions m 3. Three circular arrays with

8 1298 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY 2016 B. Objective Performance Measures The performance measures are computed for nonoverlapping segments of length T =30 ms. The final values, shown in the tables and the plots in Section V-C, are computed by averaging the segment-wise values. For a segment i, the input SNR and signal-to-interference ratio (SIR) at the m-th microphone are computed as Fig. 4. Experimental setups for the measurement (left), simulation for moving sources (middle), and simulation with multiple sources in the spot (right). diameter 2.9 cm and three DPA miniature microphones per array were arranged as shown in Figure 4 (left). The RIR between the positions 1-5 and the microphones were measured, where five GENELEC loudspeakers were used as sources. To generate diffuse sound, the RIRs from ten loudspeakers facing the walls to the microphones were measured. The sampling rate was set to 16 khz. Note that as spatial aliasing in the DOA estimates occurs around 7 khz for this setup, the signals were band-limited to 7 khz before processing. The speech signals at the microphones were obtained by convolving clean speech with the measured RIRs from positions 1-5. The clean speech samples consisted of male and female speech in English, German and French, recorded by a close-talking microphone. Babble noise signals were convolved with the RIR for the ten loudspeakers facing the walls, which added together result in approximately diffuse sound. Finally, the microphone signals were obtained by adding the speech signals, the diffuse signal and a measured sensor noise signal. The SNR with respect to the sensor noise was 35 db and only the diffuse noise level was varied in the experiments to test different input SNRs (given in Section V-C). The SNR was measured as the ratio between the spot signal power and the noise power at the reference microphone. The effect of reverberation on the performance, was investigated using simulated RIRs. The simulated room geometry was the same as in the measurements, with the freedom to vary the T 60 using an implementation of the image source model [38]. Diffuse sound was generated according to [39], and uncorrelated Gaussian noise with SNR 35 db was added as sensor noise. In addition, simulations were used for moving source scenarios, as detailed in Section V-C. The remaining processing parameters are as follows: STFT frame size was 64 ms with 50 % overlap, the Bayes detector costs were C du =7and C ud =1(discussion in Section V-C, Experiment 3), the averaging constants α x, α uv and α v were 0.75, 0.94, and 0.98, respectively, corresponding to time constants of 0.1, 0.5, and 1.6 seconds. The room was sampled at 10 positions per meter to obtain r i required in the spot probability computation. The a-priori SPP parameters from Section IV- Awereρ =1.2, c =5, l min =0.05, l max =0.95. Note that the framework does not impose restrictions on the geometry of the arrays, except that the DOA estimators cover angular range of 360 degrees so that triangulation can be performed. Nevertheless, in applications where the front-back ambiguity is not problematic (such as, for example for arrays mounted on a wall), linear arrays can also be used. isnr(i) =10log 10 x m (t) 2 / v m (t) 2, (43) isir(i) =10log 10 x m (t) 2 / u m (t) 2,t ((i 1)T,iT], where denotes average over t. The overall segmental isnr and isir are computed by averaging over all segments i. The output values osnr and osir are computed similarly, by using the filtered versions of the signals. The performance measures used in the evaluation can be summarized as follows i) SNR improvement Δ SNR and SIR improvement Δ SIR (also known as the array gains with respect to the noise and the interference, respectively) are computed as Δ SNR =osnr isnr, Δ SIR =osir isir. (44) To compute the average array gains Δ SNR and Δ SIR over the whole signal, only segments i with isnr(i) and isir(i) in the range [ 15, 30] were considered, so that both desired and undesired signals contribute significant energy. ii) Speech distortion (SD) index ν sd, attains values in [0, 1]. Values close to zero indicate low distortion. A reference signal is the desired source signal at the m-th microphone. For the i-th segment, the SD index is given by ν sd (i) = x m (t) ˆx m (t) 2 / x m (t) 2,t ((i 1)T,iT], (45) where the hat indicates filtering by the spotformer. iii) Interference reduction Δ IR and noise reduction Δ NR. The IR for the i-th segment is computed as Δ IR (i) =10log 10 u(t) 2 / û(t) 2,t ((i 1)T,iT], (46) Similarly, Δ NR is computed from v(t) and ˆv(t). iv) To evaluate the detector, we consider the false alarm rate (FR), and the miss rate (MR) given by FR = / [H x =1 H ideal =0] [H ideal =0], n,k n,k MR = / [H x =0 H ideal =1] [H ideal =1], n,k n,k (47) where n,k [ ] denotes summation over all TF bins of the value of the logical expression inside the brackets. A groundtruth detector H ideal is created by considering the instantaneous desired-to-undesired signal power ratio at each TF. H ideal is set to 1 if the desired signal is dominant, and to 0 otherwise. The FR and MR can be controlled by the Bayes costs, as discussed in Experiment 3. C. Results and Discussion In this section, we discuss six experiments related to the following aspects: In Experiment 1, we present the spotformer

9 TASESKA AND HABETS: SPOTFORMING: SPATIAL FILTERING WITH DISTRIBUTED ARRAYS 1299 TABLE I EXPERIMENT 1, OBJECTIVE PERFORMANCE EVALUATION. THE RESULTS ARE AVERAGED OVER ALL SCENARIOS WITH ONE INTERFERER performance in scenarios with different number of interferers, different number of arrays, and different noise levels. In Experiment 2, we focus on the comparison between the proposed and a state-of-the art spotforming approach. Experiment 3 investigates the influence of the detector, Experiment 4 the effect of reverberation, Experiment 5 the performance in scenarios with moving sources, and Experiment 6 examines a scenario with multiple sources inside the SOI. Experiment 1: This experiment assumes a single source in the SOI and investigates the undesired signal reduction. As there is only one desired source, we evaluate the spotformer with rank-one model-based RTF estimators described in Section III-B1 (denoted by MMSE in the results) and Section III-B2 (denoted by LS in the results). Various aspects are evaluated using the measured data: i) different number of interferers, ii) extraction with one, two, and three arrays, and iii) different SNRs. As a state-of-the-art (SoA), we use an MVDR spotformer with a fixed constraint computed from the RTF vector at the spot centre, where the desired source was located. The undesired signal PSD matrix Φ u+v for the SoA spotformer is estimated from a segment where the spot signal is inactive. In practice, such segment has to be detected, and in some cases might not exist, which poses a limitation to the approaches that estimate the Φ u+v in this manner. However, as oracle information when to estimate Φ u+v is used in this case, this experiment provides ideal settings for the fixed spotformer. Moreover, the steering vector of the SoA spotformer is perfectly matched to true RTF of the source. Under these conditions, the SoA is expected to have superior performance and the goal is to evaluate the degradation when using our proposed framework that estimates all quantities blindly from the data. The spot was a circle with radius 0.4 m. For a given number of sources, the results were averaged over all source combinations from Figure 4 (left). The sources in each scenario were active simultaneously for 20 seconds, and the input SIR for one, two, and three interferers was 2 db, -2 db, and -3.5 db, respectively. Each scenario was evaluated for a moderate and a low SNR ( 16 db and 6dB) and the signal detector with training was used. The SNR was computed with respect to the sum of diffuse and sensor noise and the diffuse-to-sensor noise ratio was 40 db and 30 db. The results are given in Table I for the case of one interferer, and in Table II for two and three interferers. The conclusions are summarized as follows a) The oracle spotformer is almost distortionless in all scenarios, as the constraint is based on the true RTFs at the source location. The SD index ν sd for the proposed data-dependent spotformer reaches up to 0.15 when using one array and up to 0.25 when using three arrays. Using multiple arrays increases TABLE II EXPERIMENT 1, OBJECTIVE PERFORMANCE EVALUATION FOR THE SCENARIOS WITH TWO AND THREE INTERFERERS. RESULTS WITH THREE-ARRAY SPOTFORMER AND ISNR =6dB the SD index as the PSD matrices are more sensitive to detection errors and the RTFs are longer filters which are more difficult to estimate accurately. Note however, that the reference signal for computing the SD index was the spot signal as received at the microphone. Hence, the increase in ν sd is partially contributed to dereverberation. This finding is further discussed in Experiment 4. b) The fact that detection errors are particularly critical when using multiple arrays is reflected in the interference reduction Δ IR as well. The multi-array oracle spotformer outperforms by 10 db the single-array oracle spotformer, whereas in the data-dependent case, only a gain of 3 db is obtained. Hence, the spatial selectivity of multiple arrays is not maximally utilized due to the detection errors. Our experiments indicated that the advantage of spatial selectivity is mainly manifested when increasing from one to two arrays, whereas the improvement when adding further arrays is less significant. c) There was no significant performance loss when the input SNR was decreased from 16 db to 6 db. The degradation in interference reduction is less than 1 db, whereas the SD index is improved in the noisier case. This can be explained as follows: for high SNR, the position estimates are accurate and concentrated around the true source locations, which leads to false alarms in the detector when an interferer is near the SOI. For low SNR however, the density of position estimates around the true source locations decreases, which in turn reduces the false alarms arising due to the nearby interferer. d) In Table II, the results for two and three interferers are shown only for the three-array spotformer. The remaining results follow similar trend as discussed above. Note that larger the number of interferers does not worsen the distortion, and the loss in interference reduction is less than 1 db. e) The undesired signal reduction with MMSE rank-one approximation (i.e., covariance subtraction-based RTF estimation) deteriorates at lower input SIR and SNR, compared to the LS approach (i.e., covariance whitening-based RTF estimation). Similar finding was presented in [26], where the

10 1300 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY 2016 Fig. 5. Experiment 2, interference power at the input and at the output of a three-array spotformer. TABLE III EXPERIMENT 2, AVERAGE INTERFERENCE AND NOISE REDUCTION OF THE STATE-OF-THE-ART AND THE PROPOSED SPOTFORMERS covariance subtraction was shown to be less accurate than the covariance whitening in low SNR conditions. Therefore in the following, unless stated otherwise, we only discuss the spotformer with the LS-based rank-one approximation. Experiment 2: The goal of this experiment is to compare the proposed data-dependent spotformer over the state-of-theart spotformer discussed in Experiment 1, when applied in dynamic scenarios. Such a scenario is obtained from the measured data as follows: a circular spot with radius 0.4 m is centered at the position 1, shown in Figure 4 (left). The first ten seconds, the desired source and an interferer at position 3 are active (average input SIR 0 db), while the next ten seconds the desired source and two interferers at positions 4 and 5 are active (input SIR -2 db). The experiment was repeated for 12 different combinations of speakers in various languages and the average input SNR was 6 db. The interference power at the input and at the spotformer outputs is plotted in Figure 5 across time for one set of signals. In the first 10 seconds, the SoA spotformer offers better interference reduction, as Φ u+v is estimated using the signal of the currently active interferer. Clearly, when the new interferers become active, such framework can not track the change in Φ u+v, resulting in inferior interference reduction compared to the proposed spotformer which is adapted using the data. The averaged results across the 12 experiments are given in Table III, whereby the results for each individual experiment followed similar trend as shown in Figure 5. Experiment 3: In this experiment, the effect of the detector on the spotformer performance is examined and the advantage of incorporating training is demonstrated. Measured data was used with the desired source and the spot center at position 2 in Figure 4 (left). Two multi-talk scenarios were considered: in the first one, an interferer at position 3 was active (relatively near to the spot) with input SIR of 0 db, and in the second one, an interferer at position 4 was active (relatively far from the spot) with input SIR of 1.5 db. The input SNR was 6 db. The training is particularly advantageous when the interferer is near the spot (< 1 m). This is visible when comparing the array gains Δ SIR in Figure 6(a) (interferer at position 4) and in Figure 6(b) (interferer at position 1). When the interferer is far, both detectors lead to similar Δ SIR, constant for all radii, whereas when the interferer is near, the training improves Δ SIR by up to 6 db for moderately large radii. As the spot radius increases, Δ SIR for both detectors deteriorates due to the increasing false alarm rate. The SNR improvement Δ SNR is constant regardless of the interferer location. When there is no interferer [Figure 6(f)], Δ SNR increases as all degrees of freedom are used for noise reduction. The previous discussion can be corroborated by analyzing the performance of the detectors in terms of FR and MR, plotted in Figure 7. The FR remains low when the interferer is far from the spot, while rapidly increasing for larger spot radii when the interferer is near. The advantage of using training is particularly noticeable in terms of the FR when the interferer is near the spot borders. The MR is however not significantly influenced by the training. Hence, the Bayes costs C du and C ud need to prioritize low FR. With this in mind, they were determined a-priori to achieve FR<0.1%, while maintaining MR not larger than 0.9%, so that Φ x update sufficiently frequently in case of moving sources (in our implementation, C ud =1and C du =7). Note that the costs might need to be revised if different parameter estimators (SPP, DOA, DDR) are used. However, for a given implementation of the estimators, chosen costs that satisfy the FR/MR trade-off, generalize well to a very wide range of acoustic conditions. Experiment 4: Due to multi-path propagation in reverberant environments, the accuracy of the position estimates decreases resulting in larger FR of the detectors. To examine the effect of reverberation on the signal quality at the spotformer output, we simulated shoe box rooms with reverberation times T 60 from 200 to 700 ms. The SNR was fixed to 9 db, which represents a significant level of babble noise. Scenarios when an interferer is far (>2 m) and when an interferer is near the spot (0.5-1 m) were simulated. For both cases, the results are averaged over 10 random source locations. The findings from this experiment are summarized as follows a) Due to the increased FR, the SD index increases as visible in Figure 8. Nevertheless, note that the high SD is partially attributed to dereverberation. To confirm this, we computed the signal-to-reverberation-modulation ratio (SRMR) [40] of the desired signal at the reference microphone, and of the desired signal after spotforming. The difference of the two SRMR values, shown in Figure 9, indicates the amount of dereverberation (larger values indicate larger dereverberation). b) Reverberation does not severely affect the noise and interference reduction. The noise reduction Δ NR is independent of the T 60 and equal to 7 db for one array, 9.2 db for two arrays, and 10.2 db for three arrays. The interference reduction Δ IR is illustrated in Figure 8 (right).

11 TASESKA AND HABETS: SPOTFORMING: SPATIAL FILTERING WITH DISTRIBUTED ARRAYS 1301 Fig. 6. Experiment 3, comparison of the detector with and without training for varying spot size, in terms of objective performance measures at the output of the proposed spotformer with LS-based rank-approximation (Section III-B2). Fig. 7. Experiment 3: Detection in terms of FR and MR. Fig. 9. Experiment 4: SRMR Improvement after spotforming. TABLE IV E XPERIMENT 5: O BJECTIVE P ERFORMANCE R ESULTS FOR A S CENARIO W ITH M OVING S OURCES Fig. 8. Experiment 4, effect of reverberation on the spotformer performance (LS rank-one approximation). c) Similarly as observed in Experiment 3, when the interferer is far from the spot, the performance for detector with and without training is identical, whereas when the interferer is near the spot, training improves the interference reduction, as shown in Figure 8 (right). Experiment 5: To examine the spotformer performance with moving sources, we simulated a moderately reverberant room (T60 = 300 ms) with the setup shown in Figure 4 (middle). The desired source traverses the trajectory A-B-A-B-A (solid line), whereas the interferer traverses A-B-A (dotted line), during 20 seconds of double-talk. The average input SNR and SIR were 6 db and 0 db, respectively. The experiment was repeated for 12 different sets of signals and the averaged results are summarized in Table IV. In terms of the objective measures, the spotformer achieves similar performance as in a fixed scenario with comparable acoustic conditions (see Figure 8 for T60 = 300 ms), with less than 0.5 db difference in ΔIR and and less than 0.03 difference in νsd. Although not reflected in the objective measures, it is to be noted that in the case of moving sources, the perceptual quality of the multi-array spotformer was degraded.

1302 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY 2016 Fig. 10.

However, when using a single-array spotformer, the signal quality was comparable to the one in fixed scenarios, confirming the potential of the spotformer in highly dynamic scenarios.

Experiment 6: In applications requiring larger spot size, multiple sources in S might be active.

12 1302 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY 2016 Fig. 10. Experiment 6: Average objective performance results in a scenario with two sources inside the spot. However, when using a single-array spotformer, the signal quality was comparable to the one in fixed scenarios, confirming the potential of the spotformer in highly dynamic scenarios. A comparable stand-alone framework to track the time-varying PSD matrices is not known to the authors at this point, hence, a comparison to the state-of-the-art method is not provided. Experiment 6: In applications requiring larger spot size, multiple sources in S might be active. In the spotformer framework, the goal is to extract the sources using a single constraint without having to estimate each source RTFs. The constraint described in Section III-B3 (denoted as Proj) is suitable in this scenario and is compared to the LS constraint from Section III-B2. Additionally, we compared two fixed constraints from the state-of-the-art: 1) the RTF at the spot centroid (denoted as Fix_centre), and 2) the eigenvector constraint corresponding to the principal eigenvector of G S described in Section III-A (denoted as Fix_eig). To make a fair comparison with the proposed MVDR spotformers in terms of undesired signal reduction, we did not compute an LCMV filter with multiple constraints but rather an MVDR filter with a single eigenvector constraint. Furthermore, to only focus on the effect of the constraint, both the proposed and the state-of-the-art designs used the PSD matrix estimate Φ u+v obtained by the framework proposed in this work. The scenario shown in Figure 4 (right) was simulated, where two sources inside S and an interferer are simultaneously active. The experiment was repeated for 10 different combinations of speakers in various languages. The results are shown in Figure 10 for a single-array and a three-array spotformer, for SNRs of 17 db and 9 db, and T 60 = 200 ms. It can be observed that the projection method improves the Δ NR by 1.5 db and thearraygainδ SIR by 1-2 db compared to the LS method, while also slightly reducing the distortion ν SD of the spot signal, which in this case is the sum of the two source signals. The data-independent constraints result is notably worse performance both in terms of spot signal distortion and undesired signal reduction. Finally, to illustrate an example of the spatial pattern, the spotformer coefficients from all frequencies at a given frame were applied to source signals located at different positions on a square grid of 10 positions per meter. For each position, the ratio of the source power at the output to the source power at the input of the spotformer is coded in the color in Figure 11. Largest attenuation is visible at the location of the interferers, showing the spotformer ability to blindly create spatial notches to reduce the interferers. The images also illustrate that while Fig. 11. Spatial selectivity pattern of the spotformer. The plus signs denote the arrays and the squares denote the sources. The colormap is [-11,-1] db (darkest to brightest). multiple arrays increase the spatial selectivity, they also lead to increased spot signal distortion. VI. CONCLUSIONS A fully data-dependent acoustic spotformer was proposed to extract signals originating from the spot of interest (SOI), while reducing noise and interference. It relies on a low-rank approximation of the spot signal PSD matrix, which is valid due to the speech sparsity in the TF domain. An important contribution that enables PSD matrix estimation in practice is the underlying probabilistic model which exploits spatial information and a minimum Bayes-risk decision rule to determine the TF bins where the spot signal is dominant. The main advantage of the proposed framework over existing approaches is that the spot PSD matrix computed from the data tends to be low rank, allowing for an MVDR spotformer design which uses maximum degrees of freedom for undesired signal reduction. Different methods to design the MVDR spotformer constraint based on state-of-the-art RTF estimators were discussed and evaluated. Thanks to the proposed signal detection framework for PSD matrix estimation, the spotformer adapts almost instantaneously in changing acoustic conditions and appearing/disappearing sources. REFERENCES [1] S. Markovich, S. Gannot, and I. Cohen, Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals, IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 6, pp , Aug [2] M. Taseska and E. A. P. Habets, Informed spatial filtering with distributed arrays, IEEE Trans. Audio, Speech, Lang. Process., vol. 22, no. 7, pp , Jul [3] D. H. Tran Vu and R. Haeb-Umbach, An EM approach to integrated multichannel speech separation and noise suppression, in Proc. Int. Workshop Acoust. Signal Enhancement (IWAENC), 2010, pp. 1 4.

13 TASESKA AND HABETS: SPOTFORMING: SPATIAL FILTERING WITH DISTRIBUTED ARRAYS 1303 [4] M. Souden, S. Araki, K. Kinoshita, T. Nakatani, and H. Sawada, A multichannel MMSE-based framework for speech source separation and noise reduction, IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 9, pp , Sep [5] Y. Grenier, A microphone array for car environments, Speech Commun., vol. 12, pp , Mar [6] M. Taseska and E. A. P. Habets, Spotforming using distributed microphone arrays, in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), New Paltz, NY, USA, Oct. 2013, pp [7] N. Grbic and S. Nordholm, Soft constrained subband beamforming for hands-free speech enhancement, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2002, vol. 1, pp. I-885 I-888. [8] J. Martinez, N. Gaubitch, and W. B. Kleijn, A robust region-based near-field beamformer, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Apr. 2015, pp [9] Y. Zheng, R. Goubran, and M. El-Tanany, Robust near-field adaptive beamforming with distance discrimination, IEEE Trans. Speech Audio Process., vol. 12, no. 5, pp , Sep [10] A. Davis, S. Y. Low, S. Nordholm, and N. Grbic, A subband space constrained beamformer incorporating voice activity detection, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Mar. 2005, pp. iii/65 iii/68. [11] O. Thiergart, M. Taseska, and E. A. P. Habets, An informed parametric spatial filter based on instantaneous direction-of-arrival estimates, IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp , Dec [12] C. A. Anderson, P. D. Teal, and M. A. Poletti, Spatially robust farfield beamforming using the von Mises(-Fisher) distribution, IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 12, pp , Dec [13] F. Dowla and A. Spiridon, Spotforming with an array of ultra-wideband radio transmitters, in Proc. IEEE Conf. Ultra Wideband Syst. Technol., Nov. 2003, pp [14] S. Affès and Y. Grenier, A signal subspace tracking algorithm for microphone array processing of speech, IEEE Trans. Speech Audio Process., vol. 5, no. 5, pp , Sep [15] E. E. Jan and J. Flanagan, Sound capture from spatial volumes: Matchedfilter processing of microphone arrays having randomly-distributed sensors, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Atlanta, GA, USA, May 1996, pp [16] Ò. Yilmaz and S. Rickard, Blind separation of speech mixture via time-frequency masking, IEEE Trans. Signal Process., vol. 52, no. 7, pp , Jul [17] J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing. Berlin, Germany: Springer-Verlag, [18] S. Araki, H. Sawada, and S. Makino, Blind speech separation in a meeting situation with maximum SNR beamformers, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2007, pp. I-41 I-44. [19] D. P. Jarrett, M. Taseska, E. A. P. Habets, and P. Naylor, Noise reduction in the spherical harmonic domain using a tradeoff beamformer and narrowband DOA estimates, IEEE Trans. Audio, Speech, Lang. Process., vol. 22, no. 5, pp , May [20] M. Taseska and E. A. P. Habets, Minimum Bayes risk signal detection for speech enhancement based on a narrowband DOA model, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Brisbane, QLD, Australia, Apr. 2015, pp [21] D. Cherkassky and S. Gannot, Blind synchronization in wireless sensor networks with application to speech enhancement, in Proc. Int. Workshop Acoust. Signal Enhancement (IWAENC), Juan-les-Pins, France, Sep. 2014, pp [22] Y. Avargel and I. Cohen, On multiplicative transfer function approximation in the short-time Fourier transform domain, IEEE Signal Process. Lett., vol. 14, no. 5, pp , May [23] M. K. Buckley, Spatial/spectral filtering with linearly constrained minimum variance beamformers, IEEE Trans. Acoust., Speech, Signal Process., vol. 35, no. 3, pp , Mar [24] J. Capon, High resolution frequency-wavenumber spectrum analysis, Proc. IEEE, vol. 57, no. 8, pp , Aug [25] S. Gannot, D. Burshtein, and E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech, IEEE Trans. Signal Process., vol. 49, no. 8, pp , Aug [26] S. Markovich-Golan and S. Gannot, Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2015, pp [27] B. Cornelis, S. Doclo, T. Van dan Bogaert, M. Moonen, and J. Wouters, Performance analysis of multichannel Wiener filter-based noise reduction in hearing aids under second order statistics estimation errors, IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 5, pp , Jul [28] G. H. Golub and C. F. van Loan, Matrix Computations, 3rd ed. Balimore, MD, USA: The Johns Hopkins Univ. Press, [29] A. Krueger, E. Warsitz, and R. Haeb-Umbach, Speech enhancement with a GSC-like structure employing eigenvector-based transfer function ratios estimation, IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 1, pp , Jan [30] E. Warsitz and R. Haeb-Umbach, Blind acoustic beamforming based on generalized eigenvalue decomposition, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 5, pp , Jul [31] M. Taseska and E. A. P. Habets, Relative transfer function estimation exploiting instantaneous signals and the signal subspace, in Proc. Eur. Signal Process. Conf. (EUSIPCO), Nice, France, Sep. 2015, pp [32] M. Taseska and E. A. P. Habets, MMSE-based blind source extraction in diffuse noise fields using a complex coherence-based a priori SAP estimator, in Proc. Int. Workshop Acoust. Signal Enhancement (IWAENC), Sep. 2012, pp [33] M. Souden, J. Chen, J. Benesty, and S. Affès, Gaussian model-based multichannel speech presence probability, IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 5, pp , Jul [34] T. Gerkmann, C. Breithaupt, and R. Martin, Improved a posteriori speech presence probability estimation based on a likelihood ratio with fixed priors, IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 5, pp , Jul [35] O. Thiergart, G. Del Galdo, and E. A. P. Habets, Signal-to-reverberant ratio estimation based on the complex spatial coherence between omnidirectional microphones, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Mar. 2012, pp [36] S. Araki, H. Sawada, R. Mukai, and S. Makino, A novel blind source separation method with observation vector clustering, in Proc. Int. Workshop Acoust. Signal Enhancement (IWAENC), Sep. 2005, pp [37] S. Kay, Fundamentals of Statistical Signal Processing, Volume II: Detection Theory. Englewood Cliffs, NJ, USA: Prentice-Hall, [38] E. A. P. Habets, Room impulse response generator, Technische Universiteit Eindhoven, Tech. Rep., Eindhoven, The Netherlands, Sep [39] E. A. P. Habets and S. Gannot, Generating sensor signals in isotropic noise fields, J. Acoust. Soc. Amer., vol. 122, no. 6, pp , Dec [40] T. Falk, C. Zheng, and W.-Y. Chan, A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech, IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 7, pp , Sep Maja Taseska (S 13) was born in Ohrid, Macedonia, in She received the B.Sc. degree in electrical engineering from the Jacobs University, Bremen, Germany, in 2010, and the M.Sc. degree from the Friedrich-Alexander-University, Erlangen, Germany, in She is currently pursuing the Ph.D. degree in informed spatial filtering at the International Audio Laboratories Erlangen, Erlangen, Germany. Her research interests include informed spatial filtering, source localization and tracking, blind source separation, and noise reduction.

14 1304 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY 2016 Emanuël A.P. Habets (S 02 M 07 SM 11) received the B.Sc. degree in electrical engineering from the Hogeschool Limburg, Heerlen, The Netherlands, in 1999, and the M.Sc. and Ph.D. degrees in electrical engineering from the Technische Universiteit Eindhoven, Eindhoven, The Netherlands, in 2002 and 2007, respectively. He is an Associate Professor with the International Audio Laboratories Erlangen (a joint institution of the Friedrich-Alexander- University Erlangen-Nürnberg and Fraunhofer IIS), and the Head of the Spatial Audio Research Group at Fraunhofer IIS, Germany. From 2007 to 2009, he was a Postdoctoral Fellow with the Technion-Israel Institute of Technology, Haifa, Israel, and the Bar-Ilan University, Ramat Gan, Israel. From 2009 to 2010, he was a Research Fellow with the Communication and Signal Processing Group, Imperial College London, London, U.K. His research interests include audio and acoustic signal processing, spatial audio signal processing, spatial sound recording and reproduction, speech enhancement (dereverberation, noise reduction, echo reduction), and sound localization and tracking. Dr. Habets was a member of the organization committee of the 2005 International Workshop on Acoustic Echo and Noise Control (IWAENC), Eindhoven, The Netherlands, a General Co-Chair of the 2013 International Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, and a General Co-Chair of the 2014 International Conference on Spatial Audio (ICSA), Erlangen, Germany. He was a member of the IEEE Signal Processing Society Standing Committee on Industry Digital Signal Processing Technology ( ), and a Guest Editor for the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING. Currently, he is an Associate Editor of the IEEE SIGNAL PROCESSING LETTERS, and an Editor-in-Chief of the EURASIP Journal on Audio, Speech, and Music Processing.

Recent Advances in Acoustic Signal Extraction and Dereverberation

Recent Advances in Acoustic Signal Extraction and Dereverberation Emanuël Habets Erlangen Colloquium 2016 Scenario Spatial Filtering Estimated Desired Signal Undesired sound components: Sensor noise Competing