IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 5, MAY

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 5, MAY 2013 945 A Two-Stage Beamforming Approach for Noise Reduction Dereverberation Emanuël A. P. Habets, Senior Member, IEEE, Jacob Benesty Abstract In general, the signal-to-noise ratio as well as the signal-to-reverberation ratio of speech received by a microphone decrease when the distance between the talker microphone increases. Dereverberation noise reduction algorithm are essential for many applications such as videoconferencing, hearing aids, automatic speech recognition to improve the quality intelligibility of the received desired speech that is corrupted by reverberation noise. In the last decade, researchers have aimed at estimating the reverberant desired speech signal as received by one of the microphones. Although this approach has let to practical noise reduction algorithms, the spatial diversity of the received desired signal is not exploited to dereverberate the speech signal. In this paper, a two-stage beamforming approach is presented for dereverberation noise reduction. In the first stage, a signal-independent beamformer is used to generate a reference signal which contains a dereverberated version of the desired speech signal as received at the microphones residual noise. In the second stage, the filtered microphone signals the noisy reference signal are used to obtain an estimate of the dereverberated desired speech signal. In this stage, different signal-dependent beamformers can be used depending on the desired operating point in terms of noise reduction speech distortion. The presented performance evaluation demonstrates the effectiveness of the proposed two-stage approach. Index Terms Beamforming, dereverberation, microphone arrays, noise reduction, speech enhancement. I. INTRODUCTION DISTANT or hs-free audio acquisition is required in many applications such as audio-bridging teleconferencing. A microphone array, which consists of multiple microphones that are arranged in a specific pattern, can be used for the acquisition. The received microphone signals usually consist of a desired speech signal interference. The received signals are processed in order to extract the desired speech or, in other words, to reduce the interference. In the last four decades many algorithms have been proposed to process the received signals (cf. [1], [2] the references therein). The time domain linearly constrained minimum variance (LCMV) beamformer, proposed by Frost [3], is aiming at Manuscript received July 10, 2012; revised September 30, 2012; accepted December 05, 2012. Date of publication January 11, 2013; date of current version February 01, 2013. The associate editor coordinating the review of this manuscript approving it for publication was Prof. Søren Holdt Jensen. E. A. P. Habets is with the International Audio Laboratories Erlangen (a joint institution of the University of Erlangen-Nuremberg Fraunhofer IIS), 91058 Erlangen, Germany (e-mail: emanuel.habets@audiolabs-erlangen.de). J. Benesty is with the INRS-EMT, University of Quebec, Montreal, QC H3C 3P8, Canada (e-mail: benesty@emt.inrs.ca). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2013.2239292 minimizing the output power under linear constraints on the response of the array towards the desired speech signal. Frost proposed an adaptive scheme, which is based on a constrained least-mean-square (LMS) type adaptation. To avoid this constrained adaptation, Griffiths Jim [4] proposed the generalized sidelobe canceller (GSC) structure, which separates the output power minimization the application of the constraint. Initially, the GSC structure was based on the assumption that the different sensors receive a delayed version of the desired signal. The GSC structure was re-derived in the frequency domain, extended to deal with general ATFs by Affes Grenier [5] later bygannotet al. [6]. In the frequency domain, only one constraint towards the desired speech signal is required per frequency such that the LCMV reduces to a minimum variance distortionless response (MVDR) beamformer. The GSC comprises three blocks: i) a filter--sum beamformer (FSB), which defines the desired signal component, ii) a blocking matrix (BM), which blocks the desired speech components resulting in reference noise signals, iii) a multichannel adaptive noise canceller (ANC), which eliminates noise components that leak through the sidelobes of the FSB. Without taking into account that the received signals are reverberant, the GSC implementation of the MVDR beamformer suffers from two basic problems [2]. The first problem is caused by the non-ideal FSB, which can lead to a non-coherent filter--sum operation that results in a distortion of the desired speech. The second problem is caused by non-perfect BM is known as the leakage problem. If signals that are correlated with desired signal leak into the noise reference signals the noise canceler filters will subtract speech components from the FSB output, causing self-cancellation of the desired speech, hence a severe distortion of the desired speech. Even when the ANC filters are estimated or adapted during noise-only periods, the distortion is unavoidable. Because the MVDR beamformer the GSC are mathematically equivalent, the MVDR beamformer also suffers from these two basic problems. The techniques to limit the distortion resulting from this leakage can be divided into two classes. Techniques in the first class reduce the leakage components in the noise references i) by using a more robust fixed blocking matrix design [7] [10], ii) by using an adaptive blocking matrix [11] [13], or iii) by constructing a blocking matrix based on estimating the ratios of the acoustic transfer functions from the speech source to the microphone array [6], [14], [15]. Techniques in the second class limit the distorting effect of the leakage components i) by controlling the step-size of the multichannel adaptive algorithm, such that the multichannel ANC is only updated during periods 1558-7916/$31.00 2013 IEEE

946 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 5, MAY 2013 ( for frequencies) the signal-to-noise ratio is low [6], [7], [11] [13], [16] [18], ii) by constraining the update formula for the multichannel adaptive filter [9], [12], [19], [20], or iii) by taking speech distortion due to speech leakage into account using a speech distortion weighted multichannel Wiener filter [21], [22]. In practice, several of these techniques are usually combined. The beamformers proposed in [1], [6], [15] aim at estimating the reverberant signal received by one of the microphones thereby alleviate the leakage problem. Because there is a tradeoff between noise reduction (signal-channel) dereverberation as shown in [23], these beamformers also maximize the amount of noise reduction. An obvious disadvantage is that the signal-to-reverberation ratio is not improved. In [24], we presented a two-stage approach that allows both multichannel dereverberation multichannel noise reduction. The first stage consisted of a delay--sum beamformer the second stage consisted of an MVDR beamformer. In this paper, we generalize analyze the previously proposed approach. In the first stage, a signal-independent beamformer is used to generate a reference signal which contains a dereverberated version of the desired speech signal as received at the microphones residual interference. In the second stage, a signal-dependent beamformer is applied to the filtered microphone signals from the first stage. In the second stage, the filtered microphone signals the noisy reference signal are used to obtain an estimate of the dereverberated desired speech signal. Here, different signal-dependent beamformers can be used depending on the desired operating point in terms of noise reduction speech distortion. The paper is organized as follows. In Section II the proposed two-stage beamforming approach is presented. In Section III, performance measures are defined. Different beamformers that can be used in the first stage second stage are discussed in Section IV Section V, respectively. The performance of the proposed two-stage approach is subsequently evaluated in Section VI. Finally, conclusions are provided in Section VII. II. TWO-STAGE APPROACH We consider a signal model in which an -element microphone array captures a convolved source signal in some noise field. In the short-term Fourier transform (STFT) domain we can express the spectral coefficients of the received signals at time frame the frequency bin as 1 is the acoustic transfer function from the unknown speech source to the th microphone that is assumed to be time-invariant, is the additive noise at microphone. We assume that the spectral coefficients are uncorrelated zero-mean complex rom variables. 1 In this work we assume that the analysis window is sufficiently long such that the multiplicative transfer function approximation [25] holds. (1) A. First Stage: Dereverberation The first stage consists of a signal-independent beamformer that generates a reference signal that contains a dereverberated version of the desired speech as received by the first microphone 2 residual noise. In the STFT domain, the filtered microphone signals are given by the superscript denotes complex conjugation denotes the complex gain that is applied to the th frequency bin of the th microphone signal. The output of the signal-independent beamformer is now given by is the desired signal is the reference noise signal. The signal is considered as the reference signal since it contains the desired signal,,thatwewilltry to extract, in the second stage, from the noisy observations,. We see that the variance of is denotes mathematical expectation, (2) (3) are the variances of, respectively. With a proper choice of the beamformer, which is discussed in Section IV, the inverse STFT of is, in principle, less reverberant than the desired signal received at any of the microphones. This, basically, concludes the first stage, which consists of dereverberating the reverberant speech signal at the microphones with a signal-independent beamformer. B. Second Stage: Noise Reduction The second stage, which consists of multichannel noise reduction, is also performed in the STFT domain. Firstly, it is very important to write (2) as a function of the desired signal 3. Since are coherent (i.e., 2 In this work we choose the first microphone as the reference microphone. 3 In principle we can also write (1) as a function of the desired signal. In this case the signal-dependent beamformer would be applied directly to the microphone signals.

HABETS AND BENESTY: TWO-STAGE BEAMFORMING APPROACH FOR NOISE REDUCTION AND DEREVERBERATION 947 they come from the same source) the signal-independent beamformer in the first stage is linear, it can be shown that (4) is the partially normalized [with respect to ] coherence vector (of length ) between, which can be seen as the steering vector. From (8), we easily deduce the covariance matrix of : (5) is the partially normalized [with respect to ] coherence function between. 1) Property 2.1: For the partially normalized coherence functions the following property holds: Proof: (6) (10) the superscript is the transpose-conjugate operator are the covariance matrices of, respectively. The matrix is the sum of two other matrices: one is of rank equal to 1 the other one (covariance matrix of the noise) is assumedtobefullrank. Now, multichannel noise reduction is performed by applying complex weights to summing across the array, i.e., (11) is an estimate of, Using (4), we can express (2) as is a filter of length containing all the complex gains, filtered microphone sig- It is more convenient to write the nals in a vector notation as (7) is the filtered desired signal, (8) is the residual noise. We then deduce the variance of : (12) superscript denotes transpose of a vector or a matrix, are defined similarly to, (9) Fig. 1 summarizes the two-stage scheme. Since the entire system is linear, we can easily change the order of the first second stage. In Section V, we show how to derive different signal-dependent beamformers,.

948 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 5, MAY 2013 are in phase with the direct sound signal as received by the first microphone. In addition to the above intrusive objective measure, we also use a recently developed non-intrusive measure called the speech to reverberation modulation energy ratio (SRMR) [26]. Fig. 1. Block diagram of the two-stage beamforming approach. B. Noise Reduction We define the subb input signal-to-noise ratio (SNR) as the ratio of the spectral variance of the desired signal over the spectral variance of the noise received at the first microphone, i.e., III. PERFORMANCE MEASURES In this section, we derive some very useful performance measures that should fit well with the two-stage beamforming approach as explained in the previous section depicted in Fig. 1. We can divide these performance measures into three categories. Performance measures in the first category evaluate the dereverberation performance of the signal-independent beamformer. The second category quantifies the noise reduction, while the third one quantifies the speech distortion. We also discuss the very convenient mean-square error (MSE). A. Dereverberation We define the subb input signal-to-reverberation ratio (SRR) as, (13) is the direct-path signal as received by the first microphone,, is the sample frequency, is the propagation time of the desired sound from the position of the source to the position of the first microphone, denotes the number of frequency bs. The fullb input SRR is given by (14) the fullb input SNR as In general, we can expect that (17) (18) because the power of the dereverberated signal is smaller than the power of the reverberant signal as received by the first microphone. These input SNRs are therefore expected to be smaller compared to the input SNRs defined in, for example, [1]. To quantify the level of noise remaining at the output of the second stage, we define the output SNR as the ratio of the spectral variance of the filtered desired signal over the spectral variance of the residual noise. We have the subb output SNR the fullb output SNR (19) The subb output SRR is given by (15) For the particular filter (20) the fullb output SRR is then given by of length, the subb output SNR is given by (21) (16) To accurately compute the output SRR it is important that the direct sound that is part of the signals (22)

HABETS AND BENESTY: TWO-STAGE BEAMFORMING APPROACH FOR NOISE REDUCTION AND DEREVERBERATION 949 denotes the subb array gain of the signal-independent beamformer in the first stage measured with respect to the first microphone. The fullb output SNR is given by Another way to measure the distortion of the desired speech signal due to the filtering operation is the speech distortion index, which is defined as the mean-square error between the desired signal its estimate, normalized by the variance of the desired signal. We have the subb fullb speech distortion indices: (23) The noise reduction factor measures the amount of noise being rejected by the beamformer. This quantity is defined as the ratio of the power of the reference noise over the power of the noise remaining at the beamformer output. We define the subb fullb noise reduction factors as (24) (25) The noise reduction factors are expected to be lower bounded by, otherwise, the signal-dependent beamformer amplifies the residual noise at the output of the signal-independent beamformer. The higher the value of the noise reduction factor, the more the noise is rejected. C. Speech Distortion Since the noise is reduced by the filtering operation, so is, in general, the desired speech. This speech reduction (or cancellation) implies, in general, speech distortion. The speech reduction factor, which is somewhat similar to the noise reduction factor, is defined as the ratio of the variance of the desired signal over the variance of the filtered desired signal. The subb fullb speech reduction factors are defined as (26) (29) (30) We also see from this measure that the design of beamformers that do not distort the desired signal requires the constraint (31) Therefore, the speech distortion index is equal to 0 if there is no distortion expected to be greater than 0 when distortion occurs. It can easily be verified that we have the following fundamental relation in both the subb fullb cases: D. Mean-Square Error (MSE) Criterion (32) We define the error signal between the estimated desired signals at frequency bin as (33) which can also be expressed as the sum of two non-coherent error signals: (34) (27) A key observation is that the design of beamformers that do not cancel the desired signal requires the constraint (35) (28) Thus, the speech reduction factor is equal to 1 if there is no cancellation expected to be greater than 1 when cancellation happens. is the speech distortion due to the complex filter represents the residual noise. (36)

950 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 5, MAY 2013 The STFT-domain (or subb) mean-square error (MSE) is then (37) is the real part of a complex number. The subb MSE can also be expressed as is the speed of sound in air is the distance between the th th microphones. It is well known that superdirective beamformers are sensitive to additive noise. Especially at low frequencies, the noise is usually amplified by the beamformer. As shown in [28], superdirective beamformers are also sensitive to deviations from the assumed microphone characteristics (gain, phase, position). A commonly used technique to limit the noise amplification, which inherently also increases the robustness against microphone mismatches, is to impose a white noise gain (WNG) constraint, i.e., (42) (38) represents the lower bound on the WNG. The filter vector corresponding to this robust design is given by the solution of the following convex optimization problem: (43) The objective of STFT-domain beamforming with the linear array model is to find optimal filtering vectors that would either minimize or minimize or subject to some constraint. IV. FIRST STAGE:SIGNAL-INDEPENDENT BEAMFORMER In this Section, we focus on the signal-independent beamformer used to dereverberate the desired signal as received by the microphone array. In principle, we can also use a multichannel equalizer in the first stage. The equalizer can be computed directly from the observed data or indirectly by using an estimate of the acoustic transfer functions. In this work, we focus on a beamformer that maximizes the directivity index [27] in the direction of the desired source, also known as a superdirective beamformer. In the following we assume the knowledge of the geometry of the array as well as the position of the source. Let be a function that relates the source position to the relative delay between microphones 1 (with ) such that the corresponding free-field steering vector is given by (39) which can be obtained via second-order cone programming [29]. V. SECOND STAGE:SIGNAL-DEPENDENT BEAMFORMERS In this section, we derive study four fundamental signal-dependent beamformers: maximum SNR beamformer, multichannel Wiener filter, minimum variance distortionless response beamformer, tradeoff beamformer that can be used in the second stage. The last three are derived from MSE criteria while the first one is derived from the subb output SNR. A. Maximum SNR Beamformer In the subb output SNR, as defined in (19), we recognize the generalized Rayleigh quotient. It is well known that this quotient is maximized with the maximum eigenvector of the matrix Let us denote by the maximum eigenvalue corresponding to this maximum eigenvector,. Since the rank of the mentioned matrix is equal to 1, we have The filter vector the superdirective beamformer is given by of (40) denotes the normalized coherence matrix observed in an ideal three-dimensional diffuse sound field. The value at the th row th column of for omnidirectional microphones is given by (41) denotes the trace of a square matrix. As a result, which corresponds to the maximum possible subb SNR. Obviously, we also have (44) (45) (46)

HABETS AND BENESTY: TWO-STAGE BEAMFORMING APPROACH FOR NOISE REDUCTION AND DEREVERBERATION 951 is an arbitrary scaling factor different from zero. While this factor has no effect on the subb output SNR, it has on the fullb output SNR on the speech distortion. In fact, all beamformers derived in the rest of this section are equivalent up to this scaling factor. These filters also try to find the respective scaling factors depending on what we optimize. B. Multichannel Wiener Filter The multichannel Wiener filter (MWF) is found by minimizing the subb MSE,.Weget since the MWF maximizes the subb output SNR at the second stage. The speech distortion indices are (55) (56) Another way to write the MWF is (47) The higher the value of (/or the number of microphones), the less the desired signal is distorted. The fullb output SNR of the MWF is (57) (48) is the identity matrix of size.weobserve that, with this formulation, the MWF relies on the second-order statistics of the observation noise signals. We now propose to write the MWF in a way that it will make it easier to compare to other beamformers. Determining the inverse of from (10) with the Woodbury s identity, we find that (49) 1) Property 5.1: With the STFT-domain MWF given in (47), the fullb output SNR is always greater than or equal to the fullb input SNR at the second stage, i.e.,. Proof: See Section V-D. C. MVDR Beamformer The well-known MVDR beamformer proposed by Capon [30], [31] is easily derived by minimizing the subb MSE of the residual noise,, with the constraint that the desired signal is not distorted. Mathematically, this is equivalent to Substituting (49) into (47) gives (58) (50) for which the solution is Using the fact that we can rewrite (50) as (51) (59) Using (51), the explicit dependence of the above filter on the steering vector is eliminated to obtain the following forms: From (50), we deduce that the subb output SNR is (52) (60) (53) (54) Alternatively, we can write the MVDR as (61)

952 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 5, MAY 2013 It can be verified that we always have (62) (63) (64) (65) the Lagrange multiplier,, satisfies (70b) (71) The MVDR beamformer rejects the maximum level of noise allowable without distorting the desired signal at each frequency bin. While the subb output SNRs of the MWF MVDR are strictly equal, their fullb output SNRs are not. The fullb output SNR of the MVDR is (66) (67) 1) Property 5.2: With the STFT-domain MVDR beamformer given in (59), the fullb output SNR is always greater than or equal to the fullb input SNR at the second stage, i.e., Proof: See Section V-D. (68) D. Tradeoff Beamformer As we have learned from the two previous subsections, not much flexibility is associated with the MWF MVDR beamformer in the sense that we do not know in advance by how much the output SNR will be improved. Moreover, in many practical situations, we wish to control the compromise between noise reduction speech distortion, the best way to do this is via the so-called tradeoff beamformer. In the tradeoff approach, we minimize the subb speech distortion index with the constraint that the subb noise reduction factor is equal to a positive value that is greater than. Mathematically, this is equivalent to (69) to insure that we get some noise reduction. By using a Lagrange multiplier,, to adjoin the constraint to the cost function, we easily deduce the tradeoff beamformer: However, in practice it is not easy to determine the optimal. Therefore, when this parameter is chosen in an ad-hoc way, we can see that for,,whichisthemwf;,,whichisthemvdr beamformer;,resultsinafilter with low residual noise at the expense of high speech distortion;, results in a filter with high residual noise low speech distortion. Note that the MVDR beamformer cannot be derived from (70a) since by taking, we have to invert a matrix that is not full rank. We can see that the tradeoff, Wiener, maximum SNR beamformers are equivalent up to a scaling factor. As a result, the subb output SNR of the tradeoff beamfomer is independent of is identical to the subb output SNR of the MWF, i.e., We have (72) (73) (74) (75) The tradeoff beamformer is interesting from several perspectives since it encompasses both the MWF MVDR beamformer.itisthenusefultostudythefullboutputsnr the fullb speech distortion index. We can verify that the fullb output SNR of the tradeoff beamformer is (76) (70a) 1) Property 5.3: The fullb output SNR of the tradeoff beamformer is an increasing function of the parameter.

HABETS AND BENESTY: TWO-STAGE BEAMFORMING APPROACH FOR NOISE REDUCTION AND DEREVERBERATION 953 Proof: We need to show that (77) The proof showing (77) is identical to the one given in [32]. From Property 5.3, we deduce that the MVDR beamformer gives the smallest fullb output SNR. While the fullb output SNR is upper bounded, it is easy to see that the fullb noise reduction factor fullb speech reduction factor are not. So when goes to infinity, so are. The fullb speech distortion index is (78) as a result, Fig. 2. Setup used for the performance evaluation. 2) Property 5.4: The fullb speech distortion index of the tradeoff beamformer is an increasing function of the parameter. Proof: It is straightforward to verify that which ends the proof. From the previous results, we deduce that for, (82) which ends the proof. It is clear that (79) (80) Therefore, as increases, the fullb output SNR increases at the price of more distortion to the desired signal. 3) Property 5.5: With the STFT-domain tradeoff beamformer given in (70b), the fullb output SNR is always greater than or equal to the input SNR, i.e.,. Proof: We know that which implies that hence But from Proposition 5.3, we have (81) for, VI. PERFORMANCE EVALUATION (83) (84) A. Experimental Setup We now evaluate the performance of the two-stage beamfomer in two simulated reverberant environments of size 7.5 9 6m(length width height) with a reverberation time of 300 600 ms. The acoustic impulse responses (AIRs) were generated using SMIRgen [33], [34], a simulator that was originally developed to compute the AIR from an omni-directional source to a receiver positioned on an open or rigid sphere. Due to the reciprocal principle, we can exchange the role of the source receiver compute the response from a source positioned on a sphere (simulating a mouth) to an omni-directional receiver. Because the scattering of the sphere is taken into account, we simulate the directional characteristic of a human speaker hence obtain more realistic AIRs. For the generation of the AIRs, the radius of the sphere was set to 8.5 cm, the number of spherical harmonics was set to 20 length of the AIRs was set to 8192 samples. In this study, a uniform linear array with a length of 4.5 cm was used with microphones. The source was positioned in the end-fire directionofthearraysuchthat (as depicted in Fig. 2). The source signal consisted of male female speech with a total length of 5 minutes. All signals were processed at a sampling frequency of 16 khz. For the STFT, we used a Hamming window for the synthesis a corresponding bi-orthogonal window for the analysis such that we obtain almost perfect reconstruction in case no processing is performed. The STFT frame size was 512 samples, the length

954 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 5, MAY 2013 TABLE I PERFORMANCE OF THE FIRST-STAGE IN TERMS OF THE SNR IMPROVEMENT ( ), SRR IMPROVEMENT ( ) AND SRMR IMPROVEMENT ( )WHEN USING DIFFERENT WNG CONSTRAINTS AND REVERBERATION TIMES. THE SOURCE-ARRAY DISTANCE WAS 2mAND THE INPUT SNR WAS 35 db Fig. 3. White noise gain (a) directivity index (b) for the superdirective beamformers with db. of the fast Fourier transform was set to 1024 samples, the number of samples between two successive frames was equal to 256, which yields a 50% overlap between frames. We have put aside the influence of a voice activity detector estimated the covariance matrices of the observed signals the noise signals recursively using. The forgetting factor was set to 0.8 for the observed signals 0.9 for the noise signals. B. Dereverberation In Fig. 3, the WNG the directivity index (DI) [35] of three signal-independent beamformer that can be used in the first stage are shown. As expected, we can control the minimum WNG by. In particular at low frequencies, we observe that a larger value of reduces the DI. The desired WNG depends on the input SNR the desired output SNR. Although the reference noise signal ( ) at the output of the superdirective beamformer can be reduced by using a singlechannel noise reduction filter, there is always a tradeoff between noise reduction speech distortion. Due to this tradeoff, the amount of residual noise that can be reduced without significantly distorting the desired speech signal ( ) is limited. We will later see that by using the proposed two-stage approach, we are able to reduce some of the residual noise while keeping the distortion of the desired speech signal low. We evaluated the noise reduction dereverberation performance of 5 minutes of reverberant noisy speech data by analyzing the SNR improvement, SRR improvement, SRMR improvement for different values of reverberation times. The input SNR was set to 35 db by adding spatially temporally white Gaussian noise. The SNRs SRRs were computed by averaging the fullb SNR SRR Fig. 4. SRMR as a function of the distance between the source the reference microphone for the superdirective beamformers with db ( s). in db over different time frames. The improvement was computed by taking the difference between the average output average input SNRs SRRs in db. The SRMR improvement was obtained by computing the difference of the SRMR of the reverberant source signal received at the first microphone (i.e., the inverse STFT of ) the SRMR of the processed desired signal (i.e., the inverse STFT of )whenusing.the results are shown in Table I. As expected, the SNR is degraded by the superdirective beamformer the largest SRR improvement is obtained for db.incasewedesireno amplification of the sensor noise (i.e., db), the reverberation reduction measured in terms of both the SRMR SRR is very limited. Finally, we studied the SRMR as a function of the distance between the source the reference microphone. In Fig. 4 the SRMRs of the desired speech signal at the reference microphone ( ) the processed desired speech signal ( ) are shown for db. The SRMR was computed over 5 minute of speech data, the reverberation time was set to 0.6 s. For db, we obtain a significant increase in the SRMR. The SRMR is closely related to the perceived distance of the signal under test. Comparing the input output SRMRs at different distances, we observe that using this microphone array the perceived distance can be reduced by approximately 2.5 m by the superdirective beamformer with db. It was confirmed through informal listening tests that the input speech signal ( )withthe source at m sounds similar to the processed speech signal ( )with db the source at m. Clearly, the first stage is able to significantly reduce the perceived distance. Unfortunately, the amplification of noise at low frequencies limits the applicability of the beamformer. In the

HABETS AND BENESTY: TWO-STAGE BEAMFORMING APPROACH FOR NOISE REDUCTION AND DEREVERBERATION 955 Fig. 6. Performance measures as a function of the tradeoff parameter (input db, input db, s, m, db). (a) Speech distortion index. (b) SRMR (solid circles) SRR (open circles) improvements. (c) SNR improvement. Fig. 5. Performance measures as a function of the input SCNR (input db, s, m, db). (a) Speech distortion index. (b) SRMR (solid circles) SRR (open circles) improvements. (c) SNR improvement. (d) SINR (solid circles) SCNR (open circles) improvements. next subsection, we analyze the noise reduction performance of the second stage. C. Joint Dereverberation Noise Reduction The noise reduction dereverberation performance of the two-stage beamformer was studied for ms m. We added a coherent noise source incoherent noise (i.e., spatially white noise) to the received signals. The coherent noise source was located at m. The incoherent noise consisted of spatially temporally white zero mean Gaussian noise. The input source to coherent noise ratio (SCNR) was varied between 40 db the input signal to incoherent noise ratio (SINR) was set to 35 db (both were measured with respect to the first microphone). For the first stage, we used the superdirective beamformer obtained using db. First the speech distortion index, SRMR improvement ( ), SRR improvement ( ), SNR, SCNR SINR improvements (denoted by,, respectively) are analyzed as a function of the input SCNR for different noise reduction filter, viz.,,,. Note that for,wesimply obtain the output of the signal independent beamformer. The results depicted in Fig. 5(a) show that for the MVDR beamformer

956 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 5, MAY 2013 Fig. 7. Spectrograms waveforms for one particular example (input db, input db, s, m, db). (a) First microphone signal. (b) Output signal with.(c)outputsignal of the tradeoff beamformer with. (d) Output signal of the tradeoff beamformer with. (e) Output signal of the tradeoff beamformer with. introduces more speech distortion when the input SCNR is high (i.e., when the incoherent noise is dominant). As expected, the MWF introduces slightly more speech distortion compared to the MVDR beamformer. The improvement in SRMR (solid circles) SRR (open circles) is depicted in Fig. 5(b). In general, we can conclude that the signal dependent beamformer has almost no influence on the dereverberation performance. Only at high input SCNRs, the SRR improvement is slightly reduced by the MWF MVDR beamformer. In case of the MWF low input SCNRs, the SRMR improvement is slightly larger. The increase in SRMR can be attributed to the increase in modulation of the residual noise signal. The results for the SNR improvement are depicted in Fig. 5(c). For the signal independent beamformer, we observe that at high SCNR the SNR improvement approaches db as indicated in TableI.ForaSCNRof db the coherent noise source located at m, the signal independent beamformer improves the SNR by approximately 6 db. Clearly, the MWF MVDR beamformer significantly improves the SNR for all input SCNRs. The MWF yields a slightly larger SNR improvements than the MVDR beamformer. In Fig. 5(d), the SCNR SINR improvements are depicted separately. Obviously, the noise reduction performance of the signal independent beamformer is independent of the input SCNR. The MWF achieves larger SINR SCNR improvements compared to the MVDR beamformer. As the input SCNR decreases the output SCNR increases the output SINR decreases. Secondly, we analyzed the speech distortion index (compared to the output obtained using ), SRMR improvement ( ) SRR improvement ( ), SNR improvement ( ) obtained using the tradeoff beamformer with. The input SCNR was equal to 10 db the input SINR was equal to 35 db. The performance measures are shown in Fig. 6. As expected, we see that the speech distortion index increase monotonically with. According to both dereverberation measures, the tradeoff beamformer (with ) reduces slightly more reverberation compared to the superdirective beamformer alone. While the SRR (which is a intrusive objective measure) is independent of,thesrmr (which is a non-intrusive objective measure) slightly increases

HABETS AND BENESTY: TWO-STAGE BEAMFORMING APPROACH FOR NOISE REDUCTION AND DEREVERBERATION 957 with. By increasing, we increase the amount of noise reduction speech distortion. Consequently, we distort also the late reverberant part of the desired speech (for which the input SNR is low) thereby reduce some of the reverberation. In Fig. 7, some sample spectrograms waveforms are depicted. The reverberation time was set to 0.6 s m. In Fig. 7(a), the first microphone signal is depicted which has an input SCNR of 5 db an input SINR of 35 db. In Fig. 7(b), the output of the proposed system is depicted using only the dereverberation stage (i.e., ). Compared to the reference microphone signal, the temporal smearing due to reverberation has clearly been reduced by the superdirective beamformer. Especially at frequencies below 4 khz, we notice that the noise was significantly amplified. In Fig. 7(c) (e) the output of the proposed system with noise reduction is shown. In this case, the tradeoff beamformer was used for i) (which corresponds to an MVDR beamformer), ii) (which corresponds to an MWF), iii). Especially at low frequencies (cf. around 2.3 s 3.2 s), we can see that by increasing we can increase the amount of noise reduction. When comparing Fig. 7(c) (e) around 2.75 s 0.8 khz, we notice that the distortion of the desired signal increases for increasing values of.theseresults are in agreement with the obtained theoretical results the performance evaluation results shown in Fig. 6. VII. CONCLUSIONS Recently, various beamformers were designed to estimate a speech signal as received by one of the microphones. While these beamformers can provide significant noise reduction they do not provide the possibility to dereverberate the received speech signal. In this work, a two-stage beamformer was proposed that provides dereverberation noise reduction. The first stage consists of a signal-independent beamformer is used to create a reference signal that contains a noisy but dereverberated version of the received speech signal. The signal-dependent beamformer in the second stage is used to estimate the dereverberated speech signal present in the reference signal. The presented results demonstrated that both dereverberation noise reduction can be achieved. While the obtained reverberation reduction depends on the signal-independent beamformer in the first stage, the obtained noise reduction speech distortion depends on the signal-dependent beamformer used in the second stage. REFERENCES [1]J.Benesty,J.Chen,Y.Huang, Microphone Array Signal Processing. Berlin, Germany: Springer-Verlag, 2008. [2] S. Gannot I. Cohen, Adaptive beamforming postfiltering, in Springer Hbook of Speech Processing, J.Benesty,M.M.Sondhi, Y. Huang, Eds. Berlin, Germany: Springer-Verlag, 2008, ch. 47. [3] O. L. Frost, III, An algorithm for linearly constrained adaptive array processing, Proc. IEEE, vol. 60, no. 8, pp. 926 935, Aug. 1972. [4] L. J. Griffiths C. W. Jim, An alternative approach to linearly constrained adaptive beamforming, IEEE Trans. Antennas Propag., vol. AP-30, no. 1, pp. 27 34, Jan. 1982. [5] S. Affès Y. Grenier, A signal subspace tracking algorithm for microphone array processing of speech, IEEE Trans. Speech Audio Process., vol. 5, no. 5, pp. 425 437, Sep. 1997. [6] S. Gannot, D. Burshtein, E. Weinstein, Signal enhancement using beamforming nonstationarity with applications to speech, IEEE Trans. Signal Process., vol. 49, no. 8, pp. 1614 1626, Aug. 2001. [7] S. Nordholm, I. Claesson, B. Bengtsson, Adaptive array noise suppression of hsfree speaker input in cars, IEEE Trans. Veh. Technol., vol. 42, no. 4, pp. 514 518, Nov. 1993. [8] S.DocloM.Moonen, Designoffar-field near-field broadb beamformers using eigenfilters, Signal Process., vol. 83, no. 12, pp. 2641 2673, Dec. 2003. [9] I. Claesson S. Nordholm, A spatial filtering approach to robust adaptive beamforming, IEEE Trans. Antennas Propag., vol. 40, no. 1, pp. 1093 1096, Sep. 1992. [10] S. Nordebo, I. Claesson, S. Nordholm, Weighted Chebyshev approximation for the design of broadb beamformers using quadratic programming, IEEE Signal Process. Lett., vol. 1, no. 7, pp. 103 105, Jul. 1994. [11] D. van Compernolle, Switching adaptive filters for enhancing noisy reverberant speech from microphone array recordings, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1990, vol. 2, pp. 833 836. [12] O. Hoshuyama, A. Sugiyama, A. Hirano, A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters, IEEE Trans. Signal Process., vol. 47, no. 10, pp. 2677 2684, Oct. 1999. [13] W. Herbordt W. Kellermann, Adaptive beamforming for audio signal acquisition, in Adaptive Signal Processing: Applications to Real-World Problems, J. Benesty Y. Huang, Eds. Berlin, Germany: Springer-Verlag, 2003, Signals Communication Technology, ch. 6, pp. 155 194. [14] R. Talmon, I. Cohen, S. Gannot, Convolutive transfer function generalized sidelobe canceler, IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 7, pp. 1420 1434, Sep. 2009. [15] A. Krueger, E. Warsitz, R. Haeb-Umbach, Speech enhancement with a GSC-like structure employing eigenvector-based transfer function ratios estimation, IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 1, pp. 206 219, Jan. 2011. [16] O. Hoshuyama A. Sugiyama, Robust adaptive beamforming, in Microphone Arrays,M.BrsteinD.Ward,Eds. Berlin,Germany: Springer, 2001, pp. 87 109. [17] J. E. Greenberg P. M. Zurek, Evaluation of an adaptive beamforming method for hearing aids, J. Acoust. Soc. Amer., vol. 91, pp. 1662 1676, 1992. [18] J. Ven Berghe J. Wouters, An adaptive noise canceller for hearing aids using two nearby microphones, J. Acoust. Soc. Amer., vol. 103, no. 6, pp. 3621 3626, 1998. [19] H. Cox, R. M. Zeskind, M. M. Owen, Robust adaptive beamforming, IEEE Trans. Acoust., Speech, Signal Process., vol. 35, no. 10, pp. 1365 1376, Oct. 1987. [20] N. Jablon, Adaptive beamforming with the generalized sidelobe canceller in the presence of array imperfections, IEEE Trans. Antennas Propag., vol. 34, no. 8, pp. 996 1012, Aug. 1986. [21] A. Spriet, M. Moonen, J. Wouters, Spatially pre-processed speech distortion weighted multi-channel Wiener filtering for noise reduction, Signal Process., vol. 84, no. 12, pp. 2367 2387, Dec. 2004. [22] S. Doclo, A. Spriet, J. Wouters, M. Moonen, Frequency-domain criterion for the speech distortion weighted multichannel Wiener filter for robust noise reduction, Speech Commun., vol. 49, no. 7 8, pp. 636 656, Aug. 2007. [23] E. Habets, J. Benesty, I. Cohen, S. Gannot, J. Dmochowski, New insights into the MVDR beamformer in room acoustics, IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 1, pp. 158 170, Jan. 2010. [24] E. Habets J. Benesty, Joint dereverberation noise reduction using a two-stage beamforming approach, in Proc. Hs-Free Speech Commun. Microphone Arrays (HSCMA), 2011, pp. 191 195. [25] Y. Avargel I. Cohen, On multiplicative transfer function approximation in the short-time Fourier transform domain, IEEE Signal Process. Lett., vol. 14, no. 5, pp. 337 340, May 2007. [26] T. H. Falk, C. Zheng, W.-Y. Chan, A non-intrusive quality intelligibility measure of reverberant dereverberated speech, IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 7, Sep. 2010. [27] G. W. Elko, Superdirectional microphone arrays, in Acoustic Signal Processing for Telecommunication, S.L.GayJ.Benesty,Eds. Hingham, MA, USA: Kluwer, 2000, ch. 10, pp. 181 237. [28] S. Doclo M. Moonen, Superdirective beamforming robust against microphone mismatch, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 2, pp. 617 631, Feb. 2007. [29] S. Yan Y. Ma, Robust supergain beamforming for circular array via second-order cone programming, Appl. Acoust., vol. 66, no. 9, pp. 1018 1032, Sep. 2005.

958 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 5, MAY 2013 [30] J. Capon, High resolution frequency-wavenumber spectrum analysis, Proc. IEEE, vol. 57, pp. 1408 1418, Aug. 1969. [31] R. T. Lacoss, Data adaptive spectral analysis methods, Geophys., vol. 36, pp. 661 675, 1971. [32] M. Souden, J. Benesty, S. Affes, On the global output SNR of the parameterized frequency-domain multichannel noise reduction Wiener filter, IEEE Signal Process. Lett., pp. 425 428, May 2010. [33] D. P. Jarrett, E. A. P. Habets, M. R. P. Thomas, P. A. Naylor, Simulating room impulse responses for spherical microphone arrays, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Prague, Czech Republic, May 2011. [34] D. P. Jarrett, Spherical Microphone Array Impulse Response (SMIR) Generator [Online]. Available: http://www.ee.ic.ac.uk/sap/smirgen/ [35] J. Benesty, J. Chen, E. Habets, Speech Enhancement in the STFT Domain. Berlin, Germany: Springer-Verlag, 2011, SpringerBriefs in Electrical Computer Engineering. Emanuël A. P. Habets (S 02 M 07 SM 11) received his B.Sc degree in electrical engineering from the Hogeschool Limburg, The Netherls, in 1999, his M.Sc Ph.D. degrees in electrical engineering from the Technische Universiteit Eindhoven, The Netherls, in 2002 2007, respectively. From March 2007 until February 2009, he was a Postdoctoral Fellow at the Technion Israel Institute of Technology at the Bar-Ilan University in Ramat-Gan, Israel. From February 2009 until November 2010, he was a Research Fellow in the Communication Signal Processing group at Imperial College London, United Kingdom. Since November 2010, he is an Associate Professor at the International Audio Laboratories Erlangen (a joint institution of the University of Erlangen Fraunhofer IIS) a Chief Scientist for Spatial Audio Processing at Fraunhofer IIS, Germany. His research interests are in signal processing for acoustical applications, more specifically microphone array processing for noise reduction, dereverberation source localization. Dr. Habets was a member of the organization committee of the 9th International Workshop on Acoustic Echo Noise Control (IWAENC) in Eindhoven, The Netherls, 2005. He is a member of the IEEE Signal Processing Society Technical Committee on Audio Acoustic Signal Processing (2011 2013). Jacob Benesty was born in 1963. He received a Master degree in microwaves from Pierre & Marie Curie University, France, in 1987, a Ph.D. degree in control signal processing from Orsay University, France, in April 1991. During his Ph.D. (from Nov. 1989 to Apr. 1991), he worked on adaptive filters fast algorithms at the Centre National d Etudes des Telecomunications (CNET), Paris, France. From January 1994 to July 1995, he worked at Telecom Paris University on multichannel adaptive filters acoustic echo cancellation. From October 1995 to May 2003, he was first a Consultant then a Member of the Technical Staff at Bell Laboratories, Murray Hill, NJ, USA. In May 2003, he joined the University of Quebec, INRS-EMT, in Montreal, Quebec, Canada, as a Professor. His research interests are in signal processing, acoustic signal processing, multimedia communications. He is the inventor of many important technologies. In particular, he was the lead researcher at Bell Labs who conceived designed the world-first real-time hs-free full-duplex stereophonic teleconferencing system. Also, he conceived designed the world-first PC-based multi-party hs-free full-duplex stereo conferencing system over IP networks. He was the co-chair of the 1999 International Workshop on Acoustic Echo Noise Control the general co-chair of the 2009 IEEE Workshop on Applications of Signal Processing to Audio Acoustics. He is the recipient, with Morgan Sondhi, of the IEEE Signal Processing Society 2001 Best Paper Award. He is the recipient, with Chen, Huang, Doclo, of the IEEE Signal Processing Society 2008 Best Paper Award. He is also the co-author of a paper for which Huang received the IEEE Signal Processing Society 2002 Young Author Best Paper Award. In 2010, he received the Gheorghe Cartianu Award from the Romanian Academy. In 2011, he received the Best Paper Award from the IEEE WASPAA for a paper that he co-authored with Chen.