The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals

Similar documents
Performance Evaluation of Nonlinear Speech Enhancement Based on Virtual Increase of Channels in Reverberant Environments

A Novel Hybrid Approach to the Permutation Problem of Frequency Domain Blind Source Separation

Nonlinear postprocessing for blind speech separation

TARGET SPEECH EXTRACTION IN COCKTAIL PARTY BY COMBINING BEAMFORMING AND BLIND SOURCE SEPARATION

SEPARATION AND DEREVERBERATION PERFORMANCE OF FREQUENCY DOMAIN BLIND SOURCE SEPARATION. Ryo Mukai Shoko Araki Shoji Makino

ESTIMATION OF TIME-VARYING ROOM IMPULSE RESPONSES OF MULTIPLE SOUND SOURCES FROM OBSERVED MIXTURE AND ISOLATED SOURCE SIGNALS

BLIND SOURCE separation (BSS) [1] is a technique for

REAL-TIME BLIND SOURCE SEPARATION FOR MOVING SPEAKERS USING BLOCKWISE ICA AND RESIDUAL CROSSTALK SUBTRACTION

Recent Advances in Acoustic Signal Extraction and Dereverberation

Speech enhancement with ad-hoc microphone array using single source activity

BLIND SOURCE SEPARATION BASED ON ACOUSTIC PRESSURE DISTRIBUTION AND NORMALIZED RELATIVE PHASE USING DODECAHEDRAL MICROPHONE ARRAY

An Adaptive Algorithm for Speech Source Separation in Overcomplete Cases Using Wavelet Packets

ONLINE REPET-SIM FOR REAL-TIME SPEECH ENHANCEMENT

Audiovisual speech source separation: a regularization method based on visual voice activity detection

BLIND SOURCE SEPARATION FOR CONVOLUTIVE MIXTURES USING SPATIALLY RESAMPLED OBSERVATIONS

Multiple Sound Sources Localization Using Energetic Analysis Method

Reducing comb filtering on different musical instruments using time delay estimation

ROBUST BLIND SOURCE SEPARATION IN A REVERBERANT ROOM BASED ON BEAMFORMING WITH A LARGE-APERTURE MICROPHONE ARRAY

A BINAURAL HEARING AID SPEECH ENHANCEMENT METHOD MAINTAINING SPATIAL AWARENESS FOR THE USER

MULTIMODAL BLIND SOURCE SEPARATION WITH A CIRCULAR MICROPHONE ARRAY AND ROBUST BEAMFORMING

A HYPOTHESIS TESTING APPROACH FOR REAL-TIME MULTICHANNEL SPEECH SEPARATION USING TIME-FREQUENCY MASKS. Ryan M. Corey and Andrew C.

A Wiener Filter Approach to Microphone Leakage Reduction in Close-Microphone Applications

ICA for Musical Signal Separation

Harmonics Enhancement for Determined Blind Sources Separation using Source s Excitation Characteristics

Blind Dereverberation of Single-Channel Speech Signals Using an ICA-Based Generative Model

Grouping Separated Frequency Components by Estimating Propagation Model Parameters in Frequency-Domain Blind Source Separation

Speech Enhancement Using Microphone Arrays

A SOURCE SEPARATION EVALUATION METHOD IN OBJECT-BASED SPATIAL AUDIO. Qingju LIU, Wenwu WANG, Philip J. B. JACKSON, Trevor J. COX

Airo Interantional Research Journal September, 2013 Volume II, ISSN:

Separation of Multiple Speech Signals by Using Triangular Microphone Array

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Omnidirectional Sound Source Tracking Based on Sequential Updating Histogram

Dual Transfer Function GSC and Application to Joint Noise Reduction and Acoustic Echo Cancellation

High-speed Noise Cancellation with Microphone Array

/$ IEEE

Real-time Adaptive Concepts in Acoustics

Microphone Array Design and Beamforming

516 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

Discriminative Enhancement for Single Channel Audio Source Separation using Deep Neural Networks

Emanuël A. P. Habets, Jacob Benesty, and Patrick A. Naylor. Presented by Amir Kiperwas

Different Approaches of Spectral Subtraction Method for Speech Enhancement

About Multichannel Speech Signal Extraction and Separation Techniques

PRIMARY-AMBIENT SOURCE SEPARATION FOR UPMIXING TO SURROUND SOUND SYSTEMS

Robust Low-Resource Sound Localization in Correlated Noise

arxiv: v1 [cs.sd] 4 Dec 2018

Speech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya 2, B. Yamuna 2, H. Divya 2, B. Shiva Kumar 2, B.

Blind source separation and directional audio synthesis for binaural auralization of multiple sound sources using microphone array recordings

Array Calibration in the Presence of Multipath

MINUET: MUSICAL INTERFERENCE UNMIXING ESTIMATION TECHNIQUE

Permutation Correction in the Frequency Domain in Blind Separation of Speech Mixtures

Sound Source Localization using HRTF database

Short-Time Fourier Transform and Its Inverse

Pseudo-determined blind source separation for ad-hoc microphone networks

A Comparison of the Convolutive Model and Real Recording for Using in Acoustic Echo Cancellation

Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks

Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation

STAP approach for DOA estimation using microphone arrays

Broadband Microphone Arrays for Speech Acquisition

Clustered Multi-channel Dereverberation for Ad-hoc Microphone Arrays

Experiments on Deep Learning for Speech Denoising

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

An analysis of blind signal separation for real time application

Nonuniform multi level crossing for signal reconstruction

Nicholas Chong, Shanhung Wong, Sven Nordholm, Iain Murray

A Frequency-Invariant Fixed Beamformer for Speech Enhancement

Study Of Sound Source Localization Using Music Method In Real Acoustic Environment

Audio Imputation Using the Non-negative Hidden Markov Model

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

A BROADBAND BEAMFORMER USING CONTROLLABLE CONSTRAINTS AND MINIMUM VARIANCE

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY /$ IEEE

Single-channel Mixture Decomposition using Bayesian Harmonic Models

WHITENING PROCESSING FOR BLIND SEPARATION OF SPEECH SIGNALS

SUB-BAND INDEPENDENT SUBSPACE ANALYSIS FOR DRUM TRANSCRIPTION. Derry FitzGerald, Eugene Coyle

CLASSIFICATION OF CLOSED AND OPEN-SHELL (TURKISH) PISTACHIO NUTS USING DOUBLE TREE UN-DECIMATED WAVELET TRANSFORM

Sampling and Reconstruction of Analog Signals

Direction of Arrival Algorithms for Mobile User Detection

Subband Analysis of Time Delay Estimation in STFT Domain

arxiv: v1 [cs.sd] 24 May 2016

HUMAN speech is frequently encountered in several

The effects of the excitation source directivity on some room acoustic descriptors obtained from impulse response measurements

ADAPTIVE ANTENNAS. TYPES OF BEAMFORMING

Informed Spatial Filtering for Sound Extraction Using Distributed Microphone Arrays

BLIND SOURCE SEPARATION USING WAVELETS

BEAMFORMING WITHIN THE MODAL SOUND FIELD OF A VEHICLE INTERIOR

Advances in Direction-of-Arrival Estimation

SDR HALF-BAKED OR WELL DONE?

Source Separation and Echo Cancellation Using Independent Component Analysis and DWT

Speech Enhancement Based On Noise Reduction

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming

Improving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research

Chapter 4 SPEECH ENHANCEMENT

A Novel Approach for the Characterization of FSK Low Probability of Intercept Radar Signals Via Application of the Reassignment Method

Students: Avihay Barazany Royi Levy Supervisor: Kuti Avargel In Association with: Zoran, Haifa

Underdetermined Convolutive Blind Source Separation via Frequency Bin-wise Clustering and Permutation Alignment

VQ Source Models: Perceptual & Phase Issues

Applying the Filtered Back-Projection Method to Extract Signal at Specific Position

2.1 BASIC CONCEPTS Basic Operations on Signals Time Shifting. Figure 2.2 Time shifting of a signal. Time Reversal.

Mel Spectrum Analysis of Speech Recognition using Single Microphone

ROTATIONAL RESET STRATEGY FOR ONLINE SEMI-SUPERVISED NMF-BASED SPEECH ENHANCEMENT FOR LONG RECORDINGS

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Transcription:

The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals Maria G. Jafari and Mark D. Plumbley Centre for Digital Music, Queen Mary University of London, UK maria.jafari@elec.qmul.ac.uk, http://www.elec.qmul.ac.uk Abstract. In this paper, we investigate the importance of the high frequencies in the problem of convolutive blind source separation (BSS) of speech signals. In particular, we focus on frequency domain blind source separation (FD-BSS), and show that when separation is performed in the low frequency bins only, the recovered signals are similar in quality to those extracted when all frequencies are taken into account. The methods are compared through informal listening tests, as well as using an objective measure. 1 Introduction Convolutive blind source separation is often addressed in the frequency domain, through the short-time fourier transform (STFT), and source separation is performed separately at each frequency bin, thus reducing the problem to that of several instantaneous BSS problems. Although the approximation of convolutions by multiplications result in reduced computational complexity, frequency domain BSS (FD-BSS) remains computationally expensive because source separation has to be carried out on a large number of bins (a typical STFT length is 2048 point), each containing sufficient data samples for the independence assumption to hold. In addition, transforming the problem to several independent instantaneous problems, has the unwelcome side effect of introducing the problem of frequency permutations, whose solution is often quite computationally expensive [1], as it involves the clustering the frequency components of the recovered sources, using methods such as beamforming approaches, e.g. [3, 4]. These methods exploit phase information contained in the de-mixing filters identified by the source separation algorithm. Generally, the characteristics of speech signals are such that little information is contained in the frequencies above 4kHz [9], suggesting a possible approach to BSS for speech mixtures that focuses on the lower frequencies. Motivated by this, and in order to reduce the computational load of FD-BSS algorithms, we consider here the role of high frequencies in source separation of speech signals. We show that high frequencies are not as important as low frequencies, This work was funded by EPSRC grant GR/S85900/01.

2 Maria G. Jafari and Mark D. Plumbley and that intelligibility is preserved even when the high frequency subbands are left umixed, and simply added back onto the separated signal. Other possible approaches would exploit existing methods that assume that high frequencies are not available, such as bandwidth extension. The structure of this paper is as follows: the basic convolutive BSS problem is described in section 2; an overview of FD-ICA is given in section 3, while the role of high frequencies is discussed in section 4. Simulation results are presented in section 5, and conclusions are drawn in section 6. 2 Problem Formulation The simplest convolutive BSS problem arises when 2 microphones record mixtures x(n) of 2 sampled real-valued signals, s(n), which in this paper are considered to be speech signals. The aim of blind source separation is then to recover the sources, from only the 2 convolutive mixtures available. Formally, the signal recorded at the q-th microphone, x q (n), is x q (n) = 2 p=1 l=1 L a qp (l)s p (n l), q = 1, 2 (1) where s p (n) is the p-th source signal, a qp (l) denotes the impulse response from source p to sensor q, and L is the maximum length of all impulse responses [1]. The source signals are then reconstructed according to y p (n) = 2 q=1 l=1 L w qp (l)x q (n l), p = 1, 2 (2) where y p (n) is the p-th recovered source, and w qp (l), are the unmixing filters which must be estimated. 3 Frequency Domain Blind Source Separation The convolutive audio source separation is often addressed in the frequency domain. It entails the evaluation of the N-point short-time fourier transform of the observed signals, followed by the use of instantaneous BSS, independently on each of the resulting N subbands. Thus, the mixing and separating models in (1) and (2) become, respectively X(f, t) = A(f)S(f, t) (3) Y(f, t) = W(f)X(f, t) (4) where S(f, t), and X(f, t) are the STFT representations of the source and mixture vectors respectively, A(f) and W(f) are the mixing and separating matrices at frequency bin f, Y(f, t) is the frequency domain representation of the recovered sources, and t denotes the STFT block index.

The Role of High Frequencies in Convolutive BSS of Speech Signals 3 FD-BSS has the drawback of introducing the problem of frequency permutations, which is typically solved by clustering the frequency components of the recovered sources, often using beamforming techniques, such as in [1, 3 5], where the direction of arrival (DOA) of the sources are evaluated from the beamformer directivity patterns F p (f, θ) = 2 q=1 W ICA qp (f)e j2πfd sin θ p/c, p = 1, 2 (5) where Wqp ICA is the ICA de-mixing filter from the q-th sensor to the p-th output, d is the spacing between two sensors, θ p is the angle of arrival of the p-th source signal, and c 340m/s is the speed of sound in air. The frequency permutations are then determined by ensuring that the directivity pattern for each beamformer is approximately aligned along the frequency axis. The BSS algorithm considered in this paper is given in [6]. It updates the unmixing filters according to W(f) = D [ diag( α i ) + E { φ(y(f, t))y H (f, t) }] W(f) W(f) W(f)(W(f) H W(f)) 0.5 (6) where y H is the conjugate transpose of y, α i = E{y i (f, t)φ(y i (f, t))}, D = diag(1/(α i E{φ (y i (f, t)})), and the activation function φ(y(f, t)) is given by φ(y(f, t)) = y(f, t), y(f, t) = 0 (7) y(f, t) and its derivative can be approximated by φ (y(f, t)) y(f, t) 1 y(f, t) 2 y(f, t) 3 [6]. Moreover, the algorithm (6) requires that the mixtures x(f, t) be pre-whitened; we refer to it as MD2003. 4 The Role of High Frequencies In this paper, we aim to investigate the role of the high frequencies in convolutive blind source separation of speech signals, whose characteristics are such that little information is contained in the frequencies above a certain cut-off frequency [9], which we define in this paper as f c. Here, we consider the following decomposition of the observed signal X(f, t) = X LF s (f, t) + X(f, t) HF s (8) where X LF s (f, t) is the STFT representation of the mixtures with the subbands corresponding to the high frequencies (f > f c ) set to zero, and similarly X(f, t) HF s has the low frequencies subbands (f f c ) set to zero. Defining the recovered signal as Y(f, t) = Y LF s (f, t) + Y(f, t) HF s, the following four scenarios are considered, in which source separation is performed using MD2003:

4 Maria G. Jafari and Mark D. Plumbley 1. on all frequency bins (MD2003): Y(f, t) = Y LF s (f, t) + Y(f, t) HF s 2. on the low frequency bins only; the high frequencies are set to zero (LF): Y(f, t) = Y LF s (f, t) 3. on the low frequency bins; the high frequency components are extracted using a beamformer W B F (f) based on the DOAs estimated from the low frequency components (LF-BF): Y(f, t) = Y LF s (f, t) + W B F (f)x(f, t) HF s 4. on the low frequency bins; the high frequency components are left mixed, and they are added back to the separated low frequencies prior to applying the inverse STFT (LF-BF): Y(f, t) = Y LF s (f, t) + X(f, t) HF s Figure 1 illustrates the four methods described above. 5 Simulation Results In this section, we consider the separation of two speech signals, from two male speakers, sampled at 16kHz. The sources were mixed using simulated room impulse responses, determined by the image method [2] using MGovern s RIR Matlab function, 1 with a room reverberation time of 160 ms. The STFT frame length used was set to 2048 in all cases. The performance of the FD-BSS method in [6] (MD2003) was compared for the four methods described in section 4, and permutations were aligned as in [3]. We set f c = 4.7kHz, so that the low frequency bands are between 0 to 4.7kHz, while the high frequencies are above 4.7kHz. This value was obtained empirically by inspecting the frequency content of the mixtures, and with the aim of ensuring that as much information as possible is preserved in the low frequencies. Method SDR (db) SIR (db) SAR (db) Listening Tests MD2003 [6] 5.37 19.17 6.08 +++ LF 5.37 19.66 5.59 + LF-BF 5.15 17.33 5.52 ++ LF-HF 5.04 13.16 6.14 ++++ Table 1. Signal-to-distortion (SDR), signal-to-interference (SIR), and signal-to-artifact ratios (SAR), for the four methods separating the sources signals: At all frequencies - MD2003; At low frequencies only - LF; At low frequencies; BF applied at high frequencies - LF-BF; At low frequencies; high frequencies added still mixed - LF-HF, for a cut off of 4.7kHz. The performance of each method was evaluated using the objective criteria of 1 Available from: http://2pi.us/code/rir.m

The Role of High Frequencies in Convolutive BSS of Speech Signals 5 (a) Separation of all frequency bins (MD2003): Y(f, t) = Y LF s(f, t) + Y(f, t) HF s (b) Separation of low frequency bins only (LF): Y(f, t) = Y LF s (f, t) (c) Separation of low frequency bins, with beamforming in the high frequencies (LF-BF): Y(f, t) = Y LF s (f, t) + W B F (f)y(f, t) HF s (d) Separation of low frequency bins. High frequency are added back without separation (LF-HF): Y(f, t) = Y LF s(f, t) + X(f, t) HF s Fig. 1. Illustration of the four methods compared.

6 Maria G. Jafari and Mark D. Plumbley Signal-to-Distortion Ratio (SDR), Signal-to-Interference Ratio (SIR) and Signalto-Artefacts Ratio (SAR), as defined in [7]. SDR, SIR and SAR measure, respectively, the level of the total distortion in the estimated source, with respect to the target source, the distortion due to interfering sources, and other remaining artefacts. The evaluation criteria allows for the recovered sources to be modified by a permitted distortion, and we considered a time-invariant filter of length 512 samples, when calculating the performance measures. This length was chosen so that the filter would cover the reverberation time. We obtained SDR, SIR and SAR figures for the four methods, and for all sources and microphones. The results are shown in Table 1, where the single figure was produced by averaging the criteria across all microphones and all sources. The SDRs in Table 1 show that the total distortion for all methods is essentially the same. Distortion increases for LF-HF, due to the high frequencies not being separated, and therefore re-introducing some level of distortion. This is supported by the corresponding SIR figure for the same method, which shows that a higher level of interference from the other source is present. The values for SAR indicate that most artefacts are introduced when separation is performed on the low frequency (LF) components only, and when the high frequency components are extracted using beamforming (LF-BF). This is hardly surprising, since both methods can have quite severe effects on the data. The most interesting result is observed from the SIR figures. They show that separating only the low frequency components, and truncating the high frequency ones, has the effect of removing more interference from the undesired source signal than when working with all frequencies, while not introducing any additional distortion (SDR is unchanged), although the level of artefacts present increases. This result is rather counterintuitive, as it suggests that there is little to be gained from performing separation in the high frequencies. This might be explained by the fact that source separation methods perform worse on high frequency components, which are generally lower in amplitude; using beamforming methods to deal with the permutation problem also yields poor results due to phase ambiguity in the high frequencies [8]. Informal listening tests were performed, to corroborate the outcome of the objective criteria. They indicated that the ratios are a good guide to the audible performance. The outputs of LF were found to sound the least natural among all the recovered signals, due to the high frequencies not being present, while the sources separated with LF-HF were found to sound somehow better than the outputs of MD2003. However, the crucial point is that the outputs of all methods sounded similar in quality, suggesting that they all have similar performance. The last column in Table 1 shows a classification of the recovered sources, with the number of + indicating how good the quality of the separated signal is. In general, LF-HF gave the best results, and LF is the worst only because it it not as natural as the others. Nonetheless, the output of LF is equally as intelligible as the others. We can conclude from these results that performing separation in all subbands

The Role of High Frequencies in Convolutive BSS of Speech Signals 7 is not always the best approach. Especially for speech signals, it might be more advantageous to apply BSS only in the low frequencies, hence reducing, or even halving, the computational burden of some frequency domain algorithms. 6 Conclusions In this paper, we discussed the role of the high frequencies in frequency domain blind source separation of speech signals. We found that when the high frequencies are ignored, the separated sources remain quite clear, albeit they do not always sound very natural. Our findings were supported by objective criteria, and informal listening tests, which have suggested that it might be a good strategy to separate the mixtures in the low frequencies only, and then add on the high frequency components, without performing any processing on them. This approach may bring significant advantages in terms of reduced computational complexity. References 1. H. Sawada, R. Mukai, S. Araki, and S. Makino, A robust and precise method for solving the permutation problem of frequency-domain blind source separation, IEEE Trans. on Speech and Audio Processing, vol. 12, pp. 530 538, 2004. 2. S. McGovern, A model for room acoustics, Available at: http//2pi.us/rir.html, (2003). 3. N. Mitianoudis and M. Davies, Permutation alignment for frequency domain ICA using subspace beamforming methods, in Proc. ICA, 2004, pp. 669 676. 4. H. Saruwatari, S. Kurita, and K. Takeda, Blind source separation combining frequency-domain ICA and beamformning, in Proc. ICASSP, 2001, vol. 5, pp. 2733 2736. 5. M. Ikram and D. Morgan, A beamforming approach to permutation alignment for multichannel frequency-domain blind speech separation, in Proc. ICASSP, 2002, vol. 1, pp. 881 884. 6. N. Mitianoudis and M. Davies, Audio source separation of convolutive mixtures, IEEE Trans. on Audio and Speech Processing, vol. 11, pp. 489 497, 2003. 7. C. Févotte, R. Gribonval and E. Vincent, BSS EVAL Toolbox User Guide, IRISA Technical Report 1706, April 2005. http://www.irisa.fr/metiss/bss eval/. 8. M. G. Jafari, S. A. Adballah, M. D. Plumbley, and M. E. Davies Sparse coding for convolutive blind audio source separation, in Proc. ICA, 2006, pp. 132 139. 9. D. Balcan and J. Rosca Independent component analysis for speech enhancement with missing TF content, in Proc. ICA, 2006, pp. 552 560.