Speaker Isolation in a Cocktail-Party Setting

Similar documents
A classification-based cocktail-party processor

Monaural and Binaural Speech Separation

IMPROVED COCKTAIL-PARTY PROCESSING

Binaural Hearing. Reading: Yost Ch. 12

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

1856 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 7, SEPTEMBER /$ IEEE

Simultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array

The psychoacoustics of reverberation

A Neural Oscillator Sound Separator for Missing Data Speech Recognition

Binaural Speaker Recognition for Humanoid Robots

Two-channel Separation of Speech Using Direction-of-arrival Estimation And Sinusoids Plus Transients Modeling

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Sound Source Localization using HRTF database

INVESTIGATING BINAURAL LOCALISATION ABILITIES FOR PROPOSING A STANDARDISED TESTING ENVIRONMENT FOR BINAURAL SYSTEMS

TDE-ILD-HRTF-Based 2D Whole-Plane Sound Source Localization Using Only Two Microphones and Source Counting

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Multiple Sound Sources Localization Using Energetic Analysis Method

Auditory Localization

Computational Perception. Sound localization 2

BIOLOGICALLY INSPIRED BINAURAL ANALOGUE SIGNAL PROCESSING

Subband Analysis of Time Delay Estimation in STFT Domain

High-speed Noise Cancellation with Microphone Array

Enhancing 3D Audio Using Blind Bandwidth Extension

PERFORMANCE COMPARISON BETWEEN STEREAUSIS AND INCOHERENT WIDEBAND MUSIC FOR LOCALIZATION OF GROUND VEHICLES ABSTRACT

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES

Calibration of Microphone Arrays for Improved Speech Recognition

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Listening with Headphones

III. Publication III. c 2005 Toni Hirvonen.

The analysis of multi-channel sound reproduction algorithms using HRTF data

EE1.el3 (EEE1023): Electronics III. Acoustics lecture 20 Sound localisation. Dr Philip Jackson.

Recent Advances in Acoustic Signal Extraction and Dereverberation

Auditory System For a Mobile Robot

Sound source localization and its use in multimedia applications

Blind Dereverberation of Single-Channel Speech Signals Using an ICA-Based Generative Model

An Adaptive Algorithm for Speech Source Separation in Overcomplete Cases Using Wavelet Packets

Source Localisation Mapping using Weighted Interaural Cross-Correlation

Recurrent Timing Neural Networks for Joint F0-Localisation Estimation

SOPA version 3. SOPA project. July 22, Principle Introduction Direction of propagation Speed of propagation...

Convention Paper Presented at the 125th Convention 2008 October 2 5 San Francisco, CA, USA

IN a natural environment, speech often occurs simultaneously. Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation

Adaptive Filters Application of Linear Prediction

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Indoor Sound Localization

Live multi-track audio recording

Pitch-Based Segregation of Reverberant Speech

Distance Estimation and Localization of Sound Sources in Reverberant Conditions using Deep Neural Networks

HRIR Customization in the Median Plane via Principal Components Analysis

URBANA-CHAMPAIGN. CS 498PS Audio Computing Lab. 3D and Virtual Sound. Paris Smaragdis. paris.cs.illinois.

Intensity Discrimination and Binaural Interaction

Chapter 4 SPEECH ENHANCEMENT

COMPARISON OF CHANNEL ESTIMATION AND EQUALIZATION TECHNIQUES FOR OFDM SYSTEMS

Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks

Transmit Power Allocation for BER Performance Improvement in Multicarrier Systems

ONE of the most common and robust beamforming algorithms

Modeling Head-Related Transfer Functions Based on Pinna Anthropometry

The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals

Audio Restoration Based on DSP Tools

White Rose Research Online URL for this paper: Version: Accepted Version

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Computational Perception /785

Airo Interantional Research Journal September, 2013 Volume II, ISSN:

WAVELET-BASED SPECTRAL SMOOTHING FOR HEAD-RELATED TRANSFER FUNCTION FILTER DESIGN

Pitch-based monaural segregation of reverberant speech

Toward Automatic Transcription -- Pitch Tracking In Polyphonic Environment

Using Energy Difference for Speech Separation of Dual-microphone Close-talk System

arxiv: v1 [cs.sd] 4 Dec 2018

Robotic Spatial Sound Localization and Its 3-D Sound Human Interface

Automotive three-microphone voice activity detector and noise-canceller

Robust Speech Recognition Based on Binaural Auditory Processing

A BINAURAL HEARING AID SPEECH ENHANCEMENT METHOD MAINTAINING SPATIAL AWARENESS FOR THE USER

Spatial Audio Transmission Technology for Multi-point Mobile Voice Chat

Lateralisation of multiple sound sources by the auditory system

Directionality. Many hearing impaired people have great difficulty

Speech and Music Discrimination based on Signal Modulation Spectrum.

Using RASTA in task independent TANDEM feature extraction

Optimal Adaptive Filtering Technique for Tamil Speech Enhancement

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Using Vision to Improve Sound Source Separation

Binaural Hearing- Human Ability of Sound Source Localization

From Monaural to Binaural Speaker Recognition for Humanoid Robots

Speech Enhancement Using Microphone Arrays

Perceptual Distortion Maps for Room Reverberation

Lecture 14: Source Separation

Accurate Delay Measurement of Coded Speech Signals with Subsample Resolution

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

ORIENTATION IN SIMPLE VIRTUAL AUDITORY SPACE CREATED WITH MEASURED HRTF

From Binaural Technology to Virtual Reality

Ivan Tashev Microsoft Research

Speech Synthesis using Mel-Cepstral Coefficient Feature

A learning, biologically-inspired sound localization model

Acoustics Research Institute

Lab 8. Signal Analysis Using Matlab Simulink

Robust Low-Resource Sound Localization in Correlated Noise

Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks

Sound Source Localization in Median Plane using Artificial Ear

Transcription:

Speaker Isolation in a Cocktail-Party Setting M.K. Alisdairi Columbia University M.S. Candidate Electrical Engineering Spring Abstract the human auditory system is capable of performing many interesting tasks, several of which could find useful applications in engineering settings. One such capability is the ability to perceptually separate sound sources, allowing a listener to focus on a single speaker in a noisy environment. This effect is often referred to as the cocktail effect (in reference to a cocktail-party environment where several simultaneous conversations are taking place in the background) or as Auditory Scene Analysis. This paper introduces two methodologies for isolating a desired speaker s audio stream from a binaural recording of multiple speakers in conversation. An implementation a system for speaker isolation based on one of these methods is also presented. Note that some of the graphics presented in this document are best viewed in a color format. For the electronic version please visit www.columbia.edu/~mka3/speech.html. INTRODUCTION Systems capable of performing Auditory Scene Analysis (ASA) [] could find numerous useful applications. The most evident application is as a front- for speech recognition systems. The development of systems capable of ASA could provide improvements in speech recognition in unconstrained auditory environments [7] [9]. Noisy Environment Sound Input Speaker Extraction System Voice Recognition System Figure Speaker Extraction System as a Front-End to Voice Recognition Another possible use for an ASA capable system could be in theatrical/movie settings as a substitute for wireless microphones. In such instances sound engineers could have a versatile means of controlling audio quality without the physical imposition of hardware on the speaker s person. This paper discusses two methodologies for speech extraction. The first method is based on the Interaural Intensity Difference, and the second on the Time Difference of Arrival. After the preliminary discussion an implementation of the TDOA based method is presented. Analysis and implementation is carriedout on sound recordings from the ShATR corpus. The ShATR Corpus The sound files used were taken from the ShATR corpus of dummy head recordings. The recordings are of five speakers (Guy, Martin, Phil, Inge Marie, and Malcolm) oriented around the dummy head. Included in the ShATR corpus are two files; one file contains a recording of each of the five speakers introducing themselves, Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting -9

the other file is a recording of the five carrying on in conversation. These two files are of primary interest in this document. THEORETICAL BACKGROUND THE HRTF The theoretical undergirding of speaker isolation is the fact that the left and right channels traverse different paths; resulting in differing filtering for each channel. The following figure and equations illustrate the concept. Frequency x 4.5 Sample Spectrogram TF Cell Ear Channel Dummy Head Figure Depiction of the Path Depent Filtering Effect Y Y Where: Y ( ω, = H φ ) X = H X, Y Ear Channel (Equation ) (Equation ) The signals received by the left and right ears..5.5 3 Time Figure 3 - The TF Cell Concept There are two methods which may be implemented towards the goal of categorizing and weighting the TF cells appropriately; the interaural intensity difference (IID) method and the time delay of arrival (TDOA) method. The Inter-aural Intensity Difference The IID method of cell-weight estimation is based on intensity differentials as a function of frequency [6]. For example, a sound originating from a source in the first quadrant of the figure below will be detected by the left ear as a signal that has undergone a low-pass filtering effect. The low-pass effect is a result of the shadowing of high frequency components by the head. On the other hand, low frequencies are able to wrap around the head with little attenuation. H, H The impulse responses of the paths to the left and right ears. SECOND QUADRANT FIRST QUADRANT X ( ω, The original speech signal. HEAD degrees The two transfer functions H and H are called the Head Related Transfer Functions (HRFT) and are functions of position as well as frequency (note the HRTF is more precisely Figure 4 Sound Source Plane a function of frequency ω, azimuth φ, and elevation θ) [5]. It is averred that that by decomposing a multi-speaker signal The interaural intensity difference may be obtained by taking into several time/frequency (TF) cells and weighting each cell the ratio of the left and right channel magnitudes in the appropriately, a desired speaker s speech data may be extracted frequency domain (or correspondingly taking the difference from an aggregate of speakers [3]. The localization cues that between the frequency domain magnitudes in db). The math are implanted in the received signals by the HRTF may be used follows. to determine the appropriate weightings for each of the TF cells. Y = H iid = log log (Equation 3) YRigth( ω, H Rigth( ω, ( Y ) log( Y ) ) iid = log t Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting -9

By comparing incoming speech data to predetermined categorical information the previously mentioned TF cells may be classified appropriately [4]. x 4 The Time Delay of Arrival The distance differential between the propagation paths for the right and left ears causes a phase distortion between the two channels. This phase distortion may be used as localization cue. As with the IID method, these localization cues may be used to derive the necessary weighting information for each TF cell. Theoretically, extraction of the desired time delay information may be accomplished through direct analysis of the phase components as follows: itd = arg * { Y Y } (Equation 4) Despite the plausibility of the above approach it is avoided due to difficulties resulting from the nonlinearity of the phase functions. An alternative means towards the same is the use of the crosscorrelation function. The TDOA is obtained by retaining the lag index of maximum crosscorrelation between the left and right channels. Correlation of & Channels 8 x -6 6 4 - -4 Frequency.5..4.6.8..4.6.8 Time Inge Marie on Microphone Four Figure 6 Inge s Introduction Sample The sound sample is first passed through a Bark scaled filter bank and each band-limited output is broken into short time windows. The left and right channel windows are then crosscorrelated, and the index of the maximum crosscorrelation is retained for each time window. The filter bank is a four-channel filter-bank. The justification for this selection is based on the results given in [3], which show a maximization of the SNR for four frequency bands. Initially, the time window length was based solely on calculations of the maximum possible lag index. However, [3] also shows a maximization of SNR for a window length of 56 samples, thus a window length of 56 was used. The following figure depicts a histogram of the lag indexes that result from the process described above: -6-5 -4-3 - - 3 4 5 Correlation Index Figure 5 Determining the TDOA by Autocorrelation IMPLEMENTATION This paper concentrates on the TDOA method of speaker isolation. The first step is the analysis of the introduction sound samples from the ShATR corpus to develop a source model that describes each speaker s localization cues. The second step involves use of the localization cues extracted by the previously attained models to determine the appropriate TF cell weightings for each TF cell. Analysis Source Models The first step in building the speech isolation system is to study the behavior of the TDOA as a speaker (position) depent feature. This analysis was conducted on the speech samples of each speaker s introduction, such as Inge Marie s depicted below. Figure 7 Histogram of Lag Index of Maximum Correlation As the above histograms show, the lag indexes of maximum crosscorrelation seem to be normally distributed. The appropriate mean and standard deviations are then manually extracted for each speaker. The results for Guy and Inge Marie follow: Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 3-9

Speaker: Guy Channel Mean (µ) Standard Deviation (σ) One -8 Two -6.5 Three -7.5.5 Four 5 Speaker: Inge Marie Channel Mean (µ) Standard Deviation (σ) One 6.5 4 Two 7.5 3 Three 7.5 Four 6.5.5 The weight of a particular TF cell as a function of lag index (i) is as follows: w ( i µ ch ) σ ch ch( i) = e (Equation 5) Speech Isolation System Once the distributions of the lag indexes for each speaker have been determined the sound file containing simultaneous speech may be analyzed and a desired speaker may be extracted. The simultaneous speech sample is first passed through the previously mentioned filter bank. Each of the four band limited signals is then broken into time windows of length 56 samples. The left and right channels are then crosscorrelated for each time frame, and the index of maximum correlation is retained. The sequence of lag indexes is then compared to the desired speaker s lag index distribution model and a weight corresponding to the likelihood of each TF cell belonging to the desired speaker is used as that cell s weight. Figure 9 Extracting Guy s Speech Signal Although the system preformed the desired task of extracting a single speech track from the sequence, the weighting process introduced some undesirable noise. It was postulated that the source of the noise was the discontinuity of the weighting matrix over time, and that the problem could be ameliorated by filtering the weight matrix. The following figures illustrate the short sections of the four weight sequences before and after filtering..5 Filtering of the Weight Matrix Original Weight Seq. Filtered Weight Seq..5.5 x 4.5 TDOA Analysis.5.5 x 4.5.5.5 Desired Speaker x 4 Speech Data Filter Bank Weighting Based on TDOA Model Isolated Speech Weight.5.5.5 Time x 4 Figure Illustration of the Weight Matrix Filtering Process Figure 8 Block Diagram of Speech Isolation System RESULTS The above methodology of speech extraction was successful in isolating the desired speaker s signal. The following is a spectrogram of a sample of cocktail-party speech and the resulting extraction of Guy s stream. Reconstruction of the desired track using the newly filtered weighting matrix resulted in the desired improvement in quality. The following figures illustrate the reconstructed signals for Guy with and without weight matrix filtering. Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 4-9

[8] F. Berthommier and S. Choi, Evaluation of CASA and BSS models for subband cocktail-party speech separation. [9] D. Wang and G. Brown, Seperation of Speech from Interfering Sounds Based on Oscillatory Correlation, IEEE Transactions on Nerual Networks, Vol., No. 3, May 999. [] A. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT Press, 99. Figure Before and After Filtering the Weighting Function CONCLUDING REMARKS The design presented in this paper illustrates a simple implementation of a speech extraction system and establishes the feasibility of such a system. This design was successful in achieving the objective of extracting a single speaker s track from a group recording. Included in the appices is the code for the implementation as well as larger spectrogram depictions of the results. Despite the successes presented in this paper, there exists room for enhancements in future work. For example, the source model in this implementation was obtained manually. The automation of the source model would allow the system to behave in a more versatile manner, possibly allowing the relaxation of the a priori assumption that the speakers positions remain constant. A second potential improvement could be the incorporation of a broader feature set, possibly including the IID in addition to the TDOA. REFERENCES [] S. Mitra, Digital Signal Processing, a Computer-Based Approach, McGraw-Hill Irwin. [] R Ziemer and W. Tranter, Signals and Systems, Continuous and Discrete, Prentice Hall 998. [3] E. Tessier and F. Berthommier, Speech Enhancement and Segregation Based on the Localisation Cue for Cocktail-Party Processing, [4] W. Chau and R. Duda, Combined Monaural and Binaural Localization of Sound Sources, IEEE Proceedings of ASILOMAR-9 996. [5] R. Duda, Modeling Head Related Transfer Functions, IEEE Proceedings of ASILOMAR 993. [6] K. Martin, Estimating Azimuth and Elevation from Interaural Differences. [7] S. Choi, H. Glotin, F. Berthommier, and E. Tessier, A CASA Front-End Using the Localization Cue for Segregation and then Cocktail-Party Speech Recognition. Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 5-9

APPENDIX A: MATLAB CODE function banddat = bands(dd,nb); M.K. Alisdairi Spring b = bands(ysound,number_bands) Given original data 'dd' bands() returns an nb-by-length(dd)-by- matrix containing versions of 'dd' bandlimited according to the bark scale nb is the number of bands used. Note nb may be {8 4 } if(nb~= & nb~= & nb~=4 & nb~=8) clc disp(sprintf('error: nb must be 8,4,, or ')) banddat = ; return Make sure it fits the criteria home disp(sprintf('working...')) N = ^9; Number of frequency points for the STFT Seperate into right and left channels d_left = dd(:,)'; d_right= dd(:,)'; Calculate the STFTs DL = stft(d_left,n,n,n/); DR = stft(d_right,n,n,n/); Define the Bark scaled windows in Hz, then convert to FFT index (w). Note: eight channels max. F = [ 33 69 6 88 3 46 93 5]; w = floor(n*f./48)+; banddat= zeros(nb,((size(dl,)+)*n/),); Initializing is faster Go through and produce the proper bandlimited signals inc = 8/nb; for(i=:inc:8) FtempL = zeros(size(dl)); FtempL((w(i)):((w(i+inc)-)),:) = DL((w(i)):((w(i+inc)-)),:); FtempR = zeros(size(dr)); FtempR((w(i)):((w(i+inc)-)),:) = DR((w(i)):((w(i+inc)-)),:); banddat(ceil(i/inc),:,)=istft(ftempl,n,n,n/); banddat(ceil(i/inc),:,)=istft(ftempr,n,n,n/); Forward the data function [ii,yy,c] = tdoa(b,win,ch) M.K. Alisdairi Spring [i,y,c] = tdoa(band_data,window_length,channels) Function accepts matrix 'b' (produced by bands()), hich contains nb channels of bandlimited audio data. The function will conduct a cross correletion of left and right time windows with length 'win'. bl = b(:,:,); br = b(:,:,); Break data into right and left channels nb=size(b,); Number of bands stop = floor(size(b,)/win); Number of windows Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 6-9

c = NaN*ones(*win-,stop); xcorr matrix for k=:length(ch) j=ch(k); for i=:win:(stop*win) c(:,ceil(i/win))=xcorr(bl(j,i:(i+win-)),br(j,i:(i+win-)))'; home disp(size(c)) [y,i] = max(c); Determine the TDOA for each frame yy(k,:) = y; ii(k,:) = i-win; Forward actual xcorr. Forward lag indexes function [ys,wgt] = extract(b,ii,person,win) M.K. Alisdairi Spring [ys,wgt] = extract(band_data,lag_indexes,desired_person, window size) This function extracts Guy's voice from the sound data in b If person == then extract Guy. If == then extract Inge. u = [ -8-6.5-7.5 ; Means for Guy 6.5 7.5 7.5 6.5]; Means for Inge Marie s = [.5 5; Std. devs for Guy 4 3.5]; Std. devs for Inge Marie u = u(person,:); Take the correct data s = s(person,:); w = zeros(,length(ii)*win); Initialize short weight vector len = min(length(b),length(w)); ys = zeros(,len); Initialize place for new data wgt = zeros(size(ii,),length(w)); Initialize actual weight matrix for ch=:size(ii,) Look at all channels wf = zeros(,length(ii)); wf = exp(-(ii(ch,:)-u(ch)).^/(*s(ch)^)); Calculate appropriate weight based on the lag index w(:) = wf(ceil((:length(w))/win)); Convert to long vector wgt(ch,:) = w; Forward the info ys(,:) = ys(,:)+b(ch,:len,).*w(:len); Calculate the extracted voice ys(,:) = ys(,:)+b(ch,:len,).*w(:len); Caluclate the other channel Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 7-9

APPENDIX B: ENLARGED SPECTROGRAMS Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 8-9

Speech and Audio Processing Speaker Isolation in a Cocktail-Party Setting 9-9