ELEN E896 MUSIC SIGNAL PROCESSING Lecture 1: Source Separation 1. Sources, Mixtures, & Perception. Spatial Filtering 3. Time-Frequency Masking. Model-Based Separation Dan Ellis Dept. Electrical Engineering, Columbia University dpwe@ee.columbia.edu http://www.ee.columbia.edu/~dpwe/e896/ E896 Music Signal Processing (Dan Ellis) 13--9-1 /19
1. Sources, Mixtures, & Perception Sound is a linear process (superposition) no opacity (unlike vision) sources auditory scenes (polyphony) frq/hz 3 1 6 8 1 1 time/s - - -6 level / db _m+s-15-evil-goodvoice-fade Analysis Voice (evil) Rumble Stab Voice (pleasant) Strings Choir Humans perceive discrete sources.. a subjective construct E896 Music Signal Processing (Dan Ellis) 13--9 - /19
Spatial Hearing People perceive sources based on cues spatial (binaural): ITD, ILD Blauert 96 R L head shadow (high freq) source path length difference shatr78m3 waveform.1.5 Left -.5 Right -.1..5.1.15..5.3.35 time / s E896 Music Signal Processing (Dan Ellis) 13--9-3 /19
Auditory Scene Analysis Spatial cues may not be enough/available single channel signal Brain uses signal-intrinsic cues to form sources onset, harmonicity Bregman 9 Reynolds-McAdams Oboe 3 3 1 1-1 - -3 -.5 1 1.5.5 3 3.5 time / sec level / db E896 Music Signal Processing (Dan Ellis) 13--9 - /19
Auditory Scene Analysis Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they? (after Bregman 9) Quite a challenge! E896 Music Signal Processing (Dan Ellis) 13--9-5 /19
Audio Mixing Studio recording combines separate tracks into, e.g., channels (stereo) different levels panning other effects Stereo Intensity Panning manipulating ILD only constant power more channels: use just nearest pair? E896 Music Signal Processing (Dan Ellis) 13--9-6 /19 1.5 1.5.5 1 L R
. Spatial Filtering N sources detected by M sensors degrees of freedom (else need other constraints) Consider x case: directional mics m 1 s 1 a a 1 11 a a 1 mixing matrix: m 1 = a 11 a 1 s 1 ŝ 1 m a 1 a s ŝ m s = Â 1 m E896 Music Signal Processing (Dan Ellis) 13--9-7 /19
Source Cancelation Simple x case example: m 1 m = m 1 (t) 1.5.8 1 s 1 s m 1 (t) =s 1 (t)+.5s (t) m (t) =.8s 1 (t)+s (t).5m (t) =.6s 1 (t) if no delay and linearly-independent sums, can cancel one source per combination E896 Music Signal Processing (Dan Ellis) 13--9-8 /19
Independent Component Analysis Can separate blind combinations by maximizing independence of outputs Bell & Sejnowski 95 m 1 a 11 a s x 1 1 m a 1 a s δ MutInfo δa kurtosis kurt(y) =E y µ 3 for independence? mix.8.6. s1 Mixture Scatter s kurtosis 1 1 8 Kurtosis vs.. 6 -. -. -.6 -.3 -. -.1.1..3. mix 1...6.8 1 E896 Music Signal Processing (Dan Ellis) 13--9-9 /19 s1 s /
Microphone Arrays If interference is diffuse, can simply boost energy from target direction e.g. shotgun mic - delay-and-sum Benesty, Chen, Huang 8 λ = D x = c. D - λ = D - λ = D + D + D + D off-axis spectral coloration many variants - filter & sum, sidelobe cancelation... E896 Music Signal Processing (Dan Ellis) 13--9-1/19
3. Time-Frequency Masking What if there is only one channel? cannot have fixed cancellation but could have fast time-varying filtering: 8 6 Brown & Cooke 9 Roweis 1 8 6.5 1 1.5.5 3 The trick is finding the right mask... E896 Music Signal Processing (Dan Ellis) 13--9-11/19 time / s
Original Mix + Oracle Labels Time-Frequency Masking Works well for overlapping voices 8 6 Male Female - level / db Oraclebased Resynth 8 6.5 1.5 1 time / sec time-frequency resolution? time / sec cooke-v3n7.wav cooke-v3msk-ideal.wav cooke-n7msk-ideal.wav E896 Music Signal Processing (Dan Ellis) 13--9-1/19
Pan-Based Filtering Can use time-frequency masking even for stereo e.g. calculate panning index as ILD mask cells matching that ILD Avendano 3 6 level / db 5 6 ILD mask 1 pt win.5.. +1. db 5 1 15 time / s E896 Music Signal Processing (Dan Ellis) 13--9-13/19 5 level / db ILD / db
Harmonic-based Masking Time-frequency masking can be used to pick out harmonics given pitch track, know where to expect harmonics Denbigh & Zhao 199 3 1 3 1 1 3 5 6 7 8 time / s E896 Music Signal Processing (Dan Ellis) 13--9-1/19
Harmonic Filtering Given pitch track, could use time-varying comb filter to get harmonics or: isolate each harmonic by heterodyning: 3 ˆx(t) = k Avery Wang 1995 â k (t)cos(kˆ(t)t) â k (t) =LP F { x(t)e jkˆ(t)t } 1 3 1 3 1 1 3 5 6 7 time / s 8 E896 Music Signal Processing (Dan Ellis) 13--9-15/19
Nonnegative Matrix Factorization Decomposition of spectrograms into templates + activation X = W H fast & forgiving gradient descent algorithm fits neatly with time-frequency masking 5 1 15 Lee & Seung 99 Abdallah & Plumbley Smaragdis & Brown 3 Virtanen 7 Virtanen 3 sounds 1 3 Bases from all W t Rows of H 1 3 5 1 15 Time (DFT slices) E896 Music Signal Processing (Dan Ellis) 13--9-16/19 1 1 8 6 Frequency (DFT index) Smaragdis
. Model-Based Separation When N (sources) > M (sensors), need additional constraints to solve problem e.g. assumption of single dominant pitch Can assemble into a model M of source si defines set of possible waveforms..probabilistically: Pr(s i M) Source separation from mixture as inference: s = {s i } = arg max s where P r(x s,a)p (A) Pr(x s,a)=n (x As, ) i Pr(s i M) E896 Music Signal Processing (Dan Ellis) 13--9-17/19
Can constrain: Source Models source spectra (e.g. harmonic, noisy, smooth) temporal evolution (piecewise-continuous) spatial arrangements (point-source, diffuse) Factored decomposition: Ozerov, Vincent & Bimbot 1 http://bass-db.gforge.inria.fr/fasst/ Stereo instantaneous mix Separated source 1 3 3 Frequency 1 Frequency 1.5 1 1.5.5 3 Time.5 1 1.5.5 3 Time Separated source 3 3 Frequency 1 Frequency 1.5 1 1.5.5 3 Time.5 1 1.5.5 3 Time Separated source 3 3 Frequency 1.5 1 1.5.5 3 Time Music: Shannon Hurley / Mix: Michel Desnoues & Alexey Ozerov / Separations: Alexey Ozerov E896 Music Signal Processing (Dan Ellis) 13--9-18/19
Summary Acoustic Source Mixtures The normal situation in real-world sounds Spatial filtering Canceling sources by subtracting channels Time-Frequency Masking Selecting spectrogram cells Model-Based Separation Exploiting regularities in source signals E896 Music Signal Processing (Dan Ellis) 13--9-19/19
References S. Abdallah & M. Plumbley, Polyphonic transcription by non-negative sparse coding of power spectra, Proc. Int. Symp. Music Info. Retrieval,. C. Avendano, Frequency-domain source identification and manipulation in stereo mixes for enhancement, suppression and re-panning applications, IEEE WASPAA, Mohonk, pp. 55-58, 3. A. Bell, T. Sejnowski, An information-maximization approach to blind separation and blind deconvolution, Neural Computation, vol. 7 no. 6, pp. 119-1159, 1995. J. Benesty, J. Chen, Y. Huang, Microphone Array Signal Processing, Springer, 8. J. Blauert, Spatial Hearing, MIT Press, 1996. A. Bregman, Auditory Scene Analysis, MIT Press, 199. G. Brown & M. Cooke, Computational auditory scene analysis, Computer Speech and Language, vol. 8 no., pp. 97-336, 199. P. Denbigh & J. Zhao, Pitch extraction and separation of overlapping speech, Speech Communication, vol. 11 no. -3, pp. 119-15, 199. D. Lee & S. Seung, Learning the Parts of Objects by Non-negative Matrix Factorization, Nature 1, 788, 1999. A. Ozerov, E. Vincent, & F. Bimbot, A general flexible framework for the handling of prior information in audio source separation, INRIA Tech. Rep. 753, Nov. 1. S. Roweis, One microphone source separation, Adv. Neural Info. Proc. Sys., pp. 793-799, 1. P. Smaragdis & J. Brown, Non-negative Matrix Factorization for Polyphonic Music Transcription, Proc. IEEE WASPAA,177-18, October, 3. T. Virtanen Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria, IEEE Tr. Audio, Speech, & Lang. Proc. 15(3), 166 17, 7. Avery Wang, Instantaneous and frequency-warped signal processing techniques for auditory source separation, Ph.D. dissertation, Stanford CCRMA, 1995. E896 Music Signal Processing (Dan Ellis) 13--9 - /19