VQ Source Models: Perceptual & Phase Issues Dan Ellis & Ron Weiss Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA {dpwe,ronw}@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Source Models for Separation 2. VQ with Perceptual Weighting 3. Phase and Resynthesis 4. Conclusions VQ Source Models - Ellis & Weiss 26-5-16-1 /12
Single-Channel Scene Analysis How to separate overlapping sounds? freq / khz 8 6 4 2.5 1 1.5 2 time / s 2-2 -4 level / db.5 1 1.5 2 time / s freq / khz 8 6 4 2.5 1 1.5 2 time / s underconstrained: infinitely many decompositions time- overlaps cause obliteration.. no obvious segmentation of sources (?) VQ Source Models - Ellis & Weiss 26-5-16-2 /12
Scene Analysis as Inference Ideal separation is rarely possible i.e. no projection can guarantee to remove overlaps Overlaps Ambiguity scene analysis = find most reasonable explanation Ambiguity can be expressed probabilistically i.e. posteriors of sources {S i } given observations X: P({S i } X) P(X {S i }) P({S i }) combination physics source models Better source models better inference.. learn from examples? VQ Source Models - Ellis & Weiss 26-5-16-3 /12
freq / mel band 8 6 4 2 Vector-Quantized (VQ) Source Models Constraint of source can be captured explicitly in a codebook (dictionary): x(t) c i(t) where i(t) 1...N defines the subspace occupied by source Codebook minimizes distortion (MSE) by k-means clustering VQ8 Codebook - Linear distortion measure (greedy ordering) -1-2 -3-4 -5 1 2 3 4 5 6 7 8 codebook index VQ Source Models - Ellis & Weiss 26-5-16-4 /12 level / db
Simple Source Separation Given models for sources, find best (most likely) states for spectra: p(x i 1,i 2 ) = N (x;c i1 + c i2,σ) combination model {i 1 (t),i 2 (t)} = argmax i1,i 2 p(x(t) i 1,i 2 ) can include sequential constraints... different domains for combining c and defining E.g. stationary noise: freq / mel bin 8 6 4 Original speech In speech-shaped noise (mel magsnr = 2.41 db) 8 6 4 8 6 4 VQ inferred states (mel magsnr = 3.6 db) inference of source state 2 2 2 1 2 time / s 1 2 1 2 VQ Source Models - Ellis & Weiss 26-5-16-5 /12
VQ Magnitude fidelity (db SNR) Codebook Size Two (main) variables: o number of codewords Measure average accuracy (distortion): 1 9.5 9 8.5 8 7.5 7 6.5 6 5.5 o amount of training data 6 training frames 1 training frames 2 training frames 4 training frames 8 training frames 16 training frames o main effect of codebook size o larger codebooks need/allow more data o (large absolute distortion values) 1 2 3 5 1 2 3 Codebook size VQ Source Models - Ellis & Weiss 26-5-16-6 /12
Distortion Metric freq / mel band 8 6 4 2 freq / khz Standard MSE gives equal weight by channel excessive emphasis on high frequencies Try e.g. Mel spectrum 8 6 4 2 approx. log spacing of bins Linear Frequency.5 1 1.5 2 time / s Little effect (?): VQ8 Codebook - Linear distortion measure freq / Mel band 8 6 4 2 Mel Frequency.5 1 1.5 2-1 -2-3 -4-5 level / db VQ8 Codebook - Mel/cube root distance codebook index codebook index VQ Source Models - Ellis & Weiss 26-5-16-7 /12
Resynthesis Phase Codewords quantize spectrum magnitude phase has arbitrary offset due to STFT grid Resynthesis (ISTFT) requires phase info use mixture phase? no good for filling-in Spectral peaks indicate common instantaneous ( ) level / db 2-2 -4 φ/ t can quantize and cumulate in resynthesis.. like the phase vocoder Code 924: Magnitudes (21 trn frms) -6 2 4 6 8 1 freq / Hz 2 4 6 8 1 freq / Hz VQ Source Models - Ellis & Weiss 26-5-16-8 /12 Inst Freq / Hz 1 8 6 4 2 Code 924: Instantaneous Frequency
Resynthesis Phase (2) Can also improve phase iteratively repeat: goal: Visible benefit: X (1) (t, f )= ˆX(t, f ) exp{ jφ (1) (t, f )} x (1) (t)=istft{x (1) (t, f )} φ (2) ( (t, f )= stft{x (1) (t)} ) X (n) (t, f ) = ˆX(t, f ) SNR of reconstruction (db) 8 7 6 5 4 3 7.6 db 4.7 db 3.4 db Iterative phase estimation Magnitude quantization only Starting from quantized phase Starting from random phase 1 2 3 4 5 6 7 8 Iteration number VQ Source Models - Ellis & Weiss 26-5-16-9 /12
Evaluating Model Quality Low distortion is not really the goal; models are to constrain source separation fit source spectra but reject non-source signals Include sequential constraints e.g. transition matrix for codewords.. or smaller HMM with distributions over codebook Best way to evaluate is via a task e.g. separating speech from noise freq / bins 8 6 4 2 8 6 4 2 8 6 4 2 db SNR to Speech-Shaped Noise (magsnr =1.8 db) Direct VQ (magsnr = -.6 db) HMM-smoothed VQ (magsnr = -2.1 db).5 1 1.5 2 2.5 VQ Source Models - Ellis & Weiss 26-5-16-1/12 time / s -1-2 -3-4 -5 level / db
Future Directions Factorized codebooks codebooks too large due to combinatorics separate codebooks for type, formants, excitation? level / db 4 2 Spectrum decomposed into low, mid, high components -2-4 1 2 3 4 5 6 7 8 freq / Hz Model adaptation many speaker-dependents model, or... single speaker-adapted model, fit to each speaker Using uncertainty enhancing noisy speech for listeners: use special tokens to preserve uncertainty VQ Source Models - Ellis & Weiss 26-5-16-11/12
Summary Source models permit separation of underconstrained mixtures or at least inference of source state Explicit codebooks need to be large.. and chosen to optimize perceptual quality Resynthesis phase can be quantized.. using phase vocoder derivative.. iterative re-estimation helps more VQ Source Models - Ellis & Weiss 26-5-16-12/12
Extra Slides VQ Source Models - Ellis & Weiss 26-5-16-13/12
Other Uses for Source Models Projecting into the model s space: Restoration / Extension inferring missing parts Generation / Copying: adaptation + fitting VQ Source Models - Ellis & Weiss 26-5-16-14/12
wer / % 1 5 1 5 Example 2: Mixed Speech Recog. Cooke & Lee s Speech Separation Challenge short, grammatically-constrained utterances: <command:4><color:4><preposition:4><letter:25><number:1><adverb:4> e.g. "bin white at M 5 soon" IBM s superhuman recognizer: b) Same Gender 6 db 3 db db!3 db!6 db!9 db c) Different Gender 6 db 3 db db!3 db!6 db!9 db SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human t5_bwam5s_m5_bbilzp_6p1.wav Kristjansson et al. Interspeech 6 o Model individual speakers (512 mix GMM) o Infer speakers and gain o Reconstruct speech o Recognize as normal... Grammar constraints a big help VQ Source Models - Ellis & Weiss 26-5-16-15/12
Central idea: Employ strong learned constraints time time to disambiguate possible sources Varga & Moore 9 Roweis 3... {Si} = argmaxsi P(X {Si}) speech-trained Vector-Quantizer e.g. fitmaxvq Results: Denoising totimemixed spectrum: time from Roweis 3 time Model-Based Separation time time time VQ Source Models - Ellis & Weiss Training: 3sec of isolated speech to fit 512 codewords, and 1sec of separate via T-F(TIMIT) mask (again) solated noise (NOISEX) to fit 32 codewords; testing on new signals at db SNR 26-5-16-16/12
Separation or Description? Are isolated waveforms required? clearly sufficient, but may not be necessary not part of perceptual source separation! Integrate separation with application? e.g. speech recognition mix separation identify target energy t-f masking + resynthesis ASR speech models words vs. mix identify speech models find best words model words source knowledge words output = abstract description of signal VQ Source Models - Ellis & Weiss 26-5-16-17/12
Evaluation How to measure separation performance? depends what you are trying to do SNR? energy (and distortions) are not created equal different nonlinear components [Vincent et al. 6] Intelligibility? rare for nonlinear processing to improve intelligibility listening tests expensive ASR performance? Transmission errors optimum separate-then-recognize too simplistic; ASR needs to accommodate separation Net effect Increased artefacts Reduced interference Agressiveness of processing VQ Source Models - Ellis & Weiss 26-5-16-18/12