VQ Source Models: Perceptual & Phase Issues

Size: px

Start display at page:

Download "VQ Source Models: Perceptual & Phase Issues"

Kellie Austin
5 years ago
Views:

VQ Source Models: Perceptual & Phase Issues Dan Ellis & Ron Weiss Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA {dpwe,ronw}@ee.

1 VQ Source Models: Perceptual & Phase Issues Dan Ellis & Ron Weiss Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA 1. Source Models for Separation 2. VQ with Perceptual Weighting 3. Phase and Resynthesis 4. Conclusions VQ Source Models - Ellis & Weiss /12

5 2 time / s 5 2 time / s underconstrained:

2 Single-Channel Scene Analysis How to separate overlapping sounds? freq / khz time / s level / db time / s freq / khz time / s underconstrained: infinitely many decompositions time- overlaps cause obliteration.. no obvious segmentation of sources (?) VQ Source Models - Ellis & Weiss /12

3 Scene Analysis as Inference Ideal separation is rarely possible i.e. no projection can guarantee to remove overlaps Overlaps Ambiguity scene analysis = find most reasonable explanation Ambiguity can be expressed probabilistically i.e. posteriors of sources {S i } given observations X: P({S i } X) P(X {S i }) P({S i }) combination physics source models Better source models better inference.. learn from examples? VQ Source Models - Ellis & Weiss /12

freq / mel band 8 6 4 2 Vector-Quantized (VQ) Source Models Constraint of source can be captured explicitly in a codebook (dictionary): x(t) c i(t) where i(t) 1.

4 freq / mel band Vector-Quantized (VQ) Source Models Constraint of source can be captured explicitly in a codebook (dictionary): x(t) c i(t) where i(t) 1...N defines the subspace occupied by source Codebook minimizes distortion (MSE) by k-means clustering VQ8 Codebook - Linear distortion measure (greedy ordering) codebook index VQ Source Models - Ellis & Weiss /12 level / db

5 Simple Source Separation Given models for sources, find best (most likely) states for spectra: p(x i 1,i 2 ) = N (x;c i1 + c i2,σ) combination model {i 1 (t),i 2 (t)} = argmax i1,i 2 p(x(t) i 1,i 2 ) can include sequential constraints... different domains for combining c and defining E.g. stationary noise: freq / mel bin Original speech In speech-shaped noise (mel magsnr = 2.41 db) VQ inferred states (mel magsnr = 3.6 db) inference of source state time / s VQ Source Models - Ellis & Weiss /12

6 VQ Magnitude fidelity (db SNR) Codebook Size Two (main) variables: o number of codewords Measure average accuracy (distortion): o amount of training data 6 training frames 1 training frames 2 training frames 4 training frames 8 training frames 16 training frames o main effect of codebook size o larger codebooks need/allow more data o (large absolute distortion values) Codebook size VQ Source Models - Ellis & Weiss /12

Distortion Metric freq / mel band 8 6 4 2 freq / khz Standard MSE gives equal weight by channel excessive emphasis on high frequencies Try e.g. Mel spectrum 8 6 4 2 approx.

7 Distortion Metric freq / mel band freq / khz Standard MSE gives equal weight by channel excessive emphasis on high frequencies Try e.g. Mel spectrum approx. log spacing of bins Linear Frequency time / s Little effect (?): VQ8 Codebook - Linear distortion measure freq / Mel band Mel Frequency level / db VQ8 Codebook - Mel/cube root distance codebook index codebook index VQ Source Models - Ellis & Weiss /12

8 Resynthesis Phase Codewords quantize spectrum magnitude phase has arbitrary offset due to STFT grid Resynthesis (ISTFT) requires phase info use mixture phase? no good for filling-in Spectral peaks indicate common instantaneous ( ) level / db φ/ t can quantize and cumulate in resynthesis.. like the phase vocoder Code 924: Magnitudes (21 trn frms) freq / Hz freq / Hz VQ Source Models - Ellis & Weiss /12 Inst Freq / Hz Code 924: Instantaneous Frequency

9 Resynthesis Phase (2) Can also improve phase iteratively repeat: goal: Visible benefit: X (1) (t, f )= ˆX(t, f ) exp{ jφ (1) (t, f )} x (1) (t)=istft{x (1) (t, f )} φ (2) ( (t, f )= stft{x (1) (t)} ) X (n) (t, f ) = ˆX(t, f ) SNR of reconstruction (db) db 4.7 db 3.4 db Iterative phase estimation Magnitude quantization only Starting from quantized phase Starting from random phase Iteration number VQ Source Models - Ellis & Weiss /12

10 Evaluating Model Quality Low distortion is not really the goal; models are to constrain source separation fit source spectra but reject non-source signals Include sequential constraints e.g. transition matrix for codewords.. or smaller HMM with distributions over codebook Best way to evaluate is via a task e.g. separating speech from noise freq / bins db SNR to Speech-Shaped Noise (magsnr =1.8 db) Direct VQ (magsnr = -.6 db) HMM-smoothed VQ (magsnr = -2.1 db) VQ Source Models - Ellis & Weiss /12 time / s level / db

11 Future Directions Factorized codebooks codebooks too large due to combinatorics separate codebooks for type, formants, excitation? level / db 4 2 Spectrum decomposed into low, mid, high components freq / Hz Model adaptation many speaker-dependents model, or... single speaker-adapted model, fit to each speaker Using uncertainty enhancing noisy speech for listeners: use special tokens to preserve uncertainty VQ Source Models - Ellis & Weiss /12

12 Summary Source models permit separation of underconstrained mixtures or at least inference of source state Explicit codebooks need to be large.. and chosen to optimize perceptual quality Resynthesis phase can be quantized.. using phase vocoder derivative.. iterative re-estimation helps more VQ Source Models - Ellis & Weiss /12

13 Extra Slides VQ Source Models - Ellis & Weiss /12

14 Other Uses for Source Models Projecting into the model s space: Restoration / Extension inferring missing parts Generation / Copying: adaptation + fitting VQ Source Models - Ellis & Weiss /12

15 wer / % Example 2: Mixed Speech Recog. Cooke & Lee s Speech Separation Challenge short, grammatically-constrained utterances: <command:4><color:4><preposition:4><letter:25><number:1><adverb:4> e.g. "bin white at M 5 soon" IBM s superhuman recognizer: b) Same Gender 6 db 3 db db!3 db!6 db!9 db c) Different Gender 6 db 3 db db!3 db!6 db!9 db SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human t5_bwam5s_m5_bbilzp_6p1.wav Kristjansson et al. Interspeech 6 o Model individual speakers (512 mix GMM) o Infer speakers and gain o Reconstruct speech o Recognize as normal... Grammar constraints a big help VQ Source Models - Ellis & Weiss /12

16 Central idea: Employ strong learned constraints time time to disambiguate possible sources Varga & Moore 9 Roweis 3... {Si} = argmaxsi P(X {Si}) speech-trained Vector-Quantizer e.g. fitmaxvq Results: Denoising totimemixed spectrum: time from Roweis 3 time Model-Based Separation time time time VQ Source Models - Ellis & Weiss Training: 3sec of isolated speech to fit 512 codewords, and 1sec of separate via T-F(TIMIT) mask (again) solated noise (NOISEX) to fit 32 codewords; testing on new signals at db SNR /12

Integrate separation with application? e.g. speech recognition mix separation identify target energy t-f masking + resynthesis ASR speech models words vs.

17 Separation or Description? Are isolated waveforms required? clearly sufficient, but may not be necessary not part of perceptual source separation! Integrate separation with application? e.g. speech recognition mix separation identify target energy t-f masking + resynthesis ASR speech models words vs. mix identify speech models find best words model words source knowledge words output = abstract description of signal VQ Source Models - Ellis & Weiss /12

18 Evaluation How to measure separation performance? depends what you are trying to do SNR? energy (and distortions) are not created equal different nonlinear components [Vincent et al. 6] Intelligibility? rare for nonlinear processing to improve intelligibility listening tests expensive ASR performance? Transmission errors optimum separate-then-recognize too simplistic; ASR needs to accommodate separation Net effect Increased artefacts Reduced interference Agressiveness of processing VQ Source Models - Ellis & Weiss /12

Lecture 9: Time & Pitch Scaling

ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 9: Time & Pitch Scaling 1. Time Scale Modification (TSM) 2. Time-Domain Approaches 3. The Phase Vocoder 4. Sinusoidal Approach Dan Ellis Dept. Electrical Engineering,