VQ Source Models: Perceptual & Phase Issues

Similar documents
Lecture 9: Time & Pitch Scaling

EE482: Digital Signal Processing Applications

HIGH RESOLUTION SIGNAL RECONSTRUCTION

Minimal-Impact Audio-Based Personal Archives

Estimating Single-Channel Source Separation Masks: Relevance Vector Machine Classifiers vs. Pitch-Based Masking

Enhanced Waveform Interpolative Coding at 4 kbps

Audio Imputation Using the Non-negative Hidden Markov Model

WIND NOISE REDUCTION USING NON-NEGATIVE SPARSE CODING

Bandwidth Extension for Speech Enhancement

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Mikko Myllymäki and Tuomas Virtanen

Chapter 4 SPEECH ENHANCEMENT

Applications of Music Processing

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Speech Enhancement using Wiener filtering

Lecture 5: Sinusoidal Modeling

CHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS

Lecture 14: Source Separation

MUS421/EE367B Applications Lecture 9C: Time Scale Modification (TSM) and Frequency Scaling/Shifting

Robustness (cont.); End-to-end systems

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Evaluation of clipping-noise suppression of stationary-noisy speech based on spectral compensation

Chapter IV THEORY OF CELP CODING

24 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 1, JANUARY /$ IEEE

Improving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research

SGN Audio and Speech Processing

Sound Synthesis Methods

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation

Can binary masks improve intelligibility?

Distributed Speech Recognition Standardization Activity

Speech Signal Analysis

Monaural and Binaural Speech Separation

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering

Review of Lecture 2. Data and Signals - Theoretical Concepts. Review of Lecture 2. Review of Lecture 2. Review of Lecture 2. Review of Lecture 2

CS 188: Artificial Intelligence Spring Speech in an Hour

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Calibration of Microphone Arrays for Improved Speech Recognition

REAL-TIME BROADBAND NOISE REDUCTION

Dynamically Configured Waveform-Agile Sensor Systems

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Speech Coding in the Frequency Domain

NOISE ESTIMATION IN A SINGLE CHANNEL

The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals

Digital Speech Processing and Coding

SPEECH ENHANCEMENT WITH SIGNAL SUBSPACE FILTER BASED ON PERCEPTUAL POST FILTERING

ESE531 Spring University of Pennsylvania Department of Electrical and System Engineering Digital Signal Processing

Emanuël A. P. Habets, Jacob Benesty, and Patrick A. Naylor. Presented by Amir Kiperwas

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

Speech Enhancement Using a Mixture-Maximum Model

Recent Advances in Acoustic Signal Extraction and Dereverberation

A New Framework for Supervised Speech Enhancement in the Time Domain

3D Distortion Measurement (DIS)

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Single-channel Mixture Decomposition using Bayesian Harmonic Models

Interpolation Error in Waveform Table Lookup

MODELING SPEECH WITH SUM-PRODUCT NETWORKS: APPLICATION TO BANDWIDTH EXTENSION

URBANA-CHAMPAIGN. CS 498PS Audio Computing Lab. Audio DSP basics. Paris Smaragdis. paris.cs.illinois.

SOUND SOURCE RECOGNITION AND MODELING

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

ROTATIONAL RESET STRATEGY FOR ONLINE SEMI-SUPERVISED NMF-BASED SPEECH ENHANCEMENT FOR LONG RECORDINGS

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Robust Low-Resource Sound Localization in Correlated Noise

ECMA TR/105. A Shaped Noise File Representative of Speech. 1 st Edition / December Reference number ECMA TR/12:2009

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Bandwidth Expansion with a Polya Urn Model

JOINT NOISE AND MASK AWARE TRAINING FOR DNN-BASED SPEECH ENHANCEMENT WITH SUB-BAND FEATURES

SGN Audio and Speech Processing

SAMPLING THEORY. Representing continuous signals with discrete numbers

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

ELT Receiver Architectures and Signal Processing Fall Mandatory homework exercises

Auditory System For a Mobile Robot

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

Flexible and Scalable Transform-Domain Codebook for High Bit Rate CELP Coders

Robust speech recognition using temporal masking and thresholding algorithm

PERFORMANCE ANALYSIS OF SPEECH SIGNAL ENHANCEMENT TECHNIQUES FOR NOISY TAMIL SPEECH RECOGNITION

High-speed Noise Cancellation with Microphone Array

Advanced audio analysis. Martin Gasser

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Training neural network acoustic models on (multichannel) waveforms

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Discriminative Training for Automatic Speech Recognition

Speech Enhancement for Nonstationary Noise Environments

Multiple Sound Sources Localization Using Energetic Analysis Method

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

Spectral Noise Tracking for Improved Nonstationary Noise Robust ASR

IN RECENT YEARS, there has been a great deal of interest

Analysis of the SNR Estimator for Speech Enhancement Using a Cascaded Linear Model

Multi-Pitch Estimation of Audio Recordings Using a Codebook-Based Approach Hansen, Martin Weiss; Jensen, Jesper Rindom; Christensen, Mads Græsbøll

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Speech Synthesis using Mel-Cepstral Coefficient Feature

ROBUST ISOLATED SPEECH RECOGNITION USING BINARY MASKS

Data and Computer Communications Chapter 3 Data Transmission

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping

DEMODULATION divides a signal into its modulator

Transcription:

VQ Source Models: Perceptual & Phase Issues Dan Ellis & Ron Weiss Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA {dpwe,ronw}@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Source Models for Separation 2. VQ with Perceptual Weighting 3. Phase and Resynthesis 4. Conclusions VQ Source Models - Ellis & Weiss 26-5-16-1 /12

Single-Channel Scene Analysis How to separate overlapping sounds? freq / khz 8 6 4 2.5 1 1.5 2 time / s 2-2 -4 level / db.5 1 1.5 2 time / s freq / khz 8 6 4 2.5 1 1.5 2 time / s underconstrained: infinitely many decompositions time- overlaps cause obliteration.. no obvious segmentation of sources (?) VQ Source Models - Ellis & Weiss 26-5-16-2 /12

Scene Analysis as Inference Ideal separation is rarely possible i.e. no projection can guarantee to remove overlaps Overlaps Ambiguity scene analysis = find most reasonable explanation Ambiguity can be expressed probabilistically i.e. posteriors of sources {S i } given observations X: P({S i } X) P(X {S i }) P({S i }) combination physics source models Better source models better inference.. learn from examples? VQ Source Models - Ellis & Weiss 26-5-16-3 /12

freq / mel band 8 6 4 2 Vector-Quantized (VQ) Source Models Constraint of source can be captured explicitly in a codebook (dictionary): x(t) c i(t) where i(t) 1...N defines the subspace occupied by source Codebook minimizes distortion (MSE) by k-means clustering VQ8 Codebook - Linear distortion measure (greedy ordering) -1-2 -3-4 -5 1 2 3 4 5 6 7 8 codebook index VQ Source Models - Ellis & Weiss 26-5-16-4 /12 level / db

Simple Source Separation Given models for sources, find best (most likely) states for spectra: p(x i 1,i 2 ) = N (x;c i1 + c i2,σ) combination model {i 1 (t),i 2 (t)} = argmax i1,i 2 p(x(t) i 1,i 2 ) can include sequential constraints... different domains for combining c and defining E.g. stationary noise: freq / mel bin 8 6 4 Original speech In speech-shaped noise (mel magsnr = 2.41 db) 8 6 4 8 6 4 VQ inferred states (mel magsnr = 3.6 db) inference of source state 2 2 2 1 2 time / s 1 2 1 2 VQ Source Models - Ellis & Weiss 26-5-16-5 /12

VQ Magnitude fidelity (db SNR) Codebook Size Two (main) variables: o number of codewords Measure average accuracy (distortion): 1 9.5 9 8.5 8 7.5 7 6.5 6 5.5 o amount of training data 6 training frames 1 training frames 2 training frames 4 training frames 8 training frames 16 training frames o main effect of codebook size o larger codebooks need/allow more data o (large absolute distortion values) 1 2 3 5 1 2 3 Codebook size VQ Source Models - Ellis & Weiss 26-5-16-6 /12

Distortion Metric freq / mel band 8 6 4 2 freq / khz Standard MSE gives equal weight by channel excessive emphasis on high frequencies Try e.g. Mel spectrum 8 6 4 2 approx. log spacing of bins Linear Frequency.5 1 1.5 2 time / s Little effect (?): VQ8 Codebook - Linear distortion measure freq / Mel band 8 6 4 2 Mel Frequency.5 1 1.5 2-1 -2-3 -4-5 level / db VQ8 Codebook - Mel/cube root distance codebook index codebook index VQ Source Models - Ellis & Weiss 26-5-16-7 /12

Resynthesis Phase Codewords quantize spectrum magnitude phase has arbitrary offset due to STFT grid Resynthesis (ISTFT) requires phase info use mixture phase? no good for filling-in Spectral peaks indicate common instantaneous ( ) level / db 2-2 -4 φ/ t can quantize and cumulate in resynthesis.. like the phase vocoder Code 924: Magnitudes (21 trn frms) -6 2 4 6 8 1 freq / Hz 2 4 6 8 1 freq / Hz VQ Source Models - Ellis & Weiss 26-5-16-8 /12 Inst Freq / Hz 1 8 6 4 2 Code 924: Instantaneous Frequency

Resynthesis Phase (2) Can also improve phase iteratively repeat: goal: Visible benefit: X (1) (t, f )= ˆX(t, f ) exp{ jφ (1) (t, f )} x (1) (t)=istft{x (1) (t, f )} φ (2) ( (t, f )= stft{x (1) (t)} ) X (n) (t, f ) = ˆX(t, f ) SNR of reconstruction (db) 8 7 6 5 4 3 7.6 db 4.7 db 3.4 db Iterative phase estimation Magnitude quantization only Starting from quantized phase Starting from random phase 1 2 3 4 5 6 7 8 Iteration number VQ Source Models - Ellis & Weiss 26-5-16-9 /12

Evaluating Model Quality Low distortion is not really the goal; models are to constrain source separation fit source spectra but reject non-source signals Include sequential constraints e.g. transition matrix for codewords.. or smaller HMM with distributions over codebook Best way to evaluate is via a task e.g. separating speech from noise freq / bins 8 6 4 2 8 6 4 2 8 6 4 2 db SNR to Speech-Shaped Noise (magsnr =1.8 db) Direct VQ (magsnr = -.6 db) HMM-smoothed VQ (magsnr = -2.1 db).5 1 1.5 2 2.5 VQ Source Models - Ellis & Weiss 26-5-16-1/12 time / s -1-2 -3-4 -5 level / db

Future Directions Factorized codebooks codebooks too large due to combinatorics separate codebooks for type, formants, excitation? level / db 4 2 Spectrum decomposed into low, mid, high components -2-4 1 2 3 4 5 6 7 8 freq / Hz Model adaptation many speaker-dependents model, or... single speaker-adapted model, fit to each speaker Using uncertainty enhancing noisy speech for listeners: use special tokens to preserve uncertainty VQ Source Models - Ellis & Weiss 26-5-16-11/12

Summary Source models permit separation of underconstrained mixtures or at least inference of source state Explicit codebooks need to be large.. and chosen to optimize perceptual quality Resynthesis phase can be quantized.. using phase vocoder derivative.. iterative re-estimation helps more VQ Source Models - Ellis & Weiss 26-5-16-12/12

Extra Slides VQ Source Models - Ellis & Weiss 26-5-16-13/12

Other Uses for Source Models Projecting into the model s space: Restoration / Extension inferring missing parts Generation / Copying: adaptation + fitting VQ Source Models - Ellis & Weiss 26-5-16-14/12

wer / % 1 5 1 5 Example 2: Mixed Speech Recog. Cooke & Lee s Speech Separation Challenge short, grammatically-constrained utterances: <command:4><color:4><preposition:4><letter:25><number:1><adverb:4> e.g. "bin white at M 5 soon" IBM s superhuman recognizer: b) Same Gender 6 db 3 db db!3 db!6 db!9 db c) Different Gender 6 db 3 db db!3 db!6 db!9 db SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human t5_bwam5s_m5_bbilzp_6p1.wav Kristjansson et al. Interspeech 6 o Model individual speakers (512 mix GMM) o Infer speakers and gain o Reconstruct speech o Recognize as normal... Grammar constraints a big help VQ Source Models - Ellis & Weiss 26-5-16-15/12

Central idea: Employ strong learned constraints time time to disambiguate possible sources Varga & Moore 9 Roweis 3... {Si} = argmaxsi P(X {Si}) speech-trained Vector-Quantizer e.g. fitmaxvq Results: Denoising totimemixed spectrum: time from Roweis 3 time Model-Based Separation time time time VQ Source Models - Ellis & Weiss Training: 3sec of isolated speech to fit 512 codewords, and 1sec of separate via T-F(TIMIT) mask (again) solated noise (NOISEX) to fit 32 codewords; testing on new signals at db SNR 26-5-16-16/12

Separation or Description? Are isolated waveforms required? clearly sufficient, but may not be necessary not part of perceptual source separation! Integrate separation with application? e.g. speech recognition mix separation identify target energy t-f masking + resynthesis ASR speech models words vs. mix identify speech models find best words model words source knowledge words output = abstract description of signal VQ Source Models - Ellis & Weiss 26-5-16-17/12

Evaluation How to measure separation performance? depends what you are trying to do SNR? energy (and distortions) are not created equal different nonlinear components [Vincent et al. 6] Intelligibility? rare for nonlinear processing to improve intelligibility listening tests expensive ASR performance? Transmission errors optimum separate-then-recognize too simplistic; ASR needs to accommodate separation Net effect Increased artefacts Reduced interference Agressiveness of processing VQ Source Models - Ellis & Weiss 26-5-16-18/12