Signal Analysis Using Autoregressive Models of Amplitude Modulation. Sriram Ganapathy

Similar documents
Feature Extraction Using 2-D Autoregressive Models For Speaker Recognition

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

Auditory motivated front-end for noisy speech using spectro-temporal modulation filtering

Autoregressive Models Of Amplitude Modulations In Audio Compression

Autoregressive Models of Amplitude. Modulations in Audio Compression

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Non-Uniform Speech/Audio Coding Exploiting Predictability of Temporal Evolution of Spectral Envelopes

PLP 2 Autoregressive modeling of auditory-like 2-D spectro-temporal patterns

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Using RASTA in task independent TANDEM feature extraction

Machine recognition of speech trained on data from New Jersey Labs

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Speech Signal Analysis

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Reverse Correlation for analyzing MLP Posterior Features in ASR

Chapter 4 SPEECH ENHANCEMENT

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Progress in the BBN Keyword Search System for the DARPA RATS Program

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Cepstrum alanysis of speech signals

Speech Synthesis using Mel-Cepstral Coefficient Feature

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Topic 2. Signal Processing Review. (Some slides are adapted from Bryan Pardo s course slides on Machine Perception of Music)

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Relative phase information for detecting human speech and spoofed speech

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

Modulation Spectral Filtering: A New Tool for Acoustic Signal Analysis

Theory of Telecommunications Networks

Audio Signal Compression using DCT and LPC Techniques

Temporally Weighted Linear Prediction Features for Speaker Verification in Additive Noise

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation

CO-CHANNEL SPEECH DETECTION APPROACHES USING CYCLOSTATIONARITY OR WAVELET TRANSFORM

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR

Advanced audio analysis. Martin Gasser

Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement

Robust telephone speech recognition based on channel compensation

RECENTLY, there has been an increasing interest in noisy

Speech Coding in the Frequency Domain

Speech Synthesis; Pitch Detection and Vocoders

Enhanced Waveform Interpolative Coding at 4 kbps

Modulator Domain Adaptive Gain Equalizer for Speech Enhancement

NOISE ESTIMATION IN A SINGLE CHANNEL

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Adaptive Filters Application of Linear Prediction

Published in: Proceedings for ISCA ITRW Speech Analysis and Processing for Knowledge Discovery

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping

Design and Implementation of Speech Recognition Systems

Modulation Domain Spectral Subtraction for Speech Enhancement

The Delta-Phase Spectrum with Application to Voice Activity Detection and Speaker Recognition

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

SPECTRAL COMBINING FOR MICROPHONE DIVERSITY SYSTEMS

Audio Fingerprinting using Fractional Fourier Transform

VQ Source Models: Perceptual & Phase Issues

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

Automatic Speech Recognition handout (1)

DISCRETE FOURIER TRANSFORM AND FILTER DESIGN

Spectro-temporal Gabor features as a front end for automatic speech recognition

A Real Time Noise-Robust Speech Recognition System

Augmenting Short-term Cepstral Features with Long-term Discriminative Features for Speaker Verification of Telephone Data

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

CS 188: Artificial Intelligence Spring Speech in an Hour

2.1 BASIC CONCEPTS Basic Operations on Signals Time Shifting. Figure 2.2 Time shifting of a signal. Time Reversal.

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Research Article Significance of Joint Features Derived from the Modified Group Delay Function in Speech Processing

Filter Banks I. Prof. Dr. Gerald Schuller. Fraunhofer IDMT & Ilmenau University of Technology Ilmenau, Germany. Fraunhofer IDMT

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

IMPROVING MICROPHONE ARRAY SPEECH RECOGNITION WITH COCHLEAR IMPLANT-LIKE SPECTRALLY REDUCED SPEECH

EE216B: VLSI Signal Processing. Wavelets. Prof. Dejan Marković Shortcomings of the Fourier Transform (FT)

Recent Advances in Acoustic Signal Extraction and Dereverberation

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

A ROBUST FRONTEND FOR ASR: COMBINING DENOISING, NOISE MASKING AND FEATURE NORMALIZATION. Maarten Van Segbroeck and Shrikanth S.

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

ANALYSIS-BY-SYNTHESIS FEATURE ESTIMATION FOR ROBUST AUTOMATIC SPEECH RECOGNITION USING SPECTRAL MASKS. Michael I Mandel and Arun Narayanan

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Robust speech recognition using temporal masking and thresholding algorithm

Speech Production. Automatic Speech Recognition handout (1) Jan - Mar 2009 Revision : 1.1. Speech Communication. Spectrogram. Waveform.

On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

VOICE ACTIVITY DETECTION USING NEUROGRAMS. Wissam A. Jassim and Naomi Harte

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

Improved Estimation of the Amplitude Envelope of Time Domain Signals Using True Envelope Cepstral Smoothing.

Perceptually Motivated Linear Prediction Cepstral Features for Network Speech Recognition

Transcription:

Signal Analysis Using Autoregressive Models of Amplitude Modulation Sriram Ganapathy Advisor - Hynek Hermansky Johns Hopkins University 11-18-2011

Overview Introduction AR Model of Hilbert Envelopes FDLP and its Properties Applications Summary

Overview Introduction AR Model of Hilbert Envelopes FDLP and its Properties Applications Summary

Introduction Sub-band speech and audio signals - product of smooth modulation with a fine carrier.

Introduction Sub-band speech and audio signals - product of smooth modulation with a fine carrier.

Introduction Sub-band speech and audio signals - product of smooth modulation with a fine carrier.

Introduction Sub-band speech and audio signals - product of smooth modulation with a fine carrier.

Introduction Sub-band speech and audio signals - product of smooth modulation with a fine carrier. =

Introduction Sub-band speech and audio signals - product of smooth modulation with a fine carrier. =

Introduction Sub-band speech and audio signals - product of smooth modulation with a fine carrier. AM Non- Unique FM

Introduction Sub-band speech and audio signals - product of smooth modulation with a fine carrier. AM Non- Unique FM x t = m t cos {ω o t + φ t }

Desired Properties of AM Linearity αx t αm t Continuity x t + δx t m t + δm t Harmonicity cos (ω o t) 1

Desired Properties of AM Uniquely satisfied by the analytic signal x (t) H x a (t) + j m(t) d ω(t) dt ω o + φ(t) H - Hilbert transform, x a (t) - analytic signal, x a (t) 2 Hilbert envelope

Desired Properties of AM However, the Hilbert transform filter is infinitely long and can cause artifacts for finite length signals. H (x t ) = 1 π x(t τ) t τ dτ Need for modeling the Hilbert envelope without explicit computation of the Hilbert transform.

Overview Introduction AR Model of Hilbert Envelopes FDLP and its Properties Applications Summary

Overview Introduction AR Model of Hilbert Envelopes FDLP and its Properties Applications Summary

AR Model of Hilbert Envelopes Signal x[n] with zero mean in time and frequency domain for n = 0 N-1 Discrete-time analytic spectrum X a [k] = 2X[k] for k<n/2 0 for k N/2

AR Model of Hilbert Envelopes Signal x[n] with zero mean in time and frequency domain for n = 0 N-1 Discrete-time analytic spectrum X a [k] = 2X[k] for k<n/2 0 for k N/2 X[k] X a [k]

AR Model of Hilbert Envelopes Let q n - even-symmetrized version of x[n]. q n = x n for n < N, q n = x M n, M = 2N 1 Spectrum Q k = 2Re{X[k]}

AR Model of Hilbert Envelopes Let q n - even-symmetrized version of x[n]. q n = x n for n < N, q n = x M n, M = 2N 1 Discrete-time analytic spectrum Q k = 2Re{X[k]} Q a [k] = 2Q[k], k<n 0 k N

AR Model of Hilbert Envelopes Let q n - even-symmetrized version of x[n]. q n = x n for n < N, q n = x M n, M = 2N 1 Discrete-time analytic spec. Q k = 2Re{X[k]} 2Q[k], k<n Q a [k] = 0 k N N-point DCT y[k] = 4Re{X k }, k<n

AR Model of Hilbert Envelopes Let q n - even-symmetrized version of x[n]. q n = x n for n < N, q n = x M n, M = 2N 1 Discrete-time analytic spec. Q k = 2Re{X[k]} 2Q[k], k<n Q a [k] = 0 k N DCT zero-padded with N-zeros y[k] = 4Re{X k } k<n 0 k N

AR Model of Hilbert Envelopes Let q n - even-symmetrized version of x[n]. q n = x n for n < N, q n = x M n, M = 2N 1 Discrete-time analytic spec. Q k = 2Re{X[k]} 2Q[k], k<n Q a [k] = 0 k N DCT zero-padded with N-zeros y[k] = 4Re{X k } k<n 0 k N Q a [k] = F q a n = y[k]

AR Model of Hilbert Envelopes We have shown - Q a [k] = F q a n = y[k] Even-sym. analytic spectrum. Zero-padded DCT sequence

AR Model of Hilbert Envelopes We have shown - Q a [k] = F q a n = y[k] Spectrum Signal

AR Model of Hilbert Envelopes We have shown - Q a [k] = F q a n = y[k] Spectrum F Signal Power Spectrum Autocorr.

AR Model of Hilbert Envelopes We have shown - Q a [k] = F q a n = y[k] Even-sym. analytic spectrum. Zero-padded DCT sequence

AR Model of Hilbert Envelopes We have shown - Q a [k] = F q a n = y[k] F q a n 2 = r y [τ] Spectrum of Hilbert env. for even-sym. signal Autocorrelation of DCT sequence

AR Model of Hilbert Envelopes We have shown - Q a [k] = F q a n = y[k] F q a n 2 = r y [τ] Hilb. env. of even-symm. signal F Auto-corr. of DCT

AR Model of Hilbert Envelopes We have shown - Q a [k] = F q a n = y[k] F q a n 2 = r y [τ] AR model of Hilb. env. LP Auto-corr. of DCT

LP in Time and Frequency Time Power Spec. Duality

LP in Time and Frequency Time Power Spec. Duality DCT Hilb. Env. Duality

FDLP Linear prediction on the cosine transform of the signal Speech FDLP Env. Hilb. Env.

FDLP Linear prediction on the cosine transform of the signal DCT LP FDLP Env. Hilb. Env.

FDLP Linear prediction on the cosine transform of the signal DCT LP Hilb. Env.

FDLP Linear prediction on the cosine transform of the signal Speech FDLP Env. Hilb. Env.

FDLP for Speech Representation DCT

FDLP for Speech Representation DCT

FDLP for Speech Representation DCT LP

FDLP for Speech Representation DCT LP

FDLP for Speech Representation DCT LP

Freq. FDLP for Speech Representation FDLP Spectrogram Time

Freq. Freq. FDLP for Speech Representation FDLP Spectrogram Time Conventional Approaches Time

FDLP versus Mel Spectrogram FDLP Mel Sriram Ganapathy, Samuel Thomas and H. Hermansky, Comparison of Modulation Frequency Features for Speech Recognition", ICASSP, 2010.

Overview Introduction AR Model of Hilbert Envelopes FDLP and its Properties Applications Summary

Overview Introduction AR Model of Hilbert Envelopes FDLP and its Properties Applications Summary

Resolution of FDLP Analysis FDLP Sig. FDLP Env. Mel

Resolution of FDLP Analysis FDLP Sig. Sig. FDLP Env. FDLP Env. Mel Res. = (Critical Width) -1

Resolution of FDLP Analysis FDLP Mel

Resolution of FDLP Analysis FDLP Mel

Mel Properties of FDLP Analysis Summarizing FDLP the gross temporal variation with a few parameters Model order of FDLP controls the degree of smoothness. AR model captures perceptually important high energy regions of the signal. Suppressing reverberation artifacts Reverberation is a long-term convolutive distortion. Analysis in long-term windows and narrow sub-bands.

Mel Properties of FDLP Analysis Summarizing FDLP the gross temporal variation with a few parameters Model order of FDLP controls the degree of smoothness. AR model captures perceptually important high energy regions of the signal. Suppressing reverberation artifacts Reverberation is a long-term convolutive distortion. Analysis in long-term windows and narrow sub-bands.

Reverberation When speech is corrupted with convolutive distortion like room reverberation Clean Speech * Room Response = Revb. Speech

Reverberation When speech is corrupted with convolutive distortion like room reverberation Clean Speech * Room Response = Revb. Speech In the long-term DFT domain, this translates Clean DFT x Response DFT = Revb. DFT

Reverberation When speech is corrupted with convolutive distortion like room reverberation r[n] = x n h n In the DFT domain, this translates to a multiplication R k = X k H k In the m th sub-band, R m k = X m k H m [k]

Reverberation H k

Reverberation H k

Reverberation H k

Reverberation H k H m

Reverberation When speech is corrupted with convolutive distortion like room reverberation r[n] = x n h n In the DFT domain, this translates to a multiplication R k = X k H k In the m th sub-band, R m k = X m k H m [k] In narrow bands, H m [k] is approx. constant, R m k X m k H m

Gain Normalization in FDLP FDLP envelope of m th band using all-pole parameters {a 1, a p } is given by E m n = G p 1 a k e j2πkn k=1 N 2 When the sub-band signal is multiplied by H m, the gain G is modified. Normalization to convolutive distortions is achieved by reconstructing the FDLP envelope with G = 1.

Gain Normalization in FDLP Without gain norm. With gain norm. S. Thomas, S. Ganapathy and H. Hermansky, Recognition of Reverberant Speech Using FDLP", IEEE Signal Proc. Letters, 2008.

Overview Introduction AR Model of Hilbert Envelopes FDLP and its Properties Applications Summary

Overview Introduction AR Model of Hilbert Envelopes FDLP and its Properties Applications Summary

Outline of Applications Input Signal Sub-band Decomposition FDLP AM FM Gain Norm. Quant. Short-term Features for Speaker & Speech Recog. Modulation Features for Phoneme Recog. Wide-band Speech & Audio Coding S. Ganapathy, S. Thomas, P. Motlicek and H. Hermansky, Applications of Signal Analysis Using Autoregressive Models of Amplitude Modulation", IEEE WASPAA, Oct. 2009.

Short-term Features Input Signal Sub-band Decomposition FDLP AM FM Gain Norm. Quant. Short-term Features for Speaker & Speech Recog. Modulation Features for Phoneme Recog. Wide-band Speech & Audio Coding

Short-term Features Input DCT Subband Window FDLP Gain Norm. Energy Int. Log + DCT Feat.

Short-term Features Input DCT Subband Window FDLP Gain Norm. Energy Int. Log + DCT Feat. Envelopes in each band are integrated along time (25 ms with a shift of 10 ms). Integration in frequency axis to convert to mel scale.

Short-term Features Input DCT Subband Window FDLP Gain Norm. Energy Int. Log + DCT Feat. Sub-band energies are converted to cepstral coefficients by applying log and DCT along frequency axis. Delta and acceleration coefficients are appended to obtain 39 dim. feat similar to conventional MFCC feat.

Speech Recognition TIDIGITS Database (8 khz) Clean training data, test data can be clean or naturally reverberated. HMM-GMM system Whole-word HMM models trained on clean speech. Performance in terms of word error rate (WER). Features PLP features with cepstral mean subtraction (CMS). Long-term log spectral sub. (LTLSS) [Avendano],[Gelbart] FDLP short-term (FDLP-S) features 39 dim.

Speech Recognition 20 10 PLP-CMS LTLSS FDLP-S 0 Clean Reverb S. Thomas, S. Ganapathy and H. Hermansky, Recognition of Reverberant Speech Using FDLP", IEEE, Signal Proc. Letters, 2008.

Speaker Verification NIST 2008 Speaker recognition evaluation (SRE) Has telephone speech and far-field speech. GMM-UBM system Trained on a large set of development speakers. Adapted on the enrollment data from the target speaker. Nuisance attribute projection (NAP) on supervectors. Detection cost function (DCF) = 0.99 P fa + 0.1 P miss Features with warping [Pelecanos, 2001]. Mel Frequency Cepstral Coefficients (MFCCs) FDLP short-term (FDLP-S) features.

Speaker Verification 30 20 MFCC FDLP-S 10 Tel. Mic. Cross domain S. Ganapathy, J. Pelecanos and M. Omar, Feature Normalization for Speaker Verification in Room Reverberation", ICASSP, 2011.

Outline of Applications Input Signal Sub-band Decomposition FDLP AM FM Gain Norm. Quant. Short-term Features for Speaker & Speech Recog. Modulation Features for Phoneme Recog. Wide-band Speech & Audio Coding

Modulation Features Input Signal Sub-band Decomposition FDLP AM FM Gain Norm. Quant. Short-term Features for Speaker & Speech Recog. Modulation Features for Phoneme Recog. Wide-band Speech & Audio Coding

Modulation Feature Extraction Static DCT Input DCT Criticalband Window FDLP Dynamic DCT Subband Feat. 200ms

Modulation Feature Extraction Static DCT Input DCT Criticalband Window FDLP Dynamic DCT Subband Feat. 200ms Static compression is a logarithm reduce the huge dynamic range in the in the sub-band envelope.

Modulation Feature Extraction Static DCT Input DCT Criticalband Window FDLP Dynamic DCT Subband Feat. 200ms Dynamic compression is implemented by dynamic compression loops consisting of dividers and low pass filters [Kollmeier, 1999]..

Modulation Feature Extraction Static DCT Input DCT Criticalband Window FDLP Dynamic DCT Subband Feat. 200ms Compressed sub-band envelopes are DCT transformed to obtain modulation frequency components 14 static and dynamic modulation spectra (0-35 Hz) with 17 sub-bands, gets a feature of 476 dim.

Phoneme Recognition TIMIT Database (8 khz) Clean training data, test data can be clean, additive noise, reverberated or telephone channel. Multi-layer perceptron (MLP) based system MLPs estimate phoneme posteriors Hidden Markov model (HMM) MLP hybrid model. Performance in phoneme error rate (PER). Features Perceptual linear prediction (PLP) - 9 frame context. Advanced ETSI standard [ETSI,2002] 9 frame context. FDLP modulation (FDLP-M) features 476 dim.

Phoneme Recognition 75 60 45 PLP-9 ETSI-9 FDLP-M 30 Clean Add. Noise Reverb Tel. S. Ganapathy, S. Thomas and H. Hermansky, Temporal Envelope Compensation for Robust Phoneme Recognition Using Modulation Spectrum", JASA, 2010..

Outline of Applications Input Signal Sub-band Decomposition FDLP AM FM Gain Norm. Quant. Short-term Features for Speaker & Speech Recog. Modulation Features for Phoneme Recog. Wide-band Speech & Audio Coding

Audio Coding Input Signal Sub-band Decomposition FDLP AM FM Gain Norm. Quant. Short-term Features for Speaker & Speech Recog. Modulation Features for Phoneme Recog. Wide-band Speech & Audio Coding

Audio Coding 1 1 Input QMF Analysis FDLP Carr. Env. Q Q -1 Mul. QMF Synthesis Output MDCT Q Q -1 IMDCT 32 32 Sriram Ganapathy, Petr Motlicek and H. Hermansky, Autoregressive Modeling of Hilbert Envelopes for Wide-band Audio Coding", AES 124th Convention, Audio Engineering Society, May 2008.

Subjective Evaluations 100 80 60 40 20 Hidden Ref. LPF7k MP3 FDLP AAC 0 S. Ganapathy, P. Motlicek, and H. Hermansky, AR Models of Amplitude Modulation in Audio Compression", IEEE Transactions on Audio, Speech and Language Proc., 2010..

Overview Introduction AR Model of Hilbert Envelopes FDLP and its Properties Applications Summary

Overview Introduction AR Model of Hilbert Envelopes FDLP and its Properties Applications Summary

Summary Employing AR modeling for estimating amplitude modulations. Long-term temporal analysis of signals forms an efficient alternative to conventional short-term spectrum. Provides AM-FM decomposition in sub-bands and acts as unified model for speech and audio signals.

Summary Employing AR modeling for estimating amplitude modulations. Long-term temporal analysis of signals forms an efficient alternative to conventional short-term spectrum. Provides AM-FM decomposition in sub-bands and acts as unified model for speech and audio signals.

Summary Employing AR modeling for estimating amplitude modulations. Long-term temporal analysis of signals forms an efficient alternative to conventional short-term spectrum. Provides AM-FM decomposition in sub-bands and acts as unified model for speech and audio signals.

Our Contributions Simple mathematical analysis for AR model of Hilbert envelopes. Investigating the resolution properties of FDLP. Gain normalization of FDLP Envelopes

Our Contributions Simple mathematical analysis for AR model of Hilbert envelopes. Investigating the resolution properties of FDLP. Gain normalization of FDLP Envelopes

Our Contributions Simple mathematical analysis for AR model of Hilbert envelopes. Investigating the resolution properties of FDLP. Gain normalization of FDLP Envelopes

Our Contributions Short-term feature extraction using FDLP Improvements in reverb speech recog. Modulation feature extraction Phoneme recognition in noisy speech. Speech and audio codec development using AM-FM signals from FDLP.

Our Contributions Short-term feature extraction using FDLP Improvements in reverb speech recog. Modulation feature extraction Phoneme recognition in noisy speech. Speech and audio codec development using AM-FM signals from FDLP.

Our Contributions Short-term feature extraction using FDLP Improvements in reverb speech recog. Modulation feature extraction Phoneme recognition in noisy speech. Speech and audio codec development using AM-FM signals from FDLP.

Publications Journals S. Ganapathy, S. Thomas and H. Hermansky, "Temporal envelope compensation for robust phoneme recognition using modulation spectrum ", Journal of Acoustical Society of America, Dec. 2010. S. Ganapathy, P. Motlicek and H. Hermansky, "Autoregressive Models Of Amplitude Modulations In Audio Compression", IEEE Transactions on Audio, Speech and Language Processing, Aug. 2010. P. Motlicek, S. Ganapathy, H. Hermansky and H. Garudadri,"Wide-Band Audio Coding based on Frequency Domain Linear Prediction", EURASIP Journal on Audio, Speech, and Music Processing, 2010. S. Ganapathy, S. Thomas and H. Hermansky, "Modulation Frequency Features For Phoneme Recognition In Noisy Speech", Journal of Acoustical Society of America - Express Letters, Jan 2009. S. Thomas, S. Ganapathy and H. Hermansky, "Recognition Of Reverberant Speech Using Frequency Domain Linear Prediction", IEEE Signal Processing Letters, Dec 2008. Patents Temporal Masking in Audio Coding Based on Spectral Dynamics in Frequency Subbands "Spectral Noise Shaping in Audio Coding Based on Spectral Dynamics in Frequency Sub-bands

Publications Selected Conferences S. Ganapathy, P. Rajan and H. Hermansky, "Multi-layer Perceptron Based Speech Activity Detection for Speaker Verification", IEEE WASPAA, Oct. 2011. S. Ganapathy, J. Pelecanos and M. Omar, "Feature Normalization for Speaker Verification in Room Reverberation", ICASSP, May 2011. S. Ganapathy, S. Thomas and H. Hermansky, "Robust Spectro-Temporal Features Based on Autoregressive Models of Hilbert Envelopes", ICASSP, March 2010. S. Ganapathy, S. Thomas and H. Hermansky, "Comparison of Modulation Features For Phoneme Recognition", ICASSP, March 2010. S. Ganapathy, S. Thomas, and H. Hermansky, "Temporal Envelope Subtraction for Robust Speech Recognition Using Modulation Spectrum", IEEE ASRU, 2009. S. Ganapathy, S. Thomas, P. Motlicek and H. Hermansky, "Applications of Signal Analysis Using Autoregressive Models for Amplitude Modulation", IEEE WASPAA 2009. S. Ganapathy, S. Thomas and H. Hermansky, "Static and Dynamic Modulation Spectrum for Speech Recognition", Proc. of INTERSPEECH, Brighton, UK, Sept. 2009. S. Ganapathy, P. Motlicek, H. Hermansky and H. Garudadri, "Autoregressive Modelling of Hilbert Envelopes for Wide-band Audio Coding", AES 124th Convention, AES. S. Ganapathy, P. Motlicek, H. Hermansky and H. Garudadri, ""Temporal Masking for Bitrate Reduction in Audio Codec Based on Frequency Domain Linear Prediction", ICASSP, April 2008.

Acknowledgements Lab Buddies Samuel Thomas, Sivaram Garimella, Padmanbhan Rajan, Harish Mallidi, Vijay Peddinti, Thomas Janu, Aren Jansen. Idiap personnel Petr Motlicek, Joel Pinto, Mathew Doss. IBM personnel Jason Pelecanos, Mohamed Omar Others Xinhui Zhou, Daniel Romero, Marios Athineos, David Gelbart, Harinath Garudadri.

Thank You

Noise Compensation in FDLP ignal + Noise Criticalband DCT IDFT 2 DFT Window Linear Pred.. FDLP Env. When speech is corrupted with additive noise, y n = x n + s n The noise component is additive in the non-parametric Hilbert envelope domain (assuming the signal and noise are uncorrelated).

Noise Compensation in FDLP Input Criticalband IDFT 2 Wiener DCT DFT Filtering Window VAD Voice activity detector (VAD) provides information about the non-speech regions which are used for estimating the temporal envelope of the noise. Noise subtraction tries to subtract the estimate the noise envelope from the noisy speech envelope.

Noise Compensation in FDLP S. Ganapathy, S. Thomas, and H. Hermansky, Temporal Envelope Subtraction for Robust Speech Recognition using Modulation Spectrum", IEEE ASRU, 2009.

Dealing with Convolutive Distortions Cepstral mean subtraction (CMS), long-term log spectral subtraction (LTLSS) & gain normalization CMS assumes distortion in neighboring frames to be similar suppresses short-term artifacts. Long-term subtraction deals with reverberation assuming over the same response over a window of long-term frames [Gelbart, 2002]. Gain normalization deals with short and long term distortions within a single long-term frame.

Dealing with Convolutive Distortions Cepstral mean subtraction (CMS), long-term log spectral subtraction (LTLSS) & gain normalization CMS assumes distortion in neighboring frames to be similar suppresses short-term artifacts. Long-term subtraction deals with reverberation assuming over the same response over a window of long-term frames [Gelbart, 2002]. Gain normalization deals with short and long term distortions within a single long-term frame.

Dealing with Convolutive Distortions Cepstral mean subtraction (CMS), long-term log spectral subtraction (LTLSS) & gain normalization CMS assumes distortion in neighboring frames to be similar suppresses short-term artifacts. Long-term subtraction deals with reverberation assuming over the same response over a window of long-term frames [Gelbart, 2002]. Gain normalization deals with short and long term distortions within a single long-term frame.

Dealing with Convolutive Distortions Cepstral mean subtraction (CMS), long-term log spectral subtraction (LTLSS) & gain normalization CMS assumes distortion in neighboring frames to be similar suppresses short-term artifacts. Long-term subtraction deals with reverberation assuming over the same response over a window of long-term frames [Gelbart, 2002]. Gain normalization deals with short and long term distortions within a single long-term frame.

Feature Comparison

Evidences Physiological evidences - Spectro-temporal receptive fields [Shamma et.al. 2001] Psycho-physical evidences - Perceptual importance of modulation frequencies [Drullman et al. 1994]. Syllable recognition from temporal modulations with minimal spectral cues [Shannon et al., 1995].

Evidences Physiological evidences - Spectro-temporal receptive fields [Shamma et.al. 2001]. Psycho-physical evidences - Perceptual importance of modulation frequencies [Drullman et al. 1994]. Syllable recognition from temporal modulations with minimal spectral cues [Shannon et al., 1995].

Applications Modulation spectra has been used in the past Speech intelligibility [Houtgast et al, 1980]. RASTA processing [Hermansky et al, 1994]. Speech recognition [Kingsbury et al, 1998]. AM-FM decomposition [Kumaresan et al, 1999]. Sound texture modeling [Athineos et al, 2003]. Sound source separation [King et al, 2010].

Linear Prediction Time Domain Current sample expressed as a linear combination of past samples n-3 n-2 n-1 n a 1 a 3 a 2

Linear Prediction Time Domain Current sample expressed as a linear combination of past samples x n = p k=1 a k x[n k] + e n n = 0 N 1 Model parameters are solved by minimizing the residual sum of squares. E p = e n 2 N 1 n=0

AR model of Power Spectrum Filter interpretation [Makhoul, 1975] e n = x n p i=1 a i x n i = x n d n d = [1 a 1 a 2 a p] N 1 E ω = n=0 e n e jωn = X ω D(ω) From Parseval s theorem N 1 E p = n=0 e n 2 = 1 = 1 2π π π 2π π π E ω 2 X ω 2 D ω 2 dω dω

AR model of Power Spectrum By definition, Let, p i=1 D ω 2 = 1 a i e jiω 2 P x ω = X ω 2, H ω = 1 D ω Thus, parameters {a i } are solved by minimizing E p = 1 2π π π X ω 2 D ω 2 dω = 1 2π π π P x ω H(ω) 2 dω

AR model of Power Spectrum Solution of the linear prediction yields an allpole model of the power spectrum P x ω = Ep H(ω) 2 = G p i=1 1 a i e jiω 2 Numerator G denotes the gain of AR model (equal to minimum residual sum of squares).

AR model of power spectrum

Hilbert Envelope - Definition Analytic signal is the sum of the signal and its quadrature component. x a n = x n + jh (x n ) where H denotes the Hilbert transform. Hilbert envelope is the squared magnitude of the analytic signal.

Duality LP FDLP

LP in Time and Frequency

a. Signal b. Hilb. Env. c. FDLP Env. d. AM comp. e. FM comp. AM-FM Decomposition

Spectrogram Comparison PLP FDLP Sriram Ganapathy, Samuel Thomas and H. Hermansky, Comparison of Modulation Frequency Features for Speech Recognition", ICASSP, 2010.

Modulation Feature Extraction Static DCT Input DCT Criticalband Window FDLP Dynamic DCT Subband Feat. 200ms

Modulation Features a. Signal b. Hilb. Env. c. FDLP Env. d. Log comp. e. Dyn. comp. Sriram Ganapathy, Samuel Thomas and H. Hermansky, Modulation Frequency Features for Phoneme Recognition in Noisy Speech", JASA, Express Letters, 2009.

Frequency Introduction Conventional signal analysis starts with the estimation of short-term spectrum (10-40 ms). Time

Introduction Conventional signal analysis starts with the estimation of short-term spectrum (10-40 ms). Spectrum is sampled at a preset rate before further modeling/processing stages. Contextual information is typically processed with time-series models such as HMM.

Introduction Conventional signal analysis starts with the estimation of short-term spectrum (10-40 ms). Spectrum is sampled at a preset rate before further modeling/processing stages. Contextual information is typically processed with time-series models such as HMM.