Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Similar documents
Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Speech Signal Analysis

Using RASTA in task independent TANDEM feature extraction

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

Cepstrum alanysis of speech signals

Relative phase information for detecting human speech and spoofed speech

Speech Synthesis using Mel-Cepstral Coefficient Feature

Training neural network acoustic models on (multichannel) waveforms

Acoustic modelling from the signal domain using CNNs

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Learning the Speech Front-end With Raw Waveform CLDNNs

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

Isolated Digit Recognition Using MFCC AND DTW

SPEECH AND SPECTRAL ANALYSIS

Mel Spectrum Analysis of Speech Recognition using Single Microphone

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Epoch Extraction From Emotional Speech

Formant Estimation and Tracking using Deep Learning

arxiv: v2 [cs.sd] 22 May 2017

Signal Analysis Using Autoregressive Models of Amplitude Modulation. Sriram Ganapathy

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Linguistic Phonetics. Spectral Analysis

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

CS 188: Artificial Intelligence Spring Speech in an Hour

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Acoustic Modeling from Frequency-Domain Representations of Speech

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

A Two-step Technique for MRI Audio Enhancement Using Dictionary Learning and Wavelet Packet Analysis

SOUND SOURCE RECOGNITION AND MODELING

Determination of Pitch Range Based on Onset and Offset Analysis in Modulation Frequency Domain

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

Speech Synthesis; Pitch Detection and Vocoders

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER

DERIVATION OF TRAPS IN AUDITORY DOMAIN

GENDER RECOGNITION USING SPEECH PROCESSING TECHNIQUES IN LABVIEW

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

A Comparative Study of Formant Frequencies Estimation Techniques

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Reverse Correlation for analyzing MLP Posterior Features in ASR

Applications of Music Processing

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

Augmenting Short-term Cepstral Features with Long-term Discriminative Features for Speaker Verification of Telephone Data

Speech/Music Change Point Detection using Sonogram and AANN

Evaluation of clipping-noise suppression of stationary-noisy speech based on spectral compensation

Voiced/nonvoiced detection based on robustness of voiced epochs

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Determining Guava Freshness by Flicking Signal Recognition Using HMM Acoustic Models

Electronic disguised voice identification based on Mel- Frequency Cepstral Coefficient analysis

HIGH RESOLUTION SIGNAL RECONSTRUCTION

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

VOICE COMMAND RECOGNITION SYSTEM BASED ON MFCC AND DTW

A New Framework for Supervised Speech Enhancement in the Time Domain

Frequency Estimation from Waveforms using Multi-Layered Neural Networks

CO-CHANNEL SPEECH DETECTION APPROACHES USING CYCLOSTATIONARITY OR WAVELET TRANSFORM

Improving Sound Quality by Bandwidth Extension

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

L19: Prosodic modification of speech

Speech Perception Speech Analysis Project. Record 3 tokens of each of the 15 vowels of American English in bvd or hvd context.

Enhanced Waveform Interpolative Coding at 4 kbps

High-Pitch Formant Estimation by Exploiting Temporal Change of Pitch

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

Raw Waveform-based Speech Enhancement by Fully Convolutional Networks

Chapter 7. Frequency-Domain Representations 语音信号的频域表征

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Deep learning architectures for music audio classification: a personal (re)view

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

A METHOD OF SPEECH PERIODICITY ENHANCEMENT BASED ON TRANSFORM-DOMAIN SIGNAL DECOMPOSITION

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Gammatone Cepstral Coefficient for Speaker Identification

Fundamental frequency estimation of speech signals using MUSIC algorithm

An Improved Voice Activity Detection Based on Deep Belief Networks

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

NCCF ACF. cepstrum coef. error signal > samples

Perceptive Speech Filters for Speech Signal Noise Reduction

Speech/Music Discrimination via Energy Density Analysis

ROBUST F0 ESTIMATION IN NOISY SPEECH SIGNALS USING SHIFT AUTOCORRELATION. Frank Kurth, Alessia Cornaggia-Urrigshardt and Sebastian Urrigshardt

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

ON THE RELATIONSHIP BETWEEN INSTANTANEOUS FREQUENCY AND PITCH IN. 1 Introduction. Zied Mnasri 1, Hamid Amiri 1

Transcription:

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives Mathew Magimai Doss Collaborators: Vinayak Abrol, Selen Hande Kabil, Hannah Muckenhirn, Dimitri Palaz, D. S. Pavan Kumar Idiap Research Institute, Martigny, Switzerland July 17, 218

Conventional speech processing approach Conventional cepstral features extraction process: speech signal FFT Critical bands filtering Non-linear operation log( ) 3 DCT AR modeling MFCC PLP Derivatives + Derivatives + + + x x NN classifier NN classifier P (i x) P (i x) Recent trend using Convolutional Neural Networks (CNN): speech signal FFT Critical bands filtering Derivatives + x + CNN NN classifier P (i x) 1. Quasi-stationarity (windowing, time-frequency resolution) Motivated from speech coding analysis-synthesis studies 2. Speech production knowledge 3. Speech perception knowledge 1 / 2

In this talk speech signal x CNN NN classifier P (i x) Joint Training Can help in overcoming limitations of conventional short-term speech processing Can help in better understanding speech signal characteristics in a task specific manner 2 / 2

CNN-based system using raw speech as input Overview Filter stage (feature learning) N Classification stage (acoustic modeling) Raw speech input x Convolution Max pooling tanh( ) MLP p(i x) Minimal prior knowledge Short-term processing Feature extraction can be seen as a filtering operation Relevant Information can be spread across time Determined in a data-driven manner. All stages are trained jointly using back-propagation with a cost function based on cross entropy. 3 / 2

CNN-based system using raw speech as input Illustration of the first convolutional layer kw n f w seq Convolution dw w seq : Input speech signal with temporal context kw : Window size Sub-segmental (< 1 pitch period) Segmental (1 3 pitch periods) dw : Window shift (< 1 pitch period) n f : number of filters 4 / 2

Speech processing applications Application w seq kw # of conv. # of hidden layers layers Speech reco. 1,2 25-31 ms sub-seg 3-5 1-3 Speaker reco. 3,4 5 ms seg, sub-seg 2-3 1 Presentation attack 3 ms seg 2 1 or none detection 5 Gender reco. 6 25-31 ms seg, sub-seg 1-3 1 Paralinguistic 7 25-5 ms seg, sub-seg 3-4 1 1 Dimitri Palaz, Ronan Collobert, and Mathew Magimai.-Doss, Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks, in Proc. of Interspeech, 213. 2 Dimitri Palaz, Mathew Magimai.-Doss, and Ronan Collobert, End-to-end acoustic modeling using convolutional neural net- works for automatic speech recognition, Idiap-RR Idiap-RR-18-216, Idiap, 6 216. 3 Hannah Muckenhirn, Mathew Magimai.-Doss, and Sébastien Marcel, Towards directly modeling raw speech signal for speaker verification using CNNs, in Proc. of ICASSP, 218. 4 Hannah Muckenhirn, Mathew Magimai.-Doss, and Sébastien Marcel, On learning vocal tract system related speaker dis- criminative information from raw signal using CNNs,, in Proc. of Interspeech, 218. 5 Hannah Muckenhirn, Mathew Magimai-Doss, and Sébastien Marcel, End-to-End Convolutional Neural Network-based Voice Presentation Attack Detection, in Proceedings of the IEEE International Joint Conference on Biometrics (IJCB), 217. 6 Selen Hande Kabil, Hannah Muckenhirn, and Mathew Magimai.-Doss, On learning to identify genders from raw speech signal using CNNs, in Proceedings of Interspeech, 218. 7 Bogdan Vlasenko, Jilt Sebastian, Pavan Kumar D. S., and Mathew Magimai.-Doss, Implementing fusion techniques for the classification of paralinguistic information, in Proc. of Interspeech, 218. 5 / 2

In this talk speech signal x CNN NN classifier P (i x) Joint Training What information does such systems learn? Filter level analysis Whole network level analysis 6 / 2

Filter level analysis First convolution layer Cumulative frequency response of filters F cum = M m=1 F m F m 2, (1) where F m is the DFT of filter f m and M is number of filters. Response of filters to input speech by interpreting learned filters collectively as a spectral dictionary X = M x, f m DFT[f m ], (2) m=1 where ˆx m = x, f m is output of filter f m and X is the spectral information modeled. If {f m } were Fourier sine and cosine bases then X is DFT of x. 7 / 2

Filter level analysis Speech recognition: cumulative response Filters model sub-segmental speech Standard filterbank: constant-q filters, i.e. flat response. 4.5 1 3 4 3.5 Normalized Magnitude 3 2.5 2 1.5 1.5 1, 2, 3, 4, 5, 6, 7, 8, Frequency [Hz] CNN trained on WSJ corpus 8 / 2

Filter level analysis Speech recognition: spectral response X foraframeofspeech Gain normalized magnitude spectrum.5.45.4.35.3.25.2.15.1.5 X: 375 Y:.4882 X: 5 Y:.4751 X: 375 Y:.3938 X: 437.5 Y:.339 Magnitude spectrum of /iy/ Speaker F1 range F2 range Obs. 1st Obs. 2nd peak peak (in Hz) (in Hz) (in Hz) (in Hz) m1 328-357 2418-2458 375 2625 w1 439-441 2767-2822 437 2812 b1 468-554 2981-324 5 3 g1 382-392 334-378 375 - X: 2625 Y:.1812 X: 2812 Y:.2566 X: 3 Y:.149 1 2 3 4 5 6 7 8 Frequency (Hz) Spectral response of /iy/ from American English Vowel dataset. m1 w1 b1 g1 9 / 2

Filter level analysis Speaker recognition: cumulative response Segmental modeling Sub-segmental modeling 1 / 2

Filter level analysis Speaker recognition: spectral response X (Segmental modeling) X F contour estimated on Keele pitch database using the CNN-based speaker classifier trained on Voxforge. 11 / 2

Filter level analysis Speaker recognition: spectral response X ofaframeofspeech. (Sub-segmental modeling) LP Spectrum X 12 / 2

In this talk speech signal x CNN NN classifier P (i x) Joint Training What information does such systems learn? Filter level analysis Whole network level analysis 13 / 2

Whole network analysis Gradient-based visualization In computer vision research, given an input image-output class pair and the trained system, finding contribution of each pixel in the image on the output score. (guided backpropagation) Given an input speech-output class pair and the trained system, what is the contribution of each sample on the output score? 8 1 1 Amlitude.5 -.5-1 1 2 3 4 5 Time (ms) Original Signal Amlitude.2.1 Amlitude.5 -.5 Input Waveform Relevance Signal -1 1 2 3 4 5 Time (ms) Relevance signal -.1 5 1 Lags Autocorrelation 8 H. Muckenhirn et al., Gradient-based spectral visualization of CNNs using raw waveforms, Idiap Research Report Idiap-RR-11-218, 218. (submitted to SLT 218) 14 / 2

Whole network analysis Case study on speech recognition (1) 6 4 Log Spectrum 2-2 -4-6 1 2 3 4 5 6 7 8 Frequency (Hz) 2 Original Spectrum of /iy/ Log Spectrum -2-4 -6 1 2 3 4 5 6 7 8 Frequency (Hz) Relevance signal spectrum of /iy/ 15 / 2

Whole network analysis Case study on speech recognition (2) Analysis of CNN trained on TIMIT phone recognition task on American English Vowel (AEV) dataset F, F1 and F2 estimated automatically for the relevance signal for the steady state regions and compared to the values specified on the original study. Table: Average accuracy in (%) of fundamental frequencies, and formant frequencies of vowels produced by 45 male and 48 female speakers, estimated from relevance signal of AEV dataset. F F1 F2 /ah/ /eh/ /iy/ /oa/ /uw/ F 93 91 91 94 92 M 92 9 89 93 9 F 9 92 93 91 93 M 88 92 92 89 93 F 94 94 94 95 94 M 94 93 94 94 93 16 / 2

Whole network analysis Case study on speaker recognition (1) Original signal Segmental modeling Sub-segmental modeling 17 / 2

Whole network analysis Case study on speaker recognition (2) Utterance level average spectrum 6 4 Log Spectrum 2-2 -4-6 1 2 3 4 5 6 7 8 Frequency (Hz) 5 4 3 Segmental modeling Log Spectrum 2 1-1 -2-3 1 2 3 4 5 6 7 8 Frequency (Hz) Sub-segmental modeling 18 / 2

Whole network analysis Listening to relevance signal Relevance signal obtained from speaker recognition CNN (segmental modeling) Relevance signal obtained from speech recognition CNN Original signal 19 / 2

Summary speech signal x CNN NN classifier P (i x) Joint Training Can help in overcoming limitations of conventional short-term speech processing Allows both segmental modeling and sub-segmental modeling Can help in better understanding speech signal characteristics in a task specific manner Relevance signal can be analyzed using conventional signal processing techniques to gain insight Work under progress to understand how the neural network is modeling the relevant information Potentially provide new algorithms for speech signal processing 2 / 2

The End Thank you for your attention! Questions?

CNN-based system using raw speech as input Detailed view for one example 1 context target 1ms 1ms 5 context 1ms 5 Conv 1 kw = 3 dw = 1 1.8 ms 1 5 1 15 2 25 3 35 MP 1 kw = 2 dw = 2 Conv 2 kw = 5 dw = 1 MP 2 kw = 2 dw = 2 Conv 3 kw = 5 dw = 1 MP 3 kw = 2 dw = 2 2.5 ms 12.5 ms 15 ms 75 ms 9 ms ANN p(i x) 21 / 2

Whole network analysis Speech recognition versus Speaker recognition Original signal spectrogram 8-6 Frequency (khz) 6 4 2 5 1 15 2-8 -1-12 -14 Power/frequency (db/hz) Time (ms) Phone CNN relevance signal spectrogram Frequency (khz) 8 6 4 2 5 1 15 2 Time (ms) -6-7 -8-9 -1-11 -12 Power/frequency (db/hz) Speaker CNN relevance signal spectrogram 8-6 -7 Frequency (khz) 6 4 2-8 -9-1 -11-12 Power/frequency (db/hz) 5 1 15 2 Time (ms) 22 / 2