Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives Mathew Magimai Doss Collaborators: Vinayak Abrol, Selen Hande Kabil, Hannah Muckenhirn, Dimitri Palaz, D. S. Pavan Kumar Idiap Research Institute, Martigny, Switzerland July 17, 218
Conventional speech processing approach Conventional cepstral features extraction process: speech signal FFT Critical bands filtering Non-linear operation log( ) 3 DCT AR modeling MFCC PLP Derivatives + Derivatives + + + x x NN classifier NN classifier P (i x) P (i x) Recent trend using Convolutional Neural Networks (CNN): speech signal FFT Critical bands filtering Derivatives + x + CNN NN classifier P (i x) 1. Quasi-stationarity (windowing, time-frequency resolution) Motivated from speech coding analysis-synthesis studies 2. Speech production knowledge 3. Speech perception knowledge 1 / 2
In this talk speech signal x CNN NN classifier P (i x) Joint Training Can help in overcoming limitations of conventional short-term speech processing Can help in better understanding speech signal characteristics in a task specific manner 2 / 2
CNN-based system using raw speech as input Overview Filter stage (feature learning) N Classification stage (acoustic modeling) Raw speech input x Convolution Max pooling tanh( ) MLP p(i x) Minimal prior knowledge Short-term processing Feature extraction can be seen as a filtering operation Relevant Information can be spread across time Determined in a data-driven manner. All stages are trained jointly using back-propagation with a cost function based on cross entropy. 3 / 2
CNN-based system using raw speech as input Illustration of the first convolutional layer kw n f w seq Convolution dw w seq : Input speech signal with temporal context kw : Window size Sub-segmental (< 1 pitch period) Segmental (1 3 pitch periods) dw : Window shift (< 1 pitch period) n f : number of filters 4 / 2
Speech processing applications Application w seq kw # of conv. # of hidden layers layers Speech reco. 1,2 25-31 ms sub-seg 3-5 1-3 Speaker reco. 3,4 5 ms seg, sub-seg 2-3 1 Presentation attack 3 ms seg 2 1 or none detection 5 Gender reco. 6 25-31 ms seg, sub-seg 1-3 1 Paralinguistic 7 25-5 ms seg, sub-seg 3-4 1 1 Dimitri Palaz, Ronan Collobert, and Mathew Magimai.-Doss, Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks, in Proc. of Interspeech, 213. 2 Dimitri Palaz, Mathew Magimai.-Doss, and Ronan Collobert, End-to-end acoustic modeling using convolutional neural net- works for automatic speech recognition, Idiap-RR Idiap-RR-18-216, Idiap, 6 216. 3 Hannah Muckenhirn, Mathew Magimai.-Doss, and Sébastien Marcel, Towards directly modeling raw speech signal for speaker verification using CNNs, in Proc. of ICASSP, 218. 4 Hannah Muckenhirn, Mathew Magimai.-Doss, and Sébastien Marcel, On learning vocal tract system related speaker dis- criminative information from raw signal using CNNs,, in Proc. of Interspeech, 218. 5 Hannah Muckenhirn, Mathew Magimai-Doss, and Sébastien Marcel, End-to-End Convolutional Neural Network-based Voice Presentation Attack Detection, in Proceedings of the IEEE International Joint Conference on Biometrics (IJCB), 217. 6 Selen Hande Kabil, Hannah Muckenhirn, and Mathew Magimai.-Doss, On learning to identify genders from raw speech signal using CNNs, in Proceedings of Interspeech, 218. 7 Bogdan Vlasenko, Jilt Sebastian, Pavan Kumar D. S., and Mathew Magimai.-Doss, Implementing fusion techniques for the classification of paralinguistic information, in Proc. of Interspeech, 218. 5 / 2
In this talk speech signal x CNN NN classifier P (i x) Joint Training What information does such systems learn? Filter level analysis Whole network level analysis 6 / 2
Filter level analysis First convolution layer Cumulative frequency response of filters F cum = M m=1 F m F m 2, (1) where F m is the DFT of filter f m and M is number of filters. Response of filters to input speech by interpreting learned filters collectively as a spectral dictionary X = M x, f m DFT[f m ], (2) m=1 where ˆx m = x, f m is output of filter f m and X is the spectral information modeled. If {f m } were Fourier sine and cosine bases then X is DFT of x. 7 / 2
Filter level analysis Speech recognition: cumulative response Filters model sub-segmental speech Standard filterbank: constant-q filters, i.e. flat response. 4.5 1 3 4 3.5 Normalized Magnitude 3 2.5 2 1.5 1.5 1, 2, 3, 4, 5, 6, 7, 8, Frequency [Hz] CNN trained on WSJ corpus 8 / 2
Filter level analysis Speech recognition: spectral response X foraframeofspeech Gain normalized magnitude spectrum.5.45.4.35.3.25.2.15.1.5 X: 375 Y:.4882 X: 5 Y:.4751 X: 375 Y:.3938 X: 437.5 Y:.339 Magnitude spectrum of /iy/ Speaker F1 range F2 range Obs. 1st Obs. 2nd peak peak (in Hz) (in Hz) (in Hz) (in Hz) m1 328-357 2418-2458 375 2625 w1 439-441 2767-2822 437 2812 b1 468-554 2981-324 5 3 g1 382-392 334-378 375 - X: 2625 Y:.1812 X: 2812 Y:.2566 X: 3 Y:.149 1 2 3 4 5 6 7 8 Frequency (Hz) Spectral response of /iy/ from American English Vowel dataset. m1 w1 b1 g1 9 / 2
Filter level analysis Speaker recognition: cumulative response Segmental modeling Sub-segmental modeling 1 / 2
Filter level analysis Speaker recognition: spectral response X (Segmental modeling) X F contour estimated on Keele pitch database using the CNN-based speaker classifier trained on Voxforge. 11 / 2
Filter level analysis Speaker recognition: spectral response X ofaframeofspeech. (Sub-segmental modeling) LP Spectrum X 12 / 2
In this talk speech signal x CNN NN classifier P (i x) Joint Training What information does such systems learn? Filter level analysis Whole network level analysis 13 / 2
Whole network analysis Gradient-based visualization In computer vision research, given an input image-output class pair and the trained system, finding contribution of each pixel in the image on the output score. (guided backpropagation) Given an input speech-output class pair and the trained system, what is the contribution of each sample on the output score? 8 1 1 Amlitude.5 -.5-1 1 2 3 4 5 Time (ms) Original Signal Amlitude.2.1 Amlitude.5 -.5 Input Waveform Relevance Signal -1 1 2 3 4 5 Time (ms) Relevance signal -.1 5 1 Lags Autocorrelation 8 H. Muckenhirn et al., Gradient-based spectral visualization of CNNs using raw waveforms, Idiap Research Report Idiap-RR-11-218, 218. (submitted to SLT 218) 14 / 2
Whole network analysis Case study on speech recognition (1) 6 4 Log Spectrum 2-2 -4-6 1 2 3 4 5 6 7 8 Frequency (Hz) 2 Original Spectrum of /iy/ Log Spectrum -2-4 -6 1 2 3 4 5 6 7 8 Frequency (Hz) Relevance signal spectrum of /iy/ 15 / 2
Whole network analysis Case study on speech recognition (2) Analysis of CNN trained on TIMIT phone recognition task on American English Vowel (AEV) dataset F, F1 and F2 estimated automatically for the relevance signal for the steady state regions and compared to the values specified on the original study. Table: Average accuracy in (%) of fundamental frequencies, and formant frequencies of vowels produced by 45 male and 48 female speakers, estimated from relevance signal of AEV dataset. F F1 F2 /ah/ /eh/ /iy/ /oa/ /uw/ F 93 91 91 94 92 M 92 9 89 93 9 F 9 92 93 91 93 M 88 92 92 89 93 F 94 94 94 95 94 M 94 93 94 94 93 16 / 2
Whole network analysis Case study on speaker recognition (1) Original signal Segmental modeling Sub-segmental modeling 17 / 2
Whole network analysis Case study on speaker recognition (2) Utterance level average spectrum 6 4 Log Spectrum 2-2 -4-6 1 2 3 4 5 6 7 8 Frequency (Hz) 5 4 3 Segmental modeling Log Spectrum 2 1-1 -2-3 1 2 3 4 5 6 7 8 Frequency (Hz) Sub-segmental modeling 18 / 2
Whole network analysis Listening to relevance signal Relevance signal obtained from speaker recognition CNN (segmental modeling) Relevance signal obtained from speech recognition CNN Original signal 19 / 2
Summary speech signal x CNN NN classifier P (i x) Joint Training Can help in overcoming limitations of conventional short-term speech processing Allows both segmental modeling and sub-segmental modeling Can help in better understanding speech signal characteristics in a task specific manner Relevance signal can be analyzed using conventional signal processing techniques to gain insight Work under progress to understand how the neural network is modeling the relevant information Potentially provide new algorithms for speech signal processing 2 / 2
The End Thank you for your attention! Questions?
CNN-based system using raw speech as input Detailed view for one example 1 context target 1ms 1ms 5 context 1ms 5 Conv 1 kw = 3 dw = 1 1.8 ms 1 5 1 15 2 25 3 35 MP 1 kw = 2 dw = 2 Conv 2 kw = 5 dw = 1 MP 2 kw = 2 dw = 2 Conv 3 kw = 5 dw = 1 MP 3 kw = 2 dw = 2 2.5 ms 12.5 ms 15 ms 75 ms 9 ms ANN p(i x) 21 / 2
Whole network analysis Speech recognition versus Speaker recognition Original signal spectrogram 8-6 Frequency (khz) 6 4 2 5 1 15 2-8 -1-12 -14 Power/frequency (db/hz) Time (ms) Phone CNN relevance signal spectrogram Frequency (khz) 8 6 4 2 5 1 15 2 Time (ms) -6-7 -8-9 -1-11 -12 Power/frequency (db/hz) Speaker CNN relevance signal spectrogram 8-6 -7 Frequency (khz) 6 4 2-8 -9-1 -11-12 Power/frequency (db/hz) 5 1 15 2 Time (ms) 22 / 2