Speech Production. Automatic Speech Recognition handout (1) Jan - Mar 2009 Revision : 1.1. Speech Communication. Spectrogram. Waveform.

Speech Production Automatic Speech Recognition handout () Jan - Mar 29 Revision :. Speech Signal Processing and Feature Extraction lips teeth nasal cavity oral cavity tongue lang S( Ω) pharynx larynx vocal folds F F2 H( Ω) F3 ( 6dB/oct.) lips s(t) +6dB/oct. F F2 V( Ω) F3 Nasal Cavity Mouth Cavity frequency Ω Larynx + Pharynx v(t) 2dB/oct. vocal folds Hiroshi Shimodaira (h.shimodaira@ed.ac.uk) Vocal Organs & Vocal Tract time domain: s(t) = h(t) v(t) = Fourier transform frequency domain: S(Ω) = H(Ω)V (Ω) h(τ)x(t τ)dτ ASR (H. Shimodaira) I : 2 Spectrogram Speech Communication Waveform Intention Language Motion Control Articulate organ (vocal tract) Understanding Language Auditory processing Auditory organs Spectrogram Signal source (vocal cords) speech sound Cross-section of spectrogram Speaker Listener ASR (H. Shimodaira) I : ASR (H. Shimodaira) I : 3

Automatic Speech Recognition Feature parameters for ASR Features should contain sufficient information to distinguish phonemes / phones good time-resolutions [e.g. ms] good frequency-resolutions [e.g. 2 channels/bark-scale] not contain (or be separated from) F and its harmonics be robust against speaker variation be robust against noise / channel distortions have good characteristics in terms of pattern recognition The number of features is as few as possible Features are independent of each other A large number of features have been proposed ASR (H. Shimodaira) I : 4 ASR (H. Shimodaira) I : 6 Signal Analysis for ASR Front-end analysis Convert acoustic signal into a sequence of feature vectors Converting analogue signals to machine readable form Discretisation (digitising) x c (t) x[n] continuous time discrete time continuous amplitude discrete amplitude ASR (H. Shimodaira) I : 5 ASR (H. Shimodaira) I : 7

Sampling of continuous-time signals Continuous-time signal: x c (t) Modulated signal by a periodic impulse train: x s (t) = x c (t) δ(t nt s ) = n= n= x c (nt s )δ(t nt s ) Sampled signal: x[n] = x s (nt s ) discrete-time signal T s : Sampling interval Sampling of continuous-time signals(cont. 3) Q: Is the C/D conversion invertible? x c (t) C/D x[n] D/C x c (t)? A: No in general, but Yes under a special condition: Nyquist sampling theorem If x c (t) is band-limited (i.e. no frequency components > F s /2), then x c (t) can be fully reconstructed by x[n]. x c (t) = h Ts (t) x[k]δ(t kt s ) = x[k]h Ts (t kt s ) k= h Ts (t) = sinc(t/t s ) = sin(πt/t s) πt/t s F s /2 : Nyquist Frequency, k= F s = /T s : Sampling Frequency ASR (H. Shimodaira) I : 8 ASR (H. Shimodaira) I : Sampling of continuous-time signals(cont. 2) Q: Is the C/D conversion invertible? x c (t) C/D x[n] D/C x c (t)? Sampling of continuous-time signals(cont. 4) Interpretation in frequency domain: X s (Ω) }{{} = Spectrum of x s (t) T s k= Xc(Ω kω s ) }{{} Spectrum of x c (t) ASR (H. Shimodaira) I : 9 ASR (H. Shimodaira) I :

Sampling of continuous-time signals(cont. 5) Questions. What sampling frequencies (F s ) are used for ASR? microphone voice: 2kHz 2kHz telephone voice: 8kHz 2. What are the advantages / disadvantages of using higher F s? 3. Why is pre-emphasis (+6dB/oct.) employed? x[n] = x [n] ax [n ], a =.95.97 An interpretation of FT Inner product between two vectors (Linear Algebra) 2-dimensional case a = (a, a 2 ) t a b = (b, b 2 ) t a b = a t b = a b + a 2 b 2 = a b cos θ b if b = Infinite-dimensional case θ a cos θ x {x[n]} e ω { e jωn} = {cos(ωn) + j sin(ωn)} cos ω + jsin ω X(e jω ) = x[n]e jωn = x e jωn = x cos ω + jx sin ω n= x cos ω : proportion of how much cos ω component is contained in x ASR (H. Shimodaira) I : 2 ASR (H. Shimodaira) I : 4 Spectral analysis: Fourier Transform FT for continuous-time signals (& continuous-frequency) X c (Ω) = x c (t)e jωt dt x c (t) = 2π X c (Ω)e jωt dω (time domain freq. domain) (freq. domain time domain) FT for discrete-time signals (& continuous-frequency) X(e jω ) = x[n] = 2π n= π π x[n]e jωn X(e jω )e jωn dω X(e jω ) 2 Power spectrum log X(e jω ) 2 Log power spectrum where ω = 2πf, f = /T, ω = T s Ω, e jωn = cos(ωn) + j sin(ωn), j : the imaginary unit Short-time Spectrum Analysis Problem with FT Assuming signals are stationary: signal properties do not change over time If signals are non-stationary loses information on time varying features Short-time Fourier transform (STFT) (Time-dependent Fourier transform) Divide signals into short-time segments (frames) and apply FT to each frame. ASR (H. Shimodaira) I : 3 ASR (H. Shimodaira) I : 5

.8.6.4.2 rectangle 2 4 6 8 2.8.6.4.2 hammin 2 4 6 8 2.8.6.4.2 hannin 2 4 6 8 2.8.6.4.2 blackman 2 4 6 8 2.8.6.4.2 bartlett 2 4 6 8 2 Short-time Spectrum Analysis(cont. 2) The Effect of Windowing in STFT Time domain: 5 6 7 y k [n] = w k [n]x[n], w k [n] : time-window for k-th frame Simply cutting out a short segment (frame) from x[n] implies applying a rectangular window on to x[n]. causes discontinuities at the edges of the segment. Instead, a tapered window is usually used.. e.g. Hamming (α =.4664) or Hanning (α =.5) window) ( ) 2πl w[l] = ( α) α cos N : window width N 4 3 2 rectangle Hamming Hanning Blackman Bartlett ASR (H. Shimodaira) I : 6 ASR (H. Shimodaira) I : 8 Short-time Spectrum Analysis(cont. 3) Trade-off problem of short time spectrum analysis frequency resolution time resolution a compromise: window width short long window width (frame width): 2 3 ms window shift (frame shift): 5 5 ms The Effect of Windowing in STFT(cont. 2) Frequency domain: Y k (e jω ) = 2π π π W k (e jθ )X(e j(ω θ) )dθ Periodic convolution Power spectrum of the frame is given as a periodic convolution between the power spectra of x[n] and w k [n]. If we want Y k (e jω ) = X(e jω ), the necessary and sufficient condition for this is W k (e jω ) = δ(ω), i.e. w k [n] = F δ(ω) =, which means the length of w k [n] is infinite. there is no window function of finite length that causes no distortion. ASR (H. Shimodaira) I : 7 ASR (H. Shimodaira) I : 9

The Effect of Windowing in STFT(cont. 3) Spectral analysis of two sine signals of close frequencies Cepstrum Analysis Idea: split(deconvolve) the power spectrum into spectrum envelope and F harmonics. 2 8 6 4 Log X(w) 5 5 2 25.9.8 Cepstrum.7.6.5.4.3.2. 5 5 2 25 Log-spectrum [freq. domain] Inverse Fourier Transform Cepstrum [time domain] (quefrency) Liftering to get low/high part (lifter: filter used in cepstral domain) Fourier Transform 2 8 6 4 Envelope (Lag=3) 5 5 2 25 Residue 8 6 4 2 5 5 2 25 Smoothed-spectrum [freq. domain] (low-part of cepstrum) Log-spectrum of high-part of cepstrum ASR (H. Shimodaira) I : 2 ASR (H. Shimodaira) I : 22 Problems with STFT The estimated power spectrum contains harmonics of F, which makes it difficult to estimate the envelope of the spectrum. Frequency bins of STFT are highly correlated each other, i.e. power spectrum representation is highly redundant. 2 8 6 4 Log X(w) 5 5 2 25 Cepstrum Analysis(cont. 2) Log spectrum h[n] : vocal tract x[n] = h[n] v[n] v[n] : glottal sounds F X(e jω ) = H(e jω )V (e jω ) log (Fourier transform) log X(e jω ) = log H(e jω ) }{{} + log V (ejω ) }{{} Cepstrum (spectral envelope) F c(τ) = F { log X(e jω ) } (spectral fine structure) = F { log H(e jω ) } + F { log V (e jω ) } ASR (H. Shimodaira) I : 2 ASR (H. Shimodaira) I : 23

LPC Analysis Linear Predictive Coding (LPC): a model-based / parametric spectrum estimation Assume a linear system for human speech production sound source x[n] vocal tract speech y[n] Spectrum estimated by FT & LPC x[n] h[n] y[n] y[n] = h[n] x[n] = h[n] : impulse response h[k] x[n k] Using a model enables us to estimate a spectrum of vocal tract from small amount of observations represent the spectrum with a small number of parameters synthesise speech with the parameters k= ASR (H. Shimodaira) I : 24 ASR (H. Shimodaira) I : 26 LPC analysis in detail Predict y[n] from y[n ], y[n 2], ŷ[n] = N a k y[n k] k= e[n] = y[n] ŷ[n] = y[n] Optimisation problem N a k y[n k] Find {a k } that minimises the mean square (MS) error: P e = E { e 2 [n] } ( ) 2 N = E y[n] a k y[n k] k= k= prediction error {a k } : LPC coefficents LPC summary Spectrum can be modelled/coded with around 4LP Cs. LPC family PARCOR (Partial Auto-Correlation Coefficient) LSP (Line Spectral Pairs) / LSF (Line Spectrum Frequencies) CSM (Composite Sinusoidal Model) LPC can be used to predict log-area ratio coefficients lossless tube model LPC-(Mel)Cepstrum: LPC based cepstrum. Drawback: LPC assumes AR model which does not suit to model nasal sounds that have zeros in spectrum. Difficult to determine the prediction order N. ASR (H. Shimodaira) I : 25 ASR (H. Shimodaira) I : 27

Taking into Perceptual Attributes Physical quality Intensity Fundamental frequency Spectral shape Onset/offset time Phase difference in binaural hearing Technical terms equal-loudness contours masking auditory filters (critical-band filters) critical bandwidth Perceptual quality Loudness Pitch Timbre Timing Location Taking into Perceptual Attributes(cont. 3) Non-linear frequency scale Bark frequency [Bark] 25 2 5 5 Bark scale b(f) = 3 arctan(.76f) + 3.5 arctan((f/75) 2 ) Mel scale B(f) = 25 ln( + f/7) 2 4 6 8 2 4 linear frequency [Hz] warped normalized frequency..8.6.4.2 2 4 6 8 linear frequency [Hz] [Bark] ln Bark Mel 2 ASR (H. Shimodaira) I : 28 ASR (H. Shimodaira) I : 3 Taking into Perceptual Attributes(cont. 2) Filter Bank Analysis Speech x[n] Bandpass Filter Bandpass Filter K x [n] x [n] K ω ω ω 2 3 K ω x i [n] = h i [n] x[n] = M i k= h i [k]x[n k] h i [n]: Impulse response of Bandpass filter i ω perceptual scale ASR (H. Shimodaira) I : 29 ASR (H. Shimodaira) I : 3

Filter Bank Analysis(cont. 2) MFCC Speech x[n] Bandpass Filter Bandpass Filter K x [n] x [n] K Nonlinearity Nonlinearity v v [n] v [n] K Lowpass Filter Lowpass Filter y [n] y [n] K Down Sampling Down Sampling MFCC: Mel-frequency Cepstrum Coefficients x[n] DFT X[k] X[k] 2 DCT: c[n] = 2 N Mel-frequency filterbank c[n] log S[m] DCT c[n] N ( ) πn(i.5) s[i] cos, where s[i] = log S[i] N i=i Trade-off problem ω x Freq. resolution # of filters length of filter Time resolution ω MFCCs are widely used in HMM-based ASR systems. The first 2 MFCCs (c[] c[2]) are generally used. ASR (H. Shimodaira) I : 32 ASR (H. Shimodaira) I : 34 Filter Bank Analysis(cont. 3) Another implementation: apply a mel-scale filter bank to STFT power spectrum to obtain mel-scale power spectrum DFT(STFT) power spectrum Triangular band pass filters Mel scale power spectrum Frequency bins MFCC(cont. 2) MFCCs are less correlated each other than DCT/Filter-bank based spectrum. Good compression rate. Feature dimensionality / frame Speech wave 4 DCT Sepctrum 64 256 Filter-bank 2 MFCC 2 where F s = 6kHz, frame-width = 25ms, frame-shift = ms are assumed. MFCCs show better ASR performance than filter-bank features, but MFCCs are not robust against noises. ASR (H. Shimodaira) I : 33 ASR (H. Shimodaira) I : 35

Perceptually-based Linear Prediction (PLP) [Hermansky, 985,99] PLP had been shown experimentally to be more noise robust more speaker independent than MFCCs Using temporal features: dynamic features In SP lab-sessions on speech recognition using HTK, MFCCs, and energy MFCCs, energy 2 MFCCs, 2 energy, 2 : delta features (dynamic features / time derivatives) [Furui, 986] continuous time discrete time c(t) c[n] c (t) = dc(t) M c[n] w i c[n + i] dt i= M c (t) = d2 c(t) 2 M c[n] w dt 2 i c[n + i] i= M ASR (H. Shimodaira) I : 36 ASR (H. Shimodaira) I : 38 Other features with low dimensionality Formants (F, F 2, F 3, ) They are not used in modern ASR systems, but why? Using temporal features: dynamic features(cont. 2) c(t) c (t ) t time ASR (H. Shimodaira) I : 37 ASR (H. Shimodaira) I : 39

Using temporal features: dynamic features(cont. 3) An acoustic feature vector, eg MFCCs, representing part of a speech signal is highly correlated with its neighbours. HMM based acoustic models assume there is no dependency between the observations. Those correlations can be captured to some extent by augmenting the original set of static acoustic features, eg. MFCCs, with dynamic features. SUMMARY(cont. 2) Front-end analysis has a great influence on ASR performance. For robust ASR in real environments, various techniques for front-end processing have been proposed. e.g. spectral subtraction (SS), cepstral mean normalisation (CMN) Do not believe what you ve got in spectral analysis. You are not seeing the true one. You are looking at speech signals / features through a pin hole. sampled windowed ASR (H. Shimodaira) I : 4 ASR (H. Shimodaira) I : 42 SUMMARY Nyquist Sampling theory Short-time Spectrum Analysis Non-parametric method Short-time Fourier Transform Cepstrum, MFCC Filter bank Parametric methods LPC, PLP Windowing effect: trade-off between time and frequency resolutions Dynamic features (delta features) There is no best feature that can be used for any purposes, but MFCC is widely used for ASR and TTS. References John N. Holmes, Wendy J. Holmes, Speech Synthesis and Recognition, Taylor and Francis (2), 2nd edition (chapter 2, 4, ) http://mi.eng.cam.ac.uk/comp.speech/ http://mi.eng.cam.ac.uk/ ajr/speechanalysis/ http://cslu.cse.ogi.edu/hltsurvey/ B. Gold, N. Morgan, Speech and Audio Signal Processing: Processing and Perception of Speech and Music, John Wiley and Sons (999). Spoken language processing: a guide to theory, algorithm, and system development, Xuedong Huang, Alex Acero and Hsiao-Wuen Hon, Prentice Hall (2). isbn: 322665 ASR (H. Shimodaira) I : 4 ASR (H. Shimodaira) I : 43

References(cont. 2) Robusness in Automatic Speech Recognition, J-C Junqua and J-P Hanton,, Kluwer Academic Publications (996). isbn: -7923-9646-4 A Comparative Study of Traditional and Newly Proposed Features for Recognition of Speech Under Stress, Sahar Bou-Ghazale and John H.L. Hansen, IEEE Trans SAP, vol. 8, no. 4, pp.429 442, July 2. ASR (H. Shimodaira) I : 44