Automatic Speech Recognition handout (1)

Size: px

Start display at page:

Download "Automatic Speech Recognition handout (1)"

Blaise Bradley
5 years ago
Views:

1 Automatic Speech Recognition handout (1) Jan - Mar 2012 Revision : 1.1 Speech Signal Processing and Feature Extraction Hiroshi Shimodaira (h.shimodaira@ed.ac.uk)

2 Speech Communication Intention Language Motion Control Articulate organ (vocal tract) Understanding Language Auditory processing Auditory organs Signal source (vocal cords) speech sound Speaker Listener ASR (H. Shimodaira) I : 1

Spectrogram Waveform 8.0 Spectrogram Cross-section of spectrogram Frequency [khz] 4.0 0.

3 Spectrogram Waveform 8.0 Spectrogram Cross-section of spectrogram Frequency [khz] Time [s] 80 Intensity (db) Frequency (khz) ASR (H. Shimodaira) I : 2

4 Speech Production Model F1 F2 nasal cavity X( Ω) F3 ( 6dB/oct.) Nasal Cavity lips teeth oral cavity tongue pharynx H( Ω) lips x(t) +6dB/oct. F1 F2 Mouth Cavity (formants) F3 Larynx + Pharynx lungs larynx vocal folds Vocal Organs & Vocal Tract V( Ω) F0 =1/T Time domain: x(t) = h(t) v(t) = Fourier transform 12dB/oct. frequency 0 0 Ω vocal folds T 0 v(t) h(τ)v(t τ)dτ Frequency domain: X(Ω) = H(Ω)V (Ω) Ω : angular frequency (= 2πF ) F : frequency t ASR (H. Shimodaira) I : 3

5 Automatic Speech Recognition Find the word sequence W such that max W P (W X) = max W P (X W )P (W ) P (X) ASR (H. Shimodaira) I : 4

6 Signal Analysis for ASR Front-end analysis Convert acoustic signal into a sequence of feature vectors e.g. MFCCs, PLP cepstral coefficients x (t) c LPF (low pass filter) A/D conversion Sampling frequency F s Pre emphasis x[n] Spectral analysis Feature extraction Analysis window Frame shift c [k] m m: frame number k: feature index ASR (H. Shimodaira) I : 5

7 Feature parameters for ASR Features should contain sufficient information to distinguish phonemes / phones good time-resolutions [e.g. 10ms] good frequency-resolutions [e.g. 20 channels/bark-scale] not contain (or be separated from) F 0 and its harmonics be robust against speaker variation be robust against noise / channel distortions have good characteristics in terms of pattern recognition The number of features is as few as possible Features are independent of each other ASR (H. Shimodaira) I : 6

8 Converting analogue signals to machine readable form Discretisation (sampling) x c (t) x[n] continuous time discrete time continuous amplitude discrete amplitude Problem: information can be lost by sampling ASR (H. Shimodaira) I : 7

9 Sampling of continuous-time signals Continuous-time signal: x c (t) Modulated signal by a periodic impulse train: x s (t) = x c (t) δ(t nt s ) = n= n= x c (nt s )δ(t nt s ) Sampled signal: x[n] = x s (nt s ) discrete-time signal T s : Sampling interval ASR (H. Shimodaira) I : 8

10 Sampling of continuous-time signals(cont. 2) Q: Is the C/D conversion invertible? x c (t) C/D x[n] D/C x c (t)? ASR (H. Shimodaira) I : 9

11 Sampling of continuous-time signals(cont. 3) Q: Is the C/D conversion invertible? x c (t) C/D x[n] D/C x c (t)? A: No in general, but Yes under a special condition: Nyquist sampling theorem If x c (t) is band-limited (i.e. no frequency components > F s /2), then x c (t) can be fully reconstructed by x[n]. x c (t) = h Ts (t) k= h Ts (t) = sinc(t/t s ) = sin(πt/t s) πt/t s F s /2 : Nyquist Frequency, x[k]δ(t kt s ) = k= x[k]h Ts (t kt s ) F s = 1/T s : Sampling Frequency ASR (H. Shimodaira) I : 10

12 Sampling of continuous-time signals(cont. 4) Interpretation in frequency domain: X s (Ω) }{{} = Spectrum of x s (t) 1 T s k= Xc(Ω kω s ) }{{} Spectrum of x c (t) ASR (H. Shimodaira) I : 11

13 Sampling of continuous-time signals(cont. 5) x (t) c (low pass filter) Questions LPF A/D conversion Sampling frequency F s Pre emphasis x[n] Spectral analysis Feature extraction Analysis window Frame shift c [k] m m: frame number k: feature index 1. What sampling frequencies (F s ) are used for ASR? microphone voice: 12kHz 20kHz telephone voice: 8kHz 2. What are the advantages / disadvantages of using higher F s? 3. Why is pre-emphasis (+6dB/oct.) employed? x[n] = x 0 [n] ax 0 [n 1], a = ASR (H. Shimodaira) I : 12

14 Spectral analysis: Fourier Transform FT for continuous-time signals (& continuous-frequency) X c (Ω) = x c (t)e jωt dt x c (t) = 1 2π X c (Ω)e jωt dω (time domain freq. domain) (freq. domain time domain) FT for discrete-time signals (& continuous-frequency) X(e jω ) = x[n] = 1 2π n= π π x[n]e jωn X(e jω )e jωn dω X(e jω ) 2 Power spectrum log X(e jω ) 2 Log power spectrum where ω = T s Ω = 2πf, e jωn = cos(ωn) + j sin(ωn), j : the imaginary unit ASR (H. Shimodaira) I : 13

15 An interpretation of FT Inner product between two vectors (Linear Algebra) 2-dimensional case a = (a 1, a 2 ) t a b = (b 1, b 2 ) t a b = a t b = a 1 b 1 + a 2 b 2 = a b cos θ b b = Infinite-dimensional case θ a cos θ x {x[n]} e ω { e jωn} = {cos(ωn) + j sin(ωn)} cos ω + jsin ω X(e jω ) = n= if 1 x[n]e jωn = x e jωn = x cos ω + jx sin ω x cos ω : proportion of how much cos ω component is contained in x ASR (H. Shimodaira) I : 14

16 Short-time Spectrum Analysis Problem with FT Assuming signals are stationary: signal properties do not change over time If signals are non-stationary loses information on time varying features Short-time Fourier transform (STFT) (Time-dependent Fourier transform) Divide the signal x[n] into short-time segments (frames) x k [m] and apply FT to each segment. x[n] x 1 [m], x 2 [m],..., x k [m],... X(ω) X 1 (ω), X 2 (ω),..., X k (ω),... ASR (H. Shimodaira) I : 15

17 Short-time Spectrum Analysis(cont. 2) windowing shift frame 70 Intensity Discrete Fourier Transform Time (frame) 60 Frequency Short time power spectrum Frequency 0 10 ASR (H. Shimodaira) I : 16

18 Short-time Spectrum Analysis(cont. 3) Trade-off problem of short time spectrum analysis frequency resolution time resolution a compromise for ASR: window width short long window width (frame width): ms window shift (frame shift): 5 15 ms ASR (H. Shimodaira) I : 17

19 The Effect of Windowing in STFT Time domain: y k [n] = w k [n]x[n], w k [n] : time-window for k-th frame Simply cutting out a short segment (frame) from x[n] implies applying a rectangular window on to x[n]. causes discontinuities at the edges of the segment. Instead, a tapered window is usually used.. e.g. Hamming (α = ) or Hanning (α = 0.5) window) ( ) 2πl w[l] = (1 α) α cos N : window width N 1 1 rectangle 1 hammin 1 hannin 1 blackman 1 bartlett rectangle Hamming Hanning Blackman Bartlett ASR (H. Shimodaira) I : 18

20 The Effect of Windowing in STFT(cont. 2) Frequency domain: Y k (e jω ) = 1 2π π π W k (e jθ )X(e j(ω θ) )dθ Periodic convolution Power spectrum of the frame is given as a periodic convolution between the power spectra of x[n] and w k [n]. If we want Y k (e jω ) = X(e jω ), the necessary and sufficient condition for this is W k (e jω ) = δ(ω), i.e. w k [n] = F 1 δ(ω) = 1, which means the length of w k [n] is infinite. there is no window function of finite length that causes no distortion. NB: hereafter x[n] will be also used to denote a segmented signal for simplicity. ASR (H. Shimodaira) I : 19

21 The Effect of Windowing in STFT(cont. 3) Spectral analysis of two sine signals of close frequencies ASR (H. Shimodaira) I : 20

22 Problems with STFT The estimated power spectrum contains harmonics of F 0, which makes it difficult to estimate the envelope of the spectrum. Frequency bins of STFT are highly correlated each other, i.e. power spectrum representation is highly redundant Log X(w) ASR (H. Shimodaira) I : 21

23 Cepstrum Analysis Idea: split(deconvolve) the power spectrum into spectrum envelope and F 0 harmonics Log X(w) Cepstrum Log-spectrum [freq. domain] Inverse Fourier Transform Cepstrum [time domain] (quefrency) Liftering to get low/high part (lifter: filter used in cepstral domain) Fourier Transform Envelope (Lag=30) Residue Smoothed-spectrum [freq. domain] (low-part of cepstrum) Log-spectrum of high-part of cepstrum ASR (H. Shimodaira) I : 22

24 Cepstrum Analysis(cont. 2) Log spectrum h[n] : vocal tract x[n] = h[n] v[n] v[n] : glottal sounds F X(e jω ) = H(e jω )V (e jω ) log (Fourier transform) log X(e jω ) = log H(e jω ) }{{} + log V (ejω ) }{{} Cepstrum (spectral envelope) F 1 c(τ) = F 1 { log X(e jω ) } (spectral fine structure) = F 1 { log H(e jω ) } + F 1 { log V (e jω ) } ASR (H. Shimodaira) I : 23

25 LPC Analysis Linear Predictive Coding (LPC): a model-based / parametric spectrum estimation Assume a linear system for human speech production sound source v[n] vocal tract speech x[n] v[n] h[n] x[n] h[n] : impulse response x[n] = h[n] v[n] = k=0 h[k] v[n k] Using a model enables us to estimate a spectrum of vocal tract from small amount of observations represent the spectrum with a small number of parameters synthesise speech with the parameters ASR (H. Shimodaira) I : 24

26 LPC Analysis(cont. 2) Predict x[n] from x[n 1], x[n 2], ˆx[n] = N k=1 a k x[n k] e[n] = x[n] ˆx[n] = x[n] N k=1 a k x[n k] prediction error Optimisation problem Find {a k } that minimises the mean square (MS) error: P e = E { e 2 [n] } ( ) 2 N = E x[n] a k x[n k] k=1 {a k } : LPC coefficients ASR (H. Shimodaira) I : 25

27 Spectrums estimated by FT & LPC ASR (H. Shimodaira) I : 26

28 LPC summary Spectrum can be modelled/coded with around 14LP Cs. LPC family PARCOR (Partial Auto-Correlation Coefficient) LSP (Line Spectral Pairs) / LSF (Line Spectrum Frequencies) CSM (Composite Sinusoidal Model) LPC can be used to predict log-area ratio coefficients lossless tube model LPC-(Mel)Cepstrum: LPC based cepstrum. Drawback: LPC assumes AR model which does not suit to model nasal sounds that have zeros in spectrum. Difficult to determine the prediction order N. ASR (H. Shimodaira) I : 27

29 Taking into Perceptual Attributes Physical quality Intensity Fundamental frequency Spectral shape Onset/offset time Phase difference in binaural hearing Technical terms equal-loudness contours masking auditory filters (critical-band filters) critical bandwidth Perceptual quality Loudness Pitch Timbre Timing Location ASR (H. Shimodaira) I : 28

30 Taking into Perceptual Attributes(cont. 2) ASR (H. Shimodaira) I : 29

31 Taking into Perceptual Attributes(cont. 3) Non-linear frequency scale Bark scale b(f) = 13 arctan( f) arctan((f/7500) 2 ) [Bark] Mel scale B(f) = 1127 ln(1 + f/700) Bark frequency [Bark] linear frequency [Hz] warped normalized frequency linear frequency [Hz] ln Bark Mel ASR (H. Shimodaira) I : 30

32 Filter Bank Analysis Speech x[n] Bandpass Filter 1 Bandpass Filter K x [n] 1 x [n] K ω ω ω K ω x i [n] = h i [n] x[n] = M i 1 k=0 h i [k]x[n k] h i [n]: Impulse response of Bandpass filter i ω perceptual scale ASR (H. Shimodaira) I : 31

33 Filter Bank Analysis(cont. 2) Speech Bandpass Filter 1 x [n] 1 Nonlinearity v [n] 1 Lowpass Filter y [n] 1 Down Sampling x[n] Bandpass Filter K x [n] K Nonlinearity v [n] K Lowpass Filter y [n] K Down Sampling v 0 ω x 0 ω Trade-off problem Freq. resolution # of filters length of filter Time resolution ASR (H. Shimodaira) I : 32

34 Filter Bank Analysis(cont. 3) Another implementation of filter banks: apply a mel-scale filter bank to STFT power spectrum to obtain mel-scale power spectrum DFT(STFT) power spectrum Triangular band pass filters Frequency bins Mel scale power spectrum ASR (H. Shimodaira) I : 33

35 MFCC MFCC: Mel-frequency Cepstral Coefficients c[n] x[n] DFT X[k] X[k] 2 DCT: c[n] = 2 N N i=1 Mel-frequency filterbank ( ) πn(i 0.5) s[i] cos N log S[m] DCT c[n], where s[i] = log S[i] DFT: discrete Fourier transform, DCT: discrete cosine transform MFCCs are widely used in HMM-based ASR systems. The first 12 MFCCs (c[1] c[12]) are generally used. ASR (H. Shimodaira) I : 34

36 MFCC(cont. 2) MFCCs are less correlated each other than DCT/Filter-bank based spectrum. Good compression rate. Feature dimensionality / frame Speech wave 400 DCT Spectrum Filter-bank MFCC 12 where F s = 16kHz, frame-width = 25ms, frame-shift = 10ms are assumed. MFCCs show better ASR performance than filter-bank features, but MFCCs are not robust against noises. ASR (H. Shimodaira) I : 35

37 Perceptually-based Linear Prediction (PLP) [Hermansky, 1985,1990] PLP had been shown experimentally to be more noise robust more speaker independent than MFCCs ASR (H. Shimodaira) I : 36

38 Other features with low dimensionality Formants (F 1, F 2, F 3, ) They are not used in modern ASR systems, but why? ASR (H. Shimodaira) I : 37

39 Using temporal features: dynamic features In SP lab-sessions on speech recognition using HTK, MFCCs, and energy MFCCs, energy 2 MFCCs, 2 energy, 2 : delta features (dynamic features / time derivatives) [Furui, 1986] continuous time discrete time c(t) c[n] c (t) = dc(t) M c[n] w i c[n + i] e.g. c[n] = dt i= M c (t) = d2 c(t) 2 M c[n] w dt 2 i c[n + i] i= M c[n + 1] c[n 1] 2 ASR (H. Shimodaira) I : 38

40 Using temporal features: dynamic features(cont. 2) c(t) c (t ) 0 t 0 time ASR (H. Shimodaira) I : 39

41 Using temporal features: dynamic features(cont. 3) An acoustic feature vector, eg MFCCs, representing part of a speech signal is highly correlated with its neighbours. HMM based acoustic models assume there is no dependency between the observations. Those correlations can be captured to some extent by augmenting the original set of static acoustic features, eg. MFCCs, with dynamic features. ASR (H. Shimodaira) I : 40

42 General Feature Transformation Orthogonal transformation (orthogonal bases) DCT (discrete cosine transform) PCA (principal component analysis) Transformation based on the bases that maximises the separability between classes. LDA (linear discriminant analysis) / Fisher s linear discrminant HLDA (heteroscedastic linear discriminant analysis) ASR (H. Shimodaira) I : 41

43 A comparison of speech features I. Mporas, et al., Comparison of Speech Features on the Speech Recognition Task, Journal of Computer Science, Vol.3, pp , NB SBC WPSR OWPF WPSR LFCC-FB HFCC-FB Feature WER(%) SER(%) SBC (16) WPSR125 (16) OWPF (16) LFCC-FB HFCC-FB HFCC-FB PLP-FB MFCC-FB Subband-based Cepstral Coefficients Wavelet packet features Overlapping wavelet packet features Wavelet packet-based speech features Linear-spaced filter-bank based cepstral coefficients Human factor cepstral coefficients The above result was obtained for TIMIT speech corpus. Results might change a lot under different conditions (e.g. noise, tasks, ASR systems) ASR (H. Shimodaira) I : 42

44 Further topics on feature extraction Feature normalisation/enhancement in terms of noise / environments speakers / speaking styles speech recognition Pitch (F 0 ) adapted feature extraction ASR (H. Shimodaira) I : 43

45 SUMMARY Nyquist Sampling theory Short-time Spectrum Analysis Non-parametric method Short-time Fourier Transform Cepstrum, MFCC Filter bank Parametric methods LPC, PLP Windowing effect: trade-off between time and frequency resolutions Dynamic features (delta features) There is no best feature that can be used for any purposes, but MFCC is widely used for ASR and TTS. ASR (H. Shimodaira) I : 44

46 SUMMARY(cont. 2) Front-end analysis has a great influence on ASR performance. For robust ASR in real environments, various techniques for front-end processing have been proposed. e.g. spectral subtraction (SS), cepstral mean normalisation (CMN) Spectrum analysis and feature extraction involve information loss and non-linear distortions. There is always a tradeoff between accuracy and efficiency. (e.g. spatial resolution vs. temporal resolution) ASR (H. Shimodaira) I : 45

47 References John N. Holmes, Wendy J. Holmes, Speech Synthesis and Recognition, Taylor and Francis (2001), 2nd edition (chapter 2, 4, 10) ajr/speechanalysis/ B. Gold, N. Morgan, Speech and Audio Signal Processing: Processing and Perception of Speech and Music, John Wiley and Sons (1999). Spoken language processing: a guide to theory, algorithm, and system development, Xuedong Huang, Alex Acero and Hsiao-Wuen Hon, Prentice Hall (2001). isbn: ASR (H. Shimodaira) I : 46

48 References(cont. 2) Robusness in Automatic Speech Recognition, J-C Junqua and J-P Hanton,, Kluwer Academic Publications (1996). isbn: A Comparative Study of Traditional and Newly Proposed Features for Recognition of Speech Under Stress, Sahar Bou-Ghazale and John H.L. Hansen, IEEE Trans SAP, vol. 8, no. 4, pp , July ASR (H. Shimodaira) I : 47

Speech Production. Automatic Speech Recognition handout (1) Jan - Mar 2009 Revision : 1.1. Speech Communication. Spectrogram. Waveform.

Speech Production. Automatic Speech Recognition handout (1) Jan - Mar 2009 Revision : 1.1. Speech Communication. Spectrogram. Waveform. Speech Production Automatic Speech Recognition handout () Jan - Mar 29 Revision :. Speech Signal Processing and Feature Extraction lips teeth nasal cavity oral cavity tongue lang S( Ω) pharynx larynx vocal