Speech Production. Automatic Speech Recognition handout (1) Jan - Mar 2009 Revision : 1.1. Speech Communication. Spectrogram. Waveform.

Similar documents
Automatic Speech Recognition handout (1)

Speech Signal Analysis

Cepstrum alanysis of speech signals

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

Speech Synthesis using Mel-Cepstral Coefficient Feature

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Advanced audio analysis. Martin Gasser

Speech Synthesis; Pitch Detection and Vocoders

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Auditory Based Feature Vectors for Speech Recognition Systems

Linguistic Phonetics. Spectral Analysis

Digital Signal Processing

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

URBANA-CHAMPAIGN. CS 498PS Audio Computing Lab. Audio DSP basics. Paris Smaragdis. paris.cs.illinois.

SAMPLING THEORY. Representing continuous signals with discrete numbers

Lecture 6: Speech modeling and synthesis

EE482: Digital Signal Processing Applications

Analysis/synthesis coding

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Isolated Digit Recognition Using MFCC AND DTW

Lecture 5: Speech modeling. The speech signal

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

SPEech Feature Toolbox (SPEFT) Design and Emotional Speech Feature Extraction

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Robust Algorithms For Speech Reconstruction On Mobile Devices

Digital Speech Processing and Coding

EE482: Digital Signal Processing Applications

Introduction of Audio and Music

COMP 546, Winter 2017 lecture 20 - sound 2

T Automatic Speech Recognition: From Theory to Practice

CS3291: Digital Signal Processing

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Topic 2. Signal Processing Review. (Some slides are adapted from Bryan Pardo s course slides on Machine Perception of Music)

CS 188: Artificial Intelligence Spring Speech in an Hour

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

FFT analysis in practice

Digital Signal Processing

Discrete Fourier Transform (DFT)

ADDITIVE SYNTHESIS BASED ON THE CONTINUOUS WAVELET TRANSFORM: A SINUSOIDAL PLUS TRANSIENT MODEL

NCCF ACF. cepstrum coef. error signal > samples

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

Synthesis Techniques. Juan P Bello

SPEECH AND SPECTRAL ANALYSIS

University of Colorado at Boulder ECEN 4/5532. Lab 1 Lab report due on February 2, 2015

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Design and Implementation of Speech Recognition Systems

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

SGN Audio and Speech Processing

SOUND SOURCE RECOGNITION AND MODELING

Perceptually Motivated Linear Prediction Cepstral Features for Network Speech Recognition

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Chapter 7. Frequency-Domain Representations 语音信号的频域表征

PROBLEM SET 6. Note: This version is preliminary in that it does not yet have instructions for uploading the MATLAB problems.

Audio processing methods on marine mammal vocalizations

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks

Signals and Systems Lecture 6: Fourier Applications

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

Signals and Systems Lecture 6: Fourier Applications

An Improved Voice Activity Detection Based on Deep Belief Networks

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Laboratory Assignment 4. Fourier Sound Synthesis

Converting Speaking Voice into Singing Voice

Complex Sounds. Reading: Yost Ch. 4

F I R Filter (Finite Impulse Response)

DSP Laboratory (EELE 4110) Lab#10 Finite Impulse Response (FIR) Filters

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

CG401 Advanced Signal Processing. Dr Stuart Lawson Room A330 Tel: January 2003

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

SGN Audio and Speech Processing

DFT: Discrete Fourier Transform & Linear Signal Processing

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Gammatone Cepstral Coefficient for Speaker Identification

EE123 Digital Signal Processing

Lecture 5: Speech modeling

Enhanced Waveform Interpolative Coding at 4 kbps

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Filter Banks I. Prof. Dr. Gerald Schuller. Fraunhofer IDMT & Ilmenau University of Technology Ilmenau, Germany. Fraunhofer IDMT

Rotating Machinery Fault Diagnosis Techniques Envelope and Cepstrum Analyses

Speech and Music Discrimination based on Signal Modulation Spectrum.

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Speech Recognition using FIR Wiener Filter

ECE438 - Laboratory 7a: Digital Filter Design (Week 1) By Prof. Charles Bouman and Prof. Mireille Boutin Fall 2015

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

8.3 Basic Parameters for Audio

Sound Synthesis Methods

Final Exam Solutions June 14, 2006

Sampling and Reconstruction of Analog Signals

Signal Analysis. Peak Detection. Envelope Follower (Amplitude detection) Music 270a: Signal Analysis

Communications Theory and Engineering

Biomedical Signals. Signals and Images in Medicine Dr Nabeel Anwar

Transcription:

Speech Production Automatic Speech Recognition handout () Jan - Mar 29 Revision :. Speech Signal Processing and Feature Extraction lips teeth nasal cavity oral cavity tongue lang S( Ω) pharynx larynx vocal folds F F2 H( Ω) F3 ( 6dB/oct.) lips s(t) +6dB/oct. F F2 V( Ω) F3 Nasal Cavity Mouth Cavity frequency Ω Larynx + Pharynx v(t) 2dB/oct. vocal folds Hiroshi Shimodaira (h.shimodaira@ed.ac.uk) Vocal Organs & Vocal Tract time domain: s(t) = h(t) v(t) = Fourier transform frequency domain: S(Ω) = H(Ω)V (Ω) h(τ)x(t τ)dτ ASR (H. Shimodaira) I : 2 Spectrogram Speech Communication Waveform Intention Language Motion Control Articulate organ (vocal tract) Understanding Language Auditory processing Auditory organs Spectrogram Signal source (vocal cords) speech sound Cross-section of spectrogram Speaker Listener ASR (H. Shimodaira) I : ASR (H. Shimodaira) I : 3

Automatic Speech Recognition Feature parameters for ASR Features should contain sufficient information to distinguish phonemes / phones good time-resolutions [e.g. ms] good frequency-resolutions [e.g. 2 channels/bark-scale] not contain (or be separated from) F and its harmonics be robust against speaker variation be robust against noise / channel distortions have good characteristics in terms of pattern recognition The number of features is as few as possible Features are independent of each other A large number of features have been proposed ASR (H. Shimodaira) I : 4 ASR (H. Shimodaira) I : 6 Signal Analysis for ASR Front-end analysis Convert acoustic signal into a sequence of feature vectors Converting analogue signals to machine readable form Discretisation (digitising) x c (t) x[n] continuous time discrete time continuous amplitude discrete amplitude ASR (H. Shimodaira) I : 5 ASR (H. Shimodaira) I : 7

Sampling of continuous-time signals Continuous-time signal: x c (t) Modulated signal by a periodic impulse train: x s (t) = x c (t) δ(t nt s ) = n= n= x c (nt s )δ(t nt s ) Sampled signal: x[n] = x s (nt s ) discrete-time signal T s : Sampling interval Sampling of continuous-time signals(cont. 3) Q: Is the C/D conversion invertible? x c (t) C/D x[n] D/C x c (t)? A: No in general, but Yes under a special condition: Nyquist sampling theorem If x c (t) is band-limited (i.e. no frequency components > F s /2), then x c (t) can be fully reconstructed by x[n]. x c (t) = h Ts (t) x[k]δ(t kt s ) = x[k]h Ts (t kt s ) k= h Ts (t) = sinc(t/t s ) = sin(πt/t s) πt/t s F s /2 : Nyquist Frequency, k= F s = /T s : Sampling Frequency ASR (H. Shimodaira) I : 8 ASR (H. Shimodaira) I : Sampling of continuous-time signals(cont. 2) Q: Is the C/D conversion invertible? x c (t) C/D x[n] D/C x c (t)? Sampling of continuous-time signals(cont. 4) Interpretation in frequency domain: X s (Ω) }{{} = Spectrum of x s (t) T s k= Xc(Ω kω s ) }{{} Spectrum of x c (t) ASR (H. Shimodaira) I : 9 ASR (H. Shimodaira) I :

Sampling of continuous-time signals(cont. 5) Questions. What sampling frequencies (F s ) are used for ASR? microphone voice: 2kHz 2kHz telephone voice: 8kHz 2. What are the advantages / disadvantages of using higher F s? 3. Why is pre-emphasis (+6dB/oct.) employed? x[n] = x [n] ax [n ], a =.95.97 An interpretation of FT Inner product between two vectors (Linear Algebra) 2-dimensional case a = (a, a 2 ) t a b = (b, b 2 ) t a b = a t b = a b + a 2 b 2 = a b cos θ b if b = Infinite-dimensional case θ a cos θ x {x[n]} e ω { e jωn} = {cos(ωn) + j sin(ωn)} cos ω + jsin ω X(e jω ) = x[n]e jωn = x e jωn = x cos ω + jx sin ω n= x cos ω : proportion of how much cos ω component is contained in x ASR (H. Shimodaira) I : 2 ASR (H. Shimodaira) I : 4 Spectral analysis: Fourier Transform FT for continuous-time signals (& continuous-frequency) X c (Ω) = x c (t)e jωt dt x c (t) = 2π X c (Ω)e jωt dω (time domain freq. domain) (freq. domain time domain) FT for discrete-time signals (& continuous-frequency) X(e jω ) = x[n] = 2π n= π π x[n]e jωn X(e jω )e jωn dω X(e jω ) 2 Power spectrum log X(e jω ) 2 Log power spectrum where ω = 2πf, f = /T, ω = T s Ω, e jωn = cos(ωn) + j sin(ωn), j : the imaginary unit Short-time Spectrum Analysis Problem with FT Assuming signals are stationary: signal properties do not change over time If signals are non-stationary loses information on time varying features Short-time Fourier transform (STFT) (Time-dependent Fourier transform) Divide signals into short-time segments (frames) and apply FT to each frame. ASR (H. Shimodaira) I : 3 ASR (H. Shimodaira) I : 5

.8.6.4.2 rectangle 2 4 6 8 2.8.6.4.2 hammin 2 4 6 8 2.8.6.4.2 hannin 2 4 6 8 2.8.6.4.2 blackman 2 4 6 8 2.8.6.4.2 bartlett 2 4 6 8 2 Short-time Spectrum Analysis(cont. 2) The Effect of Windowing in STFT Time domain: 5 6 7 y k [n] = w k [n]x[n], w k [n] : time-window for k-th frame Simply cutting out a short segment (frame) from x[n] implies applying a rectangular window on to x[n]. causes discontinuities at the edges of the segment. Instead, a tapered window is usually used.. e.g. Hamming (α =.4664) or Hanning (α =.5) window) ( ) 2πl w[l] = ( α) α cos N : window width N 4 3 2 rectangle Hamming Hanning Blackman Bartlett ASR (H. Shimodaira) I : 6 ASR (H. Shimodaira) I : 8 Short-time Spectrum Analysis(cont. 3) Trade-off problem of short time spectrum analysis frequency resolution time resolution a compromise: window width short long window width (frame width): 2 3 ms window shift (frame shift): 5 5 ms The Effect of Windowing in STFT(cont. 2) Frequency domain: Y k (e jω ) = 2π π π W k (e jθ )X(e j(ω θ) )dθ Periodic convolution Power spectrum of the frame is given as a periodic convolution between the power spectra of x[n] and w k [n]. If we want Y k (e jω ) = X(e jω ), the necessary and sufficient condition for this is W k (e jω ) = δ(ω), i.e. w k [n] = F δ(ω) =, which means the length of w k [n] is infinite. there is no window function of finite length that causes no distortion. ASR (H. Shimodaira) I : 7 ASR (H. Shimodaira) I : 9

The Effect of Windowing in STFT(cont. 3) Spectral analysis of two sine signals of close frequencies Cepstrum Analysis Idea: split(deconvolve) the power spectrum into spectrum envelope and F harmonics. 2 8 6 4 Log X(w) 5 5 2 25.9.8 Cepstrum.7.6.5.4.3.2. 5 5 2 25 Log-spectrum [freq. domain] Inverse Fourier Transform Cepstrum [time domain] (quefrency) Liftering to get low/high part (lifter: filter used in cepstral domain) Fourier Transform 2 8 6 4 Envelope (Lag=3) 5 5 2 25 Residue 8 6 4 2 5 5 2 25 Smoothed-spectrum [freq. domain] (low-part of cepstrum) Log-spectrum of high-part of cepstrum ASR (H. Shimodaira) I : 2 ASR (H. Shimodaira) I : 22 Problems with STFT The estimated power spectrum contains harmonics of F, which makes it difficult to estimate the envelope of the spectrum. Frequency bins of STFT are highly correlated each other, i.e. power spectrum representation is highly redundant. 2 8 6 4 Log X(w) 5 5 2 25 Cepstrum Analysis(cont. 2) Log spectrum h[n] : vocal tract x[n] = h[n] v[n] v[n] : glottal sounds F X(e jω ) = H(e jω )V (e jω ) log (Fourier transform) log X(e jω ) = log H(e jω ) }{{} + log V (ejω ) }{{} Cepstrum (spectral envelope) F c(τ) = F { log X(e jω ) } (spectral fine structure) = F { log H(e jω ) } + F { log V (e jω ) } ASR (H. Shimodaira) I : 2 ASR (H. Shimodaira) I : 23

LPC Analysis Linear Predictive Coding (LPC): a model-based / parametric spectrum estimation Assume a linear system for human speech production sound source x[n] vocal tract speech y[n] Spectrum estimated by FT & LPC x[n] h[n] y[n] y[n] = h[n] x[n] = h[n] : impulse response h[k] x[n k] Using a model enables us to estimate a spectrum of vocal tract from small amount of observations represent the spectrum with a small number of parameters synthesise speech with the parameters k= ASR (H. Shimodaira) I : 24 ASR (H. Shimodaira) I : 26 LPC analysis in detail Predict y[n] from y[n ], y[n 2], ŷ[n] = N a k y[n k] k= e[n] = y[n] ŷ[n] = y[n] Optimisation problem N a k y[n k] Find {a k } that minimises the mean square (MS) error: P e = E { e 2 [n] } ( ) 2 N = E y[n] a k y[n k] k= k= prediction error {a k } : LPC coefficents LPC summary Spectrum can be modelled/coded with around 4LP Cs. LPC family PARCOR (Partial Auto-Correlation Coefficient) LSP (Line Spectral Pairs) / LSF (Line Spectrum Frequencies) CSM (Composite Sinusoidal Model) LPC can be used to predict log-area ratio coefficients lossless tube model LPC-(Mel)Cepstrum: LPC based cepstrum. Drawback: LPC assumes AR model which does not suit to model nasal sounds that have zeros in spectrum. Difficult to determine the prediction order N. ASR (H. Shimodaira) I : 25 ASR (H. Shimodaira) I : 27

Taking into Perceptual Attributes Physical quality Intensity Fundamental frequency Spectral shape Onset/offset time Phase difference in binaural hearing Technical terms equal-loudness contours masking auditory filters (critical-band filters) critical bandwidth Perceptual quality Loudness Pitch Timbre Timing Location Taking into Perceptual Attributes(cont. 3) Non-linear frequency scale Bark frequency [Bark] 25 2 5 5 Bark scale b(f) = 3 arctan(.76f) + 3.5 arctan((f/75) 2 ) Mel scale B(f) = 25 ln( + f/7) 2 4 6 8 2 4 linear frequency [Hz] warped normalized frequency..8.6.4.2 2 4 6 8 linear frequency [Hz] [Bark] ln Bark Mel 2 ASR (H. Shimodaira) I : 28 ASR (H. Shimodaira) I : 3 Taking into Perceptual Attributes(cont. 2) Filter Bank Analysis Speech x[n] Bandpass Filter Bandpass Filter K x [n] x [n] K ω ω ω 2 3 K ω x i [n] = h i [n] x[n] = M i k= h i [k]x[n k] h i [n]: Impulse response of Bandpass filter i ω perceptual scale ASR (H. Shimodaira) I : 29 ASR (H. Shimodaira) I : 3

Filter Bank Analysis(cont. 2) MFCC Speech x[n] Bandpass Filter Bandpass Filter K x [n] x [n] K Nonlinearity Nonlinearity v v [n] v [n] K Lowpass Filter Lowpass Filter y [n] y [n] K Down Sampling Down Sampling MFCC: Mel-frequency Cepstrum Coefficients x[n] DFT X[k] X[k] 2 DCT: c[n] = 2 N Mel-frequency filterbank c[n] log S[m] DCT c[n] N ( ) πn(i.5) s[i] cos, where s[i] = log S[i] N i=i Trade-off problem ω x Freq. resolution # of filters length of filter Time resolution ω MFCCs are widely used in HMM-based ASR systems. The first 2 MFCCs (c[] c[2]) are generally used. ASR (H. Shimodaira) I : 32 ASR (H. Shimodaira) I : 34 Filter Bank Analysis(cont. 3) Another implementation: apply a mel-scale filter bank to STFT power spectrum to obtain mel-scale power spectrum DFT(STFT) power spectrum Triangular band pass filters Mel scale power spectrum Frequency bins MFCC(cont. 2) MFCCs are less correlated each other than DCT/Filter-bank based spectrum. Good compression rate. Feature dimensionality / frame Speech wave 4 DCT Sepctrum 64 256 Filter-bank 2 MFCC 2 where F s = 6kHz, frame-width = 25ms, frame-shift = ms are assumed. MFCCs show better ASR performance than filter-bank features, but MFCCs are not robust against noises. ASR (H. Shimodaira) I : 33 ASR (H. Shimodaira) I : 35

Perceptually-based Linear Prediction (PLP) [Hermansky, 985,99] PLP had been shown experimentally to be more noise robust more speaker independent than MFCCs Using temporal features: dynamic features In SP lab-sessions on speech recognition using HTK, MFCCs, and energy MFCCs, energy 2 MFCCs, 2 energy, 2 : delta features (dynamic features / time derivatives) [Furui, 986] continuous time discrete time c(t) c[n] c (t) = dc(t) M c[n] w i c[n + i] dt i= M c (t) = d2 c(t) 2 M c[n] w dt 2 i c[n + i] i= M ASR (H. Shimodaira) I : 36 ASR (H. Shimodaira) I : 38 Other features with low dimensionality Formants (F, F 2, F 3, ) They are not used in modern ASR systems, but why? Using temporal features: dynamic features(cont. 2) c(t) c (t ) t time ASR (H. Shimodaira) I : 37 ASR (H. Shimodaira) I : 39

Using temporal features: dynamic features(cont. 3) An acoustic feature vector, eg MFCCs, representing part of a speech signal is highly correlated with its neighbours. HMM based acoustic models assume there is no dependency between the observations. Those correlations can be captured to some extent by augmenting the original set of static acoustic features, eg. MFCCs, with dynamic features. SUMMARY(cont. 2) Front-end analysis has a great influence on ASR performance. For robust ASR in real environments, various techniques for front-end processing have been proposed. e.g. spectral subtraction (SS), cepstral mean normalisation (CMN) Do not believe what you ve got in spectral analysis. You are not seeing the true one. You are looking at speech signals / features through a pin hole. sampled windowed ASR (H. Shimodaira) I : 4 ASR (H. Shimodaira) I : 42 SUMMARY Nyquist Sampling theory Short-time Spectrum Analysis Non-parametric method Short-time Fourier Transform Cepstrum, MFCC Filter bank Parametric methods LPC, PLP Windowing effect: trade-off between time and frequency resolutions Dynamic features (delta features) There is no best feature that can be used for any purposes, but MFCC is widely used for ASR and TTS. References John N. Holmes, Wendy J. Holmes, Speech Synthesis and Recognition, Taylor and Francis (2), 2nd edition (chapter 2, 4, ) http://mi.eng.cam.ac.uk/comp.speech/ http://mi.eng.cam.ac.uk/ ajr/speechanalysis/ http://cslu.cse.ogi.edu/hltsurvey/ B. Gold, N. Morgan, Speech and Audio Signal Processing: Processing and Perception of Speech and Music, John Wiley and Sons (999). Spoken language processing: a guide to theory, algorithm, and system development, Xuedong Huang, Alex Acero and Hsiao-Wuen Hon, Prentice Hall (2). isbn: 322665 ASR (H. Shimodaira) I : 4 ASR (H. Shimodaira) I : 43

References(cont. 2) Robusness in Automatic Speech Recognition, J-C Junqua and J-P Hanton,, Kluwer Academic Publications (996). isbn: -7923-9646-4 A Comparative Study of Traditional and Newly Proposed Features for Recognition of Speech Under Stress, Sahar Bou-Ghazale and John H.L. Hansen, IEEE Trans SAP, vol. 8, no. 4, pp.429 442, July 2. ASR (H. Shimodaira) I : 44