University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Size: px

Start display at page:

Download "University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005"

Melvin Cobb
5 years ago
Views:

Announcements Filter-bank analysis of speech for spectral representations (linear/log

1 University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Lecture 5 Slides Jan 26 th, 2005 Outline of Today s Lecture Announcements Filter-bank analysis of speech for spectral representations (linear/log scale) LPC representations of speech (how to compute LPC coefficients), reflections on an acoustic tube. 1

2 books & sources Rabiner&Juang: Fundamentals of speech recognition. Deller et. al. Discrete-time Processing of speech signals Beranek, Acoustics, Flanagan, Speech Analysis Synthesis and Perception Clark & Yallop, An Intro to Phonetics and phonology Ladefoged A Course in Phonetics Lieberman & Blumstein Speech physiology, speech perception, and acoustic phonetics K. Stevens, Acoustic Phonetics Malmberg, Manual of phonetics Rossing, The Science of Sound Linguistics 001, University of Pennsylvania Filter Bank Analysis Goal: Produce a representation of speech that contains the message of what was said, stripping out what was irrelevant. Vocal tract: while carrier signal is such that F0 is from Hz, and its harmonics go up to at least 5kHz, this is only the carrier Speech information is contained in relatively slow moving time-varying vocal tract response function. Recall Dudley s quote from several lectures ago. So, goal: represent VT response (compared to spectral response). 2

Filter Bank Approach rectifier rectifier produces useful

3 Vocal Tract Production Vocal tract is a complicated physical system. Recall Homer Dudley s Voder system. Filter Bank Approach rectifier rectifier produces useful compressed representation of speech Each BPF can be seen as convolution. 3

4 Filter Bank Approach where The BPF separates speech into separate spectral components Alternate strategy: Windowed FFT which we will talk about later. First, why rectification? There are various forms of rectification. full-wave rectification, essentially absolute value: half-wave rectification: square rectification: Our analysis will be just the full-wave version, but others are similar. Full-wave analysis note: if s(n)=αsin(ωn) (i.e., a very narrowband signal), then w(n) is a square wave (odd-harmonics only). 4

5 Full-wave Analysis sine wave square wave rectified sine spectrum of sine spectrum of square (rectifier signal) convolved Full-wave analysis Narrow-band original spectrum After band-pass filtering Rectifier Signal After rectification After Low Pass Filter After downsampler, recover full bandwidth 5

6 Filter spacing Filter spacing could be uniform, constant Q, or perceptually inspired. ideal real Alternative: constant-q like Common values are α=2 (octaves) or α=4/3 (1/3 octave) filters. Can also use critical-band center frequencies (as measured on humans) as b i octave spaced, non-overlapping, Hz (Fs=6.67kHz),C=200. This is constant Q=cf/bw 12-band, 1/3-octave, Hz, C=50 7-band critical-band like filter. 6

7 Alternative view of filtering frequency modulated standalone low-frequency filter. = * ω(n) low-pass filter of length L, evenly centered at zero. Alternative view of filtering w(n) is a window that looks like e.g.,: So we re finding scaled (modulated) DTFT of speech signal at window with center at time location n. 7

8 Alternative view of filtering modulated to center frequency ω i slower-frequency amplitude/phase at time n for spectral region ω i Can use fast FFTs to implement filterbank. FFT gives us outputs for linear spaced frequency bins. L, length of w(n) window determines spectral/temporal resolution tradeoff. L long => good spectral resolution (can get good pitch harmonics, bad temporal resolution, can t see overal spectral envelope or desired vocal tract (VT) response) L short => good temporal resolution, bad spectral resolution, goo d overall spectral envelop shape (VT response), but can t see pitch harmonics LIVE Matlab Demo using specgram(s,512,fs,[],512-32) Mel/Bark scale warping Mel-scale (normal frequency) Mel/Bark scale generalization: Approximations: Mel (α = 0.42), Bark (α=0.55) at 16kHz Bark Scale Mel Scale Ultimate goal: better approximate peripheral auditory system and cochlear filter bank. What it thinks is important is probably also what is important for speech recognition

9 Constant-Q or perceptual output from FFT Simple: just bin outputs of FFT accordingly for const-q or critical band frequency. Example mel-scale spectral filters (sum the weighted FFT outputs accordingly). Mel Processing 9

Windowed speech This gives us filter-bank processing in a window of speech, so window slides over speech and we get VT estimate for each window (in form of feature vector) Human Vocal-Tract Model

10 Windowed speech This gives us filter-bank processing in a window of speech, so window slides over speech and we get VT estimate for each window (in form of feature vector) Human Vocal-Tract Model Recall acoustic tube model UL ( z) 1+ r H( z) = = U ( z) 2 G n i= 1 G N N /2 z 1 i= 1 N i= 1 yn [ ] = c+ α yn [ i] + xn [ ] i (1 + r ) az i 1 i delayed all-pole model Question: how do we go directly from speech signal to the optimal parameters of all-pole model??? This is a parameter estimation problem 10

11 Basics statistical parameter estimation Training data: x is p-dimenstional, y is a scaller, x is p-d column vector. Goal: find f:x y s.t. f has minimum error. Only part that matters for all x Taking derivatives and setting to zero, we get best solution: Note: f() might be linear/non-linear, we don t know. But we can still find best solution under a linear model, where f(x) = Ax, A is a 1xp vector of regression coefficients, could say Basics statistical parameter estimation Training data: In general, assume f() is parameterized by some parameters A take derivative of E w.r.t. A, set to zero to get: linear assumption: f(x) = A T x So, Y lies in column space of matrix X (linear combinations of columns of X), when Y is being approximated by AX. These are called the normal equations. 11

12 Basics statistical parameter estimation Normal equations: Called normal equations, because X is orthogonal to E the error. We can apply this general setting to all-pole parameter estimation, but in this case, because of the special nature of the problem, there are some computational simplifications that can be made. un [ ] s[ n] glottal excitation Time-varying Vocal tract system function (as an all-pole model) Speech signal, complex waveform. assume that over short enough time-window, this is essentially TI (piece-wise TI) All-pole parameter estimation All-pole models: s(n) is predictable from weight sum of its past samples + the glottal pulse at that time. Approximation: S to Error transfer function is inverse filter of H above. Goal: find the a i coefficients that make e(n) reflect the glottal pulse. This is hard. Instead, lets assume white Gaussian zero-mean unit variance noise source u[n]. Then estimation problem is just error minimization (this is maximum likelihood estimate). 12

13 All-pole parameter estimation There are two methods: 1) least-squares auto-correlation method, where we window the time signal 2) least-squares auto-covariance method, where we window the error function itself. We do Auto-correlation method first Define error cost function: we window the speech signal s[m] with a window function w[m] but first shift the speech signal to the left by n Take derivatives to get vector a_1, a_2,, a_p for nth time window. For now, drop n-subscripts and do it for one window, since each window is processed independently of the others. Other assumptions: ergodicity: window length is long enough to get good estimate but short enough so that time properties can estimate ensemble properties. In other words: All-pole parameter estimation autocorrelation of speech signal (random interpretation). Normal equations (again), but this time for auto-correlation features of the solution: 1) quadratic, so only one solution (unimodal) 2) fast methods to get solution. 13

14 All-pole parameter estimation In auto-correlation method, s(n) is windowed signal s(m)=w(m)s n (m), and we assume stationary ergodic, so that: Symmetric Toeplitz pxp matrix (same elements on diagonal), pos. semidefinite (might be singular) All-pole parameter estimation s(n) is windowed signal s(m)=w(m)s n (m), why valid? w(m)=0, outside of 0 m N-1, same for s(m) e(m) is 0 outside of 0 m N+p-1 To estimate R, we do: Note, s(m)=0, m<0, m>n-1, so the above non-zero only for: 1 I,k p 14

15 All-pole parameter estimation result is a function only of i-k, or R(i-k), so when we truncate the speech signal using the window method, the auto-correlation method for stationary signals is indeed appropriate. Next, outline of Auto-covariance method better for non-stationary signals where R(I,k) R(i-k) windowed error function: Not Toeplitz, but symmetric All-pole parameter estimation auto-correlation method more common because there are fast algorithms to compute the solution (exploits the Toeplitz regular structure of the matrix) Levinson s recursion: Goes from O(p 3 ) required for normal matrix inversion to O(p 2 ) Durbin s Algorithm: reduces by another constant factor of 2 over Levinson s algorithm. 15

16 All-pole parameter estimation So, we now have a set of coefficients a 1,, a p for each window width. Typical: 25ms analysis window, with a 10ms skip 100Hz sampling rate, probably oversampling Coefficients for each window represent peaks of spectrum, attempting to estimate vocal-tract transfer function within that window Gaussian noise assumption used to minimize error not always correct (e.g., during voiced speech) Ideally, want error function to equal glottal pulse (but this is impossible to get normally). Methods other than LP analysis are now being used: Minimum variance distortionless response spectrum (MVDR) method is better in some cases (particularly high-pitched speakers, children/females) MVDR Basic idea: filter-bank approach, filter speech s(n) with BP p th order filter h l (n) (Murthi&Rao,2000) Filter designed to minimize output power subject to constraint that distortion at given center frequency ω l is zero (meaning output power at that point is unity) Solution is not too different from auto-correlation method: is square (p+1)x(p+1) Toeplitz autocorrelation matrix from before. 16

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals 16 3. SPEECH ANALYSIS 3.1 INTRODUCTION TO SPEECH ANALYSIS Many speech processing [22] applications exploits speech production and perception to accomplish speech analysis. By speech analysis we extract