DECOMPOSITION OF SPEECH INTO VOICED AND UNVOICED COMPONENTS BASED ON A KALMAN FILTERBANK

Size: px

Start display at page:

Download "DECOMPOSITION OF SPEECH INTO VOICED AND UNVOICED COMPONENTS BASED ON A KALMAN FILTERBANK"

Candice Lester
6 years ago
Views:

1 DECOMPOSITIO OF SPEECH ITO VOICED AD UVOICED COMPOETS BASED O A KALMA FILTERBAK Mark Thomson, Simon Boland, Michael Smithers 3, Mike Wu & Julien Epps Motorola Labs, Botany, SW 09 Cross Avaya R & D, orth Ryde, SW 3 3 Dolby Laboratories, San Francisco, USA ABSTRACT: We present a novel method for decomposing speech into signals representing the voiced and unvoiced components of speech. The method involves first demodulating the variations in spectral envelope, energy and pitch, and then applying a bank of Kalman filters to separate the harmonic and non-harmonic components of the signal. The use of Kalman filters relies on a state-space representation of the composite signal, and provides a way to accurately estimate the harmonic component without the large delay required by a linear phase comb filter. However it also requires a priori knowledge of the variance of the unvoiced component and the state transition parameters. We present a novel method to accurately determine these parameters based on a variant of the Expectation-Maximisation algorithm. ITRODUCTIO The distinction between voiced and unvoiced sounds is important in many areas of speech technology. In speech coding, for example, different mechanisms are often used to encode the voiced and unvoiced parts of speech (Kleijn and Haagen, 994). In some methods of speech enhancement, the quasiperiodic nature of voiced speech is used to design of an optimal filter that separates speech from additive noise (Goh et al, 999). In speech recognition, knowledge of the temporal structure of the cycles of voiced speech can be used to process the speech in such a way that the impact of additive noise on feature extraction is reduced (Macho and Cheng, 00). Knowledge of the pitch of voiced speech is also useful in speech recognition for tonal languages, such as Mandarin (Zhang et al, 00). In some applications, it is sufficient to assume that any particular segment of speech is either purely voiced or purely unvoiced, and to classify segments into one of these two categories. This was true, for example, in early low bit rate vocoders (Campbell and Tremain, 986). However, in reality many segments of speech contain both quasiperiodic and noise-like energy, and many processing methods are designed to exploit this. In some cases, what is required is simply a determination of the degree of voicing. In mixed excitation linear prediction (MELP) coding (McCree and Barnwell, 995) and multiband excitation (MBE) coding (Griffin and Lim, 988), for example, a frequency-dependent measure of the strength of voicing is used to control the relative amount of periodic and non-periodic energy in the excitation of a linear prediction filter. In other cases, an attempt is made to explicitly separate the voiced and unvoiced components. In codebook-excited linear predictive (CELP) speech coders, this is achieved through an analysis-bysynthesis procedure (Gerson and Jasiuk, 99). Speech is generated by exciting a short-term linear prediction filter with a combination of signals from both an adaptive codebook, representing voiced energy, and a fixed codebook representing unvoiced energy. Minimisation of the perceptually weighted difference between the synthesised and input speech is used to estimate the two components. Several alternative approaches are possible. One is to use a linear comb filter to isolate the voiced component based on its harmonic structure. This is similar to the practice of using a low pass filter to separate slowly evolving and rapidly evolving components of the pitch cycle in interpolation-based coding (Kleijn and Haagen, 994). One limitation of this approach, however, is that its effectiveness depends on having a filter with a sharp roll-off, which requires a long impulse response. The implication of this is that the decomposition process requires a relatively large delay, which is Melbourne, December to 5, 00. Australian Speech Science & Technology Association Inc. Accepted after full review page 8

2 undesirable in some applications, such as speech coding, and also creates difficulties in dealing with rapid transitions. Achieving good decomposition without a large delay requires the use of more a priori knowledge about signal behaviour. One approach is to impose a deterministic parametric model on the evolution of the harmonic coefficients (Stylianou, 996). However the signal model is then highly non-linear, and parameter estimation becomes very complex. Stochastic models of signal evolution have been suggested by both Gruber and Tödtli (994) and Stachurski (997). However both also involve very complex estimation processes. In this paper we present a new method of decomposition that is also based on a stochastic model, but which is much simpler to implement, and also permits more control over the behaviour of the decomposition. The approach involves using a bank of Kalman filters, each corresponding to one sample in a normalised pitch period. SIGAL MODELIG AD ESTIMATIO In keeping with usual practice, we represent speech as the response of an autoregressive (AR) system, representing the vocal tract filter, to an input signal representing the acoustic energy generated by both vocal fold vibration and turbulent airflow: M z k = ai zk i + g. yk i= () and y k + vk = () where g is a gain factor, x k is a quasiperiodic signal, and v k is an uncorrelated Gaussian random variable with variance σ v. The response of the vocal tract filter to each of the two components, x k and v k, constitute the voiced and unvoiced components of speech respectively. Fundamental to our method of decomposing z k into these components is the way that x k is modelled. The component is assumed to evolve according to = α (3) T + wk T where T is the period of x k, w k is an uncorrelated Gaussian random variable with variance α is a gain value. Based on this model, the overall decomposition process is depicted in Figure. In our implementation, processing is carried out on a frame-by-frame basis with frames of 0ms duration. We begin by demodulating the variation in the energy, spectral envelope, and pitch of the signal. Energy is estimated on a subframe basis (4 subframes/frame). Demodulation of the spectral envelope variation is achieved by applying an inverse filter estimated once per frame by linear prediction. The linear prediction residual is used to estimate the pitch period, and the period is used to time-warp the signal to a fixed period. The demodulated signal is an approximation, ŷ k, of y k. Equations () and (3) together constitute a state space representation of this signal, with w k representing the process noise and v k the observation noise. Based on this, a Kalman filter can be used to estimate the state variable x k, by means of the following recursion. σ w, and k k = α + K( yˆ α ) (4) k k k Melbourne, December to 5, 00. Australian Speech Science & Technology Association Inc. Accepted after full review page 9

3 Parameter Estimation z k Demodulation ŷ k α Kalman Filterbank σ v k Modulation Voiced component Energy Period LPC parameters + _ Modulation Unvoiced component Figure : Decomposition System where K = Σ k k T ( Σ k + σ v ) (5) is the Kalman gain, Σ k k T = α Σ + σ w (6) is the variance of the error in the predicted state estimate, k, and Σ k k = ( K) Σ k (7) is the variance in the error in the filtered state estimate, k k. σ w may be chosen to control the rate at which the estimated quasiperiodic component evolves. However α and σ v must be estimated from the input data. We describe a new method to do this in the next section. Since the state variable is different for each sample in the period, estimation of the entire period essentially constitutes a bank of multiple scalar Kalman filters. The smoothing form of the Kalman filter may also be used to take advantage of future pitch cycles to estimate each current sample. The observation noise is estimated as vˆ k = ( yˆ k k ). The estimated quasiperiodic and noisy components can then be remodulated using the estimated period, LPC filter and energy to produce the voiced and unvoiced components of the speech. For the decomposition to work effectively, it is essential that when quasiperiodic energy is present in the signal, its period be known accurately. This requires not only that the resolution of the period estimate be sufficiently high, but also that the estimation method be able to track variations in period over sufficiently short time intervals. In order to ensure that x k and T are always optimally aligned, it needs to be possible to track variations in period within a pitch cycle. To achieve this we have used a dynamic programming approach, with a path metric composed of an accumulated average magnitude difference function with an additional term to penalise inappropriate variations in period. The model on which our method is based has some similarity to those in Gruber and Tödtli, J. (994). Melbourne, December to 5, 00. Australian Speech Science & Technology Association Inc. Accepted after full review page 0

4 However the use here of a time domain state representation makes it possible to use only scalar Kalman filter estimators, resulting in significantly lower complexity. In addition, the methods in Gruber and Tödtli (994) explicitly assumed that σ v is known in advance, and made no allowance for an explicit state transition gain α. The latter point is particularly important in decomposing speech, because the overall amplitude of consecutive cycles can change more rapidly than their shape. The model developed in Stachurski (997) is almost identical to that described by () and (3), but again there was no allowance for a variable transition gain, and also no provision for explicitly controlling σ w. In addition, because σ v was not known or determined prior to decomposition, it was not possible to use a Kalman filter for signal estimation. Instead a much more complex algorithm was proposed based on singular value decomposition. ESTIMATIO OF DYAMICAL SYSTEM PARAMETERS Good estimates of α and σ v are critically important in order for the decomposition described above to be effective. An iterative method for determining parameters, θ, of a general linear dynamic system from observations of its output was developed by Digalakis et al (994) based on the Expectation Maximization (EM) algorithm. Each iteration involves maximizing the expected joint log likelihood of the observed data sequence and the unknown state sequence conditioned on the observed data and the previous estimate of θ. In our application, we only require estimates of α and σ v. Using the procedure described by Digalakis et al, the values that maximize the expected joint log likelihood are: [ Σ ] + k T α = E{ x } (8) ( yk yk ) σ v = ˆ (9) k = where represents a fixed interval over which α and σ v are assumed constant, and Σ is the covariance of ( T ). The expectation in (8) should be understood to be conditioned on both the observed data up to and initial estimates of α and σ v. The effectiveness of the recursion defined by (8) and (9) depends significantly on the accuracy of the initial estimates of α and σ v. Inaccurate starting values will lead to slow convergence, and may cause the algorithm to converge to a local optimum. Although no method to obtain initial estimates was suggested by Digalakis et al (994), this was not a significant problem there since the application of interest was training of acoustic models for speech recognition. In that situation, estimation occurs off-line. However in the current application, α and σ v vary throughout the speech waveform, and must be estimated in real-time. We present here a method to obtain these values using only the observed data and past values of the estimated state sequence. The method is derived from the recursion equations above, and relies on an assumption that the interval over which α and σ v are constant is no more that one period. Although, in principal these estimates may be used as initial values for subsequent EM iterations, in our experience they are generally sufficiently accurate themselves, without resorting to further recursion. To estimate α, we first note that since v k is uncorrelated with T the expectation in the numerator of (8) can be written as E {( yk vk ) T } = yk. The smoothed state estimate x ˆ k T, which Melbourne, December to 5, 00. Australian Speech Science & Technology Association Inc. Accepted after full review page

5 also appears in the denominator, is not known a priori. However, in the mean over the interval from K, x ˆ k T is well approximated by the filtered estimate, k T k T. In addition, the error variance, Σ can be expected to be small compared with k T yk. Thus α can be approximated by α (0) Provided the summation interval in (0) is no more than one period, all terms in the right hand side are known. Using α computed from (0), σ v can be found as follows. Again assuming that k, then x ˆ k in (9) is equivalent to x ˆ k k. Using (4) to compute this value results in k σ v v = ( y k αyk k T k T ) Σ k + σ v σ () Σ is determined from (6). () can be manipulated to produce a quadratic in v = the signal is not noise-free ( σ 0 ), the value of σ v that satisfies this is [ y αy x ] Σ k v = k k ˆk T k T σ v. Assuming that σ () RESULTS AD DISCUSSIO Figure illustrates the application of our algorithm to a segment of speech consisting of a dominant unvoiced component followed by a dominant voiced component. The smoothing form of the Kalman filter was used with a two period look-ahead. The results show that the algorithm successfully decomposes the speech, with strong attenuation of noisy energy in the voiced component, and no visible harmonic energy in the unvoiced component. The presence of unvoiced signal energy during segments that would generally be classified as voiced is significant. Listening tests indicate that the unvoiced component retains the intelligibility of the original speech, but with a whispered quality. COCLUSIOS We have presented a novel method for decomposing speech into voiced and unvoiced components in the time domain. The algorithm is distinctive in its use of a Kalman filterbank, based on dynamical system parameters estimated on-line using a form of the Expectation-Maximisation algorithm. ACKOWLEDGEMETS This work was performed while the authors were all with Motorola. Simon Boland is now with Cross Avaya Research and Development and Michael Smithers is with Dolby Laboratories. REFERECES Campbell, J. P. Jr & Tremain, T. R. (986), Voiced/unvoiced classification of speech with applications to the US Government LPC0E Algorithm, Proceedings of the International Conference on Acoustic Speech and Signal Processing, pp Melbourne, December to 5, 00. Australian Speech Science & Technology Association Inc. Accepted after full review page

6 Figure : (top to bottom) speech waveform, estimated voiced component, estimated unvoiced component Digalakis, V., Rohlicek, J. R. & Ostendorf, M. (993), ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition, IEEE Transactions on Speech and Audio Processing, Vol. o. 4, pp Gerson, I. A. & Jasiuk, M. A. (99), Techniques for improving the performance of CELP-type speech coders, IEEE Journal on Selected Areas in Communications, Vol. 0, o. 5, pp Goh, Z., Tan, K.-C. & Tan, B. T. G. (999), Kalman filtering speech enhancement method based on a voiced-unvoiced speech modei, IEEE Transactions on Speech and Audio Processing, Vol. 7, o. 5, pp Griffin, D. W. & Lim, J. S (988)., Multiband excitation vocoder, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol 36, o. 8, pp Gruber, P. & Tödtli, J. (994), Estimation of quasiperiodic signal parameters by means of dynamic signal models, IEEE Transactions on Signal Processing, Vol. 4, o. 3, pp Kleijn, W. B. & Haagen, J. (994), Transformation and decomposition of the speech signal for coding, IEEE Signal Processing Letters, Vol., o. 9, pp Macho, D., & Cheng, Y.-M. (00), SR-dependent waveform processing for improving the robustness of ASR front-end, Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Vol., pp McCree, A. V. & Barnwell, T. P. III (995), A mixed excitation LPC vocoder model for low bit rate speech coding, IEEE Transactions on Speech and Audio Processing, Vol. 3, o. 4, pp Stachurski, J. (997), A Pitch Pulse Evolution Model for Linear Predictive Coding of Speech, Ph.D. Thesis, McGill University, Montreal, Canada. Stylianou, Y. (996), Efficient decomposition of speech signals into a deterministic and a stochastic part, Proceedings of the International Symposium on Signal Processing and its Applications, Vol., pp Zhang, Y., Madievski, A., Lawrence, J. & Song, J., A study of tone statistics in Chinese names, Speech Communication, Vol. 36, pp Melbourne, December to 5, 00. Australian Speech Science & Technology Association Inc. Accepted after full review page 3

Enhanced Waveform Interpolative Coding at 4 kbps

Enhanced Waveform Interpolative Coding at 4 kbps Oded Gottesman, and Allen Gersho Signal Compression Lab. University of California, Santa Barbara E-mail: [oded, gersho]@scl.ece.ucsb.edu Signal Compression