ACCURATE SPEECH DECOMPOSITION INTO PERIODIC AND APERIODIC COMPONENTS BASED ON DISCRETE HARMONIC TRANSFORM

Size: px

Start display at page:

Download "ACCURATE SPEECH DECOMPOSITION INTO PERIODIC AND APERIODIC COMPONENTS BASED ON DISCRETE HARMONIC TRANSFORM"

Arline Daniels
6 years ago
Views:

1 5th European Signal Processing Conference (EUSIPCO 007), Poznan, Poland, September 3-7, 007, copyright by EURASIP ACCURATE SPEECH DECOMPOSITIO ITO PERIODIC AD APERIODIC COMPOETS BASED O DISCRETE HARMOIC TRASFORM Piotr Zubrycki and Alexander Petrovsky Department of Real-Time Systems, Bialystok Technical University Wiejska 45A street, 5-35 Bialystok, Poland phone: (48 85) , fax: (48 85) palex@it.org.by ABSTRACT This paper presents a new method for the speech signal decomposition into periodic and aperiodic components. Proposed method is based on the Discrete Harmonic Transform (DHT). This transformation is able to analyze the signal spectrum in the harmonic domain. Another feature of the DHT is its ability to synchronize the transformation kernel with the time-varying pitch frequency. The system works without a priori knowledge about the pitch track. Unlike the most applications proposed method estimates the fundamental frequency changes within a frame before estimating fundamental the frequency itself. Periodic component is modelled as a sum of harmonically related sinusoids and for accurate estimation of the amplitudes and initial phases DHT is used. Aperiodic component is defined as a difference between the original speech and the estimated periodic component.. ITRODUCTIO Speech signal is generally assumed as a composition of two major components: periodic (harmonic) and aperiodic (noise). The problem of speech decomposition into its two basic components is the major challenge in many speech processing systems. In general this task lays in accurate estimation of the periodic and aperiodic components thus they can be analyzed separately which plays important role in many speech applications such as synthesis or coding. Periodic component is generated by the vibrations of vocal folds while aperiodic component is generated by the modulation of the air flow. Modulated air flow is responsible for generation fricative or plosive sounds but it also present in the voiced sounds as well. The basic speech production model assumes that the speech is either voiced or unvoiced. Unvoiced part of speech in this basic model is generated by passing a white gaussian noise signal through a linear filter, which represents the vocal track characteristics. Voiced parts of speech are modelled as a time-varying impulse train modulated by the vocal track filter. In this model it is assumed, that no noise signal is present in the voiced parts of speech. In fact, real voiced speech consists of some noise. The speech signal can be viewed as a mixed-source signal with both periodic and aperiodic excitation. In the sinusoidal and noise speech models this mixed-source speech signal is generally modelled as []: K s( = = A k k ( cosϕ k ( r(, () where A k is the instantaneous amplitude of k-th harmonic, K is the number of harmonics present in speech signal, r( is the noise component, φ k is the instantaneous phase of k-th harmonic defined as: n πf k ( i) ϕ k ( = ϕ (0) i = 0 k, () Fs where f k is the instantaneous frequency of the k-th harmonic, F s is the sampling frequency and φ k (0) is the initial phase of the k-th harmonic. Sinusoidal speech modelling treats the speech signal as a sum of periodic and aperiodic components where periodic signal defined as sum of sinusoids with a time-varying amplitudes and frequencies. If f k obey: f k = kf 0, where f 0 is the fundamental frequency, sinusoids in the model are harmonically related and thus the model is called Harmonicoise. There are several variations of the sinusoidal speech modelling [,]. Sinusoidal speech model presented by McAulay and Quatieri [3] and further developed by George and Smith [4] assumes the voiced speech as a sum of harmonically related sinusoids with the amplitudes and phases obtained directly from the Short-Time Fourier Transform (STFT) spectrum. Unvoiced speech is modelled as a sum of randomly distributed sinusoids with the random initial phase. Stylianou presented more accurate approach to the voiced speech modelling based on the harmonicnoise model [5]. In this approach the maximum voicing frequency is determined on the basis of the speech spectra analysis. The speech band is divided into the lower-voiced and the higher-unvoiced bands by the maximum voicing frequency. In the Multiband Excitation Vocoder (MBE) presented by Griffin and Lim [6] the speech spectrum is divided into a set of bands with a respect to the pitch frequency. Each band is analysed and the binary voiced/unvoiced decision is taken. Voiced bands are modelled as a sinusoids and unvoiced as a band-limited noise. Periodic and aperiodic speech decomposition in the methods discussed above involves a binary voiced/unvoiced 007 EURASIP 336

2 5th European Signal Processing Conference (EUSIPCO 007), Poznan, Poland, September 3-7, 007, copyright by EURASIP decision which is not valid from the speech production point of view. Yegnanarayana et. all [7] proposed a speech decomposition method which considers the voiced and the noise components to be present in the whole speech band. Idea of the work is to use an iterative algorithm based on the Discrete Fourier Transform (DFT)/Inverse Discrete Fourier Transform (IDFT) pairs for the noise component estimation. Another method of decomposition which uses the Pitch Scaled Harmonic Filter (PSHF) is presented by Jackson and Shadle [8]. The speech signal is windowed and the window length is chosen with respect to the knowledge of the pitch frequency thus the segment taken to the analysis contains integer multiple of the pitch cycles. Pitch-scaled frame length enables the pitch harmonics to be aligned with the frequency bins of the STFT and thus minimises the leakage, but complicates the windowing process. PSHF algorithm performs a decomposition in the frequency domain by selecting only these STFT bins which are aligned with the pitch harmonics. The most often assumption that is made to the speech signal is its local stationarity i.e. it is assumed that the parameters of the pitch harmonics are slowly-varying and locally these variations can be omitted. While dealing with the real speech signal these variations can decrease the quality of the speech components separation, especially if STFT is used as a spectral analysis tool. The accuracy of the speech decomposition can be improved if the speech signal nonstationarity is taken into account. In this paper we propose a new periodic-aperiodic decomposition method which assumes periodic and aperiodic components to be present in the whole speech band, which is similar to the approach presented in [9]. The motivation of our approach to speech separation problem was to develop a system which is able to accurately separate the speech components with taking into account the nonstationary speech nature and without a priori knowledge of the pitch frequency track. In our system we use the speech model defined by (). Basic concept of our method lays in the analysis of speech spectrum in the harmonic domain rather than the frequency domain in order to provide accurate estimation of the model parameters. For our purposes we have adopted the Harmonic Transform (HT) idea proposed by Zhang et. All [0]. The HT is a spectral analysis tool able to analyse the harmonic signal with the time-varying frequency and produce the pulse-train spectrum in the harmonic domain. First step of designed system is an estimation of the optimal speech fundamental frequency change on the frame-by-frame basis with usage of the HT. Once the optimal change of the pitch track is found the fundamental frequency is estimated using the analysis of the harmonic domain spectrum. On this basis the periodic component is estimated by selecting HT local maxima corresponding to the pitch harmonics. Aperiodic component is defined as a difference between the input speech and the estimated periodic component. The paper is organized as follows. In section we discus the Harmonic Transform and define the speech model used in our system. In section 3 the optimal pitch track estimation method is presented. Finally in section 4 we present a decomposition scheme. Some experimental results are given in section 5.. DISCRETE HARMOIC TRASFORM The most speech analysis applications based on sinusoidal speech modelling use the STFT spectrum for estimation of the harmonics parameters with the assumption of the speech local stationarity, i.e. the fundamental frequency is constant within the analysis frame. This is often coarse assumption in the case of the real speech signals. In fact fundamental frequency varies in time and thus only several first harmonic are distinguishable in the DFT spectrum (fig.). Figure Harmonic Transform: harmonic signal with 6 harmonics and the fundamental frequency changing from 00Hz to 0Hz (top), DFT (middle) and DHT (bottom) of this signal. This fact decreases STFT performance in the harmonic parameters estimation process. The basic concept of the harmonic domain spectral analysis is to provide the analysis along instantaneous harmonics frequencies rather than fixed frequencies like in the STFT. There are two main strategies which are possible. One strategy is to provide the timewarping of the input signal in order to convert time-varying frequency into the constant one and then use the STFT. The second one is to use the spectral analysis tool which transforms input signal directly into the harmonic domain. Zhang et. all [0] proposed the Harmonic Transform (HT) which is the transformation with built-in time-warping function. The HT of signal s(t) is defined as: jωφ t S = s t t e u ( ) ( ω ) ( ) φ ( dt, (3) φu ( t) u ) where φ u (t) is the unit phase function which is the phase of the fundamental divided by its instantaneous frequency [0] and φ u (t) is first order derivative of φ u (t). Inverse Harmonic Transform is defined as: jωφ s t = Sφ ω e u ( t) ( ) t dω π u ( ) ( ). (4) 007 EURASIP 337

5th European Signal Processing Conference (EUSIPCO 007), Poznan, Poland, September 3-7, 007, copyright by EURASIP In the real speech fundamental frequency is slowly time varying i.e. it cannot change rapidly in a short time period.

3 5th European Signal Processing Conference (EUSIPCO 007), Poznan, Poland, September 3-7, 007, copyright by EURASIP In the real speech fundamental frequency is slowly time varying i.e. it cannot change rapidly in a short time period. On this basis in our approach we assume a linear frequency change within given speech segment. Instantaneous phase φ(t) of a sinusoid with linear change of the frequency is defined by known formula (for simplicity initial phase is omitted): εt ϕ ( t) = π f0t, (5) where f 0 is the initial frequency and ε=(δf 0 /T) is the fundamental frequency change divided by length of the segment (i.e. time in which this the frequency change occurs). Considering the discrete-time signals and the segment length of samples (T=/F s ) this formula can be written as: f0n Δf0n ϕ ( = π. (6) Fs F s Initial fundamental frequency within a given segment can be written as: f0 = f c afc, a = Δf0 fc, (7) where f c is the central fundamental frequency within a given segment of the length. Substituting f 0 and Δf 0 in (6) with (7) we get: πf ( c a an ϕ = α a (, α a ( = n. (8) F s ow, let us consider the Discrete Harmonic Transform for signals with linear changing fundamental frequency. Frequencies of the spectral lines of the Discrete Fourier Transform are defined as: Fs fc =. (9) In the HT central frequencies of the spectral lines are aligned with the frequencies of DFT spectral lines. Using (9) in (8) we get: π ϕ ( = α a (. (0) Finally we can define the Short Time Discrete Harmonic Transform (STHT) for signals with linear frequency change: n= 0 j πk α ( = s( α ( e, () where α ( is defined as: a an α ( n ) =. () Inverse STHT is defined as: = j πk α ( s( e. (3) k = 0 Example of the STFT spectrum and the STHT spectrum of a test signal is shown on fig. The input harmonic signal consists of 6 harmonics, the fundamental frequency changes linearly from 00Hz to 0Hz within a segment of length 56 samples (Fs=8000Hz). ote, that only few first harmonics in the STFT spectrum can be distinguished while in the STHT spectrum all of the harmonic are visible. The second example is an example of comparison of the spectrograms of the speech signal processed by the STFT and the STHT is shown in fig.. Figure example spectrograms of the speech signal using the STFT (top) and the STHT (bottom). 3. PITCH TRACK ESTIMATIO The pair of transforms given by () and (3) allow to analyze the harmonic signals in the harmonic domain in case when the fundamental frequency track is known. In case of speech both the central fundamental frequency and its change are unknown. Block diagram of the pitch detection algorithm is shown in fig. 3. Proposed algorithm starts from searching the fundamental frequency change by examining the STHT spectrum for a different unit phase functions () i.e. unit phase functions with a different a parameter. Optimal a parameter value is defined as the value which minimises the Spectral Flatness Measure: STHT ( a, k = 0 arg min SFM ( a) =, (4) a STHT ( a, k = 0 where STHT(a, is the harmonic spectrum of a given speech segment for a given a and. denotes absolute value. The minimal spectral flatness value indicates the highest concentration, which in case of our algorithm means an optimal fit of the signal and the STHT kernel. This also means, that the optimal speech fundamental frequency change is found for a given speech segment. Once this is done, the pitch frequency is estimated. First step of this algorithm is the determination of the pitch frequency harmonics candidates f i by peak picking of the STHT spectra based on the algorithm proposed in []. Pitch harmonics candidates with the central frequency located between EURASIP 338

5th European Signal Processing Conference (EUSIPCO 007), Poznan, Poland, September 3-7, 007, copyright by EURASIP and 450Hz are considered as the pitch candidates.

In order to prevent pitch doubling or halving following factor is computed for each harmonic: nhmax n = h max n a r =, n where a n is an amplitude of the n-th harmonic of pitch, n hmax is the number

This formula can be viewed as a mean energy of the harmonic signal for the particular pitch per a single harmonic multiplied by the energy carried by the signal.

4 5th European Signal Processing Conference (EUSIPCO 007), Poznan, Poland, September 3-7, 007, copyright by EURASIP and 450Hz are considered as the pitch candidates. For each pitch candidate the algorithm tries to find its harmonics. In the case of inability to find three of the first four harmonics the candidate is discarded. In order to prevent pitch doubling or halving following factor is computed for each harmonic: nhmax n = h max n a r =, n where a n is an amplitude of the n-th harmonic of pitch, n hmax is the number of all possible harmonics for a given pitch candidate. This formula can be viewed as a mean energy of the harmonic signal for the particular pitch per a single harmonic multiplied by the energy carried by the signal. This formula prevents from pitch halving, while the mean energy per a harmonic is smaller for halved pitch candidates from one side and from the other side energy of the harmonic signal is higher for lower pitch candidates which prevents from pitch doubling. As a pitch for a given frame the pitch candidate is selected with the greatest r factor. Finally, the pitch value is refined using following formula: nh max fn n = n fr =, nh max where f n is the frequency of nth harmonic candidate. Figure 3 Pitch detection algorithm Described procedure estimates the central pitch frequency for one frame. Further prevention of the pitch halving or doubling is provided by usage of the tracking buffer which stores the fundamental frequency estimates from a several consecutive frames. The final pitch estimation is done for the frame in the middle of the tracking buffer, thus the resulting pitch estimation is done witch a several frames delay. In our system we used the buffer length of 5. As a tracking algorithm we use the median filtering which we found simple and robust against grose pitch errors. 4. PERIODIC-APERIODIC DECOMPOSITIO Speech decomposition in our system is performed in a time domain. First, the periodic component is estimated and the aperiodic component is defined as a difference between the input speech signal and the estimated periodic component. Figure 4 Example of the speech decomposition: original speech (top), estimated periodic (middle) and aperiodic (bottom) components On the basis of the speech model discussed in section periodic component is defined as: K h( = = A ( ) k k cos kϕ ( ϕ k (0), (5) where A k is the amplitude of the k-th harmonic, φ( is the instantaneous phase of the k-th harmonic defined in (8) with the central frequency f c defined by the pitch frequency, φ k (0) is the initial phase of the k-th harmonic. Unfortunately pitch harmonics are not aligned with the spectral lines and thus they cannot be directly estimated from the STHT spectrum. One possible solution for this problem is an interpolation of the adjacent STHT coefficients. In our system we propose more accurate solution to find the harmonics amplitudes and phases. In order to provide the spectral analysis exactly at the frequencies aligned witch the pitch harmonic we use the same formula (8) as we used in (5). By doing it we get the special case of HT which we have used in our previous work []. The DHT variant aligned with the pitch is defined as: = n= 0 πkf j r α ( F s s( α ( e, where f r is the refined pitch frequency and k=..k, K is the number of the harmonics of the pitch. Amplitudes and phases of the harmonics can be computed directly from h) coefficients: A k = Re Im Im ϕ k (0) = arctan, Re where Re and Im stands for the real and the imaginary parts of respectively. The periodic component is generated using formula (5) and the aperiodic component is defined as: r( = s( h(. Example of the speech decomposition is given in fig EURASIP 339

5 5th European Signal Processing Conference (EUSIPCO 007), Poznan, Poland, September 3-7, 007, copyright by EURASIP 5. EXPERIMETAL RESULTS In order to verify the proposed decomposition algorithm we performed set of experiments on a synthetic speech-like signals. The testing procedure was as follows: two sets of synthetic speech were prepared, one for male (central frequency 0Hz) and one for female (central frequency 00Hz). In order to verify the Short Time Harmonic Transform performance different fundamental frequency changes were used in both sets. The fundamental frequency change parameters were chosen randomly within a given boundaries which were chosen in order not to exceed 30% of the central fundamental frequency within a test frame. We have tested our algorithm for several Harmonic to oise Ratios (HR) by adding a noise signal with different energy to the input signal. Results of the experiment is shown in table. Central Pitch Frequency HR [db] Measured HR [db] Estimated periodic component SR [db] 0 59,6 59, , 33, ,6 0, ,7 6, 0 0,05, ,3 68, ,3 38, ,6, ,54 6,3 00 0,06, Table Results of experiments In the table the HR column is the original HR ratio of the input signal. After periodic and aperiodic component estimation HR parameter was measured. Mean value of this measure is shown in the column Measured HR. Finally, the quality of estimated periodic component was tested by measuring its SR, which is defined as the estimated periodic component energy to the error signal energy ratio. Error signal is defined as a difference between the original and estimated periodic components. 6. COCLUSIOS In this paper we proposed new speech decomposition scheme based on Harmonic Transform. Four our purposes we have developed two variants of the Short Time Discrete Harmonic Transform in the case of linear frequency change within analysis frame. First variant allows for the spectral analysis in the harmonic domain and has the ability to synchronize its kernel with the input signal. Second variant of the transformation allows for accurate estimation of the pitch harmonics amplitudes and frequencies because the spectral lines in this variant are aligned with the pitch frequency. There are two main advantages of using the STHT compared to the conventional spectral analysis using the STFT. First is the ability to estimate the fundamental frequency change without a knowledge of the fundamental frequency itself. Second one is preventing spectrum smearing especially for higher order harmonics which is important if the spectral domain fundamental frequency estimation algorithm is used. This feature allows the algorithm to be more robust in the cases of highly intonated speech segments and transient speech segments as well. Experiments prove robustness of the proposed approach. 7. ACKOWLEDGEMETS This work was supported by Bialystok Technical University under the grant W/WI//05. REFERECES [] A.M. Kondoz, Digital speech: coding for low bit rate communication systems, ew York: John Wiley & Sons, 996. [] A.S. Spanias, Speech coding: a tutorial review, Proc. IEEE, vol. 8, no. 0, pp , 994. [3] R.J McAulay, T.F. Quatieri, Sinusoidal Coding in Speech Coding and Synthesis (W. Klein and K. Palival, eds.), Amsterdam: Elsevier Science Publishers, 995. [4] E.B. George, M.J.T. Smith, Speech Analysis/Synthesis and Modification Using an Analysis-by-Synthesis/Overlap- Add Sinusoidal Model, IEEE Trans. on Speech and Audio Processing, vol. 5, no. 5, pp , 997. [5] Y. Stylianou, Applying the Harmonic Plus oise Mode in Concatenative Speech Synthesis, IEEE Trans. on Speech and Audio Processing, vol. 9, no., 00. [6] D.W. Griffin, J.S. Lim, Multiband Excitation Vocoder, IEEE Trans. on Acoust., Speech and Signal Processing, vol. ASSP-36, pp. 3-35, 988. [7] B. Yegnanarayana, C. d Alessandro, V. Darsions, An Iterative Algorithm for Decomposiiton of Speech Signals into Voiced and oise Components, IEEE Trans. on Speech and Audio Coding, vol. 6, no., pp. -, 998. [8] P.J.B. Jackson, C.H. Shadle, Pitch-Scaled Estimation of Simultaneous Voiced and Turbulence-oise Components in Speech, IEEE Trans. on Speech and Audio Processing, vol. 9, no. 7, pp , Oct. 00 [9] X. Serra, Musical Sound Modeling with Sinusoids plus oise in Musical Signal Processing (C. Roads, S. Pope, A. Picialli, and G. De Poli eds.), Swets & Zeitlinger Publishers, 997, pp. 9- [0] F. Zhang, G. Bi, Y.Q. Chen, Harmonic Transform, IEEE Trans. on Vis. Image Signal Processing, vol. 5, o. 4, pp , Aug [] V.Sercov, A.Petrovsky, The method of pitch frequency detection on the base of tuning to its harmonics, in Proc. of the 9 th European Signal processing conference, EUSIPCO 98, vol.ii, Sep. 8-, 998, Rhodes, Greece. - pp [] V. Sercov, A. Petrovsky, An Improved Speech Model with Allowance for Time-Varying Pitch Harmonic Amplitudes and Frequencies in Low Bit-Rate MBE Coders, in Proc. of the 6ht European Сonf. on Speech Communication and Technology EUROSPEECH 99, Budapest, Hungary, 999, pp EURASIP 340

L19: Prosodic modification of speech

L19: Prosodic modification of speech Time-domain pitch synchronous overlap add (TD-PSOLA) Linear-prediction PSOLA Frequency-domain PSOLA Sinusoidal models Harmonic + noise models STRAIGHT This lecture