ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

- @ Ramon E Prieto et al Robust Pitch Tracking ROUST PITCH TRACKIN USIN LINEAR RERESSION OF THE PHASE Ramon E Prieto, Sora Kim 2 Electrical Engineering Department, Stanford University, rprieto@stanfordedu 2 Electrical Engineering Department, Stanford University, kimsora@stanfordedu ASTRACT: A new robust algorithm for estimating the tch period of a speech signal is dected This algorithm puts emphasis on frequency components where noise spectral leakage have less impact on the signal At the same time, it uses smaller analysis windows to improve time resolution avoid jitter tch doubling effects As a result, experiments show lower fine tch errors besides better voiced unvoiced segment detection INTRODUCTION The robust estimation of the tch period plays an important role on speech processing applications Many methods to extract the tch have been proposed Some of the most widely accepted methods are the Cepstrum method ( the SIFT method In the first method jitter tch doubling are shown to be a problem The accuracy of the second method depends on how stationary is the speech signal in the analysis interval Other methods like the autocorrelation function method the average magnitude difference function method keep the formant structure of the signal, making the tch estimation hard when there are high energy high frequency harmonics in the signal The objective of this paper is to develop a tch tracking algorithm that can overcome the problems of the algorithms described above On one h, avoid the effect of high energy high frequency harmonics On the other h, avoid the use of long analysis windows to avoid tch doubling jitter effects This objectives would directly result in lower gross error counts lower fine tch errors (2 For this, we use a time delay estimation technique described in section 2, where we provide a new theoretical framework that explains why this method should work when spectral leakage is an issue In sections 4 we provide robustness to our method by improving the phase unwrapng process by weighting the contribution of the phase of the different frequency bins to the tch estimation In section we evaluate the performance of our method with both clean noisy speech We also compare our method with the cepstrum method ( the autocorrelation method ( LINEAR RERESSION OF THE PHASE Let s assume two different frames of a sampled voiced speech signal: "! *,+ /2 4 76 86 9 : *;=< 7>? A@ "! (!$&( (2!$&( where is periodic with fourier coefficients C is the tch period in units of time (D must be a multiple of the sampling period E represents, in units of time, how far is the beginning of frame from the start of the tch period that is next to the beginning off is assumed to range from to H A very similar problem is stated in (6, where they aim to find the time delay 9I J between The objective of this work is to find, given that we know,j the time delay between F : 9K Pitch Synchronous Frames M M 4A 4A N; <7> PO @RQ Assuming that both frames are tch synchronous (ie L then: S!!$T$ ( Melbourne, December 2 to, 22 c Australian Speech Science & Technology Association Inc Accepted after full review page 26

- - + - + Ramon E Prieto et al Robust Pitch Tracking where M M are the DFT coefficients off respectively We can see that the unwrapped phase of formula is bilinear in Then, if we knowf, their time delay we can do a linear regression of that unwrapped phase vs The result gives us which gives us the value of Although this is an unrealistic case ( is known before calculating formula it illustrates the basics of our tch tracking method Non Pitch Synchronous Frames In our real problem, we don t know T we wish to calculate it In the present section we will see that applying a linear regression of the phase when the frames are not tch synchronous still gives us the period or a good estimate We will also suggest which modifications we have to apply to the basic method in order to make it more reliable We wish to calculate applying a linear regression to the unwrapped phase of: D : M M D D (4 "!! T( where q is the frequency bin Formula doesn t apply in this case Then, analyzing the DFTs of implies a spectral leakage effect iven the formulas 2, the DFTs of are: M D :,+ 4 /2 D M D,+ /FK 4A ;=< > @ O D ( S!!$T$ D :? /F ;=<7>D? @@ ; + < > Q N; <> @@ + Q (6 Figure a shows the magnitude of the contribution of harmonic D M to C when C The location of the frequency bins of q depends on the values of We can see that the ideal case of no interference of harmonic k in the frequency bins far away from happens when (tch synchronous case As the difference between is very small, D will be close to N for the frequency bin closest to close to zero for the others Figure b shows the phase of D Making the same analysis as in a, the phase added to the contribution of D M in D will be close to zero for the frequency bins close to when the difference between > is very small That phase can be very high (even higher than when the difference between is around > If the phase distortion of a harmonic in our non-tch synchronous analysis can be as high as, what guaranties that phase regression will work for non-tch synchronous frames? The answer is given by Phase Interpolation Subtraction of Phases Phase Interpolation Imagine we have two harmonics, k, k+ Imagine also that those two harmonics are conflicting in the frequency bin "! $ & iven the characteristics of the side lobes of ( ( in Figure 4 C : * ( :,+ 4 ( a, the contribution of any other harmonics is assumed to be small As such, from formula, we have: M D D : ( 7 7 4 7 : &( ; 98 ;=<- ( 7 & 7 C ;: = C / : ;=<- :2 46 (7 R <: &$ (8 (9 7 We can see from formula 7 that, if 7 > M are properly unwrapped then the phase of D 7 is an interpolation of the phases 7 > The weights of that interpolation are driven by ( 4 (, Melbourne, December 2 to, 22 c Australian Speech Science & Technology Association Inc Accepted after full review page 27

Ramon E Prieto et al Robust Pitch Tracking Phase [rad] Magnitude a Magnitude N*k*Ts/T N/2 N*k*Ts/T N*k*Ts/T + N/2 b Phase /2 /2 N*k*Ts/T N/2 N*k*Ts/T N*k*Ts/T + N/2 Magnitude 2 b Magnitude of two neighboring harmonics 2 4 6 7 8 9 Frecuency bin q Figure : a Magnitude of centered in The circles show the magnitude at the frequency bins when b Phase of frequency bins in the same case as a c Magnitude of (dotted (dashed when Also DFT of the sum of harmonics! " " (solid upscaled by 2 Phase [rad] Phase [rad] Phase [rad] /2 /2 2 /2 4 2 a c e 2 4 6 8 Freq [KHz] 4 2 2 /2 /2 /2 /2 b d f 2 4 6 8 Freq [KHz] Figure 2: Comparison of the three unwrapng methods a Phase of $ using no Unwrapng b us- Phase of $ using U Unwrapng c Phase of $ using SFU Unwrapng, & d Phase of $ ing LRSFU Unwrapng, e Phase of $ using SFU Unwrapng, f Phase of $ LRSFU Unwrapng, ( 4 (, If the weights are the same for both phases (ie ( 4A ( ( 4H ( M, then, the interpolated phase eliminates the phase distortion the resulting phase of D will be - E - E & M The same will happen to D where the phase will be E - - E & > & ( *, perfect situation, since the frequency bin q is adding zero error to the linear regression As the difference between, ( C ( ( 4: ( is small, M we will have small phase distortion However, if the difference is big enough so the resulting phase of D 7 tends to be more 7 than & or the opposite, the solution of that problem is going to be given by Subtraction of Phases using Subtraction of Phases This time lets assume that in formula 7, the difference between or the difference between ( C ( ( 4S ( makes the resulting phase to M 7 be closer to 7 rather than & (the opposite case works just the same Then the resulting phases of D M D are: 7 M D 7,+ 7 4 7 : 98 7 M D R -+ 7 4 7 : 98 ( &$ ;: &( ;: 7 D R 7 M D R 7 M D R -+/ ( allowing our regression method to work with some regression error, since the frequency bin q next to the harmonics k k+ are showing a phase that is proportional to k The analysis above uses a rectangular window For the rest of this work we will use hamming windows since the amplitude of the worst-case side lobe level will be lower The analysis done in this section can be generalized to hamming windows Even though Phase Interpolation Subtraction of Phases would solve the phase distortion problem for the frequency bins closest to a harmonic, there is still the problem of high energy harmonics contributing to the phase of frequency bins far away from the harmonic itself This problem can seriously modify the result of the regression method This also tells us that the frequency bins with more energy have a more reliable phase The solution to this problem is solved by using a weighted linear regression Melbourne, December 2 to, 22 c Australian Speech Science & Technology Association Inc Accepted after full review page 28

( : Ramon E Prieto et al Robust Pitch Tracking PHASE UNWRAPPIN Work has been done on the field of phase Unwrapng Phase unwrapng has been used to to calculate the Complex Cepstrum Several methods have been proposed to unwrap the phase of one dimensional signals among which we preferred to compare the following ones: asic Unwrapng (U If we consider the phase response as a continuous function of frequency, then unwrapng is meant to make the phase more continuous As such our asic Unwrapng method (U adds or or greater than respectively the phase of all the frequency bins greater or equal than q if the difference between the phase of the frequency bins q q- is lower than Slope Forced Unwrapng (SFU iven that the phase of the frequency bins closest to the first harmonic wont wrap unless ( ( is bigger than approximately, we can consider those phases as good information to calculate an initial slope Then, at frequency bin, we calculate the slope of the line that departs from frequency bin zero to frequency bin An estimate of the phase at will be calculated using that slope, the actual phase at frequency bin will be unwrapped around that estimate Since we want only reliable frequency bins to modify the estimated slope, the slope will be recalculated only in the frequency bins where the magnitude is greater or equal than times the maximum magnitude in the spectrum Linear Regression Slope Forced Unwrapng (LRSFU The most widely used method for phase unwrapng is (4, a less general version of it was implemented in ( According to (, for intermediate estimate at frequency bin, frequency bins to are used to perform a linear regression The calculated slope is used to predict an estimate of the phase ( of frequency bin, unwrapng the actual phase around that estimate We call this method Linear Regression Slope Forced Unwrapng (LRSFU The value was used in the same way as in SFU A comparison of the unwrapng methods is in figure 2 As an example we show the results of two frames separated by exactly a tch period that, as a result, should give a slope equal to zero Parts b c show that the U SFU methods are too sensible to spectral leakage For this specific example, LRSFU, with is the most robust method since it is the only one that didn t add in any bin From figure 2 is important to see how dramatic the change in the slope would be if our method is not robust enough We can also see in all the unwrapng methods of figure 2 that there is no incorrect unwrap of the initial frequency bins with high magnitude in the DFT (speech usually has high energy until -4KHz When doing the linear regression, if we put more weight on the frequencies with high amplitude, we would be reducing the effect of spectral leakage not avoided by the unwrapng method WEIHTED LINEAR RERESSION We want to apply a linear regression to the unwrapped phase of formula 4 The problem solution are stated as: ( ( where W is a NxN diagonal matrix with the weights as the diagonal elements Q is a vector containing the frequency bin indexes H to N, is the vector containing the unwrapped phase of each of the frequency bins of formula 4 is the regression error The work in (6 uses the magnitude squared coherence function as defined in (7 to define a weighting scheme However, since come from the same microphone, the magnitude squared coherence Melbourne, December 2 to, 22 c Australian Speech Science & Technology Association Inc Accepted after full review page 29 ( 8 > to (2

Ramon E Prieto et al Robust Pitch Tracking T (ms Error T (ms 6 2 8 4 2 a U 6 2 8 4 j 4 2 2 4 6 8 2 4 T+delta (ms Error d g 2 b SFU e h k 2 4 6 8 2 4 T+delta (ms c LRSFU f i l 2 4 6 8 2 4 T+delta (ms Figure : Estimated Regression Errors vs for different weighting schemes unwrapng methods a, U, no weighting b, SFU, no weighting c, LRSFU, no weighting d Reg Error, U, no weighting e Reg Error, SFU, no weighting f Reg Error, LRSFU, no weighting g, U, h, SFU, i, LRSFU, j Reg Error, U, k Reg Error, SFU, l Reg Error, LRSFU, function will give a strong correlation of the noise In (8, they prefilter the signal to emphasize the frequencies where the signal-to-noise ratio is high Following this reasoning section we propose the following weighting scheme: K (M D ( ( vs $ vs are where is a real number greater than one to emphasize the frequencies with high amplitude over the ones with low amplitude In figure, several plots of the estimated tch shown The actual tch period of the signal is 7ms If we use weighting we can see that becomes more reliable in a bigger region of ( that the regression error becomes a discriminant between a good estimate a bad estimate of the tch period RESULTS are below We can see from parts i l of figure that we can perform several iterations of our method fixing the position of framef, shifting the time delay to the last estimated until certain thresholds This method is what we call Iterative Linear Regressions of the Phase (ILRP To approximate the method to the ideal tch synchronous case, we implement a variation of ILRP where we set the frame length off J their time delay to be equal to the last tch period found at each iteration This method is called Adaptive Frame Length Iterative Linear Regression of the Phase (AFLILRP it is applied only after the first tch period has been successfully found by ILRP This variation avoids jitter tch doubling effects allows the use of a lower value A frame will be labeled as voiced if went below the thresholds before a maximum number of iterations Otherwise, the frame will be labeled as unvoiced For the results in this section we used 64 seconds of speech among male speakers 96 seconds of speech among 2 female speakers Table shows the performance measure in each row for the different phase unwrapng methods in each column Number sts for SFU 2 sts for LRSFU For example, method 2- means LRSFU in ILRP SFU in AFLILRP We also compared the performance of our method with the Cepstrum tch detection method ( the Autocorrelation method ( We used for both SFU LRSFU The performance measures used are gross tch error (PE, voiced-unvoiced error rate (V-UV, unvoiced-voiced error rate (UV-V, gross error count (EC, fine tch errors average (FPEAV fine tch errors stard deviation (FPESD, as defined in (2 Melbourne, December 2 to, 22 c Australian Speech Science & Technology Association Inc Accepted after full review page 26

Ramon E Prieto et al Robust Pitch Tracking Table : Performance Of The Pitch Estimation For Different Phase Unwrapng Methods In ILRP And AFLILRP Male Data, Clean Speech Measure - -2 2-2-2 ceps ac PE( 6 7 4 68 994 9 V-UV( 66 2982 967 627 984 UV-V( 24 46 26 77 98 62 EC( 27 27 276 2 6 79 FPEAV(ms 96 7 49 9 47 FPESD(ms 29 2 4 Female Data, Clean Speech - -2 2-2-2 ceps ac 68 7 2 899 487 6 94 8 696 668 7 4 29 66 67 686 64 24 84 244 277 9 29 6-4 8 2 2 4 7 44 8 9 Male Data, SNR = db Measure 2-2 Cepstrum Autocorr V-UV( 84 7 UV-V( 642 4 64 EC( 47 82 FPEAV(ms 27 2 FPESD(ms 4 Female Data, SNR = db Measure 2-2 Cepstrum Autocorr V-UV( 66 8 48 UV-V( 7 8 EC( 722 87 86 FPEAV(ms -4 8 9 FPESD(ms 2 2 2 For clean speech, in terms of V-UV UV-V, 2-2 performs the best for male data, while it performs almost the same as -2 cepstrum for female data In terms of EC, FPEAV FPESD, -2 2-2 are the best perform almost the same However, 2-2 is faster more efficient in finding out if a segment is voiced or unvoiced For Noisy data at db SNR, 2-2 performs clearly better than cepstrum 2-2 performs considerably better than autocorrelation in the UV-V, EC FPESD measures The high UV-V measure in the autocorrelation method makes it hard to make a comparison regarding V-UV PE CONCLUSIONS We have described a method that uses a time delay estimation technique phase information to detect the tch frequency of a speech signal We brought a new theoretical explanation as to why this method should work, we have described different approaches of phase unwrapng to come with a robust fast finding of the tch We have proposed to eliminate the contribution of unreliable low energy phase components by making a weighted linear regression of the phase As a result, compared to cepstrum autocorrelation, method 2-2 performs better always in terms of EC FPESD, while it performs similarly or better in the rest of the measures depending on if the data is from male or female for both clean db SNR speech REFERENCES [] AM Noll, Cepstrum Pitch Determination, J Acoust Soc America Vol 4, pp 29-9, 967 [2] LR Rabiner, MJ Cheng, AE Rosenberg, CA Mcgonegal, A comparative performance study of several tch detection algorithms, [] Secrest, R Doddington, An integrated tch tracking algorithm for speech systems, Proc IEEE Int Conf Acoustics, Speech, Signal Processing, 98, pp 2- [4] JM Tribolet, A new phase unwrapng algorithm, in IEEE Trans Acoust, Speech, Signal Processing, vol ASSP-2, pp 7-77, 977 [] Michael S rstein, John E Adcock, Harvey F Silverman, A Practical Time-Delay Estimator for Localizing Speech Sources with a Microphone Array, Computer, Speech, Language, April 99, pp -69 [6] Y Chan, R Hattin, J Plant, The Least Squares Estimation of Time Delay Its Use in Signal Detection, IEEE Trans Acoust, Speech, Signal Processing, vol ASSP-24, pp 27-222, Jun 978 [7] C Carter, C H Knapp, A H Nutall, Estimation of the Magnitude-Squared Function Via Overlapped Fast Fourier Transform Processing, IEEE Transactions Audio Electroacustics, pp 7-44, Aug 97 [8] C H Knapp, C Carter, The eneralized Correlation Method for Estimation of Time Delay, IEEE Trans Acoust, Speech, Signal Processing, vol ASSP-24, pp 2-26, Aug 976 Melbourne, December 2 to, 22 c Australian Speech Science & Technology Association Inc Accepted after full review page 26