Novel Temporal and Spectral Features Derived from TEO for Classification of Normal and Dysphonic Voices

Size: px

Start display at page:

Download "Novel Temporal and Spectral Features Derived from TEO for Classification of Normal and Dysphonic Voices"

Blaze Richard
5 years ago
Views:

1 Novel Temporal and Spectral Features Derived from TEO for Classification of Normal and Dysphonic Voices Hemant A.Patil 1, Pallavi N. Baljekar T. K. Basu 3 1 Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar, Gujarat,India. Manipal Institute of Technology (MIT), Manipal, Karnataka, India. 3 Institute of Technology and Marine Engineering (ITME), Amira, West Bengal, India. hemant_patil@daiict.ac.in, pallavi.baljekar@learner.manipal.edu, basutk0@yahoo.co.in Abstract. In this paper, various temporal features (i.e., zero crossing rate and short-time energy) and spectral features (spectral flux and spectral centroid) have been derived from the Teager energy operator (TEO) profile of the speech waveform. The efficacy of these features has been analyzed for the classification of normal and dysphonic voices by comparing their performance with the features derived from the linear prediction (LP) residual and the speech waveform. In addition, the effectiveness of fusing these features with state-ofthe-art Mel frequency cepstral coefficients (MFCC) feature-set has also been investigated to understand whether these features provide complementary results. The classifier that has been used is the nd order polynomial classifier, with experiments being carried out on a subset of the Massachusetts Eye and Ear Infirmary (MEEI) database. Keywords: Dysphonia, TEO, LP residual, zero-crossing rate, short-time energy, spectral flux, spectral centroid and polynomial classifier. 1 Introduction The main motivation for investigating features for dysphonia detection is to build a robust and reliable system for non-intrusive evaluation of a patient s voice to detect pathologies in the larynx and the vocal tract. Pathologies such as vocal nodules, cysts, polyps, etc. are nodular masses which are present either on the glottis or along the walls of the vocal tract. As a result, they change the airflow properties through the glottis, either due to increased mass of the vocal folds which causes a change in the periodicity of vibration of the vocal folds, or due to the incomplete closure of the vocal folds due to the presence of these masses on the edge of the vocal folds. On the other hand, pathologies such as paralysis, which is caused due to damage to the recurrent and/or superior laryngeal nerve, effects the motor function of the larynx and thus causes asymmetric vibration of the vocal folds, which may cause transient or permanent diplophonia. Thus, the result of the presence of these pathologies is to modify the airflow properties, especially at the source. Thus, many of the parameters

2 which have been developed for voice pathology detection have been derived from the linear prediction (LP) residual [1] or the Electroglottograph (EGG) [1] which are considered to be representative of the airflow properties at the glottis. These features characterize the variability at the source either in amplitude represented as shimmer [1] or fundamental frequency (i.e., the pitch) called jitter [1].In pathological voices, due to incomplete closure of the vocal folds there is an escape of air, which increases turbulence perceived in the voice. Thus, apart from these perturbation measures (i.e., shimmer and jitter) there are also various noise measures [] that have been derived from the speech signal to exploit this perceived turbulence in the voice. In this paper, an attempt is made to derive features from the Teager energy operator (TEO) profile of the speech signal, which captures the glottal airflow properties in a more effective manner by also accounting for the nonlinear sources of voice production namely the vortices. In this study, four features (two temporal features and two spectral features) have been used. These features have also been extracted from the speech signal and the LP residual to compare the performance of the TEO profile. The organization of the paper is as follows. In Section, robustness of the TEO in capturing the source related information is discussed. In Section 3, the computational details of the features used are briefly discussed. Section 4 gives details of the experimental setup and describes the experiments conducted and the results obtained. Finally, Section 5 summarizes our findings and discusses future research directions. TEO as Source Information The TEO was first proposed by the Teagers in [3]. The Teagers showed that the airflow is not laminar as assumed by the linear source-filter theory, but separates into various paths leading to the generation of vortices which provide the excitation to the vocal tract during the closed phase. The TEO is an operator which captures the energy of these vortices. It is proportional to the square of amplitude and frequency. It is defined in both the discrete and continuous domain. For the discrete case it is defined as: ψ x n = x n x n 1. x n+ 1 Aω { ( )} ( ) ( ) ( ) for small values of ω i.e., s i n( ω ) ω (1). Fig. 1(a) and (b) depict the speech signal and corresponding differenced EGG taken from the ARTIC-CMU database [4]. Fig. 1(c) shows the corresponding TEO profile. It is interesting to see in this figure,that the peaks in the TEO profile are in close proximity of locations corresponding to the differenced EGG waveform which corresponds to the glottal closure instants (GCI), indicating that the TEO successfully captures the airflow properties at the glottis (in particular glottal activity). Moreover, the height of the peaks in the TEO profile is correlated to the peaks in the differenced EGG waveform, thus proving that the TEO profile of speech is a robust indicator of the airflow properties at the source, i.e., the glottis. Fig., depicts the TEO profile for a normal and pathological speaker suffering from vocal nodules taken from Massachusetts Eye and Ear Infirmary (MEEI)

3 database. As can be seen, for a normal speaker, due to complete glottal closure, there is not much turbulence at the source which is reflected in the regularity of the TEO profile peaks. On the other hand, in the case of the pathological voice, due to incomplete closure, there is increased turbulence at the source, which is reflected in the irregular structure in the running estimate of signal energy via TEO profile. This evidence again reiterates the fact that TEO is very good at capturing the airflow properties at the glottal source. Fig. 1. TEO as a source feature: (a) speech signal (b) differenced EGG pulses corresponding to GCI (c) corresponding TEO profile of speech. Fig.. TEO profile of speech for (a) normal phonation and (b) person suffering from vocal nodules. Two prominent differences that can be observed in the plots in Fig. are, firstly the peaks for the normal phonation are all almost the same height, whereas the peaks in the TEO profile for the pathological case are more non-uniform in height, showing greater variability in the energy at the GCI and thus showing greater amplitude variability. Secondly, the zero-crossing rate (ZCR) at the GCI seems to be uniform in case of the normal speech, which has hardly any zero-crossings between the GCI, whereas, in case of the pathological voice, there are increased number of zero crossings, especially in between the GCI due to the escape of air caused by the incomplete closure of vocal folds thus causing an increased turbulence perceived in the speech. Thus, we need to extract suitable temporal and spectral features which exploit these characteristics and give high classification accuracy.

4 3 Features Used The main requirements for selecting the features was firstly, it should exploit the characteristics of a pathological voice, i.e., increased noise components, breakdown of periodic structure, greater high frequency components and greater amplitude variations. Secondly, it should be simple to compute and reproduce by anyone and lastly, we wanted these features to be independent of determination of the pitch period. Thus, for this application we used four conventional features, two temporal features, the zero-crossing rate (ZCR) and short-time energy (STE) and two spectral features, spectral flux (SF) and spectral centroid (SC). The performance of these features was also compared with Mel frequency cepstral coefficients (MFCC) derived from the speech signal, since this is the state-of-the art feature which has given very good classification accuracy [5]. This section describes the computational details of these temporal and spectral features. Short-time Average Zero Crossing Rate (ZCR). As defined in [6] a zero-crossing is said to occur if there is an algebraic sign change between two consecutive samples in a speech signal. Thus, the rate at which the zero crossings occur is an indication of the frequency of the signal. It is defined as, 1 Z n = sgn[x(m)] sgn[x(m 1 )].w(n m), () N m= where, Z n is the ZCR for n th frame, N is the frame length and w(n) is the Hamming window. ZCR has been used for voiced/unvoiced detection [6], since it is known that unvoiced speech has higher frequency content. Thus, as it has been shown in various previous studies that the pathological voices have a much higher frequency content than the normal speech signals [1], we thought this feature would be suitable as an indication of the frequency content of a signal and the presence of noise components, since higher frequency components are mainly attributed to the presence of noise. Short-Time Energy (STE). It is known that the pathological speech has greater amplitude variations as compared to the normal voices, and was the major motivation for using the shimmer parameter in earlier studies [1]. The short-time energy is another parameter which has been used to account for the amplitude variations, especially to distinguish voiced/unvoiced speech [6]. In this paper, we wanted to investigate, whether it could capture the difference in amplitude variations in the pathological speech as compared to the normal speech. The short-time energy is defined as: En = [ x(m).w(n m) ], (3) () m= where E n is the energy of the n th frame, x(m) is the signal and w(n) is the Hamming window. Spectral Centroid (SC). The spectral centroid measures the brightness of sound, i.e., it measures where most of the power in a speech segment is located. It has been

5 previously used for detection of clinical depression [7] and for speech recognition where the spectral centroid in various sub-bands was obtained, and it was found that this feature is similar to the formant frequencies and provides complementary information to cepstral features and is also robust to noise [8]. It is the weighted average of the frequency of the spectrum and thus would give us an idea as to what frequency range most of the power of spectrum would lie in. We wanted to analyze if there is a shift in the spectral centroid for the pathological voice towards the higher frequencies. The spectral centroid is defined as: N 1 N 1 = ( ) ( ) ( ), (4) k= 0 k= 0 SC X k F k X k where X(k) represents the weighted frequency value, or magnitude of bin number k, and F(k) represents the center frequency of that bin. Spectral Flux (SF). The spectral flux is a defined as the difference in the power spectra of two consecutive speech frames. Thus, it measures the frame-to-frame variability in the spectral shape. It has been previously used for detecting depression [7] and for speaker recognition [9], among other applications. Since the pathological voice is known to be less periodic than the normal signals, it is expected that a pathological voice signal will have higher frequency variations, which motivated the use of jitter parameter. We thought this parameter would capture this breakdown of periodic structure and variation in the frequency content in the pathological voice. The spectral flux is defines as: ( N ) 1, k= N SF( n) = H ( X ( n, k) ( X ( n 1, k) ) (5) H ( x) = is the half- where X ( n, k ) is the k th frequency bin of the n th frame and wave rectifier function. x+ x 4 Experimental Results The corpus used for the experiments is the commercially available MEEI database [10]. For this work, a subset of 173 pathological and 53 normal speakers was used according to []. The MEEI database consists of samples either at 50 khz or 5 khz, hence all samples were downsampled to 5 khz sampling frequency. Since the number of pathological samples is approximately 3 times the normal samples, 1 s of pathological data for patients and 3 s of normal data of sustained phonation /ah/ for control people were used for training and testing. The signals were blocked into frames of 56 samples corresponding to 10. ms. The features extracted per frame were given to a nd order polynomial classifier to generate the true and false scores [11]. A 4-fold cross-validation scheme repeated 1 times giving a total of 48 trials was carried out, using 75% samples for training and 5% for testing, with the training

6 and testing subsets kept independent of each other. The classification accuracy (Ac), was calculated as an average for all these 48 trials (i.e., 688 genuine and 688 impostor trials). (a) (b) (c) (d) (c) Fig. 3. DET plots for comparison of features derived from LP residual, TEO profile of speech and speech waveform for following features: (a) ZCR (b) STE (c) SC (d) SF Table 1. A Comparison of the Temporal and Spectral features derived from the speech waveform, the LP residual (LP Res) and the TEO profile of speech. Features EER Ac ( %) (%) ZCR-Speech ZCR-LP Res ZCR-TEO STE-Speech STE-LP Res STE-TEO SC-Speech SC-LP Res SC-TEO SF-Speech SF-LP Res SF-TEO Table. Comparison of results of fusing the features derived from TEO profile with MFCC. Features EER Ac (%) (%) MFCC MFCC+ZCR MFCC+STE MFCC+SF MFCC+SC MFCC+ZCR+ STE+SC+SF

7 (a) (b) (c) Fig. 4. DET curves for feature-level fusion of MFCC with (a) ZCR, (b) SC and (c) ZCR+STE+SF+SC The detection error tradeoff (DET) plots [1] comparing the performance of the speech waveform, its TEO profile and the LP residual has been plotted for each of the 4 features in Fig 3 and the corresponding equal error rate (EER) values and accuracies have been listed in Table 1. As can be seen, in 3 of the 4 cases the TEO profile performs the best in comparison to the LP residual and the speech waveform. Only in case of the spectral flux feature the LP residual performs better. From among the 4 features we can see that the ZCR feature performs the best. Apart from this we also investigated whether these features provided complementary information to the MFCC features computed according to [13]. The values have been listed in Table. From Table we see that by fusing each feature separately with MFCC, ZCR and SC gives the lowest EER. However, on plotting the DET curves shown in Fig 4(a) and (b) for the two it was seen that for the SC parameter, certain points came very close to the DET curve for MFCC, while in case of ZCR the two DET curves were separated from each other at all points. Thus we see that as expected from the results of the individual features, ZCR performs the best when fused with MFCC followed by SC and STE, while fusion with SF actually reduces the classification accuracy, thus showing that it does not give any complementary information. We also fused MFCC with the ZCR, STE, SF and SC. Fig 4(c) depicts the DET plot of the fusion of MFCC+ZCR+STE+SC+SF. It can be observed that there not much of an improvement in EER in Fig 4(c) as compared to 4(a). Thus, even though, this fusion did increase the classification accuracy but this increase wasn t very much more than ZCR+MFCC. So, the increase in accuracy doesn t outweigh the increase in computation and so the best feature from our analysis is the ZCR feature which does well both individually as well as when fused with MFCC. The reason for the good performance by ZCR can be understood intuitively from the TEO profile plots shown in Fig. for the sustained phonation of normal person and a person suffering from vocal nodules. As can be observed from the plots, the ZCR for the normal person is almost periodic, with zero-crossings occurring approximately at each GCI. Whereas, for the pathological voice due to increased

8 turbulence at the source, there are lot of spurious zero-crossings between the peaks as well. Thus, due to this reason, the pathological voices show a significantly high ZCR and thus this feature performs the best in this dichotomy. 5 Summary and Conclusion In this paper, it is observed that most of the spectral and temporal features derived from the TEO profile of speech signal give a relatively better performance as compared to the same features derived from either the LP residual or the speech waveform. Consequently, we believe that the TEO is a more effective operator to capture the glottal airflow properties as compared to the LP residual. The advantage of proposed method is that it uses simple features, which can be implemented and computed easily and is independent of adhoc pitch estimation methods. Moreover, we have shown that most of the proposed features do provide some complementary information to the MFCC features and increase the classification accuracy by almost 1% in most cases, and decrease the EER by 1% as well. From this work, we infer that the ZCR gave the best relative performance and hence we would like to investigate this feature further by analyzing the ZCR in different frequency bands, and changing the analysis window length and degree of polynomial classifier, to obtain the best performance. Moreover, it is evident that the characteristics of the TEO profile in the vicinity of the GCI play a significant role and thus in the future we would like to develop speech features which capture this information for voice pathology classification. Acknowledgements Authors would like to thank the authorities of DA-IICT Gandhinagar for their kind support to carry out this work. They would also like to thank ECE Dept. Manipal Institute of Technology, India for providing the MEEI database, without which this work would not have been possible. References 1. Davis, S.B.: Acoustic Characteristics of Normal and Pathological Voices. Haskins Laboratories: Status Report on Speech Research, vol. 54, pp (1978). Parsa, V., Jamieson, D.G.: Identification of Pathological Voices Using Glottal Noise Measures. J. Speech, Language, Hearing Res., vol.43, no., pp , (000) 3. Teager, H. M., Teager, S. M., Evidence for Nonlinear Sound Production Mechanisms in the Vocal Tract. In Speech Production and Speech Modelling, W.J. Hardcastle, and A. Marchal (eds.). pp , Kluwer, Netherlands (1990) 4. CMU-ARCTIC speech synthesis databases, 5. Markaki, M., Stylianou, Y., Arias-Londoño, J.D., Godino-Llorente, J.I.: Dysphonia Detection Based on Modulation Spectral Features and Cepstral Coefficients. In IEEE Proc. Int. Conf. Acoust., Speech, Signal Processing (ICASSP), pp (010)

9 6. Rabiner, L. R., Schafer R. W.: Digital Processing of Speech Signals. Prentice- Hall, Englewood Cliffs, NJ. (1978). 7. Low, L.S.A., Maddage, N.C., Lech, M., Sheeber, L., Allen, N.: Influence of Acoustic Low-Level Descriptors in the Detection of Clinical Depression in Adolescents. In IEEE Proc. Int. Conf. Acoust., Speech, Signal Processing (ICASSP), pp (010) 8. Paliwal, K.K.: Spectral Subband Centroid Features for Speech Recognition. In IEEE Proc. Int. Conf. Acoust., Speech, Signal Processing (ICASSP), (1998) 9. Hossienzadeh, D., Krishnan, S.: Combining Vocal Source and MFCC Features for Enhanced Speaker Recognition Performance using GMMs. In Proc of IEEE 9th Workshop on Multimedia Signal Processing, pp (007) 10. Kay Elemetrics Corp, Disordered Voice Database Model 4337, Version 1.03, Massachusetts Eye and Ear Infirmary Voice and Speech Lab. (00) 11. Campbell, W. M., Assaleh, K. T., Broun, C. C.: Speaker Recognition with Polynomial Classifiers. In: IEEE Transactions on speech and audio processing, vol. 10, no. 4, pp (00) 1. Martin, A.F., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The DET Curve in Assessment of Detection Task Performance. In: Proc. Eurospeech 97, Vol. 4, pp Rhodes, Greece (1997) 13. Davis, S. B., Mermelstein, P.: Comparison on Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences. In: IEEE, Transactions on Acoustics, Speech, And Signal Processing, vol. ASSP-8, no.4, pp (1980)

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification Daryush Mehta SHBT 03 Research Advisor: Thomas F. Quatieri Speech and Hearing Biosciences and Technology 1 Summary Studied