Exploring Modulation Spectrum Features for Speech-Based Depression Level Classification

Size: px

Start display at page:

Download "Exploring Modulation Spectrum Features for Speech-Based Depression Level Classification"

August Chase
5 years ago
Views:

INTERSPEECH 2014 Exploring Modulation Spectrum for Speech-Based Depression Level Classification Elif Bozkurt 1, Orith Toledo-Ronen 2, Alexander Sorin 2,Ron Hoory 2 1 Multimedia, Vision and Graphics

1 INTERSPEECH 2014 Exploring Modulation Spectrum for Speech-Based Depression Level Classification Elif Bozkurt 1, Orith Toledo-Ronen 2, Alexander Sorin 2,Ron Hoory 2 1 Multimedia, Vision and Graphics Laboratory, Koç University, Istanbul, Turkey 2 IBM Research Haifa, Haifa University Mount Carmel, Haifa, Israel ebozkurt@ku.edu.tr, {oritht, sorin, hoory}@il.ibm.com Abstract In this paper, we propose a Modulation Spectrum-based manageable feature set for detection of depressed speech. Modulation Spectrum (MS) is obtained from the conventional speech spectrogram by spectral analysis along the temporal trajectories of the acoustic frequency bins. While MS representation of speech provides rich and high-dimensional joint frequency information, extraction of discriminative features from it remains as an open question. We propose a lower dimensional representation, which first employs a Melfrequency filterbank in the acoustic frequency domain and Discrete Cosine Transform in the modulation frequency domain, and then applies feature selection in both domains. We compare and fuse the proposed feature set with other complementary prosodic and spectral features at the feature and decision levels. In our experiments, we use Support Vector Machines for discriminating the depressed speech in a speaker-independent fashion. Feature-level fusion of the proposed MS-based features with other prosodic and spectral features after dimension reduction provides up to ~9% improvement over the baseline results and also correlates the most with clinical ratings of patients depression level. Index Terms: depression assessment, modulation spectrum, prosody, feature fusion, decision fusion 1. Introduction Characterization of emotional expression of speech and its relation to the overall state of the speaker is a challenging task, yet one that would provide new avenues for health care technologies. While emotions are a part of everyday communication, emotional or mood disorders such as clinical depression remain as a critical public health concern [1]. There is a large foundation of research to suggest that the analysis of voice patterns can lead to the formation of objective analysis tools for the characterization of depression in speech [2, 3]. One of the goals of this research is to find objectively measureable speech features that can distinguish speaking patterns of individuals with a diagnosis of clinical depression on a speaker independent basis. We particularly focus on Modulation Spectrum (MS) features, which provide long-term dynamic characteristics of the speech signal [4, 5, 6]. In its original definition, MS is a high dimensional representation. We employ Mel filterbank and Discrete Cosine Transformation (DCT) for lower dimensional representations in acoustic and modulation frequency domains, respectively. We hypothesize that energy modulations in particular frequency ranges may be more discriminative for depressive speech recognition and experimentally select a joint subset of Mel and DCT bins for better performance. As a secondary goal, we wish to explore how the MS-based feature set compares with the commonly used prosodic and spectral features, and whether these features have fusion potential at the feature and decision levels. Experiments for two-class depression classification problem (depressed vs. non-depressed) are performed using Support Vector Machine (SVM) classifiers implemented in speaker-independent configurations on the free speech recordings of the Mundt dataset referred in [3] Related Work The perceptual qualities of depression in voice have been most commonly studied with regard to prosodic and vocal tract perturbations [7,8,9,10]. Studies have shown that the second formant location is most affected by depressive speech. Patients with major depressive disorder had decreased second formant (F2) measurements [7]. Energy variability has been shown to decrease with increasing levels of depression [10,]. Speech-rate as combined phone duration measures [12], statistics of pitch and energy features [13], and voice quality measures [14] have also been useful for detecting depression symptoms. Spectral features such as Mel frequency Cepstral Coefficients (MFCCs), power spectral density, and spectral tilt also potentially include useful information in the classification of depression [8,15,16,17]. In addition, glottal measures have also been analyzed [17,18]. More recently, Cummins et al. [] investigated the effects of depression on speech by analyzing MS features on the trisyllabic sequence PATAKA recordings of the Mundt dataset [3] used in our study. Authors apply log mean subtraction along each acoustic frequency during MS features extraction and report 66.9% weighted accuracy using 10-fold CV for the two-class depression recognition problem. In a later study, authors investigate covariance structure of a Gaussian Mixture Model (GMM) to capture depression based information [19] on Grandfather read-speech passages of the same dataset. The best classification result for the two-class depression recognition problem is presented as 68.6% when variance and weight parameters of Gaussian are updated during adaptation. Sturim et al. also investigate the free speech recordings of the same dataset in leave-one-recording-out fashion. They focus on depression severity as a class distinction and apply joint factor analysis with Wiener filtering for modeling speaker and channel variation [20]. Authors test their system using MFCCs and shifted delta cepstral features modeled with GMMs. The proposed system brings 20-30% of equal error rate gain for two-class depression classification task. The rest of the paper is organized as follows. In Section 2, we summarize the MS features extraction steps. In Section 3, we set up the depression classification problem. In Section 4, we describe our results with baseline and proposed features. Finally, in Section 5, we provide conclusions and projections to future work. Copyright 2014 ISCA September 2014, Singapore

2 2. Modulation Spectrum Modulation spectral analysis tries to capture long-term dynamics within an acoustic signal, which is typically a twodimensional joint acoustic frequency and modulation frequency representation [4,5,6]. Acoustic frequency means the frequency variable of conventional spectrogram derived from short term Fourier transform (STFT) whereas, modulation frequency captures time-varying information through temporal modulation of the signal. The computation of joint acoustic-modulation frequency spectrum is carried out in two phases. First, speech spectrogram is computed using N A -point FFT (Fast Fourier Transform) on each pre-emphasized, Hamming windowed overlapping frame. Let S[n, k] denote the STFT of speech signal as a function of frame index n and acoustic frequency index k (0 k N A /2). The modulation spectrum is derived from the analysis of magnitude spectrogram, S[n,k], within longer duration windows (of length M frames) with some overlap. The windows correspond to a two-dimensional time frequency context, e.g. starting from frame n 0 and having a length of M frames, consists of all frequency bands within the time interval [n 0, n 0+M-1 ]. The temporal trajectory of the k th frequency band within time-frequency context is denoted as T(n 0,M, k) = ( S(n 0, k), S(n 0+1, k),, S(n 0+M-1, k) ) (1) The second N M -point FFT is then applied on meannormalized, Mel-filtered, and Hamming windowed T(n 0,M, k) to produce the modulation spectrum MS(n 0,M,k,q), where q is the modulation frequency index and (0 q N M /2). In our set up, a standard N component Mel filterbank is used to effectively reduce both dimensionality of acoustic frequency domain and correlations between the frequency sub-bands. Additionally, Discrete Cosine Transformation (DCT) is applied to each modulation spectrum MS(n 0,M,k,q) for reducing the modulation frequency domain dimensionality yielding a N M /2+1-dimensional vector of DCT coefficients for each acoustic bin. We retain the lowest D coefficients, including the DC coefficient which preserves the most significant signal energy. The frame-level MS features have N-by-D dimensionality. 3. Experimental Setup We use in-clinic speech recordings of the database originally collected by Mundt et al. for depression severity study [3]. The database contains voice samples from 35 patients (20 F/15 M, ages from 20 to 68 years) subject to depression treatment over a six week period. Depression severity level of participants was observed during clinical interviews at one-week intervals and evaluated using the Hamilton Rating Scale for Depression (HAMD) over the course of treatment [21]. HAMD assessment has 17 symptom sub-topics with scores for each. We use the total HAMD score of individual ratings as the ground truth in defining classes in our study. Recordings with total HAMD score of greater and equal to 17 are assigned the category depressed () and rest of the recordings are assigned the category non-depressed (non-). In this study, 257 samples of free speech recordings are labeled as non-depressed and remaining 2 are labeled as depressed, respectively. Speech features in our study may be considered in two main categories: prosodic and spectral. While these categories are not all inclusive of measurable speech features, they will form the basis of feature extraction described in this work. We tested all features using their sentence-level statistics ( functionals) consisting of maximum, minimum, variance, standard deviation, skewness, kurtosis, quartiles 1,2&3, and percentiles 1.0& Baseline features We use the opensmile [22] and Praat [23] toolkits for baseline features extraction. All the baseline features are extracted on a frame basis within windows of 25 ms with 10 ms frame shifts. Then, statistical functionals are calculated per recording from the frame-level features. The prosodic category features for this study are pitch (F 0 ) and intensity (I), both extracted by using Praat. The vocal tract is commonly quantified by the formant frequencies which are the primary resonances determined by the vocal tract shape during speech production. In this study, vocal tract spectral structure was quantified by the first (F 1 ), second (F 2 ), and third (F 3 ) formant center frequencies and their bandwidths (BW 1, BW 2, BW 3 ) extracted in Praat. The formant center frequencies and bandwidths each represent a unique feature sub-category for analysis. In addition, we extract Mel Frequency Cepstral Coefficients [0-14] (MFCCs) and Line Spectral Pairs [0-7] (LSPs) using the emobase2010 configuration of the opensmile toolkit Classification setup Speaker independent experiments were performed in a leaveone-speaker-out cross validation (LOSO CV) manner using data from each of the 35 speakers as the test set in turn and the data from the other 34 speakers as the training set. The class accuracies are computed on the overall dataset. Then, the classification performance is evaluated by the unweighted average recall rate (UAR), which is the arithmetic average of individual class accuracies. In addition to the UAR rate we also provide the recall rate on the two classes (, non- ) for more insight. We use LibSVM [24] implementation of Support Vector Machines and employ the linear kernel in all experiments with features scale normalization and class weights of [0.45, 0.55] for non- and categories, respectively Baseline results 4. Experimental Results We first present baseline results with the well-known speech acoustic features of two categories: prosody and spectral features. In Table 1, we present a comparison of several standard feature sets. In the upper part of the table, we see the classification performance of individual prosody features, and in the lower part that of spectral features are shown. Among the prosody features, the intensity (I) is the most discriminative feature, whereas F 0 has lower classification rate due to its speaker dependency. For the spectral features, MFCCs are discriminative, but are also very characteristic of the speaker. We can see that MS performs the best, but is very close to the other standard spectral feature sets. Additionally, random classification accuracy is calculated as %. 1244

3 Table 1. Baseline classification rates with prosodic and spectral feature sets. F 0 I F 1 F 2 F 3 BW 1 BW 2 BW 3 MFCC LSP MS Modulation spectrum features parameter setting Modulation spectrum (MS) features are a joint acoustic and modulation frequency representation of speech signals that is obtained by simultaneous spectral analysis of all frequency bins. Thus, frame shift and time-frequency context length (M) are two crucial parameters. Frame shift determines sampling rate for modulation frequency domain and M, on the other hand, controls the resolution of the MS. We extract STFT of the speech signals within windows of length of 32 ms and frame shifts of 17 ms. We tested several values for M (10, 20, 25, and 30) and selected M to be 25 (corresponding to an analysis window of length 425 ms) with the best performance to create a valid baseline for MS features. Additionally, we apply mean normalization of frequency bins (DC removal) prior to Mel filtering. However, variance normalization following mean normalization does not improve results. Moreover, log compression of STFT outputs or MS components does not increase recognition rates. We apply N A =256 point FFT for the calculation of STFT components and N M =128 point FFT for the calculation of MS components. Thus, the original feature vector size for framelevel MS features is For feature dimension reduction, we apply Mel filterbank with N=26 components in the acoustic frequency domain and DCT in the modulation frequency domain. We retain the first D=10 components of the DCTs, which results in a feature vector size of 2860 at the functionals level Modulation spectrum features bin selection Our manageable feature set is a subset of the Mel and DCT bins of Modulation Spectrum representation of speech signal. In Figure 1, we see the classification performance of several selections of the Mel bins as a function of the number of DCT coefficients (always starting from coefficient 1). As we can see, the best result is achieved by taking the Mel bins in the middle range [6-19], corresponding to a frequency range from 668 to 2000 Hz, with an increasing gain as the number of DCT coefficients reduces down to 1. In Table 3, we summarize the best result of the MS feature selection using only the first DCT coefficient and Mel bins in the range of 6-19, in comparison with the original MS features without selection. We denote these selected set of features as MS sel. We can see the dramatic improvement in accuracy of the depressed class, with some degradation on the nondepressed class and overall UAR improvement. UAR ber of DCT coefficients Figure 1: UAR classification performance of the MS features with Mel band selection for varying number of DCT components. Table 3. Classification performance of MS selected features ( Mel bins 6-19 and 1 st DCT ) in comparison with the original MS features with no selection. MS MS sel Feature fusion results We present the performance of several combinations of formant and prosody features in Table 4. We start by fusing three formant frequencies (F 123 ), and three bandwidths (BW 123 ) with moderate performance. Next, we add the intensity (I) features to F 123 and get ~4 % improvement. Adding BW 123 to the features gives only a marginal gain, and adding the F 0 degrades the performance. Finally, the combination of the 3 top-performing individual features (I, F 2, and BW 3 ) from Table 1 gives the best fusion performance as % UAR. Table 4. Classification performance for feature-level fusion of prosody and formant features F BW I+F I+F 123 +BW F 0 +I+F 123 +BW I+F 2 +BW Next, in upper part of Table 5, we show the results of fusing the MS sel features with other features. We can see that none of the feature combinations give any gain beyond the performance of the MS sel features, so our next step was to apply PCA for dimension reduction. Our experimentation with PCA on the MS sel features set did not yield any performance improvement, but for other features (e.g. MFCC) some gain was achieved, probably due to redundancies in feature representation. In lower part of Table 5, we show the results of 1245

4 combing the complete MS sel feature set with a second feature set reduced by PCA. Table 5. Classification performance of fusing the MS sel features with other feature sets, before and after applying PCA on the 2 nd feature set. MFCC LSP I I+F 2 + BW 3 I+F 2 + BW 3 +MFCC I+F 2 + BW 3 +LSP after applying PCA MFCC LSP I I+F 2 + BW 3 I+F 2 + BW 3 +MFCC I+F 2 + BW 3 +LSP To better understand the behavior of the PCA dimension reduction on the second feature set in fusion with the MS sel features, we show in Figure 2 the performance of MS sel fusion with four other feature sets as a function of the number of PCA components of the second feature set. The horizontal dotted line is the MS sel baseline performance. We can see that by selecting few principal components from the second feature set and fusing them with the MS sel features, we are able to improve the classification performance, especially with the feature set (I+F 2 +BW 3 +MFCC). UAR PCA in second feature set Figure 2. Classification performance of PCA on the second feature set in fusion with the MS sel features Correlation results MS+MFCC MS+LSP MS+I MS+(I+F 2 +BW 3 +MFCC) All the classification experiments in the previous sections were performed in a commonly-used 2-class setup of vs. non- classification based on setting a threshold on the total clinical HAMD score. In such setup, the classes are very broad. To avoid the sensitivity of the results, we measured the correlation between the classification result and the clinical total HAMD score. Since the HAMD score is measured on an ordinal scale and its relationship to classification result is monotonic, we used the Spearman rank correlation. The correlation coefficients along with their twotails p-values are shown in Table 6 for several feature sets, for features fusion and some decision fusion experiments. In feature fusion, the features are combined and one classification experiment is performed on which the correlation is measured. In decision fusion, two or more classification experiments are performed, each with a different feature set and the classification results are averaged and the correlation is measured on the fused decision. As we can see, fusion at the features level is more powerful than fusion at the decision level. Table 6. Spearman correlation between classification result and clinical total HAMD score for several features sets, features fusion, and decision fusion. and Fusion Corr p-value MFCC LSP I+F 2 +BW 3 MS sel Fusion MS sel +I+F 2 +BW 3 MS sel +(I+F 2 +BW 3 _PCA 3 ) MS sel +(I+F 2 +BW 3 +MFCC) MS sel +(I+F 2 +BW 3 +MFCC_PCA 5 ) Decision Fusion MS sel + MFCC MS sel + LSP MS sel + LSP + MFCC MS sel + (I+F 2 +BW 3 ) MS sel + (I+F 2 +BW 3 _PCA 3 ) MS sel + (I+F 2 +BW 3 ) + MFCC MS sel + (I+F 2 +BW 3 ) + LSP e-5 6.1e-7 1.7e- 1.9e e- 1.3e e e e e e e e e e Conclusions Our results clearly suggest that the proposed modulation spectrum-based manageable feature set improves the overall discrimination of depressed speech from non-depressed. The selected joint subset of Mel and DCT bins in MS brings a ~7% UAR improvement over the conventional MS feature set performance. Feature fusion of this feature set with formant, intensity, and MFCC features further advances recognition rates up to % UAR when PCA dimension reduction is applied on the second feature set. Correlation results also indicate that our feature fusion classification results are more correlated to clinical rating scores compared to decision fusion of the same feature sets. Future research will involve analysis on other datasets and improvements on feature selection strategy so that an objective analysis tool may be designed for clinical practice. 6. Acknowledgements The authors would like to thank Dr. James C. Mundt for providing the dataset that was collected under the U.S. National Institute of Mental Health Grant R43MH This work is supported by the Dem@Care FP7 project, partially funded by the EC under contract number

5 7. References [1] Greenberg, P.E., Stiglin, L. E., Finkelstein, S.D. and Berndt, E.R., Depression: A neglected major illness, Journal of Clinical Psychiatry, 54, p , [2] Darby, J. K., Speech and voice parameters of depression: A pilot study, J. Commun. Disord., 17, pp , [3] Mundt, J. C., Snyder, P. J., Cannizzaro, M. S., Chappie, K. and Geralts, D. S. Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology, Journal of Neurolinguistics, vol. 20, p , [4] Ivanov, A. & Chen, X., Modulation Spectrum Analysis for Speaker Personality Trait Recognition, INTERSPEECH, ISCA, [5] Markaki, M.; Stylianou, Y.; Arias-Londoño, J.D.; Godino- Llorente, J.I., Dysphonia detection based on modulation spectral features and cepstral coefficients, Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, vol., no., pp.5162,5165, March [6] Wu, S., Falk, T., H., and Chan, W., Y., Automatic speech emotion recognition using modulation spectral features, Speech Communication, vol. 53 (5) p , 20. [7] Flint, A.J., et al., "Abnormal speech articulation, psychomotor retardation, and subcortical dysfunction in major depression", Journal of Psychiatric Research, vol. 27(3): p , [8] France, D. J., Shiavi, R. G., Silverman, S., Silverman, M. and Wilkes, M., Acoustical properties of speech as indicators of depression and suicidal risk, Bio-Eng, IEEE Transactions on, vol. 47, p , [9] Ozdas, A., Shiavi, R. G., Silverman, S. E., Silverman, M. K., and Wilkes, D. M., Investigation of vocal jitter and glottal flow spectrum as possible cues for depression and near-term suicidal risk, Bio-Eng, IEEE Transactions on, vol. 51, pp , [10] Quatieri, T. F. and Malyska, N., Vocal-Source Biomarkers for Depression: A Link to Psychomotor Activity, in INTERSPEECH-2012, Portland, USA, [] Cummins, N., Epps, J. and Ambikairajah, E., Spectro-Temporal Analysis of Speech Affected by Depression and Psychomotor Retardation, in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, p , [12] Trevino, A., Quatieri, T., and Malyska, N., Phonologicallybased biomarkers for major depressive disorder, EURASIP Journal on Advances in Signal Processing, vol. 20, p. 1-18, 20. [13] Sanchez, M. H., Vergyri, D., Ferrer, L., Richey, C., Garcia, P., Knoth, B., and Jarrold, W., Using Prosodic and Spectral in Detecting Depression in Elderly Males, INTERSPEECH, p , 20. [14] Scherer, S., Stratou, G., Gratch, J. & Morency, L.-P. (2013). Investigating voice quality as a speaker-independent indicator of depression and PTSD.. In F. Bimbot, C. Cerisara, C. Fougeron, G. Gravier, L. Lamel, F. Pellegrino & P. Perrier (eds.), INTERSPEECH (p./pp ), : ISCA [15] Low, L. S. A., N. C. Maddage, et al. "Mel frequency cepstral feature and Gaussian Mixtures for modeling clinical depression in adolescents." in Proc. IEEE Int. Conf. on Cognitive Informatics, 2009, pp [16] Yingthawornsuk, T., Keskinpala, H., K., Wilkes, D., M., Shiavi, R., G., and Salomon, R., M., Direct Acoustic Feature using Iterative EM Algorithm and Spectral Energy for Classifying Suicidal Risk, Interspeech 2007, Antwerp, Belguim. [17] Moore, E., Clements, M., A., Peifer, J. W., and Weisser, L., Critical Analysis of the Impact of Glottal in the Classification of Clinical Depression in Speech, IEEE Trans. Biomed. Engineering,vol. 55 (1), p , [18] Moore, E., Clements, M. Peifer, J., and Weisser, L., Comparing objective feature statistics of speech for classifying clinical depression, IEEE 26 th Annual International Conference of Engineering in Biology and Medicine Society, p.17-20, 2004 [19] Cummins, N., Epps, J., Sethu, V., Breakspear, M. & Goecke, R., Modeling spectral variability for the classification of depressed speech, INTERSPEECH, p , ISCA, [20] Sturim, D., Torres-Carrasquillo, P. A., Quatieri, T. F., Malyska, N., and McCree, A., Automatic Detection of Depression in Speech Using Gaussian Mixture Modeling with Factor Analysis, Interspeech 20, p , 20. [21] Hamilton, H., HAMD: A rating scale for depression, Neurosurg Psychiat, vol. 23, p , [22] Eyben, F., Wöllmer, M., Schuller, B., opensmile - The Munich Versatile and Fast Open-Source Audio Feature Extractor, in Proc. ACM Multimedia (MM), ACM, Firenze, Italy, [23] Boersma, Paul & Weenink, David (2013). Praat: doing phonetics by computer [Computer program]. Version , available at [24] Chang, C.-C. and C.-J. Lin. LIBSVM: a library for support vector machines, Software available at

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Dimension Reduction of the Modulation Spectrogram for Speaker Verification Tomi Kinnunen Speech and Image Processing Unit Department of Computer Science University of Joensuu, Finland Kong Aik Lee and