Advances in Speech Signal Processing for Voice Quality Assessment

Size: px

Start display at page:

Download "Advances in Speech Signal Processing for Voice Quality Assessment"

Lenard Nelson
5 years ago
Views:

1 Processing for Part II University of Crete, Computer Science Dept., Multimedia Informatics Lab Bilbao, 2011 September

2 1 Multi-linear Algebra Features selection 2 Introduction Application: Vocal Fatigue 3 4

3 Multi-linear Algebra Features selection

4 In equations Multi-linear Algebra Features selection First Step: STFT where: X m (k) = n= k = 0,..., I 1 1, h(mm n)x(n)w kn I 1, I 1 denotes the number of frequency bins in the acoustic frequency axis, W I1 = exp ( jπ/i 1 ), M is the shift parameter (or, hop size) in the computation of the STFT, h(n) is the acoustic frequency analysis window.

5 In equations Multi-linear Algebra Features selection Second Step: frequencies estimation of the Subband Envelopes where: X l (k, i) = m= i = 0,..., I 2 1, g(ll m) X m (k) W im I 2, I 2 is the number of frequency bins along the modulation frequency axis, W I2 = exp ( j(f M /F s ) π/i 2 ), f M and F s denoting the maximum modulation frequency we search for, and the sampling frequency, respectively, L is the shift parameter of the second STFT, and g(m) is the modulation frequency analysis window.

6 Example I: one speaker (left), mean of speakers (right) Multi-linear Algebra Features selection Energy Frequency (khz) frequency (Hz) Pitch energy Frequency (khz) # Speakers Pitch (Hz) frequency (Hz)

7 Example II: polyps (left), spasmodic dysphonia (right) Multi-linear Algebra Features selection Energy Frequency (khz) frequency (Hz) Pitch energy Energy Frequency (khz) frequency (Hz) Pitch energy

8 Example III: keratosis (left), nodules (right) Multi-linear Algebra Features selection Energy Frequency (khz) frequency (Hz) Pitch energy Energy Frequency (khz) frequency (Hz) Pitch energy

9 Dimensionality Reduction, HO-SVD Multi-linear Algebra Features selection 1 Create tensors: D R I 1 I 2 I 3 2 Decompose of tensor D to its n mode singular vectors: D = S 1 U af 2 U mf 3 U samples where S and U are referred to as core tensor and unitary matrix, respectively and n denotes the n mode product. 3 Rank the n mode singular values 4 Near-optimal projections (PCs): truncate Singular Matrices so that we keep a% energy of D

10 Dimensionality Reduction, HO-SVD Multi-linear Algebra Features selection n mode singular vectors: Let consider tensor D R I 1 I 2 I 3 Unfold D to D (n) : 1 I1 I 2 I 3 matrix D (1) 2 I2 I 3 I 1 matrix D (2) 3 I3 I 1 I 2 matrix D (3) The n-mode singular values and vectors: SVD of D (n).

11 Dimensionality Reduction, HO-SVD Multi-linear Algebra Features selection Definition (Unitary matrix) An (I n I n ) unitary matrix U (n), n = 1, 2, 3, contains the n-mode singular vectors (SVs): [ ] U (n) = U (n) 1 U (n) 2... U (n) I n. (1) Each matrix U (n) can directly be obtained as the matrix of left singular vectors of the matrix unfolding D (n) of D along the corresponding mode.

12 Dimensionality Reduction, HO-SVD D = S 1 U af 2 U mf 3 U samples Multi-linear Algebra Features selection S is referred to as core tensor (same dimensions as D) U af R I 1 I 1, is the unitary matrix of the acoustic frequency subspace. U mf R I 2 I 2, is the unitary matrix of the modulation frequency subspace. U s R I 3 I 3 is the samples subspace matrix. n denotes n mode product.

13 Dimensionality Reduction, HO-SVD Multi-linear Algebra Features selection Defining n-product S n U (n) : S R I 1 I 2 I 3 U (n) R In In Example; for n = 2 this is an (I 1 I 2 I 3 ) tensor given by ( S 2 U (2)) def = i 1 i 2 i 3 I 2 i 2 =1 s i1 i 2 i 3 u i2 i 2.

14 Dimensionality Reduction, HO-SVD Multi-linear Algebra Features selection 1 Create tensors: D R I 1 I 2 I 3 and decompose it to its n mode singular vectors: D = S 1 U af 2 U mf 3 U samples 2 Rank the n mode singular values 3 Near-optimal projections (PCs): truncate Singular Matrices so that we keep a% energy of D

15 Dimensionality Reduction, HO-SVD Multi-linear Algebra Features selection Contribution of the j th n-mode singular vector U (n) j : α n,j = λ n,j / I n λ n,j j=1 where λ n,j is the corresponding singular value Put a threshold on α n,j and retain the R n (n = 1, 2) singular vectors Truncate matrices: Û (1) Û af R I 1 R 1 and Û (2) Ûmf R I 2 R 2 Project new MS data on to the truncated matrices: Z = Û T af B Û mf where B X l (k, i) R I 1 I 2 and Z R R 1 R 2

16 Redundancy Reduction with HOSVD 10 1 Redundancy: packed features Redundancy: original features Multi-linear Algebra Features selection P.D.F. of MI values Extrapolated MI

17 Mutual Information Multi-linear Algebra Features selection Mutual Information between two random variables x i and x j is defined as: [ ] Pij (x i, x j ) I (x i ; x j ) = dx i dx j P ij (x i, x j ) log 2 P i (x i )P j (x j ) where P ij (x i, x j ) denotes the joint probability density function (pdf) P i (x i ) and P j (x j ) denote the marginal pdfs

18 Maximal Relevance Criterion Multi-linear Algebra Features selection Select the most relevant features to the target class c: 1 Compute the mutual information I (xj ; c) between feature x j and class c 2 Rank all the computed I (xj ; c) 3 Select the top m features

19 Database & Conditions Multi-linear Algebra Features selection Sustained vowel /AH/ from MEEI Subset of the database (53 normophonic, 173 dysphonic speakers) Signals sampled at 25 khz Classifier: SVM with a radial basis function (RBF) kernel 4-fold stratified cross-validation, repeated 400 times Training/Testing: 75%25% Decision per segment Evaluation: Detection Error Trade-off (DET) curves

20 Feature extraction Multi-linear Algebra Features selection Data tensor D R Û af R Û mf R Z R 34 34

21 Results: Detection Multi-linear Algebra Features selection Normophonic/Dysphonic: Optimal detection accuracy (DCF opt ): 94.08% (±0.86) using the top m = 25 features (AUC = 97.75% in terms of ROC)

22 Results: Classification Multi-linear Algebra Features selection Classify: vocal fold polyp, adductor spasmodic dysphonia, keratosis leukoplakia, and vocal nodules MSMR FD-GA DCF opt (%) AUC (%) m DR (%) Pol/Add ± Pol/Ker ± Pol/Mod ± where: FD-GA stands for Fisher distance and Genetic Algorithms (Hosseini et al. 2008)

23 MEEI: comparison 60 MFCC SVM maxrel maxcontrib Multi-linear Algebra Features selection Miss probability (in %) False Alarm probability (in %)

24 MEEI: fusion 60 MFCC mrms Fusion Multi-linear Algebra Features selection Miss probability (in %) False Alarm probability (in %)

25 PdA: fusion 60 MFCC mrms Fusion Multi-linear Algebra Features selection Miss probability (in %) False Alarm probability (in %)

26 Cross-database experiment Multi-linear Algebra Features selection Train on PdA, test on MEEI Miss probability (in %) False Alarm probability (in %) MFCC mrms Fusion

27 Cross-database experiment Multi-linear Algebra Features selection MFCC MRMS Fusion MEEI (125) 3.63 PdA (125) PdA-MEEI (125) MEEI-PdA (450) 21.86

28 for the work on Spectra Multi-linear Algebra Features selection 1 Maria Markaki and : Voice Pathology Detection and Discrimination based on Spectral Features. IEEE Trans. on Audio, Speech and Language Processing. TASL , Jan J.D. Arias-Londono, J.I. Godino-Llorente, M. Markaki, and Y. : On combining information from Spectra and Mel-Frequency Cepstral coefficients for Automatic Detection of Pathological Voices. Logopedics, Phoniatrics, Vocology (LPV), Nov Maria Markaki and : Discrimination of Speech from Nonspeeech in Broadcast News Based on Frequency Features Speech Communication, Special Issue on Speech Communication on Perceptual and Statistical Audition,

29 Define Vocal Introduction Application: Vocal Fatigue Vocal : Involuntary modulations of frequency and/or amplitude in sustained phonation. Pathological & Physiological Vocal. Pathological : From diseases like Parkinson, essential tremor, etc. Strong motor synchronization. Physiological : Natural stochastic modulations in the interval [2, 15]Hz with low amplitude. Acoustic Vocal Attributes: Frequency: How fast are the modulations. Level: How strong are the modulations.

30 Define Vocal Introduction Application: Vocal Fatigue Vocal : Involuntary modulations of frequency and/or amplitude in sustained phonation. Pathological & Physiological Vocal. Pathological : From diseases like Parkinson, essential tremor, etc. Strong motor synchronization. Physiological : Natural stochastic modulations in the interval [2, 15]Hz with low amplitude. Acoustic Vocal Attributes: Frequency: How fast are the modulations. Level: How strong are the modulations.

31 Define Vocal Introduction Application: Vocal Fatigue Vocal : Involuntary modulations of frequency and/or amplitude in sustained phonation. Pathological & Physiological Vocal. Pathological : From diseases like Parkinson, essential tremor, etc. Strong motor synchronization. Physiological : Natural stochastic modulations in the interval [2, 15]Hz with low amplitude. Acoustic Vocal Attributes: Frequency: How fast are the modulations. Level: How strong are the modulations.

32 Define Vocal Introduction Application: Vocal Fatigue Vocal : Involuntary modulations of frequency and/or amplitude in sustained phonation. Pathological & Physiological Vocal. Pathological : From diseases like Parkinson, essential tremor, etc. Strong motor synchronization. Physiological : Natural stochastic modulations in the interval [2, 15]Hz with low amplitude. Acoustic Vocal Attributes: Frequency: How fast are the modulations. Level: How strong are the modulations.

33 Define Vocal Introduction Application: Vocal Fatigue Vocal : Involuntary modulations of frequency and/or amplitude in sustained phonation. Pathological & Physiological Vocal. Pathological : From diseases like Parkinson, essential tremor, etc. Strong motor synchronization. Physiological : Natural stochastic modulations in the interval [2, 15]Hz with low amplitude. Acoustic Vocal Attributes: Frequency: How fast are the modulations. Level: How strong are the modulations.

34 Define Vocal Introduction Application: Vocal Fatigue Vocal : Involuntary modulations of frequency and/or amplitude in sustained phonation. Pathological & Physiological Vocal. Pathological : From diseases like Parkinson, essential tremor, etc. Strong motor synchronization. Physiological : Natural stochastic modulations in the interval [2, 15]Hz with low amplitude. Acoustic Vocal Attributes: Frequency: How fast are the modulations. Level: How strong are the modulations.

35 Define Vocal Introduction Application: Vocal Fatigue Vocal : Involuntary modulations of frequency and/or amplitude in sustained phonation. Pathological & Physiological Vocal. Pathological : From diseases like Parkinson, essential tremor, etc. Strong motor synchronization. Physiological : Natural stochastic modulations in the interval [2, 15]Hz with low amplitude. Acoustic Vocal Attributes: Frequency: How fast are the modulations. Level: How strong are the modulations.

36 Vocal Introduction Application: Vocal Fatigue Use of an AM-FM decomposition algorithm based on the adaptive time-varying quasi-harmonic model for speech. High resolution in Time-Frequency plane. of Vocal for any sinusoidal component of speech. Time dependent Vocal estimations.

37 AM-FM Decomposition using aqhm Introduction Application: Vocal Fatigue Speech is modeled as a sum of AM-FM sinusoids: s(t) = K a k (t)cos(φ k (t)) k=1 K is the number of components, a k (t) is the instantaneous amplitude of the k th sinusoid, φ k (t) is the instantaneous phase of the k th sinusoid, and f k (t) = 1 sinusoid. 2π dφ k (t) dt is the instantaneous frequency of the k th AM-FM decomposition algorithm tries to estimate the instantaneous components.

38 Example of AM-FM decomposition on Speech Introduction Application: Vocal Fatigue Frequency (Hz) Time (s)

39 Preprocessing of Inst. Component Introduction Application: Vocal Fatigue Downsample inst. component to f s = 1000Hz Remove the very slow (< 2Hz) modulations of the instantaneous component. This is performed by Savinzky-Golay smoothing filter. S-G smoothing filter performs a local polynomial regression. S-G filter parameters: 4th order polynomial & 1sec frame size. Advantage: Preserve features of the time-series such as relative maxima, minima and width.

40 S-G Filter Output Introduction Application: Vocal Fatigue Frequency (Hz) Magnitude Time (s) (a) Frequency (Hz) (b)

41 Compute Frequency & Level Introduction Application: Vocal Fatigue Assuming that the processed inst. component has a single but time-varying modulation frequency and modulation level. x(t) = m(t)cos(ψ(t)) Apply for second time the AM-FM dec. alg. to the processed inst. component. Thus, 1 dψ(t) frequency, 2π dt, is estimated from the FM component of AM-FM dec. alg. level, m(t), is estimated from the respective AM component.

42 Compute Frequency & Level Introduction Application: Vocal Fatigue Frequency (Hz) Magnitude Time (s) (a) Frequency (Hz) (b)

43 Compute Frequency & Level Freq. (Hz) Introduction Application: Vocal Fatigue Level (%) Time (s) (a) Time (s) (b)

44 Voice Fatigue and Acoustic Features of Vocal Loading Introduction Application: Vocal Fatigue Voice Fatigue Strain of the laryngeal tissues. Relation between occupational voice fatigue and voice pathologies. Acoustic Features Fundamental frequency raise. Sound pressure raise. Vocal tremor attributes raise (Boucher et 2008) strain of the laryngeal muscles may affect the speaker s ability to sustain constant tension of the vocal folds.

45 Examining the Relationship between Vocal Loading and Attributes Introduction Application: Vocal Fatigue Estimating vocal tremor attributes: extract instantaneous frequency and instantaneous amplitude. Comparing vocal tremor attributes before and after vocal loading: compare the modulation frequencies and the modulation levels of two voiced signals of the same speaker before and after vocal loading.

46 Definitions Introduction Application: Vocal Fatigue Vocal Loading Amplitude Indicator (VLAI) = Mean modulation level after loading - Mean modulation level before loading. Vocal Loading Frequency Indicator (VLFI) = Mean modulation frequency after loading - Mean modulation frequency before loading. positive value: increase of vocal tremor attributes possible degradation of voice. negative value: decrease of vocal tremor attributes possible enhancement of voice.

47 DB1: Comparing VLAI and VLFI to Subjective Evaluations (SE) Introduction Application: Vocal Fatigue Female Male Speakerid VLAI VLFI Student SE Trainer SE Pre:Post : : : : : :-2 Reminder: Student SE: 0 (no tired) to -3 (very tired). Trainer SE: -3 (being very poor voice) to +3 (being excellent).

48 Summary for this task Introduction Application: Vocal Fatigue No relation seems to be between vocal loading and voice tremor. There is a correlation between objective and subjective evaluations for voice quality assessment.

49 for the work on Introduction Application: Vocal Fatigue 1 Pantazis, Maria Koutsoyannaki and : A novel method for the extraction of vocal tremor, MAVEBA-2009, Florence, Italy, Dec, Maria Koutsoyannaki, Pantazis,, and Philippe Dejonckere: in speakers with spasmodic dysphonia, MAVEBA-2011, Florence Italy, Aug 2011

50 My students: Maria Markaki, Maria Koutsoyannaki My ex-student: Pantazis. Prof. Juan Ignacio Godino-Llorente, and J.D. Arias-Londono (PhD) (UPM, Spain) Prof. Anne-Maria Laukkanen (Un. of Tampere, Finland) for providing the database with vocal fatigue examples.

51 THANK YOU for your attention

Voice Pathology Detection and Discrimination based on Modulation Spectral Features

Voice Pathology Detection and Discrimination based on Modulation Spectral Features Maria Markaki, Student Member, IEEE, and Yannis Stylianou, Member, IEEE 1 Abstract In this paper, we explore the information