Vowel Enhancement in Early Stage Spanish Esophageal Speech Using Natural Glottal Flow Pulse and Vocal Tract Frequency Warping

Size: px

Start display at page:

Download "Vowel Enhancement in Early Stage Spanish Esophageal Speech Using Natural Glottal Flow Pulse and Vocal Tract Frequency Warping"

Gervase Harrington
5 years ago
Views:

1 Vowel Enhancement in Early Stage Spanish Esophageal Speech Using Natural Glottal Flow Pulse and Vocal Tract Frequency Warping Rizwan Ishaq 1, Dhananjaya Gowda 2, Paavo Alku 2, Begoña García Zapirain 1 1 Deustotech-LIFE, University of Deusto, Bilbao, Spain 2 Aalto University, Dept. of Signal Processing and Acoustics, Finland rizwanishaq@deusto.es, dhananjaya.gowda@aalto.fi, paavo.alku@aalto.fi, mbgarciazapi@deusto.es Abstract This paper presents an enhancement system for early stage Spanish Esophageal Speech () vowels. The system decomposes the input into neoglottal waveform and vocal tract filter components using Iterative Adaptive Inverse Filtering (IAIF). The neoglottal waveform is further decomposed into fundamental frequency F, Harmonic to Noise Ratio (HNR), and neoglottal source spectrum. The enhanced neoglottal source signal is constructed using a natural glottal flow pulse computed from real speech. The F and HNR are replaced with natural speech F and HNR. The vocal tract formant frequencies (spectral peaks) and bandwidths are smoothed, the formants are shifted downward using second order frequency warping polynomial and the bandwidth is increased to make it close to the natural speech. The system is evaluated using subjective listening tests on the Spanish vowels /a/, /e/, /i/, /o/, /u/. The Mean Opinion Score (MOS) shows significant improvement in the overall quality (naturalness and intelligibility) of the vowels. Index Terms: speech enhancement, glottal flow, analysis synthesis vocal tract, spectral sharpening, warping 1. Introduction The removal of the larynx after a Total Laryngectomy (TL), changes the speech production mechanism. The trachea which connects the larynx and lungs for air source is now connected to a stoma (hole on neck) for breathing. The vocal folds which resided in larynx are no more available. After TL, there is no voicing and air source for speech production. Therefore alternative voicing and air source are needed for speech restoration. Three methods are available for this purpose, i) Esophageal Speech (), ii) Tracheo-Esophageal Speech (T), and iii) Electrolarynx (EL). and T both use a common voicing source, the Phyarngo-Esophageal (PE) segment, but with a different air source, while EL uses external devices for voicing source with no air source. The is preferred over other methods, because it does not require surgery (T) or external devices (EL). involves, however, a low pressure air source, and an irregular PE segment vibration which results in low quality and low intelligible speech. Compared to the production of normal speech according to the source-filter model [1], the voicing source in is severely altered and does not have any fundamental frequency or harmonic components. The vocal tract filter is also shortened in. The can be enhanced by transforming the source and filter components to those of normal speech using signal processing algorithms. In previous studies is typically decomposed into its source and filter components using Linear Predication (LP) based analysis-synthesis techniques. Based on this assumption the authors in [2, 3] replaced the voicing source with the Liljencrants- Fant (LF) voicing source, and reported significant enhancements. Fundamental frequency smoothing and correction with the synthetic LF source model were used for quality enhancement also in [4]. enhancement based on formant synthesis has also shown significant improvement in intelligibility [, 6]. In [7] the source and filter components were modified by replacing the source with the LF model and increasing the bandwidth of filter formants for better quality speech. Statistical conversion from to normal speech has also improved intelligibility, but requires more data [8]. Some other not so common approaches are based on Kalman filtering [9, 1, 11, 12], and modulation filtering enhancement [13, 14]. Almost all methods available in the literature assume that the fundamental frequency of can be estimated accurately. The voicing source signal is then modified with the synthetic LF model voicing source. The vocal tract formants are typically considered to be the same as in normal speech signals. In reality, however, the fundamental frequency of is highly irregular and the voicing source resembles whispered speech. Moreover, formants center frequencies are affected by the shortening of vocal tract length due to surgery. In order to deal with these deficiencies, this paper proposes an enhancement method based on the GlottHMM single pulse synthesis [1, 16, 17]. The system decomposes into neoglottal waveform and vocal tract filter components using Iterative Adaptive Inverse Filtering (IAIF) [18]. Natural glottal pulse extracted from real speech is used to construct the glottal waveform by borrowing F curve and HNR from normal speech. The vocal tract filter is also modified by smoothing the spectral peaks and their bandwidths. The spectral peaks of the vocal tract filter are also moved to lower frequencies in order to compensate the rising of formant in. The formant bandwidths are also increased for better quality speech. The system is validated with Spanish Esophageal Vowels subjectively using the Mean Opinion Score (MOS). The paper in next section describes the system in detail. The subsequent sections contain results, discussion and finally conclusions. 2. System Description The proposed system, shown in Figure 1, is divided into three main components, i) analysis, ii) transformation, and iii) synthesis. The analysis part decomposes the voiced speech frame into its source and filter components. The transformation provides the modified source and filter components. Finally the modified components are combined in the synthesis part to generate enhanced.

2 1 HNR (db) Natural Band Number Figure 2: HNR of and natural speech. Figure 1: enhancement system GlottHMM based analysis The goal of the analysis part of the system is to decompose the signal into a neoglottal source signal and a vocal tract spectrum. The input speech signal s[n] is first passed through highpass filter h hp [n] with a cutoff frequency of 7 Hz. s h [n] = s[n] h hp [n] (1) where s h [n] and are the highpass filtered speech signal and a convolution operator, respectively. The highpass filtered signal s h [n] is then windowed using a rectangular window of size 4- ms, with -ms frame shift. x[n] = s h [n]w[n] (2) where w[n] is the rectangular window. Firstly the log energy G of frame is extracted using, N 1 G = log( x 2 [n]) (3) n= where N is the number of samples in the frame. Glottal Inverse Filtering (GIF) is then used to separate the frame into a neoglottal source signal and a vocal tract spectrum. The automatic inverse filtering, IAIF is used [18]. IAIF estimates vocal tract and lip radiation using all-pole modeling and then iteratively cancel these components. In simplified form, the neoglottal source signal: U(z) = X(z) (4) V (z)r(z) where U(z), X(z), V (z) and R(z) are the z-transforms of neoglottal source signal u[n], speech signal x[n], vocal tract impulse response v[n], and lip radiation response r[n] respectively. The estimated neoglottal source signal u[n] is parametrized into fundamental frequency F, Harmonic to Noise Ratio (HNR) and neoglottal source spectrum U(z). The autocorrelation of the neoglottal source signal u[n] is used for F estimation. The HNR is estimated using the upper and lower smoothed spectral envelopes ratio to determine the voicing degree in the neoglottal voicing source signal u[n] for five frequency bands [1]. In short the analysis part of the system provides for each frame the following, i) Frame energy G, ii) vocal tract spectrum V (z) (LP order 3), iii) F, iv) HNR and v) neoglottal source spectrum U(z) (LP order 1) to normal speech transformation The parameters obtained from the analysis are transformed into natural speech parameters. The neoglottal signal and vocal tract are modified independently Neoglottal source signal enhancement The neoglottal source signal u[n] is the most effected speech component in. Therefore the parameters of this signal are replaced with any arbitrary natural speech signal for a better glottal source signal. The natural glottal pulse which is extracted from normal speech is first interpolated using the cubic spline interpolation by replacing the frame original F with natural speech F N. The interpolated glottal pulse voicing source is then multiplied with the smooth gain G and the natural speech HNR is then used to add noise in the frequency domain for naturalness according to the following steps: Taking FFT of the neoglottal waveform, Adding random components (white Gaussian noise) to real and imaginary part of FFT according to HNR, Taking IFFT of noise added neoglottal waveform U syn(z) = 1 G G(z) + Q(z) () where U syn(z) is the synthetic glottal source, G(z) is the natural glottal pulses source, and Q(z) is HNR based noise component. Figure 2 shows the mean value of HNR for all voiced frames along with standard deviation. The figure indicates that HNR of is greatly different from that of normal speech. Therefore, it is justified to replace the HNR of with the HNR of normal speech in the vowel enhancement system. In order to adjust the spectrum of neoglottal waveform to the spectrum of the target waveform, the former is filtered with following IIR filter: H m(z) = U(z) (6) U syn(z) where U(z) and U syn(z) are the LP spectra of the original and synthetic neoglottal waveform, respectively. The lip radiation is applied to the spectrally matched neoglottal waveform û[n]: û[n] = û[n] αû[n 1],.96 < α < 1 (7)

3 Natural Natural Formants (\a,e,i,o,u\) formants (\a,e,i,o,u\) Frequency warping curve Natural curve ˆf (Hz) α 1 f 2 +α 2 f +c Amplitude (db) Time (s) - (a) Natural Frequency (Hz) (b) Figure 3: Glottal excitations (computed from the vowel /a/) in the time domain (a) and in the frequency domain (b). Amplitude (db) f (Hz) Figure 4: Frequency Warping Function (FWF) curve Original Spectrum Frequency Wapred Spectrum Frequency (Hz) Figure : Frequency warped spectra. where û[n](û(z)) and α(.98) are the modified neoglottal waveform and lip radiation constant, respectively. Figure 3(a) shows time-domain examples of glottal excitations of natural speech and together with a waveform computed with the proposed enhancement system. It can be seen that the proposed system is capable of producing a glottal excitation that is highly similar to that of natural speech. As shown in Figure 3(b), the spectral slope of the excitation waveform generated by the proposed method is also close to that of natural speech, especially at low frequencies, but the generated spectrum also retains the spectral slope of at higher frequencies Vocal tract modification by nonlinear frequency warping The vocal tract spectrum of has the following characteristics, i) higher frequencies are emphasized more compared to lower frequencies, ii) spectral resonances (formants) are moved to higher frequencies, and iii) resonance bandwidths are reduced in comparison to normal speech vowels. To cope with the higher frequency emphasis, a de-emphasis filter is applied to the vocal tract spectrum. The resulting vocal tract transfer function is then expressed as: 1 + αz 1 H enh (z) = 1 +,.9 < α < 1 (8) P p=1 apz p where P is the order of the all-pole vocal tract filter and α is the de-emphsis constant. Because formants of are moved upward in frequency, a procedure is needed to adjust them to coincide more closely with the formant values of normal speech. For such a procedure, we used a second order Frequency Warping Function (FWF) ζ(f) defined as: ζ(f) = α 1f 2 + α 2f + c (9) where α 1 = , α 2 =.3, and c = ˆf = βζ(f), β = 1, f = fs 2 (1) where ˆf and f, are warped and original frequencies, and β is a constant. Figure 4 demonstrates FWF using first four formants of vowels (/a/, /e/, /i/, /o/, /u/) extracted from normal speech (x-axis) and (y-axis). The obtained frequency warping, applicable for a general formant mapping between normal speech and, is shown in Figure. In order to expand the formant bandwidths, exponential windowing is used for the vocal tract filter coefficients as follows [19]: H s(z) = 1 + P p=1 γp a pz p 1 +,.9 < γ, η < 1 (11) P p=1 ηp a pz p where γ and η are constants controlling the spectral bandwidth. If γ > η bandwidth of formants increase, otherwise it decreases (i.e. formants are sharpened). For the purpose of the present study, η(.97) is always smaller than γ(.99) in order to increase formant bandwidths Synthesis of enhanced speech The synthesis part involves convolving the modified neoglottal waveform and the impulse response of the vocal tract filter yielding the enhanced version of ˆx[n]; ˆx[n] = ˆv[n] û[n] (12) where û[n] and ˆv[n] are the modified neoglottal waveform and vocal tract impulse response, respectively. 3. System Evaluation The system was evaluated with vowels of Spanish (/a/, /e/, /i/, /o/, /u/) recorded in speech rehabilitation center. The data

Original Reference Mean MOS 4 3 2 1 a e i o u (a) Figure 7: Results of the MOS test for all the vowels. 1 8 % 6 4 (b) original proposed reference Figure 8: Results of the preference test.

five early stage male talkers by asking them to utter each vowel four times. Due to lack of female patients in the rehabilitation center, only male speakers were involved in the study.

In this figure, and also later in Figures 7 and 8, the proposed system is compared with a reference system based on using the LF source and formant modification with a bandwidth extension system [7].

1. Subjective listening evaluation Two subjective listening tests were conducted.

In this test, the listeners heard original vowels and the corresponding enhanced ones, processed by both the proposed and the reference method, in a random order and they were asked to grade the

The second listening test was a preference test where the listeners heard vowels corresponding to the same three processing types and they were asked to select which one they prefer to listen.

for all the vowels, which can be considered a good quality score for samples. Figure 8 shows the data of the preference tests by combining all the vowels.

4 Original Reference Mean MOS a e i o u (a) Figure 7: Results of the MOS test for all the vowels. 1 8 % 6 4 (b) original proposed reference Figure 8: Results of the preference test. (c) Figure 6: Spectrograms of the vowel /a/ for different processing types: unprocessed (a), processed with the proposed system (b), processed with the reference system (c) [7] was collected from five early stage male talkers by asking them to utter each vowel four times. Due to lack of female patients in the rehabilitation center, only male speakers were involved in the study. The speech sounds were sampled with 44.1 khz from which the data was down-sampled to 16 khz for computational efficiency. The system performance is visually demonstrated with spectrograms in Figure 6. In this figure, and also later in Figures 7 and 8, the proposed system is compared with a reference system based on using the LF source and formant modification with a bandwidth extension system [7]. It can be seen from Figure 6 that the spectrogram computed from the enhanced vowels by the proposed system shows a clearer formant and harmonics structure in comparison to and the reference system Subjective listening evaluation Two subjective listening tests were conducted. The first one was a quality evaluation based on the Mean Opinion Score (MOS) which is a widely used perceptual quality test of speech based on a scale from 1 (worst) to (best). In this test, the listeners heard original vowels and the corresponding enhanced ones, processed by both the proposed and the reference method, in a random order and they were asked to grade the quality of the sounds on the MOS scale. The second listening test was a preference test where the listeners heard vowels corresponding to the same three processing types and they were asked to select which one they prefer to listen. A total of 1 listeners participated in the listening tests. Figure 7 shows the results of the MOS test. The data indicates that the proposed system has a mean MOS higher than 2. for all the vowels, which can be considered a good quality score for samples. Figure 8 shows the data of the preference tests by combining all the vowels. Also these data indicate that the proposed method has succeeded in enhancing the quality of the vowels. 4. Conclusion An enhancement system for vowels was proposed based on using a natural glottal pulse combined with second order polynomial Frequency Warping Function. A preliminary evaluation of the system was carried out on early stage Spanish vowels by comparing the system performance with a known reference method. Results obtained with a MOS evaluation show clear improvements in speech quality both in comparison to the original vowels and to sounds enhanced with the reference method. The good performance was corroborated with a preference test indicating that in the vast majority of the cases, listeners preferred to listen to the sounds enhanced by the proposed method. Future work is needed to study the system together with advanced stage speakers.. Acknowledgements Special thanks to all my colleagues at Aalto University for their valuable support and time.

5 6. References [1] G. Fant, Acoustic theory of speech production. Mouton, The Hauge, 196. [2] Q. Yingyong, W. Bernd, and B. Ning, Enhancement of female esophageal and tracheoesophageal speech, Acoustical Society of America, vol. 98(, Pt1), pp , 199. [3] Y. Qi, Replacing tracheoesophageal voicing source using lpc synthesis, Acoustical Society of America, vol., pp , 199. [4] R. Sirichokswad, P. Boonpramuk, N. Kasemkosin, P. Chanyagorn, W. Charoensuk, and H. H. Szu, Improvement of esophageal speech using lpc and lf model, Internation Conf. on Biomedical and Pharamaceutical Engineering 6, pp. 4 48, 6. [] M. Kenji, H. Noriyo, K. Noriko, and H. Hajime, Enhancement of esophageal speech using formant synthesis, Acoustic. Sci. and Tech., pp , 2. [6] M. Kenji and H. Noriyo, Enhancement of esophageal speech using formant synthesis, Acoustics, Speech and Signal Processing, International conf., pp. 81 8, [7] R. H. Ali and S. B. Jebara, Esophageal speech enhancement using excitation source synthesis and formant structure modification, SITIS, pp , 6. [8] K. Doi, H.and Nakamura, T. Toda, H. Saruwatari, and K. Shikano, Statistical approach to enhancing esophageal speech based on gaussian mixture models, Acoustics Speech and Signal Processing(ICASSP), 1 IEEE International Conference, pp , 1. [9] O. Ibon, B. Garcia, and Z. M. Amaia, New approach for oesophageal speech enhancement, 1th International conference, ISSPA, vol., pp , 1. [1] B. Garcia and A. Mendez, Oesophageal speech enhancement using poles stablization and kalman filtering, ICASSP, pp , 8. [11] B. Garcia, I. Ruiz, A. Mendez, and M. Mendezona, Oesophageal voice acoustic parameterization by means of optimum shimmer calculation, WSEAS Trasactions on Systems, pp , 8. [12] R. Ishaq and B. G. Zapirain, Optimal subband kalman filter for normal and oesophageal speech enhancement, Bio-Medical Materials and Engineering, vol. 24, pp , 14. [13] R. Ishaq, B. G. Zapirain, M. Shahid, and B. Lovstrom, Subband modulator kalman filtering for signla channel speech enhancement, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 13. [14] R. Ishaq and B. G. Zapirain, Adaptive gain equalizer for improvement of esophageal speech, in IEEE International Symposium on Signal Processing and Information Technology, 12. [1] A. Suni, T. Raitio,, M. Vainio, and P. Alku, The glottalhmm entery for blizzard challenge 11: Utilizing source unit selection in hmm-based speech synthesis for improved excitation generation, in in Blizzard Challenge 11, Workshop, Florence, Italy, 11. [16] T. Raitio, A. Suni, J. Yamagishi, H. Pulakka, J. Nurminen, M. Vainio, and P. Alku, HMM-based speech synthesis utilizing glottal inverse filtering, IEEE Transactions on Audio, Speech and Language Processing, vol. 19, pp , 11. [17] T. Raitio, A. Suni, H. Pulakka, M. Vainio, and P. Alku, Utilizing glottal source pulse library for generating improved excitation signal for hmm-based speech synthesis, in IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), 11. [18] P. Alku, Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering, in Speech communication, vol. 11, no. 2, 1992, pp [19] J. H. Chen and A. Gersho, Adaptive postfiltering for quality enhancement of coded speech, Speech and Audio Processing, IEEE Transactions on, vol. 3, pp. 9 71, 199.

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation Peter J. Murphy and Olatunji O. Akande, Department of Electronic and Computer Engineering University