Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Size: px

Start display at page:

Download "Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta"

David Perkins
6 years ago
Views:

1 Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification Daryush Mehta SHBT 03 Research Advisor: Thomas F. Quatieri Speech and Hearing Biosciences and Technology 1

2 Summary Studied the problem of high-quality speech synthesis and pitch-scale modification in the context of vocalic speech sounds containing an aspiration noise component. Developed a vowel production model and corresponding vowel synthesizer, in which periodic and noise speech sources are coupled by temporally modulating the noise by the periodic gating of turbulent airflow by the vocal fold vibrations. Investigated a periodic/noise signal separation algorithm, revealing aspiration noise characteristics in normal and pathological speech. Inspired by our observation of natural modulations in the aspiration noise source, we designed an alternate approach to pitch-scale modification. The modified speech signal is perceived to be natural-sounding and generally reduces artifacts that are typically heard in current modification techniques. 2

3 Motivation Many applications would benefit from a better understanding of inherent characteristics of the aspiration component Clinical assessment of vocal pathology: Noninvasive diagnostic procedures are desired to efficiently assess the quality of a patient s vocal apparatus during therapy of post-surgery. Speech modification: Current modification methods have difficulty with the presence of a significant aspiration noise component. Text-to-speech synthesis: Desires high quality and naturalness, requiring an accurate representation of the aperiodic speech component. Speaker identification: Uses distinct traits in different speakers, with the noise characteristics of speech perhaps being unique to different speakers. 3

4 Definitions The phonation source is due to mechanisms that cause the vocal folds to vibrate. The closed phase refers to vocal fold closure, contrasting with the open phase. The aspiration noise source refers to turbulent airflow that is generated in the vicinity of the vocal folds, while frication sources are noises generated farther downstream in other cavities. Frication and Aspiration Noise Noise Sources t Pharynx Power Lungs Supply Nasal Cavity Oral Cavity Modulator Vibrating Vocal Folds Periodic Puffs Phonation Source Open Phase t Closed Phase Relying on the lungs for primary air supply, these sources propagate through the pharynx, the nasal cavity, and (most importantly) the oral cavity to produce the acoustic speech signal that we can hear and record using a microphone. Adapted from (Quatieri) 4

5 Additive Modulation Model Based on Voice Production A model for vocalic speech sounds in which the aspiration noise source is modulated by the phonation source before acoustically exciting the resonant vocal tract cavities. 2 source model Periodic Component Vocal Tract Filter Radiation Characteristic Phonation Source + Voiced Speech Sound Noise Component Modulation Vocal Tract Filter Radiation Characteristic Aspiration Noise Source 5

6 Speech Analysis Harmonic/Noise Separation The pitch-scaled harmonic filter technique (Jackson and Shadle) for the separation of harmonic and noise contributions aims at preserving the temporal and spectral characteristics inherent in the noise component. Spoken vowel ah 10 2 Original spectrum Harmonic estimate Log magnitude Frequency Noise estimate The separation algorithm has limitations due to spectral leakage and robustness to shimmer and jitter in speech, but it has proved useful as an initial tool. 6

7 Speech Analysis Harmonic/Noise Separation Maintaining the time-domain patterns of the source waveforms To estimate the aspiration source, linear prediction analysis was used to remove vocal tract resonances in the signal. In the figure below, we show that source estimate of the separated aspiration component contains modulations with peaks during the open phase of the known phonation source waveform. The processing was performed on the vowel ah synthesized using our additive modulation model to allow for direct comparison between known and separated source waveforms Aspiration noise source estimate Amplitude Known phonation source waveform Time (s) 7

8 Speech Analysis Non-pathological Speaker In this example, peaks in the aspiration noise source estimated generally occur only at instants of vocal-fold closure. Harmonic Estimate All dashed lines indicate instants of vocal fold closure Aspiration Estimate Aspiration Source Estimate Peaks during closure 8

9 Speech Analysis Speaker with Vocal Pathology Cysts can obstruct the airflow from the lungs and act as an additional source of air turbulence to normal aspiration noise sources. Peaks in the aspiration source estimate are observed to occur regularly at the instants of excitation as well as other phases within the phonation source. Harmonic Estimate Dashed line indicates open phase Dotted line indicates vocal fold closure instant Aspiration Estimate Peak during open phase Peak during closure Aspiration Source Estimate 9

10 Speech Modification Algorithm based on Physiology Design strategy to take into account the observed patterns of modulation and temporal synchrony between the harmonic and noise components in a voiced speech. The stages of our algorithm reverse-engineer the additive modulation model to modify the modulation rate of the aspiration noise source. Harmonic Component Pitch Modification f 1 to f 2 Separation + Modified Speech Speech Noise Component Pitch Modification of Aspiration Noise Component Source Estimation Source Modification Vocal Tract Filtering f 1 to f 2 10

11 Speech Modification Time-Frequency Analysis Original utterance: As time goes by. 80% pitch decrease by sinewave method Unwanted harmonicity Unwanted noise reduction Noise Component Estimate 80% pitch decrease by our algorithm 11 Issue: overestimation of noise

12 Speech Modification Preliminary Perceptual Results Sinewave transformation system (Quatieri and McAulay) The noise component is perceived as somewhat tonal and more perceptually separate from the periodic component. Harmonicity above 2500 Hz is overestimated by the STS algorithm and the perceived speech has reduced aspiration noise. Our modification algorithm (Mehta and Quatieri) Perceived to contain a breathier quality, consistent with the quality of the original waveform. Specifically, the signal characteristics tend to preserve the fullband aspiration noise features from the original signal. The modified signal has reduced artifacts and discontinuities that may appear in standard modification techniques. 12 Caveat: When performing preliminary analysis on running speech, however, the separated noise estimate sometimes contains harmonic leakage. The time-varying nature of natural vowels, in addition to the effects of jitter and shimmer, may contribute to the suboptimal performance of the separation technique because of inaccurate pitch estimation. More advanced signal processing methods can help create perturbation-free waveforms and to better estimate the pitch contour of continuous speech.

13 Future Work Further refinements Harmonic/noise separation Estimation and modification of noise envelope Perceptual rating of pitch modification Formal rating task among listeners Provide for a statistical significant way of evaluating our algorithm against baseline algorithms Speech science Temporal characteristics of both aspiration and periodic sources Analysis of dynamics of vocal fold vibration Clinical evaluation of pathological voice quality Add to features obtained from traditional acoustic measures such as Harmonic and noise energy levels Aerodynamic measures Electroglottography and electromyography data Blue: Inverse-filtered noise estimate Dashed red: Estimated envelope

14 References Hermes, D. J. (1991). "Synthesis of breathy vowels - Some research methods. Speech Communication 10(5-6): Jackson, P. J. B. and C. H. Shadle (2001). "Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech." IEEE Transactions on Speech and Audio Processing 9(7): Mehta, D and T. F. Quatieri (2005). Synthesis, analysis, and pitch modification of the breathy vowel, Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY. Quatieri, T. F. and R. J. McAulay (1992). "Shape invariant time-scale and pitch modification of speech." IEEE Transactions on Signal Processing 40(3): Quatieri, T. F. (2002). Discrete-Time Speech Signal Processing: Principles and Practice. Upper Saddle River, NJ, Prentice Hall PTR. Stevens, K. N. (1998). Acoustic Phonetics. Cambridge, MA, MIT Press. 14

15 Speech Synthesis Source Synchrony In-phase sources Out-of-phase sources Amplitude Amplitude Time (s) Time (s) Speech synthesized with in phase sources are perceptually fused, consistent with a more natural quality than synthesis with out-of-phase sources (Hermes). 15

ASPIRATION NOISE DURING PHONATION: SYNTHESIS, ANALYSIS, AND PITCH-SCALE MODIFICATION DARYUSH MEHTA

ASPIRATION NOISE DURING PHONATION: SYNTHESIS, ANALYSIS, AND PITCH-SCALE MODIFICATION by DARYUSH MEHTA B.S., Electrical Engineering (23) University of Florida SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING