Epoch Extraction From Emotional Speech

Similar documents
Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE

Cumulative Impulse Strength for Epoch Extraction

Sub-band Envelope Approach to Obtain Instants of Significant Excitation in Speech

Voiced/nonvoiced detection based on robustness of voiced epochs

Automatic Evaluation of Hindustani Learner s SARGAM Practice

/$ IEEE

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

ENHANCED ROBUSTNESS TO UNVOICED SPEECH AND NOISE IN THE DYPSA ALGORITHM FOR IDENTIFICATION OF GLOTTAL CLOSURE INSTANTS

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

A Quantitative Assessment of Group Delay Methods for Identifying Glottal Closures in Voiced Speech

Detecting Speech Polarity with High-Order Statistics

EVALUATION OF PITCH ESTIMATION IN NOISY SPEECH FOR APPLICATION IN NON-INTRUSIVE SPEECH QUALITY ASSESSMENT

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

GLOTTAL-synchronous speech processing is a field of. Detection of Glottal Closure Instants from Speech Signals: a Quantitative Review

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

SIGNIFICANCE OF EXCITATION SOURCE INFORMATION FOR SPEECH ANALYSIS

A New Method for Instantaneous F 0 Speech Extraction Based on Modified Teager Energy Algorithm

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

Enhanced Harmonic Content and Vocal Note Based Predominant Melody Extraction from Vocal Polyphonic Music Signals

Prosody Modification using Allpass Residual of Speech Signals

Mel Spectrum Analysis of Speech Recognition using Single Microphone

L19: Prosodic modification of speech

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

Basic Characteristics of Speech Signal Analysis

Pitch Period of Speech Signals Preface, Determination and Transformation

SPEECH AND SPECTRAL ANALYSIS

NOVEL APPROACH FOR FINDING PITCH MARKERS IN SPEECH SIGNAL USING ENSEMBLE EMPIRICAL MODE DECOMPOSITION

EE482: Digital Signal Processing Applications

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Digital Signal Processing

SPEECH TO SINGING SYNTHESIS SYSTEM. Mingqing Yun, Yoon mo Yang, Yufei Zhang. Department of Electrical and Computer Engineering University of Rochester

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

INTRODUCTION TO ACOUSTIC PHONETICS 2 Hilary Term, week 6 22 February 2006

VOICED speech is produced when the vocal tract is excited

Relative phase information for detecting human speech and spoofed speech

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

Speech Synthesis using Mel-Cepstral Coefficient Feature

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Digital Speech Processing and Coding

Subtractive Synthesis & Formant Synthesis

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

SOUND SOURCE RECOGNITION AND MODELING

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

Enhanced Waveform Interpolative Coding at 4 kbps

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Analysis of the SNR Estimator for Speech Enhancement Using a Cascaded Linear Model

Correspondence. Cepstrum-Based Pitch Detection Using a New Statistical V/UV Classification Algorithm

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Epoch-Synchronous Overlap-Add (ESOLA) for Time- and Pitch-Scale Modification of Speech Signals

RECENTLY, there has been an increasing interest in noisy

A Two-step Technique for MRI Audio Enhancement Using Dictionary Learning and Wavelet Packet Analysis

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase and Reassignment

Text Emotion Detection using Neural Network

ROBUST F0 ESTIMATION IN NOISY SPEECH SIGNALS USING SHIFT AUTOCORRELATION. Frank Kurth, Alessia Cornaggia-Urrigshardt and Sebastian Urrigshardt

Speech Recognition using FIR Wiener Filter

CO-CHANNEL SPEECH DETECTION APPROACHES USING CYCLOSTATIONARITY OR WAVELET TRANSFORM

VIBRATO DETECTING ALGORITHM IN REAL TIME. Minhao Zhang, Xinzhao Liu. University of Rochester Department of Electrical and Computer Engineering

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

Speech Enhancement: Reduction of Additive Noise in the Digital Processing of Speech

A LANDMARK-BASED APPROACH TO AUTOMATIC VOICE ONSET TIME ESTIMATION IN STOP-VOWEL SEQUENCES. Stephan R. Kuberski, Stephen J. Tobin, Adamantios I.

Synthesis Algorithms and Validation

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

Robust Low-Resource Sound Localization in Correlated Noise

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement

SPEECH ANALYSIS* Prof. M. Halle G. W. Hughes A. R. Adolph

Robust speech recognition using temporal masking and thresholding algorithm

On the glottal flow derivative waveform and its properties

Speech Synthesis; Pitch Detection and Vocoders

Improving Sound Quality by Bandwidth Extension

Speech Enhancement Based On Noise Reduction

The Effects of Noise on Acoustic Parameters

Modulation Domain Spectral Subtraction for Speech Enhancement

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Perceptive Speech Filters for Speech Signal Noise Reduction

Determination of Pitch Range Based on Onset and Offset Analysis in Modulation Frequency Domain

COMP 546, Winter 2017 lecture 20 - sound 2

Students: Avihay Barazany Royi Levy Supervisor: Kuti Avargel In Association with: Zoran, Haifa

Harmonics Enhancement for Determined Blind Sources Separation using Source s Excitation Characteristics

Speech/Data discrimination in Communication systems

On a Classification of Voiced/Unvoiced by using SNR for Speech Recognition

REAL-TIME BROADBAND NOISE REDUCTION

HMM-based Speech Synthesis Using an Acoustic Glottal Source Model

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

ASPIRATION NOISE DURING PHONATION: SYNTHESIS, ANALYSIS, AND PITCH-SCALE MODIFICATION DARYUSH MEHTA

Keywords: spectral centroid, MPEG-7, sum of sine waves, band limited impulse train, STFT, peak detection.

NOISE ESTIMATION IN A SINGLE CHANNEL

Between physics and perception signal models for high level audio processing. Axel Röbel. Analysis / synthesis team, IRCAM. DAFx 2010 iem Graz

Speech Endpoint Detection Based on Sub-band Energy and Harmonic Structure of Voice

Modulator Domain Adaptive Gain Equalizer for Speech Enhancement

Transcription:

Epoch Extraction From al Speech D Govind and S R M Prasanna Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati Email:{dgovind,prasanna}@iitg.ernet.in Abstract This work proposes a modified zero frequency filtering (ZFF) method for epoch extraction from emotional speech. Epochs refers the instants of maximum excitation of the vocal tract. In the conventional ZFF method, the epochs are estimated by trend removing the output of the zero frequency resonator (ZFR) using the window length equal to the average pitch period of the utterance. Use of this fixed window length for the epoch estimation causes spurious or missed estimation from the speech signals having rapid pitch variations like in emotional speech. This work therefore proposes a refined ZFF method for epoch estimation by trend removing the output of ZFR using the variable windows obtained by finding the average pitch periods for every fixed blocks of speech and low pass filtering the resulting trend removed signal segments using the estimated pitch as the cutoff frequency. The epoch estimation performance is evaluated for five different emotions in the German emotional speech corpus having simultaneous electroglotto graph (EGG) recordings. The improved epoch estimation performance indicates the robustness of the proposed method against rapid pitch variations in emotional speech signals. The effectiveness of the proposed method is also confirmed by the improved epoch estimation performance on the Hindi emotional speech database. I. INTRODUCTION Epochs are events like glottal closure in voiced speech and onset of frication or burst in unvoiced speech []. Due to the interaction of the vocal tract resonances, estimation of the epochs location from speech signal is a challenging task [], [2]. The knowledge of epochs can be used in many speech processing applications [3] [7]. Because of its importance, there are different methods proposed for estimating the epochs location from speech signals [], [8], [9]. Among all these methods zero frequency filtering (ZFF) method for extracting the epochs locations provides a simple and accurate way of estimation from speech signals []. In ZFF method, the speech signal is passed though the cascade of two resonators located at zero frequency where these zero frequency resonator (ZFR) removes the effect due to vocal tract resonances which are predominantly present in the higher frequency region of the speech signals. The trend in the ZFR output is removed by local mean substraction using window length equal to the average pitch period of the speech utterance. This trend removed signal is known as zero frequency filtered signal (ZFFS). The positive zero crossings of the ZFFS give the epochs location and the difference between successive epochs location provide instantaneous pitch periods. The accurate pitch estimation is important for emotional analysis stage of an emotional conversion system [6]. However, the rapid pitch variations in the emotional speech causes spurious or missed zero crossings in the ZFFS when trend removed using a single window length. Hence the motivation for the present work. There are some existing works done to improve ZFF epoch estimation from the signals having rapid pitch variations like laughter signals []. Here the window length used for the trend removal are estimated for each glottal activity region and trend in the corresponding segment of the ZFR output is removed using these varied window lengths. As described later, in the present work the trend in the ZFR output is removed using the window lengths estimated from the average pitch period for every fixed nonoverlapping frames of speech. The rest of the paper is described as follows: Section II describes the conventional ZFF algorithm. Section III explains the modified ZFF method for estimating epochs from emotional speech. A comparison of instantaneous F estimated from conventional and modified ZFF is given in Section IV and Section V summarizes with scope of some future work. II. CONVENTIONAL ZFF METHOD FOR EPOCH ESTIMATION The algorithmic steps to estimate the epochs in speech by ZFF, which hereafter will be termed as conventional ZFF, are as follows []: Difference input speech signal z(n) x(n) = z(n) z(n ) () Compute the output of cascade of two ideal digital resonators at Hz 4 y(n) = a k y(n k) + x(n) (2) k= where a = 4, a 2 = -6, a 3 = 4, a 4 = - Remove the trend i.e., ŷ(n) = y(n) ȳ(n) (3) where ȳ(n) = N 2N+ n= N y(n) and 2N + corresponds to the average pitch period computed over a longer segment of speech The trend removed signal ŷ(n) is termed as zero frequency filtered signal (ZFFS). The positive zero crossings of the filtered signal will give the location of the epochs. These epochs are periodically located in case of voiced speech representing the glottal closure instants and are ran-

domly located in case of unvoiced speech representing the onset of bursts or frication []. Figure shows the epochs estimated from a voiced segment and an unvoiced segment of speech. It can be noted that ZFFS gives periodic zero crossings in case of voiced segment and random zero crossings in the unvoiced segment..5.5 2 3 4.5 2 3 4 2 3 4 Time (samples).....5 2 3 4 2 3 4 2 3 4 Time (Samples) Fig. : Epochs from voiced and unvoiced segments of speech. A voiced speech segment, its ZFFS and epochs. (d) An unvoiced speech segment, its ZFFS and epochs. The epoch estimation performance is evaluated for five different emotions (Neutral, Angry, Happy, Boredom and Fear) of German emotional speech corpus having simultaneous EGG recordings []. Approximately speech files of speakers and texts per emotion were used for the performance evaluation. For evaluating the estimated epochs from the speech, following measures are used [8]. Larynx cycle: The range of sample (/2) (l r + l r )< n <(/2)(l r+ +l r ) where l r, l r and l r+ are the current, preceding and succeeding reference epoch locations, respectively Identification Rate (IDR): The percentage of larynx cycles for which exactly one epoch is detected. Miss Rate (MR): The percentage of larynx cycles for which no epoch is detected. False Alarm Rate (FAR): The percentage of larynx cycles for which more than one epoch is detected. Identification Error (ζ): The timing error between the reference and detected instants of significant excitation in larynx cycles for which exactly one epoch was detected. Identification Accuracy (σ) (IDA): The standard deviation of the identification error ζ. Small values of σ indicate high accuracy of identification. (d) TABLE I: Epoch estimation performance of conventional ZFF and DYPSA algorithms for different emotional speech signals taken from the German database []. ZFF Neutral 99.2.8.79.394 Angry 87.93.4.66.45 Happy 9.66.33 9.2.3858 Boredom 98.75.4.2.3495 Fear 94.9.3 4.97.2774 DYPSA Neutral 96.25.84 2.92.3727 Angry 88.43 5. 6.46.3824 Happy 87.87 4.68 7.45.3828 Boredom 96.66.63 2.7.457 Fear 88.62 4.38 7..4297 The performance of conventional ZFF is tabulated in the Table I. As expected the conventional ZFF method gives better epoch estimation for neutral emotion speech and the performance degrades in the case of other emotions, except boredom. The reason for this is due to the rapid pitch variations in the emotional speech compared to the neutral speech. In the conventional ZFF method of epoch estimation, due to fixed window size used for local mean subtraction, some of the epochs are either missed or spuriously hypothesized. For this reason an appropriate window size should be selected for the optimum epoch estimation from speech signals with larger pitch variations. As the pitch variation of boredom emotion is similar to that of the neutral, the epoch detection performance of both the emotions is nearly same. This epoch estimation performance from various emotions are compared with the DYPSA algorithm, another popular method for estimating epochs [8]. The DYPSA algorithm used in this work is implemented using the programs given in the Voicebox speech processing toolbox. The epochs estimated using DYPSA also shows the same trend and thus confirming the degradation is due to the large pitch variations in the emotional speech. A similar trend in the epoch estimation performance can be observed in the Hindi emotional speech database collected for four speakers (2 males and 2 females) in 4 different emotions (Neutral, Angry, Happy and Boredom) having simultaneous EGG recordings. Ten randomly selected sentences from the Hindi broadcast news database, are used for the emotional speech recording. As the speech is recorded in three sessions, there are 2 files (3x4x) available for each emotion. Table II shows epoch estimation performance obtained for Hindi emotional speech database. The degradation in the epoch estimation performance can be observed here also for angry and happy emotions. Even though the level of degradation performance is different, it follows a similar trend as in German emotional speech corpus.

TABLE II: Epoch estimation performance of conventional ZFF method on Hindi emotional speech database Neutral 99.82.32.4.3 Angry 96.7.62 2.68.346 Happy 92.7.2 7.62.342 Boredom 99.78.2.2.2984 TABLE III: Epoch estimation performance of modified ZFF method by updating the window length for 25 ms segment of speech from different emotions. Neutral 99.56.4.29.2493 Angry 94.47.4 5.2.3746 Happy 94.36.48 5.6.3622 Boredom 99.57.3.4.2682 Fear 96.95.26 3.5.2792 TABLE IV: Epoch estimation performance of refined ZFF method on various emotions Neutral 99.6.9.29.2422 Angry 96.2.37 3.43.3569 Happy 95.27.43 4.3.3544 Boredom 99.55.6.39.2688 Fear 96.95.25 2.8.272 TABLE V: Epoch estimation performance of refined ZFF method on various emotions from Hindi emotional speech database Neutral 99.56.64.37.2549 Angry 98.33.68.99.2668 Happy 99.27.8.65.233 Boredom 99.54.5.4.2799 III. MODIFIED ZFF METHOD FOR EPOCH ESTIMATION IN EMOTIONAL SPEECH For reliably estimating the epochs from emotional speech, a refinement to the conventional ZFF algorithm is developed here. Instead of using fixed average pitch period as the window length for the trend removal of the ZFR output, the window length is updated for every short time segment of length 2-3 ms. A robust method to find the F values from the ZFFS for acoustically degraded speech is described in [2]. In this method, F is computed as the frequency value corresponding to the highest magnitude in the short time fourier transform (STFT) of the ZFFS segment obtained from the conventional ZFF method and the window length is computed as the reciprocal of the F value. This window length is used for the trend removal of the ZFR output to get the ZFFS for that particular speech segment. In the present work we use this approach for the emotional speech case which can be treated as a degraded speech due to the change in the psychological state of the speaker. The performance of this method for the estimation of epochs in emotional speech is given in Table III. The performance improves significantly compared to the fixed window case demonstrating the significance of variable window length for trend removal in case of emotional speech. Figure 2 compares the epoch estimation using conventional ZFF method and the proposed modified ZFF method. Figure 2 shows the ZFFS of a segment of angry speech. The corresponding zero crossings termed as epochs are plotted in Figure 2. The fixed window length results in spurious zero crossings. Figure 2(d) shows the ZFFS for the same segment obtained from the method given in [2]. Even though, the number of spurious zero crossings are reduced, it still leaves some spurious zero crossings unaltered. To analyze this, the corresponding magnitude of STFT of ZFFS for both the methods are plotted in Figures 2 and. The magnitude of the harmonics beyond the fundamental frequency are comparatively stronger. To alleviate this problem the trend removed ZFFS segment is passed through a low pass filter having cut off frequency equal to.5 F. This is to suppress the strength of the pitch harmonics that come beyond F as shown in Figure 2(i). This resulted in further reduction of spurious zero crossings as given in Figure 2(h). Similarly, all the ZFFS segments obtained are concatenated together to obtain the modified ZFFS. Modified epochs are estimated by finding the positive zero crossings in the modified ZFFS. The performance after the low pass filtering of ZFFS is given in Table IV. As given, it further improves with low pass filtering. Thus the proposed modified ZFF method employs both variable window length and low pass filtering for the epoch estimation. The steps in the modified ZFF method can be summarized as follows: Compute the ZFFS using conventional ZFF method. Compute F as the highest magnitude frequency value in the STFT of each 25 ms non-overlapping ZFFS frames. Derive window length as the reciprocal of F for each frame. Remove the trend in the corresponding segment of the resonator output signal y(n) (Equation (2)) using a moving average filter of length equal to the window length of that segment, as given in Equation(3). Low pass filter each of the trend removed ZFFS segment with a cut off frequency equal to.5 F. Concatenate all the modified ZFFS segments to obtain the modified ZFFS signal. Hypothesize the negative to positive zero crossings of the modified ZFFS as the estimated epoch locations. Table V shows the improved epoch estimation performance for the Hindi emotional speech database. Even though there is a significant improvement in the epoch estimation due to the modified ZFF method for the emotional

.2.5 5.2 2.2 2 2 4 (d).5 5.2 2.2 2 2 4 (g).5 (h) 5 (i).2 2 2 2 4 Fig. 2: Comparison of conventional ZFF and the modified ZFF approaches. The ZFFS obtained from a voiced segment of angry speech showing the spurious zero crossings, its epochs and STFT magnitude spectrum. (d) The modified ZFFS obtained by updating the window length, its epochs and STFT magnitude spectrum. (g) The ZFFS obtained by low pass filtering the modified ZFFS segments, (h) epochs estimated and its (i) STFT magnitude spectrum showing no frequency components beyond F. speech case, still the performance in case of angry, happy and fear are not comparable with that of the neutral or boredom. To study the reason for this, the glottal waves from these five emotions are analyzed. Figure 3 shows the speech waveform, glottal wave and difference of glottal wave of five emotions under the condition of same speaker, text and syllable. The prominent features in the difference glottal wave are the impulse like discontinuities. The amplitude of these impulse like discontinuities give an indication about the intensity with which the closing of vocal folds occur. Hence the difference glottal wave may be treated as the representative of strength of excitation. The strength of excitation of the emotions like angry, happy and fear are low as compared to neutral and boredom emotions. This is due to the rapid pitch variations and/or the difference in the nature of vocal folds activity for these emotions. The rapid pitch variations cause the vocal folds to vibrate with the lower suction pressure and hence the reduced strength of excitation. Depending on the psychological state of particular emotion, the tension associated with the vocal folds and associated muscle structure may be different. Because of these factors, the impulse strength may not as prominent in the case of neutral and boredom emotions. These can be observed by comparing the difference glottal waves of different emotions. This in turn may result in the reduction of energy around the zero frequency region, leading to the spurious epoch detections. This may be the reason for the reduced epoch estimation performance in case of angry, happy and fear emotions. Further exploration and understanding is required in this direction. IV. F ESTIMATION USING CONVENTIONAL AND MODIFIED ZFF METHODS The instantaneous pitch period or the epoch interval can be found by the successive difference between the estimated epoch locations. Taking the reciprocal of each epoch interval, multiplied by the sampling frequency gives the fundamental frequency (F ) [2]. Figure 4 shows the F contours derived from the estimated epochs using conventional and modified ZFF methods for neutral and angry emotional speech signals. It is to be observed from the Figures 4 and that the F contours obtained using conventional and modified ZFF methods remain nearly same for the neutral emotion. For angry emotion, the Figures 4 and indicate that the F values obtained using the modified ZFF method are more continuous than that obtained using conventional ZFF method. Hence the merit of the modified ZFF method. Therefore the estimated epochs and ZFFS obtained using the modified ZFF method are used for the prosody modification as explained in the following sections. V. CONCLUSION AND FUTURE WORK The present work identifies the unreliable epoch estimations from the emotional speech by the popular epoch extraction methods like ZFF and DYPSA. As the ZFF method provides reliable and accurate estimate of the epochs from neutral speech signals compared to other methods, the present work proposed a refinement for the conventional ZFF method for

2 4 2 4 2 4 2 4 2 4.5.5 2 4.5.5 2 4.5.5 2 4.5.5 2 4.5.5 2 4.5.5 2 4.5 (d).5 2 4.5 (g) (h) (i).5 2 4.5 (j) (k) (l).5 2 4.5 (m) (n) (o).5 2 4 Fig. 3: speech waveforms, glottal waveforms and difference of glottal waves of neutral (-), angry((d)-), happy((g)-(i)), boredom((j)- (l)) and fear((m)-(o)), respectively. reliable epoch estimation from the emotional speech signals. The modified ZFF method for epoch estimation uses both variable window length for trend removing the ZFR output and low pass filtering of higher harmonics in the trend removed ZFFS. The performance evaluation indicates the robustness of the modified ZFF method in estimating reliably from emotional speech compared to conventional ZFF method of epoch extraction. The modified ZFF method of epoch estimation can be used as a tool in the emotional analysis stage of neutral to emotion conversion systems for comparing the estimated source features of various emotions. Further exploration is needed to bring the epoch extraction performance from emotional speech signals to that of the neutral speech signals. VI. ACKNOWLEDGEMENT This work is a part of ongoing UK-India Education Research Initiative (UKIERI) project (27-2) on Study of Source Features for Speech Synthesis and Speaker Recognition between IIT Guwahati, IIIT Hyderabad and CSTR, University of Edinburgh, UK. We would also like to thank Prof. Felix Burkhardt for providing the EGG recordings of the German emotional speech database..5.5 4 2.5.5 4 2.5.5 Time (s).5.5 2 4 2.5.5 2 4 2.5.5 2 Time (s) Fig. 4: Comparing the F contour obtained using conventional and refined ZFF method. The F contour obtained from, a neutral (- ) and angry ((d)-) speech signals using conventional and refined ZFF methods. REFERENCES [] K. S. R. Murty and B. Yegnanarayana, Epoch extraction from speech signals, IEEE Trans. Audio, Speech and Language Process., vol. 6, no. 8, pp. 62 64, Nov. 28. [2] B. Yegnanarayana and K. S. R. Murty, Event-based instantaneous fundamental frequency estimation from speech signals, IEEE Trans. Audio, Speech and Language Process., vol. 7, no. 4, pp. 64 625, May 29. [3] B. Yegnanarayana and R. N. J. Veldhuis, Extraction of vocal-tract system characteristics from speech signals, IEEE Trans. Speech and Audio Process., vol. 6, no. 4, pp. 33 327, July 998. [4] K. S. Rao and B. Yegnanarayana, Prosody modification using instants of significant excitation, IEEE Trans. Audio, Speech and Language Processing, vol. 4, pp. 972 98, May 26. [5] E. A. P. Habets, N. D. Gaubitch, and P. A. Naylor, Temporal selective dereverberation of noisy speech using one microphone, in in Proc. ICASSP, Jan. 28, pp. 4577 458. [6] D. Govind, S. R. M. Prasanna, and B. Yegnanarayana, Neutral to target emotion conversion using source and suprasegmental information, in proc. INTERSPEECH 2, Aug. 2. [7] S. R. M. Prasanna and D. Govind, Analysis of excitation source information in emotional speech, in Proc. INTERSPEECH, Sep. 2, pp. 78 784. [8] P. A. Naylor, A. Kounoudes, J. Gudnason, and M. Brookes, Estimation of glottal closure instants in voiced speech using DYPSA algorithm, IEEE Trans. Audio, Speech Lang. Process., vol. 5, no., pp. 34 43, 27. [9] R. Smits and B. Yegnanarayana, Determination of instants of significant excitation in speech using group delay function, IEEE Trans. Acoustics, Speech and Signal Processing, vol. 4, pp. 325 333, Sep.995. [] K. S. Kumar, M. S. H. Reddy, K. S. R. Murty, and B. Yegnanarayana, Analysis of laugh signals for detecting in continuous speech, in proc INTERSPEECH, 29. [] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlemeier, and B. Weiss, A database of German emotional speech, in Proc. INTERSPEECH, 25, pp. 57 52. [2] B. Yegnanarayana, S. R. M. Prasanna, and G. Seshadri, Study of robustness of zero frequency resonator method for extraction of fundamental frequency, in ICASSP, May 2. (d)