Sub-band Envelope Approach to Obtain Instants of Significant Excitation in Speech

Similar documents
Epoch Extraction From Emotional Speech

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

Cumulative Impulse Strength for Epoch Extraction

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

L19: Prosodic modification of speech

Voiced/nonvoiced detection based on robustness of voiced epochs

On the glottal flow derivative waveform and its properties

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

EVALUATION OF PITCH ESTIMATION IN NOISY SPEECH FOR APPLICATION IN NON-INTRUSIVE SPEECH QUALITY ASSESSMENT

AM-FM demodulation using zero crossings and local peaks

Speech Synthesis using Mel-Cepstral Coefficient Feature

SPEECH AND SPECTRAL ANALYSIS

Speech Synthesis; Pitch Detection and Vocoders

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

COMP 546, Winter 2017 lecture 20 - sound 2

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

EE482: Digital Signal Processing Applications

/$ IEEE

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

A Quantitative Assessment of Group Delay Methods for Identifying Glottal Closures in Voiced Speech

Different Approaches of Spectral Subtraction Method for Speech Enhancement

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

CO-CHANNEL SPEECH DETECTION APPROACHES USING CYCLOSTATIONARITY OR WAVELET TRANSFORM

Communications Theory and Engineering

Novel Temporal and Spectral Features Derived from TEO for Classification of Normal and Dysphonic Voices

Glottal source model selection for stationary singing-voice by low-band envelope matching

Linguistic Phonetics. Spectral Analysis

INTRODUCTION TO ACOUSTIC PHONETICS 2 Hilary Term, week 6 22 February 2006

Automatic Evaluation of Hindustani Learner s SARGAM Practice

Speech Enhancement using Wiener filtering

Hungarian Speech Synthesis Using a Phase Exact HNM Approach

Sound Synthesis Methods

GLOTTAL EXCITATION EXTRACTION OF VOICED SPEECH - JOINTLY PARAMETRIC AND NONPARAMETRIC APPROACHES

GLOTTAL-synchronous speech processing is a field of. Detection of Glottal Closure Instants from Speech Signals: a Quantitative Review

Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase and Reassignment

ENHANCED ROBUSTNESS TO UNVOICED SPEECH AND NOISE IN THE DYPSA ALGORITHM FOR IDENTIFICATION OF GLOTTAL CLOSURE INSTANTS

Mel Spectrum Analysis of Speech Recognition using Single Microphone

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

Overview of Code Excited Linear Predictive Coder

Synthesis Algorithms and Validation

SIGNIFICANCE OF EXCITATION SOURCE INFORMATION FOR SPEECH ANALYSIS

ARTIFICIAL BANDWIDTH EXTENSION OF NARROW-BAND SPEECH SIGNALS VIA HIGH-BAND ENERGY ESTIMATION

Converting Speaking Voice into Singing Voice

SOURCE-filter modeling of speech is based on exciting. Glottal Spectral Separation for Speech Synthesis

Detecting Speech Polarity with High-Order Statistics

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

COMPARING ACOUSTIC GLOTTAL FEATURE EXTRACTION METHODS WITH SIMULTANEOUSLY RECORDED HIGH- SPEED VIDEO FEATURES FOR CLINICALLY OBTAINED DATA

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Harmonics Enhancement for Determined Blind Sources Separation using Source s Excitation Characteristics

Enhanced Waveform Interpolative Coding at 4 kbps

Digital Signal Representation of Speech Signal

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Nonuniform multi level crossing for signal reconstruction

DECOMPOSITION OF SPEECH INTO VOICED AND UNVOICED COMPONENTS BASED ON A KALMAN FILTERBANK

Wavelet Speech Enhancement based on the Teager Energy Operator

VOICED speech is produced when the vocal tract is excited

Audio Signal Compression using DCT and LPC Techniques

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Digital Speech Processing and Coding

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Analysis/synthesis coding

Sinusoidal Modelling in Speech Synthesis, A Survey.

Auditory modelling for speech processing in the perceptual domain

YOUR WAVELET BASED PITCH DETECTION AND VOICED/UNVOICED DECISION

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

NOVEL APPROACH FOR FINDING PITCH MARKERS IN SPEECH SIGNAL USING ENSEMBLE EMPIRICAL MODE DECOMPOSITION

A() I I X=t,~ X=XI, X=O

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

A Two-step Technique for MRI Audio Enhancement Using Dictionary Learning and Wavelet Packet Analysis

Speech Signal Analysis

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

The Partly Preserved Natural Phases in the Concatenative Speech Synthesis Based on the Harmonic/Noise Approach

Monaural and Binaural Speech Separation

Applications of Music Processing

A New Iterative Algorithm for ARMA Modelling of Vowels and glottal Flow Estimation based on Blind System Identification

651 Analysis of LSF frame selection in voice conversion

Complex Sounds. Reading: Yost Ch. 4

AUDL GS08/GAV1 Auditory Perception. Envelope and temporal fine structure (TFS)

VOICE QUALITY SYNTHESIS WITH THE BANDWIDTH ENHANCED SINUSOIDAL MODEL

SGN Audio and Speech Processing

COMBINING ADVANCED SINUSOIDAL AND WAVEFORM MATCHING MODELS FOR PARAMETRIC AUDIO/SPEECH CODING

Modulator Domain Adaptive Gain Equalizer for Speech Enhancement

A Physiologically Produced Impulsive UWB signal: Speech

IMPROVED SPEECH QUALITY FOR VMR - WB SPEECH CODING USING EFFICIENT NOISE ESTIMATION ALGORITHM

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

PR No. 119 DIGITAL SIGNAL PROCESSING XVIII. Academic Research Staff. Prof. Alan V. Oppenheim Prof. James H. McClellan.

Advanced audio analysis. Martin Gasser

SPEECH ANALYSIS-SYNTHESIS FOR SPEAKER CHARACTERISTIC MODIFICATION

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Speech Compression Using Voice Excited Linear Predictive Coding

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

Transcription:

Sub-band Envelope Approach to Obtain Instants of Significant Excitation in Speech Vikram Ramesh Lakkavalli, K V Vijay Girish, A G Ramakrishnan Medical Intelligence and Language Engineering (MILE) Laboratory Department of Electrical Engineering Indian Institute of Science, Bangalore, 51, INDIA Email: vikram.ckm@gmail.com, kv@ee.iisc.ernet.in, ramkiag@ee.iisc.ernet.in Abstract In this paper, we propose a new sub-band approach to estimate the glottal activity. The method is based on the spectral harmonicity and the sub-band temporal properties of voiced speech. We propose a method to represent glottal excitation signal using sub-band temporal. Instants of maximum glottal excitation or Glottal Closure Instants (GCI) are extracted from the estimated glottal excitation pattern and the result is compared with a standard GCI computation method, DYPSA [1]. The performance of the algorithm is also compared for the noisy signal and it is shown that the proposed method is less variant to GCI estimation under noisy conditions compared to DYPSA. The algorithm is evaluated on the CMU-ARCTIC database. Index Terms glottal closure instant, epoch, GCI, DYPSA, CMU-ARCTIC. I. INTRODUCTION Estimating the excitation pattern of the vocal tract helps us to understand the interaction between the vocal tract and the source in speech production. One such representation of source signal is the electro-glottograph (EGG) signal, which indicates the area of contact between the vibrating vocal folds. Thus, it is a representation of the variation of air pressure below the glottis. Vocal tract excitation is maximum when the glottis is closed abruptly and this excitation is represented by one of the peaks in the speech signal. Instant of maximum excitation is used in many applications including speech coding, speech modification, synthesis, and duration modification. To extract the instants of maximum excitation in speech signal, properties of the glottal closure instant (GCI) have been used, such as singularity property [3], and phase slope of the linear prediction residual [1]. In our approach, excitation pattern is used to estimate the GCI s. The human speech production mechanism is shown in Fig. 1. Production of speech may be viewed from different Pharynx Lungs Nasal cavity Oral cavity Speech output Fig. 1. Simplistic view of speech production model perspectives. Source filter model proposed by G.Fant [1] is one such model, which assumes that the speech signal can be assumed to be generated from a source signal exciting a linear filter, where source signal is the glottal excitation signal and filter models the vocal tract. It is known that the linear prediction (LP) parameters of the speech signal gives an approximation to vocal tract shape involved in the production of speech. Speech production may also be viewed as an AM-FM model, proposed by Maragos et.al. [], where speech signal is viewed as a combination of modulated signals. In the source-filter model of speech production, there are two factors involved in speech production, namely, the excitation signal (source), and the vocal tract transfer function (filter). Hence, extracting one information essentially needs a reliable assumption of the other. The earliest work on estimating Glottal Closure Instant (GCI) based on the LP residual technique is by Ananthapadmanabha et.al []. In this approach, it is shown that the LPC residual may provide a sub-optimal GCI information. Another method based on the phase slope information of the LP residual is discussed by Smits et.al [], where the positive zero-crossing of the phase indicates the glottal closure instants. This is further investigated by Kounoudes et.al. [1] to propose DYPSA

Speech signal s(t) Sub-band decomposition Glottal closure instants S 1 (t) S (t) S 3 (t) S N (t) Refinement Excitation pattern Combined sub-band Dynamic weighted sum Local peak picking the filtered speech signal around a centre frequency w k which may be written as, s k (t) = e(t) v(t) h k (t) (1) where, h k (t) is the impulse response of the filter selecting the speech signal around the frequency w k, and indicates the convolution operation. Since e(t) is considered to be a sequence of impulses placed at the excitation instants; the speech signal is harmonic in w = π/t. Considering the speech signal in k th band, we write (1) as, Fig.. GCI detection based on sub-band information algorithm. Here, dynamic programming is employed to correct the baseline phase slope based pitch mark algorithm by minimizing the pitch deviation cost and the phase slope costs. Wavelet analysis has also been employed for the detection of GCI which is based on its singularity detection property, as GCI s are associated with singularity. The method in [3] does not yield good results for soft glottal closures such as in the cases of voice onsets and offsets. In this method, the lines of maximum amplitudes in each wavelet band is tracked dynamically to arrive at the GCI. Also, this method makes a fundamental assumption that the speech signal has predominantly negative peaks, which is equivalent to making the assumption on the polarity of the pitch mark. Sub-band analysis of speech to find pitch frequency (F ) is discussed in [5] and [], both using the auditory models of speech perception. In this paper, we derive a representation of the excitation pattern of vocal tract using sub-band motivated processing. To validate our claim, GCI is extracted from the estimated excitation pattern and the result is compared with the baseline GCI obtained from the EGG signalandwiththedypsaalgorithm.inordertotestthe robustness of the algorithm, DYPSA and the proposed method are also tested on noisy data. All the experiments are carried out on the CMU-ARCTIC database. II. PROPOSED METHOD First, we showthatthe peaksofthe sub-band (SBE) information represent the maximum excitation instants. Consider v(t) to represent the vocal tract transfer function, and e(t), the excitation signal. Speech signal s(t) may be written as s(t) = e(t) v(t). Let s k (t) be s k (t) = e(t) v k (t); v k (t) = v(t) h k (t) () And, in the frequency domain, we may write, S k (w) = E(w)V k (w) (3) Since e(t) is assumed to be a sequence of impulses, that is, e(t) = δ(t rt), r, S k (w) = { r δ(w rw )}V k (w) () Here, the excitation pulses are assumed to be placed at regular interval of T for ease of analysis. Now considering only the harmonics of the excitation signal in the k th band (assuming K+1 harmonics, and w k mw ), we have, e k (t) = exp( j(m K)w t)+...+exp( j(m 1)w t)+ exp( jmw t)+exp( j(m+1)w t)+...+ exp( j(m+k)w t) (5) e k (t) = exp( jmw t)(1+(cos(w t)+cos(w t)+... +cos(kw t))) () The is defined by the term 1+(cos(w t)+ cos(w t) +... + cos(kw t)), and it is easy to notice that the excitation has local maxima at t = rt; r. Now consider the weighting introduced by the vocal tract on the. The may be approximated by C k (t) a +(a 1 cos(w t)+a cos(w t)+...+ a K cos(kw t)) (7) a i. ing the information from each band of the signal, we have a representation of the excitation signal in each band. The source excitation

Sub-band signal Zero- Crossing points Sub-band Interpolating the peaks Full wave rectification Peak picking between zero-crossings Fig. 3. ing the from each sub-band pattern of speech is computed as the sum of individual excitation patterns obtained from each sub-band. C(t) = N C k (t) () k=1 The algorithm is explained through a block diagram shown in Fig.. Speech is decomposed into sub-bands and the information in each band is obtained. Sub-band is extracted by considering the peak values between successive zero-crossings in the subband speech signal. These points are interpolated using cubic spline interpolation to obtain a smoothed subband temporal. ion of sub-band temporal is shown as a block diagram in Fig 3. III. IMPLEMENTATION Before starting the process, first we identify the voiced and unvoiced parts of the speech signal, and take the voiced portion for detecting pitch marks or GCI. Then, a linear phase FIR filter bank with bands is designed using filter order of. Then the speech signal is filtered with first 1 low frequency bands since the other bands are found not to contribute much to the robustness of the GCI estimate. Then envelop of local maxima of the 1 filtered signals is taken and the unvoiced regions are assigned to zero to prevent detection of pitch in unvoiced regions. Then the signal is considered frame by frame for further analysis. Transitions in each sub-band signal are then estimated, and only those bands having higher transition rate are considered to find the GCI, and this method corresponds to the dynamic weighting as indicated in Fig.. The processed dynamic weighted signal is the estimated excitation pattern. On the processed dynamic weighted signal, the local maxima are found which are the contenders for the pitch marks. Now, these contenders include many extra detections other than the potential pitch marks. The refinement of the contenders for pitch marks is now carried out by exploiting the property of local periodicity and relative amplitudes of the successive No of bands= 1, FIR Filter order :.1.11.1.13.1.15.1.17.1.19..1 x 1 1.5 1.5.1.11.1.13.1.15.1.17.1.19..1 x 1 Fig.. ion of GCI from clean speech (the black curve is the Processed dynamic weighted signal; the blue curves are the signals selected for addition; red peaks are the estimated GCI s; the green curve is the EGG signal; cyan peaks are the GCI s detected by EGG signal) No of bands= 1, FIR Filter order :.1.11.1.13.1.15.1.17.1.19..1 x 1 3 1.1.11.1.13.1.15.1.17.1.19..1 x 1 Fig. 5. ion of GCI from noisy signal with SNR= db. Color conventions are same as Fig. local maximas. The local pitch period is found by considering the average time-differences between consecutive maximas (which lie within the range of minimum and maximum possible pitch period) around the point of consideration.

TABLE I COMPARISON OF GCI DETECTION ACCURACY AND EXTRA DETECTIONS ON CMU ARCTIC DATABASE WITHOUT NOISE.11.1.13.1.15.1.17 x 1 Fig.. ion of instants of minimum excitation energy from clean speech signal(the black curve is the speech signal; the magenta curve is the Processed dynamic weighted signal; blue peaks are the estimated GCI s; the green curve is the EGG signal; cyan peaks are the GCI s detected by EGG signal; red peaks are the minimum excitation points ) IV. FINDING INSTANTS OF MINIMUM EXCITATION ENERGY IN VOICED SPEECH The instants of minimum excitation energy in voiced speech are important as they represent the time instants at which the glottis is completely open and the excitation energy is minimum. These instants are used in unitconcatenation for MILE-TTS synthesis system. This minimum excitation energy is useful as any concatenation at a higher excitation energy region in voiced speech is prone to degradation in naturalness of the output speech and the minimum excitation instants do not pose such challenges. Experiments on the concatenation based on the instants of minimum excitation energy is implemented in MILE-TTS [11]. A minimum excitation instant is estimated from the excitation pattern as the instant before the estimated GCI, where the derivative of the is minimum, or it can also be considered as the instant of zero-crossing in speech signal occurring before the estimated GCI. The instants of minimum excitation energy and their detections are shown in Fig.. V. EVALUATION OF GCI ACCURACY The GCI is detected from the estimate of the excitation signal using the proposed analysis of the speech signal. From Fig., we may see that the peak of the estimated excitation pattern corresponds to GCI. Evaluation of the accuracy of GCI detection is carried out on the Method Detection accuracy in % Extra detections in % Proposed 9.% 1.73% DYPSA 9.7%.1% CMU-ARCTIC database. The recordings consist of the EGG signal along with the corresponding speech signal sampled at the rate of 3 khz. First, the ground truth for glottal closure instants is collected from the recorded EGG signal. The accuracy is reported based on the deviation of the estimated GCI position with respect to the reference obtained from the EGG signal. Generally, a deviation of 1 millisecond is taken as a safe bet to consider it to be accurate. Extra detection indicates the numberofextra GCIs overthosedetectedusingthe EGG signal. VI. RESULTS Table I compares the detection accuracy (deviation within 1ms duration w.r.t. GCI from EGG signal), percentage of extra detections using our SBE method and DYPSA algorithm on the clean database. It is observed from Table I that SBE method has comparable accuracy with that of DYPSA on the clean speech database. Fig. 7 compares the accuracy and extra detections of SBE and DYPSA algorithm for various values of signal to noise ratios. It is observed that our method outperforms DYPSA algorithm as the SNR decreases. Fig. shows thehistogramofnumberofestimatedgci sforthecmu ARCTIC database for deviation within 1 ms, between 1- ms, -3 ms, and above 3 ms by four bins. It is seen from Fig. that when noise is added, most of the GCI s are concentrated within samples or ms duration using our proposed method, whereas many GCIs have deviation greater than ms using DYPSA algorithm. VII. DISCUSSION The proposed SBE method makes few assumptions to estimate reliable epoch information. First, it does not depend upon the explicit pitch information; however, the pitch information is estimated from the excitation pattern to prune the spurious GCIs. Second, the algorithm is simple and cost effective for real time implementation, with few filtering operations and interpolation. The proposed algorithm is compared with DYPSA for both noisy and clean speech and the results show that the SBE algorithm outperforms DYPSA for noisy speech. This shows that the algorithm is robustandmay beemployedin real time

1 9 scenario. Also, the SBE algorithm gives us the flexibility to estimate the instant of minimum excitation energy which is not discussed here. The algorithm is employed for pitch synchronous unit concatenation [11] in MILE- TTS. Percentage of extra detections and accuracy within 1ms 7 5 3 1 Accuracy (Proposed) Accuracy (DYPSA) Extra Detections(Proposed) Extra Detections(DYPSA) 5 1 15 5 3 35 Signal to Noise ratio in Decibel Fig. 7. Accuracy and number of extra detections as a function of SNR in db No of refined local maxima No of refined local maxima 1 x 1 1 1 1 1 x 1 1 1 Proposed Method DYPSA 1 11 Deviation in terms of no. of samples (a) Results on clean speech SNR ratio=.17 Proposed Method DYPSA 1 11 Deviation in terms of no. of samples (b) Results on noisy speech with SNR= db Fig.. Histograms showing the no of detected GCIs vs the deviation from those detected from EGG. 3 samples are equivalent to 1 ms VIII. CONCLUSION Wehaveproposedanewmethodtoestimatetheglottal closure instants. The method estimates the glottal excitation pattern to arrive at the glottal closure instants. The excitation pattern obtained also gives a handle to estimate instants of minimum excitation, which find application in speech unit concatenation. The results of the proposed method are promising and the GCI estimation is robust to noise. REFERENCES [1] A.Kounoudes, P. A Naylor, and M. Brookes, The DYPSA algorithm for estimation of glottal closure instants in voiced speech, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 7, pp. I-39-I-35. [] T.V. Ananthapadmanabha, B. Yegnarayana, Epoch ion from Linear Prediction Residual for Identification of Closed Glottis, IEEETrans. onassp,vol.7, no.,1979, pp. 39 31. [3] N. Sturmel, C. d Alessandro, Francois Rigaud, Glottal Closure Instant Detection using Lines of Maximum s of the Wavelet Transform, Proc. Intl. Conf. on Audio and Speech Signal Processing, ICASSP, 9, pp. 517 5. [] R. Smits and B. Yegnanarayana, Determination of instants of significant excitation in speech using group delay function, IEEE Transactions on Speech and Audio Processing, vol. 3, 1995, pp. 35-333. [5] K. Gopalan, Pitch Estimation using a Modulation Model of Speech, ICSP, pp. 7 791. [] S.C. Sekhar, S. Pilli, L. C, and T.V. Sreenivas, Novel Auditory Motivated Subband Temporal Envelope Based Fundamental Frequency Estimation Algorithm, 1th European Signal Processing Conference (EUSIPCO ), Florence, Italy, September -,. [7] M.D. Plumpe, T.F. Quatieri, and D. a Reynolds, Modeling of the glottal flow derivative waveform with application to speaker identification, IEEE Transactions on Speech and Audio Processing, vol. 7, 1999, pp. 59-5. [] A. Potamianos and P. Maragos, Speech analysis and synthesis using an AMFM modulation model, Speech Communication, vol., July 1999, pp. 195-9. [9] D.G. Childers and C.K. Lee, Vocal quality factors: analysis, synthesis, and perception, The Journal of the Acoustical Society of America, vol. 9, Nov. 1991, pp. 39-1. [1] G. Fant, Acoustic Theory of Speech Production, The Hague, The Netherlands: Mouton, 19. [11] V.R. Lakkavalli, Arulmozhi. P, and A.G. Ramakrishnan, Continuity Metric for Unit Selection based Text-to-Speech Synthesis, IEEE International Conference On Signal Processing and Communications, 1.