Determination of Pitch Range Based on Onset and Offset Analysis in Modulation Frequency Domain

Size: px

Start display at page:

Download "Determination of Pitch Range Based on Onset and Offset Analysis in Modulation Frequency Domain"

Elijah Pearson
5 years ago
Views:

1 Determination o Pitch Range Based on Onset and Oset Analysis in Modulation Frequency Domain A. Mahmoodzadeh Speech Proc. Research Lab ECE Dept. Yazd University Yazd, Iran H. R. Abutalebi Speech Proc. Research Lab ECE Dept. Yazd University Yazd, Iran & Idiap Research Institute 1 Martigny, Switzerland H. Soltanian-Zadeh Control and Intelligent Processing Center o Excellence, University o Tehran Tehran, Iran & Image Analysis Lab. Henry Ford Health System Detroit, USA H. Sheikhzadeh EE Dept. Amirkabir University o Technology Tehran, Iran Abstract Auditory scene in a natural environment contains multiple sources. Auditory scene analysis (ASA) is the process in which the auditory system segregates a scene into streams corresponding to dierent sources. The determination o range o pitch requency is necessary or segmentation. We propose a system to determine the range o pitch requency by analyzing onsets and osets in modulation requency domain. In the proposed system, irst the modulation spectrum o speech is calculated and then, in each subband onsets and osets will be detected. Thereater, the segments are generated by matching corresponding onset and oset ront. Finally, by choosing the desired segments, the rage o pitch requency is determined. Systematic evaluation shows that the range o pitch requency is estimated with good accuracy. Keywords- pitch requency; onset/oset algorithm; modulation requency domain I. INTRODUCTION The pitch is deined as the undamental requency o quasiperiodic or voiced sounds [1]. In the speech signals, the pitch is produced by vibrations o the vocal cords. Pitch determination is a undamental problem that attracts much attention in speech analysis. A robust pitch detection algorithm (PDA) is needed or many applications including computational auditory scene analysis (CASA), prosody analysis, speech enhancement/separation [2], speech recognition, and speaker identiication [3]. Various methods have been proposed or the determination o the pitch requency. These methods are generally classiied into three categories: time-domain, requency-domain, and time-requency domain algorithms. Time-domain PDAs directly examine the temporal structure o a signal waveorm and estimates the period o the quasi-periodic signal. They use either the autocorrelation unction [4],[5], other physical [6][7] and geometric [8] criteria, least-square itting [9], pattern recognition [1], and neural networks [11]. Frequency domain PDAs utilize the harmonic structure in the short-term spectrum or distinguish the undamental requency. These methods include: the harmonic product spectrum; Cepstral analysis, and maximum likelihood. Time-requency domain algorithms perorm time-domain analysis on band-iltered signals obtained via a multichannel ront-end. In these methods, the estimation o pitch requency is done by raming the speech signal and estimation o pitch requency in each rame. Then by orming the pitch contour, the range o pitch requency is determined. In each rame, i the pitch requency changes, the pitch requency will not be correctly estimated. Moreover, in these methods we ace pitch halving and pitch doubling problems. Also, in a pitch curve, using interpolation or unvoiced regions, some values are applied. In addition, the existence o intererence between noise and speech signal deteriorates the perormance o such techniques. In recent years, multipitch methods are presented that have some complexities and problems. In this paper, we propose a system to determine the range o pitch requency by analyzing onsets and osets in modulation requency domain. The proposed method determines the range o pitch requency o voiced speech without any windowing. At irst, modulation spectrum o speech is calculated using the modulation transorm. Then, using the onset and oset algorithm, the onset and oset ronts are detected. Thereater, the segments are generated by matching the corresponding onset and oset ronts. Finally, by choosing the desired segments, the range o pitch requency is determined. By extending this system, we can determine the range o the multi-pitch requency o the two speakers. This, in turn, can be used or single channel speech separation. The undamental requency o speech varies rom 4 Hz or low-pitched male voices to 35 Hz or children or high-pitched emale voices. The pitch requency o everyone is not constant during the time; however it is bonded in a range. When the range o pitch requency is known, it may help in single channel speech separation. This paper is organized as ollows. In Section II and III, we propose a working deinition or modulation requency 1 H. R. Abutalebi has been on sabbatical at Idiap Research Institute during Fall 21-Summer 211.

2 analysis and onset and oset algorithm. In Section IV, we irst give a brie description o our system and then present the details o each stage. The results o the system on the determination o range o pitch requency are reported in Section V. The paper concludes with a discussion in Section VI. II. MODULATION FREQUENCY ANALYSIS The general modulation requency analysis ramework consists o a ilterbank (possibly decimated), ollowed by subband envelope detection and requency analysis o the subband envelopes [12]. In its most straightorward orm, the ilterbank is implemented using the Short-Time Fourier Transorm (STFT), envelope detection is deined as the magnitude or magnitude squared o the subband, and subband envelope requency analysis is perormed with the Fourier transorm. For a discrete signal x ( n ), the STFT can be expressed as kn X k ( m) = h( mm n) x ( n) W K n = (1) or k =,..., K 1 and the envelope detection and modulation requency analysis as im X l ( k, i ) = g ( ll m) X k ( m) W I. (2) m = or i =,..., I 1 j( 2 π / K ) where W K e g m are the acoustic and modulation requency analysis windows, respectively. Throughout the paper, we use the shorthand notations and =. h( n ) and ( ) { ( )} l (, ) T x n = X k i (3) 1 { (, ) } ( ) T X k i = x n (4) l to reer to the modulation requency analysis and synthesis, respectively. The magnitude o the sub-band envelope spectra Xl ( k, i ) is typically displayed in a modulation spectrogram representation. The vertical axis o this representation is regular K, and its horizontal axis is modulation acoustic requency ( ) requency ( i ), (iii). Gray- or color-scale intensity in the joint acoustic/modulation plane represents modulation spectral energy. The modulation analysis ramework is illustrated in Figure 1, and an example o a modulation spectrogram is shown in Figure 2. III. ONSET AND OFFSET Onsets and osets correspond to sudden amplitude increases and decreases [13]. A standard way to identiy such intensity changes is to take the irst-order derivative o the intensity with respect to modulation requency and then ind the peaks and valleys o the derivative. Because o intrinsic intensity luctuations, many peaks and valleys o the derivative do not correspond to actual onsets and osets. To reduce such luctuations, we smooth the intensity over modulation requency, as is commonly done in edge detection or image analysis. Smoothing can be perormed through either a diusion process or low-pass iltering. Onsets correspond to the peaks o the derivative above a certain threshold, and osets are the valleys below a certain threshold. The purpose o thresholdding is to remove peaks and valleys corresponding to insigniicant intensity luctuations. The above procedure is similar to the standard Canny edge detector in image processing [14]. An example o the above procedure is shown in Figure 3. Acoustic requency (Hz) Modulation requency (Hz) Figure 2. Modulation spectrogram o the male speaker. Figure 1. Modulation analysis ramework and the modulation spectrogram [12].

-2-4 -6-8 5 1 15 2 25 5-5 -1 5 1 15 2 25 Figure 3. The upper panel shows the response intensity, and the lower panel shows the results o onset and oset detection using low-pass ilter. IV.

3 Figure 3. The upper panel shows the response intensity, and the lower panel shows the results o onset and oset detection using low-pass ilter. IV. SYSTEM DESCRIPTION This research estimates the range o the pitch requency. The proposed system estimates this range via an analysis o signal in modulation requency domain using onset and oset detection algorithm or one speaker. Although the proposed system is capable o determining the range o pitch requency o one speaker, by its expansion we can present a new system or determination o multipitch requency o speakers in one channel. The place o pitch energy in modulation requency spectrogram is an important eature or determination o the range o the pitch requency o speech. Figure 4 shows a block diagram o our proposed system. In the irst stage, the modulation spectrum o the speech signal is calculated. Then segments are generated in the modulation domain using the onset and oset algorithm and inally in the decision-making stage, the range o pitch requency is determined. A detailed description o the stages is as ollows: A. Cochlear iltering and modulation transorm At irst the spectrum o speech signal is calculated using cochlear iltering and STFT transorm. Then, using the modulation transorm, the modulation spectrum o speech signal is calculated. The vertical axis o modulation spectrum is acoustic requency, and its horizontal axis is modulation requency. Colour intensity in the joint acoustic/modulation plane represents modulation spectral energy. B. Smoothing Smoothing corresponds to low-pass iltering. Our system smoothes the intensity over modulation requency with a lowpass ilter. Let v ( c,,,) denote the initial intensity at modulation requency in ilter channel c. We have (,,, ) (,,,) ( ) v c s = v c h s (5) where h( s ) is a low-pass ilter (in the modulation requency domain with passband [, s ] in Hertz). Here, denotes convolution. The parameter ( s ) indicates the degree o smoothing. The smaller ( s ) is, the smoother v ( c s ) is.,,, C. Maximum detection: By detecting the onsets and osets and orming the onset and oset ront, the modulation spectrum o speech signal is segmented. A ew o these segments consist o inormation about the range o pitch requency. The speaker s pitch ranges have to be [6,35] Hz (or men, women and children). Thereore, in every subband (or each acoustic requency), the maximums in [6,35] Hz and above a certain threshold are ounded. Onset/oset detection and matching: onset and oset candidates are detected by marking peaks and valleys o the modulation requency derivative o the smoothed intensity d d v c s v c h s d d (,,, ) = (,,,) ( ) In each sub-band we select the onsets and osets that occurred around any speciied maxima. Ater determining onset and oset o each subband, onset and oset ronts should be ound and inally the bounds o segments will be identiied. (6) Range o pitch requency Decision Cochlear iltering Modulation Transorm Smoothing Maximum Detection Figure 4. The block diagram o the proposed system. Onset/oset Detection and matching

4 D. Decision-making: Obviously, in the requency domain o speech signal, there are some peaks in the pitch requency and its harmonic. Based on this, in the decision stage, we select the segments whose ranges o modulation requency are the harmonics o each other. In accord to the onset and oset ronts o the desired segment, and by calculating the mean o onset and oset o these ronts, we can ind the beginning and ending o the range o pitch requency. V. RESULTS We now present experimental results demonstrating the robustness and the accuracy o our method compared to the least square harmonic model [15] in dierent SNR s. In [15] an algorithm or harmonic decomposition o time-variant signals is derived rom a least squares harmonic (LSH) technique. The estimates o harmonic amplitudes and phases are ormulated as the solution o a set o linear equations that minimize the mean square error. The experimental results in [15] demonstrate the robustness and the accuracy o LMS method relative to standard algorithms such as RAPT [16] and the maximum a posterior estimator (MAP) o [17]. The signal requency is modeled by a linear or quadratic polynomial and obtained via a local search over polynomial coeicients. An initial estimate o signal requency is necessary to reduce computation time. To evaluate the accuracy o the proposed algorithm in pitch range estimation, we choose signals rom a corpus o 1 speech signal (male and emale speeches), that commonly used or CASA research [18] and TIMIT database. At irst we selected the clean speech rom the TIMIT database, x ( n ) (speaker: one two three our ive ). The speech was sampled at 16 khz. The algorithm parameters were set to M = 16, K = 512, L = 38, I = 512, and h( n ) and g ( m ) were a 48-point and 78-point Hanning window. The additive noise is a white noise with zero mean and unity variance. Figures 2 and 5(a) show the signal and modulation spectrum o the initial.8s o the speech ( one ). The applied ilter or smoothing o each subband is FIR low-pass ilter. By selecting a proper threshold, only the maxima above the certain value in each subband are chosen. In this way, we can avoid the production o undesired segments. The obtained pitch contours o the speech using the LSH model is shown in Figure 5(b) or dierent SNR s. The exact value and the obtained range o pitch requency using the proposed system and LSH model or cleaned speech is reported in Table 1. The perormance o the proposed system and the LSH model in terms o the range o the pitch requency or dierent SNR s is reported in Table 2. Table 3 shows the obtained mean error percentage o pitch range estimation or 1 speech signals ater using the proposed system and LSH model or dierent SNR s. According to Figure 5(b), we can observe that the LSH method can not accurately estimate the pitch requency in the transition region between voiced and unvoiced. According to these results and by comparing with the exact range o pitch requency, one may deduce that or low SNR s, the LSH model aces the error and its accuracy reduces, while using the proposed system, in low SNR s the range o pitch requency is estimated accurately. VI. CONCLUSION AND DISCUSSION We demonstrated that modulation requency localization o pitch energy is an important eature or determination o range o pitch requency o speech in modulation spectrogram that can be exploited or single channel speaker separation. We presented a new approach or determination o range o pitch requency based on modulation requency analysis and onset/oset algorithm. The proposed method is purposely simple and accurate. Also, the results show that the accuracy o the proposed algorithm is acceptable in noisy conditions and or dierent SNR s. By expansion o proposed system, we can present a new system or determination o multipitch requency in modulation requency domain with using pitch energy. TABLE I. THE EXACT AND THE OBTAINED RANGE OF PITCH FREQUENCY USING THE PROPOSED SYSTEM AND LSH MODEL FOR CLEANED SPEECH TABLE II. Amplitude Frequency(Hz) Clean speech True value Proposed system LSH model [91,125] [92,126] [93,126] 1.5 OBTAINED RANGE OF PITCH FREQUENCY USING PROPOSED SYSTEM AND LSH MODEL FOR DIFFERENT SNR S SNR Proposed system LSH model [85,117] [8,135] 5 [92,123] [79,126] 1 [92,124] [93,126] 15 [92,125] [93,127] Time(s) (b) 15 SNR=1dB SNR=15dB 1 (a) SNR=5dB SNR=dB Time(s) Figure 5. (a) Speaker s signal ( one ). (b) Pitch contours o speech signal obtained with using least square harmonic model or dierent SNR s.

5 TABLE III. THE OBTAINED MEAN ERROR PERCENTAGE OF PITCH RANGE ESTIMATION FOR 1 SPEECH SIGNAL SNR Proposed system LSH model REFERENCES [1] P. Vary and R. Martin,Digital Speech Transmission Enhancement, Coding And Error Concealment, Wiley, 26. [2] C. Breithaupt, T. Gerkmann, and R.Martin, A novel a priori SNR estimation approach based on selective cepstro-temporal smoothing, Proceedings o ICASSP, pp , April 28. [3] C. MadsGraesboll and A. Jakobsson, Multi-Pitch Estimation-Synthesis Lectures on Speech and Audio Processing, M & C Publishers, 29. [4] Lawrence R. Rabiner, On the use o autocorrelation analysis or pitch detection, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 25, no. 1, pp , February [5] A. de Cheveign e and H. Kawahara, YIN, a undamental requency estimator or speech and music, J. Acoust. Soc. Am., vol. 111, no. 4, pp , Apr. 22. [6] J. C. Brown and M. S. Puckette, A high resolution undamental requency determination based on phase changes o the urier transorm, Journal o the Acoustical Society o America, vol. 94, no. 2, pp , [7] John E. Lane, Pitch detection using a tunable IIR ilter, Computer Music Journal, vol.14, no.3, pp.46 59, Fall 199. [8] D. Cooperand Kia C. Ng, A monophonic pitch-tracking algorithm based on waveorm periodicity determinations using landmark points, Computer Music Journal, vol. 2, no. 3, pp. 7 78, Fall [9] A. Choi, Real-time undamental requency estimation by least-square itting, IEEE Transactions on Speech and Audio Processing, vol. 5, no. 2, pp , March [1] J. C. Brown, Musical undamental requency tracking using a pattern recognition method, Journal o the Acoustical Society o America, vol. 92, no.3, pp , September [11] H. Sano and B. Keith Jenkins, A neural network model or pitch perception, Computer Music Journal, vol. 13, no. 3, pp , Fall [12] S. Schimmel, L. Atlas, and K. Nie, Feasibility o signal channel speaker separation based on modulation requency analysis, Proceedings o ICASSP, pp , 27. [13] G. Hu, and D.L. Wang, Auditory segmentation based on onset and oset analysis, IEEE Trans. Audio, Speech, Lang. Process. Vol. 15, no. 2, pp , February 27. [14] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 8, pp , [15] Q. Li, L. Atlas, Time-variant least-squares harmonic modelling, Proceedings o IEEE International Conerence on Acoustics, Speech, and Signal Processing, Vol. 2, Hong-Kong, April 23. [16] D. Talkin, A robust algorithm or pitch tracking (RAPT), in Speech Coding and Synthesis, W. B. Klein and K. K. Paliwal, Eds.NewYork, NY: Elsevier, pp , [17] J. Tabrikian, S. Dubnov, and Y. Dickalov, Maximum a posterior probability pitch tracking in noisy environments using harmonic model, IEEE Trans. Speech Audio Process., vol. 12, pp , 24. [18] G. J. Brown, and M. P. Cooke, Computational auditory scene analysis, Comput. Speech Lang., vol. 8, pp , 1994.

Single channel speech separation in modulation frequency domain based on a novel pitch range estimation method

RESEARCH Open Access Single channel speech separation in modulation requency domain based on a novel pitch range estimation method Azar Mahmoodzadeh 1, Hamid Reza Abutalebi 1*, Hamid Soltanian-Zadeh 2,3