Single channel speech separation in modulation frequency domain based on a novel pitch range estimation method

Size: px

Start display at page:

Download "Single channel speech separation in modulation frequency domain based on a novel pitch range estimation method"

Mavis Franklin
5 years ago
Views:

1 RESEARCH Open Access Single channel speech separation in modulation requency domain based on a novel pitch range estimation method Azar Mahmoodzadeh 1, Hamid Reza Abutalebi 1*, Hamid Soltanian-Zadeh 2,3 and Hamid Sheikhzadeh 4 Abstract Computational Auditory Scene Analysis (CASA) has been the ocus in recent literature or speech separation rom monaural mixtures. The perormance o current CASA systems on voiced speech separation strictly depends on the robustness o the algorithm used or pitch requency estimation. We propose a new system that estimates pitch (requency) range o a target utterance and separates voiced portions o target speech. The algorithm, irst, estimates the pitch range o target speech in each rame o data in the modulation requency domain, and then, uses the estimated pitch range or segregating the target speech. The method o pitch range estimation is based on an onset and oset algorithm. Speech separation is perormed by iltering the mixture signal with a mask extracted rom the modulation spectrogram. A systematic evaluation shows that the proposed system extracts the majority o target speech signal with minimal intererence and outperorms previous systems in both pitch extraction and voiced speech separation. Keywords: acoustic requency, modulation requency, onset and oset algorithm, pitch range estimation, speech separation 1. Introduction Speech separation, as a solution to the cocktail party problem, is a well-known challenge with important applications. To touch the point, consider the telecommunication systems or the Automatic Speech Recognition systems that lose perormance in the presence o interering sounds [1,2]. An eective system that segregates speech rom intererence in monaural (singlemicrophone) situations can be rewarding in such problems. Many methods have been proposed or monaural speech enhancement; or example, see [3-7]. These methods usually assume certain statistical properties or intererence and tend to lack the capacity o dealing with a variety o intererences. While the monaural speech separation works awkwardly, the human auditory system perorms proiciently. The perceptual process is considered as Auditory Scene Analysis (ASA) [5]. Psychoacoustic research in ASA has inspired considerable * Correspondence: habutalebi@yazduni.ac.ir 1 Speech Processing Research Lab (SPRL), Electrical and Computer Engineering Department, Yazd University, Yazd, Iran Full list o author inormation is available at the end o the article work in developing Computational Auditory Scene Analysis (CASA) systems or speech separation (see [6,7] or a comprehensive review). According to Bregman [5], ASA procedure can be separated into two theoretical stages: segmentation and grouping. At the irst stage, speech is transormed into a higher-dimensional space (such as a time-requency two-dimensional representation) and then, similar timerequency (T-F) units are segmented in order to compose dierent regions [6]. In the second stage, these regions are combined into dierent streams based on the relevant acoustic inormation. The major computational goal o CASA is to separate the target speech signal rom the intererence or dierent purposes, via generating a binary or a sot T-F mask, see, e.g., [8-1]. Grouping, itsel, consists o simultaneous and sequential organizations, which involves grouping o segments across requency and time. The task o sequential grouping is to group the T-F regions relative to the same sound source across time. Figure 1 illustrates this issue in which the upper panel shows T-F regions grouped into one single stream, as they are close enough in both (time 212 Mahmoodzadeh et al; licensee Springer. This is an Open Access article distributed under the terms o the Creative Commons Attribution License ( which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

2 Page 2 o 1 Acoustic Frequency Acoustic Frequency Time Time A B Acoustic Frequency Acoustic Frequency and requency) directions; while, the lower panel illustrates the case o two streams o speech, grouped separately as the T-F regions are suiciently ar rom each other in the requency direction. Temporal continuity is an eective cue or grouping T-F regions neighboring in time. However, it cannot handle T-F regions that do not overlap in time due to the silence or intererence segments. Thereore, sequential grouping o such T-F regions is a very challenging problem (see [11,12] or more details). Natural speech includes both voiced and unvoiced portions. Voiced portions o speech are described by periodicity (or harmonicity), which has been used as an important eature in many CASA systems or segregating voiced speech (see, e.g. [13,14]). Despite considerable advances in voiced speech separation, the perormance o current CASA systems is still limited by pitch requency (F) estimation errors and residual noise. Various methods have been proposed or robust pitch requency estimation, see e.g., [15,16]; however, robust pitch requency estimation in low signal-to-noise ratio (SNR) situations still poses a signiicant challenge. While mixed speech may have a great deal o overlap in the time domain, modulation requency analysis provides an additional dimension that can present a greater degree o separation among sources. In other words, the original T-F representation obtained rom transormations like Short-Time Fourier Transorm (STFT) can be augmented to a third dimension that represents modulation requency. In [17], by assuming that the pitch requency range is known and this range is constant in each ilter channel, the modulation spectral analysis is used as a tool or producing the mask or speech separation a higher-dimensional spaces. Based on the above observations, we propose a new system or single channel separation o voiced speech based on the modulation iltering. The idea is that, irst, the target pitch (requency) range is estimated in the modulation requency domain, and then, this range is used or producing the proper mask or speech separation. Because o the ollowing reasons provided in [18], modulation analysis and iltering are applied or the Time Time Figure 1 Segmentation and grouping o speech projected into T-F cells in a 2D representation [6]. target speech separation problem. First, there is a general belie stating that the human ASA system processes the sounds in the modulation requency domain. Second, the energy rom two co-channel talkers is largely non-overlapping in the modulation requency domain. The method o modulation analysis and iltering has extensively been studied by many researchers in the ield o single channel speech separation; Reerence [19] provides a general discussion on this subject. At irst, the proposed system perorms a multipitch range estimation o target and intererence speech based on the segmentation o modulation spectrogram domain. The segmentation is done using an onset and oset algorithm similar to that proposed by Hu and Wang [2]. In the proposed method, the noisy signal is divided into 2 ms time rames and then, the proposed speech separation algorithm is applied to each individual rame. Pitch range estimation method works in three stages: the irst stage computes the modulation spectrogram; the second stage decomposes the modulation spectrogram into segments using an onset and oset algorithm. In this stage, at irst, the peaks and valleys o derivative smoothed intensity o modulation spectrogram are detected and marked as onset and oset candidates. Any onset bigger than a certain threshold is accepted or which the smallest oset between two onsets is selected. Then, onset and oset ronts are produced by connecting the common onsets and osets. Finally, the segments are ormed by matching the onset and oset ronts. The third stage determines the range o pitch requency by selecting and grouping the desired segments. The separation part o the proposed system aims at obtaining a sot mask in the modulation spectrogram domain. By extending the sot mask suggested in [17], a sot mask is proposed whose value depends on the estimated pitch range in each ilter channel. To determine the sot mask in each ilter channel, irst, we ind and mutually compare the modulation spectrogram energy o target and intererence in their pitch ranges estimated rom the previous stage. Then, we transorm the sot mask to the time domain and ilter the mixture signal in order to obtain the separated target signal. Thus, a strategy is suggested which estimates the target pitch range, and subsequently, segregates the target signal rom the intererence. Finally, the separated target signal is obtained rom arranging the separated signal rom each rame, in a time order sequence. This article is organized as ollows. Section 2 describes the modulation requency analysis. In Section 3, irst, a brie description o the present system is given and then the details o each stage are presented. In Section 4, a quantitative measure is proposed or evaluating the perormance o speech separation and it is used or systematic

Page 3 o 1 evaluation o pitch range estimation and speech separation. This article concludes with a discussion in Section 5. 2.

3 Page 3 o 1 evaluation o pitch range estimation and speech separation. This article concludes with a discussion in Section Modulation requency analysis Decomposing a narrowband signal into a carrier and a modulator signals is an important problem in modulation analysis and iltering [18]. The modulator is a lowrequency signal that describes the amplitude modulation o the original signal; and the carrier is a narrowband signal describing the requency modulation o the signal. Consider a wideband discrete-time signal x(n), or which n represents a discrete-time independent variable. The T-F transorm o a signal x(n), denoted by X (m, k), is obtained using the Discrete STFT (DSTFT). X(m, k) is a T-F transormed narrowband signal (with the time index m) coming out o the kth channel: K 1 DSTFT {x (n)} = X (m, k) = n= x (n) w (mm n) e j2πnk/k k =,..., K 1, (1) where K is the DSTFT length (equal to the number o the ilter bank channels), w( )is the acoustic requency analysis window with length L and M is the decimated actor. The product model o the modulator signal M(m, k) and the carrier signal C(m, k) o the signal X(m, k) in the T-F domain is deined as X (m, k) = M (m, k) C (m, k), (2) The modulator o the signal X(m, k) isoundby applying an envelope detector to this signal, as M (m, k) D {X (m, k)}, (3) where D is the operator o the envelope detector. With respect to Equation (2), the signal s carrier is described as C (m, k) = X (m, k) M (m, k), (4) deined based on the Fourier transorm (FT) and the STFT. The discrete short-time modulation transorm o the signal x(n) is deined as X (k, i) =DFT{D {DSTFT {x (n)}}} I 1 = M (m, k) e j2πmi/i i =,..., I 1, m= where I is the DFT length and i is the modulation requency index. The modulation transorm consists o a ilter-bank that uses the DSTFT ollowed by a subband envelope detector and, then, a requency analyzer o the subband envelopes (the DFT) [18]. The modulation spectrogram intensity, deined as X (k, i) = X(k, i), is generally sketched in a diagram, in which the vertical axis displays the regular acoustic requency index k and the horizontal axis is the modulation requency index i. The modulation analysis ramework is described in Figure 2. A typical example o modulation transorm is illustrated in Figure 3, in which, Figure 3a shows the mixture o a target and interering male speakers and Figure 3b, c, respectively, depict the corresponding T-F representation and modulation spectrogram, with the overall SNR o db. 3. System description The main target o the current system is to produce a sot mask or single channel speech separation in the modulation spectrogram domain. In the proposed system, determining the pitch range o target and intererence speech is necessary or producing the mask or speech separation. The value o this mask in each subband depends on the obtained pitch range o target and intererence in that 2m 2m+1 (6) A good choice or the envelope detector is the incoherent detector, since it is able to create a modulation spectrum that has a large area covered in the modulation requency domain. For the speech signal in hand, this property may be used to ind the pitch requency in the modulation requency domain. Incoherent envelope detector is based on the Hilbert envelope (or realvalued subbands) or the magnitude operator (or complex-valued subbands) [21]. Thereore, the modulator o the complex signal X(m, k) is deined as k (Acoustic Frequency) Base Transorm 2nd Transorm on X(k,i) M (m, k) = X (m, k), (5) The theory o modulation requency analysis and iltering is best explained through the deinition o modulation transorms, which are signal transormations m (Time) i (Modulation Frequency) Figure 2 The modulation analysis ramework and the modulation spectrogram [19].

Page 4 o 1 Acoustic requency(hz) Acoustic requency (Hz) Amplitude 1-1.5.1.15.2.25.3.35.4.45.5 Time(s) 8 6 4 2.5.1.15.2.25.3.35.4.45 Time(s) (c) 5 subband.

4 Page 4 o 1 Acoustic requency(hz) Acoustic requency (Hz) Amplitude Time(s) Time(s) (c) 5 subband. When the modulation spectrogram o the speech signal is computed, the pitch ranges o target and intererence speakers are determined and, then, a proper mask is calculated or the speech separation. The overall stages o our system are shown in Figure 4. To determine the mentioned pitch ranges, our proposed method uses an onset and oset detection algorithm [2] to ind the distribution o modulation spectrogram energy in the modulation requency domain, which is an important eature or determining the pitch range. When modulation spectrogram energy is ound, the modulation spectrogram is segmented, as described in Section Then, the resulting segments are grouped in order to estimate the pitch range o each speaker. A detailed description o stages is as ollows T-F decomposition and modulation transorm At the T-F stage, the STFT (as a uniorm ilter-bank) is used or decomposing a broadband signal into narrowband subband signals. The output o the T-F stage Figure 3 Sound mixture and its modulation spectrogram. Mixture o speech signals. T-F energy plot or a mixture o two utterances o a male speaker. The utterances are eight and dos. For better display, energy is plotted as the square o the FT. (c) Modulation spectrogram o the mixture signal. Mixture T-F Decomposition Modulation Transorm Smoothing Onset/oset Decision detection & making matching Pitch range estimation Figure 4 Block diagram o the proposed system. Speech segregation Pitch requency range Segregated speech enters into the modulation transorm stage in order to calculate the modulation spectrogram Pitch range estimation in modulation requency domain The pitch requencies o target and intererence speakers are both time-varying. Occasionally, pitch requencies o the target and intererence speakers are too close to each other, in which this act causes undesired errors in multipitch tracking algorithms and decreases the accuracy o speech separation methods. The algorithm o this article estimates the pitch range o target and intererence speakers o noisy speech in the modulation requency domain. Estimating the pitch range in small time-intervals (or example 2 ms) decreases the error in the pitch range estimation method. In the pitch range estimation approach, at irst, the intensity o the modulation spectrogram is smoothed over the modulation requency, using a low-pass ilter. Then, the partial derivative o the smoothed intensity over the modulation requency is computed. By marking the peaks and valleys o the resulting signal, the onset and oset candidates are detected and the onset and oset ronts are ormed. By matching the onset and oset ronts, the modulation spectrogram o speech signal is segmented. The detailed description o the stages or the pitch range estimation is as ollows Smoothing Smoothing corresponds to low-pass iltering. The proposed system uses a low-pass ilter to smooth the modulation spectrogram intensity over the modulation requency. Considering the requency channel k, the smoothed intensity or X (k, i) is ound as ollows: X s (k, i) = X (k, i) g s (i), (7) where g s (i) is a low-pass FIR ilter with a small number o coeicients with pass-band [, s] inhz.here, * denotes the convolution operator (over the modulation requency). The parameter s determines the degree o smoothing: the smaller s, the smoother X s (k, i) would be. As an example, Figure 5 shows the original (Figure 5a) and the smoothed (Figure 5b-d) intensities o the modulation spectrum or the mixture input signal shown in Figure 3a, at three typical scales. To display more details, Figure 5e-h describes the original and the smoothed intensities at these three scales, in a single requency channel centered at 56 Hz. The intensity luctuation reduces by smoothing, as certiied by Figure 5. Although the local details o onsets and osets become blurred, the major intensity changes o the onsets and osets are still preserved.

5 Page 5 o Acoustic requency (Hz) Intensity (db) (c) (e) (g) (d) Onset/oset detection and matching Onsets and osets correspond to sudden intensity changes. The partial derivative o smoothed modulation spectrogram intensity over the modulation requency is obtained as i X s (k, i) = [ X (k, i) gs (i) ], (8) i Peaks and valleys o the resulting signal o Equation (8) are, respectively, marked as onset and oset candidates. Figure 6 illustrates this procedure, in which the onset candidates with peaks bigger than a threshold θ on are accepted. The peaks corresponding to the true onsets are usually signiicantly higher than other peaks. For this reason, θ on = μ+ s is selected as the threshold, in which μ and s are the mean and standard deviation o all the onset candidates (peaks o Equation 8), respectively [2]. Hu and Wang [2] claim that the perormance o the method using such a threshold choice is satisactory. In every ilter channel k, to determine the oset corresponding to each onset candidate, let on [k, l] represent the modulation requency o the lth onset candidate in the ilter channel k. The corresponding oset, denoted by o [k, l], is located between on [k, l] and on [k, l+1]. I there are multiple oset candidates in this interval, the one with the largest intensity decrease (i.e., the smallest i X s(k, i)) is chosen. () (h) Figure 5 Smoothed intensity values at dierent scales. Initial intensity or all channels. Smoothed intensity at the scale 14. (c) Smoothed intensity at the scale 1. (d) Smoothed intensity at the scale 4. (e) Initial intensity in a channel centered at 56 Hz. () Smoothed intensity in the channel at the scale 14. (g) Smoothed intensity in the channel at the scale 1. (h) Smoothed intensity in the channel at the scale 4. The input is shown in Figure 3a. Intensity (db) The intensity derivative (db/hz) o k, j o o k, j 3 k, j Ater inding the onsets and osets, those with close modulation requencies are connected to the onset and oset ronts, because the requency components o onsets and osets with close modulation requencies probably correspond to the same source. Onset and oset ronts are vertical contours across acoustic requency in the modulation spectrogram domain. The proposed system connects an onset candidate rom a ilter channel to an onset candidate in the above adjacent ilter channel, provided that their distance in the modulation requency is less than a certain threshold relative to the latter ilter channel. In each ilter channel, this threshold is deined as the mean o the distances in the modulation requency direction between two-by-two adjacent onsets. This deinition or the threshold is provided rom experiments and is validated as a good choice in the data. The same applies to the oset candidates. Notice that a threshold with a too small value may prevent onsets or osets rom the same event to joint; while a threshold with a too large value may cause some onsets rom dierent events to connect together [2]. The next step is to orm segments by matching individual onset and oset ronts. Consider ( on [k, l k ], on [k, l k +1],..., on [k+r-1, l k+r-1 ]) as an onset ront with r consecutive ilter channels, in which l k denotes the number o the selected onset as an onset ront member, in the ilter channel k; and consider ( o [k, l k ], o [k+1, l k+1 ],..., o [k +r-1, l k+r-1 ]) as the corresponding oset modulation requencies. For each oset modulation requency, irst, we ind all those oset ronts that cross this oset; then, the oset ront with the most crosses (with the oset modulation requencies) is chosen as the matching oset on k, j on k, j 4 on k, j 8 Figure 6 Onset and oset detection. The upper panel shows the response intensity and the lower panel shows the results o onset and oset detection using a low-pass ilter. The threshold or onset detection is.5 and or oset detection is -.5 indicated by the dash-lines. Detected onsets are marked by downward arrows, and osets by upward arrows.

6 Page 6 o 1 ront. Now, the entire ilter channels rom k to k +r -1 occupied by the matching oset ront (and their corresponding oset modulation requencies on this matchingosetront)arelabeledas matched. I all the channels rom k to k+r-1 are labeled as matched, the matching procedure inishes; otherwise, the matched channels should be put aside and the procedure should be repeated or the remaining unmatched channels. At last, in order to orm the oset ront relative to each onset ront, we replace the oset modulation requencies corresponding to the onset ront with those o the matched oset ronts. The region between the onset ront and its oset ront yields a 2D segment in the acoustic-modulation requency space; see Figure 7 or the schematic representation o the matching procedure Segment selection and decision-making By detecting the onsets and osets and orming the onset and oset ronts, the modulation spectrogram domain o speech signal is segmented. Since the speaker s pitch range is [6, 35] Hz (or men, women, and children), only the segments with modulation requencies in this range are accepted. Now, we describe the grouping procedure or the segments. First, the modulation spectrogram energy o each selected segment is computed. Two almost disjoint segments with most energies, i.e., those with the most modulation spectrogram energies and the least horizontal overlap in the modulation spectrogram, or simplicity called segments A and B, are selected (the case speech interered by a non-speaker-noise has only one such segment). For any other segment (call segment C), i the modulation requency range at least 8% overlaps with that o segment A or segment B, the segment C is groupedwiththatoverlapping segment; otherwise, the segment C is omitted or the grouping procedure. Figure 8 presents a typical example o the grouping procedure. As shown, in each ilter channel, the onset and oset ronts o the resulting group determines the corresponding range o pitch requency in that ilter channel. Acoustic Frequency group A Modulation Frequency 3.3. Speech separation In [17], a mask is presented or speech separation in the modulation spectrogram domain, assuming that the pitch ranges o the target and intererence are known and that these ranges are the same in each subband. Our system extends this idea by allowing the value o the mask in each ilter channel to depend on the estimated pitch range o that ilter channel. Consider a given signal x(n) that is the sum o a target signal x ts (n) and an intererence signal x is (n), sampled at s Hz, i.e., x(n) =x ts (n)+x is (n). A proper mask should be estimated or segregating the target signal rom the intererence signal. In each ilter channel k, thepitch ranges o the target and interering speakers (obtained rom the previous stage) are denoted by PFts k k := [pts,low, p k ts,high ] and PFk k is := [pis,low, p k is,high ], respectively. Also, { Q k := i {,..., I 1} such that ( )/ } i. s (I. M) PF k is deined as the set o modulation requency indices o PF k, i.e., a pitch range in the ilter channel k. To produce a requency mask in each ilter channel k, deine the mean o the modulation spectral energy relative to a pitch range as the energy normalized by the wideness o that pitch range: E k = / 2 X(k, i) (phigh k p low k ) (9) i Q k Acoustic Frequency group A Modulation Frequency group B Figure 8 A graphical expression o the method o grouping segments in the modulation spectrogram domain; or one speaker, or two speakers. Acoustic requency k+r-1 k Modulation requency k+r-1 k Modulation requency Onset Oset Onset ront member Corresponding oset Matching oset ront member Figure 7 Schematic representation o the matching procedure; the osets corresponding to the onset ront is determined, the matching oset ront members are ound. The requency mask is calculated, when the means o the modulation spectral energy o the target and intererence speakers are compared in the ollowing sense. E k ts F k = E k ts+ E k, (1) is Since there are artiacts associated with applying masks in the modulation requency domain (see [22]), this domain is not preerable or modulation iltering in order to mask out the intererence and reconstruct a time-domain signal. Instead, the requency mask is transormed to the time domain. To this end, ailter

Page 7 o 1 with linear phase is constructed whose magnitude is F k and the assigned linear phase is j k (i) =i. Then,the inverse DFT is taken k (m) = 1 I 1 F k e jφk (i) e j2πmi/i.

carrier signal o the mixture signal [ ] X (m, k) = M (m, k) k (m) C (m, k), (12) Finally, the separated target signal in the time domain is obtained by taking the inverse STFT o X (m, k). 4.

In this section, we evaluate the proposed system in the processes o pitch range estimation and speech separation. 4.1.

7 Page 7 o 1 with linear phase is constructed whose magnitude is F k and the assigned linear phase is j k (i) =i. Then,the inverse DFT is taken k (m) = 1 I 1 F k e jφk (i) e j2πmi/i. (11) I i= The separated target signal is estimated by the convolution (over the variable m) o the obtained ilter k (m) with the modulator signal o the mixture signal x(n) and then, multiplying by the carrier signal o the mixture signal [ ] X (m, k) = M (m, k) k (m) C (m, k), (12) Finally, the separated target signal in the time domain is obtained by taking the inverse STFT o X (m, k). 4. Evaluation As mentioned earlier, our system estimates the pitch range and uses this range or the speech separation. In this section, we evaluate the proposed system in the processes o pitch range estimation and speech separation Pitch range estimation First, the proposed system is evaluated in the pitch range estimation process with utterances chosen rom the Lee s database [23] and a corpus o 1 mixtures o speech and intererence [24], commonly used or CASA research, see, e.g., [13,25,26]. The corpus contains utterances rom both male and emale speakers. These utterances are mixed with a set o intrusions at dierent SNR levels. These intrusions are N: 1 khz pure tone; N1:whitenoise;N2:noisebursts;N3:cocktailparty noise; N4: rock music; N5: siren, N6: trill telephone; N7: emale speech; N8: male speech; and N9: emale speech. These intrusions have a considerable variety; or example, N3 is noise-like, while N5 contains strong harmonic sounds. They orm a realistic corpus or evaluating the capacity o a CASA system when it deals with various types o intererence. The signal X(k, i) is the modulation spectrogram o an input signal that is digitized at a 16-kHz sampling rate. The parameters o the proposed system are set to M = 16 and K = 128. w(n) is a Hanning window with length L = 64 (reer to Section 2). The STFT ilter-bank has 128 ilter channels, or which the center requency o the kth ilter channel is ω k =2πk/K, k =,..., K-1. Figure 9 shows the modulation spectrogram and the obtained segments or a typical speech rame, when the proposed system is applied. The speech signal is a mixture o target and intererence with the overall SNR o db. We select a male speech, a white noise and a trill Acoustic requency (Hz) Acoustic requency (Hz) Acoustic requency (Hz) (c) Figure 9 Modulation spectrogram and segments obtained or a mixture o male speaker, white noise, and (c) trill telephone. The input is shown in Figure 3a. telephone as the intererence. The results show that although the powers o the speech and intererence signals are equal, the proposed method is still able to estimate the pitch range o the target speaker with a reasonable accuracy. Figure 1 shows the average error percentage o the pitch range estimation by the proposed system on the above mixtures at dierent SNR levels. To determine the error percentage, we assign a two-element vector to the margins o each pitch range and ind the root mean square error distance between the vectors corresponding to the true and estimated pitch ranges. As shown in Figure 1, the proposed system is able to estimate 79.9% o the target pitch range, even at -5 db SNR. The estimation rate increases to about 96.1%, as the SNR increases to 15 db. Percentage o error detection LSH model RAPT MAP Proposed algorithm Mixture SNR (db) Figure 1 Percentage o pitch range estimation error or dierent SNR levels.

8 Page 8 o 1 A reliable evaluation o the proposed system requires a reerence range o the true pitch. However, such a reerence is probably impossible to obtain rom a noisy speech.weindthereerencepitchrangebyraming the clean speech signal and calculating the pitch requency in each rame. The perormance o the proposed method is compared with that o the Least Square Harmonic (LSH) technique [27], Robust Algorithm or Pitch Tracking (RAPT) [28], and the Maximum A Posterior (MAP) estimator [29]. RAPT and MAP are two standard pitch estimation algorithms. The LSH algorithm, derived in [27] or harmonic decomposition o a time-varying signal, estimates the harmonic amplitudes and phases, by solving a set o linear equations that minimizes the mean square error. The RAPT algorithm estimates the pitch requency, by searching or local maxima in the autocorrelation unction o the windowed speech signal and then, using a dynamic programming technique (see [28] or more details). The MAP approach [29] considers a harmonic model or the voiced speech so that each windowed signal is expressed with a generalized linear model whose basic unctions depend on the undamental requency and number o harmonic partials. Figure 1 also provides a comparison between the results o the pitch estimation using the mentioned our methods, in which the proposed system perorms consistently better than the three standard methods, at all SNR levels. Although the perormance o the LSH model (as the best perorming one among the mentioned standard algorithms) is good at SNR levels above 1 db, it drops quickly as SNR decreases, which shows that the proposed system is more robust to intererence compared with the LSH model. As mentioned in [29], MAP perorms slightly better in low SNR s rather than high SNR s. In addition, RAPT ails to estimate the desired pitch period in low SNR s, because it mistakenly chooses sub-harmonic and harmonic partials instead o the true pitch period. The current scheme perorms almost consistently in both high and low SNR s Voiced speech separation A corpus o 1 mixtures composed o 1 target utterances mixed with 1 intrusions is recruited or assessing the perormance o the system on voiced speech separation; these data are described in Section For comparison, the Hu and Wang system [14] and the spectral subtraction method [3] are employed. Perormance o the voiced speech separation is evaluated using two measures commonly used or this propose [14]: The percentage o energy loss, P EL, which measures the amount o the target speech excluded rom the segregated speech. The percentage o residual noise, P NR, which measures the amount o the intrusion included in the segregated speech. P EL and P NR are error measures o a separation system, which are complementary indices or assessing the system perormance. In addition, the SNR o the segregated voiced target (in db) provides a good comparison between waveorms [14]: SNR =1log 1 n n s2 (n) [ s (n) x (n) ] 2, (13) where x (n) is the estimated signal and s(n) is the target signal beore being mixed with the intrusion. The results o our system are shown in Figure 11. Each point in the igures represents the average value o 1 mixtures in the complete test corpus at a particular SNR level. Figure 11a, b shows the percentage o energy loss and noise residue. Since the goal here is to segregate the voiced target, the P EL values are only deined or the target energy at the voiced rames o the target. AsshowninFigure11,theproposedsystemsegregates 78.9% o the voiced target energy at -5 db SNR and 99% at 15 db SNR. At the same time, at -5 db, 15.9% o the segregated energy belongs to intrusion. This number drops to.7% at 15 db SNR. Figure 11c shows the SNR o the segregated target. This system obtains an average 7.5 db gain in SNR when the mixture SNR is -5 db. This gain increases to 14.3 db, when the mixture SNR is 15 db. As shown in the igure, the segregated target loses more target energy (Figure 11a), but contains less intererence as well (Figure 11b). Figure 11 also shows the perormance o the system proposed by Hu and Wang or voiced speech separation Percentage o energy loss (P EL ) Hu and Wang (24) Proposed Algorithm Mixture SNR (db) SNR o segregated target (db) (c) Mixture SNR (db) percentage o residual noise (P NR ) Hu and Wang (24) Proposed Algorithm Mixture SNR (db) Hu and Wang (24) Proposed Algorithm Figure 11 Results o voiced speech separation. Percentage o energy loss on voiced target. Percentage o noise residue. (c) SNR o segregated voiced target.

9 Page 9 o 1 [14], which is a representative o CASA systems. As shown in the igure, the Hu and Wang s system yields a lower percentage o noise residues (Figure 11b), but has a much higher percentage o target energy loss (Figure 11a, c). Nevertheless, it should be noted that our system signiicantly improved the P EL (in Figure 11a, see, e.g., by around 11 and 1% improvement at and 15 db, respectively), which leads to much less signal distortion. The price paid or this is a slightly increase in P NR,as depicted in Figure 11b (e.g., by around 6 and.5% increase at and 15 db, respectively). The average SNR or each intrusion is shown or the proposed system in Figure 12 in comparison with that o the original mixtures, Hu and Wang s system, and a Spectral Subtraction Method, which is a standard method or speech enhancement [3] (see also [14]). Theproposedsystemperormsconsistentlybetterthan Hu and Wang s system and spectral subtraction. In average, the proposed system obtains a db SNR gain, which is about 1.92 db better than Hu and Wang s system and 8.4 db better than the Spectral Subtraction. To help the reader recognize the real dierence in the perormance, a ile is prepared including sample audio mixture signals (target speech signal + intererence signal) and the results o the separation using the Spectral Subtraction, Hu and Wang, and the proposed systems. The ile is available at AM-SampleWaves.ppt. 5. Discussions and conclusions One o the major challenges in speech enhancement is the separation o a target speech rom an intererence signal o the same type. The accuracy o the CASA methods in single channel speech separation depends on the correctness o the pitch requency estimation o SNR (db) Mixture Proposed Algorithm Hu and Wang (24) Spectral Subtraction N N1 N2 N3 N4 N5 N6 N7 N8 N9 Intrusion type Figure 12 SNR results or segregated speech and original mixtures or a corpus o voiced speech and various intrusions. two simultaneous speakers because the proper mask in the T-F domain or the speech separation is produced in association with the estimated pitch requency. In this article, a single channel speech separation system is proposed that estimates the pitch range o one or two speakers and segregates the target speech rom the intererence. The pitch range is estimated using the onset and oset algorithm considering the distribution o speaker energy in the modulation spectrogram domain. When the target and intererence speakers are either male or emale, the methods or pitch requency estimation encounter large errors because o close pitch requency values. Thereore, CASA methods that employ the pitch requency as their main eature or speech separation ace diiculties. In contrast, a main novelty o the present algorithm is the estimation o pitch range based on short time-rames o the mixture signal. The constructed mask or speech separation depends on the pitch range estimated independently in each subband. As shown by the evaluation results, major portions o the voiced target speech are separated rom the interering speech using this mask. In addition, the proposed system can separate the unvoiced portions that are quasi-periodic because o the proximity o voiced portions. The proposed algorithm is robust to intererence and produces good estimates o both pitch range and voiced speech, even in the presence o strong intererence. Systematic evaluation shows that the proposed algorithm perorms signiicantly better than the mentioned CASA and speech enhancement systems. Silent gaps and other intererence-masked intervals are usually included in natural speech utterances. In practice, the utterance across such time-intervals should be grouped. This is a sequential grouping problem [5,6] whose segments or masks can be obtained using the speech recognition in a top-down manner (also, limited to non-speech intererence) [11] or the speaker recognition trained by speaker models [31]. However, the proposed algorithm does not encounter this problem o sequential grouping because it operates in the modulation spectrogram domain. In terms o computational complexity, the main cost o the proposed algorithm arises rom determining segmentsinmodulationspectrogram or pitch range estimation. The estimation o the mask and convolution or speech separation consumes a small raction o the overall cost. Both tasks (pitch range estimation and speech separation) are implemented in the requency domain, so the computational complexity is O(NlogN), where N isthenumberosamplesintheinputsignal.these operations should separately be perormed or each subband. On the other hand, since eature extraction takes place independently in dierent subbands, substantial speedup can be achieved through parallel computing.

10 Page 1 o 1 For uture work, the proposed algorithm can be improved by iterative estimation o pitch range and speech separation. The algorithm can include a speciic method to jump-start the iterative process, which gives an initial estimate o both pitch range and mask with reasonable quality. In general, the perormance o the algorithm depends on the initial estimate o pitch range; better initial estimates would lead to better perormance. Even with a poor estimate o pitch range, which is unavoidable in very low SNR conditions, the proposed algorithm improves the initial estimate during the iterative process. Author details 1 Speech Processing Research Lab (SPRL), Electrical and Computer Engineering Department, Yazd University, Yazd, Iran 2 Control and Intelligent Processing Center o Excellence (CIPCE), School o Electrical and Computer Engineering, University o Tehran, Tehran, Iran 3 Image Analysis Laboratory, Department o Radiology, Henry Ford Health System, Detroit, MI, USA 4 Electrical Engineering Department, Amirkabir University o Technology, Tehran, Iran Competing interests The authors declare that they have no competing interests. Received: 7 May 211 Accepted: 17 March 212 Published: 17 March 212 Reerences 1. RP Lippmann, Speech recognition by machines and humans. Speech Commun. 22, 1 16 (1997). doi:1.116/s (97) JJ Sroka, LD Braida, Human and machine consonant recognition. Speech Commun. 45, (25) 3. A de Cheveigne, in Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, ed. by Brown GJ, Wang DL (Wiley & IEEE, Hoboken, NJ, 26), pp S Dubnov, J Tabrikian, M Arnon-Targan, Speech source separation in convolutive environments using space-time-requency analysis. EURASIP J Appl Signal Process Article , 11 (26) 5. AS Bregman, Auditory Scene Analysis (MIT, Cambridge, MA, 199) 6. Brown GJ, Wang DL (eds.), Computational Auditory Scene Analysis: Principles, Algorithms, and Applications (Wiley & IEEE, Hoboken, NJ, 26) 7. M Buchler, S Allegro, S Launer, N Dillier, Sound classiication in hearing aids inspired by auditory scene analysis. EURASIP J Appl Signal Process. 18, (25) 8. G Hu, D Wang, A Tandem algorithm or pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process. 18(8), (27) 9. Y Shao, S Srinivasan, Z Jin, D Wang, A computational auditory scene analysis system or speech segregation and robust speech recognition. Comput Speech Lang. 24, (21). doi:1.116/j.csl MH Radar, RM Dansereau, A Sayadiyan, A maximum likelihood estimation o vocal-tract-related ilter characteristics or single channel speech separation. EURASIP J Audio Speech Music Process Article 84186, 27, 15 (27) 11. J Barker, M Cooke, D Ellis, Decoding speech in the presence o other sources. Speech Commun. 45, 5 25 (25). doi:1.116/j. specom Y Shao, DL Wang, Model-based sequential organization in cochannel speech. IEEE Trans Acoust Speech Signal Process. 14, (25) 13. GJ Brown, M Cooke, Computational auditory scene analysis. Comput Speech Lang. 8, (1994). doi:1.16/csla G Hu, DL Wang, Monaural speech separation based on pitch tracking and amplitude modulation. IEEE Trans Neural Net. 15, (24). doi:1.119/tnn M Wu, DL Wang, GJ Brown, A multipitch tracking algorithm or noisy speech. IEEE Trans Speech Audio Process. 11, (23). doi:1.119/ TSA J Le Roux, H Kameoka, N Ono, A de Cheveigne, S Sagayama, Single and multiple F contour estimation through parametric spectrogram modeling o speech in noisy environments. IEEE Trans Audio Speech Lang Process. 15, (27) 17. SM Schimmel, LE Atlas, K Nie, Feasibility o single channel speaker separation based on modulation requency analysis, in Proc IEEE International Conerence on Acoustics, Speech and Signal Processing, Hawaii, USA. 4, (27) 18. SM Schimmel, (Dissertation, University o Washington, 27) 19. L Atlas, SA Shamma, Joint acoustic and modulation requency. EURASIP J Appl Signal Process. 23(7), (23). doi:1.1155/ S G Hu, DL Wang, Auditory segmentation based on onset and oset analysis. IEEE Trans Audio Speech Lang Process. 15(2), (27) 21. R Drullman, JM Festen, R Plomp, Eect o temporal envelope smearing on speech reception. J Acoust Soc Am. 95, (1994). doi:1.1121/ SM Schimmel, LE Atlas, Coherent envelope detection or modulation iltering o speech, in Proc IEEE International Conerence on Acoustics, Speech and Signal Processing, Pennsylvania, USA, (25) 23. TW Lee, Blind source separation: audio examples (1998). edu/~tewon/blind/blind_audio.html. Accessed 4 May MP Cooke, Modeling Auditory Processing and Organization (Cambridge University Press, Cambridge, 1993) 25. LA Drake, (Dissertation, University o Northwestern, 21) 26. DL Wang, GJ Brown, Separation o speech rom interering sounds based on oscillatory correlation. IEEE Trans Neural Netw. 1, (1999). doi:1.119/ Q Li, L Atlas, Time-variant least-squares harmonic modeling, in Proc IEEE International Conerence on Acoustics, Speech and Signal Processing, Hong Kong. 2, (23) 28. D Talkin, A robust algorithm or pitch tracking (RAPT), in Speech Coding and Synthesis, ed. by Paliwal KK, Klein WB (Elsevier, NewYork, NY, 1995), pp J Tabrikian, S Dubnov, Y Dickalov, Maximum a posterior probability pitch tracking in noisy environments using harmonic model. IEEE Trans Speech Audio Process. 12, (24). doi:1.119/tsa X Huang, A Acero, HW Hon, Spoken Language Processing: A Guide to Theory, Algorithms, and System Development (Prentice Hall PTR, Upper Saddle River, NJ, 21) 31. Y Shao, (Dissertation, University o Ohio State, 27) doi:1.1186/ Cite this article as: Mahmoodzadeh et al.: Single channel speech separation in modulation requency domain based on a novel pitch range estimation method. EURASIP Journal on Advances in Signal Processing :67. Submit your manuscript to a journal and beneit rom: 7 Convenient online submission 7 Rigorous peer review 7 Immediate publication on acceptance 7 Open access: articles reely available online 7 High visibility within the ield 7 Retaining the copyright to your article Submit your next manuscript at 7 springeropen.com

Determination of Pitch Range Based on Onset and Offset Analysis in Modulation Frequency Domain

Determination of Pitch Range Based on Onset and Offset Analysis in Modulation Frequency Domain Determination o Pitch Range Based on Onset and Oset Analysis in Modulation Frequency Domain A. Mahmoodzadeh Speech Proc. Research Lab ECE Dept. Yazd University Yazd, Iran H. R. Abutalebi Speech Proc. Research