Single channel speech separation in modulation frequency domain based on a novel pitch range estimation method
|
|
- Mavis Franklin
- 5 years ago
- Views:
Transcription
1 RESEARCH Open Access Single channel speech separation in modulation requency domain based on a novel pitch range estimation method Azar Mahmoodzadeh 1, Hamid Reza Abutalebi 1*, Hamid Soltanian-Zadeh 2,3 and Hamid Sheikhzadeh 4 Abstract Computational Auditory Scene Analysis (CASA) has been the ocus in recent literature or speech separation rom monaural mixtures. The perormance o current CASA systems on voiced speech separation strictly depends on the robustness o the algorithm used or pitch requency estimation. We propose a new system that estimates pitch (requency) range o a target utterance and separates voiced portions o target speech. The algorithm, irst, estimates the pitch range o target speech in each rame o data in the modulation requency domain, and then, uses the estimated pitch range or segregating the target speech. The method o pitch range estimation is based on an onset and oset algorithm. Speech separation is perormed by iltering the mixture signal with a mask extracted rom the modulation spectrogram. A systematic evaluation shows that the proposed system extracts the majority o target speech signal with minimal intererence and outperorms previous systems in both pitch extraction and voiced speech separation. Keywords: acoustic requency, modulation requency, onset and oset algorithm, pitch range estimation, speech separation 1. Introduction Speech separation, as a solution to the cocktail party problem, is a well-known challenge with important applications. To touch the point, consider the telecommunication systems or the Automatic Speech Recognition systems that lose perormance in the presence o interering sounds [1,2]. An eective system that segregates speech rom intererence in monaural (singlemicrophone) situations can be rewarding in such problems. Many methods have been proposed or monaural speech enhancement; or example, see [3-7]. These methods usually assume certain statistical properties or intererence and tend to lack the capacity o dealing with a variety o intererences. While the monaural speech separation works awkwardly, the human auditory system perorms proiciently. The perceptual process is considered as Auditory Scene Analysis (ASA) [5]. Psychoacoustic research in ASA has inspired considerable * Correspondence: habutalebi@yazduni.ac.ir 1 Speech Processing Research Lab (SPRL), Electrical and Computer Engineering Department, Yazd University, Yazd, Iran Full list o author inormation is available at the end o the article work in developing Computational Auditory Scene Analysis (CASA) systems or speech separation (see [6,7] or a comprehensive review). According to Bregman [5], ASA procedure can be separated into two theoretical stages: segmentation and grouping. At the irst stage, speech is transormed into a higher-dimensional space (such as a time-requency two-dimensional representation) and then, similar timerequency (T-F) units are segmented in order to compose dierent regions [6]. In the second stage, these regions are combined into dierent streams based on the relevant acoustic inormation. The major computational goal o CASA is to separate the target speech signal rom the intererence or dierent purposes, via generating a binary or a sot T-F mask, see, e.g., [8-1]. Grouping, itsel, consists o simultaneous and sequential organizations, which involves grouping o segments across requency and time. The task o sequential grouping is to group the T-F regions relative to the same sound source across time. Figure 1 illustrates this issue in which the upper panel shows T-F regions grouped into one single stream, as they are close enough in both (time 212 Mahmoodzadeh et al; licensee Springer. This is an Open Access article distributed under the terms o the Creative Commons Attribution License ( which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
2 Page 2 o 1 Acoustic Frequency Acoustic Frequency Time Time A B Acoustic Frequency Acoustic Frequency and requency) directions; while, the lower panel illustrates the case o two streams o speech, grouped separately as the T-F regions are suiciently ar rom each other in the requency direction. Temporal continuity is an eective cue or grouping T-F regions neighboring in time. However, it cannot handle T-F regions that do not overlap in time due to the silence or intererence segments. Thereore, sequential grouping o such T-F regions is a very challenging problem (see [11,12] or more details). Natural speech includes both voiced and unvoiced portions. Voiced portions o speech are described by periodicity (or harmonicity), which has been used as an important eature in many CASA systems or segregating voiced speech (see, e.g. [13,14]). Despite considerable advances in voiced speech separation, the perormance o current CASA systems is still limited by pitch requency (F) estimation errors and residual noise. Various methods have been proposed or robust pitch requency estimation, see e.g., [15,16]; however, robust pitch requency estimation in low signal-to-noise ratio (SNR) situations still poses a signiicant challenge. While mixed speech may have a great deal o overlap in the time domain, modulation requency analysis provides an additional dimension that can present a greater degree o separation among sources. In other words, the original T-F representation obtained rom transormations like Short-Time Fourier Transorm (STFT) can be augmented to a third dimension that represents modulation requency. In [17], by assuming that the pitch requency range is known and this range is constant in each ilter channel, the modulation spectral analysis is used as a tool or producing the mask or speech separation a higher-dimensional spaces. Based on the above observations, we propose a new system or single channel separation o voiced speech based on the modulation iltering. The idea is that, irst, the target pitch (requency) range is estimated in the modulation requency domain, and then, this range is used or producing the proper mask or speech separation. Because o the ollowing reasons provided in [18], modulation analysis and iltering are applied or the Time Time Figure 1 Segmentation and grouping o speech projected into T-F cells in a 2D representation [6]. target speech separation problem. First, there is a general belie stating that the human ASA system processes the sounds in the modulation requency domain. Second, the energy rom two co-channel talkers is largely non-overlapping in the modulation requency domain. The method o modulation analysis and iltering has extensively been studied by many researchers in the ield o single channel speech separation; Reerence [19] provides a general discussion on this subject. At irst, the proposed system perorms a multipitch range estimation o target and intererence speech based on the segmentation o modulation spectrogram domain. The segmentation is done using an onset and oset algorithm similar to that proposed by Hu and Wang [2]. In the proposed method, the noisy signal is divided into 2 ms time rames and then, the proposed speech separation algorithm is applied to each individual rame. Pitch range estimation method works in three stages: the irst stage computes the modulation spectrogram; the second stage decomposes the modulation spectrogram into segments using an onset and oset algorithm. In this stage, at irst, the peaks and valleys o derivative smoothed intensity o modulation spectrogram are detected and marked as onset and oset candidates. Any onset bigger than a certain threshold is accepted or which the smallest oset between two onsets is selected. Then, onset and oset ronts are produced by connecting the common onsets and osets. Finally, the segments are ormed by matching the onset and oset ronts. The third stage determines the range o pitch requency by selecting and grouping the desired segments. The separation part o the proposed system aims at obtaining a sot mask in the modulation spectrogram domain. By extending the sot mask suggested in [17], a sot mask is proposed whose value depends on the estimated pitch range in each ilter channel. To determine the sot mask in each ilter channel, irst, we ind and mutually compare the modulation spectrogram energy o target and intererence in their pitch ranges estimated rom the previous stage. Then, we transorm the sot mask to the time domain and ilter the mixture signal in order to obtain the separated target signal. Thus, a strategy is suggested which estimates the target pitch range, and subsequently, segregates the target signal rom the intererence. Finally, the separated target signal is obtained rom arranging the separated signal rom each rame, in a time order sequence. This article is organized as ollows. Section 2 describes the modulation requency analysis. In Section 3, irst, a brie description o the present system is given and then the details o each stage are presented. In Section 4, a quantitative measure is proposed or evaluating the perormance o speech separation and it is used or systematic
3 Page 3 o 1 evaluation o pitch range estimation and speech separation. This article concludes with a discussion in Section Modulation requency analysis Decomposing a narrowband signal into a carrier and a modulator signals is an important problem in modulation analysis and iltering [18]. The modulator is a lowrequency signal that describes the amplitude modulation o the original signal; and the carrier is a narrowband signal describing the requency modulation o the signal. Consider a wideband discrete-time signal x(n), or which n represents a discrete-time independent variable. The T-F transorm o a signal x(n), denoted by X (m, k), is obtained using the Discrete STFT (DSTFT). X(m, k) is a T-F transormed narrowband signal (with the time index m) coming out o the kth channel: K 1 DSTFT {x (n)} = X (m, k) = n= x (n) w (mm n) e j2πnk/k k =,..., K 1, (1) where K is the DSTFT length (equal to the number o the ilter bank channels), w( )is the acoustic requency analysis window with length L and M is the decimated actor. The product model o the modulator signal M(m, k) and the carrier signal C(m, k) o the signal X(m, k) in the T-F domain is deined as X (m, k) = M (m, k) C (m, k), (2) The modulator o the signal X(m, k) isoundby applying an envelope detector to this signal, as M (m, k) D {X (m, k)}, (3) where D is the operator o the envelope detector. With respect to Equation (2), the signal s carrier is described as C (m, k) = X (m, k) M (m, k), (4) deined based on the Fourier transorm (FT) and the STFT. The discrete short-time modulation transorm o the signal x(n) is deined as X (k, i) =DFT{D {DSTFT {x (n)}}} I 1 = M (m, k) e j2πmi/i i =,..., I 1, m= where I is the DFT length and i is the modulation requency index. The modulation transorm consists o a ilter-bank that uses the DSTFT ollowed by a subband envelope detector and, then, a requency analyzer o the subband envelopes (the DFT) [18]. The modulation spectrogram intensity, deined as X (k, i) = X(k, i), is generally sketched in a diagram, in which the vertical axis displays the regular acoustic requency index k and the horizontal axis is the modulation requency index i. The modulation analysis ramework is described in Figure 2. A typical example o modulation transorm is illustrated in Figure 3, in which, Figure 3a shows the mixture o a target and interering male speakers and Figure 3b, c, respectively, depict the corresponding T-F representation and modulation spectrogram, with the overall SNR o db. 3. System description The main target o the current system is to produce a sot mask or single channel speech separation in the modulation spectrogram domain. In the proposed system, determining the pitch range o target and intererence speech is necessary or producing the mask or speech separation. The value o this mask in each subband depends on the obtained pitch range o target and intererence in that 2m 2m+1 (6) A good choice or the envelope detector is the incoherent detector, since it is able to create a modulation spectrum that has a large area covered in the modulation requency domain. For the speech signal in hand, this property may be used to ind the pitch requency in the modulation requency domain. Incoherent envelope detector is based on the Hilbert envelope (or realvalued subbands) or the magnitude operator (or complex-valued subbands) [21]. Thereore, the modulator o the complex signal X(m, k) is deined as k (Acoustic Frequency) Base Transorm 2nd Transorm on X(k,i) M (m, k) = X (m, k), (5) The theory o modulation requency analysis and iltering is best explained through the deinition o modulation transorms, which are signal transormations m (Time) i (Modulation Frequency) Figure 2 The modulation analysis ramework and the modulation spectrogram [19].
4 Page 4 o 1 Acoustic requency(hz) Acoustic requency (Hz) Amplitude Time(s) Time(s) (c) 5 subband. When the modulation spectrogram o the speech signal is computed, the pitch ranges o target and intererence speakers are determined and, then, a proper mask is calculated or the speech separation. The overall stages o our system are shown in Figure 4. To determine the mentioned pitch ranges, our proposed method uses an onset and oset detection algorithm [2] to ind the distribution o modulation spectrogram energy in the modulation requency domain, which is an important eature or determining the pitch range. When modulation spectrogram energy is ound, the modulation spectrogram is segmented, as described in Section Then, the resulting segments are grouped in order to estimate the pitch range o each speaker. A detailed description o stages is as ollows T-F decomposition and modulation transorm At the T-F stage, the STFT (as a uniorm ilter-bank) is used or decomposing a broadband signal into narrowband subband signals. The output o the T-F stage Figure 3 Sound mixture and its modulation spectrogram. Mixture o speech signals. T-F energy plot or a mixture o two utterances o a male speaker. The utterances are eight and dos. For better display, energy is plotted as the square o the FT. (c) Modulation spectrogram o the mixture signal. Mixture T-F Decomposition Modulation Transorm Smoothing Onset/oset Decision detection & making matching Pitch range estimation Figure 4 Block diagram o the proposed system. Speech segregation Pitch requency range Segregated speech enters into the modulation transorm stage in order to calculate the modulation spectrogram Pitch range estimation in modulation requency domain The pitch requencies o target and intererence speakers are both time-varying. Occasionally, pitch requencies o the target and intererence speakers are too close to each other, in which this act causes undesired errors in multipitch tracking algorithms and decreases the accuracy o speech separation methods. The algorithm o this article estimates the pitch range o target and intererence speakers o noisy speech in the modulation requency domain. Estimating the pitch range in small time-intervals (or example 2 ms) decreases the error in the pitch range estimation method. In the pitch range estimation approach, at irst, the intensity o the modulation spectrogram is smoothed over the modulation requency, using a low-pass ilter. Then, the partial derivative o the smoothed intensity over the modulation requency is computed. By marking the peaks and valleys o the resulting signal, the onset and oset candidates are detected and the onset and oset ronts are ormed. By matching the onset and oset ronts, the modulation spectrogram o speech signal is segmented. The detailed description o the stages or the pitch range estimation is as ollows Smoothing Smoothing corresponds to low-pass iltering. The proposed system uses a low-pass ilter to smooth the modulation spectrogram intensity over the modulation requency. Considering the requency channel k, the smoothed intensity or X (k, i) is ound as ollows: X s (k, i) = X (k, i) g s (i), (7) where g s (i) is a low-pass FIR ilter with a small number o coeicients with pass-band [, s] inhz.here, * denotes the convolution operator (over the modulation requency). The parameter s determines the degree o smoothing: the smaller s, the smoother X s (k, i) would be. As an example, Figure 5 shows the original (Figure 5a) and the smoothed (Figure 5b-d) intensities o the modulation spectrum or the mixture input signal shown in Figure 3a, at three typical scales. To display more details, Figure 5e-h describes the original and the smoothed intensities at these three scales, in a single requency channel centered at 56 Hz. The intensity luctuation reduces by smoothing, as certiied by Figure 5. Although the local details o onsets and osets become blurred, the major intensity changes o the onsets and osets are still preserved.
5 Page 5 o Acoustic requency (Hz) Intensity (db) (c) (e) (g) (d) Onset/oset detection and matching Onsets and osets correspond to sudden intensity changes. The partial derivative o smoothed modulation spectrogram intensity over the modulation requency is obtained as i X s (k, i) = [ X (k, i) gs (i) ], (8) i Peaks and valleys o the resulting signal o Equation (8) are, respectively, marked as onset and oset candidates. Figure 6 illustrates this procedure, in which the onset candidates with peaks bigger than a threshold θ on are accepted. The peaks corresponding to the true onsets are usually signiicantly higher than other peaks. For this reason, θ on = μ+ s is selected as the threshold, in which μ and s are the mean and standard deviation o all the onset candidates (peaks o Equation 8), respectively [2]. Hu and Wang [2] claim that the perormance o the method using such a threshold choice is satisactory. In every ilter channel k, to determine the oset corresponding to each onset candidate, let on [k, l] represent the modulation requency o the lth onset candidate in the ilter channel k. The corresponding oset, denoted by o [k, l], is located between on [k, l] and on [k, l+1]. I there are multiple oset candidates in this interval, the one with the largest intensity decrease (i.e., the smallest i X s(k, i)) is chosen. () (h) Figure 5 Smoothed intensity values at dierent scales. Initial intensity or all channels. Smoothed intensity at the scale 14. (c) Smoothed intensity at the scale 1. (d) Smoothed intensity at the scale 4. (e) Initial intensity in a channel centered at 56 Hz. () Smoothed intensity in the channel at the scale 14. (g) Smoothed intensity in the channel at the scale 1. (h) Smoothed intensity in the channel at the scale 4. The input is shown in Figure 3a. Intensity (db) The intensity derivative (db/hz) o k, j o o k, j 3 k, j Ater inding the onsets and osets, those with close modulation requencies are connected to the onset and oset ronts, because the requency components o onsets and osets with close modulation requencies probably correspond to the same source. Onset and oset ronts are vertical contours across acoustic requency in the modulation spectrogram domain. The proposed system connects an onset candidate rom a ilter channel to an onset candidate in the above adjacent ilter channel, provided that their distance in the modulation requency is less than a certain threshold relative to the latter ilter channel. In each ilter channel, this threshold is deined as the mean o the distances in the modulation requency direction between two-by-two adjacent onsets. This deinition or the threshold is provided rom experiments and is validated as a good choice in the data. The same applies to the oset candidates. Notice that a threshold with a too small value may prevent onsets or osets rom the same event to joint; while a threshold with a too large value may cause some onsets rom dierent events to connect together [2]. The next step is to orm segments by matching individual onset and oset ronts. Consider ( on [k, l k ], on [k, l k +1],..., on [k+r-1, l k+r-1 ]) as an onset ront with r consecutive ilter channels, in which l k denotes the number o the selected onset as an onset ront member, in the ilter channel k; and consider ( o [k, l k ], o [k+1, l k+1 ],..., o [k +r-1, l k+r-1 ]) as the corresponding oset modulation requencies. For each oset modulation requency, irst, we ind all those oset ronts that cross this oset; then, the oset ront with the most crosses (with the oset modulation requencies) is chosen as the matching oset on k, j on k, j 4 on k, j 8 Figure 6 Onset and oset detection. The upper panel shows the response intensity and the lower panel shows the results o onset and oset detection using a low-pass ilter. The threshold or onset detection is.5 and or oset detection is -.5 indicated by the dash-lines. Detected onsets are marked by downward arrows, and osets by upward arrows.
6 Page 6 o 1 ront. Now, the entire ilter channels rom k to k +r -1 occupied by the matching oset ront (and their corresponding oset modulation requencies on this matchingosetront)arelabeledas matched. I all the channels rom k to k+r-1 are labeled as matched, the matching procedure inishes; otherwise, the matched channels should be put aside and the procedure should be repeated or the remaining unmatched channels. At last, in order to orm the oset ront relative to each onset ront, we replace the oset modulation requencies corresponding to the onset ront with those o the matched oset ronts. The region between the onset ront and its oset ront yields a 2D segment in the acoustic-modulation requency space; see Figure 7 or the schematic representation o the matching procedure Segment selection and decision-making By detecting the onsets and osets and orming the onset and oset ronts, the modulation spectrogram domain o speech signal is segmented. Since the speaker s pitch range is [6, 35] Hz (or men, women, and children), only the segments with modulation requencies in this range are accepted. Now, we describe the grouping procedure or the segments. First, the modulation spectrogram energy o each selected segment is computed. Two almost disjoint segments with most energies, i.e., those with the most modulation spectrogram energies and the least horizontal overlap in the modulation spectrogram, or simplicity called segments A and B, are selected (the case speech interered by a non-speaker-noise has only one such segment). For any other segment (call segment C), i the modulation requency range at least 8% overlaps with that o segment A or segment B, the segment C is groupedwiththatoverlapping segment; otherwise, the segment C is omitted or the grouping procedure. Figure 8 presents a typical example o the grouping procedure. As shown, in each ilter channel, the onset and oset ronts o the resulting group determines the corresponding range o pitch requency in that ilter channel. Acoustic Frequency group A Modulation Frequency 3.3. Speech separation In [17], a mask is presented or speech separation in the modulation spectrogram domain, assuming that the pitch ranges o the target and intererence are known and that these ranges are the same in each subband. Our system extends this idea by allowing the value o the mask in each ilter channel to depend on the estimated pitch range o that ilter channel. Consider a given signal x(n) that is the sum o a target signal x ts (n) and an intererence signal x is (n), sampled at s Hz, i.e., x(n) =x ts (n)+x is (n). A proper mask should be estimated or segregating the target signal rom the intererence signal. In each ilter channel k, thepitch ranges o the target and interering speakers (obtained rom the previous stage) are denoted by PFts k k := [pts,low, p k ts,high ] and PFk k is := [pis,low, p k is,high ], respectively. Also, { Q k := i {,..., I 1} such that ( )/ } i. s (I. M) PF k is deined as the set o modulation requency indices o PF k, i.e., a pitch range in the ilter channel k. To produce a requency mask in each ilter channel k, deine the mean o the modulation spectral energy relative to a pitch range as the energy normalized by the wideness o that pitch range: E k = / 2 X(k, i) (phigh k p low k ) (9) i Q k Acoustic Frequency group A Modulation Frequency group B Figure 8 A graphical expression o the method o grouping segments in the modulation spectrogram domain; or one speaker, or two speakers. Acoustic requency k+r-1 k Modulation requency k+r-1 k Modulation requency Onset Oset Onset ront member Corresponding oset Matching oset ront member Figure 7 Schematic representation o the matching procedure; the osets corresponding to the onset ront is determined, the matching oset ront members are ound. The requency mask is calculated, when the means o the modulation spectral energy o the target and intererence speakers are compared in the ollowing sense. E k ts F k = E k ts+ E k, (1) is Since there are artiacts associated with applying masks in the modulation requency domain (see [22]), this domain is not preerable or modulation iltering in order to mask out the intererence and reconstruct a time-domain signal. Instead, the requency mask is transormed to the time domain. To this end, ailter
7 Page 7 o 1 with linear phase is constructed whose magnitude is F k and the assigned linear phase is j k (i) =i. Then,the inverse DFT is taken k (m) = 1 I 1 F k e jφk (i) e j2πmi/i. (11) I i= The separated target signal is estimated by the convolution (over the variable m) o the obtained ilter k (m) with the modulator signal o the mixture signal x(n) and then, multiplying by the carrier signal o the mixture signal [ ] X (m, k) = M (m, k) k (m) C (m, k), (12) Finally, the separated target signal in the time domain is obtained by taking the inverse STFT o X (m, k). 4. Evaluation As mentioned earlier, our system estimates the pitch range and uses this range or the speech separation. In this section, we evaluate the proposed system in the processes o pitch range estimation and speech separation Pitch range estimation First, the proposed system is evaluated in the pitch range estimation process with utterances chosen rom the Lee s database [23] and a corpus o 1 mixtures o speech and intererence [24], commonly used or CASA research, see, e.g., [13,25,26]. The corpus contains utterances rom both male and emale speakers. These utterances are mixed with a set o intrusions at dierent SNR levels. These intrusions are N: 1 khz pure tone; N1:whitenoise;N2:noisebursts;N3:cocktailparty noise; N4: rock music; N5: siren, N6: trill telephone; N7: emale speech; N8: male speech; and N9: emale speech. These intrusions have a considerable variety; or example, N3 is noise-like, while N5 contains strong harmonic sounds. They orm a realistic corpus or evaluating the capacity o a CASA system when it deals with various types o intererence. The signal X(k, i) is the modulation spectrogram o an input signal that is digitized at a 16-kHz sampling rate. The parameters o the proposed system are set to M = 16 and K = 128. w(n) is a Hanning window with length L = 64 (reer to Section 2). The STFT ilter-bank has 128 ilter channels, or which the center requency o the kth ilter channel is ω k =2πk/K, k =,..., K-1. Figure 9 shows the modulation spectrogram and the obtained segments or a typical speech rame, when the proposed system is applied. The speech signal is a mixture o target and intererence with the overall SNR o db. We select a male speech, a white noise and a trill Acoustic requency (Hz) Acoustic requency (Hz) Acoustic requency (Hz) (c) Figure 9 Modulation spectrogram and segments obtained or a mixture o male speaker, white noise, and (c) trill telephone. The input is shown in Figure 3a. telephone as the intererence. The results show that although the powers o the speech and intererence signals are equal, the proposed method is still able to estimate the pitch range o the target speaker with a reasonable accuracy. Figure 1 shows the average error percentage o the pitch range estimation by the proposed system on the above mixtures at dierent SNR levels. To determine the error percentage, we assign a two-element vector to the margins o each pitch range and ind the root mean square error distance between the vectors corresponding to the true and estimated pitch ranges. As shown in Figure 1, the proposed system is able to estimate 79.9% o the target pitch range, even at -5 db SNR. The estimation rate increases to about 96.1%, as the SNR increases to 15 db. Percentage o error detection LSH model RAPT MAP Proposed algorithm Mixture SNR (db) Figure 1 Percentage o pitch range estimation error or dierent SNR levels.
8 Page 8 o 1 A reliable evaluation o the proposed system requires a reerence range o the true pitch. However, such a reerence is probably impossible to obtain rom a noisy speech.weindthereerencepitchrangebyraming the clean speech signal and calculating the pitch requency in each rame. The perormance o the proposed method is compared with that o the Least Square Harmonic (LSH) technique [27], Robust Algorithm or Pitch Tracking (RAPT) [28], and the Maximum A Posterior (MAP) estimator [29]. RAPT and MAP are two standard pitch estimation algorithms. The LSH algorithm, derived in [27] or harmonic decomposition o a time-varying signal, estimates the harmonic amplitudes and phases, by solving a set o linear equations that minimizes the mean square error. The RAPT algorithm estimates the pitch requency, by searching or local maxima in the autocorrelation unction o the windowed speech signal and then, using a dynamic programming technique (see [28] or more details). The MAP approach [29] considers a harmonic model or the voiced speech so that each windowed signal is expressed with a generalized linear model whose basic unctions depend on the undamental requency and number o harmonic partials. Figure 1 also provides a comparison between the results o the pitch estimation using the mentioned our methods, in which the proposed system perorms consistently better than the three standard methods, at all SNR levels. Although the perormance o the LSH model (as the best perorming one among the mentioned standard algorithms) is good at SNR levels above 1 db, it drops quickly as SNR decreases, which shows that the proposed system is more robust to intererence compared with the LSH model. As mentioned in [29], MAP perorms slightly better in low SNR s rather than high SNR s. In addition, RAPT ails to estimate the desired pitch period in low SNR s, because it mistakenly chooses sub-harmonic and harmonic partials instead o the true pitch period. The current scheme perorms almost consistently in both high and low SNR s Voiced speech separation A corpus o 1 mixtures composed o 1 target utterances mixed with 1 intrusions is recruited or assessing the perormance o the system on voiced speech separation; these data are described in Section For comparison, the Hu and Wang system [14] and the spectral subtraction method [3] are employed. Perormance o the voiced speech separation is evaluated using two measures commonly used or this propose [14]: The percentage o energy loss, P EL, which measures the amount o the target speech excluded rom the segregated speech. The percentage o residual noise, P NR, which measures the amount o the intrusion included in the segregated speech. P EL and P NR are error measures o a separation system, which are complementary indices or assessing the system perormance. In addition, the SNR o the segregated voiced target (in db) provides a good comparison between waveorms [14]: SNR =1log 1 n n s2 (n) [ s (n) x (n) ] 2, (13) where x (n) is the estimated signal and s(n) is the target signal beore being mixed with the intrusion. The results o our system are shown in Figure 11. Each point in the igures represents the average value o 1 mixtures in the complete test corpus at a particular SNR level. Figure 11a, b shows the percentage o energy loss and noise residue. Since the goal here is to segregate the voiced target, the P EL values are only deined or the target energy at the voiced rames o the target. AsshowninFigure11,theproposedsystemsegregates 78.9% o the voiced target energy at -5 db SNR and 99% at 15 db SNR. At the same time, at -5 db, 15.9% o the segregated energy belongs to intrusion. This number drops to.7% at 15 db SNR. Figure 11c shows the SNR o the segregated target. This system obtains an average 7.5 db gain in SNR when the mixture SNR is -5 db. This gain increases to 14.3 db, when the mixture SNR is 15 db. As shown in the igure, the segregated target loses more target energy (Figure 11a), but contains less intererence as well (Figure 11b). Figure 11 also shows the perormance o the system proposed by Hu and Wang or voiced speech separation Percentage o energy loss (P EL ) Hu and Wang (24) Proposed Algorithm Mixture SNR (db) SNR o segregated target (db) (c) Mixture SNR (db) percentage o residual noise (P NR ) Hu and Wang (24) Proposed Algorithm Mixture SNR (db) Hu and Wang (24) Proposed Algorithm Figure 11 Results o voiced speech separation. Percentage o energy loss on voiced target. Percentage o noise residue. (c) SNR o segregated voiced target.
9 Page 9 o 1 [14], which is a representative o CASA systems. As shown in the igure, the Hu and Wang s system yields a lower percentage o noise residues (Figure 11b), but has a much higher percentage o target energy loss (Figure 11a, c). Nevertheless, it should be noted that our system signiicantly improved the P EL (in Figure 11a, see, e.g., by around 11 and 1% improvement at and 15 db, respectively), which leads to much less signal distortion. The price paid or this is a slightly increase in P NR,as depicted in Figure 11b (e.g., by around 6 and.5% increase at and 15 db, respectively). The average SNR or each intrusion is shown or the proposed system in Figure 12 in comparison with that o the original mixtures, Hu and Wang s system, and a Spectral Subtraction Method, which is a standard method or speech enhancement [3] (see also [14]). Theproposedsystemperormsconsistentlybetterthan Hu and Wang s system and spectral subtraction. In average, the proposed system obtains a db SNR gain, which is about 1.92 db better than Hu and Wang s system and 8.4 db better than the Spectral Subtraction. To help the reader recognize the real dierence in the perormance, a ile is prepared including sample audio mixture signals (target speech signal + intererence signal) and the results o the separation using the Spectral Subtraction, Hu and Wang, and the proposed systems. The ile is available at AM-SampleWaves.ppt. 5. Discussions and conclusions One o the major challenges in speech enhancement is the separation o a target speech rom an intererence signal o the same type. The accuracy o the CASA methods in single channel speech separation depends on the correctness o the pitch requency estimation o SNR (db) Mixture Proposed Algorithm Hu and Wang (24) Spectral Subtraction N N1 N2 N3 N4 N5 N6 N7 N8 N9 Intrusion type Figure 12 SNR results or segregated speech and original mixtures or a corpus o voiced speech and various intrusions. two simultaneous speakers because the proper mask in the T-F domain or the speech separation is produced in association with the estimated pitch requency. In this article, a single channel speech separation system is proposed that estimates the pitch range o one or two speakers and segregates the target speech rom the intererence. The pitch range is estimated using the onset and oset algorithm considering the distribution o speaker energy in the modulation spectrogram domain. When the target and intererence speakers are either male or emale, the methods or pitch requency estimation encounter large errors because o close pitch requency values. Thereore, CASA methods that employ the pitch requency as their main eature or speech separation ace diiculties. In contrast, a main novelty o the present algorithm is the estimation o pitch range based on short time-rames o the mixture signal. The constructed mask or speech separation depends on the pitch range estimated independently in each subband. As shown by the evaluation results, major portions o the voiced target speech are separated rom the interering speech using this mask. In addition, the proposed system can separate the unvoiced portions that are quasi-periodic because o the proximity o voiced portions. The proposed algorithm is robust to intererence and produces good estimates o both pitch range and voiced speech, even in the presence o strong intererence. Systematic evaluation shows that the proposed algorithm perorms signiicantly better than the mentioned CASA and speech enhancement systems. Silent gaps and other intererence-masked intervals are usually included in natural speech utterances. In practice, the utterance across such time-intervals should be grouped. This is a sequential grouping problem [5,6] whose segments or masks can be obtained using the speech recognition in a top-down manner (also, limited to non-speech intererence) [11] or the speaker recognition trained by speaker models [31]. However, the proposed algorithm does not encounter this problem o sequential grouping because it operates in the modulation spectrogram domain. In terms o computational complexity, the main cost o the proposed algorithm arises rom determining segmentsinmodulationspectrogram or pitch range estimation. The estimation o the mask and convolution or speech separation consumes a small raction o the overall cost. Both tasks (pitch range estimation and speech separation) are implemented in the requency domain, so the computational complexity is O(NlogN), where N isthenumberosamplesintheinputsignal.these operations should separately be perormed or each subband. On the other hand, since eature extraction takes place independently in dierent subbands, substantial speedup can be achieved through parallel computing.
10 Page 1 o 1 For uture work, the proposed algorithm can be improved by iterative estimation o pitch range and speech separation. The algorithm can include a speciic method to jump-start the iterative process, which gives an initial estimate o both pitch range and mask with reasonable quality. In general, the perormance o the algorithm depends on the initial estimate o pitch range; better initial estimates would lead to better perormance. Even with a poor estimate o pitch range, which is unavoidable in very low SNR conditions, the proposed algorithm improves the initial estimate during the iterative process. Author details 1 Speech Processing Research Lab (SPRL), Electrical and Computer Engineering Department, Yazd University, Yazd, Iran 2 Control and Intelligent Processing Center o Excellence (CIPCE), School o Electrical and Computer Engineering, University o Tehran, Tehran, Iran 3 Image Analysis Laboratory, Department o Radiology, Henry Ford Health System, Detroit, MI, USA 4 Electrical Engineering Department, Amirkabir University o Technology, Tehran, Iran Competing interests The authors declare that they have no competing interests. Received: 7 May 211 Accepted: 17 March 212 Published: 17 March 212 Reerences 1. RP Lippmann, Speech recognition by machines and humans. Speech Commun. 22, 1 16 (1997). doi:1.116/s (97) JJ Sroka, LD Braida, Human and machine consonant recognition. Speech Commun. 45, (25) 3. A de Cheveigne, in Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, ed. by Brown GJ, Wang DL (Wiley & IEEE, Hoboken, NJ, 26), pp S Dubnov, J Tabrikian, M Arnon-Targan, Speech source separation in convolutive environments using space-time-requency analysis. EURASIP J Appl Signal Process Article , 11 (26) 5. AS Bregman, Auditory Scene Analysis (MIT, Cambridge, MA, 199) 6. Brown GJ, Wang DL (eds.), Computational Auditory Scene Analysis: Principles, Algorithms, and Applications (Wiley & IEEE, Hoboken, NJ, 26) 7. M Buchler, S Allegro, S Launer, N Dillier, Sound classiication in hearing aids inspired by auditory scene analysis. EURASIP J Appl Signal Process. 18, (25) 8. G Hu, D Wang, A Tandem algorithm or pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process. 18(8), (27) 9. Y Shao, S Srinivasan, Z Jin, D Wang, A computational auditory scene analysis system or speech segregation and robust speech recognition. Comput Speech Lang. 24, (21). doi:1.116/j.csl MH Radar, RM Dansereau, A Sayadiyan, A maximum likelihood estimation o vocal-tract-related ilter characteristics or single channel speech separation. EURASIP J Audio Speech Music Process Article 84186, 27, 15 (27) 11. J Barker, M Cooke, D Ellis, Decoding speech in the presence o other sources. Speech Commun. 45, 5 25 (25). doi:1.116/j. specom Y Shao, DL Wang, Model-based sequential organization in cochannel speech. IEEE Trans Acoust Speech Signal Process. 14, (25) 13. GJ Brown, M Cooke, Computational auditory scene analysis. Comput Speech Lang. 8, (1994). doi:1.16/csla G Hu, DL Wang, Monaural speech separation based on pitch tracking and amplitude modulation. IEEE Trans Neural Net. 15, (24). doi:1.119/tnn M Wu, DL Wang, GJ Brown, A multipitch tracking algorithm or noisy speech. IEEE Trans Speech Audio Process. 11, (23). doi:1.119/ TSA J Le Roux, H Kameoka, N Ono, A de Cheveigne, S Sagayama, Single and multiple F contour estimation through parametric spectrogram modeling o speech in noisy environments. IEEE Trans Audio Speech Lang Process. 15, (27) 17. SM Schimmel, LE Atlas, K Nie, Feasibility o single channel speaker separation based on modulation requency analysis, in Proc IEEE International Conerence on Acoustics, Speech and Signal Processing, Hawaii, USA. 4, (27) 18. SM Schimmel, (Dissertation, University o Washington, 27) 19. L Atlas, SA Shamma, Joint acoustic and modulation requency. EURASIP J Appl Signal Process. 23(7), (23). doi:1.1155/ S G Hu, DL Wang, Auditory segmentation based on onset and oset analysis. IEEE Trans Audio Speech Lang Process. 15(2), (27) 21. R Drullman, JM Festen, R Plomp, Eect o temporal envelope smearing on speech reception. J Acoust Soc Am. 95, (1994). doi:1.1121/ SM Schimmel, LE Atlas, Coherent envelope detection or modulation iltering o speech, in Proc IEEE International Conerence on Acoustics, Speech and Signal Processing, Pennsylvania, USA, (25) 23. TW Lee, Blind source separation: audio examples (1998). edu/~tewon/blind/blind_audio.html. Accessed 4 May MP Cooke, Modeling Auditory Processing and Organization (Cambridge University Press, Cambridge, 1993) 25. LA Drake, (Dissertation, University o Northwestern, 21) 26. DL Wang, GJ Brown, Separation o speech rom interering sounds based on oscillatory correlation. IEEE Trans Neural Netw. 1, (1999). doi:1.119/ Q Li, L Atlas, Time-variant least-squares harmonic modeling, in Proc IEEE International Conerence on Acoustics, Speech and Signal Processing, Hong Kong. 2, (23) 28. D Talkin, A robust algorithm or pitch tracking (RAPT), in Speech Coding and Synthesis, ed. by Paliwal KK, Klein WB (Elsevier, NewYork, NY, 1995), pp J Tabrikian, S Dubnov, Y Dickalov, Maximum a posterior probability pitch tracking in noisy environments using harmonic model. IEEE Trans Speech Audio Process. 12, (24). doi:1.119/tsa X Huang, A Acero, HW Hon, Spoken Language Processing: A Guide to Theory, Algorithms, and System Development (Prentice Hall PTR, Upper Saddle River, NJ, 21) 31. Y Shao, (Dissertation, University o Ohio State, 27) doi:1.1186/ Cite this article as: Mahmoodzadeh et al.: Single channel speech separation in modulation requency domain based on a novel pitch range estimation method. EURASIP Journal on Advances in Signal Processing :67. Submit your manuscript to a journal and beneit rom: 7 Convenient online submission 7 Rigorous peer review 7 Immediate publication on acceptance 7 Open access: articles reely available online 7 High visibility within the ield 7 Retaining the copyright to your article Submit your next manuscript at 7 springeropen.com
Determination of Pitch Range Based on Onset and Offset Analysis in Modulation Frequency Domain
Determination o Pitch Range Based on Onset and Oset Analysis in Modulation Frequency Domain A. Mahmoodzadeh Speech Proc. Research Lab ECE Dept. Yazd University Yazd, Iran H. R. Abutalebi Speech Proc. Research
More informationDominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation
Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation Shibani.H 1, Lekshmi M S 2 M. Tech Student, Ilahia college of Engineering and Technology, Muvattupuzha, Kerala,
More informationScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 46 (2015 ) 122 126 International Conference on Information and Communication Technologies (ICICT 2014) Unsupervised Speech
More informationEffects of Reverberation on Pitch, Onset/Offset, and Binaural Cues
Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues DeLiang Wang Perception & Neurodynamics Lab The Ohio State University Outline of presentation Introduction Human performance Reverberation
More informationMonaural and Binaural Speech Separation
Monaural and Binaural Speech Separation DeLiang Wang Perception & Neurodynamics Lab The Ohio State University Outline of presentation Introduction CASA approach to sound separation Ideal binary mask as
More informationSingle Channel Speaker Segregation using Sinusoidal Residual Modeling
NCC 2009, January 16-18, IIT Guwahati 294 Single Channel Speaker Segregation using Sinusoidal Residual Modeling Rajesh M Hegde and A. Srinivas Dept. of Electrical Engineering Indian Institute of Technology
More informationIN a natural environment, speech often occurs simultaneously. Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 5, SEPTEMBER 2004 1135 Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation Guoning Hu and DeLiang Wang, Fellow, IEEE Abstract
More informationAn Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation
An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation Aisvarya V 1, Suganthy M 2 PG Student [Comm. Systems], Dept. of ECE, Sree Sastha Institute of Engg. & Tech., Chennai,
More informationEnhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis
Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis Mohini Avatade & S.L. Sahare Electronics & Telecommunication Department, Cummins
More informationA Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation
Technical Report OSU-CISRC-1/8-TR5 Department of Computer Science and Engineering The Ohio State University Columbus, OH 431-177 FTP site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/8
More informationECE5984 Orthogonal Frequency Division Multiplexing and Related Technologies Fall Mohamed Essam Khedr. Channel Estimation
ECE5984 Orthogonal Frequency Division Multiplexing and Related Technologies Fall 2007 Mohamed Essam Khedr Channel Estimation Matlab Assignment # Thursday 4 October 2007 Develop an OFDM system with the
More informationTIME-FREQUENCY ANALYSIS OF NON-STATIONARY THREE PHASE SIGNALS. Z. Leonowicz T. Lobos
Copyright IFAC 15th Triennial World Congress, Barcelona, Spain TIME-FREQUENCY ANALYSIS OF NON-STATIONARY THREE PHASE SIGNALS Z. Leonowicz T. Lobos Wroclaw University o Technology Pl. Grunwaldzki 13, 537
More informationDetection and direction-finding of spread spectrum signals using correlation and narrowband interference rejection
Detection and direction-inding o spread spectrum signals using correlation and narrowband intererence rejection Ulrika Ahnström,2,JohanFalk,3, Peter Händel,3, Maria Wikström Department o Electronic Warare
More informationECEN 5014, Spring 2013 Special Topics: Active Microwave Circuits and MMICs Zoya Popovic, University of Colorado, Boulder
ECEN 5014, Spring 2013 Special Topics: Active Microwave Circuits and MMICs Zoya Popovic, University o Colorado, Boulder LECTURE 13 PHASE NOISE L13.1. INTRODUCTION The requency stability o an oscillator
More informationspeech signal S(n). This involves a transformation of S(n) into another signal or a set of signals
16 3. SPEECH ANALYSIS 3.1 INTRODUCTION TO SPEECH ANALYSIS Many speech processing [22] applications exploits speech production and perception to accomplish speech analysis. By speech analysis we extract
More informationSPEECH PARAMETERIZATION FOR AUTOMATIC SPEECH RECOGNITION IN NOISY CONDITIONS
SPEECH PARAMETERIZATION FOR AUTOMATIC SPEECH RECOGNITION IN NOISY CONDITIONS Bojana Gajić Department o Telecommunications, Norwegian University o Science and Technology 7491 Trondheim, Norway gajic@tele.ntnu.no
More informationSinusoidal signal. Arbitrary signal. Periodic rectangular pulse. Sampling function. Sampled sinusoidal signal. Sampled arbitrary signal
Techniques o Physics Worksheet 4 Digital Signal Processing 1 Introduction to Digital Signal Processing The ield o digital signal processing (DSP) is concerned with the processing o signals that have been
More informationAudio Imputation Using the Non-negative Hidden Markov Model
Audio Imputation Using the Non-negative Hidden Markov Model Jinyu Han 1,, Gautham J. Mysore 2, and Bryan Pardo 1 1 EECS Department, Northwestern University 2 Advanced Technology Labs, Adobe Systems Inc.
More informationA MATLAB Model of Hybrid Active Filter Based on SVPWM Technique
International Journal o Electrical Engineering. ISSN 0974-2158 olume 5, Number 5 (2012), pp. 557-569 International Research Publication House http://www.irphouse.com A MATLAB Model o Hybrid Active Filter
More informationOptimizing Reception Performance of new UWB Pulse shape over Multipath Channel using MMSE Adaptive Algorithm
IOSR Journal o Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 05, Issue 01 (January. 2015), V1 PP 44-57 www.iosrjen.org Optimizing Reception Perormance o new UWB Pulse shape over Multipath
More informationNOISE ESTIMATION IN A SINGLE CHANNEL
SPEECH ENHANCEMENT FOR CROSS-TALK INTERFERENCE by Levent M. Arslan and John H.L. Hansen Robust Speech Processing Laboratory Department of Electrical Engineering Box 99 Duke University Durham, North Carolina
More informationMusic Technology Group, Universitat Pompeu Fabra, Barcelona, Spain {jordi.bonada,
GENERATION OF GROWL-TYPE VOICE QUALITIES BY SPECTRAL MORPHING Jordi Bonada Merlijn Blaauw Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain Email: {jordi.bonada, merlijn.blaauw}@up.edu
More informationPreeti Rao 2 nd CompMusicWorkshop, Istanbul 2012
Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012 o Music signal characteristics o Perceptual attributes and acoustic properties o Signal representations for pitch detection o STFT o Sinusoidal model o
More informationIntroduction of Audio and Music
1 Introduction of Audio and Music Wei-Ta Chu 2009/12/3 Outline 2 Introduction of Audio Signals Introduction of Music 3 Introduction of Audio Signals Wei-Ta Chu 2009/12/3 Li and Drew, Fundamentals of Multimedia,
More informationProject 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing
Project : Part 2 A second hands-on lab on Speech Processing Frequency-domain processing February 24, 217 During this lab, you will have a first contact on frequency domain analysis of speech signals. You
More informationSpeech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm
International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm A.T. Rajamanickam, N.P.Subiramaniyam, A.Balamurugan*,
More informationEEE 311: Digital Signal Processing I
EEE 311: Digital Signal Processing I Course Teacher: Dr Newaz Md Syur Rahim Associated Proessor, Dept o EEE, BUET, Dhaka 1000 Syllabus: As mentioned in your course calendar Reerence Books: 1 Digital Signal
More informationRASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991
RASTA-PLP SPEECH ANALYSIS Hynek Hermansky Nelson Morgan y Aruna Bayya Phil Kohn y TR-91-069 December 1991 Abstract Most speech parameter estimation techniques are easily inuenced by the frequency response
More informationAuditory modelling for speech processing in the perceptual domain
ANZIAM J. 45 (E) ppc964 C980, 2004 C964 Auditory modelling for speech processing in the perceptual domain L. Lin E. Ambikairajah W. H. Holmes (Received 8 August 2003; revised 28 January 2004) Abstract
More informationFatigue Life Assessment Using Signal Processing Techniques
Fatigue Lie Assessment Using Signal Processing Techniques S. ABDULLAH 1, M. Z. NUAWI, C. K. E. NIZWAN, A. ZAHARIM, Z. M. NOPIAH Engineering Faculty, Universiti Kebangsaan Malaysia 43600 UKM Bangi, Selangor,
More informationAN EFFICIENT SET OF FEATURES FOR PULSE REPETITION INTERVAL MODULATION RECOGNITION
AN EFFICIENT SET OF FEATURES FOR PULSE REPETITION INTERVAL MODULATION RECOGNITION J-P. Kauppi, K.S. Martikainen Patria Aviation Oy, Naulakatu 3, 33100 Tampere, Finland, ax +358204692696 jukka-pekka.kauppi@patria.i,
More informationSignals and Systems II
1 To appear in IEEE Potentials Signals and Systems II Part III: Analytic signals and QAM data transmission Jerey O. Coleman Naval Research Laboratory, Radar Division This six-part series is a mini-course,
More informationMel Spectrum Analysis of Speech Recognition using Single Microphone
International Journal of Engineering Research in Electronics and Communication Mel Spectrum Analysis of Speech Recognition using Single Microphone [1] Lakshmi S.A, [2] Cholavendan M [1] PG Scholar, Sree
More informationCyclostationarity-Based Spectrum Sensing for Wideband Cognitive Radio
9 International Conerence on Communications and Mobile Computing Cyclostationarity-Based Spectrum Sensing or Wideband Cognitive Radio Qi Yuan, Peng Tao, Wang Wenbo, Qian Rongrong Wireless Signal Processing
More informationEnvelope Modulation Spectrum (EMS)
Envelope Modulation Spectrum (EMS) The Envelope Modulation Spectrum (EMS) is a representation of the slow amplitude modulations in a signal and the distribution of energy in the amplitude fluctuations
More informationThe psychoacoustics of reverberation
The psychoacoustics of reverberation Steven van de Par Steven.van.de.Par@uni-oldenburg.de July 19, 2016 Thanks to Julian Grosse and Andreas Häußler 2016 AES International Conference on Sound Field Control
More information1856 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 7, SEPTEMBER /$ IEEE
1856 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 7, SEPTEMBER 2010 Sequential Organization of Speech in Reverberant Environments by Integrating Monaural Grouping and Binaural
More informationReduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter
Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter Ching-Ta Lu, Kun-Fu Tseng 2, Chih-Tsung Chen 2 Department of Information Communication, Asia University, Taichung, Taiwan, ROC
More informationAuditory Segmentation Based on Onset and Offset Analysis
Technical Report: OSU-CISRC-1/-TR4 Technical Report: OSU-CISRC-1/-TR4 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 Ftp site: ftp.cse.ohio-state.edu Login:
More informationModulation Domain Spectral Subtraction for Speech Enhancement
Modulation Domain Spectral Subtraction for Speech Enhancement Author Paliwal, Kuldip, Schwerin, Belinda, Wojcicki, Kamil Published 9 Conference Title Proceedings of Interspeech 9 Copyright Statement 9
More information1. Introduction. Keywords: speech enhancement, spectral subtraction, binary masking, Gamma-tone filter bank, musical noise.
Journal of Advances in Computer Research Quarterly pissn: 2345-606x eissn: 2345-6078 Sari Branch, Islamic Azad University, Sari, I.R.Iran (Vol. 6, No. 3, August 2015), Pages: 87-95 www.jacr.iausari.ac.ir
More informationExperiment 7: Frequency Modulation and Phase Locked Loops Fall 2009
Experiment 7: Frequency Modulation and Phase Locked Loops Fall 2009 Frequency Modulation Normally, we consider a voltage wave orm with a ixed requency o the orm v(t) = V sin(ω c t + θ), (1) where ω c is
More informationREAL-TIME BROADBAND NOISE REDUCTION
REAL-TIME BROADBAND NOISE REDUCTION Robert Hoeldrich and Markus Lorber Institute of Electronic Music Graz Jakoministrasse 3-5, A-8010 Graz, Austria email: robert.hoeldrich@mhsg.ac.at Abstract A real-time
More informationInternational Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015
International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 Analysis of Speech Signal Using Graphic User Interface Solly Joy 1, Savitha
More informationPerception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.
Perception of pitch BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb 2009. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence
More informationDrum Transcription Based on Independent Subspace Analysis
Report for EE 391 Special Studies and Reports for Electrical Engineering Drum Transcription Based on Independent Subspace Analysis Yinyi Guo Center for Computer Research in Music and Acoustics, Stanford,
More informationPitch-based monaural segregation of reverberant speech
Pitch-based monaural segregation of reverberant speech Nicoleta Roman a Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210 DeLiang Wang b Department of Computer
More informationSpectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition
Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition Author Shannon, Ben, Paliwal, Kuldip Published 25 Conference Title The 8th International Symposium
More informationNoise Removal from ECG Signal and Performance Analysis Using Different Filter
International Journal o Innovative Research in Electronics and Communication (IJIREC) Volume. 1, Issue 2, May 214, PP.32-39 ISSN 2349-442 (Print) & ISSN 2349-45 (Online) www.arcjournal.org Noise Removal
More informationSPEECH ENHANCEMENT BASED ON ITERATIVE WIENER FILTER USING COMPLEX SPEECH ANALYSIS
SPEECH ENHANCEMENT BASED ON TERATVE WENER FLTER USNG COMPLEX SPEECH ANALYSS Keiichi Funaki Computing & Networking Center, Univ. o the Ryukyus Senbaru, Nishihara, Okinawa, 93-3, Japan phone: +(8)98-895-8946,
More informationMODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS
MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS 1 S.PRASANNA VENKATESH, 2 NITIN NARAYAN, 3 K.SAILESH BHARATHWAAJ, 4 M.P.ACTLIN JEEVA, 5 P.VIJAYALAKSHMI 1,2,3,4,5 SSN College of Engineering,
More information3.6 Intersymbol interference. 1 Your site here
3.6 Intersymbol intererence 1 3.6 Intersymbol intererence what is intersymbol intererence and what cause ISI 1. The absolute bandwidth o rectangular multilevel pulses is ininite. The channels bandwidth
More informationA classification-based cocktail-party processor
A classification-based cocktail-party processor Nicoleta Roman, DeLiang Wang Department of Computer and Information Science and Center for Cognitive Science The Ohio State University Columbus, OH 43, USA
More informationSpeech Enhancement using Wiener filtering
Speech Enhancement using Wiener filtering S. Chirtmay and M. Tahernezhadi Department of Electrical Engineering Northern Illinois University DeKalb, IL 60115 ABSTRACT The problem of reducing the disturbing
More informationEnhanced Waveform Interpolative Coding at 4 kbps
Enhanced Waveform Interpolative Coding at 4 kbps Oded Gottesman, and Allen Gersho Signal Compression Lab. University of California, Santa Barbara E-mail: [oded, gersho]@scl.ece.ucsb.edu Signal Compression
More informationIntroduction to OFDM. Characteristics of OFDM (Orthogonal Frequency Division Multiplexing)
Introduction to OFDM Characteristics o OFDM (Orthogonal Frequency Division Multiplexing Parallel data transmission with very long symbol duration - Robust under multi-path channels Transormation o a requency-selective
More informationNon-intrusive intelligibility prediction for Mandarin speech in noise. Creative Commons: Attribution 3.0 Hong Kong License
Title Non-intrusive intelligibility prediction for Mandarin speech in noise Author(s) Chen, F; Guan, T Citation The 213 IEEE Region 1 Conference (TENCON 213), Xi'an, China, 22-25 October 213. In Conference
More informationSoftware Defined Radio Forum Contribution
Committee: Technical Sotware Deined Radio Forum Contribution Title: VITA-49 Drat Speciication Appendices Source Lee Pucker SDR Forum 604-828-9846 Lee.Pucker@sdrorum.org Date: 7 March 2007 Distribution:
More informationPerception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.
Perception of pitch AUDL4007: 11 Feb 2010. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence Erlbaum, 2005 Chapter 7 1 Definitions
More informationSpeech Synthesis using Mel-Cepstral Coefficient Feature
Speech Synthesis using Mel-Cepstral Coefficient Feature By Lu Wang Senior Thesis in Electrical Engineering University of Illinois at Urbana-Champaign Advisor: Professor Mark Hasegawa-Johnson May 2018 Abstract
More informationSpeaker Isolation in a Cocktail-Party Setting
Speaker Isolation in a Cocktail-Party Setting M.K. Alisdairi Columbia University M.S. Candidate Electrical Engineering Spring Abstract the human auditory system is capable of performing many interesting
More informationAudio Restoration Based on DSP Tools
Audio Restoration Based on DSP Tools EECS 451 Final Project Report Nan Wu School of Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI, United States wunan@umich.edu Abstract
More informationMFCC-based perceptual hashing for compressed domain of speech content identification
Available online www.jocpr.com Journal o Chemical and Pharmaceutical Research, 014, 6(7):379-386 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 MFCC-based perceptual hashing or compressed domain
More informationDigital Signal Processing
COMP ENG 4TL4: Digital Signal Processing Notes for Lecture #27 Tuesday, November 11, 23 6. SPECTRAL ANALYSIS AND ESTIMATION 6.1 Introduction to Spectral Analysis and Estimation The discrete-time Fourier
More informationDARK CURRENT ELIMINATION IN CHARGED COUPLE DEVICES
DARK CURRENT ELIMINATION IN CHARGED COUPLE DEVICES L. Kňazovická, J. Švihlík Department o Computing and Control Engineering, ICT Prague Abstract Charged Couple Devices can be ound all around us. They are
More informationAudio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands
Audio Engineering Society Convention Paper Presented at the th Convention May 5 Amsterdam, The Netherlands This convention paper has been reproduced from the author's advance manuscript, without editing,
More informationRobust Low-Resource Sound Localization in Correlated Noise
INTERSPEECH 2014 Robust Low-Resource Sound Localization in Correlated Noise Lorin Netsch, Jacek Stachurski Texas Instruments, Inc. netsch@ti.com, jacek@ti.com Abstract In this paper we address the problem
More informationPerception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.
Perception of pitch BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb 2008. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence Erlbaum,
More informationThe Research of Electric Energy Measurement Algorithm Based on S-Transform
International Conerence on Energy, Power and Electrical Engineering (EPEE 16 The Research o Electric Energy Measurement Algorithm Based on S-Transorm Xiyang Ou1,*, Bei He, Xiang Du1, Jin Zhang1, Ling Feng1,
More informationMMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2
MMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2 1 Electronics and Communication Department, Parul institute of engineering and technology, Vadodara,
More information1. Motivation. 2. Periodic non-gaussian noise
. Motivation One o the many challenges that we ace in wireline telemetry is how to operate highspeed data transmissions over non-ideal, poorly controlled media. The key to any telemetry system design depends
More informationNew metallic mesh designing with high electromagnetic shielding
MATEC Web o Conerences 189, 01003 (018) MEAMT 018 https://doi.org/10.1051/mateccon/01818901003 New metallic mesh designing with high electromagnetic shielding Longjia Qiu 1,,*, Li Li 1,, Zhieng Pan 1,,
More informationIsolated Digit Recognition Using MFCC AND DTW
MarutiLimkar a, RamaRao b & VidyaSagvekar c a Terna collegeof Engineering, Department of Electronics Engineering, Mumbai University, India b Vidyalankar Institute of Technology, Department ofelectronics
More informationSEG/San Antonio 2007 Annual Meeting. Summary. Morlet wavelet transform
Xiaogui Miao*, CGGVeritas, Calgary, Canada, Xiao-gui_miao@cggveritas.com Dragana Todorovic-Marinic and Tyler Klatt, Encana, Calgary Canada Summary Most geologic changes have a seismic response but sometimes
More informationOverexcitation protection function block description
unction block description Document ID: PRELIMIARY VERSIO ser s manual version inormation Version Date Modiication Compiled by Preliminary 24.11.2009. Preliminary version, without technical inormation Petri
More informationI D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008
R E S E A R C H R E P O R T I D I A P Spectral Noise Shaping: Improvements in Speech/Audio Codec Based on Linear Prediction in Spectral Domain Sriram Ganapathy a b Petr Motlicek a Hynek Hermansky a b Harinath
More informationFrequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement
Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement 1 Zeeshan Hashmi Khateeb, 2 Gopalaiah 1,2 Department of Instrumentation
More informationComparison of Spectral Analysis Methods for Automatic Speech Recognition
INTERSPEECH 2013 Comparison of Spectral Analysis Methods for Automatic Speech Recognition Venkata Neelima Parinam, Chandra Vootkuri, Stephen A. Zahorian Department of Electrical and Computer Engineering
More informationTwo-channel Separation of Speech Using Direction-of-arrival Estimation And Sinusoids Plus Transients Modeling
Two-channel Separation of Speech Using Direction-of-arrival Estimation And Sinusoids Plus Transients Modeling Mikko Parviainen 1 and Tuomas Virtanen 2 Institute of Signal Processing Tampere University
More informationECE 5655/4655 Laboratory Problems
Assignment #4 ECE 5655/4655 Laboratory Problems Make Note o the Following: Due Monday April 15, 2019 I possible write your lab report in Jupyter notebook I you choose to use the spectrum/network analyzer
More informationChange Point Determination in Audio Data Using Auditory Features
INTL JOURNAL OF ELECTRONICS AND TELECOMMUNICATIONS, 0, VOL., NO., PP. 8 90 Manuscript received April, 0; revised June, 0. DOI: /eletel-0-00 Change Point Determination in Audio Data Using Auditory Features
More informationDifferent Approaches of Spectral Subtraction Method for Speech Enhancement
ISSN 2249 5460 Available online at www.internationalejournals.com International ejournals International Journal of Mathematical Sciences, Technology and Humanities 95 (2013 1056 1062 Different Approaches
More informationImplementation of an Intelligent Target Classifier with Bicoherence Feature Set
ISSN: 39-8753 International Journal o Innovative Research in Science, (An ISO 397: 007 Certiied Organization Vol. 3, Issue, November 04 Implementation o an Intelligent Target Classiier with Bicoherence
More informationEE482: Digital Signal Processing Applications
Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu EE482: Digital Signal Processing Applications Spring 2014 TTh 14:30-15:45 CBC C222 Lecture 12 Speech Signal Processing 14/03/25 http://www.ee.unlv.edu/~b1morris/ee482/
More informationFrequency Hopped Spread Spectrum
FH- 5. Frequency Hopped pread pectrum ntroduction n the next ew lessons we will be examining spread spectrum communications. This idea was originally developed or military communication systems. However,
More informationA new zoom algorithm and its use in frequency estimation
Waves Wavelets Fractals Adv. Anal. 5; :7 Research Article Open Access Manuel D. Ortigueira, António S. Serralheiro, and J. A. Tenreiro Machado A new zoom algorithm and its use in requency estimation DOI.55/wwaa-5-
More informationAspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta
Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification Daryush Mehta SHBT 03 Research Advisor: Thomas F. Quatieri Speech and Hearing Biosciences and Technology 1 Summary Studied
More informationHigh Speed Communication Circuits and Systems Lecture 10 Mixers
High Speed Communication Circuits and Systems Lecture Mixers Michael H. Perrott March 5, 24 Copyright 24 by Michael H. Perrott All rights reserved. Mixer Design or Wireless Systems From Antenna and Bandpass
More informationfor Single-Tone Frequency Tracking H. C. So Department of Computer Engineering & Information Technology, City University of Hong Kong,
A Comparative Study of Three Recursive Least Squares Algorithms for Single-Tone Frequency Tracking H. C. So Department of Computer Engineering & Information Technology, City University of Hong Kong, Tat
More informationCHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS
46 CHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS 3.1 INTRODUCTION Personal communication of today is impaired by nearly ubiquitous noise. Speech communication becomes difficult under these conditions; speech
More informationNCCF ACF. cepstrum coef. error signal > samples
ESTIMATION OF FUNDAMENTAL FREQUENCY IN SPEECH Petr Motl»cek 1 Abstract This paper presents an application of one method for improving fundamental frequency detection from the speech. The method is based
More informationSignal segmentation and waveform characterization. Biosignal processing, S Autumn 2012
Signal segmentation and waveform characterization Biosignal processing, 5173S Autumn 01 Short-time analysis of signals Signal statistics may vary in time: nonstationary how to compute signal characterizations?
More informationBinaural Hearing. Reading: Yost Ch. 12
Binaural Hearing Reading: Yost Ch. 12 Binaural Advantages Sounds in our environment are usually complex, and occur either simultaneously or close together in time. Studies have shown that the ability to
More informationStudy Of Sound Source Localization Using Music Method In Real Acoustic Environment
International Journal of Electronics Engineering Research. ISSN 975-645 Volume 9, Number 4 (27) pp. 545-556 Research India Publications http://www.ripublication.com Study Of Sound Source Localization Using
More informationMultiple Sound Sources Localization Using Energetic Analysis Method
VOL.3, NO.4, DECEMBER 1 Multiple Sound Sources Localization Using Energetic Analysis Method Hasan Khaddour, Jiří Schimmel Department of Telecommunications FEEC, Brno University of Technology Purkyňova
More informationWavelet Speech Enhancement based on the Teager Energy Operator
Wavelet Speech Enhancement based on the Teager Energy Operator Mohammed Bahoura and Jean Rouat ERMETIS, DSA, Université du Québec à Chicoutimi, Chicoutimi, Québec, G7H 2B1, Canada. Abstract We propose
More informationROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE
- @ Ramon E Prieto et al Robust Pitch Tracking ROUST PITCH TRACKIN USIN LINEAR RERESSION OF THE PHASE Ramon E Prieto, Sora Kim 2 Electrical Engineering Department, Stanford University, rprieto@stanfordedu
More informationPLL AND NUMBER OF SAMPLE SYNCHRONISATION TECHNIQUES FOR ELECTRICAL POWER QUALITY MEASURMENTS
XX IMEKO World Congress Metrology or Green Growth September 9 14, 2012, Busan, Republic o Korea PLL AND NUMBER OF SAMPLE SYNCHRONISATION TECHNIQUES FOR ELECTRICAL POWER QUALITY MEASURMENTS Richárd Bátori
More informationSynchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech
INTERSPEECH 5 Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech M. A. Tuğtekin Turan and Engin Erzin Multimedia, Vision and Graphics Laboratory,
More informationHCS 7367 Speech Perception
HCS 7367 Speech Perception Dr. Peter Assmann Fall 212 Power spectrum model of masking Assumptions: Only frequencies within the passband of the auditory filter contribute to masking. Detection is based
More information