Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition

Size: px
Start display at page:

Download "Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition"

Transcription

1 Available online at Speech Communication 52 (2010) Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition Fabio Valente IDIAP Research Institute, CH-1920 Martigny, Switzerland Received 13 October 2009; received in revised form 25 May 2010; accepted 26 May 2010 Abstract This paper investigates from an automatic speech recognition perspective, the most effective way of combining Multi Layer Perceptron (MLP) classifiers trained on different ranges of auditory and modulation frequencies. Two different combination schemes based on MLP are considered. The first one operates in parallel fashion and is invariant to the order in which feature streams are introduced. The second one operates in hierarchical fashion and is sensitive to the order in which feature streams are introduced. The study is carried on a Large Vocabulary Continuous Speech Recognition system for transcription of meetings data using the TANDEM approach. Results reveal that (1) the combination of MLPs trained on different ranges of auditory frequencies is more effective if performed in parallel fashion; (2) the combination of MLPs trained on different ranges of modulation frequencies is more effective if performed in hierarchical fashion moving from high to low modulations; (3) the improvement obtained from separate processing of two modulation frequency ranges (12% relative WER reduction w.r.t. the single classifier approach) is considerably larger than the improvement obtained from separate processing of two auditory frequency ranges (4% relative WER reduction w.r.t. the single classifier approach). Similar results are also verified on other LVCSR systems and on other languages. Furthermore, the paper extends the discussion to the combination of classifiers trained on separate auditory modulation frequency channels showing that previous conclusions hold also in this scenario. Ó 2010 Elsevier B.V. All rights reserved. Keywords: Automatic speech recognition (ASR); TANDEM features; Multi Layer Perceptron (MLP); Auditory and modulation frequencies 1. Introduction address: fabio.valente@idiap.ch Typical automatic speech recognition (ASR) features are obtained through the short-term spectrum of 30 ms segments of speech signal. This representation extracts instantaneous frequency components of the signal. The power spectrum is then integrated using a bank of filters equally spaced on an auditory scale (e.g. Bark scale) thus obtaining the auditory spectrum. Studies on recognition of non-sense syllables (Fletcher, 1953) have shown that humans process speech separately in different auditory frequency channels (known as articulatory bands) and they classify a speech sound merging estimates from different bands. Later Allen (1994, 2005) interpreted that the recognition of speech in each articulatory band is done independently and a correct decision is obtained if the sound is correctly recognized in at least one of the bands. The size of each articulatory band spans approximatively two critical bands (Allen, 2005). Those observations have inspired automatic speech recognition approaches referred as multi-band ASR. Multiband ASR (Hermansky et al., 1996; Bourlard and Dupont, 1996) uses a set of independent classifiers (e.g. Multi Layer Perceptron (MLP) or Hidden Markov Models (HMM)) trained on different parts of the auditory spectrum in order to discriminate in between phonetic targets. The classifiers outputs are then combined together obtaining a final decision on the phonetic targets. Typical combination frameworks include both merger classifiers (another MLP (Hermansky, 2003; Chen et al., 2003) and rule based combinations (e.g. Inverse Entropy (Misra et al., 2003) or Dempster Shafer combination (Valente and Hermansky, /$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi: /j.specom

2 F. Valente / Speech Communication 52 (2010) )). Multi-band speech recognition has been originally introduced for dealing with noise, the rationale being that if the noise affects a particular auditory band, the correct phonetic recognition can be obtained using information coming from the uncorrupted bands. Later the multi-band paradigm has been generalized into the multi-stream paradigm where independent classifiers are trained on different representations of the speech signal including conventional spectral features (PLP) (Hermansky et al., 2000), long time critical band energy trajectories (Hermansky and Sharma, 1999; Morgan et al., 2004) and spectro-temporal modulations (Kleinschmidt, 2002; Hermansky and Fousek, 2005; Zhao et al., 2009). Several multi-stream systems make use of features based on long time windows of the speech signal (e.g. Hermansky and Sharma, 1999; Hermansky and Fousek, 2005; Kleinschmidt, 2002; Zhao et al., 2009; Morgan et al., 2004; Hermansky, 2003). Conventional Short Term Fourier Transform features do not provide information on the speech dynamics. Those are generally introduced using temporal differentials of the spectral trajectory (also known as delta features) or processing long segments of spectral energy trajectories i.e. the modulation spectrum (Hermansky, 1998). Several studies have been carried out to evaluate the importance of the different parts of the modulation spectrum for ASR applications (Hermansky et al., 1997) and robustness techniques like RASTA filtering are based on emphasizing modulation spectrum frequencies that are most important for speech recognition (Hermansky and Morgan, 1994). This study is motivated by two main arguments: Current multi-band/multi-stream approaches operates in two separate steps: in the first step a set of independent classifiers (e.g. MLP) is trained in order to discriminate between phonetic targets i.e. to estimate phoneme posterior probabilities; in a second step all the individual estimates are combined together into a single phoneme posterior estimate. The combination happens in parallel fashion i.e. it is invariant to the order in which the different features are introduced. Alternative methods for combining information based on hierarchies of classifiers have been proposed in literature (Sivadas and Hermansky, 2002; Valente et al., 2007; Valente and Hermansky, 2008) and have shown competitive results to the parallel scheme. In contrary to the parallel scheme, hierarchical combinations are sequential, i.e. they assume an ordering in the processing. Although proven effective in several small and large vocabulary ASR tasks, the parallel combination scheme is motivated from observations made on the auditory spectrum of the speech signal. Speech temporal modulations represent the dynamics of the signal and they are extracted using different time scales. No specific studies have been carried on the optimal way of combining classifiers trained on this type of informations. This paper aims at investigating the combination of classifiers trained on different ranges of auditory frequencies (as in conventional multi-band approaches) and modulation frequencies. In particular we study from an ASR perspective whether the combination of information obtained from auditory and modulation frequency channels is more effective in parallel (as in conventional multi-band) or hierarchical (sequential) fashion. In contrary to related works, this study is carried on a Large Vocabulary Automatic Speech Recognition task using the TANDEM approach. The paper is organized as follows: Section 2 describes the pre-processing techniques that extracts different ranges of auditory and modulation frequencies and the joint auditory modulation channels. We limit the investigation to two auditory channels and two modulation channels thus four joint auditory modulation channels for simplifying the setup. Section 3 describes two different combination schemes (parallel and hierarchical) based on Multi Layer Perceptron (MLP) classifiers, and Section 4 presents the experimental framework based on a LVCSR system for transcription of meetings data. The combination of classifiers trained on different ranges of auditory frequencies is investigated in Section 5, the combination of classifiers trained on different ranges of modulation frequencies is investigated in Section 6. Joint auditory modulation frequencies processing is then presented in Sections 7 and 8 describes results on Single Distant Microphones (SDM). Section 9 describes the application of those features into other LVCSR systems and finally Section 10 concludes the paper discussing results and presenting future directions. 2. Time-frequency processing This section presents the processing used for extracting evidence from different auditory modulation frequency sub-bands. Feature extraction is composed of the following parts: critical band auditory spectrum is extracted from Short Time Fourier Transform of a signal every 10 ms. In the following study, the power spectrum is integrated using a bank of filters equally spaced on a bark scale; 15 critical bands are used. This step is common to several conventional feature extraction methods used in ASR e.g. PLP features. Different ranges of modulation frequencies are extracted using MRASTA filtering (see Hermansky and Fousek, 2005, for details). MRASTA is an extension of RASTA filtering and extracts different modulation frequencies using a set of multiple resolution filters. A one second long temporal trajectory in each critical band is filtered with a bank of band-pass filters. Those filters represent first derivatives G1 ¼½g1 ri Š (Eq. (1)) and second derivatives G2 ¼½g2 ri Š (Eq. (2)) of Gaussian functions with variance r i varying in the range 8 60 ms (see Fig. 1). In effect, the MRASTA filters are multi-resolution band-pass filters on modulation frequency, dividing the

3 792 F. Valente / Speech Communication 52 (2010) G1 high G1 low 1 G2 high G2 low TIME TIME Fig. 1. Set of temporal filter obtained by first (G1 left picture) and second (G2 right picture) order derivation of Gaussian function. G1 and G2 are successively split in two filter bank (G1-low and G2-low, dashed line) and (G2-high and G2-high continuous line) that filter, respectively, high and low modulation frequencies G2 high G2 low db 10 1 db 10 1 G1 high G1 low modulation frequency [Hz] modulation frequency [Hs] Fig. 2. Normalized frequency response of G1 (left picture) and G2 (right picture). G1 and G2 are successively split in two filter bank. G1-low and G2-low (dashed lines) emphasize low modulation frequencies while G1-high and G2-high emphasize high modulation frequencies. available modulation frequency range into its individual sub-bands. 1 g1 ri ðxþ / x expð x 2 =ð2r 2 r 2 i ÞÞ; i x 2 g2 ri ðxþ / 1 exp x 2 =ð2r 2 r 4 i r 2 i Þ i with r i ¼f0:8; 1:2; 1:8; 2:7; 4; 6g: In the modulation frequency domain, they correspond to a filter-bank with equally spaced filters on a logarithmic scale (see Fig. 2). Identical filters are used for all critical bands. Thus, they provide a multiple-resolution representation of the time-frequency plane. MRASTA filtering is consistent with studies in (Houtgast, 1989; Dau et al., 1997) where human perception of modulation frequencies is modeled using a bank of filters equally spaced on a logarithmic scale. This bank of filters subdivides the available modulation frequency range into separate channels with decreasing frequency resolution moving from slow to fast modulations. After MRASTA filtering the total number of features is = 180. Then frequency derivatives across three consecutive critical bands are introduced (Hermansky and Fousek, 2005). The representation considers all the possible auditory and modulation frequency ranges of the speech signal. 1 Unlike in (Hermansky and Fousek, 2005), filter-banks G1 and G2 are composed of six filters rather than eighth, leaving out the two filters with longest impulse responses. ð1þ ð2þ Let us now divide this available information in different auditory and modulation frequency channels. In order to obtain different auditory frequency sub-bands, the available 15 critical bands are split in two ranges of seven and eight critical bands, respectively, referred as F-Low and F-High. The investigation is limited to two parts for simplifying the setup. Filter-Banks G1 and G2 cover the whole range of modulation frequencies. We are interested in processing separately different parts of the modulation spectrum and again we limit the investigation to two parts for simplifying the setup. Similarly to what is done with the auditory filterbanks, the filter-banks G1 and G2 (six filters each) are split in two separate filter bank G1-Low, G2-Low and G1-High and G2-High that filter, respectively, fast and slow modulation frequencies. We define G-High and G- Low as follows: G-High ¼½G1-High; G2-HighŠ ¼½g1 ri ; g2 ri Š ð3þ with r i ¼f0:8; 1:2; 1:8g: G-Low ¼½G1-Low; G2-LowŠ ¼½g1 ri ; g2 ri Š ð4þ with r i ¼f2:7; 4; 6g: Filters G1-fast and G2-fast are short filters (Fig. 1 continuous lines) and they process high modulation frequencies (Fig. 2 continuous lines). Filters G1-slow and G2-slow are long filters (Fig. 1 dashed lines) and they process low modulation frequencies (Fig. 2 dashed lines). The effect of this filtering is depicted in Fig. 3. The left picture plots the auditory spectrum of a speech signal while Fig. 3 (center and right) plot the auditory spectrum after filtering with

4 F. Valente / Speech Communication 52 (2010) Auditory Spectrogram Bark 8 Bark 8 Bark Time Time Time Fig. 3. Auditory spectrum of speech signal (left picture) and its filtered versions with filter g1 1.2 i.e. extraction of high modulation frequency (center plot) and g1 6 i.e. extraction of low modulation frequency (right plot). g1 1.2 and g1 6. We can notice that high modulations represent the original auditory spectrum with more details while low modulations give a coarse representation of the spectrum. The cutoff frequency for both filter-banks G- High and G-Low is approximatively 10 Hz. Separate ranges of auditory and modulation frequency channels can be obtained dividing the initial spectrogram in four channels (F-Low, G-Low), (F-Low, G-High), (F- High, G-Low), (F-High, G-High) that represent the combination of low/high auditory and modulation frequencies. This processing is depicted in Fig. 4. In the remainder, the investigation focuses on the most effective way of combining classifiers trained on: 1. Separate ranges of auditory frequencies (F-High) and (F-Low) (Section 5). 2. Separate ranges of modulation frequencies (G-High) and (G-Low) (Section 6). 3. Separate ranges of auditory modulation frequencies (F- Low, G-Low), (F-Low, G-High), (F-High, G-Low), (F- High, G-High) (Section 7). 3. Combination of classifiers The classifier used for this study is the Multi Layer Perceptron (MLP) described in (Bourlard and Morgan, 1994). The training is done using back-propagation (Rumelhart et al., 1986) for minimizing the cross entropy between MLP outputs and phonetic targets (Bourlard and Morgan, 1994). MLPs output can thus be considered as an estimate of the phoneme posterior probabilities conditioned to the acoustic observation vector. Phoneme posterior probability are then used as conventional features into the HMM/ GMM systems through the TANDEM approach (Hermansky et al., 2000). Two different combination schemes based on Multi Layer Perceptron (MLP) are studied: the first one combines two feature streams in parallel fashion while the second combines features in hierarchical (sequential) fashion Parallel combination Given two different feature streams, a separate MLP for estimating phoneme posterior probabilities is trained independently on each of them. Phoneme posterior probabilities from individual MLPs are then concatenated together forming an input to a third MLP which estimates a single phoneme posterior estimate. The process is depicted in Fig. 5. This architecture is generally used for combining classifiers trained on different auditory frequency bands (Hermansky et al., 1996) and has been used as well in many multi-stream systems that uses speech temporal modulations (e.g. Kleinschmidt, 2002; Morgan et al., 2004; Zhao et al., 2009). Fig. 4. Auditory modulation frequency channels extraction: auditory spectrum is filtered with a set of Gaussian filters that extracts high and low modulation frequencies (G-High and G-Low). After that auditory frequencies are divided in two channels (F-High and F-Low). This produces four different auditory modulation channels: (F-Low, G-Low), (F-Low, G-High), (F-High, G-Low), (F-High, G-High).

5 794 F. Valente / Speech Communication 52 (2010) Fig. 5. Parallel processing of two feature sets. This combination scheme is invariant to the order in which features are introduced. Fig. 6. Hierarchical processing of two feature sets. This combination scheme is sensitive to the order in which features are introduced Hierarchical processing An MLP is trained on a first feature stream in order to obtain phoneme posteriors. These posteriors are then concatenated with a second feature stream thus forming an input to a second phoneme posterior-estimating MLP. In such a way, phoneme estimates from the first MLP are modified by a second net using an evidence from a different feature stream. This process is depicted in Fig. 6. In contrary to parallel processing, the order in which features are presented does make a difference. Hierarchical processing integrates the information contained in the different frequency channels in sequential fashion, progressively modifying the phoneme posteriors obtained from the first MLP using a different signal representation. In the rest of the paper, the total number of parameters in the various MLP structures is made constant modifying the size of the hidden layer. This allows fair comparison in between the different experiments without biasing results towards structures that contains more parameters. 4. Experimental setting In contrary to previous related works on considerable smaller amounts of data, we pursue the investigation in a Large Vocabulary Continuous Speech Recognition task (LVCSR). The system is a simplified version of the first path AMI LVCSR system for meeting transcription described in (Hain et al., 2005) and uses a pruned trigram language model. 2 The training data for this system comprises of individual headset microphone (IHM) data of four meeting corpora; the NIST (13 h), ISL (10 h), ICSI (73 h) and a preliminary part of the AMI corpus (16 h). Acoustic models are phonetically state tied triphone models trained using standard HTK maximum likelihood training procedures. The recognition experiments are conducted on the NIST RT05s evaluation data ( tests/rt/rt2005/spring/, xxxx) (Independent Headset Microphone (IHM) part) which is composed of speech recorded in five different meeting rooms (AMI, CMU, ICSI, NIST, VT). The pronunciation dictionary is same as the one used in AMI NIST RT05s system (Hain et al., 2005). Juicer large vocabulary decoder (Moore et al., 2006) is used for recognition with a pruned trigram language model. In order to use phoneme posteriors into a conventional HMM/GMM system, the TANDEM approach is used (Hermansky et al., 2000). The different time-frequency 2 The first path RT05 system does not include VTLN, HLDA or speaker adaptation. Furthermore decoding is done using a pruned trigram language model. Only the first path is used as the paper does not focus on benchmarking the LVCSR system but on comparing the different feature sets.

6 F. Valente / Speech Communication 52 (2010) Table 1 RT05 WER for baseline PLP system and MRASTA TANDEM features. PLP MRASTA representations are used as input to an MLP which estimates phoneme posterior probabilities. The phonetic targets consists of 42 phonemes. Phoneme posteriors are then modified according to a Log/KLT transform and used as conventional features for HMM/GMM systems. After KLT only first 25 components are used accounting for 95% of the variability. Table 1 reports results for the baseline system (PLP plus dynamic features) and the MRASTA TANDEM system where all the available ranges of modulation and auditory frequencies are processed using a single MLP. The MRASTA TANDEM performs 3.4% worse than the PLP baseline system. 5. Combination of auditory frequency channels In this section, we investigate from an ASR perspective whether classifiers trained on separate auditory frequency ranges should be combined in parallel or hierarchical fashion. The auditory spectrum is split in two frequency ranges composed of 7 and 8 bark. MRASTA filtering is then applied resulting in two feature set of 168 and 192 component each. We refer to those two feature set as F-Low and F-High and they contain all the available modulation frequencies extracted at high and low auditory frequencies. Table 2 reports WER for the two MLP features obtained training on separate auditory frequency ranges. Classifiers trained on F-Low and F-High are then combined according to the schemes described in Section 3 i.e. in parallel fashion or in hierarchical (sequential) fashion. In contrary to the parallel combination, the hierarchical combination is sensitive to the order in which features are introduced. Thus we considers the cases in which the processing moves from F-Low to F-High and vice-versa. Results are reported in Table 3. The parallel processing outperforms the hierarchical processing. Using two separate frequency channels reduces the WER by 2% (i.e. from 45.8% to 43.9%) absolute w.r.t. the single classifier approach. The variation in WER in between the three combination schemes is approximately 1%. Those results are consistent with findings on human speech recognition (Fletcher, 1953; Allen, 1994) and confirm that the parallel scheme (as in conventional multi-band systems) outperforms other combinations in case of classifiers trained on different auditory frequency channels. 6. Combination of modulation frequency channels In this section we investigate from an ASR perspective whether classifiers trained on separate modulation frequency ranges should be combined in parallel or hierarchical fashion. As before, we limit the splitting to only two modulation frequency ranges. Filter-banks G1 and G2 (6 filters each) are divided in two separate filter bank as described in Section 2. We refer to those as high modulations (G-High) and low modulations (G-Low) and they contains the entire available ranges of auditory frequencies extracted at high and low modulations, respectively. Performances of MLP features trained on G-High/G-Low are reported in Table 4. Now classifiers trained on G-Low to G-High are combined in parallel fashion or in hierarchical (sequential) fashion moving from G-Low to G-High and from G-High to G-Low. Results are reported in Table 5. Moving in the hierarchy from low frequencies to high frequencies yields similar performance as the single MLP approach. On the other hand, moving from high to low modulation frequencies produces a significant reduction of 5.8% into final WER w.r.t. single classifier approach. Table 2 RT05 WER for TANDEM features obtained training MLPs on high and low auditory frequencies. F-Low F-High Table 4 RT05 WER for TANDEM features obtained training MLPs on high and low modulation frequencies. G-high G-low Table 3 RT05 WER for TANDEM features obtained combining MLPs trained on separate ranges of auditory frequencies both in parallel and hierarchical fashion (both from low to high and from high to low frequencies). F-Low to F-High F-High to F-Low Parallel Table 5 RT05 WER for TANDEM features obtained combining MLPs trained on separate ranges of modulation frequencies both in parallel and hierarchical fashion (both from low to high and from high to low frequencies). G-Low to G-High G-High to G-Low Parallel

7 796 F. Valente / Speech Communication 52 (2010) The parallel combination performs 1.4% absolute worse than the sequential combination. In contrary to auditory frequencies, the hierarchical combination outperforms the parallel combination when the processing moves from high to low modulation frequencies. Furthermore, this approach outperforms the PLP baseline by 2.5% absolute. The variation in WER in between the three combination schemes is considerably larger than the variation obtained splitting the auditory frequencies. Those findings are consistent with physiological experiments in (Miller et al., 2002) that shows how different levels of speech processing may attend different rates of the modulation spectrum, the higher levels emphasizing lower modulation frequency rates. To verify that the improvements in the previous experiment are produced from the sequential processing of modulation frequencies and not from a hierarchy of MLP classifiers, an additional experiment is proposed. Posterior features from the single MRASTA MLP (i.e. all auditory and modulation frequencies simultaneously processed) are presented as input to a second MLP. The second MLP does not use additional input but only re-processes a block of several concatenated posterior features. Table 6 reports WER on RT05 data set. Results show an improvement in performances w.r.t. the single MRASTA classifier of 1.6% absolute thus significantly worse than the sequential modulation processing which produce a WER reduction of 5.8%. The experiment reveals that the improvements are actually coming from the sequential processing of modulation frequencies rather than the hierarchical of MLPs. 7. Combination of auditory modulation frequencies This section aims at investigating if conclusions of Sections 5 and 6 are valid also when combining four joint auditory modulation frequency channels. At first four auditory modulation frequency ranges are extracted as described in Section 2. Then four separate MLP classifiers are trained on each of them. WER for each of those individual TANDEM features are reported in Table 7. All the four streams have similarly high WER compared to the full MRASTA filter bank. Posteriors are then combined into a single feature stream training another MLP that operates as merger classifier (see Fig. 7). This approach combines in parallel all the four information channels (auditory and modulation). Results are reported in Table 8. Table 6 RT05 WER for TANDEM features obtained hierarchical processing the output of an MLP trained on all the available auditory and modulation frequencies. Hier posterior Table 7 RT05 WER for TANDEM features obtained training MLPs on separated auditory modulation frequency ranges. Features G-Low, F-Low Separate processing of auditory modulation frequency channels reduces the WER from 45.8% (single classifier approach) to 40.7% i.e. approximatively 5% absolute. In order to verify if findings of Sections 5 and 6 hold also when different ranges of joint auditory modulation frequencies are considered, the combination scheme of Fig. 8 is investigated. This processing aims at processing in parallel fashion auditory frequencies and in hierarchical fashion modulation frequencies. The proposed scheme combines in parallel MLPs trained on high and low auditory frequencies extracted at high modulation frequencies. Later high and low auditory frequencies extracted at low modulation frequencies are combined in hierarchical fashion. Results are reported in Table 9. The proposed combination scheme produces an improvement of 1.1% in terms of WER respect to the parallel combination of the four channels. Furthermore, the WER is reduced by 0.4% respect to the hierarchical modulation spectrum approach. Those results suggests that the conclusions of Sections 5 and 6 are also verified when joint auditory modulation channels are combined. 8. Distant microphone results G-High, F-Low G-Low, F-High G-High, F-High WER Fig. 7. Parallel combination of four separate auditory modulation channels. A separate MLP is trained on each of the frequency channels. Posterior estimates are then combined together using another MLP classifier. Table 8 RT05 WER for parallel combination of evidence from four auditory modulation frequency channels (see Fig. 7) In order to measure the performance of different architectures in case of low SNR and increased reverberation, the features are also evaluated on audio acquired with a Single Distant Microphone (SDM) conditions.

8 F. Valente / Speech Communication 52 (2010) Fig. 8. Proposed combination of separate auditory modulation channels. This scheme aims at processing in parallel fashion classifiers trained on auditory frequencies and in hierarchical fashion classifiers trained on modulation frequencies. MLPs trained on high and low auditory frequencies extracted at high modulation frequencies are combined in parallel. Later high and low auditory frequencies extracted at low modulation frequencies are combined in hierarchical fashion. Table 9 RT05 WER for TANDEM features obtained using MLP architecture depicted in Fig The system training is same as in Section 4. Acoustic features for evaluation are extracted from distant microphone. Results are reported in Table 10 and includes parallel and hierarchical combination of auditory and modulation frequency channels as well as joint auditory modulation processing. On SDM audio, the gap in performances between the PLP baseline and the TANDEM MRASTA features is only 0.4% absolute. The trend for features generated using parallel and hierarchical architectures is similar to what reported on IHM data i.e. the combination of auditory frequencies is more effective in parallel fashion while the combination of modulation frequencies is more effective in hierarchical fashion. Furthermore, hierarchical features largely outperform the PLP baseline. 9. Application into other LVCSR systems The hierarchical combination of classifiers trained on separate ranges of modulation frequencies (also referred as hierarchical modulation spectrum) has also been tested on other languages and integrated in other LVCSR systems that make use of TANDEM based feature extraction. In (Valente et al., 2009), experiments on an LVCSR system for Mandarin Broadcast speech recognition are presented. The HMM/GMM and the MLP training is done using approximatively 100 h of Broadcast data manually transcribed. Those data present cleaner acoustic conditions compared to meeting recordings. Results are reported on the DARPA GALE evaluation 06 data. Acoustic models are phonetically state tied triphone models trained using maximum likelihood training procedures. The Mandarin phonetic set is composed of 71 tonemes. Phoneme posteriors are transformed according to a Log/KLT and only the first 35 components are kept accounting for 95% of the variability. As before, hierarchical and parallel combinations of modulation frequencies are studied; the number of parameters in the different MLP architectures is kept Table 10 RT05 WER for TANDEM features obtained training MLPs on separated auditory modulation frequency ranges. Results are reported for Single Distant Microphone audio. Features PLP TANDEM MRASTA WER Auditory Parallel F-Low to F-High F-High to F-Low WER Modulation Parallel G-Low to G-High G-High to G-Low WER Four channels-parallel Four channels (combination as in Fig. 8) WER

9 798 F. Valente / Speech Communication 52 (2010) Table 11 CER for DARPA GALE eval06 data set. Features MRASTA Parallel G-High G-Low Hier G-High to G-Low CER constant for providing fair comparison. Results are evaluated in terms of Character Error Rate and reported in Table 11. Results reveal similar conclusions to those previously presented in the meeting recognition tasks i.e. hierarchical processing of modulation frequencies outperforms the single classifier approach and the parallel processing. Other large scale experiments with TANDEM features based on the hierarchical processing of the modulation spectrum are also reported in (Plahl et al., 2008). The authors experiments with HMM/GMM and MLP systems trained on very large amounts of data (1600 h) and integrated in the GALE Mandarin LVCSR system. Results show that the proposed approach outperforms by 20% the single classifier approach on several evaluation and development dataset from the GALE project. 10. Conclusion and discussions In this paper we discuss the most effective way of combining classifiers trained on separate ranges of auditory and modulation frequencies. Two different schemes are considered: the parallel and the hierarchical (sequential) combination. The parallel combination of classifiers trained on separate ranges of auditory frequencies is a well known practice in the multi-band framework. Table 12 summarizes results obtained dividing the available auditory frequencies in two separate ranges. In brackets, relative improvements w.r.t. the single classifier approach are reported. Similarly, Table 13 summarizes results obtained dividing the available modulation frequencies in two separate ranges. Table 12 Summary of WER dividing the available range of auditory frequencies. In brackets relative improvements w.r.t. the MRASTA baseline are reported. Features MRASTA Two Separate auditory channels Parallel Low to High High to Low WER (+4%) 45.0% (+1%) 44.3% (+3%) Table 13 Summary of WER dividing the available range of modulation frequencies. In brackets, relative improvements w.r.t. the MRASTA baseline are reported. Features MRASTA Two Separate modulation channels Parallel Low to High High to Low WER (+9%) 45.8% (+0%) 40.0 (+12%) We can conclude that: the combination of MLPs trained on different ranges of auditory frequencies is more effective if performed in parallel fashion. This is consistent with studies on human speech recognition (Fletcher, 1953; Allen, 1994) and with the conventional multi-band framework. the combination of MLPs trained on different ranges of modulation frequencies is more effective if performed in hierarchical fashion moving from fast to slow modulations. This is consistent with studies on spectro-temporal receptive fields (Miller et al., 2002) that shows how different levels of processing may attend different rates of the modulation spectrum, the higher levels emphasizing lower modulation frequency rates. Furthermore, it outperforms by 2.4% the conventional PLP baseline. When only two frequency channels are used, the improvement coming from separate processing of modulation frequencies is considerably larger than the improvement coming from separate processing of auditory frequencies. Results on SDM data shows an overall similar trend. The relative improvements w.r.t. MRASTA features are reported in Table 14. Similar experiments proposed on different data (Broadcast Recordings) and different languages (Mandarin Chinese) hold similar conclusions (Plahl et al., 2008; Valente et al., 2009). Those findings are effective also in case of combination of joint auditory frequency channels. In fact the proposed combination scheme of Fig. 8 outperforms by 1% the conventional parallel combination of MLPs trained on separate ranges of auditory/modulation frequencies. The reason of this effect is due to what type of information auditory and modulation frequencies carry and on the time scales at which they are extracted. Auditory frequencies are extracted from a fixed short-term (30 ms) window and represents the information contained in the instantaneous frequency of the signal. Thus high and low auditory frequencies correspond to the same temporal context. Auditory frequencies do not carry information on the dynamics of the signal. The extraction of modulation frequencies, i.e. the dynamics of the signal, involves analysis at different time scales done by the multiple-resolution (MRASTA) filters with varying time spans. While short Table 14 Summary of WER dividing the available range of auditory and modulation frequencies. In brackets relative improvements w.r.t. the MRASTA baseline are reported. Features MRASTA Parallel Low to High High to Low Two separate auditory channels WER (+3%) 56.8% (+1%) 56.0% (+2%) Two separate modulation channels WER (+6%) 57.1% (+0.3%) 51.2 (+10%)

10 F. Valente / Speech Communication 52 (2010) filters (fast modulations) provide a fine representation of the speech dynamics, long filters (slow modulations) provides a coarse representation of the speech dynamics. MLPs trained on different ranges of auditory frequencies represent phoneme posteriors estimates at the same time scale (30 ms). MLPs trained on different ranges of modulation frequencies represent phoneme posterior estimates at different time scales with fine and coarse representation of the dynamics. Fine and coarse representations are extracted using long and short temporal filters. The difference between parallel and hierarchical combinations lays in the fact that parallel combination assumes that there is no ordering in the process while hierarchical combination is a sequential scheme where the order in which features are introduced matters. Experiments on human speech recognition suggests that there is no ordering in processing auditory frequencies i.e. recognition is carried independently in each band and a then a decision is taken merging results from different bands. This is verified in the experimental section as the parallel architecture (which does not assume any ordering) performs better then hierarchical sequential architectures. This is furthermore supported by the small difference in performance whenever the processing moves from fast to slow and slow to fast auditory frequencies (in the order of 2% relative). MLPs trained on different ranges of modulation frequencies produce phoneme posterior estimates at different time scales, i.e., they can be ordered according to their time scales. Combining those estimates assuming that there is no ordering i.e., in parallel, maybe suboptimal. Studies like (Miller et al., 2002) suggests that the processing of modulation frequencies is sequential, i.e. different levels attend different rates of the modulation spectrum. Hierarchical combination is a possible way of implementing sequential processing which assumes an ordering and can operate from a fine-to-coarse (i.e. fast to slow) or a coarse-to-fine (i.e. slow to fast ) time scales. Experiments reveal that integrating information from fast modulations (i.e. small size temporal windows) to slow modulations (i.e. long size temporal windows) is the most effective processing consistently with (Miller et al., 2002). The hypothesis of the sequential processing is furthermore supported by the large difference in performance whenever the MLP architecture moves from fast to slow and slow to fast modulation frequencies (in the order of 10% relative). In other words, the combination of MLPs trained on high and low modulation frequencies involves the combination of different time contexts. Moving from short to long time-spans is similar to progressively increasing the temporal context as done in a number of other posteriorbased ASR systems Bourlard et al. (2004). Such an effect could not be obtained using the parallel scheme. On the other hand, the MLPs trained on high and low auditory frequencies have a fixed input temporal context, they do not provide fine/coarse estimation of the input signal, and they do not have a particular ordering thus sequential combination is not as effective as the parallel one. We limited here the investigation to two auditory and two modulation frequency channels obtained splitting the tonotopic scales in two equally sized bands for simplifying the experimental setup. In future we plan to further consider larger number of bands and to experiments with considerably larger number of frequency channels like in (Kleinschmidt, 2002; Zhao et al., 2009). Acknowledgments This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR C-0023 and by the European Union under the integrated project AMIDA. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). The author thanks colleagues from the AMIDA and GALE projects for their help with the different LVCSR systems and the reviewers for their comments. References Allen, J.B., How do humans process and recognize speech? IEEE Trans. Speech Audio Process. 2 (October). Allen, J.B., Articulation and Intelligibility. Morgan and Claypool. Bourlard, H., Dupont, S., A new ASR approach based on independent processing and re-combination of partial frequency bands. In: Proc. ICSLP 96. Bourlard, H., Morgan, N., Connectionist Speech Recognition A Hybrid Approach. Kluwer Academic Publishers. Bourlard, H. et al., Towards using hierarchical posteriors for flexible automatic speech recognition systems. In: Proceedings of the DARPA EARS (Effective, Affordable, Reusable, Speech-to-text) Rich Transcription (RT 04). Workshop, Chen, B., Chang S., Sivadas, S., Learning discriminative temporal patterns in speech: development of novel traps-like classifiers. In: Proc. Eurospeech, Dau, T., Kollmeier, B., Kohlrausch, A., Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. J. Acoust. Soc. Am. 102, Fletcher, H., Speech and Hearing in Communication. Krieger, New York. Hain, T. et al., The 2005 AMI system for the transcription of speech in meetings. In: NIST RT05 Workshop, Edinburgh, UK. Hermansky, H., Should recognizers have ears? Speech Commun. 25, Hermansky, H., TRAP-TANDEM: data-driven extraction of temporal features from speech. In: Proc. ASRU Hermansky, H., Fousek, P., Multi-resolution RASTA filtering for TANDEM-based ASR. In: Proc. Interspeech, Hermansky, H., Morgan, N., RASTA processing of speech. IEEE Trans. Speech Audio Process. 2. Hermansky, H., Sharma, S., Temporal patterns (TRAPS) in asr of noisy speech. In: Proc. ICASSP 99. Hermansky, H., Tibrewala, S., Pavel, M., Towards ASR on partially corrupted speech. In: Proc. ICSLP 96. Hermansky, H., Kanedera, H., Arai, T., Pavel, M., On the importance of various modulation frequencies for speech recognition. In: Proc. Eurospeech 97.

11 800 F. Valente / Speech Communication 52 (2010) Hermansky, H., Ellis, D., Sharma, S., TANDEM connectionist feature extraction for conventional hmm systems. In: Proc. ICASSP Houtgast, T., Frequency selectivity in amplitude modulation detection. J. Acoust. Soc. Am. 88. Kleinschmidt, M., Methods for capturing spectro-temporal modulations in automatic speech recognition. Acoust. United Acta Acoust. 88 (3), Miller et al., Spectro-temporal receptive fields in the lemniscal auditory thalamus and cortex. J. Neurophysiol. 87 (1). Misra, H., Bourlard, H., Tyagi V., Entropy-based multi-stream combination. In: Proc. ICASSP Moore, D. et al., Juicer: a weighted finite state transducer speech coder. In: Proc. MLMI 2006, Washington, DC. Morgan, N., Chen, B., Zhu, Q., Stolcke, A., Trapping conversational speech: extending TRAP/tandem approaches to conversational telephone speech recognition. In: Proc. ICASSP Plahl, C., Hoffmeister, B., Heigold, G., Lööf, J., Schlüter, R., Ney, H., Development of the GALE 2008 mandarin LVCSR system. In: Proc. 10th Ann. Conf. of Internat. Speech Communication Association (Interspeech), Rumelhart, D., Hinton, G., Williams, R., Learning representations by back-propagating errors. Nature 323, Sivadas, S., Hermansky, H., Hierarchical tandem feature extraction. In: Proc. ICASSP Valente, F., Hermansky, H., Combination of acoustic classifiers based on Dempster Shafer theory of evidence. In: Proc. ICASSP Valente, F., Hermansky, H., Hierarchical and parallel processing of modulation spectrum for ASR applications. In: Proc. ICASSP Valente, F., Vepa, J., Plahl, C., Gollan, C., Hermansky, H., Schlüter, R., Hierarchical neural networks feature extraction for LVCSR system. In: Interspeech, Valente, F., Magimai-Doss, M., Plahl, C., Suman, R., Hierarchical processing of the modulation spectrum for GALE mandarin LVCSR system. In: Proc. 10th Ann. Conf. of Internat. Speech Communication Association (Interspeech), < Zhao, S., Ravuri, S., Nelson Morgan, Multi-stream to many-stream: using spectro-temporal features for ASR. In: Proc. Interspeech, 2009.

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a R E S E A R C H R E P O R T I D I A P Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a IDIAP RR 07-45 January 2008 published in ICASSP

More information

Using RASTA in task independent TANDEM feature extraction

Using RASTA in task independent TANDEM feature extraction R E S E A R C H R E P O R T I D I A P Using RASTA in task independent TANDEM feature extraction Guillermo Aradilla a John Dines a Sunil Sivadas a b IDIAP RR 04-22 April 2004 D a l l e M o l l e I n s t

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 2439 Transcribing Mandarin Broadcast Speech Using Multi-Layer Perceptron Acoustic Features Fabio Valente, Member,

More information

Reverse Correlation for analyzing MLP Posterior Features in ASR

Reverse Correlation for analyzing MLP Posterior Features in ASR Reverse Correlation for analyzing MLP Posterior Features in ASR Joel Pinto, G.S.V.S. Sivaram, and Hynek Hermansky IDIAP Research Institute, Martigny École Polytechnique Fédérale de Lausanne (EPFL), Switzerland

More information

FEATURE COMBINATION AND STACKING OF RECURRENT AND NON-RECURRENT NEURAL NETWORKS FOR LVCSR

FEATURE COMBINATION AND STACKING OF RECURRENT AND NON-RECURRENT NEURAL NETWORKS FOR LVCSR FEATURE COMBINATION AND STACKING OF RECURRENT AND NON-RECURRENT NEURAL NETWORKS FOR LVCSR Christian Plahl 1, Michael Kozielski 1, Ralf Schlüter 1 and Hermann Ney 1,2 1 Human Language Technology and Pattern

More information

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b R E S E A R C H R E P O R T I D I A P On Factorizing Spectral Dynamics for Robust Speech Recognition a Vivek Tyagi Hervé Bourlard a,b IDIAP RR 3-33 June 23 Iain McCowan a Hemant Misra a,b to appear in

More information

Machine recognition of speech trained on data from New Jersey Labs

Machine recognition of speech trained on data from New Jersey Labs Machine recognition of speech trained on data from New Jersey Labs Frequency response (peak around 5 Hz) Impulse response (effective length around 200 ms) 41 RASTA filter 10 attenuation [db] 40 1 10 modulation

More information

DERIVATION OF TRAPS IN AUDITORY DOMAIN

DERIVATION OF TRAPS IN AUDITORY DOMAIN DERIVATION OF TRAPS IN AUDITORY DOMAIN Petr Motlíček, Doctoral Degree Programme (4) Dept. of Computer Graphics and Multimedia, FIT, BUT E-mail: motlicek@fit.vutbr.cz Supervised by: Dr. Jan Černocký, Prof.

More information

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b R E S E A R C H R E P O R T I D I A P Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR a Vivek Tyagi Hervé Bourlard a,b IDIAP RR 3-47 September 23 Iain McCowan a Hemant Misra a,b to appear

More information

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

Modulation Spectrum Power-law Expansion for Robust Speech Recognition Modulation Spectrum Power-law Expansion for Robust Speech Recognition Hao-Teng Fan, Zi-Hao Ye and Jeih-weih Hung Department of Electrical Engineering, National Chi Nan University, Nantou, Taiwan E-mail:

More information

IMPROVING MICROPHONE ARRAY SPEECH RECOGNITION WITH COCHLEAR IMPLANT-LIKE SPECTRALLY REDUCED SPEECH

IMPROVING MICROPHONE ARRAY SPEECH RECOGNITION WITH COCHLEAR IMPLANT-LIKE SPECTRALLY REDUCED SPEECH RESEARCH REPORT IDIAP IMPROVING MICROPHONE ARRAY SPEECH RECOGNITION WITH COCHLEAR IMPLANT-LIKE SPECTRALLY REDUCED SPEECH Cong-Thanh Do Mohammad J. Taghizadeh Philip N. Garner Idiap-RR-40-2011 DECEMBER

More information

Spectro-temporal Gabor features as a front end for automatic speech recognition

Spectro-temporal Gabor features as a front end for automatic speech recognition Spectro-temporal Gabor features as a front end for automatic speech recognition Pacs reference 43.7 Michael Kleinschmidt Universität Oldenburg International Computer Science Institute - Medizinische Physik

More information

PLP 2 Autoregressive modeling of auditory-like 2-D spectro-temporal patterns

PLP 2 Autoregressive modeling of auditory-like 2-D spectro-temporal patterns PLP 2 Autoregressive modeling of auditory-like 2-D spectro-temporal patterns Marios Athineos a, Hynek Hermansky b and Daniel P.W. Ellis a a LabROSA, Dept. of Electrical Engineering, Columbia University,

More information

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991 RASTA-PLP SPEECH ANALYSIS Hynek Hermansky Nelson Morgan y Aruna Bayya Phil Kohn y TR-91-069 December 1991 Abstract Most speech parameter estimation techniques are easily inuenced by the frequency response

More information

IMPACT OF DEEP MLP ARCHITECTURE ON DIFFERENT ACOUSTIC MODELING TECHNIQUES FOR UNDER-RESOURCED SPEECH RECOGNITION

IMPACT OF DEEP MLP ARCHITECTURE ON DIFFERENT ACOUSTIC MODELING TECHNIQUES FOR UNDER-RESOURCED SPEECH RECOGNITION IMPACT OF DEEP MLP ARCHITECTURE ON DIFFERENT ACOUSTIC MODELING TECHNIQUES FOR UNDER-RESOURCED SPEECH RECOGNITION David Imseng 1, Petr Motlicek 1, Philip N. Garner 1, Hervé Bourlard 1,2 1 Idiap Research

More information

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives Mathew Magimai Doss Collaborators: Vinayak Abrol, Selen Hande Kabil, Hannah Muckenhirn, Dimitri

More information

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Dimension Reduction of the Modulation Spectrogram for Speaker Verification Dimension Reduction of the Modulation Spectrogram for Speaker Verification Tomi Kinnunen Speech and Image Processing Unit Department of Computer Science University of Joensuu, Finland tkinnu@cs.joensuu.fi

More information

Augmenting Short-term Cepstral Features with Long-term Discriminative Features for Speaker Verification of Telephone Data

Augmenting Short-term Cepstral Features with Long-term Discriminative Features for Speaker Verification of Telephone Data INTERSPEECH 2013 Augmenting Short-term Cepstral Features with Long-term Discriminative Features for Speaker Verification of Telephone Data Cong-Thanh Do 1, Claude Barras 1, Viet-Bac Le 2, Achintya K. Sarkar

More information

Progress in the BBN Keyword Search System for the DARPA RATS Program

Progress in the BBN Keyword Search System for the DARPA RATS Program INTERSPEECH 2014 Progress in the BBN Keyword Search System for the DARPA RATS Program Tim Ng 1, Roger Hsiao 1, Le Zhang 1, Damianos Karakos 1, Sri Harish Mallidi 2, Martin Karafiát 3,KarelVeselý 3, Igor

More information

IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM

IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM Jinyu Li, Dong Yu, Jui-Ting Huang, and Yifan Gong Microsoft Corporation, One Microsoft Way, Redmond, WA 98052 ABSTRACT

More information

Discriminative Training for Automatic Speech Recognition

Discriminative Training for Automatic Speech Recognition Discriminative Training for Automatic Speech Recognition 22 nd April 2013 Advanced Signal Processing Seminar Article Heigold, G.; Ney, H.; Schluter, R.; Wiesler, S. Signal Processing Magazine, IEEE, vol.29,

More information

Improving Word Accuracy with Gabor Feature Extraction Michael Kleinschmidt, David Gelbart

Improving Word Accuracy with Gabor Feature Extraction Michael Kleinschmidt, David Gelbart Improving Word Accuracy with Gabor Feature Extraction Michael Kleinschmidt, David Gelbart International Computer Science Institute, Berkeley, CA Report Nr. 29 September 2002 September 2002 Michael Kleinschmidt,

More information

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a R E S E A R C H R E P O R T I D I A P Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a IDIAP RR 7-7 January 8 submitted for publication a IDIAP Research Institute,

More information

416 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 2, FEBRUARY 2013

416 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 2, FEBRUARY 2013 416 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 2, FEBRUARY 2013 A Multistream Feature Framework Based on Bandpass Modulation Filtering for Robust Speech Recognition Sridhar

More information

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Dimension Reduction of the Modulation Spectrogram for Speaker Verification Dimension Reduction of the Modulation Spectrogram for Speaker Verification Tomi Kinnunen Speech and Image Processing Unit Department of Computer Science University of Joensuu, Finland Kong Aik Lee and

More information

Neural Network Acoustic Models for the DARPA RATS Program

Neural Network Acoustic Models for the DARPA RATS Program INTERSPEECH 2013 Neural Network Acoustic Models for the DARPA RATS Program Hagen Soltau, Hong-Kwang Kuo, Lidia Mangu, George Saon, Tomas Beran IBM T. J. Watson Research Center, Yorktown Heights, NY 10598,

More information

Speech recognition from spectral dynamics

Speech recognition from spectral dynamics Sādhanā Vol. 36, Part 5, October 211, pp. 729 744. c Indian Academy of Sciences Speech recognition from spectral dynamics HYNEK HERMANSKY The Johns Hopkins University, Baltimore, Maryland, USA e-mail:

More information

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM Samuel Thomas 1, George Saon 1, Maarten Van Segbroeck 2 and Shrikanth S. Narayanan 2 1 IBM T.J. Watson Research Center,

More information

Mikko Myllymäki and Tuomas Virtanen

Mikko Myllymäki and Tuomas Virtanen NON-STATIONARY NOISE MODEL COMPENSATION IN VOICE ACTIVITY DETECTION Mikko Myllymäki and Tuomas Virtanen Department of Signal Processing, Tampere University of Technology Korkeakoulunkatu 1, 3370, Tampere,

More information

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR Syu-Siang Wang 1, Jeih-weih Hung, Yu Tsao 1 1 Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan Dept. of Electrical

More information

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Comparison of Spectral Analysis Methods for Automatic Speech Recognition INTERSPEECH 2013 Comparison of Spectral Analysis Methods for Automatic Speech Recognition Venkata Neelima Parinam, Chandra Vootkuri, Stephen A. Zahorian Department of Electrical and Computer Engineering

More information

Training neural network acoustic models on (multichannel) waveforms

Training neural network acoustic models on (multichannel) waveforms View this talk on YouTube: https://youtu.be/si_8ea_ha8 Training neural network acoustic models on (multichannel) waveforms Ron Weiss in SANE 215 215-1-22 Joint work with Tara Sainath, Kevin Wilson, Andrew

More information

Acoustic modelling from the signal domain using CNNs

Acoustic modelling from the signal domain using CNNs Acoustic modelling from the signal domain using CNNs Pegah Ghahremani 1, Vimal Manohar 1, Daniel Povey 1,2, Sanjeev Khudanpur 1,2 1 Center of Language and Speech Processing 2 Human Language Technology

More information

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS AKSHAY CHANDRASHEKARAN ANOOP RAMAKRISHNA akshayc@cmu.edu anoopr@andrew.cmu.edu ABHISHEK JAIN GE YANG ajain2@andrew.cmu.edu younger@cmu.edu NIDHI KOHLI R

More information

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Different Approaches of Spectral Subtraction Method for Speech Enhancement ISSN 2249 5460 Available online at www.internationalejournals.com International ejournals International Journal of Mathematical Sciences, Technology and Humanities 95 (2013 1056 1062 Different Approaches

More information

Calibration of Microphone Arrays for Improved Speech Recognition

Calibration of Microphone Arrays for Improved Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Calibration of Microphone Arrays for Improved Speech Recognition Michael L. Seltzer, Bhiksha Raj TR-2001-43 December 2001 Abstract We present

More information

Signal Analysis Using Autoregressive Models of Amplitude Modulation. Sriram Ganapathy

Signal Analysis Using Autoregressive Models of Amplitude Modulation. Sriram Ganapathy Signal Analysis Using Autoregressive Models of Amplitude Modulation Sriram Ganapathy Advisor - Hynek Hermansky Johns Hopkins University 11-18-2011 Overview Introduction AR Model of Hilbert Envelopes FDLP

More information

Acoustic Modeling from Frequency-Domain Representations of Speech

Acoustic Modeling from Frequency-Domain Representations of Speech Acoustic Modeling from Frequency-Domain Representations of Speech Pegah Ghahremani 1, Hossein Hadian 1,3, Hang Lv 1,4, Daniel Povey 1,2, Sanjeev Khudanpur 1,2 1 Center of Language and Speech Processing

More information

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS Kuldeep Kumar 1, R. K. Aggarwal 1 and Ankita Jain 2 1 Department of Computer Engineering, National Institute

More information

SPEECH INTELLIGIBILITY DERIVED FROM EXCEEDINGLY SPARSE SPECTRAL INFORMATION

SPEECH INTELLIGIBILITY DERIVED FROM EXCEEDINGLY SPARSE SPECTRAL INFORMATION SPEECH INTELLIGIBILITY DERIVED FROM EXCEEDINGLY SPARSE SPECTRAL INFORMATION Steven Greenberg 1, Takayuki Arai 1, 2 and Rosaria Silipo 1 International Computer Science Institute 1 1947 Center Street, Berkeley,

More information

Robust telephone speech recognition based on channel compensation

Robust telephone speech recognition based on channel compensation Pattern Recognition 32 (1999) 1061}1067 Robust telephone speech recognition based on channel compensation Jiqing Han*, Wen Gao Department of Computer Science and Engineering, Harbin Institute of Technology,

More information

Methods for capturing spectro-temporal modulations in automatic speech recognition

Methods for capturing spectro-temporal modulations in automatic speech recognition Vol. submitted (8/1) 1 6 cfl S. Hirzel Verlag EAA 1 Methods for capturing spectro-temporal modulations in automatic speech recognition Michael Kleinschmidt Medizinische Physik, Universität Oldenburg, D-6111

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this

More information

Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events

Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events INTERSPEECH 2013 Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events Rupayan Chakraborty and Climent Nadeu TALP Research Centre, Department of Signal Theory

More information

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech INTERSPEECH 5 Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech M. A. Tuğtekin Turan and Engin Erzin Multimedia, Vision and Graphics Laboratory,

More information

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION 17th European Signal Processing Conference (EUSIPCO 2009) Glasgow, Scotland, August 24-28, 2009 A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

More information

POSSIBLY the most noticeable difference when performing

POSSIBLY the most noticeable difference when performing IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 7, SEPTEMBER 2007 2011 Acoustic Beamforming for Speaker Diarization of Meetings Xavier Anguera, Associate Member, IEEE, Chuck Wooters,

More information

Learning the Speech Front-end With Raw Waveform CLDNNs

Learning the Speech Front-end With Raw Waveform CLDNNs INTERSPEECH 2015 Learning the Speech Front-end With Raw Waveform CLDNNs Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals Google, Inc. New York, NY, U.S.A {tsainath, ronw, andrewsenior,

More information

The 2010 CMU GALE Speech-to-Text System

The 2010 CMU GALE Speech-to-Text System Research Showcase @ CMU Language Technologies Institute School of Computer Science 9-2 The 2 CMU GALE Speech-to-Text System Florian Metze, fmetze@andrew.cmu.edu Roger Hsiao Qin Jin Udhyakumar Nallasamy

More information

Damped Oscillator Cepstral Coefficients for Robust Speech Recognition

Damped Oscillator Cepstral Coefficients for Robust Speech Recognition Damped Oscillator Cepstral Coefficients for Robust Speech Recognition Vikramjit Mitra, Horacio Franco, Martin Graciarena Speech Technology and Research Laboratory, SRI International, Menlo Park, CA, USA.

More information

Auditory Based Feature Vectors for Speech Recognition Systems

Auditory Based Feature Vectors for Speech Recognition Systems Auditory Based Feature Vectors for Speech Recognition Systems Dr. Waleed H. Abdulla Electrical & Computer Engineering Department The University of Auckland, New Zealand [w.abdulla@auckland.ac.nz] 1 Outlines

More information

Time-Frequency Distributions for Automatic Speech Recognition

Time-Frequency Distributions for Automatic Speech Recognition 196 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 3, MARCH 2001 Time-Frequency Distributions for Automatic Speech Recognition Alexandros Potamianos, Member, IEEE, and Petros Maragos, Fellow,

More information

Feature Extraction Using 2-D Autoregressive Models For Speaker Recognition

Feature Extraction Using 2-D Autoregressive Models For Speaker Recognition Feature Extraction Using 2-D Autoregressive Models For Speaker Recognition Sriram Ganapathy 1, Samuel Thomas 1 and Hynek Hermansky 1,2 1 Dept. of ECE, Johns Hopkins University, USA 2 Human Language Technology

More information

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise Noha KORANY 1 Alexandria University, Egypt ABSTRACT The paper applies spectral analysis to

More information

MEDIUM-DURATION MODULATION CEPSTRAL FEATURE FOR ROBUST SPEECH RECOGNITION. Vikramjit Mitra, Horacio Franco, Martin Graciarena, Dimitra Vergyri

MEDIUM-DURATION MODULATION CEPSTRAL FEATURE FOR ROBUST SPEECH RECOGNITION. Vikramjit Mitra, Horacio Franco, Martin Graciarena, Dimitra Vergyri 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) MEDIUM-DURATION MODULATION CEPSTRAL FEATURE FOR ROBUST SPEECH RECOGNITION Vikramjit Mitra, Horacio Franco, Martin Graciarena,

More information

Audio Augmentation for Speech Recognition

Audio Augmentation for Speech Recognition Audio Augmentation for Speech Recognition Tom Ko 1, Vijayaditya Peddinti 2, Daniel Povey 2,3, Sanjeev Khudanpur 2,3 1 Huawei Noah s Ark Research Lab, Hong Kong, China 2 Center for Language and Speech Processing

More information

Speech and Music Discrimination based on Signal Modulation Spectrum.

Speech and Music Discrimination based on Signal Modulation Spectrum. Speech and Music Discrimination based on Signal Modulation Spectrum. Pavel Balabko June 24, 1999 1 Introduction. This work is devoted to the problem of automatic speech and music discrimination. As we

More information

ANALYSIS-BY-SYNTHESIS FEATURE ESTIMATION FOR ROBUST AUTOMATIC SPEECH RECOGNITION USING SPECTRAL MASKS. Michael I Mandel and Arun Narayanan

ANALYSIS-BY-SYNTHESIS FEATURE ESTIMATION FOR ROBUST AUTOMATIC SPEECH RECOGNITION USING SPECTRAL MASKS. Michael I Mandel and Arun Narayanan ANALYSIS-BY-SYNTHESIS FEATURE ESTIMATION FOR ROBUST AUTOMATIC SPEECH RECOGNITION USING SPECTRAL MASKS Michael I Mandel and Arun Narayanan The Ohio State University, Computer Science and Engineering {mandelm,narayaar}@cse.osu.edu

More information

TECHNIQUES FOR HANDLING CONVOLUTIONAL DISTORTION WITH MISSING DATA AUTOMATIC SPEECH RECOGNITION

TECHNIQUES FOR HANDLING CONVOLUTIONAL DISTORTION WITH MISSING DATA AUTOMATIC SPEECH RECOGNITION TECHNIQUES FOR HANDLING CONVOLUTIONAL DISTORTION WITH MISSING DATA AUTOMATIC SPEECH RECOGNITION Kalle J. Palomäki 1,2, Guy J. Brown 2 and Jon Barker 2 1 Helsinki University of Technology, Laboratory of

More information

Auditory motivated front-end for noisy speech using spectro-temporal modulation filtering

Auditory motivated front-end for noisy speech using spectro-temporal modulation filtering Auditory motivated front-end for noisy speech using spectro-temporal modulation filtering Sriram Ganapathy a) and Mohamed Omar IBM T.J. Watson Research Center, Yorktown Heights, New York 10562 ganapath@us.ibm.com,

More information

FEATURE FUSION FOR HIGH-ACCURACY KEYWORD SPOTTING

FEATURE FUSION FOR HIGH-ACCURACY KEYWORD SPOTTING FEATURE FUSION FOR HIGH-ACCURACY KEYWORD SPOTTING Vikramjit Mitra, Julien van Hout, Horacio Franco, Dimitra Vergyri, Yun Lei, Martin Graciarena, Yik-Cheung Tam, Jing Zheng 1 Speech Technology and Research

More information

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques 81 Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques Noboru Hayasaka 1, Non-member ABSTRACT

More information

Robustness (cont.); End-to-end systems

Robustness (cont.); End-to-end systems Robustness (cont.); End-to-end systems Steve Renals Automatic Speech Recognition ASR Lecture 18 27 March 2017 ASR Lecture 18 Robustness (cont.); End-to-end systems 1 Robust Speech Recognition ASR Lecture

More information

Audio Imputation Using the Non-negative Hidden Markov Model

Audio Imputation Using the Non-negative Hidden Markov Model Audio Imputation Using the Non-negative Hidden Markov Model Jinyu Han 1,, Gautham J. Mysore 2, and Bryan Pardo 1 1 EECS Department, Northwestern University 2 Advanced Technology Labs, Adobe Systems Inc.

More information

Recurrent neural networks Modelling sequential data. MLP Lecture 9 / 13 November 2018 Recurrent Neural Networks 1: Modelling sequential data 1

Recurrent neural networks Modelling sequential data. MLP Lecture 9 / 13 November 2018 Recurrent Neural Networks 1: Modelling sequential data 1 Recurrent neural networks Modelling sequential data MLP Lecture 9 / 13 November 2018 Recurrent Neural Networks 1: Modelling sequential data 1 Recurrent Neural Networks 1: Modelling sequential data Steve

More information

Robust Low-Resource Sound Localization in Correlated Noise

Robust Low-Resource Sound Localization in Correlated Noise INTERSPEECH 2014 Robust Low-Resource Sound Localization in Correlated Noise Lorin Netsch, Jacek Stachurski Texas Instruments, Inc. netsch@ti.com, jacek@ti.com Abstract In this paper we address the problem

More information

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm A.T. Rajamanickam, N.P.Subiramaniyam, A.Balamurugan*,

More information

The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments

The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, Gerhard

More information

Wavelet Speech Enhancement based on the Teager Energy Operator

Wavelet Speech Enhancement based on the Teager Energy Operator Wavelet Speech Enhancement based on the Teager Energy Operator Mohammed Bahoura and Jean Rouat ERMETIS, DSA, Université du Québec à Chicoutimi, Chicoutimi, Québec, G7H 2B1, Canada. Abstract We propose

More information

A New Framework for Supervised Speech Enhancement in the Time Domain

A New Framework for Supervised Speech Enhancement in the Time Domain Interspeech 2018 2-6 September 2018, Hyderabad A New Framework for Supervised Speech Enhancement in the Time Domain Ashutosh Pandey 1 and Deliang Wang 1,2 1 Department of Computer Science and Engineering,

More information

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients ISSN (Print) : 232 3765 An ISO 3297: 27 Certified Organization Vol. 3, Special Issue 3, April 214 Paiyanoor-63 14, Tamil Nadu, India Enhancement of Speech Signal by Adaptation of Scales and Thresholds

More information

TIME-FREQUENCY CONVOLUTIONAL NETWORKS FOR ROBUST SPEECH RECOGNITION. Vikramjit Mitra, Horacio Franco

TIME-FREQUENCY CONVOLUTIONAL NETWORKS FOR ROBUST SPEECH RECOGNITION. Vikramjit Mitra, Horacio Franco TIME-FREQUENCY CONVOLUTIONAL NETWORKS FOR ROBUST SPEECH RECOGNITION Vikramjit Mitra, Horacio Franco Speech Technology and Research Laboratory, SRI International, Menlo Park, CA {vikramjit.mitra, horacio.franco}@sri.com

More information

DISTANT speech recognition (DSR) [1] is a challenging

DISTANT speech recognition (DSR) [1] is a challenging 1 Convolutional Neural Networks for Distant Speech Recognition Pawel Swietojanski, Student Member, IEEE, Arnab Ghoshal, Member, IEEE, and Steve Renals, Fellow, IEEE Abstract We investigate convolutional

More information

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 1

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 1 Recurrent neural networks Modelling sequential data MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 1 Recurrent Neural Networks 1: Modelling sequential data Steve Renals Machine Learning

More information

Speech Signal Analysis

Speech Signal Analysis Speech Signal Analysis Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 2&3 14,18 January 216 ASR Lectures 2&3 Speech Signal Analysis 1 Overview Speech Signal Analysis for

More information

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 MODELING SPECTRAL AND TEMPORAL MASKING IN THE HUMAN AUDITORY SYSTEM PACS: 43.66.Ba, 43.66.Dc Dau, Torsten; Jepsen, Morten L.; Ewert,

More information

ACOUSTIC cepstral features, extracted from short-term

ACOUSTIC cepstral features, extracted from short-term 1 Combination of Cepstral and Phonetically Discriminative Features for Speaker Verification Achintya K. Sarkar, Cong-Thanh Do, Viet-Bac Le and Claude Barras, Member, IEEE Abstract Most speaker recognition

More information

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE Zhizheng Wu 1,2, Xiong Xiao 2, Eng Siong Chng 1,2, Haizhou Li 1,2,3 1 School of Computer Engineering, Nanyang Technological University (NTU),

More information

Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation

Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation Shibani.H 1, Lekshmi M S 2 M. Tech Student, Ilahia college of Engineering and Technology, Muvattupuzha, Kerala,

More information

Perceptually Motivated Linear Prediction Cepstral Features for Network Speech Recognition

Perceptually Motivated Linear Prediction Cepstral Features for Network Speech Recognition Perceptually Motivated Linear Prediction Cepstral Features for Network Speech Recognition Aadel Alatwi, Stephen So, Kuldip K. Paliwal Signal Processing Laboratory Griffith University, Brisbane, QLD, 4111,

More information

Robust Speech Recognition. based on Spectro-Temporal Features

Robust Speech Recognition. based on Spectro-Temporal Features Carl von Ossietzky Universität Oldenburg Studiengang Diplom-Physik DIPLOMARBEIT Titel: Robust Speech Recognition based on Spectro-Temporal Features vorgelegt von: Bernd Meyer Betreuender Gutachter: Prof.

More information

Sparse coding of the modulation spectrum for noise-robust automatic speech recognition

Sparse coding of the modulation spectrum for noise-robust automatic speech recognition Ahmadi et al. EURASIP Journal on Audio, Speech, and Music Processing 24, 24:36 http://asmp.eurasipjournals.com/content/24//36 RESEARCH Open Access Sparse coding of the modulation spectrum for noise-robust

More information

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter 1 Gupteswar Sahu, 2 D. Arun Kumar, 3 M. Bala Krishna and 4 Jami Venkata Suman Assistant Professor, Department of ECE,

More information

Power Function-Based Power Distribution Normalization Algorithm for Robust Speech Recognition

Power Function-Based Power Distribution Normalization Algorithm for Robust Speech Recognition Power Function-Based Power Distribution Normalization Algorithm for Robust Speech Recognition Chanwoo Kim 1 and Richard M. Stern Department of Electrical and Computer Engineering and Language Technologies

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this

More information

Fei Chen and Philipos C. Loizou a) Department of Electrical Engineering, University of Texas at Dallas, Richardson, Texas 75083

Fei Chen and Philipos C. Loizou a) Department of Electrical Engineering, University of Texas at Dallas, Richardson, Texas 75083 Analysis of a simplified normalized covariance measure based on binary weighting functions for predicting the intelligibility of noise-suppressed speech Fei Chen and Philipos C. Loizou a) Department of

More information

Epoch Extraction From Emotional Speech

Epoch Extraction From Emotional Speech Epoch Extraction From al Speech D Govind and S R M Prasanna Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati Email:{dgovind,prasanna}@iitg.ernet.in Abstract

More information

Robust Speech Recognition Group Carnegie Mellon University. Telephone: Fax:

Robust Speech Recognition Group Carnegie Mellon University. Telephone: Fax: Robust Automatic Speech Recognition In the 21 st Century Richard Stern (with Alex Acero, Yu-Hsiang Chiu, Evandro Gouvêa, Chanwoo Kim, Kshitiz Kumar, Amir Moghimi, Pedro Moreno, Hyung-Min Park, Bhiksha

More information

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches Performance study of Text-independent Speaker identification system using & I for Telephone and Microphone Speeches Ruchi Chaudhary, National Technical Research Organization Abstract: A state-of-the-art

More information

Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System

Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System Xavier Anguera 1,2, Chuck Wooters 1, Barbara Peskin 1, and Mateu Aguiló 2,1 1 International Computer Science Institute,

More information

Evaluating robust features on Deep Neural Networks for speech recognition in noisy and channel mismatched conditions

Evaluating robust features on Deep Neural Networks for speech recognition in noisy and channel mismatched conditions INTERSPEECH 2014 Evaluating robust on Deep Neural Networks for speech recognition in noisy and channel mismatched conditions Vikramjit Mitra, Wen Wang, Horacio Franco, Yun Lei, Chris Bartels, Martin Graciarena

More information

A ROBUST FRONTEND FOR ASR: COMBINING DENOISING, NOISE MASKING AND FEATURE NORMALIZATION. Maarten Van Segbroeck and Shrikanth S.

A ROBUST FRONTEND FOR ASR: COMBINING DENOISING, NOISE MASKING AND FEATURE NORMALIZATION. Maarten Van Segbroeck and Shrikanth S. A ROBUST FRONTEND FOR ASR: COMBINING DENOISING, NOISE MASKING AND FEATURE NORMALIZATION Maarten Van Segbroeck and Shrikanth S. Narayanan Signal Analysis and Interpretation Lab, University of Southern California,

More information

Investigating Modulation Spectrogram Features for Deep Neural Network-based Automatic Speech Recognition

Investigating Modulation Spectrogram Features for Deep Neural Network-based Automatic Speech Recognition Investigating Modulation Spectrogram Features for Deep Neural Network-based Automatic Speech Recognition DeepakBabyand HugoVanhamme Department ESAT, KU Leuven, Belgium {Deepak.Baby, Hugo.Vanhamme}@esat.kuleuven.be

More information

Convolutional Neural Networks for Small-footprint Keyword Spotting

Convolutional Neural Networks for Small-footprint Keyword Spotting INTERSPEECH 2015 Convolutional Neural Networks for Small-footprint Keyword Spotting Tara N. Sainath, Carolina Parada Google, Inc. New York, NY, U.S.A {tsainath, carolinap}@google.com Abstract We explore

More information

Speech Synthesis using Mel-Cepstral Coefficient Feature

Speech Synthesis using Mel-Cepstral Coefficient Feature Speech Synthesis using Mel-Cepstral Coefficient Feature By Lu Wang Senior Thesis in Electrical Engineering University of Illinois at Urbana-Champaign Advisor: Professor Mark Hasegawa-Johnson May 2018 Abstract

More information

Channel Selection in the Short-time Modulation Domain for Distant Speech Recognition

Channel Selection in the Short-time Modulation Domain for Distant Speech Recognition Channel Selection in the Short-time Modulation Domain for Distant Speech Recognition Ivan Himawan 1, Petr Motlicek 1, Sridha Sridharan 2, David Dean 2, Dian Tjondronegoro 2 1 Idiap Research Institute,

More information

High-speed Noise Cancellation with Microphone Array

High-speed Noise Cancellation with Microphone Array Noise Cancellation a Posteriori Probability, Maximum Criteria Independent Component Analysis High-speed Noise Cancellation with Microphone Array We propose the use of a microphone array based on independent

More information

DWT and LPC based feature extraction methods for isolated word recognition

DWT and LPC based feature extraction methods for isolated word recognition RESEARCH Open Access DWT and LPC based feature extraction methods for isolated word recognition Navnath S Nehe 1* and Raghunath S Holambe 2 Abstract In this article, new feature extraction methods, which

More information

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification Wei Chu and Abeer Alwan Speech Processing and Auditory Perception Laboratory Department

More information

Non-intrusive intelligibility prediction for Mandarin speech in noise. Creative Commons: Attribution 3.0 Hong Kong License

Non-intrusive intelligibility prediction for Mandarin speech in noise. Creative Commons: Attribution 3.0 Hong Kong License Title Non-intrusive intelligibility prediction for Mandarin speech in noise Author(s) Chen, F; Guan, T Citation The 213 IEEE Region 1 Conference (TENCON 213), Xi'an, China, 22-25 October 213. In Conference

More information