Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition

Size: px

Start display at page:

Download "Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition"

August Hines
5 years ago
Views:

1 Available online at Speech Communication 52 (2010) Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition Fabio Valente IDIAP Research Institute, CH-1920 Martigny, Switzerland Received 13 October 2009; received in revised form 25 May 2010; accepted 26 May 2010 Abstract This paper investigates from an automatic speech recognition perspective, the most effective way of combining Multi Layer Perceptron (MLP) classifiers trained on different ranges of auditory and modulation frequencies. Two different combination schemes based on MLP are considered. The first one operates in parallel fashion and is invariant to the order in which feature streams are introduced. The second one operates in hierarchical fashion and is sensitive to the order in which feature streams are introduced. The study is carried on a Large Vocabulary Continuous Speech Recognition system for transcription of meetings data using the TANDEM approach. Results reveal that (1) the combination of MLPs trained on different ranges of auditory frequencies is more effective if performed in parallel fashion; (2) the combination of MLPs trained on different ranges of modulation frequencies is more effective if performed in hierarchical fashion moving from high to low modulations; (3) the improvement obtained from separate processing of two modulation frequency ranges (12% relative WER reduction w.r.t. the single classifier approach) is considerably larger than the improvement obtained from separate processing of two auditory frequency ranges (4% relative WER reduction w.r.t. the single classifier approach). Similar results are also verified on other LVCSR systems and on other languages. Furthermore, the paper extends the discussion to the combination of classifiers trained on separate auditory modulation frequency channels showing that previous conclusions hold also in this scenario. Ó 2010 Elsevier B.V. All rights reserved. Keywords: Automatic speech recognition (ASR); TANDEM features; Multi Layer Perceptron (MLP); Auditory and modulation frequencies 1. Introduction address: fabio.valente@idiap.ch Typical automatic speech recognition (ASR) features are obtained through the short-term spectrum of 30 ms segments of speech signal. This representation extracts instantaneous frequency components of the signal. The power spectrum is then integrated using a bank of filters equally spaced on an auditory scale (e.g. Bark scale) thus obtaining the auditory spectrum. Studies on recognition of non-sense syllables (Fletcher, 1953) have shown that humans process speech separately in different auditory frequency channels (known as articulatory bands) and they classify a speech sound merging estimates from different bands. Later Allen (1994, 2005) interpreted that the recognition of speech in each articulatory band is done independently and a correct decision is obtained if the sound is correctly recognized in at least one of the bands. The size of each articulatory band spans approximatively two critical bands (Allen, 2005). Those observations have inspired automatic speech recognition approaches referred as multi-band ASR. Multiband ASR (Hermansky et al., 1996; Bourlard and Dupont, 1996) uses a set of independent classifiers (e.g. Multi Layer Perceptron (MLP) or Hidden Markov Models (HMM)) trained on different parts of the auditory spectrum in order to discriminate in between phonetic targets. The classifiers outputs are then combined together obtaining a final decision on the phonetic targets. Typical combination frameworks include both merger classifiers (another MLP (Hermansky, 2003; Chen et al., 2003) and rule based combinations (e.g. Inverse Entropy (Misra et al., 2003) or Dempster Shafer combination (Valente and Hermansky, /$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi: /j.specom

2 F. Valente / Speech Communication 52 (2010) )). Multi-band speech recognition has been originally introduced for dealing with noise, the rationale being that if the noise affects a particular auditory band, the correct phonetic recognition can be obtained using information coming from the uncorrupted bands. Later the multi-band paradigm has been generalized into the multi-stream paradigm where independent classifiers are trained on different representations of the speech signal including conventional spectral features (PLP) (Hermansky et al., 2000), long time critical band energy trajectories (Hermansky and Sharma, 1999; Morgan et al., 2004) and spectro-temporal modulations (Kleinschmidt, 2002; Hermansky and Fousek, 2005; Zhao et al., 2009). Several multi-stream systems make use of features based on long time windows of the speech signal (e.g. Hermansky and Sharma, 1999; Hermansky and Fousek, 2005; Kleinschmidt, 2002; Zhao et al., 2009; Morgan et al., 2004; Hermansky, 2003). Conventional Short Term Fourier Transform features do not provide information on the speech dynamics. Those are generally introduced using temporal differentials of the spectral trajectory (also known as delta features) or processing long segments of spectral energy trajectories i.e. the modulation spectrum (Hermansky, 1998). Several studies have been carried out to evaluate the importance of the different parts of the modulation spectrum for ASR applications (Hermansky et al., 1997) and robustness techniques like RASTA filtering are based on emphasizing modulation spectrum frequencies that are most important for speech recognition (Hermansky and Morgan, 1994). This study is motivated by two main arguments: Current multi-band/multi-stream approaches operates in two separate steps: in the first step a set of independent classifiers (e.g. MLP) is trained in order to discriminate between phonetic targets i.e. to estimate phoneme posterior probabilities; in a second step all the individual estimates are combined together into a single phoneme posterior estimate. The combination happens in parallel fashion i.e. it is invariant to the order in which the different features are introduced. Alternative methods for combining information based on hierarchies of classifiers have been proposed in literature (Sivadas and Hermansky, 2002; Valente et al., 2007; Valente and Hermansky, 2008) and have shown competitive results to the parallel scheme. In contrary to the parallel scheme, hierarchical combinations are sequential, i.e. they assume an ordering in the processing. Although proven effective in several small and large vocabulary ASR tasks, the parallel combination scheme is motivated from observations made on the auditory spectrum of the speech signal. Speech temporal modulations represent the dynamics of the signal and they are extracted using different time scales. No specific studies have been carried on the optimal way of combining classifiers trained on this type of informations. This paper aims at investigating the combination of classifiers trained on different ranges of auditory frequencies (as in conventional multi-band approaches) and modulation frequencies. In particular we study from an ASR perspective whether the combination of information obtained from auditory and modulation frequency channels is more effective in parallel (as in conventional multi-band) or hierarchical (sequential) fashion. In contrary to related works, this study is carried on a Large Vocabulary Automatic Speech Recognition task using the TANDEM approach. The paper is organized as follows: Section 2 describes the pre-processing techniques that extracts different ranges of auditory and modulation frequencies and the joint auditory modulation channels. We limit the investigation to two auditory channels and two modulation channels thus four joint auditory modulation channels for simplifying the setup. Section 3 describes two different combination schemes (parallel and hierarchical) based on Multi Layer Perceptron (MLP) classifiers, and Section 4 presents the experimental framework based on a LVCSR system for transcription of meetings data. The combination of classifiers trained on different ranges of auditory frequencies is investigated in Section 5, the combination of classifiers trained on different ranges of modulation frequencies is investigated in Section 6. Joint auditory modulation frequencies processing is then presented in Sections 7 and 8 describes results on Single Distant Microphones (SDM). Section 9 describes the application of those features into other LVCSR systems and finally Section 10 concludes the paper discussing results and presenting future directions. 2. Time-frequency processing This section presents the processing used for extracting evidence from different auditory modulation frequency sub-bands. Feature extraction is composed of the following parts: critical band auditory spectrum is extracted from Short Time Fourier Transform of a signal every 10 ms. In the following study, the power spectrum is integrated using a bank of filters equally spaced on a bark scale; 15 critical bands are used. This step is common to several conventional feature extraction methods used in ASR e.g. PLP features. Different ranges of modulation frequencies are extracted using MRASTA filtering (see Hermansky and Fousek, 2005, for details). MRASTA is an extension of RASTA filtering and extracts different modulation frequencies using a set of multiple resolution filters. A one second long temporal trajectory in each critical band is filtered with a bank of band-pass filters. Those filters represent first derivatives G1 ¼½g1 ri Š (Eq. (1)) and second derivatives G2 ¼½g2 ri Š (Eq. (2)) of Gaussian functions with variance r i varying in the range 8 60 ms (see Fig. 1). In effect, the MRASTA filters are multi-resolution band-pass filters on modulation frequency, dividing the

3 792 F. Valente / Speech Communication 52 (2010) G1 high G1 low 1 G2 high G2 low TIME TIME Fig. 1. Set of temporal filter obtained by first (G1 left picture) and second (G2 right picture) order derivation of Gaussian function. G1 and G2 are successively split in two filter bank (G1-low and G2-low, dashed line) and (G2-high and G2-high continuous line) that filter, respectively, high and low modulation frequencies G2 high G2 low db 10 1 db 10 1 G1 high G1 low modulation frequency [Hz] modulation frequency [Hs] Fig. 2. Normalized frequency response of G1 (left picture) and G2 (right picture). G1 and G2 are successively split in two filter bank. G1-low and G2-low (dashed lines) emphasize low modulation frequencies while G1-high and G2-high emphasize high modulation frequencies. available modulation frequency range into its individual sub-bands. 1 g1 ri ðxþ / x expð x 2 =ð2r 2 r 2 i ÞÞ; i x 2 g2 ri ðxþ / 1 exp x 2 =ð2r 2 r 4 i r 2 i Þ i with r i ¼f0:8; 1:2; 1:8; 2:7; 4; 6g: In the modulation frequency domain, they correspond to a filter-bank with equally spaced filters on a logarithmic scale (see Fig. 2). Identical filters are used for all critical bands. Thus, they provide a multiple-resolution representation of the time-frequency plane. MRASTA filtering is consistent with studies in (Houtgast, 1989; Dau et al., 1997) where human perception of modulation frequencies is modeled using a bank of filters equally spaced on a logarithmic scale. This bank of filters subdivides the available modulation frequency range into separate channels with decreasing frequency resolution moving from slow to fast modulations. After MRASTA filtering the total number of features is = 180. Then frequency derivatives across three consecutive critical bands are introduced (Hermansky and Fousek, 2005). The representation considers all the possible auditory and modulation frequency ranges of the speech signal. 1 Unlike in (Hermansky and Fousek, 2005), filter-banks G1 and G2 are composed of six filters rather than eighth, leaving out the two filters with longest impulse responses. ð1þ ð2þ Let us now divide this available information in different auditory and modulation frequency channels. In order to obtain different auditory frequency sub-bands, the available 15 critical bands are split in two ranges of seven and eight critical bands, respectively, referred as F-Low and F-High. The investigation is limited to two parts for simplifying the setup. Filter-Banks G1 and G2 cover the whole range of modulation frequencies. We are interested in processing separately different parts of the modulation spectrum and again we limit the investigation to two parts for simplifying the setup. Similarly to what is done with the auditory filterbanks, the filter-banks G1 and G2 (six filters each) are split in two separate filter bank G1-Low, G2-Low and G1-High and G2-High that filter, respectively, fast and slow modulation frequencies. We define G-High and G- Low as follows: G-High ¼½G1-High; G2-HighŠ ¼½g1 ri ; g2 ri Š ð3þ with r i ¼f0:8; 1:2; 1:8g: G-Low ¼½G1-Low; G2-LowŠ ¼½g1 ri ; g2 ri Š ð4þ with r i ¼f2:7; 4; 6g: Filters G1-fast and G2-fast are short filters (Fig. 1 continuous lines) and they process high modulation frequencies (Fig. 2 continuous lines). Filters G1-slow and G2-slow are long filters (Fig. 1 dashed lines) and they process low modulation frequencies (Fig. 2 dashed lines). The effect of this filtering is depicted in Fig. 3. The left picture plots the auditory spectrum of a speech signal while Fig. 3 (center and right) plot the auditory spectrum after filtering with

F. Valente / Speech Communication 52 (2010) 790 800 793 Auditory Spectrogram 2 2 2 4 4 4 6 6 6 Bark 8 Bark 8 Bark 8 10 10 10 12 12 12 14 14 16 14 20 40 60 80 100 120 140 Time 20 40 60 80 100 120 140

e. extraction of low modulation frequency (right plot). g1 1.2 and g1 6.

The cutoff frequency for both filter-banks G- High and G-Low is approximatively 10 Hz.

that represent the combination of low/high auditory and modulation frequencies. This processing is depicted in Fig. 4.

4 F. Valente / Speech Communication 52 (2010) Auditory Spectrogram Bark 8 Bark 8 Bark Time Time Time Fig. 3. Auditory spectrum of speech signal (left picture) and its filtered versions with filter g1 1.2 i.e. extraction of high modulation frequency (center plot) and g1 6 i.e. extraction of low modulation frequency (right plot). g1 1.2 and g1 6. We can notice that high modulations represent the original auditory spectrum with more details while low modulations give a coarse representation of the spectrum. The cutoff frequency for both filter-banks G- High and G-Low is approximatively 10 Hz. Separate ranges of auditory and modulation frequency channels can be obtained dividing the initial spectrogram in four channels (F-Low, G-Low), (F-Low, G-High), (F- High, G-Low), (F-High, G-High) that represent the combination of low/high auditory and modulation frequencies. This processing is depicted in Fig. 4. In the remainder, the investigation focuses on the most effective way of combining classifiers trained on: 1. Separate ranges of auditory frequencies (F-High) and (F-Low) (Section 5). 2. Separate ranges of modulation frequencies (G-High) and (G-Low) (Section 6). 3. Separate ranges of auditory modulation frequencies (F- Low, G-Low), (F-Low, G-High), (F-High, G-Low), (F- High, G-High) (Section 7). 3. Combination of classifiers The classifier used for this study is the Multi Layer Perceptron (MLP) described in (Bourlard and Morgan, 1994). The training is done using back-propagation (Rumelhart et al., 1986) for minimizing the cross entropy between MLP outputs and phonetic targets (Bourlard and Morgan, 1994). MLPs output can thus be considered as an estimate of the phoneme posterior probabilities conditioned to the acoustic observation vector. Phoneme posterior probability are then used as conventional features into the HMM/ GMM systems through the TANDEM approach (Hermansky et al., 2000). Two different combination schemes based on Multi Layer Perceptron (MLP) are studied: the first one combines two feature streams in parallel fashion while the second combines features in hierarchical (sequential) fashion Parallel combination Given two different feature streams, a separate MLP for estimating phoneme posterior probabilities is trained independently on each of them. Phoneme posterior probabilities from individual MLPs are then concatenated together forming an input to a third MLP which estimates a single phoneme posterior estimate. The process is depicted in Fig. 5. This architecture is generally used for combining classifiers trained on different auditory frequency bands (Hermansky et al., 1996) and has been used as well in many multi-stream systems that uses speech temporal modulations (e.g. Kleinschmidt, 2002; Morgan et al., 2004; Zhao et al., 2009). Fig. 4. Auditory modulation frequency channels extraction: auditory spectrum is filtered with a set of Gaussian filters that extracts high and low modulation frequencies (G-High and G-Low). After that auditory frequencies are divided in two channels (F-High and F-Low). This produces four different auditory modulation channels: (F-Low, G-Low), (F-Low, G-High), (F-High, G-Low), (F-High, G-High).

794 F. Valente / Speech Communication 52 (2010) 790 800 Fig. 5. Parallel processing of two feature sets. This combination scheme is invariant to the order in which features are introduced. Fig. 6.

5 794 F. Valente / Speech Communication 52 (2010) Fig. 5. Parallel processing of two feature sets. This combination scheme is invariant to the order in which features are introduced. Fig. 6. Hierarchical processing of two feature sets. This combination scheme is sensitive to the order in which features are introduced Hierarchical processing An MLP is trained on a first feature stream in order to obtain phoneme posteriors. These posteriors are then concatenated with a second feature stream thus forming an input to a second phoneme posterior-estimating MLP. In such a way, phoneme estimates from the first MLP are modified by a second net using an evidence from a different feature stream. This process is depicted in Fig. 6. In contrary to parallel processing, the order in which features are presented does make a difference. Hierarchical processing integrates the information contained in the different frequency channels in sequential fashion, progressively modifying the phoneme posteriors obtained from the first MLP using a different signal representation. In the rest of the paper, the total number of parameters in the various MLP structures is made constant modifying the size of the hidden layer. This allows fair comparison in between the different experiments without biasing results towards structures that contains more parameters. 4. Experimental setting In contrary to previous related works on considerable smaller amounts of data, we pursue the investigation in a Large Vocabulary Continuous Speech Recognition task (LVCSR). The system is a simplified version of the first path AMI LVCSR system for meeting transcription described in (Hain et al., 2005) and uses a pruned trigram language model. 2 The training data for this system comprises of individual headset microphone (IHM) data of four meeting corpora; the NIST (13 h), ISL (10 h), ICSI (73 h) and a preliminary part of the AMI corpus (16 h). Acoustic models are phonetically state tied triphone models trained using standard HTK maximum likelihood training procedures. The recognition experiments are conducted on the NIST RT05s evaluation data ( tests/rt/rt2005/spring/, xxxx) (Independent Headset Microphone (IHM) part) which is composed of speech recorded in five different meeting rooms (AMI, CMU, ICSI, NIST, VT). The pronunciation dictionary is same as the one used in AMI NIST RT05s system (Hain et al., 2005). Juicer large vocabulary decoder (Moore et al., 2006) is used for recognition with a pruned trigram language model. In order to use phoneme posteriors into a conventional HMM/GMM system, the TANDEM approach is used (Hermansky et al., 2000). The different time-frequency 2 The first path RT05 system does not include VTLN, HLDA or speaker adaptation. Furthermore decoding is done using a pruned trigram language model. Only the first path is used as the paper does not focus on benchmarking the LVCSR system but on comparing the different feature sets.

6 F. Valente / Speech Communication 52 (2010) Table 1 RT05 WER for baseline PLP system and MRASTA TANDEM features. PLP MRASTA representations are used as input to an MLP which estimates phoneme posterior probabilities. The phonetic targets consists of 42 phonemes. Phoneme posteriors are then modified according to a Log/KLT transform and used as conventional features for HMM/GMM systems. After KLT only first 25 components are used accounting for 95% of the variability. Table 1 reports results for the baseline system (PLP plus dynamic features) and the MRASTA TANDEM system where all the available ranges of modulation and auditory frequencies are processed using a single MLP. The MRASTA TANDEM performs 3.4% worse than the PLP baseline system. 5. Combination of auditory frequency channels In this section, we investigate from an ASR perspective whether classifiers trained on separate auditory frequency ranges should be combined in parallel or hierarchical fashion. The auditory spectrum is split in two frequency ranges composed of 7 and 8 bark. MRASTA filtering is then applied resulting in two feature set of 168 and 192 component each. We refer to those two feature set as F-Low and F-High and they contain all the available modulation frequencies extracted at high and low auditory frequencies. Table 2 reports WER for the two MLP features obtained training on separate auditory frequency ranges. Classifiers trained on F-Low and F-High are then combined according to the schemes described in Section 3 i.e. in parallel fashion or in hierarchical (sequential) fashion. In contrary to the parallel combination, the hierarchical combination is sensitive to the order in which features are introduced. Thus we considers the cases in which the processing moves from F-Low to F-High and vice-versa. Results are reported in Table 3. The parallel processing outperforms the hierarchical processing. Using two separate frequency channels reduces the WER by 2% (i.e. from 45.8% to 43.9%) absolute w.r.t. the single classifier approach. The variation in WER in between the three combination schemes is approximately 1%. Those results are consistent with findings on human speech recognition (Fletcher, 1953; Allen, 1994) and confirm that the parallel scheme (as in conventional multi-band systems) outperforms other combinations in case of classifiers trained on different auditory frequency channels. 6. Combination of modulation frequency channels In this section we investigate from an ASR perspective whether classifiers trained on separate modulation frequency ranges should be combined in parallel or hierarchical fashion. As before, we limit the splitting to only two modulation frequency ranges. Filter-banks G1 and G2 (6 filters each) are divided in two separate filter bank as described in Section 2. We refer to those as high modulations (G-High) and low modulations (G-Low) and they contains the entire available ranges of auditory frequencies extracted at high and low modulations, respectively. Performances of MLP features trained on G-High/G-Low are reported in Table 4. Now classifiers trained on G-Low to G-High are combined in parallel fashion or in hierarchical (sequential) fashion moving from G-Low to G-High and from G-High to G-Low. Results are reported in Table 5. Moving in the hierarchy from low frequencies to high frequencies yields similar performance as the single MLP approach. On the other hand, moving from high to low modulation frequencies produces a significant reduction of 5.8% into final WER w.r.t. single classifier approach. Table 2 RT05 WER for TANDEM features obtained training MLPs on high and low auditory frequencies. F-Low F-High Table 4 RT05 WER for TANDEM features obtained training MLPs on high and low modulation frequencies. G-high G-low Table 3 RT05 WER for TANDEM features obtained combining MLPs trained on separate ranges of auditory frequencies both in parallel and hierarchical fashion (both from low to high and from high to low frequencies). F-Low to F-High F-High to F-Low Parallel Table 5 RT05 WER for TANDEM features obtained combining MLPs trained on separate ranges of modulation frequencies both in parallel and hierarchical fashion (both from low to high and from high to low frequencies). G-Low to G-High G-High to G-Low Parallel

7 796 F. Valente / Speech Communication 52 (2010) The parallel combination performs 1.4% absolute worse than the sequential combination. In contrary to auditory frequencies, the hierarchical combination outperforms the parallel combination when the processing moves from high to low modulation frequencies. Furthermore, this approach outperforms the PLP baseline by 2.5% absolute. The variation in WER in between the three combination schemes is considerably larger than the variation obtained splitting the auditory frequencies. Those findings are consistent with physiological experiments in (Miller et al., 2002) that shows how different levels of speech processing may attend different rates of the modulation spectrum, the higher levels emphasizing lower modulation frequency rates. To verify that the improvements in the previous experiment are produced from the sequential processing of modulation frequencies and not from a hierarchy of MLP classifiers, an additional experiment is proposed. Posterior features from the single MRASTA MLP (i.e. all auditory and modulation frequencies simultaneously processed) are presented as input to a second MLP. The second MLP does not use additional input but only re-processes a block of several concatenated posterior features. Table 6 reports WER on RT05 data set. Results show an improvement in performances w.r.t. the single MRASTA classifier of 1.6% absolute thus significantly worse than the sequential modulation processing which produce a WER reduction of 5.8%. The experiment reveals that the improvements are actually coming from the sequential processing of modulation frequencies rather than the hierarchical of MLPs. 7. Combination of auditory modulation frequencies This section aims at investigating if conclusions of Sections 5 and 6 are valid also when combining four joint auditory modulation frequency channels. At first four auditory modulation frequency ranges are extracted as described in Section 2. Then four separate MLP classifiers are trained on each of them. WER for each of those individual TANDEM features are reported in Table 7. All the four streams have similarly high WER compared to the full MRASTA filter bank. Posteriors are then combined into a single feature stream training another MLP that operates as merger classifier (see Fig. 7). This approach combines in parallel all the four information channels (auditory and modulation). Results are reported in Table 8. Table 6 RT05 WER for TANDEM features obtained hierarchical processing the output of an MLP trained on all the available auditory and modulation frequencies. Hier posterior Table 7 RT05 WER for TANDEM features obtained training MLPs on separated auditory modulation frequency ranges. Features G-Low, F-Low Separate processing of auditory modulation frequency channels reduces the WER from 45.8% (single classifier approach) to 40.7% i.e. approximatively 5% absolute. In order to verify if findings of Sections 5 and 6 hold also when different ranges of joint auditory modulation frequencies are considered, the combination scheme of Fig. 8 is investigated. This processing aims at processing in parallel fashion auditory frequencies and in hierarchical fashion modulation frequencies. The proposed scheme combines in parallel MLPs trained on high and low auditory frequencies extracted at high modulation frequencies. Later high and low auditory frequencies extracted at low modulation frequencies are combined in hierarchical fashion. Results are reported in Table 9. The proposed combination scheme produces an improvement of 1.1% in terms of WER respect to the parallel combination of the four channels. Furthermore, the WER is reduced by 0.4% respect to the hierarchical modulation spectrum approach. Those results suggests that the conclusions of Sections 5 and 6 are also verified when joint auditory modulation channels are combined. 8. Distant microphone results G-High, F-Low G-Low, F-High G-High, F-High WER Fig. 7. Parallel combination of four separate auditory modulation channels. A separate MLP is trained on each of the frequency channels. Posterior estimates are then combined together using another MLP classifier. Table 8 RT05 WER for parallel combination of evidence from four auditory modulation frequency channels (see Fig. 7) In order to measure the performance of different architectures in case of low SNR and increased reverberation, the features are also evaluated on audio acquired with a Single Distant Microphone (SDM) conditions.

8 F. Valente / Speech Communication 52 (2010) Fig. 8. Proposed combination of separate auditory modulation channels. This scheme aims at processing in parallel fashion classifiers trained on auditory frequencies and in hierarchical fashion classifiers trained on modulation frequencies. MLPs trained on high and low auditory frequencies extracted at high modulation frequencies are combined in parallel. Later high and low auditory frequencies extracted at low modulation frequencies are combined in hierarchical fashion. Table 9 RT05 WER for TANDEM features obtained using MLP architecture depicted in Fig The system training is same as in Section 4. Acoustic features for evaluation are extracted from distant microphone. Results are reported in Table 10 and includes parallel and hierarchical combination of auditory and modulation frequency channels as well as joint auditory modulation processing. On SDM audio, the gap in performances between the PLP baseline and the TANDEM MRASTA features is only 0.4% absolute. The trend for features generated using parallel and hierarchical architectures is similar to what reported on IHM data i.e. the combination of auditory frequencies is more effective in parallel fashion while the combination of modulation frequencies is more effective in hierarchical fashion. Furthermore, hierarchical features largely outperform the PLP baseline. 9. Application into other LVCSR systems The hierarchical combination of classifiers trained on separate ranges of modulation frequencies (also referred as hierarchical modulation spectrum) has also been tested on other languages and integrated in other LVCSR systems that make use of TANDEM based feature extraction. In (Valente et al., 2009), experiments on an LVCSR system for Mandarin Broadcast speech recognition are presented. The HMM/GMM and the MLP training is done using approximatively 100 h of Broadcast data manually transcribed. Those data present cleaner acoustic conditions compared to meeting recordings. Results are reported on the DARPA GALE evaluation 06 data. Acoustic models are phonetically state tied triphone models trained using maximum likelihood training procedures. The Mandarin phonetic set is composed of 71 tonemes. Phoneme posteriors are transformed according to a Log/KLT and only the first 35 components are kept accounting for 95% of the variability. As before, hierarchical and parallel combinations of modulation frequencies are studied; the number of parameters in the different MLP architectures is kept Table 10 RT05 WER for TANDEM features obtained training MLPs on separated auditory modulation frequency ranges. Results are reported for Single Distant Microphone audio. Features PLP TANDEM MRASTA WER Auditory Parallel F-Low to F-High F-High to F-Low WER Modulation Parallel G-Low to G-High G-High to G-Low WER Four channels-parallel Four channels (combination as in Fig. 8) WER

9 798 F. Valente / Speech Communication 52 (2010) Table 11 CER for DARPA GALE eval06 data set. Features MRASTA Parallel G-High G-Low Hier G-High to G-Low CER constant for providing fair comparison. Results are evaluated in terms of Character Error Rate and reported in Table 11. Results reveal similar conclusions to those previously presented in the meeting recognition tasks i.e. hierarchical processing of modulation frequencies outperforms the single classifier approach and the parallel processing. Other large scale experiments with TANDEM features based on the hierarchical processing of the modulation spectrum are also reported in (Plahl et al., 2008). The authors experiments with HMM/GMM and MLP systems trained on very large amounts of data (1600 h) and integrated in the GALE Mandarin LVCSR system. Results show that the proposed approach outperforms by 20% the single classifier approach on several evaluation and development dataset from the GALE project. 10. Conclusion and discussions In this paper we discuss the most effective way of combining classifiers trained on separate ranges of auditory and modulation frequencies. Two different schemes are considered: the parallel and the hierarchical (sequential) combination. The parallel combination of classifiers trained on separate ranges of auditory frequencies is a well known practice in the multi-band framework. Table 12 summarizes results obtained dividing the available auditory frequencies in two separate ranges. In brackets, relative improvements w.r.t. the single classifier approach are reported. Similarly, Table 13 summarizes results obtained dividing the available modulation frequencies in two separate ranges. Table 12 Summary of WER dividing the available range of auditory frequencies. In brackets relative improvements w.r.t. the MRASTA baseline are reported. Features MRASTA Two Separate auditory channels Parallel Low to High High to Low WER (+4%) 45.0% (+1%) 44.3% (+3%) Table 13 Summary of WER dividing the available range of modulation frequencies. In brackets, relative improvements w.r.t. the MRASTA baseline are reported. Features MRASTA Two Separate modulation channels Parallel Low to High High to Low WER (+9%) 45.8% (+0%) 40.0 (+12%) We can conclude that: the combination of MLPs trained on different ranges of auditory frequencies is more effective if performed in parallel fashion. This is consistent with studies on human speech recognition (Fletcher, 1953; Allen, 1994) and with the conventional multi-band framework. the combination of MLPs trained on different ranges of modulation frequencies is more effective if performed in hierarchical fashion moving from fast to slow modulations. This is consistent with studies on spectro-temporal receptive fields (Miller et al., 2002) that shows how different levels of processing may attend different rates of the modulation spectrum, the higher levels emphasizing lower modulation frequency rates. Furthermore, it outperforms by 2.4% the conventional PLP baseline. When only two frequency channels are used, the improvement coming from separate processing of modulation frequencies is considerably larger than the improvement coming from separate processing of auditory frequencies. Results on SDM data shows an overall similar trend. The relative improvements w.r.t. MRASTA features are reported in Table 14. Similar experiments proposed on different data (Broadcast Recordings) and different languages (Mandarin Chinese) hold similar conclusions (Plahl et al., 2008; Valente et al., 2009). Those findings are effective also in case of combination of joint auditory frequency channels. In fact the proposed combination scheme of Fig. 8 outperforms by 1% the conventional parallel combination of MLPs trained on separate ranges of auditory/modulation frequencies. The reason of this effect is due to what type of information auditory and modulation frequencies carry and on the time scales at which they are extracted. Auditory frequencies are extracted from a fixed short-term (30 ms) window and represents the information contained in the instantaneous frequency of the signal. Thus high and low auditory frequencies correspond to the same temporal context. Auditory frequencies do not carry information on the dynamics of the signal. The extraction of modulation frequencies, i.e. the dynamics of the signal, involves analysis at different time scales done by the multiple-resolution (MRASTA) filters with varying time spans. While short Table 14 Summary of WER dividing the available range of auditory and modulation frequencies. In brackets relative improvements w.r.t. the MRASTA baseline are reported. Features MRASTA Parallel Low to High High to Low Two separate auditory channels WER (+3%) 56.8% (+1%) 56.0% (+2%) Two separate modulation channels WER (+6%) 57.1% (+0.3%) 51.2 (+10%)

10 F. Valente / Speech Communication 52 (2010) filters (fast modulations) provide a fine representation of the speech dynamics, long filters (slow modulations) provides a coarse representation of the speech dynamics. MLPs trained on different ranges of auditory frequencies represent phoneme posteriors estimates at the same time scale (30 ms). MLPs trained on different ranges of modulation frequencies represent phoneme posterior estimates at different time scales with fine and coarse representation of the dynamics. Fine and coarse representations are extracted using long and short temporal filters. The difference between parallel and hierarchical combinations lays in the fact that parallel combination assumes that there is no ordering in the process while hierarchical combination is a sequential scheme where the order in which features are introduced matters. Experiments on human speech recognition suggests that there is no ordering in processing auditory frequencies i.e. recognition is carried independently in each band and a then a decision is taken merging results from different bands. This is verified in the experimental section as the parallel architecture (which does not assume any ordering) performs better then hierarchical sequential architectures. This is furthermore supported by the small difference in performance whenever the processing moves from fast to slow and slow to fast auditory frequencies (in the order of 2% relative). MLPs trained on different ranges of modulation frequencies produce phoneme posterior estimates at different time scales, i.e., they can be ordered according to their time scales. Combining those estimates assuming that there is no ordering i.e., in parallel, maybe suboptimal. Studies like (Miller et al., 2002) suggests that the processing of modulation frequencies is sequential, i.e. different levels attend different rates of the modulation spectrum. Hierarchical combination is a possible way of implementing sequential processing which assumes an ordering and can operate from a fine-to-coarse (i.e. fast to slow) or a coarse-to-fine (i.e. slow to fast ) time scales. Experiments reveal that integrating information from fast modulations (i.e. small size temporal windows) to slow modulations (i.e. long size temporal windows) is the most effective processing consistently with (Miller et al., 2002). The hypothesis of the sequential processing is furthermore supported by the large difference in performance whenever the MLP architecture moves from fast to slow and slow to fast modulation frequencies (in the order of 10% relative). In other words, the combination of MLPs trained on high and low modulation frequencies involves the combination of different time contexts. Moving from short to long time-spans is similar to progressively increasing the temporal context as done in a number of other posteriorbased ASR systems Bourlard et al. (2004). Such an effect could not be obtained using the parallel scheme. On the other hand, the MLPs trained on high and low auditory frequencies have a fixed input temporal context, they do not provide fine/coarse estimation of the input signal, and they do not have a particular ordering thus sequential combination is not as effective as the parallel one. We limited here the investigation to two auditory and two modulation frequency channels obtained splitting the tonotopic scales in two equally sized bands for simplifying the experimental setup. In future we plan to further consider larger number of bands and to experiments with considerably larger number of frequency channels like in (Kleinschmidt, 2002; Zhao et al., 2009). Acknowledgments This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR C-0023 and by the European Union under the integrated project AMIDA. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). The author thanks colleagues from the AMIDA and GALE projects for their help with the different LVCSR systems and the reviewers for their comments. References Allen, J.B., How do humans process and recognize speech? IEEE Trans. Speech Audio Process. 2 (October). Allen, J.B., Articulation and Intelligibility. Morgan and Claypool. Bourlard, H., Dupont, S., A new ASR approach based on independent processing and re-combination of partial frequency bands. In: Proc. ICSLP 96. Bourlard, H., Morgan, N., Connectionist Speech Recognition A Hybrid Approach. Kluwer Academic Publishers. Bourlard, H. et al., Towards using hierarchical posteriors for flexible automatic speech recognition systems. In: Proceedings of the DARPA EARS (Effective, Affordable, Reusable, Speech-to-text) Rich Transcription (RT 04). Workshop, Chen, B., Chang S., Sivadas, S., Learning discriminative temporal patterns in speech: development of novel traps-like classifiers. In: Proc. Eurospeech, Dau, T., Kollmeier, B., Kohlrausch, A., Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. J. Acoust. Soc. Am. 102, Fletcher, H., Speech and Hearing in Communication. Krieger, New York. Hain, T. et al., The 2005 AMI system for the transcription of speech in meetings. In: NIST RT05 Workshop, Edinburgh, UK. Hermansky, H., Should recognizers have ears? Speech Commun. 25, Hermansky, H., TRAP-TANDEM: data-driven extraction of temporal features from speech. In: Proc. ASRU Hermansky, H., Fousek, P., Multi-resolution RASTA filtering for TANDEM-based ASR. In: Proc. Interspeech, Hermansky, H., Morgan, N., RASTA processing of speech. IEEE Trans. Speech Audio Process. 2. Hermansky, H., Sharma, S., Temporal patterns (TRAPS) in asr of noisy speech. In: Proc. ICASSP 99. Hermansky, H., Tibrewala, S., Pavel, M., Towards ASR on partially corrupted speech. In: Proc. ICSLP 96. Hermansky, H., Kanedera, H., Arai, T., Pavel, M., On the importance of various modulation frequencies for speech recognition. In: Proc. Eurospeech 97.

11 800 F. Valente / Speech Communication 52 (2010) Hermansky, H., Ellis, D., Sharma, S., TANDEM connectionist feature extraction for conventional hmm systems. In: Proc. ICASSP Houtgast, T., Frequency selectivity in amplitude modulation detection. J. Acoust. Soc. Am. 88. Kleinschmidt, M., Methods for capturing spectro-temporal modulations in automatic speech recognition. Acoust. United Acta Acoust. 88 (3), Miller et al., Spectro-temporal receptive fields in the lemniscal auditory thalamus and cortex. J. Neurophysiol. 87 (1). Misra, H., Bourlard, H., Tyagi V., Entropy-based multi-stream combination. In: Proc. ICASSP Moore, D. et al., Juicer: a weighted finite state transducer speech coder. In: Proc. MLMI 2006, Washington, DC. Morgan, N., Chen, B., Zhu, Q., Stolcke, A., Trapping conversational speech: extending TRAP/tandem approaches to conversational telephone speech recognition. In: Proc. ICASSP Plahl, C., Hoffmeister, B., Heigold, G., Lööf, J., Schlüter, R., Ney, H., Development of the GALE 2008 mandarin LVCSR system. In: Proc. 10th Ann. Conf. of Internat. Speech Communication Association (Interspeech), Rumelhart, D., Hinton, G., Williams, R., Learning representations by back-propagating errors. Nature 323, Sivadas, S., Hermansky, H., Hierarchical tandem feature extraction. In: Proc. ICASSP Valente, F., Hermansky, H., Combination of acoustic classifiers based on Dempster Shafer theory of evidence. In: Proc. ICASSP Valente, F., Hermansky, H., Hierarchical and parallel processing of modulation spectrum for ASR applications. In: Proc. ICASSP Valente, F., Vepa, J., Plahl, C., Gollan, C., Hermansky, H., Schlüter, R., Hierarchical neural networks feature extraction for LVCSR system. In: Interspeech, Valente, F., Magimai-Doss, M., Plahl, C., Suman, R., Hierarchical processing of the modulation spectrum for GALE mandarin LVCSR system. In: Proc. 10th Ann. Conf. of Internat. Speech Communication Association (Interspeech), < Zhao, S., Ravuri, S., Nelson Morgan, Multi-stream to many-stream: using spectro-temporal features for ASR. In: Proc. Interspeech, 2009.

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a R E S E A R C H R E P O R T I D I A P Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a IDIAP RR 07-45 January 2008 published in ICASSP