IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER

Size: px
Start display at page:

Download "IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER"

Transcription

1 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER Transcribing Mandarin Broadcast Speech Using Multi-Layer Perceptron Acoustic Features Fabio Valente, Member, IEEE, Mathew Magimai Doss, Member, IEEE, Christian Plahl, Suman Ravuri, and Wen Wang, Member, IEEE Abstract Recently, several multi-layer perceptron (MLP)- based front-ends have been developed and used for Mandarin speech recognition, often showing significant complementary properties to conventional spectral features. Although widely used in multiple Mandarin systems, no systematic comparison of all the different approaches as well as their scalability has been proposed. The novelty of this correspondence is mainly experimental. In this work, all the MLP front-ends recently developed at multiple sites are described and compared in a systematic manner on a 100 hours setup. The study covers the two main directions along which the MLP features have evolved: the use of different input representations to the MLP and the use of more complex MLP architectures beyond the three-layer perceptron. The results are analyzed in terms of confusion matrices and the paper discusses a number of novel findings that the comparison reveals. Furthermore, the two best front-ends used in the GALE 2008 evaluation, referred as MLP1 and MLP2, are studied in a more complex LVCSR system in order to investigate their scalability in terms of the amount of training data (from 100 hours to 1600 hours) and the parametric system complexity (maximum likelihood versus discriminative training, speaker adaptative training, lattice level combination). Results on 5 hours of evaluation data from the GALE project reveal that the MLP features consistently produce improvements in the range of 15% 23% relative at the different steps of a multipass system when compared to mel-frequency cepstral coefficient (MFCC) and PLP features, suggesting that the improvements scale with the amount of data and with the complexity of the system. The integration of those features into the GALE 2008 evaluation system provide very competitive performances compared to other Mandarin systems. Index Terms Automatic speech recognition (ASR), broadcast data, GALE project, multi-layer perceptron (MLP), multi-stream, TANDEM features. Manuscript received July 16, 2010; revised December 20, 2010; accepted March 20, Date of publication April 21, 2011; date of current version September 16, This work was supported by the Defense Advanced Research Projects Agency (DARPA) under Contract HR C Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Dimitra Vergyri. F. Valente and M. M. Doss are with the Idiap Research Institute, 1920 Martigny Switzerland ( fabio.valente@idiap.ch; mathew@idiap.ch). C. Plahl is with the Computer Science Department, RWTH Aachen University, Aachen, Germany ( plahl@i6.informatik.rwth-aachen.de). S. Ravuri is with the International Computer Science Institute, Berkeley, CA USA ( ravuri@icsi.berkeley.edu). W. Wang is with the Speech Technology and Research Laboratory, SRI International, Menlo Park, CA USA ( wwang@speech.sri.com). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TASL I. INTRODUCTION R ECENTLY a growing number of large-vocabulary continuous speech recognition (LVCSR) systems make use of multi-layer perceptron (MLP) features. MLP features have been originally introduced by Hermansky and his colleagues in [1], where the output of an MLP classifier is used as acoustic front-end for conventional speech recognition systems based on hidden Markov models/gaussian mixture models (HMMs/ GMMs). A large number of studies have proposed different types of MLP-based front-ends (see [2] [5]) and investigated their use for transcribing English (see [6], and [7]). The most common application is in concatenation with mel-frequency cepstral coefficient (MFCC) or perceptual linear predictive (PLP) features, where MLP features show considerable complementarity properties. In recent years, in the framework of the GALE 1 program, MLP features have been extensively used in ASR systems for Mandarin and Arabic languages (see [5], and [8] [11]). Since the original work [1], MLP front-ends have progressed along two main directions: 1) the use of different input representations to the MLP; 2) the use of complex MLP architectures beyond the conventional three-layer perceptron. The first category includes speech representations that aims at using long time spans of the speech signal which could capture long term phenomena (such as co-articulation) and are complementarity to MFCC or PLP features [7]. Because of the large dimension of the signal time spans, a number of techniques for efficiently encoding this information have been proposed like MRASTA [4], DCT-TRAPS [12] and wlp-traps [13]. The second category includes a heterogeneous number of techniques that aim at overcoming the pitfalls of the single MLP classifier. They are based on the probabilistic combination of MLP outputs obtained using different input representations. Those combinations can happen in a parallel fashion like in the multistream approach [2], [14] or in a hierarchical fashion [15]. Furthermore, recently the probabilistic features generated by threelayer MLPs have also been replaced by the bottleneck features extracted by four-layer and five-layer MLPs [16]. While previous works, e.g., [9], have discussed the development of the Mandarin LVCSR systems that use those features, no exhaustive comparisons and analysis of the different front-ends have been presented in literature. Without such a side-by-side comparison, it is not possible to assess which one /$ IEEE

2 2440 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 of the recent advances actually produced improvements in the final system. This correspondence focuses on those recent advances in training, scaling and integrating MLP front-ends for Mandarin transcription. The novelty of this work is mainly experimental and the correspondence provides two contributions. First, the various MLP based front-ends recently developed at multiple sites are described and compared on a common experimental setup in a systematic way. The comparison covers all the MLP features used in GALE and is done using the same phoneme set, the same speech-silence segmentation, the same amount of training data and the same number of free parameters. The study is done using a simplified version of the system described in [9] trained on 100 hours of Mandarin broadcast news and conversation recordings. The investigation covers MLP acoustic front-ends as stand alone features and in concatenation with conventional MFCC features. To our best knowledge, this is the most exhaustive comparison of MLP front-ends for Mandarin speech recognition. The comparison reveals a number of novel facts on the different features and on their use in LVCSR systems. The second contribution is the study on how the performances scale with the amount of training data (from 100 hours to 1600 hours of broadcast audio) and with the parametric model complexity of the system (including speaker adaptive training, lattice level combination and discriminative training). As before, the contrastive experiments are run with and without the MLP features to assess the maximum relative improvement that can be obtained. The remainder of the paper is organized as follows. Section II describes features obtained using three-layer MLPs with various input representations and Section III describes features obtained using modifications to the three layer architecture. Section IV experiments with those features in a system trained on 100 hours and analyzes and discusses the results of the comparison. Section V experiments in a large scale multi-pass evaluation system and finally the paper is concluded in Section VI. II. INPUT REPRESENTATION FOR THREE-LAYER MLP FEATURES The simplest MLP feature extraction is based on the following steps. At first, a three-layer MLP classifier is trained in order to minimize the cross-entropy between its output and a set of phonetic labels. Such a classifier produces phoneme posterior probabilities conditioned on the input representation at a given time instant [17]. In order to exploit this representation into HMM/GMM models, phoneme posterior probabilities are first gaussianized applying a logarithm and then decorrelated using a principal component analysis (PCA) transform. After PCA, a dimensionality reduction accounting for 95% of the total variability is applied. The resulting feature vectors are used as conventional acoustic features into ASR systems. This framework is also known as TANDEM [1]. The input to the MLP classifier can be conventional short term features like PLP/MFCC or long term features which aim at capturing the dynamic characteristics of the speech signal over large time spans. Let us briefly describe four different MLP inputs proposed and used for transcription of Mandarin broadcast: A. TANDEM-PLP In TANDEM-PLP features, the input to the MLP is represented by nine consecutive frames of PLP cepstral features. Mandarin is a tonal language; thus, the PLP vector is augmented with the smoothed log-pitch estimate plus its firstand second-order temporal derivatives as described in [18]. PLP features undergo vocal tract length normalization and speaker-level mean and variance normalization. The final dimension of this vector is 42, thus the input to the MLP is a vector of size. TANDEM-PLP has been the first MLP-based feature to be proposed and aims at using a few consecutive frames of short term spectral features. On the other hand, the input to the MLP can also be represented by critical band temporal trajectory (up to half a second) aiming at modeling long time patterns of the speech signal (also known as Temporal Patterns or TRAPS [19]). The dimensionality of TRAPS is quite large; considering for instance, 500-ms trajectories in a 19 critical band spectrogram would produce a vector of dimension Several methods have been considered for efficiently encoding this information while reducing the dimension and they will be briefly reviewed in the following. B. Multiple RASTA Multiple RASTA (MRASTA) filtering [4] is an extension of RASTA filtering and aims at using long signal time spans at the input of the MLP. The model is consistent with studies on human perception of modulation frequencies modeled using a bank of filters equally spaced on a logarithmic scale [20]. This bank of filters subdivides the available modulation frequency range into separate channels with a decreasing resolution moving from slow to fast modulations. Feature extraction is composed of the following parts: 19 critical band auditory spectrum is extracted from short-time Fourier transform of a signal every 10 ms. A 600-ms long temporal trajectory in each critical band is filtered with a bank of bandpass filters. Those filters represent first derivatives (1) and second derivatives (2) of Gaussian functions with variance varying in the range 8 60 ms: (1) with (2) In effect, the MRASTA filters are multi-resolution bandpass filters on modulation frequency, dividing the available modulation frequency range into its individual sub-bands. 2 In the modulation frequency domain, they correspond to a filter-bank with equally spaced filters on a logarithmic scale. Identical filters are used for all critical bands. Thus, they provide a multiple-resolution representation of the time frequency plane. After MRASTA filtering, frequency derivatives across three consecutive critical bands are introduced. The total number of features used as input for a three-layer MLP is Unlike in [4], filter-banks G1 and G2 are composed of six filters rather than eight, leaving out the two filters with longest impulse responses.

3 VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2441 TABLE I DIFFERENCES BETWEEN THE THREE INPUT REPRESENTATIONS THAT USES LONG TEMPORAL TIME SPANS C. DCT-TRAPS The DCT-TRAPS aims at reducing the dimension of the trajectories using a discrete cosine transform (DCT). As described in [12], the results obtained using DCT basis are very similar to the one obtained using a principal component analysis. Critical band auditory spectrum is extracted from short-time Fourier transform of a signal every 10 ms. Then 500-ms long energy trajectories are extracted for each of the 19 critical bands that compose the spectrogram. Those are projected on the first 16 coefficients of a DCT transform resulting in a vector of size used as input to the MLP. In contrary to the MRASTA, they do not emulate any sensitivity of the hearing properties to the different modulation frequencies. D. WLP-TRAPS A third alternative for extracting information from long signal time spans is represented by the wlp-traps [13]. In contrary to previous front-ends, the process does not use the short term spectrum thus potentially provides more complementarity to the MFCC features. Those features are obtained by warping the temporal axis after LP-TRAP features calculation [21]. The feature extraction is composed of the following steps: at first, linear prediction is used to model the Hilbert envelops of prewarped 500-ms long energy trajectories in auditory-like frequencies sub-bands. The warping ensures that more emphasis is given to the center of the trajectories compared to the borders [13], thus emulating again human perception. 25 LPC coefficients in 19 frequency bands are then used as input to the MLP producing a feature vector of dimension. All the three representations described in Sections II-B II-D aim at using long temporal time spans; however, they differ from each other in a number of implementation issues like the use of short-time power spectrum, the use of zero-mean filters and the warping of the time axes. Those differences are summarized in Table I. As Mandarin is a tonal language, those representations can be augmented with the smoothed log-pitch estimate obtained as described in [18] and with the value of the critical band energy (19 features per frame). In the following, we will refer to them as Augmented features. III. MLP ARCHITECTURES The second direction along which the front-ends have evolved is the use of more complex architectures to overcome limitations of the three-layer MLP in different ways. Most of them are based on the combination of several MLP outputs trained using different input representations. This combination can happen in a parallel or hierarchical fashion. Again, no side-by-side comparisons of these architectures have been presented in the literature. The following paragraphs briefly describe these front-ends used for LVCSR systems. A. Hidden Activation TRAPS (HATS) HATS feature extraction is based on observations on human speech recognition [22], which conjectures that humans recognize speech independently in each critical band and a final decision is obtained by recombining those estimates. HATS aims at using information extracted from long time spans of critical band energies which are fed into a set of independent classifiers instead of a single MLP classifier. At first, 19 critical band auditory spectrum is extracted from short-time Fourier transform of a signal every 10 ms. After that, HATS [2] feature extraction is composed of two steps. 1) In the first stage, an independent MLP for each of the 19 critical bands is trained to classify phonemes. The input to each of the MLP is 500-ms-long log critical band energy trajectories (i.e., 51-dimensional input). The input undergoes an utterance level mean and variance normalization. 2) In the second stage, a merger MLP is trained using the hidden activations obtained from the 19 MLPs of the first stage. The merger classifier aims at obtaining a single phoneme posterior estimate out of the independent estimates coming from each critical band. Phoneme posteriors obtained from the merger MLP are then transformed and used as features. The rationale behind this architecture consists in the fact that corruptions in particular critical bands should affect less the final recognition results. B. Multi-Stream The output of MLPs are posterior probabilities of phonetic targets that can be combined into a single estimate using probabilistic rules. This approach is typically referred as multi-stream and has been introduced in [14]. The rationale behind it consists in the fact that MLPs trained using different input representations will perform differently in multiple conditions. To take advantage of both representations, the combination rule should be able to dynamically select the best posterior stream. Typical combination rules weight the posterior probabilities using a function of the output entropy (see [23] and [24]). Posteriors obtained from TANDEM-PLP (short signal time spans) and HATS (long signal time spans) are combined using the Dempster Shafer method [24] and used as features after a log/pca transform. Multi-stream comes at the obvious cost of doubling the total number of parameters in the system. C. Hierarchical Processing While multi-stream approaches combine MLP outputs in parallel, studies on English and Mandarin data [15], [25] showed that the most effective way of combining classifiers trained on separate ranges of modulation frequencies, i.e., on different temporal spans, is based on hierarchical (sequential) processing. The hierarchical processing is based on the following steps. MRASTA filters cover the whole range of modulation frequencies. The filter-banks G1 and G2 (six filters each) are split into two separate filter banks G1-Low and G2-Low and G1-High and G2-High, which filter fast and slow modulation

4 2442 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 Fig. 1. Proposed scheme for the MLP-based feature extraction as used in the GALE 2008 Evaluation. The auditory spectrum is filtered with a set of multiple resolution filters that extract fast modulation frequencies. The resulting vector is concatenated with short term critical band energy and pitch estimates and is used as input to the first MLP that estimates phoneme posterior distributions. The output of the first MLP is then concatenated with features obtained using slow modulation frequencies, short-term critical band energy and pitch estimates and is used as input to the second MLP. frequencies, respectively. G-High and G-Low are defined as follows: G-High G1-High G2-High with (3) G-Low G1-Low G2-Low with (4) Filters G1-fast and G2-fast are short filters and they process high modulation frequencies. Filters G1-slow and G2-slow are long filters and they process low modulation frequencies. The cutoff frequency for both filter-banks G-High and G-Low is approximately 10 Hz. The output of the MRASTA filtering is processed according to a hierarchy of MLPs progressively moving from high to low modulation frequencies (i.e., from short to long temporal contexts). The rationale behind this processing is based on the fact that the errors produced from the first MLP can be corrected from a second one using the estimates from the first MLP together with the evidence from another range of modulation frequencies. The first MLP is trained on the first feature stream represented by the output of filter-banks G-High that extract high modulation frequencies. This MLP estimates the first set of phoneme posterior probabilities. These posteriors are modified according to a Log/PCA transform and then concatenated with the second feature stream thus forming an input to the second phoneme posterior-estimating MLP. In such a way, phoneme estimates from the first MLP are modified by the second net using an evidence from a different feature stream. This process is depicted in Fig. 1. D. Bottleneck Features Bottleneck features are recently introduced MLP non-probabilistic features [16]. The conventional three-layer MLP is replaced with a four- or five-layer MLP where the first layer is the input features and the last layer is the phonetic targets. As discussed in [26], the five-layer architecture provides slightly better performances compared to the four-layer. The size of the second layer is large to provide enough modeling power, the size of the third one is small, typically equal to the desired feature dimension, while the size of the fourth one is approximately half the second layer [26]. Instead of using the output of the MLP, features are obtained from the linear activation of the third layer. Bottleneck features do not require a dimensionality reduction, as the desired dimension can be obtained fixing the size of the bottleneck layer. Furthermore, the linear activations are already Gaussian distributed thus they do not require any Log transform. The most common input to the non-probabilistic Bottleneck features are long term features as DCT-TRAPS and the wlp-traps described in sections Sections II-C and II-D. IV. SMALL SCALE EXPERIMENTS The following preliminary experiments are based on the large-vocabulary ASR system for transcribing Mandarin broadcast described in [9], developed by SRI/UW/ICSI for the GALE project. The recognition is performed using the SRI Decipher recognizer and results are reported in terms of character error rate (CER). The training is done using approximately 100 hours of broadcast news and conversation data manually transcribed including speaker labels. Results are reported on the GALE 2006 evaluation data simply referred as eval06 in the following. The baseline system uses 13 standard MFCC plus first- and second-order temporal derivatives. Vocal tract length normalization (VTLN) and speaker level mean-variance normalizations are applied. Mandarin is a tonal language thus the MFCC vector is augmented with the smoothed log-pitch estimate plus its first and second-order temporal derivatives as described in [18], resulting in a feature vector of dimension 42. In the following, we will refer to this system simply as the MFCC baseline. The training is based on conventional Maximum-Likelihood. The acoustic models are composed of within word triphone HMM models and a 32-component diagonal covariance GMM is used for modeling acoustic emission probabilities. Parameters are shared across different triphones according to a phonetic decision tree. Recognition networks are compiled from trigram language models trained on over one billion words, with a 60 K vocabulary lexicon [9]. The decoding phase consists of two decoding passes, a speaker independent (SI) decoding followed by a speaker adapted (SA) decoding. Speaker adaptation is

5 VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2443 TABLE II BASELINE SYSTEM PERFORMANCE ON THE eval06 DATA TABLE III TANDEM-9FRAMESPLP PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES ALONE AND IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES TABLE IV MLP FEATURES MAKING USE OF LONG TIME SPANS OF THE SIGNAL AS INPUT. PERFORMANCE IS REPORTED ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES AS STAND ALONE FEATURES AND IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES done using a one-class constrained maximum-likelihood linear regression (CMLLR) followed by three-class MLLR. Performance of this baseline system on the eval06 data is reported in Table II for both speaker independent (SI) and speaker adapted (SA) models. In this set of experiments, three-layer MLPs are trained on all the available 100-hour acoustic model training data. The Mandarin toneme set is composed of 72 elements. The training is done using the ICSI Quicknet Software. 3 A. MLP Features This section discusses experiments with features obtained using three-layer MLP architectures with different input representations. Unless it is explicitly mentioned otherwise, the total number of parameters in the different MLP architectures is equalized to approximately one million parameters in order to assure a fair comparison between the different approaches. The size of the input layer equals to the feature dimension, the size of the output layer equals to the number of phonetic targets (72) and the size of the hidden layer is modified so that the total number of parameters equals to one million. After PCA, a dimensionality reduction accounting for 95% of the total variability is applied. The resulting feature vectors has dimension 35 for all the different MLP features. The investigation was carried out with MLP features as stand-alone front-end and in concatenation with spectral features, i.e., MFCC. Results are reported in terms of character error rate (CER) on the eval06 data as described in the next section. Let us first consider the TANDEM-PLP features described in Section II-A. Performances of those features are reported in Table III as well as the relative improvements with respect to the MFCC baseline with and without speaker adaptation. When used as stand-alone features, TANDEM-PLP does not outperform the baseline, whereas a relative improvement of is obtained when they are used in concatenation with MFCC. After speaker adaptation, the relative improvement drops slightly by 2%, still a 14% relative improvement over the MFCC baseline. Let us now consider the use of MLP features obtained using long time spans of the speech signal as described in Sections II-B II-D. Table IV shows that these features perform quite poorly as stand alone features, whereas they can provide improvements around 10% relative in concatenation with the MFCC features. As a stand-alone front-end, the wlp-traps 3 TABLE V MLP FEATURES MAKING USE OF LONG TIME SPANS OF THE SIGNAL AS INPUT AUGMENTED WITH CRITICAL BAND ENERGY AND LOG-PITCH. PERFORMANCE IS REPORTED ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES AS STAND ALONE FEATURES AND IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES outperforms the other two; whereas, in concatenation with spectral features and after adaptation, the three representations are comparable. Their performances are however inferior to the conventional TANDEM 9frames PLP. The performances of these features augmented with the values of the critical band energy (19 features per frame) and the smoothed log-pitch estimates are reported in Table V. Augmenting the long term features produces consistent improvements in all the cases and brings the performances of these front-ends to the same level of the TANDEM-PLP when tested in concatenation with MFCC. As before, the relative improvements are always reduced after speaker adaptation. In concatenation with spectral features, the three input representations have similar performances. In summary, MLP front-ends obtained using a three-layer MLP with different input representations do not outperform the conventional MFCC as stand alone features. On the other hand, they produce relative improvements in the range of 10% 14% when used in concatenation with spectral features. TANDEM-PLP front-end outperforms the other long term features. The various coding schemes, MRASTA, DCT-TRAPS, and wlp-traps, give similarly poor results as stand-alone features and similar improvements (approximately 11%) when used in concatenation with spectral features. Augmenting the long term input with a vector of short term energy and pitch brings the performances close to those of the TANDEM-PLP features. The relative improvements after speaker adaptation are generally reduced by 2% with respect to the speaker independent systems. This is consistent with what has already been verified on English ASR experiments [27].

6 2444 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 TABLE VI HATS PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES ALONE AND IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES TABLE VIII HIERARCHICAL FEATURE PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES ALONE AND IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES TABLE VII MULTI-STREAM MLP FEATURE PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES ALONE AND IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES TABLE IX BOTTLENECK FEATURES PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES ALONE. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES B. MLP Architectures This section discusses experiments with the different MLP architectures; the input signal representations are similar to those used in the previous section while the information is exploited differently when changing the MLP architectures. The results obtained using these methods are compared with their counterparts based on the three-layer MLPs. 1) Hidden Activation TRAPS: HATS aims at using information extracted from long time spans of critical band energies, but the recognition is done independently in each critical band using 19 independent MLPs. The final posterior estimates are obtained by merging all these estimates (see subsection Section II-A). Results with HATS features are reported in Table VI. As stand-alone features, HATS performs significantly worse than MFCC; whereas a relative improvement is obtained when used in concatenation with MFCC. Comparing Tables IV and VI, it is noticeable that this approach is marginally better than those that use long-term features into a single MLP. 2) Multi-Stream MLP Features: Table VII reports the performance of the multi-stream front-end that combines information from TANDEM-PLP (short time spans of signal) and HATS (long time spans of signal). These features outperform the MFCC by 10% relative when used stand-alone and by 16% relative in concatenation with MFCC. Those numbers must be compared to the performances of the individual streams of TANDEM-PLP (Table III) and HATS (Table VI). The combination provides a large improvement in case of stand-alone features (TANDEM-PLP 25.5%, HATS 29.1%, Multistream 23.1%); however, the improvements are smaller when used in concatenation with MFCC (TANDEM-PLP 22.1%, HATS 22.7%, Multistream 21.7%). This can be easily explained considering the fact that when used in concatenation with the MFCC, the feature vector contains twice the spectral information, through the MFCC and trough the TANDEM features. 3) Hierarchical Processing: Next, we discuss experiments with the hierarchical processing described in Section III-C. Results are reported in Table VIII in cases of both MRASTA and Augmented MRASTA inputs (processing is depicted in Fig. 1). TABLE X AUGMENTED BOTTLENECK FEATURE PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES ALONE. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES Comparing Table VIII with Tables IV and V, it is noticeable that the hierarchical approach produces considerable improvements with respect to the single classifier approach both with and without MFCC features. It is important to notice that the total number of parameters is kept constant; thus, the improvements are produced from the sequential architecture where short signal time spans are used first and then integrated with the longer ones. 4) Bottleneck Features: Tables IX and X report the performances of the bottleneck features obtained using different long term inputs (MRASTA, DCT-TRAPS, and wlp-traps) and their augmented versions. The dimension of the bottleneck is fixed to 35 in order to compare with other probabilistic MLP features. Results reveal that bottleneck features always outperform their probabilistic counterparts obtained using the three-layer MLP. This is verified on all the different input features and their augmented versions. For comparison purposes, Table XI also reports the performance of Bottleneck features when the input to the MLP is

7 VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2445 Fig. 2. RWTH evaluation system composed of two subsystems trained on MFCC and PLP features. The two subsystems consist of ML training followed by SAT/CMLLR training. The lattice outputs from the subsystems are combined in the end. TABLE XI BOTTLENECK FEATURES PERFORMANCE ON eval06 DATA WHEN 9frames PLP AND PITCH INPUT IS USED. RESULTS ARE REPORTED WITH MLP FEATURES ALONE. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES 9frames PLP features augmented with pitch: In summary, replacing the three-layer MLP with a more complex MLP structure (while keeping constant the number of total parameters) produces a reduction in the error both with and without concatenation of spectral features. The multi-stream approach that combines in parallel MLPs trained on long and short speech temporal features produce the lowest CER as stand-alone front-end (16% relative CER reduction compared to the MFCC). On the other hand, hierarchical and bottleneck structures that go beyond the three-layer appear to produce the highest complementarity to MFCC, producing an improvement of 17% 18% relative when used in concatenation. The reasons of these effects are investigated in the next section where the front-ends are compared in terms of phonetic confusions. C. Analysis of Results In order to understand the differences between the various MLP front-ends, let us now analyze the errors they produced in terms of phonetic targets. Table XII reports the phonetic set composed of 72 tonemes used for training the MLP. The set is sub-divided into six broad phonetic classes for analysis purposes. The numbers beside the vowels represent the tonal accents. The frame-level accuracy of a three-layer MLP trained using 9frames-PLP features in classifying the phonetic targets is 69.8%. Fig. 3 plots the per-class accuracy. Let us now consider the accuracies of the three-layer MLPs trained using long-term input representations, i.e., the MRASTA, DCT-TRAPS and wlp-traps. They are respectively 64%, 62.9%, and 65.2%, which are worse than the accuracy from the 9frame-PLP. The HATS features that are based on long-term critical band trajectories have a similar frame-level accuracy, i.e., 65.7%. While the overall performance of MLP trained on spectral features is superior to MLP trained on long time spans of speech signals, the latter appears to perform better on some phonetic classes. Fig. 3 plots the accuracy of recognizing each of the phonetic classes for HATS. It is noticeable that in spite of an overall inferior performance, the HATS outperforms the TANDEM-PLP on almost all the stop consonants p, t, k. b, d, and the affricative ch. Stop consonants are short Fig. 3. Phonetic-class accuracy obtained by the TANDEM-9framesPLP and HATS. The former outperforms the latter on most of the classes apart from stops and affricatives. TABLE XII PHONETIC SET USED TO TRAIN THE DIFFERENT MLPS DIVIDED INTO BROAD PHONETIC CLASSES. AS MANDARIN IS A TONAL LANGUAGE, THE NUMBER BESIDE THE VOWELS DESIGNATES THE ACCENT OF THE DIFFERENT TONEMES sounds characterized by burst of acoustic energy following a short period of silence and are known to be prone to strong co-articulation from the following vowel. Studies like [28] have shown that stop consonant recognition can be largely improved considering information from the following vowel; this explains why using longer speech time spans produces higher recognition performance compared to conventional short term spectral features. Also, the affricative ch (composed of a plosive and a fricative) is confused with the fricatives zh and s by the short term features while this confusion is significantly reduced by the other long term features. Vowels and other consonants are still better recognized from the short term features. Those facts are verified on all the MLP front-ends that use long temporal inputs (MRASTA,DCT-TRAPS and wlp-traps) as well as the HATS. In summary, training MLPs using short-term spectral input outperforms training using long term temporal input on most of the phonetic classes apart a few of them including the plosives and affricatives. Let us now consider the multi-stream approach which dynamically weights the posterior estimates from the 9frames-PLP and

8 2446 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 HATS according to the confidence of the MLP. The frame accuracy becomes 73% and the phoneme-level confusion shows that performances are never inferior to the best of the two streams that compose the combination. In other words, the combined distribution appears to perform as the HATS on the stop consonants and affricatives and as the 9frame-PLP on the remaining phonemes. Those results translate into a significant reduction on the CER, never worse than those obtained using the individual MLP features (see experiments in Section IV-B2). The hierarchical approach described in Section III-C is based on a completely different idea. This method uses an initial MLP trained using energy trajectories filtered with short temporal filters (G1-High and G2-High). This provides an initial posterior estimation then fed into the second MLP concatenated with energy trajectories filtered with long temporal filters (G1- Low, G2-Low). The second MLP re-estimates the phonetic posteriors obtained from the first MLP using information from a longer temporal context. The hierarchical framework achieves a frame accuracy of 72%. Interestingly this is done without using any spectral feature (MFCC or PLP) and keeping constant the number of parameters; only the architecture of the MLP is changed where the temporal context of the input features is increased sequentially. In other words, the first MLP trained on short temporal context is effective on most of the phonetic classes apart stops and affricatives. Those estimates are then corrected from the second MLP using the information from longer temporal context. Fig. 4 plots the phonetic class accuracy obtained by the three-layer MLP trained using the MRASTA input and the hierarchical approach. It is noticeable that the hierarchical approach outperforms training using the MRASTA on all the targets. Recognition results show that the hierarchical approach (where the processing moves from short to long temporal contexts) reduces the CER with respect to the single MLP features (where the different time spans are processed using the same MLP). Augmenting the input with pitch estimates and energy further reduces the CER. Another interesting finding is the fact that as stand-alone features, the multi-stream approach has the lowest CER, while in concatenation with MFCC, the augmented hierarchical approach produces the largest CER reduction (compare Tables VIII and VII). This effect can be explained by the fact that the multi-stream approach makes use of spectral information (through the 9frame PLP). This information produces a frame accuracy of 73% but does not appear complementary to the MFCC features as they both represent spectral information. On the other hand, the hierarchical approach achieves a frame error rate of 72% without the use of any spectral features and appears more complementary when used in concatenation with the MFCC. Results from the bottleneck features cannot be analyzed in a similar way, as these are non-probabilistic features without any explicit mapping to a phonetic target. However, recognition results in Tables IX and X show that replacing the three-layer MLP with the bottleneck architectures reduces the CER for all the different input representations (MRASTA,DCT-TRAPS,wLP- TRAPS). Bottleneck and hierarchical approaches produce similar improvements in concatenation with MFCC features. Fig. 4. Phonetic-class accuracy obtained by the MRASTA and the Hierarchical MRASTA. The latter improves the performance on all the phonetic targets without the use of any spectral information. TABLE XIII ACOUSTIC DATA FOR TRAINING AND TESTING TABLE XIV PERFORMANCES OF BASELINE SYSTEMS USING MFCC OR PLP FEATURES V. LARGE SCALE EXPERIMENTS Contrastive experiments in literature are typically reported with small setups like the one presented so far. However, the GALE evaluation systems are trained on a much larger amount of data, make uses of multi-pass training and are composed of a number of individual sub-systems. In order to study how the previous results generalize on more complex LVCSR systems and a large amount of training data, the experiments are extended using a highly accurate automatic speech recognizer for continuous Mandarin speech trained on 1600 hours of data collected by LDC (GALE releases P1R1-4, P2R1-2, P3R1-2, P4R1). The training transcripts were preprocessed and the audio data were segmented into waveforms based on sentence boundaries defined in the manual transcripts. Both were provided by UW-SRI as described in [9]. This comparison will cover the Multi-stream approach and the hierarchical MRASTA front-ends, which will be simply referred as MLP1 and MLP2 in the remainder of this paper. These two features have been used in the GALE 2008 Mandarin evaluation. The 1600 hours data are used for training the HMM/GMM systems as well as the MLP front-ends. The evaluation is done on the GALE 2007 development test set (dev07) which is used for tuning hyper-parameters, the GALE 2008 development test set (dev08) and the sequestered data of the GALE 2007 evaluation (eval07-seq), for a total amount of 5 hours of data. Statistics of the different test sets are summarized in Table XIII. The number of parameters in the MLP architectures is increased to

9 VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2447 TABLE XV SUMMARY OF FEATURE PERFORMANCES ON GALE dev07/dev08/seq-eval07 TEST SETS. RESULTS ARE REPORTED WITH MLP FEATURES ALONE AND IN CONCATENATION WITH MFCC OR PLP. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE MFCC AND PLP BASELINES IS REPORTED IN PARENTHESES five millions parameters for the large scale setup. The training of MLP1 and MLP2 networks took approximately five weeks on an eight-core machine (AMD Opteron(tm) Dual Core 2192 MHz 2 4-core CPUs). MLP1 networks have been trained at ICSI and MLP2 networks have been trained at IDIAP. On the other hand, the generation of the features is quite fast, approximately 0.09xRT on a single CPU. The RWTH evaluation system is composed of two subsystems which only differ for their acoustic front-ends. The acoustic front-ends of the subsystems consist of conventional MFCCs and PLPs augmented with the log-pitch estimates [18]. The filter banks underlying the MFCC and PLP feature extraction undergo VTLN. After that, features are mean and variance normalized and they are fed into a sliding window of length nine. All feature vectors within the sliding window are concatenated and projected to a 45-dimensional feature space using a linear discriminative analysis (LDA). The system uses a word-based pronunciation dictionary described in [9] that maps words to phoneme sequences, while the phoneme carries the tone information, which is usually referred to as a toneme. The acoustic models for all systems are based on triphones with cross-word context, modelled by a three-state left-to-right HMM. A decision tree based state tying is applied, resulting in a total of 4500 generalized triphone states. The acoustic models consist of Gaussian mixture distributions with a globally pooled diagonal covariance matrix. The first pass consists of maximum-likelihood training. We will refer to this system as an SI system. The second pass consists of speaker adaptive training (SAT). Furthermore, during decoding, maximum likelihood linear regression is applied to means for performing speaker adaptation. We will refer to this system as an SA system. Finally, the outputs of the different subsystems are combined at the lattice level using the min.fwer combination method described in [29]. The min.fwer method has been shown to outperform other lattice combination methods as ROVER or Confusion Network Combination (CNC) [29]. Fig. 2 schematically depict the RWTH evaluation system. The language model (LM) used in this work is kindly provided by SRI and UW. The vocabulary size is 60 K. Experimental results with the full LM are reported only in the TABLE XVI SYSTEM COMBINATION OF MFCC AND PLP SUBSYSTEMS DESIGNATED WITH 8. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE MFCC 8 PLP BASELINE IS REPORTED IN PARENTHESES system combination, while a pruned version is applied in all other recognition steps. Table XIV reports the CER for the speaker independent and the speaker adapted subsystems trained using MFCC and PLP features only. The error rate is in the range of 12.5% 14.5% for the different test sets. Let us now consider the integration of the MLP1 and MLP2 front-ends. Table XV report the performance of the subsystems when they are trained using MLP1 and MLP2 features only and when MFCC and PLP are concatenated with MLP1 and MLP2. The results show similar trends as in the 100-hour system. In other words, the MLP feature performance scales with the amount of training data. In particular, the MLP1 and MLP2 front-ends outperform the spectral features and produce a relative improvement in the range of 15% 25% when used in concatenation with MFCC or PLP, reducing the CER to the range 10.1% 12.2% for the different datasets. The improvements are verified on all three test sets. The relative improvements after SAT are generally reduced with respect to the speaker-independent system. After SAT, the MLP2 features (based on a hierarchical approach) yield the best performance in concatenation with both MFCC and PLP. The lattice combination results of MFCC and PLP sub-systems are reported in Table XVI (first row). For investigation purposes, corresponding sub-systems trained using MLP1 and MLP2 front-ends are combined in the same way and their performance is reported in Table XVI (second row). Their performance is superior to the MFCC/PLP system by 9% 14% relative, showing that the improvements hold after the lattice level combination. In order to increase the complementarity of the sub-systems, features MLP1 and MLP2 were then concatenated with PLP and MFCC, respectively. The performance of the lattice level

10 2448 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 TABLE XVII EFFECT OF DISCRIMINATIVE TRAINING ON DIFFERENT SUBSYSTEMS AND THEIR COMBINATION (DESIGNATED WITH 8) combination of those two sub-systems is reported in Table XVI (third row). The results show that using the two MLP front-ends in concatenation with MFCC/PLP features produces an additional relative improvement, resulting in the range of 18% 23% after system combination. For the GALE 2008 evaluation, discriminative training was further applied to the two subsystems before the lattice level combination. Discriminative training is based on a modified minimum phone error (MPE) criterion described in [30]. Table XVII reports CER obtained after discriminative training. Results are reported for the PLP+MLP1 system, the MFCC+MLP2 system and their lattice level combination. In all the three cases, discriminative training reduced the CER in the range 6% 13% relative, showing that it is also effective when used together with different MLP front-ends. For computational reasons, fully contrastive results with and without discriminative training are not available on the 1600 hours system. This system including the two most recent MLP-based frontends showed to be very competitive to current Mandarin LVCSR systems evaluated on the same test sets [31], [32]. VI. DISCUSSION AND CONCLUSION During GALE evaluation campaigns, several MLP based front-ends have been used in different LVCSR systems although no exhaustive and systematic study of their performances has been reported in literature. Without such a comparison, it is not possible to verify which of the modification to the original MLP features produced improvements in the final system. This correspondence describes and compares in a systematic manner all the MLP front-ends developed recently at multiple sites and used during the GALE project for Mandarin transcription. The initial investigation is carried on a small-scale experimental setup (100 hours) and investigates the two directions along which the MLP features have recently evolved: the use of different inputs to the conventional three-layer MLP and the use of complex MLP architectures. The experimentation is done both using MLP front-ends as stand-alone features and in concatenation with MFCC. Three-layer MLPs are trained using conventional spectral features (9frames-PLP) and features extracted from long time spans of the signal (MRASTA, DCT-TRAPS, wlp-traps and their augmented versions). Results reveal that as stand-alone features, none of them outperforms the conventional MFCC features. The performances of the MLPs trained on long time spans of the speech signal (MRASTA, DCT-TRAPS, wlp-traps) are quite poor compared to those obtained from training on short-term spectral features (9frames-PLP). The latter one is superior on most of the phonetic targets apart from a few of phonetic classes like plosives and affricatives. Features based on the three-layer MLP produce relative improvements in the range of 10% 14% when used in concatenation with the MFCC. Even when their performances are poor as stand-alone front-ends, they always appear to provide complementary information to the MFCC. After concatenation with MFCC, the various representations (MRASTA, DCT-TRAPS, wlp-traps) produce comparable performances. Over time, several alternative architectures have been proposed to replace the three-layer MLP with different motivations. This work experiments with Multi-stream, Hierarchical and Bottleneck approaches. Results using those architectures reveal the following novel findings. The Multi-stream framework that combines MLPs trained on long and short time spans outperforms the MFCC by approximately 10% relative as stand-alone feature. Furthermore, it reduces the CER by 16% relative in concatenation with MFCC. The hierarchical approach that sequentially increases the time context through a hierarchy of MLPs outperforms the MFCC by approximately 6% relative as stand-alone feature and reduces the CER by 18% relative in concatenation with MFCC. Results obtained using the bottleneck approach (five-layer MLP) show a similar trend. The MLP front-end that provides the lowest CER as standalone feature is different from the front-end that provides the highest complementarity to spectral features. This effect is discussed in Section IV-C. MLPs trained using long-time spans of the signal at the input become effective only when coupled with architectures that go beyond the three-layer structure, i.e., hierarchies or bottleneck. In summary, the most recent improvements are obtained by the use of architectures that go beyond the three-layer MLP rather than the various input representations. These results have been obtained by training the HMM/GMM and MLPs on 100 hours of speech data and tested in a simple LVCSR system. Evaluation systems are typically trained on a much larger amount of data, make uses of multipass training and are composed of a number of individual sub-systems that are combined together to provide the final recognition output. In this paper, MLP features are investigated with a large amount of training data as well as on a state-of-the-art multipass system. The improvements from the small scale study hold for the large amount of training data on speaker-independent, speaker-adapted systems and after the lattice level combination. This is verified both in concatenation with MFCC and PLP features. When MLP features are used together with spectral features, the gain after lattice combination is in the range of 19% 23% relative for the 5-hour evaluation data sets. The comprehensive contrastive experiment on a multipass evaluation system shows that the improvements obtained on a small setup scale with the amount of training data and the parametric complexity of the system. To our best knowledge, this is the most extensive study on MLP features for Mandarin LVCSR covering all the front-ends including the most recent ones used in the 2008 GALE evaluation systems. The final evaluation system showed to be very competitive to current Mandarin LVCSR systems evaluated on the same test sets [31], [32].

11 VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2449 ACKNOWLEDGMENT The authors would like to thank colleagues involved in the GALE project and Dr. P. Fousek for their help. REFERENCES [1] H. Hermansky et al., Connectionist feature extraction for conventional HMM systems, in Proc. ICASSP, 2000, pp [2] B. Chen et al., Learning discriminative temporal patterns in speech: Development of novel TRAPS-like classifiers, in Proc. Eurospeech, 2003, pp [3] N. Morgan et al., TRAPping conversational speech: Extending TRAP/ Tandem approaches to conversational telephone speech recognition, in Proc. ICASSP, 2004, pp [4] H. Hermansky and P. Fousek, Multi-resolution rasta filtering for tandem-based ASR, in Proc. Interspeech 05, 2005, pp [5] P. Fousek, L. Lamel, and J.-L. Gauvain, Transcribing broadcast data using MLP features, in Proc. Interspeech, 2008, pp [6] D. Ellis et al., Tandem acoustic modeling in large-vocabulary recognition, in Proc. ICASSP, 2001, pp [7] N. Morgan et al., Pushing the envelope aside, IEEE Signal Process. Mag., vol. 22, no. 5, pp , Sep [8] D. Vergyri et al., Development of the SRI/Nightingale arabic ASR system, in Proc. Interspeech, 2008, pp [9] M.-Y. Hwang et al., Building a highly accurate mandarin speech recognizer with language-independent technologies and language-dependent modules, IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 7, pp , Sep [10] C. Plahl et al., Development of the GALE 2008 Mandarin LVCSR system, in Proc. Interspeech, 2009, pp [11] J. Park et al., Efficient generation and use of MLP features for arabic speech recognition, in Proc. Interspeech, Brighton, U.K., Sep. 2009, pp [12] P. Schwarz, P. Matejka, and J. Cernocky, Extraction of features for automatic recognition of speech based on spectral dynamics, in Proc. TSD 04, Brno, Czech Republic, Sep. 2004, pp [13] P. Fousek, Extraction of features for automatic recognition of speech based on spectral dynamics, Ph.D. dissertation, Faculty of Elect. Eng., Czech Technical Univ., Prague, Czech Republic, [14] H. Hermansky et al., Towards ASR on partially corrupted speech, in Proc. ICSLP, 1996, pp [15] F. Valente and H. Hermansky, Hierarchical and parallel processing of modulation spectrum for ASR applications, in Proc. ICASSP, 2008, pp [16] F. Grezl et al., Probabilistic and bottle-neck features for LVCSR of meetings, in Proc. ICASSP 07, Hononulu, HI, 2007, pp [17] H. Bourlard and N. Morgan, Connectionist Speech Recognition A Hybrid Approach. Norwell, MA: Kluwer, [18] X. Lei et al., Improved tone modeling for Mandarin broadcast news speech recognition, in Proc. Interspeech, 2006, pp [19] H. Hermansky and S. Sharma, Temporal Patterns (TRAPS) in ASR of Noisy Speech, in Proc. ICASSP 99, Phoenix, AZ, 1999, pp [20] T. Dau et al., Modeling auditory processing of amplitude modulation.i detection and masking with narrow-band carriers, J. Acoust. Soc. Amer., no. 102, pp , [21] M. Athineos, H. Hermansky, and D. P. W. Ellis, Lp-trap: Linear predictive temporal patterns, in Proc. ICSLP, 2004, pp [22] J. B. Allen, Articulation and Intelligibility. San Rafael, CA: Morgan & Claypool, [23] H. Misra, H. Bourlard, and V. Tyagi, Entropy-based multi-stream combination, in Proc. ICASSP, 2003, pp [24] F. Valente and H. Hermansky, Combination of acoustic classifiers based on Dempster-Shafer theory of evidence, in Proc. ICASSP, 2007, pp [25] F. Valente et al., Hierarchical Modulation spectrum for the GALE project, in Proc. Interpseech, 2009, pp [26] F. Grezl and P. Fousek, Optimizing bottleneck features for LVCSR, in Proc. ICASSP 08, Las Vegas, NV, 2008, pp [27] Q. Zhu et al., On using MLP features in LVCSR, in Proc. ICSLP, 2004, pp [28] A. Suchato, Classification of Stop Place of Articulation, Ph.D. dissertation, Mass. Inst. of Technol., Cambridge, [29] B. Hoffmeister et al., Frame based system combination and a comparison with weighted ROVER and CNC, in Proc. Interspeech, Pittsburgh, PA, Sep. 2006, pp [30] G. Heigold et al., Margin-based discriminative training for string recognition, J. Sel. Topics Signal Process, vol. 4, no. 6, pp , Dec [31] S. M. Chu et al., Recent advances in the GALE mandarin transcription system, in Proc ICASSP, Las Vegas, NV, Apr. 2008, pp [32] T. Ng et al., Progress in the BBN mandarin speech to text system, in Proc. ICASSP, Las Vegas, NV, Apr. 2008, pp Fabio Valente (M 05) received the M.Sc. degree (summa cum laude) in communication systems from Politecnico di Torino, Turin, Italy, in 2001 and the M.Sc. degree in image processing and the Ph.D. degree in signal processing from the University of Nice, Sophia Antipolis, France, in 2002 and 2005, respectively. His Ph.D. work was on variational Bayesian methods for speaker diarization done at the Institut Eurecom, France. In 2001, he worked for the Motorola HIL (Human Interface Lab), Palo Alto, CA. Since 2006, he has been with the Idiap Research Institute, Martigy, Switzerland, involved in several E.U. and U.S. projects on speech and audio processing. His main interests are in machine learning and speech recognition. He is an author/coauthor of several papers in international conferences and journals with contributions in feature extraction and selection for speech recognition, multi-stream ASR, and Bayesian statistics for speaker diarization. Mathew Magimai Doss (S 03 M 05) received the B.E. degree in instrumentation and control engineering from the University of Madras, Chennai, India, in 1996, the M.S. degree in research in computer science and engineering from the Indian Institute of Technology, Madras, India, in 1999, and the PreDoctoral diploma and the Docteurès Sciences (Ph.D.) degree from the École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland, in 2000 and 2005, respectively. From April 2006 to March 2007, he was a Postdoctoral Fellow at the International Computer Science Institute, Berkeley, CA. Since April 2007, he has been working as a Research Scientist at the Idiap Research Institute, Martigny, Switzerland. His research interests include speech processing, automatic speech and speaker recognition, statistical pattern recognition, and artificial neural networks. Christian Plahl received the diploma degree in computer science from the University of Bielefeld, Bielefeld, Germany, in He is currently pursuing the Ph.D. degree in the Computer Science Department, RWTH Aachen University, Aachen, Germany. His research interests cover speech recognition, discriminative training, and signal analysis. Suman Ravuri is currently pursuing the Ph.D. degree in the Electrical Engineering and Computer Sciences Department, University of California, Berkeley. He is with the International Computer Science Institute (ICSI), Berkeley, CA, working on automatic speech recognition.

12 2450 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 Wen Wang (M 98) received the B.S. degree in electrical engineering and the M.S. degree in computer engineering from Shanghai Jiao Tong University, Shanghai, China, in 1996 and 1998, respectively, and the Ph.D. degree in computer engineering from Purdue University, West Lafayette, IN, in She is currently a Research Engineer in the Speech Technology and Research Laboratory, SRI International, Menlo Park, CA. Her research interests are in statistical language modeling, speech recognition, machine translation, natural language processing and understanding, and machine learning. She authored or coauthored over 50 research papers and served as reviewer for over 10 journals and conferences. She is member of the Association for Computational Linguistics.

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a R E S E A R C H R E P O R T I D I A P Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a IDIAP RR 07-45 January 2008 published in ICASSP

More information

Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition

Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition Available online at www.sciencedirect.com Speech Communication 52 (2010) 790 800 www.elsevier.com/locate/specom Hierarchical and parallel processing of auditory and modulation frequencies for automatic

More information

Using RASTA in task independent TANDEM feature extraction

Using RASTA in task independent TANDEM feature extraction R E S E A R C H R E P O R T I D I A P Using RASTA in task independent TANDEM feature extraction Guillermo Aradilla a John Dines a Sunil Sivadas a b IDIAP RR 04-22 April 2004 D a l l e M o l l e I n s t

More information

FEATURE COMBINATION AND STACKING OF RECURRENT AND NON-RECURRENT NEURAL NETWORKS FOR LVCSR

FEATURE COMBINATION AND STACKING OF RECURRENT AND NON-RECURRENT NEURAL NETWORKS FOR LVCSR FEATURE COMBINATION AND STACKING OF RECURRENT AND NON-RECURRENT NEURAL NETWORKS FOR LVCSR Christian Plahl 1, Michael Kozielski 1, Ralf Schlüter 1 and Hermann Ney 1,2 1 Human Language Technology and Pattern

More information

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives Mathew Magimai Doss Collaborators: Vinayak Abrol, Selen Hande Kabil, Hannah Muckenhirn, Dimitri

More information

Reverse Correlation for analyzing MLP Posterior Features in ASR

Reverse Correlation for analyzing MLP Posterior Features in ASR Reverse Correlation for analyzing MLP Posterior Features in ASR Joel Pinto, G.S.V.S. Sivaram, and Hynek Hermansky IDIAP Research Institute, Martigny École Polytechnique Fédérale de Lausanne (EPFL), Switzerland

More information

DERIVATION OF TRAPS IN AUDITORY DOMAIN

DERIVATION OF TRAPS IN AUDITORY DOMAIN DERIVATION OF TRAPS IN AUDITORY DOMAIN Petr Motlíček, Doctoral Degree Programme (4) Dept. of Computer Graphics and Multimedia, FIT, BUT E-mail: motlicek@fit.vutbr.cz Supervised by: Dr. Jan Černocký, Prof.

More information

Augmenting Short-term Cepstral Features with Long-term Discriminative Features for Speaker Verification of Telephone Data

Augmenting Short-term Cepstral Features with Long-term Discriminative Features for Speaker Verification of Telephone Data INTERSPEECH 2013 Augmenting Short-term Cepstral Features with Long-term Discriminative Features for Speaker Verification of Telephone Data Cong-Thanh Do 1, Claude Barras 1, Viet-Bac Le 2, Achintya K. Sarkar

More information

Machine recognition of speech trained on data from New Jersey Labs

Machine recognition of speech trained on data from New Jersey Labs Machine recognition of speech trained on data from New Jersey Labs Frequency response (peak around 5 Hz) Impulse response (effective length around 200 ms) 41 RASTA filter 10 attenuation [db] 40 1 10 modulation

More information

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise Noha KORANY 1 Alexandria University, Egypt ABSTRACT The paper applies spectral analysis to

More information

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b R E S E A R C H R E P O R T I D I A P On Factorizing Spectral Dynamics for Robust Speech Recognition a Vivek Tyagi Hervé Bourlard a,b IDIAP RR 3-33 June 23 Iain McCowan a Hemant Misra a,b to appear in

More information

CS 188: Artificial Intelligence Spring Speech in an Hour

CS 188: Artificial Intelligence Spring Speech in an Hour CS 188: Artificial Intelligence Spring 2006 Lecture 19: Speech Recognition 3/23/2006 Dan Klein UC Berkeley Many slides from Dan Jurafsky Speech in an Hour Speech input is an acoustic wave form s p ee ch

More information

Discriminative Training for Automatic Speech Recognition

Discriminative Training for Automatic Speech Recognition Discriminative Training for Automatic Speech Recognition 22 nd April 2013 Advanced Signal Processing Seminar Article Heigold, G.; Ney, H.; Schluter, R.; Wiesler, S. Signal Processing Magazine, IEEE, vol.29,

More information

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b R E S E A R C H R E P O R T I D I A P Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR a Vivek Tyagi Hervé Bourlard a,b IDIAP RR 3-47 September 23 Iain McCowan a Hemant Misra a,b to appear

More information

Applications of Music Processing

Applications of Music Processing Lecture Music Processing Applications of Music Processing Christian Dittmar International Audio Laboratories Erlangen christian.dittmar@audiolabs-erlangen.de Singing Voice Detection Important pre-requisite

More information

Speech Synthesis; Pitch Detection and Vocoders

Speech Synthesis; Pitch Detection and Vocoders Speech Synthesis; Pitch Detection and Vocoders Tai-Shih Chi ( 冀泰石 ) Department of Communication Engineering National Chiao Tung University May. 29, 2008 Speech Synthesis Basic components of the text-to-speech

More information

IMPACT OF DEEP MLP ARCHITECTURE ON DIFFERENT ACOUSTIC MODELING TECHNIQUES FOR UNDER-RESOURCED SPEECH RECOGNITION

IMPACT OF DEEP MLP ARCHITECTURE ON DIFFERENT ACOUSTIC MODELING TECHNIQUES FOR UNDER-RESOURCED SPEECH RECOGNITION IMPACT OF DEEP MLP ARCHITECTURE ON DIFFERENT ACOUSTIC MODELING TECHNIQUES FOR UNDER-RESOURCED SPEECH RECOGNITION David Imseng 1, Petr Motlicek 1, Philip N. Garner 1, Hervé Bourlard 1,2 1 Idiap Research

More information

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS Kuldeep Kumar 1, R. K. Aggarwal 1 and Ankita Jain 2 1 Department of Computer Engineering, National Institute

More information

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS AKSHAY CHANDRASHEKARAN ANOOP RAMAKRISHNA akshayc@cmu.edu anoopr@andrew.cmu.edu ABHISHEK JAIN GE YANG ajain2@andrew.cmu.edu younger@cmu.edu NIDHI KOHLI R

More information

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Mel Spectrum Analysis of Speech Recognition using Single Microphone International Journal of Engineering Research in Electronics and Communication Mel Spectrum Analysis of Speech Recognition using Single Microphone [1] Lakshmi S.A, [2] Cholavendan M [1] PG Scholar, Sree

More information

The 2010 CMU GALE Speech-to-Text System

The 2010 CMU GALE Speech-to-Text System Research Showcase @ CMU Language Technologies Institute School of Computer Science 9-2 The 2 CMU GALE Speech-to-Text System Florian Metze, fmetze@andrew.cmu.edu Roger Hsiao Qin Jin Udhyakumar Nallasamy

More information

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

Modulation Spectrum Power-law Expansion for Robust Speech Recognition Modulation Spectrum Power-law Expansion for Robust Speech Recognition Hao-Teng Fan, Zi-Hao Ye and Jeih-weih Hung Department of Electrical Engineering, National Chi Nan University, Nantou, Taiwan E-mail:

More information

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991 RASTA-PLP SPEECH ANALYSIS Hynek Hermansky Nelson Morgan y Aruna Bayya Phil Kohn y TR-91-069 December 1991 Abstract Most speech parameter estimation techniques are easily inuenced by the frequency response

More information

Speech Signal Analysis

Speech Signal Analysis Speech Signal Analysis Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 2&3 14,18 January 216 ASR Lectures 2&3 Speech Signal Analysis 1 Overview Speech Signal Analysis for

More information

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM Samuel Thomas 1, George Saon 1, Maarten Van Segbroeck 2 and Shrikanth S. Narayanan 2 1 IBM T.J. Watson Research Center,

More information

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals 16 3. SPEECH ANALYSIS 3.1 INTRODUCTION TO SPEECH ANALYSIS Many speech processing [22] applications exploits speech production and perception to accomplish speech analysis. By speech analysis we extract

More information

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015 RESEARCH ARTICLE OPEN ACCESS A Comparative Study on Feature Extraction Technique for Isolated Word Speech Recognition Easwari.N 1, Ponmuthuramalingam.P 2 1,2 (PG & Research Department of Computer Science,

More information

Speech Synthesis using Mel-Cepstral Coefficient Feature

Speech Synthesis using Mel-Cepstral Coefficient Feature Speech Synthesis using Mel-Cepstral Coefficient Feature By Lu Wang Senior Thesis in Electrical Engineering University of Illinois at Urbana-Champaign Advisor: Professor Mark Hasegawa-Johnson May 2018 Abstract

More information

PLP 2 Autoregressive modeling of auditory-like 2-D spectro-temporal patterns

PLP 2 Autoregressive modeling of auditory-like 2-D spectro-temporal patterns PLP 2 Autoregressive modeling of auditory-like 2-D spectro-temporal patterns Marios Athineos a, Hynek Hermansky b and Daniel P.W. Ellis a a LabROSA, Dept. of Electrical Engineering, Columbia University,

More information

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Dimension Reduction of the Modulation Spectrogram for Speaker Verification Dimension Reduction of the Modulation Spectrogram for Speaker Verification Tomi Kinnunen Speech and Image Processing Unit Department of Computer Science University of Joensuu, Finland tkinnu@cs.joensuu.fi

More information

Progress in the BBN Keyword Search System for the DARPA RATS Program

Progress in the BBN Keyword Search System for the DARPA RATS Program INTERSPEECH 2014 Progress in the BBN Keyword Search System for the DARPA RATS Program Tim Ng 1, Roger Hsiao 1, Le Zhang 1, Damianos Karakos 1, Sri Harish Mallidi 2, Martin Karafiát 3,KarelVeselý 3, Igor

More information

Auditory Based Feature Vectors for Speech Recognition Systems

Auditory Based Feature Vectors for Speech Recognition Systems Auditory Based Feature Vectors for Speech Recognition Systems Dr. Waleed H. Abdulla Electrical & Computer Engineering Department The University of Auckland, New Zealand [w.abdulla@auckland.ac.nz] 1 Outlines

More information

An Improved Voice Activity Detection Based on Deep Belief Networks

An Improved Voice Activity Detection Based on Deep Belief Networks e-issn 2455 1392 Volume 2 Issue 4, April 2016 pp. 676-683 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com An Improved Voice Activity Detection Based on Deep Belief Networks Shabeeba T. K.

More information

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM www.advancejournals.org Open Access Scientific Publisher MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM ABSTRACT- P. Santhiya 1, T. Jayasankar 1 1 AUT (BIT campus), Tiruchirappalli, India

More information

IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM

IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM Jinyu Li, Dong Yu, Jui-Ting Huang, and Yifan Gong Microsoft Corporation, One Microsoft Way, Redmond, WA 98052 ABSTRACT

More information

ACOUSTIC cepstral features, extracted from short-term

ACOUSTIC cepstral features, extracted from short-term 1 Combination of Cepstral and Phonetically Discriminative Features for Speaker Verification Achintya K. Sarkar, Cong-Thanh Do, Viet-Bac Le and Claude Barras, Member, IEEE Abstract Most speaker recognition

More information

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches Performance study of Text-independent Speaker identification system using & I for Telephone and Microphone Speeches Ruchi Chaudhary, National Technical Research Organization Abstract: A state-of-the-art

More information

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Dimension Reduction of the Modulation Spectrogram for Speaker Verification Dimension Reduction of the Modulation Spectrogram for Speaker Verification Tomi Kinnunen Speech and Image Processing Unit Department of Computer Science University of Joensuu, Finland Kong Aik Lee and

More information

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION 17th European Signal Processing Conference (EUSIPCO 2009) Glasgow, Scotland, August 24-28, 2009 A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

More information

Change Point Determination in Audio Data Using Auditory Features

Change Point Determination in Audio Data Using Auditory Features INTL JOURNAL OF ELECTRONICS AND TELECOMMUNICATIONS, 0, VOL., NO., PP. 8 90 Manuscript received April, 0; revised June, 0. DOI: /eletel-0-00 Change Point Determination in Audio Data Using Auditory Features

More information

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping Structure of Speech Physical acoustics Time-domain representation Frequency domain representation Sound shaping Speech acoustics Source-Filter Theory Speech Source characteristics Speech Filter characteristics

More information

Cepstrum alanysis of speech signals

Cepstrum alanysis of speech signals Cepstrum alanysis of speech signals ELEC-E5520 Speech and language processing methods Spring 2016 Mikko Kurimo 1 /48 Contents Literature and other material Idea and history of cepstrum Cepstrum and LP

More information

L19: Prosodic modification of speech

L19: Prosodic modification of speech L19: Prosodic modification of speech Time-domain pitch synchronous overlap add (TD-PSOLA) Linear-prediction PSOLA Frequency-domain PSOLA Sinusoidal models Harmonic + noise models STRAIGHT This lecture

More information

Chapter IV THEORY OF CELP CODING

Chapter IV THEORY OF CELP CODING Chapter IV THEORY OF CELP CODING CHAPTER IV THEORY OF CELP CODING 4.1 Introduction Wavefonn coders fail to produce high quality speech at bit rate lower than 16 kbps. Source coders, such as LPC vocoders,

More information

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Comparison of Spectral Analysis Methods for Automatic Speech Recognition INTERSPEECH 2013 Comparison of Spectral Analysis Methods for Automatic Speech Recognition Venkata Neelima Parinam, Chandra Vootkuri, Stephen A. Zahorian Department of Electrical and Computer Engineering

More information

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection Detection Lecture usic Processing Applications of usic Processing Christian Dittmar International Audio Laboratories Erlangen christian.dittmar@audiolabs-erlangen.de Important pre-requisite for: usic segmentation

More information

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter 1 Gupteswar Sahu, 2 D. Arun Kumar, 3 M. Bala Krishna and 4 Jami Venkata Suman Assistant Professor, Department of ECE,

More information

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis Mohini Avatade & S.L. Sahare Electronics & Telecommunication Department, Cummins

More information

Artificial Bandwidth Extension Using Deep Neural Networks for Spectral Envelope Estimation

Artificial Bandwidth Extension Using Deep Neural Networks for Spectral Envelope Estimation Platzhalter für Bild, Bild auf Titelfolie hinter das Logo einsetzen Artificial Bandwidth Extension Using Deep Neural Networks for Spectral Envelope Estimation Johannes Abel and Tim Fingscheidt Institute

More information

Isolated Digit Recognition Using MFCC AND DTW

Isolated Digit Recognition Using MFCC AND DTW MarutiLimkar a, RamaRao b & VidyaSagvekar c a Terna collegeof Engineering, Department of Electronics Engineering, Mumbai University, India b Vidyalankar Institute of Technology, Department ofelectronics

More information

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a R E S E A R C H R E P O R T I D I A P Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a IDIAP RR 7-7 January 8 submitted for publication a IDIAP Research Institute,

More information

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE Zhizheng Wu 1,2, Xiong Xiao 2, Eng Siong Chng 1,2, Haizhou Li 1,2,3 1 School of Computer Engineering, Nanyang Technological University (NTU),

More information

POSSIBLY the most noticeable difference when performing

POSSIBLY the most noticeable difference when performing IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 7, SEPTEMBER 2007 2011 Acoustic Beamforming for Speaker Diarization of Meetings Xavier Anguera, Associate Member, IEEE, Chuck Wooters,

More information

Sparse coding of the modulation spectrum for noise-robust automatic speech recognition

Sparse coding of the modulation spectrum for noise-robust automatic speech recognition Ahmadi et al. EURASIP Journal on Audio, Speech, and Music Processing 24, 24:36 http://asmp.eurasipjournals.com/content/24//36 RESEARCH Open Access Sparse coding of the modulation spectrum for noise-robust

More information

Environmental Sound Recognition using MP-based Features

Environmental Sound Recognition using MP-based Features Environmental Sound Recognition using MP-based Features Selina Chu, Shri Narayanan *, and C.-C. Jay Kuo * Speech Analysis and Interpretation Lab Signal & Image Processing Institute Department of Computer

More information

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

An Optimization of Audio Classification and Segmentation using GASOM Algorithm An Optimization of Audio Classification and Segmentation using GASOM Algorithm Dabbabi Karim, Cherif Adnen Research Unity of Processing and Analysis of Electrical and Energetic Systems Faculty of Sciences

More information

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or other reproductions of copyrighted material. Any copying

More information

Mikko Myllymäki and Tuomas Virtanen

Mikko Myllymäki and Tuomas Virtanen NON-STATIONARY NOISE MODEL COMPENSATION IN VOICE ACTIVITY DETECTION Mikko Myllymäki and Tuomas Virtanen Department of Signal Processing, Tampere University of Technology Korkeakoulunkatu 1, 3370, Tampere,

More information

Campus Location Recognition using Audio Signals

Campus Location Recognition using Audio Signals 1 Campus Location Recognition using Audio Signals James Sun,Reid Westwood SUNetID:jsun2015,rwestwoo Email: jsun2015@stanford.edu, rwestwoo@stanford.edu I. INTRODUCTION People use sound both consciously

More information

RECENTLY, there has been an increasing interest in noisy

RECENTLY, there has been an increasing interest in noisy IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 9, SEPTEMBER 2005 535 Warped Discrete Cosine Transform-Based Noisy Speech Enhancement Joon-Hyuk Chang, Member, IEEE Abstract In

More information

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels A complex sound with particular frequency can be analyzed and quantified by its Fourier spectrum: the relative amplitudes

More information

A Spectral Conversion Approach to Single- Channel Speech Enhancement

A Spectral Conversion Approach to Single- Channel Speech Enhancement University of Pennsylvania ScholarlyCommons Departmental Papers (ESE) Department of Electrical & Systems Engineering May 2007 A Spectral Conversion Approach to Single- Channel Speech Enhancement Athanasios

More information

HIGH RESOLUTION SIGNAL RECONSTRUCTION

HIGH RESOLUTION SIGNAL RECONSTRUCTION HIGH RESOLUTION SIGNAL RECONSTRUCTION Trausti Kristjansson Machine Learning and Applied Statistics Microsoft Research traustik@microsoft.com John Hershey University of California, San Diego Machine Perception

More information

Audio Augmentation for Speech Recognition

Audio Augmentation for Speech Recognition Audio Augmentation for Speech Recognition Tom Ko 1, Vijayaditya Peddinti 2, Daniel Povey 2,3, Sanjeev Khudanpur 2,3 1 Huawei Noah s Ark Research Lab, Hong Kong, China 2 Center for Language and Speech Processing

More information

Learning New Articulator Trajectories for a Speech Production Model using Artificial Neural Networks

Learning New Articulator Trajectories for a Speech Production Model using Artificial Neural Networks Learning New Articulator Trajectories for a Speech Production Model using Artificial Neural Networks C. S. Blackburn and S. J. Young Cambridge University Engineering Department (CUED), England email: csb@eng.cam.ac.uk

More information

Acoustic modelling from the signal domain using CNNs

Acoustic modelling from the signal domain using CNNs Acoustic modelling from the signal domain using CNNs Pegah Ghahremani 1, Vimal Manohar 1, Daniel Povey 1,2, Sanjeev Khudanpur 1,2 1 Center of Language and Speech Processing 2 Human Language Technology

More information

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22. Introduction to Artificial Intelligence Announcements V22.0472-001 Fall 2009 Lecture 19: Speech Recognition & Viterbi Decoding Rob Fergus Dept of Computer Science, Courant Institute, NYU Slides from John

More information

Audio Signal Compression using DCT and LPC Techniques

Audio Signal Compression using DCT and LPC Techniques Audio Signal Compression using DCT and LPC Techniques P. Sandhya Rani#1, D.Nanaji#2, V.Ramesh#3,K.V.S. Kiran#4 #Student, Department of ECE, Lendi Institute Of Engineering And Technology, Vizianagaram,

More information

Automatic Transcription of Monophonic Audio to MIDI

Automatic Transcription of Monophonic Audio to MIDI Automatic Transcription of Monophonic Audio to MIDI Jiří Vass 1 and Hadas Ofir 2 1 Czech Technical University in Prague, Faculty of Electrical Engineering Department of Measurement vassj@fel.cvut.cz 2

More information

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2 Signal Processing for Speech Applications - Part 2-1 Signal Processing For Speech Applications - Part 2 May 14, 2013 Signal Processing for Speech Applications - Part 2-2 References Huang et al., Chapter

More information

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume - 3 Issue - 8 August, 2014 Page No. 7727-7732 Performance Analysis of MFCC and LPCC Techniques in Automatic

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this

More information

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs Automatic Text-Independent Speaker Recognition Approaches Using Binaural Inputs Karim Youssef, Sylvain Argentieri and Jean-Luc Zarader 1 Outline Automatic speaker recognition: introduction Designed systems

More information

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt Pattern Recognition Part 6: Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Institute of Electrical and Information Engineering Digital Signal Processing and System Theory

More information

Voice Activity Detection

Voice Activity Detection Voice Activity Detection Speech Processing Tom Bäckström Aalto University October 2015 Introduction Voice activity detection (VAD) (or speech activity detection, or speech detection) refers to a class

More information

Adaptive Filters Application of Linear Prediction

Adaptive Filters Application of Linear Prediction Adaptive Filters Application of Linear Prediction Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Electrical Engineering and Information Technology Digital Signal Processing

More information

SPEECH AND SPECTRAL ANALYSIS

SPEECH AND SPECTRAL ANALYSIS SPEECH AND SPECTRAL ANALYSIS 1 Sound waves: production in general: acoustic interference vibration (carried by some propagation medium) variations in air pressure speech: actions of the articulatory organs

More information

MEDIUM-DURATION MODULATION CEPSTRAL FEATURE FOR ROBUST SPEECH RECOGNITION. Vikramjit Mitra, Horacio Franco, Martin Graciarena, Dimitra Vergyri

MEDIUM-DURATION MODULATION CEPSTRAL FEATURE FOR ROBUST SPEECH RECOGNITION. Vikramjit Mitra, Horacio Franco, Martin Graciarena, Dimitra Vergyri 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) MEDIUM-DURATION MODULATION CEPSTRAL FEATURE FOR ROBUST SPEECH RECOGNITION Vikramjit Mitra, Horacio Franco, Martin Graciarena,

More information

Voiced/nonvoiced detection based on robustness of voiced epochs

Voiced/nonvoiced detection based on robustness of voiced epochs Voiced/nonvoiced detection based on robustness of voiced epochs by N. Dhananjaya, B.Yegnanarayana in IEEE Signal Processing Letters, 17, 3 : 273-276 Report No: IIIT/TR/2010/50 Centre for Language Technologies

More information

Time-Frequency Distributions for Automatic Speech Recognition

Time-Frequency Distributions for Automatic Speech Recognition 196 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 3, MARCH 2001 Time-Frequency Distributions for Automatic Speech Recognition Alexandros Potamianos, Member, IEEE, and Petros Maragos, Fellow,

More information

Automatic Morse Code Recognition Under Low SNR

Automatic Morse Code Recognition Under Low SNR 2nd International Conference on Mechanical, Electronic, Control and Automation Engineering (MECAE 2018) Automatic Morse Code Recognition Under Low SNR Xianyu Wanga, Qi Zhaob, Cheng Mac, * and Jianping

More information

Determining Guava Freshness by Flicking Signal Recognition Using HMM Acoustic Models

Determining Guava Freshness by Flicking Signal Recognition Using HMM Acoustic Models Determining Guava Freshness by Flicking Signal Recognition Using HMM Acoustic Models Rong Phoophuangpairoj applied signal processing to animal sounds [1]-[3]. In speech recognition, digitized human speech

More information

FFT 1 /n octave analysis wavelet

FFT 1 /n octave analysis wavelet 06/16 For most acoustic examinations, a simple sound level analysis is insufficient, as not only the overall sound pressure level, but also the frequency-dependent distribution of the level has a significant

More information

Long Range Acoustic Classification

Long Range Acoustic Classification Approved for public release; distribution is unlimited. Long Range Acoustic Classification Authors: Ned B. Thammakhoune, Stephen W. Lang Sanders a Lockheed Martin Company P. O. Box 868 Nashua, New Hampshire

More information

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech INTERSPEECH 5 Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech M. A. Tuğtekin Turan and Engin Erzin Multimedia, Vision and Graphics Laboratory,

More information

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065 Speech Processing Undergraduate course code: LASC10061 Postgraduate course code: LASC11065 All course materials and handouts are the same for both versions. Differences: credits (20 for UG, 10 for PG);

More information

High-speed Noise Cancellation with Microphone Array

High-speed Noise Cancellation with Microphone Array Noise Cancellation a Posteriori Probability, Maximum Criteria Independent Component Analysis High-speed Noise Cancellation with Microphone Array We propose the use of a microphone array based on independent

More information

Spectro-Temporal Methods in Primary Auditory Cortex David Klein Didier Depireux Jonathan Simon Shihab Shamma

Spectro-Temporal Methods in Primary Auditory Cortex David Klein Didier Depireux Jonathan Simon Shihab Shamma Spectro-Temporal Methods in Primary Auditory Cortex David Klein Didier Depireux Jonathan Simon Shihab Shamma & Department of Electrical Engineering Supported in part by a MURI grant from the Office of

More information

Nonuniform multi level crossing for signal reconstruction

Nonuniform multi level crossing for signal reconstruction 6 Nonuniform multi level crossing for signal reconstruction 6.1 Introduction In recent years, there has been considerable interest in level crossing algorithms for sampling continuous time signals. Driven

More information

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR Syu-Siang Wang 1, Jeih-weih Hung, Yu Tsao 1 1 Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan Dept. of Electrical

More information

An Approach to Very Low Bit Rate Speech Coding

An Approach to Very Low Bit Rate Speech Coding Computing For Nation Development, February 26 27, 2009 Bharati Vidyapeeth s Institute of Computer Applications and Management, New Delhi An Approach to Very Low Bit Rate Speech Coding Hari Kumar Singh

More information

SOUND SOURCE RECOGNITION AND MODELING

SOUND SOURCE RECOGNITION AND MODELING SOUND SOURCE RECOGNITION AND MODELING CASA seminar, summer 2000 Antti Eronen antti.eronen@tut.fi Contents: Basics of human sound source recognition Timbre Voice recognition Recognition of environmental

More information

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT RASHMI MAKHIJANI Department of CSE, G. H. R.C.E., Near CRPF Campus,Hingna Road, Nagpur, Maharashtra, India rashmi.makhijani2002@gmail.com

More information

COMP 546, Winter 2017 lecture 20 - sound 2

COMP 546, Winter 2017 lecture 20 - sound 2 Today we will examine two types of sounds that are of great interest: music and speech. We will see how a frequency domain analysis is fundamental to both. Musical sounds Let s begin by briefly considering

More information

Block diagram of proposed general approach to automatic reduction of speech wave to lowinformation-rate signals.

Block diagram of proposed general approach to automatic reduction of speech wave to lowinformation-rate signals. XIV. SPEECH COMMUNICATION Prof. M. Halle G. W. Hughes J. M. Heinz Prof. K. N. Stevens Jane B. Arnold C. I. Malme Dr. T. T. Sandel P. T. Brady F. Poza C. G. Bell O. Fujimura G. Rosen A. AUTOMATIC RESOLUTION

More information

Automatic Speech Recognition Adaptation for Various Noise Levels

Automatic Speech Recognition Adaptation for Various Noise Levels Automatic Speech Recognition Adaptation for Various Noise Levels by Azhar Sabah Abdulaziz Bachelor of Science Computer Engineering College of Engineering University of Mosul 2002 Master of Science in Communication

More information

Introduction to HTK Toolkit

Introduction to HTK Toolkit Introduction to HTK Toolkit Berlin Chen 2004 Reference: - Steve Young et al. The HTK Book. Version 3.2, 2002. Outline An Overview of HTK HTK Processing Stages Data Preparation Tools Training Tools Testing

More information

Calibration of Microphone Arrays for Improved Speech Recognition

Calibration of Microphone Arrays for Improved Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Calibration of Microphone Arrays for Improved Speech Recognition Michael L. Seltzer, Bhiksha Raj TR-2001-43 December 2001 Abstract We present

More information

Relative phase information for detecting human speech and spoofed speech

Relative phase information for detecting human speech and spoofed speech Relative phase information for detecting human speech and spoofed speech Longbiao Wang 1, Yohei Yoshida 1, Yuta Kawakami 1 and Seiichi Nakagawa 2 1 Nagaoka University of Technology, Japan 2 Toyohashi University

More information

FEATURE FUSION FOR HIGH-ACCURACY KEYWORD SPOTTING

FEATURE FUSION FOR HIGH-ACCURACY KEYWORD SPOTTING FEATURE FUSION FOR HIGH-ACCURACY KEYWORD SPOTTING Vikramjit Mitra, Julien van Hout, Horacio Franco, Dimitra Vergyri, Yun Lei, Martin Graciarena, Yik-Cheung Tam, Jing Zheng 1 Speech Technology and Research

More information