IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER

Similar documents
I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition

Using RASTA in task independent TANDEM feature extraction

FEATURE COMBINATION AND STACKING OF RECURRENT AND NON-RECURRENT NEURAL NETWORKS FOR LVCSR

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Reverse Correlation for analyzing MLP Posterior Features in ASR

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Augmenting Short-term Cepstral Features with Long-term Discriminative Features for Speaker Verification of Telephone Data

Machine recognition of speech trained on data from New Jersey Labs

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

CS 188: Artificial Intelligence Spring Speech in an Hour

Discriminative Training for Automatic Speech Recognition

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Applications of Music Processing

Speech Synthesis; Pitch Detection and Vocoders

IMPACT OF DEEP MLP ARCHITECTURE ON DIFFERENT ACOUSTIC MODELING TECHNIQUES FOR UNDER-RESOURCED SPEECH RECOGNITION

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Mel Spectrum Analysis of Speech Recognition using Single Microphone

The 2010 CMU GALE Speech-to-Text System

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Speech Signal Analysis

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Speech Synthesis using Mel-Cepstral Coefficient Feature

PLP 2 Autoregressive modeling of auditory-like 2-D spectro-temporal patterns

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Progress in the BBN Keyword Search System for the DARPA RATS Program

Auditory Based Feature Vectors for Speech Recognition Systems

An Improved Voice Activity Detection Based on Deep Belief Networks

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM

ACOUSTIC cepstral features, extracted from short-term

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

Change Point Determination in Audio Data Using Auditory Features

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Cepstrum alanysis of speech signals

L19: Prosodic modification of speech

Chapter IV THEORY OF CELP CODING

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Artificial Bandwidth Extension Using Deep Neural Networks for Spectral Envelope Estimation

Isolated Digit Recognition Using MFCC AND DTW

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

POSSIBLY the most noticeable difference when performing

Sparse coding of the modulation spectrum for noise-robust automatic speech recognition

Environmental Sound Recognition using MP-based Features

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

Mikko Myllymäki and Tuomas Virtanen

Campus Location Recognition using Audio Signals

RECENTLY, there has been an increasing interest in noisy

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels

A Spectral Conversion Approach to Single- Channel Speech Enhancement

HIGH RESOLUTION SIGNAL RECONSTRUCTION

Audio Augmentation for Speech Recognition

Learning New Articulator Trajectories for a Speech Production Model using Artificial Neural Networks

Acoustic modelling from the signal domain using CNNs

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

Audio Signal Compression using DCT and LPC Techniques

Automatic Transcription of Monophonic Audio to MIDI

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Voice Activity Detection

Adaptive Filters Application of Linear Prediction

SPEECH AND SPECTRAL ANALYSIS

MEDIUM-DURATION MODULATION CEPSTRAL FEATURE FOR ROBUST SPEECH RECOGNITION. Vikramjit Mitra, Horacio Franco, Martin Graciarena, Dimitra Vergyri

Voiced/nonvoiced detection based on robustness of voiced epochs

Time-Frequency Distributions for Automatic Speech Recognition

Automatic Morse Code Recognition Under Low SNR

Determining Guava Freshness by Flicking Signal Recognition Using HMM Acoustic Models

FFT 1 /n octave analysis wavelet

Long Range Acoustic Classification

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

High-speed Noise Cancellation with Microphone Array

Spectro-Temporal Methods in Primary Auditory Cortex David Klein Didier Depireux Jonathan Simon Shihab Shamma

Nonuniform multi level crossing for signal reconstruction

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR

An Approach to Very Low Bit Rate Speech Coding

SOUND SOURCE RECOGNITION AND MODELING

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

COMP 546, Winter 2017 lecture 20 - sound 2

Block diagram of proposed general approach to automatic reduction of speech wave to lowinformation-rate signals.

Automatic Speech Recognition Adaptation for Various Noise Levels

Introduction to HTK Toolkit

Calibration of Microphone Arrays for Improved Speech Recognition

Relative phase information for detecting human speech and spoofed speech

FEATURE FUSION FOR HIGH-ACCURACY KEYWORD SPOTTING

Transcription:

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 2439 Transcribing Mandarin Broadcast Speech Using Multi-Layer Perceptron Acoustic Features Fabio Valente, Member, IEEE, Mathew Magimai Doss, Member, IEEE, Christian Plahl, Suman Ravuri, and Wen Wang, Member, IEEE Abstract Recently, several multi-layer perceptron (MLP)- based front-ends have been developed and used for Mandarin speech recognition, often showing significant complementary properties to conventional spectral features. Although widely used in multiple Mandarin systems, no systematic comparison of all the different approaches as well as their scalability has been proposed. The novelty of this correspondence is mainly experimental. In this work, all the MLP front-ends recently developed at multiple sites are described and compared in a systematic manner on a 100 hours setup. The study covers the two main directions along which the MLP features have evolved: the use of different input representations to the MLP and the use of more complex MLP architectures beyond the three-layer perceptron. The results are analyzed in terms of confusion matrices and the paper discusses a number of novel findings that the comparison reveals. Furthermore, the two best front-ends used in the GALE 2008 evaluation, referred as MLP1 and MLP2, are studied in a more complex LVCSR system in order to investigate their scalability in terms of the amount of training data (from 100 hours to 1600 hours) and the parametric system complexity (maximum likelihood versus discriminative training, speaker adaptative training, lattice level combination). Results on 5 hours of evaluation data from the GALE project reveal that the MLP features consistently produce improvements in the range of 15% 23% relative at the different steps of a multipass system when compared to mel-frequency cepstral coefficient (MFCC) and PLP features, suggesting that the improvements scale with the amount of data and with the complexity of the system. The integration of those features into the GALE 2008 evaluation system provide very competitive performances compared to other Mandarin systems. Index Terms Automatic speech recognition (ASR), broadcast data, GALE project, multi-layer perceptron (MLP), multi-stream, TANDEM features. Manuscript received July 16, 2010; revised December 20, 2010; accepted March 20, 2011. Date of publication April 21, 2011; date of current version September 16, 2011. This work was supported by the Defense Advanced Research Projects Agency (DARPA) under Contract HR0011-06-C-0023. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Dimitra Vergyri. F. Valente and M. M. Doss are with the Idiap Research Institute, 1920 Martigny Switzerland (e-mail: fabio.valente@idiap.ch; mathew@idiap.ch). C. Plahl is with the Computer Science Department, RWTH Aachen University, 52056 Aachen, Germany (e-mail: plahl@i6.informatik.rwth-aachen.de). S. Ravuri is with the International Computer Science Institute, Berkeley, CA 94704 USA (e-mail: ravuri@icsi.berkeley.edu). W. Wang is with the Speech Technology and Research Laboratory, SRI International, Menlo Park, CA 94025 USA (e-mail: wwang@speech.sri.com). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2011.2139206 I. INTRODUCTION R ECENTLY a growing number of large-vocabulary continuous speech recognition (LVCSR) systems make use of multi-layer perceptron (MLP) features. MLP features have been originally introduced by Hermansky and his colleagues in [1], where the output of an MLP classifier is used as acoustic front-end for conventional speech recognition systems based on hidden Markov models/gaussian mixture models (HMMs/ GMMs). A large number of studies have proposed different types of MLP-based front-ends (see [2] [5]) and investigated their use for transcribing English (see [6], and [7]). The most common application is in concatenation with mel-frequency cepstral coefficient (MFCC) or perceptual linear predictive (PLP) features, where MLP features show considerable complementarity properties. In recent years, in the framework of the GALE 1 program, MLP features have been extensively used in ASR systems for Mandarin and Arabic languages (see [5], and [8] [11]). Since the original work [1], MLP front-ends have progressed along two main directions: 1) the use of different input representations to the MLP; 2) the use of complex MLP architectures beyond the conventional three-layer perceptron. The first category includes speech representations that aims at using long time spans of the speech signal which could capture long term phenomena (such as co-articulation) and are complementarity to MFCC or PLP features [7]. Because of the large dimension of the signal time spans, a number of techniques for efficiently encoding this information have been proposed like MRASTA [4], DCT-TRAPS [12] and wlp-traps [13]. The second category includes a heterogeneous number of techniques that aim at overcoming the pitfalls of the single MLP classifier. They are based on the probabilistic combination of MLP outputs obtained using different input representations. Those combinations can happen in a parallel fashion like in the multistream approach [2], [14] or in a hierarchical fashion [15]. Furthermore, recently the probabilistic features generated by threelayer MLPs have also been replaced by the bottleneck features extracted by four-layer and five-layer MLPs [16]. While previous works, e.g., [9], have discussed the development of the Mandarin LVCSR systems that use those features, no exhaustive comparisons and analysis of the different front-ends have been presented in literature. Without such a side-by-side comparison, it is not possible to assess which one 1 http://www.darpa.mil/ipto/programs/gale/gale.asp 1558-7916/$26.00 2011 IEEE

2440 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 of the recent advances actually produced improvements in the final system. This correspondence focuses on those recent advances in training, scaling and integrating MLP front-ends for Mandarin transcription. The novelty of this work is mainly experimental and the correspondence provides two contributions. First, the various MLP based front-ends recently developed at multiple sites are described and compared on a common experimental setup in a systematic way. The comparison covers all the MLP features used in GALE and is done using the same phoneme set, the same speech-silence segmentation, the same amount of training data and the same number of free parameters. The study is done using a simplified version of the system described in [9] trained on 100 hours of Mandarin broadcast news and conversation recordings. The investigation covers MLP acoustic front-ends as stand alone features and in concatenation with conventional MFCC features. To our best knowledge, this is the most exhaustive comparison of MLP front-ends for Mandarin speech recognition. The comparison reveals a number of novel facts on the different features and on their use in LVCSR systems. The second contribution is the study on how the performances scale with the amount of training data (from 100 hours to 1600 hours of broadcast audio) and with the parametric model complexity of the system (including speaker adaptive training, lattice level combination and discriminative training). As before, the contrastive experiments are run with and without the MLP features to assess the maximum relative improvement that can be obtained. The remainder of the paper is organized as follows. Section II describes features obtained using three-layer MLPs with various input representations and Section III describes features obtained using modifications to the three layer architecture. Section IV experiments with those features in a system trained on 100 hours and analyzes and discusses the results of the comparison. Section V experiments in a large scale multi-pass evaluation system and finally the paper is concluded in Section VI. II. INPUT REPRESENTATION FOR THREE-LAYER MLP FEATURES The simplest MLP feature extraction is based on the following steps. At first, a three-layer MLP classifier is trained in order to minimize the cross-entropy between its output and a set of phonetic labels. Such a classifier produces phoneme posterior probabilities conditioned on the input representation at a given time instant [17]. In order to exploit this representation into HMM/GMM models, phoneme posterior probabilities are first gaussianized applying a logarithm and then decorrelated using a principal component analysis (PCA) transform. After PCA, a dimensionality reduction accounting for 95% of the total variability is applied. The resulting feature vectors are used as conventional acoustic features into ASR systems. This framework is also known as TANDEM [1]. The input to the MLP classifier can be conventional short term features like PLP/MFCC or long term features which aim at capturing the dynamic characteristics of the speech signal over large time spans. Let us briefly describe four different MLP inputs proposed and used for transcription of Mandarin broadcast: A. TANDEM-PLP In TANDEM-PLP features, the input to the MLP is represented by nine consecutive frames of PLP cepstral features. Mandarin is a tonal language; thus, the PLP vector is augmented with the smoothed log-pitch estimate plus its firstand second-order temporal derivatives as described in [18]. PLP features undergo vocal tract length normalization and speaker-level mean and variance normalization. The final dimension of this vector is 42, thus the input to the MLP is a vector of size. TANDEM-PLP has been the first MLP-based feature to be proposed and aims at using a few consecutive frames of short term spectral features. On the other hand, the input to the MLP can also be represented by critical band temporal trajectory (up to half a second) aiming at modeling long time patterns of the speech signal (also known as Temporal Patterns or TRAPS [19]). The dimensionality of TRAPS is quite large; considering for instance, 500-ms trajectories in a 19 critical band spectrogram would produce a vector of dimension 9500. Several methods have been considered for efficiently encoding this information while reducing the dimension and they will be briefly reviewed in the following. B. Multiple RASTA Multiple RASTA (MRASTA) filtering [4] is an extension of RASTA filtering and aims at using long signal time spans at the input of the MLP. The model is consistent with studies on human perception of modulation frequencies modeled using a bank of filters equally spaced on a logarithmic scale [20]. This bank of filters subdivides the available modulation frequency range into separate channels with a decreasing resolution moving from slow to fast modulations. Feature extraction is composed of the following parts: 19 critical band auditory spectrum is extracted from short-time Fourier transform of a signal every 10 ms. A 600-ms long temporal trajectory in each critical band is filtered with a bank of bandpass filters. Those filters represent first derivatives (1) and second derivatives (2) of Gaussian functions with variance varying in the range 8 60 ms: (1) with (2) In effect, the MRASTA filters are multi-resolution bandpass filters on modulation frequency, dividing the available modulation frequency range into its individual sub-bands. 2 In the modulation frequency domain, they correspond to a filter-bank with equally spaced filters on a logarithmic scale. Identical filters are used for all critical bands. Thus, they provide a multiple-resolution representation of the time frequency plane. After MRASTA filtering, frequency derivatives across three consecutive critical bands are introduced. The total number of features used as input for a three-layer MLP is 432. 2 Unlike in [4], filter-banks G1 and G2 are composed of six filters rather than eight, leaving out the two filters with longest impulse responses.

VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2441 TABLE I DIFFERENCES BETWEEN THE THREE INPUT REPRESENTATIONS THAT USES LONG TEMPORAL TIME SPANS C. DCT-TRAPS The DCT-TRAPS aims at reducing the dimension of the trajectories using a discrete cosine transform (DCT). As described in [12], the results obtained using DCT basis are very similar to the one obtained using a principal component analysis. Critical band auditory spectrum is extracted from short-time Fourier transform of a signal every 10 ms. Then 500-ms long energy trajectories are extracted for each of the 19 critical bands that compose the spectrogram. Those are projected on the first 16 coefficients of a DCT transform resulting in a vector of size used as input to the MLP. In contrary to the MRASTA, they do not emulate any sensitivity of the hearing properties to the different modulation frequencies. D. WLP-TRAPS A third alternative for extracting information from long signal time spans is represented by the wlp-traps [13]. In contrary to previous front-ends, the process does not use the short term spectrum thus potentially provides more complementarity to the MFCC features. Those features are obtained by warping the temporal axis after LP-TRAP features calculation [21]. The feature extraction is composed of the following steps: at first, linear prediction is used to model the Hilbert envelops of prewarped 500-ms long energy trajectories in auditory-like frequencies sub-bands. The warping ensures that more emphasis is given to the center of the trajectories compared to the borders [13], thus emulating again human perception. 25 LPC coefficients in 19 frequency bands are then used as input to the MLP producing a feature vector of dimension. All the three representations described in Sections II-B II-D aim at using long temporal time spans; however, they differ from each other in a number of implementation issues like the use of short-time power spectrum, the use of zero-mean filters and the warping of the time axes. Those differences are summarized in Table I. As Mandarin is a tonal language, those representations can be augmented with the smoothed log-pitch estimate obtained as described in [18] and with the value of the critical band energy (19 features per frame). In the following, we will refer to them as Augmented features. III. MLP ARCHITECTURES The second direction along which the front-ends have evolved is the use of more complex architectures to overcome limitations of the three-layer MLP in different ways. Most of them are based on the combination of several MLP outputs trained using different input representations. This combination can happen in a parallel or hierarchical fashion. Again, no side-by-side comparisons of these architectures have been presented in the literature. The following paragraphs briefly describe these front-ends used for LVCSR systems. A. Hidden Activation TRAPS (HATS) HATS feature extraction is based on observations on human speech recognition [22], which conjectures that humans recognize speech independently in each critical band and a final decision is obtained by recombining those estimates. HATS aims at using information extracted from long time spans of critical band energies which are fed into a set of independent classifiers instead of a single MLP classifier. At first, 19 critical band auditory spectrum is extracted from short-time Fourier transform of a signal every 10 ms. After that, HATS [2] feature extraction is composed of two steps. 1) In the first stage, an independent MLP for each of the 19 critical bands is trained to classify phonemes. The input to each of the MLP is 500-ms-long log critical band energy trajectories (i.e., 51-dimensional input). The input undergoes an utterance level mean and variance normalization. 2) In the second stage, a merger MLP is trained using the hidden activations obtained from the 19 MLPs of the first stage. The merger classifier aims at obtaining a single phoneme posterior estimate out of the independent estimates coming from each critical band. Phoneme posteriors obtained from the merger MLP are then transformed and used as features. The rationale behind this architecture consists in the fact that corruptions in particular critical bands should affect less the final recognition results. B. Multi-Stream The output of MLPs are posterior probabilities of phonetic targets that can be combined into a single estimate using probabilistic rules. This approach is typically referred as multi-stream and has been introduced in [14]. The rationale behind it consists in the fact that MLPs trained using different input representations will perform differently in multiple conditions. To take advantage of both representations, the combination rule should be able to dynamically select the best posterior stream. Typical combination rules weight the posterior probabilities using a function of the output entropy (see [23] and [24]). Posteriors obtained from TANDEM-PLP (short signal time spans) and HATS (long signal time spans) are combined using the Dempster Shafer method [24] and used as features after a log/pca transform. Multi-stream comes at the obvious cost of doubling the total number of parameters in the system. C. Hierarchical Processing While multi-stream approaches combine MLP outputs in parallel, studies on English and Mandarin data [15], [25] showed that the most effective way of combining classifiers trained on separate ranges of modulation frequencies, i.e., on different temporal spans, is based on hierarchical (sequential) processing. The hierarchical processing is based on the following steps. MRASTA filters cover the whole range of modulation frequencies. The filter-banks G1 and G2 (six filters each) are split into two separate filter banks G1-Low and G2-Low and G1-High and G2-High, which filter fast and slow modulation

2442 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 Fig. 1. Proposed scheme for the MLP-based feature extraction as used in the GALE 2008 Evaluation. The auditory spectrum is filtered with a set of multiple resolution filters that extract fast modulation frequencies. The resulting vector is concatenated with short term critical band energy and pitch estimates and is used as input to the first MLP that estimates phoneme posterior distributions. The output of the first MLP is then concatenated with features obtained using slow modulation frequencies, short-term critical band energy and pitch estimates and is used as input to the second MLP. frequencies, respectively. G-High and G-Low are defined as follows: G-High G1-High G2-High with (3) G-Low G1-Low G2-Low with (4) Filters G1-fast and G2-fast are short filters and they process high modulation frequencies. Filters G1-slow and G2-slow are long filters and they process low modulation frequencies. The cutoff frequency for both filter-banks G-High and G-Low is approximately 10 Hz. The output of the MRASTA filtering is processed according to a hierarchy of MLPs progressively moving from high to low modulation frequencies (i.e., from short to long temporal contexts). The rationale behind this processing is based on the fact that the errors produced from the first MLP can be corrected from a second one using the estimates from the first MLP together with the evidence from another range of modulation frequencies. The first MLP is trained on the first feature stream represented by the output of filter-banks G-High that extract high modulation frequencies. This MLP estimates the first set of phoneme posterior probabilities. These posteriors are modified according to a Log/PCA transform and then concatenated with the second feature stream thus forming an input to the second phoneme posterior-estimating MLP. In such a way, phoneme estimates from the first MLP are modified by the second net using an evidence from a different feature stream. This process is depicted in Fig. 1. D. Bottleneck Features Bottleneck features are recently introduced MLP non-probabilistic features [16]. The conventional three-layer MLP is replaced with a four- or five-layer MLP where the first layer is the input features and the last layer is the phonetic targets. As discussed in [26], the five-layer architecture provides slightly better performances compared to the four-layer. The size of the second layer is large to provide enough modeling power, the size of the third one is small, typically equal to the desired feature dimension, while the size of the fourth one is approximately half the second layer [26]. Instead of using the output of the MLP, features are obtained from the linear activation of the third layer. Bottleneck features do not require a dimensionality reduction, as the desired dimension can be obtained fixing the size of the bottleneck layer. Furthermore, the linear activations are already Gaussian distributed thus they do not require any Log transform. The most common input to the non-probabilistic Bottleneck features are long term features as DCT-TRAPS and the wlp-traps described in sections Sections II-C and II-D. IV. SMALL SCALE EXPERIMENTS The following preliminary experiments are based on the large-vocabulary ASR system for transcribing Mandarin broadcast described in [9], developed by SRI/UW/ICSI for the GALE project. The recognition is performed using the SRI Decipher recognizer and results are reported in terms of character error rate (CER). The training is done using approximately 100 hours of broadcast news and conversation data manually transcribed including speaker labels. Results are reported on the GALE 2006 evaluation data simply referred as eval06 in the following. The baseline system uses 13 standard MFCC plus first- and second-order temporal derivatives. Vocal tract length normalization (VTLN) and speaker level mean-variance normalizations are applied. Mandarin is a tonal language thus the MFCC vector is augmented with the smoothed log-pitch estimate plus its first and second-order temporal derivatives as described in [18], resulting in a feature vector of dimension 42. In the following, we will refer to this system simply as the MFCC baseline. The training is based on conventional Maximum-Likelihood. The acoustic models are composed of within word triphone HMM models and a 32-component diagonal covariance GMM is used for modeling acoustic emission probabilities. Parameters are shared across different triphones according to a phonetic decision tree. Recognition networks are compiled from trigram language models trained on over one billion words, with a 60 K vocabulary lexicon [9]. The decoding phase consists of two decoding passes, a speaker independent (SI) decoding followed by a speaker adapted (SA) decoding. Speaker adaptation is

VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2443 TABLE II BASELINE SYSTEM PERFORMANCE ON THE eval06 DATA TABLE III TANDEM-9FRAMESPLP PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES ALONE AND IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES TABLE IV MLP FEATURES MAKING USE OF LONG TIME SPANS OF THE SIGNAL AS INPUT. PERFORMANCE IS REPORTED ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES AS STAND ALONE FEATURES AND IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES done using a one-class constrained maximum-likelihood linear regression (CMLLR) followed by three-class MLLR. Performance of this baseline system on the eval06 data is reported in Table II for both speaker independent (SI) and speaker adapted (SA) models. In this set of experiments, three-layer MLPs are trained on all the available 100-hour acoustic model training data. The Mandarin toneme set is composed of 72 elements. The training is done using the ICSI Quicknet Software. 3 A. MLP Features This section discusses experiments with features obtained using three-layer MLP architectures with different input representations. Unless it is explicitly mentioned otherwise, the total number of parameters in the different MLP architectures is equalized to approximately one million parameters in order to assure a fair comparison between the different approaches. The size of the input layer equals to the feature dimension, the size of the output layer equals to the number of phonetic targets (72) and the size of the hidden layer is modified so that the total number of parameters equals to one million. After PCA, a dimensionality reduction accounting for 95% of the total variability is applied. The resulting feature vectors has dimension 35 for all the different MLP features. The investigation was carried out with MLP features as stand-alone front-end and in concatenation with spectral features, i.e., MFCC. Results are reported in terms of character error rate (CER) on the eval06 data as described in the next section. Let us first consider the TANDEM-PLP features described in Section II-A. Performances of those features are reported in Table III as well as the relative improvements with respect to the MFCC baseline with and without speaker adaptation. When used as stand-alone features, TANDEM-PLP does not outperform the baseline, whereas a relative improvement of is obtained when they are used in concatenation with MFCC. After speaker adaptation, the relative improvement drops slightly by 2%, still a 14% relative improvement over the MFCC baseline. Let us now consider the use of MLP features obtained using long time spans of the speech signal as described in Sections II-B II-D. Table IV shows that these features perform quite poorly as stand alone features, whereas they can provide improvements around 10% relative in concatenation with the MFCC features. As a stand-alone front-end, the wlp-traps 3 http://www.icsi.berkeley.edu/speech/qn.html TABLE V MLP FEATURES MAKING USE OF LONG TIME SPANS OF THE SIGNAL AS INPUT AUGMENTED WITH CRITICAL BAND ENERGY AND LOG-PITCH. PERFORMANCE IS REPORTED ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES AS STAND ALONE FEATURES AND IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES outperforms the other two; whereas, in concatenation with spectral features and after adaptation, the three representations are comparable. Their performances are however inferior to the conventional TANDEM 9frames PLP. The performances of these features augmented with the values of the critical band energy (19 features per frame) and the smoothed log-pitch estimates are reported in Table V. Augmenting the long term features produces consistent improvements in all the cases and brings the performances of these front-ends to the same level of the TANDEM-PLP when tested in concatenation with MFCC. As before, the relative improvements are always reduced after speaker adaptation. In concatenation with spectral features, the three input representations have similar performances. In summary, MLP front-ends obtained using a three-layer MLP with different input representations do not outperform the conventional MFCC as stand alone features. On the other hand, they produce relative improvements in the range of 10% 14% when used in concatenation with spectral features. TANDEM-PLP front-end outperforms the other long term features. The various coding schemes, MRASTA, DCT-TRAPS, and wlp-traps, give similarly poor results as stand-alone features and similar improvements (approximately 11%) when used in concatenation with spectral features. Augmenting the long term input with a vector of short term energy and pitch brings the performances close to those of the TANDEM-PLP features. The relative improvements after speaker adaptation are generally reduced by 2% with respect to the speaker independent systems. This is consistent with what has already been verified on English ASR experiments [27].

2444 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 TABLE VI HATS PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES ALONE AND IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES TABLE VIII HIERARCHICAL FEATURE PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES ALONE AND IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES TABLE VII MULTI-STREAM MLP FEATURE PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES ALONE AND IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES TABLE IX BOTTLENECK FEATURES PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES ALONE. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES B. MLP Architectures This section discusses experiments with the different MLP architectures; the input signal representations are similar to those used in the previous section while the information is exploited differently when changing the MLP architectures. The results obtained using these methods are compared with their counterparts based on the three-layer MLPs. 1) Hidden Activation TRAPS: HATS aims at using information extracted from long time spans of critical band energies, but the recognition is done independently in each critical band using 19 independent MLPs. The final posterior estimates are obtained by merging all these estimates (see subsection Section II-A). Results with HATS features are reported in Table VI. As stand-alone features, HATS performs significantly worse than MFCC; whereas a relative improvement is obtained when used in concatenation with MFCC. Comparing Tables IV and VI, it is noticeable that this approach is marginally better than those that use long-term features into a single MLP. 2) Multi-Stream MLP Features: Table VII reports the performance of the multi-stream front-end that combines information from TANDEM-PLP (short time spans of signal) and HATS (long time spans of signal). These features outperform the MFCC by 10% relative when used stand-alone and by 16% relative in concatenation with MFCC. Those numbers must be compared to the performances of the individual streams of TANDEM-PLP (Table III) and HATS (Table VI). The combination provides a large improvement in case of stand-alone features (TANDEM-PLP 25.5%, HATS 29.1%, Multistream 23.1%); however, the improvements are smaller when used in concatenation with MFCC (TANDEM-PLP 22.1%, HATS 22.7%, Multistream 21.7%). This can be easily explained considering the fact that when used in concatenation with the MFCC, the feature vector contains twice the spectral information, through the MFCC and trough the TANDEM features. 3) Hierarchical Processing: Next, we discuss experiments with the hierarchical processing described in Section III-C. Results are reported in Table VIII in cases of both MRASTA and Augmented MRASTA inputs (processing is depicted in Fig. 1). TABLE X AUGMENTED BOTTLENECK FEATURE PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES ALONE. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES Comparing Table VIII with Tables IV and V, it is noticeable that the hierarchical approach produces considerable improvements with respect to the single classifier approach both with and without MFCC features. It is important to notice that the total number of parameters is kept constant; thus, the improvements are produced from the sequential architecture where short signal time spans are used first and then integrated with the longer ones. 4) Bottleneck Features: Tables IX and X report the performances of the bottleneck features obtained using different long term inputs (MRASTA, DCT-TRAPS, and wlp-traps) and their augmented versions. The dimension of the bottleneck is fixed to 35 in order to compare with other probabilistic MLP features. Results reveal that bottleneck features always outperform their probabilistic counterparts obtained using the three-layer MLP. This is verified on all the different input features and their augmented versions. For comparison purposes, Table XI also reports the performance of Bottleneck features when the input to the MLP is

VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2445 Fig. 2. RWTH evaluation system composed of two subsystems trained on MFCC and PLP features. The two subsystems consist of ML training followed by SAT/CMLLR training. The lattice outputs from the subsystems are combined in the end. TABLE XI BOTTLENECK FEATURES PERFORMANCE ON eval06 DATA WHEN 9frames PLP AND PITCH INPUT IS USED. RESULTS ARE REPORTED WITH MLP FEATURES ALONE. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES 9frames PLP features augmented with pitch: In summary, replacing the three-layer MLP with a more complex MLP structure (while keeping constant the number of total parameters) produces a reduction in the error both with and without concatenation of spectral features. The multi-stream approach that combines in parallel MLPs trained on long and short speech temporal features produce the lowest CER as stand-alone front-end (16% relative CER reduction compared to the MFCC). On the other hand, hierarchical and bottleneck structures that go beyond the three-layer appear to produce the highest complementarity to MFCC, producing an improvement of 17% 18% relative when used in concatenation. The reasons of these effects are investigated in the next section where the front-ends are compared in terms of phonetic confusions. C. Analysis of Results In order to understand the differences between the various MLP front-ends, let us now analyze the errors they produced in terms of phonetic targets. Table XII reports the phonetic set composed of 72 tonemes used for training the MLP. The set is sub-divided into six broad phonetic classes for analysis purposes. The numbers beside the vowels represent the tonal accents. The frame-level accuracy of a three-layer MLP trained using 9frames-PLP features in classifying the phonetic targets is 69.8%. Fig. 3 plots the per-class accuracy. Let us now consider the accuracies of the three-layer MLPs trained using long-term input representations, i.e., the MRASTA, DCT-TRAPS and wlp-traps. They are respectively 64%, 62.9%, and 65.2%, which are worse than the accuracy from the 9frame-PLP. The HATS features that are based on long-term critical band trajectories have a similar frame-level accuracy, i.e., 65.7%. While the overall performance of MLP trained on spectral features is superior to MLP trained on long time spans of speech signals, the latter appears to perform better on some phonetic classes. Fig. 3 plots the accuracy of recognizing each of the phonetic classes for HATS. It is noticeable that in spite of an overall inferior performance, the HATS outperforms the TANDEM-PLP on almost all the stop consonants p, t, k. b, d, and the affricative ch. Stop consonants are short Fig. 3. Phonetic-class accuracy obtained by the TANDEM-9framesPLP and HATS. The former outperforms the latter on most of the classes apart from stops and affricatives. TABLE XII PHONETIC SET USED TO TRAIN THE DIFFERENT MLPS DIVIDED INTO BROAD PHONETIC CLASSES. AS MANDARIN IS A TONAL LANGUAGE, THE NUMBER BESIDE THE VOWELS DESIGNATES THE ACCENT OF THE DIFFERENT TONEMES sounds characterized by burst of acoustic energy following a short period of silence and are known to be prone to strong co-articulation from the following vowel. Studies like [28] have shown that stop consonant recognition can be largely improved considering information from the following vowel; this explains why using longer speech time spans produces higher recognition performance compared to conventional short term spectral features. Also, the affricative ch (composed of a plosive and a fricative) is confused with the fricatives zh and s by the short term features while this confusion is significantly reduced by the other long term features. Vowels and other consonants are still better recognized from the short term features. Those facts are verified on all the MLP front-ends that use long temporal inputs (MRASTA,DCT-TRAPS and wlp-traps) as well as the HATS. In summary, training MLPs using short-term spectral input outperforms training using long term temporal input on most of the phonetic classes apart a few of them including the plosives and affricatives. Let us now consider the multi-stream approach which dynamically weights the posterior estimates from the 9frames-PLP and

2446 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 HATS according to the confidence of the MLP. The frame accuracy becomes 73% and the phoneme-level confusion shows that performances are never inferior to the best of the two streams that compose the combination. In other words, the combined distribution appears to perform as the HATS on the stop consonants and affricatives and as the 9frame-PLP on the remaining phonemes. Those results translate into a significant reduction on the CER, never worse than those obtained using the individual MLP features (see experiments in Section IV-B2). The hierarchical approach described in Section III-C is based on a completely different idea. This method uses an initial MLP trained using energy trajectories filtered with short temporal filters (G1-High and G2-High). This provides an initial posterior estimation then fed into the second MLP concatenated with energy trajectories filtered with long temporal filters (G1- Low, G2-Low). The second MLP re-estimates the phonetic posteriors obtained from the first MLP using information from a longer temporal context. The hierarchical framework achieves a frame accuracy of 72%. Interestingly this is done without using any spectral feature (MFCC or PLP) and keeping constant the number of parameters; only the architecture of the MLP is changed where the temporal context of the input features is increased sequentially. In other words, the first MLP trained on short temporal context is effective on most of the phonetic classes apart stops and affricatives. Those estimates are then corrected from the second MLP using the information from longer temporal context. Fig. 4 plots the phonetic class accuracy obtained by the three-layer MLP trained using the MRASTA input and the hierarchical approach. It is noticeable that the hierarchical approach outperforms training using the MRASTA on all the targets. Recognition results show that the hierarchical approach (where the processing moves from short to long temporal contexts) reduces the CER with respect to the single MLP features (where the different time spans are processed using the same MLP). Augmenting the input with pitch estimates and energy further reduces the CER. Another interesting finding is the fact that as stand-alone features, the multi-stream approach has the lowest CER, while in concatenation with MFCC, the augmented hierarchical approach produces the largest CER reduction (compare Tables VIII and VII). This effect can be explained by the fact that the multi-stream approach makes use of spectral information (through the 9frame PLP). This information produces a frame accuracy of 73% but does not appear complementary to the MFCC features as they both represent spectral information. On the other hand, the hierarchical approach achieves a frame error rate of 72% without the use of any spectral features and appears more complementary when used in concatenation with the MFCC. Results from the bottleneck features cannot be analyzed in a similar way, as these are non-probabilistic features without any explicit mapping to a phonetic target. However, recognition results in Tables IX and X show that replacing the three-layer MLP with the bottleneck architectures reduces the CER for all the different input representations (MRASTA,DCT-TRAPS,wLP- TRAPS). Bottleneck and hierarchical approaches produce similar improvements in concatenation with MFCC features. Fig. 4. Phonetic-class accuracy obtained by the MRASTA and the Hierarchical MRASTA. The latter improves the performance on all the phonetic targets without the use of any spectral information. TABLE XIII ACOUSTIC DATA FOR TRAINING AND TESTING TABLE XIV PERFORMANCES OF BASELINE SYSTEMS USING MFCC OR PLP FEATURES V. LARGE SCALE EXPERIMENTS Contrastive experiments in literature are typically reported with small setups like the one presented so far. However, the GALE evaluation systems are trained on a much larger amount of data, make uses of multi-pass training and are composed of a number of individual sub-systems. In order to study how the previous results generalize on more complex LVCSR systems and a large amount of training data, the experiments are extended using a highly accurate automatic speech recognizer for continuous Mandarin speech trained on 1600 hours of data collected by LDC (GALE releases P1R1-4, P2R1-2, P3R1-2, P4R1). The training transcripts were preprocessed and the audio data were segmented into waveforms based on sentence boundaries defined in the manual transcripts. Both were provided by UW-SRI as described in [9]. This comparison will cover the Multi-stream approach and the hierarchical MRASTA front-ends, which will be simply referred as MLP1 and MLP2 in the remainder of this paper. These two features have been used in the GALE 2008 Mandarin evaluation. The 1600 hours data are used for training the HMM/GMM systems as well as the MLP front-ends. The evaluation is done on the GALE 2007 development test set (dev07) which is used for tuning hyper-parameters, the GALE 2008 development test set (dev08) and the sequestered data of the GALE 2007 evaluation (eval07-seq), for a total amount of 5 hours of data. Statistics of the different test sets are summarized in Table XIII. The number of parameters in the MLP architectures is increased to

VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2447 TABLE XV SUMMARY OF FEATURE PERFORMANCES ON GALE dev07/dev08/seq-eval07 TEST SETS. RESULTS ARE REPORTED WITH MLP FEATURES ALONE AND IN CONCATENATION WITH MFCC OR PLP. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE MFCC AND PLP BASELINES IS REPORTED IN PARENTHESES five millions parameters for the large scale setup. The training of MLP1 and MLP2 networks took approximately five weeks on an eight-core machine (AMD Opteron(tm) Dual Core 2192 MHz 2 4-core CPUs). MLP1 networks have been trained at ICSI and MLP2 networks have been trained at IDIAP. On the other hand, the generation of the features is quite fast, approximately 0.09xRT on a single CPU. The RWTH evaluation system is composed of two subsystems which only differ for their acoustic front-ends. The acoustic front-ends of the subsystems consist of conventional MFCCs and PLPs augmented with the log-pitch estimates [18]. The filter banks underlying the MFCC and PLP feature extraction undergo VTLN. After that, features are mean and variance normalized and they are fed into a sliding window of length nine. All feature vectors within the sliding window are concatenated and projected to a 45-dimensional feature space using a linear discriminative analysis (LDA). The system uses a word-based pronunciation dictionary described in [9] that maps words to phoneme sequences, while the phoneme carries the tone information, which is usually referred to as a toneme. The acoustic models for all systems are based on triphones with cross-word context, modelled by a three-state left-to-right HMM. A decision tree based state tying is applied, resulting in a total of 4500 generalized triphone states. The acoustic models consist of Gaussian mixture distributions with a globally pooled diagonal covariance matrix. The first pass consists of maximum-likelihood training. We will refer to this system as an SI system. The second pass consists of speaker adaptive training (SAT). Furthermore, during decoding, maximum likelihood linear regression is applied to means for performing speaker adaptation. We will refer to this system as an SA system. Finally, the outputs of the different subsystems are combined at the lattice level using the min.fwer combination method described in [29]. The min.fwer method has been shown to outperform other lattice combination methods as ROVER or Confusion Network Combination (CNC) [29]. Fig. 2 schematically depict the RWTH evaluation system. The language model (LM) used in this work is kindly provided by SRI and UW. The vocabulary size is 60 K. Experimental results with the full LM are reported only in the TABLE XVI SYSTEM COMBINATION OF MFCC AND PLP SUBSYSTEMS DESIGNATED WITH 8. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE MFCC 8 PLP BASELINE IS REPORTED IN PARENTHESES system combination, while a pruned version is applied in all other recognition steps. Table XIV reports the CER for the speaker independent and the speaker adapted subsystems trained using MFCC and PLP features only. The error rate is in the range of 12.5% 14.5% for the different test sets. Let us now consider the integration of the MLP1 and MLP2 front-ends. Table XV report the performance of the subsystems when they are trained using MLP1 and MLP2 features only and when MFCC and PLP are concatenated with MLP1 and MLP2. The results show similar trends as in the 100-hour system. In other words, the MLP feature performance scales with the amount of training data. In particular, the MLP1 and MLP2 front-ends outperform the spectral features and produce a relative improvement in the range of 15% 25% when used in concatenation with MFCC or PLP, reducing the CER to the range 10.1% 12.2% for the different datasets. The improvements are verified on all three test sets. The relative improvements after SAT are generally reduced with respect to the speaker-independent system. After SAT, the MLP2 features (based on a hierarchical approach) yield the best performance in concatenation with both MFCC and PLP. The lattice combination results of MFCC and PLP sub-systems are reported in Table XVI (first row). For investigation purposes, corresponding sub-systems trained using MLP1 and MLP2 front-ends are combined in the same way and their performance is reported in Table XVI (second row). Their performance is superior to the MFCC/PLP system by 9% 14% relative, showing that the improvements hold after the lattice level combination. In order to increase the complementarity of the sub-systems, features MLP1 and MLP2 were then concatenated with PLP and MFCC, respectively. The performance of the lattice level

2448 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 TABLE XVII EFFECT OF DISCRIMINATIVE TRAINING ON DIFFERENT SUBSYSTEMS AND THEIR COMBINATION (DESIGNATED WITH 8) combination of those two sub-systems is reported in Table XVI (third row). The results show that using the two MLP front-ends in concatenation with MFCC/PLP features produces an additional relative improvement, resulting in the range of 18% 23% after system combination. For the GALE 2008 evaluation, discriminative training was further applied to the two subsystems before the lattice level combination. Discriminative training is based on a modified minimum phone error (MPE) criterion described in [30]. Table XVII reports CER obtained after discriminative training. Results are reported for the PLP+MLP1 system, the MFCC+MLP2 system and their lattice level combination. In all the three cases, discriminative training reduced the CER in the range 6% 13% relative, showing that it is also effective when used together with different MLP front-ends. For computational reasons, fully contrastive results with and without discriminative training are not available on the 1600 hours system. This system including the two most recent MLP-based frontends showed to be very competitive to current Mandarin LVCSR systems evaluated on the same test sets [31], [32]. VI. DISCUSSION AND CONCLUSION During GALE evaluation campaigns, several MLP based front-ends have been used in different LVCSR systems although no exhaustive and systematic study of their performances has been reported in literature. Without such a comparison, it is not possible to verify which of the modification to the original MLP features produced improvements in the final system. This correspondence describes and compares in a systematic manner all the MLP front-ends developed recently at multiple sites and used during the GALE project for Mandarin transcription. The initial investigation is carried on a small-scale experimental setup (100 hours) and investigates the two directions along which the MLP features have recently evolved: the use of different inputs to the conventional three-layer MLP and the use of complex MLP architectures. The experimentation is done both using MLP front-ends as stand-alone features and in concatenation with MFCC. Three-layer MLPs are trained using conventional spectral features (9frames-PLP) and features extracted from long time spans of the signal (MRASTA, DCT-TRAPS, wlp-traps and their augmented versions). Results reveal that as stand-alone features, none of them outperforms the conventional MFCC features. The performances of the MLPs trained on long time spans of the speech signal (MRASTA, DCT-TRAPS, wlp-traps) are quite poor compared to those obtained from training on short-term spectral features (9frames-PLP). The latter one is superior on most of the phonetic targets apart from a few of phonetic classes like plosives and affricatives. Features based on the three-layer MLP produce relative improvements in the range of 10% 14% when used in concatenation with the MFCC. Even when their performances are poor as stand-alone front-ends, they always appear to provide complementary information to the MFCC. After concatenation with MFCC, the various representations (MRASTA, DCT-TRAPS, wlp-traps) produce comparable performances. Over time, several alternative architectures have been proposed to replace the three-layer MLP with different motivations. This work experiments with Multi-stream, Hierarchical and Bottleneck approaches. Results using those architectures reveal the following novel findings. The Multi-stream framework that combines MLPs trained on long and short time spans outperforms the MFCC by approximately 10% relative as stand-alone feature. Furthermore, it reduces the CER by 16% relative in concatenation with MFCC. The hierarchical approach that sequentially increases the time context through a hierarchy of MLPs outperforms the MFCC by approximately 6% relative as stand-alone feature and reduces the CER by 18% relative in concatenation with MFCC. Results obtained using the bottleneck approach (five-layer MLP) show a similar trend. The MLP front-end that provides the lowest CER as standalone feature is different from the front-end that provides the highest complementarity to spectral features. This effect is discussed in Section IV-C. MLPs trained using long-time spans of the signal at the input become effective only when coupled with architectures that go beyond the three-layer structure, i.e., hierarchies or bottleneck. In summary, the most recent improvements are obtained by the use of architectures that go beyond the three-layer MLP rather than the various input representations. These results have been obtained by training the HMM/GMM and MLPs on 100 hours of speech data and tested in a simple LVCSR system. Evaluation systems are typically trained on a much larger amount of data, make uses of multipass training and are composed of a number of individual sub-systems that are combined together to provide the final recognition output. In this paper, MLP features are investigated with a large amount of training data as well as on a state-of-the-art multipass system. The improvements from the small scale study hold for the large amount of training data on speaker-independent, speaker-adapted systems and after the lattice level combination. This is verified both in concatenation with MFCC and PLP features. When MLP features are used together with spectral features, the gain after lattice combination is in the range of 19% 23% relative for the 5-hour evaluation data sets. The comprehensive contrastive experiment on a multipass evaluation system shows that the improvements obtained on a small setup scale with the amount of training data and the parametric complexity of the system. To our best knowledge, this is the most extensive study on MLP features for Mandarin LVCSR covering all the front-ends including the most recent ones used in the 2008 GALE evaluation systems. The final evaluation system showed to be very competitive to current Mandarin LVCSR systems evaluated on the same test sets [31], [32].

VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2449 ACKNOWLEDGMENT The authors would like to thank colleagues involved in the GALE project and Dr. P. Fousek for their help. REFERENCES [1] H. Hermansky et al., Connectionist feature extraction for conventional HMM systems, in Proc. ICASSP, 2000, pp. 1635 1638. [2] B. Chen et al., Learning discriminative temporal patterns in speech: Development of novel TRAPS-like classifiers, in Proc. Eurospeech, 2003, pp. 853 856. [3] N. Morgan et al., TRAPping conversational speech: Extending TRAP/ Tandem approaches to conversational telephone speech recognition, in Proc. ICASSP, 2004, pp. 537 540. [4] H. Hermansky and P. Fousek, Multi-resolution rasta filtering for tandem-based ASR, in Proc. Interspeech 05, 2005, pp. 361 364. [5] P. Fousek, L. Lamel, and J.-L. Gauvain, Transcribing broadcast data using MLP features, in Proc. Interspeech, 2008, pp. 1433 1436. [6] D. Ellis et al., Tandem acoustic modeling in large-vocabulary recognition, in Proc. ICASSP, 2001, pp. 570 520. [7] N. Morgan et al., Pushing the envelope aside, IEEE Signal Process. Mag., vol. 22, no. 5, pp. 81 88, Sep. 2005. [8] D. Vergyri et al., Development of the SRI/Nightingale arabic ASR system, in Proc. Interspeech, 2008, pp. 1437 1440. [9] M.-Y. Hwang et al., Building a highly accurate mandarin speech recognizer with language-independent technologies and language-dependent modules, IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 7, pp. 1253 1262, Sep. 2009. [10] C. Plahl et al., Development of the GALE 2008 Mandarin LVCSR system, in Proc. Interspeech, 2009, pp. 2107 2110. [11] J. Park et al., Efficient generation and use of MLP features for arabic speech recognition, in Proc. Interspeech, Brighton, U.K., Sep. 2009, pp. 236 239. [12] P. Schwarz, P. Matejka, and J. Cernocky, Extraction of features for automatic recognition of speech based on spectral dynamics, in Proc. TSD 04, Brno, Czech Republic, Sep. 2004, pp. 465 472. [13] P. Fousek, Extraction of features for automatic recognition of speech based on spectral dynamics, Ph.D. dissertation, Faculty of Elect. Eng., Czech Technical Univ., Prague, Czech Republic, 2007. [14] H. Hermansky et al., Towards ASR on partially corrupted speech, in Proc. ICSLP, 1996, pp. 462 465. [15] F. Valente and H. Hermansky, Hierarchical and parallel processing of modulation spectrum for ASR applications, in Proc. ICASSP, 2008, pp. 4165 4168. [16] F. Grezl et al., Probabilistic and bottle-neck features for LVCSR of meetings, in Proc. ICASSP 07, Hononulu, HI, 2007, pp. 757 760. [17] H. Bourlard and N. Morgan, Connectionist Speech Recognition A Hybrid Approach. Norwell, MA: Kluwer, 1994. [18] X. Lei et al., Improved tone modeling for Mandarin broadcast news speech recognition, in Proc. Interspeech, 2006, pp. 1237 1240. [19] H. Hermansky and S. Sharma, Temporal Patterns (TRAPS) in ASR of Noisy Speech, in Proc. ICASSP 99, Phoenix, AZ, 1999, pp. 289 292. [20] T. Dau et al., Modeling auditory processing of amplitude modulation.i detection and masking with narrow-band carriers, J. Acoust. Soc. Amer., no. 102, pp. 2892 2905, 1997. [21] M. Athineos, H. Hermansky, and D. P. W. Ellis, Lp-trap: Linear predictive temporal patterns, in Proc. ICSLP, 2004, pp. 1154 1157. [22] J. B. Allen, Articulation and Intelligibility. San Rafael, CA: Morgan & Claypool, 2005. [23] H. Misra, H. Bourlard, and V. Tyagi, Entropy-based multi-stream combination, in Proc. ICASSP, 2003, pp. 741 744. [24] F. Valente and H. Hermansky, Combination of acoustic classifiers based on Dempster-Shafer theory of evidence, in Proc. ICASSP, 2007, pp. 1129 1132. [25] F. Valente et al., Hierarchical Modulation spectrum for the GALE project, in Proc. Interpseech, 2009, pp. 2963 2967. [26] F. Grezl and P. Fousek, Optimizing bottleneck features for LVCSR, in Proc. ICASSP 08, Las Vegas, NV, 2008, pp. 4729 4732. [27] Q. Zhu et al., On using MLP features in LVCSR, in Proc. ICSLP, 2004, pp. 921 924. [28] A. Suchato, Classification of Stop Place of Articulation, Ph.D. dissertation, Mass. Inst. of Technol., Cambridge, 2004. [29] B. Hoffmeister et al., Frame based system combination and a comparison with weighted ROVER and CNC, in Proc. Interspeech, Pittsburgh, PA, Sep. 2006, pp. 537 540. [30] G. Heigold et al., Margin-based discriminative training for string recognition, J. Sel. Topics Signal Process, vol. 4, no. 6, pp. 917 925, Dec. 2010. [31] S. M. Chu et al., Recent advances in the GALE mandarin transcription system, in Proc ICASSP, Las Vegas, NV, Apr. 2008, pp. 4329 4333. [32] T. Ng et al., Progress in the BBN mandarin speech to text system, in Proc. ICASSP, Las Vegas, NV, Apr. 2008, pp. 1537 1540. Fabio Valente (M 05) received the M.Sc. degree (summa cum laude) in communication systems from Politecnico di Torino, Turin, Italy, in 2001 and the M.Sc. degree in image processing and the Ph.D. degree in signal processing from the University of Nice, Sophia Antipolis, France, in 2002 and 2005, respectively. His Ph.D. work was on variational Bayesian methods for speaker diarization done at the Institut Eurecom, France. In 2001, he worked for the Motorola HIL (Human Interface Lab), Palo Alto, CA. Since 2006, he has been with the Idiap Research Institute, Martigy, Switzerland, involved in several E.U. and U.S. projects on speech and audio processing. His main interests are in machine learning and speech recognition. He is an author/coauthor of several papers in international conferences and journals with contributions in feature extraction and selection for speech recognition, multi-stream ASR, and Bayesian statistics for speaker diarization. Mathew Magimai Doss (S 03 M 05) received the B.E. degree in instrumentation and control engineering from the University of Madras, Chennai, India, in 1996, the M.S. degree in research in computer science and engineering from the Indian Institute of Technology, Madras, India, in 1999, and the PreDoctoral diploma and the Docteurès Sciences (Ph.D.) degree from the École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland, in 2000 and 2005, respectively. From April 2006 to March 2007, he was a Postdoctoral Fellow at the International Computer Science Institute, Berkeley, CA. Since April 2007, he has been working as a Research Scientist at the Idiap Research Institute, Martigny, Switzerland. His research interests include speech processing, automatic speech and speaker recognition, statistical pattern recognition, and artificial neural networks. Christian Plahl received the diploma degree in computer science from the University of Bielefeld, Bielefeld, Germany, in 2005. He is currently pursuing the Ph.D. degree in the Computer Science Department, RWTH Aachen University, Aachen, Germany. His research interests cover speech recognition, discriminative training, and signal analysis. Suman Ravuri is currently pursuing the Ph.D. degree in the Electrical Engineering and Computer Sciences Department, University of California, Berkeley. He is with the International Computer Science Institute (ICSI), Berkeley, CA, working on automatic speech recognition.

2450 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 Wen Wang (M 98) received the B.S. degree in electrical engineering and the M.S. degree in computer engineering from Shanghai Jiao Tong University, Shanghai, China, in 1996 and 1998, respectively, and the Ph.D. degree in computer engineering from Purdue University, West Lafayette, IN, in 2003. She is currently a Research Engineer in the Speech Technology and Research Laboratory, SRI International, Menlo Park, CA. Her research interests are in statistical language modeling, speech recognition, machine translation, natural language processing and understanding, and machine learning. She authored or coauthored over 50 research papers and served as reviewer for over 10 journals and conferences. She is member of the Association for Computational Linguistics.