Using RASTA in task independent TANDEM feature extraction

Similar documents
I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Discriminative Training for Automatic Speech Recognition

Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER

Reverse Correlation for analyzing MLP Posterior Features in ASR

IMPACT OF DEEP MLP ARCHITECTURE ON DIFFERENT ACOUSTIC MODELING TECHNIQUES FOR UNDER-RESOURCED SPEECH RECOGNITION

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Spectro-temporal Gabor features as a front end for automatic speech recognition

Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

IMPROVING MICROPHONE ARRAY SPEECH RECOGNITION WITH COCHLEAR IMPLANT-LIKE SPECTRALLY REDUCED SPEECH

Improving Word Accuracy with Gabor Feature Extraction Michael Kleinschmidt, David Gelbart

FEATURE COMBINATION AND STACKING OF RECURRENT AND NON-RECURRENT NEURAL NETWORKS FOR LVCSR

Calibration of Microphone Arrays for Improved Speech Recognition

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Augmenting Short-term Cepstral Features with Long-term Discriminative Features for Speaker Verification of Telephone Data

PLP 2 Autoregressive modeling of auditory-like 2-D spectro-temporal patterns

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Progress in the BBN Keyword Search System for the DARPA RATS Program

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

Perceptually Motivated Linear Prediction Cepstral Features for Network Speech Recognition

High-speed Noise Cancellation with Microphone Array

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

Simultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Machine recognition of speech trained on data from New Jersey Labs

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM

ON THE PERFORMANCE OF WTIMIT FOR WIDE BAND TELEPHONY

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Investigating Modulation Spectrogram Features for Deep Neural Network-based Automatic Speech Recognition

Implementation of Text to Speech Conversion

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Robust telephone speech recognition based on channel compensation

Call Quality Measurement for Telecommunication Network and Proposition of Tariff Rates

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

Signal Analysis Using Autoregressive Models of Amplitude Modulation. Sriram Ganapathy

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

Speech Synthesis using Mel-Cepstral Coefficient Feature

Mel Spectrum Analysis of Speech Recognition using Single Microphone

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

TECHNIQUES FOR HANDLING CONVOLUTIONAL DISTORTION WITH MISSING DATA AUTOMATIC SPEECH RECOGNITION

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments

Robust Speech Recognition. based on Spectro-Temporal Features

A New Framework for Supervised Speech Enhancement in the Time Domain

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Research Article Significance of Joint Features Derived from the Modified Group Delay Function in Speech Processing

DWT and LPC based feature extraction methods for isolated word recognition

Auditory motivated front-end for noisy speech using spectro-temporal modulation filtering

LEVERAGING JOINTLY SPATIAL, TEMPORAL AND MODULATION ENHANCEMENT IN CREATING NOISE-ROBUST FEATURES FOR SPEECH RECOGNITION

Neural Network Acoustic Models for the DARPA RATS Program

A NEW FEATURE VECTOR FOR HMM-BASED PACKET LOSS CONCEALMENT

Bandwidth Extension for Speech Enhancement

Feature Extraction Using 2-D Autoregressive Models For Speaker Recognition

Mikko Myllymäki and Tuomas Virtanen

Applications of Music Processing

IDENTIFICATION OF SIGNATURES TRANSMITTED OVER RAYLEIGH FADING CHANNEL BY USING HMM AND RLE

416 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 2, FEBRUARY 2013

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

FEATURE FUSION FOR HIGH-ACCURACY KEYWORD SPOTTING

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Artificial Bandwidth Extension Using Deep Neural Networks for Spectral Envelope Estimation

WIND NOISE REDUCTION USING NON-NEGATIVE SPARSE CODING

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Robust Speaker Identification for Meetings: UPC CLEAR 07 Meeting Room Evaluation System

Enhancing the Complex-valued Acoustic Spectrograms in Modulation Domain for Creating Noise-Robust Features in Speech Recognition

Automatic Morse Code Recognition Under Low SNR

Robustness (cont.); End-to-end systems

Binaural reverberant Speech separation based on deep neural networks

Single-channel late reverberation power spectral density estimation using denoising autoencoders

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Speech Enhancement Using a Mixture-Maximum Model

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

MURDOCH RESEARCH REPOSITORY

HIGH RESOLUTION SIGNAL RECONSTRUCTION

JOINT NOISE AND MASK AWARE TRAINING FOR DNN-BASED SPEECH ENHANCEMENT WITH SUB-BAND FEATURES

Improving ASR performance on PDA by contamination of training data

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

THE goal of Speaker Diarization is to segment audio

All for One: Feature Combination for Highly Channel-Degraded Speech Activity Detection

Change Point Determination in Audio Data Using Auditory Features

A multi-class method for detecting audio events in news broadcasts

Speech and Music Discrimination based on Signal Modulation Spectrum.

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping

An Improved Voice Activity Detection Based on Deep Belief Networks

Research Article Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

Sparse coding of the modulation spectrum for noise-robust automatic speech recognition

Proceedings of Meetings on Acoustics

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

Transcription:

R E S E A R C H R E P O R T I D I A P Using RASTA in task independent TANDEM feature extraction Guillermo Aradilla a John Dines a Sunil Sivadas a b IDIAP RR 04-22 April 2004 D a l l e M o l l e I n s t i t u t e for Perceptual Artificial Intelligence P.O.Box 592 Martigny Valais Switzerland phone +41 27 721 77 11 fax +41 27 721 77 12 e-mail secretariat@idiap.ch internet http://www.idiap.ch a b IDIAP OGI School of Science and Engineering

IDIAP Research Report 04-22 Using RASTA in task independent TANDEM feature extraction Guillermo Aradilla John Dines Sunil Sivadas April 2004 Abstract. In this work, we investigate the use of RASTA filter in the TANDEM feature extraction method when trained with a task independent data. RASTA filter removes the linear distortion introduced by the communication channel which is demonstrated in a 18% relative improvement on the Numbers 95 task. Also, studies yielded a relative improvement of 35% over the basic PLP features by combining TANDEM features and conventional PLP features.

2 IDIAP RR 04-22 1 Introduction Automatic Speech Recognition (ASR) systems are basically formed by two main subsystems: feature extraction and pattern classification. Feature extraction obtains a representation of speech (feature vectors) which must carry enough information for the pattern classification subsystem to be able to differentiate the different sub-word units, e.g. phones. The classifier uses a Hidden Markov Model (HMM) where the mapping from the hidden states to the acoustic observations, i.e. feature vectors, is typically modeled by a Gaussian Mixture Model (GMM) [1] or a Multi Layer Perceptron (MLP) [2]. Traditionally, feature vectors have been based on short-term spectrum, but other kinds of feature vectors have also been investigated which try to emphasize temporal properties of speech [3]. One of these new feature extraction methods is the TANDEM approach [4], where a MLP is used to estimate context-independent phone posterior probabilities. TANDEM features are, in principle, capable of extracting and using speech-specific but task-independent knowledge from the development data. The development data for training of the TANDEM probability estimator does not have to be directly related to the recognition task on which TANDEM is to be applied. However, since the TANDEM module is trained on the separate developement data, it acquires (as any other classifier would) peculiarities of this data. So the more similar the development data and the target application data are, the better the TANDEM approach performs. It would be desirable to have a general, i.e. task-independent, version of the neural-net stage because of the effort required to build it in terms of time and hardware resources. In order to obtain a TANDEM feature extractor as general as possible, we can use development data that contains all of the expected sources of the anticipated nonlinguistic variability or we can pre-process the development data in order to at least alleviate the harmful variability between the development data and the target application. In this work, we investigate the second option by applying the RASTA filter [5] to the basic features. This approach is particularly effective when the development data and the task-specific data are recorded in different communication channels since RASTA filter removes linear distortion produced by the recording environment. As others have shown [6], merging multiple feature vectors extracted from different context lengths can be beneficial. In this work, we also combine TANDEM features obtained from our MLP trained with task independent data and PLP conventional features obtaining improved accuracy over the TANDEM system alone. We will begin in Section 2 with a description of RASTA filter followed by a brief overview of the TANDEM approach in Section 3. In Section 4 we discuss combination of acoustic features. Then, Section 5 describes the experiment setup. Section 6 presents the results obtained. Finally, some conclusions are given in Section 7. 2 RASTA filter PLP[7], like all the conventional feature extraction methods, is based on the short-term spectrum of speech. Such features are highly vulnerable to modification of the spectrum by the frequency response of the communication channel. RASTA filter [5] replaces the common short-term spectrum by a spectral estimate in which each frequency channel is band-pass filtered across time by a filter with sharp spectral zero at the zero frequency. In this way, it is possible to remove the linear distortion which is usually introduced by the recording environment. RASTA filtering seems to be a good strategy for normalizing databases recorded in different environments. In this work, we use the TIMIT corpus as task independent training data, which has been recorded on microphone channel and Numbers corpus as task-specific data, which has been recorded on telephone channel. In this way, knowledge obtained by the MLP from the TIMIT corpus will be more compatible with the Numbers corpus.

IDIAP RR 04-22 3 3 TANDEM approach In the TANDEM approach [4], basic features are provided as input to the MLP and its processed output is used as input for a GMM/HMM based classifier. In this way two kinds of acoustic models are used in sequence (MLP and GMM). The MLP is trained to estimate posterior probabilities using Maximum A Posteriori (MAP) criterion hence, providing more discriminative features to the GMM/HMM system which are trained on the Maximum Likelihood (ML) criterion. The MLP can also extract more information about the temporal properties of speech because it takes a larger context of frames as input and because of the non-linear transformation it performs, which is more general than the linear weighting used to compute conventional dynamic features e.g. delta features [8]. It is necessary to take the logarithm followed by a PCA decorrelation of the output of the MLP. This ensures compatibility of the feature vectors with the GMM/HMM classifier which makes assumptions of decorrelated and Gaussian-like features. The goal of this work is to build a TANDEM feature extractor which is independent from the task of the system. The MLP is trained with a database that is not specific to any task, but contains the variability that is encountered in the test condition. We have chosen to use the TIMIT database [9] for this purpose which has the added advantage of accurate phonetic transcriptions for the training of the MLP. This database is used to train the MLP, and Numbers corpus is used to train and test the GMM/HMM system. By using a different corpus to train the MLP than that used to train the GMM classifier we are adding more information to the system. In order to minimize differences between TIMIT and Numbers corpora, the RASTA filter is applied in the implementation of the PLP feature extraction. Figure 1 shows the block diagram of our TANDEM feature extraction scheme. trained with TIMIT trained with Numbers speech RASTA PLP MLP log log posterior probabilities PCA TANDEM features GMM Classifier Sequence of words Figure 1: Block diagram of the feature extraction scheme used in this work. 4 Feature Combination As the first stage in any speech recognition system, features are critical to the overall system performance. The ideal features reflect the relevant information in the speech signal, in our case it is the phonetic variation, while minimizing or eliminating irrelevant information, such as speaker identity or background conditions. A wide variety of features has been proposed and employed, each with different strengths and weaknesses. There are three basic approaches of combination in speech recognition systems: feature combination (e.g. [10]), posterior combination (e.g. [11]) and hypothesis combination (e.g. [12]). In this work, we apply the first approach by combining basic PLP features and log posterior probabilities. These two feature extraction methods use the same speech signal but they extract information in a different manner so they may be suitable to be combined. The former features are based on short-term spectrum while the latter use a larger context and perform a non-linear transformation to obtain posterior probabilities. Stream combination is a technique which attempts to capitalize upon the differences in information carried by feature streams. The basic argument is that if the recognition errors of systems using the individual streams occur at different points, there is at least a chance that the combined system will be able to correct some of these errors by reference to other streams.

4 IDIAP RR 04-22 Also, the MLP used for the TANDEM features has been trained with a different and more general corpus so TANDEM features incorporates information that is not contained in the basic features. The method of combination that is applied in this work is the concatenation of features. Instead of applying the PCA transform to the log posterior probabilities, it is applied to the concatenated feature vector as can be seen in Figure 2. speech RASTA PLP trained with TIMIT MLP log log posterior probabilities Concatenation PCA trained with Numbers GMM Classifier Sequence of words PLP Figure 2: Block diagram of the feature extraction for the TANDEM system. Note that the MLP is trained with TIMIT but PCA and the classifier are trained with Number corpus. We use the RASTA filter with PLP because of the different corpora used with this system. 5 Experiment Description PLP and RASTA-PLP feature vectors with 13 dimensions are extracted using the algorithm presented in the original papers [7] [5] (RASTA filter has been applied with a pole at z = 0.98). Their delta features are concatenated to form a 26-dimension vector. We train the MLP from the TIMIT database using cross-entropy error criterion on 41 contextindependent phones. We use 3696 training files and a tenth part of which is used as validation set. The MLP has one hidden layer with 500 units. The input consists of 6 frames left and right context (13 x 26 = 338 input units). The output is a 41-dimension vector (each output unit corresponds to a context-independent phone). For compatibility with the Numbers corpus, the TIMIT recordings are downsampled to 8KHz. The PCA and the GMM/HMM classifier are trained from Numbers corpus. We use 6049 files to train and 2061 files to test. PCA is computed without any dimensionality reduction. The Numbers corpus contains 31 different words. The GMM/HMM classifier has been implemented with HTK [13] using a HMM for each contextdependent phone with 3 emitting states and 12 mixtures per state. The following experiments have been carried out: System 1: PLP feature vectors with 26 dimensions are directly used as inputs for the GMM/HMM classifier. System 2: The MLP is fed with 13 PLP feature vectors and its output is used as input for the GMM/HMM classifier. System 3: It is similar to the previous experiment except that RASTA-PLP implementation is used instead of PLP. 6 Results We conducted the experiments described in Section 5. The results are presented in Table 1.

IDIAP RR 04-22 5 Experiments Dimension WER System 1 26 6.8% System 2 41 6.6% System 3 41 5.4% Table 1: WER of the experimental systems to observe the effect of RASTA channel normalization. The column Dimension indicates the number of elements contained in each feature vector. We can see in Table 1 that RASTA filter is effectively performing displaying a relative improvemen of 18% in WER of System 3 over System 2. There is also an improvement over System 1, which uses conventional PLP features. We have also investigated different combinations between TANDEM features and PLP features, varying the option of applying the RASTA filter. The following experiments were carried out: System 4: Concatenation of the log posterior probabilities derived from the output of the MLP with PLP as input features and PLP features. System 5: Concatenation of the log posterior probabilities derived from the output of the MLP with PLP as input features and RASTA-PLP features. System 6: Concatenation of the log posterior probabilities derived from the output of the MLP with RASTA-PLP as input features and RASTA-PLP features. System 5: Concatenation of the log posterior probabilities derived from the output of the MLP with RASTA-PLP as input features and PLP features (Figure 2). Features Dimension WER System 4 67 6.0% System 5 67 6.0% System 6 67 4.9% System 7 67 4.4% Table 2: WER of the experimental systems to test the different combination strategies. Again, the column Dimension indicates the number of parameters contained in the feature vector, in this case all have 67 dimensions (41 + 26). As Table 2 shows, TANDEM features can work well when concatenated with conventional shortterm based PLP features. The use of RASTA seems to be beneficial only in those cases where channel normalization is necessary, thus, it does not improve accuracy when it is used with task specific training data. Consequently, the best combination is TANDEM features using RASTA-PLP features combined with PLP features, showing a 35% of relative improvement regarding the basic system formed with PLP features (System 1). 7 Conclusions In this paper we present a method for normalizing different databases in order to use them for obtaining a task independent TANDEM feature vector. RASTA filter appears to be very successful for channel normalization of features before input to the MLP obtaining a 18% relative improvement of the WER. Also, we show that the combination of TANDEM features and PLP features results in a further increase in accuracy, obtaining a 35% of relative improvement. Though RASTA works well when used with

6 IDIAP RR 04-22 TANDEM because of its capability of channel normalization, it does not seem to achieve a good performance in those cases where a channel normalization is not necessary, i.e. when there is no interaction between different databases. Future work should focus on the relationship between the TANDEM feature extractor and the features with which it is trained and the use of task independent and independent training data. References [1] L. R. Rabiner and H. W. Juang, Fundamentals of Speech Recognition. Prentice Hall, 1993. [2] N. Morgan and H. Bourlard, An introduction to hybrid HMM/connectionist continuous speech recognition, IEEE Signal Processing Magazine, vol. 12(3), pp. 25 42, May 1995. [3] H. Hermansky and S. Sharma, Temporal Patterns (TRAPS) in ASR of Noisy Speech, Proceedings of ICASSP, 1999. [4] H. Hermansky, D. Ellis, and S. Sharma, Tandem connectionist feature extraction for conventional HMM systems, Proceedings ICASSP, June 2000. [5] H. Hermansky, RASTA Processing of the speech, IEEE Transactions on Speech and Audio Processing, 1994. [6] S. Wu, B. Kingsbury, N. Morgan, and S. Greenberg, Performance improvements through combining phone- and syllable-length information in automatic speech recognition, Proceedings of ICSLP, pp. 854 857, 1998. [7] H. Hermansky, Perceptual Linear Predictive analysis of speech, Journal of the Acoustic Society of America, 1989. [8] S. Furui, Speaker-Independent Isolated Word Recognition Using Dynamic Features of Speech Spectrum, Proceedings of the IEEE, 1986. [9] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, and N. Dahlgren, DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM, National Institut of Standards and Technology, 1990. [10] A. Morris, A. Hagen, H. Glotin, and H. Bourlard, Multi-stream adaptive evidence combination for noise robust ASR, Speech Communication, 2001. [11] A. Janin, D. Ellis, and N. Morgan, Multi-stream speech recognition: Ready for prime time, Proceedings Eurospeech, 1999. [12] G. Evermann and P. Woodland, Posterior Probability Decoding, Confidence Estimation and System Combination, Proceedings of the NIST Speech Transcription Workshop, 2000. [13] S. Young, The HTK Hidden Markov Model Toolkit: Design and Philosophy, tech. rep., Cambridge University, 1993.