PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Similar documents
I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Using RASTA in task independent TANDEM feature extraction

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

High-speed Noise Cancellation with Microphone Array

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Robust telephone speech recognition based on channel compensation

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR

Isolated Digit Recognition Using MFCC AND DTW

Speech Synthesis using Mel-Cepstral Coefficient Feature

Mikko Myllymäki and Tuomas Virtanen

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

A Real Time Noise-Robust Speech Recognition System

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Speech and Music Discrimination based on Signal Modulation Spectrum.

Applications of Music Processing

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

Mel Spectrum Analysis of Speech Recognition using Single Microphone

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation

SOUND SOURCE RECOGNITION AND MODELING

Auditory Based Feature Vectors for Speech Recognition Systems

Calibration of Microphone Arrays for Improved Speech Recognition

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING. Department of Signal Theory and Communications. c/ Gran Capitán s/n, Campus Nord, Edificio D5

Performance Analysiss of Speech Enhancement Algorithm for Robust Speech Recognition System

A NEW FEATURE VECTOR FOR HMM-BASED PACKET LOSS CONCEALMENT

Change Point Determination in Audio Data Using Auditory Features

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

SPEECH PARAMETERIZATION FOR AUTOMATIC SPEECH RECOGNITION IN NOISY CONDITIONS

Relative phase information for detecting human speech and spoofed speech

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

24 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 1, JANUARY /$ IEEE

Audio Imputation Using the Non-negative Hidden Markov Model

DWT and LPC based feature extraction methods for isolated word recognition

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Time-Frequency Distributions for Automatic Speech Recognition

Auditory motivated front-end for noisy speech using spectro-temporal modulation filtering

Robust Speaker Identification for Meetings: UPC CLEAR 07 Meeting Room Evaluation System

Speech Signal Analysis

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Cepstrum alanysis of speech signals

SPEECH INTELLIGIBILITY DERIVED FROM EXCEEDINGLY SPARSE SPECTRAL INFORMATION

SIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION MOTIVATED BY AUDITORY PROCESSING CHANWOO KIM

NCCF ACF. cepstrum coef. error signal > samples

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Speech Enhancement Using a Mixture-Maximum Model

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

IMPROVING MICROPHONE ARRAY SPEECH RECOGNITION WITH COCHLEAR IMPLANT-LIKE SPECTRALLY REDUCED SPEECH

An Improved Voice Activity Detection Based on Deep Belief Networks

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Learning New Articulator Trajectories for a Speech Production Model using Artificial Neural Networks

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

Modulation Domain Spectral Subtraction for Speech Enhancement

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

MOST MODERN automatic speech recognition (ASR)

Audio Fingerprinting using Fractional Fourier Transform

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

Speech Synthesis; Pitch Detection and Vocoders

Determining Guava Freshness by Flicking Signal Recognition Using HMM Acoustic Models

Monophony/Polyphony Classification System using Fourier of Fourier Transform

Chapter 4 SPEECH ENHANCEMENT

Machine recognition of speech trained on data from New Jersey Labs

Speaker and Noise Independent Voice Activity Detection

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Discriminative Training for Automatic Speech Recognition

ANALYSIS-BY-SYNTHESIS FEATURE ESTIMATION FOR ROBUST AUTOMATIC SPEECH RECOGNITION USING SPECTRAL MASKS. Michael I Mandel and Arun Narayanan

HUMAN speech is frequently encountered in several

A ROBUST FRONTEND FOR ASR: COMBINING DENOISING, NOISE MASKING AND FEATURE NORMALIZATION. Maarten Van Segbroeck and Shrikanth S.

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Introduction to HTK Toolkit

Introduction of Audio and Music

HIGH RESOLUTION SIGNAL RECONSTRUCTION

Robust speech recognition using temporal masking and thresholding algorithm

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

Voice Activity Detection

TECHNIQUES FOR HANDLING CONVOLUTIONAL DISTORTION WITH MISSING DATA AUTOMATIC SPEECH RECOGNITION

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

Simultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array

SUB-BAND INDEPENDENT SUBSPACE ANALYSIS FOR DRUM TRANSCRIPTION. Derry FitzGerald, Eugene Coyle

Transcription:

PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this publication click this link. http://hdl.handle.net/2066/74977 Please be advised that this information was generated on 2018-11-19 and may be subject to change.

In: Proc. ICSLP-96, pp. 2332-2335, 1996 1 COMPARISON OF CHANNEL NORMALISATION TECHNIQUES FOR AUTOMATIC SPEECH RECOGNITION OVER THE PHONE Johan de Veth (1) & Louis Boves (1,2) (1) Department of Language and Speech, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, THE NETHERLANDS (2) KPN Research, P.O. Box 421, 2260 AK Leidschendam, THE NETHERLANDS ABSTRACT We compared three different channel normalisation (CN) methods in the context of a connected digit recognition task over the phone: ceptrum mean substraction (CMS), RASTA filtering and the Gaussian dynamic cepstrum reprsentation (GDCR). Using a small set of context-independent (CI) continuous Gaussian mixture hidden Markov models (HMMs) we found that CMS and RASTA outperformed the GDCR technique. We show that the main cause for the superiority of CMS compared to RASTA is the phase distortion introduced by the RASTA filter. Recognition results for a phasecorrected RASTA technique are identical to those of CMS. Our results indicate that an ideal cepstrum based CN method should (1) effectively remove the DC-component, (2) at least preserve modulation frequencies in the range 2-16 Hz and (3) introduce no phase distortion in case CI HMMs are used for recognition. 1. INTRODUCTION For automatic speech recognition over telephone lines it is wellknown that recognition performance can be seriously degraded due to the transfer characteristics of the communication channel. In order to reduce the influence of the linear filtering effect of the telephone handset and telephone line, different channel normalisation (CN) techniques have been proposed [for example 1,2,3,4]. Several studies addressed the question of the relative effectiveness of different CN approaches [for example 5,6]. These studies were often limited to the extend that it was only established which CN technique was to be preferred. In this paper, we focus on the question why one approach is preferred over another. We studied three different CN techniques in the context of a connected digit recognition task: cepstrum mean substraction (CMS) [1], RASTA filtering [2,3], and the Gaussian dynamic cepstrum representation (GDCR) [4]. For this task, we used hidden Markov models (HMMs) with Gaussian mixture densities describing the output probability density function of each state. Because we focussed attention on the question of what makes a CN technique a succesful one, we did not investigate the use of different types of acoustic parameter representations. Rather, we resticted ourselves to mel-frequency cepstral coefficients, log energy and their first timederivatives. This paper is further organised as follows. In section 2 we describe our feature extraction method. Next, in section 3, the telephone database that we used for our experiments is discussed. The topology of the HMMs, the way we performed training with cross-validation and the recognition syntax during testing are described in section 4. The recognition experiments are discussed in section 5. We will focus on the phase distortion introduced by the RASTA technique as this is the key difference between RASTA and CMS. We will show that removal of the phase distortion of the RASTA filter leads to a significant increase of recognition performance when using CI HMMs. Finally, in section 6 we sum up the main conclusions. 2. SIGNAL PROCESSING Speech signals were digitized at 8 khz and stored in A-law format. After conversion to a linear scale, preemphasis with factor 0.98 was applied. A 25 ms Hamming analysis window that was shifted with 10 ms steps was used to calculate 24 filterband energy values for each frame. The 24 triangular shaped filters were uniformly distributed on a mel-frequency scale. Finally, 12 mel-frequency cepstral coefficients (MFCC s) were derived. We did not apply liftering, because we were using continuous Gaussian mixture density HMMs with diagonal covariance matrices [7]. In addition to the twelve MFCC s we also used their first time-derivatives (delta-mfcc s), log-energy (loge) and its first time-derivative (delta-loge). In this manner we obtained 26-dimensional feature vectors. Feature extraction was done using HTK v1.4 [8]. We applied three CN techniques to the twelve MFCC coordinates of the feature vector in this paper. We either used RASTA with integration factor 0.98 [2,3], or the GDCR approach [4] or CMS [1]. We kept the original values of delta-mfcc s, loge and delta-loge. 3. DATABASE The speech material for this experiment was taken from the Dutch POLYPHONE corpus [9]. Speakers were recorded over the public switched telephone network in the Netherlands. Handset and channel characteristics are not known; especially handset characteristics are known to vary widely. Among other things, the speakers were asked to read a connected digit string containing six digits. We divided this set of digit strings in two parts. For training we reserved a set of 960 strings, i.e. 80 speakers (40 females and 40 males) from each of the 12 provinces in the Netherlands (denoted trn960

In: Proc. ICSLP-96, pp. 2332-2335, 1996 2 T ab le 1: Phonemic transcriptions (column 2) and the number of realisations (columns 3 till 7) of each digit. digit transcription tm960 trn480 tst911 tst240 nul n Y 1 590 294 548 136 een e n 590 286 562 165 twee t w e 591 296 597 181 drie d r i 597 299 574 155 vier v i r 569 284 523 135 vijf v Ei f 573 273 526 124 zes z E s 578 301 536 136 zeven z e v Q n 582 270 510 130 acht a x t 554 297 525 151 negen n e x Q n 534 281 556 121 in short). An independent set of 911 utterances (tst911; 461 females, 450 males) was set apart for testing. (In principle we again wanted to have 40 female and 40 male speakers from each of the 12 provinces, but the very sparsely populated province of Flevoland provided only 21 female and 10 male test speakers). For proper initialisation of the models, we manually corrected automatically generated begin- and endpoints of each utterance in the trn960 data set. We did not always use all training and testing material. Most of the time, we used only half the amount of training data (i.e. 480 utterances, trn480; 240 females, 240 males). For cross-validation during training we used a subset o f240 utterances taken from the test set (tst240; 120 females, 120 males). For evaluation of the models when training was completed we always used the full test set tst911. We list the number of available realisations of each digit for all of our data sets in columns 3 till 6 of Table 1. from each iteration cycle are stored. Next, the optimal number of iterations was determined using the tst240 data set. For the set of models with the best recognition rate, the number of Gaussians was doubled and again 20 embedded Baum-Welch re-estimation iterations were performed. This process of training with cross-validation was repeated until models with 32 Gaussians per state were obtained. During cross-validation as well as during recognition with data set tst911, the recognition syntax allowed for zero or more occurrences of either silence or very soft background noise or other background noise or out-of-vocabulary speech in between each pair of digits. At the beginning and at the end of the digit string one or more occurrences of either silence or very soft background noise of other background noise or out-of-vocabulary speech were allowed. 5. EXPERIMENTS 5.1. Comparison three CN methods We trained models with up to 32 Gaussians per state using data set trn480. Four different sets of feature vectors were used to assess the effectiveness of CN: no CN, RASTA CN, GDCR CN and CMS CN. The best performing model sets according to the cross-validation data set tst240, were evaluated using test set tst911. The proportion of digits correct (i.e. the number of digits correctly recognized divided by the total number of digits in the test set) is shown as a function of the number of Gaussians per state in Figure 1. For the amount of test digits that we used, the confidence interval is at a proportion of digits correctly recognized of respectively. 4.1. Model topology 4. MODELS The digit set of the Dutch language was described using 18 context independent (CI) phone models (see second column of Table 1). Furthermore, we used four models to describe silence, very soft background noise, other background noise and out-of-vocabulary speech, respectively. Each CI model consists of a three state, leftto-right HMM, where only self-loops and transitions to the next state are allowed. The emission probability density functions are described as a continuous mixture of 26-dimensional Gaussian probability density functions (diagonal covariance matrices). In order to be able to study the recognition performance as a function of acoustic resolution, we used mixtures containing 1, 2, 4, 8, 16 and 32 Gaussians for the emission probability density function of each state. 4.2. Training and recognition The CI phone models were initialised starting from a linear segmentation within the boundaries taken from the hand-validated word segmentations. After this initialisation, an embedded Baum-Welch re-estimation was used to further train the models. Starting with a single Gaussian emission probability density function for each state, 20 Baum-Welch iterations were conducted; the models resulting F ig u r e 1: Recognition performance for four CN approaches: x = CMS, O = RASTA, + = GDCR, * = no CN. Figure 1 clearly indicates that CN improves the recognition performance for each acoustic resolution that we tested. The improvements relative to the system without CN are significant at the confidence level in case of RASTA and CMS, but they are not for GDCR. Notice further that the recognition performance increases monotonically as a function of the acoustic resolution in all four cases. Note, however, that in all cases the improvements are not significant for 16 and 32 Gaussians per state. In other words, 8 Gaus-

In: Proc. ICSLP-96, pp. 2332-2335, 1996 3 sians per state appears to be sufficient for our connected digit recognition task. As a consequence, two different regions may be discerned on the acoustic resolution scale according to Figure 1. In the region up to 8 Gaussians per state recognition performance may be increased by either increasing acoustic resolution or applying a CN technique like RASTA or CMS. Above 8 Gaussians per state, however, increasing acoustic resolution does not result in any significant performance increase, whereas CN is still effective. Using the RASTA filtered acoustic feature vectors, we conducted an experiment to verify that we used enough training data. To this aim models were trained with the trn960 data set. We did not observe a significant change in recognition performance. Therefore, we concluded that data set trn480 was indeed large enough. 5.2. RASTA vs. CMS in the time domain According to the results in Figure 1, it appears that the different CN techniques that we studied can be ordered as follows: CMS RASTA > GDCR > no CN, where we used the symbol > to indicate better CN effectiveness. The question is now of course: How can we understand this ordering? In [7], we argued that the RASTA filter frequency response preserves modulation frequencies in the maximally sensitive region of human auditory perception (2-16 Hz, [10]) much better compared to GDCR, especially in the region below 5 Hz. This preservation of modulation frequencies may very well explain the superiority of CMS and RASTA over GDCR. In order to see what causes the difference in recognition performance between RASTA and CMS, (which is significant at the 95 % level for systems with 2 and 4 Gaussians per state), we will take a detailed look at the effects of both techniques in the time domain. We consider the signal shown in the upper panel of Figure 2 (we took a synthetic signal instead of a real MFCC coordinate time series for didactic purposes). The signal is a sequence of seven stationary segments ( speech states ) preceded and followed by a rest state ( silence ). Notice that the signal contains a constant overall DCcomponent (representing the effect of the communication channel). The RASTA filtered version of this signal is shown in the middle panel of Figure 2. Two important observations can be made. First, the DC-component has been effectively removed (at least for times larger than, say, 70 frames). Second, the shape of the signal has been altered. With regards to the shape distortion we remark the following. First, the seven speech states of the signal that had a constant amplitude are now no longer stationary. Instead, the amplitude for each state shows a tendency to drift towards zero. Thus: RASTA filtering steadily decreases the value of cepstral coefficients in stationary parts of the speech signal, while the values immediately after an abrupt change are preserved. This explains the observation that the dynamic parts in the spectrogram of a speech signal are enhanced by RASTA filtering the cepstral coefficients [3,6]. As a consequence of this drift, however, a description of the signal in terms of stationary states with well-located means and small variances becomes less accurate. Second, the mean amplitude of each state has become a function of the state itself as well as the amplitudes of states immediately preceding it. This is the well-known left-context dependency introduced by the RASTA filter [3,11]. Because the absolute ordering of signal amplitudes is lost, states can no longer be straightforward identified by their mean amplitude (compare speech states two, four and seven before and after RASTA filtering in the upper and middle panel of Figure 2). For this reason, RASTA is less well suited when using CI models [cf. the remarks in 11]. Finally, we mention a third aspect of the shape distortion for completeness (which we feel is much less important though). Due to the small attenuation of highfrequency components, abrupt amplitude changes are smoothed. i 15 Hi 1 710.5 E i 1 I 0 f-1 A 1 1 0 E " -1 0 50 100 150 200 250 300 350 400 time (frames) > F ig u r e 2: Synthetic signal representing one of the cepstral coefficients in the feature vector. Upper panel: Original signal containing a time-invariant DC-offset. Middle panel: RASTA filtered signal. Lower panel: Phase corrected RASTA filtered signal. CMS has only one effect in the time domain: the DC-component is removed while the signal shape is exactly preserved (the signal is simply shifted as a whole). So, maybe the significant difference in performance between CMS and RASTA might be explained by the preservation of shape in the time domain in case CMS was used. 5.3. Phase correction for RASTA In order to test this, we conducted a recognition experiment with an extended version of the RASTA filtering technique. We used the method decribed in [12] to do a phase correction on each MFCC coefficient after the RASTA filter was applied. We choose the phase correction such that the frequency dependent phase shift of the RASTA filter was exactly compensated, while at the same time preserving the original magnitude response of the RASTA filter by using an all-pass filter. The effect of the phase correction is shown in the lowest panel of Figure 2. As can be seen, the shape of the phase-corrected RASTA filtered signal resembles the shape of the original signal much better compared to the RASTA filtered signal. The phase correction (1) removes the amplitude drift towards zero in stationary parts of the signal and (2) removes the left-context dependency. In other words, phase-corrected RASTA (1) does not feature enhanced spectral dynamics and (2) is probably better suited for CI modeling. We replaced the twelve MFCC s by twelve phase-corrected RASTA filtered MFCC s and trained new models using the same data sets trn480 and tst240 for training and cross-validation as before. Fi

In: Proc. ICSLP-96, pp. 2332-2335, 1996 4 nally, we established the recognition performance using test set tst911. The results are shown in Figure 3, together with our previous results for CMS and RASTA. Figure 3 clearly shows that the performance of the phase-corrected RASTA features is identical to the CMS performance. Therefore, we conclude that our hypothesis was correct: The most important difference between CMS and RASTA is the phase distortion introduced by the RASTA filter, which is reflected in the time domain as a shape distortion of the signal. If the RASTA filter is adapted such that its phase distortion is exactly compensated while at the same time preserving the original magnitude response, the recognition performance becomes identical to the performance for CMS. F ig u r e 3: Recognition results for CMS (x ), RASTA (O) and phasecorrected RASTA ( ). We also conclude the following. It has been often suggested [3,6] that RASTA techniques provide better recognition performance because the spectral dynamics are enhanced. Our analysis shows that this enhancement is caused by the phase distortion of the RASTA filter. When we removed the phase distortion, we removed the enhancement of spectral dynamics. However, the recognition performance did not go down in our experiments (on the contrary). Therefore, the argument should be reconsidered that the success of RASTA filtering techniques should be attributed to the enhancement of spectral dynamics. Our experiments suggest that removal of the DC-component is the most important feature of RASTA. Finally, taking our findings for CMS, RASTA and GDCR together, we can formulate three constraints that an ideal cepstrum based CN technique should satisfy: (1) the DC-component should be effectively removed, (2) the magnitude response should be preserved in the range of 2-16 Hz, which is the maximally sensitive region of human auditory perception, and (3) the technique should not introduce any phase distortion when combined with CI modeling. 6. CONCLUSIONS We compared three different CN methods in the context of a connected digit recognition task over the phone. Using a small set of CI continuous Gaussian mixture HMMs, we showed that CMS and RASTA outperform the GDCR technique. Furthermore, we showed that the main cause for the superiority of CMS compared to RASTA is the phase distortion introduced by the RASTA filter. The recognition results for a phase-corrected RASTA technique were identical to those of CMS. Our results suggest that the ability of RASTA to effectively remove the DC-component is more important than the enhancement of spectral dynamics. Acknowledgement This work was funded by the Netherlands Organisation for Scientific Research (NWO) as part of the NWO Priority programme Language and Speech Technology. References 1 S. F u rui, Cepstral analysis technique for automatic speaker verification, IEEE Trans. Acoust. Speech Signal Process., ASSP-29, pp. 254-272, 1981. 2 H. Hermansky, N. Morgan, A. Bayya & P. Cohn, Compensation for the effect of the communication channel in auditory-like analysis of speech, in Proc. Eurospeech-91, Genova, Sept. 1991. 3 H. Hermansky & N. Morgan, RASTA processing of speech, IEEE Trans. Speech Audio, 2(4), pp. 578-589, 1994. 4 K. Aikawa, H. Singer, H. Kawahara & Y. Tohkura, A dynamic cepstrum incorporating time-frequency masking and its application to continuous speech recognition, in Proc. ICASSP-93,pp. 668-671, 1993. 5 J-C. Junqua, D. Fohr, J-F. Mari, T.H. Applebaum & B.A. Hanson, Time derivatives, cepstral normalisation and spectral parameter filtering for continuously spelled names over the telephone, in Proc. Eurospeech-95, pp. 1385-1388, 1995. 6 H. Singer, K.K. Paliwal, T. Beppu & Y. Sagisaka, Effect of RASTA-type processing for speech recognition with speaking-rate mismatches, in Proc. Eurospeech-95, pp. 487-490, 1995. 7 P. Boda, J. de Veth & L. Boves, Channel normalisation by using RASTA filtering and the dynamic cepstrum for automatic speech recognition over the phone, to appear in Proc. ESCA Workshop on the Auditory Basis of Speech Perception, Keele, July 1996. 8 S. Young & P. Woodland, HTK v1.4 User Manual, Speech Group, Cambridge University Engineering Department, UK, 1992. 9 E.A. den Os, T.I. Boogaart, L. Boves & E. Klabbers, The Dutch Polyphone corpus, in Proc. Eurospeech-95, pp. 825-828, 1995. 10 R. Drullman, J.M. Festen & R. Plomp, Effect of temporal envelope smearing on speech reception, J. Acoust. Soc. Am., vol. 95, pp. 1053-1064, 1994. 11 J. Cohen, Final report of the chairman, Frontiers of Speech Processing - Robust Speech Recognition 93, 1993. 12 M. J. Hunt, Automatic correction of low-frequency phasedistortion in analogue magnetic recordings, Acoustic Letters, vol. 32, pp. 6-10, 1978.