RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Similar documents
Robust telephone speech recognition based on channel compensation

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Adaptive Filters Application of Linear Prediction

EE482: Digital Signal Processing Applications

Using RASTA in task independent TANDEM feature extraction

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Speech Synthesis using Mel-Cepstral Coefficient Feature

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Machine recognition of speech trained on data from New Jersey Labs

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

NOISE ESTIMATION IN A SINGLE CHANNEL

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Mel Spectrum Analysis of Speech Recognition using Single Microphone

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Different Approaches of Spectral Subtraction Method for Speech Enhancement

CHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS

Speech Synthesis; Pitch Detection and Vocoders

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

WARPED FILTER DESIGN FOR THE BODY MODELING AND SOUND SYNTHESIS OF STRING INSTRUMENTS

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

Speech Signal Analysis

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

MOST MODERN automatic speech recognition (ASR)

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Calibration of Microphone Arrays for Improved Speech Recognition

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

Power Function-Based Power Distribution Normalization Algorithm for Robust Speech Recognition

SIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION MOTIVATED BY AUDITORY PROCESSING CHANWOO KIM

Adaptive Noise Reduction of Speech. Signals. Wenqing Jiang and Henrique Malvar. July Technical Report MSR-TR Microsoft Research

Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition Chanwoo Kim, Member, IEEE, and Richard M. Stern, Fellow, IEEE

Wavelet Speech Enhancement based on the Teager Energy Operator

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Speech Enhancement using Wiener filtering

Speech Enhancement Based On Noise Reduction

Robust Speech Recognition Based on Binaural Auditory Processing

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Robust Low-Resource Sound Localization in Correlated Noise

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Performance Analysiss of Speech Enhancement Algorithm for Robust Speech Recognition System

Auditory Based Feature Vectors for Speech Recognition Systems

IN recent decades following the introduction of hidden. Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition

Robust Speech Recognition Based on Binaural Auditory Processing

A Real Time Noise-Robust Speech Recognition System

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Pitch Detection Algorithms

A Two-step Technique for MRI Audio Enhancement Using Dictionary Learning and Wavelet Packet Analysis

Carlos Avendano, "Temporal Processing of Speech in a Time-Feature Space", Ph.D. thesis, Oregon Graduate Institute, April 1997

High-speed Noise Cancellation with Microphone Array

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Chapter 4 SPEECH ENHANCEMENT

AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES

Isolated Digit Recognition Using MFCC AND DTW

Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress!

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Perceptually Motivated Linear Prediction Cepstral Features for Network Speech Recognition

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Evaluation of clipping-noise suppression of stationary-noisy speech based on spectral compensation

PLP 2 Autoregressive modeling of auditory-like 2-D spectro-temporal patterns

2 Study of an embarked vibro-impact system: experimental analysis

Signals & Systems for Speech & Hearing. Week 6. Practical spectral analysis. Bandpass filters & filterbanks. Try this out on an old friend

Abstract Dual-tone Multi-frequency (DTMF) Signals are used in touch-tone telephones as well as many other areas. Since analog devices are rapidly chan

REAL-TIME BROADBAND NOISE REDUCTION

SGN Audio and Speech Processing

Exploring QAM using LabView Simulation *

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR

SPEech Feature Toolbox (SPEFT) Design and Emotional Speech Feature Extraction

Acoustics, signals & systems for audiology. Week 4. Signals through Systems

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Perceptually Inspired Signal-processing Strategies for Robust Speech Recognition in Reverberant Environments

IMPROVING MICROPHONE ARRAY SPEECH RECOGNITION WITH COCHLEAR IMPLANT-LIKE SPECTRALLY REDUCED SPEECH

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

RECENTLY, there has been an increasing interest in noisy

MMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2

ROBUST echo cancellation requires a method for adjusting

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

DWT and LPC based feature extraction methods for isolated word recognition

Auditory modelling for speech processing in the perceptual domain

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

Signal Processing for Robust Speech Recognition Motivated by Auditory Processing

Mikko Myllymäki and Tuomas Virtanen

SOUND SOURCE RECOGNITION AND MODELING

Speech Enhancement Using a Mixture-Maximum Model

Robust speech recognition using temporal masking and thresholding algorithm

LEVERAGING JOINTLY SPATIAL, TEMPORAL AND MODULATION ENHANCEMENT IN CREATING NOISE-ROBUST FEATURES FOR SPEECH RECOGNITION

CO-CHANNEL SPEECH DETECTION APPROACHES USING CYCLOSTATIONARITY OR WAVELET TRANSFORM

On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering

Robust Speech Recognition Group Carnegie Mellon University. Telephone: Fax:

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

Transcription:

RASTA-PLP SPEECH ANALYSIS Hynek Hermansky Nelson Morgan y Aruna Bayya Phil Kohn y TR-91-069 December 1991 Abstract Most speech parameter estimation techniques are easily inuenced by the frequency response of the communication channel. We have developed a technique that is more robust to such steady-state spectral factors in speech. The approach is conceptually simple and computationally ecient. The new method is described, and experimental results are reported, showing a signicant advantage for the proposed method. US West Advanced Technologies, 4001 Discovery Drive, Boulder, CO 80303 y International Computer Science Institute, 1947 Center Street, Berkeley, CA 94704

1 INTRODUCTION The Perceptual Linear Predictive (PLP) speech analysis technique [1] is based on the short-term spectrum of speech. Even though the short-term spectrum of speech is subsequently modied by several psychophysically based spectral transformations, the PLP technique (just like most other short-term spectrum based techniques), is vulnerable when the short-term spectral values are modied by the frequency response of the communication channel. Human speech perception seems to be less sensitive to such steady-state spectral factors [2]. We have developed the RelAtive SpecTrAl (RASTA) methodology [3][4] which makes PLP (and possibly also some other short-term spectrum based techniques) more robust to linear spectral distortions. Experimental results using telephonequality isolated digits and high-quality continuous speech show signicant improvements in error rate. 2 APPROACH We have replaced a common short-term absolute spectrum by a spectral estimate in which each frequency channel is band-pass ltered by a lter with sharp spectral zero at the zero frequency. Since any constant or slowly-varying component in each frequency channel is suppressed by this operation, the new spectral estimate is less sensitive to slow variations in the short-term spectrum. When the ltering is done in the logarithmic spectral domain, the suppressed constant spectral component reect the eect of the convolutive factors in the input speech signal, introduced by frequency characteristics of the communication media. The steps of RASTA-PLP are as follows (see [1] for comparison to the conventional PLP method): For each analysis frame: 1) Compute the critical-band spectrum (as in the PLP) and take its logarithm. 2) Estimate the temporal derivative of the log critical-band spectrum using regression line through ve consecutive spectral values [5]. 3) Nonlinear processing (such as applying threshold or median ltering) can be done in this domain. Currently, we do nothing here. 4) Re-integrate the log critical-band temporal derivative using a rst order IIR system. The pole position of this system can be adjusted to set the eective window size. Currently, we set this value to 0.98, providing an exponential integration window with a 3-dB point after 34 frames. 5) In accord with the conventional PLP, add the equal loudness curve and multiply by 0.33 to simulate the power law of hearing. 6) Take the inverse logarithm (exponential function) of this relative log spectrum, yielding a relative auditory spectrum. 6) Compute an all-pole model of this spectrum, following the conventional PLP technique. It can be shown that if the derivative of step (2) is estimated by a simple rst dierence, and if the full integration in step (4) is done (pole at z = 1.0), then all intermediate terms cancel and the technique is equivalent to subtraction of the log spectrum of the rst analysis frame from each new frame. In this special case, the RASTA technique resembles the spectral subtraction or blind deconvolution techniques. However, in the general case presented here, the whole derivative-reintegration process is equivalent to a bandpass ltering of each frequency channel through an IIR lter with the transfer function H(z) = 0:1 2 + z?1? z?3? 2z?4 z?4 (1? 0:98z?1 ) : (1) The low cut-o frequency of the lter determines the fastest spectral change of the log spectrum which is ignored in the output, while the high cut-o frequency determines the fastest spectral change which is preserved. 2

SPEECH DISCRETE FOURIER TRANSFORM LOGARITHM FILTERING EQUAL LOUDNESS CURVE POWER LAW OF HEARING INVERSE LOGARITHM INVERSE DISCRETE FOURIER TRANSFORM SOLVING OF SET OF LINEAR EQUATIONS (DURBIN) CEPSTRAL RECURSION CEPSTRAL COEFFICIENTS OF RASTA PLP MODEL Figure 1: RASTA-PLP Method Linear distortions as caused e.g. by the telecommunication channel or by using a dierent microphone appear as an additive constant in the log spectrum. The high-pass portion of the equivalent band-pass lter is expected to alleviate the eect of the convolutional noise introduced in the channel. The low-pass ltering is expected to help in smoothing out some of fast frame-to-frame spectral changes present in the short-term spectral estimate due to analysis artifacts. In Eq. (1), the low cut-o frequency is 0.26 Hz. The lter slope declines 6dB/oct from 12.8 Hz with sharp zeros at 28.9 Hz and at 50 Hz. There is no special reason (except historical) for using the particular lter of Eq. (1). Also, the same lter need not be used for all frequency channels. Further, the ltering does not have to be band-pass or even linear. The result is generally dependent on the starting point of analysis. In our applications we always start analysis well in the silent part which precedes speech. The whole RASTA-PLP process is illustrated in Fig.1. 3 EXPERIMENTS WITH SMALL VOCABULARY ISO- LATED TELEPHONE QUALITY SPEECH This series of experiments were designed to evaluate the eect of varying telephone network environment. The training data were recorded at the Bellcore facility in Morristown, NJ, and represented channel conditions in the New Jersey area. An isolated-utterance continuous-density HMM recognizer was used in the experiment. A database was formed by manually segmenting digits from connected utterances recorded over dialed-up telephone lines. 155 male and female speakers were used for the training of the recognizer. 5th order autoregressive models were adopted for both 3

the PLP and RASTA-PLP techniques in this experiment. Additional details of the experiment are given in [4]. Three experiments were carried out. In all experiments, the system was trained on the Bellcore training database. In the rst experiment, the test set was a subset of the Bellcore database. Thus, we assume that both the test set and the training set were recorded under similar channel conditions. Data from additional 56 male and female speakers, recorded at Bellcore, formed the test. The rst column of Table I shows the percentage error rates on this test data. The RASTA-PLP performs about as well as the standard PLP technique. In the second experiment, the Bellcore test data set was corrupted by a simulated convolutional noise (pre-emphasis by the rst-order dierentiation of the signal). The recognizer had been trained on the uncorrupted Bellcore training data. The results are tabulated in the second column of Table I. The standard PLP technique yielded almost an order of magnitude higher error rate than the error rate on the uncorrupted Bellcore data. The new approach can be seen to be far more robust to such simulated channel variation. To extend the result to an experiment with realistic changes in channel conditions, digit strings spoken by four (2 male and 2 female) speakers were recorded over the local telephone lines in the U S WEST speech laboratory. The recognition results on this set are shown in the third column of Table I. As with the previous experiment, the conventional PLP technique yields a very high error rate. A similar test showed that a standard LPC-based system degraded even further, to a 60.7% error rate. The performance of RASTA-PLP degrades only slightly. Analysis Original Speech Modied Speech Dierent Environment PLP 4.08% 31.35% 31.30% RASTA-PLP 3.81% 5.00% 7.64% Table I ISOLATED DIGIT ERROR RATES 4 EXPERIMENTS WITH LARGE VOCABULARY CON- TINUOUS HIGH QUALITY SPEECH We were curious whether our positive results with HMM-based ASR of telephone speech extend to a completely dierent ASR system and task. The standard large vocabulary continuous speech DARPA Resource Management database was chosen for this test. The recognizer used in the new series of experiments was a hybrid recognizer with a neural network trained on 4000 sentences to predict monophones for each frame, and then used in recognition to estimate likelihoods for a simple context-independent HMM system. 300 development test sentences from the October 1989 Resource Management speaker independent continuous speech recognition corpus were used as the test data. Since the DARPA database has 8 khz bandwidth (twice the telephone speech bandwidth of the previous experiment), the autoregressive model in both PLP and RASTA-PLP analysis was increased from 5th to 8th order. To simulate the eect of mued speech that we had observed with a small obstacle between the microphone and the talker's mouth, a lowpass lter (a single complex pole pair, with a 3dB point at 2 khz and a 20 db loss at 8 khz was applied to degrade the test data. The word error results, shown in Table 2, indicate that the low-pass ltering signicantly degrades the performance of the PLP-based recognizer. The RASTA processing in PLP had almost no eect on performance for the clean data, and kept the recognizer performance insensitive even to the severe low-pass ltering. 4

Informally we have observed that RASTA-PLP gives a substantial advantage in our live recognition experiments; while the conventional short-term spectrum based front-end is very sensitive to the choice of the microphone or even to the microphone position relative to the mouth, the RASTA- PLP makes the recognizer much more robust to such factors. Further, even the harmful eect of a constant additive noise background, often present in our live recordings, appears to be reduced. Analysis Original Speech Modied Speech PLP 17.9% 64.7% RASTA-PLP 18.6% 19.2% Table II CONTINUOUS SPEECH WORD ERROR RATES 5 DISCUSSION A major current research concern is the signicant degradation of high-performance laboratory systems when used in a real world. We believe that one of reasons for such a degradation is a highly variable frequency characteristics of the realistic recording and communication environments. Previous techniques for dealing with the problem of the convolutional noise introduced by such variable environment (see e.g. [6],[7]) appear to be useful for recognition applications that permit the explicit computation of a communication channel transfer functions. Such applications typically require a separate channel estimation phase. It appears that our simple RASTA-PLP technique is quite ecient in dealing with the convolutional noise. In addition, the RASTA-PLP computes all estimates on-line. That may prove advantageous for applications where the channel conditions are not known a priori or where the conditions might change unpredictably during the use of the recognizer. Because we have been primarily concerned with convolutional noise in the communication channel, we conducted our corrections in the log spectral domain. RASTA technique could be also used in the magnitude or power spectral domains for additive noise reduction. However, care must be taken to ensure positivity of the enhanced power spectrum, as is also the case for traditional spectral subtraction techniques. The study reported here made no use of other potential capabilities of the RASTA processing, particularly the ability to apply signal modiers to the spectral temporal derivative domain. For instance, a threshold imposed on small temporal derivatives could provide a further nonlinear smoothing of the spectral estimates, and nonlinear amplitude modications could enhance or suppress speech transitions. Our current band-pass lter may not be optimal. Further, there is no fundamental reason to use the same lter for all spectral channels. Those issues are topics of our current research. We also note that a German group of researchers, using a highpass ltering approach, primarily in the power spectral domain, has achieved encouraging results in suppressing the additive noise on a dierent set of speech recognition problems [8]. Their experience appears to conrm the eectiveness of the RASTA class of techniques. 6 SUMMARY A new technique for estimating a robust time-varying spectrum, RASTA-PLP, based on the ltering of time trajectories of outputs from critical-band lters, has been described. A large test was conducted on a speaker-independent telephone digit recognition task using speech that had been corrupted with convolutional noise. Results from this test show an order-of-magnitude improvement 5

in error rate over conventional spectral estimation techniques such as LPC or PLP. Results from similar tests with large vocabulary continuous speech recognition show that the improvement is consistent across dierent databases and dierent recognition techniques. 7 ACKNOWLEDGEMENT Thanks to Chuck Wooters and Steve Renals for assistance with the large vocabulary experiments. References [1] H. Hermansky: \Perceptual linear predictive (PLP) analysis for speech," J. Acoust. Soc. Am., pp. 1738-1752, 1990. [2] Q. Summereld and P. Assmann: Auditory enhancement and the perception of concurrent vowels, Perception & Psychophysics, 1989, 45 (6), pp. 529-536. [3] H. Hermansky: \Auditory model for parametrization of speech in real-life environment based on re-integration of temporal derivative of auditory spectrum," U S WEST Advanced Technologies Research Report, File Folder ST 04-01, October 1990. [4] H. Hermansky, N. Morgan, A. Bayya, P. Kohn: \Compensation for the eect of the communication channel in auditory-like analysis of speech (RASTA-PLP)," Proc. of Eurospeech '91, pp. 1367-1371, Genova, Italy, 1991. [5] S. Furui: \Speaker-Independent Isolated Word Recognition Based on Emphasized Spectral Dynamics," Procs. IEEE Intl. Conf. on Acoustic, Speech & Signal Processing, pp. 1991-1994, Tokyo, Japan 1986 [6] A. Accero and R. M. Stern : \Towards Environment-Independent Spoken Language Systems," Proc. Speech and Natural Language Workshop, DARPA, June 1990, pp. 157-162 [7] E. Errel and M. Weintraub: \Recognition of Noisy Speech: Using Minimum-Mean Log-Spectral Distance Estimation," Proc. Speech and Natural Language Workshop, DARPA, June 1990, pp. 341-345 [8] H. Hirsch, P. Meyer, and H. Ruehl: \Improved speech recognition using high-pass ltering of subband envelopes," Proc. of Eurospeech '91, pp. 413-416, Genova, Italy, 1991. 6