Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping

Similar documents
A Real Time Noise-Robust Speech Recognition System

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Fundamental frequency estimation of speech signals using MUSIC algorithm

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Speech Synthesis using Mel-Cepstral Coefficient Feature

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

High-speed Noise Cancellation with Microphone Array

Robust Speech Recognition and its ROBOT implementation

RECENTLY, there has been an increasing interest in noisy

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Robust telephone speech recognition based on channel compensation

Speech Enhancement Using a Mixture-Maximum Model

Using RASTA in task independent TANDEM feature extraction

Performance Analysiss of Speech Enhancement Algorithm for Robust Speech Recognition System

Power Function-Based Power Distribution Normalization Algorithm for Robust Speech Recognition

Auditory Based Feature Vectors for Speech Recognition Systems

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING. Department of Signal Theory and Communications. c/ Gran Capitán s/n, Campus Nord, Edificio D5

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

A ROBUST FRONTEND FOR ASR: COMBINING DENOISING, NOISE MASKING AND FEATURE NORMALIZATION. Maarten Van Segbroeck and Shrikanth S.

Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement

DWT and LPC based feature extraction methods for isolated word recognition

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Evaluation of clipping-noise suppression of stationary-noisy speech based on spectral compensation

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Online Blind Channel Normalization Using BPF-Based Modulation Frequency Filtering

SPEECH ENHANCEMENT BASED ON ITERATIVE WIENER FILTER USING COMPLEX SPEECH ANALYSIS

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

REAL-TIME BROADBAND NOISE REDUCTION

Wavelet Speech Enhancement based on the Teager Energy Operator

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Mikko Myllymäki and Tuomas Virtanen

Perceptually Motivated Linear Prediction Cepstral Features for Network Speech Recognition

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Speech and Music Discrimination based on Signal Modulation Spectrum.

Chapter 4 SPEECH ENHANCEMENT

Cepstrum alanysis of speech signals

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Automotive three-microphone voice activity detector and noise-canceller

Feature Extraction Using 2-D Autoregressive Models For Speaker Recognition

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Calibration of Microphone Arrays for Improved Speech Recognition

Speech Enhancement Based On Noise Reduction

Design and Implementation on a Sub-band based Acoustic Echo Cancellation Approach

DERIVATION OF TRAPS IN AUDITORY DOMAIN

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

Voice Activity Detection

SIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION MOTIVATED BY AUDITORY PROCESSING CHANWOO KIM

24 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 1, JANUARY /$ IEEE

Relative phase information for detecting human speech and spoofed speech

Can binary masks improve intelligibility?

An Improved Voice Activity Detection Based on Deep Belief Networks

Time-Frequency Distributions for Automatic Speech Recognition

EXTRACTING a desired speech signal from noisy speech

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

Monophony/Polyphony Classification System using Fourier of Fourier Transform

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Speech Synthesis; Pitch Detection and Vocoders

Audio Fingerprinting using Fractional Fourier Transform

works must be obtained from the IEE

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

Voiced/nonvoiced detection based on robustness of voiced epochs

Enhancing the Complex-valued Acoustic Spectrograms in Modulation Domain for Creating Noise-Robust Features in Speech Recognition

SPEECH communication under noisy conditions is difficult

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

Isolated Digit Recognition Using MFCC AND DTW

SPEECH ENHANCEMENT USING SPARSE CODE SHRINKAGE AND GLOBAL SOFT DECISION. Changkyu Choi, Seungho Choi, and Sang-Ryong Kim

NOISE ESTIMATION IN A SINGLE CHANNEL

Automatic Morse Code Recognition Under Low SNR

Adaptive Filters Application of Linear Prediction

VQ Source Models: Perceptual & Phase Issues

Narrow-Band Interference Rejection in DS/CDMA Systems Using Adaptive (QRD-LSL)-Based Nonlinear ACM Interpolators

Optimal Adaptive Filtering Technique for Tamil Speech Enhancement

A LPC-PEV Based VAD for Word Boundary Detection

Gammatone Cepstral Coefficient for Speaker Identification

MMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2

LEVERAGING JOINTLY SPATIAL, TEMPORAL AND MODULATION ENHANCEMENT IN CREATING NOISE-ROBUST FEATURES FOR SPEECH RECOGNITION

Analysis of LMS Algorithm in Wavelet Domain

Transcription:

100 ECTI TRANSACTIONS ON ELECTRICAL ENG., ELECTRONICS, AND COMMUNICATIONS VOL.3, NO.2 AUGUST 2005 Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping Naoya Wada, Shingo Yoshizawa, Noboru Hayasaka, and Yoshikazu Miyanaga, Non-members ABSTRACT This paper explores the extraction of speech features aiming noise robustness for speech recognition. The sources of actual noise may be not only a continuous noise but also a burst noise and a mixture of them. We present both noise robust techniques against continuous noise and burst noise. The algorithm for compensating continuous noise influences is based on RSF (Running Spectrum Filtering) and DRA (Dynamic Range Adjustment). RSF focuses on modulation spectrum and extracts speech components with FIR filtering. DRA normalizes maximum amplitudes of cepstrum. The compensation for burst noise influences is performed by burst noise skipping algorithm. It skips burst noise periods and links speeches before and after those periods. The identification of burst noise periods is performed by using speech estimation with Auto-Regressive Moving- Average (ARMA) model. The experimental results on isolated word speech recognition showed the effectiveness and high noise robustness of the proposed methods. Keywords: Noise Robustness, Speech Recognition, Burst Noise, Modulation Spectrum. 1. INTRODUCTION Speech recognition systems have been widely explored as one of the important human interfaces. Nowadays it is implemented to various applications such as car navigation systems, mobile terminal units, robots, etc.. Taking the fact that their systems are used in various real environments into consideration, Noise robustness is considerably required. Noises can be classified to two types of noise. One is the continuous noise. It is added to the whole speech and does not change so radically. The other is the burst noise such as a shut door and a channel impulse. It is characterized by the large occasional burst of energy. Continuous noise can be easily estimated comparing to burst noise and it has led various 04PSI09: Manuscript received on December 31, 2004 ; revised on August 26, 2005. The authors are with the Department of Graduate School of Engineering, Hokkaido University, Chuo-ku Kita 14 Jyo Nishi 9 Chome, Sapporo-shi, Hokkaido 060-0814, Japan E-mail: [wada, yosizawa, hayasaka]@csm.ist.hokudai.ac.jp, miya@ist.hokudai.ac.jp noise robust methods such as noise-robust LPC analysis [1],[2], Hidden Markov Model (HMM) decomposition and composition [3],[4], and the extraction of dynamic cepstrum, [5] etc.. Besides such research activities, spectral subtraction (SS) [6] has been the most known method and widely used to improve noise robustness. However, in real environment, burst noise can be combined with continuous noise and added into input speech. Nevertheless it seriously degrades the recognition accuracy, the robust analysis against it is not enough because the prediction of it is difficult. In this paper, we explore the robust speech feature extraction for recognition of speeches including multiple noises of continuous noise and burst noise, and propose new speech recognition techniques. The noise robust techniques for continuous noises are based on our proposing speech feature extraction using RSF and DRA [7],[8]. RSF focuses on the modulation spectrum obtained from the time trajectory of spectrum and extracts speech components by applying band-pass filtering. We employ FIR filtering as Kanadera et al. [9] for the stability and the accuracy. Furthermore, RSF applies filtering twice before logprocess and eliminates both multiplicative noise and additive noise. DRA normalizes the maximum amplitudes of feature parameters and corrects the differences of dynamic ranges between that of trained data and observed speech data. It is reported [10] that normalization of cepstral dynamic range is more suitable for combining with RSF than other normalizing method such as Cepstral Mean Normalization and Cepstral Variance Normalization. The noise robust technique for burst noise is skipping burst noise and links the input speeches before and after it. It is indeed that some speech components are lost when burst noise is included in utterances. However, the prediction of burst noise is so difficult that the extraction of speech components from burst noise periods should leave not a little amount of noise components when a large amount of burst noise is considered, and it should deteriorate recognition accuracy. Therefore, the skipping burst noise causes less influences than it because periods of burst noise are much shorter than speech periods. We utilize the estimation of speeches with ARMA models in order to identify burst noise periods. The former part of this paper presents robust speech feature extraction using RSF/DRA and burst

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping 101 noise skipping. The latter part estimates noise robustness of each method with isolated word speech recognition experiments using HMM. 2. ROBUST ANALYSIS FOR STATIONARY NOISES 2. 1 Running Spectrum Filtering (RSF) RSF focuses on modulation spectrum which shows the characteristics of time trajectory on each frame. Modulation spectrum is obtained as follows and fig. 1 illustrates it. Short-time speech characteristics in frequency domain are obtained by applying windowing and Fourier Transform to speech waveform in time domain. Therefore, the time trajectory in specific frequency is obtained by tracing its values in each time. The time trajectory of value in frequency domain is the running spectrum, and what is obtained from its frequency analysis is the modulation spectrum. It has been reported [11] that speech components in modulation frequency domain are dominant around 4Hz and out of the range from 1Hz up to 12Hz can be regarded as noise and unnecessary components. Although RASTA (RelAtive SpecTrA) is a wellknownmethodfocusingonthemodulationspectrum, primary RASTA employs IIR filtering and it may cause some problems such as phase distortion and instability of filtering. RSF applies FIR bandpass filtering to the modulation spectrum in order to avoid such difficulties and remove noise components. However, RSF needs high-order FIR filters (240-tap filters in this paper) to realize sharp modulation frequency cut off. In addition, such high order of FIR filters causes many delay boxes and needs to a long delay to calculate an output. Therefore, to realize applying FIR filtering to whole speech, non-speech frames which have enough length to obtain enough filtering orders should be included in the front and the back of speech frames. However, its time length becomes about 2800ms when the sampling frequency is 11025Hz it is not practical. In the RSF, several non-speech frames are put into the front and the back of speech frames in a certain length so that enough filtering orders are obtained. Thus RSF realizes effective feature extraction and can be applied in practical speech recognition system. The process of RSF is as follows. Noisy speech signal y(t) is converted to frequency domain by FFT as y(t) = h(t) (x(t)+a(t)) (1) Y (f) = H(f)X(f)+H(f)A(f) (2) where x(t) denotes the signal component, h(t) denotes the system noise and a(t) denotes the environmental noise. In (5), H(f)A(f) is additive noise component and the time trajectory of its spectrum is slower than that of speech component. Therefore, it The process for obtaining modulation spec- Fig.1: trum. Frequency FFT Running Spectrum FFT on each frequency Frequency Modulation Spectrum Modulation frequency can be removed with low-pass filtering on time spectrum domain. Then the logarithmic power spectrum without the additive noise component is written as log Y (f) = log H(f)X(f) = log X(f) + log H(f), (3) and this system noise component H(f) can be removed by applying band-pass filtering to the time trajectory of logarithmic power spectrum. 2. 2 Dynamic Range Adjustment (DRA) on Cepstrum One of the other causes of noise corruption is derived from the differences in the dynamic ranges of cepstrum. The dynamic range of cepstrum indicates the difference between maximum and minimum of cepstral values in each order. Both the peaks of cepstrum, maxima and minima show the important characteristics of speech. However, as shown in Fig.2(a), the cepstral amplitude of peak are reduced comparing to the amplitude of noise free speech and characteristics are degraded. Considering that speech recognition is a kind of pattern matching, these differences can be compensated by normalizing both amplitudes of clean speech and noisy speech. DRA adjusts these various dynamic ranges by normalizing the amplitude of speech features. In the DRA, each coefficient of a speech feature vector is adjusted in proportion to its maximum

102 ECTI TRANSACTIONS ON ELECTRICAL ENG., ELECTRONICS, AND COMMUNICATIONS VOL.3, NO.2 AUGUST 2005 Baseline MFCC With RSF Cepstral Value Cepstral Value Without DRA (a) (c) Cepstral Value Cepstral Value (b) With DRA (d) Fig.2: A comparison of trajectories of the 1st order cepstra among baseline MFCC and MFCC after RSF. The solid lines show cepstrum of clean speech and the dash lines show one of noisy speech (runningcar noise, 0dB SNR). The sample speech is Kitami. Used analysis methods are as follows; (a): Baseline MFCC (b): MFCC after DRA (c): MFCC after RSF (d): MFCC after RSF and DRA. amplitudes as f i (t) =f i (t)/ max f j(t) j=1,,m (i =1,,m), (4) where f i (t) denotes an element of the feature vector, m denotes the dimension and t denotes the frame number. Using (1), all coefficients are adjusted into the range from -1 to 1. Using RSF influences of the differences in the spectral fine structure are eliminated as shown in Fig.2 (c). This process removes unnecessary parts of speeches for recognition such as characteristics of speakers and noise influences. Then, using DRA, the difference of cepstral dynamic range is adjusted as shown in Fig.2 (d) and the cepstrum from noisy speech is adjusted to the one from clean speech. 3. BURST-NOISE SKIPPING ALGORITHM One thought for removing burst noise is skipping speech frames including it and link frames before and after it. This manner removes some speech components. However, the influence of burst noise is greater than that of lack of speech frames. Furthermore, HMM is flexible for time variances of speeches and the lack of some speech frames can be compensated. Determining a criterion is one of the most important factor in order to identify burst noise. It is an idea to use parameters of observed speech such as variancesduringshortperiodsasacriterion. However, it depends on the level of input speech and the estimation becomes difficult when the level of input speech and burst noise are not so much. We employ adaptive estimation with ARMA and the short-time variances of the estimation error as a criterion to identify burst noise. If the ARMA process assumes the stationary characteristics of speech in a certain short period, the estimation of ARMA process becomes difficult when non-stationary (burst) noise is added. It means that the estimation of speech works well and estimation error is little during the period where burst noise is not included. However, estimation error in ARMA process increases when non-stationary noise is included. It enables to identify burst noise periods. The algorithm to estimate input speech is as follows. We assume that observed speech signal y k can be written using speech generation process following ARMA model as, y k = n a i (k)y k i + m b j (k)u k j i=1 j=1 (5) +u k + n k u k = u p k + uw k where k is a time index, a i (i=1,c,n) are AR parameter and b j (j=1,c,m) are MA parameters. Input signal u k is a mixture of periodic pulse signal u p k which denotes a voiced sound and zero-mean white noise u w k with variance σu,k 2 which denotes unvoiced sound. Using (5), the estimation signal ŷ k and prediction signal ŷ k/k 1 are defined as ŷ k = n â i (k)y k i + m ˆbj (k)û k j +û k i=1 j=1 ŷ k k 1 = n â i (k 1)y k i + m ˆbj (k 1)û k j. i=1 j=1 Since it is assumed that û k cannot be observed at k 1, û k cannot be used in the prediction of signal. We define the estimated parameter vector ˆp(k), and vector ĥ(k) as follows: â 1 (k). ˆp(k) = â n (k) ˆb1 (k). ˆbm (k) ŷ k 1.. ĥ(k) = ŷ k n û k 1.. û k m (7) (8) (6)

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping 103 (a) Original Speech Waveform (b) Waveform with Burst Noise Speech Signal Noise Skipping Algorithm Fourier Transform ABS Low-pass Filtering (First RSF Process) (c) Variance of Input Speech (d) Criterion Using ARMA Model (e) Waveform After Noise Skipping with (d) Fig.3: Sample speech waveforms before and after the noise skipping and criterions. The sample speech is Hachinohe. Masked areas denote periods of burst noise added to the original speech. The added burst noise is temporal white noises at 0dB SNR (calculated from the total of the energy of temporal noises). Using (7) and (8), ŷ k and ŷ k k 1 are rewritten as, ŷ k = ĥ k ˆp(k)+û k ŷ k k 1 = ĥ k ˆp(k 1) (9) where represents the transpose. We now introduce the least squares criterion in order to estimate ARMA parameters and estimation error: V k = k ρ(i, k)(y i ŷ i ) 2 + i=1 ρ(1,k)ˆp (k)f 1 1 ˆp(k), (10) where F 1 is an arbitrary real symmetric positive definite matrix and ρ(i, k) are weighting coefficients given by k 1 λ ρ(i, k) = j (i =1, 2,,k 1) (11) j=i 1 (i = k, k +1, ). The second term in (10) is used to initialize V k.generally, F 1 1 is a symmetry matrix, and the amount Mel Filterbank Analysis Log Band-pass Filtering (Second RSF Process) Inverse Fourier Transform Delta Cepstrum Dynamic Range Adjustment (DRA) Speech Feature Vector Fig.4: Analysis method with DRA/RSF and noise skipping algorithm. of its components are quite small. If k increases, the first term becomes larger, and ρ(1,k) becomes smaller so that the second term becomes negligible. When λ j = 1and0<λj < 1, this criterion progressively decreases the weight of previous estimation errors. Using (5)-(9), the estimated parameter vector that minimizes the criterion can be obtained with the following equations: ˆp(k) = ˆp(k 1) + F k ĥ(k){λ k 1 + F k+1 ĥ (k)f k ĥ(k)} 1 v(k) (12) = λ 1 k 1 {F k F k ĥ[λ k 1 + ĥ (k)f k ĥ(k)] 1 ĥ (k)f k } (13) ν(k) = y k ŷ k k 1 = y k ĥ (k)ˆp(k 1), (14) where ν(k) is estimation error. Using the short-time variance of it as a criterion, noise skipping algorithm is performed. In order to suppress the influence on estimation error which is caused by continuous noise, the mean of the first few variances is subtracted from ones in each time index same as Spectral Subtraction. Fig. 3 shows sample speech waveforms before and after the noise skipping and criterions for it. When the peak of the power spectrum of burst noise is not so much, the short-time variances during burst noise periods are almost same as the ones of speech periods

104 ECTI TRANSACTIONS ON ELECTRICAL ENG., ELECTRONICS, AND COMMUNICATIONS VOL.3, NO.2 AUGUST 2005 Table 1: The condition of speech recognition experiments. Recognition Task Isolated 100 words vocabulary Speech Data 100 Japanese place names from JEIDA Sampling 11.025kHz, 16-bit Window Length 23.2ms (256 points) Frame Period 11.6ms (128 points) Window Function Hanning window Pre-emphasis 1-0.97z 1 38th order, based on MFCC (12-dimensional MFCC, Baseline Speech 12-dimensional delta MFCC, Feature Vector 12-dimensional delta-delta MFCC, delta log-energy, delta-delta log-energy) Acoustic Model 32-states continuous word HMMs Training Set 40 male speakers, 3 utterances each Tested Set Speaker-independent, 5 male speakers, 2 utterances each and it is difficult to identify burst noise periods using it. However, the values of proposed criterion during speech periods are eliminated by ARMA estimation and it makes the identification of burst noise periods easier. With proposed methods, speech analysis is refined as shown in Fig. 4. At first, a speech without burst noise is obtained with noise skipping algorithm. Then FIR filtering is applied twice to the obtained speech as RSF processes and jitter influences are removed. Finally, the obtained cepstrum is normalized by DRA and the robust speech feature vector is obtained. 4. EVALUATION EXPERIMENTS In order to evaluate the noise robustness of the proposed techniques, isolated word speech recognition has been examined. At first, with speech recognition experiments versus continuous noises, we compare the performances of three speech feature extraction method; ordinary MFCC, MFCC after spectral subtraction and MFCC after RSF and DRA. The task is the recognition of 100 Japanese word speeches including additive noise. Any of white noise, speech babble noise or High-frequency radio channel noise is applied to tested speeches, and the SNR is at 10 db. The database of continuous noises is obtained from NoiseX (http://spib.rice.edu/spib/select noise.html). The conventional recognition system consists of ordinary MFCC feature extraction and with HMMs. The training database consists of three utterances of 100 isolated words spoken by 40 male speakers, and the test data for recognition consists of two utterances of same 100 words spoken by unspecific five male speakers. The recognitions part is implemented using the MATLAB software. The whole database is Japanese common voice data Chimei (means the names of places) provided from the Japan Electric Industry Development Association. Other conditions are described in Table. 1. Recognition results are shown in Table 2. The result versus continuous noise confirms that combined method of RSF and DRA shows better robust performance for various noises. Although spectral subtraction shows better performance for white noise, it degrades recognition accuracy rather than improves in speech babble noise. It can be thought that this result is derived from the fluctuation in continuous noise. Noises in real environment include more fluctuation than artificial white noise, and it causes musical noise or the distortion of spectrum. Secondly, we evaluate the performance of the proposed criterion. We prepare two criterions, shorttime variances of speech waveform and the proposed short-time variances of estimation errors in speech estimation with ARMA model and estimated burst noise periods of speeches using each criterion. Then the estimation accuracies of burst noise periods are compared. An obtained period is regarded as a correct one when both differences between the obtained start/end point and the known start/end point are below 10 msec. Added noise is only burst noise or combined noise of burst noise and continuous noise (either of last three noises at 10 db SNR). We assume the applied burst noise as the occasional large burst of white noise which occupies 20% of the frames of the original speech. The length of burst noise is obtained from the Gaussian distribution whose mean is 70 (msec). The SNR versus burst noise is at 0 db or -10 db (calculated from the total energy of burst noises) and the SNR versus continuous noise is at 10 db. Both burst noises whose SNR are at 0 db and -10 db are added at the same periods if the original speeches are same. Other analysis conditions are same as last experiment. The result is shown in Table 3 and 4. Even using short-time variances of speech waveform as a criterion, correct periods are obtained almost perfectly when the level of burst noise is higher. However, no correct periods are obtained when the level is lower. On the other hand, using the proposed criterion, the estimation accuracy is improved and 89.8 % of accuracy rate is obtained when the SNR versus burst noise is 0 db. When continuous noise is combined, the accuracy rate is degraded. However, how much it is degraded depends on noise varieties. The accuracy rate is only 56.6 % when continuous noise is white noise and 88.9 % when continuous noise is speech babble. It can be thought that babble noise is regarded as a speech component in the ARMA estimation process and does not influence on the estimation of burst noise periods. This result shows correct periods can

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping 105 Table 2: Recognition rates versus several types of continuous noise. SNR is at 10dB. Noise Varieties White Speech HF- Noise Speech Feature Noise Babble channel Free Conventional 28.20 61.90 28.40 99.30 SS 80.70 55.10 64.70 98.60 RSF/DRA 77.50 74.10 78.00 99.10 Table 3: Accuracy rates of estimation of burst noise periods versus several types of continuous noise. SNR versus burst noise is at 0dB. Noise Burst Burst + Burst + Burst Varieties Noise White Speech +HF- Criterion Only Noise Babble channel Short-time variance 0.0 0.0 0.0 0.0 ARMA Estimation Error 89.8 56.6 88.9 67.9 Table 4: Accuracy rates of estimation of burst noise periods versus several types of continuous noise. SNR versus burst noise is at -10dB. Noise Burst Burst + Burst + Burst Varieties Noise White Speech +HF- Criterion Only Noise Babble channel Short-time variance 98.6 99.7 99.5 99.5 ARMA Estimation Error 99.9 100.0 99.9 99.9 be obtained even when the peak energy level of burst noiseisalmostsameastheoneofspeechasshownin Fig. 3, even though it is not fully practical for single use in that case. Then the performance of noise skipping algorithm is estimated with speech recognition experiments versus burst noise. We prepared following seven methods against burst noise and estimates the performance of proposed method using noise skipping and RSF/DRA: (a) No skipping, no specific processing. (b) Masking burst noise periods with blanks of speech. Burst noise periods are estimated using short-time variances of observed speech. (c) Masking burst noise periods with blanks of speech. Burst noise periods are estimated using short-time variances of estimation errors in ARMA estimation. (d) Masking known burst noise periods with blanks of speech (periods are given in advance manually). (e) Skipping of burst noise periods which are estimated using short-time variances of observed speech. (f) Skipping of burst noise periods which are estimated using short-time variances of estimation errors in speech estimation with ARMA model. (g) Skipping of known burst noise periods (periods are given in advance manually). Only burst noise or combined noise of burst noise and continuous noise (either of last three noises at 10dB SNR) is applied to tested speeches. Recognition results are shown in Table 5 and 6.The result versus burst noise shows following things. At first, Burst noise seriously degrades recognition accuracy. When only burst noise is applied, the recognition rates obtained from ordinary MFCC are only 8.4% (-10 db) and 38.0% (0 db) while about 99% of accuracy is obtained in noise-free environment. Even when RSF/DRA is applied, the recognition rates are still 36.3% (-10 db) and 57.4% (0 db), much less combined noises of burst noise and continuous noise. However, secondary, the presented noise skipping quite improves recognition accuracy while the improvement by masking noise with blank is not enough. Furthermore, especially when the SNR versus burst noise is at 0 db, the better performance is obtained by adopting variances of the estimation error in ARMA process as a criterion, rather than variances of the observed speech. Thirdly, when burst noise skipping is applied, the recognition accuracy of -10 db becomes better than that of 0dB. It is because the estimation of burst noise periods is difficult when theenergylevelofappliedburstnoiseislow,asshown in last experiment. However, even in that case, the proposed burst noise skipping technique makes benefits for speech recognition accuracy. Lastly, comparing the performances obtained from both noise skipping manners with proposed method and ideal periods, the difference of recognition accuracy is not so much when only burst noise is applied. However, the difference increases when continuous noise is also applied, even though the accuracy rate in estimation of burst noise periods are almost 100 % in last experiment. It means that remaining estimation errors and fatal and minute estimation errors less than 10 msec may be caused by included continuous noise and degrade recognition accuracy. Some improvements may be required in order to suppress the influence of continuous noise entirely. 5. CONCLUSION In this paper, the suppression of combined noise of continuous noise and burst noise is explored and new speech feature extraction techniques are proposed. In order to suppress the influence of continuous noise, a combined method RSF and DRA are presented. RSF emphasizes modulation frequency bands of speech by applying the FIR filtering. DRA normalizes the maximum amplitudes of the cepstrum. In order to sup-

106 ECTI TRANSACTIONS ON ELECTRICAL ENG., ELECTRONICS, AND COMMUNICATIONS VOL.3, NO.2 AUGUST 2005 Table 5: Recognition rates versus mixtures of several types of continuous noise and burst noise. SNR versus burst noise is at 0dB. Noise Varieties Burst Noise White Noise Speech Babble HF Channel Noise Skipping Speech Feature Only + Burst Noise + Burst Noise + Burst Noise Ordinary MFCC 38.0 7.2 10.0 6.3 (a): No Skipping SS 29.7 41.9 6.2 9.3 RSF/DRA 57.4 31.6 6.6 6.5 (b): Masking with Ordinary MFCC 37.8 7.3 10.0 6.2 Blanks Using SS 29.4 41.7 6.2 9.3 Short-time Variances RSF/DRA 57.6 32.2 6.6 6.3 (c): Masking with Ordinary MFCC 29.5 1.9 3.6 2.1 Blanks Using SS 24.1 4.5 2.7 1.6 ARMA Estimation RSF/DRA 71.5 28.5 16.5 13.9 (d): Masking with Ordinary MFCC 30.8 1.9 3.7 2.4 Blanks Using SS 25.8 7.1 2.9 2.0 Ideal Periods RSF/DRA 74.0 41.2 18.5 17.3 (e): Burst Noise Ordinary MFCC 37.8 7.2 9.9 6.2 Skipping Using SS 29.5 41.9 6.2 9.2 Short-time Variances RSF/DRA 57.7 31.3 6.4 6.4 (f): Burst Noise Ordinary MFCC 94.9 16.2 49.3 22.7 Skipping Using SS 92.3 45.7 38.1 42.8 ARMA Estimation RSF/DRA 95.1 39.5 52.1 53.2 (g): Burst Noise Ordinary MFCC 95.7 28.7 53.9 27.3 Skipping Using SS 94.0 70.5 42.0 51.6 Ideal Periods RSF/DRA 96.3 62.6 58.3 63.8 Table 6: Recognition rates versus mixtures of several types of continuous noise and burst noise. SNR versus burst noise is at 50dB. Noise Varieties Burst Noise White Noise Speech Babble HF Channel Noise Skipping Speech Feature Only + Burst Noise + Burst Noise + Burst Noise Ordinary MFCC 8.4 3.0 2.1 1.3 (a): No Skipping SS 6.0 15.1 2.0 2.1 RSF/DRA 36.3 13.9 4.2 4.0 (b): Masking with Ordinary MFCC 34.8 2.7 4.0 2.7 Blanks Using SS 29.2 9.6 3.5 2.9 Short-time Variances RSF/DRA 66.3 31.3 17.0 14.2 (c): Masking with Ordinary MFCC 33.6 2.8 3.8 2.6 Blanks Using SS 27.9 9.0 3.3 2.8 ARMA Estimation RSF/DRA 64.4 30.0 16.1 14.5 (d): Masking with Ordinary MFCC 30.8 1.9 3.7 2.4 Blanks Using SS 25.8 7.1 2.9 2.0 Ideal Periods RSF/DRA 74.0 41.2 18.5 17.3 (e): Burst Noise Ordinary MFCC 94.5 29.5 52.0 27.2 Skipping Using SS 92.0 69.1 39.7 49.1 Short-time Variances RSF/DRA 95.8 61.2 54.3 60.4 (f): Burst Noise Ordinary MFCC 95.0 27.8 51.1 25.9 Skipping Using SS 93.2 69.3 40.1 48.9 ARMA Estimation RSF/DRA 95.7 60.8 55.6 59.8 (g): Burst Noise Ordinary MFCC 95.7 28.7 53.9 27.3 Skipping Using SS 94.0 70.5 42.0 51.6 Ideal Periods RSF/DRA 96.3 62.6 58.3 63.8

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping 107 press the influence of burst noise, noise skipping algorithm using ARMA analysis is presented. The effectiveness is estimated in speech recognition experiments and the application of both techniques shows the best performance. This result indicates that the combined method of them has the best performance and flexibility for various environments, even in a combined noise of continuous noise and burst noise. References [1] Tierney J., A study of LPC analysis of speech in additive noise, IEEE Trans. on Acoust., Speech, and Signal Process., vol. ASSP-28, no.4 p.p. 389-397, Aug. 1980. [2] Kay S.M., Noise compensation for autoregressive spectral estimation, IEEE Trans. on Acoust., Speech, and Signal Process., vol. ASSP-28, no.3 p.p. 292-303, March 1980. [3] Varga A. and Moore R., Hidden Markov Model Decomposition of Speech and Noise, Proc. IEEE ICASSP p.p. 845-848, 1990. [4] Gales M.J.F. and Young S.J., Cepstral parameter compensation for HMM recognition in noise, Speech Communication, vol.12, no.3, p.p. 231-239, 1993. [5] Aikawa K. and Saito T., Noise robustness evaluation on speech recognition using a dynamic cepstrum, IEICE Technical Report, SP94-14, p.p. 1-8, June 1994. [6] Boll S., Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. ASSP, vol. ASSP-27, no.2, p.p. 113-120, 1979. [7] Hayasaka N., Miyanaga Y. and Wada N., Running spectrum filtering in speech recognition, SCIS Signal Processing and Communications with Soft Computing, Oct 2002. [8] Yoshizawa, S., Wada, N., Hayasaka, N. and Miyanaga, Y. Noise Robust Speech Recognition Focusing on Time Variation and Dynamic Range of Speech Feature Parameters Proc. IEEE IS- PACS p.p. 484-487, 2003. [9] Kanedera N., Arai T., H. Hermansky and M. Pavel, On the importance of various modulation frequencies for speech recognition, Proc. Eurospeech, p.p. 1079-1082, 1997. [10] Yoshizawa, S., Hayasaka, N., Wada, N. and Miyanaga, Y. Cepstral amplitude range normalization for noise robust speech recognition IEICE Trans. on Information and Systems, Vol.E87-D, No.8, p.p. 2130-2137, Aug. 2004. [11] Hermansky H. and Morgan N., RASTA processing of speech, IEEE Trans. Speech and Audio Process, vol.2, p.p. 578-579, Oct 1994. [12] Furui S., Speaker-Independent isolated word recognition using dynamic features of speech spectrum, IEEE Trans. on Acoust., Speech, and Signal Process., vol.assp-34, no.1 p.p. 52-59, Feb. 1986. Naoya Wada received the B.E. and M.E. degrees in Electrical Engineering from Hokkaido University, Japan in 2001 and 2003, respectively. He is currently studying at Graduate School of Information Science and Technology, Hokkaido University. His research interests are digital signal processing, speech analysis, and speech recognition. Shingo Yoshizawa received the B.E. and M.E. degrees in Electrical Engineering from Hokkaido University, Japan in 2001 and 2003, respectively. He is currently studying at Graduate School of Information Science and Technology, Hokkaido University. His research interests are speech processing, wireless communication systems, and VLSI architecture. Noboru Hayasaka received the B.E. and M.E. degrees in Electrical Engineering from Hokkaido University, Japan in 2002 and 2004, respectively. He is currently studying at Graduate School of Information Science and Technology, Hokkaido University. His research interests are digital signal processing, speech analysis, and speech recognition. Yoshikazu Miyanaga received the B.S., M.S., and Dr.Eng. degrees from Hokkaido University, Japan in 1979, 1981, and 1986, respectively. Since 1983, he has been with Hokkaido University, Japan, where he is a Professor and working at Graduate School of Information Science and Technology, Hokkaido University. His research interests are adaptive signal processing, non-linear signal processing, and parallel-pipelined VLSI systems. Yoshikazu Miyanaga is a member of IEICE, Information Processing Society of Japan, and Acoustical Society of Japan.