Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Similar documents
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING. Department of Signal Theory and Communications. c/ Gran Capitán s/n, Campus Nord, Edificio D5

Perceptually Motivated Linear Prediction Cepstral Features for Network Speech Recognition

Speech Synthesis using Mel-Cepstral Coefficient Feature

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Chapter 4 SPEECH ENHANCEMENT

Different Approaches of Spectral Subtraction Method for Speech Enhancement

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Adaptive Filters Application of Linear Prediction

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

Signal Processing Toolbox

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Comparative Analysis of Intel Pentium 4 and IEEE/EMC TC-9/ACEM CPU Heat Sinks

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

S PG Course in Radio Communications. Orthogonal Frequency Division Multiplexing Yu, Chia-Hao. Yu, Chia-Hao 7.2.

Controlling a DC-DC Converter by using the power MOSFET as a voltage controlled resistor

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Cepstrum alanysis of speech signals

Efficiency variations in electrically small, meander line RFID antennas

Audio Restoration Based on DSP Tools

Introduction of Audio and Music

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement

A Spectral Conversion Approach to Single- Channel Speech Enhancement

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

A LPC-PEV Based VAD for Word Boundary Detection

Audio Fingerprinting using Fractional Fourier Transform

Full Wave Solution for Intel CPU With a Heat Sink for EMC Investigations

Noise estimation and power spectrum analysis using different window techniques

A Comparative Study of Formant Frequencies Estimation Techniques

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

RECENTLY, there has been an increasing interest in noisy

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

SPECTRAL COMBINING FOR MICROPHONE DIVERSITY SYSTEMS

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Fundamental frequency estimation of speech signals using MUSIC algorithm

NCCF ACF. cepstrum coef. error signal > samples

Applications of Music Processing

Shielding Effect of High Frequency Power Transformers for DC/DC Converters used in Solar PV Systems

Voiced/nonvoiced detection based on robustness of voiced epochs

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Speech Enhancement in Noisy Environment using Kalman Filter

Collins, B., Kingsley, S., Ide, J., Saario, S., Schlub, R., O'Keefe, Steven

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

The Delta-Phase Spectrum with Application to Voice Activity Detection and Speaker Recognition

MEMS Wind Direction Detection: From Design to Operation

Speech Enhancement using Wiener filtering

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping

MMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2

Speech Enhancement for Nonstationary Noise Environments

Modulator Domain Adaptive Gain Equalizer for Speech Enhancement

Analysis of LMS Algorithm in Wavelet Domain

Mikko Myllymäki and Tuomas Virtanen

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Wavelet Speech Enhancement based on the Teager Energy Operator

Carrier Frequency Offset Estimation in WCDMA Systems Using a Modified FFT-Based Algorithm

EE482: Digital Signal Processing Applications

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

A Simplified Extension of X-parameters to Describe Memory Effects for Wideband Modulated Signals

Modulation Domain Spectral Subtraction for Speech Enhancement

Comparative Performance Analysis of Speech Enhancement Methods

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

Monophony/Polyphony Classification System using Fourier of Fourier Transform

Drum Transcription Based on Independent Subspace Analysis

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Adaptive notch filters from lossless bounded real all-pass functions for frequency tracking and line enhancing

Digital Signal Processing

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Robust Low-Resource Sound Localization in Correlated Noise

Audio Signal Compression using DCT and LPC Techniques

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE

SUB-BAND INDEPENDENT SUBSPACE ANALYSIS FOR DRUM TRANSCRIPTION. Derry FitzGerald, Eugene Coyle

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

Change Point Determination in Audio Data Using Auditory Features

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

HIGH RESOLUTION SIGNAL RECONSTRUCTION

ON-LINE LABORATORIES FOR SPEECH AND IMAGE PROCESSING AND FOR COMMUNICATION SYSTEMS USING J-DSP

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Flexible, light-weight antenna at 2.4GHz for athlete clothing

REAL-TIME BROADBAND NOISE REDUCTION

Signal Analysis Using Autoregressive Models of Amplitude Modulation. Sriram Ganapathy

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Performance Analysis of (TDD) Massive MIMO with Kalman Channel Prediction

Chapter IV THEORY OF CELP CODING

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Transcription:

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition Author Shannon, Ben, Paliwal, Kuldip Published 25 Conference Title The 8th International Symposium on Signal Processing and Its Applications (ISSP-25) DOI https://doi.org/.9/isspa.25.589 Copyright Statement 25 IEEE. Personal use of this material is permitted. However, permission to reprint/ republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Downloaded from http://hdl.handle.net/72/2576 Griffith Research Online https://research-repository.griffith.edu.au

SPECTRAL ESTIMATION USING HIGHER-LAG AUTOCORRELATION COEFFICIENTS WITH APPLICATIONS TO SPEECH RECOGNITION Benjamin J. Shannon and Kuldip K. Paliwal School of Microelectronic Engineering Griffith University, Brisbane, QLD 4, Australia Ben.Shannon@student.griffith.edu.au, K.Paliwal@griffith.edu.au ABSTRACT In this paper, we introduce a noise robust spectral estimation technique for speech signals that is derived from a windowed one-sided higher-lag autocorrelation sequence. We also introduce a new high dynamic range window design method, and utilise both techniques in a modified Mel Frequency Cepstral Coefficient () algorithm to produce noise robust speech recognition features. We call the new features Mel Frequency Cepstral Coefficients (As). We compare the recognition performance of As to s for a range of stationary and non-stationary noises on the Aurora II database. We show that the A features perform as well as s in clean conditions and have higher noise robustness in noisy conditions.. INTRODUCTION The potential for computing noise robust speech recognition features from the autocorrelation domain has attracted a lot of attention. A number of speech recognition feature extraction techniques have been proposed in the literature based on autocorrelation domain processing. The first technique proposed in this area was based on the use of High-Order Yule-Walker Equations [], where the autocorrelation coefficients that are involved in the equation set exclude the zero-lag coefficient. Other similar methods have been used that either avoid the zerolag coefficient [] [2] [3], or reduce the contribution from the first few coefficients [4] [5]. All of these methods are based on linear prediction (LP) processing and provide some robustness to noise, but their recognition performance for clean speech is much worse than the unmodified or conventional LP approach [5]. A potential source of error in using LP methods to estimate the power spectrum of a varying SNR signal is highlighted by Kay [6]. Kay showed that the model order is not only dependent on the AR process, but also on the prevailing SNR condition. Therefore, in this paper, we do not use an LP based method to process the autocorrelation sequence. Instead, we compute the magnitude spectrum of the one-sided higher-lag autocorrelation sequence using the Fourier transform, process it through a Mel filter bank and parameterise it in terms of s. Since the proposed method combines autocorrelation domain processing with Mel filter bank analysis, we call the resulting s, Mel Frequency Cepstral Coefficients (As). Speech recognition feature extraction algorithms are typically designed assuming stationary broadband (usually white) noise. In this work, we consider stationary noise signal as well as non-stationary noises, such as emergency vehicle sirens and chirp signals. We show that higher-lag autocorrelation processing is robust against these types of noise disturbances. The paper organisation is as follows. In section 2 we discuss some properties of the autocorrelation sequence in relation to speech and noise signals showing examples. We then describe, in section 3, the newly proposed higherlag autocorrelation spectral estimation technique and test its effectiveness for noise robust speech feature extraction using the Aurora II database in section 4. This is then followed by conclusions in section 5. 2. PROPERTIES OF AUTOCORRELATION SEQUENCES In this section, we demonstrate briefly how the smooth spectral envelope information of a voiced speech signal is distributed within its short-time autocorrelation sequence. We then discuss the autocorrelation distribution for noise signals giving an example of a non-stationary noise. 2.. Speech Signals In automatic speech recognition, we model the human speech production system using a simple source-system model. The model consists of a variable response filter, excited by either a white noise source or a periodic pulse train source. We model unvoiced speech as the output of the variable response filter excited by the white noise source and voiced speech as the output of the variable response filter excited by the periodic pulse train. For speech recognition, we are typically interested in extracting the magnitude response of the variable response filter over time. We assume that this carries the speech information sufficiently for accurate recognition. Most of the popular speech recognition features, such as LPCCs and s, are derived from an estimate of the smooth power spectrum of the speech signal. We can consider the smooth power spectrum in both of these cases -783-9243-4/5/$2. 25 IEEE 599

..2.3.4.5 4 (a) 2 (b) 2 2 4 2 3 2 2 3 4 (c) (d) 2 2 5 5 4..2.3.4.5 3 2 2 3 4 (e) (f) 2 2 5 5 4..2.3.4.5 3 2 2 3 Fig.. Decomposition of a 32 ms voiced speech frame, containing an /r/ sound. (a) The original logarithmic power spectrum. (b) sequence associated with the spectrum in (a). (c) The smooth logarithmic spectral envelope computed by retaining the first 2 cepstral coefficients. (d) The autocorrelation sequence associated with the spectrum shown in (c). (e) The logarithmic excitation spectrum. (f) sequence associated with the logarithmic spectrum shown in (e). as being computed from the autocorrelation sequence. In the LPCC algorithm, the smooth spectral estimate is computed from the first few autocorrelation coefficients, and in the algorithm, the smooth spectral estimate is computed using the whole autocorrelation sequence. A depiction of how the smooth spectral envelope information is distributed in the autocorrelation sequence is shown in Fig.. The logarithmic power spectrum of an /r/ sound is shown in Fig.(a). This shows the harmonic structure typical of voiced speech, along with the information-bearing envelope. Plot (b) shows the autocorrelation sequence associated with the spectrum in (a). By using cepstral processing, we decomposed the spectrum in (a) into the smooth spectrum in (c) and the excitation spectrum shown in (e). The corresponding autocorrelation sequences of these two spectrums are shown in (d) and (f), respectively. Figure (d) shows that the smooth power spectrum information is contained in a small number of autocorrelation coefficients. The full autocorrelation sequence shown in (b) can be considered as the convolution of the autocorrelation sequences in (d) and (f). This process demonstrates that the smooth power spectrum envelope information is spread throughout the whole autocorrelation sequence of the original speech signal frame. Therefore, we should be free to estimate the smooth spectral envelope using any region of the autocorrelation sequence. 2.2. Noise Signals The autocorrelation sequences of noise signals vary much more than the autocorrelation sequences of speech signals. This variation can be attributed to the larger range of production mechanisms for noise signals compared to the simple production model applicable to speech signals. Some general comments about autocorrelation sequences are made below. All autocorrelation sequences have the largest absolute value at the zero lag location. This coefficient represents the energy of the signal. The shape of the autocorrelation envelope moving away from the zero lag location is directly related to the noise source. Generally, the envelope decays when moving away from the zero lag coefficient. Some of the decay can be attributed to the biased autocorrelation estimation algorithm, but generally, the decay is faster than the algorithm imposed rate. As an example of non-stationary noise, an emergency vehicle siren and its analysis is shown in Fig.2. In this figure, plot (a) shows the spectrogram for a two second segment of the noise. Plots (b), (c) and (d) show the logarithmic power spectrum at times.5,. and.5 seconds respectively. Plots (e), (f) and (g) show the autocorrelation sequence associated with the spectrums in plots (b), (c) and (d) respectively. When uncorrelated noise is added to a speech signal, the combination in the autocorrelation domain can be described as follows. The zero-lag coefficient is corrupted. The lower-lag coefficients are generally more corrupted than the higher-lag coefficients. If the spectral envelope information is sufficiently contained in the higher-lag autocorrelation coefficients, a more noise robust spectral estimate should result if the more corrupt lower-lag coefficients are de-emphasised during spectral estimation. The lower-lag coefficients can be significantly attenuated by using a tapered window 6

Time (s) (a) 4. RECOGNITION EXPERIMENTS (Norm.) 3 2 4 2 2.2.4.6.8.2.4.6.8 (b) 4 2 3 4.5.5 (e) 3 2 2 3 (Norm.) 2 2 4 (c) 6 2 3 4.5.5 3 2 2 3 (f) (Norm.) 4 2 2 (d) 4 2 3 4.5.5 (g) 3 2 2 3 Fig. 2. Analysis of siren noise signal using 32 ms frames. (a) Spectrogram of a 2 second sample of siren noise. (b)(c)(d) The logarithmic power spectrum of frames taken at.5,. and.5 seconds respectively. (e)(f)(g) The autocorrelation sequences corresponding to the spectrums in (b)(c)(d) respectively. function. This also has the added effect of attenuating the very high-lag coefficients, which have high estimation variance. In these experiments, we compared the noise robustness of the new speech recognition feature with s. For the evaluation, we used the Aurora II database, recognition scripts and the HTK software. We used a range of stationary and non-stationary noise samples, which included Gaussian white noise, car noise, siren noise (as featured in Fig.2), and an artificial chirp noise, which repeatedly swept from to 4 khz in 32 ms. Recognition accuracy curves for the four noise cases are shown in Fig.3. These results show that the A features performed as well as the features in clean conditions. Secondly, these results show that the A features are more noise robust than the features in all the tested cases. The extent of the robustness improvement shown by the As appears to be dependent on the type of noise. The least improvement was displayed in the car noise case, and the most improvement was displayed in the artificial chirp noise case. The artificial chirp noise case shows a dramatic improvement in noise robustness for As over s. This type of signal produces large magnitude lowerlag autocorrelation coefficients and very low magnitude higher-lag coefficients over a short analysis window. This explains the large improvement for As for these types of noise. 3. SPECTRAL ESTIMATION FROM HIGHER-LAG AUTOCORRELATION Based on the previously discussed motivation, we compute a spectral estimate as the magnitude spectrum of the windowed one-sided autocorrelation sequence. A new speech recognition feature is then computed by substituting the new spectral estimate for the power spectrum in the algorithm. To compute the new spectral estimate from the onesided autocorrelation sequence, we first designed a suitable high dynamic range window function. Since the dynamic range of the magnitude spectrum of the autocorrelation sequence is the same as the dynamic range of the power spectrum of the time domain signal, we need to use a window function on the autocorrelation sequence that has twice the dynamic range of the window function that is normally used on the time domain signal. We devised a novel window function design method for this application as an alternative to more complex general design methods such as Kaiser or Dolph-Chebyshev. A window function that has twice the dynamic range of a seed window function can be computed as the autocorrelation of the seed window. This technique also results in a side-lobe profile of the new window that matches the side-lobe profile of the seed window function. In the following experiments, the window function used on the autocorrelation sequence was computed as the autocorrelation of a Hamming window. 5. CONCLUSIONS In this paper, we have introduced a new noise robust spectral estimation technique for speech signals. This method was computed as the magnitude spectrum of the windowed one-sided higher-lag autocorrelation sequence. We also introduced a new high dynamic range window function design approach. This technique is specifically suited to designing windows for the autocorrelation domain. This method involved computing the high dynamic range window as the autocorrelation of a seed window function used in the time domain. The new spectral estimate was used in the algorithm to produce speech recognition features called As. On the Aurora II database, the A features gave higher recognition accuracy scores than s over a range of SNRs using both stationary and non-stationary noises. 6. REFERENCES [] Y. T. Chan and R. P. Langford, Spectral estimation via the high-order yule-walker equations, IEEE Trans. on ASSP, vol. ASSP-3, no. 5, pp. 689 698, Oct. 982. [2] K. K. Paliwal, A noise-compensated long correlation matching method for ar spectral estimation of noisy signals, in Proc. ICASSP, 986, pp. 369 372. 6

(a) White (b) Siren 9 9 8 7 6 5 4 3 2 A 5 5 5 2 clean 8 7 6 5 4 3 2 A 5 5 5 2 clean (c) Car (d) Chirp 9 9 8 7 6 5 4 3 2 A 5 5 5 2 clean 8 7 6 5 4 3 2 A 5 5 5 2 clean Fig. 3. Recognition accuracy results from the Aurora II database for and A features. (a) White Gaussian noise. (b) Emergency vehicle siren noise. (c) Car noise. (d) Artificially generated chirp noise. [3] J. A. Cadzow, Spectral estimation: An overdetermined rational model equation approach, in Proc. IEEE, Sep. 982, vol. 7, pp. 97 939. [4] D. Mansour and B. H. Juang, The short-time modified coherence representation and noisy speech recognition, IEEE Transactions on ASSP, vol. 37, no. 6, pp. 795 84, Jun 989. [5] J. Hernando and C. Nadeu, Linear prediction of the one-sided autocorrelation sequence for noisy speech recognition, IEEE Transactions on Speech and Audio Processing, vol. 5, no., pp. 8 84, Jan. 997. [6] S. M. Kay, The effects of noise on the autoregressive spectral estimator, IEEE Transactions on ASSP, vol. ASSP-27, no. 5, pp. 478 485, Oct. 979. 62