SPEECH PARAMETERIZATION FOR AUTOMATIC SPEECH RECOGNITION IN NOISY CONDITIONS

Size: px

Start display at page:

Download "SPEECH PARAMETERIZATION FOR AUTOMATIC SPEECH RECOGNITION IN NOISY CONDITIONS"

Morris Howard
5 years ago
Views:

1 SPEECH PARAMETERIZATION FOR AUTOMATIC SPEECH RECOGNITION IN NOISY CONDITIONS Bojana Gajić Department o Telecommunications, Norwegian University o Science and Technology 7491 Trondheim, Norway gajic@tele.ntnu.no Kuldip K. Paliwal School o Microelectronic Engineering, Griith University Brisbane, QLD 4111, Australia K.Paliwal@me.gu.edu.au ABSTRACT This paper is concerned with increasing the robustness o automatic speech recognition systems (ASR) against additive bacground noise, by inding speech parameters that are less inluenced by changes in acoustic environments than the conventional ones. Inspired by the good robustness o auditory based speech parameterization methods, we compare the steps involved with those in the conventional methods rom the signal processing point o view. The use o dominant spectral requencies is believed to be an important reason or the superior robustness o the auditory based methods. A new speech parameterization method is described that is conceptually similar to auditory based methods, while retaining the low computational cost o the conventional methods. Evaluation on an ASR tas has shown that the new method outperormed the conventional methods in presence o various bacground noises. 1. INTRODUCTION State-o-the-art automatic speech recognition (ASR) systems are capable o achieving a very high recognition accuracy when tested in laboratory conditions. However, they usually experience a dramatic decrease in perormance when used in real-world applications. One o the main reasons or such a behavior is presence o bacground noise in the testing environment that has not been observed during system training. This problem becomes especially important or ASR on mobile devices, as the acoustic environment is constantly changing and cannot be accounted or during system training. One way to overcome this problem is to ind a speech parameterization that is invariant to changing acoustic environments. The most commonly used speech parameters are based on the energy inormation derived rom the short-term speech spectrum. However, the dominant spectral requencies are less inluenced by additive noise than the energy inormation. Thus, it is expected that the robustness o ASR systems could be improved i the dominant spectral requencies are eiciently incorporated into speech parameter vectors. The paper is organized as ollows. It starts with an overview o ASR systems in Section 2, and describes the robustness problem with possible solutions in Section 3. Section 4 summarizes the main processing steps involved in conventional and auditory based speech parameterization methods and describe a new method that combines the advantages o both classes o methods. An experimental study perormed to compare the perormance o the dierent parameterization methods on an ASR tas in various acoustic environments is described in Section 5. Finally, the major conclusions are summarized in section THE ASR SYSTEM The aim o automatic speech recognition (ASR) is to transorm a given spoen utterance into the corresponding transcription. A bloc diagram o an ASR system is shown in Figure 1. Beore the system can be used, it has to learn the characteristic speech patterns rom a large speech database with accompanying transcriptions. A set o stochastic models (hidden Marov models) is trained, each corresponding to one speech unit (or example phoneme). In addition, a lexicon is prepared to describe how the words are build up rom the basic speech units, as well as a language model describing the relationship between words. The models, lexicon and language model are then used to determine the most liely transcription o an incoming spoen utterance. The speech parameterization bloc is used to extract rom the speech waveorm the relevant inormation or discriminating between dierent speech sounds. The inormation is presented as a sequence o parameter vectors. This paper describes several dierent approaches to speech parameterization, and compares

2 Trans cription Model training Training database waveorm Parameterization Parameter vectors. Models bla bla Language model Recognition Recognition result "bla bla" Lexicon Figure 1: Bloc diagram o an ASR system their perormance on an ASR tas in various noisy conditions. 3. THE ROBUSTNESS PROBLEM Robustness o an ASR system is the system s ability to successully deal with dierent aspects o variability in the speech signal. Some o the common variabilities that occur in speech signals are listed below: Pronunciation variations between speaers depending on speaers voice characteristics, dialect, social class, etc. Pronunciation variations or a given speaer depending on mood, emotions, context, etc. Variations in the acoustic environment. Variations in the transmission channel. A number o techniques have been proposed to increase the robustness o ASR systems. Nevertheless, it still remains a major obstacle or reliable use o ASR technology in many real-world applications. As the mobile hand-held terminals become more common, the robustness against variations in the acoustic environment becomes increasingly important. Stateo-the-art ASR systems experience a dramatic perormance degradation when the acoustic environment diers rom the one observed in the training. In the ollowing, we list the major classes o approaches or overcoming this problem. Multiconditional training: The idea is to train a separate set o models or each bacground environment liely to occur during system use. For a given acoustic environment, the most liely set o models is then ound and used during the recognition process. Noise reduction: This approach is concerned with reducing the presence o noise in the speech signal beore it is sent to the recognizer. When the models are trained in noise-ree environments, this will reduce the mismatch between the input speech signal and the models. A most common approach is to apply noise spectral subtraction. Model compensation and adaptation: Instead o modiying the speech signal to better comply with the models, in this approach the models are changed according to the statistical characteristics o the noise to better comply with the noisy speech. Robust speech parameterization: The aim is to ind such a speech representation that is invariant to changes o the acoustic environment. Note that this approach diers rom the other approaches in that it does not require the nowledge o a particular acoustic environment during the use o the system. In the rest o this paper, we will ocus on this approach. 4. SPEECH PARAMETERIZATION This section starts with a summary o the major processing steps involved in conventional methods or speech parameterization. It proceeds by explaining the idea behind auditory based methods that have been shown to outperorm the conventional methods in noisy conditions. The major dierences between the two classes o methods are then explained rom the signal processing point o view. At the end, a new parameterization method is described, that combines the advantages o both conventional and auditory based methods Conventional Methods Conventional methods or speech parameterization are based on extracting the inormation rom the shortterm power spectrum o speech. The speech signal is divided into overlapping speech rames o 20-30ms length, as the speech signal can be regarded stationary on such a short intervals. The short-term power spectrum is estimated or each rame using either discrete Fourier transorm (DFT), ast Fourier transorm (FFT), ilter ban analysis or linear prediction analysis. The resulting spectral representation is usually

3 modiied by applying some auditory motivated processing. At the end, it is usual to perorm a decorrelation transormation, as this simpliies the recognition process. Mel-requency cepstrum coeicients (MFCC) are the most widely used speech parameters or ASR. Figure 2 illustrates the major processing steps involved in their computation. The short-term speech spec- S 1 () Filter ban S () DCT s(n) Spectrum estimation S() Energy log e S () N... e... parameter vector Figure 2: Illustration o MFCC computation trum is estimated using FFT. It is passed through a ilter ban consisting o overlapping triangular bandpass ilters uniormly distributed along the perceptually based mel-requency scale. The choice o the ilter ban is motivated by the nowledge on human hearing. A vector o subband log-energies is then computed and sent to a discrete cosine transorm (DCT) or decorrelation purposes. The resulting DCT coeicients, reerred to as MFCC, serve as a inal representation o the given speech rame. In the case o noisy speech, the subband energies get aected by noise, and the resulting speech representation diers rom the one or clean speech. Thus, i an ASR system is trained on clean speech, and used in noisy conditions, the mismatch can cause a large perormance degradation Auditory Based Methods Humans have a ascinating ability to recognize speech in noisy acoustic environments. Thus, there is a belie that the robustness o ASR systems could be considerably improved by simulating the processes in human auditory system. However, not all the processes in human speech recognition are well understood, and auditory based methods or speech parameterization have to rely on some heuristics. Probably the best nown auditory based parameters or ASR are so called Ensemble Interval Histograms (EIH) [1]. In this paper, we will present a slight modiication o these parameters reerred to as Zero Crossings with Pea Amplitudes (ZCPA) [2]. These parameters have been shown to outperorm both the EIH and all o the conventional parameterization methods in presence o additive noise. An illustration o the ZCPA method is shown in Figure 3. A s 1 (n) s (n) s N(n)... Zero crossing detector z i 1 z i z i 1 z i z i 1 i s(n) Filter ban z i+1 Pea detector p i zi 1 p i Histogram construction log p i DCT bin(i ) i z i... parameter vector Figure 3: Illustration o ZCPA computation rame o the given speech signal is passed through a ilter ban o bandpass ilters. The iltering is done in time domain. The resulting subband signals are sent to zero-crossing detectors. The interval between each pair o successive zero-crossings is measured together with the signal pea amplitude between the zero crossings. Then, the inverse intervals between successive zero crossings over all the subband signals are recorded in a histogram. Each histogram entry is weighted by the logarithm o the corresponding pea amplitude. Finally, the DCT is perormed or decorrelation purposes. Note that the ZCPA computation represents an alternative way o perorming spectral analysis. The inverse intervals between successive zero-crossings represent the instantaneous dominant requencies o the subband signal. The pea amplitudes, on the other hand, represent a measure o the instantaneous energy o the subband signal. The histogram bins containing the dominant requencies are increased by the

4 corresponding energy measures. Thus the resulting histogram represents an alternative representation o the signal spectrum. While the MFCC is based only on the subband energy computation, ZCPA eiciently combines the energy and dominant requency inormation. We believe that this dierence can be a part o the explanation or the ZCPA s superior perormance in noisy conditions. The dominant speech requencies are much less aected by the presence o additive noise than the subband energy measures. Thus, incorporation o the dominant requencies in the speech parameter vector can lead to increased robustness against additive noise. However, the ZCPA computation is prohibitively computationally expensive or use in practical ASR systems. This is due to time-domain processing and the need or heavy interpolation o the higher requency subband signals in order to obtain a precise zero-crossing locations Subband Spectral Centroid Histograms Motivated by the good noise robustness o the ZCPA parameters and the computational eiciency o the MFCC parameters, we searched or the possibility to design a new parameterization method, that would be more robust than MFCC, but have an acceptable computational cost. We believed that this tas could be achieved by inding a more computationally eicient method or incorporating the dominant requency inormation. In [3] it has been shown that Subband Spectral Centroids (SSC) are closely related to the dominant speech requencies. Using SSC as additional eatures to MFCC has been shown to increase the robustness o the ASR systems against additive noise [3, 4, 5, 6, 7]. We proposed a new ramewor or combining the SSC and subband energies through the construction o Subband Spectral Centroid Histograms (SSCH) [8, 9]. An illustration o the processing steps involved in the SSCH computation is shown in Figure 4. The speech power spectrum is estimated using FFT, and iltering is perormed in the requency domain to produce a number o subband signal. This part o the processing is analogue to the MFCC method. The dominant requency o each subband signal is estimated by the subband centroid. In addition, a subband energy measure is computed similarly as or the MFCC method. The dominant requency and energy inormation over all the subbands are combined in a single histogram in the same way as or the ZCPA method. Finally, the DCT is perormed or decorrelation purposes. This method uses the same conceptual inormation as the ZCPA method. However, note that the dominant requencies are now estimated rom the short- Spectrum estimation S() S 1 () s(n) S() Filter ban Centroid DCT S () Energy e S () e Histogram construction log p bin() N parameter vector Figure 4: Illustration o SSCH computation term power spectrum. This is a disadvantage in noisy conditions, as the spectrum itsel is corrupted by noise. On the other hand, the act that the processing is done in the spectral domain dramatically reduces the computational cost compared to ZCPA. It is now in the same order as or the MFCC computation. 5. EXPERIMENTAL STUDY This section describes an experimental study perormed to compare the perormance o the described methods on an ASR tas in various bacground conditions Tas and Database The methods were evaluated on the ISOLET Spoen Letter Database [10] down-sampled to 8 Hz. The database consists o English letters spoen in isolation recorded in a quiet room. Two repetitions o each word were recorded or each speaer. Utterances rom 90 speaers were used or training, while utterances rom 30 speaers were used or evaluation. Although the vocabulary consisting o 26 English letters is rather small, this is not a simple recognition tas, since the vocabulary words are very short and highly conusable. Noisy speech was artiicially created by adding to the original test set our dierent noise types at our dierent signal-to-noise ratios (SNR). Those are:

5 white Gaussian noise, actory noise, car noise and bacground speech. The last three noise types were taen rom the NOISEX database, where they were reerred to as actory1, volvo and babble noise respectively. A segment o the noise ile equal to the length o the speech ile was randomly extracted and added to the speech ile at the required SNR. SNR was computed as the ratio between the maximal rame energy o the speech ile, and the average energy o the noise segment. This way o computation maes SNR independent o the duration o the surrounding silence in the speech iles. Model training and recognition was perormed using speech recognition toolit HTK [11]. One hidden Marov model (HMM) with ive states and ive Gaussian mixtures per state was trained or each vocabulary word Choice o Free Parameters In the ollowing we summarize the most important parameters involved in MFCC, ZCPA and SSCH computation. MFCC: Frame length was set to 25 ms. The ilter ban consisted o 24 overlapping triangular ilters uniormly spaced along the mel-requency scale. 12 DCT coeicients were used. This is the standard parameter setting or the MFCC computation. It has not be optimized on the particular tas. ZCPA: The ilter ban consisted o 20 bandpass FIR ilters linearly spaced on the bar-requency scale (perceptually based requency scale similar to the mel-requency scale), with bandwidths equal to 2 Bar. The ilters had order 61, and were designed using the windowing method. Frequency dependent rame lengths equal to 20/ c were used, where c is the center requency o the corresponding bandpass ilter. The number o histogram bins was 26. Number o DCT coeicients was 12. SSCH: Frame length was set to 25 ms. The ilter ban consisted o 65 rectangular ilters. In the low requency range, ilter bandwidth was 300 Hz and the ilters were linearly spaced along the requency scale. In the high requency region, ilter bandwidth was 2 Bar and the ilters were linearly spaced along the bar-requency scale. 12 DCT coeicients were computed rom 26 histogram bins. Delta and delta-delta parameters were computed in addition to the static parameters or all o the methods, resulting in 36-dimensional parameter vectors Experimental Results Table 1 shows the results o the evaluation o MFCC, SSCH and ZCPA parameterization methods on both clean and noisy versions o the ISOLET database. Model training was perormed using clean speech. The recognition perormance was measured in terms o word accuracy. Table 1: Word accuracy or dierent parameterization methods in various acoustic environments a) White Gaussian noise method clean MFCC SSCH ZCPA b) Car noise method clean MFCC SSCH ZCPA c) Factory noise method clean MFCC SSCH ZCPA d) Bacground speech method clean MFCC SSCH ZCPA Looing at the results in Table 1, we see that MFCC perorms best on clean speech. However, even in presence o only a small amount o noise, the situation changes completely, and MFCC becomes the worst o the three methods. This conirms the lac o the robustness o MFCC parameters. SSCH is signiicantly more robust than MFCC or all the noise types. The improvement is largest or car noise, and smallest in presence o bacground speech. The relatively poor perormance in presence o bacground speech is probably due to the existence o speech-lie spectral peas in the bacground signal. SSCH even outperorms the ZCPA in the case o car noise, while ZCPA is more robust in presence o the other noise types. However, it is important to note that ZCPA cannot be used in place or SSCH in

6 practical applications, due to its prohibitive computational cost. 6. CONCLUSIONS In this paper, we addressed the robustness problem o the ASR systems against additive bacground noise. One way o overcoming this problem is to ind a speech parameterization that is less inluenced by additive noise than the conventional parameters. We compared the steps involved in conventional and auditory based methods, and concluded that the superior perormance o the auditory methods can be explained by the incorporation o the dominant spectral requencies into parameter vectors. A new speech parameterization method was described that computes the dominant spectral requencies in a more eicient way, rom the short-term spectrum o speech. Also this method outperormed the conventional methods in noisy conditions, conirming the importance o utilizing the dominant spectral requencies or increasing the robustness o the ASR systems. [9] B. Gajić and K. K. Paliwal, Robust parameters or speech recognition based on subband spectral centroid histograms, in Proc. EUROSPEECH, September [10] R. A. Cole, Y. K. Muthusamy, and M. Fanty, The ISOLET spoen letter database, Technical report CSE , Oregon Graduate Institute o Science and Technology, Beverton, OR, USA, March [11] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Boo. Entropic, REFERENCES [1] O. Ghitza, Auditory models and human perormance in tass related to speech coding and speech recognition, IEEE Trans. on and Audio Processing, vol. 2, pp , January [2] D.-S. Kim, S.-Y. Lee, and R. M. Kil, Auditory processing o speech signals or robust speech recognition in real-world noisy environments, IEEE Trans. on and Audio Processing, vol. 7, pp , January [3] K. K. Paliwal, Spectral subband centroid eatures or speech recognition, in Proc. ICASSP, vol. 2, pp , May [4] S. Tsuge, T. Fuada, and H. Singer, Speaer normalized spectral subband parameters or noise robust speech recognition, in Proc. ICASSP, May [5] D. Albesano, R. D. Mori, R. Gemello, and F. Mana, A study o the eect o adding new dimensions to trajectories in the acoustic space, in Proc. EU- ROSPEECH, vol. 4, pp , September [6] R. D. Mori, D. Albesano, R. Gemello, and F. Mana, Ear-model derived eatures or automatic speech recognition, in Proc. ICASSP, [7] E. Gjelsvi, Modiication o ront-end processing or robust speech recognition. Diploma thesis, Norwegian University o Science and Technology, June [8] B. Gajić and K. K. Paliwal, Robust eature extraction using subband spectral centroid histograms, in Proc. ICASSP, May 2001.

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b R E S E A R C H R E P O R T I D I A P On Factorizing Spectral Dynamics for Robust Speech Recognition a Vivek Tyagi Hervé Bourlard a,b IDIAP RR 3-33 June 23 Iain McCowan a Hemant Misra a,b to appear in