Robust telephone speech recognition based on channel compensation

Size: px

Start display at page:

Download "Robust telephone speech recognition based on channel compensation"

Byron Cobb
5 years ago
Views:

1 Pattern Recognition 32 (1999) 1061}1067 Robust telephone speech recognition based on channel compensation Jiqing Han*, Wen Gao Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin, , People+s Republic of China Received 16 October 1997; received in revised form 16 July 1998 Abstract Channel compensation technique has been proved to be an e!ective approach for robust speech recognition. In this paper, we compare the performance of our proposed method RMFCC with those of the former channel compensation methods: CMS, two-level CMS and RASTA for robust telephone speech recognition. For all experiments, a Korean isolated 84-word- consisting of 80 speakers collected from local telephone line is adopted. Using RMFCC, a 39.8% reduction in word error rate is obtained relative to conventional HMM system. It is shown from the experiments that RMFCC, comparing with RASTA, reduces the computational complexity without losing accuracy, and is also better than CMS and two-level CMS on the performance. After discussion, we verify that it is an e!ective approach to suppress very low modulation frequencies by "ltering for robust telephone speech recognition Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Channel compensation; Speech recognition; Robustness; Modulation frequencies; Signal-to-noise rate 1. Introduction Speech signal carries not only the linguistic message but is also in#uenced by other sources of information. One of the more harmful sources of non-linguistic variability is the communication environment, which typically includes the recording room, the microphone and most importantly the communication channel such as a telephone line. Performance of an automatic speech recognizer (ASR) can be degraded dramatically when the recognizer is applied in an environment that is di!erent from the environment in which it was trained. Though the degradation may often be attributed to non-linear e!ects of the environment and the additive noise in the signal, the frequency character of the communication channel alone can strongly in#uence the short-time * Corresponding author. spectrum of the speech. Since most similarity measures applied in ASRs are directly or indirectly based on the short-time spectrum of the speech, ASR performance can be also signi"cantly in#uenced by the frequency character of the communication channel. It has been reported [1] that the error rate of a speech recognizer can increase from 1.3 to 44.6%when the testing data are"ltered by a pole/zero "lter modeling a long-distance telephone line. Thus, it is a crucial factor to "nd the robust channel compensation methods for the actual application of telephone speech recognition. The robustness of speech recognition has been widely discussed, and a variety of channel compensation approaches have been proposed [2}5]. Cepstral mean subtraction (CMS), [2] which "rst calculates the cepstral mean of an utterance and then the cepstral mean is subtracted from the cepstral coe$cients of each frame, is one of the e!ective algorithms considering the simplicity. However, the e!ectiveness of CMS is severely limited when the channel cannot be adequately modeled by /99/$ } See front matter 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S ( 9 8 )

2 1062 J. Han, W. Gao / Pattern Recognition 32 (1999) 1061}1067 a linear one. In order to process the non-linear channel, two-level CMS is proposed, which "rst classi"es the input speech signal into two parts and calculates the cepstral mean for each part, then the di!erent cepstral means are subtracted from the cepstral coe$cients of each part. This method had been e!ectively used for channel compensation in the connected digit recognition for mobile applications [3] and speaker recognition [4], and is better than CMS on the performance. However, it needs signal classi- "cation, and the performance depends on the classi"cation result. The RelAtive SpecTrAl (RASTA) processing [5] which uses a band-pass "lter with a very low cut-o! frequency is an e!ective channel compensation method, and it can suppress slowly-varying channel distortions and get good performance. Conventional RASTA processing is applied on the perceptual linear predictive (PLP) [6] log spectrum. However, PLP needs complex computation. In our previous work [7], we use mel spectral analysis [8] instead of PLP approach to reduce the computation. Based on the linear relationship between mel-frequency log spectrum and mel-frequency cepstral coe$cients (MFCCs), we extend RASTA processing from mel-frequency log spectrum to MFCCs, and a RASTA-like band-pass "lter is proposed for robust speech recognition. Next, we select the pole parameter of the "lter by experiments and discuss the initial-value selection of the integrator. In this paper, we compare the performance of our proposed method RMFCC with those of the former channel compensation methods: CMS, two-level CMS and RASTA for robust telephone speech recognition. For all experiments, a Korean isolated 84-word- consisting of 80 speakers collected from local telephone line is adopted. Using RMFCC, a 39.8% reduction in word error rate is obtained relative to conventional HMM system. It is shown from the experiments that the proposed method, comparing with RASTA, reduces the computational complexity without losing accuracy, and is also better than CMS and two-level CMS on the performance. After discussion, we verify that it is an e!ective approach to suppress very low modulation frequencies by "ltering for robust telephone speech recognition. This paper is organized as follows. In Section 2, we introduce the former channel compensation methods and our proposed method RMFCC. Next, in Section 3 the telephone speech and the signal processing for our experiments are described. Then, the results of our recognition experiments are discussed in Section 4. Finally, in Section 5 we sum up the main conclusion. stream is zero, and modi"es the cepstral coe$cients to minimize the mismatch in the training and testing data due to channel distortions. CMS is often regarded as a kind of standard channel compensation method in which the mean of the cepstral vector is subtracted from the cepstral coe$cients of an utterance, C "c! 1 c (t"0, 1, 2, ¹ 2, ¹!1), (1) where c and C are the cepstral vectors at frame t before and after CMS processing respectively, and ¹ is the total frame number in the utterance ¹wo-level CMS Generally speaking, speech signal is corrupted not only by the channel distortions but also by the additive noise before entering ASR (as shown in Fig. 1); at the power spectrum the noisy speech >(ω) is >(ω)"[x(ω)#n (ω)] H(ω)#N (ω), (2) where X(ω) refers to the input speech component, N (ω) and N (ω) refer to the environmental noise, and H(ω) refers to the channel distortions. For clean training and testing conditions (N (ω) and N (ω) being negligible), the distortions become additive in the cepstral domain, therefore CMS removes the timeinvariant part of channel distortions. When high level of additive noise (N (ω) and N (ω) non-negligible) is present, we can only ignore noise component in the speech segments with high signal-to-noise rate (SNR) and use CMS in those segments. In the same manner, we can ignore the speech component in the segments of very low SNR, i.e. very low-level speech or no speech is present. Thus, recognition performance can be improved further by using a two-level CMS, where separate channel compensation is performed for the segments that are classi"ed as speech and for the segments classi"ed as background. Given a frame sequence of cepstral observations C"c, c, 2, c, each frame is classi"ed as either a background or a speech frame. Let E"E, E, 2, 2. Channel compensation for robust telephone speech recognition 2.1. Cepstral mean subtraction (CMS) Cepstral mean subtraction relies on the assumption that the ensemble average of the input speech feature Fig. 1. The diagram of speech signal distortions.

3 J. Han, W. Gao / Pattern Recognition 32 (1999) 1061} E be the log energy sequence of the observation, then the background function is bck(t)" E 1 (αe 0 otherwise (t"0, 1, 2, 2, ¹!1), (3) where E is the maximum log energy of the observation, and the parameter α is an empirically selected constant. For two-level CMS, the compensated cepstral vectors C are computed according to C " c!c c!c bck(t)"1 otherwise (t"0, 1, 2, 2, ¹!1) (4) where C and C are the cepstral mean vectors of the background and speech frames, respectively RAS¹A-P P technique Perceptual experiments suggest that human speech perception might be able to suppress stationary nonlinguistic background and enhance the variable linguistic message [9]. Thus, it is useful to adopt the features based on human hearing for robust speech recognition. In RASTA-PLP technique, several well known properties of hearing are simulated by practical engineering approximations, and then a band-pass "lter as follow is applied to a log spectral representation of speech, H(Z)"0.1 Z (2#Z!Z!2Z ). (5) 1!0.94Z The numerator of this "lter represents a linear regression estimate of the temporal derivative, while the denominator represents a simple leaky integrator. RASTA processing can e!ectively suppress slowly varying channel distortions RMFCC method In RASTA-PLP technique, only the RASTA processing is used to suppress slowly varying channel distortions, while PLP is used to simulate the properties of human hearing. Mel spectral analysis is also one kind of approach simulating the properties of human hearing and simpler than PLP analysis. Speci"cally, mel spectral analysis need not calculate the complex equal loudness pre-emphasis and intensity loudness power law, and conduct spectral analysis again [6]. If we use H(Z) to represent the RASTA band-pass "lter, > and >M each represent the ith mel-frequency log spectrums in the Z transform domain before and after being processed by RASTA, then we get >M "H(Z) ) >. (6) MFCCs, which are used as the features in most of the current speech recognizers, are calculated by using discrete cosine transform (DCT) on mel-frequency log spectrum as follows [8]; C (k)" cos B π k(i!0.5) >M "H(Z) cos B π k(i!0.5) > "H(Z) c (k) (k"1, 2, 3,..., K), (7) where C (k) and c (k) are the kth MFCCs in the Z transform domain with and without using RASTA processing respectively, B is the number of mel-frequency bands, and K is the dimension of MFCCs. From Eq. (7), it is reasonable to extend RASTA processing from log spectrum to MFCCs (i.e. "rst calculating MFCCs and then processing by a band-pass "lter). Generally, B is bigger than K (e.g. we used B"40 and K"12), and thus this kind of relative MFCCs (RMFCC) processing reduces the computation complexity. The main part of RASTA processing is the IIR "lter as follows H(Z)"G Z (2#Z!Z!2Z ). (8) 1!ρZ We also use this kind of "lter and should select the parameters of the "lter for our RMFCC processing. When an input signal X[t] passes through H(Z) in equation (8), the output >[t] is >[t]"g (n!2) X[t#n]#ρ >[t!1], (9) where t"0,1,2,..., ¹!1 are the labels of the frames, and the initial value >[!1] should be selected. 3. Database and baseline system The is collected from the local telephone network in Seoul and Taejon, South Korea, and many kinds of di!erent handsets are used for collection. Since the system is speaker-independent, several speakers are selected for experiments. The training contains utterances from 40 speakers (22 male and 18 female), and the testing utterances from 40 di!erent speakers (22 male and 18 female). The male to female ratio in the re#ects that of the general South Korea population. Every speaker read 93 sentences several times, and then 84 Korean isolated words were manually segmented and labeled. Since some utterances were discarded due to bad recordings, the total number of utterances for training is and for testing one 8036.

4 1064 J. Han, W. Gao / Pattern Recognition 32 (1999) 1061}1067 Table 1 Di!erent SNR measurements for the Measurement Training Testing SNR db 13.95dB SEGSNR db 13.85dB MAXSNR db 19.00dB We use the SNR as an objective measurement to evaluate the. In the literature, many SNR measurements have been proposed (see e.g. Ref. [10]). Since we have no a priori knowledge about telephone speech, the di!erent SNR measurements were adopted for the training and testing, and the results are listed in Table 1. It is shown from the Table 1 that those measurements are very similar between the training and the testing one, which might be that the includes relatively su$cient environmental (speakers, channels, noises) features and obeys the statistical theory. In the experiments, the speech signal is "rst digitized at a sampling rate of 8 KHz, a pre-emphasis "lter H(z)"1!0.95z is applied to the speech samples, and a Hamming window of 240 samples (30 ms) is used for every 15 ms. Next, the power spectrum of the windowed signal in each frame is computed using a 256-point DFT, and 40 mel-frequency spectral coe$cients are derived based on mel-frequency band-pass "lters. Then, 12 MFCCs are computed using the DCT. Finally, an isolated word, conventional continuous density HMM recognizer is implemented as the baseline system for the performance comparison. 4. Experiments and discussion A series of experiments are designed, using the training and testing, to evaluate the channel compensation methods for robust telephone speech recognition. We implement two-level CMS method and select the classi"cation parameter α by experiments, and also use experiments to determine some parameters for RMFCC. Then, the performances of all the channel compensation methods are compared Experiments of two-level CMS In two-level CMS, as shown in the Eq. (3), there is a parameter α which should be determined. Using di!erent parameter α, we did the comparing experiments to select the para- meter, and the results are illustrated in Fig. 2. Performances of two-level CMS with di!erent classi"cation parameter α. Table 2 Word error rates and the reductions in word error rates using CMS and two-level CMS in comparison with the baseline system Method Baseline CMS Two-level CMS system Training 6.5% 2.7% 2.5% Error * 58.5% 61.5% reduction Testing 11.8% 7.8% 7.2% Error * 33.9% 38.9% reduction Fig. 2. It is found that α"0.1 exhibits an optimum, which is adopted as the classi"cation parameter in our two-level CMS experiment. We implement a system using CMS as the channel compensation method and compare the performance with that of using two-level CMS. Table 2 shows the results of the baseline system, the systems of using CMS and two-level, CMS, respectively, and also shows the word error rate reductions of using the two channel compensation methods in comparison with the baseline system. Although the linear time-invariant channel assumption is almost never satis"ed in practice, using CMS we still achieve a signi"cant improvement on the performance relative to the baseline system, and a 33.9% reduction in word error rate is got for the testing. Two-level CMS is better than CMS on the performance, and a 7.7% reduction in word error rate is got for the testing.

5 J. Han, W. Gao / Pattern Recognition 32 (1999) 1061} From the experimental results, we "nd that the twolevel CMS can further suppress the channel distortions in telephone speech recognition Parameters selection in RMFCC In RMFCC method, there are some adjustable parameters, namely the gain G and the pole ρ in Eq. (8) and the initial value >[!1] in Eq. (9), all these parameters a!ect the recognition accuracy. In the previous RASTA processing, a constant 0.94 was used as the pole ρ. For RMFCC processing, we select the parameter ρ by comparing experiments. Using the gain G"0.1 consistent with RASTA, we compare the system performances of using di!erent ρ, and the results are shown in Fig. 3. From the results, we observe that ρ"0.92 exhibits an optimum, and it is chosen as the pole parameter of the RMFCC "lter in the following experiments. We also use comparing experiments to select the initial value >[!1] in Eq. (9) to get a better recognition accuracy. In the previous RASTA works, they did not report how to select the initial value y[!1]. All the methods are keeping the silent part before speech, and the results are dependent on the determination of the silence. In noisy environment, it is not easy to determine the silence parts and unfortunately, the silence is often mixed with serious noise. When the integrator is used from the silence, the noise might be introduced again, and moreover, it also needs extra silence processing. We attempt to "nd the special >[!1] to get good performance without processing the extra silence. Using the G"0.1 and ρ"0.92, three kinds of the initial values: zero, cepstral mean and silent part, are compared, and the results are listed in Table 3. We can see that using zero initial value gets the best performance. It seems that the zero value normalizes the cepstral coe$cients for all utterances. Also using zero initial value the computation is very simple Comparing experiments The proposed method RMFCC and the former channel compensation methods have been evaluated for the robust telephone speech recognition, and the experimental results are summarized in Table 4. As discussed below, delta-mfcc has a relationship with RMFCC. In this point of view we implemented another conventional HMM system using MFCC and delta-mfcc as features, where the delta-mfcc features are obtained by a linear regression estimate of the MFCCs. And we also implemented a system using RASTA as channel compensation method. In order to get good performance, we had attempted to combine RMFCC with CMS and two-level CMS, respectively, but the results, as shown in Table 4, are not improved in comparison with the case of using RMFCC only. It is shown from the experiments that the performance of RMFCC is signi"cantly superior to that of the baseline system, and a 39.8% reduction in word error rate is got for the testing. Comparing with delta-mfcc, RMFCC produces a 28.3% reduction in word error rate with slight increase in the computational complexity. Table 3 Performance comparison for the di!erent initial values of the RMFCC integrator Initial value Zero Mean Silence Training 97.7% 97.6% 97.7% Testing 92.9% 91.9% 92.6% Table 4 Word error rates using various types of channel compensation methods Method Training Testing Fig. 3. Performances of di!erent RMFCC "lter pole positions. Baseline 6.5% 11.8% Delta-MFCC 3.4% 9.9% CMS 2.7% 7.8% Two-level CMS 2.5% 7.2% RASTA 2.1% 7.1% RMFCC 2.3% 7.1% CMS#RMFCC 2.3% 7.1% Two-level CMS #RMFCC 2.3% 7.1%

6 1066 J. Han, W. Gao / Pattern Recognition 32 (1999) 1061}1067 RMFCC has been shown to yield good performance in Section 4.3, and the main part of RMFCC processing is a band-pass "lter. After analyzing the feature of the RMFCC "lter, we get the frequency response curve of the "lter, which is the solid line in Fig. 4. It can be seen that the "lter can attenuate very low modulation frequencies. Using delta-mfcc, the performance is better than that of the baseline system which is only using MFCC as features, and this is consistent with the former work [2]. We notice that there is a relationship between delta- MFCC and RMFCC. When the denominator of the RMFCC "lter in Eq. (8) is ignored, RMFCC processing is equivalent to the calculation of delta-mfcc. Therefore, delta-mfcc is also regarded as a kind of RMFCC processing. The frequency response curve of the "lter used in delta-mfcc is the dotted line in Fig. 4, and it can also suppress low modulation frequencies. This is the reason why using delta-mfcc the performance is better than that of using MFCC only. CMS processing can be regarded as a kind of high-pass "ltering, which can also suppress low modulation frequencies. The two-level CMS, in which di!erent modulation frequencies are considered and removed by di!erent high-pass "ltering, is better than CMS. From the discussion, we know that many channel compensation methods are based on the "ltering processing, and verify that it is an e!ective approach to suppress low modulation frequencies by "ltering for robust telephone speech recognition. 5. Conclusion Fig. 4. Frequency responses of RMFCC and delta-mfcc "lters. And it is also better than CMS and two-level CMS on the performance. In comparison with two-level CMS, although RMFCC does not obviously improve the performance, it is a straightforward method, while both CMS and two-level CMS are post-processing methods, which need calculating the cepstral mean of an utterance to estimate the long-term features of the channel and then the mean should be subtracted from the cepstral coe$cients of every frame. Comparing with RASTA, RMFCC gets nearly the same performance but requires a simple computational complexity. With respect to both the performance and the computational complexity, RMFCC is the best method Discussion We have extended RASTA processing from log spectrum to MFCCs and proposed the RMFCC processing method, and also compared the performances between RMFCC and the former channel compensation methods. For all experiments, a Korean isolated 84-word- consisting of 80 speakers collected from local telephone line is adopted. Using RMFCC, a 39.8% reduction in word error rate is obtained relative to conventional HMM system. It is shown from the experiments that the proposed method reduces the computational complexity without compromising performance in comparison with RASTA, and has the advantage which does not have to estimate the long-term features of the communication channel like CMS and two-level CMS. After discussion, we "nd that many channel compensation methods are based on the "ltering processing, and verify that it is an e!ective approach to suppress very low modulation frequencies by "ltering for robust telephone speech recognition. Acknowledgements The authors would like to thank Mr Munsung Han, Mr Gyu-Bong Park and Mr Jeongue Park for their support. References [1] S. Lerner, B. Mazor, Telephone channel normalization for automatic speech recognition, Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, ICASSP92 (1992) I261}I264. [2] S. Furui, Cepstral analysis technique for automatic speaker veri"cation, IEEE Trans. Acoust. Speech Signal Process. 29 (1981) 254}272. [3] S. Gupta, F. Soong, R. Haimi-Cohen, High accuracy connected digit recognition for mobile applications, Proc. IEEE Int. Conf. Acoust. Speech Signal Processing ICASSP96 (1996) 57}60. [4] D. Reynolds, The e!ects of handset variability on speaker recognition performance: experiments on the switchboard corpus, Proc. IEEE Int. Conf. Acoust. Speech Signal Process. ICASSP96, (1996) 113}116. [5] H. Hermansky, N. Morgan, RASTA processing of speech, IEEE Trans. Speech Audio Process. 2 (1994) 578}589.

7 J. Han, W. Gao / Pattern Recognition 32 (1999) 1061} [6] H. Hermansky, Perceptual linear predictive (PLP) analysis of Speech, J. Acoust. Soc. Am. 87 (1990) 1738}1752. [7] J. Han, M. Han, G. Park, J. Park, W. Gao, Relative mel-frequency cepstral coe$cients compensation for robust telephone Speech Recognition, Proc. Europ. Conf. Speech Commun. Technol. Eurospeech97, (1997) pp. 1531}1534. [8] J.W. Picone, Signal modeling techniques in speech recognition, Proc. IEEE 81 (1993) 1215}1247. [9] Q. Summer"eld, A. Sidwell, T. Nelson, Auditory enhancement of changes in spectral amplitude, J. Acoust. Soc. Am. 81 (1987) 700}706. [10] N. Jayant, P. Npll, Digital Coding of waveforms. Prentice- Hall, Englewood Cli!s, NJ, About the Author*JIQING HAN received the B.S. degree and M.S. degree in Electrical Engineering from Harbin Institute of Technology (HIT), Harbin, P. R. China, in 1987 and 1990, respectively. From 1990 to 1996, he was an assistant lecturer in the Department of Computer Science and Engineering, HIT. He is currently working for his Ph.D. degree in Computer Science and Engineering in HIT. Since June 1996, he has been working in Systems Engineering Research Institute, Korea Institute of Science and Technology, South Korea, as a Visiting Scientist. His research interests include robust speech recognition, signal processing and pattern recognition. About the Author*WEN GAO received his "rst Ph.D. degree in Computer Science and Engineering from Harbin Institute of Technology (HIT), China, in 1988, and the second Ph.D. degree in Electrical Engineering from University of Tokyo, Japan, in In 1993, he worked in Carnegie Mellon University (CMU), and in 1995 at Arti"cial Intelligent Laboratory in MIT as a Visiting Professor. He is a professor and the director of Motorola - NCIN Joint D and R Laboratory, China. Prof. GAO had published many papers, and his current research interests include image processing, computer vision, pattern recognition and multimodal human interface.

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991 RASTA-PLP SPEECH ANALYSIS Hynek Hermansky Nelson Morgan y Aruna Bayya Phil Kohn y TR-91-069 December 1991 Abstract Most speech parameter estimation techniques are easily inuenced by the frequency response