LEVERAGING JOINTLY SPATIAL, TEMPORAL AND MODULATION ENHANCEMENT IN CREATING NOISE-ROBUST FEATURES FOR SPEECH RECOGNITION

Size: px

Start display at page:

Download "LEVERAGING JOINTLY SPATIAL, TEMPORAL AND MODULATION ENHANCEMENT IN CREATING NOISE-ROBUST FEATURES FOR SPEECH RECOGNITION"

Eunice Golden
5 years ago
Views:

1 LEVERAGING JOINTLY SPATIAL, TEMPORAL AND MODULATION ENHANCEMENT IN CREATING NOISE-ROBUST FEATURES FOR SPEECH RECOGNITION 1 HSIN-JU HSIEH, 2 HAO-TENG FAN, 3 JEIH-WEIH HUNG 1,2,3 Dept of Electrical Engineering, National Chi Nan University, Taiwan, Republic of China 1 s @mail1.ncnu.edu.tw, 2 s @ncnu.edu.tw, 3 jwhung@ncnu.edu.tw Abstract- This paper presents to adopt various fusion types of spatial, temporal and modulation domain speech feature enhancement techniques in order to achieve superior speech recognition performance under noise-corrupted environments. With the mel-frequency cepstral coefficients (MFCC) as the standard speech feature representation, the spatial-domain techniques involve the short-time intra-frame feature enhancement, while the temporal-domain techniques compensate for the noise distortion that exists in the long-term inter-frame MFCC time stream. Furthermore, the modulation- domain techniques are conducted on the Fourier transform of a MFCC time stream. The evaluation experiments conducted on the connected-digit Aurora-2 database reveal that each of the spatial/temporal enhancement techniques adopted here performs better than the unprocessed MFCC baseline, and the integration of the respectively for spatial-, temporal-and modulation-domain features can result in even better recognition accuracy than the individual component method under a wide range of noise-corrupted environments. These results clearly demonstrate that the in the three domains treat noise in different aspects and therefore they are complementary to each other. Keywords- Noise Robustness, Speech Recognition, Spatial Processing, Temporal Processing, Modulation Domain. I. INTRODUCTION Most of the state-of-the-art automatic speech recognition (ASR) systems can perform well in a controlled laboratory environment. However, their performance usually degrades dramatically when they are applied outside the laboratory and in real-word applications. The performance degradation is often caused by interfering sources and distortions which are usually termed the environment variability. This variability and the resulting environmental mismatch between the developing and application situations may be caused by additive noise, channel distortion, different speaker characteristics, etc. In order to alleviate this mismatch, a great number of robustness algorithms have been proposed and thereby the application field of speech recognition can be broadened. These robustness algorithms can be roughly classifies into three schools: signal enhancement, feature compensation and model adaptation. First of all, as for signal enhancement, the aim is to improve the quality and intelligibility of speech signals. The corresponding techniques include spectral subtraction (SS) [1]-[3], short-time spectral amplitude estimation based on minimum meansquared error criteria (MMSE-STSA) [4], MMSEbased log-spectral amplitude estimation (MMSE log- STSA) [5], Wiener filtering [6, 7], Kalman filtering [8], modulation spectral subtraction (ModSpecSub) [9] and minimum mean-square error short-time spectral modulation magnitude estimator (MME) [10], just to name a few. Next, the general purpose of feature compensation is to build a speech feature representation that is robust to noise, and most of these are focused on refining the conventional speech features, such as linear predictive coefficients (LPC) [11], melfrequency cepstral coefficients (MFCC) [12] and perceptual linear prediction (PLP) [13], which behave well in clean noise-free situation, but are vulnerable to noise/interference. One primary direction of this category of is to compensate the statistics of temporal feature streams, and several popular feature statistics compensation include cepstral mean subtraction (CMS) [14], mean and variance normalization (MVN) [15], cepstral histogram equalization (CHN) [16], higher order cepstral moment normalization (HOCMN) [17] and cepstral shape normalization (CSN) [18]. Finally, the last school of, including parallel model combination (PMC) [19], speech and noise decomposition (SND) [20], vector Taylor series (VTS) [21], maximum a posteriori (MAP) [22], maximum likelihood linear regression (MLLR) [23], statistical re-estimation (STAR) and maximum mutual information (MMI) [24, 25], etc., is focused on tuning the acoustic models in the recognizer with respect to noise conditions in application. These take into account the noise characteristics within recognition procedures rather than eliminate the noise effect in the input signals/features. In recent years, our research group has been focused on developing noise-robustness techniques which primarily fall in the category of feature compensation as mentioned earlier. In particular, these developed techniques are to enhance the widely used MFCC speech features on different perspectives, which refer to temporal, spatial and modulation domains. Therefore, in this paper we focus on exploring the effectiveness as for the pairing of any two developed techniques that dwell in different domains and investigating whether such a paring result in better 15

2 performance than each individual component technique. According to the recognition experiments conducted on the well-known Aurora2 database and task [26], it is found that in most cases the noise robustness algorithms in different domains benefit one another and can produce further noise-robust speech features accordingly. These results further show that our presented algorithms in different domains deal with different traits of noise effect, and thereby using them together can further alleviate the degradation on MFCC speech features caused by noise. Figure 1. Three domains of MFCC features and the respective robustness algorithms The remainder of the paper is organized as follows: Section II briefly reviews the various noise robustness algorithms in three different domains which we have presented. Experimental setup is provided in Section IV, and Section V gives the detailed experimental results for the various integrations of any two algorithms, together with the corresponding discussions. Finally, Section VI contains a concluding remark and future works. II. REVIEW OF THE ROBUSTNESS ALGORITHMS 2.1 Spatial-domain The spatial-domain mentioned here take into consideration the mutual correlation among the intraframe MFCC features, and the associated algorithms we have developed in [27] are the weighted spatial- MVN and spatial-heq, abbreviated by WS-MVN and WS-HEQ, respectively. Briefly speaking, WS- MVN and WS-HEQ adoptthe idea of S-HEQ [28] and divide the MFCC features within every individual frame into low- and high-subbands. Then the two sub-bands are weighted according to their relative influence in recognition accuracy, and finally the weighted subbands of MFCC time sequences are enhanced by either MVN or HEQ. We have shown in [27] that WS-MVN and WS-HEQ behave better than S-HEQ and provide promising results in promoting robustness of MFCC features, and they reveal good additive property when applied together with some well-known robustness algorithms like MVN plus ARMA filtering (MVA) [29] and temporal structure normalization (TSN) [30]. 2.2 Temporal-domain As pointed out in the introduction section, the wellknown CMS, MVN and CHN process the temporal MFCC sequences by compensating the associated statistics. Here we would like to introduce three other tempoal processing algorithms that we developed in [31, 32], which are cepstral wavelet denoising (WD), sub-band temporal MVN (SB-TMVN) and sub-band temporalheq (SB-THEQ). All of these three algorithms take advantage of the discrete wavelet transform (DWT) to split each cepstral temporal sequence into several sub-bands (approximation and detail parts). Then WD applied a thresholding scheme to remove the relatively small-valued components in each individual sub-band, while SB-TMVN and SB- THEQ adopt a statistics compensation procedure to normalize the mean, variance or histogram of each sub-band temporal sequence. It has been shown that the SB-TMVN and SB-THEQ outperform their fullband counterparts, viz. MVN [15] and CHN [16], and WD behaves better than the conventional wavelet threshold denoising algorithm that operates on the speech waveform directly. 2.3 Modulation-domain Via applying the Fourier transform to the MFCC temporal sequence, the corresponding modulation spectrum can be obtained. The noise effect can be clearly observed in the cepstral modulation spectrum, and thus the respective modulation-domain robustness are developed to compensate the modulation spectrum directly. Our recent research have come up with a series of robustness algorithms 16

in modulation domain, and some of them are subband modulation spectral MVN (SB-MSMVN) [33], sub-band modulation spectral HEQ (SB-MSHEQ) [33] and modulation spectrum power law expansion (MSPLE) [34].

Besides, MSPLE applies a power law operation to the entire magnitude modulation spectrum in order to highlight the lower frequency components that are commonly viewed to be more beneficial for the

SB-SMVN and SB-SHEQ have been shown to outperform their full-band counterparts, and we have demonstrated that a simple power-law operation as in MSPLE can improve the recognition accuracy

3 in modulation domain, and some of them are subband modulation spectral MVN (SB-MSMVN) [33], sub-band modulation spectral HEQ (SB-MSHEQ) [33] and modulation spectrum power law expansion (MSPLE) [34]. Briefly speaking, SB-MSMVN and SB-MSHEQ split the magnitude component of cepstral modulation spectrum into several segments (i.e., sub-bands) first, and then employ MVN and HEQ to compensate the statistics for each segments. Besides, MSPLE applies a power law operation to the entire magnitude modulation spectrum in order to highlight the lower frequency components that are commonly viewed to be more beneficial for the speech recognition than the higher frequency components. SB-SMVN and SB-SHEQ have been shown to outperform their full-band counterparts, and we have demonstrated that a simple power-law operation as in MSPLE can improve the recognition accuracy significantly. III. EXPERIMENTAL SETUP modulation domain, SB-MSMVN, SB- MSHE and MSPLE. The resulting accuracy rates are summarized in Figure 2. From this figure, several findings can be made: 1. T-MVN brings significant accuracy improvement over the baseline. However, it behaves worse than SB-MSMVN (77.96% in accuracy) and SB-SHE (83.85%). 2. With the T-MVN as the pre-processing method, the performance of SB-MSMVN and MSPLE are further promoted, while SB-SHE drops lightly possibly due to over normalization. 3. Adding MSPLE to T-MVN benefits T-MVN a lot by providing an absolute accuracy improvement of 7.16%. This result implies that the method T-MVN plus MSPLE is well suited in application due to its computational efficiency together with good performance. The efficacy of the series of the integrations for different robustness mentioned in Section II was evaluated on the noisy Aurora-2 [26] database. Briefly speaking, Aurora-2 is a subset of the TI- DIGITS, which consists speech signals uttered by US adults. The task associated with Aurora-2 is to recognize connected digit utterances interfered with various noise sources at different signal-to-noise ratios (SNRs). In the mode of clean-condition training plus multi-condition testing, the acoustic models are trained via 8,440 clean noise-free utterances, and the testing data is further divided into three Sets: Test Sets A and B contain the utterances corrupted by additive noise, and Test Set C is composed of the utterances with additive noise and channel distortion. There are eight noise types in total, and two channel characteristics. Furthermore, the acoustic model for each digit in the Aurora-2 task is set to a left-to-right continuous density HMM with 16 states, each of which is a3- mixture GMM. In regard to speech feature extraction, each utterance of the training and testing sets was represented by a series of 13 static features augmented with their first- and second-order delta coefficients, resulting in a 39-dimensional MFCC feature vector. The training and recognition tests used the HTK recognition toolkit [35], which followed the setup originally defined for the ETSI evaluations [26]. IV. EXPERIMENTAL RESTLTS AND DISCUSSIONS 4.1 The paring of temporal- and modulationdomain At the outset, we evaluate the mode that the original MFCC features are first enhanced by the well-known temporal domain method, temporal MVN (T-MVN), and then further processed by the presented Figure 2. The averaged recognition accuracy rates for one temporal-domain method, T-MVN and three modulationdomain, MSPLE, SB-MSMVN and SB-MSHEQ, together with some possible types of integration. Figure 3. The averaged recognition accuracy rates for one spatial-domain method, WS-HEQ and three temporal-domain, WD, SB-TMVN and SB-THEQ, together with some possible types of integration. 4.2 The paring of spatial- and temporal-domain Next, we present the integration of spatial-domain method, WS-HEQ, and either of three temporal domain, WD, SB-TMVN and SB-THEQ. The corresponding evaluation results are shown in Figure 3. From this figure, we have several observations: 1. Any of the various integrations gives rise to an additive effect and show superior accuracy rates in comparison with any individual component 17

4 method. For example, "WS-HEQ plus SB- TMVN" (85.80% in accuracy) outperforms WS- HEQ (84.99%) and SB-TMVN (80.62%).Therefore, it is evident that the temporal can further improve the discriminability of speech features and reduce the noise distortion left by WS-HEQ. 2. Among the three temporal domain, SB- TMVN is the most effective in pairing WS-HEQ to provide the optimal accuracy. Despite the fact that SB-THEQ outperforms SB-TMVN in isolated operation, combining SB-THEQ with WS-HEQ possibly results in over compensation and less accuracy rates. This result also indicates that a simpler SB-TMVN (relative to SB-THEQ) can behave better when integrated with WS-HEQ. 4.3 The paring of spatial- and modulation-domain Finally, the performance of the fusion of WS-HEQ and either of three modulation domain, SB- MSMVN, SB-MSHEQ and MSPLE, is explored. The respective recognition accuracy rates averaged over all noise types and levels in three test sets are shown in Figure 4. This figure reveals that: 1. Most of the combinative procedures produce better results than the individual component method. For example, the integration of WS- HEQ and SB-MSMVN (85.72% in accuracy) behave better than the single WS-HEQ (84.99%) and SB-MSMVN (77.96%). 2. When the feature sequences are pre-processed by WS-HEQ, the three modulation domain used here behave very close to each other. This implies that MSPLE is a better choice among the three for integrating with WS-HEQ, since it has the simplest computation and achieves nearly optimal recognition accuracy in the integration. 3. Compared with the two sub-band modulationdomain, MSPLE only processes the lower band and still can promote the recognition performance as well. In particular, further compensating the WS-HEQ processed features with MSPLE can result in better performance, again evidencing the robustness capability of MSPLE. CONCLUSIONS AND FUTURE WORKS In this paper, we have demonstrated the preferable recognition performance of the integration of the robustness with respect to different domains for MFCC features. Most types of the integration are simple in implementation and very applicable in realworld scenarios. In the near future, we are going to adopt the speech features enhanced by the presented architectures in the model of state-of-the-art deeplearning neural network (DNN) to evaluate the respective performance. Another direction is to alter 18 the implementation order of the component in the integration to see the corresponding effect since these are mostly non-linear operations to the original features. Figure 4. The averaged recognition accuracy rates for one spatial-domain method, WS-HEQ and three modulationdomain, MSPLE, SB-MSMVN and SB-MSHEQ, together with some possible types of integration REFERENCES [1] S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics, Speech and Signal Processing, 27(2), pp , [2] M. Berouti, R. Schwartz, J. Makhoul, Enhancement of speech corrupted by acoustic noise, in Proceedings of the Signal Processing, pp , [3] S. Kamath and P. Loizou, A multi-band spectral subtraction method for enhancing speech corrupted by colored noise, in Proceedings of the IEEE InternationalConference on Acoustics, Speech and Signal Processing, pp. IV-4164, 2002 [4] Y. Ephraim and D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, IEEE Transactions on Audio, Speech, and Language Processing, 32(6), pp , [5] Y. Ephraim and D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Transactions on Acoustics, Speech and Signal Processing, 33(2), pp , [6] C. Plapous, C. Marro, and P. Scalart, Improved signal-tonoise ratio estimation for speech enhancement, IEEE Transactions on Audio, Speech and Language Processing, 14(6), pp , [7] P. Scalart and J. V. Filho, Speech enhancement based on a priori signal to noise estimation, in Proceedings of the Signal Processing, pp , [8] V. Grancharov and J. S. B. Kleijn, On causal algorithms for speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing, 14(3), pp , [9] K. Paliwal, K. Wojcicki and B. Schwerin, Single-channel speech enhancement using spectral subtraction in the short-time modulation domain, Speech Communication, 9552(5), pp , [10] K. Paliwal, B. Schwerin and K. Wojcicki, Speech enhancement using minimum meansquare error short-time spectral modulation magnitude estimator, Speech Communication, 54(2), pp , [11] B. S. Atal, The history of linear prediction, IEEE Signal Processing Magazine, 23(2), pp , [12] S. B. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuous spoken sentences, IEEE Transaction onacoustics, Speech and Signal Processing, 28(4), pp , [13] H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, Journal of the Acoustical Society of America, 87(4), pp , 1990.

5 [14] S. Furui, Cepstral analysis technique for automatic speaker verification, IEEE Transactions on Acoustics, Speech and Signal Processing, 29(2), pp , [15] S. Tibrewala, H. Hermansky, Multiband and adaptation approaches to robust speech recognition, in Proceedings of the Eurospeech Conference on Speech Communications and Technology, pp , [16] F. Hilger and H. Ney, Quantile based histogram equalization for noise robust large vocabulary speech recognition, IEEE Transaction on Audio, Speech, and Language Processing, 14(3), pp , [17] C.-W. Hsu and L.-S. Lee, Higher Order Cepstral Moment Normalization for Improved Robust Speech Recognition, IEEE Transactions on Audio, Speech, and Language Processing, 17(2), pp , [18] J. Du and R.-H. Wang, Cepstral shape normalization (CSN) for robust speech recognition, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp , [19] J.-W. Hung, J.-L. Shen and L.-S. Lee, New approaches for domain transformation and parameter combination for improved accuracy in parallel model combination (PMC) techniques, IEEE Transactions on Speech and Audio Processing, 9(8), pp , [20] J. H. Holmes and N. C. Sedgwick, Noise compensation for speech recognition using probabilistic models, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp , [21] A. Acero, L. Deng, T. Jristjansson, and J. Zhang, HMM adaptation using vector taylor series for noisy speech recognition, in Proceedings of the International Conference on Spoken Language Processing, pp , [22] J.-L. Gauiain and C.-H. Lee, Maximun a posteriori estimation for multivariate Gaussian mixture observations of markov chains, IEEE Transactions on Speech and Audio Processing, 2(2), pp , [23] C. J. Leggester and P. C. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density HMMs, Computer Speech and Language, 9(2), pp , [24] M. J. F. Gales and S. J. Young, Cepstral parameter compensation for HMM recognition in noise, Speech Communication, 12(3), pp [25] L. Bahl, P. Brown, P. de Souza and R. Mercer, Maximum mutual information estimation of hidden Markov model parameters for speech recognition, in Proceedings of the Signal Processing, pp , [26] H. G. Hirsch and D. Pearce, The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in Proceedings of the 2000 Automatic Speech Recognition: Challenges for the new Millenium, pp , [27] J.-W. Hung and H.-T. Fan, Intra-frame cepstral sub-band weighting and histogram equalization for noise-robust speech recognition, EURASIP Journal on Audio, Speech, and Music Processing 2013:29, Dec [28] V. Joshi, R. Bilgi, S. Umesh, L. Garcia, M. C. Benitez, Sub-band level histogram equalization for robust speech recognition, in Proceedings of International Conference on Spoken Language Processing, pp , 2011 [29] C-P. Chen and J. Bilmes, MVA processing of speech features, IEEE Transactions on Audio, Speech, and Language Processing, 15(1), pp , 2007 [30] X. Xiao, E. S. Chng and H. Li, Normalization of the speech modulation spectra for robust speech recognition, IEEE Transactions on Audio, Speech, and Language Processing, 16(8), pp , [31] H.-T. Fan, J.-Y. Lee, J.-W. Hung and I-C. Lu, Leveraging wavelet de-noising in temporal sequences of speech features for noise-robust speech recognition, in Proceeding of International Conference on Intelligent Information Processing (ICIIP) [32] J.-W. Hung and H.-T. Fan, Subband feature statistics normalization techniques based on a discrete wavelet transform for robust speech recognition, IEEE Signal Processing Letters, June 2009 [33] W.-H.Tu, S.-Y. Huang and J.-W. Hung, Sub-band Modulation Spectrum Compensation for Robust Speech Recognition, 2009 Automatic Speech Recognition and Understanding Workshop, Dec 2009 [34] H.-T. Fan, Z.-H. Ye and J.-W. Hung, Modulation spectrum power-law expansion for robust speech recognition, in Proceeding of Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Oct [35] 19

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

Modulation Spectrum Power-law Expansion for Robust Speech Recognition Hao-Teng Fan, Zi-Hao Ye and Jeih-weih Hung Department of Electrical Engineering, National Chi Nan University, Nantou, Taiwan E-mail: