Advances in Applied and Pure Mathematics

Size: px

Start display at page:

Download "Advances in Applied and Pure Mathematics"

Ross Carter
5 years ago
Views:

1 Enhancement of speech signal based on application of the Maximum a Posterior Estimator of Magnitude-Squared Spectrum in Stationary Bionic Wavelet Domain MOURAD TALBI, ANIS BEN AICHA 1 mouradtalbi196@yahoo.fr, 2 ben.aicha.anis@gmail.com 1 High institute of applied mathematics and informatics of Kairouan, university of Kairouan, Tunisia 2 Laboratory of COSIM (SUP'COM), high school of communications, Tunis, Tunisia Abstract In this paper we propose a new speech enhancement technique based on the application of the Maximum a Posterior Estimator of Magnitude-Squared Spectrum (MSS-MAP) in Stationary Bionic Wavelet Domain. This technique consists at first step in applying the Stationary Bionic Wavelet Transform (SBWT) to the noisy speech signal and then applying the Maximum A Posterior Estimator of Magnitude- Squared Spectrum, to each stationary bionic wavelet sub-band in order to enhance it. The enhanced speech signal is obtained by applying the inverse of the SBWT, SBWT -1 to enhanced stationary wavelet coefficients. In order to evaluate the proposed technique, we have compared it some previous works such as MSS-MAP based denoising technique. This evaluation was performed on a number of Arabic speech sentences corrupted by different types of noise such as Gaussian white, Car, Tank, F16 and Pink noises. The obtained simulation results show that the proposed technique outperforms the others techniques used in our evaluation. Keywords: Stationary Bionic Wavelet Transform, Maximum a Posterior Estimator of Magnitude- Squared Spectrum, Speech enhancement. 1. Introduction Speech enhancement and the uncorrelated additive noise are important problems that have received much attention in the last two decades. This is the result of the rising employment of the speech processing systems in diverse real environments. The noise presence affects the speech processing systems performance. Those systems include speech recognition, mobile phones hearing aids, and voice coders. The speech enhancement aim is to improve the intelligibility and perceptual quality of speech by minimizing the effect of noise. Existing techniques for this task include Wiener filtering [1-5], spectral subtraction [6, 7], wavelet transform (WT) [8-14, 35], etc. An emerging tendency in the speech enhancement domain consists of employing a filter bank based on a specific psychoacoustic model of human auditory system (Critical bands). The principle behind this is based on the fact that embedding the model of psychoacoustic of human auditory system in filter bank can improve the intelligibility and the perceptual quality of speech. Furthermore, it is well known that the human auditory system can approximately be described as a nonuniform bandpass filter bank and humans are able to detect the desired speech in noisy environments without noise prior knowledge [15]. Different frequency transformations (scales) are proposed to consider the hearing perceptive aspect (ERB, Bark, Mel and so on). It deserves mentioning that the majority of the perceptual speech enhancement techniques are based on the wavelet packet transform [10, 11, 13, 15-18]. furthermore, the wavelet packet transform was effectively combined with other denoising techniques in order to ameliorate the speech enhancement techniques performance. They include the ISBN:

2 Wiener filtering [19], adaptive filtering [20], spectral subtraction [21-23], Ephraim and Malah approach [15] and coherence function [24]. The rest of the paper is organized as follows: Section 2 describes the proposed speech enhancement technique by giving a detailed overview of the 2. The proposed In this paper we propose a new speech enhancement technique based on the application of the Maximum A Posterior Estimator of bionic wavelet transform (BWT) and the Stationary Bionic Wavelet Transform (SBWT). Section 3 presents the objective quality measurement techniques. Experimental results are presented and discussed in section 4. Finally, the conclusion is given in section 5. Magnitude-Squared Spectrum (MSS-MAP) [25] in Stationary Bionic Wavelet Domain. The bloc diagram of the proposed technique is given by Figure 1. Figure.1. The bloc diagram of the proposed technique. As shown in figure, the proposed technique consists at first step in applying the SBWT to the noisy speech signal in order to obtain eight noisy stationary bionic wavelet subbands,. Then the MSS-MAP is applied to each subband in order to obtain eight enhanced stationary bionic wavelet subbands,. Finally the enhanced speech signal is obtained by applying the inverse of SBWT, SBWT-1 to the enhanced stationary bionic wavelet subbands, The Bionic Wavelet Transform By referring to the perceptual model, Yao and Zhang [14] have proposed the Bionic Wavelet Transform (BWT) as a new time-frequency method. The term bionic means that the BWT is guided by an active biological mechanism [18]. Furthermore, the BWT decomposition is both perceptually scaled and adaptive [16]. The initial perceptual aspect of the transform comes from the logarithmic spacing of the baseline scale variables, which are designed to match basilar membrane spacing [16]. Then, two adaptation factors control the time-support employed at each scale, based on a non-linear perceptual model of the auditory system [16]. The basis of this transform is the Giguerre Woodland non-linear transmission line model of the auditory system [19, 20], an active-feedback electro-acoustic model incorporating the auditory canal, middle ear and cochlea [16]. The model yields estimates of the time-varying acoustic compliance and resistance along the displaced basilar membrane, as a physiological acoustic mass function, cochlear frequencyposition mapping, and feedback factors representing the active mechanisms of outer hair cells. The net result can be seen as a method for the estimation of the time-varying quality factor of the cochlear filter banks as the input sound waveform function [16]. Giguere and Woodland [20] and Yao and Zhang [14] give the complete details on the elements of this model. The BWT adaptive nature is ensured by a timevarying linear factor representing the ISBN:

3 scaling of the cochlear filter bank quality factor at each scale over time [16]. For each scale and time, the BWT adaptation factor is calculated by employing the update equation [16]: (1) Where is a constant (typically ) that represents non-linear saturation effects in the cochlear model [14, 16]. The quantities and are respectively, the active gain factor, which represents the outer hair cell active resistance function, and the active gain factor representing the time-varying compliance of the basilar membrane [16]. Practically speaking, the partial derivative in equation (2) can be approximated by using the first difference of the previous points of the BWT at that scale [16]. represents the BWT of the signal. It is given by: (2) Where denotes the parameter of scale, the time the shifting parameter in time and is the mother wavelet envelope given by [18]: (3) Where is the base fundamental frequency of the unscaled mother wavelet. In practice, is equals to for the human auditory system [14]. The discretization of the scale is achieved by employing a predetermined logarithmic spacing across the desired frequency range, so that at each scale the center frequency is expressed by [16]: (4) For the implementation performed in [16] and based on original work for cochlear implant coding (Yao and Zhang, 2002), coefficients at 22 scales, are computed employing numerical integration of the Continuous Wavelet Transform (CWT) [16]. These 22 scales are corresponding to center frequencies logarithmically spaced from 225 Hz to 5300Hz [16]. In formula (4), the role of the first factor multiplying is to ensure that the energy remains unchanging for each mother wavelet. The role of the second factor is to adjust the envelope without adjusting the central frequency of [18]. Consequently, the major difference between CWT and BWT is based on the fact that the time-frequency resolution achieved by BWT can be adjusted in an adaptive manner not only by frequency variation of the signal but also by instantaneous amplitudes of this signal. It is the mother wavelet that makes the CWT adaptive, while the adaptive characteristic of the BWT comes from the mechanism of active control in the human auditory model, which adjusts the mother wavelet associated to BWT according to the analyzed signal. Basically, the idea of the BWT is inspired from the fact that we need to make the mother wavelet envelope variable in time according to the signal characteristics. The employed mother wavelet in the reference [18] is the Morlet wavelet and its envelope is given by [16]: (5) Where denotes the initial time-support. It can be shown [15, 18] that obtained BWT coefficients are derived by using the following formula [16]: (6) ISBN:

4 Where (7) is given by: where C represents a normalizing constant calculated from the squared mother wavelet integral. This representation yields to an effective computational technique for calculating in direct manner, the BWT coefficients from those of the wavelet transform without using the BWT definition given by equation (3). There are some key differences between the discretized CWT employing the Morlet wavelet used for the BWT and a filter-bank-based WPT employing an orthonormal wavelet. One of them is that the WPT provides a perfect reconstruction, while the discretized CWT is an approximation whose exactness depends on the number and placement of frequency bands selected [16] Stationary Bionic Wavelet Transform (SBWT) As previously mentioned, in this paper, we have applied a new wavelet transform which we call Stationary Bionic Wavelet Transform (SBWT). This new transform is obtained by replacing the Continuous Wavelet Transform (CWT) used in the computation of the Bionic Wavelet Transform, by the Stationary Wavelet Transform (SWT). As shown in Figure 2, we can see the difference between the SBWT and BWT. The part (a) of the Figure.2. shows the different steps of the application of the BWT and its inverse BWT-1. The bionic wavelet coefficients are obtained by multiplying the continuous wavelet coefficients by the K factor. Those continuous wavelet coefficients are obtained by the Continuous Wavelet Transform (CWT) application to the signal. To obtain the reconstructed signal, the Bionic Wavelet coefficients are multiplied by the inverse of the factor K,. Then the inverse of the Stationary Wavelet Transform, SWT -1, is finally applied to the obtained coefficients. The part (b) of the Figure.1, shows the different steps of the application of the SBWT and its inverse SBWT - 1. The stationary bionic wavelet coefficients are obtained by multiplying the stationary wavelet coefficients by the K factor. Those stationary wavelet coefficients are obtained by the Stationary Wavelet Transform (SWT) application to the signal. The reconstructed signal is obtained by multiplying at first step the stationary bionic wavelet coefficients by the and then applying the inverse of SWT, SWT -1. (a) ISBN:

5 (b) Figure.2. (a) The Bionic Wavelet Transform (BWT) and its inverse (BWT-1), (b) The Stationary Bionic Wavelet Transform (SBWT) and its inverse (SBWT-1) Tables 1 and 2 report the values of the Mean Squared Error (MSE) between the reconstructed and the original speech signals calculated by the application of the Bionic Wavelet transform and its inverse and the Stationary Bionic Wavelet Transform and its inverse. They show clearly that the better results are obtained from the application of the SBWT with ten scales. Consequently the stationary bionic wavelet transform permits to obtain a perfect reconstruction of speech signals. Table 1. Case of Female Voice. MSE Speech signal SBWT BWT Scale Number [26] Signal e e e-005 Signal e Signal e e e-005 Signal e e e-004 Signal e Signal e e e-004 Signal e e e-004 Signal e e e-004 Signal e e e-004 Signal e ISBN:

6 Table 2. Case of Male Voice. MSE Speech signal SBWT BWT Scale Number [26] Signal e e-004 Signal e e-004 Signal e e-004 Signal e e-004 Signal e e-004 Signal e e-004 Signal e e-004 Signal e e-004 Signal e Signal e e Performance evaluation In this part of the paper, a number of objective tests used for speech enhancement techniques evaluation, are presented Signal-to-noise ratio The signal-to-noise ratio (SNR) of the enhanced speech signal is defined by: (8) where and represent respectively the original and enhanced speech signals, and is the samples number per signal Segmental signal to noise ratio The segmental signal-to-noise ratio (segsnr) is calculated by averaging the frame based SNRs over the signal: (9) where is the number of frames, is the size of frame, and is the beginning of the m-th frame. As the SNR can become negative and very small during silence periods, the segsnr values are limited to the range of [-10dB, 35dB] Itakura-Saito distance The distance of Itakura-Saito (ISd) measures the spectrum changes and can be computed employing the coefficients of linear prediction (LPC) according to the following equation: (10) where represents the LPC vector of the original speech signal. is the matrix of autocorrelation and is the LPC coefficients vector of the enhanced speech signal. In this paper, a 10th order LPC based measure is employed Perceptual evaluation of speech quality The perceptual evaluation of speech quality (PESQ) algorithm is an objective quality measure that is approved as the ITU-T recommendation P.862. It is a tool of objective measurement conceived to predict the results of a subjective Mean Opinion Score (MOS) test. It was proved that the PESQ is more reliable and correlated better with MOS than the traditional objective speech measures. 4. Results and evaluation Table 3, Table 4, Table 5 and Table 6 report the obtained results from SNR, SSNR, ISd and ISBN:

7 PESQ computation. These results are obtained by the application of the proposed speech enhancement technique, the technique of Loizou [27] based on Maximum a Posterior Estimator of Magnitude-Squared Spectrum (MSS-MAP) and Wiener Filtering on a number of noisy speech signals. These noisy speech signals are sampled at 16kHz and recorded from two voices, Male and Female. They are obtained by corrupting the original signals by different types of noise (car, white, tank, pink and F16) at different values of SNR (-5 to 15dB). Table 3. SNR measures obtained for noisy and enhanced speech signal. Noise Enhancement SNR(dB) Type technique Car Noisy The proposed MSS-MAP[25] Wiener filter [26] White Noisy The proposed MSS-MAP [25] Wiener filter [26] Tank Noisy The proposed techniqe: MSS-MAP [25] Wiener filter [26] Pink Noisy The proposed techniqe: MSS-MAP [25] Wiener filter [26] F16 Noisy The proposed techniqe: MSS-MAP [25] Wiener filter [26] ISBN:

8 Table 4. SSNR measures obtained for noisy and enhanced speech signal Noise Enhancement SSNR(dB) Type technique Car Noisy The proposed MSS-MAP [25] Wiener filter [26] White Noisy The proposed MSS-MAP [25] Wiener filter [26] Tank Noisy The proposed MSS-MAP [25] Wiener filter [26] Pink Noisy The proposed MSS-MAP [25] Wiener filter [26] F16 Noisy The proposed MSS-MAP [25] Wiener filter [26] ISBN:

9 Table 5. ISd measures obtained for noisy and enhanced speech signal Noise Enhancement ISd Type technique Car Noisy e e e The proposed e e e MSS-MAP [25] e e e Wiener filter [26] e e e White Noisy The proposed MSS-MAP [25] Wiener filter [26] Tank The proposed MSS-MAP [25] Wiener filter [26] Pink Noisy The proposed MSS-MAP [25] Wiener filter [26] F16 Noisy The proposed MSS-MAP [25] Wiener filter [26] Table 6. PESQ measures obtained for noisy and enhanced speech signal Noise Enhancement PESQ Type technique Car Noisy The proposed MSS-MAP [25] ISBN:

Wiener filter [26] 4.004 3.70 3.23 2.84 2.41 White Noisy 2.2374 1.8546 1.4516 1.0718 0.80 The proposed 2.9181 2.6264 2.3317 2.0637 1.4718 MSS-MAP [25] 2.8543 2.6076 2.2954 2.0224 1.

6253 Wiener filter [26] 2.8820 2.5881 2.2521 1.8824 1.5094 Pink Noisy 2.3123 1.9230 1.4617 1.0150 0.8336 The proposed 3.0010 2.6412 2.2504 1.8479 1.5393 MSS-MAP [25] 2.9198 2.5412 2.2560 1.8428 1.

10 Wiener filter [26] White Noisy The proposed MSS-MAP [25] Wiener filter [26] Tank Noisy The proposed MSS-MAP [25] Wiener filter [26] Pink Noisy The proposed MSS-MAP [25] Wiener filter [26] F16 Noisy The proposed MSS-MAP [25] Wiener filter [26] Figure.3. an example of denoising speech signal The obtained results show that the proposed technique outperforms the others techniques used in our evaluation. corrupted by Car noise: (a) clean speech, (b) noisy speech (SNR=10dB), (c) Denoised speech signal using the proposed technique. Figure 3 illustrates an example of speech enhancement using the proposed technique. 8 7 Clean Speech Signal Freq (khz) Time (sec) This figure shows clearly that the proposed technique reduces efficiently the noise while ISBN:

11 preserving the quality of the original speech signal. (a) Freq (khz) Freq (khz) Noisy Speech Signal Time (sec) (b) Enhanced Speech Signal Time (sec) (c) Figure 4. (a) The spectrogram of the clean speech signal, (b) The spectrogram of the noisy speech signal (speech signal corrupted by car noise with SNR=dB), (c) The spectrogram of the enhanced speech signal. 5. Conclusion In this paper, we propose a new speech enhancement technique based on the application of the Maximum a Posterior Estimator of Magnitude-Squared Spectrum (MSS-MAP) in Stationary Bionic Wavelet Domain. The evaluation of the proposed technique is performed by comparing it to the speech enhancement technique based on MSS-MAP and the technique based on Wiener filtering. This evaluation is based on the use of a number of objective criterions which are the SNR, SSNR, ISd and PESQ. We have also used in this evaluation a number of speech signals (ten sentences pronounced in Arabic language by a Male voice and ten others pronounced by a Female voice) and different types of noises which are Car, White, F16, Tank and pink noises. The obtained results from the application of the proposed technique ( ), the technique based on MSS-MAP and the third technique based on Wiener Filtering to the used noisy speech signal, show that the proposed technique outperforms the two others techniques. References [1] J. S. Lim and A. V. Oppenheim. Enhancement and bandwidth compression of noisy speech. Proceedings of the IEEE, 67(12): , [2] Y. Ephraim and D. Malah. Speech enhancement using a minimum mean square error short time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Processing, 32: , [3] Y. Ephraim and D. Malah. Speech enhancement using a minimum mean square error log spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Processing, 33: , [4] D. Malah, R. V. Cox, and A. J. Accardi. Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments. In ICASSP, volume 2, pages , [5] Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, [6] M. Berouti, R. Schwartz, and J. Makhoul. Enhancement of speech corrupted by acoustic noise. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, volume 4, pages , [7] S. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE tran. Signal Processing, 27(2): , [8] J.W. Seok and K.S. Bae. Speech enhancement with reduction of noise components in the wavelet domain. In ICASSP 97, pages , Munich, Germany, April ISBN:

12 [9] M. Bahoura and J. Rouat. Wavelet speech enhancement using the Teager energy operator. IEEE Signal Processing Letters, 8:10-12, [10] M. Bahoura and J. Rouat. Wavelet speech enhancement based on time-scale adaptation. Speech Communication, 48(12): , [11] I. Cohen. Enhancement of speech using bark-scaled wavelet packet decomposition. In Eurospeech 2001, pages , Aalborg, Denmark, [12] C. T. Lu and H. C. Wang. Enhancement of single channel speech based on masking property and wavelet transform. Speech Communication, 41(2-3): , [13] S. H. Chen and J. F. Wang. Speech enhancement using perceptual wavelet packet decomposition and teager energy operator. J. VLSI Signal Process. Syst., 36(2-3): , [14] Y. Hu and P. C. Loizou. Speech enhancement based on wavelet thresholding the multitaper spectrum. IEEE Transactions on Speech and Audio Processing, 12(1):59-67, [15] H. Ta_smaz and E. Er_celebi. Speech enhancement based on undecimated wavelet packet-perceptual _lterbanks and MMSE-STSA estimation in various noise environments. Digital Signal Processing, 18(5): , [16] I. Pint_er. Perceptual waveletrepresentation of speech signals and its application to speech enhancement. Computer Speech and Language, 10(1): 1-22, [17] M. T. Johnson, X. Yuan, and Y. Ren. Speech signal enhancement through adaptive wavelet thresholding. Speech Communication, 49(2): , [18] C. T. Lu and H. C. Wang. Speech enhancement using hybrid gain factor in criticalband-wavelet-packet transform. Digital Signal Processing, 17(1): , [19] D. Mahmoudi. A microphone array for speech enhancement using multiresolution wavelet transform. In Proc. Of Eurospeech'97, pages , Rhodes, Greece, September [20] C. H. Yang, J. C. Wang, J. F. Wang, H. P. Lee, C. H. Wu, and K. H.Chang. Multiband subspace tracking speech enhancement for in-car human computer speech interaction. Journal of Information Science and Engineering, 22(5): , [21] T. Gulzow, A. Engelsberg, and U. Heute. Comparison of a discrete wavelet transformation and nonuniform polyphase _lterbank applied to spectral- subtraction speech enhancement. Signal Processing, 64:5-19, [22] R. Nishimura, F. Asano, Y. Suzuki, and T. Sone. Speech enhancement using spectral subtraction with wavelet transform. Electronics and Communications in Japan, Part III: Fundamental Electronic Science (English translation of Denshi Tsushin Gakkai Ronbunshi), 81(1):24-31, [23] Y. Shao and C. H. Chang. A generalized time-frequency subtraction method for robust speech enhancement based on wavelet _lter banks modeling of human auditory system. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 37(4): , [24] J. Sika and V. Davidek. Multi-channel noise reduction using wavelet _lter bank. In EuroSpeech'97, pages , Rhodes, Greece, Spetember [25] Yang Lu and Philipos C. Loizou, Estimators of the Magnitude-Squared Spectrum and Methods for Incorporating SNR Uncertainty, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 5, JULY [26] Talbi M., Salhi L., Abid S. and Cherif A Recurrent Neural Network and Bionic Wavelet Transform for speech enhancement. Int. J. Signal and Imaging Systems Engineering,, Vol.3, No. 2, pp [27] Philipos C. Loizou, Speech Enhancement Theory and Practice, Taylor & Francis, ISBN:

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

ISSN (Print) : 232 3765 An ISO 3297: 27 Certified Organization Vol. 3, Special Issue 3, April 214 Paiyanoor-63 14, Tamil Nadu, India Enhancement of Speech Signal by Adaptation of Scales and Thresholds