FORWARD MASKING THRESHOLD ESTIMATION USING NEURAL NETWORKS AND ITS APPLICATION TO PARALLEL SPEECH ENHANCEMENT

FORWARD MASKING THRESHOLD ESTIMATION USING NEURAL NETWORKS AND ITS APPLICATION TO PARALLEL SPEECH ENHANCEMENT T. S. GUNAWAN 1, O. O. KHALIFA 1, E. AMBIKAIRAJAH 2 1 Electrical and Coputer Engineering Departent, International Islaic University Malaysia, P.O. Box 10, Kuala Lupur, 50728, MALAYSIA 2 School of Electrical Engineering and Telecounications, University of New South Wales, Sydney, NSW 2052, AUSTRALIA E-ail: tsgunawan@iiu.edu.y ABSTRACT: Forward asking odels have been used successfully in speech enhanceent and audio coding. Presently, forward asking thresholds are estiated using siplified asking odels which have been used for audio coding and speech enhanceent applications. In this paper, an accurate approxiation of forward asking threshold estiation using neural networks is proposed. A perforance coparison to the other existing asking odels in speech enhanceent application is presented. Objective easures using PESQ deonstrates that our proposed forward asking odel, provides significant iproveents (5-15 %) over four existing odels, when tested with speech signals corrupted by various noises at very low signal to noise ratios. Moreover, a parallel ipleentation of the speech enhanceent algorith was developed using Matlab parallel coputing toolbox. KEYWORDS: Huan auditory syste, forward asking, speech enhanceent, PESQ, parallel algorith. 1. INTRODUCTION Forward asking is a tie doain phenoenon in which a asker precedes the signal in tie. Forward asking psychoacoustic data depends on four diensions, i.e. frequency, asker level, tie difference between asker and askee, and asker signal duration [1]. The current forward asking odels do not fully take into account all the four diension of forward asking data. Functional odels of the forward asking effect of the huan auditory syste have recently been used with success in speech and audio coding to provide ore efficient signal copression [2, 3]. Furtherore, forward asking has been used for speech enhanceent [4] using the speech boosting technique [5]. Instead of focusing on suppressing the noise, the speech boosting technique increases the relative power of the speech, thus acting as a speech booster. It is only active when speech is present, and reains idle when noise is present. Jesteadt s forward asking odel [6] provides a reasonable approxiation to the forward asking effect. Strope et al. [7] extended the Jesteadt experient to 120 s. In 15

Jesteadt's and Najafzadeh's forward asking odels [6, 8], only asker level and delay have been taken into account. While in [9], Gunawan and Abikairajah have refined the odel to reflect forward asking data ore accurately by averaging several paraeters across frequencies. Currently, the ajority of these works focus on forulating atheatical odels of the forward asking. Such odels are often too general. Further refineent of the odel requires software that can do curve-fitting of ulti-diensional data. Nevertheless, for this purpose, we utilise neural network to better approxiate forward asking threshold. To evaluate the perforance of our forward asking odel, five speech enhanceent algoriths were ipleented: spectral subtraction [10], spectral subtraction with iniu statistics [11], speech boosting [5], speech boosting using forward asking odel 1 [4] and forward asking odel 2 [9]. The Perceptual Evaluation of Speech Quality or PESQ (ITU-T P.862) easure was used here to benchark the various ethods. Speech enhanceent algorith exploiting teporal asking properties of huan auditory syste has a very high coputation requireent, especially when the noisy speech signal is long or the nuber of subbands is high. Recent advances in ulti-core syste ake it a natural choice and viable option for solving high coputation requireents of the speech enhanceent algorith. Therefore, the objective of this paper is two-folds: to evaluate the perforance of our forward asking odel in ters of enhanced speech quality and to ipleent and evaluate parallel speech enhanceent algorith on a ulti-core syste. The rest of the sections are organized as follows: Section 2 discusses the developent of forward asking odels using neural networks. Section 3 describes the sequential speech enhanceent algorith while Section 4 discusses the parallel ipleentation of speech enhanceent algorith. Experiental results and analyses are discussed in Section 5 for the sequential and parallel algoriths. Finally, Section 6 concludes this paper. 2. FORWARD MASKING MODELS USING NEURAL NETWORKS Neural network has been applied for various applications within the following broad categories: function approxiation (or regression analysis), classification, and data processing (filtering, clustering, blind source separation, etc). Brown et al. [12] applied non-recurrent neural networks for siultaneous asking odelling. In this paper, neural networks is eployed to approxiate the forward asking threshold for the three input paraeters, including frequency, asker level, and delay. By taking into account the threshold in quiet (TIQ ) the absolute threshold of forward asking ( FM ) can be calculated using the equation we have developed below: f, L, t, T M f, L, t TIQ f T FM, (1) s As stated in [13], the threshold in quiet is a function of frequency and signal duration. By curve-fitting a set of 120 data points copiled fro [13], we approxiated the threshold in quiet to be as follows: s 16

f TIQ, T s for signal with long duration ( Ts 500 s) can be approxiated as (f in khz): TIQ 0.8 0.6 3.3 2 f f, T 500 3.64 f 6.5e 0. 001 f 4 s (2) f TIQ, for signal with duration Ts 500 s, can be approxiated as TIQ T s 13 3 f T TIQf, T 500 7.53 6.5 10 f log 500 T, s s 10 The aount of forward asking M f L, t forward neural network as shown in Fig. 1. s (3), can be approxiated using feed Fig. 1: Aount of forward asking odelled by neural networks. The network configuration as shown in Fig. 1 with 1 hidden layer could approxiate any function [14]. To avoid over-fitting to the training data, the Bayesian regularization as proposed in [15] was used. Figure 2 shows the aount of forward asking against L and t at frequency of 500Hz using neural network. Siilar plots can be obtained for various frequencies, thus providing a ore accurate estiation of forward asking data. 17

Fig. 2: Aount of forward asking estiation at 500Hz. 3. SPEECH ENHANCEMENT This section presents the incorporation of our odel to fit the speech enhanceent algorith developed in [4]. Speech that has been containated by noise can be expressed as xn sn vn x n is the noisy speech, s n is the clean speech signal and n (4) where v is the additive noise, all of which are in the discrete tie doain. The objective in speech enhanceent is to suppress the noise, thus resulting in an output signal y n that has a higher signal-tonoise ratio (SNR). The speech enhanceent algorith that incorporates forward asking [4] is shown in Fig. 3. By filtering the input signal x n using a bank of M analysis filters, the signal is divided into M subbands, each denoted by x n, where is the subband index. This filtering operation can be described in the tie doain as x n xn* h n 1,, M and h n is the ipulse response of the th filter. The global forward where. asking threshold (GFM) and the forward asking threshold in each subband ( FM ) are calculated fro the noisy speech signal x n and subband signal x, respectively. The GFM and FM are used to calculate the gain ( ) in each subband. The gain,, is a weighting function that aplifies the signal in band during speech activity. 18

x n sn vn DC Rejection Filter H 1 z x 1 n H 2 z x 2 n x n x M H M n z FM 1 FM 2 FM M 1 2 M GFM y 1 n G 1 z y 2 n G 2 z 1 2 y n y M n G M z M PESQ score yn PESQ easure sn Fig. 3: Speech enhanceent using forward asking. The enhanced speech, y n, is then obtained by applying the synthesis filters, n and copensating the delay ( ) in each subband as follows: g, y M n y n x n g n 1 M 1 (5) Our objective is now to find a gain function,, that weights the input signal x n, based on forward asking threshold to noise ratio (MNR). The MNR in subbands, each subband can be calculated by using the ratio of a short-ter average forward asking threshold, P n, and an estiate of the noise floor level, Q n as given in Eqn (8). The short-ter average teporal asking threshold in subband is calculated as P n P n 1 FM n 1 (6) 19

where is a sall positive constant (i.e. 0.0042, ) controlling the sensitivity of the algorith to changes in forward asking threshold, and acts as a soothing factor. The slowly varying noise floor estiate for the -th subband, n, is calculated as Q Q n 1 Q n 1, Q n 1 P n P n, Q n 1 P n (7) where is a sall positive constant (i.e. 0.05, ) controlling how fast the noise floor level estiate in the -th subband adapts to changes in the noise environent. P n, The variables n n GFM n are cobined in a novel anner in order to calculate the gain function n as follows, Q, FM and n n n n FM P n 1 (8) GFM Q where 0 1, i.e. 0.9,, is a positive constant controlling the contribution of the forward asking threshold ratio and the short ter MNR. Since the calculation of n involves a division, care ust be taken to ensure that the quotient does not becoe excessively large due to a sall very high MNR, Q n. In a situation with a nwill becoe very large if no liit is iposed on this function. Therefore, a liiter can be applied on n as follows: n C n, C C where C 0.3529 2 db provides a suitable liiter for the gain function. (9) 4. PARALLEL SPEECH ENHANCEMENT ALGORITHM The design of an efficient parallel speech enhanceent algorith can be a challenging task. First step in the parallelization of any sequential code is to identify which part of code that takes the longest execution tie. Using Matlab profiling tool, it was identified that the calculation of forward asking threshold and gain calculation for each subband (see Eqn 5 to 9) were taking the longest execution tie. In this paper, we will utilize the Matlab parallel coputing toolbox. The hardware used was an AMD quad core 2.5 GHz syste with 2 GBytes of eory. Master-slave paradig is used in the parallelization. To achieve a scalable parallel ipleentation of speech enhanceent algorith, we used the data-parallel or single progra ultiple data (SPMD) prograing odel. A single progra was written for both aster and slave processes that asynchronously execute on each node. In particular, 20

all processes will work on different piece of data. There are two data partition schees available in speech enhanceent algorith, tie partition and frequency (subband) partition. In tie partition a long noisy speech file is partitioned into saller tie and processed individually. While in frequency or subband partition, the total nuber of subbands is divided and distributed into a nuber of slaves. As the calculation of teporal asking requires the inforation fro previous fraes, it is obvious that subband partition is ore appropriate for parallelization. Hence, it will be used in our ipleentation. Fig. 4: Flowchart of parallel speech enhanceent algorith. Figure 4 shows the flowchart of parallel speech enhanceent algorith that can be ipleented on ulti-core syste and/or cluster syste. Initially, the parallel progra starts with initialization at every node. Of the two counication schees available in 21

Matlab parallel coputing toolbox, i.e. distributed array and essage passing, we will use essage passing schee as it provides ore flexible counication schee. Then, a noisy speech signal is partitioned (using subband partition) and distributed to N core configured in aster-slave fashion. After that, each slave is then filtered the noisy speech signal accordingly, calculates the forward asking threshold, deterines the gain for each subband, and applies the gain for each subband signal. After obtaining denoise speech signal for each subband, each slave then sends the results to the aster node. Finally, speech reconstruction and PESQ evaluation are applied at the aster node. 5. PERFORMANCE EVALUATION In this section, the perforance of sequential code, in ters of subjective and objective quality of the enhanced speech, was evaluated. Furtherore, the perforance of parallel code, in ters of speedup for various nubers of cores, was presented. 5.1 Subjective and Objective Quality In order to assess the perforance of the new forward asking odel in enhancing speech signals, a large nuber of siulations were perfored. Six speech files were taken fro EBU SQAM data set including English feale and ale speakers, French feale and ale speakers, and Geran feale and ale speakers. The length of the files was between 17 and 20 seconds. The sapling frequency was 8 khz, and the frae size was 256 saples (32 s). Several algoriths were ipleented and copared, including spectral subtraction, SS[10], spectral subtraction with iniu statistics, SSMS[11], speech boosting, SB[5], speech boosting using forward asking odel 1, SBFM1[4], speech boosting using forward asking odel 2, SBFM2[9], and speech boosting using the developed neural networks forward asking odel, SBFM3. Different types of background noises fro the NOISEX-92 and AURORA database have been used - including car, white noise, pink noise, F16, factory, babble, airport, exhibition, restaurant, street, subway and train noise. The variance of noise has been adjusted to obtain -5 db, 0 db, 5 db, and 10 db SNRs. The PESQ (Perceptual Evaluation of Speech Quality, ITU-T P.862) easure [16] was utilised for the objective evaluation. Note that, the PESQ has a 93.5% correlation with subjective tests [16]. To evaluate the perforance of the speech enhanceent algoriths, we developed a new easure to assess the iproveent achieved. Suppose that we have PESQref s n, and the corrupted speech, x n denoted as PESQ proc. Therefore, we can derive a new value,, which easures the PESQ iproveent achieved by the algorith as follows: which is the PESQ score for the reference clean speech,. The PESQ score of the enhanced speech, y n, was also easured and PESQproc PESQref 100% (10) PESQ ref 22

A total of 288 data sets fro six speech files, twelve noises, and four SNRs for each ethod were siulated. The average quality iproveent,, achieved by various speech enhanceent ethods is shown in Figure 4. Note that the results for various speech files and noises were averaged for -5, 0, 5, and 10 db SNRs. Fro these results, the speech boosting technique incorporating neural networks forward asking odel outperfors other ethods for all SNRs. Fig. 5: Average (%) for various algoriths. In order to analyse the perforance of our proposed ethod in ore detail, the average of quality iproveent at -5, 0, 5, and 10 db SNRs for various noises is shown in Table 1. The best result for each type of noise condition is shown in bold, fro which it can be seen that our ethod using neural networks forward asking odel provides a better PESQ iproveent than the five other ethods tested. Table 2 shows the average of quality iproveent at -5, 0, 5 and 10 db SNRs for various speech files. The best result for each individual speech file is shown in bold. The table shows that ore accurate forward asking threshold calculation leads to a better and enhanced speech quality. Furtherore, inforal listening test confir that the speech processed with the proposed algorith sounds ore pleasant to a huan listener than those obtained by other algoriths. 23

Table 1: Average PESQ iproveent (%) for various noise types. Noise Types SS SSMS SB SBFM1 SBFM2 SBFM3 Car White Pink F16 Factory Babble Airport Exhibition Restaurant Street Subway Train 19.88 18.16 11.80 21.04 22.09 23.30 17.50 28.33 15.81 21.58 34.25 33.67 22.73 28.90 16.93 27.41 37.28 37.32 16.48 18.81 13.62 23.59 29.92 30.43 18.28 12.47 13.79 25.65 31.75 32.07 2.61 1.65 7.14 13.76 18.12 19.76 6.16 3.73 7.83 12.77 16.59 17.67 11.64 5.54 11.79 18.30 30.10 30.72 5.02 2.06 4.34 10.54 17.78 17.96 8.59 9.45 12.82 18.63 15.86 17.06 4.29 7.49 11.57 20.18 34.42 34.51 14.92 15.57 13.20 19.88 20.74 21.99 Table 2: Average PESQ iproveent (%) for different speech files. Speech Files SS SSMS SB SBFM1 SBFM2 SBFM3 English ale English feale French ale French feale Geran ale Geran feale 6.24 4.32 5.57 12.37 24.24 24.74 8.79 9.08 9.61 15.65 26.00 26.23 15.17 15.67 11.67 21.94 28.38 29.11 10.83 11.46 9.36 14.69 19.31 19.73 21.89 27.35 21.27 36.03 34.75 36.33 11.13 8.20 12.84 16.00 21.76 22.08 5.2 Parallel Perforance The coputing environent used in this research was AMD Pheno Quad-Core Processor 2.5 GHz syste with 2 GBytes of eory. This section is intended to analyse the parallel perforance of the speech enhanceent algoriths in ters of parallel execution tie and speed up. 24

Table 3: Parallel execution tie and speedup for various nuber of processors. Nuber of Processor Parallel Execution Tie Speedup 1 430 seconds 1 2 217 seconds 1.98 3 145 seconds 2.97 4 109 seconds 3.95 Table 3 shows the perforance of the parallel speech enhanceent algorith for 1, 2, 3, and 4 processor. For the evaluation purposes, we used feale speech signal with various noises and various SNRs and take the average of parallel execution tie and speedup. The parallel speech enhanceent algorith achieves alost linear speedup indicating the high efficiency on parallelization. Moreover, this could be due to the fast counication schee between processor in which it did not affect the parallel perforance. When the nuber of nodes is high, the counication tie will affect the speedup, especially in a cluster syste. Therefore, it will be interesting if we evaluated our parallel speech enhanceent algoriths on a cluster syste with higher nuber of nodes. 6. CONCLUSIONS In this paper, a new forward asking odel using neural networks has been proposed and incorporated into a speech enhanceent algorith. The perforance of our speech enhanceent algorith eploying new forward asking odel was copared with five other speech enhanceent ethods (two other functional odels of forward asking) over twelve different noise types and four SNRs. PESQ results reveal that the proposed algorith outperfors the other algoriths by 5-15% depending on the SNR. Hence, it appears that the proposed forward asking odel has good potential for speech enhanceent applications across any types and intensities of environental noise. On a quad core syste, the parallel speech enhanceent algorith developed was very efficient in which alost linear speedup was achieved. REFERENCES [1] J. M. Buchholz, A Coputational Model of Auditory Masking Based on Signal-Dependent Copression, PhD Thesis, Universitat Bochu, 2002. [2] T. S. Gunawan, E. Abikairajah, and D. Sen, "Coparison of Teporal Masking Models for Speech and Audio Coding Applications," in International Syposiu on Digital Signal Processing and Counication Systes, pp. 99-103, 2003. [3] F. Sinaga, T. S. Gunawan, and E. Abikairajah, "Wavelet Packet Based Audio Coding Using Teporal Masking," in Int. Conf. on Inforation, Counications and Signal Processing, Singapore, pp. 1380-1383, 2003. 25

[4] T. S. Gunawan and E. Abikairajah, "Speech enhanceent using teporal asking and fractional bark gaatone filters," in 10th International Conference on Speech Science & Technology, Sydney, pp. 420-425, 2004. [5] N. Westerlund, Applied Speech Enhanceent for Personal Counication, Thesis, Blekinge Institute of Technology, 2003. [6] W. Jesteadt, S. P. Bacon, and J. R. Lehan, "Forward asking as a function of frequency, asker level, and signal delay," Journal of Acoustic Society of Aerica, vol. 71, pp. 950-962, 1982. [7] B. Strope and A. Alwan, "A odel of dynaic auditory perception and its application to robust word recognition," IEEE Transactions on Speech and Audio Processing, vol. 5, pp. 451-464, 1997. [8] H. Najafzadeh, H. Lahdidli, M. Lavoie, and L. Thibault, "Use of auditory teporal asking in the MPEG psychoacoustics odel 2," in the 114th Convention, Audio Engineering Society, pp., 2003. [9] T. S. Gunawan and E. Abikairajah, "A new forward asking odel and its application to speech enhanceent," in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 149-152, 2006. [10] S. F. Boll, "Suppresion of Acoustic Noise in Speech Using Spectral Subtraction," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, pp. 113-120, 1979. [11] R. Martin, "Spectral Subtraction Based on Miniu Statistics," in Europe Signal Processing Conference, Edinburgh, Scotland, pp. 1182-1185, 1994. [12] E. Brown, D. Ros, C. Bruscianelli, and B. Mila de la Roca, "Non-recurrent neural networks for auditory perceptual odelling," in IEEE International Conference on Devices, Circuits and Systes, pp. 139-143, 1995. [13] M. Florentine, H. Fastl, and S. Buus, "Teporal integration in noral hearing, cochlear ipairent, and ipairent siulated by asking," Journal of Acoustic Society of Aerica, vol. 84, pp. 195-203, 1988. [14] H. Sion, Neural Networks: A Coprehensive Foundation, Prentice Hall PTR, 1994. [15] F. D. Foresee and M. T. Hagan, "Gauss-Newton approxiation to Bayesian regularization," in International Joint Conference on Neural Networks, pp. 1930-1935, 1997. [16] ITU, "ITU-T P.862, Perceptual evaluation of speech quality (PESQ), an objective ethod for end-to-end speech quality assessent of narrow-band telephone networks and speech codecs," International Telecounication Union, Geneva 2001. 26