Speech Enhancement using Temporal Masking and Fractional Bark Gammatone Filters

PAGE 420 Speech Enhanceent using Teporal Masking and Fractional Bark Gaatone Filters Teddy Surya Gunawan, Eliathaby Abikairajah School of Electrical Engineering and Telecounications The University of New South Wales, NSW 2052, Australia tsgunawan@ee.unsw.edu.au; abi@ee.unsw.edu.au Abstract A speech enhanceent technique based on the teporal asking properties of the huan auditory syste is presented. The noisy signal is divided into a nuber of sub-bands with fractional bark accuracy, and the sub-band signals are individually and adaptively weighted in the tie doain according to a short-ter teporal asking threshold to noise ratio estiate in each subband. Objective easures and inforal listening tests deonstrate significant iproveents over three well-known existing ethods when tested with speech signals corrupted by various noises at signal to noise ratios of 0, 0, and 20 db.. Introduction The purpose of speech enhanceent is to iprove the perforance of speech counication systes in noisy environents. Speech enhanceent can be applied in any applications, such as in obile counication systes, speech recognition, or hearing aids. The additive noise source ay be wideband noise, in the for of a white or colored noise, or a periodic signal, such as hu noise or roo reverberations. Single channel speech enhanceent is a ore difficult task than ultiple channel enhanceent, since there is no independent source of inforation with which to help separate the speech and noise signals. The spectral subtraction algorith is a well known solution to the speech enhanceent (Boll 979; Gustafsson, Nordhol, and Claesson 200; Martin 994; Tsoukalas, Mourjopoulos, and Kokkinakis 997), in which noise is usually estiated during speech pauses. Spectral subtraction is widely known to suffer fro perceptible artifacts resulting fro usical residual noise that is introduced into the enhanced speech by the ethod. In order to reduce the usical noise, various algoriths have been developed (Gustafsson et al. 200; Tsoukalas et al. 997; Virag 999). In (Virag 999) and (Tsoukalas et al. 997), huan auditory asking properties, i.e. siultaneous asking, were used to reduce the usical noise. Recently, a new speech enhanceent ethod known as speech boosting has been reported (Westerlund 2003). Instead of focusing on suppressing the noise, the ethod increases the relative power of the speech, thus acting as a speech booster. It is only active when speech is present, and reains idle when noise is present. As stated in (Westerlund 2003), the algorith has proven to be robust, flexible, and versatile. Functional odels of the teporal asking effect of the huan auditory syste have recently been used with success in speech and audio coding to provide ore efficient signal copression (Gunawan, Abikairajah, and Sen 2003; Sinaga, Gunawan, and Abikairajah 2003). Furtherore, a fractional bark filterbank resolution, i.e. 0.25 and 0.5 bark (Basic and Advanced Version), has been reported in (ITU 998) to provide ore accurate objective easureent of perceived audio quality (PEAQ). Therefore, it is expected that the use of fractional bark accuracy will provide ore accurate teporal asking calculation in speech enhanceent. In this paper, we propose a novel speech enhanceent ethod that eploys a functional odel of teporal asking, eploying a fractional bark gaatone filterbank, based upon odifications to the speech boosting technique (Westerlund 2003). To evaluate the perforance of our algorith, three other algoriths were ipleented: spectral subtraction (Boll 979), spectral subtraction with iniu statistics (Martin 994), and speech boosting (Westerlund 2003). The PESQ (Perceptual Evaluation of Speech Quality, ITU-T P.862) easure was used here to benchark the various ethods. 2. Proposed Speech Enhanceent Algorith Speech that has been containated by noise can be expressed as Macquarie University, Sydney, Deceber 8 to 0, 2004. Copyright, Australian Speech Science & Technology Association Inc.

s v x = + () where x is the noisy speech, signal and s is the clean speech v is the additive noise source, all in the discrete tie doain. As entioned in section, the objective in speech enhanceent is to suppress the y n with a higher noise resulting in an output signal ( ) signal-to-noise ratio (SNR). We propose a new speech enhanceent algorith that incorporates teporal asking, as shown in Fig.. By filtering the input signal analysis filters, h x using a bank of M, the signal is divided into M subbands, each denoted by x, where is the sub-band index. Figure : Speech enhanceent using teporal asking This filtering operation can be described in the tie doain as x x h = (2) where =, K, M. The global teporal asking threshold, GTM, and the teporal asking threshold in each sub-band, TM, are calculated fro the noisy speech signal x and sub-band signal x ( ), respectively. The GTM and TM are used to calculate the gain ( Γ ) in each sub-band. The gain, Γ, is a weighting function that aplifies the signal in band during speech activity. The enhanced speech, y, is then obtained by applying the synthesis filters, g, and copensating the delay ( y M ) in each sub-band as follows M = y ( n ) = Γ x ( n ) g ( n ) = = Our objective is now to find a gain function, Γ, that, based on weighs the input signal sub-bands, x teporal asking threshold to noise ratio (MNR). The MNR in each sub-band can be calculated by using the ratio of a short-ter average teporal asking threshold, P, and an estiate of the noise floor level, Q as given in equation (6). The short-ter average teporal asking threshold in sub-band is calculated as P = ( α ) P ( n ) + α TM (3) (4) where α is a sall positive constant (i.e. α = 0.0042, ) controlling the sensitivity of the algorith to changes in teporal asking threshold, and acts as a soothing factor. The slowly varying noise floor estiate for the th sub-band, Q, is calculated as Q = ( + β ) Q ( n ), Q ( n ) P P, Q ( n ) > P where β is a sall positive constant (i.e. β = 0.05, ) controlling how fast the noise floor level estiate in sub-band adapts to changes in the noise environent. The variables P, Q, TM and GTM Γ n as follows, are used to calculate the gain function ( ) TM ( ) P ( ) n = γ + γ GTM Q (5) Γ (6) where 0 γ is a positive constant controlling the contribution of the teporal asking threshold ratio and the short ter MNR. Hence, the proposed algorith still acts as a speech booster but the gain calculation Γ differs fro (Westerlund 2003), which calculates the gain function fro the short-ter SNR. In order to find the optiu γ, we evaluated the average quality iproveent (see δ calculation in equation (8)) for a speech file (feale English speaker) containated with car noise at 0, 0, and 20 db SNRs at various γ. Fro the results of this experient, shown in Figure 2, we found the optiu value to be γ = 0.8,. PAGE 42 Macquarie University, Sydney, Deceber 8 to 0, 2004. Copyright, Australian Speech Science & Technology Association Inc.

involves a division, care ust be taken to ensure that the quotient does not becoe excessively large due to a sall Q. In a situation with a very high MNR, Γ will becoe very large if no liit is iposed on this function. Since the calculation of Γ Figure 2: Quality iproveent for various γ Therefore, a liiter can be applied on Γ Γ = C, Γ Γ C > C Γ as follows: where C is soe positive constant. By using the sae experient to find the optiu γ, setting C = 8 db 2.5 provides a suitable liiter for the gain function. 3. Fractional Bark Gaatone Filterbank In this paper, a fractional bark gaatone filterbank was eployed to filter the signal x into its sub-band. A DC rejection filter was applied to signals x reove the subsonic coponents of the input signals. In addition, the optiu nuber of filter coefficients required was evaluated and the delay copensation for each sub band was calculated. 3.. DC Rejection Filter We designed a fourth order Butterworth high pass filter with a cut-off frequency of 20 H to reove the subsonic coponents of the input signals. The filter was ipleented as a cascade of two second order IIRfilters. 2 + 2 + H DC (8) ( ) = + a + b + c + d where a = -.9878047, b = 0.98804997, c = -.97486, and d = 0.97398, for fs = 8000 H. 3.2. Gaatone Filters For the analysis filter, we used gaatone filters as they reseble the shape of huan auditory filters (Kubin and Kleijn 999). These were ipleented using FIR filters. To achieve perfect reconstruction, n, are the tie reverse of the analysis filters, g ( ) (7) h. The analysis filter for each sub-band is obtained using the following expression, h N πbbwnt = a ( nt ) e cos( 2πf nt + ϕ ) where f c is the centre frequency for each sub-band, T is the sapling period, and N is the gaatone filter order ( N = 4 ). For fs = 8000 H, the total nuber of sub-bands, M, is dependent on the bark resolution, d. The paraeter n is the discrete tie saple index, and n = 0K Nf where Nf is the length of each filter within the filterbank. BW is the critical bandwidth at a particular center frequency, b =.65, and the a were selected for each filter so as to noralie the filter gain to 0 db. 3.3. Spacing of the Filters The gaatone filters were spaced linearly on the Bark scale, or critical-band rate scale. The critical band nuber (in Bark) is related to the linear frequency f (in H), as follows (Schroeder, Atal, and Hall 979) c (9) f ( f ) = 7 a sinh, f ( ) = 650 sinh (0) 650 7 The frequency borders of the filters range fro f L = 80 H to f U = 4000 H. The widths and spacing of the filter bands correspond to a resolution of d. The nuber of sub-bands M is then calculated as follows, M = ( f ) ( f ) U d L () A spacing of d = 0. 5 Bark required 34 filters, while a spacing of d = 0. 25 required 68 filters in order to cover the frequency range of 0 to 4 kh. The lower, upper, and center frequency for each sub band in Bark scale can be calculated as follows, l u c = ( f L ) + d, ( ( f ) + d, ( f )), = in L = ( + ), 2 l u U (2) where =, K, M. Subsequently, the center frequency and the bandwidth in H can be deterined as follows, f c ( ) BW = f ( ) f ( ) = f, (3) c In order to find the optiu value of d for our speech enhanceent ethod, we evaluated the average quality iproveent and processing tie for various d values at 0, 0, and 20 db SNRs, as seen in Figure 3. Fro Figure 3, we found that setting d = 0.25 provides the optiu value in ters of speech quality and processing tie. Hence, d = 0. 25 u l PAGE 422 Macquarie University, Sydney, Deceber 8 to 0, 2004. Copyright, Australian Speech Science & Technology Association Inc.

was used throughout our experients. The frequency responses of gaatone filters for this value of d = 0.25 are shown in Figure 4. Figure 3: Fractional bark spacing versus quality and processing tie Figure 4: ¼ Bark spacing (68 filters) 3.4. Optiu Nuber of Filter Coefficients ( Nf ) The nuber of coefficients required to ipleent the analysis/synthesis filter bank depends on the ipulse response of the gaatone filters. The low frequency filters need ore coefficients as copared with the high frequency filters. The length of each filter within the filterbank, Nf, can be optiised by evaluating the non-ero gaatone filter response in each sub-band. The optiu length of the filter Nf in saples for each sub-band is given by ( Nf,round( fs f ) 25) Nf in ax (4) = c where fc is the centre frequency of the filter in H and Nfax =024 is the axiu length of filter coefficients. 3.5. Delay Copensation By eploying the optiu length of the filter in each sub-band, Nf, the aount of filter delay accuulated by each sub-band is different. Without copensation for this delay, the reconstruction of the sub-band signal coponents will lead to an incoherent output signal. The total aount of delay copensation necessary for subband is siply = Nf, where Nf is the optiu filter order calculated as in equation 4. 4. Teporal Masking Teporal asking is a tie doain phenoenon in which two stiuli occur within a sall interval of tie, and plays an iportant role in huan auditory perception. Forward teporal asking occurs when a asker precedes the signal in tie, while backward asking occurs when the signal precedes the asker in tie. Forward asking is the ore iportant effect since the duration of the asking effect can be uch longer, depending on the duration of the asker. The forward asking odel used in this paper is based on (Jesteadt, Bacon, and Lehan 982), and has been used and optiised in our previous papers for speech and audio coding (Gunawan et al. 2003; Sinaga et al. 2003). Based on the forward asking experients carried out by (Jesteadt et al. 982), forward asking level FM can be well-fitted to psychoacoustic data using the following equation: ( b t)( L c) FM = a log 0 (5) where FM is the aount of forward asking in db, t is the tie difference between the asker and the askee in illiseconds, L is the asker level in db, and a, b, and c, are paraeters that can be derived fro psychoacoustic data. To siplify the asking calculation, a, b, and c were set to 0.7, 2.3, and 20, respectively. Note that these paraeters can be further optiised. To evaluate the aount of forward asking, the current frae of 32 s was subdivided into four subfraes as shown in Figure 5. The forward asking level FM j was calculated for the jth sub-frae using the energy, L j, accuulated over the previous frae and all sub-fraes up to the current sub-frae. Figure 5: Calculation of forward asking The teporal aount of asking TM is then chosen as follows 0 ax{ FM, FM 2, FM 3, FM 4 } TM = 0 (6) Note that the calculation of a teporal asking threshold every 8 s was considered adequate since PAGE 423 Macquarie University, Sydney, Deceber 8 to 0, 2004. Copyright, Australian Speech Science & Technology Association Inc.

this provides a good approxiation to the decay effect that lasts around 200 s. The teporal asking thresholds are calculated for each sub-band, TM, K,TM M, fro x and GTM fro x. 5. Perforance Evaluation In order to assess the perforance of the proposed algorith to enhance noisy signals, a large nuber of siulations were perfored. Six speech files were taken fro EBU SQAM data set including English feale and ale speakers, French feale and ale speakers, and Geran feale and ale speakers. The length of the files is between 7 and 20 seconds. The sapling frequency was 8 kh, and the frae sie was 256 saples (32 s). Several algoriths were ipleented and copared including spectral subtraction, SS, (Boll 979), spectral subtraction with iniu statistics, SSMS, (Martin 994), speech boosting, SB, (Westerlund 2003), and the proposed ethod speech boosting exploiting teporal asking, SBTM. 5.. Addition of Noise to Test Data Different types of background noises fro the NOISEX-92 database have been used including car noise, white noise, pink noise, F6 noise, factory noise, and babble noise. The variance of noise has been adjusted to obtain SNRs in the recorded signals ranging fro 0 db to 20 db, as follows: x = s + 5.2. Objective Measures Var Var ( s ) ( v ) 0 SNR 0 v (7) The PESQ (Perceptual Evaluation of Speech Quality) easure (ITU 200), which was recently adopted as an ITU-T recoendation (P.862), was utilised for the objective evaluation. Other objective easures such as Itakura-Saito distortion, Articulation Index, Segental SNR, and SNR have been correlated to subjective tests at 59%, 67%, 77%, and 24%, respectively (Quackenbush, Barnwell, and Cleents 988), while the PESQ has a 93.5% correlation with subjective tests (ITU 200), although obviously these figures were obtained using different data sets and subjective experients. To evaluate the perforance of the speech enhanceent algoriths, we developed a new easure to assess the iproveent achieved. Suppose that we have PESQ which is the PESQ score for the ref reference clean speech, s x. The PESQ score of the enhanced speech,, and the corrupted speech, y, was also easured and denoted as PESQ. proc Therefore, we can derive a new value, δ, which easures the PESQ iproveent achieved by the algorith as follows PESQ proc PESQref δ = 00% (8) PESQ ref A total of 08 data sets fro six speech files, six noises, and three SNRs for each ethod were siulated. The average quality iproveent, δ, achieved by various speech enhanceent ethods is shown in Figure 6. Note that the δ results for various speech files and noises were averaged for 0, 0, and 20 db SNRs. Fro these results, the proposed teporal asking-based speech boosting ethod sees to outperfor other ethods for all SNRs. Figure 6: Average δ (%) for various algoriths In order to analye the perforance of our proposed ethod in ore detail, the average of quality iproveent at 0, 0, and 20 db SNRs for various noises is shown in Table. Table : Average PESQ iproveent δ (%) for various noise types using spectral subtraction (SS), spectral subtraction with iniu statistics (SSMS), speech boosting (SB), and speech boosting with teporal asking (SBTM). Noise SS SSMS SB SBTM Car noise 3.27 5.26 0.49 7.56 White noise 6.22 24.8 6.39 29.76 Pink noise 6.43 22.28 5.40 26.60 F6 noise.2 6.23 2.8 22.5 Factory noise 2.70.84 2.65 20.20 Babble noise 2.5 4.20 7.44 9.2 The best δ result for each type of noise condition is shown in italics, fro which it can be seen that our proposed ethod provides a better PESQ iproveent than the three other ethods. The best iproveent is PAGE 424 Macquarie University, Sydney, Deceber 8 to 0, 2004. Copyright, Australian Speech Science & Technology Association Inc.

achieved for the white noise while the least iproveent is achieved for the babble noise. The babble noise is a speech conversation in the background. Therefore, our algorith ight also isclassify and boost the babble noise as speech. Table 2: Average PESQ iproveent δ (%) for different speech files using spectral subtraction (SS), spectral subtraction with iniu statistics (SSMS), speech boosting (SB), and speech boosting with teporal asking (SBTM). Speech SS SSMS SB SBTM English ale 8.66 2.69 9.78 20.70 English feale.7 5.6.55 8.58 French ale 3.82 7.3.7 9.8 French feale 0.09 3.42 9.35 4.42 Geran ale 8.3 25.85 9.65 34.0 Geran feale 9.93 9.73 3.4 8.5 Table 2 shows the average of quality iproveent at 0, 0, and 20 db SNRs for various speech files. The best δ result for each individual speech files is shown in italics. While the table shows that our proposed algorith outperfors other algoriths, it is also reveals that our algorith iproves ale speech better than feale speech. 6. Conclusion We have presented a fractional bark gaatone filter for speech enhanceent based on a short-ter teporal asking threshold to noise ratio (MNR). The perforance of our proposed algorith was copared with three other standard speech enhanceent ethods over six different noise types and three SNRs. PESQ results reveal that the proposed algorith outperfors the other algoriths by 7-5% depending on the SNR. In the particularly deanding 0 db SNR condition, the new technique achieves at least a 40% relative iproveent in delta PESQ over any of the existing ethods copared. Hence, it appears that the teporal asking threshold based algorith with fractional bark accuracy has good potential for speech enhanceent applications across any types and intensities of environental noise. Further research is required to fine tune the paraeters for different speech and/or noise characteristics. 7. References Beerends, J. G., Hekstra, A. P., Rix, A. W., & Hollier, M. P. (2002). Perceptual Evaluation of Speech Quality. Gunawan, T. S., Abikairajah, E., & Sen, D. (2003, Deceber). Coparison of Teporal Masking Models for Speech and Audio Coding Applications. Paper presented at the International Syposiu on Digital Signal Processing and Counication Systes, pp. 99-03. Gustafsson, H., Nordhol, S. E., & Claesson, I. (200). Spectral Subtraction Using Reduced Delay Convolution and Adaptive Averaging. IEEE Transactions on Speech and Audio Processing, 9(8), pp. 799-807. ITU. (998). ITU-R BS.387, Method for the Objective Measureents of Perceived Audio Quality. Geneva: International Telecounications Union. ITU. (200). ITU-T P.862, Perceptual evaluation of speech quality (PESQ), an objective ethod for end-to-end speech quality assessent of narrow-band telephone networks and speech codecs. Geneva: International Telecounication Union. Jesteadt, W., Bacon, S. P., & Lehan, J. R. (982). Forward asking as a function of frequency, asker level, and signal delay. Journal of Acoustic Society of Aerica, 7(4), pp. 950-962. Kubin, G., & Kleijn, W. B. (999). On speech coding in a perceptual doain. Paper presented at the International Conference on Acoustic, Speech, and Signal Processing, pp. 205-208. Martin, R. (994). Spectral Subtraction Based on Miniu Statistics. Paper presented at the Europe Signal Processing Conference, Edinburgh, Scotland, pp. 82-85. Quackenbush, S. R., Barnwell, T. P., & Cleents, M. A. (988). Objective Measures of Speech Quality. Englewood Cliffs: Prentice Hall. Schroeder, M. R., Atal, B. S., & Hall, J. L. (979). Optiiing digital speech coders by exploiting asking properties of the huan ear. Journal of Acoustic Society of Aerica, 66, pp. 647-652. Sinaga, F., Gunawan, T. S., & Abikairajah, E. (2003). Wavelet Packet Based Audio Coding Using Teporal Masking. Paper presented at the Int. Conf. on Inforation, Counications and Signal Processing, Singapore. Tsoukalas, D. E., Mourjopoulos, J. N., & Kokkinakis, G. (997). Speech Enhanceent Based on Audible Noise Suppression. IEEE Transactions on Speech and Audio Processing, 5(6), pp. 497-54. Virag, N. (999). Single Channel Speech Enhanceent Based on Masking Properties of the Huan Auditory Syste. IEEE Transactions on Speech and Audio Processing, 7(2), pp. 26-37. Westerlund, N. (2003). Applied Speech Enhanceent for Personal Counication. PhD Thesis, Blekinge Institute of Technology. PAGE 425 Boll, S. F. (979). Suppresion of Acoustic Noise in Speech Using Spectral Subtraction. IEEE Transactions on Acoustics, Speech and Signal Processing, 27(2), pp. 3-20. Macquarie University, Sydney, Deceber 8 to 0, 2004. Copyright, Australian Speech Science & Technology Association Inc.