TRANSIENT NOISE REDUCTION BASED ON SPEECH RECONSTRUCTION

TRANSIENT NOISE REDUCTION BASED ON SPEECH RECONSTRUCTION Jian Li 1,2, Shiwei Wang 1,2, Renhua Peng 1,2, Chengshi Zheng 1,2, Xiaodong Li 1,2 1. Communication Acoustics Laboratory, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China 119, 2. Acoustics and Information Technology Laboratory, Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai, 2121, e-mail: cszheng@mail.ioa.ac.cn This paper proposes a novel transient noise reduction (TNR) algorithm based on speech reconstruction. The proposed algorithm has two stages. First, the transient noise is detected by using linear prediction residual, which will be referred as Linear Prediction Residual (LPR)-based method. Second, we replace the frames that contain transient noise with the reconstructed speech by using packet loss concealment techniques, which can reduce speech distortion and suppress the transient noise in a robust way. Compared with traditional TNR algorithms, the proposed algorithm is computationally efficient. Moreover, the proposed algorithm can completely eliminate transient noise especially when the voiced speech and the transient noise exist simultaneously. Experimental results show that the proposed algorithm using speech reconstruction techniques can reduce the transient noise effectively, up to 3dB, without introducing audible speech distortion. 1. Introduction Transient noise, which is a type of non-stationary signal with short duration less than 5ms, often appears as an interference in speech communication systems, such as mobile phones, hearing aids and teleconference devices [1]. Since transient noise may seriously degrade the speech quality in practice, it is necessary to suppress it in an efficient way. In recent years, transient noise reduction (TNR) has become an attractive research topic and the researchers have already made some efforts to suppress this transient noise. In [1] and [2], Talmon and Cohen proposed an algorithm that can efficiently suppress transient noise with diffusion maps. However, this algorithm is computationally complex and non-causal. In [3]-[5], transient noise is suppressed in the time domain, wavelet domain or frequency domain, respectively. These algorithms can suppress transient noise with low delay while they may cause serious speech distortion when speech is erroneously detected as transient noise. Moreover, experimental studies show that the existing algorithms cannot completely eliminate transient noise in the sense of hearing. In this paper, we propose a novel TNR algorithm by using speech reconstruction. The proposed method is composed of two steps. In the first step, the Linear Prediction Residual (LPR)-based method is proposed to detect the transient noise as far as possible. In the second step, the transient ICSV21, Beijing, China, July 13-17, 214 1

21st International Congress on Sound and Vibration (ICSV21), Beijing, China, 13-17 July 214 noise corrupted frames are removed and packet loss concealment techniques are used to reconstruct speech for continuity. The remainder of this paper is organized as follows. In Section II, we formulate the problem. The LPR-based method is presented in Section III. Then, a packet loss concealment technique is proposed to reconstruct speech in Section IV. Section V gives some experimental results to show the validity of the proposed algorithm. Some conclusions are presented in Section VI. 2. Problem formulation Let s(n) denote a clean speech signal and let d st (n) and d tr (n) be the additive stationary and transient noise signals, respectively. The signal received by a microphone is composed of these three components, written as: x (n) = s (n) + d st (n) + d tr (n) (1) Since the additive stationary noise can be removed by traditional single-channel speech enhancement algorithms [6], [7], we ignore the impact of d st (n) in this paper. Therefore, (1) can be rewritten as: x (n) = s (n) + d tr (n) (2) The microphone signal x(n) can be divided into short-time frames and the transient noise detection problem in the lth frame can be regarded as a binary hypothesis test, given by: { H1 (l) : x l (n) = s (Ml + n) + d tr (Ml + n) or x l (n) = d tr (Ml + n) (3) H (l) : x l (n) = s (Ml + n) where M is the frame shift length and n =,1,...N-1, with N the frame length. In this paper, we choose M=256 and N=512 with the sampling frequency of 16 khz. To eliminate the transients completely, it is better to detect the transient noise as far as possible even when both the speech and the transient noise are presented in the l frame, since we can solve this problem by applying speech reconstruction techniques. 3. Transient noise detection with LPR-based method In the following three parts, we introduce the LPR-based method. In the first part, we analyze the properties of the LPR for different types of signals. In the second part, the specific processes are given to distinguish the transient noise from the speech. In the final part, some experimental results are presented to show the validity of the this method. 3.1 The properties of the LPR In this paper, we assume that the energy of the transient noise mostly concentrates on a small range over the time scale and the temporal energy of the transient noise is significantly larger compared with the speech components. According to the traditional methods, the spectral coherence can be applied to distinguish the unvoiced speech from the transient noise[8]. Meanwhile, the harmonic property of the voiced speech is useful to differentiate the voiced speech and the transient noise [5]. However, when the voiced speech and the transient noise exist at the same time, the voiced speech has a large influence on characteristics of the transient noise and makes it difficult to detect the transient noise. To solve this problem, we propose the LPR-based method. For enhancing the characteristic difference between the transient noise and the speech, we whiten the noisy signal x(n) in each frame. Let x l (n) be the LPR in the lth frame, which can be written as: P x l (n) = x l (n) a l px l (n p) (4) ICSV21, Beijing, China, July 13-17, 214 2 p=1

21st International Congress on Sound and Vibration (ICSV21), Beijing, China, 13-17 July 214 where {a l p} P p=1 are the AR coefficients in the lth frame. In practice, we can apply the common Levinson-Durbin algorithm to estimate the AR coefficients. Different types of signals are shown in Figure. 1 within a short time frame, where (a) is the voiced speech, (b) is the transient noise, (c) is the voiced speech corrupted by transients and (d) is the unvoiced speech. Each type of signal is whitened using linear prediction and the results are shown in Figure. 2 respectively. It is observed that the LPR of the voiced speech is reduced to an impulse train, where the impulses show up periodically as shown in Figure. 2(a). Whereas, Figure.2(b) indicates that the transient noise concentrates its energy on a small window of time before and after linear prediction due to the fact that the transient noise has a short duration and a flat spectrum. Comparing Figure. 1(c) and Figure. 2(c), it can be seen that the transient noise becomes more obvious after linear prediction since the voiced speech is suppressed by the linear prediction while the whitened transient component retains most of its energy. The energy of the unvoiced speech is approximately uniformly distributed over the time which can be seen in Figure. 1(d) and Figure. 2(d)..1.1 1 2 3 4 5 (a).1.1.2.1.1 1 2 3 4 5 (b) 1 2 3 4 5 (c).1.1 1 2 3 4 5 (d) Samples.4.2.2 1 2 3 4 5 (a).1.5.5 1 2 3 4 5 (b).1.5.5 1 2 3 4 5 (c).1.1 1 2 3 4 5 (d) Samples Figure 1: Waveforms of the original signals. Figure 2: Waveforms of the whitened signals. 3.2 A signal centroid-based method to detect transient noise Based on the different distributions between the transient noise and the other signals in the residual domain, we propose a signal centroid-based method to detect transient noise. The centroid of the LPR in the lth frame can be written as: C(l) = N 1 n= N 1 n x l (n) / n= x l (n) (5) Centered on the centroid C(l), the minimum time length which contains E% total energy is given by: C(l)+v x l (n) n=c(l) v B(l) = min E% (6) v N 1 x l (n) n= where E is recommended from 75-95 and E = 9 is chosen in this paper. Our studies indicate that the B(l) is small under H 1 (l), which is based on the fact that the energy of the transient noise concentrates around a small range. Aiming at improving the detection probability, we introduce a weighted window function w(n) and (5) can be rewritten as: ICSV21, Beijing, China, July 13-17, 214 3

21st International Congress on Sound and Vibration (ICSV21), Beijing, China, 13-17 July 214 C(l) = N 1 n= N 1 n w(n) x l (n) / n= w(n) x l (n) (7) where w(n) is a hanning window in practice. Our studies indicate that an appropriate choice of w(n) and overlap length will make the energy concentrated, which helps to detect the transient noise. However, we find that the speech phoneme onsets, which are characterized by sudden bursts, also concentrate their energy on a small range. To solve this problem, we propose to add some stationary noise into the original signal, which can mask the speech phoneme onset component but not mask the voiced speech and the transient noise. Notice should be given that even if the speech is erroneously detected as the transient noise, we can use the packet loss concealment technology to reconstruct the speech, which will be introduced in the following section. Through the above protective measures, the detection criterion can be given by: { B(l) Cth, accept H (l) (8) B(l) < C th, accept H 1 (l) where C th is the threshold and it is relevant to the frame length and the type of transient noise. A large amount of experiments show that C th = 15 is a good choice when the frame length is 512. 3.3 LPR-based method simulation In this part, we show the validity of the LPR-based method. The speech signal corrupted by transient noise is used for simulation and the results are shown in Figure. 3, where the dashed line represents the threshold C th. The results indicate that the LPR-based method can detect the transient noise effectively even when the speech and the transients exist simultaneously. 2 B(l) 15 1 2 4 6 8 1 Time[Sec].2.2 2 4 6 8 1 Time[Sec] Figure 3: Simulation for the transient noise detection. 4. TNR based on speech reconstruction Traditional TNR algorithms cannot eliminate transient noise completely in practice. Unfortunately, human s auditory system is sensitive to the residual transient noise. Vaseghi and Rayner proposed a method for removing impulsive noise and reconstructing speech with interpolation algorithm [3]. In this paper, we replace the frames that contain transient noise with the reconstructed speech by using packet loss concealment techniques. Since the duration of transient noise is usually less than 5ms, once the frame is detected to contain transient noise, this frame and its two successive frames should be discarded to ensure that the transient noise can be totally eliminated. Various packet loss concealment techniques can be used to generate approximations of the discarded frames such as Waveform substitution algorithm and Waveform Similarity Overlap-add (WSOLA) algorithm [9], [1]. In this paper, we apply two-side pitch waveform replication (PWR) technique [11] to recontract the speech of the discarded frames. ICSV21, Beijing, China, July 13-17, 214 4

21st International Congress on Sound and Vibration (ICSV21), Beijing, China, 13-17 July 214 4.1 Pitch detection The pitch period of each frame can be estimated by computing the normalized autocorrelation of the signal and searching for the index that maximizes the normalized autocorrelation [12], i.e., L 1 x(n)x(n + τ) n= C nac (τ) =, τ = τ min...τ max (9) L 1 x(n) 2 L 1 x(n + τ) 2 n= n= { τ, τ L = min < τ N 2 N N τ, < τ τ (1) 2 max where L is the correlation size and τ min and τ max are the minimum and maximum values of pitch periods, respectively. The estimated pitch period is then used to reconstruct the speech. The more accurate method of pitch estimation is beyond the scope of this paper. 4.2 Speech reconstruction Based on whether the forward frame or the backward frame is voiced, we consider four different conditions[11]: both are voiced (BV), only the previous frame is voiced (PV), only the next frame is voiced (NV) and both are unvoiced (BU). The reconstruction methods of the 4 conditions are given in detail. 4.2.1 Both voiced condition For the BV condition, an algorithm based on phase synchronization and pitch adjustment is used to reconstruct the discarded frames[13]. We assume that pitch period of the forward frame is P f and the pitch period of the backward frame is P b. In the forward frame, we choose P f samples nearest to the discarded frames to be previous pitch waveform, referred as PPW. In the backward frame, we choose P b samples nearest to the discarded frames to be next pitch waveform, referred as NPW. Assuming that there are r samples to be reconstructed, and the number of reconstructed pitch waveform(referred as RPW) is N p, given by: N p = round( round(r/p f) + round(r/p b ) ) (11) 2 In general, P f is not equal to P b so the length of each RPW is different. For instance, if P f < P b, the length of the ith RPW i is given by: P i = P f + round( P b P f N p + 1 i), i = 1, 2...N p (12) If r P 1 +P 2...+P Np, P i should be slightly modified to satisfy the criteria r = P 1 +P 2...+P Np. To get the ith RPW i with length of P i, we apply interpolation method to modify PPW into modified- PPW with P i samples, referred as PPW i m. Likewise, the same method can be used to modify NPW into modified-npw with P i samples, referred as NPW i m. The ith RPW i can be written as: RPW i (k) = w i f(k) PPW i m(k) + w i b(k) NPW i m(k), k = 1, 2...P i (13) wf(k) i = r g, w i r b(k) = g r, g = P 1 + P 2...P i 1 + k (14) where wf(k), i wb(k) i are the gain patterns used for adjusting the contributing ratio of forward and backward component. We combine all the RPWs so as to get the reconstructed speech. ICSV21, Beijing, China, July 13-17, 214 5

21st International Congress on Sound and Vibration (ICSV21), Beijing, China, 13-17 July 214 4.2.2 Other conditions For the PV, NV and BU conditions, a simple recovery approach [11] is used to reconstruct the discarded frames. As to the PV condition, the last pitch segment of the forward frame is repeated to fulfill the region of the discarded frames and the gain patterns are used to adjust the amplitude. A similar method can be used to process the NV condition. In case of the BU condition, the rear half of the forward frame and the first half of the backward frame are respectively extended throughout the discarded frames with an amplitude adjustment process. 4.3 Simulation In this part, different types of speech signals are used for simulation. The waveforms of the o- riginal signals are shown in Figure. 4(a)-(c) and the waveforms of the reconstructed signals are shown in Figure. 4(d)-(f). Notice should be given that in Figure. 4(d)-(f) the dashed lines represent the forward and backward frames while the solid lines represent the reconstructed frames. The results show that the two-side PWR algorithm can reconstruct speech effectively without significant distortion..1.1 2 4 6 8 1 12 (a).1.1 2 4 6 8 1 12 (d).5.5 2 4 6 8 1 12 (b).5.5 2 4 6 8 1 12 (e).5.5.5 2 4 6 8 1 12 (c) Samples.5 2 4 6 8 1 12 (f) Samples Figure 4: Waveforms of the original and the reconstructed signals. 5. Experiments In this section, some experimental results are given to show the validity of the proposed algorithm. In the first part, we use the speech corrupted by the mouse clicking for simulation and illustrate the validity of the proposed algorithm. In the second part, two objective measures are applied to compare the proposed algorithm with the traditional ENV-TNR algorithm in [14]. 5.1 Validity of the proposed algorithm In this part, the transient noise corrupted speech signal sampled at 16 khz is used to show the validity of the proposed algorithm and the results are shown in Figure. 5. Our experiments show that the proposed algorithm can detect the transient noise accurately and suppress the transient noise effectively without introducing audible speech distortion. 5.2 Quantitative Results In this part, the quantitative results of the perceptual evaluation of speech quality (PESQ) and the amount of noise reduction are given to show the validity of the proposed algorithm. Both the keyboard typing noise and the mouse clicking noise are used to compare the proposed algorithm with ICSV21, Beijing, China, July 13-17, 214 6

21st International Congress on Sound and Vibration (ICSV21), Beijing, China, 13-17 July 214.2.2 1 2 3 4 5 6 7 8 (a) Fre[Hz] 8 6 4 2 1 2 3 4 5 6 7 8 (d) 5 1 15.2.2 1 2 3 4 5 6 7 8 (b) Fre[Hz] 8 6 4 2 1 2 3 4 5 6 7 8 (e) 5 1 15.2.2 1 2 3 4 5 6 7 8 (c) Time[Sec] Fre[Hz] 8 6 4 2 1 2 3 4 5 6 7 8 (f) Time[Sec] 5 1 15 Figure 5: Waveforms of (a) Clean speech; (b) Noisy speech; (c) Enhanced speech and speech spectrograms of (d) Clean speech; (e) Noisy speech; (f) Enhanced speech. the traditional ENV-TNR algorithm. The comparison results are presented in Table 1. This table clearly demonstrates that the proposed algorithm could reduce the transient noise and improve the PESQ simultaneously. This is based on the fact that the proposed algorithm can eliminate the transient noise completely and reconstruct the speech effectively without significant speech distortion. Table 1: Comparison results of the amount of noise reduction and the PESQ. Noise Type Noise Reduction [db] PESQ ENV-TNR Proposed Nosiy ENV-TNR Proposed Keyboard Typing 18.1 3.64 1.21 1.6 2.19 Mouse Clicking 6.32 4.12 1.76 1.59 2.65 6. Conclusion This paper proposes a new LPR-based transient noise detection method and a new transient noise reduction algorithm based on speech reconstruction. Compared with the traditional TNR algorithms, the proposed algorithm can completely eliminate transient noise without introducing audible speech distortion, even when the voiced speech and the transient noise exist simultaneously. Experimental results verify the validity of the proposed algorithm in reducing transient noise and improving the speech quality. Future work should concentrate on improving the transient noise detection method and reconstructing the speech more accurately to further avoid audible speech distortion. Acknowledgement This work was supported by NSFC (National Science Fund of China) under Grant No. 612143 and No. 6132126. This work was also supported in part by the tri-networks integration under No. KGZD-EW-13-5(3). REFERENCES 1 Talmon, R., Cohen, I. and Gannot. S. Single-Channel Transient Interference Suppression With Diffusion Maps, IEEE Transactions on Audio, Speech and Language Processing, 21(1), 132 144, January, (213). ICSV21, Beijing, China, July 13-17, 214 7

21st International Congress on Sound and Vibration (ICSV21), Beijing, China, 13-17 July 214 2 Talmon, R., Cohen, I. and Gannot. S. Transient Noise Reduction Using Nonlocal Diffusion Filters, IEEE Transactions on Audio, Speech and Language Processing, 19(6), 1584-1599, August, (211). 3 Vaseghi, S. V. and Rayner, P. J. W. Detection and Suppression of Impulsive Noise in Speech Communication Systems, IEE Proceedings I (Communications, Speech and Vision), 137(1), 38 46, February, (199). 4 Nongpiur, R. C. Impulse Noise Removal in Speech Using Wavelets, Proceedings of the 28 IEEE International Conference on Acoustics, Speech, and Signal Processing, Las Vagas, USA, 1593-1596, April, (28). 5 Zheng, C. S., Chen, X. L., Wang, S. W., Peng, R. H. and Li, X. D. Delayless Method to Suppress Transient Noise Using Speech Properties and Spectral Coherence, Proceedings of the 125 th Audio Engineering Society Convention, New York, USA, 17 2 October, (213). 6 Hu, X. H., Wang, S. W., Zheng, C. S. and Li, X. D. A Cepstrum-based Preprocessing and Postprocessing For Speech Enhancement in Adverse Environments, Applied Acoustics, 74(12), 1458-1462, December, (213). 7 Wang, J., Liu, H., Zheng, C. S. and Li, X. D. Spectral Subtraction based on Two-Stage Sspectral Estimation Modified Cepstrum Thresholding, Applied Acoustics, 74(3), 45-458, March, (213). 8 Zheng, C. S., Yang, H. F. and Li, X. D. On Generalized Auto-Spectral Coherence Function and Its Applications to Signal Detection, IEEE Signal Processing Letters, 21(5), 559-563, May, (214). 9 Goodman, D. J., Lockhart, G. B., Wasem, O. J. and Wong, W. C. Waveform Substitution Techniques for Recovering Missing Speech Segments in Packet Voice Communications, IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP(34), 144 1448, December, (1986). 1 Verhelst, W. and Roelands, M. An Overlap-Add Technique Based on Waveform Similarity (W- SOLA) for High Quality Time-Scale Modification of Speech, IEEE International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, MN, USA, 2, 554 557, April, (1993). 11 Liao, W. T., Chen, J. C. and Chen, M. S. Adaptive Recovery Techniques for Real-Time Audio Streams, Proceedings of the 2th Annual Joint Conference of the IEEE Computer and Communications Socitiies, Anchorage, AK, 2, 815-823, April, (21). 12 Medan, Y. Yair, E. and Chazan, D. Super Resolution Pitch Signal Determination of Speech Signal, IEEE Transactions on Signal Processing, 39(1), 4 48, January, (1991). 13 Li, Z. B., Zhao, S. H., Wang, J. and Kuang, J. M. A Side Information Based Packet Loss Recovery Algorithm in VoIP, Congress on Image and Signal Processing, 28, Sanya, China, 5, 139 144, May, (28). 14 Manohar, K. and Rao, P. Speech Enhancement in Nonstationary Noise Environments Using Noise Properties, Speech Communication, 48(1), 96 19, January, (26). ICSV21, Beijing, China, July 13-17, 214 8