Speech Enhancement using Wiener filtering S. Chirtmay and M. Tahernezhadi Department of Electrical Engineering Northern Illinois University DeKalb, IL 60115 ABSTRACT The problem of reducing the disturbing effects of additive white noise on a speech signal is considered when a noisereference is not available. Wiener filtering with all-pole modeling built built upon line spectral pair (LSP) frequencies is considered. The filter parameters have been optimized to achieve the highest reduction of noise. The noise is filtered using an iterative LSP-based estimations of LPC parameters. The speech model filter uses an accurate updated estimate of the current noise power spectral density with the aid of a voice activity decoder. I. INTRODUCTION The problem examined here is the enhancement of speech disturbed by additive noise. The basic assumption is that the enhancement system does not have access to any other signal except the corrupted speech itself. The is, no noisereference signal is available, which could allow one to employ classical adaptive noise canceling [l]. The objective of obtaining higher quality and/or intelligibility of the noisy speech may have a fundamental impact on applications like speech compression, speech recognition, and speaker verification, by improving the performance of the relevant digital voice processor. The technique considered in this paper is based on the all-pole model of the vocal tract and uses the estimated coefficients to process the noisy speech with a wiener filter. This is a new and improved iterative speech enhancement technique based on spectral constrains. The iterative technique, originally formulated by Lim and Oppenheim [2], attempts to solve for the maximum likelihood estimate of a speech waveform in additive white noise using Linear predictive coding (LPC). Thus the LPC parameters are estimated using the output of the wiener filter. The LPC model of this estimation is in the form of complex numbers in z - plane and the complex numbers cannot be associated for interframe smoothing; so the LPC poles are represented as Line Spectral Pair (LSP). Inter-frame spectral constraints are applied to LSP parameters across time on a fixed- frame basis. These constraints are applied to ensure that vocal tract characteristics do not vary wildly from frame to frame when speech is present. This method allows constraints to be efficiently applied to speech model pole movements across time so that formants lay along smooth tracks. An N- th order LPC model pole positions are equivalently represented by a set of N/2 LSP position roots and N/2 difference roots. The position root (P) and the difference root (Q) represent a lossless models of the vocal tracts with the glottis closed and open, respectively. They lie on the unit circle in the complex z-plane. The lightly formant locations in the signal s LPC model spectrum are highly correlated with the LSP position roots and the bandwidths of the LPC spectrums at these formants are highly correlated with the LSP difference roots. For a stable LPC model, there is a root at z=- 1 and at z= 1 resepcive!y for Pand Q. P and Q roots alternate around the unit circle. For each iteration to be filtered the LPC pole of the speech estimate is smoothed out around that particular pole in different frames using the LSP equivalent roots. A lower bound on minimum distance of a difference root to adjacent position root is applied to restrain the sharpness of any LPC model s formant to be speech like. Here, we have considered one future frame and one past frame for smoothing in a particular iteration. Then the smoothed LSP roots are again converted to the smoothed LPC parameters. The smoothed LPC model power spectrum and the current noise power estimates are used to get the next iteration of the Wiener filter. The output from the previous Weiner filter iteration is used along with the original input data to get less muffled sounding speech estimate, with a tradeoff of slightly increased residual noise in the output. When we input the noisy signal initially to the Wiener filter input we also input the Fast Fourier transform(fft) of the signal to the Voice Activity Detector (VAD). II. ALGORITHM We know that over a given frame of speech, say coefficients. The method by Lim and Oppenheim is based on maximum a posteriori (MAP) estimation of the LP coefficients, gain, and noise-free speech. The method is an iterative one in which the LP parameters and speech frame are repeatedly reestimated. It is assumed that all unknown 365
parameters are random with a priori Guasian pdf's. The resulting MAP estimator, which maximizes the conditional pdf of parameters given the observations, corresponds to solution of a set of nonlinear equations for the additive white Guassian noise (AWGN) case. In the noise case, the estimator requires a k, g k, and S I be chosen to maximize the pdf p (a k,, g k, s I y). Essentially, we wish to perform joint MAP estimation of the LP speech modeling parameters and noise-free speech by maximizing the joint probability density which is p(a k,s k/y,g k,s 1 ), where the terms g k and s I are assumed to be known(or estimated). Lim and Oppenheim consider a sub optimum solution using sequential MAP estimation of S, followed by MAP estimation of a k, g k, given s k. The sequential estimation procedure is linear at each iteration and continues until some criterion is satisfied. With further simplifying assumptions, it can be shown that MAP estimation of s k is equivalent to non causal Wiener filtering of the noisy speech y. Lim and Oppenheim showed that this technique, under certain conditions, increases the joint likelihood of a k and Sk with each iteration. It can also be shown to be the optimal solution in the MSE sense for a input noisy signal is digitized at an rate of 8kHz, and the time series are processed in frames. The number of samples considered per frame is 256 (32 msec) for determining the speech signal. The noise spectral density, or noise variance for the white Guassian case. musl be estimated during non speech activity. Step 1. Estimated a k from Sk USing either: a.) First M values as the initial condition vector, or b.) Always assume a zero initial condition S k = O. Step 2. Estimate S k (N) given the present estimate a l (N). a.) Estimated a,, estimate the speech spectrum : resulting equation for estimating the noise-free speech is simply the optimum Wiener filter Where the extra index k included to indicate the k th parameters holds, this is the optimum processor in a MSE sense. If the Guassian assumption does not hold, this filter is the best linear processor for obtaining the next speech length of signal frame and k is the iteration number. With this relation, sequential MAP estimation- of the LP parameters and the speech frame generally follows these steps: The first step is perfomed via LP parameter and the second step through adaptive Wiener filtering. The final implementation of the algorithm is presented below. The 366
LSPS only periods. When speech is present, the noise is attenuated by the filter, leaving oniy speech. The energy of the inverse filter filtered signal is compared to a threshold which is updated only during noise-only periods. This threshold rides above the energy of the noise signal after it has been filtered. If the energy is greater than the threshold, then speech is detected. There are variables which need to be updated only when noise is present, but it is obviously dangerous to use the output of the VAD to decide when to update them, because this output is itself the function of these variables. for this we use a secondary VAD. The secondary VAD detects only noise periods but does not endpoint the speech. The secondary VAD makes it decision based on the fact that if the frames have a similar spectral shape for a long period of time then it is either speech or noise. Therefore, if the distortion between frames is below a fixed threshold for a sufficiently long period of time, it is assumed that noise has been detected, unless a steady pitch component has been detected, in which case the input was probably a vowel sound. Another criteria we have used for this VAD is that if the above VAD detects noise frame in between 6 speech frames or before 3 speech frames or after 3 speech frame then that frame is considered to be unvoiced and is finally considered as speech frame. This is in order to fail-safe whenever it is not possible distinguish between unvoiced and speech. III. SIMULATION RESULTS Voice Activity Detector The most critical component of the system is the VAD. A VAD operating in a mobile environment must be able to detect speech in the presence of a range of very diverse types of acoustic background noise. The biggest difficulty in detection of speech is in the presence of very low signal to noise ratio. Thus, a VAD based on the spectral characteristics of the input signal is used in this paper. The block diagram of this VAD is shown in Fig. 1. It incorporates an inverse filter, the coefficients of which are derived during noise The purpose of this computer simulation is to test the performance of the above technique. The first step is to test the performance of VAD (voice activity detector). We need to use VAD to distinguish if current frame is pure noise or noisy speech. For the case SN R is too low, it is really hard to detect the noise frame and unvoiced frame. We use the sentence Don t ask me to carry an oily rage like that" with noise to test the performances of the simulation programs. And we also test the simulation results vs. different SNR. For larger SNR,the processed speech signal gets better quality. Figure 3. shows the output results with input noisy speech for SNR of 10 db. We have performed speech enhancement for SNR of about 5 db, but we can see that as the SNR becomes considerably low is it impossible to detect between unvoiced speech and the noise. Hence this algorithm does not work for very low SNR. Also SNR of 5 db or lower means very low SNR for unvoiced portion of speech. Thus this aspect also needs to be considered for future study. In the table below we have shown input SNR and output SNR for voiced, unvoiced, and noise frame. Thus, we can see that enhancement is about 7-8 db. 368 4
Fig. 2- Table for speech enhancement Fig.3 - Original speech and the enhanced speech REFERENCES 1) John H. L. and Mark A. Clements, Constrained Iterative Speech Enhancement with Application to Automatic Speech Recognition. IEEE, 1988. 2) Jae S. Lim and Alan, V. Oppenheim, Enhancement and Bandwidth Compression of Noisy Speech. Invited paper, IEEE, 1979. 3) John H. L. Hanson and Levent M. Arslan, Robust Feature-Estimation and Objective Quality Assessment for Noisy Speech Recognition Using the Credit Card Copus, IEEE Trans, Speech and Audio Processing, Vol-3. No. 3, May 1995. 4) WYNN,Woodson, Transmitted Noise Reduction in communications systems, Patent Cooperation Treaty. 369 5