IN RECENT YEARS, there has been a great deal of interest

Size: px

Start display at page:

Download "IN RECENT YEARS, there has been a great deal of interest"

Cameron Miller
5 years ago
Views:

1 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL 12, NO 1, JANUARY Signal Modification for Robust Speech Coding Nam Soo Kim, Member, IEEE, and Joon-Hyuk Chang, Member, IEEE Abstract Usually, the performance of a low-bit-rate speech coder degrades seriously in the presence of various interfering signals such as the background noise, acoustic echo, co-talkers speech and other unwanted signals This comes from the mismatch between the input signal and the assumed speech production model on which the design of the given speech coder is based In this paper, we present an approach to modify the input signal such that it can be coded more effectively within the generalized analysis-by-synthesis framework Signal modification in the presented approach is performed according to a criterion which makes a compromise between the modification and coder quantization errors The coder-decoder (CODEC) characteristic is described in terms of a transfer matrix, and an on-line method using the recursive least square (RLS) technique is proposed to estimate it Since each part of the speech signal is differently affected by the modification, we also devise an adaptive method based on the signal-to-quantization noise ratio (SQNR) In contrast to the conventional modification techniques, our approach can be implemented as a simple front-end for any analysis-by-synthesis type coders Index Terms Low-bit-rate speech coding, signal modification I INTRODUCTION IN RECENT YEARS, there has been a great deal of interest in low-bit-rate speech coding techniques [1] For an efficient use of the limited bandwidth resources, it is indispensable to describe the speech signal with minimal bits Low-bit-rate speech coding has become available due to a simplified speech production modeling in which the vocal tract and the excitation are treated separately Each coding technique is classified according to the way how it characterizes the excitation signal while the linear prediction analysis is usually applied to express the vocal tract In the code excited linear prediction (CELP) speech coders, the excitation is selected from a collection of codeword vectors [2] [4] Other successful approaches are based on a parametric representation of the excitation signal such as the sinusoidal model and the waveform interpolation [5] [8] In the sinusoidal model, the excitation signal is described in terms of a number of sine waves where the frequency, amplitude and phase of each sine wave are estimated or predicted from the input speech [5] In contrast, the excitation signal in the waveform interpolation approach is reconstructed by interpolating a set of pitch cycle waveforms referred to as the characteristic waveforms based on the assumption that their general shape evolves slowly [8] Since the generation of speech signals should be represented in terms of a simple parametric model in order to achieve high Manuscript received January 21, 2002; revised August 8, 2003 The associate editor coordinating the review of this manuscript and approving it for publication was Dr Peter Vary The authors are with the School of Electrical Engineering and INMC, Seoul National University, Seoul , Korea ( nkim@snuackr) Digital Object Identifier /TSA coding gain with a limited bit budget, a robust way to estimate the relevant parameters is required The criterion usually adopted for parameter estimation aims at minimizing the waveform matching error Due to the difficulty in obtaining an optimal solution analytically, we apply the closed loop analysis technique in which all the possible parameter values are tried to reconstruct the original signal and the one that minimizes the matching error is selected as the optimal solution This closed loop analysis technique is referred to as the analysis-by-synthesis approach, and almost all the low-bit-rate speech coders which are in use today are based on it [1] In [6], speech is represented as a sum of harmonic sine waves and the fundamental frequency is searched so as to minimize the mean squared error between the original and synthetic spectra On the other hand, all the fixed-codebook entries are used to form the excitation signal and the one that results in the smallest mean squared difference between the input and synthesized signals is selected in CELP-based coders [2] [4] Furthermore, it has been recently reported that the analysis-by-synthesis vector quantization of the rapidly evolving waveform (REW) parameters enhances the performance of a waveform interpolation coder [9] When measuring the waveform matching error, a perceptual weighting filter is usually applied to make the error more suitable for human auditory characteristic referred to the masking effect [10] Due to the incorporation of the perceptual weighting filter, the quantization noise is shaped such that it becomes minimally audible In general, the performance of a low-bit-rate speech coder degrades seriously under adverse environments Undesired distortions are frequently observed from the reconstructed signal in the presence of background noise, acoustic echo, music sounds or interfering speakers speech This can be considered from two aspects First, the interfering signals are not appropriate to be effectively coded in the coder which is conventionally designed based on a simplified human speech production model Second, the presence of interfering signals makes it difficult to obtain an exact estimate for the parameters and is likely to mislead to a bad solution The distortion can be somewhat mitigated under the analysis-by-synthesis framework due to its waveform matching property However, a number of codebooks used in the coder are trained based on a large amount of speech data and the ranges for parameter search are specified to fit to the pure speech signals Therefore, deviation of the input signal from the assumed speech production model is still a major cause of performance degradation even in the analysis-by-synthesis coders A straightforward way one can consider to reduce the unwanted distortions is the employment of a speech enhancement technique such as the spectral subtraction, Kalman filtering or model-based enhancement technique [11] [14] Speech enhancement algorithms increase the signal-to-noise ratio (SNR) and can be used as a signal pre-processor for the given /04$ IEEE

2 10 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL 12, NO 1, JANUARY 2004 speech coder Even though theses enhancement techniques have been found effective in the presence of a stationary background noise, they are not capable of handling such interfering signals as the acoustic echoes, music sounds or co-talkers speech and further they are usually developed irrespective of the speech coder characteristic An alternative approach is the generalized analysis-by-synthesis technique where the original input speech signal is modified such that it can be coded more effectively [1] In a generalized analysis-by-synthesis approach, it is important that the modification should not bring about any audible distortions In [15], input speech signal is modified to have a smooth timedelay trajectory, which saves the bits required for quantizing the pitch periods in each frame and incurs errors of no perceptual importance Another pre-processing algorithm for the CELP coders is proposed in [16] where the input signal is perturbed such that the perturbed signal is subjectively indistinguishable from the original but the prediction gain can be maximized with the quantized linear prediction coefficients In this paper, we present an approach to modify the signal applied to a certain speech coder as an input A criterion which compromises between the two types of distortions, ie, the modification error and the quantization noise is introduced Optimization of the given problem becomes possible through an approximation of the input-output characteristic provided by a coder-decoder (CODEC) pair with a simple transfer matrix Estimation of the transfer matrix is treated as a system identification problem, and we employ the well-known least squares technique, which can be implemented in either the batch or sequential mode This idea was originally proposed in [17], and this paper gives an in depth description as well as a thorough analysis of the approach Moreover, an adaptive way of controlling the two kinds of errors is proposed based on extensive experiments on performance evaluation In contrast to the conventional generalized analysis-by-synthesis techniques in which the input modification is tightly coupled to the inherent CODEC characteristic, our approach can be implemented as a simple front-end for any analysis-by-synthesis type coders The organization of this paper is as follows: Following this section, a sophisticated description of the signal modification approach is given in Section II Least squares algorithm for the transfer matrix estimation and an on-line implementation of the presented approach are given in Sections III and IV, respectively In Section V, a number of experiments are conducted to evaluate the performance, and finally in Section VI, some concluding remarks are drawn Fig 1 Overall structure of the generalized analysis-by-synthesis speech coder the original input signal can be perfectly reconstructed without any quantization errror Under the generalized analysis-by-synthesis paradigm, the input signal is modified before being fed to the coder so that it can be reconstructed in the receiver side with minimal distortion In this approach, it is crucial to make the modified signal nearly the same to the original input speech in terms of human auditory perception For instance, it is well-known that the human auditory system is insensitive to a degree of time-delay difference In this section, we present an approach to modify the input signal For this, we treat the CODEC as if it were a black box, and take advantage of its input-output characteristics Let with denoting matrix transposition, be a vector which represents a frame of speech samples applied as an input to a speech coder Even though a time-domain approach is possible, it is found beneficial to modify the signal in the transform domains In this paper, we use both the discrete Fourier transform (DFT) and the discrete cosine transform (DCT) for the signal representation in the transform domain is transformed into a vector where denotes the number of coefficients With the DFT DFT in which is assumed to be an even number The effective number of DFT coefficients is and each coefficient is a complex number except for and The inverse DFT (IDFT) is given by (1) (2) II SIGNAL MODIFICATION AS A CODEC FRONT-END Fig 1 shows an overall structure of a speech coder built on the basis of the generalized analysis-by-synthesis paradigm Analysis of the input signal and quantization of the relevant parameters are carried out in the speech coder, and the quantized parameters are transformed into a bit stream and then transmitted to the receiver As in most of the analysis-by-synthesis speech coding techniques, these quantized parameters are also passed through the decoder in order to reconstruct the original signal This CODEC structure forms a system for which an ideal transfer function should be the identity mapping implying that with On the other hand, if we apply the DCT DCT (3)

3 KIM AND CHANG: SIGNAL MODIFICATION FOR ROBUST SPEECH CODING 11 where all the coefficients are real numbers and Modification of the input vector, the following criterion: is achieved according to (7) Inverse DCT (IDCT) is given by where denotes the desired modified vector Here, the objective function is given by (4) In the presented approach, prior to applying to the encoder, we modify such that the modified vector can better fit to the speech coder Let be the signal samples obtained by modifying and be the output vector which is produced when is applied to the coder and then re-synthesized in the decoder Also let and be the transform domain representation of and, respectively Without loss of generality, we assume that where is an augmented input vector and represents the transfer function that models the input-output characteristic of the CODEC In (5), represents the input data in the current frame On the other hand, stands for the previous data and consists of the future input samples which are usually referred to as the look-ahead data In low-bit-rate speech coding techniques, the coding parameters such as the pitch, line spectrum frequencies (LSF s) and the excitation signal extracted from a frame are dependent not only on the samples in that frame but also on the past and look-ahead data The transfer function, is generally highly nonlinear For simplicity, we approximate (5) in the transform domain by (5) (6) in which is a distance measure between the two vectors, and, and is a positive constant From (8), it is noted that is expressed in terms of two distortions; one is caused by the modification and the other is the quantization error which comes from the speech coder The positive constant, compromises these two types of distortions, and it should be carefully determined Clearly, the optimal solution depends on how we choose the distance measure, For developing an effective signal modification method, it is important to make more closely related to human auditory perception In the following, we present three distance measures which are usually applied to compute the distance between two spectra, and the solution for signal modification is given for each case Case 1 Linear Spectral Distance: Linear spectral distance is given by in which case, where (8) (9) indicates the Euclidean norm of a vector In this is written as (10) where with being the identity matrix Since is a quadratic function of, differentiating it with respect to and then equating to zero leads us to and and are the transform domain vectors for and, respectively Equation (6) shows that the CODEC is approximated by a linear system model Practically, the input-output relationship of a speech CODEC is too complicated to be expressed in a rather simple form as given by (6) However, if we focus on a short interval of speech, this linear system model can be considered a good approximation to the real CODEC transfer function Moreover, since the CODEC output is mostly affected by the input in the corresponding frame, the effects of the past and future input data can be ignored without introducing a large modeling error (11) in which means the Hermitian operation From (11), it is easy to show that (12) Since is a diagonal matrix, (12) can be described component-wise as follows: (13)

4 12 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL 12, NO 1, JANUARY 2004 Case 2 Weighted Linear Spectral Distance: Given a positive-definite matrix the weighted linear spectral distance is defined by III TRANSFER MATRIX ESTIMATION Estimation of the transfer matrix can be treated as a system identification problem Suppose that we apply input vectors, to the speech CODEC, and obtain the corresponding output vectors, Our approach is based on the method of least squares which is described as follows: (14) Consequently (15) and (16) As in the linear spectral distance case, becomes a quadratic function and from (16) it can be shown that (21) in which is the least squares estimate for and indicates the weighting factor assigned to the th input-output pair As shown in (21), is decomposed into separate functions where (17) Since is positive-definite, it is possible to make the inverse matrix on the right hand side of (17) exist with an appropriate choice for Usually, represents a perceptual weighting filter used for quantization noise shaping in the linear prediction based analysis-by-synthesis coding techniques It is noted that if is a diagonal matrix, (17) becomes the same to (13), which indicates that a diagonal weighting matrix does not affect signal modification Case 3 Log Spectral Distance: The definition for the log spectral distance is given by (22) Differentiating with respect to, we get (23) for and equating each differential to zero results in (18) (24) With the definition, is written as The least squares estimation technique can also be implemented in a sequential manner, which we call the recursive least squares (RLS) method [18] Let (19) (25) From (19), it is not difficult to see (20) ie, the optimal solution is the input signal itself and no modification is necessary be the least squares estimate for derived based on input-output pairs, Then, for (26)

5 KIM AND CHANG: SIGNAL MODIFICATION FOR ROBUST SPEECH CODING 13 where In (27) (27) (28) Moreover, if we incorporate the exponential forgetting scheme, (27) and (28) are modified as follows: Fig 2 CODEC performance in noisy environments (29) with representing the given forgetting factor which lies in the range (0, 1) The RLS technique with an appropriate forgetting scheme enables us to track the time-varying transfer function of a given speech CODEC IV ON-LINE IMPLEMENTATION From the previous sections, it is shown that the transfer matrix, should be estimated prior to input modification Since, however, the CODEC input-output characteristic is usually time-varying depending on the given speech frame, identification of as well as input modification should be performed in an on-line fashion A simple solution is to apply the input speech frame twice to the CODEC, once for estimating by means of the RLS technique and next for real speech quantization However, this scheme requires not only a large amount of computation but also some modification to the CODEC operation Since our purpose is to achieve a proper input modification while maintaining the original CODEC operation, we predict the transfer matrix, at a time using the previous data This is based on the assumption that the transfer characteristic of a CODEC evolves slowly Let be the estimate for at time Then, input modification and transfer matrix estimation are carried out simultaneously as follows: For For i) Modify the input speech frame according to (13) with ii) Obtain the CODEC output, by applying the modified input vector iii) Estimate based on the RLS approach by treating as a new input-output pair V EXPERIMENTS AND RESULTS In this section, we perform a number of experiments on the presented signal modification technique As a target speech coder, we employed the G 729 CS-ACELP which is a toll quality 8 kb/s speech coder [4] 96 sentences spoken by four male and four female speakers were used for the evaluation data Each sentence was sampled at 8 khz and the frame size was set to 10 ms A Coder Performance in Noisy Environments For the purpose of analyzing the CODEC transfer characteristics, we first measured the signal-to-quantization noise ratio (SQNR) produced by the speech coder under various background noise conditions If represents the speech samples of the th input frame, and denotes the real CODEC transfer function, the average SQNR is computed as follows: SQNR (30) where is the total number of frames The average SQNR of the target coder for the clean speech was found 1285 db For the purpose of simulating noisy environments, speech samples were corrupted by three kinds of noise sources: white, babble, and pink noises extracted from the NOISEX-92 database [19] These noises were added to the clean speech waveforms at various SNRs The results are shown in Fig 2 from which it is evident that the performance deteriorates rapidly as the SNR lowers in all the noise conditions Among the three types of noises, the white noise affected most on the coder performance As opposed to the white noise case, a mild degradation in performance was observed with the babble noise B System Identification of CODEC Transfer Function In the presented signal modification approach, the real CODEC transfer function is approximated by a locally linear

6 14 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL 12, NO 1, JANUARY 2004 Fig 3 Approximation error of CODEC transfer function with clean speech Fig 4 Approximation error of CODEC transfer function with speech corrupted by babble noise at SNR =10dB transfer matrix as given by (6) For that reason, it is of great importance how to estimate so that it can provide a more realistic description of the CODEC transfer characteristic In the present work, is estimated by means of the RLS technique which is found suitable for on-line implementation Several experiments were conducted to verify how closely the estimated could approximate the real transfer function The performance was described in terms of the approximation error indicating the difference between the true CODEC output and the signal predicted based on the estimated By varying the forgetting factor, we applied the RLS algorithm given by (26) to estimate the diagonal transfer matrix Let denote the spectrum of the th input signal frame and be the corresponding CODEC output spectrum Here, these spectra are represented by either the DFT or DCT coefficients Since, in our on-line implementation, is used to approximate the CODEC output, a relative measure for the approximation error, which we call the signal-to-approximation noise ratio (SANR) is given as follows: enon implies that the given CODEC transfer characteristic can be considered to evolve slowly C Experiments on Signal Modification Performance of the signal modification approach presented in Section II was evaluated in both the clean and noisy environments Since the present approach makes a trade off between the modification error and the coder quantization error, the performance is described in terms of two distortion measures One is the signal-to-modification noise ratio (SMNR) defined as follows: SMNR (32) in which is the signal spectrum obtained by modifying the original input signal spectrum, The other distortion measure is the conventional SQNR given by SANR (31) SQNR (33) SANR was computed over both the clean and noisy speech databases The noisy speech samples were generated by adding the babble noise to the clean speech waveforms at 10 db of SNR Figs 3 and 4 show the results for the clean and noisy speech, respectively From the results, we can see that the DFT was more efficient than the DCT in approximating the true CODEC transfer function, but the difference in performance became smaller as the forgetting factor got closer to 1 It is noted that the performance achieved with the use of the DFT coefficients degraded more seriously in the presence of background noise compared to that using the DCT Moreover, it is also worth mentioning that SANR increased as the forgetting factor approached to 1 This phenom- where represents the output when the modified spectrum, is applied to the CODEC as an input Signal modification was done according to (13) derived from the linear spectral distance by varying the constant The transfer matrix, was estimated based on the RLS method with the forgetting factor, Fig 5 gives the results obtained from the clean speech data As expected, SMNR was high when was small and vice versa For the SQNR, even though it increased as grew, the slope of increase was not as steep as in the case of SMNR SQNR s obtained from both

7 KIM AND CHANG: SIGNAL MODIFICATION FOR ROBUST SPEECH CODING 15 Fig 5 SMNR and SQNR of the modification approach with clean speech Fig 7 SMNR and SQNR of the modification approach with the noisy speech corrupted by the white noise at SNR =10dB Fig 6 Overall SQNR of the modification approach with clean speech the DFT and DCT spectral representation were found almost the same for all the values of As for the modification error, DFT produced higher SMNR than DCT if was small while DCT performed better than DFT with a large In order to examine how much the signal modification scheme affects the overall CODEC performance, we also evaluated the overall SQNR (OSQNR) defined by OSQNR (34) OSQNR takes into account both the modification and quantization errors In Fig 6, we show the overall CODEC performance when the clean speech signals were applied OSQNR was found lower with signal modification compared to the original CODEC performance Fig 8 Overall SQNR of the modification approach with the noisy speech corrupted by the white noise at SNR =10dB Next, several experiments on noisy speech data were carried out For these experiments, white and babble noises were added to the clean speech waveforms while keeping the SNR at 10 db The results are shown in Figs 7 10 where it is observed that as grew the SQNR increased more rapidly than that obtained based on the clean speech data D Adaptive Signal Modification In the presented signal modification technique, the amount of modification and quantization errors is controlled by the constant If is large, more emphasis is placed on the quantization error and a larger modification of the input signal is allowed In some parts of the given speech signal, even a slight modification makes an audible distortion which degrades the perceptual quality On the contrary, other parts remain almost perceptually indistinguishable from the original signal or cause no degrada-

8 16 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL 12, NO 1, JANUARY 2004 Fig 9 SMNR and SQNR of the modification approach with the noisy speech corrupted by the babble noise at SNR =10dB Fig 11 Smoothed SQNR of the noisy speech corrupted by the babble noise at SNR =10dB (a) Input signal waveform (b) Smoothed SQNR in each frame If represents the smoothed SQNR at time, smoothing was done as follows: SQNR SQNR (35) Fig 10 Overall SQNR of the modification approach with the noisy speech corrupted by the babble noise at SNR =10dB tion in speech quality such as the intelligibility and naturalness Therefore, it is desirable to apply an adaptive approach to modification depending on the given signal One possible way may be to classify each frame into speech and nonspeech periods, and then apply a different value of depending on the classification result Even though a voice activity detection (VAD) algorithm [20] can be used for such a hard decision task, it is useful only in the stationary noise environments Moreover, as in many speech enhancement techniques, a soft decision method will be more helpful to avoid an abrupt discontinuity in spectral components [14] Here, we propose an adaptive approach to determine the constant based on the CODEC characteristic To devise the approach, we performed an experiment where a speech signal corrupted by the babble noise at was passed through the CODEC and a smoothed SQNR was computed in which and are the input signal in the th frame and the corresponding CODEC output, respectively, and is the smoothing coefficient which was set to 098 in the experiment Fig 11 plots the smoothed SQNR curve in conjunction with the original input speech samples From the results, it is evident that the smoothed SQNR is high for the speech periods while it is low during the nonspeech periods Since modification of the input signal during the speech period is considered more likely to produce undesired distortions, should be selected adaptively according to whether the given input signal frame is classified into the speech or nonspeech period Based on the observation given by Fig 11, we propose a novel approach to determine by means of the computed SQNR If denotes the estimate for in the th frame, it is given as follows: with SQNR (36) (37) in which ( ) is the slope parameter, means an offset and is the maximum possible value of, and all these parameters are found experimentally It is noted that the sigmoid type function of (37) makes the constant inversely proportional to the SQNR while limiting the value to the interval (0, )

9 KIM AND CHANG: SIGNAL MODIFICATION FOR ROBUST SPEECH CODING 17 TABLE I RESULTS FOR SUBJECTIVE LISTENING TEST; MOS between the modification and quantization errors, and we have derived the solutions associated to various distortion measures Based on a quasistationarity assumption for the transfer function, the system model parameters are identified by means of the RLS approach Since each part of the speech signal is differently affected by the modification, we have also devised an adaptive method based on the smoothed estimate for the SQNR From a number of experiments, the presented modification approach has been found to improve the perceived speech quality especially when the input signal is corrupted by a class of interfering signals such as the background noises, music sounds and the speech from other speakers E Subjective Listening Tests For the purpose of evaluating the subjective quality of the presented signal modification algorithm, we carried out a set of informal listening tests Eight test sentences spoken by the same number of speakers, ie, one for each speaker, were selected and then used for quality measurement Subjective opinion scores were decided by a group of ten listeners and then averaged to yield the mean opinion score (MOS) results In order to make the input data deviate from the pure speech, we added the white, babble and high frequency channel (HFC) noises from the NOISEX-92 database to the clean speech signals by varying SNR The HFC noise in the NOISEX-92 database was collected by recording the noise sounds in a high frequency radio channel after demodulation In addition, two types of interfering signals in which one was the co-talker s speech and the other was a background music were also applied to degrade the input speech quality The MOS results are shown in Table I where SM DFT and SM DCT represent the signal modification algorithm with DFT and DCT, respectively Signal modification was performed adaptively according to the sigmoid type function given by (37) with, and From Table I, it is evident that the input signal modification algorithm is effective in reducing the distortion or listener fatigue caused by the background noise Performance improvement was found greater for the white and HFC noise environments compared to the other cases It is interesting to see that the performance of the SM DCT was better than that of the SM DFT even though the former was found inferior to the latter in terms of the modeling capability, which can be discovered from our previous experiments, eg, Figs 3 and 4 It is also noted that without any interfering signals, the results obtained with the employment of a signal modification scheme were even slightly better than that of the original CODEC VI CONCLUSIONS We have presented an approach to input signal modification to be used as the front-end for a low-bit-rate speech coder A simplified system modeling of the given CODEC transfer function makes it possible to convert the signal modification issue into a mathematically tractable optimization problem The objective function of this optimization task is described as a compromise REFERENCES [1] W B Kleijn and K K Paliwal, Speech Coding and Synthesis New York: Elsevier, 1995 [2] M Schroeder and B Atal, Code-excited linear prediction (CELP): High quality speech at very low bit rates, in IEEE Int Conf Acoust, Speech, Signal Processing, 1985, pp [3] P Kroon, E F Deprette, and R J Sluyter, Regular-pulse excitation- A novel approach to effective and efficient multipulse coding of speech, IEEE Trans Acoust, Speech, Signal Processing, vol ASSP-34, no 5, pp , May 1986 [4] R Salami et al, Design and description of CS-ACELP: A toll quality 8 kb/s speech coder, IEEE Trans Speech Audio Processing, vol 6, pp , Mar 1998 [5] R J McAulay and T F Quatieri, Speech analysis-synthesis based on a sinusoidal representation, IEEE Trans Acoust, Speech, Signal Processing, vol ASSP-34, pp , Apr 1986 [6] D Griffin and J S Lim, Multiband excitation vocoder, IEEE Trans Acoust, Speech, Signal Processing, vol 36, pp , Aug 1988 [7] W B Kleijn, Encoding speech using prototype waveforms, IEEE Trans Speech Audio Processing, vol 1, pp , July 1993 [8] W B Kleijn and J Haagen, Transformation and decomposition of the speech signal for coding, IEEE Signal Processing Lett, vol 1, pp , Sept 1994 [9] O Gottesman and A Gersho, Enhancing waveform interpolative coding with weighted REW parametric quantization, in Proc IEEE Speech Coding Workshop, Sept 2000, pp [10] B Atal and M Schroeder, Predictive coding of speech signals and subjective error criteria, IEEE Trans Acoust, Speech, Signal Processing, vol ASSP-27, pp , Mar 1979 [11] Y Ephraim and D Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEE Trans Acoust, Speech, Signal Processing, vol 32, pp , Dec 1984 [12] J Gibson, B Koo, and S Gray, Filtering of colored noise for speech enhancement and coding, IEEE Trans Signal Processing, vol 39, pp , Aug 1991 [13] Y Ephraim, Statistical-model-based speech enhancement systems, Proc IEEE, vol 80, pp , 1992 [14] N S Kim and J -H Chang, Spectral enhancement based on global soft decision, IEEE Signal Processing Lett, vol 7, pp , May 2000 [15] W B Kleijn, R P Ramachandran, and P Kroon, Interpolation of the pitch-predictor parameters in analysis-by-synthesis coders, IEEE Trans Speech Audio Processing, vol 2, pp 42 54, Jan 1994 [16] J Jensen, S H Jensen, and E hansen, A perturbation-based pre-processing algorithm for CELP-coders, in Proc IEEE Speech Coding Workshop, June 1999, pp [17] N S Kim and J -H Chang, A preprocessor for low-bit-rate speech coding, IEEE Signal Processing Lett, vol 9, pp , Oct 2002 [18] S Haykin, Adaptive Filter Theory Englewood Cliffs, NJ: Prentice- Hall, 1991 [19] A P Varga, H J M Steeneken, T Tomlinson, and D Jones, The NOISEX-92 Study on the Effect of Additive Noise on Automatic Speech Recognition, DRA Speech Res Unit, 1992 [20] J Sohn, N S Kim, and W Sung, A statistical model-based voice activity detection, IEEE Signal Processing Lett, vol 6, pp 1 2, Jan 1999

18 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL 12, NO 1, JANUARY 2004 Nam Soo Kim (M 88) received the BS degree in electronics engineering from Seoul National University (SNU), Seoul,

Samsung Advanced Institute of Technology (SAIT) as a Senior Member of Technical Staff Since 1998, he has been with the School of Electrical Engineering, SNU, where he is currently an Associate

10 18 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL 12, NO 1, JANUARY 2004 Nam Soo Kim (M 88) received the BS degree in electronics engineering from Seoul National University (SNU), Seoul, Korea, in 1988, and the MS and PhD degrees in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST) in 1990 and 1994, respectively From 1994 to 1998, he was with Samsung Advanced Institute of Technology (SAIT) as a Senior Member of Technical Staff Since 1998, he has been with the School of Electrical Engineering, SNU, where he is currently an Associate Professor His research area includes speech signal processing, speech recognition, speech/audio coding, speech synthesis, adaptive signal processing, machine learning, and mobile communication Joon-Hyuk Chang (M 02) received the BS degree in electronics engineering from Kyung-pook National University, Korea, in 1998, and the MS degree in electrical engineering from Seoul National University, Seoul, Korea, in 2000 He is currently pursuing the PhD degree in electrical engineering at the Seoul National University Since 2000, he has been with Netdus Corp, Seoul, as an Associate Engineer Since January 2003, he has served as CTO at Netdus His research area includes speech signal processing, speech/image coding, speech enhancement, and adaptive filtering

RECENTLY, there has been an increasing interest in noisy

RECENTLY, there has been an increasing interest in noisy IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 9, SEPTEMBER 2005 535 Warped Discrete Cosine Transform-Based Noisy Speech Enhancement Joon-Hyuk Chang, Member, IEEE Abstract In