Non-Uniform Speech/Audio Coding Exploiting Predictability of Temporal Evolution of Spectral Envelopes

Similar documents
I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

Autoregressive Models of Amplitude. Modulations in Audio Compression

Autoregressive Models Of Amplitude Modulations In Audio Compression

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Overview of Code Excited Linear Predictive Coder

Chapter IV THEORY OF CELP CODING

PLP 2 Autoregressive modeling of auditory-like 2-D spectro-temporal patterns

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Auditory modelling for speech processing in the perceptual domain

Speech Coding in the Frequency Domain

Speech Compression Using Voice Excited Linear Predictive Coding

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Signal Analysis Using Autoregressive Models of Amplitude Modulation. Sriram Ganapathy

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

Accurate Delay Measurement of Coded Speech Signals with Subsample Resolution

ARTIFICIAL BANDWIDTH EXTENSION OF NARROW-BAND SPEECH SIGNALS VIA HIGH-BAND ENERGY ESTIMATION

Evaluation of Audio Compression Artifacts M. Herrera Martinez

APPLICATIONS OF DSP OBJECTIVES

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

Wideband Speech Encryption Based Arnold Cat Map for AMR-WB G Codec

Enhanced Waveform Interpolative Coding at 4 kbps

Proceedings of Meetings on Acoustics

Vocoder (LPC) Analysis by Variation of Input Parameters and Signals

EE482: Digital Signal Processing Applications

Audio Signal Compression using DCT and LPC Techniques

Improving Sound Quality by Bandwidth Extension

SGN Audio and Speech Processing

Speech Synthesis; Pitch Detection and Vocoders

Mel Spectrum Analysis of Speech Recognition using Single Microphone

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

ON-LINE LABORATORIES FOR SPEECH AND IMAGE PROCESSING AND FOR COMMUNICATION SYSTEMS USING J-DSP

Audio Compression using the MLT and SPIHT

Feature Extraction Using 2-D Autoregressive Models For Speaker Recognition

Golomb-Rice Coding Optimized via LPC for Frequency Domain Audio Coder

Speech Coding using Linear Prediction

Flexible and Scalable Transform-Domain Codebook for High Bit Rate CELP Coders

An objective method for evaluating data hiding in pitch gain and pitch delay parameters of the AMR codec

Machine recognition of speech trained on data from New Jersey Labs

Simulation of Conjugate Structure Algebraic Code Excited Linear Prediction Speech Coder

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Communications Theory and Engineering

NOISE ESTIMATION IN A SINGLE CHANNEL

HIGH QUALITY AUDIO CODING AT LOW BIT RATE USING WAVELET AND WAVELET PACKET TRANSFORM

Transcoding of Narrowband to Wideband Speech

Attack restoration in low bit-rate audio coding, using an algebraic detector for attack localization

Department of Electronics and Communication Engineering 1

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Using RASTA in task independent TANDEM feature extraction

The Channel Vocoder (analyzer):

IN RECENT YEARS, there has been a great deal of interest

Information. LSP (Line Spectrum Pair): Essential Technology for High-compression Speech Coding. Takehiro Moriya. Abstract

Chapter 4 SPEECH ENHANCEMENT

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

RECENTLY, there has been an increasing interest in noisy

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

QUESTION BANK. SUBJECT CODE / Name: EC2301 DIGITAL COMMUNICATION UNIT 2

Speech Coding Technique And Analysis Of Speech Codec Using CS-ACELP

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Advanced audio analysis. Martin Gasser

FPGA implementation of DWT for Audio Watermarking Application

COMBINING ADVANCED SINUSOIDAL AND WAVEFORM MATCHING MODELS FOR PARAMETRIC AUDIO/SPEECH CODING

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Outline. Communications Engineering 1

NOISE SHAPING IN AN ITU-T G.711-INTEROPERABLE EMBEDDED CODEC

Modulator Domain Adaptive Gain Equalizer for Speech Enhancement

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Speech Enhancement: Reduction of Additive Noise in the Digital Processing of Speech

Digital Signal Processing

Voice Excited Lpc for Speech Compression by V/Uv Classification

Presentation Outline. Advisors: Dr. In Soo Ahn Dr. Thomas L. Stewart. Team Members: Luke Vercimak Karl Weyeneth. Karl. Luke

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

Copyright S. K. Mitra

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Enhancing 3D Audio Using Blind Bandwidth Extension

Perceptually Motivated Linear Prediction Cepstral Features for Network Speech Recognition

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Modulation Domain Spectral Subtraction for Speech Enhancement

Comparison of CELP speech coder with a wavelet method

SAMPLING THEORY. Representing continuous signals with discrete numbers

EXAMINATION FOR THE DEGREE OF B.E. Semester 1 June COMMUNICATIONS IV (ELEC ENG 4035)

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

A Closed-loop Multimode Variable Bit Rate Characteristic Waveform Interpolation Coder

SGN Audio and Speech Processing

Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase and Reassignment

Spatial Audio Transmission Technology for Multi-point Mobile Voice Chat

Orthogonal Frequency Division Multiplexing & Measurement of its Performance

Signals A Preliminary Discussion EE442 Analog & Digital Communication Systems Lecture 2

Parallel Digital Architectures for High-Speed Adaptive DSSS Receivers

Low Bit Rate Speech Coding

Understanding Digital Signal Processing

Carrier Frequency Offset Estimation in WCDMA Systems Using a Modified FFT-Based Algorithm

6/29 Vol.7, No.2, February 2012

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

Problem Sheet 1 Probability, random processes, and noise

System analysis and signal processing

Transcription:

Non-Uniform Speech/Audio Coding Exploiting Predictability of Temporal Evolution of Spectral Envelopes Petr Motlicek 12, Hynek Hermansky 123, Sriram Ganapathy 13, and Harinath Garudadri 4 1 IDIAP Research Institute, Rue du Simplon 4, CH-1920, Martigny, Switzerland {motlicek,hynek,ganapathy}@idiap.ch 2 Faculty of Information Technology, Brno University of Technology, Božetěchova 2, Brno, 612 66, Czech Republic 3 École Polytechnique Fédérale de Lausanne (EPFL), Switzerland 4 Qualcomm Inc., San Diego, California, USA hgarudad@qualcomm.com Abstract. We describe novel speech/audio coding technique designed to operate at medium bit-rates. Unlike classical state-of-the-art coders that are based on short-term spectra, our approach uses relatively long temporal segments of audio signal in critical-band-sized sub-bands. We apply auto-regressive model to approximate Hilbert envelopes in frequency sub-bands. Residual signals (Hilbert carriers) are demodulated and thresholding functions are applied in spectral domain. The Hilbert envelopes and carriers are quantized and transmitted to the decoder. Our experiments focused on designing speech/audio coder to provide broadcast radio-like quality audio around 15 25kbps. Obtained objective quality measures, carried out on standard speech recordings, were compared to the state-of-the-art 3GPP-AMR speech coding system. Key words: Audio coding, audio signal processing, linear predictive coding, modulation coding, lossy compression 1 Introduction State-of-the-art speech coding techniques that generate toll quality speech typically exploit the short-term predictability of speech signal in the 20 30ms range [1]. This short-term analysis is based on the assumption that the speech This work was partially supported by grants from ICSI Berkeley, USA; the Swiss National Center of Competence in Research (NCCR) on Inter active Multi-modal Information Management (IM)2 ; managed by the IDIAP Research Institute on behalf of the Swiss Federal Authorities, and by the European Commission 6th Framework DIRAC Integrated Project.

2 Petr Motlicek et al. signal is stationary over these segment durations. Techniques like Linear Prediction (LP), which is able to efficiently approximate short-term power spectra by Auto-Regressive (AR) model [2], are applied. However, speech signal is quasi-stationary and carries information in its dynamics. Such information is not adequately captured by short-term based approaches. Some considerations that motivated us to explore novel architectures are mentioned below: When the signal dynamics are described by a sequence of short-term vectors, many issues come up, like windowing, proper sampling of short-term representation, time-frequency resolution compromises, etc. There are situations where LP provides a sub-optimal filter estimate. In particular, when modeling voiced speech, LP methods can be adversely affected by spectral fine structure. The LP based approaches do not respect many important perceptual properties of hearing (e.g., non-uniform critical-band representation). Conventional LP techniques are based on linear model of speech production, thus have difficulties encoding non-speech signals (e.g., music, speech in background, etc.). Over the past decade, research in speech/audio coding has been focused on high quality/low latency compression of wide-band audio signals. However, new services such as Internet broadcasting, consumer multimedia, or narrow-band digital AM broadcasting are emerging. In such applications, new challenges have been raised, such as resiliency to errors and gaps in delivery. Furthermore, many of these services do not impose strict latency constraints, i.e., the coding delay is less important as compared to bit-rate and quality requirements. This paper describes a new coding technique that employs AR modeling applied to approximate the instantaneous energy (squared Hilbert envelope (HE)) of relatively long-term critical-band-sized sub-band signals. It has been shown in our earlier work that based on approximating the envelopes in sub-bands we can design very low bit-rate speech coder giving intelligible output of synthetic quality [3]. In this work, we focus on efficient coding of residual information (Hilbert carriers (HCs)) to achieve higher quality of the re-synthesized signal. The objective quality scores based on Itakura-Saito (I-S) distance measure [4] and Perceptual Evaluation of Speech Quality (PESQ) [5] are used to evaluate the performance of the proposed coder on challenging speech files sampled at 8kHz. The paper is organized as follows: In Section 2, a basic description of the proposed encoder is given. In Section 3, the decoding-side is described. Section 4 describes the experiments we conducted to validate the approach using objective quality measurements. 2 Encoding New techniques utilizing LP to model temporal envelopes of input signal have been proposed [6, 7]. More precisely, HE (squared magnitude of an analytic sig-

Non-Uniform Speech/Audio Coding 3 FFT POWER COMPRESS IFFT LEVINSON speech, 1s, 8kHz DCT CRITICAL BAND FILTERS...... (t) x k / (t) c k HILBERT ENVELOPE by FDLP HILBERT CARRIER PROCESSING a k (t) LSFs/1s spectral components/200ms processing every 200ms separately DEMODUL DECIMATE AMPLITUDE NORMALIZE FFT ADAPTIVE THRESHOLD (t) c k d k (t) Fig. 1. Simplified structure of the proposed encoder. nal), which yields a good estimate of instantaneous energy of the signal, can be parameterized by Frequency Domain Linear Prediction (FDLP) [7]. FDLP represents frequency domain analogue of the traditional Time Domain Linear Prediction (TDLP) technique, in which the power spectrum of each short-term frame is approximated by the AR model. The FDLP technique can be summarized as follows: To get an all-pole approximation of the squared HE, first the Discrete Cosine Transform (DCT) is applied to a given audio segment. Next, the autocorrelation LP technique is applied to the DCT transformed signal. The Fourier transform of the impulse response of the resulting all-pole model approximates the squared HE of the signal. Just as TDLP fits an all-pole model to the power spectrum of the input signal, FDLP fits an all-pole model to the squared HE of the signal. As discussed later, this approach can be exploited to approximate temporal envelope of the signal in individual frequency sub-bands. This presents an alternate representation of signal in the 2-dimensional time-frequency plane that can be used for audio coding. 2.1 Parameterizing temporal envelopes in critical sub-bands The graphical scheme of the whole encoder is depicted in Fig. 1. First, the signal is divided into 1000ms long temporal segments which are transformed by DCT into the frequency domain, and later processed independently. In order to avoid possible artifacts at segment boundaries, 10ms overlapping is used. To emulate auditory-like frequency selectivity of human hearing, we apply N BANDs Gaussian functions (N BANDs denotes number of critical sub-bands), equally spaced on the Bark scale with standard deviation σ = 1 bark and center frequency F k, to derive sub-segments of the DCT transformed signal. FDLP technique is performed on every sub-segment of the DCT transformed signal (its

4 Petr Motlicek et al. time-domain equivalent obtained by inverse DCT is denoted as x k (t), where k denotes frequency sub-band). Resulting approximations of HEs in sub-bands are denoted as a k (t). 2.2 Excitation of FDLP in frequency sub-bands To reconstruct the signal in each critical-band-sized sub-band, the additional component Hilbert carrier (HC) c k (t) is required (residual of the LP analysis represented in time-domain). Modulating c k (t) with approximated temporal envelope a k (t) in each critical sub-band yields the original x k (t) (refer [8] for mathematical explanation). Clearly, c k (t) is analogous to excitation signal in TDLP. Utilizing c k (t) leads to perfect reconstruction of x k (t) in sub-band k and, after combining the subbands, in perfect reconstruction of the overall input signal. Processing Hilbert carriers (HCs): For convenience in processing and encoding, we need the sub-band carrier signals to be low-pass. This can be achieved by demodulating c k (t) (shifting Fourier spectrum of c k (t) from F k to 0 Hz). Since modulation frequency F k of each sub-band is known, we employ standard procedure to demodulate c k (t) through the concept of analytic signal z k (t). z k (t) is the complex signal that has zero-valued spectrum for negative frequencies. To demodulate c k (t), we perform scalar multiplication z k (t).c k (t). Demodulated carrier in each sub-band is low-pass filtered and down-sampled. Frequency width of the low-pass filter as well as the down-sampling ratio is determined using the frequency width of the Gaussian window (the cutoff frequencies correspond to 40dB decay in magnitude with respect to F k ) for a particular critical sub-band. The resulting time-domain signal (denoted as d k (t)) represents demodulated and down-sampled HC c k (t). d k (t) is a complex sequence, because its Fourier spectrum is not conjugate symmetric. Perfect reconstruction of c k (t) from d k (t) can be done by reversing all the pre-processing steps. Since HCs c k (t) are quite non-stationary, they are split into 200ms long subsegments (10ms overlap for smooth transitions) and processed independently. Encoding of demodulated HCs: Temporal envelopes a k (t) and complex valued demodulated HCs d k (t) carry the information necessary to reconstruct x k (t). If the original HE is used to derive d k (t), then d k (t) = 1, and only the phase information from d k (t) would be required for perfect reconstruction. However, since FDLP yields only approximation of the original HEs, d k (t) in general will not be perfectly flat and both components of complex sequence are required. The coder implemented is an adaptive threshold coder applied on Fourier spectrum of d k (t), independently in each sub-band, where only the spectral components having magnitudes above the threshold are transmitted. The threshold is dynamically adapted to meet a required number of transmitted spectral components (described later in Section 4.1). The quantized values of magnitude and phase for each selected spectral component are transmitted.

Non-Uniform Speech/Audio Coding 5 FREQUENCY RESPONSE of FILTER LSFs/1s spectral components/200ms RESTORING HILBERT ENVELOPE RESTORING HILBERT CARRIER a k (t) c k (t) x xk (t)...... SUMMATION in DCT domain IDCT speech, 1s, 8kHz processing every 200ms separately Inverse FFT INTERPOL MODULATE SPECTRUM SYMMETRY dk (t) c k (t) Fig.2. Simplified structure of the proposed decoder. 3 Decoding In order to reconstruct the input signal, the carrier c k (t) in each critical subband needs to be re-generated and then modulated by temporal envelope a k (t) obtained using FDLP. A graphical scheme of the decoder, given in Fig. 2, is relatively simple. It inverts the steps performed at the encoder. The decoding operation is also applied on each (1000ms long) input segment independently. The decoding steps are: (a) Signal d k (t) is reconstructed using inverse Fourier transform of transmitted complex spectral components. d k (t) is then up-sampled to the original rate and modulated on sinusoid at F k (i.e., its Fourier spectrum is frequency-shifted and post-processed to be conjugate symmetric). This results in the reconstructed HC c k (t). (b) Temporal envelope a k (t) is reconstructed from transmitted AR model coefficients. The temporal trajectory x k (t) is obtained by modulating c k (t) with a k (t). The above steps are performed in all frequency sub-bands. Finally: (a) The temporal trajectories x k (t) in each critical sub-band are projected to the frequency domain by DCT and summed. (b) A de-weighting window is applied to compensate for the effect of Gaussian windowing of DCT sequence at the encoder. (c) Inverse DCT is performed to reconstruct 1000ms long output signal (segment). Fig. 3 shows time-frequency characteristics of the proposed coder for a randomly selected speech sample. 4 Experiments All experiments were performed with speech signals sampled at F s = 8kHz. We used decomposition into N BANDs = 13 critical sub-bands, which roughly corresponds to partition of one sub-band per bark.

6 Petr Motlicek et al. > s FDLP: (a) > time [ms] > s > s > F 0 50 100 150 200 (b) > time [ms] 0 50 100 150 200 ENCODER: (c) > time [ms] 0 50 100 150 200 (d) > f [Hz] 400 200 0 200 400 > s > F DECODER: (e) > time [ms] 0 50 100 150 200 (f) > f [Hz] 400 200 0 200 400 Fig. 3. Time-Frequency characteristics generated from randomly selected speech sample: (a) 200ms segment of the input signal. (b) x 3(t) sequence (frequency sub-band k = 3, center frequency F 3 = 351Hz ), thin upper line represents original HE, solid upper line represents its FDLP approximation. (c) Original HC c 3(t). (d) Magnitude Fourier spectral components of the demodulated HC d 3(t), the solid line represents the selected threshold. (e) Reconstructed HC c 3(t) in the decoder. (f) Magnitude Fourier spectral components of d 3(t) post-processed by adaptive threshold. FDLP approximating HE in each frequency sub-band a k (t) is represented by Line Spectral Frequencies (LSFs). Previous informal subjective listening tests, aimed at finding sufficient approximations a k (t) of temporal envelopes, showed that for coding the 1000ms long audio segments, the optimal AR model is of order N LSFs = 20 [3]. 4.1 Objective quality tests on HC We used Itakura-Saito (I-S) distance measure [4] as a simple method together with ITU-T P.862 PESQ objective quality tool [5] to adjust the threshold values on Fourier spectrum of d k (t) for reconstructing HCs c k (t) at the decoder. These measures were used to evaluate performance as a function of variable number of Fourier spectral components for the reconstruction of c k (t) (this number is always constant over all sub-bands) while fixing all other parameters. The performance was tested on a sub-set of TIMIT speech database [9], containing 380 speech sentences sampled at F s = 8kHz. A total of about 20 minutes of speech was used for the experiments. I-S measure was performed on short-term frames (30ms frame-length, 7.5ms frame-skip). Encoded sentences were compared to original sentences measuring the I-S distance between them. The lower values of I-S measure indicate smaller distance and better speech quality. As suggested in [10], to exclude unrealistically high spectral distance values, 5% of frames with the highest I-S distances were

Non-Uniform Speech/Audio Coding 7 1.6 5 1.4 4.5 > I S distance [ ] 1.2 1 0.8 P.862 AMR 4 3.5 3 > P.862 Prediction (PESQ MOS) 0.6 I S AMR 2.5 0.4 2 30 40 50 60 70 80 > Number of components [ ] Fig.4. Global mean I-S distance measure of the proposed coder as a function of the number of Fourier spectral components used to reconstruct d k (t) in each critical subband. + marks the performance of the 3GPP-AMR speech codec at 12.2kbps. discarded from the final evaluation. This method ensures a reasonable measure of overall performance. PESQ scores were also computed for the reconstructed signal. The quality estimated by PESQ corresponds to the average user perception of the speech sample under assessment PESQ MOS (Mean Opinion Score). Fig. 4 shows the mean I-S distance value as well as mean PESQ score computed over all TIMIT DB sub-set as a function of the number of Fourier spectral components used to reconstruct spectrum of demodulated HC d k (t) in each critical sub-band. Both objective quality measures show marked improvement when the number of spectral components is increased from 30 to 80. We repeated the above objective tests with 3GPP-Adaptive Multi Rate (AMR) speech codec at 12.2kbps [11] on the same database, and show the results in Fig. 4. The results indicate that if d k (t) is reconstructed from 65 Fourier spectral components (in each critical sub-band, per 200ms), the proposed coder achieves similar performance to AMR codec with respect to the chosen objective measures. Informal subjective results showed that the speech quality was comparable to that of AMR 12.2, while the quality for music signals was noticeably better. In this paper, we do not discuss the quantization block and entropy coder. However, in additional informal experiments, LSFs describing temporal envelopes a k (t) as well as the selected spectral components of d k (t) were quantized (split VQ technique). These preliminary experiments show the promise of a coder in encoding speech and music signals at an average bit-rate of 15 25kbps.

8 Petr Motlicek et al. 5 Conclusions A novel variable bit-rate speech/audio coding technique based on processing relatively long temporal segments of audio signal in critical-band-sized sub-bands has been proposed and evaluated. The coder architecture allows to easily control the quality of reconstructed sound and the final bit-rate, thus making it suitable for variable bandwidth channels. The coding technique representing input signal in frequency sub-bands is inherently more robust to losing chunks of information, i.e., less sensitive to dropouts. This can be of high importance for any Internet protocol service. We describe experiments focused on efficient representation of excitation signal for the proposed FDLP coder. Such parameter setting does not indeed correspond to optimal approach (e.g., we use uniform spectral parameterization of Hilbert carriers in all sub-bands, uniform quantization of LSFs, simple Gaussian decomposition, etc). All these would be the direction of future research in improving the proposed coder. To convert the proposed speech/audio coding technique into a real application, formal subjective tests need to be made both on speech and music recordings. References 1. Spanias A. S., Speech Coding: A Tutorial Review, In Proc. of IEEE, Vol. 82, No. 10, October 1994. 2. Makhoul J., Linear Prediction: A Tutorial Review, in Proc. of IEEE, Vol. 63, No. 4, April 1975. 3. Motlicek P., Hermansky H., Garudadri H., Srinivasamurthy N., Speech Coding Based on Spectral Dynamics, in Lecture Notes in Computer Science, Vol 4188/2006, Springer Berlin/Heidelberg, DE, September 2006. 4. Quackenbush S. R., Barnwell T. P., Clements M. A., Objective Measures of Speech Quality, Prentice-Hall, Advanced Reference Series, Englewood Cliffs, NJ, 1988. 5. ITU-T Rec. P.862, Perceptual Evaluation of Speech Quality (PESQ), an Objective Method for End-to-end Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs, ITU, Geneva, Switzerland, 2001, 6. Herre J., Johnston J. H., Enhancing the performance of perceptual audio coders by using temporal noise shaping (TNS), in 101st Conv. Aud. Eng. Soc., 1996. 7. Athineos M., Hermansky H., Ellis D. P. W., LP-TRAP: Linear predictive temporal patterns, in Proc. of ICSLP, pp. 1154-1157, Jeju, S. Korea, October 2004. 8. Schimmel S., Atlas L., Coherent Envelope Detector for Modulation Filtering of Speech, in Proc. of ICASSP, Vol. 1, pp. 221-224, Philadelphia, USA, May 2005. 9. Fisher W. M, et al., The DARPA speech recognition research database: specifications and status, In Proc. DARPA Workshop on Speech Recognition, pp. 93-99, February 1986. 10. Hansen J. H. L., Pellom B., An Effective Quality Evaluation Protocol for Speech Enhancement Algorithms, In Proc. of ICSLP, Vol. 7, pp. 2819-2822, Sydney, Australia, December 1998. 11. 3GPP TS 26.071, AMR speech CODEC, General description, <http://www.3gpp.org/ftp/specs/html-info/26071.htm>.