Spanning the 4 kbps divide using pulse modeled residual

Similar documents
Scalable speech coding spanning the 4 Kbps divide

Improved signal analysis and time-synchronous reconstruction in waveform interpolation coding

Techniques for low-rate scalable compression of speech signals

Transcoding of Narrowband to Wideband Speech

Low Bit Rate Speech Coding

Enhanced Waveform Interpolative Coding at 4 kbps

Overview of Code Excited Linear Predictive Coder

Chapter IV THEORY OF CELP CODING

DECOMPOSITION OF SPEECH INTO VOICED AND UNVOICED COMPONENTS BASED ON A KALMAN FILTERBANK

Adaptive time scale modification of speech for graceful degrading voice quality in congested networks

Speech Synthesis using Mel-Cepstral Coefficient Feature

Audio Compression using the MLT and SPIHT

Speech Compression Using Voice Excited Linear Predictive Coding

Simulation of Conjugate Structure Algebraic Code Excited Linear Prediction Speech Coder

Speech Coding Technique And Analysis Of Speech Codec Using CS-ACELP

Mel Spectrum Analysis of Speech Recognition using Single Microphone

A spatial squeezing approach to ambisonic audio compression

Waveform interpolation speech coding

Auditory modelling for speech processing in the perceptual domain

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Sound Synthesis Methods

Speech Coding using Linear Prediction

Voice Excited Lpc for Speech Compression by V/Uv Classification

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

A Closed-loop Multimode Variable Bit Rate Characteristic Waveform Interpolation Coder

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

L19: Prosodic modification of speech

Hungarian Speech Synthesis Using a Phase Exact HNM Approach

An analysis of blind signal separation for real time application

Quantisation mechanisms in multi-protoype waveform coding

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

techniques are means of reducing the bandwidth needed to represent the human voice. In mobile

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

Page 0 of 23. MELP Vocoder

Wideband Speech Coding & Its Application

SNR Scalability, Multiple Descriptions, and Perceptual Distortion Measures

Audio and Speech Compression Using DCT and DWT Techniques

A 600 BPS MELP VOCODER FOR USE ON HF CHANNELS

Comparison of CELP speech coder with a wavelet method

EE482: Digital Signal Processing Applications

An Approach to Very Low Bit Rate Speech Coding

Dilpreet Singh 1, Parminder Singh 2 1 M.Tech. Student, 2 Associate Professor

Modulator Domain Adaptive Gain Equalizer for Speech Enhancement

A new quad-tree segmented image compression scheme using histogram analysis and pattern matching

Flexible and Scalable Transform-Domain Codebook for High Bit Rate CELP Coders

651 Analysis of LSF frame selection in voice conversion

Digital Speech Processing and Coding

IMPROVED SPEECH QUALITY FOR VMR - WB SPEECH CODING USING EFFICIENT NOISE ESTIMATION ALGORITHM

Information. LSP (Line Spectrum Pair): Essential Technology for High-compression Speech Coding. Takehiro Moriya. Abstract

IN RECENT YEARS, there has been a great deal of interest

NOISE ESTIMATION IN A SINGLE CHANNEL

Lecture 9: Time & Pitch Scaling

Implementation of attractive Speech Quality for Mixed Excited Linear Prediction

APPLICATIONS OF DSP OBJECTIVES

Lecture 5: Sinusoidal Modeling

Robust Linear Prediction Analysis for Low Bit-Rate Speech Coding

Method of color interpolation in a single sensor color camera using green channel separation

DEPARTMENT OF DEFENSE TELECOMMUNICATIONS SYSTEMS STANDARD

Audio Signal Compression using DCT and LPC Techniques

EE 225D LECTURE ON MEDIUM AND HIGH RATE CODING. University of California Berkeley

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase and Reassignment

COMBINING ADVANCED SINUSOIDAL AND WAVEFORM MATCHING MODELS FOR PARAMETRIC AUDIO/SPEECH CODING

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

Cellular systems & GSM Wireless Systems, a.a. 2014/2015

Defense Technical Information Center Compilation Part Notice

Speech Enhancement using Wiener filtering

Wavelet Speech Enhancement based on the Teager Energy Operator

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

SPEECH TO SINGING SYNTHESIS SYSTEM. Mingqing Yun, Yoon mo Yang, Yufei Zhang. Department of Electrical and Computer Engineering University of Rochester

An objective method for evaluating data hiding in pitch gain and pitch delay parameters of the AMR codec

Chapter 9 Image Compression Standards

Speech Synthesis; Pitch Detection and Vocoders

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

Nonuniform multi level crossing for signal reconstruction

Analysis/synthesis coding

A METHOD OF SPEECH PERIODICITY ENHANCEMENT BASED ON TRANSFORM-DOMAIN SIGNAL DECOMPOSITION

EUROPEAN pr ETS TELECOMMUNICATION March 1996 STANDARD

Implementation of SYMLET Wavelets to Removal of Gaussian Additive Noise from Speech Signal

Comparison of Low-Rate Speech Transcoders in Electronic Warfare Situations: Ambe-3000 to G.711, G.726, CVSD

Sub-band Envelope Approach to Obtain Instants of Significant Excitation in Speech

Ninad Bhatt Yogeshwar Kosta

Proceedings of Meetings on Acoustics

Communications Theory and Engineering

ENHANCED TIME DOMAIN PACKET LOSS CONCEALMENT IN SWITCHED SPEECH/AUDIO CODEC.

The Optimization of G.729 Speech codec and Implementation on the TMS320VC5402

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Design and Implementation on a Sub-band based Acoustic Echo Cancellation Approach

Non-Uniform Speech/Audio Coding Exploiting Predictability of Temporal Evolution of Spectral Envelopes

Boundary filter optimization for segmentationbased subband coding

Comparative Analysis between DWT and WPD Techniques of Speech Compression

Learning New Articulator Trajectories for a Speech Production Model using Artificial Neural Networks

ON-LINE LABORATORIES FOR SPEECH AND IMAGE PROCESSING AND FOR COMMUNICATION SYSTEMS USING J-DSP

Realization and Performance Evaluation of New Hybrid Speech Compression Technique

The Channel Vocoder (analyzer):

Golomb-Rice Coding Optimized via LPC for Frequency Domain Audio Coder

Spatialized teleconferencing: recording and 'Squeezed' rendering of multiple distributed sites

Transcription:

University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2002 Spanning the 4 kbps divide using pulse modeled residual J Lukasiak University of Wollongong, jl01@ouw.edu.au I. Burnett University of Wollongong, ianb@uow.edu.au Publication Details This article was published as: Lukasiak, J & Burnett, I, Spanning the 4 kbps divide using pulse modeled residual, IEEE Workshop Proceedingson Speech Coding, 6-9 October 2002, 20-22. Copyright IEEE 2002. Research Online is the open access institutional repository for the University of Wollongong. For further information contact the UOW Library: research-pubs@uow.edu.au

Spanning the 4 kbps divide using pulse modeled residual Abstract This paper reports a scalable method for coding the LP residual. The scalable method is capable of increasing the accuracy of the reconstructed speech from a parametric representation at low rates to a more accurate waveform matched representation at higher rates. The method entails pitch length segmentation, decomposition into pulsed and noise components and modeling of the pulsed components using a fixed shape pulse model in a closed-loop, analysis by synthesis system. Disciplines Physical Sciences and Mathematics Publication Details This article was published as: Lukasiak, J & Burnett, I, Spanning the 4 kbps divide using pulse modeled residual, IEEE Workshop Proceedingson Speech Coding, 6-9 October 2002, 20-22. Copyright IEEE 2002. This conference paper is available at Research Online: http://ro.uow.edu.au/infopapers/118

SPANNING THE 4 KBPS DIVIDE USING PULSE MODELED RESIDUAL J. Lukasiak, IS. Burneft Whisper Laboratories, TITR University of Wollongong Wollongong, NSW, Australia, 2522 1. ABSTRACT This paper reports a scalable method for coding the LP residual. The scalable method is capable of increasing the accuracy of the reconsmxted speech from a parametric representation at low rates to a inore accurate waveform matched representation at higher rates. The method entails pitch length segmentation, decomposition into pulsed and noise components and modeling of the pulsed components using a fued shape pulse model in a closed-loop, Analysis by Synthesis system. 2. INTRODUCTION Current speech coders exhibit a hit-rate barrier at approximately 4khps. Below the barrier parametric coders dominate, while above, waveform coders give preferable results. To increase the throughput over variable bit-rate transmission infrasuucnues such as shared medium networks, it is desirable to design a scalable coder spanning this harrier. As standardised speech compression algorithms are predominantly based on Linear Prediction (LP), developing scalable compression algorithms within this paradigm has been a research focus. Some examples of this research are hybrid parametric/waveform coders that switch at predetermined rates [I] and perfect reconsmction parametric coders that attempt to code the LP residual very accurately [21[6]. The fmt of these techniques, dynamic switching between waveform and paramehic coders, has some serious drawbacks; fmtly, oscillatoty switching can cause artifacts in the speech and secondly, both extra complexity and storage are required to run two separate algorithms. The second set of techniques require complex mechanisms to modify or warp the pitch track. They have proven to lack robusmess and scalability to higher bit rates @anicularly within delay constraints). At high rates, linear predictive coders using waveform matching, produce higher quality speech than parametric coders which directly model (open-loop) the LP residual. The waveform matching is achieved by minimising the error in the speech domain using an Analysis hy Synthesis (AbyS) structnre such as that used in [3]. At low rates, this exact waveform approach fails to exploit the perceptual redundancy utilised by open loop paramemc coders. In particular, low-rate parametric coders will tend to smooth, and reduce the detail of the coded residual. There are thus two contradictory approaches on either side of the artificial hit-rate houndiuy; precise matching at higher rates versus perceptually acceptable parameterization at low rates. In this paper we proposes a solution to the non scalable characteristics of waveform-matching coders so as to breach the divide. Our scalable method of LP residual coding is detailed in the following section, with practical results presented in Section 4. 3. METHOD The key point in our approach is the assumption that we must exploit AbyS modeling at high hit rates and thus it is the scalability of that technique to lower rates that needs to he addressed. However, at low hit rates the quality of speech produced by AbyS based speech coders tends to deteriorate rapidly due to the coder wasting hits modelling perceptually unimportant information [4]. Thus we focus here on a mechanism that avoids this bit wastage by identifying the key elements required in residual representation at low rates. For unvoiced speech, [5] suggests that the signal can he represented in a perceptually transparent manner by replacing the unvoiced LP residual with gain shaped Gaussian noise. Our own results and that work suggest that the low-rate perceptual scalability of speech signals is to be found in the representation of the voiced speech sections. Thus, for high quality low-rate reconstruction of speech signals, we concentrate an the problem of restricting the allocation of AhyS bits such that pitch pulses (and their surrounding details) are adequately represented in synthesised speech. To ensure that the AbyS modeling at low rates is concerned only with reproducing the pitch pulse, the proposed method fmtly critically samples fmed length frames of LP residual (25 ms) into pitch length sub-frames. This segmentation can be achieved in real time using the critical sampling method detailed in [6] or any altemate method that generates non-overlapped pitch length subframes. The non-overlappinglcritically sampled nature of the subframes is important as it provides for the use of AbyS modeling, This contrasts with early W1 coders that use overlapped (and over-sampled) pitch length subframes. The extracted pitch length suhframes are then decomposed into pulsed and noise components. The decomposition process is analogous to the SEWREW decomposition performed in WI [7] however, due to the variable number of suhframes per frame, fixed length linear filtering (as used in WI) of the subframe evolution requires interpolation of the subframes to produce a fixed number of subframes per frame. An altemative is to use the decomposition method proposed in [8]. This method achieves a scalable decomposition of the subframes into pulsed and noise components using a SVD based approach. 0-7803-7549-1/02/$17.00 Q2002 IEEE. 20

r 0.35 1 2 3 4 s Model Ordrar Figure 1: Comparison of residual domain MER The net result of these operations is that the residual signal is reduced to a parametric representation (i.e. pulse and noise). However, in contrast to traditional parametric coding algorithms where time asynchrony is introduced (such as U? and MELP), the critical sampling of the residual signal maintains time synchrony with the input signal and thus preserves the possibility of using AbyS to model the parameters. If AhyS is now used to model the pulsed component, at low bit rates this operation is concerned only with reproducing a pulse. Further, if a pulse model that naturally represents the shape of the residual pulse (such as a zinc pulse [9]) is used in the AbyS operation, a scalable representation of the residual can be achieved. AhyS coding using a zinc model is detailed in [9], hut the basis used in OUI work involves representing each pitch length pulsed component hy minimising: e(n) =X(n)-Z(n) P (1) =X(n)- Zzi(n)*h(n) i=l where h(n) is the impulse response of the LP synthesis filter, X(n) is the input pulsed component in the speech domain, Z(4 is the representation of the pulsed component in the speech domain, z(n) is a zinc pulse and P is the order of the zinc model (number of pulses). 4. PRACTICAL RESULTS This section concentrates on the scalable representation of the pulsed component of the pitch length suhframes, and depends on the technique proposed in [5] for representation of the noise component as gain shaped Gaussian noise. Our reference point is residual synthesized from a limited direct PCM coding of each residual pulsed sub-frame (using a limited set of samples centred on the residual domain pulse); we refer to this approach as 'Direct Modeling' as it simulates direct representation of the residual domain signal with varying degrees of accuracy. We then compare the error of such an approach with AbyS modelling of the pulsed suh-framcs using both impulse and zinc 191 pulse models. We performed the comparisons on a cross-section of sentences from the TIMIT database. For each of the pulse models used in AhyS, the analysis order was varied, and in the Direct modeling, for comparison, the number of adjacent positions transmined was altered. For each modeling approach the Mean Error Ratio (MER), defmed as the ratio of MSE to mean input energy for each pitch length sub hame was calculated according to: 1 S YedefOrder Figure 2: Comparison of speech domain MER where N is the number of samples in the sub frame. The MER was computed for both the residual and speech waveforms and the resultant MERs for each model averaged for all sentences. Figures I and 2 show residual and speech domain MER results respectively. The model orders in Figures 1 and 2, represent the number of pulses per suh-frame for the zinc and impulse methods and, for direct residual modeling, the number of transmined samples centred around the residual pulse according to the following key: 4 13 5 15 7 ukecentred These sample numbers were chosen such that an order of 1 indicates three samples on each side of the pulse, order 2 four samples etc. They provide a comparable wavefom-matching reference point for the pulsed models. Comparing Figures 1 and 2 it is evident that, for pulsed models (as with waveform matching), minimizing the MSE in the residual domain is not analogous to minimizing the MSE in the speech domain. In fact, the pulse models consistently reduce the speech domain enor as the order of the model is increased, whilst the residual domain error for the same pulse models remains almost constant, For direct modelling of the residual the opposite is true. The residual domain error (which is quite small even for the lowest model order - indicating that the method is capturing the majority of the residual domain pulse) is consistently reduced as the model order is increased, however, a corresponding reduction in the speech domain error is not achieved. Moreover, for some individual sentences, increasing the order of the direct residual modelling achieved a reduction in the residual domain MER but resulted in a worsening in the speech domain error. This never occurred in our test set for the pulse models minimized in the speech domain; increasing the model order always reduced the overall speech domain error results. Comparing the emr values for the different methods in Figure 2 shows that zinc and impulse models using 2 and 3 pulses per suh-frame respectively, achieved a lower error value than the highest order of direct modelling which uses 15 adjacent pulses, 21

0 2000 D ~2000-4000 Residual estimate - Residual - Zinc pulse -. Direct Res -finnnl U I 25 I a05-0 5-1 -1 5 0 5 10 15 20 25 30 35 lime in samples Figure 3: Residual domain pulse Comparison Speech Estimate 0 5 10 15 20 25 30 35 Time in Samples Figure 4: Speech domain pulse comparison Figure 2 also indicates that the zinc pulse model using only a single pulse per sub frame almost matched the mor achieved using 7 adjacent pulses for direct modelling. The results in Figure 2 show a clear scalahility with order, in terms of enor minimisation for the pulse models calculated in the speech domain. However, at low rates it is the parametric representation of the pulse shape (and hence the perceptually important smoothness etc) that is perceptually important. Figures 3 and 4 compare residual and speech domain waveform modelling using both a single zinc pulse and direct residual modelling of 7 adjacent samples. Figure 4 indicates that a hener representation of the speech pulse shape is achieved by the zinc pulse model. This is in spite of there being only a single pulse used in the model. Fwher, this suggests that, even in a parametric sense (where the MSE is less relevant), pulse modelling of the pitch length segments by minimising the error in the speech domain produces a very good reproduction of the pulse shape. To investigate this further, a single zinc parameter per 25 ms frame was quantised using 10 hits and interpolated for each pitch length suh-frame. The position of the pulse in each sub frame was fixed. This amounts to a 400 bps representation of the voiced speech. Informal listening tests indicated that the synthesized speech sounded clear and natural. Figure 3 gives a useful insight into the fact that minimising the error in the speech domain using fixed order pulse models does not necessarily minimise the residual domain energy. The zinc pulse in Figure 3 is positioned before the main residual pulse and thus has a large MSE in that domain. In contrast, the zinc speech domain pulse in Figure 4 is a good approximation of the original waveform, The results indicate that using pitch length sub-frames and pulse models with parameters calculated in a closed loop AbyS system, generates a scalable method for reproducing voiced speech. This contrasts with attempting to achieve scalability through increasing the accuracy of residual domain modeling; a process that may, in practice, offer vely little improvement in the speech representation. 5. CONCLUSION The results indicate that employing parametric pulse models in a AbyS structure, which is restricted to modeling pulsed, pitch length suhframes does provide scalability across the anificial bit-rate divide hetween parametric and waveform coders. The scalability of the representation is achieved by varying the order of the pulse model used (the number of pulses per suhframe) in synthesizing the pulsed suhframes. We suggest that these results call into question the approach, advocated in [2] and [6], of deriving scalability from pushing parametric coding techniques (such as WI) to higher rates. Instead, we propose that adaptation of higher-rate AhyS algorithms to the use of pulse model parameter optimization and then scaling the quantization of those models is more appropriate. However, while the modeling approaches may differ, it is worth noting that the pitch-synchronous, critical sampling approach of techniques intended to span the bit-rate divide is a common factor. 6. REFERENCES [I] I. Stachurski and A. McCree, A 4 kh/s hybrid MELPiCELP coder with alignment phase encoding and zero-phase equalization, Proc. of ICASSP 2000, Vo1.3, pp.1379-1382, 2000. [2] T. Eriksson and W.B. Kleijn, On waveform-interpolation coding with asymptotically perfect reconstmction, Proc. of IEEE Workshop on Speech Coding, pp, 93-9S, 1999. [3] B.S. Atal, Predictive coding of speech at low hit rates, IEEE Trans. On Comm., vol. COM-30, pp.600-614, April 1982. [4] I. Thyssen, G. Yang, et al., A candidate for the IUT-T 4KBITlS speech coding standard, Processings of IEEE Intemational Conference on Acoustics, Speech, and Signal PToc., Vo1.2, pp.681-684,2001, IS] G. Kuhin, B.S. Atal and W.B. Kleijn, Performance of noise excitation for unvoiced speech, Proc. of IEEE w/shop on Speech Coding for Telecommunications, pp.35-36, 1993. [6] N.R. Chong-White, Novel Analysis, Decomposifion and Reconsmciion Techniques /or Waveform Inferpolarion Speech Coding, Phd. Thesis, University of Wollongong, 2000. [7] W.B. Kleijn and I. Haagen, A speech coder based on decomposition of characteristic waveforms, Proc of IEEE Conf. On Acoustics, speech and signal - processinp.. - Vol. I, pp.508-511, 1995. 181._ J. Lukasiak and LS. Bumett Low Delav Scalable Decomposition of speech waveforms, Proc. of the 6th Iutemational Sym on Digital signal Processing for Communications DSPDC 2002, pp. 12-15, January 2002. [9] R.A. Sukkar, J.L. LoCicero and J.W. Picone, Decomposition of the LPC excitation using the zinc basis functions, IEEE trans on Signal Processing, Vo1.379, pp. 1329-1341, Sept. 1989. 22