Techniques for low-rate scalable compression of speech signals

Size: px

Start display at page:

Download "Techniques for low-rate scalable compression of speech signals"

Louise Randall
5 years ago
Views:

1 University of Wollongong Research Online University of Wollongong Thesis Collection University of Wollongong Thesis Collections 2002 Techniques for low-rate scalable compression of speech signals Jason Lukasiak University of Wollongong Recommended Citation Lukasiak, Jason, Techniques for low-rate scalable compression of speech signals, Doctor of Philosophy thesis, Faculty of Informatics, University of Wollongong, Research Online is the open access institutional repository for the University of Wollongong. For further information contact the UOW Library:

3 Techniques for Low-rate Scalable Compression of Speech Signals A thesis submitted in fulfillment of the requirements for the award of the degree Doctor of Philosophy from UNIVERSITY OF WOLLONGONG By Jason Lukasiak Bachelor of Engineering (Honours I) School of Electrical, Computer and Telecommunications Engineering February 2002

4 ii Abstract With the digitisation of most communication channels and an ever-increasing demand for mobile communication services, the amount of traffic generated by coded speech signals continues to grow rapidly. To accommodate this increased traffic load in the finite bandwidth available for speech communication, it is necessary to develop speech compression algorithms that can dynamically scale to traffic and user demands. These scalable compression algorithms should be capable of dynamically altering the bit rate required for transmission, whist smoothly and gradually varying the synthesized speech subjective quality with the changes in bit rate. To further increase the throughput of the communication channel, the scalable algorithm should operate in the lower range of bit rates currently used for speech compression (i.e. 2 to 8 kbps). This thesis proposes a number of scalable speech coding techniques that lead to the development of a single coding algorithm that is capable of scalable operation. Firstly, the characteristics of existing speech compression algorithms that limit scalable operation between bit rates of 2 and 8 kbps are identified. The major limiting characteristics are identified as; 1) the existence of a distinct barrier at 4 kbps below which parametric coders dominate and above which waveform coders dominate; 2) the large delay requirements for current low rate coding algorithms. A method that exploits the simultaneous masking property of the human ear in a linear predictive filter is proposed. The proposed method modifies the linear predictive filter to remove more of the perceptually important information from the input signal than a standard linear predictive filter. This characteristic is shown to improve the subjective speech quality of low-rate linear prediction based speech coders.

5 111 To enable the pitch cycle redundancies of the speech signal to be exploited in the coding algorithm, without introducing excessive algorithmic delay, a novel low delay method for segmenting the speech into non-overlapped pitch length subframes is proposed. This method requires only a single frame of speech and locates the pitch pulses by selecting the pulse locations in a closed loop function. The proposed segmentation is shown to produce a much more accurate pitch track in transient sections of the speech signal than the pitch track produced by traditional autocorrelation based pitch detectors. A number of Low delay decomposition techniques are proposed which decompose the speech into perceptually different components and allow scalable reconstruction of the speech signal. The preferred technique performs the decomposition in a closed loop function allowing quantisation errors to be accounted for in the decomposition process. The proposed scalable techniques are combined to produce a scalable algorithm that operates at a range of bit rates from 2 to 8 kbps. The synthesized speech quality produced by the scalable algorithm varies smoothly as the operating bit rate is varied. A key feature of the proposed algorithm is the ability to migrate from a time asynchronous parametric coder at low rates, to a time synchronous waveform coder at higher bit rates. The coder also requires only a single frame of algorithmic delay (30 ms) for operation. Subjective results presented indicate that the scalable coder produces subjective speech quality that is comparable with that achieved for fixed rate standardized coders at each of the tested bit rates.

6 iv Statement of Originality This is to certify that the work described in this thesis is entirely my own, except where due reference is made in the text. This document has not been submitted for qualifications at any other academic institution. Signed Jason Lukasiak 16 February 2002

7 v Acknowledgements Firstly, I would like to thank my supervisor, Dr. Ian Burnett, for his guidance and support throughout my research. To my parents, I would like to whole heartedly thank you for your support. You were the pillars upon which I could lean when it seemed there was no end in sight. To my long suffering wife Janine, thank you for your unwavering love and support, both emotional and financial. Without your support and patience I would not have had the possibility to chase my ambitions. To my baby girl Brooke, I hope I make you proud. And finally, to my grandparents, who endured untold hardships and sacrifices to give me the opportunities I now have, I dedicate this work to you all.

8 VI Contents Abstract... ii Statement of Originality... iv Acknowledgements... V Contents... vi L 1St 0 ff' 19ures... XIll... List of Tables... xvii List of Abbreviations... xx Chapter 1 Introduction Overview Thesis Outline Contributions Publications Journal Publications Conference Publications Patents... 9 Chapter 2 Literature Review Introduction Speech Perception and Production Perception of hearing Speech production... 14

9 vii 2.3 Linear prediction Speech specific Linear Prediction Minimum variance distortion less response (MVDR) Waveform Speech coders Pulse Code Modulation (PCM) Linear Prediction Analysis by Synthesis (LPAS) speech coders Multi Pulse Excitation (MPE) Code Excited Linear Predictive (CELP) Parametric coders Linear Predictive Parametric Coders Waveform Interpolation Coders Sinusoidal Speech Coders Multi-band and Mixed excitation Coders Hybrid and Scalable coders Summary Chapter 3 Source Enhanced Linear Prediction of Speech Incorporating Simultaneously Masked Spectral Weighting... Sl 3.1 Introduction Simultaneously Masked Spectral Weighting Linear Prediction Coefficients (SMWLPC) Motivation Method The masking threshold function Mathematical Analysis of SMWLPC Computational Complexity... 61

10 viii Data Windowing Requirements Experimental Results Objective Results LPC Spectral Estimate Quantisation Properties Subjective Listening Tests Summary Chapter 4 Scalable Representation of the Linear Prediction Residual Introduction Low Delay, Real Time pitch synchronous sampling of the speech waveform Introduction The Zinc Pulse Method Practical Results Complexity Summary for Low Delay Pitch Synchronous sampling of the Speech Waveform Scalable Coding of the Linear Prediction Residual for Speech Method for scalable coding of the LP speech residual Practical Results Summary of scalable coding of the LP speech residual Summary Chapter 5 Scalable Decomposition of Speech Waveforms... ll1 5.1 Introduction

11 ix 5.2 Linear Filtering Long Term Prediction Analytic Decomposition of Speech Signals Introduction Decomposition Characteristics Sensitivity of decomposed parameters to Quantisation Summary for Analytic Decomposition of Speech Singular Value Decomposition of Speech Signals Introduction Method Decomposition Characteristics Distribution of Singular Values Decomposition Measures Summary for SVD of Speech Pitch-Synchronous, Zinc Basis function Decomposition of Speech Signals Introduction Method for Pitch Synchronous zinc decomposition of Speech Summary for Pitch Synchronous zinc basis decomposition of Speech Summary Chapter 6 A Scalable Coding Architecture Introduction Structure of the Scalable Coder Scalable Analysis structure Scalable Synthesis Structure Selection of the CW boundaries

12 x 6.3 Scalable Pitch track Quantisation Introduction Properties of the Pitch Track Parameter Low bit rate Pitch track quantisation High bit Rate Pitch Track Scalable quantisation of the Pulsed CWs (PCWs) Characteristics of the PCW parameters Low bit rate Quantisation and Synthesis of the PCW s Description of the Low bit rate Quantiser Bit allocation and Quantiser Performance Mid bit rate Quantisation and Synthesis of the PCW s Description of the Mid bit rate Quantiser Bit allocation and Quantiser High bit rate Quantisation and Synthesis of the PCWs Description of the High bit rate Quantiser Bit allocation and Quantiser Performance Constraining the CW evolution Scalable quantisation of the Noise CWs (NCWs) Characteristics of the NCW Parameterisation of the NCW vectors Gain Parameter Quantisation Characteristics Low Rate NCW Quantisation ~gh Rate NCW quantisation Synthesis of the NCWs Summary

13 Chapter 7 Practical Results for the Scalable Coder Xl 7.1 Introduction Results for the Scalable Coder at 2.4 kbps Bit Allocation and MOS test Results at 2.4 kbps Discussion of the scalable coder performance 2.4 kbps Results for the Scalable Coder at 4 kbps Bit Allocation and MOS test Results at 4 kbps Discussion of the scalable coder performance at 4 kbps Results for the Scalable Coder at 6 kbps Bit Allocation and MOS test Results at 6 kbps Discussion of the scalable coder performance at 6 kbps Dynamic switching of The Scalable Coder bit rates Summary Chapter 8 Conclusions and Future Research Overview Increased exploitation of Human perceptual hearing properties in low rate speech coding Scalable analysis and synthesis of the Linear Predictive residual Low delay scalable decomposition techniques Low rate Scalable Speech Coder Scalable Analysis, Synthesis and Quantisation Performance of the Scalable Coder Future Work References

14 Appendix A xii Appendix B

15 xiii List of figures 2.1 Comparison of Voiced speech and Magnitude spectrum Comparison of Unvoiced speech and Magnitude spectrum Source-system model of speech production Comparison of speech and LP residual Comparison of speech spectrum and LP filter spedtrum Block diagram of LPAS coder Block diagram of a CELP coder Source filter model for speech synthesis Block diagram of the WI encoder Examples of CW surface, SEW and REW Block diagram of WI decoder Functional Block diagram of SMWLPC Block diagram of the masking threshold function Inter band spreading function Example of masking threshold function Window comparison for window size of 200 samples Window comparison for window size of 240 samples Comparison of SMWLPC and Standard LPC spectral estimates... 67

16 3.8 Difference in Weighted Residual energy xiv 4.1 Comparison of residual pulse to zinc pulse Comparison of Error Functions Candidate pulse location for a section of voiced speech Candidate pulse location for a section of unvoiced speech Example Rectangular Pulse Error Function Example of normalized cross correlation function Comparison of proposed and traditional pitch tracks for male speech Comparison of proposed and traditional pitch tracks for female speech Transitional pitch tracks for male speech Transitional pitch tracks for female speech Comparison of Residual Domain MER Comparison of Speech Domain MER Residual domain pulse Comparison Speech domain pulse Comparison Comparison ofcw surface to SEW Analytic representation of speech Evolution of Analytic speech parameters Inter-frame distribution of Singular values Comparison of speech surfaces Er for Male Voiced Input

17 5. 7 Er for Male Voiced Input Er for Male Voiced Input xv 5.9 Er for Female Voiced Input Er for Female Voiced Input Er for Female Voiced Input Er for Long pitched Voiced Input Er for Unvoiced Input Pitch versus optimal model order for zinc decomposition Er for the input files using the linear model order estimate Pitch Synchronous zinc decomposition performance in voiced speech Pitch Synchronous zinc decomposition performance in unvoiced speech Comparison of transitional speech and the pulsed component from the zinc decomposition and linear filtering Block Diagram of the Scalable Analysis Structure Block Diagram of Scalable Synthesis Architecture Comparison of interpolated and input pitch tracks Comparison of interpolated and input pitch tracks Flow diagram of iterative pitch track estimation Comparison of high rate and original pitch tracks Magnitude distribution of the Zinc A parameter Magnitude distribution of the Zinc B parameter

18 6.9 Comparison of synthesized speech for the proposed low rate quantisation xvi schemes Comparison of synthesized speech for the proposed mid rate quantisation schemes Comparison of synthesized speech for the proposed high rate quantisation schemes Comparison of synthesized speech for the constrained and unconstrained high rate Pulsed CW quantisation Comparison of original speech and synthesized speech produced using only unquantised noise CWs Comparison of mean distribution of the Noise CW and Gaussian CW Distribution of Noise CW Magnitudes Example of OLA method used for extracting the Noise CWs Comparison of synthesized speech for 2.4 kbps coders Comparison of synthesized speech for 4 kbps MOS test coders Comparison of synthesized speech for 6 kbps MOS test coders Effects of Dynamic bit rate switching

19 xvii List of Tables 3.1 Percentage greater WRE removed by SMWLPC Average spectral distortion for the predicted LSF vector AlB Comparison Results for the FS1016 CELP Coder AlB Comparison Results for the WI Coder Majority preferred sentences for SMWLPC Ratios ofrew to SEW energy Sensitivity of Analytic parameters to quantisation Evolutionary bandwidth of Analytic parameters Ratio of noise energy to periodic energy component as a percentage Average pitch versus optimal model order Linear order estimate versus actual model order Comparison of SNR for zinc modelling and linear filtering Relationship between pitch length and the number of CW/frame Bit requirements for high rate pitch track Inter frame correlation of the zinc magnitude parameters Intra frame correlation of the zinc magnitude parameters Quantised pulses for each pitch range Bit allocation for low rate pulsed CW quantisation

20 xviii 6. 7 Average SNR for low rate quantisation schemes Bit allocation for mid rate pulsed CW magnitude quantisation Average SNR for mid rate quantisation Bit allocation for high rate Pulsed CW magnitude quantisation Bit allocation for the High rate Pulse CW quantization scheme Average SNR for high rate quantisation Inter frame correlation of Noise CW gains Bit allocation for Noise CW gain parameters Bit allocation for high rate Noise CW Gain quantisation kbps scalable coder bit allocation kbps MOS test results Bit allocation for 4 kbps scalable coder kbps MOS test results Bit allocation per frame for 6 kbps scalable coder kbps MOS test results Al Male sentence 1 Speech domain MER A2 Male sentence 2 Residual domain MER A3 Male sentence 3 Speech domain MER A4 Male sentence 4 Residual domain MER A5 Female sentence 1 Speech domain MER A6 Female sentence 2 Residual domain MER

21 A 7 Female sentence 3 Speech domain MER xix AS Female sentence 4 Residual domain MER

22 xx List of Abbreviations AbyS ACELP ADPCM CELP Analysis by Synthesis Algebraic Code Excited Linear Prediction Adaptive Differential Pulse Code Modulation Code Excited Linear Prediction CS-ACELP Conjugate Symmetry Algebraic Code Excited Linear Prediction CW DVSI ETSI FFf GSM IIR IMBE Characteristic Waveform Digital Voice Systems Incorporated European Telecommunications Standards Institute Fast Fourier Transform Global System for Mobile Communication Infinite Impulse Response Improved Multi-band Excitation Coder ITU-T International Telecommunications Union Telecommunication standardisation sector kbps LP LPAS LPC LPPC LSF LTP MBE kilo bits per second Linear Prediction Linear Prediction Analysis-by-Synthesis Linear Prediction Coefficients Linear Predictive Parametric Coder Line Spectral Frequency Long Term Predictor Multi-band Excitation

23 xxi MELP MPE MPEG MSE NCW PCM PCW PWI RCELP REW RPE SEW SMWLPC SNR VDVQ VQ WI Mixed Excitation Linear Prediction Multi-Pulse Excitation Motion Pictures Expert Group Mean Squared Error Noise Characteristic Waveforms Pulse Code Modulation Pulsed Characteristic Waveforms Prototype Waveform Interpolation Relaxed Code-Excited Linear Prediction Rapidly Evolving Waveform Regular Pulse Excitation Slowly Evolving Waveform Simultaneously Masked Spectral Weighting Linear Prediction Coefficients Signal to Noise Ratio Variable Dimension Vector Quantisation Vector Quantisation Waveform Interpolation

24 Introduction 1 Chapter 1 Introduction 1.1 Overview The past decade has seen an explosion in the quantity of coded speech signals being transmitted across various media. This dramatic increase in digitized speech traffic can be attributed to both the transition to digital communication systems and the dramatic increase in popularity of mobile communication devices, such as mobile telephones. Also contributing to the increased presence of coded speech has been the instigation of using shared medium networks (that were previously the sole domain of data traffic), such as corporate Local Area Networks (LANs), for the transmission of speech signals. The increased popularity of digital speech transmission has seen increased demands on network operators in terms of the bandwidth available for voice traffic and user demands for quality of service. To increase throughput and thus better utilize the available bandwidth, virtually all current speech communication devices use compression algorithms that attempt to maintain the quality of the synthesized speech, whilst reducing the bit rate required for transmission of the signal. However, using compression algorithms that operate at a constant bit rate does not allow the network to adapt to congestion or individual user requirements. A better solution for optimal

25 Introduction 2 utilization of the finite bandwidth available for speech transmission is to develop speech compression algorithms that are capable of scaling their transmission bit rate. Scalable algorithms will allow increased connectivity for users by reducing the bit rate for all users in times of high congestion (demand). Alternately, a user could select to pay a premium and have a guarantee of transmission quality and bit rate. In order for a scalable algorithm to gain widespread practical acceptance it is essential that the speech quality produced by the algorithms change in a smooth and gradual fashion as the bit rate available for transmission is varied. The delay required by the scalable algorithm must also be small enough for operation in real time speech communications. Development and analysis of the concepts required to produce a speech compression algorithm capable of operating across a range of bit rates from 2 to 8 kbps is the focus of this thesis. This range of bit rates presents a significant impediment to scalable speech coding in that within the range (at approximately 4 kbps) there is a barrier, below which, low rate parametric coders dominate, and above which, high rate waveform coders dominate. It is thus essential that either a single algorithm operating across this range of bit rates merge these two conflicting coding methodologies or alternately an entirely new paradigm be developed. Within this thesis, several methods for achieving low delay scalable speech compression algorithms are examined. These can be grouped into: i) Perceptual techniques, whereby the exploitation of the psychoacoustic properties of the human ear in low rate speech coders are better exploited. ii) Analysis techniques, allowing the speech waveform to be represented as a parameter set capable of reproducing the speech signal in a scalable

26 Introduction 3 fashion. These techniques must limit the algorithmic delay that they introduce. iii) Low delay Decomposition techniques, whereby the signal is separated into a number of components each of which exhibits a distinctly different perceptual characteristic. The separate components allow the speech to be reconstructed in a scalable manner. iv) Scalable Quantisation techniques, providing a mechanism for scalable representation of the individual parameters and thus allowing efficient scalable transmission of the parameters. v) Scalable synthesis techniques, whereby a single synthesis loop is suitable for producing synthesized speech using quantised parameters that are received at a scalable range of rates. 1.2 Thesis Outline This thesis is organized as follows: Chapter 2 presents a background and critical review of speech coding techniques, with particular regard placed on the ability of current algorithms to operate at a range of bit rates scalable between 2 and 8 kbps. Initially a review of speech production and the perceptual properties of the human ear are given. Linear prediction, with regard to speech coding, is then analysed. This analysis exposed an opportunity to exploit the perceptual characteristics of hearing in speech coding, through modification of the linear predictive coefficients. The current literature on speech compression algorithms is then critically reviewed. This review reveals that current speech compression algorithms can be broadly grouped into either parametric or waveform coders and that a barrier exists at approximately 4 kbps, where parametric coders dominate below and waveform coders dominate above. Some hybrid coders,

27 Introduction 4 which switch between parametric and waveform coders have been proposed but no single compression algorithm that can operate across the barrier present between parametric and waveform coders has been found. In Chapter 3, a new technique that incorporates increased exploitation of the perceptual properties of the human ear into linear prediction is proposed. The proposed technique uses a psychoacoustic model to detennine the simultaneously masked frequencies and uses this information to weight the calculation of the linear prediction coefficients. Thorough subjective test results presented indicate that the proposed modification removes more perceptually important information from the speech signal than conventional linear prediction. The net result is shown to be that if the linear predictor in a low rate speech coder is replaced with the proposed linear predictor, the subjective quality of the synthesised speech produced by the coder is increased. Chapter 4 details properties that restrict current speech coders from operating at a scalable range of bit rates, and subsequently proposes methods which can be used to improve the scalability of speech compression algorithms. The first of these methods is a low delay mechanism for critically segmenting the speech waveform into pitch length subframes. This segmentation allows the pitch cycle redundancies of the speech signal to be exploited, whilst maintaining the ability to use closed loop analysis-by-synthesis (AbyS) modelling of the speech signal. The low delay of the proposed mechanism allows the method to be employed in real time speech coding algorithms. The second proposed method uses AbyS modelling of critically sampled pitch length sections of the speech waveform, to produce a scalable representation of the speech. The method uses fixed shape pulse models in the AbyS modelling, and objective results presented indicate that the proposed method allows the synthesised speech to be produced in a

28 Introduction 5 manner that is scalable from a very low bit rate parametric representation to a higher bit rate waveform representation. Three new methods for decomposing the speech waveform into pulsed and noise components are proposed in Chapter 5. Each of the proposed techniques is restricted to operate at a relatively low delay, requiring only a single frame of speech. The decomposition performance and scalability of the resultant waveforms, for the proposed decomposition methods and two well known decomposition methods (linear filtering and long term prediction), are analysed and compared. These results indicate that the proposed techniques present a more flexible set of decomposed waveforms than the well-known techniques, whilst maintaining similar decomposition performance. In addition, the preferred new decomposition method offers the distinct advantage of producing perfect reconstruction of the speech signal when operated in unquantised conditions. Also proposed in chapter 5 is a low rate coding method that exploits the decomposition characteristics of linear filtering. This coding method is capable of maintaining the subjective quality of a Waveform Interpolation coder whist reducing the total transmission bit rate. A speech compression architecture capable of operating at a scalable range of bit rates from approximately 2 to 8 kbps is proposed in Chapter 6. The structure proposed in this chapter allows the synthesized speech to be produced in a perceptually meaningful manner that is scalable with the transmission bit rate. A key feature of the proposed architecture is the ability to migrate from a time asynchronous parametric coder at low rates, to a time synchronous waveform coder at higher bit rates. Significant emphasis is placed on the scalable quantization of the algorithm parameters. Bit allocations for the architecture of Chapter 6 and subjective test results are presented in Chapter 7. Bit allocations for 2.4 kbps, 4 kbps and 6 kbps variants of the scalable

29 Introduction 6 structure are presented. Thorough subjective test results are presented; these compare the subjective performance of the scalable architecture to standardised speech coders at each of the prescribed bit rates. The performance of the scalable architecture at each of the prescribed rates is then analysed and discussed. Chapter 8 concludes the thesis with a summary of the major findings and identifies areas of future research. 1.3 Contributions The major contributions of this thesis are summarised below. The contributions are presented according to the order they appear in the thesis. The section where the work is first discussed and the associated publications are shown in parentheses. A new method of incorporating simultaneous masking properties into the calculation of linear prediction coefficients. This method allows the linear predictive filter to better exploit the perceptual characteristics of the speech signal than a standard linear predictor. (Chapter 3) [Luka99] [LukaOOa] [LukaOOc] [LukaOla] [LukaOlc] A new mechanism for segmenting the speech signal into critically sampled pitch length segments. The proposed mechanism requires a relatively low delay of one frame of input speech and performs the segmentation using an analysis function determined by minimising the pulse position error in the speech domain. (Chapter 4.2) A method of representing the linear predictive speech residual in a bit rate scalable fashion. The proposed method allows the synthesised speech to be produced in a scalable fashion from a low rate parametric representation to a high rate waveform representation. (Chapter 4.3) [Lukasub2]

30 Introduction 7 A method for representing the SEW in a WI coder using an implicit quantisation scheme. The method maintains the perceptual quality of the WI coder whilst reducing the bit rate by 12 percent. (Chapter 5.2) [LukaOla] [LukaOld] A method for the scalable decomposition of the speech signal, based on an Analytic transform. The method produces a parameter set that allows scalable reproduction of the signal. (Chapter 5.4) [LukaOOb] [LukaOOd] A decomposition of the speech signal based on Singular Value Decomposition. The method decomposes the speech residual signal into a scalable parameter set. (Chapter 5.5) [Lukasubl] [Luka02] Developed and evaluated a low delay scalable decomposition of the speech waveform. The proposed method is based on a fixed shape zinc model and performs the decomposition by minimising the resultant error in the speech domain. Using speech domain decomposition allows the decomposition to produce a scalable parameter set that produces a decomposition optimised with reference to the speech signal directly and also allows quantisation errors and perceptual weighting to be incorporated into the decomposition. (Chapter 5.6) A single speech compression algorithm capable of operating at a range of bit rates scalable from 2 to 8 kbps. The proposed algorithm migrates from a time asynchronous parametric coder at low rates, to a time synchronous waveform coder at higher rates. (Chapter 6.2). Quantisation schemes capable of quantising the scalable coder parameters in a bit rate scalable manner. (Chapter 6.3) Implemented the scalable coder at 2.4, 4 and 6 kbps and performed a thorough comparative analysis of the coders subjective speech quality. The comparative

31 Introduction 8 analysis involved comparing the Mean Opinion Score of the coder to standardised speech coders at each of the selected bit rates. (Chapter 7) A very low rate speech coder based on pulse modelling and temporal decomposition. The proposed coder produced good speech quality at an average bit rate of approximately 1 kbps. [RitzOO] 1.4 Publications The work of this thesis has resulted in the following publications and patents Journal Publications J. Lukasiak and LS. Burnett, "Source Enhanced Linear Prediction of Speech Incorporating Simultaneously Masked Spectral Weighting", Journal of Telecommunications and Information Technology Special edition on Communications, Vo1.2, December J. Lukasiak and LS. Burnett, "Low rate WI SEW representation using a REW- implicit pulse model", IEEE Signal Processing Letters, Page(s): , Volume: 8 Issue: 8, Aug J. Lukasiak and LS. Burnett, "Scalable Decomposition of Speech Waveforms", Submitted to IEEE Sig. processing letters 20/8/01. J. Lukasiak and LS. Burnett, "Scalable Coding of the LP residual for Speech", submitted to IEE Electronic Letters 3/9/ Conference Publications J. Lukasiak, LS. Burnett, J.P. Chicharo and M.M Thomson, "Linear prediction incorporating simultaneous masking", Proc. of ICASSP 2000, Vol. 3, pp , 2000.

32 Introduction 9 J. Lukasiak and IS. Burnett, "Exploiting simultaneously masked linear prediction in a WI speech coder", Proc. of IEEE Workshop on Speech Coding, pp , Lukasiak and IS. Burnett, "Exploring the characteristics of analytic decomposition of speech signals", Proc. of IEEE Workshop on Speech Coding, pp , Lukasiak, IS. Burnett, "SEW Representation for low rate WI coding", ICASSP 2001, pp , Vo1.2, May J. Lukasiak, IS. Burnett, "Low rate speech coding incorporating Simultaneously Masked Spectrally Weighted Linear Prediction", Proc.of Eurospeech 2001, CD edition, Scandinavia, September J. Lukasiak and I.S. Burnett "Low Delay Scalable Decomposition of speech waveforms", Accepted for publication in the 6th International Sym on Digital signal Processing for Communications DSPDC 2002, pp , January C.H. Ritz, 1.S. Burnett, J. Lukasiak, "Very Low Rate Speech Coding using Temporal Decomposition and Waveform Interpolation", Proc. of IEEE Workshop on Speech Coding, pp , Patents J. Lukasiak, IS. Burnett, "Method and Apparatus for detennining parameters of a model of a Power Spectrum of a digitized waveform", Australian Patent, filed Nov. 1999, Patent (CR1058AC). J. Lukasiak, I.S. Burnett, "Method and Apparatus for Decomposing and Compressing a Digitised Waveform", Australian Provisional Patent, filed May 2000.

33 Literature Review 10 Chapter 2 Literature Review 2.1 Introduction Due to the Increased adoption of digital communication systems over the past decade, representing the speech signal as a digital signal has become increasingly important. The bandwidth required by a digitised speech signal is proportional to the bit rate used to represent the signal [Klei95]; for example, speech sampled at 8 khz using 8 bits per sample (pulse Code Modulation [J aya84]) requires 64 kbps of bandwidth. This is a large proportion of the available bandwidth for many communications media and thus there is an apparent need for bit rate reduction. This is the realm of speech coding. Speech coding reduces the bit rate required to transmit speech by exploiting redundancy in the speech signal. Such exploitable redundancy can be due to both the characteristics of the speech signal itself and the perceptual characteristics of human hearing. Common characteristics that are used to remove redundancy from the speech signal are: Long-term correlations in the speech, due to the underlying pitch (periodic) component in sections such as vowels. Short-term correlations in the speech due to the effect of the vocal tract on speech production.

34 Literature Review 11 Decomposition of the speech signal into a number of perceptually distinct parameters. Simultaneous Masking, where a louder sound cause a quieter sound at an adjacent frequency to become inaudible. The non-linear frequency response of the human ear, where not all frequencies are perceived with equal loudness. Together with using the above characteristics to reduce the transmission bit rate, recent developments in digital communication have led to the need to produce speech compression algorithms capable of operating across a scalable range of bit rates. These coders are required to dynamically alter the bit rate used to represent the signal, according to the bandwidth that is available on the transmission medium. This scalable characteristic is applicable to increase the throughput in shared media networks such as computer LANs. The perceptual characteristics of the scalable coders must be such that they exhibit a smooth gradual change in the perceptual quality of the synthesized speech according to the current operating bit rate. This chapter is structured as follows. In Section 2.2 an overview of speech production and perception is detailed. Section 2.3 details linear prediction, with particular emphasis placed on speech specific linear prediction. Speech coding algorithms are reviewed in Sections , with particular regard placed on the ability of the coding algorithms to operate across a scalable range of bit rates. In Section 2.4 waveform coding algorithms are reviewed, whilst Sections 2.5 and 2.6 review parametric and hybrid coding algorithms respectively. Separating the review of speech coding algorithms into these distinct c1a$ses does not follow a chronological evolution of speech coding, but rather, groups the coding algorithms according to their underlying principles of operation. Finally a summary of the literature review is detailed in section 2.7.

35 Literature Review Speech Perception and Production Perception of hearing The human auditory system is a highly complex system. The way humans perceive the sounds present at the ear is governed by a number of non-linear operations, such as: Non-linear frequency sensitivity Non-linear frequency resolution Logarithmic amplitude response Masking, causing sounds to become inaudible Humans can hear sounds in the range of frequencies from approximately 50 Hz to 16 khz, however, these frequencies are not all perceived with equal sensitivity [Moor97]. This phenomenon leads to a set of equal loudness curves that indicate the perceived loudness of a fixed amplitude tone as the frequency of the tone is varied [Moor97]. Directly related to the equal loudness curves is the threshold of hearing curve. This curve represents the minimum amplitude that is audible for a given frequency. The frequency resolution of the human ear is not a linear function, but instead follows an almost logarithmic relationship where the frequency resolution decreases as the frequency increases. This logarithmic relationship is due to the fact that the human ear acts as a set of overlapping bandpass filters. These filters are commonly called critical band filters [Hand89] and all frequencies within a critical band are perceived with equal sensitivity [Moor97]. The critical bands are not of equal width but increase in bandwidth as their centre frequency increases. This results in better frequency resolution at lower frequencies. Scharf [Scha70] proposed that the critical band filters could be adequately represented by a set of non-overlapped rectangular filters. This greatly reduces the complexity of critical band analysis.

36 Literature Review 13 The amplitude response of the human ear is not linear but rather follows a logarithmic contour [Moor97]. This results in small amplitude variations in loud sounds being inaudible, whilst similar amplitude variations in quiet sounds are easily perceived. This characteristic is particularly useful for speech coding as quantisers can be modelled to match this logarithmic amplitude response. Masking occurs when a loud sound causes other softer sounds to become inaudible. Two types of masking can occur; simultaneous and temporal masking. Simultaneous masking occurs when a low intensity but audible sound is made inaudible by a higher intensity adjacent sound (adjacent frequency) occurring at a simultaneous moment in time. Temporal masking occurs when a loud tone causes softer tones occurring before and after the tone to become inaudible [Moor97]. The above phenomena have been researched extensively and have led to the development of psychoacoustic models that allow the auditory system to be accurately modelled. Descriptions of these models can be found in many well-recognised texts such as [Moor97]. Incorporating the perceptual characteristics of the auditory system into audio and speech coding allows coders to operate more efficiently at reduced bit rates by distributing the available bits to the most perceptually important information. This principle has been exploited extensively in audio coders and a prime example of this is the MPEG2 AAC audio standard [MPEG2]. Perceptual modelling provides the crux of this coder and allows the coder to produce high quality audio signals at a relatively low bit rate. Speech coders in general have not utilised the perceptual characteristics of the ear to the same extent as audio coders. A prime example of this is the federal standard 1016 CELP 4.8 coder [Nati]. This is a recognised standard for speech coding, however, it only employs perception in a very limited manner. In an attempt to improve speech quality

37 Literature Review 14 by suppressing noise, more recent research such as [Sing98][Virag95] has incorporated critical band analysis and simultaneous masking into their speech compression algorithms. Other authors such as Kohata [Koha98] have reported good results by incorporating equal loudness compensation into their coders. The more extensive use of perceptual techniques in speech coding presents opportunities for improvement in coding efficiency. These concepts have been further explored by authors such as Skoglund et a1. [Skog98] who report improvement in the speech quality of a harmonic coder by exploiting temporal masking into the algorithm Speech production This section gives a brief overview of the speech production mechanism. A detailed description can be found in [Flan83, Dene73, Rabi78]. Speech can be considered the output of a time varying system [Rabi78]. The system is primarily made up of an energy source (the lungs), an oscillator (the vocal cords), a frequency shaping mechanism (the vocal tract) and a radiation mechanism (the lips). The interactions of these components combine to produce two broad classes of speech; voiced and unvoiced. In voiced speech the air from the lungs causes the vocal tract to vibrate, creating a quasi-periodic pulse train of air pulses. This pulse train is then shaped by the vocal tract and finally emitted from the lips. Sounds that fall into this category are all speech that exhibits an underlying periodic component, such as vowels (e.g. lai, lei). Unvoiced sounds have no contribution from the oscillatory nature of the vocal cords, but use turbulent airflow to excite the vocal tract. The unvoiced sounds can be grouped into two broad categories: fricatives and plosives [Flan83, Dene73, Rabi78]. Fricatives use the broad-spectrum noise produced from the turbulent airflow to shape the vocal tract and consists of sounds such as "sh" [Dene73]. Plosives are produced by totally

38 Literature Review 15 closing off the vocal tract. Pressure is built up and abruptly released. Examples of plosives are Itl, Ipl [Flan83, Dene73, Rabi78]. Sounds that are a combination of unvoiced and voiced excitation also exist. Examples of these are voiced fricatives (Iv/) [Rabi78] and semi vowel sounds (/y/) [Dene73]. The shaping mechanism of the vocal tract may be considered equivalent to an acoustic tube of varying cross-section [Flan83, Dene73, Rabi78]. This varying cross-section tube can be modelled as a concatenation of fixed diameter lossless acoustic tubes [Flan83, Rabi78](this configuration can also be directly related to digital filters [Rabi78]). The acoustic coupling between the respective components of the time varying system that produces speech can be considered small in most instances [Dene73]. This has led to models that separate the speech production components into individual uncorrelated units [Rabi78]. Exploiting these models in speech compression algorithms allows dramatic coding gains, over directly representing the speech signal, to be achieved. An example of exploiting speech production models in a compression algorithm is to synthesise speech by exciting an appropriate digital filter with the correct excitation signal (periodic or noise) [Rabi78]. A useful insight into the speech signal can be gained from analyzing the frequency magnitude spectrum for a section of speech. A section of voiced speech and its magnitude spectrum are shown in Figure 2.1, whilst a section of unvoiced speech and its spectrum are shown in Figure 2.2. The shaping effects of the vocal tract produce the underlying shape of the magnitude spectrum in both Figures 2.1 and 2.2. Peaks in the magnitude spectrum shape (such as at 3250 Hz in Figure 2.1) represent vocal tract formants. The fine detail in the magnitude spectrum is representative of the excitation signal responsible for exciting the vocal tract. Contrasting the fine detail of the magnitude spectrums for Figures 2.1 and 2.2

39 Literature Review (]) "C ~ a. E < Time in ms ID "C.5 E 2 g a. (j) Figure 2.1: Frequency in Hz Comparison of Voiced speech (top) and Magnitude spectrum (bottom) reveals that Figure 2.1 has a distinct hannonic structure (evidenced by the equally spaced peaks), whilst Figure 2.2 has a none-uniform noise-like fine detail structure. The hannonic structure of Figure 2.1 is a direct result of the underlying pitch period evident in the speech signal (produced by the vocal cords), whereas the speech of Figure 2.2 has no clear pitch period. Examining Figure 2.1 indicates that the harmonic and fonnant frequencies are not necessarily related and it should thus be possible to separate these two components. This is the basis of linear predictive coding which is discussed in Section 2.3.

40 Literature Review ~ r ~ ~ 500 Q) -0 ~ ;t:: a. ~ '--- -'-- --'- ---'L l..- -.l o Time in ms co -0.!;; Q) -0.a ) (\j ::2 E ~ C3 Q) 20 Co en 0 0 Figure 2.2: Frequency in Hz Comparison of Unvoiced speech (top) and Magnitude spectrum (bottom) 2.3 Linear prediction It has been shown that the combined effects of the vocal tract detailed in Section can be successfully modelled using a time varying linear filter [Makh75]. Whilst a complete model of the vocal tract has both zeros and poles and would thus require an Auto Regressive Moving Average (ARMA) filter, the most common and efficient way of generating a linear filter for speech coding is to use a linear prediction filter, which is only an Auto Regressive (AR) filter [Makh75, Makh72, Rabi78]. The popularity of the AR model stems from the facts that if a sufficient order is used, the filter can accurately model the s,peech production system and also the filter parameters can be calculated in a straightforward and efficient manner [Rabi78].

41 Literature Review 18 Excitation E(z) Generator Time Varying Linear System S(z)..... Speech Output Figure 2.3: Source-system model of speech production The use of a linear predictor in speech coding relies upon the fact that speech can be modelled as the output of a time varying linear system [Rabi78]. The development of this model is linked to the use of lossless acoustic tubes to represent the speech production process and is detailed in [Rabi78]. Figure 2.3 represents a simplified representation of this model. The transfer function representing the linear system in Figure 2.3 can be described using an all pole (autoregressive (AR)) system as: H(z) = S(z) = G (2.1) E(z) 1 ~ -k -..az k=l k where p is the order of the filter, G is the gain of the system and the a k are the predictor coefficients. To permit linear prediction, the input waveform must be stationary. However, as speech is a random process some modification is required. It has been determined that a speech signal is stationary over a period of approximately 20-30ms [Kond95]. Thus to utilise linear prediction in speech coding it is necessary to divide the input speech into frames of approximately 20ms length and update the linear prediction coefficients for each of these frames. A number of methods that solve for the predictor coefficients to achieve a minimum mean square error for a given frame have been developed [Rabi78]. The most popular of these is the autocorrelation method.

42 Literature Review 19 The Mean Square Error (MSE) solution for the standard LPC's (a ) can be reduced k using the autocorrelation method [Mark72], to: l =l..... p (2.2) where RO is the autocorrelation function of the input speech frame. The recursive Levinson Durbin [Makh75] algorithm is then used to solve for the filter coefficients a. k The frequency domain representation of mean squared prediction error for the autocorrelation method is calculated as [Rabi78]: (2.3) where G is filter gain, S(e jw ) is the input speech in the frequency domain and H(ejW)is the frequency response of the filter. Examining (2.3) reveals that the predictor coefficients are calculated by minimising the ratio of squared error between the speech and filter spectra. Also, as linear prediction estimates the current sample from a linear combination of the previous samples, filtering a speech signal using the linear predictive filter calculated in (2.2), removes the short-term correlations from the speech signal. This has the effect of making the resultant residual signal more noise-like than the input speech, and thus improves the efficiency of coding this signal. As these short-term correlations are a result of the vocal tract shaping of the speech excitation signal (as discussed in Section 2.2.2), the frequency response of the AR linear predictive filter is a good estimate of the vocal tract response. This characteristic allows linear prediction to be a successful means of separating the vocal tract influence from the speech excitation signal. Separation of the vocal tract and excitation signal is advantageous for coding as each component can then be coded according to its

43 Literature Review (J) a 'B. E 4: Time in ms (J) ~ a 0 ~ Figure 2.4: Time in ms Comparison of speech (top) and LP residual (bottom). individual characteristics. An example section of voiced speech and the respective residual signal, from 10 th order linear prediction, are shown in Figure 2.4. Examining Figure 2.4 reveals that the residual signal is indeed more noise-like between the pitch pulses than the speech signal. It can be seen that as linear prediction removes only the short-term correlations from the speech signal, it has little effect on the pitch pulse. In fact the pitch pulses are more pronounced in the residual signal than the speech signal. This results in the residual signal being more suitable for pitch prediction procedures such as those discussed in Section 4.2 [Hess83]. The magnitude spectrum for the voiced speech section of Figure 2.4 and the magnitude spectrum of a 10 th order linear predictive filter calculated from this speech are shown in Figure 2.5.

44 Literature Review ~--~----~----~----~----~----~----~--~ 90 OJ "C c v ~ ~ I 60~J 0- (j) ~----L-----L-----L-----L-----~----~----J---~ o Frequency in Hz Figure 2.5: Comparison of speech spectrum (solid) and linear prediction filter spectrum (dotted). It can be seen in Figure 2.5 that the LP magnitude spectrum is a good estimate of the frequency response of the speech. The underlying formant structure of the speech spectrum is particularly well represented. In fact, as discussed in Section 2.1.2, an estimate of the underlying formant structure is actually an estimate of the frequency response of the vocal tract Speech specific Linear Prediction Whilst linear prediction is widely used in speech coding it was not originally developed specifically for speech coding but rather for the more general field of signal processing. The result of this is that the linear predictor used for speech coding does not exploit the perceptual properties of hearing detailed in Section Authors such as Strube [Strub80], Nakatoh et a1. [Naka98] and Hermansky [Herm90] have attempted to incorporate some perceptual properties into the calculation of the linear predictor

45 Literature Review 22 coefficients. These authors have reported improved results by warping the frequency axis of the input speech to simulate the response of the ear, prior to calculating the predictor coefficients. Hermansky [Herm90] also included equal loudness perception into the calculation of the filter parameters. This method has reported improved results for speech recognition but is untried in speech coding. In [Elja87], EI Jaroudi et a1. propose Digital all-pole modelling (DAP) which modifies the calculation of the LPC to reduce the effects of spectral aliasing for discrete spectral representations. This method operates only on the spectral peaks, through a peak picking mechanism. Peak picking of the spectrum also allows non-uniform spacing between the discrete spectral samples; this characteristic is useful for representing the non-linear frequency response of the ear discussed in Section A modified version of DAP was reported to improve the quality of a CELP speech coder [WeiOl]. However, selecting only the spectral peaks in the modelling process leads to a poor representation of the lower amplitude noise-like sections of the speech spectrum. A solution to the poor performance of DAP for noise-like spectrum is proposed by Molyneux et a1. [MolyOO]. This solution proposes using a voicing frequency, below which, DAP is used, and above which, standard LPC is used. All of the DAP methods use an iterative approach that starts with the standard LPC and modifies these to remove the aliasing effect on each iteration. This leads to a large increase in the computational complexity (over standard LPC) required to calculate the coefficients. Also, the method of determining the voicing frequency in [MolyOO] is heuristic and could suffer from mis-classifi cati on. In a separate attempt to better model the characteristics of the speech waveform, Markoul [Mark73] proposed selective linear prediction. This involves separating the frequency spectrum into a low and high band and modelling the bands with separate

46 Literature Review 23 linear predictors of different orders. This allows the accuracy of the LPC spectrum to better match the non-linear frequency response of the human ear, however, the complexity required to calculate two separate set of LPC is significantly increased over using only a single LPC. Also, problems may occur due to mis-matching of the respective spectral estimates at the frequency boundary. Whilst the literature described in this section has attempted to better match linear prediction to speech coding, it appears that no attempt to incorporate the effects of masking into linear prediction has been made. This presents distinct opportunities for better utilizing linear prediction in speech coding and a method to incorporate masking into the calculation of LPC is proposed in Chapter Minimum variance distortion less response (MVDR) MVDR has been proposed as an alternative to linear prediction, for representing the spectrum of a speech waveform. This method is capable of exactly representing the spectrum of the N harmonics of a waveform, using N + 1 parameters [Murt99a]. The requirement for N + 1 parameters is too great for low rate speech coders. To reduce the bit rate requirements, Murthi et al. [Murt97,Murt99b] propose methods to construct lower order MVDR models from higher order models. The first method involves running the Levinson [Levi47] algorithm in reverse, whilst the second proposes resampling a high order MVDR spectrum to obtain the autocorrelation function and then calculating the coefficients from the Levinson algorithm. Whilst the MVDR spectrum does represent the harmonic frequencies exactly, it does not represent a good estimate of the remainder of the spectrum. This characteristic makes the technique suitable for representing the spectrum in harmonic coders but unsuitable for use as a linear filter in other speech coding algorithms such as CELP [Nati].

47 Literature Review Waveform Speech coders Waveform coders attempt to preserve the waveform shape of the input signal. This is generally achieved by minimizing an objective error measure, such as Mean Squared Error (MSE), between the input and synthesized speech. As the wave-shape is maintained, the performance of these coders can be measured using objective measures such as SNR (though SNR fails to recognize perceptual traits and as a minimum, Segmented SNR is a better choice). These coders operate at medium to high bit rates and produce high quality synthesized speech. Virtually all coders that operate at bit rates above 4 kbps are waveform coders [Klei95, GoldOO]. However, at bit rates below 4 kbps the quality of speech produced by waveform coders diminishes dramatically. This diminished quality is a result of waveform coders wasting bits on perceptually unimportant information at these low bit rates [Shl098, ThysOI]. There are many methods for maintaining the waveform shape of the input speech, whilst achieving compression of the signal. Different types of waveform coders are detailed and compared in the remainder of this section. Particular attention is placed on the ability of the respective waveform coders to operate over a scalable range of transmission bit rates Pulse Code Modulation (PCM) PCM is the simplest form of waveform coder. This method simply samples the input speech at a given rate (usually 8 kbps) and quantises each sample using a fixed number of bits (usually 8) [Jaya84]. No attempt is made to compress the speech signal, and thus, high quality speech is achieved. This high quality is at the cost of requiring large bandwidth. This method also requires no algorithmic delay.

48 Literature Review 25 To exploit some of the perceptual properties of hearing in PCM, the linear quantiser used in PCM has been adapted to match the logarithmic amplitude response of the human ear. Two derivatives of this logarithmic quantiser have resulted; A-law and f.1- law [Jaya84]. These have been standardized at 64 kbps by ITU (G.711 [ITU88a]) and are said to produce Toll quality speech. This quality is the standard against which other speech compression algorithms are judged. To better exploit the correlation between adjacent speech samples, Differential PCM (DPCM) and Adaptive DCPM (ADPCM) have been developed. These coders use prediction to estimate the current speech sample from previous samples, and then quantise the error. ADPCM uses both an adaptive quantiser and predictor. This method produces Toll quality at 32 kbps and has been standardized at this rate as G.721 [ITU88b]. A number of variants of ADPCM have been standardized at different rates, for example G.723 [ITU88c] at 24 kbps; however, at rates below 32 kbps Toll quality is not achieved [Klei95] Linear Prediction Analysis by Synthesis (LP AS) speech coders LPAS coders exploit more of the perceptual characteristic of speech than PCM coders. This allows LP AS coders to achieve a much high level of compression than PCM coders, and Toll quality is achieved at rates as low as 8 kbps [Sala98]. More recently, adaptations of this method have attempted to produce Toll quality speech at 4 kbps [ThysOl]. A block diagram of the generalized LPAS speech coder is shown in Figure 2.6 [Kond95].

49 Literature Review 26 Synthesis Filter, Long term Short term Excitation 1 1 Predictor Predictor ~ Generator H-+ 1 ~ 1 I I P(z) A(z) I 1 I I L Original Speech sen) Synthesised.+ ", speech~ + ~ - Error Minimisation... error e(n) Figure 2.6: Block diagram of LPAS coder [Kond95] Figure 2.6 shows a closed loop structure where the error between the original and synthesized speech is minimized. The structure generally operates on fixed size segments of the input speech, with the appropriate parameters (excitation, long term predictor and short term predictor) selected to minimize the error between input and synthesized speech for each segment. No attempt is made to directly code the residual signal resulting from the error between the original and predicted signal directly (as in ADPCM), but, rather, an excitation signal is selected that minimizes the error between the input and synthesized speech; thus the Analysis-by-Synthesis (AbyS) name. The method of selecting the excitation signal and the type of excitation signal used, provide the major means of classification for coders operating within the LP AS paradigm. The synthesis filter is formed from two distinct parts; long term predictor and short term predictor. The long term predictor [Sing84] is used to remove the long term correlations from the input speech, which are due to the underlying pitch of voiced speech. A first order long-term predictor synthesis filter is represented as [Rama95]: P(z) I-bz- M (2.4)

50 Literature Review 27 where M is the delay in samples and b is the long term prediction coefficient. Higher order pitch predictors [Sing84] and fractional sample predictors [Kro090] have also been proposed. An alternative representation of the LTP is as an Adaptive Codebook (ACB) [Klei88]. This involves forming a codebook of vectors of previous excitation, at different delays. The ACB vectors are updated for each new speech segment and the ACB entries are searched to select the one that minimizes the error criteria. The short term predictor is used to remove the short term correlations, due to the vocal tract shaping, from the input signal. Generally forward linear prediction, where the filter coefficients are calculated from the current input speech frame, is used as the short term synthesis filter (see Section 2.1.3). As the filter coefficients are calculated from the input speech samples, forward prediction requires the predictor coefficients be transmitted to the decoder. This is opposed to backward linear prediction, where the coefficients are calculated from the previously quanti sed speech samples, and thus the predictor coefficient need not be transmitted. Using forward linear prediction provides greater coding gain than backward prediction [Klei95a]; however, this is at the expense of increased delay equal to the frame size used to calculate the forward predictor coefficients. When forward prediction is used in a LP AS coder, the LPC are calculated once for each speech frame. The excitation quantization then splits each frame into a fixed number of sub-frames (the sub-frames are usually of fixed length) and calculates the parameters that optimally represent the excitation for each sub-frame separately. Generally, the error minimization block shown in Figure 2.6 selects the excitation that minimizes the Mean Squared Error (MSE) between the input and synthesized speech. To incorporate some of the perceptual characteristics of speech into the excitation selection, a weighting filter is often used to weight the error signal prior to error

51 Literature Review 28 minimisation. The weighting filter de-emphasises the frequency regions corresponding to the formants of the input speech. This de-emphasis exploits the masking characteristic of the ear, in that larger errors are less perceptible in high energy spectral sections (formant regions) of the input speech than in lower energy (anti-formant) sections. This weighting filter is usually generated using the short term predictor coefficients [Schr95] as: W(z) = A(z) A(z/ A) (2.5) where A(z) is the short term synthesis filter and A is a constant that controls the amount of de-emphasis Multi Pulse Excitation (MPE) Instead of representing each sample of the excitation in an LP AS structure, MPE represents only the most perceptually important samples and sets the remainder to zero [AtaI82a]. To determine the locations to be represented and their respective amplitudes, an exhaustive search of all possible combinations is conducted; with those parameters that minimize the weighted MSE selected. The signal to noise ratio between the input and synthesized speech is directly related to the number of pulses represented per segment. Originally the MPE method of [Alat82a] did not use long term prediction, however [Sing84] found that improved performance could be achieved by incorporating LTP into the method. Due to the exhaustive search of every possible location required for MPE, the computational complexity required to implement the method is prohibitive. To reduce the corpputationalload it was proposed to restrict the pulse locations to regularly spaced intervals (Regular pulse excitation (RPE)) [Kro086]. This required only the

52 Literature Review 29 position of the first pulse and the amplitude of each pulse to be transmitted. A version of RPE with LTP was adopted for the ETSI GSM standard at 13 kbps [GoldOO] [Euro95]. Sukkar et al. [Sukk89] proposed replacing the impulses used in MPE with zinc pulses. It was reported that an equal SNR could be achieved using fewer zinc pulses per segment than impulses. However, each zinc pulse requires two amplitude values to be transmitted, and as such, the overall bit rate required to represent the speech was approximately equal to that required by MPE. An advantage of the zinc pulse method is that, due to orthogonality of the zinc pulses, only every second pulse position needs to be searched. This reduces the computational complexity required for the search by a factor of two. Whilst MPE and RPE produced good quality speech, the inefficiency of coding each pulse separately limited the ability of the methods to operate at rates below approximately 13 kbps. This inability of MPE to scale to lower rates led to the development of the most common type of LPAS coder, the Code Excited Linear Predictive (CELP) coder [Schr85] Code Excited Linear Predictive (CELP) In place of representing the exact residual signal, the CELP paradigm uses a codebook (CB) of Gaussian vectors to represent the excitation [Schr85]. The Gaussian vectors form a CB, generally called the Fixed Codebook (FCB). Also, CELP generally uses an ACB in place of the LTP shown in Figure 2.6. The use of the ACB allows the long term correlations to be modelled inside the closed loop system. A block diagram of the CELP coder is shown in Figure 2.7.

53 Literature Review 30 I I I I I I I I Adaptive Codebook L ~ I I I I I I I I I ~-- Fixed Codebook I I I I I I I I I I I L I I L I Gain r , Error Minimisation Original Speech sen) Synthesised, , speech Short term" + Predictor s (n) 1 A(z) Weighting Filter W(z) Error e(n) Figure 2.7: Block diagram of a CELP coder [Kond95] The computational load in Figure 2.7 can be reduced considerably by recasting the weighting filter into both the synthesis arm (cascaded with STP) and the input speech ann, and removing it from the error minimization arm [Kond95]. An exhaustive search is conducted to select the optimal excitation for a given speech sub-frame. This search involves selecting an excitation vector, calculating the optimal gain value and exciting the short term synthesis filter with the gain scaled excitation. This is repeated for each code vector in the codebook and the vector and gain value that minimizes the weighted perceptual error is selected as the excitation vector. The process is repeated for both the ACB and FCB, with the ACB generally searched first. This allows the selected ACB excitation to be added to the FCB excitation during the FCB search. The result is that the FCB can account for errors in the ACB excitation and hence the total error is reduced for a given speech sub-frame. Replacing this sequential CB optimisation with a joint ACBIFCB optimisation loop can further minimize the total error. This joint optimization however, significantly increases the search complexity.

54 Literature Review 31 The use of codebooks to represent the excitation allows Vector Quantisation (VQ) [Gers92] to be used. This improves the coding gain over direct scalar quantization, and allows the bit rate required to synthesise the speech to be reduced in comparison with MPE. The CELP algorithm was accepted as the FS1016 standard coder at 4.S kbps [National]; at this rate it does not achieve toll quality. Extensive research has been conducted with the aim of improving the CELP architecture. Initially this research was interested in reducing the high computational complexity required by CELP [Davi86, LinS7]. Subsequent research has primarily involved improving the speech quality at low bit rates. One method for improving CELP quality at low rates was Relaxed CELP (RCELP) [Klei93b]. This method exploits the fact that small gradual time shifts in voiced speech are imperceptible to the ear [Klei92]. The residual signal for voiced speech frames is time warped so that the resultant pitch track (pitch evolution) follows a linear trajectory. The resultant linear pitch track allowed only a single LTP delay to be used per frame of speech, as opposed to a separate delay for each sub-frame in CELP [SchrS5]. The delays for the individual sub-frames are calculated through linear interpolation of the past and present delays. This method produced good quality speech at 5.85 kbps [Klei93b] and toll quality speech at 8 kbps [Nahu95]. The RCELP method does, however, require a frame to be classified as voiced or unvoiced. This could lead to artifacts in the speech if mis-classification was to occur. Also, transitional frames may be difficult to determine and warp effectively. A significant improvement in the quality of the CELP paradigm at rates around Skpbs is achieved using Conjugate Structure Algebraic CELP (CS-ACELP) [Sala9S, Lan9S]. This coder was standardized by the ITU-T as operating at 8 kbps [ITU96]. This method operates on 10 ms frames with 5 ms sub-frames. The fixed codebook excitation

55 Literature Review 32 (FCBE) is the main differentiating factor between CS-ACELP and CELP. In CS ACELP the FCBE is generated using an algebraic structure. This is basically a marriage between MPE and CELP. CS-ACELP reduces the search complexity of MPE by forcing the pulse locations to occur on fixed tracks and the pulse amplitudes to be ±1. The search generates the optimal signs and locations of the pulses and the use of a codebook structure allows VQ to be used. The storage requirements for the FCB are also reduced when compared to CELP. This is because the code vector is generated from the transmitted indices rather than retrieved [Sala98]. The standardized coder produces toll quality speech at 8 kbps. However, at rates below this, the quality begins to fall rapidly. This rapid reduction in quality is a result of having insufficient bits available to represent a sufficient number of pulses per subframe. Ozawa et a1. [Ozaw97, Ozaw98] propose a variant of the CSA-CELP that is scaled to operate at 6.4 kbps. This method uses only 3 pulses per sub-frame and vector quantises the sign parameters. A joint optimisation of the position and sign codebooks is used. The method reports Mean Opinion Scores (MOS) approximately 0.25 below that achieved for The structure in [Ozaw97] can also operate at 12 kbps; at this rate an improvement in quality is claimed. Attempts to operate the CS-ACELP method as a multirate coder have been reported [Rugg01, Bert99]. These methods use phonetic classifiers to select the quantisation mode used for a given frame. Whilst both methods report classifiers that attempt to avoid mis-classification, any mis-classification would create noticeable artifacts in the synthesized speech. To reduce the bit rate required for CELP to below 8 kbps Miki et a1. [Miki94] proposed modifying the CELP FCB excitation to exhibit a periodic nature (PSI-CELP). This is achieved by modifying the FCB vectors to be repeated according to the underlying pitch

56 Literature Review 33 period. The method reports bit rates from 2 to 4 kbps, however, no subjective quality results are given. Also, to operate at these low bit rates the coder increases the frame size to 40 ms. This is a large increase in delay, when compared to G.729 [Sala98]. The PSI-CELP scheme has been modified to operate at 4 kbps [Man097]. This format uses 20ms frames and 10 ms sub-frame. Subjective results similar to those of 32 kbps G.726 are reported for Japanese speech. This is approximately toll quality, but, the method must switch the quantisation according to the frame nature; this is susceptible to misclassification. In very recent research, the focus has been on developing 4 kbps coders participation in the current ITU-T standardization process [ThysOl, YangOl, ErziOO, TasaOO, CupeOO]. Thyssen et al. [ThysOl] use a CS-ACELP type of coder where frames are classified into one of 6 categories. If the frame is strongly voiced, an analysis and quantisation scheme is employed where a single ACB delay per frame is used in place of one per sub-frame, and the excess bits are spent on the FCB. For all other cases a separate ACB value is used for each sub-frame and less bits are spent on the FCB. The method uses 20ms frames with a loms sub-frame. This method also modifies the residual (similarly to RCELP) if the frame is strongly voiced. Subjective results are reported that place the coder quality close to G.729. The coder quality may again suffer, if a frame is misclassified. The method proposed by Yang et al. [YangO 1] is very similar to that proposed by Thyssen [ThysOl]. The major difference is the use of "soft decisions" [YangOl] to choose the optimal excitation. This "soft decision" selection of excitation uses both open and closed loop mechanisms to reduce complexity, and chooses the fixed excitation that best represents the current residual. This method also adds extra periodicity to transitional frames through harmonic smoothing. Good subjective results

57 Literature Review 34 are reported, however the harmonic smearing and pitch warping may limit the scalability of the method to higher rates. Erzin [ErziOO] proposes a method that disperses the pulses of a Gaussian FCB, using a dispersion filter based on the LPCs. This allows the FCB excitation to better represent the spectrum of the residual. The paper reports MOS test scores of 0.75 below that of PCM. Tasunaga et al. [TasuOO] also propose dispersing the pulses of the FCB, but use trained dispersion vectors in place of Erzin's LPC base dispersion filter [ErziOO]. Also, the FCB used is an algebraic codebook with 3 pulses/sub-frame for voiced speech and 9 for unvoiced. Tasunaga et al. report subjective results using Japanese speakers that place the performance equivalent to G.726 at 32 kbps [TasuOO], which is approximately toll quality [Klei95]. Mis-classification of the frame would however, result in artifacts in the synthesized speech. Cuperman et al. [CupeOO] propose a method that modifies the CELP structure. Firstly the residual is warped in a manner similar to RCELP. A frame classifier is then used to select VoicedfUnvoicediSilence with each classification coded differently; this produces a variable bit rate for the coder. The coder also operates in one of three modes with a different average bit rate for each mode (3.4 kbps is claimed for the highest rate). The most novel part of the structure is the FeB excitation in voiced speech, which involves using binary pulses that are localized around the pitch pulses with a window calculated from the frame energy contour and the pitch. Good MOS scores are reported for the highest rate mode; these exceed the score achieved by the TIAlIS-733 variable rate coder operating at an average of 7.6 kbps [CupeOO]. The variable rate of the coder makes it impractical for speech applications where the channel bandwidth is fixed. Also, the classification mechanism may cause artifacts in the speech if mis-classification occurs.

58 Literature Review Parametric coders Parametric coders are not concerned with maintaining the waveform shape of the input speech, but rather attempt to produce perceptually unaltered speech. Often in speech coding, perceptually unaltered speech is unobtainable at low rates, so a second criteria of minimal perceptual distortion is often the objective for parametric coders. Parametric coders do not attempt to directly transmit the speech samples but instead transmit the parameters of models that replicate the speech production process [Jaya84]. Parametric coders are generally grouped according to the type of speech model used. In the remainder of this section the most common classifications of parametric coders are detailed and current research into each classification reviewed. Particular attention is placed on the ability of the respective parametric coders to scale to higher transmission bit rates Linear Predictive Parametric Coders Linear Predictive Parametric Coders (LPPC) predate the LPAS coders of Section and share a common principle of operation with the LP AS coders, in that they attempt to represent the linear prediction residual signal separately from the linear predictive shaping filter. In the parametric implementation however, only the perceptually important characteristics of the residual are estimated in an open loop manner; no regard is paid to the waveform shape of the synthesized speech signal, and thus, these coders generally operate at low bit rates. A common representation of the residual signal is generated, based on the simplified model for speech synthesis shown in Figure 2.8 [Rabi78]. The speech synthesis method shown in Figure 2.8 classifies the frame of speech as Voiced or Unvoiced. For voiced frames, the pulse excitation is selected and for

59 Literature Review 36 r ,,,, I, : Pulses Noise, 0 Voice Classifier / Gain Vocal Tract Filter Synthesised Speech... Figure 2.8: S mplified model for speech synthesis [Rabi78] unvoiced frames, noise excitation is used. The distance between pulses in the pulsed excitation is set equal to the pitch of the speech, while unvoiced speech employs white noise excitation. The excitation is gain scaled and shaped by an LP filter which represents the vocal tract resonances. This model is the basis for the FSlO15 LPClO coder, which operates at 2.4 kbps [Fede84]. Numerous configurations have been proposed for representing the excitation signal in the LPPC paradigm, for example, Atkinson [Atki95] proposes using the time domain envelope to represent the residual signal. Atkinson reports natural sounding speech at rates of 2.4 kbps and below. An interesting coder which falls between the LPPC and LPAS paradigms is the Single Pulse Excited CELP coder (SPE-CELP) [Gran91]. This method uses a single impulse per frame to represent voiced speech frames. This impulse is interpolated to give the required number of pitch pulses per frame. For unvoiced speech a Gaussian CB is used as per the CELP method. SPE-CELP is reported to produce good results at 4 kbps and below and provides a bridge between the parametric LPPC and the waveform LP AS coders. Coders within the LPPC paradigm produce good quality speech at low rates, but quantise the residual signal parameters with no regard to the effects in the speech domain. This limits the ability of the paradigm to scale to higher subjective quality at

60 Literature Review 37 increased bit rates, as the effects of quantization and model errors on the synthesized speech cannot be controlled Waveform Interpolation Coders Kleijn [Klei91] originally proposed the Prototype Waveform Interpolation (PWI) paradigm. PWI initially operated only on voiced speech, and as such required VoicedfUnvoiced classification of the speech frames. Subsequently, Kleijn et al. [Klei94) generalised PWI to operate an all speech and the generalised implementation is referred to as Waveform Interpolation (WI) coding. In the WI paradigm, a frame of input speech (usually 25ms) is firstly passed through a linear predictive (LP) analysis filter. The residual signal is then separated into a fixed number (usually 10 for a 25ms frame) of overlapped pitch cycles (known as characteristic waveforms (CW) [Klei94)) and these are used to form a two dimensional waveform which evolves on a pitch synchronous nature (CW surface). The pitch track used for extraction of the CW s is formed by linearly interpolating the pitch values from the past, present and future frames. To maximize the smoothness of the CW surface, the individual pitch length segments are aligned to achieve maximum correlation, prior to constructing the surface. The CW surface is then decomposed into Slowly evolving and Rapidly evolving waveforms (SEW IREW) using a fixed linear filtering operation. A simplified block diagram of the WI encoder is shown in Figure 2.9 and an example of the CW surface, SEW and REW are shown in Figure 2.10.

61 Literature Review 38 I nput S peech F fame,..... r LP Pitch Analysis.. Prediction LP -,.. CW Analysis.. r Extraction Filter Residual 1,. CW -alignment and 2d Surface.. r Generation Linear Filter SEW.. r REW.. r Figure 2.9: Block diagram of the WI encoder CW Surface SEW Surface... REW ~ Surface ~ Ci E «Phase(rad) Time(s) Figure 2.10: Examples of CW surface, SEW and REW

62 Literature Review 39 The CW is usually transformed to the frequency domain prior to separation into SEW and REW. Separating the CW surface into SEW and REW allows each of these wavefonns to be quanti sed and transmitted individually, according to their perceptual characteristic. The SEW is down sampled to 1 SEW per frame and the REW to 4 per frame. In the decoder, interpolation is used to generate the required number of SEW and REW. Only the magnitude spectrum of the SEW and REW are quantised, the phase is not transmitted. At the decoder a phase model is used for the SEW and random phase is used for the REW. A simplified block diagram of the WI decoder is shown in Figure The WI method has been shown to produce good quality speech at 2.4 kbps [Klei95b]. This quality was equivalent to the FS1016 [National] CELP coder at 4.8 kbps. The delay required by the WI coder is approximately 2.5 times the frame length employed [KIei95a] and this is much greater than that required for CELP. This increased delay is due to both the look ahead for pitch prediction and the delay of the linear filtering procedure. The computational complexity required for the WI coder is also large, however, a low complexity version has been proposed [Klei96]. Due to the good performance of the WI coder at low rates, much subsequent research has been conducted on scaling the coder quality to operate at higher rates [Bum95, Bum97, Thys97, Gott99, GottOO, Kang99a, Li94]. Initially this research focused around better quantization of the SEW and REW parameters. A holistic approach for improved SEWIREW quantisation was proposed by Burnett et ai. [Bum95, Bum97]. This approach jointly quantized the SEW and REW parameters in a closed loop AbyS structure [Bum95], with further enhancements later included to incorporate the SEW and REW phases into the AbyS structure [Burn97]. This closed loop quantization resulted in improved subjective quality when compared to the initial WI coder [Klei94].

63 Literature Review 40 Quantised SEW Magnitude , Spectrum Linear Interpolation Phase Model Quantised REW Magnitude , Spectrum Linear Interpolation Random Phase LP Synthesis filter Synthesised r------, speech 2d to ld conversion I----i. Figure 2.11: Block diagram of WI decoder Subsequently, AbyS of the SEW and REW prototypes has been further extended by Gottesman [Gott99, GottOO]. These latter papers extend the bit rate to 2.8 kbps and 4 kbps respectively, and reported subjective results exceeding those achieved for the coder operating at 6.3 kbps. Less holistic approaches to improving SEW IREW quantisation, than those proposed by Burnett et al. and Gottesman et ai., have also been proposed. Kang et ai. [Kang99a] claims improved results over the initial WI coder [Klei94] by adjusting the synthesised phase of the signal via modification of the quantized REW spectrum according to the pitch. Thyssen et ai. [Thys97] warp the frequency spectrum to match the non-linear response of the ear for parameter quantisation and report a preference for this method when compared to the initial WI coder [Klei94] in an AlB comparison test. Li et al. [Li94] replace the linear interpolation of the prototypes (used by Kleijn [Klei94]) with non-linear interpolation and report improved representation of transitional speech segments; this is, however, at the expense of increased complexity and bit rate [Li94]. A WI coder that is scalable from 2 to 4.8 kbps was proposed by Kang et ai. [Kang99b]. This method uses a dual-frame structure where the transmission rate of the SEW and

64 Literature Review 41 REW is increased at higher bit rates. The reported results claim similar subjective quality to fixed rate WI coders at each of the scalable bit rates. Whilst the improved SEW and REW quantization methods proposed by [Bum95, Bum97, Thys97, Gott99, GottOO, Kang99a, Kang99b Li94] report improved subjective quality over the original technique, the actual quality improvements have been relatively small. This suggests that deficiencies in the WI model make the speech quality nonscalable. These deficiencies include mis-matches between the actual and linear pitch track, smearing of transitional sections by the linear filtering operation, and reconstructing speech by interpolating between adjacent CW's even though the exact samples were originally contained in the CW's. These shortcomings in the WI model have led to the development of Perfect Reconstruction WI (PRWI) coders [Klei98, Erik99, ChonOO, TammOO]. Whilst the specific methods employed in PRWI [Klei98, Erik99, ChonOO, TammOO] differ slightly, they exhibit a number of underlying similarities. These similarities include, generating a very accurate pitch track, critically sampling the speech signal pitch synchronously and warping the critically sampled CW's to a constant length. The only report of subjective quality for a PRWI coder is by Chong-White [ChonOO], where a PRWI coder operating at 4 kbps reports MOS scores approximately 0.8 below the coder and 0.3 above the FSI016 coder operating at 4.8 kbps. These results appear to be lower than those discussed in Section 2.4.4, where wavefonn coders were operating at 4 kbps. However, due to analyzing all frames with the same structure, the synthesized speech produced by WI is not susceptible to mis-classification errors. This makes the WI paradigm more robust that other methods, which classify the speech prior to coding.

65 Literature Review 42 Whilst the PRWI coders strive to perfectly reproduce the speech when unquantised, they all quantise the LP residual in an open loop manner. This means that the effects of quantisation errors on the synthesized speech are unknown and cannot be controlled. This characteristic may limit the ability of the PRWI to scale up to toll quality in practical applications. The PRWI coders also require significantly more delay than the LPAS coders discussed in Section This delay is required due to the pitch detection required to form an the accurate pitch track and is generally equal to times the frame size [ChonOO] Sinusoidal Speech Coders Sinusoidal speech coders exploit the concept that frames of speech can be modelled as a weighted sum of sinusoids. However, allowing the individual sinusoid frequencies to be unrelated leads to a high bit rate being required to represent the sinusoids [Mcau86]. To produce a compact representation of the speech signal, the frequencies must be harmonically related. This harmonic relationship is reasonable for voiced speech, where the frequencies are harmonics of the fundamental frequency (pitch). The speech synthesis is then represented as [Mcau92a]: L scn) = ~ At cos(lwo + l) i=l C2.6) where L is the number of harmonics, ~ is the amplitude, <p[ the phase and Wo the fundamental frequency. The amplitude values are coded simultaneously by fitting a spectral envelope to the measured amplitudes, and a voicing parameter is also transmitted that determines the frequency below which the spectrum is harmonic (voiced). Phase information is not transmitted, but is estimated from the transmitted pitch, amplitude and voicing parameters [Mcau91]. For harmonic sections the extracted phase is generated as a sum of linear and minimum phase components. While in

66 Literature Review 43 unvoiced sections (above the voicing frequency) the phase is made random. This method of sinusoidal coding has been reported to produce good quality speech at 2.4 kbps [Mcau92b]. From the results in [Mcau86] it can be seen that the sinusoidal method can be scaled to operate at higher rates than used in [Mcau92b]. To achieve this the phase parameters must be quanti sed and transmitted. Cheetam et al. [Chee97] proposed separating the phase into linear, minimum and residual components, before quanti sing the residual component using an all-pass filter. Results reported for this phase quantization method indicate a modest improvement in subjective quality when compared to using only the linear/minimum phase estimate [Sun97]. This improvement is at the expense of both, higher complexity required to estimate the all-pass parameters and increased bit rate for transmission of the all-pass parameters. Ahmadi [Ahma99] proposes the use of upfront LPC and coding of the resultant residual signal using a sinusoidal model. This method assumes a flat magnitude spectrum for the residual and thus the amplitudes are not transmitted; while the all-pass method of phase quantisation is used to quantise the harmonic phases for voiced speech [Chee97]. Nonharmonic spacing of the sinusoids in also introduced for unvoiced frames. The proposed quantisation scheme is variable rate, with an average bit rate of 1.75 kbps [Ahma99]. Ahmadi et al. extended the use of upfront LPC to operate at a fixed rate of 2.4 kbps, with the coder achieving a MOS of 3.1 [Ahma98]. This MOS score is lower than those achieved by the WI coder [Klei94] operating at 2.4 kbps. All of the preceding sinusoidal coders require a large delay to accurately calculate the pitch. Aguilar et al. [AguiOO], proposed an embedded sinusoidal coder with a bit stream in the range of 3.2 to 9.6 kbps; their method uses backward pitch prediction and thus limits the delay to ms. Narrow band (0-4 khz) speech is coded at 3.2 kbps by

67 Literature Review 44 quanti sing the amplitude, pitch and voicing parameters. Aguilar [AguiOO] reports that for improved speech quality the measured phases must be coded at least once per 12.5 ms with 5 bits per phase. This characteristic is used to scale the 3.2 kbps coder to 6.4 kbps by quanti sing the sine wave phases. At 9.6 kbps the extra bits available are used to extend the scheme to code wideband speech (0-8 khz). MOS scores are presented which indicate that the 3.2 kbps coder exceeds the quality of at 5.3 kbps and the 6.4 kbps coder matches the quality of G operating at 6.3 kbps. From the literature reviewed it is apparent that the sinusoidal model presents a paradigm whereby the subjective quality may be scaled with the bit rate. However, due to the need to classify frames, the method is susceptible to artifacts in the synthesized speech if mis-classification occurs. Also, the phase component requires a large number of bits to be accurately coded and, as such, the scalability of the bit rate is not fine grain. The quality of sinew ave coders also suffers from difficulties in modelling noise-like (unvoiced) and transitional speech sections. This is a result of representing all speech as a sum of harmonic sinewaves. This problem has been considered in papers such as that by Verma et al. [Verm98], however, an increased amount of complexity is required to model such sections effectively Multi-band and Mixed excitation Coders The speech quality produced from the LPPC discussed in Section is often buzzy in nature. This buzzy sound is generally attributed to the binary selection of either voiced or unvoiced excitation for a given frame. Using a binary voicing decision is insufficient for modelling transitional sections and also any mis-classification of the signal leads to artifacts in the synthesized speech. Alternatives to using a binary voicing decision are to use either a multi-band [Grif88J or mixed [Makh78J approach to producing the excitation signal.

68 Literature Review 45 The mixed approach proposed by Makhoul et al. [Makh78] uses only a single voicing frequency per frame. Below this threshold the signal is considered voiced (pulsed excitation) and above, the signal is unvoiced (noisy excitation). The multi-band approach is a frequency domain approach whereby the input signal is separated into a number of sub-bands [Grif88]. Each sub-band is then declared voiced or unvoiced according to the match of the original sub-band spectrum to a synthetic spectmm that has harmonics spaced according to the pitch of the input speech signal. The excitation is then made up of a combination of hannonic or random spectra according to the voicedfunvoiced band decisions. An Improved version of the Multi Band Excitation coder (IMBE), which dynamically allocates the bits according to the pitch [Hard88], was standardized for the INMARSAT_M system operating at 6.4 kbps [GoldOO] [DVSI91]. A coding structure that combined mixed and multi-band excitation is the Mixed Excitation Linear Prediction (MELP) coder [Mcce92,Mcce95]. This method uses 5 subbands and classifies each band as Voiced or Unvoiced. A pulsed excitation is used for the voiced bands and noise excitation for those that are unvoiced. Transitional frames are also considered through the use of an aperiodic flag. If the aperiodic flag is set, introducing jitter into the pulsed excitation produces an aperiodic pulsed excitation. The coder also uses pulse dispersion and adaptive post filtering to improve the speech quality. The :rvrelp coder was accepted as the new Federal Standard at 2.4 kbps [Supp97J and at this rate the coder achieves equal or better performance to the FSI016 CELP coder operating at 4.8 kbps [Supp97]. The MELP method has also been modified to operate at bit rates both above and below 2.4 kbps. McCree et al. [Mcce98] proposed a version of the MELP coder at 1.7 kbps by modifying the original 2.4 kbps coder to decrease the frame size from 22.5ms to 20ms

69 Literature Review 46 and altering the LPC quantization. The quality reported in subjective tests, exceeds that of the Federal Standard [Supp97] at 2.4 kbps. Stachurski et al. [Stac99] proposed a 4 kbps MELP coder, which increases the quantisation accuracy and transmission rate of the pitch, LSF's and voicing strengths, when compared to the original 2.4 kbps MELP coder. Quality in excess of that achieved for the OSM full rate coder at 13 kbps but below that of at 8 kbps is reported for this configuration. The mixed and multiband excitation paradigms offer good speech quality at low rates. However, a good estimate of the pitch is required and this results in a large delay being necessary. Also, for the MBE method to achieve high subjective quality the phase must be quantized, and this is difficult to achieve at low rates (as discussed for sinusoidal coder in Section 2.5.3). The MELP paradigm operates with high subjective quality at low rates but inaccuracies in the model (such as heuristic voicing decisions and open loop analysis) limit the scalability of the quality to higher bit rates. 2.6 Hybrid and Scalable coders In an attempt to increase the subjective quality for a given bit rate, hybrid coders switch between parametric and waveform coding according to the classification of the input speech frame. Examples of hybrid coders operating at fixed bit rates are combined WI I CELP coders [Burn93, Klei93] and Harmonic/waveform coders [Shl098, ShloOl]. The core of these coders is the mechanism used to classify the input speech frame and the techniques used for the voiced and unvoiced excitation. The WIICELP coders use WI for strongly voiced frames and CELP for transitional and unvoiced frames. These coders operated at a constant bit rate of less than 3 kbps and report quality in excess of parametric coders operating at the same rate.

70 Literature Review 47 The Harmonic/waveform coders use a parametric structure for strongly voiced and strongly unvoiced frames and an Algebraic CELP (ACELP) waveform coder for transitional frames. Operating at 4 kbps, this coder reports subjective quality equal to that of G.723.l operating at 5.3 kbps. Whilst hybrid coders report good subjective speech quality when compared to nonhybrid coders, the requirement for classification of speech frames can result in artifacts in the synthesized speech. This tends to be a result of heuristic classifiers that are susceptible to mis-classification of the speech. The hybrid coding paradigm has also been used to generate variable rate speech coders. These coders operate at a different rate according to the characteristics of the input speech frame [Klei95a]. As these coders are not suitable for fixed bandwidth channels they are not discussed here in detail, but a full analysis of variable rate coders can be found in [Klei95a]. The ACELP paradigm presents the most prevalent group of coders that can operate at a scalable range of bit rates. These coders achieve scalability by varying the number of pulses quantised per sub-frame [Ekud99, AshlOO, ChomOl]. The coder proposed by Ekudden et al. [Ekud99] was standardised as the European Telecommunications Standards Institute (ETSI) adaptive multirate coder [GoldOO][Euro99]. The scalable ACELP speech coders operate at bit rates from 4 to 16 kbps. At rates below 4 kbps there are too few pulses available to satisfactorily reproduce the speech. 2.7 Summary Speech coders can be grouped into two broad categories: Waveform Coders and Parametric coders.

71 Literature Review 48 Waveform coders attempt to maintain the waveform shape of the input speech signal and produce high quality speech at bit rates of 4 kbps and above. At rates below 4 kbps waveform coders tend to have too few bits available to accurately reproduce the speech signal, and the speech quality diminishes rapidly. The most prevalent coder in the waveform coding paradigm is the CELP coder. This coder uses linear prediction to remove the short-term correlations from the input signal and then codes the residual signal in a closed loop AbyS structure. Using this AbyS structure ensures that quantisation errors have minimal perceptual effect on the synthesized speech. Within the CELP paradigm, ACELP is widely used in a scalable structure. ACELP coders use a dynamic algebraic codebook structure to code a fixed number of discrete pulses per sub-frame. To achieve scalability, ACELP coders vary the number of pulses represented per sub-frame and this achieves scalable bit rates in the range of 4 to 16 kbps. The subjective quality of scalable ACELP coders increases to toll quality at the higher rates. Recent developments in ACELP coding have improved the subjective quality at low rates (4 kbps) by better exploiting the perceptual qualities of the human ear in the coding process. These measures include dispersing the coded impulses to better replicate the shape of a glottal pulse and classifying the input frames into distinct classes before coding each classification with a different configuration. Another development in increasing the subjective qualities of ACELP coders at around 4 kbps, has been to use both open and closed loop structures to adaptively quantise the parameters. The algorithmic delay required for CELP waveform coders is also generally only equal to ~ single frame of speech (up to 30ms). Parametric coders make no attempt to maintain the waveform shape of the input speech, but rather attempt to synthesise the speech using only the most perceptually important

72 Literature Review 49 parameters. Parametric coders quantise the parameters of models that represent the speech production process. At rates below 4 kbps parametric coders produce good quality synthesised speech, however, this speech is of lower quality than that achieved by waveform coders at higher rates. As bit rates are increased beyond 4 kbps, the quality of parametric coders plateaus. This saturation of quality may be attributed to the fact that the models used by the parametric coders are not perfect, and thus, the quality produced from such a model will reach an upper limit regardless of the accuracy with which the model parameters are transmitted. Two major methods in parametric coding that may scale to higher rates are sinusoidal and waveform interpolative (WI) coders. Sinusoidal coders attempt to represent a frame of speech as a harmonic series of sine waves. These coders transmit the sinewave frequencies and amplitudes and use these parameters to reconstruct the speech. To scale the quality of sinusoidal coders to higher rates, it becomes necessary to transmit both the phase of the sine waves and to model transitions. Both of these components are not suitable to high levels of compression and as such, the scalability of sinusoidal coder quality requires large increases in bit rate. WI coders extract pitch length segments of the signal at regular intervals and decompose these segments into a slowly evolving signal and a rapidly evolving signal. These signals can then be transmitted at different rates and accuracy, according to their perceptual characteristics. The speech is reconstructed at the receiver by interpolating and recombining the parameters representing the rapidly and slowly evolving signals. WI coders achieve a higher quality than the federal standard 1016 CELP (4.8 kbps) at a bit rate of 2.4 kbps. The quality of WI coders is however, capped by the limitations of the model used. In an attempt to achieve scalable bit rate operation, recent research into WI coding has modified the WI model to achieved perfect reconstruction of the speech

73 Literature Review 50 waveform when operated in an unquantised state. However, as WI coders quantise the residual signal in an open loop structure, no account for the effect of quantisation errors on the synthesized speech is considered. This characteristic combined with the large delay required to generate an accurate pitch track (>60 ms) are factors limiting the ability of perfect reconstruction WI to scale to higher subjective quality with increased bit rate. One method that somewhat links the characteristics of waveform and parametric coders is Single Pulse Excitation CELP (SPECELP). This method uses the closed loop AbyS structure of CELP to generate unvoiced speech, and a parametric representation of the pulse structure required to excite the LP filter in voiced speech. SPECELP uses a single structure to synthesise both the voiced and unvoiced speech however, each frame must be classified as either voiced or unvoiced in the analysis stage. The requirement to classify the input speech, limits the subjective quality of this method. This limitation results because speech classification is a heuristic task and as such incorrect classification is always possible. The result of incorrect classification is a reduction in the subjective quality of the synthesized speech. The preceding review of current speech coding literature reveals that a gap exists for a single, scalable speech coding algorithm. This algorithm should be able to scale in terms of both bit rate and subjective quality from low rate parametric coders (-2 kbps) to medium rate (- 6 to 8 kbps) waveform coders. For this algorithm to be of real use, it should have an algorithmic delay approximately equal to that of medium rate waveform coders (20 to 30 ms). This delay will allow the algorithm to be used in real time twoway speech communication.

74 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 51 Chapter 3 Source Enhanced Linear Prediction of Speech Incorporating Simultaneously Masked Spectral Weighting 3.1 Introduction Linear prediction (LP) forms an integral part of almost all modern day speech coding or speech compression algorithms. The primary reason for this popularity is that linear prediction provides a relatively simple and well founded technique for removing redundancy from a speech signal. It is thus useful in aiding compression or bit rate reduction. Linear prediction determines and removes redundancy by removing the short term correlations from the input signal.

75 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 52 The review of speech specific linear prediction presented in Section exposed an opportunity to better exploit the perceptual properties of hearing in linear prediction by incorporating the effects of perceptual masking into the linear predictive filter. This Chapter proposes a method for modifying the calculation of the LP coefficients (LPC) to better model the characteristics of the source. This is achieved by incorporating a weighting function based on the simultaneous masking property of the ear into the calculation of the LPC. Simultaneous masking occurs when an audible sound at a particular frequency is made inaudible by a higher intensity sound at an adjacent frequency, which occurs at a simultaneous moment in time. The proposed modification to the LPC calculation fits the linear predictive spectrum only to the unmasked samples of the input spectrum. The motivation for this technique is to ensure that no complexity is wasted modelling the masked regions, allowing the unmasked regions to be better represented. This allows the filter to remove more perceptually important information from the signal than the standard LPC technique, with the resultant residual signal consisting of less perceptually important information. This characteristic allows the subjective quality of the synthesized speech to be improved for a given residual quantisation scheme. This chapter presents results confirming this characteristic using objective measures and subjective listening tests. The chapter is organized as follows. The new linear prediction method is given in detail in Section 3.2. Objective and subjective results are provided in Section 3.3. Finally, the major points are summarized in Section 3.4. This chapter uses the review of linear prediction presented in Section 2.3 and the review of hearing perception in Section as a basis upon which many of the proposed methods are built.

76 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting Simultaneously Masked Spectral Weighting Linear Prediction Coefficients (SMWLPC) Motivation The motivation for this technique is to allow a simple and computationally efficient means of better exploiting the perceptual characteristics of human hearing in low rate LP based speech codecs. Traditionally, perceptual distortion is only exploited in low rate LP speech coding by using a noise shaping weighting filter [Schr79] when coding the residual signal. This filter is used to weight the error signal when searching for the optimal excitation signal to represent the LP residual. The weighting filter deemphasises the frequency regions corresponding to the formants of the input speech. This de-emphasis exploits the masking characteristic of the ear in that larger errors are imperceptible in louder sections of the input speech than in quieter sections. Whilst the use of this weighting filter has produced good results [Kro088] and been widely accepted, it uses only a basic model of the perceptual characteristics of the ear. Authors such as Sen [Sen93] and Burnett [Bum92] have reported improved performance by employing more sophisticated perceptual models when searching for the excitation signal that minimizes the perceptually weighted MSE to the LP residual. The improved results reported by [Sen93] and [Burn92] have been at the expense of a large increase in computational complexity. This increase in complexity is due to the fact that each excitation signal tested must be transformed to the frequency domain and the error signal then multiplied with the respective perceptual model. SMWLPC exploits more sophisticated perceptual models than the filter proposed in [Schr79]. Incorporating these models into the LP filter allows SMWLPC to remove more perceptually important information from the input signal than standard LPC;

77 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 54 resulting in a residual signal that contains less perceptually important information. By exploiting the perceptual models up front in the LP filter, computational complexity is dramatically reduced when compared to methods that use complex perceptual models to quantise the LP residual [Sen93]. This reduction is due to the fact that only a single transform and weighting multiplication per frame of input speech is required. Also, by incorporating the complex perceptual models into the LPC, SMWLPC can be easily adapted to any existing LP based speech coding algorithm by simply replacing the standard LP filter. The method selected, limits the increase in computational complexity over standard LP filtering by maintaining the use of traditional recursive solutions in calculation of the LPC's. This also maintains a stable Auto Regressive (AR) structure that can be directly employed in any LP based speech codec. Standard linear prediction minimises the error equally across the entire frequency spectrum of the input speech. SMWLPC modifies this technique to firstly determine, and subsequently ignore frequencies that are simultaneously masked. This allows the error to be minimized only in the sections of the input spectrum that are unmasked Method A Block diagram of the SMWLPC method is shown in Figure 3.1. Initially the Power Spectrum (frequency domain) of the input speech is calculated via a Fast Fourier Transform (FFT). A masking threshold function is then calculated for each discrete frequency. The calculation of this function is detailed in Section The masked input frequencies are then determined by comparing the power spectrum of each discrete frequency to the masking threshold for that frequency. If the power spectrum at a discrete frequency is less than the masking threshold or the threshold of hearing, the frequency is deemed masked. A modified power spectrum is then produced

78 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 55 Conversion to Generate Convert Power Levinson... Power modified.. r spectrum to Derbin Input Spectrum via Power Power Autocorrelation ~ Modified recursion FFf Spectrum function Filter Speech Spectrum Power Coefficie nts I~ Spectrum Masked Autocorrelation Calculate frequencies Function Masking Threshold ~ Determine function Masked Frequencies... Masking Threshold Function ~ Figure 3.1: Functional Block diagram of SMWLPC by taking those frequencies deemed masked and zeroing their values. This method is equivalent to generating a spectral weighting function whose values are unity for unmasked frequencies and zero for masked frequencies (or frequencies whose power is below the threshold of hearing) and then multiplying the input spectrum by this weighting function. The result is a power spectrum that contains only unmasked information. Recognising that the autocorrelation of a discrete stochastic signal is the inverse Discrete Fourier Transform (IDFT) of the power spectrum, the perceptually altered power spectrum is transformed to the autocorrelation function of the unmasked speech. A perceptually altered Linear Predictor can then be easily calculated using the well known Levinson Durbin recursion [Makh72] The masking threshold function The Psychoacoustic Model used to calculate the masking threshold function is based on that proposed in [John88] with the parameters modified to optimize the performance of the SMWLPC. A block diagram of the method is shown in Figure 3.2.

79 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting Sub band Inter Band Masking Threshold.. Analyser Masking.. Threshold --. r Nonnaliser Po wer Band Calculator Spread Calculator Initial... Sp ectrum Energy Band Masking Waveform Energy Threshold Waveform Function.. ~ 56 Maskin g Thresh old Functio n Figure 3.2: Block diagram of the masking threshold calculation -25 dblcb -lodb/cb Figure 3.3: Inter band Spreading Function The input power spectrum is grouped into non overlapped bands (Critical Bands (CB)). The critical bands are of variable bandwidth and are designed to simulate the frequency response of the human ear [Scha70]. The power spectrum lines within each band are summed together to give an estimate of the energy value for each band and a band energy waveform is generated. To simulate the masking effect between bands, the band energy waveform is provided as the input to an inter-band masking calculator. The inter-band masking calculator convolves the band energy waveform with a spreading function (shown in Figure 3.3) to produce a spread band energy waveform. The spread band energy waveform is then used to determine an Initial Masking Threshold Function (IMTF) according to the following formula: IMTF(i) = Energy(i) - OU) (3.1)

80 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 57 where Energy(i) represents the total energy of the ith band of the spread band energy waveform measured in decibels; au) is given by: au) =50 O(i) = a(/3 + i) + (1- a)y a<0.2 a~o.2 (3.2) where: a=id1n---. (SFM 1) SFM ' max G SFM = 1010g--1!L A m (3.3) (3.4) SFMmax is an empirically determined value; G is the geometric mean of the power m spectrum; A is the arithmetic mean of the power spectrum; /3 and r are empirically m determined constant values that represent the tone masking noise and noise masking tone thresholds respectively. The value of SFM max suggested in [John88] was -60dB however, upon testing with a pure sine wave at 1 khz the required SFM max to give an alpha value of 1 was determined to be -40dB. In (3.2) /3 and r are set to 14.5 and 7 respectively. ais a measure of the flatness of the power spectrum, a value of 1 indicates a purely tonal signal and 0 represents pure noise. Equation (3.2) utilizes a to ensure that the correct mix of noise and tone thresholds is selected. The value for OU) in (3.2) differs greatly from that suggest in [John88] where the definition is given as: O(i) = a(/3 + i) + (1-a)r for all a (3.5) Setting (3.2) to a very large constant value for a very noise-like signal (a < 0.2), ensures that for this type of input the IMTF is made small and designates virtually the entire spectrum unmasked. This overcomes the situation where (3.5) designates virtually

81 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 58 the entire spectrum masked for such a signal, thus leaving too few samples to successfully generate the filter coefficients. This characteristic was reported to cause distortions in the reconstructed speech in [LukaOOa] and the modification overcomes this problem. Also in (3.5) r = 5.5, contrasting with the increased r = 7 in (3.2). This modification enhances the performance of SMWLPC and was determined empirically through informal listening tests. The IMTF is then adjusted by the threshold nomaliser to account for mis-estimation of the Energy(i) values resulting from the shape of the spreading function. This results in a masking threshold function an example of which (with the corresponding power spectrum) is shown in Figure 3.4.

82 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting ~ ~ 50 s::: ~ o 30 a Power Spectrum - Masking Threshold ~~--~--~--~--~~---1 o Frequency Figure 3.4: Example of the Masking Threshold Function Mathematical Analysis of SMWLPC SMWLPC is now analysed mathematically to explore and contrast the differences between this approach and standard LPC. The MSE solution for the Standard Linear predictive coefficients using the autocorrelation method was given in Section 2.3 Equation (2.2). As the input frame of speech is assumed to be a stationary random process the autocorrelation values (R(n» can be computed via an inverse discrete Fourier transform of the Power Spectral density P(k) [Proa96]: 1 N-J R(n)=- L P(k)eJwknIN N k=o n =O... N-1 (3.6) If the calculation of R(n) in (3.6) is modified to only operate on the perceptually important (unmasked) values of k, the Autocorrelation becomes: R(n) =! L L unmasked I P(l)eJwlnfN n =O... N-1 (3.7) where L represents the number of unmasked frequency bands of N.

83 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 60 Substituting the autocorrelation sequence (3.7) into (2.2) gives: 1. L P(l)e jw1nj N = f a (1. L P(l)ejwl(n-k)! N) L unmasked I k=l k L unmasked I n== p (3.8) It is clear that the above equations solve the mean square solution for a using only the p unmasked values of k. Also, as 1 L is a common factor it can be removed from the equations. This results in each summation term being equal to only the sum of the unmasked values of P(K) multiplied by the respective harmonic component, which is identical in value to the sum over all k with the masked values of P(K) set to zero. The above analysis demonstrates that SMWLPC fits only to unmasked regions and simply ignores the masked regions in its calculation of the LP coefficients. The fact that only the unmasked regions are modelled allows SMWLPC to achieve a better fit to these regions, as complexity is not wasted attempting to model masked regions. An alternate approach to examining the effect of the SMWLPC is to view the predictor error in the frequency domain. This can be expressed as [Rabi78]: (3.9) Equation (3.9) shows that minimizing E is equivalent to minimizing the ratio of the input energy spectrum (S(e jw ) to the squared magnitude of the frequency response (H(e jw» of the model. It can be seen that zeroing the power spectrum (numerator of equation) at any particular frequency, causes the difference between the model and the spectrum at that frequency to contribute nothing to the integral of the ratio over the entire spectrum. The result is that the zeroed (masked) regions have no effect in calculating the linear predictive coefficients.

84 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting Computational Complexity The computational complexity of SMWLPC is increased when compared to standard LPC. However, this includes calculation of the psychoacoustic model parameters which remain available for other coding tasks such as quantisation. In standard LPC, calculation of the autocorrelation requires (p + l)n real operations [Rabi78], where p w is filter order and N w is the window size. The SMWLPC uses an FFf and requires approximately 2N log N real multiplications plus N /2 comparisons to calculate f 2 f f the autocorrelation function, where N f is the FFT length used. The SMWLPC also requires approximately 2N f operations in calculation of the psychoacoustic parameters. Both methods require approximately p2 operations to solve the matrix equations. The configuration in this paper used N = 240, P = 10 and N = 512. The w f complexities in this case are SMWLPC = operations and standard LPC = 2740 operations. The computational demand of SMWLPC can be reduced to 5308 operations by decreasing the FFT length N from 512 to 256. This size transform has little effect f on the performance of SMWLPC for 4 khz band limited speech Data Windowing Requirements Initial subjective AlB comparison testing, where the LPC in a FSI016 [18] CELP coder was replaced with the SMWLPC, revealed that SMWLPC provided a greater improvement for male speech than for female speech [LukaOOc]. The cause of this discrepancy in performance was found to be spectral leakage in the Fourier transform. This leakage acts as an initial spreading function for low pitched speech and thus causes the masking threshold generated for this speech to become distorted.

85 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 62 The psychoacoustic model that is the basis for the SMWLPC psychoacoustic model has been shown to produce a consistent masking threshold for audio signals sampled at 32 khz or greater [John88]. However, because the transform window size has to be dramatically reduced for narrow band speech signals, the spectral leakage between frequency bins can cause the masking threshold to become inconsistent. For audio signals the transform window length used in the psychoacoustic model is set to 2048 samples. This window length produces frequency bins that are separated by less than 16 Hz and, as the lowest pitch frequency in the audio range is =::50 Hz, any hannonic components are separated by at least 3 frequency bins. In this case spectral leakage between frequency bins [Proa96] has little effect, even if a rectangular window were used. However, due to both a reduction in the sampling rate from 32 khz to 8 khz and a requirement for speech compression to operate in real time, the transform window for narrow band speech signals has to be dramatically reduced from 2048 samples to approximately 200 to 240 samples. This short window length causes the frequency separation between adjacent frequency bins in the Fourier transform to become almost equal to the lowest pitch frequency for voiced speech of approximately 50 Hz. This characteristic means pitch harmonics may occur in adjacent frequency bins and spectral leakage between the frequency bins tends to have a large effect. To determine a window technique that ensures a consistent masking threshold for SMWLPC, a comparison of the frequency responses for various window functions was conducted. This comparison included rectangular, Hamming [Proa96], Hanning [Proa96] and AC3 [United] windows. Mathematical representations of the rectangular, Hamming, Hanning, and AC3 windows are show in Appendix A. Figures 3.5 and 3.6 show the frequency response of the Rectangular, Hamming and AC3 windows for window sizes of 200 and 240 samples respectively.

86 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 63 The Hanning window is not included in Figures 3.5 and 3.6 as it exhibits the same main lobe width and similar main lobe attenuation to the Hamming window. Examining Figure 3.5 reveals that for the Rectangular window, the peak of first side lobe attenuation is only -13 db and occurs at a frequency of approximately 50 Hz. This attenuation is not sufficient to suppress leakage between harmonics for low pitched male speech, which has a fundamental frequency in the range of approximately 50 to 80 Hz. Similarly, due to the wide main lobe width, the attenuation for this range of frequencies using the Hamming widow is quite small. The AC3 window has a smaller main lobe width than the Hamming window and greater side lobe attenuation than the rectangular window. The combination of these characteristics allows the AC3 window to offer a minimum attenuation of -23 db for frequencies greater than 55 Hz. This offers the best leakage suppression for any of the windows in the range of male speech harmonics. Examining Figure 3.6 indicates that when the window size is increased to 240 samples the Hamming widow offers the greatest attenuation for frequencies above 57 Hz. This corresponds to the fundamental frequency for a pitch length of 140 samples and thus covers most male speech.

87 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 64 OJ "'0 c c 0 ~ ::J c Q) :;:: «0 - Rectangular -20, - - AC3 ".. Hamming -40 \ \. /, '..\ ',.... : ".....'.' '... r.,......'.- V:., '. : \, "', I'. \. /....- r-.\'. ':..:' ~.. I, '., ::1, " \' '.:.;- "", '\ I', "',\,-'/"'. ': : I I " " " '.:\ '. " I, V I, Frequency in Hz Figure 3.5: Window Comparison for window size of 200 samples 0 - Rectangular AC3 ".. Hamming CO "'0 c -40 c 0-50 ~ ::J C <D -60 -~ '.., I \ \1 /".. ~.". \"/ "...,.: \ I ". :' \,...\ r.. \" r ":'.: ~ " '. \ I..' \ I '. : \ I '. ~ I:: I.,... '.:\ I :.:\ I '.: \ :-, Frequency in Hz Figure 3.6: Window Comparison for window size of 240 samples

88 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 65 The preceding analysis of the window frequency responses shown in Figures 3.5 and 3.6 indicates that if a window size of 200 samples is required, then the AC3 window must be used as this window offers the best leakage suppression for speech signals. However, if the window size is increased to 240 samples, the Hamming window offers the best performance for speech signals. A window size of 240 samples corresponds to a frame size of 25 ms with a 5 ms overlap. This is a common window size for use in low rate speech coders such as WI [Klei95b] and thus the format chosen for SMWLPC in the following analysis is Hamming windowed speech of 240 samples in length. Results generated using this configuration have shown that the masking threshold is consistent across the range of pitch values typical in speech. If a shorter window length is required, the AC3 window must be used to minimise the spectral leakage and thus produce a consistent masking threshold function. 3.3 Experimental Results Objective Results LPC Spectral Estimate The spectrum of the linear predictive filter provides a good estimate of the spectrum of the input speech. This relationship is clearly evident when examining Equation (3.9), which shows that the predictor coefficients are calculated by minimising the ratio of squared error between the speech and filter spectra. This property of linear predicti ve filtering is widely exploited in hannonic coders to provide a bit rate effective means of transmitting the spectral envelope. To examine the effect of SMWLPC on the accuracy of the spectral estimate, loth order LPC and SMWLPC analyses were performed for a number of voiced and unvoiced speech segments. The spectra produced by both methods were then compared to the

89 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 66 actual speech spectrum. A typical example of the spectrum produced is shown in Figure 3.7 (the masked frequencies are indicated by shading). It is clearly evident in Figure 3.7 that the SMWLPC spectrum is a more accurate representation of the input speech spectrum in unmasked formant regions. As can be seen at around 800 Hz in Figure 3.7, the increased accuracy often results in the SMWLPC modelling 2 distinct formant peaks of the input spectrum whilst the standard LPC produces only a single peak between the two peaks of the input spectrum. This better modelling of the perceptually important formant regions allows SMWLPC to remove more of this perceptually important information from the input speech than a standard LP filter.

Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 67-30 ~.- = I-< -40 ~ ~ 0 p.

90 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting ~.- = I-< -40 ~ ~ 0 p o 1 2 Frequency in KHz -- Actual PSD Std LPC estimate -- SMLPC estimate Masked frequencies 3 4 Figure 3.7: Comparison of SMWLPC and Standard LPC spectral estimates To obtain an objective measure for the amount of extra perceptually important information that is removed by SMWLPC the average Weighted Unmasked Residual Energy (WURE) was calculated using: WURE =! L L I unmasked w(l)p (l) r (3.10)

91 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 68 Where P 0 is the PSD of the residual signal, wo is a weighting function equal to a 40 th r order LP spectrum of the input speech normalised to have unity maximum and L is the total number of unmasked spectral lines. The use of the weighting function wo in (3.10) places greater emphasis on the perceptually important formant regions of the input spectrum. Tenth order LP analysis using both SMWLPC and standard LPC was performed on 10 input sentences (5 male and 5 female). Frames of length 200 samples were used in the analysis with a linear predictive Hamming window of 240 samples having an overlap of 20 samples between frames. The WURE of each frame with an a (from (3.3)) greater than 0.2 was calculated via (3.10) and the values averaged across the entire sentence. An a greater than 0.2 was used as these are the frames for which SMWLPC differs from standard LPC as explained in Section The results of the analysis are shown in Table 3.1. The results in Table 3.1 show the average percentage reduction in WURE for SMWLPC compared to standard LP for each input sentence. The results demonstrate that more perceptually important information was removed by SMWLPC for each of the input files. The average improvement for all sentences was 5.84%, this represents a significant improvement and indicates that SMWLPC removes significantly more perceptually important information from the input signal than standard LPC.

92 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 69, Sentence Speaker S1\IIWLPC % Number Gender, Improvement 1 Male Male Male Male Male Female Female Female Female Female 3.13 Table 3.1: Percentage greater WRE removed by SMWLPC A typical example of the difference between the weighted residual power spectrums for a standard LP filter and the SMWLPC filter over a typical speech segment is shown in Figure 3.8. A positive value indicates that the SMWLPC residual has greater power and a negative signal indicates that the standard LPC residual is of higher power. The masked frequencies are shaded. Figure 3.8 shows that in ranges of frequency that are largely free of masking or exhibit regular spaced masking (strongly voiced) such as between 0 Hz and 1500 Hz, the SMWLPC residual has lower power than the standard LPC residual. Also in regions that are heavily masked such as between 2700 Hz and 3500 Hz the SMWLPC residual has greater magnitude than the standard LPC residual. These results reinforce the claims that the SMWLPC removes more of the perceptually important unmasked information from the signal than a standard LPC.

93 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting ai 't:i 3 c 2 >. Cl ~ 1 Q) c w 0 iii +:: -1 c Q) ~ -2 Q) ;: i Frequency I ~j! Masked Lines Energy difference Figure 3.8: Difference in Weighted Residual energy Quantisation Properties The direct fonn LP coefficients a k shown in the calculations of Section 2.1 are susceptible to quantisation noise [Rabi78]. Due to this characteristic they are rarely used in speech coding [Kang85]. The most popular representation of the LP coefficients are Line Spectral Frequencies (LSF). The LSFs are calculated from the direct form coefficients and their characteristics make them suitable for quantisation. These characteristics include monotonically increasing order, strong intra and inter frame correlation and clustering together at formant frequencies [Kond95]. To examine the effect that SMWLPC has on the correlation properties of the LSFs, the inter and intra frame correlation for both standard and SMWLPC LSF parameters were compared. This comparison showed no significant differences in the correlation values between the two'methods. To verify this finding in a practical situation, a vector linear predictor as proposed in [Y ong88] was calculated for both the standard LPC and SMWLPC LSFs respectively. The predictor produced is a square matrix that uses the

94 Source Enhanced LP o/speech Incorporating Simultaneously Masked Spectral Weighting 71 LSF vector from the previous frame to estimate the LSF vector for the current frame by exploiting both intra and inter frame correlation. The Spectral distortion between the predicted vector and the actual vector was calculated for each frame of a test sequence of 1000 frames and was then averaged across all frames. The spectral distortion was calculated via (3.11) and the results are shown in Table S(k) 2 N [ ]2 \H(k)\2 o/z k=o sd = - L 10 log \ \ (3.11) where N is the FFT length, S(K) is the Actual LPC spectrum and H(K) is the predicted LPC spectrum. SMWLPC STDLPC : Average Spectral : distortion 2.38dB 2.31dB Table 3.2: Average spectral distortion for the predicted LSF vector The results shown in Table 3.2 indicate that the spectral distortion was virtually identical for both LP methods. The difference of 0.07dB is statistically insignificant as the resultant error would then be vector quanti sed and a final spectral distortion of less than IdB is known to produce transparent results for speech coding [Pali93]. Achieving a virtually identical spectral distortion indicates that in practical situations SMWLPC maintains the high inter and intra frame correlation values of standard LPC LSFs and thus are suitable for high compression quantisation schemes such as vector linear prediction.

95 Source Enhanced LP a/speech Incorporating Simultaneously Masked Spectral Weighting Subjective Listening Tests To test the performance of the SMWLPC in existing speech codecs, a version of the 4.8 kbps FS 1016 CELP coder [Nati] and a WI coder [Klei95b] operating at 2 kbps [LukaOOa] were modified to use the SMWLPC in place of the standard LPC. The motivation for selecting the CELP and WI coders was to test the performance of SMWLPC in structures that code the LP residual signal in a closed loop and open loop method respectively. As the WI coder uses vector quantisation of the LSF parameters the coder was set to operate with non quanti sed LSFs. This removed the need to retrain the LSF codebook for the SMWLPC and also ensured an unbiased evaluation of SMWLPC's effect on the perceptual content of the residual signal, with no effect from quantisation errors of the LPC parameters. This modification was not necessary for the CELP coder as it uses scalar quantisation of the LPC parameters, which were found to match both the standard LPC and SMWLPC. All other parameters including codebooks were left unaltered. Each of the coders was used to generate synthesized speech for 10 input speech sentences (5 male, 5 female) from the TIMIT database, using both the standard LPC and SMWLPC. Subjective forced AlB comparison testing comprising 20 untrained listeners was conducted. To avoid statistical bias in the results, each sentence pair was played twice in each test with the order of the sentences being reversed. Thus the total test comprised the comparison of some 800 sentence pairs. The results are shown in Tables 3.3 and 3.4:

96 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 73 Speaker Gender SNnVLPC : STD LPC Female 59.5% 40.5% Male 53.5% 46.5% Total 56.5% 43.5% Table 3.3: AlB Comparison Results for the FS1016 CELP Coder Speaker Gender; SMWLPC i STD LPC.. Female 54.5% 45.5% Male 57.5% 42.5% Total 56% 44% Table 3.4: NB Comparison Results for the WI Coder An alternative view of the results is to look at the majority listener preference for the sentences. These results are shown in Table 3.5. I SMLPC STDLPe \ No Preference CELP 70% 30% 0 WI 60% 20% 20% Total 65% 25% 10% Table 3.5: Majority preferred sentences for SMWLPC The results clearly indicate a preference for the SMWLPC coded speech in all instances and for both coders. This clear preference is despite the fact that the coding structures for both coders were left unaltered. Modifying the quantisation procedures for the residual signal to suit the SMWLPC characteristics by, for example, retraining codebooks and introducing search weighting functions that suit the SMWLPC characteristics (such as that proposed in [Sen93]) could be expected to show further substantial improvements in the performance of the coders when using SMWLPC.

97 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 74 It is interesting to note that for the CELP coder the preference for female speakers using SMWLPC was higher than for males and for the WI coder this was reversed. It is a well known property that CELP coders sound better for male speakers due to the retention of phase (temporal) information but poor modelling of the harmonic structure in the coding process [SkogOO]. Conversely harmonic type speech coders such as WI coders are better suited to female speakers due to the retention of the harmonic structure but loss of the phase information [SkogOO]. It appears that by removing more of the perceptually important information from the input speech before the residual is coded, SMWLPC is able to overcome some of the short comings of a particular low rate coding algorithm. The results presented extended and support those reported in [LukaOOa] and [LukaOOc] where a significant preference for SMWLPC coded sentences was reported. In [LukaOOa] Mean Opinion Score (MOS) testing was conducted using a 2 kbps WI coder and the results showed an improvement in MOS score from 3.31 to 3.45 when the standard LPC was replaced by SMWLPC. 3.4 Summary A new technique that modifies the calculation of the LPC so as to better model the source for low rate speech coding has been developed. The technique involves the use of a psychoacoustic model to determine the simultaneously masked frequencies and also the frequencies whose power falls below the threshold of hearing. This information is then used to weight the power spectrum of the input speech, producing a modified power spectrum that contains only unmasked information. A modified autocorrelation function is then generated via a DFT operation and standard recursive algorithms are used to solve for the LPC. Retaining the use of the standard recursive algorithms limits

98 Source Enhanced LP of Speech Incorporating Simultaneously Masked Spectral Weighting 75 any increase in computational complexity and also ensures that a stable all pole "filter is produced. Experimental results have shown that the technique better models the spectrum in the unmasked formant regions and thus removes more of the perceptually important information from the input speech signal than a standard LP filter. Subjective listening tests using both CELP and WI coders have confirmed that this property improves the perceptual quality of the synthesized speech for a given residual coding method.

99 Scalable Representation a/the Linear Predictive Residual 76 Chapter 4 Scalable Representation of the Linear Prediction Residual 4.1 Introduction Currently a bit rate barrier (at approximately 4 kbps) exists for speech coders. Below the barrier parametric coders dominate, while above the barrier waveform coders give preferable results. As is apparent in the literature review of Chapter 2, this barrier is particularly evident for speech coders that take advantage of the coding gain offered by linear prediction (LP) (many of which are standard coders [Supp97, Nati, UTU96a, ITU96b]). LP based waveform coders select the LP excitation signal via an Analysis-by-Synthesis (AbyS) technique that minimizes the error between the input and synthesized speech, whilst LP parametric coders attempt to model and code the LP residual signal directly. As discussed in Chapter 2, the ability of both of these methods to operate across a range of bit rates that spans' 4 kbps is limited by the characteristics of the respective methods.

100 Scalable Representation of the Linear Predictive Residual 77 To exploit the coding gain of LP in a scalable coder that operates across a range of bit rates spanning 4 kbps, it is necessary to consider scalable methods of representing the LP residual. In particular, the representation should be a parameter set of that allows the perceptual quality of the synthesized speech to be scaled according to the bit rate available for parameter transmission. This chapter proposes and examines the performance of tools that allow the LP residual to be represented by a scalable set of parameters. Hence, the techniques provide an alternative to using a specific parametric or waveform coder. In section 4.2 a low delay method of segmenting the speech signal into pitch length sub-frames is proposed. This method allows the pitch evolution redundancies of the speech signal to be exploited in a low delay coding structure. Section 4.3 discusses a method for representing the LP residual with a set of parameters that allow scalable synthesis of the speech signal. The proposed method used the low delay pitch segmentation form Section 4.2 to achieve the scalable representation. 4.2 Low Delay, Real Time pitch synchronous sampling of the speech waveform Introduction Traditional LP based speech coders such as CELP [Schr85] use fixed size sub-frames to code the residual signal. The disadvantage of using fixed size sub-frames is that the underlying redundancy of the residual due to the pitch of the speech is not fully exploited. Speech coders such as WI [Klei94] and PSI-CELP [Man097] have demonstrated that modelling the residual signal in a pitch synchronous manner allows either increased compression or improved perceptual quality. Further, (as is shown in

101 Scalable Representation of the Linear Predictive Residual 78 Section 4.3) using pitch length sub-frames in a LPAS coding structure leads to a scalable representation of voiced speech frames. The WI paradigm extracts a constant number of pitch length segments per frame of speech using frame based pitch detection and linear interpolation to calculate the respective pitch lengths (this results in representing the actual pitch track with a linear approximation). A consequence of extracting a fixed number of pitch length segments per frame, regardless of the actual pitch length, is that the extracted segments overlap for all but the shortest permissible pitch lengths. This overlap is variable, according to the relationship between the segment extraction rate and the current pitch, and complexity is added to subsequent processing in order to accommodate samples being contained multiple times in the extracted pitch length segments. Also, depending on the lengths of the individual pitch length segments, using a linear approximation to the pilfh track can result in the extracted pitch segments not containing a whole pitch period. This extraction method also requires a large algorithmic delay to accommodate the pitch detection. These characteristics constrain the adaptability of the WI pitch segment extraction mechanism for scalable coding. More recently it has been shown that WI based coders can be modified to achieve perfect reconstruction by using critically sampled pitch length sub-frames [Klei98, ChonOO]. In [ChonOO], Chong-White proposed a method for extracting critically sampled pitch length sub-frames in real time. This method uses a composite correlation function [Haag95] to calculate an initial estimate of the pitch. The composite correlation function is calculated as: Rcomp(d) = Rourr(d) + max [ w(i).rpast(d +i) J+max[ w(i).rjitture(d +i) ] ' -1(d)5.i5.1(d) (4.1) K where: R( d) == L s(k )s( k - d) k=o

102 Scalable Representation o/the Linear Predictive Residual 79 where s(k) is a segment of signal with K samples, Rcurr is the correlation centred around the current signal segment, R past is the correlation centred around a past signal segment, R future is the correlation centred around a future signal segment and wei) is a window whose length led) is determined from the candidate pitch period d. The initial pitch estimate is generated by selecting the peak correlation value in Rcomp. This initial pitch estimate is then refined to give individual pitch period lengths. The pitch pulse location in each of these pitch periods is then estimated fro~ the energy contour of the individual pitch periods. The method requires a significant delay (-60 ms) due to the use of autocorrelation pitch detection. The selection of the pitch pulse locations also relies on heuristic mechanisms and as such may be susceptible to errors in some situations. This section proposes a low delay method that critically samples the speech wavefonn, pitch synchronously (i.e. segments the waveform into non overlapped pitch length sections). The technique exploits both the excellent residual modelling properties of the zinc pulse and perceptual properties of the speech signal to critically sample the speech into pitch length sub-frames. This method also requires only a single frame of delay (20 to 30 ms) The Zinc Pulse The zinc pulse is represented in the time domain as [Sukk89]: A n-a=o 2B zen - A) = Asinc(n - A) + Bcosc(n - A) = n - A = odd (4.2) (n - A)Jl' o n-a=even where A is the position of the pulse.

103 Scalable Representation of the Linear Predictive Residual 80 Residual Pulse 1000,------, , , , ,------, (!) 500 "tl ~ C. ~ o -500~----~ L------~------~------~-----+~----~ o Time in samples Residual Pulse 1000r-----~ r , r-----~ (!) 500 "tl.-2 a. E «o -500~----~------~------~--~--~------~------L-----~ o Time in samples Figure 4.1: Comparison of residual pulse to zinc pulse The zinc model has been found to be superior to impulses when modelling voiced LP residuals [Sukk89]. This superior modelling is due to the shape of the zinc pulse closely matching the shape of a residual domain pitch pulse. This similarity is shown in Figure 4.1. When used as a basis function it has been shown that the zinc pulse provides an orthogonal basis that spans all band limited signals, provided the adjacent zinc pulses are separated by an even number of samples (i.e. every second sample) [Sukk89]. Using the zinc pulse as an orthogonal basis allows the LP speech residual to be modelled either directly or in a closed loop manner. Direct residual modelling is achieved by minimizing the MSE between the zinc pulse estimate and the residual signal, whilst closed loop modelling involves converting the zinc estimate into the speech domain and minimizing the MSE between the speech domain estimate and the input speech.

104 Scalable Representation of the Linear Predictive Residual 81 A frame of residual is modelled directly by locating the zinc pulse at all orthogonal positions within the frame, and calculating the A and B parameters that minimize the Mean Squared Error (MSE) for each position. The P zinc pulses that minimize the MSE are then used to model the residual, where P is the order lof the model used (Le. the number of pulses per frame). The MSE En for pulse location ~ is represented as: where r(n) is the residual signal and N is the number of samples In a frame. Minimising this error can be reduced to maximizing [Sukk89]: L=A2+B2 n n where: N-l An= L r(n)sinc(n-~) n=o N-l Bn = L r(n)cosc(n-an) n=o (4.4) For closed loop zinc modelling of a speech frame the error is represented as: (4.5) where Sen) is the input speech, Zk is a zinc pulse as shown in (4.2) and hen) is the impulse response of the Linear Predictive synthesis filter. Minimising e(n) in a Mean Squared manner for a given location ~ leads to the optimal parameters: N-l L (Sinc(n-~)*h(n)S(n) _ n=o AJ. - N-l 2 L (Sinc(n - An) * h(n) n=o N-l L (Cosc(n -~) * hen) )S(n) n=o Bk= N-l 2 L (Cosc(n-An)*h(n) n=o (4.6)

105 Scalable Representation of the Linear Predictive Residual 82 The location An that minimizes the speech domain MSE for a given frame is found by determining the orthogonal location that maximizes: N-l 2 N-l L (Sinc(n-An)*h(n»)S(n) L (Cosc(n-An)*h(n»)S(n) M =.;...:..:.n_=..:...o..,l-+ n=o N-l 2 N-l 2 L (Sinc(n - An) * h(n») n=o L (Cosc(n - An) * h(n») n=o 2 (4.7) A perceptual weighting filter such as that used in CELP [Schr85] can also be incorporated into the closed loop process by convolving both the input speech and the synthesis filter in (4.7) with the perceptual weighting filter impulse response prior to minimizing the MSE. This results in (4.7) being modified to: N~ 2 N~ 2 L (Sinc(n - An) * fen) )S(n) * wen) MeL =..>.,...:.:n_::::.::..o--:--::--:- ---.''-- + n=o L (Cosc(n - An) * fen) )S(n) * wen) N-I 2 N-l 2 L (Sinc(n-An) * fen») n=o L (Cosc(n-An)* fen») n=o (4.8) where wen) is the impulse response of the perceptual weighting filter and fen) is equal to wen) * h(n) Method Due to the high correlation between the shape of the zinc pulse and residual domain pitch pulses, the locations within a frame of speech that minimize the MSE between the input signal and zinc pulse are strong candidates to be the locations of the pitch pulses within the frame. This is true for both direct modelling of the residual and closed loop modelling where the error is minimised in the speech domain. Figures 4.2(a) to 4.2(d) show a frame of voiced speech, the closed loop error function M from (4.7), the perceptually weighted closed loop error function M CL from (4.8) and the direct residual modelling error function L from (4.4), respectively.

106 Scalable Representation of the Linear Predictive Residual 83 a) Input Speech 6000~------~ ~------~ ~ CI) "C ~ 4000 '[ 2000 E «o b) o -2000~------~ ~ ~------~ Time in Samples '- 0.8 g ~ 0.6 <D.!:1l ~ 0.4 o z 0.2 Closed Loop Error Function 200 c) d)... O.B g ~ 0.6 CI).!!l (ij E 0.4 o z Time in Samples 150 Closed Loop Error Function with Perceptual Weighting 1r ~------~~------~~--~--~ o~ o g ~ 0.6 CI).!!l ~ 0.4 o z )\----.) \---- ~) ~ Time in Samples Open Loop Error Function ~ ~------~---., Time in Samples Figure 4.2: Comparison of error functions. a) Voiced Speech. b) Closed loop error function. C ) Weighted closed loop error function. d) Open loop error function.

107 Scalable Representation of the Linear Predictive Residual 84 The error functions in Figures 4.2(b) to 4.2( d) have each been normalised to have a maximum of unity. Examining Figure 4.2 it is evident that the zinc pulses that minimize the MSE (maximize the error function) do indeed cluster around the locations of the pitch pulses in voiced speech. This is true for all of the error functions shown, however, comparing Figures 4.2(b) - 4.2(d) indicates a distinct difference in the shape of the error function for each method. The open loop error function shown in Figure 4.2( d) is much noisier in appearance than the closed loop error functions of Figures 4.2(b) and 4.2( c). If peak picking of the error signal was used as a mechanism for selecting the pitch pulse locations, the noisy characteristic of the open loop error function results in less consistent selection of the actual pulse locations than for the closed loop error function. Also, comparing the weighted error function of Figure 4.2( c) to the other two error functions reveals that using perceptual weighting suppresses the error value for the secondary pulses that occur between the pitch pulses. This characteristic combined with the smooth contour of the weighted closed loop error function, produces a more reliable mechanism for selecting the pitch pulse locations via peak picking of this error signal than produced using either of the other two error functions. Using the perceptually weighted error function as a mechanism to select the pitch pulse locations within a frame results in an error function that contains the same number of samples as the input frame of speech (as shown in Figure 4.2(c». Given a frame size of 200 samples (25 ms) and a minimum pitch length of 16 samples, a maximum of twelve pitch pulses can occur per frame. This apparent mismatch between the maximum number of possible pitch pulses and the size of the error function, indicates that using the entire 200 samples of error function to select the pitch pulse locations is excessive, and thus adds unwarranted complexity to the pulse location procedure. It is more appropriate to clip the error function so that only the locations that sufficiently minimize

108 Scalable Representation of the Linear Predictive Residual 85 the MSE (maximize the error function M CL) are used in the pitch pulse selection procedure. As the zinc pulses that minimize the error tend to cluster around pitch pulses, it is necessary to clip the error function in such a way that all pitch pulses within a frame are represented by at least a single zinc pulse. It was determined empirically that clipping the error function M CL to contain the set of only those 40 locations and magnitudes that achieve the lowest MSE, provides a good basis for selecting the pitch pulses within a frame. This clipped error function is referred to as M clipped in the remainder of this Section. Due to the clustering of the locations in M clipped around the pitch pulse locations, it is necessary to firstly group together the values of M clipped that represent single pitch pulses, prior to selecting the candidate pitch pulse locations for the current frame. If direct peak picking were applied to the magnitudes of M clipped without firstly grouping, a single pitch pulse location could dominate the peak picking process. The grouping operation is a real time iterative process that involves grouping any values in McliPped whose locations fall within ±8 of the location whose magnitude maximizes M clipped for that group. Grouping all locations within ±8 of the location whose magnitude maximizes M clipped for that group, ensures that the minimum distance between two adjacent candidate pitch pulse locations is 16 samples (which is equal to the minimum permissible pitch period). The steps involved in the grouping process are defined as: i) Set the iteration number n equal to 1. ii) Set Max equal to the value of McliPped whose magnitude maximizes the magnitude values of M clipped'

109 Scalable Representation of the Linear Predictive Residual 86 iii) Set Group equal to values of M clipped with locations that fall within ±8 of the location of Max. iv) Set Pitch_Pos(n) equal to Max. v) Remove Group and Max from the set Mclipped. vi) Increment n by one and repeat steps (ii)-(vi) until M clipped is empty. The result of the grouping procedure is that the values contained in Pitch _ Pos represent an initial candidate pulse waveform, the locations of which represent an initial estimate of pitch pulse locations within the frame. Examples of the input speech, the weighted closed loop error function M CL and the initial candidate pulse waveforms Pitch _ Pos, calculated for a section of voiced and unvoiced speech are shown in Figures 4.3 and 4.4 respectively.

110 Scalable Representation of the Linear Predictive Residual Input Speech X Weighted closed loop error function X 10 Inital candidate pulses r Time in sarrples Figure 4.3: Candidate pulse location for a section of voiced speech 10 In put Speech 5 a Weighted closed loop error function r Initial candidate pulses p r Time in 5 al1l>l es r p Figure 4.4: Candidate pulse location for a section of unvoiced speech

111 Scalable Representation a/the Linear Predictive Residual 88 Figures 4.3 and 4.4 indicate the relationship between the input speech pitch pulse locations, error function peaks and the initial estimate of the candidate pulse waveform. For the voiced speech shown in Figure 4.3 it is evident that a number of the candidate pulses correspond to the locations of the actual pitch pulses and these pulses exhibit a definite periodic repetition. For the unvoiced speech of Figure 4.4 there are no pitch pulses and consequently there is no periodic repetition evident in the candidate pulses. It is evident from Figures 4.3 and 4.4 that the periodic relationship between the candidate pulses representing actual pitch pulses must be exploited in order to select only the candidate pulses representing the pitch pulses from the entire set of candidate pulses. To determine if any periodic relationship exists between the candidate pitch pulses it is sufficient to ascertain if there is an underlying distinct correlation between the candidate pulse locations. To determine the correlation between the candidate pulses, it is possible to simply calculate the autocorrelation of the candidate pulse waveform Pitch _ Pas. Using Pitch _ Pas to determine the correlation between the candidate pulse locations is advantageous in that the candidate pulse locations are perceptually weighted according to their suitability in minimizing the perceptually weighted MSE in the speech domain. However, using Pitch _ Pas directly in an autocorrelation calculation is extremely susceptible to pitch jitter. To overcome this jitter problem, a new Rectangular Pulse Error Function (RPEF) is generated by convolving Pitch_Pas with a unit amplitude rectangular pulse three samples wide (R p ). This calculation is defined as: n RPEF(n) = L Pitch_Pas(n)Rp(n -k) k=o where: Rp=l t 1 O:Sn:SL-l (4.9)

112 Scalable Representation of the Linear Predictive Residual 89 where L is the length of the frame and i designates the zero position for Rp. To add reliability to the correlation calculation, particularly for large pitch periods, the RPEF used in the correlation calculation covers both the past and previous frames (opposed to only using the current frame in equation (4.9)) and is designated RPEF T Finally the consistency of the correlation function for all possible delays is improved by replacing the autocorrelation function with a normalized cross correlation function. The normalized cross correlation function for RPEF T is then calculated as: L+N-l L. RPEF T (n)rpef T (n -d) Cc(d)=~===n===L===================== L+N-l L+N-l L. RPEF T (n)2 L. RPEF T (n-d)2 n=l n=l where: p min < d < P max (4.10) where L is the sample in RPEF T that designates the start of the current frame (201 for a frame size of 200 samples), N is the number of samples in a frame, p max is the minimum pitch period candidate and p max is the maximum pitch period candidate. The Rectangular Pulse Error Function RPEF T and normalized cross correlation Ce(d) for the voiced speech shown in Figure 4.3 are shown in Figures 4.5 and 4.6 respectively. As RPEF T covers both the previous and current frames, the samples that represent the speech of Figure 4.3 occur from samples 201 to 400 in Figure 4.5.

113 Scalable Representation of the Linear Predictive Residual 90 Concatenated Rectangular error function O~ ~ _1~----~------~------~----~L-----~ L------~----~ o Time in samples Figure 4.5: Example Rectangular Pulse Error Function Normalised Cross Correlation Function 1.2,-----,----,,----,-----,-----, ,,----,-----,-----, ol!\' ' _0.2L-----L---~----~----~----~----~----~----~----~----~ o Time ~n samples Figure 4.6: Example of normalized cross correlation function

114 Scalable Representation oflhe Linear Predictive Residual 91 Examining Figure 4.6 it is evident that there are distinct peaks in the normalized crosscorrelation function at 136 and 72. To ensure that the selected correlation peak is not a harmonic of the actual correlation peak, the correlation function Cc(d) is searched for peaks at sub-harmonics of the initial peak value. For Figure 4.6 this results in the peak at 72 being selected as the correlation value for the frame. This value becomes the initial pitch estimate for the frame of speech. In unvoiced speech there are no pitch pulses and thus the normalized cross correlation function has no distinct peaks. In this case the. selection of the pitch length is arbitrary and defaults to a small value, due to the subharmonic searching. Once the pitch estimate for the frame has been calculated, it is possible to select which pulse positions from the candidate pulse positions (Pitch _ Pas) are actual pitch pulses. This involves an iterative process that exploits the perceptual weighting of the Pitch _ Pas values to select the pitch pulses. The pitch pulse selection process is described as: i) Initialise Actual_ PP to be empty. ii) Set the iteration number n equal to 1. iii) Set Actual_ PP(n) equal to the location in Pitch _ Pas with the largest associated magnitude value. iv) Search Pitch _ Pas for a location that falls within the range of Actual_PP(n) plus the pitch value. The range searched is proportional to the pitch estimate and is limited to the boundaries of the current frame. v) Increment n vi) If a value is located in Pitch _ Pos within the search range this is a pitch pulse location, set Actual_PP(n) equal to the location found.

115 Scalable Representation a/the Linear Predictive Residual 92 vii) If no candidate pulse is found, and if Actual_ PP(n -1) plus the pitch value is within the frame boundaries, set Actual_ PP(n) equal to this value. viii) ix) Repeat steps (iii)-(viii) until the end of the frame is reached. Set the current pulse position equal to Actual_ PP(I) and the pitch equal to -1* pitch. Repeat steps (iii)-(ix) until the start of the frame is reached. At this point all of the pitch pulse locations are contained in Actual_ PP and the process ceases. To allow for pitch jitter in the pitch pulse selection process, the range searched at each iteration is dynamically adjusted according to the pitch value and the distance between the last two pitch pulses selected. That is, if the distance between the last two pulses is less than the pitch value, the upper value of the range is extended for the next search and VIce versa. Due to the fixed frame size (200 samples) used, it is possible that a pitch pulse may fall across a frame boundary. As no look ahead is used in the pitch pulse selection process, a pulse that straddles the frame boundary will often be designated as a pitch pulse located at the last sample of the current frame. However, when processing the next frame, this same pulse may also be designated as a pitch pulse located at the first sample of the current frame. It is thus important to check that any pulse located near the start of the current frame has not already been identified in the previous frame. If the pulse has already been identified in the previous frame, it must be ignored in the current frame. Once the locations of the actual pitch pulses are known, the frame can be easily divided into pitch length sub-frames. A more accurate underlying pitch value than that detennined from peak picking (4.10) can also be calculated by averaging the individual pitch lengths (distances between pitch pulses) for the frame.

116 Scalable Representation of the Linear Predictive Residual 93 For the speech shown in Figure 4.3 the located pitch pulses are 23, 87, 159. These values indicate that the exact pitch pulses are located Practical Results The pitch tracks obtained using the proposed method and that using a correlation based method (Haagens method based on equation (4.1» were generated for male and female speech sentences. A frame size of 25 ms (200 samples) was used for the proposed method and K in equation (4.1) was set to 400. A look ahead of I frame was used for the correlation method with the future correlation window centred half a frame ahead of the current correlation window and the past correlation window centred half a frame before the current correlation window. Examples of a male and female sentence together with the calculated pitch tracks are shown in Figures 4.7 and 4.8 respectively. Both of these figures indicate that in sections of strongly voiced speech, the pitch tracks for both methods are virtually identical. For unvoiced speech and transitional sections the proposed pitch track varies from the correlation-based pitch track. The pitch track in unvoiced speech is an arbitrary value and will not be discussed, however the pitch track for transitional sections of speech can have a distinct effect on the quality of synthesized speech. If the pitch track adds extra periodicity to the transient sections, this will give a buzzy quality to the synthesized speech. A transitional section for Figure 4.7 between samples and (inside dotted circle) is shown in Figure 4.9 and the transitional section between samples and in Figure 4.8 (inside dotted circle) is shown in Figure 4.10.

117 Scalable Representation of the Linear Predictive Residual x10 Input Speech 1r ~ _ _ _ ~ (f) 150 (l) CL ~ 100 (f) c 50 0:::: ~ ~ -L ~ ~ o Pitch Tracks 2 4 x10 '. ~ t Time in Samples 2 Figure 4.7: Comparison of proposed (solid) and correlation-based (dotted) pitch tracks for male speech. Input Speech 4000r-----, r ~ _----~ 2000 o Pitch Tracks X 10 4 if! 150 (D CL ~ 100 (f) c: 50 if: Figure 4.8: Time in Samples X 10 4 Comparison of proposed (solid) and correlation-based (dotted) pitch tracks for female speech

118 Scalable Representation of the Linear Predictive Residual 95 Input Speech 10000r_------_r ~ r_------_r------~ -5000~------~ ~ ~ ~------~ o Pitch Tracks 150~------~----~~==~--~------~------~ (f) (I) E ro (j) c:: 50.c:: u n= Time in Samples Figure 4.9: (f) Q) a. Transitional pitch tracks for male speech. Proposed (solid), correlation-based (dotted) Input Speech E 100 co (j) c:: = 50 u 0: n ry...,nI... II... - Pitch Tracks... II... I.-.r.-...,...;-;;_~ ;;;..-;; ~ ;;;... c.-.-..~... ~~~... ~ Time in Samples Figure 4.10: Transitional pitch tracks for female speech. Proposed (solid), correlation-based (dotted).

119 Scalable Representation of the Linear Predictive Residual 96 Examining Figure 4.9 indicates that the correlation-based pitch track designates the pitch period from samples 250 to 350 as approximately 50 samples and then from sample 350 to the end of the frame as a constant value of 70 samples. Examining the input speech indicates that this pitch track is clearly incorrect. There is actually a single pitch period of approximately 100 samples between samples 250 and 350. This is followed by a pitch period of approximately 150 samples between samples 350 and 500, before the pitch period becomes an almost constant value of 70 samples for the remainder of the frame. The pitch track for the proposed pitch detector closely follows that of the original speech throughout this transitional period. Examining Figure 4.10 shows that the traditional pitch detector finds an almost constant pitch period value of 40 samples across the entire section. Examining the actual speech shows this is also clearly incorrect. There is a transitional section between samples 100 and 350. During this transition the pitch increases to approximately 80 samples for 1 period and then to 150 samples for a second period, before reverting back to a value of approximately 40 samples. The pitch track for the proposed pitch detector follows the transitional pitch track of the speech very closely. The results shown in Figures 4.7 to 4.10 indicate that the proposed pitch detector operates very well. The proposed method produces an accurate pitch track that closely models the pitch track of both voiced speech and transitional sections. In contrast, a correlation-based pitch detector tends to add extra periodicity to the transitional sections. This extra periodicity is due to the requirement of looking forward and backward in the correlation calculation. The proposed pitch detector was implemented in a variant of the WI coder. The coder was operated with unquantised parameters. Informal listening tests using clean speech from the TIMIT database indicated that the speech produced using the proposed pitch

120 Scalable Representation of the Linear Predictive Residual 97 detector sounded much less buzzy around transitional sections than speech produced with a traditional pitch detector Complexity The complexity of using the preferred, perceptually weighted speech domain error minimization method is high. The majority of this complexity is due to the convolution operations required to convert the zinc estimate to the weighted speech domain and also to perceptually weight the input speech, in (4.8). Also, as the error is calculated across the entire frame, equation (4.8) requires a further N multiplications of the convolved signals for each location tested. The total number of multiplications for calculating the error function can be estimated as: No.Muit = 2N 3 N 2 2N2 for the zinc convolution for weighting the speech for the error calculation (4.11) 2N 3 +3N 2 where N is the length of the input frame in samples. Introducing some small approximations into the calculation can dramatically reduce this complexity. Firstly, if the error minimization calculation is reduced from being calculated across the entire frame to only ± 5 of the position under test, significant complexity reductions result. This limit in the error range is acceptable as the impulse response of the weighted synthesis filter falls away quickly. Limiting the error calculation to this range was also found to produce a less biased error function than that generated using the entire frame. This is because all pulses have equal impact on their respective error range if the restricted range is used, whereas, if the entire frame is used in the error calculation, the pulses at the start of the frame have more influence on the error calculation due to the causal nature of the Sinc (part of the zinc pulse see (4.2)) pulse. Restricting the error

121 Scalable Representation of the Linear Predictive Residual 98 calculation to ± 5 of the position under test requires that 5 samples of look ahead must be added to the method. However, as approximately 20 samples of look ahead are already required for calculating the LPCs, no additional delay is required. Once the error range is restricted, the number of multiplications required to perfonn the convolution operations can be reduced from 2N 3 (N multiplications for convolving both the Cosc and Sinc functions at each of the N positions in the frame). Firstly, as the error range is now only ±5 from the sample under test, the convolved signal required for the error calculation at each sample need only be 11 samples in length (±5 samples from current location) as opposed to N samples previously. If the Sinc and Cosc portions of the zinc function are now treated separately for the convolution operations, significant reductions in the complexity required for convolving the functions at each sample can be achieved. As the Sine portion of the zinc function (A at n - A = 0 in (4.2» is only an impulse, the convolution operation for this function at each sample can be reduced to only 6 multiplications (0-5 samples) with no loss in accuracy. However, the Cosc portion of the discrete time zinc function (2B (n - A)7Z' when n - A = odd) has an infinite number of values occurring before the zero position under test, and thus restricting the length of convolution operation at each sample introduces some inaccuracies due to the history of the 1IR synthesis filter not being correctly updated. However, as the amplitude of the discrete time Cosc function falls away quickly ( ::::: 0.037* B at n - A = 17), it was found that restricting the convolution operation for the Cosc function to -15 to +5 produced no noticeable difference to the resultant error value. By now exploiting the time invariant property of the synthesis filter (given fixed initial conditions), the convolution operation can be further reduced from requiring a separate convolution at each sample (for the Sinc and Cosc functions) to requiring only

122 Scalable Representation of the Linear Predictive Residual 99 a single convolution for the Sinc function and a single convolution for the Cosc function, per frame. This is possible due to the fact that for a time invariant system the convolution at delay d is equal to the convolution at the frame start time shifted by d samples. Using these approximations the complexity required to calculate the error function can be reduced to: No.Mult = 2X 2 2y2 N 2 2L2 for the sinc convolution for the cosc convolution for weighting the speech for the error calculation (4.12) where X is the number of samples in the range of the Sinc pulse (0:5), Y is the number of samples in the range of the Cosc pulse (-15:5), N is the number of samples in the frame and L is the number of samples in the error range(-5:5). Using the above complexity reduction and given N=200, the complexity was reduced from multiplications to multiplications with no degradation in the performance of the method. To this complexity the multiplications required for the normalized correlation function in (4.10) must be added. This requires: Mult for Cross Carr = 3DN (4.13) where D is the span of the range of pitch delays d. To compare the complexity of the proposed method to the correlation based pitch detector based on (4.1), the complexity of the correlation-based detector can be determined using:

123 Scalable Representation of the Linear Predictive Residual Carr - based No. Mult = L2 to calculate the residual 3DK for Rcurr 6DI for Rpast and Rfuture 100 (4.14) 3D(K + 2/) + L2 where I is the length of the window function used and its value depends on the candidate pitch value. Given K = L = N = 200, D = 140 and the average of 1 is 10 then the proposed pitch detector requires multiplications and the traditional pitch detector requires multiplications. Thus the proposed method is very close in complexity to the correlation-based method calculated via (4.1) Summary for Low Delay Pitch Synchronous sampling of the Speech Waveform The proposed method for critical sampling of the speech into pitch length sub-frames, initially performs closed loop modelling of the residual signal at each possible pitch pulse position. It is then determined if there is a periodic correlation in the resultant error function. This allows an accurate pitch track to be generated and subsequently allows the speech waveform to be critically sampled into pitch length sub-frames. The pitch track generated using the proposed method is far more accurate in transitional sections of speech than traditional pitch detectors. This is because the optimal pitch pulse locations are located in a closed loop function, where the speech domain error is calculated for each candidate pulse location. This is in contrast to the use of correlation functions of the LP residual in traditional pitch detectors; these tend to find the underlying pitch of the surrounding voiced speech in transitional regions rather than the actual transition pitch periods. The proposed method also requires only a single frame of speech as opposed to at least 1 frame of look ahead in correlation-based methods.

124 Scalable Representation of the Linear Predictive Residual Scalable Coding of the Linear Prediction Residual for Speech Within the LP speech coding paradigm, the dominance of parametric coders below 4 kbps and waveform coders above 4 kbps indicates that the LP residual is inherently non scalable across a range of bit rates that spans 4 kbps. In this section a scalable algorithm is proposed which presents a solution to this non-scalable characteristic. The performance of this solution is then analysed and compared to the performance of accurately modelling the residual signal directly Method for scalable coding of the LP speech residual Waveform-matching LP based speech coders employ closed loop AbyS modelling of the LP residual, and produce very high quality synthesized speech when sufficient bit rate is available for transmission [Ata182b]. Conversely, due to both the limitations of the models used to represent the residual signal and the inability to account for the effects of quantization-errors/model-anomalies on the synthesized speech, the quality of LP based parametric coders tends to saturate as the bit rate for transmission is increased. In view of the apparent short coming of parametric coding, it appears essential that any algorithm which attempts to operate in a scalable range of bit rates must converge to a waveform matching AbyS structure at high rates. Waveform matching is achieved in LP coders by minimizing the speech domain error represented as [Ata182b]: e(n) = Sen) - yen) * hen) (4.15) where Sen) is the input speech, yen) is the residual excitation being calculated and hen) is the impulse response of the Linear Predictive synthesis filter. The optimal

125 Scalable Representation a/the Linear Predictive Residual 102 solution for yen) is calculated by selecting the representation for yen) which minimises: where N is the number of samples in the current frame. N-l 2 E = L [Sen) - yen) * h(n)] (4.16) n=o When operated at low bit rates, the quality of speech produced by AbyS based speech coders tends to deteriorate rapidly. This reduction in quality is due to having insufficient bits available to accurately represent the residual excitation. At these low rates the AbyS coders tend to waste bits on perceptually unimportant information [Thys01J. To modify the operation of AbyS modelling to provide successful operation at low bit rates, it is necessary to modify the operating environment in such a way that bits are only assigned to perceptually important sections. For scalable reconstruction of voiced speech signals, this problem of restricting the allocation of bits reduces to ensuring that pitch pulses are adequately represented in synthesised speech. To assist in AbyS modelling being scalable to low bit rates, the proposed method critically samples the input residual into pitch length sub-frames, using the method detailed in Section 4.2. The pitch length sub-frames are then decomposed into voiced (pulse like) and noise components using a method such as linear filtering [Klei94]. The net result of these operations is that the residual signal is reduced to a parametric representation (i.e. pulse and noise). However, in contrast to traditional parametric coding algorithms where time asynchrony is introduced (such as WI and MELP), the critical sampling of the residual signal maintains time synchrony with the input signal and thus preserves the possibility of using AbyS to model the parameters. If AbyS is used to model the pulsed component, at low bit rates this operation is concerned only with reproducing a pulse. If a pulse model that accurately represents the shape of the

126 Scalable Representation of the Linear Predictive Residual 103 residual pulse (such as a zinc pulse) is now used in the AbyS operation, the characteristic of low rate AbyS coding, whereby perceptually unimportant sections are modelled, is reduced. This results in the combination of pitch length segmentation and AbyS coding with a pulse model, producing a scalable method for representing voiced components of the LP residual. This result is confirmed though experimental results in the following section Practical Results To determine the scalability of coding pitch length segments of voiced LP residual, direct modelling of the residual was compared to AbyS modelling with various pulse models. The pulse models consisted of a simple impulse model (see equation (4.17» and a zinc pulse model [Sukk89] (see (4.18». P IN == L G i 5 (n - mi) i==l where: 5(n - m) = 0 unless n = m (4.17) where Gi(N) are the heights of the impulses and P represents the model order (number of pulses). p Zen) = L ~Sinc(n - A) + BiCosc(n - Ai) (4.18) i=! where AN and BN are the magnitudes of the zinc pulse components, AN represents the positions of the zinc pulses in the frame and P is the model order. The AbyS modelling is calculated by substituting the appropriate pulse model (either IN from (4.17) of ZN from (4.18) for yen) in (4.15) and solving (4.16) to obtain the optimal model parameters. The comparison of directly modelling the residual signal and using AbyS modelling was conducted using the voiced sections of four sentences from the TIMIT database.

127 Scalable Representation of the Linear Predictive Residual 104 Each pitch length sub-frame of speech was initially synthesised using a section of the exact residual signal centred around the residual domain pulse (direct modelling). The speech was then synthesized using both impulse and zinc pulse models calculated in an AbyS loop. The number of adjacent positions for the direct modelling and the order of the pulse models (P in (4.17) and (4.18) were varied. The Mean Error Ratio (MER), defined as the ratio of MSE to mean input energy for each pitch length sub frame was calculated for the various configurations such that: 1 N-I 2J/( 1 N-I J MER= - ~ (Input(x)-Estimate(x») - L Input(x)2 ( N x=o N x=o (4.19) where N is the number of samples in the sub frame. It should be noted that the MER is similar to the inverse of a mean SNR. The MER was calculated in both the residual and speech domains for each sentence. The results for the individual sentences are shown in Appendix B. The results were then averaged for all sentences and the results for the residual and speech domain MER are shown in Figures 4.11 and 4.12 respectively. The model orders in Figures 4.11 and 4.12, represent the number of pulses per subframe (P from (4.17) and (4.18) for the zinc and impulse methods. For direct residual modelling, the model order indicates the number of adjacent locations centred around the residual pulse, used to represent the residual signal (order 1 = 7 samples, 2 = 9 and 3 = 11, 4 = 13 and 5 = 15). This relationship between order and samples is more clearly evident when examining Figure 4.13, where an order of one was used in the direct residual modelling. The direct residual representation (dashed line) in Figure 4.13 clearly uses 7 samples centred around the pitch pulse location, to represent the residual signal.

128 Scalable Representation of the Linear Predictive Residual ca 0.9 :::J :g 0.8 II) 0.7 Q)... 0 ~ Q; c:... W 0.4 -Q) 0.3 +=i ca 0.1 a: Model Order Zinc Impulse II Direct Residual Figure 4.11: Comparison of Residual Domain MER r ,.c o 0.35 ~ Co 0.3 II) > tn o Q; c: Q; W =i ca a: Model Order 5 Zinc Impulse iii Direct Residual Figure 4.12: Comparison of Speech Domain MER

129 Scalable Representation of the Linear Predictive Residual 106 Comparing Figures 4.11 and 4.12 it is evident that minimizing the MSE in the residual domain is not analogous to minimizing the MSE in the speech domain. In fact, the pulse models consistently reduce the speech domain error as the order of the model is increased, whilst the residual domain error for the same pulse models remains almost constant. For direct modelling of the residual, the opposite is true. The residual domain error for direct residual modelling is consistently reduced as the model order is increased. However, a corresponding reduction in the speech domain error is not achieved. Moreover, whilst Figures 4.11 and 4.12 represent the average behaviour across all the input sentences, for some individual sentences, increasing the order of the direct residual modelling achieved a reduction in the residual domain MER but resulted in a worsening in the speech domain error. This never occurred in the test set for the pulse models minimized in the speech domain; increasing the model order always reduced the speech domain error. Comparing the error values for the different methods in Figure 4.12 shows that zinc and impulse models using two and three pulses per sub-frame respectively, achieved a lower error value than the highest order of direct modelling which uses fifteen adjacent pulses. Figure 4.12 also indicates that the zinc pulse model (using only a single pulse per sub frame) almost matched the error achieved using seven adjacent pulses for direct modelling. Comparing the performance of the zinc and impulse models in Figure 4.12 indicates that for a given model order, the zinc pulse model clearly outperforms the impulse model. In fact, the zinc model exceeds the performance of the impulse model even when the respective model order is half of that used for the impulse model. This result demonstrates the advantage of using a pulse model that accurately represents the shape of the residual pulse.

130 Scalable Representation of the Linear Predictive Residual 107 The results in Figure 4.12 show a clear scalability with order, in terms of error minimisation, for the pulse models calculated in the speech domain. However, at low rates it is the parametric representation of the pulse shape that is perceptually important. Figures 4.13 and 4.14 show the residual and speech domain results for modelling a pitch length sub frame using both a single zinc pulse and direct residual modelling of seven adjacent pulses centred around the residual pulse. Figure 4.13 gives a useful insight into the fact that minimising the error in the speech domain using fixed order pulse models, does not necessarily minimise the residual domain energy. The zinc pulse in Figure 4.13 is positioned before the main residual pulse and thus has a large MSE in that domain. In contrast, the zinc speech domain pulse in Figure 4.14 is a good approximation of the original waveform. Figure 4.14 illustrates a better representation of the speech pulse shape achieved for the zinc pulse model, when compared to the direct modelling reproduction. This is despite only a single pulse being used in the model. Further, this indicates that, even in a parametric sense (where the MSE is less relevant), modelling of the pitch length segments by minimising the error in the speech domain produces a very good reproduction of the pulse shape. To support this claim a single zinc parameter per 25 ms frame was quantised using ten bits and interpolated for each pitch length sub-frame. The position of the pulse in each sub-frame was fixed and this amounts to a 400 bps representation of the voiced speech. Informal listening tests indicated that the synthesized speech sounded clear and natural.

131 Scalable Representation of the Linear Predictive Residual ~ 2000 :J '8. 0 E « o 5 10 Residual estimate - Original - Zinc pulse -. Direct Res Time in samples Figure 4.13: Residual domain pulse Comparison Speech Estimate Original 2 - Zinc Pulse -. Direct Res 1.5 Q) "C 1.a '5.0.5 E « X Time in Samples Figure 4.14: Speech domain pulse comparison Summary of scalable coding of the LP speech residual The results indicate that using pitch length sub-frames and fixed shape pulse models with parameters calculated in a closed loop AbyS system, generates a scalable method for reproducing voiced speech. Using a pulse model that closely matches the shape of the residual pulse, in the AbyS process, further increases this scalability. The scalability of using AbyS modelling contrasts with attempting to achieve scalability through increasing the accuracy of residual domain modelling; that process may in practice offer

132 Scalable Representation of the Linear Predictive Residual 109 very little improvement in the accuracy of the speech model produced. This discrepancy between the performance of AbyS and direct residual domain modelling is primarily due to the fact that for the latter, there is no control over the consequences of modelling and quantisation errors, with regard to the synthesised speech. Using an AbyS method also has the advantage of allowing perceptual weighting when minimising the error, such as used in [AtaI82b]. 4.4 Summary This section proposed a new set of tools that allow the LP residual to be represented by a set of parameters that permit the reconstructed speech pulses to migrate from a parametric representation to a waveform matching representation as the bit available for transmission is increased. The first of these tools, detailed in Section 4.2, is a low delay method to critically sample the input speech into pitch length sub-frames. This method will allow the pitch evolutionary redundancies of the speech signal to be exploited in a low delay speech coder. The proposed method also produces a much more accurate pitch profile in transitional sections of the input speech than correlation based pitch detectors. This increased accuracy is directly attributed to the proposed technique locating the optimal pitch pulse locations in a closed loop function, where the speech domain error is calculated for each candidate pulse location. In contrast, for reliable operation, correlation-based techniques require additional future and past frames of the LP residual to be incorporated into the correlation calculation and thus tend to smear the pitch track in transitional sections due to their characteristic of designating the underlying pitch of surrounding voiced speech sections to transitional regions.

133 Scalable Representation a/the Linear Predictive Residual 110 The second proposed tool, detailed in Section 4.3, exploits the performance of the low delay pitch length segmentation achieved in Section 4.2 to produce a scalable representation of the LP residual. The scalable representation is achieved by modelling the pitch length sub-frames with a fixed shape pulse model, the parameters for which are calculated in a closed loop AbyS structure. It is shown that using a pulse model that closely matches the shape of the residual pulse, in the AbyS process, further increases the scalability of the representation.

134 Scalable Decomposition of Speech Waveforms 111 Chapter 5 Scalable Decomposition of Speech Waveforms 5.1 Introduction The speech signal can be modelled as a linear filter whose excitation is a combination of periodic impulses and white noise [Rabi78]. Speech coding schemes exploit knowledge of this model to varying degrees, so as to achieve an overall coding gain. An efficient means of exploiting the characteristics of this model involves separating or decomposing the speech signal into a component representing underlying periodic pulse-like characteristics and another component representing the noise-like characteristics. The advantage of separating the speech into pulse-like and noise-like components is that the two components can then be individually quantised and transmitted, according to their respective perceptual characteristics. This ability to separate the quantisation and transmission frequency of the two components, presents a scalable mechanism for transmission and synthesis of the speech. A well documented example of this scalability is when the pulsed component is decimated and the update

135 Scalable Decomposition of Speech WavefomLl' 112 rate for this parameter (transmission frequency) is reduced; in this case the perceptual quality of the resultant speech degrades quite slowly as the transmission frequency is reduced [K1ei93b, K1ei94, Gran91]. The most noticeable audible effect of reducing the transmission frequency is transitional sections of the speech sounding buzzier. To produce a parameter set that can accurately reproduce the pulse-like structure (harmonic structure) of the signal in practical coding situations, the decomposition method must accurately capture all of the pulsed components into the parameters representing the pulse structure of the signal. This is a relatively straightforward operation in stable voiced sections of speech; however, in quickly changing sections of the speech sigoal (such as voiced onsets) captuting the pulsed information becomes more difficult. If the pulse information io these quickly changing sections falls through to the noise-like parameters, these pulses will most likely be lost. This results because the noise component quantisation procedures are optimized for representing noisy signals and thus they are unable to represent a pulse component. Loss of these transitional pulse parameters can, and generally will, produce audible distortions (loss of naturalness) in the synthesized speech. These distortions may be acceptable at low bit rates, however, if the coding structure is scaled to operate at higher rates, the distortions will produce a noticeable degradation in quality, when compared to high rate coders that maintain the transient structure of the speech. In addition to achieving an accurate separation of the pulsed and noise components, a decomposition scheme suitable for scalable coding should produce a parameter set (particularly for the pulsed component) that allows the decomposed sigoals to be reconstructed in a bit rate scalable manner, where the perceptual quality of the reconstructed signal smoothly improves with increased accuracy of the transmitted parameters (increased bit rate). This scalability of perceptual quality with transmission

136 Scalable Decomposition of Speech Wavefonns 113 accuracy is particularly important for the pulsed component. as this component controls the harmonic structure of the reconstructed speech and also has the most influence over perceptual quality of the synthesized speech [Kubi93, K1ei95bj. In contrast, the noise component contains no harmonic structure and it has been shown that replacing the noise-like component with gain shaped random Gaussian noise has little effect on the resultant perceptual quality of the synthesized speech [Kubi93, K1ei95bj. This chapter details and analyses the perfonnance of a widely exploited decomposition scheme based on linear filtering a two-dimensional surface (constructed from overlapped pitch length sections of the LP residual) [K1ei94j. The perfonnance of the linear filleting decomposition is used as benchmark for analyzing three proposed decomposition schemes. The comparison focuses on both the performance of the decomposition schemes in separating the pulsed and noise components and in the scalability of the resultant parameters representing the decomposed signals. The three proposed schemes each attempt to produce a scalable parameter set whilst limiting the delay required to a single frame of speech (-25 ms). For completeness, the suitability to scalable operation of another well known decomposition scheme (long term prediction [Schr85]) is also included. 5.2 Linear Filtering Linear filtering is the decomposition method used in the WI paradigm of speech coders [K1ei94j. The linear filter used is a fixed lowpass filter and is used to decompose the evolution of the CW surface (see Section for details of CW surface) into a slowly evolving waveform (SEW) surface and a Rapidly Evolving Waveform (REW) surface. The SEW is the output of the low-pass filter and the REW is calculated as the difference between the CW surface and the SEW. The low-pass filter is generally a 20 th order, 20

137 Scalable Decomposition of Speech Waveforms 114 Hz cutoff design [Klei94]. The look ahead delay required for the 21 tap filter is equal to 10 CWs, which amounts to an entire frame of speech. To this delay, the delay required for pitch detection and LP filtering of the future frame of speech must be added. This results in a total look ahead delay of at least 1.5 frames (37.5 ms) being required by the linear filtering method. The fixed configuration of the lowpass filter results in the linear filtering method being unable to adapt to quickly changing sections of the input signal. This characteristic results in pulsed information falling through to the REW in transitional sections of speech [ChonOO]. The result is that the representation of the pulsed component contained in the SEW tends to exhibit smearing of the transitional sections. An example of this effect is shown in Figure 5.1 where it can be seen that the SEW waveform clearly smears the transitions in the input residual. To determine the decomposition qualities of the linear filtering method for stable sections of speech, four speech files of distinctly different content (i.e. Speaker gender, sentence content, etc.) were separated into voiced and unvoiced sections. The voiced and unvoiced sections for all of the files were decomposed into SEW and REW components. The ratio of REW energy to SEW energy was calculated for each input section, with the resultant ratios then grouped and averaged to give an average SEW/REW energy ratio for both the voiced and unvoiced inputs. These ratios are shown in the second columo of Table 5.1. The SNR of the SEW with respect to the input is shown in column 3 of Table 5.1. Table 5.1: Ratios of REW to SEW energy

138 Scalable Decomposition of Speech Waveforms 115 Input CW surface... ;... ' :... ' ' ':..'..'., SEW Sud'aoe,..,.. ". '.... :..... :..... " ;.....'..,.!.'...;......, : '. "" o 0 " '. ". ' c"\!c\es 15 '\C\I- 'J 10 nql\ \1I-"Q\ 5 ~e'lo\~ Flgure 5.1: Comparison of CW surface to SEW For steady state voiced speech an ideal decomposition method should capture virtually the entire signal in the component representing pulsed parameters and conversely for steady state unvoiced speech the decomposition method should capture the majority of the signal in the {loise component. Examining the results in Table 5.1 indicates that the linear filtering method does in fact achieve close to this ideal decomposition. This is

139 Scalable Decomposition of Speech Waveforms 116 particularly so for voiced speech where the linear filtering method captures ninety percent of the signal in the SEW component. It is only in rapidly changing sections of speech that the linear filtering decomposition fails. Also, the need to rotate the CWs when generating the input CW surface causes the reconstructed speech to be time asynchronous with the input speech. This time asynchrony has little effect in low rate applications but prevents objective error minimisation criteria from being used at higher bit rates. As transient components of speech require a large number of bits for accurate representation, low rate speech coders generally sacrifice accurate representation of transitional sections for the sake of bit rate saving. The result is a slight buzziness in the synthesised speech; however, the overall speech is of good perceptual quality. This characteristic has allowed the linear filtering to be employed in WI coders and in tum allowed these coders to produce good speech quality at low bit rates of approximately 2.4 kbps [K1ei94j. Utilising the excellent steady state decomposition characteristics of the linear filtering method allowed a reduced rate WI coder to be developed [LukaOlb, LukaOldj. This coder does not transmit the SEW parameters at all. Instead, a fixed shape pulse model is used to reconstruct the SEW. The parameters for the pulse model are implied in the transmitted REW parameters. The pulse model used is a causal, discrete time representation of the zinc pulse, which is represented as: z(n) = Asinc(n) + Bcosc(n) = {~B n" o n=o n=odd n = even (5.1) The pulse model parameters are extracted from the reconstructed REW using:

140 Scalable Decomposition of Speech Waveforms 1 N-I A=B=l-- L IREW(i)1 N i=o 117 (S.2) where REW(N)is the Fourier Transform coefficients representing the unquantised REW and N is the length of the CWo Equation (S.2) uses the assumptions that the LP residual spectrum is flat and that the spectrum has been normalized to unity value previously in the coding process by the removal of the gain term. Whilst the assumption of a flat residual spectrum may not necessarily be true for a 10 th order LP filter, this assumption is already used in SEW quantization for frequencies above 800 Hz in Kleijns 2.4 kbps WI coder [klei94]. This, combined with the fact that we are only striving for a parametric representation of the pulse (due to the low rate), renders the asswnption acceptable. Using these asswnptions and Parsevals theorem [Proa961, (S.2) sets the zinc pulse parameters (A and B) equal to the height of a time domain impulse with the sarne energy as the extracted SEW (1- REW ) spectrwn. Whilst the zinc pulse consists of a large initial pulse followed by further impulses whose amplitudes decrease rapidly with time, infonnal perceptual testing has shown that setting the height of the initial pulse equal to the height of a single impulse as calculated by (S.2), produces good results. Subjective Mean Opinion Score testing using 8 input sentences and 24 listeners revealed that the proposed SEW reconstruction method, maintains the perceptual quality of a traditional WI coder that directly quantises the SEW. The net result is a WI based speech coder which operates at 2 kbps. 5.3 Long Term Prediction Long Term Prediction (LTP) uses a previously coded section of the signal to predict the current signal. Due to the underlying pulsed nature of the speech in voiced sections, the LTP provides a good means of modelling and removing the contribution of the pitch

141 Scalable Decomposition of Speech Waveforma 118 pulse from the signal [Atal82bl. The generalised formula for a LTP is represented as [Kond95]: P(z) ~ 1- ± fj,z-d-' j~-i (5.3) where PI are the predictor coefficients, I is the predictor order and d is the delay. The predictor coefficients and delay are calculated to minimise the MSE between the predicted and input signals [Atal82b, Kond95J. The most simple and common form of LTP predictor uses only a single coefficient and is referred to as a single tap LTP. The MSE solution for a single tap LTP is calculated as [Kond95]: ern) ~ r(n) - fjr(n - d)!.-i Ernie = l: e(n) "'"!.-I ' ~ ~ [r(n)- fjr(n-d)] 11,,0 ' (5.4) where r(n) is the input signal. The optimal solution for p in (5.4) is calculated by H ~ r(n)r(n - d) fj ~ """"'''-;;:;;-:-- - L-l ~ r'(n) """ (5.5) The optimal value for d is then calculated by substituting (5.5) into (5.4) and selecting the value of d that minimises E_. Segmental SNR (SegSNR) for various LTP configurations are reported in [Kond95]. These results indicate that a SegSNR of between 3.4 db and 9 db is achievable forltp, depending on the configuration (order, closed or open loop structure, etc.) of the LTP used. LTPs are generaliy calculated on fixed length segments (typically 5 ms) of input speech residual. This results in an individual set of parameters (P,d ) for each segment and this

142 Scalable Decomposition of Speech Waveforms 119 has the advantage of allowing fast adaptation to input signal transitions. However, this requires a relatively high transmission rate for the each segment's parameters. The subjective quality of the synthesised speech degrades quite rapidly as the bit rate used for transmission of the LTP is reduced [Rama95]. This characteristic makes the method impractical for low rate applications and thus it is also impractical for low rate scalable applications. 5.4 Analytic Decomposition of Speech Signals Introdnction This section details a method of exploiting the characteristics of an analytic representation of the speech signal so as to achieve a decomposition of the speech signal into a number of separate components. Each of the decomposed components exhibits distinctly different perceptual and quantisation characteristics. An Analytic signal is a complex signal that contains only positive frequencies. It is associated with a real signal by the removal of the real signal's negative frequencies and doubling the value of its positive frequencies. The analytic signal directly links a signal with its envelope and phase (instantaneous frequency) as shown in (5.6) [Oppe89]. x(n) =,(n) + j ;(n) = E(n)ej<l>(n) (5.6) where x(n) is the analytic signal,.(n) is the input speech, 8(n) is the Hilbert transfoild [Oppe89] of,(n), E(n) is the analytic envelope and <l>(n) is the analytic phase.

143 Scalable Decomposition of Speech Waveforms 120 Equation (5.6) requires the Hilbert transfonn of the input signal to be calculated. For a discrete time signal the Hilbert transform can be implemented either in the time domain (using a filtering mechanism which approximates the Hilbert transform.) or in the frequency domain by removing the negative frequencies and doubling the value of the positive frequencies [Oppe89]. The frequency domain approach introduces less delay and has been adopted here. The frequency domain method for calculating the analytic transform is defined for a sequence of length N, where N is even as [Marp98]: X(n) = S(O) forn=o N 2S(n) for l<n<- 2 S(~) forn=~ o N for -+I<n<N-l 2 (5.7) where Sen) is the N point DFT of the input speech y(n). The time domain analytic signal is then calculated by performing an N point Inverse DFT of X (N). The above procedure results in a frequency spectrum that contains only positive frequencies. Also equation (5.6) indicates that the real part of the analytic signal is equal to the input speech signal. The transformation provides a straightforward means of decomposing a signal into two separate signals (envelope and phase) that exhibit distinct characteristics. The removal of the negative frequencies also allows the sampling rate of the analytic signal to be reduced to half that of the input signal without causing aliasing. Whilst analytic signals are widely used in such areas as time frequency signal analysis, spectral analysis and many others (see references of [Marp98]), their use in the coding of speech has not ~n widely reported.

144 Scalable Decomposition of Speech Wavefoml.l' Decomposition Characteristics An input speech file was transfo:rmed to its envelope and phase functions using the analytic decomposition detailed in (5.6) and (5.7). The envelope and phase were then plotted against the input speech as shown in Figure 5.2. Figure 5.2 indicates that for a section of voiced speech the envelope and phase waveforms both evolve with the pitch of the input waveform. This characteristic indicates a leve1 of redundancy in both the envelope and phase waveforms. To examine the extent of this redundancy the speech was processed in a WI coding structure. The WI structure was selected as it extracts pitch length segments of speech (characteristic waveforms) and uses these to exploit redundancy of a pitch evolutionary nature. Due to problems with WI extracting characteristic waveforms (CW) in the speech domain (as detailed in [Klei95b]) the extraction was performed on the LP residual. The residual CWs were then decomposed into their envelope and phase using (5.6) and (5.7). The evolution of the CW envelope and phase for a section of female speech is shown in Figure 5.3 (note that the orientation of Figure 5.3(b) has been altered from that of Figure 5.3(a) to emphasise the evolution of the phase component). Figure 5.3 indicates that the pitch evolutionary characteristics shown in Figure 5.2 continue to apply in the residual domain, although the phase is more noise-like due to the removal of the minimum phase component by the LP filter. Figure 5.3(a) shows that the envelope evolves very slowly, particularly for voiced speech, suggesting a large level of redundancy. The phase evolution shown in Figure 5.3(b) exhibits an underlying slow evolution with a more rapidly evolving component superimposed. This characteristic indicates that further decomposition of the phase would separate these components and thus allow them to be quantised individually.

145 Scalable Decomposition o/speech Wavsforn r r r ,.5000 L---- ~ -c"~ ~ ----~-,J o " ~P~"t~S~P.~.~~C , r 50 too , Envelope,., \1 If If If if 0 I} IJ j IJ II I)., '00 Ph... Figure 5.2: Analytic representation of speech. Two methods were used to achieve a further decomposition of the phase waveform. these being lowpass filtering in the evolution direction (see Section 5.2) and a second analytic decomposition into an envelope and phase function. Both of these methods achieved a good degree of separation of the phase into a slowly evolving and rapidly evolving phase component. For the linear filtering method this resulted in a highpass and lowpass phase component and for the second analytic decomposition the result was a phase envelope and a phase of phase component.

146 Scalabl~ Decomposition of Speech Waveforms 123 a).:: : , ~ 2 20,, b) t--:!-----:;r---::;j r~ r --L 160' PiICb cyde tin: in Amples 20, _ _ _.. _ _ r'" j ~ E ~.;-f------, r------r------,------,r------r------, ,,..,..,,. 00' 80 PilCh cyde lime in samples "..,. o Figure 5.3: Evolution of Analytic speech parameters, a) envelope, b) phase

147 Scalable Decomposition a/speech Waveforms 124 Decomposition ratios (such as those calculated in Table 5.1) are not relevant measures for determining the effectiveness of the analytic decomposition in separating the pulsed and noise components in steady regions of speech. This is a result of the decomposition not directly separating the speech into different energy components (as per linear filtering and LTP) but rather the speech is separated into two different transform domains. To detennine the effectiveness of the analytic decomposition, the perceptual sensitivity of the decomposed parameters to quantisation and down sampling were determined and the results are detailed in Section Semitivity of decomposed parameters to Quantisation Section details three distinct levels of decomposition using the analytic transfonn, these being: I) Envelope and Phase. 2) Envelope, lowpass phase and highpass phase. 3) Envelope, Phase envelope and Phase of phase. The maximum tolerable quantisation noise for each component in both the time and frequency domains was calculated by determining the maximum amplitude of white noise that could be added to each component before the effects became audible. The noise amplitude results are expressed as a percentage of the maximum value for each parameter, in Table 5.2. The results in Table 5.2 indicate that the parameters exhibit a varying immunity to quantisation noise. The envelope, which is required for all of the decompositions, requires high accuracy in the time and DFr magnitude domains whilst the DFT phase of the envelope could be coded. very coarsely. These characteristics indicate that high levels of compression of the envelope through operations such as variable dimension vector quantisation (VDVQ) [Das96] will not be possible.

148 Scalable Decomposition of Speech Wave/orms 125 Tolerable quantisation noise as % of maximum Time domain DFr magnitude DFrphase Envelope Phase Low pass Phase High pass Phase Phase Envelope Phase of Phase Table 5.2: Sensitivity of Analytic parameters to quantisation noise Both the 10wpass and highpass parameters resulting from filtering the phase offer improved robustness to quantisation when compared to the original phase, whilst decomposing the phase with a second analytic transform generates parameters that are slightly more sensitive to quantisation noise than the original phase. The results indicate that all of the phase parameters are quite sensitive to quantisation noise, with the lowpass phase being the least susceptible and the phase of phase being the most susceptible. To detennine the perceptual effects of reducing the transmission rate of the parameters, ten CWs were extracted per 25 IDS frame at a rate of 400 Hz. The CWs were decomposed using each of the methods detailed in Section The evolutionary bandwidth of each parameter was detennined by lowpass filtering the parameter in the evolutionary direction before reconstructing the signal. The minimum lowpass filter bandwidth for each component, that produced perceptually unaltered speech is shown in Table 5.3. Normalised Bandwidth Bovelope 0.1 Phase 1 Phase Bovelope <0.1 of Phase 1 Table 5.3: Evolutionary baodwidth of Analytic parameters

149 Scalable Decomposition of Speech Waveforms 126 The results in Table 5.3 indicate the number of envelopes and phase envelopes per frame could be down sampled by a factor of 10 before transmission, and then reconstructed via interpolation in the decoder without causing audible distortion. Whilst the results in Table 5.3 indicate that the phase and phase of phase require full evolutionary bandwidth (and thus every vector to be transmitted) for undistorted reconstruction. The transmission rate of the linear filtering phase parameters is directly related to the normalised cutoff frequency of the lowpass filter used. Perceptual testing of the synthesized speech shows that the envelope contains most of the intelligibility of the speech. Transmitting only one envelope per frame with phase set to zero produces highly intelligible speech. The speech however sounds quite robotic. Comparing the perceptual character of the synthesized speech for analytic and filtering methods of decomposing the phase, showed that the analytic method provides a means of increasing the naturalness of the speech as the transmission rate of the parameters is increased. Transmitting only one envelope and phase envelope per frame, with the Phase of phase set to zero produced clear speech of better quality than if only the envelope was sent. The naturalness of the speech was then increased as the rate of transmission of the phase of phase was increased. Conversely, whilst the filtering method does increase naturalness as the transmission rate of the phase parameter is increased, the speech was scratchy unless the phase was transmitted without down sampling. Further, the linear filtering method introduces extra delay to the decomposition and thus is less desirable.

150 Scalable Decomposition of Speech Waveforms Summary for Analytic Decomposition of Speech The analytic decomposition of speech waveforms developed in the preceding sections does offer a degree of scalability when reconstructing the speech. This scalability is achieved by varying the transmission rate of the decomposition components. However, due to the relatively high sensitivity of the parameters to quantisation noise, the transmitted parameters must be an accurate representation of the original parameters. This characteristic results in the scalability achieved not being fine grain but rather occurring in distinct stages. Also, as the decomposed components (envelope and phase) are not orthogonal, it is difficult to ascertain whether a clear separation of the pulsed and noise components of the speech signal is achieved via the proposed analytic decomposition. 5.5 Singular Value Decomposition of Speech Signals Introduction This section proposes and examines a method of decomposing the speech signal into periodic and noise-like components such that transitions in the input signal are inherently identified and a set of scalable output parameters are produced from a single frame (20 to 25 rna) of the input speech. The scheme exploits the evolution of adjacent pitch length segments as in WI [Klei94], but uses the decomposition characteristics of Singular Value Decomposition (SVD) in place of linear filtering. The singular value decomposition of any n by m matrix X is defined as [Golu83J: X = USV T (5.8) where U is an n by n left singular matrix with columns forming an orthononnal basis for the columns of the input matrix, V is an m by m right singular matrix with columns forming an orthonormal basis for the space spanned by the rows of the input matrix and

151 Scalable Decomposition of Speech WavifomlS 128 S is an n by m diagonal matrix of singular values. The smgular values (~... "Amax(m,n) occur in descending order; the number of non-zero singular values represents the rank of the input matrix [GoluS31. Exploiting this ordering of the singular values by generating an estimate commencing with the largest singular value and adding subsequent singular values in descending order, rapidly produces an estimate of the underlying signal matrix X. This is shown by the expression: E = ta,u,v,h where p = model order and p,; rank (X) (5.9),.., where E is an estimate of the original matrix X generated from a sum of cross products weighted by the singular values. Detail is added to the estimate matrix E as the value of p in (5.9) is increased towards the rank of X. If a clear distinction in the magnitude of the singular values is apparent (i.e. ~ 0 ~1-l)' an obvious decomposition of the input matrix X into an underlying matrix E and a detail matrix is possible by setting the value of P in (5.9) equal to the point of distinction in the singular values. The detail matrix D is calculated as the difference between the input matrix X and the underlying matrix E. Forcing the input matrix X to become ill conditioned, or as close to ill conditioned as possible. causes the singular values to be maximally spread (i.e. ~ - ~ is max, where ~""A"represent the singular values) [Hi1l96]. This maximizes the likelihood that there will be a clear distinction between the singular values representing the underlying matrix and those representing the detail. TIl conditioning occurs if there is a strong correlation between columns or rows of the input matrix, or, if the columns or rows are close to linearly dependent.

152 Scalable Decomposition o/speech WavqQT11I.S Method To exploit SVD in low delay speech decomposition, a method of forcing the input matrix representing a section of the speech signal to be as close to ill-conditioned as possible is required. This is achieved by segmenting the input speech into 25 ms frames and filtering the frame with a standard linear predictive (LP) filter. For each input frame of LP residual, a fixed number of pitch length segments are extracted (ten), aligned for maximum correlation and zero padded to a fixed length N. If no pitch track is present the segments are set to a predetermined length. Due w the possible range of pitch lengths ( samples) the fixed number of pitch length segments can be extracted either by allowing the pitch segments to overlap or by critically sampling the segments and then using interpolation to produce the required number. This process results in a two-dimensional CW surface similar to that used in WI [Klei94]. The resulting surface is equivalent to an N by 10 matrix (the signal matrix X) where each column represents a zero-padded pitch length segment of the input speech residual. The process of aligning the columns for maximum correlation forces the input matrix to be close to ill conditioned; this is particularly true for constant pitch, voiced sections of speech. The distinction in the singular values for highly correlated segments of voiced speech is illustrated when the pitch length segments within the matrix X differ only in magnitude. In this case there is only one non zero singular value and this, combined with the corresponding left and right singular vectors, perfectly reconstructs the input matrix. Using the theoretically clear distinction in the singular values for voiced speech allows a good representation of the underlying pitch wavefonns for the matrix to be produced by setting p in (5.9),equal to the point of distinction in the singular values. The noise or detail for the frame is generated by subtracting the underlying waveform matrix E

153 Scalable Decomposition of Speech Waveforms 130 (from (5.9» from the signal matrix X. For unvoiced sections of speech there will be no clear distinction in the singular values and the value of p becomes arbitrary Decomposition Characteristics Distribution of Singular Values To determine the distribution of singular values within a given frame of speech, the method described in Section was used on four speech files of distinctly different content (Le. speaker gender, sentence content, etc.). The singular values for each frame were nonnalized to unity maximum and the singular values representing voiced and unvoiced frames were grouped separately. Figure 5.4 shows the mean distribution of the normalized singular values for the voiced frames of each speaker and the unvoiced distribution for the entire set. It is evident that the inter frame distribution of the singular values for all voiced speech was highly consistent, and the distribution in unvoiced frames was distinctly different to that common voiced distribution. Figure 5.4 also demonstrates a distinct change in the voiced speech singular values between the second and third singular values (indicated by the knee of the curves). In contrast, the unvoiced singular values have no clear distinction and exhibit a gradual, almost linear roll off in magnitude. These characteristics suggest that this technique will be a successful approach to decomposing the speech frame into its underlying voiced and noise components.

154 Scalable Decomposition of Speech Waveforms iii " 0.8 > ~.. III "'0 0.6 "'''.5.'!:: 0.5 U) c '0 '" 0.4.!!l Ill'" :;; 0.3 iii 0.2 E ~ Z Singular Value Index ----." Male Male2... Female 1 _._._. Female 2 --Unvoiced Figure 5.4; Inter~frame distribution of Singular values Decomposition Measures To determine the decomposition perfonnance of the proposed SVD method in steady speech sections, the files used in Section were separated into voiced and unvoiced sections. The voiced sections for each file and the combined unvoiced sections were decomposed into noise and periodic components using both the SVD method and linear filtering (see Section 5.2). Each point of distinction P in (5.9) was tested for the SVD method and the ratio of noise energy as a percentage of periodic signal energy calculated. The results are shown in Table 5.4.

Scalable Decomposition of Speech Wavefotm.Y 132 SVD number in Signal estimate Linear Filtering IFile 1 2 3 4 5 6 7 8 9 10 21 Coefficients Male 1 Voiced 49 15 5.5 2.3 0.8 0.3 0.1 0.1 0 0 11.

155 Scalable Decomposition of Speech Wavefotm.Y 132 SVD number in Signal estimate Linear Filtering IFile Coefficients Male 1 Voiced Male 2 Voiced Female 1 Voiced I!"~male 2 Voiced Unvoiced Average for Voiced Table 5.4: Ratio of noise energy to periodic energy component as a percentage The results in Table 5.4 indicate that if the first and second singular values are used to represent the underlying waveform, the decomposition level is very similar to that achieved by the linear filtering method. Using the first and second singular values to calculate E in (5,9) also corresponds to the knee of the curve for the voiced speech sections shown in Figure 5.4, In contrast to the linear filtering method. the proposed decomposition delivers a scalable method of reconstructing the underlying waveform. This scalability results from the separation of the underlying waveform into perceptually different components. The singular values themselves are similar to gain terms, whilst the left singular matrix U describes the shape of the pulse and the right singular matrix V describes the relationship between the individual pulses. Varying the combination and accuracy of the parameters used for reconstructing the underlying signal, allows determination of the fit of the reconstructed waveform to the original underlying waveform. Figure 5.S shows a comparison of the original speech CW surface and the respective estimates of the underlying waveform surfaces. Figure 5.5(c) shows the proposed SVD estimate of the, underlying surface using the first two singular values with their respective left and right vectors, Figure S.5(d) is the SVD estimate using only the first singular value. its left singular vector and the mean of the first right singular vector

156 Scalable Decomposition of Speech WavejornLl' 133 a) ". b).. :..,.' '.... ":'. c) d).' '.. :.,... Figure 5.5: Comparison of speech surfaces. a) Surface of Input speech residual b) Linear filtered version of Underlying wavefonn c) Proposed method estimate of Underlying waveform d) Low rate reconstruction of underlying waveform using the proposed method. interpolated across the frames. The results demonstrate that the proposed full SVD estimate in Figure 5.5(c) gives a significantly improved representation of the transitional changes in the input wavefonn when compared with the linear filtering method in Figure 5.5(b); the latter tends to smear these transitions. The scalability of the SVD method is also clearly evident when comparing Figures 5.5(c) and 5.5(d). Figure 5.5(d) still produces a good estimate of the underlying waveform; it simply has less detail than Figure 5.5(c), which requires transmission of extra parameters. Also, the sharp transition in the input speech is better reproduced by the SVD method (Figure 5.5(d))

157 Scalable Decomposition of Speech Wavefoml.!' 134 than the linear filtering method (Figure 5.5(b)); this is despite using only a single parameter per frame to represent the evolution of the underlying waveforms Summary for SVD of Speech The proposed SVD based technique produces a decomposition of the speech signal that is inherently scalable. For low bit rates this scalability allows an approximate estimate of the underlying periodic signal to be generated. Detail such as transitional information and pitch cycle evolution can easily be added to the estimate in a fine grain manner, when higher bit rates are available. The method is low delay in that it requires only the current frame of speech. This contrasts with other methods, such as linear filtering which requires approximately one and a half frames of look~ahead. The low delay makes the SVD based decomposition more appropriate than linear filtering for higher rate coders (such as 8 kbps). The major drawback of the method is the high complexity required. This complexity is due to both requiring a fixed number of CWs per frame to be generated and for calculating the SVD. Also, as for linear filtering, the need to rotate the pitch length segments when generating the input matrix causes the reconstructed speech to be time asynchronous with the input speech.

158 Scalable Decomposition of Speech Waveforms Pitch-Synchronous, Zinc Basis functiou Decomposition of Speech Signals Introduction TIlls section proposes decomposing the speech waveform into pulsed and noise components using a zinc basis function. This technique allows the decomposition to be modelled in a closed loop structure such that the pulsed component is calculated to minimise the speech domain error between the original pulse and the synthesised pulsed component. This is opposed to direct residual domain decomposition which was used in the preceding decomposition methods (Sections 5.2 to 5.5). Decomposing using a closed loop structure allows the effects of quantisation on the pulsed parameters to be minimised in the speech domain and thus the reconstructed pulsed signal is ensured to be a good approximation of the original pulsed waveform. Also, in Section 4.3, it was shown that pitch synchronous pulse modelling of the LP residual produced a parameter set that allows fme grain scalability of the reconstructed pulsed waveform Method for Pitch Synchronous zinc decomposition of Speech The decomposition procedure uses pitch synchronous closed loop zinc modelling to separate the speech waveform into pulse-like and noise-like components. The zinc pulse described in Section 4.2 (Equation (4.2) is used as a basis function for achieving the separation. The method employs the low delay, pitch synchronous critical sampling method described in Section 4.2 to segment the frame of input speech into pitch length segments. Each of the pitch length segments in the frame is then decomposed into

159 Scalable Decomposition of Speech Waveforms 136 pulse-like and noise-like components using a multidimensional zinc model. The multidimensional zinc model is represented as [Sukk89]: p Z(n)= Lz,(n)*h(n) 1=1 (5.10) where: z,(n) = A,Sinc(n - -1,) + B,Gosc(n --1,) (5. loa) The error signal (after removing the contribution of the zinc model) is represented as: e(n) = S(n) -Z(n) p =S(n)- Lz,(n)*h(n) j.o! (5.11) where h(n) is the impulse response of the LP synthesis filter, Sen) is the input speech and P is the order of the zinc model. The method for detennining the values of the A"B" and A, parameters involves an iterative process using (4.5), (4.6) and (4.8) (these equations are detailed in Chapter 4, Section 4.2.2). This process involves conunencing with A, set to the location of the pitch pulse marker and solving for A and ~ using (4.6). The contribution of ZI (calculated by substituting A,,~ anda,into (5.l0a» is then removed from the input signal using (4.5). The resultant error signal e(n) is then used as the input to the next iteration of the process. This results in Sen) in (4.8) and (4.6) being replaced with e(n} from the previous iteration. The optimal position A. for the current iteration is then selected as the orthogonal position (from the pitch pulse location) that maximises (4.8). A, andb, are then solved for A, using (4.6) and the process repeated for the next iteration, until P iterations have been repeated. At the completion of this modelling process the speech domain pulsed signal is represented as the zinc model Z(n) from (5.10), and the speech domain noise signal is equal to the resultant error signal e(n) from (5.11). If residual domain decomposition components

160 Scalable DecomposituJn a/speech Waveforms 137 are required for excitation of an LP filter in synthesis (as is the case for this thesis), these are calculated as: p RpuJ"d(n) = L z,(n) i=l Rnoise(n) = ren) - RpubedCn) (5.12) where r(n) is the residual domain signal for the pitch length segment, Zj(n) is defined in (5.lOa), Rpul.red is the residual domain pulsed component for the pitch length segment and R noise is the residual domain noise component. To determine the decomposition characteristics of the pitch synchronous zmc decomposition method in steady speech sections, the method was employed on 6 sentences from the TIMI.T database. The sentences were of varying content (speaker gender, sentence composition, etc) and before modelling, the sentences were separated into voiced and unvoiced components. Each of the input sections was modelled with a value of P (in (5.10)) varied from 1 to 10, with the results for each model order recorded separately. The ratio of average error energy to average speech energy was calculated for each input section and each model order using: 1/ M-IL-I ] 1M L L e'(n) Er= m=on=o xloo 1/ M-IL-l [ 7M L L 8'(n) -'"" (5.13) where L is pitch length and M is the number of pitch cycles in the section, sen) is the input speech and e(n) is the error function from (5.11). The results of (5.13) for the voiced speech sections were averaged for each input file and the results for the unvoiced sections were averaged across all sections. A separate result was generated by averaging the result of (5.13) for all voiced sections with a pitch period of 120 samples or greater. The averaged results are plotted against the model order P in Figures 5.6 to 5.13.

$Scalabk Decomposition o/speech Wavefonns 138 10 60 50 40 30 2Q '0 o." '- 1 2 3 4 5 6 7 8 9 10 l ine model order 70 60 50 40 '" 30 20 o 1\ \. ',. " 1 2 3 4 5 6 7 8 9 10 zine modal order Figure 5.$

161 Scalabk Decomposition o/speech Wavefonns Q '0 o." ' l ine model order '" o 1\ \. ',. " zine modal order Figure 5.6: Er for Male Voiced Input 1 Figure 5.7: ErforMaleVoiced Input 2 70 '0 50 '" 30 2Q " o i' :zinc model order '" so Q o \ ' Zinc Model Order Figure 5.8: Er for Male Voiced Input 3 Figure 5.9: Er for Female Voiced Input 1 oll o ". 1\ ~~ line model order ~ " o zinc model older Figure 5.10; Er for Female Voiced Input 2 Figure 5.11: Er for Female Voiced Input 3

Scalable DlCOmposition of Spel!Ch Wawfomu 139 70 eo.. 50 30 20 10 o f- ~ - I 2 3 4 5 6 7 a " 10 zinc rngdoi ordet oll 70.. 60.., 30 20 10 o " 1 2 3 4 5 6 7 8 " 10 zinc modal Ordef Figure 5.

162 Scalable DlCOmposition of Spel!Ch Wawfomu eo o f- ~ - I a " 10 zinc rngdoi ordet oll , o " " 10 zinc modal Ordef Figure 5.12: Er for long pitched Voiced Input Figure 5.13: Er for Unvoiced Input Examining Figures 5.6 to 5.12 reveals a common shape for all of the curves representing voiced speech. This common shape is present despite the actual values of Er being different for each of the figures. Comparing the relatively common shape of Figures 5.6 to 5.12 to that offignre 5.13 (that represents unvoiced speech) indicates that Figure 5.13 has a much larger value of Er for the smaller zinc order values (Jess than 4) than the other figures. For larger zinc orders the value of Er in Figure 5.13 falls away quickly to a small value (similar to Figures 5.6 to 5.12). This difference in perfonnance for voiced and unvoiced speech indicates that the high correlation between the shape of the zinc pulse and the residual domain pitch pulse (see Section 4.2.2) allows the pitch synchronous decomposition to represent a significant portion of the pulsed speech component in the zinc model parameters, even when the zinc model order is very low. The shape of all of the Figures 5.6 to 5.13 in<jicate8 that increasing the zinc model order from one, initially produces a significant reduction in Err _Rat AV ' However, after this initial reduction further increasing the model order produces little reduction in Err _ Rat AV' This.indicates that the zinc model is producing a good model of any underlying waveform quite quicldy, and subsequent increases in the model order are only modelling fine detail in the signal. Whilst this performance seems incorrect for

163 Scalable Decomposition of Speech Waveforms 140 unvoiced speech (that has no underlying signal). it should be remembered that the pitch length used for unvoiced sections is quite short (approximately 29 samples for test data) and as such, accurately modelling the eight most significant samples of the segment via a 4th order zinc model is bound to represent a significant portion of the segment energy. The key to exploiting the pitch synchronous zinc modelling as a decomposition mechanism lies in selecting the correct model order for a given segment. Setting the model order too low in voiced speech will cause the method to insufficiently model the pulsed component, and, conversely, setting the order too high will cause too much of the noise component to be modelled as part of the pulsed component. To determine the optimal model order, the model order at the knee of the curve for Figures 5.6 to 5.12 was estimated. This model order was then compared to the average pitch of the input file. The results are shown in Table 5.5 and the pitch is plotted against the order in Figure Table 5.5: Average pitch versus optimal model order

$Sca/ahl~ lkcomposijwn of Spe«h Wallej'oTmJ' 141 140 120 100 ~ 80 ~ 60 40 20 0 \..J. ----- /' 1 2 3 4 5 6 7 8 9 10 I Zinc Model order / Figure 5.$

164 Sca/ahl~ lkcomposijwn of Spe«h Wallej'oTmJ' ~ 80 ~ \..J /' I Zinc Model order / Figure 5.14: Pitch vecsub optimal model order for zinc composition Examining Figure 5.14 reveals an almost linear relationship between the piech and model order. This relationship indicates that the pitch value is an appropriate mechanism for dynamically estimating the optimal model order. The linear relationship that relates the pitch and the optimal model order was ca1culated from the input data and produced the following relationship: Est _order = Rowzd(O.08074S* pitch) (5.14) where Round is a function that rounds the value to the nearest integer. The performance of the linear estimate in (5.14) is compared to the actual model order, in Table 5.6.

165 Scalabte Decomposition a/speech Waveforms 142 Itch pumal Order Linear Estimate Table 5.6: Linear order estimate versus actual model order The results in Table 5.6 indicate that the linear estimate based on the pitch, produces a good approximation of the required model order. To determine the steady state decomposition properties of using pitch synchronous zinc modelling with the proposed linear model order estimate (from (5.14)), Er from (5.13) was calculated using the proposed method, for each of the test files used in Figures 5.6 to The results are shown in Figure The results shown in Figure 5.15 indicate that using (5.14) to calculate the model order allows pitch synchronous zinc modelling to operate at around the knee of the curves shown in Figures 5.6 to 5.12 and well before the knee of the curve in Figure 5.13 (unvoiced speech). This characteristic ensures that for voiced speech the underlying pulse component is captured in the zinc model parameters; with the detail being left to the noise component. For unvoiced speech, operating before the knee of the curve in Figure 5.13 ensures that a significant amount of the signal component appears in the noise-like parameters. To allow a thorongh analysis of the decomposition characteristics of the pitch synchronous zinc decomposition, typical examples of the methods performance for voiced and unvoiced speech are shown in Figures 5.16 and 5.17 respectively.

166 Scalable Decomposition of Speech Wavl!'fomu ,.,. 14 ;' 12 ~ 10 ~. ~ o ~ ~ - ~ - - f-- Female FomoJe FemaJo Male 3 Male 2 Ua!e 1 Loog 3 21 Input HIes i Unv Figure 5.15: Er for the input files using the linear model order estimate Examining Figure 5.16 indicates that for steady sections of voiced speech, the proposed decomposition captures virtually the entire input signal in the pulsed parameters. This results in the pulsed component being a very accurate representation of voiced speech, with only a small amount of fine detail falling through to the noise parameters. Capturing virtually all of the voiced speech in the pulsed component, supports the fmding that using the model order estimate of (5.14) to dynamically select the appropriate model order allows the pitch synchronous zinc decomposition to operate at the knee oftbe curves in Figures 5.5 to 5.12.

167 Scalable Decomposition of Speech Waveforms 144 Input Speech 6ooor-----_T ~~--r_----_t o _6rnroL- ~ ~~ ~ ~~--~ o Pulsed component for Pitoh sync zinc decomposition 6000r-----~ ~------_r------~ o L ~ ~ ~ ~ ~500 Noise component for Pitch sync zinc decomposition 6000r ~ ~r_----~ ~ 2000 ~ o~jai~ ~ "~~~/J~,~ ~I~~~~~~vvyV~~~ Co 'l'ijv', 'Y.!i! L-----~~----~------~------~------~ o Time in samples Figure 5.16: Pitch Synchronous zinc decomposition performance in voiced speech

168 ScaJohle Decomposition oj Speech Waveforms 145 Input Speech 1ooor ~--~---r------~----, -,0000L ,0~ ~OO ~ ~0~ ~500 Pulsed component for Pitch sync zinc decomposition Noise component for Pitch sync zinc decomposition 1000r ~------_r------~ ~----_:_::: - :~--~_:_--_:_'_:-- ='_ o Time in Samples Figure 5.17: Pitch Synchronous zinc decomposition performance in unvoiced speech

169 Scalable Decomposition of Speech Waveforms 146 The results shown in Figure 5.17 indicate. that in contrast to the performance for voiced speech, a large portion of the unvoiced speech input falls through to the noise component. Also, it can be seen that the pulsed component has simply captured a number of randomly dispersed large amplitude impulses. Despite having the large amplitude impulses removed from its content, it is evident that the noise component still represents the noise-like structure of the input unvoiced speech. This result reinforces the finding that dynamically calculating the model order according to the pitch value (via (5.14» allows the decomposition method to operate well before the knee of the curve in Figure The results for the pitch synchronous zinc decomposition in steady state conditions, indicate that, as desired, voiced speech input is accurately captured in the pulsed component and a good representation of unvoiced speech is captured in the noise parameter. The decomposition also produces an inherently scalable parameter set for representing the pulsed component. The scalability of this parameter set was previously presented in Section 4.3. To compare the steady state peitolmance of the proposed pitch synchronous zinc decomposition to that of a well known decomposition method, the proposed method was used to decompose the test files used in testing the linear filtering decomposition of Section 5.2. The Signal to Noise Ratio (SNR) between the error signal (from (5.11)) and the input speech was calculated for the test sentences. The SNR of the speech domain REW to the speech was also calculated for the linear filtering method. The results are shown in Table 5.7.

170 Scalable Decomposition o/speech Wavefomu 147 Voiced Unvoiced Table 5.7: Linear FilterinQ Zinc Model 13.3 db 14.1dB 2.9 db 7dB Comparison of SNR for zinc modelling and linear filtering The results in Table 5.7 indicate that the pitch synchronous zinc decomposition achieves a greater SNR in steady state voiced speech sections than the linear filtering method. As the linear filtering decomposition is well documented for capturing the pulsed component in steady state speech [klei95b], the greater SNR achieved for the proposed method conftrms that the pulsed component is accurately captured by the zinc model parameters. When the performance in unvoiced speech in Table 5.7 is compared, the higher SNR for the proposed method indicates that the proposed method has captured more of the unvoiced signal in the pulsed parameters than linear filtering. Whilst this is not an ideal result, the results presented previously for the proposed method's perfonnance in steady state unvoiced speech indicates that the proposed method still captures the noise-like structure of unvoiced speech in the noise component. To examine the performance of the proposed method in transitional sections, Figure 5.18 shows a comparison of the pulsed component from the pitch synchronous zinc decomposition and the linear filtering SEW component, for a transitiona1 section of speech.

171 Scalable Decomposition of Speech Waveforms 148 Input Speech 3000r---~--~----~--~----'----'----'---' L- ~ ~~ ~ ~ ~ ~ ~ --' o Pulsed component for Pitch sync zinc decomposition o 1-w-~~I\/\IlI\ L- ~ ~ ~ -'- ~ ~ ~ --' o BOO Linear filtering SEW component 3000r---, ~r_._~--~ TIme in Samples Figure 5.18: Comparison of transitional speech and the pulsed component from the zinc decomposition and linear filtering.

172 Scalable Decomposition o/speech Wave/oms 149 Examining the representation of the speech onset transient at sample 150 in Figure 5.18 indicates that the pulsed component of the pitch synchronous zinc decomposition is an accurate representation of the onset. Conversely, the linear filtering SEW component has smeared the transient by introducing extra periodic pulses before the transient onset. Figure 5.18 also indicates that the pulsed component of the pitch synchronous zinc decomposition accurately maintains the envelope of the speech whilst the linear filtering SEW component has distorted this envelope. The results shown in Figure 5.18 indicate that the pitch synchronous zinc decomposition produces a more accurate representation of speech transients than linear filtering. 'This characteristic is a result of modelling each pitch length segment independently in the decomposition process Summary for Pitch Synchronous zinc basis decomposition ofspeecb The proposed pitch synchronous zinc decomposition accurately captures the pulsed component of the input signal into the pulsed parameters. This is true even in quickly changing transient sections of the input signal. The method employs a closed loop modelling mechanism, which calculates the pulsed component to minimize the speech domain error between the original pulse and the synthesised pulsed component. This characteristic has the advantage of allowing quantisation errors to be accounted for in the decomposition process and thus ensures that the reproduced pulsed signals are accurate approximations of the original signals. The method also reduces the underlying waveform (pulse structure) to a finite set of pulse parameters that are suitable for high compression and scalable synthesis. The computational complexity of the method in its raw form is quite 'high, however, if the complexity reduction methods detailed in

173 Scalable Decomposition of Speech Waveforms 150 Section are employed, the complexity is reduced by a factor of approximately The pitch synchronous zinc decomposition does not require the generation of a 2D surface before perlonrung the decomposition. Instead, it decomposes each pitch length segment independent of all other segments. This characteristic does not require alignment or time scaling in the decomposition process, and as such, allows the zinc decomposition to achieve perfect reconstruction of the input signal at high bit rates. 5.7 Summary This chapter has proposed three new methods of decomposing the speech wavefonn into pulsed and noise components; these being; 1. Analytic Decomposition 2. SVD of speech 3. Pitch synchronous zinc modelling The proposed techniques all decompose the speech into parameters that allow a degree of scalability to be achieved when reconstructing the speech. Each of the proposed techniques also requires relatively low delay in that they require only a single frame of speech (25 ms). The decomposition perlormance and scalability of the proposed decomposition techniques was compared to the performance of the well known decomposition method of linear filtering a 2D surface of pitch length sections. The results indicate that whilst linear filtering does provide a good level of decomposition, it is too inflexible to operate in an environment that requires scalable bit rate operation. In contrast each of the proposed techniques was shown to support scalable bit rate operation. In both the SVD and zinc techniques this scalability was shown to be fine grain. However, the zinc

174 Scalable Decomposition of Speech Waveforms 151 technique appears to offer advantages over the SVD technique. These advantages include calculating the pulsed component in a closed loop structure that minimises the speech domain error, and not requiring the generation of a 2D surface for decomposition. Calculating the pulsed component in a closed loop structure provides a mechanism that optimises each pitch length section individually and also allows quantisation errors to be modelled and minimized. The closed loop structure also allows well known techniques for perceptual weighting to be utilised in the decomposition. The removal of the necessity for 2D surface generation allows the zinc decomposition to achieve perfect reconstruction of the input signal, at high bit rates. This perfect reconstruction is possible, as the pitch length segments do not require alignment Of time scaling in the decomposition process. Also, this chapter proposed a low rate coding method that explohs the decomposition characteristics of linear filtering. The proposed coding method is capable of maintaining the subjective quality of a WI coder whist reducing the total bit fate from 2.28 kbps to 2 kbps.

175 Scalable Coding Architecture 152 Chapter 6 A Scalable Coding Architecture 6.1 Introduction Speech coding algorithms that operate at low rates «4 kbps) employ lossy analysis models that limit their performance when scaled to higher bit rates. Conversely, higher rate algorithms (>4 kbps) utilize an accurate analysis stage, which produces a parameter set that is too rigid to produce high quality speech when scaled to lower bit rates. To operate at a scalable range of bit rates that spans the 4 kbps barrier, it appears essential that a coding structure merge these two conflicting paradigms into a new scalable algorithm. This scalable algorithm must employ an accurate analysis stage that produces a parameter set flexible enough to produce good speech quality at low rates and scale to higher perceptual quality as the bit rate is increased. Such a structure would allow only the most perceptually important parameters to be represented at low rates whilst allowing quantisation and modelling errors to be accounted for at higher rates. This chapter proposes a scalable coding structure that is capable of low rate scalable operation. The proposed encoder employs a new analysis stage, which produces a scalable parameter set. This scalable parameter set allows the synthesis procedure to migrate from a time asynchronous parametric process at low rates to perfect

176 Scalable Coding Architecture 153 reconstruction when operated with unquantised parameters. The proposed scalable algorithm also requires an algorithmic delay equal to only a single frame of input speech. This delay is low enough to be comparable with fixed rate algorithms at higher bit rates and, also allows the algorithm to be suitable for real time speech communication at all bit rates. The basic structure of the scalable architecture is described in Section 6.2. This structure generates 3 scalable parameters: Pitch track, Noise CWs and Pulsed CWs. The scalable quantisation procedures for each of these parameters are detailed in Sections 6.3, 6.4 and 6.5 respectively. A summary of the scalable architecture is contained in Section Structure of the Scalable Coder Scalable Analysis structure The scalable analysis structure is based on three elements discussed previously: 1. The SMWLPC technique introduced in Section 3.3 is used to remove shortterm redundancies from the input speech. 2. The method proposed in Section 4.2 is used to segment the input speech frame into non-overlapped (critically sampled) pitch length segments (Characteristic Waveforms (CWs) [Klei95b]). 3. The pitch synchronous zinc based decomposition detailed in Section 5.6 is used to decompose the speech residual CWs into pulsed and noise components. The combination of the three listed elements produce a structure that decomposes the input speech frames into a parameter set that can be used in varying combinations to synthesise the speech. The combination of the parameters used in synthesis, the quantisation accuracy and transmission rate of the parameters are determined according

177 Scalable Coding Architecture 154 to the bit rate available for transmission. A block diagram of the proposed scalable analysis structure is shown in Figure 6.1. To facilitate high compression at low bit rates and still maintain a relatively low algorithmic delay, an input frame size of 25 ms (200 samples) was selected. This frame size requires a transmission rate of only 40 Hz for the parameters representing the frame of speech, whilst not exceeding the pseudo stationarity of the speech signal. The LP analysis shown in Figure 6.1 requires Hamming widowed speech segments of 30 ms to generate the LPC. One of the important consequences of critically sampling the CWs is that edge effects in the Analysis-by-Synthesis operation of the pitch synchronous zinc decomposition can become a significant source of audible distortion. These edge effects can be reduced via pitch synchronous linear interpolation of the LPC across the frame. This interpolation ensures a smooth transition of the LPC values at each CW boundary. If the windowed speech used in the LPC calculation is centred on the current speech frame, the interpolation procedure requires a full window of look ahead (27.5 ms). This look ahead delay can be reduced to zero by replacing the Hamming window with a 30 ms asymmetric window biased toward the frame end (such as the window function used in the G.729 speech coding standard [SaIa98]). This asymmetric window function allows the LPC to be interpolated from the last CW in the previous frame to the last CW in the current frame. The LPC calculated for the frame are initially converted to Line Spectral Frequencies (LSFs) [Soon84] and then pitch synchronously interpolated using: LSFint = LSFprev (1- a) + LSFCurra mid pas where: a = (6.1)

178 Scalable Coding Architecture 155 LP analysis Pulsed Frame of CWs Input ~ Speech Critically..... Decompose sample into Noise ~ CWs CWs CWs.. I Figure 6.1: Block Diagram of the Scalable Analysis Structure... Pitch Track..... where LS~nt is the vector of interpolated LSFs for the current CW, LSFprev is the LSF vector calculated for the last frame, LSF Curr is the vector of LSFs calculated for the current frame and mid _ pas is the location at the centre of the current CWo The structure of Figure 6.1 only decomposes and transmits whole CWs. This is important as the critically sampled pitch synchronous extraction of the CWs generally results in a non-integer number of CWs per frame. This characteristic presents two options for handling any partial CW; either add look ahead to the structure and extract the whole CW or leave the samples not contained in a whole CW uncoded. To limit any additional look ahead to zero, the partial CW is left uncoded and carried over to the next frame. When coding a frame, if the centre of the partial CW (mid _ pas) carried over from the previous frame falls in the previous frame, (6.1) must be modified for this CW to use the correct LSF parameters via: LS~nt = LSFprevPrev(1- a) + LSFPreva mid pas where: a = (6.2) where LSFPrevPrev is the LSF vector from two frames previous. The critical sampling and decomposition functions shown in Figure 6.1 both employ a closed loop AbyS structure in their operation. The use of an AbyS structure in each of

179 Scalable Coding Architecture 156 these operations ensures that the parameters produced by the analysis structure minimise a perceptually weighted error in the speech domain. The use of AbyS also allows both the critical sampling and decomposition functions to account for the effects of modelling and quantisation errors. The scalable analysis structure shown in Figure 6.1 separates the speech signal into a set of parameters where each parameter exhibits distinctly different perceptual properties. This separation of the perceptual content of the signal combined with the scalability of the decomposition parameters (see Sections 4.3 and 5.6) is shown to produce a mechanism for producing bit rate scalability of the synthesized speech signal in the subsequent sections of this chapter and Chapter Scalable Synthesis Structure The scalable analysis architecture shown in Figure 6.1 produces a parameter set that is suitable for bit rate scaling. The parameter set produced per frame is made up of LSFs, Pitch track, Pulsed CWs (PCWs)-and Noise CWs (NCWs). A simplified block diagram of the scalable synthesis structure, that is the inverse of the scalable analysis stage, is shown in Figure 6.2. The combination of Figures 6.1 and 6.2 is a lossless architecture that will achieve perfect reconstruction of the speech signal if operated with unquantised parameters. However, bit rate scalability is possible by varying the accuracy of the quantisation and the transmission rate for the individual parameters. It was previously shown in Section 4.3 that the scalability of the PCW s allows the pulsed component of the synthesized speech to migrate from a parametric representation to a waveform-matched representation; depending on the accuracy of the pulsed parameters used in synthesis. It is demonstrated in the subsequent sections of this chapter, that also varying the quantisation accuracy and transmission rate of the NCWs and pitch track parameters, allows the scalable architecture to migrate from a truly time asynchronous parametric

180 Scalable Coding Architecture 157 Pitch Track Pulsed CW ~Ir parameters..... CW Reconstruction NoiseCW parameters..... CWto LP Synthesis Synthesised 4 Residual Filter speech ---. conversion r ~ LSFs Figure 6.2: Block Diagram of Scalable Synthesis Architecture algorithm at low bit rates to a time synchronous waveform matching algorithm at higher bit rates. This gradual transformation in synthesis operation causes the operation of the CW reconstruction and CW to residual conversion blocks shown in Figure 6.2 to vary according to the selected bit rate. The operation of these blocks, together with the scalable quantisation of the pitch track, PCWs and NCWs are detailed in Sections 6.3, 6,4 and 6.5 respectively. The scalable structure formed by combining Figures 6.1 and 6.2 also has a relatively low algorithmic delay of approximately 30 ms. This delay is comparable to medium rate CELP coders such as FS 1016 (37.5 ms) [Nati] and exceptionally low when compared to low rate parametric coders such as WI [Klei94] (60 to 80 ms) Selection of the CW boundaries As the pitch pulse-positions are identified in the critical sampling process, it is possible (via the critical sampling operation) to select the CW boundaries such that the pitch pulse falls at any location within the CWo However, if the selection of the CW

181 Scalable Coding Architecture 158 boundaries allows the pitch pulses to fall at different positions within adjacent CWs, maintaining an accurate pitch track structure in the synthesized speech requires the actual pitch pulse location within each CW be transmitted, in addition to the individual CW lengths. Transmitting the pitch pulse position within each CW is clearly a waste of available bits and selecting the CW boundaries such that the pitch pulse falls at an identical position in each CW (regardless of the CW length) removes the need to transmit this position. Selecting the best fixed-position for the pitch pulse within the CW is not a trivial matter. This is due to a combination of the pitch pulse location being used as the reference location for the pitch synchronous zinc decomposition of Section 5.6, and the variable length of the CWs (16 to 160 samples). To determine the best pitch pulse location (regardless of CW length) the distribution of the zinc pulse positions within each CW, generated by the decomposition mechanism, was calculated. Calculation of this distribution involved decomposing the voiced speech of 4 speech sentences (of different speakers) from the TIMIT database and recording the positions of the zinc pulses (contained in the pulsed parameter) relative to the pitch pulse location for each CWo To allow the pulse positions between CWs of different lengths to be grouped, the positions have been normalised by the CW length. This results in a position of zero being the pitch pulse marker, a position of -! representing half a CW length before the pitch 2 pulse marker and a position of.!. being half a CW length after the pitch pulse marker. 2 The distribution of the relative locations is shown Figure 6.3.

182 Scalable Coding Architecture r ,----r-~ (/) ~ 300 c ~ :J g 250 o '0 200 "- Q) z o~~~~~~~~~~~~--~----~--~--~--~ Inter CW position with respect to the pitch pulse. Normalised to CW length Figure 6.3: Distribution of zinc pulse positions within a CW Figure 6.3 reveals that the majority of the zinc pulses occur the range from -.05 to.05, with another small cluster at approximately 0.1. The strong clustering of the zinc pulses close to zero (-.05:0.1) indicate that these pulses are modelling the actual pitch pulse, with the remainder of the pulses modelling secondary pulses and other small detail. The relatively even spread of pulse locations outside this central cluster indicates that an optimal position for the pitch pulse marker within a CW is at the centre location. However, as no look ahead is available in the analysis structure, the position of the first pitch pulse in the next frame is unknown, and thus, it is impossible to determine the centre location for the last CW in a given frame (as a transient change in the CW length may occur at the start of the next frame). To operate with no look ahead, it is best to set the pitch pulse location a fixed distance from the CW end boundary, as opposed to setting the pulse a fixed distance from the CW start boundary. This results, because if the pitch pulse locations were set a fixed distance from the CW start, then as the end

183 Scalable Coding Architecture 160 boundary of the last CW in the current frame must correspond to the start boundary of the first CW in the next frame (unknown), it would be necessary to always carry the last pitch pulse in the current frame over to the next frame. To reduce edge effects in the decomposition process it is best to locate all of the pulse positions that are modelling the large amplitude main pitch pulse within a single CWo These positions occupy the cluster of pulses covering the range from -0.05:0.1 in Figure 6.3. For the largest possible CW length of 160 samples, grouping this range covers samples in the range of -8:16 from the pitch pulse marker. This would indicate selecting the CW boundaries to place the pitch pulse marker 16 samples from the CW end, would allow the pulse locations required to model the actual pitch pulse to be located within a single CWo However, for small CW lengths (such as 16 samples) setting the pitch pulse 16 samples from the CW end results in the pitch pulse being too close to the CW start to accommodate the pulse locations required before the pitch pulse marker. A good compromise is to fix the pitch pulse position 10 samples from the CW end boundary. In this case every CW has at least six samples before the main pitch pulse location, and for all but the longest CWs (>100 samples), the range from -.05:0.1 in Figure 6.3 is contained in the CW boundaries. In fact, the main cluster in Figure 6.3 (from :0.05) is covered for all possible CW lengths. Another source of both edge effects and a requirement for look ahead is the infinite time span (albeit with reducing amplitude) of the zinc pulse. If the zinc pulses are allowed to propagate beyond the CW boundaries, glitches (edge effects) can occur in the synthesised speech unless the LPC for the next CW is incorporated into the calculation of the pulse parameters (these edge effects result even though the LPC are interpolated pitch synchronously across the CWs). To avoid both the look ahead required to calculate the next frame LPCs and eliminate additional edge effects, the zinc pulse

184 Scalable Coding Architecture 161 parameters are not carried beyond the CW boundaries in the decomposition or subsequent synthesis operations. This has no effect on the perfect reconstruction properties of the coder as the pulse is truncated in both the analysis and synthesis stages. 6.3 Scalable Pitch track Quantisation Introduction The unquantised pitch track parameter for the scalable coding structure of Section 6.2 is a vector containing the individual lengths of the CWs contained in the current frame. For the scalable synthesis structure shown in Figure 6.2, the pitch track parameter is the most crucial. If the pitch parameter is unquantised, the scalable synthesis coder produces synthesized speech that is time synchronous with the input speech and is thus capable of operating as a waveform coder. However, due to the large number of bits required to transmit the unquantised pitch track, the unquantised pitch track may be replaced by a simple linear approximation that maintains only the average behaviour of the unquantised pitch track (such as that used in parametric coders like WI [klei94] and MELP [Supp97]). Depending on the relationship between the unquantised and linear pitch tracks, both the number of CWs in the frame and the total number of samples may vary between original and synthesized speech. Thus, when using an approximate pitch track, the coder now produces speech that is time asynchronous with the original speech. In this state, the original NCWs and PCWs must be replaced by approximations that represent the correct number of CWs, and individual CW lengths, required by the approximate pitch track. It is clear that when the unquantised pitch track is replaced by a linear approximation, the scalable synthesis structure migrates from a waveform-coding paradigm to a parametric coding paradigm.

185 Scalable Coding Architecture 162 To allow the scalable coder to migrate from a waveform coiling paradigm to a parametric coding paradigm, this section details two distinct quantization methods for the pitch track parameter. The first of these methods is a low rate approximation (detailed in Section 6.3.3) that maintains only the average behaviour of the pitch track and does not constrain the number of CWs and samples to be maintained between the original and synthesized speech. The second method is a high rate approximation (detailed in Section 6.3.4) that ensures that the number of CWs and number of samples in the synthesized and original speech are equal Properties of the Pitch Track Parameter Due to the frame length being fixed at 25 ms and the CW lengths each being equal to the instantaneous pitch period at that point, there are a variable number of CWs and instantaneous pitch values per frame. If lossless scalar quantization of the pitch track were implemented, a variable number of bits/frame (dependent on the number of CWs in the frame) would be required. For the pitch range of 16 to 160 samples (generated by Section 4.2) 8 bits per CW length would be required for direct scalar quantisation. As the number of CWs per frame can vary from 1 to 13 dependent on the pitch, then direct scalar quantisation of the pitch track would require 8 to 104 bits per frame. For the 25 ms frame size this amounts to 320 to 4160 bits per second (bps). Thus direct scalar quantisation is impractical due to both the range of bits required and the total number of bits required per frame. Direct scalar quantisation would also require the number of CWs per frame to be transmitted using 4 bits/frame. Differential quantization of the CW lengths may reduce the overall bit rate required for quantizing the pitch track in regions of steady pitch. However, due to the transient nature of the pitch track (see Chapter 4, Figures 4.7 and 4.8 for examples), differential

186 Scalable Coding Architecture 163 quantization would still require a significant number of bits to adequately represent transient sections. Assuming the CW lengths (pitch track) within the frame vary slowly, the relationship between the CW length and the number of CWlframe is shown in Table 6.1. No.CW Minimum length Maximum length Bits req per CW Table 6.1: Relationship between pitch length and the number of CW/frame Table 6.1 assumes that one partial CW is carried over from the previous frame and the amount of this partial CW that falls in the current frame is unknown. Examining Table 6.1 reveals that optimising a pitch track quantization scheme to exploit the relationship between the number of CWs in a frame and the bits required to represent the individual CW lengths, should lead to a reduction in the total number of bits required to represent the pitch trach. The only anomaly to the relationship shown in Table 6.1 is the length of the CW carried over from the previous frames, which may be unrelated to the current pitch. This characteristic requires that the pitch length of any partial CW be transmitted using the maximum number of bits required per pitch period from the previous frame. The high rate quantization scheme detailed in Section uses an optimized quantization scheme to quantize the pitch track parameter.

187 Scalable Coding Architecture Low bit rate Pitch track quantisation To operate at low bit rates of around 2 kbps the coding structure must spend few bits on quanti sing the pitch. At these rates a pitch track that allows good perceptual speech quality to be produced is more important than an exact pitch track. In traditional low rate parametric speech coders that quantise the pitch, such as WI [Klei94], as few as 7 bits per frame are used for pitch quantisation. These coders transmit the average pitch value once per frame and use interpolation to produce individual pitch periods. This section examines a number of possible methods for low rate quantization of the pitch track. The preferred option requires only 7 bits per frame and generates a pitch track that produces few distortions in the synthesized speech. A number of methods were tested with a view to accurately representing the pitch track with a minimal number of bits. These included using Linear Interpolation (as detailed for WI), taking a Discrete Cosine Transform (DCT) and fitting a polynomial curve to the pitch track. The DCT and polynomial curve fitting methods were included, as they present opportunities for high compression of the pitch track due to the usual slowly evolving nature of the pitch within a frame. Using a DCT with a slowly evolving function places most of the detail in the lowest few coefficient and the higher coefficients can simply be set to zero with little effect on the estimate accuracy. Whilst a polynomial curve, represented with only a few coefficients is capable of matching the shape of a slowly changing function. In comparison, both the DCT and polynomial curve produced smaller modelling errors than linear interpolation, however, there was no control over where the errors occur in a given frame using the both of the DCT and polynomial curve methods. This caused jitter in the pitch track, which in tum causes audible distortion in the synthesized speech. Using a monotonically increasing or decreasing function to interpolate the pitch

188 Scalable Coding Architecture 165 (such as linear interpolation) produces a less accurate pitch track, however, the pitch track is smooth and in turn produces good sounding speech. Using linear interpolation requires the pitch of the intermediate CWs to be generated by linearly interpolating between the pitch value transmitted for the past frame and the pitch value transmitted for the current frame. To eliminate the need to transmit the number of CWs in the current frame, the number of CWs in the current frame is estimated from the interpolated pitch using: No _ CW = Jloor( (200 + No _ Unc _ prev) I( curr - pitch ~ prev - Pitch)) (6.3) where floor rounds down to the nearest integer, 200 represents the number of samples in a frame, curr _ pitch is the transmitted pitch for the current frame, prev _ pitch is the transmitted pitch for the past frame and No _ Unc _ prev is the number of uncoded samples from the previous frame. As the number of samples left uncoded from the previous frame is known in the decoder, this value does not require transmission. If the pitch range is limited to samples, the linearly interpolated pitch track requires only 7 bits per frame for transmission of the average pitch for the current frame. The linearly interpolated pitch track works well in most cases however, due to the input pitch track being very accurate around transitional sections (see Section 4.2), it is possible for the average pitch value between adjacent frames in transitional sections to vary dramatically. In such cases, linearly interpolating between the past and present pitch tends to smear the transitional pitch track. To avoid this situation the past and present pitch values are used to determine if a transition has occurred. If the variation between the past and present pitch is greater than 25% then a transition is indicated. The value of 25% was chosen based on results published by Sundberg [Sund79], which state that the pitch track in voiced speech varies by less than 1 % per millisecond. If a

189 Scalable Coding Architecture 166 transition is detected, the length of the first ew is set equal to the average of the past and current average pitch values and the remaining ew s have their length set equal to the current pitch. Also in transitions (6.3) is replaced with: curr pitch + prey pitch JJ No _ CW = floor No _ Unc _ prey / curr _ pztch (( ( (6.4) A comparison of the interpolated pitch track and the unquantised pitch track for a sentence of male speech is shown in Figure 6.4. The average pitch error as a percentage of the original pitch track, across the entire sentence shown in Figure 6.4 was 8.8%. Examining Figure 6.4 indicates that the quantized pitch track accurately represents the original pitch track in sections of stable pitch, such as contained in the dashed box in Figure 6.4. However, in transitional sections (such as the large glitch immediately to the left of the dashed box in Figure 6.4) the quantised pitch track tends to smooth the transients in the original pitch track and the error between the quantized and unquantised pitch tracks is quite large. These observations indicate that the majority of the pitch error is occurring in transitional sections. This result is expected as the quantized pitch track is only attempting to maintain the average characteristics of the pitch track. The described method for quanti sing the pitch track produced good quality synthesised speech (see Section 7.2) but does not ensure equal numbers of samples in the synthesised and unquantised ews. Further, the quanti sed pitch track does not guarantee the same number of pitch cycles per frame in the sytnthesised and original speech. For these reasons the coder operating with the described low rate pitch quantisation is time asynchronous with the input speech and methods for approximating the unquantised ews to produce both the required number and required lengths are necessary. These operations are detailed for the pews in Section and the NeWs in Section J

190 Scalable Coding Architecture 150 a - Original Pitch ~ Quantisd Pitch j 2 P,! e l>1 100 i., P 167.r:.8 l t~ a:: < t "! 50. ;" ~,. 1! I*t., ~, U :>.,~ o samples o~----~------~------~------~------~------~----~ Figure 6.4: Comparison of interpolated and input pitch tracks X High bit Rate Pitch Track To allow the scalable coding structure to operate as a waveform coder, the number of samples and the number of whole pitch cycles represented by the original and quantized pitch tracks must be equal. This section details a pitch track quantization scheme that maintains this equality, while requiring between 14 and 20 bits per frame for transmission. The number of bits per frame required for transmission varies according to the number of whole pitch cycles contained in the frame. To ensure that the number of samples represented by the quantized pitch track is equal the number of samples in the in the original pitch track, the equality in (6.5) must be maintained:

191 Scalable Coding Architecture 168 N Lx(i) = b (6.5) i=l where x(n) is the length of the nth synthesised CW in the frame, N is the number of CW s in the frame and b is the total number of bits in the original speech section (original pitch track). In addition to maintaining the equality in (6.5), the pitch track quantization must also ensure that any quantization errors are spread in such a way that the resultant pitch track exhibits a smooth evolution. If all the quantization error occurred in a single pitch cycle, then the abrupt change between the length of this pitch cycle and the adjacent pitch cycles may introduce an audible distortion in the synthesized speech. When only the past and current average pitch values are known (as is the case for the low rate quantization in Section 6.3.3), calculating a quantization scheme that maintains equality (6.5) and also ensures an equal distribution of the quantization error across all pitch cycles, is not a trivial task. This amounts to solving for an interpolation function that varies smoothly between the previous and present average pitch, and also ensures that the number of samples in the frame is correct. This is not a uniquely solvable problem as a definition that quantifies the optimally smooth evolution is required. A better solution is to use an iterative approach that attempts to spread any quantization errors evenly among all of the pitch cycles contained in the pitch track. This is the approach chosen for this thesis. The proposed iterative method requires the exact length of last CW in the current frame to be transmitted in place of the mean pitch value used for the low bit rate scheme (see Section 6.3.3). A pitch track now generated by linearly interpolating between the transmitted CW length for the last frame and the current frame ensures that the last CW length in the linear pitch track always equals the last CW length in the unquantised pitch

192 Scalable Coding Architecture 169 track. This linear pitch track is the starting point for the iterative process that then adjusts the linear pitch track to ensure the equality of (6.5) is met, whilst still ensuring that the pitch track is smooth and does not suffer from jitter. A flow diagram of the iterative process is shown in Figure 6.5. The iterative process shown in Figure 6.5 initially determines if a transition has occurred between the previous and present frames by calculating the percentage change in the transmitted CW lengths. If this change is less than 20 % then no transition is detected and the iterative process commences with a linear pitch track. The iterative process then alters individual CW lengths of the linear pitch track one sample at a time, until the number of samples in the quantised pitch track is equal to the number of samples in the original speech. The process used to alter the CW lengths, evenly disperses the error between the original and linear pitch tracks across all of the CW sand thus maintains a smooth pitch track that avoids pitch jitter. If a transition is detected, the iterative process uses linear regression to predict the length for the first CW in the current frame from the pitch track of the previous frames. The first CW is not initially set to this predicted value but is instead set equal to a length that is at the limit allowed for the current frame. This limit is equal to 0.9 times the last CW length for a short to long transition and 1.1 times the last CW length for a long to short transition. The values of 0.9 and 1.1 coincide with the maximum range searched from the initial pitch position estimate in the critical sampling process detailed in Section 4.2. Once the first CW length is set to its initial value, a linear pitch track is then generated between this first CW length and length transmitted for the last CW in the frame. This initial linear pitch track is adjusted a sample at a time until the required number of samples are contained in the pitch track according to equality (6.5). The single sample adjustment commences with the first CW and adjusts the length of this

193 Scalable Coding Architecture 170 CW until it equals its predicted length or until the difference between the quantised pitch track length and the length of the original speech is less than the number of CW s. At this point the individual CW lengths are adjusted, as per the case when no transition has occurred, until equality (6.5) is satisfied. A comparison of the quantised and original pitch tracks for the same speech sentence as used in Figure 6.4, is shown in Figure 6.6. The average pitch error as a percentage of the original pitch track, across the entire sentence was 4.1 %. This is less than half the pitch error achieved for the low rate method shown in Figure 6.4 of 8.8%. Examining Figure 6.6 indicates that the quantized pitch track is an accurate representation of the unquantised pitch track, even in most of the sharp transient regions (such as that marked). Whilst a few transients (such as that marked on Figure 6.6) are still smoothed, these are far fewer in number than those present in the low bit rate pitch track.

194 Scalable Coding Architecture 171 Where. L= Length of the last CW In the current frame p= length of the last CW In the prevlous tame Nb== The total number of bits In the ~Itch track Pred::: the pndicted value aftha 1st CW fram tast frame N= The number of CWs End End Figure 6.5: Flow diagram of iterative pitch track estimation

195 Scalable Coding Architecture 172..c:.8 a:: o samples o~----~------~----~------~----~------~----~ Figure 6.6: Comparison of high rate and original pitch tracks 3.5 X 10 4 The iterative process shown in Figure 6.5 requires the number of CWs in the frame (N), the length of the last CW (L) and the number of uncoded samples (U) (i.e. incomplete CW) be transmitted. Using the values in Table 6.1, the bit requirement for the proposed quantisation method is shown in Table 6.2. Also included in Table 6.2 is the number of bits required to achieve lossless quantization of the pitch track. The bits required for the lossless qauntisation assumes 8 bits are required for the first CW (as this is carried over from the previous frame), 4 bits are required for the number of CWs and the bits for the remainder of the CW are extracted from Table 6.1. The overall accuracy of the proposed high bit rate pitch track quantization scheme is sufficient for the scalable coder to operate in a waveform-coding paradigm. However, whist the total number of samples and pitch cycles represented by the pitch track are constrained to equal those represented by the unquantised pitch track, the proposed scheme is a lossy quantization scheme if the number of CWs is greater than two. The

196 Scalable Coding Architecture 173 [No.CW Maximum Bits req Bits for Bits for Total Total Bits CW for last No. of No. of Bits for for lossless length CW uncoded CW high rate quant length samples Quant Table 6.2: Bit requirements for high rate pitch track result of this lossy characteristic is that some of the individual CW lengths may vary slightly between the quantized and unquantised pitch tracks. Due to this slight variation in CW lengths, using the proposed high rate pitch track in a waveform-coding paradigm may still require minor warping of the Pulsed and Noise CWs to match the quantized CW lengths. However, comparing columns 6 and 7 of Table 6.2 indicates a substantial bit rate saving for the proposed pitch track quantization over the lossless representation. This bit rate saving more than offsets the slight complexity increase required for the warping operation. It should be noted that the twenty bits required by the proposed quantization scheme for one CW includes the 8 bits for the number of uncoded samples. Whilst this value is not required it has been included to give consistency to the structure of the bit stream. The proposed high bit rate pitch track quantization requires the number of CW s to be decoded before the remainder of the pitch track can be decoded. This method may be susceptible to bit errors in noisy transmission environments. However, a range of the possible number of CWs is also implicitly contained in the number of uncoded samples

197 Scalable Coding Architecture 174 and the length of the last CWo It is known that the length of the last CW must be within 0.9 to 1.1 times the pitch from Section 4.2, thus the range of possible values for the number of CW s can be calculated from: B /(1.1L) < No _ CW < B /(0.9L) (6.5) where B is the number of samples coded in the current frame and L is the length of the last CW in the frame. If the decoded value for the number of CWs is outside this range, an error has occurred and the parameters from the last frame should be reused. The high bit rate pitch track gives a good estimate of the pitch pulse locations. This is the location for a single pulse within the CWo When the scalable coder transmits more than one pulse per CW at higher rates, the position of the extra pulses within the CW must be quantised separately and in addition to the pitch track 6.4 Scalable quantisation of the Pulsed CWs (pews) The pews are generated by the pitch synchronous zinc decomposition detailed in Section 5.6. A multidimensional zinc pulse model, shown in (5.10), is used to represent each PCW. For ease of readability (5.10) is repeated here: p Zen) = L Zi(n) * hen) i=l (5.10) where: where P represents the order of the zinc model (number of individual zinc pulses). The value of P is dependent upon the length of the CW (instantaneous pitch value). As can be seen in (5.10) each individual zinc pulse Zi' within a multidimensional model, consists of two amplitude values (A and B) and a single position value (A). The position value is relative to the designated pitch pulse position within the CW (10 samples from the CW end boundary, see Section 6.2.3).

198 Scalable Coding Architecture 175 It was detailed in Section 4.3 that the multidimensional zinc model presents a scalable mechanism for reproducing pitch length pulsed components of a speech signal. The scalability is achieved by varying the model order P in the synthesis operation. This section details quantisation and synthesis techniques that allow this scalable reconstruction of the pews to be achieved in a practical speech coder. Section analyses the quantisation characteristics of the zinc pulse parameters that form the pew parameters. These quantisation characteristics are then exploited in the design of low-rate, mid-rate and high-rate quantisations schemes for the pew parameters, in Sections 6.4.2, and respectively. Also contained in each of these sections is the associated synthesis techniques required for scalable reconstruction of the pews. Finally Section details modifications to the high-rate synthesis operation that produce significantly improved perceptual results Characteristics of the PCW parameters The individual zinc pulse amplitude parameters (A and B from (5.10» each consist of a sign and a magnitude value. To determine the optimal quantisation method for the A and B magnitude values, the characteristics of these parameters were analysed. This analysis included determining the distribution of the magnitude values and the inter and intra frame correlation properties. To generate this data, 10 sentences each of a different speaker from the TIMIT database were used. Histograms of the A and B magnitude distribution for the 10 sentences are shown in Figures 6.7 and 6.8 respectively. These histograms indicate that the most common magnitude values for both the A and B parameters are clustered toward the lower range of the magnitude values, whilst the probability of a large magnitude occurring decreases as the magnitude increases. Jhis characteristic combined with the fact that the A and B parameters represent amplitude values and the human ear has a logarithmic response to

199 Scalable Coding Architecture 176 amplitude, indicate that quantising the A and B parameters using a logarithmic rather than a linear scale is optimal. The correlation properties of the A and B magnitudes were calculated by concatenating the 10 test sentences and the results are shown in Tables 6.3 and 6.4. Table 6.3 represents the inter-cw correlation of the A and B magnitudes when only a single zinc pulse per CW is calculated. The inter CW correlation is calculated using: K-I L, A(k )A(k - d) R(d) == =k==o-::-:-::--- K-I L, A(k)2 k=o (6.6) where K represents the number of CWs in the test set, A(K) is the vector of A parameter magnitudes calculated from the entire input speech group. Table 6.4 shows the intra-frame correlation of both the A and B magnitude CW vectors when a variable order number of zinc pulses are generated per CWo The intra-frame correlation for a given frame of CWs is calculated as: K-I L ~(k)an(k) R tr (n, m) = -;=:=!k~-:g,o ======, K-I K-I L ~(k)2 L ~(k)2 k=o k=o n=1,2,...,p, m=1,2,...,p (6.7) where K represents the number of CWs in the test set, AN(K) is the matrix of pulse magnitudes for the frame and p is the order of the zinc model (number of pulses/cw) used for the frame. Each row of AN(K) represents a CW, with the individual pulses in the CW being placed in the columns. The data shown in Table 6.4 is the mean value of R tr for each model order across the entire test set of sentences. Table 6.4 also includes intra-frame correlation results for vectors of Gaussian distributed random values.

200 Scalable Coding Architecture en c ~ 4000 o ~ 3000 Q).0 E ~ A parameter magnitude value Figure 6.7: Magnitude distribution of the Zinc A parameter r---~--~~--~--~----~--~----~--~--~ 6000 en c ~ 4000 o... ~ 3000 Q).0 E ~ B parameter magnitude value Figure 6.8: Magnitude distribution of the Zinc B parameter

201 Scalable Coding Architecture 178 Lag a A parameter B parameter Table 6.3: Inter frame correlation of the zinc magnitude parameters Model Order A parameter B parameter Random Table 6.4: Intra frame correlation of the zinc parameters The results in Table 6.3 indicate that both the A and B magnitudes exhibit strong correlation across adjacent CWs. This correlation should make the parameters suitable for compression through means such as decimation/interpolation and Vector Quantisation (VQ) [Gers92]. Examining the results in Table 6.4 indicates that in all cases except for a model order of 2, the zinc magnitudes exhibit a stronger correlation than the random vectors. Gersho [Gers92] states that strong correlation in the input vectors allows VQ to obtain a significant quantisation advantage over scalar quantisation, and thus, the results in Table 6.4 indicate that for model orders greater than two the zinc magnitudes within a CW are suitable for compression using VQ. To detennine the interdependence between the A and B magnitudes of the zinc pulse, the correlation between the A and B magnitudes for a single pulse per CW was calculated using (6.7) with nand m set to one. The result of this calculation was a correlation of 0.8 between the A and B magnitudes. This high correlation between the magnitudes indicates that a compression gain over separately coding each magnitude may by possible be utilising differential quantisation for one of the magnitudes.

202 Scalable Coding Architecture Low bit rate Quantisation and Synthesis of the pews This section proposes a low bit rate quantisation scheme for the pulsed parameters that requires only thirteen bits per frame for transmission. This quantisation scheme produces a parametric representation of the PCW that is suitable for use in the parametric operation of the scalable speech coder detailed in Section Description of the Low bit rate Quantiser In addition to minimizing the bit rate required for transmission, the greatest constraint to the design of a low bit rate quantization scheme for the PCW s is that it must operate in conjunction with the low bit rate pitch track quantization scheme of Section This reconstructed pitch track does not ensure that the total number of samples or the number of ew s represented by the pitch track equals those represented by the original pitch track. This ambiguity requires the associated ew quantisation procedures maintain the perceptually important properties of the original ew s whilst allowing both the number of ews and length of the individual ews be modified to match the reconstructed pitch track. Whilst the unquantised pews consist of multiple zinc pulses per ew, the scalable synthesis of the LP residual detailed in Section 4.3 indicates a good parametric representation of a pew is possible by using only a single zinc pulse per CW in the synthesis operation. This characteristic was supported in Section 5.2 where it was shown that a WI speech coder produced good quality speech when the SEW was represented by a single zinc pulse (with positive and equal A and B parameters) per frame. Hiotakakos et a1. [Hiot94] and Brooks et a1. [Bro098] also use a single zinc pulse per frame and linear interpolation of the A and B parameters to generate the synthesised excitation for voiced speech. These results indicate that the zinc pulse parameters

203 Scalable Coding Architecture 180 representing pitch pulses evolve quite slowly in voiced speech and thus are suitable for decimation before transmission and interpolation in synthesis. Transmitting only a single zinc pulse per frame and linearly interpolating presents a flexible quantization scheme where the number of CWs and length of the individual CWs can easily be modified to match the reconstructed pitch track. However, transmitting only a single pulse per frame and linearly interpolating can cause distortions in transitional sections, such as buzziness at voiced speech onsets. One solution to transition performance is to transmit multiple pulses per frame; however, this presents two new problems. Firstly, the bit rate required for transmission is increased, and secondly, maintaining the flexibility of modifying the number of CWs per frame via interpolation requires that the number of pulses transmitted always be less than or equal to the number required by the reconstructed pitch track. To both maintain a low bit rate and transmit multiple pulses per frame, the amplitude for both the A and B parameters for each pulse can be set to a positive value equal to the average of the original A and B magnitudes for that pulse (as per the low rate WI coder proposed in Section 5.2). This results in only a single magnitude value being required per pulse. The high correlation between the A and B parameters reported in Section indicates that setting the A and B magnitudes to be made equal, need not dramatically reduce the accuracy of the representation. The sign values associated with each zinc amplitude value were found to be too sensitive for compression, thus the positive only value for the single pulse magnitude discussed above. The sensitivity of the sign values is due to the fact that a single incorrect sign value in a section of strongly voiced speech causes the pulse amplitude to be inverted. This inversion in amplitude causes distinct clicks in the synthesised speech.

204 Scalable Coding Architecture 181 Ensuring that the number of transmitted PCW per frame is less than or equal to the number of CWs required by the reconstructed pitch track, involves basing the number of transmitted pulses on the average pitch for the frame. To provide a good compromise between transitional performance and quantization complexity, the possible pitch ranges were grouped into 4 ranges. For each pitch range the number of pulses to be transmitted is fixed according to the minimum possible number of original CW s possible for the pitch range. The selected pitch ranges and the associated number of pulses to be quantised are shown in Table 6.5. Pitch <30 <50 <80 >=80 Table 6.5: No. of Pulse/Frame Quantised pulses for each pitch range. As the number of input pulse values (No. of CWs) varies from frame to frame (even for the same pitch range, see Table 6.1), the input zinc magnitude values within a frame are adaptively grouped and filtered to produce a single vector of magnitudes whose length corresponds to that required by Table 6.5. Quantisation of this vector could then be achieved with a single parameter using VQ. However, to provide consistency in the VQ operation, the vector of magnitude values is firstly separated into a gain term and a normalised vector. The gain term is then scalar quantized and the normalized vector quantized using VQ. This is a form of gain/shape quantisation for the pulse magnitudes, where the gain term represents the maximum magnitude and the normalized vector describes the evolution of the pulse magnitudes within the frame.

205 Scalable Coding Architecture Bit allocation and Quantiser Peiformance The bit allocation, per frame, for quantising the PCWs at low rates is shown in Table 6.6. Pitch <30 <50 <80 >=80 Table 6.6: Bits for Gain Bits for Shape Bit allocation for low rate pulsed CW quantisation The bit allocation in Table 6.6 uses 5 bit logarithmic quantisation for the pulse gain (magnitude). Perceptual testing indicated that 5 bit logarithmic quantisation provided an acceptable trade-off for resolution and bit rate. To reduce the storage requirements for the multiple shape CBs (one for each pitch range in Table 6.6), multistage VQ with 5 bits in the first stage and 3 bits in the second stage is used for the shape quantisation. Using this VQ configuration requires only 40 vectors of storage for each pitch range. This is opposed to requiring 256 vectors per range if multistage VQ is not used. Using 13 bitslframe requires an overall bit rate of 520 bps for transmitting the PCWs. It was also possible to reduce the bit rate required to 400 bps by using only 5 bits/frame for the shape quantisation. Using this method produced a reasonable representation, however more artefacts were present in the synthesised speech than when 13 bits/frame was used. Figure 6.9 shows a comparison of synthesised speech produced using unquantised PCWs with only 1 pulse per CW, 13 bits/frame PCW quantisation, 10 bits/frame PCW quantisation and 5 bits/frame (no shape quantised, only interpolation used) PCW quantisation. The comparison in Figure 6.9 uses only the PCWs to generate the synthesised speech. Figure 6.9 indicates that using both 13 and 10 bits/frame to quantise the PCW produces

206 Scalable Coding Architecture 183 an accurate representation of the unquantised pulsed speech component. The voiced speech produced by both of these quantisation schemes maintains the overall envelope shape for the speech shown in Figure 6.9. The most noticeable distortion produced by either of these quantisation schemes is a slight increase in periodicity around transient sections (such as around samples 500 and 1000). In contrast, when only a single magnitude is used in synthesis, with no evolution information, the speech is clearly distorted. In this case both the envelope shape is destroyed and significantly more periodicity is added to the transient sections. Whilst the synthesised speech for the 10 and 13 bits/frame quantisation scheme shown in Figure 6.9 is almost identical, informal perceptual testing of the synthesised speech for the two methods indicated that the 13 bits/frame quantiser produced fewer audible distortions in transitional sections of speech.

207 Scalable Coding Architecture Synthesised speech using Unquantised First Order pulsed CWs Speech with pulsed CW at 13bits/frame Speech with pulsed CW at 1 Obits/frame Q) "0 :::l +-' 'c OJ co ~ L: U Q) Q) 0. U) Speech with pulsed CW generated via interpolation Figure 6.9: Time in Samples Comparison of synthesized speech for the proposed low rate quantisation schemes The Signal to Noise Ratio (SNR) between the quantised pulses and the original pulses for a given CW can be calculated using: N-J L r(n)2 SNR == 10 log(~_n==",-o ----::- ~(r(n)-;(n)r (6.8)

208 Scalable Coding Architecture 185 where r(n) is the residual domain unquantised PCW and r(n) is the residual domain quantized PCW. The average CW SNR for 10 input sentences was calculated between the quantized first order zinc pulse using 10 and 13 bits per frame and the unquantised first order zinc pulse, using (6.8). The average SNR results are shown in Table 6.7. SNR in db 10 bits/frame Quantisation bits/frame Quantisation 3.42 Table 6.7: Average SNR for low rate quantisation Whilst the SNR values shown in Table 6.7 are quite low when compared to the SNR values for the unquantised PCWs (14.1 db in voiced speech, from Table 5.7 in Chaper 5), it should be pointed out that the low rate quantisation schemes are parametric schemes and a high SNR is not the primary objective. The results in Table 6.7 show a slight improvement in SNR of 0.02 db for the 13 bits/frame quantisation, when compared to the 10 bits/frame quantisation. However, as stated previously, even though the SNR improvement is small, the synthesized speech produced using the 13 bitslframe quantisation has fewer distortions in transitional regions than the synthesized speech produced using the 10 bits/frame quantisation scheme. Initial perceptual testing of the 13 bits/frame quantization scheme indicated good quality synthesized speech in voiced sections, however, unvoiced sections were slightly buzzy. To reduce this buzzyness the signs of the pulses in unvoiced speech were made random. The unvoiced speech is detected by comparing the NCW energy to the PCW energy. If the ratio of noise to pulsed energy exceeds a predefined threshold, the unvoiced flag is set. As unvoiced CWs are generally short (see Section 4.2), allowing the V!UV flag to be set only when the CW length is thirty or less reduces the likelihood of incorrect setting. Also, to prevent distortions at transients, the first unvoiced frame

209 Scalable Coding Architecture 186 detected immediately after a sustained voiced section has the pulse sign randomized onl y for the CW s occurring after the frame mid point. Allowing the pulse signs to be randomized only under the described conditions, produced a high degree of reliability in the decision. Also, as the same analysis and synthesis structure is always used (with only a slight modification to the quantized parameters), incorrect setting of the unvoiced flag has a relatively low impact when compared to algorithms that use a completely different coding structure for voiced and unvoiced speech, such as [ThysOl, YangOl, Burn93, Shl098]. An incorrect decision results in only very minor audible distortions (or none), whereas incorrect classification and subsequent incorrect selection of the total coding structure can cause significant audible distortions Mid bit rate Quantisation and Synthesis of the pews This section proposes a quantisation scheme for the PCW parameters that is designed to operate with the high rate pitch quantisation scheme detailed in Section The quantisation scheme requires between 14 and 44 bits per frame for transmission (dependent on the pitch) and is designed for use in the scalable coder operating at approximately 4 kbps. At this rate the quantisation scheme is designed to operate somewhere between true waveform and parametric coding Description of the Mid bit rate Quantiser In common with the low rate PCW synthesis scheme detailed in Section 6.4.2, the mid rate quantisation scheme quantises only a single zinc pulse per CW. However, in contrast to the low rate scheme, the mid rate scheme quantises both the zinc A and B parameters separately. As the mid rate scheme is used in conjunction with the high rate pitch quantiser, the number of CWs and samples in the synthesised speech are constrained to equal the number of CW s and samples in the original speech. This

210 Scalable Coding Architecture 187 equality allows the quantisation error of each pulse to be minimised directly. The simplest form of quantisation that minimises the quantisation error for each pulse is direct logarithmic scalar quantisation of the pulse A and B magnitudes. Alternatively, if the bit rate required for scalar quantisation exceeds that available for transmission, extra compression can be achieved by using gain/shape quantisation of the pulse magnitudes in a frame, for both the A and B parameters. This involves forming a vector of magnitudes for both the A and B parameters; the length of these vectors varies according to the number of CW s in the frame. These magnitude vectors are then separated into gain and normalised vector terms, where the gain terms represent the maximum amplitude of each parameter and the normalised vectors represent the magnitude evolution (shape) of the parameters. The gain terms are quantised using logarithmic scalar quantisation and the normalised vectors are quantised using Variable Dimension Vector Quantisation (VDVQ) [Das96]. Using VDVQ would allow only a single codebook to be used for the quantisation of the pulse magnitude evolution. However, to increase the performance of the VDVQ scheme, the number of CW s in a frame were grouped according to the number of bits required to quantise the high rate pitch track. Segmenting the data into smaller groups allows VDVQ to operate better as the correlation between the grouped vectors is greater than the correlation of the entire data set [Gers92]. For only 1 and 2 CWs per frame no advantage is achieved by using VDVQ of the magnitudes and simple logarithmic scalar quantisation of the magnitudes is employed Bit allocation and Quantiser The grouping of the different number of CWs in a frame and the bits required for quantising either the A and B magnitudes are shown in Table 6.8.

211 Scalable Coding Architecture 188 No. of CWs/frame 8 to 13 5 to 7 3,4 2 1 Table 6.8: Pitch quantisation bits/frame Magitude Gain bits/frame Magnitude evolution bits/frame Bit allocation for mid rate pulsed CW magnitude quantisation As Table 6.8 indicates the number of bits required for quantising either the A and B magnitude, the total bit rate required is twice that shown in Table 6.8. In addition to the bits shown in Table 6.8, the signs of the pulse parameters are transmitted at this bit rate. This requires two bits for each pulse. To limit the maximum number of bits used for transmitting the signs, the signs of pulses in excess of 10 pulses per frame are set equal to the sign of the tenth pulse. In a test set of 10 sentences, greater than 10 pulses per frame was found to occur less than 0.05% of occasions. This rarity of occurrence results in no audible artefacts occurring due to repeating the sign of the 10th pulse. To determine the objective performance of the mid rate quantisation scheme, this scheme and the low rate scheme from Section were used to quantise the PCWs for 10 test sentences. To allow objective comparison the two schemes, the low rate quantisation scheme also used the high rate pitch track. This ensured that the number of reconstructed magnitudes per frame was the same for both methods. The average CW SNR for each of the methods was calculated with respect to the unquantised first order PCWs using (6.8). The average CW SNR for the mid rate scheme with respect to the full order unquantised PCWs was also calculated. The average SNR results are shown in Table 6.9.

212 Scalable Coding Architecture 189 SNR in db Mid rate w.r.t. 1 st order 15.1 Low rate W.r.t. 1 st order 3.42 Mid rate w.r.t. Full order 5.9 Table 6.9: Average SNR for mid rate Quantisation Comparing rows 1 and 3 in Table 6.9 shows an improvement in SNR of over 11.5 db for the mid rate quantisation scheme when compared to the low rate quantisation scheme. This is a dramatic improvement in SNR and is directly due to ensuring that the number of CWs in the encoder and decoder are equal in the pitch quantisation. This allows the MSE across the entire frame of pulses to be minimised in the VDVQ operation, and shows the value of using VDVQ as a compression tool. The results in Table 6.9 for the average SNR with respect to the full order unquantised pews shows a SNR of 5.9 db. This SNR specifies that the energy in the component of the full order unquantised pew modelled by the mid rate scheme is four times the magnitude of the energy component not modelled. This result indicates that most of the pew energy is captured in the first pulse of the zinc decomposition. Figure 6.10 shows a comparison of synthesised speech produced using unquantised PCWs with only 1 pulse per ew, the mid rate pew quantisation and the low rate pew quantisation scheme from Section

213 Scalable Coding Architecture 190 Synthesised speech using Unquantised First Order pulsed CWs 1000~ ~ ~ ~ 500 o -500~ L ~~ ~ o Speech with Mid rate pulsed CW quantisation ,~ , CD "0 ::::l... c Ol co :2: 500 o -500~ ~ ~------~~--~ o Speech with Low rate pulsed CW quantisation , CD CD ~ ~ ~ ~ o Time in Samples Figure 6.10: Comparison of synthesized speech for the proposed mid rate quantisation schemes. Comparison of the speech in Figure 6.10 indicates that the proposed mid rate quantisation scheme for the PCWs produces a much closer match to the original speech than the speech generated using the low rate scheme. The mid rate scheme has accurately reproduced both the transient structure (at around samples 500 and 1000) and the envelope of the unquantised waveform. In contrast, the low rate scheme has added some periodicity to the transients and smoothed the envelope.

214 Scalable Coding Architecture High bit rate Quantisation and Synthesis of the pews This section proposes a quantization scheme for the PCW parameters that requires between 89 and 132 bits/frame for transmission, dependent on the number of CW s/frame. At this bit rate the PCW quantization scheme is operating in a waveform coding paradigm and achieves an average SNR of 9.1 db Description of the High bit rate Quantiser Based on the results in Section 4.3, which indicate that increasing the number of pulses per CW used in synthesis reduces the error between the synthesised and original speech, the high rate quantisation scheme transmits multiple pulses per CWo However, as the number of pulses/cw present in the unquantised data varies dynamically according to the individual pitch lengths in the frame (see Section equation (5.13)), quanti sing all of the PCW parameters available in the unquantised data would require separately transmitting the number of pulses quantised for each CW in the current frame. To remove the need to transmit the number of pulses per CW for each CW in the frame, the number of pulses quantised (used in synthesis) is set to the minimum number of pulses/cw that is possible for the number of CW per frame (calculated using the minimum possible pitch lengths in Table 6.1 and (5.13)). Informal perceptual testing, using unquantised parameters, indicated that this limitation produced speech that was perceptually indistinguishable from that generated using all of the available pulse parameters. The number of quantised pulses/cw for each number of CW/frame is shown in column 2 of Table 6.10.

215 Scalable Coding Architecture 192 No. No. First Order High order Total bits CW/Frame pulses/cw pulse Pulse for Pulse magnitude Magnitude Magnitudes Quant. Bits Quant. Bits >= Table 6.10: Bit allocation for high rate PCW magnitude quantization. To produce an embedded bit stream, the mid rate quantisation scheme is used as the basis upon which the high rate quantisation scheme is built; with the additional pulses/cw quanti sed separately. For cases where the number of CW/frame is 5 or less, the additional pulses/cw are firstly nonnalised by the pulse already quanti sed for the CWo This produces a vector of nonnalised magnitudes for each CW and to exploit the high intra-frame correlation shown in Table 6.4, each vector is then quantised using VDVQ. Normalising the high order (additional pulse/cw) pulses by the quanti sed initial pulse magnitude from the mid rate scheme ensures that any quantisation error for the mid rate scheme is accounted for when quanti sing the high order pulses. In the cases where the number of CW/frame are greater than 5, Table 6.10 indicates that only 2 pulses per CW are quanti sed. In these cases the vector size per CW is only 2 and using VQ offers no advantage. To achieve good compression for these occurrences, the second pulses for each CW are quantised separately to the first using the same method as is used for the first pulses (mid rate method).

216 Scalable Coding Architecture Bit allocation and Quantiser Peiformance The pulse magnitude quantisation configuration for the high rate scheme is shown in Table The bit allocation shown in Table 6.10 is for a single magnitude value per zinc pulse. When both magnitudes are quantised the bit allocation shown in Table 6.10 is used for each magnitude value, thus the bits required is doubled. In addition to quantising the magnitudes, the intra-cw position of the pulses must also be transmitted when using multiple pulses per CW in synthesis. If it is assumed that the first pulse will always occur at the pitch pulse marker (10 samples from the CW end boundary), the position of the high order pulses with respect to the pitch pulse marker must be quanti sed. Since the pulse positions are constrained to occur at orthogonal positions in the decomposition operation, only even locations from the pitch marker are possible. Using this fact, and the maximum CW lengths from Table 6.1, the bits required to transmit the pulse positions is shown in column 3 of Table No. No. Bits for Bits for Total bits for CWlFrame pulses/cw pulse pulse signs High rate _positions Quantisation >= Table 6.11: Bit allocation for the High rate pew quantization scheme Table 6.11 also includes the bits required to transmit the sign parameters for the zinc pulses and the total number of bits /frame required for the high rate quantisation scheme. As discussed in Section , the occurance of frames containing more than

217 Scalable Coding Architecture CWs are very rare. To simplify the quantisation scheme, the bit allocation shown in Table 6.11 does not specify frames that contain more than 10 CWs/frame but rather the pulse magnitudes and position of the 10th CW are repeated if more than 10 CWs occur in the frame. The average SNR for the high rate quantisation method with respect to the full order unquantised PCWs, for 10 test sentences, was calculated using (6.8). The result is shown in Table Mid rate w.r.t. Full order Table 6.12: Average SNR for high rate quantisation The average SNR value of 9.1 db shown in Table 6.12 indicates that the high rate quantisation scheme produces an accurate representation of the PCW s. Figure 6.11 shows a comparison of synthesised speech produced using unquantised PCWs (using all the pulses/cw), the high rate PCW quantisation and the mid rate PCW quantisation scheme from Section Examining Figure 6.11 indicates that the high rate PCW quantisation produces a much more accurate representation of the speech produced using unquantised PCW s, than the mid rate pulsed quantisation. The mid rate quantisation tends to smooth the evolution of the pulse magnitudes across the frame, whilst the high rate quantisation preserves both the transient structure (around sample 1000) and the pulse evolution. The high rate quantisation also better reproduces the fine structure of the individual pulses, whilst the mid rate quantisation tends to produce a smoothed pulse.

218 Scalable Coding Architecture 195 Synthesised speech using Unquantised pulsed CWs 1000, ~ 500 o ~ ~ ~ ~ o Speech with high rate pulsed CW quantisation ~ ~ ~ ~ o Speech with mid rate pulsed CW quantisation o ~ ~ ~ ~ o Figure 6.11: Comparison of synthesized speech for the proposed high rate quantisation schemes Constraining the CW evolution This section details constraints on the synthesis operation that allow the perceptual quality of the synthesised speech produced using the high rate PCW quantisation to be dramatically increased. These constraints are required despite the fact that the PCW parameters are calculated by minimising a perceptually weighted speech domain error and the fact that the high rate pew quantisation scheme achieves a SNR of 9.2 db (see Section 6.4.4), which is equal to the SNR for an unquantised 3 tap long term predictor

219 Scalable Coding Architecture 196 reported in [Kond95]. This SNR indicates that the quantisated pews achieve an accurate representation of pulse structure of the speech waveform. Despite this SNR, initial Mean Opinion Score (MOS) testing indicated that the synthesised speech produced using the high rate PCW quantisation (with other parameters unquantised) scored approximately 0.2 MOS points below the synthesised speech produced using the low rate quantisation method of Section Subsequent testing and polling of the individual listeners revealed that whilst the high rate synthesised speech sounded natural and undistorted, it also sounded slightly noisy and harsh. The low rate synthesised speech on the other hand had more distortions but sounded smooth and full. The cause of the noisy characteristic in the high rate synthesised speech was attributed to the change between adjacent pitch pulse shapes being unconstrained. It was deduced that even though the PCWs are calculated to minimise a perceptually weighted speech domain error (and the high rate quantiser has a high SNR), modelling the pitch pulses in a pitch synchronous fashion causes unconstrained changes in the shape between adjacent pitch pulses to produce an overall noisy effect in the synthesised speech. This result is in direct conflict with conventional multi-pulse CELP waveform modelling techniques [Atal82b, Sukk89], which use fixed size sub-frames. In these coders increasing the number of pulses used per sub-frame and hence reducing the SNR, increases the subjective quality of the synthesised speech. KJeijn [Klei93a] reported the problem of constraining the pitch pulse evolution in a parametric WI coder (that makes no attempt to minimise the perceptually weighted speech domain error), where the accuracy of the reconstructed speech was sacrificed in order to constrain the rate of change of the pitch pulses. This had the effect of improving subjective quality. To constrain the change between adjacent pitch pulses represented by the PCW parameters, the relative position of the individual zinc pulses (with respect to the pitch

220 Scalable Coding Architecture 197 pulse marker) within a CW must be constrained between adjacent CWs. This constraint is already the case for the low rate quantisation scheme (that produced preferred synthesised speech), which uses only a single pulse placed at the pitch pulse marker per CWo Allowing the zinc pulses to occur at unconstrained locations in each CW causes the shape of the adjacent CWs to vary in an unconstrained manner. In addition to constraining the individual pulse positions between adjacent PCWs, the rate of change between the synthesised PCWs can be further reduced by also constraining the change between the individual zinc pulse magnitudes. However, constraining the rate of change of the zinc pulse magnitudes also limits the ability of the coder to adapt to quickly changing transient sections of the speech and for this reason the rate of change of the pulse heights is not constrained in this coder. Informal subjective listening tests were used to determine the effect of constraining the zinc pulse positions in adjacent CWs. It was found that fixing the number of pulses/cw used in synthesis to two (regardless of the CW length) could produce very good quality speech. The positions of these pulses were also fixed so that the first occurred at the pitch pulse marker and the second was located at the next orthogonal position (2 samples) after the first pulse. The zinc pulses used in synthesis were also modified to be causal (as in the low rate WI coder detailed in Section 5.2). These modifications to the PCW synthesis operation had the effect of reducing the average SNR from the value of 9.2 db shown in Table 6.12 to a value of only 1 db. However, the average MOS scores across a range of 10 test files increased by over 0.6 points. This is a dramatic improvement and particularly notable since fewer bits are required for transmission, due to the fact that fewer pulses are transmitted per CWo A comparison of speech produced using the unconstrained high rate quantisation scheme and that produced using the constrained high rate quantisation scheme is shown in Figure 6.12.

221 Scalable Coding Architecture Speech with Constrained High rate pulsed CW quantisation (]) ""0 :J :!::: C 01 C\l ::2:.c 0 (]) Q) Cl.. CJ) Speech with Unconstrained High rate pulsed CW quantisation Time in Samples Figure 6.12: Comparison of synthesized speech for the constrained and unconstrained high rate Pulsed CW quantisation The comparison shown in Figure 6.12 indicates that the evolution of the pitch pulse shapes is much smoother in the speech produced using the constrained high rate quantisation. It can also be seen that despite smoothing the pulse evolution in the synthesised speech, the constrained speech still accurately models speech transitions and maintains the overall envelope structure of the speech. To ensure convergence between the PCWs and NeWs in synthesis, the decomposition method of Section 5.6 was modified ensure that the constrained positions were always used in the decomposition. This constraint ensures that at higher rates, when the NeW shape is quantised (see Section 6.5.3), the temporal positions of the NCWs and pews will converge in synthesis. Although the constrained high rate pew quantisation is less accurate than the unconstrained high rate quantisation, the method still converges to high perceptual

222 Scalable Coding Architecture 199 quality synthesised speech. This occurs because the analysis loop remains unchanged (and very accurate) and captures the perceptually important parameters of quickly changing sections of the input speech in the PCW parameters. Having this very accurate paramatisation available allows the coder to produce high perceptual quality speech, even in quickly changing sections. This contrasts with purely parametric coding structures such as WI, which smear the quickly changing transitional sections in the analysis stage, and as such these sections cannot be reproduced in synthesis regardless of the bit rate available for transmission. 6.5 Scalable quantisation of the Noise CWs (NCWs) This section (and its sub-sections) developes scalable mechanisms for quantising and synthesizing the NCWs. The NCW parameters, which are generated by the scalable analysis structure of Figure 6.1, are vectors with each vector representing a separate NCW. The lengths of the individual vectors are equal to the respective CW lengths. To permit scalable quantization, the NCW vectors are parameterised in Section Also, to determine suitable quantization procedures for these NCW parameters, the characteristics of the parameters are analysed in Section This analysis is then used to develope a low bit rate quantisation scheme and a high bit rate quantisation scheme in Sections and respectively. The synthesis technique required to produce good quality speech using the proposed quantisation schemes is then detailed inn Section

223 Scalable Coding Architecture S Characteristics of the NCW Parameterisation of the NeW vectors As discussed in Section 6.3, the pitch track quantisation schemes are not lossless and thus the lengths of the synthesised CWs may differ from the lengths of the original CWs. In fact, for the low rate pitch quantisation scheme, the number of CWs in the synthesised speech may not equal to number of CWs in the original speech. For the quantisation of the PCWs in Section 6.4 this change in length was not a major factor because the unquantised PCWs are represented by parameters (zinc pulses) that permit relatively straightforward variation of the CW length, whilst maintaining the characteristics of the original PCW. However, the unquantised NCWs are not parameterised, but rather, each NCW vector is made up of samples representing the difference between the original and PCW (see Section 5.6 for the decomposition method). These NCW vectors do not lend themselves to simple length manipulation; complex warping operations are required. To permit simply length manipulation and scalable bit rate representation of the NCW, a parametric representation of the NCWs is required. This parametric representation should lend itself to both, simple length modification of the CW and increased perceptual quality of the synthesised NCW as the bit rate for transmission is increased. To determine a suitable parametric representation for the NCWs, a thorough analysis of the NCW characteristics was conducted. This analysis used sentences from the TIMIT database and a summary is presented in the remainder of this section.

224 Scalable Coding Architecture 201 (I) "0 ::I.- 'c Ol <0 ~ 0 ~ () (I) (I) (JJ X 10 4 Speech 4r , -4~ ~ ~ ~ ~------~ o Synthesised speech using Noise CWs 5000r ~ ~ ~ o -5000~------~ ~------~ ~------~ o Time in Samples Figure 6.13: Comparison of original speech and synthesized speech produced using only unquantissed noise CWs To determine the time domain structure of the NCWs, the unquantised NCWs and original speech were compared across a number of sentences. Figure 6.13 shows a comparison of original speech and synthesised speech generated using only the NCWs. Examining Figure 6.13 indicates that the NCW do indeed represent the noise like structure of the speech signal. Whilst a very small amount of the pulsed information falls through to the NCW, it was found that zeroing the residual domain NCWs in the vicinity of the pitch pulse (thus removing the residual pulse information) produced no perceptual change to the synthesised speech. To examine the noise like structure of the NCWs, each NCW for a section of 10 input sentences was normalised by its respective gain value. Prior to normalisation, any pulsed component was removed from the NCW by zeroing those samples that fell

Scalable Coding Architecture 202 X 10 4 Distribution of Gain Normalised Noise CWs 4~----------~~-----------.-------------r------------. 3 2 1 o -1-0.5 o 0.

225 Scalable Coding Architecture 202 X 10 4 Distribution of Gain Normalised Noise CWs 4~ ~~ r o o 0.5 x 10 4 Distribution of Gain Normalised random Guassion CWs 4~ r "' o Figure 6.14: Comparison of mean distribution of the NCW and Gaussian CW 1 within ±3 of the pitch pulse marker. The mean distribution for these gain normalised NCW s is compared to the mean distriburtion of gain normalised CW s generated from Gaussian random numbers, in Figure Examining Figure 6.14 indicates that the distribution of the NCW s is very similar to the distribution of the Gaussian CWs. This characteristics combined with the obvious noise like structure of the NCWs shown in Figure 6.13 indicates that replacing the NCW data with gain scaled Gaussian noise should be acceptable. Initial perceptual testing of speech generated by replacing the NCW s with gain scaled Gaussian noise indicated that whist the speech quality was slightly reduced (slightly more harsh) from the original speech, the speech remained clear and distortion free. From these results it was decided

Enhanced Waveform Interpolative Coding at 4 kbps

Enhanced Waveform Interpolative Coding at 4 kbps Oded Gottesman, and Allen Gersho Signal Compression Lab. University of California, Santa Barbara E-mail: [oded, gersho]@scl.ece.ucsb.edu Signal Compression