Evaluation of MELP Quality and Principles Marcus Ek Lars Pääjärvi Martin Sehlstedt Lule_a Technical University in cooperation with Ericsson Erisoft AB

Evaluation of MELP Quality and Principles Marcus Ek Lars Pääjärvi Martin Sehlstedt Lule_a Technical University in cooperation with Ericsson Erisoft AB, T/RV 3th May 2

Abstract This report presents an investigation and evaluation of a Mixed Excitation Linear Prediction (MELP) speech codec, operating at 24 bps. An introduction to speech coding methodologies with emphasis on LPC-based algorithms is presented. Modifications to the codec, Texas Instruments MELP 24, have been made in order to test different parts of the algorithm and thereby find their impact on the quality of synthezised speech. Subjective listening tests show that the quantization of the fourier magnitudes, gain, and the LPC coefficients reduces the quality of speech. The quantization of the fourier magnitudes has the most impact on the output speech. Adding more bandpass filters, to enhance the frequency resolution in the mixed excitation, does not improve the speech quality while decreasing the number of filters makes the quality significantly worse. 3

Preface This report is the result of a project that has been conducted at the division T/RV at Ericsson Erisoft in Lule_a, during the spring 2. The project is a compulsory course at Lule_a University of Technology for the Master of Science degree in Signal Processing, at the department of Computer Science and Electrical Engineering. The aim was to evaluate the U.S. federal standard 24 bps MELP speech coder and document key elements in the codec. The first part (chapters 2 and 3) is an introduction to speech coding and is suitable reading for people without any previous knowledge of the subject. The second part (chapters 4 and 5) describes the evaluation and investigation process and the third part (chapters 6 8) presents results, methods and conclusions. Parts of the report is written with consideration to people that may want to follow up the work and therefore flags, commands etc. have been included. We would like to thank our supervisors Nicklas Sandgren and Jonas Svedberg for providing help and for the time they have taken from their own work to assist us. We also thank Stefan H_akansson for letting us use facilities and technology at the division, and of course our supervisor Johan Carlson at the University. 4

Contents 1 Introduction 7 2 Fundamentals of Speech 8 2.1 Speech Production........................... 8 2.2 Speech Perception........................... 9 3 Speech Coding 11 3.1 Speech Coders............................. 11 3.1.1 Waveform Coders........................ 11 3.1.2 Parametric Coders....................... 12 3.1.3 Hybrid Coders......................... 12 3.2 Algorithmic Methods.......................... 13 3.2.1 Quantization.......................... 13 3.2.2 Linear Prediction........................ 13 3.2.3 Line Spectrum Frequencies.................. 14 3.2.4 Pitch Prediction........................ 14 3.3 Performance............................... 14 3.3.1 Objective Evaluation...................... 16 3.3.2 Subjective Evaluation..................... 16 4 The MELP Speech Codec 18 4.1 LPC-model............................... 18 4.1.1 Linear prediction........................ 18 4.1.2 Encoding............................ 18 4.1.3 Decoding............................ 19 4.2 Bandpass Filters and Mixed Excitation............... 2 4.3 Aperiodic Flag............................. 2 4.4 Adaptive Spectral Enhancement................... 21 4.5 Pulse Dispersion Filter......................... 21 4.6 Fourier Magnitude........................... 21 4.7 The MELP Codec Flowchart..................... 22 4.7.1 The Encoder.......................... 22 4.7.2 The Decoder.......................... 24 5 Investigation of the MELP Speech Codec 25 5.1 Highpass Filter............................. 25 5.2 Adaptive Spectral Enhancement................... 25 5.3 Pulse Dispersion Filter......................... 25 5.4 Bandpass Filters............................ 25 5.5 Synthesis Pulse Train......................... 26 5.6 Generation of MATLAB Data..................... 27 5.7 Quantization.............................. 28 5.7.1 Bandpass Voicing Strength.................. 28 5.7.2 Gain............................... 29 5.7.3 LPC Coefficients........................ 29 5

5.7.4 Pitch............................... 29 5.7.5 Fourier Magnitudes....................... 29 5.7.6 Jitter.............................. 29 6 Test Methods 31 6.1 Objective Testing............................ 31 6.1.1 Segmental SNR......................... 31 6.2 Subjective Testing........................... 31 7 Tests and Results 33 7.1 Quantization.............................. 33 7.2 Bandpass Filters............................ 33 7.3 Synthesis Pulse Train......................... 33 7.4 Jitter................................... 34 7.5 Comparison of Performance...................... 34 8 Conclusions and Comments 36 9 Further Investigation 37 A Appendix A 39 A.1 Useful UNIX commands........................ 39 A.1.1 rlogin.............................. 39 A.1.2 projlogin............................ 39 A.1.3 setenv.............................. 39 A.1.4 echo <string>......................... 39 A.2 Useful commands in trstud...................... 4 A.2.1 DATPLAY........................... 4 A.2.2 DATPLAYBOY........................ 4 A.2.3 SCREAM............................ 4 A.2.4 make............................... 4 A.2.5 TRANSF............................ 4 A.2.6 CORR.............................. 41 A.2.7 SNR............................... 41 A.2.8 SNRSEG............................ 41 A.2.9 PSQM.............................. 41 B Flowcharts 42 C Lowpass and Highpass Filters 44 D Original and Matlab Bandpass Filters 46 E Bandpass Filter Sets (Analysis-Synthesis) 49 F Speech Signal Samples 54 6

1 INTRODUCTION 7 1 Introduction In wireless communication systems it is desirable to compress the speech signal before transmission. That is, to reduce the amount of data that has to be sent over the channel. The phrase 'speech coding' most often refers to the technique that uses coding to compress a speech signal so that less bandwidth is needed for transmission, at the same time as the quality of the speech signal is preserved. This is an essential part in for example cellular systems, which aim to comprise as many users as possible within a limited bandwidth. The coding and decoding of the speech signal is performed by a so called "speech codec". It is of importance that the speech signal can be reconstructed with as high fidelity as possible at the receiver. One goal for this project was to document key elements in the band excitation mixing/decisions and propose improvements to the band mixing. A second goal was to evaluate the mixed excitation encoding principle by comparing the unquantized MELP to other standards. The first thing to do was to get aquainted with the field of study. That is, to learn about speech coding in general and the MELP codec in particular. It was done by reading several technical articles and reports. The article written by McCree and Barnwell III [1] was presented to the other project groups. The next stage of the project was to search the WWW for avaliable source code. The American Department of Defence (DoD) fortunately had one version of the MELP codec on their homepage, otherwise the codec would have had to be implemented. Now the task at hand was to examine the algorithm from the american DoD to identify key elements for later modification.

2 FUNDAMENTALS OF SPEECH 8 2 Fundamentals of Speech 2.1 Speech Production Speech sounds are formed when air from the lungs pass through the vocal tract, see Fig. 1. The vocal tract can be modelled as a spectral shaping filter, and the frequency response depends on the size and shape of it as described by Abrahamsson in [8]. The shape can be altered using the tongue and the mouth. Speech signals Figure 1: The vocal tract [8]. can be partitioned into two main groups, voiced and unvoiced speech. Voiced speech is produced when the vocal cords are closed. Air from the lungs build up a preassure behind the cords, in the beginning of the vocal tract, see Fig. 1. When the preassure becomes high enough the cords are forced open. As a consequence the vocal cords begin to vibrate harmonically. The fundamental period of the vibrations is called the pitch period and it is dependent on the tension and length of the vocal cords. The pitch istherefore different for male, female and children speakers. In general women and children have shorter cords than men and they consequently have a higher pitch frequency (higher voice). The pitch frequency is typically between 5 and 4 Hz. The effect of the opening and closing of the vocal cords is that the air passing into the rest of the vocal tract have the characteristics of a quasi-periodic pulse train, see Fig. 2a. During unvoiced speech the vocal cords are open and the air pass undisturbed into the rest of the vocal tract. Because of this there is no periodicity inunvoiced speech segments and they are very much noise-like, see Fig. 3a. Unvoiced speech can therefore be approximated by white random noise. After the vocal cords the speech signal consists of either a series of pulses or random noise, depending on whether the speech isvoiced or unvoiced. The signal is passed into the rest of the vocal tract which forces its own frequency response on the signal. This can be seen as the envelope of the power spectrum, see Fig. 2b and 3b. The peaks in the envelope of the spectrum are called formants and are located at the pitch frequency.

2 FUNDAMENTALS OF SPEECH 9 Figure 2: Voiced speech a) time domain, b) frequency domain. Figure 3: Unvoiced speech a) time domain, b) frequency domain. 2.2 Speech Perception There is a limit to the sensitivity of the ear. An audible signal can become inaudible if there is a stronger signal present at a frequency near the original signal. This phenomenon is called masking and is often used in speech coding to avoid spending coding effort on signal components that the ear anyway can not hear. Masking is often incorporated as a part of the quantization process. There will always be a difference between the input and output signal of the codec, since information of the speech signal is lost in the coding process. The difference between the input and output signal of the codec is called the coding noise and is to a great extent due to quantization.

2 FUNDAMENTALS OF SPEECH 1 A larger coding noise power can be tolerated near the formant frequencies where the masking is more efficient and less noise power can be tolerated in spectral regions where there is less energy in the speech signal, e.g. between the formants. One therefore often tries to correlate the spectrum of the coding noise with the spectrum of the original speech signal. That is, to shape the coding noise spectrum so that it is increased in the formant regions and decreased otherwise. This operation is called noise shaping. There are also other features of the codec that make use of the limitations of the human ear. The main idea is that there is no reason to code and transfer components of the signal that the human ear will filter out anyway.

3 SPEECH CODING 11 3 Speech Coding The purpose of speech coding is to reduce the bit rate of digital speech signals and to reconstruct the speech signal in the decoder with as high quality as possible. One way of reducing the bit rate is to remove the redundancy in the speech signal by employing the knowledge about speech production and perception mentioned earlier. This leads to models which only parameterises perceptually relevant information. A speech coder always consists of a COder and a DECoder (a so called codec). The encoder takes the original speech signal and produces a bit stream. The bit stream passes some sort of channel on its way to the decoder. The decoder constructs an approximation of the original signal, often refered to as the reconstructed signal or the syntisised speech. Most speech coders use a bandwidth of 3.2 khz (2 Hz 3.4 khz) and a sampling frequency of 8 khz. 3.1 Speech Coders There exist three main types of speech coders, waveform coders, parametric coders and hybrid coders. There is no fixed boundary between the first two types. Some algorithms uses properties from both waveform and parametric coders and they are called hybrid coders, which can be considered as a third type. The three types of coders differ in the bit rate and the quality of reconstructed speech. Waveform coders are mainly used for high bit rates, more than 16 kbit/s, parametric coders are used for low bit rates (less than 4 kbps), and hybrid coders (4 16 kbps) are used at intermediate bit rates, see Fig. 4. Figure 4: Quality vs. Bit Rate. 3.1.1 Waveform Coders Waveform coders are a class of coders that attempts to reproduce the input signal's waveform. That is, to make the reconstructed signal resemble the original

3 SPEECH CODING 12 Figure 5: Vocoder Speech Production Model [8]. signal. In these coders, the objective is to minimize a criterion which measures the difference between the input and the output signal. A criterion often used is the mean square difference. Waveform coders provide very high quality speech at high bit rates but cannot be used to code speech atvery low bit rates. 3.1.2 Parametric Coders In parametric coding, the signal is represented by a set of model parameters. These parameters are quantized and transfered without consideration to the original signal. Periodic update of the model parameters requires fewer bits then a direct representation of the speech signal and consequently parametric coders can operate at low bit rates. However, the resulting speech tend to sound synthetic. When coding speech signals one tries to base the model parameters on the physiological structure of the human speech-production system. This means building structures that try to imitate the vocal tract and the vocal cords, [8]. An example of a parametric coder is the Vocoder (VOice CODER). The speech production model used by the vocoder speech production is shown in Fig. 5. If the speech is voiced the model excites a linear system by a series of periodic pulses. If the sound is unvoiced the model produces noise. Parametric coders attempt to produce a signal that sounds like the original speech, whether or not the time waveform resembles the original. The speech is analysed at the transmitter to determine the model parameters. This information is then quantized and transmitted to the receiver where the speech is reconstructed. 3.1.3 Hybrid Coders The idea of a hybrid coder is to combine these two techniques to achieve high a quality speech coder operating at intermediate bit rates. The most successfully and commonly used hybrid coders are time domain Analysisby-Synthesis coders. As the name implies, the encoder analyses the input speech by synthesising many different approximations to it. These coders split the input speech signal into frames (typically 2 ms long) and for each frame, parameters for

3 SPEECH CODING 13 a synthesis filter are derived. The parameters are chosen by finding the excitation signal to this filter that minimizes a weighted error between the input speech and the reconstructed speech. The synthesis filter parameters and the excitation are transmitted to the decoder which passes the excitation signal through the filter to give the reconstructed speech. The hybrid codeing techniques is used in the Codebook Excited Linear Prediction (CELP) coder. There are for example CELP coders operating at 4.75, 6.7 and 12.2 kbps. 3.2 Algorithmic Methods 3.2.1 Quantization Quantization is the conversion of an analogue signal to a digital representation of the signal. Information will be lost during quantization, since the actual amplitude values can not be retained after quantization. The error introduced by the quantizer is called the quantization noise. Amplitude quantization is important in communication systems since it determines to a great extent the overall distortion and the bit rate needed. There are several quantization techniques often used, such as uniform quantization, logarithmic quantization, non-uniform quantization and vector quantization. In vector quantizers the data is quantized N samples at a time. Each block of samples is treated as an N-dimensional vector and is quantized to predetermined points in the N-dimensional space. The predetermined points are usually stored in a table called code book. Vector quantization is more sensitive to transmission errors and usually involves a greater computational complexity than scalar quantization, but always gives better performance. 3.2.2 Linear Prediction Linear prediction analysis of speech is historically one of the most important speech analysis techniques. For digital speech signals adjacent samples are often highly correlated. The purpose with linear prediction is to remove this redundant information from the speech signal and thereby reducing the number of bits needed to represent the signal. This is performed with a Linear Prediction (LP) filter that predicts the current speech sample from N earlier samples, where N is typically around ten. The linear prediction filter can be viewed as a model for the tongue and vocal tract in the human speech organs. The linear predicition filter is then usually called the synthesis filter. The lungs and vocal cords are simulated with a periodic pulse train for voiced speech, while noise is used for unvoiced speech. The speech is feed through the inverted prediction filter and the result is called the residual. This is the optimal input to the synthesis filter to get the original speech signal as output. In Appendix F Fig. 24 and 25 shows the speech signal, residual, the synthesised residual and the synthesised speech for voiced and unvoiced speech.

3 SPEECH CODING 14 3.2.3 Line Spectrum Frequencies Quantization of the LP coefficients is problematic since small changes in the coefficient values may result in large changes in the spectrum and in unstable LP-filters. Quantization is therefore usually performed on transformations of the LP coefficients. The most commonly used representation is the Line Spectrum Frequencies (LSF). To obtain the LSFs, two polynomials are defined and F1 (z) =A (z)+z (N +1)+A (z 1) (1) 2 (z) =A (z) z (N +1)+A (z 1) (2) where N is the number of LPC coefficients. When these two polynomials are added together they result in A(z). The roots of the polynomials are situated on the unit circle and the LSF coefficients are obtained as the angle for each root. The LSF representation is a frequency-domain representation, and it has a number of properties, such as bounded range and sequential ordering, which makes it desirable for quantization of LP coefficients. 3.2.4 Pitch Prediction The speech signal is periodic during voiced speech segments due to the vibrations of the vocal cords. The short-term linear prediction is usually based on correlations over intervalls less than 2 ms. The pitch period however, have atypical range from 2 to 2 ms. Short-term linear prediction can therefore not account for the pitch. To be able to model the long-term periodicity of the speech signal a separate pitch prediction has to be made. It can either be applied to the input speech signal, before the linear prediction or after the LP-filter to the residual signal. 3.3 Performance To be able to evaluate and compare different speech coders some sort of measurements must be adopted. First we need to know what properties speech coders are evaluated against and then we need some sort of method that can provide us with a value of how well the algorithm codes the speech. Speech coders are often developed with a particular application in mind, and therefore the properties can be optimised for that application. Speech coding algorithms are often evaluated based on the properties listed below. ffl Bit rate. The bit rate specifies the number of bits needed to represent one second of speech. The preliminary goal with speech coding is to reduce the bit rate. Coders intended for use in the general telephone networks today have bit rates of 16 kbit/s or higher. The bit rate can be either fixed or variable. In variable bit rate speech coders differentnumber of bits are used depending on the characteristics of the speech signal. For example, a signal segment which contains silence does not require as many bits as a signal segment containing

3 SPEECH CODING 15 speech. Higher bit rates are suitable for environments where high speech quality is requested. Low bit rate vocoders are usually used by the military and in satellite links, where intelligiblity and low power is more important than quality. The relationship high bit rate! high quality is not necessarily true, but higher bit rate vocoders usually produces a higher speech quality. ffl Quality of reconstructed speech. When determining the speech quality a good objective criterion does not exist for low bit rate codecs, so the decision has to be made subjectively. When testing if the coder meets the requirements of speech quality extensive subjective tests with human listeners have to be made. The quality of reconstructed speech can be measured with simple methods like SNR, SEGSNR or with more complex methods that tries to simulate the human ear. The most important measure is still listening tests performed on humans. ffl Complexity of the algorithm. Like all computer based algorithms a high complexity normally introduces implementation errors and delays. The hardware and software tends to be expensive when time- and memory demands are high and the algorithms tends to be unstable. ffl Robustness to channel errors and acoustic interference. If the coding algorithm has some sort of error protection one could transfer speech over bad" channels. The error protection introduces extra bits for every speech frame transferred. It is therefore important not to exaggerate the protection. ffl Time delay introduced. The delay is of importance mainly for two-way communications. The delay should not affect the dynamics of the conversation. Codec delay can be divided into four parts, algorithmic delay, computational delay, multiplexing delay, and transmission delay. A rough estimate of the overall one-way delay introduced by the speech coder can be made based only on the frame size. The algorithmic delay is about one frame, and the computational delay is also about one frame. One can often assume that the multiplexing and transmission delay add up to one frame length if the channel is not shared. Consequently, without look-ahead, most speech coders give at least three frames of one way delay. A frame is typically between 1 and 2 ms each, which gives a total delay between 3 and 6 ms. One cannot make any general rank of the above statements. The area of use is very important and must be mapped to the statements in order to make a fair evaluation.

3 SPEECH CODING 16 3.3.1 Objective Evaluation Objective evaluation methods are often sensative to both gain and delay variations. The methods are well defined and based on statistics and mathematics. It is very easy to compare two different solutions with calculated measurement quality. Common methods used in the evaluation work of speech coders are SNR, SEGSNR, articulation index, the log spectral distance as well as the Euclidean distance. The signal-to-noise ratio (SNR) is one of the most common measures for objective evaluation of the performance of compression algorithms. The SNR is calculated according to equation: SNR =1 log 1 2 6 4 M 1 P M 1 P n= s2 (n) (s (n) ~s (in + n))2 n= The SNR tends to hide temporal reconstruction noise for low level signals due to its long term measuring. To avoid this a segmental SNR can be used. The segmental SNR (SEGSNR) is constructed to expose weak signal performance. This evaluation methode uses the common SNR function evaluated for each N - point segment of the speech. The SEGSNR is calculated according to the equation: SEGSNR = 1 L 1 X log1 N i= 2 6 4 M 1 P M 1 P n= s2 (in + n) 3 7 5 (s (in + n) ~s (in + n))2 n= This ensures that SEGSNR penalizes coders whose performance is variant. The disadvantages with this kind of objective quality measurements is the lack of knowledge about the quality of reproduced speech in a human perspective. The human ear does not have to agree with the objective analysis that are made, and seldom does. For this purpose a second evaluation method is defined, the subjective evaluation. 3.3.2 Subjective Evaluation Three subjective measures often used are: the diagnostic rhyme test (DRT), the diagnostic acceptability test (DAM), and the mean option score (MOS), all based on human rating. The DRT measures intelligibility, DAM provides a characterisation of reconstructed speech in terms of a broad range of distortions and MOS attempts to combine all aspects of quality in a single number. The MOS is the most commonly used measure for subjective quality of reconstructed speech. For the purpose of making a quality test which accounts for the perceptual properties of the ear, subjective evaluation methods are used. All subjective tests are based on ranks from a number of test people and is therefore, in general, more true 3 7 5 (3) (4)

3 SPEECH CODING 17 than any objective test result. The major drawback with subjective evaluation methods is that they take alot of time to set up and to do. The simplest form of listening test is the AB-test. This evaluates the ranking of the tested segments. For a set of N segments N Λ(N 1) comparisions between two segments have to be made. Two segments (A and B) are then played to a listener who selects which segment is the better. The main point in AB-test is that the listener does not know the order of the A and B segment since all these test are made in random order. As the test includes all the combinations of two segments the listener will vote for both the A-B and B-A order of each combination of two segments. The drawback is that just a few segments cause a large number ofcombinations that has to be listened to, e.g. five segments requires 2 (5 Λ 4) comparisions.

4 THE MELP SPEECH CODEC 18 4 The MELP Speech Codec The MELP speech codec operates at 2.4 kbit/s and is therefore classified as a very low bit rate speech codec. It is an extension of the ordinary Linear Prediction Coder (LPC) model. Both these coders are fully parametric models of the human speech organs such as lungs, vocal cords and vocal tract. This makes the low bit rate possible. The extensions to the LPC are used to compensate some of the problems in the ordinary LPC. The two largest problems are the buzzy speech quality and the presence of short isolated tones in the synthesised speech. McCree and Barnwell III [1] gives a very detailed description of how the MELP coder works. The following subsections are a short summary on the information in that report. It starts by describing the LPC-model and then the extensions used in the MELP codec are described one by one. 4.1 LPC-model 4.1.1 Linear prediction The basis is the source-filter model where the filter is constrained to be an all-pole linear filter. This amounts to performing a linear prediction of the next sample as a weighted sum of past samples. ~s n = px a i s n i H(z) = i=1 1 1 p P i=1 a iz i Given N samples of speech, we would like to compute estimates to a i that results in the best fit. One reasonable way to define best fit" is in terms of mean squared error. These can also be regarded as most probable" parameters. If it is assumed that the distribution of errors is Gaussian and a priori there are no restrictions to the values of a i. The error e(n) is defined as: (5) e(n) =s(n) ~s n (6) The summed squared error, E, over a finite window of length N is: N 1 X E = e(n) 2 (7) n= The minimum of E occurs when the derivative is zero with respect to each of the parameters, a i. As can be seen from equation (7) the value of E is quadratic in a i and therefore there is a single solution. Very large positive ornegative values of e(n) leads to poor prediction and hence the solution to equation (7) is a minimum. 4.1.2 Encoding When coding speech with LPC it is first divided in to frames that contains between 2 and 3 ms of sampled speech data. The compromise in this case is that

4 THE MELP SPEECH CODEC 19 Table 1: Bit allocation for LPC-1. Voiced Unvoiced Pitch/Voicing 7 7 Gain 5 5 Sync 1 1 K(1) 5 5 K(2) 5 5 K(3) 5 5 K(4) 5 5 K(5) 4 - K(6) 4 - K(7) 4 - K(8) 4 - K(9) 3 - K(1) 2 - Total 54 33 a smaller frame size would be prefered because the chances are higher that the speech signal can be assumed to be stationary. On the other hand, a smaller frame size would result in a higher bit rate. The frames are then coded one by one by first searcing for the stongest correlation for frequency between 5-2 Hz. This frequency is called the pitch. It is used to determine if the frame is voiced or unvoiced. If the frame is voiced the normalised correlation between x(k) and x(k pitch) will be close to 1.. If this is not the case the frame is unvoiced. Now the samples in the frame can be used to estimate the filter coefficients for the synthesis filter. Finally the energy level for the frame is calculated to be able to reconstruct the speech with the correct signal level. The information on voiced/unvoiced, synthesis filter coefficients, pitch and energy is then transmitted to the decoder. Before tranmission these values have to be quantized and this imposes limits on the achivable speech quality. An example of the bit allocation for a LPC en coder can bee seen in table 1, from [6]. 4.1.3 Decoding In the decoder the synthesised speech is built one pitch period at a time. This allows for the parameters for the synthesis filter, the pitch period and the gain to be linearly interpolated through the frame. This is done to avoid the spikes that would arise when the filter parameters are changed. If the frame is unvoiced the decoder uses white noise as input to the synthesis filter otherwise it uses a periodic pulse train with the same frequency as the

4 THE MELP SPEECH CODEC 2 pitch. The output from the filter is then amplified to get the same energy level as specified by the coder. The synthesised speech is now finished. 4.2 Bandpass Filters and Mixed Excitation The synthesised speech for LPC has a strong buzzy quality. The reason for this is that the residual is replaced with a pulse train that contains higher frequencies than the original residual, this causes the synthesised speech to contain overtones that were not present in the original speech. To avoid this each frame of speech is first bandpass filtered to a number of frequency bands (4 1) and for each band a voice strength is calculated. The voice strength can be calculated in two ways. In the first the correlation for the bandpass filtered signal is calculated around the pitch lag. At high frequencies this method is sensitive to variations in the pitch period. The second method calculates the correlation around the pitch lag for the envelope of the bandpass filtered signal. This makes it unsensitive to pitch variations, since the peeks in the envelope are much wider. The voice strength is selected to be the larger of the two calculated values. The value can then be quantized to one bit using a threshold. This bit signals voiced or unvoiced speech. Compared to the original LPC, one extra bit is needed for each bandpass filter. In the decoder the same number of bandpass filters are feed with a periodic pulse train or white noise depending on if they were voiced or unvoiced. The sum of the signals after the bandpass filter is then feed to the synthesis filter. In the examined MELP coder the bandpass filters were built using 6:th order poles/zeros butterworth filters. In the decoder the bandpass filters were implemented as 32:nd order FIR filters. This makes it possible to use only two filters for the calcualtion of the synthesised residual. One for the pulse train and one for the noise. Each of these two filters is constructed using a weighted sum of the bandpass filter for each frequency band. The weighting factor is the voice strength. The speech signal was also filtered with the inverse synthesis filter and thereby obtaning the residual. This is what would be the optimal pulse train for the decoder. The peakiness for the residual was then calculated and depending on the value, some of the lowest frequency bandpass filters voicing strength can be forced to 1.. 4.3 Aperiodic Flag The synthesised speech from an LPC can sometimes contain short isolated tones, this is especially true for female speakers. There are no known reasons for this behaviour, but the current theory is that the synthesis filter is almost unstable. To cure this a third voice state, jittery voiced, is introduced in the encoder and the flag called aperiodic pulses is used to transfer this information. If this flag is set the decoder destroys the periodicity in the periodic pulse train by

4 THE MELP SPEECH CODEC 21 varying each pitch period with a uniformly distributed position jitter. This also relaxes the demands for the voiced versus unvoiced decision, since it is possible to call it voiced and aperiodic if we are somewhere between voiced and unvoiced. In the case of the MELP from the US DoD the position had a rectangular distribution of +/- 2 % of the current pitch period. 4.4 Adaptive Spectral Enhancement Two problems exist with the construction of the synthesis filter. First the formant resonances are sometimes hard to create without making the filter unstable. The second is that the formant bandwidths might vary within the frame, and thereby making it hard to create a matching filter. One solution would be to sharpen the filter, that is to move the poles closer to the unit circle, but this can make the filter unstable. A better solution is to insert an adaptive pole/zero filter and a simple FIR filter where the poles are a scaled version of the poles in the synthesis filter. The zeros and the FIR filter are used to compensate for the low pass effect off the pole filter. The combined effect of these filters in the time domain is the same as if the synthesis filter had been sharpened for the first half of the frame and weakened during the second half. This effect limits the effect of the almost unstable filter in the first half of the frame. 4.5 Pulse Dispersion Filter The problem is that the frequencies decay too fast between the formants, especially in the high frequency area. The solution is a pulse dispersion filter that is created from the frequency response of a fixed triangle pulse but with its low pass characteristics removed. To create the FIR filter coefficients the discrete fourier transform (DFT) is used to calculate the frequency response of the triangle pulse. The magnitude is then set to unity and the inverse DFT is used to get the final coefficients. 4.6 Fourier Magnitude The residual is the optimal pulse train. What one would like to do is to transmit the residual to the decoder. This would take to many bits so a compromise is to send the ten lowest fourier magnitudes calculated on the residual. This makes it possible to use a pulse train with nonuniform magnitudes. To calculate these fourier magnitudes one pitch period is zeropadded to 512 samples and the FFT is calculated. The fourier magnitudes are then quantized and transmitted to the decoder. In the decoder, when the pulse train is created, each pitch period is built from the inverse discrete fourier transform calculated from the interpolated fourier magnitudes. Since only the magnitudes are transmitted there is no phase information.

4 THE MELP SPEECH CODEC 22 4.7 The MELP Codec Flowchart Based on the report MELP: The New Federal Standard at 24 bps" [3] a flowchart (Fig. 7 and Fig. 8 in Appendix B) was drawn for a MELP vocoder. The main reason with this flowchart is to get a graphical representation of the vocoder structure. Both encoder and decoder (often named analysis - synthesis) are covered in the text below. 4.7.1 The Encoder In the encoder the natural speech is analysed to achieve a parametric model of it. To achieve a low bitrate much of the original information must be eliminated. It is therefore very important that redundancy is removed and that the model parameters contain as much information per bit as possible. The analysis stage can be viewed as different phases, some must be in order other does not. 1. Eliminate low frequency noise The first step in the process is to filter out low frequency noise with a 4:th order Cherbychev high pass filter with a 6 Hz cutoff frequency and a stopband rejection of 3dB, see Fig. 9 on page 44. 2. Pitch estimate The output of the Cherbychev filter (referred to as the input speech) is filtered using a 6th order Butterworth filter with cutoff frequency 1 khz, see Fig. 1 on page 45. The result of this operation is used to perform an initial pitch search for a pitch estimate. The pitch estimate is based on the pitch lag between 4 and 16 samples that maximizes the normalized autocorrelation function. 3. Bandpass voicing analysis The input speech is filtered by five 6th order Butterworth bandpass filters with passbands -5, 5-1k, 1k-2k, 3k-4kHz in order to split the signal into five frequency bands. The output from the lowest band (-5Hz) is used to make afractional pitch analysis. A bandpass voicing strength analysis is made on the lowest band based on the normalised correlation corresponding to the fractional pitch. If the bandpass voicing strength is less than.5 an aperiodic flag is set to one, otherwise to zero. A bandpass voicing strength analysis is then made for the higher frequency bands. The analysis is based on a normalised correlation corresponding to the previously calculated fractional pitch, the bandpass signal and the time envelope of the later. In order to compensate for bias error the time envelope is decremented by.1 before any calculations are done. 4. Linear prediction analysis A 1th order linear prediction analysis, Levinson-Durbin recursion, is made on the input speech using a hamming window (2 points, 25 ms) centered over the last sample of the current frame. The resulting linear prediction

4 THE MELP SPEECH CODEC 23 Table 2: Bit allocation for MELP. Voiced Unvoiced LSF 25 25 Fourier magnitudes 8 - Gain 8 8 Pitch/Voicing 7 7 Bandpass vocing 4 - Aperiodic flag 1 - Error correction - 13 Sync bit 1 1 Total 54 54 coefficients a i are multiplied with the bandwidth expansion coefficient :994 i (15.3 HZ). 5. LPC residual signal By using a linear prediction filter with the filter coefficients a i the input speech is filtered in order to get a residual signal. 6. Peekiness value A peekniess value is calculated over 16 samples of the residual. If the peekniess exceeds 1:34 the lowest band voicing strength is forced to the value 1:. If the value exceeds 1:6 the bandpass voicing strengths of the three lowest bands (recall: -5, 5-1k, 1k-2kHz) are forced to the value 1:. 7. Final Pitch Estimate A final pitch estimate is calculated by using the residual signal. An integer pitch search is performed over lags 1 samples wider than the fractional pitch. A fractional pitch refinement is then made around the integer pitch lag. When the resulting fractional pitch exceeds, or is equal to,.6 a pitch doubling procedure is performed. 8. Gain Estimation Next step is to estimate the gain. Input gain is measured twice per frame using a pitch adaptive window length. The estimated gain is the RMS value of the input signal over the window. 9. Quantization of Parameters The LPC coefficients, pitch, gain and bandpass voicing are quantized. The fourier magnitudes of the residual signal are also calculated and then quantized. A spectral peak-picking algorithm is used to find the harmonics in each frame. If a frame is unvoiced the bits are protected using Hamming codes otherwise the bits are simply packed into the bitstream. The bit allocation can be seen in table 2.

4 THE MELP SPEECH CODEC 24 4.7.2 The Decoder The decoder is supposed to take the bitstream, unpack it and produce high quality synthesised speech. The pitch is decoded first, since it contains the mode (voiced or unvoiced) of information. 1. Pitch decoding The decoded pitch contains the information about the mode of the current frame. If the pitch is all zero or has only one bit not equal to zero the mode is forced to be unvoiced and an error correction is made. If exactly two bits are not equal to zero in the pitch code then a frame erasure is indicated and a frame repeate function is implemented. Any other occurance of bits equal to one indicates that the mode is voiced and the parameters can be directly decoded. 2. Noise and gain estimation Next the decoder updates the noise estimator and an attenuation is applied. However, noise estimation and gain attenuation are disabled for repeated frames. 3. Interpolation of synthesis parameters The LSF's, logarithmic speech gain, pitch, jitter, Fourier magnitudes, pulse and noise coefficients for the mixed excitation and the spectral tilt 1 coefficient are interpolated lineary against their previous value (for the last frame). If there is a difference greater than 6dB in gain or if there is an offset with high pitch frequency between two frames the interpolation is not made. 4. Generate the mixed excitation The excitation is generated as the sum of the filtered pulse and noise excitation. The noise is generated by a uniform random number generator and then normalised. The excitations (noise and pulse) are then added together. 5. Adaptive spectral enhancement filter The 1th order pole-zero filter (with an additional 1st order spectral tilt compensation) is applied to the mixed excitation. Its coefficients are calculated by bandwidth expansion on the interpolated LPC filter coefficients and adapted based on the SNR. 6. LPC and gain synthesis The LPC filter coefficients are based on the interpolated LSF coefficients. A gain scaling factor is applied after the filter and is based on lineary interpolated values. 7. Pulse dispersion filter A pulse dispersion filter is applied after the gain scaling. The filter is a 65th order FIR filter derived from a spectrally flattened triangular pulse. 8. Buffering Since the synthesiser produces a full period of synthesised speech at a time, some buffering must finally be made. 1 Used for the adaptive spectral enhancement filter

5 INVESTIGATION OF THE MELP SPEECH CODEC 25 5 Investigation of the MELP Speech Codec First the code was examined to find where and how different functions were implemented. Some initial tests were made by turning off different functions in the coder. This was also a way to get familiar with the working environment at Erisoft. It got clear that it was easy to make errors and that it was a very slow way to work since the source code had to be changed and compiled for each test. This is a major drawback if one wants to make the same tests over again. The solution is to use command line switches to turn off different functions. To be able to use command line switches the C-source code was changed to compile under C++. The original command line parsing could then be replaced by the Argum class 2 from BC-Lib 3 and flags in the code. After this change the different test can be executed by running the coder with different command line switches. At the same time the possibility to only generate the coded bitstream was removed. The modified program always runs analysis followed by synthesis and requires both an input and an output file. These files are both in the RAW format. The following subsections each describes the changes made on the specific function and the reason for doing them. 5.1 Highpass Filter The MELP codec uses a 4th order highpass filter, see Fig. 9 on page 44, to remove any dc-component in the input signal. The type of filter is Chebychev Type II and the cutoff frequency is 6 Hz with a stop band attenuation of 3 db. In changing the code the boolean varaible uhp was introduced to make it possible to turn the highpass filter of. This is done by specifying the command line switch -nohp. 5.2 Adaptive Spectral Enhancement The MELP codec specifies the use of Adaptive Spectral Enhancement which can be turned off by using the command line switch -noase. This sets the boolean variable uase to false. 5.3 Pulse Dispersion Filter The MELP codec specifies the use of a pulse dispersion filter which can be turned off by using the command line switch -nopdf. This sets the boolean variable updf to false. 5.4 Bandpass Filters In the original MELP codec there were 5 bandpass filters. The cutting frequencys were -1/8, 1/8-1/4, 1/4-1/2, 1/2-3/4 and 3/4-1 in normalised frequencies. 2 An class to handle command line parsing and command line switch extraction 3 Code liberary used Erisoft for developement of speech and channel coders. Built in C++ with extensions for floating matrixes and vectors

5 INVESTIGATION OF THE MELP SPEECH CODEC 26 1 Frequency response for original Encoder BandPass Filter 1 H(w) db 2 3 4 bp1 5 bp2 bp3 bp4 bp5 6 3 2 1 1 2 3 w Figure 6: Original bandpass filters. The analysis filtes are 6:th order Butterworth pole/zero filters and the synthesis filters are 32:nd order FIR filters. Fig. 6 shows the frequency response of the original bandpass filters. The first task was to design the same filters in MATLAB to see if it was possible. For the analysis filters the MATLAB function butter generated almost the same pole/zero location as the original. The synthesis filter were a little more tricky. The description specifies a 32:nd order FIR filter windowed with a Hamming window. This did not work, but by trial and error it was found that the MATLAB function fir1 windowed with the square root of the MATLAB window function hamming gave an almost perfect match. The MATLAB function filterset was constructed to create the nessesary filter parameters for both analysis and synthesis for an input vector with cutting frequencies. In Appendix D the result from MELPcomp.m can be viewed. The figures shows the comparison between the original filters and the filters constructed with MAT- LAB. They are almost identical, so it was possible to create the nessesary filters in matlab. To test how the bandpass filters affected the quality a set of different bandpass filters were created with the MATLAB function firs. This MATLAB routine generates the c-source files filter.c and filter.h that contain all the filter parameters for the different filter sets. The MELP source code was then changed to include the integer nbpf that specifies what set of filters should be used. This interger can be set by the command line switch -newbpf=n where n is a number in the range [...7]. If the switch is not present nbpf is set to -1 and the original filters are used. The implemented filters are presented in the table 3. 5.5 Synthesis Pulse Train In the original MELP codec the pulse train is generated by the inverse DFT of the fourier magnitudes. One idea was to make the perfect pulse train by the use of

5 INVESTIGATION OF THE MELP SPEECH CODEC 27 Table 3: Implemented FIR filter sets Set Cutting frequencies (normalised) 1/2 1 1/4, 1/2 2 1/8, 1/4, 1/2 3 1/8, 1/4, 1/2, 3/4 4 1/16, 1/8, 1/4, 1/2 5 1/16, 1/8, 1/4, 1/2, 3/4 6 1/8, 1/4, 3/8, 1/2 7 1/8, 1/4, 3/8, 1/2, 3/4 Table 4: Command line switches for MATLAB generator Switch Action -P Turn on all printing -sp Print Speech -rp Print residual -pp Print pulse train -ep Print lpc excitation -sp Print synthesised speech -nfp=n Specify number of frames to print (default 45) -mfile=<name> Specify filenme for MATLAB data the original residual. After some tests the integer tpt was introduced to make it possible to select between several different types of pulse trains. If the command line switch -fpt=n is not present on the command line then the original MELP pulse train is used. By including -fpt= on the command line the program will use a simple form of pulse train. This pulse train only consists of a 1 at the beginning of each pitch period. By including -fpt=1 the residual in the coder is used as a pulse train in the decoder. This type of codec does not use the mixed excitation. 5.6 Generation of MATLAB Data The MELP coder was also modified to be able to generate MATLAB data for plots and analysis. The command line switches in table 4 were defined. The MATLAB data is printed frame by frame to the data file in binary format. To read data in to MATLAB the following type of program has to be used. FRAME = 18; fid = fopen(<data file>,'rb'); clear X; [X c] = fread(fid,[frame,inf],'float'); fid = fclose(fid);

5 INVESTIGATION OF THE MELP SPEECH CODEC 28 col = c/frame; speech = reshape(x(:,1:5:col),col/5*frame,1); residual = reshape(x(:,2:5:col),col/5*frame,1); pulse = reshape(x(:,3:5:col),col/5*frame,1); exc = reshape(x(:,4:5:col),col/5*frame,1); syn = reshape(x(:,5:5:col),col/5*frame,1); MATLAB now contains the vector variables speech, residual, pulse, exc and syn. These can then be used as usual in MATLAB. 5.7 Quantization Initial test showed a large difference between the original speech and the synthesised speech. The difference between two versions of the synthesised speech were very small 4. One idea was that the quantization imposed a limit to the performance of the coder. So to make an optimal MELP-coder all quantizations were removed. The results from this optimal coder could then be used as a reference for deciding the importance of the other functions. The parameters that are quantized are: bandpass voicing strength, jitter, gain, lpc coefficients, pitch, and fourier magnitudes. Following subsections describe the quantization of these parameters. 5.7.1 Bandpass Voicing Strength In the original MELP codec the calculated bandpass voicing strength for each passband is quantized to one of two levels. or 1., that is voiced or unvoiced. To make it possible to disable this quantization the boolean flag bq was introduced in the code. This flag is in normal case set to emphtrue, but by passing the command line switch -nobq this flag is set to false. The later disables the quantization of bandpass voicing strength coefficients. Test showed that the voicing strength could be calculated to values in the range -.3... 1.2 and it was therefore not possible to directly calcualte the noise strength. Since the energy should be 1 it seemed to be a god guess that the noice strength should be calcualted as Ns =(1 Vs) (8) This is only possible if the voicing strength coefficient is limited to.... 1.. Now the pulse train excitation filter had to be calculated as the sum of weighted bandpass filters. NX p pbp F (k) = Vs i Λ BPF i (k) (9) i=1 and the noice excitation filter is calcualted as NX p nbp F (k) = Ns i Λ BPF i (k) (1) i=1 4 i.e. to different quantizations turned off