STATE OF THE ART AND TRENDS IN SPEECH CODING

Size: px

Start display at page:

Download "STATE OF THE ART AND TRENDS IN SPEECH CODING"

Moses Willis
5 years ago
Views:

1 Philips J. Res. 49 (1995) STATE OF THE ART AND TRENDS IN SPEECH CODING by R.J. SLUIJTER, F. WUPPERMANN, R. TAORI and E. KATHMANN Philips Research Laboratories, Prof Holstlaan , AA Eindhoven, The Netherlands Abstract An introductory review of some basic speech coding techniques covers the most important properties of speech production and hearing, the ubiquitous techniques of quantization and linear prediction, and a recital of the most important measures of coding performance. In the survey that follows, several standardized speech coding systems reflecting the state of the art in speech coding are discussed in terms of coding method, bit rate, performance, complexity and typical application areas. Major future trends are indicated on the basis of expected future standards. The paper, which primarily deals with narrowband speech coding systems, is concluded by a review of the state of affairs and an outline of the future trends in the area of wideband speech coding. Keywords: speech; source coding; state ofthe art; future trends; standards; narrowband; wideband. 1. Introduetion Speech coding is the conversion of an analog speech signal into a digital signal. This signal is transmitted to a remote decoder or stored in a memory for later decoding. The decoder reproduces the original analog signal as well as possible. The purpose of digitalization is to enhance the fidelity of transmission or to allow the use of digital memory for storage purposes. Sometimes, the signal is digitalized just to allow the signal to be processed in a digital way which can be more accurate and reliable than analog processing. Speech coding has already been used in professional transmission equipment for the public switched telephone network (PSTN) and business communication networks for some decades. More recently, however, there is a remarkable growth in the use of speech coding systems. For example, speech coding is applied in public mobile telephone systems, private mobile radio, conferencehall systems, videophone systems and cordless telephone products. Today, Phillps Journal of Research Vol. 49 No

2 R.J. Sluijter et al. we also find speech coders in temporary storage applications, such as voice mail systems, digital telephone answering machines, dictation systems and pocket memos, and even some personal computers provide the possibility to store speech. Another application area, which is growing with the availability of high capacity, low cost digital read-only memory (ROM), is that of voice response ('canned speech') systems. Voice response systems are used in car navigation systems, public-address equipment, portable guidance products for use in museums and big exhibitions, and toys, amongst others. It is evident that in all the aforementioned application areas, the transmission or storage media involved should be used as efficiently as possible. So, the speech coder should yield a bit rate as low as possible. Over the years, speech coding systems have been proposed for various applications. The most important speech coding systems will be put in order in Sections 3 and 4. Before we commence this discussion, some important basics of speech coding are reviewed. 2. Basics of speech coding The physiology of the speech organ and the psycho-acoustics of hearing are important foundations of many speech coders. Although precise modelling of speech production and hearing is still in a state of research, gross characterizations suffice to serve the purpose of designing effective speech coders Speech production and hearing The human speech production mechanism can be characterized rather simply by the famous source-filter model [1, 2], as shown in Fig. I. Here, the source is either modelled as a quasi-periodic pulse source for voiced sounds (V), or a white noise source for unvoiced sounds (U). A gain factor (g) controls Noise Source ~u g Pulse Source Fig. 1. The source-filter model of speech production. 456 Phllips Journalof Research Vol. 49 No

3 State of the art and trends in speech coding the intensity of the produced sound. The source excites a filter (F), which represents the vocal tract consisting of the throat, mouth and nasal cavities. During the generation ofvoiced sounds, the air expelled by the lungs causes the vocal cords to vibrate with a certain periodicity. This periodicity (pitch) varies with time and is represented by T in Fig. 1. In speech, T may vary between 2 and 20 ms, although the variations do not usually exceed two octaves for a single speaker. During the generation of unvoiced sound the vocal cords do not vibrate and hence there is no periodicity associated with the source. In the cavities of the vocal tract, three-five resonances known as formants (see also Fig. 9) may originate. Depending on the movements of the articulators (lips, jaws, tongue and velum), these resonances vary with time. The rate of change of the articulators, including the vocal cords, is limited by the musculature that operates them and the associated time constant is in the order of 20 ms. If the speech signal is considered stationary over this time duration, it can be represented fairly accuratelyon the basis of just a handful of parameters describing the model. Using these parameters, it is possible to reconstruct a perceptually similar copy of the original speech signal with a very low bit rate. In perception, some features ofthe speech signals are quite irrelevant. Phase relations between the signal components and minor variations in pitch are but ëi E< lxg) Fig. 2. Stylized visualization of the time-frequency dependent phenomenon of auditory masking, for a periodic impulse sequence. Philips Journni of Research Vol. 49 No

4 R.J. Sluijter et al. ' two examples ofirrelevancy. Also, it is possible to replace unvoiced sounds by free running artificial noise, if only the original shape of the energy density spectrum is retained. Yet another phenomenon of perception is masking [3]. It conceals weaker signal components in the neighbourhood of relatively stronger signal components, both in time and in frequency. One ofthe most instructive visualizations ofthis time-frequency dependent phenomenon is given in Fig. 2. It shows stylized masking boundaries for a time signal consisting of periodic pulses with a period T having frequency harmonics at multiples of g = lit. All additional sound components with amplitudes 'below the roofs' are inaudible [4].In some speech coders this phenomenon is exploited by controlling quantization noise and coding distortions in such a way that their audibility is reduced, or suppressed completely. A recent treatise ofthis subject can be found in Ref Quantization First of all, the speech signal is assumed to be properly sampled using antialiasing filtering. The sampling rate is 8 khz in narrowband coding and 16kHz in wideband coding. In reproducing the analog speech signal, an appropriate reconstruction filter is required. Various ways of quantizing speech samples are described next, although the principles are applicable to any other type of variables or speech parameters [2, 6]. A uniform quantizer has equally spaced quantization levels. A possible input-output characteristic of such a quantizer is shown in Fig. 3a. If a sampled signal is quantized, quantization noise is introduced and a signal to noise ratio (SNR) can be determined by means ofthe ratio ofthe signal energy and the quantization noise energy, both measured over the same time. If the quantizer is used for a signal that occupies the full scale, a certain SNR is o 02 o o -5 o (a) (b) o o - I (c) Fig. 3. Quantization characteristics showing signai level s vs quantized levels Sq, of (a) a uniform quantizer, (b) a non-uniform quantizer and (c) an adaptive quantizer, for a certain signai level I and a larger signal level Philip, Journalof Research Vol. 49 No

5 State of the art and trends in speech coding obtained. For lower signal levels, the SNR decreases. Figure 4 shows the SNR as a function of the signallevel for a sinusoid, using 256 quantization levels, which can be represented by an 8-bit code. For speech signals, in which the signal level varies over a large dynamic range of 30dB or more (see Fig. 10), a 12 bit (4096 levels)quantizer is needed to obtain satisfactory performance, as in telephony, for instance. In a non-uniform quantizer the spacing between the quantization levels is not equal, as shown in Fig. 3b. For example, the 8-bit logarithmic quantization scheme used in PCM (see Sec. 3.1.) renders a much smaller SNR-dependence on the signal level. In Fig. 4 the SNR vs the signal level of a sinusoid is sketched. The basic idea behind logarithmic quantization is that if the input s, and its compressed version, c, are related by c = ln(1 + s), then the difference quotient is given by!j.c=.!j.sj(1 + s). If c is uniformly quantized and if in addition s» 1, then!j.c= constant, and hence.!j.sj s ~ constant. This gives an approximately constant relative quantization error and, consequently, a constant SNR. By proper expansion of the quantized signal a total input-output characteristic, as shown in Fig. 3b, is obtained. In general, the quantizer can be optimized by adapting the distribution of the quantization levels to the signal statistics. This kind of quantizer is known as a Max-Lloyd quantizer [7]. SNR Logarithmic Uniform S --- Smax Fig. 4. SNR vs relative signal level for the three different scalar quantizers. o Philip. Journal of Research Vol. 49 No

6 R.J. Sluijter et al. In an adaptive quantizer a small dependence of the SNR on the signallevel is obtained in an alternative way. In this case the quantization step is adapted to the signallevel in such a way that full load ofthe quantization characteristic.. "",M'.I,r- ' ~. -,...,~ ~'''''~;f,.-r~" e-\ "Irt"'! fy.r ;~ h...,&r,-.ffr~* '-, :'~,1/1'", '... t t...,-;,.-~ ISpursued, see FIg. 3c. The ädaptatióii can be achieved III a forward or backward v e Matching criterion Index Codebook Fig. 5. Vector quantization system. 460 Phlllps Journalof Research Vol.49 No

7 State of the art and trends in speech coding way. In forward adaptation, the input signallevel is measured and the quantization characteristic is controlled accordingly, requiring separate transmission of the level parameter for decoding purposes. In backward adaptation the signal level is estimated from the quantized signal, and since the same quantized signal will be available to the decoder, no side information needs to be transmitted. For the sake of comparison, the SNR of an 8-bit backward adaptive quantizer, again for a sinusoidal input signal, is also shown in Fig. 4. A parameter to be chosen in both forward and backward adaptation is the speed of adaptation, which should be tuned to the rate of change of the envelope of the signal. For speech, the corresponding time constant is in the order of loms, preferably with a faster 'attack' time and a slower 'decay' time. Sometimes, the level of the signal is estimated on the basis of a fixed number of samples, known as 'block' adaptation. In a vector quantizer, a block of N samples srn], which can be interpreted as an N-dimensional vector, is quantized as a whole [8]. For this purpose, a codebook containing a set of vectors of the same dimensions is used. These vectors are approximations to the expected set of possible input vectors, as shown in Fig. 5. The current input vector is compared to all vectors VI [n], I = 1,2,... L in the codebook and the error sequences, e![n] = s[n]- vl[n], are evaluated to find the best matching vector. This vector can be represented by a log2l-bit index, and in a remote decoder, which contains the same codebook, the chosen vector dimension 2 centroid / dimension 1 Fig. 6. Graphical representation of vector quantization. Philips Journalof Research Vol.49 No

8 R.J. Sluijter et al. c can be retrieved using this index. A workable matching criterion is the mean square error (MSE). Such matching criteria weight the individual quantization errors, lef[n]12, equally. Sometimes, however, it may be better to give them unequal weights in order to make certain contributions to the matching criterion less important than others. The optimum codebook contents in a certain application can be obtained by training. For this purpose, many different vectors are applied to the system and they are clustered to form cells in the N-dimensional space, as shown in Fig. 6, for the simple 2-dimensional case. The centroid of each cell is actually stored as a vector in the codebook. A popular training algorithm is the LBG algorithm [9]. Alternatively, it is also possible to construct the contents of the codebook, if the statistics of the signal to be quantized are sufficiently known. In actual operation, the vector quantizer will allocate an input vector s to the centroid of the cell in which it is located. A major difficulty in vector quantization is managing the size of the codebook. This arises from the fact that the required codebook size for acceptable performance gives rise to unmanageable computational complexity. In Fig. 5, for instance, the codebook contains segments of speech. It is evident that the codebook size will be huge if it has to contain all possible sounds even with slightly different pitches and levels. Therefore, vector quantization is almost exclusively applied to decorrelated and normalized signals. An important decorrelation technique is linear prediction Linear prediction analysis Linear prediction (LP), or linear predictive coding (LPC), is the prediction of the current speech sample srn] on the basis of a linear combination of s e Predietor ' Fig. 7. Linear prediction: inverse filter. 462 Philips Journalof Research Vol.49 No

9 State of the art and trends in speech coding previous speech samples srn - i], i = 1,2,... M. The network providing the linear combination of previous samples is called the predietor (Fig. 7), where T stands for a sampling period delay. The prediction error, or prediction residual, ern] can thus be represented by: M ern] = srn] - L ajs[n - i] j=' in which aj are the prediction coefficients, or a-parameters, and M is the order of the predictor. Minimization of the total energy, Et, of the prediction error over a certain interval {no,n,}: (1) 11) Et = L e 2 [nj, 11=110 (2) with respect to the coefficients aj' results in very attractive properties of the associated system A(z) = E(z)jS(z). First of all, the total squared error criterion produces a set of linear equations which can readily be solved. We see that Et depends quadratically on the a-parameters. Setting the partial derivatives of Et with respect to each ai to zero, yields a set of M equations. Solving these M equations according to this method, which is known as the covariance method, yields the optimum a-parameters [2]. The minimization interval {no, n,} is chosen in such a way that in this interval the a-parameters may be assumed to be stationary. A common choice is 20 ms, again. The performance of prediction is expressed in terms of the 'prediction gain', defined as the ratio of the signal energy in the minimization interval and Et. By extending the minimization interval {no, n,} to {-00,00 } and applying a finite-duration windowof, say 20 ms, to srn], the widely used autocorrelation method is obtained. The equation becomes: Pa=p, (3) \ in which the elements of the M x M matrix Pare autocorrelation coefficients: Pij = L x[n]x[n + li - ilj, lien w -li-jl where N'; is the length of the window, x[n] are the windowed speech samples, and a and p are M x 1 vectors having the elements ai and PiO, respectively. This system of equations can be solved efficiently by the well known Levinson- Durbin recursion [2]. Yet another approach makes use ofthe fact that the partial derivative ofthe (4) Philip. Journalof Research Vol.49 No

10 R.J. Sluijter et al. total square prediction error with respect to ai can be written as: oe "l ~ = -2 L e[njs[n - ij.. ai fl n=llo,,_ I, ' (5) This can be interpreted as a cross-correlation between the input and output sequences of A(z). So, we see that if this partial derivative is set to zero as before, A(z) wórks as a decorrelator for the speech signal. Some speech coding systems (e.g. ADPCM, see Sec ) are based on predietors in which the prediction coefficients are controlled in such a way that this cross-correlation is adaptively driven to zero, for each i, 1~ i ~ M. Secondly, the total square error criterion provides maximum spectral flatness of the prediction residual ern], at least for the autocorrelation method [10]. This means that the transfer function of A(z) is approximately inverse to the spectral envelope of its input signal, if it has enough coefficients, and that the spectral shape ofthe input signal is represented by the prediction coefficients. The system A(z) is referred to as the inverse filter. Consequently, on the basis ofthe total square error criterion, LP provides a useful synthesis structure. The function l/a(z) represents the spectral envelope of the speech segment under consideration. The direct-form structure of the network with transfer function 1/ A(z) is shown in Fig. 8. It is referred to as the synthesis filter. On the basis of the foregoing, we are able to make a good estimate of the prediction order M. Since a second-order function is required to create a single formant, and since there are three-five formants in speech, six-ten coefficients are needed to realize the required formant structure. The predietors in modern narrowband speech coders are mostly equipped with M = 10 coefficients. In e' :8' Predietor ' o :... ~Y?~~~s!s_1fi~t~~:_!/~S~) : Fig, 8. Linear prediction: synthesis filter. 464 Philip. Journalof Research Vol.49 No

11 State of the art and trends in speech coding IS( ) I db KHz -f Fig. 9. Linear prediction: (I) amplitude spectrum of a 20ms voiced speech segment (note the pitch harmonics); (H) transfer function of the associated synthesis filter with four formants and a spectral decay. actual operation, not all coefficients are devoted to formants, but some ofthem may represent global spectral inclination. Figure 9 shows an example of the amplitude spectrum of a 20 ms voiced speech segment and the transfer function of an associated 10 th order synthesis filter. Quantizing the a-parameters directly is not very efficient. Usually, the a- parameters are first converted into another form, namely log-area-ratios (LARs) [2] or line spectral pairs (LSPs) [11], and then quantized. The quantization schemes obtained in this way are not unique, but depend somewhat on the design. Using LARs usually results in a total of about 40 bits to obtain a perceptual equivalent of the unquantized synthesis filter. In the case of LSPs, the same can be obtained using about 34 bits. A lower bit rate can be obtained if one opts to quantize using vector quantization at the cost of increased computational complexity, in which case about 24 bits are sufficient [12]. Applying linear prediction in the way discussed above, also referred to as short-term prediction, effectively describes the formant structure of a speech segment, but leaves pitch related long-term correlation in its residual. This is shown in the example of Fig. 10. The upper trace in this figure shows about 200 ms of a transition from a voiced to an unvoiced portion of speech. The second trace shows the prediction residual of a 10 th order inverse filter, updated every 20 ms, which clearly demonstrates the presence of the long-term Philip. Journalof Research Vol.49 No

12 R.J. Sluijter et al.. correlation in the form of periodic pitch pulses. This periodicity can be removed by a long-term predietor (LTP), or pitch predictor (Fig. 11). The short-term prediction residual ern] is delayed, multiplied by a pitch prediction coefficient a p and subtracted from ern], resulting in the long-term prediction Fig. 10. Linear prediction: upper trace: a portion of speech of about 200 ms (note the quick level drop of about 30dB in the voiced to unvoiced transition); middle-trace: the short-term prediction error e; lower trace: the long-term prediction error E. 466 Philip. Journal of Research Vol.49 No

13 State of the art and trends in speech coding r e' I : X(~) _ I IL lip(z) _ Fig. 11. Long-term prediction: analysis and synthesis filters. residual E[n]. The time lag, usually constrained to the range of pitch in speech, and a p are optimized on the basis of minimizing the energy of E[n], in the same way as in short-term prediction. The transfer function of such a system will be referred to as P(z). The third trace in Fig. 10 shows the resulting E[n], in which a significant reduction in the dynamic range, and hence an increase in prediction gain, is observed. It is on this decorrelated signal E[n] that vector quantization is normally performed. The network realizing the inverse function 1/ P(z), which restores e'[n] from <:'[n],is also shown in Fig Measures of coder performance Speech quality is difficult to define since subjective issues like naturalness, intelligibility, noise, etc., are involved [13]. One objective measure is the SNR, but it only correlates well with subjective quality if it concerns relatively low-level noise and distortions. Better correlation is obtained on the basis of the segmental SNR, where the SNR is measured over short stationary segments of typically 20 ms, and averaged. Intervals of silence have to be excluded then, because they can render bad SNRs which are perceptually not relevant. More sophisticated objective measures, such as spectral distance measures, are being investigated and some ofthem even include masking models [14].By and large, the performance of objective measures is improving and they will play an important role in the future. A method for the subjective assessment of speech quality, which has already been in use for decades, is the Mean Opinion Score (MOS) test [15]. A large number of listeners are asked to assess the quality of randomly sequenced utterances, using a 1-5 scale in terms of: bad, poor, fair, good and excellent, respectively. After statistical processing of the results, a MOS number is obtained. Some of the speech material used may deliberately be contaminated by background noise or transmission errors, for example, so that these issues are also included in the MOS. In order to normalize the results, speech corrupted by what is called the Modulated Noise Reference Unit (MNRU) PhilipsJournalof Research Vol. 49 No

14 R.J. Sluijter et al. can be included in the tests [16]. Several other application-dependent subjeetive measures exist, for instance, the Diagnostic Acceptability Measure (DAM) [17] and the Diagnostic Rhyme Test (DRT) [18]. While the former has a more elaborate scale than the MOS, the latter aims at measuring the intelligibility alone. The complexity of a speech coding system is of the order of magnitude of a modern digital signal processor (DSP). Systems with a high complexity will require more DSPs and systems with a low complexity can be realized using only a part of the computing capacity of a DSP. In general, a lower bit rate or higher speech quality will require a higher complexity. Sometimes, the distribution ofthe complexity over the coder and the decoder plays a role, such as in voice response systems. In this case, it is important to keep the decoder as simple as possible, while this is not a prime requirement for the encoder. The delay from the input of the encoder to the output of the decoder is an issue in full-duplex communications, such as in telephony, because it may cause disturbing echoes. Sometimes, it is even necessary to employ expensive echo-cancellers, in which case minimization of the delay still helps to reduce their costs. The delay requirements imposed on a speech coding system, often specified in terms of intrinsic 'algorithmic delay' and hardware-dependent 'implementation delay', depend on the specific application, and vary from five to some tens of milliseconds. In half-duplex communications, in which the communication channel is used in either direction at a time, the delay is not so much of an issue. In the assessment ofthe robustness of a speech coding system, the sensitivity to background noise such as car noise picked up by a car telephone, the effect of tandeming coding-decoding systems in a network, the transparency of the system for non-speech signals, such as signalling tones, data signals, fax signals, or music signals and even the sensitivity to the absence oflow frequencies in the input speech, such as in telephone speech, may play a role. However, the most important robustness measure is often the sensitivity to transmission errors. Transmission errors cause erroneous decoding. The decoder itself must be designed such that errors are perceptually minimized. Usual techniques for this purpose are minimization of error propagation in the decoding process, and minimization of the perceptual difference in the case of single (isolated) bit errors with the help of Gray-coding techniques [6], and the like. If error detection can be applied, erroneous segments can be muted, or better, be substituted on the basis of interpolation or extrapolation. On the PSTN, error rates in the order of 10-3 are to be expected and the above approaches can handle these error rates, in general. On mobile networks, however, error rates 468 Philips Journal of Research Vol.49 No

15 State of the art and trends in speech coding of several percent can be expected, in which case error correction techniques are required. Also in storage applications, if cheap error-prone memones are used, the error behaviour has to be taken into account. 3. Narrowband coding systems Figure 12 depicts the state of the art in narrowband speech coding and the expected future trend (dashed line). The state of the art is indicated by the estimated MOS scores of 9 representative coding standards at various bit rates. They are often classified as I: simple waveform coders, which are basically quantizers; Il: advanced waveform coders, characterized by the application of adaptive prediction; and Ill: vocoders, characterized solely by parameter coding, and consequently, by the absence of any waveform matching. Table I summarizes the main characteristics of these systems. In the following subsections the standard systems and their performances will be considered in more detail, and this section will be concluded with a review of the expected future trend PCM coders The first large scale application of PCM was, and still is, in telephony. In 1972 PCM was standardized in two forms, namely the European A-law and the American J-L-law[19]. These coders, which are essentially non-uniform 5 MOS ~., III I - Simple Waveform Coders Il - Advanced Waveform Coders III - Vocoders Bit-rate [Kbit/s] Figure 12. Speech quality in MOS vs bit rate of nine standardized narrowband speech coders, representative of the state of the art and the (future) trend, indicated by the dashed line. Philip. Journalof Research Vol.49 No

16 R.J. Sluijter et al. TABLEI System lpcm 2ADPCM 3 LD-CELP 4 RPE-LTP 5 VSELP 6IMBE 7CELP 8CVSD 9 LPC-I0E PCM ADPCM LD-CELP RPE-LTP VSELP IMBE CVSD LPC CCITT ITU-T ETSI CTIA INMARSAT US-DoD PSTN MOS Bit rate Application & (kbitjs) Standard Year quality range 64 CCITT 1972 PSTN 32 CCITT 1984 MOS CCITT ETSI 1988 mobile & 8 CTIA 1989 storage 4.15 INMARSAT 1990 MOS US-DoD 1989 (military) 16 US-DoD 1973 secure voice 2.4 US-DoD 1975 MOS Pulse Code Modulation Adaptive Differential PCM Low Delay-Code Excited Linear Prediction Regular Pulse Excitation-Long Term Prediction Vector Sum Excited Linear Prediction Improved Multi-Band Excitation Continuous Variable Slope Delta modulation Linear Predictive Coding International Telegraph and Telephone Consultative Committee (now ITU-T) International Telecommunications Union-section Telecommunications European Telecommunications Standardization Institute Cellular Telecommunications Industries Association International Maritime Satellite organisation United States government-department of Defense Public Switched Telephone Network Mean Opinion Score quantizers, can each be considered as an analog, basically logarithmic, compression characteristic followed by an 8-bit uniform quantizer. The performances of the quantizers are very similar. The SNR of both quantizers amounts to approximately 38 db over a dynamic range of about 30 db (recall Fig. 4). PCM is quite insensitive to different statistical properties 470 Philip. Journal of Research Vol.49 No

17 State of the art and trends in speech coding of the input signal, so it is very transparent, and it is robust in all other respects. The bit rate, 64 kbitjs, is quite high, but PCM is simple, and it has no intrinsic delay. The MOS is a little over 4, as indicated in Fig. 12. The fact that it is not higher than 4 is mainly due to the narrow bandwidth of telephone speech Differential coders A differential coder is characterized by the fact that the difference between the original speech sample and its predicted value is quantized rather than the original sample itself. Since this essentially means the quantization of the prediction error, an improvement in SNR over PCM is obtained. This improvement is approximately equal to the prediction gain. If ern] in Fig. 7 is quantized by ab-bit quantizer Q, a differential coding system is obtained which is sometimes referred to as D*PCM (D for differential) [6].The appropriate decoder would then take the form of Fig. 8. Observing the output signal s'[n] reveals that it can take many more than 2 B possible values, since it consists of the predicted value of s'[n] plus the quantized e'[n], and it was e'[n] that was quantized using 2 B levels. Here, a disadvantage of D*PCM is encountered. Because the predicted value of s'[n] itself contains quantization noise and the quantized e'[n] is added to it, an accumulation of quantization errors occurs. Generally speaking, the quantization noise is speetrally shaped by the decoder. Especially low-order predietors will have an integrating character due to the decaying spectrum of speech signals, and the low-frequency contents of the quantization noise will be emphasized. This gives rise to hoarse-sounding quantization noise. An important improvement is obtained if the coder is rearranged according to Fig. 13 (DPCM). In the coder, the predietor works on the locally decoded speech samples Sq [n] instead of the original speech samples srn]. While modelling the quantizer as an additive noise source, simple analysis shows that the noise at the output of the decoder has not undergone any spectral shaping. This kind of quantization noise sounds more pleasant. The quantization noise of a DPCM coder can be classified into two categories. The first category concerns fine quantization noise, often referred to as granular noise. The second category concerns gross quantization errors caused by what is called slope overload. Slope overload may occur when a steep slope in the input signal can not be predicted from previous samples, either because the predietor is too simple (fixed, low order), or because an unpredictable innovation in the speech signal takes place. The performance of a DPCM coder can be improved further by making the predietor as well Philip. Journni of Research Vol.49 No

18 R.J. Sluijter et al. '_ _'., Fig. 13. DPCM coder and decoder; the dashed arrows indicate adaptation (ADPCM). as the quantizer adaptive (dashed lines in Fig. 13). These measures help to reduce both granular noise and slope overload. This variant is called adaptive DPCM (ADPCM) ADPCM The ADPCM system according to the CCITT standard G.726 [20], system 2 in Fig. 12, incorporates a 4-bit non-uniform backward adaptive quantizer and a backward adaptive predictor, so that no side information needs to be transmitted. The backward adaptive predictor is controlled by the quantized prediction error e q [n] and the quantized speech signal Sq [n], in such a way that the cross-correlation between these two signals is adaptively driven to zero, as explained in Sec Both signals are also available in the decoder. The system has a bit rate of 32 kbitjs and its MOS is similar to that of 64 kbitjs PCM. It is used on the PSTN and in DECT (Digital European Cordless Telephone, another ETSI standard), without additional error proteetion bits. It is more complex than PCM, but single-chip realizations are readily available on the market. It has no intrinsic delay and it is very robust in all other respects. For non-speech signals, such as data signals, however, special provisions are incorporated to detect them and to control the settings of the system accordingly Delta modulation In delta modulation (DM), a one-bit quantizer is used and the sampling frequency is increased [6]. The feedback-loop in the coder consists of a simple, fixed, integrating network..many variants have been proposed, most of them differing in the way the quantizer is made adaptive. One of them is the 472 Philips Journalof Research Vol.49 No

19 State of the art and trends in speech coding backward adaptive 'continuous variable slope' DM (CVSD) [6, 21].1t has also been included in the survey of Fig. 12 (system 8). DM has only limited application, mainly military, and in NASA's space shuttle [22].The main features of DM are that it is very simple and extremely robust against transmission errors Analysis-by-synthesis coders The class of analysis-by-synthesis coders under consideration is based on LPC synthesis. Figure 14 shows the generic structure of such coding systems Generic strue ture The speech signal is split up into segments of typically 20 ms, and on each segment LP analysis is performed. A local decoder, consisting of an adaptive LP synthesis filter I/A(z), is excited by an excitation generator to obtain an estimate, s'[n], of the speech signal srn]. The excitation generator can generate only a limited number ofapproximations, xf[n], 1= 1,2,... L, to the prediction residual, so that log2l bits are needed to inform the decoder which particular excitation sequence to use. The error sequence, s[n]- san], is evaluated over an interval of typically 5ms on a mean-square basis. The excitation signal xf[n] is chosen such that, given the L degrees of freedom of the excitation signal, a minimum mean square error (MMSE) is obtained. In speech signals, the excitation candidate that delivers MMSE is not necessarily the candidate that delivers the best perceptual result.. In order to make the error criterion perceptually relevant, 'noise shaping' is introduced. One can conclude from the auditory masking model that more distortion s ; ~ LPC parameters , I I I I I Excitation ~ 1 I I Generator A(z) I I I I 1 ~--!:Q.c.ru.I>-gç-o-li-g!:.---J il A(z) A(zly) _ e Fig. 14. Generic structure of LPC-based analysis-by-synthesis coding systems. Philip. Journalof Research Vol. 49 No

20 R.J. Sluijter et al. can be tolerated in the formant regions. Accordingly a filter is designed which provides this weighting and is generally referred to as the weighting filter. This filter has the form A(z)/ A (zh) in which 0 < 'Y < 1with a typical value ofo.8. The effect of introducing such a parameter is to increase the bandwidth of the formants with respect to those of 1/ A(z). In this way, the formants are partially suppressed so that they have less weight in el[n], which results in the toleration of relatively larger distortion in the formant regions. The MMSE procedure works with an invariable part and an innovation part. The invariable part eo[n] is that part of el[n] which is not influenced by the excitation signal in the subsegment under consideration, so it does not depend on I. It consists of the hangovers of the synthesis filter and the weighting filter of previous subsegments and the contribution of srn] in the current segment. The innovation part ul[n] consists of the convolution of xl[n] with the impulse response h[n] of the cascade of the synthesis and weighting filters, so that the mean square error El is given by: 1 N-I 2 1 N-I 2 EI = N L el [n] = N L (eo[n]- u![n]), n=o n=o where N is the length of the sub segment. The minimum error El, I = 1,2, 3...,L, indicates the best excitation sequence in the weighted MMSE sense. There are three main variants of analysis-by-synthesis coders which basically differ only in the type of excitation function: code excited linear predictive (CELP) coders, multi-pulse excited (MPE) coders and regular-pulse excited (RPE) coders, which will be considered in more detail in the following. (6) Code excitation A CELP coder [23] is basically a vector quantizer operating on the decorrelated and normalized speech signal, with a weighted matching criterion. The codebook used contains rms-normalized approximations to the LTP residual E[n]. Accordingly, the excitation generator in Fig. 14 consists of a codebook, a gain factor, and an LTP synthesis filter. Figure 15shows the structure ofsuch a code-excitation generator in which the LTP synthesizer has been modified to what is now called an adaptive codebook. Ifthe associated lag exceedsthe subsegment duration, it operates as a usual LTP synthesis filter 1/ P(z). Otherwise, the number of samples in the delay line spanned by the lag is repeated until the subsegment is completed. This has the advantage that the computation of the gain factor gp (pitch prediction coefficient) of the adaptive codebook is straightforward [24].The range of the lag is normally ms, comprising 474 Philip. Journal of Research Vol.49 No

21 State of the art and trends in speech coding Lag Optional repeat x Fixed Codebook Fig. IS. Architecture of the CELP excitation generator. 128 integer values. Enhanced performance is obtained if the lag is allowed to have subsample resolution [25]. As a rule, not more than 256 lag values are used, distributed non-uniformly; for smalllags this distribution is more dense, never exceeding a virtual oversampling factor of 8, and for large lags only integer values are used. The main difficulty in the design of CELP coders is to keep the complexity manageable. In order to avoid the complexity of a joint search procedure for the best match, the adaptive codebook and the fixed codebook are searched sequentially. This means that in the first search the contribution of the fixed codebook is zero. Alllags are assessed, and for each lag an optimum gain gp,l is computed. The optimum gain is obtained by substituting ul[n] of eq. (6) by gp,iuan] and setting the partial derivative of El, with respect to gp,l, to zero. The search procedure selects that lag I, for which El is minimum. Next, the effect of the selected vector is incorporated into a new invariable part and the same procedure is now repeated for the fixed codebook. The loss in performance due to this approach is negligible, despite its suboptimality. Further reductions in complexity are necessary for one-dsp realizations. Many CELP variants, all aiming at reduced complexity of the adaptive-codebook and fixed-codebook search procedures, have been proposed in order to meet this goal. The required degree of sophistication of the DSP varies, however. The robustness of CELP coders shows some relation with the bit rate. The lower the bit rate, the more speech specific the Philips Journal of Research Vol.49 No

22 R.J. Sluijter et al. system, and the more information is carried per bit, causing increased error sensitivity. In the DoD-CELP coder (system 7 in Fig. 12) a ternary valued, sparsely populated random codebook (only 25% of the vector elements are nonzero) is used, resulting in a reduced complexity of the fixed codebook search [26]. The adaptive codebook search is split up in a hierarchical way, by first searching integer valued lags followed by searching only neighbouring noninteger values. The relatively low bit rate is obtained mainly by the use of long segments and subsegments (30ms and 7.5ms, respectively) and the low excitation rate* of 0.3 bit/sample, which are also the reasons for the relatively low quality (MOS ~ 3). The long segments also cause the relatively high intrinsic delay of 45 ms. The bit rate of the system is 4.8 kbit/s, including 133 bit/s for error correction. In the VSELP coder (system 5 in Fig. 12), two fixed codebooks are used which are searched in a sequential way and each codebook is constructed by the sum of 7 basisveetors [27]. The excitation code of each codebook consists of 7 bits, being the signs of the basisvectors, so that 128 combinations can be generated. This approach saves the convolution with h[n] for each codebook vector, because it can be precalculated once per basisvector, and the allocations of signs and summations can be done afterwards. The adaptive codebook has no subsample resolution yet. The segments are 20 ms with 5 ms subsegments. Despite the shorter segments (compared to the DoD-CELP), the intrinsic delay is still about 40 ms. This is due to the particular segmentation arrangement. The excitation rate is bit/sample and the estimated MOS is about 3.5. The bit rate of the coder is 7.95 kbit/s and it has been standardized (International Standard IS-54) by the North American CTIA for digital mobile telephony. The standard prescribes a gross channel rate of 13 kbit/s, the difference being devoted to error protection. At the time of standardization, it was the best existing CELP variant which could be realized in a single DSP. The LD-CELP coder has an architecture as shown in Fig. 16 [28,29]. Its low intrinsic delay, ms, is made possible by the use of backward-adaptive LPC analysis and gain control, and the short duration, only five sampling periods, of the subsegments. The coder does not incorporate LTP, but an extremely high order of 50 is used in the LPC. This makes the coder less speech-specific and consequently, it is very transparent, even for music signals. *) The excitation rate is another indicator related to speech quality. Itis defined here as the ratio of the number of bits allocated to the excitation code, apart from absolute gain factors, and the number of samples, both in a subsegment. Philips Journalof Research Vol.49 No

23 State, _ Fixed x 1 codebook of the art and trends in speech coding A(z) s' s perceptual weighting filter e Backward gain adaptation Backward LPC analysis._.---- Local Decoder _._ I i ~ Excitation code Fig. 16. Architecture of the low-delay CELP. Despite the backward adaptation, the system can handle error rates up to 10-2 which matches perfectly the intended application on the PSTN. The trained codebook contains a 7-bit 'shape' codebook (128 excitation vectors) and a 3-bit 'gain' codebook, including a sign bit, to control the backward adaptive gain control process. Its excitation rate is 1.6 bit/sample while the total bit rate is 16 kbit/s. The speech quality is the same as the quality of system 2. The coder is very complex, though implementations on a single (sophisticated) DSP already exist Pulse excitation In an MPE coder the excitation signal x[n] of Fig. 14 consists of a few pulses per sub segment, as depicted in Fig. 17a. If the locations of these pulses are known, the amplitudes can be calculated with the help of eq. (6), again by setting the partial derivatives of El with respect to these pulse amplitudes to zero, and solving the resulting set of equations. For each I, the pulses have different locations. If, for example, 10 pulses are to be located in a subsegment of 5 ms, there are about 10 9 possible excitation vectors. This means that 10 9 sets of equations have to be solved and just as many error measures have to be computed in order to select the minimum El. This is far too complex to be handled even by several DSPs. The usual approach is, therefore, a suboptimal sequential Philips Journalof Research Vol. 49 No

24 R.J. Sluijter et al. 1"1"1''1,,1,, I I I "1 (a) (b) Fig. 17. Examples of the excitation signal x for (a) multi-pulse excitation and (b) regular-pulse excitation. search, pulse by pulse [30]. Although MPE can yield good speech quality, it cannot compete with CELP because a relatively high number of bits is needed for coding the pulse positions. It has not been standardized for any application, but it has been the basis of RPE. In an RPE coder the excitation pulses are placed regularly according to a downsampling scheme, as shown in Fig. 17b [31]. If the downsampling factor is 3, for instance, there are only 3 pulse-position grids according to the 3 possible phases of downsampling. Now, only 3 sets of equations have to be solved and 3 values of E[ determined, unveiling the grid position with the lowest E[. Such a system has been the basis of the ETSI standard for the GSM (Global* System for Mobile) digital cellular telephone network. Because RPE in its original form was still too complex for a commercially attractive implementation at the time the standard was being developed, a simplified version has been standardized [32, 33]. In the RPE-LTP coder, as the GSM system is technically called (system 4 in Fig. 12), the speech signal is first fed through an inverse filter, using 20 ms segments, and subsequently, a pitch prediction residual e [n] is computed by an adaptive LTP on the basis of 5ms subsegments (Fig. 18). For each subsegment of e [n], the RPE coder generates candidate excitation sequences on the basis of a down sampling factor of three, and selects the one with the lowest E[. Prior to transmission, the RPE pulses are quantized using forward blockadaptive, 3-bit uniform quantization, with a single amplitude parameter per *) Although the ETS! is a European organization, the GSM system is being increasingly adopted on the global scale. 478 Philip. Journni of Research Vol.49 No

25 State of the art and trends in speech coding LPC parameters Fig. 18. Architecture of the RPE-LTP coder (full rate GSM). subsegment. The bit rate of the coder is 13 kbit/s, and its intrinsic delay is 20 ms. The gross bit rate on the radio channel is 22.8 kbit/s, so 9.8 kbit/s is used for error protection. This generous proteetion makes the system very reliable on the adverse radio channel. The excitation rate is 1.2 bit/sample, but the quality cannot be compared to that of the LD-CELP. The average MOS test result, which include speech utterances exposed to background noise, tandeming and channel error rates up to 30%, is about 3.6. It is the straightforward structure of the system that makes it quite transparent, even for signalling tones. Data signals, however, have to be transmitted separately in the GSM system. The complexity is low, enabling the complete digital baseband processing of a coder and decoder, including voice activity detection, channel coding and decoding and additional channel control, to fit in a single DSP of medium sophistication Vocoders At least one vocoder already existed in 1936 [34], even before PCM was invented by Reeves in A vocoder is based on the source-filter model of speech production, recall Fig. 1. In the encoder, the model parameters of a speech segment, i.e. the pitch period, the voiced/unvoiced parameter, the gain filter parameters are analysed from segments of typically 20 ms duration, and encoded. The decoder consists of a replica of the model that is controlled by the decoded parameters. As argued in Sec. 2.1., this approach results in very low bit rates. The success of this approach depends on the accuracy and the perceptual relevance of the underlying model. The LPC-l OEvocoder of system 9 in Fig. 12 is characterized by the use of an LPC synthesis filter with 10 coefficients [35].Its bit rate is 2.4 kbit/s. This kind of vocoder is known for its poor, synthetic, speech quality, mainly caused by the incompleteness and oversimplification of the model used, especially in the excitation part. However, interesting new developments are going on in this field, as announced by system 6 in Fig. 12. The IMBE coder uses spectral analysis based on the Fourier transform of 20 ms segments [36]. The spectrum is divided into a number of bands and Philip. Journalof Research Vol. 49 No

26 R.J. Sluijter et al. for each band a voiced/unvoiced parameter is determined. In voiced bands, only the amplitudes of pitch harmonics are retained. For unvoiced bands only one amplitude parameter per band is used. The pitch is represented by a single parameter per segment and heavy tracking is applied over several segments, being the main cause of an intrinsic delay of almost 80 ms. This makes the system practically suitable for half-duplex communication only. The bit rate of the INMARSAT-M system is 4.15 kbit/s and the gross bit rate on the satellite link is 6.4 kbit/s. Although the system is very speech specific and consequently not at all transparent, it can cope with background noise quite well. Its quality exceeds that of the DoD-CELP, which is quite an achievement because this seems to be the first vocoder that outperforms a CELP at that bit rate, while it has a relatively modest complexity Future trends One standard underway is the half-rate GSM standard, expected to be adopted by the ETSI in 1995 [37]. The half-rate GSM channel has a bit rate of 11.4 kbit/s and the net bit rate of the speech coder is 5.6 kbit/s, so 5.8 kbit/s is used for error protection. It incorporates a VSELP-based system with 20 ms segments and 5ms subsegments and a subsample resolution adaptive codebook. The excitation rate is 0.35 bit/sample. The performance of this system with respect to speech quality and robustness approaches that of the full-rate system. The conclusion which can be drawn is that at these bit rates the quality indicated by the dashed line in Fig. 12 (the trend) cannot yet be reached. Another standard is being considered by an ETSI Study Group which is currently looking at the possibility of enhanced quality full-rate speech coding. Meanwhile, an interesting development is going on in the ITU-T standardization process of an 8 kbit/s coder for the Future Public Land Mobile Telecommunication System (FPLMTS). This standard is expected to be launched by the end of Two candidates are involved, one with a 'conjugate structured' fixed codebook (CSCELP) and the other with an 'algebraic' fixed codebook (ACELP) [38, 39]. Both coders have an intrinsic delay of ~ 15ms. Vector quantization is applied to the short-term prediction parameters, requiring less than 20 bits for their representation. This enables a 10ms update rate without increasing the bit rate as compared to 20 ms segments and 40-bit representation. The speech quality of these coders outperforms that of the reference (system 2 of Fig. 12), even in the case of transmission errors up to 1%. The complexity is in the order of magnitude of one DSP again. The final system is expected to be a combination of the proposed candidates. This system will make all other existing speech coders at bit rates 480 Phillps Jouroulof Research Vol. 49 No

Overview of Code Excited Linear Predictive Coder

Overview of Code Excited Linear Predictive Coder Minal Mulye 1, Sonal Jagtap 2 1 PG Student, 2 Assistant Professor, Department of E&TC, Smt. Kashibai Navale College of Engg, Pune, India Abstract Advances