Chapter IV THEORY OF CELP CODING

CHAPTER IV THEORY OF CELP CODING 4.1 Introduction Wavefonn coders fail to produce high quality speech at bit rate lower than 16 kbps. Source coders, such as LPC vocoders, operates at the rates as low as 2 kbps but fail to provide speech quality appropriate for the commercial telephone application in the wire line as well as wireless medium. Hybrid coders based on Analysis by Synthesis (AbS) speech coding produces toll quality speech at rate less than 10 kbps. The chosen codec for this study is a time domain hybrid coder is based on Algebraic Code Excited Linear prediction (ACELP) architecture. The basic structure is that of the CELP with the secondary excitation searched from a memory less algebraic codebook. In this chapter the basic theory of the standard CELP based speech coding algorithm is described. Commencing with the generalized AbS coding, the analytical tools are briefly discussed. In order to maintain good quality in the reconstructed speech despite less computational complexity and spectrally efficient speech coding, various modifications are possible in the implementations of the different sections of the CELP based speech coding algorithm. 4.2 Generalized AbS-LPC Speech Coding In AbS-LPC coding [6] [46] schemes the locally synthesized signal is compared with original speech signal and coder parameters are selected so as to produce minimum mean square error between the original speech signal and the reconstructed speech signal. The perfonnance of this scheme is better due to the involvement of the closed loop optimization procedure for parameters estimate. The basic structure of the - 32-

AbS-LPC scheme [47] is as illustrated in figure 4.1. Functionally the entire AbS-LPC scheme can be divided into three sections or parts: (i) (ii) (iii) Time varying filter Excitation signal generation Error minimization procedure!"'""""................................... 1 Original speech I Excitation signal generator Time varying filter Error minimization coder Excitation signal generator Time varying filter ~-------- - -~." j s~;~~~c I Figure No. 4.1: Generalized AbS-LPC scheme 4.2.1 Time Varying Filters The time varying filter in the model is combination of the two linear predictors namely- the STP or LPC and the LTP or pitch filter. The Short term prediction filter exploits the correlation between the adjacent speech samples. The long term predictor removes the correlation between the distant samples, normally one pitch or multiple pitch away. The STP filter is generally implemented as linear prediction time -33-

varying filter, whereas the L TP is generally implemented as adaptive codebook in the excitation synthesis and search. 4.2.2 Excitation Signal Generator Excitation signal represents the input to the time varying filter (normally L TP) and is the most important part of the ABS scheme. The discrimination between the various AbS-LPC schemes is based on the way in which the excitation signal vector has been represented. In CELP based speech coding algorithms, the excitation signal is chosen from a predefined codebook. The excitation signal vector can be selected from one single codebook or it may be summation of two sub-vectors from two different excitation sources. Majority of the AbS-LPC schemes reported to use two codebooks, one fixed codebook and another adaptive codebook. In this study the developed spectrally efficient speech coding algorithm uses a fixed algebraic codebook, for fixed excitation and pitch adaptive codebook to generate the excitation for the LP synthesis. 4.2.3 Error Minimization Procedure The criterion of minimization over several errors, such as absolute error, maximum error, mean square error etc. is possible; the most commonly used error minimization criterion is minimum mean square error (MMSE). The mean square error between two signals 1 n=n MSE =-l;(s,'(n)-si(n)) N n=o (4.1) criterion in the error analysis. A perceptually or weighted mean square error criterion can also be used as -34-

4.2.4 Types Of AbS Speech Coders There are various implementations of the speech coding algorithms based on the analysis by synthesis speech coding [48], as listed below: SELP - Self excited linear predictor MPLPC - Multi pulse excited linear predictor RPE-LPC- Regular pulse excited linear predictor CELP - Code excited linear predictor. These differ more or less in the type of excitations used in the ABS scheme. For each group, however there can be different internal variation in the design and implementation of the coding algorithms. Theoretical details of the CELP based speech coding algorithm is discussed with specific emphasis to the Algebraic CELP, as it is the technique used in the present study. 4.3 CELP: Theoretical Aspects The theoretical aspects of the CELP based implementation of the ABS speech coding scheme is presented as follows. 4.3.1 Basic Principle The redundancies in the speech signal are almost removed after the short term prediction and long term prediction of the speech signal and the residual has very little correlation left in it. Then an excitation is searched which synthesizes the speech and the codebook index and gain are searched from the fixed codebook. The optimum codebook index selection criterion is based on MMSE between the locally synthesized speech and the original speech signal. Atal and Schroeder first proposed the CELP long back in 1984 [6], but until recently CELP has got the attention as speech coding -35-

algorithm for spectrally efficient speech coding. The standard model of the CELP is illustrated in the figure no. 4.2 ~-.. - " "'"'"""""'"'''"''''''''- -" ""''"'''"''''"""""''"''"'''''''''''''''"'''"''-" Input speech ~ Windowing and LP analysis Zero excitation Zero excitation ~I LP Synthesis LTP ~I LP Synthesis t r s~i~~~ ~i;~;; ct~i~; l.o... r... ~~~.. ~~~.~....r.... codebook -------------- ~------------ -------------- ~L-_L_P_S_y_n_th_es_is_...J!.......... :..... Select index and gain...!... -.....1 Figure No. 4.2: Block diagram of standard CELP coding algorithm 4.3.2 Operation Of The CELP Algorithm The illustrated CELP operates as follows: I. The original speech signal is portioned into frames of I Oms - 20ms and LP analysis is performed. LP model parameters are estimated using one of the various LP analysis methods. The memory of the STP is flushed out before further processing. 2. The L TP analysis is then performed over the target signal, which depends on the method used. Target signal is generally the LP residual obtained by the LP inverse -36-

filtering in the open loop method (OLM)or modified open loop method (MOLM), whereas the original speech is used as the target signal in the closed loop method (CLM) [22]. The pitch delay and pitch gain are the two LTP model parameters estimated in the L TP analysis. 3. The new target for the fixed codebook is then obtained by considering the STP and L TP contribution from the original speech signal. Secondary excitation is then determined by performing the exhaustive search of the fixed codebook, selection criterion being the MMSE. Codebook index and codebook gain are the selected parameters of the fixed code book. 4. The decoding algorithm for the CELP is as depicted in the figure No. 3.3. At the decoder the excitation ids constructed from the L TP parameters and the code book parameters. The synthesized excitation is then fed to the LP synthesis filter.the update of the excitation is usually performed on the sub multiples of the LP analysis frame. r - --- -...--- -.../ ~ I! Zo~~b~:k STP parameters. 'I parameters LP synthesis Or roo~:';"' Synthesized speech! ~-- -"''''''''''''''''''""~--- -- - -- - - -.. - -.. - -- -- -- "'"''"''''""""""'''''''''''''''''" "- ---- - '! Figure No. 4.3: Block diagram of standard CELP decoding algorithm -37-

4.3.3 Secondary Excitation Codebook The vectors contained in the codebook are very important part of the CELP based speech coding algorithms. It is used to generate the excitation for the time varying filter which synthesizes the speech at the decoder end. The contribution of the secondary excitation is more useful during the unvoiced portion or the inactive portion of the speech as for the voiced portion L TP provides more contribution. The codebook population of the excitation vectors and the search procedure of the excitation vectors are the two most important issues in the secondary excitation of the CELP based speech coding. The requirement of quality, lower search complexity and reduced memory for the storage of the codebook excitation vectors resulted in the the different type of the secondary excitation codebooks. A lot of research has been focused on reducing the complexity of the speech coding algorithm by using different code book architecture and efficient search procedures. As a consequence a variety of codebook structures has been developed. Some of the codebooks used are: I. Sparse codebook 2. Ternary codebook 3. Overlapping codebook 4. Binary pulse excited codebook 5. Algebraic codebook The secondary excitation code books have faster search if the the code book are structured, consequently CELP based coding sues the structured code book. There are three types of structured code books: 1. Sparse codebook 2. Ternary codebook 3. Algebraic codebook 38-

4.3.3.1 Sparse codebook A zero mean unit variance Gaussian random process is used to populate this type of the code book. Variables are usually set to zero whenever their absolute values are less than some predefined threshold. This type of code vectors has the ability tp produce natural sounding reconstructed speech. The design of the codebook, larger search complexity and the large storage needed are the limitation in the use of this code book. 4.3.3.2 Ternary codebook A ternary excitation codebook vector is sparse excitation codebook vector in which the nonzero values are replaced either by -1 (value<o) or +I (value>o). This result in a code vector consists of only three possible values. The computational complexity reduces due to the fact that multiplications are reduced to summation as the magnitude of the code vector elements is either zero or one. 4.3.3.3 Algebraic codebook The codebook uses algebraic codes; based in interlaced permutation codes (IPC) excitation vectors are derived. Earlier schemes of the algebraic code books used the binary codes to populate the codebook vectors. In the IPC the vectors contains few non zero pulses with predefined set of positions and pulses are allowed to take fixed amplitude, either + 1 or -1. Each pulse has a set of possible positions, distinct from the position of other pulses. The excitation code vector is determined by the position and amplitude of the non zero pulses. This codebook structure has a several merits, firstly it does not requires storage at the decoder or encoder, as the codebook index defines the code vectors completely. Secondly it defines inherent robustness against he channel errors. Finally and the most important merit of the algebraic codebook is the better search efficiency. -39-

4.3.4 Codebook Search Most of the computational complexity results due to the exhaustive search of the codebook vectors. To search for the optimum code vector out of the entire code book, exhaustive search of the code book is performed [). The search criterion is the minimum mean square error between the synthesized and the original speech. The mean square error can be minimized by maximizing the Tk, given by (4.2) alternative form as Where c k is code vector and, s k is the energy. This can be expressed in an (4.3) Where X is the target vector and H is the lower triangular matrix of impulse response of the STP synthesis filter. 4.4 CELP Implementation Issues From the above discussion it is clear that the computation can e broken down into three blocks: (ii) LPC analysis or STP (iii) Pitch analysis L TP (iv) Codebook search The LPC analysis and the L TP analysis have already been explained in the chapter II and needs no repetition here. First issue in the CLEP is the complexity resulted due to the exhaustive search of the fixed codebook (time complexity) and the storage of the code vectors (space complexity). A lot of research has been focused on reducing the -40-

complexity of the speech coding algorithm by using different codebook architecture and efficient search procedures. As a consequence a variety of codebook structures has been developed Second major issue in the CELP based speech coding algorithm is that of the efficient and transparent quantization of the LSF parameters, so as to encode the LP parameters into as few bits as possible and at the same time optimizing the computational complexity. The quantization issues of the LSF parameters have been discussed in the chapter III. Lastly, the issue of the search complexity of the pitch analysis or the adaptive codebook search. The exhaustive search for the entire range of the pitch delay has to carry out in order to estimate the pitch delay. A lot of focus is on the efficient pitch search algorithm as it is computational very complex. A pre selection based pitch lag search techniques has been implemented in this work. 4.5 Performance Evaluation Of The Speech Coders A speech coding algorithm is evaluated based on the bit rate, quality of reconstructed speech, complexity of the algorithm, algorithmic delay and robustness to channel errors. In general high quality speech coding at low bit rate is achieved by large complexity algorithms and hence longer algorithmic delay. The quality of reconstructed speech in the clean speech conditions as well as the algorithm evaluation has to perform with speech corrupted by background noise. Moreover, in some application the speech coding algorithm performance has to be checked for non speech signal such as DTMF (dual tone multifrquency) and codec performance in tandem [ 17]. general categories: For digital communication of speech the quality is classified into four Broadcast Network or toll - 41 -

Communication Synthetic Broadcast quality refers to high quality "commentary" speech, generally achieved at a rate above 64 Kbps. Toll quality refers to the quality comparable to that of classical analog speech communication (200Hz to 3300 Hz). Toll quality can be achieved at the mid range of data rates. Communication quality implies high intelligibility, may be slightly degraded in quality but natural speech with speaker recognition. Communication quality can be achieved at rates above 4.8 Kbps. Synthetic quality is intelligible, and can be unnatural without speaker recognizability. Speech coder operating below 4.8 Kbps can generate synthetic quality. The quality of the reconstructed speech, based on the four class of quality can be quantified either based on objective measure or the subjective measure. 4.5.1 Objective measure The signal to noise ratio (SNR) is one of the most popular and common objective measure for evaluating the quality performance of a compression algorithm. It is a long term measure for the accuracy of the reconstructed speech. SNR is the ratio of average speech signal power of length N, to the reconstruction error difference. It can be expressed as N-1 z:s2(n) SNR = ~-""""" 0 '--- N-1 Z::(s(n)-s(n)) n:o (4.4) Where s(n) is the reconstructed speech and s(n) is the original speech. Temporal variation in the quality of reconstructed speech can be better evaluated by segmental SNR (SEGSNR) which is given by -42-

n N-l L-1 :~::>2 (in+ n) SEGSNR = lo L log "" 0 (4.5) L n N-1 2,., :L(s(iN+n)-s(iN+n)) n O Where N is the length of the segment and L are the number of segments. As an averaging operation occurs after the logarithm, the SGSNR penalizes the speech coding algorithm more, whose performance is variable. 4.5.2 Subjective Measure The previously discussed objective measure is often sensitive to both gain and delay variations and do not account for the perceptual properties of the ear. The selection of most of the low and medium bit rate coders is determined by the perceptual criterion therefore subjective evaluation [49] [50] is required. There are number of ways to subjectively evaluate the performance but the most popular method is Mean opinion score (MOS). The MOS measure is widely used to quantify the subjective performance of the reconstructed speech through the coding algorithm and original speech as well. The MOS usually involves number of listeners, who are instructed to rate to a five level scale, the quality of speech, as given in table No.4.!. MOS scale Speech quality I Bad 2 Poor 3 Fair 4 Good 5 Excellent Table No. 4.1: Mean opinion scores -43-

The MOS rating is obtained by averaging the values of several scores. The MOS range relates to speech quality as follows: MOS from 4.0 to 4.5 implies network quality. MOS from 3.5 to 4.0 implies communication quality. MOS from 2.5 to 3.5 implies synthetic quality -ooo- -44-