Speech Coding using Linear Prediction Jesper Kjær Nielsen Aalborg University and Bang & Olufsen jkn@es.aau.dk September 10, 2015 1 Background Speech is generated when air is pushed from the lungs through the vocal tract. The produced speech is typically divided into voiced and unvoiced sounds. If the vocal chords in the vocal tract vibrate rapidly when air is pushed through them, the sound is voiced. Examples of voiced sounds are vocals such as a, o, and i. On the other hand, if the vocal chords are constantly open, the sound is unvoiced. Examples of unvoiced sounds are p, s, and q. Much more about human speech production can be found in [1]. The two kinds of speech sounds can be used as sources in the mathematical model of speech production in Fig. 1. In the model, either an impulse train or a white noise process is filtered through an all-pole filter with system response H 1 (z). The impulse train is the source signal for the voiced speech sounds, and the white noise is the source signal for the unvoiced speech sounds. The all-pole filter models how the cavities in the throat, mouth, and nose as well as the position of the tongue change the spectral envelope of the pulse train and the white noise process. For example, if a person is having a cold, the cavities in the nose change, and this changes the spectral envelope of the produced speech. A method called linear prediction [?,2 4] can be used to estimate the filter coefficient of this all-pole filter from a recorded segment for speech, and this is exploited in many practical applications such as speech compression in digital cellular technology and voice over IP. To understand why the mathematical model in Fig. 1 is so useful for speech compression, consider a simple telephone call between Peter and Karen, say. When Peter is speaking, the microphone in the telephone converts the pressure variations (Peter s voice) into voltage variations. Both the pressures and the voltages are continuous in time and in amplitude. In the telephone, an analogue-to-digital converter (ADC) measures the value of the voltage at a uniform rate called the sampling frequency. One such value is typically called a sample. For telephone applications, this sampling frequency is typically 8000 Hz. Moreover, the measured voltage is also rounded to the nearest value on a grid so that the measured voltages can be represented with a fixed number of bits. This rounding is called quantisation. If we assume that 8 bits are used to represent the value 1
Pulse train : Voiced speech : Unvoiced speech H -1 (z) Speech x(n) White noise Figure 1: A popular speech model. of each measured value of the voltage, the total bitrate b is b = 8 khz 8 bits = 64 kbit/s. (1) Thus, Peter s telephone has so far converted the pressure variations generated by Peter s voice into a sampled and quantised signal which we denote as x(n) where n is the time index or sample number. To simplify things, the speech production is often modelled so that it generates x(n) directly. This is also done in Fig. 1. In principle, Peter s phone could now transmit the sampled and quantised speech waveform to Karen s phone at a bitrate of 64 kbit/s. However, the model in Fig. 1 can be used to compress the speech data so that the bitrate is reduced by a significant amount. The compression can be performed in the following way. 1. Divide the speech signal into small segments of 20 ms, say. 2. For each segment, estimate the filter coefficients of the all-pole filter from the segment of speech. Typically, approximately 12 filter coefficients are used in the filter. 3. Determine whether the speech segment is voiced or unvoiced. 4. If the segment is unvoiced, estimate the variance of the white noise process responsible for generating the speech signal. If the signal is voiced, estimate the amplitude and the distance between the pulses in the pulse train. 5. Transmit the estimated filter coefficients, the speech type, and the source signal parameters. 2
At a sampling frequency of 20 khz, a speech segment of 20 ms corresponds to 160 samples. For a resolution of 8 bits/sample, 1280 bits must therefore be used to represent this speech segment if no compression is used. If we instead compress the 12 filter coefficients and the source signal parameters (also with 8 bit), only 112 bits are required for the speech segment. Thus, we achieve a compression factor of more than a factor of 10. The compression scheme described above is illustrated in Fig. 2. Figure 2: Block diagram of a generic speech coding application. In practical telephone applications, the source signals are encoded in a more complex way than described above. However, state-of-the-art speech coders such as code excited linear prediction (CELP) coders are based on the same principles and the model as above and achieve a bit rate of approximate 4.8 kbit/s. 2 Problem In the project, we will focus on estimating the filter coefficients of the all-pole filter from the speech signal so that the residual energy e 2 (n) averaged over the whole segment of data is as small as possible. This problem is the so-called linear prediction problem, and we can formulate it mathematically in the following way. Assume that a speech segment of data consists in N data points which we model as x(n) = a 1 x(n 1) + a 2 x(n 2) + + a p x(n p) + e(n) (2) p = a i x(n i) + e(n) (3) i=1 = h T (n)a + e(n) (4) 3
for n = 0, 1,..., N 1 where ( ) T denotes matrix transposition and a = [ a 1 a 2 a p ] T (5) h(n) = [ x(n 1) x(n 2) x(n p) ] T. (6) When the residual e(n) is a white and a wide sense stationary (WSS) process, the speech signal is modelled as a so-called autoregressive random process of order p. This means that if the speech signal was indeed an autoregressive process, the output of the filter with the system response H(z) would be white and WSS, and this is the main idea behind speech compression based on linear prediction. As illustrated in Fig. 2 and described above, the speech signal is not transmitted directly. Instead, the filter coefficients a and the residuals are transmitted as this can be performed at a much lower bit rate than by direcly transmitting the speech signal. From a mathematical perspectively, the autoregressive model can be formulated as a linear normal model. In matrix notation, the linear model can be written as where x = Ha + e (7) x = [ x(0) x(1) x(n 1) ] T e = [ e(0) e(1) e(n 1) ] T H = [ h(0) h(1) h(n 1) ] T. (10) Under some technical conditions on the matrix H, minimising the two-norm of e w.r.t. a leads to the so-called least squares estimate of a given by â = (H T H) 1 H T x. (11) Moreover, this estimate can be shown to be the so-called conditional maximum likelihood estimate. From an engineering perspective, the least squares estimate above might be problematic since the estimated filter coefficients in â are not guaranteed to produce a stable all-pole filter. That is, some of the poles of the system response H 1 (z) might be outside the unit circle. This can be compensated for in various ways by reflecting the problematic poles around the unit circle or by defining the matrix H so that H T H has a so-called Toeplitz structure. 2.1 Packet-loss concealment Returning to the example of Peter and Karen s telephone conversation, the estimated filter coefficients and source signal parameters from each segment are packaged in a speech packet and transmitted from one phone to the other. In Fig. 2, this is illustrated by the two antennas in the transmission part of the figure. Unfortunately, speech packets might be lost, corrupted, or delayed (8) (9) 4
in the transmission, and this will produce an audible click when the receiving telephone plays back the received speech. However, since correlation exists between the adjacent speech segments, the content of a missing speech packet can to some extend be predicted from adjacent speech packets. This prediction problem is called packet-loss concealment and can also be a part of the project. 3 Data Set The data set is based on a secret track of male speech and consists of three files, all sampled at 8000 Hz. - unvoicedsegment.wav: 120 ms of unvoiced speech. Specifically, the segment is the s-sound from the word score. - voicedsegment.wav: 260 ms of voiced speech. Specifically, the segment is the ore-sound from the word score. - sentences.wav: A sentence of approximately 5 seconds of speech containing a mixture of voiced and unvoiced speech. The unvoiced and voiced segments are taken from the word score in this file. If you listen carefully to the last file, you will hear to audible clicks. These clicks are caused by two missing audio segments of 20 ms. These segments are located directly after the voiced and unvoiced segments, respectively, and can be approximated using packet loss concealment. In the file sentences.wav, the missing packets are from sample number 21921 to 22080 and from sample number 25121 to 25280. In MATLAB, an audio file can be loaded via [data, samplingfreq] = audioread( filename.wav ); 4 Tools from Courses During the project, the following tools, which you will learn about in the courses on the semester, are needed. Solving a 2-norm optimisation problems. Understand the differences between and similarities of the maximum likelihood estimator and the least square estimator. Model signals as autoregressive processes. Analyse linear normal models. 5
References [1] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, 1st ed. Prentice-Hall Inc, 1978. [2] J. E. Markel and A. H. Gray, Linear prediction of speech. Springer-Verlag New York, Inc., 1982. [3] S. Haykin, Adaptive Filter Theory, 4th ed. Prentice Hall, Sep. 2001. [4] J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-time processing of speech signals. IEEE New York, NY, USA, 2000. [5] D. Giacobello, Sparsity in linear predictive coding of speech, Ph.D. dissertation, Aalborg University, 2010. 6