MASTER'S THESIS. Speech Compression and Tone Detection in a Real-Time System. Kristina Berglund. MSc Programmes in Engineering

2004:003 CIV MASTER'S THESIS Speech Compression and Tone Detection in a Real-Time System Kristina Berglund MSc Programmes in Engineering Department of Computer Science and Electrical Engineering Division of Signal Processing 2004:003 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--04/003--SE

Speech Compression and Tone Detection in a Real-Time System Kristina Berglund Master s Thesis Division of Signal Processing

To Andreas

Abstract During the night, encrypted spoken newspapers are broadcasted over the terrestrial FM radio network. These papers can be subscribed to by persons with reading disabilities. The subscribers presently have a special receiver that decrypts and stores the newspaper on a cassette tape. A project aiming to design a new receiver, using digital technology, was started during the year 2002. This report describes the parts of the project involving speech compression and tone detection. An overview of different compression techniques, with emphasis on waveform coding, is given. A detailed description of Adaptive Differential Pulse Code Modulation (ADPCM), the compression technique chosen for implementation on a Digital Signal Processor (DSP) is also included. ADPCM was first implemented on the ADSP-2181 DSP, with a good result. In the final version of the digital receiver the ADSP-2191 DSP will be used and hence the code was converted to fit this DSP. Due to some problems this implementation could not be completed within the time frame of this thesis. The final part of this thesis consists of finding a method for detecting a tone inserted between articles in the spoken newspaper. The tone detection is composed of two parts, the first part is reducing the amplitude of the speech while maintaining the amplitude of the tone. For this part a digital resonator was chosen and implemented both in Matlab and on the ADSP-2191 DSP. The second part of the tone detection consists of deciding whether the tone is present or not, this part was implemented only in Matlab.

Preface This Master s thesis is the final work for my Master of Science degree. The work for this thesis was performed at the Division of Signal Processing at Luleå University of Technology. It was performed as part of a project, aiming to design a digital receiver for a radio transmitted encrypted spoken newspaper system. The main purposes of this thesis are to investigate various compression algorithms and to find a detection technique that can be used to detect tones inserted between articles in the spoken newspaper. I would like to thank my examiner James P. LeBlanc for giving me the opportunity of working in this project and for all support and inputs along the way. I would also like to thank my project leader Per Johansson and my fellow colleagues Anders Larsson and Patrik Pääjärvi for their invaluable help and encouraging words. Kristina Berglund Luleå, January 2004

Contents 1 Introduction 1 1.1 Project Overview............................ 1 2 Basics 5 2.1 Sampling................................. 5 2.2 Quantization.............................. 6 2.2.1 Scalar Quantization....................... 7 2.2.2 Vector Quantization...................... 9 3 Speech Compression 11 3.1 Waveform Coding............................ 12 3.1.1 Pulse Code Modulation..................... 12 3.1.2 Differential Pulse Code Modulation.............. 13 3.1.3 Adaptive Differential Pulse Code Modulation........ 14 3.1.4 Subband Coding........................ 14 3.1.5 Transform Coding....................... 15 3.2 Vocoders................................. 16 3.2.1 Linear Predictive Coding.................... 16 3.3 Hybrid Coding............................. 17 3.3.1 Code Excited Linear Prediction................ 18 3.4 Discussion................................ 18 4 Adaptive Differential Pulse Code Modulation 21 4.1 The Encoder............................... 22 4.2 The Decoder............................... 26 4.3 Implementation and Results...................... 27 4.3.1 Used Tools........................... 27 5 Tone Detection 29 5.1 Finding the Tone............................ 29 5.1.1 Matched Filter......................... 31

Contents 5.1.2 Digital Resonator........................ 32 5.2 Implementation............................. 35 5.2.1 Results.............................. 36 6 Conclusions 41 Bibliography 44

Chapter 1 Introduction This Master s thesis is part of a project involving myself, the M.Sc. students Anders Larsson and Patrik Pääjärvi and our project leader Per Johansson at Luleå University of Technology. The aim of the project is to design a digital receiver for a radio transmitted encrypted newspaper system. 1.1 Project Overview The radio transmitted papers are spoken versions of daily newspapers, which reading disabled persons can subscribe to. The newspapers are distributed during the night between 2 a.m. and 5 a.m., on the radio channel SR P1. Different papers are broadcasted in different regions of the country, today about 90 different daily newspapers have a spoken version broadcasted each night [1]. Since every newspaper is limited to 90 minutes and one paper is sent in mono on each channel, a maximum of four newspapers can be broadcasted in one region during the night. To prevent non-subscribers from listening to the paper, it is encrypted. To receive the newspaper, the subscribers have a receiver set to record one of the papers transmitted. The receiver first decrypts the newspaper and then stores it on a cassette tape. In order to listen to the paper, the subscribers must insert the tape in a regular cassette player. Between the articles in the paper a tone is inserted. When the listeners fast forward the tape this tone is heard, indicating the start of a new article. 1

2 Chapter 1. Introduction The project of designing a digital receiver was initiated in 2002 by Taltidningsnämnden, a government authority whose purpose is to improve the availability of spoken newspapers to the reading disabled. The reasons for designing a new receiver are mainly questions of cost and adaptability. The receiver of today uses analog technology and have high costs of maintenance and repair. Often the subscribers to spoken newspapers are elderly, with difficulties in handling a cassette tape. Since the new receiver have a built-in speaker, nothing needs to be moved in order to listen to the paper. An advantage of the digital receiver is the ability to skip between articles by pressing a button. For this reason, the tones inserted between articles must be detected. Another advantage is the fact that additional features, for example improvements of the sound quality, easily can be added to the digital receiver by a change of software. Below, a brief description of the digital receiver is given. As seen in Fig. 1.1, the transmission is received by an FM-receiver and analogto-digital converted, before passed along to the digital signal processor (DSP). In the DSP, the encrypted newspaper is decrypted, searched for tones between the articles, compressed and then stored on a memory chip connected to the DSP. During playback, the process is reversed. The newspaper is read from the memory, decompressed, digital-to-analog converted and sent out through a speaker. In order to decrypt the newspaper, the incoming transmission is sampled at 19 khz. Each sample is represented in the DSP with an accuracy of 16 bits. Since one newspaper is 90 minutes long, about 205 MB worth of memory is required for storing the paper if no compression technique is used. If this receiver is to be commercialized, the costs must be kept small. Since the cost of the memory is higher the larger the memory is, a small memory is desired. To fit the newspaper on a small memory, it must be compressed. The parts of the project described in this thesis are compression/decompression and tone detection.

Chapter 1. Introduction 3 Encrypted transmission FM A/D DSP Decryption Tone detection Compression Write to memory Memory Read from memory D/A Decompression Figure 1.1: Simple block diagram of the digital receiver. The data flow during recording and playback are indicated by the black and white arrows respectively.

4 Chapter 1. Introduction

Chapter 2 Basics In this section, two basic concepts needed for the understanding of digital compression of speech waveforms are given. 2.1 Sampling Only discrete-time signals can be digitally processed, time continuous signals have to be sampled. Sampling is merely taking the value of the signal at discrete points, equally spaced in time, see Fig. 2.1. The discrete-time version of a time continuous signal s c (t), is s d [n] = s c (nt ), (2.1) where n is an integer and T is the sampling period, the time between each sample. The sampling frequency, f s, is how often a signal is sampled, the relationship between the sampling period and the sampling frequency is described by f s = 1 T. (2.2) A bandlimited signal s(t), i.e. a signal with no frequency components higher than some known limit, is uniquely determined by its samples s[n] if the sampling frequency is high enough. The sampling theorem states that the sampling frequency must be at least twice as high as the maximum frequency component in the signal [2]. That is, f s > 2 f max, (2.3) 5

6 Chapter 2. Basics Original signal Sampled signal Amplitude Time Figure 2.1: Sampling of a time continuous signal. where f max is the maximum frequency, for s(t) to be uniquely determined and reconstructible. The frequency 2 f max is called the Nyquist rate. If the sampling frequency is below the Nyquist rate, aliasing may occur. Aliasing means that a frequency is misrepresented, the reconstructed signal will have a lower frequency than the original signal, as illustrated in Fig. 2.2. Aliasing can be avoided by bandlimiting the signal before it is sampled. This is done by applying a lowpass filter to the signal prior to sampling. The lowpass filter used must remove the frequency components that are higher than f s /2. 2.2 Quantization Every sampled signal is quantized, to be encodable with a finite number of bits. Quantization can be described as rounding off the value of the signal to the nearest value from a discrete set, see Fig. 2.3. The quantization can be performed on raw speech samples as well as on residuals, for example the difference between consecutive samples. If the quantization of one sample depends on previous samples, it is said to have memory. Quantization can be performed on one sample at a time, known as scalar quantization, or on several samples at a time, known as vector quantization.

Chapter 2. Basics 7 1 0.5 Amplitude 0 0.5 1 0 0.2 0.4 0.6 0.8 1 Time (s) Figure 2.2: Original (dotted line) and reconstructed (solid line) signal, sampling frequency below the Nyquist rate. Output value Input value Figure 2.3: Input(dotted line) and output (solid line) from a 3-bit (eight levels) uniform quantizer. 2.2.1 Scalar Quantization For all quantization methods some distortion is introduced. For scalar quantization this distortion is called quantization noise, which is measured as the difference between the input and the output of the quantizer. How well a quantizer works depends on the step size. The step size is the measure between two adjacent quantization levels. Uniform quantizers have the same step size between all levels, unlike log quantizers (see Section 3.1.1). Log quantizers have a step size that varies between different quantization levels. For uniform as well as log quantizers, the step size must be kept small, to minimize quantization noise. There is a tradeoff

8 Chapter 2. Basics between how small the step size is and how many quantization levels are required. A way of keeping the quantization noise small is to allow the step size to vary from one time to another, depending on the amplitude of the input signal. This is called adaptive quantization and it means that the step size is increased in regions with a high variance in amplitude and decreased in regions with a low variance. The step size adaption can be based on future samples, called forward adaptive quantization or on previous samples, called backward adaptive quantization. Adaptive quantization adds delay and additional storage requirements, but enhances the quality of the reconstructed signal. Forward adaptive quantization is shown in Fig. 2.4. Samples, including the current sample, are stored in a buffer and a statistical analysis is performed on them. Based on this analysis the step size is adjusted and the quantization is carried out. Statistical parameters Input signal Buffering Statistical analysis Quantization encoder Interval index Quantization decoder Reconstructed signal Channel Figure 2.4: Forward adaptive quantization (based on a figure from [3]). Since the analysis is performed on samples not available at the receiver, the statistical parameters from the analysis must be sent as side information. There is a tradeoff between how sensitive the quantizer is to local variations in the signal and how often side information must be sent. If the buffer size is small, the adaption to local changes will be effective, but then the side information must be sent very often. If many samples are stored in the buffer, the side information does not have to be sent that often, however the adaption might miss local variations. The more samples the buffer stores, the bigger the delay and the storage requirements are. No side information has to be sent using backward adaptive quantization. The buffering and the statistical analysis is carried out by both the transmitter and the receiver, see Fig. 2.5. Since the analysis is performed on the output of the quantizer and this contains quantization noise, the adaption to the variation of the signal is not as fine as it is when using forward adaptive quantization. Some well-known methods that use scalar quantization are Pulse Code Modulation, Delta Modulation and Adaptive Differential Pulse Code Modulation, they are all described in Section 3.1.

Chapter 2. Basics 9 Input signal Quantization encoder Interval index Quantization decoder Reconstructed signal Buffering Buffering Statistical analysis Statistical analysis Channel Figure 2.5: Backward adaptive quantization (based on a figure from [3]). 2.2.2 Vector Quantization In scalar quantization the input to the quantizer is just one sample. The input to a vector quantizer is a vector consisting of several consecutive samples. The samples can be both pure speech samples as well as prediction residuals or some coding parameters. The general idea of vector quantization is that a group of samples can be more efficiently encoded together than one by one. Due to the inherent complexity of vector quantization, significant results within this area were not reported until the late 1970 s [4]. In vector quantization, depicted in Fig. 2.6, the quantization consists of finding a codeword that resembles the input vector. Each incoming vector, s i, is compared to codewords in a codebook. The closest word, often measured in mean squared error, is selected for transmission. Channel Codebook Codebook Input vector Encoder Index Index Decoder Reconstructed vector Figure 2.6: Block diagram of vector quantizer (based on a figure from [4]). For a codebook that consists of L different codewords, the M-dimensional vector space is divided into L non-overlapping cells. Each cell has a corresponding centroid, these are the L codewords. This is shown for a two-dimensional case (M=2) in Fig. 2.7. The incoming vector falls into a cell, C n, and the centroid, ŝ n, of the cell is the codeword chosen for transmission. Actually, what is transmitted is the index, u n of the chosen word, not the codeword itself. The decoding process is the

10 Chapter 2. Basics encoding reversed. The codebook, the same as the encoder has, is searched to find the codeword matching the index. When finding a match, the decoded word is the centroid corresponding to the index in question. Vector quantization is often used in hybrid coders, described in Section 3.3. n2............................. cell centroid, s^ n n 1 Figure 2.7: Cells for two-dimensional vector quantization (based on a figure from [5]).

Chapter 3 Speech Compression Speech compression, or speech coding, is needed for many applications. One application is telecommunication, where a low bit rate often is aimed at. Another is storage, where speech needs to be compressed to fit within a specific memory size. The goal of speech coding is to encode the signal using as few bits as possible and still have a sufficient quality of the reconstructed speech. Sufficient quality can mean a variety of different things, from speech with no artifacts and no difficulty in recognizing the speaker, to nearly non-intelligible unnatural speech, all depending on the application. Another important issue is delay, in real-time systems the delay introduced by compression/decompression must be kept small. In this project the aim has been to find a low complexity compression algorithm that reconstructs speech with high quality. Low complexity is the same as needing few instructions to encode and decode the speech. The field of speech coding can be divided into three major parts or classes, waveform coding, vocoding and hybrid coding. Hybrid coding is not really a separate class, more a mixture of vocoding and waveform coding. In the following sections an overview of some basic compression techniques is given, and in Section 3.4 examples of the quality and complexity for algorithms from each class are given and discussed. 11

12 Chapter 3. Speech Compression 3.1 Waveform Coding Waveform coders compress the speech signal without any consideration in how the waveform is generated. Waveform coding can be carried out both in the time domain and in the frequency domain. Pulse Code Modulation, Delta Modulation and Adaptive Differential Pulse Code Modulation are all examples of time domain waveform coding, while Subband Coding and Adaptive Transform Coding operate in the frequency domain. The aim is to construct a signal with a waveform that resembles the original signal. This means that waveform coders also work fairly well for non-speech signals. Waveform coders generally produce speech with a high quality, unfortunately with quite a high bit rate as well. For more information on waveform coders than given in this thesis, the reader is referred to [3] and [6]. 3.1.1 Pulse Code Modulation Pulse Code Modulation (PCM) is the simplest form of scalar quantization. PCM uses uniform or logarithmic quantization. The samples are rounded off to the nearest value from a discrete set. For uniform PCM the set consists of a number of equally spaced discrete values, the step size between each quantization level is constant. The waveform is approximated by quantizing the input speech samples before transmission. PCM produces speech with a high quality, however, it also requires a very high bit rate. PCM can be made more efficient using a non-uniform step size. This type of quantizer is called log quantizer. The difference between a uniform quantizer and a log quantizer is shown in Fig. 3.1. A-law and µ-law are both examples of log. Output value Output value Input value Input value (a) (b) Figure 3.1: (a) Uniform quantizer; (b) Log quantizer.

Chapter 3. Speech Compression 13 quantizers. These two techniques are widely used in many speech applications, µ-law is used in telephone networks in North America and in Japan and A-law is used in Europe [7]. The performance of a 12-bit uniform quantizer can be achieved by a 7-bit log quantizer [4]. 3.1.2 Differential Pulse Code Modulation Since successive speech samples are highly correlated, the difference between adjacent samples generally has a smaller variance than the original signal. This means that the signal can be encoded using fewer bits when encoding the difference than when encoding the original signal and still achieve the same performance. This is the objective of Differential Pulse Code Modulation (DPCM). DPCM quantizes the difference between adjacent samples, instead of the original signal itself. Many DPCM schemes also use a short-time predictor to further decrease the bit rate, an example of this is Delta Modulation. Delta Modulation Fig. 3.2 shows a Delta Modulator. Based on previous samples the current sample s[k] is estimated to s e [k]. The prediction residual, e[k], is the difference between the input sample and an estimate of the input sample. This difference is then quantized and transmitted. The transmitter and the receiver uses the same predictor P. At the receiver the prediction residual is added to s e [k] to reconstruct the speech sample to ŝ[k]. s[k] + Σ e[k] Q e [k] q e [k] q + Σ s[k] ^ [k] s e P + + Σ ^ s[k] [k] s e + P Figure 3.2: Block diagram of a delta modulation encoder (to the left) and decoder (to the right) (based on figures from [8]). The simplest form of the procedure described above is the first-order, one-bit linear delta modulation. The estimate in this method is based on one previous quantized

14 Chapter 3. Speech Compression sample and uses only one bit to encode the difference. This means that the following sample can only be smaller or bigger than the previously encoded sample. Hence, adjacent samples always differ with the step size from each other. If adjacent input samples have the same value, they will be encoded differently. This error is known as granular noise. When using a fixed step size like this method does, another error, called slope overload can occur. This can happen when the sample-to-sample change of the signal is greater or smaller than the step size. Both granular noise and slope overload is depicted in Fig. 3.3. This method can be used for highly oversampled speech, where the correlation is strong. Amplitude 1 0.8 0.6 0.4 0.2 Slope overload Step size Granular noise original samples coded samples 0 0 2 4 6 8 10 12 14 16 Sample number Figure 3.3: Two types of quantization errors (based on a figure from [5]). 3.1.3 Adaptive Differential Pulse Code Modulation Adaptive Differential Pulse Code Modulation (ADPCM) is an extension of Delta Modulation. ADPCM has both an adaptive step size and an adaptive prediction. A detailed description of the ITU-T standard G.726 40, 32, 24, 16 kbit/s adaptive differential pulse code modulation (ADPCM) is found in Chapter 4. 3.1.4 Subband Coding A subband coder divides the input signal into several frequency subbands, which then are individually encoded as depicted in Fig. 3.4. The encoding process works as follows. Using a bank of bandpass filters, subband coders divide the frequency

Chapter 3. Speech Compression 15 band of the signal into subbands. Each subband is then downsampled to its Nyquist rate (see Section 2.1) and encoded, using for example one of the techniques described above. The downsampling can be done since the bandwidth is narrower for the subbands than for the input signal. Before transmission, the signals are multiplexed. Multiplexing means combining multiple signals in order to send them over one single channel. At the receiving end, the signals are demultiplexed, decoded and modulated back to their original spectral bands before they are added together to reproduce the speech. To reduce the number of bits needed to encode the signals, different number of bits are used for different subbands. For speech, most of the energy is within the frequencies between 120 and 2000 Hz. Hence, more bits are often allotted to lower frequency bands than to higher. Downsampling Bandpass filter 1 Encoder 1 Decoder 1 Bandpass filter 1 Input signal Bandpass filter 2... Bandpass filter M Encoder 2... Encoder M M u l t i p l e x e r D e m u l t i p l e x e r Decoder 2... Decoder M Bandpass filter 2... Bandpass filter M Output signal Encoder Channel Decoder Figure 3.4: Block diagram of general subband coder (based on a figure from [6]). 3.1.5 Transform Coding In Fig. 3.5, a simple block diagram of a transform coder is shown. A block of speech samples are unitary transformed to the frequency domain by the encoder. The transform coefficients are quantized and encoded before transmission. The receiver decodes and inverse-transforms the coefficients to reproduce the block of speech samples. The objective of transforming the speech is that the transformed signal is less correlated than the original signal. This means that most of the signal energy is contained in a small subset of transform coefficients. Karhunen-Loéve Transform (KLT), Discrete Cosine Transform (DCT) and Discrete Fourier Transform (DFT) are three examples of transforms that are used by transform coders. KLT is the transform that gives the least correlated coefficients [9]. However, DCT is more

16 Chapter 3. Speech Compression popular since it is nearly optimal and easy to compute using the Fast Fourier Transform. As for subband coders, bit allocation is used to remove redundancies. Adaptive Transform Coding (ATC) works in a similar way as described above, except it uses adaptive quantization and adaptive bit allocation. This leads to a higher quality of the reconstructed speech. x X T Encoder T 1 ~ X Decoder x~ Receiver Channel Transmitter Figure 3.5: Block diagram of a transform coder (based on a figure from [4]). 3.2 Vocoders Unlike waveform coders, vocoders (parametric coders) do not try to recreate the speech waveform. Vocoders use a mathematical model based on how the human speech is generated, to create synthesized speech instead of a reconstruction of the input. By an analysis of the input speech, parameters corresponding to the vocal cords and the vocal tract are estimated and quantized before transmission. At the receiver, these parameters are used to tune the model in order to construct synthesized speech. Due to the fact that vocoders operate by synthesizing speech they perform poorly on non-speech signals. Vocoders use very few bits to encode the speech, giving a very synthetic speech where the speaker is hard to identify. For more information on vocoding techniques than given below, the reader is referred to [4] and [5]. 3.2.1 Linear Predictive Coding A well-known example of Linear Predictive Coding (LPC) is the 2.4 kbit/s LPC-10. Due to the very low bit rate, the reconstructed speech sounds very unnatural. The main use for algorithms like this has been secure transmissions, where a big part of the bandwidth is needed for encrypting the speech. An LPC encoder divides the input signal into frames, typically 10-20 ms long. Each frame is then analyzed to estimate the parameters needed. Linear prediction parameters, the decision if the

Chapter 3. Speech Compression 17 input is voiced or unvoiced speech, the pitch period and the gain are the parameters that are estimated and quantized. The quantized parameters are transmitted and a synthesis model, like the one shown in Fig. 3.6, is used to reconstruct the speech. For voiced speech a pulse generator is used for modelling the excitation signal, for unvoiced speech white noise is used. The excitation signal is used to represent the flow of air from the lungs through the glottis and the LPC parameters are used in the filter representing the oral cavity. The signal is multiplied by a gain to achieve the correct amplitude of the synthesized speech. The parameters are updated for each frame, i.e. every 10-20 ms. Between the updates, the speech is assumed to be stationary. Pitch Pulse generator Noise generator Voiced/ unvoiced Switch LPC parameters Filter Gain Synthesized speech Figure 3.6: LPC synthesis model (from [5]). 3.3 Hybrid Coding Hybrid coders contain features from both waveform coders and vocoders. They try to combine the low bit rate of vocoders with the high speech quality of waveform coders to produce speech with good quality at a medium or low bit rate. Many hybrid coders can be classified as analysis-by-synthesis linear predictive coders, depicted in Fig. 3.7. Using this system an excitation signal is chosen or formed. The excitation signal is then filtered by a long-term linear predictive (LP) filter A L (z) corresponding to the pitch structure of the speech. Then the short-term LP-filter A(z), representing the vocal tract, is applied to the signal. This is the vocoding part of hybrid coders. The waveform part of these coders is the attempt to match the synthesized speech to the original speech. This is done by the perceptual weighting filter, W (z). W (z) is used to shape the error between the input speech s[n] and the synthesized speech ŝ[n], in order to minimize the Mean Squared Error (MSE). For more information on hybrid coding than given below, the reader is referred to [4] and [5].

18 Chapter 3. Speech Compression s[n] Select of Form Excitation Gain Σ Σ ^ s[n] Σ A (z) L A(z) MSE W(z) Figure 3.7: Analysis-by-synthesis linear predictive coder (from [4]). 3.3.1 Code Excited Linear Prediction Code Excited Linear Prediction, known as CELP, contains a codebook with different excitation signals. The perceptual weighting filter is applied both to the original signal and to the synthesized signal. A difference is calculated and using this, the excitation yielding the minimal error is chosen. As for the general analysisby-synthesis coder described above, both a long-term and short-term LP-filter is applied to the signal for synthesizing the speech. An example of a standard using CELP is the Federal Standard FS 1016 CELP, which is a 4.8 kbit/s hybrid coder previously used in secure communications. Another example is ITU-T standard G.728 Coding of speech at 16 kbit/s using low-delay code excited linear prediction, known as LD-CELP. This standard has a much higher quality of the synthesized speech than FS 1016, but the bit rate is higher as well. 3.4 Discussion As mentioned in Section 1.1, it is desired to fit the newspaper on a small memory. The 90 minutes long paper is sampled at 19 khz using 16 bits to represent each sample. Thus, the uncompressed newspaper requires 205.2 MB of memory. In order to get the subscribers to accept a new receiver the quality of the speech must be comparable to the quality of the present receiver, therefore ratings from subjective tests is a good thing to compare. Mean Opinion Score, or simply MOS, is a well-known subjective test used to determine the quality of reconstructed or synthesized speech. The scale in MOS tests range from 1 to 5 as follows:

Chapter 3. Speech Compression 19 1 Unsatisfactory (Bad) 2 Poor 3 Fair 4 Good 5 Excellent. In the second column of Table 3.1 the result from a MOS test is given for five different compression techniques. Both the waveform coders PCM and ADPCM have high MOS scores as well as the hybrid coder LD-CELP. Many hybrid coders have a high quality of the synthesized speech, but they tend to have a very high complexity and they require a codebook. Due to these reasons a hybrid coder is not a good choice for the compression of the spoken newspaper. Since vocoders produces unnatural speech they are not considered as candidates for use in the digital receiver. The class of speech coders that is left is waveform coders. From Table 3.1, it is clear that if ADPCM is used to compress the spoken newspaper, it would be possible to store the paper on a 64 MB memory chip. For this reason and the fact that ADPCM has a high quality of the reconstructed speech and a low complexity this algorithm was chosen for the project. A detailed description of this algorithm is given in the next section. Table 3.1: Coding algorithms and their performances (based on information found in [4]). Algorithm MOS CR a MIPS b Memory requirement c (in MB) log PCM 4.3 2 0.01 102.6 ADPCM 4.1 4 2 51.3 LPC-10 2.3 53.33 7 3.85 FS 1016 CELP 3.2 26.67 16 7.7 LD-CELP 4.0 8 19 25.65 a CR = Compression Ratio is the ratio between the number of bits needed to represent one uncompressed sample and the number of bits needed to represent one compressed sample. b MIPS = Million Instructions Per Second. These figures are only approximate, the complexity is dependent on the implementation. c The amount of memory required to store a 90 minutes long encoded spoken newspaper.

20 Chapter 3. Speech Compression

Chapter 4 Adaptive Differential Pulse Code Modulation The compression technique chosen for this project is Adaptive Differential Pulse Code Modulation (ADPCM). The reasons for choosing this particular algorithm is that the quality of the reconstructed speech is high and the complexity is low, about 600 instructions are needed to encode or decode one sample of the speech. Compressed with ADPCM, one newspaper requires approximately 51.3 MB of memory, as seen from Table 3.1. Using a 64 MB memory chip, space will be available for storing overhead information. An additional advantage is that ADPCM is free of charge [10]. ADPCM was approved as the international standard, CCITT Recommendation G.721 32 kbit/s Adaptive Differential Pulse Code Modulation (ADPCM) in October 1984, see [11]. The standard is now a part of ITU-T Recommendation G.726 40, 32, 24, 16 kbit/s Adaptive Differential Pulse Code Modulation (ADPCM) [12]. The version used in this project is the 32 kbit/s ADPCM. This bit rate is valid if the input speech is sampled at 8 khz using 16 bits accuracy, but the sampling frequency used in the digital receiver is 19 khz. With a CR of 4 (see Section 3.4) the output bit rate becomes 19000 16 = 76 kbit/s. This will have no effect on the 4 calculations other than that they will be performed at a higher rate. ADPCM makes use of redundancies in speech signals, as correlation between contiguous samples, to minimize the number of bits needed to represent the signal. The difference between the incoming sample and an estimation of that sample is calculated. The difference is encoded, instead of the actual input signal itself. This reduces the variance of the signal and therefore fewer bits are required to encode 21

22 Chapter 4. Adaptive Differential Pulse Code Modulation it. A simplified block diagram of the ADPCM encoder/decoder is shown in Fig. 4.1. The estimate of the input signal, s e [k], is subtracted from the actual input signal s[k] and the resulting difference, d[k], is quantized and encoded to I[k] before transmission. The reconstruction of the speech starts by recreating the received word to d q [k]. The difference is then added to the estimate calculated by A(z) and B(z), to produce the reconstructed speech r[k]. G.726 is backward adaptive, Encoder s[k] d[k] Q I[k] Q 1 output. Step size adaption y[k] d [k] q s [k] e B(z) A(z). r[k] Decoder output Figure 4.1: Simplified block diagram of ADPCM encoder/decoder (based on a figure from [11]). meaning that the prediction of the input signal is made from previous samples. Hence, the prediction can be made at the receiver without any side information having to be sent. As shown in Fig. 4.1, the decoder can be regarded as a subset of the encoder. 4.1 The Encoder The encoding process starts by converting the input signal from logarithmic PCM to uniform PCM, as seen in Fig. 4.2. Next, the difference between the input signal and its estimate is calculated and this value is assigned four bits by the adaptive quantizer. This four-bit word, I[k], is the output of the encoder. I[k] is sent both to the decoder and to the inverse quantizer. The output of the inverse quantizer

Chapter 4. Adaptive Differential Pulse Code Modulation 23 is the quantized difference signal, which is added to the estimate of the signal to reconstruct the input. This reconstructed signal, as well as the quantized difference signal, is fed to the adaptive predictor for an estimation of the next input sample. Below follows descriptions of what each block in the encoder does. ADPCM output Reconstructed signal calculator s (k) r s(k) Input PCM s (k) Difference l d(k) l(k) format conversion signal computation Adaptive quantizer Inverse adaptive quantizer d (k) q Adaptive predictor s e(k) a (k) 2 Quantizer scale factor adaptation y(k) a (k) 1 Adaptation speed control t (k) r t (k) d Tone and transition detector y (k) l Figure 4.2: Block diagram of ADPCM encoder (from [12]). Input PCM Format Conversion The input to the ADPCM encoder is an A-law or µ-law pulse code modulated signal, with an accuracy of 8 bits. This log quantized signal is converted to uniform PCM before compression. Difference Signal Computation The second step in the encoding process is to calculate the difference signal by subtracting the estimate of the input from the actual input signal: d[k] = s l [k] s e [k]. (4.1)

24 Chapter 4. Adaptive Differential Pulse Code Modulation Adaptive Quantizer ADPCM uses backward adaptive quantization, described in Section 2.2.1. The quantizer used is a 15-level non-uniform adaptive quantizer. Before quantizing the difference signal, it is converted to a base-2 logarithmic representation and scaled by the scale factor y[k]. The normalized input to the quantizer is then which is quantized to create the encoder output I[k]. log 2 d[k] y[k], (4.2) Inverse Adaptive Quantizer The inverse adaptive quantizer constructs the quantized difference signal, d q [k], by first decoding I[k] and then adding the scale factor y[k]. Finally, the result is transformed from the logarithmic domain. Quantizer Scale Factor Adaption The scale factor y[k], used in the quantizer and in the inverse quantizer, is composed of two parts. One part for fast varying signals and one for slowly varying signals. The fast, unlocked, scale factor is recursively calculated, using the resulting scale factor y[k]: y u [k] = (1 2 5 )y[k] + 2 5 W (I[k]). (4.3) W is a number from a discrete set, depending on which one of the fifteen quantization levels is used for the current sample, see [12]. The second part of the scale factor is the slow, or locked factor y l [k] and it is calculated as follows: y l [k] = (1 2 6 )y l [k 1] + 2 6 y u [k]. (4.4) The resultant scale factor y[k] is a combination of the fast and the slow scale factors: y[k] = a l [k]y u [k 1] + (1 a l [k])y l [k 1] (4.5)

Chapter 4. Adaptive Differential Pulse Code Modulation 25 The controlling parameter a l [k] is described below. Adaptation Speed Control a l [k] is a controlling parameter that ranges between 0 and 1. For speech signals this parameter approaches 1, forcing the quantizer towards the fast mode. For slow signals like tones, the parameter tends towards 0, driving the quantizer into the slow mode. For detailed information on how a l [k] is constructed, please see [12]. Adaptive Predictor and Reconstructed Signal Calculator The estimate, s e [k], of the input signal is calculated using two previously reconstructed samples s r and six previous difference signals d q : where s e [k] = 2 a i [k 1]s r [k i] + s ez [k]. (4.6) i=1 s ez [k] = 6 b i [k 1]d q [k i]. (4.7) i=1 The reconstructed signal s r is calculated by adding the quantized difference to the estimate of the signal: The coefficients a i and b i are found in [12]. s r [k i] = s e [k i] + d q [k i]. (4.8) Tone and Transition Detector When the quantizer is in the slow, locked, mode and a stationary signal change to another stationary signal, problems can occur. An example of this type of signal is tones from a frequency shift keying modem. The problems that might occur are prevented by the tone and transition detector. When a transition between different

26 Chapter 4. Adaptive Differential Pulse Code Modulation stationary signals is detected the quantizer is forced into the fast, unlocked, mode by setting all the predictor coefficients equal to zero. 4.2 The Decoder As can be seen in Fig. 4.3 the structure of the decoder is for the most part the same as for the encoder. I[k], the four-bit word representing the difference between the input signal and its estimate, is fed to the inverse quantizer. The quantized difference signal is added to the estimate of the input and then converted from uniform PCM to either A-law or µ-law PCM. The blocks in common for both Figure 4.3: Block diagram of an ADPCM decoder (from [12]). the encoder and decoder were described in Section 4.1; the blocks unique for the decoder are described below. Output PCM Format Conversion This is the block that after reconstructing the signal converts it back to A-law or µ-law PCM format from uniform PCM.

Chapter 4. Adaptive Differential Pulse Code Modulation 27 Synchronous Coding Adjustment This block have been added to the ADPCM decoder to reduce cumulative distortion that can appear from successive synchronous tandem codings, i.e., ADPCM to PCM to ADPCM to PCM to ADPCM etc. With this feature, any number of synchronous tandem codings is equivalent to one single coding, providing an ideal channel (no transmission errors). Details about the synchronous coding adjustment is found in [12]. 4.3 Implementation and Results ADPCM was first implemented on the ADSP-2181 DSP, described in Section 4.3.1 below, using code obtained from Analog Devices. The speech compressed using this program had a high quality, as expected. Since the sampling frequency had to be set exactly to 19 khz in order to decrypt the newspaper, the ADSP-2181 DSP was replaced by the ADSP-2191 DSP. Both DSPs are programmed using assembly language, and they were supposed to be compatible. However, some instructions were defined differently in the new DSP, therefore a conversion of the code had to be made. Unfortunately, due to some problems, the implementation on the ADSP-2191 could not be completed within the time frame of this thesis. 4.3.1 Used Tools Both the ADSP-2181 DSP and the ADSP-2191 DSP were used during the implementation of ADPCM. First the evaluation kit ADSP-21XX EZ-KIT Lite was used, some of its attributes are as follows: 16-bit fixed-point ADSP-2181 DSP Ability to perform 33 MIPS 1 80 kb of on-chip RAM AD1847 stereo codec Serial port connection 1 MIPS = Million Instructions Per Second

28 Chapter 4. Adaptive Differential Pulse Code Modulation Due to a need of setting the sampling frequency in steps of 1 Hz the ADSP-2181 was replaced by the ADSP-2191. The evaluation board for this DSP is called ADSP-2191 EZ-KIT Lite and have the following attributes: 16-bit fixed-point ADSP-2191 DSP Ability to perform 160 MIPS 160 kb of on-chip RAM AD1885 48kHz AC 97 SoundMAX codec USB version 1.1 connection Ability to set the sampling frequency in steps of 1 Hz The attributes for both evaluation kits are attained from [13].

Chapter 5 Tone Detection Between two articles in the spoken newspaper a 50 Hz tone is inserted, indicating the end of one article and the start of the next. Today the subscribers of spoken newspapers use a normal cassette player when listening to their recorded paper. When listening to the newspaper at normal speed the tone is not audible, partly because of its low frequency and partly because the amplitude of the tone is smaller than the amplitude of the speech, see Fig. 5.1. However, when the listeners fast forward the tape, the frequency of the tone is increased and it can be heard as a beep. In the new digital receiver, it is desired to skip between articles just by pressing a button and for this reason the 50 Hz tone must be detected. The idea is to add a bookmark to the location of the tone. This way pressing the forward button on the new receiver will cause a jump to the next article, that is, the place where the next bookmark is located. 5.1 Finding the Tone The tone is added to the newspaper by the person reading it out, this causes the duration of the tone to vary. A couple of tones have been examined and they have had durations between approximately 0.5 and 2.5 seconds. Due to the human factor, the tone is not always located between articles, sometimes it is found in the end of one article hidden under the speech. Thus, the problem can be described as detecting a known signal with unknown duration in non-white, non-stationary noise (i.e. the speech). 29

30 Chapter 5. Tone Detection 0.25 0.2 0.15 0.1 50 Hz tones Amplitude 0.05 0 0.05 0.1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time (s) Time (s) Figure 5.1: Speech and 50 Hz tones. Requirements for the tone detection: Due to limited memory on the DSP, the solution to the tone detection problem should not require too much memory. The tone detection should be fast, i.e. not require many instructions. The detection must be robust, meaning as well detecting all tones as a low probability of false alarm. If the detector fails to detect one tone and the next one is detected, skipping forward means missing an article. False alarm occurs when something other than a tone is believed to be the tone searched for. The first step in detecting the tone is to reduce the amplitude of the speech while the amplitude of the tone is maintained or possibly increased. The second step is to decide whether the tone searched for is present in the signal or not.

Chapter 5. Tone Detection 31 5.1.1 Matched Filter Since matched filters can be used to search for known signals in noise it might be a good thing to use for the first step in the process of detecting the 50 Hz tone. A matched filter is designed to maximize the output Signal-to-Noise Ratio (SNR) [14]. Thus, it might be possible to use a matched filter to remove the speech. The aim of a matched filter is not to keep the waveform of the signal searched for unchanged, but to maximize the power of the known signal with respect to the power of the noise. The received signal can be described by r(t) = s(t) + n(t), (5.1) where s(t) is the known signal searched for (in this case the 50 Hz tone) and n(t) is the additive colored noise (the speech) corrupting the tone. If S (f) is the Fourier transform of the known signal s(t) and the Power Spectral Density (PSD) [14], of the colored input noise is P n (f), the transfer function of the matched filter is given by: H(f) = K S (f) P n (f) e j2πft 0, (5.2) where K is an arbitrary real nonzero constant and t 0 is the sampling time. A proof of this is given in [14]. Usage of the filter described above requires knowledge about the PSD of the speech. Due to this a simplification is needed. If the assumption that the noise is white is made, P n (f) equals N 0 /2. This reduces (5.2) to H(f) = 2K N 0 S (f)e j2πft 0. (5.3) Hence, if the noise is white, the impulse response of the matched filter becomes h(t) = Cs(t 0 t), (5.4) where C is an arbitrary real positive constant. The proof of this equation is found in [14]. From (5.4) it is clear that the matched filter is a scaled time-reversed version of the known signal itself, delayed by the interval t 0, see Fig. 5.2. For the matched filter to be realizable, it must be causal, i.e., h(t) = 0, if t < 0 (5.5) and for that reason t 0 must be equal to the length of the known signal. The sampling frequency used in the DSP is 19 khz, one period of the 50 Hz tone is therefore represented by 380 samples. For the matched filter to be optimal,

32 Chapter 5. Tone Detection the duration of the tone must be known. As stated before, the duration of the tone varies, and hence an assumption of the duration must be made. Thus, the length of the matched filter becomes the same as the number of samples needed to represent one period of the tone, multiplied by the number of periods chosen. However, this procedure requires a large amount of storage, since the same number of samples as the length of the filter must to be stored. A method for reducing the s(t) h(t) T t t 0 t (a) (b) Figure 5.2: (a) s(t) is the known signal; (b) h(t) is the matched filter. length of the filter is to perform a downsampling of the input signal prior to the detection process, and hereby reduce the number of samples needed to represent one period of the tone. Downsampling can not be done unless a lowpass filter first is applied to the input signal, else aliasing (described in Section 2.1) might occur. For example, if it is desired to use a sampling frequency of 400 Hz, the lowpass filter must remove all frequencies above 200 Hz. If the filter is not steep enough, frequencies above 200 Hz will be represented with a lower frequency, maybe as 50 Hz. A steep lowpass filter is of high order, i.e. many samples must be stored and this diminishes the gain of the downsampling. Therefore the matched filter does not seem to fulfill the storage requirements of the tone detection. The assumption that the noise is white is not completely correct, this error together with the large storage requirement leads to the conclusion that another solution than a matched filter might be more suitable. 5.1.2 Digital Resonator Another possible solution to the tone detection problem is to remove all frequencies in the input signal except frequencies around 50 Hz. This can be done by applying a bandpass filter to the incoming signal. By testing if the power of the output signal is bigger than some threshold value the presence of the tone can be detected. Since most of the energy in human speech is limited to frequencies between 120 and 2000

Chapter 5. Tone Detection 33 Hz, the filter used to remove the speech must drop of sharply. Thus, the magnitude response of the filter should look like a very steep, narrow bandpass filter, like the ideal filter shown in Fig. 5.3. 30 40 50 60 70 80 90 100 Frequency in Hz Figure 5.3: Magnitude response of an ideal bandpass filter with the passband [49.5 50.5] Hz. A filter known as a digital resonator [15], is a second order digital filter with complex conjugated poles in ae ±jω 0 where a is close to 1. ω 0 is the digital resonance frequency, that is, the frequency of interest. In this case the resonance frequency is π/190, the digital frequency corresponding to 50 Hz. The digital frequency ranges between 0 and π rad/s, with π corresponding to half the sampling frequency. A pole-zero plot of a general digital resonator is shown in Fig. 5.4. 1 0.8 0.6 0.4 Imaginary Part 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 Real Part Figure 5.4: Pole-zero plot of a general digital resonator. The poles and zeros are indicated by crosses and circles respectively.