Computing For Nation Development, February 26 27, 2009 Bharati Vidyapeeth s Institute of Computer Applications and Management, New Delhi An Approach to Very Low Bit Rate Speech Coding Hari Kumar Singh Department of Electronics & Communication Institute of Engineering and Technology, M.J.P.Rohilkhand University, Bareilly E-Mail: harimtech2000@rediffmail.com Sanjeev Sharma and Yash Vir Singh Department of Electronics & Communication College of Engineering and Rural Technology, Meerut E-Mail: Sanjeev_vats1@yahoo.co.in ABSTRACT Speech Coding is the process of coding speech signals for efficient transmission. The problem of reducing the bit rate of a speech representation, while preserving the quality of speech reconstructed from such a representation has received continuous attention in the past five decades. Speech coded at 64 kilobits per second (kbits/s) using logarithmic PCM is considered as ``non-compressed'' and is often used as a reference for comparisons. The term medium-rate is used for coding in the range of 8-16 kbits/s, low-rate for systems working below 8 kbits/s and down to 2.4 kbits/s, and very-lowrate for coders operating below 2.4 kbits/s. KEYWORDS Speech, Quantization, Code book, Clusters, Signals, and Compression 1. INTRODUCTION TRADITIONAL SPEECH CODING Natural speech waveforms are continuous in time and amplitude. Periodically sampling an analog waveform at the Nyquist rate (twice the highest frequency) converts it to a discrete-time signal. The signal amplitude at each time instant is quantized to one of a set of L amplitude values (where B = Log 2 L is the number of bits used to digitally code each value). Digital communication of an analog amplitude X consists of: A/D conversion, transmission of binary information over a digital channel, and D/A conversion to reconstruct the analog X value. If the channel is noise-free, the output value differs from the input by an amount known as the quantization noise. The bit rate for a signal is the product of its sampling rate F s (in samples/second) and the number of bits B used to code each sample. The process of extraction of properties or features from a speech signal which are important for communication is called speech analysis. This involves a transformation of the speech signal into another signal, a set of signals, or a set of features, with the objective of simplification and data reduction. The standard model of speech production (a source exciting vocal tract filter) is implicit in many analysis methods, including LPC. Most of the methods operate in the frequency domain as it offers the most useful parameters for speech processing. Human hearing appears to pay much more attention to spectral aspects of speech (e.g., amplitude distribution in frequency) than to phase or timing aspects. 2. TRADITIONAL APPROACH 2.1 LPC BAS ED CODING LPC is one of the most common techniques for low-bit-rate speech coding. The popularity of LPC derives from its compact yet precise representation of the speech spectral magnitude as well as its relatively simple computation. LPC analysis produces a set or vector of real-valued features, which represent an estimate to the spectrum of the windowed speech signal. The LPC vector for a signal frame typically consists of about 8-12 spectral coefficients with 5-6 bits/coefficient. The gain level and pitch are coded with 2-4 bits each. In addition, the binary voiced/unvoiced decision is transmitted. Thus, a 2400 bits/second vocoder might send 60 bits/frame every 25 ms. 2.2 VECTOR QUANTIZATION Most speech coders transmit time or frequency samples as independent (scalar) parameters, but coding efficiency can be enhanced by eliminating redundant information within blocks of parameters and transmitting a single index code to represent the entire block. This is Vector Quantization (VQ). During the coding phase, basic analysis yields k scalar parameters v (features) for each frame. Then, a particular k- dimensional vector, among a set of M vectors stored in a codebook is chosen which corresponds most closely to the vector v. A Log 2 M bit code (the index of the vector chosen from the codebook) is sent in place of the k scalar parameters. The system's decoder must include a codebook identical to that of the coder. To synthesize the output speech, the decoder uses the parameters listed under the index in the codebook corresponding to the received code. The key issues in VQ are the design and search of the codebook. In coders with scalar quantization, coding distortion comes from the finite precision for representing each parameter. VQ distortion comes instead from using synthesis parameters from a codebook entry, which differ from the parameters determined by the speech analyzer.
The size of the codebook, M should be large enough that each possible input vector corresponds to a codebook entry whose substitution for v yields output speech close to the original. However, efficient search procedures and storage considerations limit M to smaller values. The greater the degree of correlation among the vector elements, the more the bit rate can be lowered. LPC typically sends 50 bits/frame (10 coefficients, 5 bits each) with scalar quantization, but VQ succeeds with about 10 bits. A well-chosen set of 1024 spectra (2 10 for 10 bit VQ) can adequately represent most possible speech sounds. 2.3 SPEECH CODING BAS ED UPON VECTOR QUANTIZATION One of the first experimental co mparisons between optimized scalar quantization and vector quantization is presented in [2]. In this work, the gain parameter is treated separately from the rest of the information. The signal is segmented into frames of N samples for which one gain parameter is sent along with the N sample vectors. This approach is called gain separation. In this method, the gain and spectral codebooks are separate, and each entry is decoded as a scalar gain times a waveform vector. Since this method allows separate codebook searches, only 2 L +2 M entries must be examined instead of 2 L 2 M in the original method, with M spectral codebook entries and L gain possibilities. Furthermore, the gain codebook search is much simpler since it involved scalars, rather than the k-dimensional vectors in the spectral codebook. Another sub-optimal technique, binary tree search, is used for searching the codebook. In a full codebook search, the vector for each frame is compared with each of the M codebook entries, thus requiring M distance calculations. A binary tree search replaces M comparisons with only 2Log 2 comparisons. The M codebook entries form the lowest nodes of the tree, with each higher node being represented by the centroid of all entries below it in the tree. Note that this approach doubles the memory requirement for the codebook. The benefits of VQ are made quite apparent in the results of this work [2]. For a 10-bits/frame full search vector quantizer, the measured distortion is approximately 1.8 db. The equivalent distortion for scalar quantizer occurs at approximately 37 bits/frame resulting in a difference of 27 bits/frame or a 73 percent reduction in bit-rate for the same equivalent distortion. With binary tree search, the distortion was slightly greater and the improvement obtained was about 66 percent. In [3], VQ is applied to modify a 2400 bits/s LPC vocoder to operate at 800 bits/s, while retaining acceptable intelligibility and naturalness of quality. No change to the LPC design is done other than the quantization algorithms for the LPC parameters. One of the modifications here is the separation of pitch and voicing in addition to gain. The quantization technique of pitch and gain are scalar, having one value for 3 frames. Voiced and unvoiced speech spectra are in most cases very different, hence separate codebooks are employed for the two. The tree search has been modified so as to reduce the distortion in the binary tree approach. A 32 branches/node tree search has been found to be a good compromise, requiring only 1/16 the computation of the full search procedure, but achieving an average distortion very close to the full search codebook. Some techniques for reducing the bit rate are :- Frame-Repeat: In this method, every other frame is not transmitted, but a 1-bit code is sent, which specifies whether the missing frame should be interpreted as being the same as the preceding or following spectrum. The determination is made based on whichever of the two spectra is closer to the omitted spectrum. Variable-Frame-Rate (VFR) Transmission: To economize and avoid transients, the (LP) vectors are often smoothed (parameters interpolated in time) before use in the synthesizer stage. When the speech signal changes rapidly, LP vectors might be sent every 10 ms, while during steady vowels (or silence), a much lower rate suffices. Thus, data is buffered during rapid changes, for later transmission during times of less speech dynamics. VFR vocoders can often reduce bit rate significantly without loss of speech quality, but with extra complexity and delay. Determination of when to transmit a frame of data normally depends on a distance measure comparing the last frame sent along with the current analysis frame; when the distance exceeds a threshold (indicating a large enough speech change), 1-2 frames are sent. Gain coding: In a typical LPC vocoder, a 5-bit code is used for quantizing the gain. In this method, the average gain for each spectral template in the codebook is noted and stored. Then, rather than coding the absolute gain level, only the difference between the input gain and the average gain for the codebook entry is transmitted. 2.4 ACHIEVING VERY LOW RATES A new method where input speech is modeled as a sequence of variable-length segments is introduced in [5] and further optimized in [6]. An automatic segmentation algorithm is used to obtain segments with an average duration comparable to that of a phoneme. Such a segment is quantized as a single block. For segmentation, speech is considered as a succession of steady states separated by transitions. Spectral time-derivatives are thresholded to determine the middle of transitions. The threshold is chosen such that approximately 11 segments/s are obtained (equal to the expected phoneme rate). The distance between two segments is calculated using an ``equi-spaced'' sampled representation of the segment (spatial sampling in 14 LPC spectral dimensions). The euclidean distance between corresponding equi-spaced points on two segments is summed to arrive at a distance value. Each segment, in this approach, is a 140 dimensional vector (14 spectral value x 10 spatial samples). Usually, a clustering algorithm is used to obtain an optimal set of segment templates for the codebook. For the large dimensionality of the segment vocoder, the expected quantization error of a properly chosen
An Approach to Very Low Bit Rate Speech Coding random quantizer is nearly equal to the distortion rate bound. Therefore, a computationally intensive clustering algorithm was not used and a random set of segments was obtained and used as the codebook. Approximately 8000 segment templates (13 bits) were used for coding and a further 8 bits were used for gain, pitch and timing information. Thus the bit rate obtained was 231 bits/s for 11 segments/s. In [6], a further decrease in bit rate was achieved by using a segment network that restricts the number of segment templates that can follow a given template. The segment network is used in the following manner: if the current input segment is quantized to a given template, then only those segment templates that follow this template in the network can be used to quantize the following input segment. The implementation of the segment network was done in the following fashion. Suppose that the current input segment is quantized to a given template. The last spectrum of this best segment template is used to determine a subset of templates that are allowed in quantizing the following input segment. The templates allowed are those whose first spectrum is nearest (euclidean distance) to the last spectrum of the template used in quantizing the current input segment. Comparison was done with an unconstrained case, using a total of 1024 segment templates (10 bits). W ith the segment network, the choice of segment templates is restricted to 256 and thus 8 bits/segment are needed to code the segment. Almost no difference in quantization error was found in the two approaches. 2.5 PHONETIC VOCODER To code speech at very low data rates of about 100 bits/s implies approaching the true information rate of the communication. (about 60 phones at the rate of 15-20 phones/second). One approach is to extract detailed linguistic knowledge of the information embedded in the speech signal. In [7], a coding technique based on phonetic speech recognition is explored. The motivation for the work stems from the observation that Hidden Markov Model (HMM) based speech recognition systems (working on LPC features) tend to do an excellent job of modeling the acoustic data. A basic premise of the paper is that the quality of the acoustic match will be good even if the phone recognition accuracy is poor. In other words, a phone may be recognized as another phone by the HMM system. But, this will not result in significant deterioration of the speech quality, as the two phones will be very close acoustically. A large acoustic phonetic database is used which has been phonetically transcribed and labeled with a set of 60 phones. A basic phone recognition system is implemented with a null grammar (any phone can follow any phone). A contextually rich sample of examplars of each of the 60 phones are clustered. The major way in which the phonetic vocoder distinguishes itself from a vector quantization system is the manner in which spectral information is transmitted. Rather than transmit indices in a VQ codebook representing each spectral vector, a phone index is transmitted along with auxiliary information describing the path through the model. Good overall synthesized speech quality was achieved with 8 clusters per phone. (i.e. 480 clusters). Thus, a simple inventory of phones is shown to be sufficient for capturing the bulk of acoustic information. 3. CONCATENATIVE S YNTHES IS OF WAVEFORMS Speech coding at medium-rates and below is achieved using an analysis-synthesis process. In the analysis stage, speech is represented by a compact set of parameters, which are encoded efficiently. In the synthesis stage, these parameters are decoded and used in conjunction with a reconstruction mechanism to form speech. In this chapter, a new method is discussed in which the original waveform corresponding to the nearest template segment (codebook entry) is used for synthesis. The primary difference from conventional coders is that no speech generation framework like the source-filter model is used. Instead, it is assumed that any speech signal can be reconstructed by concatenating short segment waveforms that are suitably chosen from a database. Such an approach is reported to give speech with a better perceptive quality than the LPC synthesized speech using pulse/noise excitation. 3.1 WAVEFORM S EGMENT VOCODER The first foray into waveform based synthesis was made in [8]. Here, the decoding stage works with the waveforms of the nearest templates and not their spectral representation. Pitch, energy and duration of each template is independently modified to match those of the input segment. These modified segments are then concatenated to produce the output waveform. The paper describes algorithms used for time-scale modification and pitch-scale modification which are applied to the template waveforms so as to match the time and pitch characteristics of the input segment. Several sentences from a single male speaker are vocoded using his own templates (speakerdependent paradigm). The speech has a choppy quality, presumably due to segment boundary discontinuities. The authors view this work as a modification to the original LPC vocoder developed by them [5]. Hence, phoneme like segments were the basic elements in their waveform approach. These segments were then passed through considerable modifications which may in fact reduce the naturalness of the waveform. 3.2 CONTEMPORARY TECHNIQUES The waveform concatenation approach has been investigated in some detail in [9]. Here, frames are used instead of segments as the units for selection. One advantage is that since frame selection does not require a time-warping process, the synthesis of the speech signal is done without time scale modification (as was needed in [8] ). On the downside, the bit rate of frame based approach is greater, because longer segments contribute to high compression ratio of segmental coders. Mel-frequency cepstrum coefficients (MFCCs) are used as the feature parameters for the unit selection. The size of the database is about 76 minutes of speech corresponding to 460,000 frames each of 10ms duration. The first one contains MFCCs of the about 460,000 frames obtained by the feature extraction
process. The second one contains speech waveforms that are used while generating the output waveform. The raw speech signal from which the MFCCs of the first database are computed is the same as those in the second database. In addition to the unit indices, pitch (Fo) and gain parameters are also transmitted. The unit index represents the position where the selected unit is located in the database. Figure 3.1: Block diagram of the coder 3.2.1 UNIT S ELECTION For unit selection, a novel approach is proposed in [10]. Units in the synthesis database are considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. 3.2.2 CODING THE S ELECTED UNIT S EQUENCE To take advantage of this property, a run-length coding technique is employed to compress the unit sequence. In this method, a series of consecutive frames are represented with the start frame and the number of the following consecutive frames. Thereby, a number of consecutive frames are encoded into only two variables. 3.2.3 CODING PITCH Accurate coding of the pitch(f0) contour plays an important role in a very low rate coder since the correct pitch contour will increase naturalness. Piecewise linear approximation is used to implement contour-wise F0 coding. This method offers high compression, as only a small number of sampled points need to be transmitted instead of all individual samples. Of course, the intervals between the sampled points must be transmitted for proper interpolation. Piecewise linear approximation presumes some degree of smoothness for the function approximated. Therefore, the F0 contour is smoothed before compression. The methods used for finding the location of F0 points are discussed in the paper [9]. 4. CONCLUS ION AND FUTURE S COPES Recent advances in computer technology allow a wide variety of applications for speech coding. Transmission can either be real time as in normal telephone conversations, or off-line, as in storing speech for electronic mail forwarding of voice messages. In either case, the transmission bit rate (or storage requirements) is crucial to evaluate the practicality of different coding schemes. Low bit rate concatenative coders can be very useful when requiring storage of large amount of pre-recorded speech. A talking book, which is the spoken equivalent of its printed version, requires huge space for storing speech waveforms unless a high compression-coding scheme is applied. Similarly, for a wide variety of multimedia applications, such as language learning assistance, electronic dictionaries and encyclopedias there are potential applications of very low bit rate speech coders. Interest in exchanging voice messages across the Internet is increasing. To save on bandwidth, such coders could have a large role to play. For scenarios where two persons (or a small set of persons) are frequently exchanging voice messages, concatenative synthesis could be employed. REFERENCES 1. Douglas O'Shaugnessy. Speech Communications - Human and Machine. Universities Press, 2001. 2. A. Buzo, Jr. A. H. Gray, R. M. Gray, and J. D. Markel. Speech coding based upon vector quantization. IEEE International Conference on Acoustics, Speech and Signal Processing, 1980. 3 D. Y. Wong, B. H. Juang, and Jr. A. H. Gray. Recent developments in vector quantization for speech processing. IEEE International Conference on Acoustics, Speech and Signal Processing, 1981. 4. Richard M. Schwartz and Salim E. Roucos. A comparison of methods for 300-400 b/s vocoders. IEEE International Conference on Acoustics, Speech and Signal Processing, 1983. 5. S. Roucos, R. Schwartz, and J. Makhoul. Segment quantization for very-low-rate speech coding. IEEE International Conference on Acoustics, Speech and Signal Processing, 1982. 6. S. Roucos, R. M. Schwartz, and J. Makhoul.A segment vocoder at 150 b/s. IEEE International Conference on Acoustics, Speech and Signal Processing, 1983. 7. Joseph Picone and George R. Doddington. A phonetic vocoder. IEEE International Conference on Acoustics, Speech and Signal Processing, 1989. 8. Salim Roucos and Alexander M. W ilgus. The waveform segment vocoder: A new approach for very -low-rate speech coding. IEEE International Conference on Acoustics, Speech and Signal Processing, 1985.
An Approach to Very Low Bit Rate Speech Coding 9. Ki-Seung Lee and Richard V. Cox. A very low bit rate speech coder based on a recognition/synthesis paradigm. IEEE Transactions on Speech and Audio Processing, 2001. 10. Andrew J. Hunt and Alan W. Black. Unit selection in a concatenative speech synthesis system using a large speech database. IEEE International Conference on Acoustics, Speech and Signal Processing, 1996. 11. Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, Volume 77, No.2, 1989.