Speech Coding in the Frequency Domain

Speech Coding in the Frequency Domain Speech Processing Advanced Topics Tom Bäckström Aalto University October 215

Introduction The speech production model can be used to efficiently encode speech signals. Real-life signals often also contain other sounds then single source speech; Background noises in real-life envirnoments Multiple speakers Music and mixed speech/music content in entertainment broadcasts Singing voices We need a generic coding mode for non-speech signals. Audio codecs are based on frequency-domain coding good choice for generic-mode coding.

Introduction TCX An early approach (in AMR-WB+) to combining CELP with frequency-domain coding was called transform coded excitation (TCX) Model spectral envelope with linear prediction as in CELP. Take discrete Fourier transform of predictor residual and quantize. From a statistical point of view, this is a valid approach: The excitation is a uncorrelated signal, whereby we can transform it directly and do not need overlap between frames in the same sense as classical audio codecs. Original residual and transform domain signal have the same amount of information Critical sampling and perfect reconstruction.

Introduction TCX The main issue with the original approach is that application of a proper objective function is difficult. Recall that the objective function is of the form H(x ˆx) 2. If we now apply a transform F on x, such that y = Fx. The objective function is transformed to HF 1 (y ŷ) 2. The matrix HF 1 is non-trivial whereby all samples of y are multiplied with each other. Direct quantization is inefficient (=does not give best SNR). Optimal quantization would require an exhaustive search.

Introduction Modern TCX variants To solve this problem, modern codecs (USAC and EVS) use MDCT as a time-frequency transform. MDCT is a lapped transform, that is, it is based on overlap-add, but has critical sampling. The envelope model is still used to model the shape of the spectrum. The spectral coefficients are encoded with entropy coding. A perceptual model can be applied in the frequency domain, whereby we do not get the problems with the objective function.

Introduction Outlook The rest of this lecture is a brief introduction to frequency domain coding. Our objtective is to obtain a method with the following properties: Transitions between windows are perceptually smooth. Critical sampling (frequency domain representation has as many samples as the input signal). Physically well-motivated = allows efficient processing = low leakage between frequency components.

Introduction Time-frequency domain A time-frequency representation represents the signal as frequency bands which evolv over time.

Windowing and overlap-add A basic component of most speech and audio methods is segmentation of the signal into windows. Processing in fixed-length blocks allows implementation of computationally efficient methods. When the signal can be assumed stationary within a segment, it can be modelled with a single stationary model, whereby output quality of simple methods is high. Audio processing methods generally use overlapping windows to obtain a fade-in/fade-out functionality. Not-overlapping methods would suffer from discontinuities which is perceptually bad. Overlapping windows which go smoothly to zero make sure that signal is continuous also after processing, which is perceptually good. Overlapping windowing is a perceptual tool!

Windowing and overlap-add 1 (a) Overlapping half sine windows Magnitude.5 1 x k x k+1 Time (c) Overlapping Kaiser Bessel derived windows Magnitude.5 x k x k+1 Time

Windowing and overlap-add For example, half-sine windows are defined as { ( ) sin (k+.5)π 2N, for k < 2N ω k =, otherwise A window of the input signal σ k is then ξ k = ω k σ k. Subsequent windows are obtained by shifting ω k in time.

Windowing and overlap-add Magnitude Time Magnitude (db) -2-4 5 1 15 2 25 Frequency bin -2-4 5 1 15 2 25 Frequency bin -2-4 5 1 15 2 25 Frequency bin

Windowing and overlap-add Reconstruction After processing, the windowing function is applied once more ω k ˆξk on the processed signal ˆξ k. This makes sure that signal goes to zero at the borders also after processing. The windows of the signal are then added together. If σl,k and σ R,k are the left part of the current window and the right part of previous window, then for perfect reconstruction. we should have σ k = σ L,k + σ R,k = ω L,k ξ L,k + ω R,k ξ R,k = ω 2 L,kσ k + ω 2 R,kσ k = (ω 2 L,k + ω 2 R,k)σ k. If ωl,k 2 + ω2 R,k = 1 then the reconstructed signal is equal to the original signal.

Windowing and overlap-add Squared Magnitude Squared Magnitude 1.5 1.5 (b) Overlapping squared half sine windows x k x k+1 Time (d) Overlapping squared Kaiser Bessel derived windows x x k k+1 Time

Windowing and overlap-add 1 (a) x -1 1 (b) W R 2 x w R 2-1 1 (c) W L 2 x w L 2-1

Projections and time-domain aliasing cancellation If we have full overlap between windows, then for every window we get N new samples of data but every window contains 2N samples. The objective of coding is to reduce redundancy, but we have just doubled the amount of data! Overlapping windowing does not provide critical sampling. We need some method to return to critical sampling. MDCT is based on a projection known as time-domain aliasing cancellation (TDAC), which gives a representation with critical sampling.

Projections and time-domain aliasing cancellation The projection used in MDCT is based on splitting the signal into a symmetric and antisymmetric part. then x R,k = P R x k = [ J I ] x k, x L,k = P L x k = [ I J ] x k, P T L x L,k + P T R x R,k = (P T L P L + P T R P R)x k = x k since P T L P L + P T R P R = I. Moreover, P T R P Rx k is symmetric and P T L P Lx k is antisymmetric.

Projections and time-domain aliasing cancellation 1 (a) x -1 1 (b) P R H P R x -1 1 (c) P L H P L x -1

Projections and time-domain aliasing cancellation If x k is of length N, then the projected signals x R,k = P R x k and x L,k = P L x k are of length N/2. Each part holds exactly half the signal. Critical sampling! We can use these projections to remove redundancy at the overlap.

Combination of projection and windowing Perfect reconstruction works as long as the Princen-Bradley condition holds P T L P L + P T R P R = I. If we combine projection into symmetric and antisymmetric parts as P L = P L W L and P R = P R W R, where W L and W R are the windowing functions, then Princen-Bradley still holds. This is a special case windowing cannot be combined with all projections, but for the symmetric and antisymmetric parts it does work. We can use P L and P R as a projection into two parts. Smooth transitions between windows. Critical sampling.

Projections and time-domain aliasing cancellation 1 (a) x -1 1 (b) W R H P R H P R W R x -1 1 (c) W L H P L H P L W L x -1

Time-frequency transforms We have above obtained a critically sampled representation x of the signal such that transitions between windows are smooth. Next we want to transform the representation to the frequency domain y = Dx with a transform D. For a real-valued spectral representation we could use the ordinary discrete cosine transform of type II, that is, DCT-II. A benefit of real-valued transforms are that they are simpler to process than complex values. The signal is the represented as a weighted sum of basis functions y k. The reconstruction of individual basis functions is [ ] T P x k = L D P 1 y k. R

Time-frequency transforms DCT-II Basis functions of DCT-II and their reconstructions. (a) 1-1 1 (b) -1

Time-frequency transforms DCT-II The reconstructed basis functions have strange shapes. some have discontinuities and some have odd corners. Clearly the combination of DCT-II and TDAC does not work too well. Extensions/reconstructions of basis functions do not corresponds to physical frequency-elements. Let s try DCT-IV instead!

Time-frequency transforms DCT-IV Basis functions of DCT-IV and their (windowed) reconstructions. (a) 1-1 1 (b) -1 1 (c) -1

Time-frequency transforms MDCT The extended/reconstructed basis functions have the following properties The left and right parts are symmetric and antisymmetric Each extended basis function corresponds perfectly to a sinusoid. The extension is actually a discrete cosine transform, which we wall call the modified discrete cosine transform (MDCT) X k = 2N 1 n= [ ( π x n cos n + 1 N 2 + N ) ( k + 1 )]. 2 2 Note! The MDCT takes 2N samples as input and gives N frequencies as output, but since each window has N samples in common with the previous window, for every N new samples, we get N frequencies Critical sampling.

Time-frequency transforms MDCT Magnitude Time Magnitude (db) -2-4 5 1 15 2 25 Frequency bin 2-2 -4 5 1 15 2 25 Frequency bin 2-2 -4 5 1 15 2 25 Frequency bin

Time-frequency transforms MDCT The MDCT thus has all the desired properties; smooth transitions, critical sampling and well-defined frequency components. The MDCT is the most commonly used time-frequency transform in audio coding. Used in AAC, USAC, EVS etc. Prof. Edler was a central developer of MDCT. The only notable drawback with MDCT is that it is a real-valued transform. If a signal perfectly aligns with a basis function in one frame, then it will be perfectly orthogonal (=) to the basis function in the next frame. Amplitudes of MDCT components have much larger variance than the original signal. Physical interpretation of amplitudes is difficult/inefficient.

Time-frequency transforms We now have a frequency-representation of a window of the signal. Next step is to quantize and code the signal. To obtain perceptually uniform quantization noise, we scale the spectral components X k with the perceptual envelope W k. We thus quantize X k /W k. The quantized signal is then multiplied with W k to return to the original domain.

Time-frequency transforms MDCT Magnitude (db) Magnitude (db) Magnitude (db) -2-4 -6-2 -4 (a) 5 1 15 2 25 (b) -6 5 1 15 2 25 (c) -2-4 -6 5 1 15 2 25 Frequency bin X k W k X k /W k X' k /W k X k X' k

Basics of entropy coding We now have a quantized spectrum and our objective is to transmit that spectrum with the lowest number of bits. This is known as lossless coding. Compression such that the original (quantized) spectrum can be exactly recovered. Each quantization level can be interpreted as a symbol. We have an alphabet of symbols. We need unique identifiers for each symbol in terms of bit-strings.

Basics of entropy coding Consider a three-symbol alphabet with symbols a, b and c. We can assign them unique binary strings, 1 and 11. The symbols can be then transmitted with 2 bits/symbol. However, this is already inefficient because in theory, for three symbols we would need only log 2 3 1.58 bits/symbol on average. When constrained to fixed-length strings of bits we cannot do better.

Basics of entropy coding If we know from before the occurence-probabilities of each symbol, then we can do better. Consider the following probabilities and bit-strings Symbol Probability Code Bits a.5 1 b.25 1 2 c.25 11 2 Clearly each symbol has a unique bit-string. From a bit-string 111, we can easily decode bca. The average bit-rate is.5 1 +.25 2 +.25 2 = 1.5, that is, on average we use 1.5 bits/symbol. This is known as Huffman coding and it works optimally when the probabilities are powers of.5.

Arithmetic coding In the general case, when probabilities are arbitrary numbers, we can use arithmetic coding. Consider the following alphabet and the corresponding probabilities Symbol Probability Interval a.41.....41 b.22.41....63 c.15.63....78 d.12.78....9 e.1.9... 1. Here each symbol is mapped to a unique interval in [, 1].

Arithmetic coding.41.63.78.9 1 a b c d e.1.2.3.4.5.6.7.8.9 1

Arithmetic coding If we then want to encode a sequence of symbols, for example, adc. The first symbol then lands into the interval....41, which we call the remaining range. The intervals for the second symbol are then mapped into the remaining range. The second symbol then lands into the interval.3198....369. The intervals for the third symbol are then mapped into the remaining range. The last symbol then lands into the interval.358....3582. To transmit the sequence of symbols adc is then equivalent with transmitting a binary code which uniquely identifies the interval.358....3582.

Arithmetic coding a b c d e.41.63.78.9 1.41.63.78.9 1 a b c d e.1681.2583.3198.369.41.41.63.78.9 1 a b c d e.3198.34.358.3582.3641.369

Arithmetic coding The average bit-consumption is then b = p(s) log 2 p(s). s Symbols In the above example the bit-consumption is then 2.12 bits/symbol.

Arithmetic coding To encode an interval.66....71 we use a binary representation of decimal numbers. Let correspond to the interval....5 and 1 to.5... 1. The string 1 then corresponds to the interval.5....75 and 11 to.75... 1. The string 1 then corresponds to the interval.5....625 and 11 to.625....75. etc. When we have a bit-string whose range is inside the desired range, then we are finished.

Arithmetic coding Encoding of the range.66....71:.25.5.75 1 Final bit-string is 111. Decoder performs then decodes the intervals from the bit-string and maps them to symbols. 1 1 1

Arithmetic coding Arithmetic coding is thus a form of entropy coding which takes an alphabet (quantization levels) and their occurence-probabilities, and encodes a sequence of symbols with the optimally low number of bits. In a practical system, we do not want to use a fixed alphabet-probability combination, but model the signal. We can then either use preceeding coefficients (known as the context) to predict the probability model of the current coefficient (USAC and EVS), or model the probabilities with, say, a Laplacian distribution and deduce the variance of samples from the spectral envelope shape (EVS).

Integration with CELP Above we have presented principles of frequency domain coding for speech codecs. For integration to a practical codec we need methods for switching between time- and frequency-domain coding. The windowing paradigm based on MDCT fits poorly with the filter-based windows of CELP. Must use engineering solutions aka hacks, to switch windowing concept. The characteristic distortions of time- and frequency-domain codecs are very different. Switching in the middle of a phoneme can become easily audible, because the character of artifacts change, even if absolute perceptual quality remains constant. Allow switching only at phoneme borders (requires advanced signal analysis).

Summary of frequency domain coding Frequency domain coding is effective for stationary signals such as music, background noises, and stationary segments of speech. Transform coded excitation (TCX) is a family of frequency-domain coding methods which use linear prediction as a model of the spectral envelope. Modern frequency-domain codecs are based on MDCT, which provides smooth transitions between windows, critical sampling and a physically well-defined transform. The spectrum is weighted with perceptual model to limit perceptual effect of quantization noise. Frequency components are encoded with an entropy codec to reduce bit-rate.