Speech Coding in the Frequency Domain

Similar documents
Compression. Encryption. Decryption. Decompression. Presentation of Information to client site

Enhanced Waveform Interpolative Coding at 4 kbps

Advanced Digital Signal Processing Part 2: Digital Processing of Continuous-Time Signals

Signal Characteristics

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

Voice Activity Detection

Audio Signal Compression using DCT and LPC Techniques

Chapter 9 Image Compression Standards

Copyright S. K. Mitra

ITM 1010 Computer and Communication Technologies

Module 6 STILL IMAGE COMPRESSION STANDARDS

Images with (a) coding redundancy; (b) spatial redundancy; (c) irrelevant information

Problem Sheet 1 Probability, random processes, and noise

Communications IB Paper 6 Handout 3: Digitisation and Digital Signals

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

Communications Theory and Engineering

EE482: Digital Signal Processing Applications

Chapter IV THEORY OF CELP CODING

Chapter 3 Data Transmission COSC 3213 Summer 2003

Digital Speech Processing and Coding

Audio Coding based on Integer Transforms

Introduction to Source Coding

United Codec. 1. Motivation/Background. 2. Overview. Mofei Zhu, Hugo Guo, Deepak Music 422 Winter 09 Stanford University.

Filter Banks I. Prof. Dr. Gerald Schuller. Fraunhofer IDMT & Ilmenau University of Technology Ilmenau, Germany. Fraunhofer IDMT

Entropy, Coding and Data Compression

Evaluation of Audio Compression Artifacts M. Herrera Martinez

The Scientist and Engineer's Guide to Digital Signal Processing By Steven W. Smith, Ph.D.

ECEn 665: Antennas and Propagation for Wireless Communications 131. s(t) = A c [1 + αm(t)] cos (ω c t) (9.27)

Telecommunication Electronics

SAMPLING THEORY. Representing continuous signals with discrete numbers

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

EE482: Digital Signal Processing Applications

Fundamentals of Digital Communication

Communication Theory II

MULTIMEDIA SYSTEMS

Timbral Distortion in Inverse FFT Synthesis

Golomb-Rice Coding Optimized via LPC for Frequency Domain Audio Coder

Chapter 2: Digitization of Sound

Pulse Code Modulation

EEE 309 Communication Theory

Wavelet-based image compression

Digital Audio. Lecture-6

Chapter 4 SPEECH ENHANCEMENT

CSCD 433 Network Programming Fall Lecture 5 Physical Layer Continued

Sound Synthesis Methods

10 Speech and Audio Signals

TRANSFORMS / WAVELETS

SNR Scalability, Multiple Descriptions, and Perceptual Distortion Measures

Sampling and Pulse Code Modulation Chapter 6

The quality of the transmission signal The characteristics of the transmission medium. Some type of transmission medium is required for transmission:

Das, Sneha; Bäckström, Tom Postfiltering with Complex Spectral Correlations for Speech and Audio Coding

Hybrid Coding (JPEG) Image Color Transform Preparation

Audio and Speech Compression Using DCT and DWT Techniques

EEE 309 Communication Theory

Analog-Digital Interface

Audio Signal Performance Analysis using Integer MDCT Algorithm

Laboratory Assignment 4. Fourier Sound Synthesis

2.1. General Purpose Run Length Encoding Relative Encoding Tokanization or Pattern Substitution

APPLICATIONS OF DSP OBJECTIVES

Communications I (ELCN 306)

Analog and Telecommunication Electronics

LECTURE VI: LOSSLESS COMPRESSION ALGORITHMS DR. OUIEM BCHIR

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

QUESTION BANK. SUBJECT CODE / Name: EC2301 DIGITAL COMMUNICATION UNIT 2

DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. Subject Name: Information Coding Techniques UNIT I INFORMATION ENTROPY FUNDAMENTALS

Signals A Preliminary Discussion EE442 Analog & Digital Communication Systems Lecture 2

Digital Signal Processing

Discrete Fourier Transform (DFT)

Continuous vs. Discrete signals. Sampling. Analog to Digital Conversion. CMPT 368: Lecture 4 Fundamentals of Digital Audio, Discrete-Time Signals

PART I: The questions in Part I refer to the aliasing portion of the procedure as outlined in the lab manual.

Review of Lecture 2. Data and Signals - Theoretical Concepts. Review of Lecture 2. Review of Lecture 2. Review of Lecture 2. Review of Lecture 2

Chapter 2: Signal Representation

Multimedia Systems Entropy Coding Mahdi Amiri February 2011 Sharif University of Technology

SGN Audio and Speech Processing

Solutions to Information Theory Exercise Problems 5 8

Non-Uniform Speech/Audio Coding Exploiting Predictability of Temporal Evolution of Spectral Envelopes

A Brief Introduction to Information Theory and Lossless Coding

CSCD 433 Network Programming Fall Lecture 5 Physical Layer Continued

EE390 Final Exam Fall Term 2002 Friday, December 13, 2002

Speech Compression Using Wavelet Transform

CT111 Introduction to Communication Systems Lecture 9: Digital Communications

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

DCSP-3: Minimal Length Coding. Jianfeng Feng

Module 8: Video Coding Basics Lecture 40: Need for video coding, Elements of information theory, Lossless coding. The Lecture Contains:

Open Access Improved Frame Error Concealment Algorithm Based on Transform- Domain Mobile Audio Codec

6.02 Practice Problems: Modulation & Demodulation

Lecture5: Lossless Compression Techniques

END-OF-YEAR EXAMINATIONS ELEC321 Communication Systems (D2) Tuesday, 22 November 2005, 9:20 a.m. Three hours plus 10 minutes reading time.

Lecture 3 Review of Signals and Systems: Part 2. EE4900/EE6720 Digital Communications

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Pulse Code Modulation (PCM)

FPGA implementation of DWT for Audio Watermarking Application

Understanding Digital Signal Processing

8.3 Basic Parameters for Audio

CMPT 318: Lecture 4 Fundamentals of Digital Audio, Discrete-Time Signals

Lecture 3 Concepts for the Data Communications and Computer Interconnection

CHAPTER 4. PULSE MODULATION Part 2

Signal Processing Toolbox

Templates and Image Pyramids

CHAPTER. delta-sigma modulators 1.0

Transcription:

Speech Coding in the Frequency Domain Speech Processing Advanced Topics Tom Bäckström Aalto University October 215

Introduction The speech production model can be used to efficiently encode speech signals. Real-life signals often also contain other sounds then single source speech; Background noises in real-life envirnoments Multiple speakers Music and mixed speech/music content in entertainment broadcasts Singing voices We need a generic coding mode for non-speech signals. Audio codecs are based on frequency-domain coding good choice for generic-mode coding.

Introduction TCX An early approach (in AMR-WB+) to combining CELP with frequency-domain coding was called transform coded excitation (TCX) Model spectral envelope with linear prediction as in CELP. Take discrete Fourier transform of predictor residual and quantize. From a statistical point of view, this is a valid approach: The excitation is a uncorrelated signal, whereby we can transform it directly and do not need overlap between frames in the same sense as classical audio codecs. Original residual and transform domain signal have the same amount of information Critical sampling and perfect reconstruction.

Introduction TCX The main issue with the original approach is that application of a proper objective function is difficult. Recall that the objective function is of the form H(x ˆx) 2. If we now apply a transform F on x, such that y = Fx. The objective function is transformed to HF 1 (y ŷ) 2. The matrix HF 1 is non-trivial whereby all samples of y are multiplied with each other. Direct quantization is inefficient (=does not give best SNR). Optimal quantization would require an exhaustive search.

Introduction Modern TCX variants To solve this problem, modern codecs (USAC and EVS) use MDCT as a time-frequency transform. MDCT is a lapped transform, that is, it is based on overlap-add, but has critical sampling. The envelope model is still used to model the shape of the spectrum. The spectral coefficients are encoded with entropy coding. A perceptual model can be applied in the frequency domain, whereby we do not get the problems with the objective function.

Introduction Outlook The rest of this lecture is a brief introduction to frequency domain coding. Our objtective is to obtain a method with the following properties: Transitions between windows are perceptually smooth. Critical sampling (frequency domain representation has as many samples as the input signal). Physically well-motivated = allows efficient processing = low leakage between frequency components.

Introduction Time-frequency domain A time-frequency representation represents the signal as frequency bands which evolv over time.

Windowing and overlap-add A basic component of most speech and audio methods is segmentation of the signal into windows. Processing in fixed-length blocks allows implementation of computationally efficient methods. When the signal can be assumed stationary within a segment, it can be modelled with a single stationary model, whereby output quality of simple methods is high. Audio processing methods generally use overlapping windows to obtain a fade-in/fade-out functionality. Not-overlapping methods would suffer from discontinuities which is perceptually bad. Overlapping windows which go smoothly to zero make sure that signal is continuous also after processing, which is perceptually good. Overlapping windowing is a perceptual tool!

Windowing and overlap-add 1 (a) Overlapping half sine windows Magnitude.5 1 x k x k+1 Time (c) Overlapping Kaiser Bessel derived windows Magnitude.5 x k x k+1 Time

Windowing and overlap-add For example, half-sine windows are defined as { ( ) sin (k+.5)π 2N, for k < 2N ω k =, otherwise A window of the input signal σ k is then ξ k = ω k σ k. Subsequent windows are obtained by shifting ω k in time.

Windowing and overlap-add Magnitude Time Magnitude (db) -2-4 5 1 15 2 25 Frequency bin -2-4 5 1 15 2 25 Frequency bin -2-4 5 1 15 2 25 Frequency bin

Windowing and overlap-add Reconstruction After processing, the windowing function is applied once more ω k ˆξk on the processed signal ˆξ k. This makes sure that signal goes to zero at the borders also after processing. The windows of the signal are then added together. If σl,k and σ R,k are the left part of the current window and the right part of previous window, then for perfect reconstruction. we should have σ k = σ L,k + σ R,k = ω L,k ξ L,k + ω R,k ξ R,k = ω 2 L,kσ k + ω 2 R,kσ k = (ω 2 L,k + ω 2 R,k)σ k. If ωl,k 2 + ω2 R,k = 1 then the reconstructed signal is equal to the original signal.

Windowing and overlap-add Squared Magnitude Squared Magnitude 1.5 1.5 (b) Overlapping squared half sine windows x k x k+1 Time (d) Overlapping squared Kaiser Bessel derived windows x x k k+1 Time

Windowing and overlap-add 1 (a) x -1 1 (b) W R 2 x w R 2-1 1 (c) W L 2 x w L 2-1

Projections and time-domain aliasing cancellation If we have full overlap between windows, then for every window we get N new samples of data but every window contains 2N samples. The objective of coding is to reduce redundancy, but we have just doubled the amount of data! Overlapping windowing does not provide critical sampling. We need some method to return to critical sampling. MDCT is based on a projection known as time-domain aliasing cancellation (TDAC), which gives a representation with critical sampling.

Projections and time-domain aliasing cancellation The projection used in MDCT is based on splitting the signal into a symmetric and antisymmetric part. then x R,k = P R x k = [ J I ] x k, x L,k = P L x k = [ I J ] x k, P T L x L,k + P T R x R,k = (P T L P L + P T R P R)x k = x k since P T L P L + P T R P R = I. Moreover, P T R P Rx k is symmetric and P T L P Lx k is antisymmetric.

Projections and time-domain aliasing cancellation 1 (a) x -1 1 (b) P R H P R x -1 1 (c) P L H P L x -1

Projections and time-domain aliasing cancellation If x k is of length N, then the projected signals x R,k = P R x k and x L,k = P L x k are of length N/2. Each part holds exactly half the signal. Critical sampling! We can use these projections to remove redundancy at the overlap.

Combination of projection and windowing Perfect reconstruction works as long as the Princen-Bradley condition holds P T L P L + P T R P R = I. If we combine projection into symmetric and antisymmetric parts as P L = P L W L and P R = P R W R, where W L and W R are the windowing functions, then Princen-Bradley still holds. This is a special case windowing cannot be combined with all projections, but for the symmetric and antisymmetric parts it does work. We can use P L and P R as a projection into two parts. Smooth transitions between windows. Critical sampling.

Projections and time-domain aliasing cancellation 1 (a) x -1 1 (b) W R H P R H P R W R x -1 1 (c) W L H P L H P L W L x -1

Time-frequency transforms We have above obtained a critically sampled representation x of the signal such that transitions between windows are smooth. Next we want to transform the representation to the frequency domain y = Dx with a transform D. For a real-valued spectral representation we could use the ordinary discrete cosine transform of type II, that is, DCT-II. A benefit of real-valued transforms are that they are simpler to process than complex values. The signal is the represented as a weighted sum of basis functions y k. The reconstruction of individual basis functions is [ ] T P x k = L D P 1 y k. R

Time-frequency transforms DCT-II Basis functions of DCT-II and their reconstructions. (a) 1-1 1 (b) -1

Time-frequency transforms DCT-II The reconstructed basis functions have strange shapes. some have discontinuities and some have odd corners. Clearly the combination of DCT-II and TDAC does not work too well. Extensions/reconstructions of basis functions do not corresponds to physical frequency-elements. Let s try DCT-IV instead!

Time-frequency transforms DCT-IV Basis functions of DCT-IV and their (windowed) reconstructions. (a) 1-1 1 (b) -1 1 (c) -1

Time-frequency transforms MDCT The extended/reconstructed basis functions have the following properties The left and right parts are symmetric and antisymmetric Each extended basis function corresponds perfectly to a sinusoid. The extension is actually a discrete cosine transform, which we wall call the modified discrete cosine transform (MDCT) X k = 2N 1 n= [ ( π x n cos n + 1 N 2 + N ) ( k + 1 )]. 2 2 Note! The MDCT takes 2N samples as input and gives N frequencies as output, but since each window has N samples in common with the previous window, for every N new samples, we get N frequencies Critical sampling.

Time-frequency transforms MDCT Magnitude Time Magnitude (db) -2-4 5 1 15 2 25 Frequency bin 2-2 -4 5 1 15 2 25 Frequency bin 2-2 -4 5 1 15 2 25 Frequency bin

Time-frequency transforms MDCT The MDCT thus has all the desired properties; smooth transitions, critical sampling and well-defined frequency components. The MDCT is the most commonly used time-frequency transform in audio coding. Used in AAC, USAC, EVS etc. Prof. Edler was a central developer of MDCT. The only notable drawback with MDCT is that it is a real-valued transform. If a signal perfectly aligns with a basis function in one frame, then it will be perfectly orthogonal (=) to the basis function in the next frame. Amplitudes of MDCT components have much larger variance than the original signal. Physical interpretation of amplitudes is difficult/inefficient.

Time-frequency transforms We now have a frequency-representation of a window of the signal. Next step is to quantize and code the signal. To obtain perceptually uniform quantization noise, we scale the spectral components X k with the perceptual envelope W k. We thus quantize X k /W k. The quantized signal is then multiplied with W k to return to the original domain.

Time-frequency transforms MDCT Magnitude (db) Magnitude (db) Magnitude (db) -2-4 -6-2 -4 (a) 5 1 15 2 25 (b) -6 5 1 15 2 25 (c) -2-4 -6 5 1 15 2 25 Frequency bin X k W k X k /W k X' k /W k X k X' k

Basics of entropy coding We now have a quantized spectrum and our objective is to transmit that spectrum with the lowest number of bits. This is known as lossless coding. Compression such that the original (quantized) spectrum can be exactly recovered. Each quantization level can be interpreted as a symbol. We have an alphabet of symbols. We need unique identifiers for each symbol in terms of bit-strings.

Basics of entropy coding Consider a three-symbol alphabet with symbols a, b and c. We can assign them unique binary strings, 1 and 11. The symbols can be then transmitted with 2 bits/symbol. However, this is already inefficient because in theory, for three symbols we would need only log 2 3 1.58 bits/symbol on average. When constrained to fixed-length strings of bits we cannot do better.

Basics of entropy coding If we know from before the occurence-probabilities of each symbol, then we can do better. Consider the following probabilities and bit-strings Symbol Probability Code Bits a.5 1 b.25 1 2 c.25 11 2 Clearly each symbol has a unique bit-string. From a bit-string 111, we can easily decode bca. The average bit-rate is.5 1 +.25 2 +.25 2 = 1.5, that is, on average we use 1.5 bits/symbol. This is known as Huffman coding and it works optimally when the probabilities are powers of.5.

Arithmetic coding In the general case, when probabilities are arbitrary numbers, we can use arithmetic coding. Consider the following alphabet and the corresponding probabilities Symbol Probability Interval a.41.....41 b.22.41....63 c.15.63....78 d.12.78....9 e.1.9... 1. Here each symbol is mapped to a unique interval in [, 1].

Arithmetic coding.41.63.78.9 1 a b c d e.1.2.3.4.5.6.7.8.9 1

Arithmetic coding If we then want to encode a sequence of symbols, for example, adc. The first symbol then lands into the interval....41, which we call the remaining range. The intervals for the second symbol are then mapped into the remaining range. The second symbol then lands into the interval.3198....369. The intervals for the third symbol are then mapped into the remaining range. The last symbol then lands into the interval.358....3582. To transmit the sequence of symbols adc is then equivalent with transmitting a binary code which uniquely identifies the interval.358....3582.

Arithmetic coding a b c d e.41.63.78.9 1.41.63.78.9 1 a b c d e.1681.2583.3198.369.41.41.63.78.9 1 a b c d e.3198.34.358.3582.3641.369

Arithmetic coding The average bit-consumption is then b = p(s) log 2 p(s). s Symbols In the above example the bit-consumption is then 2.12 bits/symbol.

Arithmetic coding To encode an interval.66....71 we use a binary representation of decimal numbers. Let correspond to the interval....5 and 1 to.5... 1. The string 1 then corresponds to the interval.5....75 and 11 to.75... 1. The string 1 then corresponds to the interval.5....625 and 11 to.625....75. etc. When we have a bit-string whose range is inside the desired range, then we are finished.

Arithmetic coding Encoding of the range.66....71:.25.5.75 1 Final bit-string is 111. Decoder performs then decodes the intervals from the bit-string and maps them to symbols. 1 1 1

Arithmetic coding Arithmetic coding is thus a form of entropy coding which takes an alphabet (quantization levels) and their occurence-probabilities, and encodes a sequence of symbols with the optimally low number of bits. In a practical system, we do not want to use a fixed alphabet-probability combination, but model the signal. We can then either use preceeding coefficients (known as the context) to predict the probability model of the current coefficient (USAC and EVS), or model the probabilities with, say, a Laplacian distribution and deduce the variance of samples from the spectral envelope shape (EVS).

Integration with CELP Above we have presented principles of frequency domain coding for speech codecs. For integration to a practical codec we need methods for switching between time- and frequency-domain coding. The windowing paradigm based on MDCT fits poorly with the filter-based windows of CELP. Must use engineering solutions aka hacks, to switch windowing concept. The characteristic distortions of time- and frequency-domain codecs are very different. Switching in the middle of a phoneme can become easily audible, because the character of artifacts change, even if absolute perceptual quality remains constant. Allow switching only at phoneme borders (requires advanced signal analysis).

Summary of frequency domain coding Frequency domain coding is effective for stationary signals such as music, background noises, and stationary segments of speech. Transform coded excitation (TCX) is a family of frequency-domain coding methods which use linear prediction as a model of the spectral envelope. Modern frequency-domain codecs are based on MDCT, which provides smooth transitions between windows, critical sampling and a physically well-defined transform. The spectrum is weighted with perceptual model to limit perceptual effect of quantization noise. Frequency components are encoded with an entropy codec to reduce bit-rate.