Speech Coding using Linear Prediction

Similar documents
Adaptive Filters Linear Prediction

Analysis/synthesis coding

EE482: Digital Signal Processing Applications

Communications Theory and Engineering

Vocoder (LPC) Analysis by Variation of Input Parameters and Signals

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

APPLICATIONS OF DSP OBJECTIVES

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Speech Compression Using Voice Excited Linear Predictive Coding

Overview of Code Excited Linear Predictive Coder

Wideband Speech Coding & Its Application

17. Delta Modulation

Robust Linear Prediction Analysis for Low Bit-Rate Speech Coding

Speech Enhancement: Reduction of Additive Noise in the Digital Processing of Speech

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Digital Speech Processing and Coding

Speech Coding Technique And Analysis Of Speech Codec Using CS-ACELP

Voice Excited Lpc for Speech Compression by V/Uv Classification

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Speech Synthesis using Mel-Cepstral Coefficient Feature

Cepstrum alanysis of speech signals

Exam in 1TT850, 1E275. Modulation, Demodulation and Coding course

Speech Enhancement using Wiener filtering

Proceedings of Meetings on Acoustics

Cellular systems & GSM Wireless Systems, a.a. 2014/2015

SGN Audio and Speech Processing

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

techniques are means of reducing the bandwidth needed to represent the human voice. In mobile

Comparison of CELP speech coder with a wavelet method

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

Multimedia Signal Processing: Theory and Applications in Speech, Music and Communications

Chapter IV THEORY OF CELP CODING

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

Speech Synthesis; Pitch Detection and Vocoders

Page 0 of 23. MELP Vocoder

Simulation of Conjugate Structure Algebraic Code Excited Linear Prediction Speech Coder

Microcomputer Systems 1. Introduction to DSP S

Surveillance Transmitter of the Future. Abstract

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

COMP 546, Winter 2017 lecture 20 - sound 2

Waveform Encoding - PCM. BY: Dr.AHMED ALKHAYYAT. Chapter Two

Pitch Period of Speech Signals Preface, Determination and Transformation

AN AUTOREGRESSIVE BASED LFM REVERBERATION SUPPRESSION FOR RADAR AND SONAR APPLICATIONS

SILK Speech Codec. TDP 10/11 Xavier Anguera I Ciro Gracia

Audio /Video Signal Processing. Lecture 1, Organisation, A/D conversion, Sampling Gerald Schuller, TU Ilmenau

Digital Signal Processing

Enhanced Waveform Interpolative Coding at 4 kbps

10 Speech and Audio Signals

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

SGN Audio and Speech Processing

Auditory modelling for speech processing in the perceptual domain

Digital Signal Representation of Speech Signal

Signal Analysis. Peak Detection. Envelope Follower (Amplitude detection) Music 270a: Signal Analysis

OF HIGH QUALITY AUDIO SIGNALS

General outline of HF digital radiotelephone systems

Detection, Interpolation and Cancellation Algorithms for GSM burst Removal for Forensic Audio

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

Super-Wideband Fine Spectrum Quantization for Low-rate High-Quality MDCT Coding Mode of The 3GPP EVS Codec

Applications of Music Processing

Chapter 2: Digitization of Sound

NOISE ESTIMATION IN A SINGLE CHANNEL

Fundamental frequency estimation of speech signals using MUSIC algorithm

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Audio Signal Compression using DCT and LPC Techniques

EEE 309 Communication Theory

The Channel Vocoder (analyzer):

Isolated Digit Recognition Using MFCC AND DTW

Improving Sound Quality by Bandwidth Extension

COMPRESSIVE SAMPLING OF SPEECH SIGNALS. Mona Hussein Ramadan. BS, Sebha University, Submitted to the Graduate Faculty of

6/29 Vol.7, No.2, February 2012

Voice Activity Detection

TRANSIENT NOISE REDUCTION BASED ON SPEECH RECONSTRUCTION

EEE482F: Problem Set 1

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Speech Recognition using FIR Wiener Filter

Multiplexing Concepts and Introduction to BISDN. Professor Richard Harris

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

EE 225D LECTURE ON MEDIUM AND HIGH RATE CODING. University of California Berkeley

A NEW FEATURE VECTOR FOR HMM-BASED PACKET LOSS CONCEALMENT

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Introduction to cochlear implants Philipos C. Loizou Figure Captions

Speech Signal Analysis

Adaptive Feedback Cancellation With Band-Limited LPC Vocoder in Digital Hearing Aids

EE482: Digital Signal Processing Applications

RECENTLY, there has been an increasing interest in noisy

Lecture 4 Biosignal Processing. Digital Signal Processing and Analysis in Biomedical Systems

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

Digital Audio. Lecture-6

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Implementation of attractive Speech Quality for Mixed Excited Linear Prediction

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Adaptive time scale modification of speech for graceful degrading voice quality in congested networks

Speech Coding in the Frequency Domain

WARPED FILTER DESIGN FOR THE BODY MODELING AND SOUND SYNTHESIS OF STRING INSTRUMENTS

Transcription:

Speech Coding using Linear Prediction Jesper Kjær Nielsen Aalborg University and Bang & Olufsen jkn@es.aau.dk September 10, 2015 1 Background Speech is generated when air is pushed from the lungs through the vocal tract. The produced speech is typically divided into voiced and unvoiced sounds. If the vocal chords in the vocal tract vibrate rapidly when air is pushed through them, the sound is voiced. Examples of voiced sounds are vocals such as a, o, and i. On the other hand, if the vocal chords are constantly open, the sound is unvoiced. Examples of unvoiced sounds are p, s, and q. Much more about human speech production can be found in [1]. The two kinds of speech sounds can be used as sources in the mathematical model of speech production in Fig. 1. In the model, either an impulse train or a white noise process is filtered through an all-pole filter with system response H 1 (z). The impulse train is the source signal for the voiced speech sounds, and the white noise is the source signal for the unvoiced speech sounds. The all-pole filter models how the cavities in the throat, mouth, and nose as well as the position of the tongue change the spectral envelope of the pulse train and the white noise process. For example, if a person is having a cold, the cavities in the nose change, and this changes the spectral envelope of the produced speech. A method called linear prediction [?,2 4] can be used to estimate the filter coefficient of this all-pole filter from a recorded segment for speech, and this is exploited in many practical applications such as speech compression in digital cellular technology and voice over IP. To understand why the mathematical model in Fig. 1 is so useful for speech compression, consider a simple telephone call between Peter and Karen, say. When Peter is speaking, the microphone in the telephone converts the pressure variations (Peter s voice) into voltage variations. Both the pressures and the voltages are continuous in time and in amplitude. In the telephone, an analogue-to-digital converter (ADC) measures the value of the voltage at a uniform rate called the sampling frequency. One such value is typically called a sample. For telephone applications, this sampling frequency is typically 8000 Hz. Moreover, the measured voltage is also rounded to the nearest value on a grid so that the measured voltages can be represented with a fixed number of bits. This rounding is called quantisation. If we assume that 8 bits are used to represent the value 1

Pulse train : Voiced speech : Unvoiced speech H -1 (z) Speech x(n) White noise Figure 1: A popular speech model. of each measured value of the voltage, the total bitrate b is b = 8 khz 8 bits = 64 kbit/s. (1) Thus, Peter s telephone has so far converted the pressure variations generated by Peter s voice into a sampled and quantised signal which we denote as x(n) where n is the time index or sample number. To simplify things, the speech production is often modelled so that it generates x(n) directly. This is also done in Fig. 1. In principle, Peter s phone could now transmit the sampled and quantised speech waveform to Karen s phone at a bitrate of 64 kbit/s. However, the model in Fig. 1 can be used to compress the speech data so that the bitrate is reduced by a significant amount. The compression can be performed in the following way. 1. Divide the speech signal into small segments of 20 ms, say. 2. For each segment, estimate the filter coefficients of the all-pole filter from the segment of speech. Typically, approximately 12 filter coefficients are used in the filter. 3. Determine whether the speech segment is voiced or unvoiced. 4. If the segment is unvoiced, estimate the variance of the white noise process responsible for generating the speech signal. If the signal is voiced, estimate the amplitude and the distance between the pulses in the pulse train. 5. Transmit the estimated filter coefficients, the speech type, and the source signal parameters. 2

At a sampling frequency of 20 khz, a speech segment of 20 ms corresponds to 160 samples. For a resolution of 8 bits/sample, 1280 bits must therefore be used to represent this speech segment if no compression is used. If we instead compress the 12 filter coefficients and the source signal parameters (also with 8 bit), only 112 bits are required for the speech segment. Thus, we achieve a compression factor of more than a factor of 10. The compression scheme described above is illustrated in Fig. 2. Figure 2: Block diagram of a generic speech coding application. In practical telephone applications, the source signals are encoded in a more complex way than described above. However, state-of-the-art speech coders such as code excited linear prediction (CELP) coders are based on the same principles and the model as above and achieve a bit rate of approximate 4.8 kbit/s. 2 Problem In the project, we will focus on estimating the filter coefficients of the all-pole filter from the speech signal so that the residual energy e 2 (n) averaged over the whole segment of data is as small as possible. This problem is the so-called linear prediction problem, and we can formulate it mathematically in the following way. Assume that a speech segment of data consists in N data points which we model as x(n) = a 1 x(n 1) + a 2 x(n 2) + + a p x(n p) + e(n) (2) p = a i x(n i) + e(n) (3) i=1 = h T (n)a + e(n) (4) 3

for n = 0, 1,..., N 1 where ( ) T denotes matrix transposition and a = [ a 1 a 2 a p ] T (5) h(n) = [ x(n 1) x(n 2) x(n p) ] T. (6) When the residual e(n) is a white and a wide sense stationary (WSS) process, the speech signal is modelled as a so-called autoregressive random process of order p. This means that if the speech signal was indeed an autoregressive process, the output of the filter with the system response H(z) would be white and WSS, and this is the main idea behind speech compression based on linear prediction. As illustrated in Fig. 2 and described above, the speech signal is not transmitted directly. Instead, the filter coefficients a and the residuals are transmitted as this can be performed at a much lower bit rate than by direcly transmitting the speech signal. From a mathematical perspectively, the autoregressive model can be formulated as a linear normal model. In matrix notation, the linear model can be written as where x = Ha + e (7) x = [ x(0) x(1) x(n 1) ] T e = [ e(0) e(1) e(n 1) ] T H = [ h(0) h(1) h(n 1) ] T. (10) Under some technical conditions on the matrix H, minimising the two-norm of e w.r.t. a leads to the so-called least squares estimate of a given by â = (H T H) 1 H T x. (11) Moreover, this estimate can be shown to be the so-called conditional maximum likelihood estimate. From an engineering perspective, the least squares estimate above might be problematic since the estimated filter coefficients in â are not guaranteed to produce a stable all-pole filter. That is, some of the poles of the system response H 1 (z) might be outside the unit circle. This can be compensated for in various ways by reflecting the problematic poles around the unit circle or by defining the matrix H so that H T H has a so-called Toeplitz structure. 2.1 Packet-loss concealment Returning to the example of Peter and Karen s telephone conversation, the estimated filter coefficients and source signal parameters from each segment are packaged in a speech packet and transmitted from one phone to the other. In Fig. 2, this is illustrated by the two antennas in the transmission part of the figure. Unfortunately, speech packets might be lost, corrupted, or delayed (8) (9) 4

in the transmission, and this will produce an audible click when the receiving telephone plays back the received speech. However, since correlation exists between the adjacent speech segments, the content of a missing speech packet can to some extend be predicted from adjacent speech packets. This prediction problem is called packet-loss concealment and can also be a part of the project. 3 Data Set The data set is based on a secret track of male speech and consists of three files, all sampled at 8000 Hz. - unvoicedsegment.wav: 120 ms of unvoiced speech. Specifically, the segment is the s-sound from the word score. - voicedsegment.wav: 260 ms of voiced speech. Specifically, the segment is the ore-sound from the word score. - sentences.wav: A sentence of approximately 5 seconds of speech containing a mixture of voiced and unvoiced speech. The unvoiced and voiced segments are taken from the word score in this file. If you listen carefully to the last file, you will hear to audible clicks. These clicks are caused by two missing audio segments of 20 ms. These segments are located directly after the voiced and unvoiced segments, respectively, and can be approximated using packet loss concealment. In the file sentences.wav, the missing packets are from sample number 21921 to 22080 and from sample number 25121 to 25280. In MATLAB, an audio file can be loaded via [data, samplingfreq] = audioread( filename.wav ); 4 Tools from Courses During the project, the following tools, which you will learn about in the courses on the semester, are needed. Solving a 2-norm optimisation problems. Understand the differences between and similarities of the maximum likelihood estimator and the least square estimator. Model signals as autoregressive processes. Analyse linear normal models. 5

References [1] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, 1st ed. Prentice-Hall Inc, 1978. [2] J. E. Markel and A. H. Gray, Linear prediction of speech. Springer-Verlag New York, Inc., 1982. [3] S. Haykin, Adaptive Filter Theory, 4th ed. Prentice Hall, Sep. 2001. [4] J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-time processing of speech signals. IEEE New York, NY, USA, 2000. [5] D. Giacobello, Sparsity in linear predictive coding of speech, Ph.D. dissertation, Aalborg University, 2010. 6