Communications Theory and Engineering

Size: px

Start display at page:

Download "Communications Theory and Engineering"

Alexia Gregory
5 years ago
Views:

1 Communications Theory and Engineering Master's Degree in Electronic Engineering Sapienza University of Rome A.A

2 Speech and telephone speech

3 Based on a voice production model Parametric representation of speech signals Vocal tract Vocal folds Excitation signal Model match Excitation signal FILTER H(f) FILTER H(f) Speech signal Glottal signal VOCAL TRACT

4 Parametric representation The idea: the signal can be considered as the output of a system excited by an appropriate excitation signal QUESTION IS:? How should the excitation signal be defined? How should the system be characterized? THE ANSWER IS IN THE STRUCTURE OF THE SPEECH SIGNAL

5 Parametric representation The speech signal is quasi-stationary: This means that it can be considered as stationary for short-time intervals Typically 10~20 ms 10 ms Time Implying that: the model parameters must be updated every ms A segment of duration ms is referred to as FRAME This analysis is referred to as short-term analysis

6 Parametric representation Source model (excitation signal) Two main categories of sounds were identified: Voiced sounds: Vocal folds start vibrating; the corresponding excitation signal is a pulse train with repetition period T: the pitch interval Air from lungs Voiceless sounds: Vocal folds are open, while the vocal tract closes at a specific point, causing the air coming from lungs to create a turbulence at the constriction. The corresponding excitation signal is noise. Air from lungs Vocal cords tight and vibrating The vocal tract narrows Vocal folds are open

7 Parametric representation Reminder: voiced sounds vs. voiceless sounds The waveform of a voiced sound is almost periodic T T: pitch interval The waveform of a voiceless sound is noise-like

8 Source model Voiced sounds: pulse train generator, with period T PULSE TRAIN GENERATOR Voiceless sounds: noise generator T time NOISE GENERATOR time A voiced/voiceless detector is thus required in order to select which excitation signal shall be used

9 Vocal tract model The filter H(f) must be characterized by a transfer function that mimics the action performed by the vocal tract on the excitation signal A LINEAR PREDICTION filter is adopted The parameters of the filter must be updated every ms But what is linear prediction?

10 Vocal tract model The analysis relies on the idea that a sample predicted by previous signal samples: ( ) s n of the signal can be The prediction of s( n) is a LINEAR combination of previous samples s( n i) i =1,..., p that is: ( ) = α k s( n k)!s n p k=1 Linear Prediction of s(n) p is referred to as PREDICTION ORDER

11 Linear prediction The adopted approach is to determine the coefficients that minimize the difference between sample and the prediction, i.e. minimize the PREDICTION ERROR α k ( ) s! ( n) s n PREDICTION ERROR e( n) = s( n)!s ( n) where BUT ( ) = α k s( n k)!s n p k=1 REMEMBER THAT The analysis is short-term

12 Linear prediction α k In particular the coefficients can be determined by minimizing the SHORT TERM quadratic error N m=1 ( ) E n = e n 2 m for each analysis window n, where N is the number of samples per window Since a window has a typical duration of ms At a 10 khz sampling frequency the corresponding number of samples is

13 Linear prediction N N E n = e 2 ( n m) = ( s ( n m)!s ( n m) ) 2 N p = s ( n m) α k s n m k m=1 m=1 m=1 k=1 ( ) 2 FIND THE MINIMUM We are searching for a set of α k such that: E n α i = 0 for i=1,, p Order of prediction

14 Linear prediction Leading to the the following set of equations: R n YULE-WALKER EQUATIONS R ( n 0) R ( n 1)! R ( n p 1) R ( n 1) R ( n 0)! R ( n p 2) " " # " p 1 ( ) R ( n p 2)! R ( n 0) α 1 α 2 " α p = R ( n 1) R ( n 2) R n " p ( ) where: R n ( i) = s ( n m)s n m+i m ( ) SHORT-TERM AUTOCORRELATION FUNCTION

15 Linear prediction It can be observed that the matrix R n R ( n 0) R ( n 1) R ( n 2)! R ( n p 1) R ( n 1) R ( n 0) R ( n 1)! R ( n p 2) R ( n 2) " R ( n 1) " R ( n 0) is a Toeplitz matrix, and as such: It is symmetric All the elements on the main diagonal have same value "! # ( p 1) R ( n p 2) R ( n p 3)! R ( n 0) " "

16 Parametric representation Example: [a] VOWEL for several values of prediction order p AMPLITUDE db TIME FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY INPUT SIGNAL SHORT TERM SPECTRUM LPC p=4 LPC p=8 LPC p=12 LPC p=16 LPC p=20

17 Parametric representation Summary VOCAL TRACT PARAMETERS: Coefficientsα k Gain G SOURCE PARAMETERS: PITCH VOICED/VOICELESS decision

18 Parametric representation The complete model is thus as follows: PITCH SOURCE PARAMETER: PITCH + VOICED/VOICELESS decision PULSE TRAIN GENERATOR RANDOM NOISE GENERATOR u n Voiced/voiceless switch x G TIME-VARYING FILTER ( ) ( ) s n α k COEFFICIENTS Transmission rate up to bit/s

19 Parametric representation The predictor order p is typically about 12~14 ENERGY (db) Red: original signal Blue: LPC FREQUENCY

20 Parametric representation VOCODER scheme s( n) LPC analysis filter Coder Channel LPC synthesis filter PITCH detector Decoder ŝ( n) TRANSMITTER CHANNEL RECEIVER

21 Mixed systems Mixed systems are only in part based on speech production models Best example: Multipulse The multipulse method achieves excellent quality far transmission rates around bit/s In this method the vocal tract is represented by a LPC filter, but the source is determined without relying on specific properties of the speech signal ( ) u n LPC filter ( )!s n ( ) u ( n ) ( ) Given a signal!s n, one searches for the system input that makes as close as possible tos n ( ) α k The input u n and the coefficients are then sent to the receiver!s ( n)

22 What is ( ) u n like? Multipulse systems Let us assume a signal window of length 100 samples: it is obvious that, if u( n) had length 100 samples, the synthesis would be perfect The maximum number of available samples depends however on the Depending on the transmission rate, pulses (samples) TRANSMISSION RATE u( n) will thus include the right number of u( n)

23 Multipulse systems Optimal positions and amplitudes of pulses forming the input sequence must then be determined Example: for a bit rate of about 16 kbits/s, one can transmit ~ 30 pulses for a signal frame of 128 samples. u n will contain ~ 30 samples of NON-ZERO amplitude ( ) Finding optimal positions would require to analyze ALL possible positions, with an unacceptable computational cost Sub-optimal solutions are typically adopted

24 Multipulse systems In the search for pulse positions, positions are explored ONE AT THE TIME u( n) In the search for pulse amplitudes, a system of linear equations can be defined

THAT OPPOSITELY to the VOCODER, here there is no information on the

25 Multipulse systems Which information is transmitted? Coefficients Positions Amplitudes Amplitude quantization step NOTE THAT OPPOSITELY to the VOCODER, here there is no information on the structure of the source signal (neither voiced/voiceless decision, nor pitch extraction)

26 Mixed methods GSM system This method was standardized for early digital RADIO-MOBILE voice transmissions STRUCTURE: similar to the one described or the multipulse system, but the search for optimal positions is carried out with a resolution of three samples u( n) POSITION SEARCH Computational cost much lower than in multipulse system 13 kb/s standard

27 MPEG1 Audio Compression Input Critical bands filtering (sub-band filtering) Bit allocation (quantization) Bitstream formatting Output Estimation of masking effects (Psychoacoustic model) MPEG1 audio compression works in the frequency domain It takes advantage of limitations in the human auditory system in order to reduce the bit rate without significant effect on perceived audio quality MPEG1 audio compression evolved in 3 different layers: Layer 1, Layer 2 and Layer 3 (MPEG1 Layer 3, known as MP3)

28 Frequency masking: the Bark scale The perception of a sound at a given frequency reduces the capability of the ear of perceiving other sounds at nearby frequencies The higher the intensity of the sound, the stronger the masking effect The frequency interval affected by a sound is referred to as critical band The width of critical bands grows with frequency

29 Time masking A tone at high intensity affects the capability of human ear to perceive another tone at nearby frequencies even after the perception of the first tone ends.

30 Overall masking effect The combination of time masking and frequency masking leads to specific frequency intervals that are not audible for specific time intervals

31 MP3 coding MP3 coding uses information on time and frequency masking effects to achieve efficient bit allocation for quantization Bands affected by masking effects are coded with a low number of bits (higher quantization noise): this module is proprietary PCM bitstream Filter bank (32 sub-bands) Modified Discrete Cosine Transform Non-uniform Quantization FFT (1024 points) Psychoacoustic model (proprietary) Additional control signalling MP3 coded bitstream Bitstream creation Huffman coding

Analysis/synthesis coding

TSBK06 speech coding p.1/32 Analysis/synthesis coding Many speech coders are based on a principle called analysis/synthesis coding. Instead of coding a waveform, as is normally done in general audio coders