Communications Theory and Engineering Master's Degree in Electronic Engineering Sapienza University of Rome A.A. 2018-2019
Speech and telephone speech
Based on a voice production model Parametric representation of speech signals Vocal tract Vocal folds Excitation signal Model match Excitation signal FILTER H(f) FILTER H(f) Speech signal Glottal signal VOCAL TRACT
Parametric representation The idea: the signal can be considered as the output of a system excited by an appropriate excitation signal QUESTION IS:? How should the excitation signal be defined? How should the system be characterized? THE ANSWER IS IN THE STRUCTURE OF THE SPEECH SIGNAL
Parametric representation The speech signal is quasi-stationary: This means that it can be considered as stationary for short-time intervals Typically 10~20 ms 10 ms Time Implying that: the model parameters must be updated every 10-20 ms A segment of duration 10-20 ms is referred to as FRAME This analysis is referred to as short-term analysis
Parametric representation Source model (excitation signal) Two main categories of sounds were identified: Voiced sounds: Vocal folds start vibrating; the corresponding excitation signal is a pulse train with repetition period T: the pitch interval Air from lungs Voiceless sounds: Vocal folds are open, while the vocal tract closes at a specific point, causing the air coming from lungs to create a turbulence at the constriction. The corresponding excitation signal is noise. Air from lungs Vocal cords tight and vibrating The vocal tract narrows Vocal folds are open
Parametric representation Reminder: voiced sounds vs. voiceless sounds The waveform of a voiced sound is almost periodic T T: pitch interval The waveform of a voiceless sound is noise-like
Source model Voiced sounds: pulse train generator, with period T PULSE TRAIN GENERATOR Voiceless sounds: noise generator T time NOISE GENERATOR time A voiced/voiceless detector is thus required in order to select which excitation signal shall be used
Vocal tract model The filter H(f) must be characterized by a transfer function that mimics the action performed by the vocal tract on the excitation signal A LINEAR PREDICTION filter is adopted The parameters of the filter must be updated every 10-20 ms But what is linear prediction?
Vocal tract model The analysis relies on the idea that a sample predicted by previous signal samples: ( ) s n of the signal can be The prediction of s( n) is a LINEAR combination of previous samples s( n i) i =1,..., p that is: ( ) = α k s( n k)!s n p k=1 Linear Prediction of s(n) p is referred to as PREDICTION ORDER
Linear prediction The adopted approach is to determine the coefficients that minimize the difference between sample and the prediction, i.e. minimize the PREDICTION ERROR α k ( ) s! ( n) s n PREDICTION ERROR e( n) = s( n)!s ( n) where BUT ( ) = α k s( n k)!s n p k=1 REMEMBER THAT The analysis is short-term
Linear prediction α k In particular the coefficients can be determined by minimizing the SHORT TERM quadratic error N m=1 ( ) E n = e n 2 m for each analysis window n, where N is the number of samples per window Since a window has a typical duration of 10-20 ms At a 10 khz sampling frequency the corresponding number of samples is 100-200
Linear prediction N N E n = e 2 ( n m) = ( s ( n m)!s ( n m) ) 2 N p = s ( n m) α k s n m k m=1 m=1 m=1 k=1 ( ) 2 FIND THE MINIMUM We are searching for a set of α k such that: E n α i = 0 for i=1,, p Order of prediction
Linear prediction Leading to the the following set of equations: R n YULE-WALKER EQUATIONS R ( n 0) R ( n 1)! R ( n p 1) R ( n 1) R ( n 0)! R ( n p 2) " " # " p 1 ( ) R ( n p 2)! R ( n 0) α 1 α 2 " α p = R ( n 1) R ( n 2) R n " p ( ) where: R n ( i) = s ( n m)s n m+i m ( ) SHORT-TERM AUTOCORRELATION FUNCTION
Linear prediction It can be observed that the matrix R n R ( n 0) R ( n 1) R ( n 2)! R ( n p 1) R ( n 1) R ( n 0) R ( n 1)! R ( n p 2) R ( n 2) " R ( n 1) " R ( n 0) is a Toeplitz matrix, and as such: It is symmetric All the elements on the main diagonal have same value "! # ( p 1) R ( n p 2) R ( n p 3)! R ( n 0) " "
Parametric representation Example: [a] VOWEL for several values of prediction order p AMPLITUDE db TIME FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY INPUT SIGNAL SHORT TERM SPECTRUM LPC p=4 LPC p=8 LPC p=12 LPC p=16 LPC p=20
Parametric representation Summary VOCAL TRACT PARAMETERS: Coefficientsα k Gain G SOURCE PARAMETERS: PITCH VOICED/VOICELESS decision
Parametric representation The complete model is thus as follows: PITCH SOURCE PARAMETER: PITCH + VOICED/VOICELESS decision PULSE TRAIN GENERATOR RANDOM NOISE GENERATOR u n Voiced/voiceless switch x G TIME-VARYING FILTER ( ) ( ) s n α k COEFFICIENTS Transmission rate up to 10000 bit/s
Parametric representation The predictor order p is typically about 12~14 ENERGY (db) Red: original signal Blue: LPC FREQUENCY
Parametric representation VOCODER scheme s( n) LPC analysis filter Coder Channel LPC synthesis filter PITCH detector Decoder ŝ( n) TRANSMITTER CHANNEL RECEIVER
Mixed systems Mixed systems are only in part based on speech production models Best example: Multipulse The multipulse method achieves excellent quality far transmission rates around 10000 bit/s In this method the vocal tract is represented by a LPC filter, but the source is determined without relying on specific properties of the speech signal ( ) u n LPC filter ( )!s n ( ) u ( n ) ( ) Given a signal!s n, one searches for the system input that makes as close as possible tos n ( ) α k The input u n and the coefficients are then sent to the receiver!s ( n)
What is ( ) u n like? Multipulse systems Let us assume a signal window of length 100 samples: it is obvious that, if u( n) had length 100 samples, the synthesis would be perfect The maximum number of available samples depends however on the Depending on the transmission rate, pulses (samples) TRANSMISSION RATE u( n) will thus include the right number of u( n)
Multipulse systems Optimal positions and amplitudes of pulses forming the input sequence must then be determined Example: for a bit rate of about 16 kbits/s, one can transmit ~ 30 pulses for a signal frame of 128 samples. u n will contain ~ 30 samples of NON-ZERO amplitude ( ) Finding optimal positions would require to analyze ALL possible positions, with an unacceptable computational cost Sub-optimal solutions are typically adopted
Multipulse systems In the search for pulse positions, positions are explored ONE AT THE TIME u( n) In the search for pulse amplitudes, a system of linear equations can be defined
Multipulse systems Which information is transmitted? Coefficients Positions Amplitudes Amplitude quantization step NOTE THAT OPPOSITELY to the VOCODER, here there is no information on the structure of the source signal (neither voiced/voiceless decision, nor pitch extraction)
Mixed methods GSM system This method was standardized for early digital RADIO-MOBILE voice transmissions STRUCTURE: similar to the one described or the multipulse system, but the search for optimal positions is carried out with a resolution of three samples u( n) POSITION SEARCH Computational cost much lower than in multipulse system 13 kb/s standard
MPEG1 Audio Compression Input Critical bands filtering (sub-band filtering) Bit allocation (quantization) Bitstream formatting Output Estimation of masking effects (Psychoacoustic model) MPEG1 audio compression works in the frequency domain It takes advantage of limitations in the human auditory system in order to reduce the bit rate without significant effect on perceived audio quality MPEG1 audio compression evolved in 3 different layers: Layer 1, Layer 2 and Layer 3 (MPEG1 Layer 3, known as MP3)
Frequency masking: the Bark scale The perception of a sound at a given frequency reduces the capability of the ear of perceiving other sounds at nearby frequencies The higher the intensity of the sound, the stronger the masking effect The frequency interval affected by a sound is referred to as critical band The width of critical bands grows with frequency
Time masking A tone at high intensity affects the capability of human ear to perceive another tone at nearby frequencies even after the perception of the first tone ends.
Overall masking effect The combination of time masking and frequency masking leads to specific frequency intervals that are not audible for specific time intervals
MP3 coding MP3 coding uses information on time and frequency masking effects to achieve efficient bit allocation for quantization Bands affected by masking effects are coded with a low number of bits (higher quantization noise): this module is proprietary PCM bitstream Filter bank (32 sub-bands) Modified Discrete Cosine Transform Non-uniform Quantization FFT (1024 points) Psychoacoustic model (proprietary) Additional control signalling MP3 coded bitstream Bitstream creation Huffman coding