SILK Speech Codec. TDP 10/11 Xavier Anguera I Ciro Gracia

SILK Speech Codec TDP 10/11 Xavier Anguera I Ciro Gracia

SILK Codec Audio codec desenvolupat per Skype (Febrer 2009) Previament usaven el codec SVOPC (Sinusoidal Voice Over Packet Coder): LPC analysis. Quasi-harmonic modelling of the linear prediction (LPC) residual. Both the sinusoidal amplitudes and phases are explicitly encoded using new methods based on Gaussian mixture models.

Requeriments (Internet Wideband Audio Codec) Optimitzat per a treballar en temps real. Flexibilitat i adaptació de paràmetres a temps real, segons condicions: Xarxa Hardware Senyal d'àudio

Paràmetres (Internet Wideband Audio Codec) Bitrate: qualitat vs bitrate. Baix: <10kbps (parla en qualsevol idioma). Alt: excel lent per a tota senyal musical. Sampling rate: narrowband (8 Khz) wideband (24 Khz o més). Complexitat: 50 Mhz x86 core, wideband mode (16 KHz sampling rate). Packet Loss Resilience: minimitzar la propagació dels errors. Delay: retard < 30ms. Discontinuous Transmission (DTX): velocitat baixa quan només hi ha soroll de fons.

Encoder Sampling Rate: 8, 12, 16, 24 KHz Bitrate: 6-40 Kbps (1 bit/sample good, 1.5 bits/sample transparent) Packet rate: 20 ms frames, 1-5 frames/packet. Bitrate vs latency/ sensitivity. Packet Loss Resilience: us de dependències inter-frame per a detectar errors. Complexity: optimitzacions.

Escalabilitat del encoder

Evaluació subjectiva de qualitat MOS (Mean Opinion Score)

Encoder Voice Activity Detector LTP Scaling Control Gains Processor R a n g e Pitch Analysis Noise Shaping analysis LSF Quantizer Prediction Analysis E n c o d e r High-Pass filter PreFilter Noise Shaping Quantization

Decoder 1) R a n g e 2) Decode Parameters 3) 4) 5) d e c o d e r Generate Excitation 1) Range encoded bitstream 2) Coded Parameters 3) Pulses and Gains 4) Pitch lags and LTP doefficients 5) LPC coefficients 6) Decoded signañ LTP synthesis LPC synthesis 6)

Pitch analysis Returns a pitch value every 5ms and the voiced/unvoiced decision LPC analysis is done with order 16, 12 or 8 Three levels of correlation are used to reduce complexity

Noise shaping analysis Optimizes some parameters to reduce noise effect Balances quantization noise and bitrate Spectral shaping of the quantization noise: makes it follow the signal spectrum Deemphasizes spectral valleys (where noise would be more noticeable) Matches the levels of the decoded speech formants to the original ones Resulting parameters are applied to the signal in the PREFILTER module

Prediction analysis It is done differently depending whether we have voiced or unvoiced signals: Voiced: First a 5 coeff. long-term prediction analysis is performed on 20ms The residual is input to an LPC analysis LPC coefficients are converted to Line Spectral Frequencies(LSF) (less sensible to quantization noise) and quantized.

Prediction analysis It is done differently depending whether we have voiced or unvoiced signals: Unvoiced: No need for LTP analysis LPC is performed, transformed to an LSF vector and quantized.

LSF quantization A codebook method is used and non-uniform quantization rate: Rarely occurring values are quantized with low distortion but high number of bits Commonly occurring values are modeled with low error and low number of bits. The used codebook is trained from a large training set a priori

LTP quantization It also uses a vector codebook, chosen from 3 possible (containing 10, 20 and 40 vectors each) For each frame the best codebook is chosen according to a rate-distortion minimization function

Noise shaping quantization This module joins all outputs from all modules to generate the overall residual that is quantized and sent.

Range encoder It is a data compression method proposed in 1979 (now it is patent free) which is based on arithmetic encoding. It uses the probability of occurrence of each pattern to codify with less bits those that occur more often. It encodes the following: voiced/unvoiced, LTP + LPC quantization indexes, residual signal, several intermediate gains