BANDWIDTH EXTENSION OF NARROWBAND SPEECH BASED ON BLIND MODEL ADAPTATION

Size: px

Start display at page:

Download "BANDWIDTH EXTENSION OF NARROWBAND SPEECH BASED ON BLIND MODEL ADAPTATION"

Lesley Jordan
5 years ago
Views:

1 5th European Signal Processing Conference (EUSIPCO 007, Poznan, Poland, September 3-7, 007, copyright by EURASIP BANDWIDH EXENSION OF NARROWBAND SPEECH BASED ON BLIND MODEL ADAPAION Sheng Yao and Cheung-Fat Chan Department of Electronic Engineering City University of Hong Kong, Kowloon, Hong Kong and ABSRAC raditional telephone transmission networ has speech frequency upper-limit below Hz. he narrowband telephone speech (0 Hz sounds muffled as compared with the original wideband speech (0-8 Hz. Artificial bandwidth extension is an economical way of enhancing the quality of narrowband speech without modifying the infrastructure of the networ. Esting bandwidth extension methods usually include off-line learning phase and on-line enhancing phase. he performance of these systems depends largely on the consistency of wideband training data and actual narrowband input data. In real situation, input speeches usually mismatch with off-line training speeches, leading to serious model errors. o avoid the data mismatch, we propose a method based on blind adaptation of linear dynamic model. he benefit of our method is the exclusion of off-line training phase and experiment results show that our systems is comparable with those data-oriented systems in the measurements of highband spectral distortion. When data mismatch occurs, our system outperforms those systems.. INRODUCION With the gradual development of wideband voice terminals such as adaptive multi-rate wideband codec (AMR-WB and variable rate multi-model wideband codec (VRM-WB, current speech transmission networ is a mixture of traditional narrowband terminals and new wideband terminals. During this transition period, bandwidth extension systems (BWE helps to enhance the perceived quality of narrowband speech without the cost of replacing the old narrowband infrastructure. he authors of [] indicate that esting BWE systems are performing reasonably, not because they accurately retrieve the original missing high-band information, but rather they extend the high-band such that the signal sounds perceptually pleasant. Basically reported BWE systems can be his research is supported by Strategic Research Grant (Project of City University of Hong Kong classified into two categories: memoryless systems and memory systems. Memoryless methods are the earlier development of BWE, with members such as VQ codeboo mapping [], linear mapping [3] and Gaussian mixture model (GMM conversion []. hese methods are usually criticized for the disregardness of inter-frame correlationship, which is the cause of relatively large hissing artefacts. Recently, more attention is paid on memory system development. Candidates are hidden Marov model (HMM method [5], HMM with state mapping [7] and linear dynamic model [8]. hese systems are featured for the capability of estimating the missing high-band information based on previous estimations. hey focus more on the retrieval of the trajectory of spectrum evolution, thus hissing artefact greatly reduced. However, all the systems aforementioned are dataoriented. hey perform well when input narrowband speech is consistent with training database, i.e. the same speaer or similar recoding environment. It is not the case in real application where data mismatch often occurs. In this paper, we propose a memory system based on linear dynamic model whose parameters adapt to the input narrowband speech in a blind manner. Off-line model training is not required except for the initial model. Experiment results show that the proposed method is superior to memoryless systems and comparable with memory systems in the measurement of highband spectral distortion. he rest of paper is organized as follows. Section presents the employment of linear dynamic model in bandwidth extension systems. Section 3 explains how the proposed system maes itself adaptive to the input narrowband speech in a blind sense. In section, the objective performance is compared. he last section is for conclusion.. LINEAR DYNAMIC MODEL he model is also termed state space model. In linear state p space model, the hidden speech state vector x( R is presumably linearly evolving according to equation (. x ( + Ax ( + u + w( ( 007 EURASIP 350

5th European Signal Processing Conference (EUSIPCO 007, Poznan, Poland, September 3-7, 007, copyright by EURASIP is time index or speech frame index.

he m observation vector o( R is the noisy linear transformation of state vector x( according to equation (.

2 5th European Signal Processing Conference (EUSIPCO 007, Poznan, Poland, September 3-7, 007, copyright by EURASIP is time index or speech frame index. ransformation matrix A and deterministic control vector u are pre-trained model parameter. w ( is uncorrelated zero-mean Gaussian noise vector with covariance E [ w( w( l ] Qδ. he m observation vector o( R is the noisy linear transformation of state vector x( according to equation (. l (a original wideband female speech from speaer clh o ( Cx ( + v( ( he equation is static in nature since vector o and x share the same index. v( is also uncorrelated zero-mean Gaussian noise with covariance E[ v( v( l ] Rδ. l By assuming o( as input narrowband speech feature vector and x( as the unnown wideband feature vector, linear state space model is employed in the speech bandwidth extension system. Such an assumption is reasonable because Human speech process is a non-stationary random process. he hidden state of the process is never static. State equation ( is one possible representation of state evolution for the ease of mathematical treatment. he idea of linear transformation relationship between o ( and x( assumed in observation equation ( has already been applied in memoryless linear mapping system. he performance is satisfactory apart from the limitation of memoryless nature of the system. 3 Due to the presence of noise and possible noninvertiblity of matrix C in equation (, the state vector x( cannot be uniquely estimated given the observation vector o(, which reflects the one-to-many mapping between narrowband and wideband speech features [6]. We extract 0-order line spectral frequencies (LSF as narrowband feature o. arget wideband feature s defined as 8-order LSF. 3. BLIND MODEL ADAPAION Given a sequence of narrowband feature vector o and a trained linear state space model θ { A, u, C, Q, R}, if we assume θ is stationary and well trained for the sequence, a pretty good estimation of x sequence can be obtained via Kalman filter algorithm as shown below: For,,..., L, Kalman Prediction xˆ Axˆ + u A A + Q Kalman Gain (b estimation of wideband speech with the mismatched model of male speaer bjm Figure ( illustration of model mismatch Κ C ( C C + R Kalman Correction x xˆ + Κ ( o( Cxˆ, initialized usually by ˆ Κ ( C C + R Κ. ˆ 0 x 0 E[ x(0] μ(0 E [ x(0 x(0 ] (0 0 0 he formulation of Kalman gain matrix Κ aims at the minimization of the trace of the state error matrix.herefore Kalman filter algorithm is a MMSE estimator of the hidden wideband speech vector x(. Note that, Kalman filter algorithm is sequential. However, θ { A, u, C, Q, R} should not be stationary. he method described in [8] provides the state space model with several different modes. he system, bloc by bloc, chooses the best fitted modes for the input narrowband sequence via some clustering techniques. Albeit such a treatment offers the state space model a certain degree of dynamics and the subjective and objective performance of the system is satisfactory, lie other data-oriented methods, a relatively large amount of training data is required. Besides, in case the input narrowband speech is quite different from the training database (i.e. speaer or recording environment difference, severe model error occurs, leading to unacceptable level of hissing. For example, a false formant trajectory may appear in high-band spectrogram when a male model is applied to a female input (see figure ( and vise versa. We propose a model updating mechanism that doesn t require off-line training. he basic assumption is that the system is confident about previously estimated wideband features and, by utilizing those results, allows the updating of the model parameters. he concept is illustrated in figure (. For an arbitrary input narrowband vector sequence with length N ( o (, o( +, o( +... o( + N, linear 007 EURASIP 35

5th European Signal Processing Conference (EUSIPCO 007, Poznan, Poland, September 3-7, 007, copyright by EURASIP Figure ( the sequence length is fixed to N frames and the updating bloc move forward

Consider the narrowband input vector o ( + N. First, the sequential Kalman filter algorithm continues to estimate x ˆ( + N with modelθ θ.

.. xˆ( + N, the linear state space model θ θ + for sequence from + to + N is updated in mamum lielihood sense.

N +, xˆ ( + N + can be estimated via Kalman filter algorithm.

3 5th European Signal Processing Conference (EUSIPCO 007, Poznan, Poland, September 3-7, 007, copyright by EURASIP Figure ( the sequence length is fixed to N frames and the updating bloc move forward frame by frame (a with model parameter updating state space model is assumed stationaryθ θ. he corresponding wideband estimate xˆ (, xˆ( +, xˆ( +... xˆ( + N are obtained from the previous estimation. Consider the narrowband input vector o ( + N. First, the sequential Kalman filter algorithm continues to estimate x ˆ( + N with modelθ θ. Given the narrowband observation sequence (b without model parameter updating o (, o( +, o( +, o( o( + N and wideband state estimations up to x ˆ( + N, x ˆ(, xˆ( +, xˆ( +, xˆ( xˆ( + N, the linear state space model θ θ + for sequence from + to + N is updated in mamum lielihood sense. where Aˆ uˆ] + [ ( ( ] 3( ( ˆ ( ( [ Qˆ C ( N + { [ Aˆ uˆ] + [ ] } N + Rˆ + [ 7 ( C+ 8( ] N + i + 3 x + i x i + x + i (, (, (, (, + 5 (, x, 6( i 7(, 8( With θ θ + and next narrowband input o ( + N +, xˆ ( + N + can be estimated via Kalman filter algorithm. hen θ θ + is updated with (c original wideband LSF trajectory Figure (3 system capability of tracing wideband features o ( +, o( +, o( + 3, o( +... o( + N + x ˆ ( +, xˆ( +, xˆ( + 3, xˆ( +... xˆ( + N + he procedure continues until the end of input is reached, which actually conducts a timely online training of the linear state space model. he computation is not as burdensome as model training in [8] since quantities, to require 8 full calculations only once (for the very first updating. In the following updating process, an addition and a subtraction is enough. he value of N is set to 60 frames (about 0 ms for our codec configuration. If N is too small (say less than 50, matrix ( ( and 3 6 ( ( N may become singular. he initial linear state space model is offline trained with environmental signals collected when speaers are not taling. he method is named blind because the updating of 007 EURASIP 35

5th European Signal Processing Conference (EUSIPCO 007, Poznan, Poland, September 3-7, 007, copyright by EURASIP state space model is localized within consecutive N speech frames.

4 5th European Signal Processing Conference (EUSIPCO 007, Poznan, Poland, September 3-7, 007, copyright by EURASIP state space model is localized within consecutive N speech frames. herefore the model parameters are optimized merely for these N frames. Besides, the wideband training data is the previous estimated data rather than the true data. One may wonder whether the updating is correct since there are estimation errors. We did the following experiment and found out such a blind adaptation is trustable. he first portion of the experiment is to enhance the narrowband speech with initial model not adapted. In such a case, model error is quite large. he other portion is the normal operation (allowing blind adaptation. As is depicted in figure (3, under normal operation of the proposed system, the general shape of high-band feature trajectories can be retrieved. Average distortion of line spectral frequencies is listed in table (. For reference, the third column is the result of well trained and source-matched memory system presented in [8]. Note that the goodness is more relevant to small high-order LSF distortions. case (a dotted curve: proposed method case (b dashed curve: memoryless VQ method solid curve: original Figure ( spectral envelope comparison With adaptation Without adaptation LDS reference lsf( lsf( lsf( lsf( lsf( lsf( lsf( lsf( lsf( lsf( lsf( lsf( lsf( lsf( lsf( lsf( lsf( lsf( able ( LSF distortions along orders Conceptually, in the beginning, the proposed system enhances the silence narrowband input, producing spectrally flat high-band noisy signals lie what ordinary linear state space system does. When speech content comes in, the highband spectral distortion will be large if the initial model does not change accordingly. With adaptation, the model parameters are timely and locally optimized for current voiced narrowband input, driving the underlying model to a voiced model, frame by frame. Since the required wideband feature is previous estimations, which is spectrally flat, the new model parameter is actually optimized for such wideband output speeches as have similar narrowband with input and a spectrally flat high-band. Recall that most suffered speech sounds under bandwidth limitation are fricatives and plosives. hese sounds have a relatively flat high-band and few (a original wideband female speech (b estimated wideband speech by linear state space model with correct speaer model (c estimated wideband speech by blind model adaptation Figure (5 performance illustration of blind model adaptation voicing content in high-band portion. he objective of bandwidth extension is to artificially extend the bandwidth so that the speech becomes perceptually better. See illustration in figure (. In case (a, the speech quality is still enhanced even though high-band formant structure is not recovered. But if case (b occurs (due to model error or memoryless design limitation, human ear is quite sensitive to that noise. Finally in figure (5, we can see the performance difference between proposed method and conventional LDS system [8].. PERFORMANCE EVALUAION he objective measurement is high-band spectral distortion defined as follows: 007 EURASIP 353

5 5th European Signal Processing Conference (EUSIPCO 007, Poznan, Poland, September 3-7, 007, copyright by EURASIP D(dB Outlier Outlier (>5dB (>7.5dB VQ[] %.0% Linear mapping[3].963.5%.% GMM[] % 0.98% HMM[5].06.0% 0.7% HMM state mapping[7].357.% 0.0% Linear state space.33.05% 0.6% model[8] Proposed.55 3.% 0.77% able ( est A (test data matches training data D(dB Outlier Outlier (>5dB (>7.5dB VQ[] % 3.9% Linear mapping[3] %.07% GMM[] %.78% HMM[5] % 0.99% HMM state mapping[7] %.0% Linear state space % 0.8% model[8] Proposed % 0.80% able (3 est B (test data mismatches training data π D π π ( 0 log S ( ω 0 log S ( ω 0 org 0 ext dω BWE systems such as [][][3][5][7][8] are implemented and trained in a speaer-dependent manner. he training data is from the phonetically balanced IViE corpus ( 8-minute speaer dependent paragraph reading speech (about 00,000 frames according to our speech analysizer is piced out to train all the six systems. he silence segments are collected for the training of the initial model of the proposed system, which is the only off-line training requirement for the system. he performance is listed in table ( and (3. he mismatched test data is collected from another speaer with different gender. As we can see in table ((3, the proposed method has similar performances under two (3 circumstances. When test data is consistent with training database, the performance is better than memoryless systems and comparable with memory systems. When model mismatch occurs, it outperforms all the data-oriented methods. 5. CONCLUSION In this paper we present a bandwidth extension system based on blind adaptation of linear state space model. By the measurement of high-band spectral distortion, the proposed system is comparable with data-oriented memory systems and better than memoryless systems. When data mismatch occurs, the performance is better than all the data-oriented systems on the condition that the bacground environment is not dramatically changed. Moreover, off-line training is not required and the efficient computation of on-line model adaptation maes sure the system delay not too large. REFERENCES [] N. Enbom, and W.B. Kleijn, Bandwidth Expansion of Speech Based on Vector Quantization of the Mel Frequency Cepstral Coefficients, Proc. Speech Coding, pp. 7-73, 999. [] K.Y. Par, and H.S. Kim, Narrowband to Wideband Conversion of Speech Using GMM Based ransformation, Proc. ICASSP, pp , 000. [3] Y. Naatoh, M. sushima, and. Norimatsu, Generation of Broadband Speech from Narrowband Speech Based on Linear Mapping, Electronics and Communications in Japan, Part, Vol 85, No. 8, pp. -53, 00. [] M. Nilsson, H. Gustafsson, S. V. Anderson, and W. B. Kleijn, Gaussian Mixture Model Based Mutual Information Estimation between Frequency Bands in Speech, Proc. ICASSP, pp. I55-I58, 00 [5] P. Jax, and P. Vary, On artificial Bandwidth Extension of elephone Speech, Signal Processing, pp , 003. [6] Y. Agiomyrgiannais, and Y. Stylianou, Combined Estimation/coding of Highband Spectral Envelopes for Speech Spectrum Expansion, Proc. ICASSP, pp. 69-7, 00. [7] S.Yao and C.F.Chan, Bloc-based Bandwidth Extension of Narrowband Speech Signal by using CDHMM, Proc. ICASSP, pp. I793-I796, 005 [8] S.Yao and C.F.Chan, Speech Bandwidth Enhancement using State Space Speech Dynamics, Proc. ICASSP, pp. I89-I9, EURASIP 35

651 Analysis of LSF frame selection in voice conversion

651 Analysis of LSF frame selection in voice conversion Elina Helander 1, Jani Nurminen 2, Moncef Gabbouj 1 1 Institute of Signal Processing, Tampere University of Technology, Finland 2 Noia Technology