651 Analysis of LSF frame selection in voice conversion

Size: px

Start display at page:

Download "651 Analysis of LSF frame selection in voice conversion"

Prosper Cummings
5 years ago
Views:

1 651 Analysis of LSF frame selection in voice conversion Elina Helander 1, Jani Nurminen 2, Moncef Gabbouj 1 1 Institute of Signal Processing, Tampere University of Technology, Finland 2 Noia Technology Platforms, Tampere, Finland Abstract In practical applications of voice conversion, it is necessary to be able to cope with small amounts of speaer-specific training data. Consequently, most of the proposed voice conversion algorithms are based on probabilistic conversion functions. Recently, however, there has been increased interest in unit selection based approaches for voice conversion. It is evident that typical training sets are too small for enabling meaningful selection of large units such as diphones. But would it be possible to use smaller segments lie frames for high quality results provided that the selection is handled very well? In this paper, we analyze the performance of the frame selection approach in ideal conditions. In the experiments, line spectral frequencies of test sentences are replaced with the best matches from different training sets. The results show that perceptually transparent quality cannot be achieved with realistic database sizes. 1. Introduction In unit selection speech synthesis [1], speech is produced by selecting segments from a recorded database and by concatenating them together. The database is large, typically consisting of several hours of speech, sometimes even tens of hours for providing an optimal unit sequence. The most popular unit sizes used in the selection are diphones and triphones. Voice conversion (VC) provides means for generating new text-to-speech (TTS) voices in a fast and easy manner using only small training sets. Voice conversion (or voice morphing) has inspired many researchers during the last two decades. The aim in VC is to convert speech from one speaer (source speaer) to sound lie the speech from another particular speaer (target speaer). Most voice conversion systems proposed in the literature are based on applying a conversion scheme directly to the source speech or its parametric representation. Typical examples of conversion schemes include Gaussian mixture model (GMM) based conversion [2] and the use of codeboos [3, 4]. Another approach for voice conversion is the parametric adaptation in a hidden Marov model (HMM) based speech synthesis framewor [5]. All of the approaches share the same fundamental requirement: they have to be able to cope with small amounts of speaer-specific training data. Due to this requirement, the unit selection idea cannot be directly used with conventional unit sizes in voice conversion because there simply is not enough data to select from. Speaer identity can be partially characterized using formant positions and bandwidths. Since it is very hard to handle the estimation of formants in a reliable and robust manner, the features most often used for conversion in VC systems are the line spectral frequencies (LSFs). LSFs are features that are derived through linear prediction (LP) where speech is modeled using a filter given by the LP coefficients and a residual. In most VC studies, the residuals are left unconverted, but there are strong arguments for converting residuals and some techniques have been proposed for this tas for example in [6]. LSFs have also been used widely in speech coding, where typically large amounts of data from various speaers and languages is used for the training of LSF quantizers to obtain a good representation of the LSF space of all speaers. In speaer identification, high order LSFs have been reported to perform well as speaer identification features [7] and they have been used in many related studies (e.g. [8]).

2 652 Although LSFs seem to carry a lot of speech identity information, only a few personalized speech coding approaches have been proposed ([9, 10]). The ultimate goal in voice conversion is to convert the speaer identity as accurately as possible while maintaining high speech quality. However, these requirements have been found to be somewhat contradictory in practice; better identity conversion usually requires more signal modifications that may cause more distortions. The main problem of the current VC techniques is that they are not very successful in changing the identity. Good results are mainly obtained because of forced ABX tests; the speech sample may sound more lie target speech than source speech but it does not mean that it would ultimately sound lie speech of the target speaer. All of the current techniques, including the GMM based conversion and the use of codeboos, have inherent drawbacs from this point of view. Recently, Dutoit et al [11] proposed to first use a conventional GMM based approach to convert source LSFs to target LSFs and then search from the target speech database for the closest match to the converted LSFs in order to obtain more realistic target LSFs. The idea is attractive but can it help in achieving high quality conversion? In this study, we analyze if it is possible to select LSFs from a target database of a realistic size in such a manner that the quality of the converted speech would be very high or even indistinguishable from the target speech. The results of our experiments reveal how accurately LSFs could be chosen provided that the conversion is successful. Multiple speaers, different test sentence sets and different sizes of target databases are examined and the results are presented in the light of quality criteria used widely in speech coding. This paper is organized as follows. In Chapter 2, the basic properties of LSFs and the related distance metrics and quality criteria are discussed. The experiments and results demonstrating the idealized frame selection performance are described in Chapter 3. Chapter 4 provides a short discussion on the results and Chapter 5 concludes the study. 2. Linear prediction and line spectral frequencies Linear prediction is one of the basic techniques used in speech processing. This sourcefilter model can be used for separating a speech signal into linear prediction coefficients that model the vocal tract contribution and into an excitation signal. More precisely, the excitation signal, also referred to as the residual signal, can be obtained through LP analysis filtering, m r( t) = x( t) a x( t ), (1) = 1 where x(t) is the input speech signal and m is the order of the analysis filter A(z). The linear prediction coefficients {a } are usually estimated in a frame-wise manner using either the autocorrelation or covariance methods. The autocorrelation method is widely used because it always ensures that the resulting filters are stable. For further processing, the linear prediction coefficients are often converted into the line spectral frequency representation. The fully reversible conversion can be carried out by first calculating the roots of the polynomials + 1) P( z) = A( z) + z A( z ) (2) ( m + 1) 1 Q ( z) = A( z) z A( z ). Then, the LSF representation is formed simply by the angular positions {ω } of the complex roots in ascending order. The LSF representation is favored in different areas of speech processing for many reasons. For example, this representation offers advantageous properties from the viewpoint of quantization, interpolation and other processing, and it can guarantee filter stability. ( m 1,

3 653 The LSF representation has also been widely used in voice conversion. In selection based voice conversion, some distance measure is needed. The distance between two LSF vectors can be computed e.g. using weighted squared error with a diagonal weighting matrix, m w = 1 T 2 d( ω, ωˆ ) = ( ω ωˆ ) W( ω ωˆ ) = ( ω ˆ ω ). (3) The weights can be used for approximating the properties of human hearing. We use the weights given in [13] defined by w j ( ) 2 πf / 0. 6 f e s = c H, (4) where f denotes the frequency of the th LSF element, f s is the sampling frequency, and H(z) denotes the synthesis filter H(z) = 1/A(z). Furthermore, when dealing with 10-dimensional LSF vectors at a sampling frequency of 8 Hz, c is set to one for all except for c 9 = 0.64 and c 10 = 0.16, as proposed in [13]. In addition to the weighted squared error distance, another useful and popular metric for measuring the distance between two LP spectra is spectral distortion (SD). It is defined in db as SD = 1 f f u l fu fl 20log 10 H Hˆ j2πf / f s ( e ) j2πf / f s ( e ) 2 df, (5) where f l and f u denote the lower and upper frequency limits of the integration. A convenient property of this measure is the fact that there are generally accepted SD based criteria for perceptual spectral transparency, i.e. criteria that guarantee that two spectra are indistinguishable through listening. In [13], it was concluded that transparency is achieved if the following three criteria are met: 1) average SD is less than 1 db, 2) there are no outlier frames having SD above 4 db, and 3) less than 2% of frames have SD in the range from 2 to 4 db. 3. Experimental results To study the performance level achievable in voice conversion using the frame-based selection approach, we carried out experiments in idealized conditions. The main idea in these experiments was to focus only on the frame selection by maing the assumption that other parts of voice conversion would perform perfectly. In practice, we achieved this perfect conversion using recorded sentences from the target speaer as "converted test sentences". Frame-based selection was then applied on these recorded test sentences by replacing the LSF vectors in the test sentences with the best matches found in a selection database. The selection database was formed using uncompressed LSF vectors estimated from the speech of the target speaer. We experimented with various selection database sizes but different sentences were always used in testing and training, maing the experiment realistic apart from the above-mentioned assumption of idealized conditions. Thus, the results achieved in these experiments demonstrate the upper bound for the performance of frame-based selection in voice conversion. The experiments were carried out using the publicly available CMU Arctic database [12], a database of 1132 utterances spoen by 7 different speaers, 2 female and 2 male American English speaers, 1 Canadian English male, 1 Scottish English male, and 1 male speaer with Indian accent. The waveforms in the databases were downsampled to 8 Hz and 10th order LP analysis was performed at 10-ms intervals with overlapping 25-ms analysis frames, using the analysis module of the voice conversion system presented in [14]. Each analysis frame was windowed using a Hamming window and the LP coefficients were computed using the autocorrelation method.

4 654 Each speaer served as a reference speaer (speaer in test sentences) and as a selection database speaer for him/herself. In addition, each speaer was also used as a database speaer for the other speaers for comparison purposes. The number of sentences in the selection databases was varied (5, 10, 20, 50 and 100) by including new sentences in such a way that larger sets always also contained the sentences included in the smaller sets. All 80 reference sentences and the database sentence sets were selected randomly but they were ept the same for all speaer combinations. The new LSFs replacing the LSFs in the original reference sentences were selected from the selection database using the weighted squared error distortion in Eq. (3) together with the weighting in Eq. (4). This scheme was used to obtain a reasonable computational complexity. The final results were evaluated using the spectral distortion formula in Eq. (5) since it provides the best comparison capabilities. Frames classified as silence were not included in the results. The average spectral distortion, measured in the range from 0 to 3.2 Hz, and the percentage of 2 and 4 db outliers were calculated for two different categories: i) the reference speaer is the same as the database speaer (7 cases) and ii) the reference speaer is different than the database speaer (42 cases). The mean SD averaged over all speaers is shown in Figure 1 for categories i) (solid line) and ii) (dash-dotted line). The best and worst results in category i) are also shown (dashed lines). The dotted line represents the mean values of each reference speaer s best results when selecting from another speaer s database, i.e. the result with another speaer s database that gave the lowest average spectral distortion values for the reference speaer. The mean percentage of 2 db and 4 db outliers is shown in Figure 2 and Figure 3, respectively. As can be seen from Figure 1, the best matching LSFs were on average far away from ideal transparent quality. There are large differences between the speaers but even the best results were not very good. The low number of 4 db outliers with larger databases is encouraging, but the requirement of having less than 2% of 2 db outliers is far from being fulfilled. As expected, using other speaer s database was not as successful as using the speaer s own database, indicating that there are strong speaer-dependencies in LSFs. An interesting observation not directly visible in the figures was that the best results with other speaer s LSFs were always achieved when the LSFs were selected from a speaer with a matching gender. This is in line with the fact that the formant frequencies of female speaers are generally higher than the formant frequencies of male speaers due to the shorter vocal tract. We also examined whether the quality would be much better if the number of sentences in the database was significantly increased. A set of 250 sentences resulted in an average SD of 1.3 db for category i) with 7% and 0.1% of 2 db and 4 db outliers, respectively. The best result among the speaers was 1.15 db. In addition, we tested if the usage of the whole Arctic database (1132 sentences minus one reference sentence) as the selection database could result in low spectral distortion. The mean of averaged SD was 1.09 db for all speaers and the best speaer obtained an average SD of 0.97 db, measured using 20 different reference sentences. There were 2.2 % of 2 db outliers and no 4 db outliers. Using the whole database offers almost transparent quality. For the best other speaer, the average SD was 1.45 and the percentage of 2 db outliers about 14%. Nevertheless, the database of this size would not be suitable for practical voice conversion. 4. Discussion LSF selection from a single frame does not seem to provide very high spectral quality if the size of the database is realistic from the viewpoint of practical applications. In [11], the authors do mention that there is a relatively large non-parallel database available which means in their case over 12 minutes of data. It can be considered as a very large database for voice conversion. This would equal to almost 250 sentences if a sentence is on average 3 seconds long. The results presented in this paper show that transparent quality cannot be achieved even with this ind of relatively large database in idealized conditions.

5 Speaer means with own data Best and worst speaers with own data Speaer means with other's data Best results with other person's data Average SD Number of sentences Figure 1. Spectral distortions of LSF databases gathered from the same speaer or from other speaers Own data Other's data Percentage Number of sentences Figure 2. The mean percentage of 2 db outliers for all speaers 25 Own data Other's data 20 Percentage Number of sentences Figure 3. The mean percentage of 4 db outliers for all speaers.

6 656 Even if there would be enough sentences to fulfill the requirement of transparent or otherwise very high quality, there is no target signal available during the conversion and thus the selections must be based on the source speaer s sentence. This moves the realistically achievable quality even further away from the transparent level. Moreover, we have only considered LSFs in this study. In reality, there is also a need for transforming the residual. Residual selection techniques have been proposed to be based on the LSF vector and its corresponding residual. In [6], residual selection was analyzed and it was found that the selection of an optimal LSF sequence similarly as in unit selection can be more preferable than direct selection without considering neighboring frames. Nevertheless, the residual selection was ultimately based on the converted LSF vector, and it is reasonable to assume that residual selection will be even more challenging than LSF selection. 5. Conclusions In this paper, we analyzed whether it is possible to select LSF vectors from a small database with very high quality in the scope of voice conversion. The CMU Arctic database with 7 speaers was used to test if a small set of sentences could act as an effective selection database in a voice conversion. We found that small database sizes commonly used in voice conversion are not adequate for representing the LSF space of a speaer and the achievable quality is far from transparent quality even in ideal conditions. References 1. A. Blac, A. Hunt. Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. of ICASSP, pp , Y. Stylianou, O. Cappe, E. Moulines. Continuous probabilistic transform for voice conversion. IEEE Trans. on Speech and Audio Procdessing, vol. 6(2), pp , March M. Abe, S. Naamura, K. Shiano, H. Kuwabara. Voice conversion through vector quantization. In Proc. of ICASSP, pp , O. Tur, L. M. Arslan. Robust processing techniques for voice conversion. Computer Speech and Language, vol. 4(20), pp , October J. Yamagishi, K. Ogata, Y. Naano, J. Isogai, T. Kobayashi. HSMM-based model adaptation algorithms for average voice-based speech synthesis. In Proc. of ICASSP, vol. I., pp , D. Sündermann, H. Höge, A. Bonafonte, H. Ney, A. Blac. Residual prediction based on unit selection. In Proc. of ASRU, pp , D. Reynolds. Experimental evaluation of features for robust speaer identification, IEEE Trans. on Speech and Audio Processing, Vol. 2, no. 4, pp , October T. Kinnunen, E. Karpov, P. Fränti. Real-time speaer identification and verification, IEEE Trans. on Audio, Speech and Language Processing, vol. 14, no. 1, pp , January W. Jia, W.-Y. Chan. An experimental assessment of personal speech coding. Speech Communication, vol. 30, no. 1, pp. 1 8, C.-H, Lee, S.-K. Jung, H.-G. Kang. Applying a speaer-dependent speech compression technique to concatenative TTS synthesizers. IEEE Trans. on Audio, Speech and Language Processing, vol. 15, no. 2, pp , February T. Dutoit, A. Holzapfel, M. Jottrand, A. Moinet, J. Pérez, Y. Stylianou. Towards a voice conversion system based on frame selection, in Proc. of ICASSP, vol. 4, pp , J. Komine, A. Blac. CMU Arctic databases for speech synthesis version Technical report, Carnegie Mellon University, K. Paliwal, B. Atal. Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Trans on Speech and Audio Processing, Vol. 1, no. 1, pp. 3-14, January J. Nurminen, V. Popa, J. Tian, Y. Tang, I. Kiss. A parametric approach for voice conversion. In Proc. Worshop on Speech-To-Speech Translation, pp , 2006.

Wavelet-based Voice Morphing

Wavelet-based Voice Morphing Wavelet-based Voice orphing ORPHANIDOU C., Oxford Centre for Industrial and Applied athematics athematical Institute, University of Oxford Oxford OX1 3LB, UK orphanid@maths.ox.ac.u OROZ I.. Oxford Centre