The NU-NAIST voice conversion system for the Voice Conversion Challenge 2016

Size: px

Start display at page:

Download "The NU-NAIST voice conversion system for the Voice Conversion Challenge 2016"

Linda Dalton
6 years ago
Views:

INTERSPEECH 16 Septeber 8 12, 16, San Francisco, USA The NU-NAIST voice syste for the Voice Conversion Challenge 16 Kazuhiro Kobayashi 1, Shinnosuke Takaichi 1, Satoshi Nakaura 1, Tooki Toda 2 1 Nara

1 INTERSPEECH 16 Septeber 8 12, 16, San Francisco, USA The NU-NAIST voice syste for the Voice Conversion Challenge 16 Kazuhiro Kobayashi 1, Shinnosuke Takaichi 1, Satoshi Nakaura 1, Tooki Toda 2 1 Nara Institute of Science and Technology (NAIST), Japan 2 Inforation Technology Center, Nagoya University, Japan 1 {kazuhiro-k, shinnosuke-t, s-nakaura}@is.naist.jp, 2 tooki@icts.nagoya-u.ac.jp Abstract This paper presents the NU-NAIST voice (VC) syste for the Voice Conversion Challenge 16 (VCC 16) developed by a joint tea of Nagoya University and Nara Institute of Science and Technology. Statistical VC based on a Gaussian ixture odel akes it possible to convert speaker identity of a source speaker voice into that of a target speaker by converting several speech paraeters. However, various factors such as paraeterization errors and over-soothing effects usually cause speech quality degradation of the converted voice. To address this issue, we have proposed a direct wavefor odification technique based on spectral differential filtering and have successfully applied it to singing voice where excitation features are not necessary converted. In this paper, we propose a ethod to apply this technique to a standard voice task where excitation feature is needed. The result of VCC 16 deonstrates that the NU-NAIST VC syste developed by the proposed ethod yields the best accuracy for speaker identity (ore than 7% of the correct rate) and quite high naturalness score (ore than 3 of the ean opinion score). This paper presents detail descriptions of the NU-NAIST VC syste and additional results of its perforance evaluation. Index Ters: voice challenge 16, speaker identity, segental feature, Gaussian ixture odel, STRAIGHT analysis. 1. Introduction Varieties of voice characteristics, such as voice tibre and fundaental frequency (F ) patterns, produced by individual speakers are always restricted by their own physical constraint due to the speech production echanis. This constraint is helpful for aking it possible to produce a speech signal capable of siultaneously conveying not only linguistic inforation but also non-linguistic inforation such as speaker identity. However, it also causes various barriers in speech counication; e.g., severe vocal disorders are easily caused even if speech organs are partially daaged; and we hesitate to talk about soething private using a cell phone if we are surrounded by others. If the individual speakers freely produced various voice characteristics over their own physical constraints, it would break down these barriers and open up an entirely new speech counication style. Voice (VC) is a potential technique to ake it possible for us to produce speech sounds beyond our own physical constraints 1]. VC research was originally started to achieve speaker to ake it possible to transfor the voice identity of a source speaker into that of a target speaker while preserving the linguistic content 2]. A ainstrea of VC is a statistical approach to developing a function using a parallel data set consisting of utterances of the source and target speakers. As one of the ost popular statistical VC ethods, a regression ethod using a Gaussian ixture odel (GMM) was proposed 3]. To iprove perforance of the GMM-based VC ethod, various VC ethods have been proposed by ipleenting ore sophisticated techniques, such as Gaussian process regression 4, 5] deep neural networks 6, 7], non-negative atrix factorization 8, 9], and so on. We have also significantly iproved perforance of the standard GMMbased VC ethod by incorporating a trajectory-based algorith to ake it possible to consider teporal correlation in 1], odeling additional features to alleviate an over-soothing effect of the converted speech paraeters, such as global variance (GV) 1] and odulation spectru (MS) 11], and ipleenting STRAIGHT 12] with ixed excitation 13]. Furtherore, a real-tie process has also been successfully ipleented for state-of-the-art GMMbased VC 14]. However, the speech quality of the converted voices is still obviously degraded copared to that of the natural voices. One of the biggest factors causing this quality degradation is the wavefor generation process using a vocoder 15], which is still observed even when using high-quality vocoder systes 12, 16, 17]. In singing VC (SVC), to avoid the quality degradation caused by the vocoding process 15], we have proposed an intragender SVC ethod with direct wavefor odification based on spectru differential (DIFFSVC) 18] considering global variance (GV) 19], focusing on F transforation is not necessary in the intra-gender SVC. The DIFFSVC fraework can avoid using the vocoder by directly filtering an input singing voice wavefor with a tie sequence of spectral paraeter differentials estiated by a differential GMM (DIFFGMM) analytically derived fro the conventional GMM used in the standard ethod. Moreover, to apply this DIFFSVC fraework to cross-gender DIFFSVC as well, we have proposed an F transforation technique with direct residual signal odification ] based on tie-scaling with wavefor siilarity-based overlap-add 21] and resapling. In this paper, we develop a new VC syste for speaker based on the direct wavefor odification technique, which was subitted to the Voice Conversion Challenge 16 (VCC 16) 22] fro our joint tea of Nagoya University and Nara Institute of Science and Technology (NAIST) as the NU- NAIST VC syste (called new NAIST VC syste ). The following techniques are newly ipleented for our GMM-based VC syste: 1) voice with direct wavefor odification with spectral differential (DIFFVC), 2) speech paraeter trajectory soothing in the GMM training, 3) post-filtering process based on MS for DIFFVC, and 4) excitation conver- Copyright 16 ISCA

2 sion (EC) using STRAIGHT as preprocessing of spectral. The results of the VCC 16 have deonstrated that the NU-NAIST VC syste (syste J ) achieved the best accuracy on speaker identity and high naturalness (ore than 3 on the ean opinion score scale). In this paper, we also conduct subjective evaluations, deonstrating that the NU- NAIST VC syste achieves high speech quality and accuracy coparable to our conventional GMM-based VC syste. 2. VC based on GMM In the conventional VC, acoustic features such as spectral features and aperiodic coponents of a source speaker are converted into those of a target speaker based on previously trained GMMs. F is transfored to copensate for the difference in pitch between the source and target speakers based on fraeby-frae linear. Finally, the converted voice is generated by synthesizing these converted acoustic features using a vocoder Acoustic feature apping based on GMM Acoustic feature apping based on GMM consists of a training process and a process. In the training process, a joint probability density function of acoustic features of the source and target speaker voices are odeled with a GMM using a parallel data set. As the acoustic features of the source and target speakers, we eploy 2D-diensional joint static and dynaic feature vectors X t =x t, Δx t ] of the source and Y t =y t, Δy t ] of the target consisting of D-diensional static feature vectors x t and y t and their dynaic feature vectors Δx t and Δy t at frae t, respectively, where denotes the transposition of the vector. Their joint probability density odeled by the GMM is given by P (X t, Y t λ) M = α N =1 ( Xt Y t ] ; μ (X) μ (Y ) ], Σ (XX) Σ (XY ) Σ (YX) Σ (YY) ]), (1) where N ( ; μ, Σ) denotes the noral distribution with a ean vector μ and a covariance atrix Σ. The ixture coponent index is. The total nuber of ixture coponents is M. λ is a GMM paraeter set consisting of the ixture-coponent weight α, the ean vector μ, and the covariance atrix Σ of the -th ixture coponent. The GMM is trained using joint vectors of X t and Y t in the parallel data set, which are autoatically aligned to each other by dynaic tie warping. In the process, the acoustic features of the source speaker are converted into that of the target speaker using axiu likelihood estiation (MLE) of speech paraeter trajectories using the GMM and GV 1] F transforation In both intra- and cross-gender s, F is transfored frae-by-frae in order to line up pitch differences between source and target speakers. ŷ t = σ(y) σ (x) (xt μ(x) )+μ (y), (2) where x t and ŷ t are a log-scaled F of the source speaker and the converted one at frae t. μ (x) and σ (x) are the ean and standard deviation of log-scaled F of the source speaker and μ (y) and σ (y) are those of the target speaker. 3. The NU-NAIST VC syste for VCC 16 In this paper, we proposed the following techniques: 1) DIF- FVC, 2) GMM training with soothed speech paraeter trajectory, 3) post-filtering process based on odulation spectru (MS) for DIFFVC, and 4) excitation with F and aperiodic coponents transforations using a vocoder. Figure 1 indicates the flow of the NU-NAIST VC syste for the VCC 16. The NU-NAIST VC syste perfors excitation and spectral. During excitation, F values and aperiodic coponents extracted fro a source voice are transfored within an analysis/synthesis fraework using a vocoder. During spectral, spectral features of the source voice are converted into spectral feature differentials based on the DIFFGMM. Next, MS-based post-filtering is applied to the spectral feature differential. Finally, the converted speech wavefor is generated by directly filtering the analysis-synthesized speech wavefor generated during the excitation step using the post-filtered spectral feature differentials DIFFVC based on DIFFGMM As part of the odelling step, the DIFFGMM is analytically derived fro the traditional GMM (in Eq. (3)). Let D t = ] d t, Δd t denote the static and dynaic differential feature vector, where d t = y t x t, the DIFFGMM is derived by transforing odel paraeters in the sae anner as DIFFSVC 18] as follows: P (X t, D t λ) M = α N =1 ( Xt D t ] ; μ (X) μ (D) ], Σ (XX) Σ (DX) Σ (XD) Σ (DD) ]). (3) During the step, a tie sequence of the D- diensional converted spectral feature differentials, ˆd, is deterined using MLE of the speech paraeter trajectory using the DIFFGMM 18]. Then, the converted speech wavefor is generated by directly filtering an input speech wavefor with a tie-variant synthesis filter designed fro the spectral feature differential sequence. This filtering process odifies a spectral envelope sequence while basically preserving the natural excitation signals of the input speech wavefor Speech paraeter trajectory soothing Modulation Spectru (MS) 11] is defined as the log-scaled power spectru of the paraeter sequence; i.e., teporal fluctuation of the paraeter sequence is decoposed into individual odulation frequency coponents and their power values are represented as the MS. The MS, s (y), of the paraeter sequence y is defined as: s (y) = s 1 (y),, s d (y),, s D (y) ], (4) s d (y) = s d, (y),,s (y),,s d,ds 1 (y)], (5) where 2D s is the length of the discrete Fourier transfor, and s (y) is the f-th MS of the d-th diension of the paraeter sequence y 1 (d),, y T (d)]. f is the odulation frequency index. As reported in 23, 24], the higher odulation frequency coponents (ore fluctuating coponents of a teporal sequence) of spectral paraeter sequences are negligible for speech quality. By applying a low-pass filter (LPF) that reoves the higher odulation frequency coponents (e.g., ore than 5 Hz (f > D s/2)), we can iprove training accuracy 1668

3 Source voice STRAIGHT Analysis F Aperiodicity Band aperiodicity Linear transforation GMM for aperiodic coponents Aperiodicity odification Converted band aperiodicity Transfored F Excitation generation Synthesis filter F transfored source voice Converted voice (DIFFVC (EC)) Synthesis filter Spectru envelope Mel-cepstru DIFFGMM for el-cepstru Converted el-cepstru differential MSPF Figure 1: Conversion process of the NU-NAIST VC syste for VCC 16. of acoustic odels as done for hidden Markov odel-based speech synthesis 25]. Here, source and target speakers speech paraeter sequences, x and y, are LPFed, then the LPFed sequences, x (LPF) and y (LPF), are used to train the GMM. In, x (LPF) is used to generate the spectral differentials MS-based post-filter for VC with spectral differentials Statistical odeling tends to deteriorate MSs of the converted speech paraeters, and keeping natural MSs is strongly effective for iproving the quality of the converted speech. An MSbased post-filter (MSPF) 11], which is applied after speech paraeter in conventional GMM-based VC, odifies a converted speech paraeter sequence so that the sequence has the target speaker s natural MS. Here, we propose an MS-based post-filtering process that odifies spectral differentials, ˆd, such that the finally synthesized speech has the target speaker s natural MS. In training, we calculate MS statistics for target speaker s natural and converted speech paraeters fro the training data, y and ỹ =ˆd+x (LPF) ]. Here, let μ (y) s (y) and s (ỹ), and let σ (y) and μ(ỹ) and σ(ỹ) be the ean of be their variance. The ˆd is generated by converting x (LPF). In, x (LPF) is first added to the generated ˆd. Then, the MS, s (ỹ) is converted as follows: s (ỹ) = σ(y) ( σ (ỹ) s (ỹ) μ (ỹ) ) + μ (y). (6) The converted ỹ is deterined using the converted MS and the original phase coponents. The MSPFed spectral differentials, ˆd (MSPF) can be deterined by subtracting x (LPF) fro the converted ỹ 1. Note that, in this paper, we use ean-noralized MSs and adopt a segent-level post-filtering process 11] Excitation based on F and aperiodicity transforations using a vocoder Although we initially tried ipleenting the F transforation technique with direct residual signal odification ] for singer, we found that this technique was not effective for speaker. In speaker, we need to handle larger acoustic differences in excitation signals between the source and target speakers copared to singing voice. To address this issue, we ipleented excitation using STRAIGHT 26] as high-quality vocoder. For the F transforation, we perfor the global linear transforation as described in Sect 2.2. For the aperiodic coponents, band-averaged aperiodic coponents are extracted and converted with the GMM as in the conventional ethod 13]. Then, 1 Note that, because the MSPF process is non-linear to the speech paraeter sequence, the sequence that x (LPF) is subtracted fro the converted ỹ is not equal to ˆd. original aperiodic coponents at all frequency bins are shifted using aperiodic differentials between the extracted and converted band-averaged aperiodic coponents. Finally, analysissynthesized speech is generated fro these transfored excitation paraeters using STRAIGHT. Note that full STRAIGHT spectral representation is directly used in synthesis. This excitation ethod actually causes significant quality degradation because original phase inforation is discarded. Nevertheless, we have found that this ethod yields better speech quality as well as better accuracy than the direct residual signal odification ]. 4. Experiental evaluation In this section, we show results of the VCC 16 to deonstrate perforance of the NU-NAIST VC syste. Moreover, we copare the following three systes: DIFFVC (EC): The NU-NAIST VC syste subitted to the VCC 16, VC: Our conventional VC syste 13], DIFFVC: The NU-NAIST VC syste w/o excitation Experiental conditions We evaluated speech quality and speaker identity of the converted voices to copare perforance of the different VC systes in both intra-gender and cross-gender tasks. We used the English speech database used in the VCC 16. The nuber of source speakers was 5 including 3 feales and 2 ales, and that of the target speakers was 5 including 2 feales and 3 ales who were different fro the source feale and ale speakers. The nuber of sentences uttered by each speaker was 216. The sapling frequency was set to 16 khz. STRAIGHT 12] was used to extract spectral envelopes, which was paraeterized into the 1-24th el-cepstral coefficients as the spectral feature. The frae shift was 5 s. The el log spectru approxiation (MLSA) filter 27] was used as the synthesis filter. As the source excitation features, we used F and aperiodic coponents extracted with STRAIGHT 26]. The aperiodic coponents were averaged over five frequency bands, i.e., -1, 1-2, 2-4, 4-6, and 6-8 khz, to be odeled with the GMM. We used 162 sentences for training and the reaining 54 sentences were used for evaluation. The speaker-dependent GMMs were separately trained for all cobinations of source and target speaker pairs. The nuber of ixture coponents for the el-cepstral coefficients was 128 and for the aperiodic coponents was 64. Two preference tests were conducted. In the first test, speech quality of the converted voices was evaluated. The converted voice saples generated by two different VC systes for the sae sentences were presented to subjects in rando order. The subjects selected which saple had better speech quality. 1669

4 Preference score for accuracy on speaker identity %] 6 4 The NU-NAIST VC syste Target Source Subitted systes MOS score for speech quality Figure 2: Sound quality and accuracy on speaker identity in VCC 16. Preference score %] 6 4 DIFFVC(EC) w/ MSPF VC w/ GV DIFFVC w/ MSPF 95% confidence interval (a) Intra-gender (b) Cross-gender Figure 3: AB preference test for speech quality. Preference score %] 6 4 DIFFVC(EC) w/ MSPF VC w/ GV DIFFVC w/ MSPF 95% confidence interval (a) Intra-gender (b) Cross-gender Figure 4: XAB test for accuracy on speaker identity. In the second test, accuracy in speaker identity was evaluated. A natural voice saple of the target speaker was presented to the subjects first as a reference. Then, the converted voice saples generated by two different VC systes for the sae sentences were presented in rando order. The subjects selected which saple was ore siilar to the reference natural voice in ters of speaker identity. The nuber of subjects was 1 and each listener evaluated 54 saple pairs in each evaluation. They were allowed to replay each saple pair as any ties as necessary Results of the VCC 16 Figure 2 indicates an overall result of the VCC 16. The NU- NAIST VC syste achieved quite high speech quality over 3. of MOS and the best accuracy (about 74%) aong all subitted VC systes. In ters of the accuracy, our syste achieved successful perforance even though very siple prosodic was perfored. However, it is observed that there is still a large gap between the converted voices and the natural target voices. It is expected that further iproveents will be yielded by ipleenting a ethod of prosodic patterns or asking the source speakers to iic target prosodic patterns, which would be possible in several practical applications. In ters of speech quality, the NU-NAIST VC syste causes serious quality degradation copared to natural voices, i.e., fro 4.6 to 3. in MOS. This quality degradation is ainly caused by using a vocoder to perfor the excitation as shown in the next section. Therefore, it is expected that the converted speech quality will be significantly iproved by developing a better analysis/synthesis technique than STRAIGHT Results of subjective evaluation Figures 3 (a) and (b) indicate the results of the preference test for speech quality. DIFFVC (EC) achieves equivalent speech quality copared to VC in both intra/cross-gender s. On the other hand, DIFFVC achieves significantly higher speech quality copared to the other two ethods in the intra-gender. This is because DIFFVC can avoid using vocoding to generate converted speech wavefors, aking the process free fro various errors, such as F extraction errors and unvoiced/voiced decision errors. Note that DIFFVC in cross-gender condition does not result in any significant quality iproveents as it suffers fro isatches between spectral envelope and F in the cross-gender. Figures 4 (a) and (b) indicate the results of the preference test for speaker identity. Although DIFFVC (EC) has equivalent accuracy copared to VC in the intra-gender, it tends to be degraded in the cross-gender. It is expected that the residual spectral envelope preserved in the direct wavefor odification process still includes speakerdependent or gender-dependent features, and that this causes adverse effects on accuracy. These results suggest that 1) the NU-NAIST VC syste deonstrating the best accuracy and high speech quality in the VCC 16 has an alost equivalent perforance copared to the conventional VC syste in both intragender and cross-gender s, and 2) the direct wavefor odification technique achieves significantly higher converted speech quality copared to the conventional VC syste if the excitation is not necessary as in the intragender, and therefore, there is still large roo to iprove the converted speech quality of the NU-NAIST VC syste. 5. Conclusions This paper describes the details of the NU-NAIST voice (VC) syste for the Voice Conversion Challenge 16 (VCC 16) developed by a joint tea of Nagoya University and Nara Institute of Science and Technology. In order to iprove the quality of statistical VC based on Gaussian Mixture Model (GMM), we applied the following techniques: 1) voice with direct wavefor odification with spectral differential (DIFFVC), 2) speech paraeter trajectory soothing, 3) post-filtering based on odulation spectru for DIFFVC, and 4) preprocessing for excitation with F and aperiodic coponent transforations using high-quality vocoding. The experiental results deonstrated that the NU-NAIST VC syste was highly ranked in the VCC 16, its perforance was coparable to our conventional VC syste, and the DIF- FVC technique showed large potential to significantly iprove the converted speech quality of the NU-NAIST VC syste. In future work, we plan to ipleent high quality F and aperiodicity transforation for the DIFFVC technique. Acknowledgeents This work was supported in part by JSPS KAKENHI Grant Nuber 266 and Grant-in-Aid for JSPS Research Fellow Nuber 16J

5 6. References 1] T. Toda, Augented speech production based on real-tie statistical voice, Proc. GlobalSIP, pp , Dec ] M. Abe, S. Nakaura, K. Shikano, and H. Kuwabara, Voice through vector quantization, J. Acoust. Soc. Jpn (E), vol. 11, no. 2, pp , ] Y. Stylianou, O. Cappé, and E. Moulines, Continuous probabilistic transfor for voice, IEEE Trans. SAP, vol. 6, no. 2, pp , Mar ] N. Pilkington, H. Zen, and M. Gales, Gaussian process experts for voice, Proc. INTERSPEECH, pp , Aug ] N. Xu, Y. Tang, J. Bao, A. Jiang, X. Liu, and Z. Yang, Voice based on Gaussian processes by coherent and asyetric training with liited training data, Speech Counication, vol. 58, pp , Mar ] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, Voice using deep neural networks with layer-wise generative training, IEEE/ACM Trans. ASLP, vol. 22, no. 12, pp , Dec ] L. Sun, S. Kang, K. Li, and H. Meng, Voice using deep bidirectional long short-ter eory based recurrent neural networks, Proc. ICASSP, pp , Apr ] R. Takashia, T. Takiguchi, and Y. Ariki, Exeplar-based voice using sparse representation in noisy environents, IEICE Trans. on Inf. and Syst., vol. E96-A, no. 1, pp , Oct ] Z. Wu, T. Virtanen, E. Chng, and H. Li, Exeplar-based sparse representation with residual copensation for voice, IEEE/ACM Trans. ASLP, vol. 22, no. 1, pp , June 14. 1] T. Toda, A. W. Black, and K. Tokuda, Voice based on axiu likelihood estiation of spectral paraeter trajectory, IEEE Trans. ASLP, vol. 15, no. 8, pp , Nov ] S. Takaichi, T. Toda, A. W. Black, G. Neubig, S. Sakti, and S. Nakaura, Postfilters to odify the odulation spectru for statistical paraetric speech synthesis, IEEE Trans. ASLP, vol. 24, no. 4, pp , Jan ] H. Kawahara, I. Masuda-Katsuse, and A. Cheveigné, Restructuring speech representations using a pitch-adaptive tie-frequency soothing and an instantaneous-frequency-based f extraction: Possible role of a repetitive structure in sounds, Speech Counication, vol. 27, no. 3-4, pp , Apr ] Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano, Maxiu likelihood voice based on GMM with STRAIGHT ixed excitation, Proc. INTERSPEECH, pp , Sept ] T. Toda, T. Muraatsu, and H. Banno, Ipleentation of coputationally efficient real-tie voice, Proc. INTER- SPEECH, Sept ] H. Dudley, Reaking speech, JASA, vol. 11, no. 2, pp , ] Y. Stylianou, Applying the haronic plus noise odel in concatenative speech synthesis, IEEE Trans. SAP, vol. 9, no. 1, pp , 1. 17] D. Erro, I. Sainz, E. Navas, and I. Hernaez, Haronics plus noise odel based vocoder for statistical paraetric speech synthesis, IEEE J-STSP, vol. 8, no. 2, pp , ] K. Kobayashi, T. Toda, G. Neubig, S. Sakti, and S. Nakaura, Statistical singing voice with direct wavefor odification based on the spectru differential, Proc. INTERSPEECH, pp , Sept ], Statistical singing voice based on direct wavefor odification with global variance, Proc. INTERSPEECH, pp , Sept. 15. ] K. Kobayashi, T. Toda, and S. Nakaura, Ipleentation of f transforation for statistical singing voice based on direct wavefor odification, Proc. ICASSP, pp , Mar ] W. Verhelst and M. Roelands, An overlap-add technique based on wavefor siilarity (WSOLA) for high quality tie-scale odification of speech, Proc. ICASSP, pp vol.2, Apr ] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, and J. Yaagishi, The Voice Conversion Challenge 16, Proc. INTERSPEECH, Sept ] S. Takaichi, T. Toda, A. W. Black, and S. Nakaura, Paraeter generation algorith considering odulation spectru for HMMbased speech synthesis, Proc. ICASSP, Apr ], Modulation spectru-constrained trajectory training algorith for GMM-based voice, Proc. ICASSP, Apr ] S. Takaichi, K. Kobayashi, K. Tanaka, T. Toda, and S. Nakaura, The naist text-to-speech syste for the blizzard challenge 15, Proc. Blizzard Challenge workshop, Sept ] H. Kawahara, J. Estill, and O. Fujiura, Aperiodicity extraction and control using ixed ode excitation and group delay anipulation for a high quality speech analysis, odification and syste straight, Proc. MAVEBA, Sept ] K. Tokuda, T. Kobayashi, T. Masuko, and S. Iai, Melgeneralized cepstral analysis a unified approach to speech spectral estiation, Proc. ICSLP, pp ,

Statistical Singing Voice Conversion with Direct Waveform Modification based on the Spectrum Differential

INTERSPEECH 2014 Statistical Singing Voice Conversion with Direct Wavefor Modification based on the Spectru Differential Kazuhiro Kobayashi, Tooki Toda, Graha Neubig, Sakriani Sakti, Satoshi Nakaura Graduate