Statistical Singing Voice Conversion with Direct Waveform Modification based on the Spectrum Differential

Similar documents
Statistical Singing Voice Conversion based on Direct Waveform Modification with Global Variance

Direct F 0 Control of an Electrolarynx based on Statistical Excitation Feature Prediction and its Evaluation through Simulation

The NU-NAIST voice conversion system for the Voice Conversion Challenge 2016

A Digital Signal Processor Implementation of Silent/Electrolaryngeal Speech Enhancement based on Real-Time Statistical Voice Conversion

Quality-enhanced Voice Morphing using Maximum Likelihood Transformations

Recent Development of the HMM-based Singing Voice Synthesis System Sinsy

Nonaudible murmur enhancement based on statistical voice conversion and noise suppression with external noise monitoring

TESTING OF ADCS BY FREQUENCY-DOMAIN ANALYSIS IN MULTI-TONE MODE

Applying Spectral Normalisation and Efficient Envelope Estimation and Statistical Transformation for the Voice Conversion Challenge 2016

APPLICATION OF THE FAN-CHIRP TRANSFORM TO HYBRID SINUSOIDAL+NOISE MODELING OF POLYPHONIC AUDIO

Additive Synthesis, Amplitude Modulation and Frequency Modulation

New Adaptive Linear Combination Structure for Tracking/Estimating Phasor and Frequency of Power System

WaveNet Vocoder and its Applications in Voice Conversion

DSI3 Sensor to Master Current Threshold Adaptation for Pattern Recognition

Adaptive Harmonic IIR Notch Filter with Varying Notch Bandwidth and Convergence Factor

Efficient Non-linear Changed Mel-filter Bank VAD Algorithm

SECURITY AND BER PERFORMANCE TRADE-OFF IN WIRELESS COMMUNICATION SYSTEMS APPLICATIONS

Alternative Encoding Techniques for Digital Loudspeaker Arrays

Iterative Receiver Signal Processing for Joint Mitigation of Transmitter and Receiver Phase Noise in OFDM-Based Cognitive Radio Link

EQUALIZED ALGORITHM FOR A TRUCK CABIN ACTIVE NOISE CONTROL SYSTEM

Robust Acceleration Control of Electrodynamic Shaker Using µ Synthesis

NONLINEAR WAVELET PACKET DENOISING OF IMPULSIVE VIBRATION SIGNALS NIKOLAOS G. NIKOLAOU, IOANNIS A. ANTONIADIS

Kalman Filtering for NLOS Mitigation and Target Tracking in Indoor Wireless Environment

Performance Analysis of an AMC System with an Iterative V-BLAST Decoding Algorithm

FORWARD MASKING THRESHOLD ESTIMATION USING NEURAL NETWORKS AND ITS APPLICATION TO PARALLEL SPEECH ENHANCEMENT

Overlapped frequency-time division multiplexing

Department of Mechanical and Aerospace Engineering, Case Western Reserve University, Cleveland, OH, 2

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

A NEW APPROACH TO UNGROUNDED FAULT LOCATION IN A THREE-PHASE UNDERGROUND DISTRIBUTION SYSTEM USING COMBINED NEURAL NETWORKS & WAVELET ANALYSIS

Keywords Frequency-domain equalization, antenna diversity, multicode DS-CDMA, frequency-selective fading

Precise Indoor Localization System For a Mobile Robot Using Auto Calibration Algorithm

Speech Enhancement using Temporal Masking and Fractional Bark Gammatone Filters

Characterization and Modeling of Underwater Acoustic Communications Channels for Frequency-Shift-Keying Signals

Comparing structural airframe maintenance strategies based on probabilistic estimates of the remaining useful service life

A HIGH POWER FACTOR THREE-PHASE RECTIFIER BASED ON ADAPTIVE CURRENT INJECTION APPLYING BUCK CONVERTER

Intermediate-Node Initiated Reservation (IIR): A New Signaling Scheme for Wavelength-Routed Networks with Sparse Conversion

COMBINED FREQUENCY AND SPATIAL DOMAINS POWER DISTRIBUTION FOR MIMO-OFDM TRANSMISSION

Non-Linear Weighting Function for Non-stationary Signal Denoising

An orthogonal multi-beam based MIMO scheme. for multi-user wireless systems

Fundamental study for measuring microflow with Michelson interferometer enhanced by external random signal

ROBUST UNDERWATER LOCALISATION OF ULTRA LOW FREQUENCY SOURCES IN OPERATIONAL CONTEXT

Ruohua Zhou, Josh D Reiss ABSTRACT KEYWORDS INTRODUCTION

Detection of Faults in Power System Using Wavelet Transform and Independent Component Analysis

A New Localization and Tracking Algorithm for Wireless Sensor Networks Based on Internet of Things

OTC Statistics of High- and Low-Frequency Motions of a Moored Tanker. sensitive to lateral loading such as the SAL5 and

A Novel NLOS Mitigation Approach for Wireless Positioning System

Relation between C/N Ratio and S/N Ratio

Interference Management in LTE Femtocell Systems Using Fractional Frequency Reuse

POWER QUALITY ASSESSMENT USING TWO STAGE NONLINEAR ESTIMATION NUMERICAL ALGORITHM

Comparison Between PLAXIS Output and Neural Network in the Guard Walls

A soft decision decoding of product BCH and Reed-Müller codes for error control and peak-factor reduction in OFDM

A Preprocessing Method to Increase High Frequency Response of A Parametric Loudspeaker

Incorporating Performance Degradation in Fault Tolerant Control System Design with Multiple Actuator Failures

ELEC2202 Communications Engineering Laboratory Frequency Modulation (FM)

SAMPLING PERIOD ASSIGNMENT FOR NETWORKED CONTROL SYSTEMS BASED ON THE PLANT OPERATION MODE

Transmit Beamforming and Iterative Water-Filling Based on SLNR for OFDMA Systems

NINTH INTERNATIONAL CONGRESS ON SOUND AND VIBRATION, ICSV9 PASSIVE CONTROL OF LAUNCH NOISE IN ROCKET PAYLOAD BAYS

Performance Analysis of Atmospheric Field Conjugation Adaptive Arrays

Yoshiyuki Ito, 1 Koji Iwano 2 and Sadaoki Furui 1

Track-Before-Detect for an Active Towed Array Sonar

Mitigation of GPS L 2 signal in the H I observation based on NLMS algorithm Zhong Danmei 1, a, Wang zhan 1, a, Cheng zhu 1, a, Huang Da 1, a

Speech Synthesis using Mel-Cepstral Coefficient Feature

Modeling Beam forming in Circular Antenna Array with Directional Emitters

Design and Implementation of Block Based Transpose Form FIR Filter

Keywords: Equivalent Instantaneous Inductance, Finite Element, Inrush Current.

Model Development for the Wideband Vehicle-to-vehicle 2.4 GHz Channel

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

Comparison of Fourier Bessel (FB) and EMD-FB Based Noise Removal Techniques for Underwater Acoustic Signals

Secondary-side-only Simultaneous Power and Efficiency Control in Dynamic Wireless Power Transfer System

COMPARISON OF TOKEN HOLDING TIME STRATEGIES FOR A STATIC TOKEN PASSING BUS. M.E. Ulug

Energy-Efficient Cellular Communications Powered by Smart Grid Technology

Parameter Identification of Transfer Functions Using MATLAB

Notes on Orthogonal Frequency Division Multiplexing (OFDM)

Waveform Design and Receive Processing for Nonrecurrent Nonlinear FMCW Radar

Power Improvement in 64-Bit Full Adder Using Embedded Technologies Er. Arun Gandhi 1, Dr. Rahul Malhotra 2, Er. Kulbhushan Singla 3

RAKE Receiver. Tommi Heikkilä S Postgraduate Course in Radio Communications, Autumn II.

UNIT - II CONTROLLED RECTIFIERS (Line Commutated AC to DC converters) Line Commutated Converter

Phase Noise Modelling and Mitigation Techniques in OFDM Communications Systems

WIPL-D Pro: What is New in v12.0?

Torsion System. Encoder #3 ( 3 ) Third encoder/disk for Model 205a only. Figure 1: ECP Torsion Experiment

Optical fiber beamformer for processing two independent simultaneous RF beams

Multitarget Direction Measurement Based on Bistatic Radar

Optimal Modulation Index of the Mach-Zehnder Modulator in a Coherent Optical OFDM System Employing Digital Predistortion

Radar Imaging of Non-Uniformly Rotating Targets via a Novel Approach for Multi-Component AM-FM Signal Parameter Estimation

Introduction Traditionally, studying outage or cellular systes has been based on the signal-to-intererence ratio (SIR) dropping below a required thres

A comparison of LSF and ISP representations for wideband LPC parameter coding using the switched split vector quantiser

A Pre-FFT OFDM Adaptive Antenna Array with Eigenvector Combining

EFFECTS OF MASKING ANGLE AND MULTIPATH ON GALILEO PERFORMANCES IN DIFFERENT ENVIRONMENTS

LETTER Adaptive Multi-Stage Parallel Interference Cancellation Receiver for Multi-Rate DS-CDMA System

Radio Resource Management in a Coordinated Cellular Distributed Antenna System By Using Particle Swarm Optimization

Distributed Power Delivery for Energy Efficient and Low Power Systems

Overlapping Signal Separation in DPX Spectrum Based on EM Algorithm. Chuandang Liu 1, a, Luxi Lu 1, b

2nd MAVEBA, September 13-15, 2001, Firenze, Italy

Dynamic Model Displacement for Model-mediated Teleoperation

Ignition and monitoring technique for plasma processing of multicell superconducting radio frequency cavities

PARAMETER OPTIMIZATION OF THE ADAPTIVE MVDR QR-BASED BEAMFORMER FOR JAMMING AND MULTIPATH SUPRESSION IN GPS/GLONASS RECEIVERS

Spectrum Sensing in Low SNR: Diversity Combining and Cooperative Communications

A Novel TDS-FDMA Scheme for Multi-User Uplink Scenarios

Keywords: International Mobile Telecommunication (IMT) Systems, evaluating the usage of frequency bands, evaluation indicators

Relative phase information for detecting human speech and spoofed speech

Transcription:

INTERSPEECH 2014 Statistical Singing Voice Conversion with Direct Wavefor Modification based on the Spectru Differential Kazuhiro Kobayashi, Tooki Toda, Graha Neubig, Sakriani Sakti, Satoshi Nakaura Graduate School of Inforation Science, Nara Institute of Science and Technology (NAIST), Japan {kazuhiro-k, tooki, neubig, ssakti, s-nakaura}@is.naist.jp Abstract This paper presents a novel statistical singing voice conversion (SVC) technique with direct wavefor odification based on the spectru differential that can convert voice tibre of a source singer into that of a target singer without using a vocoder to generate converted singing voice wavefors. SVC akes it possible to convert singing voice characteristics of an arbitrary source singer into those of an arbitrary target singer. However, speech quality of the converted singing voice is significantly degraded copared to that of a natural singing voice due to various factors, such as analysis and odeling errors in the vocoderbased fraework. To alleviate this degradation, we propose a statistical conversion process that directly odifies the signal in the wavefor doain by estiating the difference in the spectra of the source and target singers singing voices. The differential spectral feature is directly estiated using a differential Gaussian ixture odel (GMM) that is analytically derived fro the traditional GMM used as a conversion odel in the conventional SVC. The experiental results deonstrate that the proposed ethod akes it possible to significantly iprove speech quality in the converted singing voice while preserving the conversion accuracy of singer identity copared to the conventional SVC. Index Ters: singing voice, statistical voice conversion, vocoder, Gaussian ixture odel, differential spectral copensation 1. Introduction The singing voice is one of the ost expressive coponents in usic. In addition to pitch, dynaics, and rhyth, the linguistic inforation of the lyrics can be used by singers to express ore varieties of expression than other usic instruents. Although singers can also expressively control their voice tibre to soe degree, they usually have difficulty in changing it widely (e.g. changing their own voice tibre into that of another singer) owing to physical constraints in speech production. If it would be possible for singers to freely control their voice tibre beyond their physical constraints, it will open up entirely new ways for singers to express ore varieties of expression. Singing synthesis 1, 2, 3 has been a growing interest in coputer-based usic technology. Entering notes and lyrics to the singing synthesis engine, users (e.g., coposers and singers) can easily produce a synthesized singing voice which has a specific singer s voice characteristics, different fro those of the users. To flexibly control the synthesized singing voice as the users want, there has also been proposed a technique capable of autoatically adjusting paraeters of the singing voice synthesis engine so that the variation of power and pitch in the synthesized singing voice is siilar to that of the given user s natural singing voice 4, 5. Although these technologies using singing voice synthesis engines are very effective to produce the singing voices desired by the users, it is essentially difficult to directly convert singers singing voices in realtie. Several singing voice conversion ethods have been proposed to ake it possible for a singer to sing a song with the desired voice tibre beyond their own physical constraints. One of the typical ethods is singing voice orphing between singing voices of different singers or different singing styles 6 using the speech analysis/synthesis fraework 7, which can only be applied to singing voice saples of the sae song. To convert a singer s voice tibre in any song, statistical voice conversion (VC) techniques 8, 9 have been successfully applied to singing voice conversion. This singing VC (SVC) ethod akes it possible to convert a source singer s singing voice into another target singer s singing voice 10, 11. A conversion odel is trained in advance using acoustic features, which are extracted fro a parallel data set of song pairs sung by the source and target singers. The trained conversion odel akes it possible to convert the acoustic features of the source singer s singing voice into those of the target singer s singing voice in any song while keeping the linguistic inforation of the lyrics unchanged. Recently eigenvoice conversion (EVC) techniques 12, 13 have also been successfully applied to SVC 14 to develop a ore flexible SVC syste capable of achieving conversion between arbitrary source and target singers even if a parallel data set is not available. Although SVC has great potential to bring a new singing styles to singers, there reain several probles to be solved. One of the biggest probles is that speech quality of the converted singing voice is significantly degraded copared to that of the natural singing voice. uses a vocoder to generate a wavefor of the converted singing voice fro the converted acoustic features. Consequently, speech quality of the converted singing voice suffers fro various errors, such as F 0 extraction errors, odeling errors in spectral paraeterization, and oversoothing effects often observed in the converted acoustic features. It is essential to address these issues to allow for practical use of SVC. In this paper, we propose a SVC ethod that can perfor SVC without the wavefor generation process based on a vocoder. In conventional SVC, spectral envelope, F 0, and aperiodic coponents are extracted fro the source singer s singing voice and converted to the target singer s singing voice. However, in intra-gender SVC, it is not always necessary to convert F 0 values of the source singer to those of the target because both singers often sing on key. Moreover, the conversion of the aperiodic coponents usually causes only a sall ipact on the converted singing voice. Therefore, it is expected that only spectral conversion is sufficient to achieve acceptable quality Copyright 2014 ISCA 2514 14-18 Septeber 2014, Singapore

in intra-gender SVC. Based on this idea, in the proposed SVC ethod, we focus only on converting the spectral envelope. The wavefor of the source singer is directly odified with a digital filter that uses the tie-varying difference in the spectral envelope between the source and target singer s singing voices. This spectru differential is statistically estiated fro the spectral envelop of the source singer. It is shown fro results of subjective experiental evaluation that the proposed SVC ethod significantly iproves speech quality of the converted singing voice copared to the conventional SVC ethods. Input singing voice Analysis Aperiodic coponents Mel-cepstru F 0 Input singing voice Analysis Mel-cepstru 2. Statistical singing voice conversion (SVC) SVC consists of a training process and a conversion process. In the training process, a joint probability density function of acoustic features of the source and target singers singing voices is odeled with a Gaussian ixture odel (GMM) using a parallel data set in the sae anner as in statistical VC for noral voices 11. As the acoustic features of the source and target singers, we eploy 2D-diensional joint static and dynaic feature vectors X t =x t, x t of the source and Y t =y t, y t of the target consisting of D-diensional static feature vectors x t and y t and their dynaic feature vectors x t and y t at frae t, respectively, where denotes the transposition of the vector. Their joint probability density odeled by the GMM is given by P (X t, Y t λ) M = α N =1 ( Xt Y t ; µ (X) µ (Y ), Σ (XX) Σ (XY ) Σ (YX) Σ (YY) ), (1) where N ( ; µ, Σ) denotes the noral distribution with a ean vector µ and a covariance atrix Σ. The ixture coponent index is. The total nuber of ixture coponents is M. λ is a GMM paraeter set consisting of the ixture-coponent weight α, the ean vector µ, and the covariance atrix Σ of the -th ixture coponent. The GMM is trained using joint vectors of X t and Y t in the parallel data set, which are autoatically aligned to each other by dynaic tie warping. In the conversion process, the source singer s singing voice is converted into the target singer s singing voice using axiu likelihood estiation of speech paraeter trajectory with the GMM 9. Tie sequence vectors of the source features and the target features are denoted as X =X 1,, X T and Y =Y 1,, Y T, where T is the nuber of fraes included in the tie sequence of the given source feature vectors. A tie sequence vector of the converted static features ŷ =ŷ 1,, ŷ T is deterined as follows: ŷ = argax P (Y X, λ) subject to Y = Wy, (2) y where W is a transforation atrix to expand the static feature vector sequence into the joint static and dynaic feature vector sequence 15. The conditional probability density function P (Y X, λ) is analytically derived fro the GMM of the joint probability density given by Eq. (1). To alleviate the oversoothing effects that usually ake the converted singing voice sound uffled, global variance (GV) 9 is also considered to copensate the variation of converted feature vector sequence. GMM for aperiodic coponents aperiodic coponents GMM for el-cepstru el-cepstru Synthesis filter Output converted singing voice Differential GMM for el-cepstru differential el-cepstru Synthesis filter Output converted singing voice Figure 1: Conversion processes of conventional SVC (in Section 2) and proposed SVC ethods (in Section 3). 3. SVC based on differential spectral copensation Figure 1 shows the conversion processes of the conventional and proposed SVC ethods. In the proposed ethod, the difference of the spectral features of the source and target singers is estiated fro the source singer s spectral features using a differential GMM (DIFFGMM) odeling the joint probability density of the source singer s spectral features and the difference in the spectral features. Voice tibre of the source singer is converted into that of the target singer by directly filtering an input natural singing voice of the source singer with the converted spectral feature differential. The proposed SVC ethod doesn t need to generate excitation signals, which are needed in vocoder-based wavefor generation. Therefore, the converted singing voice is free fro various errors usually observed in the traditional SVC, such as F 0 extraction errors, unvoiced/voiced decision errors, spectral paraeterization errors caused by liftering on the el-cepstru, and so on. On the other hand, the excitation paraeters can not be converted in the proposed SVC ethod. The DIFFGMM is analytically derived fro the traditional GMM (in Eq. (1)) used in the conventional SVC. Let D t = d t, d t denote the static and dynaic differential feature vector, where d t = y t x t. The 2D-diensional joint static and dynaic feature vector between the source and the differential features is given by Xt D t = A = X t Y t X t I 0 I I Xt = A, (3) Y t, (4) where A is a transforation atrix that transfors the joint feature vector between the source and target features into that of the source and difference features. I denotes the identity atrix. Applying the transforation atrix to the traditional 2515

GMM in Eq. (1), the DIFFGMM is derived as follows: P (X t, D t λ) ( M Xt µ (X) = α N ; D t =1 µ (D) = µ (Y ) Σ (XD) Σ (DD) = Σ (DX) = Σ (XX) µ (D), Σ (XX) Σ (DX) Σ (XD) Σ (DD) ), (5) µ (X), (6) = Σ (XY ) Σ (XX), (7) + Σ (YY) Σ (XY ) Σ (YX). (8) The converted differential feature vector is deterined in the sae anner as described in Section 2. In this paper, the GV is not considered in the proposed SVC ethod based on the spectru differential. 4. Experiental evaluation 4.1. Experiental conditions We evaluated speech quality and singer identity of the converted singing voices to copare the conventional SVC and the proposed SVC. We used singing voices of 21 Japanese traditional songs, which were divided into 152 phrases, where the duration of each phrase was approxiately 8 seconds. 3 ales and 3 feales sang these phrases. The sapling frequency was set to 16 khz. STRAIGHT 16 was used to extract spectral envelopes, which were paraeterized to the 1-24th, 1-32th, and 1-40th el-cepstral coefficients as spectral features. As the source excitation features for the conventional SVC, we used F 0 and aperiodic coponents in five frequency bands, i.e., 0-1, 1-2, 2-4, 4-6, and 6-8 khz, which were also extracted by STRAIGHT 17. The frae shift was 5 s. The el log spectru approxiation (MLSA) filter 18 was used as the synthesis filter in both the conventional and proposed ethods. We used 80 phrases for the GMM training and the reaining 72 phrases were used for evaluation. The speaker-dependent GMMs were separately trained for individual singer pairs deterined in a round-robin fashion within intra-gender singers. The nuber of ixture coponents for the el-cepstral coefficients was 128 and for the aperiodic coponents was 64. Two preference tests were conducted. Speech quality of the converted singing voices was evaluated in the first preference test. The converted singing voice saples of the conventional SVC and the proposed SVC for the sae phrase were presented to listeners in rando order. The listeners selected which saple had better sound quality. On the other hand, the conversion accuracy of singer identity of the converted singing voices was evaluated in the other preference test. A natural singing voice saple of the target singer was presented to the listeners first as a reference. Then, the converted singing voice saples of the conventional SVC and the proposed SVC for the sae phrase were presented in rando order. The listeners selected which saple was ore siilar to the reference natural singing voice in ters of singer identity. The nuber of listeners was 8 and each listener evaluated 24 saple pairs in each order setting of the el-cepstral coefficients. All listeners don t specialize in audio and they were allowed to replay each saple pair as any ties as necessary. 4.2. Experiental results Figure 2 indicates the results of the preference test for the speech quality. The proposed SVC akes it possible to gen- Preference score % Preference score % 100 80 60 40 20 0 100 80 60 40 20 0 95% confidence interval 1-24th 1-32th 1-40th Order of el-cepstru Figure 2: Evaluation of speech quality. 95% confidence interval 1-24th 1-32th 1-40th Order of el-cepstru Figure 3: Evaluation of singer identity. erate the converted speech with better speech quality than the conventional SVC in any order setting of the el-cepstral coefficients. This is assued that the proposed SVC is free fro various errors caused in the vocoder-based wavefor generation, such as F 0 extraction errors or spectral odeling errors caused by liftering. Figure 3 indicates the results of the preference test for the singer identity. The conversion accuracy of the singer identity of the proposed SVC is not statistically significantly different fro that of the conventional SVC in any order setting of the el-cepstral coefficients. This result suggests that the aperiodic coponents have little effect on the singer identity in singing voices, and even if the proposed SVC cannot convert the excitation features, the conversion accuracy of the singer identity still reains equivalent to that of the conventional SVC. These results deonstrate that the proposed SVC is capable of converting the voice tibre with higher speech quality while causing no degradation in the conversion accuracy of singer identity copared to the conventional SVC. Note that the GV is considered in the conventional SVC while not considered in the proposed SVC. 4.3. Coparison of the converted spectral features To ore deeply analyze what yields naturalness iproveents in the proposed SVC, we exaine in detail the spectral feature trajectories of singing voices, which are given by Source el-cepstral coefficients extracted fro the source 2516

Source Target DIFFSVC (diff feature) DISFFSVC (filtered) SVC (w/o GV) SVC (w/ GV) el-cepstru 4th 1st coefficient coefficient 16th coefficient 0 0.5 1.0 1.5 2.0 0 0.5 1.0 1.5 2.0 Tie s Tie s Figure 4: Exaple of trajectories of spectral feature sequence. Note that duration of Target trajectories is different fro the other trajectories. Global variance e 0 e -1 e -2 e -3 e -4 e -5 e -6 e -7 e -8 Target DIFFSVC (diff feature) DIFFSVC (filtered) SVC (w/ GV) SVC (w/o GV) 0 5 10 15 20 25 Order of el-cepstru Figure 5: GVs of several el-cepstral sequences. singer s natural singing voice Target el-cepstral coefficients extracted fro the target singer s natural singing voice DIFFSVC (diff feature) differences of el-cepstral coefficients estiated with the differential GMM in the proposed SVC DIFFSVC (filtered) el-cepstral coefficients extracted fro the singing voice converted in the proposed SVC SVC (w/ GV) el-cepstral coefficients estiated with the conventional GMM considering the GV SVC (w/o GV) el-cepstral coefficients estiated with the conventional GMM not considering the GV Figure 4 shows trajectories of the el-cepstral coefficients in each saple. It can be observed fro Source and Target that higher-order el-cepstral coefficients tend to have rapidly varying fluctuations. In other words, high odulationfrequency coponents tend to be larger as the order of the elcepstral coefficient is higher. On the other hand, such rapidly varying fluctuations are not observed in the trajectory of higherorder el-cepstral coefficients of the SVC (w/o GV). They are still not observed even if considering the GV in SVC (w/ GV) although the GVs of higher-order el-cepstral coefficients are recovered well. Therefore, these fluctuations are not odeled very well in SVC based on the conventional GMM. On the other hand, these fluctuations are still observed in DIFFSVC (filtered). Note that they do not appear in the estiated trajectories of the differences of el-cepstral coefficients DIFFSVC (diff feature), which are estiated with the differential GMM in the proposed SVC. However, in the proposed SVC, the source singing voices are directly filtered to generate the converted singing voices. Therefore, these fluctuations observed in the source singing voices are still kept in the singing voices converted by the proposed SVC DIFFSVC (filtered). It is possible that the quality iproveent is yielded by the proposed SVC because it generates converted trajectories having these fluctuations siilar to those in natural singing voices. Figure 5 shows the GVs calculated fro trajectories of elcepstral coefficients. As reported in the previous work 9, the GVs of the converted el-cepstral coefficients tend to be saller in SVC (w/o GV) and this tendency is clearly observed especially in higher-order el-cepstral coefficients, but the GVs are recovered by SVC (w/ GV), being alost equivalent to those of the target Target. On the other hand, the GVs of the el-cepstral coefficients in the proposed ethod DIFFSVC (filtered) tend to be saller than those of the target. This tendency can also be observed in Figure 4. Note that the GV is not considered in the proposed ethod in this paper. It is expected that naturalness of the singing voices converted by the proposed SVC can be further iproved by considering the GV so that the GVs of the filtered el-cepstral coefficients are close to those of the target. 5. Conclusions In order to iprove quality of singing voice conversion (SVC), we proposed SVC with direct wavefor odification based on the spectru differential. The experiental results deonstrated that the proposed SVC akes it possible to convert voice tibre of a source singer into that of a target singer with higher speech quality copared to conventional SVC. In future work, we plan to ipleent a conversion algorith consider in the global variance for the proposed ethod to further iprove quality of the converted singing voice. 6. Acknowledgeents Part of this work was supported by JSPS KAKENHI Grant Nuber 26280060 and by the JST OngaCREST project. 2517

7. References 1 H. Kenochi and H. Ohshita, VOCALOID Coericial singing synthesizer based on saple concatenation, Proc. IN- TERSPEECH, pp. 4011 4012, Aug. 2007. 2 K. Saino, M. Tachibana, and H. Kenochi, A singing style odeling syste for singing voice synthesizers. Proc. INTER- SPEECH, pp. 2894 2897, Sept. 2010. 3 K. Oura, A. Mase, T. Yaada, S. Muto, Y. Nankaku, and K. Tokuda, Recent developent of the HMM-based singing voice synthesis syste - Sinsy, SSW7, pp. 211 216, Sept. 2010. 4 T. Nakano and M. Goto, VocaListener: A singing-to-singing synthesis syste based on iterative paraeter estiation, Proc. SMC, pp. 343 348, July 2009. 5 T. Nakano and M. Goto, Vocalistener2: A singing synthesis syste able to iic a user s singing in ters of voice tibre changes as well as pitch and dynaics, Proc. ICASSP, pp. 453 456, May 2011. 6 M. Morise, M. Onishi, H. Kawahara, and H. Katayose, v. orish 09: A orphing-based singing design interface for vocal elodies, Proc. ICEC, pp. 185 190, Sept. 2009. 7 H. Ye and S. Young, High quality voice orphing, Proc. ICASSP, vol. 1, pp. I 9 12, May 2004. 8 Y. Stylianou, O. Cappé, and E. Moulines, Continuous probabilistic transfor for voice conversion, IEEE Trans. SAP, vol. 6, no. 2, pp. 131 142, Mar. 1998. 9 T. Toda, A. W. Black, and K. Tokuda, Voice conversion based on axiu likelihood estiation of spectral paraeter trajectory, IEEE Trans. ASLP, vol. 15, no. 8, pp. 2222 2235, Nov. 2007. 10 F. Villavicencio and J. Bonada, Applying voice conversion to concatenative singing-voice synthesis, Proc. INTERSPEECH, pp. 2162 2165, Sept. 2010. 11 Y. Kawakai, H. Banno, and F. Itakura, GMM voice conversion of singing voice using vocal tract area function, IEICE technical report. Speech (Japanese edition), vol. 110, no. 297, pp. 71 76, Nov. 2010. 12 T. Toda, Y. Ohtani, and K. Shikano, One-to-any and any-toone voice conversion based on eigenvoices, Proc. ICASSP, pp. 1249 1252, Apr. 2007. 13 Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano, Many-toany eigenvoice conversion with reference voice, Proc. INTER- SPEECH, pp. 1623 1626, Sept. 2009. 14 H. Doi, T. Toda, T. Nakano, M. Goto, and S. Nakaura, Singing voice conversion ethod based on any-to-any eigenvoice conversion and training data generation using a singing-to-singing synthesis syste, Proc. APSIPA ASC, Nov. 2012. 15 K. Tokuda, T. Yoshiura, T. Masuko, T. Kobayashi, and T. Kitaura, Speech paraeter generation algoriths for HMM-based speech synthesis, Proc. ICASSP, pp. 1315 1318, June 2000. 16 H. Kawahara, I. Masuda-Katsuse, and A. Cheveigne, Restructuring speech representations using a pitch-adaptive tie-frequency soothing and an instantaneous-frequency-based f 0 extraction: Possible role of a repetitive structure in sounds, Speech Counication, vol. 27, no. 3-4, pp. 187 207, Apr. 1999. 17 H. Kawahara, J. Estill, and O. Fujiura, Aperiodicity extraction and control using ixed ode excitation and group delay anipulation for a high quality speech analysis, odification and syste straight, Proc. MAVEBA, Sept. 2001. 18 S. Iai, K. Suita, and C. Furuichi, Mel log spectru approxiation (lsa) filter for speech synthesis, Electronics and Counications in Japan (Part I: Counications), vol. 66, no. 2, pp. 10 18, 1983. 2518