The NU-NAIST voice conversion system for the Voice Conversion Challenge 2016

Similar documents
Statistical Singing Voice Conversion with Direct Waveform Modification based on the Spectrum Differential

Statistical Singing Voice Conversion based on Direct Waveform Modification with Global Variance

Direct F 0 Control of an Electrolarynx based on Statistical Excitation Feature Prediction and its Evaluation through Simulation

A Digital Signal Processor Implementation of Silent/Electrolaryngeal Speech Enhancement based on Real-Time Statistical Voice Conversion

Quality-enhanced Voice Morphing using Maximum Likelihood Transformations

Nonaudible murmur enhancement based on statistical voice conversion and noise suppression with external noise monitoring

Applying Spectral Normalisation and Efficient Envelope Estimation and Statistical Transformation for the Voice Conversion Challenge 2016

WaveNet Vocoder and its Applications in Voice Conversion

APPLICATION OF THE FAN-CHIRP TRANSFORM TO HYBRID SINUSOIDAL+NOISE MODELING OF POLYPHONIC AUDIO

DSI3 Sensor to Master Current Threshold Adaptation for Pattern Recognition

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

TESTING OF ADCS BY FREQUENCY-DOMAIN ANALYSIS IN MULTI-TONE MODE

System Fusion for High-Performance Voice Conversion

Adaptive Harmonic IIR Notch Filter with Varying Notch Bandwidth and Convergence Factor

ROBUST UNDERWATER LOCALISATION OF ULTRA LOW FREQUENCY SOURCES IN OPERATIONAL CONTEXT

A Preprocessing Method to Increase High Frequency Response of A Parametric Loudspeaker

FORWARD MASKING THRESHOLD ESTIMATION USING NEURAL NETWORKS AND ITS APPLICATION TO PARALLEL SPEECH ENHANCEMENT

Alternative Encoding Techniques for Digital Loudspeaker Arrays

Performance Analysis of an AMC System with an Iterative V-BLAST Decoding Algorithm

New Adaptive Linear Combination Structure for Tracking/Estimating Phasor and Frequency of Power System

Speech Enhancement using Temporal Masking and Fractional Bark Gammatone Filters

Characterization and Modeling of Underwater Acoustic Communications Channels for Frequency-Shift-Keying Signals

SECURITY AND BER PERFORMANCE TRADE-OFF IN WIRELESS COMMUNICATION SYSTEMS APPLICATIONS

Additive Synthesis, Amplitude Modulation and Frequency Modulation

Non-Linear Weighting Function for Non-stationary Signal Denoising

Department of Mechanical and Aerospace Engineering, Case Western Reserve University, Cleveland, OH, 2

NONLINEAR WAVELET PACKET DENOISING OF IMPULSIVE VIBRATION SIGNALS NIKOLAOS G. NIKOLAOU, IOANNIS A. ANTONIADIS

EQUALIZED ALGORITHM FOR A TRUCK CABIN ACTIVE NOISE CONTROL SYSTEM

Overlapping Signal Separation in DPX Spectrum Based on EM Algorithm. Chuandang Liu 1, a, Luxi Lu 1, b

Comparison Between PLAXIS Output and Neural Network in the Guard Walls

Efficient Non-linear Changed Mel-filter Bank VAD Algorithm

High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder

Iterative Receiver Signal Processing for Joint Mitigation of Transmitter and Receiver Phase Noise in OFDM-Based Cognitive Radio Link

Recent Development of the HMM-based Singing Voice Synthesis System Sinsy

COMBINED FREQUENCY AND SPATIAL DOMAINS POWER DISTRIBUTION FOR MIMO-OFDM TRANSMISSION

Kalman Filtering for NLOS Mitigation and Target Tracking in Indoor Wireless Environment

A HIGH POWER FACTOR THREE-PHASE RECTIFIER BASED ON ADAPTIVE CURRENT INJECTION APPLYING BUCK CONVERTER

A NEW APPROACH TO UNGROUNDED FAULT LOCATION IN A THREE-PHASE UNDERGROUND DISTRIBUTION SYSTEM USING COMBINED NEURAL NETWORKS & WAVELET ANALYSIS

An orthogonal multi-beam based MIMO scheme. for multi-user wireless systems

Fundamental study for measuring microflow with Michelson interferometer enhanced by external random signal

RAKE Receiver. Tommi Heikkilä S Postgraduate Course in Radio Communications, Autumn II.

A New Localization and Tracking Algorithm for Wireless Sensor Networks Based on Internet of Things

Power Improvement in 64-Bit Full Adder Using Embedded Technologies Er. Arun Gandhi 1, Dr. Rahul Malhotra 2, Er. Kulbhushan Singla 3

Overlapped frequency-time division multiplexing

Detection of Faults in Power System Using Wavelet Transform and Independent Component Analysis

Waveform Design and Receive Processing for Nonrecurrent Nonlinear FMCW Radar

A soft decision decoding of product BCH and Reed-Müller codes for error control and peak-factor reduction in OFDM

Comparing structural airframe maintenance strategies based on probabilistic estimates of the remaining useful service life

Multicarrier Interleave-Division Multiple Access Communication in Multipath Channels

Track-Before-Detect for an Active Towed Array Sonar

Keywords Frequency-domain equalization, antenna diversity, multicode DS-CDMA, frequency-selective fading

ELEC2202 Communications Engineering Laboratory Frequency Modulation (FM)

Speech Synthesis using Mel-Cepstral Coefficient Feature

Relation between C/N Ratio and S/N Ratio

Relative phase information for detecting human speech and spoofed speech

Incorporating Performance Degradation in Fault Tolerant Control System Design with Multiple Actuator Failures

Ruohua Zhou, Josh D Reiss ABSTRACT KEYWORDS INTRODUCTION

Selective Harmonic Elimination for Multilevel Inverters with Unbalanced DC Inputs

Robust Acceleration Control of Electrodynamic Shaker Using µ Synthesis

Emotional Voice Conversion Using Deep Neural Networks with MCC and F0 Features

Transmit Power and Bit Allocations for OFDM Systems in a Fading Channel

High Impedance Fault Detection in Electrical Power Feeder by Wavelet and GNN

A comparison of LSF and ISP representations for wideband LPC parameter coding using the switched split vector quantiser

Enhanced Algorithm for MIESM

Keywords: Equivalent Instantaneous Inductance, Finite Element, Inrush Current.

Optical fiber beamformer for processing two independent simultaneous RF beams

NINTH INTERNATIONAL CONGRESS ON SOUND AND VIBRATION, ICSV9 PASSIVE CONTROL OF LAUNCH NOISE IN ROCKET PAYLOAD BAYS

Comparison of Fourier Bessel (FB) and EMD-FB Based Noise Removal Techniques for Underwater Acoustic Signals

Precise Indoor Localization System For a Mobile Robot Using Auto Calibration Algorithm

Model Development for the Wideband Vehicle-to-vehicle 2.4 GHz Channel

Research Article Novel Design for Reduction of Transformer Size in Dynamic Voltage Restorer

Secondary-side-only Simultaneous Power and Efficiency Control in Dynamic Wireless Power Transfer System

Radar Imaging of Non-Uniformly Rotating Targets via a Novel Approach for Multi-Component AM-FM Signal Parameter Estimation

Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform

POWER QUALITY ASSESSMENT USING TWO STAGE NONLINEAR ESTIMATION NUMERICAL ALGORITHM

ABNORMAL SOUND EVENT DETECTION USING TEMPORAL TRAJECTORIES MIXTURES

Improved Codebook-based Speech Enhancement based on MBE Model

A Pulse Model in Log-domain for a Uniform Synthesizer

Waveform generation based on signal reshaping. statistical parametric speech synthesis

Using Adaptive Modulation in a LEO Satellite Communication System

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

A Novel Control Scheme to Reduce Storage Capacitor of Flyback PFC Converter

Design and Implementation of Block Based Transpose Form FIR Filter

Performance Analysis of Atmospheric Field Conjugation Adaptive Arrays

Fiber Bragg grating based four-bit optical beamformer

Interference Management in LTE Femtocell Systems Using Fractional Frequency Reuse

REPORT ITU-R SA Telecommunication characteristics and requirements for space VLBI systems

Intermediate-Node Initiated Reservation (IIR): A New Signaling Scheme for Wavelength-Routed Networks with Sparse Conversion

UNIT - II CONTROLLED RECTIFIERS (Line Commutated AC to DC converters) Line Commutated Converter

UWB System for Time-Domain Near-Field Antenna Measurement

Optimal Modulation Index of the Mach-Zehnder Modulator in a Coherent Optical OFDM System Employing Digital Predistortion

Keywords: International Mobile Telecommunication (IMT) Systems, evaluating the usage of frequency bands, evaluation indicators

EFFECTS OF MASKING ANGLE AND MULTIPATH ON GALILEO PERFORMANCES IN DIFFERENT ENVIRONMENTS

Evaluation of clipping-noise suppression of stationary-noisy speech based on spectral compensation

A Novel NLOS Mitigation Approach for Wireless Positioning System

Cross-correlation tracking for Maximum Length Sequence based acoustic localisation

Evaluation of Steady-State and Dynamic Performance of a Synchronized Phasor Measurement Unit

Application of velvet noise and its variants for synthetic speech and singing (Revised and extended version with appendices)

LETTER Adaptive Multi-Stage Parallel Interference Cancellation Receiver for Multi-Rate DS-CDMA System

OTC Statistics of High- and Low-Frequency Motions of a Moored Tanker. sensitive to lateral loading such as the SAL5 and

Transcription:

INTERSPEECH 16 Septeber 8 12, 16, San Francisco, USA The NU-NAIST voice syste for the Voice Conversion Challenge 16 Kazuhiro Kobayashi 1, Shinnosuke Takaichi 1, Satoshi Nakaura 1, Tooki Toda 2 1 Nara Institute of Science and Technology (NAIST), Japan 2 Inforation Technology Center, Nagoya University, Japan 1 {kazuhiro-k, shinnosuke-t, s-nakaura}@is.naist.jp, 2 tooki@icts.nagoya-u.ac.jp Abstract This paper presents the NU-NAIST voice (VC) syste for the Voice Conversion Challenge 16 (VCC 16) developed by a joint tea of Nagoya University and Nara Institute of Science and Technology. Statistical VC based on a Gaussian ixture odel akes it possible to convert speaker identity of a source speaker voice into that of a target speaker by converting several speech paraeters. However, various factors such as paraeterization errors and over-soothing effects usually cause speech quality degradation of the converted voice. To address this issue, we have proposed a direct wavefor odification technique based on spectral differential filtering and have successfully applied it to singing voice where excitation features are not necessary converted. In this paper, we propose a ethod to apply this technique to a standard voice task where excitation feature is needed. The result of VCC 16 deonstrates that the NU-NAIST VC syste developed by the proposed ethod yields the best accuracy for speaker identity (ore than 7% of the correct rate) and quite high naturalness score (ore than 3 of the ean opinion score). This paper presents detail descriptions of the NU-NAIST VC syste and additional results of its perforance evaluation. Index Ters: voice challenge 16, speaker identity, segental feature, Gaussian ixture odel, STRAIGHT analysis. 1. Introduction Varieties of voice characteristics, such as voice tibre and fundaental frequency (F ) patterns, produced by individual speakers are always restricted by their own physical constraint due to the speech production echanis. This constraint is helpful for aking it possible to produce a speech signal capable of siultaneously conveying not only linguistic inforation but also non-linguistic inforation such as speaker identity. However, it also causes various barriers in speech counication; e.g., severe vocal disorders are easily caused even if speech organs are partially daaged; and we hesitate to talk about soething private using a cell phone if we are surrounded by others. If the individual speakers freely produced various voice characteristics over their own physical constraints, it would break down these barriers and open up an entirely new speech counication style. Voice (VC) is a potential technique to ake it possible for us to produce speech sounds beyond our own physical constraints 1]. VC research was originally started to achieve speaker to ake it possible to transfor the voice identity of a source speaker into that of a target speaker while preserving the linguistic content 2]. A ainstrea of VC is a statistical approach to developing a function using a parallel data set consisting of utterances of the source and target speakers. As one of the ost popular statistical VC ethods, a regression ethod using a Gaussian ixture odel (GMM) was proposed 3]. To iprove perforance of the GMM-based VC ethod, various VC ethods have been proposed by ipleenting ore sophisticated techniques, such as Gaussian process regression 4, 5] deep neural networks 6, 7], non-negative atrix factorization 8, 9], and so on. We have also significantly iproved perforance of the standard GMMbased VC ethod by incorporating a trajectory-based algorith to ake it possible to consider teporal correlation in 1], odeling additional features to alleviate an over-soothing effect of the converted speech paraeters, such as global variance (GV) 1] and odulation spectru (MS) 11], and ipleenting STRAIGHT 12] with ixed excitation 13]. Furtherore, a real-tie process has also been successfully ipleented for state-of-the-art GMMbased VC 14]. However, the speech quality of the converted voices is still obviously degraded copared to that of the natural voices. One of the biggest factors causing this quality degradation is the wavefor generation process using a vocoder 15], which is still observed even when using high-quality vocoder systes 12, 16, 17]. In singing VC (SVC), to avoid the quality degradation caused by the vocoding process 15], we have proposed an intragender SVC ethod with direct wavefor odification based on spectru differential (DIFFSVC) 18] considering global variance (GV) 19], focusing on F transforation is not necessary in the intra-gender SVC. The DIFFSVC fraework can avoid using the vocoder by directly filtering an input singing voice wavefor with a tie sequence of spectral paraeter differentials estiated by a differential GMM (DIFFGMM) analytically derived fro the conventional GMM used in the standard ethod. Moreover, to apply this DIFFSVC fraework to cross-gender DIFFSVC as well, we have proposed an F transforation technique with direct residual signal odification ] based on tie-scaling with wavefor siilarity-based overlap-add 21] and resapling. In this paper, we develop a new VC syste for speaker based on the direct wavefor odification technique, which was subitted to the Voice Conversion Challenge 16 (VCC 16) 22] fro our joint tea of Nagoya University and Nara Institute of Science and Technology (NAIST) as the NU- NAIST VC syste (called new NAIST VC syste ). The following techniques are newly ipleented for our GMM-based VC syste: 1) voice with direct wavefor odification with spectral differential (DIFFVC), 2) speech paraeter trajectory soothing in the GMM training, 3) post-filtering process based on MS for DIFFVC, and 4) excitation conver- Copyright 16 ISCA 1667 http://dx.doi.org/1.21437/interspeech.16-97

sion (EC) using STRAIGHT as preprocessing of spectral. The results of the VCC 16 have deonstrated that the NU-NAIST VC syste (syste J ) achieved the best accuracy on speaker identity and high naturalness (ore than 3 on the ean opinion score scale). In this paper, we also conduct subjective evaluations, deonstrating that the NU- NAIST VC syste achieves high speech quality and accuracy coparable to our conventional GMM-based VC syste. 2. VC based on GMM In the conventional VC, acoustic features such as spectral features and aperiodic coponents of a source speaker are converted into those of a target speaker based on previously trained GMMs. F is transfored to copensate for the difference in pitch between the source and target speakers based on fraeby-frae linear. Finally, the converted voice is generated by synthesizing these converted acoustic features using a vocoder. 2.1. Acoustic feature apping based on GMM Acoustic feature apping based on GMM consists of a training process and a process. In the training process, a joint probability density function of acoustic features of the source and target speaker voices are odeled with a GMM using a parallel data set. As the acoustic features of the source and target speakers, we eploy 2D-diensional joint static and dynaic feature vectors X t =x t, Δx t ] of the source and Y t =y t, Δy t ] of the target consisting of D-diensional static feature vectors x t and y t and their dynaic feature vectors Δx t and Δy t at frae t, respectively, where denotes the transposition of the vector. Their joint probability density odeled by the GMM is given by P (X t, Y t λ) M = α N =1 ( Xt Y t ] ; μ (X) μ (Y ) ], Σ (XX) Σ (XY ) Σ (YX) Σ (YY) ]), (1) where N ( ; μ, Σ) denotes the noral distribution with a ean vector μ and a covariance atrix Σ. The ixture coponent index is. The total nuber of ixture coponents is M. λ is a GMM paraeter set consisting of the ixture-coponent weight α, the ean vector μ, and the covariance atrix Σ of the -th ixture coponent. The GMM is trained using joint vectors of X t and Y t in the parallel data set, which are autoatically aligned to each other by dynaic tie warping. In the process, the acoustic features of the source speaker are converted into that of the target speaker using axiu likelihood estiation (MLE) of speech paraeter trajectories using the GMM and GV 1]. 2.2. F transforation In both intra- and cross-gender s, F is transfored frae-by-frae in order to line up pitch differences between source and target speakers. ŷ t = σ(y) σ (x) (xt μ(x) )+μ (y), (2) where x t and ŷ t are a log-scaled F of the source speaker and the converted one at frae t. μ (x) and σ (x) are the ean and standard deviation of log-scaled F of the source speaker and μ (y) and σ (y) are those of the target speaker. 3. The NU-NAIST VC syste for VCC 16 In this paper, we proposed the following techniques: 1) DIF- FVC, 2) GMM training with soothed speech paraeter trajectory, 3) post-filtering process based on odulation spectru (MS) for DIFFVC, and 4) excitation with F and aperiodic coponents transforations using a vocoder. Figure 1 indicates the flow of the NU-NAIST VC syste for the VCC 16. The NU-NAIST VC syste perfors excitation and spectral. During excitation, F values and aperiodic coponents extracted fro a source voice are transfored within an analysis/synthesis fraework using a vocoder. During spectral, spectral features of the source voice are converted into spectral feature differentials based on the DIFFGMM. Next, MS-based post-filtering is applied to the spectral feature differential. Finally, the converted speech wavefor is generated by directly filtering the analysis-synthesized speech wavefor generated during the excitation step using the post-filtered spectral feature differentials. 3.1. DIFFVC based on DIFFGMM As part of the odelling step, the DIFFGMM is analytically derived fro the traditional GMM (in Eq. (3)). Let D t = ] d t, Δd t denote the static and dynaic differential feature vector, where d t = y t x t, the DIFFGMM is derived by transforing odel paraeters in the sae anner as DIFFSVC 18] as follows: P (X t, D t λ) M = α N =1 ( Xt D t ] ; μ (X) μ (D) ], Σ (XX) Σ (DX) Σ (XD) Σ (DD) ]). (3) During the step, a tie sequence of the D- diensional converted spectral feature differentials, ˆd, is deterined using MLE of the speech paraeter trajectory using the DIFFGMM 18]. Then, the converted speech wavefor is generated by directly filtering an input speech wavefor with a tie-variant synthesis filter designed fro the spectral feature differential sequence. This filtering process odifies a spectral envelope sequence while basically preserving the natural excitation signals of the input speech wavefor. 3.2. Speech paraeter trajectory soothing Modulation Spectru (MS) 11] is defined as the log-scaled power spectru of the paraeter sequence; i.e., teporal fluctuation of the paraeter sequence is decoposed into individual odulation frequency coponents and their power values are represented as the MS. The MS, s (y), of the paraeter sequence y is defined as: s (y) = s 1 (y),, s d (y),, s D (y) ], (4) s d (y) = s d, (y),,s (y),,s d,ds 1 (y)], (5) where 2D s is the length of the discrete Fourier transfor, and s (y) is the f-th MS of the d-th diension of the paraeter sequence y 1 (d),, y T (d)]. f is the odulation frequency index. As reported in 23, 24], the higher odulation frequency coponents (ore fluctuating coponents of a teporal sequence) of spectral paraeter sequences are negligible for speech quality. By applying a low-pass filter (LPF) that reoves the higher odulation frequency coponents (e.g., ore than 5 Hz (f > D s/2)), we can iprove training accuracy 1668

Source voice STRAIGHT Analysis F Aperiodicity Band aperiodicity Linear transforation GMM for aperiodic coponents Aperiodicity odification Converted band aperiodicity Transfored F Excitation generation Synthesis filter F transfored source voice Converted voice (DIFFVC (EC)) Synthesis filter Spectru envelope Mel-cepstru DIFFGMM for el-cepstru Converted el-cepstru differential MSPF Figure 1: Conversion process of the NU-NAIST VC syste for VCC 16. of acoustic odels as done for hidden Markov odel-based speech synthesis 25]. Here, source and target speakers speech paraeter sequences, x and y, are LPFed, then the LPFed sequences, x (LPF) and y (LPF), are used to train the GMM. In, x (LPF) is used to generate the spectral differentials. 3.3. MS-based post-filter for VC with spectral differentials Statistical odeling tends to deteriorate MSs of the converted speech paraeters, and keeping natural MSs is strongly effective for iproving the quality of the converted speech. An MSbased post-filter (MSPF) 11], which is applied after speech paraeter in conventional GMM-based VC, odifies a converted speech paraeter sequence so that the sequence has the target speaker s natural MS. Here, we propose an MS-based post-filtering process that odifies spectral differentials, ˆd, such that the finally synthesized speech has the target speaker s natural MS. In training, we calculate MS statistics for target speaker s natural and converted speech paraeters fro the training data, y and ỹ =ˆd+x (LPF) ]. Here, let μ (y) s (y) and s (ỹ), and let σ (y) and μ(ỹ) and σ(ỹ) be the ean of be their variance. The ˆd is generated by converting x (LPF). In, x (LPF) is first added to the generated ˆd. Then, the MS, s (ỹ) is converted as follows: s (ỹ) = σ(y) ( σ (ỹ) s (ỹ) μ (ỹ) ) + μ (y). (6) The converted ỹ is deterined using the converted MS and the original phase coponents. The MSPFed spectral differentials, ˆd (MSPF) can be deterined by subtracting x (LPF) fro the converted ỹ 1. Note that, in this paper, we use ean-noralized MSs and adopt a segent-level post-filtering process 11]. 3.4. Excitation based on F and aperiodicity transforations using a vocoder Although we initially tried ipleenting the F transforation technique with direct residual signal odification ] for singer, we found that this technique was not effective for speaker. In speaker, we need to handle larger acoustic differences in excitation signals between the source and target speakers copared to singing voice. To address this issue, we ipleented excitation using STRAIGHT 26] as high-quality vocoder. For the F transforation, we perfor the global linear transforation as described in Sect 2.2. For the aperiodic coponents, band-averaged aperiodic coponents are extracted and converted with the GMM as in the conventional ethod 13]. Then, 1 Note that, because the MSPF process is non-linear to the speech paraeter sequence, the sequence that x (LPF) is subtracted fro the converted ỹ is not equal to ˆd. original aperiodic coponents at all frequency bins are shifted using aperiodic differentials between the extracted and converted band-averaged aperiodic coponents. Finally, analysissynthesized speech is generated fro these transfored excitation paraeters using STRAIGHT. Note that full STRAIGHT spectral representation is directly used in synthesis. This excitation ethod actually causes significant quality degradation because original phase inforation is discarded. Nevertheless, we have found that this ethod yields better speech quality as well as better accuracy than the direct residual signal odification ]. 4. Experiental evaluation In this section, we show results of the VCC 16 to deonstrate perforance of the NU-NAIST VC syste. Moreover, we copare the following three systes: DIFFVC (EC): The NU-NAIST VC syste subitted to the VCC 16, VC: Our conventional VC syste 13], DIFFVC: The NU-NAIST VC syste w/o excitation. 4.1. Experiental conditions We evaluated speech quality and speaker identity of the converted voices to copare perforance of the different VC systes in both intra-gender and cross-gender tasks. We used the English speech database used in the VCC 16. The nuber of source speakers was 5 including 3 feales and 2 ales, and that of the target speakers was 5 including 2 feales and 3 ales who were different fro the source feale and ale speakers. The nuber of sentences uttered by each speaker was 216. The sapling frequency was set to 16 khz. STRAIGHT 12] was used to extract spectral envelopes, which was paraeterized into the 1-24th el-cepstral coefficients as the spectral feature. The frae shift was 5 s. The el log spectru approxiation (MLSA) filter 27] was used as the synthesis filter. As the source excitation features, we used F and aperiodic coponents extracted with STRAIGHT 26]. The aperiodic coponents were averaged over five frequency bands, i.e., -1, 1-2, 2-4, 4-6, and 6-8 khz, to be odeled with the GMM. We used 162 sentences for training and the reaining 54 sentences were used for evaluation. The speaker-dependent GMMs were separately trained for all cobinations of source and target speaker pairs. The nuber of ixture coponents for the el-cepstral coefficients was 128 and for the aperiodic coponents was 64. Two preference tests were conducted. In the first test, speech quality of the converted voices was evaluated. The converted voice saples generated by two different VC systes for the sae sentences were presented to subjects in rando order. The subjects selected which saple had better speech quality. 1669

Preference score for accuracy on speaker identity %] 6 4 The NU-NAIST VC syste Target Source Subitted systes 1 1.5 2 2.5 3 3.5 4 4.5 5 MOS score for speech quality Figure 2: Sound quality and accuracy on speaker identity in VCC 16. Preference score %] 6 4 DIFFVC(EC) w/ MSPF VC w/ GV DIFFVC w/ MSPF 95% confidence interval (a) Intra-gender (b) Cross-gender Figure 3: AB preference test for speech quality. Preference score %] 6 4 DIFFVC(EC) w/ MSPF VC w/ GV DIFFVC w/ MSPF 95% confidence interval (a) Intra-gender (b) Cross-gender Figure 4: XAB test for accuracy on speaker identity. In the second test, accuracy in speaker identity was evaluated. A natural voice saple of the target speaker was presented to the subjects first as a reference. Then, the converted voice saples generated by two different VC systes for the sae sentences were presented in rando order. The subjects selected which saple was ore siilar to the reference natural voice in ters of speaker identity. The nuber of subjects was 1 and each listener evaluated 54 saple pairs in each evaluation. They were allowed to replay each saple pair as any ties as necessary. 4.2. Results of the VCC 16 Figure 2 indicates an overall result of the VCC 16. The NU- NAIST VC syste achieved quite high speech quality over 3. of MOS and the best accuracy (about 74%) aong all subitted VC systes. In ters of the accuracy, our syste achieved successful perforance even though very siple prosodic was perfored. However, it is observed that there is still a large gap between the converted voices and the natural target voices. It is expected that further iproveents will be yielded by ipleenting a ethod of prosodic patterns or asking the source speakers to iic target prosodic patterns, which would be possible in several practical applications. In ters of speech quality, the NU-NAIST VC syste causes serious quality degradation copared to natural voices, i.e., fro 4.6 to 3. in MOS. This quality degradation is ainly caused by using a vocoder to perfor the excitation as shown in the next section. Therefore, it is expected that the converted speech quality will be significantly iproved by developing a better analysis/synthesis technique than STRAIGHT. 4.3. Results of subjective evaluation Figures 3 (a) and (b) indicate the results of the preference test for speech quality. DIFFVC (EC) achieves equivalent speech quality copared to VC in both intra/cross-gender s. On the other hand, DIFFVC achieves significantly higher speech quality copared to the other two ethods in the intra-gender. This is because DIFFVC can avoid using vocoding to generate converted speech wavefors, aking the process free fro various errors, such as F extraction errors and unvoiced/voiced decision errors. Note that DIFFVC in cross-gender condition does not result in any significant quality iproveents as it suffers fro isatches between spectral envelope and F in the cross-gender. Figures 4 (a) and (b) indicate the results of the preference test for speaker identity. Although DIFFVC (EC) has equivalent accuracy copared to VC in the intra-gender, it tends to be degraded in the cross-gender. It is expected that the residual spectral envelope preserved in the direct wavefor odification process still includes speakerdependent or gender-dependent features, and that this causes adverse effects on accuracy. These results suggest that 1) the NU-NAIST VC syste deonstrating the best accuracy and high speech quality in the VCC 16 has an alost equivalent perforance copared to the conventional VC syste in both intragender and cross-gender s, and 2) the direct wavefor odification technique achieves significantly higher converted speech quality copared to the conventional VC syste if the excitation is not necessary as in the intragender, and therefore, there is still large roo to iprove the converted speech quality of the NU-NAIST VC syste. 5. Conclusions This paper describes the details of the NU-NAIST voice (VC) syste for the Voice Conversion Challenge 16 (VCC 16) developed by a joint tea of Nagoya University and Nara Institute of Science and Technology. In order to iprove the quality of statistical VC based on Gaussian Mixture Model (GMM), we applied the following techniques: 1) voice with direct wavefor odification with spectral differential (DIFFVC), 2) speech paraeter trajectory soothing, 3) post-filtering based on odulation spectru for DIFFVC, and 4) preprocessing for excitation with F and aperiodic coponent transforations using high-quality vocoding. The experiental results deonstrated that the NU-NAIST VC syste was highly ranked in the VCC 16, its perforance was coparable to our conventional VC syste, and the DIF- FVC technique showed large potential to significantly iprove the converted speech quality of the NU-NAIST VC syste. In future work, we plan to ipleent high quality F and aperiodicity transforation for the DIFFVC technique. Acknowledgeents This work was supported in part by JSPS KAKENHI Grant Nuber 266 and Grant-in-Aid for JSPS Research Fellow Nuber 16J1726. 167

6. References 1] T. Toda, Augented speech production based on real-tie statistical voice, Proc. GlobalSIP, pp. 755 759, Dec. 14. 2] M. Abe, S. Nakaura, K. Shikano, and H. Kuwabara, Voice through vector quantization, J. Acoust. Soc. Jpn (E), vol. 11, no. 2, pp. 71 76, 199. 3] Y. Stylianou, O. Cappé, and E. Moulines, Continuous probabilistic transfor for voice, IEEE Trans. SAP, vol. 6, no. 2, pp. 131 142, Mar. 1998. 4] N. Pilkington, H. Zen, and M. Gales, Gaussian process experts for voice, Proc. INTERSPEECH, pp. 2761 2764, Aug. 11. 5] N. Xu, Y. Tang, J. Bao, A. Jiang, X. Liu, and Z. Yang, Voice based on Gaussian processes by coherent and asyetric training with liited training data, Speech Counication, vol. 58, pp. 124 138, Mar. 14. 6] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, Voice using deep neural networks with layer-wise generative training, IEEE/ACM Trans. ASLP, vol. 22, no. 12, pp. 1859 1872, Dec. 14. 7] L. Sun, S. Kang, K. Li, and H. Meng, Voice using deep bidirectional long short-ter eory based recurrent neural networks, Proc. ICASSP, pp. 4869 4873, Apr. 15. 8] R. Takashia, T. Takiguchi, and Y. Ariki, Exeplar-based voice using sparse representation in noisy environents, IEICE Trans. on Inf. and Syst., vol. E96-A, no. 1, pp. 1946 1953, Oct. 13. 9] Z. Wu, T. Virtanen, E. Chng, and H. Li, Exeplar-based sparse representation with residual copensation for voice, IEEE/ACM Trans. ASLP, vol. 22, no. 1, pp. 156 1521, June 14. 1] T. Toda, A. W. Black, and K. Tokuda, Voice based on axiu likelihood estiation of spectral paraeter trajectory, IEEE Trans. ASLP, vol. 15, no. 8, pp. 2222 2235, Nov. 7. 11] S. Takaichi, T. Toda, A. W. Black, G. Neubig, S. Sakti, and S. Nakaura, Postfilters to odify the odulation spectru for statistical paraetric speech synthesis, IEEE Trans. ASLP, vol. 24, no. 4, pp. 755 767, Jan. 16. 12] H. Kawahara, I. Masuda-Katsuse, and A. Cheveigné, Restructuring speech representations using a pitch-adaptive tie-frequency soothing and an instantaneous-frequency-based f extraction: Possible role of a repetitive structure in sounds, Speech Counication, vol. 27, no. 3-4, pp. 187 7, Apr. 1999. 13] Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano, Maxiu likelihood voice based on GMM with STRAIGHT ixed excitation, Proc. INTERSPEECH, pp. 2266 2269, Sept. 6. 14] T. Toda, T. Muraatsu, and H. Banno, Ipleentation of coputationally efficient real-tie voice, Proc. INTER- SPEECH, Sept. 12. 15] H. Dudley, Reaking speech, JASA, vol. 11, no. 2, pp. 169 177, 1939. 16] Y. Stylianou, Applying the haronic plus noise odel in concatenative speech synthesis, IEEE Trans. SAP, vol. 9, no. 1, pp. 21 29, 1. 17] D. Erro, I. Sainz, E. Navas, and I. Hernaez, Haronics plus noise odel based vocoder for statistical paraetric speech synthesis, IEEE J-STSP, vol. 8, no. 2, pp. 184 194, 14. 18] K. Kobayashi, T. Toda, G. Neubig, S. Sakti, and S. Nakaura, Statistical singing voice with direct wavefor odification based on the spectru differential, Proc. INTERSPEECH, pp. 2514 2418, Sept. 14. 19], Statistical singing voice based on direct wavefor odification with global variance, Proc. INTERSPEECH, pp. 2754 2758, Sept. 15. ] K. Kobayashi, T. Toda, and S. Nakaura, Ipleentation of f transforation for statistical singing voice based on direct wavefor odification, Proc. ICASSP, pp. 567 5674, Mar. 16. 21] W. Verhelst and M. Roelands, An overlap-add technique based on wavefor siilarity (WSOLA) for high quality tie-scale odification of speech, Proc. ICASSP, pp. 554 557 vol.2, Apr. 1993. 22] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, and J. Yaagishi, The Voice Conversion Challenge 16, Proc. INTERSPEECH, Sept. 16. 23] S. Takaichi, T. Toda, A. W. Black, and S. Nakaura, Paraeter generation algorith considering odulation spectru for HMMbased speech synthesis, Proc. ICASSP, Apr. 15. 24], Modulation spectru-constrained trajectory training algorith for GMM-based voice, Proc. ICASSP, Apr. 15. 25] S. Takaichi, K. Kobayashi, K. Tanaka, T. Toda, and S. Nakaura, The naist text-to-speech syste for the blizzard challenge 15, Proc. Blizzard Challenge workshop, Sept. 15. 26] H. Kawahara, J. Estill, and O. Fujiura, Aperiodicity extraction and control using ixed ode excitation and group delay anipulation for a high quality speech analysis, odification and syste straight, Proc. MAVEBA, Sept. 1. 27] K. Tokuda, T. Kobayashi, T. Masuko, and S. Iai, Melgeneralized cepstral analysis a unified approach to speech spectral estiation, Proc. ICSLP, pp. 143 145, 1994. 1671