Hungarian Speech Synthesis Using a Phase Exact HNM Approach

Similar documents
L19: Prosodic modification of speech

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

SPEECH ANALYSIS-SYNTHESIS FOR SPEAKER CHARACTERISTIC MODIFICATION

The Partly Preserved Natural Phases in the Concatenative Speech Synthesis Based on the Harmonic/Noise Approach

Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase and Reassignment

FREQUENCY-DOMAIN TECHNIQUES FOR HIGH-QUALITY VOICE MODIFICATION. Jean Laroche

Sinusoidal Modelling in Speech Synthesis, A Survey.

Prosody Modification using Allpass Residual of Speech Signals

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

Audio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands

Sound Synthesis Methods

AhoTransf: A tool for Multiband Excitation based speech analysis and modification

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

ACCURATE SPEECH DECOMPOSITION INTO PERIODIC AND APERIODIC COMPONENTS BASED ON DISCRETE HARMONIC TRANSFORM

TIME DOMAIN ATTACK AND RELEASE MODELING Applied to Spectral Domain Sound Synthesis

Speech Synthesis; Pitch Detection and Vocoders

VOICE QUALITY SYNTHESIS WITH THE BANDWIDTH ENHANCED SINUSOIDAL MODEL

Speech Synthesis using Mel-Cepstral Coefficient Feature

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

NOTES FOR THE SYLLABLE-SIGNAL SYNTHESIS METHOD: TIPW

A Full-Band Adaptive Harmonic Representation of Speech

NCCF ACF. cepstrum coef. error signal > samples

ADDITIVE synthesis [1] is the original spectrum modeling

Synthesis Algorithms and Validation

Voice Conversion of Non-aligned Data using Unit Selection

HIGH ACCURACY FRAME-BY-FRAME NON-STATIONARY SINUSOIDAL MODELLING

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Timbral Distortion in Inverse FFT Synthesis

Glottal source model selection for stationary singing-voice by low-band envelope matching

SPEECH TO SINGING SYNTHESIS SYSTEM. Mingqing Yun, Yoon mo Yang, Yufei Zhang. Department of Electrical and Computer Engineering University of Rochester

ADDITIVE SYNTHESIS BASED ON THE CONTINUOUS WAVELET TRANSFORM: A SINUSOIDAL PLUS TRANSIENT MODEL

Time-Frequency Distributions for Automatic Speech Recognition

Location of Remote Harmonics in a Power System Using SVD *

IMPROVED CODING OF TONAL COMPONENTS IN MPEG-4 AAC WITH SBR

HIGH ACCURACY AND OCTAVE ERROR IMMUNE PITCH DETECTION ALGORITHMS

Mel Spectrum Analysis of Speech Recognition using Single Microphone

651 Analysis of LSF frame selection in voice conversion

FOURIER analysis is a well-known method for nonparametric

Design and Implementation on a Sub-band based Acoustic Echo Cancellation Approach

MUS421/EE367B Applications Lecture 9C: Time Scale Modification (TSM) and Frequency Scaling/Shifting

Fundamental frequency estimation of speech signals using MUSIC algorithm

A Novel Adaptive Algorithm for

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

HMM-based Speech Synthesis Using an Acoustic Glottal Source Model

Formant Synthesis of Haegeum: A Sound Analysis/Synthesis System using Cpestral Envelope

Enhanced Waveform Interpolative Coding at 4 kbps

IOMAC' May Guimarães - Portugal

Drum Transcription Based on Independent Subspace Analysis

Signal Characterization in terms of Sinusoidal and Non-Sinusoidal Components

DECOMPOSITION OF SPEECH INTO VOICED AND UNVOICED COMPONENTS BASED ON A KALMAN FILTERBANK

On the glottal flow derivative waveform and its properties

Sub-band Envelope Approach to Obtain Instants of Significant Excitation in Speech

Speech Compression Using Voice Excited Linear Predictive Coding

Determination of Variation Ranges of the Psola Transformation Parameters by Using Their Influence on the Acoustic Parameters of Speech

Learning New Articulator Trajectories for a Speech Production Model using Artificial Neural Networks

COMBINING ADVANCED SINUSOIDAL AND WAVEFORM MATCHING MODELS FOR PARAMETRIC AUDIO/SPEECH CODING

ADAPTIVE NOISE LEVEL ESTIMATION

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

OPTIMIZED SHAPE ADAPTIVE WAVELETS WITH REDUCED COMPUTATIONAL COST

MODAL ANALYSIS OF IMPACT SOUNDS WITH ESPRIT IN GABOR TRANSFORMS

A Comparative Performance of Various Speech Analysis-Synthesis Techniques

HIGH-RESOLUTION SINUSOIDAL MODELING OF UNVOICED SPEECH. George P. Kafentzis and Yannis Stylianou

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

A Very Low Bit Rate Speech Coder Based on a Recognition/Synthesis Paradigm

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

Monophony/Polyphony Classification System using Fourier of Fourier Transform

Two-Dimensional Wavelets with Complementary Filter Banks

Synthesis Techniques. Juan P Bello

Decomposition of AM-FM Signals with Applications in Speech Processing

Auto Regressive Moving Average Model Base Speech Synthesis for Phoneme Transitions

Epoch-Synchronous Overlap-Add (ESOLA) for Time- and Pitch-Scale Modification of Speech Signals

A Comparative Study of Formant Frequencies Estimation Techniques

Chapter 4 SPEECH ENHANCEMENT

INFLUENCE OF FREQUENCY DISTRIBUTION ON INTENSITY FLUCTUATIONS OF NOISE

Wavelet-based Voice Morphing

Waveform generation based on signal reshaping. statistical parametric speech synthesis

Original Research Articles

FREQUENCY WARPED ALL-POLE MODELING OF VOWEL SPECTRA: DEPENDENCE ON VOICE AND VOWEL QUALITY. Pushkar Patwardhan and Preeti Rao

EDS parametric modeling and tracking of audio signals

Overview of Code Excited Linear Predictive Coder

VIBRATO DETECTING ALGORITHM IN REAL TIME. Minhao Zhang, Xinzhao Liu. University of Rochester Department of Electrical and Computer Engineering

GLOTTAL EXCITATION EXTRACTION OF VOICED SPEECH - JOINTLY PARAMETRIC AND NONPARAMETRIC APPROACHES

Almost Perfect Reconstruction Filter Bank for Non-redundant, Approximately Shift-Invariant, Complex Wavelet Transforms

AM-FM demodulation using zero crossings and local peaks

A GENERALIZED POLYNOMIAL AND SINUSOIDAL MODEL FOR PARTIAL TRACKING AND TIME STRETCHING. Martin Raspaud, Sylvain Marchand, and Laurent Girin

Converting Speaking Voice into Singing Voice

Acoustic Tremor Measurement: Comparing Two Systems

Pitch Period of Speech Signals Preface, Determination and Transformation

CMPT 468: Frequency Modulation (FM) Synthesis

Nonuniform multi level crossing for signal reconstruction

A Novel Adaptive Method For The Blind Channel Estimation And Equalization Via Sub Space Method

Lecture 7 Frequency Modulation

YOUR WAVELET BASED PITCH DETECTION AND VOICED/UNVOICED DECISION

Subjective Evaluation of Join Cost and Smoothing Methods for Unit Selection Speech Synthesis Jithendra Vepa a Simon King b

Lab S-3: Beamforming with Phasors. N r k. is the time shift applied to r k

Transcription:

Hungarian Speech Synthesis Using a Phase Exact HNM Approach Kornél Kovács 1, András Kocsor 2, and László Tóth 3 Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged H-672 Szeged, Aradi vértanúk tere 1., Hungary { 1 kkornel, 2 kocsor, 3 tothl}@inf.u-szeged.hu http://www.inf.u-szeged.hu/speech Abstract. Unnaturally sounding speech prevents the listeners from recognizing the message of the signal. In this paper we demonstrate how a precise initial phase approximation can improve the naturalness of artificially generated speech. Using the Harmonic plus Noise Model provided by Stylianou as a framework for a Hungarian speech synthesis, the exact initial phase extension of the system can be easily performed. The proposed method turns out to be more effective in preserving the sound characteristics and quality than the original one. 1 Introduction The idea of artificially generated high quality speech signal has been present in science for a long time ([1], [5], [9]). We do not intend to review all the relevant literature, but there are some general features which help us to categorize the existing approaches into the following types: the articulatory model, the formant tracking mechanism ([5]), and the concatenation method which uses pre-recorded and analyzed natural speech signals to obtain the desired sound ([2], [3], [4], [8]). The Harmonic plus Noise Model is a well-known representative for concatenating speech synthesis ([7], [1]). The synthesis part of HNM can generate prosodically modified speech signal using the parameters from the analysis step. The model provided by Stylianou [11] regards a speech signal as a sum of a voiced and an unvoiced noise part with distinct frequency bands, where the lower voiced part can be expressed as a sum of harmonically related sinusoids. The analysis step can determine the uppermost voiced frequency via a peak picking algorithm that is based on the estimation of the pitch period. Because the noise part can be also modelled as a sum of harmonically related sinusoids [11], the analysis part ends with the computation of sinusoid parameters in pitch synchronous time instants. Moreover, in the synthesis step prosodic modifications can be easily executed using this sinusoidal representation. Using the zero-phase parameter estimation technique proposed by Stylianou we get convincing result. But, based on human listening tests we found that the initial phase of sinusoids have great importance on the naturalness of the

speech. Taking into account the initial phase in the HNM framework the resultant method improves the naturalness of the speech signal quite significantly: the finally produced artificial speech sounds more natural than the speech originated from the basically implemented Stylianou system. 2 Harmonic approximation Firstly, let us assume that the parameters of harmonics and the pitch period are nearly constant for a small time interval. This part of the model approximates the signal by a sum of harmonic sinusoids over a small interval. The signal is known in N time instants t = (t 1,..., t N ) T where the signal values are s = (s 1,..., s N ) T. The approximation procedure optimizes the amplitudes and phases of the following equation: h(t) = a + L a k cos(kωt + ψ k ), (1) k=1 where the a and ψ vectors contain the amplitudes and phases of the harmonic sinusoids. The number of harmonics L can be derived from the fundamental frequency and the maximal voiced frequency of the desired time instant. The optimal parameters have values which minimize the square of the error between the original signal and the approximated one: ɛ = t N t=t 1 W 2 tt(s t h(t)) 2, (2) where W is a diagonal matrix with properly chosen weights. Stylianou makes use of equation (1) supposing that ψ k =, which requires solving a set of linear equations when minimizing the error ɛ. To obtain this set of equations we use the vector form of (1) without initial phases: where h(t) = b T (t)a, (3) b T (t) = (1, cos(1ωt),..., cos(lωt)) With this type of harmonic approximation we can redefine equation (2) like so: ɛ = t N t=t 1 W 2 tt(s t h(t)) 2 = W (s Ba) 2 2, (4)

where the matrix B is B T = (b(t 1 ),..., b(t N )) The error function is expressed by the quadratic form (4), whose minimum defines the amplitudes of the harmonic sinusoids with no initial phase: B T W T W Ba = B T W T W s (5) Our approach does not place any restrictions on the form of equation (1) as Stylianou did. Though, the approximation with non-harmonic sinusoids has been solved by Kocsor et al [6] in a locally optimal way, our approach can work out the parameters of harmonic sinusoid approximation in a globally optimal way by using the known angular frequency. Applying the trigonometrical relation cos(α + β) = cos α cos β sin α sin β one can prove that the equation (1) can be re-expressed in vector form: where h(t) = g T (t)f, g T (t) = (1, cos(1ωt),..., cos(lωt), sin(1ωt),..., sin(lωt)) f T = (a, a 1 cos ψ 1,..., a L cos ψ L, a 1 sin ψ 1,..., a L sin ψ L ) Using this notation: where the matrix G is ɛ = W (s Gf) 2 2, (6) G T = (g(t 1 ),..., g(t N )) The above equation shows how the error of the initial phase exact harmonic approximation (1) can be expressed in quadratic form with a unique minimum: f = (G T W T W G) + (G T W T W s), (7) where + denotes the Moore&Penrose pseudo-inverse. After obtaining f, the amplitude and phase of each component can be computed by making use of the simple relations: ψ k = arctan f 1+L+k f 1+k a k = f 1+k cos ψ k For the purpose of pitch scaling we need to interpolate the spectrum defined by vector a with a parametric curve like a cepstrum with real valued parameters. The phase envelope estimation of ψ must be determined as well when the phases have a monotonic character. The cepstrum interpolation with real valued parameters presumes that the interpolated values are non-negative, which can be achieved by using the following: A cos(ω + ψ) = A cos(ω + (ψ + (2k + 1)π)) k Z

18 2 22 24 26 28 3 32 34 36 2.5 2.5 2 2 1.5 1.5.5 1 1.5 2 2.5.3 (a) 1.5 1.5.5 1 1.5 2 18 2 22 24 26 28 3 32 34 36.3 (b).25.25.2.2.15.15.1.1.5.5.5.5.1.1 8 1 12 14 16 18 2 (c) 8 1 12 14 16 18 2 (d) Fig. 1. Short time signals (solid line) and their approximations (dashed line). Both (a) and (b) display the same artificial harmonic signal and the same part of a Hungarian vowel a is displayed in (c) and (d). Here (a) and (c) show the approximation with precise initial phases, while (b) and (d) show the corresponding zero-phase estimation. 3 Experiments Before dealing with the quality of the synthetized speech we examine the solvability of the equations which provide the parameters of the different approaches. The short time signals are twice the pitch period, so the number of time instants included in the approximation depends on the sampling rate and pitch period. Experiences shows that the set of linear equations (5), and (7), become singular when the short time signal length is less than about 4 times the pitch period. To avoid using inverse, and to ensure that we find the best fitting harmonic approximation we employ the Moore&Penrose pseudo inverse in (5) and (7). This can be used in both cases, because the parameters can be simply computed via a set of linear equations in each case. The pseudo inverse can be computed by the help of Singular Value Decomposition (SVD) which ensures that the computational cost of the pseudo inverse will be proportional to the rank of the matrix. It then means that the zero-phase and the precise initial phase approaches can generate the amplitudes and phases with about the same computational cost because the ranks of the coefficient matrices are nearly the same in both case. In the artificial signal domain a comparison of the original and the synthetic signal was performed. The same short time frame of an artificial harmonic signal can be seen on Figs. 1 (a) and (b). It obviously seems that the approximation with precise initial phase describes the original signal much more accurately than the

zero-phase version does. In the human speech domain the quality of the various synthesis models has been judged by informal listening. The series of testing done undoubtedly prove that the model with initial phase preserves much more detail of the original speech, which means a more natural and clear artificial signal. This difference appears more strikingly in the case of prosodic modification where the more inaccurate approximation of the zero-phase method leads to a metallic sounding signal. In Figs. 1 (c) and (d) we can see an example for a Hungarian vowel a with precise and zero-phase approximation. The implemented models were tested on a segmented Hungarian speech database which makes it possible to have a text-to-speech system. In conclusion, it is clear that the use of exact initial phase approximations is more beneficial for a speech synthesis system as the model is more realistic, and it allows for the possibility of modifying prosodic information. References 1. Allen, J.: Overview of Text-to-Speech systems, In S. Furui and M. Sondhi, editors, Advances in Speech Signal Processing, pp. 741-79, 1991. 2. Dutoit, T.: High quality text-to-speech synthesis: A comparison of four candidate algorithms, Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 565-568, 1994. 3. Dutoit, T., Leich, H.: Text-To-Speech synthesis based on a MBE re-synthesis of the segments database, Speech Communication, pp. 13:435-44, 1993. 4. Gimenez de los Galanes, F. M., Savoji, M. H., Pardo, J. M.: New algorithm for spectral smoothing and envelope modification for LP-PSOLA synthesis, Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 573-576, 1994. 5. Klatt, D. R.: Review of text-to-speech conversion for English, J. Acoust. Soc. Am., pp. 82(3):737-793, September 1987. 6. Kocsor, A., Tóth, L., Bálint I.,: On the Optimal Parameters of a Sinusoidal Representation of Signals, Acta Cybernetica 14, pp. 315-33, 1999. 7. McAulay, R. J., Quatieri, T. F.: Speech Analysis/Synthesis based on a sinusoidal representation, IEEE Trans. Acoust., Speech, Signal Processing, pp. ASSP-34(4):744-754, August 1986. 8. Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones, Speech Communication, pp. 9(5/6):453-467, December 199. 9. Rabiner, L. R.: Applications of Voice Processsing to Telecommunications, Proc. IEEE, pp. 82(2):199-228, February 1994. 1. Serra, X.: A System for Sound Analysis/Transformation/Synthesis Based on a Deterministic Plus Stochastic Decomposition, PhD thesis, Stanford University, Stanford, CA 1989. 11. Stylianou, Yannis Harmonic plus Noise Model for Speech, combined with Statistical Methods, for Speech and Speaker Modification, PhD Thesis, 1996.