Yoshiyuki Ito, 1 Koji Iwano 2 and Sadaoki Furui 1

Size: px

Start display at page:

Download "Yoshiyuki Ito, 1 Koji Iwano 2 and Sadaoki Furui 1"

Lorin Lang
5 years ago
Views:

1 HMM F F F F F F A study on prosody control for spontaneous speech synthesis Yoshiyuki Ito, Koji Iwano and Sadaoki Furui This paper investigates several topics related to high-quality prosody estimation in HMM-based spontaneous speech synthesis. In our prosody control method, phoneme duration and fundamental frequency (F )aremodeledby Quantification Theory (Type ). We first analyzed the effects of the number and kinds of prosody control factors on spontaneity of synthesized speech. The analysis result showed that pause length between prosodic phrases was one of the important duration control factors while it was not particularly useful for duration control of reading-type speech synthesis. Next, we investigated reasons why the current prosody control method cannot sufficiently model the F features of spontaneous speech. Through the analysis, it was confirmed that the distribution of estimated phrase-averaged F values was reduced from original/correct distribution, and the reduction caused the low spontaneity of synthesized speech. It was also confirmed that spontaneity could be significantly improved if the correct phrase-averaged F values were assigned to the phrases whose original F values were located in a high-frequency region.. ) HMM ) ) CSJ F ) HMM F ) F F HMM TTS F F Department of Computer Science, Tokyo Institute of Technology Faculty of Environmental and Information Studies, Tokyo City University c 9 Information Processing Society of Japan

2 Phoneme HMM Cepstral feature model Quant. theory (Type ) Phoneme duration model Phoneme duration Cepstram parameter generation Mel-cepstram sequence MLSA filter Synthesized speech Japanese text Text analysis Language Information (Phoneme sequence, accent, etc) Fundamental frequency information model Voice source generation Voice source waveform F contour HMM TTS. HMM Quant. theory (Type ) Pulses : voiced sound Noises : unvoiced sound. HMM TTS HMM TTS ) F HMM F F HMM ) F MLSA ) TTS CSJ. vowel /a/,/i/,/u/,/e/,/o/. syllabic nasal /N/. choked sound /Q/. long vowel /-/. voiced stop /b/,/d/,/g/. unvoiced stop /p/,/t/,/k/. voiced fricative /z/,/j/. affricate /ch/,/ts/ 9. unvoiced fricative /f/,/h/,/s/,/sh/. nasal consonant /m/,/n/. liquid /r/. semi vowel /w/,/y/. palatalized consonant /by/,/dy/,/gy/,/py/, /ky/,/hy/,/ry/,/my/,/ny/... triphone HMM HMM left-to-right HMM khz Δ ms ms STRAIGHT ).. 9) I 9).. F I F F F F c 9 Information Processing Society of Japan

3 W 9 / P W / 9 // / / P W /9 W / / W / (9) / W / / W / O W 9 O ( 9 :O ) /9 O / ( 9 : O ) / O / ( 9) F W / P W / 9 / / P W / W / 9/ W /9 / W / / W / ) ( : n ) / / 9/ / M W n STRAIGHT F F ) M M W W P M W n I n =,,,,. 9),) CSJ.. (). (). () 9 9 F 9 9.,,, c 9 Information Processing Society of Japan

4 F F,,,, F F RMS ms F semitone = log (F [Hz]/). %,, 9) ) F,,, RMS[ms] RMS error of phoneme duration [ms] preference score [%] 9 9 Number of factors RMS[semitone]. RMS error of mora pitch [semitone] preference score [%] Number of factors F F 9) ) F F F c 9 Information Processing Society of Japan

5 . F F F F F F,, F F. F F F CSJ, F F ) F semitone Hz F F [semitone] 9 F F F F F F F c 9 Information Processing Society of Japan

6 F [semitone] 9 F F estimated correct F F F semitone F ( semitone ). F F F F F () F μ σ () F μ+σ μ σ F F F F F F. F 9 F F c 9 Information Processing Society of Japan

7 9 F F open * closed % % % % % % % % % % % * * F F F.. CSJ F % % F F open closed HMM.. F F F * % open closed open F F F open closed % % % % % % % % % % % ** *** F F.. F F F F F **,*** % % open closed F open closed F c 9 Information Processing Society of Japan

8 . F F ) F F F. F F F F F ) ) F ) STRAIGHT ) S. Werner, M. Eichner, M. Wolff, and R. Hoffmann, Toward spontaneous speech synthesis Utilizing language model information in TTS, IEEE Transactions on speech and audio proseccing, vol., no., pp.,. ) HMM vol.j9-d-ii, no., pp. 9, 99. ) HMM vol., no.9, pp.,. ) HMM vol., no., pp.,. ) K. Iwano, M. Yamada, T. Togawa, and S. Furui, Prosody control for HMM-based Japanese TTS, In S. Narayanan and A. Alwan (Eds.), Text to Speech Synthesis New Paradigms and Advances, Prentice Hall PTR, New Jersey, Ch., pp.,. ) HMM --, pp.9, 999. ) MLSA vol.j-a, no., pp. 9, 9. ) H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F extraction: Possible role of a reptitive structure in sounds, Speech Communication, vol., pp., ) HMM vol., no.9, pp.,. ) F vol., no., pp.,. ) FLUET D-, p., 99. ) vol.9 no. pp c 9 Information Processing Society of Japan

Speech Synthesis using Mel-Cepstral Coefficient Feature

Speech Synthesis using Mel-Cepstral Coefficient Feature By Lu Wang Senior Thesis in Electrical Engineering University of Illinois at Urbana-Champaign Advisor: Professor Mark Hasegawa-Johnson May 2018 Abstract