INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Size: px

Start display at page:

Download "INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)"

Leo Wells
6 years ago
Views:

1 INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) Proceedings of the 2 nd International Conference on Current Trends in Engineering and Management ICCTEM -214 ISSN (Print) ISSN (Online) Volume 5, Issue 8, August (214), pp IAEME: Journal Impact Factor (214): (Calculated by GISI) IJECET I A E M E MODIFIED SYNTHESIS STRATEGY FOR VOWELS AND SEMI VOWELS (KLATT SYNTHESIZER) Alfred Vivek D Souza 1, Dr. D.J Ravi 2 1 M.Tech, Signal Processing, Vidyavardhaka College of Engineering, Mysore, India 2 Professor and HOD, ECE, Vidyavardhaka College of Engineering, Mysore, India ABSTRACT Klatt synthesizers are one of the most widely used formant synthesizers. Klatt synthesizers are usually implemented either with fixed parameter update rate or with variable parameter update rate. This paper proposes new method to store control parameters and parameter update strategy to improve the naturalness of the synthesized vowel and semi vowel sounds. Keywords: Klatt Synthesizer, Kannada Vowels and Semivowels Synthesis. 1. INTRODUCTION Speech synthesis is one of the most researched domains in speech processing. Many types of speech synthesis strategies are in use, of which the important ones are concatenative synthesis, articulatory synthesis and formant synthesis. Formant synthesis schemes are preferred over the other two due to its simplicity and ease with which they can be implemented on general purpose computers. The idea of Cascade Parallel Synthesizer was first proposed by Dennis H. Klatt [1] in 198 and slightly modified the synthesis strategy in 199 [2] and to this day it remains one of the popular formant synthesizer configurations. Next major revision of this class of synthesizers was KlattGrid synthesizer by David Weenink [3] which was incorporated in Praat software tool. 1.1 Klatt Class Synthesizers Klatt class synthesizers are based on source filter model of speech production as shown in Fig

2 Fig. 1: Block diagram of Klatt synthesizer The Klatt synthesizer can be divided into five parts. 1) Excitation Sources: There are two excitation sources in this model, the voicing source for voiced sounds and frication source for unvoiced sounds. 2) Coupling: It consists of nasal and tracheal pole-zero filters to model the condition of vocal tract [6] during nasal sound production. 3) Cascade Vocal Tract: This part consists of series of band pass filters called as Resonators labelled R1 to R5. F1 to F5 denotes the resonance frequency of each of those resonators and B1 to B5 denotes the resonance bandwidth of each of those resonators. 4) Parallel Vocal Tract: It is another model of vocal tract in which the resonators are arranged in parallel. In Fig. 1 they are denoted by Rp1 to Rp5. Each of these individual resonators has their own amplitude control parameters denoted by A2 to A6. AB is the bypass amplitude control. 5) Radiation Characteristics: This block models the lip radiation characteristics. Usually a first difference of the output sample suffices for this model. The basic principle of speech production is that the waveform generated by excitation source is modified appropriately by vocal tract resonators by mimicking the human vocal tract system. Klatt synthesizer provides two options for excitation source and two for vocal tract model. The selection of appropriate excitation source and vocal tract model depends on the type of sound that is intended to be produced. For the sound units which involve vibration of vocal cords, voicing source and cascade vocal tract model are selected. Examples of this class of sounds are vowels, semi vowels and diphthongs. For the sound units which do not involve vocal cord vibrations, parallel vocal tract model along with frication source is selected. 1.2 Excitation Sources The voicing source generates periodic pulses which represents the glottal pulses occurring due to vibrations of vocal cords and uses Rosenberg model [4] for generating the same. The glottal pulses are generated with fundamental frequency (F) also known as pitch. The amplitude of pulse train is specified as voicing amplitude (Av). Amplitude of aspiration (Ah) and amplitude of breathiness (Ab) are used to simulate breathy sounds. Open Quotient (OQ) is the ratio of open phase of vocal cords to pitch period. Fig 2 shows a sample glottal pulse waveform. The noise like appearance during the open phase of the pulse is to bring the breathiness effect. If breathiness amplitude is set to then the open phase of the pulse traces along the red line shown in the figure. 62

3 Fig. 2: Sample glottal pulse Unvoiced sounds use frication source as the excitation source. The parameter amplitude of frication (Af) is used to control the amplitude of the frication source. 1.3 Digital Resonators and Anti-Resonators Resonators are the building blocks of the synthesizer. Resonance frequency F and resonance bandwidth B are used to characterize the resonators. Resonators used in Klatt and KlattGrid synthesizers are all pole filters. The output sample y(n) for a given input sample x(n) can be calculated using difference equation: y(n) = Ax(n) + By(n-1) + Cy(n-2) (1) where y(n-1) and y(n-2) are the two previous output samples. If the sampling period is T then the coefficients A, B and C can be calculated using the formulae shown below. C = -exp(-2*pi*b*t) B = 2exp(-pi*B*T)cos(2*pi*F*T) (2) A = 1 B C Anti-resonators are used for coupling and for generation of nasal sounds. They are implemented as FIR filters having difference equation: y(n) = A x(n) + B x(n-1) + C x(n-2) (3) The coefficients A, B and C can be derived from A,B and C respectively of the resonators using the following transformations. A = 1/A, B = -B/A, C = -CA (4) 1.4 Database for vowels and semi-vowels For synthesizing vowels and semi-vowels, voicing source and cascade vocal tract model are used. The parameters that are used for vowel and semi vowel synthesis are pitch (F), formant frequencies (F1 to F5) and their bandwidths (B1 to B5), open quotient (OQ), voicing amplitude (Av), 63

4 breathiness amplitude (Ab) and aspiration amplitude (Ah). Other parameters mentioned in [1] and [5] can be kept constant. All the above mentioned parameters vary with time. As a result, for any given sound, each of those parameters is not a single value but a set of values at different times and is known as contour. The database should capture how these parameters change with time. Vowels and semi-vowels are continuants which mean that the parameters like pitch, formant frequencies and their bandwidths etc vary slowly with time. This situation is exploited for creating the database. For Klatt synthesizer, the sample speech utterance is recorded and partitioned into equal frames usually of 5ms duration each and for each of those frames the representative values of the parameters under consideration are stored. Table 1 shows sample database of 5 frames involving pitch and first formant frequency. The same procedure is applied to all other contours as well. Table 1: Sample parameters for Klatt synthesizer for kannada vowel /a/ Frame # F in Hz F1 in Hz B1 in Hz 1(5 ms) (5 ms) (5 ms) (5 ms) (5 ms) The database generation for KlattGrid synthesizer is similar to that of Klatt synthesizer except for the fact that the KlattGrid synthesizer uses variable frame size to capture variation of parameters more precisely. Table 2 shows sample database of 3 frames involving pitch and first formant frequency. Table 2: Sample parameters for KlattGrid synthesizer for kannada vowel /a/ Time in s F in Hz F1 in Hz B1 in Hz Synthesis of vowels and semi-vowels The synthesis strategy is quite straight forward [2] in Klatt synthesizer. The excitation waveform is first generated frame by frame by providing respective pitch (F), Av, Ah, Ab and OQ values to the voicing source block. For the example parameters of Table 1, the first 5ms of voicing waveform is generated with pitch of Hz and next 5ms frame with pitch 15 Hz and so on. The excitation source is filtered with resonator R1. For first 5ms, R1 will have resonance frequency of 72 Hz with bandwidth 15 Hz. The next 5ms of excitation waveform if filtered by R1 with resonance frequency of Hz and bandwidth of 172 Hz and so on. The same principle is applied for remaining resonators also. In other words, the parameters of voicing source block and each of the resonator blocks are updated once in every 5ms. KlattGrid synthesizer also works with similar synthesis strategy [3] but the parameter update rate varies. For the example parameters shown in Table 2, the initial parameters are the row 1 of the table. The first parameter update happens after.2s with parameters of row 2, the second parameter update happens.1s after previous update. Fig 3 shows the spectrogram of recorded and synthesized kannada vowel /a/ using KlattGrid synthesizer available in Praat software with fixed frame size 64

5 Fig. 3: Spectrogram of recorded sound (top) and synthesized sound (bottom) 2. PROPOSED METHOD The spectrogram of the synthesized vowel sound shown in Fig 3 preserves the overall properties and the sound generated is intelligible but lacks naturalness of the original recorded sound. This happens due to improper frame duration selection. If the frame duration is excessively long compared to pitch period, the parameters remains same for more than one pitch period and the parameter values jump to new ones in the next frame and are held constant throughout the new frame, giving a striped appearance in the spectrogram. On the other hand, if the frame is very short and frame duration is not integral multiple of pitch period of the frame, serious distortions can occur and this type of distortion in time domain is shown in Fig 4. Fig. 4: Termination of frame before completion of pitch period In Fig 4, the vertical line indicates the termination of frame n-1 and commencement of frame n. However, the pitch pattern is not complete and is terminated abruptly. 65

6 2.1 Pitch Synchronous Parameter Update Method To avoid the distortions and to make the synthesized sound more natural, the parameters should be updated once every pitch period. This makes sure that the pitch pattern is completed before new parameters are applied to the voicing source and resonators. This also has an implication of sampling the parameters in synchronous with pitch and storing those parameters in the database and to use KlattGrid strategy for synthesis. However pitch synchronous sampling of parameters is tedious job and the number of samples to be stored is also high compared to the fixed time parameter update case. Hence a new method of database creation as described below can be used. 2.2 Database creation and synthesis strategy for Pitch Synchronous Parameter Update Method Pitch Synchronous Parameter Update Method requires parameters to be sampled once every pitch period and stored in database. However if the pitch contours, formant frequency contours and their corresponding bandwidth contours of kannada vowels and semi-vowels are observed carefully, it can be noticed that all those contours vary smoothly with respect to time. This fact helps to avoid sampling all the parameters once for every pitch period. Instead the contours can be fitted to polynomial curves and polynomial coefficients can be stored instead. The n th degree polynomial curve can be mathematically represented as P n (t) = a + a 1 t + a 2 t a n t n (5) where a, a 1,..., a n are the coefficients and t is the time index. Before the curve fitting is performed on any of the contours, the time axis is normalized to the range to 1. Fig 5 shows and example of curve fitting of pitch contour for kannada vowel /aa/ Time (s) 5 Frequency (Hz).3363 Time (s) Fig. 5: Pitch contour curve fitting for kannada vowel /aa/ 66

7 The top diagram of Fig 5 shows the recorded vowel sound /aa/ which is of.3363s duration. The middle graph shows the identified pitch values versus time. The bottom graph shows a fourth order polynomial curve that is fitted to pitch points after normalizing the time axis. The coefficients of the curve are a = , a1 = -24.9, a3 = , a4 = , a5 = In the database, for vowel /aa/, instead of storing the actual parameter contours, their respective polynomial coefficients are stored with one extra parameter of actual duration. In case of vowel /aa/ the actual duration is.3363s. To get back the value of any parameter at any required time instant say t 1. The equation representing contour say P(t) is to be evaluated at t = (t 1 /actual duration). 2.3 Synthesis Strategy Vowel and semi-vowel synthesis can be carried out in two phases. Phase 1 involves generation of voicing waveform and phase 2 involves filtering the voicing waveform generated by series of resonators. The first phase is explained in the below mentioned algorithm. 1) An empty buffer is created to hold the voicing waveform. 2) Sampling rate is fixed as Fs. Synthesis duration is specified. 3) Initialize the next parameter update time say τ = 4) Evaluate F, Av, Ah and OQ from their respective polynomials at t = τ/(synthesis duration). 5) Generate voicing waveform using Rosenberg model for the duration of (1/F) with evaluated F, Av, Ah and OQ. 6) Concatenate the generated voicing waveform with buffer. 7) Set τ = τ + (1/F). 8) If τ >= synthesis duration stop otherwise go to step 4. The second phase involves filtering of the voicing waveform with resonators one by one. The pitch synchronous filtering approach is explained for one resonator in below mentioned algorithm. 1) Read the frequency contour and bandwidth contour from the database corresponding to the resonator and sound and also read the F contour. 2) Initialize the next parameter update time say τ = 3) Evaluate resonance frequency F and resonance bandwidth B and F at t = τ/(synthesis duration). 4) Design filter with calculated F and B using equation (2) 5) Filter the voicing waveform portion τ to τ + (1/F) with above designed filter. 6) Set τ = τ + (1/F). 7) If τ >= synthesis duration go to step 8 otherwise go to step 3. 8) Repeat all above steps for each resonator. 3. IMPLEMENTATION The database for Kannada vowels and semi-vowels was created by first obtaining various contours using Praat tool. Praat provides an implementation of KlattGrid synthesizer which has two parts analysis system and synthesis system. Fig 6 shows pitch and formant frequency contours extracted using Praat s KlattGrid tool for Kannada vowel /a/. The contours were fitted to polynomial curves using Matlab s curve fitting tool. It was observed that most of the contours required a maximum of 6 th degree polynomial with exception of few contours which change rapidly. 67

8 5 Pitch Contour Frequency (Hz).3223 Time (s) 5 Formant Contours Formant frequency (Hz) Time (s) Fig. 6: Pitch and formant contours of kannada vowel /a/ The polynomial coefficients of all the required contours were stored in an XML file. Horner s method was employed to evaluate polynomial at any desired time instant. Fig 7 shows the pitch and formant contours calculated from polynomials for vowel /a/. 5 Pitch Contour 4 P itch in Hz t--> in s 5 Formant Frequency Contours Frequency in Hz t--> in s Fig. 7: Pitch and formant contours calculated from polynomials The Klatt synthesizer with proposed parameter update method was implemented using Matlab. The algorithms mentioned in section 2.3 were used to synthesize the vowel /a/. The synthesized vowel waveform along with its spectrogram and various contours are shown in Fig 8. 68

Fig. 8: Vowel /a/ generated with proposed changes Any contour can be shifted to a new level by just changing the a coefficient of the contours to the required value.

9 Fig. 8: Vowel /a/ generated with proposed changes Any contour can be shifted to a new level by just changing the a coefficient of the contours to the required value. This property finds its application in pitch level shifting and pitch matching with the adjacent sound units. Also due to process of normalization of time axis while curve fitting, the sound can be synthesized for any desired duration easily just by changing the synthesis duration parameter. Fig 9 shows the waveform of vowel /a/ synthesized for duration of.2 s. Fig. 9: Vowel /a/ generated for.2s 69

10 4. EVALUATION OF PROPOSED METHOD Kannada vowels and semi vowels were recorded and model parameters were extracted. One set of vowels and semi vowels were re-synthesized with existing KlattGrid synthesizer and other set with proposed changes. A group of Kannada language speakers were asked to identify the sounds which were played to them randomly from both the sets. This survey was conducted to check whether the re-synthesized vowels and semi-vowels were intelligible or not. All the sounds generated were correctly identified. A Mean Opinion Score (MOS) was also collected by playing both the sets of sound one after the other. 95% of the survey participants said that the set generated with proposed changes sounded more naturally than the other set. Remaining 5% of the participants said that the there was no difference between both the sets. 5. CONCLUSION The quality of the vowels and semi vowels synthesized using the proposed changes significantly increases the quality. The proposed method of storing the parameters also reduces the size of the databases. However the price to be paid for the increased quality is the increased number of computations to obtain the parameters from the polynomials. With the speed of modern processors, increased computational load does not pose any significant hindrance. 6. REFERENCES [1] Klatt, Dennis H, Software for a cascade/parallel formant synthesizer, The Journal of the Acoustical Society of America 67, No. 3, 198, pp [2] Klatt, Dennis H., and Laura C. Klatt, Analysis, synthesis, and perception of voice quality variations among female and male talkers, The Journal of the Acoustical Society of America 87, No. 2, 199, pp [3] Weenink, David, The klattgrid speech synthesizer, In INTERSPEECH, 29, pp [4] Rosenberg, Aaron E, Effect of glottal pulse shape on the quality of natural vowels, The Journal of the Acoustical Society of America 49, No. 2B, 2, pp [5] Jesus, Luis Miguel Teixeira de, Francisco Vaz, and José Carlos Principe, An Implementation of the Klatt Speech Synthesiser, Electrónica e Telecomunicações 2, No. 1, 212, pp [6] Mermelstein, Paul, Articulatory model for the study of speech production., The Journal of the Acoustical Society of America 53, No. 4, 25, pp

EE 225D LECTURE ON SPEECH SYNTHESIS. University of California Berkeley

EE 225D LECTURE ON SPEECH SYNTHESIS. University of California Berkeley University of California Berkeley College of Engineering Department of Electrical Engineering and Computer Sciences Professors : N.Morgan / B.Gold EE225D Speech Synthesis Spring,1999 Lecture 23 N.MORGAN