Quarterly Progress and Status Report. Speech synthesizer control by smoothed step functions

Size: px

Start display at page:

Download "Quarterly Progress and Status Report. Speech synthesizer control by smoothed step functions"

Juliet Anastasia Bruce
5 years ago
Views:

1 Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Speech synthesizer control by smoothed step functions Liljencrants, J. journal: STL-QPSR volume: 10 number: 4 year: 1969 pages:

3 . SPEECH SYNTHESS A, SPEECH SYNTHESZER CONTROL BY SMOOTHED STEP FUNCTONS J. Liljencrants t is an appealing notion that the movements of the speech organs may at some level in the control chain be initiated by step commands. Clearly, many muscular movements, not only in the speech apparatus, can be described as responses of an inertial system to a more or less complex set of step forces. The inertia is then not only mechanical but also due to neural propagation and other delays. The speech organ movements are eventually manifested in the appearance and movements of the formants in the speech wave. The experiment to be described here is a drastic shortcut across the whole set of nonlinear transformations from imaginary stepwise muscular excitations, over movements and area functions to the speech wave. Thus the principle here used is to operate with the formant parameters themselves as being well behaved step responses. Of course one may then not hope for more than a moderate approximation to the natural speech, but the method is very well suited for a technical implementation of a synthesis by rule system. The setup for the experiment coasists of a CDC 1700 computer interfaced with the OVE 11 serial formant synthesizer and various equipment for operator control and monitoring. The initial work is to build up a library of typical formant frequency and excitation level values. For this purpose the operator works with a handle that can be moved over a plane surface. The handle has two sensors to convey its location to the computer which plots a mark at the pertinent coordinates on a display oscilloscope. The plot on the oscilloscope shows as a time-frequency diagram the synthesis parameters in the stylized square wave form shown in Fig. 111-A-1. A set of program control commands are displayed at the edge of the plot. By pointing at these using the handle the operator can initiate such things as to insert, delete, or move data points, and select parameters to display. The operator can now devise a pattern on the screen. f the special duration parameter 3R has been given two data points indicating the nom- inal beginning and end of the pattern it may be stored as a library item in

4 1.p sec Fig. 111-A-1. Operator's plot showing vowel synthesis parameters in square wave form. The operator has just put the mark on the command word CHNP in order to initiate a change in a point. He has then selected a point on the FO contour. A line goes from this point to the mark. mmediately above the mark the parameter name and the current location of the mark is displayed. Should the operator push the enable button for the handle the point will be moved to the location of the mark.

5 STL-QPSR 4/ the program. This is done from the computer keyboard where it ;hguld also be given a name with any one or tws characters, Each of these li- brary samples contain the following infor-mation: a. The identifying name of the sample. b. A pointer to the next sample in the library, also giving the length of the current sample. f this pointer is zero it indicates that the current sample is a dummy defining the end of the library. c. A set of pointers, one for each parameter. Each of these pointers indicate the location within the sample of the value specifications for the parameter. d. The nominal duration of the sample. e. A table of variable size with the parameter values. Each entry here is one 16 bit machine word, and corresponds to a data point in the plot. The first half of the word is the time relative to the beginning of the sample, and the second half is the frequency or level value, - The frequency codes stored are on a logarithmic basis with a 3% fre- - quency increment to conform with the synthesizer hardware. The syn- thesis parameters and ranges used are given for reference in Table 111-A-. t should be noted that the smoothing operations discussed below are per- formed on this logarithmic frequency scale. The quantizing step of the time scale is 20 msec. A very important feature is that the time coordinate may define points both before and after the nominal time interval of the sample. Giving commands from the computer keyboard the operat~r can call down a library sample to the working area or the program, modify it using the handle, and reinsert it in the library, possibly with a different name. Within practical limits, set by core storage size and plot complexity, a library sample may contain an arbitrary number of data points. cases it is desirable to omit specification of certain parameters. n many f the operator after a special command types a sequence of library entry names the corresponding information is assembled. The relative time scale of the library patterns is then converted to absolute making use of the duration values of the samples. n this process it often happens that data points in one sample come later than points in the following sample. When this is the case for a parameter the square wave will make a twist backwards in time, The convention used in the following

6 0 ' Pitch STL-GPSR 4/ treatment of these cases is that the data points are used in sequence. When a time reversal occurs the time value of the new sample is neglec- ted, but its parameter value is taken. n case the folded back portion contains more than one point the intermediate points are skipped over. ~0hJ- The final step prior to controlling the synthesizer is not to smooth the square wave pattern. 7. This is done using second order lowpass filters, simulated in the computer using the z-transform technique. The filters have a small overshoot in their step response with the complex frequency poles at (-0.781fj0.625) fo where f is the 3 db cutoff frequency. The 0 choice of second order filtering was arbitrary, but would apply to mechanic- al systems of the simple spring-mass-resistance type. TABLE 111-A- 1 OVE 11 Control Parameters Name Address Code Data Bit No. Range nc re - ment Remarks FO F1 F2 F3 A0 AC Hz Hz Hz Hz 32 db 28 db 37'0 3% 3% 3% 0.5 db 4 db fundamental i Vowel formants Vowel level Fricative level AH AN FN FH KO K1 K2 B1 82 B3 B db 24 db Hz Hz Hz HZ Hz 4 db. Aspirative level 8 db Nasal level 127'0. Nasal formant 12% F4 and part of KH 3% Fricativeantiformant 1 3% Fricative formants 3% 100 HZ 100 Hz Vowel formant 200 Hz bandwidths 200 Hz For optional addenda to the circuits A From ref. (3).

7 For each synthesis parameter the smoothing time constant is invari- - able, but it may differ substantially between the different parameters. Thus the excitation level parameters are given short time constants, the - formant parameters longer, and the pitch parameter the longest. n the program these time constants may be arbitrarily adjusted in octave steps by the appropriate setting of a table of masks. The output to the synthesizer is done with 10 msec intervals using an external interrupt clock for the timing of the program. As an alternative to this real time output the control signals may be stored as a binary re- cord on the disc storage. These data can then later be used to control a separate program that simulates the synthesizer and stores the c clmputed speech output, also on the disc, from where it may ultimately be plzyed back at the correct speed. An example ~f such a synthesis is sh3wn in the spectrogram of Fig. 111-A-3. When the operator initiates synthesis with a chain of characters indicating library items, then only the char- acter string is stored. The actual extraction of data and smoothing is done during the output using interlaced buffering. Thus relatively long coherent utterances may be synthesized without using an excessive a- mount of storage capacity. 3uring these output operations the computer is busy between 30 and 40 percent ~f the time while the remainder is spent idle waiting for interrupt signals. A tentative library has been prepared for the speech sounds used in Swedish. Most of the data have been extracted by manual measurement and visual interpolation from a set of spectragrams. The speech material used was CV utterances with all Swedish consanants and the vowels [i:] [a:][u:], all pronounced by a single, phonetically trained speaker, The library was set up 3n a phoneme basis. To minimize the number of characters in the input strings the phmetic codes have been restricted to one character when possible, otherwise two. The mixture of one and two character codes will necessitate a character delay in the input routine so that the correct choice can be made when the library names are searched. Obviously some care must be exer3i3e:l in the selection of codes to amid ambiguities. This d~es n:jt seem ta offer any special difficulties, and g ~od mnemotechnic aid is given by common orthogr:.~hic notations.

8 5000 HZ sec db F1 FO * 1 AC, A0 - Fig. 111-A-2. Top: Concatenated standard samples for a word. The vowel synthesis parameters and the fricative excitation level AC are shown. At some places, most clearly between S and Y, a formant specification of a sound temporarily overrules that of the next sound. This rrhows as a twist backwards in the square waves. Bottom: The same after smoothing, ready for output to the synthesizer. The smoothing time constants differ between parameters. The accent 2 pitch contour is derived from a single standard pattern called for with the character ".

9 Fig. 111-A-3. Spectrogram of a rhort rentence. The ryntheriser control parameterr are rmoothed rtep functionr. Here the rynthericer war rimulated on the computer..-

11 STL-QPSR 4/ , input. tion, The intonation contour is generated from special characters in the These characters denote library samples with zero nominal dura- The only other parameter specified here is FO, and the specification covers an interval of the order of 500 msec. Since none of the phoneme specifications contain any F O information, the F O pattern is superimposed without any interaction. At current it seems quite satisfactory to work with as little as two different word intonation patterns, one for each of the accent 1 and accent 2 in Swedish. The sample corresponding to the first is a single rectangular pulse, and the other two pulses of which the second C is somewhat higher, Using these two alternatives the intonation patterns generated are fairly agreeable when the intonation markers are placed initially in the stressed words. However, when longer sentences are joined the result does become rather monotone. sentence tone parameter was introduced. brary sample where a single rectangular pulse is given. To relieve this a special t is specified in only one li- When used this pulse is smoothed with a very long time constant, one second or more, and the result is added to the regular F3 contour. n the synthesis work done with this principle so far the input has been in the form of a conventional phonetic transcription, where the actual character set of course is different due to the computer typewriter lirnita- tion. Apart from the insertion of intonation stress markers the following simple rules have been employed: 1. A stressed long vowel is made twice as long as normal by double typing. 2. A stressed short vowel is unchanged, but the consonant following it is made twice as long as normal. Should the consonant in question be a plosive the first item in the double typing is substituted for a space, indicating a silent interval. Application 3f these rules might as an example give SN 'THEJ v TK'TOAO-AKGZHAVQ r/ 'DJUUBQSPQRSAyOONALLT for the words "synthetic talkers have a dubious personality". The phonetic codes used should be evident. The two character codes are underlined. There are a number of obvious limitations with this scheme for mechanic- al synthesis. One may first c~me to think of the formant transition time canstant that in natural speech are rather far from being invariant. Specially one might consider the labial plosives where the transition rate is often very high. To some extent these matters can be taken care of by

12 STL-CPSR 4/1969 the proper adjustment of the step timing. Another possibility to speed up transiti~ns out from the smoothing filters is to feed them with high narrow pulses superimposed on the steps. t is however difficult to unite such operations with the demand for context independent standard control pat- terns. Also, the present system does not allow for the complex nature of formant transitions due to compound mechanical movements frequently encountered in human speech (see Fant ()). The /g/ and /k/ plosion loci are rather dependent of the following vowel. For this reason the experi- mental library ariginally contained frr~nt and back variants of these sounds. This however did not seem to give a significant impravement over the use of only the front variant. is too low ta allow for this refinement. Perhaps the merall naturalness of the synthesis Another undesirable consequence of the mechanical concatenation is that the syllable duration is unduly influenced by the number of phonemes per syllable. t has been proposed that, as an intermediate step be- tween the concatenation and smoothing, all syllables should be stretched to an equal duration, The criterion of a syllable start might then be the onset of the voicing. be somewhat prolongated, Pfter this operation the stressed syllables should This could conveniently be performed under control of the word intonation parameter. nitial experiments in this di- rection show promising results, The synthesis control procedure autlined has a certain interest for computer vocal response applications. A system might be based on a syn- thesizer controlled by the outputs from hardware lowpass filters. Should the computer supply the information of the unsmoothed square waves the data rate at this point may be estimated as 10 (phon/sec) x6 (param. used) 44(param. select.)+ 5(data bits)) = 540 bits/sec, t also seems quite practicable to pr~vide the library samples from a read-only memory with associated hardware. n that case only the phonetic character string has to be supplied and the data rate goes down to the order of 60 bits/sec. n both cases it is believed that even moderate computers could control several synthesizers in time sharing. The cgntrol system described is of course not limited to the production of speechlike signals. t has also been successfully operated far the pro- duction 3f synthetic music.

13 STL-QPSR 4/1969 Approximate program size (CDC 1700): Operator interface (keyboard, control handle, ~lots) Library service (modifications, listing) Communication with external programs (data transfers) Concatenation, smoothing, and synthesizer output Auxiliary tables Phoneme data library Buffer storage areas Written in machine code words 2, ,700 7,850 words References: (1) Fant, G.: "Stops in CV-Syllables", STL-CPSR 4/1969, pp This issue). (2) Lindblom, B., personal communication. (3) Liljencrants, J. : "The OVE 11 Speech Synthesizer", EEE Trans. on Audio and Electroacoustics, AU-16 (1968), pp

Quarterly Progress and Status Report. Formant amplitude measurements

Quarterly Progress and Status Report. Formant amplitude measurements Dept. for Speech, Music and Hearing Quarterly rogress and Status Report Formant amplitude measurements Fant, G. and Mártony, J. journal: STL-QSR volume: 4 number: 1 year: 1963 pages: 001-005 http://www.speech.kth.se/qpsr