Speech Synthesis; Pitch Detection and Vocoders Tai-Shih Chi ( 冀泰石 ) Department of Communication Engineering National Chiao Tung University May. 29, 2008
Speech Synthesis Basic components of the text-to-speech synthesis: Text preprocessing: translation with ambiguities resolved, e.g., Dr. Doctor or Drive. Text to phonetic-prosodic translation: processed text is parsed into groups to determine semantic structures, then to generate prosodic information (pitch, duration and amplitude, etc.). By some researchers opinions, this component is the source of the most disturbing errors in current TTS systems. Speech synthesis. Speech synthesis is done with one of the following approaches: Articulatory synthesis: physical models for articulators and their movements (Coker and colleagues, 1968, 1976, 1992). This is not practical due to the difficulty of deriving the physical parameters and the huge computational load. Source-filter synthesis (formant synthesis): a formant is usually represented by a second-order filter and formants are used to characterize the spectral shape. Sometimes, the spectral characteristics have been specified in terms of the short term cepstrum or linear prediction coefficients. Concatenative synthesis: direct-time waveform storage and parametric storage for speech segments have become more common in recent years since data storage is getting cheaper.
Formant Synthesizers OVE II synthesizer (Fant et al., 1962): Formant resonators in cascade. Top branch: vowels, semivowels and whispered vowels. Middle branch: nasals Bottom branch: fricatives, plosives. Such structure more closely resembles human vocal tract.
Formant Synthesizers (cont.) Holmes synthesizer (Holmes, 1973): Formant resonators in parallel. 6 db/octave high-pass filter represents the radiation characteristic of the mouthto-air junction. -12 db/octave low-pass filter gives an approximation to the spectrum of the glottal waveform. All formant filters are individually controlled to produce all possible sounds. Synthesizer developed by Dennis Klatt (1980): a compromise between Fant and Holmes by using cascade formant resonators for voiced sounds and parallel formant resonators for fricative sounds.
Other source-filter synthesizer structures Other configurations: All-pole synthesizers, derived from LPC analysis. All-zero synthesizers, derived from cepstral analysis. Fixed poles and variable zeros, derived from channel vocoder analysis: the channel vocoder consists of a relative large number (14-30) of fixed bandpass filters fixed poles. The positions of the zeros vary with the weights. Variable poles and variable zeros, derived from a parallel-formant synthesizer.
Concatenative methods Concatenative systems: speech waveforms (or compressed representations) are stored and then concatenated during synthesis, e.g., the LPC parameters used in talking chips (Morgan, 1984). A technique that has been used in many speech synthesis systems is pitch synchronous overlap and add (PSOLA), which is now often called TD(time-domain)- PSOLA. Variations: LP-PSOLA; CELP-PSOLA (codebook excited LP PSOLA); RELP-PSOLA (residual-excited LP PSOLA) Available synthesis systems: http://www.cstr.ed.ac.uk/projects/festival/ http://tcts.fpms.ac.be/synthesis/mbrola.html
Pitch detection, perception and articulation Pitch perception of the listener refers to the subjective percept of the frequency of a pure tone that is matched to a more complex signal. Pitch detection (F 0 estimation) refers to an objective measurement of the fundamental frequency of a signal. Dudley pitch detector was based on the articulatory premise that the voiced speech signal always included the fundamental frequency component. However, many practical communication systems (e.g., telephones) are band limited, and the fundamental component of the speech may be completely missing.
Difficulties in pitch detection Large dynamic range: the pitch of some male voice can be as low as 60 Hz; whereas the pitch of children s voice can be as high as 800 Hz. Pitch can fluctuate drastically in time. Rapid vocal tract changes in time (e.g., the sudden closure as in a vowel-to-nasal transition) make waveform changes drastically (the pitch might not change a lot, however, the waveform changes make the pitch detection difficult). The voice-unvoiced transition: a fastacting time-domain detector would be necessary to detect the precise transition instant. Speech degradation caused by transmission (e.g., telephone channel) or added acoustical noise makes pitch detection difficult.
Signal processing to improve pitch detection Low-pass filtering: human pitch perception pays more attention to the lower frequencies. Interestingly, estimating the pitch period by eyes is typical easier with low-passed waveforms than with full-band waveforms. It has been proved in practice that a pitchdetection device would have less trouble finding the correct period from low-passed waveforms.
Signal processing to improve pitch detection (cont.) Spectral flattening and temporal correlation (Sondhi, 1968): The original signal first be spectrally flattened (each frequency band is normalized by its own energy). Then the new signal is sent through an auto-correlator for pitch detection. The idea is based on the observation that the inverse Fourier transform of harmonics of equal amplitude and zero phase results in a pulse-train-like signal, which is good for pitch detection.
Signal processing to improve pitch detection (cont.) Inverse filtering: the speech signal is assumed to be the convolution of an exitation and a vocal tract filter. The LPC and cepstrum provide an estimate of the vocal tract filter, thus, inverse filtering Comb filtering (Ross, 1974): the speech signal is sent through a multitude of delays, corresponding to all possible periods of the input. For example, 10 KHz sampling frequency and a F 0 range of 50-500 Hz, the number of possible periods (in samples) ranges from 20 to 200. Cepstral pitch detection: see notes for cepstral analysis
Pattern-recognition methods for pitch detection Histogram based on highresolution spectral analysis (Seneff, 1978): A statistical approach which exemplifies the ML approach of testing all reasonable hypotheses (Goldstein, 1973; Duifhuis, et al., 1982):
Smoothing pitch estimation Median filtering (Seneff, 1978) with additional constraints: If the low-pass signal energy is below a threshold, the frame is set to unvoiced. If the variance of three successive frames is too large, the median smoother output is set to zero (unvoiced). Dynamic programming (Talkin, 1995):
Digital speech coding Vocoder (voice coder; voder): an analysis-synthesis system. The primary application is source coding for efficient storage and to reduce the required bandwidth for transmission. The purpose of source coding research is to devise methods of lowering the required coding rates while maintaining the quality and robustness of the transmitted or stored speech. Standards for digital speech coding:
Design considerations in vocoders Design issues around the bandpass filters: How many filters in the bandpass filter bank? Filter bandwidth (as a function of its center frequency)? Which design method works best for channel vocoders (the one with constant group delay to avoid reverberation, see previous class notes on filtering concepts )? How to adapt FFT algorithm to meet criteria?
Design considerations in vocoders (cont.) Number of bandpass filters: any parameters in the channel vocoder can not be designed in isolation! (for example: channel capacity=2400 bps, 400 bps for excitation parameters, 50 frames/sec, 4 bits/frame/channel, 10 bandpass filters which yields substandard quality). The satisfactory vocoded speech for telephony may require 15-25 channels. Filter bandwidth specifications: many early channel vocoders were designed with the same bandwidth for easy implementation (of the synthesizer filter bank). One possible design criterion is to follow the auditory bandwidths (100 Hz for center frequencies below 800 Hz and up to approximately 250 Hz for 3000 Hz center frequency). For easy implementation, the critical bandwidth curve is approximated in a stepwise fashion; for example, 6 filters of 100 Hz width from 200 to 800 Hz; 6 filters of 150 Hz width at [950:150:1700] Hz; 5 filters of 200 Hz width at [1800:200:2600] Hz; 3 filters of 300 Hz width at [2800:300:3400] Hz. There is a total 20 filters. Another possible design criterion is to specify that each filter should encompass a single harmonic of the voiced speech. The data rate for transmitting a single harmonic in each filter would be lower (lower frame rate, i.e., requiring less updating). Thus, the number of bandpass filters could be higher with the same channel capacity. For example, a total of 32 filters of 100 Hz width at [200:100:3400] Hz cover the telephone band.
Envelope extraction in a channel vocoder magnitude box == half(full)-wave rectifier; 1 harmonic through each bandpass filter. 5-15 Hz envelope variation. 2 harmonics through a bandpass filter, e.g., 12th and 13th harmonics with 80 Hz fundamental Lowpass filter becomes critical.
Bit saving in channel vocoders: efficient quantization µ-law quantizer: human ear and brain judge relative sound intensities more or less logarithmically. Thus, it makes sense to quantize the channel energy in a non-uniform manner. µ=255 has been adopted as a standard for speech waveform encoding in the US and Canada. log[1 + µ ( x/ X)] y = X, log(1 + µ ) where X is the maximum value of and µ is a parameter. x
The original channel vocoder employed a pulse generator, a noise generator and a voice-unvoice (buzz-hiss) switch. How about for the voiced fricative sounds (the excitation is really a combination of buzz and hiss)? In addition, transient bursts are often too short (5-15 ms) to be adequately encoded at low bit rate (20-40 ms per frame). Verdict: up to date, 2400 bps vocoders are not able to synthesize speech that is indistinguishable from the original. Spectral flattening needed when generating the excitation signal at the synthesizer. The overall spectral shape has already been encoded in the envelopes. Design of the excitation parameters for a channel vocoder
LPC vocoders The major difference between the LPC vocoders and channel vocoders is the presence of the error signal (residue) in LPC. The use of the unmodified error signal as excitation results in synthetic speech that is a replica of the original. Simply transmitting the full error signal does not result in bit saving; however, if we assume that LPC spectral analysis has captured much of the spectral information, this means that the error signal will be primarily a function of the excitation parameters and ought to be codable at a lower rate, for example, RELP (residual-excited LP).
Cepstral vocoders
Design comparisons Discreteness of analysis in three systems: Number of filters in channel vocoder systems. N=15-25 for a satisfactory telephony channel vocoder. Number of coefficients (= number of linear equations) in LPC vocoders. N=10 has been considered to be a reasonable specification for a 2400 bps LPC vocoder. Window length for the liftering to truncate the cepstrum. N=32 in Oppenheim s cepstral vocoder (1969). Bandwidth specification: Bandwidth spec. of channel vocoders are discussed. No additional spec. are needed for LPC vocoders. The window size to perform DFT in cepstral vocoders. Perceptually oriented: Critical bandwidth adopted in the channel vocoders. Perceptually LP (PLP) by Hermansky (1990). Mel scale cepstral analysis (MFCC: Mel Frequency Cepstral Coefficient).
References Gold, B. and Morgan N. (2000). Speech and Audio Signal Processing-Processing and perception of speech and music (John Wiley & Sons).