Speech Synthesis; Pitch Detection and Vocoders

Similar documents
EE482: Digital Signal Processing Applications

Speech Synthesis using Mel-Cepstral Coefficient Feature

L19: Prosodic modification of speech

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

EE 225D LECTURE ON SPEECH SYNTHESIS. University of California Berkeley

Page 0 of 23. MELP Vocoder

SPEECH AND SPECTRAL ANALYSIS

Mel Spectrum Analysis of Speech Recognition using Single Microphone

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

Communications Theory and Engineering

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Analysis/synthesis coding

Advanced audio analysis. Martin Gasser

Digital Speech Processing and Coding

The Channel Vocoder (analyzer):

Cepstrum alanysis of speech signals

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Speech Signal Analysis

Converting Speaking Voice into Singing Voice

SPEECH ANALYSIS-SYNTHESIS FOR SPEAKER CHARACTERISTIC MODIFICATION

Overview of Code Excited Linear Predictive Coder

APPLICATIONS OF DSP OBJECTIVES

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Linguistic Phonetics. Spectral Analysis

Speech Compression Using Voice Excited Linear Predictive Coding

EE 225D LECTURE ON MEDIUM AND HIGH RATE CODING. University of California Berkeley

Synthesis Algorithms and Validation

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

Chapter IV THEORY OF CELP CODING

Pitch Period of Speech Signals Preface, Determination and Transformation

Improving Sound Quality by Bandwidth Extension

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

Between physics and perception signal models for high level audio processing. Axel Röbel. Analysis / synthesis team, IRCAM. DAFx 2010 iem Graz

Enhanced Waveform Interpolative Coding at 4 kbps

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

NCCF ACF. cepstrum coef. error signal > samples

SGN Audio and Speech Processing

Speech Enhancement using Wiener filtering

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

Vocoder (LPC) Analysis by Variation of Input Parameters and Signals

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

SOUND SOURCE RECOGNITION AND MODELING

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Speech/Non-speech detection Rule-based method using log energy and zero crossing rate

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels

SOURCE-filter modeling of speech is based on exciting. Glottal Spectral Separation for Speech Synthesis

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Comparison of CELP speech coder with a wavelet method

CS 188: Artificial Intelligence Spring Speech in an Hour

Sound Synthesis Methods

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Lecture 6: Speech modeling and synthesis

A 600 BPS MELP VOCODER FOR USE ON HF CHANNELS

Hungarian Speech Synthesis Using a Phase Exact HNM Approach

INTRODUCTION TO ACOUSTIC PHONETICS 2 Hilary Term, week 6 22 February 2006

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

USING A WHITE NOISE SOURCE TO CHARACTERIZE A GLOTTAL SOURCE WAVEFORM FOR IMPLEMENTATION IN A SPEECH SYNTHESIS SYSTEM

Block diagram of proposed general approach to automatic reduction of speech wave to lowinformation-rate signals.

SGN Audio and Speech Processing

MASTER'S THESIS. Speech Compression and Tone Detection in a Real-Time System. Kristina Berglund. MSc Programmes in Engineering

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE

HCS 7367 Speech Perception

Lecture 5: Speech modeling. The speech signal

Isolated Digit Recognition Using MFCC AND DTW

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

AUDL GS08/GAV1 Auditory Perception. Envelope and temporal fine structure (TFS)

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Voice Excited Lpc for Speech Compression by V/Uv Classification

GLOTTAL EXCITATION EXTRACTION OF VOICED SPEECH - JOINTLY PARAMETRIC AND NONPARAMETRIC APPROACHES

Adaptive Filters Application of Linear Prediction

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

T Automatic Speech Recognition: From Theory to Practice

ENEE408G Multimedia Signal Processing

An Approach to Very Low Bit Rate Speech Coding

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

Wideband Speech Coding & Its Application

NOVEL PITCH DETECTION ALGORITHM WITH APPLICATION TO SPEECH CODING

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Auto Regressive Moving Average Model Base Speech Synthesis for Phoneme Transitions

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

SPEech Feature Toolbox (SPEFT) Design and Emotional Speech Feature Extraction

General outline of HF digital radiotelephone systems

Speech Coding Technique And Analysis Of Speech Codec Using CS-ACELP

X. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

techniques are means of reducing the bandwidth needed to represent the human voice. In mobile

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Advanced Methods for Glottal Wave Extraction

EE482: Digital Signal Processing Applications

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

On a Classification of Voiced/Unvoiced by using SNR for Speech Recognition

Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis

DEPARTMENT OF DEFENSE TELECOMMUNICATIONS SYSTEMS STANDARD

Acoustic Phonetics. How speech sounds are physically represented. Chapters 12 and 13

Low Bit Rate Speech Coding

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Transcription:

Speech Synthesis; Pitch Detection and Vocoders Tai-Shih Chi ( 冀泰石 ) Department of Communication Engineering National Chiao Tung University May. 29, 2008

Speech Synthesis Basic components of the text-to-speech synthesis: Text preprocessing: translation with ambiguities resolved, e.g., Dr. Doctor or Drive. Text to phonetic-prosodic translation: processed text is parsed into groups to determine semantic structures, then to generate prosodic information (pitch, duration and amplitude, etc.). By some researchers opinions, this component is the source of the most disturbing errors in current TTS systems. Speech synthesis. Speech synthesis is done with one of the following approaches: Articulatory synthesis: physical models for articulators and their movements (Coker and colleagues, 1968, 1976, 1992). This is not practical due to the difficulty of deriving the physical parameters and the huge computational load. Source-filter synthesis (formant synthesis): a formant is usually represented by a second-order filter and formants are used to characterize the spectral shape. Sometimes, the spectral characteristics have been specified in terms of the short term cepstrum or linear prediction coefficients. Concatenative synthesis: direct-time waveform storage and parametric storage for speech segments have become more common in recent years since data storage is getting cheaper.

Formant Synthesizers OVE II synthesizer (Fant et al., 1962): Formant resonators in cascade. Top branch: vowels, semivowels and whispered vowels. Middle branch: nasals Bottom branch: fricatives, plosives. Such structure more closely resembles human vocal tract.

Formant Synthesizers (cont.) Holmes synthesizer (Holmes, 1973): Formant resonators in parallel. 6 db/octave high-pass filter represents the radiation characteristic of the mouthto-air junction. -12 db/octave low-pass filter gives an approximation to the spectrum of the glottal waveform. All formant filters are individually controlled to produce all possible sounds. Synthesizer developed by Dennis Klatt (1980): a compromise between Fant and Holmes by using cascade formant resonators for voiced sounds and parallel formant resonators for fricative sounds.

Other source-filter synthesizer structures Other configurations: All-pole synthesizers, derived from LPC analysis. All-zero synthesizers, derived from cepstral analysis. Fixed poles and variable zeros, derived from channel vocoder analysis: the channel vocoder consists of a relative large number (14-30) of fixed bandpass filters fixed poles. The positions of the zeros vary with the weights. Variable poles and variable zeros, derived from a parallel-formant synthesizer.

Concatenative methods Concatenative systems: speech waveforms (or compressed representations) are stored and then concatenated during synthesis, e.g., the LPC parameters used in talking chips (Morgan, 1984). A technique that has been used in many speech synthesis systems is pitch synchronous overlap and add (PSOLA), which is now often called TD(time-domain)- PSOLA. Variations: LP-PSOLA; CELP-PSOLA (codebook excited LP PSOLA); RELP-PSOLA (residual-excited LP PSOLA) Available synthesis systems: http://www.cstr.ed.ac.uk/projects/festival/ http://tcts.fpms.ac.be/synthesis/mbrola.html

Pitch detection, perception and articulation Pitch perception of the listener refers to the subjective percept of the frequency of a pure tone that is matched to a more complex signal. Pitch detection (F 0 estimation) refers to an objective measurement of the fundamental frequency of a signal. Dudley pitch detector was based on the articulatory premise that the voiced speech signal always included the fundamental frequency component. However, many practical communication systems (e.g., telephones) are band limited, and the fundamental component of the speech may be completely missing.

Difficulties in pitch detection Large dynamic range: the pitch of some male voice can be as low as 60 Hz; whereas the pitch of children s voice can be as high as 800 Hz. Pitch can fluctuate drastically in time. Rapid vocal tract changes in time (e.g., the sudden closure as in a vowel-to-nasal transition) make waveform changes drastically (the pitch might not change a lot, however, the waveform changes make the pitch detection difficult). The voice-unvoiced transition: a fastacting time-domain detector would be necessary to detect the precise transition instant. Speech degradation caused by transmission (e.g., telephone channel) or added acoustical noise makes pitch detection difficult.

Signal processing to improve pitch detection Low-pass filtering: human pitch perception pays more attention to the lower frequencies. Interestingly, estimating the pitch period by eyes is typical easier with low-passed waveforms than with full-band waveforms. It has been proved in practice that a pitchdetection device would have less trouble finding the correct period from low-passed waveforms.

Signal processing to improve pitch detection (cont.) Spectral flattening and temporal correlation (Sondhi, 1968): The original signal first be spectrally flattened (each frequency band is normalized by its own energy). Then the new signal is sent through an auto-correlator for pitch detection. The idea is based on the observation that the inverse Fourier transform of harmonics of equal amplitude and zero phase results in a pulse-train-like signal, which is good for pitch detection.

Signal processing to improve pitch detection (cont.) Inverse filtering: the speech signal is assumed to be the convolution of an exitation and a vocal tract filter. The LPC and cepstrum provide an estimate of the vocal tract filter, thus, inverse filtering Comb filtering (Ross, 1974): the speech signal is sent through a multitude of delays, corresponding to all possible periods of the input. For example, 10 KHz sampling frequency and a F 0 range of 50-500 Hz, the number of possible periods (in samples) ranges from 20 to 200. Cepstral pitch detection: see notes for cepstral analysis

Pattern-recognition methods for pitch detection Histogram based on highresolution spectral analysis (Seneff, 1978): A statistical approach which exemplifies the ML approach of testing all reasonable hypotheses (Goldstein, 1973; Duifhuis, et al., 1982):

Smoothing pitch estimation Median filtering (Seneff, 1978) with additional constraints: If the low-pass signal energy is below a threshold, the frame is set to unvoiced. If the variance of three successive frames is too large, the median smoother output is set to zero (unvoiced). Dynamic programming (Talkin, 1995):

Digital speech coding Vocoder (voice coder; voder): an analysis-synthesis system. The primary application is source coding for efficient storage and to reduce the required bandwidth for transmission. The purpose of source coding research is to devise methods of lowering the required coding rates while maintaining the quality and robustness of the transmitted or stored speech. Standards for digital speech coding:

Design considerations in vocoders Design issues around the bandpass filters: How many filters in the bandpass filter bank? Filter bandwidth (as a function of its center frequency)? Which design method works best for channel vocoders (the one with constant group delay to avoid reverberation, see previous class notes on filtering concepts )? How to adapt FFT algorithm to meet criteria?

Design considerations in vocoders (cont.) Number of bandpass filters: any parameters in the channel vocoder can not be designed in isolation! (for example: channel capacity=2400 bps, 400 bps for excitation parameters, 50 frames/sec, 4 bits/frame/channel, 10 bandpass filters which yields substandard quality). The satisfactory vocoded speech for telephony may require 15-25 channels. Filter bandwidth specifications: many early channel vocoders were designed with the same bandwidth for easy implementation (of the synthesizer filter bank). One possible design criterion is to follow the auditory bandwidths (100 Hz for center frequencies below 800 Hz and up to approximately 250 Hz for 3000 Hz center frequency). For easy implementation, the critical bandwidth curve is approximated in a stepwise fashion; for example, 6 filters of 100 Hz width from 200 to 800 Hz; 6 filters of 150 Hz width at [950:150:1700] Hz; 5 filters of 200 Hz width at [1800:200:2600] Hz; 3 filters of 300 Hz width at [2800:300:3400] Hz. There is a total 20 filters. Another possible design criterion is to specify that each filter should encompass a single harmonic of the voiced speech. The data rate for transmitting a single harmonic in each filter would be lower (lower frame rate, i.e., requiring less updating). Thus, the number of bandpass filters could be higher with the same channel capacity. For example, a total of 32 filters of 100 Hz width at [200:100:3400] Hz cover the telephone band.

Envelope extraction in a channel vocoder magnitude box == half(full)-wave rectifier; 1 harmonic through each bandpass filter. 5-15 Hz envelope variation. 2 harmonics through a bandpass filter, e.g., 12th and 13th harmonics with 80 Hz fundamental Lowpass filter becomes critical.

Bit saving in channel vocoders: efficient quantization µ-law quantizer: human ear and brain judge relative sound intensities more or less logarithmically. Thus, it makes sense to quantize the channel energy in a non-uniform manner. µ=255 has been adopted as a standard for speech waveform encoding in the US and Canada. log[1 + µ ( x/ X)] y = X, log(1 + µ ) where X is the maximum value of and µ is a parameter. x

The original channel vocoder employed a pulse generator, a noise generator and a voice-unvoice (buzz-hiss) switch. How about for the voiced fricative sounds (the excitation is really a combination of buzz and hiss)? In addition, transient bursts are often too short (5-15 ms) to be adequately encoded at low bit rate (20-40 ms per frame). Verdict: up to date, 2400 bps vocoders are not able to synthesize speech that is indistinguishable from the original. Spectral flattening needed when generating the excitation signal at the synthesizer. The overall spectral shape has already been encoded in the envelopes. Design of the excitation parameters for a channel vocoder

LPC vocoders The major difference between the LPC vocoders and channel vocoders is the presence of the error signal (residue) in LPC. The use of the unmodified error signal as excitation results in synthetic speech that is a replica of the original. Simply transmitting the full error signal does not result in bit saving; however, if we assume that LPC spectral analysis has captured much of the spectral information, this means that the error signal will be primarily a function of the excitation parameters and ought to be codable at a lower rate, for example, RELP (residual-excited LP).

Cepstral vocoders

Design comparisons Discreteness of analysis in three systems: Number of filters in channel vocoder systems. N=15-25 for a satisfactory telephony channel vocoder. Number of coefficients (= number of linear equations) in LPC vocoders. N=10 has been considered to be a reasonable specification for a 2400 bps LPC vocoder. Window length for the liftering to truncate the cepstrum. N=32 in Oppenheim s cepstral vocoder (1969). Bandwidth specification: Bandwidth spec. of channel vocoders are discussed. No additional spec. are needed for LPC vocoders. The window size to perform DFT in cepstral vocoders. Perceptually oriented: Critical bandwidth adopted in the channel vocoders. Perceptually LP (PLP) by Hermansky (1990). Mel scale cepstral analysis (MFCC: Mel Frequency Cepstral Coefficient).

References Gold, B. and Morgan N. (2000). Speech and Audio Signal Processing-Processing and perception of speech and music (John Wiley & Sons).