Chapter 3. Description of the Cascade/Parallel Formant Synthesizer. 3.1 Overview

Similar documents
INTRODUCTION TO ACOUSTIC PHONETICS 2 Hilary Term, week 6 22 February 2006

SPEECH AND SPECTRAL ANALYSIS

Review: Frequency Response Graph. Introduction to Speech and Science. Review: Vowels. Response Graph. Review: Acoustic tube models

Source-filter analysis of fricatives

EE 225D LECTURE ON SPEECH SYNTHESIS. University of California Berkeley

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Source-filter Analysis of Consonants: Nasals and Laterals

Digital Signal Processing

The source-filter model of speech production"

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels

WaveSurfer. Basic acoustics part 2 Spectrograms, resonance, vowels. Spectrogram. See Rogers chapter 7 8

COMP 546, Winter 2017 lecture 20 - sound 2

Foundations of Language Science and Technology. Acoustic Phonetics 1: Resonances and formants

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Source-Filter Theory 1

Subtractive Synthesis & Formant Synthesis

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

Location of sound source and transfer functions

An Implementation of the Klatt Speech Synthesiser*

Speech Synthesis; Pitch Detection and Vocoders

ASPIRATION NOISE DURING PHONATION: SYNTHESIS, ANALYSIS, AND PITCH-SCALE MODIFICATION DARYUSH MEHTA

GLOTTAL EXCITATION EXTRACTION OF VOICED SPEECH - JOINTLY PARAMETRIC AND NONPARAMETRIC APPROACHES

Resonance and resonators

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

Block diagram of proposed general approach to automatic reduction of speech wave to lowinformation-rate signals.


Acoustic Phonetics. How speech sounds are physically represented. Chapters 12 and 13

EE 225D LECTURE ON SYNTHETIC AUDIO. University of California Berkeley

L19: Prosodic modification of speech

Acoustic Phonetics. Chapter 8

Experimental evaluation of inverse filtering using physical systems with known glottal flow and tract characteristics

A Look at Un-Electronic Musical Instruments

Epoch Extraction From Emotional Speech

Linguistic Phonetics. Spectral Analysis

About waves. Sounds of English. Different types of waves. Ever done the wave?? Why do we care? Tuning forks and pendulums

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Mask-Based Nasometry A New Method for the Measurement of Nasalance

An Experimentally Measured Source Filter Model: Glottal Flow, Vocal Tract Gain and Output Sound from a Physical Model

X. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Linguistics 401 LECTURE #2. BASIC ACOUSTIC CONCEPTS (A review)

Converting Speaking Voice into Singing Voice

Quarterly Progress and Status Report. A note on the vocal tract wall impedance

CS 188: Artificial Intelligence Spring Speech in an Hour

Pitch Period of Speech Signals Preface, Determination and Transformation

Digital Signal Representation of Speech Signal

EE482: Digital Signal Processing Applications

A() I I X=t,~ X=XI, X=O

Airflow visualization in a model of human glottis near the self-oscillating vocal folds model

Statistical NLP Spring Unsupervised Tagging?

Synthesis Algorithms and Validation

USING A WHITE NOISE SOURCE TO CHARACTERIZE A GLOTTAL SOURCE WAVEFORM FOR IMPLEMENTATION IN A SPEECH SYNTHESIS SYSTEM

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Speech Signal Analysis

Transforming High-Effort Voices Into Breathy Voices Using Adaptive Pre-Emphasis Linear Prediction

HMM-based Speech Synthesis Using an Acoustic Glottal Source Model

A Theoretically. Synthesis of Nasal Consonants: Based Approach. Andrew Ian Russell

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

A Physiologically Produced Impulsive UWB signal: Speech

Quarterly Progress and Status Report. Acoustic properties of the Rothenberg mask

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Digitized signals. Notes on the perils of low sample resolution and inappropriate sampling rates.

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

Synthasaurus: An Animal Vocalization Synthesizer. Robert Martino Master's Project Music Technology Program Advisor: Gary Kendall June 6, 2000

Communications Theory and Engineering

Wideband Speech Coding & Its Application

SYNTHESIS' OF STOPS, FRICATIVES, LIQUIDS AND VOWELS BY A COMPUTER CONTROLLED ELECTRONIC VOCAL TRACT ANALOG. ' b y KENNETH A.

DIVERSE RESONANCE TUNING STRATEGIES FOR WOMEN SINGERS

Respiration, Phonation, and Resonation: How dependent are they on each other? (Kay-Pentax Lecture in Upper Airway Science) Ingo R.

COMPARING ACOUSTIC GLOTTAL FEATURE EXTRACTION METHODS WITH SIMULTANEOUSLY RECORDED HIGH- SPEED VIDEO FEATURES FOR CLINICALLY OBTAINED DATA

5pSC20: EM sensor measurements of glottal. structure versus time. 1st Pan-American/Iberian Meeting on Acoustics. Cancun, Mexico. Dec.

Speech Compression Using Voice Excited Linear Predictive Coding

Complex Sounds. Reading: Yost Ch. 4

University of Southampton ABSTRACT Doctor of Philosophy Characterisation of plosive, fricative and aspiration components in speech production by Phili

Page 0 of 23. MELP Vocoder

MAKE SOMETHING THAT TALKS?

On the glottal flow derivative waveform and its properties

Analysis/synthesis coding

Digital Speech Processing and Coding

Proceedings of Meetings on Acoustics

Chapter 3 The Physics of Sound

Modelling of Human Glottis in VLSI for Low Power Architectures

SOURCE-filter modeling of speech is based on exciting. Glottal Spectral Separation for Speech Synthesis

Perceived Pitch of Synthesized Voice with Alternate Cycles

Improving Sound Quality by Bandwidth Extension

Subglottal coupling and its influence on vowel formants

Quality Estimation of Alaryngeal Speech

An artificial voicing waveform for laryngectomees Andersen, Jørgen Bach; Langvad, Bjarne; Møller, Henrik; Rold, Ove

Speech/Non-speech detection Rule-based method using log energy and zero crossing rate

A Review of Glottal Waveform Analysis

Between physics and perception signal models for high level audio processing. Axel Röbel. Analysis / synthesis team, IRCAM. DAFx 2010 iem Graz

Vowel Enhancement in Early Stage Spanish Esophageal Speech Using Natural Glottal Flow Pulse and Vocal Tract Frequency Warping

Low frequency response of the vocal tract: acoustic and mechanical resonances and their losses

A perceptually and physiologically motivated voice source model

Glottal source model selection for stationary singing-voice by low-band envelope matching

Transcription:

Chapter 3 Description of the Cascade/Parallel Formant Synthesizer The Klattalk system uses the KLSYN88 cascade-~arallel formant synthesizer that was first described in Klatt and Klatt (1990). This speech synthesizer is based on an original design described in Klatt (1980). This chapter1 will outline the acoustic theory upon which KL- SYN88 is based and will motivate a number of design decisions and modifications that have been made since the publication of the original Klatt (1980) synlhesizer. Specific synthesis equations for sound sources and vocal tract transfer functions of various types are presented in Section 3.2. The effect on the synthesis of each control parameter is described in detail in Section 3.3, and examples are given of synthesis values for several different types of speech sounds. A final section will discuss simplifications made to permit real-time hardwarc implementation of the synthesizer in Klattalk. Readers familiar with Klatt (1980) may wish to skim over the first two sections and concentrate on the control parameter definitions of Section 3.3. 3.1 Overview The broadband spectrogram shown in the lower half of Figure 3.1 illustrates a method of analyzing speech to determine the frequencies at which energy is present as a function of time. Time is plotted along the horizontal axis, frequency along the vertical axis, and blackness at any point is monotonically related to the energy in a frequency band 300 Hz wide, as averaged over a time interval of 1 or 2 ms. The waveform shown at the top of Figure 3.1 shows a short sample of periodic voicing from the /I/ of Uthisn followed by aperiodic noise from the Is/. Voicing is generated by 'This version was printed November 6, 1990from the file Vax [klatt.klatta!kbook) ch3klsyn88.t~~. For the most up to date versions, copy to DEC20 using CFTP program, and big-latex it. All figurea arc in the top left filing cabinet drawcr of my office labeled "Manuscripts in preparationn. Book title: KLATTALK. Subtitle: The Conversion sf English Text to Speech.

D.H. Klatt - # I I I I I 1 I 1 I I I I I * I 0.2.4.6 B 1.O 1.2 1.4 1.6 THIS IS A SPEECH WAVEFORM TI ME ( secl- Figure 3.1: Sound pressure waveform (top) and broadband sound spectrogram (bottom) of a sample of speech. vibrations of tile vocal folds of the larynx. It shows up on the spectrogram as a series of vertical striations, each correspoilding roughly to the excitation caused by the sudden ternlination of airflow as the vocal folds come together during vibration. Tlle horizontal dark bands seen during voicing are due to the resonances of tlie vocal tract wllicli modify the sound produced by the voicing source. These resonances, which are called formants, move about in time in response to the motions of articulators such as the tongue, jaw, lips, and velum. Formant frequencies are the main acoustic evidence to indicate tlie articulation eniployed by the speaker during tlie production of many speech sounds. During noise production, as in the /s/ of "thisn, a turbulence noise source is created at a constriction fornied by the tongue tip. Only tlie "front cavity" (i.e., the portion of the vocal tract in front of tlie constriction) higher frequency natural resonant modes of the vocal tract are excited to forni broad dark areas in tlie spectrogram. The lower formants, being back-cavity resonances, are usually not excited by the noise source. In general, speech ie produced by the sequential activation of one or more sound sources wllicll then excite the resonances of the vocal tract, resulting in characteristic acoustic

~ ~ D. H. Klat t 1 SOUND SOURCE 1 I I I I - WlClNC -ASPIRATION -FRICATION VOCAL TRACT RADIATION * TRANSFER FUNCTION 5 CHARACTERISTIC SOURCE T(f) LIP R(f 1 RADlAT ED VOLUME VOLUME SOUND VELOCITY VELOCITY PRESSURE c Figure 3.2: The output spectrum of a speech sound, P( f ), can be represen tedin the frequency donlain as a product of a source spectrum S(f), a vocal tract transfer function, T(f), and a radiation characteristic, R( f ). patterns of sound radiating from the lips and nose of the talker. Only two basic types of soulid sources are activated to produce most of the sounds of the languages of the world: (1) quasi-periodic sources involving tlie vibration of some structure such as the vocal folds, tongue tip, lips or uvula, and (2) turbulence noise sources - either sustained, as in a fricative such as /s/ or an aspirated sound sucli as /h/, or a brief burst of noise, as at the release of a plosive such as /t/ aiid /dl. For some sounds, a transient source is generated by abruptly releasilig a closure behind which a positive or negative pressure has been created. Duplication of the acoustic patterns that appear on spectrograms serves as the end goal of efforts to produce synthetic speech. The ultimate criteria are, of course, perceptual: "Is tlie speech intelligible? Does it sound natural?" However, Holmes (1961; 1973) has sho\vn that a ulell-designed formallt-based speech synthesizer which can duplicate the pattern seen on a broadband spectrogranl, i.e. the smoothed magnitude spectrum as it changes in time, is capable of producing speecli that is indistinguishable from the original recording. In this sense, it is reasonable to base synthesizer performance on objective spectral compnrisons, rather tliali informal subjective listening that is known to be strongly influenced by experhen ter bias and expectations. Historically, speech synthesizers fdl into two broad categories: (1) articulatory syntliesizers that attempt to model faithfully the mechanical motions of the articulators and the resulting distributions of volume velocity and sound pressure in the lungs, larynx, vocal tract and nasal tract, and (2) formalit syntl~esizers which attempt to approximate directly tl~e speech wavefornl and spectrunl by a simpler model fornlulated in the acoustic domain. Klattalk employs n formant model of speech generation since current articulatory models are fairly primitive, and rules to control the muscles or shapes of the articulators in such a model are difficult to optimize. Tlle syntl~esizer design employed in Klattalk is based on an acoustic theory of speech production that was first presented in Fant (1960), and is summarized in Figure 3.2. According to tliis view, one or more sources of sound energy are activated by creation of a pressure drop across a constriction in the airway, usually by the build-"* of lung preeeure. Treating each soulid source separately, we may characterize it in the frequency domain by

D.H. Klatt VatlNG SWRCL WXAL TRACT TRANSFER PVHCTKm FOR LARYNGEAL SOURCES (FORMANT RESONATORS IY CASCADE I ASPRA7DN SOURCE VOCAL TRACT TRANSFER FWCTIW FOR FRlCATlON SOURCES I FORMAN1 RESONATORS I11 PMILLELI IIADIATION CURICTERISTIC WTPUT SPEECH Figure 3.3: Sin~plified block diagram of the synthesizer. a source spectrum, S(f), where f is frequency in Hz. Each sound source excites the vocal tract, which acts as a resonating system analogous to an organ pipe. Since the vocal tract is a linear system, it can be described in the frequency domain by a linear transfer function, T(j), which is a ratio of output lip-plus-nose volume velocity, Ul(f), to source input, S(f). Finally, soulld is radiated from the lips and/or nose. The spectrum of the sound pressure that would be recorded some distance fiom the lips of the talker, P,( f), is related to lip-plusnose volume velocity, Ut(f), by a radiation characteristic, R(f), that includes the effects of directional soulid propagation from the head. Each of the above relations can also be recnst in the time (waveform) domain. This is the donlaill in which a waveform is actually generated in the computer. A sampled version of P,(t), denoted by Pr(nT) consists of samples of the synthetic output waveform that are usually taken 10,000 times/second, i.e. every T = 0.0001 seconds, where n is an integer. The syntliesizer to be described includes components to simulate the generation of several different kinds of sound sources, components to simulate the vocal-tract transfer iunctioii, and a component to simulate sound radiation from the head. A simplified block diagram of the syntllesizer is shown in Figure 3.3. The laryngeal sound sources-voicing and aspiration noise (as in /h/)-are combined into a glottal ' volume velocity waveform U,(t) that excites the vocal tract. The vocal-tract model consists of digital formant resonators coni~ected in cascade (the output of one serving as the input to the next). The output of the vocal-tract model is a lip volume velocity waveform, Ul(t).3 Radiation of this sound about the liead results in n sound pressure waveform Pr(t) that can be measured by a microphone placed about a fixed distance in front of the head. There is a second model of the transfer fuilction of the vocal tract when the sound source is not at the larynx, as for example in a fricative or plosive. In this latter case, a frication source generates a turbulence noise waveform that excites a set oi digital formant 'The glottis is the apace between the vocal folds of the lnrynx. JWhen a nasal consonant is produced, mound propagaten from the n~al tract, so that UI(t) should be thought of as including any volume velocity from the narer.