Lecture 6: Speech modeling and synthesis

Similar documents
Lecture 5: Speech modeling. The speech signal

Lecture 5: Speech modeling

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

Advanced audio analysis. Martin Gasser

L19: Prosodic modification of speech

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Cepstrum alanysis of speech signals

Speech Synthesis using Mel-Cepstral Coefficient Feature

Lecture 5: Sinusoidal Modeling

Speech Synthesis; Pitch Detection and Vocoders

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Speech Signal Analysis

EE482: Digital Signal Processing Applications

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

Linguistic Phonetics. Spectral Analysis

Applications of Music Processing

Enhanced Waveform Interpolative Coding at 4 kbps

SPEECH ANALYSIS-SYNTHESIS FOR SPEAKER CHARACTERISTIC MODIFICATION

Overview of Code Excited Linear Predictive Coder

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

Lecture 6: Nonspeech and Music

Lecture 6: Nonspeech and Music. Music & nonspeech

SPEECH AND SPECTRAL ANALYSIS

SOUND SOURCE RECOGNITION AND MODELING

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Sound Synthesis Methods

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Auditory Based Feature Vectors for Speech Recognition Systems

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

EE 225D LECTURE ON SPEECH SYNTHESIS. University of California Berkeley

Sinusoidal Modelling in Speech Synthesis, A Survey.

Isolated Digit Recognition Using MFCC AND DTW

Adaptive Filters Application of Linear Prediction

Lecture 6: Nonspeech and Music

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Auditory modelling for speech processing in the perceptual domain

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Digital Speech Processing and Coding

CS 188: Artificial Intelligence Spring Speech in an Hour

MUS421/EE367B Applications Lecture 9C: Time Scale Modification (TSM) and Frequency Scaling/Shifting

Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase and Reassignment

Resonator Factoring. Julius Smith and Nelson Lee

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels

Drum Transcription Based on Independent Subspace Analysis

VQ Source Models: Perceptual & Phase Issues

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Robust Algorithms For Speech Reconstruction On Mobile Devices

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Signal Analysis. Peak Detection. Envelope Follower (Amplitude detection) Music 270a: Signal Analysis

Speech Production. Automatic Speech Recognition handout (1) Jan - Mar 2009 Revision : 1.1. Speech Communication. Spectrogram. Waveform.

Speech/Non-speech detection Rule-based method using log energy and zero crossing rate

Speech Processing. Simon King University of Edinburgh. additional lecture slides for

Synthesis Algorithms and Validation

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Audio processing methods on marine mammal vocalizations

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

A Comparative Study of Formant Frequencies Estimation Techniques

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Auto Regressive Moving Average Model Base Speech Synthesis for Phoneme Transitions

Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Converting Speaking Voice into Singing Voice

ADDITIVE SYNTHESIS BASED ON THE CONTINUOUS WAVELET TRANSFORM: A SINUSOIDAL PLUS TRANSIENT MODEL

Digital Signal Processing

Audio Signal Compression using DCT and LPC Techniques

Communications Theory and Engineering

Digital Signal Processing

Acoustics, signals & systems for audiology. Week 4. Signals through Systems

Distributed Speech Recognition Standardization Activity

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Hungarian Speech Synthesis Using a Phase Exact HNM Approach

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

The Partly Preserved Natural Phases in the Concatenative Speech Synthesis Based on the Harmonic/Noise Approach

Automatic Speech Recognition handout (1)

Practical Applications of the Wavelet Analysis

Multimedia Signal Processing: Theory and Applications in Speech, Music and Communications

Linear Predictive Coding *

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

GLOTTAL EXCITATION EXTRACTION OF VOICED SPEECH - JOINTLY PARAMETRIC AND NONPARAMETRIC APPROACHES

Lecture 9: Time & Pitch Scaling

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Speech Enhancement Based On Noise Reduction

Sub-band Envelope Approach to Obtain Instants of Significant Excitation in Speech

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

Complex Sounds. Reading: Yost Ch. 4

Between physics and perception signal models for high level audio processing. Axel Röbel. Analysis / synthesis team, IRCAM. DAFx 2010 iem Graz

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Chapter 7. Frequency-Domain Representations 语音信号的频域表征

INTRODUCTION TO ACOUSTIC PHONETICS 2 Hilary Term, week 6 22 February 2006

SPEech Feature Toolbox (SPEFT) Design and Emotional Speech Feature Extraction

FFT analysis in practice

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

Transcription:

EE E682: Speech & Audio Processing & Recognition Lecture 6: Speech modeling and synthesis 1 2 3 4 5 Modeling speech signals Spectral and cepstral models Linear Predictive models (LPC) Other signal models Speech synthesis Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/courses/e682-21-1/ E682 SAPR - Speech models - Dan Ellis 21-2-27-1

1 The speech signal h ε z has e a w t cl c ^ θ I n I watch thin as a dime z I d a y m Elements of the speech signal: - spectral resonances (formants, moving) - periodic excitation (voicing, pitched) + pitch contour - noise excitation (fricatives, unvoiced, no pitch) - transients (stop-release bursts) - amplitude modulation (nasals, approximants) - timing! E682 SAPR - Speech models - Dan Ellis 21-2-27-2

The source-filter model Notional separation of: source: excitation, fine t-f structure & filter: resonance, broad spectral structure Pitch Voiced/ unvoiced Glottal pulse train Frication noise + t Formants Vocal tract resonances f Radiation characteristic Speech Source t Filter More a modeling approach than a model E682 SAPR - Speech models - Dan Ellis 21-2-27-3

Signal modeling Signal models are a kind of representation - to make some aspect explicit - for efficiency - for flexibility Nature of model depends on goal - classification: remove irrelevant details - coding/transmission: remove perceptual irrelevance - modification: isolate control parameters But commonalities emerge - perceptually irrelevant detail (coding) will also be irrelevant for classification - modification domain will usually reflect independent perceptual attributes - getting at the abstract information in the signal E682 SAPR - Speech models - Dan Ellis 21-2-27-4

Different influences for signal models Receiver: - see how signal is treated by listeners cochlea-style filterbank models Transmitter (source) - physical apparatus can generate only a limited range of signals... LPC models of vocal tract resonances Making explicit particular aspects - compact, separable resonance correlates cepstrum - modeling prominent features of NB spectrogram sinusoid models - addressing unnaturalness in synthesis H+N model E682 SAPR - Speech models - Dan Ellis 21-2-27-5

Applications of (speech) signal models Classification / matching Goal: highlight important information - speech recognition (lexical content) - speaker recognition (identity or class) - other signal classification - content-based retrieval Coding / transmission / storage Goal: represent just enough information - real-time transmission e.g. mobile phones - archive storage e.g. voicemail Modification/synthesis Goal: change certain parts independently - speech synthesis / text-to-speech (change the words) - speech transformation / disguise (change the speaker) E682 SAPR - Speech models - Dan Ellis 21-2-27-6

Outline 1 2 3 4 5 Modeling speech signals Spectral and cepstral models - Auditorily-inspired spectra - The cepstrum - Feature correlation Linear predictive models (LPC) Other models Speech synthesis E682 SAPR - Speech models - Dan Ellis 21-2-27-7

2 Spectral and cepstral models Spectrogram seems like a good representation - long history - satisfying in use - experts can read the speech What is the information? - intensity in time-frequency cells; typically 5ms x 2 Hz x 5 db Discarded information: - phase - fine-scale timing The starting point for other representations E682 SAPR - Speech models - Dan Ellis 21-2-27-8

The filterbank interpretation of the short-time Fourier transform (STFT) Can regard spectrogram rows as coming from separate bandpass filters: f sound where Mathematically: X[ k, n ] xn [ ] wn [ n ] j 2πk ( n n ) = ---------------------------- exp n N = n xn [ ] h k [ n n] h k [ n] = w[ n] exp j ------------ 2πkn N h k [n] w[-n] n H k (e jω ) W(e j(ω 2πk/N) ) 2πk/N ω E682 SAPR - Speech models - Dan Ellis 21-2-27-9

Spectral models: Which bandpass filters? Constant bandwidth? (analog / FFT) But: cochlea physiology & critical bandwidths use actual bandpass filters in ear models & choose bandwidths by e.g. CB estimates Auditory frequency scales - constant Q (center freq/bandwidth), mel, Bark... E682 SAPR - Speech models - Dan Ellis 21-2-27-1

Gammatone filterbank Given bandwidths, which filter shapes? - match inferred temporal integration window - match inferred spectral shape (sharp hi-f slope) - keep it simple (since it s only approximate) Gammatone filters hn [ ] = n N 1 exp bn z plane 2 2 2 2 mag / db -1-2 -3-4 -5 cos( ω i n) - 2N poles, 2 zeros, low complexity - reasonable linear match to cochlea time 5 1 2 5 1 2 5 freq / Hz log axis! E682 SAPR - Speech models - Dan Ellis 21-2-27-11

Constant-BW vs. cochlea model Frequency responses: Spectrograms: Effective FFT filterbank 8 FFT-based WB spectrogram (N=128) Gain / db -1-2 -3-4 freq / Hz 6 4 2-5 1 2 3 4 5 6 7 8.5 1 1.5 2 2.5 3 Gain / db -1-2 -3-4 -5 Gammatone filterbank 1 2 3 4 5 6 7 8 Freq / Hz freq / Hz linear axis 5 2 1 Q=4 4 pole 2 zero cochlea model downsampled @ 64.5 1 1.5 2 2.5 3 time / s Magnitude smoothed over 5-2 ms time window 5 2 1 E682 SAPR - Speech models - Dan Ellis 21-2-27-12

Limitations of spectral models Not much data thrown away - just fine phase/time structure (smoothing) - little actual modeling - still a large representation! Little separation of features - e.g. formants and pitch Highly correlated features - modifications affect multiple parameters But, quite easy to reconstruct - iterative reconstruction of lost phase E682 SAPR - Speech models - Dan Ellis 21-2-27-13

The cepstrum Original motivation: Assume a source-filter model: Excitation source Resonance filter t t f Define Homomorphic deconvolution : - source-filter convolution: g[n]*h[n] - FT product G(e jω ) H(e jω ) - log sum: logg(e jω ) + logh(e jω ) - IFT separate fine structure: c g [n] + c h [n] = deconvolution Definition: Real cepstrum c n = idft( log dft( xn [ ]) ) E682 SAPR - Speech models - Dan Ellis 21-2-27-14

Stages in cepstral deconvolution Original waveform has excitation fine structure convolved with resonances.2 Waveform and min. phase IR DFT shows harmonics modulated by resonances -.2 2 1 2 3 4 abs(dft) and liftered samps Log DFT is sum of harmonic comb and resonant bumps IDFT separates out resonant bumps (low quefrency) and regular, fine structure ( pitch pulse ) Selecting low-n cepstrum separates resonance information (deconvolution / liftering ) 1 1 2 3 log(abs(dft)) and liftered db -2-4 1 2 3 2 real cepstrum and lifter 1 1 2 freq / Hz freq / Hz pitch pulse quefrency E682 SAPR - Speech models - Dan Ellis 21-2-27-15

Properties of the cepstrum Separate source (fine) & filter (broad structure) - smooth the log mag spectrum to get resonances Smoothing spectrum is filtering along freq. - i.e. convolution applied in Fourier domain multiplication in IFT ( liftering ) Periodicity in time harmonics in spectrum pitch pulse in high-n cepstrum Low-n cepstral coefficients are DCT of broad filter / resonance shape: c n = log X e jω ( ) ( cosnω + jsinnω) dω 2 Cepstral coefs..5.1 5th order Cepstral reconstruction 1-1 1 2 3 4 5 -.1 1 2 3 4 5 6 7 E682 SAPR - Speech models - Dan Ellis 21-2-27-16

Auditory spectrum Cepstral coefficients Aside: Correlation of elements Cepstrum is a popular in speech recognition - feature vector elements are decorrelated: 25 2 15 1 5 18 16 14 12 1 8 6 4 2 Features Covariance matrix Example joint distrib (1,15) 2-2 16-3 12-4 8 5 1 15 frames - c normalizes out average log energy Decorrelated pdfs fit diagonal Gaussians - simple correlation is a waste of parameters DCT is close to PCA for spectra? E682 SAPR - Speech models - Dan Ellis 21-2-27-17 4 2 16 12 8 4 5 1 15 2-5 3 2 1-1 -2-3 -4-5 5

Outline 1 2 3 4 5 Modeling speech signals Spectral and cepstral modes Linear Predictive models (LPC) - The LPC model - Interpretation & application - Formant tracking Other models Speech synthesis E682 SAPR - Speech models - Dan Ellis 21-2-27-18

3 Linear predictive modeling (LPC) LPC is a very successful speech model - it is mathematically efficient (IIR filters) - it is remarkably successful for voice (fits source-filter distinction) - it has a satisfying physical interpretation (resonances) Basic math - model output as lin. function of previous outputs: sn [ ] = ( a k sn [ k] ) + en [ ]... hence linear prediction (p th order) - e[n] is excitation (input), a/k/a prediction error Sz ( ) ---------- Ez ( ) k = 1... all-pole modeling, autoregression (AR) model E682 SAPR - Speech models - Dan Ellis 21-2-27-19 p 1 = --------------------------------------------- p ( 1 a k z k = ) k = 1 1 ---------- Az ( )

Vocal tract motivation for LPC Direct expression of source-filter model: p sn [ ] = ( a k sn [ k] ) + en [ ] k = 1 Pulse/noise excitation e[n] Vocal tract H(z) = 1 / A(z) s[n] H(z) H(e jω ) z-plane f Acoustic tube models suggest all-pole model for vocal tract Relatively slowly-changing - update A(z) every 1-2 ms Not perfect: Nasals introduce zeros E682 SAPR - Speech models - Dan Ellis 21-2-27-2

Estimating LPC parameters Minimize short-time squared prediction error: E m n = 1 = e 2 [ n] = n p 2 sn [ ] a k sn [ k] k = 1 Differentiate w.r.t. a k to get: p 2( sn [ ] a j sn [ j] ) ( sn [ k] ) = n n j = 1 sn [ ]sn [ k] a j j m where φ jk, = n = 1 are correlation coefficients sn [ j]sn [ k] n p linear equations to solve for all a j s... E682 SAPR - Speech models - Dan Ellis 21-2-27-21 = φ(, k) = a j j φ( jk, ) ( ) sn [ j]sn [ k]

Evaluating parameters Linear equations φ(, k) = a j p j = 1 φ( jk, ) If s[n] is assumed zero outside some window φ( jk, ) = sn [ j]sn [ k ] = r( j k ) Hence equations become: n r( 1) r( 2) r( p) = r( ) r( 1) r( p 1) r( 1) r( 2) r( p 2) r( p 1) r( p 2) r( ) a 1 a 2 a p Toeplitz matrix (equal antidiagonals) can use Durbin recursion to solve φ( jk, ) (Solve full via Cholesky) E682 SAPR - Speech models - Dan Ellis 21-2-27-22

.1 -.1 windowed original LPC illustration -.2 -.3 5 1 15 2 25 3 35 4 db original spectrum -2 LPC residual LPC spectrum time / samp -4-6 residual spectrum 1 2 3 4 5 6 7 freq / Hz Actual poles: z-plane E682 SAPR - Speech models - Dan Ellis 21-2-27-23

Interpreting LPC Picking out resonances - if signal really was source + all-pole resonances, LPC should find the resonances Least-squares fit to spectrum - minimizing e 2 [n] in time domain is the same as minimizing E 2 (e jω ) (by Parseval) close fit to spectral peaks; valleys don t matter Removing smooth variation in spectrum - 1/A(z) is low-order approximation to S(z) - Sz ( ) ---------- Ez ( ) = 1 ---------- Az ( ) - hence, residual E(z) = A(z)S(z) is flat version of S Signal whitening: - white noise (independent x[n]s) has flat spectrum whitening removes temporal correlation E682 SAPR - Speech models - Dan Ellis 21-2-27-24

Alternative LPC representations Many alternate p-dimensional representations: - coefficients {a i } - roots {λ i } : ( 1 λ i z 1 ) = 1 a i z 1 - line spectrum frequencies... - reflection coefficients {k i } from lattice form - tube model log area ratios 1 k i g i = log ------------- 1 + k i Choice depends on: - mathematical convenience/complexity - quantization sensitivity - ease of guaranteeing stability - what is made explicit - distributions as statistics E682 SAPR - Speech models - Dan Ellis 21-2-27-25

LPC Applications Analysis-synthesis (coding, transmission): - Sz ( ) = Ez ( ) ---------- Az ( ) hence can reconstruct by filtering e[n] with {a i }s - whitened, decorrelated, minimized e[n]s are easy to quantize -.. or can model e[n] e.g. as simple pulse train Recognition/classification - LPC fit responds to spectral peaks (formants) - can use for recognition (convert to cepstra?) Modification - separating source and filter supports crosssynthesis - pole / resonance model supports warping (e.g. male female) E682 SAPR - Speech models - Dan Ellis 21-2-27-26

freq / Hz freq / Hz Aside: Formant tracking Formants carry (most?) linguistic information Why not classify speech recognition? - e.g. local maxima in cepstral-liftered spectrum pole frequencies in LPC fit But: recognition needs to work in all circumstances - formants can be obscure or undefined 4 3 2 1 4 3 2 1 Original (mpgr1_sx419) Noise-excited LPC resynthesis with pole freqs.2.4.6.8 1 1.2 1.4 time / s Need more graceful, robust parameters E682 SAPR - Speech models - Dan Ellis 21-2-27-27

Outline 1 2 3 4 5 Modeling speech signals Spectral and cepstral modes Linear predictive models (LPC) Other models - Sinewave modeling - Harmonic+Noise model (HNM) Speech synthesis E682 SAPR - Speech models - Dan Ellis 21-2-27-28

4 Other models: Sinusoid modeling Early signal models required low complexity - e.g. LPC Advances in hardware open new possibilities... NB spectrogram suggests harmonics model: freq / Hz 4 3 2 1.5 1 1.5 time / s - important info in 2-D surface is set of tracks? - harmonic tracks have ~ smooth properties - straightforward resynthesis E682 SAPR - Speech models - Dan Ellis 21-2-27-29

Sine wave models Model sound as sum of AM/FM sinusoids: N[ n] sn [ ] = A k [ n] cos( n ω k [ n] + φ k [ n] ) k = 1 - A k, ω k, φ k piecewise linear or constant - can enforce harmonicity: ω k = k.ω Extract parameters directly from STFT frames: time mag - find local maxima of S[k,n] along frequency - track birth/death & correspondence E682 SAPR - Speech models - Dan Ellis 21-2-27-3 freq

magnitude Finding sinusoid peaks Look for local maxima along DFT frame - i.e. S[k-1,n] < S[k,n] > S[k+1,n] Want exact frequency of implied sinusoid - DFT is normally quantized quite coarsely e.g. 4 Hz / 256 bins = 15.6 Hz - interpolate at peaks via quadratic fit? quadratic fit to 3 points interpolated frequency and magnitude spectral samples frequency - may also need interpolated unwrapped phase Or, use differential of phase along time (pvoc): aḃ bȧ - ω = ----------------- where S[k,n] = a + jb a 2 + b 2 E682 SAPR - Speech models - Dan Ellis 21-2-27-31

Sinewave modeling applications Modification (interpolation) & synthesis - connecting arbitrary ω & φ requires cubic phase interpolation (because ) Types of modification - time & frequency scale modification.. with or without changing formant envelope - concatenation/smoothing boundaries - phase realignment (for crest reduction) ω = φ freq / Hz 4 3 2 Non-harmonic signals? OK-ish 1.5 1 1.5 time / s E682 SAPR - Speech models - Dan Ellis 21-2-27-32

Harmonics + noise model Motivation to modify sinusoid model because: - problems with analysis of real (noisy) signals - problems with synthesis quality (esp. noise) - perceptual suspicions Model: N[ n] sn [ ] = A k [ n] cos( n k ω [ n] ) + k = 1 en [ ] ( h n [ n] bn [ ]) Harmonics Noise - sinusoids are forced to be harmonic - remainder is filtered & time-shaped noise Break frequency F m [n] between H and N: db 4 2 Harmonics Harmonicity limit F m [n] Noise 1 2 3 freq / Hz E682 SAPR - Speech models - Dan Ellis 21-2-27-33

freq / Hz HNM analysis and synthesis Dynamically adjust F m [n] based on harmonic test : 4 3 2 1.5 1 1.5 time / s Noise has envelopes in time e[n] and freq Hn freq / Hz 3 2 1 Hn[k] db 4 e[n].1.2.3 time / s - reconstruct bursts / synchronize to pitch pulses E682 SAPR - Speech models - Dan Ellis 21-2-27-34

Outline 1 2 3 4 5 Modeling speech signals Spectral and cepstral modes Linear predictive models (LPC) Other models Speech synthesis - Phone concatenation - Diphone synthesis E682 SAPR - Speech models - Dan Ellis 21-2-27-35

5 Speech synthesis One thing you can do with models Easier than recognition? - listeners do the work -.. but listeners are very critical Overview of synthesis text Text normalization Phoneme generation Prosody generation Synthesis algorithm speech - normalization disambiguates text (abbreviations) - phonetic realization from pronouncing dictionary - prosodic synthesis by rule (timing, pitch contour) -.. all controls waveform generation E682 SAPR - Speech models - Dan Ellis 21-2-27-36

Source-filter synthesis Flexibility of source-filter model is ideal for speech synthesis Pitch info Voiced/ unvoiced t t Glottal pulse source Noise source + t Phoneme info th ax k ae t Vocal tract filter Speech t Excitation source issues: - voiced / unvoiced / mixture ([th] etc.) - pitch cycle of voiced segments - glottal pulse shape voice quality? E682 SAPR - Speech models - Dan Ellis 21-2-27-37

Vocal tract modeling Simplest idea: Store a single VT model for each phoneme th ax k ae t time - but: discontinuities are very unnatural Improve by smoothing between templates freq freq th ax k ae t - trick is finding the right domain time E682 SAPR - Speech models - Dan Ellis 21-2-27-38

Cepstrum-based synthesis Low-n cepstrum is compact model of target spectrum Can invert to get actual VT IR waveform: c n = idft( log dft( xn [ ]) ) hn [ ] = idft( exp( dft( c n ))) All-zero (FIR) VT response can pre-convolve with glottal pulses Glottal pulse inventory ee Pitch pulse times (from pitch contour) ae ah time - cross-fading between templates is OK E682 SAPR - Speech models - Dan Ellis 21-2-27-39

LPC-based synthesis Very compact representation of target spectra - 3 or 4 pole pairs per template Low-order IIR filter very efficient synthesis e[n] How to interpolate? - cannot just interpolate a i in a running filter - but: lattice filter has better-behaved interpolation + s[n] e[n] + + + s[n] a z -1 1 kp-1 a 2 z -1 z -1 - - + z -1 + k z -1-1 a 3 z -1 What to use for excitation - residual from original analysis - reconstructed periodic pulse train - parameterized residual resynthesis E682 SAPR - Speech models - Dan Ellis 21-2-27-4

Diphone synthesis Problems in phone-concatenation synthesis - phonemes are context-dependent - coarticulation is complex - transitions are critical to perception Phones store transitions instead of just phonemes h ε z e w t cl c ^ θ I n I z I d a y m Diphone segments - ~4 phones 8 diphones - or even more context if have a larger database How to splice diphones together? - TD-PSOLA: align pitch pulses and cross-fade - MBROLA: normalized, multiband E682 SAPR - Speech models - Dan Ellis 21-2-27-41

HNM synthesis High quality resynthesis of real diphone units + parametric representation for modifications - pitch, timing modifications - removal of discontinuities at boundaries Synthesis procedure: - linguistic processing gives phones, pitch, timing - database search gives best-matching units - use HNM to fine-tune pitch & timing - cross-fade A k and ω parameters at boundaries freq time Careful preparation of database is key - sine models allow phase alignment of all units - larger database improves unit match E682 SAPR - Speech models - Dan Ellis 21-2-27-42

Generating prosody The real factor limiting speech synthesis? Waveform synthesizers have inputs for - intensity (stress) - duration (phrasing) - fundamental frequency (pitch) Curves produced by superposition of (many) inferred linguistic rules - phrase final lengthening, unstressed shortening.. Or learn rules from transcribed examples E682 SAPR - Speech models - Dan Ellis 21-2-27-43

Range of models: - spectral - cepstral - LPC - Sinusoid - HNM Summary Range of applications: - general spectral shape (filterbank) ASR - precise description (LPC+residual) coding - pitch, time modification (HNM) synthesis Issues: - performance vs. computational complexity - generality vs. accuracy - representation size vs. quality E682 SAPR - Speech models - Dan Ellis 21-2-27-44