Speech Processing. Simon King University of Edinburgh. additional lecture slides for

Similar documents
L19: Prosodic modification of speech

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Subjective Evaluation of Join Cost and Smoothing Methods for Unit Selection Speech Synthesis Jithendra Vepa a Simon King b

Voice Conversion of Non-aligned Data using Unit Selection

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Sinusoidal Modelling in Speech Synthesis, A Survey.

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

Communications Theory and Engineering

Digital Speech Processing and Coding

Speech Synthesis; Pitch Detection and Vocoders

Cepstrum alanysis of speech signals

Lecture 6: Speech modeling and synthesis

Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase and Reassignment

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

Lecture 5: Speech modeling. The speech signal

Overview of Code Excited Linear Predictive Coder

An Approach to Very Low Bit Rate Speech Coding

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

SPEECH TO SINGING SYNTHESIS SYSTEM. Mingqing Yun, Yoon mo Yang, Yufei Zhang. Department of Electrical and Computer Engineering University of Rochester

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

CS 188: Artificial Intelligence Spring Speech in an Hour

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

Digitized signals. Notes on the perils of low sample resolution and inappropriate sampling rates.

Statistical NLP Spring Unsupervised Tagging?

Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis

Hungarian Speech Synthesis Using a Phase Exact HNM Approach

Learning New Articulator Trajectories for a Speech Production Model using Artificial Neural Networks

Applications of Music Processing

Prosody Modification using Allpass Residual of Speech Signals

Determination of Variation Ranges of the Psola Transformation Parameters by Using Their Influence on the Acoustic Parameters of Speech

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Converting Speaking Voice into Singing Voice

FPGA-based implementation of concatenative speech synthesis algorithm

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Speech Signal Analysis

EE 225D LECTURE ON SPEECH SYNTHESIS. University of California Berkeley

Lecture 5: Speech modeling

APPLICATIONS OF DSP OBJECTIVES

Speech Synthesis using Mel-Cepstral Coefficient Feature

Mel Spectrum Analysis of Speech Recognition using Single Microphone

EE482: Digital Signal Processing Applications

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Speech Enhancement using Wiener filtering

Determining Guava Freshness by Flicking Signal Recognition Using HMM Acoustic Models

Acoustic Phonetics. Chapter 8

Analysis/synthesis coding

Microcomputer Systems 1. Introduction to DSP S

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Linguistic Phonetics. Spectral Analysis

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

NOTES FOR THE SYLLABLE-SIGNAL SYNTHESIS METHOD: TIPW

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Audio Signal Compression using DCT and LPC Techniques

Between physics and perception signal models for high level audio processing. Axel Röbel. Analysis / synthesis team, IRCAM. DAFx 2010 iem Graz

Improving Sound Quality by Bandwidth Extension

Synthesis Techniques. Juan P Bello

SOUND SOURCE RECOGNITION AND MODELING

SPEECH AND SPECTRAL ANALYSIS

Plaits. Macro-oscillator

SPEECH ANALYSIS-SYNTHESIS FOR SPEAKER CHARACTERISTIC MODIFICATION

An Improved Voice Activity Detection Based on Deep Belief Networks

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

MAKE SOMETHING THAT TALKS?

HMM-based Speech Synthesis Using an Acoustic Glottal Source Model

SOURCE-filter modeling of speech is based on exciting. Glottal Spectral Separation for Speech Synthesis

Lecture 5: Sinusoidal Modeling

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

Sub-band Envelope Approach to Obtain Instants of Significant Excitation in Speech

SGN Audio and Speech Processing

A Very Low Bit Rate Speech Coder Based on a Recognition/Synthesis Paradigm

A Two-step Technique for MRI Audio Enhancement Using Dictionary Learning and Wavelet Packet Analysis

REFERENCES 4 CONCLUSIONS ACKNOWLEDGEMENT. Anticipated results for our investigations on acoustic and visual speech integration are:

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

Glottal source model selection for stationary singing-voice by low-band envelope matching

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Introduction to Natural Language Processing

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

Chapter IV THEORY OF CELP CODING

General outline of HF digital radiotelephone systems

Development of a Voice Conversion System

Part of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521

Speech Compression Using Voice Excited Linear Predictive Coding

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

Voice Activity Detection

DECOMPOSITION OF SPEECH INTO VOICED AND UNVOICED COMPONENTS BASED ON A KALMAN FILTERBANK

Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Source-Filter Theory 1

Advanced audio analysis. Martin Gasser

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

SGN Audio and Speech Processing

651 Analysis of LSF frame selection in voice conversion

The NII speech synthesis entry for Blizzard Challenge 2016

Edinburgh Research Explorer

Improved signal analysis and time-synchronous reconstruction in waveform interpolation coding

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Signal Analysis. Peak Detection. Envelope Follower (Amplitude detection) Music 270a: Signal Analysis

Transcription:

Speech Processing Simon King University of Edinburgh additional lecture slides for 2018-19

assignment Q&A writing exercise

Roadmap Modules 1-2: The basics Modules 3-5: Speech synthesis Modules 6-9: Speech recognition

Roadmap Modules 1-2: The basics Modules 3-5: Speech synthesis Modules 6-9: Speech recognition Week 3 Module 3: text processing Week 4 Class trip Module 4: pronunciation & prosody Week 5 Assignment Q&A Module 5: waveform generation Week 6 Submission of first assignment

What you should already know From the videos & readings Concatenation of waveform fragments Diphone units Waveform manipulation TD-PSOLA Linear predictive model

What you should already know From the videos & readings Concatenation of waveform fragments choosing units that capture contextual effects i.e., co-articulation Diphone units Waveform manipulation can only modify duration and F0 TD-PSOLA Linear predictive model can also modify the filter / spectral envelope / vocal tract shape

Speech synthesis - waveform generation Extending diphone synthesis to unit selection Signal processing for waveform modification Time-domain method: TD-PSOLA Source-filter model-domain method: linear predictive filtering

Retrieve candidate units from the pre-recorded database dh ax k ae t s ae t dh ax k ae t s ae t dh ax k ae t s ae t dh ax k ae t s ae t dh ax k ae s ae ax ae ae

Which candidate sequence will sound best? dh ax k ae t s ae t dh ax k ae t s ae t dh ax k ae t s ae t dh ax k ae t s ae t dh ax k ae s ae ax ae ae

Similarity between candidate sequence and the target sequence The ideal candidate unit sequence might comprise units taken from identical linguistic contexts to those in the target unit sequence Of course, this will not be possible in general so we must use less-than-ideal units from non-identical (i.e., mismatched) contexts We need to quantify how mismatched each candidate is, so we can choose amongst them The mismatch distance or cost between a candidate unit and the ideal (i.e., target) unit is measured by the target cost function

Join cost The join cost measures the acoustic mismatch between two candidate units A typical join cost quantifies the acoustic mismatch across the concatenation point e.g., spectral characteristics (parameterised as MFCCs, perhaps), F0, energy Festival s multisyn uses a sum of normalised sub-costs (weights tuned by ear)

Speech synthesis - waveform generation Extending diphone synthesis to unit selection Signal processing for waveform modification Time-domain method: TD-PSOLA Source-filter model-domain method: linear predictive filtering

Why do we need to manipulate the recorded speech? Diphone synthesis we only have a single recorded example of each diphone so, it won t have the correct F0 or duration we want to to impose the F0 and duration predicted by the front end Unit selection (full details in the Speech Synthesis course) to disguise the joins by lightly smoothing F0 and the spectral envelope in the local region around each join imposing F0 and duration predicted by the front end is optional

What does the front end produce as output? Front end text linguistic specification "the cat sat"

For diphone synthesis, must predict acoustic properties

Predicted acoustic properties linguistic specification phones s ay m ax n desired duration desired F0

Retrieve recorded diphones from the database _s s_ay m_ax ax_n ay_m n_

Retrieve recorded diphones from the database recorded diphones from the database diphones _s s_ay ay_m m_ax ax_n n_ recorded diphones duration F0

Make a plan for manipulating F0 and duration actual vs. desired F0 and duration diphones _s s_ay ay_m m_ax ax_n n_ recorded diphones actual duration desired duration actual F0 desired F0

Speech synthesis - waveform generation Extending diphone synthesis to unit selection Signal processing for waveform modification Time-domain method: TD-PSOLA Source-filter model-domain method: linear predictive filtering

Step-by-step waveform generation: TD-PSOLA version recorded diphones manipulated diphones

Speech synthesis - waveform generation Extending diphone synthesis to unit selection Signal processing for waveform modification Time-domain method: TD-PSOLA Source-filter model-domain method: linear predictive filtering

Using a model of speech to perform manipulation Convert speech waveform into parameters of a source-filter model e.g., LPC: filter co-efficients + F0 + voicing decision (V/UV) Discard waveforms Store model parameters At synthesis time retrieve model parameters from database modify parameters if required, then resynthesise

Step-by-step waveform generation: LPC version When building the voice convert recorded waveforms into source + filter source: F0 + voicing decision filter: LPC coefficients When generating the waveform manipulate source to achieve desired duration and F0 interpolate filter coefficients to match reconstruct waveform from manipulated source + filter

LPC: convert speech into model parameters _s s_ay m_ax ax_n ay_m n_

LPC: convert speech into model parameters m_ax For each frame fit the filter to the signal (captures the spectral envelope) i.e., solve some equations to find the filter co-efficients inverse filter the speech to obtain the residual store the filter co-efficients and the residual signal (which is a waveform)

LPC: convert speech into model parameters source output speech e[t] filter y[t] y[t] =e[t] KX b k y[t k] k=1

LPC: convert speech into model parameters

Step-by-step waveform generation: LPC version Retrieve filter co-efficients and residual signals from database Construct residual signal for utterance using concatenation can manipulate F0 & duration with PSOLA method Interpolate filter co-efficients to be pitch-synchronous Pass residual signal through filter update filter parameters once per pitch period manipulated diphones

Step-by-step waveform generation: LPC version manipulated diphones

Speech synthesis - waveform generation Putting the whole pipeline together

The classic two-stage pipeline of text-to-speech synthesis Front end Waveform generator text linguistic specification waveform NN of DT Author of the Author of the... syl syl syl syl 1 0 0 0 ao th er ah f dh ax......

The linguistic specification NN of DT Author of the... syl syl syl syl 1 0 0 0... ao th er ah f dh ax...

Extracting features from text using the front end feature extraction Front end text linguistic specification Author of the NN of DT Author of the... syl syl syl syl 1 0 0 0 ao th er ah f dh ax......

Text processing pipeline text linguistic specification Front end tokenize POS tag LTS Phrase breaks intonation individually learned from labelled data

Text processing pipeline Front end tokenize POS tag LTS Phrase breaks intonation

Front end Tokenize & Normalize tokenize POS tag LTS Phrase breaks intonation Step 1: divide input stream into tokens, which are potential words For English and many other languages rule based whitespace and punctuation are good features For some other languages, especially those that don t use whitespace may be more difficult other techniques required (out of scope here)

Front end Tokenize & Normalize tokenize POS tag LTS Phrase breaks intonation Step 2: classify every token, finding Non-Standard Words that need further processing In 2011, I spent 100 at IKEA on 100 DVD holders. NYER MONEY ASWD NUM LSEQ

Front end Tokenize & Normalize tokenize POS tag LTS Phrase breaks intonation Step 3: a set of specialised modules to process NSWs of a each type 2011 NYER twenty eleven 100 MONEY one hundred pounds IKEA ASWD apply letter-to-sound 100 NUM one hundred DVD LSEQ D. V. D. dee vee dee

Front end POS tagging tokenize POS tag LTS Phrase breaks intonation Part-of-speech tagger Accuracy can be very high Trained on annotated text data Categories are designed for text, not speech NN Director IN of DT the NP McCormick NP Public NPS Affairs NP Institute IN at NP U-Mass NP Boston, NP Doctor NP Ed NP Beard, VBZ says DT the NN push IN for VBP do PP it PP yourself

Front end Pronunciation / LTS tokenize POS tag LTS Phrase breaks intonation Pronunciation model But dictionary look-up, plus letter-to-sound model need deep knowledge of the language to design the phoneme set human expert must write dictionary ADVOCATING AE1 D V AH0 K EY2 T IH0 NG ADVOCATION AE2 D V AH0 K EY1 SH AH0 N ADWEEK AE1 D W IY0 K ADWELL AH0 D W EH1 L ADY EY1 D IY0 ADZ AE1 D Z AE EY1 AEGEAN IH0 JH IY1 AH0 N AEGIS IY1 JH AH0 S AEGON EY1 G AA0 N AELTUS AE1 L T AH0 S AENEAS AE1 N IY0 AH0 S AENEID AH0 N IY1 IH0 D AEQUITRON EY1 K W IH0 T R AA0 N AER EH1 R AERIAL EH1 R IY0 AH0 L AERIALS EH1 R IY0 AH0 L Z AERIE EH1 R IY0 AERIEN EH1 R IY0 AH0 N AERIENS EH1 R IY0 AH0 N Z AERITALIA EH2 R IH0 T AE1 L Y AH0 AERO EH1 R OW0

Key concepts we now understand Breaking a complex problem down into simpler steps Combining many components into a single architecture representing information in data structures The pros and cons of rules vs. learning from data Generalising to previously-unseen words or sentences Creating new utterances from fragments of pre-recorded speech Manipulating the pitch and duration of speech