Robustness (cont.); End-to-end systems

Similar documents
Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Networks 1

Recurrent neural networks Modelling sequential data. MLP Lecture 9 / 13 November 2018 Recurrent Neural Networks 1: Modelling sequential data 1

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 1

Speech Signal Analysis

BEAMNET: END-TO-END TRAINING OF A BEAMFORMER-SUPPORTED MULTI-CHANNEL ASR SYSTEM

IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM

Training neural network acoustic models on (multichannel) waveforms

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

VQ Source Models: Perceptual & Phase Issues

Deep Learning Basics Lecture 9: Recurrent Neural Networks. Princeton University COS 495 Instructor: Yingyu Liang

IMPROVING MICROPHONE ARRAY SPEECH RECOGNITION WITH COCHLEAR IMPLANT-LIKE SPECTRALLY REDUCED SPEECH

The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments

A ROBUST FRONTEND FOR ASR: COMBINING DENOISING, NOISE MASKING AND FEATURE NORMALIZATION. Maarten Van Segbroeck and Shrikanth S.

Are there alternatives to Sigmoid Hidden Units? MLP Lecture 6 Hidden Units / Initialisation 1

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Using RASTA in task independent TANDEM feature extraction

Introduction to Audio Watermarking Schemes

Acoustic modelling from the signal domain using CNNs

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

FEATURE COMBINATION AND STACKING OF RECURRENT AND NON-RECURRENT NEURAL NETWORKS FOR LVCSR

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

AN IMPROVED NEURAL NETWORK-BASED DECODER SCHEME FOR SYSTEMATIC CONVOLUTIONAL CODE. A Thesis by. Andrew J. Zerngast

CS 188: Artificial Intelligence Spring Speech in an Hour

Research Seminar. Stefano CARRINO fr.ch

Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events

Isolated Word Recognition Based on Combination of Multiple Noise-Robust Techniques

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

/07/$ IEEE 111

High-speed Noise Cancellation with Microphone Array

Calibration of Microphone Arrays for Improved Speech Recognition

ANALYSIS-BY-SYNTHESIS FEATURE ESTIMATION FOR ROBUST AUTOMATIC SPEECH RECOGNITION USING SPECTRAL MASKS. Michael I Mandel and Arun Narayanan

Robust telephone speech recognition based on channel compensation

REVERB Workshop 2014 SINGLE-CHANNEL REVERBERANT SPEECH RECOGNITION USING C 50 ESTIMATION Pablo Peso Parada, Dushyant Sharma, Patrick A. Naylor, Toon v

Discriminative Training for Automatic Speech Recognition

Robust Speech Recognition Group Carnegie Mellon University. Telephone: Fax:

Spectral Noise Tracking for Improved Nonstationary Noise Robust ASR

Neural Network Part 4: Recurrent Neural Networks

JOINT NOISE AND MASK AWARE TRAINING FOR DNN-BASED SPEECH ENHANCEMENT WITH SUB-BAND FEATURES

Automatic Morse Code Recognition Under Low SNR

(Towards) next generation acoustic models for speech recognition. Erik McDermott Google Inc.

Power Function-Based Power Distribution Normalization Algorithm for Robust Speech Recognition

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

Investigating Modulation Spectrogram Features for Deep Neural Network-based Automatic Speech Recognition

Spectral Reconstruction and Noise Model Estimation based on a Masking Model for Noise-Robust Speech Recognition

A STUDY ON CEPSTRAL SUB-BAND NORMALIZATION FOR ROBUST ASR

CNMF-BASED ACOUSTIC FEATURES FOR NOISE-ROBUST ASR

Robust Speech Feature Extraction using RSF/DRA and Burst Noise Skipping

POSSIBLY the most noticeable difference when performing

An Adaptive Multi-Band System for Low Power Voice Command Recognition

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Mikko Myllymäki and Tuomas Virtanen

Auditory System For a Mobile Robot

Enhancing the Complex-valued Acoustic Spectrograms in Modulation Domain for Creating Noise-Robust Features in Speech Recognition

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Google Speech Processing from Mobile to Farfield

Coursework 2. MLP Lecture 7 Convolutional Networks 1

Acoustic Modeling from Frequency-Domain Representations of Speech

EE482: Digital Signal Processing Applications

Audio Fingerprinting using Fractional Fourier Transform

Auditory Based Feature Vectors for Speech Recognition Systems

DNN-based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identification

Distributed Speech Recognition Standardization Activity

DISTANT speech recognition (DSR) [1] is a challenging

IMPACT OF DEEP MLP ARCHITECTURE ON DIFFERENT ACOUSTIC MODELING TECHNIQUES FOR UNDER-RESOURCED SPEECH RECOGNITION

arxiv: v1 [cs.sd] 9 Dec 2017

An Approach to Very Low Bit Rate Speech Coding

Audio Augmentation for Speech Recognition

On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering

SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING DEEP RECURRENT NEURAL NETWORKS

Applications of Music Processing

Contents 1 Introduction Optical Character Recognition Systems Soft Computing Techniques for Optical Character Recognition Systems

SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING DEEP RECURRENT NEURAL NETWORKS

A HYBRID APPROACH TO COMBINING CONVENTIONAL AND DEEP LEARNING TECHNIQUES FOR SINGLE-CHANNEL SPEECH ENHANCEMENT AND RECOGNITION

Automatic Speech Recognition (CS753)

SGN Audio and Speech Processing

Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Convolutional Neural Networks for Small-footprint Keyword Spotting

Neural Networks The New Moore s Law

Enabling New Speech Driven Services for Mobile Devices: An overview of the ETSI standards activities for Distributed Speech Recognition Front-ends

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

Evaluating robust features on Deep Neural Networks for speech recognition in noisy and channel mismatched conditions

HIGH RESOLUTION SIGNAL RECONSTRUCTION

SPEECH PARAMETERIZATION FOR AUTOMATIC SPEECH RECOGNITION IN NOISY CONDITIONS

SGN Audio and Speech Processing

Learning the Speech Front-end With Raw Waveform CLDNNs

APPLICATIONS OF DSP OBJECTIVES

Research Article Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

Deep learning architectures for music audio classification: a personal (re)view

On the Improvement of Modulation Features Using Multi-Microphone Energy Tracking for Robust Distant Speech Recognition

Voice Activity Detection

An Investigation on the Use of i-vectors for Robust ASR

ROBUST SPEECH RECOGNITION. Richard Stern

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Transcription:

Robustness (cont.); End-to-end systems Steve Renals Automatic Speech Recognition ASR Lecture 18 27 March 2017 ASR Lecture 18 Robustness (cont.); End-to-end systems 1

Robust Speech Recognition ASR Lecture 18 Robustness (cont.); End-to-end systems 2

Additive Noise Multiple acoustic sources are the norm rather than the exception From the point of view of trying to recognize a single stream of speech, this is additive noise Stationary noise: frequency spectrum does not change over time (e.g. air conditioning, car noise at constant speed) Non-stationary noise: time-dependent frequency spectrum (e.g. breaking glass, workshop noise, music, speech) Measure the noise level as SNR (signal-to-noise ratio), measured in db 30dB SNR sounds noise free 0dB SNR has equal signal and noise energy ASR Lecture 18 Robustness (cont.); End-to-end systems 3

Approaches to Robust Speech Recognition Feature normalisation: Transform the features to reduce mismatch between training and test (e.g. CMN/CVN) Feature compensation: Estimate the noise spectrum and subtract it from the observed spectra e.g. sprectral subtraction Multi-condition training train with speech data in a variety of noise conditions. It is possible to artificially mix recorded noise with clean speech at any desired SNR to create a multi-style training set Model adaptation use an adaptation technique such as MLLR to adapt to the acoustic environment ASR Lecture 18 Robustness (cont.); End-to-end systems 4

Current approaches to robust speech recognition Decoupled preprocessing: Acoustic processing independent of downstream activity Pro: simple Con: removes variability Example: beamforming for multi-microphone distant speech recognition moves variability success: beamforming [Swietojanski 2013] Preprocessing Slide from Mike Seltzer ASR Lecture 18 Robustness (cont.); End-to-end systems 5

Current approaches to robust speech recognition Integrated processing: Treat acoustic processing as initial layers of the network optimise parameters with back propagation Pro: should be optimal for the model Con: computationally expensive, Example: direct waveform systems ples: Mask estimation [Narayanan 2014], Mel optimization [Sainath 2013] uld be optimal for the model ensive, hard to move the needle Preprocessing Back-prop Slide from Mike Seltzer ASR Lecture 18 Robustness (cont.); End-to-end systems 6

Current approaches to robust speech recognition Augmented information: Add additional side information to the network (additional nodes, different objective function,...) Pros: preserves variability, adds knowledge, maintains representation Con: not a physical model Example: noise-aware training, factorised noise codes (ivectors) n ut) Knowledge + Auxiliary information Slide from Mike Seltzer ASR Lecture 18 Robustness (cont.); End-to-end systems 7

End-to-End Modelling ASR Lecture 18 Robustness (cont.); End-to-end systems 8

Limitations of HMMs Sequence trained HMM/NN systems have limitations Markov assumption current state depends on only the previous state Conditional independence assumptions dependence on previous acoustic observations encapsulated in the current state RNNs are powerful sequence models recurrent hidden state much richer history representation than HMM state can learn representations can directly model dependences through time But HMM/RNN systems only use RNNs to model time within a phone / HMM state... ASR Lecture 18 Robustness (cont.); End-to-end systems 9

End-to-end ( HMM-Free ) RNN speech recognition Can RNNs replace the HMM sequence model? Yes active research topic. On approach is to use an RNN encoder-decoder model The encoder maps the the input sequence vector into a sequence of RNN hidden states The decoder maps the RNN hidden states into an output sequence Input and output sequences may be different lengths Input sequence of frames Output sequence of phones or letters or words! Mapping to directly to words results in a joint acoustic and language model ASR Lecture 18 Robustness (cont.); End-to-end systems 10

RNN Encoder-Decoder The overall task is to compute the probability of an output sequence given an input sequence, P(y 1,..., y O x 1,..., x T ) = P(y O 1 xt 1 ) Encoder: compute a context c o for each output y o Decoder: compute P(y1 O x T 1 ) = P(y o y1 o 1, c o ) }{{} o RNN P(y o y1 o 1, c o ) = softmax(y o 1, s o, c o ) s o = f (y o 1, s o 1, c o ) y o 1 is the previous output s o is the decoder state (recurrent hidden layer) c o is the encoder context ASR Lecture 18 Robustness (cont.); End-to-end systems 11

RNN decoder y o 1 y o y o+1 s o 1 s o s o+1 c o 1 c o c o+1 ASR Lecture 18 Robustness (cont.); End-to-end systems 12

RNN encoder c o 1 c o h t 1 h t h t+1 x t 1 x t x t+1 c o = t α ot h t α ot = softmax(g(s o 1, h t )) }{{} NN ASR Lecture 18 Robustness (cont.); End-to-end systems 13

RNN encoder-decoder Train all the parameters to maximise log P(y O 1 xt 1 ) using backprop through time The encoder is a bidirectional RNN Training/testing on Switchboard, directly mapping MFCCs to words (no pronunciation model, no language model) gives 49% WER Improved training scheme, FBANK features gives 37% WER Potential improvements multiple recurrent layers in the encoder incorporating a language model in the decoder using character-based output sequence (L Lu et al (2015), A Study of the Recurrent Neural Network Encoder-Decoder for Large Vocabulary Speech Recognition, Interspeech-2015, http://homepages.inf.ed.ac.uk/llu/pdf/liang_is15a.pdf) ASR Lecture 18 Robustness (cont.); End-to-end systems 14