Robust Speech Recognition Group Carnegie Mellon University. Telephone: Fax:

Similar documents
Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress!

ROBUST SPEECH RECOGNITION. Richard Stern

Robust Speech Recognition Based on Binaural Auditory Processing

Robust speech recognition using temporal masking and thresholding algorithm

Robust Speech Recognition Based on Binaural Auditory Processing

Power Function-Based Power Distribution Normalization Algorithm for Robust Speech Recognition

Signal Processing for Robust Speech Recognition Motivated by Auditory Processing

BINAURAL PROCESSING FOR ROBUST RECOGNITION OF DEGRADED SPEECH

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Post-masking: A Hybrid Approach to Array Processing for Speech Recognition

SIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION MOTIVATED BY AUDITORY PROCESSING CHANWOO KIM

Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition Chanwoo Kim, Member, IEEE, and Richard M. Stern, Fellow, IEEE

Calibration of Microphone Arrays for Improved Speech Recognition

IN recent decades following the introduction of hidden. Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition

Array-based Spectro-temporal Masking for Automatic Speech Recognition

The psychoacoustics of reverberation

Monaural and Binaural Speech Separation

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1. Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition

AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Binaural Hearing. Reading: Yost Ch. 12

ROBUST SPEECH RECOGNITION BASED ON HUMAN BINAURAL PERCEPTION

Training neural network acoustic models on (multichannel) waveforms

Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home

Binaural segregation in multisource reverberant environments

Estimating Single-Channel Source Separation Masks: Relevance Vector Machine Classifiers vs. Pitch-Based Masking

Auditory modelling for speech processing in the perceptual domain

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming

Robustness (cont.); End-to-end systems

Auditory System For a Mobile Robot

Binaural Segregation in Multisource Reverberant Environments

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Speech Signal Analysis

IMPROVING MICROPHONE ARRAY SPEECH RECOGNITION WITH COCHLEAR IMPLANT-LIKE SPECTRALLY REDUCED SPEECH

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Robust Low-Resource Sound Localization in Correlated Noise

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

780 IEEE SIGNAL PROCESSING LETTERS, VOL. 23, NO. 6, JUNE 2016

Can binary masks improve intelligibility?

A classification-based cocktail-party processor

CNMF-BASED ACOUSTIC FEATURES FOR NOISE-ROBUST ASR

ADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering

Robust telephone speech recognition based on channel compensation

Damped Oscillator Cepstral Coefficients for Robust Speech Recognition

MOST MODERN automatic speech recognition (ASR)

Recent Advances in Acoustic Signal Extraction and Dereverberation

Different Approaches of Spectral Subtraction Method for Speech Enhancement

SPECTRAL DISTORTION MODEL FOR TRAINING PHASE-SENSITIVE DEEP-NEURAL NETWORKS FOR FAR-FIELD SPEECH RECOGNITION

Introduction of Audio and Music

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement

Wavelet Speech Enhancement based on the Teager Energy Operator

Adaptive Filters Application of Linear Prediction

AUDL GS08/GAV1 Auditory Perception. Envelope and temporal fine structure (TFS)

EE482: Digital Signal Processing Applications

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

Power Normalized Cepstral Coefficient for Speaker Diarization and Acoustic Echo Cancellation

Audio Imputation Using the Non-negative Hidden Markov Model

Mikko Myllymäki and Tuomas Virtanen

IMPROVED COCKTAIL-PARTY PROCESSING

Binaural reverberant Speech separation based on deep neural networks

The role of temporal resolution in modulation-based speech segregation

Speech Enhancement Based On Noise Reduction

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Microphone Array Power Ratio for Speech Quality Assessment in Noisy Reverberant Environments 1

Investigating Modulation Spectrogram Features for Deep Neural Network-based Automatic Speech Recognition

SGN Audio and Speech Processing

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering

SPECTRAL COMBINING FOR MICROPHONE DIVERSITY SYSTEMS

High-speed Noise Cancellation with Microphone Array

Spectro-Temporal Methods in Primary Auditory Cortex David Klein Didier Depireux Jonathan Simon Shihab Shamma

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

Auditory Based Feature Vectors for Speech Recognition Systems

Spectral Reconstruction and Noise Model Estimation based on a Masking Model for Noise-Robust Speech Recognition

SGN Audio and Speech Processing

You know about adding up waves, e.g. from two loudspeakers. AUDL 4007 Auditory Perception. Week 2½. Mathematical prelude: Adding up levels

Voice Activity Detection

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

NOISE robustness remains an important issue in the field

1. Introduction. Keywords: speech enhancement, spectral subtraction, binary masking, Gamma-tone filter bank, musical noise.

CHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS

Lecture 14: Source Separation

Advanced audio analysis. Martin Gasser

NOISE ESTIMATION IN A SINGLE CHANNEL

Pitch-Based Segregation of Reverberant Speech

REAL-TIME BROADBAND NOISE REDUCTION

A ROBUST FRONTEND FOR ASR: COMBINING DENOISING, NOISE MASKING AND FEATURE NORMALIZATION. Maarten Van Segbroeck and Shrikanth S.

WIND SPEED ESTIMATION AND WIND-INDUCED NOISE REDUCTION USING A 2-CHANNEL SMALL MICROPHONE ARRAY

VQ Source Models: Perceptual & Phase Issues

POSSIBLY the most noticeable difference when performing

IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM

Students: Avihay Barazany Royi Levy Supervisor: Kuti Avargel In Association with: Zoran, Haifa

All-Neural Multi-Channel Speech Enhancement

Channel Selection in the Short-time Modulation Domain for Distant Speech Recognition

Transcription:

Robust Automatic Speech Recognition In the 21 st Century Richard Stern (with Alex Acero, Yu-Hsiang Chiu, Evandro Gouvêa, Chanwoo Kim, Kshitiz Kumar, Amir Moghimi, Pedro Moreno, Hyung-Min Park, Bhiksha Raj, Mike Seltzer, Rita Singh, and many others) Robust Speech Recognition Group Carnegie Mellon University Telephone: +1 412 268-2535 Fax: +1 412 268-3890 rms@cs.cmu.edu http://www.ece.cmu.edu/~rms AFEKA Conference for Speech Processing Tel Aviv, Israel July 7, 2014

Robust speech recognition As speech recognition is transferred from the laboratory to the marketplace robust recognition is becoming increasingly important Robustness in 1985: Recognition in a quiet room using desktop microphones Robustness in 2014: Recognition.» over a cell phone» in a car» with the windows down» and the radio playing» at highway speeds Slide 2

What I would like to do today Review background and motivation for current work: Sources of environmental degradation Discuss selected approaches and their performance: Traditional statistical parameter estimation Missing feature approaches Microphone arrays Physiologically- and perceptually-motivated signal processing Comment on current progress for the hardest problems Slide 3

Some of the hardest problems in speech recognition Speech in high noise (Navy F-18 flight line) Speech in background music Speech in background speech Transient dropouts and noise Spontaneous speech Reverberated speech Vocoded speech Slide 4

Challenges in robust recognition Classical problems: Additive noise Linear filtering Modern problems: Transient degradations Much lower SNR Difficult problems: Highly spontaneous speech Reverberated speech Speech masked by other speech and/or music Speech subjected to nonlinear degradation Slide 5

Solutions to classical problems: joint statistical compensation for noise and filtering Approach of Acero, Liu, Moreno, and Raj, et al. (1990-1997) Clean speech x[m] h[m] Degraded speech z[m] Additive noise Compensation achieved by estimating parameters of noise and filter and applying inverse operations Interaction is nonlinear: Linear filtering n[m] Slide 6

Accuracy (%) Classical combined compensation improves accuracy in stationary environments 100 80 Complete retraining 5 db 5 db 15 db 60 40 20 0 VTS (1997) CDCN (1990) CMN (baseline) Original Recovered 0 5 10SNR (db) 15 20 25 Threshold shifts by ~7 db Accuracy still poor for low SNRs Slide 7

Percent WER Decrease But model-based compensation does not improve accuracy (much) in transient noise 50 40 White Noise 30 20 10 0 Hub 4 Music 0 5 10 15 20 25 SNR, db Possible reasons: nonstationarity of background music and its speechlike nature Slide 8

Summary: traditional methods The effects of additive noise and linear filtering are nonlinear Methods such as CDCN and VTS can be quite effective, but these methods require that the statistics of the received signal remain stationary over an interval of several hundred ms Slide 9

Introduction: Missing-feature recognition Speech is quite intelligible, even when presented only in fragments Procedure: Determine which time-frequency time-frequency components appear to be unaffected by noise, distortion, etc. Reconstruct signal based on good components A monaural example using oracle knowledge: Mixed signals - Separated signals - Slide 10

Missing-feature recognition General approach: Determine which cells of a spectrogram-like display are unreliable (or missing ) Ignore missing features or make best guess about their values based on data that are present Comment: Most groups (following the University of Sheffield) modify the internal representations to compensate for missing features We attempt to infer and replace missing components of input vector Slide 11

Example: an original speech spectrogram Slide 12

Spectrogram corrupted by noise at SNR 15 db Some regions are affected far more than others Slide 13

Ignoring regions in the spectrogram that are corrupted by noise All regions with SNR less than 0 db deemed missing (dark blue) Recognition performed based on colored regions alone Slide 14

Recognition accuracy using compensated cepstra, speech in white noise (Raj, 1998) Accuracy (%) Cluster Based Recon. Temporal Correlations Spectral Subtraction Baseline SNR (db) Large improvements in recognition accuracy can be obtained by reconstruction of corrupted regions of noisy speech spectrograms A priori knowledge of locations of missing features needed Slide 15

Recognition accuracy using compensated cepstra, speech corrupted by music Accuracy (%) 90 80 70 60 50 40 30 20 10 0 Cluster Based Recon. Spectral Subtraction Temporal Correlations Baseline 0 5 10 15 20 25 SNR (db) Recognition accuracy increases from 7% to 69% at 0 db with clusterbased reconstruction Slide 16

Practical recognition error: white noise (Seltzer, 2000) Accuracy (%) Speech plus white noise: Recognition Accuracy vs. SNR 90 80 70 60 50 40 30 20 10 0 5 10 15 20 25 Oracle Masks Bayesian Masks Energy-based Masks Baseline SNR (db) Slide 17

Accuracy (%) Practical recognition error: background music Speech plus music: Recognition Accuracy vs. SNR 90 80 70 60 50 40 30 20 10 0 5 10 15 20 25 Oracle Masks Bayesian Masks Energybased Masks Baseline SNR (db) Slide 18

Summary: Missing features Missing feature approaches can be valuable in dealing with the effects of transient distortion and other disruptions that are localized in the spectro-temporal display The approach can be effective, but it is limited by the need to determine correctly which cells in the spectrogram are missing, which can be difficult in practice Slide 19

WER The problem of reverberation Comparison of single channel and delay-and-sum beamforming (WSJ data passed through measured impulse responses): 120 100 80 60 40 20 Single channel Delay and sum 0 0 500 1000 1500 Reverb time (ms) Slide 20

Use of microphone arrays: motivation Microphone arrays can provide directional response, accepting speech from some directions but suppressing others Slide 21

Another reason for microphone arrays Microphone arrays can focus attention on the direct field in a reverberant environment Slide 22

Options in the use of multiple microphones There are many ways we can use multiple microphones to improve recognition accuracy: Fixed delay-and-sum beamforming Microphone selection techniques Traditional adaptive filtering based on minimizing waveform distortion Feature-driven adaptive filtering (LIMABEAM) Statistically-driven separation approaches (ICA/BSS) Binaural processing based on selective reconstruction (e.g. PDCW) Binaural processing for correlation-based emphasis (e.g. Polyaural) Binaural processing using precedence-based emphasis (peripheral or central, e.g. SSF) Slide 23

......... Delay-and-sum beamforming s 1k τ 1 s 2k τ 2 s 3k τ 3 x k s Lk τ L Simple processing based on equalizing delays to sensors and summing responses High directivity can be achieved with many sensors Baseline algorithm for any multi-microphone experiment Slide 24

......... Adaptive array processing s 1k LSI 1 s 2k LSI 2 s 3k LSI 3 x k s Lk LSI 4 MMSE-based methods (e.g. LMS, RLS ) falsely assume independence of signal and noise; not true in reverberation Not as much of an issue with modern methods using objective functions based on kurtosis or negative entropy Methods reduce signal distortion, not error rate Slide 25

Speech recognition using microphone arrays Speech recognition using microphone arrays has been always been performed by combining two independent systems. This is not ideal: Systems have different objectives Each system does not exploit information available to the other MIC 1 MIC 2 Array processing Feature extraction ASR MIC 3 MIC 4 Slide 26

Feature-based optimal filtering (Seltzer 2004) Consider array processing and speech recognition as part of a single system that shares information Develop array processing algorithms specifically designed to improve speech recognition MIC 1 MIC 2 Array processing Feature extraction ASR MIC 3 MIC 4 Slide 27

Multi-microphone compensation for speech recognition based on cepstral distortion Array Proc Front End ASR Multi-mic compensation based on optimizing speech features rather than signal distortion Speech in Room Delay and Sum Optimal Comp Slide 28

WER (%) Sample results WER vs. SNR for WSJ with added white noise: Constructed 50-point filters from calibration utterance using transcription only Applied filters to all utterances 100 80 60 40 20 Closetalk Optim-Calib Delay-Sum 1 Mic 0 0 5 10 15 20 25 SNR (db) Slide 29

Nonlinear beamforming: reconstructing sound from fragments Procedure: Determine which time-frequency time-frequency components appear to be dominated by the desired signal Recognize based on subset of features that are good OR Reconstruct signal based on good components and recognize using traditional signal processing In binaural processing determination of good components is based on estimated ITD Slide 30

Binaural processing for selective reconstruction (e.g. ZCAE, PDCW processing) Assume two sources with known azimuths Extract ITDs in TF rep (using zero crossings, cross-correlation, or phase differences in frequency domain) Estimate signal amplitudes based on observed ITD (in binary or continuous fashion) (Optionally) fill in missing TF segments after binary decisions Slide 31

Audio samples using selective reconstruction RT60 (ms) 300 No Proc Delay-sum PDCW Slide 32

Selective reconstruction from two mics helps Examples using the PDCW algorithm (Kim et al. Interspeech 2009): Speech in natural noise: Reverberated speech: Comment: Use of two mics provides substantial improvement that is typically independent of what is obtained using other methods Slide 33

Comparing linear and nonlinear beamforming Nonlinear beampatterns: Linear beampattern: SNR = 0 db SNR = 20 db Comments: (Moghimi, ICASSP 2014) Performance depends on SNR as well as source locations More consistent over frequency than linear beamforming Slide 34

Linear and nonlinear beamforming as a function of the number of sensors (Moghimi & Stern, Interspeech 2014) Slide 35

The binaural precedence effect Basic stimuli of the precedence effect: Localization is typically dominated by the first arriving pair Precedence effect believed by some (e.g. Blauert) to improve speech intelligibility Generalizing, we might say that onset enhancement helps at any level Slide 36

Performance of onset enhancement in the SSF algorithm (Kim and Stern, Interspeech 2010) Background music: Reverberation: Comment: Onset enhancment using SSF processing is especially effective in dealing with reverberation Slide 37

Combining onset enhancement with twomicrophone processing: 80" 60" 40" 20" RT60"="1.0"s" RT60"="0.5"s" Clean" (Park et al., Interspeech 2014) 0" MFCC" DBSF" Bin"II" SSF" SSF"+" DBSF" SSF"+" Bin"I" SSF"+" Bin"II" SSF"+" Bin"III" Comment: the use of both SSF onset enhancement and binaural comparison is especially helpful for improving WER for reverberated speech Slide 38

Summary: use of multiple mics Microphone arrays provide directional response which can help interfering sources and in reverberation Delay-and-sum beamforming is very simple and somewhat effective Adaptive beamforming provides better performance but not in reverberant environments with MMSE-based objective functions Adaptive beamforming based on minimizing feature distortion can be very effective but is computationally costly For only two mics, nonlinear beamforming based on selective reconstruction is best Onset enhancement helps a great deal as well in reverberance Slide 39

Auditory-based representations What the speech recognizer sees: An original spectrogram: Spectrum recovered from MFCC: Slide 40

Comments on MFCC representation It s very blurry compared to a wideband spectrogram! Aspects of auditory processing represented: Frequency selectivity and spectral bandwidth (but using a constant analysis window duration!)» Wavelet schemes exploit time-frequency resolution better Nonlinear amplitude response (via log transformation only) Aspects of auditory processing NOT represented: Detailed timing structure Lateral suppression Enhancement of temporal contrast Other auditory nonlinearities Slide 41

Physiologically-motivated signal processing: the Zhang-Carney-Zilany model of the periphery We used the synapse output as the basis for further processing Slide 44

An early evaluation by Kim et al. (Interspeech 2006) Synchrony response is smeared across frequency to remove pitch effects Higher frequencies represented by mean rate of firing Synchrony and mean rate combined additively Much more processing than MFCCs, but will simplify if results are useful Slide 45

Comparing auditory processing with cepstral analysis: clean speech Slide 46

Comparing auditory processing with cepstral analysis: 20-dB SNR Slide 47

Comparing auditory processing with cepstral analysis: 10-dB SNR Slide 48

Comparing auditory processing with cepstral analysis: 0-dB SNR Slide 49

Auditory processing is more effective than MFCCs at low SNRs, especially in white noise Accuracy in background noise: Accuracy in background music: Curves are shifted by 10-15 db (greater improvement than obtained with VTS or CDCN) [Results from Kim et al., Interspeech 2006] Slide 50

But do auditory models really need to be so complex? Model of Zhang et al. 2001: A much simpler model: P(t) Gammatone Filters Nonlinear Rectifiers Lowpass Filters s(t) Slide 51

Comparing simple and complex auditory models Comparing MFCC processing, a trivial (filter rectify compress) auditory model, and the full Carney-Zhang model: Slide 52

Aspects of auditory processing we have found to be important in improving WER in noise The shape of the peripheral filters The shape of the auditory nonlinearity The use of medium-time analysis for noise and reverberation compensation The use of nonlinear filtering to obtain noise suppression and general separation of speechlike from non-speechlike signals (a form of modulation filtering) The use of nonlinear approaches to effect onset enhancement Binaural processing for further enhancement of target signals Slide 53

PNCC processing (Kim and Stern, 2010,2014) A pragmatic implementation of a number of the principles described: Gammatone filterbanks Nonlinearity shaped to follow auditory processing Medium-time environmental compensation using nonlinearity cepstral highpass filtering in each channel Enhancement of envelope onsets Computationally efficient implementation Slide 54

PNCC: an integrated front end based on auditory processing Initial processing Environmental compensation Final processing MFCC RASTA-PLP PNCC Slide 55

Computational complexity of front ends Mults & Divs per Frame 25000 20000 15000 10000 5000 0 MFCC PLP PNCC Truncated PNCC Slide 56

Performance of PNCC in white noise (RM) Slide 57

Performance of PNCC in white noise (WSJ) Slide 58

Performance of PNCC in background music Slide 59

Performance of PNCC in reverberation Slide 60

Contributions of PNCC components: white noise (WSJ) + Temporal masking + Noise suppression + Medium-duration processing Baseline MFCC + CMN Slide 61

Contributions of PNCC components: background music (WSJ) + Temporal masking + Noise suppression + Medium-duration processing Baseline MFCC + CMN Slide 62

Contributions of PNCC components: reverberation (WSJ) + Temporal masking + Noise suppression + Medium-duration processing Baseline MFCC + CMN Slide 63

PNCC and SSF @Google Slide 65

Summary: auditory processing Knowledge of the auditory system will improve ASR accuracy. Important aspects include: Consideration of filter shapes Consideration of rate-intensity function Onset enhancement Nonlinear modulation filtering Temporal suppression Slide 66

General summary: Robust recognition in the 21st century Low SNRs, reverberated speech, speech maskers, and music maskers are difficult challenges for robust speech recognition Robustness algorithms based on classical statistical estimation fail in the presence of transient degradation Some more recent techniques that can be effective: Missing-feature reconstruction Microphone array processing in several forms Processing motivated by monaural and binaural physiology and perception More information about this work may be found at http://www.cs.cmu.edu/~robust Slide 67