ROBUST SPEECH RECOGNITION. Richard Stern

ROBUST SPEECH RECOGNITION Richard Stern Robust Speech Recognition Group Mellon University Telephone: (412) 268-2535 Fax: (412) 268-3890 rms@cs.cmu.edu http://www.cs.cmu.edu/~rms Short Course at Universidad Carlos III July 12-15, 2005

Outline of discussion Summary of the state-of-the-art in speech technology at Mellon and elsewhere Review of speech production and cepstral analysis Introduction to robust speech recognition: classical techniques Robust speech recognition using missing-feature techniques Speech recognition using complementary feature sets Signal processing and signal separation based on human auditory perception Use of multiple microphones for improved recognition accuracy Mellon Slide 2 CMU Robust Speech Group

Introduction Conventional signal processing schemes for speech recognition are motivated more by knowledge of speech production than by knowledge of speech perception. Nevertheless, the auditory system does a lot of interesting things! In this talk I will.. Talk a bit about some basic findings in auditory physiology and perception Talk a bit about how knowledge of perception is starting to impact on how we design signal processing for speech recognition Talk about how we can apply auditory principles to separate signals that are presented simultaneously Mellon Slide 3 CMU Robust Speech Group

Basic auditory anatomy Structures involved in auditory processing: Mellon Slide 4 CMU Robust Speech Group

Excitation along the basilar membrane Some of von Békésy s (1960) measurements of motion along the basilar membrane: Comment: Different locations are most sensitive to different frequencies Mellon Slide 5 CMU Robust Speech Group

Transient response of auditory-nerve fibers Histograms of response to tone bursts (Kiang et al., 1965): Comment: Onsets and offsets produce overshoot Mellon Slide 6 CMU Robust Speech Group

Frequency response of auditory-nerve fibers: tuning curves Threshold level for auditory-nerve response to tones: Note dependence of bandwidth on center frequency and asymmetry of response Mellon Slide 7 CMU Robust Speech Group

Typical response of auditory-nerve fibers as a function of stimulus level Typical response of auditory-nerve fibers to tones as a function of intensity: Comment: Saturation and limited dynamic range Mellon Slide 8 CMU Robust Speech Group

Synchronized auditory-nerve response to low-frequency tones Comment: response remains synchronized over a wide range of intensities Mellon Slide 9 CMU Robust Speech Group

Comments on synchronized auditory response Nerve fibers synchronize to fine structure at low frequencies, signal envelopes at high frequencies Synchrony clearly important for auditory localization Synchrony now believed important for monaural processing of complex signals as well Mellon Slide 10 CMU Robust Speech Group

Lateral suppression in auditory processing Auditory-nerve response to pairs of tones: Comment: Lateral suppression enhances local contrast in frequency Mellon Slide 11 CMU Robust Speech Group

Auditory masking patterns Masking produced by narrowband noise at 410 Hz: Comment: asymmetries in auditory-nerve patterns preserved Mellon Slide 12 CMU Robust Speech Group

Auditory frequency selectivity: critical bands Measurements of psychophysical filter bandwidth by various methods: Comments: Bandwidth increases with center frequency Solid curve is Equivalent Rectangular Bandwidth (ERB) Mellon Slide 13 CMU Robust Speech Group

Three perceptual auditory frequency scales Bark scale: Mel scale: ERB scale: Bark( f ) =.01 f, 0 f < 500.007 f +1.5, 500 f < 1220 6ln( f ) 32.6, 1220 f Mel( f ) = 2595 log 10 (1 + 700 f ) ERB( f ) = 24.7(4.37 f +1) Mellon Slide 14 CMU Robust Speech Group

Comparison of normalized perceptual frequency scales Bark scale (in blue), Mel scale (in red), and ERB scale (in green): 2.5 2 1.5 Perceptual scale 1 0.5 0-0.5-1 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Frequency, Hz Mellon Slide 15 CMU Robust Speech Group

Forward and backward masking Masking can have an effect even if target and masker are not simultaneously presented Forward masking - masking precedes target Backward masking - target precedes masker Examples: Introduction Backward masking Forward masking Mellon Slide 16 CMU Robust Speech Group

The loudness of sounds Equal loudness contours (Fletcher-Munson curves): Mellon Slide 17 CMU Robust Speech Group

Summary of basic auditory physiology and perception Major physiological attributes: Frequency analysis in parallel channels Preservation of temporal fine structure Limited dynamic range in individual channels Enhancement of temporal contrast (at onsets and offsets) Enhancement of spectral contrast (at adjacent frequencies) Most major physiological attributes have psychophysical correlates Most physiological and psychophysical effects are not preserved in conventional representations for speech recognition Mellon Slide 18 CMU Robust Speech Group

Conventional ASR signal processing: MFCCs Segment incoming waveform into frames Compute frequency response for each frame using DFTs Multiply magnitude of frequency response by triangular weighting functions to produce 25-40 channels Compute log of weighted magnitudes for each channel Take inverse discrete cosine transform (DCT) of weighted magnitudes for each channel, producing ~14 cepstral coefficients for each frame Calculate delta and double-delta coefficients Mellon Slide 19 CMU Robust Speech Group

COMPARING SPECTRAL REPRESENTATIONS ORIGINAL SPEECH MEL LOG MAGS AFTER CEPSTRA 8000 7000 35 35 6000 30 30 5000 25 25 4000 20 20 3000 15 15 2000 10 10 1000 5 5 0 0 0.2 0.4 0.6 0.8 1 1.2 0 0 0.2 0.4 0.6 0.8 1 1.2 0 0 0.2 0.4 0.6 0.8 1 1.2 Mellon Slide 20 CMU Robust Speech Group

Comments on the MFCC representation It s very blurry compared to a wideband spectrogram! Aspects of auditory processing represented: Frequency selectivity and spectral bandwidth (but using a constant analysis window duration!)» Wavelet schemes exploit time-frequency resolution better Nonlinear amplitude response Aspects of auditory processing NOT represented: Detailed timing structure Lateral suppression Enhancement of temporal contrast Other auditory nonlinearities Mellon Slide 21 CMU Robust Speech Group

Speech representation using mean rate Representation of vowels by Young and Sachs using mean rate: Mean rate representation does not preserve spectral information Mellon Slide 22 CMU Robust Speech Group

Speech representation using average localized synchrony measure Representation of vowels by Young and Sachs using ALSR: Mellon Slide 23 CMU Robust Speech Group

The importance of timing information Re-analysis of Young-Sachs data by Searle: Temporal processing captures dominant formants in a spectral region Mellon Slide 24 CMU Robust Speech Group

Paths to the realization of temporal fine structure in speech Correlograms (Slaney and Lyon) Computations based on interval processing Seneff s Generalized Synchrony Detector (GSD) model Ghitza s Ensemble Interval Histogram (EIH) model D.C. Kim s Zero Crossing Peak Analysis (ZCPA) model Mellon Slide 25 CMU Robust Speech Group

A typical auditory model of the 1980s: The Seneff model Mellon Slide 26 CMU Robust Speech Group

Recognition accuracy using the Seneff model (Ohshima, 1994) Comment: CDCN performs just as well as the Seneff model Mellon Slide 27 CMU Robust Speech Group

Computational complexity of Seneff model Number of multiplications per ms of speech: Comment: auditory computation is extremely expensive Mellon Slide 28 CMU Robust Speech Group

Sequence of vowels Mellon Slide 29 CMU Robust Speech Group

Vowels processed using energy only Mellon Slide 30 CMU Robust Speech Group

Vowel sounds using autocorrelation expansion Mellon Slide 31 CMU Robust Speech Group

Comments on peripheral timing information Use of timing enables us to develop a rich display of frequencies, even with a limited number of analysis channels Nevertheless, this really gives us no new information unless the nonlinearities do something interesting Processing based on timing information (zero crossings, etc.) are likely to give us a more radically different display of info Mellon Slide 32 CMU Robust Speech Group

Summary - auditory physiology and perception Major physiological attributes: Frequency analysis in parallel channels Preservation of temporal fine structure Limited dynamic range in individual channels Enhancement of temporal contrast (at onsets and offsets) Enhancement of spectral contrast (at adjacent frequencies) Most major physiological attributes have psychophysical correlates We are trying to capture important attributes of the representation for recognition, and we believe that this may help performance in noise and competing signals. Mellon Slide 33 CMU Robust Speech Group

Summary - exploiting peripheral auditory processing Traditional cepstral representations are motivated by frequency resolution in auditory processing BUT. There is much more information in the signal that is not included Timing seems to be important Physiologically-motivated representations are generally computationally expensive While physiologically-motivated representations have not improved speech recognition accuracy so far we remain optimistic Mellon Slide 34 CMU Robust Speech Group

Auditory scene analysis for automatic speech recognition Auditory scene analysis refers to the methods that humans use to perceive separately sounds that are presented simultaneously Al Bregman and his colleagues have identified many potential cues in auditory stream separation can be applied directly to the speech recognition problem. Some examples: Separation by pitch/fundamental frequency Separation by source location Separation by micromodulation/ common fate Computational auditory scene analysis (CASA) refers to efforts to mimic these processes computationally Mellon Slide 35 CMU Robust Speech Group

Tracking speech sounds via fundamental frequency Given good pitch estimates How well can we separate signals from noise? How much will this separation help in speech recognition? To what extent can pitch be used to separate speech signals from one another? Mellon Slide 36 CMU Robust Speech Group

The CMU ARCTIC database Collected by John Komenik and Alan Black as a resource for speech synthesis Contains phonetically-balanced recordings with simultaneously-recorded EGG (laryngograph) measurements Available at http://www.festvox.org/cmu_arctic Mellon Slide 37 CMU Robust Speech Group

The CMU ARCTIC database 0.6 Original speech: 0.4 0.2 0-0.2-0.4-0.6-0.8 0 1000 2000 3000 4000 5000 6000 7000 8000 Laryngograph recording: Mellon Slide 38 CMU Robust Speech Group

Typical pitch estimates obtained from ARCTIC 400 350 300 250 200 150 100 50 0 0 1 2 3 4 5 6 x 10 4 Comment: not all outliers were successfully removed Mellon Slide 39 CMU Robust Speech Group

Isolating speech by pitch Method 1: Estimate amplitudes of partials by synchronous heterodyne analysis Resynthesize as sums of sines or cosines Unvoiced segments are problematical Method 2: Pass speech through a comb filter that tracks harmonic frequencies Unvoiced segments are still problematical Mellon Slide 40 CMU Robust Speech Group

An example of pitch-based approaches: synchronized heterodyne analysis Extract instantaneous pitch, extract amplitudes at harmonics, resynthesize Original speech samples: Reconstructed speech: Mellon Slide 41 CMU Robust Speech Group

Recovering speech through comb filtering Pitch-tracking comb filter: H(z) = z P 1 gz P Its frequency response: Original speech samples: Reconstructed speech: Mellon Slide 42 CMU Robust Speech Group

Separating speech signals by heterodyning and comb filtering Combined speech signals: Speech separated by heterodyne filters: Speech separated by comb filters: Comment: men mask women more because upper male harmonics are more likely to impinge on lower female harmonics Mellon Slide 43 CMU Robust Speech Group

Speech recognition in noise based on pitch tracking (Vicente y Peña) 100 90 80 70 60 50 40 30 20 10 0 SHA w Oracle pitch SHA Baseline 0 5 10 15 20 30 SNR (db) Initial results could improve as techniques mature Mellon Slide 44 CMU Robust Speech Group

Speech separation by source location Sources arriving from different azimuths produce interaural time delays (ITDs) and interaural intensity differences (IIDs) as they arrive at the two ears So far this information has been used for Better masks for missing feature recognition and to combat reverberation (e.g. Brown, Wang et al.) Direct separation from interaural representation Mellon Slide 45 CMU Robust Speech Group

The classical model of binaural processing (Colburn and Durlach 1978) Mellon Slide 46 CMU Robust Speech Group

Array processing based on human binaural hearing Mellon Slide 47 CMU Robust Speech Group

Correlation-based system motivated by binaural hearing Mellon Slide 48 CMU Robust Speech Group

Vowel representations are improved by correlation processing Reconstructed features of vowel /a/ Two inputs zero delay Two inputs 120-µs delay Eight inputs 120-µs delay Recognition results in 1993 showed some (small) improvement in WER at great computational cost Mellon Slide 49 CMU Robust Speech Group

So what do things sound like on the crosscorrelation display? Signals combined with ITDs of 0 and.5 ms Individual speech signals: Combined speech signals: Separated by delay-and sum beamforming: Signals separated by cross-correlation display: Signals separated by additional correlations across frequency at a common ITD ( straightness weighting): Mellon Slide 50 CMU Robust Speech Group

Signal separation using micro-modulation Micromodulation of amplitude and frequency may be helpful in separating unvoiced segments of sound sources Physical cues supported by many psychoacoustical studies in recent years Mellon Slide 51 CMU Robust Speech Group

John Chowning s demonstration of effects of micro-modulation in frequency 8000 7000 6000 5000 Frequency 4000 3000 2000 (Reconstruction based on description in Bregman book) 1000 0 0 1 2 3 4 5 6 7 Time Mellon Slide 52 CMU Robust Speech Group

Separating by frequency modulation only Extract instantaneous frequencies of filterbank outputs Cross-correlate frequencies across channels (finds comodulated harmonics) Cluster correlated harmonics and resynthesize Our first example: Isolated speech: Combined speech: Speech separated by frequency modulation: Comment: Success will depend on ability to track frequency components across analysis bands Mellon Slide 53 CMU Robust Speech Group

Summary: signal separation and recognition motivated by auditory processing Computational advances may now enable practical feature extraction based on auditory physiology and perception Computational auditory scene analysis shows promise for signal-separation problems, but the field is presently in its infancy Mellon Slide 54 CMU Robust Speech Group

Introduction The use of arrays of microphones can improve speech recognition accuracy in noise Outline of this talk Review classical approaches to microphone array processing» Delay and sum beamforming» Traditional adaptive filtering» Physiological-motivated processing Describe and discuss selected recent results» Matched-filter array processing (Rutgers)» Array processing based on speech features (CMU) Mellon Slide 56 CMU Robust Speech Group

OVERVIEW OF SPEECH RECOGNITION Speech features Phoneme hypotheses Major functional components: Signal processing to extract features from speech waveforms Comparison of features to pre-stored templates Important design choices: Choice of features Feature extraction Specific method of comparing features to stored templates Decision making procedure Mellon Slide 57 CMU Robust Speech Group

Why use microphone arrays? Microphone arrays can provide directional response, accepting speech from some directions but suppressing others Mellon Slide 58 CMU Robust Speech Group

Another reason for microphone arrays Microphone arrays can focus attention on the direct field in a reverberant environment Mellon Slide 59 CMU Robust Speech Group

Three classical types of microphone arrays Delay-and-sum beamforming and its many variants Adaptive arrays based on mean-square suppression of noise Physiologically and perceptually motivated approaches to multiple microphones Mellon Slide 60 CMU Robust Speech Group

Delay-and-sum beamforming Simple processing based on equalizing delays to sensors High directivity can be achieved with many sensors Sensor 1 Sensor 2 z -n1 z -n2 Output Sensor K z -nk Mellon Slide 61 CMU Robust Speech Group

The physics of delay-and-sum beamforming d θ dsin( ) Mellon Slide 62 CMU Robust Speech Group

The physics of delay-and-sum beamforming If sensor outputs are added together, the look direction is = 0 d θ Look direction can be steered to other directions by inserting electronic delays to compensate for physical ones For look direction of output is = 0, net array A = sin(nωd sin(θ)/2c) sin(ωd sin(θ)/2c) = sin(nπd sin(θ)/λ) sin(πd sin(θ)/λ) dsin( ) Mellon Slide 63 CMU Robust Speech Group

Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 500 Hz Mellon Slide 64 CMU Robust Speech Group

Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 1000 Hz Mellon Slide 65 CMU Robust Speech Group

Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 1500 Hz Mellon Slide 66 CMU Robust Speech Group

Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 2000 Hz Mellon Slide 67 CMU Robust Speech Group

Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 2500 Hz Mellon Slide 68 CMU Robust Speech Group

Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 3000 Hz Mellon Slide 69 CMU Robust Speech Group

Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 3500 Hz Mellon Slide 70 CMU Robust Speech Group

Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 4000 Hz Mellon Slide 71 CMU Robust Speech Group

Nested microphone arrays (Flanagan et al.) 5-element low frequency array Mellon Slide 72 CMU Robust Speech Group

Nested microphone arrays 5-element mid frequency array Mellon Slide 73 CMU Robust Speech Group

Nested microphone arrays 5-element high frequency array Mellon Slide 74 CMU Robust Speech Group

Combined nested array (Flanagan et al.) Three-band quasi-constant beamwidth array Lowpass filter Bandpass filter Highpass filter Mellon Slide 75 CMU Robust Speech Group

Another delay-and-sum issue: spatial aliasing d = 8.62 cm, N = 9, f = 4000 Hz Mellon Slide 76 CMU Robust Speech Group

Another delay-and-sum issue: spatial aliasing d = 8.62 cm, N = 9, f = 5000 Hz Mellon Slide 77 CMU Robust Speech Group

Another delay-and-sum issue: spatial aliasing d = 8.62 cm, N = 9, f = 6000 Hz Mellon Slide 78 CMU Robust Speech Group

Preventing spatial aliasing Spatial aliasing occurs when sensors receive input more than half a period from one another d θ The spatial Nyquist constraint depends on both frequency and arrival angle dsin( ) To prevent spatial aliasing we require that the maximum frequency be less than 2c ν < d sin(θ) Mellon Slide 79 CMU Robust Speech Group

Filter-and-sum beamforming Input filters can (in principle) place delays that vary with frequency to ameliorate frequency dependencies of beamforming Filters can also compensate for channel characteristics Sensor 1 Filter Sensor 2 Filter Output Sensor K Filter Mellon Slide 80 CMU Robust Speech Group

Compensated delay-and-sum beamforming Filter added to compensate for filtering effects of delay and sum beamforming Sensor 1 z -n1 Sensor 2 z -n2 Filter Output Sensor K z -nk Mellon Slide 81 CMU Robust Speech Group

Compensated delay-and-sum beamforming: some implementations Sullivan and Stern use CDCN compensation Rutgers group compensation function derived using a neural network Silverman group Omologo group Mellon Slide 82 CMU Robust Speech Group

Sample recognition results using compensated delay-and-sum The Flanagan array with CDCN does improve accuracy: Mellon Slide 83 CMU Robust Speech Group

Traditional adaptive arrays Mellon Slide 84 CMU Robust Speech Group

Traditional adaptive arrays Large established literature Use MMSE techniques to establish beams in the look direction and nulls to additive noise sources Generally do not perform well in reverberant environments Signal cancellation Effective impulse response longer than length of filter Techniques to circumvent signal cancellation Switching nulling mechanism off and on according to presence or absence of speech (Van Compernolle) Use of alternate adaptation algorithms Mellon Slide 85 CMU Robust Speech Group

Array processing based on human binaural hearing Motivation: human binaural system is known to have excellent immunity to additive noise and reverberation Binaural phenomena of interest: Cocktail party effect Precedence effect Problems with binaural models: Correlation produces signal distortion from rectification and squaring Precedence-effect processing defeats echoes but also suppresses desired signals Greatest challenge: decoding useful information from the crosscorrelation display Mellon Slide 86 CMU Robust Speech Group

Correlation-based system motivated by binaural hearing Mellon Slide 87 CMU Robust Speech Group

Matched-filter beamforming (Rutgers) Goal: compensation for delay and dispersion introduced in reverberant environments 600-ms Room response: 0.4 0.3 0.2 0.1 0-0.1-0.2-0.3-2 -1.5-1 -0.5 0 0.5 1 1.5 2 x 10 4 0.8 Autocorrelation function: 0.6 0.4 0.2 0-0.2-2 -1.5-1 -0.5 0 0.5 1 1.5 2 x 10 4 Mellon Slide 88 CMU Robust Speech Group

Matched-filter beamforming procedure 1. Measure or estimate sample response from source to each sensor 2. Convolve input with time-reversed sample function (producing autocorrelation function) 3. Sum outputs of channels together Rationale: main lobes of autocorrelation functions should reinforce while side lobes cancel Mellon Slide 89 CMU Robust Speech Group

Optimizing microphone arrays for speech recognition features The typical microphone array algorithms has been signal enhancement rather than speech recognition MIC 1 s[n] MIC 2 Array Processor sˆ [ n] MIC N Mellon Slide 90 CMU Robust Speech Group

Automatic Speech Recognition (ASR) Parameterize speech signal and compare parameter sequence to statistical models of speech sound units to hypothesize what a user said. s[n] Feature Extraction { O 1,..., O N } ASR AM P(O W) LM P(W) P( O W ) P( W ) P ( W O) = P( O) Wˆ,..., Wˆ { 1 M } The objective is accurate recognition, a statistical pattern classification problem. Mellon Slide 91 CMU Robust Speech Group

Speech recognition with microphone arrays Recognition with microphone arrays has been traditionally been performed by gluing the two systems together. Systems have different objectives. Each system does not exploit information present in the other. MIC 1 MIC 2 MIC 3 Array Proc Feature Extraction ASR MIC 4 Mellon Slide 92 CMU Robust Speech Group

Array processing based on speech features Develop an array processing scheme targeted at improved speech recognition performance without regard to conventional array processing objective criteria. MIC 1 MIC 2 MIC 3 Array Proc Feature Extraction ASR MIC 4 Mellon Slide 93 CMU Robust Speech Group

Multi-microphone compensation for speech recognition based on cepstral distortion Array Proc Front End ASR Multi-mic compensation based on optimizing speech features rather than signal distortion Speech in Room Delay and Sum Optimal Comp Mellon Slide 94 CMU Robust Speech Group

Choosing array weights based on speech features Want an objective function that uses parameters directly related to recognition MIC 1 x 1 τ 1 h 1 Clean Speech Features MIC 2 x 2 τ 2 h 2 M s ε y M y Σ FE MIC M x M τ Μ h M minimize ε Mellon Slide 95 CMU Robust Speech Group

An objective function for mic arrays based on speech recognition Define Q as the SSE of the log Mel spectra of clean speech s and noisy speech y Q = (M y [ f,l] M s [ f,l]) 2 f l where y is the output of a filter-and-sum microphone array and M[ f, l] is the l th log Mel spectral value in frame f. My[ f, l] is a function of the signals captured by the array and the filter parameters associated with each microphone. Mellon Slide 96 CMU Robust Speech Group

Calibration of microphone arrays for ASR Calibration of filter-and-sum microphone array: User speaks an utterance with known transcription» With or without close-talking microphone Derive optimal set of filters» Minimize the objective function with respect to the filter coefficients.» Since objective function is non-linear, use iterative gradient-based methods Apply to all future speech Mellon Slide 97 CMU Robust Speech Group

Calibration Using close-talking recording Given a close-talking mic recording for the calibration utterance, derive an optimal filter for each channel to improve recognition FE M s MIC 1 τ 1 h 1 (n) OPT ASR s[n] MIC M τ Μ h M (n) Σ FE M y Mellon Slide 98 CMU Robust Speech Group

Multi-microphone data sets TMS Recorded in CMU auditory lab» Approx. 5m x 5m x 3m» Noise from computer fans, blowers,etc. Isolated letters and digits, keywords 10 speakers * 14 utterances = 140 utterances Each utterance has close-talking mic control waveform 1m 7cm Mellon Slide 99 CMU Robust Speech Group

Multi-microphone data sets (2) WSJ + off-axis noise source Room simulation created using the image method» 5m x 4m x 3m» 200ms reverberation time» WGN source @ 5dB SNR WSJ test set» 5K word vocabulary» 10 speakers * 65 utterances = 650 utterances Original recordings used as closetalking control waveforms 2m 1m 25cm 15cm Mellon Slide 100 CMU Robust Speech Group

Results TMS data set, WSJ0 + WGN point source simulation Constructed 50 point filters from a single calibration utterance Applied filters to all test utterances TMS WSJ 100 100 80 80 WER (%) 60 40 60 40 20 20 0 CLSTK 1 MIC D&S MEL OPT Mellon Slide 101 CMU Robust Speech Group 0 CLSTK 1 MIC D&S MEL OPT

Calibration without Close-talking Microphone Obtain initial waveform estimate using conventional array processing technique (e.g. delay and sum). Use transcription and the recognizer to estimate the sequence of target clean log Mel spectra. Optimize filter parameters as before. Mellon Slide 102 CMU Robust Speech Group

Calibration w/o Close-talking Microphone (2) Force align the delay-and-sum waveform to the known transcription to generate an estimated HMM state sequence. BLAH BLAH... MIC 1 τ 1 s[n] Σ FE FALIGN qˆ, qˆ,..., qˆ { 1 2 N } MIC M τ Μ HMM Mellon Slide 103 CMU Robust Speech Group

Calibration w/o Close-talking Microphone (3) Extract the means from the single Gaussian HMMs of the estimated state sequence. Since the models have been trained from clean speech, use these means as the target clean speech feature vectors. qˆ, qˆ,..., qˆ { 1 2 N } HMM { µ, µ 2,..., µ 1 N } IDCT Mˆ s Mellon Slide 104 CMU Robust Speech Group

Calibration w/o Close-talking Microphone (4) Use estimated clean speech feature vectors to optimize filters as before. Mˆ s MIC 1 τ 1 h 1 (n) OPT ASR s[n] MIC M τ Μ h M (n) Σ FE M y Mellon Slide 105 CMU Robust Speech Group

Results TMS data set, WSJ0 + WGN point source simulation Constructed 50 point filters from calibration utterance Applied filters to all utterances TMS WSJ 100 100 80 80 WER (%) 60 40 60 40 20 20 0 CLSTK 1 MIC D&S MEL OPT W/ CLSTK MEL OPT NO CLSTK Mellon Slide 106 CMU Robust Speech Group 0 CLSTK 1 MIC D&S MEL OPT W/ CLSTK MEL OPT NO CLSTK

Results (2) WER vs. SNR for WSJ + WGN Constructed 50 point filters from calibration utterance using transcription only Applied filters to all utterances 100 80 60 40 20 Closetalk Optim-Calib Delay-Sum 1 Mic 0 0 5 10 15 20 25 SNR (db) Mellon Slide 107 CMU Robust Speech Group

The problem of reverberation High levels of reverberation are extremely detrimental to recognition accuracy In reverbant environments, the noise is actually reflected copies of the target speech signals Noise-canceling strategies (like the LMS algorithm) based on uncorrelated noise sources fail Frame-based compensation strategies also fail because effects of reverberation can be spread over several frames Mellon Slide 108 CMU Robust Speech Group

Baseline in highly reverberant rooms Comparison of single channel and delay-and-sum beamforming (WSJ data passed through measured impulse responses): 120 100 80 60 40 20 0 Single channel Delay and sum 0 500 1000 1500 Reverb time (ms) Mellon Slide 109 CMU Robust Speech Group

Subband processing using optimized features (Seltzer) Subband processing can address some of the effects of reverberation. The subband signals have more desirable narrowband signal properties. 1. Divide into independent subbands 2. Downsample 3. Process subbands independently 4. Upsample 5. Resynthesize full signal from subbands H 1 (z) L F 1 (z) L G 1 (z) x[n] H 2 (z) L F 2 (z) L G 2 (z) Σ y[n] H L (z) L F L (z) G L (z) Mellon Slide 110 CMU Robust Speech Group L

Subband results with reverberated WSJ task WER for all speakers, compared to delay-and-sum processing: 60 50 40 30 20 10 0 300 470 Reverberation Time (ms) Delay-and-Sum Subband LIMABEAM Mellon Slide 111 CMU Robust Speech Group

Is Joint Filter Estimation Necessary? We compared 4 cases: Delay and Sum Optimize 1 filter for Delay and Sum Output Optimize Microphone Array Filters Independently Optimize Microphone Array Filters Jointly 50 40 WER (%) 30 20 10 WSJ + WGN 10dB 0 Delay Delay Indep Joint Sum Sum + Optim Optim Mellon 1 Filter Slide 112 CMU Robust Speech Group

Summary Microphone array processing is effective although not yet in widespread use Despite many developments in signal processing, actual applications to speech are based on very simple concepts Delay-and-sum beamforming Constant-beamwidth arrays Array processing based on preservation of feature values can improve accuracy, even in reverberant environments Major problems and issues: Maintaining good performance in reverberation Real-time time-varying environments and speakers Mellon Slide 113 CMU Robust Speech Group