Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress!

Size: px

Start display at page:

Download "Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress!"

Jeremy Osborne
5 years ago
Views:

1 Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress! Richard Stern (with Chanwoo Kim, Yu-Hsiang Chiu, and others) Department of Electrical and Computer Engineering and Language Technologies Institute Carnegie Mellon University Pittsburgh, Pennsylvania Telephone: ; FAX: Frederick Jelinek Memorial Workshop on Meaning Representations in Language and Speech Processing Prague, Czech Republic July 16, 2014

2 Introduction auditory processing and automatic speech recognition n I was originally trained in auditory perception, and my original work was in binaural hearing n Over the past years, I have been spending the bulk of my time trying to improve the accuracy of automatic speech recognition systems in difficult acoustical environments n In this talk I would like to discuss some of the ways in my group (and many others) have been attempting to apply knowledge of auditory perception to improve ASR accuracy Comment: approaches can be more or less faithful to physiology and psychophysics Slide 2

3 The big questions. n How can knowledge of auditory physiology and perception improve speech recognition accuracy? n Can speech recognition results tell us anything we don t already know about auditory processing? n What aspects of the processing are most valuable for robust feature extraction? Slide 3

4 Two historical notes n Everything is changing with deep learning Is there a role for traditional robust speech technologies? n Knowledge-based versus statistically-based processing Slide 4

5 So what I will do is. n Briefly review some of the major physiological and psychophysical results that motivate the models n Briefly review and discuss the major classical auditory models of the 1980s Seneff, Lyon, and Ghitza n Review some of the major new trends in today s models n Talk about some representative issues that have driven work as of late at CMU and what we have learned from them Slide 5

6 Speech recognition as pattern classification Speech features Utterance hypotheses Feature extraction Decision making! procedure n Major functional components: Signal processing to extract features from speech waveforms Comparison of features to pre-stored representations n Important design choices: Choice of features Specific method of comparing features to stored representations Slide 6

7 Default signal processing: Mel frequency cepstral coefficients (MFCCs) input speech Multiply by Brief Window Fourier Transform Magnitude Triangular Weighting Log Inverse Fourier Transform MFCCs Comment: 20-ms time slices are modeled by smoothed spectra, with attention paid to auditory frequency selectivity Slide 7

5000 35 6000 30 4000 5000 25 3000 4000 20 3000 2000 2000 15 10

8 What the speech recognizer sees. An original spectrogram: Spectrum recovered from MFCC: Slide 8

9 Comments on the MFCC representation n It s very blurry compared to a wideband spectrogram! n Aspects of auditory processing represented: Frequency selectivity and spectral bandwidth (but using a constant analysis window duration!)» Wavelet schemes exploit time-frequency resolution better Nonlinear amplitude response n Aspects of auditory processing NOT represented: Detailed timing structure Lateral suppression Enhancement of temporal contrast Other auditory nonlinearities Slide 9

10 Basic auditory anatomy n Structures involved in auditory processing: Slide 10

11 Excitation along the basilar membrane (courtesy James Hudspeth, HHMI) Slide 11

12 Central auditory pathways n There is a lot going on! n For the most part, we only consider the response of the auditory nerve It is in series with everything else Slide 12

13 Transient response of auditory-nerve fibers n Histograms of response to tone bursts (Kiang et al., 1965): Comment: Onsets and offsets produce overshoot Slide 13

14 Frequency response of auditory-nerve fibers: tuning curves n Threshold level for auditory-nerve response to tones: n Note dependence of bandwidth on center frequency and asymmetry of response Slide 14

15 Typical response of auditory-nerve fibers as a function of stimulus level n Typical response of auditory-nerve fibers to tones as a function of intensity: n Comment: Saturation and limited dynamic range Slide 15

16 Synchronized auditory-nerve response to low-frequency tones n Comment: response remains synchronized over a wide range of intensities Slide 16

17 Comments on synchronized auditory response n Nerve fibers synchronize to fine structure at low frequencies, signal envelopes at high frequencies n Synchrony clearly important for auditory localization n Synchrony could be important for monaural processing of complex signals as well Slide 17

18 Lateral suppression in auditory processing n Auditory-nerve response to pairs of tones: n Comment: Lateral suppression enhances local contrast in frequency Slide 18

19 Auditory frequency selectivity: critical bands n Measurements of psychophysical filter bandwidth by various methods: n Comments: Bandwidth increases with center frequency Solid curve is Equivalent Rectangular Bandwidth (ERB) Slide 19

20 Three perceptual auditory frequency scales Bark scale: (DE) Bark( f ) =.01 f, 0 f < f +1.5, 500 f < ln( f ) 32.6, 1220 f Mel scale: (USA) Mel( f ) = 2595 log 10 (1 + f 700 ) ERB scale: (UK) ERB( f ) = 24.7(4.37 f +1) Slide 20

21 Comparison of normalized perceptual frequency scales n Bark scale (in blue), Mel scale (in red), and ERB scale (in green): 100 Relative perceptual scale Bark Mel ERB Frequency, Hz Slide 21

22 Perceptual masking of adjacent spectrotemporal components n Spectral masking: Intense signals at a given frequency mask adjacent frequencies (asymmetrically) n Temporal masking: Intense signals at a given frequency can mask successive input at that frequency (and to some extent before the masker occurs!) n These phenomena are an important part of the auditory models used in perceptual audio coding (such as in creating MP3 files) Slide 22

23 The loudness of sounds n Equal loudness contours (Fletcher-Munson curves): Slide 23

24 Summary of basic auditory physiology and perception n Major monaural physiological attributes: Frequency analysis in parallel channels Preservation of temporal fine structure Limited dynamic range in individual channels Enhancement of temporal contrast (at onsets and offsets) Enhancement of spectral contrast (at adjacent frequencies) n Most major physiological attributes have psychophysical correlates n Most physiological and psychophysical effects are not preserved in conventional representations for speech recognition Slide 24

25 Auditory models in the 1980s: the Seneff model n Overall model: Envelope Detector Mean-Rate Spectrum An early well-known auditory model Critical-Band Filter Bank Stage I Hair-Cell Model Stage II n Detail of Stage II: Synchrony Detector Stage III Synchrony Spectrum In addition to mean rate, used Generalized Synchrony Detector to extract synchrony Saturating Half-Wave Rectifier Short-term AGC Lowpass Filter Rapid AGC Slide 25

26 Auditory models in the 1980s: Ghitza s EIH model COCHLEAR FILTER-1 COCHLEAR FILTER-i LEVEL CROSSINGS L-1 INTERVAL HISTOGRAMS IH-1 L-7 IH-7 Estimated timing information from ensembles of zero crossings with different thresholds COCHLEAR FILTER-85 Slide 26

27 Auditory models in the 1980s: Lyon s auditory model n Single stage of the Lyon auditory model: HALF-WAVE RECTIFIER 1-kHZ LOWPASS GAIN A GAIN B GAIN C Target LIMIT H-B + H-C Lyon model included nonlinear compression, lateral suppression, temporal effects Also added correlograms (autocorrelation and crosscorrelation of model outputs) Slide 27

28 And one more Cohen s model (1989) 512-Point FFT Loudness normalization and transient enhancement novel for the time CRITICAL-BAND FILTERS LOUDNESS NORMALIZATION POWER-LAW COMPRESSION SHORT-TERM ADAPTATION Used successfully as part of many IBM systems Slide 28

The other standard approach: Perceptual Linear Prediction (PLP, Hermansky 90) n Comments: A pragmatic approach to auditory modeling Pre-emphasis, loudness

29 The other standard approach: Perceptual Linear Prediction (PLP, Hermansky 90) n Comments: A pragmatic approach to auditory modeling Pre-emphasis, loudness normalization based on threshold of hearing RASTA enhancement provides cepstral normalization and modulation filtering Widely used with success today Slide 30

30 Auditory modeling was expensive: Computational complexity of Seneff model n Number of multiplications per ms of speech (from Ohshima and Stern, 1994): 25,000 20,000 15,000 10, Seneff Model Stage III Stage II Stage I LPC Slide 31

31 Summary: early auditory models n The models developed in the 1980s included: Realistic auditory filtering Realistic auditory nonlnearity Synchrony extraction Lateral suppression Higher order processing through auto-correlation and cross-correlation n Every system developer had his or her own idea of what was important Slide 32

32 Evaluation of early auditory models (Ohshima and Stern, 1994) n Not much quantitative evaluation actually performed n General trends of results: Physiological processing did not help much (if at all) for clean speech More substantial improvements observed for degraded input Benefits generally do not exceed what could be achieved with more prosaic approaches (e.g. CDCN/VTS in our case). Slide 33

33 Other reasons why work on auditory models subsided in the late 1980s n Failure to obtain a good statistical match between characteristics of features and speech recognition system Ameliorated by subsequent development of continuous HMMs n More pressing need to solve other basic speech recognition problems Slide 34

34 Renaissance in the 1990s! By the late 1990s, physiologically-motivated and perceptuallymotivated approaches to signal processing began to flourish Some major new trends. n Computation no longer such a limiting factor n Serious attention to temporal evolution n Attention to reverberation n Binaural processing n More effective and mature approaches to information fusion Slide 35

35 Peripheral auditory modeling at CMU 2004 now n Foci of activities: Representing synchrony The shape of the rate-intensity function Revisiting analysis duration Revisiting frequency resolution Onset enhancement Modulation filtering Binaural and polyaural techniques Auditory scene analysis: common frequency modulation Slide 36

36 Speech representation using mean rate n Representation of vowels by Young and Sachs using mean rate: n Mean rate representation does not preserve spectral information Slide 37

37 Speech representation using average localized synchrony rate n Representation of vowels by Young and Sachs using ALSR: Slide 38

38 Physiologically-motivated signal processing: the Zhang-Carney model of the periphery n We used the synapse output as the basis for further processing Slide 40

Physiologically-motivated signal processing: synchrony and mean-rate detection (Kim/Chiu 06) n Synchrony response is smeared across frequency to remove

39 Physiologically-motivated signal processing: synchrony and mean-rate detection (Kim/Chiu 06) n Synchrony response is smeared across frequency to remove pitch effects n Higher frequencies represented by mean rate of firing n Synchrony and mean rate combined additively n Much more processing than MFCCs Slide 41

40 Comparing auditory processing with cepstral analysis: clean speech Original spectrogram MFCC reconstruction Auditory analysis Slide 42

41 Comparing auditory processing with cepstral analysis: 20-dB SNR Slide 43

42 Comparing auditory processing with cepstral analysis: 10-dB SNR Slide 44

43 Comparing auditory processing with cepstral analysis: 0-dB SNR Slide 45

44 Auditory processing is more effective than MFCCs at low SNRs, especially in white noise Accuracy in background noise: Accuracy in background music: n Curves are shifted by db (greater improvement than obtained with VTS or CDCN) [Results from Kim et al., Interspeech 2006] Slide 46

45 Do auditory models really need to be so complex? n Model of Zhang et al. 2001: A much simpler model: P(t) Gammatone Filters Nonlinear Rectifiers Lowpass Filters s(t) Slide 47

46 Comparing simple and complex auditory models n Comparing MFCC processing, a trivial (filter rectify compress) auditory model, and the full Carney-Zhang model (Chiu 2006): Complex Auditory Simple Auditory MFCC 100 WER (%) SNR (db) Slide 48

47 The nonlinearity seems to be the most important attribute of the Seneff model (Chiu 08) Envelope Detector Mean-Rate Spectrum Critical-Band Filter Bank Hair-Cell Model Synchrony Detector Synchrony Spectrum Stage I Stage II Stage III Saturating Half-Wave Rectifier Short-term AGC Lowpass Filter Rapid AGC Slide 49

48 Why the nonlinearity seems to help rate level function compression traditional log compression clean noisy Traditional log compression output 10 8 output Count input energy histogram of clean speech input energy histogram of 20 db white noise channel index Rate level function compression 9 clean 8 noisy t output Cou unt input energy channel index Slide 50

49 Impact of auditory nonlinearity (Chiu) Accuracy 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Learned nonlinearity Baseline nonlinearity MFCC clean 20 db 15 db 10 db 5 db 0 db -5 db (a) Test set A Slide 51

50 PNCC processing (Kim and Stern, 2010,2014) n A pragmatic implementation of a number of the principles described: Gammatone filterbanks Nonlinearity shaped to follow auditory processing Medium-time environmental compensation using nonlinearity cepstral highpass filtering in each channel Enhancement of envelope onsets Computationally efficient implementation Slide 52

51 An integrated front end: power-normalized cepstral coefficients (PNCC, Kim 10) H(! ) 2 MFCC Processing STFT Triangular Freq Wtg Logarithmic Nonlinearity DCT Slide 53

52 An integrated front end: power-normalized cepstral coefficients (PNCC, Kim 10) H(! ) 2 MFCC Processing STFT Triangular Freq Wtg Logarithmic Nonlinearity DCT H(! ) 2 RASTA-PLP Processing STFT Crit-Band Freq Wtg Nonlinear Compression RASTA Filter Nonlinear Expansion Power-Law Nonlinearity IFFT Compute LPCbased Cepstra Exponent 1/3 Slide 54

53 An integrated front end: power-normalized cepstral coefficients (PNCC, Kim 10) H(! ) 2 MFCC Processing STFT Triangular Freq Wtg Logarithmic Nonlinearity DCT H(! ) 2 RASTA-PLP Processing STFT Crit-Band Freq Wtg Nonlinear Compression RASTA Filter Nonlinear Expansion Power-Law Nonlinearity IFFT Compute LPCbased Cepstra Exponent 1/3 H(! ) 2 PNCC Processing STFT Gammatone Freq Wtg Noise Reduction Temporal Masking Frequency Weighting Power Normaliz.. Power-Law Nonlinearity DCT Exponent 1/15 Slide 55

54 The nonlinearity in PNCC processing (Kim) Rate (spikes / sec) Human Rate Intensity Model Cube Root Power Law Approximation MMSE Power Law Approximation Logarithmic Approximation Pressure (Pa) Rate (spikes / sec) Human Rate Intensity Model Cube Root Power Law Approximation MMSE Power Law Approximation Logarithmic Approximation Tone Level (db SPL) Slide 56

55 Frequency resolution n Examined several types of frequency resolution MFCC triangular filters Gammatone filter shapes Truncated Gammatone filter shapes n Most results do not depend greatly on filter shape n Some sort of frequency integration is helpful when frequencybased selection algorithms are used H(e j! ) Frequency (Hz) Slide 57

56 Analysis window duration (Kim) n Typical analysis window duration for speech recognition is ~25-35 ms n Optimal analysis window duration for estimation of environmental parameters is ~ ms n Best systems measure environmental parameters (including voice activity detection over a longer time interval but apply results to a short-duration analysis frame Slide 58

57 Temporal Speech Properties: modulation filtering Output of speech and noise segments from 14 th Mel filter (1050 Hz) n Speech segment exhibits greater fluctuations Slide 59 59!

58 Nonlinear noise processing n Use nonlinear cepstral highpass filtering to pass speech but not noise n Why nonlinear? Need to keep results positive because we are dealing with manipulations of signal power Slide 60

59 Asymmetric lowpass filtering (Kim, 2010) n Overview of processing: Assume that noise components vary slowly compared to speech components Obtain a running estimate of noise level in each channel using nonlinear processing Subtract estimated noise level from speech n An example: Note: Asymmetric highpass filtering is obtained by subtracting the lowpass filter output from the input Slide 61

60 Implementing asymmetric lowpass filtering Basic equation: Dependence on parameter values: Street 5 db Power (db) Q in [ m, l] Q out [ m, l] (! a = 0.9,! b = 0.9) Time (s) Street 5 db Power (db) Q in [ m, l] Q out [ m, l] (! a = 0.5,! b = 0.95) Time (s) Street 5 db Power (db) Q in [ m, l] Q out [ m, l] (! a = 0.999,! b = 0.5) Time (s) Slide 62

61 Computational complexity of front ends Mults & Divs per Frame MFCC PLP PNCC Truncated PNCC Slide 64

62 Performance of PNCC in white noise (RM) 56678(6901#!!0!0:;/4 #!!,! +! *! B.%% <C%%0D>?=0EF- $! <C%% /5-F5!BGB!! " #! #" $! %&'() -./01234 Slide 65

63 Performance of PNCC in white noise (WSJ) 56678(6901#!!0!0:;/4 #!!,! +! *! C.%% $! DE%% /5-H5!CIC!! " #! #" $! %&'() -./01234 Slide 66

64 Performance of PNCC in background music 56678(6901#!!0!0:;/4 #!!,! +! *! B.%% $! >C%% /5-H5!BIB!! " #! #" $! %&'() -./01234 Slide 67

65 Performance of PNCC in reverberation #!! *! (! &! $!?AB!!'C58+,-,./,.01234: DEFF GHFF5I21J5K6A GHFF +;A6;!DLD!!!"#!"$!"%!"&!"'!"(!") #"$ +,-,./, ,589: Slide 68

66 Contributions of PNCC components: white noise (WSJ) -../0(.123#!!2!24567 #!!,! +! *! $! + Temporal masking! + Noise suppression! + Medium-duration processing! Baseline MFCC + CMN!!! " #! #" $! %&'() 89623:;7 Slide 69

67 Contributions of PNCC components: background music (WSJ) -../0(.123#!!2!24567 #!!,! +! *! $! + Temporal masking! + Noise suppression! + Medium-duration processing! Baseline MFCC + CMN!!! " #! #" $! %&'() 89623:;7 Slide 70

68 Contributions of PNCC components: reverberation (WSJ) #!! *! (! &! $!?AB!!'C58+,-,./,.01234: D4!EE24,5FGHH5I21J5HKG + Temporal masking! D4!EE24,5FGHH5I21J3=15K09C24L + Noise suppression! D4!EE24,5FGHH5 5I21J3=15K09C24L504M5N2E1,.24L + Medium-duration processing! O09,E24,5KNHH5I21J5HKG Baseline MFCC + CMN!!!!"#!"$!"%!"&!"'!"(!") #"$ +,-,./, ,589: Slide 71

69 Effects of onset enhancement/temporal masking (SSF processing, Kim 10) #!! /<#01<7=>60.?>='4 #!! 5A# /:;<= (6901#!!0!0:;/4,! +! *! $! /5-A5!HIH0D>EF0%<.!! " #! #" $! %&'() -./ ,,-./,012#!!1!13456 *! (! &! $!!!!"#!"$!"%!"&!"'!"(!") #"$ Slide 72

70 PNCC and Slide 73

71 Summary so what matters? n Knowledge of the auditory system can certainly improve ASR accuracy: Use of synchrony Consideration of rate-intensity function Onset enhancement Nonlinear modulation filtering Selective reconstruction Consideration of processes mediating auditory scene analysis Slide 74

72 Summary: PNCC processing n PNCC processing includes More effective nonlinearity Parameter estimation for noise compensation and analysis based on longer analysis time and frequency spread Efficient noise compensation based on modulation filtering Onset enhancement Computationally-efficient implementation n Not considered yet Synchrony representation Lateral suppression Slide 75

Robust Speech Recognition Group Carnegie Mellon University. Telephone: Fax:

Robust Speech Recognition Group Carnegie Mellon University. Telephone: Fax: Robust Automatic Speech Recognition In the 21 st Century Richard Stern (with Alex Acero, Yu-Hsiang Chiu, Evandro Gouvêa, Chanwoo Kim, Kshitiz Kumar, Amir Moghimi, Pedro Moreno, Hyung-Min Park, Bhiksha