ROBUST SPEECH RECOGNITION. Richard Stern

Size: px
Start display at page:

Download "ROBUST SPEECH RECOGNITION. Richard Stern"

Transcription

1 ROBUST SPEECH RECOGNITION Richard Stern Robust Speech Recognition Group Mellon University Telephone: (412) Fax: (412) Short Course at Universidad Carlos III July 12-15, 2005

2 Outline of discussion Summary of the state-of-the-art in speech technology at Mellon and elsewhere Review of speech production and cepstral analysis Introduction to robust speech recognition: classical techniques Robust speech recognition using missing-feature techniques Speech recognition using complementary feature sets Signal processing and signal separation based on human auditory perception Use of multiple microphones for improved recognition accuracy Mellon Slide 2 CMU Robust Speech Group

3 Introduction Conventional signal processing schemes for speech recognition are motivated more by knowledge of speech production than by knowledge of speech perception. Nevertheless, the auditory system does a lot of interesting things! In this talk I will.. Talk a bit about some basic findings in auditory physiology and perception Talk a bit about how knowledge of perception is starting to impact on how we design signal processing for speech recognition Talk about how we can apply auditory principles to separate signals that are presented simultaneously Mellon Slide 3 CMU Robust Speech Group

4 Basic auditory anatomy Structures involved in auditory processing: Mellon Slide 4 CMU Robust Speech Group

5 Excitation along the basilar membrane Some of von Békésy s (1960) measurements of motion along the basilar membrane: Comment: Different locations are most sensitive to different frequencies Mellon Slide 5 CMU Robust Speech Group

6 Transient response of auditory-nerve fibers Histograms of response to tone bursts (Kiang et al., 1965): Comment: Onsets and offsets produce overshoot Mellon Slide 6 CMU Robust Speech Group

7 Frequency response of auditory-nerve fibers: tuning curves Threshold level for auditory-nerve response to tones: Note dependence of bandwidth on center frequency and asymmetry of response Mellon Slide 7 CMU Robust Speech Group

8 Typical response of auditory-nerve fibers as a function of stimulus level Typical response of auditory-nerve fibers to tones as a function of intensity: Comment: Saturation and limited dynamic range Mellon Slide 8 CMU Robust Speech Group

9 Synchronized auditory-nerve response to low-frequency tones Comment: response remains synchronized over a wide range of intensities Mellon Slide 9 CMU Robust Speech Group

10 Comments on synchronized auditory response Nerve fibers synchronize to fine structure at low frequencies, signal envelopes at high frequencies Synchrony clearly important for auditory localization Synchrony now believed important for monaural processing of complex signals as well Mellon Slide 10 CMU Robust Speech Group

11 Lateral suppression in auditory processing Auditory-nerve response to pairs of tones: Comment: Lateral suppression enhances local contrast in frequency Mellon Slide 11 CMU Robust Speech Group

12 Auditory masking patterns Masking produced by narrowband noise at 410 Hz: Comment: asymmetries in auditory-nerve patterns preserved Mellon Slide 12 CMU Robust Speech Group

13 Auditory frequency selectivity: critical bands Measurements of psychophysical filter bandwidth by various methods: Comments: Bandwidth increases with center frequency Solid curve is Equivalent Rectangular Bandwidth (ERB) Mellon Slide 13 CMU Robust Speech Group

14 Three perceptual auditory frequency scales Bark scale: Mel scale: ERB scale: Bark( f ) =.01 f, 0 f < f +1.5, 500 f < ln( f ) 32.6, 1220 f Mel( f ) = 2595 log 10 ( f ) ERB( f ) = 24.7(4.37 f +1) Mellon Slide 14 CMU Robust Speech Group

15 Comparison of normalized perceptual frequency scales Bark scale (in blue), Mel scale (in red), and ERB scale (in green): Perceptual scale Frequency, Hz Mellon Slide 15 CMU Robust Speech Group

16 Forward and backward masking Masking can have an effect even if target and masker are not simultaneously presented Forward masking - masking precedes target Backward masking - target precedes masker Examples: Introduction Backward masking Forward masking Mellon Slide 16 CMU Robust Speech Group

17 The loudness of sounds Equal loudness contours (Fletcher-Munson curves): Mellon Slide 17 CMU Robust Speech Group

18 Summary of basic auditory physiology and perception Major physiological attributes: Frequency analysis in parallel channels Preservation of temporal fine structure Limited dynamic range in individual channels Enhancement of temporal contrast (at onsets and offsets) Enhancement of spectral contrast (at adjacent frequencies) Most major physiological attributes have psychophysical correlates Most physiological and psychophysical effects are not preserved in conventional representations for speech recognition Mellon Slide 18 CMU Robust Speech Group

19 Conventional ASR signal processing: MFCCs Segment incoming waveform into frames Compute frequency response for each frame using DFTs Multiply magnitude of frequency response by triangular weighting functions to produce channels Compute log of weighted magnitudes for each channel Take inverse discrete cosine transform (DCT) of weighted magnitudes for each channel, producing ~14 cepstral coefficients for each frame Calculate delta and double-delta coefficients Mellon Slide 19 CMU Robust Speech Group

20 COMPARING SPECTRAL REPRESENTATIONS ORIGINAL SPEECH MEL LOG MAGS AFTER CEPSTRA Mellon Slide 20 CMU Robust Speech Group

21 Comments on the MFCC representation It s very blurry compared to a wideband spectrogram! Aspects of auditory processing represented: Frequency selectivity and spectral bandwidth (but using a constant analysis window duration!)» Wavelet schemes exploit time-frequency resolution better Nonlinear amplitude response Aspects of auditory processing NOT represented: Detailed timing structure Lateral suppression Enhancement of temporal contrast Other auditory nonlinearities Mellon Slide 21 CMU Robust Speech Group

22 Speech representation using mean rate Representation of vowels by Young and Sachs using mean rate: Mean rate representation does not preserve spectral information Mellon Slide 22 CMU Robust Speech Group

23 Speech representation using average localized synchrony measure Representation of vowels by Young and Sachs using ALSR: Mellon Slide 23 CMU Robust Speech Group

24 The importance of timing information Re-analysis of Young-Sachs data by Searle: Temporal processing captures dominant formants in a spectral region Mellon Slide 24 CMU Robust Speech Group

25 Paths to the realization of temporal fine structure in speech Correlograms (Slaney and Lyon) Computations based on interval processing Seneff s Generalized Synchrony Detector (GSD) model Ghitza s Ensemble Interval Histogram (EIH) model D.C. Kim s Zero Crossing Peak Analysis (ZCPA) model Mellon Slide 25 CMU Robust Speech Group

26 A typical auditory model of the 1980s: The Seneff model Mellon Slide 26 CMU Robust Speech Group

27 Recognition accuracy using the Seneff model (Ohshima, 1994) Comment: CDCN performs just as well as the Seneff model Mellon Slide 27 CMU Robust Speech Group

28 Computational complexity of Seneff model Number of multiplications per ms of speech: Comment: auditory computation is extremely expensive Mellon Slide 28 CMU Robust Speech Group

29 Sequence of vowels Mellon Slide 29 CMU Robust Speech Group

30 Vowels processed using energy only Mellon Slide 30 CMU Robust Speech Group

31 Vowel sounds using autocorrelation expansion Mellon Slide 31 CMU Robust Speech Group

32 Comments on peripheral timing information Use of timing enables us to develop a rich display of frequencies, even with a limited number of analysis channels Nevertheless, this really gives us no new information unless the nonlinearities do something interesting Processing based on timing information (zero crossings, etc.) are likely to give us a more radically different display of info Mellon Slide 32 CMU Robust Speech Group

33 Summary - auditory physiology and perception Major physiological attributes: Frequency analysis in parallel channels Preservation of temporal fine structure Limited dynamic range in individual channels Enhancement of temporal contrast (at onsets and offsets) Enhancement of spectral contrast (at adjacent frequencies) Most major physiological attributes have psychophysical correlates We are trying to capture important attributes of the representation for recognition, and we believe that this may help performance in noise and competing signals. Mellon Slide 33 CMU Robust Speech Group

34 Summary - exploiting peripheral auditory processing Traditional cepstral representations are motivated by frequency resolution in auditory processing BUT. There is much more information in the signal that is not included Timing seems to be important Physiologically-motivated representations are generally computationally expensive While physiologically-motivated representations have not improved speech recognition accuracy so far we remain optimistic Mellon Slide 34 CMU Robust Speech Group

35 Auditory scene analysis for automatic speech recognition Auditory scene analysis refers to the methods that humans use to perceive separately sounds that are presented simultaneously Al Bregman and his colleagues have identified many potential cues in auditory stream separation can be applied directly to the speech recognition problem. Some examples: Separation by pitch/fundamental frequency Separation by source location Separation by micromodulation/ common fate Computational auditory scene analysis (CASA) refers to efforts to mimic these processes computationally Mellon Slide 35 CMU Robust Speech Group

36 Tracking speech sounds via fundamental frequency Given good pitch estimates How well can we separate signals from noise? How much will this separation help in speech recognition? To what extent can pitch be used to separate speech signals from one another? Mellon Slide 36 CMU Robust Speech Group

37 The CMU ARCTIC database Collected by John Komenik and Alan Black as a resource for speech synthesis Contains phonetically-balanced recordings with simultaneously-recorded EGG (laryngograph) measurements Available at Mellon Slide 37 CMU Robust Speech Group

38 The CMU ARCTIC database 0.6 Original speech: Laryngograph recording: Mellon Slide 38 CMU Robust Speech Group

39 Typical pitch estimates obtained from ARCTIC x 10 4 Comment: not all outliers were successfully removed Mellon Slide 39 CMU Robust Speech Group

40 Isolating speech by pitch Method 1: Estimate amplitudes of partials by synchronous heterodyne analysis Resynthesize as sums of sines or cosines Unvoiced segments are problematical Method 2: Pass speech through a comb filter that tracks harmonic frequencies Unvoiced segments are still problematical Mellon Slide 40 CMU Robust Speech Group

41 An example of pitch-based approaches: synchronized heterodyne analysis Extract instantaneous pitch, extract amplitudes at harmonics, resynthesize Original speech samples: Reconstructed speech: Mellon Slide 41 CMU Robust Speech Group

42 Recovering speech through comb filtering Pitch-tracking comb filter: H(z) = z P 1 gz P Its frequency response: Original speech samples: Reconstructed speech: Mellon Slide 42 CMU Robust Speech Group

43 Separating speech signals by heterodyning and comb filtering Combined speech signals: Speech separated by heterodyne filters: Speech separated by comb filters: Comment: men mask women more because upper male harmonics are more likely to impinge on lower female harmonics Mellon Slide 43 CMU Robust Speech Group

44 Speech recognition in noise based on pitch tracking (Vicente y Peña) SHA w Oracle pitch SHA Baseline SNR (db) Initial results could improve as techniques mature Mellon Slide 44 CMU Robust Speech Group

45 Speech separation by source location Sources arriving from different azimuths produce interaural time delays (ITDs) and interaural intensity differences (IIDs) as they arrive at the two ears So far this information has been used for Better masks for missing feature recognition and to combat reverberation (e.g. Brown, Wang et al.) Direct separation from interaural representation Mellon Slide 45 CMU Robust Speech Group

46 The classical model of binaural processing (Colburn and Durlach 1978) Mellon Slide 46 CMU Robust Speech Group

47 Array processing based on human binaural hearing Mellon Slide 47 CMU Robust Speech Group

48 Correlation-based system motivated by binaural hearing Mellon Slide 48 CMU Robust Speech Group

49 Vowel representations are improved by correlation processing Reconstructed features of vowel /a/ Two inputs zero delay Two inputs 120-µs delay Eight inputs 120-µs delay Recognition results in 1993 showed some (small) improvement in WER at great computational cost Mellon Slide 49 CMU Robust Speech Group

50 So what do things sound like on the crosscorrelation display? Signals combined with ITDs of 0 and.5 ms Individual speech signals: Combined speech signals: Separated by delay-and sum beamforming: Signals separated by cross-correlation display: Signals separated by additional correlations across frequency at a common ITD ( straightness weighting): Mellon Slide 50 CMU Robust Speech Group

51 Signal separation using micro-modulation Micromodulation of amplitude and frequency may be helpful in separating unvoiced segments of sound sources Physical cues supported by many psychoacoustical studies in recent years Mellon Slide 51 CMU Robust Speech Group

52 John Chowning s demonstration of effects of micro-modulation in frequency Frequency (Reconstruction based on description in Bregman book) Time Mellon Slide 52 CMU Robust Speech Group

53 Separating by frequency modulation only Extract instantaneous frequencies of filterbank outputs Cross-correlate frequencies across channels (finds comodulated harmonics) Cluster correlated harmonics and resynthesize Our first example: Isolated speech: Combined speech: Speech separated by frequency modulation: Comment: Success will depend on ability to track frequency components across analysis bands Mellon Slide 53 CMU Robust Speech Group

54 Summary: signal separation and recognition motivated by auditory processing Computational advances may now enable practical feature extraction based on auditory physiology and perception Computational auditory scene analysis shows promise for signal-separation problems, but the field is presently in its infancy Mellon Slide 54 CMU Robust Speech Group

55 Outline of discussion Summary of the state-of-the-art in speech technology at Mellon and elsewhere Review of speech production and cepstral analysis Introduction to robust speech recognition: classical techniques Robust speech recognition using missing-feature techniques Speech recognition using complementary feature sets Signal processing and signal separation based on human auditory perception Use of multiple microphones for improved recognition accuracy Mellon Slide 55 CMU Robust Speech Group

56 Introduction The use of arrays of microphones can improve speech recognition accuracy in noise Outline of this talk Review classical approaches to microphone array processing» Delay and sum beamforming» Traditional adaptive filtering» Physiological-motivated processing Describe and discuss selected recent results» Matched-filter array processing (Rutgers)» Array processing based on speech features (CMU) Mellon Slide 56 CMU Robust Speech Group

57 OVERVIEW OF SPEECH RECOGNITION Speech features Phoneme hypotheses Major functional components: Signal processing to extract features from speech waveforms Comparison of features to pre-stored templates Important design choices: Choice of features Feature extraction Specific method of comparing features to stored templates Decision making procedure Mellon Slide 57 CMU Robust Speech Group

58 Why use microphone arrays? Microphone arrays can provide directional response, accepting speech from some directions but suppressing others Mellon Slide 58 CMU Robust Speech Group

59 Another reason for microphone arrays Microphone arrays can focus attention on the direct field in a reverberant environment Mellon Slide 59 CMU Robust Speech Group

60 Three classical types of microphone arrays Delay-and-sum beamforming and its many variants Adaptive arrays based on mean-square suppression of noise Physiologically and perceptually motivated approaches to multiple microphones Mellon Slide 60 CMU Robust Speech Group

61 Delay-and-sum beamforming Simple processing based on equalizing delays to sensors High directivity can be achieved with many sensors Sensor 1 Sensor 2 z -n1 z -n2 Output Sensor K z -nk Mellon Slide 61 CMU Robust Speech Group

62 The physics of delay-and-sum beamforming d θ dsin( ) Mellon Slide 62 CMU Robust Speech Group

63 The physics of delay-and-sum beamforming If sensor outputs are added together, the look direction is = 0 d θ Look direction can be steered to other directions by inserting electronic delays to compensate for physical ones For look direction of output is = 0, net array A = sin(nωd sin(θ)/2c) sin(ωd sin(θ)/2c) = sin(nπd sin(θ)/λ) sin(πd sin(θ)/λ) dsin( ) Mellon Slide 63 CMU Robust Speech Group

64 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 500 Hz Mellon Slide 64 CMU Robust Speech Group

65 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 1000 Hz Mellon Slide 65 CMU Robust Speech Group

66 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 1500 Hz Mellon Slide 66 CMU Robust Speech Group

67 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 2000 Hz Mellon Slide 67 CMU Robust Speech Group

68 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 2500 Hz Mellon Slide 68 CMU Robust Speech Group

69 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 3000 Hz Mellon Slide 69 CMU Robust Speech Group

70 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 3500 Hz Mellon Slide 70 CMU Robust Speech Group

71 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 4000 Hz Mellon Slide 71 CMU Robust Speech Group

72 Nested microphone arrays (Flanagan et al.) 5-element low frequency array Mellon Slide 72 CMU Robust Speech Group

73 Nested microphone arrays 5-element mid frequency array Mellon Slide 73 CMU Robust Speech Group

74 Nested microphone arrays 5-element high frequency array Mellon Slide 74 CMU Robust Speech Group

75 Combined nested array (Flanagan et al.) Three-band quasi-constant beamwidth array Lowpass filter Bandpass filter Highpass filter Mellon Slide 75 CMU Robust Speech Group

76 Another delay-and-sum issue: spatial aliasing d = 8.62 cm, N = 9, f = 4000 Hz Mellon Slide 76 CMU Robust Speech Group

77 Another delay-and-sum issue: spatial aliasing d = 8.62 cm, N = 9, f = 5000 Hz Mellon Slide 77 CMU Robust Speech Group

78 Another delay-and-sum issue: spatial aliasing d = 8.62 cm, N = 9, f = 6000 Hz Mellon Slide 78 CMU Robust Speech Group

79 Preventing spatial aliasing Spatial aliasing occurs when sensors receive input more than half a period from one another d θ The spatial Nyquist constraint depends on both frequency and arrival angle dsin( ) To prevent spatial aliasing we require that the maximum frequency be less than 2c ν < d sin(θ) Mellon Slide 79 CMU Robust Speech Group

80 Filter-and-sum beamforming Input filters can (in principle) place delays that vary with frequency to ameliorate frequency dependencies of beamforming Filters can also compensate for channel characteristics Sensor 1 Filter Sensor 2 Filter Output Sensor K Filter Mellon Slide 80 CMU Robust Speech Group

81 Compensated delay-and-sum beamforming Filter added to compensate for filtering effects of delay and sum beamforming Sensor 1 z -n1 Sensor 2 z -n2 Filter Output Sensor K z -nk Mellon Slide 81 CMU Robust Speech Group

82 Compensated delay-and-sum beamforming: some implementations Sullivan and Stern use CDCN compensation Rutgers group compensation function derived using a neural network Silverman group Omologo group Mellon Slide 82 CMU Robust Speech Group

83 Sample recognition results using compensated delay-and-sum The Flanagan array with CDCN does improve accuracy: Mellon Slide 83 CMU Robust Speech Group

84 Traditional adaptive arrays Mellon Slide 84 CMU Robust Speech Group

85 Traditional adaptive arrays Large established literature Use MMSE techniques to establish beams in the look direction and nulls to additive noise sources Generally do not perform well in reverberant environments Signal cancellation Effective impulse response longer than length of filter Techniques to circumvent signal cancellation Switching nulling mechanism off and on according to presence or absence of speech (Van Compernolle) Use of alternate adaptation algorithms Mellon Slide 85 CMU Robust Speech Group

86 Array processing based on human binaural hearing Motivation: human binaural system is known to have excellent immunity to additive noise and reverberation Binaural phenomena of interest: Cocktail party effect Precedence effect Problems with binaural models: Correlation produces signal distortion from rectification and squaring Precedence-effect processing defeats echoes but also suppresses desired signals Greatest challenge: decoding useful information from the crosscorrelation display Mellon Slide 86 CMU Robust Speech Group

87 Correlation-based system motivated by binaural hearing Mellon Slide 87 CMU Robust Speech Group

88 Matched-filter beamforming (Rutgers) Goal: compensation for delay and dispersion introduced in reverberant environments 600-ms Room response: x Autocorrelation function: x 10 4 Mellon Slide 88 CMU Robust Speech Group

89 Matched-filter beamforming procedure 1. Measure or estimate sample response from source to each sensor 2. Convolve input with time-reversed sample function (producing autocorrelation function) 3. Sum outputs of channels together Rationale: main lobes of autocorrelation functions should reinforce while side lobes cancel Mellon Slide 89 CMU Robust Speech Group

90 Optimizing microphone arrays for speech recognition features The typical microphone array algorithms has been signal enhancement rather than speech recognition MIC 1 s[n] MIC 2 Array Processor sˆ [ n] MIC N Mellon Slide 90 CMU Robust Speech Group

91 Automatic Speech Recognition (ASR) Parameterize speech signal and compare parameter sequence to statistical models of speech sound units to hypothesize what a user said. s[n] Feature Extraction { O 1,..., O N } ASR AM P(O W) LM P(W) P( O W ) P( W ) P ( W O) = P( O) Wˆ,..., Wˆ { 1 M } The objective is accurate recognition, a statistical pattern classification problem. Mellon Slide 91 CMU Robust Speech Group

92 Speech recognition with microphone arrays Recognition with microphone arrays has been traditionally been performed by gluing the two systems together. Systems have different objectives. Each system does not exploit information present in the other. MIC 1 MIC 2 MIC 3 Array Proc Feature Extraction ASR MIC 4 Mellon Slide 92 CMU Robust Speech Group

93 Array processing based on speech features Develop an array processing scheme targeted at improved speech recognition performance without regard to conventional array processing objective criteria. MIC 1 MIC 2 MIC 3 Array Proc Feature Extraction ASR MIC 4 Mellon Slide 93 CMU Robust Speech Group

94 Multi-microphone compensation for speech recognition based on cepstral distortion Array Proc Front End ASR Multi-mic compensation based on optimizing speech features rather than signal distortion Speech in Room Delay and Sum Optimal Comp Mellon Slide 94 CMU Robust Speech Group

95 Choosing array weights based on speech features Want an objective function that uses parameters directly related to recognition MIC 1 x 1 τ 1 h 1 Clean Speech Features MIC 2 x 2 τ 2 h 2 M s ε y M y Σ FE MIC M x M τ Μ h M minimize ε Mellon Slide 95 CMU Robust Speech Group

96 An objective function for mic arrays based on speech recognition Define Q as the SSE of the log Mel spectra of clean speech s and noisy speech y Q = (M y [ f,l] M s [ f,l]) 2 f l where y is the output of a filter-and-sum microphone array and M[ f, l] is the l th log Mel spectral value in frame f. My[ f, l] is a function of the signals captured by the array and the filter parameters associated with each microphone. Mellon Slide 96 CMU Robust Speech Group

97 Calibration of microphone arrays for ASR Calibration of filter-and-sum microphone array: User speaks an utterance with known transcription» With or without close-talking microphone Derive optimal set of filters» Minimize the objective function with respect to the filter coefficients.» Since objective function is non-linear, use iterative gradient-based methods Apply to all future speech Mellon Slide 97 CMU Robust Speech Group

98 Calibration Using close-talking recording Given a close-talking mic recording for the calibration utterance, derive an optimal filter for each channel to improve recognition FE M s MIC 1 τ 1 h 1 (n) OPT ASR s[n] MIC M τ Μ h M (n) Σ FE M y Mellon Slide 98 CMU Robust Speech Group

99 Multi-microphone data sets TMS Recorded in CMU auditory lab» Approx. 5m x 5m x 3m» Noise from computer fans, blowers,etc. Isolated letters and digits, keywords 10 speakers * 14 utterances = 140 utterances Each utterance has close-talking mic control waveform 1m 7cm Mellon Slide 99 CMU Robust Speech Group

100 Multi-microphone data sets (2) WSJ + off-axis noise source Room simulation created using the image method» 5m x 4m x 3m» 200ms reverberation time» WGN 5dB SNR WSJ test set» 5K word vocabulary» 10 speakers * 65 utterances = 650 utterances Original recordings used as closetalking control waveforms 2m 1m 25cm 15cm Mellon Slide 100 CMU Robust Speech Group

101 Results TMS data set, WSJ0 + WGN point source simulation Constructed 50 point filters from a single calibration utterance Applied filters to all test utterances TMS WSJ WER (%) CLSTK 1 MIC D&S MEL OPT Mellon Slide 101 CMU Robust Speech Group 0 CLSTK 1 MIC D&S MEL OPT

102 Calibration without Close-talking Microphone Obtain initial waveform estimate using conventional array processing technique (e.g. delay and sum). Use transcription and the recognizer to estimate the sequence of target clean log Mel spectra. Optimize filter parameters as before. Mellon Slide 102 CMU Robust Speech Group

103 Calibration w/o Close-talking Microphone (2) Force align the delay-and-sum waveform to the known transcription to generate an estimated HMM state sequence. BLAH BLAH... MIC 1 τ 1 s[n] Σ FE FALIGN qˆ, qˆ,..., qˆ { 1 2 N } MIC M τ Μ HMM Mellon Slide 103 CMU Robust Speech Group

104 Calibration w/o Close-talking Microphone (3) Extract the means from the single Gaussian HMMs of the estimated state sequence. Since the models have been trained from clean speech, use these means as the target clean speech feature vectors. qˆ, qˆ,..., qˆ { 1 2 N } HMM { µ, µ 2,..., µ 1 N } IDCT Mˆ s Mellon Slide 104 CMU Robust Speech Group

105 Calibration w/o Close-talking Microphone (4) Use estimated clean speech feature vectors to optimize filters as before. Mˆ s MIC 1 τ 1 h 1 (n) OPT ASR s[n] MIC M τ Μ h M (n) Σ FE M y Mellon Slide 105 CMU Robust Speech Group

106 Results TMS data set, WSJ0 + WGN point source simulation Constructed 50 point filters from calibration utterance Applied filters to all utterances TMS WSJ WER (%) CLSTK 1 MIC D&S MEL OPT W/ CLSTK MEL OPT NO CLSTK Mellon Slide 106 CMU Robust Speech Group 0 CLSTK 1 MIC D&S MEL OPT W/ CLSTK MEL OPT NO CLSTK

107 Results (2) WER vs. SNR for WSJ + WGN Constructed 50 point filters from calibration utterance using transcription only Applied filters to all utterances Closetalk Optim-Calib Delay-Sum 1 Mic SNR (db) Mellon Slide 107 CMU Robust Speech Group

108 The problem of reverberation High levels of reverberation are extremely detrimental to recognition accuracy In reverbant environments, the noise is actually reflected copies of the target speech signals Noise-canceling strategies (like the LMS algorithm) based on uncorrelated noise sources fail Frame-based compensation strategies also fail because effects of reverberation can be spread over several frames Mellon Slide 108 CMU Robust Speech Group

109 Baseline in highly reverberant rooms Comparison of single channel and delay-and-sum beamforming (WSJ data passed through measured impulse responses): Single channel Delay and sum Reverb time (ms) Mellon Slide 109 CMU Robust Speech Group

110 Subband processing using optimized features (Seltzer) Subband processing can address some of the effects of reverberation. The subband signals have more desirable narrowband signal properties. 1. Divide into independent subbands 2. Downsample 3. Process subbands independently 4. Upsample 5. Resynthesize full signal from subbands H 1 (z) L F 1 (z) L G 1 (z) x[n] H 2 (z) L F 2 (z) L G 2 (z) Σ y[n] H L (z) L F L (z) G L (z) Mellon Slide 110 CMU Robust Speech Group L

111 Subband results with reverberated WSJ task WER for all speakers, compared to delay-and-sum processing: Reverberation Time (ms) Delay-and-Sum Subband LIMABEAM Mellon Slide 111 CMU Robust Speech Group

112 Is Joint Filter Estimation Necessary? We compared 4 cases: Delay and Sum Optimize 1 filter for Delay and Sum Output Optimize Microphone Array Filters Independently Optimize Microphone Array Filters Jointly WER (%) WSJ + WGN 10dB 0 Delay Delay Indep Joint Sum Sum + Optim Optim Mellon 1 Filter Slide 112 CMU Robust Speech Group

113 Summary Microphone array processing is effective although not yet in widespread use Despite many developments in signal processing, actual applications to speech are based on very simple concepts Delay-and-sum beamforming Constant-beamwidth arrays Array processing based on preservation of feature values can improve accuracy, even in reverberant environments Major problems and issues: Maintaining good performance in reverberation Real-time time-varying environments and speakers Mellon Slide 113 CMU Robust Speech Group

114

Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress!

Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress! Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress! Richard Stern (with Chanwoo Kim, Yu-Hsiang Chiu, and others) Department of Electrical and Computer Engineering

More information

Robust Speech Recognition Group Carnegie Mellon University. Telephone: Fax:

Robust Speech Recognition Group Carnegie Mellon University. Telephone: Fax: Robust Automatic Speech Recognition In the 21 st Century Richard Stern (with Alex Acero, Yu-Hsiang Chiu, Evandro Gouvêa, Chanwoo Kim, Kshitiz Kumar, Amir Moghimi, Pedro Moreno, Hyung-Min Park, Bhiksha

More information

ROBUST SPEECH RECOGNITION BASED ON HUMAN BINAURAL PERCEPTION

ROBUST SPEECH RECOGNITION BASED ON HUMAN BINAURAL PERCEPTION ROBUST SPEECH RECOGNITION BASED ON HUMAN BINAURAL PERCEPTION Richard M. Stern and Thomas M. Sullivan Department of Electrical and Computer Engineering School of Computer Science Carnegie Mellon University

More information

The psychoacoustics of reverberation

The psychoacoustics of reverberation The psychoacoustics of reverberation Steven van de Par Steven.van.de.Par@uni-oldenburg.de July 19, 2016 Thanks to Julian Grosse and Andreas Häußler 2016 AES International Conference on Sound Field Control

More information

Calibration of Microphone Arrays for Improved Speech Recognition

Calibration of Microphone Arrays for Improved Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Calibration of Microphone Arrays for Improved Speech Recognition Michael L. Seltzer, Bhiksha Raj TR-2001-43 December 2001 Abstract We present

More information

Robust Speech Recognition Based on Binaural Auditory Processing

Robust Speech Recognition Based on Binaural Auditory Processing INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Robust Speech Recognition Based on Binaural Auditory Processing Anjali Menon 1, Chanwoo Kim 2, Richard M. Stern 1 1 Department of Electrical and Computer

More information

Monaural and Binaural Speech Separation

Monaural and Binaural Speech Separation Monaural and Binaural Speech Separation DeLiang Wang Perception & Neurodynamics Lab The Ohio State University Outline of presentation Introduction CASA approach to sound separation Ideal binary mask as

More information

Robust Speech Recognition Based on Binaural Auditory Processing

Robust Speech Recognition Based on Binaural Auditory Processing Robust Speech Recognition Based on Binaural Auditory Processing Anjali Menon 1, Chanwoo Kim 2, Richard M. Stern 1 1 Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh,

More information

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues DeLiang Wang Perception & Neurodynamics Lab The Ohio State University Outline of presentation Introduction Human performance Reverberation

More information

Auditory Based Feature Vectors for Speech Recognition Systems

Auditory Based Feature Vectors for Speech Recognition Systems Auditory Based Feature Vectors for Speech Recognition Systems Dr. Waleed H. Abdulla Electrical & Computer Engineering Department The University of Auckland, New Zealand [w.abdulla@auckland.ac.nz] 1 Outlines

More information

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Mel Spectrum Analysis of Speech Recognition using Single Microphone International Journal of Engineering Research in Electronics and Communication Mel Spectrum Analysis of Speech Recognition using Single Microphone [1] Lakshmi S.A, [2] Cholavendan M [1] PG Scholar, Sree

More information

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner. Perception of pitch BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb 2008. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence Erlbaum,

More information

AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES

AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-), Verona, Italy, December 7-9,2 AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES Tapio Lokki Telecommunications

More information

Auditory modelling for speech processing in the perceptual domain

Auditory modelling for speech processing in the perceptual domain ANZIAM J. 45 (E) ppc964 C980, 2004 C964 Auditory modelling for speech processing in the perceptual domain L. Lin E. Ambikairajah W. H. Holmes (Received 8 August 2003; revised 28 January 2004) Abstract

More information

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2 Signal Processing for Speech Applications - Part 2-1 Signal Processing For Speech Applications - Part 2 May 14, 2013 Signal Processing for Speech Applications - Part 2-2 References Huang et al., Chapter

More information

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner. Perception of pitch BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb 2009. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence

More information

Binaural Hearing. Reading: Yost Ch. 12

Binaural Hearing. Reading: Yost Ch. 12 Binaural Hearing Reading: Yost Ch. 12 Binaural Advantages Sounds in our environment are usually complex, and occur either simultaneously or close together in time. Studies have shown that the ability to

More information

Speech Signal Analysis

Speech Signal Analysis Speech Signal Analysis Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 2&3 14,18 January 216 ASR Lectures 2&3 Speech Signal Analysis 1 Overview Speech Signal Analysis for

More information

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner. Perception of pitch AUDL4007: 11 Feb 2010. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence Erlbaum, 2005 Chapter 7 1 Definitions

More information

Signal Processing for Robust Speech Recognition Motivated by Auditory Processing

Signal Processing for Robust Speech Recognition Motivated by Auditory Processing Signal Processing for Robust Speech Recognition Motivated by Auditory Processing Chanwoo Kim CMU-LTI-1-17 Language Technologies Institute School of Computer Science Carnegie Mellon University 5 Forbes

More information

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals 16 3. SPEECH ANALYSIS 3.1 INTRODUCTION TO SPEECH ANALYSIS Many speech processing [22] applications exploits speech production and perception to accomplish speech analysis. By speech analysis we extract

More information

EE482: Digital Signal Processing Applications

EE482: Digital Signal Processing Applications Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu EE482: Digital Signal Processing Applications Spring 2014 TTh 14:30-15:45 CBC C222 Lecture 14 Quiz 04 Review 14/04/07 http://www.ee.unlv.edu/~b1morris/ee482/

More information

Robust speech recognition using temporal masking and thresholding algorithm

Robust speech recognition using temporal masking and thresholding algorithm Robust speech recognition using temporal masking and thresholding algorithm Chanwoo Kim 1, Kean K. Chin 1, Michiel Bacchiani 1, Richard M. Stern 2 Google, Mountain View CA 9443 USA 1 Carnegie Mellon University,

More information

SIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION MOTIVATED BY AUDITORY PROCESSING CHANWOO KIM

SIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION MOTIVATED BY AUDITORY PROCESSING CHANWOO KIM SIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION MOTIVATED BY AUDITORY PROCESSING CHANWOO KIM MAY 21 ABSTRACT Although automatic speech recognition systems have dramatically improved in recent decades,

More information

COM325 Computer Speech and Hearing

COM325 Computer Speech and Hearing COM325 Computer Speech and Hearing Part III : Theories and Models of Pitch Perception Dr. Guy Brown Room 145 Regent Court Department of Computer Science University of Sheffield Email: g.brown@dcs.shef.ac.uk

More information

ADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering

ADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering ADSP ADSP ADSP ADSP Advanced Digital Signal Processing (18-792) Spring Fall Semester, 201 2012 Department of Electrical and Computer Engineering PROBLEM SET 5 Issued: 9/27/18 Due: 10/3/18 Reminder: Quiz

More information

Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition Chanwoo Kim, Member, IEEE, and Richard M. Stern, Fellow, IEEE

Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition Chanwoo Kim, Member, IEEE, and Richard M. Stern, Fellow, IEEE IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY 2016 1315 Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition Chanwoo Kim, Member, IEEE, and

More information

EE482: Digital Signal Processing Applications

EE482: Digital Signal Processing Applications Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu EE482: Digital Signal Processing Applications Spring 2014 TTh 14:30-15:45 CBC C222 Lecture 12 Speech Signal Processing 14/03/25 http://www.ee.unlv.edu/~b1morris/ee482/

More information

Enhanced Waveform Interpolative Coding at 4 kbps

Enhanced Waveform Interpolative Coding at 4 kbps Enhanced Waveform Interpolative Coding at 4 kbps Oded Gottesman, and Allen Gersho Signal Compression Lab. University of California, Santa Barbara E-mail: [oded, gersho]@scl.ece.ucsb.edu Signal Compression

More information

Recurrent Timing Neural Networks for Joint F0-Localisation Estimation

Recurrent Timing Neural Networks for Joint F0-Localisation Estimation Recurrent Timing Neural Networks for Joint F0-Localisation Estimation Stuart N. Wrigley and Guy J. Brown Department of Computer Science, University of Sheffield Regent Court, 211 Portobello Street, Sheffield

More information

A classification-based cocktail-party processor

A classification-based cocktail-party processor A classification-based cocktail-party processor Nicoleta Roman, DeLiang Wang Department of Computer and Information Science and Center for Cognitive Science The Ohio State University Columbus, OH 43, USA

More information

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991 RASTA-PLP SPEECH ANALYSIS Hynek Hermansky Nelson Morgan y Aruna Bayya Phil Kohn y TR-91-069 December 1991 Abstract Most speech parameter estimation techniques are easily inuenced by the frequency response

More information

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Electrical Engineering and Information Engineering

More information

III. Publication III. c 2005 Toni Hirvonen.

III. Publication III. c 2005 Toni Hirvonen. III Publication III Hirvonen, T., Segregation of Two Simultaneously Arriving Narrowband Noise Signals as a Function of Spatial and Frequency Separation, in Proceedings of th International Conference on

More information

MOST MODERN automatic speech recognition (ASR)

MOST MODERN automatic speech recognition (ASR) IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 5, SEPTEMBER 1997 451 A Model of Dynamic Auditory Perception and Its Application to Robust Word Recognition Brian Strope and Abeer Alwan, Member,

More information

Adaptive Filters Application of Linear Prediction

Adaptive Filters Application of Linear Prediction Adaptive Filters Application of Linear Prediction Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Electrical Engineering and Information Technology Digital Signal Processing

More information

A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL

A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL 9th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, -7 SEPTEMBER 7 A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL PACS: PACS:. Pn Nicolas Le Goff ; Armin Kohlrausch ; Jeroen

More information

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt Pattern Recognition Part 6: Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Institute of Electrical and Information Engineering Digital Signal Processing and System Theory

More information

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 MODELING SPECTRAL AND TEMPORAL MASKING IN THE HUMAN AUDITORY SYSTEM PACS: 43.66.Ba, 43.66.Dc Dau, Torsten; Jepsen, Morten L.; Ewert,

More information

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Comparison of Spectral Analysis Methods for Automatic Speech Recognition INTERSPEECH 2013 Comparison of Spectral Analysis Methods for Automatic Speech Recognition Venkata Neelima Parinam, Chandra Vootkuri, Stephen A. Zahorian Department of Electrical and Computer Engineering

More information

Speech Synthesis; Pitch Detection and Vocoders

Speech Synthesis; Pitch Detection and Vocoders Speech Synthesis; Pitch Detection and Vocoders Tai-Shih Chi ( 冀泰石 ) Department of Communication Engineering National Chiao Tung University May. 29, 2008 Speech Synthesis Basic components of the text-to-speech

More information

IN recent decades following the introduction of hidden. Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition

IN recent decades following the introduction of hidden. Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 1 Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition Chanwoo Kim and Richard M. Stern, Member,

More information

Psycho-acoustics (Sound characteristics, Masking, and Loudness)

Psycho-acoustics (Sound characteristics, Masking, and Loudness) Psycho-acoustics (Sound characteristics, Masking, and Loudness) Tai-Shih Chi ( 冀泰石 ) Department of Communication Engineering National Chiao Tung University Mar. 20, 2008 Pure tones Mathematics of the pure

More information

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation Aisvarya V 1, Suganthy M 2 PG Student [Comm. Systems], Dept. of ECE, Sree Sastha Institute of Engg. & Tech., Chennai,

More information

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Different Approaches of Spectral Subtraction Method for Speech Enhancement ISSN 2249 5460 Available online at www.internationalejournals.com International ejournals International Journal of Mathematical Sciences, Technology and Humanities 95 (2013 1056 1062 Different Approaches

More information

HCS 7367 Speech Perception

HCS 7367 Speech Perception HCS 7367 Speech Perception Dr. Peter Assmann Fall 212 Power spectrum model of masking Assumptions: Only frequencies within the passband of the auditory filter contribute to masking. Detection is based

More information

Advanced audio analysis. Martin Gasser

Advanced audio analysis. Martin Gasser Advanced audio analysis Martin Gasser Motivation Which methods are common in MIR research? How can we parameterize audio signals? Interesting dimensions of audio: Spectral/ time/melody structure, high

More information

DERIVATION OF TRAPS IN AUDITORY DOMAIN

DERIVATION OF TRAPS IN AUDITORY DOMAIN DERIVATION OF TRAPS IN AUDITORY DOMAIN Petr Motlíček, Doctoral Degree Programme (4) Dept. of Computer Graphics and Multimedia, FIT, BUT E-mail: motlicek@fit.vutbr.cz Supervised by: Dr. Jan Černocký, Prof.

More information

Speech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya 2, B. Yamuna 2, H. Divya 2, B. Shiva Kumar 2, B.

Speech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya 2, B. Yamuna 2, H. Divya 2, B. Shiva Kumar 2, B. www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 4 Issue 4 April 2015, Page No. 11143-11147 Speech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya

More information

IN a natural environment, speech often occurs simultaneously. Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation

IN a natural environment, speech often occurs simultaneously. Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 5, SEPTEMBER 2004 1135 Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation Guoning Hu and DeLiang Wang, Fellow, IEEE Abstract

More information

Speech Synthesis using Mel-Cepstral Coefficient Feature

Speech Synthesis using Mel-Cepstral Coefficient Feature Speech Synthesis using Mel-Cepstral Coefficient Feature By Lu Wang Senior Thesis in Electrical Engineering University of Illinois at Urbana-Champaign Advisor: Professor Mark Hasegawa-Johnson May 2018 Abstract

More information

Using the Gammachirp Filter for Auditory Analysis of Speech

Using the Gammachirp Filter for Auditory Analysis of Speech Using the Gammachirp Filter for Auditory Analysis of Speech 18.327: Wavelets and Filterbanks Alex Park malex@sls.lcs.mit.edu May 14, 2003 Abstract Modern automatic speech recognition (ASR) systems typically

More information

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Lecture 5 Slides Jan 26 th, 2005 Outline of Today s Lecture Announcements Filter-bank analysis

More information

BINAURAL PROCESSING FOR ROBUST RECOGNITION OF DEGRADED SPEECH

BINAURAL PROCESSING FOR ROBUST RECOGNITION OF DEGRADED SPEECH BINAURAL PROCESSING FOR ROBUST RECOGNITION OF DEGRADED SPEECH Anjali Menon 1, Chanwoo Kim 2, Umpei Kurokawa 1, Richard M. Stern 1 1 Department of Electrical and Computer Engineering, Carnegie Mellon University,

More information

Recent Advances in Acoustic Signal Extraction and Dereverberation

Recent Advances in Acoustic Signal Extraction and Dereverberation Recent Advances in Acoustic Signal Extraction and Dereverberation Emanuël Habets Erlangen Colloquium 2016 Scenario Spatial Filtering Estimated Desired Signal Undesired sound components: Sensor noise Competing

More information

Introduction of Audio and Music

Introduction of Audio and Music 1 Introduction of Audio and Music Wei-Ta Chu 2009/12/3 Outline 2 Introduction of Audio Signals Introduction of Music 3 Introduction of Audio Signals Wei-Ta Chu 2009/12/3 Li and Drew, Fundamentals of Multimedia,

More information

Simultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array

Simultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array 2012 2nd International Conference on Computer Design and Engineering (ICCDE 2012) IPCSIT vol. 49 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V49.14 Simultaneous Recognition of Speech

More information

Lecture 14: Source Separation

Lecture 14: Source Separation ELEN E896 MUSIC SIGNAL PROCESSING Lecture 1: Source Separation 1. Sources, Mixtures, & Perception. Spatial Filtering 3. Time-Frequency Masking. Model-Based Separation Dan Ellis Dept. Electrical Engineering,

More information

Auditory System For a Mobile Robot

Auditory System For a Mobile Robot Auditory System For a Mobile Robot PhD Thesis Jean-Marc Valin Department of Electrical Engineering and Computer Engineering Université de Sherbrooke, Québec, Canada Jean-Marc.Valin@USherbrooke.ca Motivations

More information

APPLICATIONS OF DSP OBJECTIVES

APPLICATIONS OF DSP OBJECTIVES APPLICATIONS OF DSP OBJECTIVES This lecture will discuss the following: Introduce analog and digital waveform coding Introduce Pulse Coded Modulation Consider speech-coding principles Introduce the channel

More information

Tone-in-noise detection: Observed discrepancies in spectral integration. Nicolas Le Goff a) Technische Universiteit Eindhoven, P.O.

Tone-in-noise detection: Observed discrepancies in spectral integration. Nicolas Le Goff a) Technische Universiteit Eindhoven, P.O. Tone-in-noise detection: Observed discrepancies in spectral integration Nicolas Le Goff a) Technische Universiteit Eindhoven, P.O. Box 513, NL-5600 MB Eindhoven, The Netherlands Armin Kohlrausch b) and

More information

L19: Prosodic modification of speech

L19: Prosodic modification of speech L19: Prosodic modification of speech Time-domain pitch synchronous overlap add (TD-PSOLA) Linear-prediction PSOLA Frequency-domain PSOLA Sinusoidal models Harmonic + noise models STRAIGHT This lecture

More information

SOUND SOURCE RECOGNITION AND MODELING

SOUND SOURCE RECOGNITION AND MODELING SOUND SOURCE RECOGNITION AND MODELING CASA seminar, summer 2000 Antti Eronen antti.eronen@tut.fi Contents: Basics of human sound source recognition Timbre Voice recognition Recognition of environmental

More information

IMPROVED COCKTAIL-PARTY PROCESSING

IMPROVED COCKTAIL-PARTY PROCESSING IMPROVED COCKTAIL-PARTY PROCESSING Alexis Favrot, Markus Erne Scopein Research Aarau, Switzerland postmaster@scopein.ch Christof Faller Audiovisual Communications Laboratory, LCAV Swiss Institute of Technology

More information

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs Automatic Text-Independent Speaker Recognition Approaches Using Binaural Inputs Karim Youssef, Sylvain Argentieri and Jean-Luc Zarader 1 Outline Automatic speaker recognition: introduction Designed systems

More information

Machine recognition of speech trained on data from New Jersey Labs

Machine recognition of speech trained on data from New Jersey Labs Machine recognition of speech trained on data from New Jersey Labs Frequency response (peak around 5 Hz) Impulse response (effective length around 200 ms) 41 RASTA filter 10 attenuation [db] 40 1 10 modulation

More information

Binaural segregation in multisource reverberant environments

Binaural segregation in multisource reverberant environments Binaural segregation in multisource reverberant environments Nicoleta Roman a Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210 Soundararajan Srinivasan b

More information

A cat's cocktail party: Psychophysical, neurophysiological, and computational studies of spatial release from masking

A cat's cocktail party: Psychophysical, neurophysiological, and computational studies of spatial release from masking A cat's cocktail party: Psychophysical, neurophysiological, and computational studies of spatial release from masking Courtney C. Lane 1, Norbert Kopco 2, Bertrand Delgutte 1, Barbara G. Shinn- Cunningham

More information

Binaural Segregation in Multisource Reverberant Environments

Binaural Segregation in Multisource Reverberant Environments T e c h n i c a l R e p o r t O S U - C I S R C - 9 / 0 5 - T R 6 0 D e p a r t m e n t o f C o m p u t e r S c i e n c e a n d E n g i n e e r i n g T h e O h i o S t a t e U n i v e r s i t y C o l u

More information

PERFORMANCE COMPARISON BETWEEN STEREAUSIS AND INCOHERENT WIDEBAND MUSIC FOR LOCALIZATION OF GROUND VEHICLES ABSTRACT

PERFORMANCE COMPARISON BETWEEN STEREAUSIS AND INCOHERENT WIDEBAND MUSIC FOR LOCALIZATION OF GROUND VEHICLES ABSTRACT Approved for public release; distribution is unlimited. PERFORMANCE COMPARISON BETWEEN STEREAUSIS AND INCOHERENT WIDEBAND MUSIC FOR LOCALIZATION OF GROUND VEHICLES September 1999 Tien Pham U.S. Army Research

More information

You know about adding up waves, e.g. from two loudspeakers. AUDL 4007 Auditory Perception. Week 2½. Mathematical prelude: Adding up levels

You know about adding up waves, e.g. from two loudspeakers. AUDL 4007 Auditory Perception. Week 2½. Mathematical prelude: Adding up levels AUDL 47 Auditory Perception You know about adding up waves, e.g. from two loudspeakers Week 2½ Mathematical prelude: Adding up levels 2 But how do you get the total rms from the rms values of two signals

More information

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives Mathew Magimai Doss Collaborators: Vinayak Abrol, Selen Hande Kabil, Hannah Muckenhirn, Dimitri

More information

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks SGN- 14006 Audio and Speech Processing Pasi PerQlä SGN- 14006 2015 Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks Slides for this lecture are based on those created by Katariina

More information

SOUND QUALITY EVALUATION OF FAN NOISE BASED ON HEARING-RELATED PARAMETERS SUMMARY INTRODUCTION

SOUND QUALITY EVALUATION OF FAN NOISE BASED ON HEARING-RELATED PARAMETERS SUMMARY INTRODUCTION SOUND QUALITY EVALUATION OF FAN NOISE BASED ON HEARING-RELATED PARAMETERS Roland SOTTEK, Klaus GENUIT HEAD acoustics GmbH, Ebertstr. 30a 52134 Herzogenrath, GERMANY SUMMARY Sound quality evaluation of

More information

SGN Audio and Speech Processing

SGN Audio and Speech Processing Introduction 1 Course goals Introduction 2 SGN 14006 Audio and Speech Processing Lectures, Fall 2014 Anssi Klapuri Tampere University of Technology! Learn basics of audio signal processing Basic operations

More information

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter Ching-Ta Lu, Kun-Fu Tseng 2, Chih-Tsung Chen 2 Department of Information Communication, Asia University, Taichung, Taiwan, ROC

More information

Phase and Feedback in the Nonlinear Brain. Malcolm Slaney (IBM and Stanford) Hiroko Shiraiwa-Terasawa (Stanford) Regaip Sen (Stanford)

Phase and Feedback in the Nonlinear Brain. Malcolm Slaney (IBM and Stanford) Hiroko Shiraiwa-Terasawa (Stanford) Regaip Sen (Stanford) Phase and Feedback in the Nonlinear Brain Malcolm Slaney (IBM and Stanford) Hiroko Shiraiwa-Terasawa (Stanford) Regaip Sen (Stanford) Auditory processing pre-cosyne workshop March 23, 2004 Simplistic Models

More information

A Neural Oscillator Sound Separator for Missing Data Speech Recognition

A Neural Oscillator Sound Separator for Missing Data Speech Recognition A Neural Oscillator Sound Separator for Missing Data Speech Recognition Guy J. Brown and Jon Barker Department of Computer Science University of Sheffield Regent Court, 211 Portobello Street, Sheffield

More information

Cepstrum alanysis of speech signals

Cepstrum alanysis of speech signals Cepstrum alanysis of speech signals ELEC-E5520 Speech and language processing methods Spring 2016 Mikko Kurimo 1 /48 Contents Literature and other material Idea and history of cepstrum Cepstrum and LP

More information

Joint Position-Pitch Decomposition for Multi-Speaker Tracking

Joint Position-Pitch Decomposition for Multi-Speaker Tracking Joint Position-Pitch Decomposition for Multi-Speaker Tracking SPSC Laboratory, TU Graz 1 Contents: 1. Microphone Arrays SPSC circular array Beamforming 2. Source Localization Direction of Arrival (DoA)

More information

Chapter 2 Channel Equalization

Chapter 2 Channel Equalization Chapter 2 Channel Equalization 2.1 Introduction In wireless communication systems signal experiences distortion due to fading [17]. As signal propagates, it follows multiple paths between transmitter and

More information

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Robust Voice Activity Detection Based on Discrete Wavelet. Transform Robust Voice Activity Detection Based on Discrete Wavelet Transform Kun-Ching Wang Department of Information Technology & Communication Shin Chien University kunching@mail.kh.usc.edu.tw Abstract This paper

More information

Pitch-Based Segregation of Reverberant Speech

Pitch-Based Segregation of Reverberant Speech Technical Report OSU-CISRC-4/5-TR22 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 Ftp site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/25

More information

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech INTERSPEECH 5 Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech M. A. Tuğtekin Turan and Engin Erzin Multimedia, Vision and Graphics Laboratory,

More information

Complex Sounds. Reading: Yost Ch. 4

Complex Sounds. Reading: Yost Ch. 4 Complex Sounds Reading: Yost Ch. 4 Natural Sounds Most sounds in our everyday lives are not simple sinusoidal sounds, but are complex sounds, consisting of a sum of many sinusoids. The amplitude and frequency

More information

MULTIPLE F0 ESTIMATION IN THE TRANSFORM DOMAIN

MULTIPLE F0 ESTIMATION IN THE TRANSFORM DOMAIN 10th International Society for Music Information Retrieval Conference (ISMIR 2009 MULTIPLE F0 ESTIMATION IN THE TRANSFORM DOMAIN Christopher A. Santoro +* Corey I. Cheng *# + LSB Audio Tampa, FL 33610

More information

Training neural network acoustic models on (multichannel) waveforms

Training neural network acoustic models on (multichannel) waveforms View this talk on YouTube: https://youtu.be/si_8ea_ha8 Training neural network acoustic models on (multichannel) waveforms Ron Weiss in SANE 215 215-1-22 Joint work with Tara Sainath, Kevin Wilson, Andrew

More information

Imagine the cochlea unrolled

Imagine the cochlea unrolled 2 2 1 1 1 1 1 Cochlea & Auditory Nerve: obligatory stages of auditory processing Think of the auditory periphery as a processor of signals 2 2 1 1 1 1 1 Imagine the cochlea unrolled Basilar membrane motion

More information

Signals, Sound, and Sensation

Signals, Sound, and Sensation Signals, Sound, and Sensation William M. Hartmann Department of Physics and Astronomy Michigan State University East Lansing, Michigan Л1Р Contents Preface xv Chapter 1: Pure Tones 1 Mathematics of the

More information

Lateralisation of multiple sound sources by the auditory system

Lateralisation of multiple sound sources by the auditory system Modeling of Binaural Discrimination of multiple Sound Sources: A Contribution to the Development of a Cocktail-Party-Processor 4 H.SLATKY (Lehrstuhl für allgemeine Elektrotechnik und Akustik, Ruhr-Universität

More information

SGN Audio and Speech Processing

SGN Audio and Speech Processing SGN 14006 Audio and Speech Processing Introduction 1 Course goals Introduction 2! Learn basics of audio signal processing Basic operations and their underlying ideas and principles Give basic skills although

More information

Robust Algorithms For Speech Reconstruction On Mobile Devices

Robust Algorithms For Speech Reconstruction On Mobile Devices Robust Algorithms For Speech Reconstruction On Mobile Devices XU SHAO A Thesis presented for the degree of Doctor of Philosophy Speech Group School of Computing Sciences University of East Anglia England

More information

Voice Activity Detection

Voice Activity Detection Voice Activity Detection Speech Processing Tom Bäckström Aalto University October 2015 Introduction Voice activity detection (VAD) (or speech activity detection, or speech detection) refers to a class

More information

AUDL Final exam page 1/7 Please answer all of the following questions.

AUDL Final exam page 1/7 Please answer all of the following questions. AUDL 11 28 Final exam page 1/7 Please answer all of the following questions. 1) Consider 8 harmonics of a sawtooth wave which has a fundamental period of 1 ms and a fundamental component with a level of

More information

Improving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research

Improving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research Improving Meetings with Microphone Array Algorithms Ivan Tashev Microsoft Research Why microphone arrays? They ensure better sound quality: less noises and reverberation Provide speaker position using

More information

CS 188: Artificial Intelligence Spring Speech in an Hour

CS 188: Artificial Intelligence Spring Speech in an Hour CS 188: Artificial Intelligence Spring 2006 Lecture 19: Speech Recognition 3/23/2006 Dan Klein UC Berkeley Many slides from Dan Jurafsky Speech in an Hour Speech input is an acoustic wave form s p ee ch

More information

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing Project : Part 2 A second hands-on lab on Speech Processing Frequency-domain processing February 24, 217 During this lab, you will have a first contact on frequency domain analysis of speech signals. You

More information

Acoustics, signals & systems for audiology. Week 4. Signals through Systems

Acoustics, signals & systems for audiology. Week 4. Signals through Systems Acoustics, signals & systems for audiology Week 4 Signals through Systems Crucial ideas Any signal can be constructed as a sum of sine waves In a linear time-invariant (LTI) system, the response to a sinusoid

More information

Hearing and Deafness 2. Ear as a frequency analyzer. Chris Darwin

Hearing and Deafness 2. Ear as a frequency analyzer. Chris Darwin Hearing and Deafness 2. Ear as a analyzer Chris Darwin Frequency: -Hz Sine Wave. Spectrum Amplitude against -..5 Time (s) Waveform Amplitude against time amp Hz Frequency: 5-Hz Sine Wave. Spectrum Amplitude

More information

AUDL 4007 Auditory Perception. Week 1. The cochlea & auditory nerve: Obligatory stages of auditory processing

AUDL 4007 Auditory Perception. Week 1. The cochlea & auditory nerve: Obligatory stages of auditory processing AUDL 4007 Auditory Perception Week 1 The cochlea & auditory nerve: Obligatory stages of auditory processing 1 Think of the ear as a collection of systems, transforming sounds to be sent to the brain 25

More information