ROBUST SPEECH RECOGNITION. Richard Stern
|
|
- Nicholas Pearson
- 5 years ago
- Views:
Transcription
1 ROBUST SPEECH RECOGNITION Richard Stern Robust Speech Recognition Group Mellon University Telephone: (412) Fax: (412) Short Course at Universidad Carlos III July 12-15, 2005
2 Outline of discussion Summary of the state-of-the-art in speech technology at Mellon and elsewhere Review of speech production and cepstral analysis Introduction to robust speech recognition: classical techniques Robust speech recognition using missing-feature techniques Speech recognition using complementary feature sets Signal processing and signal separation based on human auditory perception Use of multiple microphones for improved recognition accuracy Mellon Slide 2 CMU Robust Speech Group
3 Introduction Conventional signal processing schemes for speech recognition are motivated more by knowledge of speech production than by knowledge of speech perception. Nevertheless, the auditory system does a lot of interesting things! In this talk I will.. Talk a bit about some basic findings in auditory physiology and perception Talk a bit about how knowledge of perception is starting to impact on how we design signal processing for speech recognition Talk about how we can apply auditory principles to separate signals that are presented simultaneously Mellon Slide 3 CMU Robust Speech Group
4 Basic auditory anatomy Structures involved in auditory processing: Mellon Slide 4 CMU Robust Speech Group
5 Excitation along the basilar membrane Some of von Békésy s (1960) measurements of motion along the basilar membrane: Comment: Different locations are most sensitive to different frequencies Mellon Slide 5 CMU Robust Speech Group
6 Transient response of auditory-nerve fibers Histograms of response to tone bursts (Kiang et al., 1965): Comment: Onsets and offsets produce overshoot Mellon Slide 6 CMU Robust Speech Group
7 Frequency response of auditory-nerve fibers: tuning curves Threshold level for auditory-nerve response to tones: Note dependence of bandwidth on center frequency and asymmetry of response Mellon Slide 7 CMU Robust Speech Group
8 Typical response of auditory-nerve fibers as a function of stimulus level Typical response of auditory-nerve fibers to tones as a function of intensity: Comment: Saturation and limited dynamic range Mellon Slide 8 CMU Robust Speech Group
9 Synchronized auditory-nerve response to low-frequency tones Comment: response remains synchronized over a wide range of intensities Mellon Slide 9 CMU Robust Speech Group
10 Comments on synchronized auditory response Nerve fibers synchronize to fine structure at low frequencies, signal envelopes at high frequencies Synchrony clearly important for auditory localization Synchrony now believed important for monaural processing of complex signals as well Mellon Slide 10 CMU Robust Speech Group
11 Lateral suppression in auditory processing Auditory-nerve response to pairs of tones: Comment: Lateral suppression enhances local contrast in frequency Mellon Slide 11 CMU Robust Speech Group
12 Auditory masking patterns Masking produced by narrowband noise at 410 Hz: Comment: asymmetries in auditory-nerve patterns preserved Mellon Slide 12 CMU Robust Speech Group
13 Auditory frequency selectivity: critical bands Measurements of psychophysical filter bandwidth by various methods: Comments: Bandwidth increases with center frequency Solid curve is Equivalent Rectangular Bandwidth (ERB) Mellon Slide 13 CMU Robust Speech Group
14 Three perceptual auditory frequency scales Bark scale: Mel scale: ERB scale: Bark( f ) =.01 f, 0 f < f +1.5, 500 f < ln( f ) 32.6, 1220 f Mel( f ) = 2595 log 10 ( f ) ERB( f ) = 24.7(4.37 f +1) Mellon Slide 14 CMU Robust Speech Group
15 Comparison of normalized perceptual frequency scales Bark scale (in blue), Mel scale (in red), and ERB scale (in green): Perceptual scale Frequency, Hz Mellon Slide 15 CMU Robust Speech Group
16 Forward and backward masking Masking can have an effect even if target and masker are not simultaneously presented Forward masking - masking precedes target Backward masking - target precedes masker Examples: Introduction Backward masking Forward masking Mellon Slide 16 CMU Robust Speech Group
17 The loudness of sounds Equal loudness contours (Fletcher-Munson curves): Mellon Slide 17 CMU Robust Speech Group
18 Summary of basic auditory physiology and perception Major physiological attributes: Frequency analysis in parallel channels Preservation of temporal fine structure Limited dynamic range in individual channels Enhancement of temporal contrast (at onsets and offsets) Enhancement of spectral contrast (at adjacent frequencies) Most major physiological attributes have psychophysical correlates Most physiological and psychophysical effects are not preserved in conventional representations for speech recognition Mellon Slide 18 CMU Robust Speech Group
19 Conventional ASR signal processing: MFCCs Segment incoming waveform into frames Compute frequency response for each frame using DFTs Multiply magnitude of frequency response by triangular weighting functions to produce channels Compute log of weighted magnitudes for each channel Take inverse discrete cosine transform (DCT) of weighted magnitudes for each channel, producing ~14 cepstral coefficients for each frame Calculate delta and double-delta coefficients Mellon Slide 19 CMU Robust Speech Group
20 COMPARING SPECTRAL REPRESENTATIONS ORIGINAL SPEECH MEL LOG MAGS AFTER CEPSTRA Mellon Slide 20 CMU Robust Speech Group
21 Comments on the MFCC representation It s very blurry compared to a wideband spectrogram! Aspects of auditory processing represented: Frequency selectivity and spectral bandwidth (but using a constant analysis window duration!)» Wavelet schemes exploit time-frequency resolution better Nonlinear amplitude response Aspects of auditory processing NOT represented: Detailed timing structure Lateral suppression Enhancement of temporal contrast Other auditory nonlinearities Mellon Slide 21 CMU Robust Speech Group
22 Speech representation using mean rate Representation of vowels by Young and Sachs using mean rate: Mean rate representation does not preserve spectral information Mellon Slide 22 CMU Robust Speech Group
23 Speech representation using average localized synchrony measure Representation of vowels by Young and Sachs using ALSR: Mellon Slide 23 CMU Robust Speech Group
24 The importance of timing information Re-analysis of Young-Sachs data by Searle: Temporal processing captures dominant formants in a spectral region Mellon Slide 24 CMU Robust Speech Group
25 Paths to the realization of temporal fine structure in speech Correlograms (Slaney and Lyon) Computations based on interval processing Seneff s Generalized Synchrony Detector (GSD) model Ghitza s Ensemble Interval Histogram (EIH) model D.C. Kim s Zero Crossing Peak Analysis (ZCPA) model Mellon Slide 25 CMU Robust Speech Group
26 A typical auditory model of the 1980s: The Seneff model Mellon Slide 26 CMU Robust Speech Group
27 Recognition accuracy using the Seneff model (Ohshima, 1994) Comment: CDCN performs just as well as the Seneff model Mellon Slide 27 CMU Robust Speech Group
28 Computational complexity of Seneff model Number of multiplications per ms of speech: Comment: auditory computation is extremely expensive Mellon Slide 28 CMU Robust Speech Group
29 Sequence of vowels Mellon Slide 29 CMU Robust Speech Group
30 Vowels processed using energy only Mellon Slide 30 CMU Robust Speech Group
31 Vowel sounds using autocorrelation expansion Mellon Slide 31 CMU Robust Speech Group
32 Comments on peripheral timing information Use of timing enables us to develop a rich display of frequencies, even with a limited number of analysis channels Nevertheless, this really gives us no new information unless the nonlinearities do something interesting Processing based on timing information (zero crossings, etc.) are likely to give us a more radically different display of info Mellon Slide 32 CMU Robust Speech Group
33 Summary - auditory physiology and perception Major physiological attributes: Frequency analysis in parallel channels Preservation of temporal fine structure Limited dynamic range in individual channels Enhancement of temporal contrast (at onsets and offsets) Enhancement of spectral contrast (at adjacent frequencies) Most major physiological attributes have psychophysical correlates We are trying to capture important attributes of the representation for recognition, and we believe that this may help performance in noise and competing signals. Mellon Slide 33 CMU Robust Speech Group
34 Summary - exploiting peripheral auditory processing Traditional cepstral representations are motivated by frequency resolution in auditory processing BUT. There is much more information in the signal that is not included Timing seems to be important Physiologically-motivated representations are generally computationally expensive While physiologically-motivated representations have not improved speech recognition accuracy so far we remain optimistic Mellon Slide 34 CMU Robust Speech Group
35 Auditory scene analysis for automatic speech recognition Auditory scene analysis refers to the methods that humans use to perceive separately sounds that are presented simultaneously Al Bregman and his colleagues have identified many potential cues in auditory stream separation can be applied directly to the speech recognition problem. Some examples: Separation by pitch/fundamental frequency Separation by source location Separation by micromodulation/ common fate Computational auditory scene analysis (CASA) refers to efforts to mimic these processes computationally Mellon Slide 35 CMU Robust Speech Group
36 Tracking speech sounds via fundamental frequency Given good pitch estimates How well can we separate signals from noise? How much will this separation help in speech recognition? To what extent can pitch be used to separate speech signals from one another? Mellon Slide 36 CMU Robust Speech Group
37 The CMU ARCTIC database Collected by John Komenik and Alan Black as a resource for speech synthesis Contains phonetically-balanced recordings with simultaneously-recorded EGG (laryngograph) measurements Available at Mellon Slide 37 CMU Robust Speech Group
38 The CMU ARCTIC database 0.6 Original speech: Laryngograph recording: Mellon Slide 38 CMU Robust Speech Group
39 Typical pitch estimates obtained from ARCTIC x 10 4 Comment: not all outliers were successfully removed Mellon Slide 39 CMU Robust Speech Group
40 Isolating speech by pitch Method 1: Estimate amplitudes of partials by synchronous heterodyne analysis Resynthesize as sums of sines or cosines Unvoiced segments are problematical Method 2: Pass speech through a comb filter that tracks harmonic frequencies Unvoiced segments are still problematical Mellon Slide 40 CMU Robust Speech Group
41 An example of pitch-based approaches: synchronized heterodyne analysis Extract instantaneous pitch, extract amplitudes at harmonics, resynthesize Original speech samples: Reconstructed speech: Mellon Slide 41 CMU Robust Speech Group
42 Recovering speech through comb filtering Pitch-tracking comb filter: H(z) = z P 1 gz P Its frequency response: Original speech samples: Reconstructed speech: Mellon Slide 42 CMU Robust Speech Group
43 Separating speech signals by heterodyning and comb filtering Combined speech signals: Speech separated by heterodyne filters: Speech separated by comb filters: Comment: men mask women more because upper male harmonics are more likely to impinge on lower female harmonics Mellon Slide 43 CMU Robust Speech Group
44 Speech recognition in noise based on pitch tracking (Vicente y Peña) SHA w Oracle pitch SHA Baseline SNR (db) Initial results could improve as techniques mature Mellon Slide 44 CMU Robust Speech Group
45 Speech separation by source location Sources arriving from different azimuths produce interaural time delays (ITDs) and interaural intensity differences (IIDs) as they arrive at the two ears So far this information has been used for Better masks for missing feature recognition and to combat reverberation (e.g. Brown, Wang et al.) Direct separation from interaural representation Mellon Slide 45 CMU Robust Speech Group
46 The classical model of binaural processing (Colburn and Durlach 1978) Mellon Slide 46 CMU Robust Speech Group
47 Array processing based on human binaural hearing Mellon Slide 47 CMU Robust Speech Group
48 Correlation-based system motivated by binaural hearing Mellon Slide 48 CMU Robust Speech Group
49 Vowel representations are improved by correlation processing Reconstructed features of vowel /a/ Two inputs zero delay Two inputs 120-µs delay Eight inputs 120-µs delay Recognition results in 1993 showed some (small) improvement in WER at great computational cost Mellon Slide 49 CMU Robust Speech Group
50 So what do things sound like on the crosscorrelation display? Signals combined with ITDs of 0 and.5 ms Individual speech signals: Combined speech signals: Separated by delay-and sum beamforming: Signals separated by cross-correlation display: Signals separated by additional correlations across frequency at a common ITD ( straightness weighting): Mellon Slide 50 CMU Robust Speech Group
51 Signal separation using micro-modulation Micromodulation of amplitude and frequency may be helpful in separating unvoiced segments of sound sources Physical cues supported by many psychoacoustical studies in recent years Mellon Slide 51 CMU Robust Speech Group
52 John Chowning s demonstration of effects of micro-modulation in frequency Frequency (Reconstruction based on description in Bregman book) Time Mellon Slide 52 CMU Robust Speech Group
53 Separating by frequency modulation only Extract instantaneous frequencies of filterbank outputs Cross-correlate frequencies across channels (finds comodulated harmonics) Cluster correlated harmonics and resynthesize Our first example: Isolated speech: Combined speech: Speech separated by frequency modulation: Comment: Success will depend on ability to track frequency components across analysis bands Mellon Slide 53 CMU Robust Speech Group
54 Summary: signal separation and recognition motivated by auditory processing Computational advances may now enable practical feature extraction based on auditory physiology and perception Computational auditory scene analysis shows promise for signal-separation problems, but the field is presently in its infancy Mellon Slide 54 CMU Robust Speech Group
55 Outline of discussion Summary of the state-of-the-art in speech technology at Mellon and elsewhere Review of speech production and cepstral analysis Introduction to robust speech recognition: classical techniques Robust speech recognition using missing-feature techniques Speech recognition using complementary feature sets Signal processing and signal separation based on human auditory perception Use of multiple microphones for improved recognition accuracy Mellon Slide 55 CMU Robust Speech Group
56 Introduction The use of arrays of microphones can improve speech recognition accuracy in noise Outline of this talk Review classical approaches to microphone array processing» Delay and sum beamforming» Traditional adaptive filtering» Physiological-motivated processing Describe and discuss selected recent results» Matched-filter array processing (Rutgers)» Array processing based on speech features (CMU) Mellon Slide 56 CMU Robust Speech Group
57 OVERVIEW OF SPEECH RECOGNITION Speech features Phoneme hypotheses Major functional components: Signal processing to extract features from speech waveforms Comparison of features to pre-stored templates Important design choices: Choice of features Feature extraction Specific method of comparing features to stored templates Decision making procedure Mellon Slide 57 CMU Robust Speech Group
58 Why use microphone arrays? Microphone arrays can provide directional response, accepting speech from some directions but suppressing others Mellon Slide 58 CMU Robust Speech Group
59 Another reason for microphone arrays Microphone arrays can focus attention on the direct field in a reverberant environment Mellon Slide 59 CMU Robust Speech Group
60 Three classical types of microphone arrays Delay-and-sum beamforming and its many variants Adaptive arrays based on mean-square suppression of noise Physiologically and perceptually motivated approaches to multiple microphones Mellon Slide 60 CMU Robust Speech Group
61 Delay-and-sum beamforming Simple processing based on equalizing delays to sensors High directivity can be achieved with many sensors Sensor 1 Sensor 2 z -n1 z -n2 Output Sensor K z -nk Mellon Slide 61 CMU Robust Speech Group
62 The physics of delay-and-sum beamforming d θ dsin( ) Mellon Slide 62 CMU Robust Speech Group
63 The physics of delay-and-sum beamforming If sensor outputs are added together, the look direction is = 0 d θ Look direction can be steered to other directions by inserting electronic delays to compensate for physical ones For look direction of output is = 0, net array A = sin(nωd sin(θ)/2c) sin(ωd sin(θ)/2c) = sin(nπd sin(θ)/λ) sin(πd sin(θ)/λ) dsin( ) Mellon Slide 63 CMU Robust Speech Group
64 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 500 Hz Mellon Slide 64 CMU Robust Speech Group
65 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 1000 Hz Mellon Slide 65 CMU Robust Speech Group
66 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 1500 Hz Mellon Slide 66 CMU Robust Speech Group
67 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 2000 Hz Mellon Slide 67 CMU Robust Speech Group
68 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 2500 Hz Mellon Slide 68 CMU Robust Speech Group
69 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 3000 Hz Mellon Slide 69 CMU Robust Speech Group
70 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 3500 Hz Mellon Slide 70 CMU Robust Speech Group
71 Examples of delay-and-sum beams d = 8.62 cm, N = 9, f = 4000 Hz Mellon Slide 71 CMU Robust Speech Group
72 Nested microphone arrays (Flanagan et al.) 5-element low frequency array Mellon Slide 72 CMU Robust Speech Group
73 Nested microphone arrays 5-element mid frequency array Mellon Slide 73 CMU Robust Speech Group
74 Nested microphone arrays 5-element high frequency array Mellon Slide 74 CMU Robust Speech Group
75 Combined nested array (Flanagan et al.) Three-band quasi-constant beamwidth array Lowpass filter Bandpass filter Highpass filter Mellon Slide 75 CMU Robust Speech Group
76 Another delay-and-sum issue: spatial aliasing d = 8.62 cm, N = 9, f = 4000 Hz Mellon Slide 76 CMU Robust Speech Group
77 Another delay-and-sum issue: spatial aliasing d = 8.62 cm, N = 9, f = 5000 Hz Mellon Slide 77 CMU Robust Speech Group
78 Another delay-and-sum issue: spatial aliasing d = 8.62 cm, N = 9, f = 6000 Hz Mellon Slide 78 CMU Robust Speech Group
79 Preventing spatial aliasing Spatial aliasing occurs when sensors receive input more than half a period from one another d θ The spatial Nyquist constraint depends on both frequency and arrival angle dsin( ) To prevent spatial aliasing we require that the maximum frequency be less than 2c ν < d sin(θ) Mellon Slide 79 CMU Robust Speech Group
80 Filter-and-sum beamforming Input filters can (in principle) place delays that vary with frequency to ameliorate frequency dependencies of beamforming Filters can also compensate for channel characteristics Sensor 1 Filter Sensor 2 Filter Output Sensor K Filter Mellon Slide 80 CMU Robust Speech Group
81 Compensated delay-and-sum beamforming Filter added to compensate for filtering effects of delay and sum beamforming Sensor 1 z -n1 Sensor 2 z -n2 Filter Output Sensor K z -nk Mellon Slide 81 CMU Robust Speech Group
82 Compensated delay-and-sum beamforming: some implementations Sullivan and Stern use CDCN compensation Rutgers group compensation function derived using a neural network Silverman group Omologo group Mellon Slide 82 CMU Robust Speech Group
83 Sample recognition results using compensated delay-and-sum The Flanagan array with CDCN does improve accuracy: Mellon Slide 83 CMU Robust Speech Group
84 Traditional adaptive arrays Mellon Slide 84 CMU Robust Speech Group
85 Traditional adaptive arrays Large established literature Use MMSE techniques to establish beams in the look direction and nulls to additive noise sources Generally do not perform well in reverberant environments Signal cancellation Effective impulse response longer than length of filter Techniques to circumvent signal cancellation Switching nulling mechanism off and on according to presence or absence of speech (Van Compernolle) Use of alternate adaptation algorithms Mellon Slide 85 CMU Robust Speech Group
86 Array processing based on human binaural hearing Motivation: human binaural system is known to have excellent immunity to additive noise and reverberation Binaural phenomena of interest: Cocktail party effect Precedence effect Problems with binaural models: Correlation produces signal distortion from rectification and squaring Precedence-effect processing defeats echoes but also suppresses desired signals Greatest challenge: decoding useful information from the crosscorrelation display Mellon Slide 86 CMU Robust Speech Group
87 Correlation-based system motivated by binaural hearing Mellon Slide 87 CMU Robust Speech Group
88 Matched-filter beamforming (Rutgers) Goal: compensation for delay and dispersion introduced in reverberant environments 600-ms Room response: x Autocorrelation function: x 10 4 Mellon Slide 88 CMU Robust Speech Group
89 Matched-filter beamforming procedure 1. Measure or estimate sample response from source to each sensor 2. Convolve input with time-reversed sample function (producing autocorrelation function) 3. Sum outputs of channels together Rationale: main lobes of autocorrelation functions should reinforce while side lobes cancel Mellon Slide 89 CMU Robust Speech Group
90 Optimizing microphone arrays for speech recognition features The typical microphone array algorithms has been signal enhancement rather than speech recognition MIC 1 s[n] MIC 2 Array Processor sˆ [ n] MIC N Mellon Slide 90 CMU Robust Speech Group
91 Automatic Speech Recognition (ASR) Parameterize speech signal and compare parameter sequence to statistical models of speech sound units to hypothesize what a user said. s[n] Feature Extraction { O 1,..., O N } ASR AM P(O W) LM P(W) P( O W ) P( W ) P ( W O) = P( O) Wˆ,..., Wˆ { 1 M } The objective is accurate recognition, a statistical pattern classification problem. Mellon Slide 91 CMU Robust Speech Group
92 Speech recognition with microphone arrays Recognition with microphone arrays has been traditionally been performed by gluing the two systems together. Systems have different objectives. Each system does not exploit information present in the other. MIC 1 MIC 2 MIC 3 Array Proc Feature Extraction ASR MIC 4 Mellon Slide 92 CMU Robust Speech Group
93 Array processing based on speech features Develop an array processing scheme targeted at improved speech recognition performance without regard to conventional array processing objective criteria. MIC 1 MIC 2 MIC 3 Array Proc Feature Extraction ASR MIC 4 Mellon Slide 93 CMU Robust Speech Group
94 Multi-microphone compensation for speech recognition based on cepstral distortion Array Proc Front End ASR Multi-mic compensation based on optimizing speech features rather than signal distortion Speech in Room Delay and Sum Optimal Comp Mellon Slide 94 CMU Robust Speech Group
95 Choosing array weights based on speech features Want an objective function that uses parameters directly related to recognition MIC 1 x 1 τ 1 h 1 Clean Speech Features MIC 2 x 2 τ 2 h 2 M s ε y M y Σ FE MIC M x M τ Μ h M minimize ε Mellon Slide 95 CMU Robust Speech Group
96 An objective function for mic arrays based on speech recognition Define Q as the SSE of the log Mel spectra of clean speech s and noisy speech y Q = (M y [ f,l] M s [ f,l]) 2 f l where y is the output of a filter-and-sum microphone array and M[ f, l] is the l th log Mel spectral value in frame f. My[ f, l] is a function of the signals captured by the array and the filter parameters associated with each microphone. Mellon Slide 96 CMU Robust Speech Group
97 Calibration of microphone arrays for ASR Calibration of filter-and-sum microphone array: User speaks an utterance with known transcription» With or without close-talking microphone Derive optimal set of filters» Minimize the objective function with respect to the filter coefficients.» Since objective function is non-linear, use iterative gradient-based methods Apply to all future speech Mellon Slide 97 CMU Robust Speech Group
98 Calibration Using close-talking recording Given a close-talking mic recording for the calibration utterance, derive an optimal filter for each channel to improve recognition FE M s MIC 1 τ 1 h 1 (n) OPT ASR s[n] MIC M τ Μ h M (n) Σ FE M y Mellon Slide 98 CMU Robust Speech Group
99 Multi-microphone data sets TMS Recorded in CMU auditory lab» Approx. 5m x 5m x 3m» Noise from computer fans, blowers,etc. Isolated letters and digits, keywords 10 speakers * 14 utterances = 140 utterances Each utterance has close-talking mic control waveform 1m 7cm Mellon Slide 99 CMU Robust Speech Group
100 Multi-microphone data sets (2) WSJ + off-axis noise source Room simulation created using the image method» 5m x 4m x 3m» 200ms reverberation time» WGN 5dB SNR WSJ test set» 5K word vocabulary» 10 speakers * 65 utterances = 650 utterances Original recordings used as closetalking control waveforms 2m 1m 25cm 15cm Mellon Slide 100 CMU Robust Speech Group
101 Results TMS data set, WSJ0 + WGN point source simulation Constructed 50 point filters from a single calibration utterance Applied filters to all test utterances TMS WSJ WER (%) CLSTK 1 MIC D&S MEL OPT Mellon Slide 101 CMU Robust Speech Group 0 CLSTK 1 MIC D&S MEL OPT
102 Calibration without Close-talking Microphone Obtain initial waveform estimate using conventional array processing technique (e.g. delay and sum). Use transcription and the recognizer to estimate the sequence of target clean log Mel spectra. Optimize filter parameters as before. Mellon Slide 102 CMU Robust Speech Group
103 Calibration w/o Close-talking Microphone (2) Force align the delay-and-sum waveform to the known transcription to generate an estimated HMM state sequence. BLAH BLAH... MIC 1 τ 1 s[n] Σ FE FALIGN qˆ, qˆ,..., qˆ { 1 2 N } MIC M τ Μ HMM Mellon Slide 103 CMU Robust Speech Group
104 Calibration w/o Close-talking Microphone (3) Extract the means from the single Gaussian HMMs of the estimated state sequence. Since the models have been trained from clean speech, use these means as the target clean speech feature vectors. qˆ, qˆ,..., qˆ { 1 2 N } HMM { µ, µ 2,..., µ 1 N } IDCT Mˆ s Mellon Slide 104 CMU Robust Speech Group
105 Calibration w/o Close-talking Microphone (4) Use estimated clean speech feature vectors to optimize filters as before. Mˆ s MIC 1 τ 1 h 1 (n) OPT ASR s[n] MIC M τ Μ h M (n) Σ FE M y Mellon Slide 105 CMU Robust Speech Group
106 Results TMS data set, WSJ0 + WGN point source simulation Constructed 50 point filters from calibration utterance Applied filters to all utterances TMS WSJ WER (%) CLSTK 1 MIC D&S MEL OPT W/ CLSTK MEL OPT NO CLSTK Mellon Slide 106 CMU Robust Speech Group 0 CLSTK 1 MIC D&S MEL OPT W/ CLSTK MEL OPT NO CLSTK
107 Results (2) WER vs. SNR for WSJ + WGN Constructed 50 point filters from calibration utterance using transcription only Applied filters to all utterances Closetalk Optim-Calib Delay-Sum 1 Mic SNR (db) Mellon Slide 107 CMU Robust Speech Group
108 The problem of reverberation High levels of reverberation are extremely detrimental to recognition accuracy In reverbant environments, the noise is actually reflected copies of the target speech signals Noise-canceling strategies (like the LMS algorithm) based on uncorrelated noise sources fail Frame-based compensation strategies also fail because effects of reverberation can be spread over several frames Mellon Slide 108 CMU Robust Speech Group
109 Baseline in highly reverberant rooms Comparison of single channel and delay-and-sum beamforming (WSJ data passed through measured impulse responses): Single channel Delay and sum Reverb time (ms) Mellon Slide 109 CMU Robust Speech Group
110 Subband processing using optimized features (Seltzer) Subband processing can address some of the effects of reverberation. The subband signals have more desirable narrowband signal properties. 1. Divide into independent subbands 2. Downsample 3. Process subbands independently 4. Upsample 5. Resynthesize full signal from subbands H 1 (z) L F 1 (z) L G 1 (z) x[n] H 2 (z) L F 2 (z) L G 2 (z) Σ y[n] H L (z) L F L (z) G L (z) Mellon Slide 110 CMU Robust Speech Group L
111 Subband results with reverberated WSJ task WER for all speakers, compared to delay-and-sum processing: Reverberation Time (ms) Delay-and-Sum Subband LIMABEAM Mellon Slide 111 CMU Robust Speech Group
112 Is Joint Filter Estimation Necessary? We compared 4 cases: Delay and Sum Optimize 1 filter for Delay and Sum Output Optimize Microphone Array Filters Independently Optimize Microphone Array Filters Jointly WER (%) WSJ + WGN 10dB 0 Delay Delay Indep Joint Sum Sum + Optim Optim Mellon 1 Filter Slide 112 CMU Robust Speech Group
113 Summary Microphone array processing is effective although not yet in widespread use Despite many developments in signal processing, actual applications to speech are based on very simple concepts Delay-and-sum beamforming Constant-beamwidth arrays Array processing based on preservation of feature values can improve accuracy, even in reverberant environments Major problems and issues: Maintaining good performance in reverberation Real-time time-varying environments and speakers Mellon Slide 113 CMU Robust Speech Group
114
Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress!
Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress! Richard Stern (with Chanwoo Kim, Yu-Hsiang Chiu, and others) Department of Electrical and Computer Engineering
More informationRobust Speech Recognition Group Carnegie Mellon University. Telephone: Fax:
Robust Automatic Speech Recognition In the 21 st Century Richard Stern (with Alex Acero, Yu-Hsiang Chiu, Evandro Gouvêa, Chanwoo Kim, Kshitiz Kumar, Amir Moghimi, Pedro Moreno, Hyung-Min Park, Bhiksha
More informationROBUST SPEECH RECOGNITION BASED ON HUMAN BINAURAL PERCEPTION
ROBUST SPEECH RECOGNITION BASED ON HUMAN BINAURAL PERCEPTION Richard M. Stern and Thomas M. Sullivan Department of Electrical and Computer Engineering School of Computer Science Carnegie Mellon University
More informationThe psychoacoustics of reverberation
The psychoacoustics of reverberation Steven van de Par Steven.van.de.Par@uni-oldenburg.de July 19, 2016 Thanks to Julian Grosse and Andreas Häußler 2016 AES International Conference on Sound Field Control
More informationCalibration of Microphone Arrays for Improved Speech Recognition
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Calibration of Microphone Arrays for Improved Speech Recognition Michael L. Seltzer, Bhiksha Raj TR-2001-43 December 2001 Abstract We present
More informationRobust Speech Recognition Based on Binaural Auditory Processing
INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Robust Speech Recognition Based on Binaural Auditory Processing Anjali Menon 1, Chanwoo Kim 2, Richard M. Stern 1 1 Department of Electrical and Computer
More informationMonaural and Binaural Speech Separation
Monaural and Binaural Speech Separation DeLiang Wang Perception & Neurodynamics Lab The Ohio State University Outline of presentation Introduction CASA approach to sound separation Ideal binary mask as
More informationRobust Speech Recognition Based on Binaural Auditory Processing
Robust Speech Recognition Based on Binaural Auditory Processing Anjali Menon 1, Chanwoo Kim 2, Richard M. Stern 1 1 Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh,
More informationEffects of Reverberation on Pitch, Onset/Offset, and Binaural Cues
Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues DeLiang Wang Perception & Neurodynamics Lab The Ohio State University Outline of presentation Introduction Human performance Reverberation
More informationAuditory Based Feature Vectors for Speech Recognition Systems
Auditory Based Feature Vectors for Speech Recognition Systems Dr. Waleed H. Abdulla Electrical & Computer Engineering Department The University of Auckland, New Zealand [w.abdulla@auckland.ac.nz] 1 Outlines
More informationMel Spectrum Analysis of Speech Recognition using Single Microphone
International Journal of Engineering Research in Electronics and Communication Mel Spectrum Analysis of Speech Recognition using Single Microphone [1] Lakshmi S.A, [2] Cholavendan M [1] PG Scholar, Sree
More informationPerception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.
Perception of pitch BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb 2008. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence Erlbaum,
More informationAN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES
Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-), Verona, Italy, December 7-9,2 AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES Tapio Lokki Telecommunications
More informationAuditory modelling for speech processing in the perceptual domain
ANZIAM J. 45 (E) ppc964 C980, 2004 C964 Auditory modelling for speech processing in the perceptual domain L. Lin E. Ambikairajah W. H. Holmes (Received 8 August 2003; revised 28 January 2004) Abstract
More informationSignal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2
Signal Processing for Speech Applications - Part 2-1 Signal Processing For Speech Applications - Part 2 May 14, 2013 Signal Processing for Speech Applications - Part 2-2 References Huang et al., Chapter
More informationPerception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.
Perception of pitch BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb 2009. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence
More informationBinaural Hearing. Reading: Yost Ch. 12
Binaural Hearing Reading: Yost Ch. 12 Binaural Advantages Sounds in our environment are usually complex, and occur either simultaneously or close together in time. Studies have shown that the ability to
More informationSpeech Signal Analysis
Speech Signal Analysis Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 2&3 14,18 January 216 ASR Lectures 2&3 Speech Signal Analysis 1 Overview Speech Signal Analysis for
More informationPerception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.
Perception of pitch AUDL4007: 11 Feb 2010. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence Erlbaum, 2005 Chapter 7 1 Definitions
More informationSignal Processing for Robust Speech Recognition Motivated by Auditory Processing
Signal Processing for Robust Speech Recognition Motivated by Auditory Processing Chanwoo Kim CMU-LTI-1-17 Language Technologies Institute School of Computer Science Carnegie Mellon University 5 Forbes
More informationspeech signal S(n). This involves a transformation of S(n) into another signal or a set of signals
16 3. SPEECH ANALYSIS 3.1 INTRODUCTION TO SPEECH ANALYSIS Many speech processing [22] applications exploits speech production and perception to accomplish speech analysis. By speech analysis we extract
More informationEE482: Digital Signal Processing Applications
Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu EE482: Digital Signal Processing Applications Spring 2014 TTh 14:30-15:45 CBC C222 Lecture 14 Quiz 04 Review 14/04/07 http://www.ee.unlv.edu/~b1morris/ee482/
More informationRobust speech recognition using temporal masking and thresholding algorithm
Robust speech recognition using temporal masking and thresholding algorithm Chanwoo Kim 1, Kean K. Chin 1, Michiel Bacchiani 1, Richard M. Stern 2 Google, Mountain View CA 9443 USA 1 Carnegie Mellon University,
More informationSIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION MOTIVATED BY AUDITORY PROCESSING CHANWOO KIM
SIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION MOTIVATED BY AUDITORY PROCESSING CHANWOO KIM MAY 21 ABSTRACT Although automatic speech recognition systems have dramatically improved in recent decades,
More informationCOM325 Computer Speech and Hearing
COM325 Computer Speech and Hearing Part III : Theories and Models of Pitch Perception Dr. Guy Brown Room 145 Regent Court Department of Computer Science University of Sheffield Email: g.brown@dcs.shef.ac.uk
More informationADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering
ADSP ADSP ADSP ADSP Advanced Digital Signal Processing (18-792) Spring Fall Semester, 201 2012 Department of Electrical and Computer Engineering PROBLEM SET 5 Issued: 9/27/18 Due: 10/3/18 Reminder: Quiz
More informationPower-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition Chanwoo Kim, Member, IEEE, and Richard M. Stern, Fellow, IEEE
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 7, JULY 2016 1315 Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition Chanwoo Kim, Member, IEEE, and
More informationEE482: Digital Signal Processing Applications
Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu EE482: Digital Signal Processing Applications Spring 2014 TTh 14:30-15:45 CBC C222 Lecture 12 Speech Signal Processing 14/03/25 http://www.ee.unlv.edu/~b1morris/ee482/
More informationEnhanced Waveform Interpolative Coding at 4 kbps
Enhanced Waveform Interpolative Coding at 4 kbps Oded Gottesman, and Allen Gersho Signal Compression Lab. University of California, Santa Barbara E-mail: [oded, gersho]@scl.ece.ucsb.edu Signal Compression
More informationRecurrent Timing Neural Networks for Joint F0-Localisation Estimation
Recurrent Timing Neural Networks for Joint F0-Localisation Estimation Stuart N. Wrigley and Guy J. Brown Department of Computer Science, University of Sheffield Regent Court, 211 Portobello Street, Sheffield
More informationA classification-based cocktail-party processor
A classification-based cocktail-party processor Nicoleta Roman, DeLiang Wang Department of Computer and Information Science and Center for Cognitive Science The Ohio State University Columbus, OH 43, USA
More informationRASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991
RASTA-PLP SPEECH ANALYSIS Hynek Hermansky Nelson Morgan y Aruna Bayya Phil Kohn y TR-91-069 December 1991 Abstract Most speech parameter estimation techniques are easily inuenced by the frequency response
More informationSpeech and Audio Processing Recognition and Audio Effects Part 3: Beamforming
Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Electrical Engineering and Information Engineering
More informationIII. Publication III. c 2005 Toni Hirvonen.
III Publication III Hirvonen, T., Segregation of Two Simultaneously Arriving Narrowband Noise Signals as a Function of Spatial and Frequency Separation, in Proceedings of th International Conference on
More informationMOST MODERN automatic speech recognition (ASR)
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 5, SEPTEMBER 1997 451 A Model of Dynamic Auditory Perception and Its Application to Robust Word Recognition Brian Strope and Abeer Alwan, Member,
More informationAdaptive Filters Application of Linear Prediction
Adaptive Filters Application of Linear Prediction Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Electrical Engineering and Information Technology Digital Signal Processing
More informationA CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL
9th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, -7 SEPTEMBER 7 A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL PACS: PACS:. Pn Nicolas Le Goff ; Armin Kohlrausch ; Jeroen
More informationPattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt
Pattern Recognition Part 6: Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Institute of Electrical and Information Engineering Digital Signal Processing and System Theory
More information19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007
19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 MODELING SPECTRAL AND TEMPORAL MASKING IN THE HUMAN AUDITORY SYSTEM PACS: 43.66.Ba, 43.66.Dc Dau, Torsten; Jepsen, Morten L.; Ewert,
More informationComparison of Spectral Analysis Methods for Automatic Speech Recognition
INTERSPEECH 2013 Comparison of Spectral Analysis Methods for Automatic Speech Recognition Venkata Neelima Parinam, Chandra Vootkuri, Stephen A. Zahorian Department of Electrical and Computer Engineering
More informationSpeech Synthesis; Pitch Detection and Vocoders
Speech Synthesis; Pitch Detection and Vocoders Tai-Shih Chi ( 冀泰石 ) Department of Communication Engineering National Chiao Tung University May. 29, 2008 Speech Synthesis Basic components of the text-to-speech
More informationIN recent decades following the introduction of hidden. Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 1 Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition Chanwoo Kim and Richard M. Stern, Member,
More informationPsycho-acoustics (Sound characteristics, Masking, and Loudness)
Psycho-acoustics (Sound characteristics, Masking, and Loudness) Tai-Shih Chi ( 冀泰石 ) Department of Communication Engineering National Chiao Tung University Mar. 20, 2008 Pure tones Mathematics of the pure
More informationAn Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation
An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation Aisvarya V 1, Suganthy M 2 PG Student [Comm. Systems], Dept. of ECE, Sree Sastha Institute of Engg. & Tech., Chennai,
More informationDifferent Approaches of Spectral Subtraction Method for Speech Enhancement
ISSN 2249 5460 Available online at www.internationalejournals.com International ejournals International Journal of Mathematical Sciences, Technology and Humanities 95 (2013 1056 1062 Different Approaches
More informationHCS 7367 Speech Perception
HCS 7367 Speech Perception Dr. Peter Assmann Fall 212 Power spectrum model of masking Assumptions: Only frequencies within the passband of the auditory filter contribute to masking. Detection is based
More informationAdvanced audio analysis. Martin Gasser
Advanced audio analysis Martin Gasser Motivation Which methods are common in MIR research? How can we parameterize audio signals? Interesting dimensions of audio: Spectral/ time/melody structure, high
More informationDERIVATION OF TRAPS IN AUDITORY DOMAIN
DERIVATION OF TRAPS IN AUDITORY DOMAIN Petr Motlíček, Doctoral Degree Programme (4) Dept. of Computer Graphics and Multimedia, FIT, BUT E-mail: motlicek@fit.vutbr.cz Supervised by: Dr. Jan Černocký, Prof.
More informationSpeech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya 2, B. Yamuna 2, H. Divya 2, B. Shiva Kumar 2, B.
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 4 Issue 4 April 2015, Page No. 11143-11147 Speech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya
More informationIN a natural environment, speech often occurs simultaneously. Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 5, SEPTEMBER 2004 1135 Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation Guoning Hu and DeLiang Wang, Fellow, IEEE Abstract
More informationSpeech Synthesis using Mel-Cepstral Coefficient Feature
Speech Synthesis using Mel-Cepstral Coefficient Feature By Lu Wang Senior Thesis in Electrical Engineering University of Illinois at Urbana-Champaign Advisor: Professor Mark Hasegawa-Johnson May 2018 Abstract
More informationUsing the Gammachirp Filter for Auditory Analysis of Speech
Using the Gammachirp Filter for Auditory Analysis of Speech 18.327: Wavelets and Filterbanks Alex Park malex@sls.lcs.mit.edu May 14, 2003 Abstract Modern automatic speech recognition (ASR) systems typically
More informationUniversity of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005
University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Lecture 5 Slides Jan 26 th, 2005 Outline of Today s Lecture Announcements Filter-bank analysis
More informationBINAURAL PROCESSING FOR ROBUST RECOGNITION OF DEGRADED SPEECH
BINAURAL PROCESSING FOR ROBUST RECOGNITION OF DEGRADED SPEECH Anjali Menon 1, Chanwoo Kim 2, Umpei Kurokawa 1, Richard M. Stern 1 1 Department of Electrical and Computer Engineering, Carnegie Mellon University,
More informationRecent Advances in Acoustic Signal Extraction and Dereverberation
Recent Advances in Acoustic Signal Extraction and Dereverberation Emanuël Habets Erlangen Colloquium 2016 Scenario Spatial Filtering Estimated Desired Signal Undesired sound components: Sensor noise Competing
More informationIntroduction of Audio and Music
1 Introduction of Audio and Music Wei-Ta Chu 2009/12/3 Outline 2 Introduction of Audio Signals Introduction of Music 3 Introduction of Audio Signals Wei-Ta Chu 2009/12/3 Li and Drew, Fundamentals of Multimedia,
More informationSimultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array
2012 2nd International Conference on Computer Design and Engineering (ICCDE 2012) IPCSIT vol. 49 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V49.14 Simultaneous Recognition of Speech
More informationLecture 14: Source Separation
ELEN E896 MUSIC SIGNAL PROCESSING Lecture 1: Source Separation 1. Sources, Mixtures, & Perception. Spatial Filtering 3. Time-Frequency Masking. Model-Based Separation Dan Ellis Dept. Electrical Engineering,
More informationAuditory System For a Mobile Robot
Auditory System For a Mobile Robot PhD Thesis Jean-Marc Valin Department of Electrical Engineering and Computer Engineering Université de Sherbrooke, Québec, Canada Jean-Marc.Valin@USherbrooke.ca Motivations
More informationAPPLICATIONS OF DSP OBJECTIVES
APPLICATIONS OF DSP OBJECTIVES This lecture will discuss the following: Introduce analog and digital waveform coding Introduce Pulse Coded Modulation Consider speech-coding principles Introduce the channel
More informationTone-in-noise detection: Observed discrepancies in spectral integration. Nicolas Le Goff a) Technische Universiteit Eindhoven, P.O.
Tone-in-noise detection: Observed discrepancies in spectral integration Nicolas Le Goff a) Technische Universiteit Eindhoven, P.O. Box 513, NL-5600 MB Eindhoven, The Netherlands Armin Kohlrausch b) and
More informationL19: Prosodic modification of speech
L19: Prosodic modification of speech Time-domain pitch synchronous overlap add (TD-PSOLA) Linear-prediction PSOLA Frequency-domain PSOLA Sinusoidal models Harmonic + noise models STRAIGHT This lecture
More informationSOUND SOURCE RECOGNITION AND MODELING
SOUND SOURCE RECOGNITION AND MODELING CASA seminar, summer 2000 Antti Eronen antti.eronen@tut.fi Contents: Basics of human sound source recognition Timbre Voice recognition Recognition of environmental
More informationIMPROVED COCKTAIL-PARTY PROCESSING
IMPROVED COCKTAIL-PARTY PROCESSING Alexis Favrot, Markus Erne Scopein Research Aarau, Switzerland postmaster@scopein.ch Christof Faller Audiovisual Communications Laboratory, LCAV Swiss Institute of Technology
More informationAutomatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs
Automatic Text-Independent Speaker Recognition Approaches Using Binaural Inputs Karim Youssef, Sylvain Argentieri and Jean-Luc Zarader 1 Outline Automatic speaker recognition: introduction Designed systems
More informationMachine recognition of speech trained on data from New Jersey Labs
Machine recognition of speech trained on data from New Jersey Labs Frequency response (peak around 5 Hz) Impulse response (effective length around 200 ms) 41 RASTA filter 10 attenuation [db] 40 1 10 modulation
More informationBinaural segregation in multisource reverberant environments
Binaural segregation in multisource reverberant environments Nicoleta Roman a Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210 Soundararajan Srinivasan b
More informationA cat's cocktail party: Psychophysical, neurophysiological, and computational studies of spatial release from masking
A cat's cocktail party: Psychophysical, neurophysiological, and computational studies of spatial release from masking Courtney C. Lane 1, Norbert Kopco 2, Bertrand Delgutte 1, Barbara G. Shinn- Cunningham
More informationBinaural Segregation in Multisource Reverberant Environments
T e c h n i c a l R e p o r t O S U - C I S R C - 9 / 0 5 - T R 6 0 D e p a r t m e n t o f C o m p u t e r S c i e n c e a n d E n g i n e e r i n g T h e O h i o S t a t e U n i v e r s i t y C o l u
More informationPERFORMANCE COMPARISON BETWEEN STEREAUSIS AND INCOHERENT WIDEBAND MUSIC FOR LOCALIZATION OF GROUND VEHICLES ABSTRACT
Approved for public release; distribution is unlimited. PERFORMANCE COMPARISON BETWEEN STEREAUSIS AND INCOHERENT WIDEBAND MUSIC FOR LOCALIZATION OF GROUND VEHICLES September 1999 Tien Pham U.S. Army Research
More informationYou know about adding up waves, e.g. from two loudspeakers. AUDL 4007 Auditory Perception. Week 2½. Mathematical prelude: Adding up levels
AUDL 47 Auditory Perception You know about adding up waves, e.g. from two loudspeakers Week 2½ Mathematical prelude: Adding up levels 2 But how do you get the total rms from the rms values of two signals
More informationLearning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives
Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives Mathew Magimai Doss Collaborators: Vinayak Abrol, Selen Hande Kabil, Hannah Muckenhirn, Dimitri
More informationMel- frequency cepstral coefficients (MFCCs) and gammatone filter banks
SGN- 14006 Audio and Speech Processing Pasi PerQlä SGN- 14006 2015 Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks Slides for this lecture are based on those created by Katariina
More informationSOUND QUALITY EVALUATION OF FAN NOISE BASED ON HEARING-RELATED PARAMETERS SUMMARY INTRODUCTION
SOUND QUALITY EVALUATION OF FAN NOISE BASED ON HEARING-RELATED PARAMETERS Roland SOTTEK, Klaus GENUIT HEAD acoustics GmbH, Ebertstr. 30a 52134 Herzogenrath, GERMANY SUMMARY Sound quality evaluation of
More informationSGN Audio and Speech Processing
Introduction 1 Course goals Introduction 2 SGN 14006 Audio and Speech Processing Lectures, Fall 2014 Anssi Klapuri Tampere University of Technology! Learn basics of audio signal processing Basic operations
More informationReduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter
Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter Ching-Ta Lu, Kun-Fu Tseng 2, Chih-Tsung Chen 2 Department of Information Communication, Asia University, Taichung, Taiwan, ROC
More informationPhase and Feedback in the Nonlinear Brain. Malcolm Slaney (IBM and Stanford) Hiroko Shiraiwa-Terasawa (Stanford) Regaip Sen (Stanford)
Phase and Feedback in the Nonlinear Brain Malcolm Slaney (IBM and Stanford) Hiroko Shiraiwa-Terasawa (Stanford) Regaip Sen (Stanford) Auditory processing pre-cosyne workshop March 23, 2004 Simplistic Models
More informationA Neural Oscillator Sound Separator for Missing Data Speech Recognition
A Neural Oscillator Sound Separator for Missing Data Speech Recognition Guy J. Brown and Jon Barker Department of Computer Science University of Sheffield Regent Court, 211 Portobello Street, Sheffield
More informationCepstrum alanysis of speech signals
Cepstrum alanysis of speech signals ELEC-E5520 Speech and language processing methods Spring 2016 Mikko Kurimo 1 /48 Contents Literature and other material Idea and history of cepstrum Cepstrum and LP
More informationJoint Position-Pitch Decomposition for Multi-Speaker Tracking
Joint Position-Pitch Decomposition for Multi-Speaker Tracking SPSC Laboratory, TU Graz 1 Contents: 1. Microphone Arrays SPSC circular array Beamforming 2. Source Localization Direction of Arrival (DoA)
More informationChapter 2 Channel Equalization
Chapter 2 Channel Equalization 2.1 Introduction In wireless communication systems signal experiences distortion due to fading [17]. As signal propagates, it follows multiple paths between transmitter and
More informationRobust Voice Activity Detection Based on Discrete Wavelet. Transform
Robust Voice Activity Detection Based on Discrete Wavelet Transform Kun-Ching Wang Department of Information Technology & Communication Shin Chien University kunching@mail.kh.usc.edu.tw Abstract This paper
More informationPitch-Based Segregation of Reverberant Speech
Technical Report OSU-CISRC-4/5-TR22 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 Ftp site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/25
More informationSynchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech
INTERSPEECH 5 Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech M. A. Tuğtekin Turan and Engin Erzin Multimedia, Vision and Graphics Laboratory,
More informationComplex Sounds. Reading: Yost Ch. 4
Complex Sounds Reading: Yost Ch. 4 Natural Sounds Most sounds in our everyday lives are not simple sinusoidal sounds, but are complex sounds, consisting of a sum of many sinusoids. The amplitude and frequency
More informationMULTIPLE F0 ESTIMATION IN THE TRANSFORM DOMAIN
10th International Society for Music Information Retrieval Conference (ISMIR 2009 MULTIPLE F0 ESTIMATION IN THE TRANSFORM DOMAIN Christopher A. Santoro +* Corey I. Cheng *# + LSB Audio Tampa, FL 33610
More informationTraining neural network acoustic models on (multichannel) waveforms
View this talk on YouTube: https://youtu.be/si_8ea_ha8 Training neural network acoustic models on (multichannel) waveforms Ron Weiss in SANE 215 215-1-22 Joint work with Tara Sainath, Kevin Wilson, Andrew
More informationImagine the cochlea unrolled
2 2 1 1 1 1 1 Cochlea & Auditory Nerve: obligatory stages of auditory processing Think of the auditory periphery as a processor of signals 2 2 1 1 1 1 1 Imagine the cochlea unrolled Basilar membrane motion
More informationSignals, Sound, and Sensation
Signals, Sound, and Sensation William M. Hartmann Department of Physics and Astronomy Michigan State University East Lansing, Michigan Л1Р Contents Preface xv Chapter 1: Pure Tones 1 Mathematics of the
More informationLateralisation of multiple sound sources by the auditory system
Modeling of Binaural Discrimination of multiple Sound Sources: A Contribution to the Development of a Cocktail-Party-Processor 4 H.SLATKY (Lehrstuhl für allgemeine Elektrotechnik und Akustik, Ruhr-Universität
More informationSGN Audio and Speech Processing
SGN 14006 Audio and Speech Processing Introduction 1 Course goals Introduction 2! Learn basics of audio signal processing Basic operations and their underlying ideas and principles Give basic skills although
More informationRobust Algorithms For Speech Reconstruction On Mobile Devices
Robust Algorithms For Speech Reconstruction On Mobile Devices XU SHAO A Thesis presented for the degree of Doctor of Philosophy Speech Group School of Computing Sciences University of East Anglia England
More informationVoice Activity Detection
Voice Activity Detection Speech Processing Tom Bäckström Aalto University October 2015 Introduction Voice activity detection (VAD) (or speech activity detection, or speech detection) refers to a class
More informationAUDL Final exam page 1/7 Please answer all of the following questions.
AUDL 11 28 Final exam page 1/7 Please answer all of the following questions. 1) Consider 8 harmonics of a sawtooth wave which has a fundamental period of 1 ms and a fundamental component with a level of
More informationImproving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research
Improving Meetings with Microphone Array Algorithms Ivan Tashev Microsoft Research Why microphone arrays? They ensure better sound quality: less noises and reverberation Provide speaker position using
More informationCS 188: Artificial Intelligence Spring Speech in an Hour
CS 188: Artificial Intelligence Spring 2006 Lecture 19: Speech Recognition 3/23/2006 Dan Klein UC Berkeley Many slides from Dan Jurafsky Speech in an Hour Speech input is an acoustic wave form s p ee ch
More informationProject 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing
Project : Part 2 A second hands-on lab on Speech Processing Frequency-domain processing February 24, 217 During this lab, you will have a first contact on frequency domain analysis of speech signals. You
More informationAcoustics, signals & systems for audiology. Week 4. Signals through Systems
Acoustics, signals & systems for audiology Week 4 Signals through Systems Crucial ideas Any signal can be constructed as a sum of sine waves In a linear time-invariant (LTI) system, the response to a sinusoid
More informationHearing and Deafness 2. Ear as a frequency analyzer. Chris Darwin
Hearing and Deafness 2. Ear as a analyzer Chris Darwin Frequency: -Hz Sine Wave. Spectrum Amplitude against -..5 Time (s) Waveform Amplitude against time amp Hz Frequency: 5-Hz Sine Wave. Spectrum Amplitude
More informationAUDL 4007 Auditory Perception. Week 1. The cochlea & auditory nerve: Obligatory stages of auditory processing
AUDL 4007 Auditory Perception Week 1 The cochlea & auditory nerve: Obligatory stages of auditory processing 1 Think of the ear as a collection of systems, transforming sounds to be sent to the brain 25
More information