Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Size: px

Start display at page:

Download "Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues"

Gregory Andrews
6 years ago
Views:

1 Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues DeLiang Wang Perception & Neurodynamics Lab The Ohio State University

2 Outline of presentation Introduction Human performance Reverberation effects On pitch On onset/offset On binaural cues Monaural enhancement of reverberant signal Binaural segregation of reverberant signal Discussion and summary 2

3 Reverberation as linear transmission system x () t h( τ )( s t τ ) = dτ x(t): reverberant signal; s(t): source signal h(τ): room impulse response function Late reflections Early reflections Time (ms) 3

4 Reverberation and speech quality Room reverberation causes two distinct perceptual effects on speech quality Early reflections lead to coloration or spectral deviation, determined by signal-to-reverberant energy ratio; it also boasts loudness Late reflections (long-term reverberation) smear the time-frequency components of speech, and are characterized by the reverberation time (T6) 4

5 Human performance Though speech perception in quiet seems robust to reverberation, speech intelligibility in noise suffers in the presence of reverberation (Plomp 76; Culling et al. 3) Culling et al. showed that reverberation (T6 =.4 s) produces 5 db increase in speech reception threshold when naturally intonated speech is presented together with a competing talker Hearing impaired listeners are particularly susceptible to reverberation The binaural advantage for speech perception in noise is diminished by reverberation The Culling et al. study found no advantage at all Culling et al. (23) 5

6 Human performance Darwin and Hukin (2) compared reverberation effects on spatial, pitch, and vocaltract size cues for sequential organization and found that ITD cues are seriously impaired by reverberation Pitch cues (F trajectory) are more resistant A combination of pitch and vocal-tract size cues is very resistant to reverberation 6

7 Outline of presentation Introduction Human performance Reverberation effects On pitch On onset/offset On binaural cues Monaural enhancement of reverberant signal Binaural segregation of reverberant signal Discussion and summary 7

Pitch tracking of a single utterance 5 Clean Male Utterance 5 Clean

5 Reverberant Female Utterance (T 6 =.3 s) Frequency Frequency 5. 2.

3 s) Frequency Pitch (time lag) 16 12. 2.

8 Pitch tracking of a single utterance 5 Clean Male Utterance 5 Clean Female Utterance Frequency Reverberant Female Utterance (T 6 =.3 s) Frequency Frequency Reverberant Male Utterance (T 6 =.3 s) Frequency Pitch (time lag) Pitch Tracking Clean Reverberant. 2.5 Time (sec) Pitch (time lag) Pitch Tracking 6 Clean Reverberant Time (sec) Pitch is pretty robust to reverberation, especially for slowly changing pitch tracks and long voiced speech segments Noticeable artifacts: elongated pitch tracks 8

Pitch tracking of two utterances 5 Reverberant Mixture (T 6 =.3 s) Frequency (Hz) 16. 2.

9 Pitch tracking of two utterances 5 Reverberant Mixture (T 6 =.3 s) Frequency (Hz) Pitch Tracking Pitch (time lag) 12 8 One-source tracking Two-source tracking Time (sec) Multipitch tracking using the Wu et al. algorithm (23). Even with multiple reverberant sources, pitch tracking works reasonably well 9

10 Reverberation effects on harmonic structure From Darwin and Hukin (2). The utterance is Could you please write the word bead down now. T6 =.4 s Primarily in the low-frequency range 1

Implications on pitch-based grouping Pitch (time lag) Pitch (time lag) Pitch (time lag) 19 16 13 1 7 19 16 13 1 7 19 16 13 Histogram

5 Histogram of selected peaks (T 6 =.3 s). 1.

11 Implications on pitch-based grouping Pitch (time lag) Pitch (time lag) Pitch (time lag) Histogram of selected peaks (Clean). 1.5 Histogram of selected peaks (T 6 =.3 s). 1.5 Pitch Tracks 1 Clean 7 Reverberant Time (sec) Smearing of harmonic structure is worse in the high-frequency range. The figure shows the histogram of peak positions that are nearest to the detected pitch periods for frequencies greater than 8 Hz. This smearing effect would degrade the performance of pitch-based grouping. 11

12 Reverberation effects on temporal envelope Amplitude (db) Amplitude (db) (a) Smoothed temporal envelope of anechoic utterance (b) Smoothed temporal envelope of reverberant utterance Time (s) Response envelope of a gammatone filter centered near 1 khz to the utterance That noise problem grows more annoying each day. (a) T6 = and (b) T6 =.3 s Amplitude modulation (AM) depth is reduced, but the AM pattern is reasonably maintained 12

Onset and offset detection 8 (a) Anechoic utterance Frequency (Hz) 3255 1246 363

Red/black marks indicate detected onsets/offsets.

13 Onset and offset detection 8 (a) Anechoic utterance Frequency (Hz) (b) Reverberant utterance Frequency (Hz) Time (s) Cochleogram representation. Red/black marks indicate detected onsets/offsets. The utterance: That noise problem grows more annoying each day. 13

14 Reverberation effects on onset/offset detection Both the times and strengths of onsets and offsets are affected Onset times are slightly shifted Onsets of weak phones (e.g. unvoiced stops) are smeared Offset times are shifted forward (delayed) Reverberation introduces spurious offsets 14

15 Reverberation effects on binaural cues: ITD Shinn-Cunningham and Kawakyu (23) showed that the responses of a neural model to ITD (interaural time difference) are poor indicators of source azimuth in the presence of reverberation Integration over time enhances the estimation robustness 15

ITD estimation in time-frequency (T-F) units Channel Center Frequency (Hz) 5 AZIMUTH HISTOGRAM: Target source at 45, anechoic -9 45 9 Azimuth (degrees) Azimuth (degrees) -9 Across Frequency

16 ITD estimation in time-frequency (T-F) units Channel Center Frequency (Hz) 5 AZIMUTH HISTOGRAM: Target source at 45, anechoic Azimuth (degrees) Azimuth (degrees) -9 Across Frequency Integration (Clean) Channel Center Frequency (Hz) 5 AZIMUTH HISTOGRAM: Target source at 45, T 6 =.3 s Azimuth (degrees) Azimuth (degrees) Across Frequency Integration (T 6 =.3 s) Time (sec) ITD estimation in individual T-F units using a cross-correlation model (Roman et al. 3). The input is natural speech. The distribution of local azimuth estimates is much noisier in the reverberant condition 16

Interaural intensity difference estimation in T-F units Channel Center Frequency (Hz) Channel Center Frequency (Hz) 5-1 -5 5 1 15 2 IID (db) 5 IID HISTOGRAM: Target source at 45, anechoic IID

17 Interaural intensity difference estimation in T-F units Channel Center Frequency (Hz) Channel Center Frequency (Hz) IID (db) 5 IID HISTOGRAM: Target source at 45, anechoic IID HISTOGRAM: Target source at 45, T 6 =.3 s IID (db) IID (db) Mean IID for one utterance Clean Reverberant -2 5 Channel Center Frequency (Hz) The distribution of IID (interaural intensity difference) is also much noisier in reverberation, and the mean IID values lose characteristics 17

18 Outline of presentation Introduction Human performance Reverberation effects On pitch On onset/offset On binaural cues Monaural enhancement of reverberant signal Binaural segregation of reverberant signal Discussion and summary 18

(Gillespie et al. 1) Clean speech (kurtosis = 12.

19 A two-stage enhancement algorithm (Wu 3) Identify an inverse filter to reduce coloration distortion by maximizing kurtosis of LPC residue (Gillespie et al. 1) Clean speech (kurtosis = 12.2) Reverberant speech (kurtosis = 3.6) Time (ms) Estimate and subtract the effects of long-term reverberation 19

20 Results of Wu s enhancement algorithm Original speech Reverberant speech Inverse-filtered speech Enhanced speech 2

21 Binaural segregation of reverberant speech Roman and Wang (24) proposed a figure-ground segregation strategy to identify the T-F units dominated by target using spatial information, without imposing restrictions on the number, location or content of interfering sources Basic idea First perform cancellation of reverberant target (with detected target location) using adaptive filtering Then label those T-F units that have been largely attenuated in the first stage since they are more likely to originate from the target location H 1 S+N 1 H 2 S+N 2 W - DFT MATRIX DFT MATRIX BINARY MASK 21

22 Segregation results An example with a target speaker at ο and 4 other interfering speakers at (-135 ο, -45 ο, 45 ο, 135 ο ) and T6 =.3 s 22

23 ASR results The segregation output is fed to a missing data recognizer (Cooke et al. 1) (a) 5 speaker configuration Baseline performance Estimated binary mask Ideal binary mask (b) Nonspeech intrusion: rock music at 45º 23

24 Summary and discussion Reverberation corrupts auditory cues Pitch estimation is relatively robust, but harmonic structure is smeared, particularly in high-frequency AM depth is reduced but the AM pattern is reasonably maintained Onset times, and especially offset times, are shifted; onset and offset synchrony is weakened Binaural cues become unreliable A two-stage monaural algorithm for reverberant speech enhancement A binaural algorithm for segregating reverberant speech Issues What is ground truth pitch for a reverberant signal? Dereverberation versus enhancement How to deal with both segregation and reverberation monaurally? 24

25 Acknowledgment N. Roman and G. Hu for performing some computer experiments Funding by AFOSR/AFRL and NSF 25

Monaural and Binaural Speech Separation

Monaural and Binaural Speech Separation DeLiang Wang Perception & Neurodynamics Lab The Ohio State University Outline of presentation Introduction CASA approach to sound separation Ideal binary mask as