Monaural and Binaural Speech Separation

Similar documents
Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

IN a natural environment, speech often occurs simultaneously. Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation

Binaural Hearing. Reading: Yost Ch. 12

A classification-based cocktail-party processor

The psychoacoustics of reverberation

Recurrent Timing Neural Networks for Joint F0-Localisation Estimation

1856 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 7, SEPTEMBER /$ IEEE

Pitch-based monaural segregation of reverberant speech

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation

Pitch-Based Segregation of Reverberant Speech

Spatialization and Timbre for Effective Auditory Graphing

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

REpeating Pattern Extraction Technique (REPET)

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Auditory Segmentation Based on Onset and Offset Analysis

INTEGRATING MONAURAL AND BINAURAL CUES FOR SOUND LOCALIZATION AND SEGREGATION IN REVERBERANT ENVIRONMENTS

Binaural segregation in multisource reverberant environments

A Neural Oscillator Sound Separator for Missing Data Speech Recognition

Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks

Binaural Segregation in Multisource Reverberant Environments

Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation

Lecture 14: Source Separation

BIOLOGICALLY INSPIRED BINAURAL ANALOGUE SIGNAL PROCESSING

Binaural reverberant Speech separation based on deep neural networks

Robust Speech Recognition Group Carnegie Mellon University. Telephone: Fax:

Boldt, Jesper Bünsow; Kjems, Ulrik; Pedersen, Michael Syskind; Lunner, Thomas; Wang, DeLiang

Using Energy Difference for Speech Separation of Dual-microphone Close-talk System

Exploiting envelope fluctuations to achieve robust extraction and intelligent integration of binaural cues

Speaker Isolation in a Cocktail-Party Setting

Using Vision to Improve Sound Source Separation

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

HCS 7367 Speech Perception

A CASA-Based System for Long-Term SNR Estimation Arun Narayanan, Student Member, IEEE, and DeLiang Wang, Fellow, IEEE

ROBUST SPEECH RECOGNITION. Richard Stern

Computational Perception. Sound localization 2

Effect of Harmonicity on the Detection of a Signal in a Complex Masker and on Spatial Release from Masking

Towards an intelligent binaural spee enhancement system by integrating me signal extraction. Author(s)Chau, Duc Thanh; Li, Junfeng; Akagi,

AUDL GS08/GAV1 Auditory Perception. Envelope and temporal fine structure (TFS)

I R UNDERGRADUATE REPORT. Stereausis: A Binaural Processing Model. by Samuel Jiawei Ng Advisor: P.S. Krishnaprasad UG

A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL

EVERYDAY listening scenarios are complex, with multiple

Toward Automatic Transcription -- Pitch Tracking In Polyphonic Environment

PERFORMANCE COMPARISON BETWEEN STEREAUSIS AND INCOHERENT WIDEBAND MUSIC FOR LOCALIZATION OF GROUND VEHICLES ABSTRACT

Introduction of Audio and Music

Assessing the contribution of binaural cues for apparent source width perception via a functional model

Single Channel Speech Enhancement in Severe Noise Conditions

BINAURAL PROCESSING FOR ROBUST RECOGNITION OF DEGRADED SPEECH

Robust Speech Recognition Based on Binaural Auditory Processing

A cat's cocktail party: Psychophysical, neurophysiological, and computational studies of spatial release from masking

III. Publication III. c 2005 Toni Hirvonen.

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Determination of Pitch Range Based on Onset and Offset Analysis in Modulation Frequency Domain

Enhanced Waveform Interpolative Coding at 4 kbps

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

All-Neural Multi-Channel Speech Enhancement

Recent Advances in Acoustic Signal Extraction and Dereverberation

A triangulation method for determining the perceptual center of the head for auditory stimuli

(12) Patent Application Publication (10) Pub. No.: US 2010/ A1

Sound source localization and its use in multimedia applications

Improving the perceptual quality of single-channel blind audio source separation

TRANSIENT NOISE REDUCTION BASED ON SPEECH RECONSTRUCTION

EE1.el3 (EEE1023): Electronics III. Acoustics lecture 20 Sound localisation. Dr Philip Jackson.

Acoustics Research Institute

THE MATLAB IMPLEMENTATION OF BINAURAL PROCESSING MODEL SIMULATING LATERAL POSITION OF TONES WITH INTERAURAL TIME DIFFERENCES

Enhancing 3D Audio Using Blind Bandwidth Extension

Single-channel Mixture Decomposition using Bayesian Harmonic Models

SOUND SOURCE RECOGNITION AND MODELING

Robust Speech Recognition Based on Binaural Auditory Processing

Online Monaural Speech Enhancement Based on Periodicity Analysis and A Priori SNR Estimation

1. Introduction. Keywords: speech enhancement, spectral subtraction, binary masking, Gamma-tone filter bank, musical noise.

The Human Auditory System

IN practically all listening situations, the acoustic waveform

VQ Source Models: Perceptual & Phase Issues

Different Approaches of Spectral Subtraction Method for Speech Enhancement

A new sound coding strategy for suppressing noise in cochlear implants

Change Point Determination in Audio Data Using Auditory Features

Experiments in two-tone interference

15110 Principles of Computing, Carnegie Mellon University

An analysis of blind signal separation for real time application

IMPROVED COCKTAIL-PARTY PROCESSING

A Multipitch Tracking Algorithm for Noisy Speech

Emanuël A. P. Habets, Jacob Benesty, and Patrick A. Naylor. Presented by Amir Kiperwas

Auditory System For a Mobile Robot

Monaural and binaural processing of fluctuating sounds in the auditory system

Introduction. 1.1 Surround sound

COM325 Computer Speech and Hearing

SOPA version 2. Revised July SOPA project. September 21, Introduction 2. 2 Basic concept 3. 3 Capturing spatial audio 4

Simultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array

Auditory Context Awareness via Wearable Computing

Wavelet Speech Enhancement based on the Teager Energy Operator

ROBUST LOCALIZATION OF MULTIPLE SPEAKERS EXPLOITING HEAD MOVEMENTS AND MULTI-CONDITIONAL TRAINING OF BINAURAL CUES

Stefan Launer, Lyon, January 2011 Phonak AG, Stäfa, CH

Interaction of Object Binding Cues in Binaural Masking Pattern Experiments

Binaural hearing. Prof. Dan Tollin on the Hearing Throne, Oldenburg Hearing Garden

DECORRELATION TECHNIQUES FOR THE RENDERING OF APPARENT SOUND SOURCE WIDTH IN 3D AUDIO DISPLAYS. Guillaume Potard, Ian Burnett

The role of fine structure in bilateral cochlear implantation

Transcription:

Monaural and Binaural Speech Separation DeLiang Wang Perception & Neurodynamics Lab The Ohio State University

Outline of presentation Introduction CASA approach to sound separation Ideal binary mask as CASA goal Voiced speech separation based on pitch tracking and amplitude modulation analysis Unvoiced speech separation based on onset/offset analysis Binaural separation based on sound localization Summary and discussion 2

Auditory scene analysis The auditory system shows a remarkable capacity in monaural segregation of sound sources in the perceptual process of auditory scene analysis (ASA) ASA takes place in two conceptual stages (Bregman 90): Segmentation. Decompose the acoustic signal into segments (sensory elements) Grouping. Combine segments into streams so that the segments of the same stream likely originate from the same source Computational ASA (CASA) approaches sound separation based on ASA principles 3

Ideal binary mask as CASA goal Key idea is to retain parts of a target sound that are stronger than the acoustic background, or to mask interference by the target Broadly consistent with auditory masking and speech intelligibility results Within a local time-frequency (T-F) unit, the ideal binary mask is 1 if target energy is stronger than interference energy, and 0 otherwise Local 0-dB SNR criterion for mask generation 4

Ideal binary mask illustration 5

Monaural segregation of voiced speech For voiced speech, lower harmonics are resolved while higher harmonics are not For unresolved harmonics, a filter channel responds to multiple harmonics, and its response is amplitude modulated (AM) A CASA model by Hu & Wang (2004) applies different grouping mechanisms in the low-frequency and highfrequency ranges Low-frequency signals are grouped based on periodicity and temporal continuity High-frequency signals are grouped based on AM and temporal continuity 6

AM illustration (a) The output of a gammatone filter (center frequency: 2.6 khz) in response to clean speech (b) The corresponding autocorrelation function 7

Voiced speech segregation example 8

Segmentation and unvoiced speech separation To deal with unvoiced speech segregation, Hu and Wang (2004) recently proposed a model of auditory segmentation that applies to both voiced and unvoiced speech The task of segmentation is to decompose an auditory scene into contiguous T-F regions, each of which should contain signal from the same sound source The definition of segmentation does not distinguish between voiced and unvoiced sounds This is equivalent to identifying onsets and offsets of individual T-F regions, which generally correspond to sudden changes of acoustic energy The segmentation strategy is based on onset and offset analysis 9

Scale-space analysis for auditory segmentation From a computational standpoint, auditory segmentation is similar to image (visual) segmentation Visual segmentation: Finding bounding contours of visual objects Auditory segmentation: Finding onset and offset fronts of segments Onset/offset analysis employs scale-space theory, which is a multiscale analysis commonly used in image segmentation Smoothing Onset/offset detection and onset/offset front matching Multiscale integration 10

Example for segregating fricatives/affricates Utterance: That noise problem grows more annoying each day Interference: Crowd noise with music (IBM: Ideal binary mask) 11

Binaural segregation of natural speech Binaural speech segregation is applicable to both voiced and unvoiced speech The binaural segregation model of Roman, Wang, & Brown (2003) focuses on localization cues: Interaural time difference (ITD) Interaural intensity difference (IID) The model explicitly estimates ideal binary masks using supervised learning 12

Ideal binary mask estimation For narrowband stimuli, we observe that systematic changes of extracted ITD and IID values occur as the relative strength of the target signal (vs. the mixture) changes. This interaction produces characteristic clustering in the joint ITD-IID space The core of the model lies in deriving the statistical relation between the relative strength and the binaural cues Independent supervised training for different spatial configurations and different frequency bands in the joint ITD-IID space 13

Example (target: 0 o, noise: 30 o ) Target Noise Mixture Ideal binary mask Result 14

Summary and discussion It pays to have an additional microphone Binaural segregation produces better results than monaural segregation It works equally well for voiced and unvoiced speech Binaural segregation employs spatial cues, whereas monaural segregation exploits intrinsic sound characteristics Limitations of binaural (and microphone array) segregation Cannot deal with single-microphone mixtures Configuration stationarity: What if the target sound switches between different sound sources, or the target changes its location and orientation? Can one achieve general separation without analyzing sound characteristics? 15