Auditory System For a Mobile Robot

Similar documents
Improving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research

Indoor Sound Localization

Embedded Auditory System for Small Mobile Robots

Michael Brandstein Darren Ward (Eds.) Microphone Arrays. Signal Processing Techniques and Applications. With 149 Figures. Springer

The psychoacoustics of reverberation

Automotive three-microphone voice activity detector and noise-canceller

Speech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya 2, B. Yamuna 2, H. Divya 2, B. Shiva Kumar 2, B.

Simultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array

Lateralisation of multiple sound sources by the auditory system

Robust Speech Recognition Group Carnegie Mellon University. Telephone: Fax:

A MICROPHONE ARRAY INTERFACE FOR REAL-TIME INTERACTIVE MUSIC PERFORMANCE

Speaker Localization in Noisy Environments Using Steered Response Voice Power

Robust Low-Resource Sound Localization in Correlated Noise

EE1.el3 (EEE1023): Electronics III. Acoustics lecture 20 Sound localisation. Dr Philip Jackson.

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

/07/$ IEEE 111

Students: Avihay Barazany Royi Levy Supervisor: Kuti Avargel In Association with: Zoran, Haifa

Study Of Sound Source Localization Using Music Method In Real Acoustic Environment

Binaural Hearing. Reading: Yost Ch. 12

Figure 1. SIG ACAM 100 and OptiNav BeamformX at InterNoise 2015.

Estimation of Reverberation Time from Binaural Signals Without Using Controlled Excitation

Microphone Array Design and Beamforming

ROOM AND CONCERT HALL ACOUSTICS MEASUREMENTS USING ARRAYS OF CAMERAS AND MICROPHONES

Audio data fuzzy fusion for source localization

SGN Audio and Speech Processing

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Eyes n Ears: A System for Attentive Teleconferencing

Evaluating Real-time Audio Localization Algorithms for Artificial Audition in Robotics

Mei Wu Acoustics. By Mei Wu and James Black

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Proceedings of Meetings on Acoustics

NOISE ESTIMATION IN A SINGLE CHANNEL

Stefan Launer, Lyon, January 2011 Phonak AG, Stäfa, CH

Leak Energy Based Missing Feature Mask Generation for ICA and GSS and Its Evaluation with Simultaneous Speech Recognition

Emanuël A. P. Habets, Jacob Benesty, and Patrick A. Naylor. Presented by Amir Kiperwas

Recent Advances in Acoustic Signal Extraction and Dereverberation

Joint Position-Pitch Decomposition for Multi-Speaker Tracking

Sound Source Localization using HRTF database

Meeting Corpora Hardware Overview & ASR Accuracies

ROBUST echo cancellation requires a method for adjusting

Automatic Speech Recognition Improved by Two-Layered Audio-Visual Integration For Robot Audition

Monaural and Binaural Speech Separation

TDE-ILD-HRTF-Based 2D Whole-Plane Sound Source Localization Using Only Two Microphones and Source Counting

Dual Transfer Function GSC and Application to Joint Noise Reduction and Acoustic Echo Cancellation

Factors Governing the Intelligibility of Speech Sounds

Self Localization Using A Modulated Acoustic Chirp

Optimal Adaptive Filtering Technique for Tamil Speech Enhancement

SGN Audio and Speech Processing

In air acoustic vector sensors for capturing and processing of speech signals

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Distance Estimation and Localization of Sound Sources in Reverberant Conditions using Deep Neural Networks

Advances in Direction-of-Arrival Estimation

Proceedings of Meetings on Acoustics

EXPERIMENTAL EVALUATION OF MODIFIED PHASE TRANSFORM FOR SOUND SOURCE DETECTION

PERFORMANCE COMPARISON BETWEEN STEREAUSIS AND INCOHERENT WIDEBAND MUSIC FOR LOCALIZATION OF GROUND VEHICLES ABSTRACT

Improvement in Listening Capability for Humanoid Robot HRP-2

All-Neural Multi-Channel Speech Enhancement

VQ Source Models: Perceptual & Phase Issues

Onset Detection Revisited

Audio Imputation Using the Non-negative Hidden Markov Model

TARGET SPEECH EXTRACTION IN COCKTAIL PARTY BY COMBINING BEAMFORMING AND BLIND SOURCE SEPARATION

Bandwidth Extension for Speech Enhancement

Convention Paper Presented at the 131st Convention 2011 October New York, USA

Voice Activity Detection

Multiple Sound Sources Localization Using Energetic Analysis Method

BIOLOGICALLY INSPIRED BINAURAL ANALOGUE SIGNAL PROCESSING

arxiv: v1 [cs.sd] 4 Dec 2018

Integrated Vision and Sound Localization

Binaural segregation in multisource reverberant environments

Microphone Array project in MSR: approach and results

Speech Quality Assessment for Wideband Communication Scenarios

GSM Interference Cancellation For Forensic Audio

Feel the beat: using cross-modal rhythm to integrate perception of objects, others, and self

SOUND SOURCE LOCATION METHOD

Local Relative Transfer Function for Sound Source Localization

POSSIBLY the most noticeable difference when performing

3 RD GENERATION BE HEARD AND HEAR, LOUD AND CLEAR

Sound Processing Technologies for Realistic Sensations in Teleworking

Acoustic Beamforming for Speaker Diarization of Meetings

An analysis of blind signal separation for real time application

Localization of underwater moving sound source based on time delay estimation using hydrophone array

Robust Speech Recognition Based on Binaural Auditory Processing

Revision 1.1 May Front End DSP Audio Technologies for In-Car Applications ROADMAP 2016

Speech Enhancement Techniques using Wiener Filter and Subspace Filter

Calibration of Microphone Arrays for Improved Speech Recognition

Sound Source Localization in Median Plane using Artificial Ear

COM 12 C 288 E October 2011 English only Original: English

Speech Enhancement Based On Noise Reduction

Speech Enhancement: Reduction of Additive Noise in the Digital Processing of Speech

Assessment of General Applicability of Ego Noise Estimation

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming

Separation and Recognition of multiple sound source using Pulsed Neuron Model

A HYPOTHESIS TESTING APPROACH FOR REAL-TIME MULTICHANNEL SPEECH SEPARATION USING TIME-FREQUENCY MASKS. Ryan M. Corey and Andrew C.

EE482: Digital Signal Processing Applications

Introduction to Audio Watermarking Schemes

Airo Interantional Research Journal September, 2013 Volume II, ISSN:

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

A Simple Two-Microphone Array Devoted to Speech Enhancement and Source Tracking

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

From Binaural Technology to Virtual Reality

Transcription:

Auditory System For a Mobile Robot PhD Thesis Jean-Marc Valin Department of Electrical Engineering and Computer Engineering Université de Sherbrooke, Québec, Canada Jean-Marc.Valin@USherbrooke.ca

Motivations Robots need information about their environment in order to be intelligent Artificial vision has been popular for a long time, but artificial audition is new Robust audition is essential for humanrobot interaction (cocktail party effect)

Approaches To Artificial Audition Single microphone Two microphones (binaural audition) Human-robot interaction Unreliable Imitate human auditory system Limited localisation and separation Microphone array audition More information available Simpler processing

Objectives Localise and track simultaneous moving sound sources Separate sound sources Perform automatic speech recognition Remain within robotics constraints complexity, algorithmic delay robustness to noise and reverberation weight/space/adaptability moving sources, moving robot

Experimental Setup Eight microphones on the Spartacus robot Two configurations Noisy conditions Two environments Reverberation time Lab (E1) 350 ms Hall (E2) 1 s cube (C1) shell(c2)

Sound Source Localisation

Approaches to Sound Source Localisation Binaural Microphone array Interaural phase difference (delay) Interaural intensity difference Estimation through TDOAs Subspace methods (MUSIC) Direct search (steered beamformer) Post-processing Kalman filtering Particle filtering

Steered Beamformer Delay-and-sum beamformer Maximise output energy Frequency domain computation

Spectral Weighting Normal cross-correlation peaks are very wide PHAse Transform (PHAT) has narrow peaks Apply weighting Weight according to noise and reverberation Models the precedence effect Sensitivity is decreased after a loud sound

Direction Search Finding directions with highest energy Fixed number of sources Q=4 Lookup-and-sum algorithm 25 times less complex

Post-Processing: Particle Filtering Need to track sources over time Steered beamformer output is noisy Representing pdf as particles One set of (1000) particles per source State=[position, speed]

Particle Filtering Steps 1) Prediction 2) Instantaneous probabilities estimation As a function of steered beamformer energy

Particle Filtering Steps (cont.) 3) Source-observation assignment Need to know which observation is related to which tracked source Compute : Probability that q is a false alarm : Probability that q is source j : Probability that q is a new source

Particle Filtering Steps (cont.) 4) Particle weights update Merging past and present information Taking into account source-observation assignment 5) Addition or removal of sources 6) Estimation of source positions Weighted mean of the particle positions 7) Resampling

Localisation Results (E1) Detection accuracy over distance Localisation accuracy

Tracking Results Two sources crossing with C2 Video E1 E2

Tracking Results (cont.) Four moving sources with C2 E1 E2

Sound Source Separation & Speech Recognition

Overview of Sound Source Separation Frequency domain processing Simple, low complexity Linear source separation Non-linear post-filter Tracking information Microphones X n k, l Sources S m k, l Geometric Y m k, l Postsource filter separation Separated Sources S m k, l

Geometric Source Separation Frequency domain: Constrained optimization Minimize correlation of the outputs: Subject to geometric constraint: Modifications to original GSS algorithm Instantaneous computation of correlations Regularisation

Multi-Source Post-Filter

Interference Estimation Source separation leaks Incomplete adaptation Inaccuracy in localization Reverberation/diffraction Imperfect microphones Estimation from other separated sources

Reverberation Estimation Exponential decay model Example: 500 Hz frequency bin

Results (SNR) Three speakers C2 (shell), E1 (lab) 15 12,5 10 SNR (db) 7,5 5 Source 1 2,5 Source 2 0 Source 3-2,5-5 -7,5 Input Delayandsum GSS GSS + singlesource GSS + multisource

Speech Recognition Accuracy (Nuance) Proposed post-filter reduces errors by 50% Reverberation removal helps in E2 only No significant difference between C1 and C2 E2, C2, 3 speakers 90% Digit recognition 85% 80% 3 speakers: 83% 75% Right 70% 2 speakers: 90% Front 65% microphone Word correct (%) separated Left 60% 55% 50% GSS only Post-filter (no dereverb.) Proposed system

Man vs. Machine How does a human compare? 90% Word correct (%) 85% 80% 75% 70% 65% 60% 55% 50% Listener 1 Is it fair? Yes and no! Listener 2 Listener 3 Listener 4 Listener 5 Proposed system

Real-Time Application Video from AAAI conference

Speech Recognition With Missing Feature Theory Speech is transformed into features (~12) Not all features are reliable MFT = ignore unreliable features Compute missing feature mask Use the mask to compute probabilities

Missing Feature Mask Interference: unreliable Stationary noise: reliable black: reliable white: unreliable

Results (MFT) Japanese isolated word recognition (SIG2 robot, CTK) 3 simultaneous sources 200-word vocabulary 30, 60, 90 degrees separation 80 70 Word correct (%) 60 50 40 Right Front Left 30 20 10 0 GSS GSS+postfilter GSS+postfilter+MFT

Summary of the System

Conclusion What have we achieved? Localisation and tracking of sound sources Separation of multiple sources Robust basis for human-robot interaction What are the main innovations? Frequency-domain steered beamformer Particle filtering source-observation assignment Separation post-filtering for multiple sources and reverberation Integration with missing feature theory

Where From Here? Future work Complete dialogue system Echo cancellation for the robot's own voice Use human-inspired techniques Environmental sound recognition Embedded implementation Other applications Video-conference: automatically follow speaker with a camera Automatic transcription

Questions? Comments?