Improving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research

Similar documents
Microphone Array project in MSR: approach and results

Time-of-arrival estimation for blind beamforming

Speech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya 2, B. Yamuna 2, H. Divya 2, B. Shiva Kumar 2, B.

Auditory System For a Mobile Robot

Dual Transfer Function GSC and Application to Joint Noise Reduction and Acoustic Echo Cancellation

Cost Function for Sound Source Localization with Arbitrary Microphone Arrays

Study Of Sound Source Localization Using Music Method In Real Acoustic Environment

Microphone Array Feedback Suppression. for Indoor Room Acoustics

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming

ONE of the most common and robust beamforming algorithms

Recent Advances in Acoustic Signal Extraction and Dereverberation

BEAMFORMING WITHIN THE MODAL SOUND FIELD OF A VEHICLE INTERIOR

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

Joint Position-Pitch Decomposition for Multi-Speaker Tracking

FREQUENCY RESPONSE AND LATENCY OF MEMS MICROPHONES: THEORY AND PRACTICE

Automotive three-microphone voice activity detector and noise-canceller

Sound Processing Technologies for Realistic Sensations in Teleworking

High-speed Noise Cancellation with Microphone Array

Smart antenna technology

Speech Enhancement Based On Noise Reduction

Mel Spectrum Analysis of Speech Recognition using Single Microphone

IN REVERBERANT and noisy environments, multi-channel

Introduction to Audio Watermarking Schemes

Speech Enhancement Using Microphone Arrays

Emanuël A. P. Habets, Jacob Benesty, and Patrick A. Naylor. Presented by Amir Kiperwas

Applying the Filtered Back-Projection Method to Extract Signal at Specific Position

Revision 1.1 May Front End DSP Audio Technologies for In-Car Applications ROADMAP 2016

Airo Interantional Research Journal September, 2013 Volume II, ISSN:

Michael Brandstein Darren Ward (Eds.) Microphone Arrays. Signal Processing Techniques and Applications. With 149 Figures. Springer

RIR Estimation for Synthetic Data Acquisition

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

WIND SPEED ESTIMATION AND WIND-INDUCED NOISE REDUCTION USING A 2-CHANNEL SMALL MICROPHONE ARRAY

6 Uplink is from the mobile to the base station.

Case study for voice amplification in a highly absorptive conference room using negative absorption tuning by the YAMAHA Active Field Control system

VQ Source Models: Perceptual & Phase Issues

TARGET SPEECH EXTRACTION IN COCKTAIL PARTY BY COMBINING BEAMFORMING AND BLIND SOURCE SEPARATION

Some Notes on Beamforming.

Smart antenna for doa using music and esprit

Real-time Adaptive Concepts in Acoustics

Ultrasound Bioinstrumentation. Topic 2 (lecture 3) Beamforming

inter.noise 2000 The 29th International Congress and Exhibition on Noise Control Engineering August 2000, Nice, FRANCE

Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events

Design and Implementation on a Sub-band based Acoustic Echo Cancellation Approach

Indoor Sound Localization

Electronically Steerable planer Phased Array Antenna

Robust Low-Resource Sound Localization in Correlated Noise

Accurate Delay Measurement of Coded Speech Signals with Subsample Resolution

Chapter 4 SPEECH ENHANCEMENT

AN ADAPTIVE MICROPHONE ARRAY FOR OPTIMUM BEAMFORMING AND NOISE REDUCTION

AN ADAPTIVE MICROPHONE ARRAY FOR OPTIMUM BEAMFORMING AND NOISE REDUCTION

Adaptive Filters Wiener Filter

Speech Intelligibility Enhancement using Microphone Array via Intra-Vehicular Beamforming

ADAPTIVE ANTENNAS. TYPES OF BEAMFORMING

Sound Source Localization using HRTF database

METIS Second Training & Seminar. Smart antenna: Source localization and beamforming

Princeton ELE 201, Spring 2014 Laboratory No. 2 Shazam

Optimal Adaptive Filtering Technique for Tamil Speech Enhancement

Speech enhancement with ad-hoc microphone array using single source activity

NOISE ESTIMATION IN A SINGLE CHANNEL

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Adaptive Beamforming Approach with Robust Interference Suppression

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

Microphone Array Power Ratio for Speech Quality Assessment in Noisy Reverberant Environments 1

MULTIMODAL BLIND SOURCE SEPARATION WITH A CIRCULAR MICROPHONE ARRAY AND ROBUST BEAMFORMING

MAXXSPEECH PERFORMANCE ENHANCEMENT FOR AUTOMATIC SPEECH RECOGNITION

Acoustic Beamforming for Hearing Aids Using Multi Microphone Array by Designing Graphical User Interface

AN547 - Why you need high performance, ultra-high SNR MEMS microphones

The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals

GPS Anti-jamming Performance Simulation Based on LCMV Algorithm Jian WANG and Rui QIN

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

3 RD GENERATION BE HEARD AND HEAR, LOUD AND CLEAR

Live multi-track audio recording

DEEP LEARNING BASED AUTOMATIC VOLUME CONTROL AND LIMITER SYSTEM. Jun Yang (IEEE Senior Member), Philip Hilmes, Brian Adair, David W.

The psychoacoustics of reverberation

In air acoustic vector sensors for capturing and processing of speech signals

Different Approaches of Spectral Subtraction Method for Speech Enhancement

POSSIBLY the most noticeable difference when performing

Mutual Coupling Estimation for GPS Antenna Arrays in the Presence of Multipath

EXPERIMENTAL EVALUATION OF MODIFIED PHASE TRANSFORM FOR SOUND SOURCE DETECTION

Reducing comb filtering on different musical instruments using time delay estimation

Direction-of-Arrival Estimation Using a Microphone Array with the Multichannel Cross-Correlation Method

Performance Study of A Non-Blind Algorithm for Smart Antenna System

Audio data fuzzy fusion for source localization

Adaptive beamforming using pipelined transform domain filters

Long Range Acoustic Classification

Acoustic Beamforming for Speaker Diarization of Meetings

Robust Speaker Recognition using Microphone Arrays

Multiple sound source localization using gammatone auditory filtering and direct sound componence detection

Holographic Measurement of the Acoustical 3D Output by Near Field Scanning by Dave Logan, Wolfgang Klippel, Christian Bellmann, Daniel Knobloch

MAKING TRANSIENT ANTENNA MEASUREMENTS

Power Normalized Cepstral Coefficient for Speaker Diarization and Acoustic Echo Cancellation

Adaptive Array Technology for Navigation in Challenging Signal Environments

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Michael E. Lockwood, Satish Mohan, Douglas L. Jones. Quang Su, Ronald N. Miles

Antennas and Propagation. Chapter 5c: Array Signal Processing and Parametric Estimation Techniques

Performance Analysis of Acoustic Echo Cancellation in Sound Processing

Understanding Advanced Bluetooth Angle Estimation Techniques for Real-Time Locationing

Excelsior Audio Design & Services, llc

Analysis of LMS and NLMS Adaptive Beamforming Algorithms

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Transcription:

Improving Meetings with Microphone Array Algorithms Ivan Tashev Microsoft Research

Why microphone arrays? They ensure better sound quality: less noises and reverberation Provide speaker position using sound source localization algorithms These technologies are used in the upper levels of meeting recording and broadcasting systems: Speaker position awareness for better UI Assisting speaker clustering and segmentation Better speech recognition for meeting annotation and transcribing Provide input data for machine learning enabled applications

Better audio quality and user experience with MicArrays Meeting attendees look awkward wearing microphones, nobody likes to be tethered Capturing sound from single point is difficult A single microphone captures ambient noises and reverberation Due to interference with reflected sound waves we can have some frequencies enhanced and some completely suppressed A microphone array is set of microphones positioned closely The signals are captured synchronously and processed together Beamforming is ability to make the microphone array to listen to given location, suppressing the signals coming from other locations. Electronically steerable. Another name for this type of processing is spatial filtering

Delay and sum beamformer The most straightforward approach As the sound from the desired direction reaches the microphones with different delay just delay properly the signals from the microphones and sum them Supposedly the mismatched shifts (phases) for signals coming from other directions will reduce their amplitude Fast and easy to implement Major problems The shape of the beam is different for different frequencies Almost no directivity in the lower part of the frequency band Side lobes (one or more) appear in the upper part of the frequency band Used for comparison as a base line

Delay and sum beamformer Delay and sum beamformer gain vs. frequency and angle

Time vs. Frequency domain Time domain processing More natural, used in most of the common beamforming algorithms (GSC etc.) No time spent for conversion Requires long filters (2 2 taps), very slow! Frequency domain processing CPU time for conversion Long filters are vector multiplications, much faster! Many other types of audio signal processing are faster as well

Generalized beamformer All time domain algorithms for beamforming can be converted to processing in frequency domain Canonical form of the beamformer: Y( f ) = M 1 i= W ( f, i) X i ( f ) M number of microphones Xi(f) spectrum of i-th channel W(f,i) weight coefficients matrix Y(f) output signal Fast processing: M multiplications and M-1 additions per frequency bin For each weight matrix we have corresponding shape of the beam B( ϕ, θ, f ) - the array gain as function of direction

Calculation of the weights matrix The goal of the calculation is for given geometry and beam direction to find the optimal weights matrix For each frequency bin find weights to minimize the total noise in the output Constrains: equalized gain and zero phase shift for signals coming from the beam direction

Known approaches Using multidimensional optimization The multidimensional surface is multimodal, i.e. have multiple extremes Non-predictable number of iterations, i.e. slow Multiple computations lead to losing precision Using the approach above with different optimization criterion: Minimax, i.e. minimization of the max difference Minimal beamwidth, etc. In all cases the starting point of the multidimensional optimization is critical

Array noise suppression Noise = ambient + non-correlated + correlated (jammers and reverberation) Ambient noise suppression Non-correlated noise: Correlated (from given direction): + 2 2 2 2 ),, ( ) ( 2log f S df d d f B f N π π π ϕ θ θ ϕ 2 2 ),, ( ) ( ),, ( ) ( 2log S S f J J f S S df f B f J df f B f S θ ϕ θ ϕ = 2 1 2 ), ( 2log f S M i df i f W

Microphone Array for meetings Number of microphones: 8 Noise suppression, ambient: 12-16 db Sound source suppression (up to 4 Hz): At 9 : better than 12 db At 18 : better than 15 db Beam width at -3 db: 4 Work band: 8 75 Hz. Principle of work: points a capturing beam to the speaker location

Microphone Array for meetings MicArray gain vs. frequency and angle

Additional goodies Linear processing Beamforming doesn t introduce non-linear distortions making the output signal suitable not only for recording/broadcasting, but for speech recognition as well Integration with Acoustic Echo cancellation Requirement for real-time communication purposes Better noise suppression The initial noise reduction from the beamformer allows using better noise suppression algorithms after it without introducing significant non-linear distortions and musical noises Partial de-reverberation The narrow beam suppresses reflected from the walls sound waves making the sound more dry and better accepted from live listeners and speech recognition engines, it makes the job of potential de-reverberation processor easier

Beamshapes 525 Hz 125 Hz 225 Hz 425 Hz The beam shape in 3D proves frequency independent beamforming

Sound source localization Provides the direction to the sound source In most of the cases works in real-time Goes trough three phases: Pre-processing: Actual sound source localization Provides a single SSL measurement (time, position, weight) Post-processing of the results: Final result: position, confidence level

SSL pre-processing Pre-processing Packaging the audio signals in frames Conversion to frequency domain Noise suppression Classification signal/pause Rejection of non-signal frames

SSL pre-processing (example) SSL measurements vs. time 1 One channel Signal.8.6.4.2 Amplitude -.2 -.4 -.6 -.8-1 5 1 15 2 25 3 35 Time

Actual SSL - known algorithms Two step time delay estimates (TDOA) based Calculate the delay for each microphone pair Convert it to direction Combine the delays from all pairs for the final estimation One step time delay estimates (Yong Rui and Dinei Florencio, MS Research) Calculates the correlation function for each pair For each hypothetical angle of arrival, accumulate corresponding correlation strength from all pairs, and search for the best angle Steered beam based algorithms Calculate the energy of beams pointing to various directions Find the maximum Interpolate with neighbors for increased resolution Others: ICA based, blind source separation, etc. Most of them non real-time

Beamsteering SSL (example) Energy vs. angle and time, single sound source

Major factors harming the precision Ambient noise Smoothes the maximums Hides low-level sound sources Reverberations Create additional peaks Lift the noise floor Suppress/enhance some frequencies Reflections Create distinct fake peaks with constant location All above justify the post-processing phase

SSL with reflections and reverberation raw data Speakers in conference room (SSL results histogram) 3 12 25 1 2 8 15 6 1 4 5 2-2 -15-1 -5 5 1 15 2-2 -15-1 -5 5 1 15 2 Speaker 1 at -8 O : louder voice, less reflections Speaker 2 at 52 O : quieter voice, strong reflections from the white boards

SSL post-processing The goals are: To remove results from reflections and reverberation To increase the SSL precision (standard deviation) To track the sound source movement/change dynamics Eventually to provide tracking of multiple sound sources Approaches for post-processing of the SSL results Statistical processing Real-time clustering Kalman filtering Particle filtering Provides the final result: time, position, confidence level

Real-time clustering of SSL data Put each new SSL measurement (time, direction, weight) into a queue Remove all measurements older than given life time (~4 sec) Place all measurements into a spatially spread 5% overlapping buckets Find the bucket with largest sum of weights Weighted average the measurements in this bucket Calculate the confidence level based on last time, number of measurements, standard deviation

Post-processing results Single speaker in various positions Recording conditions: Sound room (no noise and reverberation) Office (high noise, shorter reverberation, reflections) Conference room (less noise, longer reverberation, reflections) Conditions Speaker, deg Bias, deg StDev, deg #results Sound Room 36-1.654.3857 334 Sound Room Sound Room Office Office Office -21 38-29 1.8722 5.6932-4.7539 1.6181 4.729 2.87 2.4788 1.3155.9687.7511 319 292 47 391 45 All records done with 8 element circular microphone array for meetings recording Conf. Room 35 3.4657.9699 226 Conf. Room -4.27 2.438 271 Conf. Room -43-5.1692.8766 383

Post-processing results (2) Two speakers in fixed positions Recording conditions: conference room, speakers at -8 and 52 deg 2 15 1 Two persons SSL data Angle, deg 5-5 1 2 3 4 5 6 7 8 9 RawSSL Post SSL -1-15 -2 Time, s

Post-processing results (3) Two speakers in fixed positions Recording conditions: conference room, speakers at -8 and 52 deg Two persons SSL (detail) 9 8 7 Angle, deg 6 5 4 3 2 1 RawSSL PostSSL Speaker switching at second 59 Post-processing delay: ~4 ms -157.5 58 58.5 59 59.5 6 6.5 61 61.5 62 62.5-2 Time, s

Applications for MicArrays and Sound Source Localization Sound capturing during meetings Provides direction to point the capturing beam Assists the Virtual director for speaker view (real-time) Meeting post-processing Assists speaker clustering Meeting annotation using rough ASR (requires good sound quality) Meeting transcription with precise ASR Recorded meetings viewing/browsing Audio timeline: suppress some audio tracks, navigation by speaker (based on the speaker clustering) Good sound quality - better user experience Good sound quality search by phrases or keywords with ASR SSL data assisted virtual director for speaker view (play-time)

Meetings browser (example)

Meetings browser (detail) Audio timeline