Joint Position-Pitch Decomposition for Multi-Speaker Tracking

Joint Position-Pitch Decomposition for Multi-Speaker Tracking SPSC Laboratory, TU Graz 1

Contents: 1. Microphone Arrays SPSC circular array Beamforming 2. Source Localization Direction of Arrival (DoA) GCC and SRP Performance tests 3. Joint position-pitch estimation (PoPi) PoPi-decomposition Modifications Performance tests 4. Conclusions and future work 2

1. Microphone Arrays Definition: Arrangement of multiple spatially separated microphones Different designs Linear (1D) Planar (2D) Volumetric(3D) 3

SPSC circular microphone array: Planar design Circular arrangement Diameter = 0.4m 16 channels Omni-directional electret microphones Angular offset = 22.5 4

Recording setup: Preamplifiers Behringer ADA 8000 A/D- Converter RME-Fireface Apple MacBook Pro PD-recording patch 5

Near-field and Far-field Near-field: Source Array distance is comparable to array dimesions Wavefront curvature is not neglectable Source distance can be estimated Far-field: Source Array distance is much bigger than array dimensions Planar wavefronts can be assumed Source distance can not be estimated 6

Beamforming Summing signals More sensitivity for signals arriving at the same time Focus beam on 90 direction Beam-width depends on number of microphones 7

Delay and Sum Beamforming Signals are individually delayed Steering-delays correspond to focus direction Signals impinging on the array from steering direction add up constructively because of their phase alignment 8

Spatial Aliasing Similar to temporal aliasing Microphone distance d must be smaller than half minimum wavelength to avoid spatial aliasing λmin c f = d alias 2 2d cosθ Frequency above which spatial aliasing occurs depends on microphone distance and angle Different effect on array designs Linear array (only one angle) Circular array (different angles) 9

Linear (ULA) vs. Circular (UCA) Linear array 16 microphones d=0.4m (total length= 6m) Circular array (SPSC) 16 microphones diameter=0.4m 10

1. Source Localization Direction of arrival (DoA) Direction from which a planar wavefront is impinging on the array Vector from array origin to source position Defined by azimuth and elevation ζ s = o cos φsin θ cos φ cosθ sin φ 11

Localization Strategies: Time-Delay-Estimation (TDE) Cross correlation of microphone pairs leads to time difference of arrival (TDoA) TDoA and microphone positions lead to DoA Steered-beamforming Beamformer is steered over a specific range Output power of beamformer reaches a maximum if focusing on the source direction 12

Time delay estimation using GCC Generalized Cross correlation R12 τ = 1 Ψ 12 ω X 1 ω X 2 ω e jωt dω 2π Phase transform Division of Cross Power Spectrum by its magnitude Ψ PHAT 12 ω = 1 X 1 ω X 2 ω 13

GCC-Phat (1) Precise TDoA estimation DoA relevant range: -51 till +51 samples (0.4m, fs= 44100Hz) Maximum can be easily located θ DoA estimation TDoA leads to DoA angle and are θ 360 θ stored for every microphone pair More GCC-maximum peaks are stored for multispeaker scenario c θ=arccos τ d 14

GCC-Phat (2) Shifting of DoA estimations According to angular offset of pairs m-1 * 22.5 (m microphone pair) Total number of estimations: 2*M (M number of pairs) Histogram Leads to final DoA estimation 15

Steered Beamforming Delay&Sum Beamformer Focusing on every direction Output power is computed Steered response power (SRP) Output power reaches maximum if focused in source direction Problems in two speaker scenario 16

SRP-Phat (1) Defined by Hector DiBiase in 2000 M M 1 jω Δ X k ω X l ω e dω k=1 l= 1 X k ω X l ω P Δ1... Δ M = lk Sum of multiple shifted GCC-Phat functions Shifting according to focus direction Predefined steering delays look up table (LUT) Δm = ζ o d m c Steering in azimuth and elevation directions LUT is defined for every direction in spherical half space 17

SRP-Phat (2) SRP can be used to locate multiple sources DoA estimation in spherical half space is possible 18

GCC-Phat vs. SRP-Phat Localization performance in the presence of a disturbing noise source 60 segments Segment-length = 2048 samples Multi-speaker scenario 20 segments Segment-length = 2048 samples 19

3. Joint position-pitch estimation (PoPi) PoPi decomposition Reindexing of Cross-correlation P K 1 ρt θ s,f 0 = R t,i k L f 0 +O θ is 2K 1 i=1 k= K Position and pitch values defined in LUTs d cos θ f s O θ = c L f 0 = f s f0 0 360, 80 280Hz 20

The PoPi-Plane: ρt θ,f 0 The matrix Female speaker at 45 : Undesired Gaussian at half Pitch Solution: additional decomposition term is visualized in a 2D-plane 1 ρt θ s,f 0 = 2K 1 P K i=1 k= K β Rt,i R t,i k L f 0 +O θ is 2k 1 i L f 0 +O θ s 2 21

Two speaker problem Analogous to SRP SRP for two speakers (90, 270 ): DoA estimation fails PoPi-Plane for two speakers (90 = female ; 270 = male) PoPi estimation fails 22

PoPi-Phat Joining GCC and GCC-Phat Phat kills periodicity (pitch information) DoA relevant sample range (-60 till +60 samples) replaced in GCC function DoA precision is improved Pitch problem not solved 23

PoPi-filter (1) Prefiltering the microphone signals Inspired by the auditory model (multi-pitch detection) Gammatone filterbank (17 Bandpass-filters) Normalized Cross-correlation of filtered Signals Summing the cross-correlations 24

PoPi-filter (2) Filtered correlations: Every filtered GCC makes a different contribution Low-frequency channels include pitch information High-frequency channels lead to precise DoA estimations 25

PoPi-filter (3) Summary correlation: Includes pitch and position information for both sources PoPi plane: Shows two dominant Gaussians at correct pitch and position values 26

Performance of PoPi methods (1) Presence of a disturbing noise source: 60 segments (2048 samples) Percentage of correct estimations PoPi performs better than SRP-Phat and Cepstrum for high noise levels PoPi favors speech sources and suppresses noise sources PoPi-filter outperforms the other methods 27

Performance of PoPi methods (2) Presence of a disturbing noise source (joint pitch and position): 60 segments (2048 samples) Percentage of correct estimations Pitch and position values must be correct PoPi filter performs best if combined DoA and pitch information is desired 28

Two speaker scenario (1) IBK-Studio (T60=0.13s) 15 segments (2048 samples) Male speaker (337 ) vowel o Female speaker (22 ) e Position and Pitch estimation of the three PoPi methods PoPi-filter completely outperforms the other methods 29

Two speaker scenario (2) Seminar room (T60=0.5s) 15 segments (2048 samples) Male speaker (45 ) vowel o Female speaker (337 ) e All methods suffer strongly under reverberation PoPi-filter appears as the only Method to give practicable results 30

Moving source (1) IBK-Studio (T60=0.13s) Male speaker moving around the array Pronouncing vowel a All three methods give practicable results PoPi-filter gives stable pitch estimation 31

Moving source (2) Seminar room (T60=0.5s) Male speaker moving around the array Pronouncing vowel a PoPi and PoPi-Phat fail in the presence of reverberation PoPi-filter completely outperforms the other methods and gives accurate DoA and pitch estimations 32

Conclusion and future work Source Localization State-of the art algorithms implemented for SPSC-array SRP-Phat outperforms GCC-Phat SRP-Phat: multi-speaker and elevation estimation possible PoPi estimation PoPi method is less sensitive to noise sources Original PoPi does not perform suitable for multiple speakers PoPi-Phat: DoA estimation more robust and precise PoPi-filter: multi-speaker PoPi estimation possible Future work Combining with a VAD and an advanced tracking algorithm Reducing computational effort of PoPi decomposition Real-time implementation 33

Thank you for your attention! 34