Nicholas Chong, Shanhung Wong, Sven Nordholm, Iain Murray

MULTIPLE SOUND SOURCE TRACKING AND IDENTIFICATION VIA DEGENERATE UNMIXING ESTIMATION TECHNIQUE AND CARDINALITY BALANCED MULTI-TARGET MULTI-BERNOULLI FILTER (DUET-CBMEMBER) WITH TRACK MANAGEMENT Nicholas Chong, Shanhung Wong, Sven Nordholm, Iain Murray Department of Electrical and Computer Engineering Curtin University, Kent Street, Bentley, WA 612, Australia ABSTRACT In Source Separation research, cocktail party problem is a challenging problem that research into source separation aims to solve. Many attempts have been made to solve this complex problem. A logical approach would be to break down this complex problem into several smaller problems which are solved in different stages - each considering various aspects. In this paper, we are providing a robust solution to a part of the problem by localizing and tracking multiple moving speech sources in a room environment. Here we study the separation problem for unknown number of moving sources. The DUET-CBMeMBer method we outline is capable of estimating the number of sound sources as well as tracking and labelling them. This paper proposes a track management technique that identifies sound sources based on their trajectory as an extension to the DUET-CBMeMBer technique. Index Terms BSS, DUET, CBMeMBer, multiple speaker tracking, track management 1. INTRODUCTION The classic cocktail party problem has been widely researched and various methods of blind source separation systems have been proposed to solve it [1], [2] [3], [4], [5]. Most of these methods assume that the sound sources are not moving. Such an assumption does not hold true in a practical setting as people will be moving around in any most acoustic communication scenarios accordingly and a blind source separation system should be able to account for such movements as well as the time varying number of sound sources. The DUET-CBMeMBer [6], inspired by both the Degenerate Unmixing Estimation Technique (DUET) [2] and the Cardinality-Balanced Multi-Target Multi-Bernoulli Filter (CBMeMBer) [7] was previously proposed to address the problem of localizing and tracking multiple speakers in room. The proposed method uses the Random Finite Set (RFS) theory to track multiple speakers in the room and also automatically follow new speakers when new speakers turn up as such a technique was shown to be viable in [8] and [9]. This paper seeks to address the problem of multiple speaker tracking and speaker identification in acoustic environments. There have been previous attempts to separate moving sound source mixtures [1], [11] but these methods require additional processing such as speech recognition or image recognition in order to identify the separated sound sources. These would mean additional computational complexity in order to identify the separated sound sources and they will be only suited for offline processing. [12] [13] [14] are earlier works which also utilize the spatio temporal nature of speech to perform sound source separation. However, these techniques only track the sound sources Direction of Arrival (DoA) which is one dimensional. The proposed algorithm aims to track the Cartesian coordinates and the velocity of the speakers within a room and not just the DoA of the sound sources. The addition of another dimension adds a new layer of complexity to the problem as the Time Difference of Arrival (TDOA) obtained from the different microphone pairs need to be fused in a cohesive manner in order to estimate the position of the speakers in the room. DUET-CBMeMBer provides the means to fuse the extracted information and use it to estimate the source location but it still lacks the ability to identify and distinguish the sound features which have been separated. Sound source reconstruction is very challenging without information about the sound source identity as there will be a permutation problem in the features required to construct the mask. Using the idea from the works [15] and [16], this paper proposes an intuitive method of track management based on the trajectory of sound sources as an extension to the DUET-CBMeMBer algorithm. The addition of the track management system addresses the problem of track labelling which was lacking in the original DUET-CBMeMBer algorithm [6]. The rest of this paper is separated into 4 Sections - System Overview, Proposed Track Management, Experiment and Conclusion. A brief overview of the DUET-CBMeMBer is given in the System Overview section. In Section 3, the proposed track management is discussed in greater detail. The experiments conducted and the analysis of the results will be 978-616-361-823-8 214 APSIPA APSIPA 214

presented in Section 4. Closing remarks and future works are given in Section 5. 2.1. DUET 2. DUET-CBMEMBER OVERVIEW The DUET-CBMeMBer is an algorithm proposed in [6] to deal with the problem of tracking multiple moving sound sources. There are two main stages in the DUET-CBMeMBer algorithm. The first stage is feature extraction which is performed using the DUET algorithm [17]. This is followed by CbMeMBer [7] which performs the sound source tracking. DUET is a robust and efficient BSS technique as long as the W-disjoint orthogonality assumption of only one source is active at any time frequency point holds true. This can be expressed as: S j (τ, ω)s k (τ, ω) =, τ, ω, j k (1) whereby S j and S k refers to the jth and kth Short Time Fourier Transformed sound sources while τ and ω refer to the time frame and frequency respectively. The two anechoic sound mixtures used by DUET to recover the original sound sources can be expressed as: y 1 (t) = y 2 (t) = N s j (t) (2) j=1 N a j s j (t δ j ) (3) j=1 with y(t) being the received signal and s j (t) being the sound source while a j and δ j are the attenuation and the delay respectively. Short Time Fourier Transform (STFT) is performed on both of the observed sound mixtures in order to transform the observed mixtures into the time frequency domain in which the source signals are sparse. Feature extraction is achieved using the ratio of the mixtures to obtain the features - relative attenuation and relative delay. The time frequency ratio can be expressed as: Y 2 (τ, ω) Y 1 (τ, ω) = N j=1 a je iωδj S j (τ, ω) N j=1 S j(τ, ω) Only the relative delay is used in CBMeMBer to localize the sound sources after it is transformed into the TDOA. The relative delay can be obtained from the Time Frequency Ratio based on the following equation: (4) Y 2 (τ, ω) Y 1 (τ, ω) = a je iωδj S j (τ, ω) = a j e iωδj (5) S j (τ, ω) while the equation used to transform the relative delay into TDOA is: t j = δ j f s (6) whereby t j refers to the TDOA of a jth peak, δj is relative delay and f s is the sampling frequency. This only holds true if no anti aliasing occurs so the distance between two micro- c 2f s phones must be less than whereby c is the speed of sound and f s is the frequency of interest[17]. A power weighted histogram is constructed to cluster the estimated parameters into groups. Instead of the relative features, the power weighted features are used to construct the power weighted histogram. The weighted features are chosen for the construction of the power weighted histogram because they are more accurate estimates of the true parameters [17]. The power weighted histogram can be used to estimate the mixing parameters as the relative parameters will cluster around the true parameters. The power weighted delta estimate can be expressed as: δ j := (τ,ω) Ω j Y 1 (τ, ω)y 2 (τ, ω) p ω q δ(τ, ω)dτdω (τ,ω) Ω j Y 1 (τ, ω)y 2 (τ, ω) p ω q dτdω The time frequency weight of a given time frequency point is Y 1 (τ, ω)y 2 (τ, ω) p ω q whereas Ω j refers to a set of (τ, ω) points (determined to be associated with the jth cluster). p=1 and q= is a good default choice [17] but p=2 and q=2 is a better choice for low Signal to Noise Ratio scenarios or speech. In the scenario whereby two sound sources have vastly different time frequency weight, a compression is performed by setting the value of p to be less than 1 and the value of q=2. 2.2. CBMeMBer The CBMeMBer filter is used to estimate the time varying number of targets as well as the state of the targets based on the observations made by DUET. The CBMeMBer was implemented using the Sequential Monte Carlo (SMC) method. Similar to the Particle filter used for acoustic localization and tracking [18], the CBMeMBer filter operates in two spaces - the multi-target state space, F(X ), and the multi-target observation space, F(Z). The multi-target state, X k, which consist of the Cartesian coordinates and the velocity of the targets is recursively estimated from the multi-target observation, Z k, which contains the TDOAs of the observed targets. (7) X k = {x k,1,..., x k,n(k) } F(X ), (8) Z k = {z k,1,..., z k,m(k) } F(Z), (9) The states of the targets are predicted and updated with each propagation of the multi-target multi-bernoulli density, π k (. Z 1:k ). The multi-target posterior density is propagated recursively according to[7]: π k k 1 (X k Z 1:k 1 ) = f k k 1 (X k X)π k 1 (X Z 1:k 1 )δx (1) π k (X k Z 1:k ) = g k(z k X k )π k k 1 (X k Z 1:k 1 ) gk (Z k X)π k k 1 (X Z 1:k 1 )δx (11)

where f k k 1 (..) is the multi-target transition density and g k (..) is the multi-target likelihood at any given time k. The CBMeMBer filter uses the parameterized multi-target multi- Bernoulli density, described by the parameters, r (i) and p (i), for propagation as it is less computationally intensive than to propagate than using the exact multi-target density. By using the paramaterized approximation, a finite set of existence probabilities and a corresponding density based on the kinematic state of each sound source is propagated in time. Readers are referred to [7] for further details on the SMC implementation of the CBMeMBer filter. 3. PROPOSED TRACK MANAGEMENT 3.1. Track Management based on Sound Source Trajectory Target state estimation is performed by the CBMeMBer technique without the need for data association in the filtering process. In order to perform data association of the estimated states, a track management algorithm is required to associate the states with the corresponding tracks by labelling the states. The track management system used is similar to the ones used in [16] and [19] with modifications made to suit acoustic signal analysis. There are multiple factors such as silent period, speaker interaction and room reverberation that needs to be considered when speech sound sources are to be tracked. To account for these factors in speech tracking, certain constraints are placed on the track management system. Due to the nature of speech, there might be silent periods which provide no information that allows it to be tracked. As a result of this, the track management system has to be able to retain the track information for the particular speech sound source instead of terminating it. By applying the idea of the track management used in [15], the labels of several time steps are taken into consideration for the data association. The track management based on [15] has four basic stages. The first stage is to associate the estimated data at current time step with the data from the previous time step. Non-associated data in current time step is then considered for data association with previously missed data. If the current data has no association with the data from previous time steps, a new identity is assigned to the current data. Memory retention will only be available for a set number of time steps. If a data remains associated for a certain time period, it will be removed from the data association memory. In the cases of moving sound sources, the location of the sound sources will change based on the trajectories of the individual sound sources. The true estimates will follow a trajectory whereas the false estimates which result from noise or reverberation will be spurious. Hence, the true tracks will be considered for data association while the spurious tracks that result from noise or reverberation will be removed from track association. A modification made to the track management system is the addition of a track merging algorithm in the CBMeMBer filter. Tracks within a short distance of each other are merged into a single track as such scenarios are highly unlikely in real life. There will always be a personal space between speakers in most social settings so the boundary of this personal space is used as the threshold for track merging. As the track management accounts for missed data, the merged track will not be lost as long as the tracks do not overlap for a period of time exceeding the memory retention period of the track management algorithm. The details of the track management algorithm is given in Track Mangement Pseudocode 1. 4. SIMULATION AND DISCUSSION 1. m.4 m 1. m Fig. 1: Room dimension and setup c 2f s In [6], the DUET-CBMeMBer was shown to be capable of localizing and tracking two sound sources in an ideal scenario with no noise and reverberation. A similar simulation was performed to further test the system s tracking and sound source identification capabilities in presence of reverberation. The simulated room setup shown in figure 1, is the same as the one used in [6] with four pairs of microphones which are.4m apart, spread out across a 1m x 1m room. The main difference is the addition of a T6 reverberation time of.15 s. The distance between the microphone pairs fulfill the condition of d < as the speed of sound is 343m/s while the frequency of interest is 8kHz. The sound sources used in the scenario are still synthetic male and female speech signals sampled at 16kHz. For the settings in the CBMeMBer, the steady state velocity of, v and the rate constant, B of a speaker in the Langevin model are respectively set to 1.4m/s and 1Hz. The birth model used is a Gaussian spawning at the birth location of the speakers. The sigma, σ, of the normally distributed likelihood is set at.5% of the maximum TDOA between the sensor pair and the clutter rate, κ is set to 6. The merging threshold used in this scenario is.5m so speakers which are within.5m of each other are merged into a single track. An example of the simulation results is shown in figure

Algorithm 1 Pseudocode for Multiple Sound Sources Track Management Data Initialization 1: Acquire the estimated sound source locations from the DUET-CBMeMBer filter starting from the first time frame 2: Set the track merging threshold, association distance threshold and maximum memory retention period 3: Set all association bits for the estimated sound source locations to 1 4: Set all missed bits for the estimated sound source locations to 5: Assign a new label for each estimated sound source location Track Merging 1: Compare the distance for all the estimated sound source location at each time frame 2: while Distance between two estimated locations < merging threshold do 3: Combine the particle clouds of those two estimates 4: Reweight the particles 5: Combine the probability of existence of the two particle clouds 6: end while 7: Estimate new sound source locations based on the combined particle clouds Iterative Data Association 1: while time frame last time frame do 2: Compare estimates from current time frame with the directly previous time frame 3: if Distance between current estimate and previous estimate is within the association distance threshold then 4: Assign the previous label to current estimate and set the corresponding association bit to 1 5: end if 6: Compare estimates from current time frame with previously missed data 7: if Distance between current and the previously missed estimate is within the association distance threshold missed time steps then 8: Assign the previously missed estimate s label to current estimate and set the corresponding association bit to 1 9: end if 1: if Current estimate has no association to previous data or previously missed data then 11: Assign a new label to the current estimate and set the corresponding association bit to 1 12: end if 13: if Previously missed data > maximum memory retention period then 14: Increase its missed estimate counter by 1 15: else if Previously missed data > maximum memory retention period then 16: Remove it from memory 17: end if 18: end while 19: x coordinate (m) y coordinate (m) Frequency Frequency 1 5 1 5 1.5 1.5 True tracks Speaker1 Speaker2 5 1 15 2 25 3 35 Time (Frame) 5 1 15 2 25 3 35 Time (Frame) 1 2 3 4 5 6 7 8 9 Time 1 2 3 4 5 6 7 8 9 Time Fig. 2: Result of source tracking and labelling 2. As shown in the result, the proposed algorithm is able to track and add an identity to both the speakers while they are moving. The gaps in the tracks are due to the silent periods in speech. No observation could be extracted during the silent periods so the DUET-CBMeMBer is unable to produce location estimates during these periods. As the track management algorithm is able to account for the silent periods and retain the track information, the tracks continue to be propagated after these silent periods. As the tracks do not cross path with each other in the scenario shown, it is mainly used to clean up the tracks by eliminating most of the false estimates resulting from room reverberation. 5. CONCLUSION In conclusion, the addition of a track management algorithm to the DUET-CBMeMBer filter is demonstrated to be a viable method of adding identity information to the sound sources tracked. Apart from that, the addition of the track management algorithm also produce cleaner results with significantly less false estimates due to room reverberation. Identity association of a tracked sound source is an important stepping stone in the overall research aim of multiple moving sound sources separation. The output of this proposed algorithm provides the ability to construct time frequency masks based on each sound source s state information and identities. Future work involves the use of tracking information including identities for time frequency mask design and subsequently the reconstruction the separated sound sources based on these time frequency masks. x 1 4 x 1 4

6. REFERENCES [1] N. Grbic, X.J. Tao, S. Nordholm, and I. Claesson, Blind signal separation using overcomplete subband representation, IEEE Transactions on Speech and Audio Processing, vol. 9, no. 5, pp. 524 533, 21. [2] O. Yilmaz and S. Rickard, Blind separation of speech mixtures via time-frequency masking, IEEE Transactions on Signal Processing, vol. 52, no. 7, pp. 183 1847, 24. [3] S.Y. Low, S. Nordholm, and R. Togneri, Convolutive blind signal separation with post-processing, IEEE Transactions on Speech and Audio Processing, vol. 12, no. 5, pp. 539 548, 24. [4] V.G. Reju, S.N. Koh, and Y. Soon, Underdetermined convolutive blind source separation via time frequency masking, IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 1, pp. 11 116, 21. [5] I. Jafari, S. Haque, R. Togneri, and S. Nordholm, Evaluations on underdetermined blind source separation in adverse environments using time-frequency masking, EURASIP Journal on Advances in Signal Processing, vol. 213, no. 1, pp. 162, 213. [6] N. Chong, S. Wong, B-T Vo, S. Nordholm, and I. Murray, Multiple sound source tracking via degenerate unmixing estimation technique and cardinality balanced multi-target multi-bernoulli filter (DUET-CBMeMBer) with track management, in Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP) Proceedings 214 IEEE Ninth International Conference on. IEEE, 214, pp. 1 6. [7] B-T Vo, B-N Vo, and A. Cantoni, The cardinality balanced multi-target multi-bernoulli filter and its implementations, IEEE Transactions on Signal Processing, vol. 57, no. 2, pp. 49 423, 29. [8] B-N Vo, S. Singh, and W.K. Ma, Tracking multiple speakers using random sets, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 24 (ICASSP 4). IEEE, 24, vol. 2, pp. ii 357. International Conference on. IEEE, 2, vol. 2, pp. II1133 II1136. [11] S.M. Naqvi, M Yu, and J.A. Chambers, A multimodal approach to blind source separation of moving sources, Selected Topics in Signal Processing, IEEE Journal of, vol. 4, no. 5, pp. 895 91, 21. [12] P. Pertilä, Online blind speech separation using multiple acoustic speaker tracking and time frequency masking, Computer Speech & Language, vol. 27, no. 3, pp. 683 72, 213. [13] B. Loesch and B. Yang, Online blind source separation based on time-frequency sparseness, in Acoustics, Speech and Signal Processing, 29. ICASSP 29. IEEE International Conference on. IEEE, 29, pp. 117 12. [14] A. Brutti and F. Nesta, Multiple source tracking by sequential posterior kernel density estimation through gsct, in Proc. of EUSIPCO, 211, pp. 259 263. [15] K. Shafique and M. Shah, A noniterative greedy algorithm for multiframe point correspondence, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 1, pp. 51 65, 25. [16] S. Wong, B-T Vo, B-N Vo, and R. Hoseinnezhad, Multi-bernoulli based track-before-detect with road constraints, in Information Fusion (FUSION), 212 15th International Conference on. IEEE, 212, pp. 84 846. [17] S. Rickard, The DUET blind source separation algorithm, Blind Speech Separation, pp. 217 241, 27. [18] D.B. Ward, E.A. Lehmann, and R.C. Williamson, Particle filtering algorithms for tracking an acoustic source in a reverberant environment, IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 826 836, 23. [19] R. Hoseinnezhad, B-N Vo, B-T Vo, and D. Suter, Visual tracking of numerous targets via multi-bernoulli filtering of image data, Pattern Recognition, vol. 45, no. 1, pp. 3625 3635, 212. [9] N.T Pham, W. Huang, and S.H. Ong, Tracking multiple speakers using CPHD filter, in Proceedings of the 15th international conference on Multimedia. ACM, 27, pp. 529 532. [1] A Koutvas, E Dermatas, and G Kokkinakis, Blind speech separation of moving speakers in real reverberant environments, in Acoustics, Speech, and Signal Processing, 2. ICASSP. Proceedings. 2 IEEE