Multiple Sound Sources Localization Using Energetic Analysis Method

VOL.3, NO.4, DECEMBER 1 Multiple Sound Sources Localization Using Energetic Analysis Method Hasan Khaddour, Jiří Schimmel Department of Telecommunications FEEC, Brno University of Technology Purkyňova 11, 1, Brno Czech Republic Email: xkhadd@stud.feec.vutbr.cz, schimmel@feec.vutbr.cz Abstract In this article a method for multiple sound source localization is proposed. This method depends on energetic analysis of B-format signals. The number of sound sources localized by this method can exceed the number of the used microphones. The method was simulated in Matlab and tested in a real environment. Both experimental and simulation results show the efficiency of this method. 1 Introduction Sound source localization methods have been intensively investigated, several methods have been designed for one sound source localization; most of them are based on time delay estimation [1], and on the phase difference []. Some methods are able to localize a number of sound sources that is equal or less than the number of used sensors (microphones), such as MUSIC (Multiple Signal Classification) [3]. MUSIC can estimate the directional of arrivals (DOAs) based on relation between the noise subspace and the signal subspace [3]. Some new methods solved this problem and they are able to localize more sound sources. To achieve that, one method uses binary time frequency masks for blind separation of speech mixtures [4]. This method relies on a property of the Gabor expansions of speech signals, which is called W- disjoint orthogonality [4]. Other method use so called blind separation method (BSS) was presented in [5]. BSS method presents a new approach in order to estimate direction of arrivals (DOAs) depending on applying the Expectation-Maximization algorithm (EM) to a sparseness-based approach [5]. In our paper, a new method is presented for multiple sound source localization using B-format signals, in which the number of sound sources localized by this method can exceed the number of used sensors (in our method B-format signals in horizontal plane only are used). This method depends on energetic analysis of sound signals (B-format signals). This paper is organized as follows; Section presents the B-format signals principle. The energetic analysis method is introduced in Section 3. Section 4 presents the simulation results for this method in Matlab. The experimental results are presented in Section 5, and Section concludes the paper. mation about sound sources [5]. The signals, provide information about the sound source in the horizontal plane, they are recorded using two figure-of-eight microphones facing front-back ( ) and left-right ( )), while provides information about the vertical plane, and it is recorded using figure-of-eight microphone that faces up-down. The signal is recorded using omni-directional microphone, see Figure 1. The encoding equations for B-format signals are [] where represents the azimuth angle of the source, represents the elevation angle of the source, and s represents the sound signal. Left (y(t)) Front (x(t)) Up (z(t)) W(t) (1) B-Format Signals B-format signals are able to represent the sound sources in the three dimensional sphere, they contain four signals,, and, which can carry all of the directional infor- Figure 1: Polar patterns of B-format components. 5

time (sec) VOL.3, NO.4, DECEMBER 1 3 Energetic Analysis Method Energetic analysis method is based on the fact, that the sound source direction is the opposite direction of the intensity vector of the sound. This principle has been used in spatial sound reproduction methods such as directional audio coding (DirAC) [7]. However, it is used in this paper for multiple sound source localization using other criteria. The sound energies are distributed in time and frequency, the spectral density distribution for one of the signals is presented in Figure using spectrogram in Matlab where the length of hamming window was chosen to be 14 samples, the overlaps was chosen to be 5 points, the number of sampling points to calculate the discrete Fourier transform was 5 points, and the sampling frequnecy was 441 Hz. Assuming that there are several sound sources, the energy in some timefrequency points are generated from several sound sources simultaneously. Therefore, from one frequency bin it is not possible to determine all sound source positions. the instantaneous intensity vector and it is defined as [] [ ] (3) The instantaneous intensity vector points to the direction of the flow of sound energy, while the direction of arrival is supposed to be opposite to this direction. We can get the azimuth of the sound source as [] { and the elevation as [ [ ] [ ] ] (5) 3.5-4 - After calculating the angles for each frequency bin in each time frame, a statistical estimation for angle distribution should be done, see Figure (3). - 1.5 1.5-1 -1-14 B-format signals Dividing signals in time Dividing signals into frequency bands Azimuth and elevation estimation Statistical calculation for angles for each time frame The estimated angles.5 1 1.5 Frequency (Hz) x 1 4 Figure. Spectral density distribution for a sample of speech signal recorded by omni-directional microphone. In this method, the sound signals are divided in time and then in frequency using short time Fourier transforms (STFT), where the window was chosen to be hanning window with length of 51 samples, and the overlapping was chosen to be half of the window size. The input signals for this method are B-format signals, the intensity vector can be obtained using following equations for each time frame [] Re{ }, Re{ }, () Re{ }, where Z is the acoustic impedance of the air, t is time, f is frequency, * denotes complex conjugate,,, and are the Fourier transform for the B-format signals,, and respectively, and is Figure 3: A diagram for Energetic Analysis method. In each time frame, we assume that only one sound source is dominant for each frequency bin. This assumption can be hold since that each sound signal differs from others, and the signals have different intensity in time. In this case, each frequency bin has information about one sound source direction. We consider the direction from where the sound signals come from as the direction that is the most repeated in the frequency bins in each time frame. When we have several sound signals that are emitted simultaneously, the direction for each sound source signal is repeated several times in each time frame for different frequency bins. We can obtain the sound source direction as the angle that maximizes the summation of function on the whole frequency interval for each time frame. In case there is only one sound source the estimated direction could be written

frequency bins number of frequency bins time (sec) VOL.3, NO.4, DECEMBER 1 as where is the estimated sound source direction, K is the number of the frequency bins for S is the sound signal, t denotes the time frame index, f is the frequency bin, and is the probability that this signal comes from the direction α. The main difficulties that face this method come generally from background noise, reverberation and microphone noise. However, the sound intensity coming from the sound source is bigger than noise and reverberation intensity. For some time frame, the detection error is bigger when there is no active speaker 4 Simulation Results The method was simulated in Matlab. Figure 4 shows the simulation results for this method. In this scenario we assumed that there are four speakers in the horizontal plane, who speak simultaneously. B-format signals are generated from these sound signals according to (1). With no additional noise, the method was able to estimate the sound source positions perfectly. Peaks in Figure 4 denote the angle from where the sound is coming from. As can be seen from Figure 4, the four sound source positions are estimated correctly. Some frequency bins indicates that the sound signal is coming from other directions. This angle detection error comes from the fact that more than one signal has component in the same frequency bin in the same time frame. 7 5 () by Matlab. The two noise signals were added to each other and asumed to be located in different places arround the microphones. These places were assumed to be equidistantly separated (i.e. 3 degrees from each other). The signal-to-noise ratio (SNR) between the signal and the additional noise signal was about - db, and it is calculated using following equation ( ) ( ) where, and are the average power for and respectively, and is the additional noise signal. The method was also able to detect the sound sources positions for the all speakers, see Figure. 3.5 1.5 1.5.5 1 1.5 Frequency (Hz) x 1 4 Figure 5: Spectral density distribution for a fan noise sound signal. As can be seen from simulation results, adding the additive noise decreases the ability of localization the sound sources. The noise signal influences the intelligibility of the sound source signal and it changes the distribution of sound intensity. However, since the sound source intensity is bigger than the noise intensity, the method is still able to localize the sound sources correctly. -4 - - -1-1 -14 4 3 1 - -15-1 -5 5 1 15 Figure 4: Simulation results with absence of noise. 1 9 7 5 4 3 Two different noise signals were added to each B-format signal. The first noise signal is a fan noise sound. Spectral energy distribution for this signal is presented in Figure 5 using the same spectrogram parameters as in Figure. The second noise signal is pseudo-random noise with a normal distribution with mean zero and standard deviation one which is generated 1 - -15-1 -5 5 1 15 Figure : Simulation result with the present of pseudorandom noise signal and a fan noise signal. 7

number of frequency bins VOL.3, NO.4, DECEMBER 1 5 Experimental results The measurements were carried out in the acoustic laboratory at Department of Telecommunications FEEC, Brno University of Technology where the experiment s conditions were same as in sound control rooms, listening rooms, or in living rooms with high quality listening environment; the laboratory provides semi-diffuse field with reverberation time RT <.3 s in all octave bands. The measurements were carried out only for sound source placed in horizontal plane of the B-format microphone setup. In the first part of the experiment, three people (two men and one woman) were talking simultaneously in forty different positions. The positions were selected arbitrary for each speaker in a circle around the microphones, see Figure 7. One sentence was chosen for all speakers to be said, the speech in each case lasted about seconds. The positions of the speakers for each case were registered and compared with the results of the method. localization less accurate. However, this method was able to estimate the sound source positions correctly in the real environment. Figure : The directional sensitivity of the eight-of- figure microphones [9]. B-format microphones The directions of arrival were estimated for each speakers positions, and the results were compared with the real speakers positions. The results are illustrated using box plots. The boxes have lines at lower quartile, median, and upper quartile values. The whiskers show the extent of the rest of the data. The outliers are presented by red cross outside of the whiskers. 15 1 Figure 7: Speakers positions around the microphones. Two figure-of-eight microphones were used to pick-up the signals,, and one omni-direction microphone was used to pick up the signal. The directional sensitivity of the figure-of-eight microphones is shown in Figure. The results show the ability of this method to estimate the sound source positions correctly. Figure 9 shows one case of the experiment where three speakers were in three different positions around the microphones. As can be seen, the positions of the three speakers are well estimated, where the peaks denote the positions of the speakers. The speaker who stood in position (+1 ) could be considered also to be in position (- 1 ). Since there are multiple sound sources (speakers), the sound intensity differs in time, and for some frequencies the sound energy is coming from more than one speaker. Furthermore, there are a background noise and noise coming from the microphones. These factors together make the sound sources 5 - -15-1 -5 5 1 15 Figure 9: Estimated speakers positions in real environment. As can be seen in Figure 1, the method was able to localize the sound source positions; the median error is between three and four degrees for all positions. The biggest error is about twelve degrees for the third speaker, where the first and the third speakers are two men and the second speaker is a woman.

absolute error (degree) absolute error (degree) VOL.3, NO.4, DECEMBER 1 1 1 4 the target s movement. The error could happen when the target moves too fast. Acknowledgment The described research was performed in laboratories supported by the SIX project; the registration number CZ.1.5/.1./3.7, the operational program Research and Development for Innovation. References Figure 1: Absolute angle error for each speaker in case of three speakers. In the second part of our experiment, four people (two men and two women) talked simultaneously. The same sentence as in the first part of the experiment was chosen to be said. The results showed the ability of the method to localize the sound sources as can be seen in Figure 11. It should be noted that the first and second speaker are women. The median error in this case was about 4 degrees. 1 1 4 first speaker second speaker third speaker first speaker second speaker third speaker fourth speaker Figure 11: Absolute angle error in case of four speakers. [1] Carter, G.C.; "Coherence and time delay estimation," Proceedings of the IEEE, vol.75, no., pp. 3-55, Feb. 197. [] Taff, L.G.;, "Target localization from bearings-only observations," Aerospace and Electronic Systems, IEEE Transactions on, vol.33, no.1, pp.-1, Jan. 1997. [3] Schmidt, R.;, "Multiple emitter location and signal parameter estimation," Antennas and Propagation, IEEE Transactions on, vol.34, no.3, pp. 7-, Mar 19. [4] Yilmaz, O.; Rickard, S.;, "Blind separation of speech mixtures via time-frequency masking," Signal Processing, IEEE Transactions on, vol.5, no.7, pp. 13-147, July 4. [5] Izumi, Yosuke; Ono, Nobutaka; Sagayama, Shigeki;, "Sparseness-Based CH BSS using the EM Algorithm in Reverberant Environment," Applications of Signal Processing to Audio and Acoustics, 7 IEEE Workshop on, vol., no., pp.147-15, 1-4 Oct. 7. [] Benjamin, E.; Heller, A.; Lee,.; "Localization in horizontal-only ambisonic systems, " in Proc. 11st Convention of the Audio Engineering Society, San Francisco,. pp.1. [7] Pulkki, V.; Spatial Sound Reproduction with Directional audio coding J.Audio Eng.Soc.,vol.55,pp.53-51,Jun 7. [] E. Ahonen J.; Pulkki V., Kuech F.; Kallinger M.; Schultz- Amling R.; Directional analysis of sound field with linear microphone array and applications in sound reproduction. In Proc. AES 14th Convention, Amsterdam, The Netherlands, May. [9] AKG- 5 years of innovation [online]. [Citied.11.1]. Accessible from < http://www.akg.com/site/products/powerslave,id,35,pid, 35,nodeid,,_language,EN,view,diagram.html>. Conclusion The energetic analysis method is a good method for multiple sound source localization. It achieved good results in both simulated and real environment. The angle detection errors come from the background noise and the reverberation signals. The method is able to localize more sound sources than the number of the used microphones. The method can be used for tracking mobile targets, when the duration of time frame is chosen to be suitable for the speed of 9