Airo Interantional Research Journal September, 2013 Volume II, ISSN:

Save this PDF as:

Size: px
Start display at page:

Download "Airo Interantional Research Journal September, 2013 Volume II, ISSN:"


1 Airo Interantional Research Journal September, 2013 Volume II, ISSN:

2 Name of author- Navin Kumar Research scholar Department of Electronics BR Ambedkar Bihar University Muzaffarpur ABSTRACT Direction of arrival (DOA) estimation of speech signals using a set of spatially separated microphones in an array has many practical applications in everyday life. DOA estimates from microphone arrays placed on a conference table can be used to automatically steer cameras to the speaker if the conference is part of a video conferencing session or a long distance TV based classroom. The current paper describes the direction of arrival using microphone arrays. KEYWORDS: Microphone, Array, DOA INTRODUCTION In current video-conferencing systems or video classrooms, the control of the video camera is performed in one of three ways. Cameras that provide different fixed views of the room can be placed at different locations in the conference room to cover all the people in it. Secondly the system could consist of one or two cameras operated by humans. Finally the system could consist of manual switches for each user or group of users that would steer the camera in their direction when activated. The third category of systems is used commonly in long distance education that uses TV based classrooms. These systems turn out to be expensive in terms of extra hardware or manpower required to operate them effectively and reliably. It would be desirable to have 2

3 one or two video cameras that can be automatically steered towards the speaker. Most conferences and classrooms typically have one person speaking at a time and all others listening. The speaker, however, could be moving around in the room. Thus there is a need to have a system that effectively and reliably locates and tracks a single speaker. Single speaker localization and tracking can be performed using either visual or acoustic data. The fundamental principle behind direction of arrival (DOA) estimation using microphone arrays is to use the phase information present in signals picked up by sensors (microphones) that are spatially separated. When the microphones are spatially separated, the acoustic signals arrive at them with time differences. For an array geometry that is known, these time-delays are dependent on the DOA of the signal. There are three main categories of methods that process this information to estimate the DOA. The first category consists of the steered beam former based methods. Beam formers combine the signals from spatially separated array-sensors in such a way that the array output emphasizes signals from a certain look -direction. Thus if a signal is present in the look direction, the power of the array output signal is high and if there is no signal in the look direction the array output power is low. Hence, the array can be used to construct beam formers that look in all possible directions and the direction that gives the maximum output power can be considered an estimate of the DOA. The delay and sum beam former (DSB) is the simplest kind of beamformer that can be implemented. In a DSB, the signals are so combined that the theoretical delays computed for a particular look direction are compensated and the signals get added constructively. The minimum-variance beamformer (MVB) is an improvement over simple DSB. In an MVB, we minimize the power of the array output subject to the constraint that the gain in the look-direction is unity. The main advantage with a steered beamformer based algorithm is that with one set of computations we are able to detect the directions of all the sources 3

4 that are impinging on the array. Thus it is inherently suited to detecting multiple sources. From considerations of the eigen-values of the spatial correlation matrix, if we have N elements in an array, it is not possible to detect more than N-1 independent sources. Methods like complementary beamforming have been proposed to detect DOAs even when the number of sources is equal to or greater than the number of sensors. For requirement, which is detecting and tracking a single user, the computational load involved in a steered beamformer based method is deemed to be too large. For example, if we have to perform 3- dimensional DOA estimation we have to compute the array output power using beamformers that are looking in all azimuths (0 to 360 ) and all elevations (- 90 to +90 ). For a resolution of 1, this involves a search space of 64,979 search points. If we add to this the condition that the source is in the near field of the array, then the set of possible ranges (distances of the sources from the array) is added to the search space. The second category consists of highresolution subspace based methods. This category of methods divides the crosscorrelation matrix of the array signals into signal and noise subspaces using eigen-value decomposition (EVD) to perform DOA estimation. These methods are also used extensively in the context of spectral estimation. Multiple signal classification (MUSIC) is an example of one such method. These methods are able to distinguish multiple sources that are located very close to each other much better than the steered beamformer based methods because the metric that is computed gives much sharper peaks at the correct locations. The algorithm again involves an exhaustive search over the set of possible source locations. The third and final category of methods is a two-step process. In the first step the timedelays are estimated for each pair of microphones in the array. The second step consists of combining or fusing this information based on the known geometry of the array to come up with the best estimate of the DOA. There are various techniques that can be used to compute pairwise time delays, such as 4

5 the generalized cross correlation (GCC) method or narrowband filtering followed by phase difference estimation of sinusoids. The phase transform (PHAT) is the most commonly used pre-filter for the GCC. The estimated time-delay for a pair of microphones is assumed to be the delay that maximizes the GCC-PHAT function for that pair. Fusing of the pair-wise time delay estimates (TDE s) is usually done in the least squares sense by solving a set of linear equations to minimize the least squared error. The simplicity of the algorithm and the fact that a closed form solution can be obtained (as opposed to searching) has made TDE based methods the methods of choice for DOA estimation using microphone arrays. Overview of Research Various factors affect the accuracy of the DOA estimates obtained using the TDE based algorithm. Accuracy of the hardware used to capture the array signals, sampling frequency, number of microphones used, reverberation and noise present in the signals, are some of these factors. The hardware that is used should introduce minimum phase errors between signals in different channels. This is a requirement no matter what method is used for DOA estimation. Also, the more microphones we use in the array the better the estimates are that we get. The sampling frequency becomes an important factor for TDE based methods especially when the array is small in terms of distance between the microphones. This is because small distances mean smaller time delays and this requires higher sampling frequencies to increase the resolution of the delay estimates. In the case of low sampling frequencies the parabolic interpolation formula has been used before to come up with a more accurate sub-sample estimate of the time delay. In this thesis we look at an alternate approach to time domain interpolation by directly computing the sub-sample correlation values from the cross-power spectral density (XPSD) while computing the inverse Fourier transform. Also for the purpose of fast tracking we study the performance of the TDE based algorithms with very short frames (32-64 ms) of signal data in the 5

6 presence of moderate reverberation. Under such conditions the performance of the GCC-PHAT based method is only marginal compared to the performance we obtain with another method, called the steered response power (SRP) method. The performance of the GCC-PHAT based method is degraded by the presence of impulsive errors in certain frames. This was caused by the algorithm picking the wrong peak in the GCC as the one corresponding to the delay. Initial work to improve these results was geared towards estimating a weighted least squares estimate. The idea behind this is that while computing the least squares estimate of the DOA, we weigh those equations less which are found to be less reliable based on certain criteria. It was found that because the time-delay of arrival between two microphones was not a linear, but rather a trigonometric function of the angle of arrival, larger time-delays would give rise to less reliable angle estimates. This observation leads to one of the weighing coefficients. Therefore this method is a maximum likelihood (ML) estimator. In the presence of reverberation, the strongest peak turns out to not always be at the correct delay. Therefore those time-delays whose second strongest peaks are close in strength to the strongest peak are also less reliable estimates. This leads to the second weighing coefficients. These two weighing coefficients can be combined to give a weighted least squares estimate of the DOA. This kind of weighting was found to reduce the number of impulsive errors in the DOA estimate, but it did not eliminate them. Impulsive errors in the DOA estimates are very undesirable in applications like video camera steering or beamforming. A unit norm constrained adaptive algorithm was suggested to remove the impulsive errors. This algorithm, though slower to reach the steady-state DOA estimated, remains in the proximity of the correct DOA and does not contain impulsive errors. From extensive studies of frame-wise GCC data, we propose an alternate method to improve the reliability of pre-adaptation estimates named Time Delay Selection (TIDES). For the frames that contained impulsive errors, it was observed that, though the 6

7 wrong delay had the strongest peak, a weak peak was almost always observed at the correct delay also. Therefore it makes sense not to discard these other peaks. Since each pair of microphones could give us multiple time delay candidates, we have in our hand several candidate time-delay sets, from among which we should be choosing one based on some criterion. MICROPHONE ARRAY STRUCTURE AND CONVENTIONS Figure 1 shows a 4-element uniform linear array (ULA) of microphones and a sound source in the far field of the array. We will be using the uniform linear array to develop the principles of these conventional methods. Without loss of generality, these methods can be extended to three-dimensional arrays. The array consists of 4 microphones placed in a straight line with a uniform distance, d, between adjacent microphones. The sound source is assumed to be in the far field of the array. This means that the distance of the source, S, from the array is much greater than the distance between the microphones. Under this assumption, we can approximate the spherical wavefront that emanates from the source as a plane wavefront as shown in the figure. Thus the sound waves reaching each of the microphones can be assumed to be parallel to each other. The direction perpendicular to the array is called the broadside direction or simply the broadside of the array. All DOA's will be measured with respect to this direction. Angles in the clockwise direction from the broadside (as the one shown in Figure 1) are assumed to be positive angles and angles in the counter clockwise direction from the broadside are assumed to be negative angles. 7

8 Figure 1 Uniform Linear Array with Far Field Source. The signal from the source reaches the microphones at different times. This is because each sound wave has to travel a different distance to reach the different microphones. For example the signal incident on microphone M3 has to travel an extra distance of d sinθ as compared to the signal incident on microphone M4. This results in the signal at microphone M3 being a time-delayed version of the signal at microphone M4. This argument can be extended to the other microphones in the array. Consider the pair of microphones shown in Figure 2. These microphones form part of a uniform linear array with a distance d between adjacent microphones. Also shown are two sources that are incident on the array at an angle of θ with respect to the broadside. The angles made by the sources are measured with respect to two different broadsides, one in front of the array and the other behind it. The extra distance traveled by either source signal to reach M1 as compared to M2 is d sinθ Thus the pair-wise time delays associated with either source will be the same. This is under the assumption that the microphones are omni-directional, which means that the gain of the microphone does not change the direction of the acoustic wavefront. What this means is that the ULA is only capable of distinguishing that the source is at an angle with respect to the line of 8

9 the array, but not where exactly it is around the line. This is referred to as front-back ambiguity of the array. A ULA can uniquely distinguish angles between 90 and + 90 with respect to the broadside of the array. Figure 2 Uniform Linear Array shown with front-back ambiguity. Restrictions on the Array There is a relationship between the frequency content of the incident signal and the maximum allowed separation between each pair of microphones in the array. Consider two sinusoids of the same frequency, but with a phase difference of φ between them. This phase difference is restricted to be between π and π. A phase lag of φ which is greater than π cannot be distinguished from a phase lead of 2π φ and vice-versa. For example consider the sinusoid shown in Figure 3 with the second sinusoid having a phase lead of 4 5π. Figure 3 Two pairs of sinusoids with different phase differences appear identical. 9

10 This fact places an important restriction on the array geometry to prevent spatial aliasing, when performing narrowband DOA estimation. Spatial aliasing happens when the phase delay, at the frequency of interest, between signals from a pair of microphones, exceeds π. This causes the time delays to be interpreted wrongly, which in the end results in wrong DOA estimates. Consider a signal incident on a ULA at an angle θ. Let this broadband signal have a maximum frequency of max f. If we would like to restrict the phase difference, at this frequency, between signals of any pair of microphones to be less than or equal to π. When this condition is satisfied, spatial aliasing is avoided and correct DOA estimates can be obtained. Note that this consideration becomes important only when we are performing TDE from phase difference estimates of narrowband signals. Algorithms that directly compute the time delays of broadband signals using crosscorrelations are not restricted in this manner. References [1] Y. Huang, J. Benesty, and G. W. Elko, Microphone Arrays for Video Camera Steering, Acoustic Signal Processing for Telecommunications, ed. S. L. Gay and J. Benesty, Kluwer Academic Publishers, [2] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, Pfinder: Realtime tracking of the human body, Proc. on Automatic Face and Gesture Recognition, 2006, pp [3] M. S. Brandstein and S. M. Griebel, Nonlinear, model-based microphone array speech enhancement, Acoustic Signal Processing for Telecommunications, ed. S. L. Gay and J. Benesty, Kluwer Academic Publishers, [4] J. H. DiBiase, H. F. Silverman, and M. Brandstein, Robust Localization in Reverberant Rooms, Microphone Arrays, Springer-Verlag, [5] B. V. Veen and K. M. Buckley, Beamforming Techniques for Spatial Filtering, CRC Digital Signal Processing Handbook, [6] H. Kamiyanagida, H. Saruwatari, K. Takeda, and F. Itakura, Direction of 10

11 arrival estimation based on non-linear microphone array, IEEE Conf. On Acoustics, Speech and Signal Processing, Vol. 5, pp , [7] C. H. Knapp and G. C. Carter, The Generalized Correlation Method for Estimation of Time Delay, IEEE Trans. Acoustics, Speech and Signal Proc., vol. ASSP-24, No. 4, August [8] K. Varma, T. Ikuma, and A. A (Louis) Beex, Robust TDE-based DOA estimation for compact audio arrays, IEEE Sensor Array and Multichannel Signal Proc. Workshop (SAM), August, [9] J. P. Ianiello, Time delay estimation via cross-correlation in the presence of large estimation errors, IEEE Trans. Acoust. Speech, Signal Processing, vol. 30, no. 6, pp , December [10] L. B. Jackson, Digital Filters and Signal Processing, pp , Kluwer Academic Publishers, [11] DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus, NIST Speech Disc 1-1.1, Oct