Convention Paper Presented at the 131st Convention 2011 October New York, USA

Size: px

Start display at page:

Download "Convention Paper Presented at the 131st Convention 2011 October New York, USA"

Aileen Booth
5 years ago
Views:

1 Audio Engineering Society Convention Paper Presented at the 131st Convention 211 October 2 23 New York, USA This paper was peer-reviewed as a complete manuscript for presentation at this Convention. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 6 East 42 nd Street, New York, New York , USA; also see All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society. Head orientation tracking using binaural headset microphones Hannes Gamper, Sakari Tervo, and Tapio Lokki 1 1 Aalto University School of Science, Department of Media Technology, P.O. Box 154, FI-76, Finland Correspondence should be addressed to Hannes Gamper (first[dot]last[at]aalto.fi) ABSTRACT A head orientation tracking system using binaural headset microphones is proposed. Unlike previous approaches, the proposed method does not require anchor sources, but relies on speech signals of the wearers of the binaural headsets. From the binaural microphone signals, time of arrival (TOA) and time difference of arrival (TDOA) estimates are obtained. The tracking is performed using a particle filter integrated with a maximum likelihood estimation function. In a case study, the proposed method is used to track the head orientations of three conferees in a meeting scenario. With an accuracy of about 1 degrees, the proposed method is shown to outperform a reference method which achieves an accuracy of about 35 degrees. 1. INTRODUCTION Augmented reality applications aim at embedding virtual content into the perception of the real world. To produce realistic auditory augmentation, knowledge about the head orientation of the user is often necessary. Conventional head tracking systems use e.g. cameras to track visibly distinct markers or inertial sensors to detect head movements. Here we propose a head orientation tracking method for a teleconferencing scenario that employs microphone signals captured at the ears of the conferees. Binaural headsets with integrated microphones are used in audio augmented reality applications to enable the display of virtual audio content, whilst leaving the transducers acoustically transparent [1]. In a teleconference scenario, wearing such headsets allows the simultaneous perception of real and virtual participants. Head orientation tracking with binaural headset microphones has previously been implemented using anchor sources at known positions [2]. In the method presented here, the head orientation tracking is based on the speech signals of the conferees. Thus, anchor sources are not required. The locations of the conferees are assumed to be known. Alternatively, the positions of the conferees can be estimated for example via acoustic source localisation and tracking [3, 4].

2 The proposed tracking method is based on time of arrival (TOA) and time difference of arrival (TDOA) estimates of speech signals. TDOA estimates are obtained from the binaural microphone signals recorded at the ears of the active speaker S and one listener L (cf. Fig. 1). From these TDOAs, the TOAs at the ears of listener L can be estimated by assuming a constant propagation delay from the acoustic centre to the ears of speaker S. Particle filtering is applied to the maximum likelihood estimation function of the TOAs and TDOAs to track the head orientation ϕ H of listener L. As a case study, a conference scenario with three participants is presented. In the experiment, each participant wore a binaural headset with integrated microphones. The head orientations of the conferees were tracked using the speech signals recorded from each conferee. The results are compared to a reference method from the literature. 2. PROPOSED METHOD 2.1. Geometrical quantities and TDOA estimation This work concentrates on tracking the head orientation of the listener. The positions of the speaker S and the listener L are assumed to be known. A schematic view of the problem is shown in Fig. 1. The head orientation tracking is based on time of arrival (TOA) and time difference of arrival (TDOA) estimates. A TDOA is the difference between two TOAs: τ i,j (φ) = t i (φ) t j (φ). (1) The TOA between source position S and the right binaural microphone at listener position L r is expressed with respect to the angle ϕ (cf. Fig. 1) as t r (ϕ) = c 1 D r,m = c 1 L r S = c 1 (D L,M cos(ϕ) D H ) 2 + (D L,M sin(ϕ)) 2, (2) where D H is the head radius. The distance between the listener and the acoustic centre of the speaker (i.e., the mouth of the speaker) is approximated as D L,M L S. The speed of sound is denoted by c and assumed to be constant, c = 345 m/s. y L l L φ H φ φ L,S L r D L,M S r M D r,m S D M-E Fig. 1: Schematic view of the head orientation estimation problem. The head orientation of the listener L is denoted with ϕ H. The mouth M of the speaker S is assumed to be the acoustic centre. D M E denotes the mouth-to-ear distance. The distance of the mouth to L is given by D L,M, and the distance of the mouth to the right ear of L is denoted as D r,m. S l, S r, L l and L r indicate the left and right microphone signals of the speaker and listener. Similarly, the TOA for the left binaural microphone of the listener at location L l is calculated as t l (ϕ) = c 1 D l,m = c 1 L l S = S l c 1 (D L,M cos(ϕ) + D H ) 2 + (D L,M sin(ϕ)) 2. x (3) Eqs. (2) and (3) yield two solutions for the angle ϕ. One solution can be discarded by assuming that the listener is facing the speaker, hence the speaker is located in the frontal hemisphere of the listener. The head orientation of the listener ϕ H can be derived from the angle ϕ and the angle between the listener and speaker position ϕ L,S as ϕ H = ϕ L,S ϕ + π 2. (4) The TDOAs are estimated using the generalised correlation method with phase transform weighting Page 2 of 7

3 (GCC-PHAT) [5] ( R i,j (τ) = F 1 Xi (ω)x j (ω) ) X i (ω)x j (ω), (5) where X i (ω) is the Fourier transform of the microphone signal x i (t), ( ) denotes the complex conjugate, and F 1 ( ) denotes the inverse Fouriertransform. The maximum argument of the GCC- PHAT function is the TDOA estimate ˆτ i,j = arg max (R i,j (τ)). (6) τ The method proposed to derive TOA estimates from TDOA estimates is described in section Head orientation tracking using particle filtering Particle filtering is a technique that can be used to approximate the posterior probability density function underlying the Bayesian filtering problem [6]. Particle filtering is often used in tracking applications [3, 4], since it lowers the computational load in non-linear and non-gaussian filtering substantially. In the method presented here, particle filtering is applied to track the head orientation of the conferees. During initialisation, the particles are uniformly distributed from to 2π, for each conferee. Particle filtering consists of three steps: prediction, update, and resampling. The prediction step requires a model that predicts the movement of the source or the particles. Here, Brownian motion is assumed, i.e., the particle locations are propagated according to a random distribution [4]. To update the particles, a weight w k (n) is calculated for each of the K particles at time step n, using the likelihood model introduced in the next section. The resampling of the particles is done with stratified resampling according to the weights [7]. After the resampling, the weights are given a uniform value, w k (n) = 1/K. The head orientation estimate of each listener is the (weighted) circular mean of the particles of that listener { K } ˆϕ(n) = arg w k (n) exp(iϕ k (n)) (7) k=1 where i = 1. The reader is referred to [3] and [6] for detailed information on particle filtering Likelihood function To determine the weights of each particle, a likelihood estimation function is used. In this paper we consider the maximum likelihood estimation (MLE) function [3] P MLE (ϕ) = M {i,j}=1 p(τ i,j (ϕ) ˆτ i,j ), (8) where M is the number of binaural microphone pairs {i, j}, and ( 1 p(τ i,j (ϕ) ˆτ i,j ) = exp (τ i,j(ϕ) ˆτ i,j ) 2 ) 2πσ 2 2σ 2 is a normal distribution with variance σ = 1 and mean ˆτ i,j. Here, the errors of the TDOA estimates are assumed to be independent and identically distributed. Consider a speaker with binaural microphones S r and S l and a listener with binaural microphones L r and L l (cf. Fig. 1). The TDOA estimates between the binaural microphones are ˆτ Lr,L l, ˆτ Lr,S r, ˆτ Lr,S l, ˆτ Ll,S r, and ˆτ Ll,S l. The TDOA between the binaural microphones of the speaker is assumed to be zero, ˆτ Sr,S l. By assuming a constant TOA from the acoustic centre to the binaural microphones of the speaker t M E (mouth-to-ear TOA), we can extract the TOA at the binaural microphones of the listener from the TDOA estimates: ˆt Lr = ˆτ Lr,S r ˆt Lr = ˆτ Lr,S l ˆt Ll = ˆτ Ll,S r ˆt Ll = ˆτ Ll,S l (9) where t M E is derived from the mouth-to-ear distance D M E, which we estimated as.18 m. The MLE function for the head orientation tracking can now be written as P MLE (ϕ) = p(τ Lr,L l (ϕ) ˆτ Lr,L l ) p(t Lr (ϕ) ˆτ Lr,S r + t M E ) p(t Lr (ϕ) ˆτ Lr,S l + t M E ) p(t Ll (ϕ) ˆτ Ll,S r + t M E ) p(t Ll (ϕ) ˆτ L1,S l + t M E ) Speech and Speaker Activity Detection (1) To improve the reliability of the tracking results, tracking is only performed when speech activity is Page 3 of 7

P1 P3 P2 The position and head orientation of each conferee was determined using the ARToolkit [8], which allows tracking predefined markers in a video stream.

4 P1 P3 P2 The position and head orientation of each conferee was determined using the ARToolkit [8], which allows tracking predefined markers in a video stream. The markers placed on top of each conferee were recorded using a Canon EOS 7D camera, mounted about 4 m above the scene, and tracked via the AR- Toolkit. Two methods for estimating the head orientation of the conferees were tested and compared against the ground truth data, i.e., the true head orientation of each conferee obtained from the video stream using the ARToolkit [8]: Reference method: the head orientation is determined from the TDOA estimates between the binaural microphones of the listener [2] (see Appendix for details). Fig. 2: The experimental setup of the case study. Three conferees are involved in a discussion. AR markers on the heads of the conferees are tracked using the ARToolkit [8] to obtain the true location and head orientation of each participant. detected in the binaural microphone signals. A simple measure for speech activity is the energy on the speech band (.2 8 khz). Speech activity is detected if the energy at one of the binaural headsets exceeds a certain threshold. The wearer of that headset is determined to be the currently active speaker. This simple activity detection works reasonably well if the conferees do not talk simultaneously. 3. EXPERIMENTAL SETUP In a case study, the head orientations of three conferees were tracked during a conversation in a meeting scenario. The experimental setup is depicted in Fig. 2. Each participant was wearing a binaural headset with integrated microphones of type Philips SHN25. The microphone signals were recorded at a sampling rate of 96 khz. For analysis, the signals were divided into frames of 5 ms with 45 ms overlap using a Hanning window. The recording was made in a multipurpose space with a reverberation time of about.3 s. The signal-to-noise ratio (SNR) was between 5 and 3 db during active speech frames. Proposed method: the head orientation is tracked using the TDOA estimates between the listener and speaker (Eq.(9)), maximum likelihood estimation function (Eq. (1)), and particle filtering. Speech activity and the currently active speaker were determined in each frame from the binaural microphone signals. For each conferee, the head orientation was determined or tracked in frames where speech activity from any of the other conferees was detected. 4. RESULTS AND DISCUSSION Figure 3 illustrates the tracking results for each conferee and the speech activity map. Speech activity was detected in 67% of the frames. Table 1 shows the root mean squared error (RMSE) of the head orientation tracking. It is calculated for each conferee over the frames during which tracking was performed. For all three conferees the proposed method clearly outperforms the reference method. This is partly due to the fact that the reference method estimates the head orientation based only on the TDOA estimate between the binaural microphone signals of the listener, whereas the proposed method uses also the TDOA estimates between the binaural microphones of the speaker and the listener. The fact that the particle filter takes into account the history Page 4 of 7

5 P1 P2 P3 Angle [ ] Angle [ ] Angle [ ] Grou n d tru th R e f e re n c e m e th o d Prop ose d m e th od Speech Activity P1 P2 P Ti me [s] Fig. 3: Head orientation tracking results for each of the three conferees P1, P2 and P3. The bottom graph indicates the frames were speech activity was detected. of each particle, i.e., its iterative, seems to improve the tracking performance. As can be seen in Fig. 3, P2 rotated the head the most during the meeting scenario. The RMSE is largest for P2, since a moving target generally suffers from a larger tracking error than a steady one. The tracking deteriorates in passages with large head movements or low speech activity, for instance around 15 s into the recording for P2. A key factor for the tracking performance is the signal-to-noise ratio (SNR). Fig. 4 shows the RMSE of both methods as a function of the SNR in the frames. The RMSE for each SNR value is obtained by averaging the RMSE of all three conferees over all frames with at least that SNR. As expected, the performance of both methods is better in frames with high SNR. Above 3 db SNR the performance of the reference method approaches the performance of RMSE [ ] Conferee Reference Proposed method P P P Table 1: Root mean squared error (RMSE) of the head orientation tracking for the reference and the proposed method. The results are calculated over the frames where tracking was performed. the proposed method. This implies that with high SNR a single TDOA estimate between the binaural microphones of the listener provides a reliable estimation of the head orientation, whereas the use of additional TDOA estimates in the proposed method Page 5 of 7

6 RMSE [ ] R e f e re n c e m e th o d Prop ose d m e th od the proposed method, a propagation model taking into account the particle velocity could be integrated, such as the Langevin model [9]. This would also enable blind tracking during frames without speech activity. The integration of acoustic source localisation and tracking [3, 4] to the system would allow tracking the positions of the conferees in addition to their head orientation SNR [db] Fig. 4: Root mean squared error (RMSE) versus signal-to-noise ratio (SNR). The RMSE is averaged over three conferees. yields only a minor improvement. In frames with low SNR, however, the proposed method clearly outperforms the reference method. Frames with low SNR provide weak evidence for tracking, hence in those frames the reference method fails, as it estimates the head orientation in each frame separately. The proposed method compensates for weak evidence in frames with low SNR by taking into account the tracking history, thus relying on strong tracking evidence found in frames with high SNR. Furthermore, the use of several TDOA estimates adds to the robustness of the proposed method. 5. CONCLUSION AND FUTURE WORK A method is proposed for tracking the head orientation of conferees in a meeting scenario using binaural headset microphones. In contrast to previous methods [2], the proposed method does not require anchor sources. The method is based on calculating the TOA and TDOA estimates of the speech signals between the binaural microphones of the speaker and the listeners. The tracking is implemented using a particle filter integrated with a maximum likelihood estimation function. As a proof of concept, head orientation tracking was applied to three conferees in a meeting scenario. The proposed method achieved an accuracy of about 1 degrees, while the reference method taken from the literature achieved an accuracy of about 35 degrees. To further improve the robustness and accuracy of 6. ACKNOWLEDGMENTS The research leading to these results has received funding from the Academy of Finland, project nos. [ and 14786], the European Research Council under the European Community s Seventh Framework Programme (FP7/27-213) / ERC grant agreement no. [23636], the Helsinki Graduate School in Computer Science and Engineering (HECSE), and [MIDE program] of Aalto University. 7. REFERENCES [1] A. Härmä, J. Jakka, M. Tikander, M. Karjalainen, T. Lokki, J. Hiipakka, and G. Lorho, Augmented reality audio for mobile and wearable appliances, Journal of the Audio Engineering Society, vol. 52, no. 6, pp , 24. [2] M. Tikander, A. Härmä, and M. Karjalainen, Acoustic positioning and head tracking based on binaural signals, in 116th Audio Engineering Society Convention, Berlin, Germany, 24, pp [3] E. Lehmann, Particle filtering methods for acoustic source localisation and tracking, Ph.D. dissertation, Australian National University, 24. [4] P. Pertilä, T. Korhonen, and A. Visa, Measurement combination for acoustic source localization in a room environment, EURASIP Journal on Audio, Speech, and Music Processing, vol. 28, pp. 1 14, 28. [5] C. Knapp and G. Carter, The generalized correlation method for estimation of time delay, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 24, no. 4, pp , Page 6 of 7

7 [6] M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking, IEEE Transactions on Signal Processing, vol. 5, no. 2, pp , 22. [7] R. Douc and O. Cappé, Comparison of resampling schemes for particle filtering, in 4th International Symposium on Image and Signal Processing and Analysis, Zagreb, Croatia, 25, pp [8] H. Kato and M. Billinghurst, Marker tracking and hmd calibration for a video-based augmented reality conferencing system, in IEEE International Workshop on Augmented Reality, San Francisco, CA, USA, 1999, pp [9] E. Lehmann, A. Johansson, and S. Nordholm, Modeling of motion dynamics and its influence on the performance of a particle filter for acoustic speaker tracking, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 27, pp APPENDIX: REFERENCE METHOD The angle of incidence of a sound wave ϕ can be derived from the TDOA τ l,r measured between the microphones of a binaural headset via a TDOA model [2]: τ l,r = D head (ϕ + sin(ϕ)), (11) where D head corresponds to the head radius. Assuming the positions of both the listener and the sound source are known, the head orientation of the listener ϕ H can be obtained via ϕ H = ϕ L,S ϕ, (12) where ϕ L,S is the angle between the listener and the source position. This approach is proposed in [2] for head orientation tracking, and serves as a reference method in the case study presented in this paper. Page 7 of 7

Spatial analysis of concert hall impulse responses

Toronto, Canada International Symposium on Room Acoustics 2013 June 9-11 Spatial analysis of concert hall impulse responses Sakari Tervo (sakari.tervo@aalto.fi) Jukka Pätynen (jukka.patynen@aalto.fi) Tapio