Robotic Spatial Sound Localization and Its 3-D Sound Human Interface

Robotic Spatial Sound Localization and Its 3-D Sound Human Interface Jie Huang, Katsunori Kume, Akira Saji, Masahiro Nishihashi, Teppei Watanabe and William L. Martens The University of Aizu Aizu-Wakamatsu, 965-858, Japan Abstract In this paper, we describe a robotic spatial sound localization system using an auditory interface with four microphones arranged on the surface of a spherical robot head. The time difference and intensity difference from a sound source to different microphones are analyzed by measuring the HRTFs around the spherical head in an anechoic chamber. It was found while the time difference can be approximated by a simple equation, the intensity difference is more complicated for different azimuth, elevation and frequency. A time difference based sound localization method was proposed and was tested by experiments. A sound interface for human listeners is also constructed by four loudspeakers with similar arrangement as the microphone set. This interface can be used as 3-D sound human interface by passing the four channel audio signals from the microphone set directly to the four loudspeakers. It can also be used to create 3-D sound with arbitrary spatial position which is determined by a virtual sound image or by the sound localization system. 1 Introduction Mobile robot technology is an emerging field with wide applications. For example, a mobile robot can serve as a guard robot that can detect suspicious objects by audition and vision. The robot can also be used as an Internetconnected agent robot by which the user can explore a new place without being there. The robot can even attend a meeting instead of its users so that the users can get the remote auditory and visual scenes of the meeting room. For the above purposes, the robot must be capable of treating multimedia resources, including sound media, to complement vision [6, 12]. Visual sensors are the most popular sensors used today for mobile robots. However, since a robot generally looks at the external world from a camera, difficulties will occur when a object does not exist in the visual field of the camera or when the lighting is poor. A robot cannot detect a non-visual event that in many cases with sound emissions. In these situations, the most useful information is provided by audition. Audition is one of the most important senses used by humans and animals to recognize their environments. Although the spatial resolution of audition is relatively low compared with that of vision, the auditory system can complement and cooperate with vision systems. For example, sound localization can enable the robot to direct its camera to a sound source. The auditory system of a mobile robot can also be used for a teleconference system to guide its camera to pick up the faces of speakers automatically [6, 8, 7]. In this paper, we will focus on the techniques of spatial sound localization and its 3-D sound human interface [1, 3, 5, 9, 1]. 2 A Multimodal Telerobot and Its Auditory Interface A multimodal mobile telerobot named HERO (abbreviation of HEaring RObot) is developed (Figure 1) [6, 4, 11]. HERO is equipped with auditory sensors and vision sensor, along with infrared and tactile sensors for obstacle avoidance. The auditory system of HERO has some properties similar to those of the human auditory system, with the aim to incorporate some appropriate features of the human auditory system based on the engineering needs of efficiency and accuracy. It is obviously not the purpose to simulate the human auditory system. The microphones are arranged on the surface of a spherical head (three in the side and one on the top, see Figure 1) with the radius of 15 cm about 1.5 times that of humans. Spatial cues including time difference and intensity difference cues are used for spatial sound localization [2]. A sphere-shaped head can simplify the formulation of time difference calculation. By using the top-mounted microphone, we can localize the elevation of sound sources based on the time difference and/or intensity difference cues without using the relatively uncertain spectral difference cue. As shown in Figure 1, a four-channel sound interface for Proceedings of the First International Symposium on Cyber Worlds (CW 2) -7695-1862-1/2 $17. 22 IEEE

A telerobot Spatial sound processing sound sources 1-channel 3D sound reconstruction 4 speaker 3D sound interface 4-channel 4-channel 4-to-4 channel 3D sound interface Figure 1. The 4-to-4 channel 3-D sound interface 3-D sound reproduction and reconstruction is constucted. By this 3-D sound interface, 4-channel audio from the robot auditory interface can be directly reproduced without any additional processing. Comparing to a binaural 3-D sound interface which usually uses a headphone, the 4-speaker 3- D sound interface can create a wide 3-D sound field to accommodate more audience. The use of the 3-speaker 3-D sound interface is not limited to reproduce the 3-D sound from the HERO robot, it also can be used as interface for creative virtual 3-D sound. In some cases, e.g. for bandwidth compression, we can send only one channel audio instead of four channels and then reconstruct the four channel 3-D sound after received the signals. where SP is the tangential line to the sphere and P is the contact point. The angle p can be obtained as p = cos 1 (r=d) (2) with the radius of the sphere r and the distance of sound source D. The arrival time from sound source to micro- S Q Y θp P 3 Arrival Time Difference and Intensity Difference Ο θ Μ X 3.1 Theoretical calculation of arrival time differences The arrival time of a sound depends on the shortest path from the source position to the receiving microphone. Considering the case that a microphone (2,3 or 4) and the sound source are in the same horizontal plane with spherical center (the elevation of sound source to be ), the sound path can be calculated as the length of line (curve) S-P-M as shown in Figure 2 ρ SM (» p ) d = (1) SP + arc(pm) ( > p ) Figure 2. The shortest path from sound source S to microphone M phone 1 can be calculated by substituting the by 9 '. When the elevation of sound source is not zero, we will need to rotate the coordinates around axis x to let the elevation to be zero. Let the spherical center to be the origin of the coordinates. The sound source S will be (D cos cos'; D sin cos '; D sin(')), where is the az- Proceedings of the First International Symposium on Cyber Worlds (CW 2) -7695-1862-1/2 $17. 22 IEEE

imuth and ' is the elevation of the sound source. The rotation angle will be tan ψ = D sin (')) =tan'= sin : (3) D sin cos ' After the rotation, the arrival time than can be calculated as same as the case of zero elevation. Figure 3 shows the calculated arrival time differences. Theoretical arrival time from each direction about Mic 2 Delay time(second) 4 3.5 3 2.5 8 x 1-3 6 4 2 Elevation (degree) 9 Azimuth (degree) Figure 3. Calculated arrival time from sound source to microphone (subtracted by the arrival time from sound source to point Q) 3.2 Comparison with the measured arrival time differences The theoretical arrival time differences were compared with the HRTF (Head Related Transfer Function) data, which were measured for azimuth from to 18 degrees and elevation from to 9 degrees with a 5 degree step. Two methods were used to calculate the arrival time differences from HRTF data. The first method uses the crosscorrelation between the HRTFs of different microphones. The cross correlation method is a widely used method for time difference calculation. Figure 4 displays the obtained arrival time difference together with the theoretically arrival time difference (the smooth curved surface). From the figure, it can be seen that the time differences calculated from the measured HRTFs by the cross correlation method match the theoretical values in general. However, there are several places where we can see a big gap between the measured data and the theoretical values. Those places are around the azimuth of degree and 24 degrees, where the sound 18 Arrival time difference(ms) 3 2 1-1 -2-3 8 6 4 Arrival time difference between mic 4 and 2 2 9 18 Figure 4. Arrival time difference between microphone 4 and 2 source is in the opposite of microphone 2 or microphone 4. This is considered to be influenced by the surface of the sphere. Although the cross correlation method is simple and useful one, in many cases we need to calculate time differences for every frequency components. The other method uses the phase differences between Fourier transformed HRTF data. Since the phase difference data are within range of ( ß»» ß), phase wrapping will occur as shown in Figure 5 (the upper plot). The phase wrapping, however, can be recovered by the following phase shift. ffi = 8 < : ffi +2ß ( ffi < ß) ffi ( ß» ffi» ß) ffi 2ß ( ffi >ß) Let the sampling period be T and f the sampling frequency, the time difference t can be expressed by the following equation t = ffi 2ß T = ffi 2ßf Figure 6 shows the arrival time differences calculated from the phase differences for 5Hz component. The results are very close to that of theoretical values. Similar results can be obtained from other frequency components when the frequency is less than about 13kHz. 27 (4) (5) 36 Proceedings of the First International Symposium on Cyber Worlds (CW 2) -7695-1862-1/2 $17. 22 IEEE

Phase(degree) Phase difference(degree) 2 1-1 Measured phase(microphone 2, elevation, frequency 5kHz) -2 2 4 6 8 1 12 14 16 18 Calculated phase difference(microphone 2, elevation, freq uency 5k -5-1 -2-25 2 4 6 8 1 12 14 16 18 Figure 5. Phase differences by the measured HRTF data. In the figure, the horizontal axis is the setup azimuth of HRTF data. The phase differences are calculated from 5kHz frequency component. The upper plot shows the phase differences with phase wrapping, and the lower plot shows the phase difference with phase wrapping removed. Delay time(second) 15 1 5-5 8 x 1-4 6 4 2 Frequecy = 5Hz 9 Figure 6. Arrival time difference from phase difference (5Hz) 18 3.3 Theoretical calculation of the intensity differences Comparing with the time differences, the intensity differences for different azimuth, elevation and frequency are very complicated. When a sound meets the spherical head, since the head blocked its way, the sound waves will be curved to go around with the spherical surface. Then, the sound lose energy depending on the frequency, the path length and the radius of the sphere. There will be different sound waves transmit on different side of the sphere. Those sound waves will then finally join again in the opposite position to the sound source. This phenomenon makes the sound intensity and phase complicated, as we have seen the influence for the cross correlation in previous section. As a very rough approximation, we assume the energy lose is simply depending on the length of the path along the spherical surface as shown in Figure 7. Here, we ignored the focusing and scattering effects of the sphere and the phase difference of sound wave from different paths. Path length only along the surface(meter).4.3.2.1 8 6 4 2 Elevation (degree) The path length only along the surface 9 Azimuth (degree) Figure 7. Sound path length along the surface of sphere 3.4 Comparison with the measured intensity differences The sound intensity difference compared with the intensity of sound from front (direction of azimuth and elevation) are calculated using the HRTF data for different azimuth and elevation. 18 R Intensity (ffi; ; f )= I(ffi; ; f ) I(; ; f) (6) Proceedings of the First International Symposium on Cyber Worlds (CW 2) -7695-1862-1/2 $17. 22 IEEE

One part of the results are shown in Figure 8 and 9 for frequency of 5Hz and 5kHz respectively. Compared with 4 Spatial Sound Localization Azimuth and Elevation Identification 4.1 Sound localization method Rate of attenuation The rate of attenuation (frequecy = 5Hz) 2 1.8 1.6 1.4 1.2 1.8 8 6 4 2 Rate of attenuation 18 9 Figure 8. Sound intensity difference (5Hz) The rate of attenuation (frequecy = 5Hz) 3 2.5 2 1.5 1.5 8 6 4 2 18 9 Figure 9. Sound intensity difference (5Hz) As shown in the previous section, sound arrival time differences can be approximated well by a theoretical equation. The intensity differences, however, are more complicated and difficult to be approximated. Although the intensity differences are useful and important for sound localization in human auditory systems, the auditory system of the HERO robot has spatially arranged four microphones that can be use for azimuth and elevation localization. Using only time difference cues can simplify the sound localization method comparing to that using intensity and spectral cues. The arrival time differences are calculated from the sound waves of different microphones by the crosscorrelation method. By choosing difference microphone pairs, there are six arrival time differences in total tm =( t 12 ; t 13 ; t 14 ; t 23 ; t 24 ; t 34 ) (7) where, the indexes mean the number of microphone. These time differences are compared with the theoretically calculated arrival time differences which are calculated and formed a database in advance. The distance between the theoretical arrival time differences and the input signals are calculated as e( ; ') =k t( ; ') tm k (8) where t( ; ') is the theoretical arrival time differences between a microphone pair. The azimuth and elevation of sound source is then calculated as the ^ and ^' which minimize e( ; '). As shown in the previous section, the measurement errors of arrival time difference will become large when the sound source was positioned behind the sphere from the view point of the microphones. Therefore, it is better to choose microphone pairs in the front side to the sound source, i.e. to choose the microphone pairs with smaller time difference. 4.2 Experiments and results the theoretical values as shown in Figure 7, we can see they have the same trends, i.e. the sound energy loses when the sound waves path along the spherical surface. However, since it is only a very rough approximation, we could only expect its rough features. Sound localization experiments were conducted in an environment of an ordinary room. Testing sounds including coin-dropping, glass-broken and a piece of classic music were used. The sampling frequency was 44:1kHz. The distance of the sound source was set to 1:m. Figure 1 and 11 show the results of sound localization using 3 microphone pairs with minimum time differences. The average localization errors using different number of Proceedings of the First International Symposium on Cyber Worlds (CW 2) -7695-1862-1/2 $17. 22 IEEE

microphone pairs are shown in Table 1. It is shown that choosing three microphone pairs with smallest arrival time difference achieved the best performance. Calculated elevation(degree) The result using 3 microphone pairs about elevati 9 8 7 6 5 4 3 2 1-1 -1 1 2 3 4 5 6 7 8 9 Acutual elevation(degree) Figure 1. Results of elevation identification using 3 microphone pairs 35 The result using 3 microphone pairs about azimut Table 1. The average localization error using different number of microphone pairs (degrees) number of microphone pairs 2 3 4 5 6 elevation 2.7 2.6 2.7 2.9 2.9 azimuth 2.6 1.5 2.4 2.1 2.1 measured HRTF data. Three subjects performed the listening tests. Sound stimuli of a radio news announcement, a piece of classical orchestra music and a glass-broken sound were used. In the experiments, a multi-track HDD digital recorder (Fostex D824mk2), a 4-channel power amplifier (BOSE 12VI) and 4 speakers (YAMAHA NS-P21) were used. Sampling frequency was 44.1kHz and the bit rate 16bit. Listeners were sitting on a chair with three speakers in the same level as their ears and one speaker on top of their head. All speakers were distanced 1 m from the listeners. The experiments were conducted in an ordinary room in our lab. For each test, the sound stimuli were presented several times. 9 Calculated azimuth(degree) 3 25 2 15 1 5 Elevation (percepted) 8 7 6 5 4 3 2 1 5 1 15 2 25 3 35 Acutual azimuth(degree) Figure 11. Results of azimuth identification using 3 microphone pairs -1-1 1 2 3 4 5 6 7 8 9 Elevation (setup) Figure 12. Elevation recognized by the listeners 5 Evaluation Tests for the Four-channel 3-D Sound Interface To evaluate the four-channel 3-D sound interface, psychological listening tests were conducted. Virtual sound images were created by convoluting sound sources with the The results of the experiments are shown in Figure 12 and Figure 13. Elevation testing was performed for,, 15, 3, 6 and 9 degrees, with the azimuth set to degree. Azimuth testing was performed for, 9 and 18 degrees with the elevation set to low ( degrees), mid ( degree) Proceedings of the First International Symposium on Cyber Worlds (CW 2) -7695-1862-1/2 $17. 22 IEEE

Azimuth (percepted) 18 16 14 12 1 8 6 4 2 4 6 8 1 12 14 16 18 Azimuth (setup) Figure 13. Azimuth recognized by the listeners and high (6 degrees) for all cases. From the results, it is clear that the system has the ability of elevation reproduction as well as azimuth. 6 Conclusion In this paper, we describe a robotic spatial sound localization system using a four microphone set arranged on the surface of a spherical robot head. The time difference and intensity difference are analyzed by measuring the HRTFs around the spherical head in an anechoic chamber. It was found while the time difference can be approximated by a simple equation, the intensity difference is more complicated for different azimuth, elevation and frequency. A time difference based sound localization method was proposed and was tested by experiments. A sound interface for human listeners is also constructed by four loudspeakers with similar arrangement as the microphone set. This interface can be used as 3-D sound human interface by passing the four channel audio signal from the microphone set directly to the four loudspeakers. It can also be used to create 3-D sound with arbitrary spatial position which is determined by a virtual sound image or by the sound localization system. [3] K. Brandenburg and M. Bosi. Overview of mpeg audio: Current and future standards for low-bit-rate audio coding. J. Audio Engineering Society, (1-2):4 2, 1997. [4] M. Cohen. A design for integrating the internet chair and a telerobot. In Proc. Int. Conf. Information Society in the 21st Century, pages 276 28, Aizu-Wakamatsu, Nov. 2. U. Aizu. [5] M. Gerzon. Periphony: with-height sound reproduction. J. Audio Engineering Society, 21(1-2):2 1, 1973. [6] J. Huang. Spatial sound processing for a hearing robot. In Q. Jin, editor, Enabling Society with Information Technology, pages 197 26. Springer-Verlag, 21. [7] J. Huang, N. Ohnishi, and N. Sugie. Building ears for robots: Sound localization and separation. Artificial Life and Robotics, 1(4):157 163, 1997. [8] J. Huang, T. Supaongprapa, I. Terakura, F. Wang, N. Ohnishi, and N. Sugie. A model based sound localization system and its application to robot navigation. Robotics and Autonomous Systems, 27(4):199 29, 1999. [9] J. Huopaniemi. Virtual Acoustics and 3-D Sound in Multimedia Signal Processing. Ph.D. dissertation, Helsinki University, Dep. Electrical and Communications Engineering, 1999. [1] D. Malham and A. Myatt. 3-D sound spatialization using ambisonic techniques. Computer Music Journal, 19(4):58 7, 1995. [11] W. L. Martens. Pseudophonic listening in reverberant environments: Implications for optimizing auditory display for the human user of a telerobotic listening system. In Proc. Int. Conf. Information Society in the 21st Century, pages 269 275, Aizu-Wakamatsu, Nov. 2. U. Aizu. [12] H. G. Okuno, K. Nakadai, T. Lourens, and H. Kitano. Sound and visual tracking for humanoid. In Proc. Int. Conf. Information Society in the 21st Century, pages 254 261, Aizu- Wakamatsu, Nov. 2. U. Aizu. References [1] D. R. Begault. 3-D Sound for Virtual Reality and Multimedia. AP Professional, Boston, 1994. [2] J. Blauert. Spatial hearing: the psychophysics of human sound localization. The MIT Press, London, revised edition, 1997. Proceedings of the First International Symposium on Cyber Worlds (CW 2) -7695-1862-1/2 $17. 22 IEEE