International Conference on Control, Automation and Systems 28 Oct. 14-17, 28 in COEX, Seoul, Korea Sound Source Localization in Median Plane using Artificial Ear Sangmoon Lee 1, Sungmok Hwang 2, Youngjin Park 3,Youn-sik Park 4 1 Department of Mechanical Engineering, KAIST, Daejeon, Korea (Tel : +82-42-869-364; E-mail: smansl@kaist.ac.kr) 2 Department of Mechanical Engineering, KAIST, Daejeon, Korea (Tel : +82-42-869-376; E-mail: tjdahr78@kaist.ac.kr) 3 Department of Mechanical Engineering, KAIST, Daejeon, Korea (Tel : +82-42-869-336; E-mail: yjpark@kaist.ac.kr) 4 Department of Mechanical Engineering, KAIST, Daejeon, Korea (Tel : +82-42-869-32; E-mail: yspark@kaist.ac.kr) Abstract: Sound source localization is the method using the measurements of the acoustic signals from microphone arrays in acoustical engineering. This technique has been used broadly in 3-D sound technology, humanoid robot and teleconferencing and so on. For robot industry, their ultimate purpose is to be with human being. This is why the industry is demanding applicable robot s auditory system in the form of artificial ears like human s external ear such as ear pinna. It has more benefits to make use of auditory system with ear pinna to humanoid robots for HRI. In this paper, we propose a specific sound source localization method using a pair of artificial ears, each of which consisting of a single ear pinna and two microphones. The feasibility and localization performance of proposed method for speech signal in median plane is shown. Through the experiment in office environment, we confirm that robots with artificial ears can estimate the elevation angle of speech signal just using two microphone output signals. Keywords: Sound source localization, Relative Transfer Function(RTF), group delay, artificial ear 1. INTRODUCTION Sound source localization is a listener s capability to estimating the direction or position of detected sound and also indicates the methods using the measurements of the acoustic signals from microphone arrays in acoustical engineering [1]. Compared to vision which is a well-directed sense, hearing is an undirected sense, i.e. omni-directional. Such ability not constrained in the field of view can supplement vision to identify the location of events of interest outside the field. The other way, visual information compensates localization error of audition as well. For example of application to humanoid robots, audio-visual integration that is the combined use of speech and face recognition can improve recognition of speech signals produced by using a pair of microphones [2-3]. And particularly when vision is blocked by barriers or speakers can t be recognized due to darkness, auditory information plays significant role in this case. Thus, sound source localization as an auditory perception is the step of more natural Human-Robot Interaction [4]. Sound source localization technique has been used broadly in 3-D sound technology, humanoid robot and teleconferencing and so on. For robot industry, their ultimate purpose is to be with human being. This is why the industry is demanding applicable robot s auditory system in the form of artificial ears like human s external ear such as ear pinna. It has more benefits to make use of auditory system with ear pinna to humanoid robots for HRI. Several trials to apply artificial ears for sound source localization for robots have been existed. For instance, the use of vision and audio sensor for sound source localization can make robots learn how to improve their initial localization ability by using supervised learning or visual information [5-7]. However, their learning process can be done only if speakers get in the field of view and they need training time whenever the place where robots exist changes and also need additional imaging systems. The SIG, humanoid robot, uses two pairs of microphones, one of which is on each ear position and the other one is installed inside the cover for cancelling noise induced by motors [8-9]. Keyrouz and Saleh proposed binaural localization using HRTF database measured by four microphones, two placed inside and two outside the ear canals of a KEMAR (Knowles Electronics manikin for Auditory Research) humanoid head. They showed localization performance for the sounds with a large bandwidth such as fingers snapping or percussive noises [1]. They used the direction-dependent spectral features corresponding to different sound source location. But, these spectral features were not in the region of voice frequency band. In order to overcome this problem, Hwang and Park applied artificial ears of large size to robots. They have shown that speech signal can be localized by using their proposed method [11]. Although they all use binaural localization, i.e. two-ear system, their proposed systems are hardly applicable for humanoid robot due to specific reasons mentioned above i.e. the use of too large ear pinna and possibility for sound with large band width. In this paper, we propose a specific sound source localization method using artificial ears consisting of ear pinna and ear canals using four microphones. Our proposed auditory system uses a spherical head and two ears, each of which composes of a single pinna and a pair of microphones. The feasibility and localization performance of proposed method for speech signal in median plane given limited computational resources is shown.
Through the experiment in office environment, we confirm that robots with artificial ears can estimate the elevation angle of speech signal just using two microphone output signals. front-back discrimination. Therefore, relative placement of microphones and ear pinna has essential part for front-back discrimination and possible localization range. 2. PROPOSED ARTIFICIAL EAR DESIGN AND HEAD SHAPE The artificial ears and spherical head model were manufactured as depicted in Fig. 1. Fig. 2 Placement of two microphones and ear pinna. Fig. 1 Proposed artificial ear built in a spherical head. Both shape and size of ear pinna attached to the ear flange were designed to attain spectral features distributed in the frequency range from 3 to 4 khz using the Diffraction and Reflection Model(DR model) suggested by Lopez-Poveda and Meddis for accurate reproduction of the spectral notches for elevated sources [12]. However, this DR model was applicable only at the positions in the concha aperture. Thus, on account of the limited reproduction region, we did experiment using several microphones as presented in Fig. 1 (left). 3. FRONT-BACK DISCRIMINATION AND PLACEMENT OF TWO MICROPHONES AND EAR PINNAE 3.1 Front-back confusion When two microphones in free field are used for localization of sound sources in 2-D space, two points sharing the same ITD (Inter-channel Time Difference) will exist and this phenomena is called front-back confusion. And the set of these points in 3-D space is often called cone of confusion [13]. Since the locations of all sounds originating from points on this cone are distinguishable. 3.1 Placement of two microphones and ear pinna If we use just two microphones for localization of sound sources in median plane, then front-back confusion will happen. As depicted in Fig. 2, when ear pinna is missing, cone of confusion occurs with respect to dotted line passing by two attached microphones. To get over cone of confusion, we located an ear pinna to pass between two microphones as shown Fig. 2. As sound source is elevated from lower to upper region, at a single elevation angle of sound source, the microphone output signal levels measured by two microphones will be equal. By letting this elevation position locate in the dotted line, we can perform the 4. ELEVATION ESTIMATION METHOD 4.1 Relative Transfer Function (RTF) Information of input sound is unknown in most practical situations. Especially, in case of localization of voice signal, their characteristics are rapidly changing from word to word and even remarkably dependent on individuals. Therefore, Relative Transfer Function (RTF) measured from two output signals can be useful and applicable as long as RTF doesn t have much side effect induced by addictive noises such as reflected sounds from environmental physical factors. Gxy ( fk ) RTF( fk ) = (1) G ( f ) RTF computation can be done using (1) equation [14]. 4.2 Cleansing method Measured RTF is no longer reliable if reflective wave has more major contribution to both two microphones than by direct wave. Thus, the cleaning procedure is necessary to avoid this side effect due to reflections that makes our auditory system to be hard to estimate real sound source positions accurately. We cleansed out RTF using hamming window whose length 67. 2π n α β.cos( ) w[n]= M, n M, α =.54, β =.46,, otherwise. xx k (2) Windowing length was determined from the smallest distance between a microphone and a dominant reflecting surface [15]. An example of cleansing process is in Fig. 3.
Fig. 3 Original Relative Impulse Response (RIR) (blue-line), cleansed RIR(red-line) and hamming window(black-line) are shown. As shown above, the obtained Relative Impulse Response (RIR), counterpart of RTF in time domain, in real environment has reflected components from the objects surrounding the listener. We can exclude reflective waves by using cleansing process. 4.3 Estimation Time Delay of Arrival (TDOA) Group delay is a measure of transmitt time of a signal from the input and output port. By using RTF phase response, we can obtain group delay and we also measure TDOA between microphone U and B [16]. 1 d Group Delay= ( RTF ( f k )) (3) 2π df By applying free-field and far-field conditions, we can directly measure the sound source direction. 5. EXPERIMENT IN AN OFFICE ENVIRONMENT 5.1 Selected microphone positions Proposed localization method mainly relies on RTF s phase responses. On the other hand, front-back discrimination is based on RTF s magnitude response associated with relative placement of microphones and artificial ear pinna in order to avoid cone of confusion. So, we selected microphones, one (Mic. B) of which is behind the ear pinna and the other (Mic. U) is in the upper part of ear flange for having less side effect by tune table refection as shown Fig. 1. Artificial ears fitted with the microphones and experimental set-up is shown in Fig. 4. 5.2 Verification of proposed artificial ear and localization method in median plane In an office environment, there are a lot of noise sources that makes localization performance worse. An experiment was carried out in an office environment. The size of room is 7m 13m 2.5m. Background noise level in this room is 45dB and SNR is 25dB. Male speech signals used as input voice signal are voice 1 ANG NYEONG HA SE YO and voice 2 BANG GAP SEUP NI DA.. The distance between a speaker and the center of artificial head is fixed apart from 1.2m. RTF magnitude response is shown in Fig. 5. Fig. 5 RTF magnitude response For the quantitative measure, Inter-channel Level Difference (ILD) is used for front-back discrimination [16]. ILD= 2log ( RTF( f ) ) = 1 n= N 1 n= 2log ( RTF ( f ) ) df db (4) In Fig. 6, computed ILD is shown for sound source positions. 1 n= N 1 n= df n UB n n Fig. 6 ILD profile Fig. 4 The experimental set-up in an office environment We can find that ILD shift occurs with respect to 6 position having same ILD. As shown in Fig. 2 before, in this case, ρ is equal to 6 and if ILD is less than db, then sound source is located below 6 and if ILD is larger than db, then sound source is located above 6. After front-back discrimination is accomplished, we can find the elevation angle of sound source by using RTF phase response in order to measure group delay
which means TDOA. Estimation performance and errors for sound sources located on the median plane from -3 to 21 elevation angle are shown below in Fig. 7 and Table 1. Estimated Elevation Angle [degree] 21 18 15 12 9 6 3 Sound Source Localization Performance -3-3 3 6 9 12 15 18 21 True Elevation Angle [degree] Fig. 7 Localization performance for voice 1(red) and voice 2(green) Table 1 Estimation error Elevation angles (degrees) Estimation range -3 ~ 21-3 ~ 7 8 ~ 11 12 ~ 21 Voice 1 4. 1.7 15.9 3.3 Voice 2 5.1 1.9 2..6 4.1 We found out that front-back discrimination can be operated by using RTF magnitude response because magnitude level changes up and down with respect to 6 elevation angle and also by using phase responses of RTF, it is possible to estimate elevation angles of sound sources on median plane. 5. CONDLUSIONS AND FUTURE WORKS We proposed a design of artificial ears consisting of a single ear pinna and two microphones and sound localization method using this artificial ear. By placing ear pinna and two microphones appropriately, we can solve front-back confusion problem. Although we use ear pinna of 7cm characteristic length, our proposed method is applicable for localization of speech signals. Through the experiment conducted in office environment, we showed the feasibility of proposed localization method for sound sources on median plane. In the near future, we ll determine the optimal positions of microphones for sound sources in 3-D space by the experiment in office environment and investigate localization performance. 6. ACKNOWLEDGEMENT This work was supported by the BK21 program, the Intelligent Robotics Development Program, and the Korea Science and Engineering Foundation through the national Research laboratory Program (RA-25-- 1112-) funded by the ministry of Education, Science and Technology. REFERENCES [1] M. S. Brandstein and H. Silverman, A practical methodology for speech source localization with microphone arrays, Computer Speech and language, Vol. 11, No. 2, pp. 91-126, 1997. [2] Y. Sasaki and S. Kagami and H. Mizoguchi, Multiple sound source mapping for a mobile robot by self-motion triangulation, In the Proceeding of the 26 IEEE/RSJ International Conference on Intelligent Robots and Systems, October 9-15, 26. [3] Kazuhiro Nakadai and Daisuke Matsuura and Hiroshi G. Okuno and Hiroshi Tsujino, Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots, Speech Communication, Vol. 44, pp. 97-112, 24. [4] Ira J. Hirsh and Charles S. Watson, Auditory psychophysics and perception, Annual Review of Psychology, Vol. 47, pp. 461-484, 1996. [5] Hiromichi Nakashima and Toshiharu Mukai, 3D Sound Source Localization System Based on Learning of Binaural Hearing, IEEE International Conference on Systems, Man and Cybernetics, 25. [6] Arabi, P and Zaky, S, Integrated Vision and Sound Localization, Proceedings of the third international conference on information fusion, Vol. 3, pp. 21-26, 2. [7] J Hornstein and M. Lopes and J. Santos-Victor and Francisco Lacerda, Sound Localization for Humanoid Robots-Building Audio-Motor Maps based on the HRTF, Proceedings of the 26 IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, October 9-15, 26. [8] Kazuhiro Nakadai and Hiroshi G. Okuno and Hiroaki Kitano, Real-time sound source localization and separation for robot audition, In Proceedings of IEEE international Conference on Spoken Language Processing, pp. 193-196, 22. [9] Hiroshi G. Okuno and Kazuhiro Nakadai and Hiroaki Kitano, Social Interaction of Humanoid Robot based on Audio-Visual Tracking, Proceedings of Eighteenth International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE-22), Vol. 2358, pp.725-735, 22. [1] F. Keyrouz and A. Abous Saleh, Intelligent Sound Source Localization Based on Head-related Transfer Functions, IEEE International Conference on Control, Automation and Systems, pp. 97-14, 27. [11] S. Hwang, Y. Park and Y. Park, Sound direction estimation using artificial ear, In the proceeding of the International Conference on Control, Automation and Systems, pp. 196-191, October 17-2, 27. [12] E. A. Lopez-Poveda and Ray Meddis, A physical model of sound diffraction and reflections in the human concha, Journal of the Acoustical Society
of America, Vol. 1, No. 5, pp. 3248-59, 1996. [13] C. I. Cheng and G. H. Wakefield, Introduction to Head-Related Transfer Functions (HRTFs): Representations of HRTFs in Time, Frequency, and Space, Journal of Audio Engineering of Society, Vol. 49, No. 4, pp. 231-249, 21. [14] Julius S. Bendat and Allan G. Piersol, Random Data: Analysis and Measurement Procedures, Wiley New York, 1999. [15] Sangmoon Lee and Youngjin Park and Youn-sik Park, Sound direction estimation using artificial ear for Human-Robot Interface, Control, Automation and Systems Symposium, October 14, 28, Seoul, Korea. [16] Jens Blauert, Spatial hearing, revised edition, MIT press, 1997.