Binaural Sound Source Localization Based on Steered Beamformer with Spherical Scatterer

Similar documents
Sound Source Localization using HRTF database

ROOM AND CONCERT HALL ACOUSTICS MEASUREMENTS USING ARRAYS OF CAMERAS AND MICROPHONES

Proceedings of Meetings on Acoustics

Sound Source Localization in Median Plane using Artificial Ear

Auditory System For a Mobile Robot

Robotic Spatial Sound Localization and Its 3-D Sound Human Interface

TDE-ILD-HRTF-Based 2D Whole-Plane Sound Source Localization Using Only Two Microphones and Source Counting

A Virtual Audio Environment for Testing Dummy- Head HRTFs modeling Real Life Situations

Ivan Tashev Microsoft Research

EE1.el3 (EEE1023): Electronics III. Acoustics lecture 20 Sound localisation. Dr Philip Jackson.

Acoustics Research Institute

INVESTIGATING BINAURAL LOCALISATION ABILITIES FOR PROPOSING A STANDARDISED TESTING ENVIRONMENT FOR BINAURAL SYSTEMS

ONE of the most common and robust beamforming algorithms

Joint Position-Pitch Decomposition for Multi-Speaker Tracking

Computational Perception. Sound localization 2

Measuring impulse responses containing complete spatial information ABSTRACT

Psychoacoustic Cues in Room Size Perception

Binaural Speaker Recognition for Humanoid Robots

Enhancing 3D Audio Using Blind Bandwidth Extension

HRIR Customization in the Median Plane via Principal Components Analysis

Microphone Array Design and Beamforming

Improving room acoustics at low frequencies with multiple loudspeakers and time based room correction

Study on method of estimating direct arrival using monaural modulation sp. Author(s)Ando, Masaru; Morikawa, Daisuke; Uno

PERSONALIZED HEAD RELATED TRANSFER FUNCTION MEASUREMENT AND VERIFICATION THROUGH SOUND LOCALIZATION RESOLUTION

Convention Paper 9870 Presented at the 143 rd Convention 2017 October 18 21, New York, NY, USA

AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming

Design and Implementation on a Sub-band based Acoustic Echo Cancellation Approach

From Monaural to Binaural Speaker Recognition for Humanoid Robots

WIND SPEED ESTIMATION AND WIND-INDUCED NOISE REDUCTION USING A 2-CHANNEL SMALL MICROPHONE ARRAY

Spatial Audio Reproduction: Towards Individualized Binaural Sound

III. Publication III. c 2005 Toni Hirvonen.

The psychoacoustics of reverberation

Simultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array

Sound source localisation in a robot

Airo Interantional Research Journal September, 2013 Volume II, ISSN:

Proceedings of Meetings on Acoustics

Circumaural transducer arrays for binaural synthesis

A binaural auditory model and applications to spatial sound evaluation

Computational Perception /785

Localization of underwater moving sound source based on time delay estimation using hydrophone array

URBANA-CHAMPAIGN. CS 498PS Audio Computing Lab. 3D and Virtual Sound. Paris Smaragdis. paris.cs.illinois.

ROBUST SPEECH RECOGNITION BASED ON HUMAN BINAURAL PERCEPTION

Automotive three-microphone voice activity detector and noise-canceller

Proceedings of Meetings on Acoustics

Towards an intelligent binaural spee enhancement system by integrating me signal extraction. Author(s)Chau, Duc Thanh; Li, Junfeng; Akagi,

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

SPHERICAL MICROPHONE ARRAY BASED IMMERSIVE AUDIO SCENE RENDERING. Adam M. O Donovan, Dmitry N. Zotkin, Ramani Duraiswami

Subband Analysis of Time Delay Estimation in STFT Domain

Source Localisation Mapping using Weighted Interaural Cross-Correlation

IMPROVED COCKTAIL-PARTY PROCESSING

Two-channel Separation of Speech Using Direction-of-arrival Estimation And Sinusoids Plus Transients Modeling

Convention Paper Presented at the 139th Convention 2015 October 29 November 1 New York, USA

University of Huddersfield Repository

Auditory Localization

Michael Brandstein Darren Ward (Eds.) Microphone Arrays. Signal Processing Techniques and Applications. With 149 Figures. Springer

A classification-based cocktail-party processor

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Sound source localization and its use in multimedia applications

29th TONMEISTERTAGUNG VDT INTERNATIONAL CONVENTION, November 2016

arxiv: v1 [cs.sd] 4 Dec 2018

HRTF adaptation and pattern learning

3D sound image control by individualized parametric head-related transfer functions

Recent Advances in Acoustic Signal Extraction and Dereverberation

Auditory Distance Perception. Yan-Chen Lu & Martin Cooke

A Simple Adaptive First-Order Differential Microphone

c 2014 Michael Friedman

From Binaural Technology to Virtual Reality

Binaural Hearing- Human Ability of Sound Source Localization

The analysis of multi-channel sound reproduction algorithms using HRTF data

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Analysis of Frontal Localization in Double Layered Loudspeaker Array System

Evaluation of a new stereophonic reproduction method with moving sweet spot using a binaural localization model

Broadband Microphone Arrays for Speech Acquisition

ENHANCED PRECISION IN SOURCE LOCALIZATION BY USING 3D-INTENSITY ARRAY MODULE

Speaker Localization in Noisy Environments Using Steered Response Voice Power

Auditory modelling for speech processing in the perceptual domain

Spatial Audio & The Vestibular System!

Sound Radiation Characteristic of a Shakuhachi with different Playing Techniques

Three-Dimensional Sound Source Localization for Unmanned Ground Vehicles with a Self-Rotational Two-Microphone Array

Applying the Filtered Back-Projection Method to Extract Signal at Specific Position

Active noise control at a moving virtual microphone using the SOTDF moving virtual sensing method

SYNTHESIS OF DEVICE-INDEPENDENT NOISE CORPORA FOR SPEECH QUALITY ASSESSMENT. Hannes Gamper, Lyle Corbin, David Johnston, Ivan J.

Intensity Discrimination and Binaural Interaction

PERSONAL 3D AUDIO SYSTEM WITH LOUDSPEAKERS

THE TEMPORAL and spectral structure of a sound signal

ROBUST SUPERDIRECTIVE BEAMFORMER WITH OPTIMAL REGULARIZATION

Listening with Headphones

Validation of lateral fraction results in room acoustic measurements

Measurement System for Acoustic Absorption Using the Cepstrum Technique. Abstract. 1. Introduction

Binaural Hearing. Reading: Yost Ch. 12

Introduction. 1.1 Surround sound

Digital Signal Processing of Speech for the Hearing Impaired

1856 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 7, SEPTEMBER /$ IEEE

A triangulation method for determining the perceptual center of the head for auditory stimuli

EFFECT OF ARTIFICIAL MOUTH SIZE ON SPEECH TRANSMISSION INDEX. Ken Stewart and Densil Cabrera

BEAMFORMING WITHIN THE MODAL SOUND FIELD OF A VEHICLE INTERIOR

EFFECTS OF PHYSICAL CONFIGURATIONS ON ANC HEADPHONE PERFORMANCE

Modeling Head-Related Transfer Functions Based on Pinna Anthropometry

Indoor Sound Localization

Transcription:

Binaural Sound Source Localization Based on Steered Beamformer with Spherical Scatterer Zhao Shuo, Chen Xun, Hao Xiaohui, Wu Rongbin, Wu Xihong National Laboratory on Machine Perception, School of Electronic Engineering and Computer Science, Peking University. Beijing, China. 1871. Correspondence should be addressed to Wu Xihong (wxh@cis.pku.edu.cn) ABSTRACT Inspired by human sound localization, this paper introduces a novel approach to achieving binaural sound source localization in the frontal azimuthal half-plane based on steered beamformer. Instead of using HRTFs, a rigid sphere, whose transfer functions can be calculated accurately, is introduced to simulate the head effect. Sub-band beamformer using both the time cue ITD and the intensity cue IID is designed to process sound scattered by the rigid sphere. In the multi-band processing, a specialized filterbank is designed and the joint judgement strategy is employed. The whole system consists of the algorithm part, the hardware part and the user interface. Evaluation results by simulation and measurement experiments are also presented. 1. INTRODUCTION Sound source localization is defined as the determination of the coordinates of sound sources in relation to a point in space. Many automation systems, such as voice capturing in conference room, hearing aids and security monitor, require sound source localization. Over the last two decades, many approaches are proposed to solve this problem, such as time-delay estimation (TDE) [1,, 3], beamforming [4, 5, 6, ], hemisphere sampling [7], and accumulated correlation [8, 9]. Most of these methods rely exclusively upon the time cue, which is known as the interaural time difference (ITD) or the interaural phase difference (IPD). But due to the Duplex theory[1], there is another important cue for sound source localization, the interaural intensity difference (IID) or the interaural level difference(ild), which has received little attention in the signal processing community[11]. In recent years, more and more localization algorithms are applied in robotic localization and many robotic localizing devices have appeared. In these applications, there are mainly three types of systems, including sensor arrays in free field, sensor arrays on robots without considering intensity information[] and sensor arrays on robots considering both time and intensity cues[16, 19]. The last type which tries to mimic human audition has become more and more popular recently. The ability of humans to perceive the spatial location of sound in space is truly a remarkable skill. In nature, human brain uses interaural differences in various frequency bands to infer the location of a source[1, 13]. Both cues mentioned above for sound source localization are included in Head-Related Transfer Functions (HRTFs), HRT F L,R, which are the function of sound source location, frequency, and also the human subject. The HRTFs, which characterize the scattering properties of a person s anatomy (especially the head, pinnae and torso), play a very important role in binaural source localization for human. Due to the scattering of the anatomy, the IID cue becomes more notable compared to the free sound field situation and both cues vary with frequency, which brings more information for the sound source localization task. Inspired by human binaural sound localization, we are interested in developing some artificial binaural sound source localization system with a human-like scatterer. It seems that dummy-head is a suitable choice for the localization task because it has complex structures to approach the human head. However, there are lots of limitations using the dummy head. First of all, HRTFs of dummy head are usually obtained by measurement, which is both time-consuming and experimentally difficult to get the results accurately[14]. Low-frequency measurements are particularly problematic, partly be- 1

cause large loudspeakers are required, and partly because even good anechoic chambers reflect long wavelength sound waves. Although the numerical computation can bring more accurate results, it requires a huge project including scanning, modeling, meshing, long-time computation and post-processing. Second, the dummy head does not fit in every application situation because of its big body especially when the localization part is required to be integrated in a bigger system. Another method which does not require HRTFs is the auditory epipolar geometry which can extract directional information of sound sources without using HRTFs[15]. But it judges only three directions - left, right or center - for IID, while it gives continuous estimation of IPD against a direction of a sound source[16]. Instead of the dummy-head, a rigid sphere with two microphones is employed here to generate the binaural cues because its transfer functions are fairly similar to HRTFs in low and medium frequencies, in which most of the speech energy is contained. The Scattering Theory[17] is used here which is concerned with the effect obstacles or inhomogeneities on incident waves. It provides a method to estimate the scattered field from the knowledge of the incident field and the scattering obstacle. Transfer functions of a rigid sphere can be computed very easily due to the analytical solution of its scattered sound field using the scatterer theory[18], so that the binaural cues - ITD and IID - can be accurately obtained. In previous studies, [16, 19] also employed the rigid sphere with two microphones to simulate the head effect to help sound localization. The localization algorithms in these two studies are mainly based on calculating distances between the cues in the received binaural sound and the cues in the database to decide the direction of the sound source. However, in such systems, the binaural cues extracted from received sound are correlated with sound source types in some extent. What s more, the performance is also affected when there are multiple sources. In our system, two omnidirectional microphones are placed at two end-points of a diameter of a rigid sphere. Although the structure is front-back symmetrical and neither pinna nor torso effects are introduced, which means that the sound source localization works only in the frontal azimuthal half-plane, we call it binaural localization, because it still satisfies in many applications. In our localization algorithm, a frequency-domain steered beamformer using both binaural cues is employed to make the system more robust[]. Due to the introduction of the rigid sphere, the binaural cues vary with frequency, which requires to sub-band processing. The steered beamformer in each sub-band ranges from 9 to 9 in azimuth in the horizontal plane (elevation = ), and the final localization result is made by a joint judgement strategy. A multi-channel audio acquisition board is developed using DSP and USB to collect the audio data received by the microphones and then send the data to the computer, in which the localization program is executed and the final result is shown in the GUI. Both simulation and measurement results are given to show the performance of the system. The rest of the paper is organized as follows. In Sec., the scattering theory and the transfer function of a rigid sphere is introduced. In Sec. 3, our specialized filterbanks, sub-band beamformer, decision-making and the framework are presented in details. In Sec. 4, the system implementation including rigid sphere, microphone setup, the DSP board are introduced. In Sec. 5, both the experiment setup and evaluation results are presented. Some discussion is made in Sec. 6 and the following section is the conclusion.. BINAURAL CUES OF SPHERICAL SCAT- TERER According to the scattering theory, the scattering problem in the frequency domain is reduced to solution of the Helmholtz equation for complex potential ψ s (r) given as ψ s (r) + k ψ s (r) = (1) with the following impedance boundary conditions on the surface S of the scatterer: ( ψs (r) + iσψ s (r)) = () n where k = ω/c is the wavenumber, σ is the constant characterizing the impedance of the scatterer, and i = 1. In the particular case of rigid scatterer (σ = ) we have the Neumann boundary conditions, S S ( ) ψs (r) = (3) n Rabinowitz et al.[1] present the solution for the pressure on the surface of the rigid sphere due to a sinusoidal point source at any range r greater than the sphere radius Page of 9

IID (db) 8 6 4 4 6 IID in Scattered Field IID in Free Field 8 5 1 15 5 3 35 Azimuth (Degree) Fig. 1: Comparison of IID cue in free sound field and in the rigid sphere scattered sound field at 1kHz. a. With minor notational changes, their solution can be written as[18] ψ s (r,θ,ω) = iρ c 4πa Ψe iωt (4) where Ψ is the infinite series expansion Ψ = m= (m + 1)P m (cos(θ)) h m(kr) h m(ka), r > a (5) Here θ is the angle of incidence, the angle between the ray from the center of the sphere to the source and the ray to the measurement point on the surface of the sphere. P m is the Legendre polynomial of degree m. h m is the mthorder spherical Hankel function. h m is the the derivative of h m with respect to its argument. Define the Rigid Sphere Transfer Function H(r,θ,ω) as H(r,θ,ω) = ψ s ψ f f (6) the free-field pressure ψ f f at a distance r from the source is given by Then ψ f f (r,ω) = iωρ 4πr ei(kr ωt) (7) H(r,θ,ω) = r ka e ikr Ψ (8) Ψ can be calculated with the iteration shown in [18]. The binaural cues of a certain frequency ω can be extracted from the transfer function H. Here are the definition for the binaural cues ITD (IPD) and IID. IT D(r,θ,ω) = IPD(r,θ,ω) = arctan(h R/H L ) ω ω IID(r,θ,ω) = H R H L (9) (1) When computing the IPD, unwrapping should be used to correct the phase by adding multiples of ±π to smooth the ITD of a series of frequencies. Fig. 1 shows comparison of IID cue in free sound field and in the rigid sphere scattered sound field. It can be easily seen that the IID cue becomes much more notable when a scatterer is introduced. Due to the small effect of r on H or binaural cues, especially when the source is in the far field, the effect of the argument r is omitted in the system algorithm design. 3. SOUND LOCALIZATION BASED ON SUB- BAND BEAMFORMER 3.1. Beamformer with Both Binaural Cues The output y(t)of a basic N-microphone delay-and-sum beamformer is defined as: y(t) = N n=1 x n (t τ n ) (11) And this can be realized in the frequency domain by only calculating the cross-correlation function[]. However, as said in Sec. 1, only the time cue is used in the beamformer while the intensity cue is neglected although it becomes notable when the scatterer is introduced. For the binaural system, according to the definition of the binaural cues in equation (9) and (1), the beamformer with both binaural cues at a certain frequency ω is modified as equation (1) to maximize the output Y. Y (ω) = X L (ω) + 1 IID(θ,ω) X R (ω)e iωit D(θ,ω) (1) Then, the beamformer output energy changes to E Y = X L + 1 IID X R + IID X L X R e iωit D (13) which is different from the basic beamformer in [] because the IID is not a constant as direction θ changes. Page 3 of 9

The beamformer output energy computed in the frequency domain not only relates to the cross-correlation function but also relates to the XL + X R /IID part, which is not a constant. Any further improvement on this IID-related beamformer, such as spectral weighting[], should consider all parts of the beamformer in equation (13). 3.. Specialized Filterbanks As shown in Sec., both the binaural cues vary with frequencies, so the beamformer in (1) is supposed to be done in each frequency component. However, it is not realistic to do so due to large computation. Here, subband processing is introduced to reduce the computation complexity. In each sub-band, it is assumed that all the frequency components share the same binaural cues as those of the center-frequency component. The Gammatone filterbanks[] are the most frequently used filterbanks in auditory signal processing due to the good simulation of the cochlea sub-band processing. However, there are much overlapping among bands in gammatone filterbanks, which makes the assumption mentioned above unreasonable and brings down the performance of the beamformer directivity. Here, a kind of specialized filterbanks is designed as follows. The center frequencies of the filterbanks are the same as those of the gammotone filterbanks due to their distribution on the log-frequency domain. The bandwidth of each sub-band is decided by center frequencies of its two neighboring sub-bands, which restricts unnecessary overlapping among sub-bands so that the beamformer based on the assumption mentioned above performs better. Each sub-band filter can be realized by the basic FIR filter. The magnitude-frequency response of this type of filterbanks is shown as Fig.. 3.3. Multi-band Joint Judgement Assume that the input signal are filtered into M subbands by the filterbanks. In each sub-band, a beamformer is steered in the possible range to search the peaks of its output energy. It is well known that the peak width is the function of frequency. Specifically, there are just a few wide peaks in low frequencies, while many narrow peaks exist in the high frequencies. It is suggested in [5] to use the coarse-fine search to improve the algorithm efficiency. But this strategy requires that results of the low frequency beamformers be as correct as possible, or, the search range of high frequency beamformers may be misled by previous results. In other words, the decisionmaking method in the coarse-fine search should vary Magnitude (db) 5 5 1 15 5 3 35 4..4.6.8 1 Normalized Frequency ( π rad/sample) Fig. : A magnitude-frequency response example of the specialized filterbanks employed in the algorithm. Normalized B(θ) 1.9.8.7.6.5.4.3 4 6 8 Azimuth (Degree) Fig. 3: The beamformer output energy at different azimuth using white noise with the sound source spectrum, which is not robust to sound source type. In all bands, the same step, which is also the spatial resolution of the system, is taken in the steered beamformer. Obviously, this method brings more computation, but it still can work in real time due to our 9 to 9 task. Steered beamformers in various subbands export their own results independently, and then the final localization result is obtained by the multi-band Page 4 of 9

Filterbank Sub-band Beamforming with IID and ITD Multi-band Joint Judgement Smoothing Source Location Spherical Scatterer Standard IID and ITD Fig. 4: Algorithm framework. joint judgement strategy. In this joint judgement strategy, the ith beamformer output energy E i (θ) is normalized to be converted to the probability P i (θ) as P i (θ) = E i(θ) E B(θ) (14) Here, the E is the energy threshold to ensure that the probability P i (θ) be not greater than 1. The B(θ) is the directional energy compensation factor. As shown in Fig. 3, when sweeping in different azimuths using white noise, the normalized output energy B(θ) varies with the direction. In order to ensure the equality of all directions, this gain brought by the beamformer itself should be removed as shown in equation (14). Then the joint probability P(θ) is calculated as equation (15) to get the final result. P(θ) = M i=1 P i (θ) (15) In order to find peaks in P(θ) more easily and accurately, P(θ) is smoothed by convolving with a Hanning window, the length of which is empirically chosen. The smoothing makes the real peaks more distinct and especially, removes those fake candidates near the real ones. And the final source localization result ˆθ is obtained by ˆθ = argmaxp(θ) (16) θ 3.4. Algorithm Framework The algorithm framework is given in Fig. 4. First, the standard ITDs and IIDs of different directions and frequencies are calculated off-line using the the rigid sphere transfer function H(θ,ω) in equation (8). Then the input data in the buffer are filtered to sub-bands using the specialized filterbanks. Sub-band beamformers are steered from 9 to 9 in azimuth respectively with beamformer output energy data calculated and the joint judgement and the smoothing strategies help get the final localization result. 4. SYSTEM IMPLEMENTATION In this system, a rigid plastic sphere with a diameter of 18. cm is used as the sphere scatterer and two omnidirectional microphones are placed at two end-points of a diameter, on the surface of the rigid sphere, as shown in Fig. 5. The whole sphere is stabilized on a bracket and stays away from the ground in a certain distance to eliminate the effect of other scatters as much as possible. The sound collected by the two microphones are transferred to a multi-channel audio signal acquisition board. Note that there are six audio input channels on the board, any two of which can be used here while the others are for extensive multi-channel use. This board, instead of the sound card in the computer, is employed here as sound cards in computer often could not qualify in the aspect of signal-to-noise ratio. Fig. 6 shows the frame of the acquisition board. Cored by TI TMS3VC559 chip, this board is composed of A/D converter, CPLD, FLASH and power management modules. The reason to choose DSP chip here lies in the consideration of embedding signal processing algorithm in the DSP in future. The two channel sound captured by the two microphones is performed by A/D conversion through the AD7336 chip with a sampling rate of 16 khz to the Multi-channel Bus Serial Port Interface(McBSP) of the DSP using the SPI protocol. The Direct Memory Access(DMA) mechanism is then introduced to transfer the digitalized signal to the computer for further signal processing through the USB interface. The CPLD part is a control unit responsible for setting the mode of bootloader, the timer input and the working mode of DSP. The FLASH part is used to store programs, which are transferred to the memory of DSP when bootstrap or reset happens. Page 5 of 9

Azimuth (Deg) 4 6 75 85 Vowel [a:] 1. 1.1 1.5 4.4. 1.6 Claps.5 1. 1. 1. 7. 3.3 Male Speech 5.6 3.6 6.6 1.3 7.5 4.1 White Noise.6 1. 1..8 5.7 1.4 Average 1.9 1.7.6 4. 5.5.6 Table 1: Time-average errors vary with target azimuths The rigid spherical scatterer and the micro- Fig. 5: phones. FLASH AM9LV8B DSP TMS3VC559 BootLoader Mic SNR (db) Pure 15 1 5 Vowel [a:].8 1. 1. 1. 1.3 1.6 Claps 1.3 1.3 1.3 1.3 1.3 1.4 Male Speech 1.8.6 3.5 4.1 5. 8. White Noise.7.7.7.8.9 1.1 Average 1.1 1.4 1.7 1.8.1 3. CPLD LC464 IO/Reg/Clk McBSP AD AD7336 Table : Time-average errors vary with SNRs POWER MANAGEMENT TIPS76731 Control USB Port Fig. 6: Frame of the multi-channel audio signal acquisition board. The inter-channel differences of the whole signal acquisition module, which includes the microphones and the board, are pre-measured in an anechoic chamber and then compensated in the computer before any further signal processing. The core localization procedure is then carried out in the computer and the final result is shown in a GUI. 5. EXPERIMENTS 5.1. Simulation Experiments To demonstrate the efficiency of the proposed algorithm, simulation and measurement experiments are performed under various conditions. In the simulation experiments, the synthesized input signals are directly sent to the core localization procedure without microphones or DSP board. The azimuth of the sound source, which is synthesized through an single channel sound convolved with the binaural rigid sphere transfer functions, ranges from to 9. There are 4 types of the original single channel sound: vowel [a:], claps, a speech sentence by a male PC speaker and white noise. Also sound sources in noisy environment are synthesized by adding two uncorrelated background noise (white noise) to the two channel inputs without spatial information at SNRs of db, 5dB, 1dB, 15dB and db. Table 1 and Table show that the time-average localization errors vary with different target azimuth and SNRs respectively, for 4 types of sound sources. Fig. 7 and Fig. 8 show that the localization standard variances for 4 types of sound sources with 15dB SNR and db SNR, respectively. Fig. 9 shows that localization standard variances under all types of SNRs. It can be seen easily that the localization performance is good when the target azimuth is no greater than 6, while the peak width of the beamformer becomes larger and larger when the azimuth increases. The speech source generally perform worse than other types because the energy of speech varies with time significantly, while the white noise has the best performance due to its stationary energy compared to other sources. Seen from Fig. 9, the system is robust to the background noise. 5.. Measurement Experiments Measurement experiments are carried out in an ordinary office room without obstacles between the sound source and the spherical scatterer. The source is placed about 1.4m far away from the center of the sphere. Actual azimuths of the source are,, 4 and 6. The male Page 6 of 9

L o c a liz a tio n S ta n d a rd V a ria n c e (d e g re e ) 1 8 1 6 1 4 1 1 8 6 4 a _ v o w e l_ s n r1 5 d B c la p _ s n r1 5 d B s p e e c h _ m a le _ s n r1 5 d B w h ite n o is e _ s n r1 5 d B 4 6 8 1 L o c a liz a tio n S ta n d a rd V a ria n c e (d e g re e ) 1 9 8 7 6 5 4 3 1 p u re s n r d B s n r1 5 d B s n r1 d B s n r5 d B s n r d B 4 6 8 1 A z im u th (d e g re e ) A z im u th (d e g re e ) Fig. 7: Localization standard variances for 4 types of sources under 15dB SNR. Fig. 9: Localization standard variances of all sources under various noisy environments. L o c a liz a tio n S ta n d a rd V a ria n c e (d e g re e ) 3 8 6 4 1 8 1 6 1 4 1 1 8 6 4 4 6 8 1 A z im u th (d e g re e ) a _ v o w e l_ s n r d B c la p _ s n r d B s p e e c h _ m a le _ s n r d B w h ite n o is e _ s n r d B A z im u th (D e g re e ) 8 7 5 7 6 5 6 5 5 5 4 5 4 3 5 3 5 1 5 1 5-5 -1 5 1 1 5 5 3 D a ta B lo c k In d e x n C la p s S p e e c h N o is e Fig. 8: Localization standard variances for 4 types of sources under db SNR. speech, claps and white noise used in the simulation experiments are broadcasted in a small loudspeaker to act the source. Fig. 1 shows the localization results varying with time. It can be seen that the white noise source performs best while the speech performs worst, which is consistent with the simulation results. 6. DISCUSSION The proposed localization algorithm is built up with Fig. 1: Localization results in the measurement experiments. some ideas inspired by the human audition, in which the cochlea-like filterbank and the binaural cues are employed. It is certain that the goal of the algorithm is to localize the sound source instead of the understanding or characterization of human audition, but we do hope that the introduction of those human audition inspired ideas can make some help to achieve the goal. As mentioned before, due to the limitation of the sym- Page 7 of 9

metrical structure, this binaural system only works in the frontal azimuthal half-plane. It is easy to solve the front-back confusion problem using two microphones, if they are not placed at two end-points of a diameter of the rigid sphere to generate unsymmetrical structure, but the performance will decrease at the broad side, on which the angle between the two microphones is larger. And if 3-D localization is needed, the number of microphones can be increased, which is one of our future work. From the experimental results, it s not hard to see that the localization error increases from 65 to 8 under all circumstances. The reason for this phenomenon includes two aspects. First, the width of the beamformer increases as the target azimuth gets close to the diameter, on which the two microphones are placed. Second, due to the front-back symmetry, even if the steering range is from 9 to 9, there exists a virtual beamformer on the back side. When the target azimuth gets close to the diameter, there is some overlap between the desired beamformer and the virtual one, which makes the maximum energy output locates around 9, so the localization error is relatively larger from 65 to 8. Unfortunately, the problem is currently unavoidable in such symmetrical settings, while it can be solved by simply using unsymmetrical structures. In the measurement experiment results, as shown in Fig. 1, the localization performance for white noise is better than those for other types of sound sources, because white noise is the most time-stationary sound sources among the three types and currently, there is no timetracking strategy (such as Kalman filter or particle filter) employed in this system. The time tracking part is also one of our future work. 7. CONCLUSION A binaural sound source localization system in the frontal azimuthal half-plane based on sub-band steered beamformer with spherical scatterer is introduced and realized. Both the ITD and IID cues are used in the beamformer. The rigid sphere transfer functions, from which the binaural cues are extracted, are used instead of HRTFs and calculated using the scatterer theory. Specialized filterbanks are introduced for sub-band beamformer and the multi-band joint judgement strategy is employed to make the decision-making more robust to various source types. Both the simulation and measurement experiments show good performance of the system. In the future, the localization and tracking task of multisources with movement based on this framework is going to be investigated and the embedding of the localization algorithm to the DSP board is also in consideration. Other aspects, such as the various scatterer models and beamformers are going to be tested to improve the performance. 8. ACKNOWLEDGEMENT This work was supported by China Natural Science Foundation for Young Scientists (6354); the Key Project of China Natural Science Foundation (64351, 65353), the Project of New Century Excellent Talents in Universities. 9. REFERENCES [1] Michael S. Brandstein and Harvey F. Silverman, Practical methodology for speech source localization with microphone arrays, Computer Speech and Language, vol. 11, no., pp. 91-16, 1997. [] Michael S. Brandstein and Harvey F. Silverman, Acoustic source location in a threedimensional space using crosspower spectrum phase, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1997, vol. 1, pp. 31-34. [3] J. Vermaak and A. Blake, Nonlinear filtering for speaker tracking in noisy and reverberant environments, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1. [4] J. L. Flanagan, J. D. Johnston, R. Zahn, and G. W. Elko, Computer-steered microphone arrays for sound transduction in large rooms, Journal of the Acoustical Society of America, vol. 78, no. 5, 1985. [5] R. Duraiswami, D. Zotkin, and L. Davis, Active speech source localization by a dual coarse-to-fine search, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1. [6] Darren B. Ward and Robert C. Williamson, Particle filter beamforming for acoustic source localization in a reverberant environment, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),. Page 8 of 9

[7] Stanley T. Birchfield and Daniel K. Gillmor, Acoustic source direction by hemisphere sampling, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1. [8] Stanley T. Birchfield and Daniel K. Gillmor, Fast Bayesian acoustic localization, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),. [9] Stanley T. Birchfield, A unifying framework for acoustic localization, in Proceedings of the 1th European Signal Processing Conference (EU- SIPCO), 4. [1] Rayleigh, L. On our perception of sound direction., Philosophical Magazine 13: 197. [11] Stanley T. Birchfield and Rajitha Gangishetty, Acoustic Localization by Interaural Level Difference, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 5. [18] Richard O. Duda, William L. Martens, Range dependence of the response of a spherical head model, J. Acoust. Soc. Am. 14 (5), pp. 348-358, Nov. 1998. [19] Amir A. Handzel and P. S. Krishnaprasad, Biomimetic Sound-Source Localization, IEEE Sensors Journal, Vol., No. 6, December. [] Jean-Marc VALIN, Auditory System for a Mobile Robot, PhD Thesis of Université de Sherbrooke, pp.11-4, August, 5. [1] Rabinowitz, W. M., Maxwell, J., Shao, Y., and Wei, M. Sound localization cues for a magnified head: Implications from sound diffraction about a rigid sphere, Presence, 15C19. 1993. [] Moore, B.C., Glasberg, B., Suggested formulae for calculating auditory-filter bandwidths and excitation pattern. J. Acoust. Soc. Am. 74, 75-753. 1983. [1] J. Blauert, Spatial Hearing, revised ed. Cambridge, MA: MIT Press, 1997. [13] W. M. Hartmann, How we localize sound, Phys. Today, pp. 4C9, Nov. 1999. [14] V. Ralph Algazi, Richard O. Duda, and Dennis M. Thompson, The Use of Head-and-Torso Models for Improved Spatial Sound Synthesis, 113th Convention Audio Engineering Society, October 5-8,, Los Angeles, CA, USA. [15] K. Nakadai, T. Lourens. H. G. Okuno, and H. Kitano. Active audition for humanoid, In AAAI-, pages 83-839. AAAI,. [16] Kazuhiro Nakadai, Daisuke Matsuura, Hiroshi G. Okuno, and Hiroaki Kitano, Applying Scattering Theory to Robot Audition System: Robust Sound Source Localization and Extraction, in Proceedings of the 3 IEEE/RSJ Intl. Conference on Intelligent Robots and Systems, Las Vegas, Nevada. October 3. [17] P. Lax and R. Phillips. Scarrering Theory. Academic Press, NY., 1989. Page 9 of 9