Indoor Sound Source Localization with Probabilistic Neural Network

Size: px

Start display at page:

Download "Indoor Sound Source Localization with Probabilistic Neural Network"

Mervyn Cannon
6 years ago
Views:

1 Indoor Sound Source Localization with Probabilistic Neural Network Yingxiang Sun, Student Member, IEEE, Jiajia Chen*, Chau Yuen, Senior Member, IEEE, and Susanto Rahardja, Fellow, IEEE Abstract It is known that adverse environments such as high reverberation and low signal-to-noise ratio (SNR) pose a great challenge to indoor sound source localization. To address this challenge, in this paper, we propose a sound source localization algorithm based on probabilistic neural network, namely Generalized cross correlation Classiication Algorithm (GCA). Experimental results or adverse environments with high reverberation time T 60 up to 600ms and low SNR such as 0dB show that, the average azimuth angle error and elevation angle error by GCA are only 4.6º and 3.º respectively. Compared with three recently published algorithms, GCA has increased the success rate on direction o arrival estimation signiicantly with good robustness to environmental changes. These results show that the proposed GCA can localize accurately and robustly or diverse indoor applications where the site acoustic eatures can be studied prior to the localization stage. Index Terms Sound source localization (SSL), direction o arrival (DOA), generalized cross correlation (GCC), probabilistic neural network (PNN), machine learning. L I. INTRODUCTION OCALIZATION techniques have been widely used in both outdoor environment [] and indoor environment [2]. Diverse types o sensors including acoustic sensors, electromagnetic sensors, and optical sensors have been adopted or localization. Sensor nodes with acoustic microphones [3] with low power consumption were used in wireless sensor networks. In [4], localization on the basis o dense passive radio-requency identiication tag was proposed. Laser range inder [5] was installed on mobile robot to localize in environment where glass walls surrounded. RGB-depth camera in the light o two-dimensional light detection and ranging technique was used or localization in [6]. In contrast to common device-enable technology [7], device-ree technology Manuscript received August 9, 207; revised November 7, 207; accepted December 6, 207. This work was supported by the SUTD-MIT International Design Center under Grant IDG and NSFC Yingxiang Sun, Jiajia Chen (Corresponding Author*) and Chau Yuen are with the Pillar o Engineering Product Development, Singapore University o Technology and Design, Singapore, ( jiajia_chen@sutd.edu.sg). Susanto Rahardja is with School o Marine Science and Technology, Northwestern Polytechnical University, Xi'an, P.R. China, and STMIK Raharja, Tangerang, Indonesia ( susantorahardja@ieee.org). [8] to localize targets that do not carry any device also has appeared. Among localization techniques, indoor sound source localization (SSL) has important applications in a wide range o scenarios. For example, robots can localize the sound source to assist to detect unknown deect in smart actory. Furthermore, in smart hospital, robots can att to patients by localizing sound source. Moreover, camera can be automatically steered or speaker localization in smart meeting room. In terms o security monitoring, robots can go on patrol and look or sound source caused by people breaking in. Thereore, indoor SSL has received a lot o attention [9] [0] in the past decades. The existing SSL technologies can be categorized into three groups: viz, time delay estimation method, beamorming method, and machine learning method. The time delay estimation method is based on computing the time dierence o arrival (TDOA). One widely used technique or TDOA is the generalized cross correlation (GCC). As reverberation and noise cause ambiguities in TDOA estimation, many eorts were made to address this problem. These works employed various types o microphone arrays, such as linear array [], circular array [2], distributed array [3], and arbitrarily-shaped non-coplanar array [4]. The second class is the beamorming method, which can be classiied into subspace approaches and beamscan approaches. Subspace approaches exploit the orthogonality between signal and noise subspaces. Two amous subspace algorithms are multiple signal classiication (MUSIC) and estimation o signal parameters via rotational invariance technique (ESPRIT). Beamscan approaches can localize the array signals into one speciic direction. A well-known technique is steered response power phase transorm (SRP-PHAT), which is adopted by many beamscan approaches [5]-[8]. The machine learning methods are more emerging approaches and a ew attempts have been made in the literature. Most o the works are supervised learning methods, including support vector machine [9], multilayer perceptron neural network [20], and Gaussian mixture model [2]. Besides, a semi-supervised learning algorithm based on maniold regularization [22] was proposed. Although the above great works have been done to propose eective localization algorithms, there are still two more major challenges to be addressed urther. The irst issue is the accuracy o direction o arrival (DOA) estimation in high reverberant environments. As indoor environments are echoic, the reverberation caused by multipath propagation introduces spectral distortions and thereore severely deteriorates DOA estimation. Secondly, spectral characteristics o undesired background noise can be the same as the source signal. As such,

2 the DOA estimation accuracy is severely degraded in low signal-to-noise-ratio (SNR) environments. Thereore, more eort is needed to improve the DOA estimation accuracy or SSL in these adverse environments. Among the applications, an important category exists where the acoustic eatures o the physical rooms can be pre-studied beore localization. In this case, the acoustic eatures including the room impulse response (RIR) can be evaluated beore any localization is perormed, which makes machine learning methods the right tools. This kind o data driven training methods can be more eective especially when the environment is too complex to be modeled. In this paper, we propose a probabilistic neural network (PNN) based SSL algorithm or the applications where pre-localization site survey is possible. Compared with other existing machine learning methods, the most important advantage o PNN is that it does not require any iterative training. In addition, the GCC eature is adopted to robustly represent the sound source position, making the training procedure eective in reverberant and noisy environments. Finally, the proposed weighted location decision method improves the accuracy o the DOA estimation by revisiting and accessing the probabilities o the adjacent clusters. Owing to these novelties, the results show that the proposed algorithm can perorm more accurate SSL than existing methods in the adverse environments. The perormance is proven to be robust too, when room environment and/or geometry varies. II. SYSTEM MODEL AND PROBLEM FORMULATION In this section, we present the SSL problem to be addressed. We consider the problem o stationary single source localization inside a 3-dimensional rectangular enclosed room. The location o the source is arbitrary inside the room. A stationary microphone array which consists o M microphones is used to receive sound signals inside the same room. Through these ixed microphones in the array, we can receive the signal transmitted rom the source directly and the delayed replicas o the source relected by room suraces. The m th microphone can be represented as M m with m [, M]. When the sound wave hits a surace such as a wall, a loor or a ceiling, part o the wave is absorbed by the surace while the rest is relected back into the room. We assume that the sound wave is relected by the suraces with the angle o incidence equal to the angle o relection. Thereore, the received signal at each microphone is a mixed signal, consisting o the signal transmitted rom the source directly and the delayed replicas o the source which are relected and attenuated. I the source signal is s(t), the received signal x m (t) at the m th microphone can be expressed as xm( t) = hm( t) s( t) + nm( t), () where denotes convolution. n m (t) is the noise at the m th microphone, which is uncorrelated with s(t) and those noises at other microphones [9]. h m (t) is the RIR which contains the multipath propagation and attenuation inormation between the sound source and the m th microphone. h m (t) varies with sound source and the m th microphone positions. By assuming the received signals set X = [x, x 2,, x M ] T, RIR set H = [h, h 2,, h M ] T and noise set N = [n, n 2,, n M ] T, () can be written as X = H s( t) + N. (2) I we divide a room into a set o space clusters whose volumes are small enough, each space cluster can be represented by a unique 3-dimensional coordinate inside it. To cope with the high computational burden, the regressive SSL inside a 3-dimensional room can be transormed into a likelihood based nonlinear classiication problem. Thereore, the classiier can decide which particular cluster the source belongs to, as shown in Fig.. In this classiication problem, each space cluster is a category and a total number o K 3 categories can be created, i.e. C = [c, c 2,, c K ] and c i R with i [, K]. The complexity o the classiication grows with the increase o K or iner-grained clusters, which leads to a more accurate localization i the classiication is successul. All the K categories are possible solutions and each possible solution c i has a set o eatures eature i that decide the probability o c i being the inal solution. Based on the eatures and the received signals set X, a dedicated classiier classiies the source into one cluster c s, whose unique coordinate representative is [d x,s, d y,s, d z,s ]. c s is the solution o the classiication problem while [d x,s, d y,s, d z,s ] is the solution o SSL problem. This classiication problem by classiier unction classiy( ) can be summarized as c s = classiy(x, eature ). (3) i Assume the actual source location is [s x, s y, s z ] inside the cluster c source. Even the classiication solution is wrong i c s c source, the regression localization error ε can be evaluated as i 2 ( dxs, sx) ( dys, sy) ( dz, s sz) 2 2 =. (4) ε + + The DOA results in terms o θ and ϕ can be obtained rom dxs, = xm+ rsinθ cos φ; dys, = ym+ rsinθsin φ; dzs, = zm+ rcosθ, (5) where x m, y m and z m are coordinates o the microphone array. r denotes the distance between the cluster c s and array center. θ [ 90, + 90 ] is the elevation angle, rom r s orthogonal projection onto the xy-plane towards the positive z-axis. ϕ ( 80, + 80 ] is the azimuth angle, rom the positive x-axis towards the positive y-axis, in terms o r s orthogonal projection onto the xy-plane. Fig.. Space cluster classiication or SSL I the longest diagonal inside one cluster is l, ε is bounded by l when the classiication is correct. I the classiication is incorrect but c s is an adjacent cluster o c source, it is still possible to have ε bounded by l. Thereore, the localization error bounding deps on the correctness probability o the classiication as P( ε l cs = csource ) = ; P( ε l) P( cs = csource ). (6) To minimize ε, thereore, we need an eicient and accurate classiier which is with high classiication correctness rate and aordable computational complexity. In the next section, we

3 will present the details o the proposed SSL algorithm based on the PNN classiier. III. THE PROPOSED ALGORITHM The relationship between the source position and the recorded signals at microphone array is nonlinear. We adopt PNN [23] as the classiier, because PNN is more suitable or the nonlinear multi-classiication problem. PNN contains our layers, i.e. input layer, pattern layer, summation layer and decision layer successively. With this classiier, we propose a GCC classiication algorithm (GCA) to solve the classiication problem ormulated in Section II. A. GCC Feature Extraction In order to generate the input vector space I or the PNN, we need to extract eatures rom the signals at microphone array. The eature o each received signal is unique. Meanwhile, as each sound source is located at a unique position, there is a one-to-one correlation between the received signals and source positions. For a machine learning algorithm to provide good solution, it is essential to select well-deined eatures prudently or the training. The reason is that the probability densities o the category patterns are unknown initially. The derivation o these probability densities solely relies on these selected eatures. GCC is an ideal candidate to be used as eature, since it contains all the needed inormation or DOA estimation and is reliable in reverberant and noisy environments. GCC varies across dierent rames. Taking the silent rame as an example, GCC is mainly due to the noise. In this case, i we directly use GCC in a single rame as the eature, it is not representative. Thereore, GCC rom higher SNR rames need to be evaluated with higher weightage, while the rest are with lower weightages or even neglected. In our method, GCC rom all rames are weighted and summed to be the eature [20], namely GCC eature. As GCC and the weights or each rame are dierent, GCC eature is unique. The length L o each rame is selected based on compromise between good spectral resolution and small bias and variance. Thereore, a vector consisting o L GCCs can be extracted or each rame. Assume that the source signal consists o totally F rames. We use GCC to represent the l th GCC corresponding to the th rame l o the source signal, with l [, L] and [, F]. The GCC eature corresponding to one sound source can be expressed as F L l GCC = w GCC, (7) where w = = l= L l= F L = l= GCC γ l GCC l γ, (8) denoting the weight o the th rame. γ is a tuning parameter. To localize a sound source by an M-microphone array with M 2, we can compute a total number o M(M )/2 GCC eatures using (7), with each corresponding to one microphone combination. These M(M )/2 GCC eatures are grouped together to orm the complete GCC set corresponding to the sound source. Thereore, more accurate SSL can be achieved with more microphone combinations, but at the expense o higher computational complexity in GCC eature extraction. B. Training At the beginning o the training, the enclosed room is divided into a number o K equal-dimension rectangular clusters, namely c, c 2,, c i,, c K with i [, K]. This dividing procedure is deined as cluster(dim, K), where the dimension o each cluster Dim deps on the required localization accuracy. Assume that n i is the total number o training samples taken inside the i th cluster, we can deine our vector spaces, namely X={X i, j }, S={S i, j }, GCC={GCC i,j } and H={H i, j }. Each X i, j represents signals produced at the microphone array M when the j th training sample sound source S i, j inside the i th cluster is placed, with j [, n i ]. GCC i,j is the corresponding GCC eature extracted rom X i, j. H i, j represents the corresponding RIR between M and S i, j. Given the sampling requency o sound signal ( sample ), the absorption coeicient o the room (α c ), sound velocity in the air (v c ), reverberation time (T 60 ) and the noise in the room (N), the RIRs between the microphone array and sources can be computed [24]. This procedure is deined as RIR( sample, v c, T 60, N, α c, M, S). By convoluting H with S and adding N, we can produce the signal vector space X. Ater that, the GCC eatures GCC are extracted using (7). We deine this procedure as GF(X, γ). Upon completion o the eature extraction, all eatures are supplied to PNN as the input vector space I. The number o neurons o input layer is equal to the dimension o input GCC eature vector. In pattern layer, the number o neurons equals to the total number o training samples placed to train the PNN. Thereore, there are K i=n i neurons in pattern layer. The neurons o the pattern layer map input GCC eature vector to a high-dimensional space and estimate corresponding probabilistic density by Gaussian kernel represented as T ( GCC GCCi, j) ( GCC GCCi, j), (9) ϕi, j( GCC) = exp D/2 D 2 (2 π) σ 2σ where φ i,j (GCC) is the Gaussian kernel unction. σ is the spread parameter which represents the width o the Gaussian kernel. T denotes the transpose. GCC is the D-dimensional input GCC eature vector. GCC i,j is the center o the kernel. The output o each neuron in the pattern layer can be generated using (9) and all outputs are transmitted to the summation layer, in which the number o neurons equals K. By averaging the output o all neurons that belong to the same cluster c i, the summation layer computes the probability p i (GCC) o that input GCC eature being classiied into the i th cluster as n i T ( GCC GCCi, j) ( GCC GCC, ) i j.(0) pi( GCC) = exp D/2 D 2 (2 π) σ ni j= 2σ Assume the priori probability o occurrence o every cluster c i is h i, and the loss caused by misclassiication decision or each cluster c i is co i. The decision layer neuron classiies the input GCC eature into cluster c s according to the Bayes s decision rule [23] as hs cos ps( GCC) > hi coi pi( GCC), i s, () where p s (GCC) is the probability o GCC being classiied into cluster c s. We assume h i and co i are unique or all the clusters, so that the GCC eature is classiied into cluster c s as c = argmax p( GCC), c. (2) { } s i i

4 We deine this procedure as DA(GCC). p i (GCC) also is the probability o each training sample being classiied into the i th cluster, as there is a one-to-one correlation between GCC and S. In terms o the output layer, there is only neuron, as only the most probable class is chosen by the PNN. C. Localization Once the PNN is trained with the GCC eatures, the GCA continues to the second stage to localize the unknown sound source S u into one o the K clusters. As presented in Section III B, the probability o S u being classiied into every cluster can be computed by PNN using (0), according to S u s GCC eature. Thereore, the decision layer classiies S u into any o the K clusters c s using (2) with those computed probabilities. However, when the space cluster s volume is small, it is diicult to distinguish which cluster the source actually belongs to and hence the rate o misclassiication becomes higher. The situation gets worse when the actual source c source is close to the boundaries o two adjacent clusters. To solve this problem, we propose a weighted location decision method (WLDM) in GCA instead o using the PNN decision layer to classiy directly, which is presented below. To guarantee K a= a p =, the sotmax unction is adopted to be the transer unction between the pattern layer and the summation layer. Thereby we can normalize the categorical probability distribution in the range o (0, ) that adds up to. With the probabilities o all clusters computed, we select the ζ most possible clusters whose probability sum is less than a ζ cluster size depent on a threshold THR, i.e. p a a THR. = The selection starts rom the cluster with top probability ollowing the descing order, and stops beore one additional cluster that will cause the probability sum to be higher than the threshold. Ater these ζ adjacent clusters are selected, we perorm the localization through the ollowing two steps, which are preliminary estimation and sample points estimation. Let P a denote the central point chosen or the a th cluster, with a [, ζ] and its Cartesian coordinates are x a, y a and z a. The preliminarily estimated source position P s with Cartesian coordinates x s, y s and z s are computed as ζ ζ ζ. (3) x = p x ; y = p y ; z = p z s a a s a a s a a a= a= a= This procedure is deined as PE(p a, P a ). With (3), we can compute the distance l a between the representative point o the a th adjacent cluster and the estimated source position by l = ( x x ) 2 + ( y y ) 2 + ( z z ) 2. (4) a a s a s a s The longer distance indicates that the actual source position is more likely to be ar away rom that particular cluster and hence its probability is supposed to be reduced. Thereore, new weight o the a th cluster which is inversely proportional to the distance can be derived by λ ( / la) (5) wa =, ζ λ ( / l ) a= where w a is the new weight o the a th cluster. 0<λ< denotes the controlling parameter. This procedure is deined as weight cluster (l a, λ). a In order to reduce the error urther, we adjust the localization by more sample points in the second step. In each adjacent cluster, β sample points are selected to represent the cluster position more accurately. Similar to the new weights o cluster, β sample point weights can be computed by ρ ( / l, ) (6) a t wa, t =, β ρ / l ( a, t) t= where l a,t is the distance rom P s to the t th sample point in the a th cluster with t [, β]. w a,t denotes the weight or the t th sample point in the a th cluster. 0<ρ< is the controlling parameter. This procedure is deined as weight sp (l a,t, ρ). Thereore, we can decide the localization o c s through WLDM(w a, w a,t, P a,t ): ζ β ζ β dx, s = wa wa, t xa, t ; dy, s = wa wa, t ya, t ; a= t= a= t= ζ β dz, s = wa wa, t za, t, (7) a= t= where x a,t, y a,t and z a,t are Cartesian coordinates o the t th sample point in the a th cluster. TABLE I THE PSEUDO CODE OF THE PROPOSED GCA GCA(Train, Localize) begin Train(M, Dim, sample, v c, T 60, N, α c, K, n i, γ, S) // training stage o GCA begin C=cluster(Dim, K); // divide the room into K clusters or all i ϵ [, K] or all j ϵ [, n i] H i,j=rir( sample, v c, T 60, N, α c, M, S i,j); // compute the RIR X i,j=h i,js i,j+n; // obtain the signal at microphone array GCC i,j=gf(x i,j, γ); // extract GCC eature p i(gcc)=da(gcc); // train PNN Localize(S u, γ, ζ, β, λ, ρ, THR); // localization stage o GCA begin GCC=GF(S u, γ); p i(gcc)=da(gcc); // compute the probability or all a ϵ [, ζ] P s=pe(p a, P a); // obtain preliminary estimation o source position w a=weight cluster(l a, λ); // compute weights o clusters or all t ϵ [, β] w a,t=weight sp(l a,t, ρ); // compute weights o sample points c s=wldm(w a, w a,t, P a,t); // obtain inal source position return DOA=[θ, ϕ]; Labeled signal samples at microphone array Received signal samples rom unknown positions Localization Stage GCC eature extraction GCC eature extraction Sample points estimation Final location PNN training Fig. 2. Flow chart o the proposed GCA Training Stage Probability estimation and classiication Preliminary estimation

5 The located position c s =[d x,s, d y,s, d z,s ] is the solution o the GCA in Cartesian coordinates and [θ, ϕ] is the DOA results which can be computed by (5). The pseudo code o the proposed GCA is summarized in Table I. The unction GCA(Train, Localize) consists o two sub-unctions which are Train(M, Dim, sample, v c, T 60, N, α c, K, n i, γ, S) and Localize(S u, γ, ζ, β, λ, ρ, THR), representing the two stages o GCA respectively. Finally, the DOA (θ, ϕ) is returned as the outputs. The low chart o the proposed GCA is depicted in Fig. 2. IV. SYNTHETIC EXPERIMENTAL RESULTS AND DISCUSSION In this section, synthetic experiments are conducted to evaluate the perormance o the proposed GCA while other three recently published algorithms presented in [4], [8], and [9] are employed to be the competing methods. A. Synthetic Experimental Setup A typical medium size meeting room with dimension as 4.0m 4.0m 4.0m is simulated. The microphone array consists o six microphones, which are placed at M =(.8m, 2.0m, 2.0m), M 2 =(2.2m, 2.0m, 2.0m), M 3 =(2.0m,.8m, 2.0m), M 4 =(2.0m, 2.2m, 2.0m), M 5 =(2.0m, 2.0m,.8m) and M 6 =(2.0m, 2.0m, 2.2m). The source is placed on a sphere centered at the centroid o the room, with three dierent radius values 0.5m,.0m and.5m. On each o the three spherical suraces, the sound source is placed at 2 dierent azimuth values rom 60º to +60º and at 9 dierent elevation values rom 60º to +60º, both with even intervals. In total, the sound source is placed at 567 dierent positions distributed in the room. In our experiments, omnidirectional microphones are adopted, with requency response rom 20Hz to 20kHz and dynamic range o 87dB. We use six microphones rather than other numbers to orm the array with such spatial distribution mainly due to our reasons. Firstly, i microphones are distributed along each dimension o the space, position o the source can be better determined as the sound propagates via each dimension o room. Secondly, we only use 2 microphones along each o the three dimensions to minimize computational complexity. Thirdly, considering the tradeo between computational complexity and validness o inormation obtained rom cross correlations, we set the maximum distance between any two microphones to be 40cm. In addition, considering a test source can be placed anywhere in the room, by reerring to the setup o competing method TDE [4], the center o microphone array is placed at the center o the room. A clean speech sampled at 8kHz as [25] is adopted to be the sound source. The 2.7-second speech (rom 220Hz to 3.4kHz) is rom the NOIZEUS database in American English language. The sound source is also omnidirectional in the setup. The reverberation time T 60, which measures the time or the original sound to decay by 60dB, is set to be dierent levels as 0ms, 00ms, 200ms, 400ms and 600ms. The longer T 60 represents the higher reverberation in the room. The SNR in the room is set to be dierent levels as 0dB, 0dB, 5dB and 0dB, where the noise is additive noise. The duration o each rame o the speech signal is chosen to be 0.064s and the overlap rate between two rames is set to be 62.5%. As the maximum distance between any two microphones o the array is 0.4m, the maximum possible time delay is.7ms by assuming the sound speed in the air being 343m/s. As the sampling rate is 8kHz, the maximum delay number in samples is 0. Thereore, or a microphone pair, the irst 0 cross correlations contain the valid inormation. However, in case o missing validity, we select the irst 6 cross correlations to be the eature. As there are totally 5 microphone combinations or cross correlation computing, the dimension o the GCC eature vector applied to the input layer is 240. Thereore, the input layer consists o 240 neurons. In the training stage o our synthetic experiments, the room is divided into 4096 equal-dimension rectangular space clusters with dimensions as 0.25m 0.25m 0.25m each. The sound source is randomly and successively placed in each cluster only once, i.e. n i =, as the cluster volume is small. Thereore, a total o 4096 complete GCC eature sets are extracted. In this case, both pattern layer and summation layer consist o 4096 neurons. For the spread parameter σ, a small σ will cause overitting while a large σ will result in underitting. In practice, by reerring to [23], σ can be selected rom 3 to 0. In our experiments, we set its value to be 5. For the WLDM, we select the 5 most possible clusters whose probability sum is less than For the controlling parameters, both λ and ρ are set to be 0.25 while γ is set to be 2. In each adjacent cluster, 8 vertexes are selected as sample points to represent the cluster position more accurately. B. Implementation We perorm the synthetic experiments to compare our results with three recent methods, which are time delay estimation method TDE [4], beamorming method TL-SSC [8], and machine learning method LS-SVM [9]. In the synthetic experiments, Dim, T 60, α c, and SNR are all required by our proposed method and the competing methods. As the author-shared codes o TDE and TL-SSC are available online at [26] and [27] respectively, we select these two algorithms as competing methods. This helps to avoid any potential errors when modeling the algorithm by non-authors so that the comparison is air and valid. As the TL-SSC is an improved version o the widely used SRP-PHAT algorithm, we do not adopt the original SRP-PHAT algorithm as a competing method. For LS-SVM, we collect the TDOA eatures as its original paper [9] or training. In addition, as LS-SVM algorithm transorms localization to be a pure classiication problem, we assume that the estimated sound source position is at the centroid o the cluster where it is classiied into. Furthermore, the perormances o these competing methods degrade i we adopt our microphone array setup into their methods. To make air comparison, thereore, the microphone arrays or these three competing methods are setup in the same way as given in their original papers, [4], [8], and [9] respectively. What s more, to improve RIR computation eiciency, ast image method [24] is adopted and the source code is available online at [28]. All the our methods are implemented in Matlab and run by a workstation with 32GB RAM and dual Intel Xeon 2.4GHz processor E V3. C. Results and Discussion Validation on Feature Extraction The irst experiment is to examine the eectiveness o GCC eatures. Simulations are perormed in our dierent

environments with SNR decreasing rom 0dB to 0dB, when T 60 = 0ms, as demonstrated by (a) to (d) o Fig. 3.

It can be seen that the computed GCC eatures in yellow color patterns demonstrate good representativeness or the testing clusters.

This shows that the GCC eature representativeness is more reliable when SNR is high.

GCC eatures extracted rom dierent SNR when T 60=0ms Impact o Reverberation and SNR With the validated GCC eatures, we perorm the SSL using GCA and compare the perormance.

6 environments with SNR decreasing rom 0dB to 0dB, when T 60 = 0ms, as demonstrated by (a) to (d) o Fig. 3. With the same acoustic environment as in each subplot, we repeated the extractions or three times which are separated by the two black lines. It can be seen that the computed GCC eatures in yellow color patterns demonstrate good representativeness or the testing clusters. The contrast between the yellow patterns and the blue regions becomes more distinctive when SNR rises rom 0dB to 0dB. This shows that the GCC eature representativeness is more reliable when SNR is high. Similar regularity can be observed when SNR is ixed and T 60 varies, where GCC eature is more reliable when T 60 is low. (a) (c) (d) Fig. 3. GCC eatures extracted rom dierent SNR when T 60=0ms Impact o Reverberation and SNR With the validated GCC eatures, we perorm the SSL using GCA and compare the perormance. Because most o the SSL applications are to localize the source directions, we adopted the DOA as the perormance metric. The results are summarized in Table II. A total o 20 dierent acoustic environments are created by varying SNR rom 0dB to 0dB and T 60 rom 0ms to 600ms. With each environment, the DOA estimation error (DEE) in terms o the mean error and standard deviation o ϕ and θ are collected rom the 567 localized positions. To evaluate the perormance with the dierent accuracy requirements, the localization successul rate or DOA estimation (SRDE) is deined. SRDE(αº) presents the percentage o localizations with both ϕ error and θ error less and equal to ± αº out o the 567 localizing. In Table II, the results o SRDE(0º), SRDE(20º) and SRDE(30º) are provided. When SNR is ixed and T 60 varies rom 0ms to 600ms, the accuracy generally drops or every algorithm. Similar tr can be observed when T 60 is ixed and SNR decreases. However, SRDE is not always increasing with the increased SNR. In some scenarios, such as very long T 60, SRDEs may not strictly increase with the SNR. This shows that when reverberation is severe, a little vary o SNR will not aect the SRDE signiicantly. I we compare across dierent algorithms, the proposed GCA outperorms other three algorithms signiicantly. SRDE(30º) o GCA is 00% when T 60 (b) is low, regardless o SNR, and drops to 69.8% in the worst case o T 60 = 600ms and SNR = 0dB. For TDE and TL-SSC, SRDE(30º) achieves 54.9% and 64.2% in the best case o T 60 = 0ms and SNR = 0dB. With adverse environments, however, the SRDE o TDE and TL-SSC drops, which shows that high reverberation and low SNR aect the localization eectiveness o these two algorithms. When αº is small such as 0º, the baseline successul rate by random localization should be (20º/360º) (20º/80º) = 0.62%. In the most adverse environment, TDE provides low SRDE(0º) slightly better than this baseline rate. Nevertheless, this can still reasonably show that TDE perormance will drop or more adverse environments. For LS-SVM, the results or T 60 = 600ms are let blank in Table II, as the provided source code o ast image method encounters errors in this case. To avoid inappropriate implementation o LS-SVM, we present and discuss the results o LS-SVM when T 60 varies rom 0ms to 400ms only. The results show that LS-SVM has perormance similarly to TDE and TL-SSC in terms o accuracy or low reverberation and high SNR but drops in very adverse environments. When SNR = 0dB, the SRDEs by the competing algorithms with longer T 60 are sometimes slightly higher than those with shorter T 60. It shows that these algorithms are more sensitive to the extremely low SNR. When the signals are very weak, the algorithms are signiicantly aected by the noises. The averages o DEEs and SRDEs under the twenty dierent environments are computed or each algorithm. They are plotted in Fig. 4(a) and (b) respectively. In Fig. 4(a), the average o mean errors o azimuth angle and elevation angle by GCA are only 4.6º and 3.º respectively, indicating that it can estimate DOA very accurately. In contrast, the DEEs o other algorithms are signiicantly higher. Comparing with the best perormance among the three competing algorithms, GCA can localize with average o 88.6% and 83.8% reduced ϕ error and θ error respectively, or all the 20 acoustic environments. In Fig. 4(b), the average SRDE(0º), SRDE(20º) and SRDE(30º) by GCA can achieve 87.5%, 94.4% and 96.9% respectively. On the other hand, the averages o SRDEs o other three methods in the 20 dierent acoustic environments are signiicantly lower. Compared with the best perormances among the three algorithms, GCA improves averages o SRDE(0º), SRDE(20º) and SRDE(30º) by 8.%, 74.% and 60.3% respectively. Fig. 4. (a) DOA estimation errors (b) SRDEs with dierent requirements However, it should be noted that the signiicantly increased SRDEs by GCA are mainly contributed by several actors. The irst one is the pre-localization site survey eort to collect the eatures, which is not needed by TDE and TL-SSC. Secondly, the 567 dierent positions tested in this experiment covered most o the positions in the room. Because the perormance o

7 other competing algorithms may vary when testing at dierent positions, the consistent perormance o GCA becomes signiicant when comparing the average SRDEs o 567 tests. What s more, or LS-SVM, when the space cluster s volume is small, it is diicult to decide which cluster the sound source belongs to. This is a reason why LS-SVM cannot achieve even higher SSL accuracy, meanwhile, it veriies the contribution by the proposed WLDM in GCA, which can solve this problem. In addition, the GCC eatures used in GCA is more robust compared to TDOA eatures used in LS-SVM. Robustness Validation In practice, when room geometry and acoustic eatures change, such as people movement, doors opening and closing, the validity o the collected training data varies. In this case, we need to ensure that the DOA by GCA is still accurate and robust even when the environment changes ater training. Moreover, we also expect that the proposed GCA can perorm well with sound sources at dierent requencies. For the irst experiment, we evaluate the robustness o GCA with respect to the change o reverberation time. We collect our groups o training data with SNR = 0dB, 5dB, 0dB and 0dB respectively, when T 60 = 200ms. Next, or each group, we vary T 60 to 0ms, 00ms, 400ms and 600ms to relect the actually changed T 60 during localizing and collect the testing data rom the 567 test positions. The results are summarized in Table III(A). The T 60 = 200ms results collected as training data is the benchmark and highlighted in grey color. Compared to the benchmarks, the worst SRDE(30º) drops are 22.7%, 6.2%,.% and 0.5% respectively or the our groups. This shows that GCA is robust to T 60 with SRDE(30º) even when T 60 varies signiicantly, except the very adverse environment where SNR = 0dB and T 60 = 600ms. For the second experiment, we validate the robustness o GCA with respect to the change o SNR. Similarly, we collect ive groups o results with dierent T 60 as 0ms, 00ms, 200ms, 400ms and 600ms respectively when SNR = 0dB, as the training data. Next, we vary SNR to 0dB, 5dB and 0dB to relect the actually changed SNR during localizing and collect the testing data rom the 567 test positions. The results are summarized in Table III(B). The results with SNR = 0dB is the benchmark and highlighted in grey color. Compared to the benchmarks, the worst SRDE(30º) drops are 5.7%, 6.7%, 5.%, 29.4% and 36.2% respectively or the ive groups. Thereore, GCA is robust to SNR except the very adverse environments where SNR = 0dB meanwhile T 60 = 400ms and 600ms. For the third experiment, we illustrate the impact o requency change on localization accuracy o GCA. We use two new sound sources [29] rather than human speech, i.e. machinery sound and telephone ring, whose requencies are dierent rom the preceding human speech source. We conduct the experiment under the conditions where SNR = 0dB and 0dB, with T 60 varying rom 0ms to 600ms. All the setups are the same as those o human speech scenario. The results are summarized in Table III(C). Compared to the results o human speech scenario in Table II, when T 60 = 0ms, 00ms, and 200ms, SRDE(30º) is almost unchanged or both o the two new sources. When T 60 becomes higher with SNR = 0dB, SRDE(30º) o machinery sound increases a little while that o telephone ring decreases slightly. When T 60 becomes higher with SNR = 0dB, SRDEs(30º) o both machinery sound and telephone ring drop slightly aster. However, on average 94.9% accuracy and 88.7% accuracy can still be achieved in terms o SRDE(30º) or these two new sound sources respectively. Thereore, we can conclude that, the localization accuracy o the proposed method is slightly aected by the changes o requency, and thereore good perormance still can be achieved when requency changes. Impact by Dierent Test Set To evaluate the perormances o the our algorithms with dierent test set, we re-compute the SRDEs o the our algorithms with 378 positions. To make sure we evaluated the competing algorithms in the correct manner, we have used the author-shared source codes o TDE and TL-SSC. These positions are obtained by removing the source positions on the spherical surace with radius =.0m, rom the previous 567 positions. During the experiment, the SNR is set to be 0dB, 5dB, 0dB, and 0dB, while T 60 varies rom 0ms and 600ms. The results are summarized in Table IV. It can be observed that the three competing algorithms can perorm relatively well when SNR is higher, i.e. SNR = 0dB and 0dB, compared with the cases where SNR is lower, i.e. SNR = 0dB and 5dB. However, GCA still outperorms them even with higher SNR. This shows that GCA perorms better than other three algorithms when the sound source is either close to or ar away rom the microphone array in adverse environments. This better perormance results rom several actors, such as the acoustic eature studying, robust GCC eature, the proposed WLDM method, and consistent localization capability at dierent test positions in a room. Complexity Comparisons The computational complexities o the our algorithms at 567 positions are summarized in Table V. For each T 60, the presented CPU time and real processing time are the average results when SNR = 0dB, 5dB, 0dB and 0dB. As more than one core are called during computing, the CPU time is higher than the real processing time. From Table V, the machine learning based GCA and LS-SVM have dominating oline training, which consists o RIR computing, eature extraction and training. However, we can overcome this deect by taking urther approximation with ast image method and using less training samples. In addition, oline training in the machine learning algorithm is only perormed once or the ixed room. For GCA, the CPU time o the rest online localization only accounts or 0.88%, 0.55%, 0.32%, 0.3%, and 0.06% o the total CPU time or T60 = 0ms, 00ms, 200ms, 400ms, 600ms respectively. This makes GCA especially suitable or the real-time localization applications when pre-localization site survey has already been done. In contrast, LS-SVM is computationally ineicient when the number o classiication categories is large and the oline training costs more than seconds o CPU time. This validates the advantage o GCA on training speed compared with LS-SVM. TDE costs at least seconds o CPU time to generate the results o localizing 567 positions. In contrast, TL-SSC is the most computationally eicient algorithm among the our methods. The pre-localization look-up table (LUT) computing costs about seconds CPU time and the localization costs around seconds only. Compared with the time cost o

8 TABLE II RESULTS OF THE FOUR ALGORITHMS AT 567 TEST POSITIONS SNR/dB 0 5 T 60/ms GCA (Proposed) TDE[4] TL-SSC[8] LS-SVM[9] ϕ error (mean/deviation)/º.8/.6 2.4/2.0 5./ / /28.2.2/0.9.3/. 2./.8 5.4/ /5.3 θ error (mean/deviation)/º.3/..7/.3 3.6/ / / /0.7.2/0.9.5/.2 3.6/ /8.3 SRDE(0º) 00% 99.6% 84.3% 46.7% 27.7% 00% 00% 99.5% 8.8% 56.3% SRDE(20º) 00% 00% 97.9% 75.% 55.2% 00% 00% 00% 96.0% 82.7% SRDE(30º) 00% 00% 99.8% 88.7% 69.8% 00% 00% 00% 98.4% 90.8% ϕ error (mean/deviation)/º 87.2/ / / / / / / / / /56.7 θ error (mean/deviation)/º 43.7/ / / / / / / / / /29.6 SRDE(0º) 0.9%.8%.4%.4% 0.9% 3.2% 3.2% 3.5% 3.0% 2.3% SRDE(20º) 3.5% 4.8% 4.4% 3.4% 4.4%.5% 0.9%.3% 8.6% 7.6% SRDE(30º) 9.7% 9.0% 7.9% 8.6% 0.2% 20.6% 8.3% 8.7% 7.6% 3.% ϕ error (mean/deviation)/º 47.6/ / / / / / / / / /43.3 θ error (mean/deviation)/º 9.0/.4 9.0/.4 9.0/.5 9./.6 9./.7 8.9/.5 8.8/.4 8.7/.3 8.9/.5 8.9/.5 SRDE(0º) 4.8% 5.5% 5.5% 3.5% 4.8% 3.9% 4.9% 3.0% 3.0% 3.7% SRDE(20º) 8.3% 5.3% 2.5% 7.9% 7.2% 27.0% 23.3% 6.0% 9.7% 9.7% SRDE(30º) 36.9% 3.9% 25.6% 8.0% 6.2% 43.9% 40.2% 33.0% 23.6% 2.3% ϕ error (mean/deviation)/º 46.0/ / / / / / / / θ error (mean/deviation)/º 36.2/ / / / / / / / SRDE(0º) 3.9%.8%.6%.6% - 9.0% 7.8% 2.5% 3.0% - SRDE(20º) 3.4% 7.% 6.5% 6.9% % 20.3% 2.2% 7.2% - SRDE(30º) 23.3% 5.3% 2.5% 3.2% - 4.4% 35.3% 22.% 6.% - SNR/dB 0 0 T 60/ms ϕ error (mean/deviation)/º./0.9./.0.5/.2 3.5/ /8.7./0.9.0/0.9.3/. 3.5/0. 6.7/3.2 θ error (mean/deviation)/º 0.9/ /0.7./ / / / /0.7.0/0.8.8/ /6.3 GCA (Proposed) SRDE(0º) 00% 00% 00% 93.5% 77.8% 00% 00% 00% 94.5% 8.7% SRDE(20º) 00% 00% 00% 99.0% 92.% 00% 00% 00% 98.8% 9.0% SRDE(30º) 00% 00% 00% 99.5% 97.2% 00% 00% 00% 99.% 94.5% ϕ error (mean/deviation)/º 5.8/ / / / / / / / / /57.4 θ error (mean/deviation)/º 3.5/ / / / / / / / / /27.4 TDE[4] SRDE(0º).3% 6.9% 7.2% 4.9% 4.% 26.% 6.% 2.5% 9.4% 8.% SRDE(20º) 24.5% 20.8% 20.8% 7.3% 5.2% 35.5% 28.2% 25.4% 20.6% 9.9% SRDE(30º) 37.9% 33.3% 33.2% 27.0% 26.5% 54.9% 47.4% 42.9% 35.6% 36.5% ϕ error (mean/deviation)/º 29.2/ / / / / / / / / /4.8 θ error (mean/deviation)/º 8.9/.6 8.7/.4 8.6/.4 8.7/.4 8.8/.5 9.5/.2 9.2/.3 8.6/.3 8.6/.4 8.7/.3 TL-SSC[8] SRDE(0º) 6.7% 5.% 3.9% 3.5% 3.3% 0.6% 9.5% 6.5% 3.5% 2.3% SRDE(20º) 35.8% 3.4% 22.4% 2.5% 0.8% 39.5% 4.6% 33.9% 8.5% 3.6% SRDE(30º) 5.0% 49.7% 39.5% 28.4% 24.5% 64.2% 6.4% 54.7% 36.2% 3.2% ϕ error (mean/deviation)/º 24.3/ / / / / / / /29. - θ error (mean/deviation)/º 34.2/ / / / / / / / LS-SVM[9] SRDE(0º) 2.2% 0.4% 7.2% 3.0% - 9.4%.% 0.6% 5.3% - SRDE(20º) 24.2% 26.3% 8.9%.% - 2.2% 25.9% 24.7% 2.2% - SRDE(30º) 37.0% 40.7% 34.7% 23.% % 42.2% 37.6% 35.5% - TABLE III (A) ROBUSTNESS VALIDATION FOR PROPOSED GCA WITH FIXED SNR AND VARYING T 60 AT 567 TEST POSITIONS SNR/dB T 60/ms (train) T 60/ms (localize) SRDE(0º) 67.2% 64.9% 84.3% 44.3% 35.% 74.4% 76.9% 99.5% 76.4% 66.7% 69.0% 76.4% 00% 83.8% 79.5% 8.% 86.% 00% 94.2% 9.9% SRDE(20º) 96.3% 95.% 97.9% 75.0% 64.2% 97.5% 98.2% 00% 96.5% 9.0% 96.8% 98.2% 00% 99.0% 97.0% 98.2% 98.8% 00% 99.8% 99.% SRDE(30º) 99.5% 98.9% 99.8% 82.5% 77.% 99.% 99.5% 00% 99.5% 93.8% 00% 00% 00% 00% 98.9% 00% 00% 00% 00% 99.5% (B) ROBUSTNESS VALIDATION FOR PROPOSED GCA WITH FIXED T 60 AND VARYING SNR AT 567 TEST POSITIONS T 60/ms SNR/dB (train) SNR/dB (localize) SRDE(0º) 76.9% 86.8% 00% 96.0% 76.2% 95.% 00% 83.4% 64.4% 92.4% 00% 68.8% 34.2% 72.8% 93.5% 72.7% 25.% 58.9% 77.8% 68.4% SRDE(20º) 87.% 98.4% 00% 99.8% 85.8% 00% 00% 98.8% 86.8% 99.8% 00% 96.8% 59.% 90.8% 99.0% 95.6% 47.% 80.% 92.% 92.2% SRDE(30º) 94.3% 99.3% 00% 00% 93.3% 00% 00% 99.5% 94.9% 99.8% 00% 99.8% 70.% 94.7% 99.5% 99.4% 6.0% 88.7% 97.2% 96.6% (C) RESULTS OF PROPOSED GCA BY SOUND SOURCES AT DIFFERENT FREQUENCIES WITH 567 TEST POSITIONS Source Machinery sound Telephone ring SNR/dB T 60/ms GCA SRDE(0º) 98.9% 98.8% 87.7% 50.8% 26.6% 99.3% 99.8% 99.% 84.% 69.7% 00% 98.4% 7.6% 27.7% 2.7% 97.4% 97.7% 90.8% 47.6% 28.2% SRDE(20º) 00% 00% 98.9% 8.8% 55.9% 00% 00% 00% 94.0% 84.0% 00% 00% 93.3% 58.9% 37.0% 00% 00% 98.2% 77.3% 56.4% (Proposed) SRDE(30º) 00% 00% 99.5% 9.4% 7.6% 00% 00% 00% 96.8% 90.% 00% 00% 98.% 76.4% 52.9% 00% 00% 99.3% 88.7% 72.0% TABLE IV RESULTS OF THE FOUR ALGORITHMS AT 378 TEST POSITIONS SNR/dB T 60/ms SRDE(0º) 00% 99.5% 82.0% 48.9% 3.7% 00% 00% 99.2% 76.5% 57.9% 00% 00% 00% 9.8% 72.5% 00% 00% 00% 93.4% 80.% GCA (Proposed) SRDE(20º) 00% 00% 96.8% 73.8% 57.9% 00% 00% 00% 94.4% 79.4% 00% 00% 00% 98.4% 90.0% 00% 00% 00% 98.9% 9.3% SRDE(30º) 00% 00% 99.7% 87.3% 70.% 00% 00% 00% 97.6% 87.8% 00% 00% 00% 99.2% 96.3% 00% 00% 00% 99.2% 93.9% SRDE(0º) 0.8% 2.%.3%.6%.% 3.2% 3.2% 3.4% 2.6%.9%.6% 7.9% 7.% 5.0% 5.0% 27.5% 5.% 5.%.6% 9.8% TDE[4] SRDE(20º) 3.7% 5.0% 4.0% 3.7% 5.6%.9% 0.6%.9% 8.5% 8.2% 25.9% 20.6% 20.4% 8.8% 6.9% 36.5% 27.8% 28.3% 24.% 9.8% SRDE(30º) 0.% 0.% 7.4% 9.8% 0.6% 20.4% 6.% 20.% 7.2% 3.2% 40.2% 32.5% 32.5% 28.0% 27.5% 55.0% 48.9% 44.7% 38.7% 37.0% TL-SSC [8] LS-SVM [9] SRDE(0º) 4.5% 5.6% 5.6% 4.0% 5.0% 3.4% 4.2% 2.4% 3.2% 3.4% 6.% 3.7% 3.7% 3.2% 3.2% 7.% 6.6% 4.5% 3.% 2.6% SRDE(20º) 6.7% 3.2% 2.4% 7.4% 7.4% 24.9% 2.4% 3.5% 9.0% 9.0% 33.9% 27.3% 20.4%.4% 8.5% 37.0% 39.2% 29.9% 6.4% 4.% SRDE(30º) 35.4% 30.4% 24.9% 8.0% 5.9% 42.6% 38.6% 3.0% 23.0% 9.8% 48.2% 46.6% 36.5% 26.2% 22.5% 62.8% 58.0% 50.7% 33.8% 28.9% SRDE(0º) 4.5% 2.4%.%.% - 0.3% 8.7% 2.% 2.4% - 2.4%.4% 7.4% 3.2% - 0.6% 0.9% 2.4% 5.3% - SRDE(20º) 4.6% 8.5% 6.% 6.6% % 2.7% 3.0% 5.8% % 24.6% 20.4% 2.4% % 27.8% 26.7% 22.8% - SRDE(30º) 25.7% 6.7%.9%.6% % 38.% 22.5% 4.6% % 39.4% 37.3% 23.3% % 4.6% 40.2% 37.0% -

9 TABLE V THE COMPUTATIONAL COMPLEXITY BY CPU TIME (S) AND REAL TIME (S) AT 567 TEST POSITIONS T 60=0ms T 60=00ms T 60=200ms T 60=400ms T 60=600ms CPU time Real time CPU time Real time CPU time Real time CPU time Real time CPU time Real time GCA Oline training (Proposed) Online localization TDE[4] Online localization Oline LUT computing TL-SSC[8] Online localization LS-SVM[9] Oline training Online localization TABLE VI TRADEOFF BETWEEN LOCALIZATION ACCURACY AND COMPUTATIONAL COMPLEXITY AT 567 TEST POSITIONS K K SRDE(0º) 86.9% 99.6% 99.8% CPU time(s) Real time(s) CPU time(s) Real time(s) CPU time(s) Real time(s) SRDE(20º) 99.8% 00% 00% Oline training SRDE(30º) 00% 00% 00% Online localization TL-SSC, GCA sps more CPU time. However, this computational overhead is acceptable considering the signiicant improvements o 82.6%, 74.%, and 60.3% by GCA over TL-SSC or SRDE(0º), SRDE(20º), and SRDE(30º) respectively. Tradeo strategy The number o clusters K is proportional to computational complexity. When K is small, although computational complexity is inexpensive, the eatures become inconsistent, resulting in degradation o localization accuracy. In contrast, i K is large, although the eatures become consistent, computational complexity becomes expensive or even unaordable. Thereore, the quantity o space cluster division should be determined by making a tradeo between localization accuracy and computational complexity. To illustrate this kind o tradeo, we conduct experiments by varying K, in the environment where T 60 = 00ms and SNR = 0dB. K is set to be 52, 4096, and 32768, corresponding to cluster volume o 0.5m 0.5m 0.5m, 0.25m 0.25m 0.25m, and 0.25m 0.25m 0.25m. The results are summarized in Table VI. From the results, it can be observed that the complexity increases with the growth o K. When K=4096, compared to the case o K=52, SRDE(0º) is signiicantly improved by 2.7%, achieving to 99.6%, although the cost is some complexity increase. When K increases rom 4096 to 32768, SRDE(0º) can hardly be improved urther, however, the complexity becomes very expensive. Thereore, we divide the room into 4096 clusters. V. CONCLUSION In this paper, we address the problem o SSL in the challenging high reverberation and low SNR environments by proposing a novel machine learning based algorithm GCA. With GCC eature, the proposed GCA transorms the SSL problem into a likelihood based nonlinear classiication problem by utilizing PNN, which is especially suitable or multiclass classiication problem. In order to overcome the misclassiication and estimate DOA more accurately, we propose WLDM in GCA. The experimental results have shown that GCA achieves more accurate DOA estimation. The average o mean values o azimuth angle estimation errors and elevation angle estimation errors o GCA are only 4.6º and 3.º respectively. Compared with three recently published algorithms, GCA improves the best perormances o average SRDE(0º), SRDE(20º) and SRDE(30º) by 8.%, 74.% and 60.3% respectively. In addition, GCA perorms robustly in dierent acoustic environments. This validates that the proposed GCA can localize very eectively or the applications when physical site acoustic eatures can be accessed beore the localization stage. This data driven training method is especially suitable or the industry environments which are too complex to be modeled. REFERENCES [] H. Guo, K. S. Low, and H. A. Nguyen, Optimizing the localization o a wireless sensor network in real time based on a low-cost microcontroller, IEEE Trans. Ind. Electron., vol. 58, no. 3, pp , Mar. 20. [2] B. Wang, S. Zhou, W. Liu, and Y. Mo, Indoor localization based on curve itting and location search using received signal strength, IEEE Trans. Ind. Electron., vol. 62, no., pp , Jan [3] F. Deng et al., Energy-based sound source localization with low power consumption in wireless sensor networks, IEEE Trans. Ind. Electron., vol. 64, no. 6, pp , Jun [4] P. Yang and W. Wu, Eicient particle ilter localization algorithm in dense passive RFID tags environment, IEEE Trans. Ind. Electron., vol. 6, no. 0, pp , Oct [5] J. Kim and W. Chung, Localization o a mobile robot using a laser range inder in a glass-walled environment, IEEE Trans. Ind. Electron., vol. 63, no. 6, pp , Jun [6] H. Song, W. Choi, and H. Kim, Robust vision-based relative-localization approach using an RGB-depth camera and LiDAR sensor usion, IEEE Trans. Ind. Electron., vol. 63, no. 6, pp , Jun [7] J. Wang, Q. Gao, Y. Yu, H. Wang, and M. Jin, Toward robust indoor localization based on Bayesian ilter using chirp-spread-spectrum ranging, IEEE Trans. Ind. Electron., vol. 59, no. 3, pp , Mar [8] J. Wang et al., Transerring compressive-sensing-based device-ree localization across target diversity, IEEE Trans. Ind. Electron., vol. 62, no. 4, pp , Apr [9] J. Chen, J. Benesty, and Y. Huang, Time delay estimation in room acoustic environments: An overview, EURASIP J. Appl. Signal Process., vol. 2006, pp. 9, Jan [0] S. Argentieri, P. Danès, and P. Souères, A survey on sound source localization in robotics: From binaural to array processing methods, Comput. Speech Lang., vol. 34, no., pp. 87 2, Nov [] H. He, L. Wu, J. Lu, X. Qiu, and J. Chen, Time dierence o arrival estimation exploiting multichannel spatio-temporal prediction, IEEE Trans. Audio Speech Lang. Process., vol. 2, no. 3, pp , Mar [2] D. Pavlidi, A. Griin, M. Puigt, and A. Mouchtaris, Real-time multiple sound source localization and counting using a circular microphone array, IEEE Trans. Audio Speech Lang. Process., vol. 2, no. 0, pp , Oct [3] A. Canclini, E. Antonacci, A. Sarti, and S. Tubaro, Acoustic source localization with distributed asynchronous microphone networks, IEEE Trans. Audio Speech Lang. Process., vol. 2, no. 2, pp , Feb [4] X. Alameda-Pineda and R. Horaud, A geometric approach to sound source localization rom time-delay estimates, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 6, pp , Jun [5] J. Velasco, C. J. Martn-Arguedas, J. Macias-Guarasa, D. Pizarro, and M. Mazo, Proposal and validation o an analytical generative model o

SRP-PHAT power maps in reverberant scenarios, Signal Process., vol. 9, pp. 209 228, Feb. 206. [6] B. Mungamuru and P. Aarabi, Enhanced sound localization, IEEE Trans. Syst. Man Cybern. B Cybern., vol. 34, no.

, vol. 5, no. 8, pp. 250 2526, Nov. 2007. [8] D. Yook, T. Lee, and Y. Cho, Fast sound source localization using two-level search space clustering, IEEE Trans. Cybern., vol. 46, no., pp. 20 26, Jan.

Xiao et al., A learning-based approach to direction o arrival estimation in noisy and reverberant environments, in Proc. IEEE Int. Con. Acoust. Speech Signal Process., Brisbane, Australia, Apr.

Lauer-Goldshtein, R. Talmon, and S. Gannot, Semi-supervised sound source localization based on maniold regularization, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, no. 8, pp. 393 407, Aug.

Johansson, Diuse reverberation model or eicient image-source simulation o room impulse responses, IEEE Trans. Audio Speech Lang. Process., vol. 8, no. 6, pp. 429 439, Aug. 200. [25] Y. Hu and P.

r/perception/research/geometric-sound-sourcelocalization. [27] T. Lee. TLSSC code. [Online]. Available: https://github.com/leetaewoo /ast_sound_source_localization_using_tlssc. [28] E. Lehmann.

degrees rom Nanyang Technological University, Singapore, in 2000 and 2004, respectively. In 2005, he was a Post-Doctoral Fellow with Lucent Technologies Bell Labs, Murray Hill, NJ, USA.

From 2006 to 200, he was a Senior Research Engineer with the Institute or Inocomm Research, Singapore, where he was involved in an industrial project developing an 802.

In 200, he joined the Singapore University o Technology and Design, Singapore, as an Assistant Proessor. He has authored over 300 research papers in international journals or conerences.

10 SRP-PHAT power maps in reverberant scenarios, Signal Process., vol. 9, pp , Feb [6] B. Mungamuru and P. Aarabi, Enhanced sound localization, IEEE Trans. Syst. Man Cybern. B Cybern., vol. 34, no. 3, pp , Jun [7] J. Dmochowski, J. Benesty, and S. Aes, A generalized steered response power method or computationally viable source localization, IEEE Trans. Audio Speech Lang. Process., vol. 5, no. 8, pp , Nov [8] D. Yook, T. Lee, and Y. Cho, Fast sound source localization using two-level search space clustering, IEEE Trans. Cybern., vol. 46, no., pp , Jan [9] H. Chen and W. Ser, Acoustic source localization Using LS-SVMs without calibration o microphone arrays, in Proc. IEEE Int. Symp. Circuits Syst., Taipei, Taiwan, May 2009, pp [20] X. Xiao et al., A learning-based approach to direction o arrival estimation in noisy and reverberant environments, in Proc. IEEE Int. Con. Acoust. Speech Signal Process., Brisbane, Australia, Apr. 205, pp [2] X. Li and H. Liu, Sound source localization or HRI using FOC-based time dierence eature and spatial grid matching, IEEE Trans. Cybern., vol. 43, no. 4, pp , Aug [22] B. Lauer-Goldshtein, R. Talmon, and S. Gannot, Semi-supervised sound source localization based on maniold regularization, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, no. 8, pp , Aug [23] D. F. Specht, Probabilistic neural networks, Neural Netw., vol. 3, no., pp. 09 8, Jan [24] E. Lehmann and A. Johansson, Diuse reverberation model or eicient image-source simulation o room impulse responses, IEEE Trans. Audio Speech Lang. Process., vol. 8, no. 6, pp , Aug [25] Y. Hu and P. Loizou. NOIZEUS database. [Online]. Available: [26] X. Alameda-Pineda and R. Horaud. The gtde MATLAB toolbox. [Online ]. Available: [27] T. Lee. TLSSC code. [Online]. Available: /ast_sound_source_localization_using_tlssc. [28] E. Lehmann. Fast ISM code. [Online]. Available: nn.com. [29] FindSounds database. [Online]. Available: Chau Yuen (S 02 M 08 SM 2) received the B.Eng. and Ph.D. degrees rom Nanyang Technological University, Singapore, in 2000 and 2004, respectively. In 2005, he was a Post-Doctoral Fellow with Lucent Technologies Bell Labs, Murray Hill, NJ, USA. In 2008, he was a Visiting Assistant Proessor with Hong Kong Polytechnic University, Hong Kong. From 2006 to 200, he was a Senior Research Engineer with the Institute or Inocomm Research, Singapore, where he was involved in an industrial project developing an 802.n wireless local area network system and actively participated in the third generation Partnership Project Long-Term Evolution (LTE) and LTE-A standardization. In 200, he joined the Singapore University o Technology and Design, Singapore, as an Assistant Proessor. He has authored over 300 research papers in international journals or conerences. He holds two U.S. patents. He received the IEEE Asia-Paciic Outstanding Young Researcher Award in 202. He serves as an Editor o the IEEE TRANSACTIONS ON COMMUNICATIONS and IEEE TRANSACTIONS ON VEHICULAR. Susanto Rahardja (F') received the B.Eng. degree rom National University o Singapore in 99, the M.Eng. and Ph.D. degrees all in Electronic Engineering rom Nanyang Technological University, Singapore, in 993 and 997 respectively. He is currently a Chair Proessor at the Northwestern Polytechnical University (NPU) under the Thousand Talent Plan o People's Republic o China. His research interests are in multimedia, signal processing, wireless communications, discrete transorms and signal processing algorithms, implementation and optimization. Dr Rahardja was the recipients o numerous awards, including the IEE Hartree Premium Award, the Tan Kah Kee Young Inventors' Open Category Gold award, the Singapore National Technology Award, A*STAR Most Inspiring Mentor Award, Finalist o the 200 World Technology & Summit Award, the Nokia Foundation Visiting Proessor Award and the ACM Recognition o Service Award. Yingxiang Sun (S 6) received his B.Eng. degree in Electronic and Inormation Engineering rom Xidian University, Xi an, China, in He received his M.Eng. degree in Electromagnetic Field and Microwave Technology rom the 54 th Research Institute o China Electronics Technology Group Corporation, Shijiazhuang, China, in 202. Currently, he is working towards the Ph.D. degree at the Pillar o Engineering Product Development, Singapore University o Technology and Design, Singapore. His current research interests include digital signal processing and machine learning. Jiajia Chen received his B. Eng. (Hons) and Ph.D. rom Nanyang Technological University, Singapore, in 2004 and 200, respectively. Since April 202, he has been with Singapore University o Technology and Design, where he is currently a Senior Lecturer. His research interest includes computational transormations o low-complexity digital ilters, image usion and audio signal processing. Dr. Chen served as Web Chair o Asia-Paciic Computer Systems Architecture Conerence 2005, Technical Program Committee member o European Signal Processing Conerence 204 and The Third IEEE International Conerence on Multimedia Big Data 207, and Associate Editor o Springer EURASIP Journal on Embedded Systems since 206.

AN EFFICIENT SET OF FEATURES FOR PULSE REPETITION INTERVAL MODULATION RECOGNITION

AN EFFICIENT SET OF FEATURES FOR PULSE REPETITION INTERVAL MODULATION RECOGNITION J-P. Kauppi, K.S. Martikainen Patria Aviation Oy, Naulakatu 3, 33100 Tampere, Finland, ax +358204692696 jukka-pekka.kauppi@patria.i,