Speaker Localization in Noisy Environments Using Steered Response Voice Power

Similar documents
Speech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya 2, B. Yamuna 2, H. Divya 2, B. Shiva Kumar 2, B.

A MICROPHONE ARRAY INTERFACE FOR REAL-TIME INTERACTIVE MUSIC PERFORMANCE

Robust Low-Resource Sound Localization in Correlated Noise

Automotive three-microphone voice activity detector and noise-canceller

Sound Source Localization using HRTF database

Auditory System For a Mobile Robot

Airo Interantional Research Journal September, 2013 Volume II, ISSN:

Smart antenna for doa using music and esprit

Joint Position-Pitch Decomposition for Multi-Speaker Tracking

Localization of underwater moving sound source based on time delay estimation using hydrophone array

PAPER Adaptive Microphone Array System with Two-Stage Adaptation Mode Controller

Study Of Sound Source Localization Using Music Method In Real Acoustic Environment

Advanced delay-and-sum beamformer with deep neural network

Michael Brandstein Darren Ward (Eds.) Microphone Arrays. Signal Processing Techniques and Applications. With 149 Figures. Springer

Calibration of Microphone Arrays for Improved Speech Recognition

arxiv: v1 [cs.sd] 4 Dec 2018

Microphone Array Feedback Suppression. for Indoor Room Acoustics

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming

Applying the Filtered Back-Projection Method to Extract Signal at Specific Position

Improving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research

Fundamental frequency estimation of speech signals using MUSIC algorithm

Evaluating Real-time Audio Localization Algorithms for Artificial Audition in Robotics

Direction-of-Arrival Estimation Using a Microphone Array with the Multichannel Cross-Correlation Method

Multiple Sound Sources Localization Using Energetic Analysis Method

REAL-TIME SRP-PHAT SOURCE LOCATION IMPLEMENTATIONS ON A LARGE-APERTURE MICROPHONE ARRAY

Optimum Rate Allocation for Two-Class Services in CDMA Smart Antenna Systems

Mel Spectrum Analysis of Speech Recognition using Single Microphone

A Fast and Accurate Sound Source Localization Method Using the Optimal Combination of SRP and TDOA Methodologies

A COHERENCE-BASED ALGORITHM FOR NOISE REDUCTION IN DUAL-MICROPHONE APPLICATIONS

Sound Processing Technologies for Realistic Sensations in Teleworking

LOCALIZATION AND IDENTIFICATION OF PERSONS AND AMBIENT NOISE SOURCES VIA ACOUSTIC SCENE ANALYSIS

Spatialized teleconferencing: recording and 'Squeezed' rendering of multiple distributed sites

Microphone Array Design and Beamforming

HANDSFREE VOICE INTERFACE FOR HOME NETWORK SERVICE USING A MICROPHONE ARRAY NETWORK

TARGET SPEECH EXTRACTION IN COCKTAIL PARTY BY COMBINING BEAMFORMING AND BLIND SOURCE SEPARATION

Emanuël A. P. Habets, Jacob Benesty, and Patrick A. Naylor. Presented by Amir Kiperwas

Research Article DOA Estimation with Local-Peak-Weighted CSP

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Speech Enhancement using Wiener filtering

Investigation of Noise Spectrum Characteristics for an Evaluation of Railway Noise Barriers

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

EXPERIMENTAL EVALUATION OF MODIFIED PHASE TRANSFORM FOR SOUND SOURCE DETECTION

Introduction to Equalization

Recent Advances in Acoustic Signal Extraction and Dereverberation

SPECTRAL COMBINING FOR MICROPHONE DIVERSITY SYSTEMS

Towards an intelligent binaural spee enhancement system by integrating me signal extraction. Author(s)Chau, Duc Thanh; Li, Junfeng; Akagi,

Comparison of LMS and NLMS algorithm with the using of 4 Linear Microphone Array for Speech Enhancement

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

LETTER Pre-Filtering Algorithm for Dual-Microphone Generalized Sidelobe Canceller Using General Transfer Function

MMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2

CHAPTER 10 CONCLUSIONS AND FUTURE WORK 10.1 Conclusions

A Predefined Command Recognition System Using a Ceiling Microphone Array in Noisy Housing Environments

Chapter 4 DOA Estimation Using Adaptive Array Antenna in the 2-GHz Band

Reducing comb filtering on different musical instruments using time delay estimation

Underwater Wideband Source Localization Using the Interference Pattern Matching

NOISE ESTIMATION IN A SINGLE CHANNEL

Broadband Microphone Arrays for Speech Acquisition

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Adaptive Beamforming for Multi-path Mitigation in GPS

Cost Function for Sound Source Localization with Arbitrary Microphone Arrays

Acoustic Beamforming for Hearing Aids Using Multi Microphone Array by Designing Graphical User Interface

Reverberant Sound Localization with a Robot Head Based on Direct-Path Relative Transfer Function

Audio Restoration Based on DSP Tools

IN REVERBERANT and noisy environments, multi-channel

BEAMFORMING WITHIN THE MODAL SOUND FIELD OF A VEHICLE INTERIOR

RECENTLY, there has been an increasing interest in noisy

GPS ANTENNA WITH METALLIC CONICAL STRUC- TURE FOR ANTI-JAMMING APPLICATIONS

Blind Dereverberation of Single-Channel Speech Signals Using an ICA-Based Generative Model

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Self Localization Using A Modulated Acoustic Chirp

Different Approaches of Spectral Subtraction Method for Speech Enhancement

inter.noise 2000 The 29th International Congress and Exhibition on Noise Control Engineering August 2000, Nice, FRANCE

An HARQ scheme with antenna switching for V-BLAST system

Non-Contact Gesture Recognition Using the Electric Field Disturbance for Smart Device Application

A FAST CUMULATIVE STEERED RESPONSE POWER FOR MULTIPLE SPEAKER DETECTION AND LOCALIZATION. Youssef Oualil, Friedrich Faubel, Dietrich Klakow

Simultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

Adaptive Beamforming Applied for Signals Estimated with MUSIC Algorithm

Speech Enhancement Based On Noise Reduction

Numerical Study of Stirring Effects in a Mode-Stirred Reverberation Chamber by using the Finite Difference Time Domain Simulation

Mutual Coupling Estimation for GPS Antenna Arrays in the Presence of Multipath

EE482: Digital Signal Processing Applications

ENHANCED PRECISION IN SOURCE LOCALIZATION BY USING 3D-INTENSITY ARRAY MODULE

Voice Activity Detection for Speech Enhancement Applications

ACOUSTIC SOURCE LOCALIZATION IN HOME ENVIRONMENTS - THE EFFECT OF MICROPHONE ARRAY GEOMETRY

Using GPS to Synthesize A Large Antenna Aperture When The Elements Are Mobile

Dual Transfer Function GSC and Application to Joint Noise Reduction and Acoustic Echo Cancellation

Ocean Ambient Noise Studies for Shallow and Deep Water Environments

A Study on Complexity Reduction of Binaural. Decoding in Multi-channel Audio Coding for. Realistic Audio Service

SOUND SOURCE LOCATION METHOD

Keywords Decomposition; Reconstruction; SNR; Speech signal; Super soft Thresholding.

Real-time Sound Localization Using Generalized Cross Correlation Based on 0.13 µm CMOS Process

On a Classification of Voiced/Unvoiced by using SNR for Speech Recognition

Adaptive Systems Homework Assignment 3

The psychoacoustics of reverberation

MARQUETTE UNIVERSITY

Interfacing with the Machine

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

Distance Estimation and Localization of Sound Sources in Reverberant Conditions using Deep Neural Networks

Transcription:

112 IEEE Transactions on Consumer Electronics, Vol. 61, No. 1, February 2015 Speaker Localization in Noisy Environments Using Steered Response Voice Power Hyeontaek Lim, In-Chul Yoo, Youngkyu Cho, and Dongsuk Yook, ember, IEEE Abstract any devices, including smart TVs and humanoid robots, can be operated through speech interface. Since a user can interact with such a device at a distance, speech-operated devices must be able to process speech signals from a distance. Although many methods exist to localize speakers via sound source localization, it is very difficult to reliably find the location of a speaker in a noisy environment. In particular, conventional sound source localization methods only find the loudest sound source within a given area, and such a sound source may not necessarily be related to human speech. This can be problematic in real environments where loud noises freuently occur, and the performance of speech-based interfaces for a variety of devices could be negatively impacted as a result. In this paper, a new speaker localization method is proposed. It identifies the location associated with the maximum voice power from all candidate locations. The proposed method is tested under a variety of conditions using both simulation data and real data, and the results indicate that the performance of the proposed method is superior to that of a conventional algorithm for various types of noises 1. Index Terms sound source localization, speaker localization, human-robot interface. Contributed Paper anuscript received 12/30/14 Current version published 03/30/15 Electronic version published 03/30/15. I. INTRODUCTION Speech has several benefits when used as communication media, mainly in that it is the basic interface that humans use to communicate with each other and that it does not reuire additional devices. ore importantly, speech can travel over long distances, making it particularly useful for a variety of devices, including humanoid robots and smart TVs, since the user and the device are typically separated by a certain 1 This work was supported by the Korea Research Foundation (KRF) grant funded by the Korean government (EST) (No. 2011-0002906). Hyeontaek Lim is with the Speech Information Processing Laboratory, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 136-701, Republic of Korea (e-mail: htlim@voice.korea.ac.kr). In-Chul Yoo is with the Speech Information Processing Laboratory, Department of Computer and Communication Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 136-701, Republic of Korea (e-mail: icyoo@voice.korea.ac.kr). Youngkyu Cho is with LG Electronics Seocho R&D Campus, 19 Yangjaedaero 11-gil, Seocho-gu, Seoul, 137-130, Republic of Korea (email: youngkyu.cho@lge.com). This work was done when Youngkyu Cho was with Korea University Dongsuk Yook is with the Speech Information Processing Laboratory, Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 136-701, Republic of Korea (e-mail: yook@voice.korea.ac.kr). 0098 3063/15/$20.00 2015 IEEE distance [1][2]. For example, a speech-based human-robot interface can provide a natural human-like interface without the need for external devices, such as remote controllers, and users can use familiar speech-based commands to control smart TVs from anywhere in the home. For such speech interfaces to be properly implemented, a method to process distant speech signals should be included [3]. Unlike speech signals detected from a close range, speech signals travelling over longer distances are usually degraded and corrupted by severe unrelated noise. A typical solution to this distant speech problem involves using microphone arrays to both enhance the speech signals coming from the desired direction and to reduce the noise signals coming from other directions. As a result, the uality of the speech signal improves. However, the location of the speaker must be estimated before improving the speech signal. In addition to speech enhancements that use beamforming, information on the speaker s location can be used to enable efficient and natural interfaces [4]-[7]. For example, when a user interacts with a humanoid robot, the robot can make use of the user s location to turn and face him or her or smart doorbells can steer their cameras to focus on the visitor s face. For most applications, the relative location of the speaker is not known, reuiring some method to determine the position of the speaker. Sound source localization (SSL) is one way to determine the location of the speaker, and this method is effective regardless of lighting conditions, allowing an estimation of the speaker s location even in the dark. Several methods have been proposed for sound source localization [8]-[13], and steered response power with a phase transform filter (SRP- PHAT) is generally known to be one of the most robust of such methods when the room produces reverberation [12][13]. However, direct use of SRP-PHAT has been shown to negatively impact the performance of real-life speech-based applications. SRP-PHAT steers the microphone array to determine the location of the maximum output power, and the output power of the beamformer is typically measured as the sum of the cross correlation values for each pair of microphone signals. Since SRP-PHAT estimates the power of the voice signal for a given location by using only the cross correlation values of the input signal, a source of noise could be determined to be the maximum output power location if the noise is louder than the voice of the speaker. That is, conventional SRP-PHAT points to the direction of a source of noise if the unrelated noise has higher steered energy, even when the steered energy remains high from the location of the

H. Lim et al.: Speaker Localization in Noisy Environments Using Steered Response Voice Power 113 desired speaker because SRP-PHAT steers to the highest energy point, regardless of the characteristics or content of the sound signal. When SRP-PHAT is implemented for speaker localization, speech characteristics must be taken into account in order to assign a higher weight to actual speech sources rather than to sources of loud noises [14]. Voice activity detection (VAD), which distinguishes human speech from noise [15][16], can be applied to handle such a problem. This paper proposes a robust speaker localization techniue that utilizes VAD. The proposed method uses SRP-PHAT for sound source localization and adopts a VAD scheme to take into account the content of the sound signal rather than just the steered response power of the signal. As a result, the proposed method can compute the steered response voice power (SRVP) for the candidate speaker location. Since the proposed method can identify content within the signal and not just the power of the signal, the location of the voice source can be effectively localized, even under conditions with a 0dB signalto-noise ratio (SNR). As a result, speech-based interfaces can be implemented for actual use with a variety of mobile devices, even where unrelated noise might freuently occur. The rest of this paper is organized as follows. Section II analyzes the problem of conventional SSL using SRP-PHAT and then describes the speaker localization method that computes the SRVP by adopting SRP-PHAT and VAD. The proposed method is then evaluated in Section III. Finally, Section IV concludes the paper. II. STEERED RESPONSE VOICE POWER A. SRP-PHAT under Noisy Environments In the freuency domain, the output Y (ω) of a filter-andsum beamformer focused on location is defined as follows: j m, Y ( ) G ( ) X ( ) e, (1) m1 m where represents the number of microphones, X m (ω) and G m (ω) are respectively the Fourier transforms of the m-th microphone signal and its associated filter, and τ m, is the direct time of travel from location to the m-th microphone. The output is obtained by phase-aligning the microphone signals with the steering delays and summing them after the filter is applied. The sound source localization algorithm based on SRP- PHAT calculates the output power, P(), of the microphone array focused on location as follows: P( ) Y ( ) l 1 l 1 k 1 G ( ) X ( ) e l 2 lk d l l j l, * k m k 1 ( ) X ( ) X ( ) e G ( ) X ( ) e * k * k j ( l, k, ) d j k,, (2) d Fig. 1. An example of the steered response power of a noisy voice signal where the noise is louder than the voice. where Ψ lk (ω) = G l (ω)g k * (ω) = 1 / X l (ω)x k * (ω). After calculating the steered response power, P(), for each candidate location, the point, ˆ, that has the maximum output power is selected as the location of the sound source. ˆ arg max P( ) (3) Although SRP-PHAT is one of the most popular techniues in use for sound source localization, it may not be adeuate for speaker localization in noisy environments. Fig. 1 shows an example that is not unusual in many real-world scenarios where noise is louder than the voice. In such a case, SRP- PHAT does not distinguish between the voice and the noise and simply computes the output power of an input signal, so if the noise has greater power, the location of the noise is identified rather than that of the voice. It should be noted that a high level of energy for the desired speaker s location can be still observed in Fig. 1, while unwanted noise has an even higher steered energy. B. Steered Response Voice Power One method that can be used to manage such a problem involves applying VAD values as weights for the SRP-PHAT energy maps. Since the VAD values are high for the speech signals and low for the noise signals, this method can effectively boost the peak from the speech signals while also reducing the peaks from the noise signals. However, the SRP- PHAT algorithm already reuires a huge amount of computation, and computing the VAD values for every candidate location significantly increases the computational load. This section presents a robust speaker localization method that can distinguish between the location of voice sources and noise sources with a little additional computational costs. The proposed method finds a point associated with the maximum voice similarity [14] instead of the maximum

114 IEEE Transactions on Consumer Electronics, Vol. 61, No. 1, February 2015 Competing Noise Voice Background Noise Sound Source Localization n-best candidates Speech Enhancement beamformed signal Vowel Corpus peak signature Voice Similarity easurement voice source location Fig. 2. The proposed speaker localization using steered response voice power. The method is composed of three steps: n-best candidate selection using smoothed steered response power, beamforming of the candidate locations, and maximum voice similarity selection. output power for SRP-PHAT. It extracts n-best candidate locations and applies VAD algorithms to these candidate locations to determine the position where the voice similarity is the highest. Fig. 2 illustrates the steps of the proposed method. In the first step, the usual SRP-PHAT algorithm is applied to compute the steered response power for every candidate location. The top n-best candidate locations are then detected for further computations. A simple smoothing method with a moving average is applied to minimize the effect of the serrated peaks surrounding the main peaks. The value of the output power for each location in the energy map is substituted by the mean of the neighboring output power values. The mean output power P is defined as follows: 1 P ), (4) a e ( a, e) P( 2 a, e (2 1) aa ee where a,e is a point with azimuth a and elevation e, and θ is the number of neighbors that are considered. Fig. 3 shows an energy map smoothed by using (4). This simple smoothing scheme effectively helps identify multiple sources by discarding serrated peaks surrounding the main peaks. In the second step, the microphone array is focused on the selected n-best candidate locations by using an adaptive beamforming method. An adaptive beamformer, such as a generalized sidelobe canceller (GSC) [17][18], boosts the signals from the desired locations and reduces the signals from other locations. In the third step, the voice similarity of the beamformed signal is evaluated. Since the proposed method targets situations where speech and background noise are captured simultaneously, it is crucial for the VAD algorithm to work Fig. 3. A smoothed SRP-PHAT energy map. The output of the SRP- PHAT (Fig.1) was smoothed using (4). reliably under conditions with a mixed signal. If vowel sounds are utilized for the VAD algorithm, it can operate well under these conditions [16]. Human vowel sounds have formants which are distinctive spectral peaks and are likely to remain even after noise has caused severe corruption [19]. However, non-relevant spectral peaks caused by noise corruption are major obstacles to utilizing these spectral peaks in noisy situations. Direct computation of pre-trained spectral peak templates effectively avoids the problem caused by non-relevant spectral peaks [16]. This makes it possible to detect the presence of speech signals even when there is a simultaneous presence of noise. Thus, the characteristic spectral peaks for human vowels are utilized to compute voice similarity, and the training data from several speakers are then used to extract the characteristic spectral peaks [16]. The algorithm does not extract spectral peaks during recognition, but rather directly computes the similarity of the input spectrum to the pre-trained spectral peak signatures. The main idea is that if a spectral peak is present, the average energies of the spectral bands for that peak will be much higher than the average energies of other bands. That is, the peak valley difference (PVD) will be higher. The positions of the spectral peaks are obtained during training and are stored as binary peak signatures consisting of values of 1 for spectral peak bands and 0 for the other bands. During training, similar spectral peak signatures can be clustered to reduce the computational overhead. The PVD is then used as a measure of voice similarity. The similarity of a given binary spectral peak signature S and the beamformed input spectrum Y can be calculated as follows [16]: PVD( Y, S) N 1 N 1 Y [ S[ Y [ 1 S[ k 0 k 0 N 1 N 1 k 0 S[ 1 S[ k 0, (5)

H. Lim et al.: Speaker Localization in Noisy Environments Using Steered Response Voice Power 115 III. EXPERIENTS Fig. 4. A steered response voice power energy map obtained by combining (6) with the results shown in Fig. 3. In this figure, the values of the PVD are applied to all points for graphical illustration. In the actual algorithm, the PVD values are applied to the n-best candidate locations only. where N is the dimension of the spectrum. The similarity measurement is performed for every registered spectral peak signatures, and the maximum value is determined to be the spectral peak energy of location as follows: PVD( Y ) max PVD( Y, S). (6) S A. Simulation Data Experiments In order to analyze the performance of the proposed method under various noisy environments, noisy speech data was created using the image method [20]. In this paper, a circular microphone array of 25cm radius with eight sensors was used. The proposed method is based on SRP-PHAT, and therefore, the algorithm can be used for various other microphone array configurations as well [12][13]. The microphones were placed with the same intervals around the circular array. The array was located at (250cm, 300cm, 80cm) in a room with dimensions of 600cm 500cm 240cm, and the voice source was located at (250cm, 500cm, 80cm) in the room. Noise sources were placed at the same distance with varying degrees ranging from 0 to 180 degrees at an interval of 10 degrees (except 90 degrees), resulting in 18 different positions. Fig. 5 illustrates this configuration. The types of noise used include a car, factory, channel, music, subway, train, white, and pink noises. The sampling rate was set to 16kHz and the frame length was 128ms. The performance was measured as the percentage of the estimated speaker locations that lie within ± 5 degrees of the true voice source location. The performance of the speaker localization was analyzed for five different SNR levels (-5, 0, 5, 10 and 15dB). Fig. 6 shows that the proposed method always yielded better performance when compared to that of SRP-PHAT for various levels of SNR. For the 0dB SNR condition, conventional Fig. 4 shows the energy map of the steered response voice power obtained using (6). Fig. 4 clearly shows that the VAD weights effectively boosted the peak from the speaker location while reducing the peak from the noise. The proposed method selects a point ˆ that is associated with high steered energy and voice similarity by combining the values obtained from SRP-PHAT with those obtained from PVD. Since both values have different ranges, a kind of normalization must be applied. The location, ˆ, of the speaker is determined by using a simple linear combination algorithm as follows: ˆ arg max P( ) P a PVD( Y ), (7) max PVD max where P max is the maximum value of the steered mean output power and PVD max is the maximum of the PVD values. Unlike conventional SSL based on SRP-PHAT, the proposed method considers the content of the input signal that is, the voice similarity and its power rather than looking for the maximum output energy of an input signal as in (3). Therefore, the location of the speaker can be effectively found even when the interfering noise is louder than the voice. The effectiveness of the proposed method in noisy environments is evaluated in the next section. Fig. 5. Locations of the microphone array, voice source, and noise used for the experiment with simulated data.

116 IEEE Transactions on Consumer Electronics, Vol. 61, No. 1, February 2015 Fig. 6. Speaker localization performance for SRP-PHAT and for the proposed method for five different levels of SNR (averaged over all types of noise). Fig. 8. Speaker localization performance for SRP-PHAT and for the proposed method for various noise positions. SRP-PHAT showed only 18.8% accuracy for speaker localization while the proposed method achieved an accuracy of 30.6% (for an absolute error reduction of 11.8%). The proposed method showed a speaker localization accuracy of 49.2% when the SNR was of 5dB. When compared to conventional SRP-PHAT, the proposed method achieved an absolute error reduction of 12.6% on the average. Fig. 7 shows the localization performance for SRP-PHAT and for the proposed method under various types of noise at 0dB. The proposed method can be seen to exhibit better performance when compared to SRP-PHAT for all types of noise. The accuracy of SRP-PHAT was severely degraded in an environment with broad band noise such as white noise. Fig. 7. Speaker localization performance for SRP-PHAT and for the proposed method for various types of noise at 0dB. This can be attributed to the fact that SRP-PHAT calculates the output power using only the phase information over all freuency bands. ost of the gains in performance came from the factory, channel, music, subway, white, and pink noises. For the car and train noise environments, relatively small improvements were made when compared to the other noises given above. It may be that some peak signatures from the car and train noises were very similar to those of some vowel sounds. If so, some low energy noise was boosted, causing a higher SRVP. Fig. 8 illustrates the speaker localization performance for various noise positions. The result was not symmetrical around 90 degrees because the circular array was rotated slightly. The overall performance of the proposed method can be seen to have increased when compared to that of the SRP- PHAT algorithm. As the angle between the voice and the noise locations increased, the gains in performance became larger. The relatively higher performances for SRP-PHAT between 80 and 100 degrees can be explained by the fact that the noise signal was so close to the voice signals that the sidelobes of the SRP-PHAT from the noise signals also affected the steered energy for the voice signals. B. Real Data Experiments The performance of the proposed method under real use was verified by using actual sound data collected using the robot prototype shown in Fig. 9. The configuration of the microphones, the room dimensions, and the location of the microphone arrays were the same as those for the simulation data. It should be emphasized that as noted in simulation data experiment, the proposed method does not restrict the configuration of microphone arrays to a circular form. Noise sources were placed at 0, 30, and 60 degrees at a distance of 200cm. Eight types of noises with five levels of SNR were used for the experiment as for the simulation data.

H. Lim et al.: Speaker Localization in Noisy Environments Using Steered Response Voice Power 117 Fig. 9. A robot with a microphone array system that is used to record real environment sound data. The 8 microphones around shoulder area of the robot were used for the experiment. Fig. 10 summarizes the performance of SRP-PHAT and of the proposed method for five different levels of SNR over real data. The results shown are similar to those of the simulation data from Fig. 6. For the condition with 0dB SNR, SRP- PHAT showed a localization accuracy of 23.7% while the proposed method showed an accuracy of 42.9%, which is a decrease of 19.2% in absolute error rates. Similar results were obtained for the 5 db SNR conditions, where the absolute error rates were reduced by 22.9%. Fig. 11 summarizes the performance for various types of noise over real data. The results are also similar to those of the simulation data, where the increase in performance was larger for a factory, channel, music, subway, white, and pink noises, with a smaller increase for the train noise. Fig. 10. Speaker localization performance for SRP-PHAT and for the proposed method for five different levels of SNR using real data. Fig. 11. Speaker localization performance for SRP-PHAT and for the proposed method for various types of noise at 0dB using real data. IV. CONCLUSION This paper proposed a robust speaker localization method that uses the voice similarity of the input signal instead of the simple output power of the beamformer. The proposed method uses SRP-PHAT to find several candidate locations and then uses GSC to enhance the signals coming from the top n candidate locations. The voice similarity of the enhanced signals is computed and combined with the steered response power. The final output is then interpreted as a steered response voice power, and the maximum SRVP location is selected as the speaker location. The computational cost is relatively low because only the top n- best candidate locations are considered for GSC and voice similarity measurements. The experimental results showed that the proposed method significantly outperformed SRP- PHAT, which is a conventional sound source localization method, for very low SNR conditions where the noise signals have eual or higher energies than the voice signals. When compared to the conventional SRP-PHAT method, the proposed method achieved an absolute localization error reduction of 19.3% on average for real data environments with various kinds of noise. The proposed method can be used for interfaces based on spoken languages in real environments where speech and noise can occur simultaneously. The increase in the accuracy of sound source localization allows location-based interactions between the user and various devices. For example, the camera of a smart doorbell system can be steered. The proposed method can also be used to increase the accuracy of speech-based interfaces. The naturalness and long-distance characteristics of speech can provide for useful interfaces for various devices, including smart TVs and humanoid robots.

118 IEEE Transactions on Consumer Electronics, Vol. 61, No. 1, February 2015 REFERENCES [1] Y. Oh, J. Yoon, J. Park,. Kim, and H. Kim, A name recognition based call-and-come service for home robots, IEEE Transactions on Consumer Electronics, vol. 54, no. 2, pp. 247-253, 2008. [2] J. Park, G. Jang, J. Kim, and S. Kim, Acoustic interference cancellation for a voice-driven interface in smart TVs, IEEE Transactions on Consumer Electronics, vol. 59, no. 1, pp. 244-249, 2013. [3] K. Kwak and S. Kim, Sound source localization with the aid of excitation source information in home robot environments, IEEE Transactions on Consumer Electronics, vol. 54, no. 2, pp. 852-856, 2008. [4] A. Sekmen,. Wikes, and K. Kawamura, An application of passive human-robot interaction: human tracking based on attention distraction, IEEE Transactions on Systems, an, and Cybernetics - Part A, vol. 32, no. 2, pp. 248-259, 2002. [5] Y. Cho, D. Yook, S. Chang, and H. Kim, Sound source localization for robot auditory systems, IEEE Transactions on Consumer Electronics, vol. 55, no. 3, pp. 1663-1668, 2009. [6] X. Li and H. Liu, Sound source localization for HRI using FOC-based time difference feature and spatial grid matching, IEEE Transactions on Cybernetics, vol. 43, no. 4, pp. 1199-1212, 2013. [7] T. Kim, H. Park, S. Hong, and Y. Chung, Integrated system of face recognition and sound localization for a smart door phone, IEEE Transactions on Consumer Electronics, vol. 59, no. 3, pp. 598-603, 2013. [8] C. Knapp and G. Carter, The generalized correlation method for estimation of time delay, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-24, no. 4, pp. 320-327, 1976. [9] R. Schmidt, ultiple emitter location and signal parameter estimation, IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, 1986. [10] B. ungamuru and P. Aarabi, Enhanced sound localization, IEEE Trans. Systems, an, and Cybernetics - Part B, vol. 34, no. 3, pp. 1526-1540, 2004. [11] V. Willert, J. Eggert, J. Adamy, R. Stahl, and E. Korner, A probabilistic model for binaural sound localization, IEEE Transactions on Systems, an, and Cybernetics - Part B, vol. 36, no. 5, pp. 982-994, 2006. [12] J. DiBiase, A high-accuracy, low-latency techniue for talker localization in reverberant environments using microphone arrays, Ph.D. Dissertation, Brown University, 2000. [13] J. DiBiase, H. Silverman, and. Brandstein, Robust localization in reverberant rooms, in icrophone Arrays: Signal Processing Techniues and Applications,. Brandstein and D. Ward, Eds., Springer-Verlag, 2001, pp. 157-180. [14] Y. Cho, Robust speaker localization using steered response voice power, Ph.D. Dissertation, Korea University, 2011. [15] J. Sohn, N. Kim, and W. Sung, A statistical model-based voice activity detection, IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1-3, 1999. [16] I. Yoo and D. Yook, Robust voice activity detection using the spectral peaks of vowel sounds, ETRI Journal, vol. 31, no. 4, pp. 451-453, 2009. [17] O. Frost, An algorithm for linearly constrained adaptive array processing, Proceedings of the IEEE, vol. 60, no. 8, pp. 926-935, 1972. [18] L. Griffiths and C. Jim, An alternative approach to linearly constrained adaptive beamforming, IEEE Transactions on Antennas and Propagation, vol. 30, no. 1, pp. 27-34, 1982. [19] I. Yoo and D. Yook, Automatic sound recognition for hearing impaired, IEEE Transactions on Consumer Electronics, vol. 54, no. 4, pp. 2029-2036, 2008. [20] J. Allen and D. Berkley, Image method for efficiently simulating smallroom acoustics, Journal of Acoustical Society of America, vol. 65, no. 4, pp. 943-950, 1979. BIOGRAPHIES Hyeontaek Lim received a B.S. degree in Computer Engineering from Yonsei University, and an.s. degree in Computer and Communication Engineering from Korea University, Korea, in 2007 and 2010, respectively. He is currently in the Ph.D. program at the Speech Information Processing Laboratory in Korea University. His research interests are speech recognition for mobile devices and parallel speech recognition. In-Chul Yoo received B.S. and.s. degrees in computer science from Korea University, Seoul, Korea, in 2006 and 2008, respectively. He is currently pursuing the Ph.D. degree at the Speech Information Processing Laboratory in Korea University. His research interests include robust speech recognition and speaker recognition. Youngkyu Cho received.s. and Ph.D. degrees in computer science and engineering from Korea University, Korea, in 2002 and 2011, respectively. Currently, he is working for LG Electronics. His current research interests are acoustic modeling, speaker recognition, and sound source localization using a microphone array. Dongsuk Yook ( 02) received B.S. and.s. degrees in computer science from Korea University, Seoul, Korea, in 1990 and 1993, respectively, and a Ph.D. degree in computer science from Rutgers University, New Jersey, U.S., in 1999. He worked on speech recognition at IB T.J. Watson Research Center, New York, USA, from 1999 to 2001. Currently, he is a professor in the Department of Computer Science and Engineering, Korea University, Seoul, Korea. His research interests include machine learning and speech processing.