Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events

Similar documents
MULTI-MICROPHONE FUSION FOR DETECTION OF SPEECH AND ACOUSTIC EVENTS IN SMART SPACES

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming

Using RASTA in task independent TANDEM feature extraction

Robust Speaker Identification for Meetings: UPC CLEAR 07 Meeting Room Evaluation System

High-speed Noise Cancellation with Microphone Array

Recent Advances in Acoustic Signal Extraction and Dereverberation

Simultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array

Speech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya 2, B. Yamuna 2, H. Divya 2, B. Shiva Kumar 2, B.

Calibration of Microphone Arrays for Improved Speech Recognition

Bag-of-Features Acoustic Event Detection for Sensor Networks

arxiv: v1 [cs.sd] 4 Dec 2018

POSSIBLY the most noticeable difference when performing

Study Of Sound Source Localization Using Music Method In Real Acoustic Environment

Multiple Sound Sources Localization Using Energetic Analysis Method

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Smart antenna for doa using music and esprit

IMPROVING MICROPHONE ARRAY SPEECH RECOGNITION WITH COCHLEAR IMPLANT-LIKE SPECTRALLY REDUCED SPEECH

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Advanced delay-and-sum beamformer with deep neural network

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Discriminative Training for Automatic Speech Recognition

Performance Evaluation of Nonlinear Speech Enhancement Based on Virtual Increase of Channels in Reverberant Environments

Adaptive Beamforming Applied for Signals Estimated with MUSIC Algorithm

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Antennas and Propagation. Chapter 5c: Array Signal Processing and Parametric Estimation Techniques

SUPERVISED SIGNAL PROCESSING FOR SEPARATION AND INDEPENDENT GAIN CONTROL OF DIFFERENT PERCUSSION INSTRUMENTS USING A LIMITED NUMBER OF MICROPHONES

Measuring impulse responses containing complete spatial information ABSTRACT

Distance Estimation and Localization of Sound Sources in Reverberant Conditions using Deep Neural Networks

ROBUST SUPERDIRECTIVE BEAMFORMER WITH OPTIMAL REGULARIZATION

Microphone Array Design and Beamforming

Robust Speaker Recognition using Microphone Arrays

INTERFERENCE REJECTION OF ADAPTIVE ARRAY ANTENNAS BY USING LMS AND SMI ALGORITHMS

Improving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research

Blind Dereverberation of Single-Channel Speech Signals Using an ICA-Based Generative Model

Direction of Arrival Algorithms for Mobile User Detection

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Informed Spatial Filtering for Sound Extraction Using Distributed Microphone Arrays

AN ADAPTIVE MICROPHONE ARRAY FOR OPTIMUM BEAMFORMING AND NOISE REDUCTION

Omnidirectional Sound Source Tracking Based on Sequential Updating Histogram

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Gaussian Mixture Model Based Methods for Virtual Microphone Signal Synthesis

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Sound Source Localization using HRTF database

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Fig Color spectrum seen by passing white light through a prism.

Audio Imputation Using the Non-negative Hidden Markov Model

Acoustic Beamforming for Speaker Diarization of Meetings

Broadband Microphone Arrays for Speech Acquisition

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

Change Point Determination in Audio Data Using Auditory Features

AN ADAPTIVE MICROPHONE ARRAY FOR OPTIMUM BEAMFORMING AND NOISE REDUCTION

Robust Low-Resource Sound Localization in Correlated Noise

Neural Network Synthesis Beamforming Model For Adaptive Antenna Arrays

HIGH RESOLUTION SIGNAL RECONSTRUCTION

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals

Automatic classification of traffic noise

Comparison of LMS and NLMS algorithm with the using of 4 Linear Microphone Array for Speech Enhancement

ADAPTIVE ANTENNAS. TYPES OF BEAMFORMING

Robustness (cont.); End-to-end systems

Mikko Myllymäki and Tuomas Virtanen

Speech enhancement with ad-hoc microphone array using single source activity

Ocean Ambient Noise Studies for Shallow and Deep Water Environments

Segmentation of Fingerprint Images

Performance Analysis of MUSIC and MVDR DOA Estimation Algorithm

Artificial Neural Networks. Artificial Intelligence Santa Clara, 2016

Michael Brandstein Darren Ward (Eds.) Microphone Arrays. Signal Processing Techniques and Applications. With 149 Figures. Springer

ONE of the most common and robust beamforming algorithms

Application of Artificial Neural Networks System for Synthesis of Phased Cylindrical Arc Antenna Arrays

Radiated EMI Recognition and Identification from PCB Configuration Using Neural Network

Airo Interantional Research Journal September, 2013 Volume II, ISSN:

Chapter 4 DOA Estimation Using Adaptive Array Antenna in the 2-GHz Band

Speech Enhancement Using Microphone Arrays

Time-of-arrival estimation for blind beamforming

Nonlinear postprocessing for blind speech separation

BEAMFORMING WITHIN THE MODAL SOUND FIELD OF A VEHICLE INTERIOR

OPTIMUM POST-FILTER ESTIMATION FOR NOISE REDUCTION IN MULTICHANNEL SPEECH PROCESSING

Real-time Adaptive Concepts in Acoustics

Collaborative Classification of Multiple Ground Vehicles in Wireless Sensor Networks Based on Acoustic Signals

UNEQUAL POWER ALLOCATION FOR JPEG TRANSMISSION OVER MIMO SYSTEMS. Muhammad F. Sabir, Robert W. Heath Jr. and Alan C. Bovik

STAP approach for DOA estimation using microphone arrays

Neural Blind Separation for Electromagnetic Source Localization and Assessment

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING. Department of Signal Theory and Communications. c/ Gran Capitán s/n, Campus Nord, Edificio D5

Acoustic signal processing via neural network towards motion capture systems

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Adaptive Systems Homework Assignment 3

SOUND SOURCE RECOGNITION AND MODELING

Design and Implementation on a Sub-band based Acoustic Echo Cancellation Approach

Level I Signal Modeling and Adaptive Spectral Analysis

Wavelet Speech Enhancement based on the Teager Energy Operator

Applications of Music Processing

Gammatone Cepstral Coefficient for Speaker Identification

24 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 1, JANUARY /$ IEEE

Microphone Array Feedback Suppression. for Indoor Room Acoustics

ON SAMPLING ISSUES OF A VIRTUALLY ROTATING MIMO ANTENNA. Robert Bains, Ralf Müller

Artificial Bandwidth Extension Using Deep Neural Networks for Spectral Envelope Estimation

FROM BLIND SOURCE SEPARATION TO BLIND SOURCE CANCELLATION IN THE UNDERDETERMINED CASE: A NEW APPROACH BASED ON TIME-FREQUENCY ANALYSIS

Transcription:

INTERSPEECH 2013 Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events Rupayan Chakraborty and Climent Nadeu TALP Research Centre, Department of Signal Theory and Communications Universitat Politècnica de Catalunya, Barcelona, Spain {rupayan.chakraborty, climent.nadeu}@upc.edu Abstract Acoustic scene analysis usually requires several sub-systems working in parallel for carrying out the various required functionalities. Focusing to a more integrated approach, in this paper we present an attempt to jointly recognize and localize several simultaneous acoustic events that take place in a meeting room environment, by developing a computationally efficient technique that employs multiple arbitrarily-located small microphone arrays. Assuming a set of simultaneous sounds, for each array a matrix is computed whose elements are likelihoods along the set of classes and a set of discretized directions of arrival. MAP estimation is used to decide about both the recognized events and the estimated directions. Experimental results with two sources, one of which is speech, and two three-microphone linear arrays are reported. The recognition results compare favorably with the ones obtained by assuming that the positions are known. Index Terms: Acoustic event recognition, direction-of-arrival, multiple source separation, null steering beamforming, machine learning 1. Introduction Acoustic scene analysis is a complex problem that requires a system encompassing several functionalities: detection (time), localization (space), separation, recognition, etc. Usually, these functionalities are assigned to different sub-systems. However, we can expect that an integrated approach, where all the functionalities are developed jointly, can offer advantages in terms of system performance. On the other hand, time overlapping of events at the signal level is often a main source of classification errors in acoustic scene analysis. In particular, after the CLEAR 07 international evaluations, where acoustic event detection (AED) was carried out with meeting-room seminars, it became clear that time overlapping of acoustic events was responsible of a big portion of detection errors [1]. The detection of overlapping acoustic events may be dealt with different approaches, either at the signal level, the feature level, the model level, etc. In [2], a model based approach was adopted for detection of events in a meeting-room scenario with two sources, one of which is always speech, and the other one is a different acoustic event from a list of 11 pre-defined events. That approach is used in the current real-time system implemented in our smart-room, which includes both AED and acoustic source localization (ASL) [3]. However, the model based approach is hardly feasible in scenarios where either the number of events or the number of simultaneous sources is large, since all the possible combinations of events have to be modeled. In such cases, the problem can be tackled in alternative ways. In [4], we proposed an alternative signal separation based approach. It can easily work in real time by using multiple distributed linear microphone arrays composed of a small number of microphones. For each array, by assuming a set of P hypothesized source positions (e.g. provided by the ASL system), a set of P beamformers, based on a frequency invariant null steering approach, was used to separate up to some extent each hypothesized source from the others. Using those (partially) separated signals, acoustic event recognition was carried out by combining, with a maximum-a-posteriori (MAP) criterion, the likelihoods calculated from a set of HMM-GMM acoustic event models. Moreover, each hypothesized event was assigned to a given source position using the same framework. In this paper, we aim to take a step further in the direction of the integrated approach mentioned above, by avoiding the assumption that the various acoustic source positions are known, so avoiding the need of a specific ASL sub-system. In fact, we present here a new technique as an attempt of jointly recognizing and localizing, in an unambiguous way, the classes and the positions of the N simultaneous sounds. Assuming only the x-y plane is needed; this is done by discretizing, for each microphone array, the direction of arrival (DOA) with M angles, and building for each event class a sequence of posterior probabilities along the angle axis. In this way, for each array, i.e. for each multi-channel signal, we have a matrix, where each element of that matrix is the likelihood for a given class and a given angle. The hypothesized event classes are determined from that likelihood matrix by applying the MAP criterion. The angle for which the posterior of a given hypothesized class shows a minimum is taken as the estimated localization angle. Experiments are carried out with the concrete meetingroom scenario mentioned above, using a database collected in our own smart-room. A machine-learning-based non-linear transformation from likelihoods to posteriors is used. The recognition results obtained with one array are comparable to the ones in [4], which resulted from assuming known source positions. The result of the DOA estimation for each array is also presented. Additionally, in the reported experiments it can be observed how the use of an additional array further improves the performance of recognition. 2. Joint recognition and DOA estimation In our approach, we aim to build for each microphone array a posterior matrix that contains information about both the identity of the acoustic events that are simultaneously present Copyright 2013 ISCA 2948 25-29 August 2013, Lyon, France

Models NSB 1 1 NSB 1 S NSB K 1 D E C I S I O N Figure 3: FIB beam patterns; in right: place nulls around 45 degree, in left: place nulls around -50 degree. NSB K S Models Figure 1: Joint event recognition and localization system. (a) (b) Figure 4: (a) Patterns of log-likelihood along the 11 models for two different events, (b) log-likelihoods along angles. Figure 2: Frequency invariant beamforming. in the room and the direction of arrival of their acoustic waves to the array. Then, both the identities of the sounds and their DOAs will be estimated with a MAP criterion. In our approach, the arrays can be located arbitrarily. Notice that, for deployment, this is an advantage with respect to using spatially structured array configurations. As shown in Figure 1, at the front end of the proposed system, the multichannel signal collected by each of the microphone arrays is driven to a set of null-steering beamformers (NSB). Each NSB is placing a null to a different value of the angular variable θ, which is discretized in M values that uniformly span the angle interval (, ). Note that the vertical coordinate is not considered in this study. Feature extraction () is then applied at the output of the beamformer, to subsequently compute a set of likelihood scores (), by using previously trained HMM-GMM models for the set of C acoustic event classes. Consequently, when K arrays are used, KxMxC likelihood scores are fed to the last block, where a MAP criterion is used to take the decision about the identity of the acoustic events E 1 E N, and their directions of arrival 1 N. Note that the number N of acoustic sources is hypothesized in this work. 2.1. Signal separation with frequency invariant null steering beamforming Null steering beamforming (NSB) allows us to design a sensor array pattern that steers the main beam towards the desired source, and places nulls in the direction of interferent sources [5]. Given the broadband characteristics of the audio signals, in order to determine the beamformer coefficients we use a technique called frequency invariant beamforming (FIB). The method, proposed in [6], uses a numerical approach to construct an optimal frequency invariant response for an arbitrary array configuration with a very small number of microphones, and it is capable of nulling several interferent sources simultaneously. As depicted in Figure 2, the FIB method first decouples the spatial selectivity from the frequency selectivity by replacing the set of real sensors by a set of virtual ones, which are frequency invariant. Then, the same array coefficients can be used for all frequencies. An illustrative example is shown in Figure 3; note how the null beams are rather constant along frequency. Indeed, in our case we cannot expect with this approach a perfect separation of the different mixed signals at the output of the NSB, since we use a small number of microphones per array, and also because of echoes and room reverberation. 2.2. Acoustic event recognition In this work we follow a detection approach that is based on classification. As the silence class is used, when the system is running along time and it outputs a non-silence hypothesis, it is decided that an event is detected. Consequently, we will deal in this section with a classification problem. To determine the likelihoods, the acoustic events are modeled with Hidden Markov models (HMM), and the state emission probabilities are computed with continuous density Gaussian mixture models (GMM) [7]. Let's assume we have a set of N simultaneous events E i, 1 i N, that belong to a set of C classes. For each of the K microphone arrays, there is a set of M beamformers, each one having a null to a different angle θ j. So there is a set of M output signals for each array, and, after likelihood computations with the HMM-GMM models, we have a MxCdimensional matrix of likelihood scores, that can be seen as a set of C patterns along the angle variable. An example of such patterns for two different events is shown in Figure 4(a). Let s denote with X k the multi-channel signal corresponding to the k-th array (notice that, to simplify notation, we do not consider time indices). We want to determine the posterior probability of a given class c i for the k- th array through all the NSBs. Note that our NSBs only separate the signals partially, so a class actually produced at 2949

likelihood scores along the angles; and there is a minimum for a specific angle, which actually is the true DOA of the given class. Figure 5: Smart-room layout, with the positions of microphone arrays (T-i), acoustic events (AE) and speaker (SP). the angle θ j may still be observed in all the NSBs that do not place nulls at θ j. We will assume that each angle θ j has an associated prior probability p(θ j ). By using the product combination rule [8] (i.e. assuming the output signals of the beamformers are independent), we have M pc ( X ) pc ( i j, Xk) p( ) i k j j 1 M p( Xk ci, j) p( c ) p( ) / p( X ) i j k j 1 where p(x k c i,θ j ) is the likelihood of class c i for the multichannel signal X k after it goes through beamformer j, which is obtained from the corresponding HMM-GMM model. For combining the posterior probabilities from the various microphone arrays, we will use again the product combination rule, so the optimal class c o will be obtained with K c argmax p( ci Xk) o ci k 1 In the case of N simultaneous sources, and assuming they correspond to N different classes, the recognized identities of those classes are obtained by applying equation (2) N consecutive times and leaving each time the recognized class out. As it will be explained in sub-section 3.2, in this work we use a data-dependent likelihood-to-posterior transformation to compute the probabilities p(c i θ j,x k ) involved in the first line of equation (1). 2.3. Optimal DOA estimation i The optimal DOA θ o of the i-th event source out of the N simultaneous sources is chosen according to i argmin p( j ci, X k ) o j argmin p( X k ci, j) p( ) j j where the minimum is taken because a null is placed by the beamformers in the direction of the position, not a maximum. Figure 4(b) shows an illustration of the variation of the (1) (2) (3) 3. Experiments In our experimental work, we consider a meeting room scenario with a predefined set of 11 acoustic events plus speech [1-3]. Like in [3], we assume that there may simultaneously exist either 0, 1 or 2 events, and, in the last case, one of the events is always speech. The reported experiments correspond to the case of two overlapped events, since it is the most general one. Consequently, speech is always present and only the events need to be recognized. 3.1. Meeting room acoustic scenario and database Figure 5 shows the Universitat Politècnica de Catalunya (UPC)'s smart-room, with the position of its six T-shaped 4- microphone arrays on the walls. We use only the linear arrays of 3 microphones in our experiments. For training, development and testing of the system, we have used, as in [3], part of a publicly available multimodal database recorded in the UPC s smart-room. Concretely, we use 8 recording sessions of audio data which contain isolated acoustic events. The approximate source positions of the acoustic events (AE) are shown in Figure 5. Each session was recorded with all the six T-shaped microphone arrays. The overlapped signals used for development and testing of the systems were generated adding those AE signals recorded in the room with a speech signal, also recorded in the room, both from all the 24 microphones. To do that, for each AE instance, a segment with the same length was extracted from the speech signal starting from a random position, and added to the AE signal. The mean power of speech was made equivalent to the mean power of the overlapping AE. That addition of signals produces an increment of the background noise level, since it is included twice in the overlapped signals; however, going from isolated to overlapped signals the SNR reduction is slight: from 18.7dB to 17.5dB. Although in our real meeting-room scenario the speaker may be placed at any point in the room, in the experimental dataset its position is fixed at a point at the left side (SP, in Figure 5). All signals were recorded at 44,1 khz sampling frequency, and further converted to 16 khz. 3.2. Event recognition The proposed event recognition system at its front end consists of a set of frequency invariant beamformers that span all the angles in the room. The beamformers are designed to work with the horizontal row of 3 microphones each array in the smart-room has. With such fewer microphones, it is expected that the beamformers have wide lobes and the sources are less well separated. But on the other hand, it facilitates a computationally efficient working environment. In the feature extraction block of the multi-array signal separation based system depicted in Figure 1, a set of audio spectro temporal features is computed for each signal frame. The frames are 30 ms long with 20 ms shift, and a Hamming window is applied. We have used frequency-filtered log filterbank energies (FF-LFBE) for the parametric representation of the spectral envelope of the audio signal [9]. For each frame, a short-length FIR filter with a transfer function z-z -1 is applied to the log filter-bank energy vectors and end-points are taken into account. Here, we have used 16 FF-LFBEs along with 2950

their 16 first temporal derivatives, where the latter represents the temporal evolution of the envelope. Therefore, the dimension of the feature vector is 32. The HTK toolkit is used for training and testing the HMM-GMM system [10]. There is one left-to-right HMM with three emitting states for each AE and silence. 32 Gaussian components with diagonal covariance matrix are used per state. Each HMM is trained with the standard Baum- Welch algorithm using mono-event signals from a microphone and for a particular array. The state emission probabilities are computed with continuous density GMM. For each array and angle, the likelihoods are computed by using the same set of AE (including speech) and silence models. This approach actually introduces a mismatch between training and testing conditions, which is a source of classification errors. Therefore, to compensate for that mismatch, in the decision block we have employed a machine-learning-based non-linear transformation technique that is unique for all classes. It is trained, in a supervised way, with the likelihoods obtained from the separated signals (the NSB outputs). We have used a multi-layer feed-forward neural network (NN) and a back-propagation training algorithm. The NN consists of three layers: input, hidden and output. We have optimized the number of hidden nodes in the NN through cross-validation. The tan-sigmoid transfer function is used at the output stages of the hidden and the output layers. A fast scaled conjugategradient-based training algorithm is used [11]. At the output of the NN, we apply the MAP criterion according to (1) and (2). In our experiments, all the angles are assigned flat prior probabilities. The testing results are obtained with all the 8 sessions (S01-S08) with a leave-one-out criterion, i.e. we recursively keep one session for testing, while all the other 7 sessions are used for training. Table 1 shows a performance comparison of the proposed system (System2) with the previous one (System1), averaging over all the 8 testing datasets, for two different arrays (T4 and T6) and their combination (T4+T6) like in [4]. It has to be mentioned here that both the acoustic events and the speech sources are physically well separated in the room space for these two arrays. System1 was designed with the assumption that the two source positions are known and thus uses two beamformers. But the proposed System2 does not require that assumption regarding the source position. In System1, the HMMs were trained using the separated signals. Conversely in System2, we train it using mono-event signals. From the results in Table 1, it is clear that System2 works better than System1. But for both the systems, it is observed that we get a better result with array T4 than T6. But the system that combines the arrays even produces a higher AED accuracy, as expected. 3.3. DOA estimations of overlapped events The hypothesized classes from the recognizer are used to localize the sources in terms of DOA estimation. It is performed at the decision block with the likelihood scores from HMM-GMM likelihood calculators. Using the monoevent signals instead of the separated signals for generating the HMM-GMM models, it is expected to get more variations in the likelihood scores along the angle and consequently, that choice should help to produce a better estimation. The optimal DOAs of the events for each array are obtained by using (3). Here also, we consider flat prior probabilities p(θ j ) for all angles. To test the performance of the localization system, we will use the normalized root means squared error for direction of arrival (RMSE_DOA) given by the following equation: RMSE_DOA 1 (4) N e test ref i i N e i 1 test ref where θ i is the estimated DOA for an event i, θ i is its reference DOA, and N e is the total number of event samples in the testing session. The reference DOA for each event class is taken from visual inspection during the recording of the signal. In our experiments, the null beam width θ is always kept constant (9 degrees). The testing results for DOA estimation are obtained using all the 8 sessions (S01-S08) with a leave-one-out criterion. Table 2 shows the DOA estimation results obtained for the proposed metric (4), averaging over all the 8 testing datasets, for two different arrays (T4 and T6). Table 1. Performance comparison of different recognition systems. Accuracy (%) T4 T6 T4+T6 System1 79.18 77.84 81.83 System2 83.12 81.41 83.93 Table 2. Source localization result. T4 T6 RMSE_DOA 2.12 2.68 4. Conclusions In this paper, we have presented a combined approach for recognition and localization of simultaneously occurring meeting room acoustic events. For recognition, a computationally efficient beamforming-based source separation technique followed by a HMM-GMM-based likelihood computation has been presented, where the estimation is done with a MAP criterion after applying a datadependent non-linear transformation. In the proposed method, the system does not require any information about the event source position, since by using the hypothesized outputs of the recognizer; the system is also able to localize the acoustic events in terms of DOA estimation, so avoiding the need of an external localization system. Future work will be devoted to use the full set of existing linear arrays in the smart-room. 5. Acknowledgements This work has been supported by the Spanish project SARAI (TEC2010-21040-C02-01). 6. References [1] A. Temko, C. Nadeu. D. Macho, R. Malkin, C. Zieger, and M. Omologo, Acoustic event detection and classification, in Computers in the Human Interaction Loop, A. Waibel, R. Stiefelhagen, Eds., Springer, pp. 61-73, 2009. [2] A. Temko and C. Nadeu, Acoustic event detection in meetingroom environments, Pattern Recognition Letters, vol. 30/14, pp. 1281-1288, Elsevier, 2009. 2 2951

[3] T. Butko, F. Gonzalez Pla, C. Segura, C. Nadeu, and J. Hernando, Two-source acoustic event detection and localization: online implementation in a smart-room, Proc. EUSIPCO, Barcelona, Spain, 2011. [4] R. Chakraborty and C. Nadeu, Real-time multi-microphone recognition of simultaneous sounds in a room environment, Proc. ICASSP, Vancouver, Canada, 2013. [5] O. Hoshuyama, and A. Sugiyama, Robust Adaptive Beamforming, in Microphone Arrays: Signal Processing Techniques and Applications. Ed. M. Brandstein and D. Ward. New York: Springer, 2001. [6] L.C. Parra, Steerable Frequency-Invariant Beamforming for Arbitrary Arrays, Journal of the Acoustical Society of America, 119 (6), pp. 3839-3847, June, 2006. [7] L. Rabiner, and B. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993. [8] L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, Wiley-Interscience, 2004. [9] C. Nadeu, D. Macho, and J. Hernando, Frequency & time filtering of filter-bank energies for robust HMM speech recognition, Speech Communication, vol. 34, pp. 93-114, 2001. [10] S. Young, et al., The HTK Book (for HTK Version 3.2), Cambridge University, 2002. [11] M. F. Moller, A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks, vol. 6, pp: 525-533, 1993. 2952