DIT - University of Trento Distributed Microphone Networks for sound source localization in smart rooms

PhD Dissertation International Doctorate School in Information and Communication Technologies DIT - University of Trento Distributed Microphone Networks for sound source localization in smart rooms Alessio Brutti Advisor: Prof. Maurizio Omologo ITC-irst Centro per la Ricerca Scientifica e Tecnologica March 2007

Abstract This PhD thesis describes a research activity on distributed networks of sensors for acoustic source localization in enclosures. The problem can be addressed through the so called ubiquitous computing, as envisioned in the CHIL project framework, in an effort to make the users unaware of the underneath audio processing. A specific goal of this research is hence to design an algorithm able to estimate the position of active speakers during meetings and seminars held in rooms which are quite critical from an acoustic point of view. As a matter of fact, reflections and scattering by objects and walls critically affect the quality and characteristics of the signals received by far microphones. The problem is tackled by distributing a set of microphones in the given room in order to always ensure a good coverage of the concerned area. This approach requires to devise an algorithm able to handle all the inputs, possibly discordant, delivered by the network of sensors. Estimation of time delay of arrival based on generalized cross correlation has been proved to be an effective approach to the sound source localization task. A novelty introduced in this doctoral research is the attempt to characterize the source orientation by exploiting the shape of the Global Coherence Field around a candidate source position. Once the orientation is determined, microphone contributions can be merged in a more efficient way, with emphasis on frontal microphone pairs. A further effort was devoted to investigate an approach which characterizes the source position and orientation on the basis of some previously calculated acoustic models. In this way it is possible to take advantage also of the reverberation pattern and to deal with situations in which a direct path from source to receiver is not available.

Keywords Sound source localization, distributed microphone network, sound source orientation, global coherence field, oriented global coherence field, sound map classification. 4

Acknowledgments A work that requires a huge amount of efforts as a PhD thesis does is always a milestone in one s life. And once you have completed it and look back at the last years, you suddenly realize that you would have never got your work done without the help of many people, both from a technical and human point of view. The first thanks are deserved by my girlfriend Silvia for sure. She played a primary role when I made the choice to enroll in this PhD course and always supported and encouraged me when depression was going to prevail and I was near to give up. Thank you very much Silvia, you are the best. I wish of course to express my gratitude to my parents who tried any effort to make my work as less hard as possible. I know that I can always trust them. My special thanks are for my tutor Maurizio Omologo for guiding me through my research with his helpful comments and advices. I would also like to thank Piergiorgio Svaizer for always providing valuable and punctual hints. How could I forget my officemate Christian Zieger for his technical and human support. I wish also to thank all the members, both current and former, of the SSI division, particularly of the SHINE project, for creating a nice work environment. Among them, my companion of many travel in Germany Luca Cristoforetti deserves a special mention. The last thanks are reserved to my friends, in particular to Campa and Pule. Anything gets easier when you have a bunch of friends who help you forget your professional problems for a night.

Contents 1 Introduction 1 1.1 The Context.......................... 1 1.2 The Problem......................... 2 1.2.1 The CHIL Project.................. 2 1.2.2 Speaker Localization in CHIL............ 3 1.3 The Solution......................... 4 1.4 Innovative Aspects...................... 5 1.5 Structure of the Thesis.................... 6 2 The Sound Source Localization Problem 9 2.1 Problem Formulation..................... 9 2.2 Signal Modeling........................ 10 2.3 Time Difference of Arrival.................. 13 2.3.1 TDOA estimation................... 19 2.4 Direct Approaches...................... 22 2.5 Particle Filtering....................... 24 2.6 Statistical Approaches.................... 26 2.7 Source Orientation...................... 27 3 Crosspower Spectrum Phase Analysis 29 3.1 Definition........................... 29 3.2 Reverberation......................... 31 i

3.3 Signal Spectrum........................ 35 3.4 Sampling Problem...................... 36 3.5 Sound Activity Detection................... 37 4 Distributed Microphone Networks 39 4.1 DMN in CHIL......................... 40 4.2 Microphone Deployment................... 43 4.3 The Proposed Approach................... 45 5 The Global Coherence Field 47 5.1 Definition........................... 47 5.1.1 Spatial Sampling................... 52 5.2 Coherence Measure...................... 52 5.3 SSL based on GCF...................... 54 5.3.1 Sound Maps...................... 55 5.3.2 Localization Performance............... 58 5.4 Real-time Demo........................ 63 5.5 Multiple Speakers....................... 64 5.6 Multimodal System...................... 65 6 The Oriented Global Coherence Field 69 6.1 Definition........................... 69 6.2 OGCF Computation..................... 70 6.3 SSL based on OGCF..................... 74 6.4 Parameter Selection...................... 75 6.4.1 Real Data Experiments................ 77 6.4.2 Simulated Data Experiments............. 80 6.5 Human Speaker........................ 81 ii

7 Speaker Localization Benchmarking 85 7.1 Data Collection........................ 86 7.2 References........................... 87 7.3 Evaluation Metrics...................... 88 7.3.1 The Evaluation Tool................. 91 7.4 Implementation of the SSL system............. 92 7.5 Evaluation........................... 94 7.5.1 Results......................... 95 8 Example Based Approach 97 8.1 Proposed Approach...................... 99 8.2 Experimental Results..................... 101 8.2.1 Simulated data collection............... 101 8.2.2 Real data collection.................. 102 8.2.3 Evaluation metrics.................. 103 8.2.4 Evaluation results................... 104 9 Conclusions and Future Work 109 9.1 Discussions and Conclusions................. 109 9.2 Future Work.......................... 111 Bibliography 113 A Oriented Image Method 129 B DMN and Hardware 133 B.1 ITC-irst CHIL Room..................... 133 B.2 Hardware........................... 135 C Evaluation 139 iii

List of Tables 7.1 Results obtained applying different localization systems to the NIST RT-05 Spring Evaluation test set......... 95 8.1 Classification performance on the simulated training set. L 1 and L 2 refer to the two proposed distance measures. The same set is used for both training and test.......... 104 8.2 Results obtained on the real-talker data with RR set to 27% using the models computed with white noise......... 106 8.3 Results obtained on the real-talker data with RR set to 27% using the models computed with speech signals....... 107 8.4 Results obtained with the loudspeaker reproducing speech using models computed with white noise. The RR is set to about 27%........................... 107 C.1 Example of a localization output: for some time instants the localization algorithm does not provide any estimate because either there is no speaker or the confidence level is too low. 139 C.2 Example of a reference file: every 100 ms the reference file indicates the number of active speakers, the number of ongoing noises, the ID and the coordinates of the speaker if one and only one is active................... 140 v

C.3 Example of an evaluation output: the upper part classifies each time instant in one of the cases foreseen by the metrics introduced in Chapter 7; the lower part summarizes the performance.......................... 141 vi

List of Figures 1.1 Speech processing benefits from both digital signal processing and spoken language processing areas. Approaches to the speaker localization problem use traditional signal processing algorithms tailored to the characteristics of spontaneous speech signals...................... 2 2.1 Example of a multi-path acoustic channel.......... 10 2.2 Example of an impulse response in a multi-path environment. The first impulse accounts for the direct path, while the rest of the impulses represents the effects of reverberation. 12 2.3 Locus of points which satisfy equation 2.7 for a given absolute time delay at two microphones m 1 and m 2. The sign of the time delay identifies one of the two sheets........ 14 2.4 Locus of points that solve equation 2.7 restricted to a plane and approximated by a straight line............. 17 2.5 Relation between DOA and TDOA in a two sensors set up when the source is in far field position. Dashed lines represent the incoming sound waves assumed to be planar.... 17 2.6 Spherical sound waves produced by a radiating source in near field condition impinge on two microphones...... 18 2.7 Scheme of a delay and sum beamformer with N s input signals. 23 2.8 Rough shape of the head radiation pattern at two different frequencies........................... 27 vii

3.1 Example of CSP measured when the sound source is frontal to the sensors at about 2.5 m distance. Note the sharp peak at time lag equal to 1 sample which corresponds to the actual TDOA.............................. 32 3.2 Example of CSP for a sound source aiming 45 degrees on the right with respect to the microphones. Note that besides the main correct peak some secondary peaks due to reflected paths arise on the left of the figure.............. 32 3.3 Example of CSP computed when the sound source is aiming at the opposite direction with respect to the microphones (left) and when it is frontal (right).............. 33 3.4 Spectrogram and CSP derived from a fricative-vowel speech sequence............................ 36 3.5 Trend of CSP main peak (above) with respect to a recorded speech signal (below)...................... 38 4.1 T-shaped microphone array geometry............. 42 4.2 Map of the CHIL room available at ITC-irst laboratories.. 42 4.3 Map of the CHIL room available at Karlsruhe University laboratories........................... 43 4.4 Influence of noisy TDOA measurements on the final position estimate applying triangulation and two different microphone deployments..................... 44 5.1 Scheme of the computation of GCF for a given point p restricted to microphone pairs i-th and i + 1-th........ 49 viii

5.2 A loudspeaker is located in front of a microphone pair but is oriented towards a corner. The main part of the radiated energy reaches the microphones after a reflection, while energy traveling along the direct path is lower due to the loudspeaker radiation pattern................... 51 5.3 CSP computed when a loudspeaker located in front of a microphone pair is rotated 45 on the left with respect to the microphones as illustrated in figure 5.2.......... 51 5.4 The picture on the left shows where the loudspeaker was located during the database collection with respect to the ITC-irst DMN. The picture on the right shows the orientation investigated........................... 54 5.5 Example of GCF based on TDOA estimations adopting a normal gaussian loss function. The source is located in the upper right corner....................... 55 5.6 Example of GCF LS. The source is located in Pos3..... 56 5.7 Example of GCF CSP. The source is located in the upper right corner........................... 57 5.8 Example of GCF exp. The source is in the middle of the room. Notice the fake peak generated by reflections........ 58 5.9 Example of GCF LS. The source is in the middle of the room. 58 5.10 Example of GCF based on C CSP when the source is in Pos1. 59 5.11 RMS error of the localization estimates for each orientation investigated adopting GCF. The source is in Pos1...... 59 5.12 RMS error restricted to the x coordinate when the loudspeaker is aimed at different directions........... 60 5.13 RMS error restricted to the y coordinate when the loudspeaker is aimed at different directions........... 61 ix

5.14 Source position estimates obtained by picking the maximum of GCF. The green circle indicates the actual source position. 61 5.15 Overall RMS error, including all the orientations, for each position investigated...................... 62 5.16 Source position estimates restricted to the x coordinate. The source is in Pos1 and the orientation is 45.......... 63 5.17 RMS error for each orientation investigated when the input signal is speech and white noise and the source is in Pos1. 64 5.18 Example of a CSP function regularized around the time delay 9 samples. The maximum in the range [ 14 : 4] corresponds to 7 samples.................. 66 6.1 Graphical representation of the OGCF computation scheme. In this case 6 microphone pairs are available and 4 possible orientations are investigated.................. 71 6.2 Scheme of the computation of OGCF for a given point p restricted to microphone pairs i-th and i + 1-th....... 72 6.3 Example of OGCF(ˆp, d) computed with N = 64 and M = 21 in the CHIL room available at ITC-irst laboratories... 73 6.4 Sound map computed with M-OGCF when the loudspeaker is in the upper right corner (Pos3) and is aimed at the opposite corner 315....................... 76 6.5 Sound map computed with M-OGCF when the loudspeaker is in the middle of the room (Pos1) and is aimed at the bottom right corner (45 )................... 76 6.6 RMS estimation error with different parameter configurations when a Tannoy loudspeaker is in central position and reproduces white noise..................... 78 x

6.7 RMS estimation error with different parameter configurations when a Yamaha low directional loudspeaker in central position reproduces white noise................ 79 6.8 Percentage of correct orientation estimations with different parameter configuration. The acceptance threshold is set to π\14 rad............................ 80 6.9 Performance of a source position estimation algorithm based on OGCF with different parameter configurations...... 81 6.10 Source orientation performance on simulated data expressed in terms of RMS error. Results obtained with different parameter configurations are reported.............. 82 6.11 RMS localization error on simulated data with different parameter configurations..................... 82 6.12 Outline of the speaker position and the 5 orientations adopted in the experiment....................... 83 6.13 Orientation error as a function of time............ 84 7.1 Map of the CHIL room available at Karlsruhe University laboratories. The T-shaped arrays exploited for localization purposes are indicated as A-Array, B-Array, C- Array, D-Array...................... 86 xi

7.2 Examples of outputs of the localization system for the x coordinate assuming a 100 ms time resolution in the reference file: SAD is the bilevel information of the Speech Activity Detector, REF is the reference transcription of the x coordinate for time frames labeled as one speaker, OUT- PUT shows the results of the localization system in the case of output at higher frame rate than 10 Hz, in the case of output at 10 Hz and in cases of deletion and false alarm, respectively.......................... 91 8.1 GCF sound map when the source is in the upper right corner (Pos3) and aims at the corner (135 ). The sound map is computed with the sensor set up available in the ITCirst CHIL room......................... 98 8.2 Graphical representation of the proposed classification method to estimate the audio source position and orientation.... 100 8.3 The left part of the figure represents the positions and orientations of the sound source that were investigated in the simulated data collection. The right part refers to the real data acquisition........................ 102 8.4 Overall orientation error rate for different similarity measures and maps with the simulated test set.......... 105 8.5 Overall position error rate for different similarity measures and maps with the simulated test set............. 105 A.1 Image method: the figure on the left explains how reflection can be substituted with an image source; the right part depicts an example of first-order images........... 129 A.2 Example of a cardioid-like radiation pattern........ 131 B.1 Map of the DMN available at ITC-irst laboratories..... 133 xii

B.2 Partial view of the DMN available at ITC-irst laboratories. 134 B.3 Block diagram of the acquisition chain implemented at ITCirst laboratories for audio data recordings within the CHIL project............................. 137 xiii

Chapter 1 Introduction 1.1 The Context This PhD dissertation addresses the problem of estimating the position of active speakers in rooms equipped with a network of distributed microphones. Localization of talkers belongs to the wide area of digital speech processing which in turn comprises both digital signal processing and spoken language processing, as illustrated in picture 1.1. Speech processing tailors algorithms generically developed for signal processing to the specific task of handling speech signals produced by human beings. As a consequence, speech processing includes many research disciplines such as computer science, acoustics, psychology and linguistics. The objective of sound source localization is to estimate the position of an active sound source from a set of acoustic measurements. In the particular case addressed in this thesis, the acoustic measurements correspond to sequences of digitalized samples derived from audio wavefronts impinging on a set of distant microphones placed in an enclosure. Due to audio wave propagation, signals acquired by microphones are strongly degraded by reflections and environmental noise. The adoption of a microphone network for estimating the position of active sound sources is a rather new and challenging task as most research activities in the literature refer to linear 1

1.2. The Problem Chapter 1. Introduction Digital Signal Processing Speech Processing Spoken Language Processing Figure 1.1: Speech processing benefits from both digital signal processing and spoken language processing areas. Approaches to the speaker localization problem use traditional signal processing algorithms tailored to the characteristics of spontaneous speech signals. or compact microphone arrays. Many practical applications could benefit from an accurate knowledge of the position of the active source: applications for domestic environments and surveillance systems; beamformers and other speech enhancement techniques such as blind source separation, speech activity detection or pitch estimation; automatic camera steering for video conferencing systems; applications for speaker identification and verification. 1.2 The Problem 1.2.1 The CHIL Project This research activity was conducted at ITC-irst 1 within the EU funded CHIL 2 project whose applicative scenarios are meetings and seminars held in proper rooms equipped with networks of audio and video sensors. The goal of the project is to create an automatic system able to serve users 1 Istituto Trentino di Cultura, Centro per la Ricerca Scientifica e Tecnologica. 2 Computers in the Human Interaction Loop. Further details can be found logging in at: http://chil.server.de 2

1.2. The Problem Chapter 1. Introduction in an implicit and unobtrusive manner. In this way users are allowed to focus on interacting with other humans/users rather than computers. In order to comply with this constraint ubiquitous computing, where machines and sensors fade out in the background, must be achieved. The foreseen contexts are seminars, where a main speaker gives his/her lesson and a small audience interacts with the talker, and meetings where a relatively small group of persons are involved and intensively interact with each other. With this aim, the front-end audio modules are charged with extracting any useful information that may allow the upper levels to recognize and characterize everything is happening in a given room. Position of the active speakers is one of the most important information as it can be fruitfully applied by other audio processing modules, e.g. speech recognizer, or high level tools. 1.2.2 Speaker Localization in CHIL The project intents entail a set of constraints to the localization problem, which turns out to be rather different from traditional formulations. Since the users must be unaware of the sensors and of the underneath information processing, they are let free to roam in the room. As a consequence, the localization system can not rely on compact microphone arrays, neither it can assume that users are frontal or cooperative. Another noteworthy difference from traditional formulations of the localization problem is that the envisioned system must work in real environments dealing with all the interferences which could occur during a real meeting of people in a room. Several assumptions commonly exploited in the acoustic localization community are not always valid, as for instance: white environmental noise, gaussian distribution of estimation errors, source independence in the multi-speaker case, etc. Also the typicalness of spontaneous speech, in terms of dynamics, pauses and spectral features, is an aspect to take care 3

1.3. The Solution Chapter 1. Introduction of as it may affect the clarity of the recorded signals and their applicability to traditional signal processing techniques. Besides these particular matters, traditional issues related to the localization problem in presence of a directional sound source must be dealt with. Reflections and scattering strongly affect the quality of the signals impinging on the sensors and considerably complicate the estimation of the point where the sound was emitted. 1.3 The Solution The solution adopted in this thesis to tackle the problem of estimating the position of the active speaker in the CHIL scenario, makes use of a network of sensors spread around the room without any geometrical constraint. In this way the user has the maximum movement freedom since the sensor network guarantees a good spatial coverage whatever are the talker position and orientation. Unfortunately, adopting a so called Distributed Microphone Network (DMN) introduces the further problem of handling and merging efficiently the redundant information provided by large set of sensors. This thesis focuses on this issue of merging contributions, computed with traditional state of the art algorithms for sound source localization, when clusters of microphones are deployed all around a room. There is hence no commitment in improving neither traditional time delay of arrival approaches nor coherence measure techniques. The concern is instead on improving single frame localizations, without any smoothing obtained with Kalman or particle filtering. This dissertation suggests to handle a DMN by means of an extension of the Global Coherence Field (GCF). In particular, the proposed approach exploits the knowledge of the speaker orientation in order to inherently select contributions form sensors with a higher confidence measure. With 4

1.4. Innovative Aspects Chapter 1. Introduction this purpose we introduce the concept of Oriented Global Coherence Field (OGCF) that delivers more robust and accurate localization outputs and can also perform an estimation of the source orientation. Both GCF and OGCF turn out to clearly represent the acoustic characteristics of an environment by means of sound maps. However, the new approach still presents some weak points, mainly to ascribe to the presence of blind areas in the room. Due to the limited number of available microphones, there are anyway some configurations of the source for which reflected wavefronts are predominant at all the microphones. In the final part of this thesis, we present an alternative approach that tries to handle these uncovered areas by applying a pattern classification framework. Although a microphone gathers only reflected wavefronts, the pattern of reflections carries cues about the source position. The proposed method learns the reflection pattern given a set of predefined source positions and orientations and then recognizes the position of the source by comparing the new input audio signals with the models. The algorithm uses sound map representations computed as GCF or OGCF to encode the reflections occurred in a given source configuration. 1.4 Innovative Aspects Besides the adoption of a DMN instead of linear arrays, the main innovative aspect of this thesis is the attempt of estimating and exploiting the speaker orientation in an effort to improve the robustness and accuracy of a localization system in reverberant and noisy environments. Most experiments in the literature are carried out assuming either an omnidirectional sound source or forcing the source to be rather frontal with respect to the microphones. In this thesis instead, the orientation of the source is taken into account to assign a reliability measure to each microphone, or group 5

1.5. Structure of the Thesis Chapter 1. Introduction of microphones, and accordingly merge their contributions. The estimation of the talker s head orientation is a new and rather unexplored task and only few works have been presented in the literature so far. In our method the estimation is inferred by exploiting coherence measures at a set of microphone pairs by means of the common Crosspower Spectrum Phase (CSP) analysis. Microphone pairs with high coherence are more likely to be frontal. A finer estimation of the angle of the source is derived by interpolating all the coherence measures. As far as the pattern classification approach is concerned, a richer literature is present on this topic but no work had been so far devoted to apply it to GCF or OGCF map representations. However this research line is still in a preliminary stage and some works refer to academic studies on pattern recognition and neglect important aspects as reverberation. 1.5 Structure of the Thesis After a description of the sound source localization problem and a brief introduction to the multi-path mathematical model, Chapter 2 presents a detailed survey of the state of the art algorithms for the sound source localization problem. A particular emphasis is placed to approaches based on time delay of arrival since they are the most common among the scientific community. The chapter offers a summary of the several alternative approaches to estimate the time delay of arrival and a survey of the solutions presented over the years to solve the problem of combining multi-channel time delay measurements to infer a single three dimensional source position estimate. Algorithms which work in statistical frameworks are also described. Finally, the chapter provides also an insight to the few activities carried out on the problem of estimating the sound source orientation. The state of the art is completed by Chapter 3 that deeply describes 6

1.5. Structure of the Thesis Chapter 1. Introduction the Crosspower Spectrum Phase analysis approach for estimating the time delay of arrival at two microphones. The chapter introduces the mathematical formulation and analyzes the performance of the algorithm in different acoustic conditions. With this aim, experiments on real data are reported. Chapter 4 presents the approach adopted to address the requirements introduced by the CHIL project. Distributed Microphone Networks are introduced and characterized from a sound source localization perspective. Distributing sensors around a room guarantees a good coverage in space which means potentially good localization performance. On the other hand some microphones are unreliable due to their unfavorable position with respect to the source and must be handled accordingly. The chapter includes also a brief analysis of the effect of different microphone deployments to the localization process. The method chosen to manage a DMN is the Global Coherence Field (GCF) that is described in details in Chapter 5. Comparisons among different possible implementations are given in terms of localization errors. An extension of the Global Coherence Field is presented in Chapter 6 that introduces the concept of Oriented Global Coherence Field (OGCF) and its exploitation to estimate the source orientation. Oriented Global Coherence Field performance is measured in terms of orientation estimation errors and localization errors. Chapter 7 is dedicated to evaluate a practical implementation of a localization system that exploits OGCF. Experiments are conducted on real data acquired in the Karlsruhe University laboratories and distributed world wide for the spring 2005 NIST evaluation campaign on speaker tracking. Also the evaluation metrics are the same adopted in the international evaluation campaign. Results are compared with two traditional localization techniques based on triangulation and GCF. An alternative approach framed in a statistical context is presented in 7

1.5. Structure of the Thesis Chapter 1. Introduction Chapter 8. The method was devised in an attempt to exploit reflection patterns to enforce the localization process. The chapter includes a description of the proposed algorithm with some alternative potential implementations and reports on a preliminary evaluation conducted on both a small simulated database and a real data collection. Chapter 9 contains conclusions and foreseen future activities. 8

Chapter 2 The Sound Source Localization Problem 2.1 Problem Formulation The goal of a Sound Source Localization (SSL) system consists in estimating the position of active sound sources given the acoustic measurements provided by a set of microphones. This task typically applies to relatively small enclosures, such as office, meeting rooms and domestic environments, whose dimensions are comparable with the involved wavelengths (e.g. 6.8 cm at 5 khz). As a consequence, the quality of signals acquired by microphones is strongly degraded by reflections on surfaces and scattering by objects in the room [73]. These phenomena are known as reverberation and make multiple delayed replicas of the emitted signal reach the microphones. In practice, beside the direct path, wavefronts can reach the sensors through several alternative paths (multi-path environment), that bounce in the room. Figure 2.1 depicts an example where 2 reflected paths join an acoustic source with a microphone. Indirect paths are longer than the direct one for obvious geometrical issues. Moreover, at each reflection part of the acoustic energy is absorbed, so a reflected path undergoes always a higher attenuation than direct path. Reverberation is 9

2.2. Signal Modeling Chapter 2. The Sound Source Localization Problem Acoustic Source Reflected Path Direct Path Reflected Path Microphone Reflection Figure 2.1: Example of a multi-path acoustic channel. the most critical issue for SSL systems since a virtual competitive sound source is generated whenever a reflection occurs [61]. Besides reverberation, the quality of signals acquired from a desired source is also affected by environmental noise (diffuse or directional) and by sounds produced by other competitive sources (fans, computers, etc.). The first part of this chapter provides a mathematical model of the acoustic phenomena which are critical for the SSL problem. The second part presents the state of the art on the topic under investigation. 2.2 Signal Modeling Let us denote as x(t) the signal impinging on a generic acoustic sensor and as s(t) the original signal emitted by the source. The received signal can be modeled as: x(t) = h(t) s(t) + n(t) (2.1) where h(t) is the impulse response accounting for the effects of signal propagation from source to microphones, while n(t) is white noise, here assumed 10

2.2. Signal Modeling Chapter 2. The Sound Source Localization Problem diffuse. Assuming a ray propagation model and a multi-path environment, each single path contributes to the impulse response with a delta function centered at the propagation time delay and having amplitude corresponding to the attenuation along the particular propagation path. The overall impulse response can be decomposed in two main components: h(t) = h D (t) + h R (t) (2.2) where: h D (t) represents the wave propagation through the direct path between source and microphone. It is characterized only by signal attenuation A and propagation delay : h D (t) = Aδ(t ) (2.3) the second part h R (t) accounts for the delayed replicas that constitute reverberation. It can be modeled as a train of delayed impulses: h R (t) = A R k δ(t k ) (2.4) k=0 notice that the attenuation coefficients A R k include the power absorption due to reflections and the decay due to propagation as well. k is the propagation time measured along the k-th path to reach the microphone and it is always larger than as the direct path is also the shortest one. Figure 2.2 shows an example of impulse response in a multi-path environment adopting a ray propagation model. According to this mathematical model, the received signal can then be rewritten as: x(t) = As(t ) + 11 A R k s(t k ) (2.5) k=0

2.2. Signal Modeling Chapter 2. The Sound Source Localization Problem Attenuation coefficients and propagation delays depend on the respective source and microphone positions and on the sensor acoustic characteristics (i.e. the directional response). Note that always A > A R k since A is not subject to absorption due to reflections. However this is not sufficient to guarantee that the maximum peak of an impulse response corresponds to the direct path: it is possible that multiple reflections constructively interfere and build up higher peaks. The amount of reverberation is in general characterized by means of the reverberation time T 60. The reverberation time is defined as the time taken by the acoustic level of the received signal to drop 60 db once the source is abruptly interrupted. In enclosures it can be approximated by the Sabine formula [58]: T 60 = 0.163 V as (s) (2.6) where V is the room volume, S is the total surface area and a is the average Sabine absorptivity [91]. Figure 2.2: Example of an impulse response in a multi-path environment. The first impulse accounts for the direct path, while the rest of the impulses represents the effects of reverberation. The aforementioned mathematical model for multi-path environments, even if simple and based on a narrow-band propagation model, offers a good understanding of the problems that reverberation introduces and that will be 12

2.3. Time Difference of Arrival Chapter 2. The Sound Source Localization Problem dealt with in the following of this thesis. Moreover, evaluation experiments for SSL algorithms are commonly carried out adopting the image method [3] which relies on this multi-path mathematical approximation. Let us now analyze the different approaches presented so far in the literature to tackle the SSL problem in reverberant environments. 2.3 Time Difference of Arrival Over the years many research efforts have been devoted to the SSL task, with particular attention to environments characterized by high reverberation levels. Almost all algorithms tackle the SSL task by exploiting an estimation of the Time Delay of Arrival (TDOA) at two or more sensors. Given a sound source in spatial position p = (x, y, z) and two microphones m 1 and m 2 with coordinates m 1 = (m x1, m y1, m z1 ) and m 2 = (m x2, m y2, m z2 ), the direct wavefronts reach the two sensors with a certain time delay T (p): T (p) = m 1 p m 2 p c (s) (2.7) where c is the speed of sound that can be expressed as a function of the temperature K in kelvins: K c = 331.45 (m/s) (2.8) 273 Also human beings exploit a similar feature, the interaural time difference (ITD), in combination with the interaural level difference (ILD) to determine where a sound is coming from. For a single microphone pair, the locus of points which satisfy a given time delay is a half hyperboloid of two sheets as depicted in picture 2.3. Hence, more than a single microphone pair or a combination of TDOA with ILD [34] are needed to accomplish to the SSL task. When M microphone pairs are available, the source position can be derived as the point in space that best fits with the set 13

2.3. Time Difference of Arrival Chapter 2. The Sound Source Localization Problem Figure 2.3: Locus of points which satisfy equation 2.7 for a given absolute time delay at two microphones m 1 and m 2. The sign of the time delay identifies one of the two sheets. of M TDOAs estimated at sensor pairs. Let us denote as τ i the TDOA estimated at the i-th microphone pair (m i1, m i2 ) with i = 0,..., M 1. If P (p τ 0, τ 1,..., τ M 1 ) is the probability that the source is in position p given the estimated TDOAs, then the Maximum Likelihood (ML) solution, from a risk minimization perspective, is the following: ˆp = arg max p log P (p τ 0, τ 1,..., τ M 1 ) (2.9) Bayes theorem allows to reformulate the decision criterion in a more convenient way: ˆp = arg max p log P (τ 0, τ 1,..., τ M 1 p) (2.10) According to the central limit theorem, τ i are commonly modeled as independent gaussian random variables with mean equal to the true time delay T i (p) and variance σ i : τ i N (T i (p), σ i ) (2.11) 14

2.3. Time Difference of Arrival Chapter 2. The Sound Source Localization Problem In vector form, the probability distribution becomes: where : P (τ p) = 1 exp (2π) M det (Ξ) ( 1 2 [τ T (p)]t Ξ 1 [τ T (p)] ) (2.12) Ξ = I [σ 0, σ 1,..., σ M 1 ] (2.13) τ = [τ 0, τ 1,..., τ M 1 ] T (2.14) T (p) = [T 0 (p), T 1 (p),..., T M 1 (p)] T. (2.15) Substituting 2.12 in 2.9, the ML solution reduces to minimizing the following objective function [104]: J ML (p) = [τ T (p)] T Ξ 1 [τ T (p)] (2.16) Notice that the ML solution is equivalent to the Least Square error (LS) formulation [50] when variances of time delays are assumed to be equal: J LS (p) = [τ T (p)] T [τ T (p)] = M 1 i=0 τ i T i (p) 2 (2.17) In general the assumption of gaussianity of noise is difficult to verify and hardly ever any knowledge about the statistics of the problem is given. Hence a LS approach is often more suitable than the ML one. From a geometrical perspective, the solution to the LS formulation is the point in space with minimum distance from all the hyperboloids identified by the set of estimated time delays. Since T (p) is a non-linear function of p, estimating the source position ˆp according to the LS approach as follows: ˆp = arg min p J LS (p) (2.18) requires to solve a non-linear system which is not trivial. Efficient iterative search methods, whose starting points must be selected carefully, allow to 15

2.3. Time Difference of Arrival Chapter 2. The Sound Source Localization Problem compute the LS solution (see DiBiase et al. in [18]) : Newton-Rapson, Gauss-Newton, LMS[14]. Gradient descent was also implemented in a real time application [86]. In an effort to reduce the computational cost of the LS approach many closed-form solutions, based on sub-optimal decision criteria, were presented over the years. The simplest closed form is triangulation of two TDOAs which does not allow to exploit extra sensor information. A way to avoid the non-linearity is to compute a linear approximation of T i (p) applying the theory of Taylor series and then a more efficient iterative search [102]. On the other hand, adopting alternative error functions allows to define different LS formulations which could result either more robust or less complex: Smith and Abel in [99] introduced the Spherical Interpolation (SI) later extended with linear-correction in [53], Schau and Robinson presented the Spherical Intersection (SX) in [94] and Chan and Ho proposed to perform a Hyperbolic Intersection (HI) [30]. Later, the Linear Interpolation (LI) method based on Direction of Arrival (DOA) was introduced in [17]. As already mentioned, given a microphone pair, the locus of points which satisfy a given TDOA is represented by half hyperboloid of two sheets. When the source satisfies the far field condition, the wave propagation can be assumed to be planar and the hyperboloid generated by the i-th microphone pair can be approximated as a cone. Far field condition is verified when the distance between the source and the microphone is much larger than the involved wavelengths λ: p m i λ [73]. If we restrict the SSL problem to a plane, the hyperboloid reduces to a hyperbola that can be approximated by a straight line whose slope indicates the angle or direction of arrival as shown in picture 2.4. The relationship between τ i and the direction of arrival θ i can be approximated as: τ i = m i1 m i2 cos (θ i ) c (2.19) 16

2.3. Time Difference of Arrival Chapter 2. The Sound Source Localization Problem PSfrag replacements line hyperbola m i1 m i2 Figure 2.4: Locus of points that solve equation 2.7 restricted to a plane and approximated by a straight line. hence, once the TDOA is given the DOA can be computed as: ( ) cτ i θ i = arccos m i1 m i2 (2.20) Figure 2.5 depicts the relationship between τ i and θ i when the source is in far field condition. On the other hand, if the source is too close to the wave PSfrag replacements τ i m i1 θ i m i2 Figure 2.5: Relation between DOA and TDOA in a two sensors set up when the source is in far field position. Dashed lines represent the incoming sound waves assumed to be planar. microphones the wave propagation must be still considered as spherical (see Figure 2.6) and the simple relationship 2.20 between θ i and τ i does not hold. 17

2.3. Time Difference of Arrival Chapter 2. The Sound Source Localization Problem The already mentioned triangulation simply computes the crossing point of DOA lines along the bearing angles estimated at two microphone pairs. Conversely, given a set of M DOA estimations, the LI algorithm computes wave p PSfrag replacements θ i m i1 m i2 Figure 2.6: Spherical sound waves produced by a radiating source in near field condition impinge on two microphones. each possible intersection, closest intersection actually since it operates in a three dimensional domain, and then it derives the position estimation as a weighted average among all the intersections. In the assumption that estimated time delays τ i are gaussian random variables (2.11), weights are derived from the likelihood of τ i. In this way estimations that are very unlikely are not taken into account when computing the average position. Remaining in the LS framework, Brandstein in [16] analyzes alternative localization criteria which define the LS objective function on alternative features like DOAs or the distance between a point p and each line identified by a DOA. Plenty of alternative closed form approaches that rely on particular environmental set up can be found in the literature as for instance in [54, 85]. It is worth mentioning that all the algorithms described above require 18

2.3. Time Difference of Arrival Chapter 2. The Sound Source Localization Problem that the sound propagation speed is known. In many cases, a small error in the sound speed estimation does not affect critically the localization performance. Anyway in [107] the authors describe a Constrained Least Square (CLS) algorithm that does not require knowledge about sound speed. 2.3.1 TDOA estimation The instantaneous TDOA at a sensor pair can be derived picking the peak of a Coherence Measure (CM) (see Ferguson in [28]) that expresses the similarity of two signals given a time delay. The simplest and straightforward manner to compute a CM is to exploit cross-correlation between the received signals. Unfortunately this approach can easily fail in real and noisy conditions due to signal spectral properties and microphone acoustic response. Several variations were presented to overcome the weaknesses of classic cross-correlation. Among them it is noteworthy the Normalized Cross Correlation (NCC) that anyway is prone to errors when coping with periodic signals [97]. Knapp and Carter introduced in [60] the most common technique to TDOA estimation: the Generalized Cross Correlation (GCC). In particular the authors presented several different variants among which a ML method that requires the knowledge of both noise and signal statistics, and the PHase Transform (GCC-PHAT) version which has been widely adopted in the SSL community. GCC-PHAT reformulates the analysis of the similarity in the phase domain in order to be independent of the signal dynamics. With this aim, signal spectra are flattened with a whitening process that makes each frequency identical from an energy point of view. GCC-PHAT is also known in the SSL community as Crosspower Spectrum Phase analysis (CSP) [78, 77]. In the rest of this thesis the denomination of CSP will be used. Even if both techniques take advantage of a phase analysis of the Crosspower Spectrum, they were derived in two different contexts: GCC-PHAT from a theoretical investigation on GCC, 19

2.3. Time Difference of Arrival Chapter 2. The Sound Source Localization Problem CSP from an application to SSL. Since CSP is the technique adopted in this doctoral research, Chapter 3 will deeply investigate this approach and its behavior in different acoustic conditions. Accuracy and robustness of a TDOA estimation process based on CSP may be improved by means of either linear microphone arrays or compact clusters of microphones. Estimation may take advantage of a smarter peak picking procedure which exploits the redundancy provided by the multi-channel scenario or may rely on a more accurate computation of the CM. With the latter purpose an additive CSP approach, which aims at enforcing main peaks by summing up aligned versions of CSP computed at different sensor pair, was presented in [76]. An extension of GCC to the multi-channel case was introduced in [31] and afterward adapted to an iterative estimation of the signal cross-correlation [12]. The authors exploited a spatial correlation matrix in a spatial prediction framework and took advantage of the closed form relationship between time delays when microphones are arranged in some geometrical fashions. Matsuo et al. applied the same concept to a small triangular microphone array [69]. Besides CSP, many alternative techniques to TDOA estimation are described in literature. Some approaches perform a parameter estimation of the signal model and rely on spatio-spectral and spatio-temporal matrices estimated from the observed data [57]. In this particular case TDOAs are derived as a side effect of adaptive beamforming which computes an approximation of the impulse response and equals the TDOAs in order to maximize the beamformer output. Methods among this class presents some critical drawbacks. First of all matrix estimation needs that signals are assumed to be stationary in the analysis window, which is quite long in order to guarantee good accuracy. Then these algorithms, which are in some cases limited to linear arrays and far field conditions, was conceived for narrow-band signals. Moreover, they are sensitive to errors in the 20

2.3. Time Difference of Arrival Chapter 2. The Sound Source Localization Problem source and noise modeling. Among this category are noteworthy Autoregressive (AR) approaches, the Minimum Variance (MV) method presented by Capon [27] and the MUltiple SIgnal Classification (MUSIC) method of Schmidt [96]. The former is based on an eigenvalue analysis of the spatiotemporal correlation matrix. In this framework Huang et al. [52] presented an algorithm that estimates the impulse response at each microphone performing an eigenvalue decomposition. This approach can be extended to the multi-channel scenario [5]. Once impulse responses are given, the time delay of arrival is derived as the time difference between the main peak of each impulse response. Likewise, Doclo and Moonen described adaptive approaches based on Generalized Singular Value Decomposition (GSVD) [39] and Generalized Eigenvalue Decomposition (GEVD) [40] which include a whitening process. Channel identification for TDOA estimation can be performed also applying an adaptive LMS filter [87] where one of the received signals acts as the reference and the other as the desired signal. TRINICON [25] for instance is a blind signal process framework which allows to estimate impulse responses, and hence TDOA, in a multi-speaker multi-sensor scenario. As an alternative to CSP, Benesty in [32] described an approach which combines Average Magnitude Difference Function (AMDF) and Average Magnitude Sum Function (AMSF) to estimate the TDOA. It is worth noting that this approach needs a whitening process, which flattens the signal spectra, in order to yield reliable TDOA estimates. Recently some contributions to the TDOA estimation, and more in general to SSL, come from research activities on Blind Source Separation (BSS) [13]. TDOA is obtained either as a side effect of the speech separation process [74, 26] or in an effort to handle bin permutation for the frequency domain based methods [93]. BSS is particularly useful when SSL is applied in a multi-speaker context [103]. Unfortunately BSS re- 21

2.4. Direct Approaches Chapter 2. The Sound Source Localization Problem quires that microphone pairs are quite close, affecting TDOA estimation resolution, and it is often tailored to low reverberant environments. Finally, Histogram Mapping proposed in [51] applies frequency decomposition in an effort to reduce the SSL broadband task to a set of narrowband problems. Frequency decomposition relies on the source sparseness, or W-disjoint, assumption: in a multi-speaker context each frequency bin is associated to a single speaker [7]. 2.4 Direct Approaches TDOA based approaches are also referred to as 2-step methods since they first estimate a set of TDOAs and then combine them to accomplish the localization task. On the other hand, a lot of research efforts was devoted in last years to direct approaches which can extract the position information in a single step. Algorithms in this class commonly perform a scan over a grid of spatial points and maximize, or minimize, a given objective function. When applying the correct objective function a direct approach can provide a ML solution computed on a reduced set of potential solutions rather than on the whole space of source positions. Direct methods have been often ignored in past years because of their heavy computational requirements. Nowadays, as the available computational power is getting higher and higher, there is a growing interest in this kind of approaches. The main advantages offered by direct approaches are the absence of any assumption about wave propagation (far versus near field) or about the signal and sensor models. It is worth mentioning that direct, or 1-step, algorithms satisfy the least commitment principle [90] as they do not take decisions in intermediate steps, which leads to throwing prematurely away part of the available information. The simplest manner to scan the space of possible sound source positions 22

2.4. Direct Approaches Chapter 2. The Sound Source Localization Problem is to use a delay-and-sum beamformer [4, 81]. Given a spatial position, signals received at each sensor are aligned, according to the theoretical set of propagation time delays, and summed up together in order to derive a single output signal. Figure 2.7 shows a scheme of a delay and sum beamformer with N s input signals. PSfrag replacements When the position under investiga- x 0 (n) 0 x 1 (n) 1 x 2 (n) 2 sum x B (n) x Ns 1(n) Ns 1 Figure 2.7: Scheme of a delay and sum beamformer with N s input signals. tion corresponds to the actual source position, the energy of the combined output signal presents a peak. Beamformers can be improved applying matched filters or any other knowledge about the context (see Ward et al. in [18]). First investigations on this field date back to 1990 when Alvarado introduced the concept of Power Field (PF) [4] in combination with Stochastic Region Contraction. In an attempt to reduce the computational load of an exhaustive search for the maximum, an efficient coarse-to-fine search based on octrees is applied in [42]. As it has been proved that energy is not a reliable feature in noisy and reverberant environments, Silverman and Kirtman described in [97] an alternative implementation of PF which employs cross-correlation and performs a two-stage search. Later, from a more general point of view, the Global Coherence Field (GCF) was introduced in [81, 77]: GCF operates as a steered beamformer based on a 23

2.5. Particle Filtering Chapter 2. The Sound Source Localization Problem coherence measure rather than power. The objective function to maximize is defined as a combination of a set of generally defined Coherence Measures (CM) provided by a set of microphone pairs. A possible implementation of GCF exploits CSP as coherence measure. A more detailed analysis on GCF is given in Chapter 5. More recently a particular implementation of GCF based on CSP and referred to as SRP-PHAT was investigated by DiBiase in [37]. In [64] a sector based, instead of point based, version of GCF is defined in the frequency domain. Restricting to linear arrays, Griebel in [47] introduces the concept of realizable, or consistent, delay vectors. In [83] Petterson and Kyriakakis suggested a hybrid approach which reduces the set of possible source position to investigate by performing a preliminary spherical intersection. Rui and Florencio in [88] presented an implementation of GCF whose CM are computed with the ML version of GCC instead of the more traditional phase analysis based on CSP. Birchfield in [15] presents a localization algorithm which performs a scan over a hemisphere defined around a compact microphone array. It is noteworthy that when direct approaches are adopted in an effort to improve the estimation of direction of arrivals with compact arrays, as in the case above, they comply only partially with the least commitment principle since a second combination step is still required. 2.5 Particle Filtering Particle filtering and Kalman filtering [10] have gained wider attention in the SSL community in recent years in particular when tracking of sources is required. Kalman filtering is mainly exploited to guarantee a smooth tracking. Referring to audio based systems, TDOA measurements are commonly 24

2.5. Particle Filtering Chapter 2. The Sound Source Localization Problem associated to the observation vector of the Kalman filter. The source position estimates are provided directly at the update step and there is no need for an exhaustive search or closed form approximations. Nevertheless, linearizing the LS criterion is still required in order to fit the Kalman filter formulation. Extended Kalman filter and Iterative Extended Kalman filter [56] better tailor to the SSL problem [59, 11] since they do not require that state transition and observation model are linear. The tuning of parameters that govern the filter behaviour is fundamental to ensure an accurate and reliable tracking and can be derived from training data. Particular attention must be paid to handle long silences, during which the acoustic measurements are not useful to the filter update procedure. Vermaak and Blake in [105] introduced the use of particle filtering, or sequential Monte Carlo [41], methods. In their work the problem of tracking a sound source is defined in a state-space framework adopting a multihypothesis model in order to cope with reverberation. The particle filtering operates recursively: each hypothesis, or particle, is propagated on the basis of a motion model, it is weighted according to a set of acoustic measurements (i.e. TDOA) and then the space of potential source positions is resampled according to the probability distribution represented by the particle weights. The source position estimation is obtained as a weighted average of single particle positions. In practice, given a likelihood, a particle filter computes the ML solution on a reduced set of positions which, after few iterations, tend to concentrate in the areas with high probability. This technique allows also to implement a steered beamformer on a sampled set of locations eliminating the need for a comprehensive search. In this framework, Ward and Williamson presented in [106] an audio tracking system which combines beamforming and particle filtering. In [72] a particle filter extended to the spectro-spatial domain is presented. In some conditions particle filtering has proved to overrule standard localiza- 25

2.6. Statistical Approaches Chapter 2. The Sound Source Localization Problem tion system [67]. Moreover any alternative acoustic measurements, which could result more efficient than TDOA, are suitable to apply as likelihood measures. Nevertheless, some drawbacks are presents: if the likelihood is too sharp, that means a very accurate likelihood, the filter may fails as particles are spread around the correct position [101]. A regularization or smoothing of the likelihood is required in these cases. 2.6 Statistical Approaches In recent years some efforts have been devoted to frame the SSL problem in a statistical formulation. The idea of a stochastic approach is appealing because it allows to efficiently cope with reverberation and can provide reliable source position estimation even when reflections are predominant. Moreover it promises to be more robust to noise and less sensitive to microphone position errors. Some of the works presented so far in the literature apply statistics to TDOA estimation, hence a combination step is still needed. Other approaches determine instead directly the final source position. A first attempt to investigate a stochastic method was presented by Guentchev and Weng in [49]. The authors modeled the mapping function from the set of observed features, including interaural level difference and interaural time difference, to the source coordinates as a regression tree. Strobel and Rabestein presented in [100] a TDOA classifier. The authors investigated the idea of estimating the probability distribution function of TDOA estimates given a direction of arrival. The classification is then performed by comparing the probability distribution models with the new probability distribution inferred from the input signals. Neural Networks have also been employed to derive an estimate of the direction of arrival based either on interaural level difference and interaural time difference [35] 26

2.7. Source Orientation Chapter 2. The Sound Source Localization Problem or on the whole Crosspower Spectrum eventually reduced performing some feature extraction algorithms [9]. Smaragdis and Boufounos [98] presented an approach, based on Hidden Markov Models (HMM), able to model and classify source trajectories. It models the wrapped phase difference at two microphones as an unimodal gaussian process. More recently, an attempt to characterize reverberation from a physical point of view by exploiting the autocorrelation function at each microphone is described in [95]. 2.7 Source Orientation 200Hz 1500Hz PSfrag replacements Figure 2.8: Rough shape of the head radiation pattern at two different frequencies. As already mentioned in the introduction, this research activity on the SSL topic was mainly focused on combining information provided by a set of microphone pairs distributed in an room. A possible approach can attempt to exploit knowledge about the source orientation. In particular in the literature it is typically assumed that the source is omnidirectional or is aiming at the microphones and as a consequence errors are mainly ascribed to noise and reverberation. On the contrary, human beings are quite directional sound sources as proved by several practical experiments [61, 33] and so the knowledge of the source orientation could help to deliver more reliable estimates by efficiently exploiting the redundancy provided by the 27

2.7. Source Orientation Chapter 2. The Sound Source Localization Problem network of microphones. Source Orientation Estimation (SOE) task is a rather new and challenging research area. The first attempt to characterize the source radiation pattern was described by Meuse and Silverman in [71]. The authors modeled the human speaker as a cylindric piston like source whose parameters were estimated by fitting an expected pressure function at the microphone. One of these parameters is the orientation of the source. Silverman and Sachar in [92] presented an approach to SOE based on energy. The proposed method requires the so-called Huge Microphone Array (HMA) and relies on the fact that the energy radiated from the back of a talker s head is lower than from the front [43] as sketched in picture 2.8. 28

Chapter 3 Crosspower Spectrum Phase Analysis 3.1 Definition Despite many TDOA estimators have been proposed in the literature, CSP remains the technique of choice in most real applications. The reasons of the success of CSP among the SSL community are various. First of all, CSP is a quite straightforward and computationally fast approach, which are key issues in real time applications. Then CSP has proved to be reliable even in adverse conditions where levels of both noise and reverberation are high. In fact this approach does not rely on any assumption about signal characteristics and acoustic wave propagation which often do not hold in real environments. Unfortunately in very critical conditions, when for instance the direct path is missing and sensors receive mainly reflected wavefronts, CSP fails. Anyway in those cases only a smart analysis of the impulse response or a statistical approach could help handling the effects of reverberation. Since this research adopts a standard CSP approach to address the TDOA estimation problem, this chapter provides a detailed analysis of this method. The choice of adopting CSP was driven by the fact that the 29

3.1. Definition Chapter 3. Crosspower Spectrum Phase Analysis improvement of single TDOA estimations is not a goal of this thesis. So we chose the most common method which was also deeply investigated, and patented 1 [65, 66], at ITC-irst and it is part of a product for teleconferencing delivered by Aethra (www.aethra.it). Moreover some preliminary experiments on real data acquired in highly reverberant and noisy environments proved that no significant improvements can be achieved with alternative approaches. Last but not least, portability is also a key as this research activity was conducted within the CHIL project framework and the algorithms were foreseen to deal with different scenarios where acoustics and sensor set-ups could change considerably. According to the notation introduced in Chapter 2, let us assume that two audio sensors m 1 and m 2 are available and let us denote as x 1 (n) and x 2 (n) the digitalized sequences captured by the sensors where n = 0,..., L 1 and L is the analysis window length. CSP at time instant t can be hence formulated as follows [78]: { F F T CSP (t, l) = F F T 1 (x1 (n)) F F T } (x 2 (n)) (3.1) F F T (x 1 (n)) F F T (x 2 (n)) where F F T is the Fast Fourier Transform and l is the time lag in samples between the signals. The whitening process, obtained by normalizing the Crosspower Spectrum, produces a flat spectrum which theoretically corresponds to a delta function in the time domain. In this way, CSP exploits only the phase difference between the acquired signals and avoids artifacts introduced by spectral characteristics of the incoming wavefronts. The range of values of the time lag is limited to ±l max according to the intersensor distance, i.e. when the source is in end-fire position: l max = m 1 m 2 F c (3.2) c where F c is the sampling rate. The length L of the analysis window must be chosen as a trade-off between stability, obtained with high value of L, 1 U.S. Patent 5,465,302, Italian patent: Nr.TO92A000855 30

3.2. Reverberation Chapter 3. Crosspower Spectrum Phase Analysis and tracking requirements. Moreover the selection of the window length should take into account the stationarity of the signals. Without loss of generality, hereafter the dependence on time t is neglected for the sake of simplicity. In ideal conditions, CSP has proved to present a prominent peak when the time lag l is set equal to the actual time difference of arrival [79]. CSP provides for each time lag a measure of the similarity, or coherence measure, between the two signals aligned according to the given time lag. 3.2 Reverberation As outlined in [29], CSP is strongly degraded as the level of reverberation is raised. In fact, as soon as the power of echoes increases, several secondary peaks arise in the CSP in correspondence with reflected paths. Some examples derived from real data will better clarify how reverberation affects CSP. For this purpose we acquired a set of real data obtained by playing a white noise sequence with a loudspeaker, which guarantees a fixed radiation pattern, placed in front of a microphone pair with different orientations. The reverberation time of the environment where the signals were acquired is T 60 = 0.7 s. Figure 3.1 shows an example of CSP computed when the loudspeaker is aiming at the microphones: the energy of the direct path dominates those of the reflected ones and the prominent peak corresponds to the actual time delay. Figure 3.2 depicts instead how CSP changes when the loudspeaker is rotated by 45 on the right with respect to the microphones: reflected paths arise on the left of the figure even if their coherence is considerably lower then the direct path. Reflections have degraded the quality of the signals and the mutual correlation along secondary paths results reduced. Notice that the overall coherence is someway shared between the main peak and the secondary peaks, reducing the discriminative capacity of the algorithm. Finally, figure 3.3 explains 31

3.2. Reverberation Chapter 3. Crosspower Spectrum Phase Analysis 0.6 0.4 0.2 0-0.2-60 -40-20 0 20 40 60 time delay in samples Figure 3.1: Example of CSP measured when the sound source is frontal to the sensors at about 2.5 m distance. Note the sharp peak at time lag equal to 1 sample which corresponds to the actual TDOA. 0.6 0.4 0.2 0-0.2-60 -40-20 0 20 40 60 time delay in samples Figure 3.2: Example of CSP for a sound source aiming 45 degrees on the right with respect to the microphones. Note that besides the main correct peak some secondary peaks due to reflected paths arise on the left of the figure. the effects of the lack of direct paths on CSP. In this case the loudspeaker is aiming at the opposite direction and so the acoustic energy radiated 32

3.2. Reverberation Chapter 3. Crosspower Spectrum Phase Analysis by the back of the loudspeaker is negligible and microphones receive only acoustic waves coming from reflected paths. It is worth noting that in this configuration CSP does not present any peak, in particular if compared with figures 3.1 and 3.2. It is hence clear that CSP is not reliable when a directional source is not aiming at the microphone pair. Nevertheless, figure 3.1 proves that an even high reverberation level is not critical until a clear direct path exists between the source and the sensors. 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.2-0.2-60 -40-20 0 20 40 60 time delay in samples -60-40 -20 0 20 40 60 time delay in samples Figure 3.3: Example of CSP computed when the sound source is aiming at the opposite direction with respect to the microphones (left) and when it is frontal (right). Some algorithms are claimed to better deal with reverberation by estimating the impulse responses from the source to the receivers. Let us extend the notation defined in Chapter 2 introducing the two impulse responses h 1 and h 2 from the source to the two acoustic sensors m 1 and m 2. If s(t) is the original signal emitted by the source, acquired signals x 1 (t) and x 2 (t) can be rewritten according to equation 2.1: x 1 (t) = h 1 s(t) + n 1 (t) (3.3) x 2 (t) = h 2 s(t) + n 2 (t) (3.4) Being X 1 (f), X 2 (f), H 1 (f) and H 2 (f) the spectra of the involved quantities, the above relations can be reformulated in the frequency domain 33

3.2. Reverberation Chapter 3. Crosspower Spectrum Phase Analysis neglecting the noise components: X 1 (f) = H 1 (f)s(f) (3.5) X 2 (f) = H 2 (f)s(f) (3.6) CSP formulation in turn can be rewritten by replacing the equations 3.5 and 3.6 in equation 3.1: { CSP (l) = F F T 1 H1 (f)s(f) H2 } (f)s (f) H 1 (f)s(f) H 2 (f)s(f) (3.7) which after some straightforward mathematical simplifications reduces to: { CSP (l) = F F T 1 H1 (f) H2(f) } H 1 (f) H 2 (f) (3.8) So performing a CSP analysis on the acquired signal leads to the same solution as exploiting an estimation of h 1 and h 2, which convey any information about the propagation through the audio channel. An estimation of the impulse responses does not seem to provide any benefits when the time difference between the two main peaks is used as TDOA estimator. However in border line situation, when for instance the direct path is missing or carries very low energy, knowledge of the real impulse response could help disambiguating or could prevent erroneous choices. In particular one could attempt to detect the exact starting time instants of each impulse response and derive from them the TDOA. Another interesting observation suggested by equation 3.8 is that sensors must be close enough to guarantee that h 1 and h 2 are correlated. the other hand a higher sensor distance offers a higher TDOA resolution. Infact the maximum time delay, which corresponds to a source in end-fire position, increases as the microphones are moved away from each other (equation 3.2). 34 On

3.3. Signal Spectrum Chapter 3. Crosspower Spectrum Phase Analysis 3.3 Signal Spectrum From a more mathematical perspective, the main peak of CSP can be interpreted to occur at the time lag corresponding to the linear phase, in the frequency domain, that better interpolates the phase of the Crosspower Spectrum. In this terms, it is clear that the bandwidth of the involved signals plays a fundamental role in determining the performance of a CSP based TDOA estimator. In particular narrow-band sounds or sounds whose spectrum, even if spread, is clustered in few frequency bins are very difficult or even impossible to process with an approach like CSP. Typical speech sounds which present such a critical spectral characteristics are occlusive, e.g. p, b, and vowels. On the contrary, the spectral contribution of fricative sounds, e.g. s, is spread on a wide band of frequencies and so it is suitable to apply to CSP. Figure 3.4 shows an example obtained with real data: the left picture depicts the spectrogram of the speech sequence sce (fricative + vowel). The vertical axis represents frequencies while time flows along the horizontal axis. Dark areas represent frequencies with high energy, while bright colors indicate low energy. In turn, the right part shows a sequence of CSPs computed over the time (horizontal axis). Dark colors identify high coherence values while bright colors mark time lags (vertical axis) characterized by low coherence level. It is clear that CSP provides a sharp peak only during the fricative sound while the vowel is not informative from this point of view. As a consequence, it is realistic to expect that when processing real speech data, CSP outputs corresponding to those unfavorable sounds are not informative. A possible solution is to perform a weighted CSP according to precalculated speech spectra [36] in an effort to limit the analysis to those frequency bins which are more informative. Another possible approach proposes to modify the whitening filter by giving more emphasis to frequency bins with high energy [86]. 35

3.4. Sampling Problem Chapter 3. Crosspower Spectrum Phase Analysis Figure 3.4: Spectrogram and CSP derived from a fricative-vowel speech sequence. Previous investigations conducted at ITC-irst showed that in this case the resulting function in the time lag domain is no longer a delta function. Nevertheless, as already mentioned, improving TDOA or CM estimation is not a goal of this thesis, hence we will adopt the classic CSP method keeping anyway in mind its limitations. 3.4 Sampling Problem According to equation 3.1, CSP provides a coherence measure for time delays corresponding to an integer number of samples which limits the TDOA resolution to the sampling period. In some cases any interpolation or refinement method could be fruitfully applied to provide a more accurate estimation of the TDOA. For example, parabolic interpolation [70] allows to refine the TDOA estimation as follows: τ = τ + 1 CSP (τ 1) CSP (τ + 1) (3.9) 2 CSP (τ 1) 2CSP (τ) + CSP (τ + 1) where τ is the integer time lag that maximizes CSP and τ is the refined time lag at sub-sample resolution. Oversampling obtained through zero padding may also turn out to be useful [82]. 36

3.5. Sound Activity Detection Chapter 3. Crosspower Spectrum Phase Analysis Nevertheless, some practical issues remain open and must be taken into account, in particular when dealing with real data. First of all, the sampling is not aligned to the CSP peak, hence the main peak could be missed. In this case there is no way to recover the correct peak. A possible solution exploits Crosspower Spectrum rotation that allows to sample the time domain function at non-integer time instants [82] but at the cost of a computationally burdensome process. Finally, as reported by pictures 3.1, in ideal conditions the main peak is very sharp since CSP is close to be a delta function: evaluating the coherence measure just 1 sample before or after the correct one could lead to picking the wrong peak. 3.5 Sound Activity Detection Figure 3.1 shows that in case the active sound source is in a frontal position with respect to the microphones, CSP presents a sharp and high peak at the time lag equivalent to the actual TDOA. The maximum value of the CSP can be then exploited to infer the presence of an active sound source in the room [8]. Unfortunately a speech activity detection system based on CSP requires that the source is aiming at the sensors and it is prone to errors when the spectrum of the signal is not wide enough (see section 3.3). Anyway a localization system based on CSP offers an implicit and efficient speech activity detector that may result useful to automatically reduce the localization outputs to time instants when an active source is likely to be present. Figure 3.5 reports on the behaviour of the maximum value of CSP in presence or absence of an active sound source in a room. A post processing on the CSP outputs was performed to avoid big fluctuations due to short pauses and narrow-band sounds. It is worth noting that a simple threshold on the CSP peak allows to extract silence segments from an audio signal. 37

3.5. Sound Activity Detection Chapter 3. Crosspower Spectrum Phase Analysis 4 6 8 10 12 14 t (s) Figure 3.5: Trend of CSP main peak (above) with respect to a recorded speech signal (below). 38

Chapter 4 Distributed Microphone Networks The task of estimating the position of an acoustic source in the CHIL project uniquely differentiates from classical SSL problems investigated in the literature. The aim envisioned by the CHIL consortium of making user unaware of the underlying network of sensors, letting people free to behave as they want, introduces a set of restrictions to technologies charged with the SSL task. Moreover, dealing with real data and real time processing constraints reduces the possibility of relying on assumptions and simplifications of the signal propagation model. The following issues are worth of mentioning as considerably crucial in the addressed scenario. Background noise can not be modeled as white noise any longer. Or at least, it must be taken into account that coherent noise sources as fan, computers, printers, chairs, doors, other persons etc. critically affect the quality of the acoustic signals. If noise was stationary one could attempt to estimate its statistic in order to better handle it [89], but unfortunately the given scenario is totally uncontrolled and does not offer many warrants of noise stationarity. Adopting only a compact microphone array, either linear or in any other shape, is not feasible as speakers are absolutely free to move within the room. There is instead the need to guarantee a good coverage in space 39

4.1. DMN in CHIL Chapter 4. Distributed Microphone Networks in order to ensure that good quality signals can be acquired. Human beings are quite directional sound sources. According to the speaker position/orientation acoustic measurements computed at some receivers may be completely faulty [1]. In an effort to address the issues introduced by the project requirements, a so called Distributed Microphone Network (DMN) was adopted. A DMN is a set of microphones, even grouped in compact arrays, placed all around a room in order to always guarantee a good converge in space. In fact, as shown in Chapter 3, microphones reached by only reflected acoustic waves are not useful for localization purposes. On the other hand, distributing microphones all around the room assures that at least a subset of them gets direct wavefronts coming from the source. Nevertheless, given the same number of microphones, other technologies as spatial filtering, speech enhancement and feature extraction get more benefits from a linear microphone array and their application to DMN has not been investigated yet. Rooms equipped with a DMN are called CHIL rooms. 4.1 DMN in CHIL The adoption of similar sensor arrangements in the different laboratories involved in the CHIL project ensures the availability of common data and common conditions that permit to compare results and experiences obtained through different approaches. Taking into account each task addressed by the project, the consortium agreed on defining a minimum sensor set up for DMN to implement at each site: 1 64-channels microphone array (NIST Mark III); at least 3 microphone arrays arranged in a reverse T-shape; close talks; 40

4.1. DMN in CHIL Chapter 4. Distributed Microphone Networks table top microphones. Besides acoustic sensors, CHIL rooms are also furnished with several cameras for vision based activities. As far as the SSL task is concerned, the DMN consists of a set of microphone clusters arranged in a reverse T shape (hereafter T-shaped arrays). A T-shaped array carries the advantages of compact microphone arrays that can either autonomously provide a source localization estimate or perform a more robust process to derive a TDOA or DOA estimation. On the contrary, having a set of distributed clusters complies with the uniform coverage requirements. Figure 4.1 depicts the lay out of a T-shaped array. This particular geometry, allowing to determine both azimuth and elevation angle relative to each array, permits to perform a 3D estimation of the source position. Microphone distances were selected as a trade-off between good resolution and applicability of CSP. In fact, while equation 3.8 entails that microphones must be kept close to guarantee high coherence levels, increasing the microphone distance increases the TDOA resolution and reduces the sensitivity of DOA to noisy TDOA measurements according to the following equation [16]: σ θ = c 2 σ τ m 1 m 2 sin 2 (θ) (4.1) where σ τ is the variance of the time delay estimation, σ θ is the variance of the corresponding direction of arrival estimated applying equation 2.20 and θ is the actual direction of arrival. As for the reminder of the available DMN, the NIST Mark III was mainly used for beamforming and speech recognition and the close talk microphones were used for transcriptions. Table top microphones were instead exploited for other speech processing tasks as Speech Activity Detection (SAD) and Speaker ID. Figure 4.2 shows a map of the DMN available at ITC-irst laboratories, where 7 microphone clusters can be exploited for SSL purposes. Most of the experiments 41

4.1. DMN in CHIL Chapter 4. Distributed Microphone Networks Figure 4.1: T-shaped microphone array geometry. reported in this thesis were carried out with signals acquired in the ITCirst CHIL room. Figure 4.3 depicts the DMN available at the Karlsruhe T1 T2 Camera (fixed) Screen T0 Speaker aerea Table microphones Microphone cluster Table for meetings T3 T4 Pan tilt zoom camera NIST MARKIII IRST Light Array T6 T5 Figure 4.2: Map of the CHIL room available at ITC-irst laboratories. University laboratories where 4 T-shaped arrays are present. Data acquired in the latter room have also been exhaustively used in this research activity for evaluation purposes. Very critical issues when dealing with sensor networks are synchronization and time alignment between recorded signals. In fact, if one wants to improve an algorithm based on phase analysis, as CSP, by exploiting the 42

4.2. Microphone Deployment Chapter 4. Distributed Microphone Networks Figure 4.3: Map of the CHIL room available at Karlsruhe University laboratories. redundancy provided by a DMN, signals must be aligned at the sample level to obtain satisfactory results. Also other speech processing applications, like for instance beamforming, need accurate time alignment. For details about how these aspects were dealt with and for a more accurate description of the ITC-irst CHIL room refer to Appendix B. 4.2 Microphone Deployment Given the same number of microphones, their relative positions considerably influence the performance that can be achieved by a localization algorithm [16]. It is hence worth investigating which deployments better cope with the particular localization scenario addressed in this dissertation. 43

4.2. Microphone Deployment Chapter 4. Distributed Microphone Networks Figure 4.4 shows the source position estimates given two different sensor set-ups when the TDOA measurements are noisy. For the sake of simplicity the analysis is reduced to a 2 microphone pair scenario and the localization is performed with simple triangulation. For each position, the true time delay of arrival is corrupted by additive guassian noise. First of all it is Figure 4.4: Influence of noisy TDOA measurements on the final position estimate applying triangulation and two different microphone deployments. worth noting that, as predicted by equation 4.1, the localization process is more sensitive to measurement noise when the source is in end-fire position with respect to the microphone pairs. The picture on the right makes clear that when microphone pairs are placed in a parallel manner the uncertainty is higher on the direction orthogonal to the microphone pair axes. On the contrary, when microphones are deployed in a orthogonal way, the effect of measurement errors is more attenuated, in particular for central positions. Nevertheless a set of very unfavorable positions is present corresponding to the line joining the microphone pairs. The same effect occurs when microphone pairs are frontal with respect to each other. This brief analysis on microphone placement suggests us that a good microphone deployment should guarantee the availability of orthogonal sensor clusters and should give the possibility to select sub set of microphones with respect to which sources are in more favorable positions. A sensor deployment as the one 44

4.3. The Proposed Approach Chapter 4. Distributed Microphone Networks sketched in figures 4.2 and 4.3 seems to comply with the above mentioned requirements and guarantee a good coverage as well. 4.3 The Proposed Approach Although a DMN seems to be suitable to apply to the given localization problem, dealing with the information delivered by a network of spread sensors is not trivial and a kind of selection process is needed due to the aforementioned issues related to the source directivity. As a matter of facts, only those microphones that receive direct wavefronts are useful to the localization process, while the others are unreliable due to reflections. The approach proposed to tackle the localization problem, as envisioned in CHIL, is the implementation of the Global Coherence Field theory that, as the next chapter will point out, can implicitly deal with the problems introduced by a DMN. In an effort to overcome the limitations of this well-known approach we attempted to enforce the localization process embedding knowledge of the source orientation. Actually, an even rough estimation of the direction where the source is oriented can be employed to give more emphasis to those sensors that, being in favorable positions, are supposed to be reliable. Global Coherence Field was chosen among all the possible solutions for a series of mainly practical reasons. First of all, it is a straightforward and rather intuitive manner to combine contributions provided by a DMN. Moreover, first investigations conducted at ITC-irst on this field date back to 1993 [77] and so this approach results to be well-known and a solid starting point for the foreseen investigation on the source orientation usefulness. Besides this practical issues, Global Coherence Field was also adopted because it offers a high level of generalization and treats in the same way sound sources in either near or far field condition. Taking into account 45

4.3. The Proposed Approach Chapter 4. Distributed Microphone Networks the applicative scenarios envisioned in the CHIL project and the different acoustic conditions that the localization system was going to face, being as much flexible as possible is evidently a crucial feature. Furthermore, Global Coherence Field immediately turned out to be suitable to extend with the orientation knowledge. However, it is reasonable to assume that several alternative approaches as closed form techniques or particle filtering could be fruitfully exploited in the given context and enriched with the source orientation information. In particular, particle filter could be applied to account also for the history and improve tracking performance. Nevertheless a memory based localization algorithm is beyond the intents of this work which is focused on improving localization performance on the single time frame. Anyway, once a more robust combination of single contributions is available it could be easily integrated with a memory based technique. 46

Chapter 5 The Global Coherence Field 5.1 Definition A direct approach, which implicitly gives more emphasis to informative microphone pairs and reduces the effects of non informative pairs, is an appealing solution to adopt in an environment fitted out with a DMN. The first and straightforward method to handle a DMN is a steered beamformer approach. Alvarado in [4] introduced the concept of Power Field (PF): the idea is to steer a network of N s sensors in order to scan a grid Σ of potential sound source positions. For each spatial point p Σ, P F (t, p) represents the energy of the output of a beamformer steered to the given point: P F (t, p) = N s 1 i=0 L/2 k= L/2 G i x i (t + k δ i ) 2 (5.1) where G i counterbalances for the attenuation introduced by acoustic wave propagation, L is the length of the analysis window and δ i is the propagation time that is compensated to align all the signals. The position estimate is the point in space with highest energy obtained through maximizing PF: ˆp P F = arg max P F (t, p) (5.2) p Σ 47

5.1. Definition Chapter 5. The Global Coherence Field A more recent work [88] extends the concept of PF and introduces weighting functions to improve the localization process. A direct extension of PF was introduced in [77] and [81]. The authors introduced an approach based on coherence rather than power and defined the Global Coherence Field (GCF). GCF is termed as a function defined over the grid Σ and representing a measure of the plausibility that a sound source is active in the given spatial position p Σ. GCF : Σ R (5.3) Let us consider a set of M microphone pairs and the corresponding set of theoretical time delays T i (p) (i = 0,..., M 1) calculated according to equation 2.7. If we assume that we can compute a Coherence Measure C i (t, τ) representing the plausibility that the correct time delay at time instant t for the i-th microphone pair is τ, we can define GCF as follows: GCF (t, p) = 1 M M 1 i=0 C i (t, T i (p)) (5.4) For the sake of simplicity and without loss of generality, the time variable t is neglected in the following of this chapter. Figure 5.1 shows a graphical representation of the procedure for computing GCF. In the same way as PF, the sound source position is estimated as the point that maximizes GCF: ˆp GCF = arg max GCF (p) (5.5) p Σ In other words, a localization method based on GCF hypothesizes that the source is in position p and checks the consistency of the theoretical time delay T i (p) for each available microphone pair with the CMs delivered by the DMN. Function like GCF and PF, which represents the acoustic activity in an enclosure, are also referred to as sound maps. In fact from 48

5.1. Definition Chapter 5. The Global Coherence Field PSfrag replacements i-th pair!"!!!"!"!!"!"!"!!!#!!$!!#!#!$!$!#!#!$!$!#!!$!!%!!&!!!%!%!&!&!%!%!&!&!%!!&!!'!!(!!'!'!(!(!'!'!(!(!'!!(! C i (T i (p)) C i+1 (T i+1 (p)) + GCF (t, p) i + 1-th pair p Σ Figure 5.1: Scheme of the computation of GCF for a given point p restricted to microphone pairs i-th and i + 1-th. an analysis of the shape of these functions (peaks, valley, etc.) it is possible to infer knowledge about acoustic events ongoing in a room. Even if it has been demonstrated with practical experiments that the sum of GCF (t, p) over Σ is a constant given the same sensor set up and acoustic conditions (i.e. level of noise, sources, reverberation etc.), a theoretical proof is beyond the purposes of this work and not investigated yet. As a consequence, GCF can not be formally interpreted as a probability distribution function. Even so, when no knowledge is given about the statistics involved a CM is suitable to employ as an equivalent representation of a probability distribution. From this point of view, GCF can be interpreted as an equivalent representation of the likelihood exploited in the general ML formulation: GCF(p) log P(p ɛ) (5.6) where ɛ is the set of observed features, e.g. acoustic coherence measurements for each microphone pair. In this assumption, GCF solves the ML problem (equation 2.9) restricting the solution space to Σ instead than performing an exhaustive search. Although this interpretation is not formally correct, the parallelism with a well-known context better clarifies the 49

5.1. Definition Chapter 5. The Global Coherence Field usefulness and the meaning of a GCF representation. An important advantage of GCF, and more in general of direct approaches, is the complete lack of assumptions or simplifications on both the signal propagation and environmental conditions. The same algorithm works in both near and far field conditions and does not require any knowledge about the statistics of the involved quantities. In most practical applications, assumptions about noise statistics and far field condition are seldom satisfied and performance of algorithms that rely on them is critically degraded. A direct consequence of the high level of generalization offered by GCF is also a considerable gain in flexibility which is always appreciated from an implementation point of view. As already mentioned in Chapter 2, a noteworthy feature of direct approaches, like PF and some implementations of GCF that we will see later, is that decisions can be delayed after the combination of single contributions. In this sense, direct approaches satisfy the least commitment principle, which is peculiar to the artificial intelligence research area [90] and claims that deferring a decision as long as possible increases the probability that the decision is correct. A practical example better clarifies how deferring a decision could improve localization performance. Let us consider a situation as the one depicted in figure 5.2 where a loudspeaker is rotated 45 on the left with respect to the microphones and a strong reflection occurs on the wall. Due to either some acoustic phenomena or the sampling process, CSP analysis provides two major peaks, the lower one consistent with the actual source position and the higher one introduced by the strong reflection on the wall, as depicted in figure 5.3. A 2-steps method selects the wrong peak making the microphone pair not useful or, even worst, weakening the whole localization system. On the contrary, if the decision is deferred and both peaks are kept, the minor one contributes anyway emphasizing the correct source position. 50

5.1. Definition Chapter 5. The Global Coherence Field PSfrag replacements source reflected path microphone pair direct path *) *+ *) *+ * *, *- *, *- wall reflection Figure 5.2: A loudspeaker is located in front of a microphone pair but is oriented towards a corner. The main part of the radiated energy reaches the microphones after a reflection, while energy traveling along the direct path is lower due to the loudspeaker radiation pattern. 0.2 0.15 0.1 0.05 0-0.05-0.1-60 -40-20 0 20 40 60 time delay in samples Figure 5.3: CSP computed when a loudspeaker located in front of a microphone pair is rotated 45 on the left with respect to the microphones as illustrated in figure 5.2. 51

5.2. Coherence Measure Chapter 5. The Global Coherence Field 5.1.1 Spatial Sampling The spatial sampling introduced in the computation process of GCF can cause the whole localization process to fail. Unless the grid is dense enough, there is no warrant that the main peak is present in the sampled version of the acoustic activity map. Moreover, GCF may potentially be a very sharp function of the space, introducing the need for a very high spatial sampling rate. Even worse, GCF could be based on a discrete CM, as CSP, that introduces a second level of sampling. 5.2 Coherence Measure In principle any CM can be adopted in the GCF computation. In practice, in a SSL context, algorithms are commonly based on phase analysis. A first definition of CM may rely on TDOA estimation and frame the analysis in a LS context. Being τ i the TDOA estimated at the i-th microphone pair, the LS based CM is defined as [86]: C LS i (T i (p)) = T i (p) τ i 2 (5.7) The additive inverse is taken to stick to the concept that high values of CM refer to high levels of coherence. Substituting equation 5.7 in equation 5.4, the formulation of GCF is rewritten as: GCF LS (t, p) = 1 M M 1 i=0 T i (p) τ i 2 (5.8) GCF LS provides a suboptimal solution of the LS problem (equation 2.18). An alternative approach may instead consider a gaussian function of the difference between the actual and the estimated time delays. In this case the CM is expressed as follows: C exp i (T i (p)) = e T i(p) τ i 2 (5.9) 52

5.2. Coherence Measure Chapter 5. The Global Coherence Field which leads to the following formulation of GCF: GCF exp (t, p) = 1 M M 1 i=0 e T i(p) τ i 2 (5.10) The standard deviation σ i, accounting for the sharpness of the loss function, can be introduced if there is any clue about the statistics involved. is worth mentioning that both C LS and C exp take the decision on single contributions instead of deferring it. From this point of view, more complex loss functions, that for instance consider more than one TDOA estimation, are more appropriate to apply as CM, as for instance suggested in [6] where TDOAs are exploited in a TRINICON framework for BSS. A definition of CM that better fits to the given localization scenario is instead based on CSP as follows: C CSP i (T i (p)) = CSP i (T i (p)) (5.11) According to its formulation, CSP represents a measure of the reliability of a given time delay. Although in this case local interpolation or refinement of the CSP function is no longer feasible, a local smoothing process could prevent problems due to the spatio-temporal sampling. Applying the CM as defined in equation 5.11, we can introduce the GCF CSP : It GCF CSP (t, p) = 1 M M 1 i=0 CSP i (T i (p)) (5.12) GCF CSP has been exhaustively investigated in the SSL community in recent years in many different formulations and implementations [48, 18, 2]. Furthermore, the first investigations on GCF carried out at ITC-irst [77, 81] were referring to a CSP based CM. For all the above mentioned reasons, this dissertation adopts a CSP approach as the technique of choice to compute CM. 53

5.3. SSL based on GCF Chapter 5. The Global Coherence Field 5.3 SSL based on GCF replacements Localization capacity of GCF was evaluated on a small database that was collected in the CHIL room available at ITC-irst, which is equipped with 7 T-shaped arrays. The room is characterized by a quite challenging reverberation time T 60 = 0.7 s. A detailed description of both the room acoustics and sensor set up is provided in Appendix B. A white noise sequence was reproduced by a Tannoy high quality loudspeaker placed in 5 different positions, as explained by figure 5.4. For each position the loudspeaker was aimed at 8 different PSfrag replacements orientations, with 45 angular spacing in order to complete a loop around the loudspeaker Pos1 axis. For the sake of sim- x Pos2 0 y T3 6.0 m T5 T6 Pos3 Pos4 Pos5 225 180 135 45 Pos2 Pos3 90 135 Pos1 T0 4.8 m 270 90 180 T4 Pos5 Pos4 45 225 270 315 T2 T1 315 0 Figure 5.4: The picture on the left shows where the loudspeaker was located during the database collection with respect to the ITC-irst DMN. The picture on the right shows the orientation investigated. plicity, the SSL problem is restricted to 2 dimensions. For each time frame, GCF is computed exploiting each available horizontal microphone pair (i.e. 21 microphone pairs in the given set up). The grid Σ has a resolution of 20 mm on both the dimensions. The height of the source is given and the sound map analysis is restricted to the horizontal plane that includes the 54

5.3. SSL based on GCF Chapter 5. The Global Coherence Field source. Given a sampling rate of 44.1 khz the analysis window for CSP computation is set to 2 14 samples with an overlap factor 4, corresponding to about 10 sound map elaborations per second. The estimation of the source position is performed frame by frame, i.e. without memory, by picking the peak of GCF. 5.3.1 Sound Maps With merely didactic intents, this section presents some examples of how sound maps look like in different source configurations and applying different approaches to CM computation. Figures 5.5,5.6 and 5.7 show, for example, sound maps based on the three different CMs introduced above. The loudspeaker is located in the upper right corner (Pos3) and is aiming at the opposite corner (315 ). Bright colors represent high GCF values while black or dark colors mark areas where the plausibility that an active source is present is very low. Each GCF presents a peak, the bright spot, Figure 5.5: Example of GCF based on TDOA estimations adopting a normal gaussian loss function. The source is located in the upper right corner. corresponding to the correct source position. It can be observed that the 55

5.3. SSL based on GCF Chapter 5. The Global Coherence Field main peak benefits only from a reduced set of microphones. Bright lines, hyperbolic curves actually, represent the loci of potential source location related to the directional coherence observed by single microphone pairs. In other words a bright line represents the locus of points whose theoretical time delays are equal to the estimated ones or to CSP peaks. Microphones from which depart those lines that contribute to the main GCF peak, receive direct wavefronts and are useful to infer the source position. Other less bright areas account for the effects of reflections and reverberation in the room. In this particular configuration, C exp seems to be more suitable since it delivers a sharp and effective sound map (figure 5.5). Note anyway that in figure 5.5 lines departing from the arrays T5 and T6 do not contribute to the main peak but aim at reflections on the wall. Even if more noisy, GCF CSP presents also an evident peak in the area where the source is located (figure 5.7). Computing a CM on a LS perspective seems instead to offer poor spatial resolution (figure 5.6). Nevertheless, when the number of useful pairs is reduced and the effects of reflections increase, the CSP based CM overrules both the TDOA Figure 5.6: Example of GCF LS. The source is located in Pos3. 56

5.3. SSL based on GCF Chapter 5. The Global Coherence Field Figure 5.7: Example of GCF CSP. The source is located in the upper right corner. based approaches. A simple and efficient prove is when the loudspeaker is in central position (Pos1), that is a favorable position, and aims at the bottom right corner of the room (45 ). Since the adopted Tannoy loudspeaker is quite directional, only two microphone clusters can rely on direct wavefronts which are emitted from the lateral part of the loudspeaker radiation lobe (arrays T0 and T1 with respect to figure 5.4). A rather extreme approach like adopting a gaussian loss function emphasizes the effects of reflections generating a faulty and misleading sound map. In fact a second strong peak arises due to reflected paths as depicted in figure 5.8. C LS offers poor performance and low resolution also in this case as reported in figure 5.9. On the other hand, a CSP based method, which assigns a reliability measure to each time delay and defer decisions as much as possible, allows to correctly estimate the source position, even in this critical situation, by giving rise to coherent contributions (figure 5.10). 57

5.3. SSL based on GCF Chapter 5. The Global Coherence Field Figure 5.8: Example of GCF exp. The source is in the middle of the room. Notice the fake peak generated by reflections. Figure 5.9: Example of GCF LS. The source is in the middle of the room. 5.3.2 Localization Performance For a more formal analysis on the potentiality of GCF CSP, figure 5.11 reports on the localization performance, in terms of RMS error, for each orientation when the source is in central position (Pos1). Without ambi- 58

5.3. SSL based on GCF Chapter 5. The Global Coherence Field Figure 5.10: Example of GCF based on C CSP when the source is in Pos1. guity, the reminder of this thesis will refer to GCF CSP simply as GCF. The localization error is computed as the euclidean distance between the correct source position and the localization outputs for each of the 400 frames evaluated. It is worth noting that some orientations are more prone to 200 180 160 140 RMS error (mm) 120 100 80 60 40 20 0 45 90 135 180 orientation (deg) 225 270 315 Figure 5.11: RMS error of the localization estimates for each orientation investigated adopting GCF. The source is in Pos1. 59

5.3. SSL based on GCF Chapter 5. The Global Coherence Field errors than others. In particular, as reported in figures 5.12 and 5.13, for some orientations the localization system is less accurate in the x coordinate estimation, while for other source configurations errors increase in the y coordinate estimation. As a matter of fact, while rotating the source, 200 180 160 rms Error X coordinate(mm) 140 120 100 80 60 40 20 0 0 45 90 135 180 orientation (deg) 225 270 315 Figure 5.12: RMS error restricted to the x coordinate when the loudspeaker is aimed at different directions the number, the relative positions and the coverage of those microphone pairs that contribute positively to GCF change, influencing the localization performance. Figure 5.14 reports on the localization estimates obtained by picking the maximum of GCF for a whole loop of the loudspeaker around its axis. The green circle indicates the actual source position. Figure 5.15 compares the localization performance in term of RMS error for each position considering a complete loop of the loudspeaker. Intuitively, being in a central position with respect to the DMN, Pos1 delivers the best performance. On the other hand, the remaining 4 positions are located near the corners and so they include very unfavorable source configurations, as for instance when the source is aimed toward a wall and no microphone gathers clean direct 60

5.3. SSL based on GCF Chapter 5. The Global Coherence Field 160 140 120 rms Error Y coordinate (mm) 100 80 60 40 20 0 0 45 90 135 180 225 270 315 orientation (deg) Figure 5.13: RMS error restricted to the y coordinate when the loudspeaker is aimed at different directions 4500 4000 3500 3000 y coordinate (mm) 2500 2000 1500 1000 500 0 0 1000 2000 3000 4000 5000 6000 x coordiante (mm) Figure 5.14: Source position estimates obtained by picking the maximum of GCF. The green circle indicates the actual source position. wavefronts, for which an estimate of the source position is not feasible. As consequence the overall performance of these positions are considerably degraded with respect to Pos1. So far, the analysis was conducted using white noise to neglect CSP 61

5.3. SSL based on GCF Chapter 5. The Global Coherence Field 700 600 500 RMS error (mm) 400 300 200 100 Pos1 Pos2 Pos3 Pos4 Pos5 Figure 5.15: Overall RMS error, including all the orientations, for each position investigated. dependence on signal spectral features as explained in Chapter 3. However, since the goal of this research is dealing with human speakers, it is worth investigating the localization performance of the proposed approach when fed with speech. For comparison purposes, figure 5.16 depicts the x coordinate estimates when a loudspeaker in central position reproduces respectively white noise and speech and the orientation is set to 45. Notice the stability of the localization output in the white noise case in contrast with errors that characterize the localization output in the speech case and that are to ascribe to pauses or narrow-band sounds. Thresholding on GCF peaks is a simple and efficient manner to skip non informative speech frames and guarantee a more reliable localization output. When some warrants about source steadiness or low mobility are given a smoothing process could also provide a gain in stability and accuracy [2]. Figure 5.17 compares localization performance in both speech and white noise cases in terms of RMS error for each orientation when the source is in position Pos1. In the speech case localization outputs are evaluated only in time 62

5.4. Real-time Demo Chapter 5. The Global Coherence Field 5000 white noise speech 4500 4000 3500 x coordinate (mm) 3000 2500 2000 1500 1000 50 100 150 200 250 300 350 400 450 samples Figure 5.16: Source position estimates restricted to the x coordinate. The source is in Pos1 and the orientation is 45. frames providing a high GCF peak in order to permit a more fair comparison. The high error that occurs when the orientation is 90 is to ascribe to the asymmetrical set up and to the source directivity. As depicted in figure 5.4, in the given source configuration only one microphone cluster, namely T0, receives direct wavefronts, while arrays T1 and T6 can rely only on the energy emitted from the lateral side of the source radiation pattern. As a consequence, having only parallel microphone pairs, the discriminative capability on the orthogonal dimension (i.e. x coordinate in this case) is very low, as outlined in figure 4.4. This problem is attenuated when the loudspeaker plays white noise, since even lateral pairs can deliver reliable CMs due to the wide band of the received signals. 5.4 Real-time Demo Although it requires a comprehensive search for the maximum peak over the whole grid Σ, GCF is suitable to adopt in real-time applications. A 63

5.5. Multiple Speakers Chapter 5. The Global Coherence Field 350 speech white noise 300 250 rms Error (mm) 200 150 100 50 0 0 50 100 150 200 250 300 350 orientation (deg) Figure 5.17: RMS error for each orientation investigated when the input signal is speech and white noise and the source is in Pos1. real-time demo, which exploits 7 microphone pairs, is currently running with satisfactory localization performance in the CHIL room arranged at ITC-irst laboratories. Moreover, the straightforward way to combine microphone contributions offers a high degree of portability and permits to quickly adapt the system to variations in the sensor set-up. Computational load is not really an issues as the algorithm can be scaled down according to the computational power available and the accuracy requirements. Anyway computational load could be reduced employing a coarse-to-fine search [42] or reducing the search space with a pre-localization step based on SX or SI or LI as done in [83]. 5.5 Multiple Speakers When two or more speakers are active exactly at the same time, the GCF sound map presents two or more peaks. Hence, although peaks will be less sharp and lower than in the single speaker case, GCF allows to handle 64

5.6. Multimodal System Chapter 5. The Global Coherence Field the multi-source scenario. In this case efficient and effective algorithms for relative maxima search must be implemented to obtain the local peaks corresponding to each source. A tracking algorithm based on particle filter could efficiently address the problem of finding those local maxima associated to sources. Nevertheless, in real scenarios, hardly ever two speakers are active exactly at the same time, for the same duration and with the same spectral content of uttered speech, therefore it is likely that single peaks appear alternatively in the areas corresponding to the speakers. In this conditions, clustering [38] of the GCF peaks seems to be a reasonable manner to identify the position of each speaker. The localization problem is remarkably complicated when the number of sources is not known. A random finite set approach could effectually handle the unknown number of sources [68]. 5.6 Multimodal System Some efforts were devoted to port the GCF theory in a video tracker based on particle filter. The chance of fusing multimodal information is appealing in particular when the number or the effectiveness of unimodal sensors is limited by some physical or economical constraints. Generative approaches for video person tracking exploit a model of the target, for instance consisting of a coarse shape and colors, acquired in a preliminary stage [62]. The tracking is performed with a particle filter where the target state is defined in terms of position and velocity. As done in common particle filter approaches, once particles have been propagated according to the motion model, likelihoods are computed from the video signals. When more cameras are available, each particle weight is derived as the product of likelihoods obtained from single views [63]. A multimodal tracker was devised as an extension of the video only 65

5.6. Multimodal System Chapter 5. The Global Coherence Field tracker. Audio information is interpreted as an additional source of likelihood. The value of GCF in the spatial point identified by the particle state is taken as measure of the particle likelihood. Nevertheless, a regularization of GCF was needed to guarantee good tracking performance. GCF sharpness is appreciated in frameworks where a global maximization is performed, whereas highly irregular and sharp likelihoods do not fit well particle filter [101] that relies on local search for maxima. In order to preserve the overall acoustic information, the regularization was performed locally on single CSP contributions [19] as depicted in picture 5.18. Inevitably, even if local, the regularization reduces the resolution of the CSP analysis. Given a spatial position p, indicated by a particle, 0.25 regularized CSP CSP 0.2 0.15 0.1 0.05 0-0.05-0.1-0.15-30 -20-10 0 10 20 30 Figure 5.18: Example of a CSP function regularized around the time delay 9 samples. The maximum in the range [ 14 : 4] corresponds to 7 samples. and the corresponding time delay of arrival T i (p) at the i-th microphone pair, the algorithm search for the local CSP maximum ν in the interval I = [T i (p) ζ, T i (p) + ζ] centered at the theoretical time delay: ν = max l I CSP i(l) (5.13) 66

5.6. Multimodal System Chapter 5. The Global Coherence Field and for the corresponding time lag l ν : l ν = arg max l I CSP i(l) (5.14) The particle weight w i (p) is computed as the local maximum value of CSP weighted with the distance from the theoretical time delay: ( w i (p) = ν 1 T ) i (p) l ν ζ (5.15) In a multimodal tracker the acoustic likelihood is reliable only when an active source is present in the room. Fusion can then be performed only in time instants characterized by some audio activity, otherwise the tracker must rely on video signals only. The decision whether to exploit or not the audio signals was based on the maximum peak of GCF exceeding a given threshold. Evaluations conducted under the CLEAR 2006 evaluation campaign 1 proved that adding audio knowledge helps video trackers deal with unfavorable situations where the number of available sensors is limited due to occlusions, i.e. a person is occluded by an other person, or other phenomena. Further details about the video tracker and the evaluation results can be found in [19]. 1 Further details on the evaluation campaign can be found logging in at: http://www.clearevaluation.org 67

5.6. Multimodal System Chapter 5. The Global Coherence Field 68

Chapter 6 The Oriented Global Coherence Field 6.1 Definition It is rather intuitive that, in the given SSL scenario, knowledge about the source orientation can be advantageously exploited to improve the accuracy of a localization algorithm. While the GCF based algorithm maximizes the sum of CSP values from all the microphone pairs, uniformly weighted since no information is a priori available about the source orientation, it would be more profitable to give more emphasis to contributions of those microphone pairs receiving direct wavefronts and less emphasis to those collecting mostly reflections. From this perspective an even rough knowledge about where the source is aiming at could be fruitfully exploited to accordingly weight single pair contributions. In the previous chapter it was shown that a sound field representation as GCF carries cues about the source orientation. Indeed, bright lines in the GCF graphical representation identify microphone pairs that receive direct wavefronts, entailing information about the source orientation (see figure 5.7). The study of the shape of sound maps around a given point brings to introduce the idea of an Oriented GCF (OGCF) which embeds 69

6.2. OGCF Computation Chapter 6. The Oriented Global Coherence Field also information about the speaker s head pose. OGCF is defined as an extension of GCF to a 4 dimensional domain. Given a set of N predefined possible orientation angles ϕ d Φ (d = 0..N 1), OGCF is formulated as a mapping function from the cartesian product between the set of all possible positions Σ and the set of all possible orientations Φ to the set of real numbers: OGCF : Σ Φ R (6.1) In a similar way as GCF, OGCF associates to each couple (p, ϕ d ) a score that represents the plausibility that an active sound source is present in p and is oriented toward the angle ϕ d. Hereafter we will refer to the orientation index d as the orientation corresponding to the predefined angle ϕ d. 6.2 OGCF Computation Given a DMN consisting of M microphone pairs, the OGCF map is derived, for each point p and orientation d, considering the coherence contributions on M points K i (i = 0..M 1) on a circle C of radius r around the point p according to the formula [21, 22]: OGCF(t, p, d) = M 1 i=0 C i (t, T i (K i ))w(θ id ) (6.2) where w( ) is a weighting function and θ id [ π, π] is the angle between the line passing through p and Π d and the line from p to K i. The point Π d is represented by the intersection between the line from p with direction d and the circle C. The point K i is the intersection between the line from p to the i-th microphone pair and the circle C. From a geometrical point of view θ id is equivalent to the arccosine of the inner product between the 70

6.2. OGCF Computation Chapter 6. The Oriented Global Coherence Field PSfrag replacements normalized vectors from p to K i and Π d : (Ki p) θ id = arccos K i p, (Π d p) Π d p (6.3) Figure 6.1 depicts a graphical representation of the procedure for OGCF computation. Taking into account the final goal of emphasizing frontal mi- d 1 m 2 K2 Π 1 K 1 m 1 p θ 10 d 2 Π 2 Π 0 d 0 K 3 r K 0 θ 00 m 0 m 3 K 4 K 5 Π 3 d 3 m 4 m 5 Figure 6.1: Graphical representation of the OGCF computation scheme. In this case 6 microphone pairs are available and 4 possible orientations are investigated. crophones, a reasonable and straightforward implementation derives weights w(θ id ) from a gaussian function: w(θ) = 1 2πσ e θ2 2σ 2 (6.4) As a result, the weights w(θ id ) related to the orientation d will emphasize the CSP contributions in those points K i that are close to Π d (i.e. the di- 71

6.2. OGCF Computation Chapter 6. The Oriented Global Coherence Field rection d) and will give less importance to the contributions corresponding to other directions. Without loss of generalization the time variable t is dropped now on. For a comparison with the GCF computation procedure, figure 6.2 sketches a schematic representation of the procedure to derive OGCF(p, d) from a two microphone pair set up. i-th pair PSfrag replacements BHB BIB B BHBH BIBI BHBH BIBI BHB BIB BAB BCB B BABA BCBC BABA BCBC BAB BCB C i (T i (K i )) BFB BGB B BFBF BGBG BFBF BGBG BFB BGB BDB BEB B `Ki+1 C i+1 `Ti+1 BDBD BEBE BDBD BEBE BDB BEB i + 1-th pair w(θ i+1d ) w(θ id ) + OGCF (p, d) (p, d) Σ Φ Figure 6.2: Scheme of the computation of OGCF for a given point p restricted to microphone pairs i-th and i + 1-th. Given a sound source position hypothesis ˆp, the orientation ˆd for which OGCF(ˆp, d) is maximized represents an estimation of the sound source orientation: ˆd = arg max OGCF(ˆp, d) (6.5) d Φ From this point of view OGCF(ˆp, d) can be interpreted as the plausibility that a talker is at that position and his/her head is aimed according to the considered orientation. Figure 6.3 shows and example of the function OGCF(ˆp, d). It is worth outlining that the estimate of the source orientation is not expected to be very accurate in the foreseen scenario. The accuracy of the estimation algorithm depends on several environmental conditions: 72

6.2. OGCF Computation Chapter 6. The Oriented Global Coherence Field Figure 6.3: Example of OGCF(ˆp, d) computed with N = 64 and M = 21 in the CHIL room available at ITC-irst laboratories. 1. the available set up; 2. the source directivity pattern; 3. the directivity of the microphones; 4. the position of the source with respect to the DMN It is intuitive that OGCF requires a rather uniform coverage in order to deliver satisfactory performance and even a rich set up as the one available in the ITC-irst CHIL room presents several blind areas. Nevertheless the goal of this thesis is to exploit the knowledge of the source orientation to devise an efficient and accurate source localization algorithm instead of obtaining an accurate estimation of the source orientation. 73

6.3. SSL based on OGCF Chapter 6. The Oriented Global Coherence Field 6.3 SSL based on OGCF The extension of GCF to OGCF was inspired by the idea that knowledge about source orientation could improve the localization performance given a DMN. The original idea was to refine the localization estimate exploiting the orientation knowledge in a 3 steps process: 1. the source position is estimated through GCF maximization; 2. the orientation of the source is estimated applying equation 6.5; 3. a new position estimation is obtained through maximization of a weighted version of GCF. According to equation 6.2, for a given source orientation ˆd, OGCF can be interpreted as a weighted version of GCF which gives more emphasis to contributions provided by frontal microphone pairs. It is hence possible to complete the 3-steps process described above in a single step [23]. The proposed approach is to estimate the source position ˆp by maximizing the OGCF function for all possible orientations and positions: ˆp = arg max (p,d) Σ Φ OGCF(p, d). (6.6) As a side effect, an estimate of the source orientation is given as well. The previous equation can be rewritten as follows: { } ˆp = arg max arg max OGCF(p, d) p Σ d Φ (6.7) Equation 6.7 suggests an alternative interpretation of the proposed method. In practice, for each point p Σ, OGCF hypothesizes a candidate orientation and weights microphone pair contributions according to the hypothesized orientation. The orientation that delivers the highest score is assumed to be the correct one and its score is associated to the point p under investigation. Such score represents the plausibility that an active source is 74

6.4. Parameter Selection Chapter 6. The Oriented Global Coherence Field present in p. Unlike in the GCF approach, in this case the plausibility is computed through a weighted fusion of single contributions. The point with the highest plausibility is the sound source position estimate. This interpretation brings us to derive from OGCF the concept of M-OGCF which is more suitable to sound map representation. The domain of M- OGCF is reduced back to Σ by local maximization of OGCF over all the possible orientations. M-OGCF : Σ R (6.8) M-OGCF is related to OGCF by the following equations [24]: M-OGCF (p) = max OGCF(p, d) (6.9) d Φ Due to the weighting process, M-OGCF delivers sharper audio maps and attenuates effects of spurious peaks. In fact, in the assumption that the local orientation estimation is performed correctly in the area around the actual source position, the effects of reverberation are lowered with respect to direct path contributions by giving less emphasis to unreliable microphones. For comparison with standard GCF, figures 6.4 and 6.5 show two sound maps computed by means of M-OGCF in the same configuration as in figures 5.7 and 5.10. 6.4 Parameter Selection The computation of OGCF with equation 6.2 presents some crucial parameters. An analysis of the best parameter configuration is hence required from an orientation estimation point of view. Theoretically, a low value of σ in equation 6.4 forces the orientation estimations to be close to directions characterized by a smaller angular distance from the microphones (K i ). As a matter of fact, when adopting a sharp weighting function, directions between two microphone pairs re- 75

6.4. Parameter Selection Chapter 6. The Oriented Global Coherence Field Figure 6.4: Sound map computed with M-OGCF when the loudspeaker is in the upper right corner (Pos3) and is aimed at the opposite corner 315. Figure 6.5: Sound map computed with M-OGCF when the loudspeaker is in the middle of the room (Pos1) and is aimed at the bottom right corner (45 ). ceive contributions from neither pairs and their scores are always low. On the contrary, a spread weighting function, corresponding to high value of σ, gives importance also to non-frontal contributions, increasing the esti- 76

6.4. Parameter Selection Chapter 6. The Oriented Global Coherence Field mation error variance. An implementation of OGCF that adopts a flat weighting function is equivalent to GCF. A set of evaluation experiments restricted to the orientation estimation were conducted on both real and simulated data. Performance were measured in terms of RMS estimation error and percentage of correct orientation estimations. An estimation is defined correct if the angular distance between the estimation and the actual orientation is below a given threshold. Orientation estimation is computed as the joint optimal solution as in equation 6.6 restricting the search to a bi-dimensional grid of potential source positions. Although the actual source position is known from the experimental set up and the estimation of the source orientation could be computed directly from equation 6.5 instead of performing a comprehensive search, it is more prudent to rely on the effectiveness of OGCF rather than on human manual measurements. In fact, due to the already enlightened sampling issues, together with the sharpness of the entities involved, the overall OGCF approach results to be quite sensitive to wrong assumptions on the source position. The grid resolution was set to 20 mm in each dimension, while the number of the orientations investigated was N = 32, which is equivalent to an angular step of π\16 rad or about 11 degrees. 6.4.1 Real Data Experiments A first experiment was carried out on the small audio collection adopted for the analysis of GCF and described in Chapter 5. Figure 6.6 reports on the RMS error with respect to different parameter configurations when the loudspeaker is in central position and plays white noise. The error is measured over all the eight orientations investigated (a complete loop around the loudspeaker). Notice the minimum reached when σ 2 is around 4 and the radius ranges between 50 and 100 mm. It is worth noting that, as supposed above, larger or smaller values of σ reduce the overall perfor- 77

6.4. Parameter Selection Chapter 6. The Oriented Global Coherence Field deg 18 16 replacements r = 0 mm r = 50 mm r = 100 mm r = 200 mm 14 rms Error (deg) 12 10 8 6 0 2 4 6 8 10 12 14 Figure 6.6: RMS estimation error with different parameter configurations when a Tannoy loudspeaker is in central position and reproduces white noise. σ 2 mance. According to the figure, the algorithm performance seems to be less sensitive to the radius r even if a very large value is detrimental for the estimation process. Since the evaluation includes a complete loop of the source, the best parameter configuration optimizes performance on average over all directions, taking into account the issues of those orientations near to microphone pairs and those in the middle between two clusters of sensors. The best parameter configuration is influenced not only by the sensor set up but also by the source radiation pattern. As a proof, figure 6.7 shows the estimation performance when the Tannoy loudspeaker is substituted with a loudspeaker characterized by lower directivity and dimensions and by poorer quality. It is worth noting that when the source directivity is lower the estimation performance is critically reduced. As a matter of fact, in an extreme condition where the source is omnidirectional an estimation of its orientation is not feasible. Experimental results show that the adoption 78

replacements 6.4. Parameter Selection Chapter 6. The Oriented Global Coherence Field 40 35 r = 0 mm r = 50 mm r = 100 mm r = 200 mm 30 rms Error (deg) 25 20 15 10 0 2 4 6 8 10 12 14 Figure 6.7: RMS estimation error with different parameter configurations when a Yamaha low directional loudspeaker in central position reproduces white noise. σ 2 of a more selective σ seems to be more suitable to tackle less directional sources. As already pointed out, an accurate estimation of the source orientation is not the goal here. A rough estimation that indicates where the speaker is oriented and permits to emphasize useful contributions is instead desired. In this perspective, figure 6.8 depicts the percentage of correct orientation estimates given an acceptance threshold set to ±π\14 rad or ±13. As soon as σ 2 > 2.5 and r < 200 mm a percentage near to 100% of correct estimation is achieved. An analysis of the localization performance with different parameter set up is not useful in this clean experimental environment. In fact, the localization process is too accurate and it is not possible to discriminate the effectiveness of different parameter selections. Moreover, it was expected that the parameter configuration is not so crucial for the localization performance since in practice weights only select microphone pairs to merge. 79

replacements 6.4. Parameter Selection Chapter 6. The Oriented Global Coherence Field 110 100 Percentage of correct estimations 90 80 70 r = 0 mm r = 50 mm r = 100 mm r = 200 mm 60 0 1 2 3 4 5 6 7 8 9 10 Figure 6.8: Percentage of correct orientation estimations with different parameter configuration. The acceptance threshold is set to π\14 rad. σ 2 However, figure 6.9 reports on the localization errors with different parameter configurations. Notice that when σ is very low, the performance considerably decreases because the corresponding very sharp weighting function does not allow to merge contributions from different pairs. 6.4.2 Simulated Data Experiments A benchmarking activity was also conducted on simulated data. Given the same sensor set up adopted for the real data acquisition, an impulse response from the source position to each microphone was generated with a modified version of the image method, which accounts also for source directivity (see Appendix A for further details on the impulse response generation process). A speech sequence consisting of a phonetically rich sentence was then filtered through each impulse response and corrupted by additive gaussian noise to obtain a 30 db SNR level. In order to reduce impulse response length and computational load as well, the sampling rate 80

replacements 6.5. Human Speaker Chapter 6. The Oriented Global Coherence Field 550 500 r = 0 mm r = 50 mm r = 100 mm r = 200 mm 450 400 rms Error (mm) (rad) 350 300 250 200 150 100 50 0 2 4 6 8 10 12 14 Figure 6.9: Performance of a source position estimation algorithm based on OGCF with different parameter configurations. σ 2 was lowered to 16 khz. A lower sampling rate reduce the resolution of CSP and, as a consequence, negatively influences performance of GCF and OGCF as well. Figure 6.10 shows orientation estimation performance in terms of RMS error. Notice that the trend is the same as real data and the minimum is achieved when σ 2 ranges between 2 and 3. As already outlined for real data, an analysis of the localization performance in this context is not informative. Nevertheless figure 6.11 shows that best values of r ranges in the interval form 50 mm to 100 mm. 6.5 Human Speaker A preliminary experiment was also carried out adopting a human speaker as acoustic source. Unfortunately, an accurate reference of the actual talker s head pose is difficult to obtain. Even small fluctuations of the head are enough to introduce considerable discrepancies between the reference and 81

frag replacements 6.5. Human Speaker Chapter 6. The Oriented Global Coherence Field 28 26 r = 0 mm r = 50 mm r = 100 mm r = 200 mm 24 RMS error (deg) 22 20 18 16 14 0 1 2 3 4 5 6 7 8 9 10 PSfrag replacements Figure 6.10: Source orientation performance on simulated data expressed in terms of RMS error. Results obtained with different parameter configurations are reported. σ 2 450 400 r = 0 mm r = 50 mm r = 100 mm r = 200 mm 350 rms Localization error (mm) 300 250 200 150 100 0 1 2 3 4 5 6 7 8 9 10 Figure 6.11: RMS localization error on simulated data with different parameter configurations. σ 2 82

6.5. Human Speaker Chapter 6. The Oriented Global Coherence Field the actual orientation making the evaluation not reliable. However a small data collection consisting of a sequence of sentences uttered by one male speaker was acquired. The speaker repeated the same sentence (of about 7 seconds length) five times, standing at the same position but every time with a different orientation, rotating by a step of about 45 in order to sweep the range between 0 and 180. The speaker was more or less in position Pos1 whose distance from the microphone arrays is about 3 m. The total sequence length was about 45 seconds. Figure 6.12 shows the position of the speaker with respect to the sensor set up and the orientations investigated. Figure 6.13 represents the errors in the orientation estimate with Figure 6.12: Outline of the speaker position and the 5 orientations adopted in the experiment. respect to the true head orientation. Each cluster of points corresponds to a different orientation. From this result, one can observe that except for two frames the proposed method always ensures an orientation error lower than 45. Taking into account the reference accuracy issues outlined above, the RMS error was roughly estimated around 15. 83

6.5. Human Speaker Chapter 6. The Oriented Global Coherence Field 230 170 115 Angle Error (deg) 57 0-57 -115-170 -230 0 10 20 30 40 50 60 70 Seconds Figure 6.13: Orientation error as a function of time. 84

Chapter 7 Speaker Localization Benchmarking In order to evaluate the effectiveness of the proposed approach, results of a benchmarking activity are presented in this chapter. Although most of the literature addressing speaker localization is based on simulations and often reports on performance expressed in terms of accuracy of delay estimates, in this PhD thesis we directly address the problem of evaluating our SSL technologies in terms of localization accuracy in a real scenario. For the evaluation of the proposed method we adopted the database distributed by NIST for the 2005 international evaluation campaign on speaker localization. This decision is justified by the fact that our laboratories participated to that evaluation campaign with a system based on GCF that was recognized as the most accurate one [80]. The chance of having a direct comparison between the here proposed technique and a valid reference method is appealing from an algorithm development point of view. Moreover, the accuracy of a speaker localization system is influenced by many factors: room acoustics, number of exploited microphones, their sensitivity, their spatial and spectral response, their relative geometric position, their distance from the speakers and so on. Adopting the same data and the same evaluation criteria across different laboratories guarantees a fair and meaningful comparison between alternative approaches to the SSL problem. 85

7.1. Data Collection Chapter 7. Speaker Localization Benchmarking 7.1 Data Collection The database collected for the NIST RT-Spring 2005 evaluation consists in excerpts of 13 lectures recorded between November 2004 and February 2005 in the CHIL room available at the Karlsruhe University laboratories, resulting in an overall duration of about 66 minutes. The room is equipped with 4 T-shaped arrays located one on each wall as depicted by figure 7.1. The room dimensions are 7 6 3 m and the reverberation time is about 0.45 s, rather lower then the reverberation time of the ITC-irst CHIL room. Figure 7.1: Map of the CHIL room available at Karlsruhe University laboratories. The T- shaped arrays exploited for localization purposes are indicated as A-Array, B-Array, C-Array, D-Array. 86

7.2. References Chapter 7. Speaker Localization Benchmarking Despite the recording chunks were selected among sequences in which there was no intentional interruption by the audience (in fact, only rough labeling would have been available for the latter case), several short unwanted noises to ascribe to the audience occur in the recordings. Moreover, computers, beamers and printers, together with late arriving attendees contribute to increase the amount of interferences which degrade the signal clarity. Audio file were recorded at 44.1 khz 24 bits PCM even if during the localization process the sample resolution was reduced to 16 bits. It was proved that for localization purposes a reduction of the precision to 16 bits attenuates the computational load without decreasing the algorithm performance. 7.2 References Lectures were also recorded by 4 calibrated video cameras for both evaluation of video based algorithm and reference extraction. The location of the centroid of the speaker s head in each image was manually marked every 667 ms 1. Starting from these hand-marked labels, the true position of the speaker s head in three dimensions was calculated using the technique described in [44]. The resulting accuracy of ground truth positions is approximately 10 cm. The total number of reference frames (for which a set of coordinates was available) was 5788. Besides 3D reference coordinates, a set of rich transcriptions, obtained from far microphones, including information about the number of active speakers, ongoing noise events, speech boundaries and so on was available. Reference coordinates were then crossed with the transcriptions in order to obtain a reference file which characterizes each 3D coordinates with a set of additional labels 1 The extraction of the centroid of the speaker s head was fulfilled at Evaluations and Language resources Distribution Agency 87

7.3. Evaluation Metrics Chapter 7. Speaker Localization Benchmarking like: the number of active speakers, the number of active noise sources and the speaker ID. An example of a reference file is reported in Appendix C. 7.3 Evaluation Metrics The basic metric adopted to evaluate the accuracy of a SSL algorithm is the localization error e(t). It is defined as the euclidean distance between the coordinates p(t) produced by the localization system in a given time instant t and the corresponding ground truth g(t) obtained from the reference file: e(t) = d(p(t), g(t)) (7.1) The localization error is evaluated only at time instants for which the reference file provides a set of coordinates labeled as one speaker. Given an acceptance distance D, localization errors are classified as: anomalies or gross errors if e(t) > D; non-anomalies or fine errors if e(t) D. This classification was introduced to distinguish between noisy localizations which are roughly hitting the correct source position and faulty position estimates. Besides mistakes in the localization process, in a real scenario anomalous errors may occur either in case references have not been properly labeled or in case a competitive sound event is ongoing while the target is speaking. Few anomalous errors may considerably affect the performance of an otherwise accurate system. Therefore, besides the overall evaluation, the adopted metrics provides also an analysis restricted to fine errors. The original formulation of this classification was proposed in [31, 29] to evaluate TDOA estimation errors and the acceptance threshold was related to the correlation time of the involved signals. Since in our scenario the statistics of signals are not given, the classification is based on an empirical 88

7.3. Evaluation Metrics Chapter 7. Speaker Localization Benchmarking acceptance threshold set to 50 cm. Given the localization error definition and the consequent classification, a first set of metrics is exploited to evaluate the localization accuracy[80]: RMS error (RMSE): it is the root mean square error computed on all the localization outputs. In some condition the overall RMS error could be considerably degraded by few localization errors. Fine RMS error: it is the root mean square error computed only on localization outputs classified as fine errors. Localization rate (P cor ): it represents the percentage of fine errors with respect to all the localization outputs. It is a measure of the reliability of the algorithm. Let us denote as N fe the number of localization outputs classified as fine errors and N T the overall number of localization outputs, the localization rate is computed as follows [75, 55] : P cor = N fe N T 100 (7.2) Bias: it is the average of localization errors on single coordinates. It is useful to understand if any bias is present and, if feasible, to infer conclusions about the statistics of the localization errors. In principle, a speaker localization system may produce a set of coordinates at a high rate, for instance inverse to the analysis window step. However, in any realistic applicative scenario one needs to have a system able to track smoothly the position of the speaker. Hence, an overproduction of speaker positions have to be post-processed anyway in order to derive one position for a longer temporal segment. This choice is also consistent with typical rates adopted for vision technologies, and so with a potential integration between audio and image processing systems for person localization and tracking purposes. If a speaker localization system 89

7.3. Evaluation Metrics Chapter 7. Speaker Localization Benchmarking provides coordinates with a faster rate, the evaluation tool averages the coordinates, every 667 ms according to the reference temporal resolution, on a 667 ms window centered around the given time instant. If the speaker localization system produces data with a slower rate, or is not able to produce a set of coordinates for some frames labeled as one speaker, the evaluation tool should account for those missing data. These arguments lead us to the introduction of a secondary set of metrics that contribute to better represent the speech activity detection capability of a SSL algorithm: Output rate: it measures the number of position estimates produced per second. In the perspective that source position estimates will be exploited by automatic camera steering systems, beamformers and so on, a system with a very low output rate even if accurate is not acceptable. On the other hand, high output rates in combination with low accuracy do not comply with requirements of higher level algorithms. False Alarm rate (FA): it is computed as the percentage of localization outputs produced when nobody is speaking over the number of frames labeled as silence in the reference file. Deletion Rate: it accounts for time instants when the tracker does not provide any estimation even if somebody is speaking. It is computed as the rate between the number of missing localizations over the number of overall valid frames, i.e. labeled as one speaker. The introduction of deletion and false alarm rates is due to the fact that a speaker localization system includes an implicit acoustic event detection process. SSL algorithms are supposed to deliver outputs only when an active speaker there exists in a room. We can assume that in a real application one is interested in a good localization accuracy as well as in a low, and balanced, rate of deletions and false alarms. Figure 7.2 provides 90

7.3. Evaluation Metrics Chapter 7. Speaker Localization Benchmarking one example for each of the situations foreseen in the evaluation metrics assuming a reference rate of 10 Hz: an average of localization outputs produced at a higher rate, a localization at the given evaluation frame rate, a deletion, and a false alarm. SAD nonsp speech speech speech nonsp REF x t x OUTPUT Averaged 100 ms Deletion t False Alarm Figure 7.2: Examples of outputs of the localization system for the x coordinate assuming a 100 ms time resolution in the reference file: SAD is the bilevel information of the Speech Activity Detector, REF is the reference transcription of the x coordinate for time frames labeled as one speaker, OUTPUT shows the results of the localization system in the case of output at higher frame rate than 10 Hz, in the case of output at 10 Hz and in cases of deletion and false alarm, respectively As already mentioned, reference coordinates for each attendee in the audience are less accurate than those of the main speaker for several practical reasons, as for instance video recording coverage. For this reason the reference file includes the information about the speaker ID and the evaluation tool delivers also separate results for the main lecturer and any speaker in the audience. 7.3.1 The Evaluation Tool Based on the above mentioned metrics, ITC-irst developed, under the CHIL project, an evaluation tool for SSL algorithms. The same software was then 91

7.4. Implementation of the SSL system Chapter 7. Speaker Localization Benchmarking adopted by NIST for the NIST RT-Spring Evaluation 2005 benchmarking 2. The evaluation tool proceeds along the reference file: for each set of coordinates it looks for localization outputs in the given time window and compares their average with the reference. According to the reference label and the averaged localization output, the tool classifies each time frame as one of the foreseen events: gross or fine error on the lecturer, gross or fine error on the audience, deletion, false alarm. In case that nobody is speaking or at least two speakers are active simultaneously the tool ignores the current time frame. Besides the final summary of the localization performance including the aforementioned metrics, the tool delivers also the single frame classification in case one is interested in a deep analysis of the performance. A dummy example of how this software works is reported in Appendix C, where one can find: the content of the reference file for a sequence of 1.3 seconds; a second list of data representing the output of a SSL algorithm; a third list reporting on the output of the evaluation tool for every frame (i.e. every 100 ms); finally, a list of indicators of performance, which let one be able to interpret a SSL system potential from different perspectives. 7.4 Implementation of the SSL system In the proposed implementation of the SSL system, the position of the active source was estimated maximizing a sound map computed by means of OGCF. To reduce the computational time, the localization algorithm was split in 2 steps [23]: 1. the source position was estimated on a horizontal plane by maximizing a sound map based on OGCF; 2 More details can be found at http://www.nist.gov/speech/tests/rt/rt2005/spring/, where the source code and the related technical document SpeakerLocEval-V5.0-2005-01-18.pdf can be downloaded. 92

7.4. Implementation of the SSL system Chapter 7. Speaker Localization Benchmarking 2. given the 2D localization, the third coordinate was derived with a traditional TDOA approach using the vertical pair that provides the highest CSP peak. The height of the speaker was hence estimated as the intersection between the vertical line passing through the 2D localization estimate and the line passing through the most reliable vertical pair with slope corresponding to the estimated TDOA. The two-step algorithm here adopted represents a suboptimal approach. Although a direct maximization of OGCF in the Σ Φ space was possible, it was not adopted because of the high computational requirements, exceeding the limits of potential real-time implementation required by benchmark tests. The window size for the CSP analysis was set to 2 14 samples with a an overlap factor 4, which theoretically corresponds to an output every 100 ms. The OGCF was computed exploiting each horizontal microphone pair, i.e. 12 pairs, with a grid resolution equal to 50 mm in each dimension. According to preliminary experiments carried out in the CHIL room available at ITC-irst, the parameter configuration was: r = 50 mm, σ = 1.7 and N = 32. A post-processing was applied in order to discard unreliable frames, based on the amplitude of the peaks of the OGCF function exceeding a predefined threshold. This step acts as a sort of implicit Speaker Activity Detection (SAD) and has the purpose of properly balancing performance with a trade off between FA rate and Deletion rate. Furthermore the SAD threshold allows to skip those speech segments not useful to the localization process because characterized by narrow-band sounds or long pauses. As soon as the threshold is increased, the selectivity of the post-processing is raised reducing the average output rate. In order to guarantee a smooth tracking of the source, the localization outputs were further parsed to filter out possible outliers introduced, for instance, by competitive noise sources. The filter was designed to throw away localiza- 93

7.5. Evaluation Chapter 7. Speaker Localization Benchmarking tions whose distance from the previous one was larger than an acceptance distance. In case that at least 7 consecutive localizations are within the acceptance interval even if too far from the last valid localization, the system accepts the 7th localization and moves to the new area. The acceptance distance was set to 50 cm. 7.5 Evaluation Localization experiments were carried out to compare the new proposed OGCF based localization algorithm with two reference algorithms previously adopted for evaluation campaigns on speaker localization and tracking: standard triangulation and GCF. In particular GCF resulted to be the most accurate system in the NIST RT-Spring 2005 and can be considered as a valid reference for a new localization algorithm. The first method simply exploits TDOA between the signals of two orthogonal microphone pairs and derives the source position by means of triangulation in two steps: first on a plane using horizontal microphone pairs and then for the vertical coordinate by means of vertical pairs. The two orthogonal pairs used for triangulation are the ones that guarantee the best performance on average over the whole evaluation set. The second method is based on the maximization of a GCF function computed on a plane. The procedure is exactly the same as the one described above for OGCF except for the computation of the sound map. The same post-processing as for the algorithm under evaluation was applied to the reference algorithms. For both the former algorithms some parameters (e.g. analysis window size) had been optimized and the reported results refer to the best performance that were obtained. 94

7.5. Evaluation Chapter 7. Speaker Localization Benchmarking 7.5.1 Results Table 7.1 reports on results obtained with the three given algorithms, considering different SAD thresholds for GCF and OGCF. Technique Output FA Del. Loc. RMSE fine RMSE Bias (SAD Rate Rate Rate Rate [mm] [mm] [mm] threshold) [1/s] [%] [%] [%] TDOA 2.25 42 41 95 309 203 (59,-78,-41) GCF(0) 6.21 81 7 87 479 226 (43,-64,-77) GCF(0.38) 1.94 39 48 92 327 198 (40,-47,-51) GCF(0.75) 0.07 03 96 91 238 159 (80,-22,-57) OGCF(0.15) 5.09 68 13 95 298 193 (-1,-7,-55) OGCF(0.20) 3.91 55 23 95 272 193 (-12,-10,-47) OGCF(0.25) 2.84 44 36 95 266 192 (-23,-10,-41) OGCF(0.30) 2.01 33 50 95 249 191 (-37,-14,-33) Table 7.1: Results obtained applying different localization systems to the NIST RT-05 Spring Evaluation test set. As first comment, one can notice that the chosen SAD threshold values have a direct effect on performance reported on Table 7.1. Consider that the two thresholds for GCF or OGCF can not be compared one each other due to different ranges assumed by the two functions. In practice, when the SAD threshold increases the output rate and the FA rate decrease, which leads to a less reactive but quite robust system. For high values of the SAD threshold, localization rate and RMSE are also improved. However, it is worth noting that an RMSE of about 24 cm is achieved by the GCF method only when a non realistic 0.07/s output rate (i.e. one localization every 14 seconds) is obtained by the given SAD threshold of 0.75. As a result the OGCF based method turns out to be the most interesting and best performing one: with the highest SAD threshold, it ensures a RMSE of 25 cm with an output rate of more than 2 localizations 95

7.5. Evaluation Chapter 7. Speaker Localization Benchmarking per second; with the lowest SAD threshold (i.e. 0.15), it ensures a RMSE of less than 30 cm with an output rate of more than 5 localizations per second (i.e. very good real time tracking capabilities). Finally, one can note that the fine RMSE is close to 19 cm. This is an important result, since it expresses the error observed when gross errors are discarded (sometimes caused by cross-talk effects generated in the audience and not annotated by manual labelers), and is computed on the 95% (localization rate) of all the outputs. Note that the system operates in a completely unsupervised manner. Parameters were tuned running experiments in the ITC-irst CHIL room and then the system was tested in another room with less microphone pairs and different acoustic characteristics. This fact shows the robustness and portability of the proposed solution. 96

Chapter 8 Example Based Approach In previous chapters the localization task has been tackled in a traditional way by maximizing sound maps computed as both GCF (Chapter 5) and OGCF (Chapter 6). It is rather intuitive that this approach delivers satisfactory performance only if some microphone pairs can collect direct wavefronts from the source. In case the source is facing a wall or is emitting sound towards a surface, reflections are predominant and the maximum peak of the sound map does not correspond with the actual sound source position and so standard localization algorithms fail (see figure 5.15). Nevertheless, even in these unfavorable conditions, the pattern of reflected wavefronts can still characterize a particular source position within an enclosure. Even if some techniques which aim at comprehensively describing the propagation of sound waves in rooms have been successfully investigated exploiting beam tracing [45, 46], an exhaustive modeling of the reflection pattern in a room is not yet feasible. However, a classification approach based on examples computed in a training phase seems to be a valid alternative. For example, figure 8.1 shows a GCF sound map computed in the ITC-irst CHIL room when a loudspeaker is placed in the upper right corner and is aimed at the corner. Reflections on the walls generate several virtual sources that are represented by the wide bright 97

Chapter 8. Example Based Approach Figure 8.1: GCF sound map when the source is in the upper right corner (Pos3) and aims at the corner (135 ). The sound map is computed with the sensor set up available in the ITC-irst CHIL room. area surrounding the corner. Although a map like this is not suitable to traditional localization methods, its shape is related in an unambiguous manner to the particular source configuration. A pattern classification method, where sound maps act as the set of input features, is the easier and more straightforward way to pursue the above mentioned intents. The proposed approach is hence to compute a set of map models, each one representing a particular source configuration, and then apply a pattern classification approach to estimate both the source position and orientation as the ones corresponding to the sound map model that delivers the highest similarity with the input map [24]. A similar approach was presented by Strobel and Rabenstein in [100], where the authors suggest to classify time delays with histograms. More recently a statistical approach which uses magnitude and phase of the Crosspower Spectrum as discriminant features was introduced in [98] by Smaragdis and Boufounos. 98

8.1. Proposed Approach Chapter 8. Example Based Approach 8.1 Proposed Approach Let us consider a set of R predefined potential source positions r R and Q potential orientations q Q. Each couple (r, q) indicates a particular source configuration, or class, c R Q among the R Q overall possible configurations. Let us assume that a set of acquired signals is available for each source configuration c and so a sound map model µ c (p), where p is again a point of the grid Σ, can be derived from a training data set as either GCF or OGCF. A map model µ c (p) is obtained as the average of all maps computed on the training data set given a particular configuration c. During the test phase, for each time frame a new map Λ(p) is computed from the input signals and is compared with the models in order to define the most likely source configuration ĉ. The classification is based on maximizing a similarity measure between normalized versions of the models and of the map under evaluation. An easy and reasonable way to achieve normalization is by mean value subtraction and scaling to unitary energy. However more efficient approaches could be found with more accurate investigations. Figure 8.2 sketches the diagram of the proposed method. The easiest way to derive a similarity measure for sound maps is to evaluate the distance based on the L 1 norm and defined as follows: d L1 (c) = Λ(p) µ c (p) (8.1) p Σ The localization estimate is the class ĉ whose model is characterized by the minimum distance from Λ(p): ĉ = arg min c d L1 (c) (8.2) As an alternative, a metric based on the L 2 norm, as in [100], may be applied: d L2 (c) = [Λ(p) µ c (p)] 2 (8.3) p Σ 99

8.1. Proposed Approach Chapter 8. Example Based Approach models PSfrag replacements µ c (p) Sound Map Λ(p) classification ĉ Figure 8.2: Graphical representation of the proposed classification method to estimate the audio source position and orientation. Again the classification is based on the minimum distance criterion: ĉ = arg min c d L2 (c) (8.4) Since maps and models have zero mean and unitary energy due to the normalization step, the distance d L2 (c) is equivalent to a similarity measure based on cross-correlation. In fact, given the cross-correlation between a map model and the input map computed as follows: d corr (c) = p Σ [Λ(p) µ c (p)] (8.5) the following relationship between d corr (c) and d L2 (c) holds [20]: d L2 (c) = N (1 d corr (c)) (8.6) Of course, in case the similarity measure is based on cross-correlation, the best class if the one that maximizes d corr (c). 100

8.2. Experimental Results Chapter 8. Example Based Approach 8.2 Experimental Results A set of preliminary experiments on simulated and real data were conducted to evaluate the proposed approach and prove that it can deliver satisfactory performance even when facing unfavorable conditions. In both cases the sensor set up consisted of the 7 T-shaped arrays included in the DMN available at ITC-irst. Without loss of generalization, the SSL task is restricted to a 2 dimension space and only horizontal microphone pairs are exploited in the experiments. 8.2.1 Simulated data collection The simulated data collection includes 9 different positions and 8 different orientations for each position, as sketched in the left part of figure 8.3, resulting in 72 different source configurations. For each source configuration, the impulse responses from the source to each microphone were generated using a modified version of the image method that allows to take into account also the source directivity. A cardioid like directivity pattern, which roughly models the radiation properties of human beings, was adopted. The reverberation time was set to 0.7 s in order to simulate the acoustics of the ITC-irst CHIL room. Further details about the modified image method are given in Appendix A. A close-talk speech segment was filtered with each impulse response in order to simulate the sound propagation through the room and obtain the acoustic signals acquired by the DMN. This first set of signals constitute the development set and is exploited in the training procedure. The length of the development set for each class is 11.32 s. Afterward, 4 further clean speech segments were also filtered with the impulse responses to generate the evaluation set for an overall length of 21.92 s for each class (single segment lengths are respectively: 4.95, 6.22, 6.88 and 3.87 s). Filtered signals were corrupted by real background noise 101

8.2. Experimental Results Chapter 8. Example Based Approach Figure 8.3: The left part of the figure represents the positions and orientations of the sound source that were investigated in the simulated data collection. The right part refers to the real data acquisition. with 4 different SNR levels: 30, 10, 5 and 0 db. The background noise was acquired in the real room. The sampling rate was 16 khz, the analysis step was about 128 ms and the analysis window length was 2 13 samples. With this particular configuration, 84 input maps contribute to the each model computation, while the overall number of frames adopted for evaluation is 12330, that corresponds to about 171 frames for each class. 8.2.2 Real data collection As for the real data collection, the development set is a subset of the already mentioned database that includes audio signals acquired by the DMN when a loudspeaker was reproducing white noise and speech sequences. In particular the subset comprises 3 positions as depicted in the right part of figure 8.3 resulting in 24 different configurations. For each source configuration the length of the acquired signals is about 49 s for both speech and white noise. The evaluation set was instead acquired with a real human talker. The overall length of the test set is about 5 minutes and 30 seconds that corresponds to an average length of 13.6 s per class. The SNR 102

8.2. Experimental Results Chapter 8. Example Based Approach in both the training and test data set is about 20 db. It is worth noting that positions and orientations of the real talker are only nominally the same as those of the loudspeaker in the training data set. Exact location and in particular orientation of a real speaker are not easily determinable, however this fact contributes to prove the feasibility and robustness of the proposed method. Only for experiments involving real data, a rejection class C rej is introduced to handle those input segments that are not useful to the classification process because for instance they are characterized by narrow-band spectra, long silences or strong background noises. An input map is classified in the rejection class if no class provides a distance below a given acceptance threshold. 8.2.3 Evaluation metrics Because the current formulation of the SSL task is actually become a pattern classification approach, metrics previously defined in Chapter 7 are no longer suitable to describe the localization capabilities of the proposed technique. In this framework performance are hence evaluated in terms of the following metrics: Localization Error Rate (LER). Orientation Error Rate (OER). Rejection Rate (RR). Adjacent Class Error Rate (ACER). Both LER and OER represent the percentage of wrong localization or orientation classifications with respect to the overall number of classifications performed. The Rejection Rate represents the ratio between the number of frames classified in C rej over the number of input frames. Furthermore, orientation classification performance are also evaluated in terms of ACER 103

8.2. Experimental Results Chapter 8. Example Based Approach that is defined as the percentage of frames that are classified either in the correct class or in one of the two adjacent classes with respect to all the classifications performed. 8.2.4 Evaluation results As a baseline, the classification capabilities of the proposed approach were tested on the same simulated data set exploited in the training phase. Results in terms of the above introduced metrics are reported in table 8.1 and show that sound maps actually carry cues useful to recognize the source configuration. Moreover there seems to be no difference between the similarity measures and the type of sound map computations. Dev Set LER OER ACER GCF L 1 0.09% 1.6% 99.8% GCF L 2 0% 0.86% 100% M-OGCF L 1 0.43% 1.4% 99.5% M-OGCF L 2 0.17% 0.61% 99.8% Table 8.1: Classification performance on the simulated training set. L 1 and L 2 refer to the two proposed distance measures. The same set is used for both training and test. Figure 8.4 reports on the orientation estimation error on the simulated test set. Interestingly there is no significant difference between the two distance measures. Conversely, GCF performs always better than M-OGCF. The sharper and more irregular representation of the acoustic activity obtained with M-OGCF with respect to GCF, as illustrated in Chapter 6, may be a reason for this behavior. It is worth mentioning that most of the orientation errors are within the adjacent class, the worst case being at a 0 db SNR with an ACER equal to 90%. Figure 8.5 reports instead on LER achieved on the simulated test set. Again, there is no significant difference between the two similarity measures 104

8.2. Experimental Results Chapter 8. Example Based Approach 20 18 Orientation Error GCF L2 M-OGCF L2 GCF L1 M-OGCF L1 16 orientation error rate % 14 12 10 8 6 4 0 5 10 15 20 25 30 SNR (db) Figure 8.4: Overall orientation error rate for different similarity measures and maps with the simulated test set. 9 8 Localization Error GCF L2 M-OGCF L2 GCF L1 M-OGCF L1 localization error rate % 7 6 5 4 3 0 5 10 15 20 25 30 SNR (db) Figure 8.5: Overall position error rate for different similarity measures and maps with the simulated test set. 105

8.2. Experimental Results Chapter 8. Example Based Approach but also the two sound map computation methods perform in the same way. It is worth mentioning that the classification algorithm for simulated data does not implement the rejection class. As a consequence, the system is evaluated also on silences, slightly affecting the overall performance. As far as real data are concerned table 8.2 reports on the classification performance achieved when the models are computed with white noise signals. It can be noted that even in the case of models acquired with a differwhite noise LER OER ACER GCF L 1 0.05% 11.06% 98.5% GCF L 2 0% 9.18% 99.8% M-OGCF L 1 0.11% 11.7% 99.4% M-OGCF L 2 0% 12.6% 99.77% Table 8.2: Results obtained on the real-talker data with RR set to 27% using the models computed with white noise. ent source (loudspeaker) and a limited accuracy in the position/orientation of the real talker, performance is still satisfactory in terms of LER, with a limited drop in OER mainly due to classification into the directions adjacent to the correct one. These results were obtained with thresholds on the similarity measures empirically set to values such to guarantee that the RR maximum level was 27%. Table 8.3 reports instead on the evaluation of a classification system whose models were derived from speech signals. Results are very similar to the white noise case and confirm the similarity between the different approach investigated. Finally, in order to verify how slight changes in the source configuration influence the algorithm effectiveness, table 8.4 presents the classification performance when the models are computed with white noise and the speech signals played by the loudspeaker are used as test set. Now positions and orientations of the source in the test set are very close to 106

8.2. Experimental Results Chapter 8. Example Based Approach speech LER OER ACER GCF L 1 0.22% 9.43% 99.78% GCF L 2 0% 10.57% 100% M-OGCF L 1 0.11% 11.74% 99.4% M-OGCF L 2 0% 12.66% 99.78% Table 8.3: Results obtained on the real-talker data with RR set to 27% using the models computed with speech signals. the ones in the training set. Notice the considerable reduction of OER due to a higher correspondence between the orientation in the test and in the development data sets. white noise LER OER ACER GCF L 1 0.64% 3.2% 98.72% GCF L 2 0.54% 3.3% 98.84% M-OGCF L 1 1.12% 4.88% 97.56% M-OGCF L 2 1.40% 4.9% 97.32% Table 8.4: Results obtained with the loudspeaker reproducing speech using models computed with white noise. The RR is set to about 27%. 107

8.2. Experimental Results Chapter 8. Example Based Approach 108

Chapter 9 Conclusions and Future Work 9.1 Discussions and Conclusions This thesis addressed the problem of estimating the position of a sound source in an enclosure. In particular the main focus was on real seminars and meetings held in rooms characterized by very noisy and reverberant acoustic environments as in the scenario envisioned in the CHIL project. The adopted solution exploits a DMN which consists of a set of acoustic sensors deployed all around a room without any particular geometrical criterion. The main purpose of this research was to investigate an efficient fusion of the information provided by the single microphone pairs. The adoption of GCF and CSP was driven by the fact that these techniques were deeply investigated at ITC-irst in recent years and they represented the best choice to obtain high performance given a DMN. The novelty of this research activity was the attempt to characterize the orientation of a directional sound source, like a speaking human being, in an effort to improve the capabilities of a GCF based localization system. With these intents, the GCF theory was extended with the introduction of the OGCF which allows to both infer a rough knowledge of the source orientation and improve the accuracy of the source position estimates. Concerning the source orientation estimation, the proposed algorithm is 109

9.1. Discussions and Conclusions Chapter 9. Conclusions and Future Work able to provide estimates of the source direction with an RMS error that ranges between 7 and 15 degrees depending on the source radiation pattern. However, for obvious practical reasons, orientation experiments were limited to the DMN available at ITC-irst laboratories and it was not investigated yet how the algorithm behaves when the number of available sensors is reduced. Moreover, experiments were always carried out adopting a single class of weighting functions, i.e. gaussians, while the use of more complex functions, for instance derived from a mixture of gaussians, may allow to improve the orientation estimations. Nevertheless, the estimation of the source orientation is a kind of side effect of this research activity whose main focus is on increasing system robustness when single pairs are combined, which emphasizes microphones that are frontal to the speaker. From this point of view, as proved by localization experiments on the NIST RT-Spring 2005 evaluation data set, the new approach based on OGCF is able to provide accurate estimations at a higher output rate with respect to a more traditional method based on GCF. However in the NIST evaluation data set the problem was limited to the single-source case. This is not a very critical limitation since in typical conditions speakers may overlap only for few seconds. Furthermore, localization approaches based on maximizing a sound map can quite easily provide information about the positions of more than one speaker, since maps will present either more peaks at the same time instant or single alternating peaks as time evolves. Anyway, further investigations on these aspects should be conducted. Because GCF turned out to be extremely appropriate to extend with source orientation knowledge, no comparison was conducted with alternative approaches, as for instance closed form solutions or particle filtering. It is anyway reasonable that any technique that combines contributions from single microphone pairs, or clusters in general, will benefit from the knowledge of the source orientation. However since the orientation estima- 110

9.2. Future Work Chapter 9. Conclusions and Future Work tion algorithm is tailored to the pre-existing GCF formulation, adopting alternative localization methods requires to design new techniques. The final part of this thesis presented preliminary investigations, in an attempt to port the source localization task into a pattern recognition framework. The idea is to take advantage also of reflected wavefronts which can still characterize the position and orientation of the source. Although the classification problem was quite simplified, the outcomes of preliminary experiments demonstrated that the novel localization approach delivers satisfactory performance and can be exploited to cope with source positions that are unfavorable with respect to the microphone coverage. The main drawback of the proposed method is the need for a training phase that requires a representative amount of signals for each possible source configuration. Furthermore models are strictly tailored to a given sensor set up and acoustic conditions and it is not possible to fit them to a new scenario. Unfortunately, generating a training set with simulations based on image method is not practicable. Image method simulates realistic impulse response but it is far away from fully describing the real acoustic characteristics of an enclosure. However, the proposed approach has proved to satisfactory handle small variations in the source configuration between the training and the test data set. 9.2 Future Work As mentioned above, the analysis carried out during this doctoral activity was restricted to the single-speaker case. Although it is reasonable to think that both GCF and OGCF can deal with the multi-speaker scenario, further research efforts should be devoted to efficiently cope with sound maps obtained when multiple competitive sources are active at the same time. Two interesting research topics in this direction are: the search for 111

9.2. Future Work Chapter 9. Conclusions and Future Work local maxima corresponding to each speaker and the identification of the number of active speakers. With regard to the former aspect, it can be supposed that when two speakers are oriented towards different microphone pairs, OGCF will present peaks sharper and higher than GCF. In case the environmental conditions and the addressed scenario suggest that speakers are steady or slowly moving, a clustering approach (see Di Claudio and Parisi in [18]) may solve the problem of searching for local maxima in the sound map and provide some clues about the number of active speakers. Another interesting research line could investigate more appropriate weighting functions for OGCF computation, maybe obtained as a mixture of gaussians optimized on a development data set. In the same framework, an adaptive thresholding for the sound activity detection may be devised in order to make the localization system independent of the sensor set up and the acoustics of the environment. The investigation of a statistical framework also opens the way to a lot of possible research activities. The work so far carried out is purely preliminary and hence many issues deserve a more in depth investigation. Further study is needed for instance: to achieve more discriminant distance measures, to determine the number of models necessary in realistic scenarios and to assess the robustness of the models in case of deviations with respect to predefined positions and orientations. Then a robust training mechanism should be devised in order to reduce the effort needed for data recordings and make the algorithm less dependent on the acoustics of the environment. In this perspective, the capability of automatically generating real acoustic responses of rooms would be of great benefit for the proposed classification algorithm. 112

Bibliography [1] Alberto Abad, Dušan Macho, Carlos Segura, Javier Hernando, and Climent Nadeu. Effect of head orientation on the speaker localization performance in smart-room environment. In Proceedings of Interspeech, pages 145 148, Lisbon, Portugal, September 4-8 2005. [2] Alberto Abad, Carlos Segura, Dušan Macho, Javier Hernando, and Climent Nadeu. Audio person tracking in a smart-room environment. In Proceedings of Interspeech, pages 2590 2593, Pittsburgh, PA, USA, September 17-21 2006. [3] Jont B. Allen and David A. Berkley. Image method for efficiently simulating small-room acoustics. Journal of Acoustic Society of America, 65(4):943 950, April 1979. [4] Victor M. Alvarado. Talker Localization and Optimal Placement of Microphones for a Linear Microphone Array Using Stochastic Region Contraction. PhD thesis, Brown University, May 1990. [5] Fabio Antonacci, Davide Lonoce, Marco Motta, Augusto Sarti, and Stefano Tubaro. Efficient source localization and tracking in reverberant environments using microphone arrays. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 4, pages 1061 1064, Philadelphia, PA, USA, March 18-23 2005. 113

Bibliography Bibliography [6] Fabio Antonacci, Davide Riva, Diego Saiu, Augusto Sarti, Marco Tagliasacchi, and Stefano Tubaro. Tracking multiple acoustic sources using particle filtering. In Proceedings of the European Signal Processing Conference, Florence, Italy, September 4-8 2006. [7] Shoko Araki, Hiroshi Sawada, Ryo Mukai, and Shoji Makino. DOA estimation for multiple sparse sources with normalized observation vector clustering. In Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 5, pages 33 36, Toulouse, France, May 14-19 2006. [8] Luca Armani, Marco Matassoni, Maurizio Omologo, and Piergiorgio Svaizer. Distant-talking activity detection with multi-channel audio input. In Proceedings of IEEE-EURASIP workshop on Nonlinear Signal and Image Processing, Grado, Italy, June 8-11 2003. [9] Güner Arslan, F. Ayhan Sakarya, and Brian L. Evans. Speaker localization for far-field and near-field wideband sources using neural networks. In Proceedings of IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing, volume 2, pages 528 532, Antalaya, Turkey, June 20-23 1999. [10] M. Sanjeev Arumpalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particle filters for online nonlinear/nongaussian bayesian tracking. IEEE Transactions on Signal Processing, 50(2):174 188, February 2002. [11] Kristine L. Bell. MAP-PF position tracking with a network of sensor arrays. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 4, pages 849 852, Philadelphia, PA, USA, March 18-23 2005. 114

Bibliography Bibliography [12] Jacob Benesty, Jingdong Chen, and Yiteng Huang. Time-delay estimation via linear interpolation and cross correlation. IEEE Transactions on Speech and Audio Processing, 12(5):509 519, September 2004. [13] Jacob Benesty, Shoji Makino, and Jingdong Chen. Speech Enhancement. Springer-Verlag, 2005. [14] Nevio Benvenuto and Giovanni Cherubini. Algorithms for Communications Systems and their Applications. John Wiley & Sons, August 2002. [15] Stanley T. Birchfield and Daniel K. Gillmor. Acoustic source direction by hemisphere sampling. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 5, pages 3053 3056, Salt Lake City, UT, USA, May, 7-11 2001. [16] Michael S. Brandstein. A Framework for Speech Source Localization Using Sensor Arrays. PhD thesis, Brown University, May 1995. [17] Michael S. Brandstein, John Adcock, and Harvey F. Silverman. A closed-form location estimator for use with room environment microphone arrays. IEEE Transactions on Speech and Audio Processing, 5(1):45 50, January 1997. [18] Michael S. Brandstein and Darren B. Ward, editors. Microphone Arrays. Springer Verlag, 2001. [19] Roberto Brunelli, Alessio Brutti, Paul Chippendale, Oswald Lanz, Maurizio Omologo, Piergiorgio Svaizer, and Francesco Tobia. A generative approach to audio-visual person tracking. In Rainer Stiefelhagen and John Garofolo, editors, Multimodal Technologies for Perception of Humans, pages 55 68. Springer LNCS 4122, 2006. First 115

Bibliography Bibliography International Evaluation Workshop on Classification of Events, Activities and Relationships, CLEAR 2006. [20] Roberto Brunelli and Stefano Messelodi. Robust estimation of correlation with applications to computer vision. In Pattern Recognition, volume 28, pages 833 841, 1995. [21] Alessio Brutti, Maurizio Omologo, and Piergiorgio Svaizer. Oriented Global Coherence Field for the estimation of the head orientation in smart rooms equipped with distributed microphone arrays. In Proceedings of Interspeech, pages 2337 2340, Lisbon, Portugal, September 4-8 2005. [22] Alessio Brutti, Maurizio Omologo, and Piergiorgio Svaizer. Estimation of talker s head orientation based on Oriented Global Coherence Field. In Proceedings of the 120th AES convention, Paris, France, May 20-23 2006. [23] Alessio Brutti, Maurizio Omologo, and Piergiorgio Svaizer. Speaker localization based on Oriented Global Coherence Field. In Proceedings of Interspeech, pages 2606 2609, Pittsburgh, PA, USA, September 17-21 2006. [24] Alessio Brutti, Maurizio Omologo, and Piergiorgio Svaizer. Classification of acoustic maps to determine speaker position and orientation from a distributed microphone network. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, Hawaii, USA, April 15-20 2007. Accepted for publication. [25] Herbert Buchner, Robert Aichner, and Walter Kellermann. TRINI- CON: A versatile framework for multichannel blind signal processing. In Proceedings of IEEE International Conference on Acoustics, 116

Bibliography Bibliography Speech and Signal Processing, volume 3, pages 889 892, Montreal, Canada, May 17-21 2004. [26] Herbert Buchner, Robert Aichner, Jochen Stenglein, Heinz Teutsch, and Walter Kellermann. Simultaneous localization of multiple sound sources using blind adaptive MIMO filtering. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 3, pages 97 100, Philadelphia, PA, USA, March 18-23 2005. [27] J. Capon. High-resolution frequency-wavenumber spectrum analysis. Proceeding of IEEE, 57(8):1408 1418, August 1969. [28] G. Clifford Carter, editor. Coherence and Time Delay Estimation: An Applied Tutorial for Research, Development, Test and Evaluation Engineers. IEEE press, Piscataway, NJ, USA, 1993. [29] Benoît Champagne, Stéphane Bedard, and Alex Stephenne. Performance of time-delay estimation in the presence of room reverberation. IEEE Transactions on Speech and Audio Processing, 4(2):148 152, March 1996. [30] Y. T. Chan and K. C. Ho. A simple and efficient estimator for hyperbolic location. IEEE transaction on signal processing, 42(8):1905 1915, August 1994. [31] Jingdong Chen, Jacob Benesty, and Yiteng Huang. Robust time delay estimation exploiting redundancy among multiple microphones. IEEE Transactions on Speech and Audio Processing, 11(6):549 557, November 2003. [32] Jingdong Chen, Jacob Benesty, and Yiteng Huang. Performance of GCC- and AMDF-based time delay estimation in practical reverber- 117

Bibliography Bibliography ant environment. EURASIP Journal on Applied Signal Processing, 2005(1):25 36, 2005. [33] W.T. Chu and Alf C. Warnock. Detailed directivity of sound fields around a human talkers. Technical report, National Research Council Canada, December 2002. URL: http://irc.nrccnrc.gc.ca/pubs/rr/rr104/. [34] Weiwei Cui, Zhigang Cao, and Jianqiang Wei. Dual-microphone source localization method in 2-D space. In Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 4, pages 845 848, Toulouse, France, May 14-19 2006. [35] Andrzejj Czyzewski. Automatic identification of sound source position employing neural networks and rough sets. Pattern Recognition Letters, 24(6):921 933, March 2003. [36] Yuki Denda, Takanobu Nishiura, and Yoichi Yamashite. Noise robust talker localization based on weighted CSP analysis with an average speech spectrum for microphone array steering. In Proceedings of International Workshop on Acoustic Echo and Noise Control, pages 45 48, Eindhoven, The Netherlands, September 12-15 2005. [37] Joseph DiBiase. A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays. PhD thesis, Brown University, May 2000. [38] Elio D. DiClaudio, Raffaele Parisi, and Gianni Orlandi. Multi-source localization in reverberant environments by ROOT-MUSIC and clustering. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 921 924, Istanbul, Turkey, June 6-9 2000. 118

Bibliography Bibliography [39] Simon Doclo and Marc Moonen. Robust time-delay estimation in highly adverse acoustic environments. In Proceeding of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 59 62, New Platz, NY, USA, October 21-24 2001. [40] Simon Doclo and Mark Moonen. Robust adaptive time delay estimation for speaker localization in noisy and reverberant acoustic environments. EURASIP Journal on Applied Signal Processing, 2003(11):1110 1124, October 2003. [41] Arnaud Doucet, Nando de Freitas, and Neil Gordon, editors. Sequential Monte Carlo Methods in Practice. Springer-Verlag, 2001. [42] Ramani Duraiswami, Dmitry Zotkin, and Larry S. Davis. Active speech source localization by a dual coarse-to-fine search. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 5, pages 3309 3312, Salt Lake City, UT, USA, May 7-11 2001. [43] James L. Flanagan. Analog measurements of sound radiation from the mouth. The Journal of the Acoustical Society of America, 32(12):1613 1620, December 1960. [44] Dirk Focken and Rainer Stiefelhagen. Towards vision-based 3-D people tracking in a smart room. In Proceedings of IEEE International Conference on Multimodal Interfaces, pages 400 405, Pittsburgh, PA, USA, October 14-16 2002. [45] Marco Foco, Pietro Polotti, Augusto Sarti, and Stefano Tubaro. Sound spatialization based on fast beam tracing in the dual space. In Proceedings of the sixth conference on digital audio effects, London, UK, September 8-11 2003. 119

Bibliography Bibliography [46] Thomas Funkhouser, Nicolas Tsingos, Ingrid Carlbom, Gary Elko, Mohan Sondhi, James E. West, Gopal Pingali, Patrick Min, and Addy Ngan. A beam tracing method for interactive architectural acoustics. Journal of Acoustical Society of America, 115(2):739 756, February 2003. [47] Scott M. Griebel and Micheal S. Brandstein. Microphone array source localization using realizable delay vectors. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 71 74, New Platz, NY, USA, October 21-24 2001. [48] S.M. Griebel. A Microphone Array System for Speech Source Localization, Denoising and Dereverberation. PhD thesis, Harvard University, 2002. [49] Kamen Guentchev and John Weng. Learning-based three dimensional sound localization using a compact non-coplanar array of microphones. In Proceeding of AAAI Spring Symposium on Intelligent Environments, pages 68 78, Stanford University, Palo Alto, CA, USA, March 23-25 1998. [50] Monson H. Hayes. Statistical Digital Signal Processing and Modeling. John Wiley & Sons, 1996. [51] Jie Huang, Noboru Ohnishi, and Noboru Sugie. A biomimetic system for localization and separation of multiple sound sources. IEEE Transactions on Instrumentation and Measurement, 44(3):733 738, June 1995. [52] Yiteng Huang, Jacob Benesty, and Gary Elko. Adaptive eigenvalue decomposition algorithm for real time acoustic source localization system. In Proceedings of IEEE International Conference on Acous- 120

Bibliography Bibliography tics, Speech and Signal Processing, volume 2, pages 937 940, Phoenix, AZ, USA, March 15-19 1999. [53] Yiteng Huang, Jacob Benesty, Gary Elko, and Russell Mersereau. Real-time passive source localization: a practical linear-correction least-squares approach. IEEE Transactions on Speech and Audio Processing, 9(8):943 955, November 2001. [54] Yiteng Huang, Jacob Benesty, and Gary W. Elko. Passive acoustic source localization for video camera steering. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages 909 912, Instanbul, Turkey, June 5-9 2000. [55] John Iannello. Time delay estimation via cross-correlation in the presence of large estimation errors. IEEE Transactions on Acoustics, Speech, and Signal Processing, 30(6):998 1003, December 1982. [56] Andrew H. Jazwinsky. Stochastic Processes and Filtering Theory. Academic Press, June 1970. [57] Don H. Johnson and Dan E. Dudgeon. Array Signal Processing: concepts and techniques. Prentice Hall, 1993. [58] Lawrence E. Kinsler, Austin R. Frey, Alan B. Coppens, and James V. Sanders. Fundamentals of Acoustics. John Wiley & Sons, 2000. [59] Ulrich Klee, Tobias Gehrig, and John McDonough. Kalman filters for time delay of arrival-based source localization. In Proceedings of Interspeech, pages 2289 2292, Lisbon, Portugal, September 4-8 2005. [60] Charles H. Knapp and G.Clifford Carter. The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustic, Speech and Signal Processing, 24(4):320 327, August 1976. 121

Bibliography Bibliography [61] Heinrich Kuttruff. Room Acoustics. Elsevier Applied Science, 1991. [62] Oswald Lanz. Probabilistic Multi-person Tracking for Ambient Intelligence. PhD thesis, ICT school Universitá di Trento, February 2005. [63] Oswald Lanz. Approximate bayesian multibody tracking. IEEE Transaction on Pattern Analysis and Machine Intelligence, 28(9):1436 1449, September 2006. [64] Guillaume Lathoud and Mathew Magimai.-Doss. A sector-based, frequency-domain approach to detection and localization of multiple speakers. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 3, pages 265 268, Philadelphia, PA, USA, March 18-23 2005. [65] Gianni Lazzari, Maurizio Omologo, and Piergiorgio Svaizer. Procedimento per la localizzazione di un parlatore e l acquisizione di un messaggio vocale, e relativo sistema. Nr.TO92A000855, October 1992. [66] Gianni Lazzari, Maurizio Omologo, and Piergiorgio Svaizer. Method for location of a speaker and acquisition of a voice message, and related system. U.S. Patent 5,465,302, 1993. [67] Eric A. Lehmann, Darren B. Ward, and Robert C. Williamson. Experimental comparison of particle filtering algorithms for acoustic source localization in a reverberant room. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 177 180, Hong Kong, April 6-10 2003. [68] Wing-Kin Ma, Ba-Ngu Vo, Sumeetpal Singh, and Adrian Baddeley. Tracking an unknown time-varying number of speakers using TDOA 122

Bibliography Bibliography measurements: a random finite set approach. IEEE Transaction on signal processing, 54(9):3291 3304, September 2006. [69] Masao Matsuo, Yusuke Hioka, and Nozomu Hamada. Estimating DOA of multiple speech signals by improved histogram mapping method. In Proceedings of International Workshop on Acoustic Echo and Noise Control, pages 129 132, Eindhoven, The Netherlands, September 12-15 2005. [70] Robin J. Y. McLeod and M. Louisa Baart. Geometry and Interpolation of Curves and Surfaces. Cambridge University Press, 1998. [71] Paul. C. Meuse and Harvey F. Silverman. Characterization of talker radiation pattern using a microphone array. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 257 260, Adelaide, Australia, April 19-22 1994. [72] Mitsunori Mitzumaki and Latsuyuki Niyada. DOA estimation using cross-correlation with particle filter. In Proceedings of Joint Workshop on Hands-Free Communications and Microphone Arrays, March 17-18 2005. Piscataway, NJ, USA. [73] Philip M. Morse and K. Uno Ingrad. Theoretical Acoustics. Princeton University Press, Princeton, NJ, USA, 1986. [74] Ryo Mukai, Hiroshi Sawada, Shoko Araki, and Shoji Makino. Realtime blind source separation an DOA estimation using small 3-D microphone array. In Proceedings of International Workshop on Acoustic Echo and Noise Control, pages 45 48, Eindhoven, The Netherlands, September 12-15 2005. [75] Takanobu Nishiura, Takeshi Yamada, Satoshi Nakamura, and Kyohiro Shikano. Localization of multiple sound sources based on a 123

Bibliography Bibliography CSP analysis with a microphone array. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages 1053 1056, Instanbul, Turkey, June 5-9 2000. [76] Takanobu Nishura, Satoshi Nakamura, and Kyohiro Shikano. Talker localization in a real acoustic environment based on DOA estimation and statistical sound source identification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 893 896, Orlando, FL, USA, May, 13-17 2002. [77] Maurizio Omologo and Piergiorgio Svaizer. Use of the crosspowerspectrum phase in acoustic event localization. Technical Report 9303-13, ITC-irst Centro per la Ricerca Scientifica e Tecnologica, 1993. [78] Maurizio Omologo and Piergiorgio Svaizer. Acoustic event localization using a crosspower-spectrum based technique. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 273 76, Adelaide, Australia, April 19-22 1994. [79] Maurizio Omologo and Piergiorgio Svaizer. Use of Crosspower- Spectrum Phase in acoustic event location. IEEE Transactions on Speech and Audio Processing, 5(3):288 292, May 1997. [80] Maurizio Omologo, Piergiorgio Svaizer, Alessio Brutti, and Luca Cristoforetti. Speaker localization in CHIL lectures: Evaluation criteria and results. In Steve Renals and Springer Berlin/Heidelberg Samy Bengio, editors, MLMI 2005: Revised and selected papers, pages 476 487, Edinburgh, UK, July 11-13 2005. [81] Maurizio Omologo, Piergiorgio Svaizer, and Renato DeMori. Spoken Dialogue with Computers. Academic Press, 1998. Chapter 2: Acoustic Transduction. 124

Bibliography Bibliography [82] Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck. Discrete- Time Signal Processing. Prentice Hall, 1999. [83] J. Micheal Petterson and Chris Kyriakakis. Hybrid algorithm for robust, real-time source localization in reverberant environments. In Proceedings of Joint Workshop on Hands-Free Communications and Microphone Arrays, March 17-18 2005. Piscataway, NJ, USA. [84] Patrick M. Petterson. Simulating the response of multiple microphones to a single acoustic source in reverberant room. Journal of Acoustic Society of America, 5(80):1527 1529, November 1986. [85] Ilyas Potamitis, Huimin Chen, and George Tremoulis. Tracking of multiple moving speaker with multiple microphone arrays. IEEE Transactions on Speech and Audio Processing, 12(5):520 529, September 2004. [86] Daniel V. Rabikin, Richard J. Renomeron, Arthur Dahl, Joseph C. French, and James L. Flanagan. A DSP implementation of source location using microphone arrays. In Proceedings of the 131st Meeting of the Acoustical Society of America, pages 88 99, Indianapolis, IN, USA, May 13-17 1996. [87] Francis A. Reed, Paul L. Feintuch, and Neil J. Bershad. Time delay estimation using the LMS adaptive filter static behavior. IEEE transactions on Acoustics, Speech and Signal Processing, 3(29):561 570, June 1981. [88] Yong Rui and Dinei Florencio. New direct approaches to robust sound source localization. In Proceedings of IEEE International Conference On Multimedia and Expo, volume 1, pages 737 740, Baltimore, MD, USA, July 6-9 2003. 125

Bibliography Bibliography [89] Yong Rui and Dinei Florencio. Time delay estimation in the presence of correlated noise and reverberation. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 133 136, Montreal, Quebec, Canada, May 17-21 2004. [90] Stuart Russel and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliffs, NJ, USA, 1995. [91] Wallace C. Sabine. Collected Papers on Acoustics. Peninsula Publishing, 1922. [92] Joshua M. Sachar and Harvey F. Silverman. A baseline algorithm for estimating talker orientation using acoustical data from largeaperture microphone array. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 4, pages 65 68, Montreal, Canada, May 17-21 2004. [93] Hiroshi Sawada, Stefan Winter, Ryo Mukai,, Shoko Araki, and Shoji Makino. Estimating the number of sources for frequency-domain blind source separation. In Proceedings of international conference on independent component analysis and blind signal separation, pages 610 617, Granada, Spain, September 22-24 2004. [94] H. Schau and A. Robinson. Passive source localization employing intersecting spherical surfaces from time-of-arrival differences. IEEE Transaction on Acoustics, Speech and Signal Processing, 35(12):1661 1669, December 1987. [95] Jan Scheuing and Bin Yang. Disambiguation of TDOA estimates in multi-path multi-source environments (DATEMM). In Proceedings of IEEE International Conference on Acoustics, Speech and Signal 126

Bibliography Bibliography Processing, volume 4, pages 837 840, Toulouse, France, May 14-19 2006. [96] Ralph Schmidt. A Signal Subspace Approach to Multiple Emitter Location and Spectral Estimation. PhD thesis, Stanford University, 1981. [97] Harvey F. Silverman and Stuart E. Kirtman. A two-stage algorithm for determining talker location from linear microphone array data. Computer Speech and Language, 6(2):129 152, 1992. [98] Paris Smaragdis and Petro Boufounos. Learning source trajectories using wrapped-phase hidden markov models. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 114 117, New Paltz, NY, USA, October 16-19 2005. [99] J. Smith and J. Abel. Closed-form least-square source location estimation from range-difference measurements. IEEE Transaction on Acoustics, Speech and Signal Processing, 35(12):1661 1669, December 1987. [100] Norbert Strobel and Rudolf Rabenstein. Classification of time delay estimates for robust speaker localization. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 6, pages 3081 3084, Phoenix, AZ, USA, March 15-19 1999. [101] Josephine Sullivan and Jens Rittscher. Guiding random particles by deterministic search. In Proceedings of IEEE International Conference on Computer Vision, volume 1, pages 323 330, Vancouver, BC, Canada, July 7-14 2001. [102] Piergiorgio Svaizer, Marco Matassoni, and Maurizio Omologo. Acoustic source location in a three-dimensional space using crosspower 127

Bibliography Bibliography spectrum phase. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 231 234, Munich, Germany, April 21-24 1997. [103] Mikael Swartling, Nedelko Grbić, and Ingver Claesson. Direction of arrival estimation for multiple speaker using time-frequency orthogonal signal separation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 4, pages 833 836, Toulouse, France, May 14-19 2006. [104] Don J. Torrieri. Statistical theory of passive location systems. IEEE Transactions on Aerospace and Electronic Systems, 20(3), March 1984. [105] Jaco Vermaak and Andrew Blake. Nonlinear filtering for speaker tracking in noisy and reverberant environments. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 3021 3024, Salt Lake City, UT, USA, May, 7-11 2001. [106] Darren B. Ward and Robert C. Williamson. Particle filter beamforming for acoustic source localization in a reverberant environment. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 1777 1780, Orlando, FL, USA, May, 13-17 2002. [107] Kung Yao, Ralph E. Hudson, Chris W. Reed, Daching Chen, and Flavio Lorenzelli. Blind beamforming on a randomly distributed sensor array system. IEEE Journal on Selected Areas in Communications, 16(8):1555 1567, October 1998. 128

Appendix A Oriented Image Method The image method, introduced by Allen and Berkley [3] and afterward extended by Petterson [84], is widely adopted in the SSL community to simulate impulse responses of acoustic channels in reverberant environments. The method assumes a ray propagation model and represents acoustic waves as rays that propagate through a room and reflect on surfaces according to the Snell s law. Reflections are further simplified by neglecting the effects of diffraction and assuming that wall reflectivity is constant with respect to the frequency. With these simplifications, reflections can be modeled as virtual sources obtained by mirroring the original source with respect to walls. Let us assume that beside the direct path between receiver PSfrag replacements source direct path PSfrag replacements image reflection point image source image reflection point wall direct path K KJ KL image wall receiver image Figure A.1: Image method: the figure on the left explains how reflection can be substituted with an image source; the right part depicts an example of first-order images. 129

Appendix A. Oriented Image Method the source and the receiver, a reflection on the wall introduces an indirect path as depicted in figure A.1. The contribution of the reflection on the wall is accounted for by mirroring the real source with respect to the wall where the reflection occurred. The image source is assumed to emit in a synchronous manner the same signal as the original source. The reflection point is inferred by joining the image source with the receiver. In an enclosure, every surface is used to mirror the original source producing a set of first-order image sources which account for first-order or single-reflection indirect path. The right part of figure A.1 shows an example of first-order image sources restricted to a plane. Each image can be mirrored again by each wall in an iterative process that considers several images at increasing distances from the source. Summing up the contributions of each source, delayed by the propagation time and weighted according to the propagation and reflection absorptions, allows to generate a realistic multi-path impulse response. The classical image method assumes that the sound source and the receiver are omnidirectional and so each reflection contributes in the same way to the impulse response. Since in this research activity we never deal with omnidirectional sources, we devised an extension of the image method that includes the information about the source radiation pattern. By connecting a generic image source to the receiver it is possible to obtain the final part of the propagation path, i.e. from the last reflection to the receiver, and consequently the angle of impact with the receiver. So it is quite trivial to account for the directivity of the receiver. Conversely, as soon as the image order is higher than one, the information about the first part of the propagation path, from the source to the first reflection, is lost and so there is no knowledge about the emitting angle. However, the reflection pattern from the source to the receiver is equal to the reflection pattern from the receiver to the source. Hence the radiation pattern of the 130

Appendix A. Oriented Image Method 0-5 -12-20 db -20-12 -5 0 Figure A.2: Example of a cardioid-like radiation pattern source can be taken into account by considering a directional receiver and switching source and receiver. A cardioid-like radiation pattern which roughly simulates the radiation properties of human beings [61, 33] was exploited in the simulations. In particular, figure A.2 depicts a typical radiation pattern adopted to generate the artificial data collections. The algorithm is suitable to implement any other radiation pattern that is function of the elevation and azimuth angles of radiation. On the other hand the dependence on frequency of human beings radiation characteristics [58] was neglected in this work. 131

132 Appendix A. Oriented Image Method

Appendix B DMN and Hardware B.1 ITC-irst CHIL Room In order to fulfill the CHIL project requirements, ITC-irst arranged a smart room whose dimensions are 5 6 4 m. A complete map of the room, including the audio sensor positions, is shown in figure B.1. As already T1 T2 Camera (fixed) UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV VV Screen T0 Speaker aerea Table microphones Microphone cluster WWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWWW WWWWWWWWWW WWWWWWWWW WWWWWWWWW WWWWWWWWW WWWWWWW WWWWWWW WWWWW WWW XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXX XXXXXXX XXXXX XXXX Table for meetings WWWWWWWWW WWWWWWWWW WWWWWWWWW XXXXXXXXX XXXXXXXXX XXXXXXXXX WWWWWWW WWWWWWW XXXXXXX XXXXXXX WWWWW WWWWW XXXXX XXXXX W X T3 T4 Pan tilt zoom camera NIST MARKIII IRST Light Array T6 T5 Figure B.1: Map of the DMN available at ITC-irst laboratories. mentioned, the DMN deployed in this CHIL room consists of: 7 T-shaped microphone clusters, 1 64-channel NIST Mark III linear array, 4 table- 133

B.1. ITC-irst CHIL Room Appendix B. DMN and Hardware top microphones and close talk microphones to be worn by the persons within the room. Furthermore, an Acoustic Magic linear array capable of automatically tracking sound sources and delivering beamformed signals was deployed for comparison purposes. Picture B.2 shows a partial view of the CHIL room created at ITC-irst. In particular, the picture includes two T-shaped array, namely T3 just above the door and T4, and the Mark III linear array. The room where the DMN is operating is almost empty Figure B.2: Partial view of the DMN available at ITC-irst laboratories. except for a small round table located near one of the corners and a small desk with a monitor and a personal computer, placed against a wall below the array T6, for demonstration purposes. During data collections for the meeting scenario, the table, where table-top microphones are placed, is moved in the middle of the room. In addition, walls are not covered by any kind of furnishing and as a consequence the reverberation time reaches the considerable value of 0.7 s. It is worth mentioning that the wall depicted in 134