Integrated Vision and Sound Localization

Similar documents
Eyes n Ears: A System for Attentive Teleconferencing

Speech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya 2, B. Yamuna 2, H. Divya 2, B. Shiva Kumar 2, B.

Self Localization Using A Modulated Acoustic Chirp

Auditory System For a Mobile Robot

Sound Processing Technologies for Realistic Sensations in Teleworking

Figure 1 HDR image fusion example

Feel the beat: using cross-modal rhythm to integrate perception of objects, others, and self

Defense Technical Information Center Compilation Part Notice

Active noise control at a moving virtual microphone using the SOTDF moving virtual sensing method

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

ISSN No: International Journal & Magazine of Engineering, Technology, Management and Research

Using Vision-Based Driver Assistance to Augment Vehicular Ad-Hoc Network Communication

ZeroTouch: A Zero-Thickness Optical Multi-Touch Force Field

FACE RECOGNITION USING NEURAL NETWORKS

Keywords: Multi-robot adversarial environments, real-time autonomous robots

Image Enhancement Using Frame Extraction Through Time

Speech Enhancement using Wiener filtering

Airo Interantional Research Journal September, 2013 Volume II, ISSN:

Robust Low-Resource Sound Localization in Correlated Noise

Introduction to DSP ECE-S352 Fall Quarter 2000 Matlab Project 1

Advanced delay-and-sum beamformer with deep neural network

Responsible Data Use Assessment for Public Realm Sensing Pilot with Numina. Overview of the Pilot:

Bits From Photons: Oversampled Binary Image Acquisition

Image Extraction using Image Mining Technique

Real-Time Face Detection and Tracking for High Resolution Smart Camera System

Incorporating a Connectionist Vision Module into a Fuzzy, Behavior-Based Robot Controller

K.NARSING RAO(08R31A0425) DEPT OF ELECTRONICS & COMMUNICATION ENGINEERING (NOVH).

An Efficient Color Image Segmentation using Edge Detection and Thresholding Methods

Problem Set I. Problem 1 Quantization. First, let us concentrate on the illustrious Lena: Page 1 of 14. Problem 1A - Quantized Lena Image

Monaural and Binaural Speech Separation

How Many Pixels Do We Need to See Things?

Mikko Myllymäki and Tuomas Virtanen

Privacy-Protected Camera for the Sensing Web

An Auditory Localization and Coordinate Transform Chip

Generalized Game Trees

Visual Search using Principal Component Analysis

Wide-Band Enhancement of TV Images for the Visually Impaired

How does prism technology help to achieve superior color image quality?

Proceedings of the 5th WSEAS Int. Conf. on SIGNAL, SPEECH and IMAGE PROCESSING, Corfu, Greece, August 17-19, 2005 (pp17-21)

Midterm Examination CS 534: Computational Photography

Optimizing Resolution and Uncertainty in Bathymetric Sonar Systems

Speech Enhancement Based On Noise Reduction

High-speed Noise Cancellation with Microphone Array

Upgrading pulse detection with time shift properties using wavelets and Support Vector Machines

TRANSMIT diversity has emerged in the last decade as an

Acoustic Blind Deconvolution in Uncertain Shallow Ocean Environments

Extraction and Recognition of Text From Digital English Comic Image Using Median Filter

Real Time Deconvolution of In-Vivo Ultrasound Images

Nonuniform multi level crossing for signal reconstruction

ROBOT VISION. Dr.M.Madhavi, MED, MVSREC

Overview. Cognitive Radio: Definitions. Cognitive Radio. Multidimensional Spectrum Awareness: Radio Space

Multi-Resolution Estimation of Optical Flow on Vehicle Tracking under Unpredictable Environments

VOL. 3, NO.11 Nov, 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Intelligent Traffic Sign Detector: Adaptive Learning Based on Online Gathering of Training Samples

Coding & Signal Processing for Holographic Data Storage. Vijayakumar Bhagavatula

Introduction to Mediated Reality

Title Goes Here Algorithms for Biometric Authentication

Wavelet Speech Enhancement based on the Teager Energy Operator

Separation and Recognition of multiple sound source using Pulsed Neuron Model

Pixel Classification Algorithms for Noise Removal and Signal Preservation in Low-Pass Filtering for Contrast Enhancement

Demosaicing Algorithms

Robust Location Detection in Emergency Sensor Networks. Goals

Using Vision to Improve Sound Source Separation

Development of an Automatic Camera Control System for Videoing a Normal Classroom to Realize a Distant Lecture

4 th Grade Mathematics Learning Targets By Unit

White Paper High Dynamic Range Imaging

Oscilloscope Measurement Fundamentals: Vertical-Axis Measurements (Part 1 of 3)

Multiple Sound Sources Localization Using Energetic Analysis Method

inter.noise 2000 The 29th International Congress and Exhibition on Noise Control Engineering August 2000, Nice, FRANCE

Robust Hand Gesture Recognition for Robotic Hand Control

Analysis of LMS and NLMS Adaptive Beamforming Algorithms

White paper. Low Light Level Image Processing Technology

Toward an Augmented Reality System for Violin Learning Support

Design of Parallel Algorithms. Communication Algorithms

Sensor system of a small biped entertainment robot

Colour Profiling Using Multiple Colour Spaces

Wavelet Transform Based Islanding Characterization Method for Distributed Generation

Vision-based User-interfaces for Pervasive Computing. CHI 2003 Tutorial Notes. Trevor Darrell Vision Interface Group MIT AI Lab

International Snow Science Workshop

Method of color interpolation in a single sensor color camera using green channel separation

Computer Vision Based Chess Playing Capabilities for the Baxter Humanoid Robot

Interfacing with the Machine

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Active noise control at a moving virtual microphone using the SOTDF moving virtual sensing method

Second Quarter Benchmark Expectations for Units 3 and 4

Audio data fuzzy fusion for source localization

Visual Communication by Colours in Human Computer Interface

Objective Evaluation of Edge Blur and Ringing Artefacts: Application to JPEG and JPEG 2000 Image Codecs

Distributed Collaborative Path Planning in Sensor Networks with Multiple Mobile Sensor Nodes

Measurement of Texture Loss for JPEG 2000 Compression Peter D. Burns and Don Williams* Burns Digital Imaging and *Image Science Associates

Image Processing by Bilateral Filtering Method

Extended Touch Mobile User Interfaces Through Sensor Fusion

Journal of Mechatronics, Electrical Power, and Vehicular Technology

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

The Noise about Noise

Face Detection: A Literature Review

Using Administrative Records for Imputation in the Decennial Census 1

HOW CAN CAAD TOOLS BE MORE USEFUL AT THE EARLY STAGES OF DESIGNING?

Smart antenna for doa using music and esprit

LOSSLESS CRYPTO-DATA HIDING IN MEDICAL IMAGES WITHOUT INCREASING THE ORIGINAL IMAGE SIZE THE METHOD

Transcription:

Integrated Vision and Sound Localization Parham Aarabi Safwat Zaky Department of Electrical and Computer Engineering University of Toronto 10 Kings College Road, Toronto, Ontario, Canada, M5S 3G4 parham@stanford.edu safwat@eecg.toronto.edu Abstract - This paper illustrates the synergic advantages of a multi-modal object localization system utilizing vision and sound localization. Prototype vision and sound localization systems were developed and integrated using spatial probability maps, which allow any number of cameras or microphones with arbitrary orientation to be easily integrated. Test results show a significant improvement in the integrated vision and sound localization (IVSL) system s ability to accurately localize objects in low signal to noise situations. Furthermore, the performance of the IVSL system was shown to surpass that of the individual sub-systems. Keywords: Microphone arrays, vision, sound localization, multi-sense multi-source information fusion, data integration. 1 Introduction Many object localization systems have been reported that rely on a specific type of sense such as sound localization or vision. While some of these have been relatively successful, they lack the robustness and accuracy that is necessary in many applications. Biological systems, such as human perception, can robustly and accurately localize objects even in the presence of significant amounts of noise. One of the reasons behind this ability is the fact that they rely on the integration of several different senses instead of just a single sense. Sense integration is effective because it allows the perception system to be applicable in a greater number of situations than would be possible with a single sense alone. Also, a given source of noise is likely to affect only one of the senses. For example, a vision system is unaffected by background sound sources in the environment, just as a sound localization system is unaffected by rapidly varying room lighting. Many previous artificial awareness systems have attempted to integrate multiple senses ([1], [2], [3], [4], [5]). However, most of these implementations process each sense separately and integrate the overall results as the final step. Such an approach may lead to a loss of information that could have been available if the sensors were processed interdependently. The system described in [1] uses an array of 8 microphones to initially locate a speaker and then to steer a camera towards the sound source. The camera does not participate in the localization of objects. It is used simply to take images of the sound source after it has been localized. Since the system was not tested in situations with low SNR, its performance characteristics in those situations is unknown. The approach proposed in this paper is based on integrating the information obtained from a vision system and a sound localization system. During analysis, the system makes use of all the information gathered to take maximum advantage of the capabilities of each sense. Data from the two senses is used to form a spatial probability map, which describes the probability of the object being found at any given location in the environment. As will be shown shortly, this form of information integration has potential for more robust and accurate localizations. Sections 2 and 3 describe the vision and sound localization subsystems, respectively. Section 4 introduces the results of integrating the vision and sound localization subsystems using spatial probability maps. 2 Vision based Object Localization The vision subsystem uses a simple object identification algorithm known as background segmentation to determine the direction of an object relative to the camera s frame of reference. The first step in this process involves taking two images, one before and one after an object is introduced in the environment. To identify the object (or objects) in the image, each pixel of the updated image is subtracted from the corresponding pixel in the background image. Thus, areas of large intensity difference between the two images remain, while areas with similar intensities are removed. The possible object locations obtained by background segmentation are converted to a two-dimensional image

describing the object s probable location, as illustrated in Figure 1. The resulting image is called the spatial probability map (SPM). Since that with a single camera, it is not possible to distinguish between a small object close to the camera and a large object far away, the probable locations of the object fall into the triangular region shown. With the addition of extra cameras and objects, the situation becomes much worse. Although this visual object localization approach is very simple, without any external help, such as multi-modal information, would not be practical due to the large number of false objects, as shown in Figure 3. /RZ REMHFW SUREDELOLW\ UHJLRQ +LJK REMHFW SUREDELOLW\ UHJLRQ /RZ REMHFW SUREDELOLW\ UHJLRQ Figure 1 The spatial probability map In order to pinpoint the object s position, at least one more camera is required. Then, a point-wise addition of the SPMs obtained will yield a two-dimensional image that has a local intensity peak at the location corresponding to the object s position. In the case of multiple objects, multiple peaks would be expected. One problem that arises in the case of multiple objects is the formation of false objects as illustrated in Figure 2. In most situations, false objects can be removed by a heuristic search algorithm [6]. Sense integration can also aid the removal of false objects. 5HDO 2EMHFW 5HDO 2EMHFW )DOVH 2EMHFW Figure 2 The formation of false objects Figure 3 Visual object localization in the presence of three cameras and objects Other approaches, such as more complex visual object localization systems or the Intelligent False Object Removal algorithm provide further means of solving the false object problem [6]. In this paper, however, the focus will be placed on utilizing multi-modal information for real object identification. 3 The Sound Localization System A variety of microphone array-based sound localization techniques exist. The integrated system described in this paper uses an iterative spatial probability (ISP) algorithm, which has been shown to provide increased accuracy and robustness [6]. First, sound signals are obtained from a microphone array. Next, cross correlation functions are computed for all possible microphone pairs. After filtering, the peak of each cross correlation function is identified. The process is repeated on successive sound windows and the positions of the cross correlation peak for each microphone pair is recorded in a histogram. Changes in the position of the histogram peaks are monitored, and the iterative process is terminated when the desired localization accuracy is attained. Each histogram, which corresponds to a specific microphone pair, is converted to a two-dimensional spatial probability map showing the likelihood of the existence of an object at all spatial locations. A high intensity point in the map corresponds to a high

likelihood that the object is at that location. Figure 4 shows an example of a spatial probability map associated with a single microphone pair. adding the corresponding SPMs, as will be shown in Section 4. The sound localization subsystem used in the integrated system consisted of 3 microphones placed in a linear fashion at 0.5 m intervals. The height of the array was fixed at 1.65 m in order to ensure that the microphones were coplanar with speakers in the environment. A detailed discussion of the sound localization system can be found in [6] and [7]. 4 Integrating the results of sound localization and vision Figure 4 SPM associated with a single microphone pair Since the histograms, and thus the spatial probability maps, represent the likelihood of an object at certain locations, they can be merged by adding the corresponding SPMs. Figure 5 gives the results of the addition of 2 SPMs associated with 2 microphone pairs: Spatial probability maps are used in both the vision and sound localization subsystems to combine multiple cameras and multiple microphone pairs, respectively. This makes integration of the results of the two subsystems particularly easy. By the weighted addition of the SPMs obtained from the two subsystems, we obtain a single SPM representing the combined results. The weights that are used prior to addition correspond to the relative merit of the associated sensor. For example, in a low SNR environment, more weight would be associated with the vision system than with the sound localization system because the degree of confidence in the former is greater. 1. Obtain Sound Samples 4. Interface with Camera and Obtain Current Scene Image Image 5. Obtain Scene Image 2. Compute Cross- Correlations and append Histograms 6. Conduct Image Processing and Decomposition Routines 3. Send Histograms in Message Packets Message 7. Block until Message Packets are Recieved Figure 5 Overlapped SPMs for two microphone pairs resulting in an intensity peak. The iterative approach means that the sound localization process can be adjusted to allow the system to perform as accurately as needed for any signal-to-noise ratio. The use of the spatial probability map makes the sound localization process very flexible. Adding a microphone pair simply involves computing a histogram for that pair and adding the resulting SPM to the overall SPM [6]. In a similar manner, information received from other localization subsystems can be easily incorporated by 9. Locate Speaker in Image Image 10. Display Joint SPM and Located Speaker 8. Produce Joint Vision and Sound Localization SPM Figure 6 The steps involved in IVSL system. Joint SPM As illustrated in Figure 6, the IVSL system consists of a sound localization system that continuously obtains SPMs of the environment. Every completed SPM is

passed in the form of encoded histograms to a visual object localization system that merges its own visual SPM with the sound localization SPM. The peak of the integrated SPM is selected and the images of the objects associated with that peak are displayed as the result of the speaker localization process. As an initial example, the results of integrating a single camera visual object localization system with the 3-microphone sound localization system are presented. 0LFURSKRQH $UUD\ 0LFURSKRQH $UUD\ a) b) c) Figure 8 The camera and microphone locations for the dual camera example Figures 9a and 9b illustrate the SPMs formed for the first and second camera, respectively. It is clear from these SPMs that each camera sees two objects. d) e) f) Figure 7 Results of the IVSL system using a single camera Figure 7 illustrates the processing steps undertaken by the IVSL system. The location of the microphone array and the camera are shown in Figure 7a. The camera image of the room is shown in Figure 7b. Figure 7c illustrates the SPM obtained by the sound localization system. The weighted addition of the vision and sound localization SPMs and the peak of the combined SPM are shown in Figures 7d and 7e, respectively. It should be noted that the associated vision SPM has a much higher weight than the sound localization SPM. The reason for such a weight selection is to ensure that the vision results, which are more robust than the sound localization results, are relied upon to a greater extent. Finally, Figure 7e shows that the system has been able to identify the object responsible for the production of sound. Figures 8-14 illustrate the application of the IVSL algorithms to a two-camera situation with the presence of a speaker and a non-vocal object. The implemented system consists of the same 3-microphone sound localization system used in the previous example along with two cameras placed in the corners of the room, as shown in Figure 8. a) b) Figure 9 The SPMs obtained by a) the first camera and b) the second camera By adding the two visual SPMs we obtain the SPM in Figure 10, which has four peaks. The 4 peaks arise from 2 true object (which consist of a person and a chair in the room) and two false objects. Figure 10 The combined SPM for both cameras The ambiguity can be resolved by using the information gathered by the sound localization subsystem, which produced the following SPM:

secondary sound source on localization accuracy with and without sense integration. The overall accuracy was computed for five different SNR values. The localization error is taken to be the root mean square of the distance between the actual object location and the estimated one. Figure 11 The sound localization SPM In order to combine the sound localization and vision results, the corresponding SPMs are added together after being multiplied by an appropriate weight factor. The combined SPM, which is shown in Figure 12a, has a single intensity peak (Figure 12b) that corresponds to the speaker in the room. a) b) Figure 12 Results of a) combining the visual and sound localization SPMS and b) the integrated SPM s peak When this location is translated back to the camera image coordinates, it correctly identifies the speaker, as illustrated in Figure 13a and 13b. a) b) Figure 13 The results of the IVSL object localization system as seen by a) camera 1 and b) camera 2 5 Performance This paper is based on the proposition that the integration of the vision and sound localization senses increases the accuracy and robustness of the overall localization. An experiment was conducted to examine the effects of a The localization accuracy of the IVSL system is compared to the accuracy of the sound localization system alone in Figure 14. As can be seen, the localization accuracy is increased in all cases, especially at low SNR situations where the sound localization system by itself would occasionally mistake the noise source for the main speaker resulting in a sudden increase in localization error. With the addition of the vision sense, the sound localization system is no longer confused by the secondary sound source and hence the accuracy is greatly increased. Localization Error (cm) Localization Error vs. SNR 100 90 80 70 60 50 40 30 20 10 0 0 2 4 6 8 10 12 SNR (db) IVSL System Sound Localization Only Figure 14 The relation of IVSL localization accuracy to SNR Compared to the speaker localization and visual identification system implemented in [1], the IVSL system benefits from both sound and vision. This means that in cases where sound localization is not able to correctly locate the speaker, the vision system can aid the localization process, as illustrated in Figure 14. Also, the IVSL system was tested in a variety of SNRs, while the noise robustness of the implementation in [1] remains unclear. Another difference is that the IVSL system does not require a camera aiming procedure. Unlike the implementation in [1], the IVSL has a fixed camera pointing to the environment. The image of the speaker is a subset of the image obtained by this camera. Overall, the IVSL system offers superior functionality and robustness to background sounds and noises. In

terms of accuracy at high SNRs, both of the two systems being compared use a per-sample analysis which means that the accuracy at these SNRs is roughly equivalent [6]. At low SNRs, however, the IVSL system can consistently locate the speaker. 6 Conclusions This paper described the process of integration of a dual camera vision system with the results of a sound localization system. The localization results of each subsystem are in the form of a map that represents the probability of an object being at any given location in a two-dimensional space. Integration of the results is accomplished by computing a weighted sum of the two maps. Test results confirm that the performance of the integrated system is superior to either of its subsystems. The important advantage of the proposed method of integration is that noise sources seen by one of the senses do not have any effect on the other. Hence the integrated system is more accurate and robust and can operate at significantly lower signal-to-noise ratios. References [1] Rabinkin, D. e. a. 1996. A DSP Implementation of Source Location Using Microphone Arrays. In 131 st meeting of the Acoustical Society of America, Indianapolis, Indiana, 15 May 1996. [2] Coen, M. Design Principles for Intelligent Environments. Proceedings of the 1998 National Conference on Artificial Intelligence. (AAAI-98). 1998. [3] Pentland, A. (1996) Scientific American, Vol. 274, No. 4, pp. 68-76, April 1996. [4] Brooks, R. A. with contributions from M. Coen, D. Dang, J. DeBonet, J. Kramer, T. Lozano-Perez, J. Mellor, P. Pook, C. Stauffer, L. Stein, M. Torrance and M. Wessler, The Intelligent Room Project, Proceedings of the Second International Cognitive Technology Conference (CT'97), Aizu, Japan, August 1997. [5] Torrance, M. Advances in Human-Computer Interaction: The Intelligent Room. Working Notes of the CHI 95 Research Symposium, May 6-7, Denver, Colorado. 1995. [6] Aarabi, Parham, Multi-Sense Artificial Awareness, June 1999, M.A.Sc. Thesis, Department of Electrical and Computer of Engineering, University of Toronto [7] Aarabi, P., Zaky, S., Iterative Spatial Probability based Sound Localization, To appear in the Proceedings of the Fourth World Multiconference on Circuits, Systems, Communications, and Computers, Athens, Greece, July 2000.