LOOK WHO S TALKING: SPEAKER DETECTION USING VIDEO AND AUDIO CORRELATION. Ross Cutler and Larry Davis

Similar documents
Vision-Based Speaker Detection Using Bayesian Networks

An Hybrid MLP-SVM Handwritten Digit Recognizer

Face Registration Using Wearable Active Vision Systems for Augmented Memory

Frame-Rate Pupil Detector and Gaze Tracker

Recent Advances in Acoustic Signal Extraction and Dereverberation

Vision-based User-interfaces for Pervasive Computing. CHI 2003 Tutorial Notes. Trevor Darrell Vision Interface Group MIT AI Lab

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Auditory Based Feature Vectors for Speech Recognition Systems

Mel Spectrum Analysis of Speech Recognition using Single Microphone

BODILY NON-VERBAL INTERACTION WITH VIRTUAL CHARACTERS

High-speed Noise Cancellation with Microphone Array

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

AUTOMATIC SPEECH RECOGNITION FOR NUMERIC DIGITS USING TIME NORMALIZATION AND ENERGY ENVELOPES

Pose Invariant Face Recognition

Eyes n Ears: A System for Attentive Teleconferencing

Uplink and Downlink Beamforming for Fading Channels. Mats Bengtsson and Björn Ottersten

Datong Chen, Albrecht Schmidt, Hans-Werner Gellersen

Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events

Effects of the Unscented Kalman Filter Process for High Performance Face Detector

Speech Recognition using FIR Wiener Filter

Vision for a Smart Kiosk

Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming

Activity monitoring and summarization for an intelligent meeting room

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Background Pixel Classification for Motion Detection in Video Image Sequences

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Mikko Myllymäki and Tuomas Virtanen

A TWO-PART PREDICTIVE CODER FOR MULTITASK SIGNAL COMPRESSION. Scott Deeann Chen and Pierre Moulin

VICs: A Modular Vision-Based HCI Framework

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

INTRODUCTION TO DEEP LEARNING. Steve Tjoa June 2013

First generation mobile communication systems (e.g. NMT and AMPS) are based on analog transmission techniques, whereas second generation systems

Airo Interantional Research Journal September, 2013 Volume II, ISSN:

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Text and Language Independent Speaker Identification By Using Short-Time Low Quality Signals

FOCAL LENGTH CHANGE COMPENSATION FOR MONOCULAR SLAM

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Iris Recognition using Hamming Distance and Fragile Bit Distance

Using RASTA in task independent TANDEM feature extraction

Session 2: 10 Year Vision session (11:00-12:20) - Tuesday. Session 3: Poster Highlights A (14:00-15:00) - Tuesday 20 posters (3minutes per poster)

Multi-Resolution Estimation of Optical Flow on Vehicle Tracking under Unpredictable Environments

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Service Robots in an Intelligent House

Change Point Determination in Audio Data Using Auditory Features

SOUND SOURCE RECOGNITION AND MODELING

Voice Activity Detection

Research Seminar. Stefano CARRINO fr.ch

Data Flow 4.{1,2}, 3.2

AUDIO VISUAL TRACKING OF A SPEAKER BASED ON FFT AND KALMAN FILTER

Neural Network Part 4: Recurrent Neural Networks

ENF ANALYSIS ON RECAPTURED AUDIO RECORDINGS

Voice Recognition Technology Using Neural Networks

Direction-of-Arrival Estimation Using a Microphone Array with the Multichannel Cross-Correlation Method

NCCF ACF. cepstrum coef. error signal > samples

Gesture Recognition with Real World Environment using Kinect: A Review

White Intensity = 1. Black Intensity = 0

Study Of Sound Source Localization Using Music Method In Real Acoustic Environment

REFERENCES 4 CONCLUSIONS ACKNOWLEDGEMENT. Anticipated results for our investigations on acoustic and visual speech integration are:

OPPORTUNISTIC TRAFFIC SENSING USING EXISTING VIDEO SOURCES (PHASE II)

Controlling Humanoid Robot Using Head Movements

Audio Fingerprinting using Fractional Fourier Transform

Separation and Recognition of multiple sound source using Pulsed Neuron Model

Improved Detection by Peak Shape Recognition Using Artificial Neural Networks

Real Time Video Analysis using Smart Phone Camera for Stroboscopic Image

5/17/2009. Digitizing Color. Place Value in a Binary Number. Place Value in a Decimal Number. Place Value in a Binary Number

Distance Estimation and Localization of Sound Sources in Reverberant Conditions using Deep Neural Networks

Robust Hand Gesture Recognition for Robotic Hand Control

Speech Signal Analysis

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Face Registration Using Wearable Active Vision Systems for Augmented Memory

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

A Neural Solution for Signal Detection In Non-Gaussian Noise

Calibration of Microphone Arrays for Improved Speech Recognition

Recognizing Talking Faces From Acoustic Doppler Reflections

Keywords: - Gaussian Mixture model, Maximum likelihood estimator, Multiresolution analysis

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

Introduction to Video Forgery Detection: Part I

Restoration of Motion Blurred Document Images

Color Constancy Using Standard Deviation of Color Channels

SOUND SOURCE RECOGNITION FOR INTELLIGENT SURVEILLANCE

Compensation of a position servo

Prof Trivedi ECE253A Notes for Students only

Advanced Functions of Java-DSP for use in Electrical and Computer Engineering Senior Level Courses

3D and Sequential Representations of Spatial Relationships among Photos

Hand & Upper Body Based Hybrid Gesture Recognition

Image Processing by Bilateral Filtering Method

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

NEURALNETWORK BASED CLASSIFICATION OF LASER-DOPPLER FLOWMETRY SIGNALS

Proceedings of Meetings on Acoustics

SmartCanvas: A Gesture-Driven Intelligent Drawing Desk System

Discriminative Training for Automatic Speech Recognition

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 1, FEBRUARY A Speech/Music Discriminator Based on RMS and Zero-Crossings

Autonomous Vehicle Speaker Verification System

Recurrent neural networks Modelling sequential data. MLP Lecture 9 / 13 November 2018 Recurrent Neural Networks 1: Modelling sequential data 1

Smart antenna for doa using music and esprit

Research on Hand Gesture Recognition Using Convolutional Neural Network

A Comparison of Histogram and Template Matching for Face Verification

THE problem of acoustic echo cancellation (AEC) was

Auditory modelling for speech processing in the perceptual domain

Video Synthesis System for Monitoring Closed Sections 1

Transcription:

LOOK WHO S TALKING: SPEAKER DETECTION USING VIDEO AND AUDIO CORRELATION Ross Cutler and Larry Davis Institute for Advanced Computer Studies University of Maryland, College Park rgc,lsd @cs.umd.edu ABSTRACT The visual motion of the mouth and the corresponding audio data generated when a person speaks are highly correlated. This fact has been exploited for lip/speechreading and for improving speech recognition. We describe a method of automatically detecting a talking person (both spatially and temporally) using video and audio data from a single microphone. The audio-visual correlation is learned using a TDNN, which is then used to perform a spatio-temporal search for a speaking person. Applications include video conferencing, video indexing, and improving human computer interaction (HCI). An example HCI application is provided. 1. INTRODUCTION The visual motion of a speaker s mouth is highly correlated with the audio data generated from the voicebox and mouth [8]. This fact has been exploited for lip/speechreading (e.g, [17, 13]) and for combined audio-visual speech recognition (e.g., [5]). We utilize this correlation to detect speakers using video and audio input from a single microphone. We learn the audio-visual correlation in speaking using a timedelayed neural network (TDNN) [2], which is then used to search an audio-video input for speaking people. Applications of speaker detection include video conferencing, video indexing, and improving the human computer interface. In video conferencing, knowing where someone is speaking can cue a video camera to zoom in on the speaker; it can also be used to transmit only the speaker s video in bandwidth-limited conferencing applications. Speaker detection can also be used to index video (e.g., find me all clips of someone speaking ), and can be combined with face recognition techniques (e.g., find me all clips of Bill Clinton speaking ). Finally, speaker detection can be used to improve human computer interaction (HCI) by providing applications with the knowledge of when and where a user is speaking. We provide an example application in which the user speaks to the computer, and the computer performs an action using the time and location of the speaker. 1.1. Related work There has been a significant amount of work done in detecting faces from images and video (e.g., [16]). There has also been a significant amount of work done in locating speakers using arrays of microphones (e.g., [1]), and in identifying a specific individual speaking (e.g., [15]). Audio data can be used to animate lips for animated and real characters [3, 4]. Vision based techniques have also been used to detect people in front of kiosks [14, 7]. There are text-to-speech systems which utilize hand-coded phoneme-to-viseme rules to animate characters [18]. We are not aware of any previous work done that exploits the audio-visual correlation in speaking to detect speakers (spatially and temporally) in video with a single microphone. 1.2. Assumptions In this work, we assume that only one person is speaking at a time, and there is not significant background noise (though our audio S/N is not high in our test data). We also assume that the speaker does not move his head excessively during talking (though we suggest methods in Section 5 to handle this). 2. METHOD Our method exploits the correlation between mouth motions and audio data. Figure 1 shows a recurrence matrix [9, 6] of the mouth region image similarities and the corresponding audio data. A recurrence matrix is a qualitative tool used to perform time series analysis of non-linear dynamic systems. In this case, the recurrence matrix Ê is defined by Ê Ø ½ Ø ¾ µ Á ؽ Á ؾ µ where is the correlation of images Á ؽ and Á ؾ. In this figure, we see that times of change in the audio data are highly correlated with visual changes in the mouth. However, the relationship between the two signals is not simple, as changes in the audio signal do not necessarily imply changes in the visual signal (and vice versa), and the visual signal may lead or lag the audio signal sig-

Similarity of Image T 1 and T 2 5 1 T 2 15 2 25 3 5 1 15 2 25 3 T 1 Figure 1: Recurrence matrix of a 1 second talking sequence. The upper triangle is the similarity (absolute correlation) of the mouth region for images at times Ì ½ and Ì ¾, and the lower triangle is the similarity (Euclidean distance) of the corresponding audio signal at times Ì ½ and Ì ¾. Whiter pixels are more similar. nificantly (Bregler and Konig [5] use mutual information to show that audio data on average lagged behind the video data by approximately 12 ms in their dataset). In addition, the changes are highly context sensitive, analogous to the coarticulation problem in speech recognition. We utilize a TDNN to learn the context-dependent correlations between the audio and visual signals. The mel cepstrum coefficients Ø of the audio signal are used as the audio features, which are commonly used in speech recognition systems [1]. In our examples, we compute 12 mel cepstrum coefficients using a 1 ms window. For visual features, we utilize a simple measure of change between two images Á Ø and Á Ø ½ (absolute correlation): Ë Ø Ü Ýµ¾ÏØ Á Ø Ü Ýµ Á Ø ½ Ü Ýµ (1) where Ï Ø is a windowing function in Á Ø (typically a rectangle). In order to account for small translations of the head during speaking, the minimal Ë Ø is found by translating over a small search radius Ö: Ë ¼ Ø Ñ Ò Ü Ý Ö Ü Ýµ¾ÏØ Á Ø Ü Ü Ý Ýµ Á Ø ½ Ü Ýµ (2) The TDNN has an input layer consisting of [ Ø Æ,..., Ø,..., Ø Æ ] audio features and [Ë ¼,..., Ø ÆΠ˼ Ø,..., Ë ¼ ] Ø ÆÎ visual features. In our examples, we have used Æ and Æ Î Figure 2: Example image (64x48 pixels) used in training. For training, the positive visual features were computed using 6x3 window centered on the mouth. See Figure 3 for the corresponding feature vectors. such that approximately 2ms of context in each direction (symmetrically) is provided. There is one hidden layer, and only a single output node Ç Ø Ï, which indicates whether someone is speaking at time Ø in the window Ï. 2.1. Training The TDNN is trained using supervised learning and back propagation [12]. Specifically, for each image Á Ø, the output Ç Ø Ï is set to 1 where a person is talking, and otherwise. An example image is shown in Figure 2. An example of the feature vectors (the TDNN input) is shown in Figure 3. The training data consists of both positive data (Ç Ø Ï =1) and negative data (Ç Ø Ï =). 2.2. Speaker detection Once the TDNN has been trained, it is evaluated on an audiovisual sequence to detect correlated motion and audio that is indicative of a person talking. Specifically, for a given image Á Ø and windowing function Ï centered at Ü Ýµ, we evaluate the TDNN output Ç Ø Ï at each Ü Ýµ ¾ Á Ø. The windowing function is typically rectangular, and of the size of an expected mouth. In our implementation, the dimensions of Ï are «by «Û pixels, where ¼, Û ¼, and «½ ¾ is a spatial scaling factor. During the search, we choose the ¾ that maximizes Ç Ø Ï to allow for a range of mouth sizes (primarily due to changes in person distance from the camera). The TDNN does not handle large changes in temporal scale from the training data. Therefore, the feature vectors Ø and Ë ¼ are linearly scaled by a time factor before Ø evaluating Ç Ø Ï. In our implementation, we selected the ¾ ¼ ¾ that maximizes Ç Ø Ï to allow for a significant

.4 Sound.2.2.4 Mel Cepstrum Coefficients 2 4 6 8 1 12 2 x Visual Features 14 1.5 1.5 5 1 15 2 25 3 Time (frames) Figure 3: Training data example for a person saying computer nine times: (top) audio data (middle) mel cepstrum coefficients Ø, (bottom) visual features Ë ¼ Ø. variation in speaking rate between the training data and test data. Ç Ø Ï can be treated as a probability that there is someone speaking at time Ø, with a mouth in the window Ï. To achieve better robustness, Ç Ø Ï can be filtered to reduce spurious noise. While a Kalman filter could be used for this purpose, in our implementation we simply use a 3D moving average filter to compute Ç Ø Ï. 3. DESIGN DECISIONS In this section, we d like to discuss several important design decisions we have made. First, in detecting the audio-visual correlation, we could have hand-coded some rules that map phonemes-to-visemes. However, extracting phonemes is errorprone, as is the visemes. Moreover, to acurrately extract visemes would likely require greater resolution than we use in our test images (the mouth is about 1x3 pixels), and a sophisticated model-based visual feature extractor. There is also the problem of determining a suitable vismeme lexicon to map to; like phonemes, there are many visemes standards to choose from. Rather than use a rule-based system, we chose to use a TDNN. We structured the NN so that with sufficient training, it should learn similar phoneme-viseme rules, though without actually classifying a phoneme or viseme. While we have not done so, the trained NN could be tested with specific phoneme-visemes to determine if this correlation is actually learned. In choosing the visual features, we needed a feature that could be robustly determined at relatively low resolutions. It was also desireable to choose a method that could be im- Figure 4: Output of the HCI application which locates a speaking person and windows their head. The cross-hair marks the location of the detected mouth. The bounding box size is a function of Ï. plemented in real-time on a standard PC. The correlation feature satisfies both of these requirements. Finally, in choosing the audio features, we utilized features (mel cepstrum) that are commonly used in speech recognition systems. More sophisticated audio features could be utilized (e.g.,...), which could enhance the performance of the system, particularly in the precence of noise. 4. EXAMPLE APPLICATION In this section, we demonstrate the system with a simple HCI application. When the user speaks the word computer, the system will recognize when and where he is speaking, and will window his head for further processing (e.g., teleconferencing). An example image from a sequence is shown in Figure 4, and the corresponding features are shown in Figure 5. The output of the TDNN is shown in Figure 6. For this test sequence, the system correctly detected all 7 instances of the word computer. The system uses a Sony DFW-V5 64x48 3 FPS resolution camera, and is implemented on a standard PC workstation. The microphone is an inexpensive desktop microphone, and the audio is sampled at 2KHz 8-bit resolution. 5. CONCLUSIONS We used a TDNN to learn the audio-visual correlations of mouth regions during speaking. This was utilized to detect speakers (both spatially and temporally) using video and a single microphone as input. We demonstrated the utility of the system using an HCI application that recognized when and where a user was talking.

.2.1.1.2 2 4 6 8 1 12 6 4 2 Sound Mel Cepstrum Coefficients Visual Features 5 1 15 2 25 Time (frames) Figure 5: Features for the sequence shown in Figure 4. Note the audio signal has a much lower S/N ratio than in Figure 3, due to the greater distance from the microphone. Figure 6: Output of the TDNN for the image given in Figure 4. Dark pixels correspond to regions of high speaker probability. To utilize this method of speaker detection for more general applications, such as video conferencing and video indexing, a measure of image similarity must be chosen that is relatively invariant to translations and rotations of the head during speaking. One possible solution is to use an affine tracker to stabilize regions being tested [11]. Other visual feature vectors can also be extracted, such the optical flow around the mouth. To improve robustness and accuracy, our method can be combined with a face detector [16] and tracker [11], and voice detector [19]. 6. REFERENCES [1] S. Basu, M. Casey, W. Gardner, A. Azarbayejani, and A. Pentland. Vision-steered audio for interactive environments. Technical Report 373, MIT Media Lab Perceptual Computer Section, 1996. [2] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [3] M. Brand. Voice puppetry. In SIGGRAPH, 1999. [4] C. Bregler, M. Covell, and M. Slaney. Video rewrite: driving visual speech with audio. In SIGGRAPH, 1997. [5] C. Bregler and Y. Konig. Eigenlips for robust speech recognition. In ICASSP, 1994. [6] M. Casdagli. Recurrence plots revisited. Physica D, 18:12 44, 1997. [7] T. Darrell, G. Gordon, J. Woodfill, H. Baker, and M. Harville. A virtual mirror interface using real-time robust face tracking. In FG, 1998. [8] B. Dodd and R. Campbell. Hearing by Eye: The Psychology of Lipreading. Lawrence Erlbaum Press, 1987. [9] P. Eckmann, S. O. Kamphorst, and D. Ruelle. Recurrence plots of dynamical systems. J. of Europhysics Letters, 4:973 977, 1987. [1] B. Gold and N. Morgan. Speech and audio signal processing. John Wiley and Sons, Inc., 1999. [11] G. Hager and K. Toyama. The xvision system: A generalpurpose substrate for portable real-time vision applications. Computer Vision and Image Understanding, 69(1):23 37, 1998. [12] J. Hertz, A. Krogh, and R. Palmer. Introduction to the theory of neural computation. Addison Wesley, 1991. [13] K. Mase and A. Pentland. Lip reading: automatic visual recognition of spoken words. In Proc. Image Understanding and Machine Vision, Optical Society of America, June 1989. [14] J. M. Rehg, K. P. Murphy, and P. W. Fieguth. Vision-based speaker detection using bayesian networks. In Proceedings of the Computer Vision and Pattern Recognition, 1999. [15] D. Reynolds and R. Rose. Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1):72 83, 1995. [16] H. A. Rowley, S. Baluja, and T. Kanade. Neural networkbased face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(1):23 38, January 1998. [17] D. Stork, G. Wolff, and E. Levine. Neural network lipreading system for improved speech recognition. IJCNN, 1992.

[18] K. Waters and T. Levergood. DECface: An automatic lipsynchronization algorithm for synthetic faces. Technical Report 93/4, DEC, September 1994. [19] T. Zhang and C.-C. Kuo. Hierarchical classification of audio data for archiving and retrieving. In ICCASP, 1999.