LOOK WHO S TALKING: SPEAKER DETECTION USING VIDEO AND AUDIO CORRELATION Ross Cutler and Larry Davis Institute for Advanced Computer Studies University of Maryland, College Park rgc,lsd @cs.umd.edu ABSTRACT The visual motion of the mouth and the corresponding audio data generated when a person speaks are highly correlated. This fact has been exploited for lip/speechreading and for improving speech recognition. We describe a method of automatically detecting a talking person (both spatially and temporally) using video and audio data from a single microphone. The audio-visual correlation is learned using a TDNN, which is then used to perform a spatio-temporal search for a speaking person. Applications include video conferencing, video indexing, and improving human computer interaction (HCI). An example HCI application is provided. 1. INTRODUCTION The visual motion of a speaker s mouth is highly correlated with the audio data generated from the voicebox and mouth [8]. This fact has been exploited for lip/speechreading (e.g, [17, 13]) and for combined audio-visual speech recognition (e.g., [5]). We utilize this correlation to detect speakers using video and audio input from a single microphone. We learn the audio-visual correlation in speaking using a timedelayed neural network (TDNN) [2], which is then used to search an audio-video input for speaking people. Applications of speaker detection include video conferencing, video indexing, and improving the human computer interface. In video conferencing, knowing where someone is speaking can cue a video camera to zoom in on the speaker; it can also be used to transmit only the speaker s video in bandwidth-limited conferencing applications. Speaker detection can also be used to index video (e.g., find me all clips of someone speaking ), and can be combined with face recognition techniques (e.g., find me all clips of Bill Clinton speaking ). Finally, speaker detection can be used to improve human computer interaction (HCI) by providing applications with the knowledge of when and where a user is speaking. We provide an example application in which the user speaks to the computer, and the computer performs an action using the time and location of the speaker. 1.1. Related work There has been a significant amount of work done in detecting faces from images and video (e.g., [16]). There has also been a significant amount of work done in locating speakers using arrays of microphones (e.g., [1]), and in identifying a specific individual speaking (e.g., [15]). Audio data can be used to animate lips for animated and real characters [3, 4]. Vision based techniques have also been used to detect people in front of kiosks [14, 7]. There are text-to-speech systems which utilize hand-coded phoneme-to-viseme rules to animate characters [18]. We are not aware of any previous work done that exploits the audio-visual correlation in speaking to detect speakers (spatially and temporally) in video with a single microphone. 1.2. Assumptions In this work, we assume that only one person is speaking at a time, and there is not significant background noise (though our audio S/N is not high in our test data). We also assume that the speaker does not move his head excessively during talking (though we suggest methods in Section 5 to handle this). 2. METHOD Our method exploits the correlation between mouth motions and audio data. Figure 1 shows a recurrence matrix [9, 6] of the mouth region image similarities and the corresponding audio data. A recurrence matrix is a qualitative tool used to perform time series analysis of non-linear dynamic systems. In this case, the recurrence matrix Ê is defined by Ê Ø ½ Ø ¾ µ Á ؽ Á ؾ µ where is the correlation of images Á ؽ and Á ؾ. In this figure, we see that times of change in the audio data are highly correlated with visual changes in the mouth. However, the relationship between the two signals is not simple, as changes in the audio signal do not necessarily imply changes in the visual signal (and vice versa), and the visual signal may lead or lag the audio signal sig-
Similarity of Image T 1 and T 2 5 1 T 2 15 2 25 3 5 1 15 2 25 3 T 1 Figure 1: Recurrence matrix of a 1 second talking sequence. The upper triangle is the similarity (absolute correlation) of the mouth region for images at times Ì ½ and Ì ¾, and the lower triangle is the similarity (Euclidean distance) of the corresponding audio signal at times Ì ½ and Ì ¾. Whiter pixels are more similar. nificantly (Bregler and Konig [5] use mutual information to show that audio data on average lagged behind the video data by approximately 12 ms in their dataset). In addition, the changes are highly context sensitive, analogous to the coarticulation problem in speech recognition. We utilize a TDNN to learn the context-dependent correlations between the audio and visual signals. The mel cepstrum coefficients Ø of the audio signal are used as the audio features, which are commonly used in speech recognition systems [1]. In our examples, we compute 12 mel cepstrum coefficients using a 1 ms window. For visual features, we utilize a simple measure of change between two images Á Ø and Á Ø ½ (absolute correlation): Ë Ø Ü Ýµ¾ÏØ Á Ø Ü Ýµ Á Ø ½ Ü Ýµ (1) where Ï Ø is a windowing function in Á Ø (typically a rectangle). In order to account for small translations of the head during speaking, the minimal Ë Ø is found by translating over a small search radius Ö: Ë ¼ Ø Ñ Ò Ü Ý Ö Ü Ýµ¾ÏØ Á Ø Ü Ü Ý Ýµ Á Ø ½ Ü Ýµ (2) The TDNN has an input layer consisting of [ Ø Æ,..., Ø,..., Ø Æ ] audio features and [Ë ¼,..., Ø ÆΠ˼ Ø,..., Ë ¼ ] Ø ÆÎ visual features. In our examples, we have used Æ and Æ Î Figure 2: Example image (64x48 pixels) used in training. For training, the positive visual features were computed using 6x3 window centered on the mouth. See Figure 3 for the corresponding feature vectors. such that approximately 2ms of context in each direction (symmetrically) is provided. There is one hidden layer, and only a single output node Ç Ø Ï, which indicates whether someone is speaking at time Ø in the window Ï. 2.1. Training The TDNN is trained using supervised learning and back propagation [12]. Specifically, for each image Á Ø, the output Ç Ø Ï is set to 1 where a person is talking, and otherwise. An example image is shown in Figure 2. An example of the feature vectors (the TDNN input) is shown in Figure 3. The training data consists of both positive data (Ç Ø Ï =1) and negative data (Ç Ø Ï =). 2.2. Speaker detection Once the TDNN has been trained, it is evaluated on an audiovisual sequence to detect correlated motion and audio that is indicative of a person talking. Specifically, for a given image Á Ø and windowing function Ï centered at Ü Ýµ, we evaluate the TDNN output Ç Ø Ï at each Ü Ýµ ¾ Á Ø. The windowing function is typically rectangular, and of the size of an expected mouth. In our implementation, the dimensions of Ï are «by «Û pixels, where ¼, Û ¼, and «½ ¾ is a spatial scaling factor. During the search, we choose the ¾ that maximizes Ç Ø Ï to allow for a range of mouth sizes (primarily due to changes in person distance from the camera). The TDNN does not handle large changes in temporal scale from the training data. Therefore, the feature vectors Ø and Ë ¼ are linearly scaled by a time factor before Ø evaluating Ç Ø Ï. In our implementation, we selected the ¾ ¼ ¾ that maximizes Ç Ø Ï to allow for a significant
.4 Sound.2.2.4 Mel Cepstrum Coefficients 2 4 6 8 1 12 2 x Visual Features 14 1.5 1.5 5 1 15 2 25 3 Time (frames) Figure 3: Training data example for a person saying computer nine times: (top) audio data (middle) mel cepstrum coefficients Ø, (bottom) visual features Ë ¼ Ø. variation in speaking rate between the training data and test data. Ç Ø Ï can be treated as a probability that there is someone speaking at time Ø, with a mouth in the window Ï. To achieve better robustness, Ç Ø Ï can be filtered to reduce spurious noise. While a Kalman filter could be used for this purpose, in our implementation we simply use a 3D moving average filter to compute Ç Ø Ï. 3. DESIGN DECISIONS In this section, we d like to discuss several important design decisions we have made. First, in detecting the audio-visual correlation, we could have hand-coded some rules that map phonemes-to-visemes. However, extracting phonemes is errorprone, as is the visemes. Moreover, to acurrately extract visemes would likely require greater resolution than we use in our test images (the mouth is about 1x3 pixels), and a sophisticated model-based visual feature extractor. There is also the problem of determining a suitable vismeme lexicon to map to; like phonemes, there are many visemes standards to choose from. Rather than use a rule-based system, we chose to use a TDNN. We structured the NN so that with sufficient training, it should learn similar phoneme-viseme rules, though without actually classifying a phoneme or viseme. While we have not done so, the trained NN could be tested with specific phoneme-visemes to determine if this correlation is actually learned. In choosing the visual features, we needed a feature that could be robustly determined at relatively low resolutions. It was also desireable to choose a method that could be im- Figure 4: Output of the HCI application which locates a speaking person and windows their head. The cross-hair marks the location of the detected mouth. The bounding box size is a function of Ï. plemented in real-time on a standard PC. The correlation feature satisfies both of these requirements. Finally, in choosing the audio features, we utilized features (mel cepstrum) that are commonly used in speech recognition systems. More sophisticated audio features could be utilized (e.g.,...), which could enhance the performance of the system, particularly in the precence of noise. 4. EXAMPLE APPLICATION In this section, we demonstrate the system with a simple HCI application. When the user speaks the word computer, the system will recognize when and where he is speaking, and will window his head for further processing (e.g., teleconferencing). An example image from a sequence is shown in Figure 4, and the corresponding features are shown in Figure 5. The output of the TDNN is shown in Figure 6. For this test sequence, the system correctly detected all 7 instances of the word computer. The system uses a Sony DFW-V5 64x48 3 FPS resolution camera, and is implemented on a standard PC workstation. The microphone is an inexpensive desktop microphone, and the audio is sampled at 2KHz 8-bit resolution. 5. CONCLUSIONS We used a TDNN to learn the audio-visual correlations of mouth regions during speaking. This was utilized to detect speakers (both spatially and temporally) using video and a single microphone as input. We demonstrated the utility of the system using an HCI application that recognized when and where a user was talking.
.2.1.1.2 2 4 6 8 1 12 6 4 2 Sound Mel Cepstrum Coefficients Visual Features 5 1 15 2 25 Time (frames) Figure 5: Features for the sequence shown in Figure 4. Note the audio signal has a much lower S/N ratio than in Figure 3, due to the greater distance from the microphone. Figure 6: Output of the TDNN for the image given in Figure 4. Dark pixels correspond to regions of high speaker probability. To utilize this method of speaker detection for more general applications, such as video conferencing and video indexing, a measure of image similarity must be chosen that is relatively invariant to translations and rotations of the head during speaking. One possible solution is to use an affine tracker to stabilize regions being tested [11]. Other visual feature vectors can also be extracted, such the optical flow around the mouth. To improve robustness and accuracy, our method can be combined with a face detector [16] and tracker [11], and voice detector [19]. 6. REFERENCES [1] S. Basu, M. Casey, W. Gardner, A. Azarbayejani, and A. Pentland. Vision-steered audio for interactive environments. Technical Report 373, MIT Media Lab Perceptual Computer Section, 1996. [2] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [3] M. Brand. Voice puppetry. In SIGGRAPH, 1999. [4] C. Bregler, M. Covell, and M. Slaney. Video rewrite: driving visual speech with audio. In SIGGRAPH, 1997. [5] C. Bregler and Y. Konig. Eigenlips for robust speech recognition. In ICASSP, 1994. [6] M. Casdagli. Recurrence plots revisited. Physica D, 18:12 44, 1997. [7] T. Darrell, G. Gordon, J. Woodfill, H. Baker, and M. Harville. A virtual mirror interface using real-time robust face tracking. In FG, 1998. [8] B. Dodd and R. Campbell. Hearing by Eye: The Psychology of Lipreading. Lawrence Erlbaum Press, 1987. [9] P. Eckmann, S. O. Kamphorst, and D. Ruelle. Recurrence plots of dynamical systems. J. of Europhysics Letters, 4:973 977, 1987. [1] B. Gold and N. Morgan. Speech and audio signal processing. John Wiley and Sons, Inc., 1999. [11] G. Hager and K. Toyama. The xvision system: A generalpurpose substrate for portable real-time vision applications. Computer Vision and Image Understanding, 69(1):23 37, 1998. [12] J. Hertz, A. Krogh, and R. Palmer. Introduction to the theory of neural computation. Addison Wesley, 1991. [13] K. Mase and A. Pentland. Lip reading: automatic visual recognition of spoken words. In Proc. Image Understanding and Machine Vision, Optical Society of America, June 1989. [14] J. M. Rehg, K. P. Murphy, and P. W. Fieguth. Vision-based speaker detection using bayesian networks. In Proceedings of the Computer Vision and Pattern Recognition, 1999. [15] D. Reynolds and R. Rose. Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1):72 83, 1995. [16] H. A. Rowley, S. Baluja, and T. Kanade. Neural networkbased face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(1):23 38, January 1998. [17] D. Stork, G. Wolff, and E. Levine. Neural network lipreading system for improved speech recognition. IJCNN, 1992.
[18] K. Waters and T. Levergood. DECface: An automatic lipsynchronization algorithm for synthetic faces. Technical Report 93/4, DEC, September 1994. [19] T. Zhang and C.-C. Kuo. Hierarchical classification of audio data for archiving and retrieving. In ICCASP, 1999.