LOOK WHO S TALKING: SPEAKER DETECTION USING VIDEO AND AUDIO CORRELATION. Ross Cutler and Larry Davis
|
|
- Meghan Atkinson
- 6 years ago
- Views:
Transcription
1 LOOK WHO S TALKING: SPEAKER DETECTION USING VIDEO AND AUDIO CORRELATION Ross Cutler and Larry Davis Institute for Advanced Computer Studies University of Maryland, College Park ABSTRACT The visual motion of the mouth and the corresponding audio data generated when a person speaks are highly correlated. This fact has been exploited for lip/speechreading and for improving speech recognition. We describe a method of automatically detecting a talking person (both spatially and temporally) using video and audio data from a single microphone. The audio-visual correlation is learned using a TDNN, which is then used to perform a spatio-temporal search for a speaking person. Applications include video conferencing, video indexing, and improving human computer interaction (HCI). An example HCI application is provided. 1. INTRODUCTION The visual motion of a speaker s mouth is highly correlated with the audio data generated from the voicebox and mouth [8]. This fact has been exploited for lip/speechreading (e.g, [17, 13]) and for combined audio-visual speech recognition (e.g., [5]). We utilize this correlation to detect speakers using video and audio input from a single microphone. We learn the audio-visual correlation in speaking using a timedelayed neural network (TDNN) [2], which is then used to search an audio-video input for speaking people. Applications of speaker detection include video conferencing, video indexing, and improving the human computer interface. In video conferencing, knowing where someone is speaking can cue a video camera to zoom in on the speaker; it can also be used to transmit only the speaker s video in bandwidth-limited conferencing applications. Speaker detection can also be used to index video (e.g., find me all clips of someone speaking ), and can be combined with face recognition techniques (e.g., find me all clips of Bill Clinton speaking ). Finally, speaker detection can be used to improve human computer interaction (HCI) by providing applications with the knowledge of when and where a user is speaking. We provide an example application in which the user speaks to the computer, and the computer performs an action using the time and location of the speaker Related work There has been a significant amount of work done in detecting faces from images and video (e.g., [16]). There has also been a significant amount of work done in locating speakers using arrays of microphones (e.g., [1]), and in identifying a specific individual speaking (e.g., [15]). Audio data can be used to animate lips for animated and real characters [3, 4]. Vision based techniques have also been used to detect people in front of kiosks [14, 7]. There are text-to-speech systems which utilize hand-coded phoneme-to-viseme rules to animate characters [18]. We are not aware of any previous work done that exploits the audio-visual correlation in speaking to detect speakers (spatially and temporally) in video with a single microphone Assumptions In this work, we assume that only one person is speaking at a time, and there is not significant background noise (though our audio S/N is not high in our test data). We also assume that the speaker does not move his head excessively during talking (though we suggest methods in Section 5 to handle this). 2. METHOD Our method exploits the correlation between mouth motions and audio data. Figure 1 shows a recurrence matrix [9, 6] of the mouth region image similarities and the corresponding audio data. A recurrence matrix is a qualitative tool used to perform time series analysis of non-linear dynamic systems. In this case, the recurrence matrix Ê is defined by Ê Ø ½ Ø ¾ µ Á ؽ Á ؾ µ where is the correlation of images Á ؽ and Á ؾ. In this figure, we see that times of change in the audio data are highly correlated with visual changes in the mouth. However, the relationship between the two signals is not simple, as changes in the audio signal do not necessarily imply changes in the visual signal (and vice versa), and the visual signal may lead or lag the audio signal sig-
2 Similarity of Image T 1 and T T T 1 Figure 1: Recurrence matrix of a 1 second talking sequence. The upper triangle is the similarity (absolute correlation) of the mouth region for images at times Ì ½ and Ì ¾, and the lower triangle is the similarity (Euclidean distance) of the corresponding audio signal at times Ì ½ and Ì ¾. Whiter pixels are more similar. nificantly (Bregler and Konig [5] use mutual information to show that audio data on average lagged behind the video data by approximately 12 ms in their dataset). In addition, the changes are highly context sensitive, analogous to the coarticulation problem in speech recognition. We utilize a TDNN to learn the context-dependent correlations between the audio and visual signals. The mel cepstrum coefficients Ø of the audio signal are used as the audio features, which are commonly used in speech recognition systems [1]. In our examples, we compute 12 mel cepstrum coefficients using a 1 ms window. For visual features, we utilize a simple measure of change between two images Á Ø and Á Ø ½ (absolute correlation): Ë Ø Ü Ýµ¾ÏØ Á Ø Ü Ýµ Á Ø ½ Ü Ýµ (1) where Ï Ø is a windowing function in Á Ø (typically a rectangle). In order to account for small translations of the head during speaking, the minimal Ë Ø is found by translating over a small search radius Ö: Ë ¼ Ø Ñ Ò Ü Ý Ö Ü Ýµ¾ÏØ Á Ø Ü Ü Ý Ýµ Á Ø ½ Ü Ýµ (2) The TDNN has an input layer consisting of [ Ø Æ,..., Ø,..., Ø Æ ] audio features and [Ë ¼,..., Ø ÆΠ˼ Ø,..., Ë ¼ ] Ø ÆÎ visual features. In our examples, we have used Æ and Æ Î Figure 2: Example image (64x48 pixels) used in training. For training, the positive visual features were computed using 6x3 window centered on the mouth. See Figure 3 for the corresponding feature vectors. such that approximately 2ms of context in each direction (symmetrically) is provided. There is one hidden layer, and only a single output node Ç Ø Ï, which indicates whether someone is speaking at time Ø in the window Ï Training The TDNN is trained using supervised learning and back propagation [12]. Specifically, for each image Á Ø, the output Ç Ø Ï is set to 1 where a person is talking, and otherwise. An example image is shown in Figure 2. An example of the feature vectors (the TDNN input) is shown in Figure 3. The training data consists of both positive data (Ç Ø Ï =1) and negative data (Ç Ø Ï =) Speaker detection Once the TDNN has been trained, it is evaluated on an audiovisual sequence to detect correlated motion and audio that is indicative of a person talking. Specifically, for a given image Á Ø and windowing function Ï centered at Ü Ýµ, we evaluate the TDNN output Ç Ø Ï at each Ü Ýµ ¾ Á Ø. The windowing function is typically rectangular, and of the size of an expected mouth. In our implementation, the dimensions of Ï are «by «Û pixels, where ¼, Û ¼, and «½ ¾ is a spatial scaling factor. During the search, we choose the ¾ that maximizes Ç Ø Ï to allow for a range of mouth sizes (primarily due to changes in person distance from the camera). The TDNN does not handle large changes in temporal scale from the training data. Therefore, the feature vectors Ø and Ë ¼ are linearly scaled by a time factor before Ø evaluating Ç Ø Ï. In our implementation, we selected the ¾ ¼ ¾ that maximizes Ç Ø Ï to allow for a significant
3 .4 Sound Mel Cepstrum Coefficients x Visual Features Time (frames) Figure 3: Training data example for a person saying computer nine times: (top) audio data (middle) mel cepstrum coefficients Ø, (bottom) visual features Ë ¼ Ø. variation in speaking rate between the training data and test data. Ç Ø Ï can be treated as a probability that there is someone speaking at time Ø, with a mouth in the window Ï. To achieve better robustness, Ç Ø Ï can be filtered to reduce spurious noise. While a Kalman filter could be used for this purpose, in our implementation we simply use a 3D moving average filter to compute Ç Ø Ï. 3. DESIGN DECISIONS In this section, we d like to discuss several important design decisions we have made. First, in detecting the audio-visual correlation, we could have hand-coded some rules that map phonemes-to-visemes. However, extracting phonemes is errorprone, as is the visemes. Moreover, to acurrately extract visemes would likely require greater resolution than we use in our test images (the mouth is about 1x3 pixels), and a sophisticated model-based visual feature extractor. There is also the problem of determining a suitable vismeme lexicon to map to; like phonemes, there are many visemes standards to choose from. Rather than use a rule-based system, we chose to use a TDNN. We structured the NN so that with sufficient training, it should learn similar phoneme-viseme rules, though without actually classifying a phoneme or viseme. While we have not done so, the trained NN could be tested with specific phoneme-visemes to determine if this correlation is actually learned. In choosing the visual features, we needed a feature that could be robustly determined at relatively low resolutions. It was also desireable to choose a method that could be im- Figure 4: Output of the HCI application which locates a speaking person and windows their head. The cross-hair marks the location of the detected mouth. The bounding box size is a function of Ï. plemented in real-time on a standard PC. The correlation feature satisfies both of these requirements. Finally, in choosing the audio features, we utilized features (mel cepstrum) that are commonly used in speech recognition systems. More sophisticated audio features could be utilized (e.g.,...), which could enhance the performance of the system, particularly in the precence of noise. 4. EXAMPLE APPLICATION In this section, we demonstrate the system with a simple HCI application. When the user speaks the word computer, the system will recognize when and where he is speaking, and will window his head for further processing (e.g., teleconferencing). An example image from a sequence is shown in Figure 4, and the corresponding features are shown in Figure 5. The output of the TDNN is shown in Figure 6. For this test sequence, the system correctly detected all 7 instances of the word computer. The system uses a Sony DFW-V5 64x48 3 FPS resolution camera, and is implemented on a standard PC workstation. The microphone is an inexpensive desktop microphone, and the audio is sampled at 2KHz 8-bit resolution. 5. CONCLUSIONS We used a TDNN to learn the audio-visual correlations of mouth regions during speaking. This was utilized to detect speakers (both spatially and temporally) using video and a single microphone as input. We demonstrated the utility of the system using an HCI application that recognized when and where a user was talking.
4 Sound Mel Cepstrum Coefficients Visual Features Time (frames) Figure 5: Features for the sequence shown in Figure 4. Note the audio signal has a much lower S/N ratio than in Figure 3, due to the greater distance from the microphone. Figure 6: Output of the TDNN for the image given in Figure 4. Dark pixels correspond to regions of high speaker probability. To utilize this method of speaker detection for more general applications, such as video conferencing and video indexing, a measure of image similarity must be chosen that is relatively invariant to translations and rotations of the head during speaking. One possible solution is to use an affine tracker to stabilize regions being tested [11]. Other visual feature vectors can also be extracted, such the optical flow around the mouth. To improve robustness and accuracy, our method can be combined with a face detector [16] and tracker [11], and voice detector [19]. 6. REFERENCES [1] S. Basu, M. Casey, W. Gardner, A. Azarbayejani, and A. Pentland. Vision-steered audio for interactive environments. Technical Report 373, MIT Media Lab Perceptual Computer Section, [2] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, [3] M. Brand. Voice puppetry. In SIGGRAPH, [4] C. Bregler, M. Covell, and M. Slaney. Video rewrite: driving visual speech with audio. In SIGGRAPH, [5] C. Bregler and Y. Konig. Eigenlips for robust speech recognition. In ICASSP, [6] M. Casdagli. Recurrence plots revisited. Physica D, 18:12 44, [7] T. Darrell, G. Gordon, J. Woodfill, H. Baker, and M. Harville. A virtual mirror interface using real-time robust face tracking. In FG, [8] B. Dodd and R. Campbell. Hearing by Eye: The Psychology of Lipreading. Lawrence Erlbaum Press, [9] P. Eckmann, S. O. Kamphorst, and D. Ruelle. Recurrence plots of dynamical systems. J. of Europhysics Letters, 4: , [1] B. Gold and N. Morgan. Speech and audio signal processing. John Wiley and Sons, Inc., [11] G. Hager and K. Toyama. The xvision system: A generalpurpose substrate for portable real-time vision applications. Computer Vision and Image Understanding, 69(1):23 37, [12] J. Hertz, A. Krogh, and R. Palmer. Introduction to the theory of neural computation. Addison Wesley, [13] K. Mase and A. Pentland. Lip reading: automatic visual recognition of spoken words. In Proc. Image Understanding and Machine Vision, Optical Society of America, June [14] J. M. Rehg, K. P. Murphy, and P. W. Fieguth. Vision-based speaker detection using bayesian networks. In Proceedings of the Computer Vision and Pattern Recognition, [15] D. Reynolds and R. Rose. Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1):72 83, [16] H. A. Rowley, S. Baluja, and T. Kanade. Neural networkbased face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(1):23 38, January [17] D. Stork, G. Wolff, and E. Levine. Neural network lipreading system for improved speech recognition. IJCNN, 1992.
5 [18] K. Waters and T. Levergood. DECface: An automatic lipsynchronization algorithm for synthetic faces. Technical Report 93/4, DEC, September [19] T. Zhang and C.-C. Kuo. Hierarchical classification of audio data for archiving and retrieving. In ICCASP, 1999.
Vision-Based Speaker Detection Using Bayesian Networks
Appears in Computer Vision and Pattern Recognition (CVPR 99), Ft. Collins, CO, June, 1999. Vision-Based Speaker Detection Using Bayesian Networks James M. Rehg Cambridge Research Lab Compaq Computer Corp.
More informationAn Hybrid MLP-SVM Handwritten Digit Recognizer
An Hybrid MLP-SVM Handwritten Digit Recognizer A. Bellili ½ ¾ M. Gilloux ¾ P. Gallinari ½ ½ LIP6, Université Pierre et Marie Curie ¾ La Poste 4, Place Jussieu 10, rue de l Ile Mabon, BP 86334 75252 Paris
More informationFace Registration Using Wearable Active Vision Systems for Augmented Memory
DICTA2002: Digital Image Computing Techniques and Applications, 21 22 January 2002, Melbourne, Australia 1 Face Registration Using Wearable Active Vision Systems for Augmented Memory Takekazu Kato Takeshi
More informationFrame-Rate Pupil Detector and Gaze Tracker
Frame-Rate Pupil Detector and Gaze Tracker C.H. Morimoto Ý D. Koons A. Amir M. Flickner ÝDept. Ciência da Computação IME/USP - Rua do Matão 1010 São Paulo, SP 05508, Brazil hitoshi@ime.usp.br IBM Almaden
More informationRecent Advances in Acoustic Signal Extraction and Dereverberation
Recent Advances in Acoustic Signal Extraction and Dereverberation Emanuël Habets Erlangen Colloquium 2016 Scenario Spatial Filtering Estimated Desired Signal Undesired sound components: Sensor noise Competing
More informationVision-based User-interfaces for Pervasive Computing. CHI 2003 Tutorial Notes. Trevor Darrell Vision Interface Group MIT AI Lab
Vision-based User-interfaces for Pervasive Computing Tutorial Notes Vision Interface Group MIT AI Lab Table of contents Biographical sketch..ii Agenda..iii Objectives.. iv Abstract..v Introduction....1
More informationAN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS Kuldeep Kumar 1, R. K. Aggarwal 1 and Ankita Jain 2 1 Department of Computer Engineering, National Institute
More informationAuditory Based Feature Vectors for Speech Recognition Systems
Auditory Based Feature Vectors for Speech Recognition Systems Dr. Waleed H. Abdulla Electrical & Computer Engineering Department The University of Auckland, New Zealand [w.abdulla@auckland.ac.nz] 1 Outlines
More informationMel Spectrum Analysis of Speech Recognition using Single Microphone
International Journal of Engineering Research in Electronics and Communication Mel Spectrum Analysis of Speech Recognition using Single Microphone [1] Lakshmi S.A, [2] Cholavendan M [1] PG Scholar, Sree
More informationBODILY NON-VERBAL INTERACTION WITH VIRTUAL CHARACTERS
KEER2010, PARIS MARCH 2-4 2010 INTERNATIONAL CONFERENCE ON KANSEI ENGINEERING AND EMOTION RESEARCH 2010 BODILY NON-VERBAL INTERACTION WITH VIRTUAL CHARACTERS Marco GILLIES *a a Department of Computing,
More informationHigh-speed Noise Cancellation with Microphone Array
Noise Cancellation a Posteriori Probability, Maximum Criteria Independent Component Analysis High-speed Noise Cancellation with Microphone Array We propose the use of a microphone array based on independent
More informationClassification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise
Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise Noha KORANY 1 Alexandria University, Egypt ABSTRACT The paper applies spectral analysis to
More informationAUTOMATIC SPEECH RECOGNITION FOR NUMERIC DIGITS USING TIME NORMALIZATION AND ENERGY ENVELOPES
AUTOMATIC SPEECH RECOGNITION FOR NUMERIC DIGITS USING TIME NORMALIZATION AND ENERGY ENVELOPES N. Sunil 1, K. Sahithya Reddy 2, U.N.D.L.mounika 3 1 ECE, Gurunanak Institute of Technology, (India) 2 ECE,
More informationPose Invariant Face Recognition
Pose Invariant Face Recognition Fu Jie Huang Zhihua Zhou Hong-Jiang Zhang Tsuhan Chen Electrical and Computer Engineering Department Carnegie Mellon University jhuangfu@cmu.edu State Key Lab for Novel
More informationEyes n Ears: A System for Attentive Teleconferencing
Eyes n Ears: A System for Attentive Teleconferencing B. Kapralos 1,3, M. Jenkin 1,3, E. Milios 2,3 and J. Tsotsos 1,3 1 Department of Computer Science, York University, North York, Canada M3J 1P3 2 Department
More informationUplink and Downlink Beamforming for Fading Channels. Mats Bengtsson and Björn Ottersten
Uplink and Downlink Beamforming for Fading Channels Mats Bengtsson and Björn Ottersten 999-02-7 In Proceedings of 2nd IEEE Signal Processing Workshop on Signal Processing Advances in Wireless Communications,
More informationDatong Chen, Albrecht Schmidt, Hans-Werner Gellersen
Datong Chen, Albrecht Schmidt, Hans-Werner Gellersen TecO (Telecooperation Office), University of Karlsruhe Vincenz-Prießnitz-Str.1, 76131 Karlruhe, Germany {charles, albrecht, hwg}@teco.uni-karlsruhe.de
More informationJoint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events
INTERSPEECH 2013 Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events Rupayan Chakraborty and Climent Nadeu TALP Research Centre, Department of Signal Theory
More informationEffects of the Unscented Kalman Filter Process for High Performance Face Detector
Effects of the Unscented Kalman Filter Process for High Performance Face Detector Bikash Lamsal and Naofumi Matsumoto Abstract This paper concerns with a high performance algorithm for human face detection
More informationSpeech Recognition using FIR Wiener Filter
Speech Recognition using FIR Wiener Filter Deepak 1, Vikas Mittal 2 1 Department of Electronics & Communication Engineering, Maharishi Markandeshwar University, Mullana (Ambala), INDIA 2 Department of
More informationVision for a Smart Kiosk
Appears in Computer Vision and Pattern Recognition, San Juan, PR, June, 1997, pages 690-696. Vision for a Smart Kiosk James M. Rehg Maria Loughlin Keith Waters Abstract Digital Equipment Corporation Cambridge
More informationSpeech and Audio Processing Recognition and Audio Effects Part 3: Beamforming
Speech and Audio Processing Recognition and Audio Effects Part 3: Beamforming Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Electrical Engineering and Information Engineering
More informationActivity monitoring and summarization for an intelligent meeting room
IEEE Workshop on Human Motion, Austin, Texas, December 2000 Activity monitoring and summarization for an intelligent meeting room Ivana Mikic, Kohsia Huang, Mohan Trivedi Computer Vision and Robotics Research
More informationAutomatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs
Automatic Text-Independent Speaker Recognition Approaches Using Binaural Inputs Karim Youssef, Sylvain Argentieri and Jean-Luc Zarader 1 Outline Automatic speaker recognition: introduction Designed systems
More informationBackground Pixel Classification for Motion Detection in Video Image Sequences
Background Pixel Classification for Motion Detection in Video Image Sequences P. Gil-Jiménez, S. Maldonado-Bascón, R. Gil-Pita, and H. Gómez-Moreno Dpto. de Teoría de la señal y Comunicaciones. Universidad
More informationDERIVATION OF TRAPS IN AUDITORY DOMAIN
DERIVATION OF TRAPS IN AUDITORY DOMAIN Petr Motlíček, Doctoral Degree Programme (4) Dept. of Computer Graphics and Multimedia, FIT, BUT E-mail: motlicek@fit.vutbr.cz Supervised by: Dr. Jan Černocký, Prof.
More informationMikko Myllymäki and Tuomas Virtanen
NON-STATIONARY NOISE MODEL COMPENSATION IN VOICE ACTIVITY DETECTION Mikko Myllymäki and Tuomas Virtanen Department of Signal Processing, Tampere University of Technology Korkeakoulunkatu 1, 3370, Tampere,
More informationA TWO-PART PREDICTIVE CODER FOR MULTITASK SIGNAL COMPRESSION. Scott Deeann Chen and Pierre Moulin
A TWO-PART PREDICTIVE CODER FOR MULTITASK SIGNAL COMPRESSION Scott Deeann Chen and Pierre Moulin University of Illinois at Urbana-Champaign Department of Electrical and Computer Engineering 5 North Mathews
More informationVICs: A Modular Vision-Based HCI Framework
VICs: A Modular Vision-Based HCI Framework The Visual Interaction Cues Project Guangqi Ye, Jason Corso Darius Burschka, & Greg Hager CIRL, 1 Today, I ll be presenting work that is part of an ongoing project
More informationPerformance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches
Performance study of Text-independent Speaker identification system using & I for Telephone and Microphone Speeches Ruchi Chaudhary, National Technical Research Organization Abstract: A state-of-the-art
More informationINTRODUCTION TO DEEP LEARNING. Steve Tjoa June 2013
INTRODUCTION TO DEEP LEARNING Steve Tjoa kiemyang@gmail.com June 2013 Acknowledgements http://ufldl.stanford.edu/wiki/index.php/ UFLDL_Tutorial http://youtu.be/ayzoubkuf3m http://youtu.be/zmnoatzigik 2
More informationFirst generation mobile communication systems (e.g. NMT and AMPS) are based on analog transmission techniques, whereas second generation systems
1 First generation mobile communication systems (e.g. NMT and AMPS) are based on analog transmission techniques, whereas second generation systems (e.g. GSM and D-AMPS) are digital. In digital systems,
More informationAiro Interantional Research Journal September, 2013 Volume II, ISSN:
Airo Interantional Research Journal September, 2013 Volume II, ISSN: 2320-3714 Name of author- Navin Kumar Research scholar Department of Electronics BR Ambedkar Bihar University Muzaffarpur ABSTRACT Direction
More informationDimension Reduction of the Modulation Spectrogram for Speaker Verification
Dimension Reduction of the Modulation Spectrogram for Speaker Verification Tomi Kinnunen Speech and Image Processing Unit Department of Computer Science University of Joensuu, Finland Kong Aik Lee and
More informationText and Language Independent Speaker Identification By Using Short-Time Low Quality Signals
Text and Language Independent Speaker Identification By Using Short-Time Low Quality Signals Maurizio Bocca*, Reino Virrankoski**, Heikki Koivo* * Control Engineering Group Faculty of Electronics, Communications
More informationFOCAL LENGTH CHANGE COMPENSATION FOR MONOCULAR SLAM
FOCAL LENGTH CHANGE COMPENSATION FOR MONOCULAR SLAM Takafumi Taketomi Nara Institute of Science and Technology, Japan Janne Heikkilä University of Oulu, Finland ABSTRACT In this paper, we propose a method
More informationSONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS
SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS AKSHAY CHANDRASHEKARAN ANOOP RAMAKRISHNA akshayc@cmu.edu anoopr@andrew.cmu.edu ABHISHEK JAIN GE YANG ajain2@andrew.cmu.edu younger@cmu.edu NIDHI KOHLI R
More informationIris Recognition using Hamming Distance and Fragile Bit Distance
IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 06, 2015 ISSN (online): 2321-0613 Iris Recognition using Hamming Distance and Fragile Bit Distance Mr. Vivek B. Mandlik
More informationUsing RASTA in task independent TANDEM feature extraction
R E S E A R C H R E P O R T I D I A P Using RASTA in task independent TANDEM feature extraction Guillermo Aradilla a John Dines a Sunil Sivadas a b IDIAP RR 04-22 April 2004 D a l l e M o l l e I n s t
More informationSession 2: 10 Year Vision session (11:00-12:20) - Tuesday. Session 3: Poster Highlights A (14:00-15:00) - Tuesday 20 posters (3minutes per poster)
Lessons from Collecting a Million Biometric Samples 109 Expression Robust 3D Face Recognition by Matching Multi-component Local Shape Descriptors on the Nasal and Adjoining Cheek Regions 177 Shared Representation
More informationMulti-Resolution Estimation of Optical Flow on Vehicle Tracking under Unpredictable Environments
, pp.32-36 http://dx.doi.org/10.14257/astl.2016.129.07 Multi-Resolution Estimation of Optical Flow on Vehicle Tracking under Unpredictable Environments Viet Dung Do 1 and Dong-Min Woo 1 1 Department of
More informationI D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b
R E S E A R C H R E P O R T I D I A P On Factorizing Spectral Dynamics for Robust Speech Recognition a Vivek Tyagi Hervé Bourlard a,b IDIAP RR 3-33 June 23 Iain McCowan a Hemant Misra a,b to appear in
More informationService Robots in an Intelligent House
Service Robots in an Intelligent House Jesus Savage Bio-Robotics Laboratory biorobotics.fi-p.unam.mx School of Engineering Autonomous National University of Mexico UNAM 2017 OUTLINE Introduction A System
More informationChange Point Determination in Audio Data Using Auditory Features
INTL JOURNAL OF ELECTRONICS AND TELECOMMUNICATIONS, 0, VOL., NO., PP. 8 90 Manuscript received April, 0; revised June, 0. DOI: /eletel-0-00 Change Point Determination in Audio Data Using Auditory Features
More informationSOUND SOURCE RECOGNITION AND MODELING
SOUND SOURCE RECOGNITION AND MODELING CASA seminar, summer 2000 Antti Eronen antti.eronen@tut.fi Contents: Basics of human sound source recognition Timbre Voice recognition Recognition of environmental
More informationVoice Activity Detection
Voice Activity Detection Speech Processing Tom Bäckström Aalto University October 2015 Introduction Voice activity detection (VAD) (or speech activity detection, or speech detection) refers to a class
More informationResearch Seminar. Stefano CARRINO fr.ch
Research Seminar Stefano CARRINO stefano.carrino@hefr.ch http://aramis.project.eia- fr.ch 26.03.2010 - based interaction Characterization Recognition Typical approach Design challenges, advantages, drawbacks
More informationData Flow 4.{1,2}, 3.2
< = = Computer Science Program, The University of Texas, Dallas Data Flow 4.{1,2}, 3.2 Batch Sequential Pipeline Systems Tektronix Case Study: Oscilloscope Formalization of Oscilloscope "systems where
More informationAUDIO VISUAL TRACKING OF A SPEAKER BASED ON FFT AND KALMAN FILTER
AUDIO VISUAL TRACKING OF A SPEAKER BASED ON FFT AND KALMAN FILTER Muhammad Muzammel, Mohd Zuki Yusoff, Mohamad Naufal Mohamad Saad and Aamir Saeed Malik Centre for Intelligent Signal and Imaging Research,
More informationNeural Network Part 4: Recurrent Neural Networks
Neural Network Part 4: Recurrent Neural Networks Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from
More informationENF ANALYSIS ON RECAPTURED AUDIO RECORDINGS
ENF ANALYSIS ON RECAPTURED AUDIO RECORDINGS Hui Su, Ravi Garg, Adi Hajj-Ahmad, and Min Wu {hsu, ravig, adiha, minwu}@umd.edu University of Maryland, College Park ABSTRACT Electric Network (ENF) based forensic
More informationVoice Recognition Technology Using Neural Networks
Journal of New Technology and Materials JNTM Vol. 05, N 01 (2015)27-31 OEB Univ. Publish. Co. Voice Recognition Technology Using Neural Networks Abdelouahab Zaatri 1, Norelhouda Azzizi 2 and Fouad Lazhar
More informationDirection-of-Arrival Estimation Using a Microphone Array with the Multichannel Cross-Correlation Method
Direction-of-Arrival Estimation Using a Microphone Array with the Multichannel Cross-Correlation Method Udo Klein, Member, IEEE, and TrInh Qu6c VO School of Electrical Engineering, International University,
More informationNCCF ACF. cepstrum coef. error signal > samples
ESTIMATION OF FUNDAMENTAL FREQUENCY IN SPEECH Petr Motl»cek 1 Abstract This paper presents an application of one method for improving fundamental frequency detection from the speech. The method is based
More informationGesture Recognition with Real World Environment using Kinect: A Review
Gesture Recognition with Real World Environment using Kinect: A Review Prakash S. Sawai 1, Prof. V. K. Shandilya 2 P.G. Student, Department of Computer Science & Engineering, Sipna COET, Amravati, Maharashtra,
More informationWhite Intensity = 1. Black Intensity = 0
A Region-based Color Image Segmentation Scheme N. Ikonomakis a, K. N. Plataniotis b and A. N. Venetsanopoulos a a Dept. of Electrical and Computer Engineering, University of Toronto, Toronto, Canada b
More informationStudy Of Sound Source Localization Using Music Method In Real Acoustic Environment
International Journal of Electronics Engineering Research. ISSN 975-645 Volume 9, Number 4 (27) pp. 545-556 Research India Publications http://www.ripublication.com Study Of Sound Source Localization Using
More informationREFERENCES 4 CONCLUSIONS ACKNOWLEDGEMENT. Anticipated results for our investigations on acoustic and visual speech integration are:
Anticipated results for our investigations on acoustic and visual integration are: an improvement in recognition performance resulting from data fusion for normal input data and for a range of degraded
More informationOPPORTUNISTIC TRAFFIC SENSING USING EXISTING VIDEO SOURCES (PHASE II)
CIVIL ENGINEERING STUDIES Illinois Center for Transportation Series No. 17-003 UILU-ENG-2017-2003 ISSN: 0197-9191 OPPORTUNISTIC TRAFFIC SENSING USING EXISTING VIDEO SOURCES (PHASE II) Prepared By Jakob
More informationControlling Humanoid Robot Using Head Movements
Volume-5, Issue-2, April-2015 International Journal of Engineering and Management Research Page Number: 648-652 Controlling Humanoid Robot Using Head Movements S. Mounica 1, A. Naga bhavani 2, Namani.Niharika
More informationAudio Fingerprinting using Fractional Fourier Transform
Audio Fingerprinting using Fractional Fourier Transform Swati V. Sutar 1, D. G. Bhalke 2 1 (Department of Electronics & Telecommunication, JSPM s RSCOE college of Engineering Pune, India) 2 (Department,
More informationSeparation and Recognition of multiple sound source using Pulsed Neuron Model
Separation and Recognition of multiple sound source using Pulsed Neuron Model Kaname Iwasa, Hideaki Inoue, Mauricio Kugler, Susumu Kuroyanagi, Akira Iwata Nagoya Institute of Technology, Gokiso-cho, Showa-ku,
More informationImproved Detection by Peak Shape Recognition Using Artificial Neural Networks
Improved Detection by Peak Shape Recognition Using Artificial Neural Networks Stefan Wunsch, Johannes Fink, Friedrich K. Jondral Communications Engineering Lab, Karlsruhe Institute of Technology Stefan.Wunsch@student.kit.edu,
More informationReal Time Video Analysis using Smart Phone Camera for Stroboscopic Image
Real Time Video Analysis using Smart Phone Camera for Stroboscopic Image Somnath Mukherjee, Kritikal Solutions Pvt. Ltd. (India); Soumyajit Ganguly, International Institute of Information Technology (India)
More information5/17/2009. Digitizing Color. Place Value in a Binary Number. Place Value in a Decimal Number. Place Value in a Binary Number
Chapter 11: Light, Sound, Magic: Representing Multimedia Digitally Digitizing Color Fluency with Information Technology Third Edition by Lawrence Snyder RGB Colors: Binary Representation Giving the intensities
More informationDistance Estimation and Localization of Sound Sources in Reverberant Conditions using Deep Neural Networks
Distance Estimation and Localization of Sound Sources in Reverberant Conditions using Deep Neural Networks Mariam Yiwere 1 and Eun Joo Rhee 2 1 Department of Computer Engineering, Hanbat National University,
More informationRobust Hand Gesture Recognition for Robotic Hand Control
Robust Hand Gesture Recognition for Robotic Hand Control Ankit Chaudhary Robust Hand Gesture Recognition for Robotic Hand Control 123 Ankit Chaudhary Department of Computer Science Northwest Missouri State
More informationSpeech Signal Analysis
Speech Signal Analysis Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 2&3 14,18 January 216 ASR Lectures 2&3 Speech Signal Analysis 1 Overview Speech Signal Analysis for
More informationSPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT
SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT RASHMI MAKHIJANI Department of CSE, G. H. R.C.E., Near CRPF Campus,Hingna Road, Nagpur, Maharashtra, India rashmi.makhijani2002@gmail.com
More informationFace Registration Using Wearable Active Vision Systems for Augmented Memory
DICTA2002: Digital Image Computing Techniques and Applications, 21 22 January 2002, Melbourne, Australia 1 Face Registration Using Wearable Active Vision Systems for Augmented Memory Takekazu Kato Takeshi
More informationSPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes
SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN Yu Wang and Mike Brookes Department of Electrical and Electronic Engineering, Exhibition Road, Imperial College London,
More informationA Neural Solution for Signal Detection In Non-Gaussian Noise
1 A Neural Solution for Signal Detection In Non-Gaussian Noise D G Khairnar, S N Merchant, U B Desai SPANN Laboratory Department of Electrical Engineering Indian Institute of Technology, Bombay, Mumbai-400
More informationCalibration of Microphone Arrays for Improved Speech Recognition
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Calibration of Microphone Arrays for Improved Speech Recognition Michael L. Seltzer, Bhiksha Raj TR-2001-43 December 2001 Abstract We present
More informationRecognizing Talking Faces From Acoustic Doppler Reflections
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Recognizing Talking Faces From Acoustic Doppler Reflections Kaustubh Kalgaonkar, Bhiksha Raj TR2008-080 December 2008 Abstract Face recognition
More informationKeywords: - Gaussian Mixture model, Maximum likelihood estimator, Multiresolution analysis
Volume 4, Issue 2, February 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Expectation
More informationMODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS
MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS 1 S.PRASANNA VENKATESH, 2 NITIN NARAYAN, 3 K.SAILESH BHARATHWAAJ, 4 M.P.ACTLIN JEEVA, 5 P.VIJAYALAKSHMI 1,2,3,4,5 SSN College of Engineering,
More informationIntroduction to Video Forgery Detection: Part I
Introduction to Video Forgery Detection: Part I Detecting Forgery From Static-Scene Video Based on Inconsistency in Noise Level Functions IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5,
More informationRestoration of Motion Blurred Document Images
Restoration of Motion Blurred Document Images Bolan Su 12, Shijian Lu 2 and Tan Chew Lim 1 1 Department of Computer Science,School of Computing,National University of Singapore Computing 1, 13 Computing
More informationColor Constancy Using Standard Deviation of Color Channels
2010 International Conference on Pattern Recognition Color Constancy Using Standard Deviation of Color Channels Anustup Choudhury and Gérard Medioni Department of Computer Science University of Southern
More informationSOUND SOURCE RECOGNITION FOR INTELLIGENT SURVEILLANCE
Paper ID: AM-01 SOUND SOURCE RECOGNITION FOR INTELLIGENT SURVEILLANCE Md. Rokunuzzaman* 1, Lutfun Nahar Nipa 1, Tamanna Tasnim Moon 1, Shafiul Alam 1 1 Department of Mechanical Engineering, Rajshahi University
More informationCompensation of a position servo
UPPSALA UNIVERSITY SYSTEMS AND CONTROL GROUP CFL & BC 9610, 9711 HN & PSA 9807, AR 0412, AR 0510, HN 2006-08 Automatic Control Compensation of a position servo Abstract The angular position of the shaft
More informationProf Trivedi ECE253A Notes for Students only
ECE 253A: Digital Processing: Course Related Class Website: https://sites.google.com/a/eng.ucsd.edu/ece253fall2017/ Course Graduate Assistants: Nachiket Deo Borhan Vasili Kirill Pirozenko Piazza Grading:
More informationAdvanced Functions of Java-DSP for use in Electrical and Computer Engineering Senior Level Courses
Advanced Functions of Java-DSP for use in Electrical and Computer Engineering Senior Level Courses Andreas Spanias Robert Santucci Tushar Gupta Mohit Shah Karthikeyan Ramamurthy Topics This presentation
More information3D and Sequential Representations of Spatial Relationships among Photos
3D and Sequential Representations of Spatial Relationships among Photos Mahoro Anabuki Canon Development Americas, Inc. E15-349, 20 Ames Street Cambridge, MA 02139 USA mahoro@media.mit.edu Hiroshi Ishii
More informationHand & Upper Body Based Hybrid Gesture Recognition
Hand & Upper Body Based Hybrid Gesture Prerna Sharma #1, Naman Sharma *2 # Research Scholor, G. B. P. U. A. & T. Pantnagar, India * Ideal Institue of Technology, Ghaziabad, India Abstract Communication
More informationImage Processing by Bilateral Filtering Method
ABHIYANTRIKI An International Journal of Engineering & Technology (A Peer Reviewed & Indexed Journal) Vol. 3, No. 4 (April, 2016) http://www.aijet.in/ eissn: 2394-627X Image Processing by Bilateral Image
More informationSingle Channel Speaker Segregation using Sinusoidal Residual Modeling
NCC 2009, January 16-18, IIT Guwahati 294 Single Channel Speaker Segregation using Sinusoidal Residual Modeling Rajesh M Hegde and A. Srinivas Dept. of Electrical Engineering Indian Institute of Technology
More informationNEURALNETWORK BASED CLASSIFICATION OF LASER-DOPPLER FLOWMETRY SIGNALS
NEURALNETWORK BASED CLASSIFICATION OF LASER-DOPPLER FLOWMETRY SIGNALS N. G. Panagiotidis, A. Delopoulos and S. D. Kollias National Technical University of Athens Department of Electrical and Computer Engineering
More informationProceedings of Meetings on Acoustics
Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Architectural Acoustics Session 1pAAa: Advanced Analysis of Room Acoustics:
More informationSmartCanvas: A Gesture-Driven Intelligent Drawing Desk System
SmartCanvas: A Gesture-Driven Intelligent Drawing Desk System Zhenyao Mo +1 213 740 4250 zmo@graphics.usc.edu J. P. Lewis +1 213 740 9619 zilla@computer.org Ulrich Neumann +1 213 740 0877 uneumann@usc.edu
More informationDiscriminative Training for Automatic Speech Recognition
Discriminative Training for Automatic Speech Recognition 22 nd April 2013 Advanced Signal Processing Seminar Article Heigold, G.; Ney, H.; Schluter, R.; Wiesler, S. Signal Processing Magazine, IEEE, vol.29,
More informationIEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 1, FEBRUARY A Speech/Music Discriminator Based on RMS and Zero-Crossings
TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 1, FEBRUARY 2005 1 A Speech/Music Discriminator Based on RMS and Zero-Crossings Costas Panagiotakis and George Tziritas, Senior Member, Abstract Over the last several
More informationAutonomous Vehicle Speaker Verification System
Autonomous Vehicle Speaker Verification System Functional Requirements List and Performance Specifications Aaron Pfalzgraf Christopher Sullivan Project Advisor: Dr. Jose Sanchez 4 November 2013 AVSVS 2
More informationRecurrent neural networks Modelling sequential data. MLP Lecture 9 / 13 November 2018 Recurrent Neural Networks 1: Modelling sequential data 1
Recurrent neural networks Modelling sequential data MLP Lecture 9 / 13 November 2018 Recurrent Neural Networks 1: Modelling sequential data 1 Recurrent Neural Networks 1: Modelling sequential data Steve
More informationSmart antenna for doa using music and esprit
IOSR Journal of Electronics and Communication Engineering (IOSRJECE) ISSN : 2278-2834 Volume 1, Issue 1 (May-June 2012), PP 12-17 Smart antenna for doa using music and esprit SURAYA MUBEEN 1, DR.A.M.PRASAD
More informationResearch on Hand Gesture Recognition Using Convolutional Neural Network
Research on Hand Gesture Recognition Using Convolutional Neural Network Tian Zhaoyang a, Cheng Lee Lung b a Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China E-mail address:
More informationA Comparison of Histogram and Template Matching for Face Verification
A Comparison of and Template Matching for Face Verification Chidambaram Chidambaram Universidade do Estado de Santa Catarina chidambaram@udesc.br Marlon Subtil Marçal, Leyza Baldo Dorini, Hugo Vieira Neto
More informationTHE problem of acoustic echo cancellation (AEC) was
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 6, NOVEMBER 2005 1231 Acoustic Echo Cancellation and Doubletalk Detection Using Estimated Loudspeaker Impulse Responses Per Åhgren Abstract
More informationAuditory modelling for speech processing in the perceptual domain
ANZIAM J. 45 (E) ppc964 C980, 2004 C964 Auditory modelling for speech processing in the perceptual domain L. Lin E. Ambikairajah W. H. Holmes (Received 8 August 2003; revised 28 January 2004) Abstract
More informationVideo Synthesis System for Monitoring Closed Sections 1
Video Synthesis System for Monitoring Closed Sections 1 Taehyeong Kim *, 2 Bum-Jin Park 1 Senior Researcher, Korea Institute of Construction Technology, Korea 2 Senior Researcher, Korea Institute of Construction
More information