Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Similar documents
Binaural Speaker Recognition for Humanoid Robots

From Monaural to Binaural Speaker Recognition for Humanoid Robots

Robust Speaker Identification for Meetings: UPC CLEAR 07 Meeting Room Evaluation System

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Voice Activity Detection

Simultaneous Recognition of Speech Commands by a Robot using a Small Microphone Array

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Applications of Music Processing

Cepstrum alanysis of speech signals

CP-JKU SUBMISSIONS FOR DCASE-2016: A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS

Automatic Morse Code Recognition Under Low SNR

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

Mikko Myllymäki and Tuomas Virtanen

Calibration of Microphone Arrays for Improved Speech Recognition

Using RASTA in task independent TANDEM feature extraction

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Sound Source Localization using HRTF database

Robust Speaker Recognition using Microphone Arrays

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Auditory System For a Mobile Robot

PERFORMANCE COMPARISON BETWEEN STEREAUSIS AND INCOHERENT WIDEBAND MUSIC FOR LOCALIZATION OF GROUND VEHICLES ABSTRACT

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

Binaural reverberant Speech separation based on deep neural networks

INVESTIGATING BINAURAL LOCALISATION ABILITIES FOR PROPOSING A STANDARDISED TESTING ENVIRONMENT FOR BINAURAL SYSTEMS

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

SIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION MOTIVATED BY AUDITORY PROCESSING CHANWOO KIM

CS 188: Artificial Intelligence Spring Speech in an Hour

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

DERIVATION OF TRAPS IN AUDITORY DOMAIN

SpeakerID - Voice Activity Detection

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Relative phase information for detecting human speech and spoofed speech

Microphone Array Design and Beamforming

Speaker Isolation in a Cocktail-Party Setting

The psychoacoustics of reverberation

VOICE COMMAND RECOGNITION SYSTEM BASED ON MFCC AND DTW

Environmental Sound Recognition using MP-based Features

An Improved Voice Activity Detection Based on Deep Belief Networks

POSSIBLY the most noticeable difference when performing

Change Point Determination in Audio Data Using Auditory Features

Electronic disguised voice identification based on Mel- Frequency Cepstral Coefficient analysis

Binaural segregation in multisource reverberant environments

LOCALIZATION AND IDENTIFICATION OF PERSONS AND AMBIENT NOISE SOURCES VIA ACOUSTIC SCENE ANALYSIS

1856 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 7, SEPTEMBER /$ IEEE

Speech Signal Analysis

Text and Language Independent Speaker Identification By Using Short-Time Low Quality Signals

Voice Recognition Technology Using Neural Networks

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

Campus Location Recognition using Audio Signals

Recent Advances in Acoustic Signal Extraction and Dereverberation

EVALUATION OF MFCC ESTIMATION TECHNIQUES FOR MUSIC SIMILARITY

From Binaural Technology to Virtual Reality

Auditory Based Feature Vectors for Speech Recognition Systems

The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments

Listening with Headphones

Binaural Hearing. Reading: Yost Ch. 12

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

SOUND SOURCE RECOGNITION AND MODELING

Optical Channel Access Security based on Automatic Speaker Recognition

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Gammatone Cepstral Coefficient for Speaker Identification

Implementing Speaker Recognition

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

Autonomous Vehicle Speaker Verification System

Electric Guitar Pickups Recognition

On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering

Feature Extraction Using 2-D Autoregressive Models For Speaker Recognition

A classification-based cocktail-party processor

Speech/Music Change Point Detection using Sonogram and AANN

24 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 1, JANUARY /$ IEEE

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Binaural Segregation in Multisource Reverberant Environments

A Wavelet Based Approach for Speaker Identification from Degraded Speech

Voiced/nonvoiced detection based on robustness of voiced epochs

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Determining Guava Freshness by Flicking Signal Recognition Using HMM Acoustic Models

White Rose Research Online URL for this paper: Version: Accepted Version

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Evaluation of MFCC Estimation Techniques for Music Similarity Jensen, Jesper Højvang; Christensen, Mads Græsbøll; Murthi, Manohar; Jensen, Søren Holdt

Enhancing the Complex-valued Acoustic Spectrograms in Modulation Domain for Creating Noise-Robust Features in Speech Recognition

Isolated Digit Recognition Using MFCC AND DTW

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Combining Voice Activity Detection Algorithms by Decision Fusion

Chapter IV THEORY OF CELP CODING

The Delta-Phase Spectrum with Application to Voice Activity Detection and Speaker Recognition

The Jigsaw Continuous Sensing Engine for Mobile Phone Applications!

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

The analysis of multi-channel sound reproduction algorithms using HRTF data

Audio Classification by Search of Primary Components

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

ROOM AND CONCERT HALL ACOUSTICS MEASUREMENTS USING ARRAYS OF CAMERAS AND MICROPHONES

Audio Fingerprinting using Fractional Fourier Transform

Spectral Noise Tracking for Improved Nonstationary Noise Robust ASR

Joint recognition and direction-of-arrival estimation of simultaneous meetingroom acoustic events

Power Normalized Cepstral Coefficient for Speaker Diarization and Acoustic Echo Cancellation

Transcription:

Automatic Text-Independent Speaker Recognition Approaches Using Binaural Inputs Karim Youssef, Sylvain Argentieri and Jean-Luc Zarader 1

Outline Automatic speaker recognition: introduction Designed systems MFCCs, PNNS, GMMs Simulated and recorded databases Results Conclusions and prospects 2

Automatic Speaker Recognition A machine recognizes a person talking to it, based on the vocal characteristics.???? -Closed or open sets of persons -Text-dependent or independent -Fields of use: access control, forensics, speakers indexing in recorded conversations, customizing services or information to individuals by voice. -Faced Problems: robot movement, real-life environments: noises, reverberations, condition differences between learning and testing phases, intra-speaker speech variability 3

Binaural Speaker Recognition Our System To show the advantage of the binaural processing, two systems have been designed: Monaural system: Monaural Signal Frame Extraction Silence Elimination MFCC Coding PNN/ GMM Binaural systems: Left & Right Ears Signal Frame Extraction Silence Elimination MFCC Coding & concatenation Intercorrelation & MFCC Coding PNN/ GMM PNN/ GMM 4

Mel-Frequency Cepstral Coding FFT Triangular Filters Output Energies Logarithm DCT Usage: -Vocal recognition -Musical information recovery (genre classification) Mel scale: approximates the human audition better than the Hertz scale Utilisation of delta-mfccs to improve the performances (reflect a temporal variation) 5

Predictive Neural Networks -One specific network for each speaker -Principle: prediction of a frame based on the previous two -Usage in recognition: classification based on the network with the minimal reconstruction error Parallel learning: each PNN is learned independently from the others. Performance equalization by controling the learning speed of the networks: stop the training of the network that is faster than the others and memorize the least confusing configuration 6

Gaussian Mixture Models Weighted sum of Gaussian PDFs M gaussians per GMM For each gaussian: weight, mean vector and covariance matrix Optimal parameters: Expectation-Maximization algorithm One GMM per speaker 7

Simulated database Radiophonic recordings, french monologues with low-level ambient noise. 10 speakers Monaural Signal Left HRTF Right HRTF Left Ear Right Ear HRTF database: from the CIPIC interfaces laboratory, University of California HRTF: Description of the filtering made on the signal Obtaining: Measurment of the impulse responses on both tympanums Measurment in an anechoic chamber 8

Real recorded database -15 speakers, 50 minutes per speaker -Silence and white noise recording KU 100 dummy head -Data sources: two ears and one monaural microphone 9

PNNs - Experimental results position effect Binaural signals: the position of the speaker enters as a parameter on the system. Influence of the speaker s position on the recognition: We have two information: -Specificities of the speech signal -Directional cues 3 directional groups: -3 speakers per group -directions: (azimuth, elevation) = {(-45, -45), (0, 45), (45, -45)} 1 2 3 4 5 6 7 8 9 comparison of the rate of confusion within the same or different directional groups influence of the directional cues on the identification of speakers 10

PNNs - position effect: concatenation -significant difference between the intra-group and intergroup confusion rates: the confusion rate between two speakers from the same directional group spreads from 8.24%to 31.79%, while it is generally less than 1% for two different directional groups a strong learning of the directional cues, supplanting the characterization of the speaker. 11

PNNs position effect: intercorrelation concatenation -the intra-group and inter-group confusion rates are not so different, being equal respectively to about 10-15% and 5%. These values show that even if BM-I is not totally insensitive to the directional information, the learning of directional cues is now lower. opportunity to efficiently discriminate speakers regardless of the direction of emission. 12

PNNs - noise effect -Noise addition -Learning made on 14 directions -Testing on durations of 3, 5 and 15 seconds (majorical vote on frame sequences classifications) -The binaural method of intercorrelation is relatively robust to noise: it outperforms the monaural method. -The binaural method of concatenation does not provide such a robustness to noisy conditions: its performances remain quite identical to the monaural system 13

GMMs Experimental results Training: One direction per speaker the same direction for each group of speakers one speaker per direction Multiple directions per speaker 10 directions per speaker Tests: on the training directions on unlearned directions 14

GMMs Simulated database The same direction for a group of speakers 3 speakers per group group1: (0,0); group2: (0,45); group3: (0,-45) -Increasing recognition rates with SNRs and testing duration -Decreasing recognition rates when changing directions between training and testing -Best performance: obtained on the (0,0) direction, being the less distant to the other directions 15

GMMs Simulated database One specific direction for each speaker Testing on the training directions: SNR Frame(22ms) 1s 3s 5s 15s 10 54.63 97,75 100 100 100 0 38,64 92,52 98,26 98,94 100-3 33,22 85,62 94,8 97,2 98,48 Testing on other directions: SNR Frame(22ms) 1s 3s 5s 15s 10 36,69 86,86 93,66 95,35 96,88 0 23,09 53,49 60,77 63,48 64,97-3 19,45 39,81 42,5 43,48 46,65 -Best performances when testing on the training directions -Very high direction sensitivity 16

GMMs Simulated database 10 directions per speaker: -the same 10 training directions for all the speakers -most realist case -highest robustness to changing directions 17

GMMs Recorded database One direction for all the speakers: Multiple directions per speaker: Same interpretation Testing with moving speakers: Performaces decrease. But it s a realistic case 18

Conclusions and prospects Conclusions -binaural sensitivity to the speaker s position. -complementarity between the intercorrelation and the concatenation techniques. -well trained binaural systems offer better performances than monaural ones. -GMMs are better discriminating than PNNs Prospects -establishment of a more advanced voice activity detector -combined binaural speaker recognition and localization 19

Thank you! 20