Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Automatic Text-Independent Speaker Recognition Approaches Using Binaural Inputs Karim Youssef, Sylvain Argentieri and Jean-Luc Zarader 1

Outline Automatic speaker recognition: introduction Designed systems MFCCs, PNNS, GMMs Simulated and recorded databases Results Conclusions and prospects 2

Automatic Speaker Recognition A machine recognizes a person talking to it, based on the vocal characteristics.???? -Closed or open sets of persons -Text-dependent or independent -Fields of use: access control, forensics, speakers indexing in recorded conversations, customizing services or information to individuals by voice. -Faced Problems: robot movement, real-life environments: noises, reverberations, condition differences between learning and testing phases, intra-speaker speech variability 3

Binaural Speaker Recognition Our System To show the advantage of the binaural processing, two systems have been designed: Monaural system: Monaural Signal Frame Extraction Silence Elimination MFCC Coding PNN/ GMM Binaural systems: Left & Right Ears Signal Frame Extraction Silence Elimination MFCC Coding & concatenation Intercorrelation & MFCC Coding PNN/ GMM PNN/ GMM 4

Mel-Frequency Cepstral Coding FFT Triangular Filters Output Energies Logarithm DCT Usage: -Vocal recognition -Musical information recovery (genre classification) Mel scale: approximates the human audition better than the Hertz scale Utilisation of delta-mfccs to improve the performances (reflect a temporal variation) 5

Predictive Neural Networks -One specific network for each speaker -Principle: prediction of a frame based on the previous two -Usage in recognition: classification based on the network with the minimal reconstruction error Parallel learning: each PNN is learned independently from the others. Performance equalization by controling the learning speed of the networks: stop the training of the network that is faster than the others and memorize the least confusing configuration 6

Gaussian Mixture Models Weighted sum of Gaussian PDFs M gaussians per GMM For each gaussian: weight, mean vector and covariance matrix Optimal parameters: Expectation-Maximization algorithm One GMM per speaker 7

Simulated database Radiophonic recordings, french monologues with low-level ambient noise. 10 speakers Monaural Signal Left HRTF Right HRTF Left Ear Right Ear HRTF database: from the CIPIC interfaces laboratory, University of California HRTF: Description of the filtering made on the signal Obtaining: Measurment of the impulse responses on both tympanums Measurment in an anechoic chamber 8

Real recorded database -15 speakers, 50 minutes per speaker -Silence and white noise recording KU 100 dummy head -Data sources: two ears and one monaural microphone 9

PNNs - Experimental results position effect Binaural signals: the position of the speaker enters as a parameter on the system. Influence of the speaker s position on the recognition: We have two information: -Specificities of the speech signal -Directional cues 3 directional groups: -3 speakers per group -directions: (azimuth, elevation) = {(-45, -45), (0, 45), (45, -45)} 1 2 3 4 5 6 7 8 9 comparison of the rate of confusion within the same or different directional groups influence of the directional cues on the identification of speakers 10

PNNs - position effect: concatenation -significant difference between the intra-group and intergroup confusion rates: the confusion rate between two speakers from the same directional group spreads from 8.24%to 31.79%, while it is generally less than 1% for two different directional groups a strong learning of the directional cues, supplanting the characterization of the speaker. 11

PNNs position effect: intercorrelation concatenation -the intra-group and inter-group confusion rates are not so different, being equal respectively to about 10-15% and 5%. These values show that even if BM-I is not totally insensitive to the directional information, the learning of directional cues is now lower. opportunity to efficiently discriminate speakers regardless of the direction of emission. 12

PNNs - noise effect -Noise addition -Learning made on 14 directions -Testing on durations of 3, 5 and 15 seconds (majorical vote on frame sequences classifications) -The binaural method of intercorrelation is relatively robust to noise: it outperforms the monaural method. -The binaural method of concatenation does not provide such a robustness to noisy conditions: its performances remain quite identical to the monaural system 13

GMMs Experimental results Training: One direction per speaker the same direction for each group of speakers one speaker per direction Multiple directions per speaker 10 directions per speaker Tests: on the training directions on unlearned directions 14

GMMs Simulated database The same direction for a group of speakers 3 speakers per group group1: (0,0); group2: (0,45); group3: (0,-45) -Increasing recognition rates with SNRs and testing duration -Decreasing recognition rates when changing directions between training and testing -Best performance: obtained on the (0,0) direction, being the less distant to the other directions 15

GMMs Simulated database One specific direction for each speaker Testing on the training directions: SNR Frame(22ms) 1s 3s 5s 15s 10 54.63 97,75 100 100 100 0 38,64 92,52 98,26 98,94 100-3 33,22 85,62 94,8 97,2 98,48 Testing on other directions: SNR Frame(22ms) 1s 3s 5s 15s 10 36,69 86,86 93,66 95,35 96,88 0 23,09 53,49 60,77 63,48 64,97-3 19,45 39,81 42,5 43,48 46,65 -Best performances when testing on the training directions -Very high direction sensitivity 16

GMMs Simulated database 10 directions per speaker: -the same 10 training directions for all the speakers -most realist case -highest robustness to changing directions 17

GMMs Recorded database One direction for all the speakers: Multiple directions per speaker: Same interpretation Testing with moving speakers: Performaces decrease. But it s a realistic case 18

Conclusions and prospects Conclusions -binaural sensitivity to the speaker s position. -complementarity between the intercorrelation and the concatenation techniques. -well trained binaural systems offer better performances than monaural ones. -GMMs are better discriminating than PNNs Prospects -establishment of a more advanced voice activity detector -combined binaural speaker recognition and localization 19

Thank you! 20