Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Similar documents
Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Relative phase information for detecting human speech and spoofed speech

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Robust Speaker Identification for Meetings: UPC CLEAR 07 Meeting Room Evaluation System

Using RASTA in task independent TANDEM feature extraction

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

Speech Signal Analysis

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

Speech Synthesis using Mel-Cepstral Coefficient Feature

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

An Improved Voice Activity Detection Based on Deep Belief Networks

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Mikko Myllymäki and Tuomas Virtanen

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Isolated Digit Recognition Using MFCC AND DTW

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Nonuniform multi level crossing for signal reconstruction

Perceptually Motivated Linear Prediction Cepstral Features for Network Speech Recognition

SOUND SOURCE RECOGNITION AND MODELING

Electronic disguised voice identification based on Mel- Frequency Cepstral Coefficient analysis

CS 188: Artificial Intelligence Spring Speech in an Hour

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Change Point Determination in Audio Data Using Auditory Features

Audio Fingerprinting using Fractional Fourier Transform

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

A Comparative Study of Formant Frequencies Estimation Techniques

Applications of Music Processing

Autonomous Vehicle Speaker Verification System

Epoch Extraction From Emotional Speech

Implementing Speaker Recognition

International Journal of Scientific & Engineering Research, Volume 4, Issue 5, May ISSN

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Speaker Identification using Frequency Dsitribution in the Transform Domain

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

Speech and Music Discrimination based on Signal Modulation Spectrum.

Gammatone Cepstral Coefficient for Speaker Identification

Auditory Based Feature Vectors for Speech Recognition Systems

Nonlinear Companding Transform Algorithm for Suppression of PAPR in OFDM Systems

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

EE482: Digital Signal Processing Applications

Speech Synthesis; Pitch Detection and Vocoders

Cepstrum alanysis of speech signals

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Advanced audio analysis. Martin Gasser

RECENTLY, there has been an increasing interest in noisy

Adaptive Filters Application of Linear Prediction

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

Chapter IV THEORY OF CELP CODING

SpeakerID - Voice Activity Detection

A DEVICE FOR AUTOMATIC SPEECH RECOGNITION*

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Converting Speaking Voice into Singing Voice

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Speech Compression Using Voice Excited Linear Predictive Coding

Study Of Sound Source Localization Using Music Method In Real Acoustic Environment

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Augmenting Short-term Cepstral Features with Long-term Discriminative Features for Speaker Verification of Telephone Data

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Improving Sound Quality by Bandwidth Extension

Time-Frequency Distributions for Automatic Speech Recognition

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Discrete Fourier Transform (DFT)

Audio Signal Compression using DCT and LPC Techniques

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

MMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2

Adaptive Filters Linear Prediction

Modulation Spectrum Power-law Expansion for Robust Speech Recognition

Digital Speech Processing and Coding

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

Temporally Weighted Linear Prediction Features for Speaker Verification in Additive Noise

Multimedia Signal Processing: Theory and Applications in Speech, Music and Communications

651 Analysis of LSF frame selection in voice conversion

Feature Extraction Using 2-D Autoregressive Models For Speaker Recognition

Proceedings of Meetings on Acoustics

PoS(CENet2015)037. Recording Device Identification Based on Cepstral Mixed Features. Speaker 2

SPEech Feature Toolbox (SPEFT) Design and Emotional Speech Feature Extraction

Robust Algorithms For Speech Reconstruction On Mobile Devices

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

VECTOR QUANTIZATION-BASED SPEECH RECOGNITION SYSTEM FOR HOME APPLIANCES

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Speech/Music Change Point Detection using Sonogram and AANN

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Transcription:

Performance study of Text-independent Speaker identification system using & I for Telephone and Microphone Speeches Ruchi Chaudhary, National Technical Research Organization Abstract: A state-of-the-art Speaker Identification (SI) system requires a robust feature extraction unit, followed by a speaker classifier scheme. Over the years, Mel-Frequency Cepstral Coefficients (), modelled on the human auditory system, has been used as a standard acoustic feature set for speech related applications. Furthermore, it has been also shown that the Inverted Mel-Frequency Cepstral Coefficients (I) is also a useful feature set for SI, which contains information complementary to as, it covers high frequency region more closely. In this study, performance of speaker identification system is evaluated by generating Detection-error-trade-off (DET) curves, for both & I (in individual and fused mode, using two different kinds of databases (i.e. Microphone Speech, Telephone Speech). It is found, that I feature based classifier, produces improved accuracy, especially for telephone speech database and also, preferred mixing proportion of two streams ( & I in combined model) are also obtained for both kind of database. Key Words: Speaker Identification,, I, Fussed feature set. INTRODUCTION Automatic Speaker Recognition is to verify a person s claimed identity from his voice. In text-independent speaker identification system, there is no constraint on the words which speakers are allowed to use. The reference (what is spoken in training) and the test utterances (what is uttered in actual use) may have completely different context. Feature extraction is method of obtaining the unique characteristic pattern of a speaker, known as features sets. A feature provides a more suitable, robust and compact representation of speaker s speech than the raw input signal. has been widely accepted as features input for a typical speaker recognition system because of its less vulnerability to noise perturbation, little session variability and, easiness to extract than other methods namely Line Spectral frequency (LSF), log Area Ratio (LAR), Perceptual log Area Ratio (PLAR), Perceptual Linear Prediction (PLP) etc. [-]. The computation of involves, averaging the low frequency region (upto khz) of the energy spectrum, by employment of closely spaced overlapping triangular filters. Smaller numbers of less closely spaced triangular filters are used to average the high

frequency zone. The figure shows the block diagram for Mel frequency Cepstral coefficients. Mel filter bank Continuous speech signal Frame Blocking Hamming Window Fourier Transform Mel Frequency wrapping Log Discrete cosine Transform Figure : Block diagram for Mel frequency cepstral coefficients. For feature extraction, Mel-scale frequency is related to linear frequency by empirical equation in (), and the figure shows the mel scale frequency relation to linear scale frequency. f mel = 9 log (+ f/) () the inverse of mel frequency wrapping function is given as () f - mel (f mel ) = 7 ( fmel /9 ) () Mel filter frequencies f[mel-scale](mel-frequency Scale) 8 6 4 8 6 4 3 3 f[hz](linear frequency Scale) Figure : Mel scale Frequency related to linear scale frequency.

, thus, represents the low frequency region more accurately than the high frequency region and hence, can capture formants efficiently, which lie in the low frequency range and which characterize the vocal tract resonances. However, other formants that lie above khz are not effectively captured by the larger spacing of filters in the higher frequency range as shown in the figure 3. filter bank.8.6.4 Amplitude..8.6.4. 3 3 4 Frequency Figure 3: Mel scale filter bank structure. The, authors in [-], have conducted the experiments by inverting the entire filter bank structure; such that the higher frequency range is averaged by more accurately spaced filters and a smaller number of widely spaced filters are used in the lower frequency range. This feature set named as Inverted Mel Frequency Cepstral Coefficients (I), follows the same procedure as but use reversed filter bank structure that is complementary in nature to the human vocal tract characteristics described by. The figure 4 shows the block diagram for Inverted Mel Scale Cepstral Coefficient. Inverted -Mel filter bank Continuous speech signal Frame Blocking Hamming Window Fourier Transform Inverted Mel Frequency wrapping Log Discrete cosine Transform Inverted - Figure 4: Block diagram for Inverted- Mel frequency cepstral coefficients. 3

To increase the frequency resolution in the high frequency range, the Mel wrapping function and the inverted Mel wrapping function (for sampling frequency of 8 khz) the empirical relation (3) & (4) have been used and the inverted mel scale relationship to linear frequency is presented in figure and the inverted mel scale filter bank structure is depicted in figure 6 below. f invertedmel = 46. 9 log (+(4-f)/7) (3) - f invertedmel (f invertedmel ) = 9.86-9 log (+ 43. - f/7) (4) Inverted-mel filter frequencies f[inverted-mel](inverted-mel Scale) 8 6 4 8 6 4 3 3 4 f[hz](linear frequency Scale) Figure : Inverted Mel Scale frequency wrapping. Inverted- filter bank.8.6.4 Amplitude..8.6.4. 3 3 4 Frequency Figure 6: Mel scale filter bank structure. In usual frequency scale, filters are placed densely in the high frequency range and sparsely in the low frequency range. The figure 7 shows filter bank for (a) Mel scale (b) 4

Inverted Mel scale, in time domain. Cepstral coefficients are calculated using the inverted Mel filter bank in place of the Mel filter bank. The detailed procedure is given in publication [-]. Mel-cepstrum coefficient 8 6 4 4 - -4... 3 3. 4 Time (s) inverted Mel-cepstrum coefficient 8 6 4... 3 3. 4 Time (s) Figure 7: (a) Mel filter bank (b) Inverted Mel Filter bank, in time domain. The combination of two or more classifiers performs better if they were supplied with information that is complementary in nature [6-8]. and I feature vectors, which are complementary in information content, can be fused in order to obtain improved identification accuracy. Number of possible combination schemes such a product, sum, minimum, maximum, median, average etc., can be utilized, but sum rule outperforms the other combination schemes and it is most resilient to estimation errors [6-8].. Databases used for Experiments: Two kind of database were used namely Telephone and Microphone recorded speech for the experiment. The descriptions of the database are as under:- (i) Telephone Speech: The Centre for Spoken Language Understanding (CSLU) speaker Recognition corpus (Release.) was collected from web site: http://cslu.cse.ogi.edu, consists of telephone speech. Each participant has recorded speech in twelve sessions. Each participant calls a toll free telephone number and answers a few question. These files were sampled at 8 khz, 8-bit. There are 4 speakers ( males and females); for each speaker, there are 96 utterances. In this work, the 36 (4 X 9 utterances) speeches are used for developing

the speaker model in training mode and 4(6 X 4 utterances) utterances are put under test to evaluate the identification accuracies. (ii) Microphone Recorded Speech: This database is obtained, from the internet, through the speech recording of speakers at 6 khz sampling rate using Microphone. Further, all speech samples were down-sampled to 8 khz frequency. For each speaker there are utterances (total x utterances) all are of speech length of approx. to seconds. For this database also, 7 ( X utterances) speeches are used for developing the speaker model in training mode and ( x utterances) speeches are put under test to evaluate the identification accuracies. 3. Experiment Setup The experiment has been set, as shown in the figure 8, to obtain performance of fused -I based speaker identification system (for two kind of database as mention above) and for evaluation of system using Detection-Error-Trade off (DET) plots., I and -I, based GMM parallel fused classifier were created in Matlab. A Gaussian Mixture Model (GMM) based classifier is used which provides an unsupervised clustering technique to model the speakers. For Each speech, numbers of Gaussian mixture features set has been generated and the scores (obtained from and I based SI System) are fused, using sum rule. For the i th speech, the combined score S i com can be expressed as (). S i com = ws i + (-w) S i I () Where S i and S i I are the scores generated by the two models, and I, respectively and where w is the fusion coefficient. Data Pre Processing Feature Feature Extraction Gaussian Mixture Model Classifier Feature vectors Database Fusion Matching Algorithm Score(S i ) SUM Score(S i I) Final Output Inverted Features Gaussian Mixture Model 6 Matching Algorithm Feature Extraction Classifier Inverted Feature vectors Database

Figure 8: -I fused Speaker identification System. The performance of the fused system has been obtained for both the databases. Thereafter, the performance of fused speaker identification system, for two different kind of speech corpus, for analysing the effect of fusion coefficient for and I features is evaluated using DET plots. 4. Results & Discussion DET performance curve has been obtained for, I and fused - I for both the database, as mentioned above. The figure 6(a) shows the speaker detection performance for, I and -I (with fusion coefficient.) obtained using telephone speech. The figure 6(b) shows the speaker detection performance for, I and -I (with fusion coefficient.) obtained using microphone Speech. Speaker Detection Performance INVERTED- FUSSION Miss probability (in %) 4 False Alarm probability (in %) Figure 6(a): DET curve for, I and fused -I (with fusion coefficient.) for Telephonic speech database. 7

8 Speaker Detection Performance 6 Miss probability (in %) 4 INVERTED- FUSSION 4 6 8 False Alarm probability (in %) Figure 6 (b): DET curve for, I and fused -I for Microphone speech database. Table : Equal Error Rate for, I and Fused Speaker Detection System. Database System I System -I Fused System Telephone Speech 9% 7.9% 7.9% Microphone Speech % 6% 48% Speaker identification system performance results, using, I and fused -I fusion based features set, equal error rate parameter, are summarized in Table, for both databases. It may be seen that the combined scheme shows significant improvements in SI system over based system alone, for both Microphone Database and Telephone speech. Especially for telephone speech database, the independent performance of the I based classifier is comparatively better to that of the based classifier. The figure 7(a) shows the performance for the fusion of -I using various fusion coefficients, obtained using telephone speech and figure 7(b) shows the performance for -I based classifier using various fusion coefficients, obtained using microphone Speech. The DET plot shows the miss probability against the false alarm 8

probability: Tables below gives the comparative performance based on different combination of fusion. Miss probability (in %) Speaker Detection Performance alpha. alpha.6 alpha.4 alpha.3 4 False Alarm probability (in %) Figure 7(a): DET curve for Telephonic speech database, with various fusion coefficients. 8 Speaker Detection Performance 6 Miss probability (in %) 4 alpha. alpha.6 alpha.4 alpha.3 alpha. 4 6 8 False Alarm probability (in %) Figure 7(b): DET curve for Microphone speech database, with various fusion coefficients. Table : Equal Error Rate for -I fusion with various fusion coefficients. Database w=. w=.6 w=.4 w=.3 w=. Telephone Speech 7.9% 9% 8.% 7.8% % 9

Microphone Speech 48% 4% 49% % 47% Individual, I and fused -I with different fusion coefficient were used for both databases. It may be seen that for the used telephone speech database, the fusion coefficient.3 outperforms the speaker identification system and for used Microphone speech database fusion coefficient.6 has given enhanced the system performance. Same can also be established from the DET plots obtained through fusion using equal contribution of and I.. CONCLUSION The I feature based classifier can provide improved accuracy for telephone speech database, by proper choice of mixing proportion of two streams in combined model. The study reveals that in order to improve the performance of the speaker identification system, for telephonic speech database the contribution of I should be more as comparable to. This is because of the fact that bandwidth in telephone channel is limited. On the other hand, for Microphone speech the contribution of should be large. The appropriate selection of the fusion coefficient, in order to improve the accuracy of the system, can be used by the DET plots for any kind of database. 6. REFERENCES. J. Campbell, Speaker recognition: a tutorial, Proceedings of the IEEE VOL. 8, NO. 9, pp. 437 46, September 997.. J. Kittler, Combining Classifiers: A Theoretical Framework, Pattern Analysis & Applied Springer-Verlag London Limited, Issue, pp.8-7, 998. 3. J. Kittler, M. Hatef, R. Duin, and J.Matas. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, volume, issue 3, pp 6 39, March 998. 4. J. Kittler, F.M. Alkoot, Sum Versus Vote Fusion in Multiple Classifier Systems, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume, Issue, pp., January 3.

. Sandipan Chakroborty, Anindya Roy and Goutam Saha, Improved Closed Set Text- Independent Speaker Identification by combining with Evidence from Flipped Filter Banks, International Journal of Information and Communication Engineering volume 4, issue, 8. 6. Sandipan Chakroborty, Goutam Saha, Improved Text-Independent Speaker Identification using Fused & I Feature Sets based on Gaussian Filter International Journal of Signal Processing Volume issue, 9. 7. Tomi Kinnunen, Haizhou Li, An overview of text-independent speaker recognition: from features to supervectors, Speech Communication volume, pp -4,. 8. Nirmalya Sen, Tapan Basu, Sandipan Chakroborty, Comparison of Features extracted Using Time-Frequency and Frequency-Time Analysis Approach for Text- Independent Speaker Identification, IEEE National conference on Communication, pp. -, 3 Jan.. 9. Satyanand Singh, Dr. E.G. Rajan, Vector Quantization approach for Speaker Recognition using and Inverted, International Journal of Computer Applications Volume 7, issue, pp. 97-8887, March. AUTHOR Ruchi Chaudhary, received M.Tech degree in the year 9 from Guru Govind Singh Indraprasth University, Kashmiri Gate, Delhi, and in, B.Tech Degree in Electronics & Communication Engineering from CJSM Kanpur University. In 3, she joined Defence Research & Organisation as Junior Research Fellow, and in 4, she joined Guru Prem Sukh Memorial College of Engineering as a Lecturer in the Department of Electronics & Communication and subsequently became Head of Department of ECE in the same Institution in 7. She is presently working as a Scientist in Government Organization and pursuing PhD from Guru Govind Singh Indraprastha University. Her interest includes Speech Processing and Soft Computing Techniques. She has also contributed in Research paper of International Journal of Sensors & Actuated in 4 on Pattern Recognition Techniques.