MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

www.advancejournals.org Open Access Scientific Publisher MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM ABSTRACT- P. Santhiya 1, T. Jayasankar 1 1 AUT (BIT campus), Tiruchirappalli, India Correspondence should be addressed to P. Santhiya Received March 27, 2015; Accepted April 02, 2015; Published June 25, 2015; Copyright: 2015 P. Santhiya et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Cite This Article: Santhiya, P., Jayasankar, T. (2015). Mfcc and gmm based tamil language speaker identification system. Advances in Engineering & Scientific Research, 1(1). 1-6 Speaker identification is a process of automatically identifying who is speaking on the basis of the individual information included in speech waves. Speaker identification is one of the most useful biometric recognition techniques in this world where insecurity is a major threat. Many organizations like banks, institutions, industries etc are currently using this technology for providing greater security to their vast databases. Speaker identification mainly involves two modules namely feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the speaker s voice signal that can later be used to represent that speaker. Feature matching involves the actual procedure to identify the unknown speaker by comparing the extracted features from his/her voice input with the ones that are already stored in our speech database. In this paper we present the overview of approaches of feature extraction using Mel Frequency Cepstrum Coefficients, which is most widely used feature in speaker identification system and feature matching GMM technique for Tamil speaker identification. KEY WORDS -Mel Frequency Cepstrum Coefficients, GMM, feature matching, feature extraction,dct. INTRODUCTION Speaker identification (SI) refers to the process of identifying an individual by extracting and processing information from his/her speech.it is a task of finding the best-matching speaker for unknown speaker from a database of known speakers. It is mainly a part of the speech processing. The SI system enables people to have secure information and property access. Speaker identification method can be divided into two categories. In open set SI,a reference model for the unknown speaker may not exist and, thus, an additional decision alternative is required if unknown does not match any of the models. On the other hand, in closed set SI,a set of N distinct speaker models may be stored in the identification system by extracting parameters from the speech samples of N speakers. In speaker identification task, similar parameters from new speech input are extracted first and then decide which one of the N known speakers mostly matches with the input speech parameters[10]. Speaker identification system divided into two methods: text independent and text dependent methods [1].Although text- dependent method requires speaker to provide utterances of the key words or sentences which have the same text for both training and identification trials. But in text- independent method utterances of the key words or sentences provide by speaker during identification trials may be independent of training phase[2]. SPEAKER IDENTIFICATION SYSTEM MODULE Speaker identification system is composed of the following modules: Front-end processing Speaker modelling 1

Speaker database Decision logic Figure 1: Block Diagram of Speaker identification system characteristics in the acoustic signal. The main aim of speaker identification is comparing a speech. Speaker database The speaker models are stored here. These models areobtained for each speaker by using feature vector extracted from each speaker.thesemodels are used for identification of unknown speaker during the testing phase. Decision logic It makes the final decision about the identity of the speaker by comparing unknown speaker to all models in the data base and selecting the best matching model. Feature extractionmodule Front- end processing This step is the first step to create feature vectors. It is the signal processing part, which converts the sampled speech signal into set of feature vectors, which characterize the properties of speech that can separate different speakers. Front-end processing is performed both in training and testing phases. The objective in the front-end processing is to modify the speech signal, so that it will be more suitable for feature extraction analysis. Thefront-end processing operation based on noise cancelling, framing, windowing and pre-emphasis. MFCC is one of the important feature extraction techniques for speaker identification technique.the goal of feature extraction is to find a set of properties of an utterance that have acoustic correlations to the speechsignal, that is parameters that can somehow be computed or estimated through processing of the signal waveform. Such parameters are termed as features. And also Speaker Identification System includes the process such as[7] Figure 2:Steps involved in Speaker identification system The goal of feature extraction is to find a set of properties of an utterance that have acoustic correlations to the speech-signal, that is parameters that can somehow be computed or estimated through processing of the signal waveform. Such parameters are termed as features. It includes the process of measuring some important characteristic of the signal such as energy or frequency response, augmenting these measurements with some perceptually meaningful derived measurements and statically conditioning these numbers to form observation vectors. Pre-processing Feature extraction Feature matching 2 Modeling The objective of modeling technique is to generate models for each speaker using specific feature vector extracted from each speaker. It performs a reduction of feature data by modelling the distributions of the feature vectors. Thespeaker recognization is also divided into two parts that means speaker dependant and speaker independent. In the speaker independent mode of the speech recognization the computer should ignore the speaker specific characteristics of the speech signal and extract the intended message.on the other hand in case of speaker dependent mode speechrecognization machine should extract speaker Pre-processing Noise Removal Noise leads to degrade the performance of speaker identification system.the de-noise process done by wavelet decomposition technique. The de-noising process consists of decomposing the original signal, thresholding the detail coefficients, and reconstructing the signal. The decomposition portion of de-noising is accomplished via the DWT. The Discrete Wavelet Transform (DWT) is commonly employed using dyadic multirate filter banks, which are sets of filters that divide a signal frequency band into sub bands. These filter banks are comprised of lowpass, high-pass, or bandpass filters. If the filter banks are

wavelet filter banks that consist of special low-pass and high-pass wavelet filters, then the outputs of the low-pass filter are the approximation coefficients. Also, the outputs of the high-pass filter are the detail coefficients. The process of obtaining the approximation and detail coefficients is called decomposition. If a threshold operation is applied to the output of the DWT and wavelet coefficients that are below a specified value are removed, then the system will perform a "de-noising" function. There are two different threshold operations. In the first, hard thresholding, coefficients whose absolute values are lower than the threshold are set to zero. Hard thresholding is extended by the second technique, soft thresholding, by shrinking the remaining nonzero coefficients toward zero. We have picked Daubechies 4 (db4) as our analysis wavelet, three-level decomposition. The three-level decomposition provides sufficient noise reduction. axis following mel scale. To obtain melcepstrum, the voice signal is windowed first using analysis window and then Discrete Fourier Transform is computed. The main purpose of MFCC is to mimic the behaviour of human ears.mel frequency cepstral coefficient estimation includes following process.mfccprocess subdivided into five phases or blocks[7] [8]. Figure 4: Block diagram of MFCC Figure 3: Wavelet Decomposition The a 3 (n) trace represents the third-level approximation coefficients, which are the high-scale, low-frequency components. The approximation coefficients a 3 (n) is similar to the original signal. The other three waveforms (d 3 (n), d 2 (n), and d 1 (n)) are the detail coefficients, which are low-scale, high-frequency components. Mel frequency cepstral coefficient estimation After the process of removing background noises from voice signal has finish, theprocess of feature extraction will begin. Feature extraction is a process of obtaining different features of voice signal such as amplitude, pitch and the vocal tract. It is a task of finding parameter set obtained from the input voice signal. The extracted features should have some criteria in dealing with the speech signal such as [11] : Stable over time Should occur frequently and naturally in speech Should not be susceptible to mimicry Easy to measure extracted speech features Shows little fluctuation from one speaking environment to another Discriminate between speakers while being tolerant of intra speaker variability s Mel Frequency Cepstrum Coefficients (MFCC) to extract features in the voice signal. MFCC focuses on series of calculation that uses cepstrum with a nonlinear frequency Frame blocking Speech is a non-stationary signal. If the frame is too long, signal properties may change too much across the window, affecting the time resolution adversely. If the frame is too short, resolution of narrowband components will be sacrificed, affecting the frequency resolution adversely. There is a trade-off between time resolution and frequency resolution. For example if We choose number of samples in each frame as 256, with the number of samples overlapping between adjacent frames is 128. Overlapping frames are used to capture information that may occur at the frame boundaries. Number of frames is obtained by dividing the total number of samples in the input speech file by 128. For covering all samples of input, last frame may require zero padding. All frames are stored as rows in one single matrix with number of rows equal to number of frames and number of columns equal to 256, which is also equal to the frame width. Figure 5:Frame Of Input Signal 3

Windowing The next step in the processing is to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. The concept here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame[9]. If we define the window as Figure 6: Windowed Input Signal W (n), 0 n N-1, where N is the number of samples in each frame, then the result of windowing is the signal. y 1 (n) = x 1 (n)*w(n) 0 n N-1 (1) Typically the Hamming window is used, which has the form: W(n)=0.54-0.46cos(2πn/N-1) 0 n N-1(2) FFT block Spectral information means the energy levels at different frequencies in the given window. Time domain data is converted into frequency domain to obtain the spectral information. Time domain data is converted to frequency domain by applying Discrete Fourier Transform (DFT) on it. Equation (3) represents DFT. Mel frequency wrapping block Mel-frequency analysis of speech is based on human perception experiments. It has been proved that human ears are more sensitive and have higher resolution to low frequency compared to high frequency. Hence, the filter bank is designed to emphasize the low frequency over the high frequency. Also the voice signal does not follow the linear frequency scale used in FFT. Hence, a perceptual scale of pitches equal in distance, namely Mel scale is used for feature extraction. Mel scale frequency is proportional to the logarithm of the linear frequency, reflecting the human perception. We use log because our ears work in decibels. Figure 2Figure 3Figure 4 shows frequencies in Mel scale plotted against frequencies in linear scale. Equation (5) is used to convert linear scale frequency into Mel scale frequency [3]. (5) Triangular band pass filters or Gaussian filter are used to extract the spectral envelope, which is constituted by dominant frequency components in the speech signal. Thus, Mel-frequency filters are triangular band pass filters or Gaussian filters non-uniformly spaced on the linear frequency axis and uniformly spaced on the Mel frequency axis, with more number of filters in the low frequency region and less number of filters in the high frequency region. Figure 7:Plot of mel frequencies v/s linear frequencies 4, (3) For example,here, x(n) represents input frame of 256 samples and X(k) represents its equivalent DFT. We use 256-point FFT algorithm to convert each frame of 256 samples into its equivalent DFT. FFT output is a set of complex numbers i.e. real and imaginary parts. Speech recognition systems deal with real data. Hence, complex value is always ignored. If we assume the real and imaginary parts of X(k) as Re(X(k)) and Im(X(k)), then the spectral magnitude of the speech signal can be obtained by using equation ( 4). Spectral magnitudes of each frame are stored as rows in one single matrix with number of rows equal to number of frames a number of columns equal to 256, which is also equal to the frame width. = ( ) ( ) (4) Magnitude response of each filter is equal to unity at the Centre and decreases linearly to zero at the Centre frequencies of two adjacent filters. We use Mel frequency filter bank with 20 triangular or Gaussian overlapping filters. It is observed that as m increases, the difference between centres of two adjacent filters increases in linear scale and remains the same in Mel scale. We use 20 filters. Hence, Q = 20. FFT spectrum is passed through Mel filters to obtain Mel spectrum. Mel frequency filter bank is applied to every 256 samples frame and the filtered response is computed. In frequency domain, filtering is obtained by

multiplying the FFT of signal and transfer function of the filter on element by element basis. Logarithm of filter energies Human ears smooth the spectrum and use the logarithmic scale approximately. We use equation (6) to compute the log-energy i.e. logarithm of the sum of filtered components for each filter The value of m ranges from 1 to Q, where Q is the number of filters. Thus, each bin per frame per filter holds the log-energy obtained by computing logarithm of weighted sum of spectral magnitudes in that filter-bank channel. Hence, we get 20 numeric values for each frame at the output of this stage. Output of this stage is stored in a matrix with number of rows equal to number of frames and number of columns equal to 20 i.e. number of filters in the filter bank. DCT block The discrete cosine transform (DCT) converts the log power spectrum Mel frequency domain into time domain. DCT gathers most of the information of the signal to its lower order coefficients, resulting in significant reduction in computational cost. Equation (7) represents the discrete cosine transform. (7) Where c(l) represents Mel Frequency Cepstral Coefficient The value of l ranges between 8 and 19. We choose l as 19. Hence, we obtain 19 coefficients for each frame. At the output of this stage, we get a matrix with number of rows equal to the number of frames and number of columns equal to l = 19. Thus, cepstral analysis is performed on Mel-spectrum to obtain Mel Frequency Cepstrum Coefficients (MFCC). Figure 8: Cepstrum of speech sample Feature matching module The use of Gaussian Mixture models for modeling speaker identity is motivated by the interpretation that the Gaussian components represent some general speaker dependent spectral shapes and the capability of Gaussian mixtures to model arbitrary densities[5] [6]. A GMM is the weighed sum of M component densities, given by the equation, P( /) = (8) Where is a sequence of feature vectors from the audio data, is D dimensional speech feature vector,bi( ), i=1...m are component densities and pi,i=1...m are the mixture weights.each component density is a D variate Gaussian function of the form, (9) with mean vector, and covariance matrix Σi. The mixture weights are such that (10) For speaker identification, each speaker is represented by a GMM i, which is completely parameterized by its mixture weights, means and covariance matrices collectively represented as i = {p i,,σi } (11) For computational ease and improved performance, the covariance matrices are constrained to be diagonal. There are two principal motivations for using GMMs to model speaker identity. The first is that the components of such a multi-modal density may represent some underlying set of acoustic classes. It is reasonable to assume that the acoustic space corresponding to a speakers voice can be characterized by a set of acoustic classes representing some broad phonetic events such as vowels, nasals or fricatives. These acoustic classes reflect some general speakerdependent vocal tract configurations that are useful for characterizing speaker identity. The spectral shape of the i th acoustic class can in turn be represented by the mean i and covariance matrix Σi. Because all the training or testing speech is unlabelled, the acoustic classes are hidden in that the class of an observation is unknown. The second motivation for using Gaussian mixture densities for speaker identification is that a linear combination of Gaussian basis functions is capable of modelling a large class of sample distributions. A GMM can form smooth approximations to arbitrarily shaped densities. There are several techniques that can be used to estimate the parameters of a GMM,, which describes the distribution of the training feature vectors. By far the most popular and well-established is Maximum Likelihood (ML) estimation. 5 These GMMs are trained separately on each speaker s enrolment data using the Expectation Maximization(EM)

algorithm. The update equations that guarantee a monotonic increase in the model s likelihood value are: Mixture Weights: Means: Variances: = ) (12) = (13) 2 = - 2 (14) 2 Where σ i, x t and i are elements of 2, and respectively.the a posterriori probability for acoustic class i given by p = (15) In speaker identification, given a group of speakers S = {1, 2...M}, represented by GMMs1,2,3 s,the objective is to find the speaker model which has the maximum a posteriori probability for a given test sequence, = arg (16) Assuming that all speakers are equally likely and that the observations are independent, and since p(x) is same for all speakers, this simplifies to = arg ( )(17) Each GMM outputs a probability for each frame, which is multiplied across all the frames. The classifier makes a decision based on these product posterior probabilities. And GMM, Journal of Research in Electrical and Electronics Engineering (ISTP-JREEE), Volume 3, Issue 2, March 2014 [4] Seiichi Nakagawa, Longbiao Wang and Shinji Ohtsuka, Speaker Identification and Verification by combining MFCC and Phase Information, IEEE transactions on Audio, Speech, and Language Processing, Vol. 20, No 4, May 2012. [5] D. A. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, Speech Communication., Vol. 17,No. 1 2, pp. 91 108, 1995. [6] D. A. Reynolds, T. F. Quatieri, and R. Dunn, Speaker verification using adapted Gaussian mixture models, Digital Signal Processing,Vol.10, No. 1 3, pp. 19 41, 2000. [7] Md.RashidulHasan, Mustafa Jamil, Md. GolamRabbani Md. SaifurRahman,"Speaker Identification Using Mel Frequency Cepstral Coefficients",3rd International Conference on Electrical &ComputerEngineering ICECE, Dhaka, Bangladesh, 28-30 December 2004. [8] Maxim Sidorov, Alexander Schmitt, Sergey Zablotskiy, Wolfgang Minker, Institute of Communications Engineering,University of Ulm, Germany, Survey of Automated Speaker Identification Methods, 2013 9th International Conference on Intelligent Environments. [9] M.G.Sumithra&A.K.DevikaDepartment of Elec-tronics and Communication Engineering, Bannari Amman Institute of Technology, Sathyamangalam, A Study on Feature Extraction Techniques for Text Independent Speaker Identification, 2012 [10] Alfredo Maesa,FabioGarzia, Text independent Automatic Recognition using Mel frequency cepstrum coefficient and Gaussian Mixture Model, IEEE Proceedings Volume 3,No-4,OCT. 2012. [11] Nisha V.S, M.Jayasheela, Survey on Feature Extraction and Matching Techniques for Speaker Recognition Systems, International Journal of Advanced Research in Electronics and Communica-tion Engineering (IJARECE) Volume 2, Issue 3, March 2013. CONCLUSION The performances of speaker identification system using various feature extraction and matching techniques. MFCC algorithm is used in our system because it has least false acceptance ratio. I n order to improve system performance and also to achieve high accuracy GMM model can be used in feature matching technique. The speakers were trained and tested by using MFCC and GMM model. They give better identification rate for speaker features. In future work, this technique integrated the pitch information with MFCC and also to analyse the speaker identification system performance in the presence of noise. REFERENCES: 6 [1] Douglas A. Reynolds and Richard C. Rose, Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models, IEEE Transactions on Speech and Audio Processing, January 1995, Vol. 3, No. 1. [2] Joseph P.Campell.JR, Speaker Recognition: ATutorial. IEEE ProcVol 85, No 9, September1997. [3] AshutoshParab, JoyebMulla, PankajBhadoria, and VikramBangar, Speaker Recognition Using MFCC