A comparative study on feature extraction techniques in speech recognition

Size: px

Start display at page:

Download "A comparative study on feature extraction techniques in speech recognition"

Conrad Singleton
6 years ago
Views:

1 A comparative study on feature techniques in speech recognition Smita B. Magre Department of C. S. and I.T., Dr. Babasaheb Ambedkar Marathwada University, Aurangabad ABSTRACT Automatic speech recognition (ASR) has created nice strides with the event of digital signal process hardware and software package. But despite of those advances, machines cannot match the performance of their human counterparts in terms of accuracy and speed, especially simply just in case of speaker independent speech recognition. So nowadays good portion of speech recognition analysis is focused on speaker independent speech recognition drawback. The reasons are its big selection of applications, and limitations of obtainable techniques of speech recognition. This paper provides outline of technique developed in each stage of speech recognition. This paper helps in selecting the technique beside their relative advantages & disadvantages. Comparative study of various techniques is finished as per stages. This paper is concludes with the choice on feature direction for developing technique in human pc interface system using Marathi Language. General Terms Modeling technique, speech processing, signal processing Keywords- Speech Recognition; Feature Extraction; MFCC; LPC;PCA;LDA;WAVELET;DTW. 1. INTRODUCTION Speech recognition is also known as automatic speech recognition or computer speech recognition which means understanding voice of the computer and performing any required task or the ability to match a voice against a provided or acquired vocabulary. The task is to getting a computer to understand spoken language. By understand we mean to react appropriately and convert the input speech into another medium e.g. text. Speech recognition is therefore sometimes referred to as speech-to-text (STT).speech processing is one of the exciting areas of signal processing. A speech recognition system consists of a microphone, for the person to speak into; speech recognition software; a computer to take and interpret the speech; a good quality soundcard for input and/or output; a proper and good pronunciation. Ratnadeep R. Deshmukh Department of C.S. and I.T., Dr. Babasaheb Ambedkar Marathwada University, Aurangabad ratnadeep_deshmukh@yahoo.co. in Pukhraj P. Shrishrimal Department of C.S. and I.T., Dr. Babasaheb Ambedkar Marathwada University, Aurangabad pukhraj.shrishrimal@gmail.com 1.1 Topology of Speech Recognition System Speaker Dependent: These systems require a user to train the system according to his or her voice. Speaker Independent Systems: This system does not require a user to train the system i.e. they are developed to operate for any speaker. Isolated word recognizer: Accept one word at a time. These recognition systems allow us to speak naturally continuous. Connected word systems: It allows speaker to speak slowly and distinctly each word with a short pause i.e. planned speech. Spontaneous recognition systems: It allows us to speak spontaneously [3]. 1.2 Overview of Speech Recognition All paragraphs must be indented. All paragraphs must be justified, i.e. both left-justified and right-justified. A speech recognition system consists of five blocks: - Feature, Acoustic modeling, Pronunciation modeling, Decoder. The process of speech recognition begins with a speaker creating an utterance which consists of the sound waves. These sound waves are then captured by a microphone and converted into electrical signals. These electrical signals are then converted into digital form to make them understandable by the speech-system. Speech signal is then converted into discrete sequence of feature vectors, which is assumed to contain only the relevant information about given utterance that is important for its correct recognition. An important property of feature is the suppression of information irrelevant for correct classification such as information about speaker (e.g. fundamental frequency) and information about transmission channel (e.g. characteristic of a microphone). Finally recognition component finds the best match in the knowledge base, for the incoming feature vectors. Sometimes, however the information conveyed by these feature vectors may be correlated and less discriminative which may slow down the further processing. Feature methods like Mel frequency Cepstral coefficient (MFCC) provide some way to get uncorrelated vectors by means of discrete cosine transforms (DCT).

2 index scaling of the repetition frequency, using Δf = (fh fl )/ (M + 1) wherever fh is that the highest frequency of the filter bank on the mel scale, computed from using equation given on top of, fl is that the lowest frequency in mel scale, having a corresponding and M is that the range of filter bank. The values thought-about for the parameters within the gift study are: f = 16KHzand f = 0 Hz.The middle frequencies on the mel scale are given by: Figure 1: Outline of Speech Recognition System [4]. 2. FEATURE EXTRACTION First of all, recording of various speech samples of each word of the vocabulary is done by different speakers. After the speech samples are collected; they are converted from analog to digital form by sampling at a frequency of 16 khz. Sampling means recording the speech signals at a regular interval. The collected data is now quantized if required to eliminate noise in speech samples. The collected speech samples are then passed through the feature, feature training & feature testing stages. Feature transforms the incoming sound into an internal representation such that it is possible to reconstruct the original signal from it. There are various techniques to extract features like MFCC, PLP, RAST, LPCC, PCA, LDA, Wavelet, DTW but mostly used is MFCC. Figure 2: Feature Extraction Diagram [1]. 2.1 MFCC: Mel-Frequency Cepstral Coefficients Mel Frequency Cepstral Coefficients (MFCC) is one amongst the most normally used feature methodology in speech recognition. The technique is named FFT based mostly which implies that feature vectors square measure extracted from the frequency spectra of the windowed speech frames. The Mel frequency filter bank may be a series of triangular bandpass filters. The filter bank relies on a non-linear frequency scale referred to as the mel scale. An a1000 thousand Hz tone is outlined as having a pitch of1000 mel. Below a thousand Hz, the Mel scale is more or less linear to the linear frequency scale. On top of the a thousand Hz point of reference, the connection between Mel scale and therefore the linear frequency scale is non-linear and more or less power. The following equation describes the mathematical relationship between the Mel scale and therefore the linear frequency scale f = ln( f ) The Mel frequency filter bank contains triangular band pass filters in such the simplest way that lower boundary of one filter is placed at the middle frequency of the previous filter and also the higher boundary placed within the center frequency of the next filter. a set frequency resolution within the Mel scale is computed, appreciate a f () = f () + mf () + f (), M m M The center frequency in Hertz, is given by () f = 700(e. 1) Above Equation is inserted into equation of f to give the mel filter bank. Finally, the MFCCs are obtained bycomputing the discrete cosine transform of using c(l) = X (m)cos (l π M (m 1 2 )) For l = 1, 2, 3,.., M where c(l) is the lth MFCC. The time derivative is approximated by a linear regression coefficient over a finite window, which is defined as c (l) = k c (m) G, 1 l M Where is the lthcepstral coefficient at time t and G is a constant used to make the variances of the derivative terms equalto those with the original cepstral coefficients. Figure 3: Steps involved in MFCC Feature Extraction Advantage As the frequency bands are positioned logarithmically in MFCC, it approximates the human system response more closely than any other system Disadvantage MFCC values are not very robust in the presence of additive noise, and so it is common to normalize their values in speech recognition systems to lessen the influence of noise

3 2.1.3 Applications MFCCs are commonly used as features in speech recognition systems, such as the systems which can automatically recognize numbers spoken into a telephone. They are also common in speaker recognition, which is the task of recognizing people from their voices. MFCCs are also increasingly finding uses in music information retrieval applications such as genre classification, audio similarity measures, etc m = 0,1, p Where the highest autocorrelation value, p, is the order of the LPC analysis LPC Analysis The next processing step is the LPC analysis, which converts each frame of p + 1 autocorrelations into LPC parameter set by using Durbin s method. This can formally be given as the following algorithm: 2.2 Linear Prediction Coefficient LPC (Linear Predictive coding) analyzes the speech signal by estimating the formants, removing speech signal, and estimating the intensity and frequency of the remaining buzz. The process is called inverse filtering, and the remaining signal is called the residue. In LPC system, each expressed as a linear combination of the previous samples. This equation is called as linear predictive coding [9]. The coefficients of the difference equation characterize the formants. The basic steps of LPC processor include the following [5]: Preemphasis The digitized speech signals (n)is put through a low order digital system, signal and to make it less susceptible to finite precision effects later in the signal processing. preemphasizer network is related to the input to the network, s (n), by difference equation: s (n) = s(n) as(n 1) Frame Blocking The output of preemphasis step,s (n), is blocked into frames of N samples, with adjacent frames being separated by M samples. If x(n)l is the lth frame of speech, and there are L frames within entire speech signal, then [5]. x (n) = s (Ml + n) windowing After frame blocking, the next step is to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. If we define the window as w(n), 0 n N 1, then the result of windowing is the signal: x (n) = x (n)w(n), 0 n N 1 Typical window is the Hamming window, which has the form w(n) = cos 2πn, 0 n N 1 N 1 Autocorrelation Analysis: The next step is to auto correlate each frame of windowed signal in order to give r (m) = x (n) x (n + m), E () = r(0) k = r(i) α () = k E α r( i j ), 1 i p α () = α () k α (), 1 j i 1 E () = (1 k )E By solving above equations recursively for i = 1,2,, p the LPC coefficient, am, is given as a = α () c = k m. c. a, m > p Figure 4: Block Diagram of LPC Advantages One of the most powerful signal analysis techniques is the method of linear prediction. LPC is to minimize the sum of the squared differences between the original speech signal and the estimated speech signal over a finite duration. 2.3 Perceptually Based Linear Predictive Analysis (PLP) PLP analysis models perceptually motivated auditory spectrum by a low order all pole function, using

4 theautocorrelation LP technique. Basic concept of PLP method is shown in block diagram of Fig. 5 Figure 5: Block Diagram of PLP Speech Analysis Method [7]. It involves two major steps: obtaining auditory spectrum, approximating the auditory spectrum by an all pole model. Auditory spectrum is derived From the speech waveform by critical-band filtering, equal loudness curve preemphasis, and intensity loudness root compression. Eighteen critical band filter outputs with their center frequencies equally spaced in bark domain, are defined as Ω(w)=6ln(( ) +((w/1200π)+1). ) Center frequency of kth critical band Ω = 0.994k Advantages PLP coefficients are often used because they approximate well the high-energy regions of the speech spectrum while simultaneously smoothing out the fine harmonic structure, which is often characteristic of the individual but not of the underlying linguistic unit. LPC, however, approximates the speech spectrum equally well at all frequencies, and this representation is contrary to known principles of human hearing. The spectral resolution of human hearing is roughly linear up to 800 or 1000 Hz, but it decreases with increasing frequency above this linear range. PLP incorporates critical-band spectralresolution into its spectrum estimate by remapping the frequency axis to the Bark scale and integrating the energy in the critical bands to produce a critical-band spectrum approximation 2.4 Dynamic Time Warping (DWT) In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences which may vary in time or speed. For instance, similarities in walking patterns could be detected using DTW, even if one person was walking faster than the Π Q = [E(w ) C (w). P(w)dw] c (W)=10 (ΩΩ.), Ω Ω-0.5 = 1, Ω -0.5 Ω Ω +0.5 = 10.(ΩΩ.), Ω +0.5 Ω The output thus obtained is linearly interpolated to give interpolated auditory spectrum. The interpolated auditory spectrum is approximated by fifth order all pole model spectrum. The IDFT of interpolated auditory spectrum provides first six terms of autocorrelation function. These are used in solution of Yule Walker equations [7] to obtain five autoregressive coefficients of all-pole filter. The PLP analysis provides similar results as with LPC analysis but the order of PLP model is half of LP model. This allows computational and storage saving for ASR. Also it provides better performance to cross speaker ASR. other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio and graphics data indeed, any data which can be turned into a linear sequence can be analyzed with DTW. Figure 6: Block Diagram of DWT Advantages Increased speed of recognition. Reduced storing space for the reference template. Constraints could be made in finding the optimal path. Increased recognition rate. A threshold can be used in order to stop the process if the error is too great Disadvantage To find the best reference template for a certain word. Choosing the appropriate reference template is a difficult task 2.5 Wavelet The speech is a non stationary signal. The Fourier transform (FT) isn't suitable for the analysis of such non stationary signal as a result of it gives only the frequency data of signal however doesn't provide the data about at what time which frequency is present. The windowed short-time ft (STFT) provides the temporal data concerning the frequency content of signal. A disadvantage of the STFT is its fastened time resolution attributable to fixed window length. The WT, with its versatile time-frequency window, is an which tool for the analysis of non stationary signals like speech fixed have both short high frequency bursts and long quasi-stationary component also.

2.5.1 Advantages Wavelet transform have been used for speech feature in which the energies of wavelet decomposed sub-bands have been used in place of Mel filtered sub-band energies.

Wavelet transform-based features give better recognition accuracy than LPC and MFCC The WT has a better capability to model the details of unvoiced sound portions.

5 2.5.1 Advantages Wavelet transform have been used for speech feature in which the energies of wavelet decomposed sub-bands have been used in place of Mel filtered sub-band energies. Because of its better energy compaction property. Wavelet transform-based features give better recognition accuracy than LPC and MFCC The WT has a better capability to model the details of unvoiced sound portions. Better time resolution than Fourier Transform. To compensate for linear channel distortions the analysis library provides the power to perform rasta filtering. The rasta filter is used either within the log spectral or cepstral domains. In result the rasta filter band passes every feature coefficient. Linear channel distortions seem as an additive Figure 8: Block Diagram of RASTA. constant in each the log spectral and therefore the cepstral domains. The high-pass portion of the equivalent band pass filter alleviates the result of convolution noise introduced inthe channel. The low-pass filtering helps in smoothing frame to border spectral changes. 2.7 Principal Component Analysis (PCA) Figure 7: Block Diagram of Wavelet. The main advantage of wavelets is that they offer a simultaneous localization in time and frequency domain. Wavelet is that, using fast wavelet transform, it is computationally very fast. Wavelets have the great advantage of being able to separate the fine details in a signal. Very small wavelets can be used to isolate very fine details in a signal, while very large wavelets can identify coarse details. A wavelet transform can be used to decompose a signal into component wavelets. PCA is thought a Principle part Analysis this is often a statistical analytical tool that's used to explore kind and cluster information. PCA take an oversized variety of correlate (interrelated) variables and rework this information into a smaller variety of unrelated variables (principal components) whereas holding largest quantity of variation, so creating it easier to work the information and build predictions. PCA could be a method of distinguishing patterns in information, and expressing the information in such some way on highlight their similarities and variations. Since a pattern in information is hard to seek out in information of high dimension, wherever the posh of graphical illustration isn't offered, PCA could be a powerful tool for analyzing information. Figure 9: Block Diagram of PCA Advantage High dimensionality reduction techniques Disadvantage The cost of computing DWT as compare to DCT may be higher. It requires longer compression time Application Signal processing. Data compression. Speech recognition. Computer graphics and multi-fractal analysis. 2.6 Relative Spectra Filtering of Log Domain Coefficients (RASTA)

6 2.7.2 Disadvantage 3. COMPARATIVE ANALYSIS The results of PCA depend on the scaling of the variables. The applicability of PCA is limited by certain assumptions made in its derivation. The absence of a probability density model and associated likelihood measure. Various methods for Feature Extraction in speech recognition are broadly shown in the following table 1. Table 1: Feature Extraction Methods [1] Method property comments 2.8 Combined LPC and MFCC The determination algorithms MFCC and LPC coefficients expressing the basic speech features are developed by author. Combined use of cepstrals of MFCC and LPC in speech recognition system is suggested by author to improve the reliability of speech recognition system. The recognition system is divided into MFCC and LPC-based recognition subsystems. The training and recognition processes are realized in both subsystems separately, and recognition system gets the decision being the same results of each subsystems. Author claimed that, results in decrease of error rate during recognition Steps of Combined Use of Cepstral of MFCC and LPC 1. The speech signals is passed through a first-order FIR high pass filter 2. Voice activation detection (VAD). Locating the endpoints of an utterance in a speech signal with the help of some commonly used methods such as short-term energy estimate Es, short-term power estimate Ps, shortterm zero crossing rate Zs etc. 3. Then the mean and variance for the measures calculated for background noise, assuming that the first 5 blocks are background noise. 4. Framing. 5. Windowing. 6. Calculating of MFCC features. 7. Calculating of LPC features. 8. The speech recognition system consists of MFCC and LPC-based two subsystems. These subsystems are trained by neural networks with MFCC and LPC features The Recognition Process Stages 1. In MFCC and LPC based recognition subsystems recognition processes are realized in parallel. 2. The recognition results of MFCC and LPC based recognition subsystems are compared and the speech recognition system confirms the result, which confirmed by the both subsystems. Principal Component Analysis(PCA) Linear Discriminant Analysis(LDA) Linear Predictive coding Cepstral Analysis Mel-frequency cepstrum (MFFCs) Independent Component Analysis (ICA) Cepstral Analysis Non linear feature Linear map; fast; eigenvector-based Non linear feature Supervised linear map; fast; eigenvector-based Static feature 10 to 16 lower order coefficient, Static feature Power spectrum Power spectrum is computed by performing Fourier Analysis Non linear feature Linear map, iterative non- Gaussian Static feature Power spectrum Traditional, eigenvector based also known as karhuneu-loeve expansion; good for Gaussian data. Better than PCA for classification; It is used for feature Extraction at lower order Used to represent spectral envelope This method is used for find our features Blind course separation, used for de-mixing non- Gaussian distributed sources(features) Used to represent spectral envelope Since the MFCC and LPC [2] methods are applied to the overlapping frames of speech signal, the dimension of feature vector depends on dimension of frames. At the same time, the number of frames depends on the length of speech signal, sampling frequency, frame step, frame length. An author use sampling frequency is 16khs, the frame step is 160 samples, and the frame length is 400 samples. The other problem of speech recognition is the same speech has different time duration. Even when the same person repeats the same speech, it has the different time durations. Author suggested that, for partially removing the problem, time durations are led to the same scale. When the dimension of scale defined for the speech signal increases, then the dimension of feature vector corresponding to the signal also increases. Mel-frequency scale analysis Kernel based feature method Static feature Spectral analysis Non linear transformations, Spectral analysis is done with a fixed resolution along a subjective frequency scale i.e. Mel-frequency scale. Dimensionality reduction leads to better classification and it is used to remove noisy and

7 Wavelet Dynamic feature s i)lpc ii)mfccs Spectral subtraction RASTA filtering Integrated Phoneme subspace method Better time resolution than Fourier Transform Acceleration and delta coefficients i.e. II and III order derivatives of normal LPC and MFCCs coefficients Robust Feature method For Noisy speech A transformation based on PCA+LDA+ICA redundant features, and improvement in classification error It replaces the fixed bandwidth of Fourier transform with one proportional to frequency which allow better time resolution at high frequencies than Fourier Transform It is used by dynamic or runtime Feature It is used basis on Spectrogram It is find out Feature in Noisy data Higher Accuracy than the existing methods Engineering Trends and Technology- Volume4Issue [5] Leena R Mehta, S. P. Mahajan, Amol S Dabhade, Comparative Study of MFCC and LPC For Marathi Isolated Word Recognition System, International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering Vol. 2, Issue 6, June 2013 [6] Kayte Charansing Nathoosing, Isolated Word Recognition for Marathi Language using VQ and HMM Science Research Reporter 2(2): , April 2012 [7] Manish P. Kesarkar, Feature Extraction for Speech Recognition M.Tech. Credit Seminar Report, Electronic Systems Group, EE. Dept, IIT Bombay, Submitted November2003. [8] Hynek Hermansky Perceptual Liner Predictive (PLP) Analysis Of Speech speech technology laboratory, division Panasonic technologies, Journal Acoustic Society of America,Vol. 87,No 4.April [9] Bharti W. Gawali, Santosh Gaikwad, Pravin Yannawar, Suresh C. Mehrotra, Marathi Isolated Word Recognition System using MFCC and DTW Features, ACEEE International Journal on Information Technology, Vol. 01, No. 01, Mar [10] К. R. Aida Zade, C. Ardil and S. S. Rustamov, Investigation of Combined use of MFCC and LPC Features in Speech Recognition Systems, Proceeding of World Academy of Science, Engineering and Technology Volume 13 May 2006 ISSN CONCLUSION We have discussed some features techniques and their advantages and disadvantages. Some new methods are developed using combination of more techniques. Authors have claimed improvement in performance. There is a need to develop new hybrid methods that will give better performance in robust speech recognition area. 5. REFERENCES [1] Santosh K. Gaikwad, Bharti W. Gawali, Pravin Yannawar A Review on Speech Recognition Technique International Journal of Computer Applications ( ). Volume 10 No.3, November [2] Urmila Shrawankar, Techniques for Feature Extraction in Speech Recognition System: A Comparative Study. International Journal Of Computer Applications In Engineering, Technology and Sciences (IJCAETS), 6 May 2013 [3] M. A. Anusuya, S. K.Katti Speech Recognition by Machine: A Review (IJCSIS) International Journal of Computer Science and Information Security, Vol. 6, No. 3, 2009 [4] Preeti Saini, Parneet Kaur, Automatic Speech Recognition: A Review International Journal of

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015 RESEARCH ARTICLE OPEN ACCESS A Comparative Study on Feature Extraction Technique for Isolated Word Speech Recognition Easwari.N 1, Ponmuthuramalingam.P 2 1,2 (PG & Research Department of Computer Science,