Isolated Digit Recognition Using MFCC AND DTW

Similar documents
Mel Spectrum Analysis of Speech Recognition using Single Microphone

Speech Recognition using FIR Wiener Filter

Speech Signal Analysis

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

Cepstrum alanysis of speech signals

Speech Synthesis using Mel-Cepstral Coefficient Feature

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

VOICE COMMAND RECOGNITION SYSTEM BASED ON MFCC AND DTW

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

DERIVATION OF TRAPS IN AUDITORY DOMAIN

NCCF ACF. cepstrum coef. error signal > samples

Introduction of Audio and Music

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Speech Synthesis; Pitch Detection and Vocoders

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

An Improved Voice Activity Detection Based on Deep Belief Networks

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Gammatone Cepstral Coefficient for Speaker Identification

Linguistic Phonetics. Spectral Analysis

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Communications Theory and Engineering

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

I D I A P. Mel-Cepstrum Modulation Spectrum (MCMS) Features for Robust ASR R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

Perceptive Speech Filters for Speech Signal Noise Reduction

Auditory Based Feature Vectors for Speech Recognition Systems

T Automatic Speech Recognition: From Theory to Practice

COMP 546, Winter 2017 lecture 20 - sound 2

Applications of Music Processing

Electronic disguised voice identification based on Mel- Frequency Cepstral Coefficient analysis

MOST MODERN automatic speech recognition (ASR)

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

Advanced audio analysis. Martin Gasser

Change Point Determination in Audio Data Using Auditory Features

Nonuniform multi level crossing for signal reconstruction

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Design and Implementation of Speech Recognition Systems

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Different Approaches of Spectral Subtraction Method for Speech Enhancement

CS 188: Artificial Intelligence Spring Speech in an Hour

Speech and Music Discrimination based on Signal Modulation Spectrum.

Identification of disguised voices using feature extraction and classification

Speech Coding using Linear Prediction

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

B.Tech III Year II Semester (R13) Regular & Supplementary Examinations May/June 2017 DIGITAL SIGNAL PROCESSING (Common to ECE and EIE)

A Comparative Study of Formant Frequencies Estimation Techniques

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Speech/Music Change Point Detection using Sonogram and AANN

Filter Banks I. Prof. Dr. Gerald Schuller. Fraunhofer IDMT & Ilmenau University of Technology Ilmenau, Germany. Fraunhofer IDMT

SPEech Feature Toolbox (SPEFT) Design and Emotional Speech Feature Extraction

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

SGN Audio and Speech Processing

RECENTLY, there has been an increasing interest in noisy

Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

A DEVICE FOR AUTOMATIC SPEECH RECOGNITION*

Chapter 4 SPEECH ENHANCEMENT

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

Blind Dereverberation of Single-Channel Speech Signals Using an ICA-Based Generative Model

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

Audio Fingerprinting using Fractional Fourier Transform

ADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering

SOUND SOURCE RECOGNITION AND MODELING

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Measuring the complexity of sound

Speech Compression Using Voice Excited Linear Predictive Coding

Determination of Variation Ranges of the Psola Transformation Parameters by Using Their Influence on the Acoustic Parameters of Speech

SGN Audio and Speech Processing

Digital Signal Processing

Implementing Speaker Recognition

Audio Restoration Based on DSP Tools

Performance Analysis of FIR Digital Filter Design Technique and Implementation

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

EE482: Digital Signal Processing Applications

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

Chapter 7. Frequency-Domain Representations 语音信号的频域表征

Variation in Noise Parameter Estimates for Background Noise Classification

Vocoder (LPC) Analysis by Variation of Input Parameters and Signals

Enhanced Waveform Interpolative Coding at 4 kbps

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

Voice Recognition Technology Using Neural Networks

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Chapter IV THEORY OF CELP CODING

Fundamental frequency estimation of speech signals using MUSIC algorithm

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Transcription:

MarutiLimkar a, RamaRao b & VidyaSagvekar c a Terna collegeof Engineering, Department of Electronics Engineering, Mumbai University, India b Vidyalankar Institute of Technology, Department ofelectronics and Telecommunication Engineering, Mumbai University, India c K.J.Somaiya Institute of Engineering & Information Technology, Department of Electronics Engineering Mumbai University, India E-Mail : limkar_maruti@rediffmail.com, b.ramarao@vit.edu.in & vidyarsagvekar@gmail.com Abstract - This paper proposes an approach to recognize spoken English words corresponding to digits zero to nine in an isolated way by different male and female speakers. The endpoint detection, framing, normalization, Mel Frequency Cepstral Coefficient (MFCC) and DTW algorithm are used to process speech samples to accomplish the recognition. The algorithm is tested on speech samples.the system is then applied to recognition of isolated word English digits, that is one, 'two', 'three', 'four', 'five, six', 'seven', 'eight' and 'nine'. The algorithm is tested on speech samples that are recorded. The results show that the algorithm managed to recognize almost 9.% oftheenglish digits for all recorded words. Keyword-Speech Recognition, Dynamic Time Warping, Mel Frequency Cepstrum Coefficient. I. INTRODUCTION Speech recognition is a popular and active area of research, used to translate words spoken by humans so as to make them computer recognizable. It usually involves extraction of patterns from digitized speech samples and representing them using an appropriate data model. These patterns are subsequently compared to each other using mathematical operations to determine their contents. In this paper we focus only on recognition of words corresponding to English numerals zero to nine. In many speech recognition systems, end point detection and pattern recognition are used to detect the presence of speech in a background of noise. The beginning and end of a word should be detected by the system that processes the word. The problem of detecting the endpoints would seem to be easily distinguished by human, but it has been found complicated for machine to recognize. Instead in the last three decades, a number of endpoint detection methods have been developed to improve the speed and accuracy of a speech recognition system. The main challenges of speech recognition involve modeling the variation of the same word as spoken by different speakers depending on speaking styles, accents, regional and social dialects, gender, voice patterns etc. In addition background noises and changing of signal properties over time, also pose major problems in speech recognition. This paper proposes a speech recognition algorithm for English digit from to 9.This system consists of speech processing inclusive of digit boundary and recognition which uses zero crossing and energy techniques. Mel Frequency Cepstral Coefficients (MFCC) vectors are used to provide an estimate of the vocal tract filter. Meanwhile dynamic time warping (DTW) is used to detect the nearest recorded voice. The paper is organized as follows: section II describes the proposed approach, section III tabulates details of Experimentations done and results obtained section IV provides overall conclusions. II. PROPOSED APPROACH This paper proposes an approach to recognize automatically digits to 9 from audio signals generated International Journal on Advanced Electrical and Electronics Engineering, (IJAEEE), ISSN (Print): 2278-8948, Volume-, Issue-, 22 9

by different individuals in a controlled environment. It uses a combination of features based on Voice Activity Detection, Mel Frequency Cepstral Coefficient (MFCC). A Dynamic Time Warping (DTW) is used to discriminate the speech data models into respective classes. A. Feature Extraction: The general methodology of audio classification involves extracting discriminatory features from the audio data and feeding them to a pattern classifier. Different approaches and various kinds of audio features were proposed with varying success rates. The features can be extracted either directly from the time domain signal or from a transformation domain depending upon the choice of the signal analysis approach. Some of the audio features that have been successfully used for audio classification include Mel Frequency Cepstral Coefficients (MFCC).. Mel Frequency Cepstrum Coefficients Human perception of frequency contents of sounds for speech signal does not follow a linear scale.thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the mel scale. The melfrequency spacing below Hz and a logarithmic spacing above Hz.As a reference point the pitch of a KHz tone 4dB above the perceptual hearing threshold, is defined as mels. Therefore following approximate formula used to compute the mels for a given frequency f in Hz. x (n)=x(n)-a*x(n-) a typical value of a is.9 (> 2 db gain for high frequency). Fig. Block Diagram of Mel Frequency Cepstrum Coefficeints. -. -.. 2 2. 3.4.2 -.2 Original wave: s(n) After pre-emphasis: s 2 (n)=s(n)-a*s(n-), a=.9 -.4.. 2 2. 3 Here used one filter for each desired mel-frequency component. That filter bank has a triangular band pass frequency response and the spacing as well as the bandwidth is determined by a constant mel frequency interval. The mel scale filter bank is a series of triangular band pass filters that have been designed to simulate the band pass filtering believed to occur in the auditory system. This corresponds to series of band pass filters with constant bandwidth and spacing on a mel frequency scale.fig. Shows Block Diagram of Mel Frequency Cepstrum Coefficient.. Pre-Emphasis This step processes the passing of signal through a filter which emphasizes higher frequencies. This process will increase the energy of signal at higher frequency. First order FIR filter is used..2 Windowing Fig. 2 Example of Pre-Emphasis A traditional method of spectral evaluation is reliable in case of stationary signal. Nature of signal changes continuously with time. For voice reliability can be ensured for a short time. Audio signal is continuous. Processing cannot wait for last sample.processing complexity increases exponentially. It is important to retain short term features. Short time analysis is performed by windowing the signal. Normally Hamming Window is used. The Hamming function is given by Hamming window: otherwise International Journal on Advanced Electrical and Electronics Engineering, (IJAEEE), ISSN (Print): 2278-8948, Volume-, Issue-, 22 6

A set of filters with triangular band pass frequency response believed to occur in the auditory system.one filter is assigned for each desired mel-frequency component. Spacing and bandwidth is determined by a constant mel-frequency interval. amplitude 2.8.6.4.2.8 Fig.3 Hamming Window.3 Discrete Fourier Transform To convert each frame of N samples from time domain into frequency domain. The Fourier Transform is to convert the convolution of the glottal pulse U[n] and the vocal tract impulse response H[n] in the time domain. Complexity of DFT is N 2 and Fast Fourier Transform (FFT) is N*log 2 (N). In general, choose N=2, 24 or 2 m. Magnitude and phase at N equidistant digital frequencies between and 2π (rad/sec). Corresponding analog frequencies are kfs/n Hz, k=,,..., N-..4 Mel Filter Banks Human hearing is not equally sensitive to all frequency bands. It is less sensitive at higher frequencies, roughly greater than Hz. i.e. human perception of frequency is non-linear..6.4.2 2 4 6 8 2 4 6 8 2 Frequency. The Cepstrum Fig. Mel Filter Banks. In this final step, the log mel spectrum is converted back to time. The result is called the mel frequency cepstrum coefficients (MFCC). The cepstral representation of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis. Because the mel spectrum coefficients (and so their logarithm) are real numbers, we can convert them to the time domain using the Discrete Cosine Transform (DCT). Human response to signal level is logarithmic. Human response is less sensitive to slight differences in amplitude at high amplitudes than low amplitudes. Logarithm compresses dynamic range of values making feature extraction less sensitive to dynamic variation. Makes frequency estimates less sensitive to slight variations in input (power variation due to speaker s mouth moving closer to mike). Magnitude of DFT discards unwanted phase information..6 Mel Frequency Cepstrum Computation IDFT converts frequency samples to time Generally J = 2 or 6 Fig.4 Mel Scale International Journal on Advanced Electrical and Electronics Engineering, (IJAEEE), ISSN (Print): 2278-8948, Volume-, Issue-, 22 6

2. Feature Vector Matching The features of the speech signal are in the form of N dimensional feature vector. For a segmented signal that is divided into M segments, M vectors are determined producing the M x N feature matrix. The M x N matrix is created by extracting features from the utterances of the speaker for selected words or sentences during the training phase. After extraction of the feature vectors from the speech signal, matching of the templates is required to be carried out for speaker recognition. This process could either be manual (comparison of spectrograms visually) or automatic. In automatic matching of templates, speaker models are constructed from the extracted features. There after a speaker is authenticated by comparison of the incoming speech signal with the stored model of the claimed user. 2. Dynamic Time Warping A many-to-many matching between the data points in time series x and the data point in time series y matches every data point xi in x with at least one data point yj in y, and every data point in y with at least a data point in x. The set of matches (i; j) forms a warping path.we define the DTW as the minimization of the lpnorm of the differences over all warping paths. A warping path is minimal if there is no subset of forming an warping path: for simplicity we require all warping paths to be minimal. In computing the DTW distance, we commonly require the warping to remain local. For time series x and y, we align values xi and yj only if w for some locality constraint w(). When w =, the DTW becomes the local distance whereas when w(n), the DTW has no locality constraint. The value of the DTW diminishes monotonically as w increases. the warping path to adjacent cells (including diagonally adjacent cells). Monotonicity: Given wk = (a,b) then wk- = (a',b') where a a' ³ and b-b' ³. This forces the points in W to be monotonically spaced in time. Fig.7 DTW constraints (a) Admissible warping paths satisfying the conditions. (b) Boundary Condition is violated (c) Monotonicity condition is violated (d) Step Size Condition is violated. 2.. The Algorithm Fig.8 Local Distance. Start with the calculation of g(,) = d(,). Calculate the first row g(i, ) =g(i, ) + d(i,). Fig. 6 Time Alignment of two time independent sequences. 2. DTW Constraints The warping path is typically subject to several constraints: Boundary conditions: w = (,) and wk = (m,n), simply stated, this requires the warping path to start and finish in diagonally opposite corner cells of the matrix. Continuity: Given wk = (a,b) then wk- = (a,b ) where a a' and b-b'. This restricts the allowable steps in Fig. 9 Optimal Path Assignment International Journal on Advanced Electrical and Electronics Engineering, (IJAEEE), ISSN (Print): 2278-8948, Volume-, Issue-, 22 62

2. Calculate the first column g(, j) =g(, j) + d(, j). 3. Move to the second row g(i, 2) = min(g(i, ), g(i, ), g(i, 2)) + d(i, 2). Book keep for each cell the index of this neighboring cell, which contributes the minimum score (red arrows). 4. Carry on from left to right and from bottom to top with the rest of the grid g(i, j) = min(g(i, j ), g(i, j ), g(i, j)) + d(i, j).. Trace back the best path through the grid starting from g(n, m) and moving towards g(,) by following the red arrows. Hence the path which gives minimum distance after testing with the feature vectors stored in the database is the identified speaker. 2 3 2 Zero 2 2 Five 2 3 4 2 One 2 2 2 Six 2 3 4 2 Two 2 2 Seven 2 2 2 Three Four 2 2 2 Eight Nine 2 2 2 III. RESULT After running the algorithm, the results are obtained. The system requires the user to record numbers until 9 in the English language. After that the system saves the recorded voice into a English digit directory. Then, the user is required to record any single number between and 9 as shown in Figure 2. The input word is then recognized as the word corresponding to the template with the lowest matching score. The recognition is implemented using DTW where the distance calculation is done between the tested speech and the reference word bank. Fig. Screenshot output of DTW path taken for recognition of utterance of one with respect to other digit digit: zero digit: one digit: two digit: three - 2 4 digit: seven - 2 4-2 4 digit: four - 2 4 digit: eight - 2 4-2 4-2 4-2 4 Fig. Screenshot output after recording ten digits in the English language digit: five digit: nine - 2 4 digit: six - 2 4 Fig.2Screenshot of DTW path taken for recognition of utterance of one The recognition algorithm is then testedfor accuracy. The test is limited to digits from to 9. Random utterance of numbers is doneand the accuracy of samples of numbersis analyzed. The results obtained from theaccuracy test are about 9.% of accuracy. Theresults obtained are as displayed in Table.Most of the time, the inaccuracy of recognition is due to sudden impulses of noise or a sudden drastic change in the voice tone. Table. Accuracy test results Word Accuracy Zero 8. One 9. Two 8. Three. International Journal on Advanced Electrical and Electronics Engineering, (IJAEEE), ISSN (Print): 2278-8948, Volume-, Issue-, 22 63

IV. CONCLUSIONS Four 9. Five. Six 8. Seven. Eight. Nine 8. Average 9. This paper has shown a speech recognition algorithm for English digits using MFCC vectorsto provide an estimate of the vocal tract filter. Meanwhile, DTW is used to detect the nearest recorded voice with appropriate constraint. The results showed a promising English digit speech recognition module. Recognition with about 9.% accuracy can be achieved using this method, which can be further increased with further research and development. REFERENCES [] Gold, B. and Morgan, N. (2). Speech and Audio Signal Processing.st ed. John Wiley and Sons, NY, USA, 37p. [2] Deng, L. and Huang, X. (24). Challenges in adopting speech recognition. Communications of the ACM, 47():69-7. [3] Milner, B.P. and Shao, X. (22). Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model.proceedings of the7th International Con- ference on Spoken Language Processing (ICSLP) 22; Sept 6-2, 22, Denver, Colorado, USA, p. 2,42-2,424. [4] Rabiner, L.R. and Sambur, M.R. (97). An algorithm for determining the endpoints of isolated utterances. The Bell System Technical J., 4(2):297-3. [] Rabiner, L.R. and Schafer, R.W. (978). Digital Processing of Speech Signals. st ed. Prentice-Hall Inc., Englewood Cliffs, NJ, USA, 9p. [6] Deng, L. and Huang, X. (24). Challenges in adopting speech recognition. Communications of the ACM, 47():69-7. International Journal on Advanced Electrical and Electronics Engineering, (IJAEEE), ISSN (Print): 2278-8948, Volume-, Issue-, 22 64